Analysis of Verbal and Nonverbal Communication and Enactment - COST 2102

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Anna Esposito | Alessandro Vinciarelli | Klara Vicsi | Catherine Pelachaud | Anton Nijholt

319 downloads 1764 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6800

Anna Esposito Alessandro Vinciarelli Klára Vicsi Catherine Pelachaud Anton Nijholt (Eds.)

Analysis of Verbal and Nonverbal Communication and Enactment The Processing Issues COST 2102 International Conference Budapest, Hungary, September 7-10, 2010 Revised Selected Papers

13

Volume Editors Anna Esposito Second University of Naples and IIASS, Vietri sul Mare (SA), Italy E-mail: [email protected] Alessandro Vinciarelli University of Glasgow, UK E-mail: [email protected] Klára Vicsi Budapest University of Technology and Economics, Hungary E-mail: [email protected] Catherine Pelachaud TELECOM ParisTech, Paris, France E-mail: [email protected] Anton Nijholt University of Twente, Enschede, The Netherlands E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-25774-2 e-ISBN 978-3-642-25775-9 DOI 10.1007/978-3-642-25775-9 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: Applied for CR Subject Classification (1998): H.4, H.5, I.4, I.2, J.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

This book is dedicated to: Luigi Maria Ricciardi for his 360-degree open mind. We will miss his guidance now and forever and to: what has never been, what was possible, and what could have been though we never know what it was.

This volume brings together the advanced research results obtained by the European COST Action 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication,” primarily discussed at the PINK SSPnet-COST 2102 International Conference on “Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues” held in Budapest, Hungary, September 7–10, 2010 (http://berber.tmit.bme.hu/cost2102/). The conference was jointly sponsored by COST (European Cooperation in Science and Technology, www.cost.eu ) in the domain of Information and Communication Technologies (ICT) for disseminating the advances of the research activities developed within the COST Action 2102: “Cross-Modal Analysis of Verbal and Nonverbal Communication”(cost2102.cs.stir.ac.uk) and by the European Network of Excellence on Social Signal Processing, SSPnet (http://sspnet.eu/). The main focus of the conference was on methods to combine and build up knowledge through verbal and nonverbal signals enacted in an environment and in a context. In previous meetings, COST 2102 focused on the importance of uncovering and exploiting the wealth of information conveyed by multimodal signals. The next steps have been to analyze actions performed in response to multimodal signals and to study how these actions are organized in a realistic and socially believable context. The focus was on processing issues, since the new approach is computationally complex and the amount of data to be treated may be considered algorithmically infeasible. Therefore, data processing for gainingenactive knowledge must account for natural and intuitive approaches, based more on heuristics and experiences rather than on symbols, as well as on the discovery of new processing possibilities that account for new approaches for data analysis, coordination of the data ﬂow through synchronization and temporal organization and optimization of the extracted features.

VI

Preface

The conference had a special session for COST 2102 students. The idea was to select original contributions from early-stage researchers. To this aim all the papers accepted in this volume were peer reviewed. This conference also aimed at underlining the role that women have had in ICT and—to this end—the conference was named “First SSPnet-COST2102 PINK International Conference.” The International Steering Committee was composed of only women. The themes of the volume cover topics on verbal and nonverbal information in body-to-body communication, cross-modal analysis of speech, gestures, gaze and facial expressions, socio-cultural diﬀerences and personal traits, multimodal algorithms and procedures for the automatic recognition of emotions, faces, facial expressions, and gestures, audio and video features for implementing intelligent avatars and interactive dialogue systems, virtual communicative agents and interactive dialogue systems. The book is arranged into two scientiﬁc sections according to a rough thematic classiﬁcation, even though both sections are closely connected and both provide fundamental insights for cross-fertilization of diﬀerent disciplines. The ﬁrst section, “Multimodal Signals: Analysis, Processing and Computational Issues,” deals with conjectural and processing issues of deﬁning models, algorithms, and heuristic strategies for data analysis, coordination of the data ﬂow and optimal encoding of multi-channel verbal and nonverbal features. The second section, “Verbal and Nonverbal Social Signals,” presents original studies that provide theoretical and practical solutions to the modelling of timing synchronization between linguistic and paralinguistic expressions, actions, body movements, activities in human interaction and on their assistance for eﬀective human–machine interactions. The papers included in this book beneﬁted from the live interactions among the many participants of the successful meeting in Budapest. Over 90 senior and junior researchers gathered for the event. The editors would like to thank the Management Board of the SSPnet and the ESF COST- ICT Programme for the support in the realization of the conference and the publication of this volume. Acknowledgements go in particular to the COST Science Oﬃcers Matteo Razzanelli, Aranzazu Sanchez, Jamsheed Shorish, and the COST 2102 reporter Guntar Balodis for their constant help, guidance, and encouragement. The event owes its success to more individuals than can be named, but notably the members of the local Steering Committee Kl´ ara Vicsi, Gy¨orgy Szasz´ak, and D´ avid Sztah´ o, who actively operated for the success of the event. Special appreciation goes to the president of the International Institute for Advanced Scientiﬁc Studies (IIASS), Gaetano Scarpetta and to the Dean and the Director of the Faculty and the Department of Psychology at the Second University of Naples, Alida Labella and Giovanna Nigro, for

Preface

VII

making available people and resources for the editing of this volume. The editors are deeply indebted to the contributors for making this book a scientiﬁcally stimulating compilation of new and original ideas and to the members of the COST 2102 International Scientiﬁc Committee for their rigorous and invaluable scientiﬁc revisions, dedication, and priceless selection process. July 2011

Anna Esposito Alessandro Vinciarelli Kl´ ara Vicsi Catherine Pelachaud Anton Nijholt

Organization

International Steering Committee Anna Esposito Kl´ ara Vicsi Catherine Pelachaud Zs´oﬁa Ruttkay Jurate Puniene Isabel Trancoso Inmaculada Hernaez Jerneja Zganec Gros Anna Pribilova Kristiina Jokinen

Second University of Naples and IIASS, Italy Budapest University of Technology and Economics, Hungary CNRS, TELECOM ParisTech, France P´ azm´any P´eter Catholic University, Hungary Kaunas University of Technology, Lithuania INESC-ID Lisboa, Portugal Universidad del Pais Vasco, Spain Ljubljana, Slovenia Slovak University of Technology, Slovak Republic University of Helsinki, Finland

COST 2102 International Scientiﬁc Committee Alberto Abad Samer Al Moubayed Uwe Altmann Sigr´ un Mar´ıa Ammendrup Hicham Atassi Nikos Avouris Martin Bachwerk Ivana Baldasarre Sandra Baldassarri Ruth Bahr G´erard Bailly Marena Balinova Marian Bartlett Dominik Bauer Sieghard Beller ˇ Stefan Be`ouˇs Niels Ole Bernsen Jonas Beskow Peter Birkholz Horst Bishof Jean-Francois Bonastre Marek Boh´ a`e Elif Bozkurt

INESC-ID Lisboa, Portugal Royal Institute of Technology, Sweden Friedrich Schiller University Jena, Germany School of Computer Science, Iceland Brno University of Technology, Czech Republic University of Patras, Greece Trinity College Dublin, Ireland Second University of Naples, Italy Zaragoza University, Spain University of South Florida, USA GIPSA-lab, Grenoble, France University of Applied Sciences, Austria University of California, San Diego, USA RWTH Aachen University, Germany Universit¨ at Freiburg, Germany Constantine the Philosopher University, Slovakia University of Southern Denmark, Denmark Royal Institute of Technology, Sweden RWTH Aachen University, Germany Technical University Graz, Austria Universit´e d’Avignon, France Technical University of Liberec, Czech Republic Ko¸c University, Turkey

X

Organization

Nikolaos Bourbakis Maja Bratani´c Antonio Calabrese Erik Cambria Paola Campadelli Nick Campbell Valent´ın Carde˜ noso Payo Nicoletta Caramelli Antonio Castro-Fonseca Aleksandra Cerekovic Peter Cerva Josef Chaloupka Mohamed Chetouani G´erard Chollet Simone Cifani Muzeyyen Ciyiltepe Anton Cizmar David Cohen Nicholas Costen Francesca D’Olimpio Vlado Deli´c C´eline De Looze Francesca D’Errico Angiola Di Conza Giuseppe Di Maio Marion Dohen Thierry Dutoit Laila DybkjÆr Jens Edlund Matthias Eichner Aly El-Bahrawy Ci˘ gdem Ero˘glu Erdem Engin Erzin Anna Esposito Antonietta M. Esposito Joan F` abregas Peinado Sascha Fagel Nikos Fakotakis Manuela Farinosi Marcos Fa´ undez-Zanuy Tibor Fegy´ o Fabrizio Ferrara Dilek Fidan Leopoldina Fortunati

ITRI, Wright State University, USA University of Zagreb, Croatia Istituto di Cibernetica – CNR, Naples, Italy University of Stirling, UK Universit` a di Milano, Italy University of Dublin, Ireland Universidad de Valladolid, Spain Universit` a di Bologna, Italy Universidade de Coimbra, Portugal Faculty of Electrical Engineering, Croatia Technical University of Liberec, Czech Republic Technical University of Liberec, Czech Republic Universit`e Pierre et Marie Curie, France CNRS URA-820, ENST, France Universit` a Politecnica delle Marche, Italy Gulhane Askeri Tip Academisi, Turkey Technical University of Koˇsice, Slovakia Universit´e Pierre et Marie Curie, Paris, France Manchester Metropolitan University, UK Second University of Naples, Italy University of Novi Sad, Serbia Trinity College Dublin, Ireland Universit`a di Roma 3, Italy Second University of Naples, Italy Second University of Naples, Italy ICP, Grenoble, France Facult´e Polytechnique de Mons, Belgium University of Southern Denmark, Denmark Royal Institute of Technology, Sweden Technische Universit¨at Dresden, Germany Ain Shams University, Egypt `ı Bah¸ce¸sehir University, Turkey Ko¸c University, Turkey Second University of Naples, Italy Osservatorio Vesuviano Napoli, Italy Escola Universitaria de Mataro, Spain Technische Universit¨at Berlin, Germany University of Patras, Greece University of Udine, Italy Universidad Polit´ecnica de Catalu˜ na, Spain Budapest University of Technology and Economics, Hungary University of Naples “Federico II”, Italy Ankara Universitesi, Turkey Universit` a di Udine, Italy

Organization

Todor Ganchev Carmen Garc´ıa-Mateo Vittorio Girotto Augusto Gnisci Milan Gnjatovi´c Bjorn Granstrom Marco Grassi Maurice Grinberg Jorge Gurlekian Mohand-Said Hacid Jaakko Hakulinen Ioannis Hatzilygeroudis Immaculada Hernaez Javier Hernando Wolfgang Hess Dirk Heylen Daniel Hl´adek R¨ udiger Hoﬀmann Hendri Hondorp David House Evgenia Hristova Stephan H¨ ubler Isabelle Hupont Amir Hussain Viktor Imre Ewa Jarmolowicz Kristiina Jokinen Jozef Juh´ar Zdravko Kacic Bridget Kane Jim Kannampuzha Maciej Karpinski Eric Keller Adam Kendon Stefan Kopp Jacques Koreman Theodoros Kostoulas Maria Koutsombogera Robert Krauss Bernd Kr¨oger Gernot Kubin Olga Kulyk Alida Labella

XI

University of Patras, Greece University of Vigo, Spain Universit` a IUAV di Venezia, Italy Second University of Naples, Italy University of Novi Sad, Serbia Royal Institute of Technology, Sweden Universit` a Politecnica delle Marche, Italy New Bulgarian University, Bulgaria LIS CONICET, Argentina Universit´e Claude Bernard Lyon 1, France University of Tampere, Finland University of Patras, Greece University of the Basque Country, Spain Technical University of Catalonia, Spain Universit¨ at Bonn, Germany University of Twente, The Netherlands Technical University of Koˇsice, Slovak Republic Technische Universit¨at Dresden, Germany University of Twente, The Netherlands Royal Institute of Technology, Sweden New Bulgarian University, Bulgaria Dresden University of Technology, Gremany Aragon Institute of Technology, Spain University of Stirling, UK Budapest University of Technology and Economics, Hungary Adam Mickiewicz University, Poland University of Helsinki, Finland Technical University Koˇsice, Slovak Republic University of Maribor, Slovenia Trinity College Dublin, Ireland RWTH Aachen University, Germany Adam Mickiewicz University, Poland Universit´e de Lausanne, Switzeland University of Pennsylvania, USA University of Bielefeld, Germany University of Science and Technology, Norway University of Patras, Greece Institute for Language and Speech Processing, Greece Columbia University, USA RWTH Aachen University, Germany Graz University of Technology, Austria University of Twente, The Netherlands Second University of Naples, Italy

XII

Organization

Emilian Lalev Yiannis Laouris Anne-Maria Laukkanen Am´elie Lelong Borge Lindberg Saturnino Luz Wojciech Majewski Pantelis Makris Kenneth Manktelow Raﬀaele Martone Rytis Maskeliunas Dominic Massaro Olimpia Matarazzo Christoph Mayer David McNeill Jiˇr´ı Mekyska Nicola Melone Katya Mihaylova P´eter Mihajlik Michal Miriloviˇc Izidor Mlakar Helena Moniz Tam´as Mozsolics Vincent C. M¨ uller Peter Murphy Antonio Natale Costanza Navarretta Eva Navas Delroy Nelson G´eza N´emeth Friedrich Neubarth Christiane Neuschaefer-Rube Giovanna Nigro Anton Nijholt Jan Nouza Michele Nucci Catharine Oertel Stanislav Ond´ aˇs Rieks Op den Akker

New Bulgarian University, Bulgaria Cyprus Neuroscience and Technology Institute, Cyprus University of Tampere, Finland GIPSA-lab, Grenoble, France Aalborg University, Denmark Trinity College Dublin, Ireland Wroclaw University of Technology, Poland Neuroscience and Technology Institute, Cyprus University of Wolverhampton, UK Second University of Naples, Italy Kaunas University of Technology, Lithuania University of California - Santa Cruz, USA Second University of Naples, Italy Technische Universit¨at M¨ unchen, Germany University of Chicago, USA Brno University of Technology, Czech Republic Second University of Naples, Italy University of National and World Economy, Bulgaria Budapest University of Technology and Economics, Hungary Technical University of Koˇsice, Slovakia Roboti c.s. d.o.o, Maribor, Slovenia INESC-ID Lisboa, Portugal Budapest University of Technology and Economics, Hungary Anatolia College/ACT, Greece University of Limerick, Ireland University of Salerno and IIASS, Italy University of Copenhagen, Denmark Escuela Superior de Ingenieros, Spain University College London, UK University of Technology and Economics, Hungary Austrian Research Inst. Artiﬁcial Intelligence, Austria RWTH Aachen University, Germany Second University of Naples, Italy Universiteit Twente, The Netherlands Technical University of Liberec, Czech Republic Universit`a Politecnica delle Marche, Italy Trinity College Dublin, Ireland Technical University of Koˇsice, Slovak Republic University of Twente, The Netherlands

Organization

Karel Paleˇcek Igor Pandzic Harris Papageorgiou Kinga Papay Paolo Parmeggiani Ana Pavia Paolo Pedone Tomislav Pejsa Catherine Pelachaud Bojan Petek Harmut R. Pﬁtzinger Francesco Piazza Neda Pintaric Mat´ uˇs Pleva Isabella Poggi Guy Politzer Jan Prazak Ken Prepin Jiˇrı Pˇribil Anna Pˇribilov´ a Emanuele Principi Michael Pucher Jurate Puniene Ana Cristina Quelhas Kari-Jouko R¨aih¨a Roxanne Raine Giuliana Ramella Fabian Ramseyer Jos`e Rebelo Peter Reichl Luigi Maria Ricciardi Maria Teresa Riviello Matej Rojc Nicla Rossini Rudi Rotili Algimantas Rudzionis Vytautas Rudzionis Hugo L. Ruﬁner Milan Rusko

XIII

Technical University of Liberec, Czech Republic Faculty of Electrical Engineering, Croatia Institute for Language and Speech Processing, Greece University of Debrecen, Hungary Universit` a degli Studi di Udine, Italy Spoken Language Systems Laboratory, Portugal Second University of Naples, Italy University of Zagreb, Croatia Universit´e de Paris, France University of Ljubljana, Slovenia University of Munich, Germany Universit`a degli Studi di Ancona, Italy University of Zagreb, Croatia Technical University of Koˇsice, Slovak Republic Universit` a di Roma 3, Italy University of Paris 8, France Technical University of Liberec, Czech Republic Telecom-ParisTech, France Academy of Sciences, Czech Republic Slovak University of Technology, Slovakia Universit` a Politecnica delle Marche, Italy Telecommunications Research Center Vienna, Austria Kaunas University of Technology, Lithuania Instituto Superior de Psicologia Aplicada, Portugal University of Tampere, Finland University of Twente, The Netherlands Istituto di Cibernetica – CNR, Naples, Italy University Hospital of Psychiatry Bern, Switzerland Universidade de Coimbra, Portugal FTW Telecommunications Research Center, Austria Universit` a di Napoli “Federico II”, Italy Second University of Naples and IIASS, Italy University of Maribor, Slovenia Universit`a del Piemonte Orientale, Italy Universit` a Politecnica delle Marche, Italy Kaunas University of Technology, Lithuania Kaunas University of Technology, Lithuania Universidad Nacional de Entre R´ıos, Argentina Slovak Academy of Sciences, Slovak Republic

XIV

Organization

Zs´oﬁa Ruttkay Yoshinori Sagisaka Bartolomeo Sapio Mauro Sarrica Gell´ert S´ arosi Gaetano Scarpetta Silvia Scarpetta Stefan Scherer Ralph Schnitker Jean Schoentgen Bj¨orn Schuller Milan Seˇcujski Stefanie Shattuck-Hufnagel Marcin Skowron Jan Silovsky Zdenˇek Sm´ekal Stefano Squartini Piotr Staroniewicz J´ an Staˇs Vojtˇech Stejskal Marian Stewart-Bartlett Xiaofan Sun Jing Su D´ avid Sztah´ o Jianhua Tao Bal´azs Tarj´an Jure F. Tasiˇc Murat Tekalp Kristinn Th´orisson Isabel Trancoso Luigi Trojano Wolfgang Tschacher Markku Turunen Henk Van den Heuvel Betsy van Dijk Giovanni Vecchiato Leticia Vicente-Rasoamalala Robert Vich Kl´ ara Vicsi

Pazmany Peter Catholic University, Hungary Waseda University, Japan Fondazione Ugo Bordoni, Italy University of Padova, Italy Budapest University of Technology and Economics, Hungary University of Salerno and IIASS, Italy Salerno University, Italy Ulm University, Germany Aachen University, Germany Universit´e Libre de Bruxelles, Belgium Technische Universit¨at M¨ unchen, Germany University of Novi Sad, Serbia MIT, Research Laboratory of Electronics, USA Austrian Research Institute for Artiﬁcial Intelligence, Austria Technical University of Liberec, Czech Republic Brno University of Technology, Czech Republic Universit` a Politecnica delle Marche, Italy Wroclaw University of Technology, Poland Technical University of Koˇsice, Slovakia Brno University of Technology, Czech Republic University of California, San Diego, USA University of Twente, The Netherlands Trinity College Dublin, Ireland Budapest University of Technology and Economics, Hungary Chinese Academy of Sciences, P.R. China Budapest University of Technology and Economics, Hungary University of Ljubljana, Slovenia Ko¸c University, Turkey Reykjav´ık University, Iceland Spoken Language Systems Laboratory, Portugal Second University of Naples, Italy University of Bern, Switzerland University of Tampere, Finland Radboud University Nijmegen, The Netherlands University of Twente, The Netherlands Universit` a “La Sapienza”, Italy Alchi Prefectural University, Japan Academy of Sciences, Czech Republic Budapest University, Hungary

Organization

Hannes H¨ogni Vilhj´ almsson Jane Vincent Alessandro Vinciarelli Laura Vincze Carl Vogel Jan Vol´ın Rosa Volpe Martin Vondra Pascal Wagner-Egger Yorick Wilks Matthias Wimmer Matthias Wolf Bencie Woll Bayya Yegnanarayana Vanda Lucia Zammuner ˇ Jerneja Zganec Gros Goranka Zoric

XV

Reykjav´ık University, Iceland University of Surrey, UK University of Glasgow, UK Universit`a di Roma 3, Italy Trinity College Dublin, Ireland Charles University, Czech Republic Universit´e de Perpignan, France Academy of Sciences, Czech Republic Fribourg University, Switzerland University of Sheﬃeld, UK Institute for Informatics Munich, Germany Technische Universit¨at Dresden, Germany University College London, UK International Institute of Information Technology, India University of Padova, Italy Alpineon, Development and Research, Slovenia Faculty of Electrical Engineering, Croatia

XVI

Organization

Sponsors The following organizations sponsored and supported the international conference: European COST Action 2102 “Cross-Modal Analysis of Verbal and Nonverbal Communication” (cost2102.cs.stir.ac.uk)

ESFProvidetheCOSTOfficethroughandECcontract

COSTissupportedbytheEURTDFrameworkprogramme

COST—the acronym for European Cooperation in Science and Technology—is the oldest and widest European intergovernmental network for cooperation in research. Established by the Ministerial Conference in November 1971, COST is presently used by the scientiﬁc communities of 36 European countries to cooperate in common research projects supported by national funds. The funds provided by COST—less than 1% of the total value of the projects— support the COST cooperation networks (COST Actions) through which, with EUR 30 million per year, more than 30,000 European scientists are involved in research having a total value which exceeds EUR 2 billion per year. This is the ﬁnancial worth of the European added value which COST achieves. A “bottom–up approach” (the initiative of launching a COST Action comes from the European scientists themselves), “` a la carte participation” (only countries interested in the Action participate), “equality of access” (participation is open also to the scientiﬁc communities of countries not belonging to the European Union) and“ﬂexible structure”(easy implementation and light management of the research initiatives) are the main characteristics of COST. As precursor of advanced multidisciplinary research, COST plays a very important role in the realization of the European Research Area (ERA) anticipating and complementing the activities of the Framework Programmes, constituting a “bridge” toward the scientiﬁc communities of emerging countries, increasing the mobility of researchers across Europe and fostering the establishment of “Networks of Excellence” in many key scientiﬁc domains such as: biomedicine and molecular biosciences; food and agriculture; forests, their products and services; materials, physical and nanosciences; chemistry and molecular sciences and technologies; earth system science and environmental management; information

Organization

XVII

and communication technologies; transport and urban development; individuals, societies cultures and health. It covers basic and more applied research and also addresses issues of pre-normative nature or of societal importance. website: http://www.cost.eu SSPnet: European Network on Social Signal Processing, http://sspnet.eu/

The ability to understand and manage the social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis and synthesis of relevant behavioral cues like blinks, smiles, crossed arms, head nods, laughter, etc., the research eﬀorts in machine analysis and synthesis of human social signals such as empathy, politeness, and (dis)agreement, are few and tentative. The main reasons for this are the absence of a research agenda and the lack of suitable resources for experimentation. The mission of the SSPNet is to create a suﬃcient momentum by integrating an existing large amount of knowledge and available resources in social signal processing (SSP) research domains including cognitive modeling, machine understanding, and synthesizing social behavior, and thus: – Enable the creation of the European and world research agenda in SSP – Provide eﬃcient and eﬀective access to SSP-relevant tools and data repositories to the research community within and beyond the SSPNet – Further develop complementary and multidisciplinary expertise necessary for pushing forward the cutting edge of the research in SSP The collective SSPNet research eﬀort is directed toward integration of existing SSP theories and technologies, and toward identiﬁcation and exploration of potentials and limitations in SSP. More speciﬁcally, the framework of the SSPNet will revolve around two research foci selected for their primacy and signiﬁcance: human–human interaction (HHI) and human–computer interaction (HCI). A particular scientiﬁc challenge that binds the SSPNet partners is the synergetic combination of human–human interaction models, and automated tools for human behavior sensing and synthesis, within socially adept multimodal interfaces. School of Computing Science, University of Glasgow, Scotland, UK Department of Psychology, Second University of Naples, Caserta, Italy Laboratory of Speech Acoustics of Department of Telecommunication and Media Informatics, Budapest University for Technology and Economics, Budapest, Hungary

XVIII

Organization

Complex Committee on Acoustics of the Hungarian Academy of Sciences, Budapest, Hungary Scientiﬁc Association for Infocommunications, Budapest, Hungary International Institute for Advanced Scientiﬁc Studies“E.R. Caianiello”IIASS, www.iiassvietri.it/ Societ`a Italiana Reti Neurotiche, SIREN, www.associazionesiren.org/ Regione Campania and Provincia di Salerno, Italy

Table of Contents

Multimodal Signals: Analysis, Processing and Computational Issues Real Time Person Tracking and Behavior Interpretation in Multi Camera Scenarios Applying Homography and Coupled HMMs . . . . . . . . . Dejan Arsi´c and Bj¨ orn Schuller Animated Faces for Robotic Heads: Gaze and Beyond . . . . . . . . . . . . . . . . Samer Al Moubayed, Jonas Beskow, Jens Edlund, Bj¨ orn Granstr¨ om, and David House RANSAC-Based Training Data Selection on Spectral Features for Emotion Recognition from Spontaneous Speech . . . . . . . . . . . . . . . . . . . . . . Elif Bozkurt, Engin Erzin, Ciˇ ¸ gdem Eroˇglu Erdem, and A. Tanju Erdem

1 19

36

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Bachwerk and Carl Vogel

48

Switching Between Diﬀerent Ways to Think: Multiple Approaches to Aﬀective Common Sense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Cambria, Thomas Mazzocco, Amir Hussain, and Tariq Durrani

56

Eﬃcient SNR Driven SPLICE Implementation for Robust Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Squartini, Emanuele Principi, Simone Cifani, Rudi Rotili, and Francesco Piazza Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petr Cerva, Jan Nouza, and Jan Silovsky Audio-Visual Isolated Words Recognition for Voice Dialogue System . . . . Josef Chaloupka Semantic Web Techniques Application for Video Fragment Annotation and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Grassi, Christian Morbidoni, and Michele Nucci

70

81 88

95

Imitation of Target Speakers by Diﬀerent Types of Impersonators . . . . . . Wojciech Majewski and Piotr Staroniewicz

104

Multimodal Interface Model for Socially Dependent People . . . . . . . . . . . . Rytis Maskeliunas and Vytautas Rudzionis

113

XX

Table of Contents

Score Fusion in Text-Dependent Speaker Recognition Systems . . . . . . . . . Jiˇr´ı Mekyska, Marcos Faundez-Zanuy, Zdenˇek Sm´ekal, and Joan F` abregas Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality within a Multimodal Shell . . . . . . . . . . . . . . . . . . . . . . . . Izidor Mlakar and Matej Rojc

120

133

Multimodal Embodied Mimicry in Interaction . . . . . . . . . . . . . . . . . . . . . . . Xiaofan Sun and Anton Nijholt

147

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications . . . . Jan Nouza and Marek Boh´ aˇc

154

Towards the Automatic Detection of Involvement in Conversation . . . . . . Catharine Oertel, C´eline De Looze, Stefan Scherer, Andreas Windmann, Petra Wagner, and Nick Campbell

163

Extracting Sentence Elements for the Natural Language Understanding Based on Slovak National Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ zm´ Stanislav Ond´ aˇs, Jozef Juh´ ar, and Anton Ciˇ ar

171

Detection of Similar Advertisements in Media Databases . . . . . . . . . . . . . . Karel Palecek

178

Towards ECA’s Animation of Expressive Complex Behaviour . . . . . . . . . . Izidor Mlakar and Matej Rojc

185

Recognition of Multiple Language Voice Navigation Queries in Traﬃc Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gell´ert S´ arosi, Tam´ as Mozsolics, Bal´ azs Tarj´ an, Andr´ as Balog, P´eter Mihajlik, and Tibor Fegy´ o Comparison of Segmentation and Clustering Methods for Speaker Diarization of Broadcast Stream Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Prazak and Jan Silovsky

199

214

Inﬂuence of Speakers’ Emotional States on Voice Recognition Scores . . . . Piotr Staroniewicz

223

Automatic Classiﬁcation of Emotions in Spontaneous Speech . . . . . . . . . . D´ avid Sztah´ o, Viktor Imre, and Kl´ ara Vicsi

229

Modiﬁcation of the Glottal Voice Characteristics Based on Changing the Maximum-Phase Speech Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Vondra and Robert V´ıch

240

Table of Contents

XXI

Verbal and Nonverbal Social Signals On Speech and Gestures Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Esposito and Antonietta M. Esposito Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Am´elie Lelong and G´erard Bailly Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository within a Neural Model of Speech Processing . . . . . . . . . . . . . . . Bernd J. Kr¨ oger, Peter Birkholz, Jim Kannampuzha, Emily Kaufmann, and Christiane Neuschaefer-Rube Neurophysiological Measurements of Memorization and Pleasantness in Neuromarketing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanni Vecchiato and Fabio Babiloni Annotating Non-verbal Behaviours in Informal Interactions . . . . . . . . . . . . Costanza Navarretta

252

273

287

294 309

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena to a Theoretical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rosa Volpe, Lucile Chanquoy, and Anna Esposito

316

Investigation of Movement Synchrony Using Windowed Cross-Lagged Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uwe Altmann

335

Multimodal Multilingual Dictionary of Gestures: DiGest . . . . . . . . . . . . . . ˇ Milan Rusko and Stefan Beˇ nuˇs

346

The Partiality in Italian Political Interviews: Stereotype or Reality? . . . . Enza Graziano and Augusto Gnisci

355

On the Perception of Emotional “Voices”: A Cross-Cultural Comparison among American, French and Italian Subjects . . . . . . . . . . . . . . . . . . . . . . . Maria Teresa Riviello, Mohamed Chetouani, David Cohen, and Anna Esposito Inﬂuence of Visual Stimuli on Evaluation of Converted Emotional Speech by Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiˇr´ı Pˇribil and Anna Pˇribilov´ a Communicative Functions of Eye Closing Behaviours . . . . . . . . . . . . . . . . . Laura Vincze and Isabella Poggi Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicla Rossini

368

378 393

406

XXII

Table of Contents

Selection Task with Conditional and Biconditional Sentences: Interpretation and Pattern of Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabrizio Ferrara and Olimpia Matarazzo Types of Pride and Their Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isabella Poggi and Francesca D’Errico

419 434

People’s Active Emotion Vocabulary: Free Listing of Emotion Labels and Their Association to Salient Psychological Variables . . . . . . . . . . . . . . Vanda Lucia Zammuner

449

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

461

Real Time Person Tracking and Behavior Interpretation in Multi Camera Scenarios Applying Homography and Coupled HMMs Dejan Arsi´c1 and Bj¨orn Schuller2 1

M¨uller BBM Vibroakustiksysteme GmbH, Planegg, Germany [email protected] 2 Institute for Human-Machine Communication, Technische Universit¨at M¨unchen, Germany [email protected]

Abstract. Video surveillance systems have been introduced in various fields of our daily life to enhance security and protect individuals and sensitive infrastructure. Up to now they have been usually utilized as a forensic tool for after the fact investigations and are commonly monitored by human operators. A further gain in safety can only be achieved by the implementation of fully automated surveillance systems which will assist human operators. In this work we will present an integrated real time capable system utilizing multiple camera person tracking, which is required to resolve heavy occlusions, to monitor individuals in complex scenes. The resulting trajectories will be further analyzed for so called Low Level Activities , such as walking, running and stationarity, applying HMMs, which will be used for the behavior interpretation task along with motion features gathered throughout the tracking process. An approach based on coupled HMMs will be used to model High Level Activities such as robberies at ATMs and luggage related scenarios.

1 Introduction Visual surveillance systems, which are quite common in urban environments, aim at providing safety in everyday life. Unfortunately most CCTV cameras are unmonitored and the vast majority of benefits are either in forensic use or deterring potential offenders, as these might be easily recognized and detected [40]. Therefore it seems desirable to support human operators and implement automated surveillance systems to be able to react in time. In order to achieve this aim most systems are split into two parts, the detection and tracking application and the subsequent behavior interpretation part. As video material may contain various stationary or moving objects and persons whose behavior may be interesting, these have to be detected in the current video frame and tracked over time. As a single camera usually is not sufficient to cope with dense crowds and large regions, multiple cameras should be mounted to view defined regions from different perspectives. Within these perspectives corresponding objects have to be located. Appearance based methods, such as matching color [32], lead to frequent errors due to different color settings and lighting situations in the individual sensors. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 1–18, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

D. Arsi´c and B. Schuller

Approaches based on geometrical information rely on geometrical constraints between views, using calibrated data [43] or homography between uncalibrated views, which e.g. Khan [25] suggested to localize feet positions. However, as Khan’s approach only localizes feet, it consequently tends to segment persons into further parts. In these respects a novel extension to this framework is presented herein, applying homography in multiple layers to successfully overcome the problem of aligning multiple segments belonging to one person. As convenient side effect the localization performance will increase dramatically [6]. Nevertheless this approach still creates some errors in complex scenes and is computationally quite expensive. Therefore a real time capable alteration of the initial homography approach will be presented in sec. 2. The results of the applied tracking approaches will be presented using the multi camera tracking databases from the Performance Evaluation of Tracking and Surveillance Challanges (PETS) in the years 2006, 2007 and 2009 [37,3,28]. All these databases have been recorded in public places, such as train stations or airports, and show at leas four views of the scene. Subsequently an integrated approach for behavior interpretation will be presented in sec. 3. Although a wide range of approaches already exists, this issue is not yet solved. Most of these operate on 2D level using texture information to extract behaviors or gait [39,15]. Unfortunately it is not possible to guarantee similar and non-obscured views in real world scenarios, which are required by these algorithms. Hence it is suggested to operate on trajectory level. Trajectories can be can be extracted robustly by the previously mentioned algorithm, easily be normalized and compared to a baseline scenario with little to no changes and knowledge of the scene geometry. Nevertheless the positions of important landmarks and objects, which may be needed for the scenario recognition, should be collected. Other information is not required. Common approaches come at the cost of collecting a large amount of data to train Hidden Markov Models (HMM) [31] or behavioral maps [11]. Despite the scenario’s complexity and large inter class variance, some scenarios are following a similar scheme, which can be modeled by a HMM architecture in two layers, where the first layer is responsible for the recognition of Low Level Activities (LLA). In the second layer complex scenarios are furthermore analyzed again applying HMMS, where only LLAs are used as features. High flexibility and robustness is achieved by the introduction of state transition between High Level Activities (HLA), allowing a detailed dynamic scene representation. It will be shown that this approach provides a high accuracy at low computational effort.

2 Object Localization Using Homography 2.1 Planar Homographies Homography [22] is a special case of projective geometry. It enables the mapping of points in spaces with different dimensionality Rn [17]. Hence, a point p observed in a view can be mapped into its corresponding point p in another perspective or even coordinate system. Fig. 1 illustrates this for the transformation of a point p in world coordinates R3 into the image pixel p in R2

p = (x, y) ← p = (x, y, z).

(1)

Real Time Person Tracking and Behavior Interpretation

3

Fig. 1. The homography constraint visualized with a cylinder standing on a planar surface

Planar homographies, here the matching of image coordinates onto the ground plane, in contrast only require an affine transformation from R2 → R2 . This can be interpreted as a simple rotation with R and translation with T

p = Rp + T.

(2)

As has been shown in [25], projective geometry between multiple cameras and a plane in world coordinates can be used for person tracking. A point pπ located on the plane is visible as piπ in view Ci and as p jπ in a second view C j . piπ and p jπ can be determined with piπ = Hiπ pπ and p j π = H j π pπ , (3) where Hiπ denotes the transformation between view Ci and the ground plane π . The composition of both perspectives results in a homography [22] p jπ = H jπ H−1 iπ piπ = Hi j piπ

(4)

between the images planes. This way each pixel in a view can be transformed into another arbitrary view, given the projection matrices for the two views. A 3D point pπ located off the plane π , visible at location piπ in view Ci , can also be warped into another image with pw = Hpiπ , and pw = p2π . The resulting misalignment is called plane parallax. As illustrated in fig. 1 the homography projects a ray from the camera center Ci through a pixel p and extends it until it intersects with the plane π , which is referred to as piercing point of a pixel and the plane π . The ray is subsequently projected into the camera center of C j , intersecting the second image plane at pw . As can be seen, points in the image plane do not have any plane parallax, whereas those off the plane do have a considerable such. Each scene point pπ located on an object in the 3D scene and on plane π will therefore be projected into a pixel p1π , p2π , · · · , pnπ in all available n views, if the projections are located in detected foreground regions FGi with piπ ∈ FGi .

(5)

Furthermore, each point piπ can be determined by a transformation between view i and an arbitrary chosen one indexed with j piπ = Hi j p jπ ,

(6)

4

D. Arsi´c and B. Schuller

where Hi j is the homography of plane π from view i to j. Given a foreground pixel pi ∈ FGi in view Ci , with its piercing point located inside the volume of an object inside the scene, the projection p j = Hi j pi ∈ FG j

(7)

lies in the foreground region FG j . This proposition, the so called homography constraint, is segmenting pixel corresponding to ground plane positions of objects and helps resolving occlusions. The homography constraint is not necessarily limited to the ground plane and can be used in any other plane in the scene, as will be shown in sec. 2.2. For the localization of objects, the ground plane seems sufficient to find objects touching it. In the context of pedestrians a detection of feet is performed, which will be explained in the following sections. Now that it is possible to compute point correspondences from the 2D space to the 3D world and vice versa, it is also possible to determine the number of objects and their exact location in a scene. In the first stage a synchronized image acquisition is needed, in order to compute the correspondences of moving objects in the current frames C1 ,C2 , . . . ,Cn . Subsequently, a foreground segmentation is performed in all available smart sensors to detect changes from the empty background B(x, y) [25] : FGi (x, y,t) = Ii (x, y,t) − Bi (x, y)

(8)

where the appropriate technique to update the background pixel, here based on Gaussian Mixture Models, is chosen for each sensor individually. It is advisable to set parameters, such as the update time, separately in all sensors to guarantee a high performance. Computational effort is reduced by masking the images with a predefined tracking area. Now the homography Hiπ between a pixel pi in the view Ci and the corresponding location on the ground plane π can be determined. In all views the observations x1 , x2 , . . . , xn can be made at the pixel positions p1 , p2 , . . . , pn . Let X resemble the event that a foreground pixel pi has a piercing point within a foreground object with the probability P(X|x1 , x2 , . . . , xn ). With Bayes’ law we have p(X|x1 , x2 , . . . , xn ) ∝ p(x1 , x2 , . . . , xn |X)p(X).

(9)

The first term on the right side is the likelihood of making an observation x1 , x2 , ..., xn , given an event X happens. Assuming conditional independence, the term can be rewritten to p(x1 , x2 , . . . , xn |X) = p(x1 |X) · p(x2 |X) · . . . · p(xn |X). (10) According to the homography constraint, a pixel within an object will be part of the foreground object in every view p(xi |X) ∝ p(xi ),

(11)

where p(xi ) is the probability of xi belonging to the foreground. An object is then detected in the ground plane when n

p(X|x1 , x2 , . . . , xn ) ∝ ∏ p(xi ) i=1

(12)

Real Time Person Tracking and Behavior Interpretation

5

Fig. 2. a) Planar homography for object detection. b) Resolving occlusions by adding further views.

exceeds a threshold θ . In order to keep computational effort low, it is feasible to transform only regions of interest [3]. These are determined by thresholding the entire image, resulting in a binary image, before the transformation and the detection of blobs with a simple connected component analysis. This way only the binary blobs are transformed into the ground plane instead of the corresponding probability maps. Therefore eq. 12 can be simplified to n

p(X|x1 , x2 , . . . , xn ) ∝ ∑ p(xi )

(13)

i=1

without any influence on the performance. The value of θlow is usually set dependent on the number n of camera sensors to θlow = n − 1, in order to provide some additional robustness in case one of the views accidentally fails. The thresholding on sensor level has a further advantage compared to the so called soft threshold [25,12], where the entire probability map is transformed and probabilities are actually multiplied as in eq. 12. A small probability or even xi = 0 would affect the overall probability and set it to small values, whereas the thresholded sum is not affected. Using the homography constraint hence solves the correspondence problem in the views C1 ,C2 , . . . ,Cn , as illustrated in fig 2a) for a cubic object. In case the object is human, only the feet of the person touching the ground plane will be detected. The homography constraint additionally resolves occlusions, as can be seen in fig. 2a). Pixel regions located within the detected foreground areas, indicated in dark gray on the ground plane, and representing the feet, will be transformed to a piercing point within the object volume. Foreground pixels not satisfying the homography constraint are located off the plane, and are being warped into background regions of other views. The piercing point is therefore located outside the object volume. All outliers indicate regions with high uncertainty, as there is no depth information available. This limitation can now be used to detect occluded objects. As visualized in fig. 2b), one cuboid is occluded by the other one in view C1 , as apparently foreground blobs are merged. The right object’s bottom side is occluded by the larger object’s body. Both objects are visible in view C2 , resulting in two detected foreground regions. A second set of foreground pixels, located on the ground plane π in view C2 , will now satisfy the homography constraint and localize the occluded object. This process allows the localization of feet positions, although they are entirely occluded, by creating a kind of see through effect.

6

D. Arsi´c and B. Schuller

Fig. 3. Detection example applying homographic transformation in the ground plane. Detected object regions are subsequently projected into the third view of the PETS2006 data set. The regions in yellow represent intersecting areas. As can be seen, some objects are split into multiple regions. These are aligned in a subsequent tracking step.

Exemplary results of the object localization are shown in fig. 3, where the yellow regions on the left hand side represent possible object positions. For an easier post processing, the resulting intersections are interpreted as circular object regions ORi with center point p j (x, y,t) and its radius r j (t), which is given by r j (t) = is the size of the intersecting region.

A j (t) π , where A j (t)

2.2 3D Reconstruction of the Scene The major drawback of planar homography is the restriction to the detection of objects touching the ground, which leads to some unwanted phenomena. Humans usually have two legs and therefore two feet touching the ground, but unfortunately not necessarily positioned next to each other. Walking people will show a distance between their feet of up to one meter. Computing intersections in the ground plane consequently results in two object positions per person. Fig. 3 illustrates the detected regions for all four persons present in the scene. As only the position of the feet is determined, remaining information on body shape and posture is dismissed. As a consequence distances between objects and individuals cannot be determined exactly. For instance, a person might try to reach an object with her arm and be just few millimeters away from touching it, though the computed distance would be almost one meter. Furthermore, tracking is only limited to objects located on a plane, while other objects, such as hands, birds, etc. cannot be considered. To resolve these limitations, it seems reasonable to try to reconstruct the observed scenery as a 3D model. Therefore various techniques have already been applied: Recent works mostly deal with the composition of so called visual hulls from an ensemble of 2D images [27,26], which requires a rather precise segmentation in each smart sensor and the use of 3D constructs like voxels or visual cones. These are subsequently being intersected in the 3D world. A comparison of scene reconstruction techniques can be found in [35]. An approach for 3D reconstruction of objects from multiple views applying homography has already been presented in [24]. All required information can be gathered by fusion of silhouettes in the image plane, which can be resolved by planar homography. With a large set of cameras or views a quite precise object reconstruction can be

Real Time Person Tracking and Behavior Interpretation

Z

Z

7

Z

1.8m 1m

0m

Y Y

Y

X(1m) X(0m) X

X

a)

Y(0m) Y(1m) Yw(1.8)

X

b)

c)

Fig. 4. a) Computation of layer intersections using two points. b) Transformed blobs in multiple layers. c) 3D reconstruction of a cuboid.

achieved, which is not required for this work. This approach can be altered to localize objects and approximate the occupied space with low additional effort [6], which will improve the detection and tracking performance. The basic idea is to compute the intersections of transformed object boundaries in additional planes, as illustrated in fig. 4b). This transformation can be computed rapidly by taking epipolar geometry into account, which will be computationally more efficient than computing the transformation for each layer. All possible transformations of an image pixel I(x, y) are basically located on an infinite line g in world coordinates (xw , yw , zw ). This line can be described by two points p1 and p2 . Therefore only two transformations, which can be precomputed, are required for the subsequent processing steps. This procedure is usually only valid for a linear stretch in space, which can be assumed in most applied sensor setups. The procedure described in sec. 2.1 is applied for each desired layer, resulting in intersecting regions in various heights, as illustrated in fig 4 b) and c). The object’s height is not required as the polygons are only intersecting within the region above the person’s position. In order to track humans it has been decided to use ten layers with a distance of 0.20 m covering the range of 0.00 m to 1.80 m, as this is usually sufficient to separate humans and only the head would be missing in case the person is by far taller. The ambiguities created by the planar homography approach are commonly solved by the upper body. Therefore the head, which is usually smaller than the body, is not required. The computed intersections have to be aligned in a subsequent step in order to reconstruct the objects’ shapes. Assuming that an object does usually not float above another one, all layers can be stacked into one layer by projecting the intersections to the ground floor. This way a top view is simulated applying a simple summation of the pixel P = (xw , yw , zw ) in all layers into one common ground floor layer with: GF(xw , yw ) =

n

∑ P(xw , yw , zl ).

(14)

l=1

Subsequently, a connected component analysis is applied, in order to assign unique IDs to all possible object positions in the projected top view. Each ID is then propagated to the layers above the ground floor, providing a mapping of object regions in the single layers. Besides the exact object location, additionally volumetric information, such as

8

D. Arsi´c and B. Schuller

Fig. 5. Detection example on PETS2007 data [3] projected in two camera views. All persons, expect the lady in the ellipse, have been detected and labeled consistently in both views. The error occurred already in the foreground segmentation.

height, width, and depth, is extracted from the image data, providing a more detailed scene representation than the simple localization. Some localization examples are provided in fig. 5, where cylinders approximate the object volume. The operating area has been restricted to the predefined area of interest, which is the region with the marked up coordinate system. As can be seen, occlusions can be resolved easily without any errors. One miss, the lady marked with the black ellipse, appeared because of an error in the foreground segmentation. She has been standing in the same spot even before the background model has been created, and therefore not been detected. 2.3 Computational Optimization of the 3D Representation The localization accuracy of the previously described approach comes at the cost of computational effort. Both the homography and the fusion in individual layers are quite demanding operations, although a simple mathematical model lies beneath them. Therefore a computationally more efficient variation will be presented in the following. As each detected foreground pixel is transformed into the ground plane, a vast amount of correspondences has to be post processed within the localization process. Instead of computing complex occupancy cones, the observed region is covered by a three dimensional grid with predefined edge lengths. Thus, we segment the observed space into a grid of volume elements, so called voxels. In a first step corresponding voxel and pixel positions in the recorded image are computed. This can be done by computing homographies in various layers, using occupancy rays cast from each image pixel in each separate camera view. Each voxel passed by a ray originating from one pixel is henceforth associated with that pixel. Due to the rough quantization of the 3D space, multiple pixel positions will be matched to each voxel. While slightly decreasing precision, this will result in a larger tolerance to calibration errors. As we now have a precomputed lookup table of pixel to voxel correspondences, it is possible to calculate an occupancy grid quickly for each following observation. Each voxel is assigned a score which is set to zero at first. For each pixel showing a foreground object, all associated voxels’ scores are incremented by one step. Going through all the foreground regions of all images, it is possible to compute the scores for each voxel in the occupancy grid. After all image pixels have been processed, a simple thresholding operation is performed on the scores of the voxels, excluding voxels with low scores and thus ambiguous regions. The remaining voxels with higher scores

Real Time Person Tracking and Behavior Interpretation

9

Fig. 6. 3D reconstruction and detection results of a scene from the PETS2009 [18] dataset

then provide an approximated volume of the observed object. The threshold is usually set equal to the number of cameras, meaning that a valid voxel needs an associated foreground/object pixel in each camera view. After filling the individual grid elements, a connected components analysis, which is commonly used in image processing, is applied to the 3D voxel grid in order to locate objects. The only significant difference to the 2D operation is the number of possibly connected neighbor elements which rises from 8 to 26. An exemplary detection result is illustrated in fig. 6, using a scene from the PETS2009 workshop [18]. Due to the rough quantization of the tracking region, calibration errors and unreliable foreground segmentation could be partially eliminated, and a by far higher tracking accuracy has been reached applying this method, which has been evaluated using the PETS2007 database. While the localization accuracy of the multi layer homography approach (MLH) and the presented voxel based tracking achieved the same localization accuracy of 0.15m, the number of ID changes has been decreased drastically from 18 to 3. this result is comparable to a combined MLH and 2D tracking approach, as presented in [8], where a graph based representation using SIFT features [30] has been applied [28]. Speaking in terms of tracking accuracy, the performance has not risen drastically. Computational effort has been decreased by the factor seven at the same time. This makes this approach by far more efficient than comparable ones.

3 Behavior Interpretation The created trajectories and changes in motion patterns can now be used by a behavior interpretation module, which subsequently either triggers an alarm signal or reacts to the observed activity by other appropriate means [23]. This module is basically matching an unknown observation sequence to stored reference samples and performing a comparison. The basic problem is to find a meaningful representation of human behavior, which is a quite a challenging task even for highly trained human operators, who

10

D. Arsi´c and B. Schuller

indeed should be ’experts in the field’. A wide range of classifiers, based on statistical learning theory, has been employed in the past, in order to recognize different behavior. The probably most popular approaches involve the use of dynamic classifiers, such as HMMs [31] or Dynamic Time Warping [36]. Nevertheless static classifiers, e. g. Support Vector Machines (SVM) or Neural Networks (NN) are being further explored, as these may outperform dynamic ones [4]. All these approaches have in common that they are data driven approaches, which usually requires a vast amount of real world training data. This is usually not available, as authorities usually do not provide or simply do not have such data, and preparing data and model creation are quite time consuming. Therefore an effective solution has to be found to overcome this problem. In order to be able to pick up interesting events and derive so called ’threat intentions’, which may for instance include robberies or even the placement of explosives, a set of Predefined Indicators (PDI), such as loitering in a defined region, has been collected [13]. These PDIs have been assembled to complex scenarios, which can be interpreted as combination and temporal sequence of so called Low Level Activities. Hence, the entire approach consists of two subsequent steps: The Low Level Activity detection and the subsequent scene analysis using the outputs of the LLA analysis. 3.1 Feature Extraction The recognition of complex events on trajectory level requires a detailed analysis of temporal events. A trajectory can be interpreted as an object projected onto the ground plane, and therefore techniques from the 2D domain can be used. According to Francois [20] and Choi [16], the most relevant trajectory related features are defined as follows: continue, appear, disappear, split, and merge. All these can be handled by the tracking algorithm, where the object age, meaning the number of frames a person is visible, can also be determined reliably. Additionally, motion patterns, such as speed and stationarity, are being analyzed. – Motion Features: In order to be able to perform an analysis of LLAs from a wide range of recordings and setups, it is reasonable to remove the position of the person in the first place. It is important to detect running, walking or loitering persons, where the position only provides contextual information. Therefore only the persons’ speed and acceleration are computed directly on trajectory level. The direction of movement can also be considered as contextual information, which leads to the conclusion to just record changes in the direction of motion on the xy plane. – Stationarity: For some scenarios, such as left luggage detection, objects not altering their spatial position have to be picked up in a video sequence. Due to noise in the video material or slight changes in the detector output, e. g. the median of a particle filter, the object location is slightly jittering. A simple spatial threshold over time is usually not adequate, because the jitter might vary in intensity over time. Therefore the object position pi (t) is averaged over the last N frames: pi =

1 t ∑ pi (t ) N t =t−N

(15)

Real Time Person Tracking and Behavior Interpretation

Subsequently, the normalized variance in both x− and y− direction 1 t 2 σi (t) = p (t ) − p ) i ∑ i N t =t−N

11

(16)

is computed [9,3]. This step is required to smooth noise created by the sensors and errors during image processing. Stationarity can then be assumed for objects with a lower variance than a predefined threshold θ : 1 if var < θ stationarity = , (17) 0 else where 1 indicates stationarity and 0 represents walking or running. Given only the location coordinates, this method does not discriminate between pedestrians and other objects, enabling the stationarity detection for any given object in the scene. A detection example is illustrated in fig. 7. – Detection of Splits and Mergers: According to Perera [33], splits and merges have to be detected in order to maintain IDs in the tracking task. Guler [21] tried to handle these as low level events describing more complex scenarios, such as people getting out of cars or forming crowds. A merger usually appears in case two previously independent objects O1 (t) and O2 (t) unite to a normally bigger one O12 (t) = O1 (t − 1) ∪ O2 (t − 1).

(18)

This observation is usually made in case two objects are either located extremely close to each other or touch one another in 3D, whereas in 2D a partial occlusion might be the reason for a merger. In contrast two objects O11 (t) and O12 (t) can be created by a splitting object O1 (t − 1), which might have been created by a previous merger. While others usually analyze object texture and luminance [38], the applied rule based approach only relies on the object position and the regions’ sizes. Disappearing and appearing objects have to be recognized during the tracking process, in order to incorporate a split or merge: • Merge: one object disappears but two objects can be mapped on one and the same object during tracking. In an optimal case both surfaces would intersect with the resulting bigger surface O1 (t − 1) ∩ O12(t) & O2 (t − 1) ∩ O12(t).

(19)

• Split: Similar to the object split two objects at frame t are mapped to one object at time t − 1, where the objects both intersect with the old splitting one O11 (t) ∩ O1 (t − 1) & O12 (t) ∩ O1 (t − 1).

(20)

– Proximity of Objects: As in various cases persons are interacting with each other, it seems reasonable to model combined motions. This can be done according to the direction of movement, proximity of objects, and velocity. As the direction of

12

D. Arsi´c and B. Schuller

Fig. 7. Exemplary recognition results for Walking, Loitering and Operating an ATM

motion can be simply computed, it is possible to elongate the motion vector v and compute intersections with interesting objects or other motion vectors. Further the distance between object positions can be easily detected with di j = (xi (t) − x j (t))2 + (yi (t) − y j (t))2 . (21) Distances in between persons and objects are usually computed scenario relatedly and require contextual knowledge, as the positions of fixed objects are known beforehand and these objects cannot necessarily be detected automatically. In case interactions between persons are required, it is sufficient to analyze only the objects with the smallest distance. 3.2 Low Level Activity Detection The classification of Low Level activities has been performed applying various different techniques. Thereby rule based approaches [6] and Bayesian Networks [14] have been quite popular. As it is hard to handle continuous data streams with both approaches and to set up a wide set of rules for each activity, dynamic data driven classification should be preferred. Though it has previously been stated that data is hardly available, this accounts only for complex scenarios, such as robberies or theft. It is therefore reasonable to collect LLAs from different data sources and additionally collect a large amount of normal data containing none of the desired LLAs, as this will be the class usually appearing. Hidden-Markov-Models [34] are applied for the trajectory analysis task in the first stage, as these can cope with dynamic sequences with variable length. Neither duration, start or end frame of the desired LLAs is known before the training phase. Only the order and number of activities for each sample in the database are defined. Each action is represented by a four or five state, left-right, continuous HMM and trained using the Baum-Welch-Algorithm [10]. During the training-process the activities are aligned to the training data via the Viterbi-Algorithm in order to find the start and end frames of the contained activities. The recognition task was performed applying the ViterbiAlgorithm. For this task all features expect the contextual information, such as position or proximity, have been applied. Table 1 illustrates the desired classes and the recognition results. This approach has been evaluated on a total amount of 2.5 h of video including the

Real Time Person Tracking and Behavior Interpretation

13

Table 1. Detection (det) results and false positives (fpos) for all five LLAs within the databases. The HMM based approach obviously outperforms the followed static Bayesian Networks approach.

14 7 18 12 60

Running Stationarity Drop luggage Pick up luggage Loitering

[#] detBN detHMM fpos BN fpos HMM 10 7 0 0 60

13 0 15 10 60

1 0 12 0 3

0 0 1 2 1

Event

Fig. 8. a) Structure of the coupled HMMs

PETS2006, PETS2007, and the PROMETHEUS [1] datasets. As such a detailed analysis of the datasets has not yet been performed, a comparison to concurring approaches is not possible. Nevertheless results applying Bayesian Networks, as presented in [7] are provided if available. Note that the activities of interest only cover a small part of the databases. It is remarkable that for all classes only few misses can be reported and a very small amount of false positives is detected. A confusion matrix is not provided, as usually misses were confused with neutral behavior, while this was usually responsible for false positives. Walking is handled as neutral behavior and due to the large amount of data not especially considered for the evaluation task. Nevertheless it can be recognized almost flawlessly, although longer sequences of walking are frequently segmented into shorter parts. This problem can be covered by summing up continuous streams of walking. 3.3 Scenario Recognition Having extracted and detected all required LLAs, either with HMMs or using the tracking algorithm, these can now be further analyzed by a scenario interpretation module. Recent approaches were frequently based on a so called Scenario Description Language (SDL), which contains examples for each possible scenario [13]. Applying the SDL based approach can be interpreted as rule based reasoning, which can be achieved with a simple set of rules [8]. Current approaches use a wide range of LLA features and perform the analysis of behaviors or emotions with Dynamic Bayesian Networks (DBN) [41], which usually require a vast amount of data to compute the inferences. A simple form of the DBN, also data driven, is the well-known HMM. It is capable to segment and classify data streams at the same time. Current implementations usually analyze the trajectory created by one person, not allowing the interaction of multiple persons.

14

D. Arsi´c and B. Schuller

Table 2. Detection (det) results and false positives (fpos) for all five complex scenarios within the evaluated databases. Rules obviously perform by far weaker than DBNs, which are outperformed by HMMs. Event

[#] det DBN det Rules det HMM fpos DBN fpos Rules fpos HMM

Left Luggage Luggage Theft Operate ATM Wait at ATM Robbery at ATM

11 6 17 15 3

9 2 17 15 3

5 0 17 10 2

10 4 17 15 3

3 3 2 3 0

6 1 5 7 4

2 2 0 1 0

Furthermore, it seems hard to compute transition probabilities when a wide range of states and orders is allowed, if only little data is available. Therefore it has already been proposed to couple Markov chains [13]. A DBN based approach has been presented in [7], where the outputs of individually classified trajectories have been combined to an overall decision. In contrast to the previously used simple Markovian structure, now a HMM based implementation is used to allow for more complex models and scenarios. As fig. 8 illustrates, the applied implementation allows transitions between several HMMs being run through in parallel. This has the advantage that not each and every scenario has to be modeled individually and links between individually modeled trajectories or persons can be established. In a very basic implementation it can be assumed that these state transitions are simple triggers, which set a feature value, allowing to leave the actual state, which has been repeated a couple of times. One of the major issues with this approach is the need of real data. As this is not available in vast amounts, training has been performed using real data and an additional set of definitions by experts, where artificial variance has been included by insertions and deletions of observations. The trained models have been once more evaluated with the previously mentioned three databases, namely PETS2006, PETS2007 and PROMETHEUS. A brief overview on the results is given in table 2, which compares the HMM based approach to previous ones applying either rules [3] or the previously mentioned Dynamic Bayesian Networks (DBN) [7]. Obviously both DBNs and HMMs perform better than rule based approaches. The presented coupled HMM approach nevertheless performs slightly better than the previous DBN based implementation, which only allowed state transitions from left to the right and not between individual models. Especially the lower false positive rate of the coupled HMM approach is remarkable.

Fig. 9. Exemplary recognition of a robbery at an ATM

Real Time Person Tracking and Behavior Interpretation

15

Two exemplary recognition results from the Prometheus database are provided in fig. 7 and fig. 9, where a person is either operating an ATM or being robbed at an ATM machine. As it can be seen, the activities in the scene are correctly picked up, assigned to the corresponding persons, and displayed in the figures.

4 Conclusion and Outlook We have presented an integrated framework for the robust interpretation of complex behaviors utilizing multi camera surveillance systems. The tracking part has been conducted in a voxel based representation of the desired tracking regions, which has been based on Multi Layer Homography. This approach has been improved both in speed and performance by this rough quantization of space. Nevertheless tracking performance can be further enhanced by creating a 3D model of the person using information retrieved from the original images, as proposed for Probability Occupancy Maps [19]. Furthermore the introduction of other sensors, such as 3D cameras or thermal infrared, could provide a more reliable segmentation of the scene [5]. Further it has been demonstrated that a complex behavior can be decomposed into multiple easy to detect LLAs, which can be detected either during the tracking phase or applying HMMs. The detected LLA are subsequently fed into a behavior interpretation module, which uses coupled HMMs and allows transitions between concurring models. Applying this approach resulted in a high detection and low false positive rate for all three evaluated databases. For future development it would be desired to analyze persons in further detail, which would include the estimation of the person’s pose [2,29], which will also allow the recognition of gestures [42]. Besides the introduction of further features and potential LLAs, the scenario interpretation needs further improvement. While a limited amount of behaviors can be modeled with little data, ambiguities between classes with low variance may not be distinguished that easily. Summed up the presented methods can be used as assistance for human operated CCTV systems, helping staff to focus attention on noticeable events at a low false positive rate, though at the same time ensuring minimal false negatives.

References 1. Ahlberg, J., Arsi´c, D., Ganchev, T., Linderhed, A., Menezes, P., Ntalampiras, S., Olma, T., Potamitis, I., Ros, J.: Prometheus: Prediction and interpretation of human behavior based on probabilistic structures and heterogeneous sensors. In: Proceedings 18th ECCAI European Conference on Artificial Intelligence, ECAI 2008, Patras, Greece, pp. 38–39 (2008) 2. Andriluka, M., Roth, S., Schiele, B.: Monocular 3d pose estimation and tracking by detection. In: Proceedings International IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pp. 623–630 (2010) 3. Arsi´c, D., Hofmann, M., Schuller, B., Rigoll, G.: Multi-camera person tracking and left luggage detection applying homographic transformation. In: Proceedings Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2007, Rio de Janeiro, Brazil, pp. 55–62 (2007)

16

D. Arsi´c and B. Schuller

4. Arsi´c, D., H¨ornler, B., Schuller, B., Rigoll, G.: A hierarchical approach for visual suspicious behavior detection in aircrafts. In: Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Biometric Recognition and Verification of Persons and their Activities for Video Surveillance”, DSP 2009, Santorini, Greece (2009) 5. Arsi´c, D., H¨ornler, B., Schuller, B., Rigoll, G.: Resolving partial occlusions in crowded environments utilizing range data and video cameras. In: Proceedings 16th IEEE International Conference on Digital Signal Processing, Special Session “Fusion of Heterogeneous Data for Robust Estimation and Classification”, DSP 2009, Santorini, Greece (2009) 6. Arsi´c, D., Lehment, N., Hristov, E., Hrnler, B., Schuller, B., Rigoll, G.: Applying multi layer homography for multi camera tracking. In: Proceeedings Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC 2008, Stanford, CA, USA, pp. 1–9 (2008) 7. Arsi´c, D., Lyutskanov, A., Kaiser, M., Rigoll, G.: Applying bayes markov chains for the detection of atm related scenarios. In: Proceedings IEEE Workshop on Applications of Computer Vision (WACV), in Conj. with the IEEE Computer Society’s Winter Vision Meetings, Snowbird, Utah, USA, pp. 1–8 (2009) 8. Arsi´c, D., Schuller, B., Rigoll, G.: Multiple camera person tracking in multiple layers combining 2d and 3d information. In: Proceedings Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), Marseille, France (2008) 9. Auvinet, E., Grossmann, E., Rougier, C., Dahmane, M., Meunier, J.: Left luggage detection using homographies and simple heuristics. In: Proceedings Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, New York, NY, USA, pp. 51–59 (2006) 10. Baum, L.E.: An inequality and associated maximalization technique in statistical estimation for probabilistic function of markov processes. Inequalities 3, 1–8 (1972) 11. Berclaz, J., Fleuret, F., Fua, P.: Multi-camera tracking and atypical motion detection with behavioral maps. In: Proceedings 10th European Conference on Computer Vision, Marseille, France (2008) 12. Broadhurst, A., Drummond, T., Cipolla, R.: A probabilistic framework for space carving. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, pp. 388–393 (2001) 13. Carter, N., Ferryman, J.: The safee on-board threat detection system. In: Proceedings International Conference on Computer Vision Systems, pp. 79–88 (May 2008) 14. Carter, N., Young, D., Ferryman, J.: A combined bayesian markovian approach for behaviour recognition. In: Proceedings 18th International IEEE Conference on Pattern Recognition, ICPR 2006, Washington, DC, USA, pp. 761–764 (2006) 15. Chen, D., Liao, H.M., Shih, S.: Continuous human action segmentation and recognition using a spatio-temporal probabilistic framework. In: Proceedings Eighth IEEE International Symposium on Multimedia, ISM 2006, Washington, DC, USA, pp. 275–282 (2006) 16. Choi, J., Cho, Y., Cho, K., Bae, S., Yang, H.S.: A view-based multiple objects tracking and human action recognition for interactive virtual environments. The International Journal of Virtual Reality 7, 71–76 (2008) 17. Estrada, F., Jepson, A., Fleet, D.: Planar homographies, lecture notes foundations of computer vision. University of Toronto, Department of Computer Science (2004) 18. Ferryman, J., Shahrokni, A.: An overview of the pets 2009 challenge. In: Proceedings Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2009, Miami, FL, USA, pp. 1–8 (2009) 19. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 267–282 (2008) 20. Francois, A.R.J.: Real-time multi-resolution blob tracking. In: IRIS Technical Report, IRIS04-422, University of Southern California. Los Angeles, USA (2004)

Real Time Person Tracking and Behavior Interpretation

17

21. Guler, S.: Scene and content analysis from multiple video streams. In: Proceedings 30th IEEE Workshop on Applied Imagery Pattern Recognition, AIPR 2001, pp. 119–123 (2001) 22. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 23. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34(3), 334–352 (2004) 24. Khan, S.M., Yan, P., Shah, M.: A homographic framework for the fusion of multi-view silhouettes. In: Proceedings Eleventh IEEE International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, pp. 1–8 (2007) 25. Khan, S., Shah, M.: A multiview approach to tracking people in crowded scenes using a planar homography constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006) 26. Kutulakos, K., Seitz, S.: A theory of shape by space carving, technical report tr692. Tech. rep., Computer Science Deptartment, University Rochester (1998) 27. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 28. Lehment, N., Arsi´c, D., Lyutskanov, A., Schuller, B., Rigoll, G.: Supporting multi camera tracking by monocular deformable graph tracking. In: Proceedings Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2009, Miami, FL, USA, pp. 87–94 (2009) 29. Lehment, N., Kaiser, M., Arsic, D., Rigoll, G.: Cue-independent extending inverse kinematics for robust pose estimation in 3d point clouds. In: Proceeding IEEE International Conference on Image Processing (ICIP 2010), Hong Kong, China, pp. 2465–2468 (2010) 30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 31. Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis Machine Intelligence 22(8), 831– 843 (2000) 32. Orwell, J., Remagnino, P., Jones, G.: Multi-camera colour tracking. In: Proceedings Second IEEE Workshop on Visual Surveillance, VS 1999, Fort Collins, CO, USA, pp. 14–21 (1999) 33. Perera, A., Srinivas, C., Hoogs, A., Brooksby, G., Hu, W.: Multi-object tracking through simultaneous long occlusions and split-merge conditions. In: Proceedings 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, Washington, DC, USA, pp. 666–673 (2006) 34. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989) 35. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR, New York, NY, June 17-22, vol. 1, pp. 519–528 (2006) 36. Takahashi, K., Seki, S., Kojima, E., Oka, R.: Recognition of dexterous manipulations from time-varying images. In: Proceedings 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pp. 23–28 (1994) 37. Thirde, D., Li, L., Ferryman, J.: Overview of the pets2006 challenge. In: Proceedings Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, PETS 2006, pp. 1–8. IEEE, New York (2006) 38. Vigus, S., Bul, D., Canagarajah, C.: Video object tracking using region split and merge and a kalman filter tracking algorithm. In: Proceedings International Conference On Image Processing, ICIP 2001, Thessaloniki, Greece, vol. x, pp. 650–653 (2001)

18

D. Arsi´c and B. Schuller

39. Wang, L.: Abnormal walking gait analysis using silhouette-masked flow histograms. In: Proceedings 18th International Conference on Pattern Recognition, pp. 473–476. IEEE Computer Society, Washington, DC (2006) 40. Welsh, B., Ferrington, D.: Effects of closed circuit television surveillance on crime. Campbell Systematic Reviews 17, 110–135 (2008) 41. W¨ollmer, M., Schuller, B., Eyben, F., Rigoll, G.: Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing 4(5), 867–881 (2010); special Issue on ”Speech Processing for Natural Interaction with Intelligent Environments 42. Wu, C., Aghajan, H.: Model-based human posture estimation for gesture analysis in an opportunistic fusion smart camera network. In: Proceedings IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS 2007, pp. 453–458 (2007) 43. Yue, Z., Zhou, S., Chellappa, R.: Robust two-camera tracking using homography. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2004, vol. 3, pp. 1–4 (2004)

Animated Faces for Robotic Heads: Gaze and Beyond Samer Al Moubayed, Jonas Beskow, Jens Edlund, Bj¨orn Granstr¨ om, and David House Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden {sameram,beskow,davidh}@kth.se, {edlund,bjorn}@speech.kth.se http://www.speech.kth.se

Abstract. We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the diﬀerent beneﬁts this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze eﬀect. This eﬀect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the eﬀect, and provides an accurate perception of gaze direction. We discuss at the end the diﬀerent requirements of gaze in interactive systems, and explore the diﬀerent settings these ﬁndings give access to. Keywords: Facial Animation, Talking Heads, Shader Lamps, Robotic Heads, Gaze, Mona Lisa Eﬀect, Avatar, Dialogue System, Situated Interaction, 3D Projection, Gaze Perception.

1

Introduction

During the last two decades, there has been ongoing research and impressive enhancement in facial animation. Researchers have been developing human-like talking heads that can have human-like interaction with humans [1], realize realistic facial expressions [2], and express emotions [3] and communicate behaviors [4]. Several talking heads are made to represent personas embodied in 3D facial designs (referred to as ECAs, Embodied Conversational Agents) simulating human behavior and establishing interaction and conversation with a human interlocutor. Although these characters have been embodied in human-like 3D animated models, this embodiment has always been limited by how these characters are displayed in our environment. Traditionally, talking heads have been displayed using two dimensional display (e.g. ﬂat screens, wall projections, etc) A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 19–35, 2011. c Springer-Verlag Berlin Heidelberg 2011

20

S. Al Moubayed et al.

having no shared access to the three dimensional environment where the interaction is taking place. Surprisingly, there is little research on the eﬀects of displaying 3D ECAs on 2D surfaces on the perception of the agent embodiment and its natural interaction eﬀects [5]. Moreover, 2D displays come with several usually undesirable illusions and eﬀects, such as the Mona Lisa gaze eﬀect. For a review on these eﬀects, refer to [6]. In robotics on the other hand, the complexity, robustness and high resolution of facial animation, which is done using computer graphics, is not employed. This is due to the fact that the accurate and highly subtle and complicated control of computer models (such as eyes, eye-lids, wrinkles, lips, etc.) do not map onto mechanically controlled heads. Such computer models require very delicate, smooth, and fast control of the motors, appearance and texture of a mechanical head. This fact has a large implication on the development of robotic heads. Moreover, in a physical mechanical robot head, the design and implementation of anthropomorphic properties can be limited, highly expensive, time consuming and diﬃcult to test until the ﬁnal head is ﬁnished. In talking heads on the other hand, changes in color, design, features, and even control of the face can be very easy and time eﬃcient compared to mechanically controlled heads. There are few studies attempting to take advantage of the appearance and behavior of talking heads in the use of robotic heads. In [7], a ﬂat screen is used as the head of the robot, displaying an animated agent. In [8], the movements of the motors of a mechanical head are driven by the control parameters of animated agents, in a trial to generate facial trajectories that are similar to those of a 3D animated face. These studies, although showing the interest and need to use the characteristics of animated talking agents in robot heads, are still limited by how this agent is represented, in the ﬁrst case by a 2D screen that comes with detrimental eﬀects and illusions, but proﬁts from the appearance of the animated face, and in the second case by a mechanical head that tries to beneﬁt from the behavior but misses on appearance. In this chapter we will present a new approach for using animated faces for robotic heads, where we attempt to guarantee the physical dimensionality and embodiment of the robotic head, and the appearance and behavior of the animated agents. After representing our approach and discussing its beneﬁts, we investigate and evaluate this approach by studying its accuracy in delivering gaze direction in comparison to two dimensional display surfaces. Perhaps one of the most important eﬀects of displaying three-dimensional scenes on two-dimensional surfaces is the Mona Lisa gaze eﬀect. The Mona Lisa gaze eﬀect is commonly described as an eﬀect that makes it appear as if the Mona Lisas gaze rests steadily on the viewer as the viewer moves through the room. This eﬀect has important implications in situational and spatial interaction, since gaze direction of a face displayed over a two-dimensional display does not point to an absolute location in the environment of the observer. In Section 2 we describe our proposal of using a 3D model of a human head as a projection surface for an animated talking head. In Section 3 we discuss

Animated Faces for Robotic Heads: Gaze and Beyond

21

Fig. 1. The technical setup: the physical model of a human head used as a 3D projection surface, to the left; the laser projector in the middle; and a snapshot of the 3D talking head to the right.

the beneﬁts of using our approach in comparison to a traditional mechanical robotic head. In Section 4 we describe an experimental setup and a user study on the perception of gaze targets using a traditional 2D display and the novel 3D projection surface. In Section 5 we discuss the properties of gaze in terms of faithfulness for diﬀerent communication requirements and conﬁgurations. We discuss diﬀerent applications that can capitalize on our approach as well as research and experimentation made possible by it in Section 6 and present ﬁnal conclusions in Section 7.

2

Projected Animated Faces on 3D Head Models

Our approach is based on the idea of projecting an animated face on a 3D surface a static, physical model of a human head. The technique of manipulating static objects with light is commonly referred to as the Shader Lamps technique [9] [10]. This technique is used to change the physical appearance of still objects, by illuminating them, using projections of static or animated textures, or video streams. We implement this technique by projecting an animated talking head (seen to the right in ﬁgure 1) on an arbitrary physical model of a human head (seen to the left in ﬁgure 1) using a laser micro projector (SHOWWX Pico Projector, seen in the center of ﬁgure 1). The main advantage of using a laser projector is that the image is always in focus, even on curved surfaces. The talking head used in the studies is detailed in [11] and includes a face, eyes, tongue, and teeth, based on static 3D-wireframe meshes that are deformed using direct parameterizations by applying weighted transformations to their vertices according to principles ﬁrst introduced by [12]. Figure 2 shows the 3D projection surface with and without a projection of the talking head.

3

Robotic Heads with Animated Faces

the capacity for adequate interaction is a key concern. Since a great proportion of human interaction is managed non-verbally through gestures, facial expressions

22

S. Al Moubayed et al.

Fig. 2. A physical model of a human head, without projection (left) and complete with a projection of the talking head, a furry hat, and a camera (right)

and gaze, an important current research trend in robotics deals with the design of social robots. But what mechanical and behavioral compromises should be considered in order to achieve satisfying interaction with human interlocutors? In the following, we present an overview of the practical beneﬁts of using an animated talking head projected on a 3D surface as a robotic head. 1 Optically based. Since the approach utilizes a static 3D projection surface, the actual animation is done completely using computer graphics projected on the surface. This provides an alternative to mechanically controlled faces, saving electrical consumption and avoiding complex mechanical designs and motor control. Computer graphics also oﬀers many advantages over motor based animation of robotic heads in speed, animation accuracy, resolution and ﬂexibility. 2 Animation using computer graphics. Facial animation technology has shown tremendous progress over the last decade, and currently oﬀers realistic, eﬃcient, and reliable renditions. It is currently able to establish facial designs that are very human-like in appearance and behavior compared to the physical designs of mechanical robotic heads. 3 Facial design. The face design is done through software, which potentially provides the ﬂexibility of having an unlimited range of facial designs for the same head. Even if the static projection surface needs to be re-customized to match a particularly unusual design, this is considerably simpler, faster, and cheaper than redesigning a whole mechanical head. In addition, the easily interchangeable face design oﬀers the possibility to eﬃciently experiment with the diﬀerent aspects of facial designs and characteristics in robotics heads, for example to examine the anthropomorphic spectrum. 4 Light weight. The optical design of the face leads to a considerably more lightweight head, depending only on the design of the projection surface. This makes the design of the neck much simpler and a more light-weight neck can be used, as it has to carry and move less weight. Ultimately, a lighter mobile robot is safer and saves energy. 5 Low noise level. The alternative of using light projection over a motorcontrolled face avoids all motor noises generated by moving the face. This is

Animated Faces for Robotic Heads: Gaze and Beyond

23

crucial for a robot interacting verbally with humans, and in any situation where noise generation is a problem. 6 Low maintenance. Maintenance is reduced to software maintenance and maintenance of the micro laser projector, which is very easily replaceable. In contrast, mechanical faces are complicated, both electronically and mechanically, and an error in the system can be diﬃcult and time consuming to troubleshoot. Naturally, there are drawbacks as well. Some robotic face designs cannot be achieved in full using light-projected animation alone, for example those requiring very large jaw openings which cannot be easily and realistically delivered without mechanically changing the physical projection surface. For such requirements, a hybrid approach can be implemented which combines a motor based physical animation of the head for the larger facial movements, with an optically projected animation for the more subtle movements, for example changes in eyes, wrinkles and eyebrows. In addition, the animations are delivered using light, so the projector must be able to outshine the ambient light, which becomes an issue if the robot is designed to be used in very bright light, such as full daylight. The problem can be remedied by employing the evermore powerful laser projectors that are being brought to the market.

4

Gaze Perception and the Mona Lisa Gaze Eﬀect

The importance of gaze in social interaction is well-established. From a human communication perspective, Kendons work in [13] on gaze direction in conversation is particularly important in inspiring a wealth of studies that singled out gaze as one of the strongest non-vocal cues in human face-to-face interaction (see e.g. [14]). Gaze has been associated with a variety of functions within social interaction Kleinkes review article from 1986 , for example, contains the following list: (a) provide information, (b) regulate interaction, (c) express intimacy, (d) exercise social control, and (e) facilitate service and task goals ([15]). These eﬀorts, in turn, were shadowed by a surge of activity in the human-computer interaction community, which recognized the importance of modeling gaze in artiﬁcial personas such as embodied conversational agents (ECAs) (e.g. [16]; [17]). To date, these eﬀorts have been somewhat biased towards the production of gaze behavior, whereas less eﬀort has been expended on the perception of gaze. In light of the fact that an overwhelming majority of ECAs are either 2D or 3D models, rendered on 2D displays, this is somewhat surprising: the perception of 2D renditions of 3D scenes is notoriously riddled with artefact and illusions of many sorts for an overview, see [18]. Perhaps the most important of these for using gaze behaviors in ECAs for communicative purposes is the Mona Lisa gaze eﬀect or the Mona Lisa stare, commonly described as an eﬀect that makes it appear as if the Mona Lisas gaze rests steadily on the viewer as the viewer moves through the room (ﬁgure 3). The fact that the Mona Lisa gaze eﬀect occurs when a face is presented on a 2D display has signiﬁcant consequences for the use and control of gaze in communication. To the extent that gaze in a 2D face follows the observer, gaze does not

24

S. Al Moubayed et al.

Fig. 3. Leonardo da Vinci’s Mona Lisa. Mona Lisa appears to be looking straight at the viewer, regardless of viewing angle. The painting is in the public domain.

point unambiguously at a point in 3D space. In the case of multiple observers, they all have the same perception of the image, no matter where they stand in relation to e.g. the painting or screen. This causes an inability to establish a situated eye contact with one particular observer, without simultaneously establishing it with all others, which leads to miscommunication if gaze is employed to support a smooth ﬂowing interaction with several human subjects: all human subjects will perceive the same gaze pattern. In the following experiment, we investigate the accuracy of perceived gaze direction in our 3D head model, discuss the diﬀerent applications it can be used for, and contrast it with a traditional 2D display. The experiment detailed here was designed and conducted to conﬁrm the hypothesis that a talking head projected on a 2D display is subject to the Mona Lisa gaze eﬀect, while projecting it on a 3D surface inhibits the eﬀect and enforces an eye-gaze direction that is independent of the subjects angle of view. Accordingly, the experiment measures perception accuracy of gaze in these two conﬁgurations. 4.1

Setup

The experiment setup employs a set of subjects simultaneously seated on a circle segment centrad at the stimulus point a 2D or 3D projection surface facing the stimuli point. Adjacent subjects are equidistant from each other and all subjects are equidistant to the projection surface so that the angle between two adjacent subjects and the projection surface was always about 26.5 degrees. The positions are annotated as -53, -26.5, 0, 26.5, 53, where 0 is the seat directly in front of the projection surface. The distance from subjects to the projection surface was 1.80 meters (ﬁgure 4).

Animated Faces for Robotic Heads: Gaze and Beyond

25

Fig. 4. Schematic of the experiment setup: ﬁve simultaneous subjects are placed at equal distances along the perimeter of a circle centred on the projection surface

Two identical sets of stimuli are projected on a 2D surface in the 2D condition (2DCOND) and on a 3D surface in the 3D condition (3DCOND). The stimuli sets contain the animated talking head with 20 diﬀerent gaze angles. The angles are equally spaced between -25 degrees and +13 degrees in the 3D models internal gaze angle (horizontal eyeball rotation in relation to skull) with 2 degree increments, where 0 degree rotation is when the eyes are looking straight forward. The angles between +13 degrees and +25 degrees were left out because of a programming error, but we found no indications that this asymmetry has any negative eﬀects on the experimental results. Five subjects were simultaneous employed in a within-subject design, where each subject judged each stimulus in the experiment. All ﬁve subjects had normal or corrected to normal eye sight. 4.2

Method

Before the experiment, the subjects were presented with an answer sheet, and the task of the experiment was explained: to point out, for each stimulus, which subject the gaze of the animated head is pointing at. The advantage of using subjects as gaze target is that this method provides perceptually, and communicatively, relevant gaze targets instead of using, for example, a spatial grid as in [19]. For each set of 20 stimuli, each of the seated subjects got an empty answer sheet with 20 answer lines indicating the position of all subjects. The subject enters a mark on one of the subjects indicating her decision. If the subject believed the head was looking beyond the rightmost or the leftmost subject, the subject entered the mark at the end of either of the two arrows to the right or left of the boxes that represent the subjects.

26

S. Al Moubayed et al.

Fig. 5. Snapshots, taken over the shoulder of a subject, of the projection surfaces in 3DCOND (left) and 2DCOND (right)

The ﬁve subjects were then randomly seated at the ﬁve positions and the ﬁrst set of 20 stimuli was projected in 3DCond, as seen on the left of ﬁgure 5. Subjects marked their answer sheets after each stimulus. When all stimuli were presented, the subjects were shifted to new positions and the process repeated, in order to capture any bias for subject/position combinations. The process was repeated ﬁve times, so that each sat in each position once, resulting in ﬁve sets of responses from each subject. 4.3

Analysis and Results

Figure 6 plots the raw data for all the responses over gaze angles. The size of the bubbles indicates the number of responses with the corresponding value for that angle; the bigger the bubble, the more subjects perceived gaze in that particular direction. It is again clear that in 3DCond, the perception of gaze is more precise (i.e. fewer bubbles per row) compared to 2DCond. Figure 7 shows bubble plots similar to those in ﬁgure 6, with responses for each stimulus. The ﬁgure diﬀers in that the data plotted is ﬁltered so that only responses are plotted where perceived gaze matched the responding subject, that is when subjects responded that the gaze was directed directly at themselves what is commonly called eye-contact or mutual gaze. These plots show the location of and the number of the subjects that perceived eye-contact over diﬀerent gaze angles. In 2DCond, the Mona Lisa gaze eﬀect is very visible: for all the near-frontal angles, each of the ﬁve subjects, independently from where they are seated, perceived eye-contact. The ﬁgure also shows that the eﬀect is completely eliminated in 3DCond, in which generally only one subject at a time perceived eye-contact with the head. 4.4

Estimating the Gaze Function

In addition to investigating the gaze perception accuracy of projections on diﬀerent types of surfaces, the experimental setup allows us to measure a psychometric

Animated Faces for Robotic Heads: Gaze and Beyond

27

Fig. 6. Responses for all subject positions (X axis) over all internal angles (Y axis) for each of the conditions: 2DCOND to the left and 3DCOND to the right. Bubble size indicates number of responses. The X axis contains the responses for each of the ﬁve subject positions (from 1 to 5), where 0 indicates gaze perceived beyond the leftmost subject, and 6 indicates gaze perceived beyond the rightmost subject.

function for gaze which maps eyeball rotation in a virtual talking head to physical, real-world angles an essential function to establish eye-contact between the real and virtual world. We estimated this function by applying a ﬁrst order polynomial ﬁt to the data to get a linear mapping from the real positions of the gaze targets perceived by the subjects, to the actual internal eyeball angles in the projected animated talking head, for each condition. In 2DCOND, the estimated function that resulted from the linear ﬁt to the data is: Angle = −5.2 × Gaze Target (1) RMSE = 17.66

(2)

R square = .668

(3)

28

S. Al Moubayed et al.

Fig. 7. Bubble plot showing only responses where subjects perceived eye-contact: subject position (X axis) over all internal angles (Y axis) for each of the conditions: 2DCond to the left and 3DCond to the right. Bubble size indicates number of responses.

And for the 3DCOND: Angle = −4.1 × Gaze Target

(4)

RMSE = 6.65

(5)

R square = .892

(6)

where R square represents the ability of the linear ﬁt to describe the data. Although the resulting gaze functions from the two conditions are similar, the goodness of ﬁt is markedly better in 3DCOND than in 2DCOND. The results provide a good estimation of a gaze psychometric function. If the physical target gaze point is known, the internal angle of eye rotation can be calculated. By reusing the experimental design, the function can be estimated for any facial design or display surface.

5

Spatial Faithfulness of Gaze and Situated Interaction

Armed with this distinction between perception of gaze in 2D and 3D displays, we now turn to how communicative gaze requirements are met by the two system types. Situated interaction requires a shared perception of spatial properties

Animated Faces for Robotic Heads: Gaze and Beyond

29

where interlocutors and objects are placed, in which direction a speaker or listener turns, and at what the interlocutors are looking. Accurate gaze perception is crucial, but plays diﬀerent roles in diﬀerent kinds of communication, for example between co-located interlocutors, between humans in avatar or video mediated human-human communication, and between humans and ECAs or robots in spoken dialogue systems. We propose that it is useful to talk about three levels of gaze faithfulness, as follows. We deﬁne the observer as the entity perceiving gaze and a target point as an absolute position in the observers space. – Mutual Gaze. When the observer is the gaze target, the observer correctly perceives this. When the observer is not the gaze target, the observer correctly perceives this. In other words, the observer can correctly answer the question: Does she look me in the eye? – Relative Gaze. There is a direct and linear mapping between the intended angle of the gaze relative to the observer and the observers perception of that angle. In other words, the observer can correctly answer the question: How much to the left of/to the right of/above/below me is she looking? – Absolute Gaze. A one-to-one mapping is correctly preserved between the intended target point of gaze and the observers perception of that target point. In other words, the observer can accurately answer the question: At what exactly is she looking? Whether a system can produce faithful gaze or not depends largely on four parameters. Two of these represent system capabilities: the type of display used, limited here to whether the system produces gaze on a 2D surface or on a 3D surface and whether the system knows where relevant objects (including the interlocutors head and eyes) are in physical space; e.g. through automatic object tracking or with the help of manual guidance). A special case of the second capability is the ability to know only where the head of the interlocutor is. The remaining two have to do with the requirements of the application: the ﬁrst is what level of faithfulness is needed as discussed above and the second whether the system is to interact with one or many interlocutors at the same time. We start by examining single user systems with a traditional 2D display without object tracking, These systems are faithful in terms of mutual gaze no matter where in the room the observer is the system can look straight ahead to achieve mutual gaze and anywhere else to avoid it; it is faithful in terms of relative gaze regardless of where in the room the observer is, the system can look to the left and be perceived as looking to the right of the observer, and so on; and it is unrealistic in terms of absolute gaze the system can only be perceived as looking at target objects other than the observer by pure luck. Next, we note that single user systems with a traditional 2D display with object tracking are generally the same as those without object tracking. It is possible, however, that the object tracking can help absolute gaze faithfulness, but it requires a fairly complex transformation involving targeting the objects in terms of angles relative the observer. If the objects are targeted in absolute terms, the observer will not perceive gaze targets as intended.

30

S. Al Moubayed et al.

Fig. 8. Faithful (+) or unrealistic (-) gaze behaviour under diﬀerent system capabilities and application requirements. +* signiﬁes that although faithfulness is most likely possible, it involves unsolved issues and additional transformations that are likely to cause complications.

Multi-user systems with a traditional 2D display and no object tracking perform poorly. They are unrealistic in terms of mutual gaze, as either all or none of the observers will perceive mutual gaze; they are unrealistic with respect to relative gaze, as all observers will perceive the gaze to be directed at the same angle relative themselves; and they are unrealistic in terms of absolute gaze as well. Multi-user systems with a traditional 2D display and object tracking perform exactly as poorly as those without object tracking regardless of any attempt to use the object tracking to help absolute faithfulness by transforming target positions in relative terms, all observers will perceive the same angle in relation to themselves, and only one at best will perceive the intended position. Turning to the 3D projection surface systems, both single and multi user systems with a 3D projection surface and no object tracking are unrealistic in terms of mutual gaze, relative gaze, and absolute gaze without knowing where to direct its gaze in real space, it is lost. By adding head tracking, the systems can produce faithful mutual gaze, and single user systems with head tracking can attempt faithful relative gaze by shifting gaze angle relative the observers head. In contrast, both single and multi user systems with a 3D projection surface and object tracking, coupling the ability to know where objects and observers are with the ability to target any position, are faithful in terms of all of mutual gaze, relative gaze, and absolute gaze. Figure 8 presents an overview of how meeting the three levels of faithfulness depends on system capabilities and application requirements. Examining the table in the ﬁgure, we ﬁrst note that in applications where more than one

Animated Faces for Robotic Heads: Gaze and Beyond

31

participant is involved, using a 2D projection surface will result in a system that is unrealistic on all levels (lower left quadrant of the table), and secondly, that a system with a 3D projection surface and object tracking will provide faithful eye gaze regardless of application requirements (rightmost column). These are the perhaps unsurprising results of the Mona Liza gaze eﬀect being in place in the ﬁrst case, causing the gaze perception of all in a room to be the same, and of mimicking the conditions under which a situated human interacts in the second, with a physical presence in space and full perception of the environment and ones relation to it. Thirdly, we note that if no automatic or manual object or head tracking is available, the 3D projection surface is unrealistic in all conditions, as it requires information on where in the room to direct its gaze, and that head only tracking improves the situation to some extent. Fourthly, and more interestingly, we note that in single user cases where no object tracking or head tracking only is available, the 2D surface is the most faithful one (upper left quadrant). In these cases, we can tame and harness the Mona Lisa gaze eﬀect and make it work for us. This suggests that gaze experiments such as those described in [20] and [21] could not have been performed with a 3D projection surface unless sophisticated head trackers would have been employed. In summation, it is worthwhile to have a clear view of the requirements of the application or investigation before designing the system. In some cases (i.e. single user cases with no need for absolute gaze faithfulness), a simpler 2D display system without any tracking can give results similar to a more complex 3D projection surface system with head or object tracking facilities at considerably lower cost and eﬀort. On the other hand, if we are to study situated interaction with objects and multiple participants, we need to guarantee successful delivery of gaze at all levels with a 3D projection surface that inhibits the Mona Lisa stare eﬀect and reliable object tracking, manual or automatic, to direct the gaze.

6

Applications and Discussions

As we have seen, the Mona Lisa gaze eﬀect is highly undesirable in several communicative setups due to the manner in which it limits our ability to control gaze target perception. We have also seen that under certain circumstances, the eﬀect a cognitive ability to perceive a depicted scene from the point of view of the camera or painter can be harnessed to allow us to build relatively simple applications, which would otherwise have required much more eﬀort. A hugely successful example is the use of TV screens and movie theaters, where entire audiences perceive the same scene, independently from where they are seated. If this was not the case, the ﬁlm and TV industries might well have been less successful. There are also situations where an ECA can beneﬁt from establishing eye-contact with either all viewers simultaneously in a multiparty situation, as when delivering a message or taking the role of e.g. a weather presenter, and when it is required to establish eye contact with one person whose position in

32

S. Al Moubayed et al.

the room is unknown to the ECA, as is the case in most spoken dialogue system experiments to date involving an ECA. Although the Mona Lisa gaze eﬀect can be exploited in some cases, it is an obstacle to be overcome in the majority of interaction scenarios, as those where gaze is required to point exclusively to objects in the physical 3D space of the observer, or where multiple observers are involved in anything but the most basic interactions. In order to do controlled experiments investigating gaze in situated multiparty dialogues, the Mona Lisa eﬀect must be overcome, and we can do this readily using the proposed technique. In other words, the technique opens possibilities for many applications which require absolute gaze perception, but would not have been possible with the use of a 2D display. In the following we present a short list of application families that we have recently begun to explore in the situated interaction domain, all of which require the levels of gaze perception aﬀorded by 3D projection surfaces. The ﬁrst family of applications is situated and multiparty dialogues with ECAs or social conversational robots. These systems need to be able to switch their attention among the diﬀerent dialogue partners, while keeping the partners informed about the status of the dialogue and who is being addressed, and exclusive eye-contact with single subjects is crucial for selecting an addressee. In such scenarios, a coherently shared and absolute perception of gaze targets is needed to achieve a smooth human-like dialogue ﬂow a requirement that can not be met unless the Mona Lisa gaze eﬀect is eliminated. The second family involves any application where there is a need for a pointing device to point at objects in real space the space of the human participant. Gaze is a powerful pointing device that can point from virtual space to real space while being completely non-mechanic as opposed to for example ﬁngers or arrows and it is non-intrusive and subtle. A third family of applications is mediated interaction and tele-presence. A typical application in this family is virtual conferential systems. In a traditional system, the remote partner cannot meaningfully gaze into the environment of the other partners, since the remote partner is presented through a 2D display subject to the Mona Lisa gaze eﬀect. Establishing a one-to-one interaction through mutual gaze cannot be done, as there is no ability to establish an exclusive eye contact. In addition to that, people look at the video presenting the other partners instead of looking into the camera, which is another obstacle for shared attention and mutual gaze and no one can estimate reliably at what the remote participant is looking. If a 3D head is used to represent the remote subject, who is represented through mediation as an avatar, these limitations to video conferential can, at lease partially, be resolved.

7

Conclusions

To sum up, we have proposed two ways of taming Mona Lisa: ﬁrstly by eliminating the eﬀect and secondly by harnessing and exploiting it.

Animated Faces for Robotic Heads: Gaze and Beyond

33

En route to this conclusion, we have proposed an aﬀordable way of eliminating the eﬀect by projecting an animated talking head on a 3D projection surface a generic physical 3D model of a human head, and veriﬁed experimentally that it allows subjects to perceive gaze targets in the room clearly from various viewing angles, meaning that the Mona Lisa eﬀect is eliminated. In the experiment, the 3D projection surface was contrasted with a 2D projection surface, clearly displaying the Mona Lisa gaze eﬀect in the 2D case. In addition to eliminating the Mona Lisa gaze eﬀect, the 3D setup allowed observers to perceive with very high agreement who was being looked at. The 2D setup showed no such agreement. We showed how the data serves to estimate a gaze psychometric function to map actual gaze target into eyeball rotation values in the animated talking head. Based on the experimental data and the working model, we proposed three levels of gaze faithfulness relevant to applications using gaze: mutual gaze faithfulness, relative gaze faithfulness, and absolute gaze faithfulness. We further suggested that whether a system achieves gaze faithfulness or not depends on several system capabilities whether the system uses a 2D display or the proposed 3D projection surface, and whether the system has some means of knowing where objects and the interlocutors are, but also depending on the application requirements whether the system is required to speak to more than one person at a time and the level of gaze faithfulness it requires. One of the implications of this is that the Mona Lisa gaze eﬀect can be exploited and put to work for us in some types of applications. Although perhaps obvious, it falls out neatly from the working model. Another implication is that the only way to robustly achieve all three levels of gaze faithfulness is to have some means of tracking objects in the room and to use an appropriate 3D projection surface. However, without knowledge of objects positions, the 3D projection surface falls short. We close by discussing the beneﬁts of 3D projection surfaces in terms of human-robot interaction, where the technique can be used to create faces for robotic heads with a high degree of human-likeness, better design ﬂexibility, more sustainable animation, low weight and noise levels and lower maintenance costs, and by discussing in some detail a few application types and research areas where the elimination of the Mona Lisa gaze eﬀect through the use of 3D projection surfaces is particularly useful, such as when dealing with situated interaction or multiple interlocutors. We consider this work to be a stepping stone for several future investigations and studies into the role and employment of gaze in human-robot, human-ECA, and human-human mediated interaction. Acknowledgments. This work has been partly funded by the EU project IURO (Interactive Urban Robot) FP7-ICT-248314 . The authors would like to thank the ﬁve subjects for participating in the experiment.

34

S. Al Moubayed et al.

References 1. Beskow, J., Edlund, J., Granstrm, B., Gustafson, J., House, D.: Face-to-face interaction and the KTH Cooking Show. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony, pp. 157–168. Springer, Heidelberg (2010) 2. Ruttkay, Z., Pelachaud, C. (eds.): From Brows till Trust: Evaluating Embodied Conversational Agents. Kluwer, Dordrecht (2004) 3. Pelachaud, C.: Modeling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009) 4. Granstrm, B., House, D.: Modeling and evaluating verbal and non-verbal communication in talking animated interface agents. In: Dybkjaer, l., Hemsen, H., Minker, W. (eds.) Evaluation of Text and Speech Systems, pp. 65–98. Springer, Heidelberg (2007) 5. Shinozawa, K., Naya, F., Yamato, J., Kogure, K.: Diﬀerences in eﬀect of robot and screen agent recommendations on human decision-making. International Journal of Human Computer Studies 62(2), 267–279 (2005) 6. Todorovi, D.: Geometrical basis of perception of gaze direction. Vision Research 45(21), 3549–3562 (2006) 7. Gockley, R., Simmons, J., Wang, D., Busquets, C., DiSalvo, K., Caﬀrey, S., Rosenthal, J., Mink, S., Thomas, W., Adams, T., Lauducci, M., Bugajska, D., Perzanowski, Schultz, A.: Grace and George: Social Robots at AAAI. In: Proceedings of AAAI 2004. Mobile Robot Competition Workshop, pp. 15–20. AAAI Press, Menlo Park (2004) 8. Sosnowski, S., Mayer, C., Kuehnlenz, K., Radig, B.: Mirror my emotions! Combining facial expression analysis and synthesis on a robot. In: Proceedings of the Thirty Sixth Annual Convention of the Society for the Study of Artiﬁcial Intelligence and Simulation of Behaviour, AISB 2010 (2010) 9. Raskar, R., Welch, G., Low, K.-L., Bandyopadhyay, D.: Shader lamps: animating real objects with image-based illumination. In: Proc. of the 12th Eurographics Workshop on Rendering Techniques, pp. 89–102 (2001) 10. Lincoln, P., Welch, G., Nashel, A., Ilie, A., State, A., Fuchs, H.: Animatronic shader lamps avatars. In: Proc. of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2009). IEEE Computer Society, Washington, DC (2009) 11. Beskow, J.: Talking heads – Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH (2003) 12. Parke, F.I.: Parameterized Models for Facial Animation. IEEE Computer Graphics and Applications 2(9), 61–68 (1982) 13. Kendon, A.: Some functions of gaze direction in social interaction. Acta Psychologica 26, 22–63 (1967) 14. Argyle, M., Cook, M.: Gaze and mutual gaze. Cambridge University Press, Cambridge (1976) ISBN: 978-0521208659 15. Kleinke, C.L.: Gaze and eye contact: a research review. Psychological Bulletin 100, 78–100 (1986) 16. Takeuchi, A., Nagao, K.: Communicative facial displays as a new conversational modality. In: Proc. of the INTERACT 1993 and CHI 1993 Conference on Human Factors in Computing Systems (1993)

Animated Faces for Robotic Heads: Gaze and Beyond

35

17. Bilvi, M., Pelachaud, C.: Communicative and statistical eye gaze predictions. In: Proc. of International Conference on Autonomous Agents and Multi-Agent Systems, Melbourne, Australia (2003) 18. Gregory, R.: Eye and Brain: The Psychology of Seeing. Princeton University Press, Princeton (1997) 19. Delaunay, F., de Greeﬀ, J., Belpaeme, T.: A study of a retro-projected robotic face and its eﬀectiveness for gaze reading by humans. In: Proc. of the 5th ACM/IEEE International Conference on Human-robot Interaction, pp. 39–44. ACM, New York (2010) 20. Edlund, J., Nordstrand, M.: Turn-taking gestures and hour-glasses in a multimodal dialogue system. In: Proc. of ISCA Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, Germany (2002) 21. Edlund, J., Beskow, J.: MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech 52(2-3), 351–367 (2009)

RANSAC-Based Training Data Selection on Spectral Features for Emotion Recognition from Spontaneous Speech Elif Bozkurt1 , Engin Erzin1 , C ¸ iˇgdem Eroˇglu Erdem2 , and A. Tanju Erdem3 1 Multimedia, Vision and Graphics Laboratory, College of Engineering, Ko¸c University, 34450, Sariyer, Istanbul, Turkey {ebozkurt,eerzin}@ku.edu.tr 2 Department of Electrical and Electronics Engineering, Bah¸ce¸sehir University, 34349 Be¸sikta¸s, Istanbul, Turkey [email protected] 3 Department of Electrical and Electronics Engineering, ¨ ¨ udar, Istanbul, Turkey Ozyeˇ gin University, 34662 Usk¨ [email protected]

Abstract. Training datasets containing spontaneous emotional speech are often imperfect due the ambiguities and diﬃculties of labeling such data by human observers. In this paper, we present a Random Sampling Consensus (RANSAC) based training approach for the problem of emotion recognition from spontaneous speech recordings. Our motivation is to insert a data cleaning process to the training phase of the Hidden Markov Models (HMMs) for the purpose of removing some suspicious instances of labels that may exist in the training dataset. Our experiments using HMMs with Mel Frequency Cepstral Coeﬃcients (MFCC) and Line Spectral Frequency (LSF) features indicate that utilization of RANSAC in the training phase provides an improvement in the unweighted recall rates on the test set. Experimental studies performed over the FAU Aibo Emotion Corpus demonstrate that decision fusion conﬁgurations with LSF and MFCC based classiﬁers provide further signiﬁcant performance improvements. Keywords: Aﬀect recognition, emotional speech classiﬁcation, RANSAC, data cleaning, decision fusion.

1

Introduction

For supervised pattern recognition problems such as emotion recognition from spontaneous speech, large training sets need to be recorded and labeled to be used for the training of the classiﬁer. The labeling of large training datasets is a tedious job, carried out by humans and hence prone to human mistakes. The mislabeled (or noisy) examples of the training data may result in a decrease in the classiﬁer performance. It is not easy to identify these contaminations or imperfections of the training data since they may also be hard to learn examples. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 36–47, 2011. c Springer-Verlag Berlin Heidelberg 2011

RANSAC-Based Training Data Selection on Spectral Features

37

In that respect, pointing out troublesome examples is a chicken-and-egg problem, since good classiﬁers are needed to tell which examples are noisy [1]. Spectral features play an important role in emotion recognition. The dynamics of the vocal tract can potentially change under diﬀerent emotional states. Hence spectral characteristics of speech diﬀer for various emotions [14]. The utterance level statistics of the spectral features have been widely used in speech emotion recognition and demonstrated a considerable success [13] [12]. In this work, we assume that outliers in the training set of emotional speech recordings mainly result from mislabeled or ambiguous data. Our goal is to remove such noisy samples from the training set to increase the performance of Hidden Markov Model based classiﬁers modeling spectral features. 1.1

Previous Work

Previous research on data cleaning, which is also called as data pruning or decontamination of training data shows that removing noisy samples is worthwhile [1] [2] [3]. Guyon et al. [9] have studied data cleaning in the context of discovering informative patterns in large databases. They mention that informative patterns are often intermixed with unwanted outliers, which are errors introduced non-intentionally to the database. Informative patterns correspond to atypical or ambiguous data and are pointed out as the most ”surprising” ones. On the other hand, garbage patterns are also surprising, which correspond to meaningless or mislabeled patterns. The authors point out that automatically cleaning the data by eliminating patterns with suspiciously large information gain may result in loss of valuable informative patterns. Therefore they propose a user-interactive method for cleaning a database of hand-written images, where a human operator checks those patterns that have the largest information gain and therefore the most suspicious. Batandela and Gasca [2] report a cleaning process to remove suspicious instances of the training set or correcting the class labels and keep them in the training set. Their method is based on the Nearest Neighbor classiﬁer. Wang et al. [22], present a method to sample a large and noisy multimedia data. Their method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to assess the representativeness of the sample set. The proposed method deals with noise in an elegant way, and has been shown to be superior to the simple random sample (SRS) method [8][16]. Angelova et al. [1] present a fully automatic algorithm for data pruning, and demonstrate its success for the problem of face recognition. They show that data pruning can improve the generalization performance of classiﬁers. Their algorithm has two components: the ﬁrst component consists of multiple semiindependent classiﬁers learned on the input data, where each classiﬁer concentrates on diﬀerent aspects and the second component is a probabilistic reasoning machine for identifying examples which are in contradiction with most learners and therefore noisy.

38

E. Bozkurt et al.

There are also other approaches for learning with noisy data based on regularization [17] or averaging decisions of several functions such as bagging [4]. However, these methods are not successful in high-noise cases. 1.2

Contribution and Outline of the Paper

In this paper, we propose an algorithm for automatic noise elimination from training data using Random Sample Consensus. RANSAC is a paradigm for ﬁtting a model to noisy data and utilized in many computer vision problems [21]. RANSAC performs multiple trials of selecting small subsets of the data to estimate the model. The ﬁnal solution is the model with maximal support from the training data. The method is robust to considerable noise. In this paper, we adopt RANSAC for training HMMs for the purpose of emotion recognition from spontaneous emotional speech. To the best of our knowledge, RANSAC has not been used before for cleaning an emotional speech database. The outline of the paper is as follows. In Section 2, background information is provided describing the spontaneous speech corpus and the well known RANSAC algorithm. In Section 3, the proposed method is described including the speech features, the Hidden Markov Model, the RANSAC-based HMM ﬁtting approach and the decision fusion method. In Section 4, our experimental results are provided, which is followed by conclusions and future work given in Section 5.

2 2.1

Background The Spontaneous Speech Corpus

The FAU AIBO corpus is used in this study [19]. The corpus consists of spontaneous, German and emotionally colored recordings of children interacting with Sony’s pet robot Aibo. The data was collected from 51 children and consisted of 48,401 words. Each word was annotated independently from each other as neutral or as belonging to one of the ten other classes, which are named as: joyful (101 words), surprised (0), emphatic (2,528), helpless (3), touchy (i.e., irritated) (225), angry (84), motherese (1,260), bored (11), reprimanding (310), rest (i.e., non-neutral but not belonging to the other categories) (3), neutral (39,169), and there were also 4,707 words not annotated since they did not satisfy the majority vote rule used in the labeling procedure. Five labelers were involved in the annotation process, and a majority vote approach was used to decide on the ﬁnal label of a word, i.e., if at least three labelers agreed on a label, the label was attributed to the word. As we can see from the above numbers, in 4,707 of the words, the ﬁve listeners could not agree on a label. Therefore, we can say that labeling spontaneous speech data into emotion classes is not an easy task, since the emotions are not classiﬁed easily and may even contain a mixture of more than one emotion. This implies that the labels of the training may be imperfect, which may adversely aﬀect the recognition performance of the trained pattern classiﬁers.

RANSAC-Based Training Data Selection on Spectral Features

39

In the INTERSPEECH 2009 emotion challenge, the FAU AIBO dataset was segmented into manually deﬁned chucks consisting of one or more words, since that was found to be the best unit of analysis [19], [20]. A total of 18,216 chunks was used for the challenge and the emotions were grouped into ﬁve classes, namely: Anger (including angry, touchy, and reprimanding classes) (1,492), Emphatic (3,601), Neutral (10,967), Positive (including motherese and joyful) (889), and Rest (1,267). The data is highly unbalanced. Since the data was collected at two diﬀerent schools, speaker independence is guaranteed by using the data of one school for training and the data of the other school for testing. This dataset is used in the experiments of this study. 2.2

The RANSAC Algorithm

Random Sample Consensus is a method for ﬁtting a model to noisy data [7]. RANSAC is capable of being robust to error levels of signiﬁcant percentages. The main idea is to identify the outliers as data samples with greatest residuals with respect to the ﬁtted model. These can be excluded and the model is recomputed. The steps of the general RANSAC algorithm are as follows [21] [7]: 1. Suppose we have n training data samples X = x1 , x2 , ..., xn to which we hope to ﬁt a model determined by (at least) m samples (m ≤ n). 2. Set an iteration counter k = 1. 3. Choose at random m items from X and compute a model. 4. For some tolerance ε, determine how many elements of X are within of the derived model ε. If this number exceeds a threshold t, re-compute the model over this consensus set and stop. 5. Set k = k + 1 If , k < K for some predetermined K, go to 3. Otherwise, accept the model with the biggest consensus set so far, or fail. There are possible improvements to this algorithm [21] [7]. The random subset selection may be improved if we have prior knowledge of data and its properties, that is some samples may be more likely to ﬁt a correct model than others. There are three parameters that need to be chosen: • ε , which is the acceptable deviation from a good model. It might be empirically determined by ﬁtting a model to m points, measuring the deviations and setting to some number of standard deviations above the mean error. • t, which is the size of the consensus set. There are two purposes for this parameter: to represent enough sample points for a suﬃcient model and to represent the enough number of samples to reﬁne the model to the ﬁnal best estimate. For the ﬁrst point a value of t satisfying t − m > 5 has been suggested [7]. • K, which is the maximum number to run the algorithm while searching a satisfactory ﬁt. Values of K = 2ω −m or K = 3ω −m have been argued to be reasonable choices [7], where ω is the probability of a randomly selected sample to be within ε of the model.

40

3 3.1

E. Bozkurt et al.

RANSAC-Based Data Cleaning Method Extraction of the Speech Features

We represent spectral features of speech using mel-frequency cepstral coeﬃcients (MFCC) and line spectral frequencies (LSF) with their ﬁrst and second order derivatives. MFCC features. Spectral features, such as mel-frequency cepstral coeﬃcients (MFCC), are expected to model the varying nature of speech spectra under diﬀerent emotions. We represent the spectral features of each analysis window of the speech data with a 13-dimensional MFCC vector consisting of energy and 12 cepstral coeﬃcients, which will be denoted as fC . LSF features. Line spectral frequency (LSF) decomposition has been ﬁrst developed by Itakura [10] for robust representation of the coeﬃcients of linear predictive (LP) speech models. LP analysis of speech assumes that a short stationary segment of speech can be represented by a linear time invariant all pole 1 ﬁlter of the form H(z) = A(z) , which is a pth order model for the vocal tract. LSF decomposition refers to expressing the p-th order inverse ﬁlter A(z) in terms of two polynomials P (z) = A(z) − z p+1 A(z −1 ) and Q(z) = A(z) + z p+1 A(z −1 ), which are used to represent the LP ﬁlter as, H(z) =

1 2 = . A(z) P (z) + Q(z)

(1)

The polynomials P (z) and Q(z) each have p/2 zeros on the unit circle, where phases of the zeros are interleaved in the interval [0, π]. Phases of p zeros from the P (z) and Q(z) polynomials form the LSF feature representation for the LP model. Extraction of LSF features, which is ﬁnding p zeros of P (z) and Q(z) polynomials, is also computationally eﬀective and robust. Note that the formant frequencies correspond to the zeros of A(z). Hence, P (z) and Q(z) will be close to zero at each formant frequency, which implies that the neighboring LSF features will be close to each other around formant frequencies. This property relates the LSF features to the formant frequencies [15], and makes them good candidates to model emotion related prosodic information in the speech spectra. We represent the LSF feature vector of each analysis window of speech as a p dimensional vector fL . Dynamic features. Temporal changes in the spectra play an important role in human perception of speech. One way to capture this information is to use dynamic features, which measure the change in the short-term spectra over time. We compute the ﬁrst and second time derivatives of the thirteen dimensional MFCC features using the following regression formula:

RANSAC-Based Training Data Selection on Spectral Features

2 ΔfC [n] =

k=−2 kfC [n + 2 2 k=−2 k

k]

,

41

(2)

where fC [n] is the MFCC feature vector at time frame n. Then, the extended MFCC feature vector, including the ﬁrst and second order derivative features, is T represented as fCΔ = fCT ΔfCT ΔΔfCT , where T is the vector transpose operator. Likewise, the extended LSF feature vector including dynamic components is denoted as fLΔ . 3.2

Emotion Classification Using Hidden Markov Models

Hidden Markov model has been deployed with great success in automatic speech recognition to model temporal spectral information, and they were also used similarly for emotion recognition as well [18]. We model the temporal patterns of the emotional speech utterances using HMM. We target to make a decision for syntactically meaningful chunks of speech segments, where in each segment typically a single emotional evidence is expected. Furthermore, in each speech segment emotional evidence may exhibit temporal patterns. Hence, we employ N states left-to-right HMM to model each emotion class. Feature observation probability distributions are modeled by M mixture Gaussian density functions with diagonal covariance matrix. Structural parameters N and M are determined through a model selection method and discussed under experimental studies. In the emotion recognition phase, the likelihood of a given speech segment is computed over HMM with the Viterbi decoding for each emotion class. Then, the utterance is classiﬁed as expressing the emotion, which yields the highest likelihood score. 3.3

RANSAC-Based Training of HMM Classifiers

Our goal is to train an HMM for each of the ﬁve emotion classes in the training set (Anger, Emphatic, Positive, Neutral and Rest). For each emotion class, we want to select a training set such that the fraction of the number of inliers (consensus set) over the total number of utterances in the dataset is maximized. In order to apply the RANSAC algorithm for ﬁtting an HMM model, we need to estimate suitable values for the parameters m, ε , t, K and ω , which were deﬁned in Section 2.2. For determining the biggest consensus set (inliers) for each of the ﬁve emotions, we use a simple HMM structure with single state and 16 Gaussian mixtures per state. The steps of the RANSAC-based HMM training method are as follows: 1. For each of the ﬁve emotions suppose we have n training data samples X = x1 , x2 , ..., xn to which we hope to ﬁt a model determined by (at least) m samples (m ≤ n). Initially, we randomly select m = 320 utterances considering use of 20 utterances per Gaussian mixture is suﬃcient for the training process.

42

E. Bozkurt et al.

2. Set an iteration counter k = 1. 3. Choose at random m items from X and compute an HMM with a given number of states and Gaussian mixtures per state. Estimate the normalized likelihood values for the rest of the training set, using the trained HMM. 4. Set tolerance level to ε = (μ - 1.5 * σ), where mean (μ) and standard deviation (σ) values are calculated using the normalized likelihood values of the initial randomly selected m utterances. Determine how many elements of X are within ε of the derived model. If this number exceeds a threshold t, recompute the model over this consensus set and stop. 5. Increase the iteration counter k = k + 1, If k < K, and k < 200, for some predetermined K, go to step 3. Otherwise, accept the model with the biggest consensus set so far, or fail. Here, we estimate K, the number of loops required for the RANSAC algorithm to converge, using the number of inliers [4]: K=

ln(1 − p) ln(1 − ω m )

(3)

i Here we set ω = m , where mi is the number of inliers for iteration i and p m = 0.9 is the probability that at least one of the sets of random samples does not include an outlier.

3.4

Decision Fusion for Classification of Emotions

Decision fusion is used to compensate for possible misclassiﬁcation errors resulting from a given modality classiﬁer with other available modalities, where scores resulting from each unimodal classiﬁcation are combined to arrive at a conclusion. Decision fusion is especially eﬀective when contributing modalities are not correlated and resulting partial decisions are statistically independent. We consider a weighted summation based decision fusion technique to combine diﬀerent classiﬁers [6] for emotion recognition. The HMM classiﬁers with MFCC and LSF features output likelihood scores for each emotion and utterance, which need to be normalized prior to the decision fusion process. First, for each utterance, likelihood scores of both classiﬁers are mean-removed over emotions. Then, sigmoid normalization is used to map likelihood values to the [0, 1] interval for all utterances [6]. After normalization, we have two likelihood score sets for the HMM classiﬁers for each emotion and utterance. Let us denote normalized log-likelihoods of MFCC and LSF based HMM classiﬁers as ρ¯γe (C) and ρ¯γe (L) respectively, for the emotion class e. The decision fusion then reduces to computing a single set of joint log-likelihood ratios, ρe , for each emotion class e. Assuming the two classiﬁers are statistically independent, we fuse the two classiﬁers, which will be denoted by γe (C)⊕γe (L), by computing the weighted average of the normalized likelihood scores ρe = α¯ ργe (C) + (1 − α)¯ ργe (L) ,

(4)

where the parameter α is selected in the interval [0, 1] to maximize the recognition rate on the training set.

RANSAC-Based Training Data Selection on Spectral Features

4

43

Experimental Results

In this section, we present our experimental results for the 5-class emotion recognition problem using FAU-Aibo speech database provided by the INTERSPEECH 2009 emotion challenge. The distribution of emotional classes in the database is highly unbalanced that the performance is measured as unweighted recall (UA) rate which is the average recall of all classes. In Table 1 and Table 2, we list the UA rates for classiﬁers modeling MFCC and LSF features with 1-state and 2-state HMMs with number of Gaussian mixtures in the range [8, 160] per state. In the experiments further increasing number of states did not improve our results. We can see that incorporation of a RANSAC based data cleaning procedure yields an increase in the unweighted recall rates in all cases. For the MFCC feature set, the highest improvement (2.84%) is seen for the 1state HMM with 160 Gaussian mixtures, whereas for the LSF feature set the highest improvement is obtained as 2.73 % for 1-state HMM with 80 Gaussian mixtures. Table 1. Unweighted recall rates (UA) for 1- and 2- state HMMs modeling MFCC features with and without RANSAC

Number of 1 state 2 states mixtures All-data RANSAC All-data RANSAC 16 38.39 39.51 38.46 38.63 56 38.84 39.79 40.17 40.45 80 38.63 40.62 40.18 40.95 160 38.82 41.66 40.36 41.32

Table 2. Unweighted recall rates (UA) for 1- and 2- state HMMs modeling LSF features with and without RANSAC

Number of 1 state 2 states mixtures All-data RANSAC All-data RANSAC 16 34.53 34.24 36.59 36.71 56 36.69 38.39 35.38 37.54 80 36.67 39.40 35.65 36.95 160 36.82 39.30 35.98 37.50

We also provide a plot of unweighted recall rate versus number of Gaussian mixtures per state for 1-state and 2-state HMMs with and without RANSAC cleaning in Figure 1 and 2 for feature sets MFCCs and LSFs, respectively. If we compare the curves denoted by circles and squares for the feature sets, we can say that the RANSAC based data cleaning method brings signiﬁcant improvements to the emotion recognition rate.

44

E. Bozkurt et al.

Fig. 1. Unweighted recall rate versus number of Gaussian mixtures per state for (a) 1state and (b) 2-state HMMs modeling M F CCΔΔ features with and without RANSAC

Comparison of the Classifiers. We would like to compare the accuracies of the HMM classiﬁers with and without using RANSAC-based training data selection. There are various statistical tests for comparing the performances of supervised classiﬁcation learning algorithms [5] [11]. The McNemars test tries to assess the signiﬁcance of the diﬀerences in the performances of two classiﬁcation algorithms that have been tested on the same testing data. The McNemars test has been shown to have low probability of incorrectly detecting a diﬀerence when no diﬀerence exists (type I error) [5]. We performed the McNemars test to show that the improvement achieved with the proposed RANSAC-based data cleaning method, as compared to employing all the available training data is signiﬁcant. The McNemar’s values for the MFCC feature set modeled by 1- and 2- state HMM classiﬁers with 160 Gaussian mixtures per state are computed as 231.246 and 8.917, respectively. Since these values are larger than the statistical signiﬁcance threshold χ2(1,.95) = 3.8414, we can conclude that the improvement provided by RANSAC-based cleaning is statistically signiﬁcant. The McNemar’s values for the LSF feature set modeled by 1- and 2-state HMMs with 160 Gaussian mixtures per state are calculated as 196.564 and 22.448, respectively. Again, since these values are greater than the statistical signiﬁcance threshold we can claim that the RANSAC based classiﬁer has a better accuracy, which is statistically signiﬁcant. Note that the data we fed to the RANSAC-based training data selection algorithm consisted of chunks of one or more words for which three of the ﬁve labelers agreed on the emotional content. Using ﬁve labelers may not always be possible and if only one labeler is present, the training data is expected to be more noisy. In such cases, the proposed RANSAC based training data selection algorithm has the potential to bring even higher improvements to the performance of the classiﬁer.

RANSAC-Based Training Data Selection on Spectral Features

45

Fig. 2. Unweighted recall rate versus number of Gaussian mixtures per state for (a) 1-state and (b) 2-state HMMs modeling LSF ΔΔ features with and without RANSAC.

One drawback of the RANSAC algorithm that was observed during the experiments is that it is time consuming, since many random subset selections need to be tested. Decision Fusion of the RANSAC-based Trained Classifiers. Decision fusion of the RANSAC-based trained HMM classiﬁers is performed for various combinations of MFCC and LSF features. The fusion weight, α, is optimized over a subset of the training database prior to be used on the test data. The highest recall rate observed with the classiﬁer fusion is 42.22 % for α = 0.84 when 1-state HMMs with 80 mixtures modeling RANSAC-cleaned MFCCs are fused with 2-state HMMs with 104 mixtures modeling RANSAC-cleaned LSF features.

5

Conclusions and Future Work

In this paper, we presented a random sampling consensus based training data selection method for the problem of emotion recognition from a spontaneous emotional speech database. The experimental results show that the proposed method is promising for HMM based emotion recognition from spontaneous speech data. In particular, we observed an improvement of up to 2.84 % in the unweighted recall rates on the test set of the spontaneous FAU AIBO test set, signiﬁcance of which have been shown by the McNemar’s test. Moreover, the decision fusion of the LSF features with the MFCC features resulted in improved classiﬁcation rates over the state-of-the-art MFCC-only decision for the FAU Aibo database.

46

E. Bozkurt et al.

In order to increase the beneﬁts of the data cleaning approach, and to decrease the training eﬀort, the algorithm may be improved by using semi-deterministic subset selection methods. Further experimental studies are planned to include more speech features (e.g., prosodic features), more complicated HMM structures and other spontaneous datasets. Acknowledgments. This work was supported in part by the Turkish Scientiﬁc and Technical Research Council (TUBITAK) under projects 106E201, 110E056 and COST2102 action.

References 1. Angelova, A., Abu-Mostafa, Y., Perona, P.: Pruning training sets for learning of object categories. In: Proc. Int. Conf. on Computer Vision and Pattern Recognition, CVPR (2005) 2. Barandela, R., Gasca, E.: Decontamination of training samples for supervised pattern recognition methods. In: Amin, A., Pudil, P., Ferri, F., I˜ nesta, J.M. (eds.) SPR 2000 and SSPR 2000. LNCS, vol. 1876, pp. 621–630. Springer, Heidelberg (2000) 3. Ben-Gal, I.: Outlier Detection, Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, Dordrecht (2005) 4. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 5. Dietterich, T.G.: Approximate statistical tests for comparing supervised classiﬁcation learning algorithms. Neural Computation 7, 1895–1924 (1998) 6. Erzin, E., Yemez, Y., Tekalp, A.M.: Multimodal speaker identiﬁcation using an adaptive classiﬁer cascade based on modality realiability. IEEE Transactions on Multimedia 7(5), 840–852 (2005) 7. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Graphics and Image Processing 24 (1981) 8. Gu, B., Hu, F., Liu, H.: Sampling and its applications in data mining: A survey. Tech. Rep. School of Computing, National University of Singapore (2000) 9. Guyon, I., Matin, N., Vapnik, V.: Discovering informative patterns and data cleaning. In: Workshop on Knowledge Discovery in Databases (1994) 10. Itakura, F.: Line spectrum representation of linear predictive coeﬃcients of speech signals. Journal of the Acoustical Society of America 57(1), S35 (1975) 11. Kuncheva, L.I.: Combining Pattern Classiﬁers. John Wiley and Sons, Chichester (2004) 12. Kwon, O., Chan, K., Hao, J., Lee, T.: Emotion recognition by speech signals. In: Proc. of Eurospeech 2003, Geneva (September 2003) 13. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. Journal 13, 293–303 (2005) 14. Lee, C.M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S.: Emotion recognition based on phoneme classes. In: Proc. ICSLP 2004, pp. 889–892 (2004) 15. Morris, R.W., Clements, M.A.: Modiﬁcation of formants in the line spectrum domain. IEEE Signal Processing Letters 9(1), 19–21 (2002)

RANSAC-Based Training Data Selection on Spectral Features

47

16. Olken, F.: Random Sampling from Databases. Ph. D. Thesis, Department of Computer Science, University of California, Berkeley (1993) 17. Ratsch, G., Onada, T., Muller, K.: Regularizing adaboost. Advances in Neural Information Processing Systems 11, 564–570 (2000) 18. Schuller, B., Rigoll, G., Lang, M.: Hidden markov model based speech emotion recognition. In: Proc. Int. Conf. Acoustics, Speech and Signal Processing, ICASSP (2003) 19. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: Interspeech (2009), ISCA. Brighton, UK (2009) 20. Seppi, D., Batliner, A., Schuller, B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Aharonson, V.: Patterns, prototypes, performance: Classifying emotional user states. In: Interspeech (2008) ISCA (2008) 21. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Thomson (2008) 22. Wang, S., Dash, M., Chia, L., Xu, M.: Eﬃcient sampling of training set in large and noisy multimedia data. ACM Transactions on Multimedia Computing, Communications and Applications 3 (2007)

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue Martin Bachwerk and Carl Vogel Computational Linguistics Group, School of Computer Science and Statistics, Trinity College, Dublin 2, Ireland {bachwerm,vogel}@tcd.ie

Abstract. In this paper, we claim that language is likely to have emerged as a mechanism for coordinating the solution of complex tasks. To conﬁrm this thesis, computer simulations are performed based on the coordination task presented by Garrod & Anderson (1987). The role of success in task-oriented dialogue is analytically evaluated with the help of performance measurements and a thorough lexical analysis of the emergent communication system. Simulation results conﬁrm a strong eﬀect of success mattering on both reliability and dispersion of linguistic conventions.

1

Introduction

In the last decade, the ﬁeld of communication science has seen a major increase in the number of research programmes that go beyond the more conventional studies of human dialogue (e.g. [6,7]) in an attempt to reproduce the emergence of conventionalized communication systems in a laboratory (e.g. [4,8,10]). In his seminal paper, Galantucci has proposed to refer to this line of research as experimental semiotics, which he sees as a more general form of experimental pragmatics. In particular, Galantucci deﬁnes that the former “studies the emergence of new forms of communication”, while the latter “studies the spontaneous use of pre-existing forms of communication” (p. 394, [5]). Experimental semiotics provides a novel way of reproducing the emergence of a conventionalized communication system under laboratory conditions. However, the ﬁndings from this ﬁeld cannot be transferred to the question of primeval emergence of language without the caveat that the subjects of the present-day experiments are very much familiar with the concepts of conventions and communication systems (even if they are not allowed to employ any existing versions of these in the conducted experiments), while our ancestors who somehow managed to invent the very ﬁrst conventionalized signaling system, by deﬁnition, could not have been aware of these concepts. Since experimental semiotics researchers cannot adjust the minds of their subjects in order to ﬁnd out how they could discover the concept of a communication system, the most these experiments can realistically achieve is make the subjects signal the ‘signalhood’ of some novel form of communication (see. [13]). To go any further seems at least for now to require the use of computer models and simulations. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 48–55, 2011. c Springer-Verlag Berlin Heidelberg 2011

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue

49

Consequently, we are interested in how a community of simulated agents can agree on a set of lexical conventions with a very limited amount of given knowledge about the notion of a communication system. In this particular paper, we address this issue by conducting several computer simulations that are meant to reconstruct the human experiments conducted by [6] and [7], which suggest that the establishment of new conventions requires for at least some understanding to be experienced, for example measured in the success of the action performed in response to an utterance, and that diﬀerently organized communities can come up with variously eﬀective communication systems. While the communities in the current experiments are in a way similar to the social structures implemented in [1], the focus here is on local coordination and the role of task-related communicative success, rather than the eﬀect of diﬀerent higher-order group structures.

2

Modelling Approach

The experiments presented in this paper have been performed with the help of the Language Evolution Workbench (LEW) (see [16,1] for more detailed descriptions of the model). This workbench provides over 20 adjustable parameters and makes as few assumptions about the agents’ cognitive skills and their awareness of the possibility of a conventionalized communication system as possible. The few cognitive skills that are assumed can be considered as widely accepted (see [11,14] among others) as the minimal prerequisites for the emergence of language. These skills include the ability to observe and individuate events, the ability to engage in a joint attention frame ﬁxed on an occurring event, and the ability to interact by constructing words and utterances from abstract symbols1 and transmitting these to one’s interlocutor.2,3 During such interactions, one of the agents is assigned the intention to comment on the event, while a second agent assumes that the topic of the utterance relates in some way to the event and attempts to decode the meaning of the encountered symbols accordingly. From an evolutionary point of view, the LEW ﬁts in with the so called faculty of language in the narrow sense as proposed by [9] in that the agents are equipped with the sensory, intentional and concept-mapping skills at the start, and the simulations attempt to provide an insight into how these could be combined to produce a communication system with comparable properties to a human language. From a pragmatics point of view, our approach directly adopts the claim made by [12] that dialogue is the underlying form of communication. Furthermore, despite the agents in the LEW lacking any kind of embodiment, they are designed in a way that makes each agent individuate events according to 1 2 3

While we often refer to such symbols as ‘phonemes’ throughout the paper, there is no reason why these should not be representative of gestural signs. Phenomena such as noise and loss of data during signal transmission are ignored in our approach for the sake of simplicity. It is important to stress out that hearers are not assumed to know the word boundaries of an encountered utterance. However, simulations with so called synchronized transmission have been performed previously by [15].

50

M. Bachwerk and C. Vogel

its own perspective, which in most cases results in their situation models being initially non-aligned, thus providing the agents with the task of aligning their representations, similarly to the account presented in [12].

3

Experiment Design

In the presented experiments, we aim to reproduce the two studies originally performed by Garrod and his colleagues, but in an evolutionary simulation performed on an abstract model of communication. Our reconstruction lies in the context of a simulated dynamic system of agents which should provide us with some insights about how Garrod’s ﬁndings can be transferred to the domain of language evolution. The remainder of this section outlines the conﬁguration of the LEW used in the present study, together with an explanation of the three manipulated parameters. The results of the corresponding simulations are then evaluated in the following section 4, with special emphasis being put on the communicative potential and general linguistic properties of the emergent communication systems.4 Garrod observed in his two studies that conventions have a better chance of getting established and reused if their utilisation appears to lead to one’s interlocutor understanding of one’s utterance, either by explicitly signaling so or by performing an adequate action. Notably, in task-based communication, interlocutors may succeed in achieving a task with or without complete mutual understanding of the surrounding dialogue. Nevertheless, our simulations have been focussed on a parameter of the LEW that deﬁnes the probability that communicative success matters psm in an interaction. From an evolutionary point of view, this parameter is motivated by the numerous theories that put cooperation and survival as the core function of communication (e.g. [2]). However, the abstract implementation of the parameter allows us to refrain from selecting any particular evolutionary theory as the target one by generalizing over all kinds of possible success that may result from a communication bout, e.g. avoiding a predator, hunting down a prey or battling oﬀ a rival gang. The levels of the parameter that deﬁnes if success matters were varied between 0 and 1 (in steps of 0.25) in the presented simulations. To clarify the selected values of the parameter, psm =0 means that communicative success plays no role whatsoever in the system and psm =1 means that only interactions satisfying a minimum success threshold will be remembered by the agents. The minimum success threshold is established by an additional parameter of the LEW and can be generally interpreted as the minimum amount of information that needs to be extracted by the hearer from an encountered utterance in order to be of any 4

We intentionally refrain from referring to the syntax-less communication systems that emerge in our simulations as ‘language’ as that would be seen as highly contentious by many readers. Furthermore, even though the term ‘protolanguage’ appears to be quite suited for our needs (cf. [11]), the controversial nature of that term does not really encourage its use either, prompting us to stick to more neutral expressions.

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue

51

use. In our experiments, we have varied between a minimum success threshold of 0.25 and 1 (in steps of 0.25).5 The eﬀects of this parameter will not be reported in this paper due to a lack of signiﬁcance and space limitations. In addition to the above two parameters, the presented experiments also introduce two diﬀerent interlocutor arrangements, similar to the studies in [6] and [7]. In the ﬁrst of these, pairs of agents are partnered with each other for the whole duration of the simulation, meaning that they do not converse with any other agents at all. The second arrangement emulates the community setting introduced in [7] by successively alternating the pairings of agents, in our case after every 100 interaction ‘epochs’.6 The introduction of the community setting was motivated by the hypothesis that a community of agents should be able to engage in a global coordination process, as opposed to local entrainment, resulting in more generalized and thus eventually more reliable conventions.

4

Results and Discussion

The experimental setup described above resulted in 34 diﬀerent parameter combinations, for each of which 600 independent runs have been performed in order to obtain empirically reliable data. The evaluation of the data has been performed with the help of a number of measures that have been selected with the goal of being able to describe both the communicative usefulness of an evolved convention system, as well as compare its main properties to those of languages as we know them now (see [1] for a more detailed account). In order to understand how well a communication system performs in a simulation, it is common to observe the understanding precision and recall rates, precision∗recall ). As can which can be combined to a single F-measure (F 1 = 2 ∗ precision+recall be seen from Figure 1(a), the results suggest that having a higher psm has a direct eﬀect on the understanding rates of a community (t value between 26.68 and 210.63, p<0.0001). However, a communication setup in which agents communicate with each other in turns as opposed to with a ﬁxed partner does not appear to be advantageous for the establishment of a reliable means of communication (t= −15.85, p<0.0001). Looking further, Figure 1(b) indicates that, just as observed in [7], agents operating in a community have a larger amount of variation available to them, in our case in the form of a larger lexicon (t=35.52, p<0.0001). However, unlike in the empirical study, the agents in the LEW do not beneﬁt from this property, among other things due to the lack of an ability to enter into a negotiation about conventions to use in a given context. It is important to note at this stage that the understanding measure presented in Figure 1(a) only takes into account the interactions that have been successful, i.e. were not below the minimum success threshold in cases where success was chosen to matter. Consequently, this ﬁgure does not tell us how well the agents’ 5 6

Setting the minimum success threshold to 0 is equivalent to having psm = 0. In both cases, the agent population was set to ten and so each ‘epoch’ comprised ten interactions, whereby every agent would on average take part in two interactions: once as a speaker and once as a hearer.

52

M. Bachwerk and C. Vogel Lexicon size

1.0

1500

Understanding F1

Fixed partner Community

0

0.0

0.2

500

0.4

0.6

1000

0.8

Fixed partner Community

0

0.25

0.5

0.75

1

0

0.25

0.5

0.75

1

Fig. 1. Eﬀect of the interaction type and the probability that success matters on (a) communicative success and (b) agent lexicon size

lexicons are actually equipped to interpret a wide range of utterances. In order to evaluate the lexicons of agents without any eﬀect that simple guessing luck might have on understanding, we take a look at two further measures: lexicon use, i.e. the average ratio of forms of an utterance that the hearer agent was able to ﬁnd in his lexicon, and lexicon precision, i.e. the ratio of correct meanings found by the hearer, in the cases where the agent used his lexicon for decoding a form. Furthermore, the decrease in lexicon size alone does not provide any speciﬁc information as to what exactly is happening to the agents’ lexicons. In other terms, further measures are required that could explain what eﬀect the decrease actually has on the expressive and interpretative potential of a lexicon. Figure 2(a) depicts the rates of lexicon use, suggesting that with the increase of psm and the corresponding diminishing of lexicon size (t value between −40.06 and −75.26, p<0.0001), the number of forms in an agent’s lexicon appears to decrease (t value between −39.81 and −78.23, p<0.0001) with a signiﬁcant

1.0

Lexicon precision

1.0

Lexicon use

0.2

0.4

0.6

0.8

Fixed partner Community

0.0

0.0

0.2

0.4

0.6

0.8

Fixed partner Community

0

0.25

0.5

0.75

1

0

0.25

0.5

0.75

1

Fig. 2. Eﬀect of the interaction type and the probability that success matters on (a) lexicon use and (b) lexicon precision

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue Unique forms

500

1000

Fixed partner Community

0

0

500

1000

Fixed partner Community

1500

1500

Unique meanings

53

0

0.25

0.5

0.75

1

0

0.25

0.5

0.75

1

Fig. 3. Eﬀect of the interaction type and the probability that success matters on the number of (a) unique meanings and (b) unique forms in agent lexicons

eﬀect on lexicon use (t value between −4.57 and −20.38, p<0.0001), as further conﬁrmed by Figure 3(b). The intuition is that for higher levels of psm , wrongly guessed meanings are not being recorded in the agents’ lexicons, resulting in higher quality convention systems. This is conﬁrmed by the increase in lexicon precision (t value between 11.63 and 101.64, p<0.0001) depicted in Figure 2(b). Interestingly enough, the decrease in the number of diﬀerent forms in agents’ lexicons does not seem to have a signiﬁcant eﬀect on agent lexicon synonymy across the board (p>0.1 psm = 0.25; yet t value between −4.64 and −28.18, p<0.0001 for higher levels of psm ) (see Figure 4(a)). Presumably, the reason for this is that the drop-oﬀ in the number of distinct meanings (see Figure 3(a)) is directly proportional to that of distinct forms, which would explain the less aﬀected synonymy and homonymy ratios (see Figure 4(b) for a plot of the latter).

1.0

Agent homonymy rate

1.0

Agent synonymy rate

0.2

0.4

0.6

0.8

Fixed partner Community

0.0

0.0

0.2

0.4

0.6

0.8

Fixed partner Community

0

0.25

0.5

0.75

1

0

0.25

0.5

0.75

1

Fig. 4. Eﬀect of the interaction type and the probability that success matters on (a) agent lexicon synonymy and (b) agent lexicon homonymy

54

M. Bachwerk and C. Vogel

Success Matters 0 0.25 0.5 0.75 1

0.00

1.0

1.2

0.05

1.4

1.6

0.10

1.8

Fixed partner Community

0.15

Ratio of mappings shared by exactly X agents

2.0

Average mapping share

0

0.25

0.5

0.75

1

2

3

4

5

6

7

8

9

10

Fig. 5. Eﬀect of the interaction type ((a) only) and the probability that success matters on (a) average mapping share and (b) ratio of mappings shared by exactly X agents

So far we have only evaluated the results of our experiments from the point of view of the agents, either by looking at their observed interaction success or by evaluating the communicative potential of their lexicons. However, as one of the main topics of the presented study was the establishment of conventions in a community of interlocutors, we should also evaluate the simulation results from the point of view of conventions, i.e. meaning-form mappings. In fact, there is a signiﬁcant eﬀect of both the community setting (t value between 3.58 and 91.14, p<0.00035) and success mattering (t=86.64, p<0.0001) on the number of agents that share a mapping on average, as depicted in Figure 5(a). This eﬀect is broken down in Figure 5(b), in which one can see the portion of the global lexicon that is shared by any particular number of agents.7 The eﬀects observed in the latter ﬁgure can be further described by an equation like mapshare = apsm ∗ b−n , whereby the ratio of shared mappings (mapshare) is directly proportional to success mattering (psm ) and inversely proportional to the number of agents (n) that are expected to know the mappings.

5

Conclusions and Future Work

In summary, experiencing a degree of success provides the all important foundation required for establishing linguistic conventions in task-oriented dialogue and dispersing these throughout the community. The ramiﬁcations of this ﬁnding are that language is very unlikely to have emerged for the beneﬁt of a successagnostic activity, such as gossip (cf. [3]), but has presumably evolved as an adaptational necessity in times where human cooperation has become essential. The shortcomings of the community setting can be attributed to the LEW’s implementation of interactions as two autonomous activities and the lack of success-based adjustment of mapping usage strategies. Future work should aim to improve this aspect by looking into the interactive alignment model (cf. [12]). 7

The remainder of the mappings is not shared, i.e. known by only one agent.

Establishing Linguistic Conventions in Task-Oriented Primeval Dialogue

55

References 1. Bachwerk, M., Vogel, C.: Modelling Social Structures and Hierarchies in Language Evolution. In: Bramer, M., Petridis, M., Hopgood, A. (eds.) Research and Development in Intelligent Systems XXVII, pp. 49–62. Springer, Heidelberg (2010) 2. Bickerton, D.: Foraging Versus Social Intelligence in the Evolution of Protolanguage. In: Wray, A. (ed.) The Transition to Language, pp. 207–225. Oxford University Press, Oxford (2002) 3. Dunbar, R.I.M.: Grooming, Gossip and the Evolution of Language. Harvard University Press, Cambridge (1997) 4. Galantucci, B.: An Experimental Study of the Emergence of Human Communication Systems. Cognitive Science 29, 737–767 (2005) 5. Galantucci, B.: Experimental Semiotics: A New Approach for Studying Communication as a Form of Joint Action. Topics in Cognitive Science 1(2), 393–410 (2009) 6. Garrod, S., Anderson, A.: Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition 27, 181–218 (1987) 7. Garrod, S., Doherty, G.: Conversation, co-ordination and convention: an empirical investigation of how groups establish linguistics conventions. Cognition 53, 181–215 (1994) 8. Garrod, S., Fay, N., Lee, J., Oberlander, J., Macleod, T.: Foundations of Representation: Where Might Graphical Symbol Systems Come From? Cognitive Science 31, 961–987 (2007) 9. Hauser, M.D., Chomsky, N., Fitch, W.T.: The Faculty of Language: What Is It, Who Has It, and How Did It Evolve? Science 298(5598), 1569–1579 (2002) 10. Healey, P.G.T., Swoboda, N., Umata, I., Katagiri, Y.: Graphical representation in graphical dialogue. International Journal of Human-Computer Studies 57, 375–395 (2002) 11. Jackendoﬀ, R.: Possible stages in the evolution of the language capacity. Trends in Cognitive Sciences, 272–279 (1999) 12. Pickering, M.J., Garrod, S.: Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27, 169–190 (2004) 13. Scott-Phillips, T.C., Kirby, S., Ritchie, G.R.S.: Signalling signalhood and the emergence of communication. Cognition 113, 226–233 (2009) 14. Tomasello, M.: Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press, Cambridge (2003) 15. Vogel, C.: Group Cohesion, Cooperation and Synchrony in a Social Model of Language Evolution. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony, pp. 16–32. Springer, Heidelberg (2010) 16. Vogel, C., Woods, J.: A Platform for Simulating Language Evolution. In: Bramer, M., Coenen, F., Tuson, A. (eds.) Research and Development in Intelligent Systems, pp. 360–373. Springer, London (2006)

Switching Between Diﬀerent Ways to Think Multiple Approaches to Aﬀective Common Sense Reasoning Erik Cambria1 , Thomas Mazzocco2 , Amir Hussain2 , and Tariq Durrani2 1

National University of Singapore, Singapore 2 University of Stirling, United Kingdom [email protected],{tma,ahu,tdu}@cs.stir.ac.uk, http://cs.stir.ac.uk/~ eca/sentics

Abstract. Emotions are diﬀerent Ways to Think that our mind triggers to deal with diﬀerent situations we face in our lives. Our ability to reason and make decisions, in fact, is strictly dependent on both our common sense knowledge about the world and our inner emotional states. This capability, which we call aﬀective common sense reasoning, is a fundamental component in human experience, cognition, perception, learning and communication. For this reason, we cannot prescind from emotions in the development of intelligent user interfaces: if we want computers to be really intelligent, not just have the veneer of intelligence, we need to give them the ability to recognize, understand and express emotions. In this work, we argue how graph mining, multi-dimensionality reduction, clustering and space transformation techniques can be used on an aﬀective common sense knowledge base to emulate the process of switching between diﬀerent perspectives and ﬁnding novel ways to look at things. Keywords: Sentic Computing, AI, Semantic Web, NLP, Cognitive and Aﬀective Modeling, Opinion Mining and Sentiment Analysis.

1

Introduction

The aﬀective aspect of cognition and communication is recognized to be a crucial part of human intelligence and has been argued to be more fundamental in human behavior and success in social life than intellect [1,2]. Emotions inﬂuence cognition, and therefore intelligence, especially when this involves social decisionmaking and interaction. Emotions are special states of our mind that have been shaped by natural selection to adjust various aspects of our organism in order to make it better face particular situations, e.g., anger evolved for reaction, fear evolved for protection and aﬀection evolved for reproduction. That is why, when developing intelligent user interfaces, we need to enable them to recognize and understand emotions. Existing approaches to aﬀect recognition from natural language are still mainly keyword-based and, hence, very limited. We need to stop playing around word co-occurrence frequencies, but focus on emulating human reasoning processes. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 56–69, 2011. c Springer-Verlag Berlin Heidelberg 2011

Switching Between Diﬀerent Ways to Think

57

We need to enable machines to represent knowledge and perform reasoning in many diﬀerent ways so that, whenever they get stuck, they can switch among diﬀerent points of view and ﬁnd one that works. In particular, to bridge the cognitive and aﬀective gap between word-level natural language data and the conceptlevel opinions and sentiments conveyed by them, we use diﬀerent graph mining, multi-dimensionality reduction, clustering and space transformation techniques on a knowledge base obtained by merging a directed graph representation of common sense with a linguistic resource for the lexical representation of aﬀect. The structure of the paper is as follows: Section 2 presents an overview on aﬀect recognition from natural language, Section 3 illustrates the adopted emotion categorization model, Section 4 explains how we build the aﬀective common sense knoweldge base, Sections 5, 6, 7 and 8 show how we implement diﬀerent Ways to Think, Section 9 discusses how to concurrently exploit these processes and Section 10 comprises concluding remarks and future directions.

2

Background

Existing approaches to aﬀect recognition from natural language can be grouped into four main categories: keyword spotting, in which text is classiﬁed into categories based on the presence of fairly unambiguous aﬀect words [3,4,5]; lexical aﬃnity, which assigns arbitrary words a probabilistic aﬃnity for a particular emotion [6,7]; statistical methods, which calculate the valence of aﬀective keywords, punctuation and word co-occurrence frequencies on the base of a large training corpus [8,9]; and sentic computing [10], a multi-disciplinary approach to opinion mining and sentiment analysis that exploits both computer and social sciences to better recognize, interpret and process emotions over the Web. The problem with the ﬁrst three approaches is that they mainly rely on parts of text in which aﬀect is explicitly expressed such as positive terms (e.g., good, excellent, superior), negative terms (e.g., bad, poor, wrong) or verbs, adjectives and adverbs of emotion (e.g., to love/to hate, angry/pleased, happily/sadly). Opinions and sentiments, however, are more often expressed implicitly through concepts with an aﬀective valence such as ‘play a game’, ‘be laid oﬀ’ or ‘go on a ﬁrst date’. In sentic computing, whose term derives from the Latin sentire (root of words such as sentiment and sentience) and sensus (intended both as capability of feeling and as common sense), the analysis of natural language is based on aﬀective ontologies [11] and common sense reasoning tools [12], which enable the analysis of documents not only at page- or paragraph-level but also at sentence-level. In particular, sentic computing involves the use of AI and Semantic Web techniques, for knowledge representation and inference; mathematics, for carrying out tasks such as graph mining and multi-dimensionality reduction; linguistics, for discourse analysis and pragmatics; psychology, for cognitive and aﬀective modeling; sociology, for understanding social network dynamics and social inﬂuence; ﬁnally ethics, for understanding related issues about the nature of mind and the creation of emotional machines (Fig. 1).

58

E. Cambria et al.

Fig. 1. Sentic computing wheel

3

Adopted Emotion Categorization Model

Aﬀect has been classiﬁed into six universal ‘basic’ categories or emotions, i.e., happiness, sadness, fear, anger, disgust and surprise [13]. Few tentative eﬀorts to detect non-basic aﬀective states, such as fatigue, anxiety, confusion or frustration, have been also made [14,15]. However, these categorical approaches classify emotions using a list of labels, failing to describe the complex range of emotions that can occur in daily communication. Unlike categorical approaches, the Hourglass of Emotions [16] is an aﬀective categorization model that can potentially describe any human emotion in terms of four independent but concomitant dimensions, whose diﬀerent levels of activation make up the total emotional state of the mind. The Hourglass model, in fact, is based on the idea that the mind is made of diﬀerent independent resources and that emotional states result from turning some set of these resources on and turning another set of them oﬀ [17]. Each such selection changes how we think by changing our brain’s activities: the state of anger, for example, appears to select a set of resources that help us react with more speed and strength while also suppressing some other resources that usually make us act prudently. The primary quantity we can measure about an emotion we feel is its strength. But, when we feel a strong emotion, it is because we feel a very speciﬁc emotion. And, conversely, we cannot feel a speciﬁc emotion like fear or amazement without that emotion being reasonably strong. Mapping this space of possible emotions leads to a hourglass shape (Fig. 2).

Switching Between Diﬀerent Ways to Think

59

The Hourglass of Emotions, in particular, can be exploited in the context of human-computer interaction (HCI) to measure how much respectively: 1. 2. 3. 4.

the the the the

user user user user

is is is is

amused by interaction modalities (Pleasantness) interested in interaction contents (Attention) comfortable with interaction dynamics (Sensitivity) conﬁdent in interaction beneﬁts (Aptitude)

Each aﬀective dimension, in particular, is characterized by six levels of activation (measuring the strength of an emotion), termed ‘sentic levels’, which determine the intensity of the expressed/perceived emotion as an int ∈ [−3,3]. These levels are also labeled as a set of 24 basic emotions [18], six for each of the aﬀective dimensions, in a way that allows the model to specify the aﬀective information associated with text both in a dimensional and in a discrete form. The dimensional form, in particular, is called ‘sentic vector’ and it is a fourdimensional f loat vector that can potentially synthesize any human emotion in terms of Pleasantness, Attention, Sensitivity and Aptitude. Some particular sets of sentic vectors have special names as they specify well-known compound emotions. For example the set of sentic vectors with a level of Pleasantness ∈ (1,2] (joy), null Attention, null Sensitivity and a level of Aptitude ∈ (1,2] (trust) are called ‘love sentic vectors’ since they specify the compound emotion of love.

4

Building the Aﬀective Common Sense Knowledge Base

When people communicate with each other, they rely on similar background knowledge, e.g., the way objects relate to each other in the world, people’s goals in their daily lives and the emotional content of events or situations. This ‘taken for granted’ information is what we call common sense – obvious things people normally know and usually leave unstated. The Open Mind Common Sense project has been collecting this kind of knowledge from volunteers on the Internet since 2000 to provide intuition to AI systems and applications. ConceptNet (CN) [19] represents the information in the Open Mind corpus as a directed graph in which the nodes are concepts and the labeled edges are assertions of common sense that interconnect them. WordNet-Aﬀect (WNA) [20] is a linguistic resource for the lexical representation of aﬀective knowledge, developed starting from WordNet [21]. The knowledge base is built by assigning to a core of WordNet synsets one or more aﬀective labels (a-labels) and then by extending this core with the relations deﬁned in WordNet. We blend [22] CN and WNA by aligning the lemma forms of CN concepts with the lemma forms of the words in WNA so that, whenever a CN concept and a WNA entry have the same lemma form, there will be a row of the blended matrix that contains information from both CN and WNA. This way we obtain a new matrix, A, in which common sense and aﬀective knowledge coexist, i.e., a matrix 14,301 × 117,365 whose rows are concepts (e.g., ‘dog’ or ‘bake cake’), whose columns are either common sense and aﬀective features (e.g., ‘isA-pet’ or ‘hasEmotion-joy’), and whose values indicate truth values of assertions.

60

E. Cambria et al.

Fig. 2. The Hourglass of Emotions

Therefore, in A, each concept is represented by a vector in the space of possible features whose values are positive for features that produce an assertion of positive valence (e.g., ‘a penguin is a bird’), negative for features that produce an assertion of negative valence (e.g., ‘a penguin cannot ﬂy’) and zero when nothing is known about the assertion. The degree of similarity between two concepts, then, is the dot product between their rows in A. The value of such a dot product increases whenever two concepts are described with the same feature and decreases when they are described by features that are negations of each other.

5

Switching Between Diﬀerent Graph Seeds

One way to perform reasoning on the newly built aﬀective common sense knowledge base is to see it as a graph and exploit its connectivity to ﬁnd semantically and aﬀectively related concepts. To this end, we use spectral association [23], a technique that involves assigning values, or activations, to ‘seed concepts’ and applying an operation that spreads their values across the graph.

Switching Between Diﬀerent Ways to Think

61

This operation, an approximation of many steps of spreading activation, transfers the most activation to concepts that are connected to the key concepts by short paths or many diﬀerent paths in aﬀective common sense knowledge. In particular, we build a matrix C that relates concepts to other concepts, instead of their features, and add up the scores over all relations that relate one concept to another, disregarding direction. Applying C to a vector containing a single concept spreads that concept’s value to its connected concepts. Applying C 2 spreads that value to concepts connected by two links (including back to the concept itself). But what we would really like is to spread the activation through any number of links, with diminishing returns, so the operator we want is: 1+C +

C2 C3 + + ... = eC 2! 3!

We can calculate this odd operator, eC , because we can factor C. C is already symmetric, so instead of applying Lanczos’ method to CC T and getting the singular value decomposition (SVD), we can apply it directly to C and get the spectral decomposition C = V ΛV T . As before, we can raise this expression to any power and cancel everything but the power of Λ. Therefore, eC = V eΛ V T . This simple twist on the SVD lets us calculate spreading activation over the whole matrix instantly. We can truncate these matrices to k axes and therefore save space while generalizing from similar concepts. We can also rescale the matrix so that activation values have a maximum of 1 and do not tend to collect in highly-connected concepts such as ‘person’, by normalizing the truncated rows of V eΛ/2 to unit vectors, and multiplying that matrix by its transpose to get a rescaled version of V eΛ V T . The outcomes of spectral association can be very diﬀerent according to which k and which ‘seed concepts’ we select. Choosing diﬀerent k values can be seen as developing diﬀerent reasoning strategies while choosing diﬀerent seeds can be associated to changing the focus around which we develop those strategies. An option for the choice of ‘seed concepts’ is to use CF-IOF (concept frequency - inverse opinion frequency) [24], a technique that identiﬁes common domaindependent semantics in order to evaluate how important a concept is to a speciﬁc context. Firstly, the frequency of a concept c for a given domain d is calculated by counting the occurrences of the concept c in the set of available d-tagged opinions and dividing the result by the sum of number of occurrences of all concepts in the set of opinions concerning d. This frequency is then multiplied by the logarithm of the inverse frequency of the concept in the whole collection of opinions, that is: nk nc,d CF -IOFc,d = log nc k nk,d k

where nc,d is the number of occurrences of concept c in the set of opinions tagged as d, nk is the total number of concept occurrences and nc is the number of occurrences of c in the whole set of opinions. A high weight in CF-IOF is reached by a high concept frequency in a given domain and a low frequency of the concept in the whole collection of opinions.

62

6

E. Cambria et al.

Switching Between Diﬀerent Dimensionalities

Another way to perform reasoning on our aﬀective common sense knowledge base is to use multi-dimensionality reduction techniques. In particular, we use again SVD in order to obtain a new matrix, which we call AﬀectiveSpace [25], that more eﬃciently summarizes hierarchical aﬀective knowledge and common sense knowledge. The resulting matrix has the form A˜ = Uk ∗ Σk ∗ VkT and is a low-rank approximation of A, the original data. This approximation is based on minimizing the Frobenius norm of the diﬀerence between A and A˜ under the ˜ = k. For the Eckart–Young theorem [26] it represents the constraint rank(A) best approximation of A in the least-square sense, in fact: min

˜ ˜ A|rank( A)=k

˜ = |A − A| =

min

˜ | |Σ − U ∗ AV

min

|Σ − S|

˜ ˜ A|rank( A)=k ˜ ˜ A|rank( A)=k

assuming that A˜ has the form A˜ = U SV ∗ , where S is diagonal. From the rank constraint, i.e., S has k non-zero diagonal entries, the minimum of the above statement is obtained as follows:

min

˜ ˜ A|rank( A)=k

n |Σ − S| = min (σi − si )2 = si

i=1

k n n 2 2 = min (σi − si ) + σi = σi2 si

i=1

i=k+1

i=k+1

Therefore, A˜ of rank k is the best approximation of A in the Frobenius norm sense when σi = si (i = 1, ..., k) and the corresponding singular vectors are the same as those of A. If we choose to discard all but the ﬁrst k principal components, common sense concepts and emotions are represented by vectors of k coordinates: these coordinates can be seen as describing concepts in terms of ‘eigenmoods’ that form the axes of AﬀectiveSpace, i.e., the basis e0 ,...,ek−1 of the vector space (Fig. 3). For example, the most signiﬁcant eigenmood, e0 , represents concepts with positive aﬀective valence. That is, the larger a concept’s component in the e0 direction is, the more aﬀectively positive it is likely to be. Concepts with negative e0 components, then, are likely to have negative aﬀective valence. Thus, by exploiting the information sharing property of SVD, concepts with the same aﬀective valence are likely to have similar features – that is, concepts conveying the same emotion tend to fall near each other in AﬀectiveSpace. Concept similarity does not depend on their absolute positions in the vector space, but rather on the angle they make with the origin.

Switching Between Diﬀerent Ways to Think

63

Fig. 3. AﬀectiveSpace

For example we can ﬁnd concepts such as ‘beautiful day’, ‘birthday party’ and ‘make person happy’ very close in direction in the vector space, while concepts like ‘feel guilty’, ‘be laid oﬀ’ and ‘shed tear’ are found in a completely diﬀerent direction (nearly opposite with respect to the center of the space). The number k of singular values we choose to build AﬀectiveSpace is a measure of the trade-oﬀ between precision and eﬃciency in the representation of our aﬀective common sense knowledge base. Switching between diﬀerent values of k can be seen as looking at the data from many diﬀerent points of view. Diﬀerent k values, in fact, work diﬀerently according to the aﬀective dimension we consider, e.g., for Pleasantness the best k appears to be closer to 100 while for Sensitivity a space of about 50 dimensions appears to be enough for precisely and eﬃciently represent aﬀective common sense knowledge.

7

Switching Between Diﬀerent Centroids

The capability of switching among diﬀerent Ways to Think can also be thought as changing the focus around which we develop our diﬀerent reasoning strategies. This can be implemented in AﬀectiveSpace by changing the centroids around which we cluster the vector space.

64

E. Cambria et al.

We apply a technique called Sentic Medoids [27] that adopts a k-medoids approach [28] to partition the given observations into k clusters around as many centroids, trying to minimize a given cost function. Diﬀerently from the k-means algorithm [29], which does not pose constraints on centroids, k-medoids do assume that centroids must coincide with k observed points. The most commonly used algorithm for ﬁnding the k medoids is the Partitioning Around Medoids (PAM) algorithm. The PAM algorithm determines a medoid for each cluster selecting the most centrally located centroid within the cluster. After selection of medoids, clusters are rearranged so that each point is grouped with the closest medoid. Since k-medoids clustering is a NP-hard problem [30], diﬀerent approaches based on alternative optimization algorithms have been developed, though taking risk of being trapped around local minima. We use a modiﬁed version of the algorithm recently proposed by Park and Jun [31], which runs similarly to the k-means clustering algorithm. This has shown to have similar performance when compared to PAM algorithm while taking a signiﬁcantly reduced computational time. In particular, we have N concepts (N = 14, 301) encoded as points x ∈ Rp (p = 50). We want to group them into k clusters and, in our case, we can ﬁx k = 24 as we are looking for one cluster for each sentic level s of the Hourglass model. Generally, the initialization of clusters for clustering algorithms is a problematic task as the process often risks to get stuck into local optimum points, depending on the initial choice of centroids [32]. However, we decide to use as initial centroids the concepts that are currently used as centroids for clusters, as they specify the emotional categories we want to organize AﬀectiveSpace into. For this reason, what is usually seen as a limitation of the algorithm can be seen as advantage for this approach, since we are not looking for the 24 centroids leading to the best 24 clusters but indeed for the 24 centroids identifying the required 24 sentic levels (i.e., the centroids should not be ‘too far’ from the ones currently used). In particular, as the Hourglass aﬀective dimensions are independent but concomitant, we need to cluster AﬀectiveSpace four times, once for each dimension. According to the Hourglass categorization model, in fact, each concept can convey, at the same time, more than one emotion (which is why we get compound emotions) and this information can be expressed via a sentic vector specifying the concept’s aﬀective valence in terms of Pleasantness, Attention, Sensitivity and Aptitude. Therefore, given that the distance between two points in AﬀectiveSp 2 pace is deﬁned as D(a, b) = i=1 (ai − bi ) (note that the choice of Euclidean distance is arbitrary), the used algorithm, applied for each of the four aﬀective dimensions, can be summarized as follows: 1. Each centroid Cn ∈ R50 (n = 1, 2, ..., k) is set as one of the six concepts corresponding to each s in the current aﬀective dimension 2. Assign each record x to a cluster Ξ so that xi ∈ Ξn if D(xi , Cn ) ≤ D(xi , Cm ) m = 1, 2, ..., k 3. Find a new centroid C for each cluster Ξ so that Cj = xi if xm∈Ξj D(xi , xm ) ≤ xm ∈Ξj D(xh , xm ) ∀xh ∈ Ξj 4. Repeat step 2 and 3 until no changes on centroids are observed

Switching Between Diﬀerent Ways to Think

65

This clusterization of AﬀectiveSpace allows to calculate, for each common sense concept x, a four-dimensional sentic vector that deﬁnes its aﬀective valence in terms of a degree of ﬁtness f (x) where fa = D(x, Cj ) Cj |D(x, Cj ) ≤ D(x, Ck ) a = 1, 2, 3, 4 k = 6a-5, 6a-4, ..., 6a.

8

Switching Between Diﬀerent Space Conﬁgurations

Yet another way to try to emulate our superﬁne capability to look at things from a diﬀerent perspective is to apply diﬀerent space transformations to AﬀectiveSpace. Since the distribution of the values of each vector space dimension is bell-shaped (with diﬀerent centers and diﬀerent degree of dispersion around them), we investigate a diﬀerent way to represent AﬀectiveSpace that consists in centering the values of the distribution of each dimension on the origin and in mapping dimensions according to a transformation x ∈ R → x∗ ∈ [−1, 1]. We apply such transformation as AﬀectiveSpace tends to have diﬀerent grades of dispersion of data points across diﬀerent dimensions, with some space regions more densely populated than others. The switch to a diﬀerent space conﬁguration helps to distribute data more uniformly, possibly leading to an improved (or, at least, diﬀerent) reasoning process. In particular, we ﬁrst apply the transformation xij → xij − μi being μi the average of all values of the i-th dimension. Then we normalize, combining the previous transformation with a new one xij xij → a·σ , where σi is the standard deviation calculated on the i-th dimension i and a is a coeﬃcient that can modify the same proportion of data that is represented within a speciﬁed interval. Finally, in order to ensure that all components of the vectors in the deﬁned space are within [−1, 1] (i.e., that the Chebyshev distance between the origin and each vector is smaller or equal to 1), we need to apply a ﬁnal transformation xij → s(xij ) where s(x) is a sigmoid function. Diﬀerent choices for the sigmoid function may be made, inﬂuencing how ‘fast’ the function approaches 1 while the independent variable approaches inﬁnity. Combining the proposed transformations, two possible mapping functions are expressed in the following formulae 1 and 2: xij − μi x∗ij = tanh (1) a · σi x∗ij =

xij − μi a · σi + |xij − μi |

(2)

This space transformation leads to two main advantages, which could be of notable importance depending on the problem being tackled. First, this diﬀerent space conﬁguration ensures that each dimension is equally important by avoiding that the information provided by dimensions with higher (i.e., more distant from the origin) averages predominates. Second, normalizing according to the standard deviations of each dimension allows a more uniform distribution of data around the origin, leading to a full use of information potential.

66

9

E. Cambria et al.

Discussion

To some extent, our reasoning capability can be re-conducted to the identiﬁcation of useful patterns in our acquired knowledge about the world. Our experience and common sense knowledge is likely to be organized in our mind as interconnected concepts and situations and most of these links are probably weighted by emotions, as we tend to forget or hardly recall memories that are not associated with any kind of positive or negative emotion. If this assumption is correct, our capability to envision possible outcomes of a decision might lie both in the capability of crawling the semantic network of concepts we have acquired through experience and in the capability of summarizing the huge amount of inputs and outputs of previous situations to ﬁnd useful patterns that might work at the present time. In this work, we try to emulate the former capability by using graph mining techniques on an aﬀective common sense knowledge base and the latter by using dimensionality reduction techniques on the same resource. Another key skill of our mind is the capability of almost instantly switching between diﬀerent points of view of a problem, until we ﬁnd one that best suits the present situation. In this work, we try to emulate this process by applying diﬀerent graph weighting, clustering and space transformation techniques on an aﬀective common sense knowledge base. In order to eﬃciently and timely switch between these diﬀerent reasoning strategies, we perform all the computations (relative to the most signiﬁcant conﬁgurations) a priori and save the results in a semantic-aware format, using an approach previously adopted for building SenticNet [33]. The result is a system for aﬀect recognition that has multiple ways to deal with natural language semantics and sentics, that is the cognitive and aﬀective information associated with text (Fig. 4). In particular, we use a NLP module to interpret all the aﬀective valence indicators usually contained in text such as special punctuation, complete upper-case words, onomatopoeic repetitions, exclamation words, negations, degree adverbs and emoticons, and eventually lemmatize text. Then a Semantic Parser deconstructs text into concepts using a lexicon based on ‘sentic n-grams’, i.e., sequences of lexemes that represent multiple-word common sense and aﬀective concepts extracted from CN, WNA and other linguistic resources. These n-grams are not used blindly as ﬁxed word patterns but exploited as reference for the module in order to extract multiple-word concepts from information-rich sentences. So, diﬀerently from other shallow parsers, the module can recognize complex concepts also when irregular verbs are used or when these are interspersed with adjective and adverbs, e.g., the concept ‘buy Christmas present’ in the sentence ‘I bought a lot of very nice Christmas presents’. The Semantic Parser, additionally, provides, for each retrieved concept, the relative frequency, valence and status, that is the concept’s occurrence in the text, its positive or negative connotation and the degree of intensity with which the concept is expressed.

Switching Between Diﬀerent Ways to Think

67

Fig. 4. Overview of the system

After extracting concepts from text, the system tries to look at these from many diﬀerent points of view, that is it switches its conﬁguration until it ﬁnds the semantics and sentics of these concepts with good enough conﬁdence. A preliminary evaluation of the system has been performed exploiting a set of 2000 posts from LiveJournal (http://livejournal.com), a virtual community of more than 23 million users who keep a blog, journal or diary. Evaluation results show that, although inevitably slower than the architectures built by applying each technique singularly, the system is much more accurate: overall precision, in particular, is 89% while average recall rate is 77%, for a total F-measure value of 82%. Further results will be submitted elsewhere for publication.

10

Conclusion and Future Directions

We argue that our capability to reason and make decisions might lie both in the capability of crawling the semantic network of concepts we have acquired through experience and in the capability of summarizing the huge amount of inputs and outputs of previous situations to ﬁnd useful patterns that might work at the present time. In this work, we tried to emulate the former capability by using graph mining techniques on an aﬀective common sense knowledge base and the latter by using dimensionality reduction techniques on the same resource. First results sound promising and pave the way for more bio-inspired approaches to the emulation of aﬀective common sense reasoning.

68

E. Cambria et al.

Whilst this study has shown encouraging results, further research studies are now planned to investigate new dimensionality reduction strategies, such as independent component analysis and random projections, and new classiﬁcation techniques, such as support and relevance vector machines and neural networks. We also plan to develop new methods to more easily, timely and eﬃciently switch between diﬀerent Ways to Think and, hence, try to emulate our superﬁne capacity to change perspective and ﬁnd novel ways to look at things. Acknowledgments. This work has been part-funded by the Royal Society of Edinburgh (UK), the Chinese Academy of Sciences in Beijing (P.R. China), the UK Engineering and Physical Sciences Research Council (EPSRC Grant Reference: EP/G501750/1) and Sitekit Solutions Ltd. (UK).

References 1. Vesterinen, E.: Aﬀective Computing. In: Digital Media Research Seminar, Helsinki (2001) 2. Pantic, M.: Aﬀective Computing. In: Encyclopedia of Multimedia Technology and Networking, vol. 1, pp. 8–14. Idea Group Reference, USA (2005) 3. Elliott, C.D.: The Aﬀective Reasoner: A Process Model of Emotions in a MultiAgent System. PhD thesis, Northwestern University, Evanston (1992) 4. Wiebe, J., Wilson, T., Cardie, C.: Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation 39(2), 165–210 (2005) 5. Kim, S., Hovy, E.: Automatic Detection of Opinion Bearing Words and Sentences. In: Proceedings of IJCNLP, Jeju Island, South Korea (2005) 6. Somasundaran, S., Wiebe, J., Ruppenhofer, J.: Discourse Level Opinion Interpretation. In: Proceedings of COLING, Manchester, UK (2008) 7. Wilson, T., Wiebe, J., Hoﬀmann, P.: Recognizing Contextual Polarity in PhraseLevel Sentiment Analysis. In: Proceedings of HLT/EMNLP, Vancouver, CA (2005) 8. Hu, M., Liu, B., Mcguinness, D., Ferguson, G.: Mining Opinion Features in Customer Reviews. In: Proceedings of AAAI, San Jose, USA (2004) 9. Goertzel, B., Silverman, K., Hartley, C., Bugaj, S., Ross, M.: The Baby Webmind project. In: Proceedings of AISB, Birmingham, UK (2000) 10. Cambria, E., Hussain, A., Havasi, C., Eckl, C.: Sentic Computing: Exploitation of Common Sense for the Development of Emotion-Sensitive Systems. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces, COST Seminar 2009. LNCS, vol. 5967, pp. 148–156. Springer, Heidelberg (2010) 11. Cambria, E., Grassi, M., Hussain, A., Havasi, C.: Sentic Computing for Social Media Marketing. Multimedia Tools and Application (2011), doi:10.1007/s11042011-0815-0 12. Cambria, E., Hussain, A., Havasi, C., Eckl, C.: Common Sense Computing: from the Society of Mind to Digital Intuition and Beyond. In: Fierrez, J., Ortega-Garcia, J., Esposito, A., Drygajlo, A., Faundez-Zanuy, M. (eds.) BioID MultiComm2009. LNCS, vol. 5707, pp. 252–259. Springer, Heidelberg (2009) 13. Ekman, P., Dalgleish, T., Power, M.: Handbook of Cognition and Emotion. Wiley, Chichester (1999)

Switching Between Diﬀerent Ways to Think

69

14. Kapoor, A., Burleson, W., Picard, R.: Automatic Prediction of Frustration. International Journal of Human-Computer Studies 65, 724–736 (2007) 15. Castellano, G., Kessous, L., Caridakis, G.: Multimodal Emotion Recognition from Expressive Faces, Body Gestures and Speech. In: Doctoral Consortium of ACII, Lisbon, Portugal (2007) 16. Cambria, E., Hussain, A., Havasi, C., Eckl, C.: SenticSpace: Visualizing Opinions and Sentiments in a Multi-Dimensional Vector Space. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 385–393. Springer, Heidelberg (2010) 17. Minsky, M.: The Emotion Machine: Commonsense Thinking, Artiﬁcial Intelligence, and the Future of the Human Mind. Simon & Schuster, New York (2006) 18. Plutchik, R.: The Nature of Emotions. American Scientist 89(4), 344–350 (2001) 19. Havasi, C., Speer, R., Alonso, J.: ConceptNet 3: a Flexible, Multilingual Semantic Network for Common Sense Knowledge. In: Proceedings of RANLP, Borovets (2007) 20. Strapparava, C., Valitutti, A.: WordNet-Aﬀect: an Aﬀective Extension of WordNet. In: Proceedings of LREC, Lisbon, Portugal (2004) 21. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (1998) 22. Havasi, C., Speer, R., Pustejovsky, J., Lieberman, H.: Digital Intuition: Applying Common Sense Using Dimensionality Reduction. IEEE Intelligent Systems 24, 24– 35 (2009) 23. Havasi, C., Speer, R., Holmgren, J.: Automated Color Selection Using Semantic Knowledge. In: Proceedings of AAAI CSK, Arlington, USA (2010) 24. Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., Munro, J.: Sentic Computing for Patient Centered Applications. In: Proceedings of IEEE ICSP, Beijing, China (2010) 25. Cambria, E., Hussain, A., Havasi, C., Eckl, C.: AﬀectiveSpace: Blending Common Sense and Aﬀective Knowledge to Perform Emotive Reasoning. In: WOMSA at CAEPIA, Seville, Spain (2009) 26. Eckart, C., Young, G.: The Approximation of One Matrix by Another of Lower Rank. Psychometrika 1(3), 211–218 (1936) 27. Cambria, E., Mazzocco, T., Hussain, A., Eckl, C.: Sentic Medoids: Organizing Aﬀective Common Sense Knowledge in a Multi-Dimensional Vector Space. In: Liu, D. (ed.) ISNN 2011, Part III. LNCS, vol. 6677, pp. 601–610. Springer, Heidelberg (2011) 28. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics). Wiley Interscience, Hoboken (2005) 29. Hartigan, J., Wong, M.: Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society 28(1), 100–108 (1979) 30. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). W.H. Freeman, New York (1979) 31. Park, H., Jun, C.: A Simple and Fast Algorithm for K-Medoids Clustering. Expert Systems with Applications 36(2), 3336–3341 (2009) 32. Duda, R., Hart, P.: Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons Inc., Chichester (1973) 33. Cambria, E., Speer, R., Havasi, C., Hussain, A.: SenticNet: A Publicly Available Semantic Resource for Opinion Mining. In: Proceedings of AAAI CSK, Arlington, USA (2010)

Eﬃcient SNR Driven SPLICE Implementation for Robust Speech Recognition Stefano Squartini, Emanuele Principi, Simone Cifani, Rudi Rotili, and Francesco Piazza 3MediaLabs, DIBET, Universit` a Politecnica delle Marche, Ancona, Italy {s.squartini,e.principi,s.cifani,r.rotili,f.piazza}@univpm.it

Abstract. The SPLICE algorithm has been recently proposed in the literature to address the robustness issue in Automatic Speech Recognition (ASR). Several variants have been also proposed to improve some drawbacks of the original technique. In this presentation an innovative eﬃcient solution is discussed: it is based on SNR estimation in the frequency or mel domain and investigates the possibility of using diﬀerent noise types for GMM training in order to maximize the generalization capabilities of the tool and therefore the recognition performances in presence of unknown noise sources. Computer simulations, conducted on the AURORA2 database, seem to conﬁrm the eﬀectiveness of the idea: the proposed approach yields similar accuracy performances w.r.t. the reference one, even employing a simpler mismatch compensation paradigm which does not need any a-priori knowledge on the noises used in the training phase.

1

Introduction

In the last decades signiﬁcant eﬀorts have been devoted to enable verbal interaction between human and machines. The rationale is that speech is a natural and fast form of communication for humans and is the only feasible means of human computer interaction in several situations, e.g. while driving. One of the key components of speech-based human machine interfaces are automatic speech recognizers (ASR). An important issue in ASR systems is the presence of acoustic non-idealities in the input speech signal, which is one of the main cause of performance degradation. Several eﬀorts have been devoted by the scientiﬁc community to the problem. The developed solutions can be divided in two families: model-domain approaches adapt acoustic model parameters to maximize the system matching to the distorted environment. Examples are Parallel Model Combination [1] and Vector Taylor Series [2]. Feature-domain approaches reduce the presence of noise in the feature vectors to reduce the mismatch between training and testing conditions. Examples of feature-domain approaches are single and multichannel Bayesian algorithms [3,4,5] and statistics normalization approaches [6,7,8]. SPLICE (Stereo-based Piecewise LInear Compensation for Environments) is part of the latter family and has originally been proposed in [9]. It operates on A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 70–80, 2011. c Springer-Verlag Berlin Heidelberg 2011

Eﬃcient SNR Driven SPLICE Implementation

71

the Mel-frequency cepstral coeﬃcients (MFCC) and learns the joint probability distribution for noisy and clean speech to map the received features into a clean estimate. SPLICE has two main limitations: ﬁrst, in the training phase it needs stereo data. This problem has been addressed using discriminative approaches [10], and speech synthesis based approaches [11]. Second, SPLICE needs a model for each noisy condition, which requires a complex training phase. This paper proposes a signal to noise ratio (SNR) driven solution to the latter problem. The number of noisy conditions needed in the training phase is reduced by clustering utterances with similar SNR, and model selection is performed by means of a noise variance estimator. Experimental results conducted on the Aurora 2 database conﬁrm the eﬀectiveness of the solution. The outline of the paper is the following: Section 2 brieﬂy describes the maximum a posteriori (MAP) and minimum mean squared error (MMSE) SPLICE algorithms; Section 3 gives details about the proposed SNR driven SPLICE approach; Section 4 shows recognition results obtained on the Aurora 2 database; ﬁnally, Section 5 draws conclusions and outlines future developments.

2

SPLICE Algorithm Overview

Fig. 1 shows the basic SPLICE operations. In the training phase, stereo data are used to build a model of the joint distribution of clean and noisy speech. Here, “stereo” means that for each utterance a noise free (clean) version and a noise corrupted version exist. A set of correction vectors for each SNR-noise pair is trained and stored to be used in the denoising phase. In this phase, the environmental model selection algorithm selects the most appropriate correction vectors to be used, and the “Enhancement” stage applies the correction vectors to the noisy MFCCs. The result is a clean speech estimate that will be used in the ASR decoding stage.

Fig. 1. Scheme of SPLICE training and denoising phases

72

S. Squartini et al.

Let x and y be respectively the clean and noisy speech MFCC vectors. One of the SPLICE basic modelling assumptions is that the distribution of y follows a mixture of Gaussians: p(y|i)p(i), where (1) p(y) = i

p(y|i) = N (y; μi , Σ i ),

(2)

i is the mixture component index, and p(i) is the i-th mixture weight. The second modelling assumption in SPLICE is that the conditional distribution for x given the noisy speech y and the mixture component i is Gaussian. Also, the mean vector is assumed to be a shifted version of the noisy speech y: p(x|y, i) = N (x; yi + ri , Γ i ),

(3)

where ri are the correction vectors that need to be trained using stereo data. Following these two assumptions, in [9] a MAP estimate of the clean speech ˆ MAP is obtained as follows: MFCC vectors has been derived. The MAP estimate x ˆ MAP = arg max p(x|y) = arg max p(x, y)/p(y) x x x p(x, y|i)p(i) = arg max p(x, y) = arg max x

x

(4) (5)

i

≈ arg max arg max p(x, y|i)p(i)

(6)

= arg max arg max p(i)p(y|i)p(x|y, i) x i = arg max arg max N (y; μi , Σ i )p(i) N (x; yi + ri , Γ i ).

(7)

x

i

x

i

(8)

ˆ MAP can be obtained in two steps: ﬁrst, ﬁnding the Equation (8) shows that x optimal mixture component ˆi = arg max N (y; μi , Σ i )p(i).

(9)

ˆ MAP = y + rˆi . x

(10)

i

Second, calculating This expression is obtained observing that the term between braces in equation (8) is independent of x, and that N (x; yi + ri , Γ i ) is maximum when x coincides with its mean value. The minimum-mean-squared-error (MMSE) estimate of x has been proposed in [12]. It can be obtained as follows: ˆ MMSE = xp(x|y)dx = p(i|y) xp(x|y, i)dx (11) x =

i

i

p(i|y)(y + ri ) = y +

i

p(i|y)ri ,

(12)

Eﬃcient SNR Driven SPLICE Implementation

73

where according to Bayes rule p(y|i)p(i) p(i|y) = . p(y|i)p(i)

(13)

i

The GMM parameters of p(y) are trained for each SNR-noise pair using noisy speech data. The correction vectors ri are trained using stereo data in a maximum likelihood framework: p(i|yt )(xt − yt ) ri = t , where (14) p(i|yt ) t

p(yt )p(i) p(i|yt ) = , p(yt p(i))

(15)

i

and t is the time frame index. 2.1

Environmental Model Selection

As aforementioned and shown in Fig. 1, the distribution parameters for y are trained for each noise condition, resulting in a set of correction vectors RMj for each of them. In the denoising phase, it is required to select the correction vectors set to be used. This is performed by means of the environmental model selection algorithm, originally proposed in [13]. Let RMj be one of the correction vectors set obtained in the training phase. The conditional probability of RMj given the noisy speech signal y can be inferred using Bayes’ rule: p(RMj |y) = p(y|RMj )

p(RMj ) p(y)

(16)

If many frames are used to estimate p(y|RMj ), the relative importance of p(RMj ) diminishes, and it can be ignored. The term p(y) can also be ignored since it is independent of RMj and the interest is on ﬁnding the most likely correction vector set. The maximum likelihood estimate of the correction vector set is then obtained as: M = arg max p(y|RM ). R (17) j j RMj

Note that the term p(y|RMj ) coincides with the term p(y) of equation (1), where RMj was omitted for simplicity of notation.

3

SNR Driven SPLICE Approach

In the original SPLICE algorithm, a distribution p(y) and a set of correction vectors RMj is created for each SNR-noise pair. Then, in the denoising phase,

74

S. Squartini et al.

the correction vectors set to be used is selected using the environmental model selection algorithm. Here, an alternative solution is proposed (Fig. 2): in the training phase, utterances with the same SNR are clustered despite they are corrupted with diﬀerent type of noises. In the denoising phase, ﬁrst the input signal SNR is estimated by means of a noise variance estimator algorithm. Then, the selected correction vectors set is the one created from speech utterances having SNR “closer” to the estimated one.

Fig. 2. The proposed approach

3.1

SNR Estimation

In order to estimate the signal to noise ratio (SNR), ﬁrst the noise variance is calculated by means of a noise variance estimator algorithm. In this work, Cohen’s improved minima controlled recursive averaging (IMCRA) algorithm [14] has been chosen to the purpose. The noise variance σn2 (t, k) is calculated as: σn2 (t, k) = IMCRA |S(t, k)|2 . (18) The SNR is then estimated with the well known a posteriori SNR expression: 2 k) = |S(t, k)| , SNR(t, 2 σn (t, k)

(19)

where t is the time frame index. In the original IMCRA formulation, S(t, k) is the input signal discrete Fourier transform, and k is the frequency bin index. It has been shown [3,4] that IMCRA can be eﬀectively applied to the mel-ﬁlterbank output of the feature extraction pipeline. In this case, S(t, k) is the energy in the k-th ﬁlter of the mel-ﬁlterbank. This solution allows a signiﬁcant reduction of the computational burden, since IMCRA in the mel domain operates on about 20 samples while in the frequency domain it operates on about 200 samples.

Eﬃcient SNR Driven SPLICE Implementation

75

Instead of using the per-frame SNR of equation (19), the model selection stage exploits the average utterance SNR: AVG = SNR

N −1 B−1 1 |S(t, k)|2 N B t=0 σn2 (t, k)

(20)

k=0

where N is the number of frames in the input utterance, and B is the number of ﬁlters. 3.2

Model Selection

After the SNR has been estimated, the selected correction vectors set is the one AVG . created from utterances with SNR closer to SNR Let S be the set of SNRs of the training utterances, and R = {Rj : j ∈ S} be all the correction vector sets created in the training phase. The model selection AVG − algorithm works simply selecting the set Rj for which |SNR j| is minimum.

4

Computer Simulations

Computer simulations have been conducted on the Aurora 2 [15] database. Aurora 2 consists of a subset of TIDigits utterances downsampled to 8 kHz and with added noise and channel distortion. Aurora 2 deﬁnes two training sets: a “clean” training set, composed of noise-free utterances, and a “multicondition” training set, composed of mixed clean/noisy utterances. Noisy utterances have SNR in the range 5–20 dB, and are corrupted with subway, babble, car and exhibition hall noises, resulting in a total of 17 sets. Aurora 2 deﬁnes three test sets: in test set A, utterances are ﬁltered with G.712 characteristic and corrupted with the same noises of the multicondition training set, but SNR is in the range 0–20 dB. Test set B is similar to test set A, but four diﬀerent noises are used. Finally, in test set C utterances are corrupted with a noise from test set A and a noise from test set B and ﬁltered with MIRS characteristic. Recognition has been performed by means of Hidden Markov Model Toolkit [16]. Acoustic models structure and recognition parameters are the same as in [15]. Feature vectors are extracted from 200 samples (25 ms) long frames, overlapped of 120 samples. The ﬁnal vectors are composed of 13 MFCCs (with C0 and without energy) and their ﬁrst and second derivatives. Pre-emphasis and cepstral mean normalization are included in the feature extraction pipeline (Fig. 3). Recognition results are expressed as word accuracy percentage. Averages are computed on the 0–20 dB SNR range [15]. 4.1

SPLICE Setup

Correction vectors have been trained against two diﬀerent sets of utterances:

76

S. Squartini et al.

Speech

PreͲ Emphasis

Windowing

DFT MelͲfilter bank

MFCCs

CMN

ѐ&ѐѐ Calculation

Log & DCT Log&DCT

Fig. 3. Feature extraction pipeline

– Aurora 2 multicondition training set. – Noisex training set: the clean utterances of Aurora 2 multicondition training are here corrupted with noises from the Noisex database [17]. Four noises have been used: destroyer operations room noise, F-16 cockpit noise, tank noise and white noise. The number of components used for modelling the noise probability density function is 192. Training has been performed by means of the expectation maximization algorithm [18]. Three diﬀerent SPLICE conﬁgurations have been considered: – SPLICE reference: a set of correction vectors is created for each combination of noise and SNR (17 sets total), and the model is selected using the environment model selection algorithm. – SNR driven: a set of correction vectors is created for each SNR, resulting in a total of 4 sets, and the SNR is estimated with IMCRA algorithm. Correction vectors are trained against Aurora 2 multicondition set. – SNR driven (Noisex): the same as “SNR driven”, but correction vectors are trained against the Noisex training set. In multicondition acoustic model tests, SPLICE is applied to both the training set used in the acoustic model training, and in the test sets. 4.2

Comparison between Frequency and Mel Domain SNR Estimation

As aforementioned, SNR can be estimated in the frequency or in the mel domain. The latter choice is more eﬃcient since it operates on 23 coeﬃcients instead of 256. In terms of recognition accuracy, the two solutions give similar results: using the “SNR driven” conﬁguration on the clean acoustic model, estimation in the frequency domain gives 84.92% of recognition accuracy, while estimation on the mel domain gives 85.07%. As so, in the following experiments estimation will be performed in the mel domain. 4.3

Results

Figures 4(a) and 4(b) show results obtained using the aforementioned SPLICE conﬁgurations. Results show that on clean acoustic model, the SNR driven approach gives accuracy values comparable to that of SPLICE reference setup.

Eﬃcient SNR Driven SPLICE Implementation

77

Similar performance are achieved also with the Noisex setup, demonstrating that SPLICE is robust with respect to a mismatch between training and testing conditions. On multicondition acoustic model, SNR driven approach gives a little performance boost over the other approaches, demonstrating the eﬀectiveness of the approach. Again, the Noisex setup achieves performance similar to SPLICE reference, conﬁrming the aforementioned robustness.

!

'$ !

'

'$ &

!

&&'"

'#&"

&%%#

&'!$

&$%

&

%&$'

%

$

#

#

"

"

(a) Clean acoustic model results

(b) Multicondition acoustic model results

Fig. 4. Recognition accuracies for diﬀerent SPLICE conﬁgurations

4.4

SPLICE and Histogram Equalization

The feature extraction pipeline used in previous experiments included cepstral mean normalization at its last stage. The more performing histogram equalization (HEQ), in particular the eﬃcient quantile-based variant (QBEQ) [6], can be used instead (Fig. 5). While CMN equalizes only the ﬁrst moment of each MFCC, HEQ is able to equalize the whole statistics.

Speech

PreͲ Emphasis

Windowing

DFT MelͲfilter bank

MFCCs

QBEQ

ѐ&ѐѐ Calculation

Log & DCT Log&DCT

Fig. 5. Feature extraction pipeline comprising histogram equalization

Fig. 6 shows the obtained recognition results: it is evident from Fig. 6(a) that coupling SPLICE and HEQ does not yield signiﬁcant improvements when using the clean acoustic model. Fig. 6(b) shows multicondition acoustic model experiments: here, results labelled with “w/o SPLICE in AM” are obtained without applying SPLICE to the training utterances. In this case, HEQ gives an 8.81% improvement over CMN; on the contrary, when SPLICE is applied to

78

S. Squartini et al. Multicondition

Clean 100

100 90

85.07

90.62

89.16

90.52

81.71

85.02

80

80 70

60

60 50

40

40 30

20

20 10

0

0

OverallAverage Overall Average OverallAverage SNRDriven(CMN)

SNRDriven(HEQ)

(a) Clean acoustic model results

SNRDriven(CMN)

SNRDriven(CMN,w/oSPLICEinAM)

SNRDriven(HEQ)

SNRDriven(HEQ,w/oSPLICEinAM)

(b) Multicondition acoustic model results

Fig. 6. Results of SNR driven SPLICE combined with HEQ

both the training and test set, CMN and HEQ perform similarly. These results imply that SPLICE does not need to be applied in the training phase when HEQ is present in the feature extraction pipeline. 4.5

Comparison with Diﬀerent Approaches

It is interesting to show how the proposed approach performs with respect to algorithms based on the Bayesian framework and multi-channel histogram equalization. Fig. 7 shows SNR driven SPLICE compared to two Bayesian algorithms, namely the enhanced MFCC-MMSE algorithm [5] and the multi-channel MFCCMAP algorithm [4], and the multi-channel quantile-based histogram equalization algorithm [7]. Results show that SNR driven SPLICE is able to obtain higher accuracies with respect to single and multi-channel Bayesian approaches using the clean acoustic model. The multi-channel histogram equalization approach is still able to outperform single and multi-channel Bayesian approaches, as well as SNR driven SPLICE.

Clean 100 90 80 70 60 50 40 30 20 10 0

77.14

85.07

Multicondition 87.51 79.63

100 90 80 70 60 50 40 30 20 10 0

88.22

89.16

91.37

Overallaverage

Overallaverage EͲMFCCͲMMSE

SPLICESNRDriven

EͲMFCCͲMMSE

SPLICESNRDriven

MͲMFCCͲMAP(4channels)

QBEQ(4channels)

MͲMFCCͲMAP(4channels)

QBEQ(4channels)

(a) Clean acoustic model results.

91.88

(b) Multicondition acoustic model results.

Fig. 7. Comparison of SNR driven SPLICE with diﬀerent approaches

Eﬃcient SNR Driven SPLICE Implementation

5

79

Conclusions

In this paper, an innovative SNR driven approach for SPLICE algorithm in robust speech recognition applications has been proposed. SNR estimation is performed by means of IMCRA algorithm in the mel domain. This allows a signiﬁcant reduction of computational cost, without aﬀecting recognition performance. Histogram equalization has been also applied instead of cepstral mean normalization in conjunction with SPLICE. This solution improves recognition results, and allows not to apply SPLICE in the multicondition training set. Future works will be oriented to integrate the multichannel approach, so far successfully addressed in Bayesian noise reduction [4] and Histogram equalization techniques [7,8], into the SPLICE paradigm.

References 1. Gales, M.J.: Model Based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University, Cambridge (1995) 2. Moreno, P.J., Raj, B., Stern, R.M.: A Vector Taylor Series approach for environment independent speech recognition. In: IEEE ICASSP, pp. 733–736 (1996) 3. Yu, D., Deng, L., Droppo, J., Wu, J., Gong, Y., Acero, A.: Robust speech recognition using a cepstral minimum-mean-square-error-motivated noise suppressor. IEEE Trans. Audio, Speech, and Lang. Process. 16(5), 1061–1070 (2008) 4. Principi, E., Rotili, R., Cifani, S., Marinelli, L., Squartini, S., Piazza, F.: Robust speech recognition using feature-domain multi-channel Bayesian estimators. In: IEEE ISCAS, Paris, France, pp. 2670–2673 (2010) 5. Principi, E., Cifani, S., Rotili, R., Squartini, S., Piazza, F.: Comparative Evaluation of Single-Channel MMSE-Based Noise Reduction Schemes for Speech Recognition. Journal of Electrical and Computer Engineering (2010) ´ Rubio, A.J., Ram´ırez, J.: Cepstral Do6. Segura, J.C., Ben´ıtez, C., de la Torre, A., main Segmental Nonlinear Feature Transformations for Robust Speech Recognition. IEEE Signal Process. Lett. 11(5), 517–520 (2004) 7. Squartini, S., Fagiani, M., Principi, E., Piazza, F.: Multichannel Cepstral Domain Feature Warping for Robust Speech Recognition. In: Proceedings of WIRN 2010, Vietri sul Mare, Italy (2010) 8. Rotili, R., Principi, E., Cifani, S., Piazza, F., Squartini, S.: Multichannel Feature Enhancement for Robust Speech Recognition. In: Speech Technologies / Book 1, Ivo Ipsic ed., (June 2011) ISBN 978-953-307-152-7 9. Deng, L., Acero, A., Plumpe, M., Huang, X.D.: Large-vocabulary speech recognition under adverse acoustic environments. ICSLP 3, 806–809 (2000) 10. Droppo, J., Acero, A.: Maximum Mutual Information SPLICE Transform for Seen and Unseen Conditions. In: Interspeech, Lisboa, Portugal, pp. 989–992 (2005) 11. Du, J., Hu, Y., Dai, L.-R., Wang, R.-H.: HMM-based pseudo-clean speech synthesis for SPLICE algorithm. In: IEEE ICASSP, Dallas, U.S.A, pp. 4570–4573 (2010) 12. Deng, L., Acero, A., Jiang, L., Droppo, J., Huang, X.D.: High-performance robust speech recognition using stereo training data. In: IEEE ICASSP, Salt Lake City, Utah, pp. 301–304 (2001) 13. Droppo, J., Deng, L., Acero, A.: Eﬃcient on-line acoustic environment estimation for FCDCN in a continuous speech recognition system. In: ICASSP, Salt Lake City, Utah, pp. 209–212 (2001)

80

S. Squartini et al.

14. Cohen, I.: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. on Speech and Audio Process. 11(5), 466–475 (2003) 15. Hirsch, H.-G., Pearce, D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noise conditions. In: ISCA ITRW ASR, Paris, France (2000) 16. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book V2.2 (1999) 17. Varga, A., Steenneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 study on the eﬀect of additive noise on automatic speech recognition. Documentation included in the NOISEX-92 CD-ROMs (1992) 18. Bilmes, J.: A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. International Computer Science Institute (1998)

Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak Petr Cerva, Jan Nouza, and Jan Silovsky Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Studentska 2, CZ 461 17, Liberec, Czech Republic {petr.cerva,jan.nouza,jan.silovsky}@tul.cz

Abstract. This paper deals with cross-lingual adaptation of a Large Vocabulary Continuous Speech Recognition (LVCSR) system between two similar Slavic languages – from Czech to Slovak. The proposed adaptation scheme is performed in two consecutive phases and it is focused on acoustic modeling and phoneme and pronunciation mapping. It also utilizes language similarities between the source and the target language and speaker adaptation approaches. Presented experimental results show that the proposed cross-lingual adaptation approach yields to reduction of Word Error Rate (WER) from 12.8 % to 8.1 % in the voice dictation task. Keywords: speech recognition, cross-lingual adaptation, speaker adaptation, Slavic languages.

1 Introduction Various automatic speech recognition systems allowing for processing of fluent speech have been utilized in many practical applications all over the world recently. These systems serve mainly for a) fluent voice dictation and b) transcription of various spoken data streams like broadcast news (BN). While the systems from the first group allow creating various text documents and reports in a natural way by voice and they are spread mainly (but not only) between medical doctors, judges and layers, the latter group of systems is employed usually for real-time subtitling and/or low-cost transcription and indexing in media monitoring companies or by individual radio/TV stations. At our laboratory, we have developed systems belonging to both these groups for high inflective Czech language in recent years. All the developed systems utilize the same recognition engine, which can operate with lexicons containing several hundred thousand words [1]. The motivation for our next research was to utilize this already developed technology and to adapt it for other similar (Slavic) languages. The hypothesis that we want to demonstrate is that a fully functional system created originally for the Czech language can be adapted for the interaction with a similar language easily without the need of collecting large amount of new speech data. For this purpose, we selected Slovak as the first target language. There exist several natural reasons for this choice: A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 81–87, 2011. © Springer-Verlag Berlin Heidelberg 2011

82

P. Cerva, J. Nouza, and J. Silovsky

a) Czech and Slovak share the same West-Slavic branch of European languages, b) most Czech people can understand Slovak and vice versa and c) there exist intensive commercial interests on porting existing systems for Slovak. Our cross-lingual adaptation approach is not based on mapping of individual words form the source to the target language like in [2] or on building of universal phoneme models for both languages [3]. We rather propose phonetic and pronunciation mapping table for individual phonemes. This table can be utilized to create new speaker independent (SI) acoustic models and corresponding pronunciation lexicons for the target language using only very limited amount of new training acoustic data. After that, we utilize speaker adaptation (SA) methods to fine-tune this baseline model for individual speakers (users of the adapted system), whose native language is the target one. The advantage of this approach is that during the SA phase, it is possible to create a full set of target phonemes which are speaker specific and not limited by the phoneme mapping table. This paper is structured as follows: differences and similarities in phonetics and lexicons between Czech and Slovak are compared briefly in the following section. Section 3 then describes the proposed cross-lingual adaptation framework in detail. The next section 4 deals with experimental evaluation and the final conclusion is given in the last section number 5.

2 Comparison of Czech and Slovak As mentioned in the previous section, Czech as well as Slovak belongs to the WestSlavic branch of European languages. Moreover, both of them were official languages of one state in the past decades. Due to these reasons, Czech and Slovak are considered very similar and closely related. Of course, there exist also several differences, which are described in the following sub-sections. 2.1 Difference and Similarity in Lexicons The measurement of dissimilarity was performed within two linguistic experiments. In the first one, we compared a set of the same documents From EU Web pages [4], which are available in Czech as well as in Slovak (such as the Treaty of EU, Lisbon Treaty, etc). These documents contained more than 215K words for both languages, from which 12.434 were distinct for Czech and 12.217 were distinct for Slovak. These two distinct lists contained 3k common items (i.e. 24 %). In the second experiment, we compared the 300k Czech and Slovak vocabularies that were compiled for our ASR systems via the word frequency criterion. We found 59K common words (i.e. 20 %). Although these results show that about 80 % of the lexical inventories between Czech and Slovak are different, the real dissimilarity is not so high if we perform more detailed comparison of corresponding orthographic forms. A lot of differ only in one or two characters or in the suffix due to slightly different morphological patterns. For example, this fact is demonstrated on following compassion of Czech and Slovak versions of a sentence saying ‘The most recent COST 2102 conference was held in September.’

Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak

83

CZ: Zatím poslední konference COST 2102 se konala v září. SK: Zatiaľ posledná konferencia COST 2102 sa konala v septembri. 2.2 Difference and Similarity in Phonetics Usually, 41 individual phonemes and three additional diphthongs are distinguished in spoken Czech while literature offers several sets of phonemes for Slovak. For example in [5], 57 different phonetic units are defined while only 48 phones and 4 diphthongs are recognized in [6]. For this work, we adopted the latter Slovak phonetic inventory set as it is more compatible with the Czech one. Both these phonetic sets are organized in Table 1, where the distinctive phonemes are bold printed. Table 1. Comparison of Czech and Slovak phonetic inventory sets (in SAMPA) Groups of phonemes Vowels Diphthongs

Consonants

Czech phonemes a, e, i, o, u, a:, e:, i:, o:, u:, @ (schwa) o_u, a_u, e_u p, b, t, d, c, J\, k, g, ts, dz, tS,dZ, r, l, f, v, s, z, S, Z, X, j, h\, Q\, P\ m, n, N, J, F

Slovak phonemes A, e, i, o, u, A:, e:, i:, o:, u:, { I_^a, I_^e, I_^U\, u_^o p, b, t, d, c, J\, k, g, ts, dz, tS,dZ, r, l, r=, r=:, l=, l=:, L f, v, s, z, S, Z, X, j, h\, w, U_^, G, I_^ M, n, N, J, F

In this table, the symbols ‘r=’ and ‘l=’ in the Slovak phoneme set represent syllabic versions of ‘l’ and ‘r’ and they are defined in Czech too – as allophones of ‘l’ and ‘r’. Phonemes ‘r=:’ and ‘l=:’ are just longer forms of them. The situation is similar with Slovak phonemes ‘G’ and ‘I_^’, which are considered as allophones of ‘h\’ and ‘j’ in the Czech language.

3 Proposed Cross-Lingual Adaptation Approach Our adaptation approach is focused on acoustic modeling and it is performed in two consecutive phases. During the former one, speaker independent or gender dependent (GD) model training is performed using the phoneme and pronunciation mapping table. The second phase then utilizes speaker specific data and speaker adaptation approaches. 3.1 Speaker Independent Acoustic Modeling The amount of Slovak acoustic data was limited to several hours and it was not possible to create a separate speaker independent Slovak acoustic model within standard training procedure. Therefore we proposed a phoneme mapping approach allowing training of the new Slovak model also on data belonging to the source Czech language. The resulting mapping rules from the Slovak phonetic inventory to the Czech one are summarized in Table 2.

84

P. Cerva, J. Nouza, and J. Silovsky

This conversion table was applied on phonetic transcriptions of all available Slovak acoustic data as well as on all pronunciation forms in the Slovak lexicon. The Slovak-specific phones and diphthongs were mapped onto their closest Czech counterparts, either single phonemes or phoneme strings. Table 2. Phonetic mapping of Slovak phonemes onto their closest Czech counterparts Slovak letter ä ľ ĺ ŕ v v h j ô ia ie iu

Slovak phoneme { L l=: r=: U_^ w G I_^ u_^o I_^a I_^e I_^U\

Czech phoneme e l l r u v h j uo ja je ju

Performed initial experiments proved that this proposed mapping works well. Only one problem occurred with word substitutions, which happened due to phonetic assimilation – namely at word boundaries in fluent speech. It is necessary to note that this phenomenon (change of voiced sound into unvoiced and vice versa due to phonemes that follow) is very typical for Slovak. We solved it by adding alternative pronunciation forms to all the words ending with pair consonants as well as to homonyms and to those words whose pronunciation depends on ambiguous morphological classification (e.g. words like „citovaní“, where letter ‘n’ can be converted into phoneme ‘n’ or ‘J’). After all these refinements, the lexicon included approximately 1.2 phonetic forms per word. 3.2 Speaker Dependent Acoustic Modeling One disadvantage of previous adaptation phase consists in the fact that the phoneme set of the created speaker independent acoustic model is limited by the phonetic mapping table only to the Czech phonemes. The aim the second adaptation phase is to further improve the performance of this model. This approach can be performed in situations, when the target ASR system (e.g. allowing for voice dictation) is used by a speaker, who can be asked to provide his/her acoustic data. These data can be utilized to extend the phoneme set of the baseline SI model to the full set of Slovak phonemes with parameters adapted on the speaker. This process can be described in detail as follows: At first, phonemes like ‘l’ and ‘r’ in the SI Slovak acoustic model, which were the target ones for mapping (see Table 2), are duplicated and renamed according to the corresponding source phonemes in order to create a new model with full set of Slovak phonemes. The resulting model then contains several same Slovak phonemes like

Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak

85

‘l=:’ and ‘l’ or ‘r=:’ and ‘r’ for example. After that, state occupation likelihoods are calculated for all the Gaussian components of this model on all available speaker adaptation data. Finally, this model is adapted on the given speaker using combination of MAP [7] and MLLR [8] methods. We utilize a regressions tree rather than static regressions classes for MLLR and the first two nodes of this tree are created manually by splitting of all the acoustic units into two groups: the first one contains only models of phonemes while the second one only models of noises. The other nodes are constructed automatically using state occupation likelihoods that have to be collected during the previous phase of SI model building. Unfortunately, these numbers are not available for phonemes that were created by duplicating – their statistics have to be set to the same values as their template models have. The result of this whole adaptation procedure is a new acoustic model containing the full set of speaker specific Slovak phonemes.

4 Experimental Evaluation Within this study, the used ASR system was evaluated in the voice dictation task. The test data set was compiled from recordings belonging to 3 male and 3 female speakers. These persons were asked to dictate text on various topics including news from Internet, economical text, etc. The total length of these test data was 63 minutes and they contained 9827 words. Each test speaker provided 100 sentences for adaptation too, which were selected to cover all the specific Slovak phonemes. We used standard 39 MFCC features for signal parameterization and three state context independent HMMs with up to 96 components per state for acoustic modeling. 4.1 Text and Language Resources The available language corpora for Slovak contained 8 GB of texts covering various domains like broadcast news, economical reports, judgments, etc. From this corpus, we created a lexicon containing 300K of the most frequent words and a bigram language model that was smoothed by the Witten-Bell method. 4.2 Evaluation of the First Adaptation Phase In initial ASR experiments, we tried to utilize the existing Czech speaker independent and gender dependent (GD) acoustic models. The results were surprisingly good (see Table 3). After that, when all phonetic transcriptions of Slovak acoustic data (3 hours of male and 3 hours of female speech) have been mapped onto the Czech phoneme inventory, we added them to our Czech speech database and we re-trained all the models. The resulting adapted models were made of approx. 90 % Czech and 10 % Slovak data. The results presented in Table 3 were calculated over all the test speakers.

86

P. Cerva, J. Nouza, and J. Silovsky Table 3. Results of Czech and Czech-to-Slovak adapted GD and SI acoustic models Acoustic models Czech SI Czech GD Czech-to-Slovak adapted SI Czech-to-Slovak adapted GD

WER [%] 13.5 12.8 12.5 11.4

They show that the GD models outperformed the corresponding SI models in both cases and that the Czech-to-Slovak adapted models gives better results than the original Czech one. Adaptation of GD models leads to reduction of WER from 12.8 % to 11.4 % (about 11% relatively). 4.3 Evaluation of the Second Adaptation Phase Two different experiments were performed for the second adaptation phase. In the first one, 10 minutes of speaker specific data were used for MAP and MLLR based adaptation of Slovak GD models, which were created within the previous experiment. In the latter experiment, we used the same models as prior for adaptation and the approach proposed in the section 3.2: at first, several phonemes from the prior GD model were duplicated (l -> L, l -> l=:, etc) so that it was not necessary to map the phonetic transcriptions of adaptation data onto the Czech set of phonemes. After that, the speaker specific models with the full set of Slovak phonemes were created for each the test speaker using combination of MAP and MLLR. Table 4. WER in [%] after utilization of speaker adaptation methods speaker M1 M2 M3 F1 F2 F3 total

GD models 11.5 10.6 12.2 11.6 10.3 12.1 11.4

GD models + SA 9.6 8.2 10.1 9.1 8.3 9.8 9.2

GD models + phon. dupl. + SA 8.7 7.3 8.4 8.0 7.3 8.6 8.1

The results of this evaluation are summarized in Table 4. The value of total WER (in the last row) was calculated as an average over all the test speakers according to the number of words in their test recordings. The presented numbers show that WER of the prior Czech-to-Slovak adapted GD models with phoneme mapping was reduced from 11.4 % to 9.2 % after speaker adaptation. It is also evident that the proposed approach for speaker dependent acoustic modeling yielded to additional reduction of WER to 8.1%.

Study on Cross-Lingual Adaptation of a Czech LVCSR System towards Slovak

87

5 Conclusion Within this study, two-phase cross-lingual adaptation from Czech to Slovak was proposed and evaluated experimentally for an existing LVCSR system. The presented results showed that the resulting Czech-to-Slovak adapted system can operate with WER of 8% in the voice dictation task. This value is only about 3% worse than the typical WER for the original Czech system in the same task and using the speaker specific models. At this moment, the similar concept of cross-lingual adaptation is also being tested for one other Slavic language – Polish. The plan for the next research is to focus on collecting more data for AM training and unsupervised crosslingual adaptation approaches, which should allow for creating better SI or GD models without the need of making manual phonetic transcriptions. Acknowledgments. This work was supported by the Grant Agency of the Czech Republic within grant no. P103/11/P499 and grant no. 102/08/0707.

References 1. Nouza, J., Zdansky, J., Cerva, P., Kolorenc, J.: Continual On-line Monitoring of Czech Spoken Broadcast Programs. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) 2006, Pittsburgh, USA, pp. 1650–1653 (September 2006) 2. Bayeh, R., Lin, S., Chollet, G., Mokbel, C.: Towards multilingual speech recognition using data driven source/target acoustical units association. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2004, Montreal, Quebec, Canada, pp. 521–524 (2004) 3. Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., Lee, C.H.: A study on multilingual acoustic modeling for large vocabulary ASR. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2009, Taipei, Taiwan, pp. 4333–4336 (2009) 4. EU Web Pages, http://eur-lex.europa.eu/en/treaties/index.htm 5. Kral, A., Sabol, J.: Phonetics and Phonology. SPN, Bratislava (1989) (in Slovak) 6. Ivanecky, J.: Automatic speech transcription and segmentation. PhD thesis, Košice (2003) (in Slovak) 7. Gauvain, J.L., Lee, C.H.: Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2, 291–298 (1994) 8. Gales, M.J.F., Woodland, P.C.: Mean and Variance Adaptation Within the MLLR Framework. Computer Speech and Language 10, 249–264 (1996)

Audio-Visual Isolated Words Recognition for Voice Dialogue System Josef Chaloupka Institute of Information Technology, Technical University of Liberec, Studentska 2, 461 17 Liberec, Czech Republic [email protected]

Abstract. This contribution is about experiments in audio-visual isolated words recognition. The results of these experiments will be used to improve our voice dialogue system, where visual speech recognition will be added. The voice dialogue systems can be used in train or bus stations (or elsewhere), where noise levels are relatively high, therefore the visual part of speech can improve the recognition rate mainly in noisy conditions. The audio-visual recognition of isolated words in our experiments was based on the technique of two-stream Hidden Markov Models (HMM) and on the HMM of single Czech phonemes and visemes. Different visual speech features and a different number of states and mixtures of HMM were evaluated in single tests. In the following experiments, isolated words were being recognized after training of the HMM and babble noise was added in the successive steps to the acoustic speech signal. Keywords: Audio-visual speech recognition, visual speech parameterization, audio-visual voice dialogue system.

1 Introduction Human lips, teeth and mimic muscles affect the production of speech. Visual information of speech can help the hearing-impaired to understand and it is also necessary for all people in order to understand pronounced speech information in noisy conditions. Therefore, it will be good to use visual information for automatic speech recognition especially in noisy conditions. The utilization of the visual speech part in systems for speech recognition is more likely in the stage of tests or prototypes, but the visual speech part in systems of audio-visual speech synthesis has been used in different communication-information or educational systems in the world for more than ten years. We have developed several multimodal voice dialog systems in our lab where audio-visual speech synthesis (talking head) is included [1]. We would like to add a subsystem for audiovisual automatic speech recognition (AV ASR) to our multimodal voice dialogue system; we have, therefore, developed an algorithm for visual speech parameterization in real time and tested several strategies how to recognize audiovisual speech signals. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 88–94, 2011. © Springer-Verlag Berlin Heidelberg 2011

Audio-Visual Isolated Words Recognition for Voice Dialogue System

89

2 Features Extraction and Audio-Visual Speech Recognition The features extraction from an audio speech signal is well-solved at present (2010) and LPCC – Linear Prediction Cepstral Coefficients or MFCC – Mel Frequency Cepstral Coefficients are very often successfully used, therefore, only visual speech parameterization is described in this part.

Fig. 1. The principle of audio and visual speech features extraction

The extraction of visual features is as follows: in the first step, human faces are detected in video images from a visual signal. The Viola-Jones face detector [2] based on the Haar-like filters and the AdaBoost algorithm was used in our parameterization system. It is necessary to decide who is speaking if more than one human face is detected in one video recording. The visual voice activity detector solves this problem: the object of lips is segmented from the bottom part of the detected face by image segmentation. The static visual feature – the vertical opening of the lips is taken from the segmented object of the lips. The dynamic features are computed from the static features for each detected face and the sum of ten subsequent absolute values of the dynamic features is the parameter for decision who is speaking. It is not a good idea to use only static visual features of the vertical opening of the lips to determine who is speaking, because somebody has a widely opened mouth (e.g. while yawning), but he (she) doesn’t speak. The sum of the dynamic features from DCT visual features was used in our previous work, but the problem was that somebody had different facial grimacing during a speech act and he (she) was selected as a

90

J. Chaloupka

speaker. Around the object of segmented lips, ROI – the Region Of Interest is selected and separated. The visual features are separated from the ROI in the last step. At present [3], two main groups of visual speech features exist – shape visual features and appearance-based visual features. The shape visual features are separated directly from the segmented object of the lips [4] - they are the horizontal and vertical opening of lips, lip rounding etc. It is difficult to find the exact border of human lips in some real video images (the color of lips is sometimes very similar to the color of human skin), it is, therefore, difficult to choose the exact shape features. Hence, the appearance-based visual features are used more often. These visual features are computed from the ROI by means of a transform: DFT – Discrete Fourier Transform, DCT – Discrete Cosine Transform, LDA – Linear Discriminant Analysis or PCA – Principal Component Analysis. DCT is chosen most often [5] because it is possible to compute this transform very fast using an algorithm similar to the well-known FFT algorithm – Fast Fourier Transform (for computing the DFT). For the extraction of visual features, It is also possible to use the methods and algorithms for stereovision [6] and the results are better than with the extraction of visual features from one 2D video image, but their use requires more computation time. A relatively new method for the extraction of visual features is based on AAM – Active Appearance Models but the visual features from AAM are quite speaker-dependent. It is, however, possible to use the transform [7] of the visual features vector and use these vectors in the speaker-independent audio-visual speech recognizer. The audio and visual features or the result of audio speech recognition only and visual speech recognition only are combined (integrated) after the extraction of audio and visual features [3]. The early integration of visual and audio features and the middle integration by two-stream HMM was used in our work. The audio and visual features are combined into one vector in the early integration process and these vectors are used for training of HMM and for the recognition based on these HMM. The output function of state S for two-stream HMM (middle integration) is:

G bSγ ( x ) =

γt

G ∏ (b ( x ) ) 2

S

(1)

t

t =1

G G where x1 is audio feature vector, x2 is visual feature vector, γt is weight of stream G and bS ( xt ) is output state function: M G bS (xt ) =  c sm m =1

1

(2π )

P

det Σ sm

[ (

) (

G G T G G ⋅ exp − 0.5 x t − x sm Σ −sm1 x t − x sm

)]

(2)

G G where xt is feature vector, P is number of features, x sm is mean values vector, Σ sm is covariance matrix and M is number of mixtures. The main task of using two-stream HMM is to find weights γt for the audio and visual stream and for the given SNR (Signal to Noise Ratio). The only possible way to achieve this is to change the SNR (add noise to the audio signal) and change the weights for the audio and visual stream in a single step and look for the best recognition rate, see fig. 2.

Audio-Visual Isolated Words Recognition for Voice Dialogue System

91

Fig. 2. Recognition rate from audio-visual speech recognition for weights of audio and visual stream and for 5dB SNR in audio signal

3 Noisy Condition Simulation It is very difficult or impossible to create an audio-visual speech database where the audio signal has a given SNR, therefore, noisy conditions are simulated and noise is added to the original audio signal. A special algorithm was developed for the addition of noise to the audio signal for a given SNR. A babble noise from the NOISEX database [8] was chosen for our purpose. Prior to the experiments, it was necessary to estimate the Signal to Noise Ratio (SNR) in our audio speech signals. They exist several algorithms for it [9]. The SNR is calculated from the power of signal Ps and from the power of noise Pn (which is added to the signal) in our case: N −1

 s [i] 2

P SNR = 10. log s = 10. log Pn

i =0 N −1

 i =0

n 2 [i ]

(3)

where N is the number of samples from signal, s[i] are samples from “clear” signal (without noise), n[i] are samples from noise signal which is included in original audio signal x. Our hypothesis was that noise n is an additive noise, hence x[i] = s[i] + n[i]. Ps was estimated from the power of the audio signal Px and Pn, which was computed from the non-speech part of the audio signal - a speech or non-speech detector was used for this purpose.

92

J. Chaloupka N −1

N −1

N −1

 s [i]  x [i]  n [i] 2

Ps = Px − Pn =

2

=

i =0

N

2

−

i =0

N

i =0

N

(4)

The problem is that the SNR (3) is computed from all audio signals and the dynamic changes of the speech signal are not covered much, the segmental signal to noise ratio SSNR, therefore, yields a better result:

  1 SSNR = 10. log F  

N −1

M −1

 j =0



 s [i]  2 j

i =0 N −1

 i =0

 n [i ]  

(5)

2 j

where F is the number of frames from speech signal and M is the length of one frame in samples. SSNR (SSNRe) was estimated from each audio signal and noise signal was added according to given relative change of SSNR (∆SSNR) in the second step.

SSNRw = SSNRe − ΔSSNR

(6)

where SNRw is new value of SSNR in audio signal:

SSNRw = 10. log

Px − Pn Pn + c.Pan

(7)

Pan is power of additive noise and c is gain coefficient: c=

Px − Pn Pan ⋅ 10

 SSNRw     10 

−

Pn Pan

(8)

Resulting audio signal xn[i] is created from original signal x[i] and from additive noise an[i]: xn [i ] = x[i ] + an[i ]. c , 0 ≤ i ≤ N

(9)

4 Experiments Our own Czech audio-visual speech database AVDBcz2 was used for experiments with audio-visual words recognition in noisy conditions. Frontal camera scans of 35 people (speakers) were taken for this database. Each speaker uttered 50 words and 50 sentences. 2 experiments have been done for our database. The whole-word HMM were used in the first experiment and the HMM of phonemes (40 Czech phonemes) and visemes (13 Czech visemes) and the early integration of audio and visual features were used in the second experiment. 50 words (video recordings) from the first 25 speakers were in the training database for the whole-word HMM training and 50

Audio-Visual Isolated Words Recognition for Voice Dialogue System

93

words from the remaining 5 speakers were used for establishing the weights in twostream audio-visual HMM. The last 50 words from the last 5 speakers were in the test database. The test database for the second experiment was the same as for the first experiment but 50 sentences from the first 30 speakers were used for the training of HMM (3 states) of single phonemes and visemes. 15 visual features (5 DCT static features + 5 delta + 5 delta-delta) were extracted from the visual signal and 39 audio features (13 MFCC + 13 delta + 13 delta-delta) were obtained from the audio signal. The number of DCT visual features had been established in our previous tests for visual speech recognition, where 5 DCT visual features (+ dynamic features) yielded the best recognition rate (for 14 states HMM). Babble noise was added to the original audio signal in single steps (for the given SNR) and the recognition rate for audio speech recognition only and for audio-visual speech recognition were evaluated. The results of the first experiment are shown in fig. 3. and the results of the second experiment can be seen in fig. 4.

Fig. 3. Audio-visual speech recognition with two-stream HMM

Fig. 4. Audio-visual speech recognition with HMM of phonemes and visemes

94

J. Chaloupka

The visual recognition rate was 45,2% where whole-word two-stream HMM (middle integration) were used, and it was 30% for the use of HMM of single phonemes and visemes (early integration).

5 Conclusion Several experiments of audio-visual speech recognition in noisy conditions have been done in this work. The results of audio-visual speech recognition in noisy conditions from the first experiment (two-stream HMM, middle integration) are better than those established in the second experiment (HMM of phonemes and visemes, early integration), but the utilization of the HMM of single phonemes and visemes in the recognizer in the voice dialogue system is more practical particularly when we have a large vocabulary (more than 1000 words). We would like to integrate the audio-visual speech recognizer based on the HMM of single Czech phonemes and visemes into our multimodal voice dialogue system in the near future.

Acknowledgments. The research reported in this paper was partly supported by the grant MSMT OC09066 (project COST 2102) and by the Czech Science Foundation (GACR) through the project No. 102/08/0707.

References 1. Chaloupka, J., Chaloupka, Z.: Czech Artificial Computerized Talking Head George. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 324–330. Springer, Heidelberg (2009) 2. Viola, P., Jones, M.: J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 3. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003) 4. Liew, A.W.C., Wang, S.: Visual speech recognition – lip segmentation and mapping. Medical Information Science Reference Press, New York (2009) 5. Heckmann, M., Kroschel, K., Savariaux, C., Berthommier, F.: DCT-based video features for audio-visual speech recognition. In: Proc. Int. Conf. Spoken Lang. Process. (2002) 6. Goecke, R., Asthana, A.: A Comparative Study of 2D and 3D Lip Tracking Methods for AV ASR. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Australia, pp. 235–240 (2008) ISBN 978-0-646-49504-0 7. Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving Visual Features for Lip-reading. In: The 9th International Conference on Auditory-Visual Speech Processing AVSP 2010, Japan, pp. 142–147 (September 2010) ISBN 978-4-9905475-0-9 8. Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep., Speech Research Unit, Defence Research Agency, Malvern, UK (1992) 9. Zhao, D.Y., Kleijn, W.B., Ypma, A., de Vries, B.: Online Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing 16(4), 835–846 (2008)

Semantic Web Techniques Application for Video Fragment Annotation and Management Marco Grassi, Christian Morbidoni, and Michele Nucci Department of Biomedical, Electronic and Telecommunication Engineering, Universit` a Politecnica delle Marche - Ancona, 60131, Italy {m.grassi,c.morbidoni}@univpm.it, [email protected] http://www.semedia.dibet.univpm.it

Abstract. The amount of videos loaded every day on the web is constantly growing and in a close future videos will constitute the primary Web content. However, video resources are currently handled only through the use of plugins and result therefore scarcely integrated on the World Wide Web. Standards as HTML 5 and Media Fragment URI, actually under development, promise to enhance video accessibility and to allow a more eﬀective management of video fragments. On the other hand, the need for annotating digital objects, possibly at a low granularity level, is being highlighted in various scientiﬁc communities. User created annotation, if properly structured and machine processable, can enrich web content and enhance search and browsing capabilities. Providing full support for video fragments tagging, linking, annotation and retrieval represents therefore a key-factor for the development of a new generation of Web applications. In this paper, we discuss the feasibility of Semantic Web techniques in this scenario and introduce a novel Web application for semantic multimodal video fragment annotation and management that we are currently developing. Keywords: Video annotation, Semantic Web, Media Fragment.

1

Introduction

The advent of Web 2.0 has led to an explosion of user generated Web contents and made tagging, linking and commenting resources everyday activities for web users and a valuable source of metadata that can be exploited to drive resource ranking, classiﬁcation and retrieval. Collaborative approach has been therefore increasingly understood as key factor for resource annotation that can be applied not only for user-generated data but also for scientiﬁc purposes in dealing with large collection of information, as in the case of digital libraries. User created annotations, if properly structured and machine-processable, can enrich web content and enhance search and browsing capabilities. Semantic Web (SW) techniques are ﬁnding more and more application in web resource annotation. These allow in fact univocally identifying web resources and exploiting a univocally interpretable format as RDF for resource description and ontologies A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 95–103, 2011. c Springer-Verlag Berlin Heidelberg 2011

96

M. Grassi, C. Morbidoni, and M. Nucci

to add semantic to the encoded information in order to make them eﬀectively sharable between diﬀerent users. In addition, the possibility of managing digital objects at a low granularity level, for example to access or annotate text excepts or photo regions, is being highlighted in various scientiﬁc communities. This is particularly true for videos resources, which till few years ago constituted just a marginal part of Web content and are still currently handled only through the use of plugins and result therefore scarcely integrated on the World Wide Web. By the way, since the spreading of video sharing services as YouTube, the amount of videos loaded every day on the web is growing constantly and in a close feature videos will constitute the primary content in the Web. However, video resources standards as HTML 5 and Media Fragment URI, actually under development, promise to enhance video accessibility and to allow a more eﬀective management of video fragments. Its not a case that HTML5, the next major revision of the HTML standard, introduces a speciﬁc video tag and provides a wider support for video functionalities. We believe that providing full support for video fragments tagging, linking, annotation and retrieval represents therefore a key-factor for the development of a new generation of Web applications. In this paper, we discuss the application of SW techniques in Web resource annotation, analyzing the common requirements of a general purpose web annotator and focusing on the video annotation to introduce a novel Web application for semantic video fragment annotation and management that we are currently developing. Paper is organized as follows. Section 2 brieﬂy introduces the Semantic Web initiative. Section 3 provides an overview of main general-purpose annotation systems while Section 4 focuses on existing video annotation tools. Section 5 discusses the main requirements and implementation guidelines of a general-purpose annotation tool. Finally, Section 6 introduces the prototype video annotation application.

2

Semantic Web

The Semantic Web1 (SW) is an initiative by W3C that aims to implement a next generation Web in which information can be expressed in a machineunderstandable format and can be processed automatically by software agents. The SW enables data interoperability, allowing data to be shared and reused across heterogeneous devices and applications. The SW is mainly based on the Resource Description Framework (RDF) to deﬁne relations among diﬀerent data, creating semantic networks. In SW ontologies are used to organize information and formally describe concepts of a domain of interest. An ontology is a vocabulary including a set of terms and relations among them. Ontologies can be developed using speciﬁc ontology languages such as the RDF Schema Language (RDFS) or the Web Ontology Language (OWL) for inference and knowledge base modeling. Semantic Web techniques are suitable for application in all the scenarios that require advanced data-integration, to link data coming from multiple sources 1

See http://www.w3.org/2001/sw/ for Semantic Web initiative, related technologies and standards

Semantic Web Techniques Application

97

without preexisting schema, and powerful data-modeling to represent expressive semantic descriptions of application domains and to provide inferencing power for applications.

3

Annotation Systems

Annotating web documents like web pages, part of web pages, images, audios and videos is one of the most spread technique to create interconnected and structured metadata on the Web. In the last years many annotation systems have been proposed to ease and to support the creation of annotations. Annotea is a web-base annotation protocol that uses an RDF based annotation schema [1] to formally describe annotations. Annotea has been implemented for the ﬁrst time in Amaya [2] browser, but currently, other applications based on Annotea protocol are available. Some of these applications have ex- tended and adapted the Annotea protocol to support additional use-cases such as the annotation of audio and video material [3]. EuropeanaConnect Media Annotation Prototype (ECMAP) [4] is online media annotation suite based on Annotea that allows users to extend existing bibliographic information about digital items like images, audio and videos. ECMAP provides free-text annotation and semantic tagging, integrates Linked Data resource linkage into the user annotation process and provides shape drawing tools on images, maps and video. It also provides special support for high-resolution map images, enabling tile-based rendering for faster delivery, geo-referencing and semantic tag suggestions based on geographic location. LORE [5] (Literature Object Reuse and Exchange) is a lightweight tool designed to enable scholars and teachers of literature to author, edit and publish compliant compound information objects that encapsulate related digital resources and bibliographic records. LORE provides a graphical user interface for creating, labeling and visualizing typed relationships between individual objects using terms from a bibliographic ontology. SWickyNotes [6] is an open-source desktop application for semantically annotating web pages and digital libraries. It bases on Semantic Web standards to allow annotations to be more than simple textual sticky notes: they can contain semantically structured data that can be later used to meaningfully browse the annotated contents. One Click Annotator [7] is a WYSIWYG Web editor for enriching content with RDFa annotations, enabling not expert to create semantic metadata. It allows annotating words and phrases with references to ontology concepts and for creating relationships between annotated phrases. Also, automatic annotation systems have been developed, which aim to create structured metadata while the user is creating contents. OpenCalais [8] is a web annotation system based on a web service that is able to automatically create rich semantic metadata for text content submitted by users. COHSE [9] enables the automatic generation of metadata descriptions by analyzing the content of a Web page and comparing the analyzed parts with concepts described in a lexicon. These kinds of automatic annotations systems generally relay on Natural Language Processing (NLP), Machine Learning, Text Mining and other similar techniques.

98

4

M. Grassi, C. Morbidoni, and M. Nucci

Video Annotation Tools

Video annotation constitutes the starting point in many scientiﬁc researches, for example in the study of emotion and of multimodality of human communication. Multimodal video annotation, in particular, is a highly time-consuming and difﬁcult task. Several desktop tools have been developed in recent years to provide support in this activity. A complete review of such tools goes beyond the purpose of this work, which focuses on web applications, and can be found in [10]. Also, in a previous work [11], we conducted a survey about multimodal video annotation tools and schemas and traced a roadmap toward the application of Semantic Web techniques to enhance multimodal video annotation, which also according to survey result appear to be highly beneﬁcial in this scenario. In recent year, as a result of great spreading of videos over the Web, video content management have been increasingly supported by web application and also tools for general purpose video annotation have started to be developed. YouTube itself [12] enables video uploaders to create textual annotations in the form of text bubbles or notes, also highlighting part of the screen and to make these annotations visible to all YouTube users when the video is played. VideoAnt [13] is a Web application that also uses YouTube as videos source, which allows to insert markers in video timeline and to associate textual annotation. The created annotations are also sent by email for being accessed also by other users. Project Pad [14] is a project to build a web-based system for media annotation and collaboration for teaching and learning and scholarly applications. Project Pad provides an open source Web application, distributed under GPL license, developed in Java and Flash available both as a standalone application and as part of Sakai. The application allows selecting video segments and creating textual annotations. It also provides a timeline visualization of the annotations. Kaltura Advanced Editor [15] provides several functionalities for online video editing, supporting timeline based editing and video and audio layers. It allows to add soundtracks and transitions to import videos, images and audio while editing, and to add eﬀects and textual annotation. EuropeanaConnect Media Annotation Suite (ECMAS) [4] includes a client application for video annotation. It allows to select video segments adding a marker and to add textual annotation. EuropanaConnect video annotator also includes Semantic Web capabilities to provide users can augment existing online videos with related resources on the Web, provided by Linked Data cloud. This augmentation happens on-the-ﬂy while the users are writing their annotations, the application proposes in fact related resource derived from DBpedia [16], a semantic compliant version of Wikipedia. User has to verify the semantic validity of a link or to disambiguate between eventual homonyms before they become part of the annotation. Such linked resources can be then exploited in the underlying search and retrieval infrastructure. Between the existing web video annotation tools, only ECMAS exploits some of the possibilities oﬀered by Semantic Web and only for information augmentation. Annotation can only be added in the form of free text and there is no possibility to structure the annotation according to standardized domain

Semantic Web Techniques Application

99

ontologies. Also, the management of video fragments, when provided, does not follow Media Fragment URI standard, which limits the interoperability and accessibility. In addition, apart from Kaltura editor, existing application for video annotation provide quite poor interfaces for and limited performances in comparison with existing desktop tools, not supporting for example drag and drop functionality.

5

SemLib Annotation Tool

SEMLIB European project [17], we are currently participating, aims to improve the current state of the art in digital libraries through the application of Semantic Web techniques in this scenario. Three main challenges need to be faced: – improve eﬃciency of searches, considering that due to the high rate of the data that digital libraries can store it has become very diﬃcult for the users to ﬁnd and retrieve relevant content; – promote interoperability allowing re-using, re-purposing, and re-mixing digital objects in heterogeneous environments, taking into account that nowadays digital libraries are consumed and manipulated at the same time by human actors and by machines and other software applications; – allow eﬀective resource linking also outside the boundaries of a single digital repository. The purpose is therefore to develop a modular and conﬁgurable web application based on Semantic Web technologies that can be plugged into other existing web applications and digital libraries and that can export/import semantic annotation from/to the Web of Data (Linked Data). The system shall allow common users, with no knowledge of Semantic Web techniques, of Digital Libraries to enrich the content, establish relations and be of support for their scholarly activities. In next subsections, we discuss the main requirements of a web resources annotation system and applicable technologies for their accomplishment. 5.1

Requirements Discussion

In order to accomplish with its purposes, ﬁve main requirements have been identiﬁed for the application: – Flexibility. The proposed application has to allow detailed annotations of heterogeneous resources (text, images, audio and video) in diﬀerent application domains and to be pluggable into other existing web applications. – Interoperability. The possibility to share fully understandable information between diﬀerent users, software agents and application represents a fundamental requirement to create richer applications, allowing augment the original knowledge base by adding related information coming from diﬀerent external sources. – Collaborative annotations. Application has to provide support for collaborative annotation management, allowing every user to create its own annotations and to access existing annotations.

100

M. Grassi, C. Morbidoni, and M. Nucci

– Fine grain annotations. Nowadays, links, tags and annotations are added at resource level. For example, creators or users can add tags to classify a document about its main topics but they cannot specify in which part of the document each single topic is treated. The capability to fully implement the concept of bookmarks for web resources, identifying and providing access to speciﬁc desired fragment of a resource, represents a key factor to create a the next generation of web application, enhancing resource fruition and enabling more eﬃcient automatic information aggregation. – Ease of use. Application should expose an intuitive and engaging interface able to hide the underlying complexity of the system to users, which are not required to have any knowledge of SW techniques. In particular, the creation of well-structured annotations, according to RDF (subject, property, value) model, should be accomplished in fast and easy way.

Fig. 1. A simpliﬁed sketch of system architecture

5.2

Technical Solutions and Implementation Guidelines

In order to satisfy the requirements outlined in previous subsection, several technical solution and implementation guidelines have been identiﬁed for the implementation of the proposed system, which can be synthesized as follows: – Standards compliance. In order to provide maximum support for interoperability, system implementation relies on standards both in data encoding and in resource identiﬁcation. RFD used as data model to encode information in a univocally interpretable standard. XPointer [18] and Media Fragment URI [19] are used respectively to identify unambiguously text excerpts in web pages and subparts of images and audio video resources, to provide support for addressing and retrieving resources as the automated processing of such subparts for reuse.

Semantic Web Techniques Application

101

– Stand-oﬀ markup. paradigm is applied for annotation management, which means that annotations reside in a location diﬀerent from the location of the data being described by it. Exploiting URIs, XPointers and Media Fragment URIs, that allows to univocally identify resources and fragments of those, a resolution mechanism can be implemented to allows annotations to be accessed and stored independently from the original resources but still remaining unambiguously associated to those. This approach is particularly suitable in the considered scenario, allowing on one side to annotate every resource on the Web, even if read-only, secured or located on a remote server, and on the other providing maximum freedom to users both in creating their own annotations and in ﬁltering and visualizing existing annotations. – Pluggable ontologies. The use of ontologies allows providing semantically rich and structured descriptions of resources in speciﬁc knowledge domains. It constitutes therefore a fundamental requirement both for the ﬂexibility (possibility to create detailed descriptions in diﬀerent domain) and interoperability (possibility to rely on standardized vocabulary in the annotations) of the created annotations. The proposed application should provide a basic set of default general-purpose descriptors. In addition, it should allow the possibility to import external ontologies, as plug-in vocabularies, for enabling eﬀective structured descriptions of any knowledge domain. – Modularity. represents a fundamental requirement of system architecture both to provide support for the management of diﬀerent resource formats and to allow the system to be pluggable on other existent applications. Figure 1 provides a simpliﬁed sketch of core system architecture. Annotation creation is separated from annotation visualization and diﬀerent handlers are provided to supply speciﬁc management for the diﬀerent functionalities required by the diﬀerent supported media formats.

6

SemTube: Semantic YouTube Video Annotation Prototype

As proof of concept we are currently developing a web application, for semantic YouTube videos annotation. Other than being the main video sharing service on the Web, YouTube oﬀers powerful APIs [20] that makes video embedding in Web pages an easy task and also provides Player APIs that gives control over YouTube video playback. In particular, the JavaScript Chromeless player APIs have been used to create a custom player that in addition to the common playback functionality allows the possibility to select frame and video for annotation. A custom video progress bar has been created in Javascript allowing to place markers for selecting frames and segments. Once selected the frame or the segment to be annotated, annotation can be performed both using free text, tags and relying on the descriptor provided by an ontology that can be retrieved in real-time from a SPARQL endpoint using AJAX technology. The created annotations are both displayed in the web page and stored into a Sesame triplestore for later retrieval and quering.

102

M. Grassi, C. Morbidoni, and M. Nucci

Fig. 2. A screenshot of SemTube (Semantic YouTube Video Annotator

7

Conclusions

The capability of providing full support for video fragments annotation and management represents a key-factor for the development of a new generation of Web applications. In this paper, we discussed the application of SW techniques in this scenario, analyzing the main requirements of a general purpose web annotator and focusing on the video annotation. We also introduced a novel Web application for semantic video fragment annotation and management that we are currently developing. Acknowledgments. This work has been supported by COST 2102, SSPNET and SEMLIB project (SEMLIB - 262301 - FP7-SME-2010-1).

References 1. Kahan, J., Koivunen, M.R.: Annotea: An Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of the 10th International Conference on World Wide Web, pp. 623–632 (2001) 2. Amaya Web Browser, http://www.w3.org/Amaya/ 3. Schroeter, R., Hunter, J., Kosovic, D.: FilmEd - Collaborative Video Indexing, Annotation and Discussion Tools Over Broadband Networks. In: Proceedings of the Multimedia Modelling Conference 2004, pp. 346–353 (January 2004)

Semantic Web Techniques Application

103

4. Haslhofer, B., Momeni, E., Gay, M., Simon, R.: Augmenting Europeana Content with Linked Data Resources. In: 6th International Conference on Semantic Systems (I-Semantics) (September 2010) 5. Gerber, A., Hunter, J.: Authoring, Editing and Visualizing Compound Objects for Literary Scholarship. Journal of Digital Information 11 (2010) 6. SWickyNotes: Sticky Web Notes with Semantics, http://dbin.org/swickynotes/ 7. Ralf Heese, M.L.: One Click Annotation. In: 6th Workshop on Scripting and Development for the Semantic Web (2010) 8. OpenCalais, http://www.opencalais.com/ 9. Goble, C., Bechhofer, S., Carr, L., De Roure, D., Hall, W.: Conceptual Open Hypermedia = The Semantic Web? In: The Second International Workshop on the Semantic Web, Hong Kong, p. 4450 (May 2001) 10. Rohlﬁng, K., et al.: Comparison of multimodal annotation tools - workshop report. Gespraechsforschung-Online Zeitschrift zur verbalen Interaktion 7(7), 99–123 (2006) 11. Grassi, M., Morbidoni, C., Piazza, F.: Towards Semantic Multimodal Video Annotation. In: Esposito, A., Esposito, A.M., Martone, R., M¨ uller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 305–316. Springer, Heidelberg (2011) 12. YouTube Video Annotations, http://www.youtube.com/it/annotations_about 13. VideoANT, http://ant.umn.edu/ 14. Project Pad, http://dewey.at.northwestern.edu/ppad2/ 15. Kaltura Video Editing and Annotation, http://corp.kaltura.com/video_platform/video_editing 16. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web (7), 154165 (2009) 17. SEMLIB - Semantic Tools for Digital Libraries. SEMLIB - 262301 - FP7-SME2010-1, http://www.semlib.org/ 18. XML Pointer Language (XPointer), http://www.w3.org/TR/xptr 19. Media Fragments URI 1.0. W3C Working Draft June 24 (2010), http://www.w3.org/TR/media-frags/ 20. YouTube APIs and Tools, http://code.google.com/apis/youtube/overview

Imitation of Target Speakers by Different Types of Impersonators Wojciech Majewski and Piotr Staroniewicz Wroclaw University of Technology, Institute of Telecommunications, Teleinformatics and Acoustics, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland [email protected]

Abstract. Vowel formant frequencies planes obtained from speech samples of three well-known Polish personalities and their imitations performed by three impersonators of different type (professional, semi-professional and amateur) have been compared. The vowel formant planes for the imitations were generally, but not always, placed between the impersonator’s natural voice and the target. The largest resemblance between the formant planes for the imitation and the target was obtained for the amateur, whose imitations were, however, subjectively evaluated as the worst ones. Thus, except for acoustical parameters, other factors, like qualification and experience of the impersonator, are very important in realization of impersonation tasks. Keywords: vowel formant planes, impersonators.

1 Introduction Impersonation of a person by means of voice may occur in two very different situations. The first situation concerns a public entertainment when a professional impersonation artist amuses the public imitating the voices of well-known personalities. The second situation concerns a forensic voice identification when an automatic speaker recognition system used for security purposes may be cheated by a skilful impostor imitating the voice of an authorized person. Thus, the problem of voice mimicry seems to be very interesting for the general public and is very important for law enforcement agencies. In spite of this, the problem of voice imitation has been studied to a rather limited extent. The first study on voice imitation was published in 1971 by Endres, Bambach and Flösser [1]. In this study vowel formant frequencies and fundamental frequency in original and imitated voices were compared. Although the imitators managed to change their formant and fundamental frequencies in the direction of the target values, they were not able to match or be similar to those of the imitated people. In 1997 Ericson and Wretling [2] examined the timing, fundamental frequency and vowel formants frequencies of three Swedish politicians and their imitations performed by a professional impersonator. The global speech rate and fundamental frequency were mimic very closely and the vowel space for two of the three target A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 104–112, 2011. © Springer-Verlag Berlin Heidelberg 2011

Imitation of Target Speakers by Different Types of Impersonators

105

voices was intermediate between that of the artist’s own voice and the target, but for the third target there was no apparent reduction in the distance. In the same year Schlichting and Sullivan [3] published the results of subjective speaker recognition indicating that the listeners are able to discriminate between the real voice and professional imitation. However, the imitation led to 100 per cent misidentification in the worst case. In 2006 Zetterholm [4] examined one impersonator and his different voice imitations to gain some insight into the flexibility of the human voice and speech. The results indicated that the impersonator was able to adopt a range of articulatoryphonetic configurations to achieve the target speakers. The authors of the present paper examined selected aspect of voice imitation. In the first study published in 2005 [5] the results of aural-perceptual voice recognition of Polish personalities and their imitations performed by cabaret entertainers were presented. It has been shown that the impersonators were able to fool the listeners, i.e. to convince them that they have heard the target speakers. At the same time, however, similar number of the listeners recognized the imitation. In the subsequent studies in 2006 [6] mel frequency cepstral coefficients of original speakers and their imitators were presented, while in 2007 [7] speaking fundamental frequency under similar conditions has been examined. Finally, in the last study published in 2008 [8] selected acoustical parameters obtained from speech samples of well-known Polish personalities and their imitations performed by cabaret entertainers were presented and discussed. In the present study the influence of the impersonator’s type on voice mimicry is examined. Only one kind of parameters, i.e. formant frequencies of Polish vowels were utilized. To be more specific, the two lowest formant frequencies, i.e. F1 and F2 of Polish vowels were measured, drawn in F1-F2 planes and compared to find out how flexible the human voice apparatus is and if the person of the impersonator and his professional qualifications have an influence on the distribution of formant frequencies in F1-F2 planes and the distances between these planes for different speakers and different ways of speech production. F1-F2 planes have been applied in the experiment since such patterns are widely used in research on speech acoustics and F1 and F2 parameters are considered as the most important parameters for speech and speaker recognition.

2 Experimental Procedure As the target speakers, three well-known Polish personalities have been selected, whose characteristic voices are relatively easy to imitate for impersonators. The selected target speakers were: Lech Walesa – former president of Poland, Jerzy Urban – editor-in-chief of Polish weekly Nie and Adam Michnik – editor-in-chief of Polish daily Gazeta Wyborcza. The speech samples of over one minute duration were obtained for Lech Walesa from the recordings available in the archives of the Polish radio and for Adam Michnik and Jerzy Urban from the internet. Three different types of impersonators have been employed. The first one was a professional. It was Waldemar Ochnia, one of the best Polish impersonators. He was imitating the voices of Lech Walesa and Jerzy Urban. The second one was a

106

W. Majewski and P. Staroniewicz

semi-professional. It was Piotr Gumulec, a cabaret artist, who was imitating the voices of Lech Walesa and Adam Michnik. The third one was Lukasz Likus, an amateur, who also was imitating the voices of Lech Walesa and Adam Michnik. The test material consisted of speech samples produced by the target speakers and by each impersonator under two speaking conditions: 1) while imitating a given target speaker, 2) while speaking the same text in his natural voice. Thus, for a given targetimpersonator pair the semantic content of all three speech samples was the same. From audio files under given speaking conditions all six Polish oral vowels, i.e. u, o, a, e, i and y [I] have been extracted. Each vowel was represented by nine segments of 40 ms duration taken from particular words spoken in the same context. The vowel segments were extracted from the beginning, middle and end of particular words, from short and long words and from the first and second part of the spoken text. Such a variety of vowel selection permitted to consider the influence of the place of vowels in words, the influence of short and long words and the influence of the first and second part of the text on vowel parameters. Vowel formant frequencies of the two lowest formants, i.e. F1 and F2, were obtained by means of Praat program. For each vowel under given speaking conditions the mean values of F1 and F2 in Hz from all nine realizations of a given vowel have been calculated, converted to Bark scale and utilized to draw F1-F2 planes of six Polish vowels. In addition, an aural-perceptual speaker recognition test was carried out to find out how effective the voice imitations were. The test was performed by a group of 15 listeners of normal hearing who stated that they knew the target speakers. Speech samples of 15 seconds duration produced by target speakers were presented first to the listeners and next, after a short break, another four speech samples of similar duration for each of the target voices and their imitations have been presented in an random order. The only restriction was that the original speech sample and its imitation could not have been presented as the adjacent stimuli. The task of the listeners was to state if a given speech sample is an original or an imitation. In case of imitation the quality of imitation had to be evaluated in a five-points scale (1 - poor imitation, 5 – very good imitation).

3 Results In Figs.1-6 F1-F2 planes are drawn on the basis of the mean values of formant frequencies obtained for all the realizations of speech samples produced by the target speakers and their impersonators when they imitate the target speakers and when they speak naturally. In Figs.1-3 the vowel formant plane for Walesa is presented, accompanied by the vowel formants parameters of the imitation and natural voice of Ochnia (Fig. 1), Gumulec (Fig.2) and Likus (Fig.3). Looking at Fig.1 it may be observed that the professional impersonator was able to change his vowel formant frequencies in comparison to his natural voice to be more close to the target voice. His vowel formant plane for the imitation is larger and generally is placed in between the planes for his natural voice and the target voice. The obtained results are similar to those presented by Ericson and Wretling [2].

Imitation of Target Speakers by Different Types of Impersonators

Fig. 1. Vowel formant planes for Walesa (target) and Ochnia (imitation and natural)

Fig. 2. Vowel formant planes for Walesa (target) and Gumulec (imitation and natural)

107

108

W. Majewski and P. Staroniewicz

Fig. 3. Vowel formant planes for Walesa (target) and Likus (imitation and natural)

Fig. 4. Vowel formant planes for Urban (target) and Ochnia (imitation and natural)

Imitation of Target Speakers by Different Types of Impersonators

109

Fig. 5. Vowel formant planes for Michnik (target) and Gumulec (imitation and natural)

Fig. 6. Vowel formant planes for Michnik (target) and Likus (imitation and natural)

The situation for the semi-professional impersonator presented in Fig.2 is not so clear. It may be seen that Gumulec changed his vowel formant frequencies but it is difficult to answer whether the formant plane for the imitation is more similar to the target plane or his natural plane. Moreover, in contrast to the results obtained by the

110

W. Majewski and P. Staroniewicz

professional, the plane for imitation is smaller than for the natural voice and shifted in the direction of larges values of F1. Still, another situation is presented in Fig.3. First of all, very large changes in formant frequencies values between the natural voice and the imitation may be seen. A large effort of the impersonator to change his voice is confirmed by frequent breaks he made during the imitation to give his voice production apparatus a rest. The F1-F2 plane for his natural voice is the smallest and the plane for the imitation is very close to the plane for the target voice. In Fig.4 the vowel formant plane for Urban is presented together with the planes for the imitation and natural voice of the professional impersonator (Ochnia). Similarly, like in many other cases, in comparison to the imitator’s natural voice, the imitation is shifted toward the larger values of F1 and somewhat in the direction of larger values of F2. This time, however, the plane for the imitation does not seem to be closer to the plane of the target. In Figs.5 and 6 the vowel formant plane for Michnik is accompanied by the vowel formant planes for the imitation and natural voice of the semi-professional (Gumulec) (Fig.5) and the amateur (Likus) (Fig.6). In Fig.5 it may be seen again that the imitation is shifted in the direction of larger values of F1 and F2. In Fig.6 similar tendencies as in Fig.3 may be seen: There is a large shift between the planes for the imitation and the natural voice of the impersonator, the plane for the natural voice is the smallest and the planes for the imitation and the target are very close. Since one of the goals of the present study was a comparison between the achievements of different types of impersonators, in Fig.7 vowel formant planes have been plotted together for Walesa and all the three impersonators employed. Visually, most similar are the planes for Walesa and the professional (Ochnia). It is interesting to note that the plane for the semi-professional (Gumulec) was smaller than for the amateur (Likus). The results shown in Fig.7 have been used to calculate the Euclidian distance between particular vowels. The results of these calculations are plotted in Fig.8. An interesting observation is that the mean distances imitation-target for all the vowels are equal for the professional (Ochnia) and the amateur (Likus), while for the semiprofessional (P.Gumulec) are substantially larger. As it has already been mentioned, the subjective tests of speech samples recognition as originals or imitations and the evaluation of the quality of impersonation have also been made. The listeners’ estimations of the perceived stimuli as originals or imitations are presented in Table 1. The evaluation of the quality of imitation in a five-point scale is also given. The effectiveness of recognition of the original target voices by the listeners was high and expressed in percent reached 98,3% for Walesa, 90% for Urban and 65% for Michnik, which reflects the public popularity of a given speaker. The effectiveness of imitation presented in the same table was much lower and it ranged from 46.7% for the impersonation of Urban by Ochnia (professional) to only 6.7% for the impersonation of Michnik by.Likus (amateur). On the basis of the results presented in Table 1 it may be said that, on the mean, the results of the imitation for the semi-professional (Gumulec) were twice as good as for the amateur (Likus) and twice as bad as for the professional (Ochnia). Thus, the qualification and experience of the impersonator plays a major role in impersonation tasks. This observation is confirmed by the evaluation of the quality of

Imitation of Target Speakers by Different Types of Impersonators

111

imitation presented in the last six columns of Table 1. The impersonations performed by the amateur were generally judged as poor (1.5 and 1.6 points on the mean in a five point scale), as satisfactory by the semi-professional (2.3 and 2.4 points) and as good by the professional (3.0 and 3.6 points).

Fig. 7. Vowel formant planes for Walesa (target) and all his three impersonators

Fig. 8. Euclidean distances between the vowels of Walesa and all his three impersonators

112

W. Majewski and P. Staroniewicz Table 1. Distribution of listeners’ answers to perceived stimuli Speech samples Walesa-original Michnik-original Urban-original Walesa by Ochnia Urban by Ochnia Walesa by Gumulec Michnik by Gumulec Walesa by Likus Michnik by Likus

Target

Imitation

98.3 65.0 90.0 26.7 46.7 15.0 16.7 8.3 6.7

1.7 35.0 10.0 73.3 53.3 85.0 83.3 91.7 93.3

1 0 1 0 2 1 8 14 32 28

Imitation evaluation 2 3 4 5 0 0 0 1 2 8 2 1 0 1 3 0 9 22 10 0 3 11 9 8 22 18 3 0 10 16 10 0 18 4 1 2 22 5 1 8

Mean 5.0 3.7 4.2 3.0 3.6 2.3 2.4 1.5 1.6

4 Conclusions It has been shown that the speakers are able to modify their speech production apparatus in the desired direction. The vowel formant planes for the imitations were generally, but not always, placed between the impersonator’s natural voice and the target. The largest resemblance between the vowel formant planes for the imitation and the target was obtained for the amateur, whose imitations were, however, subjectively evaluated as the worst ones. This indicates that except for the examined acoustical parameters other factors, like qualification and experience of the impersonator, are very important in the realization of impersonation tasks. Acknowledgments. This work was partially supported by COST Action 2102 “Crossmodal Analysis of Verbal and Non-verbal Communication” and by the grant from the Polish Minister of Science and Higher Education (decision nr 115/N-COST/2008/0).

References 1. Endres, W., Bambach, W., Flösser, G.: Voice spectrograms as a function of age, voice disguise and voice imitation. JASA 49, 1842–1848 (1971) 2. Ericson, A., Wretling, P.: How flexible is the human voice – a case study of mimicry. In: Proc. Eurospeech 1997, Rhodes, vol. 2, pp. 1043–1046 (1997) 3. Schlichting, F., Sullivan, K.: The imitated voice – a problem for line-ups? Int. J. Speech, Language and the Law 4, 148–165 (1997) 4. Zetterholm, E.: Same speaker – different voices. A study of one impersonator and some of his different imitations. In: Proc. 11 Australian Int. Conf. on Speech Sci. & Techn., Auckland, pp. 70–75 (2006) 5. Majewski, W.: Aural-perceptual voice recognition of original speakers and their imitators. Archives of Acoustics 30 supplement, 183–186 (2005) 6. Majewski, W.: Mel frequency cepstral coefficients (MFCC) of original speakers and their imitators. Archives of Acoustics 31, 445–449 (2006) 7. Majewski, W.: Speaking fundamental frequency of original speakers and their imitators. Archives of Acoustics 31, 17–23 (2007) 8. Majewski, W., Staroniewicz, P.: Acoustical parameters of target voices and their imitators. Speech and Language Technology 11, 17–23 (2008)

Multimodal Interface Model for Socially Dependent People Rytis Maskeliunas1 and Vytautas Rudzionis2 1 2

Kaunas University of Technology, Kaunas, Lithuania Vilnius University, Kaunas faculty, Kaunas, Lithuania [email protected]

Abstract. The paper presents an analysis of the multimodal interface model for the socially dependent people. The general requirements for the interface were to be as simple as possible and as natural as possible (in principle such interface should be the theoretical replacement of a typical “standard” one finger “joystick” control). Performed experiments allowed us to detect the most often used commands, the expected accuracy level for the selected applications and perform various usability tests. Keywords: multimodal, speech recognition, touch based GUI, human – machine interaction.

1 Introduction Multimodal interfaces have lots of advantages comparing with the more widely used “standard” (single modality) interfaces. Domains such as health, education, egovernment, and e-commerce have a great potential for the applications of multimodal interfaces and could lead to a higher efficiency, lower costs, better reliability, accessibility, quality of information content, decentralized communication, etc. [1]. Advantages of the multimodal dialogs [2–4] could be exploited even better by designing applications where primary users will be the socially-dependent people (elderly, people with disabilities, technically naive people, etc.) [5–7]. These advantages first of all lie in the fact that such users are often technically naive and have some sort of fear dealing with the new technologies. Interfaces which are as similar as possible to the real human-human communications are of great importance for such people. Additional inconvenience is related to the fact that mobile and small size portable devices usually have small keyboards and screens [8]. This is unavoidable design compromise of such devices introduces the additional factor of inconvenience for the development of the traditional GUI type interfaces. Such inconvenience is felt particularly sharp by the elderly and other socially-dependent people [9]. Many studies have proven that speech recognition centric interfaces used as a main modality for the control of mobile and portable devices have an enormous market potential and many usability advantages. Spoken commands may be the simplest and the most convenient way to replace a traditional keyboard based control of portable A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 113–119, 2011. © Springer-Verlag Berlin Heidelberg 2011

114

R. Maskeliunas and V. Rudzionis

devices [10, 11]. In recent years a number of new multimodal services and prototypes were presented and developed in various countries and in various areas of applications [12–15]. Unfortunately the design of multimodal interfaces still isn’t a straightforward task: there are no clear answers which spoken commands should be used to achieve the necessary naturalness of the Human – Computer Interface (HCI) and to minimize information access time, which additional modalities should be integrated and when, etc. Our research tries to find the answers to some of those questions.

2 Multimodal Interface Model for Socially Dependent People In this study the main goal was to propose a multimodal interface model to potentially provide easier communication with more and more widespread technical devices and more and more complicated user interfaces. Interface model for socially dependent people has been formulated and several research tasks were established. Among the research tasks were the following ones: • • • •

How many voice commands should be recognized for the efficient performance and necessary naturalness; What is the optimum “length” of a command for the human – computer interaction; What control modalities should be chosen to perform selected tasks in a most efficient and easy way; The evaluation of the usability and naturalness of such interface.

The evaluation and consideration of those tasks and the design of human-machine interface based on these results should in principle lead to the convenient and natural user oriented interface. It is possible that some of the results could lead to the contradictory conclusions. So another aim of this study was to try to determine which factors could be treated as the more important ones when designing such type of human-machine interface. For the experimental evaluation tasks a demo application for a multimodal wheelchair control was developed (continuing from [16]), serving as the basis for the interface efficiency evaluation experiments. Target condition were that this demo application should satisfy speech recognition accuracy requirements (at least 95% recognition accuracy for every voice command), meaning that if some particular voice command can't guarantee the necessary speaker independent recognition accuracy level it should be replaced by another voice command. The design of the application was to provide the voice based input combined with the more traditional GUI based touch screen interface with additional video input for the added control (gaze and motion recognition). The targeted audience was the socially dependent (mostly elderly and disabled) and not computer literate people. It has been decided that the control of such application should be realized on a widespread portable device (smartphone) utilizing the traditional touch based GUIs providing the user with a possibility to use more than one types of interface for human-machine interaction. The information flowchart of this application is shown in Fig. 1.

Multimodal Interface Model for Socially Dependent People

115

Fig. 1. Information flowchart of the demo application

The application architecture was a typical client side application serving as a connection point with the server side speech and video processing engines. Users didn’t have the opportunities to train their voices for the better recognition of voice commands or to adapt speech recognition engine to the characteristic properties of their voices (no “learning curve”). The video processing wasn’t used in this evaluation due to time constraints.

Fig. 2. The illustration of the demo application

Touchscreen and haptic (built-in vibro) input capabilities were used as a possibility to provide a simple GUI for a wheelchair control (Fig. 2). The user was given the possibility to confirm or to reject the proposition or to point out the direction. It was expected that such capability will be used mainly in the case when speech recognition can’t provide accurate enough recognition rate.

116

R. Maskeliunas and V. Rudzionis

3 The Experimental Evaluation Several groups of experiments were carried out trying to evaluate the usability and user’s preferences in various modes of operation. First group of experiments was performed trying to evaluate speech recognition accuracy of voice commands used in demo application. 20 speakers of different age groups participated in the experiment (the speakers of older age groups prevailed). There were two sets of voice commands used in this experiment. Each set contained 10 Lithuanian voice commands. Commands in each command set in principle has the same semantic meanings but first set consisted from the phonetically more complicated commands while the second one has been composed from the “simpler” commands. Each speaker pronounced each utterance 50 times so a total number of 1000 phrases and sentences were used to test voice command recognition accuracy. A proprietary Lithuanian ASR system based on HMM (restricted via GRXML rules) was used for evaluation. The recognition accuracy of each voice command is presented in Fig. 3. The average recognition accuracy in the first voice command set was 77 % while in the second one the recognition accuracy was 97 %. ϭϬϬ ϴϬ ϲϬ ϰϬ ϮϬ Ϭ

^ĞƚŶƌ͘ϭ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘Ϯ͕ĂĐĐƵƌĂĐǇ͕й

Fig. 3. The recognition accuracy of ten voice commands

These results show that proper design of voice command vocabulary could lead to a substantial increase in accuracy rate and customer satisfaction (all users said that they liked more to use the second command set despite the fact that the second set contained commands composed from the words used less frequently in everyday speech). We may conclude that users are ready to use voice input if it allows to achieve a high enough recognition accuracy rate, rather than to use more popular words but to face lower recognition accuracy and to face the necessity to repeat the same commands. It could be seen that the first command set was characterized by the bigger accuracy deviation among different commands (the worst recognition accuracy was only 19 %). Using the second set of commands 95 % of all users expressed satisfaction with the control capabilities while only 40 % of users expressed their satisfaction using first set of voice commands. The factor of low recognition accuracy was pointed out as the most irritating by the users.

Multimodal Interface Model for Socially Dependent People

117

Another test was carried on using long voice command strings (two different sets, one with higher recognition accuracy than the other), trying to determine the optimum “length” of a command. In this case the same 20 users needed to utter several voice commands in a row (continuous speech) to achieve a predefined task. If all voice commands were recognized correctly the task has been treated as solved, otherwise the task was treated as unsolved and the user was asked to repeat the errors. The first set required the use of 4 words (imitating a simple sentence), the second required to use 8 words (imitating the description of an action) and the third task required to use 11 words (imitating the detailed instructions). The results of this experiment are shown in Fig. 4. In this experiment all users also preferred the second (composed of simpler, less popular words but recognized better) set of voice commands. As expected the accuracy decreased and complexity in usability increased the longer the utterance has become. An interesting observation was made – the longer the utterance time has become, the lower was the satisfaction level (noticeable irritation was expressed by most users). Only the 65 % preferred to use voice while still getting quite usable accuracy levels (~80 %) with a long set of 11 commands. ϭϬϬ ϴϬ ϲϬ ϰϬ ϮϬ Ϭ ϰĐŽŵŵĂŶĚƐ

ϴĐŽŵŵĂŶĚƐ

ϭϭĐŽŵŵĂŶĚƐ

^ĞƚŶƌ͘ϭ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘Ϯ͕ĂĐĐƵƌĂĐǇ͕й

^ĞƚŶƌ͘ϭ͕ůŝŬĞĚƚŚĞǀŽŝĐĞŝŶƉƵƚ͕й

^ĞƚŶƌ͘Ϯ͕ůŝŬĞĚƚŚĞǀŽŝĐĞŝŶƉƵƚ͕й

Fig. 4. The recognition accuracy of voice command strings

In the third group of experiments users had the possibility to freely choose to use voice commands to control the device or to use touch screen capabilities to navigate through the menu and to invoke the same control capabilities and to solve the same tasks. The same 20 speakers took part in these experiments. They were divided into the two groups with 10 participants in each group (similar composition of age in both groups). One group used a first set of voice commands (lower recognition accuracy) with the touch screen interface while the second group used the second set of voice commands (higher recognition accuracy) with the same touch screen GUI. Users in each group were given the same tasks (achievable either by voice commands or by some actions on screen) and were asked after the experiment which communication mode – spoken commands or touch screen navigation they would treat as a more preferable one. In the first group only 50 % of users said that spoken input was the preferable way of interaction while in the second group 90 % of users preferred the spoken input over the touch screen navigation. In both groups 85 % of users confirmed that combined use of voice commands with the touch screen navigation was helpful. 72 % of all elder participants selected the speech input as the most

118

R. Maskeliunas and V. Rudzionis

EƵŵďĞƌŽĨƉĞŽƉůĞ

attractive way for HCI control. These results are important in the light of our pilot test with the more technically skilled younger users having some experience interacting with portable devices in general and touch screen devices in particular. In this group preference to use spoken commands wasn’t expressed so clearly: only 20 % of younger (the technically literate) users said that the spoken input was the preferable way of interaction, while others preferred the touch based GUI. The overall results (Fig. 5.) let us made a preliminary conclusion that speech centric multimodal interface is of particular importance and convenience for the socially dependent users. Detailed evaluation of the user’s satisfaction as the dependency of the WER will be obtained in the near future (more data will be gathered from a larger number of participants). ϭϬ ϵ ϴ ϳ ϲ ϱ ϰ ϯ Ϯ ϭ Ϭ

'ƌŽƵƉ 'ƌŽƵƉ

WƌĞĨĞƌƌĞĚƚŽƵĐŚ

WƌĞĨĞƌƌĞĚǀŽŝĐĞ

ƉƉƌĞĐŝĂƚĞĚƚŚĞ ŵƵůƚŝŵŽĚĂů ĂƉƉƌŽĂĐŚ

Fig. 5. The evaluation of the usability

4 Conclusions 1. The prototype system of a multimodal interface (using the speech recognition as the main modality) for the socially-dependent people, has been proposed. The system uses voice commands combined with the touch screen based GUI-like interface intended to be used by the socially dependent people. The experiments with two different sets of voice commands showed that nearly all users expressed satisfaction with the application when average recognition accuracy was 97 %. 2. Observation was made that the longer the utterance has become, the lower was the satisfaction level. Only the 65 % preferred to use voice while still getting quite usable accuracy levels (~80 %) with a long set of 11 words. 3. Most of the users (85 %) expressed the satisfaction of having the possibilities to use a multimodal interface (voice commands supplemented with the touch screen based GUI-like interface). 4. The technically naive and socially-dependent people got more value from the multimodal and voice based interface than the technically skilled users. Most of the elder users (72 %) said that the control using voice commands is the most attractive way for the human-machine interaction, while 80 % of younger users preferred the touch based GUI. As expected the lower speech recognition accuracy caused a more frequent use of touch and vice-versa.

Multimodal Interface Model for Socially Dependent People

119

Acknowledments. This research was done under the grant by Lithuanian Academy of Sciences for the research project: ”Dialogų modelių, valdomų lietuviškomis balso komandomis, panaudojimo telefoninėse klientų aptarnavimo sistemose analizė” No.: 20100701–23.

References 1. Noyes, J.M.: Enhancing mobility through speech recognition technology. IEE Developments in Personal Systems, 4/1–4/3 (1995) 2. Pieraccini, M., Huerta, R.J.: Where do we go from here? Research and Commercial Spoken Dialog Systems. In: Proc. of 6th SIGdial Workshop on Discourse and Dialog, Lisbon, Portugal, pp. 1–10 (2005) 3. Acomb, K., et al.: Technical Support Dialog Systems, Issues, Problems, and Solutions. In: Proc. of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, Rochester, New York, pp. 25–31 (2007) 4. Paek, T., Pieraccini, R.: Automating spoken dialogue management design using machine learning: An industry perspective. Speech Communication, Special Issue on Evaluating New Methods and Models for Advanced Speech-Based Interactive Systems 50(8-9), 716– 729 (2008) 5. Valles, M., et al.: Multimodal environmental control system for elderly and disabled people. In: Proc. of Engineering in Medicine and Biology Society, Amsterdam, vol. 2, pp. 516–517 (1996) 6. Perry, M., et al.: Multimodal and ubiquitous computing systems: supporting independentliving older users. IEEE Transactions on Information Technology in Biomedicine 8(3), 258–270 (2004) 7. Wai, A.A.P., et al.: Situation-Aware Patient Monitoring in and around the Bed Using Multimodal Sensing Intelligence. In: Proc. of Intelligent Environments, Kuala Lampur, pp. 128–133 (2010) 8. Ishikawa, S.Y., et al.: Speech-activated text retrieval system for multimodal cellular phones. In: Proc. of Acoustics, Speech, and Signal Processing, vol. 1, pp. I-453–I-456 (2004) 9. Verstockt, S., et al.: Assistive smartphone for people with special needs: The Personal Social Assistant. In: Proc. of Human System Interactions, Catania, pp. 331–337 (2009) 10. Oviatt, S.: User-centered modeling for spoken language and multimodal interfaces. IEEE Multimedia 3(4), 26–35 (1996) 11. Deng, L., et al.: A speech-centric perspective for human-computer interface. In: Proc. of Multimedia Signal Processing 2002, pp. 263–267 (2002) 12. Zhao, Y.: Speech-recognition technology in health care and special-needs assistance (Life Sciences). Signal Processing Magazine 26(3), 87–90 (2009) 13. Sherwani, J., et al.: Speech vs. touch-tone: Telephony interfaces for information access by low literate users. In: Proceedings of Information and Communication Technologies and Development, Doha, pp. 447–457 (2009) 14. Motiwalla, L.F.: Jialun Qin. Enhancing Mobile Learning Using Speech Recognition Technologies: A Case Study. In: Management of eBusiness 2007, Toronto, pp. 18–25 (2007) 15. Sherwani, J., et al.: HealthLine: Speech-based Access to Health Information by Lowliterate Users. In: Proc. of Information and Communication Technologies and Development, Bangalore, pp. 1–9 (2007) 16. Maskeliunas, R.: Modeling Aspects of Multimodal Lithuanian Human - Machine Interface. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) Multimodal Signals, COST Seminar 2008. LNCS (LNAI), vol. 5398, pp. 75–82. Springer, Heidelberg (2009)

Score Fusion in Text-Dependent Speaker Recognition Systems Jiˇr´ı Mekyska1 , Marcos Faundez-Zanuy2 , Zdenˇek Sm´ekal1 , and Joan F` abregas2 1

Signal Processing Laboratory, Department of Telecommunications, Faculty of Electrical Engineering and Communication, Brno University of Technology Brno, Czech Republic [email protected], [email protected] 2 Escola Universit` aria Polit`ecnica de Matar´ o Barcelona, Spain {faundez,fabregas}@tecnocampus.com

Abstract. According to some signiﬁcant advantages, the text-dependent speaker recognition is still widely used in biometric systems. These systems are, in comparison with the text-independent, more accurate and resistant against the replay attacks. There are many approaches regarding the text-dependent recognition. This paper introduces a combination of classiﬁers based on fractional distances, biometric dispersion matcher and dynamic time warping. The ﬁrst two mentioned classiﬁers are based on a voice imprint. They have low memory requirements while the recognition procedure is fast. This is advantageous especially in lowcost biometric systems supplied by batteries. It is shown that using the trained score fusion, it is possible to reach successful detection rate equal to 98.98 % and 92.19 % in case of microphone mismatch. During veriﬁcation, system reached equal error rate 2.55 % and 6.77 % when assuming the microphone mismatch. System was tested using Catalan database which consists of 48 speakers (three 3 s training samples per speaker). Keywords: Text-dependent speaker recognition, Voice imprint, Fractional distances, Biometric dispersion matcher, Dynamic time warping.

1

Introduction

Speaker identiﬁcation is a task, were the system tries to answer the question “Who is speaking?” Many behavioral biometric systems are based on this task, because, due to the individual shape of a vocal tract and a manner of speak (intonation, loudness, rhythm, accent, etc.), it is possible to distinguish between speakers. These systems can serve as gates to the secured areas, authentication systems in the ﬁeld of banking or as simple systems which provide an access to private lifts (e. g. in hospitals). Generally it is possible to divide these systems into text-dependent and textindependent recognition systems. In case of the text-dependent systems, the speaker has to exactly utter required phoneme, word or sentence. This utterance can repeat during all recognitions or it can be utterance randomly chosen A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 120–132, 2011. c Springer-Verlag Berlin Heidelberg 2011

Score Fusion in Text-Dependent Speaker Recognition Systems

121

(e. g. sequence of digits). The utterance can also serve as a password so that it is known only by the target speaker and the system. In case of text-independent speaker recognition, the speaker is recognized independently on the utterance. This is advantageous in more general cases, where it is not possible to force speaker to utter exact sequence of phonemes. However text-independent systems are not as accurate as text-dependent and they usually require a lot of training data which are not always available. Moreover text-dependent systems can be resistant against replay attacks when using randomly chosen utterances from a large set. 1.1

Low-Cost Text-Dependent Speaker Recognition

The state of the art text-independent speaker recognition systems are usually based on GMM–UBM (Gaussian Mixture Model – Universal Background Model) [14] or SVM [2], [3]. A deeper overview of these systems can be found in [10]. In [12] we introduced text-dependent speaker recognition in low-cost biometric systems based on voice imprint. Although the prices of memories and computational burden have rapidly decreased, there are still some cases where low-memory and low-computational requirements are necessary. For example in biometric systems based on sensor nets there can be dozens of sensors which are switched on from the stand by state just for the purpose of recognition and then switched oﬀ again. These sensors are usually supplied by batteries therefore the computational burden during recognition must be very low. Moreover these sensors do not have big memories for storage of large data. During the recognition procedure, the system was using classiﬁers DTW (Dynamic Time Warping), FD (Fractional Distances) and BDM (Biometric Dispersion Matcher). It has been shown, that using FD, the system fulﬁlled the requirements of low-cost biometric system: 1. low memory needed for speakers’ models and procedure of recognition, 2. training using just a limited number of samples (i. e. 2 – 3 samples lasting app. 3 s), 3. fast identiﬁcation/veriﬁcation (less than 100 ms for database which consists of app. 50 speakers’ models). This research was interested in the improvement of recognition accuracy. In [12] it was shown that using the fractional distances along with the voice imprint, it is possible to reach successful detection rate equal to 96.94 % and 82.29 % in case of microphone mismatch. During veriﬁcation, system reached equal error rate 3.93 % and 10.43 % when assuming microphone mismatch. This research shows that using diﬀerent combination of classiﬁers it is possible to reach better results. This paper is organized as follows. Section 2 describes process of calculating voice imprint used by FD and BDM. Section 3 mentions well-known classiﬁer used for the text-dependent speaker recognition and introduces new application of the other classiﬁers which are usually applied in another ﬁeld of recognition (e. g. hand-writing or face recognition). Section 4 is devoted to the main experimental results and conclusions.

122

2

J. Mekyska et al.

Voice Imprint

To describe the procedure of voice imprint calculation ﬁrstly consider the feature matrix Λ, where each column is related to a signal segment (20 – 30 ms) and nth row to nth feature. In this work the imprint was based on MFCC [15], LPCC [17], PLP [9], CMS [11], ACW [5] and combination of these features. These features were also extended by 1st order and 2nd order regression coeﬃcients. The disadvantage of matrix Λ is that it has two dimensions, some coeﬃcients in matrix can be irrelevant and the number of vectors in matrix can vary for each speaker and sentence. If the DCT is applied on Λ in horizontal direction, it concentrate the energy on a few coeﬃcients and more over if the original coeﬃcients were already less correlated in vertical direction, the energy will be concentrated in one corner of the matrix.1 This process is illustrated on Fig. 1. Picture c) represents the matrix of LPCC and on picture b) there is its DCT. c) 20

6000

15 n→

f [Hz] →

a) 8000

4000 2000 0

10 5

0.5 1 1.5 nT [s] →

50 100 150 fri →

b)

d)

20 [n] →

5

10

prnt

n→

15

c

5

0 -5

50 100 150 fr → i

0

2000 n→

Fig. 1. Procedure of voice imprint calculation: a) spectrogram; b) matrix Λ of clpcc [n]; c) DCT {Λ}; d) voice imprint cprnt [n] (fs = 16 kHz, NFFT = 2048, p = 20, Hamming window with size 20 ms and overlap 10 ms)

1

This eﬀect is similar to the calculation of JPEG image format, where the two dimensional DCT concentrate energy to one corner of the image matrix.

Score Fusion in Text-Dependent Speaker Recognition Systems

123

To obtain one dimensional signal from matrix Λ, the coeﬃcients can be read from matrix by diﬀerent ways. It can be read zig-zag like in JPEG, it can be read by the columns or by the rows. On Fig. 1 d) there is an example of coeﬃcients read by columns. The ﬁrst DC coeﬃcient is not used, because it usually has no important information about the speaker. To obtain for each speaker the same length of this one dimensional signal, it is simply multiplied by rectangular window. In this place the voice imprint cprnt [n] can be deﬁned. In the next part of work, we will consider voice imprint as: cprnt [n] = w[n] · r (DCT {Λ}) , 1 , for n = 0, 1, 2, . . . , Nv − 1, w[n] = 0 , otherwise,

(1) (2)

where function r(M) represents the reading from matrix M and Nv is the length of voice imprint. If r (DCT {Λ}) is shorter than Nv , it should be padded by zeros. By the value of Nv we can also limit the number of important coeﬃcients for speaker recognition.

3

Classifiers

The proposed system is using just a limited number of training samples, which means 2 – 3 samples. It is considered that these samples last app. 3 s. According to this assumption it is not suitable to use some statistical methods like GMM (Gaussian Mixture Models) [1], [14], [13], HMM (Hidden Markov Models) [1] or ANN (Artiﬁcial Neural Networks) [18], because these classiﬁers need a lot of training data to achieve a good performance. However one statistical classiﬁer does not have this disadvantage, it is BDM (Biometric Dispersion Matcher) and it will be deeply described in sec 3.2 [6], [7]. 3.1

Template Matching Methods

Voice imprint is 1D signal with a ﬁxed number of coeﬃcients, thus it is possible to use during the classiﬁcation a template matching method. One representative of these methods is a classiﬁer based on fractional distances (FD). This classiﬁer was successfully tested on the on-line signature recognition in [16] and in [12] it was shown, that this classiﬁer worked well also in the ﬁeld of text-dependent speaker recognition. Assume that we have one input voice imprint cIprnt [n], one reference voice imprint cRprnt [n] and their lengths are same and equal to Nv . Then the distance between these two imprints d (cIprnt , cRprnt ) is calculated according to equation: [16] d (cIprnt , cRprnt ) =

N −1 v

k1 |cIprnt [n] − cRprnt [n]|

k

,

(3)

n=0

where k is in [16] recommended to set around 0.4. This has eﬀect, that also distances between small coeﬃcients signiﬁcantly contribute to the ﬁnal distance

124

J. Mekyska et al.

d (cIprnt , cRprnt ). It is useful especially when these small values are important for the speaker recognition, which is typical for voice imprint. In this case the Euclidean distance is not suitable, because its k = 2. Another template matching method is DTW (Dynamic Time Warping) [8]. However it is important to highlight that in case of this work, DTW was used along with the feature matrix Λ, not with the voice imprint cprnt [n]. 3.2

Biometric Dispersion Matcher

In [6] J. F`abregas and M. Faundez-Zanuy proposed new classiﬁer called biometric dispersion matcher (BDM). With advantage, this classiﬁer can be used in biometric systems where just a few training samples per person exists. Instead of using one model per person, BDM trains a quadratic discriminant classiﬁer (QDC) that distinguish only between two classes: E (pairs of patterns corresponding to the same class) and U (pairs of patterns corresponding to the different classes) [6]. Using BDM, it is possible to solve the simple dichotomy: “Do the two feature vectors belong to the same speaker?” Consider, that c is the number of speakers, m is the number of samples taken from each speaker, xij is the j th sample feature column vector of speaker i2 , p is a dimension of each feature vector xij and δ ∈ Rp is the diﬀerence of two feature vectors, then the quadratic discriminant function g(δ) that solves the dichotomy can be described according to: [6] g(δ) =

1 T −1 1 δ S U − S −1 δ + ln E 2 2

|S U | |S E |

,

(4)

where S U and S E are the covariance matrices corresponding to the classes E and U. The matrices can be calculated according to the formulas: [6] S E = ξ (c − 1) SU = ξ

c

c m

i=1 j,l=1 m

T

(xij − xil ) (xij − xil ) , T

(xij − xkl ) (xij − xkl ) , i = k,

(5)

(6)

i,k=1 j,l=1

ξ=

1 . cm2 (c − 1)

(7)

If g(δ) ≥ 0, then δ ∈ E, which means that the two patterns (or voice imprints) belong to the same speaker, otherwise they belong to the diﬀerent speakers. The BDM has three important advantages: 1. Comparing dispersion of the distributions of E and U, BDM performs feature selection. Only features with the quotient of the standard deviations σσUE smaller than a ﬁxed threshold can be selected. 2

In our case, each vector can be represented by voice imprint cprnt [n].

Score Fusion in Text-Dependent Speaker Recognition Systems

125

2. When a new speaker is added to the system, next model do not have to be trained. Only two models at beginning are trained. According to these models we decide whether two patterns come from the same speaker or not. 3. Most of the veriﬁcation systems set the threshold θ a posteriori in order to minimize equal error rate REER . This is an unrealistic situation, because systems need to ﬁx θ in advance. BDM has the threshold set a priori and is still comparable to the state-of-the-art classiﬁers [6], [7].

4

Experimental Results

Text-dependent recognition system based on FD, BDM and DTW was tested using corpus which consists of 48 bilingual speakers (24 males and 24 females) who were recorded in 4 sessions. The delay among the ﬁrst three sessions is one week, the delay between the 3rd and 4th session is one month. Speakers uttered digits, sentences and text in Spanish and Catalan language. Speech signals were sampled by fs = 16 kHz and recorded using three microphones: AKG C420, AKG D46S and SONY ECM 66B. Each speech sample is labeled by M1 – M8, the meaning of these labels is described in tab. 1. Table 1. Notation of speech corpus Lab. Sess. Microphone Lab. Sess. Microphone M1 1 AKG C420 M5 2 AKG D46S M2 2 AKG C420 M6 3 SONY ECM 66B M3 3 AKG C420 M7 4 AKG C420 M4 1 AKG D46S M8 4 SONY ECM 66B

During the evaluation, the classiﬁer was trained using three samples and was tested by the last one. There are four diﬀerent possibilities of testing: 1. 2. 3. 4.

Training Training Training Training

by by by by

sessions sessions sessions sessions

2, 1, 1, 1,

3, 3, 2, 2,

4 4 4 3

and and and and

testing testing testing testing

by by by by

session session session session

1. 2. 3. 4.

It is obvious that after testing there are 4 confusion matrices. According to these matrices successful detection rate RS [%], equal error rate REE [%] and minimum of detection cost function min (FDCF ) [%] was calculated. The whole testing procedure was divided into two scenarios. During the ﬁrst scenario SC1 only the samples (sessions) recorded by microphone AKG C420 were selected. To evaluate the system in mismatch conditions, samples from 4 diﬀerent sessions recorded by 3 diﬀerent microphones (AKG C420, AKG D46S, SONY ECM 66B) were in the second scenario SC2 selected. Two of these sessions were recorded by AKG C420. It was not decided to use just 3 sessions recorded by 3 diﬀerent microphones, because the probability, that the actual signal is recorded by the same microphone as the reference signal, is at the beginning of the use of system higher.

126

4.1

J. Mekyska et al.

Settings of Classifiers and Features

The classiﬁers’ settings were found empirically so that they provide good results, but all these settings, along with the suitable selection of features, aﬀects the ﬁnal results and it is possible that there are other setting that are better for the classiﬁcation. For this purpose it would be better to use some kind of optimization or genetic algorithms. The settings used in this work are listed below: – FD – coeﬃcient used for the calculation of distance (see sec. 3.1) k = 0.5; the ﬁrst DC coeﬃcient of cprnt [n] was removed; there was calculated one template voice imprint as a mean of the three training imprints. – BDM – the ﬁrst DC coeﬃcient of cprnt [n] was removed; threshold used for the feature selection (see chap. 3.2) was set to value 0.24. – DTW – in case of this classiﬁer the matrices of features Λ were used. Regarding the features it was decided to use these representatives and their combinations: MFCC, PLP, LPCC, CMS, ACW, MFCC+LPCC+ACW. Each signal was trimmed using the VAD and consequently ﬁltered by a 1st order highpass ﬁlter with α = 0.95. During the feature extraction Hamming window with size 25 ms (400 samples) and overlap 10 ms (160 samples) was used. In case of BDM, the length of voice imprint Nv = 257.3 In case of FD with MFCC, PLP, LPCC, CMS and ACW the length of voice imprint was set to Nv = 350. In case of FD with MFCC+LPCC+ACW Nv = 1050. 4.2

Score Fusion

The higher recognition accuracy can be reached using the score fusion. As was written in sec. 4, after the test procedure there are 4 matrices with particular distances as an output of each classiﬁer. If these distances are considered as scores, we can combine them into the ﬁnal matrix D according to equation: di,j =

P p=1

ap · cpnorm · dpi,j ,

(8)

where di,j is the ﬁnal distance on the ith row and j th column of D. dpi,j is the distance on the ith row and j th column of pth score matrix. P is a number of all score matrices, ap ∈ 0, 1 is weight of pth score matrix and cpnorm is a normalization coeﬃcient of pth score matrix. Each classiﬁer can generate diﬀerent range of distances. For example distances calculated by DTW can be generally more than thousand times higher than in case of FD. Therefore it is necessary to ﬁrstly somehow normalize all values in matrices, so that distances from the diﬀerent classiﬁers will have similar weights. This can be done using normalization coeﬃcient cpnorm . If we add the diagonals 3

First DC coeﬃcient was removed thus Nv = 256. This length was better for calculation.

Score Fusion in Text-Dependent Speaker Recognition Systems

127

of all score matrices Dp generated by the same pth classiﬁer to one vector vp , then cpnorm can be calculated according to: cpnorm =

1 . mean (vp )

(9)

Coeﬃcient ap determines how much will the pth distance participate to the ﬁnal score. If there are many classiﬁers, then the best values of ap can be found using some optimization methods (e. g. hill climbing, genetic algorithms). If there are only 2 classiﬁers then it is possible to use one coeﬃcient a according to: di,j = a · c1norm · d1i,j + (a − 1) · c2norm · d2i,j .

(10)

Changing the value of a from 0 to 1 with step 0.01, it is possible to ﬁnd the best combination. Table 2. System performance in both scenarios

DTW

BDM

FD

Cl. Featuresa

a

MFCC+Δ+Δ2 PLP+Δ+Δ2 LPCC+Δ+Δ2 CMS+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ+Δ2 LPCC+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ PLP+Δ LPCC+Δ CMS+Δ ACW+Δ MFCC LPCC ACW MFCC+Δ+Δ2 PLP+Δ+Δ2 LPCC+Δ+Δ2 CMS+Δ+Δ2 ACW+Δ+Δ2 MFCC+Δ+Δ2 LPCC+Δ+Δ2 ACW+Δ+Δ2

(SC1) RS [%] REE [%] FDCF b[%] 93.37 7.14 5.82 91.33 8.16 6.53 94.90 5.10 4.35 89.80 8.16 7.54 96.94 4.08 3.93

RS [%] 81.77 72.92 77.08 80.21 81.77

(SC2) REE [%] FDCF [%] 15.10 12.77 16.67 14.23 13.54 12.32 14.06 12.27 10.94 10.43

96.43

5.61

4.62

82.29

13.54

11.78

86.22 86.73 83.16 58.16 84.69

8.67 7.14 8.67 25.51 7.65

7.33 7.08 8.18 18.45 7.23

58.85 53.65 38.02 35.42 37.50

25.00 25.52 34.38 37.50 35.94

17.98 17.92 26.22 27.38 25.10

62.76

27.55

19.03

66.15 17.19

16.37

96.94 92.86 96.94 97.96 98.47

4.08 8.67 3.06 4.08 2.55

3.82 5.71 2.52 3.14 1.95

82.81 76.04 85.94 91.67 89.06

11.46 13.54 12.50 6.77 9.90

10.40 11.73 10.77 6.56 9.42

96.94

4.08

3.66

82.81

11.46

10.07

The values of FDCF are considered as minimum of this function.

128

J. Mekyska et al. Table 3. Score fusion using the equal weights

FD–BDM–DTW

FD–BDM

DTW–FD

BDM–DTW

Cl. Featuresa

a b

4.3

MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW

RS [%] 96.94 92.86 96.94 97.96 98.47 77.04 97.45 96.94 96.94 96.94 98.47 98.47 93.37 91.33 94.90 89.80 96.94 76.02 97.45 96.94 96.94 96.94 98.47 77.55

(SC1) REE [%] FDCF b[%] 4.08 3.80 8.67 5.71 3.06 2.52 4.08 3.14 2.55 1.96 18.37 11.80 4.08 3.84 6.12 4.88 3.57 2.76 4.08 4.02 4.08 3.01 4.59 3.07 7.14 5.81 8.16 6.52 5.10 4.35 8.16 7.54 4.08 3.93 17.86 11.81 4.08 3.84 6.12 4.88 3.57 2.76 4.08 4.02 4.08 3.01 14.80 10.72

RS [%] 82.81 76.04 85.94 91.67 89.06 82.81 83.85 81.25 85.42 89.06 89.06 85.42 81.77 72.92 77.08 80.21 81.77 82.29 83.85 81.25 85.42 89.06 89.06 85.42

(SC2) REE [%] FDCF [%] 11.46 10.39 13.54 11.72 12.50 10.77 6.77 6.56 9.90 9.41 11.46 10.11 11.46 10.16 11.98 11.11 10.42 9.71 9.90 8.23 9.90 8.58 10.42 9.77 15.10 12.77 16.67 14.23 13.54 12.32 14.06 12.27 10.94 10.43 13.54 11.78 11.46 10.15 11.98 11.11 10.42 9.71 9.90 8.23 9.90 8.58 10.42 9.77

The features were extended by 1st and 2nd order regression coeﬃcients. The values of FDCF are considered as minimum of this function.

System Evaluation

In tab. 2 there can be found the results of identiﬁcation and veriﬁcation when only one classiﬁer was used. Tab. 3 provides the results considering the combinations BDM–DTW, DTW–FD, FD–BDM and FD–BDM–DTW. In this case all coeﬃcients ap were set to 1. Comparing the values in both tables it is clear, that BDM does not provide any improvement. Only DTW–FD can increase the accuracy, therefore in the next part of research it is focused on this combination. Diﬀerent values of coeﬃcient a were also used. Some characteristics showing the change of RS , REE and min (FDCF ) depending on the weight a can be seen on Fig. 2 (during the score fusion calculation an equation (10) was used). Fig. 2 take into account only the score fusion of distances calculated using the same features. But it is also possible to take the ﬁrst distance from DTW, based on one feature (e. g. MFCC), and second distance from FD, based on another feature (e. g. ACW). The results of RS , REE and min (FDCF ) for diﬀerent feature

Score Fusion in Text-Dependent Speaker Recognition Systems

a)

129

b)

100

100

MFCC PLP LPCC CMS ACW MFCC+LPCC+ACW

95

90

RS [%] →

RS [%] →

95

0

0.5 a→ c)

0

0.5 a→ d)

1

0

0.5 a→ f)

1

0

0.5 a→

1

[%] →

15

R

EE

4 2 0.5 a→ e)

6

DCF

4 2 0

0.5 a→

1

10 5 0

1

) [%] →

0

min(F

REE [%] → min(FDCF) [%] →

80

1

6

0

85 75

8

0

90

10 5 0

Fig. 2. Change of characteristics depending on weight a (combination DTW–FD): a) RS in SC1; b) RS in SC2; c) REE in SC1; d) REE in SC2; e) min (FDCF ) in SC1; f) min (FDCF ) in SC2

combination can be found in tab. 4. If the tab. 2 is compared to the tab. 4 there can be seen, that using the score fusion of DTW and FD it is possible to increase RS from 98.47 % to 98.98 % and decrease min (FDCF ) from 1.95 % to 1.90 %, speaking about SC1. In case of SC2, it is possible to increase RS from 91.67 % to 92.19 %, and decrease min (FDCF ) from 6.56 % to 6.36 %.

130

J. Mekyska et al.

Table 4. Best results of score fusion using diﬀerent features and diﬀerent values of weight a (these results are obtained by combination of DTW and FD). The values of RS , REE and FDCF are expressed in % Feat. comb.a

RS MFCC–MFCC 97.45 MFCC–PLP 97.96 MFCC–LPCC 98.47 MFCC–CMS 98.47 MFCC–ACW 98.47 MFCC–MFCC+LPCC+ACW 98.47 PLP–MFCC 97.45 PLP–PLP 96.94 PLP–LPCC 97.96 PLP–CMS 96.94 PLP–ACW 97.45 PLP–MFCC+LPCC+ACW 97.96 LPCC–MFCC 97.96 LPCC–PLP 97.45 LPCC–LPCC 97.45 LPCC–CMS 97.96 LPCC–ACW 97.45 LPCC–MFCC+LPCC+ACW 97.45 CMS–MFCC 97.96 CMS–PLP 97.96 CMS–LPCC 97.96 CMS–CMS 97.96 CMS–ACW 98.47 CMS–MFCC+LPCC+ACW 97.96 ACW–MFCC 98.47 ACW–PLP 98.47 ACW–LPCC 98.98 ACW–CMS 98.98 ACW–ACW 98.98 ACW–MFCC+LPCC+ACW 98.47 MFCC+LPCC+ACW–MFCC 97.96 MFCC+LPCC+ACW–PLP 97.96 MFCC+LPCC+ACW–LPCC 98.47 MFCC+LPCC+ACW–CMS 98.47 MFCC+LPCC+ACW–ACW 98.47 MFCC+LPCC+ACW– 98.47 MFCC+LPCC+ACW a b

5

a 0.55 0.69 0.62 0.68 0.70 0.58 0.54 0.52 0.54 0.68 0.32 0.42 0.41 0.67 0.34 0.48 0.37 0.83 0.59 0.98 0.63 0.69 0.71 0.91 0.82 1.00 0.93 0.87 0.86 0.81 0.63 0.71 0.62 0.70 0.72 0.56

(SC1) REE a 4.08 0.01 4.08 0.01 3.57 0.00 3.57 0.02 3.57 0.26 4.08 0.05 5.10 1.00 5.10 1.00 3.57 1.00 5.10 1.00 3.57 1.00 4.59 1.00 2.55 0.01 2.55 0.01 2.55 0.00 3.06 0.02 2.55 0.05 2.55 0.05 3.06 0.01 3.06 0.01 3.06 0.00 3.06 0.02 2.55 0.11 3.06 0.06 2.55 0.01 2.55 0.01 2.55 0.00 2.55 0.02 2.55 0.26 2.55 0.02 4.08 0.01 4.08 0.01 3.57 0.00 3.57 0.02 3.57 0.50 4.08 0.05

FDCF b a 3.24 0.00 3.44 0.01 3.24 0.00 3.39 0.00 2.76 0.00 3.10 0.00 4.28 0.00 4.85 0.01 3.39 1.00 4.45 0.00 3.05 1.00 3.75 1.00 2.10 0.00 2.29 0.00 2.37 0.00 2.33 0.00 2.10 0.00 2.18 0.00 2.57 0.00 2.63 0.03 2.47 0.00 2.81 0.00 2.47 0.00 2.34 0.00 1.95 0.00 1.95 0.00 1.94 0.00 1.90 0.00 1.95 0.00 1.95 0.00 3.17 0.00 3.37 0.01 3.11 0.00 3.26 0.00 2.73 0.00 3.07 0.00

RS 86.46 85.42 85.94 84.90 86.46 88.02 83.33 82.29 84.38 84.90 86.98 84.90 87.50 87.50 86.46 86.98 86.46 87.50 92.19 91.67 91.67 92.19 92.19 92.19 90.10 89.58 89.58 90.10 90.10 89.58 86.46 85.42 85.94 85.42 86.98 88.02

a 0.58 0.61 0.57 0.51 0.49 0.58 0.49 0.57 0.46 0.55 0.41 0.51 0.64 0.71 0.75 0.73 0.61 0.57 0.59 0.94 0.93 0.85 0.62 0.76 0.77 0.83 0.80 0.79 0.75 0.81 0.58 0.65 0.64 0.50 0.31 0.60

(SC2) REE a 10.42 0.04 10.94 0.03 10.42 0.05 10.42 0.00 9.38 0.96 10.42 0.01 10.94 0.04 11.46 0.03 10.94 0.06 10.94 0.00 9.38 1.00 10.94 0.50 10.42 0.04 10.42 0.04 9.90 0.07 10.42 0.00 8.85 1.00 10.42 0.00 6.77 0.03 6.77 0.04 6.77 0.03 6.77 0.03 6.77 0.02 6.77 0.01 9.38 0.02 9.38 0.03 9.38 0.07 8.85 0.00 8.85 0.17 8.85 0.00 10.42 0.04 10.94 0.03 10.42 0.05 10.42 0.00 9.38 0.97 9.90 0.00

FDCF 9.88 10.01 9.34 9.17 8.37 9.57 9.62 10.49 9.38 8.93 8.65 9.47 9.42 9.70 9.26 8.58 8.47 9.21 6.42 6.54 6.52 6.49 6.36 6.55 8.17 8.51 8.50 7.83 8.11 8.32 9.93 9.96 9.38 9.16 8.37 9.62

a 0.02 0.00 0.00 0.00 0.06 0.02 0.00 0.00 0.00 0.00 0.98 0.50 0.00 0.00 0.00 0.00 1.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.02 0.00 0.00 0.00 0.01 0.03

The features were extended by 1st and 2nd order regression coeﬃcients. The values of FDCF are considered as minimum of this function.

Conclusion

According to the results is sec. 4.3 it is shown that it is possible to improve the results of text-dependent speaker recognition system using the score fusion. The system can use the classiﬁers DTW, BDM and FD. But in case of BDM, there are no improvements. In case of combination DTW-FD using the same weights it is possible to increase the accuracy (e. g. using PLP features). Nevertheless to ﬁnd the best score fusion, it is necessary to use diﬀerent weights and diﬀerent feature combination. Tab 4 shows that using the score fusion of DTW and FD (when features ACW and CMS are selected) it is possible to reach successful detection rate equal to 98.98 % and 92.19 % in case of microphone mismatch.

Score Fusion in Text-Dependent Speaker Recognition Systems

131

During veriﬁcation, system reached equal error rate 2.55 % and 6.77 % when assuming the microphone mismatch. Although it is possible to increase the system accuracy of the identiﬁcation using the score fusion, in case of the veriﬁcation, the improvement was not signiﬁcant. Considering the recording by the same microphone it is probably diﬃcult to reach better results, because the system already provided good results. In case of microphone mismatch, better results can be probably obtained using more robust features. It is also possible to use more sophisticated methods to ﬁnd suitable coeﬃcients ap and cpnorm (e. g. using genetic algorithms). For the next work it is proposed to extend the segmental features by suprasegmental features like fundamental frequency. The ﬁrst three formants and features based on frequency tracking can be also used [4]. All these features can improve the accuracy of classiﬁcation. The identiﬁcation using DTW is in comparison to FD very slow, nearly thousand times slower. More over this time increases with the increasing number of speakers and samples in database. On the other hand DTW provides better results. Therefore it would be good to ﬁnd a compromise between the accuracy and identiﬁcation time. For example FD can very quickly ﬁnd ﬁrst 10 candidates and these candidates can be consequently processed by DTW. There is also proposed an extension to FD. These distances can be calculated from an output of ﬂoating window which returns max, min, mean or standard deviation of the series selected by this window. As was already mentioned in sec. 4.1, the classiﬁers’ settings were found empirically, but it is also possible that there exists better option. For this purpose, it would be good to apply some kind of optimization. Acknowledgments. This research has been supported by Project KONTAKTME 10123 (Research of Digital Image and Image Sequence Processing Algorithms), Project SIX (CZ.1.05/2.1.00/03.0072), Project VG20102014033, and projects MICINN and FEDER TEC2009-14123-C04-04.

References 1. BenZeghiba, M.F., Bourlard, H.: User-customized Password Speaker Veriﬁcation Using Multiple Reference and Background Models. Speech Communication 8, 1200–1213 (2006), iDIAP-RR 04-41 2. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support Vector Machines for Speaker and Language Recognition. Computer Speech & Language 20(2-3), 210–229 (2006), http://www.sciencedirect.com/ science/article/B6WCW-4GSSP9F-1/2/4aaea6467cc61ee4919a9b1c953316b1, odyssey 2004: The speaker and Language Recognition Workshop - Odyssey-04 3. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support Vector Machines Using GMM Supervectors for Speaker Veriﬁcation. IEEE Signal Processing Letters 13(5), 308–311 (2006)

132

J. Mekyska et al.

4. Das., A., Chittaranjan, G., Srinivasan, V.: Text-dependent Speaker Recognition by Compressed Feature-dynamics Derived from Sinusoidal Representation of Speech. In: 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland (2008) 5. Davis, S., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing 28(4), 357–366 (1980) 6. F` abregas, J., Faundez-Zanuy, M.: Biometric Dispersion Matcher. Pattern Recogn. 41, 3412–3426 (2008), http://portal.acm.org/citation.cfm?id=1399656.1399907 7. F` abregas, J., Faundez-Zanuy, M.: Biometric Dispersion Matcher Versus LDA. Pattern Recogn. 42, 1816–1823 (2009), http://portal.acm.org/citation.cfm?id=1542560.1542866 8. Furui, S.: Cepstral Analysis Technique for Automatic Speaker Veriﬁcation. IEEE Transactions on Acoustics, Speech and Signal Processing 29(2), 254–272 (1981) 9. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America 87(4), 1738–1752 (1990), http://link.aip.org/link/?JAS/87/1738/1 10. Kinnunen, T., Li, H.: An Overview of Text-independent Speaker Recognition: From Features to Supervectors. Speech Communication 52(1), 12–40 (2010), http://www.sciencedirect.com/science/article/B6V1C-4X4Y22C-1/2/ 7926da351ef5c650f2a1a37adcd839a1 11. Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust Speaker Recognition: a Feature-based Approach. IEEE Signal Processing Magazine 13(5), 58 (1996) 12. Mekyska, J., Faundez-Zanuy, M., Smekal, Z., F` abregas, J.: Text-dependent Speaker Recognition in Low-cost Systems. In: 6th International Conference on Teleinformatics, Dolni Morava, Czech Republic, pp. 154–158 (2011) 13. Reynolds, D.A.: Speaker Identiﬁcation and Veriﬁcation Using Gaussian Mixture Speaker Models. Speech Commun. 17, 91–108 (1995), http://portal.acm.org/ citation.cfm?id=211311.211317 14. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Veriﬁcation Using Adapted Gaussian Mixture Models. Digital Signal Processing (2000) 15. Swanson, A.L., Ramachandran, R.P., Chin, S.H.: Fast Adaptive Component Weighted Cepstrum Pole Filtering for Speaker Identiﬁcation. In: Proceedings of the 2004 International Symposium on Circuits and Systems, ISCAS 2004, vol. 5, pp. 612–615 (May 2004) 16. Vivaracho-Pascual, C., Faundez-Zanuy, M., Pascual, J.M.: An Eﬃcient Low Cost Approach for On-line Signature Recognition Based on Length Normalization and Fractional Distances. Pattern Recogn. 42, 183–193 (2009), http://portal.acm. org/citation.cfm?id=1412761.1413027 17. Wong, E., Sridharan, S.: Comparison of Linear Prediction Cepstrum Coeﬃcients and Mel-requency Cepstrum Coeﬃcients for Language Identiﬁcation. In: Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 95–98 (2001) 18. Yegnanarayana, B., Kishore, S.P.: AANN: an Alternative to GMM for Pattern Recognition. Neural Networks 15(3), 459–469 (2002), http://www.sciencedirect.com/science/article/B6T08-459952R-2/2/ a53c123eaecb7ccb7b50baec88885192

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality within a Multimodal Shell Izidor Mlakar1 and Matej Rojc2 1

Roboti c.s. d.o.o, Tržaška cesta 23, Slovenia {[email protected]} 2 Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, Slovenia [email protected]

Abstract. Web applications are a widely-spread and a widely-used concept for presenting information. Their underlying architecture and standards, in many cases, limit their presentation/control capabilities of showing pre-recorded audio/video sequences. Highly-dynamic text content, for instance, can only be displayed in its native from (as part of HTML content). This paper provides concepts and answers that enable the transformation of dynamic web-based content into multimodal sequences generated by different multimodal services. Based on the encapsulation of the content into a multimodal shell, any text-based data can dynamically and at interactive speeds be transformed into multimodal visually-synthesized speech. Techniques for the integration of multimodal input (e.g. visioning and speech recognition) are also included. The concept of multimodality relies on mashup approaches rather than traditional integration. It can, therefore, extended any type of web-based solution transparently with no major changes to either the multimodal services or the enhanced web-application. Keywords: multimodal interfaces, multimodality, multimodal shell, web multimodal, ECA-based speech synthesis.

1 Introduction Multimodal user interfaces offer several communication channels for use when interacting with the machine. These channels are usually categorized as input and outputbased. Multimodal applications running within the traditionally resource-limited, operating system (OS), and the device-dependent pervasive computing environments started to transit to web environment with the W3C’s presentation of multimodal web, and the standardisation of Extensible Multimodal Annotation markup language (EMMA) [1]. The web environment represents device and operating system independent software architectures with almost resource unlimited computational spaces. Due to the advances in the physical infrastructures of wide-area networks (e.g. stability of links, increased up/down link bandwidth, etc.) the intensive data exchange between services also became a solvable problem. Different distributive environments A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 133–146, 2011. © Springer-Verlag Berlin Heidelberg 2011

134

I. Mlakar and M. Rojc

started to form. These environments enable resource sharing, and increase the operability of traditional resource-heavy input/output processing technologies. Although such environments enable relatively device/OS independent multimodal services, the user interfaces (running on selected devices) still depend on basic device capabilities and therefore need to be adjusted, or even separately recoded for each device type. Web applications on the other hand provide easily and with HTML 5 even self-adjustable user interface that can run on any device that supports a web browser. Browser enabled-devices range from desktop to fast emerging mobile devices (e.g. smart-phones, iPhone, PDAs, etc.) that enable accessing the web content virtually from anywhere, anytime, and are becoming more and more popular every day. As a consequence, web services and applications have evolved and range from simple presentation pages to complex social networks, e-commerce, e-learning, and other business applications (e.g. B2B, CMS etc.). Although web applications offer literally limitless data and pattern gathering fields that can be extensively used with the aim of generating efficient personalized multimodal human-machine interaction interfaces, they can still prevent users from experiencing their full potential. Such cases usually refer either to the complexity of the web page, or the imparity of the user (e.g. modality mismatches for blind, or deaf users). Traditional assisting tools such as WebSpeak [2] and BrookesTalk [3] text-to-speech (providing an aural overview of Web content for the user) have been developed due to the fact that using speech is the most natural way of communication. In addition, interfaces have also been developed that rely on visual cues such as [5]. Speech technologies have the ability to further expand the convenience and accessibility of services, as well as lower the complexity of traditional unimodal GUI-based web interfaces [6]. State-of-the-art web applications, such as iGoogle1 MyYahoo2, Facebook3 offer, in essence, at least some personalization plug-ins (e.g. in the form o user-defined styles and content), yet in the context of natural human-machine interaction and personification still only offer a small number of multimodal interactive channels, usually limited to keyboard, mouse, touch-screen, and sometimes text-to-speech-synthesis and speech recognition. This paper presents a multimodal web-based user interface BQ-portal, which was developed on a flexible and distributive multimodal web platform. The BQ-portal is, in essence, a stand-alone web application, acting as an information kiosk for students. The application and core frameworks it is built upon, address the multiple-modality issues by enhancing the presented content via speech and non-speech related technologies (text-to-speech and visual speech synthesis using embodied conversational agents). Currently, the web-based user interface offers several communication channels that users can use, ranging from traditional keyboard/mouse/touch-screen setups, to complex content presentation using embodied conversational agents (ECAs). In the presented work ECA’s are regarded as “one-way” presentation channel. Their interaction capabilities are, therefore, limited to the presentation of the web-content. However, the architectural concepts presented in this paper, and their distributive and 1

iGoogle: http://www.google.com/ig myYahoo: http://my.yahoo.com/ 3 http://www.facebook.com/ 2

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 135

service oriented nature, will allow flexible implementation and integration of different Interactive Communication Management (ICM), and Own Communication Management (OCM) based tactics for simulation of “human-like” interaction by using reactive ECAs. The paper is structured as follows. Firstly, it addresses related works about developing multimodal web interfaces. The Section 3 presents a general concept of fusing existing web-based interfaces with different multimodal-based services. Section 4 presents the development of a multimodal web application BQ-portal, built on a novel fusion-concept. The paper then presents an implementation of two mashup-based multimodal web services (RSS feed reader and Language translator) that can be used by students while browsing the content of BQ-portal web application. Finally, the paper concludes with a short discussion about future work, and the conclusion.

2 Multimodal Web-Based Interfaces – Related Works As far as the field of multimodal research is concerned, new concepts and ideas emerge almost every day. Yet, most of them are still task-oriented, and placed within relatively-closed user environments. Such, usually unimodal concepts do provide vast insights into how human-human interaction works and also how man-machine interaction should work, but the usage and extension of them usually remains limited. Services, extracted from different multimodal-concepts, and closed multimodal applications are limited to the usage of a few isolated modalities, such as: text-to-speech synthesis, speech recognition, etc. Each service has its own under-laying communication protocol. Therefore, it seems only natural to provide such common platform within which web and non-web based services could interact by using a set of compatible protocols. Web-based user interfaces allow a generalization of different services and device independent implementations. Such interfaces implement standard communication protocols, such as: HTTP, RTP, and TCP/IP (depending on the data). These protocols are understandable to most of the existing web-browsers. Furthermore, with the HTML and cascade style sheets (CSSs), web-based user interfaces also provide a means of interface adaptation to both the user and device context. MONA [7] can be regarded as one of the environments for the development of multimodal webapplications. MONA focuses mainly on middleware for providing complete multimodal web-based solutions, running on diverse mobile clients (e.g. PDAs, Smart Phones, mobile phones, iPhone, etc.). The MONA concept involves a presentation server for mobile devices in different mobile phone networks. MONA’s framework transforms a single multimodal web-based user interface (MWUI) specification into a deviceadaptable multimodal web-user interface. The speech (the main multimodal channel) data is transferred using either circuit, or packet-switched voice-connection. In addition to MONA, other researchers have also worked on providing multimodal interfaces in resource-limited pervasive environments, e.g. [8], [9]. Furthermore, the TERESA [10] and ICARE [11] projects reach beyond the unimodal world of web services. TERESA provides an authoring tool for the development of multimodal interfaces within multi-device environments. It automatically produces combined XHMTL and VoiceXML multimodal web-interfaces that are specific to a targeted

136

I. Mlakar and M. Rojc

multimodal platform. ICARE provides a component-based approach for the development of web-based multimodal interfaces. The SmartWeb [12] is an extension project of the SmartKom (www.smartkom.org) project, one of the multimodal dialog systems that combine speech and gestures with facial expressions for both input and output. SmartWeb provides a context-aware user interface, and can support users in different roles. It is based on the W3C standards for the semantic web, the Resource Description Framework (RDF/S), and the Web Ontology Language (OWL). As fully operable examples of SmartWeb usage, researchers have provided a personal guide for the 2006 FIFA World Cup in Germany, and a P2P-based communication between a car and a motor bike. In contrast to complete frameworks for the development of web-based multimodal solutions, Microsoft and IBM provide toolkits, multimodal APIs, such as: Speech Application Language Tags – SALT4, and WebSphere Voice Toolkit from IBM5. These APIs contain several libraries for the development of multimodal services, and their integration within different web-applications. A more detailed study on the subject of multimodal web applications can be found in [13]. Most of these frameworks are directed towards ubiquitous computing, and provide approaches for development of new multimodal web applications, rather, than enabling the empowerment of existing web applications with different multimodal services and solutions. The BQ-portal web-based interface development, presented in this paper, follows the idea of implementing web browser-based solutions as multimodal interfaces. It is based on the distributive DATA platform [14] and the multimodal web platform (MWP) [15] concepts. The main purpose of these concepts is to allow as flexible fusion of different multimodal technologies and services provided by already developed non-multimodal web applications, as possible. The basic idea of BQ-portal web-based interface development and its infrastructure for deploying speech technologies (speech recognition and text-to-speech synthesis) has already been outlined in [15]. This paper, however, presents an in-depth presentation of the BQ-portal web-based user interface development, including fusion with advanced technologies needed for the visual synthesis of human-like behaviour characteristics. The BQ-portal web application does not adapt/transform the core technologies (such as ECA-service, or text-to-speech service). It is based on the idea of providing an extension to core technologies used within the existing web applications (similar as in [16]). The concept of BQ-portal application, therefore, strives to enhance different web-based services with different non-web based technologies. The fusion concept is based on a mashup approach and browser cross-domain communication, and does not require any major concept/functional/code/architectural changes to either the web or the non-web based services. By exploiting the features of web-based solutions (HTML 5.0, CSS, java-script) such an interface also enables a device-independent provision ranging from a common desktop computer to browser-enabled mobile devices. The multimodal web-based interfaces provided by such fusion enable the usage of different multimodal technologies in a transparent manner, regardless of their modality. Multimodal fusion concept is also quite general and can be extended to virtually any existing web-based solution. 4 5

http://msdn.microsoft.com/en-us/library/ms994629.aspx http://www-01.ibm.com/software/pervasive/tech/demos/tts.shtml

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 137

3 Multimodal Web Application – General Concept Let’s assume that users can access web applications within different computer environments using different web-based browser interfaces. Normally, these web applications (with rich content) are based on GUIs and support user-machine interaction using traditional input/output devices (e.g. mouse, keyboard, touch screen etc.). More advanced user interfaces should be developed and provided in order to integrate and enable additional modalities for machine-interaction, and to further improve user experience. Such a concept is introduced in Figure 1, which enables flexible integration of multimodal technologies into general web-applications. The concepts presented in Figure 1 suggest the fusion of general web-applications with several multimodal extensions, and the formation of a platform-independent multimodal service framework.

Fig. 1. Multimodal web-platform concept

On the one hand, we have core multimodal technologies and multimodal services, whilst, on the other hand, a web application providing user-interface, content and several services that should/could be extended with additional input/output modalities. Both technologies run, in general, on different platforms and use different communication protocols. It is assumed that web-based services use HTTP (or any other web-service protocol), and the non-web services interact under a general, unified protocol, understandable to all user-devices. It is also assumed that both protocols are unrelated, and cannot be simply merged. In order to fuse web and non-web based services in a flexible way an intermediate object (a multimodal shell) is introduced. Such multimodal shell allows both types of technologies to retain their base rules whilst, at the same time, complementing each other. Any multimodal web-based application can be viewed as a set of four relatively-independent objects. The multimodal services (first object) represent several input/output-based core technologies (text-to-speech synthesis, visual ECA synthesis, etc.). The second object then represents a general web application (e.g. e-learning, e-business, information kiosk, etc.) providing the core user-interface, content, and core functionality. The third object represents a set of different browser-based user interfaces that can use web-based

138

I. Mlakar and M. Rojc

Fig. 2. Multimodal shell’s functional architecture

services. And finally, the fourth object is a multimodal shell, representing an intermediate object that understands both multimodal services and web-application, receives user requests and responds to them (e.g. voice browsing). The implementation of the multimodal shell, presented in the following section, involves IFRAME-based crossdomain interaction. By using the presented multimodal shell (that understands both types of service protocols), general web-services (or their results) can be flexibly transformed into multimodal-based services (or their results). 3.1 Multimodal Shell Multimodal shell is an intermediate object that can be used for transformation of webbased content into multimodal content, and for presenting such fused content according to the user/device context. Figure 2 presents the functional architecture of such an intermediate object. It is basically an application wrapper, implemented as a deviceindependent web-application without any visual components. The transformation from web-based to multimodal-based and back to web-compatible data is implemented by using the concept of encapsulation. The core user-interface is stored within an IFRAME of the multimodal shell’s interface. Since it is assumed that both webapplications (the core application and the multimodal shell) are of different origins (different domain), a cross-domain API is integrated into both interfaces. This crossdomain API then handles any data transfer from the core-application towards the multimodal services (e.g. data for text-to-speech synthesis), or data transfer from multimodal services towards the web application (e.g. additional control options, such as: e.g. speech recognition). In addition to data transformation, the multimodal shell also handles the presentation of multimodal data (e.g. playing audio/video stream of the synthesized data), and the collection of device properties. The fusion of web and

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 139

non-web content, performed by the multimodal shell, is based on three components: Web platform-interface, Multimodal core and Multimodal user front-end. The Multimodal core component presents a group of non-web services that work under a unified protocol. These core services can be accessed by different applications in a general way, regardless to the web application context. The Web-platform interface component The Web-platform interface component handles the adaption of multimodal services to the user and device contexts. This software module registers the user interface to the Multimodal core and decides which core services to allow for each user interface and in what form should these services respond. The component also normalizes the web-based data (e.g. HTML based textual information) to the input format required by core. In the context of web-based services, this component additionally implements connections to different external, web-based services (e.g. weather service, RSS feeds, etc). The main task of this component is, therefore, to serve different context-dependent requests and to propagate different context-oriented responses. To achieve all these tasks, the component further implements low level components used for processes, such as: registration and adaptation of user-interface and user-services, web and non-web-based data exchange, data normalization, protocol unification, etc. Interface services component describes a set of services that are used as an access point to different external services (e.g. speech synthesis, RSS feeds, weather services, tourist info service, etc.), and for data and protocol transformation and normalization (HTML/XML-based data to raw text, high-level HTML-based request to low level sequences of command packets etc.). For instance, in the case of RSS-feeds, this component forms HTTP connection to the RSS feed-provider, transforms its data into raw text, and redirects it to the different core services e.g. TTS service. The Device manager and Service manager components are used for registration and adaption of user-interface and user-services through processing of different contextual information. The minimal context information that most user-devices can provide includes: input/output capabilities (type, display size, presence of audio/video input and output devices, etc.), network interface, web-browser type, the preferred video stream player etc. This information can be gained through services, such as: Resource Description framework (RDF), and different java-script-based client libraries that can be integrated within the Device manager component, or within the Multimodal shell’s front-end component. Such contextual information is then used to model different multimodal services. For instance, the type of web browser, screen dimensions, and the type of network connection define the availabilities of the embodied conversational agent, and also video stream properties (e.g. encoding and maximal video size), video stream player, etc. The Service manager component also obtains/holds web-properties, information about the web-content, and its structure. The web properties include among others service descriptions and HTML structure of web-pages. The Dialog manager component models human-machine interaction and, by using the device, service, and user context enables/disables the multimodal services. It also defines the communication paths for different user-requests (e.g. text for visual synthesis). This component also transforms (normalizes) the any input data in order to meet the requirements of each non-web based service.

140

I. Mlakar and M. Rojc

The Multimodal user front-end component The Multimodal user front-end component is an HTML-based user interface that merges the front-end of the targeted web-application with the multimodal shell’s front-end. It represents a web page containing an IFRAME without any visual elements. The IFRAME serves to present the content of the targeted web application, and forms a “domain” bridge between the provider of multimodal services and the targeted web application. This “domain” bridge is implemented with java-script-based cross domain API, which allows direct data transfer between the multimodal shell component, and the targeted web application (e.g. encapsulated web-application). In other words, the cross domain API allows direct interaction and data exchange between two domains. Users can send data from the web-application interface directly to the Multimodal shell component. The Multimodal shell component can remotely control users’ web-based interfaces. The Multimodal user front-end component also contains java-based client API that can be used for implementing direct TCP/IP session to the multimodal core. Created session can then be used for management and data transfer. Data transferred during this session is assumed to be of non-textual (non-web) nature (e.g. video stream from camera, audio stream from microphone, audio-visually synthesized text, etc.).

4 BQ-Portal Web Application Web application BQ-portal presents an implementation of mashup-based webapplication for provision of multimodal services. It was developed by using the presented multimodal web application concept (section 3), based on MWP [20] and DATA [19] architectures. BQ-portal web application can be regarded as a “targeted’’ web application with no direct implementation of non-web based services. It serves as an information kiosk for students. By fusing web-based (e.g. RSS feeds and BQportal’s web-based services) and non-web based services (TTS, ECA and language translation services), it performs audio/visual content presentation using embodied conversational agents. Its functional architecture based on the multimodal web application concept presented in section 3, is shown in Figure 3.

Fig. 3. Functional architecture of the BQ-portal web application

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 141

The BQ-portal web application (Figure 3) implements its own web-based user interfaces, supporting the cross domain communication. The Multimodal shell component mediates between device-adapted web-based user interfaces and the multimodal services within the Multimodal core component. Therefore, it performs fusion of webbased and non-web-based technologies into multimodal, user oriented services and multimodal, web-based user interfaces. The web-based content is presented in a format that is acceptable for the web-browsers, and the multimodal content (speech output and ECA animation) is presented as a video stream. Users can use all available services as provided by the BQ-portal web application in a regular (non-multimodal) fashion, or as multimodal services. Figure 3 presents two applicative domains suggested by the concept of the MWP platform. The first one is the multimodal domain (established by services within the Multimodal core component). It combines different non-web based technologies (PLATTOS TTS service [4], EVA service [17]).The second one is the web domain. It combines web-related technologies, such as: the different services provided by the BQ-portal web application, and the web-based services provided by other web applications, such as: RSS feeds, News feeds, Weather forecast, traffic info etc. Both mentioned applicative domains are assumed to be of different origin (e.g. run on different application frameworks) and, therefore, cannot directly communicate between eachother. Therefore, the cross domain API (discussed in section 3) is used for interfacing both domains and for the relevant data exchange performed directly through the user interface. The cross domain API implements an IFRAME encapsulation, and wraps the targeted (general) web-application into the IFRAME of the Multimodal shell‘s web-based user interface. Additionally, a multimodal client interface API (Figure 2), a light java client-based application, also automatically runs within the web-based user interface of the Multimodal shell component. This light client provides general device information, and enables communication between the user device and the Multimodal core component. The mashup principle, implemented by the cross domain API, serves as a data exchange process between applications of different domains. For an instance, to generate audio/visual content based on the text presented within the user interface. The generated speech and ECA EVA responses are always presented together as a video stream played by web-based stream player. ECA EVA can provide different formats for the output stream (e.g. RTP or HTTP), and different audio/video encodings (e.g. MPEG-2/4, H.264, FLV, etc.). The multimodal front-end will automatically generate video stream player based on the device properties (e.g. custom java-based video player, or web-player encapsulated within the MWP’s front-end). Some of multimodal web-based services provided by the BQ-portal application are outlined in the following sections. The concepts of multimodality within BQ-portal web application are quite general and can serve as a base for designing several new multimodal services. For an instance, when speech recognition is available (among multimodal services) the use of web-based interface can also be voice-driven.

5 Multimodal Services – ECA Enhanced Web Services Based on the concept of multimodality, the BQ-portal web application already offers several multimodal services based on ECA EVA and TTS system. These services

142

I. Mlakar and M. Rojc

transform general web-based content into an audio-visual response (generated by the embodied conversational agent EVA and TTS PLATTOS). Such services, therefore, unify web-content, text-to-speech output, and the embodied conversational agent into responsive audio-visually presented content. This section presents insights in how web-based content is transformed into multimodal response, and two multimodal services provided by the BQ-portal web application. 5.1 Visualized RSS Feeds RSS feeds are a common entity within today’s web-applications. Feed services are used to provide and present information from outside sources (external web applications). Commonly, these services can be accessed over standardized XML interface, using a well structured and constant XML scheme. If the structure is unknown, it can also be determined from the content of the XML. The idea to enhance RSS feeds with multimodal output services involve: -

parsing the feed (to obtain the titles, the corresponding content, images, etc), removing all HTML related content, such as: CSS styles, tags, etc., redirecting normalized content to corresponding multimodal services, generating multimodal output, presenting the multimodal output to the user.

Fig. 4. Enhanced RSS feeds by using multimodal output services

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 143

This process is implemented upon the user’s request, either for the index of RSS titles, or for the content of a title. If the index page is to be read, the TTS service is fed with titles, otherwise the selected title is connected to its content, and the content is then fed to the TTS service. The TTS service (based on PLATTOS TTS system, available in the DATA service cloud) generates the speech, and the EVA-Script based description of the text being synthesized. This description is a sequential set of phonemes that are assigned with attributes, such as: duration, pitch, and the prominence levels. The EVA-Script description is then transformed into animated lip movement, and sent as a video stream towards the user’s multimodal front end component. Figure 4 presents the functional architecture of the multimodal output based RSS feed reader. Three different RSS news feed providers are supported within the current version of the BQ-portal web application. The indexes and individual feeds are accessed using standard web-based RSS connectors. The RSS parser parses the feeds, and generates raw text by using RSS scheme descriptors (e.g. XML tree parsing). When the user accesses the feed, he/she actually accesses the RSS feed reader interface that stores both the raw and styled data of the individual feed. A read-feed request initiates the feeding process, and circumstantially launches the multimodal output process. With the delay expected in interactive norms, the user hears and sees the generated multimodal output within its interface. 5.2 Visualized Translations The BQ-portal also supports Google-translator-API-based6 real-time language translations. In this way, users can translate either the content of the currently presented web page as whole or only fragments of the content at a time. The data source for the translation is always provided by the user interface, and the process of translation is always initiated upon the user’s request. Figure 5 presents the functional architecture of the BQ-portal multimodal output-based text translator’s API. The multimodal output based text translator sends cleaned data (raw text with no HTML, CSS, XML elements) into ‘language discovery’ and ‘language translation’ modules, and returns two types of outputs: the translations as text, and the video sequence of the generated multimodal output (synthesized and visualized speech). The BQ-portal’s presentation layer provides translation data in the form of raw text (text without any HTML/CSS related information). This raw text is then translated. Firstly, it is passed through the ‘language discovery’ module. This module is native to Google API, which defines the language of the input raw text. If the ‘language discovery’ module fails to do this, the user is asked to define the input language manually. The target translation language is currently Slovenian (since PLATTOS TTS for the Slovenian language is available). The real-time translation processes then proceed with the translation phase, where detected input language and raw text are translated into the Slovenian language by using the Google translator API. The translation result is raw text that is fed to a DATA service cloud. Within this cloud, raw data are firstly redirected to the TTS service that generates audio and EVA Script-based descriptions. The TTS service output is then redirected towards the ECA service. The obtained data are used for the generation of multimodal output (ECA animated speech sequences). The multimodal output is then transferred back to the user interface as video stream. 6

http://code.google.com/apis/language/

144

I. Mlakar and M. Rojc

Fig. 5. Enhanced text translations by using multimodal output services

The presented approach of the multimodal output based text translation process can easily be extended to any type of document that can be parsed and cleaned (transformed into raw text) by using the BQ-portal web application, ranging from word documents to pdf books. The quality of the multimodal output based translation, however, highly depends on the quality of the translated text - the quality of the Google translator API.

6 Conclusion This paper presents a concept of developing multimodal web interface that could overcome all those device/system dependences that are concurrent with several multimodal web interfaces. The concepts outlined in this article enable the integration of multimodal technologies into different types of web-based solutions. By visualizing the textual context, embodied conversational agents can add more life to the content, and also be used as supportive technologies providing additional meaning to the content. As a result, this paper has presented two services that are implemented within the BQ-portal web application. The RSS feed visualization service allows the BQ-portal to directly visualize the content of any RSS feed provided (must have a known XML/HTML scheme). The visualized translations, on the other hand, present a service that incorporates both web and non-web services. By using Google translator API (web based services), and TTS + ECA (non-web based services), the user can translate selected text and also visualize the translations.

Developing Multimodal Web Interfaces by Encapsulating Their Content and Functionality 145

The main focus of the presented MWP concept is to provide an interface that allows fusion (not integration) of non-web based services with general web-based applications. By using the mashup-based principle of cross-domain interaction, it has been shown that non-web technologies, such as: text-to-speech synthesis and embodied conversational agents, can be fused with different web services and web-content. Such a fusion enriches and enhances existing general web content and presents it through multiple communication channels. In the paper ECAs were regarded only as one way presentation channel, being able to perform only OCM (e.g. visual speech synthesis, speech related gesture synthesis, etc.) based on TTS output. However, the BQ-portal web application’s service oriented architecture allows the development and integration of application independent ICM. In order to form two-way interaction loops, different behaviour management techniques, as e.g. in [18, 19], can be integrated into multimodal core component. These techniques can be further interfaced with ECA and TTS service. In this way, ECAs would gain the ability not only to present the web content, but also the ability to respond to different user requests, and the ability to influence the interaction flow. In the future we plan to extend the presented multimodal concept by providing services, such as: speech recognition and visioning (multimodal input), and will further enhance the BQ-portal web application with new input modalities. In addition we plan to research and introduce different ICM tactics and dialog management systems in order to provide more human-like communicative dialog to general web-based application, or service. These research activities will allow us to transform the kiosk-based application into more natural web-based interface that can be used within different intelligent environments. Acknowledgements. Operation part financed by the European Union, European Social Fund.

References 1. EMMA: Extensible MultiModal Annotation Markup Language. W3C Recommendation (2009), http://www.w3.org/TR/2009/REC-emma-20090210/ 2. Hakkinen, M., Dewitt, J.: WebSpeak: user interface design of an accessible web browser. White Paper, the Productivity Works Inc. (1996) 3. Zajicek, M., Powell, C., Reeves, C.: A web navigation tool for the blind. In: Proceedings of the 3rd ACM/SIGAPH on Assistive Technologies, pp. 204–206 (1998) 4. Rojc, M., Kačič, Z.: Time and Space-Efficient Architecture for a Corpus-based Text-toSpeech Synthesis System. Speech Communication 49(3), 230–249 (2007) 5. Yu, W., Kuber, R., Murphy, E., Strain, P., McAllister, G.: A novel multimodal interface for improving visually impaired people’s web accessibility. Virtual Reality 9(2), 133–148 (2006) 6. Oviatt, S., Cohen, P.: Perceptual user interfaces: multimodal interfaces that process what comes naturally. Communications of the ACM 43(3), 45–53 (2000) 7. Niklfeld, G., Anegg, H.: Device independent mobile multimodal user interfaces with the MONA Multimodal Presentation Server. In: Proceedings of Eurescom Summit 2005 (2005)

146

I. Mlakar and M. Rojc

8. Song, K., Lee, K.H.: Generating multimodal user interfaces for Web services. Interacting with Computers Archive 20(4-5) (September 2008) 9. Chang, S.E., Minkin, B.: The implementation of a secure and pervasive multimodal Web system architecture. Information and Software Technology 48(6) (2006) 10. Berti, S., Paternò, F.: Migratory MultiModal Interfaces in MultiDevice Environments. In: Proc. of 7th Int. Conf. on Multimodal Interfaces ICMI 2005. ACM Press, New York (2005) 11. Bouchet, J., Nigay, L., Ganille, T.: ICARE software components for rapidly developing multimodal interface. In: Conference Proceedings of ICMI 2004 (2004) 12. Wahlster, W.: SmartWeb: Mobile Applications of the Semantic Web. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004. LNCS (LNAI), vol. 3238, pp. 50–51. Springer, Heidelberg (2004) 13. Stanciulescu, A., Vanderdonckt, J.: Design Options for Multimodal Web Applications. In: Computer Aided Design of User Interfaces V, pp. 41–56 (2007) 14. Rojc, M., Mlakar, I.: Finite-state machine based distributed framework DATA for intelligent ambience systems. In: Proceedings of CIMMACS 2009, WSEAS Press (2009) 15. Mlakar, I., Rojc, M.: Platform for flexible integration of multimodal technologies into web application domain. In: Proceedings of E-ACTIVITIES 2009, International Conference on Information Security and Privacy (ISP 2009), WSEAS Press (2009) 16. Thang, M.D., Dimitrova, V., Djemame, K.: Personalised Mashups Opportunities and Challenges for User Modelling. In: Conati, C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS (LNAI), vol. 4511, pp. 415–419. Springer, Heidelberg (2007) 17. Mlakar, I., Rojc, M.: EVA: expressive multipart virtual agent performing gestures and emotions. International Journal of Mathematics and Computers in Simulation 5(1), 36–44 (2011), http://www.naun.org/journals/mcs/19-710.pdf 18. Morency, L.P., de Kok, I., Jonathan Gratch, J.: A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems 20(1), 70–84 (2010) 19. Wrede, B., Kopp, S., Rohlfing, K., Lohse, M., Muhl, C.: Appropriate feedback in asymmetric interactions. Journal of Pragmatics 42(9), 2369–2384 (2010)

Multimodal Embodied Mimicry in Interaction Xiaofan Sun and Anton Nijholt Human Media Interaction, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands {x.f.sun,a.nijholt}@ewi.utwente.nl

Abstract. Nonverbal behavior plays an important role in human-human interaction. One particular kind of nonverbal behavior is mimicry. Behavioral mimicry supports harmonious relationships in social interaction through creating affiliation, rapport, and liking between partners. Affective computing that employs mimicry knowledge and that is able to predict how mimicry affects social situations and relations can find immediate application in humancomputer interaction to improve interaction. In this short paper we survey and discuss mimicry issues that are important from that point of view: application in human-computer interaction. We designed experiments to collect mimicry data. Some preliminary analysis of the data is presented. Keywords: Mimicry, affective computing, embodied agents, social robots.

1 Introduction People come from different cultures and have different backgrounds while growing up. This is reflected in their verbal and nonverbal interaction behavior, speech and language use, attitudes, social norms and expectations. Sometimes a harmonious communication is difficult to establish or continue because of these different cultures and backgrounds. This is also true when people are from the same culture and have the same background, but differ in opinions or are in competition. In designing user interfaces for human-computer interaction, including social robots and artificial embodied agents, in designing tools for computer-mediated interaction, and in designing tools or environments for training and simulation where interaction is essential, we should be aware of this. These interfaces, tools and environments need to be socially intelligent, capable of sensing or detecting information relevant to social interaction. Mimicry is often an automatic and unconscious process where, usually, the mimicker neither intends to mimic nor is consciously aware of doing so, but may tend to activate a desire to affiliate. For example, mimicking behaviors even occur among strangers when no affiliation goal is present. Certainly, mimicking strangers assumes unconscious mimicry. In other cases, people often mimic each other without realizing they want to create similarity. This also can be assumed to be unconscious mimicry. Conversational partners may or may not be consciously engaged in mimicry, but no doubt, one or both of the interactants take on the posture, mannerisms, and movements of the other during natural interaction [1]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 147–153, 2011. © Springer-Verlag Berlin Heidelberg 2011

148

X. Sun and A. Nijholt

Some instances of mimicry in daily life and factors that affect them are given below. People often mimic their bosses’ behavior in a meeting or discussion. For example, repeat what the boss said because of a desire to affiliate even if there is no real agreement. As another example, meeting or discussion partners mimic each other to gain acceptance and agreement when they share or want to share an opinion in a discussion. Thus, it is worth nothing that interactants mimic each other because of directly activating goals though without consistent awareness. Mimicry occurs in our daily life all the time, and in most of the cases mimicry behavior implicates or explicates the mimickee and mimicker’s actual attitudes, beliefs, and affects, moreover, judging the current interaction situation as positive or negative. Nonconscious mimicry widely occurs in our daily life, for example, people unconsciously speak more softly when they are visiting a library. Mimicry is inherently sensitive to actual social context; in other words, automatic mimicry changes with changing goals according to the realistic social situation. It is expected that human-computer interfaces that employ knowledge on mimicry can improve natural, human-like interaction behavior. It requires detection and generation of mimicry behavior. It allows the interface to adapt to its human partner and to create affiliation and rapport. This can in particular be true when mimicry behavior is added to human-like computer agents with which users communicate. One of the important goals for the future studies in embodied virtual agents and social robots is to use social strategies in order to make them more sociable and natural [2]. The sociable agent should have the capability of recognizing positive and negative situations and its communicative behavior should be appropriate in the current situation. Then it can achieve desirable interaction results such as creating affiliation and rapport, gaining acceptance, increasing belongingness, and, of course, better understanding of the conversational partner. Indeed, in recent research on humanoid agents the view that humans are “users” of a certain “tool” is shifting to that of a “partnership” with artificial, autonomous agents [3], [4]. Social agents need to have the capabilities to acquire various types of inputs from human users in verbal and non-verbal communication modalities. Also, social agents should have the capabilities of understanding the input signals to recognize a current situation, and then according to desired goals in the conversational setting to combine social strategies to determine what behavior is appropriate to express in response to the multimodal input information. Similarly, in the output phase, agents are expected to have the capabilities of mimicking users’ facial expression, eye contact, postural or even verbal types to gain more closeness and natural communication.

2 Types of Mimicry Various types of mimicry can be distinguished. They range from almost directly mimicking facial expressions and slight head movements to long term effects of interaction such as convergence in attitudes [2]. When we look at automatic detection and generation, we confine ourselves to the directly observable and developing mimicry behavior during interactions and what can be concluded from that. Therefore, below we distinguish mimicry in facial expressions, in speech, in body behavior (including gestures and head movements) and emotions.

Multimodal Embodied Mimicry in Interaction

149

2.1 Facial Expression Mimicry Interactants may express similar facial expressions during face-to-face interactions. When one of two interactants facing each other takes on a certain facial action, the partner may take on a congruent action [5], [6]. For instance, if one is smiling, the other may also smile. From previous mimicry experiments it is known that when images of a facial expression displaying a particular emotion are presented, people display similar expressions, even if those images are just static expression [7], [8], [9]. 2.2 Vocal Mimicry Vocal behavior coordination occurs when people match the speech characteristics and patterns of their interaction partners [10]. They may neither intend to do so nor are they consciously aware of doing so. This can be observed even if they are not facing each other [11]. 2.3 Postural Mimicry Body behavioral coordination involves taking on the postures, mannerisms, gestures, and motor movements of other people such as rubbing the face, touching the hair, or moving the legs [12]. For instance, if one is crossing his legs with the right leg on top of the left, maybe the other also cross his legs with the left leg on top of the right leg or with the right leg on top of the right leg [13]. 2.4 Emotional Mimicry The perception of mimicry is not limited to the perception of behavioral expressions [14]. Emotional mimicry is another phenomenon that needs to be considered. It is more complicated and mostly based on personal feeling and perception. In [7] emotional mimicry is classified into positive mood mimicry, negative mood mimicry and counter-mimicry. In an actual social situation not all emotion expressions are mimicked equally. Normally people have a higher chance to mimic positive emotion than negative emotion. This seems to be because of a negative emotional mimicry being less relevant and costly [15]. Consider, for example, the situation where someone tells you a bad thing happened to him or her, and he or she consciously or unconsciously, displays a sad face. Mimicking his or her sadness expression means signaling understanding, and maybe also willingness to help. Hence, sadness mimicry only occurs between people who are close to each other rather than just a passing acquaintance [15]. In contrast, people mimic happiness regardless of the relationship with each other or the situational context because of mimicking positive emotion is with low risk and is low costly [14]. Usually in a competition condition such as debates or negotiations, counter-mimicry is evoked to express different attitudes or negative emotion in a polite and implicit way, which shows contrasting facial expressions, and postural or vocal cues, such as a smile when the expresser winces in pain [7].

150

X. Sun and A. Nijholt

3 Mimicry as a Nonconscious Tool to Enhance Communication Individuals may consciously engage in more mimicry with each other in the case that they intend to affiliate during interaction. In contrast, they may also consciously engage in less mimicry since they prefer disaffiliation [16]. Hence, mimicry has the power to enhance social interaction and to express preferences. This is not really different in the case of unconscious mimicry. Unconscious mimicry shows a merging of the minds such as creating more similar attitudes or share more viewpoints [12]. Moreover, in interpersonal interaction mimicry can be an unconsciously used ‘tool’ to create greater feelings of, e.g., rapport and affiliation [17]. Mimicry can be seen as an assessment of the current social interaction situation (e.g., positive environment and negative environment). The connection between mimicry and closeness of social interaction was shown by a study conducted by Jefferis, van Baaren and Chartrand [18]. To use mimicry as a tool to enrich social interaction, some important research issues are, first, to understand and explore how people experience and use mimicry, second, to examine the implications of explicit mimicry behaviors in terms of social perceptions of the mimickers, third, to analyze detected and classified mimicry behavior for cues about the characteristics of the interaction, and, finally, to examine to what extend mimicking should occur so that it enriches communication properly. Embodied automatic mimicry can be used as a social strategy to achieve the desired level of affiliation or disaffiliation. The key is to obtain an optimal level of embodied mimicry [2], that is, mimicry should occur only to the proper degree so that such mimicry behavior serves the affiliation goal and is not costly and risky.

4 Measuring of Mimicry Mimicry refers to the coordination of movement between individuals in both timing and form during interpersonal communication. These phenomena are observed in newborn infants [8], and it is reported that these phenomena are related to language acquisition [10] and, as mentioned before, rapport. Therefore, many researchers have been interested in investigating the nature of these phenomena and have introduced theories explaining these phenomena. Because of this broad range of theoretical applicability, interactional mimicry has been measured in many different ways [19]. These methodologies can be divided into two types: behavior coding and rating. Some research has resulted in illustrating the similarities and differences between using a coding method and a rating method for measuring mimicry. Some researchers have been studying interpersonal communication using both methods. Recently, Reidsma et al. [20] presented a quantitative method for measuring the level of nonverbal synchrony during interaction. First the amount of movement of a person as a function of time is measured by image difference computations. Then, with the help of the cross-correlation between the movement functions of two conversational partners, taking into account possible time delays, it is determined if they move synchronously. In research on judging rapport and affiliation, studies examined how people use objective cues, as measured by a coding method, or subjective cues, as measured by a rating method, when they perceive interpersonal communication.

Multimodal Embodied Mimicry in Interaction

151

For automatic mimicry detection advanced learning techniques need to be employed to construct a model from both subjective knowledge and training data. Affect (e.g., disagreement/agreement) recognition is accomplished through probabilistic inference by systematically integrating mimicry measurements with mimicry behavior detection and a mimicry behavior organization model. In the model head movements, postural movements, and facial expressions can be explicitly modeled by different sub-modes in lower levels, while the higher level model represents the interaction between the modes. However, automatic selection of the sensory sources based on the information need is non-trivial; hence no operational systems exploit this. Individual sensors are integrated in sensor networks. Perceived data from single sensors need to be fused and integrated in the network. Moreover, the multimodal signals should be considered mutually dependent rather than be combined only at the end as is the case in decision-level fusion. And the same problem also appears in classifying features such as when and how to combine the features from various sensor models.

5 Collecting Data and Annotation It is necessary to automatically detect mimicry and recognize affect based on mimicry analysis. To achieve the ultimate goal of automatically analyzing mimicry some sub goals need to be achieved. First, a multi-modal database of interactional mimicry in social interactions is necessary to be set up, and secondly, possible rules and algorithms of mimicry in interactions need to be explored based on experimental social psychology. The desire to set up a multimodal database of interactional mimicry in social interactions are to (1) understand and explore how people consciously and unconsciously employ and display mimicry behavior, (2) develop methods and design tools to automatically detect synchrony and mimicry in social interactions, (3) examine and annotate the implications of mimicry detection in terms of social perceptions and emotions of the mimickers, (4) develop social mimicry algorithms to be utilized by embodied conversational agents. In sum, the goal is to understand when and why mimicry behavior happens and what the exact types of those non-verbal behaviors are in human face-to-face communication by annotating, analyzing and modeling recorded data. Recently we finished the process of collecting data from a large number of face-toface interactions in an experimental setting. The recordings were done at Imperial College London in collaboration with the iBUG group of Imperial College. The setting and the interaction scenarios aimed at extracting natural multimodal mimicry information, and to explore the relationship between the occurrence of mimicry and human affect (see section 2). The corpus was recorded using a wide range of devices including face-to-face-talking and fixed microphones, individual and room-view video cameras from different views, all of which produced auditory and visual output signals that are synchronized with each other. Two scenarios were followed in the experiments: a discussion on a political topic, and a role-playing game. More than 40 participants were recruited to participate. They also had to fill in questionnaires to report their felt experiences. The recordings and ratings are stored in a database. The interactions are being manually annotated for

152

X. Sun and A. Nijholt

many different phenomena, including dialogue acts, turn-taking, affect, and some head and hand gestures, body movement and facial expressions. Annotation includes annotating behavioural expressions for participants separately, annotating the meaning expressed by the behavioral expressions, and annotating mimicry episodes. Some preliminary results on automatic detection of mimicry episodes can be found in [21]. The corpus will be made available to the scientific community through a webaccessible database.

6 Conclusion Embodied mimicry can provide important clues for investigations of human-human and human-agent interactions. Firstly, as an indicator of cooperativeness and empathy. Secondly, in its application as a means to enrich communication. The impact of a practical technology to mediate human interactions in real time would be enormous both for society and individuals as a whole (improving business relations, cultural understanding, communication relationship, etc). It would find immediate applications in areas such as adapting interactions to help people with less confidence, training people for improved social interactions, or in specific tools for tasks such as negotiation. This technology would also strongly influence science and technology (providing a powerful new class of research tools for social science and anthropology, for example). While the primary goal of such an effort would be to facilitate direct mediated communication between people, advances here would also facilitate interactions between humans and machines. Moreover, given the huge advances in computer vision and algorithmic gesture detection, coupled with the propensity for more and more computers to utilize highbandwidth connections and embedded video cameras, the potential for computer agents to detect, mimic, and implement human gestures and other behaviors is quite boundless and promising. Together with the early findings in [21] this suggests that mimicry can be added to computer agents to improve the user’s experience unobtrusively, that is to say, without the user’s notice. It is worth mentioning again that the first main issue in our research is to explore and later to analyze automatically in what situation and to what extend mimicking behaviors occur. Acknowledgments. We gratefully acknowledge the useful comments of some anonymous referees. This work has been funded in part by FP7/2007-2013 under the grant agreement no. 231287 (SSPNet).

References 1. Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception-behavior link and social interaction. Journal of Personality and Social Psychology 76(6), 893–910 (1999) 2. Kopp, S.: Social resonance and embodied coordination in face-to- face conversation with artificial interlocutors. Speech Communication 52(6), 587–597 (2010) 3. Bailenson, J.N., Yee, N.: Digital chameleons. Psychological Science 16(10), 814–819 (2005)

Multimodal Embodied Mimicry in Interaction

153

4. Bailenson, J.N., Yee, N., Patel, K., Beall, A.C.: Detecting digital chameleons. Computers in Human Behavior 24(1), 66–87 (2008) 5. Chartrand, T.L., Jefferis, V.E.: Consequences of automatic goal pursuit and the case of nonconscious mimicry, pp. 290–305. Psychology Press, Philadelphia (2003) 6. Nagaoka, C., Komori, M., Nakamura, T., Draguna, M.R.: Effects of receptive listening on the congruence of speakers’ response latencies in dialogues. Psychological Reports 97, 265–274 (2005) 7. Hess, U., Blairy, S.: Facial mimicry and emotional contagion to dynamic emotional facial expressions and their influence on decoding accuracy. Int. J. Psychophysiology 40(2), 129–141 (2001) 8. Bernieri, F.J., Reznick, J.S., Rosenthal, R.: Synchrony, pseudosynchrony, and dissynchrony: Measuring the entrainment process in mother-infant interactions. Journal of Personality and Social Psychology 54(2), 243–253 (1988) 9. Yabar, Y., Johnston, L., Miles, L., Peace, V.: Implicit behavioral mimicry: Investigating the impact of group membership. Journal of Nonverbal Behavior 30(3), 97–113 (2006) 10. Giles, H., Powesland, P.F.: Speech style and social evaluation. Academic Press, London (1975) 11. Lakin, J.L., Chartrand, T.L., Arkin, R.M.: Exclusion and nonconscious behavioral mimicry: Mimicking others to resolve threatened belongingness needs (2004) (manuscript) 12. Bernieri, F.J.: Coordinated movement and rapport in teacher student interactions. Journal of Nonverbal Behavior 12(2), 120–138 (1998) 13. LaFrance, M.: Nonverbal synchrony and rapport: Analysis by the cross-lag panel technique. Social Psychology Quarterly 42(1), 66–70 (1979) 14. Chartrand, T.L., Maddux, W., Lakin, J.L.: Beyond the perception behavior link: The ubiquitous utility and motivational moderators of nonconscious mimicry. In: Hassin, R.R., Uleman, J.S., Bargh, J.A. (eds.) The New Unconscious, pp. 334–361. Oxford University Press, New York (2005) 15. Bourgeois, P., Hess, U.: The impact of social context on mimicry. Biol. Psychol. 77, 343–352 (2008) 16. Lakin, J., Chartrand, T.L.: Using nonconscious behavioral mimicry to create affiliation and rapport. Psychol. Sci. 14, 334–339 (2003) 17. Chartrand, T.L., van Baaren, R.: Chapter 5 Human Mimicry. Advances in Experimental Social Psychology 41, 219–274 (2009) 18. Jefferis, V.E., van Baaren, R., Chartrand, T.L.: The functional purpose of mimicry for creating interpersonal closeness. Ohio State University (2003) (manuscript) 19. Gueguen, N., Jacob, C., Martin, A.: Mimicry in social interaction: Its effect on human judgment and behavior. European Journal of Sciences 8(2), 253–259 (2009) 20. Reidsma, D., Nijholt, A., Tschacher, W., Ramseyer, F.: Measuring Multimodal Synchrony for Human-Computer Interaction. In: Proceedings International Conference on CYBERWORLDS, pp. 67–71. IEEE Xplore, Los Alamitos (2010) 21. Sun, X.F., Truong, K., Nijholt, A., Pantic, M.: Automatic Visual Mimicry Expression Analysis in Interpersonal Interaction. In: Proceedings Fourth IEEE Workshop on CVPR for Human Communicative Behavior Analysis. IEEE Xplore, Los Alamitos (2011)

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications Jan Nouza and Marek Boháč Institute of Information Technology and Electronics, Technical University of Liberec Studentská 2, 461 17 Liberec, Czech Republic {jan.nouza,marek.bohac}@tul.cz

Abstract. In the paper we propose a method that simplifies initial stages in the development of speech recognition applications that are to be ported to other languages. The method is based on cross-lingual adaptation of the acoustic model. In the search for optimal mapping between the target and original phonetic inventories we utilize data generated in the target language by a highquality TTS system. The data is analyzed by an ASR module that serves as a partly restricted phoneme recognizer. We demonstrate the method on Czech-toPolish adaptation of two prototype systems, one aimed at handicapped persons and another prepared for fluent dictation with large vocabulary. Keywords: Speech recognition, speech synthesis, cross-lingual adaptation.

1 Introduction As the number and variety of voice technology applications increases, the demand to port them into other languages becomes acute. One of the crucial issues in localization of the already developed products for another language is the cost of the transfer. In ASR (Automatic Speech Recognition) systems, the major costs are related to the adaptation of their two language-dependent layers: the acoustic-phonetic one and the linguistic one. Usually, the latter task is easier for automation because it is based on statistical processing of texts, which are now widely available in digital form (e.g. on internet [1]). The former task takes significantly more human work since it requires a large amount of annotated speech recordings and some deeper phonetic knowledge. These costs may be prohibitive if we aim at porting applications for special groups of clients, such as handicapped persons, where the number of potential users is small and the price of the products should be kept low. The research described in this paper has had three major goals. First, we were asked to transfer the voice tools developed for Czech handicapped persons to the similar target groups in the countries where these tools are not available. Second, we wanted to find a methodology that would make the transfer as rapid and cheap as possible. And third, we wished to explore the limits of the proposed approach to see whether it is applicable also for more challenging tasks. Our initial attempt was to allow for porting the MyVoice and MyDictate tools into other (mainly Slavic) languages. The two programs were developed in our lab in 2004 to 2006. They enabled Czech motor-handicapped people to work with a PC in a A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 154–162, 2011. © Springer-Verlag Berlin Heidelberg 2011

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

155

hands-free manner, with a large degree of flexibility and customization [2]. Very soon, a demand to port the MyVoice for Slovak language occurred and a few years later the software was transferred also to Spanish. The adaptation of the acousticphonetic layer of the MyVoice’s engine was done in a simple and straightforward way – by mapping the phonemes of the target language to the original Czech ones [3]. In both the cases, the mapping was conducted by experts who knew the phonetics of the target and original languages. As the demand for porting the voice tools to several other languages increases we are searching for an alternative approach in which the expert can be (at least partly) replaced by a machine. In this paper, we investigate a method where a TTS system together with an ASRbased tool tries to play the role of a ‘skilled phonetician’ whose aim is to find the optimal acoustic-phonetic mapping. The approach has been proposed and successfully tested on Polish language. Our experiments show that the scheme yields promising results not only for small-vocabulary applications but also for a more challenging task, such as fluent dictation of professional texts. In the following sections, we briefly introduce the ASR systems developed for Czech. Then we focus on the issues related to their transfer to Polish. We mention the main differences occurring on the phonetic level between the two languages and propose a method that utilizes the output from a Polish TTS system for creating the objective mapping of Polish orthography to Czech phonetic inventory. The proposed solution is simple and cheap because it does not require human-made recordings or the expert in phonetics and yet it seems applicable in the desired area.

2 ASR Systems Developed for Czech Language During the last decade we have developed two types of ASR engines, one for voicecommand input and discrete-speech dictation and another for fluent speech recognition with very large vocabularies. The former proved to be useful mainly in applications where robust hands-free performance is the highest priority. This is the case, for example, of voice-controlled aids developed for motor-handicapped people. Voice commands and voice typing can help them very much if it is reliable, flexible, customizable and does not require highcost computing power. The speed of typing, on the other side, is slower but this is not the crucial factor. The engine we have developed can operate with small vocabularies (tens or hundreds of commands) as well as with very large lexicons (up to 1 million words). It has been recently used in the MyVoice tool and in the MyDictate program. Both the programs can run not only on low-cost PCs but also on mobile devices [4]. The latter engine is a large-vocabulary continuous-speech recognition (LVCSR) decoder. It has been developed for voice dictation and speech transcription tasks with regard to specific needs of highly inflected languages, like Czech and other Slavic languages [5]. The recent version operates in real time with lexicons whose size goes up to 500 thousands words. Both the engines use the same signal processing and acoustic modeling core. A speech signal is 16 kHz sampled and parameterized every 10 ms into 39 MFCC features per frame. The acoustic model (AM) employs CDHMMs that are based either on context-independent phonetic units (monophones) or context-dependent triphones. The latter yield slightly better performance though the former are more compact,

156

J. Nouza and M. Boháč

require less memory and they are more robust against pronunciation deviations. The last aspect is important especially if we consider using the AM in speech recognition of another language. The linguistic part of the systems consists in the lexicon (which can be general or application oriented) and the corresponding language model (LM). The LM used in simpler systems has form of fixed grammar, in the dictation and transcription systems it is based on bigrams. The final applications (e.g. the programs MyVoice, MyDictate and FluentDictate) have been developed for Czech. Yet, the engines themselves are languageindependent. If the above programs are to be used in another language, we need to provide them with a new lexicon, a corresponding LM and an AM that fit the coding used for the pronunciation part of the lexicon

3 Case Study: Adapting ASR System to Polish Language In the following part, we present a method that allows us to adapt the acoustic model of an ASR system to a new language. Its main benefit consists in the fact that only minimum amount of speech data needs to be recorded and annotated for the target language. Instead of recording human-produced speech we employ data generated by a high-quality TTS system. Then, we analyze it by an ASR system in order find the optimal mapping between the phonemes of the target language and the phonetic inventory of the original acoustic model. Moreover, the ASR system serves as an automatic transducer that proposes and evaluates the rules for transcribing the orthographic form of words in the target language into pronunciation forms based on phonemes of the original languages. It is evident that the AM created for the new language using the above approach cannot perform as well as the AM that was trained directly on the target language data. Therefore, we need to evaluate how good this adapted AM is. For this purpose we utilize the TTS again. This time we employ it as a generator of test data, perform speech recognition tests and compare the results with those achieved for the same utterances produced by human speakers. In the next sections, we will illustrate the method on a case study in which two Czech ASR systems have been adapted for Polish language. 3.1 Czech vs. Polish Phonology Czech and Polish belong to the same West branch of Slavic languages, however, they differ significantly on lexical as well as on phonetic level. The phonetic inventory of Czech consists of 10 wovels (5 short and 5 long ones + very rare schwa) and 30 consonants. All are listed in Table 1 where each phoneme is represented by its SAMPA symbol [6]. (In this text, we prefer to use SAMPA notation rather than IPA because it is easier for typing and reading.) Polish phonology [8] recognizes 8 vowels and 29 consonants. Their list with SAMPA symbols [9] is in Table 2. By comparing the two tables we can see that there are 3 vowels (I, e~, o~) and 5 consonants (ts', dz', s',z', w) that are specific for Polish. All the other phonemes have their counterparts in Czech. (Note that symbol n' used in Polish SAMPA is equivalent to J in Czech SAMPA.)

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

157

Table 1. Czech phonetic inventory Groups Vowels (11) Consonants (30)

SAMPA symbols a, e, i, o, u, a:, e:, i:, o:, u:, @ (schwa) p, b, t, d, c, J\, k, g, ts, dz, tS,dZ, f, v, s, z, S, Z, X, h\, Q\, P\ j, r, l, m, n, N, J, F Table 2. Polish phonetic inventory

Groups Vowels (8) Consonants (29)

SAMPA symbols a, e, i, o, u, I, e~, o~ p, b, t, d, k, g, ts, dz, tS, dZ, f, v, s, z, S, Z, X, ts', dz', s',z', w, j, r, l, m, n, N, n' (equivalent to Czech J)

3.2 How to Map Polish Phonemes to Czech Phoneme Inventory? In our previous research on cross-lingual adaptation, we have done transfer of a Czech ASR system to Slovak [10] and to Spanish [3]. In both the cases, the Czech acoustic model was used and the language specific phonemes were mapped to the Czech ones. The mapping was designed by the experts who knew the original and the target language. An alternative to this expert-driven method is a data-driven approach, e.g. that described in [11] where the similarity between phonemes in two languages is measured by Bhattacharyya distance. However, this method requires quite a lot of recorded and annotated data in both the languages. In this paper, we propose an approach where the data from the target language are generated by a TTS system and the mapping is controlled by an ASR system. The main advantage is that the data can be produced automatically, on demand and in the amount and structure that is needed. 3.3 Phonetic Mapping Based on TTS Output Analyzed by ASR System The key component is a high-quality TTS system. For Polish language, we have chosen the IVONA software [12]. It employs the algorithm that produces an almost natural speech by concatenating properly selected units from a large database of recordings. The software won several awards in TTS competitions [13, 14]. Recently, it offers 4 different voices (2 male and 2 female), which - for our purpose - introduces an additional degree of voice variety. The software can be tested via its web pages [12]. Any text typed in its input box is immediately converted into an utterance. The second component is an ASR system operating with the given acoustic model (the Czech one in this case). It is arranged in the way that it works as a partly restricted phoneme recognizer. The ASR module takes a recording, transforms it into a series of feature vectors X = x(1), ...x(t), ...x(T) and outputs the most probable sequence of phonemes p1, p2, … pN. The output includes the phonemes, their times and their likelihoods. The module is called with several parameters, as shown in the example below:

158

J. Nouza and M. Boháč

Recording_name: Recorded_utterance: Pronunciation: Variants:

maslo-Ewa.wav masło mas?o ?= u l

uv

In the above example, the recognizer takes the given sound file, processes it and evaluates which of the proposed pronunciations is best. The output looks like this: 1. masuo 2. masuvo 3. maslo -

avg. likelihood = -77.417 avg. likelihood = -77.956 avg. likelihood = -78.213

We can see that for the given recording and the given AM, it is Czech phoneme ‘u’ that fits best to Polish letter ‘ł‘ (and corresponding phoneme ‘w’). The module also provides rich information from the phonetic decoding process (phoneme boundaries, likelihoods in frames, etc.), which can be used for detailed study, as shown in Fig.1. Ewa

-40

-60 likelihood

likelihood

-60

-80

-100 -

ma

s

20

? 40 frames Jan

-40

o

-

?

o

60 80 20 40 masło _ _ _ ?=l __ __ ?=uv _____ ?=u frames Maja -40

60

m

a

s

80

-60 likelihood

likelihood

-80

-100 -

-60

-80

-100 -

Jacek

-40

m 20

a

s

?

40 frames

o 60

-80

-100 80

ma 20

s

?

40 60 frames

o80

Fig. 1. Diagrams showing log likelihoods in frames of speech generated by TTS system (voices Ewa, Jacek, Jan, Maja). Different pronunciation variants of Polish letter ‘ł‘ in word ‘masło‘ can be compared.

Using the TTS software we have recorded more than 50 Polish words, each spoken by four available voices. The words were selected so that all the Polish specific phonemes occurred at various positions and context (e.g. at the start, in the middle, at

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

159

the end of words, in specific phonetic clusters, etc.). For each word, we offered the phoneme recognizer several pronunciation alternatives to choose from. In most cases, the output from the recognizer was consistent in the way that the same best phonemes were assigned to the Polish ones across various words and the four TTS voices. In some cases, however, the mapping showed to be dependent on the context, e.g. Polish ‘rz’ was mapped either to Czech phonemes ‘Z’, ‘Q\’ or ‘P\’. The results are summarized in Table 3. We can see that the resulting map covers not only the phoneme-to-phoneme relations but also the grapheme-to-phoneme conversion. It is also interesting to compare these objectively derived mappings with those considered in subjective perception. Since Poland and the Czech Republic are neighboring countries, Czech people have a lot of chances to hear spoken Polish and to use some Polish words, such as proper and geographical names. The subjective perception of some Polish specific phonemes seems to be different from what has been found by the objective investigation. For example, Czech people tend to perceive Polish ‘I’ as ‘i’ (the reason being that the letters ‘i’ and ‘y’ are pronounced in the same way in Czech – as ‘i’). Also Polish pair of letters ‘rz’ is usually considered as equivalent to Czech ‘ř’, which is not always true. The above described method proves that the ASR machine (equipped with the given acoustic model) has different perception. Anyway, this perception is objective because it is the ASR system that is to perform the recognition task. Table 3. Polish orthography and phonemes mapped to Czech phonetic inventory

Letter(s) in Polish orthography y ó ę ą dz ź / z(i) ś / s(i) dź / dz(i) ć / c(i) ż rz sz dż cz ń / n(i) h, ch ł

Polish phoneme(s) (SAMPA) I u e~ o~ dz z' s' dz' ts' Z Z S dZ tS n' X w

Mapping to Czech phoneme(s) (SAMPA) e, (schwa) u e+n, (e+N) o+n, (o+N) dz Z S dZ tS Z Z (Q\ or P\ in clusters trz, drz) S dZ tS J X u

160

J. Nouza and M. Boháč

3.4 Evaluation on Small Vocabulary Task The first task, in which we tested the proposed method and evaluated the resulting mapping, was Polish voice-command control, the same as in the MyVoice tool. The basic lexicon in this application consists of 256 commands, such as names of letters, digits, keys on PC keyboard, mouse actions, names of computer programs, etc. These commands have been translated into Polish, their pronunciations have been created automatically using the rules in Table 3 and after that they were recorded by the IVONA TTS (all the four voices) and by two Polish speakers. All the recordings were passed to the MyVoice’s ASR module operating with the original Czech AM. The experiment was to show us how well this cross-lingual application performed, and whether there is a significant difference in recognition of synthetic and human speech. Also it allowed us to compare the objectively derived mapping with the subjective phoneme conversion mentioned in section 3.3. The results are included in Table 4. We can observe that the performance measured by the Word Recognition Rate (WRR) is considerably high both for the TTS data as well as for human speakers. The results are comparable to those achieved for Czech, Slovak and Spanish [3]. 3.5 Evaluation on Fluent Speech Dictation with Large Lexicon The second task was to build a very preliminary version of a Polish voice dictation system for radiology. In this case, we used data (articles, annotations, medical reports) available online at [15]. For this purpose, we collected a small corpus (approx. 2 MB) of radiology texts and created a lexicon made of 23.060 most frequent words. Their pronunciations were derived automatically using the rules in Table 3. The bigram language model was computed on the same corpus. To test the prototype system, we selected three medical reports not included in the training corpus. They were recorded again by the IVONA software (four times with four different voices) and by two native speakers. The results from this experiment are part of Table 4. The WRR values are about 8 - 10 % lower compared to the Czech dictation system for radiology but it should be noted that our main aim was to test the proposed fast prototyping technique. The complete design of this demo system took just one week. It is also interesting to compare the results achieved with the TTS data to the human-produced ones. We can see that the TTS speech yielded slightly better recognition rates. It is not surprising as we have already observed this in our previous investigations [16]. In any case, we can see that the TTS utterances can be used during the development process as a cheap source of benchmarking data. Table 4. Results from speech recognition experiments in Polish language

Task Voice commands – TTS data Voice commands – human speech Fluent dictation (radiology) – TTS data Fluent dictation (radiology) – human speech

Lexicon size 256 256 23060 23060

WRR [%] 97.8 96.6 86.4 83.7

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

161

4 Discussion and Conclusions The results of the two experiments show that the proposed combination of TTS data and ASR-driven mapping is applicable in rapid prototyping of programs that are to be transferred into other languages. The TTS system for the target language should be of high quality, of course, and it is appreciated if it can offer multiple voices. If this is true, we can obtain not only the required L2-L1 phonetic mapping but also the grapheme-to-phoneme conversion table that will help us in generating pronunciations for the lexicon in the target application. Moreover, the TTS system can serve as a cheap source of test data needed for preliminary evaluations. The results we obtained in the first experiment prove that the created lexicon (with its automatically derived pronunciations) could be immediately used in the Polish version of the MyVoice software. Even though the internal acoustic model is Czech, we can expect the overall system performance being at the similar level as it is for Czech users. The most important thing is that during the prototype development no Polish data needed to be recorded and annotated and thus the whole process could be fast and cheap. Furthermore, we showed that the phonetic mapping generated via the combination of TTS and ASR systems would lead to more objective and better results compared to those based on subjective perception. In the second experiment we demonstrated that the same automated approach can be utilized also in a more challenging task, during the initial phase of the development of a dictation system. Within very short time we were able to create a Polish version of the program that can be used for demonstration purposes, for getting potential partners interested and for allowing at least initial testing with future users. Acknowledgments. The research was supported by the Grant Agency of the Czech Republic (grant no. 102/08/0707).

References 1. Vu, N.T., Schlippe, T., Kraus, F., Schultz, T.: Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit. In: Proc. of Interspeech 2010, Japan, Makuhari, pp. 865–868 (2010) 2. Cerva, P., Nouza, J.: Design and Development of Voice Controlled Aids for MotorHandicapped Persons. In: Proc. of Interspeech 2007, Antwerp, pp. 2521–2524 (2007) 3. Callejas, Z., Nouza, J., Cerva, P., López-Cózar, R.: Cost-Efficient Cross-Lingual Adaptation of a Speech Recognition System. In: Advances in Intelligent and Soft Computing, vol. 57, pp. 331–338. Springer, Heidelberg (2009) 4. Nouza, J., Cerva, P., Zdansky, J.: Very Large Vocabulary Voice Dictation for Mobile Devices. In: Proc. of Interspeech 2009, UK, Brighton, pp. 995–998 (2009) 5. Nouza, J., Zdansky, J., Cerva, P., Silovsky, J.: Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces, COST Seminar 2009. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010) 6. Czech SAMPA, http://noel.feld.cvut.cz/sampa/ 7. Nouza, J., Psutka, J., Uhlir, J.: Phonetic Alphabet for Speech Recognition of Czech. Radioengineering 6(4), 16–20 (1997)

162

J. Nouza and M. Boháč

8. Gussman, E.: The Phonology of Polish. Oxford University Press, Oxford (2007) 9. Polish SAMPA, http://www.phon.ucl.ac.uk/home/sampa/polish.htm 10. Nouza, J., Silovsky, J., Zdansky, J., Cerva, P., Kroul, M., Chaloupka, J.: Czech-to-Slovak Adapted Broadcast News Transcription System. In: Proc. of Interspeech 2008, Australia, Brisbane, pp. 683–2686 (September 2008) 11. Kumar, S.C., Mohandas, V.P., Li, H.: Multilingual Speech Recognition: A Unified Approach. In: Proc. of Interspeech 2005, Portugal, Lisboa, pp. 3357–3360 (2005) 12. IVONA TTS system, http://www.ivona.com/ 13. Kaszczuk, M., Osowski, L.: Evaluating Ivona Speech Synthesis System for Blizzard Challenge 2006. In: Blizzard Workshop, Pittsburgh (2006) 14. Kaszczuk, M., Osowski, L.: The IVO Software Blizzard 2007 Entry: Improving Ivona Speech Synthesis System. In: Sixth ISCA Workshop on Speech Synthesis, Bonn (2007) 15. http://www.openmedica.pl/ 16. Vich, R., Nouza, J., Vondra, M.: Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 136–148. Springer, Heidelberg (2008)

Towards the Automatic Detection of Involvement in Conversation Catharine Oertel1 , C´ eline De Looze1 , Stefan Scherer2 , 3 Andreas Windmann , Petra Wagner3 , and Nick Campbell1 1

Speech Communication Laboratory, Trinity College Dublin, Ireland 2 University of Ulm, Germany 3 Bielefeld University, Germany

Abstract. Although an increasing amount of research has been carried out into human-machine interaction in the last century, even today we are not able to fully understand the dynamic changes in human interaction. Only when we achieve this, will we be able to go beyond a one-to-one mapping between text and speech and be able to add social information to speech technologies. Social information is expressed to a high degree through prosodic cues and movement of the body and the face. The aim of this paper is to use those cues to make one aspect of social information more tangible; namely participants’ degree of involvement in a conversation. Our results for voice span and intensity, and our preliminary results on the movement of the body and face suggest that these cues are reliable cues for the detection of distinct levels of participants involvement in conversation. This will allow for the development of a statistical model which is able to classify these stages of involvement. Our data indicate that involvement may be a scalar phenomenon. Keywords: Social involvement, multi-modal corpora, discourse prosody.

1

Introduction

Language and speech, and later, writing systems, have evolved to serve human communication. In today’s society human-machine interaction is becoming more and more ubiquitous. However, despite more than half a century of research in speech technology, neither computer scientists, linguists nor phoneticians have yet reached a full understanding of how the variations in speech function as a means of human communication and social interaction. A one-to-one mapping between text and speech is not suﬃcient to treat the social information exchanged in human interaction. What makes a conversation a naturally interactive dialogue are the dynamic changes involved in spoken interaction. We propose that these changes might be explained by the concept of involvement. Following Antil [1] we deﬁne involvement as “the level of perceived personal importance and/or interest evoked by a stimulus (or stimuli) within a speciﬁc situation” [1]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 163–170, 2011. c Springer-Verlag Berlin Heidelberg 2011

164

C. Oertel et al.

Moreover, we consider involvement in our study to be a scalar phenomenon. Contrary to Wrede & Shriberg [2] who deﬁne involvement as a binary phenomenon, we agree with Antil in that “involvement must be conceptualized and operationalized as a continuous variable, not as a dichotomous variable” [1]. Similar to Dillon [3], who uses a slider to let participants indicate their level of emotional engagement, we used a scale from 1-10 in our annotation scheme to indicate distinct levels of involvement. Studies on involvement [2], [4], or related concepts such as emotional engagement [5] [6], interest [7], or interactional rapport [8] reported that these phenomena are conveyed by speciﬁc prosodic cues. For example, Wrede and Shriberg [2], in their study on involvement found that there was an increase in mean and range of the fundamental frequency (F0) in more activated speech as well as tense voice quality. Moreover, Crystal and Davy [9] reported that, in live cricket commentaries, the more the commentator is involved in reporting the action (i.e. at the action peak), the quicker the speech rate.

2

Main Objectives and Hypotheses

In our study we looked at how prosodic parameters as well as visual cues may be used to indicate levels of involvement. A statistical model based on these cues would enable the automatisation of involvement detection. Automatic involvement detection allows for a time eﬃcient search through multimodal corpora, and may be used for interactive speech synthesis. The prosodic parameters (i.e. F0, duration and intensity) include level and span of the voice, articulation rate (i.e. excluding pauses) and intensity of the voice. The visual parameter includes the participants’ amount of change in movement of the body and face. Based on studies [2–9] our hypotheses are: the higher the degree of involvement, (1) the higher the level and (2) the wider the span of the voice, (3) the quicker the articulation rate, (4) the higher the intensity and (5) the higher the amount of movement in the face and body of the participants.

3 3.1

Experiment Data Collection: The D64 Corpus

We used the D64 corpus [10] for this study. It was recorded over two successive days in a rented apartment, resulting in a total of eight hours of multimodal recordings. Five participants took part on the ﬁrst day and four on the second. Three of the participants were male and two female. They were colleagues and/or friends (with the exception of one naive participant), ranging in age from early twenties to early sixties. They were able to move freely around as well as to eat and drink refreshments as in normal daily life. The conversation was not directed and ranged widely over topics both trivial and technical.

Towards the Automatic Detection of Involvement in Conversation

3.2

165

Data Selection

For our analysis, all 5 speakers were included. Data was chosen from two diﬀerent recording sessions; Session 1 and Session 2 (a total of 1 hour of recording). For session 1, there was no predeﬁned topic, and the conversation was allowed to meander freely. For session 2, the ﬁrst author’s Master’s research was amongst the topics of discussion. Speaking time per speaker varies between 1 to 15 minutes (mean=9 min; sd=5,15). 3.3

Data Annotation

We developed an annotation scheme based on hearer independent, intuitive impressions [11] and annotated approximately 1 hour of video recordings for levels of involvement. The annotation scheme was validated perceptively and was combined with acoustic analysis and movement data. Our measure of involvement comprises the joint involvement of the entire group. Involvement annotations are based on the following criteria: Involvement level 1 is reserved for cases in which virtually no interaction is taking place and in which interlocutors are not taking notice of each other at all and are engaged in completely diﬀerent pursuits. Involvement level 2 is a less extreme variant of involvement level 1. Involvement level 3 is annotated when subgroups emerge. For example, in a conversation with four participants, this would mean that two subgroups of two interlocutors each would be talking about diﬀerent subjects and ignore the respective other subgroup. Involvement level 4 is annotated when only one conversation is taking place while for involvement level 5 interlocutors also need to show mild interest in the conversation. Involvement level 6 is annotated when conditions for involvement level 5 are fulﬁlled and interlocutors encourage the turnholder to carry on. Involvement level 7 is annotated when interlocutors show increased interest and actively contribute to the conversation. For involvement level 8, interlocutors must fulﬁl the conditions for involvement level 7 and contribute even more actively to the conversation. They might for example jointly, wholeheartedly laugh or totally freeze following a remark of one of the participants. Involvement level 9 is annotated when interlocutors show absolute, undivided interest in the conversation and each other and vehemently emphasise the points they want to make. Participants signal that they either strongly agree or disagree with the turn-holder. Involvement level 10 is an extreme variant of involvement level 9. A ten point scale was chosen for annotation but only values 4-9 were actually used in the annotations.This fact might be explained by the calm and friendly nature of the conversation. The numbers of times in which involvement level 4 and 9 were annotated were statistically not suﬃcient and were thus excluded from further analysis. 3.4

Measurements and Statistical Analyses

Acoustic measurements were obtained using the software Praat [12]. The level and span of the voice were measured by calculating the F0 median (the mean

166

C. Oertel et al.

being too sensitive to erroneous values) and the log2 (F 0max − F 0min) respectively. The F0-level is given on a linear scale (i.e. Hertz) while F0-span is given on a logarithmic scale (i.e. octave). In order to avoid possible pitch tracking errors, pitch ﬂoor and pitch ceiling were set to the values q15 · 0.83 (where ’q’ stands for percentile) and q65 · 1.92 (De Looze [13]). Articulation rate was calculated in terms of number of syllables per second. Syllables were detected automatically using a prominence detection tool developed by Tamburini [14]. In order to neutralise speaker diﬀerences in voice level and span, articulation rate and intensity, data were normalised by a z-score transformation. For the movement extraction an algorithm was chosen which is not restricted to calculating movement changes for the whole picture but rather for individual people (note that movement measurements were only calculated in this study for two speakers). From the video data coordinates of the faces and bodies at each frame composed by the exact spot of the top left corner and the bottom right corner of the face are extracted as in Scherer et al [15] by utilising the standard Viola Jones algorithm [16]. Normalisation is carried out as these coordinates are highly dependent on the distance of the person to the camera in order to obtain relative movement over the size of the detected face and body. Only in the case where a face is recognised a moving average is calculated. ANOVA analyses were carried out for the above mentioned cues. 3.5

Results

Level and Span of the voice. As illustrated in Figure 1, involvement level 6 is signiﬁcantly higher than involvement level 5 (F(3,1041)=8.843; p=0.006370) and involvement level 8 is signiﬁcantly higher than involvement level 7 (F(1,440)=6.58; p=0.0106). Involvement level 7 is however not signiﬁcantly higher than 6 (F(2,830)=4.899; p=0.35040). The acoustic cue F0-max/min as illustrated in Figure 1 increases with involvement. While involvement level 7 is signiﬁcantly higher than involvement level F0-max/min

8

F0-median

**

3

*

1 0

ZScore(Octave)

2

-2

-2

-1

0

ZScore(Hertz)

4

2

6

**

4

5

6

7

Involvement

8

9

4

5

6

7

8

9

Involvement

Fig. 1. Boxplots of F0-median and F0 max/min according to four levels of involvement

Towards the Automatic Detection of Involvement in Conversation

167

6 (F(2,831)=22.82; p=7.96e-08), involvement level 6 is not signiﬁcantly higher than involvement level 5 (F(3,1041)=18.31; p=0.6325) and involvement level 8 is not signiﬁcantly higher than involvement level 7 (F(1,440)=21.2; p=0.274). Articulation Rate. The acoustic cue articulation rate does not illustrate any signiﬁcant changes. The articulation rate of the individual speakers stays approximately the same over the various involvement levels. Intensity. The acoustic cue intensity illustrates an increasing slope as can be seen in Figure 2. While involvement level 6 is signiﬁcantly higher than involvement level 5 (F(3,1130)= 139.5 ; p=1.62e-05) and involvement level 7 is signiﬁcantly higher than involvement level 6 (F(2,889)= 121; p=<2e-16), involvement level 8 is not signiﬁcantly diﬀerent from involvement level 7 (F(1,453)=0.223 ; p=0.637).

Fig. 2. Boxplots of intensity according to four levels of involvement

Fig. 3. Boxplots of Movement for speaker F according to four levels of involvement

168

C. Oertel et al.

Movement. For movement of body and face, it can be seen in Figure 3 that for speaker F there is an increasing slope. Involvement level 6 is signiﬁcantly higher than involvement level 5 (F(3,611)=11.67; p=0.00484). However, involvement level 7 is not signiﬁcantly diﬀerent from involvement level 6 (F(2,464)=5.617; p=0.04029) and involvement level 8 is neither signiﬁcantly diﬀerent from involvement level 7( F(1,240)=2.152; p=0.144 ). For Movement of Body and Face for speaker C we present the results separately for session 1 and 2 since the trends of both sessions are signiﬁcantly diﬀerent here. It can be seen in Figure 4 that there is an increasing slope for session 1. Except for the increase from involvement level 6 to involvement level 7 (F(2,192)=3.913; p=0.00753), this increase is however not signiﬁcant for other levels. It can be seen in Figure 4 that there is a decreasing slope for session 2 in involvement level 6, 7 and 8 where the decrease from involvement level 6 to 7 is signiﬁcant (F(2,269)=7.497 ; p=0.000256). Movement Speaker C Session 1

Movement Speaker C Session 2

**

2

Movement Change

0

0

1

1

2

Movement Change

3

3

**

3

4

5

6 Involvement

7

8

9

4

5

6

7

8

9

Involvement

Fig. 4. Boxplots of Movement for speaker C in Session 1 and 2 according to four levels of involvement

4

Discussion

In this study, we examined the prosodic parameters correlated with the dynamic changes that characterise social conversation. We looked at the level and span of the voice, intensity and articulation rate. We conﬁrmed the ﬁndings of Wrede and Shriberg (2003) that the level and span of the voice, as well as the intensity, increase in more activated speech. Contrary to their binary distinction however, our data suggests that involvement seems to be a scalar rather than binary phenomenon. We found a clear linear relationship between our perceptual measure of involvement and the level and span of the voice as well as intensity. Wrede and Shriberg make no mention of articulation rate. We looked at articulation rate and found no relationship.

Towards the Automatic Detection of Involvement in Conversation

169

In a pilot study we added a multimodal aspect to our analysis through consideration of automatically extracted measures of body and head movement. We found this parameter to be well correlated with our perceptual measures of involvement. Our movement analysis (based on two speakers) for speaker F indicates that the more involvement the more the amount of movement. For speaker C however, movement in session 2 does not ﬁt this pattern. This may be explained perhaps by the fact that speaker C held a laptop on her lap in session 2, hiding her hands for part of the session. Our analyses based on level and span of the voice and intensity suggest that involvement is a scalar phenomenon. Furthermore, the preliminary measures of movement appear to correlate strongly with the acoustic parameters and so it might be advantageous to merge them to give a more robust automatic measure of involvement. Further analysis will be carried out to conﬁrm our preliminary results. Our current and future work involves building a statistical model, incorporating both sources of information in order that we may clarify the mutual information. The number of levels to quantify involvement is not clear at the moment, but we will continue to use a scale of one to ten.

5

Conclusion

Our study conﬁrmed that social information is expressed to a high degree through prosodic cues and movement of the body and face. The aim of this paper was to use those cues to make one aspect of social information more tangible; namely participants’ degree of involvement in a conversation. Our results for voice span and intensity, and our preliminary results on the movement of the body and face suggest that these cues are reliable cues for the detection of distinct levels of participants involvement in conversation. This will allow for the development of a statistical model which is able to classify these stages of involvement. This would have applications in automatic multimodal corpus search, automatic spoken dialog systems, robotics, games and other such technologies.

References 1. Antil, J.H.: Conceptualization and Operationalization of Involvement. Advances in Consumer Research 11(1), 203–209 (1984) 2. Wrede, B., Shriberg, E.: Spotting Hot Spots in Meetings: Human Judgements and Prosodic Cues. In: Proceedings of Eurospeech 2003, Geneva, pp. 2805–2808 (2003) 3. Dillon, R.: Lecture Notes in Computer Science: A Possible Model for Predicting Listener’s Emotional Engagement. Springer, Heidelberg (2006) 4. Selting, M.: Emphatic speech style: with special focus on the prosodic signalling of heightened emotive involvement in conversation. Journal of pragmatics 22(3-4), 375–408 (1994) 5. Gustafson, J., Neiberg, D.: Prosodic cues to engagement in non- lexical response tokens in Swedish. In: DiSS-LPSS Joint Workshop 2010, Tokyo, Japan (2010) 6. Yu, C., Aoki, P.M., Woodruﬀ, A.: Detecting user engagement in everyday conversations. In: 8th International Conference on Spoken Language Processing (ICSLP 2004), Jeju Island, Korea, pp. 1329–1332 (2004)

170

C. Oertel et al.

7. Gatica-Perez, D.: Modeling Interest in Face-to-Face Conversations from Multimodal Nonverbal Behavior. In: Thiran, J.-P., Bourlard, H., Marques, F. (eds.) Multimodal Signal Processing, pp. 309–323. Academic Press, San Diego (2009) 8. Duncan, S., Baldenebro, T., Lawandow, A., Levow, G.-A.: Multi-modal Analysis of Interactional Rapport in Three Language Cultural Groups. In: Workshop on Modeling Human Communication Dynamics, Vancouver, B.C., Canada, pp. 42–45 (2010) 9. Crystal, D., Davy, D.: Investigating English Style. Longman Group. Ltd., London (1969) 10. Oertel, C., Cummins, F., Campbell, N., Edlund, J., Wagner, P.: D64: a corpus of richly recorded conversational interaction. In: Proceedings of LREC 2010; Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, Valetta, pp. 27–30 (2010) 11. Oertel, C.: Identiﬁcation of Cues for the Automatic Detection of Hotspots. Bielefeld University, Bielefeld (2010) (unpublished) 12. Boersma, P., Weenink, D.: Praat: doing phonetics by computer 13. De Looze, C., Hirst, D.J.: Integrating changes of register into automatic intonation analysis. In: Proceedings of the Speech Prosody 2010 Conferene, Chicago, 4 pages (2010) 14. Tamburini, F., Wagner, P.: On automatic prominence detection for german. In: Proceedings of Interspeech 2007, Antwerp, pp. 1809–1802 (2007) 15. Scherer, S., Campbell, N.: Multimodal laughter detection in natural discourses. In: Proceedings of the 3rd International Workshop on Human-Centered Robotic Systems (HCRS 2009), pp. 111–121 (2009) 16. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2004)

Extracting Sentence Elements for the Natural Language Understanding Based on Slovak National Corpus Stanislav Ondáš, Jozef Juhár, and Anton Čižmár Faculty of Electrical Engineering and Informatics, Technical University of Košice, Slovakia, Park Komenského 13, 040 01 Košice, Slovakia {stanislav.ondas,jozef.juhar,anton.cizmar}@tuke.sk

Abstract. This paper introduces an approach for extracting sentence elements from Slovak sentences based on linguistic analysis. The key idea lies in the assumption that the sentence elements relates to the meaning and they can be helpful in the process of the semantic roles identification. The system for extracting sentence elements from Slovak sentences has been developed with a morphological analyzer, disambiguator and syntactic analyzer as fundamental components. The morphological analyzer uses data obtained from the Slovak National Corpus. The syntactic analyzer uses context-free grammars. Several evaluation experiments were done with limited range of sentences for obtaining information about success of the proposed approach. Keywords: Natural language understanding, sentence elements, parser, tagger.

1 Introduction Nowadays, the domain of the natural language understanding (NLU) becomes very important due to the increasing number of applications based on spoken interaction with user. The goal of NLU is the effort to obtain a conceptual representation of natural language sentences. Obtaining meaning from speech is a complex process and many different approaches and models have been proposed. [1] Three main ways can be identified – statistical, linguistic and knowledge-based approaches [12]. Authors usually divide understanding process into several layers (see [2], [3] and [4]): • Morphological layer: It analyses words and tries to extract their grammatical categories. • Syntactic layer: It tries to determine syntax, which is represented by the sentence elements and their relation to the context. • Morphemic layer: In addition to the morphological and syntactical layers, it adds information about morphemic structure of words. • Semantic layer: It analyses important parts of sentences and tries to identify semantic structure of analyzed sentences. • Contextual layer: It uses context to make semantic result more accurate. Because of several factors, like spoken language ambiguity, uncompleted utterances, variable word order, as well as ambiguities in the semantic layer, problematics of A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 171–177, 2011. © Springer-Verlag Berlin Heidelberg 2011

172

S. Ondáš, J. Juhár, and A. Čižmár

NLU is complicated and complex. Therefore, in the practical applications simpler approaches are usually used, often based on the keyword spotting techniques, which are powerful enough for domain-oriented applications [13]. The keyword spotting approach can be feasible for spoken dialogue systems, but it is unsatisfactory for applications like communication with conversational agents or question-answering machines. Nowadays, the advanced NLU system for Slovak language, usable in conversation with embodied conversational agents (ECA), does not exist. Therefore we have focused on design and development of such system. The key requirements have been identified: identification of base semantic concepts (in a form of sentence elements), simplicity and fast processing. The lack of data was the main problem and some limitation has been taken. On morphological layer annotated data from the Slovak National Corpus (the part with journalistic as well as artistic texts) has been used, because we have no access to the part of SNC with spoken data, yet. On syntactic layer any annotated corpus for Slovak does not exist. The rule-based approach has been the only solution. The key idea of our approach lies in the effort to obtain information about “Who, where, when, how, why and with whom/what is doing or what is happening”. Information about sentence elements is able to give us answers to semantic-related questions and can help us in the semantic analysis process to affirm or disprove the semantic roles [14] of the particular word. The base types of Slovak sentence elements are as follows: • Subject enables to obtain answer to the question “Who does it?” It can be expressed or unexpressed unlike to English. For example: “(On) Spí. “ („He is sleeping. “) • Predicate enables to obtain the answer to the question “What does he/she do? What happened?” • Adverbial gives information about “Where/when/why/how does he/she do something?” • Object enables to obtain the answer to the question “What is he/she doing? With whom is he/she doing something?” • Attribute gives us the answer to the questions about characteristics and properties of the object and subject.

2 Sentence Elements Extraction Reasons mentioned in the introduction led us to start designing and developing the system for extraction of meaning from Slovak sentences. It is based on linguistic analysis with morphological, syntactic, semantic and contextual layer. Semantic and contextual layers are not developed yet. Architecture of the system is based on Galaxy hub-server solution [9]. Actually the system consists of four servers – input/output server (IO server), morphological analyzer, disambiguator and syntactic analyzer. In Fig.1 the realization of the sentence elements extraction system for Slovak language can be seen. The central element of the system is the Galaxy hub process, which routes all communication between components of the system. Input sentences

Extracting Sentence Elements for the Natural Language Understanding

173

are loaded by IO server, which is responsible for its preprocessing (downcasing, tokenization). Then, the morphological analyzer (tagger) takes incoming sentences and it assigns a group of morphological tags to each word, which could represent morphological properties of the particular words. The server named disambiguator takes sentence with tagged words and tries to select a unique tag for each word from several tags generated by the morphological analyzer. The main task of the syntactic analyzer is to analyze word-tag pairs and to determinate the role of the word in the sentence – the type of the sentence element.

SNC

Morphological Analyzer

Disambiguator

(tagger)

IO server

HUB

Syntactic Analyzer (parser)

Semantic analyzer

Context analyzer

Fig. 1. The scheme of Slovak sentence elements extraction system

After this analysis, each word is labeled by its lemma, tag and sentence element type and then the labeled sentence is presented by the IO server. The next step in the processing flow will be semantic and contextual analysis of the processed sentence. Semantic analysis will be based on the concept of semantic cases according the work of Emil Pales published in linguistics journal [14] and [8], where 66 semantic roles were identified in 8 categories. The system for automatic extraction of these roles was also presented in [8]. 2.1 The Morphological Analyzer The morphological analyzer (tagger) in the proposed system works with data from the Slovak National Corpus (SNC) [5]. It is a database of contemporary Slovak language texts, covering broad range of language styles and with additional linguistic information. The Corpus has been collected and processed by Ľudovít Štúr Institute of Linguistics (JULS) on Slovak Academy of Science since year 2002 [5]. The most important linguistic information for us is the morphological annotation. In 2008 the script for extracting information about trigrams occurrences was prepared and applied to one part of SNC (prim-3.0). This part of the corpus consists mostly of journalistic as well as artistic texts. The obtained data has a form of tagged dictionary and word trigrams with morphological tags and its frequency of occurrence [15]. The

174

S. Ondáš, J. Juhár, and A. Čižmár

structure of tags is described in [11]. The rapid algorithm was prepared for searching in this data for assigning appropriate tags to words in analyzed sentences. Two versions of tagger was prepared and evaluated. The base version of the analyzer picks up word by word (do not use context information) from the processed sentence and it assigns a group of possible tags (obtained only from dictionary) to each word. The second version of the tagger picks up words trigrams from the proposed sentence and it try to find appropriate word-tag trigram in data from SNC. Then the trigram sequence with the highest overall occurrence is selected. 2.2 Disambiguation Morphological analysis usually generates ambiguities [8] in a form of several possible tags for particular word. Selection of the correct tag depends on surrounding words. Therefore some disambiguation process is needed. There are two possible ways – separate disambiguator server or the morphological analysis which produces unambiguous result. The second version of the proposed tagger runs in this mode. When this version of the tagger is used, the disambiguator server is bypassed, because the tagger returns only the best word-tag sequence. In the case of using the base version of the morphological analyzer, the disambiguator server must be used and it must select the most probable tag to the each word. Nowadays, the disambiguator in our system works in very simple mode. It selects only the first tag from the group of tags and assigns it to the processed word. It is clear, that, some errors are included here. The difference in results of both approaches to morphological analysis is shown on experiment No.1 and No.2. 2.3 The Syntactic Analyzer (Parser) The last component in the processing chain is the syntactic analyzer server (parser), which takes sentences with word-tag pairs and it tries to assign a correct sentence element type to each word according its tag and surrounding word-tag pairs. In flective languages, like Slovak, and in the case of spontaneous speech, we cannot rely on word order and therefore the classical syntactic analysis, which divides sentences in to phrases (noun phrase, verb phrase) could not bring expected results. According these facts, the approach mainly based on context-free grammars (CFG) is chosen and in some cases the inter-word relations are taken into consideration in 2-pass processing. The direction of processing is from left to right and bottom-up. After studying the Rules of the Slovak language [10] a set of 26 rules have been defined. Developed syntactic analyzer is able to detect five types of elements – subject (expressed/unexpressed), predicate, object, adverbial and attribute.

3 Evaluation Experiments The evaluation experiments were focused on evaluation of the sentence element detection functionality. The results of the experiments reflect performance of both tagger and parser. Experiments were performed on 127 Slovak sentences obtained randomly from the newspaper corpus [7] and 100 sentences obtained as a

Extracting Sentence Elements for the Natural Language Understanding

175

transcription of the part of TV series (Experiment No.7). Analysis of sentences in TV serials shows that Average Sentence Length (ASL) in real conversation is relatively low (ASL=3.63). Therefore, in the case of sentences from newspaper corpus we have selected sentences with less than six words per sentence. For the purpose of the evaluation we have created a reference file with manually labeled sentences. Six experiments were done with the sentences from the newspaper corpus and the seventh experiment was done with sentences from the TV series. Results of each experiment were written in to the output files and they were compared with the reference file. Firstly two type of values were computed – the number of correctly detected sentence elements in percentage (CORR) and the number of incorrectly detected sentence elements in percentage (INCORR). The INCORR value represents situations, when some word is wrongly marked as concrete sentence element type. Computation of CORR and INCORR values helped us to obtain initial information about success of the system (see Tab.1). Then the confusion matrix was identified as the better way how to show results of the evaluation experiments (Tab.2). Finally the computation of total WER was done for the best experiments (No.6 and No.7). The first experiment was done with the base version of the morphological analyzer, the stand-alone disambiguator server and the syntactical analyzer with an initial set of rules. The results have shown that the weak spot of the system is the detection of subjects, objects and adverbial part of the sentence. The analysis shows that improvements in both - in the tagger and in the parser are necessary. The setup of the second experiment consists of the second version of the tagger (with trigrams) and the same version of the parser as in experiment No.1. An improvement in the detection of objects (+8.5%), unexpressed subjects (+8.69%) and adverbial part of the sentence (+4.35%) was reached. The contextual information included in the trigrams helps the system to distinguish between subjects and objects. Then we started improvements of CFG rules in the parser. New rules for detecting of subjects and objects were added, which enable identifying of the pronouns as subjects (nominative case) and as objects (accusative and dative cases). The list of words, which are the typical for the adverbial part of the sentence, was also extended. Experiments No.3, 4 and 5 were done for testing and debugging of these new rules and other extensions. The improvement in the experiment No.5 (against the experiment No.2) was: subjects +13.73%, objects +32.25%, adverbial + 26.05%. Table 1. The results of evaluation experiments No.1, 5 and 6

Parameter Subject Predicate Attribute Object Adverbial Un. subj.

Experiment No.1 CORR INCORR 16.41 72.55 0.78 99.22 12.50 90.74 6.29 33.33 1.56 52.27 17.18 73.91

Experiment No.5 CORR INCORR 85.29 4.68 98.43 0.00 92.59 7.81 74.07 5.47 80.43 1.56 83.33 8.59

Experiment No.6 CORR INCORR 4.68 88.23 0 98.43 3.14 90.74 5.47 85.18 1.56 82.60 8.59 83.33

176

S. Ondáš, J. Juhár, and A. Čižmár

After analyzing results of experiment No.5, the need of the next rule was identified to handle the special situation, which can occur during detection of attributes, which are mainly adjectives. If an adjective stands without subject and object, it takes their role in the sentence. Therefore we added new rules for this case and repeated the experiment (No.6). Its results are in the third column of the Tab. 1. Included rules helped the system increase the number of detected subjects (+2.94%) and objects (+11.11%). More concrete information can be obtained from the confusion matrices. They enable to see, where weak spots of the classification system are. Table 2 imagines confusion matrices for the experiments No. 1, 5 and 6. The values for each experiment are delimited by vertical bar. Table 2. Confusion matrices of experiments No.1, 5 and 6 Exp.1|5|6 Subj. Subject 74|87|90 Predicate 0 Attribute

10|4|1

Object Adverbial Un. subj.

0|2|3 1|0|0 x

Pre.

Attr. Obj. 0 0|1|0 15|4|4 0 1|0|0 127|126|1 26 5|6|2 0 49|50|4 9 1|2|2 0|2|2 18|40|46 0 0|1|1 0|1|1 x x x

Adv. 2|0|0 0

Un.subj.

x x

1|0|0

x

5|1|1 23|37|38 x

x x 17|20|20

The computation of Word Error Rate (WER) (1) can give us information about overall performance of the system. WER = (S + D + I)/N.

(1)

In experiments No. 6 and No. 7 WERs were computed (Tab. 3). Table 3. Word Error Rates obtained in experiments No.6 and No. 7

WER [%]

Experiment No.6 14.07

Experiment No.7 27

4 Conclusions The base version of the sentence elements extraction system for processing of Slovak sentences was introduced. Evaluation experiments show both the promising results with relatively small set of rules in CFG and also weak spots of the system. At the beginning we have taken several limitations which affect success of the system. The main problem is the lack of data in a form of annotated corpuses for both parser and tagger. Data for the tagger with morphological annotation were obtained from the part of the Slovak National Corpus with journalistic and artistic texts and it is clear that they are not appropriate enough for processing the natural language. Unfortunately data with syntactic annotation for Slovak language are not available. The way how to

Extracting Sentence Elements for the Natural Language Understanding

177

improve the success of the system is to use manually-corrected output of the system for creating the annotated corpuses for the parser and the tagger. Of course, the way to the real understanding is even longer. In the proposed solution, the semantic and contextual analyzers are missing. These two processes relate to knowledge base with data about real world. There are many issues which should be studied and solved. Our future work will be focused on these areas as well as improving of other parts of the system. Acknowledgments. The work presented in this paper was supported by the Slovak Research and Development Agency under the project VMSP-P-0004-09, Ministry of Education of Slovak Republic under research projects VEGA-1/0065/10 and under the framework of the EU ICT Project INDECT (FP7-No.218086).

References 1. Mori, D., et al.: Spoken language understanding. IEEE Signal Processing Magazine 25(3), 50–58 (2008) 2. Psutka, J., et al.: We talk with the computer in Czech (In Czech: Mluvime s pocitacem cesky), Academia, Praha (2006), ISBN 80-200-1309-1 3. Natural language processing website (2010), http://www.gurmania.sk/trabalka/NLP.htm 4. Turban, E.: Expert Systems and Applied Artificial Intelligence. Maxmillan Publishing Company (1992) 5. Slovak National Corpus web page, http://korpus.juls.savba.sk/ 6. Simkova, M.: Slovak national corpus - history and current situation. In: Insight into the Slovak and Czech Corpus Linguistics, Veda, Bratislava, pp. 151–159 (2006) 7. Stas, J., Hladek, D., Pleva, M., Juhar, J.: Slovak Language Model from Internet Text Data. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 340–346. Springer, Heidelberg (2011) 8. Pales, E.: Sapfo. Paraphraser of the Slovak language, the computer tool for modeling in linguistic (in Slovak: Parafrazovac Slovenciny, pocitacovy nastroj na modelovanie v jazykovede), VEDA, Bratislava (1994) ISBN 80-224-0109-9 9. Galaxy communicator website, http://communicator.sourceforge.net/ 10. The Slovak Grammar (in Slovak: Pravidlá slovenského pravopisu), Bratislava, VEDA (2000), http://www.juls.savba.sk/ediela/psp2000/psp.pdf (2010) 11. Morphological annotation of the Slovak National Corpus website, http://korpus.juls.savba.sk/usage/morpho/ 12. Furdík, K.: Information Retrieval from texts in natural language, using hypertext structures (in Slovak), PhD Thesis, Department of Cybernetics and Artificial Intelligence, Technical University of Košice (2003) 13. McTear, F.M.: Spoken Dialogue Technology. In: Toward the Conversational User Interface. Springer, London (2006) ISBN 1852336722 14. Pales, E.: Semantic roles of the Slovak verbs. Linguistic Journal (41), 30–47 (1990) (in Slovak) 15. Mirilovic, M.: The stochastic language model of Slovak language for using in automatic continuous speech recognition systems. PhD Thesis, Kosice (January 2008) (in Slovak)

Detection of Similar Advertisements in Media Databases Karel Palecek Institute of Information Technology and Electronics, Technical University of Liberec, Czech Republic [email protected]

Abstract. This contribution presents a system for detection of similar images of advertisements in moderate size datasets. These datasets are daily updated and mainly consists of advertisements from tv, newspapers, journals, etc. The task is to identify clusters of duplicate advertisements in given dataset. Images diﬀer by translation, scale and the amount of compression. The presented approach is based on recently popular bag-of-features approach which has been successfully used in context of image retrieval and other related areas. Each image is represented as weighted histogram of local features. Similarities are computed based on the extracted features are projected onto separating hyperplane and clustered using agglomerative hierarchical clustering. Experiments show that this simple and eﬃcient scheme yields good results and ﬁnds corresponding images even for advertisements which are substantially dissimilar in spatial arrangement and color composition with reasonable false positive rate. Keywords: bag-of-features, clustering, image similarity.

1

Introduction

We present a system for detection of similar and equivalent advertisement images as part of the media monitoring process. It has been built for commercial company which creates textual transcriptions of TV news, monitors Czech local media such as newspapers or magazines, or reports information about advertisements presented on the Internet, in television or in printed periodicals. Every day, this company analyzes a large amount of advertisements which needs to be sorted for subsequent processing. These sets typically contain from several thousands to tens of thousands images of commercials. Our task is to identify groups of near-duplicate advertisements, i.e. ﬁnd all pairs of similar advertisements in given dataset. The images are cropped out of printed magazines and diﬀer by several transformations. Because of improper crop, images are translated and scaled with respect to each other, contain borders and have diﬀerent quality - degradations such as noise, compression artifacts, or changes in brightness and contrast are present. Also, advertising companies use diﬀerent size formats in various periodicals and A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 178–184, 2011. c Springer-Verlag Berlin Heidelberg 2011

Detection of Similar Advertisements in Media Databases

179

often arrange the salient regions of advertisements in a number of ways. Therefore the key elements of advertisements, i.e. logos, promotional texts or pictures of people, are not spatially consistent, and we are unable to use global appearance methods which capture such relationships. Last but not least, the near-duplicate relation of advertisements is diﬃcult to deﬁne in terms of image similarity. For some pairs of corresponding images, visual similarity may be in fact very low even for human eye, whereas for some pairs of distinct advertisements the differences can be statistically indistinguishable from noise, especially when using global features. Our problem formulation is closely related to several areas of research - content based image retrieval, image registration (IR) and clustering. Image registration techniques try to transform various images of the same scene into one coordinate system [9]. However, since our advertisements vary in their design, there is no guarantee of global geometric transformation between any pair of corresponding images, thus we cannot apply IR techniques to our problem. Image retrieval area of research deals with problem of organizing images by their visual content [3]. This may include ﬁnding all similar database images to a given query, categorization of pictures of faces, cars, buildings, etc. Traditional techniques used in image retrieval involve color, texture, shape matching, etc. In [4] several normalized image frames are extracted based color segmentation from every image and then a sublinear indexing tree, is created. However, this approach would be diﬃcult to adopt to our problem for two reasons: ﬁrst, time required for building the indexing tree is in order of days, and second, it is not well suited for variable spatial arrangement of image elements, since the indexing is based on intensity thresholding of ﬁxed locations in images. One of the most popular approaches in recent years has been bag-of-features (BoF) image representation which derives from bag-of-words (BoW) model in natural language processing (NLP). Analogous to BoW, each image in BoF is represented by histogram of its visual words, see e.g. [2,8]. Typically, these are local image regions with associated descriptors. Our method for detection of similar advertisements takes inspiration mostly in the image retrieval area. We represent images of advertisements using BoF model, as it has been well established and tested. It also has several useful properties in terms of our problem, mainly invariance to spatial arrangement of salient regions. With suitable feature detector and descriptor, it is also robust to quality degradations. In Sect. 2 we describe the algorithm and its details and in Sect. 3 we evaluate results and present conclusion and future extensions.

2

System Description

As discussed in previous section, our system utilizes bag-of-features representation for images of advertisements. We ﬁrst discuss the outline of the method and then we describe its details in following subsections. The outline of the method is as follows. First, for every image a set of local features is extracted. This involves detection of local regions and computation

180

K. Palecek

of their descriptors. For better robustness and more features available for recognition two types of interest regions are used - Harris-Laplace and Maximally Stable Extremal Regions. Next, visual vocabulary with ﬁxed number of visual words is created by vector-quantizing extracted local descriptors using k-means algorithm. Each image is represented as weighted histogram of visual words, i.e. set of local features, each of which is associated to its nearest cluster in visual vocabulary. Searching for all similar advertisements then corresponds to searching for all similar histograms in given dataset. For each pair of advertisements two cosine similarities for separate feature type are computed and classiﬁed by quadratic perceptron. Final clusters of similar advertisements are obtained by a hierarchical clustering algorithm. 2.1

Extraction of Local Features

The extraction of local features involves detection of local regions which are invariant to translation and scale. The translation and scale invariance is necessary because of unprecise cropping of advertisement images. It is assumed that in local scope only these two Euclidean transformations apply, i.e. individual letters, logos, or faces are not unproportionally deformed or rotated. To prevent having too few features for reliable matching, two complementary types of interest regions are extracted: Harris-Laplace [7] and Maximally Stable Extremal Regions [6]. More than one type of interest region is used because of various design of advertisements, as some images contain well structured logos and catchwords, while other comprise of natural scenes or pictures of promoted products. The Harris-Laplace (HL) detector is scale invariant modiﬁcation of Harris corner detector. It ﬁnds corners in images as local maxima of a corneress measure of scale adapted auto-correlation matrix of a interest point neighborhood. Since HL depends on auto-correlation, it is best suited for textured scenes which contain a lot of detail. Maximally stable extremal regions (MSER) are extracted by watershed segmentation followed by ﬁnding a local minimum of a function of size growth of each connected component. They correspond to bright and dark blobs which are stable over wide range of intensities and are best suited for well structured scenes with many homogeneous regions. Descriptors of interest regions are computed as follows. Each detected region is ﬁrst normalized according to its second moment matrix by using bilinear interpolation. On resulting patch of ﬁxed size 32x32 pixels a two-dimensional discrete cosine transformation (DCT) is computed. The descriptor then consists of ﬁrst 15 coeﬃcients after the zeroth coeﬃcient according to standard zig-zag ordering of DCT (e.g. used in JPEG compression), normalized to unit L2 norm. Since only low-frequency coeﬃcients are kept, the descriptor is robust to noise and small misalignments of interest regions. Excluding the zeroth coeﬃcient of DCT and normalization of the descriptor to unit length guarantees invariance to change of brightness and contrast. We also tried to make the descriptor invariant to rotation by normalizing each patch by its dominant gradient orientation [5] before computation of the DCT, but we found that in such case the overall recognition rate was lower. This can be contributed to the fact that most of

Detection of Similar Advertisements in Media Databases

181

advertisements are oriented upright for good readability and therefore too much robustness causes the features to lose their discriminative power. 2.2

Visual Vocabulary

Two visual vocabularies, one for each type of regions, are built separately. In our case, the training datasets contained roughly thousand of pictures of advertisements, resulting in total of cca 500k local features for both HL and MSER regions. To reduce training time, only randomly selected 20% of descriptors from the training dataset are used. To vector-quantize the local features into a vocabulary, k-means clustering is performed. We use standard euclidean distance as a measure for cluster dissimilarity and initialize the centers by k-means++ algorithm [1]. The number of clusters k was chosen empirically to k = 1024, as it provided good recognition rate while keeping the time complexity of both training and testing phases reasonable. After the vector quantization, the clusters found by k-means represent the visual vocabulary and every feature from the training dataset is assigned to its nearest cluster. We use standard term frequency - inverse document frequency (tf-idf) weighting scheme for each feature wij =

Nij n · log Nj ni

(1)

where wij is the weight of i-th feature in j-th image, Nij is the number of occurrences of i-th feature in j-th image, Nj is the number of all features in i-th image, n is the number of images in the training dataset and ni is the number of documents in which feature i occurs. Each image is then represented by two vectors of weight coeﬃcients, one for MSER and one for HL interest regions. 2.3

Clustering of Images

Images in training dataset are compared pair-wise based on their weighted histograms. For both type of regions, cosine similarities c (x, y) =

x·y x · y

(2)

where x and y are vectors of coeﬃcients wij of two distinct images, are computed separately for MSER-based and HL-based features. The two histogram similarities cM and cH are projected onto a conic which is found in the training phase by learning a single layer quadratic perceptron. We use linear iterative gradient descent learning rule for ﬁnding the coeﬃcients of six-dimensional separating hyperplane which corresponds to general 2nd degree curve in two-dimensional space. The perceptron is trained on ground truth dataset where each pair of advertisements is labeled by g ∈ {−1, 1} indicating whether the two images belong to the same cluster.

182

K. Palecek

Based on the dissimilarity matrix D whose elements are dij = −w · cij where w is the vector of perceptron coeﬃcients and cij = (cM , cH ) , a hierarchical clustering of advertisements is performed. Clustering starts with each advertisement in its own cluster and in every iteration, two clusters with the smallest distance are merged. The distance dxy of the clusters x and y, where cluster x was created by merging clusters z and w, is computed recursively by dxy =

dzy + dwy 2

(3)

This way an agglomerative hierarchical clustering tree is built whose leaves correspond to individual advertisements. Final clusters are obtained by thresholding the distance dxy , i.e. merging of the clusters stops when their dissimilarity exceeds some limit.

3

Results

We tested the algorithm on two datasets A and B which have been provided by the external company. The datasets comprise of 981 and 1174 pictures of advertisements respectively. For reasons of eﬃciency, the images were scaleddown such that neither their width nor their height exceeded 1024 pixels. The clustering algorithm is treated as series of binary decisions, where each pair of advertisements is classiﬁed as either belonging or not belonging to the same cluster. The quadratic perceptron was trained on the separate training dataset consisting of 1036 images. Figure 1 shows receiver operating characteristic curve for threshold t of the cluster linkage criterion (3). We select t such that F-measure of the ﬁnal clustering of the training dataset S is maximized. The F-measure is computed as 1 + β2 · p · r (4) Fβ = β2 · p + r where p = T P/ (T P + F P ) is precision, r = T P/ (T P + F N ) is recall. We choose β = 4 as in our application recall is somewhat more important than precision, i.e. only minimum of pairs of similar advertisements should be missed. We have tested four scenarios where the weighted histograms were created based on various vocabularies. The results of the ﬁnal clustering of the testing dataset are summarized in Tab. 1. In all four cases the system shows similar performance in terms of overall success rate. It can be seen that in case of dataset A false negative rate raised to 6.9% with the histograms created based on vocabulary of dataset B. This is because datasets contain diﬀerent sets of advertisements and therefore visual vocabularies may not be representative enough. Example output pairs of advertisements are shown in Fig. 2.

Detection of Similar Advertisements in Media Databases

183

1 0.98 0.96 0.94

AA AB BB BA

0.92 0.9

0

0.005

0.01

0.015

0.02

0.025

0.03

Fig. 1. ROC of the classiﬁer for the two datasets. For example graph AB shows ROC for dataset A where local features were assigned to nearest word of visual vocabulary of the dataset B. Table 1. Results of the clustering. γ is normalized Pearson’s correlation coeﬃcient of ground truth and incidence matrices after clustering. set A A B B

voc. A B B A

TP 0.9957 0.9315 0.9883 0.9942

TN 0.9971 0.9971 0.9994 0.9974

FP 0.0029 0.0029 0.0006 0.0026

FN 0.0043 0.0685 0.0117 0.0058

γ 0.9728 0.9364 0.9882 0.9747

rate 99.70 99.36 99.88 99.72

Fig. 2. Example pair of advertisements. Left: true positive, right: false negative.

We performed the clustering on computer with Intel Core 2 Duo @ 3 GHz and 4 GB of RAM. For extraction of HL, we used the original implementation1 of [7]. We used OpenCV2 implementation of MSER and k-means and MATLAB implementation of hierarchical clustering. All tests were run as a single threaded applications. The most time consuming part of the algorithm is extraction of local features, detecting cca 500k features of 981 images took 256 seconds for MSER and 570 seconds for HL. Vector quantizing of 100k features of dimension 15 took 286 seconds. Computation of dissimilarity matrices and hierarchical clustering took 4 seconds for 1174 histograms of dataset B. 1 2

http://www.robots.ox.ac.uk/~ vgg/research/affine/ http://opencv.willowgarage.com/wiki/

184

4

K. Palecek

Conclusion

We have developed a system for automatic detection of similar advertisements in media databases. Advantages of our system are its conﬁgurability, extensibility and simplicity of the used scheme. By choosing a diﬀerent threshold for cluster linkage criterion, the system can be conﬁgured to prefer either similarity or strict equivalence. Experimental results shows that matching based on bag-of-features approach is suitable even for images with diﬀerent spatial and color arrangement. With utilization of suitable feature detectors and descriptors system is robust, but because of averaging of local features and normalization of histograms it is not sensitive to details. A disadvantage of our system is the need for computation of the dissimilarity matrix whose complexity is quadratic in both time and space. Usage of our system is therefore limited to maximum of tens of thousands of advertisements. One possible solution for larger datasets can be utilization of eﬃcient all-pairs similarity search methods which avoid computing distance between every pair of records by pruning the dataset. We would also like to explore possibilities of geometric veriﬁcation of pairs of advertisements based on ﬁnding groups of similar features between the two images. Acknowledgments. The research reported in this paper was partly supported by the grant MSMT OC09066 (project COST 2102) and by Student Grant Scheme at Technical University of Liberec.

References 1. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 2. Dance, C., Willamowski, J., Fan, L., Bray, C., Csurka, G.: Visual categorization with bags of keypoints. In: ECCV International Workshop on Statistical Learning in Computer Vision, pp. 1–22 (2004) 3. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, inﬂuences, and trends of the new age. ACM Comput. Surv. 40, 5:1–5:60 (2008) 4. Horacek, O., Bican, J., Kamenicky, J., Flusser, J.: Image retrieval for image theft detection. In: Kurzynski, M., Puchala, E., Wozniak, M., Zolnierek, A. (eds.) Computer Recognition Systems 2. Advances in Soft Computing, vol. 45, pp. 44–51. Springer, Heidelberg (2007) 5. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 6. Matas, J., Chum, O., Martin, U., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of the British Machine Vision Conference, London, vol. 1, pp. 384–393 (2002) 7. Mikolajczyk, K., Schmid, C.: Scale & aﬃne invariant interest point detectors. Int. J. Comput. Vision 60, 63–86 (2004) 8. Sivic, J., Zisserman, A.: Video Google: A Text Retrieval Approach to Object Matching in Videos. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1470–1477 (April 2003) 9. Zitova, B.: Image registration methods: a survey. Image and Vision Computing 21(11), 977–1000 (2003)

Towards ECA’s Animation of Expressive Complex Behaviour Izidor Mlakar1 and Matej Rojc2 1

Roboti c.s. d.o.o, Tržaška cesta 23, Slovenia [email protected] 2 Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, Slovenia [email protected]

Abstract. Multimodal interfaces supporting ECAs enable the development of novel concepts regarding human-machine interaction interfaces and provide several communication channels such as: natural speech, facial expression, and different body gestures. This paper presents the synthesis of expressive behaviour within the realm of affective computing. By providing descriptions of different expressive parameters (e.g. temporal, spatial, power, and different degrees of fluidity) and the context of unplanned behaviour, it addresses the synthesis of expressive behaviour by enabling the ECA to visualize complex human-like body movements (e.g. expressions, emotional speech, hand and head gestures, gaze and complex emotions). Movements performed by our ECA EVA are reactive, not require extensive planning phases, and can be presented hieratically as a set of different events. The animation concepts prevent the synthesis of unnatural movements even when two or more behavioural events influence the same segments of the body (e.g. speech with different facial expressions). Keywords: Expressive behaviour, animation blending, expressivity parameters, expressive embodied conversational agent.

1 Introduction Human-machine interfaces are striving more and more to emulate natural and highlycomplex human-human interactions. Substantial effort by many researchers in different fields of communication management has already been devoted to this task, researching different communication tactics, from OCM-related (OCM - Own Communication Management) person and emotion specific body motion synthesis to ICMrelated (ICM - Interactive Communication Management) feed-back-based responses and turn taking. An understanding of the relation between attitude and emotion, together with how gestures (facial and hand) and body movement complements or in some cases even overrides verbal information, is essential for simulation of humanlike communicative behaviour. Such knowledge provides crucial information for modelling interactive management, from both the input and output perspectives, when generating natural human-machine interaction [1]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 185–198, 2011. © Springer-Verlag Berlin Heidelberg 2011

186

I. Mlakar and M. Rojc

Research into input modalities mostly focuses on studies of speech-based technologies (e.g. speech recognition), and video/image-based analysis technologies (e.g. interfaces using pointing gestures [2], emotion recognition [3], etc.), whereas most of research on output modalities is currently oriented towards the contextual presentation of information (e.g. interfaces based on semantic web [4] and pragmatic web [5]), embodied conversational agents (ECAs) [39], and agent related communicative management [40]. ECAs provide the ability to merge several output modalities into human-like multimodal output. Currently, ECAs are being embedded into different multimodal human-machine interfaces as animated talking heads [6][7], or fullyfunctional conversational agents [8][9]. But, how do we define natural behaviour? The literature and theory of affective computing imply several conditions for synthesized motion to appear natural [10][11]. Among the more important are speed of interaction and emotion/speech-correlated believable body motion (facial gestures, hand gestures, head gestures, etc.). The expressivity of an ECA plays a central role in defining its personality, its emotional state, and can further explain the context of the spoken dialogue (e.g. what parts of the dialogue are important – emphasis, visualization of the spoken word accompanied with different facial expressions etc.). The main challenges of natural human-machine interaction, however, lie in an understanding of the information context (user, situation/application, and cultural), corresponding dialogue management [17], and gestures/expression to be performed. Researchers actively address different functions of gestures, gaze, body movement, and facial expressions, to get knowledge about the context of human-human interaction, and also to synthesize expressive synthetic behaviour, in order to provide more intuitive and flexible interactions between humans and computer’s ECAs. The expressivity basically defines ”how” information is presented through physically based behaviour (movement) [23], and plays a central role in the perception of verbal and non-verbal dialogue [24]. The base requirements of an ECA are provided in [19]. Embodied conversational agents should provide content and added-value and, therefore, should visually represent verbal information. Consequently, all gestures and facial expressions performed should provide meaning (be used as an additional information channel that extends verbal information by “colorizing” its meaning with emotions or emphasizes its meaning). The synthesis of expressive behaviour usually incorporates the processing of several different input/output modalities. MAXINE [8] is one of the frameworks that incorporate various input modalities that the user can utilize to simulate natural human-machine interaction. Here, the input processing and behaviour modelling are defined within a programmable interfaced structure. Several concepts of affective computing are also presented in [14]. By utilising Affective Presentation Mark-up Language APML [15], XML-based abstract language, and definitions for facial expressions based on MPEG-4 FAP, the ECA synthesizes expressive speech, nonspeech related facial and head gestures (e.g. gaze), and more complex emotions (e.g. emotion blending). In order to extend the natural presentation of behavioural states, the authors in [16] present a novel approach for the generation of coordinated multimodal behaviour, by using the behaviour mark-up language (BML)[26], and the low level animation language (EMBRScript). A lot of research, on the other hand, addresses different modalities individually. Most speech and facial emotion

Towards ECA’s Animation of Expressive Complex Behaviour

187

synthesis-oriented frameworks (most popular concepts within ECA context) suggest incorporating one or more input processing techniques and usually address one output modality at a time. For instance, the system presented in [12] addresses emotional speech synthesis based on statistical modelling, and HUGE architecture [13]. Another approach as described in [18], address affective computing from a visual speech synthesis perspective. Based on HMM for speech generation, they synthesize lip movement from arbitrary text. In our previous work [20], we outlined and presented an ECA (named EVA) capable of simulating expressive behaviour. We also presented EVA’s exchangeable articulated model, the underlying animation engine, and the EVA Script as an arbitrary XML scheme that provides intermediate layer between abstract behaviour description languages (e.g. BML, APML, ALMA [21], MURML [22]), and low-level animation parameters. In order to achieve believable part-based animation, ECA EVA provides morph and bone-based control units by which each body part can be animated. This paper, however, presents a further extension of ECA EVA’s capabilities, aiming towards affective computing and addressing the concept of expressive behaviour. The focus of this work is, therefore, oriented towards generating complex behaviour as a combination of different animated body parts (building the animation of body segments separately, and synchronizing them into animated complex behaviour e.g. correlated speech and gestures). We also address several components of expressive behaviour, such as: temporal and special components of movement, fluidity of movement and repetitivity, and the ability to replicate the same movement that differs only in spatial components. This paper is structured as follows. Firstly, a description of EVA’s animation engine [20] is presented in the context of expressive components, followed by a detailed description of EVA’s animated behaviour. The behavioural animation concepts such as autonomous behaviour (e.g. eye-blinks), gestures (incorporating head, hand and arm motion), and eye gaze are also addressed in detail. This paper concludes with a short presentation of expressive (affective) behaviour involving the animation of complex gestures, facial expressions, emotions, and a brief description of our future plans regarding natural human-machine interaction and interfaces.

2 Expressivity The term expressivity defines ‘how’ information is presented through physicallybased behaviour (motion) [14]. Expressivity also plays a central role in the perception of verbal and non-verbal dialogue [26]. The expressivity of animated behaviour seems to be the leading topic regarding multimodal interfaces using ECAs. Researchers have addressed it from different perspectives. Most of them try to address it either in the context of expressive speech, or in the context of facial expressions. E.g. ECA Greta [14][27] presents expressivity through qualitative sets of motion parameters affecting the physical characteristics of movements (e.g. speed, width, strength, etc.). In order to define the taxonomy of head gestures and gaze, they propose 6 dimensional approaches that use Overall Activity (models the general amount of activity), Spatial Extent (modifies the amplitudes of movements), Temporal Extent (changes the duration of movements), Fluidity (influences the smoothness and continuity of movement), Power (represents the dynamic properties of the movement), and Repetitivity

188

I. Mlakar and M. Rojc

(models the tendency of the ECA to replicate the same movement with short and close repetitions over time). Several other researches also present different taxonomies and correlations between gestures, emotion, and speech [28][29]. The expressivity can be extended from facial emotions and speech to head-gestures and hand-gestures (and possibly to other types of body motion/position such as posture). ECA EVA can, by using EVA Script, efficiently generate expressive behaviour. Due to the bone based approach, such behaviour can be performed (with 360° of 3D rotational freedom) on any body part. The following sections address some of the expressivity dimensions provided by [14]. As already described in [20], the ECA EVA’s articulated model can be described as a multipart surface model with underlying skeleton chains and, therefore, supports several types of expressivity concepts, from facial expressions (facial gestures, emotional speech, facial emotions), to expressive speech/non-speech related hand and other body-part movements. Three major body-segments are also defined, facial region, mouth region (speech), and body region, which can all be described within EVA script. Each of these regions hosts a unique set of control unit (either skeleton chains, morphed-shapes, or both) by which ECA EVA can be controlled. The influence on the shared border regions is controlled by using the animation-blending technique (e.g. both the facial and mouth regions can influence mouth at the same time – smile whilst speaking). The animation of skeleton chains is defined using the forward kinematics technique (FK) provided by the base of our animation engine, the Panda 3D game engine [30]. 2.1 EVA’s Temporal, Spatial, and Power Components This section addresses temporal, spatial, and power components of expressivity in terms of EVA script. Figure 1 shows a segment of an EVA-script-based description of head rotation and corresponding movement of eyes (gaze).

Fig. 1. EVA script-based description of a gesture – behavioural event

The temporal component of EVA-script-based descriptions The temporal component of expressivity describes the duration of each movement. When defining the temporal expressive parameters, a choice was made, to define the movement phases of each gesture as a set of four intervals (similar to that in [27]): preparation, stroke, hold, and retraction. The ‘preparation phase’ is represented by a start attribute, and describes how long after the overall animation segment commences, will the corresponding control unit start to move. The durationUp attribute describes the ‘stroke phase’. This phase represents the time-period, within which the corresponding movement shifts (translates/rotates) from its current state towards the script-described state. The persistent attribute describes the ‘hold phase’. If this attribute is set to “inf”,

Towards ECA’s Animation of Expressive Complex Behaviour

189

the control unit remains in the elevated state indefinitely, or at least until a behavioural event moves it into a different state. In general, all gestures have the tendency to move from neutral – to excited – to neutral state. In order to override this functionality of our ECA, we added the ability for a gesture to remain in its excited state. This presents an important feature when generating complex animations, since each stage of a complex animation can be an extension of the previously elevated stage. The last temporal attribute, durationDown, describes the retraction duration. This represents the interval within which the corresponding unit returns to its previous (the state before behavioural event)/neutral state. The durationDown attribute is automatically ignored only in two cases, when a gesture is to be persistent, or when a gesture is defined as complex (built out of several animated segments) and the next stage of the complex animation also contains a description of the corresponding control unit (in such cases the control unit will automatically shift from its current to its next state). The spatial component of EVA-script-based descriptions The spatial component of expressivity describes the space which a gesture occupies in the context of rotation/translation values of its control units. The spatial component is most commonly described by the 3D rotational vector of the control unit in general (1) RV = (Rx, Ry, Rz), or in our case (2) RV = (H, P, R) [H-heading, P-pitch, R-roll]. Each rotation vector has, in general, 360° of freedom for each axis but when used in combination with ECA EVA, a [-180°, 180°] interval is used, where “–“ defines the direction in which the control unit should rotate (e.g. from right to left). The transformation from [0, 360°] interval into [-180°, 180°] interval, is implemented automatically by the animation engine of our ECA. Whilst transforming, the minima/maxima rotation limitations for each control unit are also posed. The morphed-shapes (and in some cases also the bones) can use the 3D transitional vector (3) TV = (Tx, Ty, Tz). In the case of morphs, only the Tx value is accounted for. Morphed shapes (blend shapes) present a modelled copy of the original mesh, and the extent of the presented shape usually ranges from [0, 1] on the X axis [28]. EVA-script uses two attributes in order to define the spatial component of a gesture/control unit. The type attribute defines the transformation being invoked on the control unit. Two common types are currently used, the ‘HPR’ represents the rotation transformation, and the ‘XYZ’ the translation. In addition, both ‘HPR’ and the ‘XYZ’ can also be extended to the component based transformation. Therefore, each control unit can be transformed only by one of its vector components (e.g. types H, P, R, X, Y and Z). During the vector-component-based transformation, all non-moving components are assumed to hold their current values. The power component of EVA-script-based descriptions The power component of expressiveness, in general, describes the dynamic properties of movement. EVA-script’s attribute stress represents this expressive component. By using it, the presentation of each gesture can be additionally fine-tuned. The stress attribute directly influences the value attribute of each control unit. In essence, it enables each gesture to be defined at its maximum spatial values, and by using (3) y = (3x-2x3+1)*0.5, each gesture is interpolated to the desired presentation. For instance, by defining a smile facial gesture (maximum spatial attributes for the corresponding control units), and by using the stress attribute, different states of smile gesture can be described, such as: weak smile (stress = 0.0), full smile (stress = 1.0) etc. The value of the stress attribute in EVA-Script can be represented either by 1D vector, or by 3D

190

I. Mlakar and M. Rojc

vector. The dimensionality of the vector depends on the type of transformation, the control unit is performing. A 3D vector is used for 3D transformations (e.g. type = HPR or type=XYZ), and similarly, when transforming by vector-components (e.g. type=X), 1D vector is used. If the stress attribute describes speech co-articulation, then an 1D vector holding values from 1.0 to 10.0 is sufficient. Repetitive motion Repetitive gestures, such as hand waves, head nods etc., are very important when simulating human motion. These gestures can also present important visual cues, e.g. emphasis of spoken dialogue. ECA EVA can repeat and stress-interpolate any gesture (even over very short intervals). Similarly as in the context of continuity attribute, it can also be stated that no repeated human movement, in the context of temporal and spatial components, is exactly the same. Therefore, each repetitive movement must be performed with slight adjustments, either in the temporal or spatial components. EVA script describes the repetitivity parameter using the ‘loop’ attribute in the behavioural event description. The loop attribute is used only to describe the repetitivity of a gesture as a whole. Therefore, this attribute is only used within the and tags. In addition, the loop attribute is automatically ignored, when used in gesture templates (predefined gesture descriptions). Currently, a random method is used on spatial parameters. The modification value (RI – repetitivity influence factor) is randomly selected from floating interval [-0.1000, 0.1000]. The loop-influenced rotational vector can therefore be described as Rvloop = RI *(H, P, R). 2.2 EVA’s Fluidity Component Fluidity influences the smoothness and continuity of each animated segment, and the animation as whole. In the context of ECA EVA, three degrees of fluidity are defined: the continuity (describes the continuity between related gesture segments), the transition (enables non-linear interpolation of control unit’s transformation), and interfere ence (defines how to handle different segments of animation, when the influence on a body segment is shared). This section addresses the three degrees of fluidity in terms of EVA script, and a segment of EVA-script-based motion description is presented in

Fig. 2. Defining speech sequence and hand-gesture using EVA Script

Towards ECA’s Animation of Expressive Complex Behaviour

191

Figure 2. The tag presents the call to a predefined description of viseme animation. By defining predefined motion, the usage of EVA Script’s control units was extended from the simple transformation of base-control units (bones/morphed shapes), to also include complex descriptions of predefined gestures. When building an animation out of animated segments, six basic rules are set: 1. 2. 3. 4. 5.

6.

UNIT/vizem tag presents the lowest level, and always holds the transformation information. If units are placed within a parallel tag, different units will be animated at the same time (each based on its own description). If units/parallels are placed within a sequence tag, their transformations will be presented one after each other. The speech tag ignores rules 1 and 2. Any speech sequence is regarded as a sequential animation. Therefore, vizem tags will always animate one after another. If an identical UNIT tag is used in different animated segments and appearing at the same time, the last occurrence of such a tag will override all previous occurrences (to prevent movement/gesture mismatches). If a speech sequence contains 2 or more identical vizem tags one after another, the stress level automatically adjusts so that lip movement still occurs.

Continuity The continuity, proposed in our previous work [20], is handled internally by EVA’s animation engine. It basically states that each body movement (gesture) can be planned as sets of parallel/sequential motion of control units. The idea of continuity is implemented by performing an animated segment in the form of a finite-state machine (FSM). The control-units only change their physical characteristics (e.g. transition or rotation) on transitions between different animation states. Different simultaneous animated segments (events) always model the ECA EVA in a parallel fashion. Each animated segment is first compared against its type (body animation, facial animation, or speech). Then each segment is broken-down into sequential intervals of the control units’ parallel transformations. During each transition between animation states, all control units from the previous state (e.g. Ai) are compared against the control units of the state in the process of being exposed (e.g. Ai+1). Those that match properly transit to their positions according to the state Ai+1. The control units of state Ai, that do not match any of the states in Ai+1, are either transformed back to their neutral state or, depending on the persistence attribute, remain in their elevated state. Different control unit sets can be described mathematically as: a) Exposed control units - units transforming from neutral to elevated state (4) b) Dissipated control units – units moving to initial state (5) c) Transiting control units – units transforming from previous elevated to next elevated state (6)

192

I. Mlakar and M. Rojc

Transition The second degree of fluidity, the transition, enables animators to use three types of interpolations when generating behavioural events. Human motion is seldom (almost never) linear. The temporal interpolation between the frames of each gesture, therefore, enables the movement to accelerate/decelerate at certain stages in the animation progress. The types of movement interpolators currently used are: EaseIn, EaseOut and EaseInOut. Figure 3 shows the EaseIn interpolation curve, Figure 4 the EaseOut curve, and Figure 5 the EaseInOut interpolation curve. The velocity diagrams in Figures 3-5 serve only for presentational purposes. The actual acceleration depends on the spatial and temporal components of the transformation, and is automatically adjusted by the Panda 3D’s core engine [30].

Fig. 3. The EaseIn temporal interpolation

Fig. 4. The EaseOut temporal interpolation

Fig. 5. The EaseInOut temporal interpolation

The EaseIn temporal interpolation (Figure 3) suggests that a gesture should be performed with accelerated start (slow-start and the ramp-to-full phases of the velocity diagram), then continue at maximal speed, and in the last frame of the animated segment, jump from full speed to 0. The acceleration curve is represented as curve Δ (EaseIN). The curve can be described mathematically as (7) 0.5*((3*t2)-(t3)), limiting the temporal component t to the interval [0, 1]. Temporal component t is defined as [current-time-stamp/overall-duration]. The EaseOut temporal interpolation (Figure 4) is an inverse of the EaseIn interpolation. The animation starts with full speed and in the last n frames starts to decelerate to slow stop. Equation (7) also applies to the EaseOut interpolation, the temporal component t is, however, limited to the interval [1, 2]. The EaseInOut interpolation is defined as (8) ((3*t2)-(2t3)), and presents a combination of both EaseIn and EaseOut interpolations. The animation starts slowly, ramps to full speed and, after the constant phase (if it exists), slowly decelerates to a full stop.

Towards ECA’s Animation of Expressive Complex Behaviour

193

Interference The interference addresses how the ECA EVA should be modelled, when control units perform different transformations at the same time, and how the polygonal mesh should react, when different control units try to influence the same sets of vertices. Figure 6 shows an example of smile whilst speaking. This is an example of when several different control units influence the same sets of vertices.

Fig. 6. Blending different animation segments

The occurrence of facial expression interfering with the synthesis of speech sequence is highly probable. Such interferences include teeth piercing the mouth, unnatural mouth region deformations, etc. In the example in Figure 6, teeth should, therefore, pierce the lower lip (unnatural teeth position). Interference rules are enacted, in order to prevent lip piercing, and similar unnatural occurrences on the facial region. Within the rules of confinement, it is assumed that the visual synthesis of speech should always dominate over any other facial expression. Therefore, the level of influence for speech related motion is always set at 1, and the level of any other facial expression at 0.2. The value of 0.2 is defined through the visual evaluation. The interference limits the influence of the facial expression, where different sets of influenced vertices intersect. Similarly, levels of influence are also set, when two or more facial expressions interfere (without speech). The concept of interference, therefore, enables the synthesis of both expressive speech and complex emotions.

3 Results: Synthesizing Expressive Behaviour The expressive facial gestures of ECA EVA influence both, the facial and body regions of the ECA’s articulated model. The morphed shapes used to define several facial gestures are based on the MPEG-4 FAPs. The bone chains, however, serve to widen the influence of the animation (especially at the border regions). In this way, e.g. a part of the lower jaw animation can also be transferred to the neck region. In addition, by defining bone chains for teeth (lower and upper) and the tongue, both parts can easily be animated without making additional morphed shapes. Combining both the morphed and bone-based animation techniques widens its animation capabilities, and extends the finite set of morphed-based animations with 360° freedom of movement (the physical movement limitations of each bone are also set in respect to the natural rotation/translation capabilities of human body) in all X, Y, and Z spatial planes. In addition, ECA EVA consists of several body parts (3D sub-models) forming the articulated 3D model. Each body part can be animated independently (either by gesture template,

194

I. Mlakar and M. Rojc

or directly by a behavioural event). This allows ECA EVA to animate an even broader range of different facial/body movements, each of the animated movement being dependent on either the personality traits, or even the time variant attributes of human behaviour such as: emotion, mood, etc. ECA EVA also provides an ‘expressive body’ that can be used as part of speech and non-speech correlated motion. The expressive body relies solely on bone-based animation, and forward kinematics. The body bone chains are structured as to also influence and animate external objects such as a dress and other objects attached to the body (e.g. jewellery). 3.1 Synthesis of Unconscious Behaviour Unconscious behaviour can be regarded as a background, somewhat autonomous behaviour that is unconsciously controlled by the person performing it. In most cases, such behaviour refers to eye-blinking and breathing (among other types). Unconscious behaviour is also usually somewhat cyclic and periodic. Standard measurements, such as: breathing-rate, eye-blink-rate, heart-rate, etc., are usually defined by statistical modelling. Researchers in [31-34] address unconscious behaviour from different perspectives, and provide several insights related to simulating unconscious behaviour. In addition, in [32] some evidence relating to randomness factors within cyclical and periodic behaviour can also be found. For instance, human eye-blinking appears to be a cyclic behaviour that appears at random intervals. ECA EVA can already simulate the findings, found in [31-34] and also other human behavioural studies, as background animation. The key concept used when simulating unconscious behaviour combined with conscious behaviour is the already mentioned animation blending technique, and also the fact that all the spatial components of EVA’s expressivity are regarded as relative. The initial position of a control unit is denoted automatically from the spatial parameters of the control unit at the time, the animation event (behavioural event or the occurrence of unconscious behaviour) is transformed to animation (but not the animation parameter sets). As a result, the animation itself does not require extensive planning. As an example of unconscious behavioural simulation, animating eye-blinks on neutral, angry and surprise facial expressions are presented. Figure 7 shows an example of an eye-blink posed on all three types of facial expressions. The angry and surprise expressions were chosen, since both significantly influence the shape of the eye-region (e.g. the position of eyelids). An eye-blink is defined as the closing of both, the lower and upper eyelids (MPEG-4 FAPs: F19-F22 relating to displacement of eyelids), and the facial expressions angry and surprise also influence these FAPs. In terms of EVA script, eye-blink is defined by two morphed shapes named ‘left_eye_blink’ and ’right_eye_blink’, both generated out of the FAP 19-22 set. Similarly, other facial expression can also be defined based on their FAP definitions [35][36].

Fig. 7. Blending the eye region

Towards ECA’s Animation of Expressive Complex Behaviour

195

Fig. 8. Expressive behaviour

The eye-blink is background behaviour, and in terms of animation independent of any other ECA’s activity. The animation of an eye-blink can be described as a sequence with four phases: SET, HOLD, RELEASE, and HOLD. The SET phase closes the left and right eyelids, and describes both the temporal and spatial characteristics of the stage. The RELEASE phase then describes the temporal and spatial components of the eyelids’ retractions to their default state (in our case neutral, angry or surpriseinfluenced expression). The HOLD phases define the duration the animation remains in its SET/RELEASE dependent appearance. EVA’s eye-blink procedure generates a sequence with four phases for each animation cycle at the time, the cycle should occur (depending on the last HOLD phase of the previous cycle). Therefore, continuous eye-blink animation is unplanned, and individually generated during each eye-blink cycle. In addition, during each animation cycle, eye-blink also includes a certain degree of randomness over each of its phases. The interval, during which the temporal and spatial random factors are chosen, is currently defined by standard deviation values (STD) for each stage, derived from different studies on eye-blink rates [37, 38]. By using animation blending techniques, the background behaviour doesn’t produce unnatural surfaces, even when other behaviour (e.g. facial expression) is influencing the same region. The eye-lids close and open normally even when the eyelid-region is narrowed (angry), or expanded (surprise). The concept of eyeblinking can easily be extended to other unconscious behaviour (e.g. breathing). Unconscious behaviour can then be modelled individually, and presented as an independent animation. The animation blending techniques and the ability to define unplanned animation of body movement enables ECA EVA, not only to emulate unconscious behaviour, but also to emulate complex expressive behaviour, such as that presented in the Figure 8, where an ECA provides additional communication cues by using arms, hands, head, and expresses its attitude/emotion using different facial expressions.

4 Conclusion The expressivity of an ECA plays a central role in the perception of verbal and non-verbal dialogue. It defines its personality, and can further explain the context of the spoken dialogue (e.g. which parts of the dialogue are important – emphasis, visualization of the spoken word etc.). In essence, it therefore, defines how information is presented through non-verbal, physically-based behaviour (motion). This paper has presented expressivity in the context of ECA, named EVA. It has outlined and

196

I. Mlakar and M. Rojc

explained several expressivity parameters that EVA uses to simulate natural behaviour. Each motion generated by EVA can be described hierarchically, and can hold several degrees of complexity (e.g. composite of several independent movements, or independently animated cycles). EVA script also defines several parameters in order to fine-tune behavioural events and behavioural templates. Persistence defines the duration of stroke, and overrides the tendency of each motion to start and end in the neutral state. Any position of the body can, therefore, remain in its stroke state indefinitely, or at least until future behavioural event (controlling one or more relevant control units) transforms it to a different state. The loop attribute enables animation of the repetitive movement (e.g. hand wave, nodding, etc.). Each motion generated within such a repetitive cycle also has the tendency to be unique, to slightly differ in both temporal and spatial attributes. The results in this paper have also presented the concept of synthesizing unconscious behaviour (e.g. eye-blinking), and snapshots of a few expressive gestures combining facial expressions (and masking), head movement, and gaze and hand gestures, into complex animated behaviour. The concept of expressivity presented in this paper relies on the current state of the ECA’s control units. We believe that an ECA should be able to react to different dialogue events in unique unplanned ways. The concepts and techniques of expressivity presented in this work form a sound basis for synthesizing natural human behaviour that can result in a responsive and human-like interaction between the user and the embodied conversational agent. Our future work will be oriented towards a broadening of expressivity regarding ECA EVA. With limited sets of hand and facial-gestures (obtained through the annotation of video databases), either accompanying speech or non-speech related tasks, we would like to specify those gestures that are generally used in natural multimodal human-human interactions. Since spoken dialogue can provide substantial information on interaction tendency and expressivity, we also plan further study of the relationships between speech, gestures, facial, and body expressions. Acknowledgements. Operation part financed by the European Union, European Social Fund.

References 1. Georgantas, G., Issarny, V., Cerisara, C.: Dynamic Synthesis of Natural Human-Machine Interfaces in Ambient Intelligence Environments. In: Ambient Intelligence, Wireless Networking, and Ubiquitous Computing. Artech House, Boston (2006) 2. Sato, E., Yamaguchi, T., Harashima, F.: Natural Interface Using Pointing Behavior for Human–Robot Gestural Interaction. Industrial Electronics 54(2), 1105–1112 (2007) 3. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollia, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine 18(1), 32–80 (2001) 4. Daconta, M.C., Obrst, L.J., Smith, K.T.: The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management. Wiley, Chichester (2003) 5. Schoop, M., de Moor, A., Dietz, J.L.G.: The pragmatic web: a manifesto. Commun. ACM 49(5), 75–76 (2006)

Towards ECA’s Animation of Expressive Complex Behaviour

197

6. Cosatto, E., Graf, H.: Sample-Based Synthesis of Photo-Realistic Talking Heads. In: Proceedings of the Computer Animation, p. 103 (1998) 7. Poggi, I., Pelachaud, C., De Rosis, F., Carofiglio, V., De Carolis, B.: Greta, a believable embodied conversational agent. In: Multimodal Intelligent Information Presentation Text, Speech and Language Technology, vol. 27 (2005) 8. Baldassarri, S., Cerezo, E., Seron, F.J.: Chaos and Graphics: Maxine: A platform for embodied animated agents. Computers and Graphics 32(4), 430–437 (2008) 9. Chuang, E., Bregler, C.: Mood swings: expressive speech animation. ACM Transactions on Graphics (TOG) 24(2), 331–347 (2005) 10. Abrilian, S., Devillers, L., Buisine, S., Martin, J.C.: EmoTV1: Annotation of Real-life Emotions for the Specification of Multimodal Affective Interfaces. HCI International (2005) 11. Malatesta, L., Raouzaiou, A., Karpouzis, K., Kollias, S.: Towards modeling embodied conversational agent character profiles using appraisal theory predictions in expression synthesis. Applied Intelligence 30(1), 58–64 (2009) 12. Zoric, G., Pandzic, I.S.: Towards Real-time Speech-based Facial Animation Applications built on HUGE architecture. In: Proceedings of International Conference on AuditoryVisual Speech Processing AVSP (2008) 13. Smid, K., Zoric, G., Pandzic, I.S.: HUGE: Universal Architecture for Statistically Based HUman Gesturing. In: Gratch, J., Young, M., Aylett, R.S., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 256–269. Springer, Heidelberg (2006) 14. Bevacqua, E., Mancini, M., Niewiadomski, R., Pelachaud, C.: An expressive ECA showing complex emotions. In: Proceedings of the AISB Annual Convention (2007) 15. DeCarolis, B., Pelachaud, C., Poggi, I., Steedman, M.: APML, a mark-up language for believable behavior generation. In: Prendinger, H., Ishizuka, M. (eds.) Life-like Characters. Tools, Affective Functions and Applications, pp. 65–85. Springer, Heidelberg (2004) 16. Kipp, M., Heloir, A., Gebhard, P., Schroeder, M.: Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th International Conference on Intelligent Virtual Agents (IVA 2010). Springer, Heidelberg (2010) 17. Jokinen, K.: Gaze and Gesture Activity in Communication. In: Stephanidis, C. (ed.) UAHCI 2009. LNCS, vol. 5615, pp. 537–546. Springer, Heidelberg (2009) 18. Masuko, T., Kobayashi, T., Tamura, M., Masubuchi, J., Tokuda, K.: Text-to-visual speech synthesis based on parameter generation from HMM. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 6, pp. 3745–3748 (1998) 19. Eliens, A., Huang, Z., Hoorn, J.F., Visser, C.T.: ECA Perspectives - Requirements, Applications, Technology. Dagstuhl Seminar Proceedings 04121, Evaluating Embodied Conversational Agents (2006) 20. Mlakar, I., Rojc, M.: EVA: expressive multipart virtual agent performing gestures and emotions. International Journal of Mathematics and Computers in Simulation 5(1), 36–44 (2011) 21. Gebhard, P.: Alma: a layered model of affect. In: Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 29–36. ACM Press, New York (2005) 22. Kranstedt, A., Kopp, S., Wachsmuth, I.: MURML: A Multimodal Utterance Representation Markup Language for Conversational Agents. In: AAMAS 2002 Workshop Embodied Conversational Agents (2002)

198

I. Mlakar and M. Rojc

23. Bevacqua, E., Mancini, M., Niewiadomski, R., Pelachaud, C.: An expressive ECA showing complex emotions. In: Proceedings of the AISB Annual Convention, Newcastle, UK, pp. 208–216 (2007) 24. Martin, J., Abrilian, C., Devillers, S., Lamolle, L., Mancini, M., Pelachaud, C.: Levels of Representation in the Annotation of Emotion for the Specification of Expressivity in ECAs. In: Panayiotopoulos, T., Gratch, J., Aylett, R.S., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 405–417. Springer, Heidelberg (2005) 25. Rojc, M., Kačič, Z.: Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Communication 49(3), 230–249 (2007) 26. Kopp, S., Krenn, B., Marsella, S., Marshall, A., Pelachaud, C., Pirker, H., Thórisson, K., Vilhjalmsson, H.: Towards a Common Framework for Multimodal Generation in ECAs: The Behavior Markup Language. In: Gratch, J., Young, M., Aylett, R.S., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 205–217. Springer, Heidelberg (2006) 27. Martin, J., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: gesture expressivity and blended facial expressions. International Journal of Humanoid Robotics (IJHR), Special Issue Achieving Human-Like Qualities in Interactive Virtual and Physical Humanoids 3(3), 269–291 (2006) 28. Poggi, I.: Mind markers. In: Trigo, N., Rector, M., Poggi, I. (eds.) Gestures. Meaning and Use, University Fernando Pessoa Press (2002) 29. Kipp, M., Neff, M., Albrecht, I.: An annotation scheme for conversational gestures: how to economically capture timing and form. In: Language Resources and Evaluation. Springer, Netherlands (2007) 30. Goslin, M., Mine, M.R.: The Panda3D Graphics Engine. Computer 37(10), 112–114 (2004) 31. Stern, J., Boyer, D., Schroeder, D.: Blink rate: a possible measure of fatigue. Hum. Factors 36(2), 285–297 (1994) 32. Pelachaud, C., Badler, N., Steedman, M.: Generating Facial Expressions for Speech. Cognitive Science 20(1), 1–46 (1996) 33. Albrecht, I., Haber, J., Seidel, H.P.: Automatic generation of non-verbal facial expressions from speech. In: Proceedings of the Computer Graphics International, pp. 283–293 (2002) 34. Clark, F.J., von Euler, C.: On the regulation of depth and rate of breathing. Journal of Physiol. 222(2), 267–295 (1972) 35. Ostermann, J.: Animation of synthetic faces in MPEG-4. In: Proceedings of Computer Animation 1998, pp. 49–55 (1998) 36. Pandzic, I.S., Forchheimer, R.: MPEG-4 facial animation: the standard, implementation and applications. Wiley, Chichester (2002) 37. Bentivoglio, A.R., Bressman, S.B., Cassetta, E., Carretta, D., Tonali, P., Albanese, A.: Analysis of blink rate patterns in normal subjects. Movement Disorders 12, 1028–1034 (1997) 38. Carney, L.G., Hill, R.M.: The nature of normal blinking patterns. Acta Ophthalmologica 60, 427–433 (1982) 39. Nass, C., Isbister, K., Lee, E.J.: Truth is beauty: Researching embodied conversational agents. In: Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (eds.) Embodied Conversational Agents, pp. 374–402. MIT Press, Cambridge (2000) 40. Kopp, S., Allwood, J., Grammer, K., Ahlsen, E., Stocksmeier, T.: Modeling Embodied Feedback with Virtual Humans. In: Wachsmuth, I., Knoblich, G. (eds.) ZiF Research Group International Workshop. LNCS (LNAI), vol. 4930, pp. 18–37. Springer, Heidelberg (2008)

Recognition of Multiple Language Voice Navigation Queries in Traﬃc Situations Gell´ert S´ arosi1 , Tam´as Mozsolics1,2 , Bal´azs Tarj´an1 , Andr´ as Balog1,2 , P´eter Mihajlik1,2 , and Tibor Fegy´ o1,3 1

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics {sarosi,tarjanb,mihajlik,fegyo}@tmit.bme.hu http://www.tmit.bme.hu 2 THINKTech Research Center Nonproﬁt LLC. {tmozsolics,abalog}@thinktech.hu http://www.thinktech.hu/ 3 Aitia International Inc. http://www.aitia.ai/

Abstract. This paper introduces our work and results related to a multiple language continuous speech recognition task. The aim was to design a system that introduces tolerable amount of recognition errors for point of interest words in voice navigational queries even in the presence of real-life traﬃc noise. Additional challenges were that no task-speciﬁc training databases were available for language and acoustic modeling. Instead, general purpose acoustic database were obtained and (probabilistic) context free grammars were constructed for the acoustic and language models, respectively. Public pronunciation lexicon was used for the English language, whereas rule- and exception dictionary based pronunciation modeling was applied for French, German, Italian, Spanish and Hungarian. For the last four languages the classical phoneme-based pronunciation modeling approach was compared to grapheme-based pronunciation modeling technique, as well. Noise robustness was addressed by applying various feature extraction methods. The results show that achieving high word recognition accuracy is feasible if cooperative speakers can be assumed. Keywords: Point of interest, speech recognition, context free grammar, noise robustness, feature extraction, multiple languages, navigation system.

1

Introduction

The main interest of our paper is in the design of a speech-based automated guiding service for car drivers and pedestrians. People can ask for help through the public telephone network to ﬁnd a target destination. The system shown in Figure 1 supports multiple languages, the required one is selected by a keystroke. Incoming calls are directed into a service center featuring a two-level processing system. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 199–213, 2011. c Springer-Verlag Berlin Heidelberg 2011

200

G. S´ arosi et al.

Human help in case of failed recognition

Request

ASR Content Service Provider

Response

TTS

Target GPS coordinates

Fig. 1. Overview of the navigation service system

At ﬁrst, an automated call-center service tries to identify the POI (Point Of Interest) in the incoming call based on ASR (Automatic Speech Recognition) technology. The ASR system matches the incoming utterances to a previously loaded speech recognition network, and returns the most likely result. The network represents word-sequences expected in real-life navigational situations. If a customer notices that the ASR fails, then the call is rerouted to a human assistant to answer the request. In either case, the user’s navigation system gets back the GPS coordinates of the most probable POI as an answer. In this paper, we present the design and implementation issues of the ASR part of the system. At ﬁrst we go through the related work in the next section. Then in Section 3 the characteristics of the training and test databases are described. Section 4 details the training process of the ASR system and the feature extraction methods that we used in our experiments. Our results are summarized in Section 5. In the last section we discuss our ﬁndings and draw some pertinent conclusions.

2

Tasks and Related Works

The integration of speech recognition into car navigation systems is increasingly popular. There are several commercially available solutions enabling voice control for navigation, searching, call management, note or e-mail dictation, etc. However, these applications require sophisticated cell phones or mobile operating systems, and operate merely on US English. Speech recognition services are typically server-based solutions, these are still device dependent applications. Our approach was to develop a server-based speech recognition system for a device independent service. Since ASR systems still require more resources that an everyday cell phone can provide, the optimal solution is to integrate the recognizer into a platform independent application like a call center, thus the navigation service can be reached by a simple telephone call. We developed an ASR system for navigational services which can operate on six languages - with manual language pre-selection. One of the leading companies in speech technology Google has recently published a study [1] about the language modeling approaches applied in their ”Search by voice” service. In this study they reported using a training database consists of 320 billion words of google text search queries. Processing such a huge training corpus needs special treatment. In order to reduce vocabulary and

Recognition of Voice Navigation Queries in Noise

201

language model size a ﬁnite state text normalization technique and aggressive language model pruning were applied. With the resulting recognition network around 17% WER was achieved on 10k queries by using 2-pass decoding with lattice rescoring. In another paper by Google [2] a description of development of acoustic models used for ”Search by voice” can be found. The service was started with a mismatched acoustic model. As the users provided more and more training data, ﬁrst manual transcriptions were made for supervised training and then as the traﬃc increased they changed to unsupervised training. Every release was reported to improve the overall accuracy of the system. Researchers of Microsoft presented methods for training text normalization and interpolation of transcripts of real calls and a listing database [3]. However, it was also emphasized that no data is better than the more real data. Hence our task is more challenging as we are lack of task-speciﬁc training databases. During dictionary building, we have speciﬁed the number of POI expressions around 8000 as a compromise between recognition accuracy and POI-expression variety. In [4] there are much more POI words in the language model than in our dictionary, however no surrounding text is allowed during recognition. This solution simpliﬁes the system, but ignoring the contextual information can make the recognition more diﬃcult. In [4] speech enhancement was combined with end-point detection and speech/nonspeech discrimination. The time-domain preprocessing stages - such as voice activity detection - may discard noisy but important speech segments, therefore we merely used feature extraction to process the speech-signals. The general experience [5]-[6] is that every method has different performance depending on the noise conditions and the level of the SNR. Therefore we re-evaluated several recently developed and several baseline feature extraction methods to examine which front-ends are more suitable for our real-life noisy recognition task.

3

Databases

For training purposes, we used various SpeechDat [7] type databases (see [8] and [7]) recorded typically through mobile telephone networks. These contain recordings from 500 to 5000 speakers from the required languages. Common features of all training and testing databases are the 8kHz sample rate, single channel and 8 bit A-law encoding. The identiﬁers and parameters for each language corpus are presented in Table 1. Recognition tests were performed on a database consisting of navigational questions or statements spoken by native speakers from diﬀerent ages and from both genders. All of them were recorded through the mobile telephone network either from the street of from a moving vehicle. The length of test recordings was in the range of 1-6 seconds. Parameters of the test and training corpora are detailed in Table 1. The test database was recorded in presence of wide variety of background noises – callers were asked to walk on the street or travel in a vehicle during the recording. The test subjects had to read out or make up sentences consisting of a question or description of a POI – for example a theater, diner or a museum.

202

G. S´ arosi et al.

Table 1. The source identiﬁers (see [7]) and the most important features of the acoustical training databases; and parameters of the test recordings English French German Hungarian Italian Spanish Training db: ELRA ID S0011 S0061 S0051 – S0116 S0101 length [hour] 17,8 57,9 62,1 28,9 93,7 56,5 # of words 64k 269k 219k 92k 251k 212k # of chars 373k 1447k 1586k 630k 1568k 1247k Test data: # of records 58 40 26 291 71 28 # of speakers 9 5 3 27 9 3 genders [m/f] 5/4 1/4 3/0 20/7 5/4 2/1

Altogether 85% of the collected sentences followed the predeﬁned sentence structures, while 15% of them were constructed by the test subject. There were such sentences that did not contain any POI’s.

4 4.1

Speech Recognition Models Language Models

Our approach was to apply LVCSR technology to extract POI’s from the spoken utterances. Continuous speech recognizers are usually trained on task-speciﬁc text corpora. However, collecting a large training database particularly for the investigated speech controlled guiding system could not ﬁt into the project’s ﬁnancial and time limits. In this section we present two language modeling techniques that can be used when a theme-speciﬁc training corpus is completely unavailable. The ﬁrst model presented here is a rule-based grammar, where the expected sentences for the given situation have to be collected manually. Examples for search statements: ’Where is the nearest pizza/POI restaurant/POI?’ ’Is there a shoe/POI shop/POI near here?’ ’Find the Modern/POI Art/POI Gallery/POI.’ ’Take me to the airport/POI.’ where the whole sentences are recognized and the /POI tags help to extract the POI words from the output of the recognizer. Theoretically there are inﬁnite sentence variations and it is impossible to collect all the potential search requests. However, separating the class of POI’s and the sentence structures results in a much general representation: ’Where is the nearest [poi]?’ ’Is there a [poi] near here?’ ’Find the [poi].’ ’Take me to the [poi].’

’pizza restaurant’ ’shoe shop’ ’Modern Art Gallery’ ’airport’

Recognition of Voice Navigation Queries in Noise

203

Table 2. Details about the grammar models English French German Hungarian Italian Spanish Average sentence structures 36,7k 4,1k 68,3k 18,8k 5,4k 17,6k ≈25k dictionary size 5,4k 5,9k 7,1k 15k 7,8k 5,4k ≈7,8k

Replacing the [poi] tags in the predeﬁned sentence structures with the actual POI’s in the right column, we get CFG (Context Free Grammar) model which has NSEN T EN CES × NP OI sentences (16 in the example). The sentence structure variations can be eﬃciently described in the GRA-format1 and it is practical to divide the destinations into subcategories like restaurants, shopping, services, etc. As was mentioned above, building an eﬃcient N-gram language model would require a large, task-speciﬁc training text corpus. Fortunately, there is a technique [10] that allows us utilizing the collected sentences to train a stochastic grammar. This method performs a three-way randomization on the original database, where the sentences are being varied in length, word-order and appearance probability. The resulted corpus is now suitable for class N-gram training. The full process is carried out with the Logios language compilation suite [11], which generates class N-gram language models in ARPA-format (hereafter will be referred to as PCFG N-grams: Probabilistic Context Free Grammars, according to [10]) from GRA-format CFGs. CFG and PCFG 3-gram models are built on the target languages with the parameters shown in Table 2. There are 8000 POI’s used during model training, however the dictionary sizes are smaller than 8k for all languages except Hungarian (discussed later). There are matching words in the POI expressions like in the Museum of Applied Arts and the National Museum, which causes the dictionary size to decrease. The variance between the languages comes from the diﬀerent number of contextual words. However, Hungarian is a highly agglutinative language, therefore all POI’s have three more alternatives with the ’-t’, ’-ba/-be’, ’-hoz/-hez/-h¨ oz’ suﬃxes (detailed in Section 4.6), which drastically increases dictionary size. 4.2

Pronunciation Model

In the phoneme-based approach simple grapheme-to-phoneme rules are applied on each lexicon separately in order to obtain word-to-phoneme mappings. The following phonetic transcribers are used: LIA PHON[12] for French, the TXT2PHO[13] for German and Spanish, and our own transcriber for the rest of the languages (English, Italian, Hungarian). In the Hungarian and English pronunciation models, the automatically derived phonetic transcriptions are corrected by using word exception pronunciation dictionaries. For this purpose the BEEP[14] dictionary is applied in the case of the English experiments, whereas for Hungarian only the exceptionally pronounced POI’s have been collected on an exception list. 1

which was deﬁned in the Phoenix Parser, see [9]

204

G. S´ arosi et al.

The application of phoneme-based acoustic models require considerable amount of language speciﬁc knowledge like grapheme-to-phoneme rules, or manual phonemic transcriptions. Hence, grapheme-based models are also tested on those 4 languages (German, Hungarian, Italian and Spanish), where acoustic models are built directly on letters (or graphemes) instead of phonemes [15]. In this approach even foreign, traditional, and other morphs grapheme ”pronunciations” are obtained as their linear sequence of alphabetic letters, thus no alternative pronunciations are allowed. However, grapheme-based approach can be inaccurate for modeling the untypical pronunciation variations of grapheme sequences. Applying language-speciﬁc grapheme-based exception dictionaries – similarly to the phoneme-based ones – can signiﬁcantly improve recognition accuracy. For example: Deutsche Deutsche Auchan = Auchan = Toyota = Toyota = 4.3

= = o o t t

d d s s o e

o o a c j u

j y n h o o

c c ; a t t

s e ; (for Hungarian) h e ; (for English) (for Hungarian) n ; (for German) a ; (for Hungarian) a ; (for German)

Context Dependency Model

As Equation (1) shows, triphone context expansion is performed after the integration of higher level knowledge sources. Context dependency is modeled across word-boundaries, with respect to inter-word optional silences, as well. 4.4

Acoustic Models

Speaker independent decision-tree state clustered cross-word tied triphone models were trained using ML (Maximum Likelihood) estimation [16]. Three state left-to-right HMM’s were applied with GMM’s (Gaussian Mixture Models) associated to the states. The acoustic models were trained for each language from the related database according to Table 1. The number of states were in the range 800-5200 depending on the actual language, and 10-15 Gaussians were used per state. All the feature types detailed in Section 4.7 were used, and blind channel equalization [17] was also applied. Context-dependent grapheme-based acoustic models called as ”trigraphones” were also trained similarly as phoneme-based triphone acoustic models in case of the four languages mentioned in Section 4.2. By default, the phonemic questions Table 3. The HMM state numbers for grapheme-and phoneme-based acoustic models

Phon Graph

English French German Hungarian Italian Spanish 0,9k 4,8k 5,2k 0,8k 1,2k 3,6k – – 4,8k 1,8k 3,8k 3,8k

Recognition of Voice Navigation Queries in Noise

205

used in decision tree constructions were simply converted to graphemic questions as in [15]. The resulting HMM state numbers of the phoneme- and graphemebased models are shown in Table 3. 4.5

Oﬀ-Line Recognition Network Construction

The WFST (Weighted Finite State Transducer) [18] recognition network is computed on the HMM-level: H o wpush(min(det( C o det(

L o G ) ))) phoneme-level model triphone-level model HMM-level model

(1)

where G (Grammar) denotes the word-level language model, L (Library) is the lexicon of words and their pronunciations, C is the context-dependency transducer and H (HMM-dictionary) is the lexicon of triphones and their HMM states. The ’o’ symbol denotes the composition operator that carries out cross-level transformations between the models and the ’det’, ’min’ and ’wpush’ acronyms denote further optimization steps [18]. 4.6

Models for Multiple Languages

The sequence of WFST operations used for building the recognition networks (1) is language independent, therefore our main task was to construct the H, C, L and G transducers for each language. Structure of H, C and L transducers is well deﬁned in [18] and [19]. Their construction is quite straightforward if the language-speciﬁc acoustic models and pronunciation rules are given (see Sections 4.4 and 4.2). Construction of language models (G) has been discussed in Section 4.1. However, there are some language-speciﬁc subproblems that still have to be handled. For instance, the target destinations appear as subjects or adverbs in place of the navigation related keywords in Hungarian. Hence, accusative and adverbial suﬃxes have to be removed from the end of POI’s, for example: ’moziba’ (to the cinema) or ’´aruh´ azhoz’ (to a store). These suﬃxes usually have a couple of alternatives (’ba/be’, ’hoz/hez/h¨oz’) according to the position of the back (’a’, ’´ a’, ’o’, ’´o’, ’u’, ’´ u’) and front vowels (’e’, ’´e’, ’i’, ’´ı’, ’¨ o’, ’˝o, ’¨ u’, ’˝ u’) in the actual word. The lexical form of POI’s can be extracted by using a simple, rulebased software, that can choose the right suﬃx alternative with 95% accuracy for our POI dictionary. 4.7

Feature Extraction

In order to automatically recognize speech in an environment ﬁlled with reallife noises, the choice of the front-end processing stage can be crucial. Multiple feature extraction methods have been developed for this purpose. However, the

206

G. S´ arosi et al.

general experience is that if a technique performs well in certain noise conditions, it can be suboptimal in other noise or high SNR conditions. So, real-life noises always need the re-evaluation of acoustic feature extraction techniques. This section shows advanced and baseline methods, which are included in our comparative test. The Mel Frequency Cepstral Coeﬃcients (MFCC) is a widely used feature extraction method implemented in multiple ways. We tested the variations included in the HTK (Hidden Markov Modell Toolkit)[16], the front-end of the SPHINX[20] speech recognition system, and our own version implemented in the VOXerver2 recognition software, which we also used in the recognition tests. The major diﬀerence between the three MFCC front-ends is in the procedure that reduces convolutive distortions caused by the transmission channel. The HTK and SPHINX systems use CMN (Cepstral Mean Normalization), while our implementation applies an adaptive technique, called BEQ (Blind Equalization)[17] based methods. The Perceptual Linear Prediction (PLP)[21] is also a quite popular feature extraction method, because it is considered as a more noise robust solution. Therefore we added the HTK implementation into our tests. The Perceptual Minimum Variance Distortionless Response (PMVDR)[22] is based on a procedure that estimates the transfer characteristic from a signal’s spectrum by computing an upper spectral envelope. A special transformation called frequency bending is applied on the FFT spectra instead of a ﬁltering step. The MVDR spectrum comes from the LP coeﬃcients, calculated in a similar way to the PLP method. This front-end also uses BEQ to reduce convolutive distortions. The Power Normalized Cepstral Coeﬃcients (PNCC)[23] is a recently introduced front-end technique, similar to the MFCC but the Mel-scale transformation is replaced by Gammatone ﬁlters[24] simulating the behavior of the cochlea. Furthermore, it includes a step called medium time power bias removal to increase robustness. The bias vector is calculated using the arithmetic to geometric mean ratio in a way, to estimate the speech quality reduction caused by noise. 4.8

Evaluation

One-pass decoding was performed by the frame synchronous WFST decoder called as VOXerver – developed in our laboratories. RTF (Real Time Factor) of the decoding process was adjusted to be close to equal (0.2-0.4 @ 2GHz, 2 core CPU) across every languages using standard pruning techniques. Standard WACC (Word Recognition Accuracy) was measured to evaluate the general performance of each ASR system, whereas the eﬃciency of POI retrieval was estimated by measuring word recognition accuracies for the POI-related words (WACC,P OI ). 2

Aitia International Inc.

Recognition of Voice Navigation Queries in Noise

5

207

Results and Discussion

In this section, we discuss the results according to the various aspects of the speech recognition tests. First we compare the phoneme-based CFG and PCFG 3-gram models for the six languages. Then grapheme-based pronunciation and acoustic modeling are compared to the classical phoneme-based approach on the suitable four languages. Finally, we discuss the impact of the various feature extraction methods using CFG and 3-gram models, too. The average results of the tests were weighted with the relative number of the test recordings of each language. 5.1

CFG vs. PCFG 3-Gram

In this test, we compared the phoneme-based CFG and PCFG 3-gram models for the six languages in the case of complete match – which means that there were no OOV (Out Of Vocabulary) words or out of grammar expressions in the test recordings. This test can also be interpreted as a comparison of the targeted languages. The results are shown in Table 4, where the highest average scores are emphasized. The complete match suggested much better results with the CFG model, however the more ﬂexible PCFG 3-gram model performed nearly as well as the CFG. The applied feature extraction technique was the internal MFCC method of the VOXerver. The results of the diﬀerent languages were similar except the English model which signiﬁcantly underperformed the other ﬁve. It was probably caused by the relatively smaller training database we had for acoustic model training. An other possible explanation is that the mapping of training and test words to their phonetic counterparts was obtained independently using diﬀerent methods. We performed another test on the Hungarian CFG and PCFG 3-gram models. Initially all test sentences were included in – they were all expected by – the language models. In this test, some of the sentence structures were removed from the training data, therefore several test sentences were not included in the language models, these became unexpected sentences and caused the matching rate (expected test sentences / all test sentences) to drop. The ﬂexibility of the CFG and PCFG models were evaluated by decreasing this matching rate in three steps. Table 4. The word recognition accuracy results of the phoneme-based CFG and PCFG 3-gram models (with POI vocabulary size of 8K avg.) for the six languages, and the average accuracies (the highest word and POI accuracies are written in bold) WACC English French [%] All POI All POI CFG 50.4 35.1 74.2 80.0 PCFG 51.0 50.4 78.5 85.2

German All POI 72.4 89.1 74.9 82.6

Hungarian Italian Spanish All POI All POI All POI 70.9 66.9 77.4 81.8 86.3 84.9 68.9 64.5 66.1 83.3 81.4 80.3

Average All POI 70.7 68.5 68.2 68.9

208

G. S´ arosi et al.

72

72

CFG

67

PCFG

62 57 52 47 42

measured values power law regression

71 POI Word Accuracy [%]

POI OI Word Accuracy [%]

77

70 69 68 67 66

1

0,841

0,682

Matching Rate

(a)

0,54

65 1k

5k

9k

13k

17k

21k

NPOI

(b)

Fig. 2. a) The performance of the Hungarian CFG and PCFG 3-gram methods at diﬀerent matching rates (ratio of the number of test sentence structures covered by the language models, and the number of all test sentences); b) The WACC,P OI accuracy parameter and its approximating power law regression function for the Hungarian CFG model in the NP OI range from 1000 to 21000.

As Figure 2 (a) shows, the recognition accuracy decreased as the test data became more unexpected for the language models. The 3-gram showed increasing advantage over the CFG model as the matching rate decreased, because the PCFG has a more ﬂexible structure to recognize unexpected word sequences. Diﬀerences between the CFG and PCFG models was only measured to be signiﬁcant (signed-rank Wilcoxon test) if the matching rate was under 0.8. Summarizing the comparison of the CFG and PCFG models, the CFG gave slightly better average recognition result in case of the perfect match to the test database, however PCFG N-gram performed better with a smaller sentence structure database and it was more ﬂexible with unexpected sentences. The Hungarian CFG model was also tested using POI dictionary sizes linearly varying from 1000 to 21000 in 20 steps, keeping the complete match condition for all models. Power law regression function (2) is calculated on the measured recognition accuracy values, see Figure 2 (b). WACC,P OI ≈ 87, 1385 · (NP OI )−0,0286 [%]

(2)

According to Equation 2, we can tell that the POI-related word recognition accuracy shows a decreasing tendency in the range of NP OI ∈ [1000, 21000]. During the WFST network construction, the memory needs grew almost linearly with the increase of the POI-number, but constructing a PCFG model needed signiﬁcantly more memory as compared to the CFG model. 5.2

Grapheme-Based Models vs. Phoneme-Based Models

Grapheme-based language models are tested for German, Hungarian, Italian and Spanish in comparison with the results of the previous section. The applied

Recognition of Voice Navigation Queries in Noise

209

Table 5. The word recognition accuracy results of the grapheme- and phoneme-based CFG models of the four languages, and the average accuracies (highest values are written in bold) WACC [%] Grapheme Phoneme

German All POI 63,8 82,6 72.4 89.1

Hungarian Italian Spanish All POI All POI All POI 69,2 58,4 76,6 82,6 79,8 75,8 70.9 66.9 77.4 81.8 86.3 84.9

Average All POI 70.8 65.2 73.1 72.0

75 73 71

Grapheme

69 67

Phoneme h

65 63 61 Word accuracy [%]

POI accuracy [%]

Fig. 3. The average recognition accuracies of the phoneme- and grapheme-based CFG models for the four languages

feature extraction was also the internal MFCC method of the VOXerver. Accordingly to the negative experiences of [15], English and French languages were excluded from this series of grapheme-based experiments. Table 5 contains the recognition results comparing the grapheme-based model to the phoneme-based ones, and Figure 3 shows the average accuracies of the two methods. Signiﬁcant part of the POI’s contain names of multinational enterprises and international brands from all over the world. The ratio of foreign words in the test recordings was negligible in case of German, Italian and Spanish, therefore the application of the exception dictionaries did not aﬀect the grapheme-based accuracies. However, our Hungarian test recordings had 15% foreign POI’s, for example Erste Bank, McDonald’s, Renault, etc. In this case, the grapheme-based exception dictionary (see Section 4.2) signiﬁcantly improved in the Hungarian grapheme-based accuracies (from WACC,All = 66.4% and WACC,P OI = 51.1% to the scores in Table 5). The Hungarian, German and Spanish grapheme-based models performed a bit worse compared to their phoneme-based counterparts. However, the Italian model performed better than its phoneme-based alternative. The weaker performance of the phoneme-based Italian model was probably caused by the manually collected and possibly incomplete phonetic transcription rules. According to the average result of Figure 3, the accuracy of the grapheme-based models come near to the phoneme-based ones.

210

G. S´ arosi et al.

Not surprisingly, the POI names are harder to be recognized, because these typically contain more ”out of language” expressions than ordinary words. The grapheme-based exception dictionaries could help in every languages, not just for Hungarian. This could be investigated in the future by obtaining more test recordings that contain foreign POI names. 5.3

Comparison of Feature Extraction Methods

In order to compare acoustic front-ends series of experiments were performed using the standard settings of the previous tests. Both the CFG and the PCFG 3-gram language models were included in the recognition tests. Table 6 shows word recognition accuracies (WACC ) of all words, grouped by the CFG and PCFG language types. The three highest average word recognition scores were emphasized for both language modeling type to indicate the best performing front-end techniques. In addition, Figure 4 displays the average word scorings of both language models. The diﬀerent front-end methods generally performed better using CFG-based language models, similarly to the results in Section 5.1. In case of both model types, the same three feature extraction method gave better performance. According to Figure 4, the PNCC technique proved to be the best, but the MFCC and PLP front-ends of the HTK are only slightly behind. Surprisingly the noise robust methods could not outperform the standard MFCC techniques, although the diﬀerence can be highly varying from tasks and implementations. The POI recognition rates were also evaluated but are not discussed, because they are in a good correlation with the global word accuracies and the previous results. A more detailed analysis of noise robustness of the applied front-end techniques can be found in [25]. Table 6. Recognition accuracies of the CFG and the PCFG 3-gram models using various feature extraction methods WACC,CF G [%] English French German Hungarian Italian Spanish Average mfcc-htk 59.8 69.2 71.2 73.3 79.7 78.1 72.5 mfcc-sphinx 48.3 72.8 68.1 70.1 75.8 77.6 68.9 mfcc-vox 50.4 74.2 72.4 70.9 77.4 86.3 70.7 plp-htk 59.5 70.6 68.7 71.7 79.7 83.6 71.9 pmvdr 49.7 68.5 68.7 72.1 78.7 76.0 70.3 pncc 50.8 75.3 65.0 73.5 77.4 83.6 71.9 WACC,P CF G [%] English French German Hungarian Italian Spanish Average mfcc-htk 55.0 76.7 70.6 70.8 68.4 74.9 69.4 mfcc-sphinx 42.7 75.6 64.7 68.7 66.6 74.3 66.1 mfcc-vox 51.0 78.5 74.9 68.9 66.1 81.4 68.2 69.6 plp-htk 57.8 75.3 71.2 69.9 69.2 81.4 pmvdr 45.0 75.6 71.2 69.6 68.9 73.8 68.2 pncc 50.6 78.1 72.4 71.4 72.6 83.1 70.4

Recognition of Voice Navigation Queries in Noise

211

Word Recognition ecognition Accuracy [%]

72 71 70 69 68 67 66 65

Fig. 4. Averaged word accuracies of the CFG and the PCFG language models

6

Conclusions

This paper introduced our work on designing a multiple language continuous speech recognition system for navigation service. The aim was to achieve good recognition accuracy of point of interest words in voice navigational queries even in the presence of real-life traﬃc noise. Serious challenge was that no task speciﬁc training databases were available for language modeling. Instead, we have applied and compared two language model construction methods: the CFG modeling and the PCFG N-grams. Both methods gave suitable solutions for this speciﬁc speech recognition task. As expected, the completely matched CFG model yielded the highest recognition accuracies. Increasing the number of POI expressions from 1000 to 21000 in the Hungarian CFG model, we have found that a power law approximation can be applied between the word error rate and the number of the POI’s in the examined range. The search space was much larger in the case of the PCFG N-gram model as compared to the CFG approach, which resulted in a minor recognition accuracy reduction in the fully matched tests. However, a signiﬁcant advantage of the PCFG model was observed when the test contained more out of grammar and OOV elements. The classical phoneme-based pronunciation modeling approach was compared to a customized grapheme-based pronunciation modeling technique for the German, Hungarian, Italian and Spanish languages. The results showed that building an exception dictionary for foreign POI-related words can signiﬁcantly improve the grapheme-based models, making them almost competitive to the phonemebased ones. Noise robustness was addressed by applying various feature extraction methods. The results showed a great variation in recognition accuracies of the acoustic front-ends. The recognition scores of the PNCC proved to be the highest, but the MFCC and PLP front-ends of the HTK were only slightly behind.

212

G. S´ arosi et al.

The results suggest that achieving high word recognition accuracy is possible if cooperative speakers can be assumed – if the users raise navigational questions in the usual way, so the ratio of OOV and out of grammar expressions is minimal. The diﬀerent cost eﬃcient language and acoustic modeling techniques and feature extraction methods worked well for the included six languages, although the English accuracy should be improved. Hopefully with the growth of the acoustical model training database the results of our English model will improve and approach the accuracies of the other ﬁve. Acknowledgment. Our research was partially funded by: OM-00102/2007, ´ ´ OMFB-00736/2005, TAMOP-4.2.2-08/1/KMR-2008-0007, TAMOP-4.2.1/B-09/1/KMR-2010-0002, KMOP-1.1.1-07/1-2008-0034.

References 1. Chelba, C., Schalkwyk, J., Brants, T., Ha, V., Harb, B., Neveitt, W., Parada, C., Xu, P.: Query Language Modeling for Voice Search. In: Proceedings of the 2010 IEEE Workshop on Spoken Language Technology (2010) 2. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google Search by Voice: A Case Study (2010) 3. Yu, D., Ju, Y.-C., Wang, Y.-Y., Zweig, G., Acero, A.: Automated Directory Assistance System - from Theory to Practice. In: INTERSPEECH 2007, pp. 2709–2712 (2007) 4. Lee, S.H., Chung, H., Park, J.G., Young, H.-Y., Lee, Y.: A Commercial Car Navigation System using Korean Large Vocabulary Automatic Speech Recognizer. In: APSIPA 2009 Annual Summit and Conference, pp. 286–289 (2009) 5. Kim, D.-S., Lee, S.-Y., Rhee, M., Kil, R.M.: Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments. IEEE Transactions on Speech and Audio Processing 7(1), 55–69 (1999) 6. Milner, B.: A comparison of front-end conﬁgurations for robust speech recognition. In: ICASSP 1993, pp. 797–800 (1993) 7. European Language Resource Association, http://catalog.elra.info/ 8. Hungarian Telephone Speech Database (Magyar Telefonos Besz´ed Adatb´ azis), http://alpha.tmit.bme.hu/speech/hdbMTBA.php 9. Center for Spoken Language Research of Colorado: Phoenix parser for spontaneous speech, http://cslr.colorado.edu/~whw/phoenix/ 10. Harris, T.K.: Bi-grams Generated from Phoenix Grammars and Sparse Data for the Universal Speech Interface. In: Language and Statistics Class Project, CMU (May 2002) 11. CMU Language Compilation Suite for Dialog Systems, https://cmusphinx.svn. sourceforge.net/svnroot/cmusphinx/trunk/logios/ 12. A text phonetization system for the MBROLA system, http://tcts.fpms.ac.be/ synthesis/mbrola/tts/French/liaphon.tar.gz 13. A German TTS-frontend for MBROLA system, http://www.sk.uni-bonn.de/ forschung/phonetik/sprachsynthese/txt2pho 14. British English pronunciation dictionary, http://mi.eng.cam.ac.uk/comp. speech/Section1/Lexical/beep.html

Recognition of Voice Navigation Queries in Noise

213

15. Kanthak, S., Ney, H.: Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition. In: ICASSP 1993, pp. 845–848 (1993) 16. Young, S., Ollason, D., Valtchev, V., Woodland, P.: The HTK book (for HTK version 3.4) (March 2009), http://htk.eng.cam.ac.uk 17. Mauuary, L.: Blind Equalization in the Cepstral Domain for robust Telephone based Speech Recognition. In: Proc. of EUSPICO 1998, vol. 1, pp. 59–363 (1998) 18. Mohri, M., Pereira, F., Riley, M.: Weighted Finite-State Transducers in speech Recognition. Computer Speech and Language 16(1), 69–88 (2002) 19. Szarvas, M.: Eﬃcient Large Vocabulary Continuous Speech Recognition Using Weighted Finite-state Transducers – The Development of a Hungarian Dictation System. PhD Thesis, Department of Computer Science, Tokyo Institute of Technology, Tokyo (March 2003) 20. CMU Speech Recognition Engine (SphinxTrain 1.0), http://www.speech.cs.cmu. edu/ 21. Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 87(4), 1738–1752 (1990) 22. Yapanel, U.H., Hansen, J.H.L.: A New Perspective on Feature Extraction for Robust In-Vehicle Speech Recognition. In: EUROSPEECH 2003, pp. 1281–1284 (2003) 23. Kim, C., Stern, R.M.: Feature Extraction for Robust Speech Recognition using a Power-Law Nonlinearity and Power-Bias Subtraction. In: INTERSPEECH 2009, pp. 28–31 (2009) 24. Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.H.: Complex sounds and auditory images. In: Cazals, Y., Demany, L., Horner, K. (eds.) Auditory and Perception, pp. 429–446. Pergamon Press, Oxford (1992) 25. S´ arosi, G., Mozs´ ary, M., Mihajlik, P., Fegy´ o, T.: Comparison of Feature Extraction Methods for Speech Recognition in Noise-Free and in Traﬃc Noise Environment. In: Proc. of the 6th Conference on Speech Technology and Human-Computer Dialogue, Romania, Brasov (2011)

Comparison of Segmentation and Clustering Methods for Speaker Diarization of Broadcast Stream Audio Jan Prazak and Jan Silovsky Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Studentska 2, CZ 46117 Liberec, Czech Republic {jan.prazak,jan.silovsky}@tul.cz

Abstract. This paper investigates various approaches to segmentation of media streams into speaker homogenous segments and approaches to clustering of speakers within a speaker diarization system for processing of broadcast audio. Evaluated segmentation approaches are all based on the widely used Bayesian Information Criterion (BIC). They diﬀer in a strategy for choice of the length of the window (ﬁxed or variable) and in a strategy for estimation of the decision threshold (ﬁxed or adaptive). Further, we compare two bottom-up clustering approaches. The traditional BIC-based clustering is compared with the approach based on a measure of the distance between GMMs estimated for the data of clusters by the Maximum A Posteriori (MAP) adaptation. Keywords: Speaker diarization, speaker segmentation, clustering, broadcast stream audio.

1

Introduction

Speaker diarization is the process of partitioning an input audio data into speaker homogeneous segments and it is a useful preprocessing step in speech or speaker recognition and for searching and indexing of audio archives, whose number and extent is exponentially growing. It can also improve the readability of automatic transcriptions. Our research is motivated mainly by the fact that we have developed a system for continual monitoring of Czech broadcasting [1]. It has been used in practice since 2008 and as the number of monitored TV and radio stations increases rapidly, the process of audio data transcription, labeling and indexing needs to be further automated. One part of this is broadcast stream diarization that helps to identify new speakers (those not included in speaker database) and to collect their audio data for speaker recognition and speaker adaptation purposes.

2

Speaker Diarization System

Our speaker diarization system consists of three main parts depicted in Fig. 1. After feature vectors are extracted the speech activity detection is applied. Next, A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 214–222, 2011. c Springer-Verlag Berlin Heidelberg 2011

Segmentation and Clustering Methods for Speaker Diarization

audio

Speech Activity Detection

Speaker Segmentation

215

speaker diarization

Clustering

Fig. 1. Framework of the speaker diarization system

speaker change points are detected by the speaker segmentation module. Finally, segments of the same speakers are clustered and speaker diarization provided. In our experiments, all modules use Mel-frequency cepstral coeﬃcients (MFCCs) as features. Our speech activity detector has two parts - an energy detector with an adaptive threshold and a Gaussian Mixture Model (GMM) based detector. The aim of the former is to remove silent parts from the signal, while the latter does the same for other non-speech events, namely music and noise. The other modules will be described in more detail in next subsections. 2.1

Speaker Segmentation

The aim of the speaker segmentation module is to ﬁnd such points in audio streams where a change of speakers occurs. Essentially, the speaker segmentation module works as follows. The audio stream is sequentially processed within a sliding window which is analyzed for a change of speakers as depicted in Fig. 2. A change point candidate is generated based on a metric that evaluates the diﬀerence between hypotheses that the data in the window were represented by a single distribution or two distributions. A change point is detected if the value of the metric for the candidate is higher than a decision threshold. Finally, the analyzed window is shifted. μ1,Σ1

1

a

μ2,Σ2

t

μ,Σ speech segment frames

b

F

adaptive window

Fig. 2. Process of speaker segmentation of an audio stream

Typically, the metric used by speaker segmentation methods is based on the Bayesian Information Criterion (BIC). Let X = x1 , . . . , xF be a sequence of d-dimensional feature vectors representing an audio stream and let’s assume an analyzed window depicted in Fig. 2 with start frame a and end frame b, the value of the penalized likelihood ratio test function based on the BIC for frame t (a ≤ t < b) is deﬁned as [2] ΔBIC(t) = (N1 + N2 )log |Σ| − N1 log |Σ 1 | − N2 log |Σ 2 | − αP,

(1)

where N is the number of frames and Σ is the full covariance matrix, P is the penalty

216

J. Prazak and J. Silovsky

1 P = 2

1 d + d(d + 1) log(N1 + N2 ) 2

(2)

and ﬁnally α is the penalty weight (we used α = 1). 2.2

Speaker Segmentation - Choice of Analyzed Window

There are two basic strategies for choice of the analyzed window. First general approach use variable-length (adaptive) window [3]. This method works as follows [2]. First, the window of a minimal length Wmin (100 frames in our case) is set at the beginning of a stream. Then the following routine is repeated until the end of the analyzed window reaches the end of the stream: The ΔBIC is computed for every frame of the analyzed window and if ΔBIC for the frame t0 is higher than a certain threshold, a new change point is detected at frame u = maxt (ΔBIC(t)) where t is in the interval [t0 , t0 + 500 ms]. Finally, the new boundaries of the analyzed window are set to a = u + 1 and b = u + Wmin ; Every time, when no change point is detected within the analyzed window, the end of the window is shifted forward by ﬁxed number of frames (50 in our experiments). If the length of the window reaches a maximum length (we used 2000 frames ≈ 20 s) and no change point is detected within the window, then a ﬁctive change point is placed at time corresponding to the end of the window b and a new window of length Wmin starts at frame b + 1. This trick can reduce computational cost because analysis of long windows is computationally expensive. In most cases, the ﬁctive change points are eliminated at the clustering stage. The speaker segmentation employing a ﬁxed-length window operates with a window of constant length (3 s in our case) which moves along the stream frameby-frame and the ΔBIC values are computed only for the center frame of the window. Please note that when using the ﬁxed-length window, the penalty P is constant for every frame. 2.3

Speaker Segmentation - Threshold Estimation Strategies

Usually the speaker segmentation modules employ a ﬁxed threshold to decide about the change point candidates. The threshold is estimated using a development data. This approach is very straightforward but not very robust. And thus may severely aﬀect the performance of the speaker segmentation in domains where the nature of audio streams encountered is not uniform, e.g. broadcast domain. The adaptive document-speciﬁc threshold ΔBIC ∗ is estimated as follows [4] ΔBIC ∗ = μΔBIC + p(HΔBIC − μΔBIC ), (3) where ΔBIC are values computed for all frames applying ﬁxed-length window approach (we used length of 3 s), the μΔBIC is the mean of the values and the HΔBIC represents average value of the L highest values. Finally, p is a regularization parameter estimated using a development data. The drawback of the approach with adaptive threshold is that it requires 2 passes and thus it is computational more expensive.

Segmentation and Clustering Methods for Speaker Diarization

2.4

217

Clustering

The aim of clustering is to group segments from the same speaker together. We used bottom-up clustering (a.k.a. hierarchical, agglomerative clustering) which is predominant approach for speaker clustering in the speaker diarization framework. For simplicity, we will refer to each speaker segment as to a cluster in the further description. The clustering works as follows: 1. 2. 3. 4. 5.

compute distance between each pair of clusters if a stopping criterion holds then exit merge closest pair update distances of remaining clusters to the new cluster go to 2.

We used this algorithm with two diﬀerent distance metrics which will be described in next subsections in more detail. 2.5

Clustering - BIC Based

The most popular clustering metric is based on the BIC. The BIC based criterion compares the BIC statistics of cluster g1 and g2 with the BIC statistics of cluster g which is formed by merging of clusters g1 and g2 . The criterion is deﬁned as ΔBIC(g1 , g2 ) = (N1 + N2 )log |Σ| − N1 log |Σ 1 | − N2 log |Σ 2 | − αP,

(4)

where N is the number of frames and Σ is the full covariance matrix of the data. It is obvious that Eq. 4 represents exactly the same penalized likelihood function as deﬁned by Eq. 1. The penalty P is computed as in Eq. 2 and the penalty weight α = 1. If the ΔBIC is lower than estimated threshold the clusters are merged. If a minimal distance between any pair of clusters is over a ﬁxed threshold estimated on a development data then the stopping criterion is met. 2.6

Clustering - Parameter-Derived Distance between Adapted GMMs

This clustering technique introduced in [5] is based on principles of the Maximum A Posteriori (MAP) adaptation of GMMs. In the MAP adaptation scenario, the GMM for a speaker segment (cluster) is adapted from a Universal Background Model (UBM). The UBM is trained using large amount of data pooled from many speakers and its parameters are used as priors for MAP estimation. Here we use all speech segments of the currently processed stream to estimate the UBM so that no development data are needed. Such UBM will be referred to as the Document Speech Background Model (DSBM). The parameters of the model are weights wc , means vectors μc and covariance matrices Σ c , where c = 1, . . . , C and C is the number of Gaussian components in the model. Here only diagonal covariance matrices are employed. Only means are adapted within the MAP

218

J. Prazak and J. Silovsky

adaptation process. The adapted means vectors for a cluster are derived from the DSBM according to [6] c = βc E c (X) + (1 − βc )μc μ

(5)

where X = x1 , . . . , xF is a sequence of F feature vectors representing the data of the cluster, μc represents DSBM means vectors and E c (X) is maximum likelihood estimate of means vector for component c (see [6] for more detail). The βc is a coeﬃcient that controls a balance between new and old estimates. The βc is deﬁned as βc = nc /(nc + r), where r is the relevance factor value F (we used the same value for all components) and nc = f =1 γc (f ), where γc (f ) represents a probability that the f th feature vector xf was generated by the component c. The Parameter-Derived Distance (PDD) between clusters g1 and g2 is deﬁned as [5] D2 (g1 , g2 ) =

C

2c )T Σ −1 2c ). wc ( μ1c − μ μ1c − μ c (

(6)

c=1

3

Experiments and Results

3.1

Evaluation data

We used our database of Czech broadcast streams. First, 12 hours of continual recording of CT24 TV broadcasting were split into 20-minutes excerpts. The CT24 is Czech all-news television channel which provides 24-hour news coverage (same type like BBC or CNN). The broadcasting consists of news and interviews mostly. However, nothing was extracted, so e.g. commercials or intervals with music only were included in the evaluation data. The 20-minutes excerpts were divided into three mutually disjoint datasets. The data from the ﬁrst set were used for training of background models, e.g. speech and non-speech models for speech activity detector. The second set was used as development for estimation of various thresholds and tuning of systems. Finally, the third set was used for tests. 3.2

Evaluation Metrics

We used the Diarization Error Rate (DER)1 deﬁned by National Institute of Standards and Technology (NIST) as a metric for evaluation of performance of our diarization systems. The DER can be expressed as sum of three components called speaker error SP KE, speech false alarm error F A and missed speech error M ISS. The SP KE reﬂects the amount of speech data that is attributed to a wrong speaker and is primarily aﬀected by performance of the speaker 1

http://www.itl.nist.gov/iad/mig//tests/rt/2009/docs/ rt09-meeting-eval-plan-v2.pdf

Segmentation and Clustering Methods for Speaker Diarization

219

clustering module. The NIST scoring tool2 was employed to compute DERs for our experiments. Beyond the overall performance of diarization systems, we also evaluated the performance of speaker segmentation approaches separately in terms of three widely used metrics: Recall, P recision and F − measure (see e.g. [8]). For separate evaluation of clustering techniques, we applied weighted average frame-level cluster purity and cluster coverage [7]. 3.3

Results

All experiments were carried out within the framework of the speaker diarization system. The speaker segmentation module is thus preceded by the speech activity detection module which implicitly provides speaker segmentation by placing a non-speech mini-segment when speech of two speakers is separated by a nonspeech interval. The speech activity module is responsible for successful detection of 32.8 % of all speaker turns in the evaluation data. These speaker turns are thus excluded (as eﬀect of discard of non-speech segments) from evaluation of speaker segmentation approaches and none of remaining turns is separated by a nonspeech interval which makes the speaker segmentation harder and is supposed to harm the performance in terms of speaker segmentation metrics. The decision threshold for approaches operating with ﬁxed value was estimated using the development data as well as the regularization parameter used by adaptive approaches. Both values were primarily optimized with preference of higher Recall values. Hence, speaker segmentation module produces higher number of false change point detections. However, this strategy is motivated by our observation that most of false change point detections are eliminated in the clustering stage while missed change points are unrecoverable. The Recall, P recision and F − measure values for evaluated speaker segmentation methods are summarized in Tab. 1. Table 1. Speaker segmentation evaluation Threshold Window Length Recall [%] Precision [%] F-measure [%] ﬁxed variable 66.9 11.9 20.3 ﬁxed ﬁxed 61.2 11.1 18.7 adaptive variable 62.0 13.4 22.0 adaptive ﬁxed 68.3 11.2 19.2

Based on the F − measure values, we can conclude that methods using variable-length window outperform methods operating with ﬁxed-length window and that those applying adaptive threshold estimation perform better than methods using ﬁxed threshold. The combination of adaptive threshold and variablelength window yielded the best F − measure value. On the other hand, our 2

http://itl.nist.gov/iad/mig/tests/rt/2006-spring/code/md-eval-v21.pl

220

J. Prazak and J. Silovsky

results show that the Recall is more relevant metric than the F − measure for evaluation of speaker segmentation within the speaker diarization framework. The method that employs adaptive threshold and ﬁxed-length window and that achieved the highest value of the Recall reached also the best speaker diarization performance in combinations with both clustering techniques. Both clustering approaches were evaluated in combination with all speaker segmentation methods. For all systems, the threshold used for clustering stop criterion was estimated with respect to minimization of the DER of the system on the development data. For experiments with the PDD based approach, we used GMMs with 16 Gaussian components. Achieved results show that the PDD based approach provides better cluster purity but worse coverage than the BIC based approach. Authors in [5] also carried out experiments on broadcast news data which led to slightly worse results for the PDD based approach in comparison with the BIC based clustering. Our results are quite in contrast with these observations. In our case, the PDD based clustering achieved better performance than the BIC based algorithm in combination with each speaker segmentation method. Tab. 2 sums up both speaker clustering and diarization performance. We do not show the values of F A = 1.2 % and M ISS = 1.5 % in the table as the speech activity detection module was common for all systems and these values were the same. Considering the performance of various segmentation methods from the perspective of speaker diarization task, we can conclude that combination of ﬁxedlength window and adaptive threshold outperforms other combinations. This observation holds for both clustering approaches. The PDD based clustering yielded better performance for all segmentation methods. These results seem to be indicated by higher values of cluster purity achieved by the PDD clustering technique. We attribute this to the capability of GMMs to model more complex distributions. Table 2. Results of speaker diarization systems Segmentation Clustering Threshold Window Clustering Purity Coverage length criterion [%] [%] ﬁxed variable BIC 93.3 91.7 ﬁxed ﬁxed BIC 93.3 90.9 adaptive variable BIC 93.5 89.8 adaptive ﬁxed BIC 93.5 92.0 ﬁxed variable PDD 96.0 89.3 ﬁxed ﬁxed PDD 95.4 89.2 adaptive variable PDD 95.1 89.8 adaptive ﬁxed PDD 95.4 90.2

Diarization SPKE DER [%] [%] 13.3 16.0 13.3 16.0 14.7 17.4 12.2 14.9 11.8 14.5 12.0 14.7 11.8 14.5 11.3 14.0

Segmentation and Clustering Methods for Speaker Diarization

221

Table 3. Results of speaker clustering methods using reference segmentation Clustering Diarization Clustering Purity Coverage SPKE DER criterion [%] [%] [%] [%] BIC 93.9 92.9 12.0 12.5 PDD 96.6 92.6 8.9 9.4

Tab. 3 presents results of speaker clustering methods using reference segmentation. We do not show the values of F A = 0.0 % and M ISS = 0.5 %3 in the table. Based on a comparison of the results reached by speaker diarization systems employing automatic segmentation with those applying reference segmentation, we conclude that the PDD clustering is more sensitive to segmentation errors than the BIC based approach. This ﬁnding corresponds with an observation carried out in [5].

4

Conclusions

We compared several approaches to essential technologies for a speaker diarization system which are speaker segmentation and clustering. Four speaker segmentation methods and two clustering approaches were evaluated in the broadcast domain. Speaker diarization of broadcast streams is very challenging task because of large variability of content, e.g. news, interviews, commercials, etc. The best performing speaker diarization system setup uses speaker segmentation operating with ﬁxed-length window and adaptive threshold in combination with clustering based on parameter-derived distance between MAP adapted GMMs. Comparison of the BIC based clustering and PDD based clustering show that the later provided better performance as evaluated by the DER of the whole speaker diarization framework. Acknowledgments. This research was supported by the Technology Agency of the Czech Republic project no. TA01011204 and the Student Grant Scheme (SGS) at the Technical University of Liberec.

References 1. Nouza, J., Zdansky, J., Cerva, P., Kolorenc, J.: Continual On-line Monitoring of Czech Spoken Broadcast Programs. In: Proceedings of 7th International Conference on Spoken Language Processing (ICSLP 2006), Pittsburgh, pp. 1650–1653 (2006) 3

Used speech activity detector presumes that at most one speaker is speaking at one time of the analyzed signal. However, our test data contained also parts in which two or more speakers are speaking at once. During processing of such parts of the audio by the speech activity detector, speech of at least one of the speakers was missed by the system. Therefore, the M ISS value of the reference segmentation is nonzero.

222

J. Prazak and J. Silovsky

2. Chen, S.S., Gopalakrishnan, P.S.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings 1998 DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, pp. 127–132 (1998) 3. Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, Language Processing 14(5), 1557–1565 (2006) 4. Meignier, S., Moraru, D., Fredouille, C., Bonastre, J.-F., Besacier, L.: Step-by-Step and integrated approaches in broadcast news speaker diarization. Computer Speech And Language (20), 303–330 (2005) 5. Ben, M., Bester, M., Bimbot, F., Gravier, G.: Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs. In: Proceedings of 8th International Conference Spoken Language Processing, Jeju Island, pp. 2329–2332 (2004) 6. Reynolds, D.A., Quatieri, T., Dunn, R.: Speaker veriﬁcation using adapted Gaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000) 7. Gauvain, J.-L., Lamel, L., Adda, G.: Partitioning and transcription of broadcast news data. In: Proceedings International Conference Spoken Language Processing, Sydney, pp. 1335–1338 (1998) 8. Lopez-Otero, P., Fernandez, L.D., Garcia-Mateo, C.: Novel strategies for reducing the false alarm rate in a speaker segmentation system. In: Proceedings of ICASSP 2010, Dallas, pp. 4970–4973 (2010)

Influence of Speakers' Emotional States on Voice Recognition Scores Piotr Staroniewicz Wroclaw University of Technology, Institute of Telecommunications, Teleinformatics and Acoustics, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland [email protected]

Abstract. The paper presents the voice recognition EER (Equal Error Rate) scores for speakers' basic emotional states. The database of Polish emotional speech used during the tests includes recordings of six acted emotional states (anger, sadness, happiness, fear, disgusts, surprise) and the neutral state of 13 amateur speakers (2118 utterances). The voice recognition procedure was proceeded with MFCC features and GMM classifiers. The EER scores distinctly depend on speakers' emotional states, even for a simulated database. The mean EER results tend to be only slightly less sensitive to an emotional state, even when using speech in various kinds of emotional arousal in a training set. Keywords: Emotional speech, voice recognition.

1 Introduction Voice recognition (also: speaker recognition) systems are significant part of biometrics which is a domain very important for our security. Voice recognition scores highly depend on speakers condition. Tiredness, illness or emotional state can change speakers’ voice features which can lead to problems with proper speakers’ verification or identification. Emotions in speech is an important aspect of speechcomputer communication (beside speaker recognition, also in speech recognition and synthesis). The main source of complication during research on emotional speech is no strict definition of emotions and their classification rules. The literature describes them as emotion-dimensions (i.e. potency, activation, etc.) or discrete concepts (i.e. anger, fear etc.) (Fig.1) [1,2,3]. Distinct terms which are easily understood by speakers are usually chosen in acted emotions. In order to be able to compare the results with older studies and because they are generally considered as the most common ones, it was decided to use six basic emotional states plus the neutral state. Despite the fact that there is no definite list of basic emotions, there exists a general agreement on so-called “the big six” [1,2]: anger, sadness, happiness, fear, disgust, surprise and neutral state. The paper presents the voice recognition scores of tests carried out on the Polish speech emotional database (presented earlier in [4,5,6]). A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 223–228, 2011. © Springer-Verlag Berlin Heidelberg 2011

224

P. Staroniewicz

activation

anger

happiness

fear

surprise

potency neutral

disgust

evaluation sadness

Fig. 1. Six basic emotional states and neutral state on three dimensions of emotional space

2 Database Despite all disadvantages of acted emotions in comparison to natural and elicited ones (which means recordings of spontaneous speech), only recording simulated (or seminatural) emotions can guarantee the control of recordings which fulfils [2,3,5]: -

reasonable number of subjects to act all emotions to enable the generalization over a target group, all subjects uttering the same verbal content to allow the comparison across emotions and speakers, high quality recordings to enable later proper speech features extraction, unambiguous emotional states (only one emotion per utterance).

The amateur speakers who took part in the recordings were sex balanced. All the subjects were recorded in separate sessions to prevent their influencing each other’s speaking styles. The speakers were asked to use their own every day way of expressing emotional states, not from stage acting. The decision of selecting simulated emotional states enabled a free choice of utterances to be recorded. The most important condition was that all selected texts should be interpretable according to emotions and not containing an emotional bias. The everyday life speech was used, which has some important advantages: -

it is the natural form of speech under emotional arousal, lectors can immediately speak it from memory, no need for memorising and reading it, which could lead to a lecturing style.

The group of speakers consisted of 13 subjects, 6 women and 7 men each recorded 10 sentences in 7 emotional states in several repetitions. Altogether 2351 utterances were recorded, 1168 with female and 1183 with male voices. An average duration of a single utterance was around 1 second. After a preliminary validation, some doubtful emotional states and recordings with poor acoustical quality were rejected. The Final number of 2118 utterances was then divided into training and testing sets for a later automatic recognition of emotional states [5]. The database was validated with the subjective human recognition tests [5].

Influence of Speakers' Emotional States on Voice Recognition Scores

225

3 Voice Recognition Procedure A classical speakers verification system is composed of two phases, a training and testing one (Fig.2) [7]. The first step of both the training and the testing is the speech feature extraction process [8,9]. In the front-end procedures of the applied verification system standard speech parametrization methods were applied, namely, pre-emphasis, windowing, extraction of the cepstral coefficients vectors MFCC (Mel Frequency Cepstral Coefficients). Calculated cepstral coefficients can then be centered, which is realized by subtraction of the cepstral mean vector (CMS), lowering the contribution of slowly varying convolutive noises. Afterwards the dynamic information was incorporated in the feature vectors by using Δ and Δ Δ parameters, which are polynomial approximations of the first and the second derivates. Training phase known speaker data

Speech feature extraction

speech features

Speaker modeling

speakers models

Testing phase verified speaker data

Speech feature extraction verified identity

speech features

Speakers medels

Score normalization

verification decision

speaker and background models

Fig. 2. The scheme of speaker verification system

The second step, the statistical modeling (Fig.2), was done with GMM (Gaussian Mixture Models), nowadays the most successful likelihood function [7,8,9,10]. The final step of the speaker verification process is the decision which consists of comparing the likelihood resulting from the comparison between the claimed speaker model and the incoming speech signal with a decision threshold. The claimed speaker is accepted if the likelihood is higher than the threshold, otherwise it is rejected. The tuning of the decision threshold is a troublesome problem in speaker verification because of score variability between trials (differences in contents of speech material, duration between speakers, variation in a speaker's voice caused by emotional state etc., acoustical conditions). To avoid the above problems, score normalization techniques have been introduced. Three normalization techniques have been tested: Tnorm (Test-normalization), Znorm (Zero normalization) and ZTnorm (the combination of Znorm and Tnorm).

4 Experimental Results In the speaker verification system two basic types of errors occur, namely, FAR (False Acceptance Rate) and FRR (False Rejection Rate). A false acceptance error occurs when an identity claim from an impostor is accepted, whereas a false rejection error occurs when a valid identity claim is rejected. Both FAR and FRR depend on

226

P. Staroniewicz

the threshold value which is set in the verification decision process. Such a system has many operating points so a single performance number is usually inadequate to represent the capabilities of the system. The EER (Equal Error Rate) measure is sometimes used to summarize the performance of the system in a single figure. It corresponds to the operating point where the FAR is equal to the FRR. Two cases were examined during the tests: the first one with only neutral state samples in the training set and the second one with all emotional states samples in the training set. The length of the training sets for both cases maintained the same. The testing set in both cases was build for each emotional state separately. In Tables 1 and 2 EER scores for female and male voices are presented (for both cases of the training sets). In Fig.3 mean scores are presented. Table 1. EER scores for female speakers

Neutral state Anger Happiness Sadness Fear Disgust Surprise

EER Training set – neutral state 0.9% 2.7% 2.9% 1.5% 2.6% 1.6% 2.9%

EER Training set – all states 0.9% 2.0% 2.1% 1.3% 2.2% 1.1% 2.4%

Table 2. EER scores for male speakers

Neutral state Anger Happiness Sadness Fear Disgust Surprise

EER Training set – neutral state 0.5% 0.8% 1.2% 0.4% 1.1% 0.9% 1.6%

EER Training set – all states 0.4% 0.6% 1.1% 0.3% 0.8% 0.6% 1.2%

Anger

Fear

2,50% 2,00% 1,50% 1,00% 0,50% 0,00% Neutral state

Happiness

Sadness

Disgust

Surprise

Fig. 3. Mean EER scores (grey: training set-neutral state; white: training set-all states)

Influence of Speakers' Emotional States on Voice Recognition Scores

227

5 Conclusions According to the listeners’ tests carried out earlier on the same data [3,5], the listeners were confusing sadness and disgust with the neutral state extremely often: almost 23% for sadness and 29% for disgust. Surprise was confused with the neutral state for only 4% of the cases. These results correspond with the mean EER scores. The best results were obtained for the states most often confused with the neutral state, which were sadness and disgust, whereas the worst EER scores were obtained for surprise (Fig.3).The best mean EER scores were obtained for the neutral state (0.7%). Quite low error rates were achieved for the sadness (0.9%) and disgust (1.1%). Happiness (1.8%), surprise (2.0%), anger (1.5%) and fear (1.7%) were the strong emotions (see Fig.1 activation axis) and caused more trouble in proper voice recognition. Almost in all the cases using all emotional states (instead of the neutral state only) in the training set improved the recognition results and, what is very important, did not change high results of the neutral state. Still, even when using various kinds of emotional arousal in a training set, the mean EER results tend to be only slightly less sensitive to an emotional state. The presented tests were carried out on currently most common speaker verification algorithms. The results revealed that speakers’ emotional states can influence the voice recognition scores (especially in case of emotions with strong activation). The problem can be partially solved by using not only the neutral state in the training set. Probably the results would be substantially better in case of a fusion of the speaker recognition system and identification of emotional state. Acknowledgments. This work was partially supported by COST Action 2102 “Crossmodal Analysis of Verbal and Non-verbal Communication” [11] and by the grant from the Polish Minister of Science and Higher Education (decision nr 115/NCOST/2008/0).

References 1. Cowie, R.: Describing the Emotional States Expressed in Speech. In: Proc. of ISCA, Belfast 2000, pp. 11–18 (2000) 2. Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech: Towards a new generation of databases. Speech Communication 40, 33–60 (2003) 3. Ververdis, D., Kotropoulos, C.: A State of the Art on Emotional Speech Databases. In: Proc. of 1st Richmedia Conf., Laussane, Switzerland, pp. 109–119 (October 2003) 4. Staroniewicz, P.: Polish emotional speech database–design. In: Proc. of 55th Open Seminar on Acoustics, Wroclaw, Poland, pp. 373–378 (2008) 5. Staroniewicz, P., Majewski, W.: Polish Emotional Speech Database – Recording and Preliminary Validation. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 42–49. Springer, Heidelberg (2009) 6. Staroniewicz, P.: Recognition of Emotional State in Polish Speech – Comparison between Human and Automatic Efficiency. In: Fierrez, J., Ortega-Garcia, J., Esposito, A., Drygajlo, A., Faundez-Zanuy, M. (eds.) BioID MultiComm2009. LNCS, vol. 5707, pp. 33–40. Springer, Heidelberg (2009)

228

P. Staroniewicz

7. Staroniewicz, P.: Test of Robustness of GMM Speaker Verification in VoIP Telephony. Archives of Acoustics 32(4), 187–192 (2007) 8. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10, 19–41 (2000) 9. Bimbot, F., et al.: A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing 4, 430–451 (2004) 10. Staroniewicz, P.: Speaker Recognition for VoIP Transmission Using Gaussian Mixture Models. In: Computer Recogition Systems, pp. 739–745. Springer, Heidelberg (2005) 11. COST Action 2102 Modal Analysis of Verbal and Non-verbal Communication. Memorandum of Understanding, Brussels, July 11 (2006)

Automatic Classification of Emotions in Spontaneous Speech Dávid Sztahó, Viktor Imre, and Klára Vicsi Budapest University of Technology and Economics, Department of Telecommunication and Mediainformatics, Laboratory of Speech Acoustics, H-1117 Budapest, Magyar tudósok krt. 2. {sztaho,vicsi}@tmit.bme.hu, [email protected]

Abstract. Numerous examinations are performed related to automatic emotion recognition and speech detection in the Laboratory of Speech Acoustics. This article reviews results achieved for automatic emotion recognition experiments on spontaneous speech databases on the base of the acoustical information only. Different acoustic parameters were compared for the acoustical preprocessing, and Support Vector Machines were used for the classification. In spontaneous speech, before the automatic emotion recognition, speech detection and speech segmentation are needed to segment the audio material into the unit of recognition. At present, phrase was selected as a unit of segmentation. A special method was developed on the base of Hidden Markov Models, which can process the speech detection and automatic phrase segmentation simultaneously. The developed method was tested in a noisy spontaneous telephone speech database. The emotional classification was prepared on the detected and segmented speech. Keywords: Emotion recognition, spontaneous speech, speech detection.

1 Introduction Automatic emotion recognition is a complex problem. In order to realize it in real time, we have to solve the problem of the real time detection and segmentation of speech. The solution of this problem has a quite critical importance, because training of the emotion recognizer with proper speech units can’t be realized without it. That is why we have prepared examinations related to automatic speech detection and speech segmentation in Laboratory of Speech Acoustics at Department of Telecommunication and Mediainformatics of Budapest University of Technology and Economics. We recorded, segmented and annotated speech databases, for this purpose. Following this segmentation procedure, emotion recognition tasks were performed. There is a question until now, what kind of acoustic parameters (feature vectors) have any influences to the recognition. In this article we measured other spectral parameters beyond the basic ones, that can be found in the literature [2][3]. Separate sentences and words are applied more frequently as recognition units in emotion recognition in the literature, we use a different recognition unit in continuous A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 229–239, 2011. © Springer-Verlag Berlin Heidelberg 2011

230

D. Sztahó, V. Imre, and K. Vicsi

spontaneous speech, on the basis of our earlier results [4]. It was decided that phrase sized segment will be used as basic unit for the emotion recognition. But as we have mentioned before in a real time process, we have to solve the speech detection and phrase unit segmentation before the emotion recognition.

2 Speech Detection In this chapter we will introduce the automatic speech detection process, and the used database. 2.1 Noisy Telephone Speech Database for Speech Detection For training and testing the speech detection system we needed a speech database, which contains a voice material similar to the circumstances of usage with respect to background noise level. The applied database was made by colleagues and students working at our Laboratory using mobile phones. Records can be distributed to three different noise levels. There are records that were made in roughly noiseless environment. Conversations encumbered with noise can be further distributed to two parts: moderately noisy, where speech is well understandable, but there are different background noises (noise of cars, loud speech in the street); speech was difficult to understand in strongly noisy recordings. Table 1. Number of recordings according to the different noise level classes

Noise level

Number of records

Low Medium High

9 16 6

Total duration of recordings 17 min 27 min 30 min

Total number of speakers 9 16 6

Table 2. Applied marks during the annotation of database Name of row speech noise

Sound type speech pause noise of vehicle voice gestures background speech noise of wind telephone sound creaking hooter hitting paper rattle

Marking b u a g k s t r i h p

Total number of samples 1087 853 106 241 60 227 53 267 7 196 61

Automatic Classification of Emotions in Spontaneous Speech

231

Fig. 1. An example for the manual segmentation

Records can be distributed into two groups on the way of speaking mode: the first group of recordings (medium level of noise) contains formal speech, that has wellseparated sentences, while in the second group (with high level noise) records contain spontaneous speech. Praat software was used for the phrase-leveled labeling [5] during the annotation of recordings. A prepared sample is to be seen on Figure 1. The label file contains two rows, one for the „speech”, and another one for the „noise”. In the speech row we marked the speech and pause parts, and their boundaries. In the noise row we marked the different background noises and their boundaries. Table 2 contains the distinguished types of noises. 2.2 Speech Detection Process Automatic recognition was done with the help of the Hidden Markov Models. The HTK Toolkit [6] was applied, which is a recognition toolkit, with a Hidden Markov Model realization. In order to automatically separate speech and noise parts, we built separated Markov Models for the different noise types and for the speech (phrase) sections with the help of database presented in section 2.1. Features obtained by the acoustic preprocessing are given in Table 3. These basic features and their first, and second derived features have got to the final training vector. Table 3. Used acoustic features

Characteristic Fundamental frequency Intensity Mel-frequency cepstral coefficients (MFCC)

Window size 75 ms

Timestep 10 ms

250 ms 700, 500. 250 and 100 ms

10 ms 10 ms

232

D. Sztahó, V. Imre, and K. Vicsi

Acoustic HMMs were trained separately for speech and for the different noises and for silence. The states of these models were optimized. The best result was obtained when the speech model had 11 states, the noise model had 5 states, and the silence model had 3 states. The samples that were used from the database were separated consequently for training and testing parts. Therefore all the tests were made on the same sample group. For the two groups the samples were chosen randomly, but taking variability into consideration. Thus there were extremely noisy, normal, moderately noisy samples (with a drawn up car window, not using a hands-free telephone), in both the training and in the testing set. For evaluating the performance of the automatic segmenter, a simple index was used that made the process faster. We calculated two error matrixes, in which there were inserting and confusion errors. These matrixes can be quite large in certain cases, for example, in case if there are many label types. It has an unpleasant consequent, that it is difficult to understand, and it takes many times to determine, whether the recognition is good at first approach, or not. To eliminate this, for the sake of easier understanding we introduced a simple index calculation. It consists of two parts: one is the so called speech index, and the other one is the noise index. Their weighted amount gives the totalized index, where the noise index is taken into account only with a quarter weight. It means, that if we recognize speech much better we may neglect the cases when the noise is marked falsely from the viewpoint of the final recognition. This solution gives better score considering our final aim of the automatic recognition, that is speech detection. Speech index consists of two components: inserting rate, and confusion rate. inserting rate =

number of well insterted intervals in the class number of original intervals of the class

(1)

number of covered time intervals number of original intervals of the class

(2)

confusion rate =

speech index [%] = inserting rate ∗ confusion rate ∗ 100

(3)

It can be seen, that maximum of confusion rate is 1, while the inserting rate can essentially be optional, thus neither the maximum of speech index is 100. We had to introduce a break at the value of 100 so that it would be the maximum score. If the inserting rate is bigger then 1, then we will maximize the speech index. It can be seen by the evaluation of results, that this change does not make the ability to evaluate any worse. At a speech index about 80 the recognition is acceptable. confusion rate ∗ inserting rate, if inserting rate ≤ 1 speech index =  100, if inserting rate > 1 

(4)

Noise index was calculated on the same way, as described above for individual noises, and then the final index will be the average of these items. Totalized index will be the following:

Automatic Classification of Emotions in Spontaneous Speech

total index =

233

1 3 * speech index + * noise index 4 4

(5)

2.3 Results At the beginning of the test serial the following classes were prepared for training: b (speech), u (silence/pause), a (noise of car), g (gesture), k (background speech), s (noise of wind), t (telephone signal), r (creaking), i (hooter). Sound of hooter was immediately removed at the first testing, because it appeared only in one sound file, for a short time. Marking of p (clatter of paper) and h (hitting) sounds were closed up with creaking because of the acoustic similarity of these sounds. During tests we introduced a „breathing” label, which consisted breathing noises derived from speakers in the phone. In test serial 1 the acoustical parameters were the followings: mel-frequency cepstral coefficients calculated with different window lengths (100, 250, 500 and 750 milliseconds), intensity and fundamental sound values, and their first and second derivatives. We achieved the best results in case of MFCC parameters calculated with window size of 500 ms (Table 5). In case of sound files of the worst quality (recorded in a car, with a hand-free phone, marked with * in Table 5) results of classification became poor, too. System was not able to recognize almost any speech in these files. To improve these results, we introduced a „noisy speech” class (marked with „z”). Results achieved this way, and results achieved with the original models can be seen in Table 5. Table 4. Length of Markov Models assigned to classes

Number of state 11 stated model 5 stated model

Labels (classes) b, k a, g, s, t, r, u, l

Table 5. The best classification results achieved with a time window of 500 ms according to different indexes given in [%]

In case of original models

0,69

Noise index 63,95

Total index 16,51

After the introduction of a noisy model Speech Noise Total index index index 46,81 63,3 50,93

11,36

24,29

14,59

32,74

24,29

30,6

83,42

100

35,58

83,89

70,07 78,84 79,79 58,76 70,98 57,35 72,61

68,43 82,64 98,88 76,8 84,22 80,06 88,82

29,07 9,8 23,34 33,28 32,71 0,58 38,24

58,59 64,43 79,99 65,92 71,34 60,19 76,17

Record identifier 01*

Speech index

02* 03

100

33,7

04 05 06 07 08 09 10

83,62 100 98,75 67,22 83,61 76,31 84,55

29,39 15,34 22,9 33,4 33,1 0,46 36,79

234

D. Sztahó, V. Imre, and K. Vicsi Table 6. Result of modified class grouping

State number 14 11 5 4

Classes b, z, k s, a, u g, r l, t

Table 7. Recognition results achieved with modified groups of classes given in [%]

Record identifier

Speech index

01 02 03 04 05 06 07 08 09 10

49,65 16,75 100 87,23 82,64 100 65,2 86,91 83,24 88,1

Noise index 57,96 28,95 38,34 17,75 8,61 29,2 30,09 37,24 0,58 36,89

Total index 51,73 19,79 84,58 69,86 64,13 82,3 56,42 74,49 62,57 75,3

Fig. 2. An example for the result of automatic classification

For sake of the further improvement of classification we tried to modify models according to multiple approaches. On the basis of assumed difficulty (complexity of acoustic classes), labels thought to be falsely classified, and average length of individual sound patterns we generated different groups of models, and assigned Markov Models with different state numbers to them. Groups of classes achieved with this method, and the result of recognition belonging to them are shown by Tables 6 and 7. The best recognition resulted four different groups of Markov Model state

Automatic Classification of Emotions in Spontaneous Speech

235

numbers. Short Markov Models were applied to short noises, like hit, crash noises and to sounds with low rate of spectral change, like dial tones. Longer Markov Models were applied to speech, and to longer noises with higher level rate of spectral change, like wind and car sounds. These changes increased the recognition performance in the case almost every recording.

3 Emotion Recognition 3.1 Database of Emotions For realization of emotion recognition spontaneous telephone speech containing continuous conversations, and records of different talkshows were collected and annotated. The recordings consist of spontaneous speech material and improvisation play by actors. Continuous speech was distributed to phrase units, and phrases were annotated by emotions, and the most characteristic emotional parts were marked. During the annotation it occurred, that emotional classification of phrase units was not obvious to the human listeners. The annotators marked different emotions to same segments. In order to solve this, the persons making the annotation had to mark only the borders of segments filled with emotion, and their classification was made by multiple listeners during a separated subjective test serial using a predefined set of emotions without any scale of intensity of a given emotion (Table 8). The listeners did not have to consider the intensity of the heard emotion. Thus the subjective listening tests of the segments of 2540 emotional segments was made by 30 persons, and after it we have chosen 985 emotional segments from 43 speakers by 6 emotions. Only those voice patterns were selected, where there was a 70% of correspondence in the decisions. Emotions were the following: neutral, sad, surprised, angry/nervous, laughing during speech, and happy. Distribution among categories is shown in Table 8. Table 8. Number of emotional patterns selected by 30 monitoring persons

Type of emotion Neutral Nervous/angry Happy Laughing during speech Sad Surprised

Number of selected phrases (70% of correspondence in the decisions at the subjective test) 517 290 39 42 54 43

3.2 Emotion Recognition Process During the emotion recognition experiments we had to reduce the set of emotion categories, since not all had enough samples to achieve proper training. 4 emotion categories were selected, according to Table 9 and 10 they are the following: neutral, angry/nervous, happy and laughing during speech together, and sad. In order to achieve a proper training, a balanced set of emotion samples were selected. The

236

D. Sztahó, V. Imre, and K. Vicsi

neutral and anger category was reduced to the size of happy category. We applied Support Vector Machines for the automatic classification, using the toolkit of LIBSVM [7] with C# programming language, which can be downloaded freely. The aim of these experiments was to examine, which acoustics parameters are necessary for the recognition of emotions. We examined the following issues: • • • • • •

average, maximum, range and standard deviation of the fundamental frequency (marking: F0) average, maximum, range and standard deviation of derivative of the fundamental frequency (marking: ΔF0) average, maximum, range and standard deviation of intensity (marking: EN) average, maximum, range and standard deviation of derivative of the intensity (marking: ΔEN) average, maximum, range and standard deviation of 12 mel-frequency cepstral coefficients (marking: MFCCi) average, maximum, range and standard deviation of harmonicity values (marking: HARM)

Every characteristic was computed with a 10-ms timestep, and then we calculated the proper statistic characteristic by every phrase-length unit. Thus every phrase had a value from the above enumeration, and all of these features were put into the feature vector related to the given phrase. 3.3 Results Four emotion marks - angry/nervous: A, happy: J, neutral: N, sad: S - were used during the tests. Table 9 contains the results of four experiments prepared with different types of feature vectors. Table 9. Results of automatic recognitions in [%], in case of different groups of feature vectors

Feature vector: F0, ΔF0, EN, ΔEN

A J N S

A 51 18 6 15

J 15 32 9 4

N S 5 4 17 2 57 3 13 7 Recognition result: 56,98

Feature vector: F0, ΔF0, EN, ΔEN, HARM,

A J N S

A 46 17 7 12

J 13 30 8 7

N S 10 6 16 6 56 4 12 8 Recognition result: 54,26

Automatic Classification of Emotions in Spontaneous Speech

237

Table 9. (continued)

Feature vector: F0, ΔF0, EN, ΔEN, MFCCi

A J N S

A 57 12 4 5

J 13 37 12 17

N S 4 1 13 7 55 4 5 12 Recognition result: 62,40

Feature vector: F0, ΔF0, EN, ΔEN, HARM, MFCCi A J N S 61 9 4 1 A 11 41 11 6 J 3 12 56 4 N 5 16 5 13 S Recognition result: 66,27 Table 10. Result of automatic emotion recognition in [%] in case of female and male voice samples, and in case of a characteristic vector given the best result

male speakers

A J N S

A 17 1 2 1

J 0 7 2 5

N 4 2 18 0

S 1 7 0 14

Recognition result: 69,14 female speakers

A J N S

A 46 9 1 3

J 6 31 9 8

N 1 11 40 6

S 0 1 3 2

Recognition result: 67,28 Recognition results show, that beyond basic prosodic parameters, that can be found in the literature (fundamental frequency, intensity), the mel-frequency cepstrum parameters have an important role in automatic recognition. This means that spectral

238

D. Sztahó, V. Imre, and K. Vicsi

features also have important meaning in emotion recognition. Harmonicity values can even improve it, but since the number of samples is not sufficient yet, its effect is not proven, however it is worth to examine in the future. There is a need for a continuous database collection. It is worth to see the results, when voice patterns are selected separately, distributed to female and male patterns. Result of this can be seen in Table 10. Although recognition shows a slight improvement, it means only a few differences in voice sample numbers because of the insufficient number of voice samples.

4 A Quasi-real Time Emotion Recognition Process in Spontaneous Speech During speech communication, mainly in case of a long conversation, emotional state of the speaking person can change continuously. To follow the mental state of the speaker, we have to separate the continuous speech to sections. In the present case we chose the phrase to be the basic unit of segmentation. At the construction of our real-time recognizer the automatic phrase-leveled segmentation is realized by the speech detector described in the chapter 2. Block diagram of the real time automatic emotion recognizer is shown on Figure 3, where the speech detector-segmenter and the emotion recognizer are built together. Acoustic processing of the two independent recognizers is separated on the figure, because the system uses two different methods. However, we plan to use only one module for this in the future.

F0i, ∆F0i

Ei, ∆Ei

Normalization

Preparing mutlidimensonal feature vector

MFCCi

audio signal

Acoustic preprocessing

Speech/noise detection

Database cointatining phone-line recordings

Clause unit segmenteation

speech segments

Markovmodells

Acoustic preprocessing

Database cointatining emotinal recordings

Speech detection

emotion Emotion category classification

Support Vector Machines

Emotion recognition

Fig. 3. Block diagram of the automatic emotion recognizer in case of spontaneous speech

Automatic Classification of Emotions in Spontaneous Speech

239

5 Conclusion In this article a method was presented for automatic emotion recognition task, which is able to recognize emotions in real time and in noisy environment on the basis of prosodic and spectral parameters of speech. We have developed a process based on Hidden Markov Models, which segments the audio signal into phrase sized speech parts and acoustic environment noise, solving the speech-not speech detection and phrase-leveled segmentation. During evaluation of results of speech detection it can be determined, that this method can be applied for spontaneous speech. The achieved speech index result can even be 80% in case of not too noisy records. It is an acceptable performance, as it can be seen on Figure 2. Speech detection, phrase segmenting process is followed by the emotion recognition process. In case of training with voice samples of four emotions with a subjective monitoring, the automatic recognizer based on Support Vector Machines can classify emotional voice samples of phrase length unit with 66% of correctness. Acknowledgement. This research was prepared in the framework of Jedlik project No OM-00102/2007. named "TELEAUTO" and TÁMOP-4.2.2-08/1/KMR-20080007 project and TÁMOP 4.2.2-08/1-2008-0009 project.

References 1. Tóth, S.L., Sztahó, D., Vicsi, K.: Speech Emotion Perception by Human and Machine. In: Proceedings of COST Action 2102 International Conference. Patras, Greece, October 29-31 (2007); Revised Papers in Verbal and Nonverbal Features of Human-Human and HumanMachine Interaction 2008. LNCS, vol. 5042, pp. 213–224. Springer, Heidelberg (2008) 2. Hozjan, V., Kacic, Z.: A rule-based emotion-dependent feature extraction method for emotion analysis from speech. The Journal of the Acoustical Society of America 119(5), 3109–3120 (2006) 3. Navas, E., Hernáez, I., Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE Transactions on Audio, Speech, and Language Processing 14(4), 1117–1127 (2006) 4. Klára, V., Dávid, S.: Ügyfél érzelmi állapotának detektálása telefonos ügyfélszolgálati dialógusban. VI. Magyar Számítógépes Nyelvészeti Konferencia, Szeged, pp. 217-225 (2009) 5. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (Computer program), http://www.praat.org (retrieved) 6. The Hidden Markov Model Toolkit (HTK), http://htk.eng.cam.ac.uk/ 7. Chang, C.C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Modification of the Glottal Voice Characteristics Based on Changing the Maximum-Phase Speech Component Martin Vondra and Robert Vích Institute of Photonics and Electronics, Academy of Sciences of the Czech Republic, Chaberska 57, CZ 18251 Prague 8, Czech Republic {vondra,vich}@ufe.cz

Abstract. Voice characteristics are influenced especially by the vocal cords and by the vocal tract. Characteristics known as voice type (normal, breathy, tense, falsetto etc.) are attributed to vocal cords. Emotion influences among others the tonus of muscles and thus influences also the vocal cords behavior. Previous research confirms a large dependence of emotional speech on the glottal flow characteristics. There are several possible ways for obtaining the glottal flow signal from speech. One of them is the decomposition of speech using the complex cepstrum into the maximum- and minimum-phase components. In this approach the maximum-phase component is considered as the open phase of the glottal flow signal. In this contribution we present experiments with the modification of the maximum-phase speech signal component with the aim to obtain synthetic emotional speech.

1 Introduction The present research in speech synthesis is focusing especially on the ability to change the expressivity of the produced speech. In the case of unit concatenation synthesis, where synthetic speech achieves almost natural sounding, we have only the possibility to construct several new speech corpuses for each expressive style [1]. This is a very time and resource consuming procedure and greatly increases the memory demands of the speech synthesis system. From this perspective, it would be better if we had the possibility directly to influence or modify the individual characteristics of the speech related to the expressive style of speaking. This can be achieved by a suitable speech model that has the ability to influence the individual parameters. The basic speech production model is based on the source-filter theory (Fig. 1). In the simplest case the source or excitation is represented by Dirac unit impulses with the period equal to the fundamental period of speech for voiced sounds and by white noise for unvoiced speech. The vocal tract model is represented by a time varying digital filter, which performs the convolution of the excitation with its impulse response. The vocal tract model can be based on linear prediction [2], on approximation of the inverse cepstral transformation [3], etc. The main speech characteristics related to expressivity are the prosody (pitch, intensity and timing variation) and the voice quality (the speech timbre). Voice quality is given by both – the vocal tract and also by the vocal cord oscillation. For A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 240–251, 2011. © Springer-Verlag Berlin Heidelberg 2011

Modification of the Glottal Voice Characteristics

241

Vocal tract parameters Pitch period

Impulse generator

White noise

Voiced Unvoiced

Vocal tract model

Synthetic speech

Fig. 1. Source-filter speech model

the given speaker the vocal tract is primarily responsible for creating the resonances realizing the corresponding speech sounds. The voice quality characteristics are given mainly by the excitation of the vocal tract – by the glottal signal. The vocal cords influence the speech in such a way that can be described as modal, breathy, pressed or lax phonation. There are several papers that confirm that source speech parameters are influenced by the expressive content of speech [4, 5]. If we want to achieve a modification of speech based on the source-filter model we must perform speech deconvolution into the source and vocal tract components at first. There are several possibilities for doing this. If we have an estimation of the vocal tract model parameters, we can use filtering by the inverse vocal tract filter [6]. Research on Zeros of Z-Transform (ZZT) [7] of the speech frames proved that deconvolution into the source and filter components can be done by separation of ZZT into zeros inside and outside the unit circle in the z-plane. The same can be done using the separation into the anticipative and causal complex cepstrum parts [8]. Our approach is based on speech deconvolution using the complex cepstrum, which allows to get the glottal signal from the speech. However, practical experiments show that the cepstral deconvolution does not lead to the true glottal signal in all voiced speech frames. Our solution of this issue lies in the estimation of the parameters of the glottal signal from the source magnitude spectrum obtained by deconvolution, where the basic glottal parameters are usually maintained. These parameters are the glottal formant and its bandwidth. Based on these parameters we can design a linear anticausal IIR model of the glottal signal [9]. First the deconvolution based on the complex speech cepstrum will be introduced. We describe also several methods of complex cepstrum computation. Further some examples of reliable and poor estimation of the glottal signal will be shown. In the following part the design of a 2nd order anticausal IIR glottal model will be described. Then the speech deconvolution can be performed with the inverse model of the glottal signal, which leads to the vocal tract impulse response. If we save this vocal tract impulse response and change the glottal model parameters, we achieve a modified speech signal after convolution of the saved vocal tract and the modified glottal model impulse responses.

2 Complex Cepstrum Deconvolution of Speech There are several possibilities for estimating the glottal signal from speech. A list of the most common methods is given in [6]. The majority of methods include inverse filtering by the vocal tract model. A new and attractive method, which can lead to an

242

M. Vondra and R. Vích

estimation of the glottal signal, is based on ZZT [7], which is a method for complex cepstrum computation of exponential sequences. In this method the glottal signal is estimated from the maximum-phase component of the speech frame. This is done by computing the roots of the speech frame Z-transform and by separation of the roots into zeros inside and outside the unit circle in the z-plane. From the properties of the complex cepstrum and ZZT the same can be performed also by separation into the anticipative and causal parts of the complex cepstrum. 2.1 Methods of Complex Cepstrum Calculation The complex cepstrum xˆ[n] of the windowed speech frame x[n], n = 0, ... N – 1; N is the speech frame length, can be computed by several methods. 2.1.1 Calculation Using the Complex Logarithm The complex cepstrum xˆ[n] is given by the inverse Fast Fourier Transform (FFT) of Xˆ [k ]

xˆ[n] =

1 M

M −1

 Xˆ [k ]e j2 πkn / M ,

(1)

k =0

where n = –M/2 – 1, ... , 0, ... M/2; M is the dimension of the applied FFT algorithm. For minimizing the cepstrum aliasing M > N. Xˆ [k ] = ln X [k ] = ln X [k ] + j arg X [k ]

(2)

is the logarithmic complex spectrum of the speech frame, where the real part is given by the logarithm of the spectrum magnitude and the imaginary part is given by the phase spectrum in radians. The phase is an ambiguous function with uncertainty of 2π. For this reason we must perform the phase unwrapping before the inverse FFT (1). The spectrum is efficiently computed by the FFT X [k ] =

M −1

 x[n]e− j2 πkn / M .

(3)

n=0

The phase unwrapping can be rather difficult especially in cases where the speech frame Z-transform has zeros close to the unit circle in the z-plane, which cause sudden phase changes [10]. For this reason we have tried also another methods for complex cepstrum computation. 2.1.2 Using the Logarithmic Derivative

xˆ[ n] = −

1 jnM

M −1

X ′[ k ]

 X [k ] e j2 πkn / M

, for n = 1, ... M – 1,

k =0

(4)

where M −1

X ′[ k ] = − j  nx[ n]e − j2 πkn / M n=0

(5)

Modification of the Glottal Voice Characteristics

243

is the logarithmic derivative of the speech frame spectrum. The first cepstral coefficient can be computed as the mean value of the logarithmic speech magnitude spectrum xˆ[0] =

1 M

M −1

 log X [k ] .

(6)

k =0

The advantage of this method is that it does not need the phase unwrapping. But if the speech frame Z-transform has zeros close to the unit circle in the z-plane and the dimension of FFT is low, this method gives wrong useless results. Moreover the formula (4) adopted from [11] pp. 793 is not appropriate for practical implementation. If we substitute the fraction X’[k]/X[k] by Xd[k] then practical implementation of (4) can be done based on following formulae

1 IFFT[X d [k ]], for n = 1,..., M / 2, jn 1 IFFT[X d [k ]], for n = 1 + M / 2,..., M − 1. j( M − n)

xˆ[n] = − xˆ[n] =

(7)

2.1.3 Using ZZT The Z-transform of the windowed speech frame x[n] can be written as

X ( z) =

N −1

N −1

N −1

n= 0

n=0

m =1

 x[n]z − n =z − N +1  x[n]z n =x[0]z − N +1∏ ( z − Z m ) .

(8)

Setting X(z) = 0 leads to the solution of a high degree polynomial equation. A numerical method must be used – we utilize Matlab roots function which is based on eigenvalues of the associated companion matrix. The zeros Zm of (8) can lie inside or outside the unit circle in the z-plane. If we denote the zeros inside the unit circle as ak and the zeros outside the unit circle as bk we can compute the complex cepstrum based on the relationship [11]

xˆ[ n] = log A , Mi

n = 0,

n k

a , n > 0, k =1 n

xˆ[ n] = − Mo

(9)

−n k

b , n < 0, k =1 n

xˆ[ n] = 

where A is the real constant, Mi is the number of zeros inside the unit circle in the zplane and Mo is the number of zeros outside the unit circle in the z-plane. The complex cepstrum computed by this technique is called as the root cepstrum. The disadvantage of this method is the relatively high computational requirement in contrary to the previous methods, especially for higher sampling frequency, where the speech frame has a relatively high number of samples.

244

M. Vondra and R. Vích

2.2 Complex Cepstrum Speech Deconvolution

The steps of complex cepstrum speech deconvolution are shown in Fig. 2 and 3. The first step in complex cepstrum speech deconvolution is the speech segmentation. The frames must be chosen pitch-synchronously with the length of two pitch periods and with one pitch period overlapping. It is important to have the Glottal Closure Instant (GCI) in the middle of the frame as shown in [7]. Also the frame weighting is of high importance. The Hamming window, which is usually used in speech analysis, is not the best choice. In [8] a new parameterized window is proposed, with the Hann and the Blackman window as particular cases and the optimum parameter for deconvolution is given. The segmentation and the used window cause that the magnitude spectrum is very smooth – the periodicity of voiced excitation is totally destroyed and the magnitude spectrum approximates the spectrum envelope. The second step is the complex cepstrum computation. For this we can use (1), (7) or (9). Matlab has the function named cceps(), which realizes (1) or (9). If we use (1) the phase unwrapping must be performed. The resulting complex cepstrum is very sensitive to the used phase unwrapping algorithm. Anyway it is better to use a sufficient high dimension for the FFT algorithm – the phase unwrapping is then more unambiguous (for 8 kHz sampling frequency M = 2048 points of the FFT is usually sufficient). If we compute the complex cepstrum using the logarithmic derivative (7), the results are quite inconsistent – their sensitivity on the FFT dimension is even higher than in the computation using phase unwrapping. Maybe the most reliable method for computing the complex cepstrum seems to be the ZZT. Comparison of speech deconvolution using the complex cepstrum computed using FFT and ZZT are shown in Fig. 3.

Fig. 2. Signal, spectra and complex cepstrum of the stationary part of the vowel a

Modification of the Glottal Voice Characteristics

245

Fig. 3. Anticipative and causal cepstra computed using FFT and ZZT and the corresponding spectra and impulse responses for the vowel a

The last step in complex cepstrum deconvolution is the inverse cepstral transformation separately for the anticipative and for the causal parts of the complex cepstrum. This leads to the anticipative (maximum-phase) or causal (minimum-phase) spectrum and further to the anticipative (maximum-phase) or causal (minimum-phase) impulse responses. From Fig. 3 it is evident that the maximum-phase part of the speech can be considered as the glottal signal, which is proved in [7]. The reconstruction of the speech frame can be performed by convolution of the anticipative and causal impulse responses [12]. This reconstructed speech is of mixed phase and has higher quality than the classical parametric speech models, which employ the Dirac unit impulse excitation and the minimum-phase vocal tract model based e.g. on linear prediction or on Padé approximation. 2.3 Problematic Frames in Complex Cepstral Deconvolution

First we used directly the maximum-phase impulse response for the modification of the glottal signal [13], but we observed that for some voiced speech segments after the complex cepstral speech deconvolution the maximum-phase speech components are not similar to the typical glottal signal, see Fig. 4. This occurs more often for a higher

246

M. Vondra and R. Vích

sampling frequency than 8 kHz. The anticipative impulse response in Fig. 4 is more similar to the AM modulation of the glottal signal. This may be probably caused by the noise component in the excitation in natural speech, which may have a negative impact on the separability into the the minimum- and maximum-phase speech components. It might be interesting to perform some harmonic noise decomposition [14] before the complex cepstrum deconvolution, which would be used only for the harmonic component.

3 Design of 2nd order Anticausal IIR Model of the Glottal Signal Our first experiment with the modification of the glottal signal [13] was based on extension or shortening of the maximum-phase impulse response. For the speech frame analyzed in Fig. 3 this is appropriate and we can achieve a modification of the open quotient of the glottal signal. However, this technique cannot be used for the speech frame analyzed in Fig. 4. In this case, when we perform the convolution of the original maximum- and minimum-phase impulse responses, we obtain the original speech frame, but after the extension or shortening of the maximum-phase component, the convolution produces a signal that differs from a typical speech signal.

Fig. 4. Anticipative and causal cepstra, the corresponding spectra and impulse responses for the problematic voiced speech frame estimated using ZZT

Modification of the Glottal Voice Characteristics

247

We decided to solve this problem by designing a model of the glottal signal, whose parameters can be reliably estimated from the anticipative (maximum-phase) magnitude spectrum. The first peak in the maximum-phase magnitude spectrum near the zero frequency refers to the glottal formant. The glottal formant is not caused by a resonance as a classical vocal tract formant, but it is a property of the glottal impulse. This glottal formant is usually visible also in the problematic frames – see Fig. 3 and 4 – and can be estimated by peak picking. In our experience, the formant is more visible when the spectrum is computed from the anticipative root cepstrum. The glottal formant is one of the main parameters of the glottal signal and it is coupled with the open quotient of the glottal impulse [9]. The model of the glottis can be represented by two complex conjugate poles in the z-plane at a frequency, which is equal to the frequency of the glottal formant. This is a property of a 2nd order IIR filter. The magnitude of the pole pair can be estimated from the glottal formant bandwidth. For agreement with the glottal signal phase properties this model must have poles outside the unit circle. Such a filter is unstable, but this model can be designed as anticausal, which means that the time response of this filter is calculated in the reverse time direction. The response of such a filter can be computed as a time reversed response of a causal filter with poles in conjugate reciprocal position to the original poles. The frequency and impulse responses of the glottal model for the speech frame in Fig. 4 are depicted in Fig. 5. 3.1 Speech Deconvolution with the 2nd Order Anticausal Model of the Glottal Signal

If we want to use the described 2nd order anticausal glottal model for modification of the speech signal, we must integrate this model into the speech deconvolution. The deconvolution can be performed by filtering the windowed speech frame by the inverse model of the glottal signal, which is a simple FIR filter with zeros in the same place, where the glottal model has the poles. The resulting signal can be considered as the vocal tract impulse response. The described deconvolution is schematically shown in Fig. 6.

Fig. 5. Frequency and impulse responses of the 2nd order anticausal glottal model together with the pole plot in the z-plane

248

M. Vondra and R. Vích

Fig. 6. Speech deconvolution with the inverse glottal model

4 Glottal Signal Modification The glottal signal modification can be performed by a change of the estimated parameters – the glottal formant and its bandwidth. These parameters are obtained from the maximum-phase magnitude spectrum, which is estimated by the complex cepstrum deconvolution. The simplest modification is the increase or decrease of the open quotient of the glottal signal. According to [15] the open quotient is inversely proportional to the glottal formant. The bandwidth of the glottal formant is coupled with the asymmetry coefficient of the glottal flow. In Fig. 7 the influence of the glottal formant and its bandwidth variation by the same multiplication factor are shown. This results in the modification of the open quotient only, the asymmetry quotient is the same for all cases. Fg a = 3/2 Fg o , Bfg a = 3/2 Bfg o and Fg b = 2/3 Fg o, Bfg b = 2/3 Bfg o , where Fg o is the frequency of the original

Fig. 7. Example of the responses for 2nd order anticausal model of the glottal signal for the modification of the glottal formant and its bandwidth

Modification of the Glottal Voice Characteristics

249

glottal formant, Bfg o is its bandwidth, Fg a or Fg b are the modified frequencies of the glottal formant and Bfg a and Bfg b are the corresponding modified bandwidths. The modified speech frame after the convolution of the modified glottal signal with the impulse response of the vocal tract, which was obtained by deconvolution using the inverse glottal model with the original glottal formant and its bandwidth, are shown in Fig. 8. Finally in Fig. 9 the original speech and both speech conversions are shown, where the modifications of the glottal signal from the previous example were used.

Fig. 8. Example of the modified speech impulse responses for the cases in Fig. 7

Fig. 9. Example of the modified speech using the change of the glottal model parameters (see Fig. 7)

250

M. Vondra and R. Vích

5 Conclusion In this contribution experiences with complex cepstrum speech deconvolution and a proposal of glottal signal modification based on a 2nd order anticausal glottal model were described. The complex cepstral speech deconvolution is sensitive, above all, to speech segmentation, the speech frame must be 2 pitch periods long with the GCI in the middle of the frame and a proper weighting window must be used. Also the method of complex cepstrum estimation or a robust phase unwrapping algorithm is of high importance. Although all of these criteria are fulfilled, the complex cepstral deconvolution doesn’t give adequate results especially for sampling frequency higher than 8 kHz. This is probably caused by some portion of noise, which is present in the higher frequency band of the source speech signal. For this reason we developed a 2nd order anticausal model of the glottal signal, whose parameters can be reliably estimated by the complex cepstral deconvolution also in cases of problematic speech frames. The proposed 2nd order anticausal model of the glottal signal has two basic parameters – the frequency of the glottal formant and its bandwidth. Cepstral deconvolution is then used only for the estimation of these parameters. A modification of the glottal signal is achieved by filtering the original windowed speech frame by the inverse model of the glottal signal, which leads to the impulse response of the vocal tract. Then the parameters of the glottal model are changed and the modified speech frame is estimated by convolution of the impulse responses of the vocal tract with the new glottal model response. The preliminary listening tests proved that increase of the glottal formant and of its bandwidth (i.e. decrease of the open quotient of the glottal signal) leads to a tense sounding voice. On the other hand, the decrease of the glottal formant and of its bandwidth (i.e. increase of the open quotient) leads to a lax sounding voice. But it is clear that for a change of the emotional speech style also the conversion of the vocal tract model and of the prosody must be used. The modification of the glottal signal alone is not sufficient for the generation of emotional speech, but it can boost the speech style given mainly by the prosody. Acknowledgments. This paper has been supported within the framework of COST2102 by the Ministry of Education, Youth and Sports of the Czech Republic, project number OC08010 and by the research project 102/09/0989 by the Grant Agency of the Czech Republic.

References 1. Iida, A., Campbell, N., Higuchi, F., Yasumutra, M.: A corpus-based speech synthesis system with emotions. Speech Communication 40, 161–187 (2003) 2. Vích, R.: Pitch Synchronous Linear Predictive Czech and Slovak Text-to-Speech Synthesis. In: Proc. of the 15th International Congress on Acoustics, ICA 1995, Trondheim, Norway, vol. III, pp. 181–184 (1995)

Modification of the Glottal Voice Characteristics

251

3. Vích, R.: Cepstral Speech Model, Padé Approximation, Excitation and Gain Matching in Cepstral Speech Synthesis. In: Jan, J. (ed.) BIOSIGNAL 2000, VUTIUM, Brno, pp. 77–82 (2000) 4. Gobl, C., Chasaide, A.N.: The role of voice quality in communicating emotion, mood and attitude. Speech Communication 40, 18–212 (2003) 5. Airas, M., Alku, P.: Emotions in Vowel Segments of Continuous Speech: Analysis of the Glottal Flow Using the Normalized Amplitude Quotient. Phonetica 63, 26–46 (2006) 6. Walker, J., Murphy, P.: A Review of Glottal Waveform Analysis. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 1–21. Springer, Heidelberg (2007) 7. Bozkurt, B.: Zeros of the z-transform (ZZT) representation and chirp group delay processing for the analysis of source and filter characteristics of speech signals. Ph.D. Thesis, Faculté Polytechnique De Mons, Belgium (2005) 8. Drugman, T., Bozkurt, B., Dutoid, T.: Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation. In: INTERSPEECH 2009, Brighton, U.K, pp. 116–119 (2009) 9. Doval, B.: Alessandro, Ch., Henric, N.: The voice source as a causal/anticausal linear filter. In: Proc. of ISCA Tutorial and Research Workshop on Voice Quality (VOQUAL), Geneva, pp. 15–19 (2003) 10. Tribolet, J.: A new phase unwrapping algorithm. IEEE Transactions on Acoustics, Speech and Signal Processing 25(2), 170–177 (1977) 11. Oppenheim, A.V., Schafer, R.V.: Discrete-Time Signal Processing, pp. 768–825. Prentice Hall, Englewood Cliffs (1989) 12. Vích, R.: Nichtkausales Cepstrales Sprachmodell. In: Proc. 20th Electronic Speech Processing Conference – ESSV 2009, Dresden, Germany, pp. 107–114 (2009) 13. Vondra, M., Vích, R.: Speech Conversion Using a Mixed-phase Cepstral Vocoder. In: Proc. of 21st Electronic Speech Processing Conference – ESSV 2010, Berlin, Germany, pp. 112–118 (2010) 14. Stylianou, Y.: Decomposition of speech signals into a deterministic and a stochastic part. In: Proc. of Fourth International Conference on Spoken Language, ICSLP 1996, Philadelphia, pp. 1213–1216 (1996) 15. Doval, B., d’Alessandro, C., Henrich, N.: The spectrum of glottal flow models, http://rs2007.limsi.fr/index.php/PS:Page_2

On Speech and Gestures Synchrony Anna Esposito1,2 and Antonietta M. Esposito3 1

Dep. of Psychology, Second University of Naples, Via Vivaldi 43, 81100 Caserta, Italy 2 IIASS, Via Pellegrino 19, 84019, Vietri sul Mare, SA, Italy 3 Istituto Nazionale di Geofisica e Vulcanologia, sezione di Napoli Osservatorio Vesuviano, Napoli, Italy [email protected], [email protected]

Abstract. Previous research works proved the existence of synchronization between speech and holds in adults and in 9 year old children with a rich linguistic vocabulary and advanced language skills. When and how does this synchrony develop during child language acquisition? Could it be observed also in children younger than 9? The present work aims to answer the above questions reporting on the analysis of narrations produced by three different age groups of Italian children (9, 5 and 3 year olds). Measurements are provided on the amount of synchronization between speech pauses and holds in the three different groups, as a function of the duration of the narrations. The results show that, as far as the reported data concerns, in children, as in adults, holds and speech pauses are to a certain extent synchronized and play similar functions, suggesting that they may be considered as a multi-determined phenomenon exploited by the speaker under the guidance of a unified planning process to satisfy a communicative intention. In addition, considering the role that speech pauses play in communication, we speculate on the possibility that holds may serve to similar purposes supporting the hypothesis that gestures as speech are an expressive resource that can take on different functions depending on the communicative demand. While speech pauses are likely to play the role of signalling mental activation processes aimed at replacing the “old spoken content” of the communicative plan with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily action” with new ones reflecting the representational and/or propositional contribution of gestures to the new communicative plan. Keywords: Speech pauses, holds, synchrony, child narrations.

1 Introduction Humans communicate through a gestalt of actions which involve much more than the speech production system. Facial expressions, head, body and arm movements (grouped under the name of gestures) all potentially provide information to the communicative act, supporting (through different channels) the speaker’s communicative goal and also allowing the speaker to add a variety of other information to his/her messages including (but not limited to) his/her psychological

A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 252–272, 2011. © Springer-Verlag Berlin Heidelberg 2011

On Speech and Gestures Synchrony

253

state, attitude, etc. The complexity of the communicative act expression should be taken into account in human-computer interaction research aiming at modeling and improving such interaction by developing user-friendly applications which should simplify and enrich the average end user’s ability to use automatic systems. Psycholinguistic studies have confirmed the complementary nature of verbal and nonverbal aspects in human expressions [44, 56, 58], demonstrating how visual information processing integrates and supports speech comprehension [61]. In the field of human-machine interaction, research works on mutual contribution of speech and gestures to communication have being carried out along three main axes. Some studies have been mainly devoted to model and synchronize speech production and facial movements for implementing more natural “talking heads” or “talking faces” [23, 28, 33, 37, 40-41, 54] taking into account, in some cases, features able also to encode emotional states [15, 29, 64]. Other studies have exploited a combination of speech and gestures features (mainly related to the oral movements) with the aim to improve the performance of automatic speech recognition systems [11]. Some others have dealt with the modeling and synthesis of facial expression (virtual agents), head, hand movements and body postures [3, 48] with the aim to improve the naturalness and effectiveness of interactive dialogues systems. Such studies are in their seminal stage, even though some prototypes, which prove the efficacy of modeling gestural information, have already been developed for the American English language [6-7, 70]. A less investigated but crucial aspect for multimodal human-machine interaction is the relationship between paralinguistic and extra-linguistic information conveyed by speech and gestures in human body-to-body interaction (in this context, the term gestures is mainly referred to facial expressions, head and hand/arm movements). Psycholinguistic studies have shown that humans convey meanings not only by using words (lexicon), and that there exists a set of non-lexical expressions carrying specific communicative values, expressing for example turn-taking and feedback mechanism regulations, or signalling active cognitive processes (such as the recovery of lexicon from the long term memory) during speech production [5, 8-10]. Typical non-lexical but communicative events at the speech level are, for example, empty and filled pauses and other hesitation phenomena (by which the speaker signals his/her intention to keep the turn), vocalizations and nasalizations signalling positive or negative feedback and the so called “speech repairs” which convey information on the speaker’s cognitive state and the planning and re-planning strategies she/he is typically using in a discourse. All these non-lexical events are often included in the overall category of “disfluencies” and therefore considered (mostly in the automatic speech recognition research) as similar to non-lexical and non communicative speech events such as coughing or sneezing. On the other hand, seminal works have observed that such non-lexical acts are also communicative speech events and show gestural correlates, both for the English and the Italian language [21-22, 25]. Adding to a mathematical model of human-machine interaction a representation of this gestural information would bring to the implementation of more natural and user-friendly interactive dialog systems and may contribute to the general improvement of the system performance. The present paper aims at contributing to the development of this research reporting data on the synchronization between communicative entities in speech (in

254

A. Esposito and A.M. Esposito

particular empty and filled speech pauses and vowel lengthening) and in gestures (in particular holds). The data where collected both for adults and for three differently aged groups of children (particularly 3, 5 and 9 year old children) with the aim to assess when synchrony between holds and speech pause develops during child language acquisition.

2 Getting the Focus on Speech Pauses and Holds 2.1 The Role of Pausing Strategies in Dialogue Organization A characteristic of spontaneous speech, as well as of other types of speech, is the presence of silent intervals (empty pauses) and vocalizations (filled pauses) that do not have a lexical meaning. Pauses seem to play a role in controlling the speech flow. Several studies have been conducted to investigate the system of rules that underlie speaker pausing strategies and their psychological bases. Research in this field has shown that pauses may play several communicative functions, such as building up tension or raising expectations in the listener about the rest of the story, assisting the listener in her/his task of understanding the speaker, signalling anxiety, emphasis, syntactic complexity, degree of spontaneity and gender, and transmitting educational and socio-economical information [1, 34, 36, 51]. Studies on speech pause distribution in language production have produced evidence of a relationship between pausing and discourse structure. Empty and filled pauses are more likely to coincide with boundaries, realized as a silent interval of varying length, at clause and paragraph level [68]. This is particularly true for narrative structures where it has been shown that pausing marks the boundaries of narrative units [9-10, 18-20, 62-63]. Several cognitive psychologists have suggested that pausing strategies reflect the complexity of neural information processing. Pauses will surface in the speech stream as the end product of a “planning” process that cannot be carried out during speech articulation and the amount and length of pausing reflects the cognitive effort related to lexical choices and semantic difficulties for generating new information [4-5, 10, 18-20, 34]. We can conclude from the above considerations that pauses in speech are typically a multi-determined phenomenon attributable to physical, socio-psychological, communicative, linguistic and cognitive causes. Physical pauses are normally attributed to breathing or articulatory processes (i.e. pauses due to the momentary stoppage of the breath stream caused by the constrictors of the articulatory mechanism or the closure of the glottis). Socio-psychological pauses are caused by stress or anxiety [2]. Communicative pauses are meant to permit the listener to comprehend the message or to interrupt and ask questions or make comments. Linguistic pauses are used as a mean for discourse segmentation. Finally, cognitive pauses are related to mental processes connected to the flow of speech, such as replacing the current mental structure with a new one, in order to continue the production [8-10] or difficulties in conceptualization [34]. Recent studies aimed to investigate the role of speech pauses, such as empty and filled pauses, and phoneme lengthening, in child narrations have shown that also children exploit pausing strategies to shape their discourse structure (in Italian)[18-20]. Children pause, like

On Speech and Gestures Synchrony

255

adults, to recover from their memory the new information (the added1 one) they are trying to convey. More complex (in terms of cognitive processing) is the recovery effort, longer is the pausing time. The longer are the pauses, the lower is the probability that they can be associated to given information. Most of the long pauses (96% for female and 94% for male) are associated to a change of scene suggesting that long pauses are favored by children in signalling discourse boundaries. The consistency in the distribution of speech pauses seems to suggest that, at least in Italian, both adults and children exploit a similar model of timing to regulate speech flow and discourse organization. In the light of these considerations it seems pretty logical to ask what could be, if any, the role of gesture pauses (holds henceforth) in communication and the possible functions they are assumed to play with respect to speech pauses. To this aim, in the reported data, socio-psychological, articulatory, and communicative pauses were ruled out from the analysis. The first ones were assumed not to be a relevant factor, by virtue of the particular elicitation setting (see next section for details). The second and third ones were identified during the speech analysis and eliminated from the dataset. The speech pauses considered in this work, therefore, are linguistic, cognitive and breathing pauses. On the basis of the assumption that breathing and linguistic pauses are part of the strategy the speakers adopt for grouping words into a sentence, in the following they are both considered part of the planned communication process. 2.2 The Role of Gestural Holds in Shaping the Interaction In daily human-to-human interaction we usually encode the messages we want to transmit in a set of actions that go beyond verbal modality. Nonverbal actions (grouped under the name of gestures) help to clarify meanings, feelings, and contexts, acting for the speaker as an expressive resource exploited in partnership with speech for appropriately shaping communicative intentions and satisfying the requirements of a particular message being transmitted. There is a considerable body of evidence attributing to gestures similar semantic and pragmatic functions as in speech and rejecting the hypothesis that neither gestures nor speech alone might have the primary role in the communicative act [16, 44, 56] but there are also data suggesting that the role of gestures is secondary to speech serving as support to the speaker's effort to encode his/her message [49-50, 52, 59-60, 65]. The latter hypothesis appears to be a reasonable position since during our everyday interactions we are aware of generating verbal messages and of the meaning we attribute to these, thanks to a continuous auditory feedback. On the other hand, we are not endowed with a similar feedback for our gesticulation, posture, and facial expressions. In addition, most of the gesturing is made without a conscious control, since we do not pay special attention to it while speaking, and additionally, humans carry out successful communications also in situations where they cannot see each other (on the telephone for example, see Short et al. [69] and Williams [74]). Conversely, it is really hard to infer the meaning of a message when only gestures and no speech is provided and therefore it might appear obvious, if not trivial to presume 1

The present work interprets the concepts of “given” and “added” according to the definition proposed by Chafe [8], which considered as “added” any verbal material that produces a modification in the listener’s conscious knowledge, and therefore “given” verbal material was intended as not to produce such a modification.

256

A. Esposito and A.M. Esposito

that the role of gestures, if any, in communication, is just of assistance to both the listener, and/or the speaker [31-32, 52-53, 60-61, 66]. Nonetheless, more in-depth analyses, shed doubts on the above position and prove that gestures and speech are partners in shaping communication and giving kinetic and temporal (visual and auditory) dimensions to our thoughts. Some hints on these gestural functions may simply be experienced in our everyday life. Gestures resolve speech ambiguities and facilitate comprehension in noisy environment, they act as a language when verbal communication is impaired, and in some contexts not only they are preferred, but produce more effective results than speech, in communicating ideas [35, 42-47, 5558, 72]. More interesting, it has been shown that gestures are used in semantic coherence with speech and may be coordinated with tone units and prosodic entities, such as pitch-accented syllables and boundary tones [17, 43, 71, 75]. Besides, gestures add an imagistic dimension to the phrasal contents [35, 39, 45, 47, 55] and are synchronized with speech pauses [4-5, 21-22, 25, 43, 38]. In the light of these considerations, gestures are to be regarded as an expressive system that, in partnership with speech, provide means for giving form to our thoughts [13, 42, 55-56].

1a)

1b)

Fig. 1. Distributions of Empty Pauses (1a) and Holds (1b) over the Clauses (yellow bars) in an action plan dialogue produced by an American English speaker (Esposito et al., 2001)

On Speech and Gestures Synchrony

257

Have we been convincing? Are the provided data able to definitely assess the role of gestures in communicative behaviours? Since the experimental data are somewhat conflictual, the question of how to integrate and evaluate the above different positions on the relevance of gestures in communication is still open and the results we are going to present may be relevant in evaluating their relative merits. In previous works (see Esposito et al. [21-22, 25]), we adopted the theoretical framework that gestures, acting in partnership with speech, have similar semantic and pragmatic functions. Starting from these assumptions, we tried to answer the following questions about hand movements: assuming that speech and gestures are co-involved in the production of a message, is there any gestural equivalent to filled and empty pauses in speech? Assuming that we have found some equivalent gestural entities, to what degree do these synchronize with speech pauses? As an answer to our first question, in two pilot studies we identified a gestural entity that we called hold. A careful review of speech and gesture data showed that in fluent speech contexts, holds appear to be distributed similarly to speech pauses and to overlap with them, independently from the language (the gesture data were produced by Italian and American English speakers) and the context (there were two narrative contexts: an action plan and a narration dialogue). As an example to support the above conclusions, the data in Figure 1 show the distribution of empty speech pauses (red bars, Figure 1a)) and the distribution of holds (red bars, Figure 1b) over speech clauses(yellow bars) produced by an American English speaker during an action plan dialogue. On the y-axis are reported the number of clauses (also displayed as yellow bars) and on the x-axis the durations of clauses, holds, and speech pauses. Figure 2 shows the amount of overlaps between empty speech pauses and holds (red bars) and between locutions and holds (white bars) during the narration of an episode of a familiar cartoon (Silvester-Twitee) reported by an Italian (Figure 2a) and an American English speaker (Figure 2b). In a recent work [16] we found further support to our previous speculations through the analysis of narrative discourse data collected both from children and adults who participated in a similar elicitation experiment. Both adults and children were native speakers of Italian. There were two goals motivating this extension of the previous research: 1) If the relationships previously found, between holds and speech pause, are robust, they should be independent of age; i.e., they should also be evident in child narrations; 2) If at the least some aspects of speech and gesture reflect a unified planning process, these should be similar for all human beings providing that the same expressive tools are available. The results of the above research work [16] are partially displayed in Figures 3 and 4. Figure 3 graphically shows the percentage of overlaps against the percentage of speech pauses that do not overlap with holds, in children (3a) and adults (3b).

258

A. Esposito and A.M. Esposito

2a)

2b)

Fig. 2. Percentage of overlaps between empty (EP) and filled (FP) speech pauses and holds (red bars) and between clauses and holds (white bars) during the narration of a cartoon episode (Silvester-Twitee) narrated by an Italian (Figure 2a) and an American English speaker (Figure 2b).

Figure 4 displays, for each subject in each group (children and adults), the hold and speech pause rates computed as the ratios between the number of holds and/or speech pauses over the length of the subject’s narrations measured in seconds. Figure 4a is for children and Figure 4b is for adults. The Pearson correlation coefficient was computed as a descriptive statistic of the magnitude or the amount of information that can be inferred about speech pause frequency from the known hold frequency. The Pearson correlation coefficient between holds and speech pauses for children was ґ = 0.97, and the proportion of the variation of speech pauses that is determined by the variation of holds (i.e. the coefficient of determination) was ґ2 = 0.93, which means that 93% of the children’s speech pause variation is predictable from holds. For adults ґ = 0.88, and ґ2 = 0.78.

On Speech and Gestures Synchrony

259

Children 84.8%

% Overlap

100

% Speech Pauses overlapping with holds

80 60

15.2%

40 20

% Speech Pauses not overlapping with holds

0

3a

Adults 83%

% Overlap

100 80 60 40 20 0

17%

% Speech Pauses overlapping with holds % Speech Pauses not overlapping with holds

3b Fig. 3. Percentage of overlaps and non-overlaps between speech pauses and holds in children (3a) and adults (3b)

The two groups of speakers produced a similar distribution of hold and speech pause overlaps. The degree of synchronization was so high that further statistic analyses to assess its significance were not necessary, if the word “synchronization” is interpreted more loosely to mean “the obtaining of a desired fixed relationship among corresponding significant instants of two or more signals [www.its.bldrdoc.gov]”. In summary, the reported data showed that the frequency of overlaps between holds and speech pauses not only was remarkably high but was much the same for adults and children (see Figures 3 and 4), clearly indicating that both children and adults tended to synchronize speech pauses with holds independently of their age. The two rates were also statistically significant according to the results of a one way ANOVA test performed for each group, with hold and speech pause rates as within-subject variables. The differences between hold and speech pause rates were not significant for children (F(1,10) = 1.09, ρ = 0.32) suggesting that holds and speech pauses were equally distributed along children’s narrations. For adults, the differences were statistically significant (F(1,6) = 11.38, ρ = 0.01), suggesting that adults used holds more frequently than speech pauses.

260

A. Esposito and A.M. Esposito

Children 1

rate

0,8 0,6

Holds Rate

0,4 Speech Pauses Rate

0,2 0 S1

S2

S3

S4

S5

S6

Subject's number

4a Adults 1

rate

0,8 0,6 Holds Rate

0,4

Speech Pauses Rate

0,2 0 S1

S2

S3

S4

Subject's Number

4b Fig. 4. Hold rates against speech pause rates for children (4a) and adults (4b)

From these results, considering the role that speech pauses play in communication we speculate on the possibility that holds may serve to similar purposes supporting the view that gestures as speech are an expressive resource that can take on different functions depending on the communicative demand. The data discussed above seem to support this hypothesis, showing that 93% of the children and 78% of the adult speech pause variation is predictable from holds, suggesting that at least to some extent, the function of holds may be thought to be similar to speech pauses. We further speculated that while speech pauses are likely to play the role of signalling mental activation processes aimed at replacing the “old spoken content” of an “utterance” with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the old “utterance”) with new bodily actions reflecting the representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”. In order to further support the above results, in the present work we try to answer the following questions: When and how does this synchrony develop during child language acquisition? Could it be observed also in children younger than 9? To answer the above questions, in the following sections, are described the analyses of

On Speech and Gestures Synchrony

261

narrations produced by three different age groups of Italian children (9, 5 and 3 year olds) and measurements of the amount of speech pauses and holds are provided as a function of the word rates and narration durations.

3 Material The video recordings on which our analysis is based are of narrations by three groups of children: • • •

8 females, of 9 ± 3 months year olds; 5 males, and 5 females of 5 ± 3 months year olds; 3 males, and 3 females of 3 ± 3 months year olds;

The children told the story of a 7-minute animated color cartoon they had just seen. The cartoon was of a familiar type to Italian children involving a cat and a bird. The listener was the child’s teacher together with other children also participating in the experiment. The children’s recordings were made after the experimenter had spent two months with the children in order to become familiar to them and after several preparatory recordings had been made in various contexts in order for the children to get used to the camera. This kept out stranger-experimenter inhibitions from the elicitation setting; i.e., factors that could result in stress and anxiety. Limiting these factors allowed us to rule out the “socio-psychological” type of pauses [2]. The cartoon had an episodic structure, each episode characterized by a “cat that tries to catch a bird and is foiled” narrative arc. The experimental set-up is the same reported in previously works of McNeill and Duncan [57] and the decision to use such a similar experimental set-up was made on the purpose to allow future comparisons with other similar research works. Because of the cartoon’s episodic structure, typically children would forget entire episodes and therefore only four episodes (those in common to all the child narrations) were analyzed. None of the participants was aware that speech and gesture pauses were of interest. The video was analyzed using commercial video analysis software (VirtualDub™) that allows viewing video-shots, and forward and backward movements through the shots. The speech waves, extracted from the video, were sampled at 16 kHz and digitalized at 16 bits. The audio was analyzed using Speechstation2™ from Sensimetrics. For the audio measurements the waveform, energy, spectrogram, and spectrum were considered together, in order to identify the beginnings and endings of utterances, filled and empty speech pauses and phoneme lengthening. The details of the criteria applied to identify the boundaries in the speech waveform are described in [24,26]. Both the video and audio data were analyzed perceptually, the former frame-by-frame and the latter clause-by-clause or locution-by-locution, where a “clause” or a “locution” is assumed to be “a sequence of words grouped together on semantic or functional basis” [18-20]. 3.1 Some Working Definitions In this study, empty pauses are simply defined as a silence (or verbal inactivity) in the flow of speech equal to or longer than 120 milliseconds. Filled pauses are defined as

262

A. Esposito and A.M. Esposito

the lengthening of a vowel or consonant identified perceptually (and on the spectrogram) by the experimenter or as one of the following expressions: “uh, hum, ah, ehm, ehh, a:nd, the:, so, the:n, con:, er, e:, a:, so:”2. A hold is detected when the arms and hands remained still for at least three video frames (i.e., approximately 120 ms.) in whatever position excluding the rest position. The latter is defined as the home position of the arms and hands when they are not engaged in gesticulation and typically this is at the lower periphery of the gesture space (see Mc Neill [58] p.89). The holds associated with gesture rest were not included in the analysis by virtue of the particular elicitation setting (see next section for details). Note that the absence of movement is judged perceptually by an expert human coder. Therefore, the concept of hold is ultimately a perceptual one. A hold may be thought to be associated with a particular level of discourse abstraction. In producing a sentence, the speaker may employ a metaphoric gesture with a hold spanning the entire utterance. However, the speaker may also engage in word search behaviour (characterized by a slight oscillatory motion centered around the original hold) without any change in hand shape (the Butterworth gesture cited in Mc Neill [58]). The speaker may also add emphatic beats coinciding with points of peak prosodic emphasis in the utterance. While the word search and emphatic beats may sit atop the original hold, most observers will still perceive the underlying gesture hold.

4 Results Figure 5 displays the percentage of holds overlapping with speech pauses in the three different groups. In this particular case rest positions were not included in the data since it could have been objected that small children may perform a considerable amount of rest positions that cannot be accounted as holds (in the previous study rest positions were difficult to interpret as not to be also holds and the two types of gestures were considered together). The Pearson product-moment correlation coefficient was computed as a descriptive statistic of the magnitude or the amount of information that can be inferred about speech pause frequency from the known hold frequency. The Pearson correlation coefficient between holds and speech pauses for 3 year old children was ґ = 0.72, and the proportion of the variation of speech pauses that is determined by the variation of holds (i.e. the coefficient of determination) was ґ2 = 0.52, which means that 52% of the children’s speech pause variation is predictable from holds. For 5 year old children ґ = 0.18, and ґ2 = 0.03, which means that there was no correlation between pause variation and holds in 5 year old children. For 9 year old children ґ = 0.83, and ґ2 = 0.70, which means that 70% of the children’s speech pause variation is predictable from holds. For adults it was previously found [16] ґ = 0.88, and ґ2 = 0.78, which means that 78% of the adults’ speech pause variation is predictable from holds.

2

The notation “:” indicates vowel or consonant lengthening.

On Speech and Gestures Synchrony

263

%hold overlapping with pauses AV=84,6% ; SD=7,8

100

AV=71,6% SD: 12,1

% overlap

80

AV=70,6% SD: 8,1

60 40 20 0 3 year olds

5 year olds

9 year olds

Fig. 5. Percentage of holds overlapping with speech pauses in the three different age groups of children

Distribution of Pause and Hold Rates in 3 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0 F 1

F2

F 3

M 1

M 2

M 3

Subjects

Fig. 6. Distribution of pause and hold rates in three year old children Distribution of Pause and Hold Rates in 5 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0

F1 F2 F4 F5 F6 M1 M2 M3 M4 M5 Subjects

Fig. 7. Distribution of pause and hold rates in five year old children

264

A. Esposito and A.M. Esposito

Distribution of Pause and Hold Rates in 9 year old children Pause Rate

Hold Rate

1 0,8 0,6 0,4 0,2 0 F1

F2

F3

F4

F5

F7

F9

F 10

Subjects

Fig. 8. Distribution of pause and hold rates in nine year old children

The data reported above show that a correlation does exist between holds and speech pauses, only for two out of the three groups of children and that for five year old children there was no synchronization between speech pauses and holds. In order to ascertain the causes of this discrepancy in the above results, the distribution of hold and speech pause rates was checked for each group and each subject. The next figures (Figures 6, 7 and 8) will display such distributions in the three different age groups. The labels F and M on the x-axis indicate the female and male children respectively. The distribution of hold and speech pause rates does not seem to explain why for five year old children there was no correlation between holds and speech pauses and therefore, as further control, the rest position rates (computed as the ratio of the number of rest positions over the duration – in seconds – of the episodes under analysis) were considered for the three groups of children (see Figures 9, 10, and 11).

Rest Position Rates

Rest position rates in 3 year olds 0,18 0,16 0,14 0,12 0,1 0,08

s

0,06 0,04 0,02 0 F1

F2

F3

M1

M2

subjects

Fig. 9. Rest position rates in three year old children

M3

On Speech and Gestures Synchrony

265

Rest position rates in 5 year olds

Rest Position Rates

0,18 0,16 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 F1

F2

F4

F5

F6

M1

M2

M3

M4

M5

subjects

Fig. 10. Rest position rates in five year old children Rest position rates in 9 year olds

Rest Position Rates

0,18 0,16 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 F1

F2

F3

F4

F5

F7

F9

F 10

subjects

Fig. 11. Rest position rates in nine year old children

The data show an unexpected trend and pose the basis for novel interpretations on the amount of gestural holds that can be predicted from speech pauses. In fact, Figure 10 shows that the distribution of the rest position rates in five year old children is quite high and very different from those observed in three and nine year old children.

5 Discussion The previous section displays three interesting results. Firstly, a great amount of variation in speech pauses is highly correlated with holds, both in adults and children, and there is a great amount of overlaps between the two speech and gestural entities. Secondly, speech pauses are highly synchronized with holds and this synchronization

266

A. Esposito and A.M. Esposito

does not depend on the speaker’s age. Thirdly, five year old children do not seem to show the same amount of overlap and synchronization between hold and speech pause entities. What does this suggest about gestures and speech partnership? To answer to this question it is necessary to recall the role attributed to speech pauses, in particular to cognitive speech pauses that are under examination in this work. As already pointed out in the introductive section, these speech pauses are used to “hold the floor”, i.e. to prevent interruption by the listener while the speaker searches for a specific word [27], but can also serve for other functions, such as reflecting the complexity of neural information processing. Pauses will surface in the speech stream as the end product of a “planning” process that cannot be carried out during speech articulation and the amount and length of pausing reflect the cognitive effort related to lexical choices and semantic difficulties for generating new information [4-5, 8-10, 34]. In summary, speech pauses seem likely to play the role of signalling mental activation processes aimed at replacing a particular attentional state with a new one. Given the great amount of overlaps between holds and speech pauses, holds appear to be gestural entities with similar function and similar behaviour as speech pauses. Therefore, these data appear to support the hypothesis that non-verbal modalities and speech have similar semantic and pragmatic functions, and therefore that, at least in some respects, speech and gestures reflect a unified planning process, which is implemented synchronously in space and time thanks to the exploitations of two different avenues (the manual-visual versus the oral-auditory channel). As speech pauses seem likely to play the role of signalling mental activation processes aimed at replacing the “given spoken content” of a former utterance with an “added” one, holds may signal mental activation processes aimed at replacing “given visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the former “utterance”) with “added bodily actions” reflecting the new representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”. Note that the meaning given here to the word “utterance” is the same used by Kendon (see chapter 1, page 5, [44]) “as an object constructed for others from components fashioned from both spoken language and gesture”. As far as the reported data concerns, in children, as in adults, holds and speech pauses are to a certain extent synchronized and play similar functions, suggesting that they may be considered as a multi-determined phenomenon exploited by the speaker under the guidance of a unified planning process to satisfy a communicative aim. Under the above assumption of the meaning of the word “utterance” we can speculate about how to justify the second result reported in the present work, i.e. why, hold rates in adults are significantly different from speech pause rates whereas this is not the case for 3 and 9 year old children. Our hypothesis is that being gestures the “kinetic” expression of our thoughts the speaker may use them in many different ways and one of these could be to structure the spoken discourse when lexical access does not present difficulties. This is one of the functions played by holds in adults. In fact, the holds performed by adults in their narrations and not synchronized with speech pauses were made at the end of clauses, as to mark the different components of the sentence and to emphasize or underline groups of words. Children, instead, being less skilled in assembling bodily and verbal information, tend to attribute to holds the same functions as speech pauses. These considerations may explain the differences in

On Speech and Gestures Synchrony

267

hold and speech pause rates between the two groups. On the other hand, children may be less skilled in bodily actions than in language since they start to experience visible actions after birth, whereas language feedback is experienced during pregnancy. Furthermore, sophisticated utterances, where verbal and nonverbal entities are put together to express thoughts with the purpose to maximize the amount of information transmitted, are a prerogative of adult communication behaviour and may not be necessary in child utterances, limiting the functions and the use of gestures and consequently of holds. There is an inconsistency in the above discussion, since five year old children do not seem to synchronize speech pauses and holds. However at the age of 5-6 children acquire social consciousness and care for their performance as well as for the perception of the self by the others (according to several psychological theories as Theory of Mind [30, 73]). This includes heightened sensitivity to criticism [14]. It could be this acquired sensitivity that may have prevented children, even in a friendly environment, to report their narrations in a relaxed way, increasing their rest positions and decreasing the synchronization of their gestures with their speech. Although the present data may be relevant in assessing the partnership between speech and gestures, it should be emphasized that this is a pilot study, based on data restricted to a narration context and that further work is needed to support the above assumptions as well as to assess the functions of holds in the production of utterances.

6

Conclusions

The present paper reports perceptual data showing that both adults and children make use of speech pauses in synchronization with holds, thereby supporting the hypothesis that, at least in some respects, speech and gestures reflect a unified communicative planning process in the production of utterances. The consistency among the subjects in the distribution of holds and speech pauses suggests that, at least in the Italian language, there is an intrinsic timing behaviour, probably a general pattern of rules that speakers (in narrations) use to regulate the speech flow in synchrony with the bodily actions for structuring the discourse organization. The synchrony we are speaking of is more specific than the synchrony discussed in Condon and Sander [12] as well as in several and more recent papers published in literature [5, 27, 42-44, 5558]. Contrarily to what has been objected by a reviewer of this paper, synchrony between holds and speech pauses of different typology was first observed and discussed by one the authors in [16, 22, 25]. The importance of this synchrony is strongly related to the multi-determined nature of pauses in speech and may help enlightening the role of gestures in communication. The authors will welcome new investigations in this direction. It would be interesting to conduct an analysis on a more extensive data set and model this behaviour in mathematical terms. This might help to derive a deterministic algorithm that would be of great utility for applications in the field of human-machine interaction, favouring the implementation of more natural speech synthesis and interactive dialog systems. The analysis that has been developed in this paper sheds lights only on a subset of much richer and more subtle processes that are at the basis of the rules and procedures governing the dynamic of face-to-face communication. Among the phenomena not yet examined and worth to be investigated are:

268 • • •

A. Esposito and A.M. Esposito

The relevance that the bodily actions of the speaker and listener in guiding the dialogues might have; The exploitation of speech and holds in mid turn or in signalling the engagement and disengagement of the participants within the turn; The functioning and positioning of speech pause and holds at certain favourite sequential positions within conversations where they are more likely to be relaxed, such as at the end of clauses and paragraphs during a narration.

In the present study, the consequences of the listener’s actions on the speaker’s have not been considered. Interactions between speaker and listener are relevant and surely may lead to systematic changes in the emerging structure of the speaker’s utterance and in her/his distribution of speech and holds along the utterance. How these dynamics are implemented during interaction is a central issue for the development of a theory of speech and gestures partnership. Acknowledgments. This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication” (http://cost2102.cs.stir.ac.uk/) and COST ISCH TD0904 “TMELY: Time in MEntal activitY (http://w3.cost.eu/index.php?id=233&action_number=TD0904). Acknowledgment goes to three unknown reviewers for their helpful comments and suggestions and to Tina Marcella Nappi for her editorial help.

References 1. Abrams, K., Bever, T.G.: Syntactic Structure Modifies Attention During Speech Perception and Recognition. Quarterly Journal of Experimental Psychology 21, 280–290 (1969) 2. Beaugrande, R.: Text Production. Text Publishing Corporation, Norwood (1984) 3. Bryll, R., Quek, F., Esposito, A.: Automatic Hand Hold Detection in Natural Conversation. In: Proc. of IEEE Workshop on Cues in Communication, Hawai, December 9 (2001) 4. Butterworth, B.L., Hadar, U.: Gesture, Speech, and Computational Stages: A Reply to McNeill. Psychological Review 96, 168–174 (1989) 5. Butterworth, B.L., Beattie, G.W.: Gestures and silence as indicator of planning in speech. In: Campbell, R.N., Smith, P.T. (eds.) Recent Advances in the Psychology of Language, pp. 347–360. Olenum Press, New York (1978) 6. Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C.: Non-verbal Cues for Discourse Structure. In: Association for Computational Linguistics Joint EACL-ACL Conference (2001a) 7. Cassell, J., Vilhjalmsson, H., Bickmore, T.: BEAT: The Behavior Expression Animation Toolkit. In: Proc. of SIGGRAPH (2001b) 8. Chafe, W.L.: Language and Consciousness. Language 50, 111–133 (1974) 9. Chafe, W.L.: The Deployment of Consciousness in the Production of a Narrative. In: Chafe, W.L. (ed.) The Pear Stories, pp. 9–50. Ablex, Norwood (1980) 10. Chafe, W.L.: Cognitive Constraint on Information Flow. In: Tomlin, R. (ed.) Coherence and Grounding in Discourse, pp. 20–51. John Benjamins, Amsterdam (1987) 11. Chen, L., Liu, Y., Harper, M.P., Shriberg, E.: Multimodal Model Integration for Sentence Unit Detection. In: Proceedings of ICMI, State College Pennsylvania, USA, October 13-15 (2004)

On Speech and Gestures Synchrony

269

12. Condon, W.S., Sander, L.W.: Synchrony Demonstrated between Movements of the Neonate and Adult Speech. Child Development 45(2), 456–462 (1974) 13. De Ruiter, J.P.: The Production of Gesture and Speech. In: McNeill, D. (ed.) Language and Gesture, pp. 284–311. Cambridge University Press, UK (2000) 14. Dunn, J.: Children as Psychologist: The Later Correlates of Individual Differences in Understanding Emotion and Other Minds. Cognition and Emotion 9, 187–201 (1995) 15. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 211–234. Springer, Heidelberg (2008) 16. Esposito, A., Marinaro, M.: What Pauses Can Tell Us About Speech and Gesture Partnership. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 45–57. IOS Press, The Netherlands (2007) 17. Esposito, A., Esposito, D., Refice, M., Savino, M., Shattuck-Hufnagel, S.: A Preliminary Investigation of the Relationships between Gestures and Prosody in Italian. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 65–74. IOS Press, The Netherlands (2007) 18. Esposito, A.: Children’s Organization of Discourse Structure Through Pausing Means. In: Faundez-Zanuy, M., Janer, L., Esposito, A., Satue-Villar, A., Roure, J., Espinosa-Duro, V., et al. (eds.) NOLISP 2005. LNCS (LNAI), vol. 3817, pp. 108–115. Springer, Heidelberg (2006) 19. Esposito, A.: Pausing Strategies in Children. In: Proceedings of the International Conference in Nonlinear Speech Processing, Cargraphics, Barcelona, Spain, April 19-22, pp. 42–48 (2005) 20. Esposito, A., Marinaro, M., Palombo, G.: Children Speech Pauses as Markers of Different Discourse Structures and Utterance Information Content. In: Proceedings of the International Conference: From Sound to Sense: +50 Years of Discoveries in Speech Communication, June 10-13, pp. C139–C144. MIT, Cambridge (2004) 21. Esposito, A., Natale, A., Duncan, S., McNeill, D., Quek, F.: Speech and Gestures Pauses Relationships: A Hypothesis of Synchronization. In: Proceedings of the V National Conference on Italian Psychology, AIP, Grafica80-Modugno, Bari, Italy, pp. 95–98 (2003) (in Italian) 22. Esposito, A., Duncan, S., Quek, F.: Holds as gestural correlates to empty and filled pauses. In: Proc. of ICSLP, Colorado, vol. 1, pp. 541–544 (2002a) 23. Esposito, A., Gutierrez-Osuna, R., Kakumanu, P., Garcia, O.N.: Optimal Data Encoding for Speech Driven Facial Animation. Wright State University Technical Report N. CSWSU-04-02, Dayton, Ohio, USA 1-11 (2002b) 24. Esposito, A.: On Vowel Height and Consonantal Voicing Effects: Data from Italian. Phonetica 9(4), 197–231 (2002c) 25. Esposito, A., McCullough, K.E., Quek, F.: Disfluencies in Gesture: Gestural Correlates to Filled and Unfilled Speech Pauses. In: Proc. of IEEE Workshop on Cues in Communication, Hawai (2001) 26. Esposito, A., Stevens, K.N.: Notes on Italian Vowels: An Acoustical Study (Part I). Research Laboratory of Electronic, Speech Communication Working Papers 10, 1–42 (1995) 27. Erbaugh, M.S.: A Uniform Pause and Error Strategy for Native and Non-native Speakers. In: Tomlin, R. (ed.) Coherence and Grounding in Discourse, pp. 109–130. John Benjamins, Amsterdam (1987)

270

A. Esposito and A.M. Esposito

28. Ezzat, T., Geiger, G., Poggio, T.: Trainable Video Realistic Speech Animation. In: Proc. of SIGGRAPH, San Antonio, Texas, pp. 388–397 (2002) 29. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36(1), 259–275 (2003) 30. Flavell, J.H.: Cognitive development: Children’s Knowledge About the Mind. Annual Review of Psychology 50, 21–45 (1999) 31. Freedman, N.: The Analysis of Movement Behaviour During the Clinical Interview. In: Siegmann, A.W., Pope, B. (eds.) Studies in Dyadic Communication, pp. 177–208. Pergamon Press, Oxford (1972) 32. Freedman, N., Van Meel, J., Barroso, F., Bucci, W.: On the Development of Communicative Competence. Semiotica 62, 77–105 (1986) 33. Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P., Garcia, O.N.: Audio/Visual Mapping with Cross-Modal Hidden Markov Models. IEEE Transactions on Multimedia 7(2), 243–252 (2005) 34. Goldmar Eisler, F.: Psycholinguistic: Experiments in Spontaneous Speech. Academic Press, London (1968) 35. Goldin-Meadow, S.: Gesture: How Our Hands Help Us Think. Harvard University Press, Cambridge (2003) 36. Green, D.W.: The Immediate Processing of Sentence. Quarterly Journal of Experimental Psychology 29, 135–146 (1977) 37. Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O.N., Bojorquez, A., Castello, J., Rudomin, I.: Speech-Driven Facial Animation with Realistic Dynamics. IEEE Transactions on Multimedia 7(1), 33–42 (2005) 38. Hadar, U., Butterworth, B.L.: Iconic Gestures, Imagery and Word Retrieval in Speech. Semiotica 115, 147–172 (1997) 39. Kähler, K., Haber, J., Seidel, H.: Geometry-based Muscle Modeling for Facial Animation. In: Proc. of Inter. Conf. on Graphics Interface, pp. 27–36 (2001) 40. Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: Comparing Different Acoustic Data-Encoding for Speech Driven Facial Animation. Speech Communication 48(6), 598–615 (2006) 41. Kakumanu, P., Gutierrez-Osuna, R., Esposito, A., Bryll, R., Goshtasby, A., Garcia, O.N.: Speech Dirven Facial Animation. In: Proc. of ACM Workshop on Perceptive User Interfaces, Orlando, November 15-16 (2001) 42. Kendon, A.: Spacing and Orientation in Co-present Interaction. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 1– 15. Springer, Heidelberg (2010) 43. Kendon, A.: Some Topic in Gesture Study. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing SubSeries E: Human and Societal Dynamics, vol. 18, pp. 1–17. IOS Press, The Netherlands (2007) 44. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004) 45. Kendon, A.: Sign Languages of Aboriginal Australia: Cultural, Semiotic and Communicative. Cambridge University Press, Cambridge (1988) 46. Kendon, A.: Current Issues in the Study of Gesture. In: Nespoulous, J.L., et al. (eds.) The Biological Foundations of Gestures: Motor and Semiotic Aspects, pp. 23–27. LEA Publishers, Hillsdale (1986)

On Speech and Gestures Synchrony

271

47. Kendon, A.: Gesticulation and Speech: Two Aspects of the Process of Utterance. In: Ritchie Key, M. (ed.) The Relationship of Verbal and Nonverbal Communication, pp. 207–227. Mouton and Co., The Hague (1980) 48. Kipp, M.: From Human Gesture to Synthetic Action. In: Proc. of Workshop on Multimodal Communication and Context in Embodied Agents, Montreal, pp. 9–14 (2001) 49. Kita, S., Özyürek, A.: What Does Cross-Linguistic Variation in Semantic Coordination of Speech and Gesture Reveal? Evidence for an Interface Representation of Spatial Thinking and Speaking. Journal of Memory and Language 48, 16–32 (2003) 50. Kita, S.: How Representational Gestures Help Speaking. In: McNeill, D. (ed.) Language and Gesture, pp. 162–185. Cambridge University Press, UK (2000) 51. Kowal, S., O’Connell, D.C., Sabin, E.J.: Development of Temporal Patterning and Vocal Hesitations in Spontaneous Narratives. Journal of Psycholinguistic Research 4, 195–207 (1975) 52. Krauss, R., Chen, Y., Gottesman, R.F.: Lexical Gestures and Lexical Access: A Process Model. In: McNeill, D. (ed.) Language and Gesture, pp. 261–283. Cambridge University Press, UK (2000) 53. Krauss, R., Morrel-Samuels, P., Colasante, C.: Do Conversational Hand Gestures Communicate? Journal of Personality and Social Psychology 61(5), 743–754 (1991) 54. Lee, Y., Terzopoulos, D., Waters, K.: Realistic Modeling for Facial Animation. In: Proc. of SIGGRAPH, pp. 55–62 (1995) 55. McNeill, D.: Gesture and Thought. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing Sub-Series E: Human and Societal Dynamics, vol. 18, pp. 18–31. IOS Press, The Netherlands (2007) 56. McNeill, D.: Gesture and Thought. University of Chicago Press, Chicago (2005) 57. McNeill, D., Duncan, S.: Growth Points in Thinking for Speaking. In: McNeill, D. (ed.) Language and Gesture, pp. 141–161. Cambridge University Press, UK (2000) 58. McNeill, D.: Hand and Mind: What Gesture Reveal about Thought. University of Chicago Press, Chicago (1992) 59. Morsella, E., Krauss, R.M.: Muscular Activity in the Arm During Lexical Retrieval: Implications for Gesture-Speech Theories. Journal of Psycholinguistic Research 34, 415– 437 (2005) 60. Morsella, E., Krauss, R.M.: Can Motor States Influence Semantic Processing? Evidence from an Interference Paradigm. In: Columbus, A. (ed.) Advances in Psychology Research, vol. 36, pp. 163–182. Nova, New York (2005a) 61. Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual Prosody and Speech Intelligibility. Psychological Science 15(2), 133–137 (2004) 62. O’Shaughnessy, D.: Timing Patterns in Fluent and Disfluent Spontaneous Speech. In: Proceedings of ICASSP Conference, Detroit, Detroit, pp. 600–603 (1995) 63. Oliveira, M.: Pausing Strategies as Means of Information Processing Narratives. In: Proceedings of the International Conference on Speech Prosody, Aix-en-Provence, pp. 539–542 (2002) 64. Prinosil, J., Smekal, Z., Esposito, A.: Combining Features for Recognizing Emotional Facial Expressions in Static Images. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I., et al. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 56– 69. Springer, Heidelberg (2008) 65. Rimé, B., Schiaratura, L.: Gesture and Speech. In: Feldman, R.S., Rimé, B. (eds.) Fundamentals of Nonverbal Behavior, Cambridge University Press, pp. 239–284. Cambridge University Press, Cambridge (1992)

272

A. Esposito and A.M. Esposito

66. Rimé, B.: The Elimination of Visible Behaviour from Social Interactions: Effects of Verbal, Nonverbal, and Interpersonal Variables. European Journal of Social Psychology 12, 113–129 (1982) 67. Rogers, W.T.: The Contribution of Kinesic Illustrators Towards the Comprehension of Verbal Behaviour Within Utterances. Human Communication Research 5, 54–62 (1978) 68. Rosenfield, B.: Pauses in Oral and Written Narratives. Boston University Press (1987) 69. Short, J., Williams, E., Christie, B.: The Social Psychology of Telecommunications. Wiley, New York (1976) 70. Stocky, T., Cassell, J.: Shared Reality: Spatial Intelligence in Intuitive User Interfaces. In: Proc. of Intelligent User Interfaces, San Francisco, CA, pp. 224–225 (2002) 71. Shattuck-Hufnagel, S., Yasinnik, Y., Veilleux, N., Renwick, M.: A Method for Studying the Time Alignment of Gestures and Prosody in American English: ‘Hits’ and Pitch Accents in Academic-Lecture-Style Speech. In: Esposito, A., et al. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO Publishing SubSeries E: Human and Societal Dynamics, vol. 18, pp. 32–42. IOS Press, The Netherlands (2007) 72. Thompson, L.A., Massaro, D.W.: Evaluation and Integration of Speech and Pointing Gestures During Referential Understanding. Journal of Experimental Child Psychology 42, 144–168 (1986) 73. Wellman, H.M.: Early Understanding of Mind: The Normal Case. In: Baron-Cohen, S., et al. (eds.) Understanding Other Mind: Perspective from Children with Autism, pp. 10–39. Oxford Univ. Press, Oxford (1993) 74. Williams, E.: Experimental Comparisons of Face-to-Face and Mediated Communication: A Review. Psychological Bulletin 84, 963–976 (1977) 75. Yasinnik, Y., Renwick, M., Shattuck-Hufnagel, S.: The Timing of Speech-Accompanying Gestures with Respect to Prosody. In: Proceedings of the International Conference: From Sound to Sense: +50 Years of Discoveries in Speech Communication, June 10-13, pp. C97–C102. MIT, Cambridge (2004)

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes Amélie Lelong and Gérard Bailly GIPSA-Lab, Speech & Cognition dpt., UMR 5216 CNRS/Grenoble INP/UJF/U. Stendhal, 38402 Grenoble Cedex, France {amelie.lelong,gerard.bailly}@gipsa-lab.grenoble-inp.fr

Abstract. During an interaction people are known to mutually adapt. Phonetic adaptation has been studied notably for prosodic parameters such as loudness, speech rate or fundamental frequency. In most of the cases, results are contradictory and the effectiveness of phonetic convergence during an interaction remains an open issue. This paper describes an experiment based on a children game known as speech dominoes that enabled us to collect several hundreds of syllables uttered by different speakers in different conditions: alone before any interaction vs. after it, in a mediated interaction vs. in a face-to-face interaction. Speech recognition techniques were then applied to globally characterize a possible phonetic convergence. Keywords: face-to-face interaction phonetic convergence, mutual adaptation.

1 Introduction The Communication Adaptation Theory (CAT), introduced by Giles et al [1], postulates that individuals accommodate their communication behavior either by becoming much closer of their interlocutor (convergence) or on the contrary by increasing their differences (divergence). People can adapt to each other in different ways. For example, conversational partners notably adapt to each other’s choice of words and references [2] and also converge on certain syntactic choices [3]. ZoltanFord [4] has shown that users of dialog systems converge lexically and syntactically to the spoken responses of the system. Ward et al [5] demonstrated that adaptive systems mimicking this behavior facilitate learning. This alignment [6] may have several benefits such as easing comprehension [7], facilitating the exchange of messages of which the meaning is highly context-dependent [8], disclosing ability and willingness to perceive, understanding or accepting new information [9] and maintaining social glue or resonance [10]. Researchers have examined also adaptation of phonetic dimensions such as pitch [11], speech rate [12], loudness [13], dispersions of vocalic targets [14] as well as more global alignment such as turn-taking [15]. But the results of these different studies show a weak convergence and even in some cases no convergence at all. In the perceptual study conducted by Pardo [16], disparities between talkers have been attributed to various dimensions such as social settings, communication goals and varying roles in the conversation. Sex differences have also been put forward: female interlocutors show more convergence than males. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 273–286, 2011. © Springer-Verlag Berlin Heidelberg 2011

274

A. Lelong and G. Bailly

The emerging field of research is crucial to the comprehension of adaptive behavior during unconstrained conversation on one hand and to versatile speech technologies that aim at substituting one partner with an artificial conversational agent on the other hand. Literature shows that two main challenges persist: (a) the need of original experiments that allow us to collect sufficient phonetic material to study and isolate the impact of the numerous factors influencing adaptation; (b) the use of automatic techniques for characterizing the degree of convergence if any.

2 State of the Art In the following section, several influential articles will be presented. These papers thoroughly summarize research about phonetic adaptation. 2.1 Convergence and Social Role There are only a few studies that explain the role of convergence in a social interaction. Different interpretations have been given. First of all, convergence could be a consequence of the episodic memory system [17]. People keep a trace of all their multimodal experiences during social interaction. An exemplar-based retrieval of previous behavior given similar social context is triggered so that the current interaction benefits from previous attunement. Adaptation can also be used in a community to let a more stable form emerge across those present in the community [18] or to help people to define their identity by categorizing others and themselves into groups constantly compared and evaluated [19]. Other studies have shown that convergence may help to accomplish mutual goal [20], align representations [18], increase the quality of an interaction [21], and furthermore contribute to mutual comprehension by decreasing social distance [21]. According to Labov [22], convergence could be due to the need to add emphasis to expression and persist for the next interaction. Finally, adaptation could be interpreted as a behavioral strategy to achieve particular social goals such as approval [11, 23] or desirability [24]. 2.2 Description of Key Studies on Phonetic Convergence Pardo [16] examined whether pairs of talkers converged in their phonetic repertoire during a single conversational interaction called a map-task. Six same-sex pairs were recruited to solve a series of 5 map tasks where their role – instruction giver or receiver – were exchanged. The advantage of the map task is to collect landmark names that are uttered several times during the interaction by each interlocutor in order to have the receiver replicate the itinerary described by the giver. One or two weeks before any interaction, talkers read out the set of map task landmark labels in order to obtain reference pronunciations. Just after interaction, the same procedure was also performed to test the persistence of convergence, i.e. to distinguish stimulusdependent mimicry from mimesis which is supposed to originate from a deeper change of phonetic representations [25]. To measure convergence, 30 listeners were asked to judge similarity between pronunciations of pre-, map- and post-task landmark labels in a AXB test, X being a map-task utterance and (A,B) pre-, map- or

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

275

post-task of the same utterance pronounced by the corresponding partner. Results of this forced choice showed significant main effects of expose and persistence but there was also dependence of role and sex: givers’ instructions converged more than receivers’ instructions and particularly for female givers. It is in agreement with the results found by Namy [26]. Delvaux and Soquet [14] questioned the influence of ambient speech on the pronunciations of some keywords. These keywords were chosen in order to collect representatives of two sounds (the mid-open vowels [n] and [']) the allophonic variations of which are typical of the two dialects of French spoken in Belgium. During these non interactive experiments, subjects were asked to describe a simple scene: “C’est dans X qu’il y a N Y” (It’s in X that there are N Y), where X were locations, N numbers and Y objects. This description was either uttered by the speaker or by recorded speakers using the same or the other dialect. Pre- and posttasks were also performed for the same reasons enounced previously. The phonetic analysis focused on the production of the two sounds that were used in the two possible labels X. The authors sought for unintentional imitation. To characterize the amplitude of that change, they compared durations and spectral characteristics of target sounds. In most cases, small but significant displacements towards the prototypes of the other ambient dialect were observed for both sounds (see the lowering of the canonical values in Test 1 and 2 in Fig. 1). Similar unconscious imitation of characteristics of ambient speech has also been observed by Gentilucci et al [27] for audiovisual stimulations.

Fig. 1. Results on spectral distance calculated by Delvaux and Soquet [14]. It can be seen that, during tests (Tests 1 & 2), subjects are getting away from their own reference (Pre-test) and closer to the other dialect (References 1 & 2).

Aubanel and Nguyen [28] also conducted experiments to study the mutual influence between French accents, i.e. northern versus southern, that could be part of the subjects’ experience. They have proposed an original paradigm to collect dense interactive corpora made up of uncommon proper nouns. They defined some criteria in order to discriminate the two accents, i.e. schwa, back mid vowels, mid vowels in word-final syllables, coronal stops, and nasal vowels. Uncommon proper nouns containing these segments are chosen so as to maximize coverage of alternative spellings. They chose

276

A. Lelong and G. Bailly

their subjects in a major high school and grouped them according to their sex and to a similar score on the Crowne-Marlowe [29] social desirability scale. One week before any interaction, subjects read out three sets of 16 names to get reference pronunciations. This session was repeated just after the interactions to measure mimesis. During the interaction, dyads were asked to associate names with photographs and the corresponding characters’ statements. Aubanel and Nguyen used a Bayes classifier to automatically assign subjects to a group and test different levels of convergence in the dyads (towards the interlocutor, the interlocutor’s group and accent) using linear discriminant analysis performed on spectral targets. They found very few instances of convergence. Additionally convergence was quite dependent of the critical segments analyzed, the sessions and the pairs. 2.3 Comments These studies show that phonological and phonetic convergence is very weak. The experimental paradigms used so far either collect few instances (typically a dozen in Aubanel and Nguyen) of few key segments or many instances of a very small set of key segments (two in Delvaux and Soquet). These segments are always produced in a controlled context within key words. Both studies have focused on inter-dialectal convergence and segments that carry most of the dialectal variation. This a priori choice is questionable since it remains to be shown that subjects at first negotiate these critical segments before or more easily than others. Since the convergence is segment-dependent, it is interesting to study the speakers’ alignment on the common repertoire of their mother tongue. In our experiments, we will examine the convergence of the eight French peripheral oral vowels. In most studies, interlocutors or ambient speech are not known a priori by the subjects. The authors were certainly expecting to observe on-line convergence as the dialog proceeds. The hypothesis that adaptation and alignment is immediate and fast is questionable: in the following we will compare convergence of unknowns with those of good friends. Table 1. First speech dominoes used in the interactive scenario. Interlocutors have to choose and utter alternatively the rhyming words. Correct chainings of rhymes are highlighted with a dark background.

spk 1

spk 2

spk 1

spk 2

spk 1

spk 2

spk 1

rotnr

tordy

5imi

5ema

leto

Ieri

b'rly

dyre

repi

pile

kepi

todi

…

3 Material and Protocol During our experiments, speakers were instructed to choose between two words displayed on a computer screen.

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

277

3.1 Speech Dominoes The rule of the game is quite simple. Speakers have to choose between two words the one that begins with the same syllable as the final syllable of the word previously uttered by the interlocutor (see Table 1). Such rhyme games - here speech dominoes – are part of the children’s folklore and widely used in primary school, for example for language learning. We decided to chain simple disyllabic words such as: bateau [bato], taudis [todi], diffus [dify], furie [fyri], etc. We used only two disyllabic words in order to limit the cognitive load and ease the running of successive sessions. The words were chosen to uniformly collect allophonic variations of the eight peripheral oral vowels of French: [a], [E], [e], [i], [y], [u], [o], [n]. To force mutual attention during the interaction, the word list has been built so that the speaker could not guess the next domino given the sole history of the dialog. In fact, he has to pay attention to the word uttered by his interlocutor to decide which “domino” he would have to utter next. For instance, spk 2, after having chosen [tordy] in Table 1, will be presented with the following two alternatives, namely [Sema] and [repi]. Since [dySe] and [dyre] are two valid French common words with almost the same word frequency, spk 2 will have to wait until spk1 chooses the right rhyme to decide about his own. A chain of 350 dominoes was thus established that permitted us to collect almost 40 exemplars of each peripheral oral vowel (see Table 2). Table 2. Number of phones collected for each speaker during the dominoes’ game. 350 CV or CVC syllables are pronounced in total.

phones #items

a 47

' 48

e 45

i 43

y 44

u 40

o 43

n others 31 9

3.2 Conditions The speakers pronounced dominoes under different conditions. First of all, we needed to get references for each speaker, we called this condition pre-test. To do this, they

Fig. 2. Face-to-face interaction

278

A. Lelong and G. Bailly

uttered a list of 350 words before any dialog with their interlocutor. The pre-test words were the same as those pronounced by the two speakers during the dominoes' game. It allowed us to characterize each speaker's phonetic space and to measure the amplitude of adaptation if any. 3.3 Experiments In this paper, we only contrast the pre-test condition and the interactive game played during three experiments: •

• •

Experiment I: speakers were in two different rooms and communicated through microphones and headphones. This setup was easy to realize thanks to the MICAL platform of our laboratory (two rooms separated with a tinted mirror). Speakers were unknown to each other. Experiment II: same as Experiment I but with a reduced set of good friends, people that know each other or work together since a long time (mean of 15 years from 10 years to 25 years). Experiment III: speakers were in a face-to-face interaction. We studied also dyads of good friends (mean of from 6 months to 3 year and 6 months).

In the two cases, they were instructed to avoid speech overlaps and repairs so as to ease automatic segmentation and alignment. 3.4 Experimental Settings For Experiments I and II, people played through sets of microphones and headphones. Signals were digitized at 16 kHz thanks to a high-quality stereo sound card. Dominoes were displayed on two computer screens displaying a pdf file. For Experiment III, the setting is quite different. Speakers sat on each side of a table facing two back-to-back computer screens. They were recorded with a camera a mirror allowed us to capture both interactants (see Figure 2) - their head movements were monitored using four infrared cameras (Qualisys system®). We used two keyboards connected to the same computer to forward turns: when a speaker finishes uttering his domino, he presses a key on his keyboard to display the two choices for his next turn on his own screen. 3.5 Characterization Delvaux and Soquet [14] noticed that global automatic analysis of spectral distributions by MFCC (Mel Frequency Cepstral Coefficients) lead to quasi-identical but more robust characterization of convergence than a more detailed semi-automatic phonetic analysis such as formant tracking. Aubanel and Nguyen [28] similarly used automatic recognition techniques to recognize idiolects. Here, we trained phone-sized context-independent HMMs with 5 states trained using HTK on the pre-test data. The input parameters are the first 12 MFCC + energy + deltas of theses parameters. After various forced alignments, we compared the distributions of normalized self vs. other’s recognition scores of central states of each vowel (see Fig. 4 and Fig. 5). Paired t-tests were also performed to compare changes of distributions of scores of vowels produced in the same words (175 words for each speaker).

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

279

We align signals with a network of pronunciation variants for each word to semiautomatically segment the signals. This segmentation was then checked by hand (by three different annotators). Devoiced or creaky vowels - often high vowels [i], [y] in unvoiced context - were discarded. Dialectal variations were also considered: allophonic variations of mid-vowels (open or closed) are determined according to the speaker-dependent partition of the range of the first formant.

4 Results 4.1 Phonological Variations Despite the fact that our corpus was not designed to enhance dialectal variations (unlike Aubanel and Nguyen [28]), we observed some dialectal variations mainly concerned with allophonic variations of mid-vowels. Most participants came from North of France and thus used exclusively open vowels in closed syllables (e.g. sabord /sabn]/ vs. sabot /sabo/). Other interlocutors spectrally contrasted minimal pairs such as vallée vs. valais (/vale/ vs. /val'/), miné vs. minet (/mine/ vs. /min'/), etc. We observed few cases of phonological adaptation, i.e. subjects adopting a pronunciation different from the one chosen in their pretest to get closer to the pronunciation of their interlocutor. Most interactions resulted in convergence of allophonic choices (see Fig. 3) but this is not significant due to limited data. For example, for the vowel [e], a mutual adaptation can be seen during the interaction between ALa and MGB and also between ALa and MSM.

Fig. 3. Proportion of peripheral mid vowels pronounced as closed by 4 pairs (left: the vowel [e]; right: vowel [o]). The initiator was the same female ALa interacting with 3 males (MGB, MMP, MSM) and one female (FLD). For each pair, bars represent the proportion uttered during respectively the ALa pretest, ALa interacting with her interlocutor, her interlocutor interacting with ALa and the interlocutor’s pretest.

280

A. Lelong and G. Bailly

We should however mention that the labeling of allophonic variations of midvowels is very difficult, since French speakers have now the tendency to front midclosed vowels [30-31]. The labeling is particularly difficult in non accented positions where vowel undershoot or co articulation may override perceptual intuition. We always privileged labeling based on objective measurements that tend to favor midclosed options. 4.2 Sub-phonemic Convergence Given the assumption that each phone was properly labeled, we compared the distributions of normalized recognition scores of the pre-test and interactive utterances, as explained previously. These utterances were recognized by the speaker’s own HMMs in a first time and in a second time by the HMMs of his interlocutor. For pre-test data, we expected high scores for HMMs tested on their own training data by construction and lower scores for HMMs of the interlocutor. The recognition score of each vowel is the average log likelihood per frame for the central state of the corresponding HMM. The difference between the scores somehow reflects the inter-speaker distance. Convergence would be characterized by a decrease of scores by self HMMs and an increase of scores by the other's HMMs. The recognition is thus performed by HMM models of each speaker and of his/her interlocutor. Fig. 4 and Fig. 5 compare the distributions of normalized recognition scores for the pre-test (left) versus the interaction (right). Scores are typically higher for phones uttered by one speaker and recognized by his own HMMs. In case of an interaction between two unknowns (cf. Fig. 4), the distributions computed for the interactive speech do not evolve so much. We observe stronger convergence in case of good friends (cf. Fig. 5).

Fig. 4. Distribution of recognition scores for the vowels of disyllabic words produced by two unknowns. The recognition is performed by their own HMM models and by the HMM models of their interlocutor. Scores are expected to be higher by using their own HMM models. Left: scores for word lists read aloud in isolation; this speech data is used to train the speakerspecific HMM models. It can be seen that, by using the own model of each interlocutor (lmb_lmb and rl_rl), higher recognition scores are obtained than for cross recognition (lmb_rl and rl_lmb) Right: same words pronounced in a verbal domino game. In this case, we expect a decrease of recognition scores by using the own model of each interlocutor (lmb_rl_lmb and rl_lmb_rl) and an increase of recognition scores by using cross recognition (lmb_rl_rl and rl_lmb_lmb). Here, only small adjustments are observable (weak shift on the left for lmb_rl_lmb and rl_lmb_rl and on the right for lmb_rl_rl and rl_lmb_lmb).

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

281

Fig. 5. Same as Fig. 4 but for disyllabic words produced by two old friends. Stronger convergence is observed here since larger shifts are observed.

4.3 Distributions of Recognition Scores Fig. 6 shows the average convergence rate for all dyads recorded in the two

experiments. This is computed as the relative distance of vocalic targets produced in pretest vs. interaction (central state of the HMM alignment). Similarly to Delvaux & Soquet [14], linear discriminant analysis is performed on the target MFCC parameters for each vowel to categorize each interlocutor’s vocalic space into two distinct groups. For each pair, pre-test and interactive vocalic targets are projected onto the first discriminant axis. A normalized convergence rate is then computed by dividing the distance between targets produced during the interaction with that produced during the pre-test for the same word.

Fig. 6. Average convergence rate (calculated on 100 iterations) of vocalic targets of interlocutors for all conditions. A linear discriminant analysis on one random half of the pretest data has been used to separate each interlocutor’s vocalic space. Thus, a reference discriminant distance between interlocutors is obtained. It is used to calculate normalized convergence rates. First, it is used to calculate the convergence rate on the other half of the pretest to have our reference departure for each interlocutor. The two dotted lines represent the mean of pre-test of the tested subjects (line 0) and of the reference subjects (line 1). Distributions displayed with bold lines are significantly different (p<0.05) from the corresponding pre-test (reference departure). Note that only two significant divergences are found (one speaker in the pair number 12 and one on the pair 22). Most convergence cases are observed with pairs with same sex.

282

A. Lelong and G. Bailly

(a)

(b)

Fig. 7. Detailing convergence rates for two different pairs and for each vowel. Pair (a) seems not convergence at all except for the mid-open vowel [n] while pair (b) exhibits complete mutual adaptation. The rates are calculated the same way as Figure 6. For each figure (or interaction), the dotted line on the outside corresponds to the reference subject and the other dotted line to the tested subject. The darker grey corresponds to the reference subject’s convergence rates and the lighter one to those of the tested subject

Fig. 8. First discriminant space projection of MFCC targets for [n] produced by speaker alb (dark dispersion ellipsis for the pre-test drawn at the center) interacting successively with three interlocutors A, B, C (pre-test ellipses located at the periphery). Realizations for interactions are displayed with unfilled ellipsis for alb and filled ellipsis with same color as pre-test for interlocutors. While A and B converge to alb, alb and C do not adapt.

Convergence is not systematic. We can see that the phenomenon is amplified with pairs of the same sex and particularly women (it can be due to a priori more similar speakers). We used this statement for the seven last interactions by selecting only women for the experiment and the results confirm this statement. An ANOVA has been done to assess significance of adaptation. Distributions with significant convergence rates are drawn in bold.

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

283

4.4 Convergence of Vocalic Targets Convergence is a vowel- and interlocutor- dependent phenomenon. Fig. 7 shows that some pairs do not adapt at all while others show a significant mutual adaptation. To study interlocutor-specific adaptation strategies, our game initiators interacted with 2 to 5 different interlocutors. Fig. 8 illustrates examples of target- and interlocutor-specific behaviors. When considering each vowel separately (40 occurrences in average, see Table 1), we do observe cases of full convergence (speakers A and B in Fig. 8). Note however that our analysis is based on relative distances and should take into account the whole structure of the vocalic space of the interlocutors. Speakers notably fill differently their available acoustic space, especially between midvowels [32]. An evolution of convergence rates with time was expected. Convergence rates as a function of time as been plotted for each interlocutor but nothing relevant has been observed. Maybe the proposed task was too short to observe this phenomenon.

Fig. 9. Mean changes of relative difference between F0 registers. As expected the values are higher for pairs with different sex. No significant narrowing of this difference is induced by interaction. This can be due to the task that imposes short utterances.

Fig. 10. Mean changes of relative difference between syllabic durations

284

A. Lelong and G. Bailly

4.5 Prosody Fig. 10 shows that fundamental frequency register was relatively unaffected by the interaction. The exchange of simple words does not favor attunement of melody. Convergence of speech rhythm is clearly observed, certainly due to the ‘ping-pong’ task. This can also be due to the fact that speech rhythm was also much quicker in interactive speech compared to isolated word reading (cf. Fig. 10) with a notable shortening of final syllables. This is probably due to the task focusing on rhyme matching. Delvaux and Soquet [14] advise to discard final syllables for studying phonetic convergence. In our case, we did not find any difference between global and partial statistics except stronger convergence for the durations of vocalic nuclei of initial syllables.

5 Conclusions and Perspectives We proposed here an original speech game that quickly collects many instances of target sounds with a mutual influence that force interlocutors to engage into active action-perception loops. Distribution of target sounds can be explicitly controlled to observe convergence in action if any. We found occurrences of strong phonetic convergence with only one instance of small divergence. This convergence strongly depended on the dyads – with strongest convergence observed for pairs of the same sex – and seemed to be phonemedependent. We used this observation to select our last subjects and the results confirmed a strongest convergence for dyads composed of women. These objective measurements should be confirmed by subjective assessments such as promoted by Pardo [16]. We are also planning to conduct a series of subjective tests to determine if adapted stimuli offer a clearer perceptual benefit for listeners compared as to non adapted stimuli. Perception of degraded stimuli such as used by Adank et al [7] is an interesting option. This gaming paradigm will now be used to select subjects and dyads who exhibit the strongest adaptation abilities and study more complex conversational situations. This data will be used to train speech synthesis engines that will implement these adaptation strategies. Such interlocutor-aware components are certainly crucial for creating social rapport between humans and virtual conversational agents [33]. Acknowledgments. This work has been financed by ANR Amorces and by the Cluster RA ISLE. We thank Frederic Elisei, Sascha Fagel and Loïc Martin for their help.

References 1. Giles, H., et al.: Speech accommodation theory: The first decade and beyond. In: McLaughlin, M.L. (ed.) Communication Yearbook, pp. 13–48. Sage Publishers, London (1987) 2. Brennan, S.E., Clark, H.H.: Lexical choice and conceptual pacts in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition 22, 1482–1493 (1996)

Study of the Phenomenon of Phonetic Convergence Thanks to Speech Dominoes

285

3. Lockridge, C.B., Brennan, S.E.: Addressees needs influence speakers early syntactic choices. Psychonomic Bulletin and Review 9, 550–557 (2002) 4. Zoltan-Ford, E.: How to get people to say and type what computers can understand. International Journal of Man-Machine Studies 34, 527–547 (1991) 5. Ward, A., Litman, D.: Dialog convergence and learning. In: International Conference on Artificial Intelligence in Education (AIED), Los Angeles, CA (2007) 6. Pickering, M., et al.: Activation of syntactic priming during language production. Journal of Psycholinguistic Research 29(2), 205–216 (2000) 7. Adank, P., Hagoort, P., Bekkering, H.: Imitation improves language comprehension. Psychological Science 21, 1903–1909 (2010) 8. Lakin, J., et al.: The chameleon effect as social glue: evidence for the evolutionary significance of nonconscious mimicry. Nonverbal Behavior 27(3), 145–162 (2003) 9. Allwood, J.: Bodily communication - dimensions of expression and content. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems, pp. 7– 26. Kluwer Academic Publishers, Dordrecht (2002) 10. Kopp, S.: Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Communication 52(6), 587–597 (2010) 11. Gregory, S.W., Webster, S.: A nonverbal signal in voices of interview partners effectively predicts communication accommodation and social status perceptions. Journal of Personality and Social Psychology 70, 1231–1240 (1996) 12. Edlund, J., Heldner, M., Hirschberg, J.: Pause and gap length in face-to-face interaction. In: Interspeech, Brighton (2009) 13. Kousidis, S., et al.: Towards measuring continuous acoustic feature convergence in unconstrained spoken dialogues. In: Interspeech, Brisbane (2008) 14. Delvaux, V., Soquet, A.: The influence of ambient speech on adult speech productions through unintentional imitation. Phonetica 64, 145–173 (2007) 15. Benus, S.: Are we ’in sync’: Turn-taking in collaborative dialogues. In: Interspeech, Brighton (2009) 16. Pardo, J.S.: On phonetic convergence during conversational interaction. Journal of the Acoustical Association of America 119(4), 2382–2393 (2006) 17. Dijksterhuis, A., Bargh, J.A.: The perception-behavior expressway: automatic effects of social perception on social behavior. Advances in Experimental Social Psychology 33, 1– 40 (2001) 18. Garrod, S., Doherty, G.: Conversation, co-ordination, and convention: An empirical investigation of how groups establish linguistic conventions. Cognition & Emotion 53, 181–215 (1994) 19. Tajfel, H., Turner, J.: An integrative theory of intergroup conflict. In: Austin, W.G., Worchel, S. (eds.) The Social Psychology of Intergroup Relations, pp. 94–109. BrooksCole, Monterey (1979) 20. Clark, H.H.: Using Language. Cambridge University Press, Cambridge (1996) 21. Babel, M.E.: Phonetic and social selectivity in speech accommodation. In: Department of Linguistics, p. 181. University of California, Berkeley (2009) 22. Labov, W.: The anatomy of style-shifting, in Style and Sociolinguistic Variation. In: Eckert, P., Rickford, J.R. (eds.), pp. 85–108. Cambridge University Press, Cambridge (2001) 23. Giles, H., Clair, R.: Language and Social Psychology. Blackwell, Oxford (1979) 24. Natale, M.: Social desirability as related to convergence of temporal speech patterns. Perceptual Motor Skills 40, 827–830 (1975)

286

A. Lelong and G. Bailly

25. Donald, M.: Origins of the Modern Mind: three stages in the evolution of culture and cognition. Harvard University Press, Cambridge (1991) 26. Namy, L.L., Nygaard, L.C., Sauerteig, D.: Gender differences in vocal accommodation: The role of perception. Journal of Language and Social Psychology 21, 422–432 (2002) 27. Gentilucci, M., Bernardis, P.: Imitation during phoneme production. Neuropsychologia 45(3), 608–615 (2007) 28. Aubanel, V., Nguyen, N.: Automatic recognition of regional phonological variation in conversational interaction. Speech Communication 52, 577–586 (2010) 29. Crowne, D.P., Marlowe, D.: A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology 24, 349–354 (1960) 30. Coveney, A.: The Sounds of Contemporary French: Articulation and Diversity. Elm Bank Publications, Exeter (2001) 31. Boula de Mareüil, P., et al.: Accents étrangers et régionaux en français: Caractérisation et identification. Traitement Automatique des Langues 49(3), 135–163 (2008) 32. Ménard, L., Schwartz, J.-L., Aubin, J.: Invariance and variability in the production of the height feature in French vowels. Speech Communication 50(1), 14–28 (2008) 33. Gratch, J., Wang, N., Gerten, J., Fast, E., Duffy, R.: Creating rapport with virtual agents. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 125–138. Springer, Heidelberg (2007)

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository within a Neural Model of Speech Processing Bernd J. Kröger1, Peter Birkholz1, Jim Kannampuzha1, Emily Kaufmann2, and Christiane Neuschaefer-Rube1 1 Department of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany {bkroeger,pbirkholz,jkannampuzha,cneuschaefer}@ukaachen.de 2 Human Technology Centre, RWTH Aachen University, Aachen, Germany [email protected]

Abstract. While a mental lexicon stores phonological, grammatical and semantic features of words, a vocal tract action repository is assumed to store inner motor and sensory representations of speech items (i.e. the sounds, syllables and words) of the speaker’s native language. On the basis of a neural model of speech processing, which comprises important cognitive and sensorimotor aspects of speech production, perception, and acquisition (Speech Commun 51, 793–809, 2009), this paper will outline how a sensorimotor vocal tract action repository can be acquired in a self-organizing neural network structure which is trained using unsupervised associative learning. Keywords: Speech actions, neural model, speech production, speech perception, speech acquisition, mental lexicon, neural network, self-organization.

1 Introduction Neural models of speech processing aim to account for cognitive, sensory, and motor aspects of speech production and perception ([1], [2], and [3]). While the mental lexicon plays a major role as a repository for the cognitive linguistic description of words [4], a mental syllabary is presumed to be the central repository for the sensory and motor representation of frequent syllables ([4], [5], and [6]). A comparable module, which we will call the sensorimotor vocal tract action repository, is presumed to represent the mentally syllabary in our approach [3]. The central structural feature of this module is a self-organizing map (a phonetic map or hypermodal action map: PMAP). This map associates the motor, sensory, and phonemic states of the most frequent syllables. Our model has already been tested for a limited model language data set comprising a simple vowel and consonant system with 45 CV- and 20 CCVsyllables (V = vowel, C = consonant; [7] and [8]). In this paper, a simulation experiment will be described in which the system acquired a basic set of 200 syllables of a natural language, i.e. Standard German in the case of this study. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 287–293, 2011. © Springer-Verlag Berlin Heidelberg 2011

288

B.J. Kröger et al.

2 The Neural Model Our neural model comprises two knowledge repositories, i.e. the mental lexicon and the action repository, as well as modules for neuromuscular and perceptual processing (Fig. 1). Phonological and semantic processing modules outside the mental lexicon are not yet integrated into the model. Word production starts with local neural activations within the semantic self-organizing map (S-MAP). Here, one model neuron represents one lexical item, i.e. one word. This neural activation leads to a coactivation of a distributed neural activation pattern, representing the semantic state of that word. The S-MAP is also connected with the phonetic self-organizing map (PMAP), leading to a co-activation of those model neurons within the P-MAP which represent the syllables of that word. Thus, phonemic states, motor plans, and internal sensory states are also co-activated for these syllables ([3] and [9]). This activation triggers the execution (i.e. articulation) of the word. Then, the still-activated inner sensory states of each syllable can be compared with their external sensory states using the articulatory-acoustic model (sensorimotor feedback loop). A detailed babbling and imitation training which establishes the phonetic map and the neural associations with the motor plan and sensory maps has been described for V- and CVsyllable states [3] and for V-, CV- and CCV-syllable states; see [7] and [8].

Fig. 1. Structure of the neural model of speech processing. Light blue boxes indicate processing modules; dark blue boxes indicate self-organizing maps (S-MAP and P-MAP) or neural state maps, i.e. the semantic, phonemic, auditory, somatosensory, and motor plan state map.

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository

289

3 Method: Training the Model A word and syllable list was assembled based on our corpus of Standard German children’s books, which comprises transcriptions of 40 books targeted to children between one and six years of age. This corpus comprises 6513 sentences and 70512 words in total, with morphologically distinct forms of the same word counted as separate words (e.g. ‘kleine’ and ‘kleinen’, two forms of the word ‘klein’, meaning small, which are used with nouns of different grammatical genders and in different grammatical cases). A further analysis revealed that the corpus comprises 8217 different words, which is assumed to approximately represent a 6-years-old child’s mental lexicon (Tab. 1). These words were phonetically transcribed using phonetic transcription rules for Standard German [10]. There were 4763 different syllables found in the transcription, of which 2139 syllables can be defined as frequent syllables: 96% of the corpus sentences can be produced using only these 2139 syllables (Tab. 2). The 200 most frequent syllables, including phonetic simplifications which typically occur in children’s word production (e.g. elisions and assimilations of sounds [12]), comprise CV-, CVC-, CVCC-, CCV-, and CCVC-syllables. Typical frequent CVsyllables comprise the consonants [t, >+f+m+c+y+a+k+q+r+g+ l+j+k+e+u+B+R+o+ w+i\together with the vowels [?+T+H,`H+`+h9+D5+`T+d9+D+5+N+t9+`9+d9+ n9+`5+ n5+h5+x9\. Typical frequent CVC-syllables are [!>Tm+!c`r+ !>`Hm+!>Hr+!>Dr+ !g`s+ !>`Te+!lHs+!>`m+!mHB+!yHB+ !y`9s, !>HB+ !>Hl+ !>`Tr] ([!] indicates a stressed syllable). Typical frequent CVCC-syllables are [!>Tms+ !>Hrs+ü!>`kr+ !mHBs+!y`9js+ !ln9ms+ !>`ks+ !jNls]. Typical frequent CCV-syllables are [!srt9+sr?+ !jk`H+!sr`H+!Roh9]. Typical frequent CCVC-syllables are [!sr?m] and [!RtD5n]. Table 1. The ten most frequent words in the categories noun, verb, adjective/adverb and other (i.e. pronouns and particles; particles comprise prepositions, conjunctions, and interjections [11]), in our corpus of Standard German; N = frequency of occurrence of that word. Nouns “Mama” (mom) “Bär” (bear) “Papa” (dad) “Mond” (moon) “Kinder” (children) “Katze” (cat) “Frau” (wife) “Bett” (bed) “Mädchen” (girl) “Wasser” (water)

N 392 278 235 217 190 147 145 106 105 104

Verbs “ist” (is) “hat” (has) “sagt” (says) “war” (was) “kann” (can) “wird” (will be) “will” (want) “sagte” (said) “muss” (must) “sieht” (sees)

N 793 448 413 246 184 159 156 131 120 112

Adj./Adv. “kleine” (little) “mehr” (more) “schnell” (fast) “viel” (much) “kleinen” (little) “fest” (fixed) “genau” (exactly) “großen” (large) “einfach” (simple) “große” (large)

N 287 126 90 75 74 67 60 59 58 58

Others “und” (and) “die” (the) “der” (the) “sie” (she/it) “das” (the) “den” (the) “ein” (a) “er” (he) “es” (it) “in” (in)

N 2367 1678 1644 1391 891 831 781 777 764 616

290

B.J. Kröger et al.

Table 2. Number N of most frequent syllables occurring at least M times within the corpus and percentage of text or speech which can be produced using only these syllables

Number N of most frequent syllables

Minimum number M of instances of each of these N most frequent syllables 477 856 1396 2139 2843 3475 4763

>= 40 >= 20 >= 10 >= 5 >= 3 >= 2 >= 1

Percentage of sentences within the corpus which can be produced using the N most frequent syllables 75% 85% 91% 96% 98% 99% 100%

The training of the phonetic map (P-MAP) was done in two steps. First, the training set (comprising phonemic, auditory, and motor plan states) was established for the 200 most frequent syllables. This was done by (i) choosing one acoustic realization of each syllable produced by one speaker of Standard German (33 years old, male), who uttered a selection of the sentences listed in the children’s book corpus, and (ii) applying an articulatory-acoustic re-synthesis method [13] in order to generate the appropriate motor plans. Each auditory state is based on the acoustic realization and is represented in our model as a short-term memory spectrogram comprising 24 × 30 neurons, where 24 rows of neurons represent the 24 critical bands (20 to 16000 Hz) and where 65 columns represent successive time intervals of 12.5 ms each (overall length of short-term time interval: 812.5 ms). The degree of activation of each neuron represents the spectral energy within a time-frequency interval. Each motor plan state is based on the motor plan generated by our re-synthesis method [13] and is represented in the neural model by a vocal tract action score as introduced in [14]. The score is determined by considering (i) a specification of the temporal organization of vocal tract actions within each syllable (i.e. 11 action rows over the whole short-term time interval: 11 × 65 neurons) and (ii) a specification of each type of action (4 × 17 for consonantal and 2 × 15 for vocalic actions; assuming CCVCC as the maximally complex syllable structure). Each phonemic state is based on the discrete description of all segments (allophones) of each syllable: 159 neurons in total. In the second step, this syllabic sensorimotor training set, covering the 200 most frequent syllables, was applied in order to train three P-MAPS of different sizes i.e. self-organizing neuron maps with 15 × 15, 20 × 20, and 25 × 25 neurons, respectively. 5000 incremental training cycles were computed using standard training conditions for self-organizing maps [3]. The training of the P-MAP can be called associative training since phonemic, motor, and sensory states are presented synchronously to the network for each syllable. Each cycle comprised 703 incremental training steps, and each syllable was represented within the training set proportionally to the frequency of its occurrence in the children’s book corpus; i.e. the most frequent syllable occurred 25 times per training cycle, while the least frequent syllable (number 200 in the ranking) occurred one time per cycle. Thus, the leastfrequent syllable appeared 5000 times in total, and the most frequent syllable appeared 125000 times in total in the training.

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository

291

4 Results Our simulation experimen nts indicate that a P-MAP comprising at least 25 × 25 neurons is needed in orderr to represent all 200 syllables. 158 syllables were reppresented in the 15 × 15 phoneetic map, and 176 syllables were represented in the 20 × 20 map (see Fig. 2) after training was complete.

Fig. 2. Organization of the 20 0 × 20 neuron P-MAP. Each box represents a neuron withinn the self-organizing neural map. A syllable appears only if the activation of its phonemic statte is greater than 80% of maximum m activation.

While most of the syllab bles are represented by only one neuron in the 15 × 15 m map, approximately the 100 most m frequent syllables are represented by two or m more neurons in the 20 × 20 and 25 × 25 maps. This allows the map to represent more tthan one realization for each off these syllables (e.g. [!c`] is represented by 3 neuroons, while [!c`m] and [!j`m] aree represented by only one neuron each in the 20 × 20 m map:

292

B.J. Kröger et al.

see Fig. 2). It should be noted that the syllables in Figure 2 are loosely ordered with respect to syllable structure (e.g. CV vs. CCV or CVC), vowel type (e.g. [i] vs. [a]) and consonant type (e.g. plosive vs. fricative or nasal).

5 Discussion Our neural model of speech processing as developed thus far is capable of simulating the basic processes of acquiring the motor plan and sensory states of frequent syllables of a natural language by using unsupervised associative learning. This process is illustrated here on the basis of our Standard German children’s book corpus that 96% of fluent speech can be produced using only the 2000 most frequent syllables. These frequent syllables are assumed to be produced directly by activating stored motor plans, without using complex motor processing routines. In our neural network model, the sensory and motor information about frequent syllables is stored by the dynamic link weights of the neural associations occurring between a self-organizing P-MAP and neural state maps for motor plan, auditory, somatosensory, and phonemic states. Thus, a neuron within the P-MAP represents a syllable, which – if activated – leads to a syllable-specific activation pattern within each neural state map. These neural activations represent “internal speech” or “verbal imagery” [15], i.e. “how to articulate a syllable” (motor plan state), “what a syllable sounds like” (auditory state), and “what a syllable articulation feels like” (somatosensory state), without actually articulating that syllable. While in earlier experiments our simulations were based on an artificial and completely symmetric model language, comprising five vowels [i, e, D, o, a] and nine consonants [b, d, g, p, t, k, m, n, l] and all combinations of vowels and consonants as CV-syllables and all combinations of four CC-clusters [bl, gl, pl, kl] with all vowels as CCV-syllables, this paper gives the first results of simulation experiments based on a natural language, i.e. based on the 200 most frequent syllables of Standard German as they occur in our children’s book corpus, including phonetic simplifications which typically occur in children’s word production. While syllables are strictly ordered with respect to phonetic features in the P-MAP in the case of the model language (see [3], [7], and [8]), we can see here that syllables are ordered more “loosely” in the case of a natural language. This is due to the fact that natural languages are less symmetrical than the model language due to the gaps in syllable structure which are present in a natural language, i.e. not all combinations of vowels and consonants are equally likely to occur in a natural language as they are in a model language. Furthermore, our simulations indicate that the representation of 200 syllables within the P-MAP requires a minimum map size of 25 × 25 neurons. Phonetic maps of 15 × 15 or 20 × 20 neurons were not capable of representing all 200 syllables. In order to be able to account for complete acquisition of a language, more than 200 syllables (up to 2000) must be included in the training set, so the size of the P-MAP and the S-MAP must be increased before this will be possible (cf. [9]). Acknowledgments. We thank Cornelia Eckers and Cigdem Capaat for building the corpus. This work was supported in part by the German Research Council (DFG) grant Kr 1439/13-1 and grant Kr 1439/15-1 and in part by COST-action 2102.

Towards the Acquisition of a Sensorimotor Vocal Tract Action Repository

293

References 1. Guenther, F.H., Ghosh, S.S., Tourville, J.A.: Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96, 280–301 (2006) 2. Guenther, F.H., Vladusich, T.: A neural theory of speech acquisition and production. Journal of Neurolinguistics (in press) 3. Kröger, B.J., Kannampuzha, J., Neuschaefer-Rube, C.: Towards a neurocomputational model of speech production and perception. Speech Communication 51, 793–809 (2009) 4. Levelt, W.J.M., Roelofs, A., Meyer, A.: A theory of lexical access in speech production. Behavioral and Brain Sciences 22, 1–75 (1999) 5. Levelt, W.J.M., Wheeldon, L.: Do speakers have access to a mental syllabary? Cognition 50, 239–269 (1994) 6. Wade, T., Dogil, G., Schütze, H., Walsh, M., Möbius, B.: Syllable frequency effects in a context-sensitive segment production model. Journal of Phonetics 38, 227–239 (2010) 7. Kröger, B.J.: Computersimulation sprechapraktischer Symptome aufgrund funktioneller Defekte. Sprache-Stimme-Gehör 34, 139–145 (2010) 8. Kröger, B.J., Miller, N., Lowit, A.: Defective neural motor speech mappings as a source for apraxia of speech: Evidence from a quantitative neural model of speech processing. In: Lowit, A., Kent, R. (eds.) Assessment of Motor Speech Disorders. Plural Publishing, San Diego (in press) 9. Li, P., Farkas, I., MacWhinney, B.: Early lexical development in a self-organizing neural network. Neural Networks 17, 1345–1362 (2004) 10. Kohler, W.: Einführung in die Phonetik des Deutschen. Erich Schmidt Verlag, Berlin (1995) 11. Glinz, H.: Deutsche Syntax. Metzler Verlag, Stuttgart (1970) 12. Ferguson, C.A., Farwell, C.B.: Words and sounds in early language acquisition. Language 51, 419–439 (1975) 13. Bauer, D., Kannampuzha, J., Kröger, B.J.: Articulatory Speech Re-Synthesis: Profiting from natural acoustic speech data. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 344–355. Springer, Heidelberg (2009) 14. Kröger, B.J., Birkholz, P., Lowit, A.: Phonemic, sensory, and motor representations in an action-based neurocomputational model of speech production (ACT). In: Maassen, B., van Lieshout, P. (eds.) Speech Motor Control: New Developments in Basic and Applied Research, pp. 23–36. Oxford University Press, Oxford (2010) 15. Ackermann, H., Mathiak, K., Ivry, R.B.: Temporal organization of “internal speech” as a basis for cerebellar modulation of cognitive functions. Behavioral and Cognitive Neuroscience Reviews 3, 14–22 (2004)

Neurophysiological Measurements of Memorization and Pleasantness in Neuromarketing Experiments Giovanni Vecchiato1,2 and Fabio Babiloni1,2 1 Dept. Physiology and Pharmacology, Univ. of Rome Sapienza, 00185, Rome, Italy 2 IRCCS Fondazione Santa Lucia, via Ardeatina 306, 00179, Rome, Italy [email protected]

Abstract. The aim of this study was to analyze the brain activity occurring during the “naturalistic” observation of commercial ads. In order to measure both the brain activity and the emotional engage we used electroencephalographic (EEG) recordings and the high resolution EEG technique to obtain an estimation of the cortical activity during the experiment. Results showed that TV commercials proposed to the population analyzed have increased the cortical activity mainly in the theta band in the left hemisphere when they will be memorized and judged pleasant. A correlation analysis also revealed that the increase of the EEG Power Spectral Density (PSD) at left frontal sites is negatively correlated with the degree of pleasantness perceived. Conversely, the de-synchronization of left alpha frontal activity is positively correlated with judgments of high pleasantness. Moreover, our data also presented an increase of PSD related to the observation of unpleasant commercials. Keywords: Neuromarketing, EEG, EEG frontal asymmetry, high resolution EEG, TV commercials.

1 Introduction In these last years we assisted to an increased interest in the use of brain imaging techniques, based on hemodynamic or electromagnetic recordings, for the analysis of brain responses to the commercial advertisements or for the investigation of the purchasing attitudes of the subjects [1, 2, 3, 4]. The interest is justified by the possibility to correlate the particular observed brain activations with the characteristics of the proposed commercial stimuli, in order to derive conclusions about the adequacy of such ad stimuli to be interesting, or emotionally engaging, for the subjects. Standard marketing techniques so far employed involved the use of an interview and the compilation of a questionnaire for the subjects after the exposition to novel commercial ads before the massive launch of the ad itself (ad pre-test). However, it is now recognised that often the verbal advertising pre-testing is flawed by the respondents’ cognitive processes activated during the interview, being the implicit memory and subject’s feelings often inaccessible to the interviewer that uses A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 294–308, 2011. © Springer-Verlag Berlin Heidelberg 2011

Neurophysiological Measurements of Memorization and Pleasantness

295

traditional techniques [5]. In addition, it was also suggested that the interviewer on this typical pre-testing interviews has a great influence on what respondent recalls and on the subjective experiencing of it [6, 7]. Taking all these considerations in mind, researchers have attempted to investigate the signs of the brain activity correlated with an increase of attention, memory or emotional engagement during the observation of such commercial ads. Researchers within the consumer neuroscience community promote the view that findings and methods from neuroscience complement and illuminate existing knowledge in consumer research in order to better understand consumer behaviour [8, 9]. The use of electroencephalographic (EEG) measurements allows to follow the brain activity on a ms base, but it has the problem that the recorded EEG signals are mainly due to the activity generated on the cortical structures of the brain. In fact, the electromagnetic activity elicited by deep structures advocated for the generation of emotional processing in humans is almost impossible to gather from usual superficial EEG electrodes [10, 11]. It has underlined as a positive or negative emotional processing of the commercial ads it is an important factor for the formation of stable memory traces [12]. Hence, it became relevant to infer the emotional engage of the subject by using indirect signs for it. Indirect variables of emotional processing could be also gathered by tracking variations of the activity of other anatomical structures linked to the emotional processing activity in humans, such as the pre- and frontal cortex (PFC and FC respectively; [13, 8]). The PFC region is structurally and functionally heterogeneous but its role in emotion is well recognized [14, 9]. EEG spectral power analyses indicate that the anterior cerebral hemispheres are differentially lateralized for approach and withdrawal motivational tendencies and emotions. Specifically, findings suggest that the left PFC is an important brain area in a widespread circuit that mediates appetitive approach, while the right PFC appears to form a major component of a neural circuit that instantiates defensive withdrawal [15, 16]. In this study we were interested to analyse the brain activity occurring during the “naturalistic” observation of commercial ads intermingled in a random order in a documentary. To measure both the brain activity and the emotional engage we used the EEG and high resolution EEG technique to obtain an estimation of the cortical activity during the experiment. The aim was to link significant variation of EEG measurements with the memory and pleasantness of the stimuli presented, as resulted successively from the subject’s verbal interview. In order to do that, different indexes were employed to summarize the cerebral measurements performed and used in the statistical analysis. In order to recreate, as much as possible, a “naturalist” approach to the task, the observer watched the TV screen without particular goals in mind. In fact, the subjects were not instructed at all on the aim of the task, and they were not aware that an interview about the TV commercials observed intermingled to the documentary would be generated at the end of the task. The experimental questions of the present studies are the following: 1.

In the particular task employed and for the analyzed population, are there particular EEG activities in the spectral domain that correlate with the memorization performed or the pleasantness perceived by the subjects?

296

G. Vecchiato and F. Babiloni

2.

Does there exist any EEG frontal asymmetrical activity when we are watching pleasant and unpleasant commercial advertisements?

3.

Is it possible to extract from the EEG signals a descriptor which is strictly correlated with the degree of perceived pleasantness?

In the following pages, a detailed description of two different experiments and the related methodologies employed will be presented. Successively, the description of the results derived from the experiments performed will be accomplished and a general discussion of the significance of such results against the existing literature will close the scientific part of the work.

2 Materials and Methods High-resolution EEG technologies have been developed to enhance the poor spatial information content of the EEG activity [17, 18, 10, 19, 20]. Basically, these techniques involve the use of a large number (64-256) of scalp electrodes. In addition, high-resolution EEG techniques rely on realistic MRI-constructed head models and spatial de-convolution estimations, which are usually computed by solving a linearinverse problem based on Boundary-Element Mathematics [21, 22]. Subjects were comfortably seated on a reclining chair, in an electrically shielded, dimly lit room. In the present work, the cortical activity was estimated from scalp EEG recordings by using realistic head models whose cortical surface consisted of about 5000 triangles uniformly disposed. The current density estimation of each one of the triangle, which represents the electrical dipole of the underlying neuronal population, was computed by solving the linear-inverse problem according to the techniques described in previous papers [23, 24, 25]. 2.1 Experiment 1 Fifteen healthy volunteers (mean age 27.5±7.5 years; 7 women and 8 men) have been recruited for this study. The procedure of the experimental task consisted in observing a thirty minutes long documentary in which we inserted three advertising breaks: the first one after eight minutes from the beginning, the second one in the middle and the last one at the end of the movie. Each interruption was formed by the same number of commercial videoclips of about thirty seconds. During the whole documentary, a total of six TV commercials was presented. The clips were related to standard international brands of commercial products, like cars, food, etc. and public service announcements (PSA) such as campaigns against violence. Randomization of the occurrence of the commercial videos within the documentary was made to remove the factor “sequence” as possible confounding effect in the following analysis. During the observation of the documentary and TV commercials, subjects were not aware that an interview would be held within a couple of hours from the end of the movie. They were simply told to pay attention to what they would have watched and no mention of the importance of the commercial clips was made. In the interview, subjects were asked to recall commercial clips they remembered. In addition, a

Neurophysiological Measurements of Memorization and Pleasantness

297

question on the pleasantness of the advertisement has been performed. According to the information acquired, the neurophysiologic activity recorded has been divided into four different datasets. The first pool was related to the activity collected during the viewing of the commercial clips that the subjects had correctly remembered, and this dataset was named RMB. The second pool was related to the activity collected during the observation of the TV commercials that had been forgotten by the subjects, and this set was named FRG. The third pool is instead formed by the activity associated to subjects who affirmed to like the advertisement in exam. This group has been named LIKE. Analogously, the fourth and last group comprises all the cerebral and autonomic activity of subjects who answered in a negative way to the question on likeability. We referred to this dataset as DISLIKE. In such a case, these two datasets (LIKE/DISLIKE) only take into account the emotional feeling of the subject since he/she is asked to answer to the question “Did you like the commercial you have seen in the movie?”. Hence, an advertisement could be labelled as DISLIKE even though the subject found it meaningful or interesting. In fact, the question does not investigate cognitive aspects but only the degree of pleasantness perceived. Finally, the neurophysiologic activity during the observation of the documentary was also analyzed and a final pool of data related to this state was generated with the name REST. This REST period was taken as the period in which the subject looked at the documentary. We took into account a two minutes long sequence of the documentary, immediately before the appearance of the first spot interruption, employed in order to minimize the variations of the spectral responses owing to fatigue or loss of concentration. The cerebral activity was recorded by means of a portable 64-channel system (BE+ and Galileo software, EBneuro, Italy). Informed consent was obtained from each subject after explanation of the study, which was approved by the local institutional ethics committee. All subjects were comfortably seated on a reclining chair, in an electrically-shielded, dimly-lit room. Electrodes positions were acquired in a 3D space with a Polhemius device for the successive positioning on the head model employed for the analysis. Recordings were initially extra-cerebrally referred and then converted to an average reference off-line. We collected the EEG activity at a sampling rate = 256 Hz while the impedances kept below 5 kΩ. Each EEG trace was then converted into the Brain Vision format (BrainAmp, Brainproducts GmbH, Germany) in order to perform signal pre-processing such as artefacts detection, filtering and segmentation. Raw EEG traces were first band pass filtered (high pass = 2 Hz; low pass = 47 Hz) and the Independent Component Analysis (ICA) was then applied to detect and remove components due to eye movements, blinks, and muscular artefacts. These EEG traces were then segmented to obtain the cerebral activity during the observation of the TV commercials and that associated to the REST period. Since we recorded such activity from fifteen subjects, for each proposed advertisement we collected fifteen trials which have been grouped and averaged to obtain the results illustrated in the following sections. This dataset has been used to evaluate the cortical activity and calculate the power spectral density (PSD) for each segment according to the Welch method [38].

298

G. Vecchiato and F. Babiloni

2.2 Experiment 2 Eleven voluntary and healthy undergraduate students of our faculty participated in the study (age, 22–25 years; 8 males and 3 females). They had no personal history of neurological or psychiatric disorder. They were free from medications, or alcohol or drugs abuse. For the EEG data acquisition, subjects were comfortably seated on a reclining chair in an electrically shielded and dimly lit room. They were exposed to the vision of a film of about 30 minutes and asked to pay attention to the above stimuli; they were not aware about the aim of the experiment and did not know that an interview would be performed after the recording. The movie consisted in a neutral documentary. Three interruptions have been generated: one at the beginning, the second at the middle and the last one at the end of the documentary. Each interruption was composed by six 30 seconds long commercial video-clips. Eighteen commercials were showed during the whole documentary. The TV spots were relative to standard international brands of commercial products, such as cars and food, and no profit associations, such as FAO and Greenpeace. They have never been broadcasted in the country in which the experiment has been performed. Hence, the advertising material was new to the subject as well as the documentary they observed. After two hours from the end of the recording, each experimental subject was contacted and an interview was performed. In such a questionnaire, the experimenter asked the subjects to recall the clips they remembered. Firstly, the operator verbally listed the sequence of advertisements presented within the documentary asking them to tell which they remembered, one by one. Successively, the interviewer showed to the subject several sheets, each presenting several frame sequences of each commercial inserted in the movie in order to solicit the memory of the stimuli presented. Along with these pictures, we also showed an equal number of ads which we did not choose as stimuli. This was done to provide to the subject the same number of distractors when compared to the target pictures. Finally, for each advertisement the subjects remembered, we asked them to give a score ranging between 1 and 10 according to the level of pleasantness they perceived during the observation of the ad (1, lowly pleasant; 5, indifferent; 10, highly pleasant). The EEG signals were segmented and classified according to the rated pleasantness score in order to group, in different datasets, the neuroelectrical activity elicited during the observation of commercials. Moreover, for each subject, a two minutes EEG segment related to the observation of the documentary has been further taken into account as baseline activity. In the following analysis, we considered only those pleasantness scores which have been expressed at least by three subjects in the population analyzed, in order to avoid outliers. According to this criteria, we discarded the EEG activity related to the ads that have been rated as 1, 2 and 10. The signals associated to the lowest pleasantness ratings from 3 to 5 have been labelled as DISLIKE dataset; conversely, the ones related to the higher ratings from 7 to 9 have been labelled as LIKE dataset. In such a case, these two datasets (LIKE/DISLIKE) only take into account the emotional feeling of the subject since he/she is asked to answer to the question “Did you like the

Neurophysiological Measurements of Memorization and Pleasantness

299

commercial you have seen in the movie?”. Hence, an advertisement could be labelled as DISLIKE even though the subject found it meaningful or interesting. In fact, the question does not investigate cognitive aspects but only the degree of pleasantness perceived. A 96-channel system with a frequency sampling of 200 Hz (BrainAmp, Brainproducts GmbH, Germany) was used to record the EEG electrical potentials by means of an electrode cap which was built according to an extension of the 10-20 international system to 64 channels. Linked ears reference was used. Since a clear role of the frontal areas have been depicted for the phenomena we would like to investigate [13, 14, 15], we used the left and right frontal and prefrontal electrodes of the 10-20 international system to compute the following spectral analysis. In such a case, we considered the following couples of homologous channels: Fp2/Fp1, AF8/AF7, AF4/AF3, F8/F7, F6/F5, F4/F3, F2/F1. The EEG signals have been band pass filtered at 1-45 Hz and depurated of ocular artefacts by employing the Independent Component Analysis (ICA) in such a way the components due to eye blinks and ocular movements detected by eye inspection were then removed from the original signal. The EEG traces related to our datasets of interest have been further segmented in one second trials. Then, a semi-automatic procedure has been adopted to reject trials presenting muscular and other kinds of artefacts. Only artefacts-free trials have been considered for the following analysis. The extra-cerebrally referred EEG signals have been transformed by means of the Common Average Reference (CAR) and the Individual Alpha Frequency (IAF) has been calculated for each subject in order to define four bands of interest according to the method suggested in the scientific literature [26]. Such bands were in the following reported as IAF+x, where IAF is the Individual Alpha Frequency, in Hertz, and x is an integer displacement in the frequency domain which is employed to define the band. In particular we defined the following four frequency bands: theta (IAF-6, IAF-2), i.e. theta ranges between IAF-6 and IAF-2 Hz, alpha (IAF-2, IAF+2). The higher frequency ranges of the EEG spectrum have been also analyzed but we do not report any results since their variations were not significant. The spectral EEG scalp activity has been calculated by means of the Welch method [38] for each segment of interest. In order to discard the single subject’s baseline activity we contrasted the EEG power spectra computed during the observation of the commercial video clips with the EEG power spectra obtained in different frequency bands during the observation of the documentary by using the z-score transformation [41]. In particular, for each frequency band of interest the used transformation is described as follows:

Z=

X −μ σ N

(1)

where X denotes the distribution of PSD values (of cardinality N ) elicited during the observation of commercials and the superscription is the mean operator, μ denotes the mean value of PSD activity related to the documentary and σ its standard deviation [41]. By using z-score transformation, we removed the variance due to the baseline differences in EEG power spectra among the subjects.

300

G. Vecchiato and F. Babiloni

To study the EEG frontal activity, we compared the LIKE activity against the DISLIKE one by evaluating the difference of their average spectral values as follows: Z = ZLIKE – ZDISLIKE

(2)

where ZLIKE is the z-score of the power spectra of the EEG recorded during the observation of commercial videoclips rated pleasant (“liked”) by the analyzed subjects in a particular frequency band of interest. ZDISLIKE is the z-score values for the EEG recorded during the observation of commercial videoclips rated unpleasant by the subjects. This spectral index has been mapped onto a real scalp model in the two bands of interest. Moreover, in order to investigate the cerebral frontal asymmetry, for each couple of homologous channels, we calculated the following spectral imbalance: ZIM = Zdx - Zsx

(3)

This index has been employed to calculate the Pearson product moment correlation coefficient [41] between the pleasantness score and the neural activity, in the theta and alpha band for each couples of channels we analyzed. Finally, we adopted the student’s t-test to compare the ZIM index between the LIKE and DISLIKE condition by evaluating the corresponding indexes.

3 Results 3.1 Experiment 1 The EEG signals gathered during the observation of the commercial spots were subjected to the estimation of the cortical power spectral density by using the techniques described in the Methods section. In each subject, the cortical power spectral density were evaluated in the different frequency bands adopted in this study and contrasted with the values of the power spectral density of the EEG during the observation of the documentary through the estimation of the z-score. These cortical distributions of the z-scores obtained during the observation of the commercials were then organized in two different populations: the first one was composed by the cortical z-scores relative to the observation of commercial videos that were remembered during the interview (RMB group), while the second was composed by the cortical distribution of the z-scores relative to the observation of commercial videos that were forgotten (FRG group). A contrast will be made between these cortical z-score distributions of these two populations, and the resulting cortical distributions in the four frequency bands highlight the cortical areas in which the estimated power spectra statistically differs between the populations. Fig.1 presents two cortical maps, in which the brain is viewed from a frontal perspective. The maps are relative to the contrast between the two population in the theta (upper left) and alpha (upper right) frequency bands. The gray scale on the cortex coded the statistical significance.

Neurophysiological Measurements of Memorization and Pleasantness

301

Fig.1 presents an increase of cortical activity in the theta band that it is prominent on the left pre and frontal hemisphere for the RMB group. The statistical significant activity in the alpha frequency band for the RMB group is still increased in the left hemisphere although there are few zones in the frontocentral and right prefrontal hemisphere where the cortical activity was prominent for the FRG group.

Fig. 1. Figure presents two cortical z-score maps, in the two frequency bands employed. Gray scale represents cortical areas in which increased statistically significant activity occurs (p<0.05 Bonferroni corrected)

Fig.2 presents the contrast between the LIKE and DISLIKE groups in the two frequency bands considered in this analysis. Same convention of Fig.1 is used. The significant increase of the frontal activity in the theta band is visible in the LIKE group when compared to the DISLIKE one, in the upper left part of the Figure 2. Scattered increased of cortical activity on the left hemisphere is also present in the DISLIKE group. In the alpha frequency band (upper right of the Fig.2) significant increase of cortical activity is present on the left hemisphere and on the orbitofrontal right hemisphere in the LIKE group when compared to the DISLIKE one.

Fig. 2. Two cortical maps for the groups LIKE and DISLIKE. Same conventions than in the previous figure

302

G. Vecchiato and F. Babiloni

3.2 Experiment 2 In Fig. 4 the scalp distribution of the z-score values for the theta and alpha bands is presented. In the z-score distribution for the theta band, it is possible to observe a major activation for the condition ZDISLIKE at the electrodes F2 and AF8, roughly overlaying the right frontal cortex (FC) and the prefrontal cortex (PFC), although an enhance of spectral activity is also present in the left hemisphere (site F3). However, it is possible to observe as the EEG spectral power is also increased at the Fp1 electrode, for the ZLIKE dataset. The right side of Fig. 1 shows the differences in EEG spectral power of the alpha band. They are mainly located at the electrode F1 which roughly overcame the left FC. Conversely, differences are also visible in the right hemisphere at scalp sites AF8 and AF4 roughly overlying the right PFC.

Fig. 3. The two scalp maps in figure represent the Z x for the Theta (left) and Alpha band (right). Z values are mapped onto a realistic scalp model, seen from a frontal perspective. Gray scale codes scalp areas in which the LIKE spectral activity is greater than the DISLIKE and regions in which the DISLIKE spectral activity is greater than the LIKE.

In order to further investigate the frontal EEG asymmetry and its implications with the pleasantness our experimental subjects perceived, we calculated the Pearson product moment correlation index between the imbalance index ZIM, described in Eq. 3, and the pleasantness scores provided by the subjects, at each scalp location and frequency bands of interest. These results are summarized in Table I. As to the theta band, we found out significant negative correlations at pre-frontal and lateral sites, while the only couple of electrodes F2-F1 present a significant positive correlation between the ZIM and the pleasantness score. Instead, as far as the correlation in the alpha band is concerned, we obtained a significant positive correlation for the couple F2-F1.

Neurophysiological Measurements of Memorization and Pleasantness

303

Table 1. Correlation coefficients between ZIM index and pleasantness score, for each couple of electrodes, for the theta and alpha band. Statistically significant values are highlighted in grey. Fp2-Fp1

AF8-AF7

F8-F7

F6-F5

AF4-AF3

F4-F3

F2-F1

Theta

-0.17 (p=0.04)

-0.17 (p=0.04)

-0.21 (p=0.01)

-0.11 (p=0.16)

-0.02 (p=0.81)

0.07 (p=0.40)

0.17 (p=0.04)

Alpha

-0.11 (p=0.19)

-0.15 (p=0.07)

-0.15 (p=0.06)

-0.09 (p=0.26)

-0.02 (p=0.80)

0.05 (p=0.58)

0.16 (p=0.04)

Finally, in order to assess the different behaviour presented in Table 1, we performed a t-test analysis between the ZIM values of the pre-frontal and lateral electrodes for the theta band (Fp2/Fp1, AF8/AF7, F8/F7, F6/F5) and the medial ones for the alpha band (AF4/AF3, F4/F3, F2/F1). In Fig. 5, the results of the statistical test revealed a significant difference of ZIM values between the conditions LIKE and DISLIKE, in both theta (t = -3.2, p = .0014) and alpha band (t = 2.2, p = .0298). In fact, it is possible to observe a greater absolute value of ZIM value for the DISLIKE condition in both theta and alpha bands.

Fig. 4. Representation of the mean values of the ZIM index for the LIKE (dark gray) and DISLIKE (light gray) conditions in the Theta (left) and Alpha band (right). Both differences are statistically significant (theta: t = -3.2, p < 0.001; alpha: t = 2.2, p < 0.01).

304

G. Vecchiato and F. Babiloni

4 Discussion 4.1 Experiment 1 The analysis of the statistical cortical maps in the different conditions (RMB vs FRG and LIKE vs DISLIKE) suggested that the left frontal hemisphere was highly active during the RMB condition, especially in the theta band, while the activity of the brain is greater in the LIKE condition than in the DISLIKE in both frequency bands. These results are in agreement with different observations on the RMB condition performed in literature [28, 29, 3]. In addition, the results here obtained for the LIKE condition are also congruent with other observations performed with EEG in a group of 20 subjects during the observation of pictures from the International Affective Picture System (IAPS, [30]). Such observations indicated an increase of the EEG activity in the theta band for the anterior areas of the left hemisphere. It is worth to note that there were methodological differences between the Aftanas and colleagues study and the present one, mainly related to the use of different material as stimuli, and different processing algorithms. Nevertheless, the convergence of these results, obtained in the “naturalistic” conditions of the observation of commercial videos within the documentary with those of more controlled memory and affective tasks, deserve attention. 4.2 Experiment 2 The results presented in the Experiment 2 showed a frontal asymmetrical activations of cerebral activity during the observation of pleasantness and not pleasantness commercial videos. Reported EEG power spectral maps distinguished the different activations between LIKE and DISLIKE conditions both in the theta and alpha band. It is worth of note that the most of activity in the left frontal hemisphere relates to the observation of commercials that have been judged pleasant by the analyzed population. On the other hand, the right frontal sites highlighted neuroelectrical activations concerning the observation of advertisements that have been judged less pleasant. Moreover, this imbalance in the activations was linearly correlated with the degree of pleasantness the subjects expressed. The correlation analysis revealed that pleasantness scores are significantly negatively correlated with the theta imbalance index, mostly concerning the pre-frontal and lateral frontal sites. Conversely, at the alpha frequencies the imbalance index is positively correlated around the medial frontal region. It could be noted that the adopted correlation index (Pearson’s r) assumed that the distribution of the variables employed follow a gaussian distribution. While light departures from this normality assumption does not alter the significance of the correlation obtained, other correlation coefficients can be computed by using non parametric statistics (such as Spearman’s R and Kendall’s Tau) [27]. Here, we used parametric statistic correlation coefficient (Pearson’s r) due to its higher statistical power. These data showed that the variations of the spectral index we defined are able to describe the degree of pleasantness feeling perceived by subjects while watching

Neurophysiological Measurements of Memorization and Pleasantness

305

TV commercial ads. The different number of TV commercials employed in the two studies does not influence the results achieved since there is no comparison between them. In particular, the scalp regions on the left frontal and pre-frontal areas are mostly activated when experimenting pleasant feelings. However, the right frontal lobe is more activated while watching commercials that have been judged unpleasant. Overall, the right frontal activity is significantly greater than the one in the left frontal lobe, both in theta and alpha bands. All together, these results are in line with previous findings suggesting the presence of an asymmetrical EEG activity when subjects experienced emotional stimuli [13, 14, 15]. Moreover, the greater spectral activity elicited in the right frontal areas during the observation of unpleasant TV ads could also be congruent with literature associating the insula/parainsula [31, 32] and the ventral anterior cingulated cortices [33, 34] with the processing of negatively valenced emotions in social situations. Of course, these statements need to be confirmed by further studies employing the high resolution EEG techniques in order to estimate and investigate the relative cortical patterns [35- 40]. Taken together, the results indicated the cortical activity in the theta band on the left frontal areas was increased during the memorization of commercials, and it is also increased during the observation of commercials that were liked by the subjects. These results are in agreement with the role that has been advocated for the left pre and frontal regions during the transfer of sensory percepts from the short term memory toward the long-term memory storage by the HERA model [41]. In fact, in such model the left hemisphere plays a key role during the encoding phase of information from the short term memory to the long term memory, whereas the right hemisphere plays a role in the retrieval of such information. The results of the present study suggested the following answers to the questions elicited in the Introduction section: 1)

In the population analyzed, the cortical activity in the theta band elicited during the observation of the TV commercials that were remembered (RMB) is higher and localized in the left frontal brain areas when compared to the activity elicited during the vision of the TV commercials that were forgotten (FRG). Same increase in the theta activity occurred during the observation of commercials that were judged pleasant (LIKE) when compared with the others (DISLIKE).

2)

There exists a frontal EEG asymmetry elicited by the observation of pleasant TV commercials, in particular there is a stronger activation in the left hemisphere related to pleasant ads and, conversely, an enhance of spectral power associated to unpleasant ads;

3)

The degree of the perceived pleasantness linearly correlates with the unbalance between the EEG power spectra estimated between selected right and left scalp sites.

306

G. Vecchiato and F. Babiloni

References 1. Ioannides, A.A., Liu, L., Theofilou, D., et al.: Real time processing of affective and cognitive stimuli in the human brain extracted from MEG signals. Brain Topogr. 13(1), 11– 19 (2000) 2. Knutson, B., Rick, S., Wimmer, G.E., Prelec, D., Loewenstein, G.: Neural predictors of purchases. Neuron. 53(1), 147–156 (2007) 3. Astolfi, L., Cincotti, F., Mattia, D., et al.: Tracking the Time-Varying Cortical Connectivity Patterns by Adaptive Multivariate Estimators. IEEE Trans. Biomed. Eng. 55(3), 902–913 (2008) 4. Morris, J.D., Klahr, N.J., Shen, F., et al.: Mapping a multidimensional emotion in response to television commercials. Hum. Brain Mapp. 30(3), 789–796 (2009) 5. Zaltman, G.: How Customers Think: Essential Insights into the Mind of the Market, 1st edn. Harvard Business Press, Boston (2003) 6. McDonald, C.: Is Your Advertising Working? WARC (2003) 7. Franzen, G., Bouwman, M.: The Mental World of Brands: Mind, Memory and Brand Success. NTC Publications (2001) 8. Ambler, T., Ioannides, A., Rose, S.: Brands on the Brain: Neuro-Images of Advertising. Business Strategy Review 11(3), 17–30 (2000) 9. Klucharev, V., Smidts, A., Fernández, G.: Brain mechanisms of persuasion: how ’expert power’ modulates memory and attitudes. Soc. Cogn. Affect Neurosci. 3(4), 353–366 (2008) 10. Nunez, P.: Neocortical Dynamics and Human EEG Rhythms, 1st edn. Oxford University Press, USA (1995) 11. Urbano, A., Babiloni, C., Onorati, P., Babiloni, F.: Dynamic functional coupling of high resolution EEG potentials related to unilateral internally triggered one-digit movements. Electroencephalogr. Clin. Neurophysiol. 106(6), 477–487 (1998) 12. Kato, J., Ide, H., Kabashima, I., et al.: Neural correlates of attitude change following positive and negative advertisements. Front Behav. Neurosci. 3, 6 (2009) 13. Davidson, I.: The functional neuroanatomy of emotion and affective style. Trends Cogn. Sci (Regul. Ed.) 3(1), 11–21 (1999) 14. Davidson, R.J.: Anxiety and affective style: role of prefrontal cortex and amygdala. Biol. Psychiatry. 51(1), 68–80 (2002) 15. Davidson, R.: What does the prefrontal cortex ”do” in affect: perspectives on frontal EEG asymmetry research. Biol. Psychol. 67(1-2), 219–233 (2004) 16. Davidson, R.J.: Affective style, psychopathology, and resilience: brain mechanisms and plasticity. Am. Psychol. 55(11), 1196–1214 (2000) 17. Ding, L., Lai, Y., He, B.: Low resolution brain electromagnetic tomography in a realistic geometry head model: a simulation study. Phys. Med. Biol. 50(1), 45–56 (2005) 18. He, B., Wang, Y., Wu, D.: Estimating cortical potentials from scalp EEG’s in a realistically shaped inhomogeneous head model by means of the boundary element method. IEEE Trans. Biomed. Eng. 46(10), 1264–1268 (1999) 19. Babiloni, C., Babiloni, F., Carducci, F., et al.: Mapping of early and late human somatosensory evoked brain potentials to phasic galvanic painful stimulation. Hum. Brain Mapp. 12(3), 168–179 (2001) 20. De Vico Fallani, F., Astolfi, L., Cincotti, F., et al.: Cortical functional connectivity networks in normal and spinal cord injured patients: Evaluation by graph analysis. Hum. Brain Mapp. 28(12), 1334–1346 (2007)

Neurophysiological Measurements of Memorization and Pleasantness

307

21. Grave de Peralta Menendez, R., Gonzalez Andino, S.: Distributed source models:standard solutions and new developments. In: Uhl, C. (ed.) Analysis of Neurophysiological Brain Functioning, 1999, pp. 176–201. Springer, Heidelberg (1998) 22. Dale, A.M., Liu, A.K., Fischl, B.R., et al.: Dynamic statistical parametric mapping: combining fMRI and MEG for high-resolution imaging of cortical activity. Neuron. 26(1), 55–67 (2000) 23. Astolfi, L., Cincotti, F., Mattia, D., Marciani, M.G., Baccala, L., de Vico Fallani, F., Salinari, S., Ursino, M., Zavaglia, M., Ding, L., Edgar, J.C., Miller, G.A., He, B., Babiloni, F.: Comparison of different cortical connectivity estimators for high-resolution EEG recordings. Hum. Brain Mapp. 28(2), 143–157 (2007a) 24. Astolfi, L., De Vico Fallani, F., Cincotti, F., Mattia, D., Marciani, M.G., Bufalari, S., Salinari, S., Colosimo, A., Ding, L., Edgar, J.C., Heller, W., Miller, G.A., He, B., Babiloni, F.: Imaging Functional Brain Connectivity Patterns From High-Resolution EEG And fMRI Via Graph Theory. Psychophysology 44(6), 880–893 (2007b) 25. Babiloni, F., Cincotti, F., Babiloni, C., Carducci, F., Basilisco, A., Rossini, P.M., Mattia, D., Astolfi, L., Ding, L., Ni, Y., Cheng, K., Christine, K., Sweeney, J., He, B.: Estimation of the cortical functional connectivity with the multimodal integration of high resolution EEG and fMRI data by Directed Transfer Function. Neuroimage 24(1), 118–131 (2005) 26. Klimesch, W.: EEG alpha and theta oscillations reflect cognitive and memory performance: a review and analysis. Brain Res. Brain Res. Rev. 29(2-3), 169–195 (1999) 27. Zar, J.H.: Biostatistical Analysis, 5th edn. Prentice Hall, Englewood Cliffs (2009) 28. Summerfield, C., Mangels, J.: Coherent theta-band EEG activity predicts item-context binding during encoding. Neuroimage 24(3), 692–703 (2005) 29. Werkle-Bergner, M., Müller, V., Li, S., Lindenberger, U.: Cortical EEG correlates of successful memory encoding: implications for lifespan comparisons. Neurosci. Biobehav. Rev. 30(6), 839–854 (2006) 30. Aftanas, L.I., Reva, N.V., Varlamov, A.A., Pavlov, S.V., Makhnev, V.: Analysis of evoked EEG synchronization and desynchronization in conditions of emotional activation in humans: temporal and topographic characteristics. Neurosci. Behav. Physiol. 34(8), 859– 867 (2004) 31. Coan, J.A., Schaefer, H.S., Davidson, R.: Lending a hand: social regulation of the neural response to threat. Psychol. Sci. 17(12), 1032–1039 (2006) 32. Lamm, C., Batson, C.D., Decety, J.: The neural substrate of human empathy: effects of perspective-taking and cognitive appraisal. J. Cogn. Neurosci. 19(1), 42–58 (2007) 33. Somerville, L.H., Heatherton, T.F., Kelley, W.: Anterior cingulate cortex responds differentially to expectancy violation and social rejection. Nat. Neurosci. 9(8), 1007–1008 (2006) 34. Eisenberger, N.I., Lieberman, M.D., Williams, K.: Does rejection hurt? An FMRI study of social exclusion. Science 302(5643), 290–292 (2003) 35. Babiloni, C., Carducci, F., Del Gratta, C., et al.: Hemispherical asymmetry in human SMA during voluntary simple unilateral movements. An fMRI study. Cortex 39(2), 293–305 (2003) 36. Urbano, A., Babiloni, F., Babiloni, C., et al.: Human short latency cortical responses to somatosensory stimulation. A high resolution EEG study. Neuroreport 8(15), 3239–3243 (1997) 37. Babiloni, C., Babiloni, F., Carducci, F., et al.: Human cortical responses during one-bit short-term memory. A high-resolution EEG study on delayed choice reaction time tasks. Clin. Neurophysiol. 115(1), 161–170 (2004)

308

G. Vecchiato and F. Babiloni

38. De Vico Fallani, F., Astolfi, L., Cincotti, F., et al.: Extracting information from cortical connectivity patterns estimated from high resolution EEG recordings: a theoretical graph approach. Brain Topogr. 19(3), 125–136 (2007) 39. Babiloni, C., Brancucci, A., Babiloni, F., et al.: Anticipatory cortical responses during the expectancy of a predictable painful stimulation. A high-resolution electroencephalography study. Eur. J. Neurosci. 18(6), 1692–1700 (2003) 40. Astolfi, L., Cincotti, F., Mattia, D., et al.: Assessing cortical functional connectivity by partial directed coherence: simulations and application to real data. IEEE Trans. Biomed. Eng. 53(9), 1802–1812 (2006) 41. Tulving, E., Kapur, S., Craik, F.I., Moscovitch, M., Houle, S.: Hemispheric encoding/retrieval asymmetry in episodic memory: positron emission tomography findings. Proc. Natl. Acad. Sci. U.S.A. 91(6), 2016–2020 (1994)

Annotating Non-verbal Behaviours in Informal Interactions Costanza Navarretta Centre for Language Technology, University of Copenhagen, Njalsgade 140, build. 25, 4. 2300 Copenhagen S, Denmark [email protected] http://www.cst.ku.dk

Abstract. This paper deals with the annotations of non-verbal behaviours in a Danish multimodal corpus of naturally occurring interactions between people who are well-acquainted. The main goal of this work is to provide formally annotated data for describing and modelling various communicative phenomena in this corpus type. In the paper we describe the annotation model and present a ﬁrst analysis of the annotated data focusing on feedback-related non-verbal behaviours. The data conﬁrm that head movements are the most common feedback-related non-verbal behaviours, but they indicate also that there are diﬀerences in the way feedback is expressed in two-party and in three-party interactions. Keywords: multimodal annotations, feedback, two-party and multiparty spontaneous interactions.

1

Introduction

People communicate with their voice and their bodies. Verbal and non-verbal communicative behaviours, comprising facial expressions, head movements, body postures and hand gestures, are intertwined on many levels [13,8]. Furthermore, multimodal behaviours are inﬂuenced by numerous factors such as the type of social activity in which they are produced, the cultural environment, the setting as well as the individuals involved in the interaction. It is important to study the single modalities and their relations in several communicative and cultural situations, and this has been the main aim in recent national and international projects and networks, such as CALO, HUMAINE, NOMCO, SPONTAL, SSPNET, and VACE. The present work follows this line of research. In particular, we describe the annotation of multimodal behaviours related to various communicative functions in a Danish corpus of naturally-occurring informal interactions, and we present a ﬁrst analysis of feedback-related gestures. There are numerous studies that focus on feedback-related head movements, inter alia [21,6,12], and machine learning techniques have been applied to predict A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 309–315, 2011. c Springer-Verlag Berlin Heidelberg 2011

310

C. Navarretta

the feedback function of head movements from speech, prosody and/or eye gaze, e.g. [14,16]. The relation between feedback and cultural aspects has also been investigated especially in ﬁrst-encounters data, e.g. [20,1,17] because ﬁrst encounters interactions show how diﬀerent cultures deal with varying degrees of familiarity, social status and norms [3]. Diﬀering from these studies, we investigate feedback-related behaviours in spontaneous interactions between well-acquainted people. Because most available multimodal corpora are scenario-based or not spontaneous, this itype of interactions have not been studied extensively. Thus, it will be useful to investigate multimodal communicative behaviours in these data and to compare them with the multimodal behaviours annotated in scenariobased interactions and in interactions between people who do not know each other. In section 2, we shortly present our corpus, and in section 3 we describe the annotation model and the annotations coded so far. In section 4, we present a ﬁrst analysis of the annotated data, and in section 5 we discuss future work and conclude.

2

The Corpus

The corpus consists of video recordings from the movin database collected by researchers at University of Southern Denmark. The database contains audio and video recordings of naturally-occurring interactions belonging to diﬀerent types of social activity and involving various numbers of participants. The language spoken in the interactions is Danish. Examples of these activities are cooking, buying food, visiting family and friends. All the movin data are transcribed in clan [11] according to the CA tradition. We have selected part of the movin video recordings and have coded them with multimodal annotations. The main criteria behind our selection have been the following: i) the recordings had to be freely available at least for researchers; ii) the bodies and faces of the participants had to be visible; and iii) the quality of the audio had to be so good that all speech could be transcribed. So far, we have annotated four video recordings of informal conversations between friends or family members in private homes (approx. 40 minutes). The participants in the interactions are women between 55 and 80 years old and are all Danish native speakers. The setting is similar in all recordings: the subjects are sitting around a sofa table while they drink coﬀee, eat cakes, and talk about various topics such as soccer, economic crisis and family relations. The multimodal annotations have been mainly produced in the Danish clarin project1 by researchers from University of Copenhagen. The main aims of the pilot multimodal annotation work are i) to provide a formal annotation of video recordings of spontaneous interactions and ii) to combine annotations produced by diﬀerent research communities. 1

The Danish clarin is an infrastructure project (2008-2011) funded by the Danish Research Councils, and it integrates written and spoken resources, pictures and video records.

Annotating Non-verbal Behaviours in Informal Interactions

311

The conversations have been orthographically transcribed in praat [4]. These transcriptions, together with the pre-existing CA transcriptions have been imported in the anvil tool [9] in which the multimodal behaviours have been coded.

3

The Annotation Model

The multimodal annotations follow the mumin annotation model proposed by [2] in a Nordic network on multimodal interfaces (2003-2005). The model has been implemented in various annotation tools and has been applied to multimodal data in various languages, comprising Chinese, Danish, English, Estonian, Finnish, Greek, Japanese and Swedish, e.g. [2,10,16]. The mumin model deals with communicative non-verbal behaviours (gestures henceforth) related to feedback, turn management and sequencing. The gestures that are accounted for are facial expressions, head movements, hand gestures, body postures and gaze. The model describes both the shape and the communicative function of gestures via pre-deﬁned attributes and values. Gestures can be multifunctional, e.g. a nod can be related to both feedback and turn management. The encoding of the gestures’ shape is quite coarse-grained with respect to other annotation schemes, but more speciﬁc attributes and values can be added to the model. Gestures can be assigned an attribute indicating the attitude which they show, and they can be linked to one or more words in the orthographic transcriptions if the annotators judge that there is a semantic relation between the gestures and speech [19]. Finally, a semiotic type can be assigned to gestures to indicate the relation holding between the signs and the objects that they denote. The mumin scheme adopts Peirce’s classiﬁcation [18] of semiotic types distinguishing indexical, iconic and symbolic types. Indexical gestures have a real and direct connection with the objects they denote. They comprise deictic and non-deictic gestures, e.g. beats and displays. Iconic gestures, also known as emblems, denote their objects by similarity and include metaphoric gestures, while symbolic gestures are established by means of an arbitrary conventional relation. Similar classiﬁcations of gestures have been proposed by [13,8]. 3.1

The Annotation Scheme

In the Danish clarin project we follow the mumin scheme, but have adopted a more ﬁne-grained description of the gestures’ shape. We also use a larger number of semiotic subtypes than those described in mumin. In particular, we distinguish several subtypes of deictic gestures reﬂecting the kind of object which they point to. Examples of these types are spatial, ﬁrst-person and second-person deictic. In Table 1, the attributes and values which describe head movements are shown. Head movements are described by their shape and by the type of movement indicating whether the movement is simple or repeated. Furthermore, we annotate whether a subject has the face toward or away from the interlocutor, and we code the direction of the gaze.

312

C. Navarretta Table 1. Head movement features

Behaviour attribute Behaviour value Head Movement Head Repetition FaceInterlocutor GazeDirection GazeInterlocutor

Nod, Jerk, HeadBackward HeadForward, TiltRight, TiltLeft, SideTurnRight, SideTurnLeft, Shake, Waggle, HeadOther Single, Repeated ToInterlocutor, AwayFromInterlocutor Up, Down, Forward, Left, Right, Other ToInterlocutor, AwayFromInterlocutor Table 2. Feedback description Behaviour attribute Behaviour value FeedbackBasic FeedbackDirection FeedbackAgreement

CPU, Other Give, Elicit, GiveElicit Agree, NonAgree

In Table 2, the features describing feedback are given. The ﬁrst attribute in the table, FeedbackBasic, indicates whether there is feedback or not. The second attribute FeedbackDirection describes whether a subject is giving or asking for feedback, while the attribute FeedbackAgreement is coded when a person agrees or disagrees with what is stated by the interlocutors.

4

Feedback-Related Gestures

We have annotated facial expressions, head movements, gaze direction, hand gestures, and body postures. So far, we have assigned the following communicative functions to gestures: feedback, turn management, sequencing and information structure. Furthermore, we have focused on the classiﬁcation of deictic gestures and their co-occurrence with various types of verbal referring expressions. We have not run any inter-coder agreement study on these data, but several validation studies have been accomplished on other corpora annotated according to the same model. The inter-coder agreement ﬁgures reported in these studies are in general acceptable, see inter alia [7,16,15]. Tables 3 and 4 show the number and type of gesture recognised in the twoparty and three-party interactions, respectively. The tables also provide the number and percentage of the gestures which the annotators have classiﬁed as feedback-related. These data conﬁrm that feedback is mainly expressed via head movements [21,6,5]. However, feedback is also related to facial expressions, hand gestures and body postures, although the gestures classiﬁed as body posture are quite few. This is probably due to the setting in which the participants are sitting around a table.

Annotating Non-verbal Behaviours in Informal Interactions

313

Table 3. Gestures and feedback in two party interactions Speaker Face Feedback Head Feedback Hand Feedback Body Feedback A B Total %

17 33 50

11 30 41 82

112 99 211

74 75 149 71

46 56 102

6 9 17 17

8 9 17

8 8 16 94

Table 4. Gestures and feedback in three-party interactions Speaker Face Feedback Head Feedback Hand Feedback Body Feedback A B C Total %

20 17 6 43

8 2 3 13 30

224 209 190 623

139 137 144 420 67

120 51 49 220

3 3 6 12 5.5

14 0 4 18

8 0 0 8 44

The percentage of feedback-related gestures in the two-party interactions is higher than in the three-party interactions. This can indicate that people feel obliged to give and ask for feedback more often when they interact with one interlocutor than when they communicate with more persons. The distribution of feedback gestures in the two-party interactions is as follows: Face 19%, Head 69%, Hand 5%, and Body 7%. The distribution of the same gestures in the three-party interactions is the following: Face 3%, Head 93%, Hand 3%, and Body 1%. Thus, the distribution of feedback-related head movements is higher in the three-party than in the two-party interactions. In Tables 5 and 6 the values of FeedbackDirection for facial expressions and head movements in two-party and three-party interactions are shown. The tables Table 5. Feedback direction in two-party interactions Speaker Face GiveElicit Give Elicit Head GiveElicit Give Elicit A B Total

8 30 38

2 5 7

6 18 24

2 5 7

74 75 149

9 7 16

29 49 78

32 13 45

Table 6. Feedback direction in three-party interactions Speaker Face GiveElicit Give Elicit Head GiveElicit Give Elicit A B C Total

8 2 3 13

1 0 0 1

6 2 3 11

0 0 0 0

139 137 144 420

4 1 0 5

23 49 81 153

18 4 1 23

314

C. Navarretta

indicate that the involved subjects give more frequently feedback than they elicit it. However, in two-party interactions asking for feedback occurs more frequently than in three-party interactions.

5

Concluding Remarks

In the paper, we described the multimodal annotations of Danish two-party and three-party informal and spontaneous interactions between well-acquainted interlocutors. The analysis of the feedback-related gestures conﬁrms that head movements are the most common non-verbal behaviour when people express feedback, but it also indicates that all gesture types can be related to feedback. Giving feedback occurs more often than asking for it in this corpus, but there are diﬀerences in the distribution of feedback-related gesture types in the two-party and three-party interactions. Currently, we are investigating the relation between co-occurring feedbackrelated gestures and speech, and we are analysing the other communicative function types which are annotated in the corpus. In future we will compare the feedback annotations in these data with those in a Danish corpus of ﬁrst encounters meetings [17], which is annotated according to the same model, in order to measure to which extent the degree of familiarity of the interlocutors inﬂuences their feedback related multimodal behaviours. We also plan to investigate whether there are individual diﬀerences in the multimodal behaviours of the various participants in these data. Acknowledgements. The work described in this paper has been done under the Danish clarin project and the verbal and bodily communication (VKK) project, both funded by the Danish Research Councils. A special thank goes also to Elisabeth Ahls´en, Jens Allwood, Kristiina Jokinen and especially Patrizia Paggio.

References 1. Allwood, J., Lu, J.: Chinese and Swedish multimodal communicative feedback. In: Abstracts of the 5th Conference on Multimodality, Sydney, pp. 19–20 (2010) 2. Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., Paggio, P.: The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. In: Martin, J.-C., et al. (eds.) Multimodal Corpora for Modelling Human Multimodal Behaviour, Special Issue of the International Journal of Language Resources and Evaluation, pp. 273–287. Springer, Heidelberg (2007) 3. Argyle, M.: Bodily Communication. Methuen, New York (1975) 4. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 5.1.05), http://www.praat.org/ (retrieved May 1, 2009) 5. Cerrato, L.: Investigating Communicative Feedback Phenomena across Languages and Modalities. Ph.D. thesis, Stockholm, KTH, Speech and Music Communication (2007)

Annotating Non-verbal Behaviours in Informal Interactions

315

6. Duncan, S.: Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23, 283–292 (1972) 7. Jokinen, K., Navarretta, C., Paggio, P.: Distinguishing the communicative functions of gestures. In: Proceedings of the 5th Joint Workshop on Machine Learning and Multimodal Interaction. Springer, Heidelberg (2008) 8. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004) 9. Kipp, M.: Gesture Generation by Imitation - From Human Behavior to Computer Character Animation. Ph.D. thesis, Saarland University, Saarbruecken, Germany, Boca Raton, Florida, dissertation.com (2004) 10. Koutsombogera, M., Touribaba, L., Papageorgiou, H.: Multimodality in Conversation Analysis: A Case of Greek TV Interviews. In: Proceedings of the LREC Workshop on Multimodal Coorpora from Models of Natural Interaction to Systems and Applications, Marrakesh, pp. 12–15 (2008) 11. MacWhinney, B.: The CHILDES Project: Tools for Analyzing Talk. Lawrence Erlbaum Associates, Mahwah (2000) 12. McClave, E.: Linguistic functions of head movements in the context of speech. Journal of Pragmatics 32, 855–878 (2000) 13. McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago Press, Chicago (2000) 14. Morency, L., Sidner, C., Lee, C., Darrell, T.: Contextual Recognition of Head Gestures. In: Proceedings of the International Conference on Multi-modal Interfaces (2005) 15. Navarretta, C., Ahlsn, E., Allwood, J., Jokinen, K., Paggio, P.: Creating Comparable Multimodal Corpora for Nordic Languages. In: Proceedings of the 18th Nordic Conference of Computational Linguistics (Nodalida 2011), Riga, Latvia, May 11-13, pp. 153–160 (2011) 16. Navarretta, C., Paggio, P.: Classiﬁcation of Feedback Expressions in Multimodal Data. In: Proceedings of ACL 2010 Uppsala, Sweden, pp. 318–324 (2010) 17. Paggio, P., Navarretta, C.: Head movements, facial expressions and feedback in ﬁrst encounters interactions. In: Proceedings of HCI 2011, Orlando Florida (to appear, July 2011) 18. Peirce, C.S.: Collected Papers of Charles Sanders Peirce. In: Hartshorne, C., Weiss, P., Burks, A. (eds.), vol. 8, Harvard University Press, Cambridge (1931-1958) 19. Poggi, I., Magno Caldognetto, E.: A score for the analysis of gestures in multimodal communication. In: Messing, L.(ed.) Proceedings of the Workshop on the Integration of Gesture and Language in Speech. Applied Science and Engineering Laboratories. Newark and Wilmington, Delaware, pp. 235–244 (1996) 20. Rehm, M., Nakano, Y., Andre, E., Nishida, T.: Culture-Speciﬁc First Meeting Encounters between Virtual Agents. In: Prendinger, H., Lester, J.C., Ishizuka, M. (eds.) IVA 2008. LNCS (LNAI), vol. 5208, pp. 223–236. Springer, Heidelberg (2008) 21. Yngve, V.: On getting a word in edgewise. In: Papers from the sixth regional meeting of the Chicago Linguistic Society, pp. 567–578 (1970)

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena to a Theoretical Model Rosa Volpe1, Lucile Chanquoy2, and Anna Esposito3 1

University of Perpignan, 66000 Perpignan, France

[email protected] 2

University of Nice Sophia Antipolis, 06100 Nice, France 3 Seconda Università di Napoli and IIASS, 81100 Caserta, Italy

Abstract. Understanding the role mental representations play within the process of meaning structure comes with the understanding of the relationship between verbal semantics and predicate argument structures. While fulfilling a specific linguistic function, predicate argument structures also allow for the organization of more general information of extra-linguistic and perceptual nature. Previous research suggests that the development of linguistic competence cannot happen without bringing into play such general world knowledge, given that concepts get linked to elements of the perceptive world. Our research studies the role mental representations play within the process of meaning structure. To trigger some food for thoughts on a new model of meaning structure we discuss the results of our experimental study on the intermodality image-text and we analyze the collected data under the perspective of Vygotsky’s non-classical psychology, which implies a philosophical understanding of holography. Keywords: representation, predicate argument structures, perception, meaning structure, holography.

1 Introduction Our approach to meaning structure goes beyond conceiving it simply in terms of bringing together linguistic forms and shapes. Needless to say, morphology, syntax, semantics, and pragmatics, contribute to such a process, yet there is more to the alchemy of communication and understanding that needs to be uncovered. Although some issues about meaning structure may be hard to pin down, others come with a sense of immediacy and demand attention. This paper contends that, if the process of meaning structure is to be understood in terms of “putting the pieces together” then, investigating (a) what is the nature of such pieces, and whether (b) there is a “who,” in charge of putting the pieces together, as well as (c) where is “who” to be found, needs to be addressed. Doing so also implies looking for “the situation” allowing for the pieces to come together while comprehending what is “the nature of the situation” that triggers such a process. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 316–334, 2011. © Springer-Verlag Berlin Heidelberg 2011

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

317

After reviewing some of the research that has lead us to our previous data collection on the inter-modality image-text, in order to study the role mental representations play in the process of meaning structure, we will present Varela’s notion of groundlessness [1] allowing us to introduce Vygotsky’s holographic perspective on the structure of meaning. Some preliminary considerations on a model of matrix of meaning will follow. 1.1 On the Nature of Meaning Structure When dealing with issues of meaning structure, the real heart of the matter can be summarized, according to Glenberg, as it follows: linguistic symbols, such as words and syntactic structures, have meaning only if non-linguistic pieces of experience such as action and perception underlie them [2]. Furthermore, in the attempt to explain “why language is due to have anchors well beyond a linguistic system,” Glenberg also explains the reasons why the process of meaning structure represents a major concern within the field of cognitive psychology. In other words, meaning controls memory and perception; meaning is the goal of communication; meaning underlies social activities and culture. Hence, because human cultures attribute meanings to natural phenomena, artifacts, and human relations, the way they engage in doing so also contributes to distinguishing each culture from one another [3]. Consequently, it would be worth investigating the process of meaning structure under this wider perspective rather than simply conceiving it as arising from the syntactic combination of abstract amodal symbols. Glenberg looks at meaning structure from the perspective of embodied cognition. Cognition, as Glenberg explains, has evolved due to the coordination of effective action responsible for survival and success in reproduction dependent on the structure of the body [4]. Such a consideration leads to the conclusion that mental representations of language are representations of situations, rather than of language itself. In fact, the understanding of a concept that is presented linguistically also calls upon perceptual experience [5]. A number of empirical studies support this point of view. According to the experiential approach, representations involved in language comprehension are of the same kind as those involved in sensory experiences, perceptions, and actions [6]. These studies also show that the processing of words and sentences triggers the activation of the brain’s motor system in language users. Such activations depend on the meaning of linguistic construction as well as the depth with which participants process them. Out of this understanding, the embodied cognition framework claims that facts of the body (including perceptual processes) play a prominent role in cognition; thus, the process of meaning structure undergoes the same dynamics [7]. Research also shows that meaning structure relies both on linguistic and nonlinguistic information, and that mental representations of extra-linguistic nature affect understanding [8]. We experience daily events and live through various situations trying to answer questions of this type: “who, does, what . . . ”. Such ‘tacit knowledge’ seems to constitute the ‘backbone’ of linguistic processing enabling comprehension and

318

R. Volpe, L. Chanquoy, and A. Esposito

production, and it is carried out by the verb’s thematic roles [9]. Several studies show that the processing of syntactic and semantic information depends on the structure of knowledge, and that comprehension results from the rapid integration of such knowledge. Kupersberg [10] measured Event Related Potentials, a multidimensional measure (P600 and N400) to test syntactic and semantic violations in sentence comprehension in order to detect the similarities and differences according to which linguistic information is processed. The results show that the meaning structure of a sentence engages parallel and serial processing accounting for non-linguistic information to come into play while processing linguistic one. McRay [11] has also found that in addition to contributing to the comprehension and production processes, non-linguistic information determines the formation of verbs’ thematic roles. In other words, the meaning structure comes from the interplay of linguistic and non-linguistic information. Take the verb to accuse, for instance, this verb will solicit a person’s mental representations of the role agent from his/her previous personal experience with the witnessing of people’s accusing other people. These studies on the interplay between linguistic and non-linguistic information suggest that general world knowledge gets internally organized into a “model” that comes with all of the information about objects, and situations, on the one hand, and with the order of the events from previous background knowledge and experience, on the other. Such a conclusion is supported by additional findings by Cordier & Pariollaud [12] who pointed out that accessing verb’s meaning comes with the understanding both of its general knowledge and the capturing of the specificity nature of its components. Hence, verbs contribute to the making up of the mental representation of the dynamics associated with a given situation. Such representation includes the semantic information of the verbal chunk as well as those elements associated with its thematic roles. On the occasion of an experimental study, Cordier and colleagues showed their subjects a few lists of verbs ‘out of context,’ that is to say with no mentioning of their thematic roles, yet, participants spontaneously evoked out loud to themselves the verb’s thematic role(s). Such behavior brought Cordier & Pariollaud to the conclusion that there exists in memory a link between the verb and its thematic roles and that individuals’ general knowledge about the world plays a fundamental role on the structure of meaning. Because verbs carry linguistic information about the syntax of their argument structures, it can be assumed that verbs give access to the overall structure of the situation and that linguistic competence depends on such knowledge, known as schema. Take the verb to entertain, it is not intuitively possible to conceive a situation of having to entertain without also conceiving someone to be entertained [13], for instance. A number of theoretical models have attempted to account for the ways general world knowledge contributes to the building up of mental structures responsible for organizing the way one experiences reality. Schema theory is at the foundation of Ferretti’s [9] framework and predicts that the knowledge built all along one’s life forms units which also include information about objects and situations together with the order they appear in. Knowledge about these situations gets assembled around a

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

319

number of well structured “gaps”, each one marked by a to be realized “value”. The schema corresponding to the situation of getting arrested, for instance, would come with a “gap” for the agent conducting the action of arresting, yet its “value” depends on whether it is the police, the soldier, or the security guard who carries out the action. With this respect, researchers agree that exists a general schema at the situational level responsible for organizing experience. Such structures also include more specific and well detailed knowledge about the situation one is witnessing. Previous research on the contribution that mental models play on written text comprehension, whether illustrated or not with pictures, has shown that general world knowledge gets organized into mental representations and it contributes to the structure of meaning when it comes to making sense of so to speak, “brand new” situations [14]. 1.2 The Image-Text Intermodality Data Collection Taking into account the implicit preexistence, within the mind, of such schemas making up the structure of more general mental representations, Volpe [19] and colleagues carried out a preliminary experimental study based on the intermodality image-text to trigger in the viewer, mental representations from watching a set of images ‘narrating’ a daily situation. A written sentence describing the previously displayed set of visual stimuli followed. Subjects were required to answer whether yes or not written sentences described the previously displayed set of images making up the corresponding visual sentences. Volpe [19] found that having to decide whether a written sentence describes YES or NOT the previously displayed visual sentence, participants were faster in answering correctly when the visual sentence and the written sentence were plausible. When written sentences were non-plausible time to respond took longer and more errors occurred. To test the role mental representations play in the process of meaning structures the stimuli were planned in such a way as to expose subjects to visual and written stimuli representing both plausible and nonplausible situations considering the four possible conditions listed in Table 1. To minimize the uncontrolled effects of non-standardized stimuli, such as film segments or other materials, the choice of the stimuli for this preliminary experimental study fell on classic standardized black and white still pictures (image-signs). The whole set of image-sign pictures consisted of a total of 200 “complete visual sentences”. These pictures were extracted from existing databases [15], [16]. Each visual sentence contained three image-sign pictures representing the “who” “does” “what” (agent, predicate, agent) mental representations. Each visual sentence was meant to “tell a story” and/or represent a “cognitive scene” from daily routine. A horizontal arrow “showed” the order in which the “visual reading” of these image-signs pictures should occur. The Appendix shows an example of how visual stimuli are combined with written ones following to the Plausible/Plausible, Plausible/Non-Plausible, NonPlausible/Plausible, Non-Plausible/Non-Plausible condition. The analysis of variance (ANOVA) accounted for: the length of time participants took to decide whether YES or NOT the written sentence describes the preceding visual scene, and the recurrence of errors during this decision making process.

320

R. Volpe, L. Chanquoy, and A. Esposito

Concerning the incidence of errors the results show: The Image The Text The interface Image-Text

F(1,93)=79.881, p < 0.0001 F(1,93)=12.956, p < 0.0005 F(1,93)=16.961, p < 0.0001

Table 1. The inter-modality IMAGE-TEXT and its conditions CONDITIONS 1 2 3 4

IMAGE Visual sentence Plausible Plausible Non-Plausible Non-Plausible

TEXT Written sentence Plausible Non-Plausible Plausible Non-Plausible

More specifically, when the Image is Plausible, the incidence of error is significantly greater when participants had to decide whether or not the following Text (written sentence) describing it is Non-Plausible than when it is Plausible, F (1,93) = 27.37, p < 0.001. This result is quite predictable. In fact, if both the Image and the Text are Plausible, then the process of matching up meanings from visual and written input happens smoothly. On the contrary when the Image is Plausible and the Text is Non-Plausible, participants seem to ‘cling’ more heavily to the structure of the written text rather than being ‘facilitated’ by the immediacy of the visual input from the visual sentence. This being said, because Non-Plausible visual sentences are meant to portray a ‘distorted reality’ representing life situations that could not trigger mental representations of any sort (example: the flowers water the young man), participants clung to trying to understand the written sentence rather than relying on the Images. Concerning the length of time taken to decide the results show: The Image The Text The interaction Image-Text

F(1,93) = 77.68 p < 0.0001 F (1,93) = 11.01 p = 0.001 F(1,93) = 0.1693 p = 0.6870

Hence, the response time is significantly greater when the Image (visual sentence) is Non-Plausible compared to the Plausible one, F(1,93) = 77.68 p < 0.0001. Non-plausible Images are meant to portray a ‘distorted reality,’ they represent life situations that could not trigger mental representations of any sort, as they make little or no sense. Having to decide whether the written sentence describes the previous visual one takes longer when the Image is Non-Plausible. When the Image is Plausible there is no ‘meaning matching conflict’ at the level of mental representation. The visually represented meaning matches with the ‘mentally represented reality,’ the choice is easily made, and the time it takes to make it is shorter. This does not apply when the visual sentence is Non-Plausible. Further details on the results are displayed in Tables 2 and 3.

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

321

Table 2. ANOVA with the dependent variable count of right/wrong answers and the independent variables IMAGE, TEXT, IMAGE*TEXT SS

DF

MS

IMAGE

9.8513

1

9.8513

Error

11.4891

93

0.1233

TEXT

0.9855

1

0.9855 0.0761

Error

7.0741

93

IMAGE*TEXT

0.7511

1

0.7511

Error

4.1187

93

0.0443

F

P

79.8811

0

12.9565

0.000514

16.9608

0.000083

Table 3. ANOVA with the dependent variable response time and the independent variables IMAGE, TEXT, IMAGE*TEXT SS

DF

MS

IMAGE

2.099902

1

2.099902

Error

2.513928

93

2.703148

TEXT

5.812424

1

5.812424

Error

4.906406

93

5.275705

IMAGE*TEXT

7.8485550

1

7.8485550

Error

4.470501

93

4.806990

F

P

795.6886

0

11.0173

0.0021290

0.1633

0.687088

With respect to the Non-Plausible Image-Text modality condition we expected to obtain the same results as the Plausible Image-Text one, given that the only possible correct answer to give was: YES, the written sentence describes the previously displayed visual sentence (although Non-Plausible). However, when both the written and visual modalities were Non-Plausible the subject’s ability to process their meaning was greatly affected. What went wrong, given that deciding whether YES or NOT the written sentence described the visual sentence should have been easy, as both stimuli were Non-Plausible? Is this result to be attributed to the fact that none of the stimuli portrayed any possible real life experience participants could relate to? How did participants go about constructing meaning from the displayed stimuli? Did their mental representations play any role in deciding whether YES or NOT the NonPlausible visual sentences matched with the written sentences? Why is it that the NonPlausible visual and written stimuli weren’t easier to process compared to the Plausible and Non-Plausible modality pairs? It needs to be said that having paired up the visual stimuli with the written sentences, and because the task consisted in deciding whether YES or NOT the written sentence described the visual one, we expected participants’ involvement in the decision-making process to be such as to exclude conceiving the visual stimuli as having passive structures (example: the flowers were watered by the young man). In fact, not only the task was clearly and unambiguously stated, but also previous exposure to the visual stimuli came first, given that the goal of such study was to trigger participants’ mental representations from previous real life experience and not inducing them to mentally constructing

322

R. Volpe, L. Chanquoy, and A. Esposito

passive sentences requiring linguistic processing. Understanding the role mental representations play within the construction of meaning comes with the understanding of the relationship between verbal semantics and predicate argument structures. While they fulfill a specific linguistic function, predicate argument structures also allow for the organization of more general information of extra-linguistic and perceptual nature [17]. Henceforth the development of linguistic competence cannot happen without bringing into play a more general world knowledge. Barsalou [18] claimed that linguistic symbols and their related perceptive symbols happen simultaneously in each person. In other words, a schematic memory of a perceptive event rises at the same time as a schematic memory of a linguistic symbol. This suggests that linguistic and perceptive symbols share the same dynamics due to the interplay between linguistic and non-linguistic information within the process of meaning structure. Our data collection shows that the Non-Plausible visual and written stimuli were as cumbersome to process as the Plausible and Non-Plausible pairs [19]. Had the decision-making been based on experience alone (the visual stimuli rather than their combination with the written text), would the results have been any different? In other words, did the participants have to decide whether YES or NOT the visual sentences matched “possible real life experiences”. What would have been the incidence of errors within the process of meaning structure? How long would it have taken for them to decide? To gather further insights on the participants’ decision-making process following exposure to the inter-modality Image-Text stimuli, we asked them to fill up a questionnaire at the end of their session. Below is a sample of answers drawn from the questionnaire. QUESTION N. 1 What did you find hard or unpleasant during the data collection? Subject One Going from a Plausible (coherent) situation to a Non-plausible (incoherent) situation was rather unpleasant, while going from a Non-plausible (incoherent) situation to a Plausible (coherent) situation was rather pleasant. Subject Two I found the experience quite strange. I found that I didn’t let myself be guided by the images because at times they seemed to make no sense at all. Too bad I got into this “controlling” pattern. QUESTION N. 2 What did you find easy and/or pleasant? Subject One Coherent situations were pleasant and easy. Subject Two Some images and sentences matched up perfectly, this was the easy aspect of this experience. It was even very pleasant to let these images and their descriptions pass by and enjoy the show despite their absurdity, at times, once it became clear that they could be so. QUESTION N. 3 What do the images bring to mind? Subject One

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

323

The images I remember portrayed mostly daily life circumstances describing moments of one’s private life, as well as professional one (for instance the one with the boat). Subject 2 They remind some daily life familiar actions, even tender at times, they remind of childhood. The drawings do not portray our contemporary times, with no TV, nor computer . . . A simple life that the absurd side of the quite surprising associations enrich and poetize . . . The above data seem to point at the fact that Non-Plausible written sentences (although they followed previously displayed visual ones) triggered confusion, discomfort, and even uneasiness. In fact, in the participants’ mind, things were not as “they were supposed to be.” Interestingly enough, our data also unveil two opposite attitudes: on the one hand, wanting to “control” one’s experience and possibly put things back “where they belong to,” on the other letting go of one’s wishes, and experience things as they are while “enjoying the show” as meaningless as it is. In other words, it would seem that given the interplay of linguistic and non-linguistic information, the process of meaning structure also takes into account further cognitive and meta cognitive abilities, such as attitudes and believes, to be considered, these latter, as additional “shades” of mental representations.

2 From Schema Theory to Groundlessness Unwillingly, the above different attitudes bring some insights on the role mental representations play within the process of meaning structure. According to Le Ny, semantic representations are nothing but “concepts in the mind” which correspond to “realities in the universe,” and they constitute what he calls “the individual lexicon” [20]. Le Ny’s approach to verb semantics has been said to be ‘transcognitive’, meaning that it includes notions borrowed from cognitive psychology, linguistics, logics, philosophy, biology and neurobiology. To understand Le Ny’s approach, let’s take for example the sentence Kathy met Sophie in the cafeteria after the break. Having heard (or read) it, what is recalled, Le Ny suggests, is a semantic postrepresentation which would match the perceptual representation one might have experienced had one witnessed such an encounter in reality. Visetti gives such a ‘reconstruction’ the name of interpretation and/or synopsis, and introduces the role of action (and of its actor) to explain the process of meaning structure [21]. Doing so brings about another element, namely the actor’s intention to carry out his/her own ‘story.’ According to Rosenthal, action comes also with anticipation and all together resorts to experience [22]. Immediate experience, whether it is a perception, a thought, an expression, a fantasy, is the result of some sort of development in the present time. Rosenthal identifies it as micro-development since it anticipates what it will allow to perceive, understand, or hear. From this perspective, meaning construction becomes an endless process, and the notion of meaning itself acquires a ‘generic shade,’ where perception takes the place of ‘the original modality of experience,’ while the notion of microgenetics implies a psychophysical process involving at the same time both the actual body and the field of experience.

324

R. Volpe, L. Chanquoy, and A. Esposito

Experiential theories based on embodiment claim that the body has to find itself in the appropriate emotional state in order to understand (emotional) language. Glenberg’s study shows, for instance, that people were able to quickly detect happy emotional states from reading when holding a pen in between their teeth – hinting to a smile, “because that’s how words become meaningful [23].” In their book The Embodied Mind Varela and colleagues claim: (. . .) Our human embodiment and the world that is enacted by our history (of coupling) reflect only one of many possible evolutionary pathways. We are already constrained by the path we have lied down, but there is no ultimate ground to prescribe the steps that we take [1].

Considering Varela’s assertion, we might wonder: isn’t it constraining to conceive meaning as the result of how “language arranges itself” to convey meaning? Isn’t language to be found in more than its structure? Do (mental) representations from previous experience play any role within the process of meaning structure? Barsalou warns about the misconception that the notion “embodied cognition” brings about. Bodily states, he says, are not necessarily needed for cognition, nor does research only investigate on bodily states. On the contrary, cognition proceeds independently of the body as there are many different ways cognition is grounded, such as simulations, situated action, and occasionally bodily states. Simulations are said to derive from the reenactment of perceptual, motor and introspective states acquired during experience with the world, body and mind. In other words, the brain captures the various states of experience and it integrates them with multimodal representations stored in memory. These representations are reactivated to simulate the way the brain represented the perception, action and introspection when the knowledge representing that category is needed [24], re-experiencing or reconstructing the initial sensory sensation [25]. Additional findings show that simulations are central for constructing future events on memories of past events [26]. As Barsalou, Yuri Alexandrov also believes that cognition should be defined in different ways other than memory, language, processing, problem solving and thinking. Instead, cognition should also include defining adaptive activities of individuals [27]. Within such a framework, language becomes an important carrier of consciousness given that it is essential in the collective achievement of results. Conscious facts can be shared through communication with others. By the use of language, individuals can evaluate their behavior and share with others such evaluation. In fact, language areas of the brain are involved in the organization of behavior even when overt verbalization is not required.

3 About Groundlessness and Its Implications for Meaning Structure Let’s now consider Varela’s notion of groundlessness claiming that reality is not grounded externally but it is the result of the subject’s enacted experience based on the circularity nature of mind and experience. Under this perspective, there is no subjective nor objective ground, all that is found instead is a world enacted by one’s history of structural coupling. These various forms of groundlessness are, according

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

325

to Varela and colleagues, but one: organism and environment enfold into each other and unfold from one another in the fundamental circularity that is life itself [1]. Hence, cognition emerges from the background of a world that extends beyond us but that cannot be found apart from our embodiment which is “contextually determined by our common sense” [1]. Cognition becomes then embodied actions inextricably woven with histories that are lived; these lived histories are the result of evolution as natural drift. In fact, embodied and grounded cognition theories predict that mind draws meaning out of the body’s experience. Furthermore, according to the Embodied Construction Grammar model, the understanding of everyday language depends on the triggering of mental simulations of its perceptual and motor content [26]. This model predicts that linguistic meanings become the parameters of some aspects of such simulations; thus they behave as an interface between the properties of language on the one hand, and the detailed and encyclopedic knowledge entailed by simulations on the other. Strong evidence exists that during language understanding embodied knowledge is unconsciously and automatically brought to bear [28]. Because the notion of groundlessness is tied to the circularity nature of mind and experience within the environment we suggest representing such dynamics in Figure 1.

Environment (E) MIND EXPERIENCE

Environment (E)

Fig. 1. On the circularity nature of mind and experience within the Environment (E)

Research on embodied and grounded cognition [3, 4, 5, 6, 23, 26, 28] has shown that to successfully carry out a task, subjects rely upon previous experience and knowledge including motor, spatial and perceptual skills, that is to say what the mindbody had already experienced, and learned, within a given situation and environment. This being said, we suggest considering the role the environment’s whole structure plays within the process of meaning, rather than considering it as the sum of separate entities, such as objects, events and situations; under this perspective we call (E) such an environment. In fact, predicting the circularity nature of mind and experience also requires predicting the role the environment plays within such dynamics. Positing the circularity nature of mind and experience comes with positing the role the environment’s whole structure (E) plays for such circularity to happen. 3.1 From Groundlessness to Holography Having predicted the circularity nature of mind and experience and the role the environment plays within the process of meaning structure, Varela’s notion of

326

R. Volpe, L. Chanquoy, and A. Esposito

groundlessness allows to bring some new insights on Vygotsky’s holographic approach to the structure of meaning, thus making it possible to introduce the Matrix of Meaning (MoM). Before Varela spoke about groundlessness, Luria had founded his theory of language on the following assumption, based on Vygotsky’s research: the social first, in other words, the external conditions of life [29]. Robbins explains that because Russian psychology places the entire personality of an individual within a structure that is holographic in nature, Luria’s approach makes sense. In fact, contrary to Western psychology, the Russian approach conceives the social environment as well as the human consciousness, language, and the development of concepts and activity as important. Hence, understanding holography requires being able to conceive both the “whole” (social) and the “parts” (individual) [29]. Doing so helps to envision Varela’s predicament about groundlessness: pursuing groundlessness requires going further into it rather than trying to culturally find another ground [1]. Once again, Luria’s work offers instances on how this is actually possible. Luria worked with twin boys (. . .) who had not developed linguistically or mentally. Luria changed the overall leaning environment of the boys, and in the end, the improvement made could be monitored when the boys were able to separate their actions from language, hence internalizations, where meaning was then relocated and transformed within a new sense of action [29].

Out of these observations Luria noticed that language contributed greatly to the development of the children. The notions of displacement and inner speech, more particularly, describe how the children were able “to detach themselves from the immediate situation, to subordinate their activity to a verbally formulated project, and by doing so, to stand in a new relation to this situation [29]. Such attitude fully reflects Varela’s concept of groundlessness [1].

4 Prolegomena to the Matrix of Meaning Having posited the fundamental role the environment (E) plays within the process of meaning structure, through the circularity nature of mind and experience, let’s now consider how Varela’s notion of groundlessness makes it possible to move towards conceiving a model of meaning structure based on the Matrix of Meaning (MoM), our prolegomena to a holographic model of meaning. In fact, bringing the notion of groundlessness into play makes it possible to understand how “the experiencing mind” ends up intertwining itself with the specificity of the situation (stimulus) of its immediate environment (e) despite the environment (E) within which the experiencing mind finds itself doesn’t come with any finite temporal nor spatial boundaries, its nature being holistic (E). By acknowledging the generic, unbound, holistic nature of the environment (E) the notion of groundlessness allows to attribute to the “circularity nature of the experiencing mind” the role of its competence: its interaction with the specificity of the situation (stimulus) of its immediate environment (e) which, together with the perceiver, represents the part of the whole (E).

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

327

If we consider Figure 2, the environment (E) represents, according to our view, the Matrix of Meaning (MoM). To understand the MoM the notion of groundlessness comes handy once again: “the very condition for the richly textured and interdependent world of human experience [1].” Although, this sentence expresses one of Varela’s definitions of groundlessness, we also believe that it describes our understanding of a MoM, because it emphasizes the relationship the latter entertains with the former. The MoM looks like a fabric. When looking at a fabric, each stitch contributes to the making up of the fabric, but taken independently, each stitch is meaningless. Such conception also entails that the MoM is made up of a network of Cells of Meaning (CoM); yet alone each CoM is meaningless. Although each CoM contributes to the structure of the MoM, the latter must not be understood as the sum total of X Cells of Meaning. In fact, the MoM is, in itself, meaning “in close dependence on the historical conditions of the social situation and the whole pragmatic the run of life (Volosinov [29]).” This being said, we also suggest thinking of the MoM in terms of “meaning in the bud.” In fact, potentially speaking meaning is all around, but it takes shape only when, stimulated by a specific event and situation (verbal or not verbal), the perceiver’s intention and motivation to “act upon it” gets triggered also because of previous embodied experiences. Although, both the CoM and MoM share the same nature, the former is the holographic representation of the latter, in other words, part of the whole. Furthermore, for the CoM to construct meaning, the MoM needs to be conceived. The opposite cannot apply.

Cell Cell

Un de rly ing M atr ix

Cell Cell

of M ean ing

Fig. 2. The Matrix of Meaning and its relation to the Cell(s) of Meaning

The MoM basic assumption establishes that the body (a cell of meaning) is a “bundle of meanings,” (from memories making up meaning). On the other hand, the body partakes within the process of meaning structure, of which it is a manifestation. The meaning structure for a given situation (and a given perceiver) happens from the pairing up of external stimuli with the perceiver’s pre-existing underlying meaning representation (CoM). As a matter of fact, the external stimulus works as the entre deux (between the two), that is to say, between the CoM (the perceiver’s previous knowledge and experience (e) and the MoM (the unbound, generic, environment (E)

328

R. Volpe, L. Chanquoy, and A. Esposito

of which the perceiver partakes of as part of the whole), as Figure 3 shows. It is also important to understand that there is no “comes first” dynamics within such a conception, which is characterized by a circular nature, the same that within the conception of groundlessness describes the circular nature of mind and experience. Groundlessness comes then into play to account for all that stands beyond grounded cognition, thus contributing to the whole process of meaning structure. In fact, as research on embodied and grounded cognition has made it possible to unveil the role non-linguistic information plays within the process of meaning structure, the notion of groundlessness opens the doors to further investigation for understanding such process. Furthermore, by introducing the notion of groundlessness Varela’s wish is to challenge a deeper understanding of the circularity nature of mind and experience in order to explain the process of meaning structure. Namely, the dynamic structure of meaning grows out of the circularity nature of mind(body) and experience. Predicting the circularity nature of mind and experience requires to predict the role the environment plays within such dynamics.

Cell Cell Cell M

STIMULI

Cell

Cell

Cell Cell

Cell

Fig. 3. The process of making Meaning (M) happen

4.1 From the Matrix of Meaning to the Cell of Meaning Conceiving the MoM out of Varela’s notion of groundlessness is meant to emphasize the fact that it is due to the circularity nature of mind and experience that meaning gets environmentally, socially, historically, cognitively, bodily and emotionally structured. Having established that the MoM, as a whole, is boundless and groundless in its nature, let’s now posit that each CoM (the perceiver) contributes to the process of meaning structure only if the MoM is conceived. This means that although each CoM comes with the potential for meaning, since it partakes in the overall structure of the environment (E) - namely the MoM, only its coming together with a given stimuli “in search of meaning” can trigger and produce the specific meaning structure the external stimuli solicit (e). In fact, the very characteristics of the external stimulus

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

329

crossing the perceiver’s field is that, on its own, it has no meaning. This explains the holographic relationship the CoM entertains with the MoM. In other words, for meaning to exist, the perceiver (CoM), has to be conceived as well. Given that the perceiver’s peculiarity – representing the part of the whole - rests on the circularity nature of his/her mind and experience within the environment (E), previously posited as the MoM, (s)he naturally partakes in its holistic structure. In other words, the MoM and the CoM feed into each other, the latter being the part of the whole, as represented in Figure 4.

The Cell of Meaning

of ture y na larit u c r Ci

Entre deux d min

a

ce r i en x pe nd e

n al t er Ex

stim

in rch sea “ i ul

” ng ani me r o gf

Fig. 4. The elements that make up meaning

4.1.1 Bringing the Pieces Together: On the Circularity Nature of the CoM In his book The Embodied Mind [1] Varela posits that mind and experience cannot be taken apart, on the contrary, there is a circularity process linking the two of them. Understanding the circularity nature of mind and experience implies understanding the complex dynamics that brings together the various intertwined fields of knowledge. Having posited the CoM as being part of the whole, its structure reflects the circularity nature of local and global knowledge. The external stimulus, which instantiates the nature of verb semantics (agent, patient roles, or predicate argument structures - PAS), acts as entre deux (in between the two) as described in Figure 5. Local

Entre deux Mind

Experience

PAS

The Cell of Meaning

Global

Fig. 5. The structure of the Cell of Meaning

330

R. Volpe, L. Chanquoy, and A. Esposito

Let’s now consider the nature of the various intertwined fields of knowledge that through the circularity nature of mind and experience – partake into the making up of the local and global knowledge structures the CoM and the MoM feed upon and share for meaning to happen. By bringing about one’s potential for producing language specific meaning, predicate argument structures (PAS) have the specific function of bridging linguistic information with extra-linguistic one [30]. Hence, understanding verb meaning implies mastering both its general meaning and the specificity of its components [12]. Furthermore, verb semantics also speaks of the way one ‘looks at’ each situation [20]. Visetti gives the same phenomena the name of interpretation [21], or synopsis to bring about the role action and its actor play within such a process. Hence, intentionality becomes another intervening component within the dynamics of meaning structure, where experience is “becoming” and perception takes the place of the primary modality of experience, since it is in the present time that perception takes shape [31]. Barsalou predicts that it is possible to build a theory of knowledge based on perception. The perceptual experience would be the point of departure of such a process; selective attention playing a major role by extracting a number of components from it in order to build a number of simulators having the function of concepts. Simulators make up a “type” capable of producing categorical inferences intertwining productively to form complex (conceptualization) simulations that have nothing to do with previous experience resulting into propositions building up “tokens” in the world. Such a model represents both concrete and abstract concepts [24]. Varela’s embodiment and enactive approach accounts for such interactions [1]; Rosental talks of micro-developments to account for the genetic circularity of mind and experience [31]. Pribram comes to even challenge the concept of mind given that, it “makes something concrete out of something very multifaceted” [33]. We believe that both Pribram’s and Rosental’s positions point at Varelas’s notion of groundlessness although under a different perspective. Viewing cognition as embodied action leads to a truer understanding of groundlessness which includes transformative approaches to experience as embodied groundlessness [1]. 4.2 From the Matrix of Meaning to the Holographic Structure of Meaning Both the CoM and the MoM are characterized by a circular dynamics and they account for the notion of groundlessness, the first step towards predicting a holographic theory of meaning. Holistic and holographic approaches to meaning structure have already been anticipated by several researchers [29]. Namely, Luria posited the existence of a “functional system” allowing to seize the relationship of the parts making the whole. According to Luria, the parts of this system may be scattered over a wide area of the body and get united only on the occasion of the execution of a task. The functional system operates as a complete entity organizing the flow of excitation and coordinating the activity of the individual organs [28]. This being said, although the holographic model includes phenomena that cannot yet be explained, Bohm believes that this lack of knowledge speaks of an “implicate” (=enfolding) order at work at a much deeper level and that one can transform at will if one could better understand

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

331

the principles of holography [25]. Our prolegomena to a holographic theory of meaning intends to contribute to such understanding. With this respect, a neuronal holographic (or similar) process implies that input information is distributed over the entire depth and surface of the brain, however only those limited regions where reasonably stable junctional designs are initiated by the input participating in the distribution [25]. We believe that, from the perspective of our Matrix of Meaning Model, such claim goes hand in hand with our prediction about the role each cell of meaning plays within the process of meaning structure. We owe our initial inspirational thoughts to Dorothy Robbins [29], and to Pribram’s research on the holographic brain [25]. Pribram believes that there is a holographic informational representation that is distributed throughout all neural patterns, just as this distribution is found in holographic photographic records. To actually operate the MoM we predict that just as Fourier’s transforms are known to be the key to understanding a hologram and holographic theories, the concepts of MoM and CoM are the key to conceiving a holographic approach to meaning structure. Explaining the circularity nature of the MoM and CoM based on Fourier’s transforms - predicting that any signal can be expressed as a sum of sinusoids – is our next undertaking.

5 Conclusions and Implications for Future Research When looking at the preliminary qualitative and quantitative results of our intermodality image-text experimental study it might be tempting to say that, indeed, language works as the ultimate ground “prescribing the steps we take,” and that experience doesn’t work as dynamic unfolding of differentiation. Conceiving the elaboration of a holographic model of meaning by integrating the notion of groundlessness within the process of meaning structure might open some new perspectives. The conception of a MoM is our contribution in this direction. Such an approach can contribute to a number of fields enlightening the relationship between language and thought, mind and brain. The notion of groundlessness bears, we believe, some important implications on language and meaning structure. The next step could be to define a research agenda supporting our Matrix of Meaning Model through additional data collection. Acknowledgments. Particular thanks to Prof. Thierry Baccino, the University of Nice for supporting this research, as well as to the members of the cognitive psychology laboratory for contributing in many different ways to this experimental study. Special thanks also go to Prof. Jérôme Boissier - UMR 5244 - Centre de Biologie et d'écologie Tropicale et Méditerranéenne - CNRS UPVD for his assistance with the statistical analysis. This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication”, http://cost2102.cs.stir.ac.uk/ and COST ISCH TD0904 “TMELY: Time in Mental activity (http://w3.cost.eu/index.php?id=233&action_number=TD0904).

332

R. Volpe, L. Chanquoy, and A. Esposito

References 1. Varela, F.: L’inscription corporelle de l’esprit, Seuil, Paris (1993) 2. Glenberg, A.M., Robertson, D.A.: Symbol Grounding and Meaning: A Comparison of High Dimensional and Embodied Theories of Meaning. Journal of Memory and Language 43, 379–401 (2000) 3. Glenberg, A.M., Kaschak, M.P.: Grounding language in action. Psychonomic Bulletin & Review 9, 558–565 (2002) 4. Glenberg, A.M.: Language and action: creating sensible combinations of ideas. In: Gaskell, G. (ed.) The Oxford Handbook of Psycholinguistics, pp. 361–370. Oxford University Press, Oxford (2007) 5. Glenberg, A.M., Havas, D., Becker, R., Rinck, M.: Grounding Language in Bodily States: The Case for Emotion. In: Zwaan, R., Pecher, D. (eds.) The Grounding of Cognition: The Role of Perception and Action in Memory, Language, and Thinking. Cambridge University Press, Cambridge (2005) 6. Zwaan, R.A., Taylor, L.J.: Seeing, Acting, Understanding: Motor Resonance in Language Comprehension. Journal of Experimental Psychology: General 135(1), 1–11 (2006) 7. Havas, D.A., Glenberg, A.M., Rinck, M.: Emotion simulation during language comprehension. Psychonomic Bulletin & Review 14, 436–444 (2007) 8. Taylor, L.J., Zwaan, R.A.: Action in Cognition: The Case of Language. In: Language and Cognition, http://www.brain-cognition.eu/publications/ L&C_LT&RZ_inpress.pdf 9. Ferretti, T.R., McRae, K., Hatherell, A.: Integrating Verbs, Situation Schemas, and Thematic Role Concepts. Journal of Memory and Language 44, 516–547 (2001) 10. Kupersberg, G., Caplan, D., Sitnikova, T.: Neural Correlates of processing syntactic, semantic, and thematic relationships in sentences. Language and Cognitive Processes 21(5), 489–530 (2006) 11. McRae, K., Spivey-Knowlton, M.J., Tanenhaus, M.K.: Modeling the influente of thematic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language 38, 283–312 (1998) 12. Cordier, F., Pariollaud, F.: From the choice of the patients for a transitive verb to its polysemy. Current Psychology Letters 21(1) (2007) 13. Gibson, J.J., Gibson, E.J.: Perceptual learning: Differentiation or enrichment? Psychological Review 62, 32–41 (1955) 14. Zwaan, R.A., Radvansky, G.A.: Situation Models in Language Comprehension and Memory. Psychological Bulletin 123(2), 167–185 (1998) 15. Bonin, P., Peerman, R., Malardier, N., Meot, A., Clalard, M.: A new set of 299 pictures for psycholinguistic studies: French norms for name agreement, image agreement. Journal of Behavior Research Methods, Instruments1 Computers 35(1), 158–167 (2003) 16. Alario, F.-X., Ferrand, L.: A set of 400 pictures standardized for French: Norms for name agreement, image agreement, familiarity, visual complexity, image variability and age of acquisition. Journal of Beahvior, Research, Methods, Instruments & Computers 31, 531–552 (1999) 17. Valette, M.: Linguistique énonciative et cognitives françaises. Honoré Champion, Paris (2006) 18. Barsalou, L.W.: Perceptual symbol systems. Behavioral and Brain Sciences 22, 577–660 (1999)

The Matrix of Meaning: Re-presenting Meaning in Mind Prolegomena

333

19. Volpe, R.: Representing Meaning in Mind: When Predicate Argument Structures Meet Mental Representations. In: Esposito, A., Esposito, A.M., Martone, R., Müller, V.C., Scarpetta, G. (eds.) COST 2010. LNCS, vol. 6456, pp. 165–179. Springer, Heidelberg (2011) 20. Le Ny, J.-F.: La sémantique des verbes et la représentation des situations. Syntaxe 1 Sémantique – Sémantique du lexique verbal, 2 (2000) 21. Visetti, Y.-M.: La place de l’action dans les linguistiques cognitives. Texto, 1mars (1998) 22. Rosenthal, V.: Perception comme anticipation: vie perceptive et microgenèse. In: Sock, R., Vaxelaire, B. (eds.) L’Anticipation à l’horizon du Présent. Liège: Mardaga (Collection Psychologie et Sciences Humaines), pp. 13–32 (2004) 23. Glenberg, A.M., Kaschak, M.P.: The body’s contribution to language. In: Ross, B. (ed.) The Psychology of Learning and Motivation, vol. 43, pp. 93–126. Academic Press, New York (2003) 24. Barsalou, L.W.: Perceptual symbol systems. Behavioral and Brain Sciences 22, 577–660 (1999) 25. Pribram, K.H., McGuinness, D.: Arousal, activation, and effort in the control of attention. Psychological Review 82(2), 116–149 (1975) 26. Bergen, B.K., Chang, N.: Embodied Construction Grammar in Simulation-Based Language Understanding. In: Evans, V., et al. (eds.) The Cognitive Linguistics Reader, Equinox (2007) 27. Alexandrov, Y., Sams, M.E.: Emotion and consciousness. Ends of a continuum. Cognitive Brain Research 25, 387–405 (2005) 28. Zwaan, R.A., Taylor, L.J.: Seeing, Acting, Understanding: Motor Resonance in Language Comprehension. Journal of Experimental Psychology 135(1), 1–11 (2006) 29. Robbins, D.: Generalized holographic visions of language. In: Vygotsky, L., et al. (eds.) Intercultural Pragmatics, vol. 2(1), pp. 25–39 (2005) 30. Badecker, W.: On Some Proposals Concerning the Status of Predicate Argument Structure Representations. Brain and Language 40, 373–383 (1991) 31. Rosenthal, V.: Perception comme anticipation: vie perceptive et microgenèse. In: Sock, R., Vaxelaire, B. (eds.) L’Anticipation à l’horizon du Présent. Liège: Mardaga (Collection Psychologie et Sciences Humaines), pp. 13–32 (2004) 32. Mishlove, J. : Holographic Brain for the TV program Thinking Allowed : Conversations on the Leading Edge of Knowledge and Discovery (1998, TV interview available on paper), http://twm.co.nz/pribram.htm

334

R. Volpe, L. Chanquoy, and A. Esposito

APPENDIX : An Example of the Four Conditions Used for the IMAGE-TEXT Presentations Condition One: Image Plausible, Text Plausible: The young man waters the flowers. Expected Answer: YES (the Text describes the Image).

Condition Two: Image Plausible, Text Non Plausible: The flowers water the young man. Expected answer: NO (the Text doesn’t describe the Image).

Condition Three: Image Non Plausible, Text Plausible: The young man waters the flowers. Expected answer: NO (the Text doesn’t describe the Image).

Condition Four: Image Non Plausible, Text Non Plausible: The flowers water the young man. Expected answer: YES (the Text describes the Image).

Investigation of Movement Synchrony Using Windowed Cross-Lagged Regression Uwe Altmann Institute of Psychology, Friedrich-Schiller-University Jena, Germany [email protected]

Abstract. Movement synchrony is studied in various ﬁelds of research because the occurrence of movement synchrony correlates with the quality of interaction in terms of liking, rapport, and aﬃliation. Usually, movement synchrony is investigated with time series and a windowwise computed cross-lagged-correlation. This paper is concerned with the problem that (windowed) cross-lagged-correlation could be confounded by auto-correlation, which may lead to biased conclusions about movement synchrony. The proposed solution combines the idea of a windowwise computed measure and the methodological framework of autoregressive models. As shown through simulated time series, the new method is robust against auto-correlation and identiﬁes the time lag and duration of movement synchrony correctly. At last, the method is applied to real time series of a pilot-study on children’s nonverbal behaviour. Friends vs. non-friends dyads are compared in neutral vs. conﬂict situations regarding the occurrence of movement synchrony. Keywords: nonverbal behaviour, synchronisation, automatical identiﬁcation, interpersonal conﬂicts, children’s friendships.

1

Introduction

Synchrony is a general term for phenomenons like imitation of gestures, facial mimicry, posture mirroring, simultaneous changes of voice parameters, etc. Synchrony was studied in various types of interactions. Feldman et al. summarise for mother-infant-interactions that “synchrony describes a timebound, co-regulatory lived experience within attachment relationships that provides the foundation for the child’s later capacity for intimacy, symbol use, empathy, and the ability to read the intentions of others” [8, p. 330]. In peer interactions during early childhood, synchrony holds a central role regarding the learning of social skills [11]. In interactions within an institutional context (e. g. student-teacher, physician-patient, psychotherapist-patient) synchrony mediates the professional competence and eﬀectiveness [3], [14], [17]. Moreover, an avatar’s ability to synchronise intensiﬁes the impression of a naturalistic human-machine-interaction [7], [15]. To reduce the diversity of analysed phenomenons, in this paper we focus on movement synchrony (short: sync). It can be deﬁned as observer impression: as A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 335–345, 2011. c Springer-Verlag Berlin Heidelberg 2011

336

U. Altmann

dance-like, well-coordinated, or well-timed body movements [3]. An often used methodology is [6, p. 360-361]: 1. Capture body movements of both persons (automatically), 2. identify synchronous movements in the resulting time series, and 3. study the relation between the occurrence of movement synchrony and a variable of interest, e. g. success of psychotherapy. In step 2., movement synchrony is usually operationalized as a temporay linear relationship between two time series which describe the movements of the persons over time [5], [17], [19]. The standard method is windowed cross-laggedcorrelation (WCLC, [5], [17], [19]). The computation of a window-wise measure considers the circumstance that humans synchronise their behaviour sporadically and temporarily. But the method does not take into account the possibility to get a signiﬁcant cross-correlation of two time series which are independent from each other. Such spurious cross-correlations could be arised if both time series are auto-correlated (cyclic) [9], [18], [20]. In other words, (windowed) cross-lagged correlation could be biased with auto-correlation and consequently conclusions about the occurrence of movement synchrony too. In this paper, we propose windowed cross-lagged regression (WCLR) as solution. In section 2, we introduce the identiﬁcation of movement synchrony with WCLR and compare the results of WCLC and WCLR in the case of autocorrelated time series. In section 3, WCLR is applied to time series from a study on children’s nonverbal behaviour. The occurrence of movement synchrony in friends vs. non-friends dyads is compared to neutral vs. conﬂict situations. Finally, we discuss the results of the study and the developed methods.

2

WCLR and Its Validation

Sync can be operationalised as a temporary linear relationship between two time series (e. g. X1 t and X2 t with t ∈ {1, 2, . . . , T }) which describe the movement behaviour of two persons. Usually, such relationships are being determined with WCLC (for details see [5]). In WCLC we consider short intervals of two time series (called windows). The term “cross” means that it is a correlation of two diﬀerent time series. The term “lagged” refers to the fact that a time lag (τ ) could be between both windows. At ﬁrst in WCLC, we ﬁx the bandwidth of the windows (b) and choose one person (e. g. person A) whose behaviour shall be explained by the interaction partner. Next, the correlation of a window at tstart and the window of the reference person at tstart + τ is computed. This step is repeated for a set of time lags (e. g. τ ∈ {0, 1, . . . , 10}) and all possible values of 2 tstart . With RWCLC = rWCLC · rWCLC we quantify the variance of a later window which is explained by the variance of a previous window. Before beholding the further development of WCLC, a view back into interaction research is necessary. We often ﬁnd models which describe the relationships

Investigation of Movement Synchrony

337

between the variables during the whole interaction sequence. Cross-lagged regression (CLR) with time lag τ = 1 is a well known model in interaction research (e. g. [2], [6]): Person 1: Person 2:

X1 t+1 = β10 + β11 X1 t + β12 X2 t + ε1 t X2 t+1 = β20 + β21 X2 t + β22 X1 t + ε2 t

.

(1)

In this model, β11 and β21 quantify the linear relationship between a variable at t and itself at t + 1. These relationships correspond with the auto-correlation of variables. β12 and β22 indicate that ones behaviour can predict the later behaviour of the interaction partner. The relationship between X1 t+1 and X2 t is not confounded by auto-correlation, because X1 t is a covariate in the model. The same applies for the relationship between X2 t+1 and X1 t . 2.1

Exploration of Temporary Relationships with WCLR

The framework of autoregressive models which provide an unconfound relationship measure could be combined with the idea of a window-wise computed measure. Like WCLC, we ﬁx the window width b and compute the windowed cross-lagged regression (WCLR) for all possible window positions and a set of time lags. To get a standardized relationship measure (like R2 ) we compare two models1 for each combination of tstart and τ : Model 1:

X1 t+τ = β0 + β1 X1 t + ε1 t

Model 2:

X1 t+τ = β0 + β1 X1 t + β2 X2 t + ε1 t

(2) .

(3)

In simple terms, model 1 includes only auto-correlation and model 2 auto- and 2 2 cross-correlation. Using the coeﬃcients of determination (RModel 1 and RModel 2 ) the variance explained by cross-correlation could be quantiﬁed with 2 2 2 RCC = RModel 2 − RModel 1

.

(4)

2 2 2 RCC is standardized in terms of RCC ∈ [0, 1]. If RCC > 0, then the model with cross-correlation ﬁts more appropriate for the current position tstart and time lag τ than the model without cross-correlation. With a (window-wise) R2 2 diﬀerence-test we can proof that RCC diﬀers statistically signiﬁcant from zero. This respects the claim of [16] that synchrony must be more than random. Figure 1 allows a visual comparison of WCLC and WCLR. The upper plot shows the curves of two temporary coupled oszillators. Both curves are autocorrelated. The persons synchronise their behaviour only in the intervals [100, 150] and [250, 300]. In the ﬁrst interval person B (red curve) is the interaction leader. In the second it is person A (blue curve). WCLC (middle plot) and WCLR (lower plot) should identify a signiﬁcant relationship between both time series only for these two intervals. The plots of WCLC and WCLR could be read in the same 2 way. The colour indicates the value of R2 resp. RCC (white: low values, black: large values). For example, a large value at t = 250 and τ = 20 means that the 1

In the synchronisation research, a previous approach of such model comparisons could be found in the work of [9].

338

U. Altmann curves

X

20 0 −20 0

50

100

150

200

250

300

350

400

250

300

350

400

250

300

350

400

R2 of WCLC

time lag

40 20 0 −20 −40

0

50

100

150

200

R2CC of WCLR

time lag

40 20 0 −20 −40

0

50

100

150

200 time

Fig. 1. The WCLC and WCLR results for two temporary coupled oscillators

behaviour of person A around t = 250 can explain the behaviour of person B 20 time units later. The sign of τ indicates the “leader” (plus, if person A leads person B; minus, if person B leads person A). In the WCLC plot, we see dark areas during the whole interaction time. In other words, the method suggests a (nonexistent!) relationship between the behaviour of both persons over the whole time interval. In the WCLR plot, we see only dark areas in the intervals of coupling. 2.2

Identification of Sync-intervals with a Peak-Picking Algorithm

The WCLR plot in Figure 1 discloses another problem. Within the intervals of coupling are several parallel lines which could be explained with the cyclic of data. Which of these mirrors synchronised movements? To ﬁnd an answer, we simulate time series (with Gauss pulses, length: 80 points of time) of which the time lag, beginning and duration of movement synchrony are known (see Figure 2). The ﬁrst sync interval starts at t = 80 with τ = 40 (person A leads) and the second interval starts at t = 220 with τ = −21 (person B leads).

Investigation of Movement Synchrony

339

curves 30 20

X

10 0 −10 0

50

100

150

200

250

300

400

350

R2CC of WCLR and peaks selected with Bokers algorithm (red dots)

time lag

50

1

selected

0

0.5 0

−50 0

50

100

150

200

250

300

350

400

R2CC of WCLR and peaks selected with the new algorithm (green lines)

time lag

50

1

selected

0

0.5 0

−50 0

50

100

150

200 time

250

300

350

400

2 Fig. 2. The selection of RCC peaks with two algorithms

2 The middle plot shows the RCC of WCLR and the peaks which are selected by the algorithm of [5]. At every point of time, the time lag is not identiﬁed correctly. Moreover, around t = 210 the algorithm suggests the wrong person as interaction leader (wrong sign of τ ). Due to these results, we implement an 2 alternative peak-picking-algorithm in MATLAB. It selects RCC peaks which are 2 2 on a line and have the largest RCC values compared to alternative RCC peaks on the same time interval (see the lower plot in Figure 2). The ﬁrst selected peak interval is [45, 165]. Person A is correctly identiﬁed as leader with time lag τ = 40. The second selected interval is [225, 320] with person B as leader (τ = −21). In both intervals, the length of Gauss pulse plus the time lag is equal to the length of peak interval (80 + 40 = 165 − 45 and 80 + 21 ≈ 320 − 225).

2.3

Concluding Remarks

Movement synchrony is operationalised as temporary linear relationship between the time series of both interaction partners which time lag is stable over the time

340

U. Altmann

interval of synchrony. It is necessary that the behaviour of a person is represented by one time series (e. g. the intensity of body motions). The introduced methods contain: 1. WCLR of two time series which explores the temporary relationships, 2. selection of relationships with a stable time lag using the WCLR output and an iterative peak-picking-algorithm, and 3. computation of sync measures (e. g. time lag, duration) based on the selected peak intervals. The sign of time lag opens up the possibility to study the interaction leadership, because it indicates which person predicts the later behaviour of the interaction partner. Moreover, based on the duration of selected peak intervals the occurrence of movement synchrony within an episode could be computed. This measure allows the comparison of interactions which occure under diﬀerent conditions (e. g. neutral vs. conﬂict situations).

3

Movement Synchrony in Interactions of Children

In this section, the introduced methods are applied to real time series from the study on children’s nonverbal behaviour (for details about the study see [1] and [12]). Based on the work of [11] and [13] we hypothesise that (1.) dyads of friends synchronise more often than non-friends and that (2.) dyads synchronise more often within neutral situations than within conﬂict situations. Moreover, we assume that these diﬀerences occur with no interaction between situation and interpersonal relationship. For example: Dyads of non-friends synchronise more often within neutral situations than within conﬂict situations. 3.1

Methods

All participants of the study are from a German school class and 12 or 13 years old. The sample includes interactions of 7 non-friend dyads and 6 friend dyads. Each dyad interacts in a neutral situation and in a conﬂict situation. In sum, we have N = (7 + 6) · 2 = 26 episodes. In order to compare neutral situations and conﬂict situations we used the futuristic computer game AquaNautic which was developed by [4]. One child navigates a submarine, and the other one collects artefacts with the submarine’s tractor beam. The game level ends in the case of too many collisions or if too little artefacts were collected. At the end of a game level, the computer displays in big red letters the name of the player who has caused the end. It could be understood as initial moment of an interpersonal conﬂict. Discussions about bad playing could be expected. Each dyad solves one tutorial at the beginning and after that normal game levels. This allows a comparison of neutral situations (pause after the tutorial, the bad player’s name is not displayed) and conﬂict situations (the other pauses, the bad player’s name is displayed).

Investigation of Movement Synchrony

341

Based on the video clips of the interactions, body movements were captured automatically with Motion Energy Analysis [10], [19] for each child separately (for details see [1]). We got 24 measurements per second. Motion Energy could be termed as a global measure. Head movements, gestures, body movements etc. at the same point of time were summarised to one measure. After data preprocessing (Anscombe transformation to stabilize the variance and smoothing splines) movement synchrony sequences were identiﬁed with the new method (WCLR bandwidth b = 100 frames ≈ 4 sec). The movement synchrony occurrence of an episode is operationalized as proportion: the sum of the length of all sync sequences within the episode relatively to the episode length. As an illustration, Figure 3 shows the Motion Energy time series of an episode and the identiﬁed sync sequences (gray marked). Compared to the simulated time series (see Figure 2) the Motion Energy time series are more complex. Sync sequences could not be identiﬁed with visual curve inspection. This underlines the need of an automatical identiﬁcation algorithm. 250

X

200 150 100 50 0

5

10

15

20

25

30

35

40

Fig. 3. Two Motion Energy time series and the selected sync intervals

3.2

Results

Table 1 shows descriptive statistics of the sync occurrence. With contrast tests, the means were compared according to the hypotheses mentioned above. In Table 2, the results are listed. All tests are one-sided and use df = 22. Previously, the normal distribution and the equality of groups variances were proofed with Kolmogorow-Smirnow-test resp. Levene-test. Table 1. Mean and standard deviation of sync occurrence under diﬀerent conditions neutral situation

conﬂict situation

total

M

(SD)

M

(SD)

M

(SD)

non-friends

0.155

(0.080)

0.138

(0.096)

0.147

(0.085)

friends

0.224

(0.138)

0.099

(0.092)

0.162

(0.129)

total

0.187

(0.111)

0.120

(0.092)

0.154

(0.106)

342

U. Altmann Table 2. Mean comparisons of sync occurrence using one-sided contrast tests the compared groups

4

value of contrast

SE

p

neutral vs. conﬂict situations

0.141

0.081 .046∗

neutral vs. conﬂict situations (only friends)

0.125

0.059 .024∗

neutral vs. conﬂict situations (only non-friends)

0.171

0.055 .379

friends vs. non-friends

0.030

0.081 .308

friends vs. non-friends in neutral situations

0.069

0.057 .121

friends vs. non-friends in conﬂict situations

−0.039

0.057 .251

Discussion

Since the beginning of synchrony research the methods, their possibilities, and their limitations have often been discussed (e. g. [3], [6], [16], [17]). Today, the automatic capture of body motions and the automatic identiﬁcation of movement synchrony using WCLC and the peak-picking-algorithm from [5] are standard. Independent from a computer which analyses the time series, we get the same movement and sync codings. Furthermore, various statistical tests (e. g. bootstrapping within WCLC) were used to proof that synchronous body movements are more than random events. 4.1

Developed Methods

This paper uncovers two problems regarding the standard methods using simulated time series whose sync intervals are known: If the time series are auto-correlated (cyclic) windowed cross-lagged correlation misrepresents the relationship between the behaviour of both interaction partners. Bokers peakpicking-algorithm sometimes tends to pick up wrong peaks. Due to this a WCLR and an alternative peak-picking algorithm were developed. The WCLR is robust regarding auto-correlation, and the alternative algorithm allows the estimation of the time lag and duration of a sync interval. Both could be used for subsequent analyses. The sign of time lag indicates the leader of interaction. Leader means that the current behaviour of a person predicts the later behaviour of the interaction partner. Furthermore, the duration of all sync intervals of an episode could be used to compute the synchrony occurrence of an episode. Concerning the comparison of standard methods with new methods two points should be taken into account. First, we only made case studies. Therefore, generalizations are limited. Second, standard methods and new methods have diﬀerent outputs. For each point in time, standard methods provide the smallest signiﬁcant time lag. Subsequently the average of all time lags of interaction is analyzed

Investigation of Movement Synchrony

343

(see e. g. [5] or [17]). However, it is assumed in the new methods that the time lag is stable over time within a sync-interval. For this reason, in subsequent analysis we consider the duration of the sync-intervals. 4.2

Application

The new methods were applied on automatically captured time series from a pilot-study about children’s nonverbal behaviour. Regarding the occurrence of movement synchrony which has been identiﬁed with the introduced methods we found a signiﬁcant diﬀerence between neutral and conﬂict situations. This result corresponds with [13] who report a diﬀerence between emotional positive and emotional negative episodes regarding (manual synchrony ratings of college student interactions). Contrastly to the assumption, dyads of friends and non-friends do not diﬀer signiﬁcant. Similar results were reported by [16], who has studied an interaction of six college students using manual codings of the behaviour. Remarkable is that in our study the diﬀerence between neutral and conﬂict situations were much larger when we only considered dyads of friends. Due to this it could be assumed that the sync occurrence depends on the combination of situation characteristics (e. g. neutral vs. conﬂict situation) and dyad characteristics (e. g. the interpersonal relationship like friendship). However, generalisations of the ﬁndings are limited. It was considered only a unimodal case: synchronous body movements. All body movements of one person at a point of time were aggregated to an intensity index. Both served to reduce the complexity of the investigated phenomena and to obtain simple and transparent data. Furthermore, the sample size was small and the interaction time short because the data come from a pilot-study. Regarding the methods which were used in the pilot-study, a set of advantages is to be named. The experimental setup with the computer game provides a standardised initial moment (naming the bad player) and a within-subject design (one dyad could be studied as well in a neutral situation as in a conﬂict situation). Furthermore, the body motions were captured with Motion Energy Analysis and movement synchrony sequences were identiﬁed with the introduced methods. Such automatic ratings are highly reliable because every computer codes the same behaviour and identiﬁes the same sync sequences. A pleasant side eﬀect is that time intensive manual behaviour codings are saved. 4.3

Outlook

Prospectively, the introduced methods could be used to study various types of interactions. Besides the applications, methodical developments should be forwarded, e. g. systematic simulation studies regarding the correct sync interval identiﬁcation depending on the method of motion capture, signal-to-noise ratio and signal complexity. This also includes a validation of the methods. Often new methods are applied to real data, without testing whether they bring the expected results. This article has shown, for example, that synchrony is not always identiﬁed correctly using a cross-correlation approach.

344

U. Altmann

References 1. Altmann, U.: Interrater-Reliabilit¨ at = 1 in Videostudien? Automatisierte Erhebung von Nonverbalit¨ at in einem Experiment zur Kooperation von Sch¨ ulern (Automated coding of nonverbal behavior in an experiment on the cooperation of students). In: Schwarz, B., Nenniger, P., J¨ ager, R.S. (eds.) Erziehungswissenschaftliche Forschung – nachhaltige Bildung. Beitr¨ age zur 5. DGfE-Sektionstagung/AEPF-KBBB im Fr¨ uhjahr 2009, pp. 261–267. VEP, Landau (2010) 2. Badalamenti, A.F., Langs, R.J.: An empirical investigation of human dyadic systems in the time and frequency domains. Behavioral Science 36(2), 100–114 (1991) 3. Bernieri, F.J.: Coordinated movement and rapport in teacher-student interactions. Journal of Nonverbal Behavior 12(2), 120–138 (1988) 4. Biemer, S., M¨ uller, C.: Entwicklung eines Videospiels auf Basis von Open-SourceBibliotheken f¨ ur die Anwendung im Rahmen eines Experiments zur Untersuchung von dyadischen Interaktionsprozessen (Development of a video game based on open-source libraries for use in an experiment to study dyadic interaction processes). Unver¨ oﬀentlichte Diplomarbeit, Technische Universit¨ at Dresden, Dresden (2008) 5. Boker, S.M., Xu, M., Rotondo, J.L., King, K.: Windowed cross-correlation and peak picking for the analysis of variability in the association between behavioral time series. Psychological Methods 7(1), 338–355 (2002) 6. Cappella, J.N.: Dynamic coordination of vocal and kinesic behavior in dyadic interaction: Methods, problems, and interpersonal outcomes. In: Watt, J., van Lear, C.A. (eds.) Dynamic Patterns in Communication Processes, pp. 353–386. Sage Publications, Thousand Oaks (1996) 7. Caridakis, G., Raouzaiou, A., Bevacqua, E., Mancini, M., Karpouzis, K., Malatesta, L., Pelachaud, C.: Virtual agent multimodal mimicry of humans. Language Resources and Evaluation 41(3-4), 367–388 (2007) 8. Feldman, R.: Parent-infant synchrony and the construction of shared timing; physiological precursors, developmental outcomes, and risk conditions. Journal of Child Psychology and Psychiatry 48(3/4), 329–354 (2007) 9. Gottman, J.M., Ringland, J.T.: The analysis of dominance and bidirectionality in social development. Child Development 52(1), 393–412 (1981) 10. Grammer, K., Honda, M., Juette, A., Schmitt, A.: Fuzziness of nonverbal courtship communication unblurred by motion energy detection. Journal of Personality and Social Psychology 77(3), 487–508 (1999) 11. Harrista, A.W., Waugh, R.M.: Dyadic synchrony: Its structure and function in children’s development. Development Review 22(4), 555–592 (2002) 12. Hoﬀmann, R., Alisch, L.M., Altmann, U., Feh´er, T., Petrick, R., Wittenberg, S., Hermkes, R.: The acoustic front-end in scenarios of interaction research. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 187–199. Springer, Heidelberg (2008) 13. Kimura, M., Daibo, I.: Interactional synchrony in conversations about emotional episodes: A measurement by “the between participants pseudosynchrony experimental paradigm”. Journal of Nonverbal Behavior 30(3), 115–126 (2006) 14. Koss, T., Rosenthal, R.: Interactional synchrony, positivity and patient-satisfaction in the physician-patient relationship. Medical Care 35(11), 1158–1163 (1998) 15. Marin, L., Issartel, J., Chaminade, T.: Interpersonal motor coordination. From human-human to human-robot interactions. Interaction Studies 10(3), 479–504 (2009)

Investigation of Movement Synchrony

345

16. McDowall, J.J.: Interactional synchrony: A reappraisal. Journal of Personality and Social Psychology 36(9), 963–975 (1978) 17. Ramseyer, F., Tschacher, W.: Nonverbal synchrony or random coincidence? How to tell the diﬀerence. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony, pp. 182–196. Springer, Berlin (2010) 18. Rogosa, D.: A critique of cross-lagged correlation. Psychological Bulletin 88(2), 245–258 (1980) 19. Watanabe, T.: A study of motion-voice synchronization. Bulletin of the Japanese Society of Mechanical Engineers 26(222), 2244–2250 (1983) 20. Yule, G.U.: Why do we sometimes get nonsense-correlations between time-series? – A study in sampling and the nature of time-series. Journal of the Royal Statistical Society 89(1), 1–63 (1926)

Multimodal Multilingual Dictionary of Gestures: DiGest Milan Rusko1 and Štefan Beňuš1,2 1

Institute of Informatics of the Slovak Academy of Sciences, Dúbravská cesta 9, 845 07 Bratislava, Slovakia 2 Constantine the Philosopher University, Štefánikova 67, 94974 Nitra, Slovakia [email protected], [email protected]

Abstract. The paper presents a web-based multimodal and multilingual dictionary of gestures. Its current version contains several hundreds of gestures represented by a still image, a description of the gesture and its meaning, and optional sound and video records. The current version includes language and culture dependent content for American English, Slovak, Italian, and Mongolian. Entries for Japanese, Chinese, and Hungarian are being implemented. The primary motivation for database creation is to build a research tool that will facilitate identifying problems in research on nonverbal speech displays and their intercultural and intermodal aspects, and help in testing proposed solutions to these problems. Keywords: Gestures, vocalizations, non-verbal speech sounds, grunts, dictionary of gestures, gesture database.

1 Introduction In addition to canonical words, spontaneous everyday human conversations contain many grunts, exclamations, sighs, laughs, and other non-lexical vocalizations. They play an important and varied role in the complex system of human communication [1]. Pragmatically, they range from expressing relatively neutral hesitation sounds like mm or uh, to backchannels/acknowledgments like mmhm or uhm that are primarily utilized by the turn-taking and information management mechanisms, to vocalizations like pch [px], ‘whistle’, yay [jaj], ‘laugh’ with high emotional affectivity expressing positive or negative attitudes towards propositions and/or interlocutors. Note that although the above examples are written as individual lexical items, their phonetic and prosodic variability is enormous, highly contextually determined, and precludes a straightforward conversion to spelling, sometimes posing problems even for phonetic transcription. Research in this area seems to be fragmented in the sense that it typically studies only a limited number of types (studies specifically on filled pauses [2, 3], laughs [4, 5], or various sets of affirmative and/or grounding responses [6, 7]), and seems to focus on the pragmatic meanings of the canonical (lexicalized) forms of these items commonly filtering out huge production variability of these vocalizations [1]. Our own research, focused mainly on speech synthesis (text to speech, TTS) and automatic speech recognition (ASR), has shown that it is inevitable to process these

A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 346–354, 2011. © Springer-Verlag Berlin Heidelberg 2011

Multimodal Multilingual Dictionary of Gestures: DiGest

347

non-lexical vocalizations in the signal properly when we work with speech, and especially, when expressive speech is involved. More specifically, we need a better understanding of the systematic correspondence between the forms of these vocalizations and their conversational meanings. In this paper, we refer to the above mentioned vocalizations, grunts, exclamations, sighs, etc. as nonverbal voice gestures (NVGs, [8]). The term gesture is generally used to represent visible movements of the body, hand, or face in order to convey certain message as part of non-verbal communication [9]. Our NVGs extend this meaning to include speech sounds sending some signals (bits of information) to the listener without having the form of a traditional word. The term nonverbal often designates communications not transmitted by words [e.g. 10], and it can be subdivided to vocal and non-vocal. We use modifiers ‘nonverbal’ and ‘voice’ in our term to indicate vocal sounds that commonly lack a standard unambiguous written form. Despite some work in this area in other languages [1, 11], virtually no descriptions exist for Slovak. One study [1] investigates non-lexical speech utterances in Japanese, and it provided us with valuable information on how many classes of vocalizations can be expected at a reasonable level of generalization. In that study, the author describes their extra large speech corpus in the following way: “A dictionary of only 100 items accounts for at least half of the non-lexical speech utterances in the corpus. Our nonverbal dictionary contains several thousand items but many of them only occur very infrequently. Because the small number of common sounds (typical grunts) are so very frequent in conversational speech, these particular sounds facilitate very fine comparison of their prosodic differences.”[1:124] Another study [11] provides a comprehensive taxonomy of both forms and functions of non-lexical conversational sounds in American English and shows how they form a separate language system that complements the conversation occurring in the main channel. Importantly, it also identified the compositional character of the sound-meaning correspondence. In our attempt to study Slovak NVGs, we found that very little research has been done in this field in Slovakia so far and that basic information on Slovak NVGs, which would facilitate further applied research, is missing. The first obstacle in our work was that, apart from our subjective intuitions, we had very little idea about the diversity of different basic forms of such sounds and their classes, about their frequency of occurrence in spontaneous speech and dialogues, or about their functions and potential meanings. We realize that many of these features are of a continuous nature in natural speech and that every clustering or categorization means an approximation that provides only a crude approximation to real life. However, the most efficient way to proceed in this situation was to combine the top-down and bottom-up approaches and first create a list of NVGs in Slovak and their categories, however non-representative and open it might be in these initial stages, then consequently subject this list to the tests of frequency and reliability of categorization based on real speech corpora, and adjust our categorizations accordingly.

348

M. Rusko and Š. Beňuš

Our first task was therefore to collect an initial set of candidate NVGs which would be rich enough to reflect basic characteristics of these elements in human speech. The collection and categorization of these NVG candidates facilitates the development of tools and skills that will be required for the utilization of NVGs in work with spontaneous speech and applications linked to it. A collection of general ideas describing our approach was published in [8]. In this contribution, we provide a detailed description of the structure and content of the first version of the multi-modal and multi-lingual database of gestures DiGest together with suggestions for its usage.

2 DiGest - The Multilingual Multimodal Dictionary of Gestures Early stages In the initial stages of our project, our research drew heavily from the Picture Dictionary of Gestures – American, Slovak, Japanese and Chinese by Eva Ružičková [8, 12]. The dictionary divides these gestures into four groups according to the generic meaning: gestures expressing physical body, initiative contact, emotional body, and mental body. Such division is motivated by the semantic meanings of the gestures. Gestures are means of communication between at least two participants, but the emphasis of the dictionary is primarily on the speaker; both verbal and non-verbal communication is taken into account. The dictionary’s goal was to provide a source of information for students of intercultural communication and other people who meet and interact with representatives of other cultures and therefore need to understand the similarities and differences in how gestures are used in different countries. Taking this dictionary as our starting point, we first identified those gestures that, in our minds, have a vocal counterpart in Slovak [8]. This provided us with a first list of text-induced vocal gestures in Slovak. But we have soon realized that if such a list is extended to a dictionary, it could facilitate research activities in several related areas. Thinking about human communication and its applied areas of multimodal interfaces, audiovisual speech synthesis and audio visual speech recognition, as well as a number of works of scholars (some of which also participate in the COST 2102 initiative), brought us to the conclusion that our research should also include the relation of the vocalizations to the body and face gestures and that we should make our work more general, and open it to other languages and cultures, which would then in turn encourage more cross-linguistic and cross-cultural comparative research. To facilitate these efforts, we have decided to create a web-based electronic version of the original dictionary. The new web-based dictionary is called DiGest. We have added new annotation layers and modalities that were not present in the original work, such as sound and video, added new languages, and prepared the dictionary for introducing new gestures. As a result of this development, DiGest is a multilingual multimodal dictionary of gestures and it allowed us to broaden our original goal – to study nonverbal speech gestures – and to also include the study of nonverbal communication as a whole and a comparative study of gestures in different cultures.

Multimodal Multilingual Dictionary of Gestures: DiGest

349

Description of the current version DiGest allows quick access to multi-layered information about gestures and crosscomparison of these gestures in different cultures. Technically, it is a database system that uses MySQL for content storage and a web interface written in PHP. A testing version of a subset of the dictionary can be found at http://ui.sav.sk/gestures/. The first basic set of gestures and their descriptions in the first version of DiGest were adopted from [12]. All the recorded modalities of the included gestures are acted (posed) at this time. The information on the broader context of the gesture can be introduced in this version by including a video or an audio recording of a longer sequence, possibly accompanied by an annotation file. The gestures are meant to be prototypical (representing a class of gestures) and their recordings in all modalities illustrate their basic communicative characteristics. All languages and cultures represented in the dictionary have strictly equal status. However, we had to designate the language of the description as well as the user interface. We chose English as it is the language of international scientific communication. English is used for all the language independent features and for navigation. The current version includes language dependent content for American English, Slovak, Italian, Mongolian, and the implementation of Japanese, Chinese and Hungarian content is in progress. Fig. 1 illustrates the current DiGest interface. The top rectangle represents cultureindependent information and the bottom one shows culture-dependent descriptions.

Fig. 1. Current graphical user interface of DiGest

350

M. Rusko and Š. Beňuš

Culture independent information Culture independent data provide the user with gesture information that is assumed to be relevant, independent of the language and culture of the gesture producer. This description should be valid in the cultures where the gesture is known, but this does not mean it should be known (and used) in all cultures. Culture independent data consist of gesture components, generic meaning and cross-references. Gesture components describe the physical aspect of the gesture, which parts of the body take the active role in performing the gesture, and in which way. Generic meaning provides the name of the gesture and serves as an identifier of the particular gesture. The dictionary is divided into four groups according to the semantically motivated generic meanings. Physical body (24 generic meaning classes, 46 gestures) includes gestures that are physically oriented. It means they manifest sizes, shapes and conditions of physical objects, human bodies included. Apart from the sense of the sight and touch, other senses such as taste, smell, or hearing are included, as well as concepts, such as eating and drinking, which are related to these senses. Initiative contact (29 classes, 109 gestures) involves gestures which indicate the initiation of contact between communicators either in the form of greetings, initiative actions, commands, or requests. Emotional body (25 classes, 90 gestures) groups gestures which indicate emotions, romance, sex, and taboo topics. Six basic emotions – Anger, Disgust, Fear, Happiness, Sadness, and Surprise – that are assumed to be basic or biologically universal to all humans [13] are presented in many variations as they were in the original taxonomy [4], but may be extended to cover other emotions as well. Mental body (20 classes, 79 gestures) involves gestures related to mental activities, judgements and some miscellaneous gestures concerning good luck and the concept of self. In total, DiGest now has 98 generic meaning classes and 324 gestures with manifold information. The final culture-independent information lists cross-references of each gesture that show the user the relatedness of one gesture to another. One generic meaning can be expressed by different gestures and one gesture can express more generic meanings. Cross-references, currently implemented through a ‘See also’ button, offer the reader effortless access to the related gestures through the dictionary. Culture dependent information Our working hypothesis during the construction of DiGest is that some gestures are known to all the cultures considered and some are culture dependent. Moreover, some gestures might show slight variations, which are typical for only one culture. To facilitate a comparative approach towards testing these hypotheses, our graphical user interface allows for choosing three different cultures/languages from the list and for showing culture dependent information related to the gesture in these three cultures simultaneously. This information includes the translation of the English name of the gesture to the respective language, lexical and non-lexical co-gestural messages, and sociolinguistic context including information on the formality of gesture, the gender, age, and social status of the user (sender), and family (or other) relation between the sender and recipient.

Multimodal Multilingual Dictionary of Gestures: DiGest

351

Culture dependent information can be supplemented by a comment on the specifics of the use and forms of the particular gesture in the given culture. Comments as well as text, audio, video, and annotation files can be attached to demonstrate the phenomena described in the comment text. Below we discuss in more detail cogestural messages as the aspect of the database that is highly relevant to our work. Co-gestural messages Many scholars (e.g. [14, 15]) use the notion of co-verbal gesture, where the meaning of the word gesture is limited to the body and face communication act that accompanies the speech utterance. We introduce the notion “co-gestural message”, which designates speech material (lexical or non-lexical), that typically accompanies a particular body gesture. Nevertheless, we consider speech gestures to be full-value gestures, and many of them can be displayed without involving other modalities. A lexical specific message is an utterance that is typically used in a given culture with a particular gesture, and consists of canonical, lexical words. In DiGest, we include the Orthographic, Orthoepic, and Latinized forms of the message, and its literal translation to English. For example, the gesture Coldness A can be accompanied by the Slovak sentence ‘Bŕŕŕ, všetci čerti sa tam ženia!’ in standard Slovak orthography, which means that the weather is very cold and windy outside. Additionally, our description includes the orthoepic transcription using International Phonetic Alphabet (IPA) [br::/ f5'ts+ t5'rc+ sa tam <'Õ+a], no latinized form since this description applies only to languages that use non-latin alphabets, and the literal English translation ‘(Brrr,) all devils are having their wedding out there!’ The lexical information can be supplemented by a comment on the specifics of the use and forms of the particular gesture in the given culture regarding lexical specific message. It can have text, audio and video files attached as well. A non-lexical specific message is an utterance that typically accompanies a particular gesture in a given culture and consists of non-verbal (non-lexical) speech material that usually cannot be transcribed using canonical words. The information related to the message consists of Orthographic form - customary text transcription in a respective language using its own alphabet (if it exists) – and Orthoepic and Latinized forms. Non-lexical information can also be supplemented by a comment and additional text/audio/video files illustrating the uses and forms of the gesture. It is important to note, however, that non-lexical and lexical elements are often combined in one utterance (e.g. ‘Brrrr! I am cold!’) In this version of DiGest, these cases are handled by including the whole utterance both as the lexical and non-lexical items. The lexical features are described in the specific message of the lexical part and the non lexical information is covered in the specific message of the non-lexical part.

3 Usage and Limitations From a scientific point of view, DiGest provides a platform on which basic concepts of annotation, classification, representativeness, and presentation can be validated. The dictionary is inevitably a theory-driven tool and its existence allows for rigorous testing of the parts of the theory. For example, one can test if the division into 4

352

M. Rusko and Š. Beňuš

groups of generic meanings is justified, or if co-gestural messages are sufficiently robustly linked to the respective gestures. Additionally, work with the dictionary generates multiple research questions. For example, we can ask which common structural features (e.g. number of syllables, positive pitch slope, etc.) are shared by the gestures for the same semantic content among the cultures, or how much semantic information a nonverbal speech gesture bears when it is alone, combined with a visual gesture, or positioned in a wider speech context. Furthermore, the process of DiGest creation forces us to contemplate the potential as well as the pitfalls of this multilingual and multimodal approach to gesture representation. For example, it is necessary to adopt a sufficiently general, yet reasonably constrained taxonomy of both visual as well as spoken gestures; see [16] for a relevant review and suggestion. DiGest, similar to its predecessor, can also be used as a source of information for people who meet and communicate with representatives of other cultures and therefore need to understand the similarities and differences in using gestures in different countries. Communication partners bring to communication presuppositions and expectations from their own cultures [8] and DiGest gives them an opportunity to check the meaning of gestures and appropriateness of their use in other cultures. There are several limitations of this first version. First, the physical gesture components description will have to be revised in future versions. The body posture and face gesture annotation scheme MUMIN provides a reasonably detailed description [17] which would make the description substantially more systematic and transparent – and thus more suitable for computer processing. Second, no strategy was taken in this version for the annotation of suprasegmental (paralinguistic and extralinguistic) phenomena of speech recordings. One imperfect solution is to open the sound files in PRAAT [18] or another program, enabling immediate acoustic/phonetic analysis. However such a program does not automatically provide the user with a symbolic annotation (e.g. ToBI [19] for intonation and accents), and both human and automatically extractable suprasegmental annotations will be investigated in the future.

4 Conclusion Having been inspired by a dictionary of gestures in book form [12], a web-based electronic multilingual multimodal dictionary of gestures, called DiGest, was created. The database structure and graphical interface was created. The content was divided into language dependent and language independent parts. New annotation layers were added, such as phonetic (orthoepic) transcription, literal translation, the information on non-verbal speech gestures was introduced. New modalities such as audio and video records were enabled. The original content was expanded for Slovak and Chinese, and new languages were added: Hungarian, Italian, Mongolian (Cyrillic alphabet), Mongolian (traditional alphabet). The dictionary is thus intended as a multi-layered and multi-modal research tool for investigating the relationship between modalities in gesture use and their potential implementation in automated speech processing applications, cross-linguistic and cross-cultural research, and research on the relationship between the forms and communicative functions of gestures.

Multimodal Multilingual Dictionary of Gestures: DiGest

353

Acknowledgements. Eva Ružičková for allowing us to adopt her concept, descriptions, and other material from her book in our work, and the following programmers, advisors, demonstrators and informants: Sakhia Darjaa, Pang Qiwei, Lucia Rusková, Jolana Sebestyénová, Peter Kurdel, and Jozef Juhár. This work was supported in part by the EU grant CRISIS and an International Cooperation in Science and Technology project.

References 1. Ward, N.: Non-Lexical Conversational Sounds in American English. Pragmatics & Cognition 14(1), 129–182 (2006) 2. Stewart, O.W., Corley, M.: Hesitation disfluencies in spontaneous speech: The meaning of um. Language and Linguistics Compass 4, 589–602 (2008) 3. Clark, H.H., Fox Tree, J.E.: Using uh and um in spontaneous speaking. Cognition 84, 73–111 (2002) 4. Laskowski, K.: Finding emotionally involved speech using implicitly proximity-annotated laughter. In: Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 5226–5229 (2010) 5. Ruch, W.: Ekman, p.: The expressive pattern of laughter. In: Kaszniak, A. (ed.) Emotion, Qualia and Consciousness. World Scientific, Tokyo (2001) 6. Gravano, A., Hirschberg, J., Beňuš, Š.: Affirmative cure words in task-oriented dialogues. Computational Linguistics (in press) 7. Benus, S., Gravano, A., Hirschberg, J.: The prosody of backchannels in American English. In: Proceedings of ICPhS, pp. 1065–1068 (2007) 8. Rusko, M., Juhár, J.: Towards Annotation of Nonverbal Vocal Gestures in Slovak. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 255–265. Springer, Heidelberg (2008) 9. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004) 10. Knapp, M.L., Hall, J.A.: Nonverbal communication in human action. Wadsworth, Belmont (2001) 11. Campbell, N.: On the Use of Non Verbal Speech Sounds in Human Communication. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 117–128. Springer, Heidelberg (2007) 12. Ružičková, E.: Picture dictionary of gestures (American, Slovak, Japanese, and Chinese). Comenius University Publishing House, Bratislava (2001) 13. Ekman, P.: Universals and Cultural Differences in Facial Expression of Emotion. In: Cole, J. (ed.) Nebraska Symposium on Motivation, pp. 207–283. University of Nebraska Press, Lincoln (1972) 14. McNeill, D.: So you think gestures are nonverbal? Psychological Review 92, 350–371 (1985) 15. Pizzuto, E., Catenacci, C.: Signed Languages, Verbal languages, Coverbal Gestures: Analysis and Representation, http://www.loa-cnr.it/iliks/Pizzuto-Catenacci06.pdf (accessed December 15, 2010) 16. Rossini, N.: The Analysis of Gesture: Establishing a Set of Parameters. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 124–131. Springer, Heidelberg (2004)

354

M. Rusko and Š. Beňuš

17. Allwood, J., Cerrato, L., Jokinen, K., Navaretta, C., Paggio, P.: The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. Language Resources & Evaluation 41, 273–287 (2007) 18. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (Version 5.1.05) (Computer program), http://www.praat.org/ (retrieved May 1, 2009) 19. Beckman, M.E., Hirschberg, J.: The ToBI Annotation Conventions, http://www.ling.ohio-state.edu/~tobi/ame_tobi/ annotation_conventions.html (retrieved November 1, 2010)

The Partiality in Italian Political Interviews: Stereotype or Reality? Enza Graziano and Augusto Gnisci Department of Psychology, Second University of Naples, Via Vivaldi 43, 81100 Caserta, Italy {enza.graziano,augusto.gnisci}@unina2.it

Abstract. This contribution has two aims closely related. The first one is to assess toughness and partiality of the main Italian political broadcasts analysing interruptions which occur during the interviews. Interruptions can be considered a conversational index according to the theory of equivocation. Results show that more than the half of the studied broadcasts are “tough” and that many public opinion’s beliefs are true with some surprising exceptions. The second aim is to design an interactive multimedia software for coding interruptions from the interruption coding system applied to our sample and from the obtained results. Indications are drawn to implement this software. Keywords: Interruptions, Software, Political interviews, Toughness, Partiality.

1 Introduction This contribution has two aims which are closely related one another. The first one is to assess impartiality of the main Italian political broadcasts analyzing interruptions occurring during the interviews. They were coded applying a coding system adapted to the Italian culture and language. The second aim is to introduce a project of an interactive and multimedia software based on the interruption coding system which could help researchers of different fields in analyzing some specific aspects of human interaction. The interest for our first intention is generated by the law 28/2000 [1], known as par condicio, which states that broadcasters have to guarantee adequate visibility to all political parties. This law controls the television political communication, the distribution of time ensured to each political party and the sanctions and measures if the law is broken. This law controls “how much” a political party or a politician appears in public, but doesn’t control “how” they are treated during a political broadcast [2]. The theoretical reference is the theory of equivocation [3]. Equivocation is a type of vague, unclear, tangential communication that includes different linguistic acts in order not to answer the questions in an unequivocal and brief way. Evasiveness, which is a type of equivocation, is not caused by intrinsic or natural characteristics of politicians, but it is related to situational elements. Politicians are continuously put in avoidance-avoidance conflicts by interviewers [4]. Avoidance-avoidance conflicts are difficult communicative conflicts, or rather impossible to manage, because people are forced to choose unfavorable communicative alternatives [5]. In other terms, every A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 355–367, 2011. © Springer-Verlag Berlin Heidelberg 2011

356

E. Graziano and A. Gnisci

reply to this type of question can be a politician’s face-threat. The politician’s “face” concerns positive social attributes [6] that he/she wants to give of him/herself, of his/her party and of significant others connected to the party, for example, colleagues and allies [7] [8]. A positive face determines social approval and as a consequence electoral consent [9]. Obviously mass media are a powerful mean for amplifying and communicating the politician’s positive face. Different ways exist to put a politician in front of a communicative conflict and face threat during an interview. The most studied way is the type of question asked to the interviewed [10] [11]. Only a few studies have considered conversational (or structural) ways like interruptions, from a quantitative point of view, as indexes of communicative conflict. If we assume that an interruption has a negative value, for example, an interviewer can interrupt a speaker when he/she uses equivocation, so that he/she turns his/her attention to the problem. While intruding in the speaker’s speech, interruptions can cause a perturbation of conversational flow, they can show conversational disagreement [12], they can try to take the floor or prevent the first speaker from completing his/her speech, so that they are considered “small insults” [13]. From this point of view, interruptions are an index of aggressiveness, toughness, control and conversational dominance [14] [15] [16]. Indeed, this research showed that when we say “interruptions”, we refer to a very broad and complex phenomenon. So we must separate different types of positive or negative interruptions according to their purposes and effects [17] [15] [18]. There are neutral events, supportive interventions, successful and unsuccessful interruptions

Speakers change turns synchronizing them (they speak one at a time)? Yes

No

INTERRUPTION NEUTRAL EVENT

False Start, Overlapping, Pause, Latching, Afterthought

Interruption aims at supporting the speaker? No Yes

SUPPORTIVE INTERVENTION Listener response, Lexical suggestion

SUCCESSFUL INTERRUPTION Successful Single interruption; Successful Complex interruption; Snatch-back; Interjection

The second speaker prevents the first one to complete the turn and completes his/her utterance?

Yes

No UNSUCCESSFUL INTERRUPTION

Unsuccessful Single/Complex interruption; Unsuccessful Single/Complex interrupted interruption; Unsuccessful Single/Complex snatch-back; Unsuccessful Single/Complex interruption with Completion/with Overlapping.

Fig. 1. Simplified flow chart of the interruption coding system used in the study

The Partiality in Italian Political Interviews: Stereotype or Reality?

357

[19]. Neutral events include the regular turn-taking: speakers take turns tidily synchronizing them, sometimes with brief pauses or little overlapping. Instead supportive interventions aim at supporting the speaker and show interest and attention for what he/she is saying, even if they are interruptions from a structural point of view. Real interruptions are those that disrupt the conversational flow and they can be successful or unsuccessful according to their outcome [19]. In successful interruptions, the interrupter prevent the first speaker from completing his/her utterance by taking the floor and completing his/her speech. In unsuccessful interruptions, the first speaker doesn’t give the floor and the interrupter can complete or not his/her utterance (see Figure 1). So supportive interventions cause a face maintenance or improvement to who receives them [20], while successful and unsuccessful interruptions can create damage or face threat to who suffers them. To make it easier, from now on we will refer to the latter as “aggressive” interruptions. In a similar way as other studies about threatening questions [15], in this study a broadcast’s toughness (or aggressiveness) is operationalized as the proportion of aggressive interruptions (successful and unsuccessful interruptions) made by the interviewers towards politicians, relatively to the other forms of turn taking occurring in the interview. It’s an index that measures how much an interviewer or a broadcast constitutes a face-threat for politicians. This index varies among 0 and 100 and when it increases it means that absolute toughness increases too. Impartiality (or neutrality) versus partiality (or tendentiousness), is operationalized as the difference between the proportions of aggressive interruptions received from different political parties (in our study, we took into account the Popolo della Libertà, Silvio Berlusconi’s right party, and the Partito Democratico, the most important left party, lead at that time by Walter Veltroni). When this index is 0, there is perfect broadcast impartiality toward politicians of different parties. Moreover it can take positive or negative values which show that the interviewer supports one party rather than the other or vice versa. As we will describe below in the method section, our research is focused on political broadcasts aired during the electoral campaign of Italian general election of 13 and 14 April 2008. The main aim of this research is to assess the level of toughness and impartiality of the most important political broadcasts. The research wants to verify if some beliefs in common sense or some journalistic analyses about interviewers’ and TV channels partiality are true or just stereotypes. We are referring for example to the supposed tendentiousness of Rai 3 and of some interviewers from different channels, like Michele Santoro, against the right party and in favor of the left party. We are also referring to the largely claimed partiality of Mediaset channels, owned by the Prime Minister’s family, or to the supposed pro-government partiality of some broadcasts like “Porta a Porta”. One of our main interests is the context: both the political and historical context of the analyzed broadcasts, which will be explained in one of the next section; the narrower context in which interruptions occur (that is, political interviews), since their characteristics can change according to formal/informal, relational, cultural and linguistic context. The second aim of this contribution is the implementation of the interruption coding system we used to assess partiality in an interactive, multimedia software. Since we adapted this coding system to Italian language and context (see Method) through a reliability study and the introduction of some categories found in Italian

358

E. Graziano and A. Gnisci

interactions [21], we are now testing this system on a wide sample of 57 hours of political interviews. As we will see, the obtained results suggest we can proceed with the design of the software, whose characteristics will be briefly described below.

2 Method 2.1 Sample Sample was formed by 57 hours of video recording material from 12 Italian broadcasts of political information aired on different TV channels, radio and digital channels during the pre-electoral period. In these broadcasts, there are the leaders of the main Italian political parties who run for the Prime Minister office. The broadcasts format was of two types: face-to-face interviews between one interviewer and one politician (for example, in “In mezz’ora”) and one interviewer which incites discussion between two or six politicians belonging to different parties, together with the presence of some experts (as in “Ballarò”). 2.2 Procedure Observers were trained to label turn taking modalities using ICS [19]. Each time a turn taking occurred in the analysed broadcasts, they stopped videorecording and identified the type of turn taking, answering to the questions of the flow chart which constitutes ICS (Figure 1). So, from the first question of original coding system (“Can a first and a second speaker be identified?”) and depending on the answer of each question, the coder go on until a category was found corresponding to the observed event. For example, a turn taking as the following is labelled as neutral event (Figure 1): “A: I think we can have a break now and we can take a walk outside. – B: Uhm, I have to complete my task, I can’t right now”. An event such this: “A: I think we can have a break now and we can- B: Uhm, that’s right! Let’s go!” is an example of supportive intervention. If the coder observes a turn taking like this: “A: I think we can have a break now and we can- B: Uhm, I have to complete my task, I can’t right now”, then it’s called successful interruption. An unsuccessful interruption is the following: “A: I think we can have a break +now and we can* take a walk outside. – B: +Uhm, I have to*”. These are only examples of the major categories considered in this study (see Figure 1) while observers coded micro categories (all those included in adapted ICS; [21]). Observers transcribed part of the sample (1 interview whose duration was 2h 04m) in order to understand the interviews context and its characteristics. The multimodality of materials is important for a correct identification of interruptions; particular signals can be seen just before an interruption occurs, such as an increasing pitch of voice, hand gestures and especially facial expressions which are meaningful signals of the speaker’s intentions in interrupting one another. Since all of this had to be taken into account, coders used computer video players in order to be able to pause the videos, quickly rewind it to listen again and record on a text file the moment of the interruption, who interrupted whom and the type of interruption.

The Partiality in Italian Political Interviews: Stereotype or Reality?

359

2.3 Coding System and Reliability The observers used an adaptation of Interruption Coding System- ICS [19] [21] for coding interruptions. ICS is based on the distinction between single and complex interruptions on one hand, and between successful and unsuccessful interruptions on the other hand. These last two categories are considered “aggressive interruptions”. We added lexical support and floor changing (with pause or “latching”, that is synchronization) to the interruption categories included by ICS, because they are often found in Italian political interviews and moreover they make more complete the coding system. Floor changing, overlapping and false start are considered neutral events. Finally, supportive interventions include lexical support and backchannels. The proportion of these three main categories (aggressive interruptions, neutral events and supportive interventions) gives us information about toughness and impartiality. However, due to reasons of space, in this contribution we took into account only the proportion of “aggressive interruptions” against the other two categories considered together. Specifically, toughness is defined as the proportion of threatening interruptions made by interviewers to the politicians on the whole of turn taking modalities; impartiality is the difference between the rates of aggressive interruptions received by the two main Italian political parties (PD and PdL). Interobserver reliability among 4 independent observers, who coded 2h 17m of the whole sample (4%), was satisfying. The average agreement percentage that refers to the identification of the event (“Is this an interruption?”) is 97.8%. The average index that refers to all the categories is “excellent” ( = .91) [22]. Moreover, we calculated Cohen’s k referred to different levels of the coding system (from micro- to main categories), conducting a study on its reliability aimed at adapting it to Italian culture and language [21]. 2.4 The Political Situation The sampling procedure was conducted during the electoral campaign of Italian general election of 2008. For these elections two “new” big parties appeared on the Italian political scene. One was a right-wing party, that is Popolo della Libertà (PdL), lead by Silvio Berlusconi. This party came out from the association of Forza Italia (“Go Italy”) and Alleanza Nazionale (“National Alliance”). The other one was the Partito Democratico (PD), lead by Walter Veltroni. It was formed by the association between Democratici di Sinistra (“Left Democratic Party”) and Margherita (“Daisy”, a moderate party). These two parties were born also in consequence of the problems and of the obvious incompetence of previous wide coalitions. These coalitions couldn’t rule and manage their continuous conflicting internal needs due to their internal composition: they were formed by too many parties. The left-wing government fell down only two years after the previous elections (that occurred in 2006) and an overwhelming victory by the right-wing was predicted, which in effect occurred.

360

E. Graziano and A. Gnisci

Concerning the Italian television situation, there are 7 national channels, among which three are public (Rai 1, Rai 2 and Rai 3), three are Mediaset private channels (Rete 4, Canale 5 and Italia 1) that belong to Silvio Berlusconi’s family, and one is La7. Public television is managed by a Board of Directors (Consiglio di Amministrazione) designated by the Government. Over the years a kind of distribution of channels has happened. So Rai 1 would be a pro-government channel, Rai 3 would be historically bound to the left-wing. In 2008 Rai 2 was “assigned” to one of the opposition parties, that is a centre-right party (Lega Nord, “North League”). Since a few years a new pay TV is broadcasted in Italy: it is Sky, and the channel Sky TG24 airs news and informations 24 hours a day. 2.5 Results Table 1 shows toughness level of broadcasts included in the sample. Interruptions constitute on average more than half on the turn taking modalities during interviews. In particular, interruptions are more than the half on the turn taking modalities in most of broadcasts (7 out of 12). The toughest interviewer is Santoro (“AnnoZero”, Rai 2), followed by Floris (“Ballarò”, Rai3) with very high values (> 60%) and by Formigli (“Controcorrente”, Sky TG24). A set of 6 broadcasts places itself within the range of 10 points around 50%. Among these programs, the toughest interviewer is Vespa (“Porta a Porta”, Rai 1), followed by the two interviewers of “Otto e Mezzo” (La7), by those of “Telecamere” and “Tg3 Primo Piano”, both of Rai 3, and then by Mentana (“Matrix”, Canale 5) and Annunziata (“In mezz’ora”, Rai 3). “Conferenza Stampa” (Rai 3) is the least tough television broadcast. Radio and digital broadcasts have the less toughness levels. The broadcasts partiality is shown in Table 2 and it is referred to two political parties, namely Popolo della Libertà (PdL) and Partito Democratico (PD). “Radio anch’io” (Rai Radio 1), “Telecamere” (Rai 3), “Tg3 Primo Piano” (Rai 3) and “Ballarò” (Rai 3) support PD rather than PdL. Even “Matrix” (Canale 5) and “Controcorrente” (Sky TG24) support PD. Annunziata (“In mezz’ora”, Rai 3) and Santoro (“AnnoZero”, Rai 2) use more aggressive interruptions towards PD rather than towards PdL, but this trend is restrained (around 5%). A paired sample t test was conducted to assess if the two considered parties were interrupted in a different way during the Italian political broadcasts. The results (t (11) = -2.49, p<.05) show that PD is interrupted significantly less (M=48.58, SD=15.45) than PDL (M=56.62, SD=16.02). Moreover, the percentages of aggressive interruptions received by PD and PDL are positively and significantly correlated (r = .748, p<.05). Thus, in general, interviewers who administered high, intermediate or low interruptions to one party, administered high, intermediate or low interruptions also to the other party, respectively.

G. Del Bufalo, G.S. Rossi

Tg3 Primo Piano (Rai 3)

Conferenza Stampa (Rai 3)

Radio anch'io (Rai Radio 1)

AVERAGE

TOTAL

C. Formigli A. Caprarica

Controcorrente (Sky TG24)

Incontri Digitali (Corriere Tv)

Unavailable information

E. Mentana G. Ferrara, R. Armeni, L. Pace

Matrix (Canale 5)

Otto e Mezzo (La7)

A. La Rosa

Telecamere (Rai 3)

In mezz’ora (Rai 3)

G. Floris B. Berlinguer, M. Mannoni, A. Di Bella, G. Giubilei L. Annunziata

M. Santoro

AnnoZero (Rai 2)

Ballarò (Rai 3)

B. Vespa

Interviewers

Porta a Porta (Rai 1)

Broadcasts (Channels)

3112

99

280

60

184

218

206

117

395

239

434

355

525

50.4

52.2

31.4

67.5

17.9

52.4

48.7

52

40.2

45.4

51.2

68.3

75.9

54.2

Aggressive Interruptions N %

2848

216

135

276

167

230

190

174

475

228

201

113

443

N

49.6

47.7

68.6

32.5

82.1

47.6

51.3

48

59.8

54.6

48.8

31.6

24.1

45.8

%

Other Turn-Taking

5960

315

415

336

351

448

396

291

870

467

635

468

968

TOT

Remark 1. “Aggressive Interruptions” comprises successful and unsuccessful interruptions; “Other Turn-taking” comprises neutral events

Table 1. Levels of Toughness of Italian political broadcasts during election campaign of 2008

The Partiality in Italian Political Interviews: Stereotype or Reality? 361

355 434 239 395 117 206 218 184 60 280 99 3112

AnnoZero (Rai 2)

Ballarò (Rai 3)

Tg3 Primo Piano (Rai 3)

In mezz’ora (Rai 3)

Conferenza Stampa (Rai 3)

Telecamere (Rai 3)

Matrix (Canale5)

Otto e Mezzo (La7)

Incontri Digitali (Corriere Tv)

Controcorrente (Sky TG24)

Radio anch’io (Rai Radio 1)

TOTAL

525

n

31.4

67.5

17.9

52.4

48.7

52

40.2

45.4

51.2

68.3

75.9

54.2

%

Aggressive Interruptions

Porta a porta (Rai 1)

Broadcasts (Channels)

549

22

31

24

42

28

22

15

63

65

103

38

96

n

PD

23.6

66

30.4

57.5

42.4

38.6

35.7

48.8

49.6

59.9

77.5

53

%

830

48

75

17

50

67

53

13

75

64

150

64

154

n

PdL

42.1

79.8

29.8

57.5

69.1

60.2

39.4

42.4

58.7

76.5

72.7

51.3

%

-18.4

-13.8

0.6

0.1

-26.6

-21.6

-3.7

+6.5

-9.1

-16.6

+4.8

1.7

Δ(PD-PdL)

Remark 1. The sum of frequencies of the two main parties is less than the whole total because in the whole total there are also frequencies of other smaller parties. Remark 2. When the Impartiality index (Δ(PD-PdL)) is 0, it indicates absolute impartiality. When it has positive values, PdL is treated better than PD; when it has negative values PD is treated better than PdL.

Table 2. Levels of Impartiality/Tendentiousness toward PdL and PD in Italian political broadcasts during election campaign of 2008

362 E. Graziano and A. Gnisci

The Partiality in Italian Political Interviews: Stereotype or Reality?

363

3 Discussion and Conclusions 3.1 Toughness and Partiality of Italian Political Broadcasts This study highlights that aggressive interruptions are very frequent in Italian political interviews. They are more frequent than those made in informal contexts [23] and in other formal contexts, like for example in courtroom [24] [25]. These results are consistent with the ones of other international studies [14] [15], which highlight the high frequency of interruptions in political interviews. Interruptions seem to be a television more than a radio or digital phenomenon. In general, interviewers keep the same interruptive behavior towards the two considered political parties. This result is not surprising if we consider that it can probably be a reflection of the general style of interviewing of the leading journalist. However this general style can be more though towards one party rather than towards the other one, and this can explain results concerning partiality. We can observe that “Controcorrente” (broadcast of the new information channel Sky TG24) shows high levels of toughness. Moreover “In mezz’ora” has opposite levels of face threatening with respect to the common sense: it shows low levels of toughness. In general, among the twelve analyzed broadcasts, six are favorable to PD (among these, four in a very clear way), which in general is also the better treated party, two are favorable to PdL, and four are almost fair. Among the five broadcasts of Rai 3, three clearly support PD (“Ballarò”, “Tg3 Primo Piano”, “Telecamere”), one is fair and the other one is in favor of PdL (“In mezz’ora”): this substantiates the left-wing political orientation of the channel. “Conferenza Stampa” seems to be fair; it’s aired only during electoral campaign and it’s also the least tough among television broadcasts. Perhaps this is due to the fact that it’s the heir of the traditional political information broadcast (the former “Tribuna Politica”, that once was the unique broadcast of political information) and it’s organized in this way: there are one interviewed politician and many journalists from different newspapers that take turns in asking questions. Concerning the other two broadcasts of Rai (Rai 1 e Rai 2), one is fair (“Porta a Porta”) and the other one, “AnnoZero”, is favorable to PdL. This last result is quite surprising, being the anchorman often accused of favoring left-wing parties. The only considered broadcast of Mediaset channels (“Matrix”) supports the leftwing party. This result, apparently in contrast with expectations (a Mediaset channel is not expected to support a left-wing party, because of its owner’s political orientation; see above), is consistent with the later disclosed difficulties that the anchorman was having with Mediaset’s board during the general election of 2008 [26]. Afterwards he was dismissed and he is now (2010) in charge of the competing channel La7. “Otto e Mezzo” of La7 is quite tough but substantially fair, consistent with results of other studies [2] conducted on the same broadcast but with different interviewers. The only broadcast of Sky seems to be consistent with the expectations, because it is very tough and quite contrary to PdL. In effect it’s known that Murdoch, who is the owner of Sky, is a competitor of Mediaset, whose owner is Berlusconi’s family. Berlusconi is also a direct competitor of the satellite TV because he has got a digital pay TV.

364

E. Graziano and A. Gnisci

After all, if we consider the political situation of general election of 2008, only among the television broadcasts six out of ten are consistent with expectations of common sense and with journalistic analyses. We are referring to those broadcasts which are expected favoring the left-wing party, namely those aired on Rai 3 and on Sky; there are also two broadcasts which neutrality was expected, namely “Otto e Mezzo” and “Conferenza Stampa”, as we have already explained. Results are also consistent with the political period we considered: the board was designated by a leftwing party, which was in charge before the general election of 2008. The “Matrix” exception can be understood taking into consideration the following interviewer’s dismissal (see above). However there are two remarkable exceptions to expectations. First of all, “AnnoZero”, presented by Santoro, is the toughest broadcast and it shows some trends in favour of PdL (which result is consistent with the administration of Rai 2see above). Second, “In mezz’ora”, whose interviewer is Annunziata, in general shows low levels of toughness and it threatens PD, while supporting PdL. Our results point out that tendentiousness and impartiality coexist in the main Italian political broadcasts. Many of them show a bias which is consistent with the political orientation and/or with the orientation of the economic property of the channel and with who has designated the public channels boards (namely, the government). Hence, these results partly reflect stereotypes and expectations of public opinion. However there are some broadcasts that contrast stereotypes because they are substantially fair: among these we find two that are notoriously accused of favoring the left-wing party (“Conferenza Stampa”, aired on Rai 3; “Otto e Mezzo”, aired on La7). In general, we should not generalize these results to the interviewer tout court. First, the considered indexes of toughness and impartiality are bound to interruption, that is a structural index, and they are not necessarily associated with, for example, the face-threatening aspect of the contents treated during the political interviews. Therefore, a possible interviewer’s partiality doesn’t leave out any other one based on different indexes, like questions [15] [9] [2]. More, interruptions and questions, as indexes of interviewers’ performance, are both indexes that can be generalized to the interactive parts of a broadcast. For example, an anchorman can result impartial when he/she interviews politicians but there is the possibility that other parts of the broadcast (reports, comments of experts, videos, etc.) are completely tendentious. Therefore, we should always keep in mind that the indexes used in this research work well especially for interactive parts of the broadcasts but do not grasp the complete partiality of the broadcast or of the interviewer. The considered indexes (toughness and partiality) and other measures should be used to guarantee an equal treatment to all political parties. This aim overshoots and at the same time completes the law 28/2000. In every democratic system, press should become the “watchdog” of democracy, it should develop indispensable antibodies for the regular execution of democratic rules [27]. Without correct and impartial information citizens can’t be fully responsible and aware of their choices. Future researches have to combine the results concerning different threatening and equivocation indexes joining them into a single corpus of results.

The Partiality in Italian Political Interviews: Stereotype or Reality?

365

3.2 A Software for Coding Interruptions The results of the study analyzing toughness and partiality of broadcasts suggest that the adaptation of the used interruption coding system can be profitably applied to study this frequent characteristic of human interaction. It is reliable and exhaustive, because its categories comprise all the modalities of turn taking, at least those occurring in Italian language. Next step would be implementing it on a software with two important features. The first one is interactivity: it might allow to obtain information in a dynamic way, so that the user can learn the coding system in a simple way. The second one is multimedia: the software has to be based on different communicative media, for a correct identification of the type of interruptions (video, audio, transcriptions, etc.). Further studies may code different behavioral flows by many coding systems concerning gestures, posture, facial expressions, prosody related to interruptions. The correlations between all of these nonverbal cues may help the construction of a complex multimedia software. So it would be easier labeling turn taking forms taking into account nonverbal aspects related to some specific types of interruptions. The software will be in a flow chart form (as ICS [19]), in order to identify the right type of interruption answering to a series of nested yes/no questions (for a simplified example, see Figure 1). So, if the answer to the first question of the flow chart (Figure 1) “Do speakers take the floor synchronizing them?” is “Yes”, then observer can codify the turn-taking as a “neutral event”. Simply by a click of the mouse, user will have the definition of the category at stakes and some real examples. For example, by a click on the box “Neutral event”, observer can read: “It’s a type of turn taking in which speakers take turns tidily synchronizing them, sometimes with brief pauses or little overlapping”. Another click of the mouse allow the observer to reach an example of neutral event, including a video- and audiorecording short scene in which two speakers talk one at a time. A transcription of the turn taking appears in the same window and helps observer to immediately identify the event of interest. The following exchange can be an example: “A: I think we can have a break now. – B: Uhm, I have to complete my task, I can’t right now”. Indeed the software will comprise a video recording catalog taken also from the sample of this study; the catalog will include examples of every category. This type of software can be useful for all researchers of various fields interested in observational studies involving interaction and interruptions.

References 1. Law 2000 February 22th, n. 28. Disposizioni per la Parità di Accesso ai Mezzi di Informazione durante le Campagne Elettorali e Referendarie e per la Comunicazione Politica (Regulation for Equal Access to Information Media during Electoral and Referendum Campaigns and for the Political Communication). Gazzetta Ufficiale della Repubblica, 43 2. Gnisci, A.: Coercive and Face-Threatening Questions to Left-Wing and Right-Wing Politicians during Two Italian Broadcasts: Conversational Indexes of Par Conditio for Democracy Systems. J. Appl. Soc. Psychol. 38, 1179–1210 (2008)

366

E. Graziano and A. Gnisci

3. Bavelas, J.B., Black, A., Bryson, L., Mullett, J.: Political Equivocation: A Situational Explanation. J. Lang. Soc. Psychol. 7, 137–145 (1988) 4. Lewin, K.: The Conceptual Representation and Measurement of Psychological Forces. Contributions to Psychological Theory 1 (1938) 5. Bavelas, J.B.: Theoretical and Methodological Principles of the Equivocation Project. J. Lang. Soc. Psychol. 17, 183–199 (1998) 6. Goffman, E.: On Face-Work: An Analysis of Ritual Elements in Social Interaction. Psychiatry 18, 213–231 (1955); Reprinted in Goffman, E.: Interaction Ritual: Essays on Face to Face Behavior. Anchor, Garden City, NY (1967) 7. Bull, P.E.: ‘‘Slipperiness, Evasion, and Ambiguity’’: Equivocation and Facework in Noncommittal Political Discourse. J. Lang. Soc. Psychol. 27, 324–332 (2008) 8. Bull, P., Elliott, J., Palmer, D., Walker, L.: Why Politicians are Three-Faced: The Face Model of Political Interviews. Brit. J. Soc. Psychol. 35, 267–284 (1996) 9. Gnisci, A., Bonaiuto, M.: Grilling Politicians. Politicians’ Answers to Questions in Television Interviews and Courtroom Examinations. J. Lang. Soc. Psychol. 22, 385–413 (2003) 10. Bull, P.: On Identifying Questions, Replies and Non-Replies in Political Interviews. J. Lang. Soc. Psychol. 13, 115–131 (1994) 11. Bull, P., Elliott, J.: Level of Threat: A Means of Assessing Interviewer Toughness and Neutrality. J. Lang. Soc. Psychol. 17, 220–244 (1998) 12. Bull, P., Mayer, K.: Interruptions in Political Interviews: A Study of Margaret Thatcher and Neil Kinnock. J. Lang. Soc. Psychol. 7, 35–45 (1988) 13. West, C., Zimmerman, D.H.: Small Insults: A study of Interruptions in Cross-sex Conversations between Unacquainted Persons. In: Thorne, B., Henley, N. (eds.) Language, Gender, and Society, pp. 118–124. Newbury House, Rowley (1983) 14. Beattie, G.W.: Turn-Taking and Interruptions in Political Interviews- Margaret Thatcher and Jim Callaghan Compared and Contrasted. Semiotica 39, 93–114 (1982) 15. Bull, P.: The Microanalysis of Political Communication: Claptrap and Ambiguity. Routledge, London (2003) 16. Roger, D.B., Schumacher, A.: Effects of Individual Differences on Dyadic Conversational Strategies. J. Pers. Soc. Psychol. 45, 700–705 (1983) 17. Bazzanella, C.: Le Interruzioni “Competitive” e “Supportive”: Verso una Configurazione Complessiva. In: Stati, S., Weigand, E., Hundsnurscher, F. (eds.) Dialog-analyse III, Niemeyer, Tubingen, pp. 283–292 (1991) 18. Murata, K.: Intrusive or Cooperative? A Cross-cultural Study of Interruption. J. Pragmatics 21, 385–400 (1994) 19. Roger, D.B., Bull, P.E., Smith, S.: The Development of a Comprehensive System for Classifying Interruptions. J. Lang. Soc. Psychol. 7, 27–34 (1988) 20. Brown, P., Levinson, S.C.: Universals in Language Usage: Politeness Phenomena. In: Goody, E. (ed.) Questions of Politeness, pp. 56–310. Cambridge University Press, Cambridge (1978) 21. Gnisci, A., Bull, P., Graziano, E., Ciancia, M.R., Errico, D.: Un Sistema di Codifica delle Interruzioni nell Intervista Politica Italiana. Psicologia Sociale 1, 119–140 (2011) 22. Cohen, J.A.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37– 46 (1960) 23. Bazzanella, C.: Le Facce del Parlare: Un Approccio Pragmatico all’Italiano Parlato, La Nuova Italia, Firenze (2001) 24. Gnisci, A.: Sequential Strategies of Accomodation: A New Method in Courtroom. Brit. J. Soc. Psychol. 44, 621–643 (2005)

The Partiality in Italian Political Interviews: Stereotype or Reality?

367

25. Gnisci, A., Bakeman, R.: Sequential Accomodation of Turn Taking and Turn Length: A Study of Courtroom Interaction. J. Lang. Soc. Psychol. 26, 134–259 (2007) 26. Mentana, E.: La Passionaccia. Milano, Rizzoli (2009) 27. Gnisci, A., Di Conza, A., Zollo., P.: Political Journalism as a Democracy Watchman. In: Herrmann, P. (ed.) Democracy in Theory and Action, NOVA Publishers, New York (in press) 28. Gnisci, A., Sergi, I., De Luca, E., Errico, V.: Does Frequency of Interruptions Amplify the effect of Various Types of Interruptions? Experimental Evidence. J. Nonverbal Behav. (in press)

On the Perception of Emotional “Voices”: A Cross-Cultural Comparison among American, French and Italian Subjects Maria Teresa Riviello1, Mohamed Chetouani2, David Cohen2, and Anna Esposito1 1

Seconda Università degli Studi di Napoli, Department of Psychology, and IIASS, Italy 2 University Pierre and Marie Curie (UPMC), Paris, France [email protected], [email protected], [email protected], [email protected]

Abstract. Does a global world mean a common perception? Is the rather universal use of the American English Language as about a tool of communication among different cultures spread out enough to bring a universal perception of emotional states? How much of the supra-segmental emotional information is captured by non-native speakers and how well do the native ones perform? The present work aims to investigate how different cultures perceive emotional American English voices. In particular the comparison reported is among American, French and Italian subjects. They were tested on the perception of emotional voices extracted from American English movies. The assumption is that the recognition of the emotional states expressed by the actors/actresses will change according to the familiarity of the languages and the expositions of the subjects to the cultural environment. The results show that identification of emotional voices depends on the native language of the listener, since the ability of Italian subjects to recognize emotional information from American vocal cues differs significantly from American and French subjects. Keywords: Vocal expression of emotion, (non) native speakers, perception of emotion, cross-cultural comparison.

vocal

1 Introduction Currently, English is the dominant international language in communications, science, business, aviation, entertainment, radio and diplomacy. It is the world's second largest native language, the official language in 70 countries, as it is the world media language, and the language of cinema, TV, pop music and the computer world. According to Wikipedia 1 billion people speaks English as their first or second language, and another billion is learning English. The causes for this universality are very well known and understandable. English first began to spread during the 16th century with the British Empire and was strongly reinforced in the 20th century by the USA world domination in economic, political and military aspects [7]. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 368–377, 2011. © Springer-Verlag Berlin Heidelberg 2011

On the Perception of Emotional “Voices”

369

In the era of world mass communication American English is described as the Universal language. All over the planet people know many English words, their pronunciation and meaning. However, transmitting and/or understanding the semantic meaning of words or phrases of a spoken language are not all: speech is an acoustically rich signal that provides considerable personal information about talkers, such as her/his state of health, intentions and emotional states [1, 18, and 30] In particular, the expression of emotions in speech sounds and corresponding abilities to perceive such emotions are both fundamental aspects of human communication. Emotions, producing changes in respiration, phonation, and articulation [3-6, 9-11, 21, 23, 32-35], affecting vocalizations and the acoustic parameters of the corresponding signal such as amplitude, sound intensity and the fundamental frequency (F0). These variations in the acoustic features directly affect the suprasegmental and prosodic aspect of the speech like loudness, pitch and timing. For such universal language, does the ability to infer emotion from vocal cues exist among different cultures or is it a prerogative of native speakers of the language? How much of emotional information is captured by non-native speakers and how well do the native ones perform? The goal of the present work is to explore how members of different cultures1 perceive emotional American English voices. A comparison among American, French and Italian subjects, tested on the perception of emotional voices extracted from American English movies, is reported. Given the indisputable role of emotional expression in social communication, the ability of members of one culture to correctly identify the meaning of the emotional expressions in another culture should provide at least some support for positions claiming a high degree of universality of the emotion process [8, 12-14, 24-27, and 29]. Psychologists have long debated whether emotions are universal versus whether they vary by culture, and many theorists have taken extreme positions. After Darwin [8], who was the first to suggest that emotions have important adaptational functions and that specific expressions and physiological response patterns are rudiments of appropriate behaviors, Tomkins and those he inspired [12, 24] perpetuated Darwin’s fundamental assumption of a biologically based emotion mechanism clearly implying intercultural universality of the emotion process. To these, Tomkins added the idea of a small, fixed number of discrete ("basic") emotions. According to Tomkins's theory, each basic emotion can vary in intensity and consists of a single brain process (an "affect program"), whose triggering produces all the various manifestations (components) of the emotion, including its facial and vocal expression, changes in peripheral physiology, subjective experience, and instrumental action. The set of theories, methods, and assumptions inspired by Tomkins guided the study of emotion for over a quarter century. On the other side, the idea of the universality of the emotions’ expressions was debated by several authors, among them White [37] and Fridlund [21], according to whom any behavior, and therefore also expressions of emotions, are learned and thus they vary across cultures. 1

The term “culture” is here used as a theoretical concept (in a very general sense), to identify the geo-graphical location, language, history, lifestyle, the costumes and tradition of a country.

370

M.T. Riviello et al.

Recent theoretical models have attempted to account for both universality and cultural variation by specifying which particular aspects of emotion show similarities and differences across cultural boundaries [15, 20, 28, 32-33]. Along this line, our approach is based on the assumption that culture and languagespecific paralinguistic patterns may influence the decoding process of a speech, and the familiarity of the language and the expositions of the subjects to the cultural environment affect the recognition of the emotional states vocally expressed [17, 32].

2 Materials Emotions in speech may be real or pretended. The first type occurs when a speaker is truly happy, sad or angry, and this emotional state is reflected in his or her speech. The second type occurs when the emotion expressed is not the same as the person’s emotional state. This is the case, for instance, when an actor pretends to be sad or happy. Most of the research on emotions in speech has focused on the latter type of emotions in speech [31]. The material used in the presented study consists of audio waves extracts from American English movies whose protagonists were carefully chosen among actors and actresses that are largely acknowledged by the critique and considered capable of giving some very real and careful interpretations. The use of audio waves extracted from movies provided a set of more realistic emotional expressions. In fact, actors and actresses were not asked to produce a given emotional expression, but they were acting according to the movie script and the movie director (supposed to be an expert) has assessed his/her performance as appropriate to the required emotional context, and even though the emotions expressed were still simulations under studio conditions (and may not have reproduced a genuine emotion but an idealization of it) they were able to catch up and engage the emotional feeling of the spectators and therefore we were quite confident of their perceptual emotional contents. The current database consists of audio stimuli representing and expressing 6 different emotional states: Happiness, Sarcasm/Irony, Fear, Anger, Surprise, and Sadness (these emotions were selected because many theories of emotions agree that they can be considered as basic and universal [12-14, 26, 33-35]. Each emotional state was represented by 10 stimuli, 5 produced by an actor and 5 produced by an actress, coming up to a total of 60 audio stimuli. The selected stimuli were short in duration (the average stimulus’ length was 3.5s, SD = ± 1s), to reject any overlapping of emotional states and moods that could have confused the subject’s perception. In addition, stimuli were chosen so that the semantic meaning of the sentences expressed by the protagonists was not clearly expressing the portrayed emotional state and its intensity level was moderate, trying to obtain emotional expressions very similar to what generally occurs in natural social interaction. Once selected, the stimuli were labeled by two expert judges and then by three naïve judges independently. The expert judges carefully examined the stimuli

On the Perception of Emotional “Voices”

371

exploiting emotional information on vocal expressions such as F0 contour, rising and falling of intonation contour, etc [16, 19, 36-37] and also considering the contextual situation the protagonist was interpreting. There were no opinion exchanges between the experts and naïve judges and the final agreement on the labeling between the two groups was 100%. The collected stimuli, being extracted from movie scenes containing environmental noise are also useful for testing realistic computer applications [2]. The stimuli were then randomized and proposed to American, French and Italian subjects in order to explore their ability to recognize American English emotional vocal expressions. 2.1 Participants A total of 90 participants, 30 Americans, 30 French and 30 Italian were involved in the evaluation of the American English emotional audio stimuli. In each group 15 participants were male and 15 were female. The participants’ age was similar between countries, ranging from 18 to 35 years. The knowledge of American English by the Italian and French subjects was comparable. Subjects were required to carefully listen to the experimental stimuli via headphones in a quiet room. They were asked to focus on the emotion expressed and decide at the end of each presentation, which emotional state was expressed in it. Responses were recorded on a matrix paper form 30x8 where the rows listed the stimuli’s numbers and the columns the emotional states of happiness, sarcasm/irony, fear, anger, surprise, and sadness, plus an option for any other emotion (where subjects were free to report a different emotional label than the six listed), plus the option neutral that was suggested when according to the subject’s feeling the protagonist did not show an emotional state. Each emotional label given by the participants as an alternative to one of the six listed was included in one of listed emotional class only if criteria of synonymity and/or analogy were satisfied otherwise it was included in the class labeled “any other emotion”.

3 Results The data obtained from American, French and Italian subjects were first analyzed, separately, in term of confusion matrices matching the intended/encoded categories with the inferred/decoded categories (Table 1, 2, 3). This analysis provided the percentages of correct inference (recognition accuracy) through the percentages in the diagonal of the matrix as well as the pattern of errors or confusions in the off-diagonal entries. Since American subjects share both language and possibly cultural background of emotion expression with the encoders of the stimulus material, they can be regarded as a reference for optimal emotion recognition. The results of the American subjects are presented in Table 1, while the results of French and Italian subjects are reported in Tables 2 and 3, respectively.

372

M.T. Riviello et al.

Table 1. Confusion matrix for American Subjects assessing American English audio emotional stimuli. The numbers are percentages computed considering the number of correct answers over the total number of expected correct answers (300) for each emotional state Emotion Recognition and Confusion Percentages for American Subjects Stimuli / % 0f recognition HAPPINESS FEAR

HAPPINESS

FEAR

33,3 1

2

ANGER IRO/SAR SURPRISE 2,3

15,3

10,3

SADNESS NO EMOT ION OT HERS 6

23

7,7

76,7

2

0,33

9

3,7

5,7

1,7 0,7

ANGER

1

0,7

86

3.03

3,7

1,3

3,3

IRO/SAR

11,7

1,3

2,7

9,3

5,3

20,7

4,7

SURPRISE

5,3

6,3

6,3

44,3 10,3

47

9

12,3

3,3

SADNESS

2

22

3,3

2,7

2,7

52,7

9,3

5,3

Table 2. Confusion matrix for French Subjects assessing American English audio emotional stimuli. The numbers are percentages computed considering the number of correct answers over the total number of expected correct answers (300) for each emotional state Emotion Recognition and Confusion Percentages for French Subjects Stimuli / % 0f recognition

HAPPINESS

FEAR 5,33

2,7

13,33

6,33

4,33

28,7

5

FEAR

34,33 1

72,7

5,3

0,33

8,33

4,7

5,33

2,33

HAPPINESS

ANGER IRO/SAR SURPRISE

SADNESS NO EMOT ION OT HERS

ANGER

0,33

0

92

1,7

3

0,7

2

0,33

IRO/SAR

16

1,7

1,7

5,33

6

33

3,33

SURPRISE

6,33

5

7

33 4,33

45,33

11,33

15

5,7

SADNESS

0,7

20,7

3

1

0,7

54,33

15,33

4,33

Table 3. Confusion matrix for Italian Subjects assessing American English audio emotional stimuli. The numbers are percentages computed considering the number of correct answers over the total number of expected correct answers (300) for each emotional state Emotion Recognition and Confusion Percentages for Italian Subjects Stimuli / % 0f recognition

HAPPINESS

FEAR 4,3

6,7

15

10,3

8

9,3

FEAR

40 3

49

10,7

8

8,3

6,7

8,3

6

ANGER

3,7

1,7

76,3

5,3

4

1

6,3

1,7

HAPPINESS

ANGER IRO/SAR SURPRISE

SADNESS NO EMOT ION OT HERS 6,3

IRO/SAR

16

3,4

7

7

15,7

8,3

9,3

8

4

27,3 10,7

15,3

SURPRISE

32,3

12

17,3

6,4

SADNESS

3,3

24

6,7

2,7

4,6

44,7

9

5

Over emotions, the error patterns observed in the confusion matrices seems to be similar across countries. In particular, for all the three groups of subjects, Happiness is confused with Irony and Surprise and vice-versa. Fear is most confused with Surprise for Americans and French, whereas for Italians this emotional state is

On the Perception of Emotional “Voices”

373

confused with Anger, Surprise and Sadness. In this case, Italians’ recognition accuracy is very low with respect to the other groups evaluating the audio stimuli for Fear. A similar trend is shown for Sadness that is mainly confused with Fear by all groups of subjects, and for Anger that get the highest percentage of recognition accuracy in subjects from all the considered countries. To evaluate intercultural differences in the decoding of vocal portrayals of emotional state Figure 1. reports the main diagonals of the confusion matrices for American, French and Italian subjects in order to facilitate a comparison.

Comparison on American English Audio Stimuli

% of Agreement

American

French

Italian

100 90 80 70 60 50 40 30 20 10 0 HAPPINESS

FEAR

ANGER

IRONY

SURPRISE SADNESS

Emotions

Fig. 1. Overall results in emotion recognition. On the x-axis of the figures are the basic emotions under consideration and on the y-axis is reported (for each emotion) the percentage of correct agreement obtained by American (grey bar), French (black bar) and Italian (white bar) Subjects, respectively.

An ANOVA analysis was performed on the collected data, considering nationality as a between subject variable and emotions and gender as within subjects variables. Significance was established for α=.05; It shows that nationality plays a significant role (F (2, 12) = 4.288, ρ=.04), in fact, Duncan post-hoc test revealed that the Italian subjects differ significantly both from French and American subjects for α=.05. No interaction was found between nationality and emotions (F (10, 60) = .671, ρ=.74), and nationality and gender (F (2, 12) = .105, ρ=.89). Identification of emotions is significantly different (F (5, 60) = 17.790, ρ=.0001). In addition, there is no effect for male and female voice, as is displayed in Figure 2, where the straight lines represent the trend of accuracy in the emotional recognition of male and female voices for American, French, and Italian subjects.

374

M.T. Riviello et al.

Fig. 2. Effect of male and female voices on the recognition accuracy for American, French, and Italian subjects

4 Conclusions The present study reports on the perception of emotions from vocal expression in a cross-cultural perspective. In particular it investigates if the ability to infer emotional information from acoustical cues in American English speech is shared among different cultures as the use of that language that is considered as the universal one. The data presented, obtained comparing American, French, and Italian subjects on the ability to recognize emotional expressions from American voices extracted from American movies, make it possible to hypothesize that the knowledge of American English Language is wide enough to allow a general identification of emotional voices. In fact, as described above, for all the three groups of subjects the accuracy in the recognition of emotional vocal expressions follows a similar trend across all emotions, except for Fear that is equally well recognized by the French and the Americans but not by the Italian subjects. Nevertheless, it seems that different languages’ speakers have different sensitivity to emotional information conveyed through the speech. In fact, even though the knowledge of American English by the Italian and French subjects was comparable, Italians have more difficulty than French subjects in identifying foreign emotional vocal expressions. It could be possible that at the base of the encoding of the emotional information there is a more language specific process that is strictly related to the native language and the way suprasegmental information is encoded and expressed in it. Before going into this hypothesis, the above results need to be supported by more data, by extending the perceptual experiment to members of other countries, and investigating the identification of vocal emotional information using a not so common language. To this goal we intent to involve in the emotional perceptual experiments Hungarian and Indian participants, as well as to test the participants of all countries on the perception of Italian emotional voices. Additional significant differences in the perception of

On the Perception of Emotional “Voices”

375

foreign emotional vocal expression among cultures would support the hypothesis that culture and in particular language-specificity affect the recognition of the emotional vocal expression. However, it is worth highlighting the high percentage in recognition obtained for Anger in all the three groups, no matter the cultural context. This result suggests that, among the emotions, Anger is perceptually privileged, probably for the phylogenetic value of its clear survival function [12]. Identifying anger in the interlocutor may trigger cognitive self-defense mechanisms that are critical for the perceiver’s survival, and therefore humans may have a high sensitivity to recognize it independently from the cultural environment. Acknowledgements. This work has been supported by the European projects: COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication”, http://cost2102.cs.stir.ac.uk/ and COST ISCH TD0904 “TMELY: Time in MEntal activitY (http://w3.cost.eu/index.php?id=233&action_number=TD0904). Acknowledgements go to Miss Tina Marcella Nappi for her editorial help.

References 1. Apple, W., Hecht, K.: Speaking emotionally: The relation between verbal and vocal communication of affect. Journal of Personality and Social Psychology 42, 864–875 (1982) 2. Atassi, H., Riviello, M.T., Smékal, Z., Hussain, A., Esposito, A.: Emotional Vocal Expressions Recognition using the COST 2102 Italian Database of Emotional Speech. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 255–267. Springer, Heidelberg (2010) 3. Bachorowski, J.A.: Vocal expression and perception of emotion. Current Directions in Psychological Science 8, 53–57 (1999) 4. Banse, R., Scherer, K.: Acoustic profiles in vocal emotion expression. Journal of Personality & Social Psychology 70(3), 614–636 (1996) 5. Breitenstein, C., Van Lancker, D., Daum, I.: The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cognition & Emotion 15, 57–79 (2001) 6. Cosmides, L.: Invariances in the acoustic expressions of emotions during speech. Journal of Experimental Psycology, Human Perception Performance 9, 864–881 (1983) 7. Crystal, D.: English as a global Language, 2nd edn. Cambridge University Press, Cambridge (2003) 8. Darwin, C.: The expression of the emotions in man and the animals (1872); Reproducedby the University of Chicago, Chicago press (1965) 9. Davitz, J.R.: Auditory correlates of vocal expression of emotional feeling. In: Davitz, J.R. (ed.) The Communication of Emotional Meaning, pp. 101–112. McGraw-Hill, New York (1964) 10. Davitz, J.: The communication of emotional meaning. McGraw-Hill, New York (1964) 11. Davitz, J.R.: Auditory correlates of vocal expression of emotional feeling. In: Davitz, J.R. (ed.) The Communication of Emotional Meaning, pp. 101–112. McGraw-Hill, New York (1964) 12. Ekman, P.: An argument for basic emotions. Cognition and Emotion 6, 169–200 (1992)

376

M.T. Riviello et al.

13. Ekman, P.: The argument and evidence about universals in facial expressions of emotion. In: Wagner, H., Manstead, A. (eds.) Handbook of Social Psychophysiology, pp. 143–164. Wiley, Chichester (1989) 14. Ekman, P.: Expression and the nature of emotion. In: Scherer, K., Ekman, P. (eds.) Approaches to Emotion, pp. 319–343. Lawrence Erlbaum, Hillsdale (1984) 15. Ekman, P.: Universals and cultural differences in facial expressions of emotion. In: Cole, J. (ed.) Nebraska Symposium on Motivation, 1971, vol. 19, pp. 207–282. University of Nebraska Press, Lincoln (1972) 16. Esposito, A.: The Perceptual and Cognitive Role of Visual and Auditory Channels in Conveying Emotional Information. Cognitive Computation Journal 1(2), 268–278 (2009) 17. Esposito, A., Riviello, M.T., Bourbakis, N.: Cultural Specific Effects on the Recognition of Basic Emotions: A Study on Italian Subjects. In: Holzinger, A., Miesenberger, K. (eds.) USAB 2009. LNCS, vol. 5889, pp. 135–148. Springer, Heidelberg (2009) 18. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 211–234. Springer, Heidelberg (2008) 19. Esposito, A.: The Amount of Information on Emotional States Conveyed by the Verbal and Nonverbal Channels: Some Perceptual Data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007) 20. Fiske, A.P., Kitayama, S., Markus, H.R., Nisbett, R.E.: The cultural matrix of social psychology. In: Gilbert, D.T., Fiske, S.T., Lindzey, G. (eds.) The Handbook of Social Psychology, 4th edn., pp. 915–981. McGraw-Hill, Boston (1998) 21. Fridlund, A.J.: The new ethology of human facial expressions. In: Russell, J.A., FernandezDols, J. (eds.) The Psychology of Facial Expression, pp. 103–129. Cambridge University Press, Cambridge (1997) 22. Friend, M.: Developmental changes in sensitivity to vocal paralanguage. Developmental Science 3, 148–162 (2000) 23. Fulcher, J.A.: Vocal affect expression as an indicator of affective response. Behavior Research Methods, Instruments, & Computers 23, 306–313 (1991) 24. Izard, C.E.: Innate and universal facial expressions: Evidence from developmental and cross-cultural research. Psychological Bulletin 115, 288–299 (1994) 25. Izard, C.E.: Organizational and motivational functions of discrete emotions. In: Lewis, M., Haviland, J.M. (eds.) Handbook of Emotions, pp. 631–641. Guilford Press, New York (1993) 26. Izard, C.E.: Basic emotions, relations among emotions, and emotion–cognition relations. Psychological Review 99, 561–565 (1992) 27. Izard, C.: Human Emotions. Plenum Press, New York (1977) 28. Mesquita, B., Frijda, N.H., Scherer, K.R.: Culture and emotion. In: Berry, J.W., Dasen, P.R., Saraswathi, T.S. (eds.) Handbook of Cross-cultural Psychology. Basic processes and human development, vol. 2, pp. 255–297. Allyn & Bacon, Boston (1997) 29. Nushikyan, E.A.: Intonational universals in texual context. In: Elenius, K., Branderudf, P. (eds.) Proceedings of ICPhS 1995, Arne Strömbergs Grafiska, vol. 1, pp. 258–261 (1995) 30. Oatley, K., Jenkins, J.M.: Understanding emotions. Blackwell, Oxford (1996) 31. Scherer, K.R.: Vocal communication of emotion: A review of research paradigms. Speech Communication 40, 227–256 (2003) 32. Scherer, K.R., Banse, R., Wallbott, H.G.: Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology 32, 76–92 (2001) 33. Scherer, K.R.: The role of culture in emotion-antecedent appraisal. Journal of Personality and Social Psychology 73, 902–922 (1997)

On the Perception of Emotional “Voices”

377

34. Scherer, K.R., Banse, R., Wallbott, H.G., Goldbeck, T.: Vocal cues in emotion encoding and decoding. Motivation and Emotion 15, 123–148 (1991) 35. Scherer, K.R.: Vocal correlates of emotional arousal and affective disturbance. In: Wagner, H., Manstead, A. (eds.) Handbook of Social Psychophysiology, pp. 165–197. Wiley, New York (1989) 36. Scherer, K.R., Oshinsky, J.S.: Cue utilization in emotion attribution from auditory stimuli. Motivation and Emotion 1, 331–346 (1977) 37. Ververidis, D., Kotropoulos, C.: Emotional Speech Recognition: Resources, Features and Methods. Elsevier Speech Communication 48(9), 1162–1181 (2006) 38. White, G.M.: Emotion inside out The anthropology of affect. In: Haviland, M., Lewis, J.M. (eds.) Handbook of Emotion, pp. 29–40. Guilford Press, New York (1993); 68 (1947)

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech by Listening Tests Jiří Přibil1,2 and Anna Přibilová3 1

Institute of Photonics and Electronics, Academy of Sciences CR, v.v.i., Chaberská 57, CZ-182 51 Prague 8, Czech Republic 2 Institute of Measurement Science, SAS, Dúbravská cesta 9, SK-841 04 Bratislava, Slovakia [email protected] 3 Institute of Electronics and Photonics, Faculty of Electrical Engineering & Information Technology, Slovak University of Technology, Ilkovičova 3, SK-812 19 Bratislava, Slovakia [email protected]

Abstract. Emotional voice conversion is usually evaluated by subjective listening tests. In our experiment, the sentences of emotionally converted speech were evaluated by the classical listening test accompanied by visual stimuli of affective pictures with emotional content from the International Affective Picture System (IAPS) database. Obtained results for sentences uttered by male and female speakers corresponding to five emotional states (joy, surprise, sadness, anger together with original neutral state) confirm our working hypothesis about influence of visual stimuli on the speech emotion evaluation process. Keywords: emotional voice conversion, visual stimuli, listening test.

1 Introduction Research on multimodal communication of emotions has been in the centre of interest of psychologists in recent years. Results of perception of emotions conveyed by verbal and nonverbal channels depend also on the form of auditory and visual stimuli. Using video clips representing dynamically changing information perceived by both auditory and visual channel has not presented any increase of the amount of emotional information when combination of both channels was used. Native speakers relied more on the auditory than on the visual channel to infer information on emotional states [1] and video alone was less informative. As regards static visual stimuli represented by still photographs some authors sustain that facial expressions are more informative than vocal expressions whereas others suggest that vocal expressions are more faithful than facial ones in expressing emotional states [2]. Psychologists are also interested in understanding the neural pathways in emotion processing. Neural mechanisms of audio-visual emotion perception was studied using stimuli consisting of a static photograph coupled with a short audio track or dynamic video paired with a single spoken word or a sentence. In the latter case short movies A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 378–392, 2011. © Springer-Verlag Berlin Heidelberg 2011

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

379

were blocked by modality (audio, video, and audio-video) and/or emotion (angry, fearful, happy, and neutral) and presented to participants undergoing fMRI scanning [3]. Many psychological studies have theoretically and empirically demonstrated the importance of the integration of information from multiple modalities (e.g. vocal and visual expression) to yield a coherent representation and inference of emotions [4-8]. In any modality affect can be described in terms of discrete categories of prototypical (basic) emotions which include happiness, sadness, fear, anger, disgust, and surprise [9], [10]. An alternative to the categorical description of human affect is the dimensional description where an affective state is characterized in terms of small number of latent dimensions rather than in terms of small number of discrete emotion categories [11], [12]. These dimensions include evaluation, activation, control, power, etc. In particular, the evaluation and activation dimensions are expected to reflect the main aspects of emotion. The evaluation dimension measures how a human feels from positive to negative. The activation dimension measures whether humans are more or less likely to take an action under the emotional state, from active to passive [13]. Sometimes the emotion space is described by three major dimensions: valence, arousal, and power [14]. Examination of the effect of audio-visual integration on perceived emotions by using face-voice stimuli showed that under normal conditions the visual information dominates the auditory one when judging an emotion. Some studies have investigated the same process for body-sound by means of music [15]. Studying perception of music accompanied with static visual stimuli has shown that music alone is more effective in raising emotional feeling than music combined either with congruent or incongruent visual stimuli [16]. In our research work static visual stimuli were used together with listening of converted emotional speech to evaluate influence of visual information on the results of listening tests.

2 Method An internet realization of the automated listening test program supplemented with emotionally loaded visual information was used for evaluation of emotional speech conversion of sentences uttered by male and female speakers corresponding to five emotional states (neutral together with converted joyous, surprised, sad, and angry one). 2.1 Applied Emotional Speech Conversion Method Emotional speech conversion method based on spectral modification of male and female voice using the cepstral and the harmonic speech model was described in our previous work [17], [18]. Our approach to spectral modification consists of non-linear spectral envelope transformation (see applied values of formant ratios γ1 and γ2 in Table 1) with the effect of the first formant shift to the left and the higher ones to the right for pleasant emotions, and the first formant to the right and higher ones to the left for unpleasant emotions according to the knowledge of psychological and phonetic research [19]. The proposed spectral modification is combined with

380

J. Přibil and A. Přibilová

modification of F0 mean, F0 range, energy (see used emotional-to-neutral ratios in Table 2), and F0 linear trend superposition at the end of the sentence (rising for joyous and surprised, falling for angry styles). Applied emotional speech conversion method includes also temporal changes: time duration lengthening for sad emotional style, time duration shortening for joyous, surprised, and angry styles. The whole process of voice conversion is presented by the block diagram in Fig. 1. Input sentence

(in Neutral emotional style)

Cepstral speech analysis Modification of prosodic parameters

F0 En

{cn}

DUR

SF

Spectral properties modification

Surprise Joy

Pitch-synchronous speech synthesis with cepstral description

Sadness Anger

(Joyous style)

Target sentences

(Angry style) (Surprised style)

(Sad style)

Fig. 1. Block diagram of applied emotional speech conversion method Table 1. Applied emotional-to-neutral formant ratios γ 1 at 165.5 Hz, γ 2 at 2426 Hz [17], [18] Chosen formant ratio (shift) angry-to-neutral sad-to-neutral surprise-to-neutral joyous-to-neutral

γ1

γ2

1.35 (+35 %) 1.10 (+10 %) 0.75 (-25 %) 0.70 (-30 %)

0.85 (-15 %) 0.90 (-10 %) 1.10 (+10 %) 1.05 (+ 5 %)

Table 2. Emotional-to-neutral ratios for prosodic parameters transformation according to [17] Emotion ratio angry-to-neutral sad-to-neutral surprise-to-neutral joyous-to-neutral

F0 mean 1.16 0.81 1.20 1.18

F0 range 1.30 0.62 1.25 1.30

Energy 1.70 0.95 1.50 1.30

Duration 0.84 1.16 0.85 0.81

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

381

2.2 Listening Test Specification Our internet server realization of automated listening test program [20] has following properties: ─ the listening test program and the testing speech corpus are stored on the server PC (where the listening tests are also executed), ─ performance of the current listening test is controlled from the user local computer, ─ the listening test program communicates with the user via HTTP protocol (by WEB pages with frames in the HTML language) – see example in Fig. 2, ─ the testing program generates automatically the text protocol about the currently running test, ─ obtained test output values are stored on the server (which performs also statistical post-processing and evaluation), ─ actual summary results are directly accessible to an arbitrary connected internet user (in the form of a confusion matrix or as a diagram).

Type of test Number of the evaluation set and the sentence

Background image Button for replaying of the evaluated sentence

Button for sending evaluation to the server

Fig. 2. Demonstration example of WEB page generated by the automated listening program with typical user information and control elements (displayed by Firefox internet browser), used background picture No 7009 from the IAPS database [21]

2.3 Categorization of Picture Material for Listening Tests The visual stimuli were evoked by the coloured images picked from the International Affective Picture System (IAPS) [21] representing a standardized picture database. It consists of more than a thousand of emotionally evocative colour photographs rated using three dimensions: valence, arousal, and dominance. The two primary dimensions of valence (ranging from pleasant to unpleasant) and arousal (ranging from calm to excited) can be used for representation of the six primary emotions

382

J. Přibil and A. Přibilová

(anger, disgust, fear, sadness, surprise, joy) in the affective space (excluding the dominance factor). In our listening test experiment, we use different approach to classification of emotional pictures described by the two emotional dimensions mentioned above. We divided the photographs into five categories of emotions (neutral, two pleasant, and two unpleasant) as follows: 1. 2. 3. 4. 5.

Negative Slightly negative Neutral Slightly positive Positive

⇐ ⇐ ⇐ ⇐ ⇐

Anger, Disgust, Fear Sadness Neutral Surprise Joy

Preliminary ranges of valence and arousal values for seven emotional states (six primary emotions plus a neutral one) were chosen experimentally with the aim to select maximally disjunctive groups of pictures corresponding to of basic emotional categories. The whole process of background pictures selection from the IAPS database consists of five steps: 1. Selection by picture size (only pictures with dimension of 1024x768 pixels were selected) − automatically 2. pre-selection by arousal and valence parameters range of the basic emotion categories − automatically 3. correction of automatic selection – removing pictures with high violence (sadistic, perverse etc.) or erotic (pornographic) contents – manually, 4. separation of pre-selected pictures into five pleasant/unpleasant derived categories − automatically, 5. final selection by picture quality for displaying on the WEB page (removing pictures with black areas around, scanned from papers with low quality etc.) – manually.

3 Material and Experiments The listening test called “Influence of visual information on determination of emotion type” is located on the web page http://www/lef.um.savba.sk/scripts/itstposl2.dll for all potential users. This test consists of 10 evaluation sets selected randomly from the test speech corpus (sampled at 16 kHz) in neutral and converted emotional styles. These sentences were obtained with corresponding settings of emotional conversion described in [17] and [18] by off-line method at the time of building of the speech corpus - it means during the test already prepared wave files are used (with the mean time duration of 2 sec for original sentences in neutral style) – see Table 3. Every evaluation set consists of five sentences that are sequentially displayed with the background image – it means that the listener evaluates 50 sentences in the whole test. The test speech corpus includes 5x8 short sentences from male professional actors (in Czech and Slovak) with applied speech resynthesis method based on the source-filter speech model, and 5x8 short sentences from female professional actors

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

383

(in Czech) with applied speech resynthesis method based on the harmonic speech model. In each evaluation the listeners can choose the type of emotion of the sentence from five possibilities: “Joy”, “Anger”, “Sadness”, “Surprise”, and “Neutral”. Table 3. Basic description of used sentences in the listening test Sentence No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Text of utterance (language) He climbed up the tower. (SK) All he needed. (CZ) He woke the tailor up.(CZ) As none in the world. (CZ) He didn't know what would he say to it.(CZ) A handsome young man. (CZ) The doggie said nothing.(CZ) During the day she walked in the garden. (CZ) He sent his servant. (SK) Why he bought them. (CZ) And she fulfilled everything.(CZ) She went to eat to the castle. (CZ) He presented them with a small piece of barrens.(CZ) Ugly altogether.(CZ) Life and speech.(CZ) So they became engaged. (CZ)

Voice Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female

Duration [sec] *) 1.82 1.91 1.58 2.08 2.24 2.43 1.22 2.95 1.28 1.52 1.23 2.08 2.05 1.62 1.60 1.62

*)

Time duration of original sentence in the neutral style, length of transformed emotions corresponds with used emotional-to-neutral ratio – see Table 2.

Chosen ranges and corresponding basic statistical values of arousal/valence parameters of emotional pictures from the IAPS database used in the pre-selection step are presented in Table 4. Positions in arousal/valence space of used pictures in the pre-selection step are show in Fig. 3 (left), positions of finally used pictures are shown in Fig. 3 (right). The created picture test corpus consists of 65 background pictures selected randomly for every evaluated sentence, divided into five categories: “Negative”, “Slightly negative”, “Slightly positive”, “Positive”, and “Neutral” (every category contains 15 pictures – see Table 5). Table 4. Chosen ranges and corresponding basic statistical values of arousal/valence parameters of emotional pictures from the IAPS database used in the pre-selection step Emotional state Anger Disgust Fear Sadness Neutral Surprise Joy

Valence range [-] (1.0 ÷ 3.0) (3.0 ÷ 4.5) (1.5 ÷ 3.5) (2.0 ÷ 3.5) (4.0 ÷ 6.0) (4.5 ÷ 7.0) (7.0 ÷ 9.0)

Valence mean (SD)[-] 2.40(1.03) 3.50(0.56) 2.97(0.47) 3.04(0.76) 5.14(0.53) 5.67(0.95) 8.44(0.38)

Arousal range [-] (6.0 ÷ 8.0) (4.5 ÷ 6.5) (4.0 ÷ 6.5) (3.0 ÷ 5.0) (2.5 ÷ 4.5) (4.5 ÷ 7.0) (4.5 ÷ 8.0)

Arousal mean (SD)[-] 6.04(0.89) 5.73(0.71) 5.72(0.57) 3.88(0.74) 3.45(0.61) 4.81(0.86) 5.88(0.86)

384

J. Přibil and A. Přibilová

9

9 Anger Disgust Fear Sadness Neutral Surprise Joy

8 7

7 6 Arousal [-]

Arousal [-]

6 5

5

4

4

3

3

2

2

1 1

2

3

4

5 6 Valence [-]

7

8

Negative Slight neg. Neutral Slight pos. Positive

8

9

1 1

2

3

4

5 6 Valence [-]

7

8

9

Fig. 3. 2-D diagrams of emotional coloured pictures distribution in arousal/valence space: used in the pre-selection step by chosen arousal and valence range classified to basic emotional categories Anger, Disgust, Fear, Sadness, Neutral, Surprise, and Joy (left), used pictures after final selection of derived categories “Negative”, “Slightly negative”, “Neutral”, “Slightly positive”, and “Positive” (right) Table 5. Finally selected pictures from the IAPS database used as background images in the listening test (with corresponding basic statistical values of arousal/valence parameters) Emotion category Positive Slightly positive Neutral Slightly negative Negative

Picture number from the IAPS database 1463, 1920, 1999, 1340, 2530, 4616, 2550, 5480, 1710, 5202, 4614, 8120, 8461, 4628, 8540 2034, 1595, 1942, 2055.2, 2208, 7195, 2392, 5199, 7472, 2352, 7660, 8205, 8600, 8620, 1659 7211, 5471, 5500, 7001, 7003, 7017, 7021, 7033, 7052, 7512, 5535, 7300, 7009, 7161, 2980 9000, 2455, 9280, 9295, 9342, 9600, 9902, 7520, 9002, 9010, 9395, 9471, 9621, 9220, 9330 1050, 1275, 1930, 1202, 2115, 6563, 9940, 2122, 8485, 9424, 5961, 9495, 2683, 2692, 1205

Valence mean (SD)

Arousal mean (SD)

7.41(0.29)

4.77(0.66)

6.45(0.38)

4.62(0.52)

5.27(0.29)

3.34(0.50)

2.87(0.54)

4.86(0.85)

3.26(0.84)

5.97(0.79)

From our previous experiments follows that the listening test results depend on the sex of listeners, and there is no influence of the listener’s nationality (it was confirmed for Czech and Slovak languages [17], [18], [22]). Therefore, the obtained results are interpreted separately in three groups – male listeners, female listeners, and both together for comparison. During the test realization and in the process of

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

385

evaluation of partial results it came out that comparison of all five types of emotions and background images is too complicated and it does not bring correct results. For this reason the obtained values were cumulated to the three basic categories: “Anger/Sadness”, “Joy/Surprise”, and “Neutral” (for speech), and “Negative”, “Positive”, and “Neutral” (for background images).

4 Results Twenty eight listeners (20 Czechs and 8 Slovaks, 9 women and 19 men) took part in our experimental listening test. Except for the graphical comparison the obtained results were compared numerically using the confusion matrix values. Total number of 1400 sentences was evaluated for all types of background images; it means 280 sentences were evaluated together with background images of one emotion category. From the whole test we finally obtained percentual evaluation in the form of six full confusion matrices: basic – for all types of background images and further five ones for every type of background image (i.e. negative, slightly negative, neutral, slightly positive, and positive) – from which three confusion matrices for merged categories (negative, neutral, positive) were consecutively created. The automated listening test program running on the server also generates the test protocol with time marks (user login time and logoff time after test finishing [20]), hence we can determine the length of performed test. The whole listening test results comparison was realized in five steps: 1. Comparison of current obtained results with neutral background images with the results acquired from our previous experiments with evaluation of emotional speech conversion (cumulated values of male and female emotional voice conversion without the listener’s sex classification) – Tables 6 and 7. 2. Comparison of cumulated results (emotional types of utterances divided into three basic categories) differentiated by the type of a background image (two positive, two negative types, and a neutral one), evaluated in dependence on the listeners’ sex – see Figures 4, 5, 6, and numerical results in Tables 8, 9, 10. 3. Analysis of influence of background images on correctness of evaluated sentences, performed also on cumulated values of sentences’ emotion types and background image types, and differentiated by the listeners’ sex (Fig. 7) and summary results in Table 11. 4. Detailed analysis of evaluation of every sentence used in the test (for cumulated emotion categories, without the listeners’ sex classification) including determination the best and the worst evaluated sentence – see Tables 12, 13, and 14. 5. Statistical evaluation of the time duration of executed listening tests in dependence on listener’s sex (male, female, and both for comparison) – see Table 15.

386

J. Přibil and A. Přibilová

Table 6. Confusion matrix of cumulated percentual values of male and female voice conversion listening tests performed in the year 2008 [17], [18] Anger 67.08 % 0.83 % 6.21 %

Anger Sadness Joy

Sadness 0.50 % 87.92 % 4.41 %

Joy 8.17 % 1.88 % 64.79 %

Other*) 24.25 % 9.38 % 24.59 %

*)

Emotion Neutral was not evaluated in our previous experiment where the category “Other” represented evaluation not falling into basic three categories: Joy, Anger, Sadness.

Table 7. Originally obtained confusion matrix of the current listening test with selected values for neutral background images (summary results for both listener’s sex categories) Anger 58.33 % 1.48 % 5.93 % 5.56 % 13.33 %

Anger Sadness Neutral Surprise Joy

Sadness 9.26 % 81.85 % 15.93 % 3.70 % 1.48 %

Neutral 13.89 % 10.74 % 64.81 % 31.67 % 15.74 %

100

Anger/Sadness Neutral Joy/Surprise

90

80

80

70

70 Evaluation [%]

Evaluation [%]

Joy 12.96 % 4.81 % 7.78 % 15.56 % 46.67 %

100 Anger/Sadness Neutral Joy/Surprise

90

60 50 40

60 50 40

30

30

20

20

10

10

0

Surprise 5.56 % 1.11 % 5.56 % 38.52 % 22.78 %

0

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

Fig. 4. Graphical results of the listening test for joined emotional speech categories evaluated by male (left) and female (right) listeners − background image type: “Neutral” Table 8. Derived confusion matrix for joined emotional speech categories with selected values of neutral background images (summary results) *)

Anger/ Sadness Neutral Joy /Surprise *)

Anger/ Sadness 75.24 % 23.91 % 17.65 %

Neutral 19.05 % 69.57 % 20.21 %

**)

Joy /Surprise 5.71 % 6.52 % 62.14 %

Values for emotions Anger and Sadness were joined to one category for this comparison. Values for emotions Joy and Surprise were joined to one category for this comparison.

**)

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech 100

100 Anger/Sadness Neutral Joy/Surprise

80

80

70

70

60 50 40

Anger/Sadness Neutral Joy/Surprise

90

Evaluation [%]

Evaluation [%]

90

60 50 40

30

30

20

20

10

10

0

387

0

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

Fig. 5. Graphical results of the listening test for joined emotional speech categories evaluated by male (left) and female (right) listeners − background image type “Negative” (including “Slightly negative”) Table 9. Derived confusion matrix for joined emotional speech categories with selected values of “Negative” (including “Slightly negative”) background images (summary results)

Anger/ Sadness Neutral Joy /Surprise

Anger/ Sadness 80.35 % 31.93 % 20.00 %

Neutral 10.49 % 57.98 % 30.85 %

100

100 Anger/Sadness Neutral Joy/Surprise

90

80

80

70

70 Evaluation [%]

Evaluation [%]

90

60 50 40

50 40 30

20

20

10

10

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

Anger/Sadness Neutral Joy/Surprise

60

30

0

Joy /Surprise 9.16 % 10.08 % 49.15 %

0

1 2 3 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

Fig. 6. Graphical results of the listening test for joined emotional speech categories evaluated by male (left) and female (right) listeners − background image type: “Positive” (including “Slightly positive”) Table 10. Derived confusion matrix for joined emotional speech categories with selected values of “Positive” (including “Slightly positive”) background images (summary results)

Anger/ Sadness Neutral Joy /Surprise

Anger/ Sadness 63.72 % 8.16 % 13.52 %

Neutral 13.40 % 65.73 % 16.00 %

Joy /Surprise 22.88 % 26.11 % 70.48 %

388

J. Přibil and A. Přibilová 100

100 Negative backgroud images Neutral backgroud images Positive backgroud images

90 80

Correct evaluation [%]

Correct evaluation [%]

80 70

70 60 50 40 30

60 50 40 30

20

20

10

10

0

Negative backgroud images Neutral backgroud images Positive backgroud images

90

1 2 3 Speech emotion type: 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

0

1 2 3 Speech emotion type: 1 = Anger/Sadness, 2 = Neutral, 3 = Joy/Surprise

Fig. 7. Graphical results of influence of background images on correctness of speech evaluation of joined emotions as Angry/Sad, Neutral, Joyful/Surprised for the group of male (left) and female (right) listeners Table 11. Summary results of influence of background images {“Negative” (including “Slightly negative”), Neutral, and “Positive” (including “Slightly positive”)} on correctness of speech evaluation of joined emotions for both listeners’ sex categories Background images Negative Neutral Positive

Correctness of speech emotion evaluation Anger/ Sadness Neutral Joy /Surprise 57.14 % 47.73 % 80.35 % 67.98 % 61.72 % 67.57 % 56.15 % 65.24 % 77.48 %

Table 12. Detailed evaluation results of all used sentences for cumulated emotion speech styles (for both listeners’ sex categories) − background image type “Neutral” Sentence No 1 2 3 4 5 6 7 8 9 10 †) 11 12 ‡) 13 14 15 16 *)

Anger/ Sadness Neutral Joy /Surprise OK*) [%] XC**) [%] OK*) [%] XC**) [%] OK*) [%] XC**) [%] 50.00 50.00 70.59 29.41 42.86 57.14 80.56 19.44 72.22 27.78 38.89 61.11 78.95 21.05 70.00 30.00 57.14 42.86 62.50 37.50 44.44 55.56 79.41 20.59 55.88 44.12 76.47 23.53 50.00 50.00 50.00 50.00 72.22 27.78 37.50 62.50 58.82 41.18 64.71 35.29 50.00 50.00 55.88 44.12 70.59 29.41 38.24 61.76 50.00 50.00 89.47 10.53 50.00 50.00 61.54 38.46 20.00 80.00 34.62 65.38 44.44 55.56 62.50 37.50 56.25 43.75 93.33 6.67 66.67 33.33 60.00 40.00 56.25 43.75 56.25 43.75 28.57 71.43 66.67 33.33 44.44 55.56 55.56 44.44 71.74 28.26 52.17 47.83 23.91 76.09 75.00 25.00 72.22 27.78 38.89 61.11

Summary result of correctly evaluated emotion style in percentage. Summary result of exchange in evaluation emotion style in percentage. ‡) The best evaluated sentence. †) The worst evaluated sentence.

**)

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

389

Table 13. Detailed evaluation results of all used sentences for cumulated emotion speech styles (for both listeners’ sex categories) − background image type “Negative” (including “Slightly negative”) Sentence No 1 2 3 4 5 6 7 8 9 ‡) 10 †) 11 12 13 14 15 16

Anger/ Sadness Neutral Joy /Surprise OK*) [%] XC**) [%] OK*) [%] XC**) [%] OK*) [%] XC**) [%] 53.85 46.15 87.50 12.50 30.77 69.23 93.75 6.25 57.14 42.86 7.69 92.31 85.71 14.29 35.71 64.29 70.59 29.41 37.50 62.50 12.50 87.50 77.78 22.22 57.14 42.86 75.00 25.00 18.75 81.25 52.63 47.37 42.86 57.14 20.00 80.00 75.00 25.00 61.54 38.46 38.46 61.54 61.54 38.46 50.00 50.00 23.08 76.92 52.63 47.37 100.00 0.00 46.15 53.85 61.54 38.46 0.00 100.00 12.50 87.50 43.75 56.25 37.50 62.50 33.33 66.67 100.00 0.00 50.00 50.00 25.00 75.00 46.67 53.33 57.14 42.86 10.00 90.00 75.00 25.00 45.45 54.55 41.18 58.82 76.47 23.53 45.45 54.55 27.78 72.22 78.95 21.05 66.67 33.33 27.27 72.73

*)

Summary result of correctly evaluated emotion style in percentage. Summary result of exchange in evaluation emotion style in percentage. ‡) The best evaluated sentence. †) The worst evaluated sentence.

**)

Table 14. Detailed evaluation results of all used sentences for cumulated emotion speech styles (for both listeners’ sex categories) − background image type “Positive” (including “Slightly positive”) Sentence No 1 2 3 4 5 6 7 8 9 ‡) 10 11 12 13 14 †) 15 16 *)

Anger/ Sadness Neutral Joy /Surprise OK*) [%] XC**) [%] OK*) [%] XC**) [%] OK*) [%] XC**) [%] 64.29 35.71 50.00 50.00 71.43 28.57 38.46 61.54 100.00 0.00 56.25 43.75 77.78 22.22 43.75 56.25 46.15 53.85 100.00 0.00 72.22 27.78 20.00 80.00 60.00 40.00 80.00 20.00 57.14 42.86 53.85 46.15 83.33 16.67 64.71 35.29 47.06 52.94 75.00 25.00 53.85 46.15 47.06 52.94 100.00 0.00 61.54 38.46 69.23 30.77 75.00 25.00 64.71 35.29 55.56 44.44 83.33 16.67 54.55 45.45 63.64 36.36 70.00 30.00 66.67 33.33 15.38 84.62 80.00 20.00 50.00 50.00 41.67 58.33 57.14 42.86 53.85 46.15 33.33 66.67 42.86 57.14 63.64 36.36 31.58 68.42 44.44 55.56 66.67 33.33 30.77 69.23 100.00 0.00 77.78 22.22

Summary result of correctly evaluated emotion style in percentage. Summary result of exchange in evaluation emotion style in percentage. ‡) The best evaluated sentence. †) The worst evaluated sentence. **)

390

J. Přibil and A. Přibilová

Table 15. Results of statistical evaluation of the time duration of executed listening tests Listeners’ gender Male Female All

minimum [minutes] 8 9 8

maximum [minutes] 58 25 58

mean [minutes] 19.95 12.89 17.50

5 Discussion and Conclusion Motivation of our work was to find out whether emotional perception through hearing is influenced by perception through vision. We set a working hypothesis: if the valence-arousal model is valid for audio as well as visual representation of emotional states then perception of emotional speech might be changed by simultaneous perception of an affective picture. If this hypothesis is correct then the listening tests evaluating emotional speech conversion might be biased by emotionally loaded visual information. Immediately after the listening test realization we asked some of the listeners about their experiences from the performed test. From the listeners’ feedback information follows that: ─ the evaluated sentences are sometimes too short for correct emotion recognition, ─ some sounds of emotions representing Joy and Surprise are very similar; these two emotions are often confused, ─ listeners are often concentrated on the utterance evaluation and they don’t perceive consciously the background image, ─ manual playing of speech examples decreases sensibility to background images, ─ combination of contrary picture and speech valence produced rather impression of grotesqueness than feeling of emotional neutrality. Time duration of the test execution depends on many objective and subjective factors (including the listener’s interest and quality of internet connection etc.). From statistical evaluation of values obtained from listening test protocols follows that the mean time duration of the whole test was 17.5 minutes - see Table 15. For this time interval it is not a problem to be concentrated on listening and also on visual perception – it means we can obtain convincing results from our test. Fig. 7 shows influence of the background picture upon the correctness of emotion recognition in speech. After fusion into three categories of auditory information (positive emotions including joy and surprise, negative emotions including anger and sadness, and a neutral emotion) and visual information (negative emotions including slightly negative, positive emotions including slightly positive, and a neutral emotion) the resulting bar graph consists of nine bars. For both groups of listeners (male/female) positive trend can be observed for positive background pictures and negative trend for negative background pictures. It means that speech is evaluated as more positive when positive pictures are used (see white bars in Fig. 7) and negative pictures cause more negative evaluation of speech (see black bars in Fig. 7). In both cases it is incorrect shift in evaluation of auditory information due to visual information which is effective at the same time. For neutral background pictures this phenomenon is not observed (see grey bars in Fig. 7). This conclusion is confirmed

Influence of Visual Stimuli on Evaluation of Converted Emotional Speech

391

also by the result values of evaluation of the sentences: Table. 14 (influence of negative background images) – the sentences with negative emotions are predominantly determined correctly (the column of the values OK), but the positive sentences exhibit predominantly exchange of emotional type evaluation (the column XC). Table 15 (influence of positive background pictures) – the sentences with positive emotions have mostly evaluation OK, the negative sentences have mainly XC. According to our preliminary assumption the picture with positive valence would increase perception of the valence of emotional speech, i.e. angry/sad speech would be perceived as neutral and neutral speech would be perceived as joyful. In the same manner the picture with negative valence would decrease perception of the emotional speech valence, i.e. joyous speech would be perceived as neutral and neutral speech would be perceived either as angry or sad one. However, the listening tests have not approved this presumption. Combination of the positive picture and the negative speech as well as combination of the negative picture with the positive speech did not evoke neutral overall emotional perception. From the performed listening test follows, that our first hypothesis was confirmed for both groups of listeners. Influence of background images was greater in the female group of listeners than the male one because males were more concentrated on speech and they ignored visual part of perception. In general it can be said that use of emotive pictures decreases recognition score of emotional speech evaluation during the listening test. Acknowledgments. The work has been supported by the Grant Agency of the Czech Republic (GA102/09/0989), by the Grant Agency of the Slovak Academy of Sciences (VEGA 2/0090/11) and the Ministry of Education of the Slovak Republic (VEGA 1/0987/12).

References 1. Esposito, A.: The Amount of Information on Emotional States Conveyed by the Verbal and Nonverbal Channels: Some Perceptual Data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007) 2. Esposito, A.: Affect in Multimodal Information. In: Tao, J.H., Tan, T.N. (eds.) Affective Information Processing, pp. 203–226. Springer, London (2009) 3. Robins, D.L., Hunyadi, E., Schutz, R.T.: Superior Temporal Activation in Response to Dynamic Audio-Visual Emotional Cues. Brain and Cognition 69, 269–278 (2009) 4. Strongman, K.T.: The Psychology of Emotion: From Everyday Life to Theory. John Wiley & Sons Ltd., Chichester (2003) 5. Tomkins, S.S.: Affect Imagery Consciousness: The Complete Edition. Springer Publishing Company, LLC, New York (2008) 6. Park, J.-Y., Gu, B.-M., Kang, D.-H., Shin, Y.-W., Choi, C.-H., Lee, J.-M., Kwon, J.S.: Integration of Cross-Modal Emotional Information in the Human Brain: An fMRI Study. Cortex 46, 161–169 (2010) 7. Jeong, J.-W., Diwadkar, V.A., Chugani, C.D., Sinsoongsud, P., Muzik, O., Behen, M.E., Chugani, H.T., Chugani, D.C.: Congruence of Happy and Sad Emotion in Music and Faces Modifies Cortical Audiovisual Activation. NeuroImage 54, 2973–2982 (2011)

392

J. Přibil and A. Přibilová

8. Naranjo, C., Kornreich, C., Campanella, S., Noël, X., Vandriette, Y., Gillain, B., de Longueville, X., Delatte, B., Verbanck, P., Constant, E.: Major Depression Is Associated with Impaired Processing of Emotion in Music As Well As in Facial and Vocal Stimuli. Journal of Affective Disorders 128, 243–251 (2011) 9. Ekman, P.: Facial Expression and Emotion. American Psychologist 48, 384–392 (1993) 10. Scherer, K.R.: Appraisal Theory. In: Dalgleish, T., Power, M. (eds.) Handbook of Cognition and Emotion, pp. 637–663. John Wiley & Sons Ltd., Chichester (1999) 11. Watson, D., Clark, L.A., Tellegen, A.: Development and Validation of Brief Measures of Positive and Negative Affect: The PANAS Scales. Journal of Personality and Social Psychology 54, 1063–1070 (1988) 12. Bradley, M.M., Lang, P.J.: Motivation and Emotion. In: Cacioppo, J.T., Tassinary, L.G., Berntson, G.G. (eds.) Handbook of Psychophysiology, pp. 581–607. Cambridge University Press, New York (2007) 13. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 39–58 (2009) 14. Jin, X., Wang, Z.: An Emotion Space Model for Recognition of Emotions in Spoken Chinese. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 397–402. Springer, Heidelberg (2005) 15. Petrini, K., McAleer, P., Pollick, F.: Audiovisual Integration of Emotional Signals from Music Improvisation Does not Depend on Temporal Correspondence. Brain Research 1323, 139–148 (2010) 16. Esposito, A., Carbone, D., Riviello, M.T.: Visual Context Effects on the Perception of Musical Emotional Expressions. In: Fierrez, J., Ortega-Garcia, J., Esposito, A., Drygajlo, A., Faundez-Zanuy, M. (eds.) BioID MultiComm2009. LNCS, vol. 5707, pp. 73–80. Springer, Heidelberg (2009) 17. Přibilová, A., Přibil, J.: Spectrum Modification for Emotional Speech Synthesis. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds.) COST Action 2102. LNCS (LNAI), vol. 5398, pp. 232–241. Springer, Heidelberg (2009) 18. Přibilová, A., Přibil, J.: Harmonic Model for Female Voice Emotional Synthesis. In: Fierrez, J., Ortega-Garcia, J., Esposito, A., Drygajlo, A., Faundez-Zanuy, M. (eds.) BioID MultiComm2009. LNCS, vol. 5707, pp. 41–48. Springer, Heidelberg (2009) 19. Scherer, K.R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication 40, 227–256 (2003) 20. Přibil, J., Přibilová, A.: Distributed Listening Test Program for Synthetic Speech Evaluation. In: Proceedings of the 34 Jahrestagung für Akustik DAGA 2008, Dresden, Germany, pp. 241–242 (2008) 21. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International Affective Picture System (IAPS): Affective Ratings of Pictures and Instruction Manual. Technical Report A-8. University of Florida, Gainesville, FL (2008) 22. Přibil, J., Přibilová, A.: Evaluation of Voice Conversion in TTS System Based on Cepstral Description. In: Proceedings of the 18th Biennial International EURASIP Conference Biosignal, Brno, Czech republic, pp. 67–69 (2006)

Communicative Functions of Eye Closing Behaviours Laura Vincze and Isabella Poggi Education Sciences Department, University Roma Tre, 53 Manin, 00185 Rome, Italy [email protected], [email protected]

Abstract. This work presents a typology of eye closing behaviours based on a semantic taxonomy of communicative signals. The types of eye closing we investigate are blinks, eye-closures and winks performed during political debates. While other studies in the literature attempt to classify blinks and eyeclosures according to their duration or (in)completeness of the closure, our study aims to distinguish between communicative and non communicative types of eye closing, and among the former category, between different meanings possibly conveyed by closing one’s eyes. We argue that while winks are always communicative, i.e. they bear a meaning, blinks and eye-closures may have a communicative value too. To analyse eye closing types both an observational and a Speaker’s judgement approach are adopted, and the items of eye closings exemplified are classified on the basis of a semantic taxonomy that distinguish signals as to their conveying information on the World, on the Sender’s mind or on the Sender’s Identity. Keywords: Eye-closure, blink, wink, gaze, facial expression, multimodal communication.

1 Gaze Studies Numerous studies have been devoted to eye communication: gaze has been investigated in many of its social and communicative functions ([1]; [2]) mainly in connection with greeting and flirting behaviour ([3]; [4]), conversational manoeuvres like turn-taking ([5], [6]) and backchannel ([7], [8]). Eyebrows also received attention: [9], [10] and [11] studied eyebrows behaviour as an emotional, syntactic and conversational signal. Other scholars ([12]; [13]) hypothesized the existence of a lexicon of gaze, according to which a specific meaning corresponds to each communicative gaze signal. Analyses of the behaviours of the eye region (eyebrows, eyelids, eyes, eye sockets) [13] found out that gaze can produce a number of communicative signals and showed that they may be decomposed into minimal units, comparable to phonemes, distinctive features, or even morphemes of verbal languages, that, depending on how they combine, result in changes in meaning. Studies about specific aspects of gaze, like eyelids positions [14], highlighted the semantic richness of gaze, which can convey information even on the surrounding world (by pointing, or ‘mimicking’ some concrete or abstract qualities, like ‘huge’, ‘subtle’, ‘difficult’) and express sophisticated metadiscursive information (by a total eye closure, meaning that some topic can be passed over). A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 393–405, 2011. © Springer-Verlag Berlin Heidelberg 2011

394

L. Vincze and I. Poggi

Interest of researchers in multimodality was attracted by eye-closing behaviour too, especially blinks. Blinks have been studied in face to face interactions in relation to gaze direction before and after eye closure [15] or during cognitive tasks such as reading, memorizing or even lying [16; 17]. On the basis of previous research according to which during tasks requiring higher cognitive load subjects’ blink rate tended to decrease, [17] tested the hypothesis that also during lying, more cognitively demanding than truth telling [18; 19], subjects’ blinks will decrease while a lie is told, followed by an increase in blinking rate immediately after. The experiments conducted confirmed the hypothesis: liars displayed a reduction in blink rate during the target period (i.e. during the lie-telling), followed by an increase in blink rate after the lie was told, increase explained in terms of the compensatory effect. Similar results were obtained by [20] who, analysing suspects’ blink rates in high-stake contexts such as police interviews, found a decrease in their eye-blink while lying; a result that contradicts the hypothesis that increase in blinking rate is due to anxiety. The relationship between deception and blinking was investigated by [16] too who, with the aim of strengthening the theoretical bases of the Concealed Information Test (one of the main polygraphs or ‘lie-detector’ tests used for the detection of deception) focused on the startle eye blink. By analyzing the subjects’ physiological responses to crime-related questions as compared to those following incorrect control questions, the Authors assumed that the guilty subjects show stronger reactivity to the crimerelated questions as opposed to control ones. In line with the Concealed Information Test, [16] predicted enhanced physiological reactions (heart rate change, skin conductance, respiration line length and startle blink) to crime pictures in comparison to control pictures. Their hypotheses were only partly confirmed in that indeed physiological reactions increased in the entire body except for the startle blink, that decreased. This is in line with the results of [17], stating that when lying subjects tend to manifest a decrease in blink rate. A very plausible alternative explanation of the reduced blink rate when lying could be the one advanced by [15]: people tend to blink less not as an effect of high cognitive activity, but due to the necessity of keeping one’s eyes as open as possible. When performing highly risky activities, such as lying, the need to always ‘keep an eye’ on the interlocutor and observe his reactions, becomes more acute. The need of total focus on the activity performed is, according to [15], also the explanation of the reduced blink rate in surgeons while performing surgery. Confronted with a large literature concerned with blinks study in experimental settings where subjects were required to gaze at fixed targets or were involved in reading activities or performed surgeries, [15] emphasizes the importance of studying blinks in face to face communication instead. Due to the enormous contribution of the entire body in the communication through emphasizing certain meanings conveyed verbally or adding new ones, we assume, in line with Cummins’ considerations, that blinks “might bear a richer relationship to spoken communication than has been previously been recognized” ([15]: 2). His study is an attempt to find patterns in blinking behaviour among subjects involved in conversational contexts. Although blinking style is a highly independent behaviour, affirming, on the basis of his results, that “the most common blink is a short blink with unperturbed gaze while both participants are looking directly at each other” ([15]: 4) is not hasty generalization.

Communicative Functions of Eye Closing Behaviours

395

2 Closing the Eyes: A Subset of Gaze Behaviours In this paper we study the closing of the eyes: a particular subset of gaze behaviours, performed by partially or totally closing one or both eyes. Our hypothesis is that eye closings make up a small set of specific behaviours of the eye region, which may or may not have a communicative function. Here we focus on those performed during conversation, while specifying that their Sender, that is, the person who is performing an eye-closing, may be either by the present Speaker or the Listener. Our aim is to describe eye-closings in terms of the parameters of their physical production, to distinguish their communicative from their non-communicative uses, and to classify the communicative ones in terms of a semantic taxonomy, singling out the specific meanings they convey.

3 Method To collect and analyze cases of eye closing, we conducted an observational study and a qualitative analysis on a corpus of political debates, through the annotation scheme (Table 1) illustrated below. Our corpus includes six debates and an interview. The debates are taken from “Canal 9”, a corpus of roughly 40 minutes each, held between 2004 and 2006 at Canal 9, a TV Emitter in the Canton Valais, collected by the IDIAP Research Institute of Martigny (Switzerland) and publicly available on the web portal of the SSPNet (Social Signal Processing Network of Excellence, http://sspnet.eu/). ([21]). The last item of our corpus is a pre-electoral interview, held in May 2007 in the studios of France 2, interview having as guest the Socialist party’s representative: Ségolène Royal, Nicolas Sarkozy’s counter-candidate to the French presidential elections. To analyze eye behaviours during debates we adopted an annotation scheme (see Table 1), where each item is described and classified both on the side of the behaviour and of its possible meaning. In column 1 we write the time in the video of the behaviour under analysis; columns 2 and 3 contain a description, respectively, of the verbal and nonverbal behaviour; and col. 4 the practical goal or the communicative goal (meaning) of the behaviour in column 3. If the behaviour is attributed a communicative goal – a meaning – this is phrased as a sentence in first person. Further, since a communicative action besides its direct goal may aim at one or more “communicative supergoals”, that is, it may indirectly convey some information to be inferred by the Addressee – other goals for which the direct goal is a means – in col. 5 we write the possible supergoal of the actions in column 3. The meanings were attributed to body signals in some cases relying on the finding of previous studies ( [22]; [23]), in others through linguistic intuition taking also context into account. All items were coded by two expert coders who reached an agreement by discussing their interpretations, and these were later confirmed by 30 naive subjects. Finally, in column 6 we classify the goal of column 4 (or the supergoal written in column 5, when there is one) in terms of a taxonomy of mutually exclusive categories of meanings (see Sect. 6).

396

L. Vincze and I. Poggi Table 1. Annotation Scheme

1.Time 1). 13.01 Chevrier (Speaker)

2. Speech Il n’y a aucune volonté à démanteler

3. Action

6. Type of Eyeclosing

Totality

There is no will to dismantle Gaze:

Chevrier

Eye-closure:

(Speaker)

Presses upper against lower eyelid

Ségolène Royal

5. Supergoal

Head: Shakes head

13.01

2). 51.32

4. Goal/Meaning

[…] alors qu’il y a tellement d’abus de l’autre côté parmi les amis du pouvoir

(Speaker)

[…] when there are so many abuses on the other side among the friends of power

Information on the Sender’s Mind:

I am categorical.

Certainty

Gaze: Frowns

I disprove of this.

Squints eyes.

I am angry.

Right eye winks.

I am your confederate

I refer to the Looks down while saying “on the other side, among the friends of power”

opponent’s party and I locate it here.

Negative evaluation of the opponent

Information on the Sender’s Goals

I want you to understand what I am not saying explicitly. Æ

Information on the Sender’s Emotions

I warn you that something wrong is going on there: Sarkozy favours his friends.

Information on the World

Table 1 contains the detailed analysis of one item of communicative eye-closure and one of communicative wink. In the first instance the sender of the signal, Mr. Chevrier, a politician involved in a debate about whether reforming the Disability Insurance or not, while trying to assure the audience that there is no intention whatsoever to dismantle the Insurance, shakes his head and presses the upper eyelids against the lower ones; his head shake is an intensifier of totality [24] meaning in this case total absence of dismantling intentions, while the eye closing with pressed eyelids means that the Sender is being categorical in stating this: so it is classified, in col. 6, as providing information about the Sender’s certainty on what he’s saying. The second item is an instance of wink performed by Ségolène Royal, the left wing candidate running for President of France. While speaking, in absence of the countercandidate Nicolas Sarkozy, about the difficulties poor people are subject to, and about rich people’s abuses, Royal frowns, squints eyes in a disagreeing grimace and performs a wink with the right eye. This may be interpreted as an allusive warning: she wants to attract the electors’ attention on the fact that Sarkozy favours rich people, without uttering his name explicitly. Such analysis of a corpus is of course a useful method to have an overview of a category of signals. Yet, since our ambition was to have a comprehensive list of possible eye-closings, and some of them might not be present in our corpus, we also relied on self-generated examples, by using the method of the Speaker’s judgements:

Communicative Functions of Eye Closing Behaviours

397

by thinking of a given body behaviour (including a closing of the eyes) and figuring it out as used in a specific context, whether or not you have found it in your corpus, you wonder: how it is physically produced, and what’s its difference from other possibly similar behaviours; if it is simply practical action or a communicative behaviour (a signal conveying some meaning); what does it mean; if it is ambiguous, and if so, what are its multiple meanings; if its meaning is acceptable in that context or another; how could it be paraphrased in a verbal language; what other behaviours in the same or other modalities may be synonyms of it. This method complements observation of actual cases because, thanks to our communicative intuitions (similar to Chomsky’s linguistic intuitions [25]), we can judge also cases that are not actual but only possible, hence reaching a more complete list of cases in a category of signals. Our focus being on qualitative research and not so much on quantitative, the very existence of one single item of any category is sufficient for proving the validity of the taxonomy. Further research will concentrate on individuating more items belonging to the same category.

4 Three Types of Eye Closing: Parameters and Values Based on our analysis of the corpus above, and on self-generated examples, we singled out three types of eye closing: blink, eye-closure and wink, that can be described in terms of a set of parameters and their values. To be relevant, a parameter should allow to differentiate at least two items of behaviour that are otherwise similar in all respects. To identify the parameters and values distinguishing different types of eye closing, we started from those proposed to analyze the movement of gesture ([27], [13], [28]) – amplitude, velocity, tension, duration – assuming they can be applied to any body movement, including those of the head [24] and of the eye region. For the eye closings, the following parameters and values seem to be relevant: eyelid tension (tense, default, relaxed); velocity (fast, default, slow); duration (long, default, brief); repetition (0, 2, n); bi / unilateralism of the eye closing (left, right, both). In terms of these parameters we can distinguish three different types of eye closing, blink, wink and eye-closure, and within each, communicative vs. noncommunicative ones. By blink we mean, following [30], a quick closing of the eyes and return to eyes open, by eye-closure an eye closing longer than a blink, further sometimes characterized by a higher tension in the eyelids, while by wink we refer to a unilateral lowering of the upper lid. All three signals share a common feature, complete eye(s) closing, but they differ in at least four major features: repetition, duration, tension and bilateralism of the closing. The parameter repetition distinguishes blinks from the other types of eye closing, since blinks are the only repeated ones. Duration makes the difference between blinks and eye-closure: blinks are brief, while eye-closures are longer than a blink. Duration, on the contrary, has no influence on winks: even if we try to vary this parameter, which in blinks and eye-closures leads to crucial differences in the signal, it is not so as far as winks are concerned. [30] in their system of facial coding advise to code as unilateral eye-closure a wink longer than 2 seconds, but this in our view does not affect either the signal or the meaning: a wink different in duration is still a wink.

398

L. Vincze and I. Poggi

Another characterizing feature of the wink is, according to the above authors, hesitation during closure, even though the closure may be very brief. Hesitation in winks cannot be considered a parameter (while this is so, for instance, with unilateralism) because the difference between a hesitant and a non-hesitant wink is not relevant. While the unilateral closure is intentional, hence communicative, the hesitation in closure is not: it may be simply due to lack of training in unilateral blinks (as they are considerably less frequent than blinks). Thus the only distinctive parameter applying to winks is bilateralism: only winks are unilaterally performed, by either right or left eye, while blinks and eye-closure are bilateral. Finally, tension may be a characterizing feature of both eye-closure and winks, but definitely not of blinks, as the parameter of tension is connected to duration. By definition a blink – a quick eye-closure and return to eyes open – is so fast that it cannot involve tension. If one has the time to press the upper eyelid against the lower one, it is not a blink anymore, but an eye-closure. So whatever bilateral eye closing is long and tense, is an eye-closure. As we shall see later, the parameters of tension and duration have an important role in conveying the meaning of being categorical in what we are stating, therefore intensifying the degree of certainty in what we say or hear.

5 Communicative and Non-communicative Eye-Closings So far we have focused on the signal side of the eye-closings, by characterizing how they are produced. Now we come to its function. First, still considering their parameters and values, we distinguish them into communicative and noncommunicative eye closings. By communicative we mean that a certain morphological feature or behaviour is exhibited with a goal (that may be a conscious intention, but also an unconscious impulse, a biological function [13]) of conveying information. After presenting some types of non-communicative eye-closings, for the communicative eye closings we outline a semantic typology that distinguishes blinks, winks and eye-closures in terms of their specific meanings. 5.1 Communicative vs. Non-communicative Blinks Some of the parameters above allow distinguishing communicative from noncommunicative types of eye closing. In this case a very relevant parameter is repetition: a physiological blink, i.e., one simply aimed at keeping the standard humidity of the eye, is single, while a communicative blink is generally faster and repeated. Repetition is, in general, a necessary, condition to consider a blink communicative; yet, it is not a sufficient condition, since due to idiosyncratic differences, some people tend to blink more frequently than others. There are three types of non-communicative blinks ([26]): 1. the “physiological” blink, that merely fulfils the physiological need of keeping a standard level of eye humidity. 2. the blink of a stuttering person. When a person has a problem in pronouncing a word, he may blink twice when engaging in the production of that word, while repeating its first syllable. This type of blink is not communicative, even if it somehow “helps” the stutterer to communicate.

Communicative Functions of Eye Closing Behaviours

399

3. the startle blink [31]. Supposing the startle is real and not acted, this type of blink, even though repeated, is not communicative, in that the Sender does not want to communicate his startle reaction to the others. 5.2 Communicative vs. Non-communicative Eye-Closures Two more possibly relevant parameters to distinguish between communicative and non-communicative eye-closure are duration and eyelid pressure: a communicative eye-closure is longer than a blink but considerably shorter than one made while going to sleep; further, in a communicative eye-closure the upper eyelids are generally pressed against the lower ones. And when it is used to convey emphasis, the eyebrows may be raised as well, causing a tightening of the upper eyelid. Pressure may distinguish communicative from non communicative eye-closure also because in the latter (e.g., in sleeping) the eyelids are lax, while in the former they are possibly tense. Also the context is relevant: in a debate it is much less likely (if not impossible) for a non-communicative eye-closure to appear, while in a relaxed, familiar situation this may sometimes occur. [26] distinguish three cases of non communicative eye-closures: 1. one occurring while sleeping (which obviously cannot be found in a debate, at least on the part of the debaters, while it might occur in a bored spectator). 2. eye-closure while laughing. Sometimes, while laughing, one closes eyes for a longer duration than a blink. 3. eyes closed while thinking. While concentrating we often close our eyes for a few seconds, to isolate ourselves out of the surrounding space: this is the cut off, a type of eye-closure which can transmit information on the cognitive processes of the Sender [32]. This eye behaviour is not strictly communicative, in that it can be displayed exclusively to help the process of thought. Although by seeing us close our eyes our interlocutor can infer we are thinking, this doesn’t mean that we intended to communicate this to him. If instead we choose to display our eye closing in order to let the other know we are concentrating (and maybe don’t want to be disturbed); this is indeed a communicative eye-closure. 5.3 Winks: Only Communicative While blinks and eye-closures can be either communicative or non-communicative, winks are probably always communicative: due to the un-natural unilateralism of the wink, intentionality seems necessary in its performance. As a wink is unilateral, it attracts our attention for its discontinuity as compared to blinks. When closing only one eye instead of two, the Sender of the wink intentionally chooses to send a visual signal to the addressee. This signal is highly intentional, aimed at attracting attention, but not anybody’s attention: it is directed to a particular person, the Addressee, and only to him; sometimes you perform the wink only after making sure that no one else (or at least not the one you do not want to understand) sees you. That is why we might say that winks are, at the same time, both overt and covert, open and hidden communicative signals: overt because they are highly peculiar and therefore aimed at breaking the continuity of blinks; and covert, hidden, because furtive, performed while only the others are not looking. Winks

400

L. Vincze and I. Poggi

express therefore a sort of complicity with the addressee, while alluding in a nonconspicuous manner (i.e., not so noticeable by bystanders) to something that should be the object of attention only to Sender and Addressee.

6 Communicative Eye Closings: A Semantic Typology Coming to the communicative eye-closings, they can be grouped on the basis of their meaning. According to [13] and [33], any communicative signal – words, sentences, prosody, gestures, therefore also an eye closing – conveys one of three basic kinds of information: about the World, the Sender’s Identity, or the Sender’s Mind. Information on the World concerns the concrete and abstract entities and events of the world outside the speaker (objects, persons, organisms, events, their place and time); Information on the Speaker’s Identity concerns his/her age, sex, personality, cultural roots; while Information on the Speaker’s Mind concerns the Speaker’s mental states: his/her goals, beliefs and emotions referred to ongoing discourse. Let us see the types of information borne by eye closings. 6.1 Eye Closings Informing on the Sender’s Identity Information about the Sender’s Identity concerns age, sex, personality, cultural roots of the person making an eye-closing. In the debate “Disability Insurance”, Mr. Richoz, a person representing the blind, and himself affected by a degenerative blinding disease, while assuring the opponent (and the audience) about the efforts of disabled people to obtain a qualification, performs a frown and an eye-closure, which might paraphrased as “I am concentrated in this effort”. Richoz’s eye-closure is somehow mimicking the disabled’s determination in trying to do their best, therefore informing on the disabled persons’ identity. Considering that he himself makes part of the same category of people, and he himself attended training classes to obtain a qualification, we can say that his eye behaviour conveys information on his own identity. 6.2 Eye Closings and the Sender’s Mind Among the signals bearing information on the Sender’s Mind, [33] distinguishes Belief Markers, Goal Markers and Emotion Markers. Belief Markers inform on the Sender’s degree of certainty, or other ongoing cognitive processes, regarding the message being delivered; Goal Markers inform on the goals (the performatives) of one’s sentences or the structure of the message; while Emotion Markers convey the emotions being felt during or regarding the delivered discourse. 6.2.1 Belief Markers. The degree of certainty one attributes to the beliefs mentioned in ongoing discourse can be conveyed verbally, by verbal markers such as absolutely, probably or possibly, or the morphology of conditional mode or evidential verbs, but also through gestures and eye behaviour. Within eye-closings, through both rapid repeated blinks (either accompanied by nods or not) and eye-closure the interlocutor can confirm what the present speaker is saying, thus manifesting one’s certainty that

Communicative Functions of Eye Closing Behaviours

401

the speaker’s statements are correct. In comparison to the blink, the eye-closure adds to the meaning of ‘yes’ an element of “categorical”, i.e., a higher level of certainty and possibly of commitment to what one is saying: this longer closure of the eyelids can be paraphrased as “Absolutely, I am completely certain about this”. In a previous paper, [29] proposed a classification of nods on the basis of the meanings they convey. From the analysis of our corpus we can state that the eyeclosure (especially if long in duration and with a high tension on the lower eyelid), if performed while nodding or while shaking head, conveys a higher degree of convincement with respect to nodding/head shaking alone. When accompanied by a nod or a head shake, eye-closure can be seen therefore as an intensifier of the degree of conviction of the Sender in what he is saying or hearing: like in the example of Mr. Chevrier above (see Table 1) who, by his head shake accompanied by pressed eyeclosure, intensifies the absolute lack of dismantling intentions. But the meaning “categorical” can be added by the eye-closure also to a nod, like in the following example. The Sender of the nod accompanied by eye-closure is the Listener, Mr. Richoz, who, when hearing Mr. Delessert’s evaluation of the disabled’s misfortune, shows his total agreement with him by performing high amplitude nods and a very tense eye-closure with eyebrows raised. 6.2.2 Emotion Markers. Within emotion markers in gaze, i.e. gaze items informing about the Sender’s emotions, typical ones are those of surprise, either really felt or only acted, and acted desperation. Beside the typical expression like raised eyebrows and wide open eyes [31], surprise (either only acted, or actually felt at a certain moment in time and now re-expressed, therefore mimicked) can be conveyed by rapid repeated blinks too. In a debate of our corpus, the vice-mayor Feferler speaks of the surprise felt by himself and other town hall workers when a questionnaire about their previous political activity was presented to the inhabitants of Valais right before the elections. While pronouncing the word surprise, he makes a series of rapid repeated blinks accompanied by raised eyebrows, as if mimicking the surprise he felt in that particular moment when the questionnaire came out. 6.2.3 Goal Markers. Goal Markers are signals in which Sender and Speaker are necessarily the same person, since they inform about the Sender’s goals concerning the discourse he is delivering. Important subtypes of these goals – adequately conveyed also by some types of eye closings – are the following: 1. meta-sentence goals: they include the goals of signalling the beginning or the end of a sentence or phrase (syntactic goals), and of marking the part of the sentence that constitutes the comment – the new and important information (emphasis); 2. meta-discursive goals: they mark the part of discourse that, within the structure of his discourse, the Speaker considers important or, on the contrary, not important, so much so that it could be passed over or left out. 3. performative markers, that inform about sentence goals: signals that make it clear the specific performative – the communicative intention – of a sentence or other communicative act. A case of eye-closure with a syntactic function of marking the start of a sentence is exploited in a case of misspelling and self-correction. Mr. Feferler is talking about a

402

L. Vincze and I. Poggi

decision made by the General Council. While quoting the numbers of votes, respectively, in favour, against and abstained, he has a moment of confusion and makes a mistake; so as he realizes he has said “one abstention” instead of “one against”, while restarting to correct himself, he performs a rapid eye-closure with raised eyebrows and a violent nod. The meaning of his non verbal behaviour is “I correct myself and I start all over again”; and he starts to enumerate the results of the voting once more. The eye-closure in this case functions as a demarcation signal of the point at which the Speaker stops and starts all over again. Among meta-sentence and meta-discursive markers, some signal the main concepts of one’s discourse. One may emphasize the comment of one’s sentence by batons and eyebrow raisings, but also by a sudden widening of the eyelids aperture or rapid repeated blinks ([26]). Rapid repeated blinks, sometimes accompanied by raised eyebrows, can be used as a punctuation mark during speech: the Speaker can perform a sequence of several quick blinks while talking of an important concept, thereby signalling that something important has just been stated, and attracting the Interlocutor’s attention to it. On the opposite side stand cases in which the Speaker by an eye-closure does not have the goal to emphasize a concept, but rather to pass it over, implying it is not essential in the structure of his present discourse. During the debate on Héliski (a service to carry skiers on the mountains by helicopter, very contrasted by ecologists), while speaking about the numbers of flights made for Héliski, Mr. Pouget, a helicopter pilot, mentions that their number is not that important, and that this issue could be dealt with later. While saying n’est pas si important que ça (it is not that important), Pouget performs a slow eye-closure, meaning “I am skipping this part, as I don’t consider it important for present discussion”. Finally, an example of performative eye-closing is the wink. This closing of one eye conveys the Sender’s goal of addressing a specific and unique addressee, one with whom the Sender feels associated with, who shares his same interests and goals. The Sender of the wink also has the goal of performing a signal concealed to everyone else, a signal not perceived by the other ‘camp’: that is where the wink’s furtive, allusive characteristic comes from. In fact, the wink may be paraphrased as: “I want to communicate about this but only to you, hence I want to do so in an furtive, allusive, covert way”. In this sense, the wink is always a signal of complicity: the Sender of the wink wants to convey his affiliation to a group, sometimes a very restricted group, that for some reason must conceal its goals to others, and to which the Sender and the Addressee belong, while the person against whom the wink is directed is excluded from it. This is how the dual nature of the wink can be explained: wink is at the same time an overt and a covert signal of both inclusion and exclusion, affinity between Sender and Addressee, difference from others. Two types of wink can be distinguished: the playful complicity wink and the warning wink. Both share a feature of complicity, but an important difference is that the playful complicity wink – possibly accompanied by a smile – can be displayed to the ‘enemy’ as well, thus meta-communicating that the Sender is not being serious, but kidding. In the warning wink, on the contrary, the Sender wants to exclude the ‘enemy’, the person against whom the wink is directed, since her knowing of the Sender and Addressee’s complicity might imply some danger: therefore, the signal

Communicative Functions of Eye Closing Behaviours

403

must be perceived only by the confederate and concealed from the ‘enemy’, to prevent him from interfering with the Sender’s and Addressee’s goals. If the ‘enemy’ sees it too, all the efforts of the Sender to warn the Addressee are useless. In sum, while the warning wink is directed to the Addressee, the playful wink is only indirectly addressed to him, but actually it is to the ‘excluded’ person. In a playful interaction, if one wants to let the other know that he is excluded from the group, one may have the ultimate goal of attracting him back into group, after having pointed out his failure to the other members of the group. The wink in this case is not properly exclusive, but has an inclusive function. Warning winks can also be overt sometimes, as in the example of Ségolène Royal illustrated above. She is being interviewed during a broadcasted TV interview, that her opponent, i.e. the ‘enemy’ she wants to warn the public against, is certainly watching. In a such case, besides the warning intention directed to the audience of the electors, there is, we argue, precisely the goal of communicating to the ‘enemy’ one’s affiliation with electors – who do not have rich friends – and his exclusion from it. 6.3 Eye Closings and Information on the World The third category of our semantic taxonomy [13] is Information on the World. Although so far in our corpus we have not found cases of eye closings conveying this kind of information (the example of Royal in Table 1 concerns a case of eye direction, not of eye closing), through our Speaker’s judgements method we can hypothesize cases in which one could, through blinks, express information about the World. This happens, for instance, as one imitates other people’s characteristics or behaviours. For example, if we were to mimic a snobbish person, we would very likely decide to make a series of repeated blinks with tightened eyelids, accompanied by raised eyebrows and a lifted chin. When wanting to describe the non verbal behaviour of a person engaged in a seduction attempt, we would probably again perform a series of quick repeated blinks, but interrupted by brief moments of pause and of glances the object of desire, just to check if s/he is gazing back as well. Of course, all this eye behaviour is driven to extreme and it could only occur in mimicking or caricaturizing contexts, but it nonetheless conveys (stereotypical) information on the people around us, therefore information on the World.

7 Conclusion In this paper we presented a study on eye closing behaviours, we classified them as to their being communicative or not, and for the communicative cases, we illustrated a taxonomy of their meanings. The paper represents a new step toward the construction of a lexicon of gaze, and more generally toward a finer grained and more complete picture of the sophisticated nuances of multimodal communication. Acknowledgments. This research is supported by the Seventh Framework Program, European Network of Excellence SSPNet (Social Signal Processing Network), Grant Agreement Number 231287. We are indebted to the anonymous referees for their useful suggestions that allowed a better rephrasing of our work.

404

L. Vincze and I. Poggi

References 1. Kendon, A., Cook, M.: The consistency of gaze pattern in social interaction. British Journal of Psychology 60, 48–94 (1969) 2. Argyle, M., Cook, M.: Gaze and mutual gaze. Cambridge University Press, Cambridge (1976) 3. Kendon, A.: A description of some human greetings. In: Michael, R., Crook, J. (eds.) Comparative Ethology and Behaviour of Primates, pp. 591–668. Academic Press, New York (1973) 4. Eibl-Eibesfeldt, I.: Similarities and differences between cultures in expressive movements. In: Hinde, R. (ed.) Nonverbal Communication, pp. 297–314. Cambridge Univ. Press, Cambridge (1972) 5. Duncan, S.: Some signals and rules for taking speaking turns in conversations. In: Weitz, S. (ed.) Nonverbal Communication, Oxford University Press, Oxford (1974) 6. Goodwin, C.: Conversational organization. Interaction between speakers and hearers. Academic Press, NY (1991) 7. Heylen, D.: A closer look at gaze. In: Proceedings of the 4th International Joint Conference on Autonomous Agents and Multimodal Agent Systems 2005 (2005) 8. Maatman, R., Gratch, J., Marsella, S.: Natural behaviour of a listening agent. In: Proceedings of the 5th International Conference on Interactive Virtual Agents, Kos, Greece (2005) 9. Ekman, P.: About brows: Emotional and conversational signals. In: von Cranach, M., Foppa, K., Lepenies, W., Ploog, D. (eds.) Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, pp. 169–248. Cambridge University Press, Cambridge (1979) 10. Pelachaud, C., Prevost, S.: Sight and sound: Generating facial expressions and spoken intonation from context. In: Proceedings of the 2nd ESCA/AAAI/IEEE Workshop on Speech Synthesis, New Paltz, New York, pp. 216–219 (1994) 11. Costa, M., Ricci Bitti, P.E.: Il chiasso delle sopracciglia. Psicologia Contemporanea 176, 38–47 (2003) 12. Kreidlin, G.E.: Neverbal’naia semiotika: Iazyk tela i estestvennyi iazyk. Novoe literaturnoe obozrenie, Moskva (2002) 13. Poggi, I.: Mind, Hands, Face and Body. A goal and belief view of multimodal communication. Weidler Buchverlag (2007) 14. Poggi, I.: Mind markers. In: Rector, M., Poggi, I., Trigo, N. (eds.) Gestures. Meaning and Use. University Fernando Pessoa Press, Oporto (2002) 15. Cummins, F.: Blinking in Face to Face Communication. In: Proceedings of the 21st National Conference on Artificial Intelligence and Cognitive Science (AISC), NUI Galway, IE, pp. 74–83 (2010) 16. Verschuere, B., Crombez, G., Koster, E.H.W., Bockstaele van, B., De Clerq, A.: Startling secrets: Startle eye blink modulation by concealed crime information. Biological Psychology 76, 52–60 (2007) 17. Leal, S., Vrij, A.: Blinking during and after lying. Journal of Nonverbal Behaviour 32, 187–194 (2008) 18. DePaulo, B.M., Kirkendol, S.E.: The motivational impairment effect in the communication of deception. In: Yuille, J.C. (ed.) Credibility Assessment, pp. 51–70. Kluwer, Dordrecht (2003)

Communicative Functions of Eye Closing Behaviours

405

19. Zuckerman, M., DePaulo, B.M., Rosenthal, R.: Verbal and nonverbal communication of deception. In: Berkowitz, L. (ed.) Advances in Experimental Social Psychology, vol. 14, pp. 1–57. Academic Press, New York (1981) 20. Mann, S., Vrij, A., Bull, R.: Suspects, lies and videotape: An analysis of authentic highstakes liars. Law and Human Behaviour 26, 365–376 (2002) 21. Vinciarelli, A., Favre, S., Salamin, H., Dielmann, A.: Canal 9: A Database of Political Debates for Analysis of Social Interactions. In: Proceedings of the IEEE SSP Workshop (2009) 22. Poggi, I., D’Errico, F., Spagnolo, A.: The Embodied Morphemes of Gaze. In: Kopp, S., Wachsmuth, I. (eds.) GW 2009. LNCS (LNAI), vol. 5934, pp. 34–46. Springer, Heidelberg (2010) 23. Poggi, I., Roberto, E.: The eyes and the eyelids. A compositional view about the meanings of Gaze. In: Ahlsén, E., Henrichsen, P.J., Hirsch, R., Nivre, J., Abelin, A., Stroemqvist, S., Nicholson, S. (eds.) Communicaion – Action – Meaning. A festschrift to Jens Allwood. Dpt. Linguistics, pp. 333–350. Goteborg University (2007) 24. McClave, E.: Linguistic functions of head movements in the context of speech. Journal of Pragmatics 32, 855–878 (2000) 25. Chomsky, N.: Aspects of the theory of syntax. MIT Press, Cambridge (1965) 26. Vincze, L., Poggi, I.: Close your eyes and communicate. In: Proceedings of Giornata di Studi: Teorie e trascrizione – Trascrizione e teoria, Bolzano (December 2009) (in press) 27. Hartmann, B., Mancini, M., Pelachaud, C.: Formational Parameters and Adaptive Prototype Instantiation for MPEG-4 Compliant Gesture Synthesis. In: Computer Animation 2002, pp. 111–119 (2002) 28. Poggi, I., Pelachaud, C.: Persuasive gestures and the expressivity of ECAs. In: Wachsmuth, I., Lenzen, M., Knoblich, G. (eds.) Embodied Communication in Humans and Machines. Oxford University Press, Oxford (2008) 29. Poggi, I., D’Errico, F., Vincze, L.: Types of Nods. The polysemy of a social signal. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), Malta, May 19-21 (2010) 30. Ekman, P., Friesen, W., Hager, J.: Facial Action Coding System. The Manual. Research Nexus division of Network Information Research Corporation, USA (2002) 31. Ekman, P., Friesen, W.: Giù la maschera. Giunti ed. (2007) 32. Morris, D.: Manwatching. Jonathan Cape, London (1977) 33. Poggi, I.: Mind markers. In: Rector, M., Poggi, I., Trigo, N. (eds.) Gestures. Meaning and Use. University Fernando Pessoa Press, Oporto (2002)

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication Nicla Rossini I.I.A.S.S. – Istituto Internazionale per gli Alti Studi Scientifici, “Eduardo R. Caianiello”, Vietri sul Mare, Italy [email protected]

Abstract. Deception is a determinant social phenomenon already observed extensively in the literature of several different research fields. This study presents the analysis of both micro-expressions and voice features in sample TV clips, in order to outline a defined research agenda on the topic. Keywords: Human behaviour, Deception cues, Expression recognition, Prosodic cues, Micro-expression cues, research agenda.

1 Introduction Deception is a determinant phenomenon in human social interaction that has been outlined and studied within different research fields (see e.g. [1, 2]) with diverse conclusions and suggestions for further enquiry, and with special focus on the possibility of isolating cues to deception and prevarication in interaction [3]. Ekman and Friesen [4], for instance, distinguish between deception cues on the one hand and leakage cues on the other hand: the first phenomenon consists in microexpressions, bodily cues, and linguistic features revealing an attempt at deception and prevarication by the speaker, while the second phenomenon takes place whenever the speaker makes a social effort to mask spontaneous emotions and thoughts. Because this attempt is usually not completely successful, ambiguous signals are usually conveyed, such as twofold expressions [3], or inconsistent gesture-speech matching [5]. Deception and emotions have in fact been investigated from multifold perspectives, including the psychological [1] sociological [6], and computational [7-9] ones. Zuckerman et al. [10], for instance, puts emphasis on the cognitive effort placed on the speaker who is involved in an attempt to lie or deceive his audience: because organizing consistent and coherent spoken and behavioural messages when lying requires, supposedly, a higher cognitive load than recalling the truth, several effects on the signals emitted should be evident, according to the authors. The major ones are as follows: - Longer response latencies - Higher number of speech hesitations - Greater pupil dilation - Fewer illustrators [11] accompanying speech A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 406–418, 2011. © Springer-Verlag Berlin Heidelberg 2011

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

407

A common hypothesis is that lying involves certain given recurrent and universal emotions, that can also serve as a way to spot attempts at deception: Ekman [10], for instance, states that anxiety or excitement can be associated with lying; because both emotional states cause pupil dilation, the latter is assumed to be a reliable cue to detecting lying in progress. De Paulo et al. [12] adopt a self-presentational perspective, in drawing from Ekman and Friesen’s [2] distinction between lying to others and lying to the self: according to De Paulo, lying is substantially a self-presentation issue and thus includes a stronger effort in regulating otherwise unwitting pieces of verbal and nonverbal behaviour. This greater effort would thus result in a higher number of selfadaptors [6, 12], or those gestures focused on the body. Scherer [13, 14] also records a interestingly higher peak of the F0 contour in case of efforts aimed at controlling voice trembling that is normally related to anxiety. These types of approach can also be compared to Mehrabian’s [15] suggestion that deception usually results from a process that the liar himself is not willing or able to embrace completely, both in terms on emotional and cognitive spheres: deception would thus result in negative emotions, such as anxiety, sense of guilt, and discomfort on the one hand, and a poor representation of knowledge, on the other hand: because, in fact, lying involves a fictional representation of the world, the speech content of the liar is usually less precise. A recent study conducted by McNeill et al. [5] is consistent with Mehrabian’s hypothesis, as far as the cognitive sphere of deception is involved: it has in fact been recorded that being forced to produce unrehearsed lies during a storytelling task leads to a mismatch between the information conveyed by the hand gestures performed and speech, which leads to distrust in the recipient. Nevertheless, the question of detecting possible leaking and deceptive cues brings to the fore some issues in the inter-cultural variation of some traits, such as, for instance, the stigma associated with lying, the possibility of decoding microexpressions in spontaneous interactions, and the possibility to track emotions in human voice [13].

2 The Recognition of Expressions: A Case Study Research focusing on the recognition of emotions is diverse, and has lead to remarkable results, despite the fact that the study of emotions in human and animals is at present divided into two somewhat antithetical theoretical approaches: the first is based on the assumption that emotions are a universal phenomenon associated with universal traits, such as quick onset, brief duration, specialized response of the autonomous nervous system, and universal signals associated with each emotion [14]. A different approach is based on the claim of socio-cultural variation of expressions conveying emotions [16]. A key point of the research conducted so far is the use of mimed expressions produced by actors in the questionnaires used for the scientific investigation of the recognition of emotions [16], while less is known about the supposed universal perception of spontaneous expressions. In this regard, Luca Urbani [17] designed,

408

N. Rossini

under the guidance of Professor Rossini, a thesis project that focused on the recognition of spontaneous expressions. In order to achieve this aim, Urbani captured spontaneous reactions to video clips and real life happenings. During this data collection, subjects were unaware both of being recorded and of the aims of the investigation. After having collected the relevant data, he disclosed both the experimental setting and the aims of the study, and asked his participants to mime the expressions previously obtained spontaneously. Only the images of those participants who agreed to informed consent were used to design a questionnaire. The questionnaire was shown to 40 participants who were asked to generally describe what had happened to the subjects shown in the pictures. For the purposes of his research, Urbani [16] combined Scherer’s [18] Stimulus evaluation checks that combine both physiological and cognitive traits in a hierarchical classification of emotions, Izard’s [1] Differential Emotions Theory, and Ekman’s studies in order to develop an independent list of possible emotions to be investigated, that are as follows: a) Interest; b) Joy; c) Fear; d) Sadness; e) Anger; f) Disgust; g) Contempt; h) Surprise; i) Shyness; l) Sense of guilt. An instance of the images used for the questionnaire is shown in Figure 1.

Fig. 1. Instance of images used for the questionnaire. 1- fake interest, 2- interest, 3- joy, 4- fake joy, 5- sadness; 6- fake sadness (Urbani, 2008: 61)

Interestingly enough, the results show that mimed expressions were usually recognized and interpreted as spontaneous ones. A different phenomenon takes place with spontaneous emotions: in this latter case, only a few expressions are recognized. This research shows that, while interesting results have been achieved already in the field of enquiry pertaining to the role of emotions in communication, further study on the recognition of spontaneous expressions is desirable.

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

409

3 Deception Cues in Political Speeches: A Case Study When attempting an analysis of deception in spontaneous interactions, a number of questions come to the fore: it is in fact quite difficult to propose the hypothesis that cues of deception can be recognized, especially where there is no general agreement on the number of basic emotions nor on the way that emotions are conveyed – and thus recognized - by means of facial expression. Nevertheless, it is a matter of fact that many people, when facing deception, will be able to recognize it. A possible explanation of this phenomenon is that voice quality, micro-expressions, and posture – among other clues such as gesture and of course the spoken signal – contribute to form a unique decodable message. For the purposes of this study, we will analyze video clips available from TV shows. The main reason for choosing such materials is that TV shows often involve full faced close-up shots of the interlocutors that allow for an analysis of microexpressions under circumstances where it is normally expected that participants will try to suppress or mask their inner state. The major drawback of using this data is that the audio is often filtered or otherwise modulated, which makes it difficult or impossible to gauge absolute values of voicing, in particular intensity values. This issue was addressed by analyzing comparable audio-streams. The clips selected for this study are as follows: - Hillary Clinton’s speech at the democratic convention - a discussion between Silvio Berlusconi and Massimo D’Alema during an Italian talk show - a small fragment of the spontaneous anger reaction of the comedian Luca Bizzarri during the Sanremo song festival. This last video clip has been used to allow for a comparison between rhetorical anger and a spontaneous emotional reaction. For reasons of space, only brief segments of the videos will be presented in the text of this article. 3.1 Expressions Hillary Clinton’s speech at the democratic convention was made soon after she had abandoned her campaign for President. Her speech follows the norms of the American rhetorical and behavioural protocol for political public speeches during a political campaign. Her speech (see Figure 2) starts with a negation and follows with a list of roles as she presents herself to the convention. During this phase, head nodding and emphatic eyebrows flick are used as display mechanisms (Kendon, 2004), in order to underline the key words of her speech. During this phase, the corners of the mouth are higher than in neutral expressions. As the speaker pauses, the corners of her mouth lower (Figure 2, frame 2): after this transition, she adds, “and a proud supporter of Barack Obama” (Figure 2). The expression recorded after Hillary Clinton utters the word “Obama” is visible in Figure 2, frame 3: here, the attempted smile can be interpreted as a signal of leakage [4]. A comparison between the appearance of Hillary Clinton’s spontaneous smile and the one recorded here (see Figure 3) will make this phenomenon visible.

410

N. Rossini

I // * [I am so honored to be here tonight] /// Head shaking

1 [No]// I am here tonight [as a] / [proud mothe<e>r]// as a proud [democra t] Both hands raise, palm up and away from body Eyebrows flick Eyebrows flick and head shakes Eyebrows flick

2 3 [And a proud supporter of Barack Obama//] Head nods Fig. 2. Hilary Clinton’s speech at the Democratic Convention

(a)

b)

Fig. 3. Comparison between spontaneous smile and the expression observable after the word “Obama”

A similar phenomenon is recorded in Silvio Berlusconi’s expressions during the talk show “Ballarò” (Fig. 4 and Fig. 5). In the segment presented here, Silvio Berlusconi is in the position of listener while Massimo D’Alema – a political

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

a)

411

b)

Fig. 4. Silvio Berlusconi’s expressions during a talk show, while in the position of listener

opponent – is speaking. Fig. 4 shows the beginning of Massimo D’Alema’s speech (plate a) and the subsequent micro-expressive reaction as Massimo D’Alema leads his monologue towards unemployment rates among the young. Plate a) in Fig. 4 shows the smile Silvio Berlusconi presents himself with. At a first analysis, it will be visible that one of the halves of the face (the right one) is dramatically more active than the other one. After the topic shift, a sudden change of expression is visible: the corners of the mouth lower, and blinking is more frequent (Fig. 4, plate b). The blinking rate of Silvio Berlusconi is double that observed of Massimo D’Alema. Because blinking is considered to be a signal of unease [4], it can be easily associated with the micro-expressive change and analysed as a cue to leakage, while Berlusconi’s smile, shown in Figure 4 plate a), is probably deceptive: if, in fact, one focuses on the corners of the mouth (Fig. 5), it will be evident that, while the right corner of the mouth points up, the left corner of points down. The expression shown in the plate b of Figure 5 is in fact a negative one.

(a)

b)

Fig. 5. Silvio Berlusconi’s smile. The expression in the left half of the mouth is in contrast with the expression displayed in the right half.

412

N. Rossini

Shifting towards the expressions of Massimo D’Alema, it is possible to underline a moment of particular emphasis that is addressed in the next paragraph for voice quality. While evoking his efforts in describing the situation of the younger generation, Massimo D’Alema exhibits a sudden expression of anger, as shown in figure 6. The fact that this expression of anger is not synchronized with the most prominent part of the speech signal, but rather follows it, is an index of deception (see also Scherer’s work [19] for comparable results).

perchè non hanno più neanche la spera nza di trovare [lavoro! //] because they haven’t even the hope to find a job! Anger expression Fig. 6. Anger expression in Massimo D’Alema, starting with the word “lavoro”

3.2 Voice Quality The analysis of voice quality can be a trustable cue for the interpretation and analysis of emotional states [13]. We will here focus on the voice signal of Massimo D’Alema’s speech in the same video clip presented in the previous section for the analysis of facial expressions. The video clip has been analysed by means of PRAAT, with simple parameters, such as utterance (reporting the speech signal), pauses (both silent and filled ones), pitch, and intensity. Figure 7 shows a screen capture of a PRAAT spectral display of the voice signal of Massimo D’Alema. As already stated, the speaker shifts the focus to employment rates among the young in Italy. D’Alema states that “this [shift] is important, so he [Silvio Berlusconi] understands the reason why [this phenomenon happens] // Because, on the basis of the statistics provided by Silvio Berlusconi, there should be a plebiscite” (the Italian version of the speech is provided in Figure 7). As visible in Figure 7, the most prominent part of the utterance is this case is the word “perchè” (here translated into “the reason why”), which shows the maximum pitch level and is followed by a long silent pause. The highest intensity in the voice stream seems to be placed on the same point, which provides emphasis to the signal.

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

413

Fig. 7. Massimo D’Alema’s speech, first segment

The utterance proceeds with “but this doesn’t happen, so I am providing him with a more convincing interpretational key, than the demented one that [we lie] with our televisions” (Figure 8). In this segment, it is interesting to note that the highest peak of intensity is synchronised with the first three syllables of “più convincente” (more convincing), while the highest pitch is recorded with the word “fornisco” (“I provide”). It is also interesting to note that, when reporting the idea of a conspiracy set up by the televisions controlled by the left party, a chant-like pitch (Figure 8).

Fig. 8. Massimo D’Alema’s speech. Second segment

D’Almena explains then that in the South of Italy, seventy thousand jobs were lost in 2005, referring to the data provided by ISTAT, the National Institute for Statistics (Figure 9). The interesting phenomenon here is the congruence between the highest pitch and the highest intensity values, that here fall on the syllable “ta” of the number settantamila (seventy thousand).

414

N. Rossini

Fig. 9. Massimo D’Alema’s speech. Third segment.

Afterwards, there is an intensification of emphasis, when D’Alema explains that the young people move from south to north because they do not even have the hope of finding a job at home. Figure 10 shows the most emphatic moment of the speech, followed by the expression of anger reported in Figure 6.

Fig. 10. Massimo D’Alema’s speech. Final segment

It is interesting to note that, while the highest pitch value is here synchronised with “Speranza” (Eng.: “hope”), the highest level of intensity is noted elsewhere. Moreover, as already stated, the most emphatic part of speech precedes the expression of anger recorded in figure 6.

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

415

If one compares this with an instance of genuine anger, the difference will be evident. For this purpose, we will here examine a display of anger by the comedian Luca Bizzarri at the Sanremo Song Festival. The episode is preceded by a satirical song about Silvio Berlusconi and Gianfranco Fini, that had been presented together with Paolo Kessisoglu. The song in question had caused embarrassment to the board of directors of the RAI (the Italian state owned broadcasting service). On the occasion presented here, Luca and Paolo are introduced to the authorities in the first row. When forced to homage the first row, both comedians take a sarcastic and falsely deferential behaviour. Figure 11 shows Luca Bizzarri’s sentence, uttered towards one of the Directors (“we won’t touch your Berlusconi again”): as is visible, when the tone of the utterance is sarcastic and/or playful, the pitch accent is higher than the intensity of the sentence. Figure 12 and Figure 13 show the expressions of Luca Bizzarri and the analysis of his voice after he hears the word “bipartisan”, uttered by Gianni Morandi (in the transcripts, M). Figure 12 shows the transitions of the expressions in Luca (on the right), when Morandi says “bravo” (frame 1), and when he utters the word “bipartisan” (frame 2). It will hopefully be visible here that Luca’s mouth changes from smile to anger before he replies. His voice is reported in Figure 13. As is visible here, in this case both pitch and maximum intensity tend to synchronise, while the angry speech slightly follows the expression shift.

Fig. 11. Luca Bizzarri’s sarcastic salute to one of the RAI directors

When compared to Massimo D’Alema’s performance (Figure 6), it will be evident that the politician’s anger was probably simulated, as indicated by excessive emphasis in the expression, the inversion mismatch between expression and speech, and the mismatch between pitch and intensity.

416

N. Rossini

1

2

3

4 M.: Bravo!//Bipartisan/ L: Ma che bipartisan! A me non me ne frega un cazzo né di uno né di e<eee> Bravo! Bipartisan what bipartisan! I don’t give a fuck nor for one nor for <eee>

Fig. 12. Luca Bizzarri’s expressions

Fig. 13. Luca Bizzarri’s voice

4 Conclusions The possibility of tracking deception in everyday interaction is an interesting hypothesis, although this possibility requires a broader understanding of perception, on the one hand. and signal emission, on the other hand, than normally brought into

Deception Cues in Political Speeches: Verbal and Non-verbal Traits of Prevarication

417

play in current analysis. The study of deception, thus, while revealing intriguing possible applications in the field of verbal and nonverbal communication, intercultural communication, negotiation, and even human robot interaction, needs further field study aimed at addressing and unfolding the complexity of face to face human communication. Acknowledgments. Thanks to Anna Esposito, Karl-Erik McCullough, and Catherine Pelachaud for their comments on this work, and to my students for their vivid interest in this topic.

References 1. Izard, C.E.: Human emotions. Plenum Press, New York (1977) 2. Zhou, L., Burgoon, J.K., Nunamaker, J.F., Twitchell, D.: Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiation 13, 81–106 (2004) 3. Ekman, P., Friesen, W.V., Scherer, K.R.: Body movement and voice pitch in deceptive interactions. Semiotica 16, 23–27 (1976) 4. Ekman, P., Friesen, W.V.: Nonverbal leakage and clues to deception. Psychiatry 32, 88–106 (1969) 5. McNeill, D., Duncan, S., Franklin, D., Goss, J., Kimbara, I., Parrill, I., Welji, H., Chen, L., Harper, M., Quek, F., Rose, T., Tuttle, R.: Mind Merging. Festschrift in honor of Robert M. Krauss, Chicago (August 11, 2007) 6. Barnes, J.A.: A Pack of Lies: Towards a Sociology of Lying. Cambridge University Press, Canbridge (1994) 7. Martin, J.-C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: Gesture expressivity and blended facial expressions. International Journal of Humanoid Robotics, Special Edition “Achieving Human-Like Qualities in Interactive Virtual and Physical Humanoids” 3(3), 269–292 (2006) 8. Pelachaud, C.: Modelling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009) 9. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing IV, pp. 203–226. Springer, Berlin (2009) 10. Zuckerman, M., DeFrank, R.S., Hall, J.A., Larrance, D.T., Rosenthal, R.: Facial and vocal cues of deception and honesty. Journal of Experimental Social Psychology 15, 378–396 (1979) 11. Zuckerman, M., DePaulo, B.M., Rosenthal, R.: Verbal and nonverbal communication of deception. In: Berkowitz, L. (ed.) Advances in Experimental Social Psychology, vol. 14, pp. 1–59. Academic Press, New York (1981) 12. De Paulo, B.M., Lindsay, J.J., Malone, B.E., Muhlenbruck, L., Charlton, K., Cooper, H.: Cues to deception. Psychological Bulletin 129(1), 74–118 (2003) 13. Juslin, P.N., Scherer, K.R.: Vocal expression of affect. In: Harrigan, J., Rosenthal, R., Scherer, K.R. (eds.) The New Handbook of Methods in Nonverbal Behavior Research, pp. 65–135. Oxford University Press, Oxford (2005) 14. Scherer, K.R.: Speech and emotional states. In: Darby, J.K. (ed.) Speech Evaluation in Psychiatry, pp. 189–220. Grune & Stratton, New York (1981)

418

N. Rossini

15. Mehrabian, A.: Nonverbal communication. Aldine Atherton, Chicago (1972) 16. Ekman, P.: Facial Expression and Emotion. American Psychologist 48(4), 372–379 (1993) 17. Urbani, L.: La percezione delle emozioni. M.A. thesis in Non-Verbal Communication, Università del Piemonte Orientale (2008) 18. Scherer, K.: On the Nature and Function of Emotions. A component Process approach. In: Scherer, K.R., Ekman, P. (eds.) Approaches to Emotion, pp. 293–317. Erlbaum, Hillsdale (1984)

Selection Task with Conditional and Biconditional Sentences: Interpretation and Pattern of Answer Fabrizio Ferrara1 and Olimpia Matarazzo2 1

Department of Relational Sciences “G. Iacono”, University of Naples “Federico II”, Italy [email protected] 2 Department of Psychology, Second University of Naples, Italy [email protected]

Abstract. In this study we tested the hypothesis according to which sentence interpretation affects performance in the selection task, the most used task to investigate conditional reasoning. Through a between design, conditional (if p then q) and biconditional (if and only if p then q) sentences, of which participants had to establish the truth-value, were compared. The selection task was administered with a sentence-interpretation task. The results showed that the responses to the selection task widely depended on the sentence interpretation and that conditional and biconditional sentences were interpreted, at least in part, in analogous way. The theoretical implications of these results are discussed. Keywords: selection task, interpretation task, conditional reasoning, biconditional reasoning.

1 Introduction One of the most used experimental paradigms in the study of conditional reasoning – that is, reasoning with sentences of the form “if... then”– is the selection task. It is a rule-testing task, devised by Wason in 1966 [1], in order to investigate the procedure people follow to test a hypothesis. Selection task consists in selecting the states of affairs (p, not-p, q, not-q) necessary to determine the truth-value of a conditional rule “if p then q”. In its original version participants were presented with four cards: each card had a letter on one side and a number on the other side. The cards were presented so that two of them were visible only from the “letter” side and the other two were visible only from the “number” side (see fig. 1). A

K

2

7

Fig. 1. The four cards used in the original Wason’s experiment

A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 419–433, 2011. © Springer-Verlag Berlin Heidelberg 2011

420

F. Ferrara and O. Matarazzo

The relationship between letters and numbers in the four cards was expressed through the following rule: “if there is a vowel on one side then there is an even number on the other side”. Participants had to select those cards they needed to turn over to determine whether the rule was true or false1. According to propositional logic, a conditional sentence2 “if p then q” is conceived as material implication between the two simple sentences p (antecedent) and q (consequent). The relationship of material implication means that a conditional statement is false only when the antecedent is true (p) and the consequent is false (not-q), while it is true in the remaining combinations of truth-values for p and for q (p/q, not-p/not-q, not-p/q). Therefore, the logically correct answer to the selection task consists in selecting the cards “A” (p) and “7” (not-q) because they are the only ones that may present a letter/number combination able to falsify the rule (that is, a vowel on one side and an odd number on the other side). Selecting “K” (not-p) and “2” (q) is useless because any state of affairs associated with them makes the rule true. The selection of p & not-q allows to establish both the truth and the falsity of the rule in closed-context tasks - where all the states of affairs covered by the rule can be explored - whereas it allows to establish only the falsity of the rule in open-context tasks - where the rule concerns a set of cases that cannot be fully explored. In this case, the truth of the rule is indemonstrable. Originally, Wason devised the selection task as a closed context task but afterwards, in the countless studies based on this experimental paradigm, there has not been a clear definition of the context of the task. In the first experiments conducted by Wason (summarized in [2]) only 4% of participants gave the correct answer selecting p & not-q cards, whereas the most frequent answers were p & q (46%) or p alone (33%). These percentages have remained almost unchanged in studies using selection tasks with features analogous to the original ones, i.e. with instructions requiring to establish the truth-value of an arbitrary rule (for a review see [3]). Several hypotheses have been advanced to explain this recurring pattern of answers. Wason [1] assumes that p & q answer resulted from a confirmation bias: participants tend to confirm the rule rather than refute it. Evans (see [4] for a review) posits that most participants are guided by a matching bias, a heuristic process leading to select only the states of affairs directly mentioned in the rule (just p & q). In the framework of the relevance theory, Sperber, Cara & Girotto [5] argue that people use inferential unconscious processes (called relevance mechanisms), specialized for discourse comprehension, to solve the selection task. These mechanisms allow to select the cards containing the most relevant information, i.e. those producing high cognitive effects (new inferences) with low cognitive effort (processing cost): usually, in selection tasks with abstract content these cards are p and q. A different explanation is advanced by the information gain theory [6], [7]. The theory adopts a probabilistic conception of conditionals, grounded on Ramsey’s test [8], according to which people judge the truth-value of a conditional sentence on the 1

Following what is customary in literature, we will keep using the phrase “truth/falsity of a rule” and the terms “sentence” and “rule” interchangeably, although, strictly speaking, a rule cannot be called true or false because only the sentence describing it may take one of the two truth values. 2 The terms “sentence” and “statement” are used interchangeably.

Selection Task with Conditional and Biconditional Sentences

421

basis of the conditional probability of q given p, P(q|p). The selection task is seen as an inductive problem of optimal data selection rather than a deductive task. Participants would unconsciously interpret it as an open-context task, and select cards that have the greatest expected information gain in order to decide between two competing hypotheses: (H1) the rule is true and the p cases are always associated to the q ones, (H0) the rule is false and p and q cases are independent. Oaksford & Chater [6], [7], developed a model to calculate the expected information gain associated with the four cards, according to which the information gain of the cards is ordered as p > q > not-q > not-p. Since the participants’ responses conform to model predictions, they should no longer be viewed as biased, but as the most rational ones. A number of authors [9], [10], [11], [12], [13], [14], [15], who maintain the deductive view of conditionals, focused on the role played by the sentence interpretation on the responses people give to the selection task, and underline that the conditional sentence is often interpreted as a biconditional. In propositional logic a biconditional sentence “if and only if p then q” describes the relationship of double implication between two propositions: in this case not only p implies q, as in conditional sentences, but also q implies p. A biconditional is true when its antecedent and consequent are both true or both false (p/q or not-p/not-q) and is false when either the antecedent or the consequent is false (p/not-q or not-p/q). Unlike conditional sentence, the biconditional is logically equivalent to its converse sentence “if and only if q then p” and to its inverse sentence “if and only if not-p then not-q”. The logically correct answer to selection task with a biconditional sentence consists in selecting all the cards: indeed all of them may have a combination of states of affairs that falsifies the sentence. Nevertheless, the experimental instructions of the task, which require to select only the cards necessary to determine the truth value of the rule, pragmatically could discourage the production of this type of answer and favor the more economic selection of p & q cards. In natural language biconditional statements are often expressed with “if... then” sentences, their appropriate interpretation depending on the context. However, in conditional reasoning tasks, and especially in abstract selection tasks, the context is frequently not well defined; therefore, the biconditional interpretation of the conditional statement could be favoured. Moreover, this interpretation could be encouraged by the binary nature of the task’s materials [10]. For instance, in Wason’s original task the rule “if there is a vowel on one side, then there is an even number on the other side” could lead participants to believe that also the inverse rule “if there is a consonant then there is an odd number” holds. Margolis [11], [12] hypothesizes that performance in selection task is affected by wrong interpretations of the task. Participants, indeed, would unconsciously interpret the four cards not as individual cases, but as all-inclusive categories. For instance, the “A” card is not regarded as a single card, but as representative of all the possible “A” cards. So, the number found behind the single “A” card would be the same for all “A” cards. If there is, for example, an even number, it means that all “A” cards have an even number on the other side and that no “A” card has an odd number. Consequently, selecting only the p card is sufficient to establish the truth-value of a conditional rule because in this way all the states of affairs covered by the rule are

422

F. Ferrara and O. Matarazzo

explored. For the same principle, in case of a biconditional interpretation it is sufficient to select p & q. In virtue of this misinterpretation of the cards, Margolis argues that p and p & q answers should not be considered as mistakes, but as the correct responses in conformity with a conditional and a biconditional interpretation of the rule, respectively. Laming and colleagues [14], [15] posit that most participants misunderstand the experimental instructions given in the selection task in different ways, the most typical being interpreting the conditional sentence as a biconditional and reading “one side/other side” as “top/underneath”. However, the participants’ responses are largely consistent with their understanding of the rule: the “top/underneath” interpretation leads to turn p card over, the biconditional interpretation to turn all the cards over, while the combination of the two misinterpretations leads to turn p & q cards over. So, these responses should be seen to be logical rather than erroneous. According to the Mental Model theory [9], [16], participants select only the cards that are exhaustively represented in their mental model of the rule. The theory assumes that people reason by constructing mental models of the possibilities compatible with the premises of an argument, from which they draw putative conclusions, successively validated by searching for counterexamples. However, usually people do not flesh out exhaustive models of the premises, but only those representing true possibilities. In selection task, when the rule is interpreted as a conditional, participants tend to construct only the model [p] q (the square brackets indicate that the state of affairs is exhaustively represented in the model) and select the p card. Instead, in the case of a biconditional interpretation, they construct the model [p] [q] and select the p & q cards. Only if participants are able to flesh out the exhaustive models of the rule – [p][q], [not-p][q], [not-p] [not-q] in the case of conditional interpretation; [p][q], [not-p][not-q] in the case of biconditional interpretation – they can infer the counter-example of the rule – p/not-q for conditionals, p/not-q and not-p/q for biconditionals – and select the logically correct answers. It is worth to note that also the relevance theorists (Sperber, Cara & Girotto [5]) advance the hypothesis of the biconditional interpretation of the rule, but they posit that in abstract tasks, p & q are viewed as the most relevant cards, regardless of the type of interpretation. Although the “sentence-interpretation” hypothesis has been shared by a number of authors, only few studies have explicitly assessed how people interpret the conditional rules presented in selection tasks. Laming and colleagues [14], [15] gave participants six sets of four cards and asked them to establish the truth-value of a conditional rule for each set by physically turning over the cards needed to do it. Green, Over and Pyne [17] administered a construction task after the selection task, in which participants were asked to imagine, supposing the truth of the rule, which state of affairs was depicted on the hidden side of the four cards. The study found that p & not-q responses are linked to a conditional interpretation. More recently, WagnerEgger [13] showed that conditional interpretation of the rule is associated with p & not-q and p alone responses while biconditional interpretation is linked to the p & q answer; in this study the effective interpretation of the rule was determined using a deductive task that, for each of the four cards, required to indicate the states of affairs compatible with the truth of the rule. To our knowledge, no studies have compared

Selection Task with Conditional and Biconditional Sentences

423

conditional vs. biconditional sentences to investigate whether participants interpret them in the same or different manner and whether the responses they give to the selection task are affected by their sentence interpretation.

2 Experiment 1 This experiment aimed at further investigating the sentence-interpretation hypothesis in two ways: 1.

2.

by administering an interpretation task jointly with an abstract selection task in order to ascertain how participants interpreted the sentence and whether the responses to the selection task were affected by their interpretation; by comparing in both tasks a conditional vs. a biconditional sentence in order to establish whether the sentence interpretation and the pattern of responses to the selection task differed as a function of the type of sentence.

Concerning 1., we must point out that, unlike the interpretation tasks used in other studies [13], [15], where it was required to take as given the truth of the rule, our task held it uncertain: participants were presented with some possible ways in which the open side of each card could be matched with the covered side and, for each pattern, they had to indicate whether it confirmed or falsified the hypothesis to test. In our opinion, this procedure should prevent participants from believing that the hypothesis presented in the selection task was true and that they should look for evidence in support of its truth. The order of the two tasks, selection and interpretation, was balanced across the participants. We expected that this variable would affect the results: in our opinion, the interpretation task, requiring to reason about the combinations of states of affairs able to confirm or falsify the hypothesis, would improve the performance in the selection task. So the number of correct responses should have increased when the interpretation task was administered before the selection task. Two versions of the interpretation task were built: in one the hidden side had only the same states of affairs represented on the visible side of the cards; in the other, the hidden side had also different states of affairs. This last version should avoid a binary interpretation of the states of affairs and therefore prevent a biconditional interpretation of the sentence. As to 2., this is the first study that explicitly compared conditional vs. biconditional rules. We had two reasons for this choice: a) to inspect whether a biconditional sentence elicits a specific - biconditional - pattern of answers; b) to find out whether the overlap between conditional and biconditional interpretation is limited to the “if... then” sentences or affects also the “if and only if... then” sentences. In other words, we wondered if also biconditional statements are misinterpreted, as is the case for conditionals. If so, we should infer that in natural language, the connectives used to introduce conditional or biconditional statements are ambiguous and undefined and that understanding the participants' interpretation of the sentences they are presented with should be a preliminary step to any reasoning task.

424

F. Ferrara and O. Matarazzo

2.1 Design The 2x2x2 research design involved the manipulation of three between-subjects variables: type of sentence (conditional vs. biconditional), order of administration of the tasks (interpretation task–selection task vs. selection task–interpretation task, henceforth: “IS” vs. “SI”), and type of materials used in the interpretation task (cards with same states of affairs vs. cards with different states of affairs, henceforth: same values vs. different values). 2.2 Participants Two hundred-forty undergraduates of the Universities of Naples participated in the experiment as unpaid volunteers. They had no knowledge of logic or psychology of reasoning and their age ranged between 18 and 35 years (M=21,74; SD=3,59). Participants were assigned randomly to one of the eight experimental conditions (n=30 for each condition). 2.3 Materials and Procedure The selection task and the interpretation task were presented together in a booklet. Participants were instructed to solve the tasks one by one, in the exact order they were presented: they could go to the next page only after completing the current page, and it was forbidden to return to previous page. The “IS” version of the booklet showed on the first page a presentation of the states of affairs: four cards having the name of a flower on one side and a geometric shape on the other side. The cards were visible only from one side. A hypothesis about the relationship between names of flower and geometric shapes was formulated: “if there is a daisy on one side then there is a square on the other side” (in experimental conditions with conditional sentence) or “if and only if there is a daisy on one side then there is a square on the other side” (in experimental conditions with biconditional sentence). The depiction of the four cards was presented (see fig. 2).

Daisy

Tulip

Fig. 2. The four cards used in the experiment

On the second page of the booklet there was the interpretation task. It presented four card patterns, in each of which four possible combinations of both sides of the four cards were depicted. In each pattern the four cards were depicted so that both sides were visible: the hidden side was colored in grey and placed beside the visible side (see figures 3 and 4). For each pattern, participants had to judge whether it confirmed or falsified the hypothesis. In the “same values” version of the task (see fig. 3), the hidden sides had the same states of affairs depicted on the visible sides of the four cards (daisy, tulip, square, triangle); in the “different values” version (see fig. 4), the hidden sides had also different states of affairs (i.e. sunflower, rose, orchid, circle, rectangle, pentagon).

Selection Task with Conditional and Biconditional Sentences

425

The combinations presented in the four card patterns were the following: 1. 2. 3. 4.

p & q; not-p & not-q; q & p; not-q & not-p p & not-q; not-p & not-q; q & p; not-q & not-p p & q; not-p & not-q; q & not- p; not-q & not-p p & q; not-p & not-q; q & p; not-q & p.

The first card pattern confirmed both conditional and biconditional statements, the second and the fourth patterns falsified both statements, the third pattern (see fig. 3 and 4) confirmed the conditional statement and falsified the biconditional one. So, this pattern was able to discriminate whether participants made a conditional or a biconditional interpretation of the hypothesis: if they answered “confirms”, judging the q & not-p combination compatible with the hypothesis, then they interpreted it as a conditional statement; on the contrary, if they answered ”falsifies”, judging the combination incompatible with the hypothesis, they interpreted it as a biconditional one. The order of the four configurations was randomized across the participants. The third and last page of the booklet included the selection task. The same four cards of page 1 were presented again, along with the hypothesis formulated about the relationship between the two sides: “if there is a daisy on one side then there is a square on the other side” (in experimental conditions with conditional rule) or “if and only if there is a daisy on one side then there is a square on the other side” (in experimental conditions with biconditional rule). Participants were asked to indicate which card or cards needed to be turned over in order to determine whether the hypothesis was true or false. The “SI” version of the booklet presented a different order of administration of the two tasks: participants had to solve first the selection task and then the interpretation task. The first page was very similar to that of the “IS” booklet, the only difference being that, after the presentation of the states of affairs and the formulation of the hypothesis, participants were asked to indicate which card or cards needed to be turned over in order to determine whether the hypothesis was true or false. On the second page the interpretation task was presented in the same way as in the “IS” version.

Daisy

Tulip

Tulip

Tulip

Does this configuration confirm or falsify the hypothesis? confirms

falsifies

Fig. 3. “Same values” experimental condition - The critical configuration to discern whether the hypothesis was interpreted as a conditional or a biconditional sentence: the “square” visible side is associated to a flower different from a daisy.

426

F. Ferrara and O. Matarazzo

Daisy

Dahlia

Tulip

Rose

Does this configuration confirm or falsify the hypothesis? confirms

falsifies

Fig. 4. “Different values” experimental condition - The critical configuration to discern whether the hypothesis was interpreted as a conditional or a biconditional sentence: the “square” visible side is associated to a flower different from a daisy

2.4 Results Sentence-interpretation task. The frequency of answers to the sentenceinterpretation task in the eight experimental conditions is reported in table 1. We counted as “conditional interpretation” when participants answered “confirms” to the critical pattern and correctly to the other three patterns, as “biconditional interpretation” when they answered “falsifies” to the critical pattern and correctly to the other three patterns, and “other interpretation” when participants, aside from their answer to the critical combination, made one or more mistakes judging the other three patterns (that is, choosing “confirms” to one or both of the combinations that falsified the hypothesis and/or choosing “falsifies” to the combination that confirmed the hypothesis). Observing the marginal totals of table 1, one can note that the biconditional interpretation is the most frequent: it was given by 47,5% of participants whereas the conditional one was delivered by 25,8% of them; other interpretations reached the percentage of 26,7%. The inspection of table 1 also shows that these percentages are independent of the type of sentence (conditional vs. biconditional) and of the order of tasks administration (IS vs. SI). In particular, conditional sentence was interpreted as Table 1. Frequency of answers to the interpretation task in the eight experimental conditions

Type of interpretation

Conditional Biconditional Other Tot

Type of sentence Conditional Biconditional Order of administration Order of administration IS SI IS SI Type of Type of Type of Type of materials materials materials materials Sa. Di. Sa. Di. Sa. Di. Sa. Di. val. val. val. val. val. val. val. val. 13 6 9 6 7 6 9 6 13 12 17 14 17 11 16 14 4 12 4 10 6 13 5 10 30 30 30 30 30 30 30 30

Sa. val. = same values; Di. val. = different values

Tot 62 114 64 240

Selection Task with Conditional and Biconditional Sentences

427

conditional by 28,3% of the participants and as biconditional by 46,7%, while the remaining 25% gave other interpretations; biconditional sentence was interpreted as biconditional by 48,3% of the participants and as conditional by 23,3%, while the remaining 28,3% gave other interpretations. LOGIT analyses, conducted on the interpretation as dependent variable and the sentence, the order of administration and the type of materials (same values vs. different values) as independent variables, corroborated these considerations. The best model was the one in which the interpretation was affected only by the type of materials (G2 = 4,34; d. f. = 12; p =. 98). Parameter estimates showed that in the “same values” condition both conditional and biconditional interpretations increased, whereas other interpretations increased in the “different values” condition (all p< .001). Selection task. The answers retained for the analyses were: p & q, p (usually the most frequent ones), p & not-q, all cards (the logically correct responses according to a conditional and a biconditional interpretation, respectively); all the other types of answers were assembled in the other category. Table 2 presents frequencies of responses as a function of the eight experimental conditions and of the type of interpretation. Inspecting table 2, it is possible to note that, regardless of the type of sentence, the order of administration and the type of materials, 84,6% of p & not-q answers and 62,7% of p answers are associated with conditional interpretation of the sentence, while p & q responses (78,7%) and the selection of all cards (96%) are strongly linked to its biconditional interpretation. This observation has been supported by LOGIT analyses, performed on the answer as dependent variable, and the sentence (conditional vs. biconditional), the order of administration (IS vs. SI), the type of materials used in the interpretation task (same values vs. different values), and the interpretation (conditional vs. biconditional vs. other) as factors. The best model was the one in which the response was affected only by the interpretation (G2 = 82,766; d. f. = 84; p = .518). Parameter estimates showed that p and p & not-q responses were associated with the conditional interpretation, while p & q and all cards were linked to the biconditional interpretation; other responses increased with other interpretations (all p< .001).

3 Experiment 2 In the interpretation task of experiment 1 four card patterns were presented: as to the conditional sentence, two of them confirmed it and the other two falsified it; as regards the biconditional sentence, one pattern confirmed it and the other three falsified it. However, it should be noted that, whereas the conditional statement is falsified only by the p & not-q combination and is confirmed by all other combinations of antecedent and consequent, the biconditional statement is falsified whenever the presence of the antecedent does not correspond to the presence of the consequent and vice versa. Thus, the card patterns presented in the first experiment did not include the one with the fourth combination able to falsify the conditional, i.e. not-p & q. In this small-scale study, suggested by one of the reviewers of the first

Different values

Same values

Different values

Same values

Different values

Material

Order of administration: IS

Same values

Different values

Material

Order of administration: SI

Biconditional sentence

0

10

0

1

2

13

1

4

0

3

13

p & not-q

p&q

p

All

Others

Tot

4

0

0

2

2

0

O

6

0

1

4

1

0

C

12

0

2

0

10

0

B

12

6

0

3

3

0

O

9

1

0

6

1

1

C

17

3

5

0

9

0

B

4

2

0

2

0

0

O

6

1

0

4

0

1

C

14

0

6

0

8

0

B

10

6

0

4

0

0

O

7

0

0

5

2

0

C

17

4

3

0

9

1

B

C = Conditional interpretation; B = Biconditional interpretation; O: Other interpretation.

B

5

6

3

0

1

2

0

O

6

0

0

4

0

2

C

11

2

2

1

6

0

B

13

6

0

4

2

1

O

9

1

0

6

0

2

C

16

4

3

0

9

0

B

5

0

0

2

3

0

O

6

2

0

4

0

0

C

14

2

2

0

9

0

B

54

25

59

89

13

TOT

10 240

6

0

2

2

0

O

Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation Interpretation

Same values

Material

Material

C

Answer

Order of administration: SI

Order of administration: IS

Conditional sentence

Table 2. Frequencies of answers to the selection task as a function of the twelve experimental conditions and of the sentence interpretation

428 F. Ferrara and O. Matarazzo

Selection Task with Conditional and Biconditional Sentences

429

version of the work, experiment 1 was replicated by adding the fifth pattern (see fig. 5) in the interpretation task. Since the results of experiment 1 showed that the use of different values in the interpretation task increased other interpretations and decreased the conditional and biconditional ones, in this study the interpretation task was performed only with cards presenting the same values on both sides. 3.1 Design The 2x2 research design involved the manipulation of two between-subjects variables: type of sentence (conditional vs. biconditional) and order of administration of the tasks (IS vs. SI). 3.2 Participants Eighty undergraduates of the Universities of Naples participated in the experiment as unpaid volunteers. They had no knowledge of logic or psychology of reasoning and their age ranged between 18 and 30 years (M=22,41; SD=2,85). Participants were assigned randomly to one of the four experimental conditions (n=20 for each condition). 3.3 Materials and Procedure The materials used in this study were the same as in experiment 1. However, unlike experiment 1, the interpretation task, presented only in the “same values” version, had five cards patterns instead of four.

Daisy

Tulip

Daisy

Tulip

Does this configuration confirm or falsify the hypothesis? □ confirms

□ falsifies

Fig. 5. The fifth cards pattern with the not-p & q combination

3.4 Results Since the number of participants in this study was smaller than in the first experiment, we preliminarily checked whether the order of task administration affected the responses, in order to suppress this variable in case it was ininfluential and thus to simplify the experimental design. The chi square test showed no effect of the administration order either on the interpretation task (χ2 = .096; d. f. = 2; p = .95) or on the selection task (χ2 = 4.634; d. f. = 4; p = .33). So, the two orders of administration were aggregate in the subsequent analyses.

430

F. Ferrara and O. Matarazzo

Sentence-interpretation task. The frequency of answers to the sentenceinterpretation task is reported in table 3. Table 3. Frequencies of answers to the sentence-interpretation task

Interpretation Conditional Biconditional Other Tot

Sentence Conditional Biconditional 11 8 18 20 11 12 40 40

Tot 19 38 23 80

Examining table 3, it is possible to note that the biconditional interpretation is the most frequent, regardless of the type of sentence (conditional vs. biconditional). More specifically, conditional sentence was interpreted as conditional by 27,5% of the participants and as biconditional by 45%, while the remaining 27,5% gave other interpretations; biconditional sentence was interpreted as biconditional by 50% of the participants and as conditional by 20%, while the remaining 30% gave other interpretations. The sentence did not affect the interpretation (χ2 = .622; d. f. = 2; p = .73). Selection task. In table 4 the frequency of responses as a function of the sentence and of the interpretation is presented. Observing table 4, it is possible to note that, regardless of the sentence, 84,2% of p answers were associated with conditional interpretation of the sentence, while 83,3% of p & q responses were linked to its biconditional interpretation. This consideration was supported by LOGIT analyses, performed on the answer as dependent variable, and the sentence (conditional vs. biconditional) and the interpretation (conditional vs. biconditional vs. other) as factors. The best model was the one in which the response was affected only by the interpretation (G2 = 4,87; d.f. = 12; p = .96). Parameter estimates showed that p and p & not-q responses were associated with a conditional interpretation, while p & q and all responses were linked to a biconditional interpretation; other responses increased with other interpretations (all p< .001). Table 4. Frequencies of answers to the selection task as a function of the sentence and of the interpretation

Conditional sentence Biconditional sentence Tot Interpretation Interpretation Answer Conditional Biconditio. Other Conditional Biconditio. Other p¬-q 2 0 0 0 0 0 2 p&q 1 12 2 0 13 2 30 p 8 0 2 8 0 1 19 All 0 3 0 0 5 0 8 Others 0 3 7 0 2 9 21 Tot 11 18 11 8 20 12 80 Biconditio. = Biconditional interpretation.

Selection Task with Conditional and Biconditional Sentences

431

4 Discussion and Conclusions The results of the interpretation task in both experiments showed that almost half of the participants interpreted both conditional and biconditional sentences as biconditionals, regardless of their linguistic formulation. On the other hand, about 28% of the participants appropriately interpreted the conditional statement and more than 20% of them interpreted it as a biconditional. Whereas the biconditional interpretation of conditionals is widely documented in reasoning literature (see [13] for a review), to our knowledge the conditional reading of biconditional statements has not been documented yet. These findings suggest that several people assign a similar meaning to “if… then” and “if and only if… then” sentences with abstract content and that, consequently, the linguistic formulation is not sufficient to determine alone the meaning of a (bi)conditional sentence, without referring to its thematic content and context. Contrary to our predictions, the use of cards also presenting different values on their hidden side from those shown on the visible side did not prevent or discourage a biconditional interpretation of the sentence – which was our aim – but it created a confounding effect that increased other interpretations. However, our findings widely support the “sentence-interpretation” hypothesis: the way the sentence is interpreted directly influences the pattern of answers. Aside from the conditional or biconditional formulation of the sentence presented to participants, p and p & not-q answers are associated with its conditional interpretation, while p & q and the selection of all cards are associated with its biconditional interpretation. The systematic link of p & q response with the biconditional interpretation of the statement undermines the alternative theoretical perspectives seeing this response either as the result of a confirmation [1] or of a matching bias [4], or as the most relevant [5] or the most rational [6], [7] response. We turn now to consider only the correct responses, given the conditional and the biconditional interpretations, respectively. Across the two experiments of this study, the percentage of p & not-q responses, given the conditional interpretation of the sentence, is 13,9%; the percentage of selection of all cards, given the biconditional interpretation, is 20,5%. Since the order of tasks administration (IS vs. SI), contrarily to our hypothesis, did not affect the participants’ responses, one can infer that making the sentences interpretation explicit, through the interpretation task, does not improve the performance in the selection task. Besides, the absence of difference between the results of the two experiments shows that presenting (in experiment 2) a further combination (not-p & q) able to falsify the biconditional does not affect the sentence interpretation nor does it increase the choice of all cards in the selection task. In fact, although our findings are analogous to those of similar studies [e.g. 13, experiment 1], we still have to address the question of why p and p & q responses are the most frequent given a conditional or a biconditional interpretation, respectively. The interpretation task showed that participants giving a correct interpretation (conditional or biconditional) recognized which combinations of states of affairs falsify the sentence and which cards may have these combinations, but they did not use this knowledge to select the logically correct cards in the selection task. For instance, although participants giving the conditional interpretation understood that not-q card, associated with p, falsified the rule, most of them tended to choose only p card in the

432

F. Ferrara and O. Matarazzo

selection task. The congruence between interpretation and selection found by Laming and colleagues [14-15] has only partly been replicated in this study, which rather suggests that the cognitive processes involved in the two tasks only partially overlap. Although many hypotheses have already been advanced in order to explain what might be called “incomplete selection” – p instead of p & not-q; p & q instead of all cards – [see 13 for a review], here we formulate a further hypothesis. We might speculate that in performing the selection task people tend to reason only in the forward direction, i.e. from the antecedent (p) to the consequent (q). In other terms, they would consider it sufficient to reason about the p card (if p is associated to q then the hypothesis is true, whereas if p is associated to not-q then the hypothesis is false), and deem the more difficult backward reasoning about the not-q card to be needless, even if they are aware that it is able to falsify the hypothesis. The p & q answer would be the result of the same strategy when the sentence is interpreted as a biconditional; the selection of these two cards could be due to the reading of the biconditional as the conjunction of a conditional with its converse statement. The rarity of the p & not-p selection, the response corresponding to the interpretation of the biconditional as the conjunction of a conditional with its inverse statement, could be due to the welldocumented difficulties in reasoning with negations. Further studies will be carried out to test this hypothesis.

References 1. Wason, P.C.: Reasoning. In: Foss, B.M. (ed.) New Horizons in Psychology I. Penguin, Harmondsworth (1966) 2. Evans, J.S.B.T.: Logic and human reasoning: An assessment of the deduction paradigm. Psychological Bulletin 128, 978–996 (2002) 3. Wason, P.C., Johnson-Laird, P.N.: Psychology of reasoning: Structure and content. Penguin, Harmondsworth (1972) 4. Evans, J.S.B.T.: Matching bias in conditional reasoning: Do we understand it after 25 years? Thinking and Reasoning 4, 45–110 (1998) 5. Sperber, D., Cara, D., Girotto, V.: Relevance theory explains the selection task. Cognition 57, 31–95 (1995) 6. Oaksford, M., Chater, N.: A rational analysis of the selection task as optimal data selection. Psychological Review 101, 608–631 (1994) 7. Oaksford, M., Chater, N.: Rational explanation of the selection task. Psychological Review 103, 381–391 (1996) 8. Ramsey, F.P.: General Propositions and Causality. In: Mellor, D.H. (ed.) Philosophical Papers, pp. 145–163. Cambridge University Press, Cambridge (1900/1929) 9. Johnson-Laird, P.N., Byrne, R.M.J.: Conditionals: A theory of meaning, pragmatics and inference. Psychological Review 109, 646–678 (2002) 10. Legrenzi, P.: Relation between language and reasoning about deductive rules. In: Flores D’Arcais, G.B., Levelt, W.J.M. (eds.) Advances in Psycholinguistic. North-Holland, Amsterdam (1970) 11. Margolis, H.: Patterns, thinking and cognition. University of Chicago Press, Chicago (1987) 12. Margolis, H.: Wason’s selection task with reduced array. PSYCOLOQUY 11(005), ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/2000.volume.11/

Selection Task with Conditional and Biconditional Sentences

433

13. Wegner-Egger, P.: Conditional reasoning and the Wason selection task: Biconditional interpretation instead of reasoning bias. Thinking and Reasoning 13, 484–505 (2007) 14. Gebauer, G., Laming, D.: Rational choice in Wason’s selection task. Psychological Research 60, 284–293 (1997) 15. Osman, M., Laming, D.: Misinterpretation of conditional statements in Wason’s selection task. Psychological Research 65, 121–144 (2001) 16. Johnson-Laird, P.N.: Mental models. In: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge University Press, Cambridge (1983) 17. Green, D.W., Over, D.E., Pyne, R.A.: Probability and choice in selection task. Thinking and Reasoning 3, 209–235 (1997)

Types of Pride and Their Expression Isabella Poggi and Francesca D’Errico Roma Tre University, Department of Education Sciences {poggi,fderrico}@uniroma3.it

Abstract. The paper analyzes pride, its nature, expression and functions, as a social emotion connected to the areas of image and self-image and to power relations. Three types of pride, dignity, superiority and arrogance, are distinguished, their mental ingredients are singled out, and two experimental studies are presented showing that they are conveyed by different combinations of smile, eyebrow and eyelid positions, and head posture. Keywords: pride, social emotion, social signal, facial expression.

1 Introduction In the last decade a new research area has arisen in the interface between Computer Scientists and Social Scientists, the area of social signal processing. If previous work on signal processing studied physical quantities in various modalities, since 2007 on Pentland [1, 2] launched the idea of analyzing physical signals that convey socially relevant information, such as activity level during an interaction, or mirroring between participants, and the like. The field of Social Signal processing is now being settled as the area of research that analyzes the communicative and informative signals which convey information about social interactions, social relations, social attitudes and social emotions. Among emotions, we can distinguish “individual” from “social” emotions, and within these, three types of them [3]. First, those felt toward someone else; in this sense, happiness and sadness are individual emotions, while admiration, envy, contempt, compassion are social ones: I cannot admire without admiring someone, I cannot envy or contemn but someone, while I can be happy or sad myself. Second, some emotions are “social” in that they are very easily transmitted from one person to another: like enthusiasm, panic, or anxiety. A third set are the so-called “selfconscious emotions” [4], like shame, pride, embarrassment, that we feel when our own image or self-image, an important part of our social identity, is at stake. They are triggered by our adequacy or inadequacy with respect to some standards and values, possibly imposed by the social context [5], that we want to live up to, and thus they concern and determine our relationships with others. In Social Signal processing, as well as in Affective Computing, a relevant objective is to build systems able to process and recognize signals of social emotions. In this paper we briefly overview some studies on the emotion of pride, trying to distinguish different types of it, and present two studies on the expression of this emotion aimed at recognizing the three types from the nuances of their display. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 434–448, 2011. © Springer-Verlag Berlin Heidelberg 2011

Types of Pride and Their Expression

435

2 Authentic and Hubristic Pride The emotion of pride has traditionally been an object of attention in myth, moral philosophy and religious speculation, more than in psychology. Within psychological literature, Darwin [6] and Lewis [4] include it among the “complex”, or “selfconscious” emotions. Different from the so called “primary” emotions, like joy or sadness, anger or disgust, the “self-conscious” emotions, like shame, guilt and embarrassment, have a less clear universal and biologically innate expressive pattern than the “primary” ones, and can be felt only by someone who has a concept of self, like a child of more than two years, or some great apes, since they entail the fulfilment and transgression of social norms and values. More recently, Tracy and Robins [7] investigated nature, function and expression of pride, and distinguished two types of it, authentic and hubristic. Authentic pride, represented in words like accomplished and confident, is positively associated with personality traits of extraversion, agreeableness, conscientiousness, and with genuine self-esteem, whereas hubristic pride, related to words like arrogant and conceited, is related positively to self-aggrandizing narcissism and shame-proneness. Hubristic pride “may contribute to aggression, hostility and interpersonal problems” (p.148), while authentic pride can favour altruistic action, since the most frequent behavioural responses to pride experience are seeking and making contact with others. Seen in terms of the attribution theory [24], “authentic pride seems to result from attributions to internal but instable, specific, and controllable causes, such as (...) effort, hard work, and specific accomplishments” [8], whereas hubristic pride is felt when one attributes one’s success to “internal but stable, global, and uncontrollable causes” such as “talents, abilities, and global positive traits” [9]. Concerning the adaptive function of pride, Tracy and Robins [7] suggest that its feeling “might have evolved to provide information about an individual’s current level of social status and acceptance” (p.149), thus being importantly liked to selfesteem. They also investigated the nonverbal expression of pride [10] and singled out its constituting elements: small smile, head slightly tilted back, arms raised and expanded posture. They argued that pride and its expression are universal and that their function may be “alerting one’s social group that the proud individual merits increased status and acceptance” [7] (p.149-150). By adopting a functionalist view of emotions, Tracy, Shariff & Cheng [8] propose that pride serves the adaptive function of promoting high status, and does so because the pleasant reinforcing emotion of pride due to previous accomplishments enhances motivation and persistence in future tasks, while the internal experience, by enhancing self-esteem, informs the individual – and the external nonverbal expression informs others – of one’s achievement, indicating one deserves a high status in the group. While wondering if the two facets of the emotion of pride, authentic and hubristic, have different adaptive functions, they stick to Henrich & Gil-White [25] distinction between two distinct forms of high status that humans are in search for: dominance, to be acquired mainly through force, threat, intimidation, aggression, and prestige, a respect-based status stemming from demonstrated knowledge, skill, and altruism. Tracy et al. [8] posit that the emotion of hubristic pride and its expression serve the function of dominance, while authentic pride serves the function of prestige, thus being a way to gain a higher status by demonstrating one’s real skills and social and

436

I. Poggi and F. D’Errico

cooperative ability. To sum up, for Tracy and Robin [7], “Authentic pride might motivate behaviours geared toward long-term status attainment, whereas hubristic pride provides a ‘short cut’ solution, promoting status that is immediate but fleeting and, in some cases, unwarranted”; it may have “evolved as a ‘cheater’ attempt to convince others of one’s success by showing the same expression when no achievement occurred” (p.150). The view of pride outlined by Tracy et al. [7, 8, 10], with its two contrasting facets and their function, looks interesting and insightful. Yet, their distinction between authentic and hubristic pride suffers from the connotation of their very names: authentic sounds as only positive, while hubristic sounds as negative and, being contrasted to authentic, as typically implying “cheating”. In our view, one thing is to distinguish types of pride in terms of their very nature, and one is to see whether they can be expressed to cheat others (or themselves) about one’s worth. Actually, the two (or more?) facets of pride might all have a positive function, and all might be simulated and used to cheat. But what makes them different is the feeling they entail and the different function they serve in a person’s relationship with others.

3 Superiority, Arrogance and Dignity: Types of Pride and Their Mental Ingredients In another work, following a model of mind, social actions and emotions in terms of goals and beliefs [7, 8, 11, 13, 16], pride was analyzed in terms of its “mental ingredients”, the beliefs and goals that are represented, whether in a conscious or an unconscious way1, in a person who is feeling that emotion. In this analysis, some ingredients are common to all possible cases of pride, while others allow one to distinguish three types of pride, that we call “superiority”, “arrogance”, and “dignity” pride. All types of pride share the same core of ingredients: 1. 2. 3. 4. 5. 6.

A believes that ((A did p) or (A is p) or (p has occurred)) A believes p is positive A believes p is connected to / caused by A A wants to evaluate A as to p A wants to evaluate A as valuable A believes A is valuable (because of p)

These are the necessary conditions for a person to feel proud: 1. an event p has occurred (e.g., A’s party won the elections); or A did an action (she ran faster than others); or A has a property (she is stubborn, she has long dark hair); 2. A evaluates this action, property or event as positive, i.e., as something which fulfils some of her goals; 1

The hypothesis of the model adopted is that the ingredients may be unconscious, that is, not meta-represented (you have that belief and that goal, but you do not have a meta-belief about your having that belief), but one cannot say that you are feeling that emotion unless those ingredients are there.

Types of Pride and Their Expression

3.

437

A sees p as caused by herself, or anyway as an important part of her identity. I can be proud of my son because I see what he is or does as something, in any case, stemming from myself; or proud of the good weather of my country because I feel it as my own country. In the prototypical cases of pride A can be proud only of things she attributes to internal controllable causes [10, 11]; but in other cases the action, property or event is simply connected to, not necessarily caused by A; the positive evaluation refers to something that does make part of the selfimage A wants to have: something with respect to which A wants to evaluate herself positively; A wants to evaluate herself positively as a whole; the positive evaluation of p causes a more positive self-evaluation of A as a whole: it has a positive effect on A’s self-image.

4.

5. 6.

Superiority pride. In cases entailing actions or properties a possible ingredient is victory: doing or being p makes you win over someone else, and this implies that you are stronger or better than another. Further, if seen not as a single occurrence but as a steady property, this means you are superior to others: 7. 8.

A believes A once has been superior to B with respect to p A believes A is always superior to B with respect to p

You have more power than another as to some p in a specific situation (ingredient 7), and you feel in general superior to others with respect to p (8). Sometimes, if a single fact or capacity is very relevant in your overall judgment of how people should be, believing yourself superior to another as to it can make you believe you are superior to others in general. 9.

A believes judgment with respect to p is relevant for overall judgment of people 10. A believes A is in general superior to B Ingredients 7 – 10 are in a sense the bulk of “narcissism”: a high consideration of one’s capacities and of oneself as a whole, a very positive self-image. If added to ingredients 1 – 6, they make up “superiority pride”, which is typically felt when the event p is an action that makes one win in a competition. But one can also feel superior when event p is simply one’s belonging to a category (a social class, a Nation, a group of people) that one thinks is superior to others. Superiority of an individual over another is relevant for adaptation because in case of competition it allows a more frequent and effective access to resources. But this holds particularly when others are aware of one’s superiority. This leads to the necessity for one who feels superior – in case he also wants his superiority to give him access to resources – to have others know and acknowledge it. In other words, one who is superior often does not only want to evaluate himself positively, but wants others to evaluate him as superior: he does not only want to have a positive self-image, but also to have a positive image before others:

438

I. Poggi and F. D’Errico

11. A wants B to evaluate A as to p 12. A believes B believes A is valuable (because of p)

Often one is proud of something not only before himself but also before others. Yet, within the “core” ingredients of pride (1 – 6) the goal of projecting one’s positive image to others is not a necessary condition. In this, pride is symmetrical to shame. One is sincerely ashamed before others only if one is ashamed before oneself [14], that is, only if the value one is evaluated against makes part not only of the image one wants to have before others but also of the evaluation one wants to have of oneself (self-image). In conclusion, one who feels genuine “superiority pride” is proud of something that others evaluate positively only if one also evaluates it positively. Arrogance pride. “Superiority pride” is generally felt when in a competition between people on the same level one wins in the power comparison and thus becomes superior. But in other cases one is, at the start, on the “down” side of the power comparison; A has less power than B, but does not want to submit to B’s superiority: either he wants to challenge B’s power and possibly become superior, or he does not long to superiority, but wants his worth to be acknowledged, and not to be considered inferior. We call the former “arrogance pride”, and the latter “dignity pride”. In arrogance the proud one challenges another person or institution having more power than he and possibly power over him. Thus he climbs the pyramid of power: he does not acknowledge the other’s power because he claims he has (or has the right to have) more power than the other. Here are the ingredients of “arrogance pride”: 13. A wants to have power over B 14. A believes A can have power over B 15. A wants B believe A can have power over B

A person feeling arrogance pride wants to have power over the other (13), he believes he can do so (14), and further wants the other to know that he can overcome his power (15). But while “superiority pride” sometimes is not even communicated to others (you may feel superior to such an extent that you do not either bother to make others know of your superiority), “arrogance pride” instead, encompassing an ing8redient of challenge (15), is by definition communicative. The arrogant communicates: I am not afraid of you, though you claim to have more power than me and even power over me; but since I am superior to you (n.10), I want to have power over you (n.14) and want you to know I have the power thereof (n.15). Sometimes the challenge, at least apparently, does not come from the less powerful, but from the more powerful in a dyad. This is the case with the so-called “arrogance of power”: one who is powerful is arrogant as he abuses of his power. For example, a politician from the government who insults an interviewer of a TV channel of the opposite side, or who blatantly violates general rules while displaying his not being subject to any other power. Here the powerful one does something more than he would be entitled to, according to the principle that rules and laws are for people who have not power, while one who has the power can establish rules himself. So even in this case there is, in a sense, a challenge to power: the power of law.

Types of Pride and Their Expression

439

Dignity pride. Let us take the other case of unbalanced power: A at a lower level than B. If A does not accept his inferiority, he feels “dignity pride”: the pride of human dignity. One who feels this type of pride does not claim to be superior, but not to be inferior. He claims to his right of being treated as a peer, with same status, same rights, same freedom as the other: he wants to be acknowledged his worth as a human being, and the consequent right to be addressed respectfully and not to be a slave to anybody. One who feels “dignity pride” attributes a higher value to his self-image than to his image, and primarily cares his self-image both of self-sufficiency and of selfregulation. Being self-sufficient means you do not depend on others, since you have all the resources necessary to achieve your goals by yourself; but not being dependent, you also do not want anyone to have power over you; you claim your right to autonomy, i.e. self-regulation: your right to be free. 16. 17. 18. 19. 20.

A wants A/B believes A has all the resources A needs A wants A/B believes A does not depend on B A wants A/B believes A has not less power than B A wants A/B believes B has not power over A A wants B believes A has the dignity of a human

A wants to be considered by others and himself as one who has all the resources he needs, i.e. he wants to have an image and self-image of an autonomous person (16), and of one who does not depend on B (17); he wants to be considered as not having less power than B (18), and as not being submitted to B (19): to be acknowledge his dignity as a human (20). The three types of pride differ for the actual vs. ideal power relation aimed at by the proud person with respect to the other. In dignity, the proud one has less power than the other but wants to be considered equal to him; in superiority, A wants (considers right) to be considered superior, whether or not he is so; in arrogance, A may be equal or inferior to B, but wants to become superior.

4 Different Pride, Different Signals? As shown by Tracy and Robins [7], the emotion of pride is generally expressed by a small smile, expanded posture, head tilted backward, and arms extended out from the body, possibly with hands on hips. But notwithstanding their attempts they did not find systematic differences in the expressions of “authentic” vs. “hubristic” pride. In this work we present two studies to test if the three types of pride, superiority, arrogance and dignity pride, can be distinguished based on subtle differences in their facial expression. 4.1 First Study We conducted an observational study on the expressions of pride in six Italian political debates (six hours in total). After selecting the fragments in which the politicians express his pride by their verbal behaviour, we carried out a qualitative

440

I. Poggi and F. D’Errico

analysis of the multimodal communication parallel to their words, through an annotation scheme that described the signals in various modalities (pauses, voice pitch, intensity and rhythm, gestures, posture, facial expression, gaze behavior) and attributed meanings to each of them. As argued by Poggi [18] in fact, for body behaviours too, if they are considered signals, by definition it is possible to attach them meanings, and these meanings, just like those of verbal language, can be subject to introspection and can be paraphrased in words. Hypothesis. Based on this analysis [22], three fragments were selected as prototypical expressions of the three types of pride: in these, dignity pride is characterized by gaze to the interlocutor, no smile, no conspicuous gestures, and a serious frown; superiority pride includes gazing down to the other, possibly with slightly lowered eyelids, no smile, or else, with smile and an ironic head canting of ironic compassion, and a distant posture. Arrogance entails ample gestures, gaze to the target, and a large smile, similar to a laughter of scorn. We then hypothesized that subjects can distinguish the three types of pride from their expression. Experimental design and procedure. The experimental design is3x3 within subject with independent variables being a facial display (Vendola, Scalfari and Brunetta) and three types of Pride (dignity, superiority and arrogance) and the dependent variable being the agreement of participants, measured on a Likert Scale, to interpret the face as a specific type of pride. A forced choice questionnaire was submitted to 58 participant (all females, to avoid the gender issue, range 18-32 years old, mean age 22) with three pictures of speakers in political shows (Nichi Vendola, a former governor of an italian Region, Eugenio Scalfari, the founder of a famous newspaper, and Renato Brunetta, a minister), hypothesized as respectively expressing dignity, superiority and arrogance; participants were asked to associate each picture to one of three sentences meaning dignity (voglio essere trattato da pari, I want to be treat as equal), superiority (mi sento superiore, I feel superior) and arrogance (sto lanciando una sfida, I defy you), by expressing their agreement on a Likert Scale (1-5). Results. As shown in Table 1, results confirm the previous qualitative analysis [19]. An Anova [F (2, 114)= 14,36, p<0,00; η²=.11] confirms significant differences in the meanings assigned to the three different speakers. Post hoc comparison using Tukey HDS test indicated that the mean score for “I want to be treated as an equal” is significantly different from “I defy you”, while “I am superior” is significantly different from “I want to be treated as equal”. When participants have to assign the meaning of dignity pride “I want to be treated as equal”, Vendola reaches a mean higher (2.95) than both Scalfari (2.58) and Brunetta (2.07). On the other hand, the superiority pride item “I am superior to you” reports a higher mean (3.69) for Scalfari than for Brunetta (3.64) and Vendola (2.51). The arrogance item “I defy you” confirmed, even if only slightly, to be associated more frequently with Brunetta (3.29).

Types of Pride and Their Expression

441

Limitations. A relevant limitation of this study was that the three pictures concerned three well-known characters of Italian politics: so answers might have been biased by previous knowledge of the characters themselves and/or by their political position (e.g., their belonging to the majority or the minority with respect to the present government), as well as by the subjects’ political preferences. The reason why we exploited right these pictures was our need to test the results of our previous qualitative analysis [22] . Yet, this first study did not yet allow us to isolate which specific aspects of the politicians’ multimodal communication were responsible for the perception of the subtypes of pride found out; another limitation is the order of the politicians’ figures, which were not randomized. For these reasons, we performed the second study. Table 1. Perception of different types of pride 4,00 3,50 3,00

3,69

3,34

2,95 2,58

2,50

3,29

3,20

2,51 2,08

2,07

2,00 1,50 1,00 0,50

I want to be treated as equal

I feel superior than you

brunetta

scalfari

vendola

brunetta

scalfari

vendola

brunetta

scalfari

vendola

0,00

I am defying you

4.2 Second Study The first study led us to hypothesize that the three types of pride are conveyed by three different facial expressions. The next step was to single out which specific aspects of the face respectively point at the three types. In view of a systematic investigation through more refined experimental design, we run an exploratory study focused on the role of smile and of the eyebrows’ position. Hypothesis. The goal of the second study was to test if different patterns of frown and smile (taken as independent variables) distinguish the three types of pride. In particular we expected the following main effects on the three different types of pride (dignity, arrogance, superiority): as to the variable Eyebrows position, we expected that: 1. Frown (vs asymmetrical frown and absence of frown) directs interpretation toward dignity pride; 2. asymmetrical eyebrows (vs frown and absence of frown) toward superiority pride; 3. no frown (vs asymmetrical eyebrows and symmetrical frown), towards arrogance pride.

442

I. Poggi and F. D’Errico

As to Smile, we expected that: 1. smile (vs not smile) directs toward an interpretation of arrogance, while 2. no smile (vs smile) towards dignity or superiority. Experimental design and participants The bifactorial design is 3 x 2 between subjects with two independent variables being the different eyebrow positions (frown, no frown, asymmetrical eyebrows) and smile (present or absent), and three dependent variables being the perceived types of pride (dignity, superiority or arrogance). The questionnaire was submitted to 58 subjects (females, range 18-32 years old, mean age 22). Figures 1 and 2 represent the pictures submitted to the participants, in a random disposition to avoid task learning effect. Procedure. Embodied Conversational Agents as a tool for emotion research The procedure of this study uses a conversational agent [26] to manipulate the variables that relate to the different interpretation of the meaning of emotions. Conversational agents in fact are presented as tools that certainly do not lend ecological validity to the level of human faces, but they offer the opportunity to manipulate the variables in a precise way through the FAPS “Facial Animation Parameters”, by controlling different parts of the facial expressions and isolating the constant variables which are not under investigation or that could interfere in the recognition of meaning. This method, rarely used in psychology, seems promising and favors a reliable interpretation of emotional meanings. By manipulating the variables above we constructed a multiple choice questionnaire of 6 items. By using the “face-library” of the Embodied Conversational Agent Greta [26] (a useful tool to build pictures that, different from frames of real videos and actors’ posed photographs, allows you to set the FAPS of mouth, eyes, eyebrows, eyelids and head movements very precisely), we combined the three positions of the frown (frown, no frown, asymmetrical eyebrows) with the two conditions of smile (present or absent), resulting in six items of facial expression. For each eyebrows-smile pattern we made a hypothesis about its meaning (dignity, arrogance or superiority pride - , still with the assumption that the meanings could be consciously retrieved and phrased in words Finally, for each item we constructed a multiple choice question including the verbal phrasings of the hypothesized meaning and two distractors. Distractors were progressively more distant from the target meaning, with the extreme one opposite to it. To test our main hypotheses, for each face resulting from the combination of the two chosen variables we proposed three verbal phrasings of the concepts of dignity, superiority and arrogance pride, respectively, (I don’t submit to you =non mi sottometto a te, I am superior to you =sono superiore a te, I will win over you = avrò la meglio su di te); for each, participants expressed their agreement on a Likert scale 1-5.

Types of Pride and Their Expression

443

Fig. 1. Greta’s faces in presence of smile combined with different eyebrow positions (asymmetrical eyebrows, no frown, frown)

Fig. 2. Greta’s faces in absence of smile combined with different eyebrow positions (frown, no frown, asymmetrical eyebrows)

Results The results obtained from the questionnaire seem to confirm our hypothesis, even though for some conditions distractors might have caused some problem for data interpretation. A Manova highlights a main effect of eyebrows position [F(2, 57) = 81,95; p< 0,00; η²=.11] on the meanings attributed to Greta’s expressions . Table 2 shows the positive and negative polarizations of meaning resulting from the presence/absence of smile. In fact, analyzing the pattern of eyebrows position (frown, asymmetrical eyebrows, absence of frown) we may notice that the asymmetrical eyebrows without smile can be interpreted as both superiority pride (“I am superior to you”, 3.64) and dignity pride (“I don’t submit to you”, 3.60), while in frown with smile the meaning of dignity (3.12) is very close to other meanings as “I am resolute”, (2.88) “I want to humiliate you” (2.84) and “I have won” (2.67). With the frown, instead, the leading meanings of worry (3.86) and dignity pride (“ I don’t submit to you”, 2.69) are followed by meanings of satisfaction (2.60), defiance (2.36) and amusement (2.16). In the no-frown condition, smile seems to direct the meaning towards an idea of power – “I am resolute” (3.03) and “I will win over you” (2.79); while when smile is absent discontent (3.66) and sense of failure (3.28) prevail.

444

I. Poggi and F. D’Errico Table 2. Eyebrows*Smile 3,86

3,60 3,64 2,79

2,67

2,41 1,97

1,90

2,16

3,66 3,03

2,60 2,69

2,36

2,07

1,81

1,41

3,28 2,79 2,24 2,24

1,74

1,66

1,59

2,09 1,90

I won I am enjoying I do not submit to I am superior to I am surprised I am worried I will win over you I do not submit to I an enjoying You are not stronger I am angry with you I am satisfied I do not submit to I an enjoying I will win over you I an worried I am superior to I am resolute I will win over you I am superior to I don't submit to I am perplexed I am satisfied I am superior to I determined

1,09

smile: greta 1

not smile: greta 5

smile: greta 6

frown

not smile: greta 2

smile: greta 3

symmetrical

I challenge

2,88 2,84

I defeat you

3,12

I do not submit to I am resolute I want umiliate

4,50 4,00 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00

not smile: greta 4 no frown

Let us now take into account the items constructed on the basis of our hypothesis. We consider the expression “I don’t submit to you” as the dignity pride item, “I am superior than you” as superiority pride, and “I will win over you” as arrogance pride. As results from the Manova analysis (Table 3), different eyebrows positions significantly correspond to different meanings of pride [F(2, 57) = 53,30; p< 0,00; η²=.11]; compared to arrogance and superiority pride the frown is interpreted primarily as dignity pride, “I don’t submit to you” (2.69); in fact this eyebrows position, according to post hoc comparison with Tukey HDS test, shows significantly different from the asymmetrical eyebrows (3.60), for which the dignity mean is high); the asymmetrical eyebrows face is oriented to the superiority item “I am superior to you” (3.64), and post hoc Tukey test shows that “I am superior” differs significantly Table 3. Main effect for eyebrows position 4,00

3,60

3,64

3,50 3,00

2,79

2,69

2,50 2,00

2,79 2,24

2,07

2,24

1,74

1,50 1,00 0,50 0,00 I do not I am I will I do not I am I wi ll I do not I am I will submit superi or win over submit superior wi n over submit superior win over to you to you you to you to you you to you to you you Frown

Asymmetrical

No Frown

Types of Pride and Their Expression

445

from both the frown condition (1.74) and the condition (2.24). The item of arrogance pride “I will win over you” shows a higher mean in the no frown condition and, unexpectedly, also in the asymmetrical condition (2.79), but it differs most from the frown condition (2.07), and this difference is supported also by Tukey’s test. This last result seems congruent with our hypothesis according to which the absence of frown is linked to a “frontal” (and in a sense “amused”) sense of defiance; which in turn carries insights on the ironic nuances of the asymmetrical frown, and, on the other side, on possible link between irony and arrogance. As for the manipulation of smile, from the Manova analysis no significant differences emerged, probably also because in some conditions there were fewer than 3 cases. So we only present a descriptive analysis to better understand the effect of smile in the perception of pride. We can observe that smile, possibly interpreted as an ironic smile, presents the highest mean (2.79) as associated to the choice “I will win over you”, and this might confirm our hypothesis of smile as a signal of arrogance (Table 4). The absence of smile on the other hand is associated to dignity – “I don’t submit to you” - and to superiority pride – “I am superior to you” (3.15; 3.01). These results shed some light on the different roles of smile in pride expressions, detailing the hypothesis on the prototypical expressions of pride and allowing a more complex analysis of the pride display. Table 4. Main effect of Smile 3,50 3,00

3,15 2,68

3,01

2,79

2,50

2,43

2,24

2,00 1,50 1,00 0,50 0,00 I do not s ubmi t to you

I wi l l wi n over you pres ent

I am s uperi or to you

I do not s ubmi t to you

I wi l l wi n over you

I am s uperi or to you

abs ent

Manova analysis points out that no interaction effect is significant. Descriptive results help us to illustrate a tendency on our hypothesis on different types of pride: even the combination of the two variables goes in the direction of the main hypotheses (Table 5): frown without smile is interpreted as dignity pride (2.69), while no frown with smile as arrogance pride (2.79), and asymmetrical eyebrows without smile as superiority pride (3.64).

446

I. Poggi and F. D’Errico Table 5. Combinations of eyebrow position and smile (Three main hypotheses) 4,00

3,60

3,64

3,50 3,00

2,79

2,69

2,50

2,79 2,24

2,07

2,24

1,74

2,00 1,50 1,00 0,50 0,00

I do not I will I am I do not I will I am I do not I will I am submit win over superior submit win over superior submit win over superior to you you to you to you you to you to you you to you frown*no smile

no frown*smile

asymmetrical*no smile

Limitations The second study sheds light on the role of variables in the expression of pride with an emphasis on facial expressions, but nevertheless there is a need for further studies to highlight the limitations in which it is incurred. First of all what we think that though the Embodied Conversational Agents are an innovative tool for emotion research, a feedback is needed to ascertain their ecological validity with respect to human faces. Secondly, the selected variables showed a trend in attributing smile a relevant meaning in pride, but its role was not so clear. A limitation may lie in the fact that we chose only two levels for the smile (absence vs. presence); a subsequent study will therefore further investigate the role of smile by varying its levels, and considering absence, small and wide smile. The third limitation is that the forced choice questions may have influenced the participants’ answers. The multiple choice test in this procedure considered additional items to be selected by participants, and in two cases the meanings of the distractors reached a higher mean than pride items (in particular the answer "I am worried" in the asymmetrical brows condition and no smile, and the answer "I am satisfied" in the no smile and frown condition). Future studies will be spread to a larger number of participants in order to obtain a more robust analysis.

5 Conclusion Pride is a positive emotion that we feel as we have a very positive evaluation of ourselves, due to our own achievements or positive qualities. The emotion of pride is strictly connected to a person’s identity, and it has relevant effects on how one sees oneself and consequently how others see him, thus importantly determining one’s relationships with other people. Pride is also linked to the area of power comparison, and conveys power relationships: by expressing pride you claim you are superior or not inferior to the other, and you refuse to submit, or you challenge the other’s power.

Types of Pride and Their Expression

447

Three types of pride - dignity, superiority and arrogance - can be felt, and their expressions are distinguished by subtle differences in eyebrows position and smile ). Subsequent work will study differences in three levels of smile: no smile, small and large smile, and further test other aspects of facial expression, like eyelids openness and head position The second study tests the use of conversational agents as a research tool to study the attribution of emotions and other meanings to facial expressions; it seems to be a good tools during the first manipulation phase, but to have tests with a higher ecological validity subsequent studies are needed to investigate the meaning correspondence with real human faces. If these and other aspects of the body expression of pride are shown systematically distinctive in manifesting its different types, signal processing systems for the recognition and interpretation of signals will be able to take these subtle differences into account in order to the detection of social emotions, social relations and social interaction. Acknowledgments. This research is supported by the 7th Framework Program, European Network of Excellence SSPNet (Social Signal Processing Network), Grant Agreement Number 231287. Acknowledgement is also due to the Project “Cognitive Agents and Social Simulation” at ISTC-CNR of Rome.

References 1. Pentland, A.: Social signal processing. IEEE Signal Processing Magazine 24(4), 108–111 (2007) 2. Pentland, A.: Honest Signals: how they shape our world. MIT Press, Cambridge (2008) 3. Poggi, I., D’Errico, F.: Social Signals: a Psychological Perspective. In: Salah, A.A., Gevers, T. (eds.) Computer Analysis of Human Behavior, pp. 185–226. Springer, Heidelberg (2011) 4. Lewis, M.: Self-conscious emotions: Embarrassment, pride, shame, and guilt. In: Lewis, M., Haviland-Jones, J.M. (eds.) Handbook of Emotions, 2nd edn., pp. 623–636. Guilford Press, New York (2000) 5. Castelfranchi, C., Poggi, I.: Blushing as a Discourse: Was Darwin Wrong? In: Crozier, R. (ed.) Shyness and Embarrassment. Perspectives from Social Psychology, pp. 230–251. Cambridge University Press, New York (1990) 6. Darwin, C.: The Expression of the Emotions in Man and Animals. Appleton and Company, New York (1872) 7. Tracy, J.L., Robins, R.W.: Show your pride: Evidence for a discrete emotion expression. Psychological Science 15, 194–197 (2004) 8. Tracy, J.L., Shariff, A.F., Cheng, J.T.: A Naturalist’s View of Pride. Emotion Review 2(2), 163–177 (2010) 9. Gladkova, A.: A Linguist’s View of Pride. Emotion Review 2(2), 178–179 (2010) 10. Tracy, J.L., Robins, R.W.: The prototypical pride expression: development of a nonverbal behavior coding system”. Emotion 7, 789–801 (2007) 11. Conte, R., Castelfranchi, C.: Cognitive and social action. University College, London (1995)

448

I. Poggi and F. D’Errico

12. Castelfranchi, C.: Micro-Macro Constitution of Power. ProtoSociology, International Journal of Interdisciplinary Research 18-19, 208–265 (2003) 13. Poggi, I., D’Errico, F.: Dominance signals in debates. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 163–174. Springer, Heidelberg (2010) 14. Frijda, N.H.: The emotions. Cambridge University Press, Cambridge (1986) 15. Castelfranchi, C.: Affective appraisal versus cognitive evaluation in social emotions and interactions. In: Paiva, A. (ed.) Affective Interactions. Springer, Berlin (2000) 16. Scherer, K.: Handbook of Affective Sciences. Oxford University Press, Oxford (2003) 17. Poggi, I.: Types of emotions and types of goals. In: Proceedings of the Workshop AFFINE: Affective Interaction in Natural Environment, Proc. ICMI 2008, Chania, Crete, September 24 (2008) 18. Poggi, I.: Mind, hands, face and body. In: Goal and Belief View of Multimodal Communication. Weidler, Berlin (2007) 19. Castelfranchi, C., Miceli, M.: The cognitive-motivational compound of emotional experience. Emotion Review 1, 223–231 (2009) 20. Poggi, I., D’Errico, F.: The mental ingredients of Bitterness. Journal of Multimodal User Interface 3, 79–86 (2009) 21. Castelfranchi, C., Guerini, M.: Is it a Promise or a Threat? Pragmatics & Cognition Journal 15(2), 277–311 (2007) 22. Poggi, I., D’Errico, F.: Pride and its expression in political debates. In: Paglieri, F., Tummolini, L., Falcone, R., Miceli, M. (eds.) The Goals of Cognition, Festschrift for Cristiano Castelfranchi, London College Publications, London (forth.) 23. Peters, C., Pelachaud, C., Bevacqua, E., Ochs, M., Ech Chafai, N., Mancini, M.: Towards a Socially and Emotionally Attuned Humanoid Agent. In: Esposito, A., Bratanic, M., Keller, E., Marinaro, M. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue. NATO HSD EAP ASI 982256, vol. 18. IOS Press, Amsterdam (2007) 24. Weiner, B.: An attributional theory of achievement motivation andemotion. Psychological Review 92, 548–573 (1985) 25. Henrich, J., Gil-White, F.J.: The evolution of prestige: Freely conferred deference as a mechanism for enhancing the benefits of cultural transmission. Evolution and Human Behavior 22(3), 165–196 (2001) 26. Bevacqua, E., Mancini, M., Niewiadomski, R., Pelachaud, C.: An expressive ECA showing complex emotions. In: Language, Speech and Gesture for Expressive Characters, AISB 2007, Newcastle, UK (2007)

People's Active Emotion Vocabulary: Free Listing of Emotion Labels and Their Association to Salient Psychological Variables Vanda Lucia Zammuner University of Padova, D.P.S.S., Via Venezia, 8, 35100 Padova, Italy [email protected]

Abstract. The study is on the 'working emotion vocabulary', i.e., words easily accessed when people are asked to list emotions. Participants (N 1146, 65.9% women, 15-30 year-olds), in an on-line task, listed 621 distinct words; 21 words were listed by 10%-65% (including joy, happiness, sadness, fear, anger, by 50% at least), 93 by 2%-9% (not including 'errors’, e.g., naming eliciting events), 507 by 1%. In sum, most listed words did refer to emotions and showed great variability. Women supplied more 'Correct' Emotion Words (CEW) than men. The active (CEW) and the 'passive' vocabulary (e.g., ability to recognize synonyms of emotion targets) were uncorrelated. Production of negative (4,95) and positive (3,76) CEW was significantly associated with emotion-related abilities and traits – e.g., recognition of facial expressions of emotions, expressive transparency, awareness of emotions, life satisfaction, loneliness, alexithymia and health. The results have implications for emotion communication and understanding. Keywords: Emotion words and concepts, free-listing, Italian language, lexicon, emotional traits and abilities.

1 Introduction In our lives we encounter a variety of events, many of which elicit in us an emotional experience. In addition to being expressed by a variety of nonverbal signals, such as facial, postural and vocal expressions [1-3], emotional experiences are in many cases labelled using emotion-related words or expressions - e.g., labelling the experience for oneself [4-7], when sharing it in everyday interactions [1, 4-5], or when trying to understand emotional episodes one observes or analyzes [8]. In all these cases, labelling might refer to either a discrete emotion (e.g., anger, joy), or a component aspect of it, including event appraisals (as in I can't face this or it will spoil my hopes), physiological responses (as in I blushed or I was paralysed), tendencies to act and actual behaviours (as in I feel like crying, I yelled at her), regulation attempts (as in I counted to ten before answering him or to calm down I tried to think of something else), causal attributions (as in he did it on purpose), or to yet other aspects of the experience. A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 449–460, 2011. © Springer-Verlag Berlin Heidelberg 2011

450

V.L. Zammuner

The experimental research to be presented focussed on discrete emotion labels. The study of emotion labels, i.e., of the emotion lexicon, is crucial in many respects. It helps us to understand: (a) how people conceptualize the emotion domain (e.g., the literature shows that it is in terms of ‘families’ of emotions characterized by fuzzy boundaries [7, 9]); (b) the extent to which emotion concepts are shared by members of a culture [10-11]; (c) developmental acquisition trends [12-13]; and (d) intercultural differences [10,14-15]. Furthermore, good empirical databases of emotion-word usage, possibly in several languages [7, 9, 16-17] are necessary to understand emotion-related communication, such as that between people and information technology tools, to avoid misperceptions, to help develop friendly, emotionallytoned exchanges [1, 3, 8, 18]. However, the understanding of emotion-label usage is complicated by the fact that most languages include a huge number of emotion terms [7, 15, 17, 19-20]. Thus, we might want to know which words people preferably use to label emotions, which words are most and least frequently used in a given language, which words refer to culturally shared meanings, rather than to idiosyncratic ones, and which words are understood and used most easily and correctly.

2 Free Listing of Emotion Words To address the questions listed above, the present study focussed on the 'working emotion vocabulary', i.e., words that most easily come to mind, are recalled, when people are asked to quickly supply emotion labels, e.g., in a minute or two at most. An important and open question in studies of the emotion lexicon is whether we assume that words label discrete, distinguishable, distinct emotions (or emotional experiences), or whether we consider words simply as communicative tools, approximate descriptions of the 'gist' of an emotional experience (or even of selected aspects of it) as conceptualized by the individual using the words [6]. My assumption is that actual emotions do not coincide with their linguistic designation. However, linguistic labels are functional tools, useful means to refer to one's own and others' experiences, allowing us to provide others, if we so desire, (more or less) 'generic' descriptions of emotions, and to understand others’ use of words [6, 14, 17, 19]. To illustrate, people use (and understand) the word sadness to label an emotionally 'sad' experience, whatever 'sad' is taken to mean by this or that person in this or that cultural context - e.g., a fully scripted sadness experience, that includes tears, apathy, isolation, etc. [7] or simply a generic low-tone mood, or something else yet [13, 19, 21]. An undisputed assumption, instead, is that the task of listing emotion words implies that people will access both their emotion knowledge, their concepts of emotional experiences, and their linguistic knowledge, and will perform, either way, an emotion-word ‘matching task’ on the basis of some criteria. 2.1 Method To understand the 'working emotion vocabulary' issues discussed above, within a larger research project on emotion-related competencies and their correlates that collected data by means of online procedures [22], a large sample of Italian people

People's Active Emotion Vocabulary: Free Listing of Emotion Labels

451

( N = 1.146, 65,9% women) was asked to free list 10 emotion words, writing them in ten numbered positions. Participants were mostly young adults in their twenties - age range: from 15 to 34 years; participants were later grouped into five age groups - 1520 years: 8,4%; 21-22 years: 16,8%; 23-24 years: 30,1%; 25-26 years: 23,8%; 27 or more years: 20,9%. To obtain unbiased results, the free listing task was the very first participants did, i.e., before they saw other test sections that included emotion names. Although in previous studies [9, 10, 13-14, 23] people were free to list as many emotions as they could think of, and/or within a pre-defined time interval (1, 2, or 5 minutes), this study set a 10 words limit in order to probe the 'working emotion vocabulary' of participants, so that they would list truly ‘easily activated’ emotion concepts. Data were collected from January 2009 to July 2010; 3 to 30 participants did not list all 10 words (writing something else; e.g., I don't know). Participants answered other lexicon-based tasks too, both production (e.g., Write a word referring to an emotion that is more (or less) intense than … target) and recognition ones (e.g., Which word is most similar to … target). Participants were tested also on a variety of emotionrelated abilities and traits (e.g., recognition of emotion expressions; emotion awareness) and on criteria variables that measured subjective well-being [22, 24]. 2.2 Hypotheses The most general hypothesis of this study was that people’s usage of the emotion lexicon is likely to be idiosyncratic at least to some extent, given the huge number of emotion words in Italian [17, 23], as it happens in many languages [9, 14-15, 19], the variety of emotional experiences people might be knowledgeable about [2, 4, 7, 11, 13, 20-21], and the cognitive processes (e.g., what information is active in memory) involved in a production task [7, 10, 17]. More specific hypotheses of the study were the following. Participants were expected to produce: (a) 'summary' words, i.e., words labelling higher order concepts referring to so-called ‘basic emotions’ [7, 9] (e.g., the words joy, anger, fear; individual Italian words are here reported in their closest English translation), (b) words labelling more specific concepts, designating so-called ‘complex’ or 'blended' emotions (e.g., pride, jealousy, anxiety), as well as (c) words and phrases that refer to parts of an emotion experience (e.g., feeling hot), to its antecedent event (e.g., listening to music), and to emotion-related dispositions and values (e.g., personality traits, such as sociability). Further, (d) both overall frequency and order of word production (i.e., at the aggregate, group level) were expected to reflect the distinction between emotion words (emotion concepts) that are typically most and least accessible ones. At the individual level, however, this distinction is likely to be mediated by individual differences in knowledge of the emotion lexicon, as well as in recently experienced emotions, current mood state, and so forth. Finally, (e) the nature of listed emotion words (i.e., valence of the designated experience; e.g., positive emotion words) at the individual level was expected to provide cues on what accounts for word accessibility in memory, especially as regards individual characteristics likely to affect task performance - e.g., emotion-related traits (emotional competence), level of wellbeing, and personal variables such as gender and age.

452

V.L. Zammuner

2.3 Analyses of Free Lists into Conceptually Based Word-Categories Let me mention already here that participants produced a huge variety of words and phrases (N = 882), in several grammatical forms and variants (verbs, nouns, adjectives, etc.) including spelling mistakes - see Table 1 for lists of most frequently listed words; a similar variety was obtained by [15] with 30 Japanese and English subjects, who however had 20 minutes for the task. The following are three examples of typical individual lists: Love, joy, surprise, passion, jealousy, serenity, envy, melancholy, anxiety, boredom; To love, enjoy, hate, admire, esteem, envy; happiness, feeling piety; Serenity, friendship, fear, sympathy, haste, happiness, sadness, nostalgia, making love, embarrassment. Several types of analyses, both qualitative and quantitative ones, were performed on the collected data. Table 1. Most frequent emotion words, and total number of distinct words, listed within four order positions P (% frequency; N participants = 1.147) 1st P joy happiness love anger cheerful fear serenity anxiety sadness hope surprise boredom passion tiredness disappointed melancholy N words

% 19,8 15,7 14,0 9,2 4,3 4,1 3,9 2,3 2,2 1,3 1,2 1,0 1,0 1,0 0,9 0,8 133

3rd P sadness joy anger fear happiness love anxiety cheerful serenity surprise hope melancholy jealousy pain disappointed nostalgia N words

% 9,2 8,2 7,6 7,0 6,2 4,5 3,1 2,4 2,4 2,4 2,4 2,0 1,9 1,7 1,3 1,0 201

7th P % fear 5,1 sadness 4,5 anger 4 anxiety 3,8 joy 3,4 nostalgia 2,9 surprise 2,8 love 2,6 jealousy 2,6 happiness 2,5 disappointed 2,4 cheerful 2,3 hope 2,3 shame 2,1 disgust 1,9 serenity 1,9 N words 231

10th P fear nostalgia anger sadness disappointed joy cheerful anxiety happiness jealousy love remorse surprise envy boredom hope N words

% 4,1 3,3 3,1 3,0 2,9 2,6 2,5 2,5 2,1 2,0 1,7 1,7 1,7 1,6 1,6 1,6 289

A first grammatical and morphological analysis of produced lists was performed by (a) checking for spelling mistakes (e.g., hatte instead of hate) and correcting them, and (b) finding grammatical variants (e.g., joy, joyful; love, in love, madly in love, falling in love; fury-furious-get furious) and grouping them. The results of this analysis showed that participants listed a total of 621 distinct words and phrases. Needless to say, individual lists contained 10 distinct words and phrases at most. To assess the psycho-cultural salience of emotion labels, i.e., how focal, important, or prominent they are for participants, a second analysis computed the frequencies of listed words across lists and order positions. A third analysis focused on word frequencies in each of the 10 order positions across lists (see Table 1).

People's Active Emotion Vocabulary: Free Listing of Emotion Labels

453

In order to examine in greater depth - as many suggest is necessary [1, 8, 27-29] the structure and content of the active emotion lexicon, and therefore of labelled emotion concepts, a semantic analysis of individual words and individuals' productions was performed. Distinct words and phrases were carefully analyzed for their meaning and coded into 50 conceptual higher-order categories, by means of various data-coding phases as detailed below. Forty-four of the 50 higher-order conceptual categories referred to emotions proper1, i.e., to experiences that emotion theorists consider as somewhat distinct, including not only 'basic emotions' such as fear and joy, but also so-called 'complex' emotions such as apathy, anxiety, boredom, contempt, doubt, jealousy, shame, serenity [7-9, 11, 15, 20]. The coding procedure was developed on the basis of theoretical and empirical accounts of discrete emotional experiences, as well as of accounts of 'families' of emotions, as defined by, for instance [7, 13, 25-27]; a similar procedure was used in previous studies to classify and sort adolescents' and adults' free lists [23, 13]. Each of the 44 categories was further coded for the valence (hedonic tone) of the experience it referred to, i.e., as positive (e.g., joy, pride), negative (e.g., fear, disgust), or neutral (e.g., surprise, interest, emotion). To allow comparisons between types of emotion referred to by participants, the 44 emotion categories were later analysed and coded, on the basis of theoretical and empirical accounts of their semantic and experiential similarity, into more abstractlevel categories. First, the 44 categories were grouped into the following 9 emotion type categories: Joy, Love, Surprise, Anxiety, Fear, Apathy, Sadness, Anger, and Emotion - the latter, actually referring to the super-ordinate domain category, was necessary to code such listed words as emotion, ambivalent, emotional, being moved, etc. Each category, as stated, grouped several sub kinds or subtypes of emotion. For instance, Joy included joy, calm, serenity, satisfaction, pride, hope, and their lexical variants. These 9 categories were further grouped into 5 'basic' emotion categories, i.e., grouping together Love with Joy, Anxiety with Fear, Apathy with Sadness [7, 25], plus the super-ordinate Emotion category as a sixth one (Table 2). Of the 50 coding categories, six grouped words that referred to one of the following: (1) ‘parts’ or consequences of emotional experiences - e.g., physiological and expressive reactions, like heart beating, blushing, smiling, crying, hot, cold, suffocating, yelling; (2) cognitive conditions - e.g., attentive, imagination, introspection; (3) eliciting events - e.g., driving, making love, kissing, difficulties, graduating, having a child, meeting friends; (4) personality traits, dispositions and modes of being or acting - e.g., autonomy, fantasy, sensitivity, openness, humour, independence, cooperative, energetic, stubborn, courageous, self-criticism, egoism, creativity, cruelty, fragility, rationality, instability; (5) values - e.g., honesty; and, finally, (6) the category ‘other’, coding items that could not unambiguously be assigned to any category - e.g., waiting, eternity, or distance could refer to a sentiment, an antecedent event or situation, or a behaviour; company could be an antecedent, but is ambiguous. The latter 6 categories were considered ‘errors’ in that they did not refer to “an emotion”, although most of them are related, in different ways, to emotional experiences - e.g., blushing is often an expression of either embarrassment or shame; introspection might help understand why one felt a certain emotion; energetic might be associated with positive emotions; and so forth. Note that similar ‘errors’ are

454

V.L. Zammuner

produced whenever people engage in a free-listing type of task (e.g., such errors occurred in the free lists obtained by [9, 23-24]), as well as, more generally, when people describe their own or others' emotional experiences [4, 5, 7, 13, 21]. Although most authors do not distinguish 'errors' from emotion words (e.g., consider arousal, frown, laughter, heart, violence, vulnerability, protective, all listed in [9]), nor comment upon their meaning or significance, the distinction between 'errors' and 'proper' emotion words is both necessary to adequately describe ‘proper’ emotion labels in lexicon usage, and useful to understand better - on the basis of the information errors supply - the conceptual organization of emotion knowledge. All productions coded in terms of the 44 categories were considered correct listings of emotion words, whereas all productions of the 6 categories were considered ‘errors’. Frequencies of the various coding categories detailed above were computed. Analyses of variance, with gender and age as independent variables, were performed on word listings coded into correct and error, and on correct words coded into the three valence categories. The three valence categories, and correctness of production, were further analysed to test the extent to which they correlated with emotion-related abilities, traits and perceptions. 2.4 Analyses of Correlates of Types of Free Listed Words A final set of analyses – correlational ones, computing Pearson’s r, was performed to check whether, and the extent to which, correctness of production on the one hand, and the nature of free listed words, i.e., their valence, on the other hand were related to participants' scores on a number of ability, traits and well-being variables [22], namely: (a) two emotion-lexicon based ability recognition tasks, i.e., recognizing Synonyms and Antonyms of a target emotion, (b) the ability of Recognition of Facial expressions of emotion, (c) the psychological emotion-related traits of Expressive transparency, Awareness of felt emotions, Alexithymia, and use of Suppression and Reappraisal regulation strategies, and (d) criterion variables that assessed subjective perception of well-being, i.e., Life satisfaction, perceived Emotional Loneliness and Social support, Felt Positive and Negative emotions, and Psycho-physical well-being or perceived Health level (see [22] for references on these measures).

3 Results: Frequencies and Order of Listed Words As stated, participants listed a total of 621 distinct words and phrases. Of the 621 distinct words, 5 were listed by 50% at least - joy 65%, anger, happiness, sadness, fear 51%; another 16 words were listed by at least 10% - e.g., love, anxiety, surprise, disgust, melancholy, shame, pain; 93 more words - e.g., envy, passion - were produced by 2%-9%. In sum, only 18,3% of all words were individually produced by at least 20 people, supporting the hypothesis of a great variability in what constitutes the active emotion lexicon. Finally, 507 words or phrases were listed by 10 or less people (63 words by 5 to 10, 444 by 1 to 5 people), that is, by 1% or less of the sample – e.g., irritation, interest, grudge. All ‘errors’ (with 2 exceptions), appeared within this group of extremely infrequent productions, and mostly referred to antecedent events and personality traits – but all error categories were present.

People's Active Emotion Vocabulary: Free Listing of Emotion Labels

455

Analyses of variance of listed words, coded for their correctness (with reference to the 50 semantic categories), showed that on the average people listed about nine correct emotion words, with women (Mean 9,22, sd 1,39) performing significantly better than men (8,82, sd 1,91; F (1, 1143) 8,99, p < .000), and made one 'error’, i.e., listed words and phrases not referring to “an emotion”. Another set of analyses of variance showed that words referring to negative emotions, congruently with lexicon-based Italian frequencies (e.g., [17]), were more frequent (4,95, sd 1,77) than words referring to positive ones (3,76, sd 1,46), especially for women (negative: 5,08, sd 1,73, vs. men’s 4,69, sd 1,82; F (1, 1133) 12,76, p < .000), and for adolescents (5,08, sd 1,73) in comparison to older participants. Neutrally and 'ambiguously' toned emotion words - e.g., surprise, indifference, emotion - were quite infrequent (0,43, sd 0,62), but listed especially by 23-24 year-olds (0,50, sd 0,64) in comparison to the other four age groups. /

/

3.1 Order of Word Listing Table 1 reports, for four order positions (out of 10, due to space limitation), the 16 most frequent words within each position, ordered in decreasing percentage frequency, down to words produced by 1% or so of participants. The bottom row specifies how many different words were listed within each order position. Within each order position, about 30 distinct words were altogether listed by participants. The analysis of the 33 most frequently listed words in each of the 10 order positions, i.e., from first to tenth position, and their frequency-based rank within each position, showed that on the average basic emotion words obtained the highest ranks in the first 5 order positions, whereas their rank decreased in the last 5 positions. For instance, joy, happiness, love, anger obtained the highest 4 ranks in position 1 (see Table 1), totalling a frequency of 59% of all words participants listed as their very first one; in position 10, instead, their sum frequency of production was 9,5%, and their respective rank was: anger 3, joy 6, happiness 9, love 11. The negative emotions fear, anger and sadness actually increased their ranks from 2nd to 10th position; e.g., fear was 6th in position 1 (frequency: 4,1%), 5th in position 2 (6,8%), 4th in position 3 (7%), 1st in position 10 (4,1%); likewise, 9,2% of participants listed anger as their very first word, 11% as second, 7,6% as third, and 3,1% as tenth. Words denoting complex emotions, vice versa, increased their frequency, and thus their rank (but fluctuations were common), from beginning to end positions; e.g., nostalgia’s rank was 33rd in position 1, 6th in position 7, 2nd in position 10; hate oscillated from 24th in position 1, to 9th in position 2, to ranks below 15 in intermediate positions, and back to 9th in position 10. Analyses of the order with which words referred to six basic-emotion-type categories (Table 2) showed that, on the average, people started the task by thinking about positive emotions, i.e., comprised in the Joy-Love type (whose frequency decreased from 66% in position 1 to about 30% in the last positions) rather than about emotions belonging to the Anger, Sadness and Fear categories, and even less frequently to Surprise and Emotion. The 2nd to 10th listed words, however, did refer to Anger, Sadness and Fear - with a constant frequency in all positions: about 1520%. That is, although negative emotion labels in the lexicon greatly outnumber positive ones (reflecting a greater attention to, and processing of, the meaning and

456

V.L. Zammuner

implications of negative emotional experiences; e.g., see [20, 28-29]) and people congruently list more negative than positive words, the results evidenced a bias towards accessing positive experiences first, i.e., starting with a positive emotion. Table 2. Order (1st to 10th) of basic-emotion-category word production (N participants = 1.147; % frequencies) Basic categories 1 Emotion 2 Joy-Love 3 Surprise 4 Fear 5 Sadness 6 Anger Total Missing

1

2

3

4

5

6

7

8

9

10

0,1 66,3 2,0 8,9 7,4 10,3 94,9 5,1

0,4 40,9 2,4 14,9 17,0 17,6 93,3 6,7

0,4 38,5 3,2 16,2 20,1 14,2 92,8 7,2

0,3 34,8 3,7 18,0 19,5 15,3 91,5 8,5

0,4 35,3 5,0 18,3 17,3 15,0 91,3 8,7

0,4 33,0 4,4 19,0 19,1 14,6 90,6 9,4

0,3 31,6 4,9 19,3 19,7 14,9 90,7 9,3

0,9 32,2 5,2 16,7 20,4 14,2 89,6 10,4

0,6 31,0 3,9 19,4 19,0 13,1 87,0 13,0

0,7 29,4 3,7 16,5 21,8 13,4 85,4 14,6

3.2 Correlates of the ‘Working Emotion Vocabulary' As stated earlier on, correlational analyses were performed to test whether the nature of free listed words was related to emotion-related abilities, traits and perceptions – e.g., [30]. The results showed that participants’ number of listed correct emotion words did not correlate with their emotion-lexicon recognition abilities – i.e., recognizing Synonyms and Antonyms of a target emotion. In other words, as it happens with linguistic skills in general, the active emotion vocabulary is not predictive of people’s actual ('passive') knowledge. Likewise, correctness was not significantly associated with measured psychological emotion-related variables (see below), except for Recognition of Facial expressions of emotion (RFE; Pearson’s r .11; all reported values, with sex as a covariate, were significant at p < .01). Valence, i.e., the extent to which participants supplied positive and negative emotion words (in themselves highly and negatively correlated: r -.61), and neutral ones, instead significantly correlated with several participants' psychological features. Namely, listing negative emotion words correlated (Pearson's r) negatively with extent of own Expressive transparency (EE: r -.14), Awareness of emotions (AE r .17), Life satisfaction (LS r -.18), perceived Social support (SS r -.10), and Felt positive emotions (FPE r -.25) and correlated positively with RFE (r .09), the Suppression regulation strategy (SR r .09), Emotional Loneliness (EL r .14), Felt negative emotions (FNE r .22) Alexithymia (AL r .13) and psychophysical ill-being, i.e., perceived Low Health level (LH r .21). The frequency of positive emotion words showed a quite similar pattern of correlations, but in the opposite direction (EE r .09, AE r .15, LS r .16, SS r .12; FPE r .21; SR r -.10, EL r -.18, FNE r -.23, AL r -.13 and LH r -.20). Listing neutral emotion words, finally, showed fewer and lower associations, but in the same direction found for positive words (AE r .10, LS r .12, FPE r .10, EL r -.09, AL r -.09, FNE r -.09, LH r .11). In sum, the obtained associations form a coherent pattern: the experiences and labels that people access when performing the listing task reflect their present psychological state - indexed by level of life satisfaction, emotional and social

People's Active Emotion Vocabulary: Free Listing of Emotion Labels

457

loneliness, alexithymia, health - as well as some of their emotion-related traits and abilities - such as their level of awareness of emotions, expressive transparency, and use of suppression to regulate emotions.

4 Conclusion As we saw, the great majority of words listed by participants did refer to emotions proper. The most frequently listed ones (about 6 to 10, including love, joy, happiness, anger, sadness, fear, hate, anxiety) correspond, on the whole, with the words listed by a variety of samples, both in Italian [17, 23] and in other languages - e.g., [9, 10, 14], pointing to substantial similarities in emotion conceptualization across and within cultural groups. However, differences rather than similarities are found, and might be expected in future studies, if the analysis of listed words is expanded beyond the 5-10 most frequent ones. Participants made 'errors' too, i.e., words referring to something else - especially referring to events and personality traits, and listing pars pro toto words. Thus, the data showed that when people access their emotion knowledge and lexicon quickly (or perhaps without a great communicative motivation), they actually display a ‘restricted’ Active Emotion Vocabulary (Aev) of 8 to 9 correct emotion words - with women performing better than men, and with rare age differences if post-adolescence individuals are considered. The Aev typically includes both negative (more frequent) and positive emotion words. What specific emotions come to mind is influenced not only by a person's active linguistic repertoire and her emotion knowledge, but also by her psycho-physical state, her salient experiences moment in her recent history (indexed by life satisfaction, loneliness level, etc.), and is, finally, related to aspects of her emotional competence (e.g., awareness of felt emotions). The results showed that Aev typically includes both basic and complex emotion words, i.e., both (a small set of) labels that are known to everyone, as well as (a large number of) idiosyncratically easy-to-retrieve ones. The analysis of the order with which emotions ‘come to mind’ showed that 'basic emotion' words have the greatest probability of coming to mind, and of doing so sooner rather than later (with many individual differences, however). What other emotion words are produced, and in what order, is subject to much more individual variation - as shown by the very high number of words listed by 1-2 participants only, or a small proportion of them. However, he data showed that many words referring to so-called complex or blended emotions – e.g., anxiety, cheerfulness, serenity, surprise, hope, delusion, jealousy, hate, boredom, disgust, melancholy, shame, amazement, pain, listed by 10% or more of the sample - are more likely to figure in an individual's Aev than some other ones – e.g., envy, guilt, relief, panic, stress, uneasiness, gratitude, apathy, depression, contempt, terror, frustration, passion, produced by 2% to 9%, or those listed by 1% or less of the sample, including irritation, illusion, ecstasy, upset, unhappy, indignation, interest, and grudge. In other words, the obtained results imply that, at any given moment in time, a few emotion labels are generally likely to be active or very easily accessed. Individual minds instead might differ much one from the other in what else they easily lend access to beyond the ‘basics’. The number of listed distinct words almost linearly increased in fact from the first (N = 133) to the end positions

458

V.L. Zammuner

(N = 289; see Table 1 for partial data), as did the ‘missing data’ (see Table 2), i.e., the number of people who had difficulty thinking of anything at all as they proceeded in the task beyond the first few productions. The study results indicate, I believe, that detailed analyses of listed words, i.e., analyses that are not limited to words produced by the majority of participants (as was done in most previous studies), help our understanding of how people conceptualize and verbalize emotion experiences. The results clearly show, in fact, that when people (have to) think about emotions, they activate different sets of knowledge. On the one hand, they in fact retrieve emotion-lexicon items, key ones (typically short words; e.g., [7, 17]) that constitute summary designations of so-called basic emotions (anger, fear, etc.) as well as words naming ‘complex’ emotions (jealousy, irritation, etc.). On the other hand, they retrieve a variety of words that refer to emotion-related aspects and experiences, such as expressive or visceral emotional reactions, events that led to an emotion, or personality traits that predispose to feel emotion x or emotion y. Both 'correct' and ‘error’ data thus support the hypotheses that emotion knowledge is structured in scripts or schemata containing information on various aspects of (an) emotion (e.g., [9, 7, 13]), and that the retrieving process is possibly one of spreading activation - e.g., [31]. The obtained Aev results, especially those related to the frequency and order of listing, might complement sorting studies of similarity between emotions (in Italian, English, or other languages; e.g., [7, 11, 25]), and might be useful in studying emotion concepts and experiences, for instance across ages and cultural groups e.g., [10, 1416]. Finally, by providing a large database of emotion-lexicon usage, the results have implications for issues such as communication and understanding of emotional states via linguistic labels by humans and robots alike, and in man-machine interaction studies and applications - e.g., [1, 3, 8]. Acknowledgments and Notes. The study was financed by Fondazione Cariparo. I wish to thank S. Andriolo, head technician of DPSS, for his precious work as concerns online task administration and monitoring. I also wish to thank M. Casnici, J. Tomelleri, L. Ronconi, T. Lanciano and V. Paganelli for their precious cooperation to aspects of the reported study, as well as several students, especially S. L’Abbate, M. Berto and E. Valle, who helped monitoring participants’ adhesion to the project. Note 1. More detailed information (e.g., on the 44 emotion categories, and on the words produced by at least 5% of the sample, together with their English translation) can be obtained, upon request, from the author.

References 1. Derks, D., Fischer, A.H., Bos, A.E.R.: The role of emotion in computer-mediated communication: A review. Computers in Human Behavior 24, 766–785 (2008) 2. Parkinson, B.: Emotions in direct and remote social interaction: getting through the spaces between us. Computers in Human Behavior 24(4), 1510–1529 (2008) 3. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 39–58 (2009)

People's Active Emotion Vocabulary: Free Listing of Emotion Labels

459

4. Zammuner, V.L.: Men’s and women’s lay theories of emotion. In: Fischer, A.H. (ed.) Gender and Emotion, pp. 48–60. Cambridge University Press, Cambridge (2000) 5. Zammuner, V.L.: Naive Theories of Emotional Experience: Jealousy. In: Russell, J.A., Fernandez Dols, J.M., Manstead, A.S.R., Wellenkamp, J.C. (eds.) Everyday Conceptions of Emotion: An Introduction to the Psychology, Anthropology and Linguistics of Emotion, pp. 435–456. Kluwer, Dordrecht (1995) 6. Frijda, N., Zammuner, V.L.: L’etichettamento delle proprie emozioni. Giornale Italiano di Psicologia 19(3), 389–423 (1992) 7. Shaver, P., Schwartz, J., Kirson, D., O’Connor, C.: Emotion Knowledge: further exploration of a prototype approach. Journal of Personality and Social Psychology 52, 1061–1086 (1987) 8. Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech: Towards a new generation of databases. Speech Communication. Special Issue Speech and Emotion 40(1-2), 33–60 (2003) 9. Fehr, B., Russell, J.A.: Concept of Emotion Viewed From a Prototype Perspective. Journal of Experimental Psychology: General 113(3), 464–486 (1984) 10. Schrauf, R.W., Sanchez, J.: Using Freelisting to Identify, Assess, and Characterize Age Differences in Shared Cultural Domains. Journal of Gerontology: Social Sciences 63(6), 385–393 (2008) 11. Zammuner, V.L., Bussolon, S., Peloso, O.: La conoscenza negli adolescenti delle somiglianze e differenze tra le emozioni. In: Grazzani, I., Riva Crugnola, C. (eds.), La competenza emotiva. Milano, Unicopli (in press, 2011) 12. Galli, C., Zammuner, V.: Concepts of emotion and dimensional ratings of Italian emotion words in pre-adolescents. In: XXVIII Annual Conference of the Cognitive Science Society, Vancouver (2006) 13. Galli, C., Zammuner, V., Romagnoli, G.: The conceptual organization of emotion concepts in pre-adolescents: a 2-task study. In: XXVII Annual Meeting of the Cognitive Science Society, Stresa (2005) 14. Van Goozen, S., Frijda, N.H.: Emotion words used in six European countries. European Journal of Social Psychology 23, 89–95 (1993) 15. Kobayashi, F., Schallert, D.L., Ogren, H.A.: Japanese and American folk vocabularies for emotions. Journal of Social Psychology 143, 451–478 (2003) 16. Niedenthal, P.M., Auxiette, C., Nugier, A., Dalle, N., Bonin, P., Fayol, M.: A prototype analysis of the French category “emotion”. Cognition and Emotion 18(3), 289–312 (2004) 17. Zammuner, V.L.: Concepts of emotion: ”Emotionness,” and dimensional ratings of Italian emotion words. Cognition and Emotion 12(2), 243–272 (1998) 18. Thelwall, M., Wilkinson, D., Uppal, S.: Data mining emotion in social network communication: Gender differences in MySpace. Journal of the American Society for Information Science and Technology 61(1), 190–199 (2010) 19. Russell, J.A.: Culture and the categorization of emotion. Psychological Bulletin 110, 426– 450 (1991) 20. Hupka, R.B., Lenton, A.P., Hutchison, K.A.: Universal development of emotion categories in natural language. Journal of Personality and Social Psychology 77, 247–278 (1999) 21. Zammuner, V.L., Cigala, A.: La conoscenza delle emozioni nei bambini in eta’ scolare. Età Evolutiva 69, 19–42 (2001) 22. Zammuner, V.L.: Measurement and training on-line of emotional intelligence and competencies, and relevant criterion and socio-demographic variables in career starters. In: INTED 2010 Proceedings, CD (2010)

460

V.L. Zammuner

23. Zammuner, V.L., Galli, C.: La conoscenza delle emozioni negli adolescenti, e in giovani adulti. Tre studi con il compito di produzione spontanea di parole. In: Matarazzo, O. (ed.) Emozioni e adolescenza, Napoli, Liguori, pp. 197–224 (2001) 24. Zammuner, V.L., Casnici, M., Prencipe, G., Scapin, F., Zonta, S., Galli, C.: Emotional lexicon literacy as an index of emotional intelligence and a predictor of well-being and adequate social functioning. In: ICP 2008, CD rom Poster presentations (2008) 25. Zammuner, V.L.: Termini emotivi nel lessico italiano e loro categorizzazione per somiglianza. In: Arcuri, L., Boscolo, P., Peressotti, F. (eds.) Cognition and Language: A Long Story, Padova, Cleup, pp. 309–315 (2008) 26. Ortony, A., Clore, G.L., Foss, M.A.: The referential structure of the affective lexicon. Cognitive Science 11, 341–364 (1987) 27. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 39–58 (2009) 28. Schrauf, R.W., Sanchez, J.: The Preponderance of Negative Emotion Words in the Emotion Lexicon: A Cross-generational and Cross-linguistic Study. J. of Multilingual and Multicultural Development 25(2&3), 266–284 (2004) 29. Romney, A.K., Moore, C.C., Rusch, C.D.: Cultural universals: Measuring the semantic structure of emotion terms in English and Japanese. Proc. Natl. Acad. Sci. USA 94, 5489–5494 (1997) 30. Schwarz, N., Clore, G.L.: Mood, misattribution, and judgements of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology 45, 513–523 (1983) 31. Schimmack, U., Reisenzein, R.: Cognitive Processes Involved in Similarity Judgments of Emotions. Journal of Personality and Social Psychology 73(4), 645–661 (1997)

Author Index

Al Moubayed, Samer Altmann, Uwe 335 Arsi´c, Dejan 1

House, David Hussain, Amir

19

Babiloni, Fabio 294 Bachwerk, Martin 48 Bailly, G´erard 273 Balog, Andr´ as 199 ˇ Beˇ nuˇs, Stefan 346 Beskow, Jonas 19 Birkholz, Peter 287 Boh´ aˇc, Marek 154 Bozkurt, Elif 36

229

Juh´ ar, Jozef

171

Lelong, Am´elie

De Looze, C´eline 163 D’Errico, Francesca 434 Durrani, Tariq 56 Edlund, Jens 19 Erdem, A. Tanju 36 Erdem, C ¸ iˇ gdem Eroˇ glu 36 Erzin, Engin 36 Esposito, Anna 252, 316, 368 Esposito, Antonietta M. 252

Gnisci, Augusto 355 Granstr¨ om, Bj¨ orn 19 Grassi, Marco 95 Graziano, Enza 355

Imre, Viktor

Kannampuzha, Jim 287 Kaufmann, Emily 287 Kr¨ oger, Bernd J. 287

Cambria, Erik 56 Campbell, Nick 163 Cerva, Petr 81 Chaloupka, Josef 88 Chanquoy, Lucile 316 Chetouani, Mohamed 368 Cifani, Simone 70 ˇ zm´ Ciˇ ar, Anton 171 Cohen, David 368

F` abregas, Joan 120 Faundez-Zanuy, Marcos Fegy´ o, Tibor 199 Ferrara, Fabrizio 419

19 56

120

273

Majewski, Wojciech 104 Maskeliunas, Rytis 113 Matarazzo, Olimpia 419 Mazzocco, Thomas 56 Mekyska, Jiˇr´ı 120 Mihajlik, P´eter 199 Mlakar, Izidor 133, 185 Morbidoni, Christian 95 Mozsolics, Tam´ as 199 Navarretta, Costanza 309 Neuschaefer-Rube, Christiane Nijholt, Anton 147 Nouza, Jan 81, 154 Nucci, Michele 95 Oertel, Catharine Ond´ aˇs, Stanislav

163 171

Palecek, Karel 178 Piazza, Francesco 70 Poggi, Isabella 393, 434 Prazak, Jan 214 Pˇribil, Jiˇr´ı 378 Pˇribilov´ a, Anna 378 Principi, Emanuele 70 Riviello, Maria Teresa 368 Rojc, Matej 133, 185 Rossini, Nicla 406 Rotili, Rudi 70 Rudzionis, Vytautas 113 Rusko, Milan 346

287

462

Author Index

S´ arosi, Gell´ert 199 Scherer, Stefan 163 Schuller, Bj¨ orn 1 Silovsky, Jan 81, 214 Sm´ekal, Zdenˇek 120 Squartini, Stefano 70 Staroniewicz, Piotr 104, 223 Sun, Xiaofan 147 Sztah´ o, D´ avid 229

Vecchiato, Giovanni 294 V´ıch, Robert 240 Vicsi, Kl´ ara 229 Vincze, Laura 393 Vogel, Carl 48 Volpe, Rosa 316 Vondra, Martin 240

Tarj´ an, Bal´ azs

Zammuner, Vanda Lucia

199

Wagner, Petra 163 Windmann, Andreas

163 449

Verbal and Nonverbal Communication Behaviours, COST Action 2102

Read more

Nonverbal Behaviour and Communication

Read more

Applications of Nonverbal Communication

Read more

Pragmatics and Non-Verbal Communication

Read more

Settler and Creole Re-Enactment

Read more

Communication Circuits: Analysis and Design

Read more

The Verbal Communication of Emotions: Interdisciplinary Perspectives

Read more

Analysis of Computer and Communication Networks

Read more

Cost-Benefit Analysis And Water Resources Management

Read more

Cost Benefit Analysis and Health Care Evaluations

Read more

Applied Cost-Benefit Analysis

Read more

Cost Benefit Analysis

Read more

Cost-Benefit Analysis

Read more

The Meaning of the Built Environment: A Nonverbal Communication Approach

Read more

Textual Translation and Live Translation: The Total Experience of Nonverbal Communication in Literature, Theater and Cinema

Read more

Applications of Nonverbal Communication (Claremont Symposium on Applied Social Psychology)

Read more

IRJET- Cost of Quality Analysis and its Calculation

Read more

Verbal Art, Verbal Sign, Verbal Time

Read more

Cost and Value Management

Read more

The Meaning of the Built Environment: A Nonverbal Communication Approach

Read more

Shakespeare and the Art of Verbal Seduction

Read more

Shakespeare and the Art of Verbal Seduction

Read more

Shakespeare and the Art of Verbal Seduction

Read more

Statistical Analysis of Cost-effectiveness Data

Read more

Cost-Benefit Analysis of Environmental Change

Read more

Cost-Benefit Analysis (Second Edition)

Read more

Cost-Benefit Analysis (Second Edition)

Read more

Biometric ID Management and Multimodal Communication: Joint COST 2101 and 2102 International Conference, BioID_MultiComm 2009, Madrid, Spain, ... Vision, Pattern Recognition, and Graphics)

Read more

Speech Play and Verbal Art

Read more

Radio-frequency and microwave communication circuits: analysis and design

Read more

Recommend Documents

Verbal and Nonverbal Communication Behaviours, COST Action 2102

Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann Subseries of Lecture Notes in Comput...

Nonverbal Behaviour and Communication

NONVERBAL BEHAVIOR AND NONVERBAL COMMUNICATION: WHAT DO CONVERSATIONAL HAND GESTURES TELL US? ROBERT M. KRAUSS, YIHSIU ...

Applications of Nonverbal Communication

Applications of Nonverbal Communication TLFeBOOK THE STAUFFER SYMPOSIUM ON APPLIED PSYCHOLOGY AT THE CLAREMONT COLLE...

Pragmatics and Non-Verbal Communication

PRAGMATICS AND NON-VERBAL COMMUNICATION The way we say the words we say helps us convey our intended meanings. Indeed, t...

Settler and Creole Re-Enactment

Settler and Creole Reenactment Edited by Vanessa Agnew, Jonathan Lamb with Daniel Spoth Copyright material from www.p...

Communication Circuits: Analysis and Design

...

The Verbal Communication of Emotions: Interdisciplinary Perspectives

The Verbal Communication of Emotions Interdisciplinary Perspectives This page intentionally left blank The Verbal C...

Analysis of Computer and Communication Networks

Analysis of Computer and Communication Networks Fayez Gebali Analysis of Computer and Communication Networks 123 F...

Cost-Benefit Analysis And Water Resources Management

Cost–Benefit Analysis and Water Resources Management Cost–Benefit Analysis and Water Resources Management Edited by ...

Cost Benefit Analysis and Health Care Evaluations

Cost–Benefit Analysis and Health Care Evaluations This page intentionally left blank Cost–Benefit Analysis and Heal...