Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4122
Rainer Stiefelhagen John Garofolo (Eds.)
Multimodal Technologies for Perception of Humans First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006 Southampton, UK, April 6-7, 2006 Revised Selected Papers
13
Volume Editors Rainer Stiefelhagen Universität Karlsruhe (TH) Institut für Theoretische Informatik Am Fasanengarten 5, 76131 Karlsruhe, Germany E-mail:
[email protected] John Garofolo National Institute of Standards and Technology 100 Bureau Drive, Stop 8940, Gaithersburg, MD 20899-8940, USA E-mail:
[email protected]
Library of Congress Control Number: 2006939517 CR Subject Classification (1998): I.4, I.5, I.2.10, I.3.5, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-69567-2 Springer Berlin Heidelberg New York 978-3-540-69567-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11970897 06/3142 543210
Preface
During winter and spring 2006, the first international CLEAR evaluation took place, which targets the evaluation of systems for the perception of people, their identities, activities, interactions and relationships in human–human interaction scenarios as well as related scenarios. As part of the evaluation, a two-day workshop was held during April 6–7, 2006, in Southampton, UK, in which the participating systems were presented and the evaluation results discussed in detail. This book contains the system description papers that were presented at the CLEAR 2006 workshop as well as an overview of the evaluation tasks and the results that were obtained in each of these by the various participants. The book also includes two invited papers about related evaluation activities that were presented at the workshop. The CLEAR evaluation campaign and workshop was jointly organized by the Universit¨ at Karlsruhe, Germany and the US National Institute of Standards and Technology (NIST). CLEAR 2006 was supported by the European Integrated Project CHIL— Computers in the Human Interaction Loop—as well as the US DTO—Disruptive Technology Office - VACE - Video Analysis Content Extraction—program, which jointly organized part of their perceptual technology evaluations within CLEAR 2006 for the first time. CLEAR 2006 was thus sponsored by the European Commission (Information Society Technologies priority of the Sixth Framework Programme) and the US DTO. CLEAR 2006 was also organized in cooperation with the NIST RT - Rich Transcription Meeting Recognition evaluation, which focused more on the evaluation of content-related technologies, such as speech and video text recognition. CLEAR and RT shared some of their evaluation data sets, so that, for example, the speaker-localization results generated for CLEAR could be used for the farfield speech-to-text task in RT06. This was facilitated through the harmonization of the 2006 CLEAR and RT evaluation deadlines. The current evaluation tasks that were conducted in CLEAR 2006 consisted of part of the evaluation tasks related to human activity analysis that were part of the CHIL and VACE perceptual technology evaluation activities. These current CLEAR 2006 evaluation tasks can be categorized as follows: – – – –
Tracking tasks (faces/persons/vehicles, 2D/3D, acoustic/visual/audio-visual) Person identification tasks (acoustic, visual, audio-visual) Head pose estimation (single-view studio data, multi-view lecture data) Acoustic scene analysis (acoustic event detection, acoustic environment classification)
Most of these tasks were evaluated on multi-modal multi-site recordings of seminars and meetings provided by the CHIL and VACE projects, as well as on
VI
Preface
surveillance data provided by the UK Home Office i-LIDS- Imagery Library for Intelligent Detection Systems program. Participation in the CLEAR 2006 evaluation and workshop was also open to any site interested in participating in at least one of the evaluation tasks. As a benefit for participating, participating sites would receive the necessary development and evaluation data sets, including scoring tools etc. without cost. This first CLEAR evaluation and workshop – around 60 people participated in the workshop – was clearly a big success. Overall, nine major evaluation tasks, including more than 20 subtasks, were evaluated. Sixteen different institutions participated in the evaluation, including eight participants from the CHIL program, five participants from the VACE program and three external participants. We were also pleased to have a number of representatives from related evaluation programs and projects give presentations. They were: – David Cher, (SILOGIC, FR) Topic: Evaluations in ETISEO – James Ferryman (University of Reading, UK) Topic: PETS Evaluation and Perspective – Daniel Gatica-Perez (IDIAP, CHE) Topic: Technology Evaluations in AMI – Mats Ljungqvist (European Commission) Topic: EU-Funded Research Initiatives – Jonathon Phillips (NIST) Topic: Do Evaluations and Challenge Problems Hinder Creativity? – Alan Smeaton (Dublin City University, IRL) Topic: TrecVid Based on the success of CLEAR 2006, it was decided to organize CLEAR 2007 during May 8–9, 2007 in Baltimore, USA. This will again be organized in conjunction with and be collocated with the NIST RT 2007 evaluations, May 10–11, 2007. Finally, we would like to take this opportunity to thank the sponsoring projects and funding agencies, all the participants of the evaluation and the workshop, the invited speakers and everybody involved in the organization of the evaluations and the workshop.
September 2006
Rainer Stiefelhagen John Garofolo
Organization
Chairs Rainer Stiefelhagen John Garofolo
Universit¨ at Karlsruhe, Germany National Institute of Standards and Technology (NIST), USA
Workshop Organization Rachel Bowers Margit R¨ odder
NIST Universit¨ at Karlsruhe, Germany
Evaluation Task Organizers Keni Bernardin Maurizio Omologo John Garofolo Hazim Ekenel Djamel Mostefa Aristodemos Pnevmatikakis Ferran Marques Ramon Morros Michael Voit Andrey Temko Rob Malkin
Universit¨ at Karlsruhe, Germany ITC-IRST, Trento, Italy NIST Universit¨at Karlsruhe, Germany ELDA, Paris, France Athens Information Technology, Greece Universitat Politecnica de Catalunya, Barcelona, Spain Universitat Polit`ecnica de Catalunya, Spain Universit¨ at Karlsruhe, Germany Universitat Polit`ecnica de Catalunya, Spain Carnegie Mellon University, Pittsburgh, PA
Sponsoring Projects and Institutions Projects: – CHIL, Computers in the Human Interaction Loop, http://chil.server.de – VACE, Video Analysis Content Extraction, https://control.nist.gov/dto/twiki/bin/view/Main/WebHome Institutions: – European Commission, through the Multimodal Interfaces objective of the Information Society Technologies (IST) priority of the Sixth Framework Programme – US National Institute of Standards and Technology (NIST), http://www.nist.gov/speech
Table of Contents
Overview The CLEAR 2006 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rainer Stiefelhagen, Keni Bernardin, Rachel Bowers, John Garofolo, Djamel Mostefa, and Padmanabhan Soundararajan
1
3D Person Tracking 3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikos Katsarakis, George Souretis, Fotios Talantzis, Aristodemos Pnevmatikakis, and Lazaros Polymenakos A Generative Approach to Audio-Visual Person Tracking . . . . . . . . . . . . . Roberto Brunelli, Alessio Brutti, Paul Chippendale, Oswald Lanz, Maurizio Omologo, Piergiorgio Svaizer, and Francesco Tobia An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR’06 Evaluation Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Nickel, Tobias Gehrig, Hazim K. Ekenel, John McDonough, and Rainer Stiefelhagen Multi- and Single View Multiperson Tracking for Smart Room Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keni Bernardin, Tobias Gehrig, and Rainer Stiefelhagen UPC Audio, Video and Multimodal Person Tracking Systems in the Clear Evaluation Campaign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alberto Abad, Cristian Canton-Ferrer, Carlos Segura, Jos´e Luis Landabaso, Duˇsan Macho, Josep Ramon Casas, Javier Hernando, Montse Pard` as, and Climent Nadeu
45
55
69
81
93
A Joint System for Single-Person 2D-Face and 3D-Head Tracking in CHIL Seminars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gerasimos Potamianos and Zhenqiu Zhang
105
Speaker Tracking in Seminars by Human Body Detection . . . . . . . . . . . . . Bo Wu, Vivek Kumar Singh, Ram Nevatia, and Chi-Wei Chu
119
TUT Acoustic Source Tracking System 2006 . . . . . . . . . . . . . . . . . . . . . . . . . Pasi Pertil¨ a, Teemu Korhonen, Tuomo Pirinen, and Mikko Parviainen
127
X
Table of Contents
Tracking Multiple Speakers with Probabilistic Data Association Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Gehrig and John McDonough
137
2D Face Detection and Tracking 2D Person Tracking Using Kalman Filtering and Adaptive Background Learning in a Feedback Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aristodemos Pnevmatikakis and Lazaros Polymenakos
151
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael C. Nechyba and Henry Schneiderman
161
Person Tracking on Surveillance Data The AIT Outdoors Tracking System for Pedestrians and Vehicles . . . . . . Aristodemos Pnevmatikakis, Lazaros Polymenakos, and Vasileios Mylonakis
171
Evaluation of USC Human Tracking System for Surveillance Videos . . . . Bo Wu, Xuefeng Song, Vivek Kumar Singh, and Ram Nevatia
183
Vehicle Tracking Multi-feature Graph-Based Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . Murtaza Taj, Emilio Maggio, and Andrea Cavallaro
190
Multiple Vehicle Tracking in Surveillance Videos . . . . . . . . . . . . . . . . . . . . . Yun Zhai, Phillip Berkowitz, Andrew Miller, Khurram Shafique, Aniket Vartak, Brandyn White, and Mubarak Shah
200
Robust Appearance Modeling for Pedestrian and Vehicle Tracking . . . . . Wael Abd-Almageed and Larry S. Davis
209
Robust Vehicle Blob Tracking with Split/Merge Handling . . . . . . . . . . . . . Xuefeng Song and Ram Nevatia
216
Person Identification A Decision Fusion System Across Time and Classifiers for Audio-Visual Person Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Stergiou, Aristodemos Pnevmatikakis, and Lazaros Polymenakos The CLEAR’06 LIMSI Acoustic Speaker Identification System for CHIL Seminars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claude Barras, Xuan Zhu, Jean-Luc Gauvain, and Lori Lamel
223
233
Table of Contents
XI
Person Identification Based on Multichannel and Multimodality Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Liu, Hao Tang, Huazhong Ning, and Thomas Huang
241
ISL Person Identification Systems in the CLEAR Evaluations . . . . . . . . . . Hazım Kemal Ekenel and Qin Jin
249
Audio, Video and Multimodal Person Identification in a Smart Room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordi Luque, Ramon Morros, Ainara Garde, Jan Anguita, Mireia Farrus, Duˇsan Macho, Ferran Marqu´es, Claudi Mart´ınez, Ver´ onica Vilaplana, and Javier Hernando
258
Head Pose Estimation Head Pose Estimation on Low Resolution Images . . . . . . . . . . . . . . . . . . . . Nicolas Gourier, J´erˆ ome Maisonnasse, Daniela Hall, and James L. Crowley
270
Evaluation of Head Pose Estimation for Studio Data . . . . . . . . . . . . . . . . . Jilin Tu, Yun Fu, Yuxiao Hu, and Thomas Huang
281
Neural Network-Based Head Pose Estimation and Multi-view Fusion . . . Michael Voit, Kai Nickel, and Rainer Stiefelhagen
291
Head Pose Estimation in Seminar Room Using Multi View Face Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhenqiu Zhang, Yuxiao Hu, Ming Liu, and Thomas Huang
299
Head Pose Detection Based on Fusion of Multiple Viewpoint Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristian Canton-Ferrer, Josep Ramon Casas, and Montse Pard` as
305
Acoustic Scene Analysis CLEAR Evaluation of Acoustic Event Detection and Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrey Temko, Robert Malkin, Christian Zieger, Duˇsan Macho, Climent Nadeu, and Maurizio Omologo The CLEAR 2006 CMU Acoustic Environment Classification System . . . Robert G. Malkin
311
323
Other Evaluations 2D Multi-person Tracking: A Comparative Study in AMI Meetings . . . . . Kevin Smith, Sascha Schreiber, Igor Pot´ ucek, V´ıtezslav Beran, Gerhard Rigoll, and Daniel Gatica-Perez
331
XII
Table of Contents
Head Pose Tracking and Focus of Attention Recognition Algorithms in Meeting Rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sileye O. Ba and Jean-Marc Odobez
345
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
359
The CLEAR 2006 Evaluation Rainer Stiefelhagen1 , Keni Bernardin1 , Rachel Bowers2 , John Garofolo2 , Djamel Mostefa3 , and Padmanabhan Soundararajan4 1
Interactive Systems Lab, Universit¨ at Karlsruhe, 76131 Karlsruhe, Germany {stiefel, keni}@ira.uka.de 2 National Institute of Standards and Technology (NIST), Information Technology Lab - Information Access Division, Speech Group {rachel.bowers, garofolo}@nist.gov 3 Evaluations and Language Resources Distribution Agency (ELDA), Paris, France
[email protected] 4 Computer Science and Engineering, University of South Florida, Tampa, FL, USA
[email protected]
Abstract. This paper is a summary of the first CLEAR evaluation on CLassification of Events, Activities and Relationships - which took place in early 2006 and concluded with a two day evaluation workshop in April 2006. CLEAR is an international effort to evaluate systems for the multimodal perception of people, their activities and interactions. It provides a new international evaluation framework for such technologies. It aims to support the definition of common evaluation tasks and metrics, to coordinate and leverage the production of necessary multimodal corpora and to provide a possibility for comparing different algorithms and approaches on common benchmarks, which will result in faster progress in the research community. This paper describes the evaluation tasks, including metrics and databases used, that were conducted in CLEAR 2006, and provides an overview of the results. The evaluation tasks in CLEAR 2006 included person tracking, face detection and tracking, person identification, head pose estimation, vehicle tracking as well as acoustic scene analysis. Overall, more than 20 subtasks were conducted, which included acoustic, visual and audio-visual analysis for many of the main tasks, as well as different data domains and evaluation conditions.
1
Introduction
CLassification of Events, Activities and Relationships (CLEAR) is an international effort to evaluate systems that are designed to analyze people’s identities, activities, interactions and relationships in human-human interaction scenarios, as well as related scenarios. The first CLEAR evaluation has been conducted from around December 2005, when the first development data and scoring scripts were disseminated, until April 2006, when a two-day evaluation workshop took place in Southampton, UK, during which the evaluation results and system details of all participants were discussed. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 1–44, 2007. c Springer-Verlag Berlin Heidelberg 2007
2
R. Stiefelhagen et al.
1.1
Motivation
Many researchers, research labs and in particular a number of current major research projects worldwide – including the European projects CHIL, Computers in the Human Interaction Loop [1], and AMI, “Augmented Multi-party Interaction” [2], as well as the US programs VACE, “Video Analysis Content extraction” [3], and CALO, “Cognitive Assistant that Learns and Organizes” [4] – are working on technologies to analyze people, their activities, and their interaction. However, common evaluation standards for such technologies are missing. Until now, most researchers and research projects use their own different data sets, annotations, task definitions, metrics and evaluation procedures. As a consequence, comparability of the research algorithms and systems is virtually impossible. Furthermore, this leads to a costly multiplication of data production and evaluation efforts for the research community as a whole. CLEAR was created to address this problem. Its goal is to provide a common international evaluation forum and framework for such technologies, and to serve as a forum for the discussion and definition of related common benchmarks, including the definition of common metrics, tasks and evaluation procedures. The outcomes for the research community that we expect from such a common evaluation forum are – the definition of widely adopted common metrics and tasks – a greater availability of resources by sharing the data collection and annotation burden – provision of challenging multimodal data sets for the development of robust perceptual technologies – comparability of systems and approaches and – thus faster progress in developing better, more robust technology. 1.2
Background
The CLEAR 2006 evaluation has emerged out of the existing evaluation efforts of the European Integrated project CHIL, which has in previous years conducted a number of evaluations on multimodal perceptual technologies, including tasks such as person tracking and identification, head pose estimation, gesture recognition and acoustic event detection, as well as the technology evaluation efforts in the US VACE program, which conducted several similar evaluations in face, person and vehicle tracking. For CLEAR 2006, the technology evaluations of CHIL and VACE were combined for the first time, and the evaluations were also open to any site interested in participating. In order to broaden the participation and discussion of evaluation tasks and metrics, representatives from other related projects and evaluation efforts (AMI[2], NIST RT evaluations[5], NIST People-ID evaluations, PETs[6], TrecVid[7], ETISEO[8]) were actively invited to participate in the preparation of the workshop as well as to present an overview about their related activities at the workshop.
The CLEAR 2006 Evaluation
1.3
3
Scope and Evaluation Tasks in 2006
The CLEAR 2006 evaluation and workshop was organized in conjunction with the National Institute of Standards and Technology (NIST) Rich Transcription (RT) 2006 evaluation [5]. While the evaluations conducted in RT focus on content-related technologies, such as speech and text recognition, CLEAR is more about context-related multimodal technologies such as person tracking, person identification, head pose estimation, analyzing focus of attention, interaction, activities and events. CLEAR 2006 and RT06 in particular shared some of their evaluation data sets, so that for example the speaker-localization results generated for CLEAR could be used for the far-field speech-to-text task in RT06. Also the evaluation deadlines of CLEAR and RT 2006 were harmonized so that this would be possible. This is an important first step towards developing a comprehensive multimedia evaluation program. The evaluation tasks in CLEAR 2006 can be broken down into four categories: – – – –
tracking tasks (faces/persons/vehicles, 2D/3D, acoustic/visual/audio-visual) person identification tasks (acoustic, visual, audio-visual) head pose estimation (single view studio data, multi-view lecture data) acoustic scene analysis (events, environments)
These tasks and their various subtasks will be described in Section 3. Due to the short time frame for preparing the joint technology evaluations in CLEAR, it was decided that the evaluations tasks that had already been defined in VACE and CLEAR, respectively, would be kept as they were, and thus were run independently in parallel, with their slightly differing annotations and on different data sets. As a consequence there were, for example, several 3D person tracking tasks (CHIL) as well as 2D person tracking tasks (VACE) in CLEAR 2006. As a first step of harmonizing evaluation tasks, the participants from CHIL and VACE had, however, agreed on common metrics for multiple object tracking (see section 3.3). The aim for upcoming evaluations is to further harmonize metrics and benchmarks. 1.4
Contributors
CLEAR 2006 would not have been possible without the help and effort of many people and institutions worldwide. CLEAR 2006 was supported by the projects CHIL [1] and VACE [3]. The organizers of CLEAR are the Interactive Systems Labs of the Universit¨ at Karlsruhe, Germany (UKA), and the US National Institute of Standards and Technology (NIST), with the support of contractors University of South Florida (USF) and VideoMining Inc. The participants and contributors to the CLEAR 2006 evaluations included: the Research and Education Society in Information Technologies at Athens Information Technology, Athens, Greece, (AIT), the Interactive Systems Labs at Carnegie Mellon University, Pittsburgh, PA, USA, (CMU) the Evaluations and Language resources Distribution Agency, Paris, France (ELDA), the IBM T.J. Watson Research
4
R. Stiefelhagen et al.
Center, RTE 134, Yorktown Heights, USA (IBM), the Project PRIMA of the Institut National de Recherche en Informatique et en Automatique, Grenoble, France (INRIA), the Centro per la ricerca scientifica e tecnologica at the Instituto Trentino di Cultura, Trento, Italy (ITC-IRST), the Laboratoire d’Informatique pour la m´ecanique et les sciences de l’ing´enieur at the Centre national de la recherche scientifique, Paris, France (LIMSI), Pittsburgh Pattern Recognition, Inc., Pittsburgh, PA, USA (PPATT), the department of Electronic Engineering of the Queen Mary University of London, UK, (QMUL) the Institute of Signal Processing of the Technical University of Tampere, Finland, (TUT), the Breckman Institute for Advanced Science and Tech. at the University of Illinois Urbana Champaign, USA (UIUC) the Institute for Robotics and Intelligent Systems of the University of Southern California, USA, (USC). UKA and ITC-IRST provided recordings of seminars (lectures), which were used for the 3D single person tracking tasks the face detection task and for person recognition. AIT, IBM and UPC provided several recordings of “interactive” seminars (basically small interactive meetings), which were used for the multiperson tracking tasks, for face detection, for the person identification tasks and for acoustic event detection. INRIA provided the Pointing’04 database for head pose detection. UKA provided 26 seminar recordings with head pose annotations for the lecturer, which data was used for the second head pose estimation task. UPC, ITC and CMU provided different databases with annotated acoustic events used for acoustic event classification. Visual and acoustic annotations of the CHIL seminar and interactive seminar data were mainly done by ELDA, in collaboration with UKA, CMU, AIT, IBM, ITC-irst and UPC. ELDA also packaged and distributed the data coming from CHIL. The data coming from VACE was derived from a single source for the surveillance data - i-LIDS. The meeting room data was a collection derived from data collected at CMU, University of Edinburgh (EDI), NIST, TNO, and Virginia Tech (VT). The discussion and definition of the invidual tasks and evaluation procedures were moderated by “task-leaders”. The task-leaders coordinated all aspects surrounding the running of their given tasks. These were Keni Bernardin (UKA, 3D single- and multi-person tracking), Maurizio Omologo (ITC-irst, 3D acoustic single-person tracking), John Garofolo/Rachel Bowers (NIST, 2D Multi-person tracking tasks, VACE 2D face tracking, vehicle tracking), Hazim Ekenel (UKA, visual person identification), Djamel Mostefa (ELDA, acoustic identification), Aristodemos Pnevmatikakis (AIT, audio-visual identification), Ferran Marques and Ramon Morros (both UPC, CHIL 2D Face detection), Michael Voit (UKA, head pose estimation), Andrey Temko (UPC, acoustic event detection). The tasks leaders were also responsible for scoring the evaluation submissions, which in addition were also centrally scored by ELDA. This paper aims at giving an overview of the CLEAR 2006 evaluation. In the remainder of this paper we will therefore give a brief overview of the data sets used (Section 2) and the various evaluation tasks (Section 3). In Section 4 we present an overview of the results and discuss some of the outcomes and potential implications for further evaluations.
The CLEAR 2006 Evaluation
5
Further details on the tasks definitions and data sets can be found in the CHIL and VACE evaluation plans [9], [10] and on the CLEAR webpage [11].
2
Datasets Used in CLEAR 2006
2.1
The CHIL Seminar Database
A large mutimodal database has been collected by the CHIL consortium for the CLEAR 2006 evaluation, consisting of audiovisual recordings of regular lecturelike seminars and interactive small working group seminars. For some of the interactive seminars, scripts were used in order to elicit certain activities (e.g., opening doors, taking a coffee break), which were to be automatically detected in one or more evaluation tasks (e.g., acoustic event detection). The database contains audio and video recordings segments from 47 seminars recorded at the following sites: – – – – –
AIT, Athens, Greece, IBM, New-York, USA, ITC-IRST, Trento, Italy, UKA, Karlsruhe, Germany, UPC, Barcelona, Spain.
These seminars were given by students and lecturers of each institution or by invited speakers on topics concerning technologies involved in the CHIL project, such as speech recognition, audio source localization, audio scene analysis, video scene analysis, person identification and tracking, etc. The language is English spoken by mostly non native speakers. A detailled description of the CLEAR database can be found in [9]. Non-interactive Seminars Versus Interactive Seminars – Non-interactive seminars were provided by ITC-IRST and UKA. These seminars consist of lecture-like presentations in a small seminar room. One presenter is talking in front of an audience of 10 to 20 people. In these recordings, the focus is mainly on the presenter (he’s the only one wearing a close talking microphone, moving, . . . ). As a consequence, only the presenter has been annotated for the different tasks using this database (tracking, identification, . . . ). An example of non-interactive seminars is given by the first two pictures in Fig. 1. – Interactive seminars were recorded by AIT, IBM and UPC. The recording room is a meeting room and the audience is made up of only 3 to 5 people. The attendees are sitting around a table and are wearing close-talking microphones, just as the presenter. There is a higher degree of interaction between the presenter and the audience. During and after the presentation, there are questions from the attendees with answers from the presenter. Moreover there is also activity in terms of people entering or leaving the room, opening and closing the door. AIT and UPC seminars have been scripted in order
6
R. Stiefelhagen et al.
to elicit certain activities (e.g., opening doors, taking a coffee break). These activities were to be automatically detected in one or more evaluation tasks (e.g., acoustic event detection). The last 3 pictures of Fig. 1 show examples of interactive seminars.
ITC−irst
AIT
UKA
IBM
UPC
Fig. 1. Scenes from non-interactive and interactive seminars
Data Description – Raw data: Each seminar is composed of synchronized audio and video streams. The video streams consist of 4 to 5 JPEG sequences recorded at 15 to 30 frames per second by 4 fixed corner and a ceiling camera. Acoustic sounds are recorded using a great variety of sensors. High quality close talking microphones are used by every participant in interactive seminars and by the presenter only in non-interactive seminars. In addition, omnidirectional table top microphones and several T-shaped arrays are used for far-field recordings. All these microphones are synchronised at the sample level by a dedicated sound card. Moreover, far field recordings are also captured by a NIST markIII 64 channel microphone array. Fig. 2 shows an example of a recording room setup. – Audio transcription: For a single audiovisual data element (a seminar), two transcriptions were produced. The first one is the speaker transcription which contains the speech utterances of all intervening speakers, including human-generated noises accompanying speech. This is done by transcribing the close-talking microphone recording of the main speaker. The second one is the environment transcription which contains all noises not produced by the
The CLEAR 2006 Evaluation
7
Fig. 2. Example recording room setup (source: UKA)
speaker(s). Environment transcriptions are realized on far-field recordings. All environmental noises (human and non-human) and all speaker utterances are transcribed. Both transcriptions were produced with Transcriber [12] and are in native XML format. Acoustic event noises annotations are made on the far field recordings with AGTK annotation tool [13]. This tool enables the annotations of overlapping noises in a simple XML format. – Video labels: Video annotations were realized using an in house developed tool. This tool allows to sequentially display video frames to be annotated, for the 4 corner cameras. On each displayed picture, the annotator was to click on the head centroid (the estimated centre of the head), the left eye, right eye, and nose bridge of the annotated person. In addition to these four points, a face rectangle bounding box was used to delimit the person’s face. These annotations were done on the lecturer for non-interactive seminars and on each participant for interactive seminars. The 2D coordinates within the camera planes were interpolated among all cameras in order to compute the real ”ground truth” location of the speaker within the room. Fig. 3 shows an example of video labeling. Displayed are the head centroid, the left eye, the nose bridge, the right eye and the face bounding box. Development Data. The development data is made of segments used in previous CHIL evaluations and of new seminars provided by new recording sites. 17 seminars from UKA used in the first CHIL evaluation and the NIST Rich Transcription
8
R. Stiefelhagen et al.
Fig. 3. Example of video annotations
2005 were used as development data for CLEAR 2006. For each UKA seminar, two segments of 5min each were used. The first one is taken from the talk of the presenter and the other one is selected from the question-answering session at the end of the talk. The second segment usually contains more spontaneous speech and involves more speakers than the first one. In addition to the UKA seminars, around 1 h of data coming from AIT, IBM, ITC-IRST and UPC was added to the development set. The first 15min of the first seminar recorded by each site was used. In total, the development set duration is 204min with 80 % non-interactive seminars and 20 % interactive seminars. This imbalance is mainly due to the fact that only 3 interactive seminars were recorded and labeled at the time the development set was released. Table 1 gives an overview of the composition of the development set. Evaluation Data. As for the development set, the evaluation set is composed of segments from interactive and non-interactive seminars. Due to the availability of more data recorded at each site, the evaluation data is much more balanced between interactive and non-interactive seminars. The total duration of the CLEAR’06 evaluation set is 190min, of which 14 seminars, representing 68 %, are non-interactive and 12 seminars, representing 32 %, are interactive. Table 2 gives an overview of the composition of the evaluation set. 2.2
VACE Related Databases
For tasks coordinated and led by the VACE community, the evaluations were conducted using two main databases, the Multi-Site Meetings and the i-LIDS Surveillance data (see Table 3).
The CLEAR 2006 Evaluation
9
Table 1. The CLEAR’06 development set Site Type Number Total length (in minutes) ITC-irst non interactive 1 15 UKA non interactive 17 148 AIT interactive 1 13 IBM interactive 1 15 UPC interactive 1 13 TOTAL 21 204
Table 2. The CLEAR’06 evaluation set Site Type Number Total length (in minutes) ITC-irst non interactive 2 10 UKA non interactive 12 120 AIT interactive 4 20 IBM interactive 4 20 UPC interactive 4 20 TOTAL 26 190
Table 3. The VACE related databases Data Multi-Site Meetings i-LIDS Surveillance i-LIDS Surveillance
Raw Data Training Evaluation ≈ 160GB 50 Clips (Face) 45 Clips (Face) ≈ 38GB 50 Clips (Person) 50 Clips (Person) ≈ 38GB 50 Clips (Moving Vehicle) 50 Clips (Moving Vehicle)
All the raw data is in MPEG-2 format with either 12 or 15 I-frame rate encoding. The annotations are specifically done using the ViPER tool developed by UMD by VideoMining. The Multi-Site Meetings are composed of datasets from different sites, samples of which are shown in Fig 4. 1. 2. 3. 4. 5.
CMU (10 Clips) EDI (10 Clips) NIST (10 Clips) TNO (5 Clips) VT (10 Clips)
Each site has their own independent camera setup, different illuminations, viewpoints, people and topics in the meetings. Most of these datasets also figured High-Definition (HD) recordings but were subsequently formatted to MPEG-2 standard for evaluation purposes. Fig. 2.3 shows an example of the recording room setup for the NIST meeting data collection laboratory. The room has seven HD cameras, the table has one quad microphone and three omni-directional
10
R. Stiefelhagen et al.
(a) CMU
(b) EDI
(d) TNO
(c) NIST
(e) VT
Fig. 4. Scenes from Multi–Site Meetings
microphones. Each meeting room participant is equipped with one wireless lapel mic and one headmounted mic. The room is equipped with both traditional and electronic whiteboards as well as a projector for presentations. All cameras are synchronized using the NIST Smart Data Flow synchronization software. For more details on the individual room setup for all the sites, please refer to [14]. Specific annotation or labeling details can be found in Section 3.4. i-LIDS is a video surveillance dataset that has been developed by the United Kingdom Government as a “benchmark for video-based detection systems”[15]. VACE has obtained permission to use this data for their person and vehicle detection and tracking evaluations. The dataset for the CLEAR evaluation includes outdoor views of roadways with walking paths. Though night scenes were available for the data training test set the actual evaluation was limited to day scenes. The dataset was composed of two different scenes with various shapes and sizes of vehicles and people, making for a challenging evaluation task. Specific annotation/labeling details for a person or vehicle in the video can be found in Section 3.5 and 3.6. 2.3
Other Databases
In addition to the two main databases mentioned above, specific datasets attuned to the head pose estimation and the acoustic scene analysis tasks were also used in the CLEAR’06 evaluation. These databases will be explained in more detail together with the corresponding task descriptions in section 3.
The CLEAR 2006 Evaluation
11
Fig. 5. Example recording room setup (source: NIST)
3
CLEAR Tasks and Metrics
This section gives an overview of the different tasks evaluated in the CLEAR’06 evaluation. Three main databases were evaluated on: The first is a series of recordings made in CHIL smartrooms, using a wide range of synchronized sensors, and useful for multimodal analysis in indoor environments. The second, originally used for the VACE tasks, comprises a set of single camera surveillance videos used for visual outdoor detection and tracking scenarios. The third is a set of multi-camera meeting room recordings used mainly for face detection tasks (see Section 2 for details on the used data sets). The CLEAR tasks can be broken down into four main categories: tracking tasks, identification tasks, head pose estimation tasks and acoustic scene analysis tasks. Table 4 shows the different CLEAR tasks. 3.1
3D Single Person Tracking
One of the main tasks in the 2006 CLEAR evaluation, in terms of participation, was the 3D single person tracking task. The task definition was inherited from previous evaluations made in the CHIL project. The objective was to track a presenter giving a talk in front of an audience in a small seminar room (see Fig. 6). The database to be evaluated on consisted of recordings made at two CHIL sites, UKA and ITC-IRST, with different room sizes and layouts, but with a common sensor setup. The video streams from the four corner cameras of the room and
12
R. Stiefelhagen et al. Table 4. Overview of CLEAR’06 tasks
Task name Organizer Database Tracking 3D Single Person Tracking (A,V,AV) CHIL Non-interactive Seminars 3D Multi-Person Tracking (A,V,AV) CHIL Interactive Seminars 2D Face Detection & Tracking (V) CHIL/VACE All Seminars/Multi-Site Meetings 2D Person Tracking (V) VACE Surveillance Data Vehicle Tracking (V) VACE Surveillance Data Person Identification (A,V,AV) CHIL All Seminars Head Pose Estimation (V) CHIL Seminars1 , Pointing04 DB Acoustic Scene Analysis Acoustic Event Detection CHIL Isolated Events, UPC Seminars Acoustic Environment Classification CHIL AATEPS corpus
the audio streams from the four T-shaped arrays and the MarkIII microphone array were available to do the tracking. In addition to the raw data, only the calibration information for the cameras and the locations of the microphones could be used. No explicit knowledge about the initial position of the presenter, the location of the whiteboard, of the room doors, of the audience, etc. was provided. However, participants were able to tune their systems on data from a separate development set, showing different seminars recorded in the same rooms. Whereas in earlier CHIL evaluations the visual and acoustic tracking tasks were evaluated separately, here, for the first time, it was possible to compare the performance of trackers from both modalities, through the use of common datasets and metrics. A multimodal tracking task was also newly introduced, where the combined audio-visual streams could be used. As opposed to the CLEAR 2D person tracking task, or similar tasks from other evaluations, such as e.g. PETS [6], where the objective is typically to track the positon or bounding box of moving objects in 2D images, the objective here was to track the actual location of a person in a room coordinate frame (typically with the origin at one of the bottom corners of the room and the axes parallel to the walls). This is possible because the CHIL seminar recordings offer 4 overlapping, synchronized and calibrated camera views, allowing for video triangulation, and at least 4 sets of microphone arrays, allowing for precise sound source localization. As it was not intended to track specific body regions, such as the head or the feet, a person’s position was defined as his or her x,y-coordinates on the ground plane. This proved a reasonable approximation usable for both standing and sitting persons and allowing to evaluate all types of trackers across modalities. The ground truth person locations for error calculations were obtained from manual annotation of the video streams. In each of the four corner camera streams, the presenter’s head centroid was marked. Using calibration information, 1
For this task, a number of non-interactive seminars, which were recorded in 2004, were annotated and used. These seminars, however, were not part of the dataset used for the tracking and identification tasks.
The CLEAR 2006 Evaluation
(a) cam1
(b) cam2
(c) cam3
(d) cam4
13
Fig. 6. Example scene from a UKA seminar recording
these 2D positions were triangulated to obtain the 3D head position, which was then projected to the ground to yield the person’s reference position. If the presenter’s head was not visible in at least 2 camera views, the frame was left unmarked. Note that due to this annotation scheme, slight errors could be introduced in the labeled positions, for ex. when the presenter bends forward to change his presentation slides. Nevertheless, the annotation of the head centroid was found to be the easiest, most precise, and least error prone for this kind of task. To further reduce the cost of annotations, it was chosen to label video frames only in intervals of 1s (i.e. every 15, 25, or 30 frames, depending on the actual framerate of the recording). Tracking systems could be run using all video frames and audio samples, but were to be evaluated only on labeled frames. This helped reduce the cost of evaluation dramatically with only little impact on the accuracy of results. For the acoustic tracking task, an additional restriction was made. The evaluation of tracking performance was to be decoupled from that of speech detection and segmentation. That is why acoustic tracking systems, although run continuously on all data, were evaluated only on segments of non-overlapping speech where the presenter is speaking and no greater source of noise (e.g. clapping) is audible. These segments were defined by manual annotation.
14
R. Stiefelhagen et al.
For the multimodal tracking task, two separate conditions were defined, to offer better comparability to the visual and acoustic tracking tasks. In condition A, multimodal tracking systems were evaluated on segments of non-ovelapping speech only, just as in the acoustic task. This could serve to measure what increase in precision the addition of the visual modality would bring to acoustic tracking, given an already accurate speech segmentation. In condition B, they were evaluated on all labeled time frames, as in the visual task, regardless if the speaker was active or not. This served to measure the enhancement brought by the fusion of modalities in the general case. The metrics used to evaluate single person tracking performance are explained in section 3.3 and the results for all subtasks and conditions summed up in section 4.1. 3.2
3D Multi-person Tracking
As opposed to the 3D single person tracking task, where only the main speaker had to be accounted for, ignoring the audience, the objective in the 3D multiperson tracking task is to simultaneously track all the participants in a small interactive meeting. To this effect, a set of recordings was made at three CHIL sites, IBM, UPC, and AIT, with a slightly modified scenario involving 4 to 6 people (see Fig. 7). While there is still a main speaker presenting a topic to the other participants, there is much more interaction as participants take turns asking questions or move around while entering the room or during coffee breaks. These recordings proved quite challenging compared to the single person tracking task due to the number of persons to track, the relatively small size of the meeting rooms and the high variability of the scenario. The same sensor setup as for single person tracking was used. Additionally, video streams from a ceiling mounted panoramic camera were available. The annotations were also made in the same manner, with the exception that for each time frame, the head centroids of all participants were labeled. In contrast to single person tracking, the definition of the multi-person tracking task is quite dependent on the chosen modality. For visual tracking, the objective is to track every participant of the interactive seminar for all labeled frames in the sequence. For the acoustic tracking task, on the other hand, the objective was to track only one person at a time, namely the active speaker, because tracking during overlapping speech was considered to be too difficult at this time. While in single person tracking, this was limited to the presenter, here it could also be anyone in the audience. Systems are evaluated only on manually defined segments of non-overlapping speech with no considerable noise sources. For multimodal tracking, again, two conditions were introduced: In condition A, the objective is to audio-visually track only one person at each point in time, namely the active speaker. This is best comparable to the acoustic tracking task, and is evaluated only on manually defined active speech segments. In condition
The CLEAR 2006 Evaluation
(a) cam1
(b) cam2
(d) cam4
15
(c) cam3
(e) cam5
Fig. 7. Example scene from a UPC interactive seminar recording
B, the goal is to track all persons in all labeled time frames using streams from both audio and visual modalities. Evaluating the performance of systems for tracking multiple persons, and allowing for comparative results across modalities and tasks required the definition of a specialized set of metrics. These same metrics are also used in single person tracking, and in modified form in most other tracking tasks. They are explained in detail in section 3.3. The results for the 3D multi-person tracking task are summarized in section 4.2. 3.3
Multiple Object Tracking Metrics
Defining measures to express all of the important characteristics of a system for continuous tracking of multiple objects is not a straightforward task. Various measures, all with strengths and weaknesses, currently exist and there is no consensus in the tracking community on the best set to use. For the CLEAR workshop, a small expressive set of metrics was proposed. In the following, these metrics are briefly introduced and a systematic procedure for their calculation is shown. A more detailed discussion of the metrics can be found in [16]. The MOT Precision and Accuracy Metrics. For the design of the CLEAR multiple object (person) tracking metrics, the following criteria were followed:
16
R. Stiefelhagen et al.
– They should allow to judge a tracker’s precision in determining exact object locations. – They should reflect its ability to consistently track object configurations through time, i.e. to correctly trace object trajectories, producing exactly one trajectory per object (see Fig. 8). Additionally, we expect useful metrics – to have as few free parameters, adjustable thresholds, etc, as possible to help make evaluations straightforward and keep results comparable. – to be clear, easily understandable and behave according to human intuition, especially in the occurence of multiple errors of different types or of uneven repartition of errors throughout the sequence. – to be general enough to allow comparison of most types of trackers (2D, 3D trackers, acoustic or visual trackers, etc). – to be few in number and yet expressive, so they may be used e.g. in large evaluations where many systems are being compared. Based on the above criteria, we define a procedure for systematic and objective evaluation of a tracker’s characteristics. Assuming that for every time frame t a multiple object tracker outputs a set of hypotheses {h1 . . . hm } for a set of visible objects {o1 . . . on }, we define the procedure to evaluate its performance as follows: Let the correspondence between an object oi and a hypothesis hj be valid only if their distance disti,j does not exceed a certain threshold T (for CLEAR’06, T was set to 500mm), and let Mt = {(oi , hj )} be a dynamic mapping of objecthypothesis pairs. Let M0 = {}. For every time frame t, 1. For every mapping (oi , hj ) in Mt−1 , verify if it is still valid. If object oi is still visible and tracker hypothesis hj still exists at time t, and if their distance does not exceed the threshold T , make the correspondence between oi and hj for frame t. 2. For all objects for which no correspondence was made yet, try to find a matching hypothesis. Allow only one to one matches. To find optimal correspondences that minimize the overall distance error, Munkre’s algorithm is used. Only pairs for which the distance does not exceed the threshold T are valid. If a correspondence (oi , hk ) is made that contradicts a mapping (oi , hj ) in Mt−1 , replace (oi , hj ) with (oi , hk ) in Mt . Count this as a mismatch error and let mmet be the number of mismatch errors for frame t. 3. After the first two steps, a set of matching pairs for the current time frame is known. Let ct be the number of matches found for time t. For each of theses matches, calculate the distance dit between the object oi and its corresponding hypothesis. 4. All remaining hypotheses are considered false positives. Similarly, all remaining objects are considered misses. Let f pt and mt be the number of false positives and misses respectively for frame t. Let also gt be the number of objects present at time t.
The CLEAR 2006 Evaluation
17
Fig. 8. Matching multiple object tracks to reference annotations
5. Repeat the procedure from step 1 for the next time frame. Note that since for the initial frame, the set of mappings M0 is empty, all correspondences made are initial and no mismatch errors occur. Based on the matching strategy described above, two very intuitive metrics can be defined: The M ultiple O bject T racking P recision (M OT P ), which shows the tracker’s ability to estimate precise object positions, and the M ultiple O bject T racking Accuracy (M OT A), which expresses its performance at estimating the number of objects, and at keeping consistent trajectories: i,t di,t (1) M OT P = t ct (mt + f pt + mmet ) (2) M OT A = 1 − t t gt The M OT A can be seen as composed of 3 error ratios: t mt t f pt t mmet m= , fp = , mme = , g g t t t t t gt the ratio of misses, false positives and mismatches in the sequence, computed over the total number of objects present in all frames. For the current run of CLEAR evaluations, it was decided that for acoustic tracking, it was not required to detect speaker change or to track speaker identities through time. Therefore, the measurement of identity mismatches is not meaningful for these systems, and an separate measure, the A − M OT A is computed, by ignoring mismatch errors in the global error computation: (mt + f pt ) (3) A − M OT A = 1 − t t gt The above described M OT P and M OT A metrics were used in slightly modified form throughout the CLEAR tracking tasks and proved very useful for large scale comparisons of tracker performance across tasks and modalities.
18
R. Stiefelhagen et al.
Multiple Object Detection Precision and Accuracy. In contrast to the point-wise distance metric described above, for the Multiple Object Detection Precision (MODP) the spatial overlap information between the ground truth and the system output is used to compute an Overlap Ratio as defined in Eq 4. (t) Here, the notation Gi denotes the ith ground truth object in the tth frame t and Di denotes the detected object for Gti . t Nmapped
Overlap Ratio =
i=1
(t)
(t)
Di | (t) (t) |Gi Di | |Gi
(4)
A threshold of 0.2 for the spatial overlap is used, primarily to compute the misses and false alarms (required while computing the MODA measure). Using the assignment sets, the Multiple Object Detection Precision (MODP) is computed for each frame t as: M ODP (t) =
(Overlap Ratio) t Nmapped
(5)
t where, Nmapped is the number of mapped object sets in frame t. This gives us the localization precision of objects in any given frame and the measure can also be normalized by taking into account the total number of relevant evaluation t = 0, then the MODP is forced to a zero value. frames. If Nmapped Nf rames M ODP (t) (6) N − M ODP = t=1 Nf rames
The thresholded approach for the Overlap Ratio is meant to minimize the importance of the spatial accuracy. The N-MODP hence gives the localization precision for the entire sequence. The Multiple Object Detection Accuracy (MODA) serves to assess the accuracy aspect of system performance. Here, only the missed counts and false alarm counts are used. Assuming that in each frame t, the number of misses are indicated by mt and the number of false positives indicated by f pt , the Multiple Object Detection Precision (MODA) can be computed as: M ODA(t) = 1 −
cm (mt ) + cf (f pt ) t NG
(7)
where, cm and cf are the cost functions for the missed detects and false alarm penalties. These cost functions are used as weights and can be varied based on the application at hand. If misses are more critical than false alarms, cm can be t is the number of ground truth objects in the tth increased and cf reduced. NG frame. The computation of the N-MODA, the normalized MODA for the entire sequence, is made as: Nf rames (cm (mi ) + cf (f pi )) (8) N − M ODA = 1 − i=1 Nf rames i NG i=1
The CLEAR 2006 Evaluation
19
Differences in the VACE Detection and Tracking Metrics. In November 2005, the evaluation teams from the CHIL and VACE projects both had their own sets of individual metrics. It was decided that in order to harmonize the CLEAR evaluation tasks, the metrics also have to be harmonized. In the CHIL Project, the tracking metrics used were: – MOTP (Multiple Object Tracking Precision) – MOTA (Multiple Object Tracking Accuracy) On the other hand, the VACE side used the following detection and tracking metrics: – SFDA (Sequence Frame Detection Accuracy) for Detection – ATA (Average Tracking Accuracy) for Tracking and a whole set of diagnostic metrics to measure individual components of the performance. The key differences between the MODP/A and MOTP/A metrics, used in VACE-related tasks, and the standard MOTP/A used in CHIL-related tasks are: – The metrics use the spatial component instead of the distance. We believe that for this evaluation we can keep this additional dimensionality. – The mapping differs as in we use an Hungarian matching to map ground truth and system output boxes and this again uses the spatial component (as in spatial overlap between two objects). The idea is to maximize the metric score based on these individual components. 3.4
2D Face Detection and Tracking
The goal of this evaluation task was to measure the quality and accuracy of face detection techniques, both for meeting and for lecture scenarios. As opposed to the person tracking tasks, the objective here was not to estimate the trajectories of faces in real world coordinates, but rather to correctly detect as many faces as possible within the separate camera views. To this effect, no triangulation or 3D computation between views and no continuous tracking were required. The main difficulty - and at the same time the scientific contribution - of this task stems from the nature of the database itself. In the CLEAR seminar and meeting databases, faces sizes are extremely small, in some cases down to (10x10) pixels, faces are rarely oriented towards cameras, lighting conditions are extremely difficult and faces are often partly occluded, making standard skin color segmentation or template matching techniques inapplicable. This drives the development of new techniques, which can handle very difficult data recorded under realistic wide camera view conditions. As in person tracking tasks, for the lecture scenario, only the presenter’s face was to be found, whereas for interactive seminar and meeting scenarios, all faces had to be detected (see Fig. 9). A correct face detection should deliver not only the position of the face in the image, but also its extension, as this information can be valuable for subsequent
20
R. Stiefelhagen et al.
(a) UKA seminar
(b) AIT interactive seminar
Fig. 9. Scenes from the Face Detection & Tracking database
identification or pose estimation processes. The ouput of face detection systems are therefore the bounding boxes of detected faces, which are compared to manual annotations. The guidelines for annotating the face bounding boxes differed very slightly for the CHIL and VACE databases, resulting in somewhat larger face boxes in the CHIL data. Also, the criteria for considering a face as visible differed. Whereas in the VACE data it depended on the visibility of at least one eye, the nose, and part of the mouth, in the CHIL data, only visibility of at least one eye or the nose bridge was necessary. For future evaluations, it is planned to harmonize the annotation guidelines, to produce more uniform databases. As for the person tracking task, a face label was created only for every second of video. To evaluate the performance of face detection and tracking algorithms, five measures were used: The percentage of correctly detected faces, wrong detections, and non-detected (missing) faces, the mean weighted error (in pixels) of the estimated face center, and the mean (face) extension accuracy. For a correctly detected face in a frame i, the mean weighted error is defined as: d C − C l i i 2 wei = Ri with Cid and Cil , the centers of the detected and labeled faces respectively, and Ri the face size, calculated as the average of the vertical and horizontal face bounding box lengths. The mean extension accuracy is defined as: A((BB l ∪ BB d ) − (BB l ∩ BB d )) A(BB l ) the ratio of the area A(.) of the symmetric difference of the detected and labeled bounding boxes BB d and BB l with respect to the labeled bounding box BB l . The resulting errors are averaged over all faces in all frames. The results of the face detection and tracking task, evaluated on the CHIL recording database, are presented in section 4.3.
The CLEAR 2006 Evaluation
21
In the VACE Multi-Site Meeting database, the face is marked horizontally bound to the extent of the eyes and vertically bound from just above the eyes to the chin. The face must have at–least one eye, part of the nose and lips seen to be annotated. For specific annotation guidelines, please refer to [17]. The MODA/MODP metrics for detection and MOTA/MOTP metrics for tracking are used. 3.5
2D Person Detection and Tracking
The goal of the person detection task is to detect persons in a particular frame, while for the tracking task it is to track persons in the entire clip. The annotation of a person in the Surveillance domain comprises the full extent of the person (completely enclosing the entire body including the arms and legs). Specific annotation details about how a person is marked are given in the annotation guidelines document [17].
Fig. 10. Sample annotation for a person in the Surveillance domain
Fig 10 shows a sample person annotation. When at least 25 % of a person is visible, the person is annotated. Each person is marked with a bounding box and each box has a rich set of attributes to enable sub–scoring if needed. For formal evaluations though, the simplest setting is used: the person must be clearly visible (should not be occluded by any other object, e.g. being occluded by the branches of the tree. If a person walks behind a bigger object the annotations are stopped temporarily until the person is visible again. Depending on how long it takes for this person to re-appear the objectID is maintained accordingly. The specific guidelines can be found in [17]. The metrics used are the MODA/MODP and the MOTA/MOTP.
22
R. Stiefelhagen et al.
3.6
Vehicle Tracking
The goal of the moving vehicle task is to track any moving vehicle in a given clip. During annotations, only vehicles that have moved at any time during the clip are marked. Vehicles which are completely stationary are not marked. Vehicles are annotated at the first frame where they move. For specific details about the annotations please refer to the annotation guidelines document [17]. For a vehicle to be annotated, at least 25 % of the vehicle must be visible, and it is marked with a bounding box. Each box has a rich set of attributes, essentially recording if the vehicle is currently moving and whether it is occluded (a vehicle is marked as occluded if it is more than 50 % occluded). Fig 11 shows a sample vehicle annotation. For this evaluation, the simplest setting was used: the vehicle has to be moving and must be clearly visible (should not be occluded by other objects). In the i-LIDS dataset there are regions where vehicles are not clearly visible due to tree branches or where the sizes of vehicles are very small. These particular regions are marked accordingly and are not evaluated. Also, since this is purely a tracking task, the metrics used here are the MOTA and MOTP.
Fig. 11. Sample from the moving vehicle tracking in Surveillance domain
3.7
Person Identification
In a smart meeting or lecture room environment, where many sensors and perceptual components cooperate to provide rich information about room activities, the tracking algorithms presented in the previous sections can serve as building blocks, providing necessary person locations, aligned faces, or localized speech segments for subsequent identification processes. The goal of the CLEAR person identification task is to measure what identification accuracies can be reached, and how fast they can be reached, using only far-field microphones and cameras, assuming person locations are already well known (see Fig. 12).
The CLEAR 2006 Evaluation
23
Fig. 12. Sample from the CLEAR person identification database
For this purpose, in addition to the head centers and the face bounding boxes, three additional marks have been annotated in the video images: The positions of the left and right eye and that of the nose bridge. These labels serve to achieve an exact alignment and cropping of face images necessary for face identification routines, clearly decoupling the identification task from the detection and tracking task. While all other features were marked for every second of video, the eye labels were produced every 200 ms, for better precision. As for the face detection task, one of the big challenges - and the novelty - of the CLEAR visual identification task comes from the database itself. The seminar videos contain extremely low resolution faces, down to (10x10) pixels with eye distances ranging from 4 to 16 pixels, which are very difficult to detect with conventional techniques, let alone to identify. This is also why a decoupling from the tracking task becomes necessary, if the performance of identification techniques alone is to be accurately measured. Similarly, the acoustic identification is to be made solely on far-field microphones, arrays and tabletops, which can be very distant from the speaker and include all kinds of room noises, murmurs, cross-talk, etc. The above mentioned difficulties in the data led to a task definition requiring the identification over time windows of varying length, as opposed to identification on single frames, allowing for enough evidence for correct recognition to be accumulated. For CLEAR 2006, a closed set identification task was proposed. The data consisted of synchronized audio-visual segments cut out from the CHIL seminar recordings and containing in total 26 different subjects. In the seminar scenario, only the presenter was to be identified, whereas in the interactive seminar scenarios, recognition was to be done for all participants. For the visual task, images from the four corner cameras, for the acoustic task, all the signals from the far-field microphones could be used for identification. In the multimodal task, all information from the audio-visual streams was available.
24
R. Stiefelhagen et al.
The data for the person identification task was partitioned into training (enrollment) and test segments of varying lengths, to assess the effect of temporal information fusion: For training, two conditions, A and B, with segment lengths of (15 and 30)s respectively, were evaluated. The test conditions comprised segments of (1, 5, 10 and 20)s, allowing to measure the increase in recognition accuracy as more information becomes available. Identification systems are required to output one recognized ID per test segment, which is compared to the labeled identity. The error measure used is the percentage of wrongfully recognized persons for all training and testing conditions. The results of the person identification task are presented and discussed in detail in section 4.6. 3.8
Head Pose Estimation
As for the person identification tasks, the main condition in the CLEAR head pose estimation task builds on the results of person and head detection techniques and aims at determining the head orientations of seminar or meeting attendees using only the information provided by room corner cameras. The head pose estimation task in CLEAR’06 was split into two conditions, based on two very different databases. The first is the INRIA 2004 Pointing Database figuring studio quality close-up recordings of 15 persons providing 93 images each (see Fig. 13). The objective for this database is to determine the pan and tilt of the user’s head in still images. The reference annotations are made in 15 degree intervals in the range from −90◦ to +90◦ , and the error measures used are the mean absolute error in pan and tilt, and the rate of correct classification to one of the discrete pan and tilt classes.
Fig. 13. Samples from the INRIA Pointing’04 Database
A more natural and challenging problem is addressed in the second condition. Here, the goal is to estimate the pan orientation of the presenter’s head in a CHIL seminar room using the room corner cameras (see Fig. 14). Again, the low resolution of heads in the camera views and the difficult lighting conditions, as well as the availability of multiple synchronized video streams are what make this task novel and challenging. The recordings consist of 12 training and 14 test
The CLEAR 2006 Evaluation
25
Fig. 14. Sample from the CHIL seminar recordings for head pose estimation
seminars recorded in the Karlsruhe seminar room, with a length of 18min to 68min each. The manual annotations are made for every tenth frame of video, and mark the presenter’s head orientation as belonging to one of 8 pan classes (north, north-west, west, south-west, . . . ), of 45◦ width each. The goal in this subtask is to continuously track the presenter’s horizontal viewing direction in the global room coordinate frame. As for the visual person identification task, the problem of estimating the head pose is decoupled from the head tracking problem by the availability of manually annotated head bounding boxes in the camera images. The error measures used are the mean absolute pan error and the correct classification rate into one of the eight pan classes. In addition, the classification rate into either the correct pan class or one of its neighboring classes (representing at most 90◦ absolute estimation error) is also measured. The results for the head pose estimation task can be found in section 4.7. 3.9
Acoustic Event Detection and Classification
To gain a better understanding of the situations occuring in a room and of the activities of its occupants, the recognition of certain events can be very helpful. In particular, the detection of acoustic events, such as keyboard clicks, door slams, speech, applause, etc, in a meeting or seminar can be used to focus the attention of other systems on particular persons or regions, to filter the output of speech recognizers, to detect phases of user interaction, and so forth. The CLEAR acoustic event detection (AED) task aims at measuring the accuracy of acoustic detection systems for this type of scenario, using the input from wall-mounted or table top microphones. A total of 12 semantic classes are to be recognized: Knock (door, table), door slam, steps, moving chair, spoon (cup jingle), paper wrapping, key jingle, keyboard typing, phone ringing/music, applause, coughing, and laughing.
26
R. Stiefelhagen et al.
Two additional classes, namely speech and an “unknown event” class are also considered. Two types of databases are used in this task: One consisting of isolated events, where the goal is solely to achieve a high classification accuracy, and another consisting of scripted seminars recorded in UPC’s smart meeting room, where the goal is to detect the time of occurence of an event, in addition to making a correct classification. For the subtask of isolated AED, only the isolated event database is used in training and testing. For the subtask of AED in real environments, both databases are used in training, and testing is made on dedicated segments of scripted seminars. The error metric used is the Acoustic Event Error Rate (AEER): AEER =
D+I +S ∗ 100 N
with D, I, S, the number of deletions, insertions, and substitutions respectively, and N the number of events to detect. Here, an event is considered correctly detected when its hypothesized temporal center is situated in the appropriate time interval of one or more reference events and the hypothesized and reference labels match. If none of the labels match, it is counted as a substitution error. An insertion error occurs when the hypothesised temporal center of the detected event does not coincide with any reference event’s time interval. A deletion error is counted when a referenced event was not detected at all. Section 4.8 sums up the results for the acoustic event detection task and briefly describes the challenges and difficulties encountered. 3.10
Acoustic Environment Classification
In contrast to the acoustic event detection task, where the recognition of small, temporally constricted acoustic events is aimed at, the goal in this task is to gain a high level understanding of the type of recording environment itself using audio information. This high level knowledge can be used to provide context awareness in mobile settings where large suites of sensors are not available. One example of an application where such knowledge is useful is the CHIL Connector [18] service, in which the environment is used as an information source to help a smart mobile telephone decide whether the user is available for communication. Knowledge of the environmental type can also be useful to boost the performance of basic perceptual algorithms, e.g., by providing appropriate preprocessing or context dependent grammars for speech recognition modules. In the CLEAR’06 evaluation, classification was tested on a fairly specific set of environments. These environments included airport, bus, gallery, park, restaurant, street, plaza, train, and train platform. Many of these environmental types are self-explanatory. “Gallery” refers to a large indoor space in which people gather, e.g., malls, museums, etc. “Street” is any urban outdoor space with streets dominated by vehicular traffic, while “plaza” refers to an urban outdoor space with streets dominated by pedestrian traffic, e.g., a city square or
The CLEAR 2006 Evaluation
27
outdoor marketplace. “Park” is an outdoor space not dominated by urban accoutrements. Finally, “train platform” refers specifically to that part of a train or subway station where passengers board and exit train cars. The environmental recording database used for this evaluation, the Ambient Acoustic Textures in Enclosed Public Spaces (AATEPS) corpus, consisted of a set of 10min audio recordings made with identical recording equipment in these environments; recordings were made in 2004 and 2005 in North America, Europe, Asia, and Africa. A total of 10.5 h of data, divided into 5s segments, was used in this evaluation, with 5400 segments used for training and 2160 for testing, with half of the test segments taken from recordings not part of the training set. Classification results attained in this evaluation are reported in section 4.9.
4
CLEAR Results and Lessons Learned
This section gives an overview of the CLEAR evaluation results and offers a brief discussion based on the attributes of the evaluated systems, and the underlying problems in the tasks and databases. It also hints at future directions to be followed in the next evaluation run, based on the experiences made. For each of the CLEAR tasks and conditions, participants were asked to submit hypothesis files, which were then centrally scored against the reference ground truths. Sites could submit several sets of results for each task, coming from different systems, with the condition that there were basic differences in the concerned systems’ algorithms themselves, as opposed to simple differences coming from parameter tweaking. Because of the great number of evaluated systems, no deep insight into the individual approaches could be given here. The interested reader is referred to the individual system publications for details. 4.1
3D Single Person Tracking
The 3D single person tracking task solicited the greatest number of interest and participation. A total of 21 systems were evaluated for the different audio and visual conditions. This was due in part to the traditional nature of the task person tracking -, allowing for a great variety of approaches, from well known techniques to cutting edge algorithms, to be applied even though the difficulty of the data and the availability of multiple sensors posed new challenges which demanded their share of innovation. The evaluation was made for 4 conditions, the acoustic, the visual, as well as two audio-visual conditions, and the systems were scored using the MOT metrics described in section 3.3. The common database and metrics allowed for an easier comparison of the advantages of different modalities for tracking on the realistic CLEAR data. Fig. 15 shows the results for acoustic tracking. As the systems are only scored on segments of active speech without noticeable noise, and there is only one target to track, the acoustic subtask very closely resembles a source localization problem, with the difference that the actual detection of speech is not being evaluated. For this reason, and for easier analysis of the results, two additional
28
R. Stiefelhagen et al.
error measures to the MOT metrics are shown in Fig. 15: The rate of misses caused by localization errors exceeding the 500mm threshold, and the rate of misses attributed to missing speaker hypotheses. Many techniques were presented, mostly based on the calculation of a generalized cross correlation (GCC) or global coherence field (GCF) function, accompanied by Kalman, particle, or data association filtering. The best overall result was achieved by a joint probabilistic data association filtering technique using as features the TDOA between microphone pairs. Overall, the MOTP measure shows that, given correct speech segmentation, very high localization accuracies of up to 14cm can be achieved. For comparison, the expected error in manual annotation of the speaker’s head is also of the order of 8-10cm. The MOTA measure, on the other hand, shows us that even for the best systems, in roughly 20 % of all cases the presenter is still to be considered missed. While for most systems, this stems from gross localization errors in problematic segments, for others it comes from the failure to produce a location hypothesis, hinting at where considerable improvements could still be achieved.
Fig. 15. Results for the acoustic single person tracking task
The results for the visual condition can be found in Fig. 16. Overall, they are quite high, showing a good match of the task definition to the current state of the art, and the appropriateness of the available sensors. The highest accuracy (91 %) was reached by a particle filter based approach using color and shape features acquired prior to tracking by a fully automatic procedure. The advantage of particle filters for this type of task is that they are robust to noise and allow to easily integrate a variety of features from several sensor streams. Indeed, they have enjoyed a steady growth in popularity over
The CLEAR 2006 Evaluation
29
the past years due to their flexibility. The appearance model adopted here allows efficient particle scoring, resulting in a fast system appropriate for online applications. The best localization precision (a noteworthy 88mm), on the other hand, was reached by a joint 2D-face and 3D-head tracking system, using adaboost-trained classifier cascades for finding faces in the 2D images. Using faces and heads as the base for tracking, as opposed to full-body tracking, ensures that the system hypothesis is always very close to the annotated ground truth, which explains the high score. It also explains the somewhat higher miss rate, as faces can not always be found in the corner camera images. This system illustrates another popular trend, the use of boosted classifier cascades using simple features (as presented in [19]), trained for specific detection tasks, and that serve as high confidence initialization steps in combination with other fast but less reliable tracking techniques. It may be useful to remind here that no background images of the empty room were supplied for this task, and no training was allowed on the test set itself, which made it hard to use foreground segmentation based techniques. The evaluation also revealed a problem in the visual tracking task definition itself, namely the loose definition of the tracking object. In some cases, it can not be unambiguously decided which of the room occupants is the presenter without
Fig. 16. Results for the visual single person tracking task
using prior scene knowledge or accumulating enough tracking statistics. While this is a minor problem, it will most likely lead to changes in the tracking task definition or annotations in future evaluations. 2
Results submitted one month after the official deadline and printed here for completeness.
30
R. Stiefelhagen et al.
Figs. 17 and 18 show the results for the multimodal tracking task, conditions B and A. As a reminder, for this task the two multimodal conditions differ only in the data segments to be evaluated. In condition B, all time frames, whether they contain speech or not, were scored. For condition A, only the time frames in which the presenter is speaking, without loud noise or crosstalk were scored. This is to better decouple the task from the speaker segmentation problem, accounting for the fact that single modality acoustic trackers are not usable in longer periods of silence. Compared to the visual tracking results, the numbers for multimodal condition B show no significant improvement. This should by no means imply that audio-visual fusion bears no advantages, but rather that for this type of scenario, with the current visual sensor coverage, the addition of acoustic features could not help maintain tracks in the seldom events where visual tracking fails. In contrast, condition A shows that, considering only cases where both modalities are present, the addition of visual features helps improve performance, compared to acoustic tracking alone. For comparison, the best system for this task, a realtime-capable system using a particle filter framework, reached 90 % accuracy using both streams, and just 55 % and 71 % respectively using only acoustic and visual streams. These examples also show us that a modified task description, e.g. limiting the numbers of available cameras or making automatic speech segmentation a requirement, or a slightly more compex scenario might be advantageous in order to better measure the improvement audio-visual fusion can bring when single modalities more frequently fail.
Fig. 17. Results for the multimodal single person tracking task, condition B
Fig. 18. Results for the multimodal single person tracking task, condition A
The CLEAR 2006 Evaluation
31
In conclusion, the results for the single person tracking task overall were quite satisfying, although there is still room for improvement. Accounting for the lessons learned in this evaluation run, a move towards a more complex task definition and a shift away from scenarios involving the tracking of just one person becomes very likely in the future. 4.2
3D Multi-person Tracking
Compared to the single person case, the multi-person tracking task offers a variety of new challenges requiring different systems and strategies. As the number of tracking objects is no longer fixed, new techniques for determining person configurations, for deciding when to create or destroy a track, for avoiding track mismatches, merges, etc, have to be designed. Compared to seminar recordings, which were used for the single person case, the scenarios in the interactive seminar database used here are also more challenging, including e.g. coffee breaks where all tracked persons move and interact in very close proximity. A total of 5 sites participated in the various subtasks for a total of 11 acoustic and visual tracking systems. For the acoustic tracking subtask, the objective was quite similar to the single person case, in the sense that only one speaking person needs to be tracked at every point in time. As a consequence, the presented approaches did not differ significantly from the algorithmic point of view. The results are shown in Fig. 19.
Fig. 19. Results for the acoustic multi-person tracking task
On the whole, the scores were quite low, compared to the single person case. Except for the leading system, which reached 64 % accuracy and 16cm precision, all other results were well below expectations. While for the second ranking system, this again comes from a large number of missing hypotheses, for all other systems, the error lies in large inaccuracies in localization itself. The comparatively poor performance of systems can be attributed to several factors: In part it
32
R. Stiefelhagen et al.
comes from the difficult data itself, including very small rooms with severe reverberations, and in part from the interactive seminar scenario, including frequent speaker switches, coffee breaks, etc. The visual subtask, requiring the simultaneous tracking of all room occupants, posed a problem of much higher complexity. Three sites participated in the evaluation, which was split in two conditions: The main condition involved data from three sites, for which no previously recorded background images of the empty room were available. This made it much harder for trackers based on conventional foreground segmentation to acquire clean tracks. The second condition involved data from just two sites, for which such background images were supplied. In addition to the four room corner cameras, a ceiling-mounted panoramic camera, delivering a wide angle view of the room was available. The results can be found in Figs. 20 and 21.
Fig. 20. Results for the visual multi-person tracking task (3-site dataset)
Fig. 21. Results for the visual multi-person tracking task (2-site dataset)
Despite the associated problems, all submitted systems were based on foreground segmentation features at the lower level, with the main differences in the higher level data fusion and tracking schemes. The leading system was a realtime-capable foreground blob tracking algorithm using just the single input stream from the top view camera. It reached 51 % and 63 % MOT accuracies for the two conditions respectively, with precisions of about 20cm. The other approaches were based on the fusion of multiple camera streams and the results revealed the still not satisfactorily solved problem of data association for such
The CLEAR 2006 Evaluation
33
highly cluttered scenes. Perhaps the extension of one of the probabilistic tracking schemes, which proved very effective in the single person tracking task, to the multi-person case will allow to achieve a jump in performance for the next evaluation runs. Another important observation is that for all systems the relative amount of track identity mismatches made over a complete recording sequence is very low, compared to other error types. Although this is explained in part by the nature of the data itself, with only few crossing person tracks, it does considerably diminish the influence of the mismatch rate on the general MOTA score. This observation is likely to lead to a redefinition or modification of the metric for future evaluations, e.g. by the addition of separate weighting factors for the different error ratios. Fig. 22 shows the results for the audio-visual condition B, which is very similar to the visual tracking subtask, with the exception that acoustic information could be opportunistically used whenever available to increase the confidence in the currently active speaker’s track. All presented systems used decision level fusion on the outputs of single modality trackers. The figures show no significant increase compared to visual tracking alone, which can in part be explained by the low accuracies of the acoustic systems, and by the fact that usually only one of the multiple persons to track is speaking at any point in time, considerably decreasing the importance of audio features for the global tracking task.
Fig. 22. Results for the multimodal multi-person tracking task, condition B (2-site dataset)
The results for condition A, in contrast, are better suited for analyzing the effectiveness of data fusion techniques, as the importance of the single modalities for tracking is better balanced. Here, the objective is to track just the active speakers and to keep a correct record of their identities through time. The results, on the whole, stay relatively poor, due to the low performance of the acoustic component in most systems, which did not allow to filter out the correct speaker track, and of the visual component for the leading system. More work is no doubt required on the single modalities before a synergetic effect can be obtained for the combined systems. It would also be interesting to see if a robust feature level
34
R. Stiefelhagen et al.
Fig. 23. Results for the multimodal multi-person tracking task, condition A (2-site dataset)
fusion scheme, such as the ones presented in the single person tracking scenario, could lead to heightened performance. In conclusion, it may be said that the CLEAR multi-person scenario still poses a number of unmet challenges, which will keep driving cutting edge research on new and versatile techniques. Although the CLEAR 3D multi-person tracking task featured a novel and unconventional problem definition, the submitted results for this first evaluation run were in part very encouraging and the experiences made should prove valuable for future runs. 4.3
2D Face Detection and Tracking
Three sites participated in the face detection and tracking task, where the evaluation was performed separately for the single person seminar scenario and the multi-person interactive seminar scenario. The results can be seen in Fig. 24. For both conditions, the leading systems built on the use of boosted classifier
Fig. 24. Results for the 2D face detection and tracking task
The CLEAR 2006 Evaluation
35
cascades, specially trained for use on CHIL recordings, delivering initial detection hints which where then used by more elaborate multiple pass tracking and filtering techniques. For the seminar scenario, the same system as already presented in the 3D visual single person tracking task achieved best scores, with a correct detection rate of 54 %, and moderate miss and false positive ratios. For the interactive seminar scenario, a three-stage system involving high acceptance detection, motion-based tracking, and track filtering achieved a remarkable 72 % correct detection, with relatively low miss and false positive ratios. In both cases, the average localization error was in the sub-pixel domain at under 0.2 pixels and face extension errors reached from 96 pixels to 141 pixels. When judging these numbers, one must bear in mind that these results are averages computed over several seminars featuring multiple faces of different sizes. Detection accuracy was in fact nearly perfect for larger faces, which were located close to the recording camera, while small, far away faces were very often missed. This also explains why systems run on the seminar database, involving only the presenter’s face, tended to produce somewhat lower scores: The presenter’s face in this database was rarely visible (meaning an eye or the nose bridge is visible) from the closer cameras and face sizes were typically very small. To better assess the effectiveness of face detection and tracking techniques in future evaluations, perhaps a categorization of the visual data into classes of increasing difficulty, with annotated face sizes as the selection criterion, and the separate scoring of results for each class could be a worthwile extension to the task definition. Similar conclusions were obtained in the VACE run evaluations, the results of which are shown in Fig 25. Smaller faces are harder to detect and track. The best score is about 71 %. Further analysis on how the sites performed on different datasets from the Multi–Site Meetings revealed that the data from VT was the hardest, possibly because faces were smaller in that set.
Fig. 25. Results for the 2D face detection task (Multi-Site Meetings3 )
4.4
2D Person Tracking
The results for the 2D person detection and tracking task are shown in Fig 26. Four sites participated in this challenging evaluation and the best performance for both detection and tracking in terms of accuracy is about 42 %. The dataset is challenging, figuring person of different sizes and different viewpoints. 3
Scoring differs slightly to the method presented in Section 3.3. Please see [9,10].
36
R. Stiefelhagen et al.
Fig. 26. Results for the 2D Person Detection and Tracking task (Surveillance)
A sub-analysis using the person size as parameter revealed that eliminating small objects gave a boost to the scores compared to including all sizes. In conclusion, it can be said that smaller persons are harder to detect. Also, performance on one particular viewpoint was much better compared to the other, possibly because of lighting condition differences. 4.5
Vehicle Tracking
The evaluation results for Vehicle Tracking in the Surveillance domain are shown in Fig 27. The best performance for tracking in terms of accuracy is about 64 %.
Fig. 27. Results for the Moving Vehicle Tracking Task (Surveillance)
The dataset is challenging figuring different viewpoints and vehicle sizes. A sub-analysis using the vehicle size as parameter revealed that eliminating small objects gave a boost to the scores compared to including all object sizes. In conclusion, it can be said that smaller vehicles, with respect to the frame, are harder to detect and track. Performance on both viewpoints was about equal in contrast to the 2D person detection and tracking evaluation (where performance on one was better than on the other). This could possibly be due to the fact that vehicles are in general bigger, with respect to the frame, most of the time. 4
Problems with extracting video frames.
The CLEAR 2006 Evaluation
4.6
37
Person Identification
Among the 2006 CLEAR tasks, the person identification task was no doubt one of the most complex to organize and carry out, from the point of view of database preparation and annotation, task definition, harmonization of acoustic and visual metrics, and weighting and fusion of multiple audio-visual information streams. 6 different sites participated in the evaluation and a total of 12 audio and visual systems were presented. For the acoustic identification subtask, most systems built on Mel-frequency cepstral analysis of a single microphone stream, combined with filtering, warping or reverberation cancellation, to reduce environmental effects. Fig. 28 shows the results for the 15s and 30s training conditions.
Fig. 28. Error rates for the acoustic person identification task
In both cases, identification systems show a big drop in error rates from the 1s to the 5s testing conditions, followed by a steady decrease as more data becomes available. For the 30s train and 20s test condition, the best systems already reach 0 % error. This shows us that for a closed set identification task, with the current indoor seminar scenario and even using just one microphone, acoustic speaker identification can be a very powerful and robust tool. The next worthwhile challenge would be an open set task involving also the automatic detection and segmentation of speech from multiple persons, and the evaluation of identification hypotheses e.g. on a speaker turn basis. The visual identification task proved much harder for the participating sites, in spite of manual face annotations to alleviate the alignment problem. There were three main difficulties: – The dataset contained many tiny faces; the median eye distance was just 9 pixels (see Fig. 29). – There was no regularity in the number or visibility of faces in the (1, 5, 10, and 20)s test sets. This is because the visual data was segmented synchronously to the acoustic data, in view of the multimodal task, and a higher priority was put on producing segments containing speech. Due to this fact,
38
R. Stiefelhagen et al.
some small segments contained few or no usable frontal faces in any of the four available camera views. This problem is especially severe for the 1s tests: more than 10 % of them contained no usable frontal faces. – The frequency of the provided labels (every second for face bounding boxes and nose bridges, and every 200 ms for eyes) proved insufficient for problemfree face alignment. Three systems were submitted for this subtask and the results are shown in Fig. 30.
(a) Face samples
(b) Eye distance histogram
Fig. 29. Examples of frontal faces at various eye distances and histogram of the eye distances in the training and testing faces of the CLEAR database
Fig. 30. Error rates for the visual person identification task
The best system for the 15s training case used two classifiers (PCA and LDA) fused together with temporal confidence accumulation and reached 20 % error rate for the 20s test condition. The leading system for the 30s training case used a local appearance technique based on DCT features. It reached a minimum 16 % error rate. Both systems showed the expected steady decrease in error rates
The CLEAR 2006 Evaluation
39
as the test segment lengths increase, although minimum rates still stayed well above those reached using the acoustic modality. Fig. 31 shows the results for the combined audio-visual identification systems. 4 different approaches were presented, all using decision-level fusion of single modality system outputs. As could be expected from the single modality results, the weighting of the two modalities played an important role, with systems favoring the acoustic side clearly outperforming those which assigned equal weights. The best system, which was not represented in the acoustic subtask, used a fusion scheme incorporating streams from multiple microphones in addition to temporal information. It reached remarkably low error rates of 0.56 % for the 20s test condition, in both 15s and 30s test cases.
Fig. 31. Error rates for the multimodal person identification task
In conclusion, it may be said that although acoustic identification techniques seem to outperform visual ones, this is largely due to the nature of the data at hand and the definition of the task. Perhaps the only fair way of comparing modalities would imply completely automatic detection and segmentation of speech in an open set for the acoustic side, and fully automatic tracking, alignment, and identification for the visual side. This would however also greatly increase the complexity of the tasks and required metrics. In a lesser case, a careful selection of the database, with equal audio and visual complexities, and a redefinition of the multimodal task to better reflect the advantage of fusion during single modality failure, could also help reduce the apparent imbalance and drive the development of novel fusion techniques. 4.7
Head Pose Estimation
The two subtasks of the CLEAR head pose estimation task offered two very distinct levels of challenge to evaluation participants. While the frame based
40
R. Stiefelhagen et al.
estimation on the studio database, featuring close-up images, was the more conventional task which has been adressed before, the pose tracking task on the seminar database with multi-view low resolution head captures opened a new field with new challenges for head pose estimation. For the former condition, three systems were presented, based on a variety of approaches, from PCA classification to feed-forward or auto-associative neural nets. The results can be seen in Fig. 32.
Fig. 32. Results for the head pose estimation task (Pointing’04 data)
The best systems reach an error rate of 10.1◦ and 12.8◦ for pan and tilt respectively, which is well in range of a human’s estimation error on such images. The correct classification rate into 15◦ orientation classes is also shown in Fig. 32, with the leading system achieving 55 % pan and 84 % tilt classification accuracy. For the more difficult subtask involving real seminar data, a series of new techniques were explored, including 3D head texture modeling, fusion of neural net classifiers, and combination of boosted classifier cascades. The results are shown in Fig. 33. For the leading system, based on sequential multi-view face detection and HMM filtering, 45 % correct classification (into 45◦ classes) was reached. When allowing also classification into the neighboring class, the score reaches 87 %. To better view these numbers in context, it must be said that even human annotation of 45◦ head pose classes in the room corner camera images proved very difficult, since it was often ambiguous to the annotators, which orientation class to choose. Here, an analysis of inter-annotator agreement is needed in the future. In conclusion, one can say that although the head pose estimation task on CHIL seminar data presented a novelty to the field, the results achieved in this first evaluation run proved very encouraging. The availability of several camera views alleviates the problem of small head sizes with respect to the frame and drives the development of more sophisticated fusion schemes. One must also note that a big part of the difficuly in the current recordings came from the difficult lighting conditions in the seminar room, affecting the performance of all algorithms.
The CLEAR 2006 Evaluation
41
Fig. 33. Results for the head pose estimation task (Seminar data)
4.8
Acoustic Event Detection and Classification
For the two conditions of the acoustic event detection task, the classification of isolated events, and the detection and classification of events in seminars, a total of 11 systems were presented by 3 sites. The systems are based on the HMM or SVM classification of spectral features gained from a single audio channel. The results are shown in Figs. 34 and 35.
Fig. 34. AEER error rates for the acoustic event detection task (classification only)
Fig. 35. AEER error rates for the acoustic event detection task (detection and classification)
The error rates show that, while for the recognition of isolated events, current techniques are already appropriate, reaching about 4 % error in the best case, the detection of low-energy events in a complex seminar scenario, on the background of speech, is still an unsolved problem. The best system, using a two step SVM approach for detection of silence/non-silence and subsequent recognition of the 12 event classes, delivered 97 % error rate on unsegmented seminar data, and about
42
R. Stiefelhagen et al.
60 % error on presegmented event databases. One of the main difficulties no doubt came from the presence of speech in the recordings, showing that a better coupling with SAD systems could yield some improvement. Additionally, the use of multiple microphones to better handle noise and room acoustics yet has to be explored, and may constitute one of the main research directions for the future. 4.9
Acoustic Environment Classification
For the acoustic environment classification task, only one site participated. The results for the seen test condition, the unseen test condition, and the average of these two conditions are shown in Fig. 36. The system performed much better in identifying environments from locales specifically seen in the training data; however, the error rate for unseen locales is still much better than chance. These results indicate that while practical systems might be fielded to identify a user’s frequently-visited locales, work still needs to be done on improving generality and adapting to new locales.
Fig. 36. Results for the Acoustic Environment Classification Task
5
Summary
This paper summarized the CLEAR 2006 evaluation, which started early 2006 and was concluded with a two day workshop in April 2006. It described the evaluation tasks performed in CLEAR’06, including descriptions of metrics and used databases, and also gave an overview of the individal results achieved by the evaluation participants. Further details on the individual systems used can be found in the respective system description papers in the proceedings of the evaluation workshop. The goal of the CLEAR evaluation is to provide an international framework to evaluate multimodal technologies related to the perception of humans, their activities and interactions. In CLEAR’06, sixteen international research laboratories participated in more than 20 evaluation subtasks. An important contribution of the CLEAR evaluation is the fact that it provides an international forum for the discussion and harmonization of related evaluation tasks, including the definition of procedures, metrics and guidelines for the collection and annotation of necessary multimodal datasets. CLEAR has been established through the collaboration and coordination efforts of the European Union (EU) Integrated Project CHIL - Computers in the Human Interactive Loop - and the United States (US) Video Analysis and
The CLEAR 2006 Evaluation
43
Content Extraction (VACE) programs. From a decision made in mid November 2005 by CHIL and VACE to establish CLEAR, to the actual CLEAR workshop in April 2006, over 20 evaluation subtasks were performed. In that period of four months, evaluation tracking metrics between CHIL and VACE were harmonized, several hours of multimedia data were annotated for the various evaluation tasks, large amounts of data were distributed to 16 participants worldwide, and dozens of teleconferences were held to help coordinate the entire evaluation effort. An additional important contribution of CLEAR 2006 and the supporting programs is that significant multimedia datasets and evaluation benchmarks have been produced and made available to the research and community. Evaluation packages for the various tasks, including data sets, annotations, scoring tools, evaluation protocols and metrics, are available through the Evaluations and Language Distribution Agency (ELDA)[20] and NIST. While we consider CLEAR 2006 as a remarkable success, we think that the evaluation tasks performed in CLEAR 2006 - mainly tracking, identification, head pose estimation and acoustic scene analysis - only scratch the surface of automatic perception and understanding of humans and their activities. As systems addressing such “lower-level” perceptual tasks are becoming more mature, we expect that more challenging tasks, addressing human activity analysis on higher levels will become part of future CLEAR evaluations. In order to keep CLEAR focused, the coordinators are committed to working together to synergize more aspects of the CLEAR evaluations. This synergy will allow evaluation assets developed to be greater than if they were developed independently by each participating evaluation program. For instance synergy in areas of data annotations and format will positively impact future evaluations by providing a lasting data resource whose development is cost-shared across evaluation programs and projects, while useful for numerous tasks due to the commonalities.
Acknowledgments The authors would like to thank the following people for all their help and support in organizing the CLEAR evaluation and for their help in revising this paper: Matthew Boonstra, Susanne Burger, Josep Casas, Hazim Ekenel, Dmitry Goldgof, Rangachar Kasturi, Valentina Korzhova, Oswald Lanz, Uwe Mayer, Rob Malkin, Vasant Manohar, Ferran Marques, John McDonough, Dennis Moellmann, Ramon Morros, Maurizio Omologo, Aristodemos Pnevmatikakis, Gerasimos Potamianos, Cedrick Rochet, Margit R¨odder, Andrey Temko, Michael Voit, Alex Waibel. The work presented here was partly funded by the European Union (EU) under the integrated project CHIL, Computers in the Human Interaction Loop (Grant number IST-506909) and partial funding was also provided by the US Government VACE program.
44
R. Stiefelhagen et al.
Disclaimer The here presented tests are designed for local implementation by each participant. The reported results are not to be construed, or represented, as endorsements of any participant’s system, or as official findings on the part of NIST or the U. S. Government.
References 1. CHIL - Computers In the Human Interaction Loop, http://chil.server.de. 2. AMI - Augmented Multiparty Interaction, http://www.amiproject.org. 3. VACE - Video Analysis and Content Extraction, https://control.nist.gov/dto/ twiki/bin/view/Main/WebHome. 4. CALO - Cognitive Agent that Learns and Organizes, http://caloproject.sri.com/. 5. NIST Rich Transcription Meeting Recognition Evaluations, http://www.nist.gov/ speech/tests/rt/rt2006/spring/. 6. PETS - Performance Evaluation of Tracking and Surveillance, http:// www.cbsr.ia.ac.cn/conferences/VS-PETS-2005/. 7. TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid/. 8. ETISEO Video Understanding Evaluation, http://www.silogic.fr/etiseo/. 9. Mostefa, D., Garcia, M.N., Bernardin, K., Stiefelhagen, R., McDonough, J., Voit, M., Omologo, M., Marques, F., Ekenel, H.K., Pnevmatikakis, A.: Clear evaluation plan. Technical report, http://www.clear-evaluation.org/downloads/chilclear-v1.1-2006-02-21.pdf (2006) 10. The VACE evaluation plan, http://www.clear-evaluation.org/downloads/ ClearEval-Protocol-v5.pdf. 11. CLEAR evaluation webpage, http://www.clear-evaluation.org. 12. Transcriber Labeling Tool, http://trans.sourceforge.net/. 13. AGTK: Annotation Graph Toolkit, http://agtk.sourceforge.net/. 14. Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI-06), (Springer) 15. The i-LIDS dataset, http://scienceandresearch.homeoffice.gov.uk/hosdb/physicalsecurity/detection-systems/i-lids/ilids-scenario-pricing/?view=Standard. 16. Bernardin, K., Elbs, A., Stiefelhagen, R.: Multiple object tracking performance metrics and evaluation in a smart room environment. In: Sixth IEEE International Workshop on Visual Surveillance, in conjunction with ECCV2006, Graz, Austria (2006) 17. ViPER: The Video Performance Evaluation Resource, http://viper-toolkit.sourceforge.net/. 18. Danninger, M., Flaherty, G., Bernadin, K., Ekenel, H., Kohler, T., Malkin, R., Stiefelhagen, R., Waibel, A.: The Connector — facilitating context-aware communication. In: Proceedings of the International Conference on Multimodal Interfaces. (2005) 19. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR 2001. Volume 1. (2001) 511–518 20. ELRA/ELDA’s Catalogue of Language Resources: http://catalog.elda.org/.
3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory Nikos Katsarakis, George Souretis, Fotios Talantzis, Aristodemos Pnevmatikakis, and Lazaros Polymenakos Athens Information Technology, Autonomic and Grid Computing, P.O. Box 64, Markopoulou Ave., 19002 Peania, Greece {nkat,gsou,fota,apne,lcp}@ait.edu.gr http://www.ait.edu.gr/research/RG1/overview.asp
Abstract. This paper proposes a system for tracking people in three dimensions, utilizing audiovisual information from multiple acoustic and video sensors. The proposed system comprises a video and an audio subsystem combined using a Kalman filter. The video subsystem combines in 3D a number of 2D trackers based on a variation of Stauffer’s adaptive background algorithm with spacio-temporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. The audio subsystem uses an information theoretic metric upon a pair of microphones to estimate the direction from which sound is arriving from. Combining measurements from a series of pairs the actual coordinate of the speaker in space is derived.
1 Introduction Three dimensional person tracking from multiple synchronized audiovisual sensors has many applications, like surveillance, security, smart spaces [1], pervasive computing, and human-machine interfaces [2] to name a few. In such trackers, body motion is the most widely used video cue, while speech is the audio cue. As speech is not always present for all people present in the monitored space, a stand-alone audio tracker cannot provide continuous tracks. A video tracker on the other hand can loose track of the people due to clutter from other people and the background. In this case the audio cue can help resolve the tracks. In this paper an audiovisual approach towards 3D tracking is employed. The standalone video and audio trackers are combined using a Kalman filter [3]. The audio tracker employs an information theoretic approach [4] for direction-of-arrival estimation, as this can be combined using multiple clusters of microphones [5]. The 3D video tracker takes advantage of multiple calibrated cameras [6] to produce 3D tracks from multiple 2D video trackers [7] employs a variation of Stauffer’s adaptive background algorithm [8-10] with spacio-temporal adaptation of the learning parameters and a Kalman tracker [11] in a feedback configuration. In the feedforward path, the adaptive background module provides target evidence to the Kalman tracker. In the feedback path, the Kalman tracker adapts the learning parameters of the adaptive background module. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 45 – 54, 2007. © Springer-Verlag Berlin Heidelberg 2007
46
N. Katsarakis et al.
This paper is organized as follows: In section 2 the audio, video and multimodal combination modules of the tracker are detailed. The results on CLEAR evaluations are presented and discussed in section 3. Finally, in section 4 the conclusions are drawn, followed by some indications for further work.
2 Multimodal 3D Tracker The block diagram of the tracking system is shown in Figure 1. It comprises three modules: adaptive background, measurement and Kalman filtering. The adaptive background module produces the foreground pixels of each video frame and passes this evidence to the measurement module. The measurement module associates the foreground pixels to targets, initializes new ones if necessary and manipulates existing targets by merging or splitting them based on an analysis of the foreground evidence. The existing or new target information is passed to the Kalman filtering module to update the state of the tracker, i.e. the position and size of the targets. The output of the tracker is the state information which is also fed back to the adaptive background module to guide the spacio-temporal adaptation of the algorithm. In the rest of the section, we present the three modules in detail. 2.1 Audio Module In AIT’s ASL system, collection of audio data is performed using a total of 80 microphones located in different locations inside the acoustic enclosure and organized in different topologies. More analytically, there is a 64 channel linear microphone array and four smaller clusters of microphones, each containing four microphones. Each of the microphone clusters has the microphones organized in an inverted T topology as in Fig. 1.
Fig. 1. Relative geometry of the microphone clusters. Distance between microphones 1, 3 and 2, 3 is 20cm. Distance between microphones 3, 4 is 30cm.
Localization of speakers is generally dealt with the estimation of the Direction Of Arrival (DOA) of the acoustic source by means of time delay estimation (TDE) algorithms. Estimation of DOA essentially provides us with the direction from which sound is arriving from. Typically, audio data is collected in frames so that the current TDE estimate can be provided. Combination of several DOAs can then provide us with the actual source position. The practical and, in many ways, severely restricting disadvantage of traditional methods for TDE [12] is that if the system is used in reverberant environments, the
3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory
47
returned estimate could be a spurious delay created by the ensuing reflections. For the purposes of our system, we have proposed [3] a new mathematical framework that resolves to great amount the reverberation issues and generates robust estimations. It is thus of interest to briefly investigate the used model. Consider two of the microphones with a distance d between them. The sound source is assumed to be in the far field of the array. For the case in which the environment is non-reverberant, the assumption of a single source leads to the following discrete-time signal being recorded at the m-th microphone (m = 1, 2):
xm ( k ) = sm ( k − τ m ) + nm ( k )
(1)
where τm denotes the time in samples that it takes for the source signal to reach the mth microphone, and nm is the respective additive noise (assumed to be zero mean and uncorrelated with the source signal). The overall geometry of the corresponding system can be seen in Fig. 2. Without loss of generality, this considers m1 to be the reference microphone, i.e., τ1=0. The delay at m2 is then the relative delay between the two recorded signals, and thus, the relationship is reduced to x1(k)=x2(k-τ2). The DOA is defined with respect to the broadside of the array as a function of any delay τ as: ⎡ τc ⎤ ⎥ ⎣ fsd ⎦
θ = arcsin ⎢
(2)
where fs is the sampling frequency, and c is the speed of sound (typically defined as 343 m/s). Thus, DOA estimation methods rely on successful estimation of τ. However, in a real reverberant environment, each of the microphone recordings is a result of a convolution operator between the speech signal and a reverberant impulse response of significant length (depending on the reverberation level). In order to overcome the problems introduced by reverberation we make use of the concept of mutual information (MI) by tailoring it appropriately to the tracking of an acoustic source. A review of the concept can be found in the work of Bell et al. [13].
Fig. 2. Geometry of the recording system
Most of the DOA estimation techniques are required to operate in real time. We must, therefore, assume that data at each sensor m are collected over t frames xm=[xm(tL), xm(tL+1),…, xm(tL+L-1)] of L samples. Since the analysis will be independent of the data frame, we can drop t to express frames simply as xm for any t. In the context of our model, and for any set of frames, we may then write x1 = x 2 (τ )
(3)
48
N. Katsarakis et al.
where xm(τ) denotes a delayed version of xm by τ samples. Thus, the problem is to estimate the correct value of and the DOA by processing two frames x1 and x2(τ) only. If we were to neglect reverberation, only a single delay is present in the microphone signals. Thus, the measurement of information contained in a sample l of x1 is only dependent on the information contained in sample l-τ of x2(τ). In the case of the reverberant model, though, information contained in a sample l of x1 is also contained in neighboring samples of sample l-τ of x2(τ) due to the fact that the model is now convolutive. The same logical argument applies to the samples of x2(τ). In order to estimate the information between the microphone signals, we use the marginal MI that considers jointly N neighboring samples and can be formulated as follows [14] for the case where the recordings exhibit Gaussian behavior
1 det[C (τ )] I N = − ln 2 det[C11 ]det[C22 ]
(4)
with the joint covariance matrix C(τ) given as Τ
⎡ x1 ⎤ ⎡ x1 ⎤ ⎢ x1 (1) ⎥ ⎢ x1 (1) ⎥ ⎢ ⎥⎢ ⎥ C12 (τ ) ⎤ ⎡ C x (N) x (N ) C (τ ) ≈ ⎢ x1 (τ ) ⎥ ⎢ x1 (τ ) ⎥ = ⎢ 11 ⎥ ⎢ 2 ⎥⎢ 2 ⎥ ⎣C21 (τ ) C22 ⎦ ⎢ x2 (τ +1) ⎥ ⎢ x2 (τ +1) ⎥ ⎢ ⎥⎢ ⎥ ⎣⎢ x 2 (τ + N ) ⎦⎥ ⎣⎢ x2 (τ + N ) ⎦⎥
(5)
If N is chosen to be greater than zero, the elements of C(τ) are themselves matrices. In fact, for any value of τ, the size of C(τ) is always 2(N+1)×2(N+1). For the purposes of the present letter, we call Ν the order of the tracking system. When C(τ) reaches a maximum as a function of at a specific time shift τ, then there is at this point a joint process with a maximum transport of information between x1 and x2(τ). According to the presented information-theoretical criterion, this is the delay that synchronizes the two recordings. In the context of DOA, this delay returns the correct angle θ, at which the signal coincides with the microphone array. The estimation of a DOA from a pair of microphones and the corresponding angle θ cannot by itself determine the speaker location in space. For this we need to combine information from a set of DOAs from different pairs in the enclosure. In the following, we describe the method used for fusing information from a set of microphone pairs. The method breaks the task into two steps. We first estimate the X and Y coordinates of the speaker and then separately estimate a height Z for the derived pair of X and Y. Suppose we have employed m microphones in the acoustic enclosure, each of them placed at a geometric location rm = [Xm, Ym, Zm]. Let’s also assume that we organize the receivers in P pairs. We define the estimated DOA angle of the pair containing the i-th and j-th microphones as θij. Thus, after the DOA estimation is completed we obtain P angle values. First for the 2D coordinates and given a pair of microphones ri and rj (assuming far-field conditions), we can define the line crossing from the midpoint of the microphone pair and the estimated source location as a function of the derived angle θij as:
3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory
49
yij = aij xij + bij
(6)
aij = tan(θij )
(7)
bij = ϒij − aij Χ ij
(8)
where
In the above, Χ ij and ϒ ij are the coordinates of the geometric mid-point of microphones at ri and rj. In real systems the location estimate is most often different than the actual position of the source due to noise, interfering acoustic sources, reverberation and the fact that the sources do not exhibit omni-directional characteristics. Thus, even though in ideal conditions the lines given by (6) would cross at a single point, in real environments we have a set of crossing points defining an area within which there are a series of candidate source locations. Most often the source location is derived by operating on the line equations according to some adaptive [15] or closed-form [16] criterion. An alternative approach is the operation upon the crossing points of these lines. In this case, localizing the acoustic source in 2D first requires the derivation of the line equations for the pairs of microphones for which DOA estimation is performed. In the sequel, the set of all crossing points between all lines is derived. The total number of crossing points is a function of P. The problem is then to choose an appropriate filtering mechanism that accepts the remaining crossing points as an input and returns a source location estimate. For the purposes of the present work we apply a median filter upon the crossing points. Thus, the acoustic source estimate s for each of the L frames is given as:
s = median(u)
(9)
where u is the set of the derived crossing points. To assist the localization process further, all crossing points outside the enclosure dimensions can be neglected prior to filtering. Thus, after median filtering we obtain the X and Y coordinates of the source. The height Z is then estimated by using microphone pairs in different planes. The DOAs for these pairs are also estimated and the median of these is also calculated. The system then derives Z by calculating the height at which this angle crosses the derived source point s.
2.2 Video Module The 3D video tracker employs multiple 2D video trackers [7], each operating on the synchronized video streams of multiple calibrated cameras [6]. The detected peopled are mapped from the camera image plane into 3D world coordinates using epipolar geometry. The block diagram of the 2D video tracker is shown in Figure 3. It comprises three modules: adaptive background, measurement and Kalman filtering. The adaptive background module produces the foreground pixels of each video frame. It employs a variation of Stauffer’s adaptive background algorithm [8-9] in the sense that the learning rate and the threshold for the Pixel Persistence Map [10] are adapted based on the Kalman module. These variations, detailed in [7], allow for the system to segment targets
50
N. Katsarakis et al.
Fig. 3. Block diagram of the 2D video tracker architecture
from the background even if they remain stationary for some time. The thresholded and Pixel Persistence Map is the evidence passed to the measurement module. The measurement module associates the foreground pixels to targets using the Mahalanobis distance of the evidence segments from any of the known targets. Nonassociated evidence segments are used to initializes new targets. Finally, existing targets are manipulated by merging or splitting them based on an analysis of the foreground evidence. The existing or new target information is passed to the Kalman filtering [11] module to update the state of the 2D video tracker, i.e. the position, velocity and size of the targets on the image plane of the particular camera. The output of the tracker is the state information which is also fed back to the adaptive background module to guide the spacio-temporal adaptation of the algorithm. The mapping of the tracked person from two or more camera planes into 3D world coordinates is trivial in the case of single targets (i.e. the CLEAR evaluation seminars [17]). In this case, given K camera views, the system of K equations of the lines from the pinhole of each camera to the normalized camera plane coordinates ⎡⎣ xc( i ) , yc( i ) ,1⎤⎦
T
of the target [6] and the 3D coordinates of the target [ xo , yo , zo ] can be solved using T
least squares:
⎡ xc( i ) ⎤ ⎡ xo ⎤ ⎢ (i ) ⎥ ci R i ⋅ ⎢ yc ⎥ + Ti = ⎢ yo ⎥ , i = 1,… , K ⎢ ⎥ ⎢ 1 ⎥ ⎢⎣ zo ⎦⎥ ⎣ ⎦
(10)
3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory
51
where the unknowns are the K multiplicative constants ci from the normalized coordinates and the three world coordinates
[ xo , yo , zo ]T .
The 2K normalized
T
coordinates ⎡⎣ xc( i ) , yc( i ) ⎤⎦ are obtained from the K 2D trackers, R i are the rotation matrices and Ti are the displacement vectors obtained from camera calibration [6]. In order for the multiple-target case (CLEAR interactive seminars) to reduce to multiple systems like those of equations (10), the correspondence between the 2D targets in the various camera planes has to be established. This is the camera matching problem, in theory solved using the epipolar constraint [18], which relates the position of a point in space seen from two views. The two camera centers, the two projections on the respective camera planes and the 3D world point, all lie on the same plane. Using the essential matrix Ea ,b for cameras a and b, then for a target i on camera plane a matching to a target j on camera plane b, the epipolar constraint is given as: T
⎡⎣ xc( i ) , yc( i ) ,1⎤⎦ ⋅ Ea ,b ⋅ ⎡⎣ xc( j ) , yc( j ) ,1⎤⎦ = 0
(11)
Due to tracking or camera calibration inaccuracies, in practice the matching targets i and j never yield exactly zero in equation (11). Bounding equation (11) by a threshold is error-prone; instead, the trifocal tensor is used [18]. This combines the epipolar constraint for three camera views. Suppose that target i from camera plane a is found to match targets j and k from camera planes b and c respectively. If targets j and k also match, then a point-point-point correspondence of the three targets is established.
2.3 Audiovisual Combination Module The location estimates provided by the audio and video modules can be recursively combined by the use of a decentralized Kalman filter [3]. The overall fusion system can be seen in Figure 4. It comprises of two linear local Kalman Filters (KF) and a Audio-Visual Sensors
Audio 3D Tracking Module
Video 3D Tracking Module
Audio local KF
Video local KF
Two-Input Global KF Fig. 4. Block diagram of the decentralized Kalman filter used for audiovisual fusion
52
N. Katsarakis et al.
two-input global one. The local KFs operate on the outputs of the modules for the two standalone modalities. The estimated audio and video states are then weighted according to the trust level assigned to every modality and fed to the global KF. The weights allow placing different trust on audio or video, according to the examined meeting scenario.
3 CLEAR Evaluation Results The presented audiovisual 3D tracker has been tested in the CLEAR evaluations. The results are shown in Table 1. The ceiling camera has been used, when provided. The video-only results have not been expected to be good, because of the monitored rooms not being empty at the beginning of the evaluation segments and the violation of the far-filed video conditions in some of the camera views. These reasons are explained in detail in [7]. The results are far better for the multi person tracking subtask than the single person one. This is a rather unexpected result; the reasons for it are under investigation. A preliminary explanation is that the results are influenced by the audience, who are not to be tracked. There is nothing in the system to stop it from tracking the audience instead of the presenter, as long as they enter the region of interest in the various camera views. Table 1. Audio recognition performance on the CLEAR evaluation data for seminars Condition Single person, acoustic Single person, visual Single person, A/V (B) Single person, A/V (A) Multiperson, acoustic Multiperson, visual Multiperson, A/V (A)
MOTP (mm) 226 246 377 379 230 233 252
MISS FALSEPOS MISMATCH MOTA A-MOTA (%) (%) (%) (%) (%) 51.16 51.16 -2.32 91.03 88.75 0.00 -79.78 93.92 93.90 0.00 -87.82 94.41 94.41 0.00 -88.83 -88.83 56.19 56.19 -12.38 59.87 31.74 4.06 4.33 59.79 59.79 1.03 -20.62 -19.59
The audio-only results are significantly better than the video results for the single person tracking subtask, and comparable to them for the multi person tracking subtask. Also, the presence of multiple speakers does not degrade audio tracking performance a lot. A second unexpected result is the failure of the audiovisual fusion module; the results are worse than those of any of the two modalities being fused. The reasons for this are again under investigation.
4 Conclusions In this paper we have presented and evaluated a 3D audiovisual tracking system that employs multiple audio and video sensors. Although the audiovisual module has exhibited unexpected performance, the results show that audio is far more robust for
3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory
53
tracking a single presenter that is the only one speaking, and is doing so continuously. Such a system does not suffer from clutter, as its video counterpart does. On more interactive scenarios, with multiple speakers, audio and video tracking performance is comparable. In this case, the effect of video clutter remains the same. Audio tracking, even though does not suffer from clutter to the same extend as video (unless multiple people are speaking simultaneously), seems to be performing a bit worse than video tracking.
Acknowledgements This work is sponsored by the European Union under the integrated project CHIL, contract number 506909. The authors wish to thank the organizers of the CLEAR evaluations.
References [1] Waibe1, H. Steusloff, R. Stiefelhagen, et. al: CHIL: Computers in the Human Interaction Loop, 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal, (Apr. 2004). [2] Pnevmatikakis, F. Talantzis, J. Soldatos and L. Polymenakos: Robust Multimodal AudioVisual Processing for Advanced Context Awareness in Smart Spaces, Artificial Intelligence Applications and Innovations, Peania, Greece, (June 2006). [3] N. Strobel, S. Spors and R. Rabenstein: Joint Audio-Video Signal Processing for Object Localization and Tracking, in M. Brandstein and D. Ward (eds.), Microphone Arrays, Springer. [4] F. Talantzis, A. G. Constantinides, and L. Polymenakos: Estimation of Direction of Arrival Using Information Theory, IEEE Signal Processing, 12, 8 (Aug. 2005), 561-564. [5] F. Talantzis, A. G. Constantinides, and L. Polymenakos: Real-Time Audio Source Localization Using Information Theory, Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI 2006), (May 2006). [6] Z. Zhang: A Flexible New Technique for Camera Calibration, Technical Report MSRTR-98-71, Microsoft Research, (Aug. 2002). [7] Pnevmatikakis and L. Polymenakos: 2D Person Tracking Using Kalman Filtering and Adaptive Background Learning in a Feedback Loop, CLEAR 2006, (Apr. 2006). [8] Stauffer and W. E. L. Grimson: Learning patterns of activity using real-time tracking, IEEE Trans. on Pattern Anal. and Machine Intel., 22, 8 (2000), 747–757. [9] P. KaewTraKulPong and R. Bowden: An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, in Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems (AVBS01), (Sept 2001). [10] J. L. Landabaso and M. Pardas: Foreground regions extraction and characterization towards real-time object tracking, in Proceedings of Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI ’05), (July 2005). [11] R. E. Kalman: A New Approach to Linear Filtering and Prediction Problems, Transactions of the ASME – Journal of Basic Engineering, 82 (Series D), (1960), 35-45. [12] H. Knapp and G. C. Carter: The generalized correlation method for estimation of time delay, IEEE Trans. Acoust., Speech, Signal Process., ASSP-24, 4 (Aug. 1976), 320–327.
54
N. Katsarakis et al.
[13] Bell and T. Sejnowski: An information maximization approach to blind separation and blind deconvolution, Neural Comput., 7 (1995), 1129–1159. [14] T. M. Cover and J. A. Thomas: Elements of Information Theory. New York: Wiley, (1991). [15] J. Benesty: Adaptive eigenvalue decomposition algorithm for passive acoustic source Localization, Journal of the Acoustical Society of America, 107, 1 (2000), 384–391. [16] M. S. Brandstein, J. E. Adcock and H. Silverman: A Closed-Form Location Estimator for Use with Room Environment Microphone Arrays, IEEE Trans. on Acoust. Speech and Sig. Proc., 5 (1997), 45-50. [17] Mostefa et. al: CLEAR Evaluation Plan, document CHIL-CLEAR-V1.1-2006-02-21, (Feb 2006). [18] R. Hartley and A. Zisserman: Multiple View Geometry in Computer Vision, 2nd Edition, Cambridge University Press, (March 2004).
A Generative Approach to Audio-Visual Person Tracking Roberto Brunelli, Alessio Brutti, Paul Chippendale, Oswald Lanz , Maurizio Omologo, Piergiorgio Svaizer, and Francesco Tobia ITC-irst, Via Sommarive 18, 38050 Povo di Trento, Italy
[email protected] Abstract. This paper focuses on the integration of acoustic and visual information for people tracking. The system presented relies on a probabilistic framework within which information from multiple sources is integrated at an intermediate stage. An advantage of the method proposed is that of using a generative approach which supports easy and robust integration of multi source information by means of sampled projection instead of triangulation. The system described has been developed in the EU funded CHIL Project research activities. Experimental results from the CLEAR evaluation workshop are reported.
1
Introduction
An essential component for a system monitoring people behaviour is given by the reliable and accurate detection of people position. The task is particularly difficult when people are allowed to move naturally in scenarios with few or no constraints. Audio (speaker) and visual localization experience specific shortcomings whose impact may be reduced, if not eliminated altogether, by a synergic use of cross modal information. The integration of multisensorial information is expected to be of significant relevance when sensor number and/or effectiveness is limited. This paper presents a principled framework for the integration of multimodal information in a Bayesian setting supported by an efficient implementation based on a particle filter. An advantage of the method proposed is that of using a generative approach which supports easy and robust integration of multi source information by means of sampled projection instead of triangulation. The system described has been developed in the EU funded CHIL Project research activities. Experimental results from the CLEAR evaluation workshop are reported.
2
Audio Tracking
The audio based speaker localization and tracking task addressed in CHIL is rather challenging. Since the evaluation data have been collected during real
Corresponding author.
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 55–68, 2007. c Springer-Verlag Berlin Heidelberg 2007
56
R. Brunelli et al.
seminars and meetings they present some critical aspects to the localization process. First of all, seminar and meeting rooms are typically characterized by a high reverberation time (for example in the ITC-irst CHIL room the reverberation time is about 700ms): in these conditions strong delayed replica of the signal are allowed to reach the microphones, generating virtual competitive sound sources. Then, in a real environment, the localization system has to deal with coherent noise sources like fan, door slamming, printers, chairs moving etc. Finally, the degree of freedom left to the speaker and the audience is very high since they can behave as they were unaware of the presence of the audio and video sensors. As a consequence it is not possible to rely always on the presence of a direct path from the talker to the microphones or to formulate assumptions on the position and orientation of the speaker. In an effort to tackle these problems, we devised a localization system based on the Global Coherence Field (GCF), introduced in [5], in order to merge all the information gathered by the distributed microphone network available in the CHIL smart rooms. 2.1
Global Coherence Field
According to [3], localization systems can be roughly divided in three classes: steered beamformed based locators, high-resolution spectral estimation locators and localization systems based on Time Delay of Arrival (TDOA). GCF can be classified in the first class and aims at building a function representing the plausibility that a sound source is active at a given point in space. Given a grid Σ of potential source locations and the corresponding sets of steering delays, the GCF is defined by considering the average coherence between signals realigned by the beamformer. Any coherence metric can be used to compute a GCF, nevertheless in this work we chose to adopt a CrossPowerSpectrum (CSP) based Coherence Measure (CM). CSP was well exploited in the literature and it has been proved to be reliable and robust even in reverberant and noisy environment [10]. Let us consider a set Ω of Q microphone pairs and denote with δik (S) the theoretical delay for the microphone pair (i, k) if the source is at position S = (x, y, z) ∈ Σ. Once the CM Cik (δik (S)) has been computed for each microphone pair (i, k) belonging to Ω, GCF is expressed as: GCFΩ (S) =
1 Cik (δik (S)). Q (i,k)∈Ω
Fig. 1 shows an example of GCF restricted to a plane (x, y). Notice the brightest spot in correspondence with the speaker position. The underlying idea of GCF is that when summing the information provided by all the sensors it is possible to reinforce the true sound source with respect to the virtual sound sources introduced by reflections. Another worth mentioning point is that there is no need of knowledge about the position and orientation of the speaker. Sensors which do no receive a direct path will in fact deliver a CM inconsistent with the others.
A Generative Approach to Audio-Visual Person Tracking
57
Fig. 1. CSP-based 2-dimensional GCF computed in the CHIL room available at ITCirst. GCF magnitude is represented by the brightness of the plotted points. The brightest spot in the center of the room corresponds to the active speaker.
2.2
The Tracking System
As described in Sec. 2.1 the main peak of the GCF represents the point in space with the highest plausibility that there is an active sound source. All the aspects described in the previous section, together with a high level of portability to different room setups and speaker behaviours, induced us to adopt a GCF approach rather than a more classical TDOA based localization system. Because of the non-stationarity of speech, some frames are more informative than others. For this reason we introduced a threshold on the GCF peak in order to establish whether a frame can be classified as informative or not. When dealing with real scenarios, disturbances due to coherent sound sources must be taken into account in order to design a robust speaker localization system. As a matter of fact, the typical assumption of the whiteness of noise does not hold in a real scenario. Our algorithm handles such noises with a consistency check based on the distance between successive localizations. If we assume that coherent noises are brief and located far from the speaker, this kind of check allows to skip isolated localization output and reduces the impact of outliers. Notice that this post processing, if tuned correctly, guaranties the tracking of multiple speaker in a question-answering context. The algorithm works frame by frame independently as follows: 1. a 2D GCF is computed using all the available horizontal microphone pairs; 2. the source position is estimated by maximizing the GCF; 3. given the 2D localization, the height of the speaker is estimated by maximizing a mono-dimensional GCF computed using vertical microphone pairs. 4. the GFC peak is compared with the threshold; 5. a consistency check is performed to validate the output.
58
R. Brunelli et al.
The 3D localization was performed in 2 steps in an effort to reduce the computational complexity of the algorithm. As for microphones involved in the localization process, the system exploited the horizontal and vertical microphone pairs of all the T-shaped arrays. The frame analysis length was 214 samples with an overlap of 75%. The room was sampled in space with a 5 cm resolution in all directions. The post processing parameters, including the thresholding and the consistency check, were tuned in a way that the overall output rate was between 2 and 3 localizations per second during speech intervals. The tuning was performed empirically running experiments on the development data set. The processing time depends on the number of microphones available and on the density of the grid investigated. As a consequence an average of the different processing times is not representative. Anyway the system can work faster than real time for every sensor set up available in the CHIL consortium. A real time demonstration implementing the described algorithm is operative at ITC-irst labs.
3
Visual Tracking
The evaluation scenario addressed in CLEAR single person tracking task exhibits some challenging peculiarities: – highly dynamic background scene due to non modeled audience motion and presentation slide changes; – changing illumination conditions, e.g. due to slide projection; – unreliable a priori characterization of skin color due to varying illumination and jpeg noise; – sometimes difficult detection of the true target. All these issues are of major concern to classical approaches based on the typical processing chain: background suppression - morphological noise filtering - blob classification (some examples are shown in Fig. 2). Robust solutions usually require the engineering of complex cascades of low-level filters whose behaviour is difficult to understand and which require the control of many parameters. We use instead a principled Bayesian approach which remains simple and whose performance seems to be largely unaffected by the issues mentioned above. In this section we first present the model we adopt to describe the target within the scene and then we show how this model is tracked throughout the video sequence, possibly captured non-synchronously from different viewpoints. 3.1
Visual Likelihood Model
Following Bayesian approaches to object tracking, we describe the target in terms of a generative model of its appearance. In the specific case of tracking the presenter in a seminar video we can characterize the target as: – having human-like shape; – standing upright most of the time;
A Generative Approach to Audio-Visual Person Tracking
59
Fig. 2. On the left two examples where a background model has not been built reliably. The first image shows the background model acquired on a sequence where the presenter did not move much during the whole sequence. The second image shows a background that does not contain much information about the slide projection area and on some people of the audience. The fourth image shows the output of a simple skin detector in a typical sequence shot shown in the third image.
– having consistent color throughout the sequence (or at least consistently different from the background scene); – being the only object satisfying all of the requisites above. According to these features we define an explicit, low-dimensional model of the presenter which has two components: shape and color. Shape. A coarse 3D model identifying the scene volume covered by a person standing upright is adopted for shape, similar to the generalized-cylinder approach proposed in [8]. This model is shown in Fig. 3. In our implementation it has only 1 degree of freedom: target height. To obtain the image projection of this 3D model when placed in a specific position x of the scene, we proceed as follow. Firstly, we compute a pair of 3D points which represent the center of feet and top of head of the model. In case of x describing the 2D position w.r.t. the floor plane and h being the height of the target, these two points are simply given by x enhanced by a third coordinate which has value 0 and h, respectively. These two points are then projected onto the camera reference frame by means of a calibrated camera model. The segment joining these two image points defines the axis around which the contour is drawn with piece-wise linear offset from this axis. This rendering procedure is fast and sufficiently accurate on horizontal views such as the ones captured by cameras placed at the corners of a room (we do not use images from ceiling camera for tracking). Color. The projected silhouette is decomposed into three body parts: head, torso and legs. The appearance of the target within these parts is described by one color histogram per part. In our implementation we quantize the RGB color space uniformly in 8 × 8 × 8 bins (thus we have histograms of size 512). Likelihood. We now describe how the likelihood of a given hypothesis x is computed on a calibrated image z. This involves two steps: candidate histogram extraction and hypothesis scoring. The first step makes use of the shape model introduced above, and is depicted in Fig. 4: hypothetic body parts are identified within the image by means of the shape model rendered at x, and candidate
60
R. Brunelli et al.
Fig. 3. 3D shape model of the presenter and an approximate, but efficient, rendering implementation which still conveys imaging artefacts such as perspective distortion
RGB histograms are extracted from these areas. To assign the final score for x, these histograms are compared with the histograms of the model using a similarity measure derived from Bhattacharyya-coefficient based distance [4]. If ah , at , al is the area of body part projections and hhz , htz , hlz and hhm , htm , hlm denote normalized extracted and modeled histograms respectively, the likelihood assigned is h 2 h h a d (hz , hm ) + at d2 (htz , htm ) + al d2 (hlz , hlm ) exp − 2σ 2 (ah + at + al ) with normalized histogram distance given by d2 (h, k) = 1 −
512 hi ki . i=1
Parameter σ can be used to control the selectivity of this function and is set empirically to 0.12. 3.2
Appearance Model Acquisition
The performance of a tracker based on the likelihood model just described depends strongly on the quality of the acquired appearance model. In constraint scenarios such as the ones envisaged within the CHIL project it may be reasonable to assume that such a model can be acquired a priori, e.g. while each participant of a monitored meeting presents himself in front of a dedicated camera before the meeting starts. Since this assumption was not met for CLEAR evaluation data, an automatic acquisition procedure was developed which runs as a separate task prior to tracking. The presenter is detected in each sequence as follows.
A Generative Approach to Audio-Visual Person Tracking
61
Fig. 4. Candidate histogram extraction procedure
Motion Edges. For a given sequence timestamp, forward and backward temporal difference images are computed for each view (except for ceiling camera) and analyzed for motion edges. Precisely, a pixel is labeled as potentially belonging to a motion edge if its gray value differs from both its temporal neighbors by a value that exceeds 15. These candidates are enhanced by adding edges detected by a Canny filter on the reference image. This operation is followed by morphological noise cleaning to suppress contributions due to spurious noise (i.e. which have area less than 4 pixels). Contour Likelihood. The motion boundaries are then analyzed for compatible silhouette shapes as follows. A regular 2D grid is defined on the floor plane 1 . Each such grid point represents a possible target position. To compute the corresponding likelihood, the generalized-cylinder model is first rendered for each view as by Fig. 3. Each hypothetic contour so obtained is then assigned a score for each view according to how well the hypothesis is supported by the current edge map. This value is computed as the sum over contour pixels of their Euclidean distance to the nearest motion edge, normalized w.r.t. the length of the contour under consideration. To limit the influence of missing edges, the contribution of contour pixels that exceed 30% of body width are set to a constant value. Fast scoring can be achieved by precompiling Euclidean distances into lookup images obtained as the Distance Transform (DT) of edge images (see Fig. 5). If α(x) denotes projected body width, C(x) describes its contour and D(u) is the DT of an edge image, the likelihood assigned for the corresponding view is 1 min{1, D(u)/(0.3α(x))} du . exp − length{C(x)} C(x) Similar shape likelihoods have been proposed for tracking in [7], where edges are searched for only along a small predefined set of contour normals. While this formulation is computationally cheaper when the number of hypotheses to be tested is limited, the overhead introduced through the computation of DT in our approach becomes negligible for a large search task such as target detection. In addition, our measure is more accurate as it considers continuous contours. 1
If metric information about the room is not available, domain bounds may be found by considering camera calibration and looking for overlapping fields of view.
62
R. Brunelli et al.
Model Acquisition. A hypothesis is accepted if at least 2 views have contour likelihood that lies below a predefined threshold (e−0.15 in our implementation). The search is then refined at this position by testing different target heights and orientations 2 . The best scoring configuration among the different views is kept and used to define target height. Body part histograms are then extracted from the best view, and stored as model. If no acceptable hypothesis is found at time t, the same analysis is made on images at t + 15.
Fig. 5. Motion edges, their distance transform, and the acquired model
It is worth to point out that additional research is needed to accomplish online target detection and acquisition, while tracking. This is of particular concern in the context of multiple target tracking where occlusions need to be taken into account. Although a robust and efficient tracker based on the same 1-body appearance likelihood has already been realized [9], we could not participate to the multi-person evaluation task because of the inability of automatically acquiring target models. 3.3
Tracking
A particle filter [6] was implemented to track the pre-acquired appearance model of the presenter. Target state is defined in terms of position and velocity on the floor plane, resulting in a 4-dimensional state space. Target orientation, which is used during contour rendering, is approximated as the direction of the state’s velocity component. The particle set representing the probabilistic estimate at time t is projected to time t + 1 by a first order autoregressive process: each particle is linearly propagated along its velocity component according to the time elapsed, and zero-mean Gaussian noise is added to both position and velocity component (variance is 0.6m/s and 0.3m/s2 , respectively). After prediction, likelihoods are computed for the different views and particle weights are assigned as the product of their likelihoods over the different views. The assumption underlying this simple fusion rule is that observations are conditionally independent once the state of the target is fixed. This is an approximation which seems acceptable in the light of the evaluation results reported in Sec. 5. If a particle 2
3D shape model has elliptical profiles and thus projection width changes with orientation.
A Generative Approach to Audio-Visual Person Tracking
63
renders outside the field of view of a camera, it is given a constant likelihood value (10−4 in our implementation). If the candidate silhouette is only partially visible, the Bhattacharrya score is linearly interpolated with this value according to the amount of hidden area. This allows to assign a likelihood for each available view so that all particles have comparable weights. Weighted resampling can then be applied in a straightforward manner. The initial particle set is sampled uniformly from the domain. The output of the probabilistic tracker is computed as the expectation over the current, weighted particle set.
4
Audio-Visual Tracking
The multi-modal tracker is an extension of the video tracker just described, where the audio signal is interpreted as an additional source of likelihood to be integrated when computing particle weights. The theory underlying GCF-based localization is in line with the generative approach taken to interpret the video signal. In fact, for a given target state x, an audio source can be hypothesized at (x, y, z) where x, y are the particle coordinates shifted by an offset of 15 cm along the direction of the particle velocity, and z is fixed to 90% of target height (which is known from the appearance model). This source would then render to a known TDOA at each microphone pair, and the conditional likelihood of this TDOA is verified within the real signal through a correlation measure. 4.1
Regularizing the GCF
Fig. 6 highlights a potential problem when using GCF in combination with particle filtering. The GCF can provide highly irregular optimization landscapes which do not marry well with discrete, local search methods such as particle filtering. It is a well known fact that particle filters do not behave well with sharply peaked and irregular likelihoods [11]. We solve this problem through smoothing the GCF which we perform in the following way. Given the 3D coordinate of a hypothetic audio source (carried by a particle) we compute the interval of TDOA which map inside a sphere of radius 50 cm centered at this point. The highest GCF response in this interval is found and weighted with the relative distance to the source location. With this choice the GCF becomes spatially averaged and continuous, thus making the likelihood response a smooth function of x. The acoustic likelihood of a particle is then computed by summing up individual microphone pair contributions (negative values are set to 0) and taking its exponential. 4.2
Integration
If the highest response over all particles for a given time frame is significant (i.e. if the cumulated response is above a given threshold), the frame is labeled
64
R. Brunelli et al. CSP array-A pair 1-2 CSP array-B pair 1-2
-40
-20
0
20
40
Fig. 6. The need for regularization of the GCF becomes evident by analyzing the left plot. It reports single GCF responses on two microphone pairs which capture speech from orthogonal directions w.r.t. the speaker (frame 335 of sequence UKA 20040420 A Segment1). Note that the peak in the first response is strong but very narrow. The response for the second microphone pair instead is very noisy. Exhaustive search, as performed by the audio tracker, is needed to pick the global optimum as becomes evident from the accumulated GCF (top right image). Since the width of the peak is about 5 cm, it is unlikely that a particle hits this peak. The regularized GCF (bottom right image) is more suitable for integration with a particle filter.
as speech frame and a likelihood is assigned to each particle according to the computed score. The joint particle weight is then taken as the product of the likelihoods computed from the different signal sources (cam1, cam2, ..., microphone arrays). If the speech activity threshold has not been reached on the support of the particle set, no GCF likelihood is assigned and tracking for this frame is supported by the video signal only.
5
Results and Discussion
Evaluation of the presented technologies have been carried within CLEAR 2006 evaluation campaign on the database described in [1]. The tasks in which we have participated are: 3D single person tracking - audio only (3DSPT A), video only (3DSPT V), audio-video (3DSPT AV), and 3D multiple person tracking audio only (3DMPT A). 5.1
Setup and Processing Times
The sequences have been processed on a 3GHz Intel Xeon computer. Appearance model acquisition is yet an off-line process which takes 10 - 50 sec per multiview frame (4 images). Grid resolution is 5 cm, target height resolution is 2 cm,
A Generative Approach to Audio-Visual Person Tracking
65
target orientation resolution is 30 deg. Acquisition runtime depends mainly on quality of motion edges. For evaluation data it ranges from less than 1 min to several minutes per sequence. Tracking is done in real-time at full-frame rate. Image resolution is down-sampled by a factor of 3. Images from ceiling camera are not used for tracking. GFC coefficients have been computed off-line, prior to tracking. Speech activity threshold is set to 1.0 for ITC seminars (7 microphone arrays) and 0.5 for UKA seminars (4 microphone arrays). Number of particles is fixed to 200. CPU load is approximately 60% for video-only tracking and 70% for audio-visual tracking. 5.2
Audio-Only 3DSPT Results
Tab. 1 shows the evaluation results obtained running both the MOT scoring tool [1] and the SLOC scoring tool [2]. According to the so called “MOT scores”, the audio based tracking system tracks the speaker with a precision of 14 cm which is very close to the reference precision (that can be assumed to be about 10 cm). The Audio accuracy (A-MOTA) was 48%. This score is mostly degraded by a quite high “miss rate” due to an excessively strict post processing. As a confirmation of the good performance in term of precision, the “SLOC scores” show that the overall error (fine+gross) is about 40 cm and the localization rate is 86%. It means that 86% of the localization outputs are very close (less than 50 cm) to the actual talker position. Table 1. Audio based tracking performance for the 3DSPT task MOT scores MOTP Miss False P. A-MOTA 14.4cm 46.6% 5.2% 48.2% SLOC scores Pcor AEE gross Deletion False Alarm 86% 38.8cm 40% 36%
5.3
Audio-Only 3DMPT Results
The multi person tracking task has been tackled adopting exactly the same approach. Tab. 2 shows the results provided by the scoring tools. Results in term of “MOT scores” show that the performance does not degrade too much when facing a more complex and challenging task with respect to the 3DSPT. Moreover, it is worth noting that the task was even more difficult because of some technical issues. In fact the number of available microphones was reduced with respect to the 3DSPT task due to different smart room setups and sensor failures. For instance, in 3 seminars there were only 2 arrays available and one of them was behind the speaker. As a consequence the localization precision was degraded leading to a higher fine+gross average error and a reduced localization rate (SLOC scores).
66
R. Brunelli et al. Table 2. Audio based tracking performance for the 3DMPT task MOT scores MOTP Miss False P. A-MOTA 21.8cm 65% 19% 15.6% SLOC scores Pcor AEE gross deletion False Alarm 63% 61.7cm 47% 39%
5.4
Video-Only 3DSPT Results
CLEAR evaluation scores for the video-based 3DSPT task are shown in Tab. 3. The generative approach taken shows all its potential. The way likelihoods are calculated is intrinsically 3D and takes into account image formation principles. In particular, we do not suffer from under or over-segmentation typically afflicting traditional approaches and deriving from the inability of setting up consistent system parameters (background suppression thresholds, etc.). The drawback of using low-dimensional shape models is shown in Fig. 7, where we report a frame of the sequence where we performed worse: this posture cannot be explained by the shape model. However, even if the presenter stays in this posture for more than 1/4th of the time, the tracker never looses the target over the whole sequence. This highlights another feature of the system: as particle filtering performs local search, it does not necessarily get distracted by significant responses on background clutter that may occur when the model does not match body posture well. Table 3. Video based tracking performance for the 3DSPT task. Best case is UKA 20050525 B Segment1, worse case is UKA 20050504 A Segment2. MOT scores average best case worse case
5.5
MOTP 132mm 86mm 151mm
Miss 4.43% 0.0% 33.7%
False P. A-MOTA 4.34% 91.23% 0.0% 100.0% 33.3% 33.0%
Audio-Video 3DSPT Results
CLEAR evaluation scores for the audio/video based 3DSPT task are shown in Tab. 4. In this case the evaluation results are presented for two different situations: Condition A: track the speaker on segments where he is speaking Condition B: track the speaker for every time point in the sequence Inspection of the results show that, in this tracking task, the integration of multimodal information did not improve, on average, on the results obtained from the best of the tracker (the visual one). However, on some single sequences fusion did increase performance noticeably. Tab. 5 reports the best improvement.
A Generative Approach to Audio-Visual Person Tracking
67
Fig. 7. A multiview frame from sequence UKA 20050504 A Segment2, where we performed worse (see Tab. 3). Even though the filter is locked on the target, this estimate is classified as a false positive for evaluation purpose because of the significant head offset (which is taken as the reference target position). Table 4. Audio-video based tracking performance for the 3DSPT task MOT scores MOTP Miss False P. A-MOTA condition A 132mm 9.78% 3.43% 86.80% condition B 134mm 4.59% 4.27% 91.13%
Table 5. Multimodal vs. unimodal: best case sequence UKA 20050420 A Segment2 MOT scores audio only video only multimodal A multimodal B
MOTP 252mm 100mm 101mm 118mm
Miss 33.80% 11.63% 3.03% 5.31%
False P. A-MOTA 14.94% 52.0% 11.60% 67.7% 0.50% 96.5% 5.00% 89.7%
The reason is due to the fact that visual tracking, with four cameras providing complete overlapping coverage of the room and the motion patterns typical of a lecture, can actually always rely on enough information to obtain a robust, precise position estimate. Situations where dynamic occlusions are more frequent and where only a limited number of sensors is available for tracking (e.g. one or two active cameras switching from target to target) are expected to show an advantage of cross modal localization.
6
Conclusion
This paper has presented a system based on the integration of acoustic and visual information for people tracking. The system presented relies on a probabilistic framework within which information from multiple sources is integrated at an intermediate stage. An advantage of the method proposed is that of using a generative approach which supports easy and robust integration of multi source information by means of sampled projection instead of triangulation. The system described has been developed in the EU funded CHIL Project research activities. Experimental results from the CLEAR evaluation workshop are reported.
68
R. Brunelli et al.
References 1. CLEAR 2006 evaluation campaign. [Online]: http://www.clear-evaluation.org/. 2. Rich transcription 2005 spring meeting recognition evaluation. [Online]: http:// www.nist.gov/speech/tests/rt/rt2005/spring/. 3. M. Brandstein and D. Ward. Microphone Arrays. Springer Verlag, 2001. 4. D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean-shift. In Int. Conf. on Computer Vision and Pattern Recognition, volume 2, pages 142–149, 2000. 5. R. DeMori. Spoken Dialogue with Computers. Accademic Press, 1998. 6. A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. 7. M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. Int. Journal of Computer Vision, 1(29):5–28, 1998. 8. M. Isard and J. MacCormick. BraMBLe: A Bayesian multiple-blob tracker. In Int. Conf. of Computer Vision, volume 2, pages 34–41, 2003. 9. O. Lanz. Approximate bayesian multibody tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 2006. (to appear). 10. M. Omologo and P. Svaizer. Acoustic event localization using a crosspowerspectrum phase based technique. In Int. Conf. on Acoustics, Speech, and Signal Processing, volume 2, pages 273–276, 1994. 11. J. Sullivan and J. Rittscher. Guiding random particles by deterministic search. In Int. Conf. of Computer Vision, volume 1, pages 323–330, 2001.
An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR’06 Evaluation Dataset Kai Nickel, Tobias Gehrig, Hazim K. Ekenel, John McDonough, and Rainer Stiefelhagen Interactive Systems Labs - University of Karlsruhe Am Fasanengarten 5, 76131 Karlsruhe, Germany
[email protected]
Abstract. We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. In the CLEAR’06 evaluation, the system yielded a tracking accuracy (MOTA) of 71% for video-only, 55% for audio-only and 90% for combined audio-visual tracking.
1
Introduction
Person tracking is a basic technology for realizing context-aware human-computer interaction applications. The scenario addressed in this work is a smart lecture room, where information about the lecturer’s location helps to automatically create an audio-visual log of the presentation. As we have shown in [18], tracking accuracy has an direct impact on the recognition rate of beamformed speech. Other applications include active camera control in order to supply high-resolution images of the speaker, thus facilitating person identification and audio-visual speech recognition. The task of lecturer tracking poses two basic problems: localizing the lecturer (in terms of 3D head coordinates) and disambiguating the lecturer from other people in the room. In the proposed approach, we jointly process images from multiple cameras and the signal from multiple microphones in order to track the lecturer both visually and acoustically. The algorithm is based on the assumption, that the lecturer - among all other people in the room - is the one that is speaking and moving most of the time, i.e. exhibiting the highest visual and acoustical activity. The central issue in audio-visual tracking is the question of how to combine different sensor streams in a beneficial way. In our approach, we integrate audio and video features such that the system does not rely on a single sensor or R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 69–80, 2007. c Springer-Verlag Berlin Heidelberg 2007
70
K. Nickel et al.
a certain combination of sensors to work properly. In fact, each single camera and each microphone pair alone can contribute to the track. The core of the proposed algorithm is a particle filter for the computationally efficient integration of acoustic source localization, person detection (frontal face, profile face, upper body) and foreground segmentation. The 3D position of the lecturer is robustly being determined by means of sampled projection instead of triangulation. 1.1
Related Work
For acoustic source localization, several authors have proposed solving this optimization problem with standard gradient based iterative techniques. While such techniques typically yield accurate location estimates, they are typically computationally intensive and thus ill-suited for real-time implementation [2,3]. Other recent work on acoustic source localization includes that by Huang et al [7], who developed an iterative technique based on a spherical least square error criterion, that is nonetheless suitable for real-time implementation, as well as the work by Ward et al [17], who proposed using particle filter together with both time delay of arrival estimation and steered beamformers. In other work by the same authors [10], a variant of the extended Kalman filter was used for acoustic speaker tracking. This approach was extended in [9] to add video features. Particle filters [8] have previously been used for audio-visual tracking for example by [15] for a video telephony application, by [4] for multi-person tracking or by [6] for multi-party conversation in a meeting situation. The particle filter’s capability of representing arbitrary distributions is of central importance for the proposed feature fusion scheme. Concerning video features, it has often be proposed to use color models for the task of tracking articulated objects like the human body. Unfortunately, the appearance of color in real-world scenarios is fragile because of different light sources, shadowing, and – specific for our lecture scenario – by the bright and colorful beam of the video projector that often overlays the lecturer. Color-invariant approaches that rely on background subtraction, e.g. Mikic et al. [13], often suffer from over- or under-segmentation as an effect of noisy foreground classification. In the proposed method, we avoid this problem: instead of triangulating connected foreground segments, our algorithm performs sampled projections of 3D hypotheses, as proposed by Zotkin et al. [19], and gathers support for the respective sample in the resulting image region in each view. It is thus less dependent on the quality of the segmentation. Face-detection cascades as proposed by Viola and Jones [16] are known to be both robust and fast, which makes them a good feature to support a person tracker. However, searching high-resolution camera images exhaustively for faces in multiple scales goes beyond the current possibilities of real-time operation. The particles, however, cluster around likely target positions and are thus a good approximation of the search space.
An Audio-Visual Particle Filter for Speaker Tracking
2
71
An Audio-Visual Particle Filter
Particle filters [8] represent a generally unknown probability density function by a set of m random samples s1..m . Each of these particles is a vector in state space and is associated with an individual weight πi . The evolution of the particle set is a two-stage process which is guided by the observation and the motion model: 1. The prediction step: From the set of particles from the previous time instance, an equal number of new particles is generated. In order to generate a new particle, a particle of the old set is selected randomly in consideration of its weight, and then propagated by applying the motion model. 2. The measurement step: In this step, the weights of the new particles are adjusted with respect to the current observation zt : πi = p(zt |si ). This means computing the probability of the observation given that the state of particle si is the true state of the system. Each particle si = (x, y, z) hypothesizes the location of the lectures head centroid in 3D space. The particles are propagated by Gaussian diffusion, i.e. a 0-th order motion model. If a particle leaves the boundaries of lecture room, it gets re-initialized to a random position within the room. A certain percentage of particles (in our case 5%) are not drawn from the previous particle set, but are also initialized randomly. This way it is guaranteed, that the entire space is roughly searched - and the tracker does not stick to a local maximum. Using the features described in Sections 3 and 4, we calculate the weight πi for each particle si by combining the normalized probabilities of the visual observation Vt and the acoustical observation At . πi = cA · p(At |si ) + cV · p(Vt |si )
(1)
The dynamic mixture weights cA and cV can be interpreted as confidence measures for the audio and video channel respectively. In order to determine the values of cA,V , we consider the spread of the audio and video scores on the ground plane (x and y components of the state vector). Let for example σxA denote the standard deviation of the particle set’s x-components weighted with the audio scores, then the audio channel confidence is given by: (−1) (2) cA = (σxA )2 + (σyA )2 The video confidence cV is calculated in the same way. In order to generate the final tracker output, we shift a 1m2 sized search window over the ground plane and look for the region with the highest accumulated particle scores. The weighted mean of all particles within that region is then the final hypothesis.
3
Video Features
As lecturer and audience cannot be separated reliably by means of fixed spatial constraints as, e.g., a dedicated speaker area, we have to look for features that are
72
K. Nickel et al.
more specific for the lecturer than for the audience. Intuitively, the lecturer is the person that is standing and moving (walking, gesticulating) most, while people from the audience are generally sitting and moving less. In order to exploit this specific behavior, we use foreground segmentation based on adaptive background modeling as primary feature, as described in Section 3.1. In order to support the track indicated by foreground segments, we use detectors for face and upper body (see Section 3.2). Both features – foreground F and detectors D – are linearly combined1 using a mixing weight β. So the probability of the visual information Vtj in view j, given that the true state of the system is characterized by si , is set to be p(Vtj |si ) = β · p(Dtj |si ) + (1 − β) · p(Ftj |si ) (3) By means of the sum rule, we integrate the weights from the v different views in order to obtain the total probability of the visual observation: 1 p(Vt |si ) = p(Vtj |si ) (4) v j=1..v To obtain the desired (pseudo) probability value which tells us how likely this particle corresponds to the visual observation we have to normalize over all particles: p(Vt |si ) p(Vt |si ) = (5) i p(Vt |si ) 3.1
Foreground Segmentation
In order to segment the lecturer from the background, we use a simple background model b(x, y) that is updated with every new frame z(x, y) using a constant update factor α: b(x, y) = (1 − α) · b(x, y) + α · z(x, y)
(6)
The foreground map m(x, y) is made up of pixel-wise differences between the current image z(x, y) and the background model b(x, y). It is scaled using minimum/maximum thresholds τ0 and τ1 : m(x, y) =
|z(x, y) − b(x, y)| − τ0 · 255 τ1 − τ0
(7)
The values of m(x, y) are clipped to a range from 0 to 255. However, as Fig. 1 shows, the resulting segmentation of a crowded lecture room is far from perfect. Morphological filtering of the foreground map is generally not sufficient to remove the noise and to create a single connected component for the lecturer’s silhouette. Nonetheless, the combination of the foreground maps from different views contains enough information to locate the speaker. Thus, our approach gathers support from all the views’ maps without making any ”hard” decisions like a connected component analysis. 1
Note that p(Dtj |si ) and p(Ftj |si ) respectively have to be normalized before combination so that they sum up to 1.
An Audio-Visual Particle Filter for Speaker Tracking
73
Fig. 1. Foreground segmentation is performed by means of an adaptive background model. A ”3-boxes model” approximates the speaker’s appearance.
As described in Section 2, the particle filter framework merely requires us to assign scores to a number of hypothesized head positions. In order to evaluate a hypothesis si = (x, y, z), we project a ”3-boxes person model” (see Fig. 1) centered around the head position to the image plane of each camera view, and sum up the weighted foreground pixels m(x, y) inside the projected polygons: The topmost box, representing the head, has a height of 28cm and a width/depth of 18cm. The torso box has a width and depth of 60cm, whereas the box for the legs spans 40cm. The accumulated weights of the foreground pixels within the projected polygons are then used as the particle’s score. As this calculation has to be done for each of the particles in all views, we use the following simplification in order to speed up the procedure: we assume that all cameras are set upright with respect to the ground plane, so the projection of a cuboid can be approximated by a rectangle orthogonal to the image plane, i.e. the bounding box of the projected polygon (see Fig. 2). The sum of pixels inside a bounding box can be computed efficiently using the integral image introduced by [16]. Given the foreground map m(x, y), the integral image ii(x, y) contains the sum of the pixels above and to the left of (x, y): y x m(x , y ) (8) ii(x, y) = y =0 x =0
Thus, the sum of the rectangle (x1 , y1 , x2 , y2 ) can be determined by four lookups in the integral image. So the particle score for the foreground feature is defined by the sum of pixels inside the bounding boxes normalized by the size of the bounding boxes: p(Ftj |si ) =
ii(xb2 , y2b ) − ii(xb1 , y2b ) − ii(xb2 , y1b ) + ii(xb1 , y1b ) (xb2 − xb1 + 1)(y2b − y1b + 1) b=H,T,L
(9)
The index b specifies the head box (H), torso box (T), and legs box (L). Using the recurrent formulation from [16], the generation of the integral image only takes one pass over the foreground map, so the complexity of the foreground
74
K. Nickel et al.
(x, y, z)
(x1,y1)A
(x1,y1)B
(x2,y2)B (x2,y2)A
Fig. 2. For each particle and each body segment (head, torso, legs), a cuboid centered around the hypothesized head position (x, y, z) is projected into the views A and B. The resulting polygon is approximated by a bounding box (x1 , y1 , x2 , y2 )A/B .
feature preparation is linear to the image size. The evaluation of one particle can then be done in constant time, and is thus independent of the image resolution and the projected size of the target. 3.2
Face and Upper Body Detection
As we aim at tracking the coordinates of the lecturer’s head – serving as model point for the full body –, we need a feature that gives evidence for the head position. The face detection algorithm proposed by Viola and Jones [16] is known to be both robust and fast: it uses Haar-like features that can be efficiently computed by means of the integral image, thus being invariant to scale variations. The features are organized in a cascade of weak classifiers, that is used to classify the content of a search window as being face or not. Typically, a variable-size search window is repeatedly shifted over the image, and overlapping detections are combined to a single detection. Exhaustively searching a W × W image region for a F × F sized face while incrementing the face size n times by the scale factor s requires the following number of cascade runs (not yet taking into account post-filtering of overlapping detections): #cascade runs =
n−1
2 W − F · si
(10)
i=0
In case of for example a 100x100 pixel image region, and a face size in between 20 and 42 (n = 8, s = 1.1), this results in 44368 cascade runs. In the proposed particle filter framework however, it is not necessary to scan the image exhaustively: the places to search are directly given by the particle set. For each particle, a head-sized cuboid (30cm edge length) centered around the hypothesized head position is projected to the image plane, and the bounding
An Audio-Visual Particle Filter for Speaker Tracking
75
box of the projection defines the search window that is to be classified. Thus, the evaluation of a particle takes only one run of the cascade: #cascade runs = #particles
(11)
The face detector is able to locate the vertical and horizontal position of the face precisely with respect to the image plane. However, the distance to the camera, i.e. the scaling, cannot be estimated accurately from a single view. In order to achieve tolerance against scale variation and to smooth the scores of nearby particles, we set the i-th particle’s score to the (average) overlap2 between the particle’s head rectangle ri = (x1 , y1 , x2 , y2 ) and all the positively classified head rectangles r0..N by any of the other particles: p(Dtj |si ) =
N 1 overlap(ri , rn ) N n=0
(12)
A detector that is trained on frontal faces only is unlikely to produce many hits in our multi-view scenario. In order to improve the performance, we used two cascades for face detection: one for frontal faces in the range of ±45◦ and one for profile faces (45◦ − 90◦ )3 . Our implementation of the face detector is based on the OpenCV library, that implements an extended set of Haar-like features as proposed by [12]. This library also includes a pre-trained classifier cascade for upper body detection [11]. We used this detector in addition to face detection, and incorporated it’s results using the same methods as described for face detection.
4
Audio Features
The lecturer is the person that is normally speaking, therefore we can use audio features using multiple microphones to detect the speaker position. Consider the j-th pair of microphones, and let mj1 and mj2 respectively be the positions of the first and second microphones in the pair. Let x denote the position of the speaker in a three dimensional space. Then the time delay of arrival (TDOA) between the two microphones of the pair can be expressed as x − mj1 − x − mj2 (13) c where c is the speed of sound. To estimate the TDOAs a variety of well-known techniques [14,5] exist. Perhaps the most popular method is the phase transform (PHAT), which can be expressed as π 1 X1 (ejωτ )X2∗ (ejωτ ) jωτ R12 (τ ) = e dω (14) 2π −π |X1 (ejωτ )X2∗ (ejωτ )| Tj (x) = T (mj1 , mj2 , x) =
2
3
The auxiliary function overlap(a, b) calculates the ratio of the shared area of two rectangles a and b to the sum of the areas of a and b. The profile face cascade has to be applied twice: to the original image and to a horizontally flipped image.
76
K. Nickel et al.
where X1 (ω) and X2 (ω) are the Fourier transforms of the signals of a microphone pair in a microphone array. Normally one would search for the highest peak in the resulting cross correlation to estimate the position. But since we are using a particle filter, as described in Section 2, we can simply set the PHAT value at the time delay position Tj (x = si ) of the MA pair j of a particular particle si as p(Ajt |si ) = max(0, Rj (Tj (x = si )))
(15)
As the values returned by the PHAT can be negative, but probability density functions must be strictly nonnegative, we set negative values of the PHAT to zero. To get a better estimate we repeat this over all m pair of microphones, sum their values and normalize by m: 1 p(Ajt |si ) m j=1 m
p(At |si ) =
(16)
Just like for the visual features, we normalize over all particles in order to get the acoustic observation likelihood for each particle: p(At |si ) p(At |si ) = i p(At |si )
5
(17)
Experiments on the CLEAR’06 Evaluation Dataset
The performance of the proposed algorithm has been evaluated on the single person tracking tasks of the CLEAR’06 Evaluation Campaign [1]. The dataset consists of recordings of actual lectures and seminars that were held at different sites. The evaluation dataset comprises a total number of 14 recordings, each featuring a different speaker (see Fig. 3). From each recording typically two segments of 5 minutes length are to be processed (26 segments in total). The lectures are complemented by slides which are projected to a whiteboard next to the speaker. Apart from the lecturer, there is a number of about 5-20 people in the audience. In many recordings, there is no clear separation between speaker area and audience. It must further be noted that every once in a while, auditors cross the speaker area in order to enter or to leave the room. The details of the sensor setup vary among the sites. There is, however, a common setup of 4 fixed cameras in the room corners with full room coverage, and 4-6 microphone arrays mounted on the walls. Each array consists of 4 microphones: 3 in a row with a distance of 20 cm and one 30 cm above the center microphone. The lecturer’s head centroid was labeled manually every 10th frame. By means of the calibration information, a 3D label file was generated and serves as ground truth for the evaluation. A separate development dataset has been provided to tune tracking parameters and to train the face detection
An Audio-Visual Particle Filter for Speaker Tracking
77
Fig. 3. Snapshot from a lecture showing all 4 camera views
cascades. This development set consists of different lectures that were collected using the same setup as for the evaluation set. The system has not been hand-tuned to the different data collection sites. This means in particular: – no images of the empty scene have been used, the background model initializes automatically – no cameras or microphones were excluded - all sensors were used – no speaker area has been defined, the tracker scans the entire room. 5.1
Results
The evaluation results presented in Table 1 are average values of all 26 lecture segments that were provided in the single-person tracking task of CLEAR’06. Two scores were defined to rate the tracking systems: – MOTA: the multi-person tracking accuracy accumulates misses and false positives and relates them to the number of labeled frames. In this evaluation, a miss is defined as a hypothesis outside a 500mm radius around the labeled head position. Note that each hypothesis outside the radius is a miss and a false positive by the same time, and is thus counted as 2 errors. – MOTP: the multi-object tracking precision is the mean error of the hypotheses within the 500mm radius around the labeled head position. Note that the most relevant score in the table is the miss rate and the MOTA score respectively, whereas MOTP only measures the precision for those hypotheses that are inside the 500mm range. For the video-only evaluation, all labeled frames were used for scoring, whereas the audio-only condition was scored exclusively on frames in which the lecturer actually speaks. The multi-modal system was scored on both conditions. It can be seen in the table that the video-only tracker outperforms the audioonly tracker. The combination of both performs clearly better than the unimodal systems as long as scoring is done on speech frames only. When being evaluated on all frames, the audio-only tracker performs much worse than the video tracker, so that the combination of both is not beneficial anymore. As a comparison, Table 2 shows the results on the CLEAR’06 development set. Here, the combination of audio and video is beneficial even when being evaluated on all frames.
78
K. Nickel et al. Table 1. Results in 3D speaker tracking on the CLEAR’06 evaluation set Tracking mode Video only Audio only Video + Audio (speech frames) Video + Audio (all frames)
Misses 14.3% 22.6% 5.1% 14.6%
MOTA 71.4% 54.8% 89.8% 70.8%
MOTP 127mm 186mm 140mm 143mm
Table 2. Comparative results on the CLEAR’06 development set Tracking mode Video only Audio only Video + Audio (all frames)
Misses 16.7% 15.3% 8.1%
MOTA 66.5% 69.4% 84.0%
MOTP 141mm 138mm 125mm
In 3 of 26 evaluation segments, the audio-visual system has a miss rate of 55% or higher, whereas the miss rate on the other segments is always < 30%. An indepth look at those segments with worst performance reveals some reasons for this behavior: In the first of these three underperforming segments, the speaker is often standing in a corner of the room, speaking into the direction of the wall. Both audio and video fail here. The other two segments actually show the question-and-answer phase of a presentation. The speaker is standing still, while the participants are having a discussion. When being evaluated on all frames, the audio tracker tracks the current speaker, which is most of the time not the labeled presenter. Segments like this were not included in the development set. 5.2
Implementation and Complexity
For maximum precision, the experiments on the CLEAR’06 dataset have been conducted with full image resolution and a number of 500 particles. The processing time for 1sec of data (all sensors together) on a single 3GHz PC was 2.3sec (audio-only), 4.8sec (video-only) and 11.6sec (audio-visual). On the video side, the proposed algorithm consists of two parts that can be characterized by their relation to the three factors that determine the runtime of the algorithm. The feature preparation part (foreground segmentation, integral image calculation) is related linearly to the image size S and a constant time factor tV S . In contrast, the particle evaluation part is independent from S and related linearly to the number of particles P and a constant tV P . Both parts are likewise related linearly to the number of views V . On the audio side, the runtime is linearly related to the number of microphone pairs M . Like in the video case, this can be further decomposed into a constant preprocessing part tAM and a part tAP that has to be repeated for each particle. Thus, the total processing time per frame is determined by: ttotal = (tV S · S + tV P · P ) · V + (tAM + tAP · P ) · M
(18)
An Audio-Visual Particle Filter for Speaker Tracking
79
As this equation indicates, the visual part can be intuitively parallelized for the number of views V . We implemented such a video-only tracker using 4 desktop PCs, each connected to a camera, in a way that the image processing is done locally on each machine. Because only low-bandwidth data (particle positions and weights) are shared over the network, the overhead is negligible, and a speed of 15 fps (including image acquisition) could easily be achieved. In the live system, image downsampling by a factor of 2 − 4 and a number of 100 − 200 particles performs reasonably well.
6
Conclusion
We presented an algorithm for tracking a person using multiple cameras and multiple pairs of microphones. The core of the proposed algorithm is a particle filter that works without explicit triangulation. Instead, it estimates the 3D location by sampled projection, thus benefiting from each single view and microphone pair. The video features used for tracking are based on foreground segmentation and the response of detectors for upper body, frontal face and profile face. The audio features are based on the time delays of arrival between pairs of microphones, and are estimated with a generalized cross correlation function. The audio-visual tracking algorithm was evaluated on the CLEAR’06 dataset and outperformed both the audio- and video-only tracker. One reason for this is that the video and audio features described in this paper complement one another well: the comparatively coarse foreground feature along with the audio feature guide the way for the face detector, which in turn gives very precise results as long as it searches around the true head position. Another reason for the benefit of the combination is that neither motion and face detection nor acoustic source localization responds exclusively to the lecturer and not to people from the audience – so the combination of both increases the chance of actually tracking the lecturer.
Acknowledgments This work has been funded by the European Commission under Project CHIL (http://chil.server.de, contract #506909).
References 1. CLEAR 2006 Evaluation and Workshop Campaign. http://clear-evaluation.org. April 6-7 2006, Southampton, UK. 2. M. S. Brandstein. A framework for speech source localization using sensor arrays. PhD thesis, Brown University, Providence, RI, May 1995. 3. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc., 5(1):45–50, January 1997.
80
K. Nickel et al.
4. N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR), 2003. 5. J. Chen, J. Benesty, and Y. A. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc., 11(6):549–57, November 2003. 6. D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez. A mixed-state iparticle filter for multi-camera speaker tracking. In Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC), 2003. 7. Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc., 9(8):943–956, November 2001. 8. M. Isard and A. Blake. Condensation–conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. 9. T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough. Kalman filters for audio-video source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 2005. 10. U. Klee, T. Gehrig, and J. McDonough. Kalman filters for time delay of arrivalbased source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publication. 11. H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and robust face finding via local context. In IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, October 2003. 12. R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In ICIP, volume 1, pages 900–903, September 2002. 13. I. Mikic, S. Santini, and R. Jain. Tracking objects in 3d using multiple camera views. In ACCV, 2000. 14. M. Omologo and P. Svaizer. Acoustic event localization using a crosspowerspectrum phase based technique. Proc. ICASSP, II:273–6, 1994. 15. J. Vermaak, M. Gangnet, A. Blake, and P. Pez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE Intl. Conf. on Computer Vision, volume 1, pages 741–746, 2001. 16. P. Viola and M. Jones. Robust real-time object detection. In ICCV Workshop on Statistical and Computation Theories of Vision, July 2001. 17. D. B. Ward, E. A. Lehmann, and R. C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc., 11(6):826–836, 2003. 18. M. W¨ olfel, K. Nickel and J. McDonough. Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate. 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Edinburgh, 11-13 July 2005. 19. D. Zotkin, R. Duraiswami, and L. Davis. Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing, 2002(11), 2002.
Multi- and Single View Multiperson Tracking for Smart Room Environments Keni Bernardin, Tobias Gehrig, and Rainer Stiefelhagen Interactive Systems Lab Institut f¨ ur Theoretische Informatik Universit¨ at Karlsruhe, 76131 Karlsruhe, Germany {keni, tgehrig, stiefel}@ira.uka.de
Abstract. Simultaneous tracking of multiple persons in real world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. In this work, we present 2 multimodal systems for tracking multiple users in a smart room environment. One is a multi-view tracker based on color histogram tracking and special person region detectors. The other is a wide angle overhead view person tracker relying on foreground segmentation and model-based tracking. Both systems are completed by a joint probabilistic data association filter-based source localization framework using input from several microphone arrays. We also very briefly present two intuitive metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. The trackers are extensively tested and compared, for each modality separately, and for the combined modalities, on the CLEAR 2006 Evaluation Database.
1
Introduction and Related Work
In recent years, there has been a growing interest in intelligent systems for indoor scene analysis. Various research projects, such as the European CHIL or AMI projects [17,18] or the VACE project in the U.S. [19], aim at developing smart room environments, at facilitating human-machine and human-human interaction, or at analyzing meeting or conference situations. To this effect, multimodal approaches that utilize a variety of far-field sensors, video cameras and microphones, to gain rich scene information gain more and more popularity. An essential building block for complex scene analysis is the detection and tracking of persons in the scene. One of the major problems faced by indoor tracking systems is the lack of reliable features that allow to keep track of persons in natural, evolving and unconstrained scenarios. The most popular visual features in use are color features and foreground segmentation or movement features [2,1,3,6,7,16], each with their advantages and drawbacks. Doing e.g. blob tracking on background subtraction R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 81–92, 2007. c Springer-Verlag Berlin Heidelberg 2007
82
K. Bernardin, T. Gehrig, and R. Stiefelhagen
maps is error-prone, as it requires a clean background and assumes only persons are moving. In real environments, the foreground blobs are often fragmented or merged with others, they depict only parts of occluded persons or are produced by shadows or displaced objects. When using color information to track people, the problem is to create appropriate color histograms or models. Generic color models are usually sensitive and environment-specific [4]. If no generic model is used, one must at some point decide which pixels in the image belong to a person to initialize a dedicated color histogram [3,7,15,16]. In many cases, this still requires the cooperation of the users and/or a clean and relatively static background. On the acoustic side, although actual techniques already allow for a high accuracy in localization, they can still only be used effectively for the tracking of one person, and only when this person is speaking. This naturally leads to the development of more and more multimodal techniques. Here, we present two multimodal systems for the tracking of multiple persons in a smart room scenario. A joint probability data association filter is used in conjunction with a set of microphone arrays to determine active speaker positions. For the video modality, we investigate the advantages and drawbacks of 2 approaches, one relying on color histogram tracking in several corner camera images and subsequent triangulation, and one relying on foreground blob tracking in wide angle top view images. For both systems, the acoustic and visual modalities are fused using a state-based selection and combination scheme on the single modality tracker outputs. The systems are evaluated on the CLEAR’06 3D Multiperson Tracking Database, and compared using the MOTP and MOTA metrics, which will also be briefly decribed here. The next sections introduce the multi-view and single-view visual trackers, and the jpdaf-based acoustic tracker. Section 6 gives a brief explanation of the used metrics. Section 7 shows the evaluation results on the CLEAR database, while section 8 gives a brief summary and concludes.
2
Multi-view Person Tracking Using Color Histograms and Haar-Classifier Cascades
The developed system is a 3D tracker that uses several fixed cameras installed at the room corners [11]. It is designed to function with a variable number of cameras, with precision increasing as the number of cameras grows. It performs tracking first separately on each camera image, using color histogram models. Color tracks are initialized automatically using a combination of foreground maps and special object detectors. The information from several cameras is then fused to produce 3D hypotheses of the persons’ positions. A more detailed explanation of the system’s different components is given in the following. 2.1
Classifier Cascades and Foreground Segmentation
A set of special object detectors is used to detect persons in the camera images. They are classifier cascades that build on haar-like features, as decribed in
Multi- and Single View Multiperson Tracking for Smart Room Environments
83
[9,8]. For our implementation, the cascades were taken from the OpenCV [20] library. Two types of cascades are used: One trained to recognize frontal views of faces(face), and one to recognize the upper body region of standing or sitting persons (upper body). The image is scanned at different scales and bounding rectangles are obtained for regions likely to contain a person. By using these detectors, we avoid the drawbacks of creation/deletion zones and are able to initialize or recover a track at any place in the room. Further, to reduce the amount of false detector hits, a preprocessing step is made on the image. It is first segmented into foreground regions by using an adaptive background model. The foreground regions are then scanned using the classifier cascades. This combined approach offers two advantages: The cascades, on the one hand, increase robustness to segmentation errors, as foreground regions not belonging to persons, such as moved chairs, doors, shadows, etc, are ignored. The foreground segmentation, on the other hand, helps to decide which of the pixels inside a detection rectangle belong to a person, and which to the background. Knowing exactly which pixels belong to the detected person is useful to create accurate color histograms and improve color tracking performance. 2.2
Color Histogram Tracking and 2D Hypotheses
Whenever an object detector has found an upper or a full body in the image, a color histogram of the respective person region is constructed from the foreground pixels belonging to that region, and a track is initialized. The actual tracking is done based only on color features by using the meanshift algorithm [5] on histogram backprojection images. Care must be taken when creating the color histograms to reduce the negative effect of background colors that may have been mistakenly included in the person silhouette during the detection and segmentation phase. This is done by histogram division, as proposed in [12]. Several types of division are possible (division by a general background histogram, by the histogram of the background region immediately surrounding the person, etc, see Fig. 1). The choice of the best technique depends on the conditions at hand and is made automatically at each track initialization step, by making a quick prediction of the effect of each technique on the tracking behavior in the next frame. To ensure continued tracking stability, the histogram model for a track is also adapted every time a classifier cascade produces a detection hit on that track. Tracks that are not confirmed by a detection hit for some time are deleted, as they are most likely erroneous. The color based tracker, as described above, is used to produce a 2D hypothesis for the position of a person in the image. Based on the type of cascade that triggered initialization of the tracker, and the original size of the detected region, the body center of the person in the image and the person’s distance from the camera are estimated and output as hypothesis. When several types of trackers (face and upper body) are available for the same person, a combined output is produced.
84
K. Bernardin, T. Gehrig, and R. Stiefelhagen
(a) Detected (b) Foreground (c) Hist. body regions map proj.
Back- (d) DIV ground
Back-
(e) DIV (f ) DIV Bor- (g) DIV (h) Tracker outBackground2 der*Backg Border*Backg2 put Fig. 1. Color histogram creation, filtering and tracking. a) Face, upper and full body detections (rectangles) in one camera view. b) Foreground segmentation (in white). Only foreground pixels inside the rectangles are used. c) Histogram backprojection for the upper body track of the leftmost person. d), e), f ) and g) Effects of different types of histogram division. Background: Overall background histogram. Border: Histogram of the background region immediately surrounding the detected rectangle. h) Tracker output as seen from another view.
2.3
Fusion and Generation of 3D Hypotheses
The 2D hypotheses produced for every camera view are triangulated to produce 3D position estimates. For this, the cameras must be calibrated and their position relative to a general room coordinate system known. The lines of view (LOV) coming from the optical centers of the cameras and passing through the 2D hypothesis points in their respective image planes are intersected. When no exact intersection point exists, a residual distance between LOVs, the triangulation error, can be calculated. This error value is used by an intelligent 3D tracking algorithm to establish likely correspondences between 2D tracks (as in [13]). When the triangulation error between a set of 2D hypotheses is small enough, they are associated to form a 3D track. Likewise, when it exceeds a certain threshold, the 2D hypothesis which contributes most to the error is dissociated again and the 3D track is maintained using the remaining hypotheses. The tracker requires a minimum of 2 cameras to produce 3D hypotheses, and becomes more robust as the number of cameras increases. Once a 3D estimate for a person’s position has been computed, it is further used to validate 2D tracks, to initiate color histogram tracking in camera views where the person has not yet been detected, to predict occlusions in a camera view and deactivate the involved 2D trackers, and to reinitialize tracking even in the absence of detector hits. The developed multiperson tracker draws its strength from the intelligent fusion of several camera views. It initializes its tracks automatically, constantly adapts its color models and verifies the validity of its tracks through the use
Multi- and Single View Multiperson Tracking for Smart Room Environments
85
of special object detectors. It is capable of tracking several people, regardless if they are sitting, moving or standing still, in a cluttered environment with uneven lighting conditions.
Fig. 2. The output of the top camera tracker. The colored circles represent the person models.
3
Single-View Model-Based Person Tracking on Panoramic Images
In contrast to the above presented multi-view system, a single-view tracker working on wide angle images captured from the top of the room was also designed. The advantage of such images is that they reduce the chance of occlusion by objects or overlap between persons. The drawback is that detailed analysis of the tracked persons is difficult as person-specific features are hard to observe (see Fig. 2). The tracking algorithm is essentially composed of a simple but fast foreground blob segmentation followed by a more complex EM algorithm based on person models: First, foreground patches are extracted from the images by using a dynamic background model. The background model is created on a few initial images of the room and is constantly adapted with each new image with an adaptation factor α. Background subtraction and thresholding yield an initial foreground map, which is morphologically filtered. A connected component analysis provides the foreground blobs for tracking. Blobs below a certain size are rejected as segmentation errors. The subsequent EM tracking algorithm tries to find an optimal assignment of the detected blobs to a set of active person models, instantiating new models or deleting unnecessary ones if need be. A person model, in our case is composed of a position (x, y), a velocity (vx, vy), a radius r and a track ID. In our implementation, the model radius was estimated automatically using the calibration
86
K. Bernardin, T. Gehrig, and R. Stiefelhagen
information for the wide angle camera and rough knowledge about the room height. The procedure is as follows: – For all person models Mi , verify their updated positions (x, y)Mi . If the overlap between two models exceeds a maximum value, fuse them. – For each pixel p in each foreground blob Bj , find the person model Mk which is closest to p. If the distance is smaller than rMk , assign p to Mk . – Iteratively assign blobs to person models: For every foreground blob Bj whose pixels were assigned to at most one model Mk , assign Bj to Mk and use all assigned pixels from Bj to compute a position update for Mk . Subsequently, consider all assignments of pixels in other blobs to Mk as invalid. Repeat this step until all unambiguous mappings have been made. Position updates are made by calculating the mean of assigned pixels (x, y)m and setting (x, y)Mk,new = αM (x, y)m + (1 − αM ) (x, y)Mk , with αM the learnrate for model adaptation. – For every blob whose pixels are still assigned to several models, accumulate the pixel positions assigned to each of these models. Then make the position updates based on the respectively assigned pixels only. This is to handle the case that two person tracks coincide: The foreground blobs are merged but both person models still subsist as long as they do not overlap too greatly, and can keep track of their respective persons when they part again. – For each remaining unassigned foreground blob, initialize a new person model, setting its (x, y) position to the blob center. Make the model active, only if it subsist for a minimum period of time. On the other hand, if a model stays unassigned for a certain period of latency, delete it. – Repeat the procedure from step 1. The two stage approach results in a fast tracking algorithm that is able to initialize and maintain several person tracks, even in the event of moderate overlap. Relying solely on foreground maps as features, however, makes the system relatively sensitive to situations with heavy overlap. This could be improved by including color information, or with e.g. temporal templates, as proposed in [1]. By assuming an average height of 1m for a person’s body center, and using calibration information for the top camera, the positions in the world coordinate frame of all N tracked persons are calculated and output. The system makes no assumptions about the environment, e.g. no special creation or deletion zones, about the consistency of a person’s appearance or the surrounding room. It runs in realtime, at 15fps, on a Pentium 3GHz machine.
4
A JPDAF Source Localizer for Speaker Tracking
In parallel to the visual tracking of all room occupants, acoustic source localization was performed to estimate the position of the active speaker. For this, the system relies on the input from four T-shaped microphone clusters installed on the room walls. They allow a precise localization in the horizontal plane, as well as height estimation. Two subtasks are accomplished:
Multi- and Single View Multiperson Tracking for Smart Room Environments
87
– Speaker localization and tracking. This is done by estimating time delays of arrival between microphone pairs using the Generalized Cross Correlation function (GCC). – Speech detection and segmentation. This is currently done by thresholding the GCC function, but techniques more robust to non-speech noise and crosstalk are already being experimented with. Our system uses a variant of the GCC, the GCC-PHAT, defined as folows: π X1 (ejωτ )X2∗ (ejωτ ) jωτ 1 e dω R12 (τ ) = 2π −π |X1 (ejωτ )X2∗ (ejωτ )| where X1 (ω) and X2 (ω) are the Fourier transforms of the signals of a microphone pair in a microphone array. As opposed to other approaches, Kalman or particle-filter based, this approach uses a Joint Probabilistic Data Association Filter that directly receives as input the time delays that maximize the correlation results from the various microphone pairs, and performs the tracking in a unified probabilistic way for multiple possible target hypotheses, thereby achieving more robust and accurate results. The details of the source localizer can be found in [10]. The output of the speaker localization module is the tracked position of the active speaker in the world coordinate frame. This position is compared in the fusion module to those of all visually tracked persons in the room and a combined hypothesis is produced.
5
State-Based Fusion
The fusion of the audio and video modalities is done at the decision level. Track estimates coming from the visual and acoustic tracking systems are combined using a finite state machine approach, which considers the relative strengths and weaknesses of each modality. The visual trackers are generally very accurate at determining a person’s position. In multiperson scenarios they can, however, miss persons completely because their faces are too small or invisible, or because they are not well discernable from the background by color, shape or motion. The acoustic tracker on the other hand can precisely determine a person’s position when this person speaks. In the current implementation, it can, however, only track one active speaker at a time and can produce no estimates when several or no persons are speaking. Based on this, the fusion of the acoustic and visual tracks is made using a finite state machine weighing the availability or reliability of the single modality tracks. For multimodal tracking, two main conditions are to be evaluated: For condition A, only the position of the active speaker in a multi-participant scenario is to be estimated. For condition B, on the other hand, all participants have to be tracked. Consequently, the states for the fusion of modalities differ slightly depending on the task condition. For condition A, they are as follows:
88
K. Bernardin, T. Gehrig, and R. Stiefelhagen
– State 1: An acoustic estimate is available, for which no overlapping visual estimate exists. Here, estimates are considered overlapping if their distance is smaller than 500mm. In this case, assume the visual tracker has missed the speaking person and ouptut the acoustic hypothesis. Store the last received acoustic estimate and keep outputting it until an overlapping visual estimate is found. – State 2: An acoustic estimate is available, and a corresponding visual estimate exists. In this case, output the average of the acoustic and visual positions. – State 3: After an overlapping visual estimate had been found, an acoustic estimate is no longer available. In this case, we consider the visual tracker has recovered the previously undetected speaker and keep ouputting the position of the last overlapping visual track. For condition B, where all participants must be tracked, the acoustic estimate serves to increase the precision of the closest visual track, whenever available. The states are: – State 1: An acoustic estimate is available, for which no overlapping visual estimate exists. In this case, assume the visual tracker has missed the speaking person and ouptut the acoustic hypothesis additionally to the visual ones. Store the last received acoustic estimate and keep outputting it until an overlapping visual estimate is found. – State 2 and State 3 are similar to condition A, with the exception that here, all other visual estimates are output as well. Using this fusion scheme, two multimodal tracking systems were designed: System1, fusing the JPDAF acoustic tracker with the single-view visual tracker, and System2, fusing it with the multi-view tracker. Both systems were evaluated on conditions A and B, and the results compared in section 7. To allow better insight into the evaluation scores, the following section now gives a brief overview of the used metrics.
6
Multiple Object Tracking Metrics
Defining good measures to express the characteristics of a system for continuous tracking of multiple objects is not a straightforward task. Various measures exist and there is no consensus in the literature on the best set to use. Here, we propose a small expressive set of metrics and show a systematic procedure for their calculation. A more detailed discussion of these metrics can be found in [14]. Assuming that for every time frame t a multiple object tracker outputs a set of hypotheses {h1 . . . hm } for a set of visible objects {o1 . . . on }, we define the procedure to evaluate its performance as follows: Let the correspondence between an object oi and a hypothesis hj be valid only if their distance disti,j does not exceed a certain threshold T , and let Mt = {(oi , hj )} be a dynamic mapping of object-hypothesis pairs.
Multi- and Single View Multiperson Tracking for Smart Room Environments
89
Let M0 = {}. For every time frame t, 1. For every mapping (oi , hj ) in Mt−1 , verify if it is still valid. If object oi is still visible and tracker hypothesis hj still exists at time t, and if their distance does not exceed the threshold T , make the correspondence between oi and hj for frame t. 2. For all objects for which no correspondence was made yet, try to find a matching hypothesis. Allow only one to one matches. To find optimal correspondences that minimize the overall distance error, Munkre’s algorithm is used. Only pairs for which the distance does not exceed the threshold T are valid. If a correspondence (oi , hk ) is made that contradicts a mapping (oi , hj ) in Mt−1 , replace (oi , hj ) with (oi , hk ) in Mt . Count this as a mismatch error and let mmet be the number of mismatch errors for frame t. 3. After the first two steps, a set of matching pairs for the current time frame is known. Let ct be the number of matches found for time t. For each of theses matches, calculate the distance dit between the object oi and its corresponding hypothesis. 4. All remaining hypotheses are considered false positives. Similarly, all remaining objects are considered misses. Let f pt and mt be the number of false positives and misses respectively for frame t. Let also gt be the number of objects present at time t. 5. Repeat the procedure from step 1 for the next time frame. Note that since for the initial frame, the set of mappings M0 is empty, all correspondences made are initial and no mismatch errors occur. Based on the matching strategy described above, two very intuitive metrics can be defined: The M ultiple O bject T racking P recision (M OT P ), which shows the tracker’s ability to estimate precise object positions, and the M ultiple O bject T racking Accuracy (M OT A), which expresses its performance at estimating the number of objects, and at keeping consistent trajectories: i,t di,t M OT P = (1) t ct (mt + f pt + mmet ) M OT A = 1 − t (2) t gt The M OT A can be seen as composed of 3 error ratios: mt f pt t t t mmet m= , fp = , mme = , g g t t t t t gt the ratio of misses, false positives and mismatches in the sequence, computed over the total number of objects present in all frames. Alternatively, to compare systems for which measurement of identity mismatches is not meaningful, an addtitional measure, the A − M OT A can be computed, by ignoring mismatch errors in the global error computation: (mt + f pt ) (3) A − M OT A = 1 − t t gt
90
7
K. Bernardin, T. Gehrig, and R. Stiefelhagen
Evaluation on the CLEAR’06 3D Multiperson Tracking Database
The above presented systems for visual and multimodal tracking were evaluated on the CLEAR’06 3D Multiperson Tacking Database. This database comprises recordings from 3 different CHIL smartrooms, involving up to 6 persons in a seminar scenario, for a total of approx. 60 min. Tables 1 and 2 show the results for the Single- and Multi-view based systems, System1 and System2, for the visual and the mutimodal conditions A and B: Table 1. Evalution results for the visual and multimodal B conditions System 1:Visual 1:AV CondB 2:Visual 2:AV CondB
M OT P 217mm 226mm 203mm 223mm
m 27.6% 26.1% 46.0% 44.4%
fp 20.3% 20.8% 24.9% 25.8%
mme M OT A 1.0% 51.1% 1.1% 52.0% 2.8% 26.3% 3.3% 26.4%
Table 2. Evalution results for the multimodal A condition System M OT P m f p mme M OT A 1:AV CondA 223mm 51.4% 51.4% 2.1% -5.0% 2:AV CondA 179mm 51.4% 51.4% 5.3% -8.2%
As Table 1 shows, the single view tracker clearly outperforms the multi-view approach. As the scenario involved mostly people sitting around a table and occasionally walking, they were very clearly distinguishable from a top view, even when using simple features such as foreground blobs for tracking. The multi-view approach, on the other hand, had more moderate results, stemming from the considerably more difficult video data. The problems can be summed up in 2 categories: – 2D tracking errors: In several seminars, participants were only hardly distinguishable from the background using color information, or detectable by the face and body detectors, due to low resolution. This accounts for the relatively high amount of missed persons. – Triangulation errors: The low angle of view of the corner cameras and the small size of most recording rooms caused a considerable amount of occlusion in most seminars, which could not be completely resolved by the triangulation scheme. A more precise distance estimation, based on the size of detection hits could help avoid many of the occured triangulation errors, and reduce the false positive count. In all cases, the average MOTP error was about 20cm, making the MOTA the more interesting metric for comparison. As can also be seen, although the
Multi- and Single View Multiperson Tracking for Smart Room Environments
91
addition of the acoustic modality could bring a slight improvement in tracking accuracy, the gain is minimal, as it could only help improve tracking performance for the speaking person at each respective point in time. Compared to these results, the scores for condition A are relatively low. Both systems produced a high amount of miss errors (around 50%), as the correct speaker could not be selected from the multiple available tracks. It is noticeable that in case the correct speaker was tracked, though, the multi-view System2 achieved a higher precision, reaching 18cm, as compared to 20cm for System1. This suggest that for the tracking of clearly identifiable persons (such as the presenter in the seminars), the multi-view, face and body-detector based approach does have its advantages.
8
Summary
In this work, 2 systems for multimodal tracking of multiple users are presented. A joint probabilistic data association filter for source localization is used in conjunction with 2 distinct systems for visual tracking: One using multiple camera images, based on color histogram tracking and haar-feature classifier cascades for upper bodies and faces. The other using only a wide angle overhead view, and model based tracking on foreground segmentation features. A fusion scheme is presented, using a 3-state finite-state machine to combine the output of the audio and visual trackers. The systems were extensively tested on the CLEAR 2006 3D Multiperson Tracking Database, for the visual and the audio-visual conditions A and B. The results show that under fairly controlled conditions, as can be expected of meeting situations with relatively few participants, an overhead wide angle view analysis can yield considerable advantages over more elaborate multicamera systems, even if only simple features, such as foreground blobs are used. Overall, an accuracy of 52% could be reached for the audio-visual task, with position errors below 23cm.
Acknowledgments The work presented here was partly funded by the European Union (EU) under the integrated project CHIL, Computers in the Human Interaction Loop (Grant number IST-506909).
References 1. Rania Y. Khalaf and Stephen S. Intille, “Improving Multiple People Tracking using Temporal Consistency”, MIT Dept. of Architecture House n Project Technical Report, 2001. 2. Wei Niu, Long Jiao, Dan Han, and Yuan-Fang Wang, “Real-Time Multi-Person Tracking in Video Surveillance”, Pacific Rim Multimedia Conference, Singapore, 2003.
92
K. Bernardin, T. Gehrig, and R. Stiefelhagen
3. A. Mittal and L. S. Davis, “M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene Using Region-Based Stereo”, European Conf. on Computer Vision, LNCS 2350, pp. 18-33, 2002. 4. Neal Checka, Kevin Wilson, Vibhav Rangarajan, Trevor Darrell, “A Probabilistic Framework for Multi-modal Multi-Person Tracking”, Workshop on Multi-Object Tracking (CVPR), 2003. 5. Dorin Comaniciu and Peter Meer, “Mean Shift: A Robust Approach Toward Feature Space Analysis”. IEEE PAMI, Vol. 24, No. 5, May 2002. 6. Ismail Haritaoglu, David Harwood and Larry S. Davis, “W4: Who? When? Where? What? A Real Time System for Detecting and Tracking People”. Third Face and Gesture Recognition Conference, pp. 222–227, 1998. 7. Yogesh Raja, Stephen J. McKenna, Shaogang Gong, “Tracking and Segmenting People in Varying Lighting Conditions using Colour”. 3rd. Int. Conference on Face & Gesture Recognition, pp. 228, 1998. 8. Paul Viola and Michael Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”. IEEE CVPR, 2001. 9. Rainer Lienhart and Jochen Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection”. IEEE ICIP 2002, Vol. 1, pp. 900–903, Sep. 2002. 10. T. Gehrig, J. McDonough, “Tracking of Multiple Speakers with Probabilistic Data Association Filters”. CLEAR Workshop, Southampton, UK, April 2006. 11. Keni Bernardin, Alexander Elbs, Rainer Stiefelhagen, “Detection-Assisted Initialization, Adaptation and Fusion of Body Region Trackers for Robust Multiperson Tracking”. IEEE International Conference on Pattern Recognition, 20 - 24 August 2006, Hong Kong. 12. Kai Nickel and Rainer Stiefelhagen, “Pointing Gesture Recognition based on 3Dtracking of Face, Hands and Head Orientation”, 5th International Conference on Multimodal Interfaces, Vancouver, Canada, Nov. 2003. 13. Dirk Focken, Rainer Stiefelhagen, “Towards Vision-Based 3-D People Tracking in a Smart Room”, IEEE International Conference on Multimodal Interfaces, Pittsburgh, PA, USA, October 14-16, 2002, pp. 400-405. 14. Keni Bernardin, Alexander Elbs and Rainer Stiefelhagen, “Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment”, Sixth IEEE International Workshop on Visual Surveillance, in conjunction with ECCV2006, May 13th 2006, Graz, Austria 15. Hai Tao, Harpreet Sawhney and Rakesh Kumar, “A Sampling Algorithm for Tracking Multiple Objects”. International Workshop on Vision Algorithms: Theory and Practice, pp. 53–68, 1999. 16. Christopher Wren, Ali Azarbayejani, Trevor Darrell, Alex Pentland, “Pfinder: RealTime Tracking of the Human Body”. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 19, no 7, pp. 780–785, July 1997. 17. CHIL - Computers In the Human Interaction Loop, http://chil.server.de 18. AMI - Augmented Multiparty Interaction, http://www.amiproject.org 19. VACE - Video Analysis and Content Extraction, http://www.ic-arda.org 20. OpenCV - Open Computer Vision Library, http://sourceforge.net/projects/ opencvlibrary
UPC Audio, Video and Multimodal Person Tracking Systems in the Clear Evaluation Campaign A. Abad, C. Canton-Ferrer, C. Segura, J.L. Landabaso, D. Macho, J.R. Casas, J. Hernando, M. Pard` as, and C. Nadeu Technical University of Catalonia, Barcelona, Spain {alberto,ccanton,csegura,jl,dusan,josep,javier,montse, climent}@gps.tsc.upc.es
Abstract. Reliable measures of person positions are needed for computational perception of human activities taking place in a smart-room environment. In this work, we present the Person Tracking systems developed at UPC for audio, video and audio-video modalities in the context of the EU funded CHIL project research activities. The aim of the designed systems, and particularly of the new contributions proposed, is to deal robustly in both single and multiperson localization tasks independently on the environmental conditions. Besides the technology description, experimental results conducted for the CLEAR evaluation workshop are also reported.
1
Introduction
Person localization and tracking is a basic functionality for computational perception of human activities in a smart-room environment. Additionally, reliable measures of the position of persons are needed for technologies that are often deployed in that environment and use different modalities, like microphone array beamforming or steering of pan-tilt-zoom cameras towards the active speaker. To locate persons with unobtrusive far-field sensors, either video or audio sources can be used, though eventually the most accurate and robust techniques will likely be based on multimodal information. The degree of reliable information provided by person localization systems on the basis of the audio and video signals collected in a smart-room environment with a distributed microphone and video network, depends on a number of factors such as environmental noise, room reverberation, person movements and camera occlusions. These factors, among others, demand an effort on the development of new robust systems capable of dealing with adverse environments. In the present work, we get an insight on the development and design of robust Person Tracking systems based on audio, video and audio-video modalities in the framework of the CHIL [1] research activities conducted at UPC.
This work has been partially sponsored by the EC-funded project CHIL (IST-2002506909) and by the Spanish Government-funded project ACESCA (TIN2005-08852).
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 93–104, 2007. c Springer-Verlag Berlin Heidelberg 2007
94
2
A. Abad et al.
Audio Person Tracking System
Conventional acoustic person localization and tracking systems can be split into three basic stages. In the first stage, estimations of such information as Time Difference of Arrival or Direction of Arrival is usually obtained from the combination of the different microphones available. In general, in the second stage the set of relative delays or directions of arrival estimations are used to derive the source position that is in the best accordance with them and with the given geometry. In the third optional stage, a tracking of the possible movements of the sources according to a motion model can be employed. The SRP-PHAT [2] algorithm (also known as Global Coherence Field [3]) performs and integrates the two first stages of localization in a robust and smart way. In general, the goal of localization techniques based on SRP (Steered Response Power) is to maximize the power of the received sound source signal using a delay-and-sum or a filter-and-sum beamformer. In the simplest case, the output of the delay-and-sum beamformer is the sum of the signals of each microphone with the adequate steering delays for the position that is explored. Thus, a simple localization strategy is to search for the energy peak through all the possible positions in 3D space. Concretely, SRP-PHAT algorithm searches for the maximum of the contribution of the cross-correlations between all the microphone pairs across the space. The main strength of this technique consists on the combination of the simplicity of the steered beamformer approach with the robustness offered by the PHAT weighting. The proposed UPC system for Audio Person Tracking is based on the SRPPHAT algorithm with some additional robust modifications. The system design has been aimed to develop a robust system with independency on the acoustic and room conditions, such as the number of sources, their maneuvering modes or the number of microphones. 2.1
Brief Description of the SRP-PHAT Algorithm
As already mentioned above, the SRP-PHAT algorithm searches for the maximum of the contribution of the cross-correlations between all the microphone pairs across the space. The process can be summarized into four basic steps: Step 1. The exploration space is firstly split into small regions (typically of 5-10 cm). Then, theoretical delays from each possible exploration region to each microphone pair is pre-computed and stored. Step 2. Cross-correlations of each microphone pair are estimated for each analysis frame. Concretely, the Generalized Cross Correlation with PHAT weighting [4] is considered. It can be expressed in terms of the inverse x1 x2 (f )) as Fourier transform of the estimated cross-power spectral density (G follows, ∞ Gx1 x2 (f ) j2πf τ x1 x2 (τ ) = R df (1) e x1 x2 (f )| −∞ |G
UPC Audio, Video and Multimodal Person Tracking Systems
95
Step 3. The contribution of the cross-correlations is accumulated for each exploration region using the delays pre-computed in Step 1. In this way, it is obtained a kind of Sound Map as the one shown in Figure 1. Step 4. Finally, the position with the maximum score is selected as the estimated position.
Fig. 1. On the left, zenithal camera snapshot. On the right, example of the Sound Map obtained with the SRP-PHAT process.
2.2
The Implementation of the Robust UPC Audio Person Tracker
On the basis of the conventional SRP-PHAT, a robust system for Audio Person Tracking is developed. The main novelties introduced and some aspects related to other implementation details are introduced in the following. Implementation Details. The analysis frame consists of Hanning windowed blocks of 4096 samples, 50% overlapped, obtained at a sample rate of 44.1 kHz. The FFT computation dimension is fixed to 4096 samples. Adaptive Smoothing Factor for the Cross-Power Spectrum (CPS) Estimations. Smoothing over time of the GCC-PHAT estimations is a simple and efficient way of adding robustness to the system. This smoothing can be done in the time domain (GCC-PHAT) or in the frequency domain (CPS). Considx1 x2 (k, f ) in time instant k and the ering the smoothed cross-power spectrum G instantaneous estimation Gx1 x2 (k, f ) our system performs the smoothing in the frequency domain as follows, x1 x2 (k − 1, f ) + (1 − β)Gx1 x2 (k, f ) x1 x2 (k, f ) = β G G
(2)
From experimental observation it can be seen that the right selection of this β factor is crucial in the system design. A high smoothing value can greatly enhance the results obtained in an almost static scenario, while it can be dramatically inconvenient in a scenario with many moving speakers. Hence, an adaptive smoothing factor has been designed. This adaptive factor is obtained based on the velocity estimation provided by a Kalman filter.
96
A. Abad et al.
Two-Pass SRP Search. It can be seen from experimental observations that most of the information for a rough localization is concentrated in the lowfrequency bins of the GCC-PHAT, while high frequency bins are useful in order to obtain a finest estimation given a first coarse estimation. Taking into account this observation a two-pass SRP search has been designed: Coarse Search. This search procedure is performed only in the x-y axis (z is assumed to be 1.5 m), with a searching cell dimension of 16 cm and only using the low frequency information of the cross-correlations (f < 9kHz). A first coarse estimation is obtained from this search, say (x1 , y1 , 150) cm. Fine Search. A new limited search area around the obtained coarse estimation is defined (x1 − 50 : x1 + 50, y1 − 50 : y1 + 50, 110 : 190) cm. In this new fine search, dimension of the cell search is fixed to 4 cm for the x-y axis and to 8 cm for the z -axis. In the fine search all the frequency information of the cross-correlations is used and a more accurate estimation is obtained. Moreover, the double SRP searching procedure is adequate to reduce computational load, since the fine exploration is only performed across a very limited area. Confidence Threshold. In SRP-PHAT algorithm the position with the maximum value obtained from the accumulated contributions of all the correlations is selected (Step 4 ). This value is assumed to be well-correlated with the likelihood of the given estimation. Hence, this value is compared to a fixed threshold (depending on the number of microphone-pairs used) to reject/accept the estimation. The threshold has been experimentally fixed to 0.5 for each 6 microphone pairs. Finally, it is worth noting that although a Kalman filter is used for the estimation of the adaptive CPS smoothing factor, it is not considered for tracking purposes. The reason is that the Kalman filter design and the data association strategies adopted showed a different impact depending on the scenario. In other words, it showed to be too much dependent on the number and the velocities of sources to perform correctly.
3
Video Person Tracking System
For this task, we propose using the camera views to extract foreground voxels, i.e., the smallest distinguishable box-shaped part of a three-dimensional image. Indeed, foreground voxels provide enough information for precise object detection and tracking. Shape from silhouette, which is a non-invasive and faster technique, is used to generate foreground voxels. A calibrated [5] set of cameras must be placed around the scene of interest, and the camera pixels must be provided as either part of the shape (foreground) or background. Each of the foreground camera point defines a ray in the scene space that intersects the object at some unknown depth along this ray; the union of these visual rays for all points in the silhouette defines a generalized cone within which the 3D object must lie. Finally, the object is guaranteed to lie in the volume defined by the
UPC Audio, Video and Multimodal Person Tracking Systems
97
intersection of all the cones. The main drawback of the method is that it doesn’t always capture the true shape of the object, as concave shape regions are not expressed in the silhouettes. However, this is not a severe problem in a tracking application as the aim is not to reconstruct photorealistic scenes.
Cam 1
Cam 2
Foreground Segmentation
Cam N
Foreground Segmentation
Foreground Segmentation
3D Reconstruction & Connected Components Analysis
Size
Voxel Coloring Histogram
Position & Velocity
Feature Extraction
Object / Candidate Feature Matching
Kalman Predictor
3D Labels
Fig. 2. The system block diagram showing the chain of functional modules
After the voxelization process (see figure 2), a connected component analysis CCA follows to cluster and label the voxels into meaningful 3D-blobs, from which some representative features are extracted. Finally, there is a templatebased matching process aiming to find persistent blob correspondences between consecutive frames. 3.1
3D Blob Extraction
Once the foreground region has been extracted in each camera view by using a modified version of Stauffer and Grimson [9, 6, 7, 8], the blobs in the 3D space are constructed. In our implementation, the bounding volume (the room) is discretized into voxels. Each of the foreground camera points defines a ray in the scene. Then, the voxels are marked as occupied when there are intersecting rays from enough cameras MINC over the total N. The relaxation in the number of intersecting rays at a voxel prevents typical missing-foreground errors at the pixel level in a certain view, consisting in foreground pixels incorrectly classified as background. Besides, camera redundancy also prevents analog false-foreground errors, since a wrongly defined ray in a view will unlikely intersect with at least MINC −1 rays from the rest of the cameras at any voxel. Voxel Connectivity Analysis. After marking all the occupied voxels, with the process described above, a connectivity analysis is performed to detect clouds
98
A. Abad et al.
of connected voxels, i.e. 3D-blobs, corresponding to tracking targets. We choose to group the voxels with 26-connectivity which means that any possible contact between voxels (vertices, edges, and surfaces) makes them form a group. Then, from all the possible blobs, we consider only the ones with a number of connected voxels greater than a certain threshold B SIZE, to avoid spurious detections. Voxel Coloring. After voxel grouping, the blobs are characterized with their color (dominant color, histogram, histogram at different heights, etc.), among other features. This characterization is employed later for tracking purposes. However, a trustworthy and fast voxel coloring technique has to be employed before any color extraction method is applied to the blob. We need to note that during the voxelization and labeling process, inter/intraobject occlusions are not considered, as it is irrelevant whether the ray came from the occluded or the occluding object. However, in order to guarantee correct pixel-color mapping to visible voxels in a certain view, occlusions have to be previously determined. We discard slow exhaustive search techniques, which project back all the occupied voxels to all the camera views to check intersecting voxels along the projection ray. Instead, for the sake of computational efficiency, we propose a faster technique, making use of target localization, which can be obtained from the tracking system. As photorealistic coloring is not required in our application, intra-object occlusions are simply determined by examining if the voxel is more distant to the camera than the centroid of the blob the voxel belongs to. On the other hand, inter-object occlusions in a voxel are simply determined by finding objects (represented by their centroid) in between the camera and the voxel. This is achieved by computing the closest distance between the segment voxel-to-camera and the objects’ centroids (dist(vc, oc )). The process is schematized in the Voxel-Blob level in Figure 3. To reduce even further the computational complexity, the voxels can be approximated by the position of the centroid of the blob they belong to, as it’s shown in the Blob level in Figure 3, and intra-object occlusions are not examined. Finally, the color of the voxels is calculated as an average of the projected colors from all the non-occluding views. 3.2
Object Tracking
After labeling and voxel coloring, the blobs are temporally tracked throughout their movements within the scene by means of temporal templates. Each object of interest in the scene is modeled by a temporal template of persistent features. In the current studies, a set of three significant features are used for describing them: the velocity at its centroid, the volume, and the histogram. Therefore at time t, we have, for each object l centered at (plx , ply , plz ), a template of features Ml (t). Prior to matching the template l with a candidate blob k in frame t + 1, centered at (pkx , pky , pkz ) with a feature vector Bk (t + 1),
UPC Audio, Video and Multimodal Person Tracking Systems
99
Using camera (c) & Examining Voxel (v), which belongs to blob with centroid (p) || v,c || < || p,c || No
Yes
Any Obj. with centroid oc that ||oc,c||<||v,c||
Yes No
dist( vc,oc) > THR
Any Obj. with centroid oc that ||oc,c||<||p,c|| Do Not Color the Voxel No
Yes
No dist( pc,oc) > THR
Yes
Yes
Color the Voxel
Color the Voxel
No
Fig. 3. Voxel Coloring block diagram, showing the two proposed methods. On the left, the Voxel-Blob level, which addresses voxel coloring individually. On the right, a faster approach using only the centroids of the blobs.
Kalman filters are used to update the template by predicting its new velocity ˆ l (t + 1). The mean Ml (t) and variance Vl (t) vector of the templates and size in M are updated when a candidate blob k in frame t + 1 is found to match with it. The updates are computed using the latest corresponding L blobs that the object has matched. For the matching procedure we choose to use a parallel matching strategy. The main issue is the use of a proper distance metric that best suits the problem under study. The template for each object being tracked has a set of associated Kalman filters that predict the expected value for each feature (except for the histogram) in the next frame. Obviously, some features are more persistent for an object while others may be more susceptible to noise. Also, different features normally assume values in different ranges with different variances. Euclidean distance does not account for these factors as it will allow dimensions with larger scales and variances to dominate the distance measure. One way to tackle this problem is to use the Mahalanobis distance metric, which takes into account not only the scaling and variance of a feature, but also the variation of other features based on the covariance matrix. Thus, if there are correlated features, their contribution is weighted appropriately. However, with high-dimensional data, the covariance matrix can become noninvertible. Furthermore, matrix inversion is a computationally expensive process, not suitable for real-time operation. So, in the current work a weighted Euclidean distance between the template l and a candidate blob k is adopted, assuming a diagonal co-variance matrix. For a heterogeneous data set, this is a reasonable distance definition. Further details of the technique have been presented in the past [6].
100
4
A. Abad et al.
Multimodal Person Tracking System
Multimodal Person Tracking is done based on the audio and video person tracking technologies described in the previous sections. These two technologies may have different nature, for example different frame rate, the video tracking system is able to track several persons, but usually only one person estimate is given by the audio tracking system and only when actively speaking, etc. A multimodal system aiming on the fusion of information proceeding from these two technologies has to take into account these differences. We expect to have far more position estimates from the video system than from the audio system since persons in the smart room are visible by the cameras during most of the video frames; in contrary, the audio system can estimate the person’s position only if she/he is speaking (so called active speaker). Thus, the presented multimodal approach relies more on the video tracking system and it is extended to incorporate the audio estimates to the corresponding video tracks. This is achieved by first synchronizing the audio and video estimates and then using data association techniques. After that a decentralized Kalman filter is used to provide a global estimate of person’s position. The frame rate of the multimodal tracking is the same as that of the video system. 4.1
UPC Implementation
The Kalman filter algorithm provides an efficient computational solution for recursively estimating the position, in situations where the system dynamics can be described by a state-space model. A detailed description of the Kalman filter for tracking can be found in [10, 11]. The decentralized Kalman filter [12] is used for the fusion of audio and video position estimates. As shown in Figure 4, the system can be divided in two modules associated with the audio and video systems. Each modality computes a local a-posteriori estimate x ˆi [k|k], i = 1, 2 of the person position using a local Kalman filter (KF1 and KF2, respectively), based on the corresponding observations y1 [k], y2 [k]. These partial estimates are then combined to provide a global state estimate x ˆ[k|k] at the fusion center such as: x ˆ[x|x] = P[k|k] P−1 [k|k − 1]ˆ x[k|k − 1] +
2 P−1 xi [k|k] − P−1 xi [k|k − 1] i [k|k]ˆ i [k|k − 1]ˆ
(3)
i=1
P−1 [k|k] = P−1 [k|k − 1] +
2
−1 P−1 i [k|k] − Pi [k|k − 1]
(4)
i=1
The global estimate of the system state is obtained weighting the global and local estate estimate with the global error covariance matrix P[k|k] and their counterparts Pi [k|k] at the audio and video systems.
UPC Audio, Video and Multimodal Person Tracking Systems
101
Fig. 4. Structure of the decentralized Kalman filter. The fusion center combines the local estimates to compute a global estimate of the system state.
5
Evaluation
Person Tracking evaluation is run on the data collected by the CHIL consortium for the CLEAR 06 evaluation. Two tasks are considered: single and multiperson tracking, based on non-interactive seminar (collected by ITC and UKA) and highly interactive seminar (collected by IBM, RESIT and UPC) recordings, respectively. Complete description of the data and the evaluation can be found in [13]. 5.1
Summary of the Experimental Set-Up
Data Description. Room set-ups of the contributing sites present two basic common groups of devices: the audio and the video sensors. Audio sensors set-up is composed by 1 (or more) NIST Mark III 64-channel microphone array, 3 (or more) T-shaped 4-channel microphone cluster and various table-top and close-talk microphones. Video sensors set-up is basically composed by 4 (or more) fixed cameras. In addition to the fixed cameras, some sites are equipped with 1 (or more) PTZ camera. Evaluation Metrics. Three metrics are considered for evaluation and comparison purposes: Multiple Object Tracking Precision (MOTP) [mm]. This is the precision of the tracker when it comes to determining the exact position of a tracked person in the room. It is the total Euclidian distance error for matched ground truthhypothesis pairs over all frames, averaged by the total number of matches made. Multiple Object Tracking Accuracy (MOTA) [%]. This is the accuracy of the tracker when it comes to keeping correct correspondences over time, estimating the number of people, recovering tracks, etc. It is the sum of all errors made by the tracker, false positives, misses, mismatches, over all frames, divided by the total number of ground truth points. Acoustic Multiple Object Tracking Accuracy (A-MOTA) [%]. This is like the original MOTA metric in which all mismatch errors are ignored and it is
102
A. Abad et al.
used to measure tracker performance only for the active speaker at each point in time for better comparison with the acoustic person tracking results (where identity mismatches are not evaluated). 5.2
Audio Person Tracking Results
We have decided to use all the T-clusters available in the different seminars and only to use the MarkIII data of those sites where the MarkIII is located in a wall without a T-cluster (IBM, RESIT and UPC). In general, only microphone pairs of the same T-cluster or MarkIII array are considered by the algorithm. In the experiments where the MarkIII is used, 6 microphone pairs are selected for GCC-PHAT computation The pairs selected out of the 64 microphones of MarkIII are 1-11, 11-21, 21-31, 31-41, 41-51 and 51-61. Hence, an inter-microphone separation of 20 cm for each microphone-pair is considered. In Table 1 individual results for each data set and average results for both tasks are shown. Notice that task results are not directly the mean of the individual results, since the scores are recomputed jointly. The evaluating system in both tasks is the same and the multi-person task is only evaluated when only one speaker is active. In this way mean performances obtained, as it could be expected, are quite similar. In fact, there is a fail in the multi-person task, but it is more related with the particular characteristics of each data set, that with the task indeed. For instance, UPC data is particularly noisy and present some challenging situations such as coffee breaks. Hence, we can conclude that acoustic tracking performs reasonably well in controlled scenarios with one or few alternative and non-overlapping speakers, while it shows a considerable decrease in difficult noisy scenarios with many moving and overlapping speakers. Table 1. Audio results for both single and multi-person tracking
5.3
Task
MOTP
Misses
False Positives
A-MOTA
ITC data UKA data Single Person
108mm 148mm 145mm
8.56% 15.09% 14.53%
1.46% 10.19% 9.43%
89.98% 74.72% 76.04%
IBM data RESIT data UPC data Multi Person
180mm 150mm 139mm 157mm
17.85% 12.96% 32.34% 20.95%
10.54% 6.23% 28.76% 15.05%
71.61% 80.80% 38.89% 64.00%
Video Person Tracking Results
Seminar sequences from UPC and RESIT have been evaluated and results are reported in Table 2. Since our algorithm required empty room information, we
UPC Audio, Video and Multimodal Person Tracking Systems
103
were constrained to only evaluate UPC and RESIT. By analyzing the results in detail we reached the following conclusions. Measures of False Positives (FP) are high due to the fact our algorithm detected many foreground objects after the 3D reconstruction due to shadows and other lighting artifacts. Moreover, MOTA is related with the FP score thus dropping as FP increases. Further research to avoid such problems include an improvement of the Kalman filtering and association rules. Since our tracking strategy relies on the 3D reconstruction, rooms with a reduced common volume seen by a number of cameras (typically less N -1 cameras) produce less accurate results. Other reconstruction schemes more accommodated to different camera placement scenarios are under research to generate reliable volumes even if a reduced number of cameras is viewing a given part of the room. Table 2. Video results for the multiperson tracking
5.4
Task
MOTP
Misses
False Pos.
Mism.
MOTA
RESIT data UPC data Multi Person
205mm 188mm 195mm
26.67% 16.92% 21.24%
74.62% 23.56% 46.16%
2.18% 5.85% 4.22%
-3.47% 53.67% 28.35%
Multimodal Person Tracking Results
Only seminar sequences from RESIT and UPC have been evaluated due to the constrains of the Video tracking system mentioned above. For the Multimodal Person Tracking task, two different scorings under two different conditions are defined. For the condition A, the scoring shows the ability to track the active speaker at the time segments that he is speaking, while under the condition B the scoring measures the ability to track all the persons in the room during all the seminar. The results are reported in Table 3 for each condition. It can be seen that the results are very similar to those of the Video Person tracking task. This observation suggests that the multimodal algorithm is mainly influenced by the performance of the video tracking system. Table 3. Multimodal results for Condition A and B Task
MOTP
Misses
False Pos.
Mism.
MOTA
A-MOTA
Cond. A (RESIT) Cond. A (UPC) Cond. A
143mm 101mm 118mm
52.66% 29.48% 41.18%
7.14% 25.28% 16.13%
3.92% 6.35% 5.12%
− − −
40.20% 45.24% 42.70%
Cond. B (RESIT) Cond. B (UPC) Cond. B
201mm 190mm 195mm
26.43% 17.95% 21.71%
74.47% 24.61% 46.71%
2.20% 5.98% 4.31%
-3.10% 51.46% 27.28%
− − −
104
6
A. Abad et al.
Conclusions
In this paper we have presented the audio, video and audio-video Person Tracking systems developed by UPC for the CLEAR evaluation campaign. Novelties proposed in the three systems have been specially designed to add robustness to scenario and environment variabilities. Results show that the audio tracker performs reasonably well in situations with few non-overlapping speakers, while it shows a considerable loss of performance in some challenging and noisy situations that must be addressed. Improvement of the Kalman filtering and association rules are also expected to enhance the video system. Finally, the multimodal audio-video system shows a high dependence on the video results caused by the fusion procedure. Thus, future efforts will be devoted to develop new fusion strategies at a higher level.
References 1. CHIL Computers In the Human Interaction Loop. Integrated Project of the 6th European Framework Programme (506909). http://chil.server.de/, 2004- 2007 2. DiBiase, J., Silverman, H., Brandstein, M.: Microphone Arrays. Robust Localization in Reverberant Rooms, Chapter 8, Springer, January 2001 3. Brutti, A., Omologo, M. , Svaizer, P.: Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays. Proceedings of Interspeech 2005, Lisboa September 2005 4. Knapp, C.H. and Carter, G.C.: The Generalized Correlation Method for Estimation of Time Delay Rooms. IEEE Trans. on Acoustics, Speech, and Signal Processing August 1976 5. Zhang, Z.: A flexible new technique for camera calibration. Technical report, Microsoft Research, August 2002 6. Landabaso, J.L., Xu, L-Q., Pard` as, M.: Robust Tracking and Object Classification Towards Automated Video Surveillance. Proceedings of ICIAR 2 2004 463–470 7. Xu, L-Q., Landabaso, J.L. Pard` as, M.: Shadow removal with blob-based morphological reconstruction for error correction. Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, Vol.2, Iss., March 18-23, 2005 729–732 8. Landabaso, J.L., Pard` as, M., Xu, L-Q.: Hierarchical representation of scenes using activity information. Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, Vol.2, Iss., March 18-23, 2005 677–680 9. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE trans. on Pattern Analysis and Machine Intelligence, 22(8), August 2000 10. Bar-Shalom, Y., Fortman,T.E.: Tracking and Data association. Academic Press, 1988 11. Sturim, D. E., Brandstein, M. S., Silverman, H. F.: Tracking Multiple Talkers Using Microphone-Array Measurements. Proceedings of ICASSP 1997, Munich, April 1997 12. Hashemipour, H. R., Roy, S., Laub, J.: Decentralized structures for parallel Kalman filterin. IEEE Transactions on Automatic Control, 33(1):88-93, 1988 13. The Spring 2006 CLEAR Evaluation and Workshop. http://www.clearevaluation.org/
A Joint System for Single-Person 2D-Face and 3D-Head Tracking in CHIL Seminars Gerasimos Potamianos1 and Zhenqiu Zhang2, 1
IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A 2 Beckman Institute, University of Illinois, Urbana, IL 61801, U.S.A
[email protected],
[email protected]
Abstract. We present the IBM systems submitted and evaluated within the CLEAR’06 evaluation campaign for the tasks of single person visual 3D tracking (localization) and 2D face tracking on CHIL seminar data. The two systems are significantly inter-connected to justify their presentation within a single paper as a joint vision system for single person 2D-face and 3D-head tracking, suitable for smart room environments with multiple synchronized, calibrated, stationary cameras. Indeed, in the developed system, face detection plays a pivotal role in 3D person tracking, being employed both in system initialization as well as in detecting possible tracking drift. Similarly, 3D person tracking determines the 2D frame regions where a face detector is subsequently applied. The joint system consists of a number of components that employ detection and tracking algorithms, some of which operate on input from all four corner cameras of the CHIL smart rooms, while others select and utilize two out of the four available cameras. Main system highlights constitute the use of AdaBoost-like multi-pose face detectors, a spatio-temporal dynamic programming algorithm to initialize 3D location hypotheses, and an adaptive subspace learning based tracking scheme with a forgetting mechanism as a means to reduce tracking drift. The system is benchmarked on the CLEAR’06 CHIL seminar database, consisting of 26 lecture segments recorded inside the smart rooms of the UKA and ITC CHIL partners. Its resulting 3D single-person tracking performance is 86% accuracy with a precision of 88 mm, whereas the achieved face tracking score is 54% correct with 37% wrong detections and 19% misses. In terms of speed, an inefficient system implementation runs at about 2 fps on a P4 2.8 GHz desktop.
1
Introduction
Visual detection and tracking of humans in complex scenes is a very interesting and challenging problem. Often, input from multiple calibrated cameras with
The main system development was performed during Zhenqiu Zhang’s two internships at the IBM T.J. Watson Research Center. This work was supported by the European Commission under the integrated project CHIL, “Computers in the Human Interaction Loop”, contract number 506909.
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 105–118, 2007. c Springer-Verlag Berlin Heidelberg 2007
106
G. Potamianos and Z. Zhang
overlapping fields of view is available synchronously, and information about both the frame-view and space-level human location is desired. One such scenario of interest, considered in this paper, is human-computer interaction in smart rooms, where a speaker is presenting a seminar in front of an audience. The scenario is of central interest within the CHIL European Union integrated project, “Computers in the Human Interaction Loop” [1]. In data collected as part of the CHIL project, a minimum setup of four fixed calibrated cameras located at the corners of a smart room provides video data, with the goal of locating and identifying the seminar presenter. Hence, both three-dimensional head position estimation, as well as face detection at the available frame views is required. The information can be further utilized to obtain close-up views of the presenter, based on steerable pan-tilt-zoom cameras, in the seminar indexing and annotation, etc. Clearly therefore, in such a scenario, a visual system that combines face detection, tracking, and multi-camera processing is feasible and desirable. This paper presents such a system, developed for joint 2D-face and 3D-head tracking of a singe person (the lecturer) in CHIL seminars. The system is benchmarked on 26 seminar segments recorded at two CHIL sites, as part of the CLEAR’06 evaluation campaign. Much work has been devoted in the literature to the core problems of human detection and tracking that constitute the focus of this paper. For face detection, machine learning based approaches are widely considered as the most effective, for example based on neural networks [2], support vector machines [3], network of linear units [4], or the AdaBoost approach [5]. These methods can be readily extended to handle detecting faces under varying head pose, as for example in [6], where a pose-based technique within the appearance-based framework is proposed, or the multi-pose face detection work of Li et al. [7], where “FloatBoost”, an AdaBoost variant, is employed. For tracking faces, various target representations have been used in the literature, such as parameterized shapes [8], color distributions [9], image templates [10] and the eigen-space approach [11], to name a few. Tracking with fixed representations however is not reliable over long durations, and a successful tracker needs to allow appropriate model adaptation. Not surprisingly, a number of tracking methods have been developed to allow such adaptation online, for example the EM-algorithm based technique of [12], the feature selection mechanism of [13], and the parametric statistical appearance modeling technique in [14]. An interesting non-parametric approach appears in Lim et al. [15], where the appearance subspace is learned online by an efficient sequential algorithm for principal component analysis (PCA), updated employing the incoming data vectors. In general however, real human-computer interaction scenarios present significant challenges to most face detection and tracking algorithms, for example partially occluded and low-resolution faces, as well as lighting and head-pose variations. These difficulties can often be successfully addressed, only if additional information is available in the form of multi-camera input, to reduce spatial uncertainty in the scene of interest. The evaluated system described in this paper takes such information into account by providing a joint framework for 2D face
A Joint System for Single-Person 2D-Face and 3D-Head Tracking z
FLOOR (0 , 0 , 0)
y
( x , y , z) [cm]
z
FLOOR (0 , 0 , 0)
y
107
( x , y , z) [cm]
PROJ. SCREEN
Cam2 x
Cam1
Cam4
Approximate Speaker Area
Cam4 Microphone Array
x
Audi
PROJ. SCREEN
Approximate Speaker Area
TABLE
Cam2
Aud
ence
TABLE
TABLE
Cam3
ienc
Cam1
e A rea
Cam3
(475 , 592 , 300) CEILING
Microphone Array (590 , 710 , 300) CEILING
(a)
(b)
Fig. 1. Schematic diagrams of the CHIL smart rooms at the (a) Universit¨ at Karlsruhe (UKA), Germany and (b) Istituto Trentino di Cultura (ITC), Italy. The CLEAR’06 development and evaluation “seminar” data, used here for single-person tracking, have been collected at these two sites.
and 3D head tracking. The system consists of a number of inter-connected components that employ detection and tracking algorithms, some of which operate on input from all four corner cameras of the CHIL smart rooms, while others select and utilize two out of the four available cameras. In the developed system, face detection plays a pivotal role in 3D head tracking, being employed both in system initialization as well as in detecting possible tracking drift. Similarly, 3D tracking determines the 2D frame regions where a face detector is subsequently applied. The conduit of this inter-connection is provided by the camera calibration information, where a non-linear (fourth order) model is used to correct for lens distortion. The three main highlights of the evaluated system constitute the use of AdaBoost-like multi-pose face detectors [16], a spatio-temporal dynamic programming algorithm to initialize 3D head location hypotheses [17], and an adaptive subspace learning based tracking scheme with a forgetting mechanism as a means to reduce tracking drift [18]. The rest of the paper is organized as follows: Section 2 presents an overview of the whole system, with its components described in detail in Section 3. Experiments are presented in Section 4, with a brief summary in Section 5 concluding the paper.
2
System Overview
The developed system for the CLEAR evaluations provides both 3D head location and 2D face tracking in the CHIL seminar scenario within a joint framework. In this particular domain, multiple synchronized calibrated cameras are set up in smart rooms, among them four corner room cameras with widely overlapping fields of view. Schematics of two such rooms are depicted in Fig. 1. In our work,
108
G. Potamianos and Z. Zhang
Face Detection Results from 4 Camera Views
Face Detection Result
Spatio-Temporal Dynamic Prog.
Good 3D Initial ? Estimate
Generating 3D Hypotheses at Each Instant
N Optimum 3D Trajectory by Dynamic Prog.
Y Online Subspace Learning Based Tracking
Lost Track ?
N
Next Interval Starting at t=t+5
Good 3D Initial ? Estimate
N
Y
Y (a)
Face Tracking on 2 Camera Views
(b)
Fig. 2. Block diagram of the developed multi-camera 3D head tracking system. (a) Overview; (b) Initialization.
the inputs of the four corner cameras are used to obtain over time the 3D head position and the 2D face locations in the individual frames of a person presenting a seminar in front of an audience inside the room. 2.1
The 3D Head Localization Subsystem
The overview diagram of the developed 3D head tracking system is given in Fig. 2(a). It basically consists of an initialization and a tracking component, with tracking drift detection controlling the switch between these two modes. For its initialization, multi-pose face detectors are first applied to four camera views in the smart room. Details are provided in Section 3.1. Subsequently, spatiotemporal information of the face detection results over 10 consecutive quadframes is integrated within a dynamic programming (DP) framework, to provide robust initialization. Details are described in Section 3.2 (see also Fig. 2(b)). If the optimal DP trajectory is accepted as a true object, a 2D tracking component kicks in, operating independently in two camera views, which are selected among the four views based on the DP result. Details of the tracking algorithm, which is based on online adaptive subspace learning, are presented in Section 3.3. Notice that as long as the DP trajectory is not acceptable, the initialization process is repeated with a shift of five frames, and no 3D position is returned. Finally, an important aspect of the system is the re-initialization decision, or equivalently, the drift detection. This is described in Section 3.4, and it is based on a combination of local face detection and calibration-based triangulation to test the consistency of independent tracking in the two (selected based on the DP results) camera views.
A Joint System for Single-Person 2D-Face and 3D-Head Tracking
CAM 1
CAM 2
CAM 3
CAM 4
109
Fig. 3. Initial face detection result on four synchronized camera views (UKA seminar data), before any spatio-temporal information is considered
2.2
The 2D Face Localization Subsystem
In the developed system, 2D face localization is performed based on the 3D head tracking result. Such result provides the approximate region within the 2D frame views, where a visible face could be present, in the following manner: As mentioned above (and further explained in Sections 3.2 and 3.3), the 3D head tracking system uses 2D subspace tracking on two only camera views, selected based on the algorithm initialization stage. For these two camera views, the expected face location is therefore immediately available. For the remaining two camera views, the system considers the projection of the 3D head position estimate (by employing camera calibration information) to obtain an estimate of the head’s 2D location in the image frames. Following this step, multi-pose face detection (see Section 3.1) is applied around the estimated head center in each camera view. If the face detector locates a face, this is accepted. If there is no face detection result, then one of the following two cases occur: (a) If the camera view in question is one of the two views that have been used in tracking at that instant, the raw 2D tracking result (i.e., the tracked face box) is returned as the face detection output. (b) If however the camera is not a 2D tracking view, no face output is produced. The above face detection strategy has been selected after conducting a number of experiments on the CLEAR’06 development set, as described in Section 4.2.
110
3
G. Potamianos and Z. Zhang
System Components
In this section, the four main components of the 3D-head and 2D-face tracking sub-systems are described in more detail. 3.1
Multi-pose 2D Face Detection
Face detection is the most critical component of the evaluated system, being utilized at the initialization (Section 3.2) and drift detection stages (Section 3.4) of the 3D head tracking system, and in addition being the required step to produce the evaluated 2D face results, based on the 3D head location estimate (see Section 2.2). The system adopts a multi-pose face detector approach, trained using the FloatBoost technique [7]. In particular, two face detectors are employed: One for “frontal pose”, that includes frontal and up to half-profile faces, and a second for the “left-side pose”. A “right-side” pose face detector is subsequently obtained by mirroring the latter. Both are trained as a cascade, multi-layer structure of weak classifiers. For system development, 2485 and 4994 faces for the “frontal” and “left profile” poses are used respectively, pooled from all camera views of the available development data and their provided labels. The resulting detectors consist of a cascade of 19 layers and 2873 features (see also [16]). Not surprisingly, face detection by itself produces rather poor results in the challenging CHIL domain considered. This is illustrated in Fig. 3: The resolution of the presenter’s face in each camera view is small, around 30×30 or less within the camera views, with significant pose and illumination change in the video sequence. Robust multi-pose face detection in this scenario is clearly hard, with high rates of missing face detections and false alarms observed. To remedy this problem, a novel algorithm proposed in [17] is recruited that integrates spatial and temporal information available within the multi-camera video sequence setting. This replaces a previously employed motion-based framework [16]. The approach is described next. 3.2
Spatio-temporal Dynamic Programming for 3D Initialization
In summary (see also Fig. 2(b)), the trained multi-pose face detectors are first applied on all four camera views. Based on the spatial consistency of the detection result from different camera views, 3D hypotheses of the presenter’s head location are generated using the calibration information. Then, dynamic programming (DP) on the results over ten consecutive frames is used to search the optimal trajectory of the presenter’s head centroid in the 3D space, based on a local similarity measure and a transition cost. If the optimal trajectory is accepted compared with a threshold, the result is fed into the tracking component described in Section 3.3; otherwise the process is iterated with a five frame shift until an acceptable trajectory is determined. Details of this DP framework implementation are given next: Generating 3D Hypotheses. Assuming ni face detections per camera view, there could be
A Joint System for Single-Person 2D-Face and 3D-Head Tracking
1/2 ×
ni × nj
111
(1)
i,j:i=j
candidate 3D head locations, obtained via triangulation. Based on the resulting inter-ray distances of the 2D-to-3D maps, one can easily reject few large interray distance hypotheses. In addition, collection-site specific spatial constraints, learned from development data, are imposed to distinguish the seminar presenter from audience members (see also Fig. 1). These constraints result to about half of the room floor surface being allowable for the presenter’s (x,y) location, whereas a 40 cm height range is imposed on the z-axis location coordinate. Generating an Optimal Dynamic Programming Trajectory. The DP framework contains two main parts: The 3D-path cost components, both intraand inter-time-instant in the form of a local similarity measure and transition cost, respectively, as well as the hypothesis search stage. Local Similarity Measure. This is used to evaluate the hypothesis at the current instant on basis of the available four camera views. The color histograms of rectangles (approximately double the face height) in different views are used for this task, with the Bhattacharyya coefficient employed over 30-bin histograms of the H component of the color HSV space. The assumption is that if the candidate hypothesis is a true target, then the corresponding rectangles in different camera views should cover the same person, and color histogram similarity should be high. Transition Cost. This penalizes non-smooth trajectories. In the adopted framework, the transition is defined as the 3D spatial distance between two hypotheses, with its cost specified using a Gaussian diffusion with a pre-set diagonal covariance matrix [17]. A new trajectory generation cost is also defined, set to a constant. Hypothesis Search. The searching scheme employs the standard dynamic programming framework, as described in [17]. A few things to note: A total of six hypotheses are kept “alive” at each time instant, as a pruning mechanism; a maximum acceptable score (constant) is set, thus providing a mechanism to reject the final hypothesis (and hence trigger a new search – with a five quad-frame shift); and finally, that the returned optimal trajectory defines the two camera views on which tracking is to commence, based on the views that generated the last-instant optimal trajectory hypothesis. 3.3
Adaptive Subspace 2D Tracking with Forgetting Mechanism
In [15], when a new observation is obtained, the PCA subspace is updated to take into consideration the variance contributed by the new observation. However, the method does not provide an updating algorithm for eliminating past observations during the tracking. This poses a problem when tracking video over long periods of time, as the noise introduced during tracking would eventually bias the PCA subspace away from the characteristic appearance of the desired tracked object. In [19], an L∞ norm subspace is fitted to the past frames incrementally by Gramm-Schmitt orthogonalization. Though the subspace with L∞
112
G. Potamianos and Z. Zhang
norm has the advantage of incorporating observation novelties into the subspace representation in a timely manner, as shown by many successful experiments [19], it runs the risk of tracking drift, as consistent noise and outliers may easily bias the subspace away from the object appearance space. Considering that PCA offers the freedom for the user to perform dimensionality reduction and thus ignore tracking noise and assist outlier rejection based on reconstruction error [11], the evaluated system adopts the incremental PCA subspace learning approach, with Hall’s mechanism [20] to incrementally update the PCA subspace given new observations. Furthermore, the method allows subspace adjustment by eliminating distant past observations in the subspace. This introduces a forgetting mechanism that is absent in Lim’s approach [15]. The algorithm is presented in [18]. In this particular implementation, the evaluated system employs the most recent 50 frame observations to construct the PCA subspace. Hence, following tracking initialization, the forgetting mechanism does not commence until after 50 frames are observed. For this initial duration, the algorithm remains identical to [15]. The learned subspace has a dimensionality of up to 15, down from a normalized 20×20-pixel data “template” (the un-normalized template size depends on the detected face at the end of the initialization step). Notice – as already mentioned above – that this stage is performed in 2D, independently on two camera views, selected by the initialization stage of Section 3.2. Triangulation of the template centroids by means of camera calibration information provides the 3D head estimate during this tracking stage, in conjunction with the tracking drift detection component described next. 3.4
Tracking Drift Detection in 3D
An important aspect of the system is the re-initialization decision, or equivalently, tracking drift detection on basis of the 2D independent tracking results in the two selected camera views. This is based on a combination of local face detection and calibration-based triangulation to test the consistency of the two tracks at the given time. In more detail, if the inter-ray distance of the two 2D-to-3D mapping rays is larger than a predetermined threshold, this indicates that the two tracked results are inconsistent, hence immediately prompting reinitialization. Furthermore, at each frame, the multi-pose face detectors of Section 3.1 are also applied around the two tracking results to determine whether there indeed exists a face object in the local regions of interest (for example, in the evaluated system, this is set to a 80×80 pixel region for UKA data). If faces could not be detected in the local region for several frames (30 in our case) in any of the two camera views, a re-initialization decision is prompted.
4
System Performance on the CHIL Seminar Data
System development and evaluation was performed on the CHIL seminar database, consisting of 19 development and 26 evaluation segments, collected inside the smart rooms of the UKA and ITC sites. Their majority has been
A Joint System for Single-Person 2D-Face and 3D-Head Tracking
113
recorded at UKA (18 development and 24 evaluation segments), hence performance on UKA data dominates the reported results. Additional, so-called “interactive seminars” collected at three other CHIL sites (AIT, UPC, and IBM) have not been used, as the focus of these datasets has been on multi-person interaction and tracking, and the evaluated system has not yet been extended to handle multiple person tasks. System training and fine-tuning was performed on the development set, in particular face detection training (see Section 3.1), setting up system parameters (e.g., spatial constraints and DP costs – see Section 3.2; inter-ray distance thresholds – Sections 3.2 and 3.4; tracking template sizes – Section 3.3, among others), as well as determining optimal tracking strategies, especially for the 2D face tracking evaluation task (see Section 4.2). In the following, we briefly describe relevant metrics, experiments, and evaluation results. 4.1
Single-Person Visual Tracking Results – Task “3DSPT V”
Two metrics have been identified as relevant to all tracking evaluations on CHIL data, spanning both single- and multi-person, as well as single- and multi-modal tracking conditions: These are multiple object tracking accuracy (MOTA), and multiple object tracking precision (MOTP) [21]. MOTA is measured as the percentage (%) of correct correspondences (mappings) of estimated and ground truth persons over the evaluation set of time instants. Of course, in the case of single-person tracking, the mapping problem becomes trivial, since there is at most one hypothesized and one reference person, always assumed to correspond to the single seminar lecturer. In such a case, the hypothesis is considered correct based on the 2D Euclidean distance between the estimated location and the ground truth, as compared to a threshold set to 500 mm. Notice that only 2D distance is considered, although the evaluated head tracking system provides 3D location information. It is worth mentioning that the metric penalizes single-person trackers that output a default hypothesis, when for example failing to detect a person: Such a strategy would in most cases result to two errors for each default estimate: a false positive and a miss. This fact has been taken into consideration in the developed system: It returns no 3D hypothesis when initialization fails (see Section 3.2), as opposed to its earlier versions [16, 18] that always produced a 3D hypothesis either at the center of the presenter’s area “cube”, or the most recent non-default 3D estimate. The second adopted evaluation metric, MOTP, is measured in mm, and is simply the average 2D Euclidean distance computed over the correct reference-hypothesis mappings. Its value ranges between 0 and 500 mm. Table 1 presents the summary of the developed 3D head tracking system performance. A number of 2D face tracking metrics are also depicted (see Section 4.2). Results are reported on both development and evaluation sets, listed per collection site, and cumulatively. Performance is computed over the entire segments, but at time instants spaced every 1 sec (in order to reduce the associated labeling effort). Further details on the evaluation set performance per seminar segment can be found in Table 2.
114
G. Potamianos and Z. Zhang
Table 1. Performance of 3D head tracking (“3DSPT V”) and 2D face tracking (“2DFT S” – only part of the metrics are shown) on the CLEAR’06 development (DEV) and evaluation (EVA) sets, depicted per collection site and cumulatively. Number of seminar segments are also listed. All metrics are expressed in %, with the exception of MOTP that is expressed in mm.
Set D E V E V A
Data 3DSPT V task 2DFT S task Site #Sem MOTA MOTP Corr Err Miss ITC 1 21.78 148 − − − − − − UKA 18 79.47 93 74.17 21.04 15.18 all 19 71.11 99 − − − − − − ITC 2 98.33 92 84.75 28.70 3.14 UKA 24 84.94 88 52.64 37.68 19.89 all 26 85.96 88 54.44 37.18 18.95
As depicted in Table 1, the developed system achieved a tracking accuracy of 85.96% on the CLEAR’06 evaluation set, with a tracking precision of 88 mm. Notice that the performance on the development set was significantly worse, at 71.11% MOTA and 99 mm MOTP, due to poor tracking on three development segments (UKA 20041124 B.Seg1, UKA 20050214.Seg1, and ITC 20050429. Seg1). By excluding them, performance on the development set becomes 94.44% MOTA and 90 mm MOTP. Similarly, performance on the evaluation set is unsatisfactory in a few segments (UKA 20050504 A.Seg[1-2] and UKA 20050615 A. Seg2). Excluding them boosts evaluation set performance to 93.00% MOTA and 86 mm MOTP. The above results represent a major improvement over earlier tracking work reported on the CHIL corpus [16]. In past CHIL single-person visual tracking evaluations, a similar metric to MOTP was reported, but assuming that all hypotheses were correct. Using such metric, the newly evaluated system results to an average 2D error of 256 mm and 139 mm on the CLEAR’06 development and evaluation sets, respectively – as compared to two earlier CHIL evaluation runs, where 228 and 441 mm of average error was reported (June 2004 and Jan 2005 evaluations, respectively). As a final remark, and as already mentioned, the evaluated system does not always return a 3D head location hypothesis. The exact approach was fine-tuned on the development set, where it boosted the MOTA metric significantly on the 18 UKA segments from an original 69.27% (when always outputing a hypothesis) to 79.47% reported in Table 1. 4.2
Single-Person Face Tracking Results – Task “2DFT S”
A total of five metrics have been defined for the face tracking CLEAR’06 evaluation tasks (both single- and multi-person) on the CHIL seminar data: (i) Percentage of correctly detected faces (“Corr”), namely the percentage of detected faces with hypothesis–reference face bounding-box centroid distance more than half the size of the reference face; (ii) Percentage of wrong face detections (“Err”), accounting for false positives (this includes detected faces with
A Joint System for Single-Person 2D-Face and 3D-Head Tracking
115
Table 2. Detailed performance results on the CLEAR’06 evaluation set Seminar Segment or Cumulative ITC 20050503.Seg1 ITC 20050607.Seg1 ITC all (2) UKA 20050420 A.Seg1 UKA 20050420 A.Seg2 UKA 20050427 B.Seg1 UKA 20050427 B.Seg2 UKA 20050504 A.Seg1 UKA 20050504 A.Seg2 UKA 20050504 B.Seg1 UKA 20050504 B.Seg2 UKA 20050511.Seg1 UKA 20050511.Seg2 UKA 20050525 A.Seg1 UKA 20050525 A.Seg2 UKA 20050525 B.Seg1 UKA 20050525 B.Seg2 UKA 20050525 C.Seg1 UKA 20050525 C.Seg2 UKA 20050601.Seg1 UKA 20050601.Seg2 UKA 20050615 A.Seg1 UKA 20050615 A.Seg2 UKA 20050622 B.Seg1 UKA 20050622 B.Seg2 UKA 20050622 C.Seg1 UKA 20050622 C.Seg2 UKA all (24) all 26 segments
3DSPT V task MOTA MOTP 97.32 98 99.33 86 98.33 92 95.35 88 83.72 88 96.68 78 75.42 100 31.23 97 26.00 153 84.00 112 79.33 75 90.67 89 97.33 73 97.33 62 98.67 73 91.36 76 99.34 108 98.01 59 100.00 99 89.37 80 98.01 100 77.70 115 38.80 84 96.01 73 99.34 101 98.67 63 96.01 85 84.94 88 85.96 88
Corr 81.58 88.07 84.75 59.46 52.13 5.00 13.28 18.44 13.99 8.76 9.86 66.21 80.45 76.43 66.04 40.54 48.50 68.16 68.37 71.43 70.81 57.35 32.99 66.29 57.02 81.10 80.14 52.64 54.44
2DFT S task Err Miss MWE 31.58 5.26 0.19 25.69 0.92 0.19 28.70 3.14 0.19 35.14 16.89 0.23 18.09 31.91 0.21 87.86 16.43 0.30 97.66 25.78 0.22 76.60 24.11 0.23 81.12 27.97 0.22 89.05 10.22 0.24 88.03 11.27 0.23 39.31 9.66 0.21 13.41 12.85 0.17 21.43 9.29 0.24 22.64 16.35 0.22 45.95 17.57 0.28 22.00 30.50 0.21 16.20 22.35 0.22 2.79 29.30 0.13 22.86 12.00 0.17 15.79 22.97 0.15 32.35 24.26 0.22 49.48 49.48 0.23 16.00 18.86 0.18 52.07 24.79 0.21 17.07 7.32 0.23 21.28 7.09 0.25 37.68 19.89 0.20 37.18 18.95 0.20
MEA 118.00 52.26 84.61 63.29 112.59 308.86 237.33 144.04 236.08 264.12 215.34 64.23 103.89 57.27 77.80 85.77 240.37 99.79 60.76 72.04 85.90 71.59 102.37 123.55 105.33 75.89 56.80 96.82 95.76
hypothesis–reference bounding-box centroid distance larger than half the reference face size); (iii) Percentage of missed face detections (“Miss”) of the reference face; and finally, two metrics that further specify how accurate the detection is when a face is correctly detected, namely: (iv) Mean weighted error (MWE); and (v) Mean extension accuracy (MEA). A summary and a more detailed version of the system performance on the CLEAR’06 evaluation campaign are given in Tables 1 and 2, respectively. The system achieved 54.5% correct detections, with 37.2% erroneous detections and 18.9% misses. This performance is rather poor, and it is due to the extremely challenging nature of the task, the rather strict evaluation metrics, as well as lack of time for further system development. In particular, by comparing the UKA development and evaluation set performance in Table 1, one can notice that the performance drops significantly, due to the mismatch in seminar presenters (a
116
G. Potamianos and Z. Zhang
purely “speaker independent” evaluation framework is considered). Furthermore, errors and misses are relatively balanced on the development set, but not so on the evaluation data. Nevertheless, the achieved performance represents a small improvement over the CHIL 2005 evaluation run performance of the IBM system, that exhibited a 51% correct detection rate under a much more generous “multispeaker” training/testing scenario. A final remark concerns the adopted strategy described in Section 2.2 for face detection. A number of approaches have been considered for producing 2D face results from the 3D head location estimate in an effort to reduce and balance the false positive (“Err”) and negative (“Miss”) error rates. Among them, an interesting modification of the proposed method considered is to always return the 2D tracking result on the two selected camera views where the subspace tracking takes place (Section 3.3), and only apply multi-pose face detection to the two non-tracked camera views around a region of interest based on the 3D head estimate. This is in contrast to first applying the multi-pose face detector on all four views, and only resorting to the tracking result of the selected camera views when the detector fails to return a face. The performance of the former approach was measured on seven UKA development set seminars at 77.26% Corr, 18.67% Err, and 9.37% Miss, compared to 85.92% Corr, 9.95% Err, and 9.43% Miss of the adopted approach. 4.3
System Run-Time Performance
There has been no particular effort to optimize system implementation for this evaluation. To reduce the face detection overhead and allow speedier development, the whole system has been implemented in a cascade, where face detection is first applied on all frames and all camera views (as in Section 3.1), before feeding its output to the remaining system modules (described in Sections 3.2-3.4). In practice, this is of course rather wasteful, as the two 2D tracking processes (Section 3.3) can perform most of the required work in real time (20 frames per second on a P4 2.8 GHz, 512 desktop). In contrast, face detection over the entire frame in four camera views is significantly slower and runs only at about 2 frames per second.
5
Summary
In this paper, we presented our developed vision system for joint 3D head and 2D face tracking of the lecturer in CHIL smart rooms. We described details of the system components, and presented experimental results on the corresponding CLEAR’06 evaluation campaign data. The system achieved an 86% accuracy with a precision of 88 mm for the 3D single-person visual tracking task, and a 54% correct rate for the face tracking task. We plan to continue work towards improving system performance, and most importantly to expand its applicability to multi-person localization. Its more efficient implementation in order to achieve a faster run-time performance is also among our goals.
A Joint System for Single-Person 2D-Face and 3D-Head Tracking
117
References 1. CHIL: “Computers in the Human Interaction Loop,” http://chil.server.de 2. H.A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Trans. Pattern Anal. Machine Intell., 20(1):23–28, 1998. 3. E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” in Proc. Conf. Computer Vision Pattern Recog., pp. 130–136, 1997. 4. D. Roth, M.-H. Yang, and N. Ahuja, “A SNoW-based face detector,” in Proc. NIPS, 2000. 5. P. Viola and M. Jones, “Robust real time object detection,” in Proc. IEEE ICCV Work. Statistical and Computational Theories of Vision, 2001. 6. A.P. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” in Proc. Conf. Computer Vision Pattern Recog., pp. 84–91, 1994. 7. S.Z. Li and Z. Zhang, “FloatBoost learning and statistical face detection,” IEEE Trans. Pattern Anal. Machine Intell., 26(9): 1112–1123, 2004. 8. M. Isard and A. Blake, “Contour tracking by stochastic propagation of conditional density,” in Proc. Europ. Conf. Computer Vision, pp. 343–356, 1996. 9. D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proc. Int. Conf. Computer Vision Pattern Recog., vol. 2, pp. 142–149, 2000. 10. H. Tao, H.S. Sawhney, and R. Kumar, “Dynamic layer representation with applications to tracking,” in Proc. Int. Conf. Computer Vision Pattern Recog., vol. 2, pp. 134–141, 2000. 11. M.J. Black and A. Jepson, “Eigentracking: Robust matching and tracking of articulated objects using a view-based representation,” Int. J. Computer Vision, 26(1): 63–84, 1998. 12. A.D. Jepson, D.J. Fleet and T.F. El-Maraghi, “Robust online appearance models for visual tracking,” IEEE Trans. Pattern Anal. Machine Intell., 25(10): 1296– 1311, 2003. 13. R.T. Collins, Y. Liu and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE Trans. Pattern Anal. Machine Intell., 27(10): 1631–1643, 2005. 14. B. Han and L. Davis, “On-line density-based appearance modeling for object tracking,” in Proc. Int. Conf. Computer Vision, Beijing, 2005. 15. J. Lim, D. Ross, R.-S. Lin, and M.-H. Yang, “Incremental learning for visual tracking,” in Proc. NIPS, 2004. 16. Z. Zhang, G. Potamianos, A. Senior, S. Chu, and T. Huang, “A joint system for person tracking and face detection,” in Proc. Int. Wksp. Human-Computer Interaction, Beijing, China, 2005. 17. Z. Zhang, G. Potamianos, M. Liu, and T.S. Huang, “Robust multi-view multicamera face detection inside smart rooms using spatio-temporal dynamic programming,” in Proc. Int. Conf. Automatic Face Gesture Recog., Southampton, United Kingdom, 2006. 18. Z. Zhang, G. Potamianos, S.M. Chu, J. Tu and T.S. Huang, “Person tracking in smart rooms using dynamic programming and adaptive subspace learning,” in Proc. Int. Conf. Multimedia Expo, Toronto, Canada, 2006.
118
G. Potamianos and Z. Zhang
19. J. Ho, K.-C. Lee, M.-H. Yang, and D. Kriegman, “Visual tracking using learned linear subspaces,” in Proc. Int. Conf. Computer Vision Pattern Recog., vol. 1, pp. 782–789, 2004. 20. P. Hall, D. Marshall, and R. Martin, “Merging and splitting eigenspace models,” IEEE Trans. Pattern Anal. Machine Intell., 22(9): 1042–1049, 2000. 21. K. Bernardin, A. Elbs, and R. Stiefelhagen, “Multiple object tracking performance metrics and evaluation in a smart room environment,” in Proc. Int. Wksp. Visual Surveillance, Graz, Austria, 2006.
Speaker Tracking in Seminars by Human Body Detection Bo Wu, Vivek Kumar Singh, Ram Nevatia, and Chi-Wei Chu University of Southern California Institute for Robotics and Intelligent Systems Los Angeles, CA 90089-0273 {bowu,viveksin,nevatia,chuc}@usc.edu
Abstract. This paper presents evaluation results of a method for tracking speakers in seminars from multiple cameras. First, 2D human tracking and detection is done for each view. Then, 2D locations are converted to 3D based on the calibration parameters. Finally, cues from multiple cameras are integrated in a incremental way to refine the trajectories. We have developed two multi-view integration methods, which are evaluated and compared on the CHIL speaker tracking test set.
1
Task and Data Set
The task in this evaluation exercise is to track the 3D head locations of a speaker in seminars. In practice, only the ground plane projections of the 3D head locations are used to evaluate the performance. The test set contains 24 segments captured from the UKA site and 2 segments from the ITC site. For each segment, four side view cameras and one optional top-down camera are used to record the seminar. Camera calibration information including radial distortion, is provided. Each video contains about 4,500 frames. There are overall 26×4×4500 = 468, 000 frames to process. The frame sizes of the UKA videos and the ITC videos are 640 × 480 and 800 × 600 respectively. The sampling rate of all videos is 30 FPS. Fig.1 shows some sample frames. This task is made complex due to many reasons. First, the faces of the speakers are not always visible, so face or skin-color detection based methods can not be used in all cases. Second, the speaker does not move all the time and a clear scene shot is not available, hence moving object detection based on static/adaptive background modeling is difficult. Third, the scene is cluttered due to various scene objects, e.g. chairs and laptops. Based on the observation that the speaker is usually the only person standing/walking during the seminar, we use a 2D multi-view human body detector [1] to locate the speaker frame by frame and track in 2D. Then the 2D trajectories are converted to 3D based on the camera calibration information. Finally the cues from multiple cameras are integrated to refine the trajectories. The rest of the paper is organized as follows: Section 2 describes our tracking method; Section 3 shows the experimental results; and Section 4 sums up. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 119–126, 2007. c Springer-Verlag Berlin Heidelberg 2007
120
B. Wu et al.
Cam1
Cam2
Cam3
Cam4
Cam3
Cam4
(a) UKA site
Cam1
Cam2
(b) ITC site Fig. 1. Sample frames
2
Methodology
We take the single frame human detection responses as the observation of human hypotheses. Tracking is done in 2D for individual views. The 3D head locations are calculated by triangulation or approximated with the 3D feet positions obtained from the calibration information and a ground plane assumption. Then cues from multiple cameras are integrated in an incremental way. 2.1
Multi-view Human Body Detection and Tracking
In the method of [1], four part detectors are learned for full-body, head-shoulder, torso, and legs. We only use the one for full-body here. Two detectors are learnt: one for the left profile view, and one for the frontal/rear view (the detector for right profile view is generated by flipping the left profile view horizontally). Nested cascade detectors are learned by boosting edgelet feature based weak classifiers, as in [1]. The training set contains 1,700 positive samples for frontal/rear views, 1,120 for left profile view, and 7,000 negative images. The positive samples are collected from the Internet and the MIT pedestrian set [2]. The negative images are general scene images collected from the Internet. The training sets are fully independent of the test sequences. For detection, the input image is scanned by all three detectors and the union of their responses is taken as the multi-view detection result. The speaker is tracked in 2D by associating the frame detection responses. This 2D tracking method is a modified version of that in [3]. In [3], the detection responses come from four part detectors and a combined detector. To start a trajectory, an initialization confidence InitConf is calculated from T consecutive responses, which correspond to one human hypothesis, based on the cues from
Speaker Tracking in Seminars by Human Body Detection
121
color, shape, and position. If InitConf is larger than a threshold θinit , a trajectory is started. To track the human, first data association with the combined detection responses is attempted; if this fails, data association with the part detection responses is attempted; if this fails again, a color based meanshift tracker [4] is used to follow the person. The strategy of trajectory termination is similar to that of initialization. A termination confidence EndConf is calculated when an existing trajectory has been lost by the detector for T time steps. If EndConf is larger than a threshold θend , the trajectory is terminated. In this work, only the full-body detector is used. We do not use the combined detection [1] for partial occlusion reasoning explicitly, as the local feature based full-body detector can work with partial occlusion to some extent and occlusions are not strong in this data set. The tracker in [3] tracks multiple persons simultaneously; while the tracker in this work is designed to track a single person. Once a human trajectory is initialized, it prohibits the initialization of other trajectories. The result of the 2D tracker is a set of 2D trajectories which are temporally disjoint with each other. These trajectories share the same identity, i.e. they are considered corresponding to the same object. Fig.2 shows some sample frames of 2D tracking results.
Frame 0
Frame 555
Frame 677
Frame 1955
Frame 2361
Frame 2966
Frame 3401
Frame 4198
Fig. 2. 2D speaker tracking result
2.2
Conversion from 2D to 3D
The 2D human detection and tracking only gives the rough 2D locations of the speaker. We need to extract the ground plane projections of the 3D positions of the speaker’s head. We propose two methods to do this. Approximation by Feet Position. As we have the camera calibration information, the 3D feet positions can be calculated from the 2D pixel locations for individual views based on an assumption that the speaker stands or walks on a ground plane. 3D feet positions are good approximation of the ground plane projections of 3D head positions. In practice, based on the human model of
122
B. Wu et al.
2D image
3D scene Calibration info
3D feet position 2D feet position ground plane Fig. 3. Computation of 3D feet positions
the positive training samples [1], we calculate the 2D feet positions from the rectangle-shaped detection responses, then project them to 3D space. Fig.3 illustrates the computation of the 3D feet positions. Head Position by Triangulation. Similar to the case of 2D feet positions, based on the human model we can get the 2D head positions from the detection responses. Then we use a motion segmentation based method to further refine the 2D head positions. When the speaker is detected as moving, we search for the peaks of the foreground blobs within the response rectangles and take the peaks as the image positions of the head top. When the speaker is detected as being stationary, we just use the head positions calculated based on the human model. As the height of the speaker is unknown, we do triangulation from two views to get the 3D head positions. Fig.4 illustrates the computation of 3D head positions. 2.3
Integration of Multiple Cameras
For one segment, 3D trajectories are obtained from each camera. Partial occlusion of the speaker by the background or other persons may result in tracking errors. Also the speaker is not always visible from a single camera. In order to refined the 3D trajectories, we combine the tracking results from the individual cameras to form a multi-camera result. Due to the errors in 2D tracking, the 3D trajectory may have some unnatural, sudden motions that we call peaks. We detect these peaks by thresholding the velocity of the trajectory. Denote by vi the maximum magnitudes of the velocity of the i-th point, Pi , in the trajectory, and denote by di the overall translation of a sub-window Wi around Pi , i.e. the distance between the start point and the end point of Wi . If vi is larger than a threshold θv and di is smaller than a threshold θd , Pi is classified as peak and all points in Wi are removed from the trajectory. This peak removal process reduces the false alarms in the tracking results but also creates some gaps (missed detections). Gaps may also be present if there is no detection from a single camera. We fill in these gaps by combining the trajectory
Speaker Tracking in Seminars by Human Body Detection
123
3D scene
2D image Calibration info
3D head position
2D head position
ground plane Cam2
Cam1
Fig. 4. Computation of 3D head positions
information from all the cameras in an incremental way. We assign priorities to the individual camera outputs based on their accuracy on a small fraction of the development data. Starting from the highest priority camera, we remove peaks in the output 3D trajectory, then fill in these gaps by using the information from the next highest priority camera and so on. For the triangulation based method the initial 3D trajectory is generated from the best two cameras. This process is continued until all cameras have been used. Fig.5 illustrates the multi-camera integration. Peak Gap
Gap Cam1
Cam2
Original tracks from individual camera
Tracks after peak removal
Track after multicamera integration
Fig. 5. Multi-camera integration
3
Experimental Results
The formal evaluation process defines four metrics for the speaker tracking task [5]: 1. “Miss” represents the of missing detection rate; 2. “FalsePos” represents the false alarm rate; 3. Multiple Object Tracking Precision (MOTP) reflects the 3D location precision of the tracking level; and 4. Multiple Object Tracking Accuracy (MOTA) is the tracking accuracy calculated from the number of false alarms and the number of missed detections.
124
B. Wu et al. Table 1. Evaluation scores with a default threshold of 500mm
Approximation by feet position Head position by triangulation
Miss
FalsePos
MOTP
MOTA
12.28%
12.22%
207mm
75.50%
9.71%
9.65%
161mm
80.64%
260
240
MOTP (mm)
220
200
180
160 Method 1 Method 2 140 400
500
600 700 800 Distance Threshold (mm)
900
1000
(a) MOTP 100 95 90
MOTA (%)
85 80 75 70 65 60 55 400
Method 1 Method 2 500
600 700 800 Distance Threshold (mm)
900
1000
(b) MOTA
Fig. 6. Scores with different distance thresholds. (Method 1: approximation by feet positions; Method 2: head positions by triangulation)
Speaker Tracking in Seminars by Human Body Detection
125
The first two metrics are for detection level and the last two for tracking level. If the distance between the tracked response and the ground truth is smaller than a threshold θpos , it is considered to be a successful match; otherwise a false alarm. The default value of θpos is 500mm. Table 1 lists the scores obtained with the default threshold, and Fig.6 shows the curves of MOTP and MOTA with different thresholds. The triangulation based method dominates the feet tracking based method, as the former locates the head directly. However the main advantage of the triangulation based method is the position accuracy; it can not improve the tracking level performance much when the threshold, i.e. the acceptable error, is close to one meter. Fig.7 shows the distribution of the tracking errors in 3D. Most of the errors are less than one meter, which is small compared to the size of the room. The speed of the system is about 2 FPS on a 2.8GHz Pentium CPU; the program is coded in C++ using OpenCV library functions; no attempt at code optimization has been made. 3000 Method 1 Method 2
No of Instances
2500
2000
1500
1000
500
0
0
500 1000 Distance Errors (mm)
1500
Fig. 7. Error distributions. (Method 1: approximation by feet positions; Method 2: head positions by triangulation)
4
Conclusion and Discussion
We applied a fully automatic single human tracking method to the task of speaker tracking. The system achieves good performance on the test sequences. The comparative results between two multi-view integration methods shows that the triangulation based method has better accuracy. Our current system does multi-view integration after the 2D trajectories are obtained. An alternative way is to do the integration after single frame detection, and then do tracking in 3D. This will remove some ambiguities at the detection level and make the tracking easier. We will explore this method in our future work.
126
B. Wu et al.
Acknowledgements. This research was partially funded by the Advanced Research and Development Activity of the U.S. Government under contract MDA-904-03-C-1786.
References 1. B. Wu, and R. Nevatia. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. ICCV’05. Vol I: 90-97 2. C. Papageorgiou, T. Evgeniou, and T. Poggio. A Trainable Pedestrian Detection System. In: Proc. of Intelligent Vehicles, 1998. pp. 241-246 3. B. Wu, and R. Nevatia. Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. To appear in CVPR’06. 4. D. Comaniciu, V. Ramesh, and P. Meer. The Variable Bandwidth Mean Shift and Data-Driven Scale Selection. ICCV’01. Vol I: 438-445 5. http://www.clear-evaluation.org/
TUT Acoustic Source Tracking System 2006 Pasi Pertil¨ a, Teemu Korhonen, Tuomo Pirinen, and Mikko Parviainen Tampere University of Technology, P.O.Box 553, 33101, Tampere, Finland {pasi.pertila, teemu.korhonen, tuomo.pirinen, mikko.p.parviainen}@tut.fi
Abstract. This paper documents the acoustic source tracking system developed by TUT for the 2006 CLEAR evaluation campaign. The described system performs 3-D single person tracking based on audio data received from multiple spatially separated microphone arrays. The evaluation focuses on meeting room domain. The system consists of four distinct stages. First stage is time delay estimation (TDE) between microphone pairs inside each array. Based on the TDE, direction of arrival (DOA) vectors are calculated for each array using a confidence metric. Source localization is done by using a selected combination of DOA estimates. The location estimate is tracked using a particle filter to reduce noise. The system is capable of locating a speaker 72 % of the time with an average accuracy of 25 cm.
1
Introduction
The motivation for this work is to evaluate the performance of an acoustic source tracking system. The described system was entered to CLEAR’06 evaluation [1]. The evaluation data comprises of several hours of audio and video recorded in different meeting room environments [2]. The data is precisely annotated and enables the calculation of system performance in terms of accuracy and other metrics. The idea behind the presented system is straightforward. Microphone arrays, consisting of multiple microphones are used to gather data. The data from each array is processed independently and the array outputs are combined. The system is scalable in terms of microphone array shape, placement and quantity. The baseline for the TUT system was first evaluated in the NIST Rich Transcription spring campaign (RT05s) [3,4]. For CLEAR’06 the localization system is developed to include location tracking to improve accuracy and robustness. Acoustic localization is an enabling technology for a number of applications, that require the speaker location information. Such systems range from automatic camera steering for video conferencing [5] to speech enhancement [6]. Other applications include surveillance applications [7,8,9] in which localization can also be used as a stand-alone technology. The next section discusses the evaluation tasks, data and metrics that are used for scoring. Then a brief description about the proposed system is given followed by a detailed discussion about each processing stage. In Sect. 4 results of the evaluation are given with the processing time. Sections 5 and 6 conclude the discussion. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 127–136, 2007. c Springer-Verlag Berlin Heidelberg 2007
128
P. Pertil¨ a et al.
camera
Speaker
c Mic. array Tabletop mics Audience
a
b
(a) Microphone array
(b) Example room layout
Fig. 1. Geometry related to data gathering is illustrated. The geometry of a four microphone T-array is presented in panel (a). Dimensions a and b are 20 cm and c is 30 cm for every site except IBM, where the corresponding dimensions are 26 cm and 40 cm. Panel (b) illustrates a basic recording room layout for a seminar session, equipped with different sensors. Microphone arrays, used by TUT system, are mounted to the walls.
2
Evaluation
2.1
Tasks
The evaluation covers multiple tasks based on audio and video data. The tasks that the TUT system participates in are: “3-D Single Person Tracking (SPT-A)” and “3-D Multi Person Tracking (MPT-A)”. The objective of both tasks is to locate a single speaker using only the audio data from the microphone arrays. Other tasks are described in [2]. The tasks use two different data sets, seminar and interactive seminar respectively, see Table 1 and Sect. 2.2 for details. Table 1. Information about the data used in the evaluation is presented. The sites are University of Karlsruhe (UKA), Instituto Trentino di Cultura (ITC), IBM, Polit`ecnica de Catalunya (UPC) and Research and Education society in Information Technology / Athens Information Technology (RESIT/AIT).
Data length [minutes] Site T-arrays UKA ITC1 IBM UPC2 AIT 1 2
4 6 4 3 3
Room size [m]
Dev set
Test set
x
y
157 16 15 13 13
120 10 19 20 20
5.9 4.7 7.2 4.0 5.0
7.1 5.9 5.9 5.2 3.7
z
Recording type
3.0 Seminar 4.2 Seminar 2.7 Interactive Seminar 4.0 Interactive Seminar 2.6 Interactive Seminar
In one Test recording (ITC 20050503) only four T-Arrays are used (1,2,3 and 6) In all Test recordings, only two T-Arrays are used (A and C)
2.2
Data
A short description about the data used in system development and performance evaluation is given here. For further details refer to the evaluation plan [2].
TUT Acoustic Source Tracking System 2006
129
The data consists of audio and video recordings made at five different sites in “meeting room” type environments. The rooms are equipped with the basic recording equipment and configuration according to the CHIL minimum common sensor set-up described in [10]. The specification includes at least three T-shaped microphone arrays. Each array consists of four microphones in a two dimensional upside-down T-shaped form, see Fig. 1. The arrays are located parallel to the walls of the recording rooms at equal height. The height is between 1.6 and 2.4 meters depending on the room. Also other microphones including a linear microphone array and several video cameras are present, but they are not used by the TUT system. The data is divided into development (Dev) and testing (Test) sets, with respective durations of 3.6 and 3.2 hours. The sets are further divided into seminars and interactive seminars. In the seminars the presenter is speaking in front of a large audience. In the interactive seminars the audience is allowed to participate in the discussion, e.g., by asking questions. This corresponds to a multiple person tracking scenario (MPT). The audio data was recorded at 44.1 kHz sampling rate at 24 bit resolution. The number of arrays depends on the configuration of the site, see Table 1. Reference data is annotated with a time resolution of 1.0 s for the person tracking task. For each time instant of active speech segments there exists 3-D head coordinates and ID of the active speaker. 2.3
Metrics
System performance is evaluated with two set of metrics: multiple object tracking (MOT) metrics and sound source localization (SLOC) metrics. These are described briefly below, for detailed discussion see [2]. The MOT metric is a joint set of metrics between audio and video evaluation tasks. The MOT metrics evaluates the system in terms of accuracy and different types of detection errors in a multiple speaker environment: – – – –
MOTP [mm]: multiple object tracking precision MISS [%]: number of misses / number of ground truth points FALSEPOS [%]: number of false positives / number of ground truth points A-MOTA [%]: multiple object tracking accuracy (audio only)
The MOTP is the average distance error of estimates that have a matching reference. The match threshold is set to 500 mm. A false positive happens when no reference exists closer than 500 mm of hypothesis. A miss occurs when no unique matching to a reference point can be made. The A-MOTA is defined as 1 − (misses + false positives)/ground truth points. The SLOC metric is used in previous evaluations. This metric has more finer attributes in terms of accuracy: /N – Pcor : localization rate = N FE T – AEE fine{+gross}: Average Estimate Error in case of fine{+gross} errors – Bias fine{+gross}: bias in case of fine{+gross} errors
130
P. Pertil¨ a et al.
Mic.array
audio data
1
TD Estimation
time delays
DOA Estimation
DOA value k1
tracked location
location
^ x
^ s Localization Mic.array N
audio data
TD Estimation
time delays
DOA Estimation
Tracking
DOA value kN
Fig. 2. A block diagram presents the processing of audio data into a location estimate. Two spatially separated microphone arrays is a minimum requirement.
– Deletion Rate: deleted frames / frames where speaker was active – False Alarm Rate: false alarms / frames where no speaker was active The accuracy of the system is measured using two subtasks, that is, accurate and rough. In the accurate localization subtask, a threshold value of 500 mm between the estimate and reference separates a fine error from a non-anomal error. In the rough localization subtask, a threshold value of 1000 mm separates an estimate from a gross error and an anomal error. NFE is defined as the number of fine errors. The number of total output frames is NT.
3
System Description
The acoustic source tracking system consists of the four stages presented in Fig. 2. Each microphone array is used to calculate source direction (DOA) using time delays between microphone pairs inside each array. Hypothetical location estimates with different DOA combinations are calculated. The hypothesis resulting the best distance criterion value is chosen as the location estimate. The estimate is then tracked using a particle filter to reduce noise. 3.1
Time Delay Estimation
Time delays are produced between outputs of microphones as an acoustic wave travels through the array. The delay values are determined by direction (and speed) of wavefront propagation and thus also the source location. Therefore, it is possible to use estimated values of time delays to compute source direction. Time delay estimation is the first processing stage in the localization system, see Fig. 2. The input to the time delay estimation process are the actual audio signals captured by the microphone arrays. Pairwise time delays are computed within each array. That is, delays are not estimated between signals from two different arrays. Time delays are computed for all pairs available in a single array. For a four-microphone array, six pairs are available, and thus six delay estimates are computed per a processing frame for each array.
TUT Acoustic Source Tracking System 2006
131
The processing is done framewise using a 44100 sample window with 50 % overlap. This corresponds to a window length of one second and 500 ms overlap. Delays are estimated with the Generalized Cross-Correlation using Phase Transform weighting (GCC-PHAT) [11]. This method estimates the weighted cross-correlation function between two microphone channels. The delay estimate is set to the lag value giving the maximum of the correlation function. Despite some shortcomings of the GCC-based methods, such as reverberation robustness [12], the GCC-PHAT method was chosen for its simplicity and ease of implementation. The method has also demonstrated satisfactory performance in the previous version of the TUT localization system [3,4]. It has also been used for speech time delay estimation [13,14,15]. 3.2
Direction of Arrival Estimation
The direction of arrival estimation is based on a local planar wave assumption. That is, the wavefront is assumed to be planar within the dimensions of the microphone array. Within this model, the values of time delays between microphones are determined only by the direction, and not by the distance of the source. Because time delay estimates are prone to errors, especially in reverberant indoor conditions present in the evaluation data, a selection procedure is used to reduce error in DOA estimation. The selection is based on a confidence scoring approach that relies on the linear dependence of time delays. For each time delay, a normalized confidence score [16] is computed. Time delays are sorted according to their confidence, and two best are selected. Because of the T-geometry of the arrays, it may happen that the selected time delays and corresponding microphone pairs do not span a two-dimensional space. In such a case a third time delay is added to the processing. With the mentioned array configurations, three pairs are always sufficient for two-dimensional estimation. DOA estimation is done with the propagation vector technique [17]. This method is a closed form solution and it uses the fact that the propagation vector, k, is a least-squares linear transformation of the pairwise time delays: −1 XT τˆ . (1) k = XT X Here, τˆ is a vector of estimated time delays and matrix X contains the sensor vectors corresponding to the time delays. A sensor vector connects the microphones of a microphone pair. Because the arrays are planar and mounted parallel to the walls, the propagation vector estimates lie in the planes of the walls. To make the vectors threedimensional, a third component is added. The value of the third component is obtained by setting the norm of the propagation vector constant. The other two components are kept fixed and the third component is set, using the Pythagorean theorem, to produce the desired value of the norm. The DOA estimation is performed for all arrays in the room, except those mentioned in Table 1. This gives one estimate from each array, per time frame.
132
P. Pertil¨ a et al.
The one-bit quantization used in [17] was not used in this system. Instead, the files were processed with the original sampling accuracy. 3.3
Source Localization
The localization module computes the source location from the DOA estimates. However, it is unlikely that the DOA lines intersect at a single point in a three dimensional space. Therefore, localization is done using a distance metric as a minimization criterion. The criterion is defined as the distance between a hypothetical source location and its projection on the DOA measurements. An analytic solution that applies this localization criterion is used [18]. The orientation of the speaker’s head determines which microphone arrays are faced directly and which are not. The arrays not faced by the speaker may receive the reflection of a sound louder than the direct sound itself. Also, the recordings are made in a real environment where noise from the audience and devices such as the projector and computers are present. In case of multiple signals, the signal resulting in the largest correlation values between the microphone channel pairs determines the DOA estimate. Besides the absolute sound pressure caused by the speaker also the location of the receiver affects the correlation value. Therefore, it may happen that the DOA estimates made at spatially separated arrays do not point to the same sound source. The issues discussed above cause a situation where using all the available DOA estimates in the closed form solution may result in a location with a larger error compared to a solution that does not use all the DOA estimates. In the previous evaluation campaign, the DOA estimates of each array were filtered respectively with a median filter to remove outliers [3]. Here, an approach that utilizes a distance criterion as a measure of DOA estimate removal is adopted. First, analytic solutions s1 , s2 , ..., sN are calculated using combinations of three or more DOA arrays, where N is the number of combinations. For instance, with four arrays there exists 43 + 44 = 5 possible combinations, i.e., {1, 2, 3}, {1, 2, 4}, . . ., {1, 2, 3, 4} = Ω1 , Ω2 , . . . , Ω5 . Then for each hypothetical solution the average distance criterion value is calculated. More precisely, an array combination n with its location hypothesis sn are selected, where n ∈ 1, . . . , N . Then, a vector from an individual microphone array pi to the hypothesis location ˆi = sn − pi , where i ∈ 1, . . . , |Ωn |. Next, the array-to-hypothesis sn is calculated k ˆ vector ki is projected onto the DOA estimate vector ki . Finally, the distance from ˆi to the hypothesis location is calculated and averthe projection vector Projki k aged over all arrays in the combination. The distance criterion value of each hypothetical location sn is estimated by ˆi · ki k 1 ˆ 1 ˆ ˆi . ki − ki = ki − Projki k 2 |Ωn | |Ω | n k i i∈Ωn i∈Ωn
(2)
Equation (2) is evaluated for all combinations of three or more arrays, and the combination resulting in the smallest average distance criterion is selected as the location estimate ˆ s of the current time frame.
TUT Acoustic Source Tracking System 2006
133
Source locations outside of the room are discarded and are not processed by the tracking method discussed in the next section. 3.4
Source Tracking
Source position estimates received from the localization process are distributed around the true position due to measurement and estimation errors. If the estimate is assumed unbiased, the error due to variance can be reduced by integrating temporal correlation of consecutive location samples. Here, a sequential Monte Carlo method known as particle filtering is applied to location estimates. Specifically, the Sampling Importance Resampling (SIR) algorithm is used [19]. Particle filtering approximates a probability density function (pdf) with a set (n) (n) of M weighted random samples Xt = {xt , wt }M n=1 for each time instant t. The samples known as particles are propagated over time and resampled according to their fit on data. An approximation from the particle set can be evaluated with many different methods. Here, a weighted average of particles yields ˆt = x
M
(n)
(n)
xt wt ,
(3)
n=1 (n)
where weights wt are normalized. The initial set X0 is sampled from Gaussian distribution centered around the first location estimate with the number of particles chosen M = 500. During each iteration particles are sampled from Gaussian prior importance density function (n)
xt
(n)
(n)
∼ N (xt |xt−1 , σ 2 )
(4)
using σ 2 = (0.075 m)2 . The particle weights are evaluated directly from Gaussian pdf with mean from recent location estimate ˆ s and standard deviation of 500 mm. Furthermore, the filter uses four samples ahead of recent one, effectively rendering the filtering method non-causal. Causality can be achieved using a simple delay. 3.5
System Output
The described system produces outputs for both evaluation tasks (SPT-A and MPT-A) with the task specific predefined format [2]. The native output rate is 0.5 seconds and the final rate of the interpolated results is 0.1 seconds. A bias was observed between the system output and UKA ground truth coordinates. The bias was calculated by averaging the differences between output and reference. As a result 300 mm was reduced from every x-coordinate value of UKA data.
4
Results
The outputs were score with two sets of evaluation metrics: MOT and SLOC. The results of SPT-A and MPT-A tasks with both metrics are given in Table 2.
134
P. Pertil¨ a et al.
Table 2. TUT system’s evaluation scores for the testing data set. The tasks are defined in Sect. 2.1 and the metrics are defined in Sect. 2.3.
MOT Scores
MPT-A
SPT-A
MOTP [mm] MISS [%] FALSEPOS [%] A-MOTA [%]
334 83.32 83.22 -66.53
245 27.93 27.86 44.21
SLOC Scores
MPT-A
SPT-A
0.08 397 (68,-118,95) 1194 (146,-102,348) 0.00 1.00
0.68 279 (66,24,67) 533 (73,19,104) 0.00 1.00
Pcor AEE fine [mm] Bias fine [mm] AEE fine+gross [mm] Bias fine+gross [mm] Deletion Rate False Alarm Rate
4.1
Computation Time
The system was implemented and run completely in Matlab. No external binaries or libraries were used. All processing was done with a 3.2 GHz Intel Pentium 4 machine with no more than 2 GB of RAM. The processing time of 5 h 25 min was dominated by TD estimation (95 %). The Test data set contains roughly 3 h 10 min of multichannel data. The system performed the evaluation at approximately 1.63 × real-time. This value depends on the number of arrays. If all the used testing data was converted into a single mono signal and processed, this (1.7 · 105 s) signal would be processed 0.11 × real-time.
5
Discussion
The TUT system baseline is developed for tracking a single continuous acoustic source. The lack of speech activity detection (SAD) affects the utilization of some of the evaluation metrics. Frames that are annotated as non-speech are always counted as false positives. Nevertheless, the metrics related to localization accuracy and miss ratios are relevant for assessing system performance and also designing further improvements. The MOT scores indicate that the system is capable of single person tracking (SPT) with an average accuracy better than 25 cm more than 72 % of the time. The almost equal values of FALSEPOS and MISS metrics suggests that the system output contains large errors where an estimate is counted as a miss and also as a false positive detection. The SLOC scores indicate that 68 % of the time the error was less than 500 mm, with an average error of 279 mm. It is noteworthy that the Pcor score suffers from the lack of proper SAD subsystem if there exists non-speech segments. The lack of SAD also causes False Alarm Rate of 1.00 and Deletion Rate of 0.00.
TUT Acoustic Source Tracking System 2006
135
It is obvious that the system should work better for the intended purpose of single person tracking (SPT) compared to multiple person tracking (MPT). The results of the MPT-A task support this. The recording rooms are equipped with a different number of microphone arrays and have different audibility properties due to dimensions and materials. This type of diversity challenges the scalability of a localization system and limits the optimization. This is seen as healthy development basis for any system. The system was not aggressively tuned for a certain type of room and data from different rooms was essentially processed with the same system. Comparison to the previous evaluation [3] is difficult due to different data set and metrics. However, incorporating a tracking system increases the accuracy and robustness. Also the computational efficiency has increased. Overall performance is seen as satisfactory. However, including a SAD system is necessary for a speaker localization system, since there is nothing to locate without speech.
6
Summary
An acoustic source localization and tracking system was presented. The system comprises of spatially separated microphone arrays. Each array is able to measure the direction of the source. The selected DOA measurements are combined to produce a location estimate. The location is tracked with a particle filter to improve accuracy and robustness. The evaluation data was collected and scored by and outside party of TUT. The system is able to locate a single speaker with an average accuracy of 25 cm more than 72 % of the time.
References 1. Stiefelhagen, R., Garofolo, J.: CLEAR Evaluation Campaign and Workshop. http://www.clear-evaluation.org/ (2006) 2. Mostefa, D., et al.: Clear evaluation plan v.1.1. http://www.clear-evaluation.org/ downloads/chil-clear-v1.1-2006-02-21.pdf (2006) 3. Pirinen, T.W., Pertila, P., Parviainen, M.: The TUT 2005 Source Localization System. In: Proceedings of the Rich Transcription 2005 Spring Meeting Recognition Evaluation, Royal College of Physicians, Edinburgh, UK (2005) 93–99 4. Parviainen, M., Pirinen, T.W., Pertil¨ a, P.: A Speaker Localization System for Lecture Room Environment. In: 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms. (2006) (accepted for publication). 5. Huang, Y., Benesty, J., Elko, G.W.: Passive acoustic source localization for video camera steering. In: Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’00). Volume 2. (2000) 909–912 6. Roman, N., Wang, D.L., Brown, G.J.: Location-based sound segregation. In: Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP’02). (2002) 1013–1016 7. Blumrich, R., Altmann, J.: Medium-range localisation of aircraft via triangulation. Applied Acoustics 61(1) (2000) 65–82
136
P. Pertil¨ a et al.
8. Bass, H.E., et al.: Infrasound. Acoustics Today 2(1) (2006) 9–19 9. Pertil¨ a, P., Parviainen, M., Korhonen, T., Visa, A.: Moving sound source localization in large areas. In: 2005 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS 2005). (2005) 745–748 10. Omologo, M., Brutti, A., Svaizer, P.: Speaker Localization and Tracking - Evaluation Criteria. CHIL. (2005) v. 5.0. 11. Knapp, C., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 24(4) (1976) 320–327 12. Champagne, B., B´edard, S., St´ephenne, A.: Performance of time-delay estimation in the presence of room reverberation. IEEE Transactions on Speech and Audio Processing 4(2) (1996) 148–152 13. Omologo, M., Svaizer, P.: Use of the crosspower-spectrum phase in acoustic event location. IEEE Transactions on Speech and Audio Processing 5(3) (1997) 288–292 14. Varma, K., Ikuma, T., Beex, A.A.: Robust TDE-based DOA-estimation for compact audio arrays. In: Proceedings of the Second IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM). (2002) 214–218 15. Anguera, X., Wooters, C., Peskin, B., Aguil´ o, M.: Robust speaker segmentation for meetings: The ICSI-SRI spring 2005 diarization system. Lecture Notes in Computer Science 3869 (2006) 402–414 16. Pirinen, T.: Normalized confidence factors for robust direction of arrival estimation. In: Proceedings of the 2005 IEEE International Symposium on Circuits and Systems (ISCAS). (2005) 17. Yli-Hietanen, J., Kallioj¨ arvi, K., Astola, J.: Low-complexity angle of arrival estimation of wideband signals using small arrays. In: Proceedings of the 8th IEEE Signal Processing Workshop on Statistical Signal and Array Signal Processing. (1996) 109–112 18. Hawkes, M., Nehorai, A.: Wideband Source Localization Using a Distributed Acoustic Vector-Sensor Array. IEEE Transactions on Signal Processing 51(6) (2003) 1479–1491 19. Gordon, N., Salmond, D., Smith, A.: Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F 140(2) (1993) 107–113
Tracking Multiple Speakers with Probabilistic Data Association Filters Tobias Gehrig and John McDonough Institut f¨ ur Theoretische Informatik Universit¨ at Karlsruhe Am Fasanengarten 5, 76131 Karlsruhe, Germany {tgehrig, jmcd}@ira.uka.de
Abstract. In prior work, we developed a speaker tracking system based on an extended Kalman filter using time delays of arrival (TDOAs) as acoustic features. In particular, the TDOAs comprised the observation associated with an iterated extended Kalman filter (IEKF) whose state corresponds to the speaker position. In other work, we followed the same approach to develop a system that could use both audio and video information to track a moving lecturer. While these systems functioned well, their utility was limited to scenarios in which a single speaker was to be tracked. In this work, we seek to remove this restriction by generalizing the IEKF, first to a probabilistic data association filter, which incorporates a clutter model for rejection of spurious acoustic events, and then to a joint probabilistic data association filter (JPDAF), which maintains a separate state vector for each active speaker. In a set of experiments conducted on seminar and meeting data, we demonstrate that the JPDAF provides tracking performance superior to the IEKF.
1
Introduction
Most practical acoustic source localization schemes are based on time delay of arrival estimation (TDOA) for the following reasons: Such systems are conceptually simple. They are reasonably effective in moderately reverberant environments. Moreover, their low computational complexity makes them well-suited to real-time implementation with several sensors. Time delay of arrival-based source localization is based on a two-step procedure: 1. The TDOA between all pairs of microphones is estimated, typically by finding the peak in a cross correlation or generalized cross correlation such as the phase transform (PHAT) [1]. 2. For a given source location, the squared-error is calculated between the estimated TDOAs and those determined from the source location. The estimated source location then corresponds to that position which minimizes this squared error.
This work was sponsored by the European Union under the integrated project CHIL, Computers in the Human Interaction Loop, contract number 506909.
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 137–150, 2007. c Springer-Verlag Berlin Heidelberg 2007
138
T. Gehrig and J. McDonough
If the TDOA estimates are assumed to have a Gaussian-distributed error term, it can be shown that the least squares metric used in Step 2 provides the maximum likelihood (ML) estimate of the speaker location [2]. Unfortunately this least squares criterion results in a nonlinear optimization problem that can have several local minima. In prior work [3], we employed an extended Kalman filter to directly update the speaker position estimate based on the observed TDOAs. In particular, the TDOAs comprised the observation associated with an extended Kalman filter whose state corresponded to the speaker position. Hence, the new position estimate came directly from the update formulae associated with the Kalman filter. We tested our algorithm on seminar data involving actual human subjects, and found that our algorithm provided localization performance superior to the standard techniques such as [4]. In other work [5], we enhanced our audio localizer with video information. We proposed an algorithm to incorporate detected face positions in different camera views into the Kalman filter without doing any triangulation. Our algorithm differed from that proposed by Strobel et al [6] in that no explicit position estimates were made by the individual sensors. Rather, as in the work of Welch and Bishop [7], the observations of the individual sensors were used to incrementally update the state of a Kalman filter. This combined approach yielded a robust source localizer that functioned reliably both for segments wherein the speaker was silent, which would have been detrimental for an audio only tracker, and wherein many faces appear, which would have confused a video only tracker. Our experiments with actual seminar data revealed that the audio-video localizer functioned better than a localizer based solely on audio or solely on video features. Although the systems described in our prior work functioned well, their utility was limited to scenarios wherein a single subject was to be tracked. In this work, we seek to remove this limitation and develop a system that can track several simultaneous speakers, such as might be required for meeting and small conference scenarios. Our approach is based on two generalizations of the IEKF, namely, the probabistic data association filter (PDAF) and the join probabistic data association filter (JPDAF). Such data association filters have been used extensively in the computer vision field [8], but have seen less widespread use in the field of acoustic person localization and tracking [9]. Compared with the IEKF, these generalizations provide the following advantages: 1. In the PDAF, a “clutter model” is used to model random events, such as door slams, footfalls, etc., that are not associated with any speaker, but can cause spurious peaks in the GCC of a microphone pair, and thus lead to poor tracking performance. Observations assigned with high probability to the clutter model do not affect the estimated position of the active target. 2. In the JPDAF, a unique PDAF is maintained for each active speaker and the peaks in the GCC are probabilistically associated with each of the currently active targets. This association is done jointly for all targets. Moreover, the feasible associations are defined such that a given GCC peak is associated with exactly one active speaker or the clutter model, and a target may be associated with at most one peak for a given microphone pair [10].
Tracking Multiple Speakers with Probabilistic Data Association Filters
139
Through these extensions, the JPDAF is able to track multiple, simultaneous speakers, which is not possible with the simple IEKF. As we show here, this capacity for tracking multiple active speakers is the primary reason why the JPDAF system provides tracking performance superior to that achieved with the IEKF. It is worth noting that similar work in speaker segmentation based on the output of a source localizer was attempted in [11], but without exploiting the full rigor of the Kalman and data association filters. The balance of this work is organized as follows. In Section 2, we review the process of source localization based on time-delay of arrival estimation. In particular, we formulate source localization as a problem in nonlinear least squares estimation, then develop an appropriate linearized model. Section 3 provides a brief exposition of the extended Kalman, as well as it variants, the IEKF, the PDAF and JPDAF. Section 4 presents the results of our initial experiments comparing the tracking performance of the IEKF and JPDAF.
2
Source Localization
Consider the i-th pair of microphones, and let mi1 and mi2 respectively be the positions of the first and second microphones in the pair. Let x denote the position of the speaker in R3 . Then the time delay of arrival (TDOA) between the two microphones of the pair can be expressed as T (mi1 , mi2 , x) =
x − mi1 − x − mi2 s
(1)
where s is the speed of sound. Denoting ⎤ ⎡ ⎡ ⎤ mij,x x x = ⎣y ⎦ mij = ⎣mij,y ⎦ mij,z z allows (1) to be rewritten as Ti (x) = T (mi1 , mi2 , x) = where dij =
1 (di1 − di2 ) s
(2)
(x − mij,x )2 + (y − mij,y )2 + (z − mij,z )2
= x − mij
(3)
is the distance from the source to microphone mij . Equation (2) is clearly nonlinear in x = (x, y, z). In the coming development, we will find it useful to have a linear approximation. Hence, we can take a partial derivative with respect to x on both sides of (2) and write 1 x − mi1,x x − mi2,x ∂Ti (x) = · − ∂x s di1 di2
140
T. Gehrig and J. McDonough
Taking partial derivatives with respect to y and z similarly, we find 1 x − mi1 x − mi2 ∇x Ti (x) = · − s di1 di2 We can approximate Ti (x) with a first order Taylor series expansion about ˆ (t − 1) as the last position estimate x ˆ (t − 1)) x(t − 1)) + ∇x Ti (x)(x − x Ti (x) ≈ Ti (ˆ ˆ (t − 1)) x(t − 1)) + ci (t)(x − x = Ti (ˆ
(4)
where we have defined the row vector T
ci (t) = [∇x Ti (x)] =
T 1 x − mi1 x − mi2 · − s di1 di2
(5)
Equations (4–5) are the desired linearization. Source localization based on a maximum likelihood (ML) criterion [2] proceeds by minimizing the error function (x) =
N −1 i=0
1 2 [ˆ τ i − Ti (x)] σi2
(6)
where τˆi is the observed TDOA for the i-th microphone pair and σi2 is the error covariance associated with this observation. The TDOAs can be estimated with a variety of well-known techniques [1,12]. Perhaps the most popular method involves phase transform (PHAT), a variant of the generalized cross correlation (GCC) which can be expressed as
π X1 (ejωτ )X2∗ (ejωτ ) jωτ 1 e dω (7) R12 (τ ) = 2π −π |X1 (ejωτ )X2∗ (ejωτ )| For reasons of computational efficiency, R12 (τ ) is typically calculated with an inverse FFT. Thereafter, an interpolation is performed to overcome the granularity in the estimate corresponding to the sampling interval [1]. Substituting the linearization (4) into (6) and introducing a time dependence provides N −1 1 τ i (t) − ci (t)x]2 (8) (x; t) ≈ 2 [¯ σ i i=0 where x(t − 1) τ¯i (t) = τˆi (t) − Ti (x(t − 1)) + ci (t)ˆ for i = 0, . . . , N − 1. Let us define
τ¯0 (t) τ ¯1.(t) τ¯ (t) = .. τ¯N−1 (t)
τˆ0 (t) τ ˆ1.(t) τˆ (t) = .. τˆN−1 (t)
(9)
Tracking Multiple Speakers with Probabilistic Data Association Filters
and
141
T0 (ˆ x(t)) x(t)) T 1 (ˆ T(ˆ x(t)) = .. . x(t)) TN−1 (ˆ
c0 (t) c 1.(t) C(t) = .. cN−1 (t)
(10)
so that (9) can be expressed in matrix form as τ¯ (t) = τˆ (t) − [T(x(t − 1)) − C(t)ˆ x(t − 1)]
(11)
2 Σ = diag σ02 σ12 · · · σN −1
(12)
Similarly, defining
enables (8) to be expressed as (x; t) = [¯ τ (t) − C(t)x] Σ −1 [¯ τ (t) − C(t)x] T
(13)
While (13) is sufficient to estimate the position of a speaker at any given time instant, it takes no account of past observations, which may also be useful for determining the speaker’s current position. This can be achieved, however, by defining a model of the speaker’s dynamics, and applying an extended Kalman filter to this nonlinear regression problem.
3
Kalman Filters
Here we briefly review the extended Kalman filter (EKF) and its variations, the IEKF, PDAF and JPDAF. 3.1
Extended Kalman Filter
Let x(t) denote the current state of a Kalman filter and y(t) the current observation. As x(t) cannot be observed directly, it must be inferred from the time series {y(t)}t ; this is the primary function of the Kalman filter. The operation of the Kalman filter is governed by a state space model consisting of a process and an observation equation, respectively, x(t + 1) = F(t + 1, t) x(t) + ν 1 (t)
(14)
y(t) = C(t, x(t)) + ν 2 (t)
(15)
where F(t + 1, t) is a known transition matrix, which, by definition, satisfies F(t + 1, t)F(t, t + 1) = F(t, t + 1)F(t + 1, t) = I
(16)
The term C(t, x(t)) is the known observation functional, which can represent any arbitrary, nonlinear, time varying mapping from x(t) to y(t). In (14–15) the process and observation noise terms are denoted by ν 1 (t) and ν 2 (t) respectively.
142
T. Gehrig and J. McDonough
These noise terms are by assumption zero mean, white Gaussian random vector processes with covariance matrices defined by
Qi (t) for t = k T E{ν i (t)ν i (k)} = 0 otherwise for i = 1, 2. Moreover, ν 1 (t) and ν 2 (k) are statistically independent such that E{ν 1 (t)ν T2 (k)} = 0 for all t and k. In the sequel, it will prove useful to define two estimates of the current state: ˆ (t|Y t−1 ) denote the predicted state estimate of x(t) obtained from all obLet x ˆ (t|Y t ), servations Y t−1 = {y(i)}t−1 i=0 up to time t − 1. The filtered state estimate x on the other hand, is based on all observations Y t = {y(i)}ti=0 including the current one. The predicted observation is then given by ˆ (t|Y t−1 )) ˆ (t|Y t−1 ) = C(t, x y
(17)
which follows readily from (15). By definition, the innovation is the difference ˆ (t|Y t−1 )) α(t) = y(t) − C(t, x
(18)
between actual and predicted observations. Generalizing the Kalman filter to the ˆ (t|Y t−1 ). EKF entails linearizing C(t, x(t)) about the predicted state estimate x Denote this linearization as ∂C(t, x) C(t) = (19) ∂x x = x ˆ (t|Y t−1 ) where entry (i, j) of C(t, x) is the partial derivative of the i-th component of C(t, x) with respect to the j-th component of x. Exploiting the statistical independence of ν 1 (t) and ν 2 (t), the correlation matrix of the innovations sequence can be expressed as R(t) = E α(t)αT (t) = C(t)K(t, t − 1)CT (t) + Q2 (t) where
(20)
K(t, t − 1) = E (t, t − 1)T (t, t − 1)
is the correlation matrix of the predicted state error, ˆ (t|Y t−1 ) (t, t − 1) = x(t) − x The Kalman gain for the EKF is defined as GF (t) = F−1 (t + 1, t) E x(t + 1)αT (t) R−1 (t) = K(t, t − 1) CT (t) R−1 (t)
(21)
Tracking Multiple Speakers with Probabilistic Data Association Filters
143
To calculate G(t), we must know K(t, t − 1) in advance. The latter is available from the Riccati equation, which can be stated as
where
K(t + 1, t) = F(t + 1, t)K(t)FT (t + 1, t) + Q1 (t)
(22)
K(t) = [I − F(t, t + 1)G(t)C(t)] K(t, t − 1)
(23)
K(t) = E (t)T (t)
is the correlation matrix of the filtered state error, ˆ (t|Y t ) (t) = x(t) − x An update of the state estimate proceeds in two steps: First, the predicted state estimate ˆ (t|Y t−1 ) = F(t, t − 1)ˆ x x(t − 1|Y t−1 ) is formed and used to calculate the innovation α(t) as in (18), as well as the linearized observation functional as in (19). Then the correction based on the current observation is applied to obtain the filtered state estimate according to ˆ (t|Y t−1 ) + GF (t)α(t) ˆ (t|Y t ) = x x
(24)
These computations are summarized in Table 1. We now consider a refinement of the extended Kalman filter. Repeating (25– 28) of Table 1, we can write ˆ (t|Y t−1 )) = C(t)K(t, t − 1)CT (t) + Q2 (t) R(t, x −1
ˆ (t|Y t−1 ) = K(t, t − 1)C (t, x ˆ (t|Y t−1 )R (t, x ˆ (t|Y t−1 ) GF (t, x ˆ (t|Y t−1 )) = y(t) − C(t, x ˆ (t|Y t−1 )) α(t, x ˆ (t|Y t ) = x ˆ (t|Y t−1 ) + GF (t, x ˆ (t|Y t−1 )) α(t, x ˆ (t|Y t−1 )) x T
(32) (33) (34) (35)
where we have explicity indicated the dependence of the relevant quantities ˆ (t|Y t−1 ). Jazwinski [13, §8.3] describes an iterated extended Kalman filter on x (IEKF), in which (32–35) are replaced with the local iteration, R(t, η i ) = C(η i )K(t, t − 1)CT (η i ) + Q2 (t) −1
GF (t, η i ) = K(t, t − 1)C (η i )R T
(t, η i )
α(t, η i ) = y(t) − C(t, η i ) x(t|Y t−1 ) − η i ] ζ(t, η i ) = α(t, η i ) − C(η i ) [ˆ ˆ (t|Y t−1 ) + GF (t, η i )ζ(t, η i ) η i+1 = x
(36) (37) (38) (39) (40)
where C(η i ) is the linearization of C(t, η i ) about η i . The local iteration is initialized by setting ˆ (t|Y t−1 ) η1 = x
144
T. Gehrig and J. McDonough Table 1. Calculations for extended Kalman filter
Input vector process: y(1), y(2), . . . , y(t) Known parameters: – – – – –
state transition matrix: F(t + 1, t) nonlinear measurement functional: C(t, x(t)) covariance matrix of process noise: Q1 (t) covariance matrix of measurement noise: Q2 (t) 2 initial diagonal loading: σD
Initial conditions: ˆ (1|Y 0 ) = x0 x 1 K(1, 0) = 2 I σD Computation: t = 1, 2, 3, . . . R(t) = C(t)K(t, t − 1)CT (t) + Q2 (t) −1
GF (t) = K(t, t − 1)C (t)R T
(t)
(25) (26)
ˆ (t|Y t−1 )) α(t) = y(t) − C(t, x
(27)
ˆ (t|Y t−1 ) + GF (t)α(t) ˆ (t|Y t ) = x x
(28)
K(t) = [I − GF (t)C(t)] K(t, t − 1)
(29)
T
K(t + 1, t) = F(t + 1, t)K(t)F (t + 1, t) + Q1 (t)
(30)
ˆ (t + 1|Y t ) = F(t + 1, t)ˆ x(t|Y t ) x
(31)
Note: The linearized matrix C(t) is computed from the nonlinear functional C(t, x(t)) as in (19).
ˆ (t|Y) as defined in (35). Hence, if the local iteration is run only Note that η 2 = x once, the IEKF reduces to the EKF. Normally (36–40) are repeated, however, until there are no substantial changes between η i and η i+1 . Both GF (t, η i ) and C(η i ) are updated for each local iteration. After the last iteration, we set ˆ (t|Y t ) = η f x and this value is used to update K(t) and K(t + 1, t). Jazwinski [13, §8.3] reports that the IEKF provides faster convergence in the presence of significant nonlinearities in the observation equation, especially when the initial state estimate ˆ (t|Y t−1 ) is far from the optimal value. η1 = x Although the IEKF was used for all experiments reported in Section 4, in the descriptions of data association filters to follow, we will base our development on the extended Kalman filter. This is done only for the sake of simplicity of exposition; in all cases, the extension of the data association filters to use multiple iterations at each time instant, as described above, is straightforward.
Tracking Multiple Speakers with Probabilistic Data Association Filters
3.2
145
Speaker Tracking
In this section, we discuss the specifics of how the linearized least squares position estimation criterion (13) can be recursively minimized with the iterated extended Kalman filter presented in the prior section. We begin by associating the observation y(t) with the TDOA estimate τ (t) for the audio features, and with the detected face position for the video features. Moreover, we recognize that the linearized observation functional C(t) required for the Kalman filter is given by (5). Furthermore, we can equate the TDOA error covariance matrix Σ in (12) with the observation noise covariance Q2 (t) and define a similar matrix for the video features. Hence, we have all relations needed on the observation side of the Kalman filter. We need only supplement these with an appropriate model of the speaker’s dynamics to develop an algorithm capable of tracking a moving speaker, as opposed to finding his position at a single time instant. Consider the simplest model of speaker dynamics, wherein the speaker is “stationary” inasmuch as he moves only under the influence of the process noise ν 1 (t). The transition matrix is then F(t + 1|t) = I. Assuming the process noise components in the three directions are statistically independent, we can write Q1 (t) = σ 2 T 2 I
(41)
where T is the time since the last state update. Although the audio sampling is synchronous for all sensors, it cannot be assumed that the speaker constantly speaks, nor that all microphones receive the direct signal from the speaker’s mouth; i.e., the speaker sometimes turns so that he is no longer facing the microphone array. As only the direct signal is useful for localization [14], the TDOA estimates returned by those sensors receiving only the indirect signal reflected from the walls should not be used for position updates. This is most easily done by setting a threshold on the PHAT (7), and using for source localization only those microphone pairs returning a peak in the PHAT above the threshold [14]. This implies that no update at all is made if the speaker is not speaking. 3.3
Probabilistic Data Association Filter
The PDAF is a generalization of the Kalman filter wherein the Gaussian probability density function (pdf) associated with the location of the speaker or target is supplemented with a pdf for random false alarms or clutter [10, §6.4]. Depending on the formulation, the latter pdf may be specified by either a Poisson density or a uniform distribution. Through the inclusion of the clutter model, t the PDAF is able to make use of several observations {yi (t)}m i=1 for each time instant, where mt is the total number of observations for time t. Each observation can then be attributed either to the target itself, or to the background model. Let us define the events θi (t) = {yi (t) is the target observation at time t} θ0 (t) = {all observations are clutter} and the posterior probability of each event
(42) (43)
146
T. Gehrig and J. McDonough
βi (t) = P {θi (t)|Y t }
(44)
t As the events {θi (t)}m i=0 are exhaustive and mutually exclusive, we have
mt
βi (t) = 1
i=0
Moreover, invoking the total probability theorem, the filtered state estimate can be expressed as mt ˆ (t|Y t ) = ˆ i (t|Y t )βi (t) x x i=0
where ˆ i (t|Y t ) = E{x(t)|θi (t), Y t } x is the updated state estimate conditioned on θi (t). It can be readily shown that this state estimate can be calculated as ˆ (t|Y t ) + GF (t, x ˆ (t|Y t−1 )) αi (t, x ˆ (t|Y t−1 )) ˆ i (t|Y t ) = x x where ˆ (t|Y t−1 )) = yi (t) − C(t, x ˆ (t|Y t−1 )) αi (t, x
(45)
is the innovation for observation yi (t). The combined update is then ˆ i (t|Y t ) = x ˆ (t|Y t−1 ) + GF (t, x ˆ (t|Y t−1 )) α(t, x ˆ (t|Y t−1 )) x
(46)
where the combined innovation is ˆ (t|Y t−1 )) = α(t, x
mt
ˆ (t|Y t−1 )) βi (t) αi (t, x
(47)
i=1
The Riccati equation (22–23) must be suitably modified to account for the additional uncertainty associated with the multiple innovations {αi (t)}, as well as the possibility of the null event θ0 (t); see Bar-Shalom and Fortmann [10, §6.4] for details. 3.4
Joint Probabilistic Data Association Filter
The JPDAF is an extension of the PDAF to the case of multiple targets. Consider t the set Y(t) = {yi (t)}m i=1 of all observations occuring at time instant t and let t−1 Y t−1 = {Y(i)}i=0 denote the set of all past observations. The first step in the JPDA algorithm is the evaluation of the conditional probabilities of the joint association events mt θiki θ= i=1
where the atomic events are defined as
Tracking Multiple Speakers with Probabilistic Data Association Filters
147
θik = {observation i originated from target k} for all i = 1, . . . , mt ; t = 0, 1, . . . , T . Here, ki denotes the index of the target to which the i-th observation is associated in the event currently under consideration. A feasible event is defined as an event wherein 1. An observation has exactly one source, which can be the clutter model; 2. No more than one observation can originate from any target. In the acoustic person tracking application where the observations are peaks in the cross correlation function for pairs of microphones, the second point must be interpreted as referring to the observations for any given pair of microphones. Applying Bayes’ rule, the conditional probability of θ(t) can be expressed as P {θ(t)|Y t } = P {θ(t)|Y(t), Y t−1 } P {Y(t)|θ(t), Y t−1 }P {θ(t)|Y t−1 } P {Y(t)|Y t−1 } P {Y(t)|θ(t), Y t−1 }P {θ(t)} = P {Y(t)|Y t−1 }
=
(48)
where the marginal probability P {Y(t)|Y t−1 } is computed by summing the joint probability in the numerator of (48) over all possible θ(t). The conditional probability of Y(t) required in (48) can be calculated from P {Y(t)|θ(t), Y t−1 } =
mt
p(yi (t)|θiki (t), Y t−1 )
(49)
i=1
The individual probabilities on the right side of (49) can be easily evaluated given the fundamental assumption of the JPDAF, namely, yi (t) N (ˆ yki (t|Y t−1 ), Rki (t)) ˆ ki (t|Y t−1 ) is the predicted observation for target ki from (17), and Rki (t) where y is the innovation covariance matrix for target ki from (20). The prior probability P {θ(t)} in (48) can be readily evaluated through combinatorial arguments [10, §9.3]. Once the posterior probabilities of the joint events {θ(t)} have been evaluated for all targets together, the state update for each target can be made separately according to (45–47). For any given target, it is only necessary to marginalize out the effect of all other targets to obtain the required posterior probabilities {βi (t)}. As the JPDAF can track multiple targets, it was necessary to formulate rules for deciding when a new target should be created, when two targets should be merged and when a target should be deleted. A new target was always created as soon as a measurement could not be associated with any existing target. But if
148
T. Gehrig and J. McDonough
the time to initialize the filter exceeded a time threshold, the newly created target was immediately deleted. The initialization time of the filter is defined as the time required until the variance of each dimension of (t, t − 1) in (21) fell below a given threshold. Normally this initialization time is relatively short for a target that emits sufficient measurements and long for spurious noises. To merge two or more targets, a list was maintained with the timestamp when the two targets became closer than a given distance. If, after some allowed interval of overlap, the two targets did not move apart, then the target with the larger |K(t, t − 1)| was deleted. In all cases, targets were deleted if their position estimate had not been updated for a given length of time. To detect the active sound source, we simply used the target with the smallest error covariance matrix, since an active sound source should emit enough measurements so that the covariance decreases and others that are inactive should increase at the same time.
4
Experiments
The test set used to evaluate the algorithms proposed here contains approximately three hours of audio and video data recorded during 18 seminars held by students and faculty at University of Karlsruhe (UKA) in Karlsruhe, Germany. An additional hour of test data was recorded at Athens Information Technology in Athens, Greece, IBM at Yorktown Heights, New York, USA, Instituto Trentino di Cultura in Trento, Italy, and Universitat Politecnica de Catalunya in Barcelona, Spain. These recordings were made in connection with the European Union integrated project CHIL, Computers in the Human Interaction Loop. In the sequel, we describe out speaker tracking and STT experiments. Prior to the start of the recordings, four video cameras in the corners of the room had been calibrated with the technique of Zhang [15]. The location of the centroid of the speaker’s head in the images from the four calibrated video cameras was manually marked every second. Using these hand-marked labels, the true position of the speaker’s head in the three dimensions was calculated using the technique described in [15]. These “ground truth” speaker positions are accurate to within 10 cm. For the speaker tracking experiments described here, the seminars were recorded with several four-element T-shaped arrays. A precise description of the sensor and room configuration at UKA is provided in [3]. Tracking performance was evaluated only on those parts of the seminars where only a single speaker was active. For these parts, it was determined whether the error between the ground truth and the estimated position is less 50 cm. Any instance where the error exceeded this threshold was treated as a false positive (FP) and was not considered when calculating the multiple object tracking precision (MOTP), which is defined as the average horizontal position error. If no estimate fell within 50 cm of the ground truth, it was treated as a miss. Letting Nfp and Nm , respectively, denote the total number of false positives and misses, the multiple object tracking error (MOTE) is defined as (Nfp + Nm )/N where N
Tracking Multiple Speakers with Probabilistic Data Association Filters
149
Table 2. Speaker tracking performance for IEKF and JPDAF systems Filter
Test Set
IEKF IEKF IEKF JPDAF JPDAF JPDAF
lecture interactive complete lecture interactive complete
MOTP % Miss % FP % MOTE (cm) 11.4 8.32 8.30 16.6 18.0 28.75 28.75 57.5 12.1 10.37 10.35 20.7 11.6 5.81 5.78 11.6 17.7 19.60 19.60 39.2 12.3 7.19 7.16 14.3
is the total number of ground truth positions. We evaluated performance separately for the portion of the seminar during which only the lecturer spoke, and that during which the lecturer interacted with the audience. Shown in Table 2 are the results of our experiments. These results clearly show that the JPDAF provided better tracking performance for both the lecture and interactive portions of the seminar. As one might expect, the reduction in MOTE was largest for the interactive portion, where multiple speakers were often simultaneously active.
References 1. M. Omologo and P. Svaizer, “Acoustic event localization using a crosspowerspectrum phase based technique,” in Proc. ICASSP, vol. II, 1994, pp. 273–6. 2. S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Englewood Cliffs, NJ: Prentice-Hall, 1993. 3. U. Klee, T. Gehrig, and J. McDonough, “Kalman filters for time delay of arrivalbased source localization,” Journal of Advanced Signal Processing, Special Issue on Multi-Channel Speech Processing, to appear. 4. M. S. Brandstein, J. E. Adcock, and H. F. Silverman, “A closed-form location estimator for use with room environment microphone arrays,” IEEE Trans. Speech Audio Proc., vol. 5, no. 1, pp. 45–50, January 1997. 5. T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough, “Kalman filters for audio-video source localization,” in Proc. Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, 2005. 6. N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video signal processing for object localization and tracking,” in Microphone Arrays, M. Brandstein and D. Ward, Eds. Heidelberg, Germany: Springer Verlag, 2001, ch. 10. 7. G. Welch and G. Bishop, “SCAAT: Incremental tracking with incomplete information,” in Proc. Computer Graphics and Interactive Techniques, August 1997. 8. G. Gennari and G. D. Hager, “Probabilistic data association methods in the visual tracking of groups,” in Proc. CVPR, 2004, pp. 1063–1069. 9. D. Bechler, “Akustische Sprecherlokalisation mit Hilfe eines Mikrofonarrays,” Ph.D. dissertation, Universit¨ at Karlsruhe, Karlsruhe, Germany, 2006. 10. Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association. San Diego: Academic Press, 1988. 11. J. Ajmera, G. Lathoud, and I. McCowan, “Clustering and segmenting speakers and their locations in meetings,” in Proc. ICASSP, 2004, pp. I–605–8.
150
T. Gehrig and J. McDonough
12. J. Chen, J. Benesty, and Y. A. Huang, “Robust time delay estimation exploiting redundancy among multiple microphones,” IEEE Trans. Speech Audio Proc., vol. 11, no. 6, pp. 549–57, November 2003. 13. A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York: Academic Press, 1970. 14. L. Armani, M. Matassoni, M. Omologo, and P. Svaizer, “Use of a CSP-based voice activity detector for distant-talking ASR,” in Proc. Eurospeech, vol. II, 2003, pp. 501–4. 15. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Analysis Machine Intel., vol. 22, pp. 1330–1334, 2000.
2D Person Tracking Using Kalman Filtering and Adaptive Background Learning in a Feedback Loop Aristodemos Pnevmatikakis and Lazaros Polymenakos Athens Information Technology, Autonomic and Grid Computing, P.O. Box 64, Markopoulou Ave., 19002 Peania, Greece {apne,lcp}@ait.edu.gr http://www.ait.edu.gr/research/RG1/overview.asp
Abstract. This paper proposes a system for tracking people in video streams, returning their body and head bounding boxes. The proposed system comprises a variation of Stauffer’s adaptive background algorithm with spacio-temporal adaptation of the learning parameters and a Kalman tracker in a feedback configuration. In the feed-forward path, the adaptive background module provides target evidence to the Kalman tracker. In the feedback path, the Kalman tracker adapts the learning parameters of the adaptive background module. The proposed feedback architecture is suitable for indoors and outdoors scenes with varying background and overcomes the problem of stationary targets fading into the background, commonly found in variations of Stauffer’s adaptive background algorithm.
1 Introduction Target tracking in video streams has many applications, like surveillance, security, smart spaces [1], pervasive computing, and human-machine interfaces [2] to name a few. In these applications the targets are either human bodies, or vehicles. The common property of these targets is that sooner or later they exhibit some movement which is evidence that distinguishes them from the background and identifies them as foreground targets. The segmentation of foreground objects can be accomplished by processing the difference of the current frame from a background image. This background image can be static [3] or can be computed adaptively [4]. The drawback of the static background image is that background does change. In outdoor scenes natural light changes and the wind causes movement of trees and other objects. In indoor scenes, artificial light flickers and pieces of furniture are moved around. All such effects can be learned by an adaptive background algorithm [5] and any of its modifications, like [6,7]. Such an algorithm detects targets as segments different from the learned background, but depends on the targets’ movement to keep a fix on them. If they stop, the background learning process fades them into the background. Once a target is initialized, a tracking system should be able to keep a fix on it even when it remains immobile for some time. In this paper, we propose a novel tracking system that addresses many of the above mentioned limitations by utilizing a R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 151 – 160, 2007. © Springer-Verlag Berlin Heidelberg 2007
152
A. Pnevmatikakis and L. Polymenakos
feedback mechanism from the tracking module to the adaptive background module which in turn provides the evidence for each target to the tracking module. We control the adaptive background parameters on a pixel level for every frame (spacio-temporal adaptation), based on a prediction of the position of the target. Under the assumption of Gaussian-like targets, this prediction can be provided by a Kalman filter [8]. This paper is organized as follows: In section 2 the adaptive background, measurement and Kalman tracking modules are detailed. The results on CLEAR evaluations are presented and discussed in section 3. Finally, in section 4 the conclusions are drawn, followed by some indications for further work.
2 Tracking System The block diagram of the tracking system is shown in Figure 1. It comprises three modules: adaptive background, measurement and Kalman filtering. The adaptive background module produces the foreground pixels of each video frame and passes this evidence to the measurement module. The measurement module associates the foreground pixels to targets, initializes new ones if necessary and manipulates existing targets by merging or splitting them based on an analysis of the foreground evidence. The existing or new target information is passed to the Kalman filtering module to update the state of the tracker, i.e. the position, velocity and size of the targets. The output of the tracker is the state information which is also fed back to the adaptive background module to guide the spacio-temporal adaptation of the algorithm. In the rest of the section, we present the three modules in detail.
Fig. 1. Block diagram of the complete feedback tracker architecture
2D Person Tracking Using Kalman Filtering and Adaptive Background Learning
153
2.1 Adaptive Background Module The targets of the proposed system (vehicles and humans) are mostly moving. The changes in the video frames due to the movement are used to identify and segment the foreground (pixels of the moving targets) from the background (pixels without movement). If a background image were available, this segmentation is simply the difference of the current frame from the background image. The foreground pixels thus obtained are readily grouped into target regions. A static image of the empty scene viewed by the camera can be used for background [3]. Unfortunately this is not practical and adaptive background approaches are adopted [4-7] primarily for two reasons: First, such an empty scene image might not be available due to system setup. Secondly and most importantly, background (outdoors and indoors) also changes: Natural light conditions change slowly as time goes by; the wind causes swaying movements of flexible background object (e.g. foliage); fluorescent light flickers at the power supply frequency; objects on tabletops and small pieces of furniture are rearranged and projection areas display different content. All these changes need to be learnt into an adaptive background model. Stauffer’s adaptive background algorithm [5] is capable of learning such changes with so different speeds of change by learning into the background any pixel, whose color in the current frame resembles the colors that this pixel often has. So no changes, periodic changes or changes that occurred in the distant past lead to pixels that are considered background. To do so, a number of weighted Gaussians model the appearance of different colors in each pixel. The weights indicate the amount of time the modeled color is active in that particular pixel. The mean is a three dimensional vector indicating the color modeled for that pixel, while the covariance matrix indicates the extend around the mean that a color of that pixel is to be considered as similar to the one modeled. Colors in any given pixel similar to that modeled by any of the Gaussians of that pixel lead to an update of that Gaussian, an increase of its weight and a decrease of all the weights of the other Gaussians of that pixel. Colors not matching any of the Gaussians of that pixel lead to the introduction of a new Gaussian with minimum weight. Hence the possible updates of the weight of the i-th Gaussian of the pixel located at (x, y) at time t are
⎧ a new Gaussian ⎪ wi ( x, y, t ) = ⎨ (1 − a ) wi ( x, y, t − 1) non-matching Gaussians ⎪(1 − a ) w ( x, y, t − 1) + a matching Gaussians i ⎩
(1)
where a is the learning rate. Some variations of the Stauffer algorithm found in the literature deal with the way covariance is represented (single value, diagonal of full matrix) and the way the mean and covariance of the Gaussians are updated [6]. Some further variations of the algorithm address the way the foreground information is represented. The original algorithm and most of the modifications lead to a binary decision for each pixel: foreground or background [5,6]. In [7], the Pixel Persistence Map (PPM) is used instead. This is a map of the same dimension as the frames with a value at each location (x, y) equal to the weight of the Gaussian matching the current color of the
154
A. Pnevmatikakis and L. Polymenakos
pixel at (x, y). Small PPM values indicate foreground objects, while large indicate background. The foreground/background threshold is left unspecified though. The drawback of all the existing variations of Stauffer’s algorithm is that stationary foreground objects tend to fade in the background with rate a . Small rates fade foreground objects slowly, but are also slow in adapting to the background changes, like the motion of a chair. Large rates favor background adaptation but tend to fade a target into the background when it stops. This fading progressively destroys the region of the tracked object, deforms its perceived shape and finally leads to loosing track of the object altogether. When the target resumes moving, foreground pixels will be marked only at the locations not previously occupied by the stationary target. When the target has fairly uniform coloration, this can lead to track loss even in the presence of movement. We propose a feedback tracking architecture in order to addresses these problems. The threshold PPM serves as target evidence to the Kalman tracker. The state of the Kalman tracker contains the ellipse that describes every target. The learning rate is modified in elliptical regions around these targets. Thus instead of a constant value, a spacio-temporal adaptation of the learning rate is used:
⎧⎪large if ( x, y ) not near target at time t a ( x, y , t ) = ⎨ ⎪⎩ small if ( x, y ) near target at time t
(2)
This delays fading of the targets and depending on the selection of the small learning rate and the motion of the targets can be sufficient. In some cases though where targets stay put for very long periods, even the small learning rate will gradually fade them into the background. If this starts happening (the target becomes smaller while its mobility is small), the normal weight update mechanism of (1) is bypassed. The weight of the current Gaussian is decreased and that of all the rest is increased with a rate that is inversely proportional to the mobility of the target, as this is estimated from the state of the Kalman tracker for this particular target. This fading prevention mechanism is not always in effect; it is only activated when targets are small and rather immobile, since the tampering of the weights is very forceful and affects the whole elliptical disk around the target, regardless if the pixel is actually foreground or not. The second major proposed modification of Stauffer’s algorithm addresses extreme flickering situations often encountered in night vision cameras. In such scenes the PPM needs to be bounded by a very low threshold in order not to consider flickering pixels as foreground. The threshold on the other hand tends to discard actual foreground pixels as well. The proposed solution is to adapt the threshold T in a spacio-temporal fashion similar to the learning rate in (2). i.e. ⎧⎪small if ( x, y ) not near target at time t T ( x, y, t ) = ⎨ ⎪⎩ large if ( x, y ) near target at time t
(3)
This way flickering pixels are avoided far from the targets, while the targets themselves are not affected. The penalty of this strategy is the delayed detection of new very small targets.
2D Person Tracking Using Kalman Filtering and Adaptive Background Learning
155
These proposed feedback mechanisms on the learning rate lead to robust foreground regions regardless of the flickering in the images or the lack of target mobility, while they do not affect the adaptation of the background around the targets. When such flickering and mobility conditions occur, the resulting PPM is more suitable for target region forming that the original version of [7]. The forming of target regions is the goal of the measurement module, detailed next. 2.2 Measurement Module
The measurement module finds foreground segments, assigns them to known targets or initializes new ones and checks targets for possible merging or splitting. The information for new targets or targets to be updated is passed to the Kalman module. The measurement process begins by processing the adaptively thresholded PPM to obtain foreground segments. This involves shadow detection based on [9], dilation, filling of any holes in the segments and erosion. The obtained segments are checked for possible merging based on their Mahalanobis distance and are further considered only if they are large enough. These segments are associated to targets based on their Mahalanobis distance from the targets. Non-associated segments generate new target requests to the Kalman module. The targets are subsequently checked for possible merging based on how similar they are. Since we are using a Kalman tracker, the targets are described by twodimensional Gaussians [8]. If two such Gaussians are too similar, the targets are merged. Finally, very large targets are checked for splitting. This is necessary as, for example, two monitored people can be walking together and then separate their tracks. Splitting is performed using the k-means algorithm on the pixels of the foreground segment comprising the target. Two parts are requested from the k-means algorithm. These parts are subsequently checked to determine if they are distinct. For this, the minimum Mahalanobis distance of the one with respect to the other is used. If the two parts are found distinct, then they form two targets. The one part of the foreground evidence is used to update the existing target, while the other part is used to request a new target from the Kalman tracker. All the found targets are then processed to identify the number of bodies in them and detect the heads. This is done by processing the height of the target as a function of its column number. The height is measured form the bottom of the box bounding the target. The processing identifies peaks that correspond to heads and valleys that correspond to points that the target can be split into more than one body. The process is illustrated in Figure 2 and works well with upright people. Finally, heads are found by examining the smoothed derivative of the width of the detected peaks. As at the shoulders the width of the body increases rapidly, this point can be easily detected. If the lighting conditions are normal, the face position can be refined inside the head region using skin color histograms [10], as in [3]. Also, if resolution is adequate, an eye detector like the one in [2] can be used to estimate the eye positions and from those, infer the face position. Finally, frontal, upright faces can be detected using the boosting algorithm of Viola and Jones [11]. Since lighting conditions, resolution and face frontality are not guaranteed in the intended applications, none of these approaches are used to refine the face position.
156
A. Pnevmatikakis and L. Polymenakos
75
70
(a)
Target height
65
60
55
50
45
(b)
0
5
10
15 20 25 30 Target collumn number
35
40
45
(c)
Fig. 2. Processing a target to extract bodies and heads. (a) Scene of a target with the two associated bodies and heads marked. (b) PPM of the target and 2D Gaussian approximation. (c) Target height profile (blue line) used to identify the peaks and valleys that correspond to the head tops (red circles) and body splitting points (vertical red lines) respectively. Head estimates are also marked with black lines.
2.3 Kalman Tracking Module
The Kalman module maintains the states of the targets. It creates new targets should it receive a request from the measurement module and performs measurement update based on the foreground segments associated to the targets. The states of the targets are fed back to the adaptive background module to adapt the learning rate and the threshold for the PPM binarization. States are also eliminated if they have no foreground segments associated to them for 15 frames. Every target is approximated by an elliptical disc, i.e. can be described by a single Gaussian. This facilitates the use of a Kalman tracker. The target states are sevendimensional; they comprise of the mean of the Gaussian describing the target (horizontal and vertical components), the velocity of the mean (horizontal and vertical components) and the three independent terms of the covariance matrix. The prediction step uses a loose dynamic model of constant velocity [12] for the update of the mean position and velocity. As for the update of the three covariance terms, their exact model is non-linear, hence cannot be used with the Kalman tracker; instead of using linearization and an extended Kalman tracker, the covariance terms are modeled as constant. The variations of the velocity and the covariance terms are permitted by the state update variance term. This loose dynamic model permits arbitrary movement of the targets. It is very different to the more elaborate models used for tracking aircraft. Aircraft can perform a limited set of maneuvers that can be learned and be expected by the tracking system. Further, flying aircraft can be modeled as rigid bodies thus strict and multiple dynamic models are appropriate and have been used extensively in Interacting Multiple Model Kalman trackers [13-14]. Unlike aircraft, street vehicles and especially humans have more degrees of freedom for their movement which includes apart from speed and direction changes obstacles
2D Person Tracking Using Kalman Filtering and Adaptive Background Learning
157
arbitrarily, rendering the learning of a strict dynamic model impractical. A strict dynamic model in this case can mislead a tracker to a particular track even in the presence of contradicting evidence [15].
3 CLEAR Evaluation Results The proposed feedback tracking architecture is tested on the CLEAR evaluations (video sequences coming from the CHIL and VACE projects). In this section we show the effect of the algorithm on the data, more specifically, its successes and problems when it is applied both in indoor and outdoor environments. Figure 3 shows the effect of the spacio-temporal adaptation of the threshold for the binarization of the PPM in the adaptive background module. The morphological processing of the thresholded PPM can result in false alarms if the threshold is not adapted by the states of the Kalman tracker.
(a)
(b)
(c)
Fig. 3. Tracking in outdor night video (a) with (b) and without (c) the proposed feedback that spacio-temporally adapts the threshold for the binarization of the PPM in the adaptive background module. Without the proposed scheme, night camera flicker generates false alarm segments, one of which exceeds the size threshold and initiates a false target (marked by the yellow circle).
Figures 4 and 5 show the effect of the spacio-temporal learning rate adaptation to the PPM when a target remains stationary. When the proposed adaptation is not used, stationary targets fade, so that the system either looses track (Figure 4) or has reduced tracking accuracy (Figure 5). Problems do arise when employing the algorithm as is on the available data. They have to do with two cases for which the algorithm is not currently designed to cope with. Firstly the algorithm assumes that the background can be learned, i.e. that at start-up, either the view is empty of foreground objects, or these objects move significantly. When at start-up a foreground object exists, it is learned into the background. This becomes foreground upon moving. Unfortunately the thresholded PPM in this case comprises of the outline of the object, not a complete region (see Figure 6). This can be taken into account in the measurement module, utilizing a contour tracking algorithm instead of the morphological processing. This is not used in the current implementation, leading to partial target identification, or misses.
158
A. Pnevmatikakis and L. Polymenakos
(a)
(b)
Fig. 4. Tracking with (a) and without (b) the proposed feedback that spacio-temporally adapts the learning rate of the adaptive background module. Without the proposed scheme, the topright target is lost entirely, whereas the center-left one is in the process of fading (see the PPM on the right column of the figure). The moving bottom-right target is not affected.
(a)
(b)
Fig. 5. Tracking with (a) and without (b) the proposed feedback that spacio-temporally adapts the learning rate of the adaptive background module. Without the proposed scheme, the stationary target is no longer tracked. Instead, the system tracks the chair the target has started moving.
(a)
(b)
Fig. 6. Effect of applying the proposed algorithm on scenes with little movement and the targeted people present at start-up. (a) The scene and (b) the PPM. The target outline is only present in the PPM, resulting to two partial detections and a miss.
The second problem arises with very large targets together with very small, like those of close-up vehicles together with far-away people. The current implementation attempts to segment the large targets, splitting the vehicles in parts. Some vehicle/person discrimination ability needs to be implemented in future versions of the algorithm. The third problem has to do with the use of the Kalman filter itself. Kalman filtering necessitates the approximation of the tracked objects by two-dimensional Gaussians. This can be troublesome depending on the nature of the tracked objects
2D Person Tracking Using Kalman Filtering and Adaptive Background Learning
159
and the camera position. Gaussians are sufficient approximations to vehicles. They are also sufficient approximations to human bodies when the camera viewing conditions are far-field (fisheye ceiling or road surveillance cameras) and the dynamic model of the Kalman tracker is loose. The limbs lead to important deviations from the Gaussian model for close viewing conditions. In such conditions, multiple occluding targets are common and the loose dynamic model is no longer capable of tracking. To overcome the problem of many, occluding, non-Gaussian-like targets, future extensions of the proposed tracking architecture will replace the Kalman tracker with CONDENSATION [16] algorithm. These problems, coupled by the fact that for the system to remain generic and not tied to particular conditions (indoors or outdoors), none of the specialized face detectors mentioned in section 2.2 have been used, leads to degraded performance in the CLEAR evaluations. Referring to [17] for the definitions of the evaluation metrics, the performance of the algorithm is shown in Table 1. Table 1. Audio recognition performance on the CLEAR evaluation data for seminars
Seminars Interactive seminars
Correctly detected faces (%) 12.08
Wrong detections (%) 137.38
11.08
94.50
Non-detected (missing) faces (%) 1.39 17.62
Mean weighted error (pixels) 0.33 0.34
Mean extension accuracy 76.17 132.97
4 Conclusions The proposed tracking architecture of the adaptive background and the Kalman tracking modules in a feedback configuration combines the immunity of Stauffer’s algorithm to background changes (like lighting, camera flicker or furniture movement), with the stability of the targets of the static background, no matter if they move or not. Utilizing the Kalman tracker, gates are effectively built around the tracked targets that allow association of the foreground evidence to the targets. The CLEAR evaluations have shown that the proposed algorithm can be improved to handle initially non-empty monitored spaces and non-gaussian-like targets. Also the head detector can be improved following a combination of the techniques mentioned in section 2.2.
Acknowledgements This work is sponsored by the European Union under the integrated project CHIL, contract number 506909. The authors wish to thank the organizers of the CLEAR evaluations and acknowledge the use of data coming from the VACE project for testing the algorithm.
160
A. Pnevmatikakis and L. Polymenakos
References [1] Waibe1, H. Steusloff, R. Stiefelhagen, et. al: CHIL: Computers in the Human Interaction Loop, 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal, (Apr. 2004). [2] Pnevmatikakis, F. Talantzis, J. Soldatos and L. Polymenakos: Robust Multimodal AudioVisual Processing for Advanced Context Awareness in Smart Spaces, Artificial Intelligence Applications and Innovations, Peania, Greece, (June 2006). [3] H. Ekenel and A. Pnevmatikakis: Video-Based Face Recognition Evaluation in the CHIL Project – Run 1, Int. Conf. Pattern Recognition, Southampton, UK, (Mar. 2006), 85-90. [4] McIvor: Background Subtraction Techniques, Image and Vision Computing New Zealand, (2000). [5] Stauffer and W. E. L. Grimson: Learning patterns of activity using real-time tracking, IEEE Trans. on Pattern Anal.and Machine Intel., 22, 8 (2000), 747–757. [6] P. KaewTraKulPong and R. Bowden: An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, in Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems (AVBS01), (Sept 2001). [7] J. L. Landabaso and M. Pardas: Foreground regions extraction and characterization towards real-time object tracking, in Proceedings of Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI ’05), (July 2005). [8] R. E. Kalman: A New Approach to Linear Filtering and Prediction Problems, Transactions of the ASME – Journal of Basic Engineering, 82 (Series D), (1960) 35-45. [9] L.-Q. Xu, J. L. Landabaso and M. Pardas: Shadow Removal with Blob-Based Morphological Reconstruction for Error Correction, IEEE International Conference on Acoustics, Speech, and Signal Processing, (March 2005). [10] M. Jones and J. Rehg: Statistical color models with application to skin detection, Computer Vision and Pattern Recognition, (1999), 274–280. [11] P. Viola and M. Jones: Rapid Object Detection using a Boosted Cascade of Simple Features, IEEE Conf. on Computer Vision and Pattern Recognition, (2001). [12] S.-M. Herman: A particle filtering approach to joint passive radar tracking and target classification, PhD thesis, University of Illinois at Urbana-Champaign, (2002), 51-54. [13] H. A. P. Bloom and Y. Bar-Shalom: The interactive multiple model algorithm for systems with Markovian switching coefficients, IEEE Trans. Automatic Control, 33 (Aug. 1988), 780-783. [14] G. A. Watson and W. D. Blair: IMM algorithm for tracking targets that maneuver through coordinated turns, in Proc. of SPIE Signal and Data Processing of Small Targets, 1698 (1992), 236-247. [15] Forsyth and J. Ponce: Computer Vision - A Modern Approach, Prentice Hall, (2002), 489-541. [16] M. Isard and A. Blake: CONDENSATION - conditional density propagation for visual tracking, Int. J. Computer Vision, 29, (1998), 5-28. [17] Mostefa et. al: CLEAR Evaluation Plan, document CHIL-CLEAR-V1.1-2006-02-21, (Feb 2006).
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation Michael C. Nechyba and Henry Schneiderman Pittsburgh Pattern Recognition 40 24th Street, Suite 240, Pittsburgh, PA 15222, USA
[email protected] http://www.pittpatt.com
Abstract. This paper describes Pittsburgh Pattern Recognition’s participation in the face detection and tracking tasks for the CLEAR 2006 evaluation. We first give a system overview, briefly explaining the three main stages of processing: (1) frame-based face detection; (2) motionbased tracking; and (3) track filtering. Second, we summarize and analyze our system’s performance on two test data sets: (1) the CHIL Interactive Seminar corpus, and (2) the VACE Multi-site Conference Meeting corpus. We note that our system is identically configured for all experiments, and, as such, makes use of no site-specific or domain-specific information; only video continuity is assumed. Finally, we offer some concluding thoughts on future evaluations.
1 1.1
System Description Frame-Based Face Detection
In the first stage of processing, our system finds candidate faces in individual gray-scale video frames using the standard Schneiderman face finder [1][2] installed for the PittPatt web demo (http://demo.pittpatt.com). However, for the purposes of these evaluations, we have made two changes to its configuration. First, given the small face appearances in some of the video data, we configured the face finder to search for faces with an inter-ocular distance as small as 3-4 pixels. This is approximately 50% smaller than in the web demo. Second, we set our normalized detection threshold to be 0.25 (instead of 1.0). While this lower setting generates more false alarms, it also permits more correct detections. Later processing across frames (described in Secs. 1.2 and 1.3) is able to eliminate most of the introduced false alarms, while preserving more correctly detected faces. For each detected face, we retain the following meta-data: (1) face-center location (x, y); face size s, s ≥ −4, where the approximate face dimensions √ (2) s are given by 4 2 × (32 × 24); (3) one of five possible pose categories – namely, frontal, right/left profile, and ±24◦ tilted; and (4) classifier confidence c, c ≥ 0.25. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 161–170, 2007. c Springer-Verlag Berlin Heidelberg 2007
162
1.2
M.C. Nechyba and H. Schneiderman
Motion-Based Tracking
In motion-based tracking, we exploit the spatio-temporal continuity of video to combine single-frame observations into face tracks, each of which is ultimately associated with a unique subject ID. For space reasons, we cannot describe the tracking algorithm in great detail here. Below, we describe its major components and the overall algorithm; see [3] for a more thorough treatment. Motion Model: Let (zt , ct ), zt = [xt , yt , st ]T , denote the face location and size, and the classifier confidence in frame t for a given person. Now, assume that we have a collection of these observations for that person for t ∈ [0 . . . T ], and, furthermore, assume that the person’s motion is governed by a second-order motion model: (1) zˆt = a0 + a1 t + a2 t2 the parameters of which – a0 , a1 and a2 – must be updated with each new frame. To do this update for frame t, we minimize Jforward : Jforward (a0 , a1 , a2 ) =
t
ck λt−k ||zk − zˆk ||2 , t ∈ [0 . . . T ]
(2)
k=0
if we are tracking forward in time, and Jbackward : Jbackward (a0 , a1 , a2 ) =
T
ck λk−t ||zk − zˆk ||2 , t ∈ [0 . . . T ]
(3)
k=t
if we are tracking backward in time. Note that each term in the above sums is weighed by two factors: (1) the classifier confidence, thus giving more weight to higher-confidence detections, and (2) an exponential decay λ, 0 < λ < 1, giving more weight to more recent observations (we set λ = 0.75). The minimization of eqs. (2) and (3) can be solved recursively through the square root information filter (SRIF) algorithm [4]. This algorithm is mathematically equivalent to weighted recursive least squares, but requires no matrix inversion. We define track confidence mt as: ˆ (4) mt = 1/ |Σ| ˆ where Σ denotes the estimated covariance in zt , thereby incorporating both classifier confidence and motion-model confidence into mt . Data Association: The above discussion assumes that the data association problem – the correct matching of IDs across frames – is solved; this is, however, not the case when multiple faces are present. Given a partial face track through frame t, we predict zˆt+1 using the current motion-model parameters (a0 , a1 , a2 ) and eq. (1). Then we associate a single-frame observation zt+1 in frame t + 1 with that track if and only if: (xt+1 − x ˆt+1 )2 − (yt+1 − yˆt+1 )2 < dthresh and (st+1 − sˆt+1 )2 < sthresh
(5)
and observation zt+1 has not yet been assigned to a different track. Here, dthresh is a size-dependent distance threshold, and sthresh is a logarithmic scale threshold. If no appropriate match is found, we set ct+1 = 0.
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation
163
Track Initiation and Termination: A track is initiated once the classifier confidence ct and the track confidence mt exceed acceptance thresholds caccept and maccept, respectively. A track is terminated once ct and mt fall below rejection thresholds creject and mreject , respectively. In our system configuration: creject < caccept and mreject < maccept
(6)
When tracking forward in time, these settings will tend to drop the initial segment of a track; conversely, when tracking backward in time, the final segment of a track will be dropped. Therefore, if on-line processing is not required (as in the case of recorded video), combining forward and backward tracking will result in the most complete tracks. Overall Tracking Algorithm: Initially we treat each of the five poses as independent objects that get tracked separately. Thus, for each pose, we first track forward in time, and then track backward in time. This gives us a set of partial pose-dependent forward and backward tracks. Next, we merge tracks across pose, if (1) they temporally overlap, (2) are spatially consistent and (3) are compatible poses. For example, frontal and tilted poses are compatible; left and right profile poses are not. Finally, we merge forward and backward tracks, applying the same criteria as for the pose merges. 1.3
Track Filtering
After motion-based tracking, we finalize our results with a few additional processing steps. While motion-based tracking can successfully track through very shortterm (i.e. a few frames) occlusions or missed detections, the track confidence mt deteriorates quickly for longer time gaps, due to the exponential decay λ. As a result, incorrect ID splits occur. Therefore, we merge the subject ID of tracks if they meet certain spatial consistency criteria and do not overlap temporally. We apply three principal spatial consistency tests: (1) mean distance between two tracks; (2) covariance-weighted mean distance between two tracks; and (3) distance between the start and end locations of two tracks. Second, we delete low-confidence tracks. Through extensive experiments on development data, we observe that false alarm tracks that survive motion-based tracking are typically characterized by low classifier confidence ct throughout. Therefore, we eliminate all tracks for which the maximum classifier confidence ct is less than 2.5 and does not rise above 2.0 for at least 10% of the track’s existence. This two-tiered criteria was found to be the most discriminating between false alarm tracks and true face tracks. Third, we delete tracks that exhibit very little movement throughout the lifetime of the track as long as they do not meet a more stringent confidence test. As with the confidence-based tests above, we observed through experiments that near-stationary tracks are much more likely to be persistent false alarm tracks than true positive tracks. Finally, we adjust the box sizes output by our system through a constant mapping to better conform to the annotation guidelines
164
M.C. Nechyba and H. Schneiderman
for CHIL and VACE-supported tasks, respectively. This final step is the only difference in processing between the CHIL and VACE tasks.
2
System Performance
2.1
Overview
Here, we report our system performance over the VACE and CHIL test data sets for face detection and tracking. Fig. 1 summarizes our results by data set1 , and sub-categorizes results by the site where the source video was originally recorded. The results in Fig. 1 are generated using version 5.1.1 of the USF evaluation software. For the CHIL data, we first convert our system output and the ground-truth annotations to ViPER format to enable use of the USF software. Since the criteria for matching system objects to ground truth objects differs somewhat between the USF software and the CHIL guidelines, Fig. 1 results differ slightly from those obtained by the CHIL scoring tool. Note that we report two different sets of results for the CHIL data set, since two different sets of ground truth annotations exist for these data. CHIL-1 results correspond to the system output for those video frames that we were asked to submit. These frames are spaced every 5 seconds, and, due to unforeseen circumstances, ground truth for these frames had to be generated in a very short time period by the organizers. CHIL-2 results, on the other hand, correspond to ground-truth annotations that were generated and distributed for follow-on tasks, such as head pose estimation. The CHIL-2 frames are spaced every second, and appear more consistent than the CHIL-1 annotations. We are able to report results for both sets of CHIL ground truth because we generated output for all frames during the evaluation. 2.2
Analysis
CHIL-1 vs. CHIL-2 results: From Fig. 1, we note a small, but significant discrepancy between the CHIL-1 and CHIL-2 results, especially with respect to the AIT data set. This discrepancy appears to be caused by the way the CHIL-1 annotations were generated – namely, through interpolation from the higherfrequency CHIL-2 annotations. While interpolation works well when people are relatively stationary, it works much less well when people are moving around, as is the case, most often, in the AIT data set. For example, Fig. 2 (left) shows two CHIL-1 ground-truth annotations2 for the AIT data that are clearly in error; Fig. 2 (right) shows our corresponding system output3 for those frames. Note how the incorrect ground truth distorts performance estimation. For these 1
2 3
VACE results exclude clips #04 and #31, since no ground truth has yet been made available for those clips; also clips #21 through #25 were eliminated by the organizers during the evaluation. Box colors indicate number of visible landmarks (1,2 or 3). Box colors indicate subject ID.
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation
165
Fig. 1. Performance summary: NF = number of evaluation frames; NO = number of ground-truth objects; DP = detection percentage = 100 × ND /NO , where ND = number of detected objects; F AP = false alarm percentage = 100 × NF A /NO , where NF A = number of false alarms; AOR = avg. area overlap ratio for detected objects; SP = ID split percentage = 100 × NS /NO , where NS = number of incorrect ID splits
two frames, our system has 1 missed detection (MD) and 1 false alarm (FA); however, given the erroneous ground truth, our system is instead assigned 5 MDs and 6 FAs. It is therefore likely that the CHIL-2 results more accurately reflect performance, especially since the CHIL-2 results are also summed over a much larger sample (approx. 5 times as large). As such, for the remainder of this paper, we confine our discussion of CHIL performance to the CHIL-2 results. Three Stages of Processing: In Fig. 3, we illustrate the contribution of the three main stages of processing to overall detection performance. The blue ROC curves show the performance of the single-image face finder, while the magenta and green coordinates show how motion-based tracking and track filtering, respectively, improve performance. Note especially, how the confidence-based track elimination in stage 3 radically reduces the number of false alarms in the system. Tracking performance – that is, the assignment of consistent subject IDs across frames – is improved even more dramatically than detection performance by the stage 2 and 3 processing. After stage 1, a unique ID is assigned to every face instance in every frame. This assignment of IDs represents the worst-case tracking performance, and corresponds to a split percentage (SP) slightly less than 100%. Following motion-based tracking (stage 2), the SP is reduced to 1.53% for the VACE data and 6.28% for the CHIL data. Track filtering (stage 3)
166
M.C. Nechyba and H. Schneiderman
Fig. 2. (left) erroneous CHIL-1 ground truth; (right) corresponding system output
Fig. 3. Performance after each stage of processing: face finding (stage 1); motion-based tracking (stage 2); and track filtering (stage 3). The orange point on the ROC curves represents performance at the default detection threshold of 1.0.
then lowers the SP to the numbers in Fig. 1, namely, 0.04% for VACE, and 0.81% for CHIL. Tracking error for CHIL is greater than VACE, because the CHIL interactive seminars (especially at UPC) tend to be more dynamic than the VACE conference meetings. Data Complexity: From Fig. 1 we observe noticeable performance differences among the various sites. Sites EDI, VT and IBM, for example, exhibit a lower detection rate, while site UPC exhibits a larger-than-average false alarm rate.
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation
167
Furthermore, comparing the two video corpi, we note a higher detection percentage (DP) and lower false alarm percentage (FAP) for the VACE data. The primary factor leading to these performance variations is source data complexity; simply put, some video presents greater challenges for automated analysis. Such video is typically characterized by poorer image quality, due to bad lighting, interlacing and more visible compression artifacts; the presence of smaller and poorly contrasted faces; and/or face poses that lie outside the range of poses for which our face finder has been explicitly trained. A secondary, but critically important factor is the ground-truth annotation protocol, which differs markedly for the VACE and CHIL data sets.
Fig. 4. Sample representative face images that are marked as “visible” and “unambiguous” in the ground truth. All images are drawn to scale, relative to EDI-1, where the face-box is 78 × 84 pixels in the source video.
Consider the representative sample images (with associated ground truth) in Fig. 4, drawn from the test data. All of these faces are marked as “visible” and “unambiguous” in the ground-truth annotations, yet all of them (and many more like them) are very challenging examples indeed.4 Small faces (e.g. EDI-5,6; IBM examples; UPC examples) are the most obvious instances of “difficult” faces; characteristic features (such as eyes, nose and mouth) are virtually invisible for these cases. Poor image quality (e.g. UPC examples) further exacerbates the challenge. Faces also become more difficult to detect when they occur near the image boundary (e.g. VT-3,4; AIT-1), since visual cues typically associated with the area around the face are missing. Our face finder is trained explicitly for specific poses (see Sec. 1.1) that span a wide range of common poses seen in video and images; as such, it is intended 4
EDI-1 is the exception and offers a contrasting example of a large, high-quality face.
168
M.C. Nechyba and H. Schneiderman
to detect near-upright faces from ±90◦ profile to frontal, as well as tilted frontal faces. Therefore, substantially tilted profiles (e.g. VT-1, UPC-1, AIT-2), profiles turned substantially more than ±90◦ from the camera (e.g. UPC-7,8; AIT-3,5), downward-looking frontal faces (e.g. VT-2; UPC-3,4,6; AIT-4), and faces viewed from overhead (EDI-2,3,4) will likely be missed by the face finder. Ultimately, annotation guidelines for ground-truthing govern the complexity (i.e. difficulty) of a particular data set. The VACE and CHIL supported tasks both require the presence of certain landmarks for a face to be marked as “visible.” For VACE, the criterion is that at least three landmarks are visible: one eye, the nose and part of the mouth. CHIL employs a less stringent criterion – namely, that only one of the following three landmarks be visible: the left eye, nose bridge and right eye. Another key difference in guidelines between VACE and CHIL is the existence of Don’t Care Objects (DCOs). For VACE, poorly visible faces are marked as “ambiguous” and thus become DCOs; the CHIL protocol, however, does not contemplate the marking of DCOs. The more relaxed “visibility” criterion, along with the absence of DCOs in the CHIL annotations most likely accounts for most if not all of the observed performance difference between the VACE and CHIL tasks. Site-Dependent Performance Variations: Within the context of the discussion above, site-dependent performance variations in Fig. 1 can now be more easily understood. Here, we go through a few examples. First, consider the EDI test data, which are comprised of 10 video clips. Seven of these clips consist of close-up face shots (e.g. EDI-1), and for these clips our system achieves a 97.7% DP and a 1.2% FAP. However, the remaining three clips consist of fisheye cameras from an overhead view (e.g. EDI-2,3,4) and a far-corner view (e.g. EDI-5,6). Because of the unusual pose appearance and very small face sizes for these clips, our system’s DP drops to 10.2%. Together, these numbers explain the comparatively low overall 67.2% DP for the EDI site data. Next, let us consider the VT site data, for which we observe a comparatively low DP of 67.6%. These data prove challenging because (1) partial faces near the image border (e.g VT-3,4) and downward-looking poses (e.g. VT-1,2) are prevalent; and (2) many of the subjects are wearing head caps with markers for pose tracking (e.g. VT-1,2), obscuring potentially helpful features for detecting the presence of a face. Finally, let us consider the IBM site data. From several camera views, the face appearances of virtually all subjects is exceedingly small (e.g. IBM-1,2,3,4), with very poor facial feature differentiation. It is primarily the prevalence of these small faces that contribute to the low overall DP of 62.7%. Estimation of Data Complexity: From the sample images in Fig. 4, we observe that more difficult faces tend to be characterized by (1) an unusually small or large aspect ratio R and/or (2) a small minimum dimension smin of the bounding box. Therefore, we propose that data complexity can be measured roughly as a function of the distribution of these quantities in the ground truth. Depending on specific annotation guidelines, one or the other of these quantities may be the better predictor. In fact, the variance in R for ground-truth bounding
PittPatt Face Detection and Tracking for the CLEAR 2006 Evaluation
169
boxes is much smaller for the VACE data than the CHIL data, due to the specific data sources as well as the more stringent VACE “visibility” criterion. Based on this and our extensive observations of the ground truth annotations, we expect that for the VACE data, smin will be the better predictor of performance, while for the CHIL data, the aspect ratio R will be the better predictor. In Fig. 5, we plot detection error as a function of smin (VACE) and R (CHIL). To generate these plots, we divide the ground-truth data into eight quantiles, and compute the detection error for each quantile; therefore, each bar represents approximately 3000 ground-truth faces for the VACE data, and 5000 groundtruth faces for the CHIL data. Note from Fig. 5, that detection error decreases as a function of smin . Thus, for the smallest faces, 4 ≤ smin < 10, we correctly detect 53.2% of faces, while for the largest faces, smin > 60, we correctly detect 97.3% of faces. Also, note how detection error increases for more severe aspect ratios R, as these face instances are more likely to correspond to poses not accounted for in our face finder. One other important indicator of data complexity is the visibility of facial features (i.e. landmark locations). While the VACE annotations do not explicitly label landmark locations, the CHIL annotations do for the eyes and nose bridge. For these data, the missed-detection percentage drops from 27.1% for faces with one or more visible landmarks to 14.1% for faces with three or more visible landmarks.
Fig. 5. Detection error as a function of minimum bounding box dimension smin (VACE) and aspect ratio R (CHIL). Each bar represents approximately one eighth of the respective data sets, sorted from smallest to largest value. The pink boxes in the CHIL plot illustrate a representative R for the three bars.
2.3
Conclusion
We conclude that the dominant failure modes for our system are (1) very small faces, (2) extreme poses (e.g. looking down, away from the camera, overhead camera views) and (3) partially visible faces at the image boundary. The relative frequency of occurrence of these factors determines data complexity, which, in turn, determines system performance. Aggregate performance statistics are
170
M.C. Nechyba and H. Schneiderman
useful in assessing relative performance between systems; however, only a more detailed analysis of data complexity and system failures allows the end-user to judge how and when a system may be deployed effectively and reliably. Ground truth annotations play a critical role in this process, and the VACE and CHIL communities have arrived at different guidelines. We believe that both have advantages and disadvantages. On the one hand, VACE annotations handle DCOs, which is absolutely critical if we want to measure the false alarm rate of a system accurately. However, the VACE treatment of object visibility and ambiguity remains subjective, at least in practice. In contrast, CHIL guidelines attempt to define visibility more objectively in terms of the presence of annotated facial landmarks. While we may take issue with the specific landmarks chosen5 or the threshold criterion for visibility6 , we do believe that a landmark-based approach offers the most benefits. First, a more objective visibility criterion in terms of annotated landmarks will lead to more consistent annotations across annotators and data sets. Obviously more landmarks are better than fewer, but at a minimum these landmarks should include the eyes, nose and mouth. Second, landmarks allow us to compute derivative information, such as head pose and face size, much more precisely than bounding-box annotations. As such, the analysis of data complexity and system failure modes can become much more rigorous than what is possible with present ground truth. Ultimately it is this kind of analysis that will guide an end-user in the proper application of automated video analysis systems.
References 1. Schneiderman, H.: Feature-centric evaluation for efficient cascaded object detection. CVPR 2 (2004) 29–36 2. Schneiderman, H.: Learning a restricted Bayesian network for object detection. CVPR 2 (2004) 639–646 3. Schneiderman, H., Wavering, A. J., Nashman, M., Lumia, R.: Real-Time ModelBased Visual Tracking. Proc. Intelligent Robotic Systems ’94 (July, 1994) 4. Bierman, G.: Factorization Methods for Discrete Sequential Estimation. New York, Academic Press (1977)
5
6
The three co-linear CHIL landmarks (eyes, nose bridge) do not fully resolve pose or face size. Additional landmarks, such as the nose, mouth and ears, would reduce ambiguity substantially. Defining a face to be “visible” if only one landmark is visible leads to many mostly occluded faces labeled as “visible.”
The AIT Outdoors Tracking System for Pedestrians and Vehicles Aristodemos Pnevmatikakis, Lazaros Polymenakos, and Vasileios Mylonakis Athens Information Technology, Autonomic and Grid Computing, P.O. Box 64, Markopoulou Ave., 19002 Peania, Greece {apne,lcp,vmil}@ait.edu.gr http://www.ait.edu.gr/research/RG1/overview.asp
Abstract. This paper presents the tracking system from Athens Information Technology that participated to the pedestrian and vehicle surveillance task of the CLEAR 2006 evaluations. Two are the novelties of the proposed tracker. First, we use a variation of Stauffer’s adaptive background algorithm with spatiotemporal adaptation of the learning parameters and a Kalman filter in a feedback configuration. In the feed-forward path, the adaptive background module provides target evidence to the Kalman filter. In the feedback path, the Kalman filter adapts the learning parameters of the adaptive background module. Second, we combine a temporal persistence pixel map, together with edge information, to produce the evidence that is associated with targets. The proposed tracker performed well in the evaluations, and can be also applied to indoors settings and multi-camera tracking.
1 Introduction Target tracking in video streams has many applications, like surveillance, security, smart spaces [1], pervasive computing, and human-machine interfaces [2] to name a few. In these applications the objects to be tracked are either humans, or vehicles. To track objects we first need to detect them. The detected objects are used to initialize the tracks and provide measurements to the tracking algorithm, usually of the recursive Bayesian filtering [3] type. This is a very hard problem, one that remains unsolved in the general case [3]. If a shape or a color model of the objects were known a-priori, then detection can be done using active contours [4] or variations of the mean-shift algorithm [5]. Unfortunately such approaches can only be applied in limited application domains; the shape and color richness of all possible people and vehicles prohibit their use in unconstrained applications like surveillance or smart rooms. The solution to the detection problem is a common property of such targets: sooner or later they move, which produces evidence that distinguishes them from the background and identifies them as foreground objects. The segmentation of foreground objects can be accomplished by processing the difference of the current frame from a background image. This background image can be static [6] or can be computed adaptively [7]. The drawback of the static background image is that R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 171 – 182, 2007. © Springer-Verlag Berlin Heidelberg 2007
172
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
background does change. In outdoor scenes natural light changes and the wind causes movement of trees and other objects. In indoor scenes, artificial light flickers and pieces of furniture may be moved around. All such effects can be learned by an adaptive background algorithm like Stauffer’s [8] and of its modifications [9,10]. Such an algorithm detects targets as segments different from the learned background, but depends on the targets’ movement to keep a fix on them. If they stop, the learning process fades them into the background. Once a target is initialized, a tracking system should be able to keep a fix on it even when it remains immobile for some time. In this paper, we propose a novel tracking system that addresses this need by utilizing a feedback mechanism from the tracking module to the adaptive background module which in turn provides the evidence for each target to the tracking module. We control the adaptive background parameters on a pixel level for every frame (spatiotemporal adaptation), based on a prediction of the position of the target. Under the assumption of Gaussian target states and linear dynamic models, this prediction can be provided by a Kalman filter [11]. The proposed tracker system comprises of the feedback configuration of three modules, namely the adaptive background, the image processing for evidence generation and the Kalman filtering. A fourth module operates on the tracks in a temporal window of 1 second, by checking their consistency. This paper is organized as follows: In section 2 the four modules of the system are detailed. The results on the VACE person and vehicle surveillance tasks of the CLEAR 2006 evaluations are presented and discussed in section 3. Finally, in section 4 the conclusions are drawn, followed by some indications for future enhancements.
2 Tracking System The block diagram of the tracking system is shown in Figure 1. It comprises four modules: adaptive background, image processing for evidence generation, Kalman filtering and track consistency. Evidence for the targets is generated once difference from the estimated background is detected. The estimation of the background is dynamic; background is learnt in a different manner for different portions of the frame, depending on whether they belong to existing targets, the target size and its speed. The evidence is used to initialize and update tracks. Tracks that are persistent for 10 out of the 15 past frames are promoted to targets, and are reported by the system. Given that the frame rate of all the VACE person and vehicle surveillance videos is 25 per second, the introduced lag is a small penalty to pay for the added robustness to false alarms. Initialized tracks have their new position predicted by the state update step of the Kalman filter. The predictions are used to associate evidence with tracks and perform the measurement update step of the Kalman filter. Tracks are also eliminated is they have no evidence supporting them for 15 frames. The states of the Kalman filter, i.e. the position, velocity and size of the targets, are fed back to the adaptive background module to spatiotemporally adapt the learning rate. They are also fed forward to the track consistency module to obtain the reported tracks of the system and the decision whether they correspond to vehicles or pedestrians. In the rest of the section, we present the four modules in detail.
The AIT Outdoors Tracking System for Pedestrians and Vehicles
173
Fig. 1. Block diagram of the complete feedback tracker architecture. Frames are input to the adaptive background and evidence generation modules, and targets are output from the track consistency module.
2.1 Adaptive Background Module The targets of the proposed system (vehicles and pedestrians) are mostly moving. The changes in subsequent video frames due to movement are used to identify and segment the foreground (pixels of the moving targets) from the background (pixels without movement). If a background image were available, this segmentation is simply the difference of the current frame from the background image. The foreground pixels thus obtained are readily grouped into target regions. A static image of the empty scene viewed by the (fixed) camera can be used for background [6]. Unfortunately this is not practical for outdoors applications, or even for long term indoors applications, hence adaptive background approaches are adopted [7-10] primarily for two reasons: First, such an empty scene image might not be available due to system setup. Second and most important, background (outdoors and indoors) also changes: Natural light conditions change slowly as time goes by; the wind causes swaying movements of flexible background object (e.g. foliage); fluorescent light flickers at the power supply frequency; objects on tabletops and small pieces of furniture are rearranged and projection areas display different content. All these changes need to be learnt into an adaptive background model. Stauffer’s adaptive background algorithm [8] is capable of learning such changes with different speeds of change by learning into the background any pixel, whose color in the current frame resembles the colors that this pixel often had in the history of the recording. So no changes, periodic changes or changes that occurred in the distant past lead to pixels that are considered background. To do so, a number of weighted Gaussians model the appearance of different colors in each pixel. The weights indicate the amount of time the modeled color is active in that particular pixel. The mean is a three dimensional vector indicating the estimated color for that model and that pixel, while the covariance matrix indicates the extend around the mean that a color of that
174
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
pixel is to be considered similar to the one modeled. Colors in any given pixel similar to that modeled by any of the Gaussians of that pixel lead to an update of that Gaussian, an increase of its weight and a decrease of all the weights of the other Gaussians of that pixel. Colors not matching any of the Gaussians of that pixel lead to the introduction of a new Gaussian with minimum weight. Hence the possible updates of the weight of the i-th Gaussian of the pixel located at (x, y) at time t are
⎧ a new Gaussian ⎪ wi ( x, y, t ) = ⎨ (1 − a ) wi ( x, y, t − 1) non-matching Gaussians ⎪(1 − a ) w ( x, y, t − 1) + a matching Gaussians i ⎩
(1)
where a is the learning rate. Some variations of the Stauffer algorithm found in the literature deal with the way covariance is represented (single value, diagonal of full matrix) and the way the mean and covariance of the Gaussians are updated [9]. Some further variations of the algorithm address the way the foreground information is represented. The original algorithm and most of the modifications lead to a binary decision for each pixel: foreground or background [8,9]. In [10], the Pixel Persistence Map (PPM) is used instead. This is a map of the same dimension as the frames with a value at each location (x, y) equal to the weight of the Gaussian matching the current color of the pixel at (x, y). Small PPM values indicate foreground objects, while large indicate background. The foreground/background threshold is left unspecified though. The drawback of all the existing variations of Stauffer’s algorithm is that stationary foreground objects tend to fade in the background with rate a . Small background learning rates fade foreground objects slowly, but are also slow in adapting to the background changes. Large rates favor background adaptation but tend to fade a target into the background when it stops. This fading progressively destroys the region of the tracked object, deforms its perceived shape and finally leads to loosing track of the object altogether. When the target resumes moving, foreground pixels will be marked only at the locations not previously occupied by the stationary target. When a target remains stationary long enough, or has fairly uniform coloration, the new evidence will be far apart from the last evidence of the track, either in time, or in space or in both. Then the track is lost; the track is terminated and another is initiated when movement resumes. We address the problem of the fading of stationary foreground objects using a feedback tracking architecture. The edges of the frame that coincide with values of the PPM below a threshold serve as target evidence to the Kalman filter. The states of the Kalman filter provide ellipses that describe every target. The learning rate is modified in regions around these targets, based on their speed and size. Thus, instead of a constant value, a spatiotemporal adaptation of the learning rate is used:
⎧ 0.04 if ( x, y ) not a target pixel at time t a ( x, y , t ) = ⎨ if ( x, y ) a target pixel at time t ⎩a ( vˆ, C )
(2)
where C is the covariance matrix of the target (hence det ( C ) relates to its size) and vˆ is the mobility of the target, which is related to the change of the position of its
The AIT Outdoors Tracking System for Pedestrians and Vehicles
175
centroid and the change of its size. The latter indicates an approaching or receding target and is quantified using the determinant of the covariance matrix of the target. Thus the mobility is defined as follows:
vˆ = T f
v
2
+
( ) min ( C , C )
max Ct , Ct −T f t
(3)
t −T f
where v is the velocity vector and T f is the inverse of the frame rate. Then, the learning rate a ( vˆ, C ) of a pixel belonging to a target is:
⎧ 0.04 ⎪ 0.04 ⎪ ⎪ 4 ⎪ a ( vˆ, C ) = ⎨ ⎛ vˆ ⋅ π ⎞ ⎪ 0.0044 ⋅ tan ⎜⎝ 4.3 ⎟⎠ ⎪ ⎪ 0.0044 ⎛ vˆ ⋅ π ⎞ ⎪ 4 ⋅ tan ⎜⎝ 4.3 ⎟⎠ ⎩
if det ( C ) ≤ 8 ⋅ 105 and vˆ ≥ 2 if det ( C ) > 8 ⋅ 105 and vˆ ≥ 2 if det ( C ) ≤ 8 ⋅105 and vˆ < 2
(4)
if det ( C ) > 8 ⋅105 and vˆ < 2
This choice for a ( vˆ, C ) progressively delays fading of the targets as they become slower. It also delays fading of large targets by setting the learning rate to 1 4 of its value if the target is too large. This is useful for large vehicles, where their speed can be large, but their uniform colors can lead to fading into the background. The second major proposed modification of Stauffer’s algorithm addresses extreme flickering situations often encountered in night vision cameras. In such scenes the PPM needs to be binarized by a high threshold in order not to consider flickering pixels as foreground. The high threshold on the other hand tends to discard actual foreground pixels as well. The proposed solution is to adapt the threshold T in a spatiotemporal fashion similar to the learning rate in (2). i.e. if ( x, y ) not a target pixel at time t ⎧ ⎪0.25 T ( x, y , t ) = ⎨ or a target with det ( C ) < 500 ⎪ elsewhere ⎩ 0.5
(5)
This way flickering pixels are avoided far from the targets, while the targets themselves are not affected. To avoid a delayed detection of new very small targets, the threshold of pixels belonging to such targets with det ( C ) < 500 is not affected. These proposed feedback mechanisms on the learning rate and PPM binarization threshold lead to robust foreground regions regardless of the flickering in the images or the lack of target mobility, while they do not affect the adaptation of the background around the targets. When such flickering and mobility conditions occur, the resulting PPM is more suitable for target region forming that the original version of [10]. The forming of target regions is the goal of the evidence generation module, detailed next.
176
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
2.2 Evidence Generation Module
The evidence generation module finds foreground segments, assigns them to known tracks or initializes new ones and checks targets for possible splitting. The information for new targets or targets to be updated is passed to the Kalman module. The binary mask obtained by adaptively thresholding the PPM is passed through a shadow detector based on [12]. It is then merged with the binary mask obtained by the edge detector using the AND operator. The resulting mask contains the foreground edges. Edges are used to add robustness to the system: The PPM can have filled segments if the object has entered the camera view after initialization and moves sufficiently. On the other hand some object can manifest itself by parts of its outline if it has been present in the scene at initialization. The use of edges provides the contours of objects in both cases, so they no longer need to be treated by different image processing modules. The foreground edge map is dilated to form segments. The dilation is such that edges that lie up to 10 pixels apart are merged into a single segment. The detected segments are associated to tracks based on the Mahalanobis distance of segments from tracks. Segments with distances larger than 10 from any track are not associated, leading to a request to the Kalman module for the initialization of a new track. The association utilizes the Munkres or Hungarian algorithm [13] for the assignment of M evidences to N known tracks, where generally M ≠ N . The algorithm is very fast, and contrary to the exhaustive search, requires no limitation of M or N. The evidence segments associated with tracks that are indicated as vehicles by the track consistency module are also checked for possible target split, both vertically and horizontally. This is done in order to tell small crowds apart. Vertically, pedestrian evidence split is done if the segment is too tall. If the camera were calibrated [14], then under the reasonable assumption that the segments touch the ground, their height can be estimated. In the absence of such calibration, the height in pixels of tall pedestrians as a function of the vertical coordinate of the bottom of their bounding box in the image (that correspond to different depths) is estimated from the videos of the Dry Run surveillance task. When the height of a segment is much larger than that, the segment is split in half, resulting to two segments. The one closest matching the track remains associated to it, while the other issues an initialization request to the Kalman module. Horizontal candidate splits are evaluated by attempting to identify the number of bodies in them and detect the heads. This is done by processing the height of the target as a function of its width. The height is measured form the bottom of the box bounding the target. The processing identifies peaks that correspond to heads and valleys that correspond to points that the target can be split into more than one body. The process is illustrated in Figure 2 and works well with upright people. Finally, heads are found by examining the smoothed derivative of the width of the detected peaks. As at the shoulders the width of the body increases rapidly, this point can be easily detected. Other approaches that are based on the detection of faces, like the skin color histograms [15] used in [6], eye detectors like the one in [2] or the boosting algorithm of Viola and Jones [16] are limited to frontal faces, good resolution and fairly constant lighting. Since none of these conditions are guaranteed in the intended
The AIT Outdoors Tracking System for Pedestrians and Vehicles
177
75
70
Target height
65
(a)
60
55
50
45
0
5
10
15 20 25 30 Target collumn number
(b)
35
40
45
(c)
Fig. 2. Processing a target to extract bodies and heads. (a) Scene of a target with the two associated bodies and heads marked. (b) PPM of the target and 2D Gaussian approximation. (c) Target height profile (blue line) used to identify the peaks and valleys that correspond to the head tops (red circles) and body splitting points (vertical red lines) respectively. Head estimates are also marked with black lines.
outdoor surveillance application, none of these approaches are used to find faces and then evaluate segment splits based on their position. 2.3 Kalman Filtering Module
The Kalman filtering module maintains the states of the targets. It creates new targets should it receive a request from the evidence generation module and performs measurement update based on the foreground segments associated to the targets. The states of the targets aOre fed back to the adaptive background module to adapt the learning rate and the threshold for the PPM binarization. States are also eliminated if they have no foreground segments associated to them for 15 frames. Every target is approximated by an elliptical disc that is obtained by the mean m and the covariance matrix C of the target, i.e. it is described by a single Gaussian.
m = ⎡⎣ mx , m y ⎤⎦
T
(6)
C12 ⎤ ⎡C C = ⎢ 11 ⎥ C C ⎣ 12 22 ⎦
If the eigenvectors and the eigenvalues of C are v i and λi respectively, with i = 1, 2 , then the axes of the ellipse are the v i and the radii are 2 λi . The target states are seven-dimensional; they comprise of the mean of the Gaussian describing the target (horizontal and vertical components), the velocity of the mean (horizontal and vertical components) and the three independent terms of the covariance matrix. Hence the state vector is: s = ⎡⎣ mx , m y , v x , v y , C11 , C22 , C12 ⎤⎦
T
(7)
178
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
The prediction step uses a loose linear dynamic model of constant velocity [17] for the update of the mean position and velocity. As for the update of the three covariance terms, their exact model is non-linear, hence cannot be used with the Kalman tracker; instead of using linearization and an extended Kalman tracker, the covariance terms are modeled as constant. The variations of the velocity and the covariance terms are permitted by the state update variance term. This loose dynamic model permits arbitrary movement of the targets. It is very different to the more elaborate models used for tracking aircraft. Aircraft can perform a limited set of maneuvers that can be learned and be expected by the tracking system. Further, flying aircraft can be modeled as rigid bodies thus strict and multiple dynamic models are appropriate and have been used extensively in Interacting Multiple Model Kalman trackers [18,19]. Unlike aircraft, street vehicles and especially humans have more degrees of freedom for their movement which includes apart from speed and direction changes obstacles arbitrarily, rendering the learning of a strict dynamic model impractical. A strict dynamic model in this case can mislead a tracker to a particular track even in the presence of contradicting evidence [3]. 2.4 Track Consistency Module
The states of the Kalman filtering module are processed by the track consistency module for two reasons. Firstly to join tracks that are terminated with others that are initialized near them in terms of time, space and size. Secondly, to decide about the target type: vehicle or pedestrian. Both operations require some memory. It is decided that a lag of 1 second (or 25 frames in the given evaluation videos) is acceptable. During this 1 second period the tracks are checked for consistency and the decision about vehicle or pedestrian is made. The track consistency check eliminates some of the misses due to lack of evidence and avoids segmentation of continuous tracks. Temporal proximity is restricted to 1 second. Spatial proximity is restricted to a Mahalanobis distance of 10. Finally, proximity in terms of size is restricted to having the determinants of the last covariance matrix of the terminating track and the first of the initialized track differ at most 3 times. The decision about the type of target is based on the velocity, size and location of the targets. The decision thresholds are trained using the Dry Run videos. Velocity and size thresholds utilize all of the Dry Run videos, while location thresholds are estimated separately for the two monitored sites. For every frame in the decision interval that a target is classified as a vehicle, a counter is increased, while it is decreased for frames that the target is classified as pedestrians. Positive values indicate vehicles and negative values indicate pedestrians.
3 CLEAR Evaluation Results The proposed feedback tracking architecture is tested on the CLEAR evaluations (video sequences coming from the CHIL and VACE projects). First we show the qualitative effect of the algorithm on outdoors and indoors videos. Then we present the quantitative results for the vehicle and pedestrian tracking tasks of CLEAR.
The AIT Outdoors Tracking System for Pedestrians and Vehicles
179
Figure 3 shows the effect of the spatiotemporal adaptation of the threshold for the binarization of the PPM in the adaptive background module. The thresholded PPM can result in false alarms if the threshold is not adapted by the states of the Kalman filter.
(a)
(b)
(c)
Fig. 3. Tracking in outdoor night video (a) with (b) and without (c) the proposed feedback that spatiotemporally adapts the threshold for the binarization of the PPM in the adaptive background module. Without the proposed scheme, night camera flicker generates false alarm segments, one of which exceeds the size threshold and initiates a false target (marked by the yellow circle).
Figures 4 and 5 show the effect of the spatiotemporal learning rate adaptation to the PPM when a target remains stationary. When the proposed adaptation is not used,
(a)
(b)
Fig. 4. Tracking with (a) and without (b) the proposed feedback that spatiotemporally adapts the learning rate of the adaptive background module. Without the proposed scheme, the topright target is lost entirely, whereas the center-left one is in the process of fading (see the PPM on the right column of the figure). The moving bottom-right target is not affected.
(a)
(b)
Fig. 5. Tracking with (a) and without (b) the proposed feedback that spatiotemporally adapts the learning rate of the adaptive background module. Without the proposed scheme, the stationary person is no longer tracked. Instead, the system tracks the chair that the person has started moving.
180
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
stationary targets fade, so that the system either looses track (Figure 4) or has reduced tracking accuracy (Figure 5). The 50 video segments of the vehicle and pedestrian surveillance tasks of CLEAR are utilized for the quantitative evaluation of the algorithm. Referring to [20] for the definitions of the evaluation metrics, the performance of the algorithm is shown in Table 1. Table 1. Quantitative performance evaluation of the proposed tracking algorithm on the CLEAR evaluation data for vehicle and pedestrian surveillance
Mean Median IQR
MODP 0.529 0.554 0.162
MOTA 0.036 0.418 1.230
Pedestrians MOTP MOTA 0.527 0.025 0.539 0.418 0.165 1.246
SFDA 0.510 0.570 0.520
ATA 0.242 0.270 0.248
MOTP 0.367 0.515 0.601
Vehicles MOTA 0.301 0.279 0.576
ATA 0.300 0.370 0.525
1 0.75 0.5 0.25 0
MOTA
-0.25 -0.5 -0.75 -1 -1.25 -1.5 -1.75 -2 -2.25
People, site 1
Vehicles, site 1
People, site 2
Vehicles, site 2
Fig. 6. Boxplot of the MOTA per target type and site. For site 1, the algorithm performs similarly for pedestrians and vehicles. For site 2, all targets are designated as pedestrians, increasing their false alarms. The effect on MOTP and ATA is similar.
Note that the proposed algorithm performed very well on one of the two recording sites. In the second site the algorithm failed in terms of cars. This is due to the nature of the scene; in site 2 the road had cars parked or maneuvering to park and also speed bumps that forced cars to almost zero speed. These render the speed parameter useless for vehicle/pedestrian discrimination. The boxplot of the MTP per site and target type is shown in Figure 6.
The AIT Outdoors Tracking System for Pedestrians and Vehicles
181
4 Conclusions The proposed tracking algorithm of the adaptive background and the Kalman filtering modules in a feedback configuration combines the immunity of Stauffer’s algorithm to background changes (like lighting, camera flicker or furniture movement), with the stability of the targets of the static background, no matter if they move or not. Utilizing the Kalman filter, gates are effectively built around the tracked targets that allow association of the foreground evidence to the targets. The evidence is obtained by processing a binary map that combines edge and temporal information. The performance of the proposed algorithm on the vehicle and pedestrian surveillance task of the CLEAR evaluations has shown that it is a very good candidate for outdoors surveillance. Also, qualitative results on indoors videos show the potential of the algorithm in that domain as well. Future plans fare two different directions. One direction is to replace the loose dynamic model for the linear update of the states in the Kalman filter with stricter dynamic models and the CONDENSATION [21] algorithm. This might prove particularly useful for indoors tracking, where the approximation of humans with elliptical discs is no longer sufficient. In near-field recording conditions the limbs of the people create large deviations from the elliptical discs; sticking to the latter can jeopardize the ability of the algorithm to handle occlusions and clutter. The second direction is towards multi-sensor tracking, where a number of synchronized and calibrated cameras can produce three-dimensional tracks.
Acknowledgements This work is sponsored by the European Union under the integrated project CHIL, contract number 506909. The authors wish to thank the organizers of the CLEAR evaluations and acknowledge the use of data coming from the VACE and CHIL projects for testing the algorithm.
References [1] Waibe1, H. Steusloff, R. Stiefelhagen, et. al: CHIL: Computers in the Human Interaction Loop, 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal, (Apr. 2004). [2] Pnevmatikakis, F. Talantzis, J. Soldatos and L. Polymenakos: Robust Multimodal AudioVisual Processing for Advanced Context Awareness in Smart Spaces, in I. Maglogiannis, K. Karpouzis and M. Bramer (eds.), Artificial Intelligence Applications and Innovations (AIAI06), Springer, Berlin Heidelberg (June 2006), 290-301. [3] D. Forsyth and J. Ponce: Computer Vision - A Modern Approach, Prentice Hall, (2002), 489-541. [4] J. MacCormick: Probabilistic modelling and stochastic algorithms for visual localisation and tracking, PhD Thesis, University of Oxford (2000), section 4.6. [5] G. Jaffré and A. Crouzil: Non-rigid object localization from color model using mean shift, International Conference on Image Processing (ICIP 2003), Barcelona, Spain, (Sept. 2003)
182
A. Pnevmatikakis, L. Polymenakos, and V. Mylonakis
[6] H. Ekenel and A. Pnevmatikakis: Video-Based Face Recognition Evaluation in the CHIL Project – Run 1, Face and Gesture Recognition, Southampton, UK, (Mar. 2006), 85-90. [7] McIvor: Background Subtraction Techniques, Image and Vision Computing New Zealand, (2000). [8] Stauffer and W. E. L. Grimson: Learning patterns of activity using real-time tracking, IEEE Trans. on Pattern Anal.and Machine Intel., 22, 8 (2000), 747–757. [9] P. KaewTraKulPong and R. Bowden: An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection, in Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems (AVBS01), (Sept 2001). [10] J. L. Landabaso and M. Pardas: Foreground regions extraction and characterization towards real-time object tracking, in Proceedings of Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI ’05), (July 2005). [11] R. E. Kalman: A New Approach to Linear Filtering and Prediction Problems, Transactions of the ASME – Journal of Basic Engineering, 82 (Series D), (1960) 35-45. [12] L.-Q. Xu, J. L. Landabaso and M. Pardas: Shadow Removal with Blob-Based Morphological Reconstruction for Error Correction, IEEE International Conference on Acoustics, Speech, and Signal Processing, (March 2005). [13] S. Blackman: Multiple-Target Tracking with Radar Applications, Artech House, Dedham, MA, (1986), chapter 14. [14] Z. Zhang: A Flexible New Technique for Camera Calibration, Microsoft Research, Technical Report MSR-TR-98-71, (Aug. 2002). [15] M. Jones and J. Rehg: Statistical color models with application to skin detection, Computer Vision and Pattern Recognition, (1999), 274–280. [16] P. Viola and M. Jones: Rapid Object Detection using a Boosted Cascade of Simple Features, IEEE Conf. on Computer Vision and Pattern Recognition, (2001). [17] S.-M. Herman: A particle filtering approach to joint passive radar tracking and target classification, PhD thesis, University of Illinois at Urbana-Champaign, (2002), 51-54. [18] H. A. P. Bloom and Y. Bar-Shalom: The interactive multiple model algorithm for systems with Markovian switching coefficients, IEEE Trans. Automatic Control, 33 (Aug. 1988), 780-783. [19] G. A. Watson and W. D. Blair: IMM algorithm for tracking targets that maneuver through coordinated turns, in Proc. of SPIE Signal and Data Processing of Small Targets, 1698 (1992), 236-247. [20] R. Kasturi, et. al: Performance evaluation protocol for face, person and vehicle detection & tracking in video analysis and content extraction (VACE-II), University of South Florida (Jan 2006). [21] M. Isard and A. Blake: CONDENSATION - conditional density propagation for visual tracking, Int. J. Computer Vision, 29, (1998), 5-28.
Evaluation of USC Human Tracking System for Surveillance Videos Bo Wu, Xuefeng Song, Vivek Kumar Singh, and Ram Nevatia University of Southern California Institute for Robotics and Intelligent Systems Los Angeles, CA 90089-0273 {bowu,xsong,viveksin,nevatia}@usc.edu
Abstract. The evaluation results of a system for tracking humans in surveillance videos are presented. Moving blobs are detected based on adaptive background modeling. A shape based multi-view human detection system is used to find humans in moving regions. The detected responses are associated to infer the human trajectories. The shaped based human detection and tracking is further enhanced by a blob tracker to boost the performance on persons at a long distance from the camera. Finally the 2D trajectories are projected onto the 3D ground plane and their 3D speeds are used to verified the hypotheses. Results are given on the video test set of the VACE surveillance human tracking evaluation task.
1
Task and Data Set
The task in this evaluation exercise is to track the 2D locations and regions of multiple humans in surveillance videos. The videos are captured with a single static camera mounted a few meters above the ground looking down towards a street. The test set for the evaluation contains 50 sequences, overall 121,404 frames, captured from two different sites at various times. The frame size is 720 × 480; the sampling rate is 30 FPS. Fig.1 shows one shot of each site.
(a) site 1
(b) site 2
Fig. 1. Sample frames
This task is made complex due to many reasons. The image appearance of pedestrians changes not only with the changing viewpoints and the clothing, R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 183–189, 2007. c Springer-Verlag Berlin Heidelberg 2007
184
B. Wu et al.
moving objects include not only humans but also vehicles, at the given resolution, detection of face pattern is infeasible and the scene is cluttered by many scene objects, e.g. trees and traffic signs. We describe our method to overcome these difficulties in Section 2; Section 3 shows the experimental results; and Section 4 provides conclusion.
2
Methodology
We first detect person hypotheses in each frame, then we track them in 2D with a data association method. Shape based tracking is combined with a blob tracker to improve the performance. Based on computed camera parameters, the 2D trajectories are projected onto the 3D ground plane. The 3D speeds of the tracked objects are calculated and used to verify the hypotheses. 2.1
Shape Based Human Detection and Tracking
We learn full-body detectors for walking or standing humans by the method proposed in [2]. Nested structure detectors are learned by boosting edgelet feature based weak classifiers. To cover different viewpoints, two detectors are learned: one for the left profile view, and one for the frontal/rear view (the detector for right profile view is generated by flipping the left profile view horizontally). The training set contains 1,700 positive samples for frontal/rear views, 1,120 for left profile view, and 500 negative images. The negative images are all of street scenes. The samples in training set are independent of the test sequences. We do not use the combined detection in [2] for partial occlusion reasoning explicitly, as the local feature based full-body detector can work with partial occlusion to some extent and inter-human occlusions are not strong in this data set. We constrain the search of humans around moving blobs. Motion is detected by comparing pixel color to an adaptively learned background model. If the proportion of the moving pixels within an image sub-window is larger than a threshold, θm , it is considered a candidate for human hypotheses and sent to the detector for further process; otherwise it is discarded directly. This reduces the search space of the human detector and prevents false alarms on static scene objects; however, it also prevents detection of static persons in the scene (in a real surveillance scenario, persons will always be expected to enter the scene at some time). Fig.2 shows some detection results. Humans are tracked by forming associations between the frame detection responses. This 2D tracking method is a simplified version of that in [3], as only the full-body detector is applied. The affinity between a hypothesis and a responses is calculated based on cues from distance, size, and color. A greedy algorithm is used to associate the hypotheses and the detection responses. The automatic initialization and termination of trajectories are based on the confidences calculated from associated detection responses. To track the human, first data association with the full-body detection responses is attempted; if this fails,
Evaluation of USC Human Tracking System for Surveillance Videos
185
Fig. 2. Sample detection results
a color based meanshift tracker [4] is used to follow the person. Fig.3 show an example of shape based tracking.
Fig. 3. Examples of shape based human tracking results
2.2
Motion Based Human Detection and Tracking
The shape based detector does not work well on images of resolution where a person is less than 24 pixel wide, as in the case when the humans are far from the camera. Fig.4 shows some examples of missed detections. We augment the shape based method with motion based blob tracking. Taking the motion detection results as input, we apply some morphological operations to connect foreground pixels to generate motion blobs. For simplicity, we model moving objects as rectangles. Each object is associated with an appearance model and a dynamic model. At each new frame, we predict the object’s
186
B. Wu et al.
Fig. 4. Examples of missed detections. (The persons marked by the red arrows are detected as moving blobs but not found by the human detectors.)
position with its dynamic model. Appearance model is used to distinguish objects when they are merged. A new object is created when a blob has no match with current hypotheses. A track ends when it has no blob match for more than a set number of frames. However, if multiple objects are merged in one blob from the beginning to the end, the blob tracker can not segment them. The moving objects with relatively small size are classified as pedestrians; the others to be vehicles. Fig.5 shows an example of motion based tracking.
Fig. 5. Examples of motion based human tracking results
2.3
Combination of Shape and Motion Based Approaches
We use an integration method to combine the shape based tracking and the motion based tracking. For each human track segment hs from shape based tracking, we search for the motion blob track segments hm which have large overlap with hs . We then merge the motion blob track segments hm with the human track segment hs . This combination increases the accuracy of the trajectories. Fig.6 shows an example of the combination.
Evaluation of USC Human Tracking System for Surveillance Videos
187
Fig. 6. Combination of shape based and motion based tracking
2.4
Verification by 3D Speed
False alarms of the detection system usually appear in cluttered or highly textured areas. Some of these false alarms are persistent ones, from which false trajectories may be generated. See Fig.7 for some examples.
Fig. 7. Examples of false alarm trajectories
We use speed to discriminate humans from vehicles (vehicles can move slowly but human speed is limited). The image speed, however, depends on the position of the object in the image (faster motion near the camera). One approach could be to learn such speed patterns. We instead infer camera calibration parameters from observed motion by using approach proposed in [1] (this requires interactive processing in the tracking stage). Based on the calibration parameters and the assumption that objects are moving on a known ground plane, we project the 2D image locations onto the 3D ground plane, and calculate the 3D speeds of all tracked objects. If the average speed of a hypothesis is lower than a threshold, θspeed , the hypothesis is accepted as human; otherwise, it is rejected.
3
Experiments
We ran our system on the 50 test sequences. The formal evaluation process defines four main metrics for the human tracking task [5]: 1. Multiple Object Detection Precision (MODP) reflects the 2D location precision of detection level; 2. Multiple Object Detection Accuracy (MODA) is the detection accuracy calculated from the number of false alarms and missed detections;
188
B. Wu et al.
3. Multiple Object Tracking Precision (MOTP) reflects the 2D location precision of the tracking level; and 4. Multiple Object Tracking Accuracy (MOTA) is the tracking accuracy calculated from the number of false alarms, missed detections, and identity switches. We repeated the experiment with six sets of parameters to observe tradeoffs in performance. Table 1 list the scores of the six runs. It can be seen that our system achieves reasonable results. Table 1. Final evaluation scores (the numbers in bold font are the best ones)
# of missing detections per groundtruth
As MODA and MOTA metrics integrate missed detections and false alarms in a single number, it is difficult to see the tradeoff from these numbers. Instead, we use the fraction of ground-truth instances missed, and the number of false alarms per frame to draw an ROC curve, see Fig.8. The videos of site 2 are more complicated than those of site 1 in terms of the variety of objects in the scene. Table 2 gives the scores of run 1 on these two sites. It can be seen that the performance on site 1 is much better than that on site 2. Site 2 is a street with a number of parking lots in front of some shopping area. Besides humans, there are many vehicles moving or parked on the street. 0.48
0.46
0.44
0.42
0.4
0.38 0.1
0.2 0.3 0.4 # of false alarms per frame
Fig. 8. ROC curve
0.5
Evaluation of USC Human Tracking System for Surveillance Videos
189
Table 2. Scores on different sites
The detection rate is lower because humans are often occluded by cars; while the number of false alarms increases because the blob tracker gives false positive on cars and the cars move slowly due to heavy traffic so that the speed based verification does not help much. The speed of the system is about 0.2 FPS on a 2.8GHz Pentium CPU; the program is coded in C++ using OpenCV library functions; no attempt at code optimization has been made.
4
Conclusion and Discussion
We applied a fully automatic multiple human tracking method to surveillance videos. The system has achieved reasonable performance on the test sequences. However, the performance does depend strongly on the complexity of the environment. Our future work will attempt to combine motion and shape cues in a stronger way to improve performance in more complex situations. Our system does not run at real time. Some speedup can be obtained by code optimization and use of commodity parallel hardware. We can also obtain significant improvements by taking advantage of context in our algorithms. Acknowledgements. This research was partially funded by the Advanced Research and Development Activity of the U.S. Government under contract MDA-904-03-C-1786.
References 1. Fengjun Lv, Tao Zhao and Ramakant Nevatia. Camera Calibration from Video of a Walking Human, to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2006 2. B. Wu, and R. Nevatia. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. ICCV’05. Vol I: 90-97 3. B. Wu, and R. Nevatia. Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. In: CVPR’06. 4. D. Comaniciu, V. Ramesh, and P. Meer. The Variable Bandwidth Mean Shift and Data-Driven Scale Selection. ICCV’01. Vol I: 438-445 5. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, and V. Korzhova. Performance Evaluation Protocal for Face, Person and Vehicle Detection & Tracking in Video Analysis and Centent Extraction (VACE-II) CLEAR - Classification of Events, Activities and Relationships. http://www.nist.gov/speech/tests/clear/2006/CLEAR06-R106-EvalDiscDoc/Data and Information/ClearEval Protocol v5.pdf
Multi-feature Graph-Based Object Tracking Murtaza Taj, Emilio Maggio, and Andrea Cavallaro Queen Mary, University of London Mile End Road, London E1 4NS (United Kingdom) {murtaza.taj,emilio.maggio,andrea.cavallaro}@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/staffinfo/andrea/
Abstract. We present an object detection and tracking algorithm that addresses the problem of multiple simultaneous targets tracking in realworld surveillance scenarios. The algorithm is based on color change detection and multi-feature graph matching. The change detector uses statistical information from each color channel to discriminate between foreground and background. Changes of global illumination, dark scenes, and cast shadows are dealt with a pre-processing and post-processing stage. Graph theory is used to find the best object paths across multiple frames using a set of weighted object features, namely color, position, direction and size. The effectiveness of the proposed algorithm and the improvements in accuracy and precision introduced by the use of multiple features are evaluated on the VACE dataset.
1
Introduction
Object tracking algorithms aim at establishing the correspondence between object observations at subsequent time instants by analyzing selected object features. This problem is usually divided into two major steps: the detection of foreground regions and the association of these regions over time. A typical problem in the detection step is the definition of pre-processing and post-processing strategies under challenging lighting conditions, such as cast shadows, local and global illumination variations, and dark scenes. To address the problem of cast shadows, object and scene geometry, texture, brightness or color information can be used. Shadows can be modeled using Gaussians [1], multi-variate Gaussians [2] and mixture of Gaussians [3]. Texture analysis has also been used to detect shadows, based on the assumption that shadows do not alter the texture of the underlying surface [4]. A combination of features such as luminance, chrominance, gradient density and edges can also be used [5]. Moreover, edge and region information can be integrated across multiple frames [6]. Shadow color properties have also been used [7,8,3], based on the observation that a shadow cast on a surface equally attenuates the values of all color components. Although these methods succeed in segmenting shadows in a number of test sequences, they tend to fail when shadows are very dark. In this case, contextual information such as prior knowledge of object orientation can be used. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 190–199, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multi-feature Graph-Based Object Tracking
191
In traffic monitoring, detecting the lanes of a road can improve the performance of shadow detection algorithms [9]. Once objects are detected, the second step aims at linking different instances of the same object over time (i.e., data association). A typical problem for data association is to disambiguate objects with similar appearance and motion. For this reason data association for object tracking can be assimilated to the motion correspondence problem. Several statistical and graph-based algorithms for tracking dense feature points have been proposed in the literature. Two methods based on statistics are the Joint Probabilistic Data-Association Filter [10] and the Multiple Hypotheses Tracking [11]. The major drawbacks of these two methods are the large number of parameters that need to be tuned and the assumptions that are needed to model the state space [12]. An example of graph-based method is Greedy Optimal Assignment [13], which requires a batch processing to deal with occlusions and detection errors, and assumes that the number of objects is constant over the time. A variable number of objects is allowed when dummy nodes are introduced in the graph in order to obtain a constant number of nodes per frame [14]. More elegant solutions have also been proposed: the best motion tracks are evaluated across multiple frames, based on a simple motion model. Next, node linking is performed after pruning unlikely motions [12]. Data association can also be performed by matching the blob contour using the Kullback-Leibler distance [15]. However, this method needs large targets to compute accurately the blob contour, and the correspondence is limited to two consecutive frames. Multi-frame graph matching [12] can be applied to motion correspondence using the appearance of regions around the points; then the global appearance of the entire object is computed with PCA over the point distribution [16]. Graph matching has also been used to find the object correspondence across multiple cameras [17] analyzing both color appearance and scene entry-exit object positions. Finally two-frame bipartite graph matching can be used to track objects in aerial videos based on gray level templates and centroid positions [18]. This paper proposes a tracking algorithm that copes with a variety of realworld surveillance scenarios, with sudden changes of the environmental conditions, and to disambiguate objects with similar appearance. To achieve these goals, the algorithm combines a statistical color change detector with a graphbased tracker that solves the correspondence problem by measuring the coherency of multiple object features, namely, color histograms, direction, position, and size. The video is equalized in case of dark scenes and the output of the background subtraction is post-processed to cope with shadows, global illumination changes caused by the passage of clouds and by vehicle headlights. The paper is organized as follows. Section 2 describes the object detection algorithm. In Section 3 we present the graph matching strategy for multiple object tracking. Section 4 discusses the experimental results using different sets of features and validates the proposed approach on the VACE dataset [19]. Finally, Section 5 concludes the paper.
192
M. Taj, E. Maggio, and A. Cavallaro
(a)
(b)
(c)
(d)
Fig. 1. Contrast enhancement for improving object detection. (a) Reference frame; (b) current frame; (c) image difference before contrast enhancement; (d) image difference after contrast enhancement.
2
Object Detection
Foreground segmentation is performed by a statistical color change detector [20], a model-based algorithm that assumes additive white Gaussian noise on each frame. The noise amplitude is estimated for each color channel. Challenging illumination conditions typical of long surveillance videos, such as dark scenes, global and local illumination changes, and cast shadows need to be addressed separately. Dark scenes are identified by analyzing the frame intensity distribution. A scene is classified as dark when more than 75% of the pixels in a frame are in the first quartile of the intensity range. In this case contrast and brightness are improved through image equalization. Rapid global illumination changes are often associated to the passage of clouds. This results in large false positive detections, especially in regions in the shade of buildings or trees. To increase the contrast, the variance σ0 of the difference image calculated between reference and first image should be similar to the variance σi of the difference between reference Iref (x, y) and current frame Ii (x, y). Let β and ζ0 be the brightness and the initial contrast, respectively; and let σi = σ(| Iref (x, y) − Ii (x, y) |). The contrast of the current difference image is modified at each iteration k using ζk = ζk−1 ±s until the condition |σi,k −σ0 | < is satisfied. The pixel values Γkj in the difference image are modified, for an 8-bit image, according to ⎧ if ak · j + bk < 0 ⎨0 if ak · j + bk > 255 , (1) Γkj = 255 ⎩ ak · j + bk otherwise 1 where j ∈ [1, 255] is the pixel value, ak = 1−w·Δ , bk = ak · (β − Δk ), w = 2/255 k ζk and Δ = w·ζ0 . Fig. 1(d) shows a sample frame with increased contrast. Vehicle headlights generate important local illumination changes. To address this problem, we perform an edge-based post-processing using selective morphology that filters out misclassified foreground regions by dilating strong foreground edges and eroding weak foreground edges. Next, 8-neighbor connected components analysis is performed to generate the foreground mask.
Multi-feature Graph-Based Object Tracking
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
193
Fig. 2. Example of shadow removal to improve the accuracy of object detection. (a) Example of strong shadow; (b) difference image; (c) foreground mask after shadow segmentation; (d) final bounding box. (e) Example of multiple shadows; (f) difference image; (g) foreground mask after shadow segmentation; (h) final bounding box.
Finally, cast shadows are frequent local illumination changes in real-world sequences (Fig. 2(a), (e)) that affect the estimation of an object shape. Many surveillance scenarios are characterized by shadows that are too dark for a successful use of color-based techniques. For this reason, we use a model-based shadow removal approach that assumes that shadows are cast on the ground. Fig. 2 (c),(g) shows sample results of shadow removal. The result of the object detection step is a bounding box for each blob (Fig. 2 (d),(h)). The next step is to associate subsequent detections of the same object over time, as explained in the next section.
3
Graph Matching Using Weighted Features
Data association is a challenging problem due to track management issues such as appearance and disappearance of objects, occlusions, false detections due to clutter and noisy measurements. Furthermore, data association has to be verified throughout several frames to validate the correctness of the tracks. Let {Xi }i=1...K be K sets of target detections, and v(xai ) ∈ Vi the set of vertices representing the detected targets at time i. Each v(xai ) belongs to D, a bi-partitioned digraph (i.e., a directional graph), such as the one reported in Fig. 3 (a). The candidate correspondences at different observation times are described by the gain g associated to the edges e(v(xai ), v(xbj )) ∈ E that link the vertices. To obtain a bi-partitioned graph, a split of the graph G = (V, E) is performed and two sets, V + and V − , are created as copies of V . After splitting, each vertex becomes either a source (V + ) or a sink (V − ). Each detection xai ∈ Xi is therefore represented by twin nodes v + (xai ) ∈ V + and v − (xai ) ∈ V − (Fig. 3 (c)). The graph is formed by iteratively creating new edges from the
194
M. Taj, E. Maggio, and A. Cavallaro
v(x11 )
v(x13 )
v(x11 )
v(x21 )
v(x13 ) v(x12 )
v(x12 ) v(x23 )
v(x21 )
v(x12 )
V
V+
v(x23 ) v(x22 ) v(x22 )
v(x22 ) v(x31 )
v(x33 )
v(x31 )
V1
V2
(a)
v(x43 )
v(x41 )
V3
V1
V+ v(x32 )
v(x32 )
v(x32 ) v(x41 )
V v(x33 )
v(x43 )
V2
(b)
V3
V
V+
V2
(c)
Fig. 3. Example of digraph D for 3 frames motion correspondence. (a) The full graph. (b) A possible maximum path cover. (c) Bi-partition of some nodes of the graph.
vertices v + (xai ) ∈ V + to the sink nodes v − (xbK ) associated to the new object observations XK of the last frame. Edges represent all possible track hypotheses, including miss detections and occlusions (i.e., edges between two vertices v(xai ) and v(xbj ), with j − i > 1). The best set of tracks is computed by finding the maximum weight path cover of G, as in Fig. 3 (b). This step can be performed using the algorithm by Hopcroft and Karp [21] with complexity O(n2.5 ), where n is the number of vertices in G. After the maximization procedure, a vertex without backward correspondence models a new target, and a vertex without forward correspondence models a disappeared target. The depth of the graph K determines the maximum number of consecutive miss detected or occluded frames during which an object track can still be recovered. Note that despite larger values of K allow dealing with longer term occlusions, the larger the value of K, the higher is the probability of wrongly associating different targets. The gain g between two vertices is computed using the information in Xi , where the elements of the set Xi are the vectors xai defining x, the state of the object: x = [x, y, x, ˙ y, ˙ h, w, H], (2) where (x, y) is the center of mass of the object, (x, ˙ y) ˙ are the vertical and horizontal velocity components, (h, w) are the height and width of the bounding box, and H is the color histogram. The velocity is computed based on the backward correspondences of the nodes. If a node has no backward correspondence (i.e., object appearance), then x˙ and y˙ are set to 0. The gain for each couple of nodes xai , xbj is computed based on the position, direction, appearance and size of a candidate target. The position gain g1 based on the predicted and observed position of the point, is computed as [xbj − (xai + x˙ai (j − i))]2 + [yjb − (yia + y˙ia (j − i))]2 g1 (xai , xbj ) = 1 − , (3) Dx 2 + Dy 2
Multi-feature Graph-Based Object Tracking
195
where Dx and Dy are height and width of the image, respectively. Since the gain function is dependent on the backward correspondences (i.e. the speed at the previous step) the greedy suboptimal version of the graph matching algorithm is used [12]. The direction gain g2 aims at penalizing large deviations in the direction of motion, is ⎞ ⎛ b a ˙a b a ˙a − x ) x (j − i) + (y − y ) y (j − i) (x 1 j i j i i ⎠. i g2 (xai , xbj ) = ⎝1 + (4) 2 2 2 2 2 b b a a (xj + yj )(xi + yi ) The appearance gain g3 is the distance between color histograms of objects using the correlation method: N k=0 (Hi,a (k) · Hj,b (k)) a b g3 (xi , xj ) = , (5) N 2 2 k=0 (Hi,a (k) · Hj,b (k) )
where H (k) = H(k) − N · N1 H(z) , N is number of histogram bins. z=0 Finally, the size gain g4 is the gain computed as absolute difference between the width and height of the objects represented by the nodes:
b
w b − w a
h − ha 1 j i j i a b g4 (xi , xj ) = 1 − + . (6) 2 max(wjb , wia ) max(hbj , hai ) The overall gain g is a weighted linear combination of the position, direction, size and appearance gain as g(xai , xbj ) = α·g1 (xai , xbj )+β ·g2 (xai , xbj )+γ ·g3 (xai , xbj )+δ ·g4 (xai , xbj )−(j −i−1)·τ (7) where α + β + γ + δ = 1 and τ is a constant that penalizes the choice of shorter tracks. Since graph matching links nodes based on the highest weights, two trajectory points far from each other can be connected. To overcome this problem, gating is used and an edge is created only if g > 0.
4
Experimental Results
We present experimental results on the VACE [19] dataset. The sequences are in CIF format at 25Hz. To evaluate the benefits introduced by different features, four configurations are compared: C-T, the baseline system with center of mass only; CB-T the system with center of mass and bounding box; CBD-T, the system with center of mass, bounding box and direction; and CBDH-T, the proposed system with all the previous features and the appearance model based on color histograms. The parameters used in the simulations are the same for all scenarios. The change detector has σ = 1.8 and a kernel with 3x3 pixels. A 32-bin histogram is used for each color channel. The weights used for graph matching are: α = 0.40
196
M. Taj, E. Maggio, and A. Cavallaro
0.8
0.8
0.7
0.7
C-T
C-T
CB-T
CB-T
CBD-T
0.6
0.6
CBD-T
CBDH-T
CBDH-T
0.5 Scores
Scores
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0
0 MODP
MODA
MOTP
MOTA
MODP
MODA
MOTP
MOTA
Fig. 4. Comparison of objective results using different set of features for detection and tracking on the Broadway/Church scenario, from the VACE dry run dataset (CT: center of mass only; CB-T: center of mass and bounding box; CBD-T: center of mass, bounding box and direction; CBDH-T, the proposed system with all the previous features and the appearance model based on color histograms). Left: score for person detection and tracking. Right: score for vehicle detection and tracking.
(position), β = 0.30 (direction), γ = 0.15 (histogram), δ = 0.15 (size), and τ = 0.043. The objective evaluation is based on the 4 scores of the VACE protocol, namely Multiple Object Detection Accuracy (MODA), Multiple Object Detection Precision (MPDP), Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP) [19]. In order to use the VACE evaluation tool and the available ground truth, a simple pedestrian/vehicle classifier is added to the system, whose decision is based on the ratio of the width and the height of the bounding box, followed by a temporal voting mechanism. Scores obtained with the different combinations of features are shown in Fig. 4. The results on the 4 scores show that the proposed algorithm (CBDH-T) produces a consistent improvement, especially in the case of vehicle tracking. This performance is not surprising as vehicles tend to have more distinctive colors than pedestrians. The use of direction as a feature improves detection and tracking precision more than detection and tracking accuracy (see Fig. 4 CBD-T vs. CB-T). Sample tracking results for CBDH-T are shown in Fig. 5. Detected objects are identified by a color-coded bounding box, their respective trajectories and an object ID (top left of the bounding box). The results of the classification into one of the two classes, namely pedestrian (P) and vehicles (V), are shown on the top of the bounding box. To conclude, in Fig. 6 we analyze the limits of the proposed algorithm. CBDHT tends to merge tracks of small targets, such as vehicles far from the camera, when limited color information is available and the frame-by-frame motion direction is not reliable (Fig. 6 (a)). Another failure modality is due to the foreground
Multi-feature Graph-Based Object Tracking
197
Fig. 5. Sample tracking results using the proposed detection and tracking algorithm (CBDH-T) on the VACE dataset
detector: when objects are too close to each other, such as pedestrians in groups or parked vehicles (Fig. 6 (b)), only one blob (i.e., one bounding box) is generated. We also noticed some instability in the detection of vehicles in dark scenes, due to variations in the illumination changes generated by the headlights. In Fig. 6 (d) the features used by the graph matching algorithm change drastically compared to Fig. 6 (c) because of a change in the object bounding box, thus generating an identity switch. A possible solution to both problems is to add to the system a detection algorithm based on prior knowledge (models) of the objects.
198
M. Taj, E. Maggio, and A. Cavallaro
(a)
(b)
(c)
(d)
Fig. 6. Examples of failure modes of the proposed algorithm. (a) Track ambiguity between two vehicles driving on opposite lanes far from the camera (see zoom). (b) Vehicles merged by the detector due to their proximity. (c),(d) Lost track due to variations in the object features caused by a significant change of the bounding box size (the two frames show the same vehicle at different time instants).
5
Conclusions
We presented a multiple object detection and tracking algorithm based on statistical color change detection and graph matching. The graph matching procedure uses multiple object features: position, color, size and direction. Experimental results showed that increasing the number of features and appropriately weighting them is an effective solution for improving tracking results in challenging realworld surveillance sequences, such as those of the VACE dataset. The algorithm demonstrated the ability to cope with changes in global illumination and local illumination conditions, using the same set of parameters throughout the dataset. Future work includes the use of multiple views to increase the robustness of the detection and tracking algorithm and the integration of a state-of-the-art object classifier to improve the detection results.
References 1. Chang, C., Hu, W., Hsieh, J., Chen, Y.: Shadow elimination for effective moving object detection with gaussian models. In: Proc. of IEEE Conf. on Pattern Recog. Volume 2. (2002) 540–543 2. Porikli, F., Thornton, J.: Shadow flow: A recursive method to learn moving cast shadows. In: Proc. of IEEE International Conference on Computer Vision. Volume 1. (2005) 891–898 3. Martel-Brisson, N., Zaccarin, A.: Moving cast shadow detection from a gaussian mixture shadow model. In: Proc. of IEEE Conf. on Comp. Vis. and Pattern Recog. Volume 2. (2005) 643–648 4. Javed, O., Shah, M.: Tracking and object classification for automated surveillance. In: Proc. of the European Conference on Computer Vision, Copenhagen (2002) 5. Fung, G., Yung, N., Pang, G., Lai, A.: Effective moving cast shadow detection for monocular color image sequences. In: Proc. of IEEE International Conf. on Image Analysis and Processing. (2001) 404–409 6. Xu, D., Liu, J., Liu, Z., Tang, X.: Indoor shadow detection for video segmentation. In: IEEE Fifth World Congress on Intelligent Control and Automation (WCICA). Volume 4. (2004)
Multi-feature Graph-Based Object Tracking
199
7. Huang, J., Xie, W., Tang, L.: Detection of and compensation for shadows in colored urban aerial images. In: IEEE Fifth World Congress on Intelligent Control and Automation (WCICA). Volume 4. (2004) 3098–3100 8. Salvador, E., Cavallaro, A., Ebrahimi, T.: Shadow identification and classification using invariant color models. In: Proc. of IEEE International Conf. on Acoustics, Speech, and Signal Processing. Volume 3. (2001) 1545–1548 9. Hsieh, J., Yu, S., Chen, Y., Hu, W.: A shadow elimination method for vehicle analysis. In: Proc. of IEEE Conf. on Pattern Recog. Volume 4. (2004) 372–375 10. Fortman, T., Bar-Shalom, Y., Scheffe, M.: Sonar tracking of multiple targets using joint probabilistic data association. IEEE J. Oceanic Eng. 8(3) (1983) 173–184 11. Reid, D.: An algorithm for tracking multiple targets. IEEE Trans. Automat. Contr. AC-24 (1979) 843–854 12. Shafique, K., Shah, M.: A noniterative greedy algorithm for multiframe point correspondence. IEEE Trans. Pattern Anal. Machine Intell. 27 (2005) 51–65 13. Veenman, C., Reinders, M., Backer, E.: Resolving motion correspondence for densely moving points. IEEE Trans. Pattern Anal. Machine Intell. 23(1) (2001) 54–72 14. Rowan, M., Maire, F.: An efficient multiple object vision tracking system using bipartite graph matching. In: FIRA. (2004) 15. Chen, H., Lin, H., Liu, T.: Multi-object tracking using dynamical graph matching. In: Proc. of IEEE Conf. on Comp. Vis. and Pattern Recog. Volume 2. (2001) II– 210–II–217 16. Mathes, T., Piater, J.: Robust non-rigid object tracking using point distribution models. In: British Machine Vision Conference, Oxford (2005) 17. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: The Ninth IEEE International Conference on Computer Vision, Nice, France (2003) 18. Cohen, I., Medioni, G.G.: Detecting and tracking moving objects for video surveillance. In: CVPR, IEEE Computer Society (1999) 2319–2325 19. Kasturi, R.: Performance evaluation protocol for face, person and vehicle detection & tracking in video analysis and content extraction (VACE-II). Computer Science & Engineering University of South Florida, Tampa. (2006) 20. Cavallaro, A., Ebrahimi, T.: Interaction between high-level and low-level image analysis for semantic video object extraction. EURASIP Journal on Applied Signal Processing 6 (2004) 786–797 21. Hopcroft, J., Karp, R.: An n2.5 algorithm for maximum matchings in bipartite graphs. SIAM J. Computing 2(4) (1973) 225–230
Multiple Vehicle Tracking in Surveillance Videos Yun Zhai, Phillip Berkowitz, Andrew Miller, Khurram Shafique, Aniket Vartak, Brandyn White, and Mubarak Shah Computer Vision Laboratory School of Electrical Engineering and Computer Science University of Central Florida Orlando, Florida 32826, USA
Abstract. In this paper, we present KNIGHT, a Windows-based standalone object detection, tracking and classification software, which is built upon Microsoft Windows technologies. The object detection component assumes stationary background settings and models background pixel values using Mixture of Gaussians. Gradient-based background subtraction is used to handle scenarios of sudden illumination change. Connectedcomponent algorithm is applied to detected foreground pixels for finding object-level moving blobs. The foreground objects are further tracked based on a pixel-voting technique with the occlusion and entry/exit reasonings. Motion correspondences are established using the color, size, spatial and motion information of objects. We have proposed a texture-based descriptor to classify moving objects into two groups: vehicles and persons. In this component, feature descriptors are computed from image patches, which are partitioned by concentric squares. SVM is used to build the object classifier. The system has been used in the VACE-CLEAR evaluation forum for the vehicle tracking task. Corresponding system performance is presented in this paper.
1
Introduction
Object detection, tracking and classification are the key modules in most security systems and video surveillance applications. Given correctly detected and consistently tracked objects, further analysis can be performed for activity studies. In this paper, we present KNIGHT, an automated surveillance system for stationary camera settings. The system and its underlying mechanisms are solely based on the work by Javed et al. [8][9][10]. References and extensive result sequences can be found at http://www.cs.ucf.edu/∼vision/projects/Knight/Knight.html. KNIGHT has also been applied extensively in various research projects, including Crime Scene Detection (funded by Orlando Police Department), Night Time Surveillance (funded by DARPA/PercepTek Robotics) and Visual Monitoring of Railroad Crossings (funded by Florida Department of Transportation). The problems of detecting and tracking moving objects in surveillance videos has been widely studied. PFinder [19] uses a uni-modal background model to locate interesting object. It tracks the full body of a person though it assumes that only a single person is present in the scene. In the approach proposed by Stauffer R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 200–208, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multiple Vehicle Tracking in Surveillance Videos
201
and Grimson [17], an adaptive multi-modal background subtraction method that can deal with slow changes in illumination, repeated motion from background clutter and long term scene changes is employed. Ricquebourg and Bouthemy [14] proposed tracking people by exploiting spatiotemporal slices. Their detection scheme involves the combined use of intensity, temporal differences between three successive images and of comparison of the current image to a background reference image which is reconstructed and updated online. W4 [5] uses dynamic appearance models to track people. Single person and groups are distinguished using projection histograms. Each person in a group is tracked by tracking the head of that person. Our proposed KNIGHT system consists of three main components: object detection, tracking and classification. In the object detection module, background pixel values are modelled by the mixture of Gaussian distributions, by which foreground pixels are detected. Gradient information is used to handle the situations, where there are sudden illumination changes. Connected-component algorithm is further applied to obtain high-level moving objects. The detected foreground objects are then tracked across frames by establishing the motion correspondences using the color, size, spatial and motion information of objects. Corresponding occlusion and entry/exit reasonings are applied to handle both full and partial occlusions. In KNIGHT, our object classification task is achieved by incorporating the texture-based feature descriptor. Given an image patch containing the target object, such as a car, a 2D Gaussian mask is applied to the gradient map of the patch. Then, the weighted gradient map is divided into a set of concentric circles. For each circle, a gradient histogram is computed using 8 directions. The intensity of the patch is also incorporated in the feature vector. The classifier is constructed using the Support Vector Machines (SVM). The rest of this paper is organized as follows: Section 2 describes of KNIGHT system and its underlying mechanisms for object detection, tracking and classification. Section 3 presents the system performance on the VACE-CLEAR evaluation task of vehicle tracking in surveillance domain. Finally, Section 4 concludes our work.
2
System Description
In this section, we describe the underlying object detection, tracking and classification mechanisms in KNIGHT and the system graphical user interface. 2.1
Object Detection
In order to track a target object, the object must be detected in the acquired video frames. In our framework, an object refers to any type of moving blobs in the imagery. It could be a person, a vehicle, an animal, a group of people, etc. In our system, we assume stationary background settings with minor dynamic motions allowed, such as moving tree leaves or water waves. We use an adaptive background subtraction method proposed by Stauffer and Grimson [17]. In this
202
Y. Zhai et al.
method, the color values of each pixel across time are modelled by a mixture of K multi-variate Gaussian distributions. The probability of k-th Gaussian at pixel pi,j is computed as, k N (xi,j |mki,j , Σi,j )=
1 k −1 exp − (xi,j − mki,j )T (Σi,j ) (xi,j − mki,j ) , (1) k 2 (2π) |Σi,j | 1
n 2
1 2
k where xi,j is the color vector of pixel pi,j in RGB color space, and mki,j and Σi,j are the mean vector and the covariance matrix of the k-th Gaussian distribution, respectively. For each pixel, its new value xti,j at time t is checked against all K Gaussian distributions, and the one gives the minimum distance is updated accordingly. Mahalanobis distance is used as the matching metric. If a match between xi,j and the target Gaussian distribution is found, i.e., the distance between them is less than the tolerance, corresponding parameters of the matched Gaussian distribution are updated using an exponential decay scheme with a learning factor. In addition, the weight of the matched distribution is also accumulated by 1. If no match is found, the distribution with the lowest weight is replaced with a new distribution having xti,j as the mean and a pre-defined value as the variance. Based on these two conditions, the Gaussian distributions are gradually updated, and the ones with weights greater than a threshold, Tw , are incorporated in the set of distributions belonging to the background. To achieve region-level foreground objects from detected foreground pixels, the connectedcomponent algorithm is applied, and morphological filtering is performed for noise removal. It is well known that, color-based background subtraction methods are not able to handle sudden light changes. To overcome this problem, we propose a gradient-based subtraction technique. For each pixel in the image, we compute = ( m , d ) as its feature vector, where m is the gradient magnitude, i.e.,
m = fx2 + fy2 , and d is the gradient direction, i.e., d = tan−1 fyx . The gradient information is computed from the gray-level images. Since the color value xi,j of pi,j is normally distributed, its corresponding gray-level value gi,j is also normally distributed with mean μi,j and standard deviation σi,j . Let fx = gi+1,j − gi,j and fy = gi,j+1 − gi,j . We observe that fx is distributed 2 2 + σi,j . Siminormally with mean μfx = μi+1,j − μi,j and variance σf2x = σi+1,j larly, fy is also distributed normally with mean μfy = μi,j+1 − μi,j and variance 2 2 + σi,j . Knowing the distributions of fx and fy and standard disσf2y = σi,j+1 tribution transformation methods [4], we determine the distribution of feature vector (m , d) as, f
F (m , d ) = where z = and ρ =
m cosd −μfx 2
2 σi,j σfx σfy
σfx
m z , exp − 2(1 − ρ2 ) 2πσfx σfy 1 − ρ2
−2ρ
(2)
m cosd −μfx m sind −μfy m sind −μfy 2 + σf σf σf x
y
y
. All the parameters involved in Eqn.2 can be calculated from the
Multiple Vehicle Tracking in Surveillance Videos
(a) Input Sequence
203
(b) Background Subtraction
Fig. 1. Sample results of background subtraction for one dry-run sequence. (a) Image from the input sequence, and (b) Background subtraction output.
means and variances of the color distributions. Given a new input image, gradient magnitude and direction values are computed. If the probability of a gradient vector being generated from the background gradient distributions is below the tolerance, then the corresponding pixels belongs to foreground. Otherwise, it is a background pixel. One example on the background subtraction using the color and gradient methods is shown in Fig.1. 2.2
Object Tracking
Given foreground objects detected by the previously described method, they are further tracked by establishing their correspondences using object color, size, spatial and motion models. For an object Pk with size of nk pixels. The shape is modelled by a Gaussian distribution, sk (x), with variance equal to the sample variance of the person silhouette. The color is modelled by a normalized histogram, hk (c(x)), where the function c(·) returns the color at pixel position x in the current frame. A linear velocity predictor is used to model the motion. Each pixel pj in the new coming image, where pj ∈ Ri (the detected foreground region), votes for the label of the object, for which the joint probability of shape and color is maximum, arg maxk (sk (x)hk (c(x))). Then, the following tracking mechanisms with occlusion and entry/exit reasonings are used, – if the number of votes Vi,k (votes from Ri for Pk ) is a significant percentage, say T , of nk , i.e., (Vi,k /nk ) > T , and also (Vi,q /nq ) < T , where k = q, then all the pixels in Ri are used to update models of Pk . In case of more than one region satisfy this condition, all regions are used to update the object model. This case represents an object splitting into multiple regions. – if (Vi,k /nk ) > T , and (Vi,q /nq ) > T , then this case represents situations where two objects merge into a single region. In this case, only those pixels in Ri that voted for Pk will be used to update models of Pk . – if (Vi,k /nk ) < T, ∀i, i.e., no observation matches model k. This might be due to the complete occlusion of the object, or the object might have exited the field of view. If the predicted position of the object is near the frame boundary, the object is determined to be out of the frame. Otherwise, the
204
Y. Zhai et al.
Image1616
Image1632
Image1648
Image1664
Image1680
Image1700
Fig. 2. Six images are shown, where a vehicle has been consistently tracked. Note that KNIGHT is able to handle occlusion, such as the situation in this example: the car is occluded by a street pole, but the tracking label remains consistent.
mean of the spatial model is updated by a linear velocity prediction. The rest of the parameters are kept constant. – if (Vi,k /nk ) < T, ∀k, i.e., region Ri does not match any model. This means it is a new entry. A new object model is created for this region. One example of tracking is shown in Fig.2, where a vehicle is tracked with a consistent label. It should be noted that KNIGHT is able to handle occlusions, such as the one presented in this example, where the street pole divides the vehicle into halves, but the tracking is still consistent. 2.3
Object Classification
In this evaluation, we have developed an object classification module. The proposed classification technique is utilizing a texture-based feature descriptor. Support Vector Machine (SVM) is applied to construct the classifier. Certain object category has common shapes, such as humans and vehicles. Based on this fact, we utilize the textural information of the detected moving objects. Pixel gradient magnitude, gradient direction and normalized intensity value are incorporated in our model. The image patches are scaled to have a uniform dimension. The gradient patches are convolved with a 2D Gaussian mask with independent variances σx2 and σy2 . The purpose of this convolution is to give a hierarchy of importance to the gradient patches. Given the convolved patches, a weighted histogram Θ is computed. The gradient directions are quantized into 36 bins. Each bin is accumulated by adding the magnitudes of the corresponding gradients. To handle the rotation invariance, the gradient histogram is shifted to by detecting all bins with significant weights, denoted by
Multiple Vehicle Tracking in Surveillance Videos
(a)
(b)
205
(c)
Fig. 3. Examples of feature descriptors on vehicle and human. (a) Input image patches. (b) Gradient information. (c) Concentric filters.
{θ1 , · · · , θk }. This shifting offsets the gradient directions and creates N matrices K, where Kn (i, j) = ω(i, j) − θn and ω is the gradient direction. The gradient magnitude map M is divided into k regions. The shape of the regions were concentric circles, with distance δ between adjacent circles. A histogram of gradient magnitudes in 8 directions is extracted from each of the circular regions. The histogram are then concatenated to form the region descriptor for the image patch. This descriptor is further combined with the intensity vector of the image patch. Some examples are shown in Fig.3. Object classifiers are constructed using the SVM. In our experiments, 150∼200 images are used in each of the positive and negative training sets.
3
Evaluation Results and Performance Analysis
KNIGHT has been applied in the vehicle tracking task of the VACE-CLEAR evaluation workshop. There were total fifty video sequences in the final testing set. Moving objects in the videos include both walking persons and moving vehicles. The task definition requires only the tracking of vehicles. Therefore, object classification is applied to filter out other moving objects such as people. In the dry-run stage, four accuracy measures are used, (1) Multiple Object Detection Precision (MODP), (2) Multiple Object Detection Accuracy (MODA), (3) Multiple Object Tracking Precision (MOTP), and (4) Multiple Object Tracking Accuracy (MOTA). The detailed results are shown in Fig.4.
206
Y. Zhai et al.
FileN
UCF MODP
MODA
MOTP
MOTA
1
0.20358
-0.48457
0.20000
-0.51977
2
0.25872
-1.11765
0.25872
-1.12945
3
0.30791
-0.26500
0.29504
-0.27457
4
0.27664
0.23810
0.26307
0.23209
5
0.18169
0.03807
0.18646
0.01402
Fig. 4. Dry run results for five testing videos. Four accuracy scores are shown for each video sequence: MODP, MODA, MOTP and MOTA.
Index 1
UCF Vision Group MOTP
MOTA
0.569311
0.725000
5
0.516373
0.408737
10
0.568950
0.412500
Fig. 5. Refined results on three of the final testing sequences. Significant improvement in performance is obtained by the correction of the video sampling rate.
In the final review workshop, only MOTP and MOTA were presented. UCF team has achieved relative low performance in the final evaluation stage. The average tracking precision is 0.1980 if not considering the sequences which give “not defined” scores. The average tracking accuracy (MOTA) is -0.0827 considering all the 50 sequence and -0.1488 excluding the undefined sequence results. This was apart from what we expected based on the experience of dry-run results. After careful examination, we have located our problem. The main cause of the low performance is the sampling rate of the MPEG videos. In KNIGHT, we feed the program with sequences of PPM images, which are extracted from MPEG videos. The dry-run MPEG videos are with 29.97FPS, which the testing sequences have 25FPS. The frame rate was incorrectly set when the frameextraction task was taken place, therefore, cause a non-static offset between our submission files and the ground truth files. This is the reason why several sequences returned “not defined” MOTPs. We have re-run the evaluation for three of the final testing sequences to verify our argument. We have achieved a great boost in both scores. Fig.5 shows the refined results.
4
Conclusions
In this paper, we have presented KNIGHT, an automated object detection and tracking system for surveillance videos. The system is composed of two major components: object detection and object tracking. Background subtraction based on mixture of Gaussians is used in the detection of foreground image
Multiple Vehicle Tracking in Surveillance Videos
207
pixels. Connected-component algorithm is applied to find region-level foreground objects, and a gradient-based subtraction technique is used to handle sudden illumination changes. Object tracking is achieved by establishing the motion correspondences using the color, size, spatial and motion information of objects. KNIGHT has been used in the vehicle tracking task of the VACE-CLEAR evaluation workshop, and corresponding system performance is presented in this paper.
Acknowledgement Some materials presented in the paper are based upon the work funded in part by the U.S. Government. Any opinions, findings and conclusions or recommendations expressed in these materials are those of the authors and do not necessarily reflect the views of the U.S. Government.
References 1. S. Ali and M. Shah, “A Supervised Learning Framework for Generic Object Detection in Images”, ICCV, 2005. 2. H. Bischof, H. Wildenauer and A. Leonardis, “Illumination Insensitive Eigenspaces”, ICCV, 2001. 3. A. Bobick and J. Davis, “The Recognition of Human Movements Using Temporal Templates”, IEEE T-PAMI, Vol.23, No.3, March 2001. 4. G. Casella and R. Berger, Statistical Inference, 2nd Edition, 2001. 5. I. Haritaoglu, D. Harwood and L. Davis, “W4: Real Time Surveillance of People and Their Activities”, IEEE T-PAMI, Vol.22, No.8, August 2000. 6. T. Horprasert, D. Harwood and L. Davis, “A Statistical Approach for Read Time Robust Background Subtraction and Shadow Detection”, IEEE Frame Rate Workshop, 1999. 7. D. Jacobs, P. Bellhumeur and R. Basri, “Comparing Images Under Variable Lighting”, CVPR, 1998. 8. O. Javed, “Scene Monitoring with a Forest of Cooperative Sensors”, Ph.D. Dissertation, 2005. 9. O. Javed and M. Shah, “Tracking and Object Classification for Automated Survellance”, ECCV, 2002. 10. O. Javed, K. Shafique and M. Shah, “A Hierarchical Approach to Robust Background Subtraction Using Color and Gradient Information”, IEEE Workshop on Motion and Video Computing, 2002. 11. D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, IJCV, 2004. 12. K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors”, CVPR, 2003. 13. K. Rangarajan and M. Shah, “Establishing Motion Correspondences”, CVGIP, July, 1991. 14. Y. Ricquebourg and P. Bouthemy, “Real Time Tracking of Moving Persons by Exploiting Spatiotemporal Image Slices”, IEEE T-PAMI, Vol.22, No.8, August 2000.
208
Y. Zhai et al.
15. R. Rosin and T. Ellis, “Image Different Threshold Strategies and Shadow Detection”, BMVC, 1995. 16. I.K. Sethi and R. Jain, “Finding Trajectories of Feature Points in Monocular Image Sequences”, IEEE T-PAMI, January, 1987. 17. Stauffer c. and Grimson, “Learning Patterns of Activity Using Real Time Tracking”, IEEE T-PAMI, Vol.22, No.8, August 2000, pp 747-767. 18. C.J. Veenman, M.J.T. Reinders and E. Baker, “Resolving Motion Correspondence for Densely Moving Point”, IEEE T-PAMI, January, 2000. 19. C. Wren, A. Azarbayejani, T. Darrel and A. Pentland, “PFinder, Real Time Tracking of Human Body”, IEEE T-PAMI, Vol.19, No.7, July 1997.
Robust Appearance Modeling for Pedestrian and Vehicle Tracking Wael Abd-Almageed and Larry S. Davis Institute for Advanced Computer Studies University of Maryland College Park, MD 20742 {wamageed, lsd}@umiacs.umd.edu
Abstract. This paper describes a system for tracking people and vehicles for stationary-camera visual surveillance. The appearance of objects being tracked is modeled using mixtures of mixtures of Gaussians. Particles filters are used to track the states of object. Results show the robustness of the system to various lighting and object conditions.
1
Introduction
Detecting and tracking moving people and vehicle is a critical task in any visual surveillance system. Without robust detection and tracking, further video understanding tasks such as activity recognition or abnormal activity detection is not possible. Robust, realtime tracking algorithms must satisfy a few characteristics. First, an accurate appearance model must be estimated for the objects being tracked as well as the background. Second, the appearance models must be parameterlight or preferably nonparametric in order to increase the level of autonomy of the tracker. Finally, the computing the appearance model must be not computationally expensive to facilitate realtime performance. In this paper we use our previous work on density estimation using mixtures of mixtures of Gaussians [1] to model the appearance of the objects and the background. Tracking the state of the object is achieved using particle filters. This paper is organized as follows. Section 2 briefly discusses background subtraction as a classic method for detecting moving objects. For more details on this algorithm, the reader is referred to [1]. In Section 3 we discuss appearance modeling using mixtures of mixtures of Gaussians. The particle filter tracker is introduces in Section 4. Results are presents in Section 5 for tracking people and vehicles under different lighting conditions.
2
Moving Object Detection
Detecting moving objects in stationary camera surveillance is classically performed using background subtraction. To build a background image model, IBG we use a simple median filtering approach as shown in Equation 1 R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 209–215, 2007. c Springer-Verlag Berlin Heidelberg 2007
210
W. Abd-Almageed and L.S. Davis NBG
IBG (x, y) = median Ii (x, y) i=1
(1)
where NBG is the number of images used to model the background. The probability that a given pixel belongs to the moving foreground F is given by Equation 2 (I(x, y) − IBG (x, y))2 p((x, y) ∈ F ) = 1 − exp − (2) σF2 where σF is a motion-sensitivity system-parameter. Background subtraction is followed by a series of morphological operations to remove noise and very small moving objects. Connected component analysis is then applied to the resulting image in order to find the independently moving objects. The appearance of each object is modeled using the algorithm described in Section 3 and a tracker is instantiated as will be shown in Section 4.
3
Appearance Modeling Using Mixtures of Mixtures
Let Y = {xi }M i=1 be a set of M vectors to be modeled. If we apply the meanshift mode finding algorithm, as proposed in [2], and only retain the modes with positive definite Hessian, we will obtain a set of m modes Yc = {xcj }m j=1 which represent the local maxima points of the density function, where m M . For details on computing the Hessian, the reader is referred to [3]. To infer the structure of the data, we start by partitioning Y into m partitions each of which corresponds to one of the detected modes. For all vectors of Y we compute a Mahalanobis-like distance δ defined by: δ(xi |j) = (xi − xcj )T Pj (xi − xcj )T , i = 1, 2, . . . , M
and
(3)
j = 1, 2, . . . , m where Pj is the Hessian of mode j. The rationale here, as explained in [3] is to replace the covariance matrix by the Hessian which represents the local curvature around the mode xcj . Each vector is then assigned to a specific mode according to Equation 4. C(i) = argj min δ(xi |j) and j = 1, 2, . . . , m
(4)
The data set can now be partitioned as Y=
m
Yj
(5)
j=1
where Yj = {∀xi ∈ Y; C(i) ≡ j}
(6)
Robust Appearance Modeling for Pedestrian and Vehicle Tracking
211
Each of the detected modes corresponds to either a single Gaussian or a mixture of more than one Gaussian, based on the complexity of the underlying density function. To determine the complexity of density around a given mode xcj , we model the partition data Yj using a mixture of Gaussians specific to partition j. In other words, p(x|Θj ) =
k
πi N (x, μi , Σi )
(7)
i=1
where Θj is the parameter set of a k-component mixture associated with mode xcj . The initial values for the mean vectors are all set to xcj . The initial values for the covariance matrices are all set to Pj . Since the structure of the data around xcj is unknown, we repeat the process for a search range of mixture complexities [kmin , kmax ] and compute the Penaltyless Information Criterion (PIC ) introduced in [1] for each complexity. The mixture that minimizes the PIC is chosen to represent the given partition. Applying PIC to all partitions results in m mixtures of Gaussians with different complexities. The underlying density of the entire data set Y is now modeled as a mixture of mixtures of Gaussians as follows p(x|Θ) =
m
ωj p(x|Θj )
(8)
j=1
where Θ = {Θj , ωj ; j = 1, 2, . . . , , m} is the set of all parameters. (Note that we extend the notation Θ here.) Finally, the weights of the mixtures ωj s are computed according to Equation 9. M p(xi |Θj ) ωj = m i=1 M j j=1 i=1 p(xi |Θ )
(9)
There are two advantages of this algorithm. Firstly, the appearance model obtained is in closed-form representation. This enables the tracker to compute the likelihood values in O(1) time per feature vector, which significantly improves the speed of the tracker as will be shown in Section 5. Secondly, the algorithm is totally non-parametric in the sense that it does not need manual setting of any of its parameters, compared to the popular Expectation Maximization model which needs a priori setting of the number of mixture components and the initial means and covariances. The importance of modeling each partition using a separate mixture can be show by modeling the color density of the human object in Figure 1.a. The estimated mixture of mixtures is shown in Figure 1.b. The green partition represents the colors of the pants. Since the pants area is a smooth, dark blue cluster, only one Gaussian is enough to model that partition. On the other hand, more than one Gaussian (precisely four) are needed to model the underlying density of the shirt area (blue partition) because of the different shades of gray in that area.
212
W. Abd-Almageed and L.S. Davis
1 features iso−density lines
0.9 0.8 0.7
g
0.6 0.5 0.4 0.3 0.2 0.1 0
(a)
0
0.2
0.4
r
0.6
0.8
1
(b)
Fig. 1. (a) A moving object and (b) Appearance model of the moving object
4 4.1
Particle Filter Tracking Back/Foreground Appearance Models
Background subtraction results in a feature set of background pixels, YB , and a number of feature sets representing the detected moving objects, YOn , n = 1, . . . , N , where N is the number of detected moving objects. These feature sets are used to build an appearance model for the background, p(xj |ΘB ), and N appearance models for the detected objects, p(xj |ΘOn ). 4.2
Particle Filter Tracking
For each of the detected objects, the tracker is formulated as {sti , πit ; i = 1, . . . , Nt ; t = 1, . . . , T }
(10)
where sti and πit represent particle number i at time t and its weight, respectively Nt t and i πi = 1. Nt represents the number of particles and the subscript t indicates that the number of particles may vary over time and T is the length of the video stream. Each particle represents a combination of translation and scaling of the object being tracked as shown in Equation 11, sti = (δx , δy , αx , αy )
(11)
where δx and δy represent the translation in the x and y directions, respectively and αx and αy represent the scaling in the x and y directions, respectively. The propagation of the particles follows the state-transition model of Equation 12 |st−1 , wt−1 ) (12) sti ∼ pˆ(st−1 i i where pˆ is the probability density function of the states at the current time step and wt−1 is the covariance matrix of zero-mean Gaussian process noise. The
Robust Appearance Modeling for Pedestrian and Vehicle Tracking
213
values of ˆs0 is set to (0 0 1 1) which represents no translation and no scaling and pˆ(s0i ) is assumed to be uniformly distributed. The four elements of the process noise are assumed to be uncorrelated, normally distributed random variables. t The set of predicted particles {sti }N i=1 corresponds to a set of bounding boxes, t Nt {Bi }i=1 , on It . Each bounding box is evaluated using a Baysian combination of appearance and motion as shown in Equation 13 p(Bti |ΘO , ΘB ) = log
Ki p(xj |ΘO ) p(xj ∈ F ) p(xj |ΘB ) 1 − p(xj ∈ F ) j=1
(13)
and i = 1, 2, . . . , Nt where Ki is the number of pixels in bounding box i. The bounding box with maximum goodness-of-fit represents the most likely particle which in turn represents the state of the object being tracked at time t as shown in Equation 14 ˆst = argsti max p(Bti |p(x|ΘO ), pˆ(x|ΘB )), i = 1, . . . , Nt
5
(14)
Experimental Results
In this Section, a few number of results is presented on VACE data. The data was processed on a cluster of 15 computers (i.e. nodes) running Linux Operating System. Each node has two 3.0 GHz processors and 8GB of memory. Figure 2 shows tracking people and vehicles in a night vision surveillance system. From the Figure, we can see that the detector detects moving object that move close to each other as one object and hence tracking is done on the
(a) Frame no. 1
(b) Frame no. 226
(c) Frame no. 333
(d) Frame no. 376
(e) Frame no. 1150
(f) Frame no. 1740
Fig. 2. People and vehicle tracking. Detection is performed concurrently with object tracking.
214
W. Abd-Almageed and L.S. Davis
(a) Frame no. 376
(b) Frame no. 600
(c) Frame no. 800
(d) Frame no. 1000
(e) Frame no. 1200
(f) Frame no. 1600
Fig. 3. People tracking under severe background ambiguity for long time periods
(a) Frame no. 45
(b) Frame no. 280
(c) Frame no. 495
(d) Frame no. 548
(e) Frame no. 880
(f) Frame no. 975
(g) Frame no. 1063
(h) Frame no. 1140
(i) Frame no. 1200
(j) Frame no. 1300
(k) Frame no. 1400
(l) Frame no. 1500
Fig. 4. People tracking on daylight color data
Robust Appearance Modeling for Pedestrian and Vehicle Tracking
215
same basis. Also, we can see that detecting new objects and tracking existing ones is performed automatically. Finally, the system does not absorb the white car which comes into the scene and stops indefinitely. In Figure 3 detects and tracks a single moving object in the video sequence. The performance of the tracker is shown to be very robust against scale changes. The tracker can keep track of the object for long period of time as well as keeping a relatively accurate estimate of the object’s scale. Finally, Figure 4 shows another example of detecting and tracking moving people in a daylight camera. The Figure shows that the system can accurate segment the independently moving objects even at very small scales and then track them robustly.
6
Conclusions
In this paper, a system for detecting and tracking moving objects in a stationary camera visual surveillance setting has been presented. Moving objects are detected using classical background subtraction methods. The appearance of the moving objects is modeled using a mixture of mixtures of Gaussians, rather than a simple mixture of Gaussians. This appearance model have a two main advantages. Firstly, no a priori setting of mixture parameters (e.g. number of mixture components, initial means, etc.) is needed. Secondly, the computational complexity for computing appearance likelihoods is O(n), which is important to achieving real-time tracking. Object tracking is performed using a particle filter framework. Results on daylight color video sequences as well as night video sequences are presented in the paper. The results show a very robust performance with respect to scale changes and lighting conditions.
References 1. Abd-Almageed, W., Davis, L.: Density estimation using mixture of mixtures of gaussians. In: 9th European Conference on Computer Vision. (2006) 2. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 24 (2002) 3. Han, H., and, D.C., Zhu, Y., Davis, L.: Incremental density approximation and kernel-based baesian filtering for object tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition. (2004)
Robust Vehicle Blob Tracking with Split/Merge Handling Xuefeng Song and Ram Nevatia Univ. of Southern California, Los Angeles, CA 90089, USA
[email protected],
[email protected]
Abstract. Evaluation results of a vehicle tracking system on a given set of evaluation videos of a street surveillance system are presented. The method largely depends on detection of motion by comparison with a learned background model. Several difficulties of the task are overcome by the use of general constrains of scene, camera and vehicle models. An analysis of results is also presented.
1
Task and Data Description
The task here is to evaluate the performance of our vehicle detection and tracking algorithm on a set of surveillance videos provided by the VACE project[1]. Objective includes accurate detection, localization and tracking while maintaining the identities vehicles as they travel across different frames. The videos are of street scenes captured by cameras mounted at light pole heights looking down towards the ground. There is one road running from top to bottom of the image and another one from left to right near the top of the image. Provided videos are from three different cameras at three different sites. They include data captured at several different times of the day including some at night. Some examples are shown in figure-1. Our basic approach is to detect moving vehicles in these videos by computing the motion foreground blobs by comparing with a learned background model and then to track by making data associations between detected blobs. However, there are several factors that make this task highly challenging. We summarize these in three groups below: 1. Camera Effects: Cameras shake and create false foreground detections. Automatic gain control abruptly changes the intensity of the video sometimes causing multiple false detections. 2. Scene Effects: Ambient illumination changes such as due to passing clouds. Other moving objects like walking people, swinging trees or small animals all create motion foreground. An object may not be fully detected as foreground when its contrast against the background is low. 3. Object appearance Effects: The shadow of vehicles creates foreground on sunny days. Blobs from different vehicles may merge into one, particularly in heavy traffic or near a stop sign. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 216–222, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Vehicle Blob Tracking with Split/Merge Handling
(a) case-1
(b) case-2
(c) case-3
(d) case-4
(e) case-5
(f) case-6
217
Fig. 1. Sample frames of six cases: (a) camera #1 at daytime; (b) camera #2 at daytime; (c) camera #3 at daytime; (d) camera #1 at night with lighting; (e) camera #2 at dark night; (f) camera #3 at nightfall
The image sizes of the objects in the top of the image are much smaller than of those near the bottom. Thus, the size of a vehicle as it travels from the top to the bottom changes substantially. An ambiguous zone or ”don’t care” region near the top is provided by the evaluation scheme to exclude hard to see, smaller, vehicles as being part of the evaluation. However, this zone does not fully cover the vehicles traveling left to right and the scores are impacted by these vehicles. The rest of the paper is organized as follows. The details of the proposed vehicle tracking method are presented in section 2. Section 3 describes the experiments and results. The quantitive evaluation and analysis are in section 4.
2
Vehicle Motion Blob Tracking
Figure-2 shows an overview of our method. We compute motion foreground blobs at each frame and then track these blobs based on their association on appearance. Knowledge of scene, camera and vehicles are pre-learnt to enforce some general constrains. 2.1
Scene Constraints
We assume vehicles move on a ground plane; we use vehicle motion and vanishing points from scene features to compute an approximate camera model [4]. This process is performed once, in training phase. To distinguish vehicles from walking humans and other motion, we set a minimum size for an object to be considered as a vehicle; this size is set in 3-D, its size in image is computed by using the camera parameters.
218
X. Song and R. Nevatia Scene and Camera Knowledge
Vehicle Knowledge Input Frame (t)
Background Subtraction
Object Tracks (t-1)
Foreground Blobs (t) Estimated Object Tracks (t)
New Track Initialization
Track-Blob Association Matrix
Track Ending Vehicle Blob Split
Object Tracks (t)
Vehicle Blob Merge
Fig. 2. Method Overview
2.2
Background Subtraction
We learn a pixel-wise color model of the background pixels. The models is updated adaptively with new frames to adapt to illumination changes[2]. We do not assume that an empty background frame is available. Pixels that do not conform to the background model are hypothesized to be due to motion and called ”foreground” pixels. These foreground pixels are grouped into connected regions; we apply a sequence of morphological operations to remove small noise regions and fill in small holes in the regions. In an ideal case, every blob would correspond to one vehicle object; however, this is not always the case. Figure-3 shows three common problems that may be present.
(a) Blob Merge
(b) Blob Split
(c) Other Objects
Fig. 3. Common difficulties of vehicle blob tracking
Robust Vehicle Blob Tracking with Split/Merge Handling
Oˆ t1
Oˆ t2
Oˆt3
Oˆ t4
1
1
219
…...
1
Bt1 Bt2
1
3 t 4 t
1
B
B
… ...
Fig. 4. Track-blob association matrix
2.3
Track Vehicle Blobs
For simplicity, we model vehicle objects as rectangles: Oti = {(xit , yti , wti , hit ), Ait , Dti } where t is the frame number, i is the object id, (xit , yti , wti , hit ) describes the location of the object, and Ait , Dti are the object appearance model and dynamic model. In our implementation, Ait is a color histogram, and Dti is a Kalman filter. Similarly, the detected blob is modeled as a rectangle with a color histogram: Btj = {(xjt , ytj , wtj , hjt ), Ajt }. Our tracking method processes the frames sequentially. At each new frame t, we first apply tracking object’s dynamic model to predict object’s new position. ˆ i = Di (Oi ) O t t−1 t−1 Then the predicted objects with detected blobs will generate an association matrix, see figure-4. The association is based on the overlap between the predicted object rectangle and blob rectangle. ˆ i ∩B j O t 1, if min(tOˆ i ,B >τ j M (i, j) = t) t 0, otherwise If the track-blob match is one-to-one, we simply update the position of the track. A new tracking object is created when a blob has no match with current tracking objects, and its size is comparable to a regular vehicle. And a track ends when it has no blob match for more than a threshold number of frames. When an object matches with multiple blobs, we combine the split blobs into one to match with the tracking object. When multiple objects merge into one blob, we segment the blob based on the appearance model of each involved object. Specifically, we apply meanshift[3] color tracking method to locate the vehicle in the merged blob.
220
X. Song and R. Nevatia
(a) seq1
(b) seq2
(c) seq3
(d) seq4
(e) seq5
(f) seq6
Fig. 5. Sample result frames on training/testing sequences. (green rectangle is the ground-truth, red rectangle is the tracking output of our system, and blue rectangle is the defined ”ambiguous” zone.)
(a) low contrast
(b) shaking camera
(c) congestion
Fig. 6. Typical tracking error samples
3
Experiments and Results
We tested our systems on the videos provided by the VACE project [1]. The size of each frame is 720x480. The experiments are finished on a regular PC with Intel Pentium 2.6GHZ CPU. The average processing time is 2.85 frame/second. During the training and testing stages, 100 sequences (about 165 minutes in total) are manually labeled by a third party. Figure-5 shows some examples under different conditions. In general, our system works well for daytime videos, though the detected vehicle size is usually larger than the ground-truth when shadows are present. At night time, vehicle headlights create large change regions; our system has difficulty in locating the vehicle positions accurately in such cases. Figure-6 shows a few typical tracking errors. In case(a), two vehicles are not detected because of low contrast with the background. In case(b), there is
Robust Vehicle Blob Tracking with Split/Merge Handling
221
significant camera shaking; this creates many false foreground regions. Also, one person object is detected as a vehicle because its size is comparable to small vehicles. In case(c), four vehicles come together due to traffic congestion; this causes one missed detection, and the position of another detection is not accurate.
4
Evaluation and Discussion
We quantitively evaluated our system according to the requirements of the test process. Table-1 lists the scores on 50 test video sequences. The metrics shown evaluate both the detection and tracking performances. We summarize the metric definitions below; more details may be found in [1]. 1. MODP (Multiple Object Detection Precision) measures the position precision of single frame detections; 2. MODA (Multiple Object Detection Accuracy) combines the influence of miss detections and false alarms; 3. MOTP (Multiple Object Tracking Precision) measures the position precision at tracking level; 4. MOTA (Multiple Object Tracking Accuracy) is MODA at tracking level with consideration of ID switches. One observation from the table is that: the difference of MODP and MOTP, or MODA and MOTA is very small for all the test video sequences. It is mainly because that the penalty on object id change is relatively small. Actually, there are some number of id changes in the output of our system. However, the current defined MOTA and MOTP are not able to reflect this tracking error very well. To evaluate performance trade-offs, we repeated our experiments with 5 different sets of parameters. As MODA combines the influence of missed detections and false alarms, it is not easy to see a trade-off using this metric. Instead, we plot an ROC curve using the traditional detection and false alarm rates in Fig-7. Table 1. Evaluation scores on 50 test video sequences Scene Name
Num of Sequences
Average MODP
Average MODA
Average MOTP
Average MOTA
PVTRA102
24
0.653
0.675
0.645
0.667
PVTRA201
18
0.540
0.612
0.539
0.605
PVTRN101a
5
0.665
0.625
0.664
0.623
3
0.691
0.644
0.684
0.641
50
0.616
0.645
0.165
0.639
PVTRN102d Average
222
X. Song and R. Nevatia
Detection Reate
0.8
0.75
0.7
0.65 0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
False Alarm Rate
Fig. 7. ROC Curve of Clear-Vace Evaluation on Vehicle Tracking
5
Conclusions
We have presented evaluation results on the performance of our vehicle detection and tracking system on a provided set of surveillance videos. The data contains many highly challenging features. The performance of our system is promising though many shortcomings exist. We feel that further improvements will require stronger models of vehicle shapes and modeling of shadow patterns in outdoor environments. Our system also does not run in real-time; some of the needed speed-up can be obtained by more careful coding and use of faster commodity hardware but is also likely to require algorithmic improvements. Acknowledgements. This research was partially funded by the Advanced Research and Development Activity of the U.S. Government under contract MDA904-03-C-1786.
References 1. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, and V. Korzhova. Performance Evaluation Protocal for Face, Person and Vehicle Detection & Tracking in Video Analysis and Centent Extraction (VACE-II) CLEAR - Classification of Events, Activities and Relationships. http://www.nist.gov/speech/tests/clear/2006/CLEAR06-R106-EvalDiscDoc/Data and Information/ClearEval Protocol v5.pdf 2. Liyuan Li, Weimin Huang, Irene Y.H. Gu, and Qi Tian. “Foreground Object Detection from Videos Containing Complex Background,” ACM MM 2003. 3. D. Comaniciu, V. Ramesh, and P. Meer,“Real-time tracking of non-rigid objects using mean shift,” IEEE Conf. on Computer Vision and Pattern Recognition 2001, vol.1, pp. 511-518, 2001 4. Fengjun Lv, Tao Zhao and Ramakant Nevatia. ”Self-Calibration of a Camera from Video of a Walking Human,” 16th International Conference on Pattern Recognition (ICPR), Quebec, Canada, 2002
A Decision Fusion System Across Time and Classifiers for Audio-Visual Person Identification Andreas Stergiou, Aristodemos Pnevmatikakis, and Lazaros Polymenakos Athens Information Technology, Autonomic and Grid Computing, Markopoulou Ave., 19002 Peania, Greece {aste,apne,lcp}@ait.edu.gr http://www.ait.edu.gr/research/RG1/overview.asp
Abstract. In this paper the person identification system developed at Athens Information Technology is presented. It comprises of an audio-only (speech), a video-only (face) and an audiovisual fusion subsystem. Audio recognition is based on the Gaussian Mixture modeling of the principal components of the Mel-Frequency Cepstral Coefficients of speech. Video recognition is based on linear subspace projection methods and temporal fusion using weighted voting on the results. Audiovisual fusion is done by fusing the unimodal identities into the multimodal one, using a suitable confidence metric for the results of the unimodal classifiers.
1 Introduction Person identification is of paramount importance in security, surveillance, humancomputer interfaces and smart spaces. Hence, the evaluation of different recognition algorithms under common evaluation methodologies is very important. Even though the applications of person recognition vary, the evaluations have mostly focused on the security scenario, where training data are few but recorded under close-field conditions. An example of this for faces is the Face Recognition Grand Challenge [1], where facial images are of high resolution (about 250 pixels distance between the centers of the eyes). The CLEAR person identification evaluations, following the Run-1 evaluations [2] of the CHIL project [3], focus on the surveillance and smart spaces applications, where training can be abundant, but on the other hand the recording conditions are far-field: wall-mounted microphone arrays record speech far from the speakers, and cameras mounted on room corners record faces. These two modalities are used, either stand-alone or combined [4] to recognize people in audiovisual streams. The person identification system implemented in Athens Information Technology operates on short sequences of the two modalities of the far-field data, producing unimodal identities and confidences. The identities produced by the unimodal subsystems are then fused into a bimodal one by the audiovisual subsystem. This paper is organized as follows: In section 2 the audio-only, video-only and audiovisual subsystems of the person identification system are detailed. The evaluation results are presented in section 3. Finally, in section 4 the conclusions are drawn. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 223 – 232, 2006. © Springer-Verlag Berlin Heidelberg 2006
224
A. Stergiou, A. Pnevmatikakis, and L. Polymenakos
2 Person Identification System In this section, the three subsystems of the person identification system are detailed. The audio subsystem operates on speech segments, the video subsystem on faces extracted from the same segments of the multi-camera video streams and the audiovisual fusion operates on the decisions of the umiodal subsystems. The system is trained automatically, in the sense that there is no manual operation for the selection of the speech or the faces to be used. This automatic selection occurs in the video subsystem, for the faces to be used in training. 2.1 Audio Subsystem In the training phase of our system the goal is to create a model for each one of the supported speakers and ensure that these models accentuate the specific speech characteristics of each person. To this end, we first break up the training segments into frames of appropriate size (i.e. duration), with successive frames having a predefined overlap percentage. The samples belonging to each frame are used to calculate a vector of parameters that represents the given frame during the model estimation process. Specifically, a set of Mel Frequency Cepstral Coefficients (MFCC) are extracted from each frame and they are used to model the characteristics and structure of each individual’s vocal tract. All MFCC vectors for a given person are collected and used to train a Gaussian Mixture Model (GMM), based on the Baum-Welch algorithm. A GMM is in essence a linear combination of multi-variant Gaussians that approximates the probability density function (PDF) of the MFCC for the given speaker: M
λk = ∑ wm N ( o, μ m , Σm ) , k = 1,..., K
(1)
m =1
where K is the number of speakers (i.e. 26) and λk is the GMM for the k-th speaker. This model is characterized by the number of Gaussians (M) that constitutes the mixture, each having its own weight ( wm ), mean vector ( μ m ) and covariance matrix ( Σm ). For the identification part, testing samples are again segmented into frames with the same characteristics as the ones created during the training process, and we subsequently extract MFCC’s from each frame. To perform identification, each of the K GMM’s is fed with an array of the coefficients (one row per sample), based on which we calculate the log-likelihood that this set of observations was produced by the given model. The model that produces the highest log-likelihood is the most probable speaker according to the system: k1 = arg max {L ( O | λk )} , k = 1,..., K k
(2)
where O is the matrix of MFCC’s for the specific test segment and L ( O | λk ) is the log-likelihood that each model λk produces this set of observations. All samples are broken up in frames of length 1024 with 75% overlap. Since the data are sampled at 44.1 kHz, each frame has duration of a little over 23 msec. The size of the GMM is fixed at 16 Gaussians and the number of static MFCC’s per frame
A Decision Fusion System Across Time and Classifiers
225
has been set to 12. To this we concatenate the log-energy of the frame to create 13D vectors, and we also append the delta (first-order derivative) coefficients. A very crucial step for the creation of a successful GMM is the initialization of its parameters, which will be updated during the iterations of the EM training algorithm. The standard approach is to use the K-Means clustering algorithm to obtain some initial estimates for the Gaussian parameters; this strategy however suffers from the random characteristics of the outcome of K-Means, which in turn lead to a different GMM each time the same data are used for training. Moreover, the identification performance varies considerably across these different models. We have therefore utilized a deterministic initialization strategy for the EM algorithm, based on the statistics of the training data. Specifically, we compute a number of percentiles across all dimensions of the training data set and thus partition the data range in each dimension into as many subsets as the modes of the GMM. The K-Means algorithm is consequently run using the central values of each subset as initial cluster means, and the resulting clustered data are fed into the EM algorithm for parameter fine-tuning. Our experiments have shown that this strategy gives on average lower error rates than the random K-Means initialization, although there are a few runs using the standard approach that lead to better identification performance. Automatic identification systems are evaluated based on their response time and error rate. It is obviously important to minimize both these numbers, however in many cases it is not easy or even possible to do that and we must settle for a trade-off between speed and identification accuracy. We have addressed this issue by employing the standard Principal Components Analysis (PCA) as a pre-processing step. Specifically, we compute a transformation (projection matrix) for each speaker based on their training data and use that matrix to perform a mapping to the PCA coordinate system prior to GMM calculation. In the testing phase, we compute the log-likelihood of each speaker by first projecting the MFCC vectors to the respective PCA space. The use of PCA introduces one further degree of freedom in the system, namely the dimensionality of the projection space. It is obvious that by keeping an increasingly smaller number of eigenvalues from the PCA scatter matrix we can reduce this dimensionality accordingly, therefore achieving a significant execution speed increase. The choice of the number of discarded eigenvalues will be ultimately dictated by the truncation error introduced due to the reduction of the projection space dimension. Specifically, if the initial space dimension is d and we discard the q smallest eigenvalues, the truncation error will be equal to d
e =1−
∑
i = d − q +1 d
λi
∑ λi
(3)
i =1
We have implemented an automatic decision process that determines the number of retained eigenvalues in a way that ensures that the average truncation error across all speakers is no more than 0.2%. The maximum value of q that satisfies this condition is chosen, so that we achieve the greatest speed increase possible while retaining (mostly) optimal identification accuracies.
226
A. Stergiou, A. Pnevmatikakis, and L. Polymenakos
Our experiments indicate that this selection strategy gives a value for q that is at most one above or below the number of eigenvalues that minimizes the error rates. Even if our choice of q leads to slightly sub-optimal solutions, the achieved error rates are still superior to using the standard GMM algorithm approach without PCA preprocessing. We have therefore achieved faster response times as well as enhanced identification performance. 2.2 Video Subsystem The video subsystem for person identification utilizes all four camera streams to extract approximately frontal faces for training and testing of the system. The faces are extracted employing the provided label files, both those sampled at 1sec intervals and those at 200 ms intervals. For face normalization, the eye positions are used, as marked in the label files. For most of the frames, these positions are linearly interpolated, leading to some inaccuracy in the eye detection. The eyes are then positioned on specific coordinates on a 34 by 42 template that contains mostly the face for approximately frontal views of the people. The normalized training images extracted for one person are shown in Figure 1. Evidently there are problems with the accuracy of the interpolated labels (or the 200 ms labels themselves) that lead to scaling, shifting and rotation of the faces. Such effects can be minor up to major, leading to segments that are definitely not faces. Also there are pose variations, both left-right (even extreme profile with only one eye visible) and up-down. Finally there are lighting variations. When the view is not approximately frontal, then other parts of the head, or even background, can be included in the template. Such views are not wanted, and some means for automatically discarding them is needed. Note at this point that automatic selection of faces is a prerequisite for testing of the recognition systems in the CLEAR evaluations, but the proposed visual recognition subsystem also employs the same mechanism for training, which is hence automatic. The automatic selection of faces employs a measure of frontality, based on the supplied face bounding boxes and eye positions. Frontal views should have both eyes symmetrically positioned around the vertical face axis. This symmetry is enumerated in the frontality measure. The measure can unfortunately be inaccurate for two reasons. The first has to do with the provided label files: eye positions are provided every 200 ms, while face bounding boxes every 1 sec, causing larger errors due to interpolation. The second reason has to do with the positioning of the head: when it is not upright, then the major axis of the face does not coincide with the central vertical axis of the face bounding box. Nevertheless, employing the proposed frontality measure rids the system from most of the non-frontal faces at the expense of missing some frontal but tilted ones. As for the threshold on frontality, this should not be too strict to diminish the training and testing data. It is set to 0.1 for all training durations and testing durations up to 10 sec. For testing durations of 20 sec, it is doubled, as the abundance of images in this case allows for a stricter threshold. A final problem with the application of the frontality threshold is that there are some testing segments for which both eyes are never visible. This leads to empty segments. Unfortunately, 13% of the 1 sec, 3.4% of the 5 sec, 1.7% of the 10 sec and 1.1% of the 20 sec testing segments are left empty. These profile faces can in principle be classified by face recognizers trained on profile faces,
A Decision Fusion System Across Time and Classifiers
227
Fig. 1. Training images from the 15 sec training intervals, as they are captured by any of the four cameras, for one person
but such classifiers have not been implemented in the scope of the CLEAR evaluations. The individual decisions for the faces that pass the frontality threshold are fused using the sum rule [5]. According to it, each decision IDi in a testing segment casts a vote that carries a weight wi . The weights wi of every decision such as IDi = k are summed to yield the weights Wk of each class:
Wk =
∑
i : IDi = k
wi
(4)
where k = 1,… , K and K is the number of classes. Then the fused decision based on the N individual identities is:
ID (
N)
= arg max (Wk )
(5)
k
The weight wi in the sum rule for the i-th decision is the sixth power of the ratio of the second-minimum distance d i(1) over the minimum distance d i(1) :
228
A. Stergiou, A. Pnevmatikakis, and L. Polymenakos
⎡ d (2) ⎤ wi = ⎢ i(1) ⎥ ⎣ di ⎦
6
(6)
This choice for weight reflects the classification confidence: If the two smallest distances from the class centers are approximately equal, then the selection of the identity leading to the smallest distance is unreliable. In this case the weight is close to unity, weighting down the particular decision. If on the other hand the minimum distance is much smaller than the second-minimum, the decision is heavily weighted as the selection of the identity is reliable. The sixth power allows for a few very confident decisions to be weighted more then many less confident ones. The face recognizers employed are of the linear subspace projection family. Both Principal Components Analysis (PCA) [6] and Linear Discriminant Analysis (LDA) [7] are employed. LDA is better for large faces with accurate eye labels [8], but PCA is more robust as size and eye label accuracy drop. To demonstrate the difficulties the far-field viewing conditions impose on face recognition, a comparison of the error rate of PCA and LDA as the eye distance drop is carried out in Figure 2. Note that the database used for these experiments is HumanScan [9], not the data of the CLEAR evaluations. 4
2.5
3
x 10
PCA w/o 3 LDA
2
2
1.5
PMC (%)
Histogram
2.5
1.5
1
1
0.5
0.5
5
10
15
20
25 30 35 Eye distance (pixels)
(a)
40
45
50
55
0 0 10
1
10 Eye distance (pixels)
2
10
(b)
Fig. 2. Effect of far-field viewing conditions on face recognition. (a) The probability of misclassification increases dramatically below 10 pixels of eye distance, even with perfect eye labeling. (b) Histogram of the eye distances of the faces in the testing segments; face recognition in the CLEAR evaluations has to cope with eye distances of 4 to 20 pixels.
LDA is robust to lighting changes [7]. To increase the robustness of PCA to lighting, histogram equalization is applied on the faces. Even though the performance of LDA and PCA at the face resolutions of interest are expected to be very close (see Figure 1.a), when there are few testing images per testing segments, LDA is expected to be a better choice to PCA. The latter is expected to surpass LDA when there are fewer training images or more testing images to fuse the individual decisions. A note
A Decision Fusion System Across Time and Classifiers
229
is due at this point for the application of LDA. Contrary to the Fisherfaces algorithm [7], in this case the small sample size problem [10] does not apply. Hence not PCA step is used, without the need for a direct LDA algorithm [10]. The decisions ID ( PCA ) and ID ( LDA ) of the PCA and the LDA classifiers are again fused using the sum rule to yield the reported identity. For this fusion, the class weights Wk of equation (4) are used instead of the distances in equation (6). Setting:
k1 ≡ [ best matching class] = ID ( ) k2 ≡ [second-best matching class ] N
(7)
the weights of the PCA and LDA decisions become:
wi =
Wk(1i ) Wk(2i )
, i ∈ {PCA, LDA}
(8)
Then the fused PCA/LDA decision to be reported by the visual subsystem is: ID (
visual )
⎧⎪ ID ( PCA ) = ⎨ LDA ( ) ⎪⎩ ID
if wPCA ≥ wLDA
(9)
if wPCA < wLDA
2.3 Audiovisual Subsystem
The audiovisual system is again based on post-decision fusion using the sum rule. In this case the decision is:
⎧⎪ ID ( audio ) ID ( A/V ) = ⎨ ( visual ) ⎪⎩ ID
if waudio ≥ min ({wthr , wvisual } )
(10)
if waudio < wvisual
where the audio weight is the ratio of the log-likelihood L ( O | λk
1
)
that the best
matching model λk produces the set of observations O, over the log-likelihood
(
L O | λk2
)
1
that the second-best matching model λk produces O: 2
waudio =
( ) L (O | λ ) L O | λk1
(11)
k2
The visual weights are the maximum of the PCA and LDA weights of (8), transformed by a factor c so that they have the same mean value as the audio weights and remain greater than or equal to unity: wvisual = c ⎡⎣ max ({wPCA , wLDA } ) − 1⎤⎦ + 1
(12)
wthr is an audio weight threshold above which the audio decision is absolutely trusted. This reflects the confidence on the adequately weighted audio decisions, no matter the
230
A. Stergiou, A. Pnevmatikakis, and L. Polymenakos
video ones. This is needed as the performance of video is not expected to be as good as the audio, due to the adverse effect of resolution, label interpolation and pose variation. The choice of this threshold is 1.016 for 15 seconds training, and 1.008 for 30 seconds, where experiments show that audio recognition should be error-free.
3 Evaluation Results The person identification algorithms have been tested on the CLEAR data that comprise of speech and four camera views of 26 individuals. The audiovisual conditions are far-field, in the sense that speech is recorded by microphone arrays and faces are captured by medium resolution cameras mounted high on room corners, resulting to median eye distance of 9 pixels. Some of the segmented faces for different eye distances are shown in Figure 3.
4 pixels
9 pixels
16 pixels
Fig. 3. Good examples of segmented faces at various eye distances. The problems of too low resolution, pose changes and label interpolation inaccuracies are evident.
Two training conditions have been defined, one 15 and another 30 seconds long. Four testing durations are also defined: 1, 5, 10 and 20 seconds long. All these segments contain mostly speech, so a speech activity detection algorithm [11] has not been used. Even though the heads are always visible in the chosen segments, frontal face images are difficult to find in many of them (see section 2.2), Nevertheless, only frontal face recognition has been attempted, using those training and testing faces the system automatically marks as approximately frontal. The results of the audio only person recognition are shown in Table 1 per training and testing duration. Similarly, results for video-only and audiovisual recognition are shown in Tables 2 and 3. Table 1. Audio recognition performance on the CLEAR evaluation data
Training duration 15
30
Testing duration 1 5 10 20 1 5 10 20
Audio error rate (%) 26.92 9.73 7.96 4.49 15.17 2.68 1.73 0.56
A Decision Fusion System Across Time and Classifiers
231
Table 2. Video recognition performance on the CLEAR evaluation data
Training duration
15
30
Testing duration 1 5 10 20 1 5 10 20
LDA individual faces 58.2 57.8 57.0 56.0 19.9 48.4 47.7 49.6
Video error rate (%) LDA fused PCA fused across time across time 51.4 56.0 35.8 31.1 29.4 27.3 24.2 20.2 47.0 51.4 31.1 33.6 29.4 27.7 25.8 23.0
Fused PCA/LDA 50.6 29.7 23.2 20.2 47.3 31.1 26.6 24.7
Table 3. Audiovisual recognition performance on the CLEAR evaluation data
Training duration 15
30
Testing duration 1 5 10 20 1 5 10 20
Audiovisual error rate (%) 23.65 6.81 6.57 2.81 13.70 2.19 1.73 0.56
4 Conclusions The results of the person identification system of Athens Information Technology at the CLEAR evaluations are far superior for audio than video recognition. This is a bit misleading: Although the results show that video only provides minor improvement of the audio recognition, this is not generally true. One obvious reason is that speech is usually much sparser than face images in a multi-camera setup. In the CLEAR evaluations, care has been taken to have segments with speech available; this should be contrasted by the fact that even though there are four available cameras, in 6.8% of the testing segments there is not a single frame where both eyes are visible. Adding to that figure another 2.2% of segments where there are only up to five faces with both eyes visible, reveals the difficulties of visual identification in these segments. The second reason is segmentation. The segments have been selected so that they contain speech from only one person, so no speech segmentation is needed prior to recognition. This is not the case in realistic deployments. On the other hand the segmentation of one from multiple faces has been tackled with the provided labels. Unfortunately these are provided every 1 sec or 200 ms. The needed interpolation leads to large inaccuracies, especially in cases of person motion.
232
A. Stergiou, A. Pnevmatikakis, and L. Polymenakos
Hence audiovisual fusion is more imperative than the results in the CLEAR evaluations show, for any realistic deployment of a person identification system.
Acknowledgements This work is sponsored by the European Union under the integrated project CHIL, contract number 506909. The authors wish to thank the organizers of the CLEAR evaluations.
References [1] P. Phillips et al.: Overview of the Face Recognition Grand Challenge, CVPR, (2005). [2] H. Ekenel and A. Pnevmatikakis: Video-Based Face Recognition Evaluation in the CHIL Project – Run 1, Face and Gesture Recognition 2006, Southampton, UK, (Apr. 2006), 85-90. [3] Waibe1, H. Steusloff, R. Stiefelhagen, et. al: CHIL: Computers in the Human Interaction Loop, 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal, (Apr. 2004). [4] R. Brunelli and D. Falavigna: Person Recognition Using Multiple Cues, IEEE Trans. Pattern Anal. Mach. Intell., 17, 10, (Oct. 1995), 955-966. [5] J. Kittler, M. Hatef, R.P.W. Duin and J. Matas: On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 20, 3 (March 1998), 226–239. [6] M. Turk and A. Pentland: Eigenfaces for Recognition, J. Cognitive Neuroscience, 3 (March 1991), 71-86. [7] P. Belhumeur, J. Hespanha and D. Kriegman: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Trans. Pattern Analysis and Machine Intelligence, 19, 7 (July 1997), 711-720. [8] E. Rentzeperis, A. Stergiou, A. Pnevmatikakis and L. Polymenakos: Impact of Face Registration Errors on Recognition, Artificial Intelligence Applications and Innovations, Peania, Greece, (June 2006). [9] O. Jesorsky, K. Kirchberg and R. Frischholz: Robust Face Detection Using the Hausdorff Distance, in J. Bigun and F. Smeraldi, (eds.), “Audio and Video based Person Authentication”, Springer (2001), 90-95. [10] H. Yu and J. Yang: A direct LDA algorithm for high-dimensional data with application to face recognition, Pattern Recognition, 34 (2001), 2067–2070. [11] J. Sohn, N.S. Kim and W. Sung: A Statistical Model Based Voice Activity Detection, IEEE Sig. Proc. Letters, 6, 1 (Jan. 1999).
The CLEAR’06 LIMSI Acoustic Speaker Identification System for CHIL Seminars Claude Barras, Xuan Zhu, Jean-Luc Gauvain, and Lori Lamel Spoken Language Processing Group LIMSI-CNRS, BP 133, 91403 Orsay cedex, France {barras,xuan,gauvain,lamel}@limsi.fr
Abstract. This paper summarizes the LIMSI participation in the CLEAR’06 acoustic speaker identification task that aims to identify speakers in CHIL seminars via the acoustic channel. The system consists of a standard Gaussian mixture model based system similar to systems developed for the NIST speaker recognition evaluations and includies feature warping of cepstral coefficients and MAP adaptation of a Universal Background Model. Several computational optimizations were implemented for real-time efficiency: stochastic frame subsampling for training, top-Gaussians scoring and auto-adaptive pruning for the tests, speeding up the system by more than a factor of ten.
1
Introduction
The European Integrated Project CHIL1 is exploring new paradigms for humancomputer interaction and developing user interfaces which can track and identify people and take appropriate actions based on the context. One of the CHIL services aims to provide support for lecture and meeting situations, and automatic person identification is obviously a key feature of smart rooms. CHIL has supported the CLEAR’06 evaluation, where audio, video and multi-modal person identification tasks were evaluated in the context of CHIL seminars. Our work at LIMSI focuses on the acoustic modality. The CLEAR’06 acoustic speaker identification task is a text-independent, closed-set identification task with far-field microphone array training and test conditions. Enrollment data of 15 and 30 seconds are provided for the 26 target speakers and test segment durations of 1, 5 10 and 20 seconds are considered [5]. This paper describes the LIMSI acoustic speaker identification system, evaluated in the CLEAR’06 benchmark. The system is a standard GMM-UBM system based on technology developed for use in NIST speaker recognition evaluations. In the next section, the LIMSI speaker recognition system is presented along with specific computation optimizations that were developed for this system. Section 3 gives experimental results on the CLEAR development data and evaluation data.
1
This work was partially financed by the European Commission under the FP6 Integrated Project IP 506909 Chil. CHIL – Computers in the Human Interaction Loop, http://chil.server.de/
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 233–240, 2007. c Springer-Verlag Berlin Heidelberg 2007
234
2
C. Barras et al.
Speaker Recognition System
In this section, the LIMSI speaker recognition system and several computational optimizations that were implemented for real-time efficiency are described. 2.1
Front-End
Acoustic features are extracted from the speech signal every 10ms using a 30ms window. The feature vector consists of 15 PLP-like cepstrum coefficients computed on a Mel frequency scale, their Δ and Δ-Δ coefficients plus the Δ and Δ-Δ log-energy for a total of 47 features. Ten percent of the frames with the lowest energy are filtered out, on the assumption that they carry less information characteristic of the speaker. No speech activity detection (SAD) module is used in this configuration since silences longer than one second according to the reference transcriptions are a priori removed from evaluation data. Feature warping [6] is then performed over a sliding window of 3 seconds, in order to map the cepstral feature distribution to a normal distribution and reduce the non-stationary effects of the acoustic environment. In the NIST speaker recognition evaluations, feature warping was shown to outperform the standard cepstral mean substraction (CMS) approach [1]. 2.2
Models and Identification
A Gaussian mixture-model (GMM) with diagonal covariance matrices is used as a gender-independent Universal Background Model (UBM). For each target speaker, a speaker-specific GMM is trained by Maximum A Posteriori (MAP) adaptation [3] of the Gaussian means of the UBM. The GMM-UBM approach has proved to be very successful for text-independent speaker recognition, since it allows the robust estimation of the target models even with a limited amount of enrollment data [7]. During the identification phase, each test segment X is scored against all targets λk in parallel and the target model with the highest log-likelihood is chosen: k ∗ = argmaxk log f (X|λk ). 2.3
Optimizations
In the CHIL framework, target model training and speaker identification need to be performed efficiently, in faster than real-time for realistic configurations. Several
background model
reference voices GMM training
MAP adaptation
target model λ
target
Fig. 1. MAP adaptation of background model to a target speaker
The CLEAR’06 LIMSI Acoustic Speaker Identification System
235
optimizations have thus been implemented addressing training and scoring computational requirements. Stochastic Frame Subsampling. For speaker recognition, the reduction of the number of frame by a decimation factor up to 10 on the test segment only results in a limited loss of accuracy [4]. This can be explained by the high correlation of neighboring frames and the fact that a temporal context of several frames is already taken into account by the delta and delta-delta coefficients. It can be also of interest to speed up the training of the models. The UBM needs to account for the largest possible speaker variability in the acoustic context of the application; but the amount of training data needs to be put in relation with the number of parameters in the UBM. For training a GMM with diagonal covariance matrices, a few hundred frames per Gaussian should be enough for a reliable estimation of the means and variances. A possible solution can be a fixed rate subsampling as described above; in this situation, a subset of the frames is selected once for all. We have experimented with another schema. For each Expectation-Maximization (EM) iteration of the GMM reestimation, a random selection of frames is applied according to a target ratio. This way, each frame can possibly impact the training. Also, if we train the GMM using a splitting algorithm starting with a single Gaussian, the stochastic frames sampling dramatically speeds up the initial training phases by adapting the number of frames to the number of components. Top-Gaussian Scoring. The top-Gaussian scoring is an optimization used for speaker verification in the context of the parallel scoring of a set of target models MAP-adapted from the same GMM-UBM [4]. For each frame, the top scoring components of the UBM are selected; then the log-likelihood estimation for all target models is restricted to the same set of components. The speedup increases along with the size of the models and with the number of target speakers. Auto-Adaptive Pruning. During scoring, it is usual to exclude models with a too low likelihood relative to the best current hypothesis. However in the context of top-Gaussian scoring, the computation is dominated by the UBM initial likelihood estimation and a reduction in the number of target candidates only provides a minor improvement; the major gain is observed when a single model remains and the end of the test segment can thus be discarded. Taking an early decision about the current speaker is also of interest in the context of an online system as required for some CHIL applications. In this situation, an a priori fixed threshold is not precise enough for such an aggressive pruning because of the acoustic variability. We have thus implemented an auto-adaptive pruning, which takes into account the distribution of the best hypothesis log-likelihood: – at each frame xt , for each model λk , compute its cumulated log-likelihood: lk (t) = 1t log f (x1 . . . xt |λk ) – choose the best cumulated score up to the current frame: l∗ (t) = maxk lk (t) – compute the statistics (μl (t),σl (t)) of l∗ (t) with an exponential decay factor α ∈]0; 1] in order to focus on the most recent acoustic context:
236
C. Barras et al.
μl (t) = t
1
i i=0 α
t
αi l∗ (t − i) and σl (t)2 = t
i=0
1
i i=0 α
t
αi l∗ 2 (t − i) − μl (t)2
i=0
– initialize l∗ (t) on a minimal count dmin of a few tens to a few hundreds frames – during scoring, cut model λk if lk (t) < μl (t) − λ(t)σl (t) with the standard deviation factor λ(t) either constant or decreasing in time.
3
Experiments
In this section the experimental conditions are described, and the impact of the optimization and development work using the CHIL’05 evaluation data are given. Results on the CLEAR’06 evaluation data are also provided. 3.1
Experimental Setup
Seminars recorded for the CHIL project were used for building the system. All processing were performed on 16 kHz, 16 bits single channel audio files in far-field microphone condition. CHIL jun’04 data (28 segments from 7 seminars recorded by UKA for a total of 140 min.) and dev’06 data (another 140 min. from UKA plus 45 min. from AIT, IBM and UPC partners) were used for training a generic speaker model. Beamformed data were supplied by our CHIL partner ISL/UKA for both the jun’04 and dev’06 data sets. The data from CHIL 2005 speaker identification evaluation (jan’05) was used for the development of the system. For CLEAR’06 evaluation data, the 64 channels of a MarkIII microphone array were provided. However, only the 4th channel of the MarkIII microphone array as extracted and downsampled to 16kHz by ELDA was used. A gender-independent UBM with 256 Gaussians was trained on speech extracted from jun’04 and dev’06 CHIL data. The amount of data was limited to 2 min. per speaker in order to increase the speaker variability in the UBM, for a total duration of about 90 min. Target models were MAP-adapted using 3 iterations of the EM algorithm and a prior factor of 10. Computation times were estimated on a standard desktop PC/Linux with a 3GHz Pentium 4 CPU and are expressed in Real-Time factor (xRT) when relevant. 3.2
Optimization Results
The effect of the stochastic frame subsampling was studied on the 90 min. of training data, which account for d 500.000 frames after filtering of low-energy frames. With M = 256 components in the GMM and f = 200 frames kept in average per Gaussian, the gain relative to the standard training using all the frames d = 500.000/(256 ∗ 200) ≈ 10. at each step of the EM estimation is: g(f ) = M∗f Figure 2 shows the likelihood of the UBM on the training data as a function of the computation time for the stochastic subsampling with an average count of 200 frames per Gaussian, compared to the standard training and to a fixed-rate
The CLEAR’06 LIMSI Acoustic Speaker Identification System
237
training log-likelihood
subsampling with the corresponding 10% ratio; it was obtained by varying the number of EM iterations from 1 to 9. For a given computation time, the stochastic subsampling outperforms the standard training, and also the fixed-rate decimation, due to the faster initialization procedure. For a given EM iteration count, we also observed that the stochastic subsampling even outperforms the full training up to 5 EM iterations, and the fixed-rate subsampling in all configurations. -62
-62
-62.5 -62.2
-63 -63.5
-62.4
-64 -64.5
-62.6
-65 10
100
1000
10000
-62.8
Computation time (sec.) stochastic subsampling fixed-rate subsampling standard training
-63 3 4 5 6 7 8 9 EM iteration count
Fig. 2. Likelihood of UBM on training data as a function of computation time and of EM iteration count for standard training, stochastic subsampling and fixed-rate subsampling
The scoring was performed with the top Gaussians. With M = 256 components in the GMMs, T = 10 top components and N = 26 target models, the M∗N gain in computation is g(T ) = M+T ∗N = (256 ∗ 26)/(256 + 10 ∗ 26) ≈ 13. The pruning with α = 0.995, dmin = 200 frames and λ(t) linearly decreasing from 4 to 2 along the test segment, brings an addition factor of 2 speed-up for the 5-20 sec. test conditions, with no difference on the development results. Figure 3 illustrates the evolution of the auto-adaptive pruning threshold on a test sample, in a case where an impostor provides a better likelihood than the true speaker at the beginning of the segment. Overall, the cepstral features were computed at 0.1xRT. Target model adaptation was performed at 0.1xRT, and test identification at 0.08xRT down to 0.04xRT with pruning. 3.3
Developments Results
Developments were conducted on CHIL’05 Speaker Identification evaluation database, restricted to the microphone array matched condition, for the 30 seconds training condition and 1 to 30 seconds test segments. These are the most similar to CLEAR’06 conditions, despite the use of only 11 target speakers instead
238
C. Barras et al.
0.45
log-likelihood
0.4 0.35 0.3 0.25 0.2 True target Nearest impostor Auto-adaptive threshold
0.15 0.1 200
250
300 350 Frame index
400
450
Fig. 3. Example of the evolution of the auto-adaptive pruning threshold during the recognition of a test segment
of 26. Results of LIMSI’05 system for CHIL’05 evaluation under these restricted conditions are reported Table 1. The system used an UBM with 2048 Gaussians trained on meeting data from various sources (ICSI, ISL, NIST) recorded using close-talking microphones, and cepstral mean and variance normalization was performed instead of feature warping [8]. The LIMSI’06 system provides a dramatic improvement for all segment durations, due mainly to better matched training data for the UBM. Contrastive experiments on feature normalization show that mean and variance normalization very significantly improve upon standard CMS, while feature warping is still slightly better. Other improvements to the system were mainly computation optimizations which do not show into the recognition scores. 3.4
CLEAR’06 Evaluation
Table 2 reports the LIMSI results for the CLEAR’06 evaluation. Note that for a few hundred trials, the precision of the identification error rates remain limited Table 1. Identification error rates on the CHIL’05 Speaker Identification task restricted to microphone-array matched conditions, for the LIMSI’05 and the LIMSI’06 system associated with different feature normalizations Test duration 1 second 5 seconds 10 seconds 30 seconds # trials 1100 682 341 110 LIMSI’05 52.8 11.3 4.7 0.0 LIMSI’06 with CMS 33.4 5.6 1.8 0.9 LIMSI’06 with mean+variance 30.5 2.3 0.6 0.0 LIMSI’06 with feature warping 29.6 2.6 0.0 0.0
The CLEAR’06 LIMSI Acoustic Speaker Identification System
239
Table 2. LIMSI’06 system error rates for CLEAR’06 Acoustic Speaker Identification task Test duration 1 second 5 seconds 10 seconds 20 seconds # trials 613 411 289 178 Train A (15 seconds) 51.7 10.9 6.6 3.4 Train B (30 seconds) 38.8 5.8 2.1 0.0
Identification error rate (log-scale)
50
Train A (15 sec.) Train B (30 sec.)
20
10
5
2
1 1
5
10
20
Test duration in seconds (log-scale) Fig. 4. LIMSI’06 system identification error rates by training and test duration for CLEAR’06 Acoustic Speaker Identification task
to ∼ 1%. The difference in speaker count does not allow a direct comparison with development results, but we can observe that the trends are similar. We observe especially high error rates on 1 sec. test segments. The effect of training and test durations are illustrated on a log-log scale in Figure 4.
4
Conclusions
The LIMSI CLEAR’06 system provides an over 50% relative reduction of the error rate compared to CHIL’05 Speaker Identification LIMSI results for a comparable configuration (matched array condition, 30 sec. training, 5 and 10 sec. test). Several optimizations were implemented and provided 10–20 acceleration factor in model training and speaker identification. The stochastic subsampling was shown to perform very efficiently compared to other existing approaches.
240
C. Barras et al.
With the current system, no errors were measured for 30 sec. training and 20 sec. test segments; a larger test database would be necessary to increase the precision of the measure. However, identification rate of 1 second test segments remains poor compared to other results in the CLEAR’06 evaluation; our system would need specific tuning for very short segments.
Acknowledgments Thanks are due to the CHIL partners for the seminar data, and in particular to ISL-UKA for making the audio beamforming available.
References 1. C. Barras and J.-L. Gauvain, “Feature and score normalization for speaker verification of cellular data,” in Proc. of IEEE ICASSP, May 2003. 2. G. Doddington, M. Przybocki, A. Martin, and D. Reynolds, “The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective,” Speech Communication, vol. 31, pp. 225–254, 2000. 3. J.-L. Gauvain and C.H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transactions on Speech and Audio Processing, vol. 2(2), pp. 291–298, April 1994. 4. J. McLaughlin, D. Reynolds, and T. Gleason “A Study of Computation SpeedUPS of the GMM-UBM Speaker Recognition System,” in Proc. Eurospeech’99, pp. 1215–1218, Budapest, Sept. 1999. 5. D. Mostefa et al., “CLEAR Evaluation Plan v1.1,” http://isl.ira.uka.de/ clear06/downloads/chil-clear-v1.1-2006-02-21.pdf 6. J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proc. ISCA Workshop on Speaker Recognition - Odyssey, June 2001. 7. D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000. 8. X. Zhu, C-C. Leung, C. Barras, L. Lamel, and J-L. Gauvain, “Speech activity detection and speaker identification for CHIL,” in Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), Edinburgh, July 2005.
Person Identification Based on Multichannel and Multimodality Fusion Ming Liu, Hao Tang, Huazhong Ning, and Thomas Huang IFP Group University of Illinois at Urbana-Champaign Urbana, IL 61801 {mingliu1,htang2,hning2,huang}@ifp.uiuc.edu
Abstract. Person ID is a very useful information for high level video analysis and retrieval. In some scenario, the recording is not only multimodality and also multichannel(microphone array, camera array). In this paper, we describe a Multimodal person ID system base on multichannel and multimodal fusion. The audio only system is combining 7 channel microphone recording at decision output individual audio-only system. The modeling technique of audio system is Universal Background Model(UBM) and Maximum a Posterior adaptation framework which is very popular in speaker recognition literature. The visual only system works directly on the appearance space via l1 norm and nearest neighbor classifier. The linear fusion is then combining the two modalities to improve the ID performance. The experiments indicate the effectiviness of micropohone array fusion and audio/visual fusion.
1
Introduction
Person identification, as its name suggests, is the task of identifying a particular person out of a group of people by the use of a computer. Over decades, this topic has brought about many research and engineering efforts in both academia and industry. In the literature, there exist two primary categories of work for person identification. One category involves the work of identifying a person by his or her voice, and is known as acoustic (audio) person identification[1][2][3], speaker identification, or voiceprint identification. The other category involves the work of identifying a person by his or her visual appearance (i.e., face)[4], and thus is named visual person identification or face recognition. Either category has been extensively addressed, and is traditionally formulated as a pattern recognition problem in some feature vector space, tackled by statistical classification and machine learning algorithms. Fusing audio and visual cues one can potentially achieve better performance for the task of person identification than treating each modality alone, researchers have begun to explore the correlations between the audio and visual signals. The concept of multimodal person identification has been brought to the attention of the speech and computer vision communities. Also, the multichannel recordings R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 241–248, 2007. c Springer-Verlag Berlin Heidelberg 2007
242
M. Liu et al.
are avaiable for some scenarios, such as smart room recording. To fuse the microphone array recording and multiple camera recording is a chanllenge and interesting research problem. This paper describes a system fuse multimodal cues as well as multichannel recording so that person identification achieve significant boost in performance. The experiemnts are conducted on the CLEAR 2006 Evaluation corpus[5]. The results show that the fusion of multichannel and multimodality do improve the performance significantly. The accuracy of 1sec testing utterance is boosted from 60% to 84%. For longer testing utterance, the fused system can achieve 99% accuracy. These results clearly demonstrate the effectiveness of multichannel and multimodal fusion. The detailed algorithms and implementation of the system are described in the following sections.
2
Audio Person Identification Subsystem
The use of Gaussian Mixture Model (GMM) has dominated the area of textindependent speaker identification for over a decade. The GMM is among the most pioneer and effective generative methods used for speaker identification. In the domain of speaker identification, the Mel Frequency Cepstral Coefficient (MFCC) is often used. Although MFCC is not exclusively designed as a sort of speaker-distinguishing speech features, its discriminative power lies in the fact that it is derived from the envelope of the speech spectrum, which is more due to the vocal tract structure. Our audio person identification subsystem adopts an improved variation of the GMM algorithm – the Universal Background Model adapted GMM (UBM-GMM) originally developed in MIT Lincoln Lab [6][7]. 2.1
GMM
An M -mixture GMM is defined as a weighted sum of M component Gaussian densities M wm N (¯ x|¯ μm , Σm ) (1) p(¯ x|λ) = m=1
where x¯ is a D-dimensional feature vector, wm is the mth mixture weight, and N (¯ x|¯ μm , Σm ) is a multivariate Gaussian density, with mean vector μ ¯ m and coM variance matrix Σm . Note that m=1 wm = 1. M A speaker model λ = {wm , μ ¯ m , Σm }m=1 is obtained by fitting a GMM to a training utterance X = {¯ x1 , x ¯2 , ..., x ¯T } using the expectation-maximization (EM) algorithm. The log likelihood of a testing utterance Y = {¯ y1 , y¯2 , ..., y¯T } on a given speaker model λ is computed as follows. LL(Y |λ) =
T 1 logp(¯ yt |λ) T t=1
(2)
where p(¯ yt |λ) is the likelihood of the tth frame of the utterance. To identify an utterance as having been spoken by a person out of a group of N people, we compute its utterance scores against all N speaker models and pick the maximum
Person Identification Based on Multichannel and Multimodality Fusion
ˆ = arg max LL(Y |λn ) λ λn
243
(3)
where λn is the model of the nth speaker. 2.2
UBM-GMM
The GMM algorithm described in the previous subsection requires that every speaker model be trained independently with the speaker’s training data. In the case when the available training data is limited for a speaker, the model is prone to singularity. In the UBM-GMM algorithm, a different scheme is adopted to train the speaker models. A single speaker-independent Universal Background Model (UBM) λ0 is trained with a combination of the training data from all speakers, and a speaker model λ is derived by updating the well-trained UBM with that speaker’s training data via Maximum A Posteriori (MAP) adaptation[7]. The final score of the testing utterance is computed by the log likelihood ratio between target model and background model. LLR(Y ) =
LLR(¯ y1T )
T 1 P (¯ yt |λ1 ) = log T t=1 P (¯ yt |λ0 )
(4)
where (¯ y1T ) are the feature vectors of the observed utterance – test utterance Y , λ0 is the parameter of UBM and λ1 is the parameter of the target model. Essentially, the verification task is to construct a generalized likelihood ratio test between hypothesis H1 (observation drawn from the target) and hypothesis H0 (observation not drawn the target). The advantages of the UBM-GMM over the GMM are two-fold. First, the UBM is trained with a considerable amount of data and is thus quite welldefined. A speaker model, obtained by adapting the parameters of the UBM with a small amount of new data, is expected to be well-defined, too. Hence, the UBM-GMM approach should be robust to limited training data. Second, during adaptation, only a small number of Gaussian components of the UBM are updated. This follows that it is possible to significantly reduce the model storage requirements by storing only the difference between a speaker model and the UBM. In our experiments, a 128-component UBM is trained with CHIL development data(approximate 1hour speech). 2.3
Multichannel Fusion
In order to combine different microphone channel of the microphone array, a linear fusion is adopted to fuse these channels. The MarkIII microphone array in our task has 64 channels which is linear configured with 2cm distance between conjacent channels. In order to have more variety between two channels, we select one channel out of every 10 channels. The channels used in fusion module are 00, 10, 22, 30, 40, 50, 60. The fusion is conducted directly on the log-likelihood score of each individual channel with equal weight.
244
3
M. Liu et al.
Face Recognition
Our face recognition subsystem is based on the idea of K-Nearest Neighbor (KNN) algorithm. As a typical face recognition system, our system also has these modules: cropping, alignment, metric measurement, and KNN. The big difference is that, instead of determining the person ID based on a single face image, our system makes the decision by processing all face samples in a clip of a video. In other words, our system has a module of fusing multiple face samples. 3.1
Face Cropping
For both training and testing videos, the faces are cropped according to the bounding boxes and positions of nose bridge provided by the organizers. In spite of the big variation of the view angles, the face images are then scaled to a fixed size (20 × 20 in our experiment) with the nose bridge fixed to the center of the image. The face images without the positions of nose bridge are omitted in the experiment because most of those face images have bad quality and may induce extra errors to the system. Figure 1 shows some cropped face samples. These images have varying face angles, changing illumination, and varying background which make the face recognition a big challenge.
Fig. 1. Examples of cropped face samples
3.2
Face Alignment
In a typical face recognition system, an alignment procedure should be applied to the cropped faces such that the main facial feature points (such as eye corners, nose point, mouth corners) are aligned image by image. However, face alignment is extremely difficult for this CHIL data, because face angles vary a lot and face resolution is too low. Therefore we use shifting procedure to partly substitute for the alignment procedure. In detail, the training samples are repeatedly shifted by one or two pixels in all directions to generate new training samples. We assume, after shifting, any test sample has a counterpart in the training data set (including the shifted samples) that both of them come from the same person while having same alignment.
Person Identification Based on Multichannel and Multimodality Fusion
3.3
245
Affinity Measurement
As we know, all face recognition algorithms depends heavily on the choice of the metric measurement. In our work, we transform the color face images into gray scale, then expand them into vectors, and finally calculate the lp distances. D dp (f1 , f2 ) = [ (f1 (i) − f2 (i))p ]1/p
(5)
i=1
where f1 and f2 are the face samples and the D is the total dimension. f1 (i) is the ith dimension of the face sample. It is worth to mention that l1 distance generates better performance than l2 distance in our work. 3.4
KNN and Fusion of Multiple Faces
As mentioned above, unlike the typical face recognition systems, our system makes the decision by processing all face samples in a clip of a video. We call the face samples in the same clip as “test subset”. To determine the person ID by considering the entire “test subset”, we first apply the KNN algorithm to each sample in subset separately, and then fuse the output of KNN algorithm to make the final decision. We choose the standard KNN algorithm. For each face sample f in a “test subset” S, K training samples with smallest distance (with f ) are selected as candidates. These K samples is called candidate set of sample Ω(f ). Therefore, given that S contains NS samples, the subset S will have K × NS candidates from the training set which forms a candidate set Ω = f ∈S Ω(f ). Then we use voting to generate the id for one test subset.
4
Audio Visual Fusion
In order to fuse the two modalities for better performance, an audio/visual fusion module is applied to combine these two modalities. There are different kinds of fusion strategies proposed in the literature[8][9][10]. There are mainly three level of fusion: feature-level, state-level and decision level. Fusion at feature-level mainly concatenate the features from different modalities as a single big feature vector. Some dimension reduction techniques such as PCA, LDA can be used to reduce the dimensionality of the final feature vector. The modeling is then conducted on the final feature vectors. Usually the feature level fusion is most simple fusion strategies and often result in moderate improvement after fusion. State-level fusion is considered the best strategies from the reports by audio/visual speech recognition literatures. The basic idea is to fuse the observation likelihood of different modalities on the same state. By searching the right confidence measure of two streams, the fusion can achieve best improvement.
246
M. Liu et al.
However, the text-independent ID task make it difficult to find the same state for audio and visual streams. To circumvent this difficulties, we explore the decision level fusion for this task. The decision output of audio and visual stream are the similarity scores of the testing utterance on 26 target speaker models. By tuning the weighting factor between two streams, we obtain very good improvement after fusion. Intuitively, the weighting factor should not be static between audio/visual streams. In principle, the optimal weighting factor should be estimated based on the SNRs of different modalities. However, the estimation of SNRs usually is also difficult to obtain. However, the duration of the speech utterance is correlate to the performance of audio-only system in consistent way and so is the number of face frames. In this task, we searching the optimal weighting factor for different testing conditions(1sec, 5sec, 10sec, 20sec) indivisually based on the experiments on CHIL development dataset. The optimal weighting factors we obtained are (3 : 1), (40 : 1), (180 : 1), (550 : 1) between audio and visual modalities for 1sec, 5sec, 10sec and 20sec conditions.
5
Experiment Results
The CHIL 2006 ID task corpus[5] contain 26 video sequences from 5 cites. The audio recording is far-field microphone array recording. In our expriments, only one microphone array – MarkIII recording is considered. There are 64 channels and linear array configuration with 2cm apart. The video recording includes four cameras located at four corner of ceiling. Both of audio and visual recording are far-field, therefore noisy and low resolution. The performance of each individual modality will not be sufficient. It contains seminar recording as well as interactive dicussion recording. There are two training conditions varing with respect to the duration of the enrollment. The train set A has 15sec training enrollment while train set B has 30 sec enrollment. The testing conditions varies in term of testing durations. The four testing conditions are corresponds to 1sec, 5sec, 10sec and 20sec. A 128 component UBM is trained from approximate 1hour CHIL development data. To improve the audio only system by multichannel recording, we fuse the channels based on decision level fusion and all 7 channels(00,10,22,30,40,50,60) are treated with equal weighting factor. The experiment results(Table 1 and Table 2) shows the improvement is significance by multichannel fusion, especially for short testing utterarnce conditions(acurracy boost from 65% to 74%). For visual only part, we have try different distance measure (l1 ,l2 and normalized cross correlation) and different neighborhood size (N = 1, 3, 5, 7, 10). It turns out the l1 norm combined with N = 1 is the optimal based on the CHIL development data, Table 3. The performance of Audio/Visual fusion is listed in Table 4. The improvement due to A/V fusion is as large as 8% in absolete percentage (74% → 82%) compared to the multichannel fused audio only system and 16% in absolete percentage (66% → 82%) compared to the single channel audio only system.
Person Identification Based on Multichannel and Multimodality Fusion
247
Table 1. Single Channel Audio-only System Performance
TrainSet A B
test1 65.9 69.0
test5 88.07 92.45
test10 93.08 96.54
test20 94.38 97.75
Table 2. Microphone Array Audio-only System Performance
TrainSet A B
test1 74.06 79.12
test5 95.86 96.84
test10 97.23 98.27
test20 98.88 99.44
Table 3. Visaul-only System Performance
TrainSet A B
test1 62.26 71.01
test5 73.32 81.54
test10 79.02 83.91
test20 80.68 85.23
Table 4. Final Auido Visual Fusion System Performance
TrainSet A B
6
test1 82.39 86.79
test5 97.32 97.57
test10 98.27 98.62
test20 99.44 99.44
Conclusion and Future Work
In this paper, we describe a Multimodal person ID system base on multichannel and multimodal fusion. The audio only system is combining 7 channels microphone recording at decision output individual audio-only system. The modeling technique of audio system is UBM-GMM and the visual only system works directly on the appearance space via l1 norm and nearest neighbor classifier. The linear fusion is then combining the two modalities to improve the ID performance. The experiments indicate the effectiviness of micropohone array fusion and audio/visual fusion. Although the CHIL06 corpus is quite large database(200 giga bytes for all evaluation data), the number of speakers might be few. In the near future, we are going to including more speakers from CHIL06 development corpus to futher verify our framework. Also, linear fusion is simple yet useful solution for multichannel and multimodal fusion. More sophisticate fusion schemes are under investigation.
Acknowledgments This work was supported in part by National Science Foundation Grant CCF 04-26627 and ARDA VACE II.
248
M. Liu et al.
References [1] Doddington, G.: Speaker recognition - identifying people by their voices. (1985) 1651–1664 [2] Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Communication 17 (1995) 91–108 [3] FURUI, S.: An overview of speaker recognition technology. (1996) 31–56 [4] Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Comput. Surv. 35(4) (2003) 399–458 [5] : (http://clear-evaluation.org/) [6] Reynolds, D.A.: Comparison of background normalization methods for textindependent speaker verification. In: Proc. Eurospeech ’97, Rhodes, Greece (1997) 963–966 [7] Reynolds, D., Quatieri, T., Dunn, R.: Speaker verification using adapted gaussian mixture models. Digital Signal Processing (2000) [8] Dupont, S., Luettin, J.: Audio-visual speech modelling for continuous speech recognition. IEEE Transactions on Multimedia (2000) to appear. [9] Garg, A., Potamianos, G., Neti, C., Huang, T.S.: Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. In: Proc. of international conference on Acoustics, Speech and Signal Processing (ICASSP). (2003) [10] Potamianos, G.: Audio-Visual Speech Recognition. Encyclopedia of Language and Linguistics (2005)
ISL Person Identification Systems in the CLEAR Evaluations Hazım Kemal Ekenel1 and Qin Jin2 1
Interactive Systems Labs (ISL), Computer Science Department, Universität Karlsruhe (TH), 76131 Karlsruhe, Germany
[email protected] 2 Interactive Systems Labs (ISL), Computer Science Department Carnegie Mellon University, 15213 Pittsburgh, PA, USA
[email protected]
Abstract. In this paper, we presented three person identification systems that we have developed for the CLEAR evaluations. Two of the developed identification systems are based on single modalities- audio and video, whereas the third system uses both of these modalities. The visual identification system analyzes the face images of the individuals to determine the identity of the person. It processes multi-view, multi-frame information to provide the identity estimate. The speaker identification system processes the audio data from different channels and tries to determine the identity. The multi-modal identification system fuses the similarity scores obtained by the audio and video modalities to reach an identity estimate.
1 Introduction Person identification in smart environments is very important in many aspects. For instance, customization of the environment according to the person’s identity is one of the most useful applications. However, until now, person identification research has focused on security-oriented authentication applications and face recognition in smart rooms has been ignored in great extent. In CHIL project [1], aiming to encourage the research efforts for person identification in smart environments, a data corpus and evaluation procedure has been provided. Following the two successful uni-modal identification evaluations [2], this year multi-modal identification is also included to the person identification task. In this paper, the person identification systems that have developed at the Interactive Systems Labs for the CLEAR evaluations are presented. The organization of the paper is as follows. In Section 2, the algorithms used in each system are explained. Experimental results are presented and discussed in Section 3. Finally, in Section 4, conclusions are given.
2 Methodology In this section, face recognition, speaker identification, and fusion algorithms that are used for the evaluations are presented. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 249 – 257, 2007. © Springer-Verlag Berlin Heidelberg 2007
250
H.K. Ekenel and Q. Jin
2.1 Face Recognition The face recognition system processes multi-view, multi-frame visual information to obtain an identity estimate. The system consists of the following building blocks: -
Image alignment Feature Extraction Camera-wise classification Score normalization Fusion over camera-views Fusion over image sequence
The system receives an input image and the eye-coordinates of the face in the input image. The face image is cropped and aligned according to the eye coordinates. If only one eye is visible, that image is not processed. The aligned image is, then, divided into non-overlapping 8x8 pixels resolution image blocks. Discrete cosine transform (DCT) is applied on each local block. The obtained DCT coefficients are ordered using zig-zag scan pattern. From the ordered coefficients, the first one is removed since it only represents the average value of the image block. The first M coefficients are selected from the remaining ones [3]. To remove the effect of intensity level variations among the corresponding blocks of the face images, the extracted coefficients are normalized to unit norm. For detailed information please see [4]. Classification is performed by comparing the extracted feature vectors of the test image, with the ones in the database. Each camera-view is handled separately. That is, the feature vectors that are extracted from the face images acquired by Camera 1 are compared with the ones that are also extracted from the face images acquired by Camera 1 during training. This approach speeds up the system significantly. That is, if we have N images from each camera for training, and if we have R images from each camera for testing, and if we have C cameras that do recording, it requires (C*N)*(C*R) number of similarity calculations between the training and testing images. However, when we do camera-wise image comparison, then we only need to do C*(N*R) comparisons between the training and testing images. Apparently, this reduces the amount of required computation by 1/C. In addition to the improvement in system’s speed, it also provides a kind of view-based approach that separates the comparison of different views, which was shown to perform better than doing matching between all the face images without taking into consideration their view angles [5]. Distance values obtained from each camera-view are normalized using Min-Max rule, which is defined as:
ns = 1 −
s − min( S ) , max( S ) − min( S )
where, s corresponds to a distance value of the test image to one of the training images in the database, and S corresponds to a vector that contains the distance values of the test image to all of the training images. The division is subtracted from one, since the lower the distance is, the higher the probability that the test image belongs to that identity class. This way, the score is normalized to the value range of [0,1], closest
ISL Person Identification Systems in the CLEAR Evaluations
251
match having the score “1”, and the furthest match having the score “0”. These scores are then normalized by dividing them to the sum of the confidence scores. The obtained confidence scores are summed over camera-views and over imagesequence. The identity of the face image is assigned as the person who has the highest accumulated score. 2.2 Speaker Identification
In this section, the building blocks of the speaker identification system are explained. 2.2.1 Reverberation Compensation A distant-talking speech signal is degraded by additive background noise and reverberation. Considering room acoustics as a linear shift-invariant system, the receiving signal y[t ] can be written as,
y[t ] = x[t ] ∗ h[t ] + n[t ] ,
(1)
where the source signal x[t ] is the clean speech, h[t ] is the impulse response of room reverberation, and n[t ] is recording noise. Cepstrum Mean Subtraction (CMS) has been used successfully to compensate the convolution distortion. In order for CMS to be effective, the length of the channel impulse response has to be shorter than the short-time spectral analysis window which is usually 16ms-32ms. Unfortunately, the duration of impulse response of reverberation usually has a much longer tail, as long as more than 50ms. Therefore traditional CMS will not be as effective under these conditions. We separate the impulse response h[t ] into two parts h1[t ] and h2 [t ] , where,
h[t ] = h1[t ] + δ (t − T )h2 [t ] ⎧h[t ] h1[t ] = ⎨ ⎩0
t
t≥0 ⎧h[t + T ] h2 [t ] = ⎨ otherwise ⎩0 and rewrite formula (1) as
y[t ] = x[t ] ∗ h1[t ] + x[t − T ] ∗ h2 [t ] + n[t ] h1[t ] is a much shorter impulse response, whose length is smaller than the DFT analysis window, thus it can be compensated by the conventional CMS. For x[t − T ] * h2 [t ] , we treat it the same as additive noise n[t], and apply the noise reduction technique based on spectrum subtraction. Assuming the noise x[t − T ] * h2 [t ] + n[t ] could be estimated from y[t − T ] , and then the spectrum subtraction is performed as, Xˆ [t , w] = max(Y [t , w] − a ⋅ g ( w)Y [t − T , w], b ⋅ Y [t , w]) ,
252
H.K. Ekenel and Q. Jin
where a is the noise overestimation factor, b is the spectral floor parameter to avoid negative or underflow values. We can empirically estimate the optimum a, b and g(w) on a development dataset. We found that the system performance is not sensitive to T. Within the range of 20-40 ms there is no significant difference on the effect of the spectra subtraction. However outside that range, there is obvious performance degrajw dation. We found a = 1.0, b = 0.1 and g (w) = 1 − 0.9e optimal in most changing conditions based on development data as described in [6]. Standard CMS is applied after spectrum subtraction to eliminate the effect of h1[t ] . 2.2.2 Feature Warping The feature warping method proposed in [7], which warps the distribution of a cepstral feature stream to a standardized distribution over a specified time interval, aims to make the features more robust to different channel and noise effects. The warping can be considered as a nonlinear transformation T, which transforms the original fea-
ture X to a warped feature Xˆ , i.e,
Xˆ = T ( X )
This can be done by CDF matching, which warps a given feature so that its CDF matches a desired distribution, such as normal distribution. The method assumes that the dimensions of the MFCC vector are independent. So each dimension is processed as a separate stream. The CDF matching is performed over short time intervals by shifting a window. Only the central frame of the window is warped every time. The warping executes as follows, the same way as in [8].:
i = 1, " , d where d is the number of feature dimensions • sorting features in dimension i in ascending order in a given window • warping raw feature value x in dimension i of the central frame to its warped value xˆ which statisfies: • for
I
³
f
f
f ( y )dy
Where f ( y ) is the probability density function (PDF) of standard normal distribution, i.e.
y2 ) 2 2π and φ is its corresponding CDF value. Suppose x has a rank r and the window size is N . Then the CDF value can be approximated as (r − 1 ) 2 φ= N • xˆ can be quickly found by lookup in a standard normal CDF table. f ( y) =
1
exp(−
In our experiments, the window size is 300 frames and the window shifts one frame. Zeros are padded at the beginning and at the end of the raw feature stream.
ISL Person Identification Systems in the CLEAR Evaluations
253
2.2.3 Speaker Modeling Over past decades, GMM has become the dominant approach for speaker modelling in speaker recognition systems which use untranscribed training data [9]. The recognition decision is made as follows
s = arg max{L(Y Θ i )} i
(
Y = ( y1 , y2 ,", y N ) ,
)
where s is the identified speaker and L Y Θ i is the likelihood that the test feature set Y was generated by the GMM Θ i of speaker i , which contains M weighted mixtures of Gaussian distributions M
Θi = ∑ λm N ( X ,U m , Σ m ) m =1
i = 1,2,", S ,
where X is the set of training feature vectors to be modelled, S is the total number of speakers, M is the number of Gaussian mixtures, λm is the weight of the Gaussian component m , and N ( X , U m , Σ m ) is a Gaussian function with mean vector U m and covariance matrix Σ m . The parameters of a GMM are estimated from speech samples of a speaker using the EM algorithm. In our system, 128 Gaussians and 32 Gaussians are trained for each speaker for the training duration of 30-seconds and 15-seconds respectively. We will show how we choose these numbers of Gaussians in the experimental results section. 2.3 Multimodal Identification
Multimodal identification is performed by fusing the match scores of each unimodality- audio and video. Since, different classifiers are used during classification in each modality (nearest neighbor vs. GMM), the confidence scores of each modality are normalized with a non-linear function to compensate this mis-match. Sigmoid function is used for this purpose. After normalizing the match scores, they are fused via sum rule. Since, there is no common validation set available to the evaluation participants, no prior information about the performances of audio-only and videoonly testing is used. Therefore, audio and video modalities are equally weighted.
3 Experiments In this section the evaluation data is described and the experimental results are presented. 3.1 Face Recognition Experiments
The evaluation data for visual identification task in CLEAR evaluations consists of short video sequences taken from the Seminar 2005 database recorded in the various CHIL sites. There are 26 individuals in the database. In face recognition experiments, face images are aligned according to eye-center coordinates and scaled to 40x32 pixels resolution. Only every five frame that has the eye coordinate labels is used for training and testing. The aligned image is then
254
H.K. Ekenel and Q. Jin
divided into 8x8 pixels resolution non-overlapping blocks making 20 local image blocks. From each image block 10 unit norm DCT-0 coefficients are extracted and they are concatenated to construct the 200-dimensional final feature vector. The classification is performed using nearest neighbor classifier. L1 norm is selected as the distance metric, since it has been observed that, it consistently gives the best correct recognition rates when unit norm DCT-0 coefficients are used. The distance values are converted to the matching scores by using the Min-Max rule. The normalized matching scores are accumulated over different camera views and over image sequence. The identity candidate that has the highest score is assigned as the identity of the person. The false identification rates for different training and testing durations can be seen in Table 1. As can be observed from the table, the increase in the training segments’ duration or in the testing segments’ duration decreases the false identification rate. Table 1. False visual identification rates
Test Duration (sec) Segments
Train A (15 sec)
Train B (30 sec)
1
613
46.8%
40.1%
5
411
33.6%
23.1%
10
289
28.0%
20.4%
20
178
23.0%
16.3%
3.2 Speaker Identification Experiments
The evaluation data in CHIL 2005 Spring Evaluation was used as our development data set. This data set has been carried out on the union of the UKAISL_Seminar_2003 and UKA-ISL_Seminar_2004 databases. Non-speech segments have been manually removed both from the training and the testing segments. There are two microphone conditions: Closed-Talking-Microphone (CTM) and Microphone Array (ARR). The duration and number of segments selected for the training and testing as improving our system is described in Table 2. Table 2. Description of development data
Duration (sec)
CTM Segments
ARR Segments
Train A
30
11
11
Train B
15
11
11
Test
5
1100
682
In order to find an optimal number of Gaussians for a speaker model, we conducted several speaker identification experiments with different number of Gaussians in a speaker model.
ISL Person Identification Systems in the CLEAR Evaluations
255
Table 3. False identification rate with different number of Gaussians for 30-sec training
Number of Gaussians
64
128
256
Miss Classification Rate
0.36%
0.27%
0.36%
Table 4. False identification rate with different number of Gaussians for 15-sec training
Number of Gaussians
16
32
64
Miss Classification Rate
2.82%
2.00%
2.23%
According Table 3 and Table 4, we choose to use 128 Gaussians for the 30-second training condition and 32 Gaussians for the 15-second training condition. Table 5 shows the system performance improvement by applying reverberation compensation and feature warping under the 30-seconds training condition. We can see from the table that signification improvement was achieved for both the CTM and ARR microphone conditions. Table 5. Performance improvement by reverberation compensation and feature warping
Baseline
RC+Warp
Relative Improvement
CTM
0.27%
0.18%
33.3%
ARR
6.74%
3.08%
54.3%
Table 6. False audio identification rate in clear 2006 evaluation
Test Duration (sec)
Segments
Train A (15 sec)
Train B (30 sec)
1
613
23.7%
14.4%
5
411
7.8%
2.2%
10
289
7.3%
1.4%
20
178
3.9%
0%
The overall system performances for different training and testing durations are given in Table 6. It is apparent that, as the duration of training or testing segments increases, the error rate decreases. 3.3 Multimodal Identification Experiments
The evaluation data for multimodal identification task in CLEAR evaluations consists of short audio-video sequences taken from the Seminar 2005 database recorded in the various CHIL sites. There are 26 individuals in the database.
256
H.K. Ekenel and Q. Jin
To perform multimodal identification, the individual modality matching scores are fused as explained in Section 2.3. The experimental results can be seen in Table 7. Again, it can be observed that the increase in training segments’ duration or in testing segments’ duration decreases the false identification rate. Due to equal weighting of each modality, the multimodal identification results are higher than the visual-only results and lower than the audio-only results. Table 7. False multi-modal identification rates
Test Duration (sec)
Segments
Train A (15 sec)
Train B (30 sec)
1
613
43.1%
35.7%
5
411
29.2%
19.7%
10
289
23.9%
16.6%
20
178
20.2%
12.4%
4 Conclusions In this paper, we presented the person identification systems that have been developed at the Interactive Systems Labs for the CLEAR evaluations. The experimental results showed that, speaker identification performs better than face recognition for person identification in smart environments. The main reason for the performance difference is the low video quality. Multimodal identification results perform worse than speaker identification results. This result is expected, since the audio and video data are weighted equally, due to missing priori performance information which is caused by lack of a common validation set.
Acknowledgements We would like to thank Kenichi Kumatani for his contributions to the ISL visual identification evaluation effort. This work is sponsored by the European Union under the integrated project CHIL, contract number 506909.
References [1] Computers in the Human Interaction Loop –CHIL, http://chil.server.de/ [2] H.K. Ekenel, A. Pnevmatikakis, "Video-Based Face Recognition Evaluation in the CHIL Project – Run 1", 7th International Conference Automatic Face and Gesture Recognition (FG2006), Southampton, UK, April 2006. [3] H.K.Ekenel, R. Stiefelhagen, "Local Appearance based Face Recognition Using Discrete Cosine Transform", 13th European Signal Processing Conference (EUSIPCO 2005), Antalya, Turkey, September 2005.
ISL Person Identification Systems in the CLEAR Evaluations
257
[4] H.K. Ekenel, R. Stiefelhagen, "Analysis of Local Appearance-based Face Recognition: Effects of Feature Selection and Feature Normalization", CVPR Biometrics Workshop, New York, USA, June 2006. [5] A. Pentland, B. Moghaddam, T. Starner and M. Turk, “View based and modular eigenspaces for face recognition”, Proceedings of IEEE CVPR, pp. 84-91, 1994. [6] Q. Jin, Y. Pan and T. Schultz, "Far-field Speaker Recognition", International Conference on Acoustic, Speech, and Signal Processing (ICASSP) 2006. [7] J. Pelecanos and S. Sridharan, "Feature warping for robust speaker verification", Proc. Speaker Odyssey 2001 conference, June 2001. [8] B. Xiang, U. Chaudhari, J. Navratil, G. Ramaswamy and R. Gopinath, "Short-time Gaussianization for Robust Speaker Verification", in Proc. ICASSP, 2002. [9] D. Reynolds, "Speaker Identification and Verification Using Gaussian Mixture Speaker Models", Speech Communication, Vol. 17, No. 1-2, p. 91-108, August 1995.
Audio, Video and Multimodal Person Identification in a Smart Room J. Luque, R. Morros, A. Garde, J. Anguita, M. Farrus, D. Macho, F. Marqués, C. Martínez, V. Vilaplana, and J. Hernando Universitat Politècnica de Catalunya Jordi Girona 1-3, Campus Nord Edifici D5 08034 Barcelona, Spain {aluque, morros, agarde, jan, mfarrus, dusan, ferran, veronica, javier}@gps.tsc.upc.edu
Abstract. In this paper, we address the modality integration issue on the example of a smart room environment aiming at enabling person identification by combining acoustic features and 2D face images. First we introduce the monomodal audio and video identification techniques and then we present the use of combined input speech and face images for person identification. The various sensory modalities, speech and faces, are processed both individually and jointly. It’s shown that the multimodal approach results in improved performance in the identification of the participants.
1 Introduction Person identification consists in determining the identity of a person from a data segment, such as a speech, video segment, etc. Currently, there is a high interest in developing person identification applications in the framework of smart room environments. In a smart room, the typical situation is to have one or more cameras and several microphones. Perceptually aware interfaces can gather relevant information to recognize, model and interpret human activity, behaviour and actions. Such applications face an assortment of problems such a mismatched training and testing conditions or the limited amount of training data. In this work we present the audio, video and multimodal person identification techniques and the obtained results in the CLEAR’06 Evaluation Campaign inside the CHIL (Computers in the human interaction loop) project [1]. The CLEAR’06 Person Identification evaluation is a closed-set task, that is, all the possible speakers are known. Matched training and testing conditions and far-field data acquisition are assumed, as well as no a priori knowledge about room environment. For acoustic speaker identification, the speech signals are parameterized using the Frequency Filtering (FF) [2] over the filter-bank energies, which is both computationally efficient and robust against noise. Next, in order to model the probability distribution of the parameters generated by each speaker, Gaussian Mixture Models (GMM) [3] with diagonal covariance are used. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 258 – 269, 2007. © Springer-Verlag Berlin Heidelberg 2007
Audio, Video and Multimodal Person Identification in a Smart Room
259
In the case of visual identification, an appearance-based technique is used due to the low quality of the images. Face images of the same individual are gathered into groups. Frontal images within a group are jointly compared to the models for identification. These models are composed of several images representative of the individual. The joint recognition enhances the performance of a face recognition algorithm applied on single images. Individual decisions are based on a PCA [4] approach given that the variability of the users’ appearance is assumed to be low and so are the lighting variations. Multimodal recognition involves the combination of two or more human traits like voice, face, fingerprints, iris, hand geometry, etc. to achieve better performance than using monomodal recognition [5], [6]. In this work, a multimodal score fusion technique, Matcher Weighting with equalized scores, has been used. This technique has obtained an improvement for the correct identification rate on the closed-set 15/30 seconds training and 1/5/10/20 seconds testing conditions on the CLEAR’06 Evaluation task. This paper is organized as follows: In Sections 2 and 3 an overview of the audio and video algorithms and techniques is given. Section 4 presents the technique for multimodal fusion. Section 5 describes the evaluation scenario and the experimental results. Finally, Section 6 is devoted to provide conclusions.
2 Audio Person Identification The speaker identification (SI) task consists in determining the identity of the speaker of a speech segment. In this task it is usually assumed that all the possible speakers are known. For this evaluation, recordings using one microphone of an array from 26 speakers have been provided. The first stage of current speaker recognition systems is a segmentation of the speech signal into regular segments. The speech signal is divided into frames of 30ms at a rate of 10ms. From each segment a vector of parameters that characterizes the segment is calculated. In this work we have used the Frequency Filtering (FF) parameters [2]. These parameters are calculated as the widely used Mel-Frequency Cepstral Coefficients (MFFC) [7] but replacing the final Discrete Cosine Transform of the logarithmic filter-bank energies of the speech frame with the following filter:
H ( z ) = z − z −1
(1)
Fig. 1 shows the calculation procedure of the FF parameters. These features have several interesting characteristics: − − − −
They are uncorrelated. They are computationally simpler than MFCCs. They have frequency meaning. They have generally shown an equal or better performance than MFCCs in both speech and speaker recognition.
260
J. Luque et al.
Window
FFT
|·|
H(z)
Log( · )
Filter bank
oFF
Fig. 1. Calculation procedure of the FF parameters
In order to capture the temporal evolution of the parameters the first and second time derivatives of the features are generally appended to the oFF, basic static feature vector. The so called delta coefficients [8] are computed using the following regression formula Θ
Δ o t (i) =
θ (o ∑ θ =1
t +θ
(i) − o t −θ (i) ) Θ
2∑ θ 2
(2)
θ =1
Where Δot(i) is the delta coefficient at time t computed in terms of the corresponding static coefficients ot-Θ to ot+Θ. The same formula is applied to the delta coefficient with another window size to obtain acceleration coefficients. For each speaker that the system has to recognize, a model of the probability density function of the parameter vectors is estimated. These models are known as Gaussian Mixture Models (GMM) [3], which is a weighted sum of Gaussian distributions: M
λ j = ∑ wm N (o, μ m , Σ m )
j = 1, … , J
(3)
m =1
where λj is the model, o is the vector of parameters being modeled, J is the number of speakers, M is the number of Gaussian mixtures, wm is the weight of the Gaussian m, and N is a Gaussian function of mean vector μm and covariance matrix Σm. The parameters of the models are estimated from speech samples of the speakers using the well-known Baum-Welch algorithm. In the testing phase of a SI system a set of parameter vectors O={oi} is calculated from the testing signal. After that, the likelihood that the vector O is produced by each speaker is calculated and it is chosen the speaker with the biggest likelihood, i.e.: s = arg max{L(O | λ j )} j
(4)
where s is the recognized speaker and L(O|λj) is the likelihood that the vector O was generated by the speaker of the model λj.
Audio, Video and Multimodal Person Identification in a Smart Room
261
3 Video Person Identification In this section, the Visual Person ID task is presented. Recognition is stand-alone, taking detection and tracking for granted. That is, the system is semi-automatic. We have developed for this task a technique for face recognition in smart environments. The technique takes advantage of the continuous monitoring of the scenario and combines the information of several images to perform the recognition. Appearance based face recognition techniques are used given that the scenario does not ensure high quality images. As the visual identification evaluation is a close-set identification task, models for all individuals in the database are created off-line using two sets of video segments: the first one consists on one segment of 15 s per each individual in the database, while the second one consists on one segment of 30 s per individual. The proposed system works with groups of face images of the same individual. For each test segment, face images of the same individual are gathered into a group. Then, for each group, the system compares such images with the model of the person. We first describe the procedure for combining the information provided by a face recognition algorithm when it is applied to a group of images of the same person in order to, globally, improve the recognition results. Note that the proposed system is independent of the distance measure adopted by the specific face recognition algorithm. Combining groups of images. Let {x}i ={x1, x2, ... , xM} be a group of M probe images of the same person, and let {C}j = {C1, C2, ... ,CS} be the different models or classes stored in the (local or global) model database. S is the number of individual models. Each model Cj contains Nj images, {y}nj = {y1j, y2j, ..., yNjj} where Nj may be different for every class. Moreover, let d (xi , ynj ) : ℜQ xℜQ → ℜ
(5)
be a certain decision function that applies to one element of {x}i and one element of {y}nj, where Q is the dimension of xi and ynj. It represents the decision function of any face recognition algorithm. It measures the similarity of a probe image xi to a test image ynj. We fix a decision threshold Rd so that xi and ynj represent the same person if d(xi, ynj) < Rd. If, for a given xi the decision function is applied to every ynj ∈ Cj, we can define the δ value of xi relative to a class Cj, δij as
δ ij = # { y nj ∈ C j / d ( xi , ynj ) < Rd }
(6)
That is, δij counts the number of times that the face recognition algorithm matches xi with an element of Cj. With this information, the δ-Table is built:
262
J. Luque et al. Table 1. δ-Table
+X1 X2 ... xM
C1 δ11 δ21 ... δM1
C2 δ 12 δ22 ... δM2
... ... ... ... ...
Cs δ1S δ2S ... δMS
Based on the δ-Table above, we define the following concepts: • Individual Representation of xi: It measures the representation of sample xi by class Cj: R( x i , C j ) =
δ ij
(7)
Nj
• Total representation of xi: It is the sum of the individual representations of xi through all the classes: S
S
δ ij
j =1
j =1
Nj
Rˆ ( x i ) = ∑ R(x i , C j ) =∑
(8)
• Reliability of a sample xi given a class Cj: It measures the relative representation of sample xi by class Cj considering that sample xi could be represented by other classes: δij ⎧ R(x i , C j ) δ ij N j = S = , ⎪ ˆ S δ ik δ ik ⎪ R(x i ) ⋅ N ∑ j ∑ ⎪ k =1 N k k =1 N k ρ ( x i , C j )= ⎨ ⎪ ⎪ ⎪ 1, if Rˆ ( x i ) = 0 ⎩
⎫ if Rˆ ( x i ) > 0⎪ ⎪ ⎪ ⎬ ≤1 ⎪ ⎪ ⎪ ⎭
(9)
The assignment ρij =ρ(xi, Cj) = 1 when the total representation is zero will be commented when discussing the model updating. • Representation of Cj: It estimates the relative representation of a group of samples {x}i by a class Cj. Weighting is performed to account for the contribution of the group {x}i to other classes: R(C j ) =
1 M
M
∑ρ i =1
ij
⋅ δ ij
(10)
• Match Likelihood for class Cj: It relates a class representation and its match probability. If r = R(Cj), then:
Audio, Video and Multimodal Person Identification in a Smart Room
263
Fig. 2. Examples of ML for different σ values (N=5) −r
ML(C j ) =
1− e σ
2
2
−N j
1− e
2
(11)
σ2
where σ adjusts the range of R(Cj) ‘s values (see Fig. 2). • Relative Match Likelihood for a class Cj: It relates the ML of a class Cj and the maximum ML of the other classes:
⎧ ML(C j ) if ML(C j ) ≥ 0.5 ⎪ ML(Ck )) RML(C j ) = ⎨ max( k≠ j ⎪ 0 if ML(C j ) < 0.5 ⎩
(12)
This measure determines if the selected class (that with maximum ML) is widely separated from other classes. A minimum value of ML is required, to avoid analyzing cases with too low ML values. Relying on the previous concepts, the recognition process is defined, in the identification mode, as follows: 1. Compute the δ-Table. 2. Compute the match likelihood (ML) for every model. 3. Compute the RML of the class with the highest ML(Cj). The group is assigned to the class resulting in a highest RML value. In this work, a PCA based approach [4] has been used. This way, the decision function is the Euclidean distance between the projections of xi and ynj on the subspace spanned by the first eigenvectors of the training data covariance matrix:
d ( xi , ynj ) = W T xi − W T ynj where WT is the projection matrix.
(13)
264
J. Luque et al.
The XM2VTS database [9] has been used as training data for estimating the projection matrix and the first 400 eigenvectors have been preserved. Due to the images being recorded continuously using the corner cameras, face images can not be ensured to be all frontal. Mixing frontal and non-frontal faces in the same models can be quite a problem for face recognition systems. To avoid this situation, eye coordinates are used to determine the face pose for each image. Only frontal faces are used for identification. Note that, in our system, models per each person have been automatically generated, without human intervention. All images for a given individual in the training intervals are candidates to form part of the model. Candidate face bounding boxes are projected on the subspace spanned by the first eigenvectors of the training data covariance matrix WT. The resulting vector is added to the model only if different enough from the vectors already present in the model.
4 Multimodal Person Identification In a multimodal biometric system that uses several characteristics, fusion is possible at three different levels: feature extraction level, matching score level or decision level. Fusion at the feature extraction level combines different biometric features in the recognition process, while decision level fusion performs logical operations upon the monomodal system decisions to reach a final resolution. Score level fusion matches the individual scores of different recognition systems to obtain a multimodal score. Fusion at the matching score level is usually preferred by most of the systems. Matching score level fusion is a two-step process: normalization and fusion itself [10], [11], [12], [13]. Since monomodal scores are usually non-homogeneous, the normalization process transforms the different scores of each monomodal system into a comparable range of values. One conventional affine normalization technique is zscore, that transforms the scores into a distribution with zero mean and unitary variance [11],[13]. After normalization, the converted scores are combined in the fusion process in order to obtain a single multimodal score. Product and sum are the most straightforward fusion methods. Other fusion methods are min-score and max-score that choose the minimum and the maximum of the monomodal scores as the multimodal score. Normalization and Fusion Techniques. Scores must be normalized before being fused. One of the most conventional normalization methods is z-score (ZS), which normalizes the global mean and variance of the scores of a monomodal biometric. Denoting a raw matching score as a from the set A of all the original monomodal biometric scores, the z-score normalized biometric xZS is calculated according to Eq. 14. x zs =
a − mean(A) std(A)
(14)
where mean(A) is the statistical mean of A and std(A) is the standard deviation. Histogram equalization (HE) is a non linear transformation whose purpose is to equalize the variances of two monomodal biometrics in order to reduce the non linear
Audio, Video and Multimodal Person Identification in a Smart Room
265
effects typically introduced by speech systems. The HE technique matches the histogram obtained from the speaker verification scores and the histogram obtained from the face identification scores, both evaluated over the training data. The designed equalization takes as a reference the histogram of the scores with the best accuracy, which can be expected to have lower separate variances, in order to obtain a bigger variance reduction. In Matcher Weighting (MW) fusion each monomodal score is weighted by a factor proportional to the recognition rate, so that the weights for more accurate matchers are higher than those of less accurate matchers. When using the Identification Error Rates (IER) the weighting factor for every biometric is proportional to the inverse of its IER. Denoting wm and em the weigthing factor and the IER for the mth biometric xm and M the number of biometrics, the fused score u is expressed as M
u = ∑ wm x m
(15)
m =1
where
1 m w m = Me 1 ∑ m e m =1
(16)
Before carrying out the fusion process, histogram equalization is applied over all the previously obtained monomodal scores. Since the best recognition results have been achieved in the acoustic recognition experiments, the histogram of the voice scores has been taken as a reference in the histogram equalization. After the equalization process, the weighting factors for both acoustic and face scores are calculated by using the corresponding Identification Error Rates, as in Eq. 16. Z-score normalization is also applied, and final fused scores are obtained by using Eq. 15.
5 Experiments and Discussion 5.1 Experimental Set-Up A set of audiovisual recordings of seminars and of highly-interactive small working groups seminars have been used. These recordings were collected by the CHIL consortium for the CLEAR 06 Evaluation. The recordings were done according to the “CHIL Room Setup” specification [1]. A complete description of the different recordings can be found in [14]. Data segments are short video sequences and matching far-field audio recordings taken from the above seminars. In order to evaluate how of the duration of the training signals affects the performance of the system two training durations have been considered: 15 and 30 seconds. Test segments of different durations (1, 2, 5, 10 and 20 seconds) have been used during the algorithm development and testing phases. A total of 26 personal identities have been used in the recognition experiments.
266
J. Luque et al.
Each seminar has one audio signal from the microphone number 4 of the Mark III array. Each audio signal has been divided into segments which contain information of a unique speaker. These segments have been merged to form the final testing segments of 1, 5, 10 and 20 seconds (see Table 2) and training segments of 15 and 30 seconds. Video is recorded in compressed JPEG format, with different frame-rates and resolutions for the various recordings. Far-field conditions have been used for both modalities, i.e. corner cameras for video and Mark III microphone array for audio. In the audio task only one array microphone has been considered for both development and testing phases. In the video task, we have four fixed position cameras that are continuously monitoring the scene. All frames in the 1/5/10/20 sec segments and all synchronous camera views can be used and the information can be fused to find the identity of the concerned person. To find the faces to be identified, a set of labels is available with the position of the bounding box for each person’s face in the scene. These labels are provided each 1s. The face bounding boxes are linearly interpolated to estimate their position in intermediate frames. To help this process, an extra set of labels is provided, giving the position of both eyes of each individual each 200 ms. Table 2. Number of segments for each test condition
Number of segments Segment Duration 1 sec 2 sec 5 sec 10 sec 20 sec
Development 390 182 78 26 0
Evaluation 613 0 411 289 178
The metric used to benchmark the quality of the algorithms is the percentage of correctly recognized people from test segments. 5.2 Results In this section we summarize the results for the evaluation of different modalities and the result improvement with the multimodal technique. Table 3 shows the correct identification rate for both audio and video modalities and the fusion identification rate obtained depending on the length of the used test files. Related to acoustic identification task, it can be seen that the results, in general, are better as the segments length increases. Table 3 reports that for the different test segment lengths the recognition rate increases when more data is used to test the speaker models. Overall, using the 30 seconds training segments, an improvement of up to 8% in the recognition rate is obtained with respect to the case where 15 seconds segments are used. For the face identification evaluation, in general, these results show a low performance of the system. Results for the training set B (using a segment of 30s to
Audio, Video and Multimodal Person Identification in a Smart Room
267
generate the models) show only a slight increase of performance with respect to training set A. It can also be seen that the results improve slowly as the segments length increases. Table 3. Percentage of correct identification for both audio and video unimodal modalities and multimodal fusion. The first column shows the duration of test segments in seconds. The second one shows the number of tested segments. Train A and B are the training sets of 15 seconds and 30 respectively.
Duration
Segments
Speech
Train A Video
Fusion
Speech
Train B Video
Fusion
1
613
75.0
20.2
76.8
84.0
19.6
86.2
5
411
89.3
21.4
92.0
97.1
22.9
97.1
10
289
89.3
22.5
94.1
96.2
25.6
98.0
20
178
88.2
23.6
96.0
97.2
27.0
98.9
The reasons for this low performance are manifold: First of all, the system uses only frontal faces to generate the models and for recognition. However, most of the face views found in the recordings are non frontal. Another reason for the low percentage of correctly identified persons is the low quality of the images. The need to cover all the space in the room with four cameras results in small images, were the person’s faces are tiny. In the worst cases, face sizes are only 13x13 pixels. In addition, poor illumination conditions in some recordings causes cameras to work at large diaphragm apertures. As a result, the depth of field is very shallow and several images are out of focus. Other recordings present interlacing errors. Figure 3 shows several examples of all these problems. Another problem is that, due to the fact that face bounding boxes are interpolated from the 1 second labels, our system is, in many cases, considering as ‘frontal’ faces that are not really frontal ones. Figures (a), (b), (c) and (e) are examples of this situation.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Examples of face bounding boxes taken from several recordings shown at its relative sizes. Smallest image (c) is 13x13 pixels and larger image (a) is 29x47 pixels. Images are taken from the training segments of the AIT, IBM, ITC, UPC and UKA recordings. Images in the test sequences are similar.
268
J. Luque et al.
This leads us to conclude that, under these conditions, a more elaborated technique should be used. For instance, non-frontal face views should be taken into account, as most of the views found in the recordings are non-frontal. Even in this case, person identification using face detection alone is probably not going to give good results in these conditions. Identification should be performed combining more features other than those obtained from face bounding-boxes. The determination of the weighting factors for the multimodal fusion has been done by using the training signals of 30 seconds as a development set. The first 15 seconds have been used for training and the other 15 seconds for testing. The recognition results obtained in the evaluation for multimodal identification can also be seen in Table 3. Fusion results of both systems are also shown for the different lengths. Fusion correct identification rates are higher than the monomodal rates. The obtained fusion results outperform those obtained with both monomodal systems.
6 Conclusions In this paper we describe two techniques for visual and acoustic person identification in smart room environments. A Gaussian mixture model of the Frequency Filtering coefficients has been used to perform speaker recognition. For video, an approach based on joint identification over groups of images of a same individual using a PCA approach has been followed. For the acoustic identification task, the results show that the presented approach is well adapted to the conditions of the experiments. For the visual identification task, the low quality of the images results in a low performance of the system. In this case, results suggest that identification should be performed combining more features other than frontal face bounding-boxes. To improve the obtained results, a multimodal score fusion technique has been used. Matcher Weighting with histogram equalized scores is applied to the scores of the two monomodal tasks. The results show that this technique can provide an improvement of the recognition rate in all train/test conditions.
Acknowledgements This work has been partially sponsored by the EC-funded project CHIL (IST-2002506909) and by the Spanish Government-funded project ACESCA (TIN2005-08852).
References 1. J. Casas, R. Stiefelhagen, et al, “Multi-camera/multi-microphone system design for continuous room monitoring,” CHIL-WP4-D4.1-V2.1-2004-07-08-CO, CHIL Consortium Deliverable D4.1, July 2004 2. Nadeu C., Macho D. and Hernando J., “Time and frequency filtering of filter-bank energies for robust HMM speech recognition”, Speech Communication, Vol. 34, pp. 93-114, 2001.
Audio, Video and Multimodal Person Identification in a Smart Room
269
3. D. A. Reynolds, “Robust text-independent speaker identification using Gaussian mixture speaker models”, IEEE Transactions on Speech and Audio Processing, Vol. 3, Nº 1, pp. 72-83, January 1995 4. M. Kirby, L.Sirovich, “Application of the Karhunen-Loeve procedure for the characterization of human faces”, IEEE Trans. PAMI, vol 12, no. 1, pp. 103-108, Jan. 1990. 5. Bolle, R.M. Connell, J.H., Pankanti, S., Ratha, N.K., Senior, A.W., 2004. Guide to Biometrrics, Springer, New York. 6. R.Brunelli and D. Falavigna, “Person Identification Using Multiple Cues“, IEEE on PAMI Vol. 17, No. 10, pages 955-966, October 1995. 7. Davis S. B., Mermelstein P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. 28, pp. 357-366, 1980. 8. Furui S., “Speaker independent isolated word recognition using dynamic features of speech spectrum”, IEEE Transactions ASSP, No. 34, pp.52-59, 1986. 9. K. Messer, J. Matas, J.V. Kittler, J. Luettin and G. Maitre, XM2VTSDB: The extended M2VTS Database, AVBPA99, 1999. 10. Fox, N.A., Gross, R., Chazal, P., Cohn, J.F., Reilly, R.B., 2003. Person Identification Using Automatic Integration of Speech, Lip and Face Experts, Proc. of the ACM SIGMM 2003 Multimedia Biometrics Methods and Applications Workshop (WBMA’03), Berkeley, CA, pp. 25-32. 11. Indovina, M., Uludag, U., Snelik, R., Mink, A., Jain, A., 2003. Multimodal Biometric Authentication Methods: A COTS Approach, Proceedings of the MMUA 2003, Workshop on Multimodal User Authentication, Santa Barbara, CA, pp. 99-106. 12. Lucey, S., Chen, T., 2003. Improved Audio-visual Speaker Recognition via the Use of a Hybrid Combination Strategy, The 4th International Conference on Audio- and VideoBased Biometric Person Authentication, Guilford, U.K. 13. Wang, Yuan; Wangm Yunhong; Tan, T., 2004. Combining Fingerprint and Voiceprint Biometrics for Identity Verification: an Experimental Comparison, Proceedings of the ICBA, Hong Kong, China, pp. 663-670. 14. Djamel Mostefa et al. “CLEAR Evaluation Plan v1.1” http://www.clear-evaluation.org/downloads/chil-clear-v1.1-2006-02-21.pdf
Head Pose Estimation on Low Resolution Images Nicolas Gourier, Jérôme Maisonnasse, Daniela Hall, and James L. Crowley PRIMA, GRAVIR-IMAG INRIA Rhône-Alpes, 38349 St. Ismier, France
Abstract. This paper addresses the problem of estimating head pose over a wide range of angles from low-resolution images. Faces are detected using chrominance-based features. Grey-level normalized face imagettes serve as input for linear auto-associative memory. One memory is computed for each pose using a Widrow-Hoff learning rule. Head pose is classified with a winnertakes-all process. We compare results from our method with abilities of human subjects to estimate head pose from the same data set. Our method achieves similar results in estimating orientation in tilt (head nodding) angle, and higher precision for estimating orientation in the pan (side-to-side) angle.
1 Introduction Knowing the head pose of a person provides important cues concerning visual focus of attention [12]. Applications such as video surveillance, intelligent environments and human interaction modelling require head pose estimation from low-resolution face images. Unfortunately, most methods described in the research literature require high-resolution images, often using multiple views of the face. In this paper we address the problem of estimating head pose from low-resolution single images. The pose, or orientation, of a head is determined by 3 angles: slant, pan and tilt. The slant angle represents the person’s head inclination with regard to the image plane, whereas the tilt and the pan angles represent the vertical and the horizontal inclination of the face. Our objective is to obtain a reliable estimation of head pose on unconstrained low-resolution images. We employ a fast, chrominance-based segmentation algorithm to isolate and normalize the face region in size and slant. We then project this region of the image into a small fixed-size imagette using a tranformation that normalises size and slant orientation. Normalised face imagettes are used to train an auto-associative memory using the Widrow-Hoff correction rule. Classification of head pose is obtained by comparing normalised face imagettes with reconstructed by the auto-associative memory. . Head pose which obtains the highest score is selected. We compare results with our method to human performance on head pose estimation using the same data set [13]. This process is described in section 3. We compare results R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 270 – 280, 2007. © Springer-Verlag Berlin Heidelberg 2007
Head Pose Estimation on Low Resolution Images
271
from this method with human performance for head pose estimation using the same data set, as described in section 4. Results and comparison are discussed in section 5.
2 Approaches to Head Pose Estimation Local or global approaches exist for head pose estimation. Local approaches usually estimate head pose from a set of facial features such as eyes, eyebrows and lips. Three dimensional rotation of the head can be estimated from correspondences between such facial landmarks in the image and the face [1], [2], [3]. However, the detection of facial features tends to be sensitive to changes of illumination, person and pose variations. Robust techniques have been proposed to handle such variations [4], [5] but these require high resolution images of the face and tracking can fail when certain facial features are occluded. Some local-based systems, such as FaceLAB [21], have a precision smaller than one degree This system uses stereo vision and require high resolution of the image of the face. Transformation-based approaches use some geometric properties of facial landmarks to estimate the 3D rotation of the head [6], [7], [8]. However, such techniques remain sensitive to the precision of detected regions and to the resolution of the face image. Such problems do not appear when using global approaches. Global approaches use the entire image of the face to estimate head pose. The principal advantage of global approaches is that only the face needs to be located. No facial landmark, or face model are required. Global approaches can accommodate very low resolution images of the face. Template matching is a popular method to estimate head pose. The best template is found via a nearestneighbour algorithm, and the pose associated with this template is selected as the best pose. Template matching can be performed using Gabor Wavelets and Principle Components Analysis (PCA) [9], or Support Vector Machines [10], but these approaches tend to be sensitive to alignment and are dependent on the identity of the person. Neural networks have also been used for head pose estimation [11]. Stiefelhagen [12] reports 10 degrees of precision on the Pointing’04 Head Pose Image Database [13]. However, some images were used both in training and testing, which biases the results. Furthermore, the number of cells in hidden layers is chosen arbitrarily, which prevent creation of image class prototypes. In the method described in this paper, we adapt auto-associative memories based on the Widrow-Hoff learning rule. Auto-associative memories require very few parameters and contain no hidden layers [14]. Prototypes of image classes can be saved and reused. The Widrow-Hoff learning rule provides robustness to partial occlusions [22]. Each head pose serves to train an auto-associative network. Head pose is estimated by selecting the auto-associative network with the highest likelihood score.
272
N. Gourier et al.
3 Head Pose Estimation Using Linear Auto-associative Neural Networks 3.1 Linear Auto-associative Memories Linear auto-associative memories are a particular case of one-layer linear neural networks where input patterns are associated with each other. Auto-associative memories associate images with their respective class, even when the image has been degraded or partially occluded. With this approach, each cell corresponds to an input pattern. We describe a grey-level input image x’ with a normalized vector x = x’/||x’||. A set of M images composed of N pixels of the same class are stored into a N x M matrix X = (x1, x2, …, xM). The linear auto-associative memory is represented by a connection matrix W. The reconstructed image yk is obtained by computing the product between the source image x and the connection weighted matrix Wk : yk = Wk·x. The similarity between the source image and a class k of images is estimated as the cosine between x and yk: cos(x, y) = x·yT. A similarity of 1 corresponds to a perfect match. The connection matrix Wk0 is initialized with the standard Hebbian learning rule Wk0 = Xk· XkT. Reconstructed images with Hebbian learning are equal to the first eigenface of the image class. To improve recognition abilities of the neural network, we learn W with the Widrow-Hoff rule. 3.2 The Widrow-Hoff Correction Rule The Widrow-Hoff correction rule is a local supervised learning rule. At each presentation of an image, each cell modifies its weights from the others. Images X, of the same class are presented iteratively with an adaptation step η until all images are classified correctly. As a result, the connection matrix Wk becomes spherically normalized. The Widrow-Hoff learning rule is described by:
Wkt +1 = Wkt + η ⋅ ( x − Wkt ⋅ x) ⋅ x T
(1)
In-class images are minimally deformed by multiplying with the connection matrix, while extra-class images are more strongly deformed. Direct comparison between input and output normalized images gives a score between 0 and 1. This correction rule has shown good results on classic face analysis problems in the case of images from a single camera, for problems such as face recognition, sex classification and facial type classification [14]. The Widrow-Hoff correction rule increases the performance of PCA and provides robustness to partial occlusions [22]. All dimensions are used and few parameters are needed. There is no requirement to specify the choice of the structure or the number of cells in hidden layers. Furthermore, prototypes, W k, of image classes can be saved, recovered and directly reused on other images unlike non-linear memories or neural networks with hidden layers, where prototypes can not be recovered.
Head Pose Estimation on Low Resolution Images
273
3.3 Head Pose Image Database The choice of a good database is crucial for learning. The Pointing’04 Head Pose Image database [13] consists of 15 sets of images of different people. Each set contains 2 series of 93 images of the same person at different poses. Subjects are 20 to 40 years old. Five people have facial hair and seven are wearing glasses.
Fig. 1. Sample of the Pointing’04 Image Database
Head pose is determined by pan and tilt angle. Each angle varies between -90 and +90 degrees, with a step of 15 degrees for pan, and 30 and 15 for tilt. Negative values for tilt correspond to bottom poses and positive values correspond to top poses. During the database acquisition, people were asked to look successively at 93 markers. Each marker corresponds to a particular pose. A sample of the database can be seen Figure 1. 3.4 Head Pose Prototypes The face region is normalized into a low resolution normalized grey-scale imagette of 23x30 pixels, as in [4]. Face normalization provides invariance of position, scale and slant [15]. This increases the reliability of results and processing time becomes independent of original face size. All further operations take place within this imagette. We consider each head pose as a class. A connection matrix Wk is computed for each pose k. The Pointing’04 database consists in 13 Poses for pan and 9 Poses for tilt. Two experiments have been performed using this approach. In the “separate” technique, we learn each angle on an axis while varying the angle of the other axis. Each classifier corresponding to a pan angle is trained with varying tilt angle. Similarly, each memory corresponding to a tilt angle is trained with a varying pan angle. The “separate” experiment learns 22 classifiers: Pan = +90,…, Pan = -90, Tilt = +90,…, Tilt = -90. We use an adaptation step η of 0.008 for pan and 0.006 for tilt for this experiment. Pan and tilt are trained separately. In the “grouped” experiment, pan and tilt angle are trained together. Each classifier corresponds to a pan and a tilt angle. This experiment learns 93 classifiers: (Pan,Tilt)=(0,-90),…, (Pan, Tilt)=(+90,+75), (Pan, Tilt)=(0,+90). We use an adaptation step η of 0.006 for this experiment. To estimate head pose on a given face imagette, a simple winner-takes-all process is employed. We compute the cosine between the source image X and reconstructed images Xk’. The pose whose memory obtains the best match is selected (2).
Pose = arg− max(cos(X, X k′ )) k
(2)
274
N. Gourier et al.
4 Human Abilities for Head Pose Estimation To our knowledge, there is no data human abilities for estimating head pose from images. Kersten [18] reports that front and profile poses are particularly well recognized by humans. These poses are used as key poses [19]. This observation is true not only for head poses, but also for other objects. However, they do not estimate intermediate poses. As a comparison to our artificial system, we measure the performance of a group of 72 human subjects on head pose estimation. In our experiment, we have tested 36 men and 36 women, ranging in age from 15 to 60 years old. The experiment consisted of two parts: one for pan angle estimation, and the other for tilt angle. Images from the Pointing’04 Head Pose Database were presented in random order to the subject for 7 seconds, with a different order for each subject. Subjects were asked to examine the image, and to select an answer pose estimation from a fixed set. The data base consists of 65 images for pan and 45 for tilt, which gives 5 images for each pose. The psycho-physical basis for human head pose estimation from static images is unknown. We do not know whether humans have a natural ability to estimate head pose from such images, or whether people must be trained for this task using annotated images. In order to avoid bias in our experiment, the subjects were divided into 2 groups: people in the first group may inspect the labelled training images of head pose as long as they wish before beginning the experiment, whereas people in the second group are not provided an opportunity to see the images before the experiment. First and second groups are respectively referred as “Calibrated” and “Non-Calibrated” subjects. Creating these two groups allows us to determine if training significantly increases human performances on head pose estimation.
5 Results and Discussion In this section, we compare results of the two variations of our method (separate and grouped) using the Pointing’04 Head Pose image database. There are two ways of splitting the data for training and testing. By using the first set of the database as the training data and testing on the second set, we measure the performance of our system on known users. By using the Jack-Knife method, also known as the leave-one-out algorithm, we measure the performance on unknown users. To have an idea of the efficiency of our system in human-computer interaction applications, we compare performances of our system with human performances. 5.1 Evaluation Measures To evaluate the performance of our system, we must define evaluation criterions. Average absolute error for pan and tilt is the main evaluation metric. It is computed by averaging the difference between expected pose and estimated pose for all images. We also compute average absolute error for pan and tilt per pose. The Pointing’04 database is well suited for such measure, because it provides the same amount of data for each pose. Precise classification rate and correct classification with 15 degrees errors is also computed. We compare results of our system on known and unknown users. Results are presented in Table 1.
Head Pose Estimation on Low Resolution Images
275
5.2 Performances Our system works well with known users on both angles. With the separate technique, we achieve a mean error of 7.3 degrees in pan and 12.1 degrees in tilt. The grouped learning provides a mean error of 8.5 degrees in pan and 10.1 degrees in tilt. Pan angle can be correctly estimated with a precision of 15 degrees in more than 90% of cases with both learning techniques. Results obtained with the Jack-Knife algorithm show that our system also generalizes well to previous unseen subjects and is robust to identity. With the separate technique, we see that pan angle is well recognized with an average error of 10.3 degrees. Average error decreases to 10.1 degrees using the grouped learning. The average tilt error is 15.9 degrees using the separate technique, and 16.8 degrees using the grouped technique. Average error per pose is shown in Figure 2. Concerning the pan angle, the average absolute error in pose is relatively stable with both techniques. Both techniques accommodate intermediate tilt angles. We achieve a precise classification rate of 50.4% for pan angle and 21 % for tilt angles with the separate technique. Using the grouped technique provides a 50% classification rate for pan angle and 45% for tilt angle. Pan angle can be correctly estimated with a precision of 15 degrees in 88% of cases with the second technique. These results tend to show that using the together technique does not provide significantly improve results. Examples can be seen in Figure 4. Faces are not aligned in the Pointing’04 database. Normalizing face images provides small variations in alignment. Results show that our system can handle alignment problems. Computing a score for each memory allows us to discriminate face and non-face images. Head detection and pose estimation is done in a single process. The system runs at 15 images/secs using the separate technique, and 3 images/secs with the grouped technique. As humans estimated angles separately, we will use the separate learning for comparison with human performances. Table 1. Performance evaluation on known and unknown users. AM refers to auto-associative memories.
Known Users Pan Average Error Tilt Average Error Pan Class. With 0º Tilt Class. With 0º Pan Class. With 15º
AM separate 7.3 º 12.1 º 61.3 % 53.8 % 93.3 %
AM grouped 8.5 º 10.1 º 60.8 % 61.7 % 90.1 %
Unknown Users Pan Average Error Tilt Average Error Pan Class. With 0º Tilt Class. With 0º Pan Class. With 15º
AM separate 10.3 º 15.9 º 50.4 % 43.9 % 88.1 %
AM grouped 10.1 º 16.8 º 50 % 44.5 % 88.7 %
276
N. Gourier et al.
Fig. 2. Average error per pose for known and unknown users 5.3 Comparison to Human Performances We computed the same evaluation measures for humans. Results for calibrated (C) and non-calibrated (NC) people are shown in Table 2. Global human average error for head pose estimation is 11.9 degrees in pan and 11 degrees in tilt. Creating two groups allows comparing the performances of our system on unknown users to the best human performances. We apply a Student test to compare the two populations.
Head Pose Estimation on Low Resolution Images
277
Table 2. Human/Machine performance evaluation. C and NC stand for Calibrated and NonCalibrated people.
Pan Average Error Tilt Average Error Pan Class. with 0 º Tilt Class. With 0º
C 11.8 º 9.4 º 40.7 % 59 %
NC 11.9 º 12.6 º 42.4 % 48 %
AM Separate U 10.3 º 15.9 º 50.4 % 43.9 %
Fig. 3. Human / System performance per pose
Calibrated people do not perform significantly much better in pan. However, the difference is significant in tilt angle. These results show that pan angle estimation appears to be natural for humans, whereas tilt angle estimation is not. This is due to the fact that people twist their head left and right more often than up and down during social interactions. In situations when people talk to each other, pan angle provides good cues on visual focus of attention [12], [19]. Head poses changes in tilt become meaningless. This is even more relevant when people are seated, because their head is roughly at the same height. People are more used to consider pose changes in pan. Seeing training images annotated do not improve much pan angle estimation but improves significantly tilt angle estimation. The best human performance is obtained by calibrated people. Average error per pose for human subjects can be seen in Figure 3. For pan angle, we found that humans perform well for front and profile angles, but not for intermediate angles. The average error per pose in pan can be modelled by a Gaussian function centered at 45 degrees. Minimum error can be found at 0 degrees, which corresponds to front pose. Furthermore, during our experiment, we observe that most people did not use intermediate angles such as 30, 45 and 60 degrees. These results suggest that the human brain uses front and profile as key poses, as suggested in [17].
278
N. Gourier et al.
Concerning tilt angle, humans performs better for top angles than for bottom angles. The minimum error can be found at +90 degrees, whereas the maximum error is at 90 degrees. This can be due to the fact that when a face is nodding downward, hair dominates a large surface of the apparent face, providing more information about side to side angle. With an average error of 10.3 degrees and a precise classification rate of 50.4%, our method performs significantly better than humans at estimating pan angle (11.9 degrees). The standard deviation of the average error per pose is low for the system and high for humans. The system achieves roughly the same precision for front and profile, and higher precision for intermediate poses. With an average error of 11 degrees, humans perform better in tilt angle. Our method performs well for top poses. This is due to the fact that hair becomes more visible in the image and the face appearance between people changes more when looking down. On the other hand, such changes are less visible for up poses. Face region normalization also introduces a problem. The height of the neck changes between people. This provides high variations on face imagettes and can disrupt tilt angle estimation.
Fig. 4. Pan angle estimation on example images
6 Conclusion We have proposed a new method to estimate head pose on unconstrained low resolution images. Face image is normalized in scale and slant into an imagette by a robust face detector. Face imagettes containing the same head pose are learned with the Widrow-Hoff correction rule to obtain a linear auto-associative memory. To estimate head pose, we compare source and reconstructed images using their cosine. A simple winner-takes-all process is applied to select the head pose which memory gives the best match. We achieved a precision of 10.3 degrees in pan and 15.9 degrees in tilt only on unknown subjects on the Pointing’04 Head Pose Image database. Learning pan and tilt together does not provide significantly better results. Our method provides good results on very low resolution face images and can handle wide movements, which is particularly adapted to wide-angle or panoramic cameras setups. The system generalizes well to unknown users, is robust to alignment and runs at 15 frames/secs.
Head Pose Estimation on Low Resolution Images
279
We measured human performance on head pose estimation using the same data set. Our system performs significantly better than humans in pan, especially with intermediate angles. Humans perform better in tilt. Results of our system may be improved by fitting an ellipse to delimit more precisely the face. Our head pose estimation system can be adapted to video sequences for situations such as human interaction modelling, video surveillance and intelligent environments. By knowing a coarse estimate of the current head pose, the temporal context can help to limit head pose search only to neighbour poses. The uses of head prototypes reduces significantly the computational time in video sequences.
References [1] A.H. Gee, R. Cipolla, “Non-intrusive gaze tracking for human-computer interaction,” Mechatronics and Machine Vision in Practise, pp. 112-117, 1994. [2] R. Stiefelhagen, J. Yang, A. Waibel, "Tracking Eyes and Monitoring Eye Gaze,” Workshop on Perceptual User Interfaces, pp. 98-100, Banff, Canada, October 1997. [3] A. Azarbayejani, T. Starner, B. Horowitz, A. Pentland, "Visually Controlled Graphics," IEEE Transactions on PAMI 15(6) 1993, pp. 602-605. [4] N. Gourier, D. Hall, J. Crowley, "Estimating Face Orientation using Robust Detection of Salient Facial Features," Pointing 2004, ICPR, Visual Observation of Deictic Gestures, Cambridge, UK. [5] J. Wu, J.M. Pedersen, D. Putthividhya, D. Norgaard, M.M. Trivedi, "A Two-Level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis," Pointing 2004, ICPR, Visual Observation of Deictic Gestures, Cambridge, UK. [6] Q. Chen, H. Wu, T. Fukumoto, M. Yachida, "3D Head Pose Estimation without Feature Tracking," AFGR, April 16/1998 Nara, Japan. pp. 88-93. [7] R. Brunelli, "Estimation of Pose and Illuminant Direction for Face Processing," Proceedings of IVC(15), No. 10, October 1997, pp. 741-748. [8] P. Yao, G. Evans, A. Calway, "Using Affine Correspondance to Estimate 3-D Facial Pose," 8th ICIP 2001, Thessaloniki, Greece, pp. 919-922. [9] S. McKenna, S. Gong, "Real-time face pose estimation," International Journal on Real Time Imaging, Special Issue on Real-time Visual Monitoring and Inspection, volume 4: pp.333-347, 1998. [10] J. Ng, S. Gong, "Multi-view Face Detection and Pose Estimation using a Composite Support Vector Machine across the View Sphere," International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, p. 1421, Corfu, Greece, September 1999. [11] B. Schiele, A. Waibel, "Gaze tracking based on face-color," Workshop on Automatic Face and Gesture Recognition, pages 344-349, Zurich, June 26-28, 1995. [12] R. Stiefelhagen, "Estimating Head Pose with Neural Networks - Results on the Pointing04 ICPR Workshop Evaluation Data," Pointing 2004, ICPR, Visual Observation of Deictic Gestures, Cambridge, UK. [13] N. Gourier, J. Letessier, "The Pointing 04 Data Sets," Pointing 2004, ICPR, Visual Observation of Deictic Gestures, Cambridge, UK. [14] D. Valentin, H. Abdi, A. O'Toole, "Categorization and identification of human face images by neural networks: A review of linear auto-associator and principal component approaches," Journal of Biological Systems 2, pp. 413-429, 1994.
280
N. Gourier et al.
[15] K. Schwerdt, J. Crowley, "Robust face Tracking using Color," International Conference on Automatic face and Gesture Recognition pp. 90-95, 2000. [16] G.J. Klinker, S.A. Shafer, T. Kanade, "A Physical Approach to Color Image Understanding," International Journal on Computer Vision 1990. [17] H. Abdi, D. Valentin, "Modeles Neuronaux, Connectionistes et Numeriques de la Reconnaissance des Visages," Psychologie Francaise, 39(4), pp. 357-392, 1994. [18] D. Kersten, N.F. Troje, H.H. Bülthoff, "Phenomenal competition for poses of the human head," Perception, 25 (1996), pp. 367-368. [19] B. Steinzor, “The spatial factor in face to face discussions,” Journal of Abnormal and Social Psychology 1950 (45), pp. 552-555. [20] H.H. Bülthoff, S.Y. Edelmann, M.J. Tarr, “How are three-dimensional objects represented in the brain?,” Cerebral Cortex 1995 (5) 3, pp. 247-260. [21] Seeing Machines Company. “FaceLAB4”, http://www.seeingmachines.com [22] H. Abdi, D. Valentin, "Modeles Neuronaux, Connectionistes et Numeriques de la Reconnaissance des Visages," Psychologie Francaise, 39(4), pp. 357-392, 1994.
Evaluation of Head Pose Estimation for Studio Data Jilin Tu, Yun Fu, Yuxiao Hu, and Thomas Huang Beckman Institute, University of Illinois at Urbana-Champaign, 405 N Mathews Ave, Urbana, IL 61801,USA {jilintu, yunfu2, hu3, huang}@ifp.uiuc.edu
Abstract. This paper introduces our head pose estimation system that localizes nose-tip of the faces and estimate head poses in studio quality pictures. After the nose-tip in the training data are manually labeled, the appearance variation caused by head pose changes is characterized by tensor model. Given images with unknown head pose and nose-tip location, the nose-tip of the face is localized in a coarse-to-fine fashion, and the head pose is estimated simultaneously by the head pose tensor model. The image patches at the localized nose tips are then cropped and sent to two other head pose estimators based on LEA and PCA techniques. We evaluated our system on the Pointing’04 head pose image database. With the nose-tip location known, our head pose estimators can achieve 94 ∼ 96% head pose classification accuracy(within ±15o ). With nose-tip unknown, we achieves 85% nose-tip localization accuracy(within 3 pixels from the ground truth), and 81 ∼ 84% head pose classification accuracy(within ±15o ).
1 Introduction Locating human faces and determining head pose from video/image is one of the most important components for human computer interaction systems, as knowing the location of human face and its orientation allows the computer to determine human identity and focus of attention in the scene. In Pointing 04 workshop, a set of static head pose image database was made public and a number of research groups reported the performance of their head pose estimation systems on this set of data. Wu[1] proposed a two-level approach for estimating the head pose but with nose tip manually localized for the purpose of removing errors from misalignment. At lower level, the image is down-sampled and Gabor wavelet features are computed. The head poses are then classified by majority voting on the classification results based on KDA and PCA subspace models. At the lower level, they achieved 90% accuracy for head pose error less than 15 degree. With the estimation from lower level, the head pose is further refined in a window of 3 by 3 neighboring poses(within 15 degree) by Bunch Graph Analysis. In [2], the face location is obtained by skin color segmentation and edge detection, and the pose is estimated by ANN classifier. They randomly took 80% of the data for training, 10% for cross-validation and 10% for testing. Their system achieved average pan error 9.5 degree, average tilt error 9.7 degree, 52% for pan accuracy and 66% for tilt accuracy. In [3], the face area is obtained by color segmentation and described by a weighted sum R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 281–290, 2007. c Springer-Verlag Berlin Heidelberg 2007
282
J. Tu et al.
of locally normalized Gaussian receptive fields. The eyes are further localized based on the learnt robust features. And the head pose is inferred from the relative eye locations based on the symmetry assumption of human head. Because eye relative locations only give inference about the pan angle. This system can only estimate pan angles. Applying their system to Pointing 04 image database, they achieved mean pose error less than 15 degree when the actual head pose is less than 45 degree. But the mean error increases to 90 degree when actual head pose is greater than 45 degree due to occlusion of one of the eyes. In this paper, we introduce our system for automatic localizing the nose-tip of human face in studio quality images and estimate the head poses using tensor techniques[6]. From the training images, we locate the nose-tips manually, and cropped the image patches of size 18x18 after laplacian decomposition. As there are 15 subjects, each subjects perform 13 pan poses and 7 tilt poses, together with two extreme cases with tilt angles at 90 and -90 degrees, we built a tensor model of size 324x15x13x7 plus 2 PCA models for the two extreme tilt angle poses(We will however only underline the tensor model and ignore the two PCA models thereafter as we consider them as technical hacking for special cases). Given an image patch from the testing set, it is projected into the tensor subspace, and head pose can be estimated from the tensor coefficients obtained from HOSVD(High Order SVD). The advantage of this approach is its potential efficiency for computation. As typical way for head pose estimation is to build model one for each head pose, and the head pose is usually obtained by nearest neighbor search. This approach avoids brutal-force search, and obtains the head pose by a one-shot operation, thus is potentially more efficient in computation. With this tensor model, we can do coarse-to-fine search in the image and locate the nose-tip and estimate the head pose simultaneously. We further crop the image patches at the automatically localized nose-tips and send them to a PCA pose classifier[9] and LEA [8] pose classifier. The results are compared in the experiments section. In section 2, we give a brief description of the data. In section 3, we describe the framework of our system. In section 4, we introduces the pose estimators we developed based on Tensor techniques. Section 5 describes how we do coarse-to-fine nose-tip localization. Section 6 briefly describe the pose estimator based on PCA, and LEA techniques. Section 7 summarizes the performance of our system on Pointing04 image data set. Section 8 concludes with discussion of some future works.
2 Pointing04 Head Pose Image Database Pointing04 head pose image database contains head pose images for 15 subjects. Pictures of 93 head poses for each subject are taken in two sessions with different illumination and scale. The pictures taken in the first session are utilized as training data, and the pictures from the second session are utilized as testing data. The 93 head poses includes combinations of 13 pan poses and 7 tilt poses, together with two extreme cases with pan angle 0 degree and tilt angle 90, -90 degree respectively. Figure 1-(a) shows the pictures of one subject with the 93 head poses and Figure 1-(b) shows the 15 subjects with head pan angle 45 degree and 0 degree tilt angle. Figure 1-(b)
Evaluation of Head Pose Estimation for Studio Data
50 100 150 200 250
50 100 150 200 250 100 200 300
50 100 150 200 250
100 200 300 50 100 150 200 250
100 200 300 50 100 150 200 250
100 200 300 50 100 150 200 250
100 200 300 50 100 150 200 250
100 200 300 50 100 150 200 250
100 200 300
100 200 300
100 200 300
100 200 300
50 100 150 200 250
50 100 150 200 250
50 100 150 200 250
50 100 150 200 250
(a)The person view
50 100 150 200 250
283
100 200 300 50 100 150 200 250
100 200 300
100 200 300
50 100 150 200 250 100 200 300
100 200 300
(b)The pose view
Fig. 1. Two views of the Pointing ’04 head pose image database
also indicates that, there exists inconsistency between the appearances and the head poses, as the appearance of some subjects looks more like a pose of 90o pan angle.
3 The System Framework We developed a system to automatically localize the nose-tip and estimate the head poses using tensor techniques [6]. For the training images, we mark the nose-tips manually, and cropped the image patches of size 18x18 after laplacian decomposition. As there are 15 subjects, each subjects perform 13 pan poses and 7 tilt poses, together with two extreme cases with tilt angles at 90 and -90 degrees, we built a tensor model of size 324x15x13x7 plus 2 PCA models for the two extreme tilt angle poses. Given an image patch from the testing set, it is projected into the tensor subspace, and head pose can be estimated from the tensor coefficients obtained from HOSVD. The advantage of this approach is its potential efficiency for computation as it avoids brutal-force search. When new image is provided, we first do skin color segmentation to locate the face area, then a Laplacian pyramid is built for the input image, so that appearance variations caused by illumination can be eliminated. We then search for the nose-tip in a coarse-to-fine manner using our tensor model and estimate the head pose with the most probable nosetip location. We further crop the image patches at the automatically localized nose-tips and send them to PCA [9] and LEA [8] pose classifiers. The framework is shown in Figure 2.
4 Head Pose Estimation Based on Tensor Model 4.1 Basics of Tensor Model For the training data, we cropped image patches of size 18 by 18 at nose-tip after Laplacian decomposition. As there are 15 subjects, if we ignore the two extreme head poses with tilt angle 90o and −90o respectively(which are taken care of by PCA models as
284
J. Tu et al.
Fig. 2. The framework of our head pose estimators
200
200
150
150 pan
pan
special cases), the rest head poses are combinations from 13 pan angles and 7 tilt angles. We therefore can arrange the cropped data into a tensor of size 314 × 15 × 13 × 7, as shown in Figure 3-(a).
100
50
100
50 250
0
200 150
100
250 0
200 150
100
100 50
tilt
100 50
50 0
0
50 0
person
tilt
(a)The data arranged into tensor
0
person
(b)The tensor faces
Fig. 3. Tensor Analysis
As the tensor data illustrates a multi-linear structure of the appearances resulting from the confluence of person identity, pan angle and tilt angle, the image tensor D can be decomposed into these factor coefficients by N-mode SVD, or so-called High-Order SVD(HOSVD)[5]. D = Z ×1 Upixel ×2 Uperson ×3 Upan ×4 Utilt
(1)
where Z is known as the core tensor, analogous to the singular value matrix in SVD, and Upixel , Uperson , Upan , Utilt , analogous to the eigen-matrices in SVD, are mode matrices that span the space of pixel, people identity, pan angle, tilt angle variations respectively. While the core tensor Z governs the interactions among the factors, tensor faces can be obtained as B = Z ×1 Upixel , as shown in Figure 3-(b).
Evaluation of Head Pose Estimation for Studio Data
285
In [6], a multi-linear method is proposed for simultaneously inferring multiple factor coefficient vectors based on tensor face B. Given an new input image d, from Equation 1, we have dT = B ×2 cTperson ×3 cTpan ×4 cTtilt
(2)
where the coefficient vector cperson encodes the human identity, cpan and ctilt encodes the pan and tilt angles. Let P(pixel) = B−T pixel where Bpixel is the pixel-mode flattening from tensor B, we can obtain the projection of d in the subspace spanned by P(pixel) as R = P ×1 dT = cperson ◦ cpan ◦ ctilt
(3)
where P is the tensor model unflatten from P(pixel) . As R is a tensor of rank (1,1,1) We therefore can retrieve cperson , cpan , ctilt simultaneously by decomposing R using rank 1 N-Mode SVD algorithm. In [5][6], face recognition can be further carried out by nearest neighbor searching in the Uperson subspace. 4.2 Finding Head Poses by HOSVD We however wish to avoid estimating head pose via nearest neighbor searching in subspace spanned by Upan ◦ Utilt in our case as that could become time consuming. Following [6], we propose to build a tensor subspace solely for the poses. Bpose = Z ×1 Upixel ×3 Upan ×4 Utilt
(4)
As shown in Figure 4-(a), Bpose is actually a multi-linear representation of PCA subspaces for the head poses. The left most layer shows the mean picture for the PCA subspace at each pose, and along the column at each pose toward the right shows the eigenfaces that characterize the appearance variations caused by the 15 different subjects. Suppose the appearance of a new subject d at certain head pose can be linearly interpolated by the appearance of the subjects of the same head poses in the database, we have (5) dT = D ×2 pTperson ×3 pTpan ×4 pTtilt where ppan and ptilt are boolean vectors with vector elements for the corresponding head pose being set to 1, and pperson is the fusion vector of the subject appearances for the new subject head appearance with the specified head pose. From Equation 1, 4, we further obtain T dT = Bpose ×2 Uperson ×2 pTperson ×3 pTpan ×4 pTtilt T T = Bpose ×2 cT person ×3 cpan ×4 ctilt
(6)
Therefore HOSVD decomposition by kernel subspace Bpose as described by Equation 2 generates cperson = Uperson pperson cpan ctilt
(7)
= ppan
(8)
= ptilt
(9)
286
J. Tu et al.
150
pan
100
50
150
0 80
100
60 40
50 20 0
tilt
0
person
(a)Pose Tensor
(b)Pose Estimation by HOSVD
Fig. 4. Pose Estimation by Pose Tensor
with cpan and ctilt being boolean vector in ideal situation. The intuition is illustrated in Figure 4-(b). In practice, after cpan and ctilt are obtained by N-mode SVD, the pan and tilt angle can be obtained as follows: pˆ = argmax{cpan } (10) tˆ = argmax{ctilt } T We can further reconstruct dˆ from the tensor model as dˆT = Bpose ×2 cT person ×3 cpan ×4 T ctilt , the distance from the input image d to the pose tensor model is computed as
ˆ D(d, B) = 1 − corr(d, d)
(11)
We utilize this measure to localize the nose-tip in the test image.
5 Nose-Tip Localization When a test image is provided, we first do skin color segmentation to locate the face area. After a Laplacian pyramid is built to reduce the variations caused by illumination, the nose-tip is localized in a coarse-to-fine fashion. 5.1 Skin Color Segmentation We implemented the skin color model proposed in [7]. It is a skin color model in RGB space but is trained from a dataset of nearly 1 billion labeled pixels. the paper reported skin color detection rate of 80% with 8.5% false positives on their testing set collected from internet. We found it applies well to the Pointing ’04 dataset. After the skin color likelihood ratio image is computed based on the skin color model, we first smooth the likelihood ration image with a low pass filter to remove nose, and then we applied Ostu’s automatic threshold selection algorithm[10] to decide
Evaluation of Head Pose Estimation for Studio Data
287
a threshold for the skin color segmentation. This automatic threshold selection procedure ensures the detection of face area in the image even in case the color at the skin area is quite far away from the skin color model as long as it is closer to the model than the background colors. As a last step, we carried out morphological open/close operations to eliminate outliers and the holes in the segmented area is filled. 5.2 Laplacian Pyramid To build Laplacian pyramid from an image, the image, denoted as g0 , is first repeatedly smoothed and down-sampled, in order to obtain a Gaussian pyramid, denoted as {g1 , g2 , ...}. The Laplacian pyramid {L0 , L1 , ...} are computed as the differences between adjacent Gaussian levels, i.e. Li = gi − U P SAM P LE(gi+1 ). By experiments, we empirically chose to do pose estimation at Laplacian pyramid level 2 and level 3 with image patch size 25 × 25 and 18 × 18. 5.3 Coarse-to-Fine Searching We first scan through the face area segmented by skin color model pixel by pixel in Laplacian pyramid level 3 using our tensor model, obtain 10 nose-tip locations where cropped image patches has local minimal distance from the head pose tensor model according to Equation 11. We then scan through the neighborhood of these candidate nose-tip locations in Laplacian pyramid level 2. Denote the candidate nose-tip locations lic and the head pose is estimated as (pci , tci ) with distance measure Dic in pyramid level c at location i, the nose-tip likelihood measure is proposed as follows (ˆl, pˆ, tˆ) = argmini {C(|p3i − p2i |, |t3i − t2i |) + αDi3 + βDi2 + γli3 − li2 }
(12)
The intuition is that, if the detected nose-tip is a true positive, the estimated head poses from different Laplacian pyramid levels should be consistent, the distance between the nose-tips estimated across different resolution should be small enough, and the distance of the image patch cropped at the nose-tip location across different pyramid level from the head pose tensor model also should be small.
6 Head Pose Estimation Based on PCA and LEA Techniques After we locate the nose tips in the testing image, we cropped the image patches at the nose tips, and sent the data to two other pose estimators based on PCA(Turk[9]) and LEA(Fu[8]) techniques. PCA pose classifier trains a PCA subspace that capture the appearance variances of all the poses, the face images of all the subjects in different poses are projected into this subspace. The pose of new subject is estimated by nearest neighbor search in the PCA subspace. The intuition is illustrated by Fig. 5. Locally Embedded Analysis (LEA) trains a linear mapping for data existing on a manifold satisfying the Locality Linear Embedding(LLE) constraint: the local geometrical relationship(represented by a weight matrix for K-nearest neighbor linear combination approximation) should be maximally preserved. The intuition is illustrated by
288
J. Tu et al.
Fig. 5. The intuition of PCA pose classifier
Fig. 6. The intuition of LEA pose classifier
Fig. 6. The head pose is then determined by nearest neighbor search after training data and query data are projected into this LEA subspace.
7 Experiment Results 7.1 Nose-Tip Localization Accuracy We trained our system with the training set in the Pointing’04 dataset after the nose tips are manually marked. We also marked the nose tips in the testing set as ground truth. The nose tip localization accuracy against the ground truth for the testing dataset(1395 images) is shown in Figure 7. It is shown the distance between the estimated nose-tip and the ground truth is less than 3 pixels in 85% of the pictures. histogram of localization error 800
700
600
#
500
400
300
200
100
0
0
5
10
15 dist
20
25
30
Fig. 7. The histogram of the distance between the estimated nose-tips and the ground truth
Evaluation of Head Pose Estimation for Studio Data
289
7.2 Pose Estimation Evaluation For the pose estimation evaluation, we carried out two experiments. In the first experiments, we provided the image patches cropped at the manually marked nose-tips as input, so that we can evaluate how the head pose estimators performs when there is no error caused by misalignment. Table 1 summarizes the results. It seems LEA pose estimator has better performance when the data is noise free. In the second experiment, the nose-tips are automatically localized and the head pose estimators are provided with image patches cropped at the automatically localized nose tips. Table 2 summarizes the results. The PCA pose classifier seems to achieve better performance. The reason for that is because the PCA pose classifier was trained with image patches not only cropped at the nose tip but also at one pixel shift from the nose tip, therefore the PCA pose classifier is more robust to misalignment in the automatically cropped data. Table 1. Evaluation on head pose estimation with known nose-tip locations Metric Mean (Pan err) Mean (Tilt err) Pan Classification Tilt Classification Tilt Classification (15 degree)
Tensor 6.18o 8.60o 72.40% 75.70% 94.48%
PCA 5.86o 5.49o 69.61% 73.62% 96.56%
LEA 5.70o 5.63o 71.18% 75.27% 96.77%
Table 2. Evaluation on head pose estimation with automatically localized nose-tip locations Metric Mean (Pan err) Mean (Tilt err) Pan Classification TiltClassification Tilt Classification(15 degree)
Tensor 12.90o 17.97o 49.25% 54.84% 84.23%
PCA 14.11o 14.98o 55.20% 57.99% 84.30%
LEA 15.88o 17.44o 45.16% 50.61% 81.51%
8 Conclusion In this paper, we introduced our system for automatic nose-tip localization in image and head pose estimation based on tensor techniques. In stead of doing nearest neighbor searching, we proposed to estimate the head poses by N-Mode SVD. We also implemented head pose estimator based on PCA and LEA techniques. The evaluation result shows, our nose tip localization algorithm achieves 85% accuracy(for within 3 pixels away from the ground truth), and all of our pose estimators achieved pretty good pose estimation accuracy given good nose-tip localization. We will further improve our tensor-based head pose estimator according to ICA analysis.
290
J. Tu et al.
References [1] J.W. Wu, J.M. Pedersen, D. Putthividhya, D. Norgaard, M.M. Trivedi: A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis. Pointing 2004. [2] Rainer Stiefelhagen: Estimating Head Pose with Neural Networks-Results on the Pointing04 ICPR Workshop Evaluation Data, Pointing 2004. [3] N. Gourier, D. Hall, J. L. Crowley: Estimating Face orientation from Robust Detection of Salient Facial Structures, Pointing 2004. [4] P. J. Burt, E. H. Adelson: The Laplacian pyramid as a compact image code. IEEE Trans. Commun., 31(4),(1983), 532–540 [5] M. A. O. Vasilescu, D. Terzopoulos: Multilinear Subspace Analysis for Image Ensembles. Proc. Computer Vision and Pattern Recognition Conf. (2003), 2, (2003), 93–99 [6] M. A. O. Vasilescu, D. Terzopoulos: Multilinear Independent Components Analysis. Proc. Computer Vision and Pattern Recognition Conf. (2005) [7] M. J. Jones, J. M. Rehg: Statistical Color Models with Application to Skin Detection. Int. J. of Computer Vision,46(1),(2002), 81-96. [8] Y. Fu, T.S. Huang: Graph Embedded Analysis for Head Pose Estimation. 7th IEEE International Conference Automatic Face and Gesture Recognition, (2006). [9] Turk M., Pentland A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1),(1991),71–86. [10] N. Ostu: A thresholding selection method from gray-level histograms, IEEE Trans. Systems, Man & Cybernetics, 9(1),(1979),62–66.
Neural Network-Based Head Pose Estimation and Multi-view Fusion Michael Voit, Kai Nickel, and Rainer Stiefelhagen Interactive Systems Lab, Universit¨ at Karlsruhe (TH), Germany {voit|nickel|stiefel}@ira.uka.de
Abstract. In this paper, we present two systems that were used for head pose estimation during the CLEAR06 Evaluation. We participated in two tasks: (1) estimating both pan and tilt orientation on synthetic, high resolution head captures, (2) estimating horizontal head orientation only on real seminar recordings that were captured with multiple cameras from different viewing angles. In both systems, we used a neural network to estimate the persons’ head orientation. In case of seminar recordings, a Bayes filter framework is further used to provide a statistical fusion scheme, integrating every camera view into one joint hypothesis. We achieved a mean error of 12.3◦ on horizontal head orientation estimation, in the monocular, high resolution task. Vertical orientation performed with 12.77◦ mean error. In case of the multi-view seminar recordings, our system could correctly identify head orientation in 34.9% (one of eight classes). If neighbouring classes were allowed, even 72.9% of the frames were correctly classified.
1
Introduction
A lot of effort in today’s research in human computer interfaces is put in analysing human activities and human-human interaction. An important aspect of human interaction is the looking behavior of people, which can give insight to their focus of attention, to whom they are listing, as well as about the general dynamics of interaction and the specific roles that people play. Since using special gear is prohibitive in real-life scenarios, visual analysis of people’s head orientation has received more and more attention over the last years. 1.1
Related Work
In the last years, head pose estimation has got increasing attention due to its unobtrusive possibility to estimate peoples’ looking direction. A lot of different approaches were presented which, in general, can be categorized into either model-based or appearance-based techniques. Model-based works such as [4,3,5] allow quite precise hypotheses about the orientation. Due to the necessary feature detection however, they are only applicable in areas where near frontal shots of peoples’ faces are ensured. Further, high resolutions seem necessary since building on detailed face features (nostrils, eyes) mostly becomes impossible R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 291–298, 2007. c Springer-Verlag Berlin Heidelberg 2007
292
M. Voit, K. Nickel, and R. Stiefelhagen
once the head’s resolution decreases in its total dimensionality. Here, appearancebased approaches tend to achieve satisfactory results even with lower resolutions of extracted head images. In [6] a neural-network-based approach was demonstrated for head pose estimation from rather low resolution facial images which were captured by a panoramic camera. The output covered head poses from the left to the right profile. Another interesting work is described in [2], where facial images are modeled by the response of Gabor and Gaussian filters for a number of pose classes. An interesting contribution of their work is the combination of head detection and pose estimation in one joint particle filter framework. Integrating head detection and classification into one combined step allows to overcome alignment problems, from which most appearance-based techniques suffer the most. 1.2
Paper Overview
This paper presents two independent systems that were used during CLEAR Evaluation 2006. The provided task of head orientation estimation comprises two distinct datasets: one is a monocular, synthetic setup with high resolution, frontal captures of different persons’ rotated heads [1], the other are real seminar recordings that were recorded with four fixed, overhead cameras that were setup in the upper corners of a smartroom. Due to the far distance of the cameras in the latter scenario, head regions mostly suffer from a rather poor resolution and the main task herein lies to take the advantage of using multiple views in order to stabilize the system’s output by deriving a joint hypothesis. Section 2 of this paper gives an overview of the neural network architecture we used for evaluating the monocular setup. Section 3 adapts that system’s idea of using one neural network for single-view estimation, extends it to use multiple views and combines the single estimates to one joint hypothesis. Section 4 provides a short conclusion.
2 2.1
Monocular Head Pose Estimation Task Overview
In the monocular head pose task, the Face Pointing04 database provided by the PRIMA Team in INRIA Rhone-Alpes [1] was being chosen. The database used for this evaluation consists of 15 sets of images. Each set contains 2 series of 93 images of the same person at different poses. The first series is used for learning, the second is for testing. There are 15 people in the database, wearing glasses or not and having various skin color. The pose, or head orientation is determined by 2 angles (h,v), which vary from −90◦ to +90◦ . To obtain different poses, markers were put in the whole room at which the subjects had to look during data acquisition. A sample of the dataset is depicted in Figure 1. Further details regarding this dataset can be found on the corresponding website [1].
Neural Network-Based Head Pose Estimation and Multi-view Fusion
293
Fig. 1. Sample images from Face Pointing04 Database
2.2
System Overview
Compared to the second task, this dataset only consists of discrete head poses, depicted on single capture images. Since temporal smoothing does not seem to be suitable here, we adopted our previous work presented in [7] which suggests using one neural network for classifying head orientation on a frame-based basis. We implemented one network with two output units, one for pan, the other for tilt estimation.
Fig. 2. We used one neural network for estimating both horizontal and vertical head pose in the monocular task. We trained two output neurons to estimate both orientations continuously.
The network follows a three-layered, feed-forward topology, including 100 hidden neurons in the second layer. As input, the cropped head region is downsampled to an image size of 64 × 64 pixels, grayscaled and linearly stretched in its contrast to overcome small lighting changes. A Sobel operator is then applied to get the magnitude response in both horizontal and vertical derivation. Both images are then concatenated to obtain a feature vector of 8192 dimensions which is fed into the network’s input layer (as depicted in Figure 2). Since the database does not provide head bounding boxes, a head segmentation step is necessary in order to align a bounding box around the region of interest. We implemented a linear boundary decision classifier in HSV color space to segment skin color cluster. The classifier was trained exclusively on the training images of the dataset. A connected component search over the segmented skin pixels results in the head bounding box. In order to double the training data, we mirrored the training images and added them to the training step.
294
M. Voit, K. Nickel, and R. Stiefelhagen
The network was trained using standard error backpropagation and sigmoid activation functions. A cross evaluation set was used to obtain the best performing network among 100 training cycles. Since we used one output unit for each orientation angle, a final step involves to discretize the network’s output to one of the defined head pose classes. 2.3
Results
Table 1 shows our results on the described dataset. As it can be seen, our implementation performed with 12.3◦ mean error on horizontal orientation hypotheses and 12.77◦ mean error on vertical orientation estimations. We believe, the performance can well be increased by including shifted head bounding boxes into the training step. This way, the classifier might be able to overcome inconsistent head alignment to some extent by itself. However, using a linear decision boundary in HSV space subjectively showed sufficient quality for segmenting the head region. Table 1. Results of our monocular head pose estimation system on the Face Pointing04 Database Pan Avg. Error Tilt Avg. Error Correct Pan Class Correct Tilt Class 12.3◦ 12.8◦ 41.8% 52.1%
3 3.1
Multi-view Head Pose Estimation Task Overview
In the multi-view head pose estimation task, real seminar recordings provided by Universit¨ at Karlsruhe were used in order to estimate the lecturer’s head pose in horizontal direction only. The data consists of two separate datasets: one for training, the other for testing purposes exclusively. The videos depict real seminar recordings from four fixed cameras that are placed in the upper corners of a seminar room. The lecturer’s head bounding box and head orientation were annotated manually for each of the four camera views. Figure 4 depicts one sample video frame from the four cameras. Since the resolution of the captured cameras is 640x480 pixels, the resolution of annotated head regions is poor, thus, the task in using multiple views aims at stabilising the system’s output by using views from different angles and building one joint hypothesis. Since the lecturer’s position varies, his or her head is being exposed to strong lighting changes such as the projector ray or even whiteboard illumination. The background is cluttered, which is the reason why the task does not require automatic head tracking and alignment but provides manual annotations instead. Further, the data is not equally distributed. As depicted in Figure 3, most of the time, the lecturer turns his or her attention toward the audience or the whiteboard. Turns to the door or the east in general are ignored most of the time. The head orientation is classified into eight discrete classes: 0◦ , 45◦ , 90◦ , 135◦, 180◦ , 225◦ , 270◦ and 315◦ .
Neural Network-Based Head Pose Estimation and Multi-view Fusion
295
Fig. 3. Setup of the smart room at Universit¨ at Karlsruhe, including a logarithmically scaled histogram, depicting the head pose distribution of the complete UKA seminar dataset over the eight defined classes
3.2
System Overview
As in section 2.2, we trained one neural network to estimate head orientation. Here, however, we trained the network to output the head orientation relative to the camera’s line of sight: By using relative head pose angles, the very same neural network may be used for all camera views. The network follows a three-layered, feed-forward topology, including 100 hidden neurons in the second layer. As input, the cropped head region is preprocessed in the same way as in section 2.2. Due to the low resolution of head captures, we only resampled to 32 × 32 pixels, thus our feature space consists of 2048 dimensions in total. Further, the original network topology was modified to not outputting a continuous estimation of the horizontal head orientation but to output classconditional probabilities p(ck |zj ) of a discretization ck of possible head rotations, relative to camera j’s line of view. The observation of camera j is denoted by zj . Our experiments showed that a discretization into 36 classes, each 10◦ wide, performed best, thus allowing the network to give a hypothesis for the full range of observable head poses, from −180◦ to +180◦. Concerning head orientation in room coordinates, we defined 360 states X = xi , with 0 ≤ i ≤ 359, where every state describes one possible head rotation. We implemented a Bayes filter for the transition between these states. Thus, given observations Z = {zj } of all cameras, our fusion can be written as: p(Xt = xi |Zt ) = p(Zt |xi ) ·
x ∈X
p(Xt = xi |Xt−1 = x )p(Xt−1 = x |Zt−1 )
(1)
296
M. Voit, K. Nickel, and R. Stiefelhagen
Fig. 4. Example video frame of UKA Seminar database. The lecturer of the seminar is observed by four fixed, overhead video cameras. In all views, the lecturer’s head bounding box and horizontal head orientation is manually annotated.
The observation model gathers the estimations of all n cameras, into one combined measurement, such that 1 p(Zt |φj (xi )) n j=1 n
p(Zt |xi ) =
(2)
given the current observations Zt . Here, φj (xi ) serves as a mapping from the absolute head pose angle xi to one of the camera-relative rotation classes ck of camera j. The sum in equation 1 is made up of two factors: p(Xt = xi |Xt−1 = x ) describes the transition probability to go from state x to xi . The second factor p(Xt−1 = x |Zt−1 ) represents the posterior probability distribution at time t − 1. Having computed the distribution of all states and transitions, we accumulate the probabilities of all states, which fall into the very same output orientation class θl of those defined by the task (Θ = 0◦ , 45◦ , 90◦ , . . .). The final output can then be given as the highest scored orientation θˆ such that: θˆ = arg max θl ∈Θ
3.3
p(Xt = xi |Zt )
(3)
xi ∈θl
Experimental Results
The system has been trained on the training dataset only, evaluation took place on the evaluation set exclusively. No further head alignment was done, the annotated head bounding boxes were used directly to extract the head region.
Neural Network-Based Head Pose Estimation and Multi-view Fusion
297
Fig. 5. In the multi-view setup, we trained one neural network with 36 output neurons. Each of them represents one discrete head pose class, relative to the camera’s line of view (in 10◦ steps). The network was trained to estimate the class-conditional likelihood of the corresponding output class given the observation of that camera. Table 2. Results of our multi-view head pose estimation system on the UKA Seminar Database Avg. Error Correct Class Correct + neighbouring class 49.2◦ 34.9% 72.9%
Our system performed with 34.9% correct classification, when allowing the system’s output to lie within the correct or neighbouring classes the performance increased two 72.9%. We believe that an additional alignment step would further increase the system’s performance, since the manual labelling still varies in position and size.
4
Conclusion
In this work, we have presented two system for estimating head pose under different conditions that were used during CLEAR Evaluation 2006. One task was to estimate head orientation both in horizontal and vertical direction on monocular, synthetic head captures (Face Pointing04 Database). The second task was to hypothesise horizontal head orientation on multi-view, real seminar recordings (UKA Seminar Database). In both systems, head orientation is estimated per camera using a neural network. In case of the multi-view seminar scenario, an attached Bayes filter both fuses the single cameras’ estimations as well as provides temporal filtering to smooth the system’s output on each video recording. Using one single neural network that is applied on every camera, our approach is flexible and allows for easy change of camera positions and additional sensors without the necessity of retraining the whole system. The Bayes filter framework is independent of the amount of cameras and can easily be extended by further information coming from even more than the four views that were provided in the dataset.
298
M. Voit, K. Nickel, and R. Stiefelhagen
In case of the monocular setup, our system estimated horizontal head orientation with a mean error of 12.3◦ , vertical orientation estimation performed with a mean error of 12.77◦. Since the dataset did contain static face captures only, no temporal filtering was applied. Our multi-view head pose estimation system, we used on UKA Seminar Database, performed with a correct classification of 34.9%. When allowing for neighbouring classes, even 72.9% are correctly classified.
Acknowledgement This work has been funded by the European Commission under contract nr. 506909 within the project CHIL (http://chil.server.de).
References 1. Pointing’04 icpr workshop, http://www-prima.inrialpes.fr/pointing04/. 2. S. O. Ba and J.-M. Obodez. A probabilistic framework for joint head tracking and pose estimation. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. 3. A. H. Gee and R. Cipolla. Non-intrusive gaze tracking for human-computer interaction. In Proceedings of Mechatronics and Machine Vision in Practise, pages 112–117, 1994. 4. T. Horprasert, Y. Yacoob, and L. S. Davis. Computing 3-d head orientation from a monocular image sequence. In Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition, 1996. 5. R. Stiefelhagen, J. Yang, and A. Waibel. A modelbased gaze tracking system. In Proceedings of the IEEE International Joint Symposia on Intelligence and Systems, pages 304–310, 1996. 6. R. Stiefelhagen, J. Yang, and A. Waibel. Simultaneous tracking of head poses in a panoramic view. In International Conference on Pattern Recognition, 2000. 7. M. Voit, K. Nickel, and R. Stiefelhagen. Multi-view head pose estimation using neural networks. In Second Workshop on Face Processing in Video (FPiV’05), in Proceedings of Second Canadian Conference on Computer and Robot Vision. (CRV’05), 9-11 May 2005, Victoria, BC, Canada, 2005.
Head Pose Estimation in Seminar Room Using Multi View Face Detectors Zhenqiu Zhang, Yuxiao Hu, Ming Liu, and Thomas Huang Beckman Institute, University of Illinois, Urbana, IL 61801, U.S.A {zzhang6,hu3,mingliu1}@uiuc.edu,
[email protected]
Abstract. Head pose estimation in low resolution is a challenge problem. Traditional pose estimation algorithms, which assume faces have been well aligned before pose estimation, would face much difficulty in this situation, since face alignment itself does not work well in this low resolution scenario. In this paper, we propose to estimate head pose using view-based multi-view face detectors directly. Naive Bayesian classifier is then applied to fuse the information of head pose from multiple camera views. To model the temporal changing of head pose, Hidden Markov Model is used to obtain the optimal sequence of head pose with greatest likelihood.
1
Introduction
Most previous works on pose estimation (PE) [1] [2] assume face has been well aligned before PE. However, in many real applications, face alignment itself is a quite difficult problem, as seen in Fig. 1. Hence much error would be introduced in face alignment stage, which decrease the performance of PE dramatically. View-based multi view face detection is one of the successful application of statistical learning in the past several years [3] [4]. It detect faces of different pose in an input image, meanwhile, it provides estimation of head pose based on which channel of view-based face detectors give the output. In scenarios when face alignment is difficult to be done, view-based face detector could be used as a pose estimator directly. In the task of head pose estimation in seminar room of CHIL evaluation, a five-channel multi-view face detector is applied to each of four camera views. Head pose with respect to seminar room could then be estimated using naive bayesian network [5]. Temporal changing of head pose has been modeled with hidden markov model [6]. The remainder of this paper is organized as follows: Section 2 briefly discusses FloatBoost, used in this work as face detection algorithm. Pose estimation with local multi-view face detection is described in section3. Section 4 presents in detail the proposed framework for head pose estimation using naive bayesian network. Hidden Markov Model used to model temporal change of head pose is presented in section 5. Experiments on the CHIL dataset are described in Section 6, and a brief summary is given in Section 7.
This work was supported by ARDA and DTO.
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 299–304, 2007. c Springer-Verlag Berlin Heidelberg 2007
300
Z. Zhang et al.
Fig. 1. Four camera view images of CHIL data
2
FloatBoost Multi-view Face Detection
Various algorithms have been proposed in the literature for face detection. Among those, in this paper, we use the FloatBoost approach [3]. FloatBoost is a variant of AdaBoost [7], introduced to amend some of its limitations. AdaBoost is a sequential forward search procedure using the greedy selection strategy. Its heuristic assumption is the monotonicity. The premise offered by the sequential procedure can be broken down when this assumption is violated. FloatBoost instead incorporates the idea of floating search [8] into AdaBoost to overcome the non-monotonicity problems associated with the latter. The sequential floating search (SFS) method [8] allows the number of backtracking steps to be controlled instead of being fixed beforehand. Specifically, it adds or deletes l = 1 feature and then backtracks r steps, where r depends on the current status. As a result, quality improvement of the selected features is obtained at the cost of increased computation due to the extended search. These feature selection methods, however, do not address the problem of (sub-)optimal classifier design based on the selected features. FloatBoost combines them into AdaBoost for both effective feature selection and classifier design. Briefly, FloatBoost is an iterative procedure involving various steps: In the forward inclusion step, the currently most significant weak classifier is added one at a time, a step identical to AdaBoost. In the conditional exclusion step, FloatBoost removes the least significant weak classifier from the current ensemble, subject to the condition that the removal leads to a lower cost than the one incurred at the previous iteration. The classifiers following the removed one will subsequently need to be re-trained. The above steps are repeated until no more removals can be performed.
Head Pose Estimation in Seminar Room Using Multi View Face Detectors
301
Fig. 2. Framework of local face detection
In the scenario of interest in this paper, face detection needs to accommodate the lecturer’s varying head pose, as captured in the fixed camera views inside the smart room. Therefore, a multi-view FloatBoost approach is used, where three face detectors are trained: One for frontal view, one for left half-profile view and another for left profile view, with the right side face detectors obtained by mirroring the left ones. All detectors are trained by the FloatBoost technique.
3
Pose Estimation with Local Multi-view Face Detection
As mentioned in introduction part, we estimate head pose with multi-view face detectors directly. As illustrated in Fig. 2, a sequence of face detectors were applied around face region, given the bounding boxes of face in each camera view. In the flow chart, there are totally five view-based face detectors: frontal view, left half-profile, right half-profile, left profile and right profile. First of all, frontal face detector is applied in the region around bounding box. If frontal face is detected, we stop the sequence of face detection and estimate head pose as frontal view. If no frontal face is found in the local region, we continue the process of multi-view face detection with left half-profile face detector. Similarly, if left half-profile face is detected, we stop the sequence of face detection and estimate head pose as left half-profile view. We continue the process, following the flow chart shown in Fig. 2. If none of these five detectors could detect face around bounding box region, non-face is assigned to this local region, which means probably it is the back of presenter’s head from this view (this could also happen, if we miss detection of the face). With this local multi-view face detection, head pose in each camera view is obtained. Let v1 ,v2 ,v3 ,v4 be the variable, which present the head pose in camera view 1, camera view 2, camera view 3 and camera view 4 respectively. For example, v1 has six possible value 1, 2, 3, 4, 5, 6, corresponding to frontal view, left half-profile, right half-profile, left profile, right profile and non-face. Basic assumptions for this framework of sequential local multi-view face detection is that: (1) Frontal face detector is more robust than half-profile face detectors. (2) Half-profile face detectors are more robust than profile face detectors. (3) Right and left side face detectors are independent. It is with low probability that we would detect left side and right side face simultaneously.
302
Z. Zhang et al.
Fig. 3. Naive Bayesian network for head pose estimation
Fig. 4. Overall framework of pose estimation using Naive Bayesian network
4
Naive Bayesian Network for Head Pose Estimation
Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes [5], is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. This classifier learns from training data the conditional probability of each attribute Ai , given the class label C. Classification is then done by applying Bayes rule to compute the probability of C given the particular instance of A1 , ...., An , and then predicting the class with the highest posterior probability. This computation is rendered feasible by making a strong independence assumption: all the attributes Ai are conditionally independent given the value of class C. For the scenario of head pose estimation in seminar room, let v1 ,v2 ,v3 ,v4 denote head pose in each camera view and θ be head pose with respect to seminar room, which has eight possible value, such as: east, west, northwest. Relation between v1 ,v2 ,v3 ,v4 and θ is modeled with naive Bayes, as shown in Fig. 3. Given observation of particular instance of v1 ,v2 ,v3 ,v4 , θ could be estimated with the highest posterior probability. The overall flow chart of head pose estimation with naive Bayes is illustrated in Fig. 4.
Head Pose Estimation in Seminar Room Using Multi View Face Detectors
303
Table 1. Average pose estimation accuracy with leave-one out strategy S0 S1 H 0 H 1 54% 85% 58% 92%
5
Hidden Markov Model for Pose Estimation
In computational pattern recognition, temporal patterns are very common tasks. In order to model the inherent temporal ordering from the time sequences, some assumptions are made to simplify the analysis. Markov assumption is one of the most common assumption which state that the temporal dependence can be break down to first-order approximation: the random variable vt at time t only depends on previous time instant vt−1 . The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) observation probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. To specify the Hidden Markov Model(HMM) completely, there are three types of parameters: initial probability, transition probability and observation probability distribution of each state. In this paper, Hidden Markov Model is applied to model the temporal change of head pose in video sequence. In particular, the HMM model is specified as: – Initial probability p(θ) – Transition probability p(θt |θt−1 ) – Observation Probability distribution of each state p(V¯t |θt )=p(vt1 vt2 vt3 vt4 |θt ) Given observation (V¯1 , V¯2 , ..., V¯T ), Viterbi algorithm is used to find the state sequence (θ1 , θ2 , ..., θT ) with greatest likelihood.
6
Experiment Result
The meeting room considered in this paper corresponds to the smart room located at one of the CHIL project partners [9]. A number of sensors are installed in the room which include the four fixed cameras, providing the data used in this paper. The cameras capture color data at a 640×480 pixel resolution and at 15 frames per second, are synchronized. For multi-view face detection, three face detectors are trained on the development data: One for the frontal view, one for the left half-profile view and another for left profile (the right side face detectors are obtained by mirroring the left ones). A number of frontal,left half-profile and left profile view face images are cropped from selected images in the development set for this purpose. In addition, non-face training samples are cropped from an image database that does not include faces.
304
Z. Zhang et al.
Estimation of p(θ), p(θt |θt−1 ) and p(v1 v2 v3 v4 |θt ) could be learned from development set. Experiment result of head pose estimation on development set, with leave-one out strategy, is shown in Table 1. There, S0 denotes pose estimation using naive Bayesian classifier without temporal information and without tolerance of neighbouring pose, S1 denotes pose estimation using naive Bayesian classifier without temporal information but with tolerance of neighbouring pose, H0 is pose estimation using Hidden Markov Model with temporal information but without tolerance of neighbouring pose and H1 is pose estimation using Hidden Markov Model with temporal information and with tolerance of neighbouring pose. Tested on evaluation set, we got 87% correct classification within range of neighbouring pose classes using the Hidden Markov Model described in section 5, and 33.56◦ of mean absolute error.
7
Summary
In this paper, multi-view face detectors are applied to estimate head pose in seminar room scenario. Naive Bayesian is used to fuse estimation of head pose from four camera view and HMM is used to model temporal change of head pose in video sequence.
References 1. S. Gong, S. Mckenna, and J. Collins, ”An investigation into face pose distributions”, FG 1996. 2. S. Li, X. Peng, X. Hou, H. Zhang and Q. Cheng, ”Multi-view face pose estimation based on supervised ISA learning”, FG 2002. 3. S. Li and Z. Zhang, “FloatBoost learning and statistical face detection,” IEEE Trans. Pattern Anal. Machine Intell., 26(9), 2004. 4. H. Schneiderman and T. Kanade, “A statistical method for 3D object detection applied to faces and cars,” In Proc. Conf. Computer Vision Pattern Recog., 2000. 5. N. Friedman, D. Geiger, and M. Goldszmidt,”Bayesian network classifiers”, Machine Learning 29:131–163, 1997. 6. L. Rabiner, ”A tutorial on Hidden Markov Models and selected applications in speech recognition”, 1989, Proc. IEEE 77(2):257–286. 7. P. Viola and M. Jones, “Robust real time object detection,” In Proc. IEEE ICCV Work. Statistical and Computational Theories of Vision, 2001. 8. P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in feature selection,” Pattern Recog. Lett., 15:1119–1125, 1994. 9. CHIL “Computers in the Human Interaction Loop” project web-site: http://chil.server.de
Head Pose Detection Based on Fusion of Multiple Viewpoint Information C. Canton-Ferrer, J.R. Casas, and M. Pard` as Technical University of Catalonia, Barcelona, Spain {ccanton,josep,montse}@gps.tsc.upc.es
Abstract. This paper presents a novel approach to the problem of determining head pose estimation and face 3D orientation of several people in low resolution sequences from multiple calibrated cameras. Spatial redundancy is exploited and the head in the scene is detected and geometrically approximated by an ellipsoid. Skin patches from each detected head are located in each camera view. Data fusion is performed by backprojecting skin patches from single images onto the estimated 3D head model, thus providing a synthetic reconstruction of the head appearance. Finally, these data are processed in a pattern analysis framework thus giving an estimation of face orientation. Tracking over time is performed by Kalman filtering. Results of the proposed algorithm are provided in the SmartRoom scenario of the CLEAR Evaluation.
1
Introduction
The current paper addresses the problem of estimating the head orientation of people present in a SmartRoom in the framework of multiple view geometry. Multi camera systems are widely used for image and video analysis tasks in SmartRooms, surveillance, body analysis or computer graphics. From a mathematical viewpoint, multiple view geometry has been addressed in [4] , but there is still work to do for the efficient fusion of information from redundant camera views and its combination with image analysis techniques for object detection, tracking or higher semantic level analysis such as detection of attitudes and behaviors of individuals. A number of methods for head pose estimation has been proposed in the literature [1]. The general approach involves estimating the position of specific facial features in the image (typically eyes, nostrils and mouth) and then fitting these data to a head model. The accuracy and reliability of the feature extraction process plays an important role in the head pose estimation results. In practice, some of these methods still require manually selecting feature points, as well as assuming that near-frontal views and high-quality images are available. For the applications addressed in this work, such conditions are usually difficult to satisfy. Specific facial features are typically not clearly visible due to farfield conditions, inadequate lighting and wide angle camera views. They may also be entirely unavailable when faces are not oriented towards the cameras. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 305–310, 2007. c Springer-Verlag Berlin Heidelberg 2007
306
C. Canton-Ferrer, J.R. Casas, and M. Pard` as SPATIAL MODULE: N
Spatial Point Correspondence Analysis
COLOR MODULE: Skin Detection
Information Fusion
Orientation estimation
Kalman Tracker
Output
Head Model Fitting
Fig. 1. System flowchart: acquisition, spatial and color analysis, head model fitting, face orientation estimation and tracking
Furthermore, most of the existing approaches are based on monocular analysis of images but few have addressed the multiocular case for face or head analysis [3]. We propose a method for 3D face orientation estimation which produce a fair estimation of the angle and is computationally simple for real-time applications. Redundancy among camera views is exploited to define a fusion process of color and spatial information in order to obtain a synthetic reconstruction of face appearance in 3D. Finally, an analysis method on these data is proposed in order to obtain the orientation of the head. This method has been applied to a multi-camera SmartRoom scenario in the framework of a scene understanding project. Other fields where our algorithm has potential applicability are vehicle driver attention tracking, disabled people interfaces and face recognition.
2
Low Level Signal Analysis Modules
According to the flowchart depicted in Fig.1 the system comprises two low level image processing modules: spatial and color analysis. These modules provide data to the higher level analysis module that performs the information fusion required to estimate the orientation of head, and to the Kalman tracking module as well. For a given frame in the video sequence, a set of N images are obtained from the N cameras. Each camera is modeled using a pinhole camera model based on perspective projection. Accurate calibration information is available. Bounding boxes describing the head of a person in multiple views are used to segment the interest area where the colour module will be applied. Center and size of the bounding box allow defining an ellipsoid model H = {c, R, s} where c is the center, R the rotation along each axis centered on c and s the length of each axis. Colour information is processeced as described in subsection 2.1. Information obtained by these two modules is combined in order to generate a 3D representation of the head and perform an estimation of its orientation. The final low level signal analysis module employed in our system is a standard Kalman tracker with a constant velocity model. With respect to our model of parameters evolution, it computes the predictions and adds the information coming from the measurements in an optimal way to produce a posteriori estimations of the parameters. The tracked parameters are the geometric parameters defining the head and the estimated face orientation angle. For the initialization
Head Pose Detection Based on Fusion of Multiple Viewpoint Information
307
of this filter, hand marked sequences were analyzed in order to estimate the noise correlation matrices. 2.1
Color Module
Interest regions provided as a bounding box around the head provide 2D masks within the original images where skin color pixels are sought. The masked original images are processed in the CbCr color space since different skin types mostly differ in the luminance component and not with regard to the hue value. Afterwards, a probabilistic classification is computed on the CbCr information [9] where the color distribution of skin is estimated from offline hand selected samples of skin pixels in the same light conditions of the online experiments and approximated by a Gaussian function. Let us denote with Sn all skin pixels in the n-th view. It should be recalled that there could be empty sets Sn due to occlusions or under-performance of the skin detection technique. However, tracking information and redundancy among views would allow to overcome this problem.
3
Multiple View Color and Spatial Information Fusion
Fusion of both color and space information is required in order to perform a high semantic level classification and estimation of face orientation. Our information fusion procedure takes as input the information generated from the low level image analysis for each person: an ellipsoid estimation H of the head and a set of skin patches at each view belonging to this head {Sn }, 0 ≤ n < N . The output of this technique is a fusion of color and space information set denoted as Ω. An analysis technique of the data contained in Ω is provided in Sec.4. The procedure of information fusion we define is based on the assumption that all skin patches {Sn } are projections of a region of the surface of the estimated ellipsoid defining the head of a person. Hence, color and space information can be combined to produce a synthetic reconstruction of the head and face appearance in 3D. This fusion process is performed for each head separately starting by backprojecting the skin pixels of Sn from all N views onto the 3D ellipsoid model. Formally, for each pixel pn ∈ Sn , we compute Γ (pn ) ≡ Pn−1 (pn ) = on + λv,
λ ∈ R+ ,
(1)
thus obtaining its back-projected ray in the world coordinate frame passing through pn in the image plane with origin in the camera center on and director vector v. In this equation, Pn (·) is the perspective projection operator from 3D to 2D coordinates on the view n [4]. In order to obtain the back-projection of pn onto the surface of the ellipsoid modelling the head, Eq.1 is substituted into the equation of an ellipsoid defined by the set of parameters H [4]. It gives a quadratic in λ, (2) aλ2 + bλ + c = 0.
308
C. Canton-Ferrer, J.R. Casas, and M. Pard` as Hk 15
αk 1 Sk 1
5
αk 0
z
S0k
10
αk 0
” “ k p0 Γ
0
S1k
Sk 0
−5
z
o0
” “ pk1 Γ
y x
−10
−15 10
o1
5
0 x
(a)
−5
−10
15
10
5
−5
0
−10
−15
y
(b)
(c) Fig. 2. In (a), color and spatial information fusion process scheme. Pixels in the set Sn are back-projected onto the surface of the ellipsoid defined by H, generating the set S n with its weightening term αn . In (b), result of information fusion obtaining a synthetic reconstruction of face appearance from images in (c) where the skin patches are plot in red and the ellipsoid fitting in white.
The case of interest will be when Eq.2 has two real roots. That means that the ray intersects the ellipsoid twice in which case the solution with the smaller value of λ will be chosen for reasons of visibility consistency. See a scheme of this process on Fig.2(a). This process is applied to all pixels of a given patch Sn obtaining a set S n containing the 3D points being the intersections of the back-projected skin pixels in the view n with the ellipsoid surface. In order to perform a joint analysis of the sets {S n }, each set must have an associated weighting factor that takes into account the real surface of the ellipsoid represented by a single pixel in that view n. That is, to quantize the effect of the different distances from the center of the object to each camera. This weighting factor αn can be estimated by projecting a sphere with radius r = max(s) on every camera plane, and computing the ratio between the appeareance area of the sphere and the number of projected pixels. To be precise, αn should be estimated for each element in S n but, since the far-field condition max(s) c − on 2 ,
∀n,
(3)
Head Pose Detection Based on Fusion of Multiple Viewpoint Information
309
is fulfilled, αn can be considered constant for all intersections in S n . A schematic representation of the fusion procedure is depicted in Fig.2(a). Finally, after applying this process to all skin patches we obtain a fusion of color and spatial information set Ω = {S n , αn , H}, 0 ≤ n < N , for the study head in the scene. A result of this fusion is shown in Fig.2(b).
4
Head and Face Orientation
The final part of our system deals with the identification of head and face orientation using the output data of the previous fusion method. The angle of interest to be estimated for our purposes in a SmartRoom scenario has been chosen as a direction onto the xy plane. Since this angle gives information about where the people is looking at in the scene, it can be used for further analysis such as tracking of attention in meetings [8]. We propose a method in order to estimate ˆ the value of the orientation angle θ. 4.1
Weighted Centroid
An estimation method of the orientation angle θˆ would be the computation of the weigthed centroid of the fusion data Ω as 1 d = N −1 n=0
N −1
|S n |
n=0
αn
(pn − c) ,
(4)
pn ∈S n
θˆ = tan−1 (dy /dx ) ,
(5)
where |S n | denotes the number of elements (pixels) in the set.
5
Results and Conclusions
Sequences of the seminar type have been evaluated and results are reported in Table 1. By analyzing the results in detail we reached the following conclusions. Orientation is highly depending on the detection of skin patches thus being sensitive to its performance. Typically, skin detection underperforms when the face is being illuminated by a coloured light, i.e. the beamer. In this cases, we estimate a wrong orientation angle and the Kalman filter looses track after a short while. On the other hand, our method is conditioned by the hair style, the presence of beard or baldness. Future research towards solving the aforementioned weak points of our algorithms would involve employing more sophisticated skin detectiors robust to the bias introduced when the face is illuminated by coloured light [5]. Particle filtering tracking schemes would also be introduced to cope with fast changes in the head orientation.
310
C. Canton-Ferrer, J.R. Casas, and M. Pard` as
Table 1. Results of the proposed method for the UKA Seminars. M1 states for Pan Mean Absolute Error, M3 for Pan Mean Absolute Error per Pose, M5 for Pan Correct Classification, M7 for Correct Pan Classification per Pan Pose Class and M6 for Pan Correct Classification within range of neighbouring pose classes.
UKA UKA UKA UKA UKA UKA UKA UKA
20050427 20050622 20050511 20050601 20050615 20050615 20050504 20050525
B A
A1 A2 B C
Average
M1
M5
M6
79.45o 96.01o 61.91o 79.52o 74.68o 67.65o 84.03o 53.20o
17.39% 08.14% 16.96% 16.21% 27.87% 28.76% 09.31% 27.91%
39.93% 31.60% 55.49% 48.63% 48.72% 59.80% 32.48% 70.74%
73.63o
19.67%
48.83%
M3 Pose Average Error
o
0 40.63o
o
45 55.83o
o
90 67.50o
135o 146.25o
180o 112.97o
225o 94.36o
270o 54.41o
315o 32.59o
M7 Pose Correct Classification
0o 36.17%
45o 14.81%
90o 0%
135o 0%
180o 8.61%
225o 2.42%
270o 18.05%
315o 40.52%
References 1. Brolly, X., Stratelos, C., Mulligan, J.: Model-based head pose estimation for airtraffic controllers. Proc. IEEE Int. Conf. on Image Processing, pp. 113–116, 2003. 2. Canton-Ferrer, C., Casas, J.R., Pard` as, M.: Towards a Bayesian Approach to Robust Finding Correspondences in Multiple View Geometry Environments. LNCS 3515:2, pp. 281–289, 2005. 3. Chen, M., Hauptmann, A.: Towards Robust Face Recognition from Multiple Views. Proc. IEEE Int. Conf. on Multimedia and Expo, 2004. 4. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. 2nd Edition. Cambridge University Press. 2004. 5. Martinkauppim, B.: Face colour under varying illumination-Analysis and applications. PhD Thesis, University of Oulu, 2002. 6. Mikic, I.: Human Body Model Acquisition and Trackign using Multi-Camera Voxel Data. PhD Thesis, Univ. of California. 2002. 7. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 252– 259, 1999. 8. Stiefelhagen, R.: Tracking Focus of Attention in Meetings. Proc. IEEE Int. Conf. on Multimodal Interfaces, pp. 273–280, 2002. 9. Yang, J., Lu, W., Waible, A.: Skin-colour modeling and adaptation. Technical Report, Carnegie Mellon University, CMU-CS-97-146.
CLEAR Evaluation of Acoustic Event Detection and Classification Systems Andrey Temko1, Robert Malkin2, Christian Zieger3, Dusan Macho1, Climent Nadeu1, and Maurizio Omologo3 1
TALP Research Center, UPC, Campus Nord, Ed. D5, Jordi Girona 1-3, 08034 Barcelona, Spain {temko, dusan, climent}@talp.upc.es 2 interACT, Carnegie Mellon University, 407 S. Craig St, Pittsburgh PA 15213 USA
[email protected] 3 ITC-irst, via Sommarive 18, 38050, Povo (TN), Italy {zieger, omologo}@itc.it
Abstract. In this paper, we present the results of the Acoustic Event Detection (AED) and Classification (AEC) evaluations carried out in February 2006 by the three participant partners from the CHIL project. The primary evaluation task was AED of the testing portions of the isolated sound databases and seminar recordings produced in CHIL. Additionally, a secondary AEC evaluation task was designed using only the isolated sound databases. The set of meetingroom acoustic event classes and the metrics were agreed by the three partners and ELDA was in charge of the scoring task. In this paper, the various systems for the tasks of AED and AEC and their results are presented.
1 Introduction Although speech is certainly the most informative acoustic event, other kind of sounds may also carry useful information in a meeting room environment. In fact, in that environment the human activity is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans. Consequently, detection or classification of acoustic events may help to detect and describe the human and social activity that takes place in the room. For example: clapping or laughter inside a speech discourse, a strong yawn in the middle of a lecture, a chair moving or door noise when the meeting has just started, etc Additionally, the robustness of automatic speech recognition systems may be increased by a previous detection of the non-speech sounds lying in the captured signals. Acoustic Event Detection/Classification (AED/C) is a recent sub-area of computational auditory scene analysis [1] that deals with processing acoustic signals and converting them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. While acoustic event classification deals with events that have already been isolated from its temporal context, acoustic event detection refers to both identification and localization in time of events in continuous audio streams. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 311 – 322, 2007. © Springer-Verlag Berlin Heidelberg 2007
312
A. Temko et al.
In this paper, we present the results of the AED/C CLEAR evaluations carried out in February 2006 by the three participant partners from the CHIL project [2] which sign this paper (UPC, CMU and ITC). The primary evaluation task was AED of the testing portions of the two isolated sound databases (from ITC and UPC) and 4 UPC’s seminar recordings produced in CHIL. Additionally, a secondary AEC evaluation task was designed using only the isolated sound databases, and it is also included in this report. All the partners agreed the set of acoustic classes a priori before recording the databases. A common metrics was also developed at the UPC and agreed with the other partners. ELDA was in charge of the scoring task. In this paper, the three participant sites present their own preliminary systems for the tasks of AED and AEC. Two of them are based on the classical Hidden Markov Model (HMM) [3] approach used in continuous speech recognition, and the other uses Support Vector Machine (SVM) [4] as the basic classifier. Since the evaluation procedure was not strictly defined, there are some differences between the degrees of fitting of the systems to the testing data: two partners developed specific systems for each room, but not the third; one partner uses a system trained differently for seminars and isolated event databases, etc. If those differences are neglected, it is observed that the system closest to the usual speech recognition approach offers better average AED results. The paper is organized as follows: Section 2 gives the experimental setup. Specifically, the databases used in the evaluations are described in Subsection 2.1, while the evaluation scenario and metrics are given in Subsection 2.2 and 2.3, respectively. Section 3 reviews the systems used by each of the AED/C evaluation participants. The results obtained by the detection and classification systems in the CLEAR evaluations are shown and discussed in Section 4. Conclusions are presented in Section 5.
2 Evaluation Setup 2.1 Databases The conducted experiments were carried out on 2 different kinds of databases, namely: 2 databases of isolated acoustic events recorded at the UPC and IRST, and 5 interactive seminars recorded at the UPC. The two former databases contain a set of isolated acoustic events that occur in a meeting room environment and were recorded specially for the CHIL AED/C task. The recorded sounds do not have temporal overlapping and no interfering noises were present in the room. The UPC database of isolated acoustic events [5] was recorded using 84 microphones, namely, Mark III (array of 64 microphones), three T-shape clusters (4 mics per cluster), 4 tabletop directional and 4 omni-directional microphones. The database consists of 13 semantic classes plus “unknown”. Approximately 60 sounds per each of the sound classes were recorded as shown in Table 1. Ten people participated in recordings: 5 men and 5 women. There are 3 sessions per each participant. At each session, the participant took a different place in the room out of 7 fixed different positions. The ITC database of isolated acoustic events [6] was recorded with 32 microphones. They were mounted in 7 T-shaped arrays (composed by 4 microphones each one) plus there were 4 table microphones. The database contains 16 semantic classes of events. Approximately 50 sounds per almost each of the sound classes were recorded as shown
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
313
in Table 1. 9 people participated at the recordings. For each experiment 4 positions in the room were located. People swapped their positions after every session. During each session every person reproduced a complete set of acoustic events. Additionally, the AED techniques were applied to the database of the interactive seminars [7] recorded at the UPC. 5 interactive seminars have been collected. The difference with two previous databases of isolated acoustic events is that seminars consist of real environment events that may have temporal overlapping with speech and/or other acoustic events. Each seminar consists of a 10-20 minutes presentation to a group of 3-5 attendees in a meeting room. During and after the presentation there are questions from the attendees with answers from the presenter. There is also activity in terms of people entering/leaving the room, opening and closing the door, standing up and going to the screen, some discussion among the attendees, coffee breaks, etc. The databases was recorded using 88 different sensors that include 3 4microphoneT-shaped arrays, 1 64-microphone Mark III array, 4 omni-directional table-top microphones, 4 directional table-top microphones, and 4 close-talk microphones. The number of events of one of the seminars is summarized in Table 1. Table 1. Number of events for the UPC and ITC databases of isolated acoustic events, and the UPC interactive seminar
Event type Door knock Door open Door slam Steps Chair moving Spoon/cup jingle Paper work Key jingle Keyboard typing Phone ring Applause Cough Laugh Unknown Mimo pen buzz Falling object Phone vibration Speech
UPC-isolated 50 60 61 73 76 64 84 65 66 116 60 65 64 126
Number of events ITC-isolated UPC-seminar 47 4 49 7 51 7 50 43 47 26 48 15 48 21 48 2 48 14 89 6 12 2 48 5 48 8 12 48 48 13 169
2.2 Evaluation Scenario The AED/C evaluation is done on 12 semantic classes that are defined as: • •
Knock (door, table) Door slam
[kn] [ds]
314
A. Temko et al.
• • • • • • • • • •
Steps Chair moving Spoon (cup jingle) Paper wrapping Key jingle Keyboard typing Phone ringing/Music Applause Cough Laugh
[st] [cm] [cl] [pw] [kj] [kt] [pr] [ap] [co] [la]
Also there are two other possible events that are present but are not evaluated • Speech [sp] • Unknown [un] Actually, the databases of isolated acoustic events contain more semantic classes than the above-proposed list as shown in Table 1. For that reason, the classes that are out of the scope of the current AED/C evaluation were marked as “unknown”. Two main series of experiments are performed: AED and AEC. AED was done in both isolated and real environment conditions. For the task of AEC and isolated AED the databases of isolated acoustic events were split into training and testing parts, namely, for the UPC database sessions 1 and 2 were used for training and session 3 for testing; for the ITC database sessions 1-3 were used for training and session 4 for testing. For the task of AED in real environment all databases of isolated acoustic events and one of five seminars were allowed to use for training and developing, while for testing a 5-minute extract from each of the remaining 4 seminars was proposed forming in total 4 five-minute segments. The selection of extracted parts was done by ELDA. The primary evaluation task was defined as AED evaluated on both the isolated databases and the seminars.
REF_1
REF 2
1
2 3 4 5
6
7
8
9
Single-level reference transcription
Fig. 1. From reference transcription with overlapping of level 2 to reference single-level transcription
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
315
Table 2. Obtained single-level reference transcription and a list of events to detect
Single-level reference transcription 1 – co1 2 – la1 3 – la1_ds1 4 – la1 5 – la1_cl1 6 – cl1 7 – la2
List of events to detect: 1 – cough1 2 – laugh1 3 – ds1 4 – spoon1 5 – laugh2 6 – keyboard1
2.3 Metrics As it was mentioned above, the acoustic events that happen in real environment may have temporal overlapping. The appropriate metric was developed to score the system outputs. It consists of two steps: projecting all levels of overlapping events into a single-level reference transcription and comparing a hypothesized transcription with the single level reference transcription. For instance, let’s suppose we have a reference that contain overlapping of level 2 and can be represented as shown in Figure 1 and REF_1: _la_kt_ REF_2: _co_ds_cl_la_ where REF_1 and REF_2 model two overlapping acoustic event sequences. Then we can form the single-level reference transcription and a list of events to detect as shown in Table 2. Following definitions are needed to compute the metric: • An event is correctly detected when the hypothesised temporal centre is situated in the appropriate single-level reference interval and the hypothesised label is a constituent or a full name of this interval single-level reference label. After an event is claimed to be correctly detected, it is marked as detected in the list of events to detect. • Empty intervals are the reference intervals that contain speech, silence or events belonging to the “unknown” class. • A substitution error occurs when the temporal centre of the hypothesised event is situated in the appropriate single-level reference interval and the label of the hypothesised event is not constituent or the full name of the label of that single-level reference interval. • An insertion error occurs when the temporal centre of the hypothesised event is not situated in any of the single-level reference intervals (i.e. are situated in empty intervals) • A deletion error occurs when there is an event in the list of events to detect that is not marked as detected.
316
A. Temko et al.
Finally, Acoustic Event Error Rate (AEER) is computed as AEER= (D+I+S)/N * 100 where N is the number of events to detect, D – deletions, I – insertions, and S – substitutions.
3 Acoustic Event Detection and Classification Systems 3.1 UPC AED/C Systems A system based on SVM was used at the UPC for the task of AED/C. A DAG [8] multi-classification scheme was chosen to extend the SVM binary classifier to the multi-classification problem. 5-fold cross-validation [4] on the training data was applied to find the optimal SVM hyper parameters that were σ for the chosen Gaussian kernel, and C, a parameter that controls the amount of data allowed to be misclassified during the training procedure. In all the experiments the third channel of the Mark III microphone array was used. Firstly, the sound is downsampled from the initial 44kHz sampling rate to 22 kHz, and framed (frame length=25ms, overlapping 50%, Hamming window). For each frame, the set of spectral parameters that showed the best results in [9] was extracted. It consists of the concatenation of two types of parameters: 1) 16 Frequency-Filtered (FF) log filter-bank energies [10] taken from ASR, and 2) a set of other perceptual parameters: zero-crossing rate, short time energy, 4 subband energies, spectral flux
Feature extraction
SVM silence non-silence segmentation
Segmentation step
SVM classification Classification step
Fig. 2. UPC acoustic event detection system
calculated for each of the defined subbands, and pitch. The first and second time derivatives were also calculated for the FF parameters. In total, a vector of 59 components is build to represent each frame. AEC system The mean, standard deviation, entropy and autocorrelation coefficient of the parameter vectors were computed along the whole event signal thus forming one vector per
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
317
audio event with 4x59 elements. Then, that vector of statistical features was used to feed the SVM classifier, which was trained on the training set of the two databases of isolated acoustic events. The resulting system, herewith named “UPC-C”, was used to test both UPC and ITC databases of isolated acoustic events, so neither feature nor system adaptation related to a specific database was applied AED system The scheme of the AED system herewith named “UPC-D” is shown in Figure 2. Using a sliding window of one second with a 100ms shift, a vector of 4x59 statistical features was extracted like in the AEC system described in the last sub-section for each position of the window (every 100ms). The statistical feature vector is then fed to an SVM-based silence/non-silence classifier trained on silence and non-silence segments of the two isolated acoustic events databases. At the output, a binary sequence of decisions is obtained. A median-filter of size 17 is applied to eliminate too short silences or non-silences. Then, the SVM-based event classifier is applied to each detected non-silence segment. The event classifier was trained on a parameters extracted from a sliding window with 100ms shift applied to each event in the way that the first and the last windows still include more than 50% of the event content. The event classifier is trained on both isolated acoustic events and seminar databases to classify a set of 12 defined acoustical classes, plus classes “speech” and “unknown”. A sequence of decisions made on a 1second window every 100ms is obtained within the non-silence segment. That sequence is smoothed by assigning to the current decision point the label that is most frequent in a string of five decision points around the current one. Also, a confidence measure is calculated for each point as the quotient between the number of times that the chosen label appears in the string and the number of labels in the string (5).
Feature extraction
HMM event/others segmentation
Segmentation step
HMM classification Classification step
Fig. 3. CMU acoustic event detection system
The sequence of decisions from the non-silence segment is then processed again to get the detected events. In that step, only the events that have their length equal or larger than the average event length are kept, and the number of events kept in the non-silence segment is forced to be lower than a number which is proportional to the
318
A. Temko et al.
length of the segment. The average length of the events is estimated from the training and development databases. Finally, if the average of the above mentioned computed confidences in a detected event is less than a threshold, the hypothesized event is marked as “unknown”; otherwise, it maintains the assigned label. 3.2 CMU AED/C Systems The CMU acoustic event classification and detection systems were based on continuous density HMMs. We first downsampled the input signal from a single microphone to 16kHz, 2-byte quality. From this signal, we extracted 15 Mel-Frequency Cepstral Coefficients (MFCCs) at a rate of 100 frames per second. We additionally normalized these MFCCs to zero mean and unity variance using means and variances specific to each site. We used custom HMM topologies for each sound class; these topologies were induced using the k-variable k-means algorithm due to Reyes-Gomez and Ellis [11]. The k-variable k-means algorithm is a greedy approach to topology induction based on the leader-follower clustering paradigm; it uses a threshold to control the tendency to add new states to a class HMM. We trained five complete sets of class HMMs using all available data from the isolated databases. After training these five complete HMM sets, we further trained sitespecific feature space adaptation matrices that are reflected in systems “CMU-C1” and “CMU-C2”. We used the maximum likelihood approach suggested by Leggetter and Woodland [12] and Gales [13]. Finally, as suggested by Reyes-Gomez and Ellis, we explored the combination of scores of HMMs trained with different thresholds on a per-site basis. We found that by combining three models for the ITC data and two for the UPC data, we were able to achieve a combined misclassification rate of less than 6% for acoustic event classification task. For the acoustic event detection task, we wished to explore the possibility of presegmenting the data with a simple HMM before applying our more complex classification HMMs which used more than one Viterbi path to assign a final score. The scheme of the system is presented in Figure 3. Hence, we trained segmentation HMMs which included three classes: speech, CHIL event, and other. To train these HMMs, we used the same approach as for the classification systems above, except that we added the UPC seminar data for training. The detection systems herewith will be named “CMU-D1” and “CMU-D2”. We chose the optimal HMMs for segmentation on a per-site basis. Further, since we also needed to control the rate at which these HMMs created segments in the data, we optimized separate insertion penalties for the ITC isolated database, the UPC isolated database, and the UPC seminar database. This approach yielded poor results on the isolated condition, and very poor results for the seminar condition. 3.3 ITC AED/C Systems The AED/C system that was studied at the ITC-irst is based on continuous density HMM. The scheme of the system is presented in Figure 4. A signal acquired by a single microphone belonging to a T-shaped array was used in experiments. The front-end processing is based on 12 Mel-Frequency Cepstral Coefficients (MFCCs) [3] and log-energy of the signal. The analysis step is 10 ms
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
319
with a Hamming window of 20 ms. The resulting parameters together with their first and second order time derivatives are arranged into a single observation vector of 39 components. Each event is described by a 3 state HMM model. All of the HMMs have a left-toright topology and use output probability densities represented by means of 32 Gaussian components with diagonal covariance matrices. HMM training was accomplished through the standard Baum-Welch training procedure. For AEC task two different sets of models were created to fit the ITC and UPC rooms; the corresponding systems will herewith be named “ITC-C1” and “ITC-C2”. The first one is trained on the ITC isolated acoustic events database and the other is trained on the UPC isolated acoustic events database. The selected training data refers to the recordings of a single microphone belonging to a T-shaped array. For the AED task the same models adopted in the AEC task were used, but also the models of speech and silence were added; the corresponding systems will herewith be named “ITC-D1” and “ITC-D2”. To train the model of speech, recordings of meetings in the ITC room were used, while for the model of silence the database of isolated acoustic events was used exploiting the silence moments between each event.
Feature extraction
HMM detection
Fig. 4. ITC acoustic event detection system
To overcome the detection of events that are overlapped with speech that occur in the interactive seminars, the strategy based on the contamination of the events with speech in the training procedure was exploited; that is reflected in the system “ITCD3”. An artificial database was created by adding speech to the isolated events imposing different SNR values, from 0 to 15 dB. At this moment the system is not trained to detect events that overlap with other events except speech.
4 Results and Discussion Table 3 shows classification error rates obtained using different classification systems described previously. Since the evaluation procedure was not strictly defined, there are some differences in the degree of fitting of the systems to the two testing databases (ITC and UPC isolated DB): both CMU and ITC systems use two sets of models, one for each testing database, while UPC system uses one set of models for the both testing databases. We can observe that the system based on SVM obtained the same or better results than the systems based on the HMM technology, despite the fact that database-specific systems were used in the case of HMM.
320
A. Temko et al.
Table 3. Error rates (in %) for AE classification task of the systems explained in Section 3
Systems UPC-C Databases ITC isolated DB 4.1 UPC isolated DB 5.8
CMU–C1 CMU–C2 7.5 ----
---5.8
ITC–C1
ITC–C2
12.3 ----
---6.2
In the detection task, as explained in the previous sections, participants took two different approaches: a) First performing segmentation and then classification (UPC and CMU systems) b) Merging the segmentation and classification in one step as performed by the Viterbi search in the state-of-the-art ASR systems (ITC systems) Table 4 shows detection error rates for the two isolated event databases and the interactive seminar database. The lowest detection error rates are obtained by the ITC systems, which are based on the approach b). Notice that both CMU and UPC systems achieved better results than the ITC systems in the classification task (Table 3), however they both rely on a previous segmentation step (the approach a)). If we add up the results obtained for the detection task for both isolated and seminar conditions neglecting the test-specificities of the CMU and ITC systems, we obtain the following error rates: UPC: 69.6%, CMU: 80.5%, ITC: 46.8%. Although there might be a number of reasons to explain the differences across the systems, we conjecture that the initial segmentation step included in both UPC and CMU systems, but not in the ITC systems, is the main cause of the lower overall detection performance of these systems. Further investigation is needed in the direction of the approach a) to see whether it can outperform the well-established scheme b). Besides, it can be seen from the Table 4, that the error rates increase significantly for the UPC seminar database. One of possible reasons of such a bad performance is that it is difficult to detect low-energy acoustic classes that overlap with speech, such as e.g. “chair moving”, “steps”, “keyboard typing”, and “paper work”. Actually, these classes cover the majority of the events in the UPC seminars and probably they are the cause of the bad results we obtained in the seminar task. A usage of multiple microphones might be helpful in this case. Table 4. Error rates (in %) for AE detection task of the systems explained in Section 3
Systems UPC–D CMU–D1 CMU-D2 ITC-D1 ITC-D2 ITC-D3 Databases ITC isolated DB 64.6 45.2 ---23.6 ------UPC isolated DB 58.9 ---52.5 ---33.7 ---UPC seminars DB 97.1 ---177.3 ------99.3
5 Conclusions The presented work focused on the CLEAR evaluation tasks concerning the detection and classification of acoustic events that may happen in a lecture/meeting room environment. In this context, we evaluated two different tasks, acoustic event classification (AEC) and acoustic event detection (AED), AED being the primary objective of
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
321
the evaluation. Two kinds of databases were used, two databases of isolated acoustic events and a database of interactive seminars containing a significant number of acoustic events of interest. Preliminary detection and classification systems from three different participants were presented, which allowed an evaluation of different approaches for both classification and detection. The UPC system is based on the Support Vector Machine (SVM) discriminative approach and uses Frequency Filtering features and four kinds of perceptual features. Both the CMU and ITC systems are based on the Hidden Markov Model (HMM) generative approach and they use MFCC features. In the classification task, the UPC SVM-based system showed better performance than the two systems based on HMM. In the detection task, we could see two different approaches: a) first performing segmentation and then classification (UPC and CMU systems), and b) merging the segmentation and classification in one step as performed by the Viterbi search in the state-of-the-art Automatic Speech Recognition (ASR) systems (ITC systems). In the presented results, the approach b) showed better performance than the approach a). Notice however that the b) approach (and actually the ITC systems) is a well-established ASR approach developed for many years and thus can be considered as a challenging reference for the other presented approaches/systems in the acoustic event detection task.
Acknowledgements This work has been partially sponsored by the EC-funded project CHIL (IST-2002506909). Authors wish to thank Djamel Mostefa and Nicolas Moreau from ELDA for their role in the transcription of the seminar data and in the scoring task. UPC authors have been partially sponsored by the Spanish Government-funded project ACESCA (TIN2005-08852).
References 1. D. Wang, G. Brown, Computational Auditory Scene Analysis: Principles, Algorithms and Applications, Wiley-IEEE Press, 2006 2. CHIL - COMPUTERS IN THE HUMAN INTERACTION LOOP, http://chil.server.de/ 3. L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993 4. B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002 5. A. Temko, D. Macho, C. Nadeu, C. Segura, “UPC-TALP Database of Isolated Acoustic Events”, Internal UPC report, 2005 6. C. Zieger, M. Omologo, “Acoustic Event Detection - ITC-irst AED database”, Internal ITC report, 2005 7. J. Casas, R. Stiefelhagen, et al, “Multi-camera/multi-microphone system design for continuous room monitoring,” CHIL-WP4-D4.1-V2.1-2004-07-08-CO, CHIL Consortium Deliverable D4.1, July 2004 8. J. Platt et al., “Large Margin DAGs for Multiclass Classification”, Proc. Advances in Neural Information Processing Systems 12, pp. 547-553, 2000 9. A. Temko, C. Nadeu, “Classification of meeting-room acoustic events with Support Vector Machines and Confusion-based Clustering”, Proc. ICASSP’05, pp. 505-508, 2005
322
A. Temko et al.
10. C. Nadeu et al., “On the decorrelation of filter-bank energies in speech recognition”, Proc. Eurospeech’95, pp. 1381-1384, 1995 11. M. Reyes-Gomez and D. Ellis, “Selection, Parameter Estimation, and Discriminative Training of Hidden Markov Models for General Audio Modeling”, Proc. ICME’03, 2003 12. C. Leggetter and P. Woodland, “Speaker Adaptation of Continuous Density HMMs using Multivariate Regression”, Proc. ICSLP’94, 1994 13. M. Gales, “Maximum Likelihood Linear Transformations for HMM-based Speech Recognition”, Computer Speech and Language, 1998
The CLEAR 2006 CMU Acoustic Environment Classification System Robert G. Malkin interACT, Carnegie Mellon University, Pittsburgh PA 15213, USA
[email protected] http://www.is.cs.cmu.edu
Abstract. We describe the CLEAR 2006 acoustic environment classification evaluation and the CMU system used in the evaluation. Environment classification is a critical technology for the CHIL Connector service [1] in that Connector relies on maintaining awareness of user state to make intelligent decisions about the optimal times, places, and methods to deal with requests for human-to-human communication. Environment is an important aspect of user state with respect to this problem; humans may be more or less able to deal with voice or text communications depending on whether they are, for instance, in an office, a car, a cafe, or a cinema. We unfortunately cannot rely on the availability of the full CHIL sensor suite when users are not in the CHIL room; hence, we are motivated to explore the use of the only sensor which is reliably available on every mobile communication device: the microphone.
1
Introduction
User state has a large effect on what kinds of services need to be provided by context-aware computational systems, and also on how they are provided. In the CHIL room [2], many auditory and visual processing systems [3] are being developed to ensure that user state is properly modeled and that this information is used to deliver the correct services in the correct manner at the correct time. The Connector service [1] is an important aspect of the CHIL project; its goal is to provide an intelligent policy that will take into account user state, user relationships, and available methods of communication in order to optimize human-to-human communications. One very important aspect of user state with respect to communications is environment; users often have very different communications preferences depending on whether they are, for example, in a car, office, cafe, or cinema. However, a sensor gap exists precisely when users are outside the CHIL smartroom; multi-camera, multi-microphone sensor setups are unavailable when users are in arbitrary environments. As we cannot always rely on a full sensory suite in the Connector scenario, the device needs to be able to detect user environment using the only source of information that is reliably available on every mobile communications device: the audio signal. This paper describes the effort to implement and evaluate an auditory environment classification system for the Connector service. We first describe the task and database R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 323–330, 2007. c Springer-Verlag Berlin Heidelberg 2007
324
R.G. Malkin
in Sec. 2. We then describe the two approaches used at CMU for acoustic environment classification in Sec. 3. Experimental results can be found in Sec. 4 and conclusions in Sec. 5.
2
The Acoustic Environment Classification Task
The CHIL acoustic environment classification task involves identifying, from a single acoustic signal in isolation, a broadly-defined environmental class. This task stands in contrast to locale recognition, in which we attempt to identify a specific location. We are instead interested in a much more general problem whose solution can be used in many situations. Toward the goal of developing and evaluating such general purpose environment classification systems, we defined a set of classes and collected data from these classes as described in Sec. 2.1 below. We designed a balanced corpus of five-second audio segments for this evaluation. The evaluation criterion was simply the misclassification rate over all segments in the test set. 2.1
Database
The database we collected for this evaluation consists of approximately 20 hours of audio data recorded in 14 different kinds of environments in ten different countries on four continents.1 The data were recorded in ten-minute chunks using a Sony minidisc recorder with a Sony ECM-717 stereo microphone, and converted to mono 16-bit, 16KHz raw format. From this database, we selected nine environments to study. These environments are airport, bus, gallery, park, plaza, restaurant, street, train, and train platform. These environments were selected to be representative of the environments encountered by typical business travelers and hence potentially relevant to the Connector platform. Most are self-explanatory. Gallery refers to any crowded indoor space not covered by the other environments; e.g., a mall. Plaza refers to any crowded outdoor space not covered by the other environments; e.g., a city square with no significant vehicle traffic. Train platform refers to the actual area with train tracks, where passengers board and disembark from subway cars or high-speed trains. Train refers to subways, high-speed trains, and street trolleys. From each environment, we selected seven recordings at random, for a total of 10.5 hours of data. We divided these recordings into 7,560 5-second segments and split them into two pools at random. The first pool consisted of 6 recordings from each environment, and is referred to as the seen pool. The second pool consisted of 1 recording from each environment, and is referred to as the unseen pool. The unseen recording from each environment was assigned to a development set, along with 120 segments from the seen pool. The training set was made up of the remaining segments in the seen pool. The training set thus contained 5,400 total 5-second segments, and the development set 2,160 segments. Additionally, 1
Thanks to Kornel Laskowski for creating this database during his travels in 2004 and 2005.
The CLEAR 2006 CMU Acoustic Environment Classification System
325
108 segments from the development set — 54 seen and 54 unseen, for a total of 12 per environment — were used to evaluate human performance.
3
The CMU Environment Classification Systems
For this evaluation, we explored two different approaches to environment classification: an optimal coding approach, and a generative modeling approach. The optimal coding approach involves the construction of one autoencoder for each environment; during testing, each segment is exposed to each autoencoder, and the environment whose autoencoder most faithfully reconstructs the input signal is taken to be the hypothesis. The generative modeling approach involves the construction of a continuous-density hidden Markov model (HMM) for each environment; during testing, the environment whose HMM yields the best log likelihood for a given segment is taken to be the hypothesis. As in our previous work [4], we implemented both of these approaches for this evaluation using Mel-frequency cepstral coefficients (MFCCs) computed at 100 frames per second. Further, we added three perceptually-motivated features which, intuitively, should be helpful for environment classification. These features were spectral energy centroid, or brightness, signal-to-noise ratio, and spectral energy diffusion. All of these features describe certain gross characteristics of the sound field that in some way help us to discriminate between environmental types. In preliminary experiments, we compared 14-dimensional MFCCs to 11-dimensional MFCCs plus the three perceptual features; we found that unaugmented MFCCs yielded better performance in the optimal coding systems, while augmented MFCCs yielded better performance in the HMM systems. We now describe each modeling approach before moving on to evaluation results. 3.1
Optimal Coding
The optimal coding approach to acoustic environment modeling rests on the perceptual fact that the optimal code for a suite of signals depends only on the statistics of that suite of signals [5], [6], [7]. If one suite of signals varies significantly from some other suite of signals, then the optimal codes for those suites will also be different. This difference can be used to discriminate between textures in the following way. Imagine that we have a suite of N -dimensional feature vectors I drawn from a single sound field. We can derive an optimal coding matrix Wc from I using principal component analysis (PCA), independent component analysis (ICA), or some numeric approximation thereof; e.g., an autoencoding multi-layer perceptron (MLP). No matter how Wc is estimated, applying Wc to I results in a coded matrix H whose elements are decorrelated or independent. If we produce a decoding matrix Wd and apply it to H, the result is the matrix O, which, if the coding and decoding procedure retained all the relevant information, is the same as I. If O is close to I, then Wc is a good code. The optimal coding matrix Wc
326
R.G. Malkin
should be different for each class of sound field; hence the difference between O and I, or the coding error, denoted Δ(I, O), can serve as a discriminator between classes. The coding and decoding process is denoted: H = IWc , O = HWd , Δ(I, O) =
N 1 Oi − Ii . N i=1
(1) (2) (3)
The decoding matrix is derived from the input matrix and the coding matrix in the following way: Wd = (H T H)−1 H T I, = ((IWc )T (IWc )T )−1 (IWc )T I.
(4) (5)
If we have two sound field classes α and β, defined by coding and decoding matrices Wcα , Wcβ , Wdα , and Wdβ , then for data I drawn from class α, we expect that Δ(I, Oα ) < Δ(I, Oβ ). In our approach, we use a numeric approximation to PCA which uses a 3-layer linear autoencoding MLP which is trained to reproduce the input signal using a limited number of hidden units. The MLP approach has a critical advantage over analytical PCA for classification tasks; namely, the MLP solution does not necessarily converge to the exact PCA solution. Instead, the MLP solution merely spans the same subspace as the PCA solution. That is, it is only identical up to a rotation. This means that the MLP solution can find some specific rotation [8] that leads to better classification performance than analytical PCA. One disadvantage of the MLP approach is that the number of hidden units and hence parameters is constrained to be less than the dimensionality of the input; otherwise, the MLP will simply learn the identity function. Analytical methods, e.g., overcomplete ICA, exist [9] which do not suffer from this constraint. The numeric autoencoder approach can be modified to allow for more parameters by constructing for each class instead of a single MLP a tree of MLPs. In the autoencoder tree approach, we build a binary tree of MLPs up to some given depth, and initialize them at random. Only the root node is trained using all available data. After the root node is trained, each training frame is assigned to either the left subtree or the right subtree depending on which MLP is initially a better code for that frame. This process is iterated until the leaf nodes are trained. In this way, we can construct a model with as many parameters as can be justified by the number of samples available. 3.2
Generative Modeling
The generative modeling approach involves training a single continuous-density HMM for each environmental class. As it is unclear a priori what kind of HMM
The CLEAR 2006 CMU Acoustic Environment Classification System
327
topology is appropriate for the kind of low-resolution modeling problem we describe here, topology estimation is an important aspect of the modeling problem. In this evaluation, we used the k-variable k-means (KVKM) algorithm as described by Reyes-Gomez and Ellis in [10]. KVKM is a greedy clustering approach based on the leader-follower paradigm; loosely, a single state is initialized with a single frame, and all other frames are iteratively assigned to an existing state if close enough in Euclidean space, or used as the basis for a new state if far enough away in Euclidean space from all other models. In this evaluation, we required 100 samples per parameter; hence, each state needed to have at least 2,800 samples (enough to estimate the means and variances of a single 14-dimensional diagonal-covariance Gaussian). In addition, we assigned to each state the maximum number of Gaussians allowed given the 100 samples per parameter constraint. We trained several different sets of systems in this fashion, varying the parameter which controls how many states are produced, and carried out our evaluations using the best such system.
4
Evaluation Results
Our experiments indicated that the optimal HMM configuration resulted by using a control parameter of 1.0 for the k-variable k-means algorithm. The resulting system had a total of 187 HMM states and 850 Gaussians for a total of over 28,000 parameters. We also found that the best autoencoder tree performance was achieved by using trees with 6 hidden units and a tree depth of 7, which resulted in a total of 64 leaf-level autoencoders per class for a total of over 96,000 parameters. Experimental results showed that the HMM approach yielded better performance on the evaluation data than the optimal coding approach. The results are summarized in Tab. 1. Note that all systems perform significantly better on segments drawn from recordings used for training than on segments which were heldout. This result underscores the fact that many environments exhibit significant variability, making this task very difficult in the general case. Even so, the best system yielded an error rate on unseen data that is much better than chance, meaning that in practice unsupervised adaptation to new locales should be possible. Table 1. Evaluation Results: HMMs vs. Autoencoder Trees System Params per Class Total Error Seen Error Heldout Error HMM 3180 15.4% 5.4% 25.4% Single Autoencoder 168 34.7% 26.4% 43.1% Depth-2 Autoencoder 336 37.6% 26.4% 48.8% Depth-3 Autoencoder 672 36.6% 25.5% 47.8% Depth-4 Autoencoder 1344 37.1% 23.8% 50.4% Depth-5 Autoencoder 2688 32.6% 21.2% 44.1% Depth-6 Autoencoder 5376 32.1% 21.0% 43.1% Depth-7 Autoencoder 10752 30.1% 18.9% 41.4%
328
4.1
R.G. Malkin
Comparison to Human Performance
Establishing a human performance level in this evaluation serves two purposes. First, human performance serves as a benchmark for machine listening algorithms; automatic listening systems should be able to approach if not exceed human performance. Indeed, based on the degree to which humans are unaccustomed to listening as a primary means of sensory awareness, we would expect a reasonable automatic listening system to perform better than the average sighted human on the environment recognition task. Second, we can examine the types of errors that humans make, and, if systematic errors are found, we can compare these to the types of errors made by automatic listening systems to gain insight into the differences between human and machine perception. To perform this test, we selected 12 segments at random for each environment from the development evaluation set. We ensured that 6 of the 12 were from the seen condition and 6 from the heldout condition. We presented human subjects with these 108 recordings in random order and asked them to select which of the 9 possible environments best matched each recording. The subjects were not told that the corpus was balanced, nor were they told how many different recordings were present for any particular environment. They were also not given access to their own answers. A summary of human performance over all subjects is given in Tab. 2. As expected, performance was on average poor, with the best-performing subjects giving an incorrect answer nearly 34 of the time. The average performance of 74% error was only 14% better than chance. Table 2. Human Performance on Environment Recognition Task Subject Total Error Seen Error Heldout Error 1 72.2% 70.3% 74.0% 2 75.0% 75.9% 74.0% 3 70.3% 75.9% 64.8% 4 75.9% 66.6% 85.1% 5 76.8% 81.4% 72.2% 6 70.3% 72.2% 68.5% 7 73.1% 77.7% 68.5% 8 76.8% 79.6% 74.0% 9 76.8% 74.0% 79.6% 10 73.7% 74.4% 68.5% Average 73.7% 74.4% 72.9%
We evaluated both the HMM and Depth-6 Autoencoder systems on the human subset of the evaluation data. Results, shown in Tab. 3, indicate that the subset of the evaluation data chosen for human evaluation was actually much more difficult than the remainder of the evaluation data; HMM performance is nearly 15% worse than on the evaluation set as a whole. Nonetheless, optimal system performance is still 60% better relative than human performance.
The CLEAR 2006 CMU Acoustic Environment Classification System
329
Table 3. System Performance on Human-Evaluated Subset System Total Error Seen Error Heldout Error Humans 73.7% 74.4% 72.9% HMM 29.6% 22.2% 37.0% Depth-6 Autoencoder 44.4% 29.6% 59.2%
Top confusions on the human subset are shown in Tab. 4; human confusion counts are across all subjects and hence much higher than machine confusion counts. Many of the human confusions shown are quite intuitive: one would expect bus and train, for instance, to be highly confusable classes. The confusions between park / plaza and street / plaza are also understandable in some sense; a quiet plaza might contain sound elements like birds chirping, which is normally taken to be a reliable feature of the park environment. Conversely, a loud plaza might contain sound elements much like those found in a city street: babble noise, and perhaps even some vehicle noise. Both machine systems seemed to confuse airport and street. This confusion is somewhat puzzling on its face. It is possible, however, that certain areas of airports contain sound elements similar to street noise. The baggage claim area, for instance, might contain both human-produced babble noise and mechanical sounds such as conveyor belts. In any case, the top confusions are similar across the machine systems and do include the bus / train confusion so common in human subjects. Table 4. Top Confusions on Human-Evaluated Subset System Human HMM Depth-6 Autoencoder Tree Reference Hypothesis Count Reference Hypothesis Count Reference Hypothesis Count plaza park 22 street airport 3 train bus 5 street plaza 21 train bus 3 platform gallery 5 bus train 19 airport street 3 airport street 5 train bus 18 park street 2 street airport 4 gallery street 18 airport gallery 2 plaza gallery 4
5
Conclusions
We have presented both the evaluation regime and specific system results for the 2006 CLEAR acoustic environment classification task. We further evaluated humans on this task using a subset of the test set, and found that, in general, machine environment classification systems performed much better than human listeners. Further, we found that HMM-based systems perform better than systems based on optimal coding, even those with many more parameters. The overall misclassification rate of 15.4%, with 5.4% on segments drawn from locales seen in training vs. 25.4% on segments drawn from locales not seen in training indicates that automatic environment classification could in principle be
330
R.G. Malkin
used in a Connector-like system. Performance of these systems would doubtless be improved by collection of more data, as well as a larger pool of researchers working on this problem. One issue that needs to be addressed, though, is privacy. Some of the recordings contain understandable speech, and though it is not possible in general to identify the speakers, steps to ensure privacy must be taken in order to share data and hence to compare results. Suggestions made by Ellis and Lee [11] include labeling and scrambling segments containing intelligible speech in such a way that the words are no longer understandable, but it is still clear that speech is present; this seems to be the optimal manner in which to proceed in order to evaluate these technologies across multiple sites.
References 1. M. Danninger, G. Flaherty, R. Malkin, R. Steifelhagen, and A. Waibel, “The Connector — facilitating context-aware communication,” in Proceedings of the International Conference on Multimodal Interfaces, 2005. 2. A. Waibel, H. Steusloff, R. Stiefelhagen, and the CHIL Project Consortium, “CHIL: Computers in the human interaction loop,” in Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services, 2004. 3. J. Casas, Rainer Stiefelhagen, and et. al., “Multi-camera / multi-microphone system design for continuous room monitoring,” CHIL Consortium Deliverable D4.1 CHIL-WP4-D4.1, The CHIL Consortium, 2004. 4. R. Malkin and A. Waibel, “Classifying user environment for mobile applications using linear autoencoding of ambient audio,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2005. 5. H.B. Barlow, “Possible principles underlying the transformation of sensory messages,” in Sensory Communication, W.A. Rosenbluth, Ed. MIT Press, 1961. 6. J.J. Atick, “Could information theory provide an ecological theory of sensory processing?,” Network: Computation in Neural Systems, 1992. 7. P. Smaragdis, Redundancy reduction for computational audition, a unifying approach, Ph.D. thesis, MIT, 2001. 8. R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley and Sons, 2001. 9. M.S. Lewicki and T.J. Sejnowski, “Learning overcomplete representations,” Neural Computation, 2000. 10. M. Reyes-Gomez and D. Ellis, “Selection, parameter estimation, and discriminative training of hidden markov models for general audio modeling,” in Proceedings of the International Conference on Multimedia and Expo, 2003. 11. D. Ellis and K.S. Lee, “Minimal-impact audio-based personal archives,” in First ACM Workshop on Continuous Archiving and Recording of Personal Experiences, 2004.
2D Multi-person Tracking: A Comparative Study in AMI Meetings Kevin Smith1 , Sascha Schreiber2 , Igor Pot´ ucek3 , V´ıtezslav Beran3 , 2 Gerhard Rigoll , and Daniel Gatica-Perez1 1
IDIAP Research Institute, Switzerland Technische Universit¨ at M¨ unchen (TUM), Germany Brno University of Technology (BUT), Czech Republic 2
3
Abstract. In this paper1 , we present the findings of the Augmented Multiparty Interaction (AMI) project investigation on the localization and tracking of 2D head positions in meetings. The focus of the study was to test and evaluate various multi-person tracking methods developed in the project using a standardized data set and evaluation methodology.
1
Introduction
One of the fundamental goals of the AMI project is to formally and consistently evaluate tracking methods developed by AMI members using a standardized data set and evaluation methodology. In a meeting room context, these tracking methods must be robust to real-world conditions such as variation in person appearance and pose, unrestricted motion, changing lighting conditions, and the presence of multiple self-occluding objects. In this paper, we present an evaluation methodology for gauging the effectiveness of various 2D multi-person head tracking methods and provide an evaluation of four tracking methods developed under the AMI framework in the context of a meeting room scenario. The rest of this paper is organized as follows: Section 2 describes the method of evaluation, Section 3 briefly describes the tracking methods, Section 4 presents the results of the evaluation, and Section 5 provides some concluding remarks.
2
Evaluation Methodology
To objectively compare the tracking methods, a common data set was agreed upon (Sec. 2.1) and evaluation procedure [13] was adopted (Sec. 2.2). 2.1
Data Set
Testing was done using the AV16.7.ami corpus, which was specifically collected to evaluate localization and tracking algorithms2 . The corpus consists of 16 1
2
This paper orginally appeared with minor changes in Proceedings Multimodal Interaction and Related Machine Learning Algorithms (MLMI) 2006. We are thankful to Bastien Crettol for his support with the collection, annotation, and distribution of the AV16.7ami corpus, and to the participants for their time.
R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 331–344, 2007. c Springer-Verlag Berlin Heidelberg 2007
332
K. Smith et al.
Fig. 1. Examples from seq14 of the AV16.7.avi data corpus. Left: Typical meeting room data with four participants (free to stand, sit, walk). Center: Participant heads near the camera are not fully visible and often move in and out of the scene. Right: The data set also contained challenging situations such as this (four heads appear and are annotated in this image).
sequences recorded from two camera angles in a meeting room using four actors. Seven sequences were designated as the training set, and nine sequences for testing. The sequences depict up to four people performing common meeting actions such as sitting down, discussing around a table, etc (see Figure 1). Participants acted according to different predefined agendas for each scene (they were told the order in which to enter the room, sit, or pass each other), but the behavior of the subjects was otherwise natural. The sequences contain many challenging phenomena for tracking methods including occlusion, cameras blocked by passing people, partial views of backs of heads, and large variations in head size (see Table 1). The corpus was annotated using bounding boxes for head location for use in training and evaluation [3]. Annotators were instructed to fit the bounding boxes around the perimeters of the participants heads, which were ambiguous in some cases. To reduce annotation time, every 25th frame was annotated (evaluations were performed only on annotated frames). 2.2
Measures and Procedure
In [13], the task of evaluating tracker performance was broken into evaluating three tasks: fitting ground truth persons (or GT s) with tight bounding boxes (referred to as spatial fitting), predicting the correct number and placement of Table 1. Challenges in the AV16.7.ami data corpus test set (yes = y, no = n) seq01 L R duration (sec) 63 total # heads 1 1 frontal heads 1 1 rear heads 1 1 event: occlusion n n event: camera blocked y y event: sit down n n
seq02 L R 48 1 1 1 1 1 1 n n y y n n
seq03 L R 208 1 1 1 1 1 1 n n n n y y
seq08 L R 99 2 2 2 0 0 2 y n y y y y
seq09 L R 70 2 2 2 0 0 2 y y n y n n
seq12 L R 103 3 3 3 0 0 3 y y n y y y
seq13 L R 94 3 3 3 0 0 3 y y n y y y
seq14 L R 118 4 4 2 2 2 2 y y y y y y
seq16 L R 89 4 4 4 2 4 4 y n y y n n
2D Multi-person Tracking: A Comparative Study in AMI Meetings
333
people in the scene (referred to as configuration), and checking the consistency with which each tracking result (or estimate, E) assigns identities to a GT over its lifetime (referred to as identification). Several measures are defined to evaluate these tasks, each dependant on the fundamental coverage test. The tasks measured in [13] are similar in many ways to those in [7], but the methods for measuring differ in a fundamental way: the mapping of Es and GT s. Measures in [7] are computed using a one-to-one mapping, whereas [13] defines measures using many-to-one w.r.t. Es and many-to-one w.r.t. GT s. We believe the latter to be a superior method, since situations can arise where there is no clearly correct one-to-one mapping between the Es and GT s. 2.2.1 Coverage Test The coverage test determines if a GT is being tracked by an E, if a E is tracking a GT , and reports the quality of the tracking result. For a given tracking estimate Ei and ground truth GT j , the coverage test measures the overlap between the two areas using the fitting F-Measure Fi,j [11] Fi,j =
2αi,j βi,j αi,j + βi,j
αi,j =
|Ei ∩ GT j | |GT j |
βi,j =
|Ei ∩ GT j | |Ei |
(1)
where recall (α) and precision (β), are well-known information retrieval measures. If the overlap passes a fixed coverage threshold (Fi,j ≥ tc , tc = 0.33), then it is determined that Ei is tracking GT j and GT j is tracked by Ei . 2.2.2 Configuration In this context, configuration means the number, the location, and the size of all people in a frame. A tracking result is considered to be correctly configured if and only if exactly one Ei is tracking each GT j . Four types of errors may occur, which correspond to the four configuration measures: – FN - False negative. A GT is which not tracked by an E. – FP - False positive. An E exists which is not tracking a GT . – MT - Multiple trackers. More than one E is tracking a single GT . An MT error is assigned for each excess E. – MO - Multiple objects. An E is tracking multiple GT s. An MO error is assigned for each excess GT . An example of each error type is depicted in Fig. 2, where the GT s are marked with green colored boxes, the Es with red and blue. One can also measure the difference between the number of GT s and the number of Es: – CD - Counting distance. For a given frame, the difference between the numt t ) normalized by the number of GT s (NGT ). ber of Es (NEt ) and GT s (NGT CD =
t NEt − NGT t , 1) max(NGT
(2)
2.2.3 Identification In the context of this evaluation, identification implies the persistent tracking of a GT by a particular E over time. Though several methods to associate identities
334
K. Smith et al.
False negative (FN) False positive (FP) Multiple trackers (MT) Multiple objects (MO)
Fig. 2. The four types of configuration errors. GT s are represented by green boxes, E s by red and blue boxes.
exist, we adopt an approach based on a majority rule [13]. A GT j is said to be identified by the Ei which passes the coverage test for the majority of GT j s lifetime, and similarly Ei is said to identify the GT j which passes the coverage test for the majority of Ei s lifetime (this implies that associations between GT s and Es will not necessarily match). There can arise two types of identification failures, quantified by five measures. – FIT - Falsely identified tracker. Occurs when a Ek which passed the coverage test for GT j is not the identifying tracker, Ei . F IT s often result when Ei suddenly stops tracking GT j and another Ek continues tracking GT j . – FIO - Falsely identified object. Occurs when a GT k which passed the coverage test for Ei is not the identifying person, GT j . F IOs often result from swapping GT s, i.e. Ei initially tracks GT j and subsequently tracks GT k . – OP - Object purity. If GT j is identified by Ei , then OP is the ratio of frames in which GT j and Ei passed the coverage test (ni,j ) to the overall number of frames GT j exists (nj ). – TP - Tracker purity. If Ei identifies GT j , then T P is the ratio of frames in which GT j and Ei passed the coverage test (nj,i ) to the overall number of frames Ei exists (ni ). – identity F-Measure - combines OP and TP using the F-measure such that if either component is low, identity F-Measure is low: identity F M easure = 2 OP T P OP +T P . 2.2.4 Procedure To evaluate the ability of each tracking method for the tasks of spatial fitting, configuration and identification over diverse data sets, the following procedure is followed for each sequence: —————————————————————————————— Evaluation procedure for a data sequence. 1. for each frame in the sequence – determine tracking maps by applying the coverage test over all combinations of Es and GT s. – record configuration measures (F N ,F P ,M T ,M O, CD) and fitting FMeasure from tracking maps. 2. determine identity maps for tracked Es and GT s using the majority rule.
2D Multi-person Tracking: A Comparative Study in AMI Meetings
335
3. for each frame in the sequence – record identification errors (F IT ,F IO) from the identity maps. 4. normalize the configuration and identification errors and compute the purity measures for the entire sequence (the instantaneous number of ground truths and estimates are NGT and NE respectively, and the total number of frames is T ). FP =
T T F Pt F Nt 1 1 , F N = t t , 1) , T t=1 max(NGT , 1) T t=1 max(NGT
MT =
T T M Tt M Ot 1 1 , M O = t t , 1) , T t=1 max(NGT , 1) T t=1 max(NGT
T T F ITt F IOt 1 1 , F IT = t , 1) , F IO = T t T t=1 max(NGT max(N GT , 1) t=1 E nj,i 1 ni,j 1 1 , TP = , CD = |CD| NGT j=1 nj NE i=1 ni T t=1 ——————————————————————————————
NGT
N
T
OP =
Table 2. Properties of the various head tracking approaches Method A Method B Method C Method D Learned binary, color, skin color, skin color face/nonface head shape shape weak classifiers Models Initialization automatic automatic automatic automatic Features background sub, motion detection, background sub, skin color, silhouette, skin color, skin color, gabor color head/shoulder shape local charact. wavelets Mild Occ. robust robust robust robust Severe Occ. semi-robust semi-robust sensitive sensitive Identity swap, swap, rebirth none rebirth rebirth Recovery Comp. Exp. ∼1 frame/sec ∼3 frame/sec ∼20 frame/sec ∼0.2 frame/sec
Note that most measures are normalized by NGT and the number of frames (such as F P ). For these measures, the number reported could be thought of as a rate of error. For instance, F P = .25 could be interpreted as: “for a given person, at time t, 0.25 F P errors will be generated on average.”
3
Tracking Methods
Four head tracking methods built within AMI were applied to the data corpus and evaluated as described in Section 2. Each method approached the problem of head tracking differently, and it is noteworthy to list some of the qualitative differences (see Table 2). These methods are described briefly below.
336
3.1
K. Smith et al.
Method A: Trans-Dimensional MCMC (developed at IDIAP)
Method A uses an approach based on a hybrid Dynamic Bayesian Network that simultaneously infers the number of people in the scene and their locations [12]. The state contains a varying number of interacting person models, each consisting of a head and body model. The person models evolve according to a dynamical model and a Markov Random Field (MRF) based interaction model (to prevent trackers from overlapping). The observation model consists of a set of global binary and color observations as well as individual head silhouette observations (to localize heads). The function of the global binary observation model is to predict the number of people in the scene. Inference is done by trans-dimensional Markov Chain Monte Carlo (MCMC) sampling (because of its ability to add/remove people from the scene and its efficiency). 3.2
Method B: Probabilistic Active Shape (developed at TUM)
Method B uses a double-layered particle filtering (PF) technique [5,6] consisting of a control layer (responsible for the detection of new people and evaluating the person configuration) and a basic layer (responsible for building a local probability distribution for each head). Locations for new people are derived from skin colored regions, which are detected using a normalized rg skin color model. Heads are modeled using a deformable active shape model consisting of 20 landmark points [1,2]. The basic layer PF samples and predicts a set of hypotheses for each person. Using the active shape model, a likelihood for the existence of a head in the image represented by the respective hypothesis can be computed. These sets of hypotheses are passed to the control layer PF, which evaluates and determines the configuration of heads by incorporating skin color validation and the local likelihood to verify the number of people being tracked. 3.3
Method C: KLT (developed at BUT)
Method C, proposed in [4] is based on the KLT feature tracker [8]. The method works by searching for potential people through performing background subtraction and skin color detection (using an RG skin color model) on the raw image. Connected component analysis is performed on the segmented image to find patches suitable for head detection. Ellipse-like shapes are then fitted to the patches and define a set of head centers. A KLT tracker, which extracts meaningful image features at multiple resolutions and tracks them by using a Newton-Raphson minimization method to find the most likely position of image features in the next frame, is initialized at each head center. Additionally, a color cue and rules for flocking behavior (alignment, separation, cohesion, and avoidance) are used to refine the tracking. 3.4
Method D: Face Detector (developed at BUT)
Method D, proposed in [10], is based on skin color segmentation and face detection. A learned skin color model is used to segment the image. Connected
2D Multi-person Tracking: A Comparative Study in AMI Meetings 1 0.75
.72
.61 .47
.50
.62
.58
.73 .64
.58
.47 .34
0.5
.31
337
A B C D
0.25 0
quality: fitting F−measure
configuration: 1−CD
identification: identity F−measure
Fig. 3. Results for the three tracking tasks (spatial fitting, configuration, and identification). The fitting F-measure shows the spatial fitting, or tightness of the bounding boxes. The quantity 1 − CD is indicative of the ability of a method to estimate the configuration. The ability of a method to maintain consistent identities is measured by the identity F-Measure. The numbers above each bar represent the mean for the entire data set, and the lines represent the standard deviations.
component analysis and morphological operations on the skin color segmented image are used to propose head locations. Face detection is then applied to the skin color blobs to determine the likelihood of the presence of a face. The face detection is based on the well-known AdaBoost [14] algorithm which uses weak classifiers to classify an image patch as a face or non-face. Method D replaces the simple rectangular image features with more complex Gabor wavelets [9]. The face detector was trained on normalized faces from the CBCL data set (1500 face and 14000 non-face images) and outputs a confidence, which is then thresholded to determine if a face exists. Faces are associated between frames using a proximity association defined on the positions of the detected faces.
4
Evaluation
The four methods were evaluated for their performance at the tasks outlined in Section 2: spatial fitting, configuration, and identification. Methods A and B were tested on 360×288 non-interlaced images; Methods C and D were tested on 720× 576 interlaced images after applying an interpolating filter. This discrepancy may affect the relative performance of the methods, but we believe the effect to be minimal. In the following, we present a summary of the overall performance of the tracking methods, followed by a detailed discussion of each task3 . 4.1
Overall Performance
The fitting F-Measure is an indicator of the spatial fitting (see Figure 3). Spatial fitting refers to how tightly the E bounding boxes fit the GT . The fitting F-Measure is only computed on correctly tracked people, and a value of one indicates perfectly fit bounding boxes. Lower numbers indicate looser, misaligned, or missized tracking estimates. Results for the fitting F-Measure indicate that methods A and D performed comparably well at about .60. Measures B and C 3
Example videos and details can be found at http://www.idiap.ch/∼smith/
338
K. Smith et al.
performed at approximately .50. The spatial fitting depends on many aspects of the method including the features, motion model, and method of inference. Intuition suggests that the boosted Gabor wavelets of Method D and the head silhouette feature of Method A were most precise in this case, but these results cannot be solely attributed to these features without further experiments. The counting distance CD measures the difference between the number of GT s and Es for a given frame, and gives an imperfect estimation of the configuration performance, i.e. the ability of the method to place the correct number of Es in the correct locations. CD is an imperfect summary because some types of errors such as F P s and F N s may cancel in the calculation of CD, but it is still a good indicator. The quantity 1 − CD is reported so that higher numbers indicate better configuration performance (CD ∈ [0, ∞) but in our experiments ranged from 0 to 1). Methods A and C performed best, at about .73, while method D performed at .64, and B at .58. An alternative way to measure the overall configuration performance is to sort the methods by rankings of the individual configuration measures (see Section 4.3 and Figure 5). Doing so, we find that Method C performs the best, followed by Method A, Method D, and finally Method B. Though not necessarily so, in this case this result is consistent with the findings of the counting distance. The identity F-Measure measure indicates how consistently a method was able to identify the GT s over time; it is a combination of the T P and OP measures. In this case, method D clearly outperformed the others. This is somehow surprising, as it uses the simplest procedure for maintaining identity (spatial proximity between frames). More sophisticated methods such as models for swapping identities in Methods A and B, are perhaps not suited for this data. One the other hand, because Method D relies on specialized face detection, it’s superior performance may not generalize to situations in which faces are not the target objects. 4.2
Spatial Fitting
As mentioned in Section 4.1, the fitting F-measure indicates the tightness of the fit of the bounding boxes to the GT s. From Figure 4, it is apparent that certain sequences presented much more of a challenge than others. Figure 4 illustrates the variation of performance on specific pieces of data, something hidden by all-inclusive measures. Typically, fitting F-Measure values were similar for all the trackers at approximately 0.80, but for more challenging sequences such as 08R, 09R, 12R, 13R, and 16R, differences were more pronounced and fitting FMeasure values dipped as low as 0 in one case. Method D was the most spatially robust for the challenging sequences. 4.3
Configuration
Results for the four configuration error types and CD can be found in Figure 5. The measure F N gives an estimation of the number of False Negatives (or undetected person ground truths) per ground truth, per frame. Method C performed the best in this respect, with .26 F N ’s per person, per frame. This low rate of missed
2D Multi-person Tracking: A Comparative Study in AMI Meetings
339
GT s may be attributed to KLT trackers selection of meaningful image features. Method B performed significantly worse, averaging approximately .49 F N , which may be due to difficulties in fitting the contour to the appearance of some heads. F N s were the most prominent type of configuration error among all four tracking methods, usually as a result of an unexpected change in the appearance of a head, partial views, lighting changes, entrances/exits, and size variations and occlusions (sometimes as extreme as in Figure 1). The measure F P estimates the number of False Positive errors (or extraneous Es) per ground truth, per frame. This was the second most common type of configuration error. Typical causes for F P errors include face-like or skin colored objects in the background (texture or color), shadows, and background motion. Methods A and B were least prone to F P errors, with a rate of 0.08 F P s per person, per frame. Method A’s low rate of F P errors can be attributed to the use of a body model, which only adds people when a body is detected (bodies are easier to detect than heads). This was followed by Method C with 0.21, and Method D with 0.23. Method D was particularly sensitive to F P generating conditions, as the standard deviation was roughly twice the mean, 0.42. Method D’s F P s were generated by face-like or skin colored objects in the background and exposed skin on the arms of the participants. The measure M T estimates the number of Multiple Tracker errors (which occur when several estimates are tracking the same ground truth person). The only method significantly prone to this type of error was Method A. This susceptibility is due to the fact that Method A uses strong priors on the size of the body and head to help the foreground segmented image features localize the head. The priors of Method A are trained using participants in the far field of view, and are not robust to dramatic changes in size. When a participant appears close to the camera, Method A often fits multiple trackers to the larger head area. Methods B,C, and D do no suffer from this effect because they do not enforce constraints on the size of the head so strongly. The measure M O estimates the number of Multiple object errors (which occur when one estimate tracks several ground truths) per person, per frame. This type of error generally occurs when a tracker estimate is oversized and expands to cover large areas of the image, or occasionally when people are near one another.
F−Measure
1 A B C D
0.5
0
01L 01R 02L 02R 03L 03R 08L 08R 09L 09R 12L 12R 13L 13R 14L 14R 16L 16R Sequence
Fig. 4. The fitting F-Measure shows how tightly the estimated bounding boxes fit the ground truth (when passing the coverage test)
340
K. Smith et al. 1 .49
0.75
.23 .26.29
.28 0.5
.28
.21
.27
A B C D
.08 .08
0.25 0
.36
.42
.03 FN
FP
.00 .00.00 MT
MO
CD
Fig. 5. The configuration measures, F N , F P , M T , M O, and CD, normalized over the test set
All four methods tested were robust to this type of error. This robustness can be attributed to the modeling of head objects, interaction models, and motion models built into each of the methods. The counting distance measure CD is described in Section 4.1. 4.4
Identification
Results for the identification measures can be found in Figure 6. The F IO measure estimates the rate of Falsely Identified Object errors (when an E tracks a GT k which is not the GT j that the E identifies). Of the two types of identification errors (F IO and F IT ), F IO errors occurred less frequently for all four methods. F IO errors are often generated when an E outlives the GT it is supposed to identify, and the E begins to track another GT , though this was rare in our experiments. The other common mode of failure occurred when Es confused GT s, often as a result of occlusion. This method of failure was seen most in Methods A and B with F IO rates of 0.05 and 0.04, respectively. Interestingly, both these methods modeled identity swapping, where Es switch labels in an attempt to maintain identity. Spurious identity swaps could account for higher F IO rates. Method C was very robust to F IO errors, with a negligible F IO rate. Method D was nearly as robust, with a F IO of 0.01. The F IT measure reports the rate of Falsely Identified Tracker errors (which occur when a GT person is being tracked by a non-identifying E). There are two typical sources of F IT errors. The first occurs, as with the F IO error, when Es 1
.93
.68 .56
0.75 .24
0.5 0.25 0
.52 .44
.00 .01
FIO
.58 .47
.35 .31
A B C D
.13
.13 .05 .04
.27
.46
.04 .05 FIT
TP
OP
Identity F−Measure
Fig. 6. The identification measures, F IO, F IT , T P , OP , and identity F-Measure computed over the test set
2D Multi-person Tracking: A Comparative Study in AMI Meetings
341
swap or confuse GT s. The second error source occurs when several short-lived Es track the same GT s. Both of these sources caused F IT errors in our test set, though it can be expected that F IT contributions from the first error source should roughly match the F IO error rate (and thus, any increase in the F IT over the F IO is caused by short-lived Es). Methods A and D saw the most F IT errors, with F IT rates at 0.13 (0.13 F IT errors are generated per frame, per person). Method D’s F IT errors can be almost exclusively attributed to multiple, short-lived Es tracking the same GT . Method B was the most robust to F IT errors with a rate of 0.04. The T P measure evaluates the consistency with which an E identifies a particular GT . Mis-identified GT s cause F IO errors, but the T P measure gives equal weight to all tracking estimates. Es with a short lifetime will not significantly influence the F IO, and Es with long lifetimes will dominate. Typically, in our experiments, the methods reported a higher T P than OP . This indicates more Es were generated than the number of GT s in the sequence (in a temporal sense), and that they lasted for shorter lifetimes. Method D reported a T P of 0.93, which indicates that nearly all its Es perfectly identified their GT s. However, this does not indicate near-perfect identification. Method D’s OP , 0.46, while on par with the other methods, indicates that the GT s were often tracked by multiple short-lived Es. Method A reported the next highest T P , with a value of 0.68, followed by Method B (0.56) and Method C (0.24). Method C was the only method to report a lower T P than OP . The OP measure evaluates the consistency with which a GT is identified by the same E. Mis-identifying Es can cause F IT errors, but OP gives equal weight to all GT s in the sequence. Short-lived GT s will not significantly affect the F IT , and GT s with a long lifetime will dominate. Method C reported the best OP . 4.5
Summary and Qualitative Comments
Giving equal weight to the three tracking tasks described in this document (configuration, identification, and spatial fitting) and using a simple ranking system, the best performing tracking method is D, followed by A, C, and B. Method D is the most reliable at identification and exhibits the highest spatial fitting. However, it does have several drawbacks. It is the slowest of the four methods and the most sensitive to occlusion. The face detector is based on skin color detection and is more sensitive to lighting conditions than the other methods. Skin-colored segments of the background pose a problem for the face detector (Method D exhibits the highest F P ), and the F N suffers as the detector struggles with non-frontal faces. Ranked second among the four methods is Method A. Method A was the only method which did not model skin color, and was the only method which modeled the body to help localize the head. The use of a body model had several effects. First, Method A had the lowest F P rate, which can be attributed to the body model preventing spurious head Es. The body model assisted in detecting heads, which kept the F N rate low. However, because of strong size priors on the head and body models, Method A performed poorly when tracking heads near the
342
K. Smith et al. Method A: Trans-Dimensional MCMC
Method B: Probabilistic Active Shape
Method C: KLT
Method D: Face Detector
Fig. 7. Results for frames 307, 333, and 357 of sequence 09L from the AV16.7.avi data corpus. Method A: body and head results shown. A F P error appears in frame 357. Method B: heads results appear as red bounding boxes. Two F N errors and an F P error occur in 307, and one F N error occurs in 333. Method C: head results appear as grey bounding boxes. Method D: results appear as grey bounding boxes, participant arms are mistaken for heads in 307 and 333.
camera (resulting in M T errors). Method A was ranked second in spatial fitting and was also ranked second in maintaining identity, though incorrect swapping of E labels may have lowered this performance.
2D Multi-person Tracking: A Comparative Study in AMI Meetings
343
Method C was third overall among the four methods. It was the fastest computationally; the only one approaching real-time frame rates. Method C had the highest configuration performance, boasting the lowest F N rate and negligible M T and M O errors. This can be attributed to the KLTs selection of meaningful image features. However, Method C performed worst in terms of spatial fitting and identification. The poor spatial fitting might be due to a lack of shape features or features specialized to the face (as in the face detector). Problems with identification were due to the lack of an explicit way to manage identity among the trackers. Finally, Method B fell last overall, but ranked third for each of the three tracking tasks. In terms of spatial fitting, Method B was the highest performing method for several of the sequences, but suffered from poor performance on some of the more difficult multi-person sequences (12R, 14R, and 16R). Among the four trackers, the Method B was the most robust to partial occlusions. For Method B, identity was maintained by binning gray values of the face shape. A lack of color information, poor shape adjustment, and a swapping mechanism like that of Method A, may have caused identification problems for this method. From this evaluation, we might draw some of the following conclusions: 1. Shape-based methods, such as B and C, perform as well or better at spatial fitting when stable, but are more prone to configuration failures, and less able to recover from such failures. 2. Methods employing background subtraction (such as A and C) seem to have an advantage estimating the configuration of the scene. 3. Attempts to model identity changes to handle difficult tracking scenarios such as dramatic changes in size and appearance or frequent occlusions may do more harm than good (as for Methods A and B).
5
Conclusion and Future Work
The AV16.7.ami corpus contains many difficult real-life scenarios which remain challenging for state-of-the-art tracking methods. These results represent the first evaluation of methods for multi-person tracking in meetings using a common data set in the context of the AMI project. Future work might incorporate multimodel information or concentrate on tracking other objects in different scenarios. Acknowledgements. This work was supported by the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (IM2), and the EC project Augmented Multi-party Interaction (AMI, publication AMI-175).
References 1. T. Cootes and C. Taylor, Statistical models of appearance for computer vision, 2004. 2. T. Cootes, G. Edwards and C. Taylor, “A comparative evaluation of active appearance model algorithms”, British Machine Vision Conference, Southampton, UK, Sept. 1998.
344
K. Smith et al.
3. D. Gatica-Perez “Annotation Procedure for WP4-locate”, AMI Internal Document, Martigny, Switzerland, October 2004. 4. M. Hradis, R. Juranek, “Real-time Tracking of Participants in Meeting Video”, Proceedings of CESCG, Wien, 2006. 5. M. Isard and A. Blake, “Condensation – conditional density propagation for visual tracking”, International Journal of Computer Vision 29(1), pp. 5–28, 1998. 6. M. Isard and A. Blake, “A Mixed-State CONDENSATION Tracker with Automatic Model-Switching”, International Conference on Computer Vision (ICCV), 1998. 7. R. Kasture et. al., “Performance Evaluation Protocol for Face, Person, and Vehicle Detection & Tracking Analysis and Content Extraction (VACE-II)”, ARDA Techinical Report, Tampa, FL, 2006. 8. M. K¨ olsch and M. Turk, “Fast 2D Hand Tracking With Flocks and Multi Cue Integration”, Department of Computer Science, University of California, 2005. 9. V. Kruger, “Wavelet Networks for Object Representation,” thesis dissertation, Technischen Fakultat, Christian-Albrechts-Universitat zu Kiel, 2000. 10. I. Potucek, S. Sumec, M. Spanel, “Participant activity detection by hands and face movement tracking in the meeting room”, Computer Graphics International (CGI), Los Alamitos, 2004. 11. C.J. Van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, USA, 1979. 12. K. Smith, S. Ba, J.M. Odobez, D. Gatica-Perez, “Multi-Person Wander-VisualFocus-of-Attention Tracking”, IDIAP-RR-05-80, Nov 2005. 13. K. Smith, S. Ba, J.M. Odobez, D. Gatica-Perez, “Evaluating Multi-Object Tracking”, CVPR Workshop on Empirical Evaluation Methods in Computer Vision (EEMCV), San Diego, CA, June 2005. 14. J. Viola and M. Jones, “Robust Real-time Object Detection”, Technical Report 2001/01, Com-paq CRL, February 2001.
Head Pose Tracking and Focus of Attention Recognition Algorithms in Meeting Rooms Sileye O. Ba and Jean-Marc Odobez IDIAP Research Institute, Martigny, Switzerland
Abstract. The paper presents an evaluation of both head pose and visual focus of attention (VFOA) estimation algorithms in a meeting room environment. Head orientation is estimated using a Rao-Blackwellized mixed state particle filter to achieve joint head localization and pose estimation. The output of this tracker is exploited in an Hidden Markov Model (HMM) to estimate people’s VFOA. Contrarily to previous studies on the topic, in our set-up, the potential VFOA of people is not restricted to other meeting participants only, but includes environmental targets (table, slide screen), which renders the task more difficult due to more ambiguity between VFOA target directions. By relying on a corpus of 8 meetings of 8 minutes on average featuring 4 persons involved in the discussion of statements projected on a slide screen, and for which head orientation ground truth was obtained using magnetic sensor devices, we thoroughly assess the performance of the above algorithms, demonstrating the validity of our approaches and pointing out to further research directions.
1
Introduction
The automatic analysis of human interaction constitutes a rich research field. In particular, meetings exemplify the multimodal nature of human communication and the complex patterns that emerge from the interaction between multiple people [6]. Besides, in view of the amount of relevant information in meetings suitable for automatic extraction, meeting analysis has attracted attention in fields spanning computer vision, speech processing, human-computer interaction, and information retrieval [13]. In this view, the tracking of people and of their activity is relevant for high-level multimodal tasks that relate to the communicative goal of meetings. Experimental evidence in social psychology has highlighted the role of non-verbal behavior (e.g. gaze and facial expressions) in interactions [9], and the power of speaker turn patterns to capture information about the behavior of a group and its members [6,9]. Identifying such multimodal behaviors requires reliable people tracking. In the present work, we investigate the estimation of head pose from video, and its use in the inference of the VFOA of people. To this end, we propose two algorithms to solve each of the task, and the objective is to evaluate how well they perform and how well we can infer the VFOA solely from the head pose. R. Stiefelhagen and J. Garofolo (Eds.): CLEAR 2006, LNCS 4122, pp. 345–357, 2007. c Springer-Verlag Berlin Heidelberg 2007
346
S.O. Ba and J.-M. Odobez
Fig. 1. Left: meeting room. Right: The set F of potential FOA comprises: other participants, the table, the slidescreen, the whiteboard, and an unfocus label when none of the previous applies.
Many methods have been proposed to solve the problem of head tracking and pose estimation. They can be grossly separated into two groups. The first group considers the problem of head tracking and pose estimation as two separate and independent problems: the head location is found, then processed for pose estimation [2,13,10,15,17]. The main advantage is usually a fast processing, but then head pose estimation is highly dependent on the head tracking accuracy. Indeed, it has been shown that head pose estimation is very sensitive to head localization [2]. To address this issue, the second group of methods [3,5,14] considers jointly the head tracking and pose estimation problems, and we follow this approach. In meeting data, it is often claimed that head pose can be reasonably used as a proxy for gaze (which usually calls for close views). In this paper, we evaluate tha validity of this assumption by generalizing to more complex situations similar works that have already been conducted in [8,11]. Contrarily to these previous works, the scenario we consider involves people looking at slides or writing on the table. As a consequence, in our set-up, people have more potential visual focus of attention (6 instead of 3 in [8,11]), leading to more ambiguities between VFOA, and the identification of the VFOA can only be done using complete head pose representation (pan and tilt), instead of just the head pan as done previously. Thus our study reflects more complex, but realistic, meeting room situations in which people don’t just focus their attention on the other people but also on other room targets. In this work, we analyze the recognition of the VFOA of people from their head pose. VFOA are recognized using either the Maximum A Posteriori principle or an Hidden Markov Models (HMM) modeling, where in both cases the VFOAs are represented using Gaussian distributions. In our experiments, the head poses are either obtain using a magnetic sensor or a computer vision based probabilistic tracker, allowing to evaluate the degradation in VFOA recognition when going from true values to estimated ones. The remainder of this paper is organized as follows. Section 2 describes our database and the protocols used for evaluation. Section 3 and 4 respectively presents our head pose tracking and VFOA recognition algorithms. Results and analysis of the evaluation are reported in 5 and Section 6 concludes the paper.
Head Pose Tracking and Focus of Attention Recognition Algorithms
2
347
Databases and Protocols
In this section, we describe the data and performance measures used to evaluate head pose estimation algorithms and VFOA recognition algorithms. In the latter case, the emphasis is on the recognition of a finite set F of specific FOA loci. 2.1
The Database
Our evaluation exploits the IDIAP Head Pose Database1 . In view of the limitations of visual inspection for evaluation, and the inaccuracy obtained by manually labeling head pose in real videos, we decided to record a video database with head pose ground truth produced by a flock-of-birds device. At the same time, as the database is also annotated with the discrete FOA of participants, we will be able to evaluate the impact of having the true vs an estimated head pose on the VFOA recognition. Content description: the database comprises 8 meetings involving 4 people (duration ranged from 7 to 14 minutes), recorded in IDIAP’s smart meeting room. The scenario was to discuss statements displayed on the projection screen. There were restrictions neither on head motions, nor on head poses. In each meeting, the head pose of two persons was continuously annotated ground truth of two participants (the left and right person in Fig. 1) using 3D magnetic sensors attached to the head, resulting in a video database of 16 different people. Head pose annotation: the head pose configuration with respect to the camera was ground truthed. This pose is defined by three Euler angles (α, β, γ) which parameterize the decomposition of the rotation matrix of the head configuration with respect to the camera frame. Among the possible decompositions, we have selected the one whose rotation axes are rigidly attached to the head to report and comment the results. With this choice, we have: α denotes the pan angle, a left/right head rotation; β denotes the tilt angle, an up/down head rotation; and finally, γ, the roll, represents a left/right “head on shoulder” head rotation. VFOA set and annotation: for each of the two person (’left’ and ’right’ in Fig. 1), the set of potential focus is composed of the other participants, the slidescreen, the table, and an additional label (unfocused) when none of the previous could apply. As a person can not focus on himself/herself, the set of focus is thus different from person to person. For instance, for the left person, we have: F = {right person, organizer1, organizer2, slide screen, table, unf ocus}. The guidance for the annotation are given in [7]. 2.2
Evaluation Protocol
Head pose protocol: Data and protocol. Amongst the 16 recorded people, we used half of the database (8 people) as training set to learn the pose dynamic model and the half remaining as test set to evaluate the tracking algorithms. In addition, from the 8 meetings of the test set, we selected 1 minute of recording 1
Available at http://mmm.idiap.ch/HeadPoseDatabase/
348
S.O. Ba and J.-M. Odobez
(1500 video frames) for evaluation data. This decision was made to save machine computation time. Pan values range from -60 to 60 degrees (with a majority of negative values corresponding to looking at the projection screen). Tilt values range from -60 to 15 degrees (due to the camera looking down at people) and roll value from -30 to 30 degrees. Performace measures: four error measures are used. The three first measures are the errors in pan, tilt and roll angle, i.e. the absolute difference between the pan, tilt and roll of the ground truth (GT) and the tracker estimation. Also, the angle between the 3D pointing vector (the vector indicating where the head is pointing at (cf Figure 2) defined by the head pose GT and the pose estimated by the tracker was used as pose estimation error measure. This vector depends only on the head pan and tilt values (given the selected representation). For each error, mean, standard deviation and median (less sensitive to large errors due to erroneous tracking) values are reported. VFOA protocol: Data and protocol: Experiments on FOA recognition are done separately for the left and right person (see Fig. 1). Thus, for each seating position, we have 8 sequences. We adopt a leave-one-out protocol, where for each sequence, the parameters of the recognizer that is applied to this sequence are learned on the 7 other sequences. Performance measures: two different types of measures are used. - frame-based recognition rate: this corresponds to the percentage of frames in the video whose estimated FOA match the ground truth label. To avoid the emphasis on events that are long (i.e. when someone is continuously focused) we propose below alternative measures that may better reflect how well an algorithm is able at recognizing events, whether long or short, which might be more suited to understanding meeting dynamics and human interaction. - event-based recall/precision: we are given two sequences of FOA events: the recognized sequence of FOA, R = {Ri }i=1..NR and the ground truth sequence G = {Gj }j=1..NG . To compare the 2 sequences, we first apply an adapted string alignment procedure that account for time overlap to match events in the GT and R. Given this alignement, we can then compute for each event l ∈ F, the recall ρ, precision π, and F measures of that event, defined as: Nmat (l) 1 1 1 1 Nmat (l) , π(l) = and = ( + ) (1) ∀l ∈ F, ρ(l) = NG (l) NR (l) Fmeas (l) 2 ρ(l) π(l) where Nmat (l) represents the number of events l in the recognized sequence that match the same event type in the ground truth after the alignment, NR (l) denotes the number of occurence of event l in the recognition sequence, and NG (l) denotes the number of occurence of l in the ground truth. Qualitatively, the recall of l indicates the percentage of correctly recognized true looks at FOA l, while the precision indicates the percentage of looks at l that were recognized and indeed corresponds to the ground truth. The F measure, defined as the harmonic mean of the precision and recall, represents a composite value2 . Finally, performance measures for the whole database are obtained through averaging of the recall, precision and F measures first over event types per person, then over individuals. 2
Often, increasing the recall tends to decrease the precision, and vice-versa.
Head Pose Tracking and Focus of Attention Recognition Algorithms
3
349
Head Pose Tracking
To address the tracking issue, we formulate the coupled problems of head tracking and head pose estimation in a Bayesian filtering framework, which is then solved through sampling techniques. In this paragraph, we expose the main points of our approach. More details can be found in references [1]. 3.1
Head Pose Models
We use the Pointing’04 database to build our head pose model. Texture and color based head pose models are built from all the sample images available for each of the 93 discrete head poses θ ∈ Θ = {θj = (αj , βj , 0), j = 1, ..., NΘ }. In the Pointing database, there are 15 people per pose. Head Pose Texture Model. The head pose texture is represented by the output of three filters: a Gaussian at coarse scale and two Gabor filters at two different scales (finer to coarser). Training patch images are resized to the same reference size 64 × 64, preprocessed by histogram equalization to reduce light variations effects, then filtered by each of the above filters. The filter outputs at sample locations inside a head mask are concatenated into a single feature vector. To model the texture of head poses, the feature vectors associated with each head pose θ ∈ Θ are clustered into K=2 clusters using a kmeans algorithm. The cluster centers eθk = (eθk,i ) are taken to be the exemplars of the head pose θ θ. The diagonal covariance matrix of the features σkθ = diag(σk,i ) inside each cluster is also exploited to define the pose likelihood models. The likelihood of an input head image, characterized by its extracted features z text , with respect to an exemplar k of a head pose θ is then defined by: 1 1 pT (z|k, θ) = max(exp − θ 2 σk,i i
zitext − eθk,i θ σk,i
2 ,T)
(2)
where T = exp − 29 is a lower threshold set to reduce the effects of outlier components of the feature vectors. Head Pose Color Model. To gain robustness to background clutter and help tracking, a skin color model Mkθ is learned from the training images belonging to a each head pose exemplar eθk . Training images are resized to 64 × 64, then their pixels are classified as skin (pixel value=1) or non skin(value=0). The mask Mkθ is the average of training skin images. Additionally we model the distribution of skin pixel values with a Gaussian distribution [16] in the normalized RG space whose parameters are learned from the training images and continuously adapted during tracking. The color likelihood of an input patch image at time t w.r.t. the k th exemplar of a pose θ is obtained by detecting the skin pixels on the 64x64 grid, producing this way the skin color mask ztcol , from which the color likelihood is defined as:
350
S.O. Ba and J.-M. Odobez
pcol (z|k, θ) ∝ exp −λ||ztcol − Mkθ ||1
(3)
where λ is a hyper parameter learned from training data, and ||.||1 denotes the L1 norm. 3.2
Joint Head Tracking and Pose Estimation
The Bayesian formulation of the tracking problem is well known. Denoting by Xt the hidden state representing the object configuration at time t, and by zt the observation extracted from the image, the objective is to estimate the filtering distribution p(Xt |z1:t ) of Xt given all the observations z1:t = (z1 . . . zt ) up to the current time. This can be done through a recursive equation, which can be approximated through sampling techniques (or particle filters PF) in the case of non-linear and non-Gaussian models. The basic idea behind PF consists of reps resenting the filtering distribution using a weighted set of samples {Xtn , wtn }N n=1 , and updating this representation as new data arrives. That is, given the particle n n , wt−1 }, configurations time set at the previous time step {Xt−1 nat the current n step are drawn from a proposal distribution q(Xt ) = n wt−1 p(Xt |Xt−1 ). The weights are then computed as wtn ∝ p(zt |Xtn ). Four elements are important in defining a PF: a state model, a dynamical model, an observation model, and a sampling mechanism. We now describe each of them. State Model: The mixed state approach [12], allows to represent jointly in the same state variable discrete variables and continuous variables. In our specific case the state X = (S, γ, l) is the conjunction of a discrete index l = (θ, k) which labels an element of the set of head pose models eθk , while both the discrete variable γ and the continuous variable S = (x, y, sx , sy ) parameterize the transform x T(S,γ) defined by: s 0 cos γ − sin γ x u+ . (4) T(S,γ) u = 0 sy sin γ cos γ y which characterizes the image object configuration. (x, y) specifies the translation position of the object in the image plane, (sx , sy ) denote the width and height scales of the object according to a reference size, and γ specifies the in-plane rotation of the object. Dynamic Model: This model represents the temporal prior on the evolution of the state. Figure 2 describes the dependencies between our variables from which the equation of the process density can be defined: P (Xt |Xt−1 ) = p(St |St−1 )p(lt |lt−1 , St )p(γt |γt−1 , lt−1 )
(5)
The dynamical model of the continuous variable St , p(St |St−1 ) is modeled as a classical first order auto regressive process. The other densities, learned from training sequences, allow to set some prior on the head eccentricity, as well as to model the head rotation dynamic, as detailed in [1]. The observation model. p(zt |Xt ) measures the adequacy between the observation and the state. This is an essential term, where data fusion occurs, and
Head Pose Tracking and Focus of Attention Recognition Algorithms st−1
st
st+1
γ
t−1
γt
γt+1
lt−1
lt
lt+1
zt−1
zt
zt+1
351
Fig. 2. Left: Mixed State Graphical Model. Middle: basis attached to the head (head pointing vector in red). Right: visual focus of attention graphical model.
whose modeling accuracy can greatly benefit from additional discrete variables in the state space. In our case, observations z are composed of texture and color observations (z text , z col ), and the likelihood is defined as follows : p (z|X = (S, γ, l)) = ptext (z text (S, γ)|l)pcol (z col (S, γ)|l),
(6)
where we have assumed that these observations were conditionally independent given the state. The texture likelihood ptext and the color likelihood pcol have been defined in Section 3.1.During tracking, the image patch associated with the image spatial configuration of the state space, (S, γ), is first cropped from the image according to C(S, γ) = {T(S,γ)u, u ∈ C}, where C corresponds to the set of 64x64 locations defined in a reference frame. Then, the texture and color observations are computed using the procedure described in sections 3.1. Sampling mechanism: The Rao-Blackwellization. The sampling should places new samples as close as possible to regions of high likelihood. The plain particle filter (PF), denoted MSPF, described in the first paragraph of this subsection, can be employed. However, given that the exemplar label l is discrete, its filtering pdf can be exactly computed given the samples of the remaining variables. Thus we can apply the Rao-Blackwellization procedure, which is known to lead to more accurate estimates with a fewer number of particles [4]. Given the graphical model of our filter (Fig.2), the Rao-Blackwellized particle filter (RBPF) consists of applying the standard PF algorithm over the tracking variables S and γ while applying an exact filtering step over the exemplar variable l, given a sample of the tracking variables. In this way, computing the likelihood of the state can be done using: p(S1:t , γ1:t , l1:t |z1:t ) = p(l1:t |S1:t , γ1:t , z1:t )p(S1:t , γ1:t |z1:t )
(7)
In practice, only the sufficient statistics p(lt |S1:t , γ1:t , z1:t ) of the first term in the right hand side is computed and is involved in the PF steps of the second term. Thus, in the RBPF modeling, the pdf in Equation 7 is represented by a set of particles i i s , γ1:t , πti (lt ), wti }N (8) {S1:t i=1 i i where πti (lt ) = p(lt |S1:t , γ1:t , z1:t ) is the pdf of the exemplars given a particle i i and a sequence of measurements, and wti ∝ p(S1:t , γ1:t |z1:t ) is the weight of the
352
S.O. Ba and J.-M. Odobez
1. initialization step: ∀ i sample (S0i , γ0i ) from p(S0 , γ0 ), and set π0i (.) uniform and t = 1 2. prediction of new head location configurations: sample S˜ti and γ˜ti from i i i i i the mixture πt−1 (lt−1 )p(γt |γt−1 , lt−1 ) (S˜t , γ˜t ) ∼ p(St |St−1 ) lt−1
3. head poses distribution of the particles: compute the exact step π ˜ti (lt ) = i i p(lt |S1:t , γ1:t , z1:t ) for all i and lt i i 4. particles weights: for all i compute the weights wti = p(zt |S1:t , γ1:t , z1:t−1 ) 5. selection step: resample Ns particle {Sti , γti , πti (.), wti = N1s } from the set ˜ti (.), w ˜ti }, set t = t + 1 go to step 2 {S˜ti , γ˜ti , π
Fig. 3. RBPF Algorithm
tracking state particle. Figure 3 summarizes the steps of the RBPF algorithm with the additional resample step to avoid sampling degeneracy. In the following, we detail the methodology to derive the exact steps to compute πti (lt ) and the PF steps to compute wti . Deriving the Exact Step: The goal here is do derive p(lt |S1:t , γ1:t , z1:t ). As lt is discrete, this can be done using prediction and update steps similar to those involved in Hidden Markov Model (HMM), and generates as intermediate results Z1 (St , γt ) = p(St , γt |S1:t−1 , γ1:t−1 , z1:t−1 ) and Z2 = p(zt |S1:t , γ1:t , z1:t−1 ). Deriving the PF steps: The pdf p(S1:t , γ1:t |z1:t ) is approximated using particles whose weight are recursively computed using the standard PF approach. Using the discrete approximation of the pdf at time t − 1 with the set of particles and weight, the current pdf p(S1:t , γ1:t |z1:t ) can be approximated (up to the proportionality constant p(zt |z1:t−1 )) by: p(zt |S1:t , γ1:t , z1:t−1 )
Ns
i i i wt−1 p(St , γt |S1:t−1 , γ1:t−1 , z1:t−1 )
(9)
i=1
to which the standard PF steps can be applied. Indeed, the mixture in the the second part of Equation 9 can be rewritten as: Ns i=1
i i wt−1 p(St |St−1 )
i πt−1 p(γt |γt−1 , lt−1 )
(10)
lt−1
which embeds the temporal evolution of the head configurations and allows to draw new (St , γt ) samples. Similarly, the weight of this new samples, defined by the observation likelihood p(zt |S1:t , γ1:t , z1:t−1 ) can be readily obtained from the exact steps computation (cf the computation of the Z2 constant).
Head Pose Tracking and Focus of Attention Recognition Algorithms
353
Filter output: As the set of particles defines a pdf over the state space, we can use as output the expectation value of this pdf, obtained by standard averaging over the particle set. Note that usually, with mixed-state particle filters, averaging over discrete variable is not possible (e.g. if a discrete index represents a person identity). However, in our case, there is no problem since our discrete indices correspond to real Euler angles which can be combined.
4
Visual Focus of Attention Tracking
Modelling VFOA with a Gaussian Mixture Model (GMM): Let us denote by Ft ∈ F and by Zt the VFOA and the head pointing vector (defined by its pan and tilt angles) of a person at time instant t. Estimating the VFOA can be posed in a probabilistic framework as finding the label maximizing the a posteriori (MAP) probability: p(Zt |Ft )p(Ft ) ∝ p(Zt |Ft )p(Ft ) (11) Fˆt = arg max p(Ft |Zt ) with p(Ft |Zt ) = Ft ∈ F p(Zt ) For each possible VFOA f ∈ F which is not unf ocused, p(Zt |Ft ) is modeled as a Gaussian distribution N (Zt ; μf , Σf ) with mean μf and full covariance matrix Σf . Besides, p(Zt |Ft = unf ocused) is modeled as a uniform distribution. For p(Ft ), we indeed used no prior (i.e. the distribution was uniform), in order to obtain a more general model of FOA and avoid overfitting to the considered specific scenario with roles (organizers, participants) that we considered. Modeling VFOA with a Hidden Markov Model (HMM). The GMM modelling does not account for the temporal dependencies between the VFOA events. As a model of these dependencies, we considered the classical graphical model shown in Figure 1. Given a sequence of VFOA F0:T = {Ft , t = 0, ..., T } and a sequence of observations Z1:T , the joint posterior probability density function of the states and observation can be written: p(F0:T , Z1:T ) = p(F0 )
T
p(Zt |Ft )p(Ft |Ft−1 )
(12)
t=1
The emission probabilities were modeled as in the previous case (i.e. Gaussian distributions for regular VFOA, and uniform distribution for the unfocused label). Their parameters, along with the transition matrix p(Ft |Ft−1 ) modeling the probability to transit from a VFOA to another were learned using standard techniques. In the testing phase, the estimation of the optimal sequence of states given a sequence of observations was conducted using Viterbi algorithm.
5 5.1
Results Head Pose Evaluation
Experiments following the protocol described in Section 2.2 were conducted to compare head pose estimation based on the MSPF and the RBPF tracker. The
354
S.O. Ba and J.-M. Odobez
Table 1. Mean, standard deviation and median of errors on the different angles pan tilt roll pointing vector mean std med mean std med mean std med mean std med MSPF 10.0 9.6 7.8 19.4 12.7 17.5 11.5 9.9 8.8 22.5 12.5 20.1 RBPF 9.10 8.6 7.0 17.6 12.2 15.8 10.1 9.9 7.5 20.3 11.3 18.2
MSPF tracker was run with 200 hundred particles and the RBPF with 100 particles. Except this difference, all the other models/parameters involved in the algorithm were the same (remember that both approaches are based on the same graphical model and involve the setting/learning of the same pdf). Table 1 shows the pose errors for the two methods over the test set. Overall, given the small head size, and the fact that none of the head in the test set were used for appearence training, the results are quite good, with a majority of head pan errors smaller than 10 degrees. Also, the errors in pan and roll are smaller than the errors in tilt. This is due to the fact that, even from a perceptive point of view, discriminating between head tilts is more difficult than discriminating between head pan or head roll [2]. Besides, as can be seen, the errors are smaller for the RBPF than for the MSPF approach. This improvement is mainly due to a better exploration of the configuration space of the head poses with the RBPF, as illustrated in Figure 5 which displays sample tracking results of one person of the test set. Because of a sudden head turn, the MSPF lags behind in the exploration of the head pose configuration space, to the contrary of the RBPF approach which nicely follows the head pose. The above results, however, hide a large discrepancy between individuals, as the mean errors for each person of the test set show (Fig. 4). This variance depends mainly on whether the tracked person resembles one of the person of the training set used to learn the appearance model. It is worth noticing in this figure that the improvements due to the Rao-Blackwellisation are more consistent on the marginalized variables (pan and tilt) than on the sampled one (the roll). 5.2
Focus of Attention Recognition Evaluation
Table 2 and 3 display the VFOA estimation results for the right and left person respectively. VFOA and head pose correlation: The ML results corresponds to the maximum likelihood estimation (ML) of the VFOA, which consists in estimating the VFOA model parameters using the data of a person and testing the model on the same data (with a GMM model). These results show in an optimistic case the performances our model can achieve, and illustrate somehow the correlation between a person’s head poses and his VFOA. As can be seen, this correlation is quite high for the left person (close to 80% FRR), showing the good accordance between pose and VFOA. However, it drops to near 60% only for the right person, mainly due to the stronger ambiguity between
Head Pose Tracking and Focus of Attention Recognition Algorithms 35
35
35 MSPF RBPF
MSPF RBPF 30
25
25
25
15
20
15
10
10
5
5
0 0
1
2
3
4 5 meeting index
6
7
8
roll error mean
30
tilt error mean
pan error mean
MSPF RBPF 30
20
0 0
355
20
15
10
5
1
2
3
4 5 meeting index
6
7
8
0 0
1
2
3
4 5 meeting index
6
7
8
Fig. 4. Pan, tilt, and roll errors over individual participants
Fig. 5. Sample of tracking failure for MSPF. First row : MSPF; Second row: RBPF.
looking at person left, slide screen and to a smaller extent, left organizer. VFOA Prediction: While ML is achieving the best results, its performances are not extremely out-performing the performances of the GMM and HMM modeling using GT data, which show the ability to learn a VFOA model applicable to new data. For both person right and left, the GMM modeling is achieving better frame recognition rate and event recall performance while the HMM is giving better event precision. This can be explained since the HMM approach is doing some data smoothing. As a results some events are missed (lower recall) but the precision increases due to the elimination of short spurious detections. Overall, our results are comparable to other state of the art VFOA estimation using sensor input. For instance, [8] with a VFOA target set composed of 3 people obtained an average frame recognition rate of 68%, similar to our results. Head pose estimates: As tables 2 and 3 show, we observe a degradation in performance when using head pose estimates. This degradation are due to tracking errors (short periods when the tracker locks on a subpart of the face, tilt uncertainty) and the different (but individually consistent) head pose estimation tracker response to input with similar poses but different appearences. While the HMM modeling had only a small impact on performance when using GT data, we observe from the event F-measure that in presence of noisier data, its smoothing effect is quite beneficial.
356
S.O. Ba and J.-M. Odobez
Table 2. Average VFOA estimation results for right person using (ML), GMM, and HMM modeling, and either gt (ground truth) or tr (pose tracking output) observations error measure gt-ML gt-gmm gt-hmm tr-ML tr-gmm tr-hmm frame rr (FRR) 62.1 53.6 53.9 42.8 38.2 38.4 event rec 65.7 57.3 50.6 54.5 51.5 34.8 event prec 43.6 43.6 52.2 18.5 17.1 40.6 event F-meas 52.1 47.2 50.4 29.5 25.3 36.9
Table 3. Average VFOA estimation results for left person using (ML), GMM, and HMM modeling, and either gt (ground truth) or tr (pose tracking output) observations error measure gt-ML gt-gmm gt-hmm tr-ML tr-gmm tr-hmm frame rr (FRR) 78.4 73 73 53.6 49.5 50.1 event rec 66.9 62 56.4 51.3 39.3 32.7 event prec 53.2 56.8 63.8 26.8 18.9 44.9 event F-meas 59 58.7 59.2 34.2 25.2 36.9
6
Conclusion
We have presented a system for the recognition of the VFOA of people in meetings. The method relies on the estimation of the head orientation of people, from which the VFOA is deduced. We obtained an average error of around 10 degrees in pan angle, and 18 degrees in tilt angle in pose estimation, with fluctuations due to variation in people’s appearence. With respect to VFOA recognition, the obtained results are encouraging, but additional work is needed. A first direction is the use of individualized VFOA models obtained through unsupervised adaptation. Early results along this line exhibit an absolute increase of performance of around 8%. The second research line addresses the ambiguity issues by modeling the interaction between people and different cues (e.g. speaking status, slide activity).
References 1. S. O. Ba and J. M. Odobez. A rao-blackwellized mixed state particle filter for head pose tracking. In ACM-ICMI Workshop on Multi-modal Multi-party Meeting Processing (MMMP), Trento Italy, pages 9–16, 2005. 2. L. Brown and Y. Tian. A study of coarse head pose estimation. IEEE Workshop on Motion and Video Computing, Dec 2002. 3. T. Cootes and P. Kittipanya-ngam. Comparing variations on the active appearance model algorithm. BMVC, 2002. 4. A. Doucet, S. Godsill, and C. andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 2000. 5. L. Lu, Z. Zhang, H. Shum, Z. Liu, and H. Chen. Model and exemplar-based robust head pose tracking under occlusion and varying expression. CVPR, Dec 2001. 6. J. McGrath. Groups: Interaction and performance. Prentice-Hall, 1984.
Head Pose Tracking and Focus of Attention Recognition Algorithms
357
7. J.-M. Odobez. Focus of attention coding guidelines. Technical Report 2, IDIAPCOM, Jan. 2006. 8. K. Otsuka, Y. Takemae, J. Yamato, and H. Murase. A probabilistic inference of multiparty-conversation structure based on markov-switching models of gaze patterns, head directions, and utterances. In Proc. of International Conference on Multimodal Interface (ICMI’05), pages 191–198, Trento, Italy, October 2005. 9. K. Parker. Speaking turns in small group interaction: a context sensitive event sequence model. Journal of Personality and Social Psychology, 1988. 10. R. Rae and H. Ritter. Recognition of human head orientation based on artificial neural networks. IEEE Trans. on Neural Network, March 1998. 11. R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus of attention for meeting indexing based on multiple cues. IEEE Transactions on Neural Networks, Vol.13, No. 4, 2002. 12. K. Toyama and A. Blake. Probabilistic tracking in metric space. ICCV, Dec 2001. 13. A. Waibel, M. Bett, F. Metze, K. Ries, T.Schaaf, T. Schultz, H. Soltau, H. Yu, and K. Zechner. Advances in automatic meeting record creation and access. Proc. ICASSP, May 2001. 14. P. Wang and Q. Ji. Multi-view face tracking with factorial and switching hmm. Workshops on Application of Computer Vision (WACV/MOTION’05),Breckenridge, Colorado, 2005. 15. Y. Wu and K. Toyama. Wide range illumination insensitive head orientation estimation. IEEE Conf. on Automatic Face and Gesture Recognition, Apr 2001. 16. J. Yang, W. Lu, and A. Weibel. Skin color modeling and adaptation. ACCV, Oct 1998. 17. L. Zhao, G. Pingali, and I. Carlbom. Real-time head orientation estimation using neural networks. Proc. of ICIP, Sept 2002.
Author Index
Abad, Alberto 93 Abd-Almageed, Wael Anguita, Jan 258
209
Ba, Sileye O. 345 Barras, Claude 233 Beran, V´ıtezslav 331 Berkowitz, Phillip 200 Bernardin, Keni 1, 81 Bowers, Rachel 1 Brunelli, Roberto 55 Brutti, Alessio 55 Canton-Ferrer, Cristian 93, 305 Casas, Josep Ramon 93, 305 Cavallaro, Andrea 190 Chippendale, Paul 55 Chu, Chi-Wei 119 Crowley, James L. 270 Davis, Larry S.
209
Ekenel, Hazım Kemal Farrus, Mireia Fu, Yun 281
258
Hall, Daniela 270 Hernando, Javier 93, 258 Hu, Yuxiao 281, 299 Huang, Thomas 241, 281, 299 249
Katsarakis, Nikos Korhonen, Teemu
45 127
93
Macho, Duˇsan 93, 258, 311 Maggio, Emilio 190 Maisonnasse, J´erˆ ome 270 Malkin, Robert G. 311, 323 Marqu´es, Ferran 258 Mart´ınez, Claudi 258 McDonough, John 69, 137 Miller, Andrew 200 Morros, Ramon 258 Mostefa, Djamel 1 Mylonakis, Vasileios 171 Nadeu, Climent 93, 311 Nechyba, Michael C. 161 Nevatia, Ram 119, 183, 216 Nickel, Kai 69, 291 Ning, Huazhong 241
69, 249
Garde, Ainara 258 Garofolo, John 1 Gatica-Perez, Daniel 331 Gauvain, Jean-Luc 233 Gehrig, Tobias 69, 81, 137 Gourier, Nicolas 270
Jin, Qin
Lamel, Lori 233 Landabaso, Jos´e Luis Lanz, Oswald 55 Liu, Ming 241, 299 Luque, Jordi 258
Odobez, Jean-Marc Omologo, Maurizio
345 55, 311
Pard` as, Montse 93, 305 Parviainen, Mikko 127 Pertil¨ a, Pasi 127 Pirinen, Tuomo 127 Pnevmatikakis, Aristodemos 45, 151, 171, 223 Polymenakos, Lazaros 45, 151, 171, 223 Potamianos, Gerasimos 105 Pot´ ucek, Igor 331 Rigoll, Gerhard
331
Schneiderman, Henry 161 Schreiber, Sascha 331 Segura, Carlos 93 Shafique, Khurram 200 Shah, Mubarak 200
360
Author Index
Singh, Vivek Kumar 119, 183 Smith, Kevin 331 Song, Xuefeng 183, 216 Soundararajan, Padmanabhan 1 Souretis, George 45 Stergiou, Andreas 223 Stiefelhagen, Rainer 1, 69, 81, 291 Svaizer, Piergiorgio 55
Tobia, Francesco Tu, Jilin 281
55
Taj, Murtaza 190 Talantzis, Fotios 45 Tang, Hao 241 Temko, Andrey 311
Zhai, Yun 200 Zhang, Zhenqiu 105, 299 Zhu, Xuan 233 Zieger, Christian 311
Vartak, Aniket 200 Vilaplana, Ver´ onica 258 Voit, Michael 291 White, Brandyn 200 Wu, Bo 119, 183