Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5707
Julian Fierrez Javier Ortega-Garcia Anna Esposito Andrzej Drygajlo Marcos Faundez-Zanuy (Eds.)
Biometric ID Management and Multimodal Communication Joint COST 2101 and 2102 International Conference BioID_MultiComm 2009 Madrid, Spain, September 16-18, 2009 Proceedings
13
Volume Editors Julian Fierrez Javier Ortega-Garcia Universidad Autonoma de Madrid Escuela Politecnica Superior C/Francisco Tomas y Valiente 11, 28049 Madrid, Spain E-mail: {julian.fierrez;javier.ortega}@uam.es Anna Esposito Second University of Naples, and IIASS Caserta, Italy E-mail:
[email protected] Andrzej Drygajlo EPFL, Speech Processing and Biometrics Group 1015 Lausanne, Switzerland E-mail:
[email protected] Marcos Faundez-Zanuy Escola Universitària Politècnica de Mataró 08303 Mataro (Barcelona), Spain E-mail:
[email protected]
Library of Congress Control Number: 2009934011 CR Subject Classification (1998): I.5, J.3, K.6.5, D.4.6, I.4.8, I.7.5, I.2.7 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-642-04390-9 Springer Berlin Heidelberg New York 978-3-642-04390-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12752645 06/3180 543210
Preface
This volume contains the research papers presented at the Joint COST 2101 & 2102 International Conference on Biometric ID Management and Multimodal Communication, BioID_MultiComm 2009, hosted by the Biometric Recognition Group, ATVS, at the Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain, during September 16–18, 2009. BioID_MultiComm 2009 was a joint international conference organized cooperatively by COST Actions 2101 & 2102. COST 2101 Action focuses on “Biometrics for Identity Documents and Smart Cards (BIDS),” while COST 2102 Action is entitled “Cross-Modal Analysis of Verbal and Non-verbal Communication.” The aim of COST 2101 is to investigate novel technologies for unsupervised multimodal biometric authentication systems using a new generation of biometrics-enabled identity documents and smart cards. COST 2102 is devoted to developing an advanced acoustical, perceptual and psychological analysis of verbal and non-verbal communication signals originating in spontaneous face-to-face interaction, in order to identify algorithms and automatic procedures capable of recognizing human emotional states. While each Action supports its own individual topics, there are also strong links and shared interests between them. BioID_MultiComm 2009 therefore focused on both Action-specific and joint topics. These included, but we are not restricted to: physiological biometric traits (face, iris, fingerprint, hand); behavioral biometric modalities (speech, handwriting, gait) transparent biometrics and smart remote sensing; biometric vulnerabilities and liveness detection; data encryption for identity documents and smart cards; quality and reliability measures in biometrics; multibiometric templates for next generation ID documents; operational scenarios and large-scale biometric ID management; standards and privacy issues for biometrics; multibiometric databases; human factors and behavioral patterns; interactive and unsupervised multimodal systems; analysis of verbal and non-verbal communication signals; cross modal analysis of audio and video; spontaneous face-to-face interaction; advanced acoustical and perceptual signal processing; audiovisual data encoding; fusion of visual and audio signals for recognition and synthesis; identification of human emotional states; gesture, speech and facial expression analysis and recognition; implementation of intelligent avatars; annotation of extended MPEG7 standard; human behavior and unsupervised interactive interfaces; and cultural and socio-cultural variability. We sincerely thank all the authors who submitted their work for consideration. We also thank the Scientific Committee members for their great effort and high-quality work in the review process. In addition to the papers included in the present volume, the conference program also included three keynote speeches from outstanding researchers: Prof. Anil K. Jain (Michigan State University, USA), Prof. Simon Haykin (McMaster University, Canada) and Dr. Janet Slifka (Harvard – MIT, USA). We sincerely thank them for accepting the invitation to give their talks.
VI
Preface
The conference organization was the result of a team effort. We are grateful to the Advisory Board for their support at every stage of the conference organization. We also thank all the members of the Local Organizing Committee, in particular Pedro Tome-Gonzalez for the website management, Miriam Moreno-Moreno for supervising the registration process, and Almudena Gilperez and Maria Puertas-Calvo for taking care of the social program. Finally, we gratefully acknowledge the material and financial support provided by the Escuela Politécnica Superior and the Universidad Autónoma de Madrid.
August 2009
Javier Ortega-Garcia Julian Fierrez
Organization General Chair Javier Ortega-Garcia
Universidad Autonoma de Madrid, Spain
Conference Co-chair Joaquin Gonzalez-Rodriguez
Universidad Autonoma de Madrid, Spain
Advisory Board Anna Esposito Andrzej Drygajlo Marcos Faundez Mike Fairhurst Amir Hussain Niels-Christian Juul
Second University of Napoles, Italy EPFL, Switzerland Escuela Universitaria Politécnica de Mataró, Spain University of Kent, UK University of Stirling, UK University of Roskilde, Denmark
Program Chair Julian Fierrez
Universidad Autonoma de Madrid, Spain
Scientific Committee Akarun, L., Turkey Alba-Castro, J.-L., Spain Almeida Pavia, A., Portugal Alonso-Fernandez, F., Spain Ariyaeeinia, A., UK Bailly, G., France Bernsen, N.-O., Denmark Bourbakis, N., USA Bowyer, K. W., USA Campbell, N., Japan Campisi, P., Italy Cerekovic, A., Croatia Chetouani, M., France Chollet, G., France Cizmar, A., Slovak Rep. Delic, V., Serbia Delvaux, N., France Dittman, J., Germany Dorizzi, B., France Dutoit, T., Belgium Dybkjar, L., Denmark
El-Bahrawy, A., Egypt Erzin, E., Turkey Fagel, S., Germany Furui, S., Japan Garcia-Mateo, C., Spain Gluhchev, G., Bulgaria Govindaraju, V., USA Granstrom, B., Sweden Grinberg, M., Bulgaria Harte, N., Ireland Kendon, A., USA Hernaez, I., Spain Hernando, J., Spain Hess, W., Germany Hoffmann, R., Germany Keus, K., Germany Kim, H., Korea Kittler, J., UK Koreman, J., Norway Kotropoulos, C., Greece Kounoudes, A., Cyprus
VIII
Organization
Krauss, R., USA Kryszczuk, K., Switzerland Laminen, H., Finland Laouris, Y., Cyprus Lindberg, B., Denmark Lopez-Cozar, R., Spain Majewski, W., Poland Makris, P., Cyprus Matsumoto, D., USA Mihaylova, K., Bulgaria Moeslund, T.-B., Denmark Murphy, P., Ireland Neubarth, F., Austria Nijholt, A., The Netherlands Pandzic, I., Croatia Papageorgiou, H., Greece Pavesic, N., Slovenia Pelachaud, C., France Pfitzinger, H., Germany Piazza, F., Italy Pitas, I., Greece Pribilova, A., Slovak Rep. Pucher, M., Austria Puniene, J., Lithuania Raiha, K.-J., Finland Ramos, D., Spain Ramseyer, F., Switzerland Ratha, N., USA Ribaric, S., Croatia Richiardi, J., Switzerland Rojc, M., Slovenia
Rudzionis, A., Lithuania Rusko, M., Slovak Rep. Ruttkay, Z., Hungary Sankur, B., Turkey Schoentgen, J., Belgium Schouten, B., Netherlands Sigüenza, J.-A., Spain Smekal, Z., Czech Rep. Staroniewicz, P., Poland Tao, J., China Tekalp, A.-M., Turkey Thorisson, K.-R., Iceland Tistarelli, M., Italy Toh, K.-A., Korea Toledano, D. T., Spain Tome-Gonzalez, P., Spain Trancoso, I., Portugal Tsapatsoulis, N., Cyprus Tschacher, W., Switzerland v. d. Heuvel, H., The Netherlands Veldhuis, The Netherlands Vich, R., Czech Republic Vicsi, K., Hungary Vielhauer, C., Germany Vilhjalmsson, H., Iceland Vogel, C., Ireland Wilks, Y., UK Yegnanarayana, B., India Zganec Gros, J., Slovenia Zhang, D., Hong Kong Zoric, G., Croatia
Local Organizing Committee (from the Universidad Autonoma de Madrid, Spain) Javier Galbally Pedro Tome-Gonzalez Manuel R. Freire Marcos Martinez-Diaz Miriam Moreno-Moreno Javier Gonzalez-Dominguez Ignacio Lopez-Moreno Javier Franco Alicia Beisner Javier Burgues Ruben F. Sevilla-Garcia Almudena Gilperez Maria Puertas-Calvo
Table of Contents
Face Processing and Recognition Illumination Invariant Face Recognition by Non-local Smoothing . . . . . . . ˇ Vitomir Struc and Nikola Paveˇsi´c
1
Manifold Learning for Video-to-Video Face Recognition . . . . . . . . . . . . . . . Abdenour Hadid and Matti Pietik¨ ainen
9
MORPH: Development and Optimization of a Longitudinal Age Progression Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Allen W. Rawls and Karl Ricanek Jr.
17
Verification of Aging Faces Using Local Ternary Patterns and Q-Stack Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Drygajlo, Weifeng Li, and Kewei Zhu
25
Voice Analysis and Modeling Recognition of Emotional State in Polish Speech - Comparison between Human and Automatic Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Staroniewicz
33
Harmonic Model for Female Voice Emotional Synthesis . . . . . . . . . . . . . . . Anna Pˇribilov´ a and Jiˇr´ı Pˇribil
41
Anchor Model Fusion for Emotion Recognition in Speech . . . . . . . . . . . . . Carlos Ortego-Resa, Ignacio Lopez-Moreno, Daniel Ramos, and Joaquin Gonzalez-Rodriguez
49
Multimodal Interaction Audiovisual Alignment in a Face-to-Face Conversation Translation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Jerneja Zganec Gros and Aleˇs Miheliˇc Maximising Audiovisual Correlation with Automatic Lip Tracking and Vowel Based Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Abel, Amir Hussain, Quoc-Dinh Nguyen, Fabien Ringeval, Mohamed Chetouani, and Maurice Milgram Visual Context Effects on the Perception of Musical Emotional Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Esposito, Domenico Carbone, and Maria Teresa Riviello
57
65
73
X
Table of Contents
Eigenfeatures and Supervectors in Feature and Score Fusion for SVM Face and Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascual Ejarque, Javier Hernado, David Hernando, and David G´ omez
81
Face and Expression Recognition Facial Expression Recognition Using Two-Class Discriminant Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marios Kyperountas and Ioannis Pitas A Study for the Self Similarity Smile Detection . . . . . . . . . . . . . . . . . . . . . . David Freire, Luis Ant´ on, and Modesto Castrill´ on Analysis of Head and Facial Gestures Using Facial Landmark Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hatice Cinar Akakin and Bulent Sankur
89 97
105
Combining Audio and Video for Detection of Spontaneous Emotions . . . ˇ ˇ Rok Gajˇsek, Vitomir Struc, Simon Dobriˇsek, Janez Zibert, France Miheliˇc, and Nikola Paveˇsi´c
114
Face Recognition Using Wireframe Model Across Facial Expressions . . . . Zahid Riaz, Christoph Mayer, Michael Beetz, and Bernd Radig
122
Body and Gait Recognition Modeling Gait Using CPG (Central Pattern Generator) and Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arabneydi Jalal, Moshiri Behzad, and Bahrami Fariba
130
Fusion of Movement Specific Human Identification Experts . . . . . . . . . . . . Nikolaos Gkalelis, Anastasios Tefas, and Ioannis Pitas
138
CBIR over Multiple Projections of 3D Objects . . . . . . . . . . . . . . . . . . . . . . . Dimo Dimov, Nadezhda Zlateva, and Alexander Marinov
146
Biometrics beyond the Visible Spectrum: Imaging Technologies and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miriam Moreno-Moreno, Julian Fierrez, and Javier Ortega-Garcia
154
Poster Session Voice Analysis and Speaker Verification Formant Based Analysis of Spoken Arabic Vowels . . . . . . . . . . . . . . . . . . . . Yousef Ajami Alotaibi and Amir Husain
162
Table of Contents
Key Generation in a Voice Based Template Free Biometric Security System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joshua A. Atah and Gareth Howells
XI
170
Fingerprint Biometrics Extending Match-On-Card to Local Biometric Identification . . . . . . . . . . . Julien Bringer, Herv´e Chabanne, Tom A.M. Kevenaar, and Bruno Kindarji A New Fingerprint Matching Algorithm Based on Minimum Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Andr´es I. Avila and Adrialy Muci
178
187
Handwriting Analysis and Signature Verification Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimo Dimov and Lasko Laskov
192
Bio-Inspired Reference Level Assigned DTW for Person Identification Using Handwritten Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muzaffar Bashir and J¨ urgen Kempf
200
Pressure Evaluation in On-Line and Off-Line Signatures . . . . . . . . . . . . . . Desislava Dimitrova and Georgi Gluhchev
207
Multimodal Biometrics Confidence Partition and Hybrid Fusion in Multimodal Biometric Verification System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chaw Chia, Nasser Sherkat, and Lars Nolle Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Scheidat, Michael Biermann, Jana Dittmann, Claus Vielhauer, and Karl K¨ ummel Multi-modal Authentication Using Continuous Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K.R. Radhika, S.V. Sheela, M.K. Venkatesha, and G.N. Sekhar
212
220
228
Biometric Systems and Knowledge Discovery Biometric System Verification Close to “Real World” Conditions . . . . . . . ´ Aythami Morales, Miguel Angel Ferrer, Marcos Faundez, Joan F` abregas, Guillermo Gonzalez, Javier Garrido, Ricardo Ribalda, Javier Ortega, and Manuel Freire
236
XII
Table of Contents
Developing HEO Human Emotions Ontology . . . . . . . . . . . . . . . . . . . . . . . . Marco Grassi Common Sense Computing: From the Society of Mind to Digital Intuition and beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Cambria, Amir Hussain, Catherine Havasi, and Chris Eckl
244
252
Biometric Systems and Security On Development of Inspection System for Biometric Passports Using Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Ter´ an and Andrzej Drygajlo
260
Handwritten Signature On-Card Matching Performance Testing . . . . . . . . Olaf Henniger and Sascha M¨ uller
268
Classification Based Revocable Biometric Identity Code Generation . . . . Alper Kanak and Ibrahim So˜gukpinar
276
Vulnerability Assessment of Fingerprint Matching Based on Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Galbally, Sara Carballo, Julian Fierrez, and Javier Ortega-Garcia A Matching Algorithm Secure against the Wolf Attack in Biometric Authentication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshihiro Kojima, Rie Shigetomi, Manabu Inuma, Akira Otsuka, and Hideki Imai
285
293
Iris, Fingerprint and Hand Recognition A Region-Based Iris Feature Extraction Method Based on 2D-Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nima Tajbakhsh, Khashayar Misaghian, and Naghmeh Mohammadi Bandari
301
A Novel Contourlet Based Online Fingerprint Identification . . . . . . . . . . . Omer Saeed, Atif Bin Mansoor, and M Asif Afzal Butt
308
Fake Finger Detection Using the Fractional Fourier Transform . . . . . . . . . Hyun-suk Lee, Hyun-ju Maeng, and You-suk Bae
318
Comparison of Distance-Based Features for Hand Geometry Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Burgues, Julian Fierrez, Daniel Ramos, and Javier Ortega-Garcia
325
Table of Contents
XIII
Signature Verification A Comparison of Three Kinds of DP Matching Schemes in Verifying Segmental Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seiichiro Hangai, Tomoaki Sano, and Takahiro Yoshida Ergodic HMM-UBM System for On-Line Signature Verification . . . . . . . . Enrique Argones R´ ua, David P´erez-Pi˜ nar L´ opez, and Jos´e Luis Alba Castro
333 340
Improving Identity Prediction in Signature-based Unimodal Systems Using Soft Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M´ arjory Abreu and Michael Fairhurst
348
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
357
Illumination Invariant Face Recognition by Non-Local Smoothing ˇ Vitomir Struc and Nikola Paveˇsi´c Faculty of Electrical Engineering, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia {vitomir.struc,nikola.pavesic}@fe.uni-lj.si.com http://luks.fe.uni-lj.si/
Abstract. Existing face recognition techniques struggle with their performance when identities have to be determined (recognized) based on image data captured under challenging illumination conditions. To overcome the susceptibility of the existing techniques to illumination variations numerous normalization techniques have been proposed in the literature. These normalization techniques, however, still exhibit some shortcomings and, thus, offer room for improvement. In this paper we identify the most important weaknesses of the commonly adopted illumination normalization techniques and presents two novel approaches which make use of the recently proposed non-local means algorithm. We assess the performance of the proposed techniques on the YaleB face database and report preliminary results. Keywords: Face recognition, retinex theory, non-local means, illumination invariance.
1
Introduction
The performance of current face recognition technology with image data captured in controlled conditions has reached a level which allows for its deployment in a wide variety of applications. These applications typically ensure controlled conditions for the image acquisition procedure and, hence, minimize the variability in the appearance of different (facial) images of a given individual. However, when employed on facial images captured in uncontrolled and unconstrained environments the majority of existing face recognition techniques still exhibits a significant drop in their recognition performance. The reason for the deterioration in the recognition (or verification) rates can be found in the appearance variations induced by various environmental factors, among which illumination is undoubtedly one of the most important. The importance of illumination was highlighted in several empirical studies where it was shown that the illumination induced variability in facial images is often larger than the variability induced to the facial images by the individual’s identity [1]. Due to this susceptibility, numerous techniques have been proposed in J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 1–8, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
ˇ V. Struc and N. Paveˇsi´c
the literature to cope with the problem of illumination. These techniques try to tackle the illumination induced appearance variations at one of the following three levels: (i) at the pre-processing level, (ii) at the feature extraction level, and (iii) at the modeling or/and classification level. While techniques from the latter two levels represent valid efforts in solving the problem of illumination invariant face recognition, techniques operating at the pre-processing level exhibit some important advantages which make them a preferred choice when devising robust face recognition systems. One of their most essential advantages lies in the fact that they make no assumptions regarding the size and characteristics of the training set while offering a computationally simple and simultaneously effective way of achieving illumination invariant face recognition. Examples of normalization techniques operating at the pre-processing level1 include the single and multi scale retinex algorithms [2],[3], the self quotient image [4], anisotropic smoothing [5], etc. All of these techniques share a common theoretical foundation and exhibit some strengths as well as some weaknesses. In this paper we identify (in our opinion) the most important weaknesses of the existing normalization techniques and propose two novel techniques which try to overcome them. We assess the proposed techniques on the YaleB database and present encouraging preliminary results. The rest of the paper is organized as follows. In Section 2 the theory underlying the majority of photometric normalization techniques is briefly reviewed and some weakness of existing techniques are pointed out. The novel normalization techniques are presented in Section 3 and experimentally evaluated in Section 4. The paper concludes with some final comments in Section 5.
2
Background and Related Work
The theoretical foundation of the majority of existing photometric normalization techniques can be linked to the Retinex theory developed and presented by Land and McCann in [6]. The theory tries to explain the basic principles governing the process of image formation and/or scene perception and states that an image I(x, y) can be modeled as the product of the reflectance R(x, y) and luminance L(x, y) functions: I(x, y) = R(x, y)L(x, y). (1) Here, the reflectance R(x, y) relates to the characteristics of the objects comprising the scene of an image and is dependant on the reflectivity (or albedo) of the scenes surfaces [7], while the luminance L(x, y) is determined by the illumination source and relates to the amount of illumination falling on the observed scene. Since the reflectance R(x, y) relates solely to the objects in an image, it is obvious that (when successfully estimated) it acts as an illumination invariant representation of the input image. Unfortunately, estimating the reflectance from 1
We will refer to these techniques as photometric normalization techniques in the remainder of this paper.
Illumination Invariant Face Recognition by Non-Local Smoothing
3
the expression defined by (1) represents an ill-posed problem, i.e., it is impossible to compute the reflectance unless some assumptions regarding the nature of the illumination induced appearance variations are made. To this end, researchers introduced various assumptions regarding the luminance and reflectance functions, the most common, however, are that the luminance part of the model in (1) varies slowly with the spatial position and, hence, represents a low-frequency phenomenon, while the reflectance part represents a high-frequency phenomenon. To determine the reflectance of an image, and thus, to obtain an illumination invariant image representation, the luminance L(x, y) of an image is commonly estimated first. This estimate L(x, y) is then exploited to compute the reflectance via the manipulation of the image model given by the expression (1), i.e.: ln R(x, y) = ln I(x, y) − ln L(x, y) or R(x, y) = I(x, y)/L(x, y),
(2)
where the right hand side equation of (2) denotes an element-wise division of the input image I(x, y) with the estimated luminance L(x, y). We will refer to the reflectance computed with the left hand side equation of (2) as the logarithmic reflectance and to the reflectance computed with the right hand side equation of (2) as the quotient reflectance in the rest of this paper. As already emphasized, the luminance is considered to vary slowly with the spatial position [8] and can, therefore, be estimated as a smoothed version of the original image I(x, y). Various smoothing filters and smoothing techniques have been proposed in the literature resulting in different photometric normalization procedures that were successfully applied to the problem of face recognition under severe illumination changes. The single scale retinex algorithm [2], for example, computes the estimate of the luminance function L(x, y) by simply smoothing the input image I(x, y) with a Gaussian smoothing filter. The illumination invariant image representation is then computed using the expression for the logarithmic reflectance. While such an approach generally produces good results with a properly selected Gaussian, its broader use in robust face recognition systems is still limited by an important weakness: at large illumination discontinuities caused by strong shadows that are casted over the face halo effects are often visible in the computed reflectance [8]. To avoid this problem the authors of the algorithm extended their normalization technique to a multi scale form [3], where Gaussians with different widths are used and basically outputs of different implementations of the single scale retinex algorithm are combined to compute the final illumination invariant face representation. Another solution to the problem of halo effects was presented by Wang et al. [4] in form of the self quotient image technique. Here, the authors approach the problem of luminance estimation by introducing an anisotropic smoothing filter. Once the anisotropic smoothing operation produces an estimate of the luminance L(x, y), the quotient reflectance R(x, y) is computed in accordance with the right hand side equation of (2). However, due to the anisotropic nature of the employed smoothing filter flat zones in the images are not smoothed properly.
ˇ V. Struc and N. Paveˇsi´c
4
Gross and Brajovic [5] presented a solution to the problem of reliable luminance estimation by adopting an anisotropic diffusion based smoothing technique. In their method the amount of smoothing at each pixel location is controlled by the images local contrast. Adopting the local contrast as means to control the smoothing process results in flat image regions being smoothed properly while still preserving image edges and, thus, avoiding halo effects. Despite the success of the normalization technique in effectively determining the quotient reflectance, one could still voice some misgivings. An known issue with anisotropic diffusion based smoothing is that it smoothes the image only in the direction orthogonal to the images gradient [9]. Thus, it effectively preserves only straight edges, but struggles at edge points with high curvature (e.g., at corners). In these situations an approach that better preserves edges would be preferable. To this end, we present in the next section two novel algorithms which make use of the recently proposed non-local means algorithm.
3 3.1
Non-Local Means for Luminance Estimation The Non-Local Means Algorithm
The non-local means (NL means) algorithm [9] is a recently proposed image denoising technique, which, unlike existing denoising methods, considers pixel values from the entire image for the task of noise reduction. The algorithm is based on the fact that for every small window of the image several similar windows can be found in the image as well, and, moreover, that all of these windows can be exploited to denoise the image. Let us denote an image contaminated with noise as In (x) ∈ Ra×b , where a and b are image dimensions in pixels, and let x stand for an arbitrary pixel location x = (x, y) within the noisy image. The NL means algorithm constructs the denoised image Id (x) by computing each pixel value of Id (x) as a weighted average of pixels comprising In (x), i.e. [9]: Id (x) =
w(z, x)In (x),
(3)
x∈In (x)
where w(z, x) represents the weighting function that measures the similarity between the local neighborhoods of the pixel at the spatial locations z and x. Here, the weighting function is defined as follows: w(z, x) =
2 ˙ Gσ I n (Ωx )−In (Ωz )2 1 h2 e− and Z(z) = Z(z)
e−
2 ˙ Gσ I n (Ωx )−In (Ωz )2 h2
.
x∈In (x)
(4) In the above expressions Gσ denotes a Gaussian kernel with the standard deviation σ, Ωx and Ωz denote the local neighborhoods of the pixels at the locations x and z, respectively, h stands for the parameter that controls the decay of the exponential function, and Z(z) represents a normalizing factor.
Illumination Invariant Face Recognition by Non-Local Smoothing
5
Fig. 1. The principle of the NL means algorithm: an input image (left), similar and dissimilar image windows (right)
From the presented equations it is clear that if the local neighborhoods of a given pair of pixel locations z and x display a high degree of similarity, the pixels at z and x will be assigned relatively large weights when computing their denoised estimates. Some examples of image windows used by the algorithm are presented in Fig. 1. Here, similar image windows are marked white, while dissimilar image windows are marked black. When computing the denoised value of the center pixel of each of the white windowed image regions, center pixels of the similar windows will be assigned relatively large weights, the center pixels of the dissimilar windows, on the other hand, will be assigned relatively low weights. Whit a proper selection of the decay parameter h, the presented algorithm results in a smoothed image whit preserved edges. Hence, it can be used to estimate the luminance of an input image and, consequently, to compute the (logarithmic) reflectance. An example of the deployment of the NL means algorithm (for a 5×5 local neighborhood and h = 10) for estimation of the logarithmic reflectance is shown in Fig. 2 (left triplet).
Fig. 2. Two sample images processed with the NL means (left triplet) and adaptive NL means (right triplet) algorithms. Order of images in each triplet (from left to right): the input image, the estimated luminance, the logarithmic reflectance.
3.2
The Adaptive Non-Local Means Algorithm
The NL means algorithm assigns different weights to each of the pixel values in the noisy image In (x) when estimating the denoised image Id (x). As we have shown in the previous section, this weight assignment is based on the similarity of the local neighborhoods of arbitrary pixel pairs and is controlled by the decay parameter h. Large values of h result in a slow decay of the Gaussian weighted Euclidian distance2 and, hence, more neighborhoods are considered similar and are assigned 2
Recall that the Euclidian distance serves as the similarity measure between two local neighborhoods.
6
ˇ V. Struc and N. Paveˇsi´c
relatively large weights. Small values of h, on the other hand, result in a fast decay of the Euclidian similarity measure and consequently only a small number of pixels is assigned a large weight for the estimation of the denoised pixel values. Rather than using the original NL means algorithm for estimation of the luminance of an image, we propose in this paper to exploit an adaptive version of the algorithm, where the decay parameter h is a function of local contrast and not a fixed and preselected value. At regions of low contrast, which represent homogeneous areas, the image should be smoothed more (i.e., more pixels should be considered for the estimation of the denoised pixel value), while in regions of high contrast the image should be smoothed less, (i.e., less pixels should be considered for the estimation of the denoised pixel value). Following the work of Gross and Brajovic [5], we define the local contrast between neighboring pixel locations a and b as: ρa,b =| In (a) − In (b) | / | In (a) + In (b) |. Assuming that a is an arbitrary pixel location within In (x) and b stands for a neighboring pixel location above, below, left or right from a, we can construct four contrast images encoding the local contrast in one of the possible four directions. The final contrast image Ic (x) is ultimately computed as the average of the four (directional) contrast images. To link the decay parameter h to the contrast image we first compute the logarithm of the inverse of the (8bit grey-scale) contrast image Iic (x) = log[1/Ic (x)], where 1 denotes a matrix of all ones and the operator ”/” stands for the element-wise division. Next, we linearly map the values of our inverted contrast image Iic (x) to values of the decay parameter h, which now becomes a function of the spatial location: h(x) = [(Iic (x) − Iicmin )/(Iicmax − Iicmin )] ∗ hmax + hmin , where Iicmax and Iicmin denote the maximum and minimum value of the inverted contrast image Iic (x), respectively, and hmax and hmin stand for the target maximum and minimum values of the decay parameter h. An example of the deployment of the presented algorithm is shown in Fig. 2 (right triplet).
4
Experiments
To assess the presented two photometric normalization techniques we made use of the YaleB face database [10]. The database contains images of ten distinct subjects each photographed under 576 different viewing conditions (9 poses 64 illumination conditions). Thus, a total of 5760 images is featured in the database. However, as we are interested only in testing our photometric normalization techniques, we make use of a subset of 640 images with frontal pose in our experiments. We partition the 640 images into five image set according to the extremity in illumination under which they were taken and employ the first image set for training and the remaining ones for testing. In the experiments we use principal component analysis as the feature extraction technique and the nearest neighbor (to the mean) classifier in conjunction with the cosine similarity measure as the classifier. The number of features is set to its maximal value in all experiments. In our first series of recognition experiments we assess the performance of the NL means (NLM) and adaptive NL means (ANL) algorithms for varying values of their
Illumination Invariant Face Recognition by Non-Local Smoothing
7
Table 1. The rank one recognition rates (in %) for the NLM and ANL algorithms Algorithm Parameter value Image set no. 2 Image set no. 3 Image set no. 4 Image set no. 5 Average
ANL - parameter hmax 40 80 120 160 200 100.0 100.0 100.0 100.0 100.0 98.3 100.0 100.0 100.0 100.0 90.7 94.3 94.3 92.1 92.9 87.9 97.4 92.6 84.7 82.1 94.2 97.9 96.7 94.2 93.8
NLM - parameter h 10 30 60 120 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 91.4 95.0 97.1 95.7 96.3 99.5 92.6 85.3 96.9 98.6 97.4 95.3
parameters, i.e., the decay parameter h for the NLM algorithm and hmax for the ANL algorithm. It has to be noted that the parameter hmin of the ANL algorithm was fixed at the value of hmin = 0.01 and the local neighborhood of 5 × 5 pixels was chosen for the NLM and ANL algorithms in all experiments. The results of the experiments in terms of the rank one recognition rates for the individual image sets as well as its average value over the entire database are presented in Table 1. We can see that the best performing implementations of the NLM and ANL algorithm feature parameter values of h = 30 and hmax = 80, respectively. In our second series of recognition experiments we compare the performance of the two proposed algorithms (for h = 30 and hmax = 80) and several popular photometric normalization techniques. Specifically, the following techniques were implemented for comparison: the logarithm transform (LN), histogram equalization (HQ), the single scale retinex (SR) technique and the adaptive retinex normalization approach (AR) presented in [8]. For baseline comparisons, experiments on unprocessed grey scale images (GR) are conducted as well. It should be noted that the presented recognition rates are only indicative of the general performance of the tested techniques, as the YaleB database represent a rather small database, where it is possible to easily devise a normalization technique that effectively discriminates among different images of the small number of subjects. Several techniques were presented in the literature that normalize the facial images by extremely compressing the dynamic range of the images, resulting in the suppression of most of the images variability, albeit induced by illumination or the subjects identity. The question of how to scale up these techniques for use with larger numbers of subjects, however, still remains unanswered. To get an impression of the scalability of the tested techniques we present also recognition rates obtained with the estimated logarithmic luminance functions (where applicable). These results provide an estimate of how much of the useful information was removed from the facial image during the normalization. For the experiments with the logarithmic luminance functions logarithm transformed images from the first image set were employed for training. The presented results show the competitiveness of the proposed techniques. Similar to the best performing AR technique, they achieve an average recognition rate of approximately 98%, but remove less of the useful information as shown by the results obtained on the luminance estimates. The results suggest that the proposed normalization techniques will perform well on larger databases as well.
8
ˇ V. Struc and N. Paveˇsi´c
Table 2. Comparison of the rank one recognition rates (in %) for various algorithms Representation Image sets GR HQ LN SR AR NLM ANL
5
No. 2 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Normalized image No. 3 No. 4 No. 5 100.0 57.9 16.3 100.0 58.6 60.0 98.3 58.6 52.6 100.0 92.1 84.2 100.0 97.1 98.4 100.0 95.0 99.5 100.0 94.3 97.4
Avg. 68.6 79.7 77.4 94.1 98.9 98.6 97.9
Log. luminance No. 2 No. 3 No. 4 n/a n/a n/a n/a n/a n/a n/a n/a n/a 100.0 90.8 46.4 100.0 95.0 49.3 100.0 86.7 39.3 100.0 65.8 36.4
ln L(x, y) No. 5 Avg. n/a n/a n/a n/a n/a n/a 41.1 69.6 44.3 72.1 26.3 63.1 26.8 57.3
Conclusion and Future Work
In this paper we have presented two novel image normalization techniques, which try to compensate for the illumination induced appearance variations of facial images at the preprocessing level. The feasibility of the presented techniques was successfully demonstrated on the YaleB database were encouraging results were achieved. Our future work with respect to the normalization techniques will be focused on their evaluation on larger and more challenging databases.
References 1. Heusch, G., Cardinaux, F., Marcel, S.: Lighting Normalization Algorithms for Face Verification. IDIAP-com 05-03 (March 2005) 2. Jobson, D.J., Rahman, Z., Woodell, G.A.: Properties and Performance of a Center/Surround Retinex. IEEE Transactions on Image Processing 6(3), 451–462 (1997) 3. Jobson, D.J., Rahman, Z., Woodell, G.A.: A Multiscale Retinex for Bridging the Gap Between Color Images and the Human Observations of Scenes. IEEE Transactions on Image Processing 6(7), 897–1056 (1997) 4. Wang, H., Li, S.Z., Wang, Y., Zhang, J.: Self Quotient Image for Face Recognition. In: Proc. of the Int. Conference on Pattern Recognition, pp. 1397–1400 (2004) 5. Gross, R., Brajovic, V.: An Image Preprocessing Algorithm for Illumination Invariant Face Recognition. In: Proc. of AVPBA 2003, pp. 10–18 (2003) 6. Land, E.H., McCann, J.J.: Lightness and Retinex Theory. Journal of the Optical Society of America 61(1), 1–11 (1971) 7. Short, J., Kittler, J., Messer, K.: Photometric Normalisation for Face Verification. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 617–626. Springer, Heidelberg (2005) 8. Park, Y.K., Park, S.L., Kim, J.K.: Retinex Method Based on Adaptive smoothing for Illumination Invariant Face Recognition. Signal Processing 88(8), 1929–1945 (2008) 9. Buades, A., Coll, B., Morel, J.M.: On Image Denoising Methods. Prepublication, http://www.cmla.ens-cachan.fr 10. Georghiades, A.G., Belhumeur, P.N., Kriegman, D.J.: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE TPAMI 23(6), 643–660 (2001)
Manifold Learning for Video-to-Video Face Recognition Abdenour Hadid and Matti Pietik¨ ainen Machine Vision Group, P.O. Box 4500, FI-90014, University of Oulu, Finland
Abstract. We look in this work at the problem of video-based face recognition in which both training and test sets are video sequences, and propose a novel approach based on manifold learning. The idea consists of first learning the intrinsic personal characteristics of each subject from the training video sequences by discovering the hidden low-dimensional nonlinear manifold of each individual. Then, a target face video sequence is projected and compared to the manifold of each subject. The closest manifold, in terms of a recently introduced manifold distance measure, determines the identity of the person in the sequence. Experiments on a large set of talking faces under different image resolutions show very promising results (recognition rate of 99.8%), outperforming many traditional approaches.
1
Introduction
Recently, there has been an increasing interest on video-based face recognition (e.g. [1,2,3]). This is partially due to the limitations of still image-based methods in handling illumination changes, pose variations and other factors. The most studied scenario in video-based face recognition is having a set of still images as the gallery (enrollment) and video sequences as the probe (test set). However, in some real-world applications such as in human-computer interaction and content based video retrieval, both training and test sets can be video sequences. In such settings, performing video-to-video matching may be crucial for robust face recognition but this task is far from being trivial. There are several ways of approaching the problem of face recognition in which both training and test sets are video sequences. Basically, one could build an appearance-based system by selecting few exemplars from the training sequences as gallery models and then performing still image-based recognition and fusing the results over the target video sequence [4]. Obviously, such an approach is not optimal as some important information in the video sequences may be left out. Another direction consists of using spatiotemporal representations for encoding the information both in the training and test video sequences [1,2,3]. Perhaps, the most popular approach in this category is based on the hidden Markov models (HMMs) which have been successfully applied to face recognition from videos [2]. The idea is quite simple: in the training phase, an HMM is created to learn both the statistics and temporal dynamics of each individual. During the recognition process, the temporal characteristics of the face sequence are J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 9–16, 2009. c Springer-Verlag Berlin Heidelberg 2009
10
A. Hadid and M. Pietik¨ ainen
analyzed over time by the HMM corresponding to each subject. The likelihood scores provided by the HMMs are compared. The highest score provides the identity of a face in the video sequence. Unfortunately, most methods which use spatiotemporal representations for face recognition have not yet shown their full potential as they suffer from different drawbacks such as the use of only global features while local information is shown to also be important to facial image analysis [5] and the lack of discriminating between the facial dynamics which are useful for recognition from those which can hinder the recognition process [6]. Very recently, inspired by studies in neuroscience emphasizing manifold ways of visual perception, we introduced in [7] a novel method for gender classification from videos using manifold learning. The idea consists of clustering the face sequences in the low-dimensional space based on the intrinsic characteristic of men and women. Then, a target face sequence is projected into both men and women manifolds for classification. The proposed approach reached excellent results not only in gender recognition problem but also in age and ethnic classification from face video sequences. In this work, we extend the approach proposed in [7] to the problem of videoto-video face recognition. Thus, we propose to first learn and discover the hidden low-dimensional nonlinear manifold of each individual. Then, a target face sequence can be projected into each manifold for classification. The “closest” manifold will then determine the identity of the person in the target face video sequence. The experiments which are presented in Section 4 show that such manifold-based approach yields in excellent results outperforming many traditional methods for video-based face recognition. The rest of this paper is organized as follows. Section 2 explains the notion of face manifold and discusses some learning methods. Then, we describe our proposed approach to the problem of video-to-video face recognition and the experimental analysis in sections 3 and 4, respectively. Finally, we draw a conclusion in Section 5.
2
Face Manifold
Let I(P, s) denote a face image of a person P at configuration s. The variable s describes a combination of factors such as facial expression, pose, illuminations etc. Let ξ p , ξ p = {I(P, s) | s ∈ S} (1) be the collection of face images of the person P under all possible configurations S. The ξ p thus defined is called the face manifold of person P . Additionally, if we consider all the face images of different individuals, then we obtain the face manifold ξ: ξ = ∪ξ p (2) p
Such a manifold ξ resides only in a small subspace of the high-dimensional image space. Consider the example of Fig. 1 showing face images of a person when moving his face from left to right. The only obvious degree of freedom in this case is the rotation angle of the face. Therefore, the intrinsic dimensionality of
Manifold Learning for Video-to-Video Face Recognition
11
Fig. 1. An example showing a face manifold of a given subject embedded in the high dimensional image space
the faces is very small (close to 1). However, these faces are embedded in a 1600dimensional image space (since the face images have 40×40 = 1600 pixels) which is highly redundant. If one could discover the hidden low-dimensional structure of these faces (the rotation angle of the face) from the input observations, this would greatly facilitate the further analysis of the face images such as visualization, classification, retrieval etc. Our proposed approach to the problem of video-tovideo face recognition, which is described in Section 3, exploits the properties of face manifolds. Neuroscience studies also pointed out the manifold ways of visual perception [8]. Indeed, facial images are not “isolated” patterns in the image space but lie on a nonlinear low-dimensional manifold. The key issue in manifold learning is to discover the low-dimensional manifold embedded in the high dimensional space. This can be done by projecting the face images into low-dimensional coordinates. For that purpose, there exist several methods. The traditional ones are Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). These methods are simple to implement and efficient in discovering the structure of data lying on or near linear subspaces of the high-dimensional input space. However, face images do not satisfy this constraint as they lie on a complex nonlinear and nonconvex manifold in the high-dimensional space. Therefore, such linear methods generally fail to discover the real structure of the face images in the low-dimensional space. As an alternative to PCA and MDS, one can consider some nonlinear dimensionality reduction methods such as Self-Organizing Maps (SOM) [9], Generative Topographic Mapping (GTM) [10], Sammon’s Mappings (SM) [11] etc. Though these methods can also handle nonlinear manifolds, most of them tend to involve several free parameters such as learning rates and convergence criteria. In addition, most of these methods do not have an obvious guarantee of convergence to the global optimum. Fortunately, in the recent years, a set of new manifold learning algorithms have emerged. These methods are based on an Eigen decomposition and combine the major algorithmic features of PCA and MDS (computational efficiency, global optimality, and flexible
12
A. Hadid and M. Pietik¨ ainen
asymptotic convergence guarantees) with flexibility to learn a broad class of nonlinear manifolds. Among these algorithms are Locally Linear Embedding (LLE) [12], ISOmetric feature MAPping (ISOMAP) [13] and Laplacian Eigenmaps [14].
3
Proposed Approach to Video-Video Face Recognition
We approach the problem of video-to-video face recognition from manifold learning perspective. We adopt the LLE algorithm for manifold learning due to its demonstrated simplicity and efficiency to recover meaningful low-dimensional structures hidden in complex and high-dimensional data such as face images. LLE is an unsupervised learning algorithm which maps high-dimensional data onto a low-dimensional, neighbor-preserving embedding space. In brief, considering a set of N face images and organizing them into a matrix X (where each column vector represents a face), the LLE algorithm involves then the following three steps: 1. Find the k nearest neighbors of each point Xi . 2. Compute the weights Wij that best reconstruct each data point from its neighbors, minimizing the cost in Equation (3): 2 N (W ) = Wij Xj (3) Xi − i=1 j∈neighbors(i) while enforcing the constraints Wij = 0 if Xj is not a neighbor of Xi , and N j=1 Wij = 1 for every i (to ensure that W is translation-invariant). 3. Compute the embedding Y (of lower dimensionality d << D, where D is the dimension of the input data) best reconstructed by the weights Wij minimizing the quadratic form in Equation (4): 2 N Φ(Y ) = Wij Yj (4) Yi − i=1 j∈neighbors(i) N under constraints i=1 Yi = 0 (to ensure a translation-invariant embedding) N T and N1 i=1 Yi Yi = I (normalized unit covariance). The aim of the first two steps of the algorithm is to preserve the local geometry of the data in the low-dimensional space, while the last step discovers the global structure by integrating information from overlapping local neighborhoods. LLE is an efficient approach to compute the low-dimensional embeddings of highdimensional data assumed to lie on a non-linear manifold. Its ability to deal with large sizes of high-dimensional data and its non-iterative way to find the embeddings make it attractive. Given a set of training face video sequences with one or more sequences per person. For each person, we first apply the LLE algorithm on all his/her face images in the training set. We obtain then coordinates in the low-dimensional
Manifold Learning for Video-to-Video Face Recognition
13
space, thus defining a face manifold of the person. Let us denote then the obtained embedding for a given person P as ξ P . Note that the calculation of ξ P involves only two free parameters which are the number of neighbors (k) and the dimension of the embedding space (d). To determine the identity of an unknown person in a given face sequence {F acef rame(1) , F acef rame(2) , ..., F acef rame(L) } we first project every face instance F acef rame(i) into the face manifold of each subject in the low-dimensional space. The “closest” manifold will then determine the identity of the person in the sequence. The projection of the target face sequence into the manifold of person P is done using the following steps: a. Let now Xi be the column vector representing the face image (F acef rame(i) ) from the new sequence. b. Find the k nearest neighbors of each point Xi among the training face samples of person P . c. Compute the weights Wij that best reconstruct each data point Xi from its neighbors using Equation (3). d. Use the obtained weights Wij to compute the embedding YiP of each point Xi (i.e. F acef rame(i) ) as: YiP = Wij ξjP (5) j ∈ neighbors(Xi )
where ξjP refers to the embedding point of the j th neighbor of the point Xi in the face manifold of person P . As a result, we obtain the embedding Y P of the new face sequence in every face manifold ξ P . Then, we compute how close is the embedding Y P to the face manifold ξ P using: DP =
L 1 P P (i) Yi − ξj L i=1
(6)
where L is the length of the target face sequence, YiP is the embedding of the P (i) point Xi in the low-dimensional space and ξj is the closest point (in term P of Euclidean distance) from the manifold ξ to YiP . Finally, the identity of the L P P (i) person in the target face sequence is given by: argmin . i=1 Yi − ξj p
4
Experimental Analysis
For experimental analysis, we considered the VidTIMIT [15] face video database containing 43 talking subjects (19 female and 24 male), reciting ten short sentences in three sessions with an average delay of a week between sessions, allowing for appearance and mood changes. In total, there are ten face sequences per persons. From each sequence, we automatically detected the eye positions from
14
A. Hadid and M. Pietik¨ ainen
Fig. 2. Examples of facial images extracted from videos of three different subjects
the first frame. The determined eye positions are then used to crop the facial area in the whole sequence, yielding in not well aligned face images. Finally, we scaled the resulted images into four different resolutions: 20×20, 40×40, 60×60 and 80×80 pixels. Examples of face images from some sequences are shown in Fig. 2. For evaluation, we randomly selected one face sequence per person for training while the rest was used for testing. In all our experiments, we considered the average recognition rates of 100 random permutations. For comparative study, we also implemented some state-of-the-art methods including three still image-based methods (PCA, LDA and LBP [16]) and two spatiotemporal-based approaches (HMM [2] and ARMA [1]). For still imagebased analysis, we adopted a scheme proposed in [4] to perform appearance-based face recognition from videos. The approach consists of performing unsupervised learning to extract a set of K most representative samples (or exemplars) from the raw gallery videos (K = 3 in our experiments). Once these exemplars are extracted, we build a view-based system and use a probabilistic voting strategy to recognize the individuals in the probe video sequences. The performance of our proposed approach and also those of the considered methods under four different resolutions are plotted in Fig. 3. From the results, we can notice that all the methods perform quite well but the proposed manifold-based approach significantly outperforms all other methods in all image resolution configurations. For instance, at image resolution of 60 × 60, our approach yielded in recognition rate of 99.8% while PCA, LDA, LBP, HMM, and ARMA yielded in recognition rates of 94.2%, 94.0%, 97.6%, 92.9% and 95.8%, respectively. It is worth noting that, in addition to its efficiency, our approach involves only two free parameters which are quite easy to determine [7]. From the results, we can also notice that the spatiotemporal-based methods (HMM and ARMA) do not always perform better than PCA, LDA, and LBP
Manifold Learning for Video-to-Video Face Recognition
15
Fig. 3. Performance of the considered methods under four different resolutions
based methods. This supports the conclusions of other researchers indicating that using spatiotemporal representations does not systematically enhance the recognition performance. Our results also show that low-image resolutions affect all methods and the best results using the proposed manifold-based approach are obtained using 60 × 60 pixels as image resolution.
5
Conclusion
To overcome the limitations of traditional video-based face recognition methods, we introduced a novel video-to-video matching approach based on manifold learning. Our approach consisted of first learning the hidden low-dimensional manifold of each individual. Then, a target face sequence is projected into each manifold for classification. The closest manifold determined the identity of the person in the target face video sequence. Experiments on a large set of talking faces under different resolutions showed excellent results outperforming stateof-the-art approaches. Our future work consists of extending our approach to multi-view face recognition from videos and experimenting with much larger databases.
Acknowledgment The financial support of the Academy of Finland is gratefully acknowledged. This work has been partially performed within the EU funded project called MOBIO, which is a 7th Framework Research Programme of the European Union (EU), contract number: IST-214324.
16
A. Hadid and M. Pietik¨ ainen
References 1. Aggarwal, G., Chowdhury, A.R., Chellappa, R.: A system identification approach for video-based face recognition. In: 17th International Conference on Pattern Recognition, August 2004, vol. 4, pp. 175–178 (2004) 2. Liu, X., Chen, T.: Video-based face recognition using adaptive hidden markov models. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2003, pp. 340–345 (2003) 3. Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2003, pp. 313–320 (2003) 4. Hadid, A., Pietik¨ ainen, M.: Selecting models from videos for appearance-based face recognition. In: 17th International Conference on Pattern Recognition, August 2004, vol. 1, pp. 304–308 (2004) 5. Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: Component based versus global approaches. Computer Vision and Image Understanding 91(1-2), 6–21 (2003) 6. Hadid, A., Pietik¨ ainen, M.: An experimental investigation about the integration of facial dynamics in video-based face recognition. Electronic Letters on Computer Vision and Image Analysis (ELCVIA) 5(1), 1–13 (2005) 7. Hadid, A., Pietik¨ ainen, M.: Manifold learning for gender classification from face sequences. In: Proc. 3rd IAPR/IEEE International Conference on Biometrics, ICB 2009 (2009) 8. Seung, H.S., Lee, D.: The manifold ways of perception. Science 290(12), 2268–2269 (2000) 9. Kohonen, T. (ed.): Self-Organizing Maps. Springer, Berlin (1997) 10. Bishop, C.M., Svensen, M., Williams, C.K.I.: GTM: The generative topographic mapping. Neural Computation 10(1), 215–234 (1998) 11. Sammon, J.: A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 18(5), 401–409 (1969) 12. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 13. Tenenbaum, J.B., DeSilva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 14. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 585–591. MIT Press, Cambridge (2002) 15. Sanderson, C. (ed.): Biometric Person Recognition: Face, Speech and Fusion. VDMVerlag (2008) 16. Ahonen, T., Hadid, A., Pietik¨ ainen, M.: Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 2037–2041 (2006)
MORPH: Development and Optimization of a Longitudinal Age Progression Database Allen W. Rawls and Karl Ricanek Jr. Department of Computer Science University of North Carolina Wilmington 601 South College Road Wilmington, NC 28403 USA {rawlsa,ricanekk}@uncw.edu
Abstract. This paper details recent improvement to MORPH, a longitudinal face database, developed for age progression and age estimation research. This database is primarily used to solve age-related problems of facial recognition systems. The data corpus provides the largest set of publicly available longitudinal adult images with supporting metadata and is still expanding; longitudinal spans range from several days to over twenty years. The metadata provided aids in classification by age, gender, and race and includes other key parameters that affect aging appearance. Keywords: biometric database, face database, longitudinal images, morphological, age progression, age estimation, adult faces, face aging.
1 Introduction As an aspect for identification and verification, face recognition is less pervasive than other forms of biometrics. Subjects can be observed without their knowledge. With other biometrics such as iris, retinal, signature recognition, finger geometry, etc it is hard to observe without the subject’s knowledge. For this reason, facial recognition is useful for non-pervasive security. Unfortunately, face recognition algorithms are not robust to aging. If there is a known image of a subject that is five years older than the current image, the algorithms might fail to identify the subject. The algorithmic problems and solutions for facial aging are outside the scope of this paper, but can be found in [1, 2, 3, 4, 5]. In order to effectively address this problem to facial recognition, development of a longitudinal database will assist researchers in testing and developing solutions that are impervious to aging.
2 Face Databases Face-based biometrics research has yielded a multitude of data corpuses for research, which include face and gesture recognition. Each database is tuned to the individual research niche, such as gender, ethnicity, pose, and lighting or developed to focus on face-based biometrics, face recognition, face modeling, and photo-realistic animation. For research primarily in facial aging, the data corpus must contain multiple images J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 17–24, 2009. © Springer-Verlag Berlin Heidelberg 2009
18
A.W. Rawls and K. Ricanek
of the same subject over a period of time. There are three known publicly available databases that contain longitudinal multiple images of individuals at different ages, FERET [6], FG-NET [7], and MORPH, which is discussed extensively in this paper. 2.1 FERET The Facial Recognition Technology (FERET) database consists of 14,126 images in 1,564 sets of 8-bit gray-scale portable gray map (PGM) images, containing several frontal and left and right profiles. The images were collected from 15 sessions between August 1993 and July 1996. The FERET Database was used extensively in the Face Recognition Vendor Test (FRVT) 2000. 2.2 FG-NET Whereas the FERET Database contains numerous facial images, the images do not span a range of years for each individual. The Face and Gesture Recognition Research Network (FG-NET) (Lanitis 2002) database contains 1002 images spanning up to five years, but contains a small number of subjects (n=82), which are usersubmitted images from snapshots over the lifetime of the subject. The average number of images per subject is 12, with a lower bound of 6 images per subject. FG-NET provides no data elements on the key parameters that affect the appearance of the face during adult aging, such as race, gender, height, and weight.
3 MORPH Database The MORPH Craniofacial Morphological Face Database (MORPH) was developed by researchers at the University of North Carolina Wilmington as a collection of longitudinal face images for institutional research. It was primarily developed to evaluate how age progression affects the efficacy of computer algorithms to correctly match faces of the same adult individual at different ages. One of the attractions to the MORPH database is that it combines several features that are distinct in the previously mentioned databases. MORPH contains detailed metadata, or key parameters about all images contained in the corpus for assistance in creating face-aging hypothesis. MORPH also presents a large amount of data, both on the number of images per subject and the number of subjects. MORPH images and the associated metadata are obtained from public record sources. This data is organized in two “albums,” Album1 and Album2. 3.1 MORPH Album1 Album1 contains scanned photographs of individuals taken between October 26, 1962 and April 7, 1998. The images are 8-bit grayscale portable gray map (PGM) formatted, scanned with a consumer-grade scanner. Beyond the scan, minimal enhancement was employed for artifact minimization with median filtering and contrast stretching via histogram equalization. Figures 1 and 2 show three sample sets of these images, where aging is most evident. The image number is listed above the image and the acquisition date and subject’s age is listed below the image. The age of images from the first photograph in Album1 ranges from 46 days to 29 years.
MORPH: Development and Optimization of a Longitudinal Age Progression Database
19
Fig. 1. Album1: Sample Image Progression for African-American Male
Fig. 2. Album1: Sample Image Progression for African-American Female
Album1 contains 1,690 images from 515 individuals, containing both males and females from several different ancestry groups as shown in Table 1. Table 2 shows the number of duplicate images per subject, ranging from one to four plus. The individuals range in age from 15 to 68 years and are further defined by “decade of life,” where the most notable craniofacial age-related changes appear (Table 3). Table 4 outlines the metadata included with Album1. Currently, future additions to Album1 have been discontinued. Table 1. Album1: Number of Facial Images by Gender and Ancestry (n=1,690)
Male Female Total
African 1,037 216 1,253
European 365 69 434
“Other” 3 0 3
Total 1,405 285 1,690
Table 2. Album1: Number of Additional Images per Subject
Male Female Total
1 526 108 634
2 297 62 359
3 71 14 85
4+ 10 2 12
Total 904 186 1,090
20
A.W. Rawls and K. Ricanek Table 3. Album1: Number of Facial Images by Gender and Decade-of-Life (n=1690)
Male Female Total
< 18 142 15 157
18 - 29 803 182 985
30-39 345 70 415
40-49 93 18 111
50+ 22 0 22
Total 1,405 285 1,690
Table 4. Album1: Metadata Details
Subject Identifier Picture Identifier ( beginning at 0 for each subject ) Date of Birth (mm/dd/yyyy) Image Date (mm/dd/yyyy) Race ( African-American.Black, White, Other ) Gender ( Male / Female ) Facial Hair Flag Glasses Flag Age Difference ( years & months since last image ) Image Filename ( ID_[picture_id][M/F][Age].JPG ) 3.2 MORPH Album2 Following MORPH Album1, development began on a second album, Album2, containing images from a new public source. The images in Album2 are distributed in either 8-bit color 200x240 JPEG or 8-bit color 400x480 JPEG images, depending on the image acquisition date. The original release of Album2 contained more than 14,000 images from the mid-1990s to the release date in 2005, which included more than 4,000 individuals with associated metadata for each image. For this original dataset, images and their respective metadata was gathered over several weeks. The dataset was recently updated by adding new images of existing individuals in Album2 as well as acquire new individuals with at least three images who were not in the initial Album2. All new images obtained are stored as 8-bit color 400x480 JPEG images. Figures 3 and 4 show a sample set of images in Album2 that accurately portray the data, in the same format as the images for Album1. Album2 contains over 94,000 images of over 24,000 individuals. Ages range from 16 to 77 with a median age of 31. The average number of images per individual is 4 and the average time between photos is 195 days, with the minimum being 1 day and the maximum being 1980 days. The standard deviation of days between images is 211. The maximum age span between first and last image for a single individual is 5.7 years and the average pixel distance between eyes is 96. Table 5 shows the distribution of images by gender and ancestry; Table 6 shows the number of additional images that exist from the initial facial image; and Table 7 shows the number of facial images in this initial release by decade-of-life.
MORPH: Development and Optimization of a Longitudinal Age Progression Database
Fig. 3. Album 2: Sample Image Progression for White Male
Fig. 4. Album 2: Sample Image Progression for African-American Female Table 5. Album2: Number of Facial Images by Gender and Ancestry (n=94,155)
Male Female Total
African 59,555 10,375 69,930
European 14,809 4,601 19,410
Asian 4,047 37 4,264
Hispanic 372 217 409
“Other” 109 33 142
Total 78,892 15,263 94,155
Table 6. Album2: Number of Additional Images per Subject
Male Female Total
1 7,899 1,992 9,891
2 4,320 914 5,234
3 2,543 454 2,997
4 1,542 269 1,811
5+ 3,599 608 4,207
Total 19,903 4,237 24,140
Table 7. Album2: Number of Facial Images by Gender and Decade-of-Life (n=94,155)
Male Female Total
< 18 4,962 792 5,754
18 – 29 31,605 5,604 37,209
30-39 19,368 4,708 24,076
40-49 16,645 3,395 20,040
50+ 6,312 764 7,076
Total 78,892 12,263 94,155
21
22
A.W. Rawls and K. Ricanek
Unlike Album1, the metadata fields in Album2 are abbreviated to reduce the size of the database and aid in reducing errors during computation. Male and female gender identifiers have been reduced to M and F respectively. Similarly, race is represented by a single character, B: African-America/Black, W: European/White, A: Asian, H: Hispanic, and O: Other. In Album1, the age difference field contained textual delimiters, ex. 2y3m representing an age of 2 years and 3 months. For Album2, this has been converted to an integer unit of days. Table 8 lists all metadata items associated with the image files. Table 8. Album2: Metadata Details
Subject Identifier (six digit ID with leading zeros) Picture Identifier ( beginning at 0 for each subject ) Date of Birth (mm/dd/yyyy) Image Date (mm/dd/yyyy) Age Difference ( number of days between records ) Eye Coordinates Image Filename ( ID_[picture_id][M/F][Age].JPG ) Image File MD5 Checksum
Race ( B, W, A, H, O ) Gender ( M / F ) Height ( in inches ) Weight ( in US pounds ) Facial Hair Flag Glasses Flag Occlusions Flag
3.3 MORPH Public Release A subset of Album2 in addition to the full Album1, is available for download by application for use in non-commercial research applications, see [8]. The subset of Album2 contains 55,000 unique images of more than 13,000 individuals, spanning from the mid-1990s to late 2007. Ages range from 16 to 77 with a median age of 33. The average number of images per individual is 4 and the average time between photos is 164 days, with the minimum being 1 day and the maximum being 1681 days. The standard deviation of days between images is 180. Table 9 shows the distribution of images by gender and ancestry; Table 10 shows the number of additional images that exist from the initial facial image; and Table 11 shows the number of facial images in this initial release by decade-of-life. Table 12 lists the metadata provided with this public release. Table 9. Album 2 – Public Release: Number of Facial Images by Gender and Ancestry (n=55,608)
Male Female Total
African 37,093 5,803 42,896
European 8,119 2,617 10,736
Asian 147 14 161
Hispanic 1,652 101 1,753
“Other” 46 16 62
Total 47,057 8,551 55,608
Table 10. Album 2 – Public Release: Number of Additional Images per Subject
Male Female Total
1 11,157 2,075 13,232
2 8,797 1,608 10,405
3 5,187 894 6,081
4 3,196 538 3,732
5+ 7,208 1,277 8,485
Total 34,545 6,390 41,935
MORPH: Development and Optimization of a Longitudinal Age Progression Database
23
Table 11. Album 2 – Public Release: Number of Facial Images by Gender and Decade-of-Life (n=55,608)
Male Female Total
< 18 2,964 373 3,337
18 – 29 17,728 2,783 20,511
30-39 12,587 2,924 15,511
40-49 10,248 2,017 12,265
50+ 3,530 454 3,984
Total 47,057 8,551 55,608
Table 12. Album 2 – Public Release: Metadata Details
Subject Identifier (six digit ID with leading zeros) Picture Identifier ( beginning at 0 for each subject ) Date of Birth (mm/dd/yyyy) Image Date (mm/dd/yyyy) Age Difference ( number of days since last image ) Image Filename ( ID_[picture_id][M/F][Age].JPG )
Race ( B, W, A, H, O ) Gender ( M / F ) Facial Hair Flag Glasses Flag
3.4 Challenges While data from public information sources has been abundant, there is a considerable amount of data cleaning that must be performed with each new dataset update to insure consistency for each subject. Often the metadata accompanying the images contains errors due to inconsistencies with self-reporting information. The most common error is race identification that is not consistent with existing records; however, occasionally subjects will be assigned the wrong gender, date of birth or erroneous height and weight, which can be attributed to errors upon data entry. These items need to be checked and corrected where applicable. Another important step is to optimize the metadata for use in facial recognition applications, such as identification of facial hair, glasses, and other occlusions that inhibit recognition algorithms. Additionally, eye coordinates are obtained and verified for each image, duplicate images are removed, and subjects misidentified are either reclassified or removed.
4 Conclusion The MORPH Database outlined in this paper is an ongoing data collection effort. Work is being done to constantly supplement the metadata with important components useful in facial recognition applications. To date, it is by far the largest longitudinal face database publicly available and is used in the development a variety of age progression and age estimation algorithms.
References 1. Patterson, E., Ricanek, K., Albert, E., Boone, E.: Automatic Representation of Adult Aging in Facial Images. In: Proc. 6th IASTED International Conference on Visualization, Imaging, and Image Processing, Palma de Mallorca, Spain, pp. 171–176 (2006)
24
A.W. Rawls and K. Ricanek
2. Ricanek, K., Boone, E.: The Effect of Normal Adult Aging on Standard PCA Face Recognition Accuracy Rates. In: International Joint Conference on Neural Networks, Montreal, Canada, pp. 2018–2023 (2005) 3. Ricanek Jr., K., Patterson, E.K., Albert, A.M.: Age-related morphological changes: Effects on facial recognition technologies. UNCW Technical Report (2004) 4. Ricanek Jr., K., Tesafaye, T.: MORPH: A longitudinal image database of normal adult ageprogression. In: Proceedings of the IEEE 7th International Conference on Automatic Face and Gesture Recognition, Southampton, England (2006) 5. Sethuram, A., Patterson, E., Ricanek, K., Rawls, A.: Improvements and Performance Evaluation Concerning Synthetic Age Progression and Face Recognition Affected by Adult Aging. In: International Conference on Biometrics, Sassari, Italy (2009) 6. FG-NET Aging Database, http://www-prima.inrialpes.fr/FGnet 7. Phillips, P.J., Rauss, P.J., Der, S.Z.: FERET (Face Recognition Technology) Recognition Algorithm Development and Test Results. Army Research Lab Technical Report (2006) 8. MORPH Application, Face Aging Research Group, http://www.faceaginggroup.com
Verification of Aging Faces Using Local Ternary Patterns and Q-Stack Classifier Andrzej Drygajlo, Weifeng Li, and Kewei Zhu Speech Processing and Biometrics Group Swiss Federal Institute of Technology Lausanne (EPFL) CH-1015 Lausanne, Switzerland
[email protected] http://scgwww.epfl.ch/
Abstract. This paper deals with the influence of age progression on the performance of face verification systems. This is a challenging and largely open research problem that deserves more and more attention. Aging affects both the shape of the face and its texture, leading to a failure in the face verification task. In this paper, the aging influence on the face verification system using local ternary patterns is managed by a Q-stack aging model, which uses the age as a class-independent metadata quality measure together with baseline classifier scores in order to obtain better recognition rates. This allows for increased long-term class separation by a decision boundary in the score-quality measure space using a short-term enrolment model. This new method, based on the concept of classifier stacking, compares favorably with the conventional face verification approach which uses decision boundary calculated only in the score space at the time of enrolment.
1
Introduction
It is a well-known fact that the individual physical characteristic features change with time. In particular, aging changes person’s face at a slow rate, albeit irreversibly. Actual and up-to-date at the time of their creation, extracted features and models relevant to a person’s face may eventually become outdated, leading to a failure in the face verification task. This expectation is confirmed by recent research [1]. For this reason, it is of prime importance to understand and quantify the temporal reliability of face biometric features and their classification models. The problem of time validity of biometric templates received only a marginal attention from researchers. The variation caused by face aging is often neglected compared with pose, lighting, and expression variations. Nowadays, digital face images are becoming prevalent in government issued travel and identity documents (e.g., biometric e-passports and national identity cards). Developing face verification systems that are robust to age progression would enable the successful deployment of face verification systems in those large-scale applications. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 25–32, 2009. c Springer-Verlag Berlin Heidelberg 2009
26
A. Drygajlo, W. Li, and K. Zhu
Since faces undergo gradual variations due to aging, periodically updating (e.g. every six months) large-scale-application face databases with more recent images of subjects might be necessary for the success of face verification systems. Since periodical updating such large databases would be a tedious and very costly task, a better alternative would be to develop aging-aware face verification methods. Only such methods will have the best prospects of success in longer stretches of time [2]. Aging is a complex process that affects both the shape of the face and its texture. Most of the reported efforts have been focused on the problem of visualizing the changes of appearance of face images as the time progresses [3]. Very limited evidence is available as to the impact of the changes of appearance on actual recognition performance. The biometrics research community realized the importance of temporal changes in individual features and databases are created with predefined intervals between sessions of data collection. However, in commonly used benchmarking databases this period is within the range of weeks or months [4]. Such short intervals are unlikely to give a good understanding of the temporal dynamics of biometric traits, and the observed short-term face image and feature variability is more likely to be due to the environment factors rather than time flow-related changes. In order to examine the long-term reliability of biometric features the collection sessions must cover periods measured in years. Unfortunately, such databases for research purposes are almost inexistent. In order to address this problem in this paper we use daily photo recordings over years, publicly available on the YouTube and the MORPH Database [5] collected for investigation of face age progression. An inherent limitation to the use of the YouTube recordings is the amount of subjects that appear on the photos over a long enough time stretch (e.g. three or more years). The advantage of such recordings is that they provide a large amount of face images of an individual, sampled at daily time intervals over a long period. In the experiments reported here, we have used recordings which covered 1200 days. Such recordings allow us to model age progression in human faces and to build face verification systems robust to age progression. Certain amount of early recordings is used to extract features (Local Ternary Patterns (LTPs)) and build models, whose temporal score dynamics is further analyzed based on recordings with later time stamps. The paper specifically identifies a possible way of using the age information as a class-independent quality measure. Age is a factor that directly impacts the comparative quality of face images recorded at different times. If the face samples being compared differ substantially in age, recognition accuracy is affected. More substantial physical degradation may become an issue as the difference in age increases. Aging can be considered a metadata [8] because the quality of the samples themselves is not the issue: the age difference is the cause of the degradation of accuracy. Using such interpretation, the recently developed framework of classification with quality measures, Q-stack [9], is adopted to create a new face verification
Verification of Aging Faces
27
Fig. 1. Sample face images of individuals with the age progression (about three years)
system robust to aging of biometric templates. Q-stack is a general framework of classification with quality measures that is applicable to uni-, multi-classifier and multimodal biometric verification with one or more quality measures. This paper is focused on the Q-stack solution that allows for improved class separation using age as a metadata quality measure. The novelty of this approach is that it opens a new way for the combination of age information with different quality measures of face image and multiple baseline classifiers to further improve the verification performance. The paper is structured as follows. Section 2 provides correlation analysis between age metadata quality measure of example adult human faces and the distance scoring of local ternary pattern (LTP) baseline classifiers. In Section 3 we propose a general framework - Q-stack aging model - a stacking classifier for the task of biometric identity verification using face images and associated metadata quality measure - age. Section 4 presents experimental results with their discussion and Section 5 concludes the paper.
2
Aging Influence on the Face Classifiers
First, we analyze the influence of age progression on the baseline classifier scores. Figure 1 shows the daily face image samples for the four people during 1200 days, which are pre-processed by using OpenCV face detector. For each person, an average face is obtained from the first 100 images, and then an average face template is created by performing the Local Ternary Pattern (LTP) operators on the average face. The LTP operator [6] is an extension of the local binary pattern (LBP) operator from binary (0 and 1) codes to 3-value codes (-1, 0, 1). The most important properties of the LTP operator in real-world applications are its robust invariances and computational simplicity. The measure of the influence of age progression on the classifier scores is based on the 2D Euclidean distance between the template image and test image, as used in [6]. Figure 2 shows the effects of age progression on the classifier scores for the four persons from Fig. 1. The plots in Fig. 2 show that there is an evident
28
A. Drygajlo, W. Li, and K. Zhu
4
1.4
4
First Person
x 10
1.4
1.2
Distance
Distance
1.2
1
0.8
200
4
400 600 800 Time (in days)
1000
0.6
1200
1.4
400 600 800 Time (in days)
1000
1200
1000
1200
Fourth Person
x 10
1.2
Distance
Distance
200
4
Third Person
x 10
1.2
1
0.8
0.6
1
0.8
0.6
1.4
Second Person
x 10
1
0.8
200
400 600 800 Time (in days)
1000
1200
0.6
200
400 600 800 Time (in days)
Fig. 2. The influence of age progression on the classifier scores (distances). The red lines in each sub-figure are linear fittings of the variations of log likelihood with the age progression. The four individuals correspond to the ones in Fig. 1.
conditional dependency between age progression and distance measure. As the age increases, the distance values generally increase1 . This motivates us to use age as class-independent metadata quality measure in the Q-stack classifier in order to improve long-term classification performance of face verification systems.
3
Q-Stack Aging Model
Figure 3 shows a diagram of the Q − stack classifier [9]. Identity-related information is composed of a biometric signal S, classified by a baseline classifier, resulting in a score x. At the same time, the analysed signal undergoes quality measurements, resulting in m quality signals qm = [qm1 , qm2 , ...qmm ]. Multiple quality measures can be used to characterize one signal. The score x is concatenated with the quality measure vector to form an evidence vector e = [x, qm]. The evidence vector e becomes a feature vector for the stacked classifier. The proposed method of Q-stack is a generalized framework which encompasses previously reported methods of using quality measures in biometric verification [7]. Age information can be used as one of the quality features in the 1
Since the images are not taken under controlled conditions, there are other factors such as illumination and hair-style which influence the classifier scores. Therefore, the variations of the distance values do not always increase as the age increases.
Verification of Aging Faces
29
Q-stack Quality Measurement
Signal
Feature Extraction
Quality Measures qm
Class Features
Baseline Classifier
Scores x
Evidence
e=[x,qm]
Stacked Classifier
DECISION
Fig. 3. Q-stack architecture, in which baseline classifier scores and quality measures jointly serve as features to a second-level, stacked classifier
framework of Q-stack. As class-independent feature, age information does not provide information about the identity of the individual whose face appears in the image. However, if one face model (template) is continuously used, time difference (age difference) has an obvious influence on the face matching scores as shown in Section 2. This influence translates into a statistical dependence between the scores and age information. This dependence is consequently modeled and exploited for greater classification accuracy in the Q-stack scheme.
4
Experiments and Results in Aging Face Verification
4.1
Experiments on YouTube Data
In order to verify the claim that an inclusion of age progression in the evidence vector allows for more accurate classification in the Q-stack framework than using baseline classifiers, we performed the face verification preliminary experiments on YouTube daily recorded data. Our face verification system (for the first person) is defined as follows: Class A - genuine identity claim from the first person in Fig. 1; Class B - imposter attempt from the other three persons in Fig. 1. Figure 3 shows a diagram of the Q-stack framework for our face verification experiments, in which the distance scores and age progression information jointly serve as features to a second level stacked classifier. At this level a Support Vector Machine (SVM) classifier with linear kernel is employed. Figure 4 shows the decision threshold of the baseline classifier2 (horizontal dashed lines) and the decision boundary of the application of Q-stack model (with a second-stage SVM linear kernel classifier) for the training data set (first 100-day images) and the evaluation data set (the images from 101th day to 1200th day). In the case of short-term 100-day training data set the Q − stack decision boundary does not deviate much from the baseline classification boundary (or threshold), and both the baseline classifier and the Q-stack classifier can almost perfectly separate the two classes. However, as the age progresses in the 2
The threshold of the baseline classifier is determined by minimizing the HTER and maxmizing the margin between two classes.
A. Drygajlo, W. Li, and K. Zhu 3
3
2
2
1
1
Normalized Distance
Normalized Distance
30
0
0
−1
−1
−2
−2
−3
1
25
50 Time (in days)
75
100
−3 101
375
650 Time (in days)
925
1200
Fig. 4. Class separation in the age-score space for the first 100 days. The ‘◦’ and ‘×’ marks represent the scores of the genuine user (first person in Fig. 1) and the impostors (the other three persons in Fig. 1), respectively. The dashed lines show the threshold of the baseline classifier. The bold lines denote the Q-stack decision boundary by using SVM-lin stacked classifiers.
evaluation data set, the baseline classifier can not perform so well any more, resulting in considerable false rejections3 . On the other hand, the Q-stack classifier can track the tendency of the distance scores as age increases. This gives much less classification errors. The gains of the Q-stack method due to the classification in the space of evidence e (score and age))rather than in the score space x are summarized in Table 1 for successive periods of 180 days (about 0.5 year) in terms of false acceptance rate (FAR), false rejection rate (FRR) and half total error rate (HTER). This shows an improvement over the baseline system with the decision boundary fixed only on score distributions during enrolment. In summary of the pilot experiments, we can conclude that the use of age progression information as a metadata quality measure in the Q-stack classification scenario allows for improved classification in respect to the baseline classification results. 4.2
Experiments on MORPH Data
An inherent limitation of YouTube data is the amount of subjects. The MORPH Database [5] is a publicly available database developed for investigating age progression. The images represent a diverse population with respect to age, gender, 3
By changing the decision threshold of baseline classifier, one can expect that the classification accuracy may be improved and even a Detection Error Trade-off (DET) curve can be obtained. However with only a short-term enrolment data at hand we have no information how to change the decision threshold in long-term. Consequently, the DET curve is not a realistic evaluation of the performance of a verification system without knowing how operating point is changing in time given the age information.
Verification of Aging Faces
31
Table 1. Verification performance of baseline classifier and the Q-stack method (with SVM linear kernel classifier) over the last 1,100 days (about three years).
FAR [%] FRR [%] HTER [%] FAR [%] FRR [%] HTER [%]
0-0.5 year 0.5-1.0 year 1.0-1.5 years 1.5-2.0 years 2.0-2.5 years 2.5-3.0 years Baseline 0 0 0 0 0 0 6.67 5.56 11.67 24.44 23.33 23.00 3.33 2.78 5.83 12.22 11.67 11.50 Q-stack method 0 1.11 2.22 3.89 1.11 15.00 2.78 2.22 0 3.89 0 0 1.39 1.67 1.11 3.89 0.56 7.50
4
1.5
x 10
1.4
1.3
Distance
1.2
1.1
1
0.9
0.8
0.7
0
200
400
600
800 1000 Time (in days)
1200
1400
1600
1800
Fig. 5. The influence of age progression on the scores in MORPH database Table 2. Verification performance on MORPH database (averaged over all the 14 individuals)
Balineline SVM-lin SVM-rbf
FAR [%] FRR [%] HTER [%] 1.53 17.18 9.35 0.48 10.16 5.32 0.29 10.72 5.51
ethnicity, etc. This database contains 1,724 face images of 515 individuals. These photos were taken between 1962 and 1998 but not taken daily. Therefore, there are different number of photos for each individual. The face images are not taken under controlled conditions. There are a lot of face images which are not frontal and in which the variations in head pose and facial expressions exist. For our studies on the age progression, we selected 14 individuals with more than 20 images for each individual and without significant changes in head pose and facial expression. The average face templates are created for each of the 14 individuals by using the first 10 images and LTP operators. Beside the Support Vector Machines (SVM) Qstacked classifier with linear kernel (SVM-lin), a SVM classifier with radial basis functions kernel (SVM-rbf) is also employed as Q-stacked classifier.
32
A. Drygajlo, W. Li, and K. Zhu
Figure 5 shows the influence of age progression on the classifier scores for all the individuals in MORPH database. It is shown that there exist general tendency of the classifier scores on the age progression. Table 2 summarizes the verification performance with MORPH database in terms of FAR, FRR, and HTER, which are averaged over all the 14 individuals. It is found that all the three errors (FAR, FRR, and HTER) are reduced significantly using Q-stack method.
5
Conclusions
In this paper, we presented a novel theoretical approach to incorporating age, based on the concept of metadata quality measure, into the face verification process, based on the concept of classifier stacking (Q-stack). Our experiments with baseline LTP classifier using YouTube and MORPH databases demonstrated that the age metadata quality measure is causally linked to the classifier scores, which allows for increased long-term class-separation in the score-quality measure space using a short-term enrolment model. We plan to perform exhaustive experiments with the whole MORPH database of 515 individuals, and a new database under development in our laboratory, using combination of age information with different quality measures of face image and multiple baseline classifiers to further improve the verification performance.
References 1. Ramanathan, N., Chellappa, R.: Face Verification Across Age Progression. IEEE Trans. Image Processing. 15, 3349–3361 (2006) 2. Patterson, E., Sethuram, A., Albert, M., Ricanek, K., King, M.: Aspects of Age Variation in Facial Morphology Affecting Biometrics. In: IEEE Conference on Biometrics: Theory, Applications, and Systems (BTAS 2007), Washington, D.C., USA, September 27-29 (2007) 3. Scandrett, C., Solomon, J., Gibson, S.J.: A person-specific, rigorous aging model of the human face. Pattern Recognition Letters 27, 1776–1787 (2006) 4. Flynn, P.: Biometrics databases. In: Jain, A., et al. (eds.) Handbook of Biometrics, ch. 25, pp. 529–548. Springer, New York (2008) 5. Ricanek, K., Tesafaye, T.: MORPH: A longitudinal image database of normal adult age-progression. In: 7th International Conference on Automatic Face and Gesture Recognition (FGR 2006), April 2006, pp. 341–345 (2006) 6. Tan, X., Triggs, B.: Enhanced local texture feature sets for face recognition under difficult lighting conditions. In: Zhou, S.K., Zhao, W., Tang, X., Gong, S. (eds.) AMFG 2007. LNCS, vol. 4778, pp. 168–182. Springer, Heidelberg (2007) 7. Kryszczuk, K.M., Drygajlo, A.: Improving classification with class-independent quality measures: Q-stack in face verification. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 1124–1133. Springer, Heidelberg (2007) 8. Hicklin, A., Khanna, R.: The Role of Data Quality in Biometric Systems. White Paper. Mitretek Systems (February 2006) 9. Kryszczuk, K., Drygajlo, A.: Q-stack: uni- and multimodal classifier stacking with quality measures. In: International Workshop on Multiple Classifier Systems, Prague, Czech Republic (May 2007)
Recognition of Emotional State in Polish Speech Comparison between Human and Automatic Efficiency Piotr Staroniewicz Wroclaw University of Technology Institute of Telecommunications, Teleinformatics and Acoustics Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland
[email protected]
Abstract. The paper presents the comparison of human (listeners test) and automatic (SVM classifier) speech emotion recognition. The database of Polish emotional speech used during tests includes recordings of six acted emotional states (anger, sadness, happiness, fear, disgust, surprise) and the neutral state of 13 amateur speakers (2118 utterances). The automatic classifier used the set of 31 attribute evaluated features, C-SVC algorithm with the Gaussian Radial Basis Function. The mean overall score for human recognition (57.25%) turned out to be lower than for automatic recognition (64.77%). Keywords: emotional speech, emotional state recognition.
1 Introduction Emotions are important factors in speech-computer communication (i.e. speech and speaker recognition, speech synthesis), which is a reason for trials to develop efficient algorithms for emotion speech synthesis and emotion speech recognition. The main source of complication during that task is no strict definition of emotion and its classification rules. The literature describes them as emotion-dimensions (i.e. pleasure, activation, etc.) or discrete concepts (i.e. anger, fear etc.) [1,2,3]. Distinct terms which are easily understood by speakers are usually chosen in acted emotions. In order to be able to compare the results with older studies and because they are generally considered as the most common ones, it was decided to use six basic emotional states plus the neutral state. Despite the fact that there is no definitive list of basic emotions, there exists a general agreement on so-called “the big six” [1,2]: anger, sadness, happiness, fear, disgust, surprise and neutral state. Another problem can be the variability of emotional categories among languages or cultures. Relatively barely investigated aspect is the issue of the comparison of human and machine (automatic) emotion recognition abilities. The paper presents the results of experiments carried out on the Polish speech emotional database (presented earlier in [4,5]). The human (subjective) and automatic (objective) emotion recognition tests were carried out on the same speech material to allow the comparison of their effectiveness. J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 33–40, 2009. © Springer-Verlag Berlin Heidelberg 2009
34
P. Staroniewicz
2 Database Despite all disadvantages of acted emotions in comparison to natural and elicited ones (which means recordings of spontaneous speech), only recording simulated (or seminatural) emotions can guarantee the control of recordings which fulfils [6,7]: -
reasonable number of subjects to act all emotions to enable the generalization over a target group, all subjects uttering the same verbal content to allow the comparison across emotions and speakers, high quality recordings to enable later proper speech features extraction, unambiguous emotional states (only one emotion per utterance).
During emotional speech recordings it was planned to rely on speakers’ ability of selfinduction by remembering a situation when a certain emotion was felt (known as the Stanislavski method [3]). Since skilled actors can simulate emotions in a way that could be confused with truly natural behaviour, they are very often used in emotional speech databases recordings [6]. On the other hand, sometimes actors could express emotions in quite an exaggerated way [3]. The amateur speakers who took part in the recordings were sex balanced. All the subjects were recorded in separate sessions to prevent their influencing each other’s speaking styles. The speakers were asked to use their own every day way of expressing emotional states, not from stage acting. The decision of selecting simulated emotional states enabled a free choice of utterances to be recorded. The most important condition is that all selected texts should be interpretable according to emotions and not containing an emotional bias. Two kinds of material could be used: nonsense text material or everyday life sentences. Despite the fact that nonsense material is guaranteed to be emotionally neutral, it would be difficult to produce natural emotional speech spontaneously, which can lead to overacting. The usage of everyday life speech has some important advantages: -
it is the natural form of speech under emotional arousal, lectors can immediately speak it from memory, no need for memorising and reading it, which could lead to a lecturing style.
The ten everyday life phonetically balanced sentences in Polish and their English translation are listed in Table 1. Table 1. Ten everyday life sentences in Polish and their English translation No 1 2 3 4 5 6 7 8 9 10
Sentence (in Polish) Jutro pojdziemy do kina. Musimy sie spotkac. Najlepsze miejsca sa już zajete. Powinnas zadzwonic wieczorem. To na pewno sie uda. Ona koniecznie chce wygrac. Nie pij tyle kawy. Zasun za soba krzeslo. Dlaczego on nie wrocil. Niech się pan zastanowi.
Sentence (English translation) Tomorrow we’ll go to the cinema. We have to meet. The best seats are already taken. You should call in the evening. It must work out. She simply must win. Don’t drink so much coffee. Put the chair back. Why hasn’t he come back. Think about it.
Recognition of Emotional State in Polish Speech
35
The database recording were carried out in our recording studio with T-Bone SCT 700 microphone, Yamaha 03D digital mixer, where the analogue/digital conversion was done (44.1kHz, 16bit, mono). The data in S/PDIF standard was then transferred to the Soundmax Integrated Digital Audio PC sound card. The group of speakers consisted of 13 subjects, 6 women and 7 men each recorded 10 sentences in 7 emotional states in several repetitions. Altogether 2351 utterances were recorded, 1168 with female and 1183 with male voices. An average duration of a single utterance was around 1 second. After a preliminary validation, some doubtful emotional states and recordings with poor acoustical quality were rejected. The Final number of 2118 utterances was then divided into training and testing sets for a later automatic recognition of emotional states [4].
3 Recognition Procedure 3.1 Human Recognition During the subjective human recognition the listeners were presented with the acoustic material in random order and listened to each sample, not being allowed to go back to compare them with earlier utterances. Each time the decision was made in which emotional state the speaker was. The automated tests were done on a personal computer. 202 listeners participated in the tests. 3.2 Automatic Recognition In automatic recognition following speech features were chosen: -
F0, intensity, first four formant frequencies (F1, F2, F3, F4), 12 LPC coefficients.
For the F0 the following parameters were determined: minimum, maximum, median, mean, range, standard deviation and the mean absolute slope which is the measure of the mean local variability of the F0. For the intensity the minimum, maximum, median, mean, range and standard deviation were calculated. Similarly for formant frequencies the minimum, maximum, median, mean, median of frequency range, range and the standard deviation were determined. Altogether 53 parameters were obtained. Each parameter was linearly scaled to the range |−1,+1| to avoid attributes in greater numeric ranges to dominate those in smaller numeric ranges and to prevent difficulties during the calculations in the SVM classifier [8]. The total number of parameters was reduced to 31 after attribute evaluation with the Chi-squared test [9]. During the classification the LibSVM [10] library was applied, which contains two classification algorithms (C-SVC, ν-SVC), regression and supports multi-class classification. The SVM is the classification method worked out by Vladimir Vapnik [11]. Support Vector Machines were successfully applied in speech emotions recognition and in many cases appears to be superior to other classifiers [12,13]. The best classification
36
P. Staroniewicz
algorithm and its kernel function was set experimentally. The C-SVC algorithm (penalty parameter C = 256) with the Gaussian Radial Basis Function (1) (kernel parameter γ = 0.6) was applied. K(x i,x j) = exp(− γ ||x i − x j|| ) . 2
(1)
4 Results and Discussion In Fig.1 it is presented the comparison of recognition scores for the subjective (human) and objective (automatic) methods for basic emotional states. The following figures depicture the recognition results for singular emotional states including the confusion information: Fig.2 “Happiness”, Fig.3 “Anger”, Fig.4 “Fear”, Fig.5 “Sadness”, Fig.6 “Surprise”, Fig.7 “Disgust” and Fig.8 the neutral state. 80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00% Happiness
Anger
Fear
Sadness
Surprise
Disgust
Neutral state
Fig. 1. Comparison between human and automatic recognition for all emotional states (whitehuman and grey-automatic) Happiness 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 2. Comparison between human and automatic recognition scores for “Happiness” (whitehuman and grey-automatic)
Recognition of Emotional State in Polish Speech
37
Anger 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 3. Comparison between human and automatic recognition scores for “Anger” (whitehuman and grey-automatic) Fear 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 4. Comparison between human and automatic recognition scores for “Fear” (white-human and grey-automatic) Sadness 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 5. Comparison between human and automatic recognition scores for “Sadness” (whitehuman and grey-automatic)
The results presented in Fig.1 revealed that the worst scores for the subjective method was obtained for “Disgust” (30.41%), “Fear” (40.52%) and “Sadness” (44.70%). It could be caused by the fact that the speakers found it extremely difficult to express “Disgust” naturally and of course sometimes this emotional state can be
38
P. Staroniewicz Surprise 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 6. Comparison between human and automatic recognition scores for “Surprise” (whitehuman and grey-automatic) Disgust 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 7. Comparison between human and automatic recognition scores for “Disgust” (whitehuman and grey-automatic) Neutral state 8 0 ,0 0% 7 0 ,0 0% 6 0 ,0 0% 5 0 ,0 0% 4 0 ,0 0% 3 0 ,0 0% 2 0 ,0 0% 1 0 ,0 0% 0 ,0 0% Hap p in ess
An g er
Fear
Sad n ess
Su rp rise
Disg u st
Neu tral state
Fig. 8. Comparison between human and automatic recognition scores for “Neutral state” (white-human and grey-automatic)
better expressed by face than by voice. “Fear” and “Sadness” also caused troubles in a proper identification. In this case also the expression of “Fear” can be very difficult for speakers.The listeners were confusing “Sadness” with the neutral state (Fig.5).Results for the remaining emotional states are significantly better: “Happiness” (68.18%), “Anger” (71.07%) and “Surprise” (72.49%). The best results were obtained for the recognition of the neutral state (73.41%).
Recognition of Emotional State in Polish Speech
39
In the case of the objective method the lowest recognition scores were obtained for the recognition of “Fear” (51.18%) and “Disgust” (58.33%). In both cases the reason for that fact can be similar as for the subjective method. For all the remaining states the scores were higher: “Happiness” (61.90%), “Anger” (72.61%), “Sadness” (73.15%), “Surprise” (75.71%) and the neutral state (58.71%). The mean score in the objective emotion recognition was 64.77% and was higher than in the subjective one which was 57.25%, whereas listeners could better recognise “Happiness” and the neutral state. SVM classifier substantially better recognised “Sadness” (28.45% higher) and “Disgust” (27.92% higher). As presented in Fig.2-8 the classifier confused particular emotions in a similar way as the listeners, with the exception of “Disgust” (Fig.7), where listeners more often interpreted this emotion as the natural state.
5 Concluding Remarks The overall recognition scores for the objective method (with SVM classifier) are over 7% higher than the results obtained for the listeners perception tests. The difference is even higher for specific emotions (sadness and disgust). It is worth mentioning that in both recognition procedures very similar confusions were obtained during the emotion classification. In both cases the recognition scores could be possibly higher if using professional actors recordings. Acknowledgments. This work was partially supported by COST Action 2102 “Crossmodal Analysis of Verbal and Non-verbal Communication” [14] and by the grant from the Polish Minister of Science and Higher Education (decision nr 115/NCOST/2008/0).
References 1. Cowie, R.: Describing the Emotional States Expressed in Speech. In: Proc. of ISCA, Belfast, pp. 11–18 (2000) 2. Scherer, K.R.: Vocal communications of emotion: A review of research paradigms. Speech Communication 40, 227–256 (2003) 3. Burkhard, F., Paeschkhe, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A Database of German Emotional Speech. In: Proc. of Interspeech 2005, Lissabon, Portugal (2005) 4. Staroniewicz, P., Majewski, W.: Polish Emotional Speech Database – Recording and Preliminary Validation. In: Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. Springer, Heidelberg (accepted, 2009) 5. Staroniewicz, P.: Polish emotional speech database–design. In: Proc. of 55th Open Seminar on Acoustics, Wroclaw, Poland, pp. 373–378 (2008) 6. Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech: Towards a new generation of databases. Speech Communication 40, 33–60 (2003) 7. Ververdis, D., Kotropoulos, C.: A State of the Art on Emotional Speech Databases. In: Proc. of 1st Richmedia Conf., Laussane, Switzerland, October 2003, pp. 109–119 (2003)
40
P. Staroniewicz
8. Hsu Ch.W., Chang Ch.-Ch., Lin Ch.-J.: A Practical Guide to Support Vector Classification. Department of Computer Science, National Taiwan University (2008), http://www.csie.ntu.edu.tw/~cjlin (last updated: May 21, 2008) 9. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kauffmann, San Francisco (2005) 10. Chang, Ch.-Ch., Lin, Ch.-J.: LIBSVM: a Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 11. Vapnik, V.N.: Statistical Learning Theory. Wiley, Chichester (1998) 12. Kwon, O., Chan, K., Hao, J., Lee, T.: Emotion Recognition by Speech Signals. In: Eurospeech, Geneva, Switzerland, September 1-3 (2003) 13. Zhou, J., Wang, G., Yang, Y., Chen, P.: Speech emotion recognition based on rough set and SVM. In: Cognitive Informatics, ICCI 2006. 5th IEEE International Conference, Beijing, July 17-19, vol. 1, pp. 53–61 (2006) 14. COST Action 2102, Cross-Modal Analysis of Verbal and Non-verbal Communication. Memorandum of Understanding, Brussels, July 11 (2006)
Harmonic Model for Female Voice Emotional Synthesis Anna Přibilová1 and Jiří Přibil2,3 1
Department of Radio Electronics, Slovak University of Technology Ikovičova 3, SK-812 19 Bratislava, Slovakia
[email protected] 2 Institute of Photonics and Electronics, Academy of Sciences of the Czech Republic Chaberská 57, CZ-182 51 Prague 8, Czech Republic 3 Institute of Measurement Science, Slovak Academy of Sciences Dúbravská cesta 9, SK-841 04 Bratislava, Slovakia
[email protected]
Abstract. Spectral and prosodic modifications for emotional speech synthesis using harmonic modelling are described. Autoregressive parameterization of inverse Fourier transformed log spectral envelope is used. Spectral flatness determines the voicing transition frequency dividing spectrum of synthesized speech into minimum phases and random phases of the harmonic model. Female emotional voice conversion is evaluated by a listening test. Keywords: emotional speech, spectral envelope, harmonic speech model, emotional voice conversion.
1 Introduction Expression of emotional states in human voice has been moved to the centre of attention of researchers involved in speech processing [1-8]. Our contribution to this area consists of female emotional voice conversion using harmonic sine-wave model of speech signal, i.e. a sinusoidal model with harmonically related sine waves [9, 10]. Although this model had originally been used in speech coding, its modifications were also successfully applied in speech synthesis [11-13]. For modelling of voiced fricatives and other speech sounds with mixed excitation the sine-wave phases are made random above the voicing transition frequency, which is determined by the voicing probability that is a measure of how well the harmonic set of sine waves fits the measured set of sine waves minimizing the mean squared error [9, 10]. In [13] the notion of a maximum voiced frequency is used for the same variable, however, the upper band of the spectrum is modelled using an all-pole filter driven by a white Gaussian noise instead of a sum of sine waves with random phases. A rather elaborate technique for decomposition of voiced speech into periodic and aperiodic components is described in [14]. Sinusoidal model has also been used for emotional speech analysis [15, 16]. Our approach to harmonic speech modelling with autoregressive (AR) parameterization of spectral envelope is described in Section 2. Modification of AR parameters according to different emotions is described in Section 3. Prosody modification is dealt with in Section 4. Listening test results are summarized in Section 5. J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 41–48, 2009. © Springer-Verlag Berlin Heidelberg 2009
42
A. Přibilová and J. Přibil
2 Harmonic Speech Model with AR parameterization Speech signal synthesized by the harmonic speech model with AR parameterization (Fig. 1) is represented by a sum of sine waves with frequencies {fm} corresponding to pitch harmonics, amplitudes {Am} given by sampling of the spectral envelope at these frequencies, and phases {ϕm} being samples of Hilbert transform of the log spectral envelope corresponding to the minimum-phase model. For unvoiced speech and for voiced speech above the voicing transition frequency (fv-uv) the phases are randomized in the interval <-π, π>. Our approach to fv-uv determination uses spectral flatness SF [17] as a measure of degree of voicing. The spectral envelope is represented by AR parameters (gain G and LPC coefficients {an}). Summed sine waves in two consecutive pitch periods are weighted by an asymmetric window in such a way that the left part of the current asymmetric window has the same length as the right part of the previous window, and the right part of the current window has the same length as the left part of the next window, and the overlapped asymmetric windows are complementary. For the final synthesis the weighted overlapped consecutive pairs of pitch-synchronous frames are added to avoid discontinuities at the frame boundaries. Speech analysis is performed in equidistant overlapping weighted frames according to the block diagram in Fig. 2. To avoid disadvantages of standard AR modelling (bias of formant frequencies toward pitch harmonics, underestimation of formant F0
G
{an}
SF
G
pitch harmonics
jω A⎛⎜ e ⎞⎟ ⎝ ⎠
fv-uv(SF)
ln
Hilbert transform randomization <-π, π> {fm}
{Am}
{ϕm}
sum of sine waves + overlap and add
synthetic speech Fig. 1. Block diagram of the harmonic speech model with AR parameterization
Harmonic Model for Female Voice Emotional Synthesis
43
speech Hamming window abs(FFT) ln spectral envelope estimation exp real(IFFT) autocorrelation method of LPC analysis
G
{an}
SF
Fig. 2. Determination of model parameters in one equidistant speech frame
bandwidth) the AR parameters are computed from the time-domain signal corresponding to the spectral envelope instead of the original speech signal. Spectral envelope estimation similar to that of [18] is used. We use spline interpolation [19] applied to local maxima at pitch harmonics of the log spectrum.
3 Spectral Modifications for Emotional Synthesis According to [20] larynx and pharynx expansion, vocal tract walls relaxation, and mouth corners retraction upward lead to falling first formant and rising higher formants during pleasant emotions. On the other hand, larynx and pharynx constriction, vocal tract walls tension, and mouth corners retraction downward lead to rising first formant and falling higher formants for unpleasant emotions. Thus, the first formant and the higher formants of emotional speech shift in opposite directions in the frequency ranges divided by a frequency between the first and the second formant. Although the formant frequencies differ to some extent for different languages and their ranges are overlapped [21] the male voice vowel formant areas without overlap can be determined: F1 ≈ 250 ÷ 700 Hz, F2 ≈ 700 ÷ 2000 Hz, F3 ≈ 2000 ÷ 3200 Hz, F4 ≈ 3200 ÷ 4000 Hz [22]. Using the general knowledge of [21] that females have on average 20 % higher formant frequencies than males, female voice vowel formant areas without overlap will be: F1 ≈ 300 ÷ 840 Hz, F2 ≈ 840 ÷ 2400 Hz, F3 ≈ 2400 ÷ 3840 Hz, F4 ≈ 3840 ÷ 4800 Hz. The border frequency between the first and the second formant for female voice will be F1,2 = 840 Hz.
44
A. Přibilová and J. Přibil
Spectral parameters modification consists of spectral envelope modification by non-linear frequency scale transformation of the spectral envelope computed using AR parameters obtained during analysis. After spectral transformation, the inverse Fourier transform of the spectral envelope is treated as a real speech signal for modified AR parameters computation for a database of AR parameters corresponding to different emotions – see Fig. 3. G {an}
G ⎛ A e jω ⎞ ⎜ ⎝
frequency scale transformation
⎟ ⎠
exp(real(IFFT))
G' autocorrelation method of {an' } LPC analysis
Fig. 3. Modification of AR parameters using frequency scale transformation
For shifting of the first formant and the higher formants in the opposite directions we use a smooth function of frequency representing formant ratio between emotional and neutral speech. For its better analytic representation the frequency scale is logarithmically warped so that the border frequency F1,2 corresponds to one fourth of the sampling frequency fs. Inverse of this log warping function is f ( f t ) = a b ft + c ,
(1)
where f represents the input frequency and ft corresponds to the transformed frequency. Unknown variables a, b, c are determined using the points [ ft , f ] = [0, 0], [ fs /4, F1, 2], [ fs /2, fs /2]. The solution of the system of the three equations is
a=
2 1, 2
2F , f s − 4 F1, 2
⎛ ⎛ ⎞⎞ ⎜ 4 ln⎜ f s − 2 F1, 2 ⎟ ⎟ ⎜ 2F ⎟⎟ ⎜ 1, 2 ⎝ ⎠ , b = exp⎜ ⎟ f s ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
c = − a.
(2)
Inversion of (1) gives the logarithmically warped frequency scale f −c f t ( f ) = log b = a
⎛ f −c⎞ ln⎜ ⎟ ⎝ a ⎠. ln b
(3)
Formant ratio γ ( ft ) as a smooth function of the logarithmically warped frequency can be expressed by a fourth-order polynomial function
γ ( f t ) = p f t 4 + q f t 3 + r f t 2 + s f t + t.
(4)
Coefficients of this polynomial are computed from equidistant points [ ft , γ ] = [0, 1], [ fs /8, γ 1], [ fs /4, 1], [3 fs /8, γ 2], [ fs /2, 1]. Solution of the system of five equations is
Harmonic Model for Female Voice Emotional Synthesis
p=
2048 (− γ 1 − γ 2 + 2) , 3 f s4
q=
45
256 (9γ 1 + 7γ 2 − 16 ) , 3 f s3
(5) 64 r = 2 (− 13γ 1 − 7γ 2 + 20 ) , 3 fs
32 s= (3γ 1 + γ 2 − 4) , 3 fs
t = 1.
Relation between the modified spectral envelope E’( f ) and the original one E( f ) is ⎛ ⎞ f ⎟⎟ . E ' ( f ) = E ⎜⎜ ⎝ γ ( f t ( f )) ⎠
(6)
For 16-kHz sampling the transformation function (3) gives the frequency fs /8 corresponding to 214.3 Hz and the frequency 3 fs /8 corresponding to 2666.7 Hz. Chosen female emotional-to-neutral formant ratios at these frequencies together with spectral flatness ratio obtained by emotional speech analysis are shown in Table 1. Table 1. Emotional-to-neutral formant ratios γ 1, γ 2, and spectral flatness ratio SF
joyous-to-neutral angry-to-neutral sad-to-neutral
γ1
γ2
SF
0.70 1.35 1.10
1.05 0.85 0.90
1.24 1.11 2.02
In the four vowel formant areas the mean formant ratios are computed using the formant transformation function (4). Their values are shown in Table 2. For joy the first formant is shifted to the left by about 10 %, the second and third formants are shifted to the right by about 3 % to 6 % and the shift gradually decreases. For anger the first formant is shifted to the right by about 13 %, the higher formants are shifted to the left by about 10 % to 14 %. For sadness the mean shift of the first formant is about 4 % to the right and the higher formants about 6 % to 10 % to the left. Table 2. Mean female emotional-to-neutral formant ratios in formant areas for chosen γ 1, γ 2
joyous-to-neutral angry-to-neutral sad-to-neutral
300÷840 Hz 0.8982 1.1289 1.0432
840÷2400 Hz 1.0589 0.8849 0.9383
2400÷3840 Hz 1.0334 0.8623 0.8991
3840÷4800 Hz 0.9964 0.9012 0.9076
4 Prosodic Modifications for Emotional Synthesis For emotional speech conversion, following prosodic parameters are modified: F0 mean, F0 range, energy, and duration. For joyous/angry emotional styles rising/falling linear trend (LT) of F0 is used at the end of sentences. Modification ratios between emotional and neutral speech were chosen experimentally as shown in Table 3.
46
A. Přibilová and J. Přibil
Table 3. Prosodic parameters modification ratio values between emotional and neutral speech F0 mean 1.18 1.16 0.81
F0 range 1.30 1.30 0.62
energy 1.30 1.70 0.95
duration 0.81 0.84 1.16
1.6
1.6
1.4
1.4
1.2
1.2
—› F0rel [-]
—› F0rel [-]
joy anger sadness
1 0.8 VF0source VF0LT
0.6 0.4 50
100 150 —› N [frames]
LTstart 55 % 35 % 0
VF0source VF0LT LT, start at 122
1 0.8 0.6
LT, start at 61 0
LT type rising falling −
200
0.4 0
50
100 150 —› N [frames]
200
Fig. 4. Linear trend applied to VF0 contour for joyous (left) and angry (right) emotional styles. Source VF0 contour normalized by F0mean in the sentence “Vše co potřeboval” (“All he needed”) – female speaker, fs = 16 kHz, frame length = 8 ms.
Applying of LT to virtual F0 (VF0) contour obtained by cubic interpolation in unvoiced parts of speech can be seen in Fig. 4. Starting point of LT is determined by a parameter LTstart (in percentage of distance to the end of the sentence).
5 Listening Tests Subjective evaluation called “Determination of emotion type” was realized by the listening test with the help of automated listening test program located on the web page http://www.lef.um.savba.sk/scripts/itstposl2.dll. Every listening test consists of ten evaluation sets selected randomly from the testing corpus composed of 60 short sentences with durations varying between 1 and 3.5 seconds. The sentences were extracted from the Czech stories narrated by a female professional actor. For each sentence there is a choice from four possibilities: “joy”, “sadness”, “anger”, or “other”. Twenty listeners (16 Czechs and 4 Slovaks, 6 women and 14 men) took part in the listening test. The summary results are presented in the form of a confusion matrix in Table 4. Best identified is sadness, worst identified is joy. Evaluation of successful determination of emotion type in individual sentences was carried out, too. Table 5 shows summed relative values for all emotions (values in the column “not classified” represent choice “other” in the listening test, “exchanged” corresponds to incorrect choice).
Harmonic Model for Female Voice Emotional Synthesis
47
Table 4. Confusion matrix of the listening test
joy anger sadness
joy 59.0 % 2.5 % 0.5 %
anger 0.5 % 73.5 % 0.5 %
sadness 16.0 % 2.0 % 90.0 %
other 24.5 % 22.0 % 9.0 %
Table 5. Best and worst evaluated sentences (data summed for all emotions) sentence best evaluated worst evaluated * **
s13 s12
*
**
correct 88.1 %
not classified 11.9 %
exchanged 0%
57.6 %
30.3 %
12.1 %
“Vše co potřeboval.” (“All he needed.”) “Máš ho mít.” (“You ought to have it.”)
6 Conclusion Performed listening tests have shown that combination of spectral and prosodic modification in female voice emotion conversion using harmonic speech modelling gives the best results for sadness and the worst results for joy. The best listening test results correspond to the sentences that had been uttered most neutrally in the original. Some sentences of the original speech extracted from the stories were emotionally coloured, and then, their resyntheses with emotional modifications were perceptually biased towards the original emotion. In our next research, we want to include microprosodic features in emotional voice conversion and we intend to experiment with application of linear trend F0 modifications also at the beginning of sentences. Acknowledgment. This work has been done in the framework of the COST Action 2102. It has also been supported by the Ministry of Education of the Slovak Republic (MVTS COST 2102/STU/08), the Ministry of Education, Youth, and Sports of the Czech Republic (OC08010), the Grant Agency of the Czech Republic (GA102/09/0989), and the Grant Agency of the Slovak Academy of Sciences (VEGA 2/0142/08).
References 1. Navas, E., Hernáez, I., Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE Transactions on Audio, Speech, and Language Processing 14, 1117–1127 (2006) 2. Tao, J., Kang, Y., Li, A.: Prosody Conversion from Neutral Speech to Emotional Speech. IEEE Transactions on Audio, Speech, and Language Processing 14, 1145–1154 (2006) 3. Ververidis, D., Kotropoulos, C.: Emotional Speech Recognition: Resources, Features, and Methods. Speech Communication 48, 1162–1181 (2006)
48
A. Přibilová and J. Přibil
4. Tóth, S.L., Sztahó, D., Vicsi, K.: Speech Emotion Perception by Human and Machine. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 213–224. Springer, Heidelberg (2008) 5. Zainkó, C., Fék, M., Németh, G.: Expressive Speech Synthesis Using Emotion-Specific Speech Inventories. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 225–234. Springer, Heidelberg (2008) 6. Kostoulas, T., Ganchev, T., Fakotakis, N.: Study on Speaker-Independent Emotion Recognition from Speech on Real-World Data. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 235–242. Springer, Heidelberg (2008) 7. Ringeval, F., Chetouani, M.: Exploiting a Vowel Based Approach for Acted Emotion Recognition. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 243–254. Springer, Heidelberg (2008) 8. Callejas, Z., López-Cózar, R.: Influence of Contextual Information in Emotion Annotation for Spoken Dialogue Systems. Speech Communication 50, 416–433 (2008) 9. McAulay, R.J., Quatieri, T.F.: Low-Rate Speech Coding Based on the Sinusoidal Model. In: Furui, S., Sondhi, M.M. (eds.) Advances in Speech Signal Processing, pp. 165–208. Marcel Dekker, New York (1992) 10. McAulay, R.J., Quatieri, T.F.: Sinusoidal Coding. In: Kleijn, W.B., Paliwal, K.K. (eds.) Speech Coding and Synthesis, pp. 121–173. Elsevier Science, Amsterdam (1995) 11. Dutoit, T., Gosselin, B.: On the Use of a Hybrid Harmonic/Stochastic Model for TTS Synthesis-by-Concatenation. Speech Communication 19, 119–143 (1996) 12. Bailly, G.: Accurate Estimation of Sinusoidal Parameters in a Harmonic+Noise Model for Speech Synthesis. In: Eurospeech 1999, Budapest, pp. 1051–1054 (1999) 13. Stylianou, Y.: Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis. IEEE Transactions on Speech and Audio Processing 9, 21–29 (2001) 14. Yegnanarayana, B., d’Alessandro, C., Darsinos, V.: An Iterative Algorithm for Decomposition of Speech Signals into Periodic and Aperiodic Components. IEEE Transactions on Speech and Audio Processing 6, 1–11 (1998) 15. Drioli, C., Tisato, G., Cosi, P., Tesser, F.: Emotions and Voice Quality: Experiments with Sinusoidal Modeling. In: Proceedings of Voice Quality, Geneva, pp. 127–132 (2003) 16. Ramamohan, S., Dandapat, S.: Sinusoidal Model-Based Analysis and Classification of Stressed Speech. IEEE Transactions on Audio, Speech, and Language Processing 14, 737– 746 (2006) 17. Gray, A.H., Markel, J.D.: A Spectral-Flatness Measure for Studying the Autocorrelation Method of Linear Prediction of Speech Analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 22, 207–217 (1974) 18. Vích, R., Vondra, M.: Speech Spectrum Envelope Modeling. In: Esposito, A., FaundezZanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 129–137. Springer, Heidelberg (2007) 19. Unser, M.: Splines. A Perfect Fit for Signal and Image Processing. IEEE Signal Processing Magazine 16, 22–38 (1999) 20. Scherer, K.R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication 40, 227–256 (2003) 21. Fant, G.: Acoustical Analysis of Speech. In: Crocker, M.J. (ed.) Encyclopedia of Acoustics, pp. 1589–1598. John Wiley & Sons, Chichester (1997) 22. Fant, G.: Speech Acoustics and Phonetics. Kluwer Academic Publishers, Dordrecht (2004)
Anchor Model Fusion for Emotion Recognition in Speech Carlos Ortego-Resa, Ignacio Lopez-Moreno, Daniel Ramos, and Joaquin Gonzalez-Rodriguez ATVS - Biometric Recognition Group, Universidad Autonoma de Madrid, Spain
[email protected],
[email protected]
Abstract. In this work, a novel method for system fusion in emotion recognition for speech is presented. The proposed approach, namely Anchor Model Fusion (AMF), exploits the characteristic behaviour of the scores of a speech utterance among different emotion models, by a mapping to a back-end anchor-model feature space followed by a SVM classifier. Experiments are presented in three different databases: Ahumada III, with speech obtained from real forensic cases; and SUSAS Actual and SUSAS Simulated. Results comparing AMF with a simple sum-fusion scheme after normalization show a significant performance improvement of the proposed technique for two of the three experimental set-ups, without degrading performance in the third one. Keywords: emotion recognition, anchor models, prosodic features, GMM supervectors, SVM.
1
Introduction
There is an increasing interest in the automatic recognition of emotional states in a speech signal, mainly due to its applications to human-machine interaction applications [1] [2]. As a result, a wide range of different algorithms for emotion recognition have emerged. This fact motivates the use of fusion schemes in order to improve the performance of system by the combination of different approaches. It is common for this task to be stated as a multiclass classification problem. However, emotion recognition can also be stated as a verification or detection problem. In such case, given an utterance x and a target emotion e, for which an emotion model me from a set M is trained. The objective is to determine whether the dominant emotion that affect the speaker in the utterance is e (target class) or any other (non-target class). In such a scheme, for any model me ∈ M and utterance x, a similarity score denoted as sx,me can be computed. Detection is performed by comparing sx,me to a threshold, which is generally set according to the minimization of some cost function. It is important to remark that the behaviour of the scores of a given utterance from a given emotion is different and characteristic for each model in M . Therefore, it is expected that the detection of the emotion in x will benefit not only J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 49–56, 2009. c Springer-Verlag Berlin Heidelberg 2009
50
C. Ortego-Resa et al.
of the target scores from their comparisons with me , but also from the scores of x compared to the rest of models in M . This motivates a two-level architecture, where models mj ∈ M , j ∈ {1, .., Nf e } are denoted as front-end models in opposition to back-end models which are trained in advance using scores, such as sx,mj , as feature vectors. This nomenclature has been adopted form language recognition [3], which is a similar problem. This work proposes a novel back-end approach for the fusion of the information obtained by Nsys different emotion detectors. It is based on anchor models fusion (AMF) [4], which uses the information of the relationship among all the models in M for improving detection performance. Results presented validate the proposed approach based on a experimental set-up in substantially different databases: Ahumada III (speech from real forensic cases) [5], SUSAS Simulated and SUSAS Actual [6]. AMF have been used to combine scores from two prosodic emotion recognition systems denoted as GMM-SVM and statistics-SVM. Performance results will be measured in terms of equal error rate (EER) and its average among emotions. This work is organized as follows. The anchor models feature space is described in Section 2. In Section 3, the proposed AMF method is described in detail. Section 4 describes front-end systems implemented as well as the prosodic feature extraction. The experimental work which shows the adequacy of the approach is shown in 5. Finally, conclusions are drawn in Section 6.
2
Anchor Models Feature Space
Given a speech utterance x from an unknown spoken emotion, and a frontend emotion recognition system with Nf e target emotion models mj ∈ M , j ∈ {1, .., Nf e }, a similarity score sx,mj , can be obtained as a result of comparing x against each emotion model mj . Thus, for every utterance x we obtain a Nf e dimensional vector S¯x,M = [sx,m1 · · · sx,mN ] that stacks all possible scores for x. This scheme defines a derived similarity feature space known as anchor model space [4] where every utterance x can be mapped. In this new feature space any classifier can be trained in order to discriminate any given emotion in utterance x with respect to the rest, by learning the relative behaviour of the scores of speech utterance x with respect to the models in M . An example of this relative behaviour is shown in figure 1 where utterances from four emotions (angry, question, neutral, stressed,) are compared with two different cohorts M of anchor models.
3
Anchor Model Fusion (AMF) Back-End
AMF is a data-driven approach that have shown a satisfactory performance in language recognition [7]. In AMF, the cohort of models M is built by including all the available models from the Nsys emotion recognition systems in the
Anchor Model Fusion for Emotion Recognition in Speech
target emotion: Question
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 Values
Values
target emotion: Angry
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8 −1
−0.8 angry clear
c5
c7
fast
lomb
loud
neutr quest slow
−1
soft
angry clear
c5
c7
fast
lomb
loud
neutr quest slow
soft
target emotion: Extressed 30
25
25
20
20
15
15 Values
Values
target emotion: Neutral 30
10 5
10 5
0
0
−5
−5
−10
51
extressed
neutr−low
neutr−extressed
neutr
−10
extressed
neutr−low
neutr−extressed
neutr
Fig. 1. At the top, the distribution of angry (left) and question (right) utterances over the a set M formed by the emotion models in SUSAS Simulated speech. At the bottom, the distribution range of neutral (left) and stressed (right) utterances over the a set M form by the emotion models in Ahumada III.
front-end. The resulting vector of scores for utterance x, denoted as S¯AM , stacks j the Nsys values of S¯x,M over all emotion recognition system j in the front-end. ¯ (x, M ) = S¯1 , · · · , S¯Nsys SAM x,M x,M
(1)
Fig. 2 illustrates the process in which SAM (x, M ) is obtained by projecting x into the anchor model space defined by M . Hence, the number of dimensions Nsys of anchor model space is d = j=1 Nj , where Nj is the number of models in the front-end system j. At this point, the objective is to boost the probability of finding a characteristic behaviour of the speech pattern in the anchor model space, by increasing d and with the limits of the curse of dimensionality. This objective can be achieved by including more anchor models and/or systems to fuse. It is important to note that any emotion can be trained in the anchor model space, not only those in M . These so-called back-end emotional set M will be the actual set of target emotions. Thus, once every testing utterance is projected over the anchor model space, any classifier can be used for training any back-end emotions in M . In this work, SVM were applied due to its robustness while the dimension of the anchor model space increases.
52
C. Ortego-Resa et al.
Fig. 2. Diagram of generation of features in the AMF space. S¯AM (x, M ) stacks the scores of xi over the set of models mlj , for all languages j and all subsystems l.
4
Emotion Recognition Systems Front-End
This section details the prosodic features extracted from the audio signal, and used as input vectors for both front-end systems implemented. Subsections 4.2 and 4.3 describes in more detail their implementation. 4.1
Prosodic Features for Emotion Recognition
Prosodic features are often considered as input signals for emotion recognition systems due to their relation with the emotional state information [8]. In this work prosodic features consist of a set of d = 4 dimensional vectors with the sort-term coefficients of energy; the logarithm of the pitch; and their velocity coefficients, also known as Δ features. These coefficients are extracted only from voiced segments with an energy value higher than the 90% of the dynamic range. Mean normalization have been used for energy and Δ-energy coefficients. Pitch and energy have been computed using Praat [9]. 4.2
Prosodic GMM-SVM
Previous works have shown the excellent performance of SVM-GMM supervectors in the tasks of language and speaker recognition, while the application of
Anchor Model Fusion for Emotion Recognition in Speech
53
this technique to the prosodic level of the speech were firstly introduced in [10]. This technique can be seen as a secondary parametrization capable to summarize the distribution of the feature vectors in x, into a single high-dimensionality vector. This high-dimensionality vector is known as a GMM supervector. In order to build the GMM supervector, first the prosodic vectors of x are used to train a M -mixtures GMM model λx . This model is obtained from a Universal Background Model (UBM) λUBM using Maximum-A-Posteriori (MAP) adaptation of means. The GMM supervector of the utterance x is the concatenation of the M vectors of means in λx . GMM supervector are often considered as kernel functions μ(x) that maps prosodic features from dimension of d into a high-dimensional feature space of size L = M ∗ d. Once every utterance is mapped into this L -dimensional supervector space, linear SVM models are used to train the front-end emotion models. Therefore, any mj is a L -dimensional vector that represent an hyperplane that optimally separates supervectors of utterances form the target emotion j with respect to supervectors from other emotions. 4.3
Prosodic Statistics-SVM
This scheme is based on a previous work presented in [11]. It consist of a statistical analysis of each prosodic coefficient followed by a SVM. The distribution of the prosodic values is characterized by computing n = 9 statistical coefficients per feature (table 1). Once every utterance is mapped into this derived feature space of dimension L = d ∗ n, front-end emotions models are obtained as linear one-vs-all SVM models. A test-normalization scheme has been used for score normalization prior to AMF. First, the scores distribution for every testing utterance x with respect to M has been estimated assuming Gaussianity. The values of mean and variance of this distribution are then used to normalize the similarity scores of x over any model m. Table 1. Statistical coefficients extracted for every prosodic stream form each utterance in the statistics-SVM approach Coefficients Maximum Minimum Mean Standard deviation Median First quartile Third quartile Skewness Kurtosis
54
5 5.1
C. Ortego-Resa et al.
Experiments Databases
The proposed emotion recognition system has been tested over Ahumada III and SUSAS (Speech Under Simulated And Actual Stress) databases. Ahumada III consists of real forensic cases recorded by the Spanish police forces (Guardia Civil ) and authored by the Spanish law under confidence agreements. It includes speech from 69 speakers and 4 emotional states (neutral, neutral-low, neutralstressed, stressed ) with 150 seconds training utterances and testing utterances among 10 and 5 seconds long. SUSAS database is divided in two subcorpora from simulated and real spoken emotions. SUSAS Simulated subcorpus contains speech from 9 speakers and 11 speaking styles. They include 7 simulated styles (slow, f ast, sof t, question, clear enunciation, angry) and four other styles under different workload conditions (high, cond70, cond50, moderate). SUSAS Actual subcorpus contains speech from 11 speakers, and 5 different and real stress conditions (neutral, medst, hist, f reef all, scream). Actual and Simulated subcorpora contains 35 spoken words, each one with 2 realization, for every speaker and speaking style. 5.2
Results
The GMM-SVM front-end system requires a set of development data for building the model λUBM . Therefore every database were split in two different nonoverlapping sets. The first one was used for training a M=256 mixtures GMM UBM λUBM . The second set were used for implementing a double 10 folds crossvalidation scheme: first cross-validation stage is for training and testing front-end models, while back-end models are trained and tested during the second one. AMF cohort M is built with models from all databases and systems. Therefore, for each front-end system we obatined 4 models from Ahumada III corpus, 11 models from SUSAS Simulated corpus and 5 models from SUSAS Actual corpus. A third system is included as the sum fusion of both front-end systems. Thus, this scheme leads to a AMF space of (4 + 11 + 5) × 3 = 60 dimensions. In order to compare AMF with a baseline fusion technique we performed a standard sum fusion between the scores of GMM-SVM and statistics-SVM Table 2. AMF performance improvement vs. sum fusion for the systems Ahumada III. Results are presented in EER(%) and its relative improvement (R.I.). AhumadaIII Emotion Baseline AMF neutral-low 50.21 30.02 neutral 33.77 33.92 neutral-stressed 38.12 33.22 stressed 28.69 25.7 Avg. EER 37.7 30.72
R.I. % -40.21 0.44 -12.85 -10.42 -18.51
Anchor Model Fusion for Emotion Recognition in Speech
55
Table 3. AMF performance improvement vs. sum fusion for SUSAS Simulated (a) and SUSAS Actual (b) SUSAS Simulated Emotion Baseline AMF angry 22.93 32.76 clear 42.91 41.89 cond50 41.01 33.57 cond70 48.3 30.55 fast 30.21 16.81 lombard 34.85 38.65 loud 27.65 13.2 neutral 40.53 35.31 question 3.86 3.52 slow 26.75 20.35 soft 22.07 22.54 Avg. EER 31.01 26.29 (a)
R.I. % 42.87 -2.38 -18.14 -36.75 -44.36 10.9 -52.26 -12.88 -8.81 -23.93 2.13 -15.22
SUSAS Actual Emotion Baseline AMF neutral 36.54 35.26 medst 46.95 50.08 hist 42.57 39.14 freefall 25.86 24.66 scream 11.15 14.6 Avg. EER 32.61 32.75 (b)
R.I. % -3.5 6.67 -8.06 -4.64 30.94 0.43
systems. Note that sum fusion outperforms the results obtained from any of both front-end systems individually. Tables 2 and 3 summarize the results of the proposed approach. It can be seen that the average EER for all the emotions in Ahumada III and SUSAS Simulated respectively improves 15.52% and 18.61%. Remarkable good results are obtained for neutral-low, loud and fast emotion models while for models scream and angry a significant loss of performance is obtained, probably due to non-modeled variability factors such as the speaker identity. The results for SUSAS Actual shows neither improvement not degradation in the average EER. This can be due to the enbiromental conditions of SUSAS Actual corpus (amusement park roller-coaster, and helicopter cockpit recordings). Under such conditions, noise patterns can characterististically affect scores in such way that AMF can not improve front-end results.
6
Conclusions
This work introduces Anchor Model Fusion (AMF), a novel approach for fusion of systems in emotion recognition. The approach is based on the anchor model space which maps scores from the so-called front-end detectors to a different back-end feature space where they can be classified by an SVM. Therefore backend emotion models M are supported over the set of front-end models M , which may be trained with different emotions, databases, recording conditions, etc. In this work the proposed AMF approach have been used for fusing two different prosodic emotion recognition systems as well as a third one obtained as the result of the sum fusion of both systems. Thus M have been built with 3 systems, each one with 20 front-end models, leading to a 60-dimensions AMF space. Experiments have been carried out over three corpora: Ahumada III (speech
56
C. Ortego-Resa et al.
from real forensic cases), SUSAS Simulated (speech with acted emotions) and SUSAS Actual (speech with actual emotions). Results of the proposed AMF scheme are compared with the sum fusion of both front-end systems, showing a EER relative improvement larger than the 15% for Ahumada III and SUSAS Simulated corpora. Future work will focus on the optimal selection of front-end models M , normalization techniques of the anchor-model space vectors and new classification methods for the back-end such as Linear Discriminant Analysis.
Ackowledgements This work has been financed under project TEC2006-13170-C02-01.
References 1. Ververidisa, D., Kotropoulos, C.: Emotional speech recognition: Resources, features, and methods. Speech Communication (9), 1162–1181 (2006) 2. Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) 3. Ramabadran, T., Meunier, J., Jasiuk, M., Kushner, B.: Enhancing distributed speech recognition with back-end speech reconstruction. In: Proceedings of Eurospeech 2001, pp. 1859–1862 (2001) 4. Collet, M., Mami, Y., Charlet, D., Bimbot, F.: Probabilistic anchor models approach for speaker verification, 2005–2008 (2005) 5. Ramos, D., Gonzalez-Rodriguez, J., Gonzalez-Dominguez, J., Lucena-Molina, J.J.: Addressing database mismatch in forensic speaker recognition with ahumada iii: a public real-case database in spanish. In: Proceedings of Interspeech 2008, September 2008, pp. 1493–1496 (2008) 6. Hansen, J., Sahar, E.: Getting started with susas: a speech under simulated and actual stress database. In: Proceedings of Eurospeech 1997, pp. 1743–1746 (1997) 7. Lopez-Moreno, I., Ramos, D., Gonzalez-Rodriguez, J., Toledano, D.T.: Anchormodel fusion for language recognition. In: Proceedings of Interspeech 2008 (September 2008) 8. Hansen, J., Patil, S.: Speech under stress: Analysis, modeling and recognition. In: M¨ uller, C. (ed.) Speaker Classification 2007. LNCS (LNAI), vol. 4343, pp. 108–137. Springer, Heidelberg (2007) 9. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 5.1.04) [computer program] (April 2009), http://www.praat.org/ 10. Hu, H., Xu, M.X., Wu, W.: Gmm supervector based svm with spectral features for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 2007, vol. 4, pp. IV-413–IV-416 (2007) 11. Kwon, O.W., Chan, K., Hao, J., Lee, T.W.: Emotion recognition by speech signals. In: EUROSPEECH 2003, pp. 125–128 (2003)
Audiovisual Alignment in a Face-to-Face Conversation Translation Framework Jerneja Žganec Gros and Aleš Mihelič Alpineon Research and Development, Ulica Iga Grudna 15 1000 Ljubljana, Slovenia
[email protected]
Abstract. Recent improvements in audiovisual alignment for a translating videophone are presented. A method for audiovisual alignment in the target language is proposed and the process of audiovisual speech synthesis is described. The proposed method has been evaluated in the VideoTRAN translating videophone environment, where an H.323 software client translating videophone allows for the transmission and translation of a set of multimodal verbal and nonverbal clues in a multilingual face-to-face communication setting. An extension of subjective evaluation metrics of fluency and adequacy, which are commonly used in subjective machine translation evaluation tests, is proposed for usage in an audiovisual translation environment. Keywords: nonverbal communication, facial expressions, speech-to-speech translation, translating videophone, subjective evaluation methods.
1 Introduction In the last decade, users have continually oscillated between the impersonal nature of technology offering anonymous electronic communication and the intimate reality of human relationships. No question, technology is the great enabler. But, paradoxically, now the human bit seems to be more, not less, important than ever before [1]. There are many situations—often those involving escalating conflict, sensitive feelings, high priority, important authority, or a great deal of money—that demand people to take the time and trouble to get into the same room to exchange information. Or at least they try to simulate face-to-face communication when individuals are in remote locations using videophones or web-phones [2]. Face-to-face behaviors have two important elements: verbal and nonverbal. According to various investigations, verbal communication accounts for less than 30% of communication, whereas nonverbal elements represent at least the remaining 70%. Nonverbal communication—communication that does not use words—takes place all the time. Smiles, frowns, who sits where at a meeting, the size of the office, how long someone keeps a visitor waiting—all these communicate pleasure or anger, friendliness or distance, power and status. Eye contact, facial expressions, gestures, posture, voice, appearance—all these nonverbal clues influence the way the message is interpreted, or decoded, by the receiver. J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 57–64, 2009. © Springer-Verlag Berlin Heidelberg 2009
58
J.Ž. Gros and A. Mihelič
In order to automatically facilitate face-to-face communication among people that speak different languages, a framework for audiovisual translation of face-to-face communication, such as VideoTRAN, can be used [3]. There are many subtle cues provided by facial expressions and vocal intonation that let us know how what we are saying is affecting the other person. Transmission of these nonverbal cues is very important when translating conversations from a source language into a target language. State-of-the-art speech-to-speech translation enables subtitling of phone conversations only. The SYNFACE project made it possible to facilitate telephone communication for hard-of-hearing people: a synthesized talking face controlled by the telephone voice channel allows hearing- disabled users to better understand phone conversations by lip-reading [4], [5]. Therefore, VideoTRAN presents one of the first attempts to build an audiovisual translation system. The organization of this chapter is as follows. The next section describes the VideoTRAN conceptual framework of translations of audiovisual face-to-face conversations. Then, a method for audiovisual alignment in the target language is proposed, followed by a description of the audiovisual speech synthesis process. The audiovisual translation framework has been tested in a demonstrator, a translating videophone, which allows for the transmission and translation of a set of multimodal verbal and nonverbal clues in a multilingual face-to-face communication setting. An extension of subjective evaluation metrics of fluency and adequacy, which are commonly used in subjective machine translation evaluation tests, is proposed for usage in an audiovisual translation environment.
2 The VideoTRAN Audiovisual Translation Framework The VideoTRAN framework is an extension of our speech-to-speech translation research efforts within the VoiceTRAN project [6]. The VideoTRAN system architecture consists of three major modules [3]: 1. Audiovisual speech analysis: automatic speech recognition in the source language enhanced by lip-reading visemes, enables segmentation of the source language video signal into units, corresponding to audio representations of words. 2. Verbal and video alignment: audiovisual alignments in the target language are achieved through: a) machine translation for the translation of verbal communication and b) video alignment for nonverbal conversation elements. 3. Audiovisual speech synthesis in the target language also adds to the intelligibility of the spoken output, which is especially important for hearing-disabled persons that can benefit from lip-reading and/or subtitles. The underlying automatic speech recognition, machine translation, and speech synthesis techniques are described in [7]. There are, however, major open research issues that challenge the deployment of natural and unconstrained face-to-face conversation translation systems, even for very restricted application domains, because state-of-theart automatic speech recognition and machine translation systems are far from perfect. In addition, in comparison to translating written text, conversational spoken messages are usually conveyed with relaxed syntax structures and casual spontaneous speech.
Audiovisual Alignment in a Face-to-Face Conversation Translation Framework
59
In practice, a demonstrator is typically implemented by imposing strong constraints on the application domain and the type and structure of possible utterances; that is in both the range and the scope of the user input allowed at any point of the interaction. Consequently, this compromises the flexibility and naturalness of using the system. The remainder of this section describes the audiovisual alignment and the audiovisual speech synthesis in the target language in greater detail.
3 AudioVisual Alignment in the Target Language In most speech-to-speech translation tasks, especially if the two languages are not closely related, the word order in the source language can differ significantly from the word order of aligned words in the target language. Below is an example for word alignments of two phrases in the source language (SL) English and the target language (TL) Slovenian: SL>
He
was
TL>
Res je
really bil
upset. PAUSE. But
razburjen. PAUSE. Zakaj
why? vendar?
The changed word order in the target language requires changes in the video sequence, unless the entire phrase has been pronounced with the same facial expression, apart from lip and jaw movements involved in the production of verbal speech. In real-situation face-to-face communication, many forms of nonverbal communication can be detected, including those representing nonverbal speech sounds [8], which have to be aligned in the word alignment phase and are strongly coupled to the changes in facial expressions. When facial expression changes do occur within a phrase as part of nonverbal communication, we propose the following video-sequence recombination method for alignment of the audio and the video signal in the target language. We base our video alignment method on prominent words, which are often reflected by significant changes in facial expressions. Experimental evidence shows that upper-face eye and eyebrow movements are strongly related to prominence in expressive modes [10], [11]. Numerous separate annotation schemes for coding verbal and nonverbal communication have been recently proposed in several multimodal interaction and face-to-face communication research projects [9], [10]; however, further dedicated work on consolidation and standards of the annotation schemes is necessary. We trace facial expressions through upper-face action units (AUs) associated with the eyes and eyebrows in the Facial Action Coding System (FACS) system [11], [12]. Eyebrow action units, for instance, have action unit labels 1, 2, and 4: AU1—Inner Brow Raiser, AU2—Outer Brow Raiser, AU4—Brow Lowerer. When more than one action unit is present, the combination of action units is listed in ascending order; for example, AU1+2 expresses a combination of the inner and outer brow raisers. Initially, AUs are manually labelled on test conversation recordings. Automatic recognition of action units (AUs) [15] will be additionally implemented.
60
J.Ž. Gros and A. Mihelič
We assume that the alignment pair of a prominent word in the source language will again be a prominent word, which is conveyed with the same facial expression and prosodic markedness as the prominent work in the source language. To prove this assumption, a parallel corpus of bilingual audiovisual conversations is needed. First we segment the video signal in the target language into short video clips and align them with the recognized words in the source language. We annotate each video clip with a facial expression code, which can be N (neutral), an action unit (AU), or a transition in case of a facial expression onset (N-AU) or offset (AU-N). Further, we mark the words in the source language where a new facial expression occurred as prominent words, shown in bold below: N
N-AU4
SL>
He
was
TL>
Res je
AU4
really bil
AU4-N
PAUSE
N
AU1+2
upset. PAUSE. But
why?
razburjen. PAUSE. Zakaj
vendar?
Following the assumption that prominent words match in both languages, we can derive the recombination of the video clips for the target language in the following way. First we annotate the facial expressions (AUs) of prominent words and the words on phrase boundaries in the target language according to the corresponding aligned facial expressions (AUs) in the source language: N
N-AU4
SL>
He
was
TL>
Res je
AU4
really bil
AU4-N
PAUSE
N
AU1+2
upset. PAUSE. But
why?
razburjen. PAUSE. Zakaj
AU4
N
PAUSE
vendar?
AU1+2
N
In the second step, facial expressions for the remaining words are predicted. If a prominent facial expression in the source language started with an onset (offset) on the previous (next) word, the same procedure is followed in the target language: N
N-AU4
SL>
He
was
TL>
Res je Au4 AU4-N
AU4
really bil N
AU4-N
PAUSE
N
AU1+2
upset. PAUSE. But
razburjen. PAUSE. Zakaj N
PAUSE
AU1+2
why? vendar? N
If the transition onsets into (or offsets out of) prominent facial expressions are not present in the original video signal, transition trajectories between the facial movements have to be modeled for transient features, such as nasolabial furrows, crowsfeet wrinkles, and nose wrinkles. The presented video-sequence recombination method for alignment of the audio and the video signal in the target language has several restrictions. It works best for
Audiovisual Alignment in a Face-to-Face Conversation Translation Framework
61
frontal views with limited sideward head rotation only; in other cases perspective alignments need to be implemented.
4 AudioVisual Speech Synthesis The video alignment method based on matching prominent words described in the previous section provides the translation of only nonverbal elements between the source language and the target language. Lip and jaw movements involved in the production of verbal speech have to be modeled separately and integrated into the final video sequence. Speech-synchronous face animation takes advantage of the correlation between speech and facial coarticulation. It takes the target language speech stream as input and yields corresponding face animation sequences. In our first experiment, we performed rudimentary modeling of lip movements only. The mouth region is located and replaced by an artificial lip controlled by the lip-relevant MPEG-4 Facial Animation Standard [16] viseme facial animation parameters (which are commonly used in animated talking agents [17]) as shown in Figure 1. In order to allow for coarticulation of speech and mouth movement, transitions from one viseme to the next are defined by blending the two visemes with a weighting factor.
Fig. 1. VideoTRAN faces after processing. The mouth region in the image is covered by a rectangle with an artificial mouth, which is controlled by visemes and moves synchronously with the uttered phonemes.
During non-verbal periods such as pauses, grunts, laughing, the artificial mouth is replaced by the original video signal. As an alternative to artificial lip models, manipulation of video images of the original speaker’s lips can be used [18]. The resulting speaking voice can also be adapted to that of the original speaker by applying voice transformation techniques.
5 The VideoTRAN Videophone Client The VideoTRAN demonstrator under development is an H.323 client softphone with videophone capabilities. It is built around the CPhone open-source voice-over-IP (VoIP) solution [19]. It is based on the OpenH.323 and Trolltech PWLib libraries. Standard G.711 audio codec and the H.261 video codec are supported. It supports full duplex audio and bi-directional video. The audiovisual processing of the first analysis unit or phrase inevitably introduces a short delay into the conversation. The audio stream is replaced by utterances in the
62
J.Ž. Gros and A. Mihelič
target language, whereas the video stream is cut into clips and recombined according MT anchors corresponding to prominent words (significant AUs), as described in section 2.1. If the audio duration of a segment in the target language is longer than the source-language segment duration, the video signal is extrapolated to fit the targetlanguage audio duration. Otherwise, the video signal is cut off at the end of the target language audio. A transitional smoothing at the video concatenation points still needs to be implemented. During nonverbal speech segments such as pauses, grunts, laughing, the original audio and video signals are brought back into the foreground.
Fig. 2. Screenshot of the VideoTRAN VoIP software client. To the right of the window with the video image, utterances in both the source language (recognized utterances) and the target language (translated utterances) are displayed.
A simple user-friendly user interface enables to set up video calls. To the right of the window with the video image, utterances in both the source language, i.e. the recognized utterances, and the target language, i.e. translated utterances, are displayed, as shown in Figure 2. VideoTRAN is a fully H.323-compliant video conferencing solution compatible with other H.323 video conferencing clients and devices, such as Microsoft NetMeeting, OhPhone, and GnomeMeeting.
6 Subjective Evaluation The first competitive MT evaluations, like the series of DARPA MT evaluations in the mid 1990’s, evaluated machine translation output with human reference translations on the basis of fluency and adequacy [20]. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Adequacy refers to the degree to which the translation communicates the information present in the reference output. We propose an adaptation of this subjective evaluation methodology for assessing the fluency and adequacy of the verbal and nonverbal translations in the target language message. Here we report some preliminary findings from the VideoTRAN evaluation experiment for the Slovenian – English language pair in the travel domain. The grading assignments for each grader were split into two parts. First, the MT output was displayed and the grader had to judge the fluency of the translation, along with the fluency of the displayed nonverbal information. In the second step, a refer-
Audiovisual Alignment in a Face-to-Face Conversation Translation Framework
63
ence translation was given and the grader had to evaluate the adequacy of the translation along with the displayed nonverbal information in the target language. In order to minimize grading inconsistencies between graders due to contextual misinterpretations of the translations, the situation in which the sentence was uttered was provided for the adequacy judgment with corpus annotations like “food” or “luggage”. The preliminary test results indicate a fluency of 2.3 (‘disfluent’ – ‘nonnative’), and an adequacy of 3.1 (‘much information’), both on a 5-point grading scale.
7 Conclusions Face-to-face communication remains the most powerful human interaction; electronic devices can never fully replace the intimacy and immediacy of people conversing in the same room, or via a videophone. In face-to-face conversation, there are many subtle cues provided by facial expressions and vocal intonation that let us know how what we are saying is affecting the other person. Therefore the transmission of these nonverbal cues is very important when translating conversations from a source language into a target language. We have presented the VideoTRAN conceptual framework for translating audiovisual face-to-face conversations. It provides translations for both verbal and nonverbal components of the conversation. A simple method for audiovisual alignment in the target language has been proposed and the process of audiovisual speech synthesis has been described. The VideoTRAN framework has been tested in a translating videophone demonstrator: an H.323 software client translating videophone allows for the transmission and translation of a set of multimodal verbal and nonverbal clues in a multilingual face-to-face communication setting. In order to evaluate the performance of the proposed face-to-face conversation translation approach, a combination of speech-to-speech translation performance metrics and facial expression performance metrics has been proposed. The VideoTRAN concept allows for automatic face-to-face tele-conversation translation. It can be used for either online or offline translation and annotation of audiovisual face-to-face conversation material. The exploitation potentials of the VideoTRAN framework are numerous. Cross-cultural universal features in the form of gestures and postures can be transmitted and translated along with the facial expressions into the target language.
Acknowledgements Part of the work presented in this paper was financed as part of the AvID project, contract number M2-0132, supported by the Slovenian Ministry of Defense and the Slovenian Research Agency. The paper was written as a part of the cooperation within COST Action 2102, Cross-Modal Analysis of Verbal and Nonverbal Communication.
64
J.Ž. Gros and A. Mihelič
References 1. Roebuck, C.: Effective Communication. American Management Association (1999) 2. Begley, A.K.: Face to Face Communication: Making Human Connections in a Technology-Driven World. In: Thomson Learning, Boston, MA (2004) 3. Žganec Gros, J.: VideoTRAN: A translation framework for audiovisual face-to-face conversations. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 219–226. Springer, Heidelberg (2007) 4. Spens, K.-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., Salvi, G.: SYNFACE, a Talking Head Telephone for the Hearing Impaired. In: Proceedings of the IFHOH 7th World Congress, Helsinki, Finland (2004) 5. Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., Thomas, N.: User evaluation of the SYNFACE talking head telephone. In: Miesenberger, K., Klaus, J., Zagler, W.L., Karshmer, A.I. (eds.) ICCHP 2006. LNCS, vol. 4061, pp. 579–586. Springer, Heidelberg (2006) 6. Žganec Gros, J., Mihelič, F., Erjavec, T., Vintar, Š.: The VoiceTRAN Speech-to-Speech Communicator. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 379–384. Springer, Heidelberg (2005) 7. Žganec Gros, J., Gruden, S.: The VoiceTRAN Machine Translation System. In: Proceedings of the Interspeech 2007, Antwerpen, Belgium, pp. 1521–1524 (2007) 8. Campbell, N.: On the use of nonVerbal speech sounds in human communication. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 117–128. Springer, Heidelberg (2007) 9. Bernsen, N.O., Dybkjær, L.: Annotation schemes for verbal and non-verbal communication: Some general issues. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 11–22. Springer, Heidelberg (2007) 10. Ruttkay, Z.: A Presenting in Style by Virtual Humans. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 23– 36. Springer, Heidelberg (2007) 11. Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologists Press, Palo Alto (1978) 12. Ekman, P., Friesen, W.V., Hager, J.C. (eds.): Facial Action Coding System. Research Nexus, Network Research Information, Salt Lake City, UT (2002) 13. Krahmer, E., Ruttkay, Z., Swerts, M., Wesselink, W.: Perceptual Evaluation of Audiovisual Cues for Prominence. In: Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), Denver, CO, pp. 1933–1936 (2002) 14. Beskow, J., Granström, B., House, D.: Visual Correlates to Prominence in Several Expressive Modes. In: Proceedings of the Interspeech 2006, Pittsburg, PA, pp. 1272–1275 (2006) 15. Tian, Y.L., Kanade, T., Cohn, J.F.: Facial Expression Analysis. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition. Springer, New York (2005) 16. Pandzic, I., Forchheimer, R.: MPEG-4 Facial Animation – the Standard, Implementation and Applications. John Wiley & Sons, Chichester (2002) 17. Beskow, J., Granström, B., House, D.: Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 250– 263. Springer, Heidelberg (2007) 18. Ezzat, T., Geiger, G., Poggio, T.: Trainable Videorealistic Speech Animation. In: Proceedings of the ACM SIGGRAPH 2002, San Antonio, TX, pp. 388–398 (2002) 19. Cphone project, http://sourceforge.net/projects/cphone 20. White, J., O’Connell, T., O’Mara, F.: The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proc. of the AMTA, pp. 193–205 (1994)
Maximising Audiovisual Correlation with Automatic Lip Tracking and Vowel Based Segmentation Andrew Abel1 , Amir Hussain1 , Quoc-Dinh Nguyen2 , Fabien Ringeval2 , Mohamed Chetouani2 , and Maurice Milgram2 1 Dept. of Computing Science, University of Stirling, Scotland, UK Institute of Intelligent Systems and Robotics, University Pierre and Marie Curie (Paris 6) 4 Place Jussieu, Paris, France
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] 2
Abstract. In recent years, the established link between the various human communication production domains has become more widely utilised in the field of speech processing. In this work, a state of the art Semi Adaptive Appearance Model (SAAM) approach developed by the authors is used for automatic lip tracking, and an adapted version of our vowel based speech segmentation system is employed to automatically segment speech. Canonical Correlation Analysis (CCA) on segmented and non segmented data in a range of noisy speech environments finds that segmented speech has a significantly better audiovisual correlation, demonstrating the feasibility of our techniques for further development as part of a proposed audiovisual speech enhancement system.
1
Introduction
The multimodal nature of both human speech production and perception is well established. The relationship between audio and visual speech has been investigated in literature, demonstrating that speech acoustics can be estimated using visual information. Almajai et al. [1] recently investigated correlation between audio and visual features using Multiple Linear Regression (MLR), and expanded upon this to devise a visually derived Wiener filter for speech enhancement [2]. Sargin [3] also performed correlation analysis of multimodal speech, but used Canonical Correlation Analysis (CCA) [4] as part of a speaker identification task. However, one aspect of speech processing that has not been researched much is the detailed analysis of multimodal correlation with noisy speech. Since pioneering work by Girin et al. [5], which uses an Independent Component Analysis (ICA) approach to estimate “cleaned” spectral parameters, little additional correlation work in noisy environments has been published. The most relevant visual speech information is contained in the lip region, and so it is desirable to focus on lip features alone. Speakers very rarely remain motionless when talking, and so it is impractical to extract visual features from J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 65–72, 2009. c Springer-Verlag Berlin Heidelberg 2009
66
A. Abel et al.
the same location of a frame for a whole speech sequence. Visual feature extraction by manually labelling all frames in an image sequence is unsuitable for an integrated speech processing system. A novel Semi Adaptive Appearance Model (SAAM) lip tracking approach has been developed by the authors, and is used here to automatically track lip features in image sequences. In addition to our automated lip tracking approach, speech segmentation is utilised. Almajai et al. found improved audiovisual correlation when individual phonemes within a sentence were processed rather than a whole sentence, implying that a more nuanced approach to audiovisual data can improve correlation. Therefore, the authors have adapted a vowel based speech segmentation system [6], which detects vowels in speech and segments the data accordingly. In this paper, we combine our automatic vowel based speech segmentation and lip tracker to perform CCA on speech from the VidTIMIT Corpus [7] and maximise correlation between visual and audio speech signals. The performance of CCA on segmented vowels and automatically tracked images is compared to that of complete spoken sentences [3]. Additionally, the performance of this approach in a range of noisy environments is assessed by adding noise to speech from the corpus. The results produced with our tracking and segmentation techniques demonstrate the feasibility of extending this initial work and integrating these approaches into a future proposed audiovisual speech enhancement system. The rest of this paper is divided up as follows. Section 2 describes our feature extraction methods, firstly describing the vowel segmentation approach, and then the automated lip tracking method. Section 3 outlines the CCA used on audiovisual data in this work, with results discussed in section 4. Finally, section 5 summarises this paper and outlines future research directions.
2 2.1
Multimodal Feature Extraction Audio Feature Extraction
In this work, our original vowel detection method [6] has been adapted to deal with noisy speech. Our new system is completely automatic and is both speaker and language independent. According to the source-filter model of speech production, vowels are characterised by a particular spectral envelope qualified as formantic, which reveals the positions of the formants. The detection of vowellike segments is based on the characterisation of the spectral envelope. A spectral measure termed the “Reduce Energy Cumulating” (REC) function (equation 1) has been proposed for vowel spectrum characterisation [8] by comparing the energy computed from Mel bank filters. The speech is segmented into overlapping frames, N spectral energy values Ei are extracted for each k frame, and those ¯ are cumulated and weighted that are higher than their respective mean value E by the energy ratio from low ELF and total ET frequency bands. Rec (k) =
N + ELF (k) ¯ Ei (k) − E(k) ET (k) i=1
(1)
Maximising Audiovisual Correlation with Automatic Lip Tracking
67
For a given sentence, peak detection on the smoothed REC curve (with a Simple Moving Average filter) allows vocalic nucleus detection. To reject low energy peaks, which can be either related to spectral noises or low energy vowels, only those higher than the half mean of the REC values are kept. Due to local spectral variability, successive detected peaks from the REC curve can be very close. If two vowels are detected closer than 150ms, only the single highest peak is kept. Segmental borders are then found depending on the REC curve’s local peak configuration: first coming REC values located at the half below their corresponding peaks are used for delimiting the localisation of the detected nucleus. Borders are set at 100ms away from the nucleus to avoid overlapping vowels. Sentences from VidTIMIT are segmented into 32ms frames with an overlap ratio equal to 50%. 22 MFCC coefficients (Mel Frequency Cepstral Coefficients) are computed on these frames, producing MFCC matrices for full sentences. To extract vowel only data, vowel localisation results are used to group vowel only segments together into a single MFCC file, providing the feature vector fy . 2.2
Visual Feature Extraction with SAAM
Visual lip features are extracted by using our newly developed semi adaptive appearance models (SAAMs) [9]. Lip tracking essentially deals with non-stationary data, as the appearance of a target object may alter drastically over time due to factors like pose variation and illumination changes. Our lip tracking framework is based on Adaptive Appearance Models (AAMs) [10], which allow us to update the mean and Eigen vectors of d-dimensional observation vectors x ∈ Rd . First, we extend the AAMs by inserting a supervisor model [11] that verifies AAM performance at each frame in the sequence, by using a Support Vector Machine (SVM) to filter the AAM result for an individual frame, as shown in (2) next: f (x) = sgn
αi yi K (xi , x) + b
(2)
i
Where f (x) ∈ {−1, 1} signifies whether x is a good or bad result. α, b are trained offline with the SVM in [12], K (.) is the Gaussian kernel function and xi ,x are trained and observation vectors respectively. Each yi represents the desired output of each example xi from the offline training dataset. Secondly, shape models are constructed to allow our SAAM to track feature points in video sequences. To model deformation, we form a shape model: S ◦ = S ◦ + Ps b where S ◦ = (x◦1 , y1◦ , . . . , x◦n , yn◦ ) is a normalized shape and n represents a number of feature points. To track these, it is sufficient to find the parameters p = [b1 , . . . , Tx , Ty , θ, s] where bi is the coefficient to deform S ◦ and Tx , Ty , θ, s represent translations, rotation and scale parameters. To track a target object, the aim is to maximize the cost function given in (3) as follows: p∗ = arg max(de ) (3) p
68
A. Abel et al.
Where de is a negative exponential of projection error between xt and the Principal Component Analysis (PCA) subspace created by earlier observations, defined by equation 4 as follows: 2 de = exp − (xt − x ¯) − U U T (xt − x ¯) (4) Note that the distance de isa Gaussian distribution, with Eigen vectors U and mean x ¯, de = p (xt |p) ∝ N xt ; x ¯, U U T + I as → 0, and the inverse matrix can be solved by applying the Woodbury formula[11], given in equation 4 as follows: (U U + I)−1 = −1 I − (1 + )−1 U U T (5) The optimal parameter p∗ is found with a number of iterations. Here, we use empirical gradient, since we evaluate the cost function in the neighbourhood of the current parameter vector value. Our tracking algorithm works as follows: 1. Manually locate target object in the first frame (t=1). Eigen vectors U are initialized as empty. Our tracker initially works as a template based tracker. 2. At the next frame, find the optimal parameters p∗ = argmax de pki ∗ over a number of iterations: – For each parameter pi . k – For each Δp and k ∈
{−1, k1}, compute pi (p1 , . . . , pi + kΔp, . . . , pk+4 ) ∗ – Compute i = max de pi – Do p ← pi ∗, store de pki ∗ −1 3. Check the observation vector: x = x W Se , p∗ where W is a transformation matrix, with result estimation phase as shown in equation 2 4. If f (x) = 1, this signifies a good result to add to the model. When the desired number of new images has been accumulated, perform an incremental update. 5. Return to step 2. This technique is used to find the 2D-DCT vector fx = 2D − DCT (x). VidTIMIT contains a number of image sequences of sentences recorded at 25 fps. The first 30 2D-DCT components of each image are vectorised in a zigzag order to produce the vector for a single frame in an image sequence. The resulting 2D-DCT sequence is then interpolated to match the equivalent MFCC matrices.
3
Canonical Correlation Analysis
In this paper, CCA [4] is used to analyse the linear relationships between multidimensional audio and visual speech variables by attempting to identify a basis vector for each variable that then produces a diagonal correlation matrix. CCA maximises the diagonal elements of the correlation matrix. The main difference between CCA and other forms of correlation analysis is the independence of
Maximising Audiovisual Correlation with Automatic Lip Tracking
69
analysis from the coordinate system describing the variables. Ordinary correlation analysis can produce different results depending on the coordinate system used, whereas CCA finds the optimal coordinate system. Let fx and fy represent multidimensional visual and audio signal variables, with projection matrices ux and uy , calculated using the QR Decomposition method [11], that mutually maximises the projections of fy and fx onto their respective basis vectors. When the linear combinations fˆx = uTx fx and fˆy = uTy fy are considered, we aim to maximise ρ as defined in equation 6: T E fˆx fˆy ρ= (6) T T E fˆx fˆx E fˆy fˆy With the total covariance block matrix given in equation 7 as follows: T Cxx Cxy fx fx C= =E Cyx Cyy fy fy
(7)
T Cxx and Cyy represent the within sets covariance matrices and Cxy = Cyx represents the between set matrix. In order to find the canonical correlations between fx and fy , it is necessary to solve the Eigen value equations in (8): −1 −1 Cxx Cxy Cyy Cyx ux = ρ2 ux (8) −1 −1 Cyy Cyx Cxx Cxy uu = ρ2 uy
Where the Eigen values ρ2 represent the squared canonical correlations. It is only necessary to solve one Eigen value equation due to the two components of equation 8 being related as shown in (9): uTy Cyy uy Cyx ux = ρλy Cyy uy Where: λx = λ−1 = (9) y Cxy uy = ρλx Cxx ux uTx Cxx ux
4 4.1
Results Comparison of CCA with Sentences and Segments
Initially, synchrony between audio and visual signals was assessed. Existing work [3] found maximum audiovisual correlation with an audio delay of 40ms. To corroborate this, CCA was applied to a 24 sentence dataset from VidTIMIT. The canonical correlations of each sentence were calculated, and the correlation measure used is defined as τ , where: τ=
N i=1
λ2
(10)
70
A. Abel et al.
Fig. 1. (a) CCA feature synchrony results, showing maximum τ with a shift of approx 48ms. (b) Comparison of mean canonical correlations of complete VidTIMIT sentences and vowel segments.
Where N represents the number of canonical correlations found. τ was taken when shifting the visual data in relation to the equivalent audio data. The mean synchronisation results are shown in fig.1(a), confirming that audiovisual correlation is maximised when there is a small degree of asynchrony, in line with results found by Sargin et al. [3]. Accordingly, subsequent experiments are shifted by three frames. The performance of CCA with segmented speech was then assessed. Our proposed vowel segmentation technique was used to extract speech sentence segments containing vowels. CCA was then performed on these segments, and compared to results from full sentences. The process used is shown in fig.2. The mean canonical correlations of sentences and segments are shown in fig.1(b), which shows two lines. The solid line indicates the mean canonical correlation coefficients for whole sentences, while the dashed line represents mean
Fig. 2. Proposed multimodal speech enhancement system, showing role of lip tracking, speech segmentation, and CCA transforms described in this paper. Proposed fusion, speech estimation and wiener filtering components are also shown. Solid lines indicate components used in this work, and dashed lines represent proposed future work.
Maximising Audiovisual Correlation with Automatic Lip Tracking
71
coefficients for vowel segments only. This shows a clear difference in correlation values and behaviour, with segmented speech producing a significantly higher multimodal correlation, as it can be seen from fig.1(b) that vowel segments produce much stronger initial canonical correlations. This is proven by comparing the squared sum of canonical correlations (τ ) for sentences and segments, with results of 3.41 and 8.48 respectively, confirming that speech segmentation significantly increases correlation. 4.2
Noisy Speech Investigation
The previous section was extended by performing CCA in a variety of noisy environments. A number of noises (filtered pink noise, F16 aircraft noise, and incoherent speech babble) were added to the dataset at SNRs of -3, -6, and -9dB. Vowel segmentation was then carried out on the resulting noisy sentences, and the results of CCA on noisy sentences and segments are shown in table 1. Table 1 shows that two of the noisy speech mixtures, pink and F16, have a much lower correlation than clean speech, which is expected. However, due to the similarities between the incoherent human babble and the target speech, CCA appears to find inaccurate relationships between audio and visual data. In all cases though, there is a significant correlation increase when vowel specific information is used. With the exception of babble, the table shows that segmented speech produces a lower percentage drop in audiovisual correlation when the noise level is increased, showing that our speech segmentation approach is effective for increasing audiovisual speech correlation in noisy environments. It should be noted that these results suggest that our approach functions best in environments where the noise is suitably different from the target speech, such as aircraft or automobiles. In environments dominated by human babble, changing the SNR makes little difference, as similar inaccurate relationships between babble and visual speech features are still found irrespective of the SNR value, explaining the smaller change in correlation produced by babble in table 1. Table 1. Segment and Sentence τ comparison of noisy speech at -3, -6, -9 dB SNR
72
5
A. Abel et al.
Conclusion
In this paper, we presented work that maximised audiovisual speech correlation by successfully making use of our SAAM approach for automatic visual feature extraction and our vowel based segmentation technique to segment speech. To assess the performance of these techniques, CCA was used to investigate multimodal correlation. It was found that in noisy environments, segmented speech produced much higher correlation than whole sentences, showing the potential of these techniques for future use as part of an integrated multimodal speech enhancement system, as shown in fig.2. This diagram shows the proposed role of the lip tracking, speech segmentation, and CCA transforms discussed in this paper in such a proposed speech enhancement system, as well as a proposed audiovisual Wiener filtering approach, and the use of a beamformer (tested in previous work by the authors [13]) for pre-processing the noisy audio signal.
References 1. Almajai, I., Milner, B.: Maximising Audio-Visual Speech Correlation. In: AVSP 2007 (2007) 2. Almajai, I., Milner, B., Darch, J., Vaseghi, S.: Visually-Derived Wiener Filters for Speech Enhancement. In: ICASSP 2007, vol. 4, pp. 585–588 (2007) 3. Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis. IEEE Trans. on Mult. 9(7), 1396– 1403 (2007) 4. Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936) 5. Girin, L., Feng, G., Schwartz, J.L.: Fusion of Auditory and Visual Information For Noisy Speech Enhancement: A Preliminary Study of Vowel Transition. In: ICASSP 1998, vol. 2, pp. 1005–1008 (1998) 6. Ringeval, F., Chetouani, M.: A Vowel Based Approach For Acted Emotion Recognition. In: Proc. Interspeech 2008, pp. 2763–2766 (2008) 7. Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDMVerlag (2008) 8. Pellegrino, F., Andr´e-Obrecht, R.: Automatic Language Identification: An Alternative Approach to Phonetic Modelling. Sig. Proc. 80(7), 1231–1244 (2000) 9. Nguyen, Q.D., Milgram, M.: Semi Adaptive Appearance Models For Lip Tracking. Submitted to ICIP 2009 (2009) 10. Levy, A., Lindenbaum, M.: Sequential Karhumen-Loeve basis extraction and its application to images. Image Proc., IEEE Trans. 9(8), 1371–1374 (2000) 11. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. John Hopkins Uni. Press (1996) 12. Cauwenberghs, G., Poggio, T.: Incremental and Decremental Support Vector Machine Learning. In: NIPS, pp. 409–415 (2000) 13. Cifani, S., Abel, A., Hussain, A., Squartini, S., Piazza, F.: An Investigation Into Audiovisual Speech Correlatio. In: Reverberant Noisy Environments (LNCS): CrossModal Analysis of Speech, Gest, Gaze and Facial Expr. (2008) (in press)
Visual Context Effects on the Perception of Musical Emotional Expressions Anna Esposito1,2, Domenico Carbone2, and Maria Teresa Riviello2 1
Seconda Università di Napoli, Dipartimento di Psicologia, and 2 IIASS, Italy
[email protected],
[email protected]
Abstract. Is there any evidence that context plays a role in the perception of the emotional feeling aroused by emotional musical expressions?. This work tries to answer the above question through a series of experiments where subjects were asked to label as positive or negative a set of emotionally assessed musical expressions played in combination with congruent or incongruent visual stimuli. The influence of context was measured through the valence. The results showed that the agreement on valence was always higher when melodies were played without context suggesting that music alone is more effective in raising emotional feeling than music combined either with positive or negative visual stimuli. Visual stimuli (either congruent or incongruent) significantly affect the perception of happy and sad melodies, whereas their effects are less severe and not significant for angry and fearful musical expressions. Keywords: Embodiment, visual and auditory perception, emotion, music, context.
1 Introduction Recent results in social psychology have shown that social information processing involves embodiment, intended here as the mutual influence of the physical environment and the human activities that unfold within it. The underlying idea is that embodiment emerges from the interaction between our sensory-motor systems and the inhabited environment (that includes people as well as objects) and dynamically affects /enhances our reactions/actions, or our social perception. Several experimental data seem to support this idea. For example, Schubert [24] showed that the act of making a fist influenced men’s and women’s automatic processing of words related to the concept of strength. Similar effects in different contexts have been described by several authors (see [2, 26] )suggesting that the body and the context rule individual’s social conduct as a practical ability to render the world sensible and interpretable in the course of everyday activities. Context interaction, therefore – the organizational, cultural, and physical context – plays a critical role in shaping social conduct providing a means to interpret and understand individuals’ choices, perception, and actions. Previous cognitive theories had not accounted for such findings. The popular metaphor about the mind is that cognitive processes (such as inference, categorization, and memory) are independent from their physical instantiations. As a consequence, mental operations are based on amodal representations performed by a central processing J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 73–80, 2009. © Springer-Verlag Berlin Heidelberg 2009
74
A. Esposito, D. Carbone, and M.T. Riviello
unit that exploits the sensory (input) and motor (output) subsystems for collecting and sending representations of the external world and execute commands [4-5, 12, 16, 21] respectively. Only recently, new cognitive models have been proposed, that account for embodied knowledge acquisition and embodied knowledge use [3, 27]. The data we are going to illustrate aim to bring further support to these theories providing results from a series of perceptual experiments that show how perception of emotional expressions in music could be biased by the effect of the visual context. The proposed experiments are based on the assumption that there exists a small set of discrete emotional categories (basic emotions) universally shared from which other emotions can be derived [6, 13,20]. This set of emotional categories includes happiness, anger, sadness, and fear, which can reliably be associated with basic survival problems such as nurturing offspring, earning food, competing for resource, avoiding and/or facing dangers. In this view, basic emotions are brief, intense and adapted reactions to urgent and demanding survival issues caused by goal-relevant changes in the environment. These reactions require “readiness to act” and “prompting of plans” in order to appropriately handle (under conditions of limited time) environmental changes producing suitable mental states, physiological changes, feelings, and expressions [11]. The above categorization is, however, debated. Different theories have been proposed for its conceptualization, among those dimensional models [22, 25] that envisage a set of primary features (dimensions) in which emotions can be decomposed and suggest that different combinations of such features can arouse different affective states. Nevertheless, the discrete evolutionary perspective has received support by several sources of evidence, among those the (1) presence of basic emotional expressions in other mammalian species (as the attachment of infant mammals to their mothers); and the (2) universal exhibition and/or recognition of emotional expressions (such as smiling, amusement, and irritability) by infants, adults, blind and sighted independently of race and culture [17-19]. The present paper accepts the emotional evolutionary perspective assumption and investigates on the ability to interpret emotions in instrumentally-presented melodies when presented contextually with a set of images that may arouse a subjective feeling of pleasantness or unpleasantness (valence). The introduction of a valence measure in the proposed experimental set-up does not contradict our evolutionary perspective since there is a general agreement among emotional theorists that basic emotional states can be organized using the dimensions of valence and arousal. Valence is defined as a subjective emotional experience and it is considered positive if the subject’s environmental conditions and internal states of are favourable to goal accomplishments, or negative otherwise [23].
2 Materials and Methods Two sets of stimuli were used in the experiment. A set of musical stimuli and a set of visual ones. The stimuli were considered representative of happiness, sadness, fear and anger. The musical stimuli consisted of eight 20 second-long musical pieces (two for each of the basic emotions considered) already assessed [1, 10, 14-15] as able to arouse emotional feelings of happiness, sadness, fear, and anger. The happy condition included Beethoven’s Symphony No.6 Pastoral selection from Awakening of happy feelings on arriving in the country, and the Eine Kleine Nachtmusik, Divertimento n.
Visual Context Effects on the Perception of Musical Emotional Expressions
75
36 by Mozart. The sad musical selections were Adagio for Strings from Platoon by Barber and Adagio from Piano Concerto No.2 in Do Minor by Rachmaninov. Fearful musical pieces were from Alexander Nevsky by Prokofiev and Concerto for Piano No.2 Op. 18 in Do Minor by Rachmaninov. Anger music selections were from Beethoven's Symphony No.9 in Re Minor Op. 125 and from Alexander Nevsky by Prokofiev. All the musical pieces came from digital recordings. The assessment of the musical pieces in terms of emotional labels was made by a group of 20 adults (10 males and 10 females) with no musical education, which were asked to label the melodies according to the feeling they perceived. No mention was made and no label was given to the emotional valence attributed to them. Table 1 displays the percentage of correct agreement on emotional labels and valence. Table 1. Adults agreements on musical pieces emotional labels and valence
Musical Pieces P1 Happy
Adults‘ Label Accord 90%
Valence 100% positive
P2 Happy P3 Sad
100% 90%
100% positive 90% negative
P4 Sad
90%
90% negative
P5 Fear
83%
100% negative
P6 Fear
78%
100% negative
P7 Anger
72.2%
90% negative
P8 Anger
61.1%
78% negative
The visual stimuli consisted of 10 colour images, 5 judged to be positive and 5 negative. The negative and/or positive valence was assessed by 6 independent judges (among which the authors). For the 5 identified positive images the judge agreement was 100% on 4 of them and 83% on the fifth, whereas it was 100% on three of the negative images and 83% on the remaining two. 2.2 Procedures Two groups of participants, all with no musical education, each consisting of 38 adults (19 males and 19 females from 18 to 30 years old) were involved in the experiment. One group listened, through earphones, to the musical pieces played by Windows Media Player™ in combination with the positive visual stimuli and the other group the same pieces in combination with the negative ones. Each participant was asked to listen carefully and separately to the 8 melodies (played together with the visual stimuli) and rate each of them either as positive, negative, or neutral. When a positive or negative judgment was expressed, they were further asked to rate it on a 5 scale valence intensity: from 1 (very weak) to 5 (very strong). No mention was made to the visual stimuli played with the melodies. The possible answers were “positive”, “negative” and “I don’t know”. In the present paper only the percentage of agreement on the valence (positive or negative) attributed to the melodies was discussed.
76
A. Esposito, D. Carbone, and M.T. Riviello
3 Results Figures 1, 2, 3, and 4 display the percentage of “positive”, “negative” and “I don’t know” answers for each of the two happy, sad, fearful, and angry musical pieces when played either with positive (top) or negative (bottom) visual stimuli or when the music was played alone.
Happy 2 with Positive Visual Context
Only music
1
I don't know Negative
conditions
conditions
Happy 1 with Positive Visual Context
Only music
1
I don't know Negative Positive
Positive 0
20
40
60
80
0
100
40
60
80
100
percentage of agreement
percentage of agreement
Happy 2 with Negative Visual Context
Only music
1
I don't know Negative
conditions
Happy 1 with Negative Visual Context
conditions
20
Only music
1
I don't know Negative Positive
Positive 0
20
40
60
80
0
100
20
40
60
80
100
percentage of agreement
percentage of agreement
Fig. 1. Emotional valence attributed to the happy musical selections when played either with positive or negative visual stimuli or in audio modality only
Sad 2 with Positive Visual Context
Only music
1
I don't know Negative
conditions
conditions
Sad 1 with Positive Visual Context
Only music
1
I don't know Negative
Positive 0
20
40
60
80
Positive
100
0
percentage of agreement
Negative Positive 60
80
percentage of agreement
100
conditions
conditions
I don't know
40
60
80
100
Sad 2 with Negative Visual Context
Only music
1
20
40
percentage of agreement
Sad 1 with Negative Visual Context
0
20
Only music
1
I don't know Negative Positive 0
20
40
60
80
100
percentage of agreement
Fig. 2. Emotional valence attributed to the sad musical selections when played either with positive or negative visual stimuli or in audio modality only
Visual Context Effects on the Perception of Musical Emotional Expressions
Fear 2 with Negative Visual Context
Only music
1
I don't know Negative
conditions
conditions
Fear 1 with Negative Visual Context
Only music
1
I don't know Negative
Positive 0
20
40
60
80
Positive
100
0
percentage of agreement
40
60
80
100
Fear 2 with Positive Visual Context
Only music
1
I don't know Negative
conditions
conditions
20
percentage of agreement
Fear 1 with Positive Visual Context
Only music
1
I don't know Negative
Positive 0
20
40
60
80
77
Positive
100
0
percentage of agreement
20
40
60
80
100
percentage of agreement
Fig. 3. Emotional valence attributed to the fear musical selections when played either with positive or negative visual stimuli or in audio modality only
Anger 2 with Positive Visual Context
Only music
1
I don't know Negative
conditions
conditions
Anger 1 with Positive Visual Context
Only music
1
I don't know Negative Positive
Positive 0
20
40
60
80
0
100
40
60
80
percentage of agreement
percentage of agreement
Anger 2 with Negative Visual Context
Only music
1
I don't know Negative
conditions
Anger 1 with Negative Visual Context
conditions
20
Only music
1
I don't know Negative
Positive 0
20
40
60
80
Positive
100
0
percentage of agreement
20
40
60
80
percentage of agreement
Fig. 4. Emotional valence attributed to the angry musical selections when played either with positive or negative visual stimuli or in audio modality only Table 2. Statistics assessment of the context effects
χ2 Critique = Happy
Sad
Fear
Anger
5.99 α=0.05 χ2 for Piece 1 13.96 χ2 for Piece 2 15.82
10.34 6.58
0.72 1.42
4.96 5.86
78
A. Esposito, D. Carbone, and M.T. Riviello
χ2 statistics were performed to assess if the answers were significantly different between the two contexts. The results are illustrated in Table 2. The statistics showed that the visual stimuli significantly affect both the perception of happy and sad musical pieces whereas they do not influence the musical expression of fear and anger.
4 Discussion and Conclusions What clearly appears from the data reported in the above figures is that context plays a role in the perception of the emotional feeling aroused by musical melodies. As it can be observed, the agreement on valence was always higher when melodies were played without context suggesting that music alone is more effective in raising emotional feeling than music combined either with congruent or incongruent visual stimuli. Visual stimuli (either congruent or incongruent) significantly affect the perception of happy and sad melodies, whereas their effects are less severe and not significant for angry and fearful musical expressions. How this happen is still matter of speculation. Some recent data reported in literature have shown that the emotional information conveyed by the combined audio and video channels is less or equally emotionally informative than that conveyed by the audio or video alone depending on the language. When native language is used, audio alone is equally or more emotionally informative than audio and video, whereas, when nonnative language is used, the same holds for video alone [7-9]. Esposito [7-9] suggested that such nonlinear processing of emotional information is caused by an increase of the cognitive processing load and by language specific mechanisms. The exploitation of emotional information from a single channel in the bimodal presentation was due to the fact that the subject’s cognitive load increases. She/he should concurrently elaborate gestures, facial and vocal expressions, dynamic temporal evolving information, and the semantic of the message. These concurrent elaborations deviate the subject’s attention from the perceived emotion. Therefore, in an attempt to reduce the cognitive load, the subject exploits her/his language expertise, if the auditory signal is in her/his native language, otherwise she/he will rely on the ability to process visual stimuli if the audio input is in a foreign language. Does this hypothesis hold also for cultural specific musical emotional expressions? The above results seem to support this hypothesis but a cross cultural comparison is needed to perform an appropriate assessment. Nevertheless, what remains proved by the above experiments is that context plays a role and that cognitive models must account for embodied knowledge acquisition and embodied knowledge use.
Acknowledgements This work has been partially funded by COST 2102 “Cross Modal Analysis of Verbal and Nonverbal Communication”, http://cost2102.cs.stir.ac.uk/ and by Regione Campania, L.R. N.5 del 28.03.2002, Project ID N. BRC1293, Bando Feb. 2006. Acknowledgements go to Miss Tina Marcella Nappi for her editorial help.
Visual Context Effects on the Perception of Musical Emotional Expressions
79
References 1. Baumgartner, T., Esslen, M., Lutz Jäncke, L.: From Emotion Perception to Emotion Experience: Emotions Evoked by Pictures and Classical Music. International Journal of Psychophysiology 60, 34–43 (2006) 2. Bargh, J.A., Chen, M., Burrows, L.: Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action. Journal of Personality and Social Psychology 71, 230–244 (1996) 3. Barsalou, L.W., Niedenthal, P.M., Barbey, A.K., Ruppert, J.A.: Social Embodiment. In: Ross, B.H. (ed.) The psychology of learning and motivation, vol. 43, pp. 43–92. Academic Press, San Diego (2003) 4. Block, N.: The Mind as the Software of the Brain. In: Smith, E.E., Osherson, D.N. (eds.) Thinking, pp. 377–425. MIT Press, Cambridge (1995) 5. Dennett, D.C.: Content and Consciousness. Humanities Press, Oxford (1969) 6. Ekman, P.: An Argument for Basic Emotions. Cognition and Emotion 6, 169–200 (1992) 7. Esposito, A.: The Perceptual and Cognitive Role of Visual and Auditory Channels in Conveying Emotional Information., Cognitive Computation (2009) http://www.springerlink.com/content/121361 8. Esposito, A.: Affect in Multimodal Information. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 211–234. Springer, Heidelberg (2008) 9. Esposito, A.: The Amount of Information on Emotional States Conveyed by the Verbal and Nonverbal Channels: Some Perceptual Data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007) 10. Esposito, A., Serio, M.: Children’s Perception of Musical Emotional Expressions. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 51–64. Springer, Heidelberg (2007) 11. Frijda, N.H.: Moods, Emotion Episodes, and Emotions. In: Haviland, M., Lewis, J.M. (eds.) Handbook of Emotion, pp. 381–402. Guilford Press, New York (1993) 12. Fodor, J.A.: The modularity of mind. MIT Press, Cambridge (1983) 13. Izard, C.E.: Organizational and Motivational Functions of Discrete Emotions. In: Lewis, M., Haviland, J.M. (eds.) Handbook of Emotions, pp. 631–641. Guilford Press, New York (1993) 14. Nawrot, E.S.: The Perception of Emotional Expression in Music: Evidence from Infants, Children and Adults. Psychology of Music 31(I), 75–92 (2003) 15. Niedenthal, P.M., Setterlund, M.B.: Emotion Congruence in Perception. Personality and Social Psychology Bullettin 20, 401–410 (1993) 16. Newell, A., Simon, H.A.: Human problem solving. Prentice Hall, Oxford (1972) 17. Oatley, K., Jenkins, J.M.: Understanding Emotions, pp. 96–132. Blackwell Publishers, Malden (1996) 18. Panksepp, J.: Emotions as Natural Kinds Within the Mammalian Brain. In: Lewis, J.M., Haviland-Jones, M. (eds.) Handbook of Emotions, 2nd edn., pp. 137–156. Guilford Press, New York (2000) 19. Panksepp, J.: At the Interface of the Affective, Behavioral, and Cognitive Neurosciences: Decoding the Emotional Feelings of the Brain. Brain and Cognition 52, 4–14 (2003) 20. Plutchik, R.: Emotion and their Vicissitudes: Emotions and Psychopatology. In: Haviland, M., Lewis, J.M. (eds.) Handbook of Emotion, pp. 53–66. Guilford Press, New York (1993)
80
A. Esposito, D. Carbone, and M.T. Riviello
21. Pylyshyn, Z.W.: Computation and cognition: Toward a Foundation for Cognitive Science. MIT Press, Cambridge (1984) 22. Russell, J.A.: A Circumplex Model of Affect. Journal of Personality and Social Psychology 39, 1161–1178 (1980) 23. Russell, J.A.: Core Affect and the Psychological Construction of Emotion. Psychological Review 110, 145–172 (2003) 24. Schubert, T.W.: The Power in Your Hand: Gender Differences in Bodily Feedback from Making a Fist. Personality and Social Psychology Bulletin 30, 757–769 (2004) 25. Schlosberg, H.: Three Dimensions of Emotion. The Psychological Review 61(2), 81–88 (1953) 26. Stepper, S., Strack, F.: Proprioceptive Determinants of Emotional and Nonnemotional Feelings. Journal of Personality and Social Psychology 64, 211–220 (1993) 27. Smith, E.R., Semin, G.R.: Socially Situated Cognition: Cognition in its Social Context. Advances in Experimental Social Psychology 36, 53–117 (2004)
Eigenfeatures and Supervectors in Feature and Score Fusion for SVM Face and Speaker Verification Pascual Ejarque1, Javier Hernado1, David Hernando2, and David Gómez2 1
TALP Research Center. Department of Signal Theory and Communications, Technical University of Catalonia, Barcelona, Spain 2 Biometric Technologies, S.L., Barcelona, Spain
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Eigenfaces are the classical features used in face recognition and have been commonly used with classification techniques based on Euclidean distance and, more recently, with Support Vector Machines. In speaker verification, GMM has been widely used for the recognition task. Lately, the combination of the GMM supervector, formed by the means of the Gaussians of the GMM, and SVM has resulted successful. In some works, dimensionality reduction transformations have been applied upon the GMM supervectors using Euclidean distance based classification methods to obtain eigenvoices. In this paper, eigenvoices will be used in a SVM system, and the fusion of eigenfaces and eigenvoices will be performed in a multimodal fusion. In addition to this, different feature and score normalization techniques will be applied before the classification process. The results show that the dimensionality reduction techniques do not improve the error rates provided by the GMM supervector and that the use of SVM and the multimodal fusion significantly increase the performance of the recognition systems. Keywords: multimodal, GMM supervector, SVM, eigenvoices, equalization.
1 Introduction Automatic person recognition (APR) by means of biometric modalities like speech, face, fingerprint iris, etc. can be performed in a four-step process that includes the collection of signals, the extraction of the relevant features for the modality, the computation of a score matching each feature vector with a client model or a template, and the obtaining of a final decision by comparing such score with a threshold [1]. Multimodal fusion of several biometric modalities combines the information provided by each unimodal modality to obtain a final decision. It has been widely demonstrated that multimodal fusion increases the robustness of the recognition system to noise or signal distortions, and permits to obtain a recognition decision even when one or more of the biometric decisions can not be accomplished [2]. Multimodal fusion is possible at three different levels: feature extraction level, matching score level or decision level [1, 2]. In this work, we will perform fusion at the feature and the score levels. J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 81–88, 2009. © Springer-Verlag Berlin Heidelberg 2009
82
P. Ejarque et al.
Eigenfaces and fisherfaces are the most classical features used in face recognition and have been commonly used with classification techniques based on Euclidean distance [3]. More recently, Support Vector Machines have also been included as a classification technique for face recognition systems based in eigenfaces and fisherfaces [4]. In speaker recognition, cepstrum or LDA features have been widely used in GMM systems in combination with the Viterbi algorithm. Subsequently, in several works, the GMM supervector, a vector formed with the mean of all the Gaussians of an adapted GMM UBM, has been used as speech feature vector [5, 6]. When the GMM supervector is used, the dimension of the speech pattern becomes fixed and does not depend on the occurrence duration. This characteristic permits the use of these features with classification techniques as SVM [7] and is an advantage to include the speech information in multimodal fusion at the feature level. In addition to this, principal component analysis (PCA) and linear discriminant analysis (LDA) transformations have been applied upon the GMM supervectors to obtain eigenvoices and fishervoices. In these works, the classification is generally performed using Euclidean distance based methods like nearest neighbor classifier [6]. The aim of this work is two-fold. Firstly, eigenvoices and fishervoices will be used in a SVM system. Secondly, the face and speech features and scores will be combined to fuse the face and the speech information at the feature level and at the score level. Another important issue treated in this work is the normalization of the features or scores that is necessary before the classification process [8]. Histogram equalization based techniques are compared with a conventional normalization method. The experiments in this work have been performed upon the recordings of the XM2VTS database according to the LP1 protocol [9]. The paper is organized as follows. In section 2, the state-of-the-art for face and speaker recognition and for multimodal fusion is reviewed. In section 3, the experimental setup and the results are presented. Finally, section 4 contains the conclusions of the work.
2 State-of-the-Art 2.1 Face Recognition: Eigenfaces, Fisherfaces As face recognition methods, namely nearest neighbor classifier in the image space, are computationally expensive and require great amounts of storage it is natural to pursue dimensionality reduction schemes [10]. A commonly used technique for dimensionality reduction in face recognition is principal components analysis (PCA), which chooses a dimensionality reducing linear projection that maximizes the scatter of all projected samples. The features obtained from the application of the PCA technique to the image are called eigenfaces. On the other hand, under admittedly idealized conditions, the variation within class lies in a linear subspace of the image space. One can perform dimensionality reduction using linear projection and still preserve linear separability. By means of a Linear Discriminant transformation [4, 10] a linear reduction can be performed and the resultant features are called fisherfaces.
Eigenfeatures and Supervectors in Feature and Score Fusion for SVM Face
83
For the classification of the feature vectors, the face recognition systems usually make use of techniques based on the classical Euclidean distance [3]. Recently, Support Vector Machines (SVM) has been successfully introduced to perform the classification task in face recognition systems [4]. In the first case, the Euclidean distance is computed between the test vectors and a model vector for each occurrence. In the second case, the SVM based technique performs the classification with a separating hyperplane, trained by means of machine learning [11, 12]. Non-linear boundaries are achieved using a specific function called kernel function that maps the data of the input space to a higher dimensional space. In this work, only linear SVM has been used. It has been shown in several works [2, 13] the importance of the normalization of the data before the SVM fusion process. For the normalization of the face features, a standard normalization method and a histogram equalization method are used. The standard normalization normalizes the global mean and variance of the features of a unimodal biometric. The normalized scores xSN are computed as
x SN =
a − mean(a) std (a)
(1)
where mean(a) is the statistical mean of the set of scores a and std(a) is its standard deviation. Histogram equalization (HEQ) makes the statistical distribution of some given data match to a reference distribution. This technique can be seen as an extension of the statistical normalization made by the previous technique to the whole statistics of the biometric modality and not only to its mean and variance. For the normalization of the face features, a Gaussian Equalization (GEQ) is used, and the reference is a normal distribution with the variance set to 1. 2.2 Speaker Recognition: GMM Supervector, Eigenvoices, Fishervoices
A Gaussian Mixture Model approximates the distribution of the observed features with a Gaussian mixture density function g (x ) =
N
∑ λ N (x; m , Σ ) i
i
i
(2)
i =1
where λi are the mixture weights, N(.) is a Gaussian, mi and Σi are the mean and covariance matrix of the Gaussians, respectively, and N is the total number of Gaussian components. A GMM can be trained for each speech utterance by adapting a global Universal Background Model (UBM). The main advantage of this technique is that the parameters may be robustly estimated with a relatively small amount of training data. In particular, GMMs can be obtained by adapting the mean vectors of the global GMM using Maximum A Posteriori (MAP) criteria [5, 6]. The mixture weights and covariance matrices are retained for simplicity and robustness of parameter estimation. From the adapted model, a GMM supervector can be formed with the means of the GMMs, as shown in figure 1.
84
P. Ejarque et al.
Input Utterance
m1 m = m2 … mN
GMM UBM
Feature Extraction
MAP Adaptation
GMM Supervector
Fig. 1. GMM supervector production
Taking into account the high dimensionality that the GMM supervector can achieve, the concept of dimensionality reduction applied to eigenfaces and fisherfaces can be generalized to the GMM supervector case. In this way, principal components analysis (PCA) and linear discriminant analysis (LDA) can be applied upon the GMM supervector to reduce the feature vector length. A common characteristic of the three sets of features, GMM supervector, eigenvoices and fishervoices is that the length of the features vector is fixed. This is an advantage for the classification by means of pattern recognition techniques like SVM and for the feature fusion of the speech information with other modalities. For the speaker recognition classification process, Euclidean distance with standard normalization and SVM with standard and GEQ normalizations have been used. 2.3 Multimodal Fusion at Feature and Score Levels
In this work, multimodal fusion will be performed at the feature level and at the score level. In the fusion at the feature level, the features extracted from the face and the speech recordings are concatenated to obtain a joint feature vector. Taking into account the results achieved in the unimodal experiments, a linear SVM with GEQ has been chosen as the recognition technique. When a score level fusion is performed it has also been shown the importance of the normalization of the scores before the fusion process [2, 8]. For this reason, the BiGaussian equalization introduced by the authors in [13] has been used prior to a linear SVM classificatory system. In the Bi-Gaussian equalization process the reference distribution is the sum of two Gaussian functions, whose standard deviations σ model the overlap between the genuine and impostor lobes of the original distributions, i. e., ( β −1) ⎡ − ( β +1) − ⎢e 2σ + e 2σ (β ) = 2σ 2π ⎢⎣ 2
f ref
1
2
2
2
⎤ ⎥ ⎥ ⎦
(3)
3 Recognition Experiments 3.1 Databases
The experiments in this work were performed on the XM2VTS database [9]. Only the frontal views for each of the 295 subjects were used for face recognition and both,
Eigenfeatures and Supervectors in Feature and Score Fusion for SVM Face
85
digits sequences and text sentences, for speaker recognition. Each user is organized in 4 sessions of two shots resulting in 8 face samples and 24 voice signals per user. The LP1 protocol was followed for all the experiments. Speech data from the BANCA database [14] was used for the speaker recognition UBM training. 3.2 Face Recognition
In a first stage, an evaluation of the results of each unimodal modality has been done. As explained in 2.1, eigenfaces and fisherfaces vectors have been extracted for each frontal face sample in the XM2VTS database. The vector dimension of the eigenfaces features is 144 while fisherfaces features are 64. Two different classification methods have been used: a Euclidean distance with no feature normalization, in order to preserve the original eigenfaces and fisherfaces, and a linear SVM with previous standard normalization (SVM) and Gaussian Equalization (GEQ-SVM). Table 1 shows the equal error rate (EER) obtained for each of these combinations. Table 1. Face recognition (Equal Error Rate)
Eigenfaces Fisherfaces
Euclidean distance 4.10 % 2.93 %
SVM
GEQ-SVM
1.50 % 1.25 %
1.50 % 1.25 %
For all the tested classification techniques, the use of fisherfaces has outperformed the results obtained with eigenfaces, to a greater extent when the classification is performed by a Euclidean distance. In addition to this, SVM classification performs better than the classical Euclidean distance for both kind of features and for both normalization techniques, which obtain the same EER. 3.3 Speaker Recognition
For speaker recognition, a Universal Background Model (UBM) has been built using the speech data of the BANCA database. For each input signal, a Voice Activity Detector (VAD) is used in order to discard non-speech segments. Speech frames are then processed in 25 ms segments, Hamming windowed and pre-emphasized. The feature set is formed by a 12th order Mel-Frequency Cepstral Coefficients (MFCC) and the normalized log energy. Cepstral Mean Subtraction (CMS) is also applied. Using speech data from 208 speakers, recorded over 12 sessions, a GMM UBM with 64 mixture components has been trained. The same pre-processing was done to each of the three speech signals in each shot of the XM2VTS before using MAP adaptation to make a speaker dependent GMM. In this step, only the mixture means with a relevance factor of 16 were considered. Following the procedure described in 2.2, a GMM supervector was formed from each adapted model resulting in a feature vector of 832 coefficients. Finally, in a similar way than for face recognition, dimensionality of the feature vector was reduced by means of PCA (eigenvoices) and LDA (fishervoices). The transformation matrix for both techniques was estimated using the impostor speech data in the evaluation set of LP1.
86
P. Ejarque et al. Table 2. Voice recognition (Equal Error Rate)
GMM supervector Eigenvoices Fishervoices
Euclidean distance 6.46 % 6.59 % 7.40 %
SVM
GEQ-SVM
0.53 % 1.25 % 1.50 %
0.50 % 1.23 % 1.00 %
The same classification techniques than in the previous section have been used except for a standard normalization of the speech features before the Euclidean distance based verification system. Table 2 shows how the use of SVM significantly reduces the EER for all the feature types. Both normalization techniques obtain similar results with SVM, but GEQ obtains the lower recognition rates for the three feature sets. In addition to this, the dimensionality reduction produced by eigenvoices and fishervoices does not favor the recognition process. 3.4 Multimodal Evaluation
In order to increase the performance of the unimodal modalities, fusion of face and speech data was done at two levels: score level fusion and feature level fusion. In the fusion at feature level, the face and voice features where concatenated to obtain a single feature vector. All the features where normalized by means of GEQ and were introduced to a SVM recognition system. The feature level fusion results are shown in table 3. Table 3. Feature level fusion (Equal Error Rate)
Eigenfaces + GMM supervector Eigenfaces + Eigenvoices Fisherfaces + GMM supervector Fisherfaces + Fishervoices
GEQ-SVM 0.008 % 0.25 % 0.026 % 0.25 %
The best result is obtained by the joint use of eigenfaces and the GMM supervector features, and is the minimum EER obtained in the experiments of this article. As in the unimodal case, the use of the whole GMM supervector outperforms the eigenvoices and fishervoices features. It is worth to note that in the XM2VTS database while the number of impostors is high enough to have a low gap in the false alarm error rate (lower than 0.001%), the minimum gap for the miss probability is 0.25 % (LP1 uses 400 client test samples). Results in table 3 are then showing only just one or zero miss errors. At score level fusion, scores were obtained from the GEQ-SVM classifier for each unimodal modality and then introduced in a third BGEQ-SVM system with two inputs: the face and the speech scores. Results are shown in Table 4.
Eigenfeatures and Supervectors in Feature and Score Fusion for SVM Face
87
Table 4. Score level fusion (Equal Error Rate)
Eigenfaces + GMM supervector Eigenfaces + Eigenvoices Fisherfaces + GMM supervector Fisherfaces + Fishervoices
BGEQ-SVM 0.019 % 0.25 % 0.017 % 0.25 %
The scores of the eigenfaces and fisherfaces systems in combination with the scores of the GMM supervectors system achieve similar results. The results obtained with the “fisherfaces + GMM supervector” score fusion system outperform those obtained in the feature fusion case. Finally, the best results are obtained in the “eigenvoices + GMM supervector” system.
4 Conclusions In this paper, in the same manner that eigenfaces and fisherfaces are obtained for face recognition, PCA and LDA transformations have been applied upon GMM supervectors to obtain eigenvoices and fishervoices. The face and speech features have been used in the corresponding unimodal recognition systems based on the Euclidean distance and in SVM. In addition to this, the features introduced in the SVM system have been normalized before the classification process with classical normalization and histogram equalization techniques. Furthermore, feature level and score level fusion methods have been developed to obtain multimodal results. The results show that the use of SVM reduces the error rates for both face and speaker unimodal recognition systems. On the other hand, in the speaker recognition system, the dimensionality reduction techniques do not improve the error rates provided by the GMM supervector. The multimodal fusion significantly increases the performance of the recognition systems and the best result is obtained for the feature level fusion of eigenfaces and GMM supervectors. Acknowledgments. This work has been supported by the Spanish Government projects TSI-020100-2008-537 and TEC2007-65470. The authors would like to thank to the Instituto Tecnológico de Informática in Valencia for their contribution with the face recognition features.
References 1. Bolle, R.M., Connell, J.H., Pankanti, S., Ratha, N.K., Senior, A.W.: Guide to Biometrics. Springer, New York (2004) 2. Ross, A., Nandakumar, K., Jain, A.: Handbook of Multibiometrics. International Series on Biometrics. Springer, New York (2006) 3. Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
88
P. Ejarque et al.
4. Jonsson, K., Kittler, J., Li, Y.P., Matas, J.: Support vector machines for face authentication. Image and Vision Computing 20(5-6), 369–375 (2002) 5. Reynolds, D.A., Quatieri, T.F., Dunn, R.: Speaker verification using adapted Gaussian mixture models. Dig. Signal Process 10(1-3), 19–41 (2000) 6. Thyes, O., Kuhn, R., Nguyen, P., Junqua, J.C.: Speaker identification and verification using eigenvoices. In: At: ICSLP 2000, Beijing, China, vol. 2, pp. 242–245 (2000) 7. Campbell, W.M., Reynolds, D.A.: Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters 13(5), 308–311 (2006) 8. Jain, A.K., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(12), 2270–2285 (2005) 9. Lüttin, J., Maître, G.: Evaluation Protocol for the Extended M2VTS Database (XM2VTSDB). IDIAP Communication 98-05, Martigny, Switzerland (1998) 10. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997) 11. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, Cambridge (2000) 12. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge discovery 2, 121–167 (1998) 13. Ejarque, P., Hernando, J.: Bi-Gaussian Score Equalization in an Audio-Visual SVM-based Person Verification System. In: At: Interspeech, Brisbane, Australia (2008) 14. Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Facial Expression Recognition Using Two-Class Discriminant Features Marios Kyperountas1 and Ioannis Pitas1,2 1
Department of Informatics, Aristotle University of Thessaloniki, Greece 2 Informatics and Telematics Institude, CERTH, Greece {mkyper,pitas}@aiia.csd.auth.gr
Abstract. This paper presents a novel facial expression recognition methodology. In order to classify the expression of a test face to one of seven predetermined facial expression classes, multiple two-class classification tasks are carried out. For each such task, a unique set of features is identified that is enhanced, in terms of its ability to help produce a proper separation between the two specific classes. The selection of these sets of features is accomplished by making use of a class separability measure that is utilized in an iterative process. Fisher’s linear discriminant is employed in order to produce the separation between each pair of classes and train each two-class classifier. In order to combine the classification results from all two-class classifiers, the ‘voting’ classifier-decision fusion process is employed. The standard JAFFE database is utilized in order to evaluate the performance of this algorithm. Experimental results show that the proposed methodology provides a good solution to the facial expression recognition problem. Keywords: facial expression classification, two-class discriminant analysis, facial features, feature extraction.
1 Introduction In recent years, developing facial expression recognition (FER) technology has received great attention [1, 2]. For the face recognition problem, the true match to the expression of a test face, out of a number of C different pre-determined facial expressions, is sought. This type of non-verbal communication is useful when developing automatic and, in some cases, real-time human centered interfaces, where the face plays a crucial role [3]. Examples of applications that use FER are facial expression cloning in virtual reality applications, video-conferencing, and user profiling, indexing, and retrieval from image and video databases. Facial expressions play a very important role in human face-to-face interpersonal interaction [4]. In fact, facial expressions represent a direct and naturally preeminent means of communicating emotions [5]. Recently, various methods have attempted to solve the FER problem. In [6], two hybrid FER systems are proposed that employ the ‘one-against-all’ classification strategy. The first system decomposes the facial images into linear combinations of several basis images using Independent Component Analysis (ICA). Subsequently, J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 89–96, 2009. © Springer-Verlag Berlin Heidelberg 2009
90
M. Kyperountas and I. Pitas
the corresponding coefficients of these combinations are fed into Support Vector Machines (SVMs) that carry out the classification process. The second system performs feature extraction via a set of Gabor Wavelets (GWs). The resulting features are then classified using CSC, MCC, or SVMs that employ various kernel functions. The method in [7] uses Supervised Locally Linear Embedding (SLLE) to perform feature extraction. Then, a minimum-distance classifier is used to classify the various expressions. SLLE computes low dimensional, neighborhood-preserving embeddings of high dimensional data and is used to reduce data dimension and extract features. The basic idea of LLE is the global minimization of the reconstruction error of the set of all local neighbors in the data set. This technique expects the construction of a local embedding from a fixed number of nearest neighbors to be more appropriate than from a fixed subspace. The Supervised-LLE algorithm uses class label information when computing neighbors to improve the performance of classification. The work in [8] introduces the ICA-FX feature extraction method that is based on ICA and is supervised in the sense that it utilizes class information for multi-class classification problems. Class labels are incorporated into the structure of standard ICA by being treated as input features, in order to produce a set of class-label-relevant and a set of class-label-irrelevant features. The learning rule being used applies the stochastic gradient method to maximize the likelihood of the observed data. Then, only classlabel-relevant features are retained, thus reducing the feature space dimension in line with the principle of parsimony. This improves the generalization ability of the nearest-neighbor classifier that is used to perform FER. This paper presents a novel FER methodology which attempts to classify any random facial image to one of the following C = 7 basic [9] facial expression classes: happiness (E1), sadness (E2), anger (E3), fear (E4), surprise (E5), disgust (E6), and the neutral state (E7). To do so, proper and unique sets of features are identified for each pair of classes. The selected features are concatenated to produce the Enhanced Feature Vectors (EFVs). This is done individually for all C (C − 1) pair-wise compari2
sons between the C facial expression classes. Initially, features are extracted by convolving the facial images with a set of 2-D Gabor filters of different scales and orientations. Then, a class separability measure is utilized in order select the proper subset of features for each distinct pair of classes. Then, two-class Linear Discriminant Analysis (LDA) is applied to the EFVs in order to produce the Discriminant Hyper-planes (DHs), which are essentially projections onto which large class separations are attained. The DHs are used to train the
C (C − 1) two-class classifiers, and the 2
corresponding two-class separations are measured. Next, the ‘voting’ [10] classifierdecision fusion process is employed to produce the final classification decision. This completes the proposed EFV-Classification (EFV-C) FER framework.
2 Producing Enhanced Feature Vectors This section presents the feature extraction process and the iterative process that is utilized in order to produce the subsets of enhanced features that compose the EFVs.
Facial Expression Recognition Using Two-Class Discriminant Features
91
2.1 Gabor-Based Feature Extraction Initially, a feature set that contains M features is extracted from each training facial image. These M features correspond to the image being convolved with M 2-D Gabor filters of different scales and orientations. A 2-D Gabor filter is produced by modulating a complex exponential by a Gaussian envelope, and can allow the direction of oscillation to any angle in the 2-D cartesian plane. Thus, a filter is produced with local support that is used to determine the image’s oscillatory component in a particular direction at a particular frequency. This is particularly useful for FER since different facial expressions (e.g. happiness vs. disgust, or neutral) produce these components at different directions and/or frequencies. A complex-valued 2-D Gabor function can be defined as [10]:
Ψ (k , x ) =
⎛ k 2 x 2 ⎞⎡ σ2 ⎤ ⎟ exp( jkx ) − exp( − exp ⎜⎜ − ) 2 ⎟⎢ σ 2 ⎥⎦ ⎝ 2σ ⎠ ⎣ k2
2
(1)
To produce M = M s .M o different Gabor functions, let us assume that M s different scales and M o different orientations are investigated. The different scales can be obtained by setting k i = π / 2 i , where i = 1,K, M s . The different angular orientations can be obtained by selecting M o angles between 0 and 180 degrees. 2.2 Class Separability Measure for Feature Selection Next, a combination of the N most useful features, out of the M total, is selected when the task at hand is to discriminate between a specific pair of facial expression classes. Since different facial expressions produce more oscillatory components at particular directions and frequencies, it is expected that, for a given pair of classes, certain Gabor features can produce a larger class separation, than the rest of the features can. In total, there exist
C (C − 1) = 21 distinct pair-wise class combinations for 2
the C = 7 facial expression classes: E1-E2, E1-E3, E1-E4, E1-E5, E1-E6, E1-E7, E2E3, E2-E4, E2-E5, E2-E6, E2-E7, E3-E4, E3-E5, E3-E6, E3-E7, E4-E5, E5-E6, E4E7, E5-E6, E5-E7, and, E6-E7. Thus, the feature selection process presented next creates 21 sets of enhanced features. First, the M features are converted to M 1-D vectors via row-concatenation, f i , i = 1,K , M . Let us assume that we need to classify between a specific pair E x and E y . In order to select the subset of N most useful feature vectors, where N < M , a class separability measure that is based on the maximum value of Fisher’s criterion is employed. For our purposes, this is a suitable measure since the discriminant hyper-planes that we later produce to train each two-class classifier stem from Fisher’s criterion. When examining the i − th feature vector, this separability measure is defined as:
J Emax (i ) = J ( w 0,Ex , y ,i ) = x,y
(μ
0 , E x ,i
− μ 0,E y ,i
σ 02,E ,i + σ 02,E x
)
y ,i
2
,
(2)
92
M. Kyperountas and I. Pitas
where μ 0 ,E x ,i and μ 0 ,E ,i denote the sample mean and σ 02,Ex ,i and σ 02,E ,i the sample y y variance of the training feature vectors of classes E x and E y , respectively, when projected to the subspace defined by w 0,E ,i . The discriminant vector w 0,E ,i is x,y x,y given by [11]
(
)
w 0,Ex , y ,i = SW−1,Ex , y ,i miEx − miE y ,
(3)
where m iE and m iE denote the sample mean of the feature vectors of classes E x x y and E y , respectively, for the i − th feature. Moreover, SW ,E ,i is the within-class x,y scatter matrix for the i − th feature, and is defined as
SW ,E x , y ,i =
∑ (f
f ij ∈E x
+
i j
− miEx
∑ (f
f ij ∈E y
i j
)( f
i j
− miE y
− miEx
)( f
i j
)
Τ
− miE y
), Τ
where j indicates the class (either E x or E y ) to which the i − th feature vector,
(4)
f ij ,
belongs to. Using (2), we now have a class separability measure that indicates how useful each of the M features is. However, it is not sufficient to select the N best features as the ones that produce the N largest values for this separability measure. This is because each EFV, which is comprised by the N features, is subsequently processed by two-class LDA to produce the discriminant hyper-plane. The concatenation of the N selected feature vectors to produce one large column vector, the EFV, is as such: EFV : E x , y
fj
Τ
= ⎡⎣ f ij (1) Τ ,K , f ij ( N ) Τ ⎤⎦ ,
(5)
where i ⊂ 1, K , M , and f j ∈ E x , or, f j ∈ E y . N
As a result, notions such as linear dependency between the feature vectors should be taken into account when selecting the N best features. For example, if two feature vectors are linearly dependent, or close to being linearly dependent, then the selection of both these vectors, rather than only one of them, would not provide any additional benefit to the discriminant ability of the hyper-plane being produced to train the twoclass classifier. For this reason, an iterative feature selection process that is again based on the class separability measure (2) is developed in order to define the group of feature vectors that should compose each pair of EFVs, for all two-class problems. Specifically, the separability measure is not applied independently to each feature but, rather, to groups of features, in order to identify the feature combination that produces the largest class separation.
Facial Expression Recognition Using Two-Class Discriminant Features
93
2.3 Creating EFVs to Produce DHs The following feature selection methodology is applied in order to construct the C (C − 1) DHs, where each hyper-plane is associated with two realizations (one per 2
class) of N-selected features. The first feature vector to be selected is the one that produces the maximum J Emax value, out of all the original M feature vectors, when x,y
attempting to discriminate between the two facial expression classes E x and E y . Subsequently, each feature vector to be selected next is identified by creating groups of features in vector form, i.e. f group , where each group contains the feature vectors that were previously selected and a new candidate feature vector. A candidate feature vector is simply a feature vector that has not yet been selected as being one of the N vectors that compose the EFV. In general, if this is the i − th feature vector to be selected, then M − i + 1 distinct groups of features are created:
[
Τ
j 1,K,i −1 j f group = f selected , f candidate i
],
Τ Τ
j = 1,K , M − i + 1 .
(6)
Next, for each group of features in (6), the corresponding FLD hyper-plane that is used to discriminate between the facial expression classes E x and E y is produced:
w 0 ,E
j x , y , groupi
= SW−1,E
j x , y , groupi
(m
E x , groupij
− mE
y , groupi
j
),
(7)
where groupij indicates that this expression only uses the group of features that are currently under consideration. Then the value of the corresponding separability measure for this group of features is calculated via (2). The selected feature is set to be the one whose corresponding group produces the maximum value of this separability max measure, i.e. J E x , y ,groupi . To select all N feature vectors, this process is iterated N times and at its completion the N selected feature vectors are concatenated to form the EFV of each class, as (5) indicates. The two EFVs that correspond to classes E x and E y are also related to a specific DH, via (7). By using this feature selection process, the two-class LDA algorithm can potentially evade problems relating to non-linear class separability. This is because multiple combinations of groups of features are examined and the max group that produces the largest class separation J E x , y ,groupi is selected. Since the separability value is based on Fisher’s criterion, it is expected that a combination of features that would form a non-linear separation between the classes would produce a small J E x , y , groupij value, thus, this combination of features would be rejected. For the same reason, each EFV that is produced should not contain features that are, or are close to being, linearly dependent.
94
M. Kyperountas and I. Pitas
3 Integrating the Classification Results Let us assume that a two-class classifier needs to produce a decision on whether the expression of a test image r should be assigned to either the facial expression class E x or E y . To do so, the test image and the two class means are projected onto the discriminant hyper-plane, w 0,E x , y . Then, the L2 norm can be utilized to calculate the distance between the projected r and the two projected class means. Subsequently, r is assigned to the class associated with the smallest of the two distances. To produce the final classification decision, i.e. determine which of the C facial expression classes r belongs to, results from all C (C − 1) two-class classifiers need to 2
be integrated. To do so, the widely used voting scheme is utilized, where the winning class for each two-class problem receives a vote and the class that accumulates the most votes is set as the best match to the expression of the test face [10].
4 Experimental Results In this section, the performance of the proposed EFV-C method is evaluated and compared against contemporary state-of-the-art FER methods. The JAFFE [10] facial expression database, which contains images captured at disjoint temporal instances, has been extensively used when evaluating the classification performance of spatial facial expression algorithms. Hence, our method, as well as the spatial FER methods of [6, 7, and 8] that it is compared against, is evaluated on the JAFFE database. A simple preprocessing step is applied to the JAFFE images before performing FER. Each face is manually cropped by taking as reference the hairline, the left and right cheek and the chin of each face. Next, the average ratio between the vertical and horizontal dimensions of all the cropped images was calculated to be 1.28 and used to resize/down-sample the cropped images to 50× 39 pixels using bicubic interpolation. To experimentally evaluate the proposed method, we set M o = 6 and M s = 4 , which results to obtaining M = 24 2-D Gabor features that correspond to 6 different orientations and 4 different scales. Furthermore, for the enhanced-feature selection process, we set N = 4 , so each EFV is comprised by 4-selected Gabor features, out of the 24 total that are extracted. The testing protocol that is used to evaluate the FER algorithms is the common ‘leave-one-sample-out’ evaluation strategy [6, 7, and 8]. During each run of this strategy, one specific image is selected as the test data, whereas the remaining images are used to train the classification system. The strategy makes maximal use of the available data for training. This process is repeated 213 times so that all the images in the database will represent the test set once. Then, the 213 classification results are averaged and a more statistically significant result, the final FER rate, is produced. The FER rate of the EFV-C algorithm is calculated to be 95.11%. Table 1 summarizes the results for all competing methods and shows that EFV-C competes well with the state-of-the-art solutions.
Facial Expression Recognition Using Two-Class Discriminant Features
95
Table 1. Classification performance of various facial expression recognition methods
Method GWs+SVMs [6] SLLE [7] ICA-FX [8] EFV-C
Leave-one-sample-out FER rate 90.34% 92.90% 94.97% 95.11%
5 Conclusion A FER methodology that produces expression-pair- specific features is proposed and its performance is evaluated. These enhanced features are utilized by a two-class discriminant analysis process in order to train all two-class classifiers. The EFV-C methodology was tested on the well-established JAFFE database under the common leave-one-out evaluation strategy. Results indicate that it provides a good solution to the FER problem by producing classification rates of 95.11%, and that it compares well with state-of-the-art methods. It is anticipated that the performance of other FER methods can be enhanced by utilizing processes that stem from this framework in order to produce high-quality features. Acknowledgments. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 211471 (i3DPost) and COST Action 2101 on Biometrics for Identity Documents and Smart Cards.
References 1. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recognition 36(1), 259–275 (2003) 2. Pantic, M., Rothkrantz, J.M.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Trans. Pattern Analysis and Machine Intelligence 22(12), 1424–1445 (2000) 3. Pandzic, I.S., Forchheimer, R. (eds.): MPEG-4 Facial Animation. Wiley, New York (2002) 4. Pantic, M., Patras, I.: Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 36(2), 433–449 (2006) 5. Keltner, D., Ekman, P.: Facial expression of emotion. In: Lewis, M., Haviland-Jones, J.M. (eds.) Handbook of Emotions, pp. 236–249. Guildford, New York (2000) 6. Buciu, I., Kotropoulos, C., Pitas, I.: ICA and Gabor representation for facial expression recognition. In: Proc. IEEE Int. Conf. on Image Processing, Barcelona, Spain, September 14-17, vol. 2(3), pp. 855–858 (2003) 7. Liang, J.Y., Zheng, Z., Chang, Y.: A facial expression recognition system based on supervised locally linear embedding. Pattern Recognition Letters 26(15), 2374–2389 (2005)
96
M. Kyperountas and I. Pitas
8. Kwak, N.: Feature extraction for classification problems and its application to face recognition. Pattern Recognition 41(5), 1718–1734 (2008) 9. Ekman, P., Friesen, W.V.: Emotion in the Human Face. Prentice-Hall, Englewood Cliffs (1975) 10. Lyons, M.J., Budynek, J., Akamatsu, S.: Automatic classification of single facial images. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(12), 1357–1362 (1999) 11. Kyperountas, M., Tefas, A., Pitas, I.: Weighted piecewise LDA for solving the small sample size problem in face verification. IEEE Trans. on Neural Networks 18(2), 506–519 (2007)
A Study for the Self Similarity Smile Detection David Freire, Luis Ant´on, and Modesto Castrill´on SIANI, Universidad de Las Palmas de Gran Canaria, Spain
[email protected],
[email protected],
[email protected]
Abstract. Facial expression recognition has been the subject of much research in the last years within the Computer Vision community. The detection of smiles, however, has received less attention. Its distinctive configuration may pose less problem than other, at times subtle, expressions. On the other hand, smiles can still be very useful as a measure of happiness, enjoyment or even approval. Geometrical or local-based detection approaches like the use of lip edges may not be robust enough and thus researchers have focused on applying machine learning to appearance-based and self-similarity descriptors. This work makes an extensive experimental study of smile detection testing the Local Binary Patterns (LBP) combined with self similarity (LAC) as main descriptors of the image, along with the powerful Support Vector Machines classifier. Results show that error rates can be acceptable and the self similarity approach for the detection of smiles is suitable for real-time interaction, although there is still room for improvement.
1 Introduction It is now known that emotions play a significant role in human decision making processes [14]. The ability to show and interpret emotions is therefore also important for human-machine interaction. In this context face analysis is currently a topic of intensive research within the Computer Vision community. Facial expression recognition research has studied geometry-based features [3], appearance [1] and hybrid approaches [7], see [11] for a survey. Commercial products that are able to perform expression recognition in real time are currently available. Potential applications include evaluation of behavior, human-robot interaction, intelligent tutoring systems, perceptual user interfaces, etc. Some facial expressions can be very subtle and difficult to recognize even between humans. Besides, in human-computer interaction the range of expressions displayed is typically reduced. In front of a computer, for example, subjects rarely display accentuated surprise or anger expressions as he/she would display when interacting with another human subject. The human smile is a distinct facial configuration that could be recognized by a computer with greater precision and robustness. Besides, it is a significantly useful facial expression, as it allows to sense happiness or enjoyment and even approval (and also the lack of them) [8]. As opposed to facial expression recognition, smile detection research has produced less literature. Lip edge features and a perceptron were used in [10]. The lip zone is obviously the most important, since human smiles involve mainly J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 97–104, 2009. c Springer-Verlag Berlin Heidelberg 2009
98
D. Freire, L. Ant´on, and M. Castrill´on
Fig. 1. The basic version of the Local Binary Pattern computation (c) and the Simplified LBP codification (d)
the Zygomatic muscle pair, which raises the mouth ends. Edge features alone, however, may be insufficient. We present an image descriptor based on self-similarities which is able to capture the general structure of an image. Computed descriptors are similar for images with the same layout, even if textures and colors are different. Similarly to [16], images are partitioned into smaller cells which, conveniently compared with a patch located at the image center, yield a vector of values that describes local aspect correspondences (LAC) This paper makes an extensive experimental study of the smile detection problem, being organized as follows. Section 2 describes the codification algorithms used for the experiments. The different classification approaches used in the study are briefly presented in Section 3. The experimental results and conclusions are described in Sections 4 and 5 respectively.
2 Representation The Local Binary Pattern (LBP) is an image descriptor commonly used for classification and retrieval. Introduced by Ojala et al. [13] for texture classification, they are characterized by invariance to monotonic changes in illumination and low processing cost. Given a pixel, the LBP operator thresholds the circular neighborhood within a distance by the pixel gray value, and labels the center pixel considering the result as a binary pattern. The basic version considers the pixel as the center of a 3 × 3 window and builds the binary pattern based on the eight neighbors of the center pixel, as shown in Figure 1-c. However, the LBP definition can be easily extended to any radius, R, considering P neighbors [13]: Rotation invariance is achieved in the LBP based representation considering the local binary pattern as circular. More recently LBPs have been used to describe facial appearance. Once the LBP image is obtained, most authors apply a histogram based representation approach [15]. However, as pointed out by some recent works, the histogram based representation loses relative location information [15,17], thus LBP can also be used as a preprocessing method. Using LBP as preprocessing method, has the effect of emphasizing edges and noise. To reduce the noise influence, Qian Tao et al. [17] proposed recently a modification in the basic version of the local binary pattern computation. Instead of weighting
A Study for the Self Similarity Smile Detection
99
the neighbors differently, their weights are all the same, obtaining the so called Simplified LBPs, see Figure 1-d. Their approach has shown some benefits applied to facial verification, due to the fact that by simplifying the weights, the image becomes more robust to illumination changes, having a maximum of nine different values per pixel. The total number of local patterns are largely reduced so the image has a more constrained value domain. In the experiments described at Section 4, both approaches will be adopted, i.e. using the histogram based approach, but also using Uniform LBP and Simplified LBP as a preprocessing step. Raw face images are highly dimensional. A classical technique applied for face representation to avoid the consequent processing overload problem is Principal Components Analysis (PCA) decomposition [12]. PCA decomposition is a method that reduces data dimensionality, without a significant loss of information, by performing a covariance analysis between factors. As such, it is suitable for highly dimensional data sets, such as face images. A normalized image of the target object, i.e. a face, is projected in the PCA space. The appearance of the different individuals is then represented in a space of lower dimensionality by means of a number of those resulting coefficients, v i [18]. We also present an image descriptor based on self-similarities which is able to capture the general structure of an image. Computed descriptors are similar for images with the same layout, even if textures and colors are different, similarly to [16]. Images are partitioned into smaller cells which, conveniently compared with a patch located at the image center, yield a vector of comparison results that describes local aspect correspondences (LAC). A LAC descriptor is computed from a square shaped image subdivided into n × n cells, where each cell corresponds to an m × m pixels image patch. The number of cells and their pixel size have effect on how much an image structure is generalized. A low number of cells will not capture many structural details, while too many small cells will produce a too detailed descriptor. The present approach will consider overlapping cells, which may be required to capture subtle structural details. Once an image is partitioned, an m × m patch located in the exact image center (which does not have to correspond to a cell in the image partition) is compared with all partition cells. In order to achieve greater generalization, image patches are compared computing the Sum of Squared Differences (SSD) between pixel values (or the Sum of Absolute Differences (SAD), which is computationally less expensive). Each cell-center comparison is consecutively stored in a m × m dimensions LAC descriptor vector. Such description overcomes color, contrast and textures. Images are described in terms of their general structure, similarly to [16]. An image showing a white upper half and a black lower half will produce exactly the same descriptor as an image showing a black upper half and a white lower half. Local aspect correspondences are exactly the same: the upper half is different from the lower half. Rotations, however, are not considered. LAC descriptors are specially useful to describe points defined by a scale salient point detector (like DoG or SURF [2]). In the present work, however, they are applied
100
D. Freire, L. Ant´on, and M. Castrill´on
Fig. 2. LAC Descriptor example using a 11x11 partition. The barcode-like vector represents all comparisons between each cell and the central patch.
to classify mouths found by a face detector [5] into smiling or non-smiling gestures. Smiling mouths look similar no matter the skin color or the presence of facial hair. This generality can be registered by a self-similarity descriptor like LAC. However, images containing smiling mouths require local brightness to be preserved: teeth are always brighter than surrounding skin and that must be captured by the descriptor. Thus, instead of using SSD, patches are compared using Sum of Differences. Otherwise, a closed mouth would produce the same descriptor as a smiling mouth: lips are surrounded by differently colored skin, exactly as teeth are surrounded by differently colored lips. Figure 2 shows an example with an 11 × 11 cell partition, each cell sized 10 × 10 pixels. The LAC descriptor is shown as a barcode for representation purposes. Thus, given an input image (i.e. a scale salient point or a known region like detected mouths), a number of cells n and their pixel size m, LAC is computed as follows: 1. The image is resized to a template sized (n × m) × (n × m) pixels. 2. The template is partitioned into n × n cells, each of them sized m × m pixels. 3. A central patch sized m × m pixels is captured from the center of the template image. 4. The central patch is compared with each template cell, and each result is consecutively stored in the n × n LAC descriptor vector. In order to tell wether two images have a similar structure, their corresponding LAC descriptors can be compared computing SAD between both vectors. However, given that the present work aims at classifying mouth images in two categories, a Support Vector Machine approach is used.
3 Classification Support Vector Machine (SVM) [4] is a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. LIBSVM [6] has been the library employed in the experiment described below.
A Study for the Self Similarity Smile Detection
101
4 Experiments The dataset of images used for the experimental setup is separated into two classes: smiling and not smiling. The previous classification has been performed by humans who labeled each normalized image of 59×65 pixels. The first set contains 2421 images of different smiling faces, while the second set contains 3360 non smiling faces. As briefly mentioned above, the experimental setup considers one possibility as input: the mouth. The input image will be a grayscale image, and for representation purposes we have used the following approaches for the tests: – A PCA space obtained from the original gray images of 59 × 65 pixels. – A PCA space computed on both the resulting images after preprocessing the original images using LBP. Two different approaches, i.e. simplified LBP (SLBP) and uniform LBP (ULBP), have been used. – A concatenation of histograms based on the gray image or the resulting LBP image (both approaches simplified and uniform were used). – A concatenation of the image values based on the gray images or the resulting LBP image (again both approaches simplified and uniform were used). – LAC descriptor obtained from the original gray images of 59 × 65 pixels. – LAC descriptor computed on both the resulting images after preprocessing the original images using LBP. Two different approaches, i.e. simplified LBP (SLBP) and uniform LBP (ULBP), have been used. Similar experimental conditions have been used for every approach considered in this setup. The test sets are built randomly, having an identical number of images for both classes. Results presented correspond to the percentage of wrong classified samples of all test samples. Average results presented in this paper are achieved for each configuration after ten random selections with 50% of samples for training and 50% for testing. Therefore, 2000 images, 1000 of each class, and 2000 images for the test, 1000 of each class, have been used for testing purposes. As it can be seen in Figure 3, best results in almost every situation are achieved with no preprocessing at all, directly using grayscale images. None of the LBP based representations outperforms that approach. However, even if the Uniform LBP approach evidences a larger improvement when normalized histograms are used, the Simplified LBP approach reported better results than Uniform LBP in any other situation. As already stated in [17] this preprocessing provides benefits in the context of facial analysis. When the Normalized Histogram based representation is used, the Uniform LBP error rate is the lowest. This approach seems to model properly the smile texture even when the histogram is losing the relative location information. However, this feature is quite similar for the Simplified LBP approach, its histogram loses information but achieved rates are similar. The grayscale image test achieved its highest error rate in this case, higher than the Uniform LBP and Simplified LBP approaches, which means that the Grayscale approach is very sensitive to the relative location of the information. When the Normalized Image Values vector based representation is used, the grayscale image test achieved the lowest error rate. On the other hand, Uniform LBP test achieved the highest error rate in this case.
102
D. Freire, L. Ant´on, and M. Castrill´on
Fig. 3. Mouth processing results with different approaches using SVM for classification. Six different methods were applied to each preprocessing method: Histogram, Image Values, Principal Components Analysis and Local Aspect Correspondence respectively. It is important to mention that the number next to PCA refers to the dimension of the representation space, i.e. it indicates the number of eigenvectors used for projecting the face image.
For the PCA approach, error-rate behavior is quite similar to the behaviour obtained previously with the rates of the image values test. Again grayscale image test achieved the lowest error rates. PCA deserves an additional observation, not always the increasing of the space dimension for PCA reports better results. For the LAC approach, error-rate behavior is also quite similar to behaviour obtained previously with PCA and Image Value tests rates. Again, the lowest error rates were achieved by the grayscale image test. For LAC, it is important to mention that overlap is considered between cells. Firstly, several tests without overlapping were made in order to find optimun LAC parameters (number of cells and cells’ size yielding the lowest error rate). For smile detection it was found that 10 × 10 cells of 3 × 3 pixels performed best thanks to the closeness between the size of the extracted LAC patch (30 × 30 pixels) and the original size of the mouth capture (20 × 12 pixels). It is also shown that worst results are achieved for configurations less than 10 × 10 cells because of the loss of information due to resizing in the Normalization step. Beyond that number of cells and for bigger sizes, behaviour is irregular due to the fact that information extracted is not reliable because of the false information introduced when the mouth is resized to fit the LAC patch. When images are upsampled, redundant and useless information is created. Unfortunately, when overlap was introduced, the achieved error rates were higher than without overlapping. Used images were too small for overlapping regions to be significant. Something that should be mentioned is that, in terms of error rates, Simplified LBP behaves as an intermediate approach between Grayscale and Uniform LBP. That is, Simplified LBP has achieved better rates for normalized Histogram test than Grayscale’s approach and worse than Uniform LBP’s. Also Simplified LBP has got better rates for image values and PCA tests than Uniform LBP’s approach and worse than the Grayscale’s. Unlike the study stated in [9] where whole face was considered for smile detection, for the SVM setting already explained, the strategical block of mouth is translated into
A Study for the Self Similarity Smile Detection
103
a reduction of dimensions. Improvement is due to this fact. Of course, it should be mentioned that PCA reduce dimensions too, that is the reason why PCA tests achieved better results than the Image Values test for grayscale images. Between approaches used in the tests, the difference of rates could not come from the domain value. Every input to SVM is previously normalized within the range [0,1]. Normalized Histograms deserves an additional observation. For each representation approach, a normalized histogram is built for the selected area: mouth. We can appreciate that, for the results in this case, there is a remarkable improvement of Uniform LBP above Simplified LBP.
5 Conclusions This paper described a smile detection using different LBP approaches, as well as grayscale image representation, combined with SVM. It has been shown the potentiality of the LAC based representation for smile verification. The LAC based representation presented in this paper outperforms other approaches with an improve over a 5% for each preprocessing method. Overlap does not perform better due to the small size of the mouth area. Uniform LBP does not respond to a statistical spatial patterns locality. This means, that there is no gradual change between adjacent blocks preprocessed with Uniform LBP. Depending on the value of a pixel inside one of the blocks, codification between two adjacent pixels can be, for example, from pattern 2 to pattern 9. Translated to the space domain of SVM, this means that dimensions can be too far away. Simplified LBP keeps the statistical spatial patterns locality. There is a gradual change between adjacent preprocessed pixels. Translated to the SVM’s space domain, this gradual change means that similar points are closer in this space. The main reason to get worse results using a histogram of Simplified LBP, is that there is a loss of information related to location. That is not important for Uniform LBP because, as we said before, Uniform LBP looks for texture and the histogram window gives it the chance to show this fact. Our future line is focus on the potentiality of the LAC descriptor for generic applications such as image retrieval. In this paper we have developed a static smile clasiffier achieving, in some cases, a 93% of success rate. Due to this sucess rate, smile detection in video streams, where temporal coherence is implicit, will be studied in short term, as a cue to get the ability to recognize the dynamics of the smile expression.
References 1. Bartlett, M., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Fully automatic facial action recognition in spontaneous behavior. In: Proceedings of the Seventh IEEE International Conference on Automatic Face and Gesture Recognition (2006) 2. Bay, H., Tuytelaars, T.: Surf: Speeded up robust features. In: Proceedings of the Ninth European Conference on Computer Vision (May 2006) 3. Bourel, F., Chibelushi, C., Low, A.: Robust facial expression recognition using a state-based model of spatially-localised facial dynamics. In: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (2002)
104
D. Freire, L. Ant´on, and M. Castrill´on
4. Burges, C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 5. Castrill´on Santana, M., D´eniz Su´arez, O., Hern´andez Tejera, M., Guerra Artal, C.: ENCARA2: Real-time detection of multiple faces at different resolutions in video streams. Journal of Visual Communication and Image Representation, 130–140 (April 2007) 6. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/˜cjlin/libsvm 7. Datcu, D., Rothkrantz, L.: Automatic recognition of facial expressions using bayesian belief networks. Proceedings of IEEE Systems, Man and Cybernetics (2004) 8. Ekman, P., Friesen, W.: Felt, false, and miserable smiles. Journal of Nonverbal Behavior 6(4), 238–252 (1982) 9. Freire, D., Castrillon, M., Deniz, O.: Smile detection using local binary patterns and support vector machines. In: Proceedings of the Fourth International Joint Conference on Computer Vision and Computer Graphics Theory and Applications (2009) 10. Ito, A., Wang, X., Suzuki, M., Makino, S.: Smile and laughter recognition using speech processing and face recognition from conversation video. In: Procs. of the 2005 IEEE Int. Conf. on Cyberworlds, CW 2005 (2005) 11. Khan, M., Ingleby, M., Ward, R.: Automated facial expression classification and affect interpretation using infrared measurement of facial skin temperature variations. ACM Trans. Auton. Adapt. Syst. 1(1), 91–113 (2006) 12. Kirby, Y., Sirovich, L.: Application of the Karhunen-Lo´eve procedure for the characterization of human faces. IEEE Trans. on Pattern Analysis and Machine Intelligence 12(1) (July 1990) 13. Ojala, T., Pietik¨ainen, M., M¨aenp¨aa¨ , T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 14. Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) 15. S´ebastien Marcel, Y.R., Heusch, G.: On the recent use of local binary patterns for face authentication. In: International Journal of Image and Video Preprocessing, Special Issue on Facial Image Processing (2007) 16. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: IEEE Conference on Computer Vision and Pattern Recognition 2007 (CVPR 2007) (June 2007) 17. Tao, Q., Veldhuis, R.: Illumination normalization based on simplified local binary patterns for a face verification system. In: Proc. of the Biometrics Symposium, pp. 1–6 (2007) 18. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cognitive Neuroscience 3(1), 71–86 (1991)
Analysis of Head and Facial Gestures Using Facial Landmark Trajectories Hatice Cinar Akakin and Bulent Sankur Bogazici University, Electrical and Electronics Engineering Department, 34342, Istanbul, Turkiye {hatice.cinar,bulent.sankur}@boun.edu.tr
Abstract. Automatic analysis of head and facial gestures is a significant and challenging research area for human-computer interfaces. We propose a robust face-and head gesture analyzer. The analyzer exploits trajectories of facial landmark positions during the course of the head gesture or facial expression. The trajectories themselves are obtained as the output of an accurate feature detector and tracker algorithm, which uses a combination of appearance- and model-based approaches. A multi-pose deformable shape model is trained in order to handle shape variations under varying head rotations and facial expressions. Discriminative observation symbols extracted from the landmark trajectories drive a continuous HMM with mixture of Gaussian outputs and is used to recognize a subset of head gestures and facial expressions. For seven gesture classes we achieve 86.4 % recognition rate. Keywords: Automatic facial feature tracking, head gesture and facial expression recognition.
1 Introduction Human face not only provides identity information but also clues to understand social feelings and can be instrumental in revealing mental states via facial expressions. It has been stated that, we express ourselves by words about 10%, by voice tone about 30% and by body language 60% [15]. Therefore the body language is our main channel in communicating the emotional content of our messages. Gestures, eye and head movements, body movements, facial expressions and touch constitute the nonverbal message of our body language. These non-verbal messages can be more essential than words in revealing our true mental states and feelings such as trust, agreement, enjoyment, hostility and worry. The main tasks in human-computer interfaces are: i) Detection and localization of the face; ii) Accurate facial feature detection and tracking; iii) Face modeling and animation; iv) Facial expression analysis; v) Recognition of mental states from sequence of facial expressions. In this work we focus on the last task, namely, expression analysis and mental state modeling. Our approach for the automatic analysis of the speaker’s current state of mind will consist first in detecting and tracking certain fiducial facial features on the J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 105–113, 2009. © Springer-Verlag Berlin Heidelberg 2009
106
H.C. Akakin and B. Sankur
eyebrows, eyes, mouth and nose in video frames. A combination of appearance and model-based approaches is utilized to increase the effectiveness of the tracker. Second, a multi-pose deformable shape model is trained in order to handle shape variations under varying head rotations and facial expressions. Thirdly, we recognize a subset of head gestures and facial expressions using discriminative observation symbols extracted from the tracked landmark trajectories and a continuous HMM with mixture of Gaussian outputs. The rest of the paper is organized as follows. Section 1 concludes with literature review of recent area papers. Section 2 describes the facial landmark detection algorithm. Facial landmark tracking method is presented in Section 3. In section 4 head gesture and facial expression recognition procedures are described. Section 5 represents the experimental results of the proposed methods. The paper concludes with conclusion in section 6. 1.1 Background on Facial Landmark Tracking and Automatic Expression Recognition Recent studies on facial landmark tracking can be categorized as appearance-based [1] and model-based approaches [2-6,8]. The appearance-based approaches are generalpurpose point trackers without the prior knowledge of the intent; on the other hand, model-based approaches concentrate on explicitly modeling the shape of objects. In study [1] a face template is represented as a linear combination Gabor wavelet functions. The face image and the facial landmarks are affinely repositioned in the consequent frames. McKenna et al. [2] propose an approach to track facial motion based on PDM and Gabor wavelets. The ASM [3] is a popular statistical approach to represent deformable objects. PCA is applied to analyze the modes of shape variation so that the object shape can only deform in specific ways that are found in the training data. The AAM [4] is another popular approach that combines constraints of both shape variation and texture variation. In study [6] detection and tracking algorithm utilizes a set of feature templates in combination with a shape constrained search technique. In [8] propose a framework to detect emblems that combines ASM, based on NMF, with a predictive face aspect model. Most of the work in the literature on facial expression analysis are focused on recognizing the six basic facial expressions, i.e., such as happiness, surprise, sadness, fear, anger and disgust [16]. In addition to basic expressions, complex mental states such as confused, thinking, and interested are also analyzed [14,17]. Automatic analyzes of the complex mental states from the face is a challenging task when compared with the recognition of basic facial expressions. First, the mental state inference has to combat with uncertainties since mental states are hidden and can only be inferred indirectly by analyzing the behavior of that person. Secondly, while basic emotions are mostly identifiable solely from facial action units, complex mental states additionally involves purposeful head gestures and eye-gaze direction. Besides these, whereas basic emotions are identifiable from a small number of frames or even still images, complex mental states can only be recognized by analyzing the temporal dependencies across consecutive facial and head displays [17]. The majority of facial expression recognition systems attempt to identify Facial Action units (FAUs) [16-17] based on Facial Action Coding System (FACS) [18].
Analysis of Head and Facial Gestures Using Facial Landmark Trajectories
107
2 Facial Landmark Initialization We use a two-tier architecture for facial landmark initilization [7], where first tier detects seven fiducial landmarks, and the second tier adds ten other ancillary landmarks. First the face is located using a modified version of Viola-Jones face detector [19], then the seven facial landmark points consisting of the four eye corners, two mouth corners and the nose tip are detected via Support Vector Machines (SVM) trained separately on DCT features of the corresponding landmark (Fig. 1b). A probabilistic graph model (PGM) improves landmark estimates, which on the one hand eliminates false alarms and on the other hand increases the precision of their location [7]. Since seven facial feature points will not be sufficient to interpret facial expressions, in the next tier of the algorithm we estimate ten ancillary feature points. These additional points are inner and outer corners and middle points of the eyebrows, points on left and right side of nose and middle points of the upper and lower lips. To this effect we adapt a seventeen-point face mesh to the seven fiducial landmarks, to initialize the positions of the ten ancillary landmarks (Fig. 1c). Finally, we refine the ancillary landmark positions using a bounding box around the estimated positions and searching for an improved match in its neighborhood via SVM-DCT (Fig. 1d). Once the facial landmarks have been initialized, we can start tracking them.
Fig. 1. The flowchart of initialization of facial landmarks. (b) Fiducial Landmarks detected via DCT-features and refined with PGM, (c) Ancillary landmarks initialized via an adapted 17node mesh, (d) Ancillary landmarks refined.
3 Facial Feature Tracking The locations of the identified landmark points are predicted with a Kalman filter in the following frames. However, their positions need to be further refined after Kalman prediction by template searching for the best match. Meanwhile, the template library for every landmark is updated after a test frame, provided the appearance of the landmark in that frame is sufficiently different from any in the present library. DCT coefficients of an NxN block around each tracked feature point constitute candidate templates. At the initial frame, all DCT templates extracted around the detected landmark are saved in the corresponding library. In the subsequent frames, DCT templates are compared with the template library by checking their distance in terms of the normalized correlation coefficient as in Eqn. 1. If a test template for the jth landmark (j = 1, …, 17) T j C differs from the existing template library T j L , k by
108
H.C. Akakin and B. Sankur
more than a threshold (here it is 0.3) then the test template is included in the template library of the corresponding landmark. Template library replenishment is allowed only after that the feature points have been reliably detected.
(
⎧ L L L,k L ,k L ,k L Dist (Ttest , Tlibrary ) = 1 − arg max ⎨ Ttest , Tlibrary . Tlibrary Ttest ⎩ k
)
−1 ⎫
⎬; ⎭
L k = 1,..., Tlibrary
(1)
There are 17 landmark libraries (hence L = 1, .., 17) and each library contains a dyL namic number of templates, k, given by the cardinality of the L’th set Tlibrary .
Fig. 2. The flowchart of the proposed tracking algorithm
Recapitulating, the steps of the tracking process, also illustrated in Fig. 2, are: Step 1 Initialization: Initialize landmark points and their template library. Step 2 Prediction: Predict the landmark locations using a Kalman filter. Step 3 Refinement: Refine the predicted landmark locations with a search strategy using the present DCT template library. The position of the template yielding the lowest distance score is the new landmark location. Step 4 Regularization: Landmarks can go astray due to occlusion, interfering patterns or inadequacy of templates which do not represent subtle variations. To minimize this risk, we project the ensemble of all 17 feature points on the shape subspace following ASM paradigm [3]. The regularized shape with minimum Euclidean distance to the shape is generated. Multi-pose Shape Model: A three-pose deformable shape model is built in order to handle shape variations under severe head rotations. Using the principle of the ASM, the PDM is constructed from a training set of manually labeled face images. The first category represents shape model for frontal faces, neutral or with expressions, as well as mildly rotated faces (yaw rotations up to ±30º, slight pitch rotations). Second and third categories of shape models represent faces rotated to left and right sides, respectively, with yaw degrees 20º to 45º. The shape models were trained using a subset of Bosphorus Face Database [11], which includes a rich set of expressions, systematic variation of poses as shown in Fig.3.
Fig. 3. Images from Bosphorus Database [15]; Head poses and facial actions
Analysis of Head and Facial Gestures Using Facial Landmark Trajectories
109
Shape Regularization: The shape regularization method for the landmarks is given by the formula x = x + Φ b where x is the mean shape, Φ is the matrix (with λi eigenvalues) defining the linear subspace and b = ΦT x . In [3] the variation limits are set at ±3(λi)½ for the bi parameters to ensure that the generated shape is similar to the training
shapes,
that
is,
i < 20
{bi }i =1
parameters
are
allowed
to
change
within
−3 λi < bi < 3 λi .This range was found adequate to accommodate shape and expres-
sion variations for the frontal faces. However, we found that this interval was too restrictive for the wider range of rotations and expressions in our database. Therefore we extend the limit for higher variations (e.g. the limit of the first parameter is ±12.5 λi , while it is ±3 λi for the b20 parameter) by setting interval to ( −15 × (1 + 0.2i )−1 ) λi < bi < (15 × (1 + 0.2i) −1 ) λi , i = 1,..., 20 .
4 Recognition of Head and Facial Gestures Using HMM’s Robust and accurate tracking of facial feature points in a face video sequence enables one to classify head and facial gestures and expressions. HMM (Hidden Markov Model) is one of the basic and most common method of time sequence classification, and hence can be used for expression, gesture recognition [16,17]. If the training set of observation sequences are given, HMM model parameters (A, B, π) can be learned for that sequence class. For classification, we select an adequate HMM, and then learn the model parameters separately for each gesture class. The number of states is estimated by considering the complexity of the gesture classes. Discriminative observable symbols (features) are chosen for each class. Thus while our raw data consists of the trajectories of seventeen tracked facial landmarks (34 variables) they can be judiciously reduced in an informed way for each gesture class. The extracted symbols are as following: (1,2) Mean of y and x coordinates (ymean,xmean) , (3,4)Number of peaks in ymean and xmean over the observation epoch, (5) Euclidean distance between two lip corners, (6) Euclidean distance of lip corners on y coordinate between instances t and (t-1), (7) Euclidean distance between eyes and eyebrows, (8) Euclidean distance between inner eyebrow corners, 9) Euclidean distance of eyebrows between initial frame and current instant, (10,11) Euclidean distance of nose (x,y) between initial frame and current instant, (12) The ratio of right and left eye widths, (13) Inner-eye corner distance. All distances are normalized to inter-pupil distances.
5 Experimental Results Landmark Tracking Results: The effectiveness of the proposed tracking algorithm was tested on the BUHMAP video database [10]. This database includes head and facial gesture classes, namely, head shaking, head up (simultaneously raising the eyebrows), head forward (accompanied with raised eyebrows), sadness (lips turned
110
H.C. Akakin and B. Sankur
Proportion of succesfully tracked feature pts
Head shaking
Head up and eyebrow raise
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0,05
0,1
0,15
0.2
0 0
0.05
Head forward and eyebrow raise
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0.05
0.1
0.15
0.2
0.1
0.15
0.2
Smiling
0 0
Eyes Nose Mouth Eyebrows 0.05
0.1
0.15
0.2
Normalized error
Fig. 4. Cumulative distribution of landmark errors vis-à-vis their ground-truth data
Fig. 5. Tracked landmarks on sample image sequences; first row: head-up-and-eyebrow-raise, second row: head-up-down-and-smile, third row: head-shaking
down, eyebrows down), head up-down (nodding head continuously), happiness, happy up-down (head up-down + happiness). 11 subjects performed the gesture classes with 5 repetitions for each gesture. The videos are recorded at 30 fps at the resolution 640x480, and the duration of gestures is between 1 to 2 seconds. Only 3 repetitions of the 4 gestures (head shaking, head up, head forward, happiness) performed by 4 subjects have been manually landmarked in all frames. Since all the point-to point distance between tracked point and ground-truth are normalized by the inter-pupil distance, the error measure is invariant of face size. Fig. 4 displays the tracking performance results. In these curves the horizontal axis denotes the deviation from ground-truth as a percentage of inter-pupil distance; the vertical axis is the percentage of successful tracking cases within that tolerance. Notice that head shaking is the best tracked gesture class since at 0.1 tolerance point almost all facial features are accurately tracked over the frames. Head-up- and-eyebrow-raise is the worst action especially in the tracking of eyebrows and mouth. Fig. 5 shows the tracked landmarks on the faces. Overall the proposed tracking algorithm was able to track facial landmarks even under large head rotations and facial expressions.
Analysis of Head and Facial Gestures Using Facial Landmark Trajectories
111
Table 1. Test sets and performed experiments (S: number of subjects, R: number of repetitions, C: number of classes)
Test Sets C R Test set-1 7 5 Test set-2 4 (Head L-R, Head Up, Head F and Happiness) 3 (with groundtruths) Experiments Performed Test set Training Testing Method Exp.1 2 4 S, 2 R, 4 S, 1 R Leave-one R-out 32 videos 16 videos cross validation Exp.2 3 S, 5 R, 1 S, 5 R Leave-one S-out 105 videos 35 videos cross validation Exp.3 1 (test) 4 S, 3 R, 4 S, 2 R 2(training) 48 videos 32 videos S 4 4
Table 2. Confusion matrix for Exp.2 (105 training videos, 35 test videos)
Head L-R Head U Head F Sadness Head U-D Happiness Happy U-D
Head L-R 100 0 0 5 0 0 0
Head U 0 100 0 0 0 0 0
Head F 0 0 100 10 0 0 5
Sadness 0 0 0 85 0 15 0
Head U-D
Happiness 0 0 0 0 75 0 35
0 0 0 0 0 85 0
Happy U-D 0 0 0 0 25 0 60
Head&Facial Gesture Recognition Results: The extracted observable symbols (feature vectors) are continuous, so we need to run head&facial gesture recognition procedure using continuous HMMs with mixture of Gaussian outputs [13]. The number of hidden states and mixture of Gaussians are chosen in accordance with the complexity of the action classes (e.g. for the head nodding and smiling class five hidden states and six mixtures of Gaussians, for sadness and smiling classes only three states and five mixture of Gaussians are used). The closest work we can compare the performance of our algorithm with is that of Ari’s test sets [14]. The conducted experiments and test sets are shown in Table 1. It is observed, the first and third experiments, which involve Head L-R, Head Up, Head Forward and Happiness, achieve 100% recognition rates. The confusion matrix of the second experiment is given in Table 2. In the second experiment our recognition rate is 86.4%, which is still better by 13 percentage points as compared to Ari’s highest recognition rate of 72.9% [14]. As seen in Table-2, 25% of Head U-D class is misclassified with Happy U-D class and 35% of Happy-U-D class is misclassified as Head U-D. If we merge these two problematic classes into one class, the total recognition rate rises to 94%. Sadness is another difficult expression for classification, since acting of the sadness state by the subjects in the database differs significantly. For example, some of the subjects activated “raise inner eyebrow points” and “lip corners down” actions, while others activated “eyebrow fall” and “lip pucker” actions.
112
H.C. Akakin and B. Sankur
6 Conclusions In this study, we have introduced a robust face-and-head gesture analyzer based on accurate tracking of facial landmarks and dynamic sequence classifier in face videos. The proposed algorithm is capable of tracking facial landmarks even under large head rotations (mostly yaw and tilt) and under facial expressions from subtle to strong valence. The tracker has generalization potential in that it performed satisfactorily in test sequence types not seen in the training set. The gesture analyzer is capable of differentiating head shaking, head nodding, head forward, facial expressions such as smiling, sadness, and combination of head and facial expressions in face video sequences. The analyzer extracts discriminative observation symbols (features) from the trajectories of the facial landmarks and uses continuous HMMs with mixture of Gaussian outputs. Future work will focus on improving current analyzer performance and developing methodologies in order to handle more challenging tasks such as complex mental state inference from video sequences.
References 1. Feris, R., Cesar Junior, R.M.: Tracking Facial Features Using Gabor Wavelet Networks, pp. 22–27. IEEE, Sibgraphi (2000) 2. McKenna, S., Gong, R.P., Wurtz, J., Tanner, D.: Tracking facial feature points with Gabor wavelets and shape models. In: Proceedings of the International Conference on Audio-and Video-based Biometric Person Authentication (1997) 3. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models - their training and application. Computer Vision Image Understanding 61(1), 38–59 (1995) 4. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. PAMI 23(6), 681–684 (2001) 5. Dornaika, F., Davoine, F.: Online Appearance-based Face and Facial Feature Tracking. In: Proceedings of the 17th International Conference on Pattern Recognition (2004) 6. Cristinacce, D., Cootes, T.F.: Facial Feature Detection and Tracking with Automatic Template election. In: Int. Conf. on Automatic Face and Gesture Recognition, FGR (2006) 7. Çınar Akakın, H., Akarun, L., Sankur, B.: 2D/3D Face Landmarking. 3DTV Con., Kos (2007) 8. Kanaujia, A., Huang, Y., Metaxas, D.: Emblem Detections by Tracking Facial Features. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, CVPRW (2006) 9. Tong, Y., Wang, Y., Zhu, Z., Ji, Q.: Robust facial feature tracking under varying face pose and facial expression. Pattern Recognition 40, 3195–3208 (2007) 10. Aran, O., Arı, İ., Güvensan, M.A., Haberdar, H., Kurt, Z., Türkmen, H.İ., Uyar, A., Akarun, L.: A Database of Non-Manual Signs in Turkish Sign Language. In: IEEE Signal Processing and Communications Applications (SIU 2007), Eskişehir (2007) 11. Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., Akarun, L.: Bosphorus database for 3D face analysis. In: Schouten, B., Juul, N.C., Drygajlo, A., Tistarelli, M. (eds.) BIOID 2008. LNCS, vol. 5372, pp. 47–56. Springer, Heidelberg (2008), http://www.busim.ee.boun.edu.tr/~bosphorus/ 12. Wang, T.H., James Lien, J.J.: Facial expression recognition system based on rigid and nonrigid motion separation and 3D pose estimation. Pattern Recognition 42, 96–977 (2009)
Analysis of Head and Facial Gestures Using Facial Landmark Trajectories
113
13. Murphy, K.: Hidden Markov Model (HMM) Toolbox for Matlab, http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html 14. Arı, İ., Akarun, L.: Facial Feature Tracking and Expression Recognition for Sign Language. In: IEEE, Signal Processing and Communications Applications, Antalya (2009) 15. Cooper, K.: Nonverbal Communication for Business Success. Amacom (January 1979) 16. Bailenson, J.N., et al.: Real-time classification of evoked emotions using facial feature tracking and physiological responses. International Journal of Human Machine Studies 66, 303–317 (2008) 17. Kaliouby, R.A.: Mind-reading machines: automated inference of complex mental states. Tech. Report, UCAM-CL-TR-636 (July 2005) 18. Ekman, P., Friesen, W.V.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press (1978) 19. Demirkir, C., Sankur, B.: Face Detection Using Look-up Table Based Gentle AdaBoost. In: Audio and Video-based Biometric Person Authentication (AVBPA), Terrytown, New York (2005)
Combining Audio and Video for Detection of Spontaneous Emotions ˇ ˇ Rok Gajˇsek, Vitomir Struc, Simon Dobriˇsek, Janez Zibert, France Miheliˇc, and Nikola Paveˇsi´c Faculty of Electrical Engineering, University of Ljubljana Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia {rok.gajsek,vitomir.struc,simon.dobrisek,janez.zibert,france.mihelic, nikola.pavesic}@fe.uni-lj.si http://luks.fe.uni-lj.si/
Abstract. The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented. Keywords: bimodal emotion database, emotion recognition, linear transformations.
1
Introduction
Recent years have seen an increase of analysis of psycho physical state of the users of the biometric systems. Especially, in the fields of audio and video processing. Recognising human emotions represents a complex and difficult task, the more so if the emotions recorded are of spontaneous nature. While human observers might have difficulties to correctly recognise the emotions of another individual in these situations, the task poses an even bigger challenge to automated systems relying on audio and video data. Various techniques have been proposed in the literature for the recognition of emotions from audio-video sequences; however, only a hand full of them were tested on sequences of spontaneous emotions. In the article we present separately, the video and audio sub-systems for the emotion recognition. For video, a discrete cosine transformation (DCT) was applied, yielding a set of features, which were then used, with their deltas and classified using nearest neighbour classifier. In emotion recognition from audio, different prosody features are generally used to capture emotion specific properties of the speech signal. Instead, we J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 114–121, 2009. c Springer-Verlag Berlin Heidelberg 2009
Combining Audio and Video for Detection of Spontaneous Emotions
115
focused on using maximum likelihood linear transformations, otherwise heavily used in speaker or environment adaptation, as features for emotion classification.
2
Video Emotion Recognition
Many of the existing techniques make use of the coding system developed by Eckman [4] to encode different facial expressions in terms of so-called action units, which relate to the muscular configuration across the human face. Automated visual (emotion) recognition systems typically track a set of facial features across a video sequence trying to determine the action units based on which facial expressions can be categorised into emotional states. An overview of existing techniques for emotion recognition from video data can be found in [5]. Regardless of the technique used for recognising the emotional states of an individual from a video sequence, a number of issues needs to be solved in the design of a robust emotion recognition system. These issues commonly include: detecting the facial area in each video frame of the video sequence, photometrically and geometrically normalising the facial region, extracting and/or tracking of facial features in each video frame and classification of the final feature vector sequence into an emotional category. As we are concerned only with the detection and recognition of neutral and aroused emotional states in this paper, the final classification procedure represents a two class problem. 2.1
Face Detection, Tracking and Normalisation
To detect the facial region in each video frame of the given video sequence we adopted the popular Viola-Jones face detector [6]. During the face detection stage all artifact not belonging to the facial region are removed from the video frames. The key element of the employed face detector is an image representation called integral image, which allows visual features of image sub-windows of arbitrary sizes to be computed in constant time. Once these features over a predefined number of sub-windows sizes have been computed, AdaBoost is adopted to select a small set of relevant features that are ultimately fed to a cascade classifier. The classifier then performs the actual face detection by classifying each query face window into either the class of “faces” or the class of “non-faces”. While the face detectors usually results in a good performance, it still exhibits some shortcomings. The most troublesome issue we encountered when testing the face detector was the false assignment of image windows to the class of faces. To overcome this problem and consequently to improve the face detectors performance we further processed the detectors output using a skin-colour filter of the following form: 1 , if A & B & C fsc (I(x, y)) = , (1) 0 , otherwise where I(x, y) denotes the input image, fsc denotes the skin-colour filter, the operator & denotes the logical AND, and the expressions A, B, and C represent
116
R. Gajˇsek et al.
conditions concerning the red (R), green (G) and blue (B) colour components of I(x, y). , i.e., The skin-colour filter fsc produces a binary mask with pixel values equal to one at every position that corresponds to the colour of the human skin. This binary mask is combined with the face detector by discarding all image subwindows, but the one with the highest number of skin-colour pixels. An example of a successful deployment of the employed face detector is shown in Fig. 1.
Fig. 1. An example of the Viola-Jones face detector output
The presented face detector was applied to each frame of the video sequence, resulting in a sequence of detected facial regions that formed the foundation for the normalisation step. During the normalisation procedure, the facial regions of each frame were first aligned (in such a way that the faces were in an upright position), their histogram was equalised to enhance the contrast and finally they were cropped to standard size of 128 × 128 pixels. 2.2
Feature Extraction and Classification
There are several options on which features to extract from a video sequence for the task of emotion recognition. Since this work describes our initial attempts in the design of a emotion recognition system, we decided to use simple holistic features extracted from each video frame by means of the discrete cosine transform (DCT). The DCT is a popular image processing tool commonly used for image compression. When adopted as a feature extraction approach it is used to reduce the dimensionality of input frames from N to d, where d << N and N stands for the number of pixels in the normalised facial region, i.e., N = n × n = 128 × 128. In our case, the selected value of d was 300. Formally, the 1-dimensional DCT transform on a n-dimensional sequence u(i), where i = 1, 2, ..., n, is defined as follows: v(k) = α(k)
n−1 i=0
u(i)cos(
(2i + 1)πk ), 2n
(2)
Combining Audio and Video for Detection of Spontaneous Emotions
117
1 2 α(0) = , and α(k) = , for 1 ≤ k ≤ n − 1. (3) n n In the above expressions v(k) denotes the DCT transformed sequence u(i). Since the DCT transform represents a separable transform its 2-dimensional variant is obtained by first applying the 1-dimensional DCT to all image rows (which act as the sequences u(i)) and then to all image columns. To encode some of the dynamic information contained in the video frames the feature vector of each frame computed via the 2-dimensional DCT was augmented with delta features defined as the difference of DCT feature vectors of two consecutive frames. The presented procedure resulted in a sequence of feature vectors of length 2d. The sequence was finally classified into one of the two emotional classes using a variant of the nearest neighbour classifier.
where
3
Audio Emotion Recognition
Analysis of prosody has been the main focus of research in the field of emotion recognition from speech. Features, such as pitch and energy with their means, medians, standard deviations, minimum and maximum values [3], are generally combined with some higher level features, such as speaking rate, phone or word duration, etc. Language model based features, have also been studied [7]. In our work we evaluated the possibility of using linear transformations as emotion features. 3.1
Linear Transformations of HMMs
Linear transformations of Hidden Markov Models are widely used in the fields of environment or speaker adaptation. A speaker or environment specific information is hidden in the linear transformation, which is then used in combination with the global acoustical model. Although, there are other ways of calculating the transformation matrix, in our work we concentrated on applying maximum likelihood estimation for determining the parameters of the transformation. This type of linear transformation is named Maximum Likelihood Linear Regression (MLLR) [8]. Even though, all parameters of a HMM model can be transformed using linear transformation, usually only the means and the variances of the Gaussian distributions are considered. A constrained version of MLLR (CMLLR), where the same transformation matrix is used for both, means and variances, was used. The transformation of a vector of means μ and the covariance matrix Σ is presented by the following equation. ˆ = A ΣAT μ ˆ = A μ − b , Σ
(4)
The above equations present the CMLLR transformation in model space, but by applying equation (5), the same transformation can be applied to feature space as well. ˆ ) = A−1 o(τ ) + A−1 b = Ao(τ ) + b. o(τ (5)
118
R. Gajˇsek et al.
The matrix A and the vector b are usually combined in one matrix W = [A b], that represents the CMLLR transformation. 3.2
Training Procedure for the Estimation of CMLLR Transformations
Our goal was to evaluate the usage of CMLLR transformations as a feature for emotion recognition. In order to estimate these transformations for a particular emotion or arousal state, the global speaker independent acoustical model is required. A Voicetran database [9] composed of broadcast weather news and read speech, was used to build the monophone acoustic model. The following steps describe the procedure used to train the acoustical model. • Calculation of the global means μo (s) and covariance matrices Σo (s), for each speaker s in the database. • Initialisation of the matrix W0 (s) for each speaker s using equation 6 and μ0 (s) and Σ0 (s) from the first step. • Training of the first acoustical model AM1 using W0 (s) transformation. • Estimation of the of the new MLLR transformations W1 (s). • The above two steps were then repeated four times, finally yielding the speaker independent model AM5 • Five alternations between the training of the acoustical model ACi and the estimation of the Wi transformation matrix. • After the fifth iteration the final acoustic model AC5 and W5 (s) are calculated A(s) = L0 (s)T , b(s) = −L0 (s)T μo (s).
(6)
In equation (6), the L0 (s) is a Cholesky decomposition matrix of the inverse of the global covariance matrix Σ0 (s)−1 . The result from the above procedure relevant to the calculation of the CMLLR matrices for emotions, is the set of five acoustical models. These are used unchanged in a similar procedure for estimation of CMLLRs for emotion classification. The s parameter describing a particular speaker, now represent an emotional class. Again, a five-iteration process described above is applied, except for the training of the acoustical model in each step. The final set of emotion specific CMLLR transformations is acquired after the fifth iteration.
4
Bimodal Emotional Database
The basis, for the evaluation of proposed procedures, forms the audio-video database of spontaneous emotions - AvID [2]. A description of the process of labelling the data into different emotion groups is given.
Combining Audio and Video for Detection of Spontaneous Emotions
4.1
119
Emotion Labelling
When working with databases of spontaneous emotions, the task of labelling the emotions, presents a big challenge, as oppose to the databases with acted emotions. In the case of acted emotions, the person labelling the data has an a-priori knowledge about which emotional state is represented in the recording, whereas in the case of spontaneous database the decision needs to be based solely on the recording. Furthermore, the amount of different emotions represented in the acted databases can be controlled and normally they are equally distributed. In databases of spontaneous nature, this can be controlled to some extent (the second session) or can not be influenced directly at all (the first session). For the task of labelling the AvID database we employed five students of psychology. Due to the time consuming nature of the labelling task, which is currently under way, not all recordings will be labelled by all five students. Still, the task of evaluating the agreement between the labellers presents the first obstacle we had to overcome. In the case of only two people labelling emotions, only the recordings were they both agreed on the label were initially used. Recordings labelled by three or more people provide other options that can be evaluated. For the initial tests we used the majority vote, meaning that, if the majority of the labellers agreed on the particular label, their decision was set as a reference. Since the recordings were transcribed in sentences, this formed the basis for evaluation of the agreement between labellers. Here, it should be noted that spontaneous speech, especially emotionally coloured, can not always be segmented into proper sentences. Thus, some sentences can be very short or can only contain just few syllables. Matching time between the labellers were analysed and are presented in the section 5.
5
Experimental Results
From the AvID database, six sessions, fully transcribed and labelled as normal or aroused, were used as the basis for the evaluation. From this recordings a set of 668 MLLR matrices were calculated. Due to the spontaneous nature of the recordings, there was a strong prevalence of normal speech (529 MLLRs) over aroused speech (39 MLLRs). At least 15 seconds of audio was used for the estimation of each 39x39 element MLLR matrix. The MLLR matrices were converted into vectors and linear and non-linear versions of Principal Component Analysis (PCA) were evaluated. Since this paper represents our initial attempts in audiovisual emotion recognition, a simple nearest neighbour classifier was selected for the classification phase. The full data set was divided into training set (80%) and testing set (20%) during five fold cross validation. Using the training data a class prototype was build for both normal and emotional state by taking the mean feature vector over all subjects in the class. For each MLLR derived feature vector, an Euclidean distance is calculated to both, neutral and emotional class prototypes. Based on the ratio between the two distances, the feature vector was classified. A comparison of linear and non-linear PCA (polynomial kernel of
120
R. Gajˇsek et al.
Non−linear PCA Linear PCA
emotional state error
0.8 0.6 0.4 0.2 0
0.1
0.2
0.3 0.4 0.5 neutral state error
0.6
0.7
Fig. 2. Comparison of linear and non-linear PCA
the fourth order was used) is presented using Detection Error Trade-off curves (DET) in Fig. 2, which shows the average values of the five fold cross validation. The superior results are achieved with the non-linear version, which is reasonable, since the non-linear dependencies in data are also considered. The video features, extracted as described in Sec. 2.2 were analysed using a similar procedure as above and classified using a variant of the nearest neighbour classifier. The Fig. 3 presents, in a form of a DET plot, an improvement after a sum rule fusion of audio and video. For more realistic assessment of the classification results, an agreement between the labellers should be discussed. One session, out of the six in question, was labelled individually by five people and the agreement reached between all five labellers was only in 59.1% of utterances. Agreement increases if only two people are labelling the data, which was the case for the rest of the sessions evaluated here, yielding an average of 89.3%. The lowest equal error rate (ERR) that can be achieved by our current system, as shown in Fig. 3, is 24.2%
Audio−Video Fusion Audio
Emotional state error
0.8 0.6 0.4 0.2
0.1
0.2
0.3 0.4 0.5 Neutral state error
0.6
0.7
Fig. 3. Increase in classification accuracy of audio-video Fusion
Combining Audio and Video for Detection of Spontaneous Emotions
6
121
Conclusion
Our initial attempts in building an audio video emotion recognition system were presented. The task of labelling the AvID database of spontaneous emotions, which forms the basis for the evaluation of the emotion recognition system, was discussed. The issue of emotional labelling of spontaneous recordings was discussed and the agreement between the labellers was presented. The performance of our system was presented and compared to the agreement achieved between different labellers.
Acknowledgement This work was supported by the Slovenian Research Agency (ARRS), development project M2-0210 (C) entitled “AvID: Audiovisual speaker identification and emotion detection for secure communications.”
References 1. Song, M., Chen, C., You, M.: Audio-visual based emotion recognition using tripled hidden Markov model. In: Proceedings of Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 5, pp. 877–880 (2004) 2. Gajˇsek, R., et al.: Multi-Modal Emotional Database: AvID. Informatica 33, 101–106 (2009) 3. Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: ICMI 2004: Proceedings of the 6th international conference on Multimodal interfaces, pp. 205–211. ACM, New York (2004) 4. Eckman, P.: Strong Evidence for universals in facial expressions. Psychol. Bull. 115(2), 268–287 (1994) 5. Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: the state of the art. IEEE TPAMI 22(12), 1424–1445 (2000) 6. Viola, P., Jones, M.: Robust real-time object detection. In: Proc. of the Second Intenrnational Workshop on Statistical and Computational Theories of Vision Modeling, Learning, Computing and Sampling, Vancouver, Canada (2001) 7. Ang, J., et al.: Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: Proc. ICSLP 2002, vol. 3, pp. 2037–2040 (2002) 8. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12(2), 75–98 (1998) 9. Miheliˇc, F., et al.: Spoken language resources at LUKS of the University of Ljubljana. Int. J. of Speech Technology 6(3), 221–232 (2006)
Face Recognition Using Wireframe Model Across Facial Expressions Zahid Riaz, Christoph Mayer, Michael Beetz, and Bernd Radig Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen Boltzmannstr. 3, 85748 Garching, Germany {riaz,mayerc,beetz,radig}@in.tum.de
Abstract. This paper describes face recognition across facial expressions variations. We focus on an automatic feature extraction technique which is not only efficient but also accurate for person identification. A 3D wireframe model is fitted to face images using a robust objective function. Furthermore, we extract structural and textural information which is coupled with temoral information from the motion of local facial features. The extracted information is combined to form a feature vector descriptor for each image. This set of features has been tested on two databases for face recognition across facial expressions. We use Bayesian Network (BN) and Binary Decision Trees (BDT) as classifiers. The developed system is automatic, real-time capable and efficient. Keywords: Face Recognition, Feature Extraction, Model Based Image Analysis, Image Classification.
1
Introduction
In future technologies, machines are an essential part of the individual’s life and we are often confronted with scenarios where humans and machines are interacting each other and performing joint activities. This requires to build intelligent and user-friendly machines to ensure a convenient cooperation. A good example are assistive robots [1] that serve as an attendee nurse to elderly. These intelligent machines should be able to interact with humans of all categories and perform equally well with untrained and new user. Machine intelligence is measured from its perception and manipulation to the environment changes without human intervention. For this purpose, an intuitive approach is to borrow the concepts of humanhuman interaction and train it to machines for comparable performance in real world situations. Human faces play an important role in daily life interaction. Therefore, context and identity awareness in robots improves the performance of the systems in joint activities. A system aware of person identity information can better employ user habits and store current interaction knowledge for improving future interactions. While recognizing humans faces, an important factor is facial expressions which vary from person to person and change the facial appearance drastically. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 122–129, 2009. c Springer-Verlag Berlin Heidelberg 2009
Face Recognition Using Wireframe Model Across Facial Expressions
123
Since two decades, many commercially available systems exist to identify human faces. However, face recognition is still an outstanding challenge against different kinds of variations like facial expressions, poses, non-uniform light illuminations, occlusions and aging effects. Meanwhile this technology has extended its role from biometrics and security applications to Human-Robot-Interaction (HRI). Person identity is one of the key tasks while interacting with robots, exploiting the oblivious system security and authentication of the human interacting with the system. In such kinds of scenarios the acquired face images could contain various types of facial expressions. This problem has been addressed in this paper resulting in a face recognition system which is robust against facial expressions. This publication focuses on one of the aspects of natural human-computer interfaces: our goal is to build a real-time system for face recognition that could robustly run in real-world environments. We develop it using model-based image interpretation techniques, which have proven its great potential to fulfill current and future requests on real-world image understanding. Our approach comprises methods that robustly localize facial features, seamlessly track them through image sequences, to include robustnes to facial expressions in facial recognition. This remainder of the paper is organized in four sections. In section 2, related work to our applications is discussed. In section 3 we discuss our approach in detail. In section 4 higher level features extraction from model based image interpretation is described. This includes description from model fitting to face image to feature vector formation. Section 5 discusses about evaluation of our results on the database. Finally we conclude our results with some future directions.
2
Related Work
Traditional recognition systems have the abilities to recognize the human using various techniques like feature based recognition, face geometry based recognition, classifier design and model based methods. In [2] the authors give a comprehensive survey of face recognition and some commercially available face recognition software. Subspace projection method like Principal Components Analysis (PCA) was firstly used by Sirvovich and Kirby [3], which were latterly adopted by M. Turk and A. Pentland introducing the famous idea of eigenfaces [4]. This paper focuses on the modeling of human face using a three dimensional model for shape model fitting, texture and temporal information extraction and then low dimensional parameters for recognition purposes. The model using shape and texture parameters is called Active Appearance Model (AAMs), introduced by Cootes et. al. [5][6]. For face recognition using AAM, Edwards et al [7] use weighted distance classifier called Mahalanobis distance. In [8] the authors used separate information for shape and gray level texture. They isolate the sources of variation by maximizing the interclass variations using discriminant analysis, similar to Linear Discriminant Analysis (LDA), the technique which was used for Fisherfaces representation [9]. Fisherface approach is similar to the eigenface approach however outperforms in the presence of
124
Z. Riaz et al.
illuminations. In [10] the authors have utilized shape and temporal features collectively to form a feature vector for facial expressions recognition. These models utilize the shape information based on a point distribution of various landmarks points marked on the face image. Blanz et al. [12] use state-of-the-art morphable model from laser scaner data for face recognition by synthesizing 3D face. This model is not as efficient as AAM but more realistic. In our approach a wireframe model known as Candide-III [13] has been utilized. In order to perform face recognition applications many researchers have applied model based approach. Riaz et al [14] apply similar features for explaining face recognition using 2D model. They use expression invariant technique for face recognition, which is also used in 3D scenarios by Bronstein et al [15] without 3D reconstruction of the faces and using geodesic distance. Park et. al. [16] apply 3D model for face recognition on videos from CMU Face in Action (FIA) database. They reconstruct a 3D model acquiring views from 2D model fitting to the images.
3
Our Approach
In this section we explain in detail the approach adopted in this paper including model fitting, image warping and parameters extraction for shape, texture and temporal information. We use a wireframe 3D face model known as candideIII. The model is fitted to the face image using objective function approach. A detailed process flow for learning objective function is shown in Figure 1.
Fig. 1. We learn objective functions to ensure an accurate model fitting [11]. This provides an accurate feature extraction.
After fitting the model to the example face image, we use the 2D projections of the 3D landmarks for texture mapping. Texture information is mapped from the example image to a reference shape which is the mean shape of all the shapes available in database. However this is an arbitrary choice. Image texture is extracted using planar subdivisions of the reference and the example shapes. We use delauny triangulations of the distribution of our model points. Texture warping between the trigulations is performed using affine transformation. Principal Component Analysis (PCA) is used to obtain the texture and shape parameters of the example image. This approach is similar to extracting AAM parameters. In addition to AAM parameters, temporal features of the facial changes are also calculated. Local motion of the feature points is observed using
Face Recognition Using Wireframe Model Across Facial Expressions
125
optical flow. We use reduced descriptors by trading off between accuracy and run time performance. Finally, structural information is obtained by exploiting the model parameters directly. These features are then used for classification. Our approach achieves real-time performance and provides robustness against facial expressions in real-world scenarios. This computer vision task comprises of various phases shown in Figure 2.
Fig. 2. Our Approach: Sequential flow for feature extraction
4
Features Extraction
In this section, we present the extraction of three sets of features. Structural features are obtained by a model fitting approach and describe the shape of the face currently visible. Textural features are obtained by mapping the current image to a reference image. This is similar to AAM approach. Temporal features inspect the facial motion over a short time and therefore take facial expressions into consideration. 4.1
Structural Features
Model parameters describe various properties of the modeled object, such as position or deformation. Face model parameters might consider aspects like the aspect ratio of the face or facial components, the distance of the eyes or the opening angle of the mouth. Although such face properties are influenced by various factors like facial expressions or head pose, they still contain valuable information for person identification. However, in order to determine this structural information, correct model parameters have to be estimated.
126
Z. Riaz et al.
We integrate a state-of-the-art fitting algorithm that is based on machinelearning techniques. An objective function yields a comparable value that specifies how accurately a parameterized model matches an image. A fitting algorithm searches for the minimum of this objective function to determine the model parameters that best describe the face visible in the image. This approach is based on the general properties of ideal objective functions. The key idea behind the approach is that if the function used to generate training data is ideal, the function learned from the data will also be approximately ideal. A simple example of such a function computes the Euclidean distance between the correct location of a model point and a chosen location in the image. Note that the correct contour point must be specified manually and therefore this example function is not applicable to previously unseen images but it is applied to generate ideal training data. We annotate a set of training images with the correct contour points. For each annotation, the ideal objective function returns the minimum by definition. Further coordinate-to-value correspondences are automatically acquired by varying the contour points along the contour perpendicular and recording the value returned by the ideal objective function in the second step. The evaluation presented in this paper considers Haar-like features for objective function learning, but any other feature representation that turns out to be descriptive for the locations of the model points can be integrated as well. Finally, the calculation rules of the objective function that map the extracted image feature values on an objective function value are learned. For a intensive evaluation of this approach we refer to [20]. The shape x is parametrized by using mean shape xm and matrix of eigenvectors Ps to obtain the parameter vector bs [17]. x = xm + Ps bs 4.2
(1)
Textural Features
Once we have the structural information of the image, we extract texture from the face region by mapping it to a reference shape. A reference shape is extracted by finding the mean shape over the dataset. Image texture is extracted using planar subdivisions of the reference and the example shapes. We use delauny triangulations for the convex hull of the facial landmarks. Texture warping between the triangles is performed using affine transformation. The extracted texture is parametrized using PCA by using mean texture gm and matrix of eigenvectors Pg to obtain the parameter vector bg [17]. g = gm + Pg bg 4.3
(2)
Temporal Features
Facial expressions cause large changes of the face shape in the image and therefore challenge the identification process. We tackle this challenge by explicitly considering facial expressions in the feature extraction process. Since facial expressions emerge from muscle activity, the motion of particular feature points
Face Recognition Using Wireframe Model Across Facial Expressions
127
Table 1. Recognition results comparison BN BDT CKFED 90.66% 98.50% MMI 90.32% 99.29%
within the face gives evidence about the facial expression. Real-time capability is important, and therefore, a small number of feature points is considered only. The relative location of these points is connected to the structure of the face model and normalized by the affine transformation of the entire face in order to separate the facial motion from the rigid head motion. In order to determine robust descriptors, PCA determines the H most relevant motion patterns (principal components) visible within the set of training sequences. A linear combination of these motion patterns describes each observation approximately correct. This reduces the number of descriptors by enforcing robustness towards outliers as well. We combine all extracted features into a single feature vector. Single image information is considered by the structural and textural features whereas image sequence information is considered by the temporal features. The overall feature vector becomes: u = (bs1 , ...., bsm , bg 1 , ...., bg n , bt1 , ...., bth , )
(3)
Where bs , bg and bt are structural, textural and temporal parameters respectively. The face feature vector consists of the shape, texture and temporal variations, which sufficiently defines global and local variations of the face. All the subjects in the database are labeled for classification.
5
Experimentation Evaluation
For experimentation purposes, we benchmark our results on the Cohn Kanade Facial Expression Database (CKFED) [18] and the Man-Machine-Interaction (MMI) [19] database. Both databases contain image sequences of different persons with six standard facial expressions. The images are taken in controlled lighting and frontal poses. The database are designed for facial expressions analysis. We utilize the classifier implementation of WEKA [21] and evaluate the results with 10-fold cross validation. In order to identify persons in the presence of strong facial expressions, we use our extracted features set with Bayesian Networks (BN) and Binary Decision Tree (BDT). The recognition results are obtained in the presence of strong facial expressions but restricted to frontal face views only. However, there exists slight in-plane rotation of the head in some cases, but the features set works well in this case because all the images have been normalized to a standard shape. The recognition rate is recorded to be 90.66% and 90.32% for BN with 10-fold cross validation on CKFED and MMI respectively and improved to 98.5% and 99.29% using BDT on CKFED and MMI respectively. Table 1 shows the results comparatively. Figure 3 shows the true positive and false positive rates for subjects in our databases.
128
Z. Riaz et al.
Fig. 3. True positive and false psitive rates for CKFED and MMI Database
6
Conclusions
This paper describes a feature extraction technique for a face recognition application which is robust against facial deformations. We exploit a 3D model based approach for extracting three types of features. The shape and textural features define necessary person identiy information whereas additionally calculated temporal parameters sufficiently define local facial features variations for training our classifier for face recognition. Although the extracted features are highly applicable for face recognition under the presence of facial expression, the quality of the algorithm can be further improved by considering natural facial expression which arise in real world environment. Future work includes applying the extracted features to facial expressions recognition and gender classification.
References 1. Beetz, M., et al.: The Assistive Kitchen — A Demonstration Scenario for Cognitive Technical Systems. In: Proceedings of the 4th COE Workshop on Human Adaptive Mechatronics, HAM (2007) 2. Zhao, W., Chellapa, R., Rosenfeld, A., Philips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys 35(4), 399–458 (2003) 3. Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces. J. Opt. Soc. Am. A 4(3), 519–524 (1987) 4. Turk, M.A., Pentland, A.P.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 5. Edwards, G.J., Taylor, C.J., Cootes, T.F.: Interpreting Face Images using Active Appearance Models. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, pp. 300–305 (1998)
Face Recognition Using Wireframe Model Across Facial Expressions
129
6. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. In: Proceedings of European Conference on Computer Vision, vol. 2, pp. 484–498. Springer, Heidelberg (1998) 7. Edwards, G.J., Cootes, T.F., Taylor, C.J.: Face Recognition using Active Appearance Models. In: Proceeding of European Conference on Computer Vision, vol. 2, pp. 581–695. Springer, Heidelberg (1998) 8. Edwards, G.J., Lanitis, A., Taylor, C.J., Cootes, T.: Statistical Models of Face Images: Improving Specificity. In: British Machine Vision Conference, Edinburgh, UK (1996) 9. Belheumeur, P.N., Hespanha, J.P., Kreigman, D.J.: Eigenfaces vs Fisherfaces: Recognition using Class Specific Linear Projection. IEEE Transaction on Pattern Analysis and Machine Intelligence 19(7) (July 1997) 10. Wimmer, M., Riaz, Z., Mayer, C., Radig, B.: Recognizing Facial Expressions Using Model-Based Image Interpretation. Advances in Human-Computer Interaction 11. Wimmer, M., Stulp, F., Tschechne, S., Radig, B.: Learning Robust Objective Functions for Model Fitting in Image Understanding Applications. In: Proceedings of the 17th British Machine Vision Conference, BMVA, Edinburgh, UK, pp. 1159– 1168 (2006) 12. Blanz, V., Vetter, T.: Face Recognition Based on Fitting a 3D Morphable Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 13. Ahlberg, J.: An Experiment on 3D Face Model Adaptation using the Active Appearance Algorithm. Image Coding Group, Deptt. of Electric Engineering Link¨ oping University 14. Riaz, Z., et al.: A Model Based Approach for Expression Invariant Face Recognitio. In: 3rd International Conference on Biometrics, Italy (June 2009) 15. Bronstein, A., Bronstein, M., Kimmel, R., Spira, A.: 3D face recognition without facial surface reconstruction. In: Proceedings of European Conference of Computer Vision Prague, Czech Republic, May 11-14 (2004) 16. Park, U., Jain, A.K.: 3D Model-Based Face Recognition in Video. In: 2nd International Conference on Biometrics, Seoul, Korea (2007) 17. Li, S.Z., Jain, A.K.: Handbook of Face recognition. Springer, Heidelberg (2005) 18. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis In. In: Proceedings of Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FGR 2000), Grenoble, France, pp. 46–53 (2000) 19. The MMI Facial Expression Database, www.mmifacedb.com 20. Wimmer, M., Stulp, F., Pietzsch, S., Radig, B.: Learning local objective functions for robust face model fitting. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI 2007) (2007) 21. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Modeling Gait Using CPG (Central Pattern Generator) and Neural Network Arabneydi Jalal, Moshiri Behzad, and Bahrami Fariba
Abstract. In this study, we utilize CPG (Central Pattern Generator) concept in modeling a bipedal gait. For simplicity, only lower extremity body of a biped is considered and modeled. Actually, gait is a result of a locomotor which is inherent in our bodies. In other words, the locomotor applies appropriate torques to joints to move our bodies and generate gait cycles. Consequently, to overcome the gait modeling problem, we should know structure of locomotor and how it works. Actually, each locomotor mainly consists of two parts: path planning and controlling parts. Task of path planning part is to generate appropriate trajectories of joint angles in order to walk properly. We use CPG to generate these proper trajectories. Our CPG is a combination of several oscillators because of the fact that gait is a periodic or semi-periodic movement and it can be represented as sinusoidal oscillators using Fourier transform. Second part is to design a controller for tracking above-mentioned trajectories. We utilize Neural Networks (NNs) as controllers which can learn inverse model of the biped. In comparison with traditional PDs, NNs have some benefits such as: nonlinearity and adjusting weights is so much faster, easier and fully automatically. Lastly, to do this, someone is asked to walk on a treadmill. Trajectories are recorded and collected by six cameras and CPG can then be computed by Fourier transform. Next, Neural Networks will be trained in order to use as controllers. Keywords: Gait, Locomotion, CPG (Central Pattern Generator), Biped robot, Control, Neural Network.
1 Introduction The biped locomotion research field has been pretty active over the past ten years and produced many different approaches to make robots walk. But the ability for robots to walk in an unknown environment is still much reduced at the time. It is why in this project we are using bio-inspired approach. We are convinced that the approach has many advantages over the traditional methods tried up to now. But to explain why, it is crucial to first understand these traditional methods and their limitations. The biped locomotion problem is really difficult to solve. Many different approaches have been tried during the past twenty years, but even the better solutions found at this time are totally inefficient compared to the natural ability to walk of a human being. Of course the Moore's law helps and today many robots are able to walk or even to climb stair. But still, all these robots are lost if their environment suddenly becomes J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 130–137, 2009. © Springer-Verlag Berlin Heidelberg 2009
Modeling Gait Using CPG (Central Pattern Generator) and Neural Network
131
different of what the programmer thought about. A bump on the ground or a little slope can easily make fall a robot that was walking very well on a perfectly flat ground. So the first difficulty in the biped locomotion comes from the environment. Then the second most challenging difficulty comes from the robot itself. That means that a particular solution will only work on the robot it has been designed for. The locomotion and the robot's model are directly connected and cannot be separated. This only fact makes the problem really difficult, because all the algorithms are targeted and fine tuned for a particular robot. That explains why most of the approaches used today to make robots walk are built based on the exact robot's model. And, this project completely differs from the traditional approaches. 1.1 The Different Approaches Commonly Used In order to provide an insight and understand the importance of this approach, it is worthy to take a look at common methods and review their advantages and disadvantages. 1.1.1 Trajectory Based Methods The main idea here is to generate a set of predefined kinematic trajectories. Then, by applying the dynamic equations of the robot on these trajectories it is possible to test the stability of the locomotion. But these methods are not providing any methodology to design a controller. It is only a method to prove the stability of the motion. Therefore, the trajectories have to be designed by trial-and-error or from human recording. The Zero Moment Point method The most successful approach of this kind is called the Zero Moment Point (ZMP) [3].This method is to the dynamic walking what the center of gravity method is to the static walking. The dynamic equations are used to compute the Zero Moment Point, which is the point on the ground where the moment of the inertial forces and the gravity forces has no component along the horizontal plans. This point is also called the Center of Pressure (CoP). So, for the locomotion to be stable, the ZMP must always be located within the robot's stability shape. This method works well and is used on many well known robots like the Honda robot [5] or brand new Sony's QRIO. But the stability is ensured only in perfectly known situations. To deal with perturbations, an external system has to online adjust the position of the ZMP. Typically the correction is made by adapting the hip and torso motion of the robot. The advantages 1) 2) 3) 4)
Well defined methodology to prove stability. Well suited for expensive robots that should never fall. Easy to implement in a real robots. Can deal with dynamic walking.
132
A. Jalal, M. Behzad, and B. Fariba
The drawbacks 1) 2) 3) 4) 5)
Requires a perfect knowledge of the robot's model and its dynamic. External online control is needed to deal with perturbations. Difficult to know where to resume the motion after a fall recovery. No way to continue walking if a part of the robot beaks. ZMP proves the stability but furnishes no methodology to find the good trajectories. 6) Difficult to respect the robot's natural dynamic. 1.1.2 Heuristic Based Methods The major problem with the ZMP method is the complexity of the equations used to compute the robot's dynamic. With such a level of complexity directly exposed to the robot's programmer, it becomes very difficult to have a global view over the robot's behavior. It's why heuristics were developed to hide the complexity of the problem and make it more accessible. Virtual Model Control Among all the approaches using heuristic control algorithms, the most successful one might be the Virtual Model Control (VMC). Developed by Jerry E. Pratt [1] at the MIT leg laboratory, VMC relies on virtual components like springs and dampers that are used to control the position, the equilibrium and the speed of the robot. The virtual forces created by these elements are them mapped to physical torques at each of the robot's joints. Typically this is done by computing the transpose of the Jacobean relating the two attachment frames of the virtual element1. Therefore it creates the illusion that the simulated components are connected to the real robot. This allows fast and relatively simple online control. But of course this method requires a perfect knowledge of the robot's dynamic. With this heuristic, designing the controller becomes pretty easy. First some virtual elements must be placed to maintain an upright posture and to ensure stability.The springs are maintaining the robot in standing position and dampers are used to prevent oscillations. There is now one last problem to solve to make VMC efficient. The biped locomotion can be decomposed in many different stages each of them having its own dynamic (e.g. one stage for the two feet supports phase and another one for the single foot support phase). So for each stage different virtual elements are required. It is why a finite state machine is needed to cycle through the various stages of the gaits. This method has been successfully applied on the MIT Spring Turkey and Spring Flamingo. Compared to the trajectory based methods VMC is less sensitive to external perturbations and unknown terrain, since these can to some extend be compensated by the virtual elements. The advantages 1) Provides a real methodology to describe controllers. 2) Intuitive way of describing the controller.
Modeling Gait Using CPG (Central Pattern Generator) and Neural Network
3) 4) 5) 6)
133
The difficulty of the problem is hidden behind the virtual elements. Robust against small perturbation. Does not need an exact model of the environment. The control is fast and can be efficiently done online.
The drawbacks 1) Requires an exact knowledge of the robot's model. 2) Due to mathematical singularities the mapping from virtual elements to joint's torque is not always possible. 3) Can be harmful for the robot because virtual elements can generate high torque on the joints. 4) The finite state machine makes transition between walking phases very abrupt.
2 Main 2.1 Data Collection Gait trajectories are required to build CPG. In other words, as we discuss later, we need trajectories at the first step and then we extract the main frequencies from them and finally CPG will be represented by Fourier components. Actually, there are two ways in order to generate trajectories. One of them employs mathematical models. These kinds of approaches calculate the trajectories based on some criteria such as minimizing the energy consumption during the gait [5].Another way is to utilize the actual data from human motions. The latter way is more reliable, rational and recommended. As it is shown in figure (1), an individual is asked to walk on a treadmill at a constant speed. The treadmill is surrounded by six cameras. Positions of the markers are detected, collected and then recorded by the cameras. We benefit from visual3D software the view of which is shown in figure (1) at left side. Apart from visualizing the trajectories, this software enables us to fit a skeleton body to these trajectories.
Fig. 1. Cameras detect and collect and record positions of markers
134
A. Jalal, M. Behzad, and B. Fariba
0.3
0.7
0.2
0.6
0.1
0.5
0
0.4
-0.1
0.3
-0.2
0.2
-0.3
0.1
-0.4 -0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0 -0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Fig. 2. Limit cycle behavior of joints angles. Left-side figure is right and left hips limit cycle. Right-side figure shows how right hip and left knee behave respectively.
2.2 CPG ( Central Pattern Generator) Central pattern generators (CPGs) are neural circuits found in both invertebrate and vertebrate animals that can produce rhythmic patterns of neural activity without receiving rhythmic inputs. The term central indicates that sensory feedback (from the peripheral nervous system) is not needed for generating the rhythms [2]. CPG is a proven fact which exists in animals and its task is to generate the rhythmic movements almost independent of central nervous system. There are a lot of CPG methods but the best one in robotic application is proposed by A.J. Ijspeert [4]. Since gait is a periodic or semi-periodic movement, it can be represented by sinusoidal oscillators. At the first step, main frequencies are extracted from the trajectories. Next, we can apply distance and phase constraints to these oscillators. In other words, distance constrains lie in the fact that the distance between two joints is always constant. On the other hand, due to gait is a symmetric motion, we impose a phase difference between two oscillators corresponding to relevant joints. Figure (3) illustrates the above explanation.
Fig. 3. The values on the arrows correspond to the phase differences we imposed between the different oscillators
Modeling Gait Using CPG (Central Pattern Generator) and Neural Network
135
The major drawback of the two previous methods is that they both required an exact knowledge of the robot's model. For this project we wanted to use a less specific approach, not based on the robot's model but its behavior. These advantages and also the drawbacks of this method are summarized here: The advantages 1) Doesn't require an exact knowledge of the robot's model. 2) The CPG will recover smoothly and quickly after a perturbation. 3) This method has already been successfully applied to simulate the locomotion of simple vertebrate. 4) No complicated control input is required; the CPG will spontaneously generate the locomotion patterns. 5) The control is no more centralized in a black box connected to the robot. But the CPG will be distributed over the whole robot's body, as a virtual spinal cord. 6) Can independently generate new trajectories for various speeds, without needing to be trained again. The drawbacks 1) 2) 3) 4)
The number of parameters needed to describe the CPG is huge. Difficult to know how the feedback should be introduced in the CPG. Difficult to predict how the robot will behave in an unknown situation. Stability is not guaranteed as much as mathematical approaches. 0.04 CPG 0.03 0.02
amplitude
0.01 0 -0.01 -0.02 Real Data
-0.03 -0.04
0
200
400
600
800 1000 samples
1200
1400
1600
1800
Fig. 4. This is a sample of a trajectory of knee joint. As it is can be seen real data has noises and some undesired frequencies.CPG manages to remove them and extract a simple rhythm from the real trajectory which is complex and polluted.
2.3 Neural Network Unfortunately, many of methods only generate trajectories and they don't propose a way to achieve suitable torques in order to track the trajectories. As a matter of fact,
136
A. Jalal, M. Behzad, and B. Fariba
Fig. 5. This diagram presents idea of our work. Firstly, desired speed is set by CNS (Central Nervous System).CNS commands lower level (CPG) to walk with the specific velocity.CPG generates the trajectories to reach the desired speed. Then, neural network applies the proper torques to joints in order to walk with the desired velocity.
inputs of robot joints are torques, not trajectories of angles. In other words, we should apply torques to joints to track the desired trajectories made by CPG.PD (proportional-derivative) is a common controller which is utilized in these occasions but it has some disadvantages which forced us to use neural network. 1) Finding the proper gains is a very time consuming procedure. 2) There is structured dependency between the joints. Constantly, it is not easy to set the gains of each joint while its movement is dependent on other joints movements. To solve these, we employ neural networks which can learn inverse model of the biped. It has some advantageous in comparison with the traditional PD. Firstly, it is nonlinear. Secondly, adjusting weights is so much faster and easier and fully automatically.
3 Conclusion In this paper, we presented a procedure in order to make a robot walk. As we reviewed, there are a lot of methods in gait field and we argued why we used CPG for our purpose by mentioning its benefits. On the other hand, to avoid from PD disadvantages, we employed neural network as a controller. At the next step, experimental data were collected by cameras and then we fitted a CPG on the recorded trajectories. Finally, the CPG was implemented to the model which is built in MATLAB (simulink) and neural network was then trained. As the simulation of the bipedal robot model shows, this approach makes robot walk well.
Modeling Gait Using CPG (Central Pattern Generator) and Neural Network
137
References 1.
2. 3. 4.
5. 6. 7.
8.
Pratt, J., Dilworth, P., Pratt, G.: Virtual model control of a bipedal walking robot. In: Proceeding of the IEEE International Conference on Robotics & Automation, pp. 193–198 (1997) Ijspeert, A.J.: Central pattern generators for locomotion control in animals and robots: a review. Preprint of Neural Networks 21(4), 642–653 (2008) Vucobratovic, M., Borovac, B., Surla, D., Stoki, D.: Biped locomotion: dynamics, Stability. In: Control and Applications. Springer, Heidelberg (1990) Righetti, L., Ijspeert, A.J.: Programmable central pattern generators: an application to biped locomotion control. In: Proc. of the 2006 IEEE International Conference on Robotics and Automation (2006) Rena, L., Jonesa, R.K., Howard, D.: Predictive modeling of human walking over a complete gait cycle. Journal of Biomechanics 40, 1567–1574 (2007) Mojon, S.: Using nonlinear oscillators to control the locomotion of a simulated biped robot. In: Diploma Thesis, EPFL / Computer Science (February 2004) Chan, C.Y.A.: 5-link Dynamic Modeling, Control and Simulation of a Planar 5-Link Bipedal Walking System, In: M.Sc. Thesis, Department of Mechanical and Industrial Engineering, University of Manitoba (2000) Huang, Q., Yokoi, K., Kajita, S., Kaneko, K., Arai, H., Koyachi, N., Tanie, K.: Planning Walking Pattern for a Biped Robot. IEEE Trans. on Robotics and Automation 17(3), 280– 289 (2001)
Fusion of Movement Specific Human Identification Experts Nikolaos Gkalelis1,2 , Anastasios Tefas2 , and Ioannis Pitas1,2 1
2
Informatics and Telematics Institute, CERTH, Greece Department of Informatics, Aristotle University of Thessaloniki, Greece
Abstract. In this paper a multi-modal method for human identification that exploits the discriminant features derived from several movement types performed from the same human is proposed. Utilizing a fuzzy vector quantization (FVQ) and linear discriminant analysis (LDA) based algorithm, an unknown movement is first classified, and, then, the person performing the movement is recognized from a movement specific person recognition expert. In case that the unknown person performs more than one movements, a multi-modal algorithm combines the scores of the individual experts to yield the final decision for the identity of the unknown human. Using a publicly available database, we provide promising results regarding the human identification strength of movement specific experts, as well as we indicate that the combination of the outputs of the experts increases the robustness of the human recognition algorithm. Keywords: Multi-modal human identification, movement recognition, movement specific person recognition expert, fuzzy vector quantization, linear discriminant analysis.
1
Introduction
Identification of humans from video sources using gait has been recently attracted increasing attention in several application domains, e.g., for video surveillance, content-based video annotation and retrieval, and other applications, as this technology is the only one offering non-invasive, unobtrusive human identification [1,2,3,4,6]. However, the vast majority of the researchers in this topic concern only walk as human biometric, while only very few works have been reported that utilize run as a second gait biometric [7]. To the best of our knowledge, identification of humans from gait types or in general other human movements, different from walk or run, is still an unexplored topic. Exploitation of more than one movements for the task of human identification may be realistic and beneficial in many applications for several reasons. First of all, some humans may not be considerable different from others in the way they walk but in the way they perform another movement, e.g., skip or jump. Moreover, in many applications the human that should be identified may perform more than one movements, where the movement of walk may not be even included. For instance, a thief captured by a hidden camera during a pursuit J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 138–145, 2009. c Springer-Verlag Berlin Heidelberg 2009
Fusion of Movement Specific Human Identification Experts
139
may be depicted to run, jump, or performing some other movement. Similarly, in video retrieval application, it may be necessary to semantically label and retrieve videos of a specific actor performing a specific movement not necessarily the movement of walk. To allow the use of movement-based person classifiers, the different movements contained in a test video should be firstly extracted and recognized. Currently, many promising movement recognition algorithms have been proposed [8], a development, which can considerable advance as well as advocate the use of specific movement person classifiers and their combination for the task of human identification. Motivated from the above discussion, we propose the use of a number of human identification experts each of them trained to recognize a human from a specific movement type, and exploit a fusion algorithm in the score level to provide a robust human identification system. The components of this system are presented in section 2, while various experimental results, regarding the discrimination power of the individual classifiers as well as the overall identification system are presented in section 3. Finally, conclusions regarding the proposed approach are given in section 4.
2
Proposed Method
In video based biometric systems a movement is represented by the so-called movement video, i.e., a video that depicts a person performing a single period of a movement, e.g., a step of walk. Consequently, a movement video is a sequence of frames, where at each frame, a unique posture of the movement is depicted. A basic requirement of our system is that the binary body mask at each frame has been retrieved. This requirement can be relatively easily satisfied in cases of static/constant background. From the body masks, the body regions of interest (ROIs) are extracted, centered in respect to the centroid of the bodies along the whole movement sequence, and scaled to the same dimension using bicubic interpolation. A ROI is scanned column wise to produce the respective vector x ∈ F , where F is the number of pixels in the ROI, and, thus, a movement video is represented with a sequence of such vectors. We model the several movements as well as we design movement specific human recognition experts using a method that combines FVQ and LDA (FVQLDA). Finally we combine the output of the experts in order to provide a robust estimate for the identity of the human depicted in a test video. We briefly review FVQLDA as well as we describe the fusion strategy in the next subsections. 2.1
FVQ Plus LDA for Video Classification
Let U be a training database of videos {xi,j } belonging to one of R different classes, {{xi,j }, yi ∈ {1, . . . , R}}, where xi,j ∈ F is the vector derived by scanning raster-wise the j-th frame of the i-th video, and yi is its class label. The task of a classification algorithm is using the above database to construct a classifier such that given a test video {st,j } to identify its class label ut .
140
N. Gkalelis, A. Tefas, and I. Pitas
The intrinsic dimensionality of the biometric data is much lower than the dimension of the data in the image space F , and a dimensionality reduction method is commonly used to avoid the high computational cost of classification in the image space. Fisher’s linear discriminant analysis (FLDA) [5], one of the most popular subspace techniques, cannot be directly used on most video-based recognition algorithms as usually the number of frames in the training database is mach smaller than the dimension of the data in the image space. To elevate this problem several LDA variants have been proposed [10], which are usually computationally expensive. Moreover, a computationally demanding distance metric, e.g., Hausdorff distance [4,12], is commonly used to compare not aligned video sequences. In [9] a method that combines FVQ and LDA has been used for movement recognition, which addresses the above issues providing a computationally efficient algorithm. In this method, the labelling of the data is initially ignored, and the FCM algorithm is used to provide C centroid vectors {v1, . . . , vC }. Then, FVQ is applied to compute a membership vector for each frame, xi,j → φi,j ∈ C , φi,j = [φi,j,c ], where the c-th component of the membership vector is given by 2
( xi,j − vc 2 ) 1−m
φi,j,c = C
=1 (
2
xi,j − v 2 ) 1−m
,
(1)
and, consequently, the i-th video with Li frames is represented by the arithmetic mean of the membership vectors derived by its frames si =
Li 1 φi,j . Li j=1
(2)
In the C-dimensional space conventional LDA can be applied to further reduce the dimensionality, as in most cases the dimensionality of the membership vectors will be smaller than the number of training videos. Thus, the final representation of the video will be yi = WT si , where W ∈ C×R−1 is the projection matrix computed with LDA. The r-th class with cardinality Nr can then be represented by the mean of all its feature vectors ζr =
1 yi , Nr
(3)
yi ∈Ur
Assuming equiprobable priors as well as a common covariance matrix Σ for all classes, the feature vector zt of a test movement video is first retrieved and then R Mahalanobis distance values are computed gr (zt ) = (zt − ζr )T Σ −1 (zt − ζr ) , r = 1, . . . , R .
(4)
The test video is assigned to the class according to the following rule ut = y(zt ) = argmin ( gr (zt ) ) . r∈[1,...,R]
(5)
Fusion of Movement Specific Human Identification Experts
141
The number of dynemes C and the fuzzification parameter m are initially not known. The LOOCV procedure is combined with the global-to-local search strategy, e.g., similar to [10], in order to identify the optimal parameters C and m that best discriminate the R classes. 2.2
Fusion of Movement Specific Human Recognition Experts
Let U be an annotated movement video database that contains P persons and R movement types, {{xi,j }, ui ∈ {1, . . . , R}, ki ∈ {1, . . . , P }}, i.e., each movement video has two labels, ui and ki , according to the movement type and the person it belongs respectively. Our target is to devise an algorithm that recognizes a person using a number of movement videos, where each movement video depicts the same person performing one of the R different movement types. Using all the movement videos of the database and utilizing only the movement type labelling information ui , the FVQLDA method is used to train a movement type classifier y(). Then, we break the database to R distinct subsets Ur , r = 1, . . . , R, i.e., Ur subset contains only movement videos of the r-th movement type, e.g. of the movement walk. Each subset is then used to train a movement specific person recognition expert hr (). The training of each expert is done using again FVQLDA, where now only the person specific labelling information ki is exploited. At the testing phase, it is assumed that R movement videos, each of them depicting the same unknown person performing a different movement type, are available. This can be done for example, by temporarily segmenting a test video that depicts the same person performing sequentially the R different movements. Each movement video is classified from the movement classifier y(), and it is routed to the respective expert hr (). Therefore, for each expert hr () a feature vector ztr is computed, while all feature vectors, ztr , r = 1, . . . , R, possess the same but unknown label kt . The training and test videos of the movement specific person recognition experts are different, and, thus, it can be assumed that the feature vectors of the test movement videos, are conditionally statistically independent. In this case, the sum rule proposed in [11] can be applied to fuse the matching scores produced from each expert as we explain below. Each expert hr () according to equation (4) produces P Mahalanobis distance values, gr,p = gr,p (ztr ), p = 1, . . . , P , each referring to one of the P persons in the database. The distance values are transformed to matching scores by taking their reciprocal, and then normalized to produce an estimate of the a posteriori probability that the test video belongs to the class ωp given the feature vector ztr 1/gr,p P (ωp |ztr ) = P , i=1 1/gr,i
(6)
where ωp is the class representing the p-th person. Then, assuming that the posterior probabilities do not deviate dramatically from the prior probabilities, the sum rule can be applied to yield the identity kt of the person in the test video
142
N. Gkalelis, A. Tefas, and I. Pitas
video
movement segmentation
y()
h1()
hR()
P(ωp|z1)
P(ωp|zR)
SUM
max
Human ID
p
Fig. 1. Recognition of humans from their movements
R
1 P (ωp |ztr ) . p∈[1,...,P ] R r=1
kt = argmax
(7)
The algorithm is summarized in Figure 1. We should note that the same algorithm can be applied in the case that the test video depicts the same person performing a fraction of the R movement types and not necessarily all of them.
3
Experimental Results
In this section we present experimental results on the database reported in [12]. From this database we used low resolution videos (180 × 144 pixels size, 25 fps), containing nine persons, namely, Daria (dar), Denis (den), Ido (ido), Ira (ira), Lena (len), Lyova (lyo), Moshe (mos), Shahar (sha), and seven movements, i.e., walk (wk), run (rn), skip (sp), gallop sideways (sd), jump jack (jk), jump forward (jf) and jump in place (jp). Some videos depict a person performing more than one cycles of a specific movement, e.g., the videos of walk. We break such videos to their constituting single period movement videos to create a database of 193 movement videos. Each video is labelled according to the person and movement that belongs to, and preprocessed as described in the beginning of section 2 to yield 3072-dimensional vector sequences, where the vectors are formed by scanning raster-wise 64 × 48 pixel size ROIs.
Fusion of Movement Specific Human Identification Experts
3.1
143
Human Recognition from a Single Movement
In order to assess the human characterization ability of each movement type in the database, we created seven disjoint datasets, one for each movement type and then we applied the procedure described in section 2.1 to train seven movement specific human recognition experts, hwk (), hrn (), . . . , hjp (). During the design of the experts we found that the optimal range for the fuzzification parameter was m ∈ [1.1, 1.2], while the optimum number of dynemes varied depending on the movement type, i.e, from C = 20 for jump forward to C = 49 for run. The Table 1. CCR for each movement specific person classifier Classifier wk rn sp sd jk jf jp CCR (%) 78 92 93 81 77 89 92
correct classification rates (CCR) for each expert is shown in Table 1, while the confusion matrix regarding hsp () expert is shown in Table 2. Surprisingly, we see that the worst CCR was given from the experts based on the movements of walk and jack, while a CCR above 90% was obtained using the experts based on the movements of skip, run and jump in place. Table 2. Identification of nine persons from the way they skip dar den eli ido ira len lyo mos sha dar 4 den 3 eli 3 ido 2 ira 2 len 7 lyo 2 mos 3 sha 3
3.2
Human Recognition from Multiple Movements
The experts computed above can be combined using the framework presented in section 2.2. To evaluate this algorithm we performed 26 experiments. At each experiment we removed from the database five to seven movement videos to form a test case for the fusion algorithm, i.e., each video depicted the same person performing one different movement from the movements in the database. The movement types in the test videos were recognized correctly, and the respective test videos were routed correctly to the corresponding experts. For each experiment, the score values derived from the experts were fused using the sum rule. In some cases, one or more experts misclassified a test person. However, using
144
N. Gkalelis, A. Tefas, and I. Pitas Table 3. Evaluation of the fusion algorithm
dar den eli ido ira len lyo mos sha
dar 4.6455 0.3168 0.4148 0.3757 0.3596 0.2538 0.3646 0.5434 0.5784
den 0.2932 3.5587 0.3387 0.3379 0.4019 0.2142 0.4095 0.5527 0.5373
eli 0.2607 0.2700 2.1828 0.3409 0.3705 0.2123 0.3263 0.5500 0.3941
ido 0.2517 0.2538 0.2670 1.6952 0.2694 0.2583 0.3211 0.4781 0.4598
ira 0.3040 0.3934 0.4354 0.3946 4.2343 0.2752 0.4245 0.6905 0.4764
len 0.3308 0.3162 0.3452 0.4051 0.4130 4.1482 1.0701 0.5825 0.5784
lyo 0.2929 0.2845 0.3402 0.3020 0.2849 0.2197 3.3649 0.5085 0.4240
mos 0.3101 0.3275 0.3623 0.7964 0.3271 0.2285 0.3598 2.5942 0.5627
sha 0.3112 0.2791 0.3137 0.3523 0.3392 1.1898 0.3591 0.5000 2.9889
median 0.3040 0.3162 0.3452 0.3757 0.3596 0.2538 0.3646 0.5500 0.5373
the sum rule, in all 26 test cases the depicted human was identified correctly with high confidence over the median of the scores. In Table 3 we present the recognition results for nine test cases, one for each person in the database. Each row of the table corresponds to a specific evaluation of the fusion algorithm. The first column depicts the actual identity of the person for the specific test case, the next nine columns provide the confidence score for the respective person in the database, computed using the fusion algorithm, and the last column provides the median of the score values for the current evaluation.
4
Conclusions
We pursue the idea of movement identification using not only walk but also other movement types. Using a video database containing a number of persons and movement types, we train a movement type classifier as well as a number of movement-specific person recognition experts. At the testing phase, the movement type depicted on each video within a set of movement videos performed from the same test person is recognized, and each video is routed to the respective expert. Consequently, the scores yielded from each expert are combined using the sum rule, to yield the final decision for the identity of the test person. Several experiments have been performed, indicating that movement types, other than walk, may contain considerable discriminant information for the task of human identification, which may be further exploited to design suitable fusion algorithms to enhance the accuracy and robustness of human recognition systems.
Acknowledgment The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 211471 (i3DPost) and COST Action 2101 on Biometrics for Identity Documents and Smart Cards.
Fusion of Movement Specific Human Identification Experts
145
References 1. Kale, A., Sundaresan, A., Rajagopalan, A.N., Cuntoor, N.P., Roy-Chowdhury, A.K., Kruger, V., Chellappa, R.: Identification of Humans Using Gait. IEEE Trans. Image Process. 13(9), 1163–1173 (2004) 2. Boulgouris, N.V., Hatzinakos, D., Plataniotis, K.N.: Gait recognition: a challenging signal processing technology for biometric identification. IEEE Signal Processing Magazine 22(6), 78–90 (2005) 3. Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P., Bowyer, K.W.: The humanID gait challenge problem: data sets, performance, and analysis. IEEE Trans. Pattern Anal. Mach. Intell. 27(2), 162–177 (2005) 4. Xu, D., Yan, S., Tao, D., Lin, S., Zhang, H.J.: Marginal fisher analysis and its variants for human gait recognition and content-based image retrieval. IEEE Trans. Image Process 16(11), 2811–2821 (2007) 5. Funkunaka, K.: Statistical Pattern Recognition. Academic, San Diego (1990) 6. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000) 7. Yam, C.Y., Nixon, M.S., Carter, J.N.: Gait Recognition By Walking and Running: A Model-Based Approach. In: Proceedings Asian Conference on Computer Vision, ACCV (2002) 8. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1473–1488 (2008) 9. Gkalelis, N., Tefas, A., Pitas, I.: Combining fuzzy vector quantization with linear discriminant analysis for continuous human movement recognition. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1511–1521 (2008) 10. Yang, J., Frangi, A.F., Yang, J.Y., Zhang, D., Jin, Z.: KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 27(2), 230–244 (2005) 11. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998) 12. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as Space-Time Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
CBIR over Multiple Projections of 3D Objects Dimo Dimov, Nadezhda Zlateva, and Alexander Marinov Institute of Information Technologies (IIT) at Bulgarian Academy of Science (BAS) Acad. G. Bonchev Str., Block 29-A, 1113 Sofia, Bulgaria {dtdim,nzlateva,amarinov}@iinf.bas.bg
Abstract. This paper presents a heuristic approach to 3D object recognition by considering multiple 2D projections (appearances) of the objects of interest. Thus, 3D object identification is interpreted as a conventional Content Based Image Retrieval (CBIR) problem. An arbitrary input image of a given object is treated as a search sample within a database (DB) of a large enough set of images, i.e. appearances from a sufficient number of viewpoints for each object. The CBIR method to access the image DB should be both fast enough and sufficiently noise-tolerant. The method we propose is described over two cases of recognition, namely human faces and hand signs of a given sign-language alphabet. Analogically, the method can also be applied to recognition of a large number of 3D objects of different types. We are briefly covering the data gathering technique, its structuring into a DB of image samples, and the experimental study for the noise-resistance of the applied CBIR method. The latter is used to acknowledge the applicability of the proposed approach. Keywords: CBIR, appearance-based object recognition, face recognition, sign language recognition, image databases, image and video analysis.
1 Introduction Over the last decade and at present, vision-based approaches to object recognition are being vastly explored. Some of the active areas of research include face recognition (FR) and hand posture recognition (HPR). FR is primarily being addressed in connection with the need for highly reliable security systems, by providing an extra medium next to the typically used biometrical data [1]. On the other hand, automated HPR systems permit the replacement of conventional interface devices and provide a natural interface to communicate with computerized devices. Examples of such HPR systems that require robustness and fine-grained control are: virtual or highly sterile environments, smart homes, sign language recognition, etc. A brief survey of the commonly used methods for FR is listed in [2], [9], and a review for HPR can be found in [6], [10], [11]. The problems of face identification and hand posture recognition can be addressed by a common, appearance-based approach. For instance, such a method is presented in [2] and [6] (for FR and HPR, respectively), and uses a similar approach to gathering 3D object appearances. The used technique is IPCA-ICA (Incremental Principal Component Analysis followed by Independent Component Analysis). The principal J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 146–153, 2009. © Springer-Verlag Berlin Heidelberg 2009
CBIR over Multiple Projections of 3D Objects
147
components of all appearances of a given object are calculated to form a lowdimensional object eigenspace, and from the image sets of all considered objects (classes) − a universal eigenspace. An input object’s identity and pose are determined based on its location over these two eigenspaces. Since identity information is distributed over all eigenvectors, this method does not guarantee good class separability. In this paper we propose a simpler approach, which interprets 3D object recognition as a conventional CBIR problem.
2 Problem Formulation and Assumptions The task of identification or recognition of either a human face or a hand posture in an image is interpreted as a task for direct comparison of an input example with samples from an image DB (IDB), i.e. as a CBIR problem. We assume that the images of the objects for recognition (face or hand) have been appropriately extracted from an input video sequence. We propose a vision-based approach for Face or Hand Posture Recognition, which uses one video camera for capturing the object. For the sake of simplicity, we consider the recognition of a single object on the image scene, i.e. we assume that the face or hand sign to be recognized is ‘the biggest spot’ in the video frames, or at least in the majority of frames incoming from the camera. 2.1 Essence of the Proposed Approach A given object in front of the video camera is considered as a dynamic 3D object, represented by a series of 2D appearance projections, i.e. static images from the camera. If an appropriate part of these images, or similar to them, is already stored in an IDB, then we can search into this IDB for the image that is closest (most similar) to the observed input one. Moreover, we can also locate a series of images, sorted in descending order of their similarity to the input image. To achieve this, the IDB from above needs to be built as an object dictionary with a large amount of representative object samples/images. To assure operation in real-time, the comparison time of the input image with all samples from the dictionary has to be quick enough. This is possible if we have a fast enough and simultaneously noise-tolerant CBIR method for accessing the large IDB of object projections. Such CBIR methods are available with the EFIRS system (Effective and Fast Image Retrieval System), developed by IIT-BAS team [4], [5]. Their noise-resistance covers cases of accidental linear transformations (translation, rotation, and scaling), of regular noise, and of rougher (to a certain level) artifact-noise in the input. In essence, EFIRS uses an IDB to store a large set of images, by representing each of them with a unique, fixed length key (a global image descriptor), [4]. The IDB access is organized over these keys by using conventional DBMS indexing techniques. EFIRS supports several competitive key generation strategies, e.g. by a 2D Fourier transform (2DFT); by a combination of Log-Polar mapping, Fourier and Wavelet transforms (PFWT) [5], and other [4]. The primary goal of the conducted experimental study is to clarify if our CBIR approach is appropriate for the recognition of dynamic objects in a video-clip, in favor
148
D. Dimov, N. Zlateva, and A. Marinov
of different recognition applications. Preliminary results in this aspect have been reported for the cases of face and sign language alphabet recognition [7], [8]. 2.2 Expected Problems and Allowed Limitations Typical problems to be faced when recognizing objects on an image scene are: (1) segmentation or isolation of the main object of interest from an input image (static 2D scene) or from a video sequence of similar scenes (video-clip); (2) object specific normalization at both stages, the learning and recognition ones; (3) accumulation of a representative enough IDB, i.e. a dictionary with images of representative appearances of the object, to face the well-known classifier’s learning problem. The first two problems are characteristic for most of the approaches in the area of Computer vision, Image processing and recognition, therefore, at this stage, we consider them a ’priory resolved. Furthermore, the appropriateness of one method (for object isolation or segmentation) over another depends on the concrete use case and problem setup. We concentrate here on the third problem – gathering the representative samples in the IDB and performing an experiment of evidence for the chosen concept. For the preparation of the experimental IDB, it is important that it contains a large enough number of image samples that also adhere to the general limitation of the available CBIR methods [4], [5] – the images need to be relatively “clean”, i.e. to contain the whole object of interest (e.g. the human face or palm), in color or gray scale, over a uniform (white) background and, if possible, be devoid of noise-artifacts from the natural surroundings. All these limitations, related to the well known problems of segmentation, as addressed by Zhao et al. [3] for FR and in [11], [12] for HPR, are resolved here in their light form, according to the experiment’s specifics. We are using a simplified scene for IDB gathering: a motionless object of interest in front of a dark blue curtain, i.e. background with color that is opposite to the color of the actor’s face and palm skin; the object faces the camera and is centrally positioned.
3 Our Approach to IDB The methodology of gathering data in an IDB is based on the recording of a short video-clip that traces the object through uniform scanning (by position and time, vertically and horizontally) in a spatial sector − wide enough and in front of the object. Three possible approaches have been considered: (1) Static object and moving camera that uniformly scans the needed spatial sector. (2) Static camera and uniformly moving object that expose itself well enough, [14]. (3) Static object and static camera, [13]. The three mentioned approaches are valid at the stage of regular exploitation of a given recognition system, e.g. Fig. 4. Yet, only the first one could provide the regularity
CBIR over Multiple Projections of 3D Objects
149
we need for the image sequence of object appearances from multiple, uniformly spread viewpoints within the considered spatial sector. This requires an auxiliary construction for recording a video-clip with a conventional camera. Fig.1 illustrates an idea for a construction of this type. Here, the responsibility for providing the needed uniformity of motion within the video-clip falls on the operator-researcher. This is acceptable and natural as we are speaking of a single session of unvaried scanning procedures. 3.1 Video Capture At this stage, we use the construction illustrated on Fig. 1 and suggest the following procedure for gathering the experimental data for our IDB: (s1) Fix the object − actor’s face or palm − in the necessary pose: in front of the camera and approximately towards its central position. (s2) Scan (capture) the needed spatial sector in front of the object of interest: row by row, top-down, moving in “zigzag” uniformly along the rows, at nearly constant speed. In vertical transitions (from row to row) use an assistant to provide the “synchronous” release of the vertical restriction (a cord) through equally spaced distances d, among the arcs with radius R=const (Fig.1). During these vertical transitions, the camera operator can simply cover the lens by hand for better separation of the significant row frames (“non black” frames, containing the entire object of interest) at the later video processing step. For operative reasons, it is advisable that the time for transition does not go beyond 1÷2 seconds, while the scan speed along the rows is tolerated to vary within some not very large bounds. (s3) If there is another object to capture, go to (s1). The results of the above procedure are primary video-clips. Similar clips should be captured for each object of interest (human face or hand posture). Apparently, the above procedure and the construction of Fig.1 can be relatively easily elaborated to the status of an automated “photo-kiosk”, in case of necessity.
cord cord
fixative
H≈ N.d
H
d
jib W≈ nD
(a)
(b)
Fig. 1. (a) Kinematics’ schema of the construction for taking “primary” video-clips, using the static-object-and-moving-camera approach, and (b) the construction in action
150
D. Dimov, N. Zlateva, and A. Marinov
3.2 Primary Video-clips Processing Stages ♦ Separate the significant frames from the entire video-clip. Arrange them row by row, following the original scanning/capture path.
(a)
(b)
Fig.2. Spherical sector in front of the camera, scanned row by row. Two main variants for the uniform grid of representative frames: (a) a square grid, and (b) a triangular one.
♦ Derive a representative set of frames from the video sequence by, for example, obtaining those frames that fall within a uniform grid (“square” or “triangular” one) over the spherical sector of scanning. The linear parameter D of this grid (Fig.1 and Fig. 2) represents the differences between the consecutive representative frames, each of which contains the object sample from the corresponding viewpoint. On a given row, D can be measured in degrees, in centimeters (for the concrete construction, Fig.1), or even in number of frames (assuming a regular scanning speed). Fig. 3 provides an example of such a set of representative frames to be stored in IDB.
(a)
(b)
Fig. 3. Two IDBs of representative frames: (a) for a palm sign, and (b) for a face
Our experimental study was carried out with the following construction parameters: • radius R of the scanned spatial sector, R =51cm; • average scanned spatial sector size – horizontally ≈ 80° and vertically ≈ 115°;
CBIR over Multiple Projections of 3D Objects
151
• distance d between consecutive scan rows over the spherical sector, fixed to d = 10 cm (as distance between cord gaps on a fixative); at the chosen R, d ≈ 9.6° (as chord angle); • number N of scan rows (arcs over the spherical sector), N = 8. Thus, for the two uniform grids shown in Fig. 2, we have the following possible values of D: (1) D = kd , for square grid (Fig. 2a) D = kd 2 / 3 , for triangular grid (Fig. 2b),
(2)
where k is the scan row number, k ∈ {1,2,...( N − 1)} . We assume that the noise-resistance of the used CBIR method could also be measured by means of D (the representative frame distance) and d (the row distance). Thus, in a uniform grid with D > D0 , where D0 is an admissible lower boundary, the CBIR method would start to fault when searching for similarity within the samples of the IDB. Therefore, the D0 boundary can also serve as a noise-resistance measure. It will also determine the optimal number of grid nodes P for storing the representative samples within the IDB, see Fig. 2. P ≈ π ( N 2) 2 Dd D 02
(3)
4 Experiments and Results The experimental analysis of the proposed approach has been carried out with EFIRS: a C++ Windows XP application, operating on a Pentium 4 PC at 2.8 GHz with 2GB RAM, and a 9ms HDD. For the generation of the experimental IDB of object appearances, we used the existing IDB structure of EFIRS. The chosen CBIR access method was PFWT, [5]. 4.1 Essence of the Experiments For each possible value of the examined parameter D, do: • Generate a separate IDB with the representative frames for all objects (hand poses or human faces) of interest. Initially, it is recommended that the representative frames for each object be chosen at regular positions (at D ≈ d). • For each couple of consecutive representative frames (from a given row) on the grid, associated with a given primary video-clip, define the closest frame to the center of this couple. These central frames are obtained from the set of motion frames along each row of the clip, and are uniformly the most distant ones, at distances ≈d/2 from their corresponding 2 neighboring frames, which have already been stored in the IDB. These central frames are used to provide “the heaviest case” of input precedents for the CBIR search within the IDB. • Carry on a Simple Locate Test (SLT) [5] for a CBIR search within the IDB over all centers, i.e. over “the heaviest cases” of input precedents. Summarize the results for the successful and unsuccessful retrievals from the IDB. More robust results could be obtained when testing with all significant image frames, extracted from the primary video-clips.
152
D. Dimov, N. Zlateva, and A. Marinov
4.2 Experiments The EFIRS experimental schema is illustrated on Fig. 4, including the three basic regimes of operation: (i) the Learning regime, i.e. IDB loading, (ii) the Test regime, i.e. our experiment to prove the proposed method, and (iii) the conventional Recognition regime that is still under development. We have carried two separate tests with two different object types: Test for Face Recognition: The IDB is loaded with the representative frames of 22 faces of 16 persons - some have been filmed twice to capture different facial expressions. The total number of acquired significant frames was 8177, i.e. on average 378 per face. 1251 of the frames have been defined as representative ones and registered in the IDB, with an average of 57 representative images per face. The generated IDB applies to the case of D = d. The generalized result of testing with all central frames for all 22 faces approximates the average error rate to ~ 6.6% or the FR rate is around 93%. The averaged search time per input face is ~ 0.2 s. The FR rate is comparative to the success rate reported in [2]. Test for Hand Pose Recognition: This experiment is similar to the FR one, but the IDB is filled with the representative frames of 7 hand poses (letters from the Bulgarian sign language alphabet). The number of significant frames was 344, i.e. on average 49 per hand pose. The generalized result for this test shows an average error rate of ~4% or the HPR rate is around 96%.
Test regime
Learning regime
IDP
VC
IP
IDB
CBIR
ORI
IP
VC
Recognition regime
Fig. 4. Our experimental schema: VC (Video camera), IP (Image processing), ORI (Output result interpretation), IDB (Image DB), IDP (Image data pool). The three basic regimes are outlined.
Experiments’ Discussion: The recognition rate: ~93% (faces) and ~96% (hands), achieved at D0=1d, is not bad in view of the early stage of this work. Refinements in image preprocessing step, e.g. object specific geometrical normalization, lightening normalization, removal of extra attributes [8], are envisioned in the near future.
5 Conclusion The paper proposes an appearance-based 3D object recognition approach. The 3D recognition problem is interpreted as a CBIR one by presenting a 3D object by a set
CBIR over Multiple Projections of 3D Objects
153
of 2D appearances from different, uniformly spaced viewpoints. The objects of interest in the experimental study were human faces and static hand postures. Based on the gathered experience, we believe this method can be applied over different object classes, e.g. identifying the brand model of a car, items of the ceramic art, etc. Since the IDB key size that EFIRS uses for images is ~0.5KB, the IDB itself can store millions of objects. From the viewpoint of efficiency, the search time remains a logarithmic one ~ log2N, with N the number of objects represented in the IDB, [4]. As a next step, our approach needs image preprocessing refinements, while its performance - a comparison with experiments, as in [2] and [6], over similar IDB-s. Acknowledgments. This work was partially supported by following grants of IITBAS: Grant # DO-02-275/2008 and Grant # VU-МI-204/2006 of the National Science Fund at Bulgarian Min. of Education & Science, and Grant # 010088/2007 of BAS.
References 1. Jain, A.K., Ross, A., Prabhakar, S.: An Introduction to Biometric Recognition. IEEE Transactions on Circuits and Systems for Video Tech. 14(1), 4–19 (2004) 2. Dagher, I., Nachar, R.: Face Recognition Using IPCA-ICA Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(6), 996–1000 (2006) 3. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys, 399–458 (2003) 4. Dimov, D.: Rapid and Reliable Content Based Image Retrieval. In: Lefebvre, E. (ed.) NATO ASI, Multisensor Data and Information Processing for Rapid and Robust Situation and Threat Assessment, Albena, Bulgaria, pp. 384–395. IOS Press, Amsterdam (2007) 5. Dimov, D.: A Polar-Fourier-Wavelet Transform for Effective CBIR. In: Morzy, T., Morzy, M., Nanopoulos, A. (eds.) ADMKD 2007 (2007); 11th ADBIS 2007, Varna, BG, pp. 107– 118 (2007) 6. Dagher, I., Kobersy, W., Nader, W.A.: Human Hand Recognition Using IPCA-ICA Algorithm. EURASIP J. on Advances in Signal Processing 2007, ID 91467, 7 pages 7. Dimov, D., Marinov, A., Zlateva, N.: CBIR Approach to the Recognition of a Sign Language Alphabet. In: CompSysTech 2007, Rousse, Bulgaria, pp. V.2.1–9 (2007) 8. Dimov, D., Zlateva, N., Marinov, A.: CBIR Approach to Face Recognition. In: Workshop on Multisensor Signal, Image and Data Processing. In: A&I 2008, Sofia, pp. IV.21–IV.26 (2008) 9. Gross, R., Shi, J., Cohn, J.F.: Quo vadis Face Recognition? In: 3th Workshop EEMCV. IEEE Conf. Computer Vision and Pattern Recognition (2001) 10. Malassiotis, S., Aifanti, N., Strintzis, M.G.: A Gesture Recognition System Using 3D Data. In: 1st IEEE Symp. 3D Data Processing Visualiz, and Transmission, Padova, Italy (2002) 11. Jennings, C.: Robust finger tracking with multiple cameras. In: International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (1999) 12. Binh, N.D., Shuichi, E., Ejima, T.: Real-Time Hand Tracking and Gesture Recognition System. In: GVIP 2005, CICC, Cairo, Egypt (2005) 13. Thomas Moeslund’s Gesture Recognition Database (1996), http://www-prima.inrialpes.fr/FGnet/data/12MoeslundGesture/database.html 14. Extended Multi-Modal Face DB (2003), http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb
Biometrics beyond the Visible Spectrum: Imaging Technologies and Applications Miriam Moreno-Moreno, Julian Fierrez, and Javier Ortega-Garcia Escuela Politécnica Superior, Universidad Autónoma de Madrid C/ Francisco Tomás y Valiente, 11 – Cantoblanco – 28049 Madrid, Spain {miriam.moreno,julian.fierrez,javier.ortega}@uam.es
Abstract. Human body images acquired at visible spectrum have inherent restrictions that hinder the performance of person recognition systems built using that kind of information (e.g. scene artefacts under varying illumination conditions). One promising approach for dealing with those limitations is using images acquired beyond the visible spectrum. This paper reviews some of the existing human body imaging technologies working beyond the visible spectrum (X-ray, Infrared, Millimeter and Submillimeter Wave imaging technologies). The benefits and drawbacks of each technology and their biometric applications are presented. Keywords: Imaging Technologies, X-ray, Infrared, Millimeter Waves, Submillimeter Waves, Thermograph, Passive Imaging, Active Imaging, Terahertz Imaging, Biometrics, Face Recognition, Hand Vein Recognition.
1 Introduction The ability to capture an image of the whole human body or a part of it has attracted much interest in many areas such as Medicine, Biology, Surveillance and Biometrics. Biometric Recognition, or simply Biometrics, is a technological field in which users are identified through one or more physiological and/or behavioural characteristics [1]. Many biometric characteristics are used to identify individuals: fingerprint, signature, iris, voice, face, hand, etc. Biometric traits such as the ear, face, hand and gait are usually acquired with cameras working at visible frequencies of the electromagnetic spectrum. Such images are affected by, among other factors, lighting conditions and occlusions (e.g., clothing, make up, hair, etc.) In order to circumvent the limitations imposed by the use of images acquired at the visible spectrum (VIS), researchers in biometrics and surveillance [2] have proposed acquiring images at other spectral ranges: X-ray (XR), Infrared (IR), Millimeter (MMW) and Submillimeter (SMW) waves (see Fig. 1). In addition to overcoming to some extent some of the limitations of visible imaging, the images captured beyond the visible spectrum present another benefit: they are more robust to spoofing than other biometric images/traits [3]. In this work, we present an overview of the state of the art in body imaging beyond the visible spectrum, with a focus on biometric recognition applications. In particular, J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 154–161, 2009. © Springer-Verlag Berlin Heidelberg 2009
Biometrics Beyond the Visible Spectrum: Imaging Technologies and Applications
1m
10mm
1mm
100µm
Micro- MMW SMW wave 0.3GHz
0.7µm 0.4µm
IR
VIS
10nm
UV
155
0.01nm
X-ray
λ f
30GHz 300GHz 3THz 4·1014Hz 8·1014Hz 3·1016Hz 3·1019Hz
Fig. 1. Electromagnetic spectrum showing the different spectral bands between the microwaves and the X-rays. The IR band is sometimes considered to extend to 1 mm including the SMW region.
Imaging Technologies Beyond VIS for Biometrics X-Ray Transmitted XR Backscatter XR
Infrared Near IR Medium Wave IR Long Wave IR
MMW-SMW MMW Passive Active
SMW Passive Active
Fig. 2. A taxonomy of imaging technologies beyond visible spectrum. The figure only shows the technologies adequate for biometrics.
a taxonomy followed by the properties and the biometric applications of each imaging technology beyond the visible spectrum is presented.
2 Fundamentals and Taxonomy of Imaging Technologies beyond the Visible Spectrum Many imaging technologies beyond the visible spectrum have been used to capture a body part: IR, magnetic resonance, radioisotope, XR, acoustical, MMW- and SMWimaging, etc. Not all of them can be used for biometric purposes because of their high level of intrusiveness. The imaging technologies more adequate for biometric applications are: XR, IR, MMW and SMW imaging. A taxonomy of them is shown in Fig. 2. Imagery can be classified in two architectures: passive or active. In the former group the image is generated by receiving natural radiation which has been emitted and reflected from the scene, obtaining a map of brightness temperature. On the other hand, in active imaging the radiation is transmitted to the scene and then collected after reflection to form the image, which is a map of reflectivity. The contrast in the scene in any part of the spectrum is a function of the optical properties of the object being imaged and its background. In particular, the apparent temperature of an object T0 is defined by: T0 = Tε + Tsr + Tbt
(1)
156
M. Moreno-Moreno, J. Fierrez, and J. Ortega-Garcia
where T is the physical temperature of the object, ε its emissivity, Ts is the temperature of the background which is reflected by the object with reflectivity r, Tb is the temperature of the background behind the object and t the object’s transmissivity [4].
3 X-Ray Imaging X-radiation have a wavelength in the range of 10-0.01 nm (3·1016-3·1019 Hz) and enough energy to pass through cloth and human tissues. In addition to cloth penetration, XR imaging provides high image resolution. On the other hand, this technology presents some disadvantages: low speed, limitation to very short distances and the health safety concerns it raises because of using ionizing radiation. The natural background X-radiation is too weak to form an image, therefore active imaging is required in both XR imaging modalities: transmission and backscatter Xray imaging. X-rays are commonly produced by accelerating charged particles. Transmission X-ray Imaging. Conventional X-ray radiographic systems used for medical purposes produce images relying on this kind of imaging: a uniform X-ray beam incident on the patient interacts with the tissues of the body, producing a variable transmitted X-ray flux dependent on the attenuation along the beam paths. An Xray-sensitive detector captures the transmitted fraction and converts the X-rays into a visible projection image. Only a few works on biometric identification making use of the conventional Xrays can be found: Shamir et al. [5] perform biometric identification using knee Xrays while Chen et al. [6] present an automatic method for matching dental radiographs (see Fig. 3a-c). These knee or dental X-rays are difficult to forge and present additional advantages: they can be used in forensic identification where the soft tissues are degraded. Backscatter X-ray Imaging. In this technique the XR scattered photons, instead of transmitted photons, are used to construct the image [7]. This technology utilizes high energy X-rays that are more likely to scatter than penetrate materials as compared to lower-energy X-ray used in medical applications. However, this kind of radiation is able to penetrate some materials, such as cloth. A person is scanned by moving a single XR beam over her body. The backscattered beam from a known position allows a realistic image to be reconstructed. As only scattered X-rays are used, the registered image is mainly a view of the surface of the scanned person, i.e. her nude form. As the image resolution is high, these images present privacy issues. Some companies (e.g. AS&E) ensure privacy by applying an algorithm to the raw images so that processed images reveal only an outline of the scanned individual. Raw and processed backscatter XR images are shown in Fig. 3d and e. According to our knowledge, there are no works on biometrics using backscatter Xray images. The application of this technique includes medical imaging [8] and passenger screening at airports and homeland security [9]. There are currently different backscatter X-ray imaging systems available on the market to screen people (e.g. AS&E, Rapiscan Systems).
Biometrics Beyond the Visible Spectrum: Imaging Technologies and Applications
(d)
157
(e)
(a)
(b)
(c) Fig. 3. (a-c) Dental radiographs used for human identification. (d) A backscatter XR image of a man, it shows the skin surface and objects hidden by clothing. (e) A backscatter XR image processed to ensure privacy. These figure insets are extracted from: [6] (a-c), [http://www.elpais.com] (d) and [http://www.as-e.com/] (e).
Reflected IR
Thermal IR
NIR SWIR MWIR 0.7µm
1µm
3µm
Low transmitance window
5µm
8µm
LWIR
VLWIR 14µm
1mm
Fig. 4. Infrared band of the electromagnetic spectrum showing the different IR sub-bands
4 Infrared Imaging The infrared band of the electromagnetic spectrum lies between the SMW and VIS regions, with wavelengths in the range of 0.7 µm - 1 mm (see Fig. 1). The human body emits IR radiation with a wavelength between 3-14 µm, hence both active and passive architectures can be used in IR imaging. As indicated in Eq. (1), the radiation that is actually detected by an IR sensor depends on the surface properties of the object (ε, r, t) and on the trasmissivity of the medium (atmosphere). According to the properties of the medium and the spectral ranges of the currently available IR detectors, the IR spectrum is divided into five sub-bands, summarized in Fig. 4. The limits of these sub-bands are not completely fixed and depend on the specific application. In practice, IR imaging systems usually operate in one of the three following IR sub-bands: the near infrared (NIR), the medium wave infrared (MWIR) or the long wave infrared (LWIR), where the windows of high atmosphere transmissivity are located.
158
M. Moreno-Moreno, J. Fierrez, and J. Ortega-Garcia Table 1. Properties of the most important IR sub-bands
IR Spectral bands Near IR (NIR) Medium Wave IR (MWIR) Long Wave IR (LWIR)
Range Archi(µm) tecture 0,7-1 3-5
IR camera cost Low, VIS cameActive ra also sensitive
Passive
High
8 - 14 Passive
Low
(a)
Image Applications Properties in Biometrics Good quality, body Face [10] and Hand condition invariant Vein [11] Recognition Good quality, sensiti- Face [12] and Hand ve to body conditions Vein [13] Recognition Low contrast, sensiti- Face [14] and Hand ve to body conditions Vein [11] Recognition
(b)
(c)
(d)
(e)
NIR
MWIR
(f)
LWIR
Fig. 5. Face and hand images acquired at NIR, MWIR and LWIR: (a) face at NIR, (b) face at MWIR, (c) face at LWIR, (d) palm at NIR, back of the hand at MWIR (e) and at LWIR (f). The images are extracted respectively from [10], [12], [14], [11]. [13] and [11].
In Fig. 4 the IR band is split in two sub-bands: Reflected IR band (0.7-2.4 µm) and Thermal IR band (beyond 2.4 µm). The Reflected IR band is associated with reflected solar radiation that contains no information about the thermal properties of materials. The Thermal IR band is associated with the thermal radiation emitted by the objects. This division in two bands is also related to the two kind of imaging architectures: active and passive imaging. In the Reflected IR band external illumination is needed while in the Thermal IR band passive imaging is preferred (natural IR radiation emitted by the person is captured). The three mentioned practical IR sub-bands (i.e. NIR, MWIR and LWIR) present different characteristics. A summary of the properties of these sub-bands is given in Table 1 while Fig. 5 shows face and hand images acquired at NIR, MWIR and LWIR. Many research works have been developed using NIR, MWIR and LWIR imaging systems for biometrics. Face and hand vein pattern recognition are the most important biometric modalities investigated in these three bands (see references in Table 1). Images acquired at any of these bands (see Fig. 5) are, to some extent, environmental illumination invariant. Specifically, images at NIR are body condition invariant and
Biometrics Beyond the Visible Spectrum: Imaging Technologies and Applications
159
can provide good quality vein patterns near the skin surface [11], but external NIR illumination is required. Images acquired at MWIR and LWIR show patterns of radiated heat from the body's surface (often called thermograms). Very few biometric works have been developed in MWIR [12, 13], probably due to the high cost of MWIR cameras. LWIR cameras are much cheaper but, in contrast with NIR, LWIR can only capture large veins. Additionally, most of the LWIR images have low levels of contrast, being also sensitive to ambient and body condition [11].
5 Millimeter and Submillimeter Wave Imaging MMW and SMW radiation fill the gap between the IR and the microwaves (see Fig. 1). Specifically, MMW band spreads from 30 to 300 GHz (10-1 mm) while the SMW band lies in the range of 0.3-3 THz (1-0.1 mm) [4]. The use of MMW/SMW of radiation within imaging systems is currently an active field of research due to some of its interesting properties [15-20]. The penetration through clothing and other nonpolar dielectric materials even at stand off ranges is one of the most important MMWs and SMWs abilities. Hence, security screening (detection of concealed weapons under clothing), nondestructive inspection and medical and biometric imaging are the most relevant applications of MMW/SMW imaging. Another application of MMW imaging is low visibility navigation (due to the low attenuation of MMWs under adverse weather). Images acquired at MMW/SMW frequencies have lower resolution than IR or VIS images due to larger wavelength. Further, MMW and, specially, SMW imaging technologies are not as mature as the IR or VIS imaging technologies, which restricts the quality of the resulting images. On the other hand, SMW images have better spatial resolution than MMW images (SMW wavelength is smaller), but SMW clothing penetration is lower compared to MMW. In any case, images acquired with passive or with active systems present different characteristics (see Table 2). Different Passive Table 2. Properties of MMW and SMW imaging operating with passive or active architecture. The spatial resolution depends on some conditions such as the distance between the target and the detector, the given resolution corresponds to a target-detector distance of some meters. Radia- Archition tecture
Image Properties ▪ Low resolution compared to VIS and IR Passive ▪ Highly affected by MMW sky illumination
Relative Spatial Resolution
Operating Frequencies
Very Low (indoors) ▪ 35 GHz Low-Medium (outdoors) ▪ 94 GHz
Active
▪ Higher quality than _Passive MMW images
Medium
▪ 30 GHz ▪ 60 GHz ▪ 190 GHz
Passive
▪ Good quality ▪ Partial clothing opacity
Medium-High
▪ 0.1-1 THz ▪ 1.5 THz
Active
▪ Higher quality than Passive SMW images ▪ Partial clothing opacity
High (at a distance) Very high (near)
SMW
Commercial Systems ▪ Qinetiq ▪ Brijot ▪ Alfa Imaging ▪ Sago Systems ▪ Agilent ▪ L-3 Communications ▪ Thruvision
▪ 0.6-0.8 THz ▪ Picometrix ▪ 310 GHz ▪ Teraview ▪ > 1 THz
160
M. Moreno-Moreno, J. Fierrez, and J. Ortega-Garcia
(a)
(d)
(b)
(e)
(h)
(g)
Outdoors
(c)
Outdoors
(f)
(i)
2 mm
Indoors PMMW
AMMW
PSMW
ASMW
Fig. 6. Images acquired with MMW and SMW imaging systems. (a) Outdoors PMMW image (94 GHz) of a man carrying a gun in a bag. (b-c) Indoors and outdoors PMMW image of a face. (d) AMMW image of a man carrying two handguns acquired at 27-33 GHz. (e) PSMW image (0.1-1 THz) of a man with concealed objects beneath his jacket. (f) PSMW image (1.5 THz) of a man with a spanner under his T-shirt. (h) ASMW image (0.6 THz) of a man hiding a gun beneath his shirt. (g) Full 3-D reconstruction of the previous image after smoothing of the back surface and background removal. (i) Terahertz reflection mode image of a thumb. These figure insets are extracted from: www.vision4thefuture.org (a-c), [16] (d), [17] (e), [18] (f), [19] (g-h) and [20] (i).
MMW (PMMW), Active MMW (AMMW), Passive SMW (PSMW) and Active SMW (ASMW) images are shown in Fig. 6. As MMW and SMW images measure the different radiometric temperatures in the scene, see Eq. (1), images acquired indoors and outdoors have very different contrast when working in passive mode, specially with MMW (see Fig. 6b and c). In spite of the significant advantages of MMW and SMW radiation for biometric purposes (cloth penetration, low intrusiveness, health safety), no biometric applications have been developed yet.
6 Conclusions We have provided a taxonomy of the existing imaging technologies operating at frequencies beyond the visible spectrum that can be used for biometrics purposes. The advantages and challenges of each imaging technology, as well as their image properties have been presented. Although only X-ray and Infrared spectral bands have been used for biometric applications, there is another kind of radiation with promising applications in the biometric field: millimeter and submillimeter waves. However MMW and SMW technology is not completely mature yet.
Biometrics Beyond the Visible Spectrum: Imaging Technologies and Applications
161
Acknowledgments. This work has been supported by Terasense (CSD2008-00068) Consolider project of the Spanish Ministry of Science and Technology. M. M.-M. is supported by a CPI Fellowship from CAM, and J. F. is supported by a Marie Curie Fellowship from the European Commission.
References 1. Jain, A.K., et al.: An Introduction to Biometric Recognition. IEEE Trans. on CSVT 14(1), 4–20 (2004) 2. National Research Council, Airline Passenger Security Screening: New Technologies and Implementation Issues. National Academy Press, Washington, D.C (1996) 3. Galbally, J., et al.: Fake Fingertip Generation from a Minutiae Template. In: Proc. Intl. Conf. on Pattern Recognition, ICPR. IEEE Press, Los Alamitos (2008) 4. Appleby, R., et al.: Millimeter-Wave and Submillimeter-Wave Imaging for Security and Surveillance. Proc. of the IEEE 95(8), 1683–1690 (2007) 5. Shamir, L., et al.: Biometric identification using knee X-rays. Int. J. Biometrics 1(3), 365– 370 (2009) 6. Chen, H., Jain, A.K.: Dental Biometrics: Alignment and Matching of Dental Radiographs. IEEE Transactions on PAMI 27(8), 1319–1326 (2005) 7. Bossi, R.H., et al.: Backscatter X-ray imaging. Materials Evaluation 46(11), 1462–1467 (1988) 8. Morris, E.J.L., et al.: A backscattered x-ray imager for medical applications. In: Proc. of the SPIE, vol. 5745, pp. 113–120 (2005) 9. Chalmers, A.: Three applications of backscatter X-ray imaging technology to homeland defense. In: Proc. of the SPIE, vol. 5778(1), pp. 989–993 (2005) 10. Li, S.Z., et al.: Illumination Invariant Face Recognition Using Near-Infrared Images. IEEE Trans. on PAMI 29(4), 627–639 (2007) 11. Lingyu, W., et al.: Near- and Far- Infrared Imaging for Vein Pattern Biometrics. In: Proc. of the AVSS, pp. 52–57 (2006) 12. Buddharaju, P., et al.: Physiology-Based Face Recognition in the Thermal Infrared Spectrum. IEEE Trans. on PAMI 29(4), 613–626 (2007) 13. Fann, C.K., et al.: Biometric Verification Using Thermal Images of Palm-dorsa Veinpatterns. IEEE Trans. on CSVT 14(2), 199–213 (2004) 14. Chen, X., et al.: IR and visible light face recognition. Computer Vision and Image Understanding 99(3), 332–358 (2005) 15. Kapilevich, B., et al.: Passive mm-wave Sensor for In-Door and Out-Door Homeland Security Applications. In: SensorComm 2007, pp. 20–23 (2007) 16. Sheen, D.M., et al.: Three-dimensional millimeter-wave imaging for concealed weapon detection. IEEE Trans. on Microwave Theory and Techniques 49(9), 1581–1592 (2001) 17. Shen, X., et al.: Detection and Segmentation of Concealed Objects in Terahertz Images. IEEE trans. on IP 17(12) (2008) 18. Luukanen, A., et al.: Stand-off Contraband Identification using Passive THz Imaging. In: EDA IEEMT Workshop (2008) 19. Cooper, K.B., et al.: Penetrating 3-D Imaging at 4- and 25-m Range Using a Submillimeter-Wave Radar. IEEE Trans. on Microwave Theory and Techniques 56(12) (2008) 20. Lee, A.W., et al.: Real-time imaging using a 4.3-THz Quantum Cascade Laser and a 320 x 240 Microbolometer Focal-Plane Array. IEEE Photonics Technology Letters 18(13), 1415–1417 (2006)
Formant Based Analysis of Spoken Arabic Vowels Yousef Ajami Alotaibi1 and Amir Husain2 1
Computer Eng. Dept., College of Computer & Information Sciences, King Saud University 2 Department of Computing Science, Stirling University
[email protected],
[email protected]
Abstract. In general, speech sounds are classified into two categories: vowels that contain no major air restriction through the vocal tract, and consonants that involve a significant restriction and are therefore weaker in amplitude and often "noisier" than vowels. This study is specifically concerned with modern standard Arabic dialect. Whilst there has been disagreement between linguistics and researchers on the exact number of Arabic vowels that exist, here we consider the case of eight Arabic vowels that comprise the six basic ones in addition to two diphthongs. The first and second formant values in these vowels are investigated and the differences and similarities between the vowels are researched using consonant-vowels-consonant (CVC) utterances. The Arabic vowels are analyzed in both time and frequency domains, and the results of the analysis will facilitate future Arabic speech processing tasks such as vowel and speech recognition and classification. Keywords: Arabic, Vowels, Analysis, Speech, Recognition, Formants.
1 Introduction Arabic is a Semitic language, and is one of the oldest languages in the world. Currently it is the second most spoken language in terms of number of speakers. Modern Standard Arabic (MSA) has basically 36 phonemes, of which six are vowels, two diphthongs, and 28 are consonants. A phoneme is the smallest element of speech units that indicates a difference in meaning of a word or a sentence. In addition to the two diphthongs, the six vowels are /a, i, u, a: , i:, u:/ where the first three ones are short vowels and the last three are their corresponding longer versions (that is, the three short vowels are /a, i, u /, and their three long counterparts are /a:, i:, u:/) [1] [2][3]. As a result, vowel sound duration is phonemic in Arabic language. Some researchers consider Arabic vowels to number eight in total by counting the two diphthongs as vowels and, this is normally considered to be the case for MSA [4]. MSA has fewer vowels than English language. Arabic phonemes comprise two distinctive classes, termed pharyngeal and emphatic phonemes. These two classes can be found only in Semitic languages such as Hebrew and Persian [1], [5], [4]. Arabic dialects may have different vowels - for instance, Levantine dialect has at least two extra types of diphthongs /aj/ and /aw/. Similarly, Egyptian dialect has other extra vowels [3]. Arabic language is comparatively much less researched compared to other languages such as English and Japanese. Most of the reported studies have been conducted on J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 162–169, 2009. © Springer-Verlag Berlin Heidelberg 2009
Formant Based Analysis of Spoken Arabic Vowels
163
Arabic language and speech digital processing in general, with only a few on Arabic vowels in specific. Some of research works have been carried out on MSA, classical and Quraanic versions of Arabic. More recently, Iqbal et al. [6] reported a new preliminary study on vowels segmentation and identification using formant transitions occurring in continuous recitation of Quraanic Arabic. The paper provided an analysis of cues to identify Arabic vowels. Their algorithm extracted the formants of already segmented recitation audio files and recognized the vowels on the basis of these extracted formants. The investigation was applied in the context of recitation principles of the Holy Quraan. The vowel identification system developed showed up to 90% average accuracy on continuous speech files comprising around 1000 vowels. In other related works, Razak et. al. [7] have investigated Quraanic verse recitation feature extraction using the Mel-Frequency Cepstral Coefficient (MFCC) approach. Their paper explored the viability of the MFCC technique to extract features from Quranic verse recitation. Features extraction is crucial to prepare data for classification process. The authors were able to recognize and differentiate the Quranic Arabic utterance and pronunciation based on the extracted features vectors. Tolba et al. [8] have also reported a new method for Arabic consonant/vowel segmentation using the wavelet transform. In their paper, a new algorithm for Arabic speech consonant and vowel segmentation without linguistic information was presented. The method was based on the wavelet transform and spectral analysis and focused on searching the transient between the consonant and vowel parts in certain levels from the wavelet packet decomposition. The accuracy rate was about 88.3% for consonant/vowel segmentation and the rate remained fixed at both low and high signal to noise ratios (SNR). Previously, Newman et al. [9] worked on a frequency analysis of Arabic vowels in connected Speech. Their findings do not confirm the existence of a high classical style as an acoustically ‘purer’ variety of modern standard Arabic. Alghamdi [10] carried out an interesting spectrographic analysis of Arabic vowels based on a cross-dialect study. He investigated whether Arabic vowels are the same at the phonetic level when spoken by speakers of different Arabic dialects, including Saudi, Sudanese, and Egyptian dialects. The author found that the phonetic implementation of the standard Arabic vowel system differs according to dialects. Previously, Al-Otaibi [11] also developed an automatic Arabic vowel recognition system. Isolated Arabic vowels and isolated Arabic word recognition systems were implemented. The work investigated the syllabic nature of the Arabic language in terms of syllable types, syllable structures, and primary stress rules. In this study, we carry out a formant based analysis of the six Arabic vowels as used in MSA. By changing the vocal tract shape, different forms of a perfect tube are produced, which in turn, can be used to change the desired frequencies of vibration. Each of the preferred resonating frequencies of the vocal tract (corresponding to the relevant bump in the frequency response curve) is known as a formant. These are usually referred to as F1 indicating the first formant, F2 indicating the second formant, F3 indicating the third formant, etc. That is, by moving around the tongue body and the lips, the position of the formants can be changed [2]. In vowels, F1 can vary from 300 Hz to 1000 Hz. The lower it is, the closer the tongue is to the roof of the mouth. F2 can vary from 850 Hz to 2500 Hz. The value of F2 is proportional to the frontness or backness of the highest part of the tongue during the production of the vowel. In addition, lips' rounding causes a lower F2 than with
164
Y.A. Alotaibi and A. Husain
unrounded lips. F3 is also important in determining the phonemic quality of a given speech sound, and the higher formants such as F4 and F5 are thought to be significant in determining voice quality. The rest of this paper is organized as follows: section two introduces the experimental framework employed in this study. The results are described and discussed in Section 3 and some concluding remarks and future work suggestions are given in Section 4.
2 Experimental Framework The allowed syllables in Arabic language are: consonant-vowel (CV), consonantvowel-consonant (CVC), and consonant-vowel-consonant-consonant (CVCC), where V indicates a (long or short) vowel while C indicates a consonant. Arabic utterances can only start with a consonant [1]. Table 1 shows the eight Arabic vowels along with their names, examples, and IPA symbols. In this paper the formants of Arabic vowels are analyzed to determine their values. These are expected to prove helpful in subsequent speech processing tasks such as vowel and speech recognition and classification. In carrying out the analysis, Arabic vowels have been viewed as if they are patterns on papers. Specifically, the vowels were plotted on paper or computer screen in the form of their time waveform, spectrograms, formants, and LPC spectrums. Comparisons and investigations are used as the vehicle to accomplish the goals of this research. At a later stage, these Arabic phonemes will be employed as input to a speech recognition system for classification purposes. An in-house database was built to help in investigating Arabic vowels depending on good selected and fixed phoneme. The utterances of ten male Arabic speakers, all aged between 23 to 25 years with the exception of one child, were recorded. Nine of the speakers are from different regions in Saudi Arabia and the remaining one from Egypt. Each of the ten speakers participated in five different trials for every carrier word in the data set used along with all the eight intended Arabic phonemes. Some of the speakers recorded the words in one session and others in two or three sessions. The carrier words were chosen to represent different consonants before and after the intended vowel. These carrier words are displayed in Table 2 using the second vowel /a:/. The sampling rate used in recording these words is 16 kHz and 16-bit resolution mono. Total of the recorded audio tokens is 4000 (i.e., eight phonemes times ten speakers times carrier words times five trials for each speaker). These audio tokens are used for analyzing the intended phonemes in frequency, during the training phase of the recognition system, and in its testing phase. Table 1. Arabic vowels
Formant Based Analysis of Spoken Arabic Vowels
165
Table 2. Carrier words used in the study using second vowel
3 Results The first part of the experiments is to evaluate values of the first and second formants, namely F1 and F2, in all considered Arabic vowels. This study considered frames in the middle of each vowel to minimize the co-articulation effects. Distribution of all vowels for all speakers with respect to the values of F1 and F2 is shown in Figure 1, from which we can see that the location of the /ay/ vowel distribution is far from both vowel /a/ and vowel /i/. This implies that there is a big difference between this vowel, /ay/, and the two components that form it (/a/ and /i/). In Figure 1 we can see an overlap between the vowel /aw/ and its constituent vowels, /a/ and /u/. Based on the figure, we can estimate the Arabic vowel triangle’s location as (400,800), (700,1100), and (400,2100) where the first value is for F1 and the second value is for F2. Figure 2 shows a plot for all short vowels for one of the speakers for three different trials. It can be seen from Figure 2 that the F1 value is relatively high for /a/, medium for /u/, and minimum for /i/. But in the case of F2, it is medium for /a/, minimum for /u/ and high for /i/. For the long vowels it is clear that the same situation is true as observed for their short counterparts. The long vowels are peripheral while their short counterparts are close to center when the frequencies of the first two formants are plotted on the formant chart. The position of /aw/ can be seen to be between /a:/ and /u:/ and the position of /ay/ is between /a:/ and /i:/. F1 can be used to classify between /a:/ and /u:/ and between /a/ and /u/. F2 can be used to classify between /i:/ and /u:/ and between /i/ and /u/ as can be inferred from Figure 2.
166
Y.A. Alotaibi and A. Husain
Fig. 1. Vowels distribution depending on F1 and F2 for all speakers
From the obtained results, it can be seen that F1 of /a/ is smaller than F1 of /a:/, while F2 of /a/ is smaller than F2 of /a:/ and the values of F3 are close for both of the durational counterparts. Also F1 of /u/ is larger than F1 of /u:/ except for the child, whereas F2 of /u/ is smaller than F2 of /u:/ for all speakers and the values of F3 are close for both of them. Similarly, it has been found that F1 of /i/ is larger than F1 of /i:/ and F2 of /i/ is smaller than F2 of /i:/. To conclude these specific outcomes, it can be said that F1 in a given short vowel is larger than F1 in its long counterpart except for /a/ and /a:/; and F2 in a given short vowel is larger than F1 in its long counterpart except for /a/ and /a:/. These findings confirm those reported in previous studies [10]. Further, it can be seen from Figure 2, that on the basis of F3, Arabic vowels can be classified into three groups: group 1 contains vowels /u/, /u:/ and /aw/ where F3 is less than 2700Hz; group 2 contains vowels /a/, /a:/ and /i/ where F3 is more than 2700Hz and less than 2760Hz; and group 3 contains vowels /i:/ and /ay/ where F3 is more than 2760Hz. Also F1 can be used to classify between /a/ and /u/. F2 can be used to classify /i/ and /u/. The vowel /a/ has the largest value of F1 and /i/ has the largest value of F2. The vowel /i/ has the smallest value of F1 and /u/ has the smallest value of F2. F1 can be used to classify between /a:/ and /u:/. F2 can be used to classify /i:/ and /u:/. The vowel /a:/ has the largest value of F1 and /i:/ has the largest value of F2. The vowel /i:/ has the smallest value of F1 and /u:/ has the smallest value of F2.
Formant Based Analysis of Spoken Arabic Vowels
167
Fig. 2. Values of F1 and F2 for short vowels for Speaker 6 for three trials for short vowels
In Arabic vowels, as mentioned earlier, F1 can vary from 300 Hz to 1000 Hz, and F2 can vary from 850 Hz to 2500 Hz. F3 is also important in determining the phonemic quality of a given speech sound, and the higher formants such as F4 and F5 are thought to be significant in determining voice quality. In this case, as can be seen from Table 3 which shows the Arabic vowel formants averaged on all speakers, for short vowels, /a/ has the largest value of F1 and /i/ has the largest value of F2. In /a/ and /a:/ the whole tongue goes down so the vocal tract becomes wider than in producing other Arabic vowels. In /u/ and /u:/ the end of the tongue comes near to the palate while the other parts of the tongue are in the regular position. In /i/ and /i:/ the the front of tongue comes near to the palate whereas other parts remain in their regular position. Lips are more rounded for /u/ and /u:/ than for /i/ and /i:/. Figure 3 shows the signal in time domain and spectrograms of the specific carrier word used in all eight vowels. The time domain representations in Figure 3 confirm that all utterances are CVC as the signal energy in the middle is mostly at its maximum. Also, the formants’ usual patterns can be noticed in the spectrograms. In addition, the similarities of the first (and final) consonant in all words can be clearly noticed since the same first and final consonants are used in all the plots (with just the vowel being varied). Table 3. Arabic vowel formants averaged on all speakers Formant
a
a:
u
u:
I
I:
F1
590.8
684.4
488.8
428.7
479.3
412.2
F2
1101.9
1193.3
975.2
858.6
1545
2131.9
F3
2755
2750.5
2660.2
2594.5
2732.5
2788.3
F4
3581.5
3665.7
3534.7
3426.9
3573.2
3599.8
168
Y.A. Alotaibi and A. Husain
Fig. 3. Time waveforms and spectrogram for Speaker 9, Word 2, and Vowels 1,5,7 and 8
In summary, these formants can thus be seen to be very effective in classifying vowels correctly and can be used in a future speech recognition system. Formants of the vowels can be included explicitly in the feature extraction module of the recognizer. If such a system is able to recognize the different vowels then this will tremendously assist in the Arabic speech recognition process. The reason behind this is that every word and syllable in Arabic language must contain at least one vowel; hence vowel recognition will play a key role in identifying the spoken utterance.
4 Conclusions This paper has presented a new formant based analysis of Arabic vowels using a spectrogram technique. The Arabic vowels were studied as if they were patterns shown on screen or paper. Modern standard Arabic has six basic vowels and two diphthongs which are considered by some linguistics as vowels rather than diphthongs. Thus the number of vowels in this study was considered to be eight (including the two diphthongs). All these eight Arabic phonemes were included for constricting the created database deployed in the investigation which has shown that the formants are very effective in classifying vowels correctly. In the near future, a recognition system will be built for classifying these eight phonemes and an error performance analysis of the recognition system will be carried out to acquire further knowledge and infer related conclusions about Arabic vowels and diphthongs. Other planned future work will extend the present study to include vowels of classical Arabic (used in the Quraan).
Formant Based Analysis of Spoken Arabic Vowels
169
Acknowledgment The authors would like to acknowledge the British Council (in Riyadh, Saudi Arabia) for funding this collaborative research between King Saud University and the University of Stirling.
References 1. Alkhouli, M.: Alaswaat Alaghawaiyah. Daar Alfalah, Jordan (1990) (in Arabic) 2. Deller, J., Proakis, J., Hansen, J.H.: Discrete-Time Processing of Speech Signal. Macmillan, Basingstoke (1993) 3. Alghamdi, M.: Arabic Phonetics. Al-Toubah Bookshop, Riyadh (2001) (in Arabic) 4. Omar, A.: Study of Linguistic phonetics. Aalam Alkutob, Eygpt (1991) (in Arabic) 5. Elshafei, M.: Toward an Arabic Text-to -Speech System. The Arabian Journal for Scince and Engineering 16(4B), 565–583 (1991) 6. Iqbal, H.R., Awais, M.M., Masud, S., Shamail, S.: New Challenges in Applied Intelligence Technologies. In: On Vowels Segmentation and Identification Using Formant Transitions in Continuous Recitation of Quranic Arabic, pp. 155–162. Springer, Berlin (2008) 7. Razak, Z., Ibrahim, N.J., Tamil, E.M., Idris, M.Y.I., Yakub, M., Yusoff, Z.B.M.: Quranic Verse Recitation Feature Extraction Using Mel-Frequency Cepstral Coefficient (MFCC). In: Proceedings of the 4th IEEE International Colloquium on Signal Processing and its Application (CSPA), Kuala Lumpur, MALAYSIA, March 7-9 (2008) 8. Tolba, M.F., Nazmy, T., Abdelhamid, A.A., GadallahA, M.E.: A Novel Method for Arabic Consonant/Vowel Segmentation using Wavelet Transform. International Journal on Intelligent Cooperative Information Systems, IJICIS Vol 5(1), 353–364 (2005) 9. Newman, D.L., Verhoeven, J.: Frequency Analysis of Arabic Vowels in Connected Speech, pp. 77-87 10. Alghamdi, M.M.: A spectrographic analysis of Arabic vowels: A cross-dialect study. Journal of King Saud University 10(1), 3–24 (1998) 11. Al-Otaibi, A.: Speech Processing. The British Library in Association with UMI (1988)
Key Generation in a Voice Based Template Free Biometric Security System Joshua A. Atah and Gareth Howells Department of Electronics, University of Kent, Canterbury, Kent, CT2 7NT United Kingdom {JAA29,W.G.J.Howells}@kent.ac.uk
Abstract. Biometric systems possess major drawbacks in their inability to be revoked and re-issued as would be the case with passwords if lost or stolen. The implication is that once a biometric source has been compromised, the owner of the biometric as well as the data protected by the biometric is compromised for life. These concerns have necessitated research in template free biometrics, which exploits the possibility of directly encrypting the biometric data provided by the individual and therefore eliminates the need for storing templates used for data validation, and thus increasing the security of the system. Template free system function in stages- (a) Calibration during which feature distribution maps of typical users are generated from known biometric samples without storing any personal information, and (b) Operation which uses the feature distribution maps as a reference to generate encryption keys from samples of previously unseen users and also rebuilds the key when needed from new sets of previously unseen samples. In this report, we used a combination of stable features from the human voice to directly generate biometric keys using a novel method of feature concatenation to combine the binary information at the operation phase. The stability of the keys is superior to current key generation methods and the elimination of biometric templates improves the safety of personal information and in turn increases users’ confidence in the system. Keywords: Biometrics, Security, Template-free, Voice features.
1 Introduction In all current biometric systems, which are based on templates and operate by measuring an individual's physical features in an authentication inquiry and comparing this data with stored biometric reference data [6], there are concerns particularly in the safety of the personal biometric information that is stored on the template [1, 7]. One of these concerns is that once a biometric source has been compromised, it cannot be re-issued, unlike passwords that can be cancelled and reissued if lost or stolen. Therefore, the owner of the biometric as well as the data protected by the biometric is compromised for life because users cannot ever change their features. Although cancellable biometrics seeks to address re-issueability of compromised biometric keys, it is still based on the storage of users’ information on templates and therefore, it is challenged by the integrity of the owner of the biometric sensor who may pre-record J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 170–177, 2009. © Springer-Verlag Berlin Heidelberg 2009
Key Generation in a Voice Based Template Free Biometric Security System
171
biometric samples, as well as those with access to the algorithm who may have, and can use, privilege rights at the pre and post template stages of the system. Our current research in template-free biometrics principally addresses all concerns associated with template storage of individual’s physical features as reference data for authentication/verification [6, 10]. the idea is novel and exploits the possibility of directly encrypting the biometric data provided by the individual and therefore eliminates the need for storing templates used for data validation. Template free systems have two stages called ‘Calibration’ and ‘Operation’ Phases. Abstractly, in the Calibration phase, users provides samples of the given biometric to generate typical user normalisation maps and then allowing encryption keys to be generated directly from certain features in the samples. As a result, the proposed system requires no storage of the biometric templates and no storage of any private encryption keys. In the Operation phase, a new set of samples are provided by individual users from which new keys may be generated directly. These are previously unseen samples (that have not been stored anywhere). Message data is encrypted first with the receiver's asymmetric public key and a digest of the message is then encrypted with the sender's asymmetric private key regenerated via new biometric samples to form a digital signature. The encrypted message is then sent to the receiver. Figure 1 below shows a conceptual key generation system.
In the operation phase, the encrypted message is decrypted first with the sender's asymmetric public key to Fig. 1. A conceptual biometric key generation system verify the sender. Thereafter, the decrypted message is further decrypted with the receiver’s asymmetric private key again regenerated via further biometric samples.
The novelty of the current proposal lies in the development of techniques for the direct encryption of data extracted from biometric Fig. 2. samples which characterise the identity of the individual. Such a system offers the following significant advantages:• •
The removal of the need to store any form of template for validating the user, hence directly addressing the disadvantages of template compromise. The security of the system will be as strong as the biometric and encryption algorithm employed (there is no back door). The only mechanisms to gain subsequent access are to provide another sample of the biometric or to break the cipher employed by the encryption technology.
172
J.A. Atah and G. Howells
•
The (unlikely) compromise of a system does not release sensitive biometric template data which would allow unauthorised access to other systems protected by the same biometric or indeed any system protected by any other biometric templates present.
A further significant advantage relates to the asymmetric encryption system associated with the proposed technique. Traditional systems require that the private key for decrypting data be stored in some way (memorising a private key is not feasible). With the proposed system, the key will be uniquely associated with the given biometric sample and a further biometric sample will be required to generate the required private key. As there is no physical record of the key, it is not possible to compromise the security of sensitive data via unauthorised access to the key. Our previous publication [6] identified some of the voice features that are considered suitable for template free biometrics as Maximum Power Spectral Density (PSD), Average Power Spectral Density (PSD), Minimum Power Spectral Density (PSD), Minimum fft, Mean Amplitude, Minimum amplitude, minimum cepstrum, mean cepstrum, maximum ifft, minimum ifft, mean ifft, maximum hilbert function, minimum hilbert function, mean hilbert function.
2 Feature Distribution Maps The initial phase in template free biometrics is the Calibration phase, which yields feature distribution maps for typical users of the system. Calibration begins with the presentation of known biometric samples by all the users.
Calibration Phase
Signalinput
PreͲ Processing
FeatureExtraction (Featuresare representedas integers)and OriginalSamples aredeleted
FeatureScore normalisation and Quantisation
Generationof feature distribution/ Normalisation maps
Data/fileis encryptedfromthe normalised& combinedbiometric featuresfor operatingthe system
Fig. 3. Schematic representation of the Operation phase
All voice samples are pre-processed to determine sampling frequency, the frame and to ensure relative stability/standardisation because of the variances in the capture device. Useful features [6] are extracted in the form of measureable integers and then normalised for all users within a given feature space in order to reduce the effect of score variability. For a defined quantisation interval users’ probability density functions are calculated within the feature space. These are used to generate the normalisation maps which is a bell shape curve in useful features. The probability values
Key Generation in a Voice Based Template Free Biometric Security System
173
within a pre-defined percentage of the interval from the mean are considered for generating the biometric key. This is because of overlaps in users information within the curve. For the purpose of determining the keys, the probability values are further converted into binary form to provide more specific information about the stable bits which are then combined to build the encryption keys for each user. Normalisation and Quantisation The min-max normalisation technique is employed bearing in mind that voice modality to a reasonable extent does not generate unusual distributions (multiple modes). Suppose that xi = x1, x2…..xn is a score to be fused, then, the min-max method is given by
x
norm
=
xi − min( xi ) [5], [7] max(xi ) − min( xi )
The min and max sample score values are those of users within each feature space, where xi is the individual sample value; min( xi ) is the smallest value of x in all the users for that feature space, and
max( xi ) is the largest value of x in all the users for
that feature space Fixed quantisation interval between 0 and 1 is used per feature space. For each value in the quantization interval, the mean and standard deviation per user is used to calculate the normal probability distribution function, given by:
Plot the p(x) against the quantization space. It should however be noted that there are overlaps in users characteristics increasing the likelihood of one user’s key to be similar to that of the others. The p(x) values are further converted into Gray code to
Fig. 4. User distribution maps
174
J.A. Atah and G. Howells
provide more specific information about the samples. Values within the range of 10% from the mean on both sides of the curve are considered most useful and are the then combined to build the encryption keys for each user. The stored user distribution maps forms the basis for the operation phase of the system.
3 Key Generation Key generation for encryption and decryption takes place in the operation phase of template free biometric systems. In all cases, the signal pre-processing and feature extraction processes takes place as described earlier. The system then references the feature distribution maps to generate the bits used as biometric keys per feature space that represent individual users. The keys are then combined using a novel method of concatenation to produce a single key which is used to encrypt message/ data. For the purpose of decrypting the system, the same process is followed and a new key is generated from previously unseen samples provided by the user, but which have not been stored anywhere.
OperationPhase
Signal input
Pre Processing
FeatureExtraction (Featuresare representedas integers)and OriginalSamples aredeleted
FeatureScore normalization and Quantization
System references feature distribution mapsto generate binarised
Combination ofbitsfrom various features
Encryption and decryption keysare generated asrequired
Fig. 5. Schematic representation of the Operation phase
Binarisation The binarisation process is introduced to convert the probability distribution scores within the quantised intervals into Gray code. This will ensure precision and absolute score stability to the algorithm at every instance of encryption and decryption. The acceptable quantisation intervals used in this case corresponds to the probability distribution values within the range of 10% deviation from the mean. Gray code is used because of its single distance code property i.e. adjacent code words differ by 1 in one digit position only. Feature bits concatenation For each user, the keys generated are mostly stable within a given region of percentage deviation on either side of the mean since the keys begins to degenerate when bits beyond a certain range is considered. As a result, the key generated as biometric keys per feature space is very small and therefore, it is required that all the bits from all the features be used to produce a single long key. A novel key combination method using concatenation is used to produce a long biometric key.
Key Generation in a Voice Based Template Free Biometric Security System
175
The concept of the research is that rather than referencing a stored template of users’ information, user candidates must always present their samples every time a file needs to be encrypted or decrypted, but none of the candidate’s samples will be recorded. Thus the key that is used for the system are reproduced at every instance of operation, but neither this key nor the samples from which they are derived are stored on any form of template.
4 Experiments and Results Our datasets are: (i) the VALID database [3] and (ii) the Biosecure database [9]. The valid database contains 530 samples (consisting of five recordings for each of 106 individuals over a period of one month), each one uttering the sentence "Joe Took Father's Green Shoebench Out". The recordings were acquired across different environments/ background – noisy, real world, and others in office scenario with no control on acoustic noise. The biosecure database is a multimodal biometric dataset, a product of the biosecure foundation. Each user candidate presented two sessions which were collected on different occasions, and the sessions are used for calibration and operation respectively. The calibration and operation tests are carried using features that have been established suitable. For all probability values within the quantisation interval in the distribution, values beyond 10% on either side of the mean generates bits equal to or close to zero and are therefore not considered useful in generating the keys. A typical User distribution graph and table is shown below
Quantizati on Interval Normal PDF
0
0.1
0.2
0.3
4.324 59E08
7.32 018 E-08
2.63221 E-79
2.0107 E-222
0.4
0.5 0
0.6 0
0.7 0
0.8 0
0.9 0
1 0
0
For a quantisation interval on a scale of 10, the four highest probability values within the quantisation space is considered in the key generation bits. Although this is the first attempt at using voice signals to generate template free biometrics, the experiments on 106 users in the valid database generates unique
176
J.A. Atah and G. Howells
biometric keys representing the users. This is a better performance over previous use of the same database in a template based system.
5 Conclusion This research introduces a novel method of biometric key generation in a template free biometric system. The two stage process of calibration and operation enables new biometric keys to be built at every instance of operation, thereby transferring the safe custody of the personal information to the individual users. A template-free biometric encryption system addresses the concerns associated with compromise of template of stored data. It is a technique that directly encrypts the biometric data provided by the individual and therefore eliminates the need for storing templates used for data validation, thus increasing the security of, and in turn the confidence in, the system. Its application ranges from secure document exchange over electronic media to instant encryption of mobile telephone conversations based on the voice samples provided by the speaker.
References [1] Maltoni, Anil, Wayman, Dario (eds.): Biometric Systems: Technology, Design and Performance Evaluation. Springer, Heidelberg (2002) [2] Wayman, J.: Fundamentals of biometric authentication technologies. Int. J. Imaging and Graphics 1(1) (2001) [3] http://ee.ucd.ie/validdb/datasets.html [4] Poh, N., Bengio, S.: A study of the effect of score normalization prior to fusion in Biometric Authentication Tasks (December 2004) [5] Rumsey, D.: Statistics for Dummies. Wiley publishing Inc., Indiana (2003) [6] Atah, J.A., Howells, G.: Score Normalisation of Voice Features for Template Free Biometric Encryption. In: The 2008 multi-conference in computer science, information technology, computer engineering, control and automation technology, Orlando, FL, USA (July 2008) [7] Nandakumar, K.: Integration of multiple cues in biometric systems, PhD thesis, Michigan State University (2005) [8] Koutsoyiannis, A.: Theory of Econometrics, 2nd edn. Palgrave, New York [9] http://biosecure.it-sudparis.eu/AB/ [10] Sheng, W., Howells, G., Fairhurst, M.C., Deravi, F.: Template-free Biometric Key Generation by means of Fuzzy Genetic Clustering. Information Forensics and Security 3(2), 183–191 (2008) [11] Deravi, F., Lockie, M.: Biometric Industry Report - Market and Technology Forecasts to 2003, Elsevier Advanced Technology (December 2000) [12] Bolle, R.M., Connell, J.H., Ratha, N.K.: Biometric perils and patches. Pattern Recognition 35(12) (December 2002) [13] Howells, W.G.J., Selim, H., Hoque, S., Fairhurst, M.C., Deravi, F.: An Autonomous Document Object (ADO) Model. In: Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, Washington, USA, September 2001, pp. 977–981 (2001)
Key Generation in a Voice Based Template Free Biometric Security System
177
[14] Hoque, S., Selim, H., Howells, G., Fairhurst, M.C., Deravi, F.: SAGENT: A Novel Technique for Document Modeling for Secure Access and Distribution. In: Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland (2003) [15] Howells, G., Selim, H., Fairhurst, M.C., Deravi, F., Hoque, S.: SAGENT: A Model for Security of Distributed Multimedia. Submitted to IEEE Transactions on System, Man and Cybernetics [16] Rahman, A.F.R., Fairhurst, M.C.: Enhancing multiple expert decision combination strategies through exploitation of a priori information sources. IEE Proc. on Vision, Image and Signal Processing 146, 1–10 (1999) [17] Sirlantzis, K., Hoque, S., Fairhurst, M.C.: Trainable multiple classifier schemes for handwritten character recognition. In: Proc. 3rd Int. Workshop on Multiple Classifier Systems, Cagliari, Italy, pp. 169–178 [18] Chibelushi, C.C., Mason, J.S.D., Deravi, F.: Audio-Visual Person Recognition: An Evaluation of Data Fusion Strategies. In: Proc. European Conference on Security, London, April 28-30, 1997, pp. 26–30. IEE (1997) [19] Jain, A.K., Prabakar, S., Ross, A.: Biometrics Based Web Access. Technical Report TR98-33, Michigan State University (1998)
Extending Match-On-Card to Local Biometric Identification Julien Bringer1, Herv´e Chabanne1,2 , Tom A.M. Kevenaar3, and Bruno Kindarji1,2 1
Sagem S´ecurit´e, Osny, France T´el´ecom ParisTech, Paris, France priv-ID, Eindhoven, The Netherlands
2 3
Abstract. We describe the architecture of a biometric terminal designed to respect privacy. We show how to rely on a secure module equipped with Match-On-Card technology to ensure the confidentiality of biometric data of registered users. Our proposal relies on a new way of using the quantization functionality of Secure Sketches that enables identification. Keywords: Biometric terminal, Privacy, Identification.
1
Introduction
This paper aims at giving a solution to the problem of securing a biometric access control terminal. For privacy reasons, we want to protect the confidentiality of the biometric data stored inside the terminal. We follow the example of payment terminals which rely on a Secure Access Module (SAM, think of a dedicated smartcard) to provide the secure storage functionality for secret elements. The same goes for GSM phones and their SIM card. Our work follows and is adapted to biometric specificities to handle local identification, i.e. access control through a biometric terminal. The sensitive operation of matching a fresh biometric data against the biometric references of the registered users is made inside a SAM equipped with Match-On-Card (MOC) technology. To optimize the performances of our process, we reduce the number of these MOC comparisons by using a faster pre-comparison of biometric data based on their quantization. This way of proceeding partially comes from [10] where the computations needed for identifying an iris among a large database are sped-up with a similar trick. To quantize biometric data, we mostly rely on the works already done in the context of Secure Sketches. [2, 6, 8, 9, 11, 13, 18, 19, 17]. We focus on fingerprintbased identification, though more biometrics are possible, e.g. iris or face recognition, etc. However, Secure Sketches are not fit for biometric identification; moreover, their security is defined only in an information-theoretic specification. A direct application of such a scheme is thus leaky [3,16], and our proposal takes into account these weaknesses. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 178–186, 2009. c Springer-Verlag Berlin Heidelberg 2009
Extending Match-On-Card to Local Biometric Identification
179
The novelty introduced in this paper concerns both identification and privacy. We fully describe the architecture of a biometric access-control terminal that is based on Match-On-Card, and thus extend the security properties of MOC.
2
Preliminaries
We focus on biometric identification. In such a setting, one of the main concerns is the privacy of the database containing all the biometric records. Indeed, we do not want such a collection of personal characteristics to be used against their owners. This implies to deploy an adequate solution so that no one is able to obtain information about a biometric feature stored within the database. It leads to an architectural issue: how is it possible to have a system in which the biometric data remains secure from the capture, to the storage, until it is used by a trusted matching device? Another concern is the efficiency of such a system: we wish to identify a user in a number of comparison sublinear in the database size. 2.1
Biometric Templates
In the following, we use two different kinds of biometric template. The first one is used by classical biometrics matching algorithms. For example, for fingerprints, one may think of the coordinates of the minutiae points. There is a large literature on this subject and we refer to [7] for more details. We also introduce what we call quantized templates. Let k be a natural integer, and B = {0, 1}k be the binary Hamming space, equipped with Hamming distance d. d returns the number of coordinates on which two vectors differ. In the following, we consider that our biometric terminal captures a biometric feature from an individual u, and translates it into a binary vector – the quantized template v ∈ B. Quantized templates must verify the following properties. Two different quantized templates v, v of the same user are with high probability “close”, i.e. at a small Hamming distance d(v, v ) ≤ m; quantized templates v1 , v2 of different users u1 , u2 are with high probability “far away”, i.e. at a Hamming distance d(v1 , v2 ) > m , with m < m . A summary of the notations used throughout the paper is given in Table 1. Comparisons between quantized templates go fast as it simply consists in computing a Hamming distance. Storage of a quantized template is also less demanding as we envisage a few hundreds bits for representing it (see Sec. 4). Table 1. Summary of Variables and Parameters Parameters B = {0, 1}k : the Hamming space d: the Hamming distance n: number of registered users c: maximal number of MOC comparisons
Variables ui : the user number i bi : enrolled template for ui vi : quantized version of bi bj , vj : candidates templates
180
J. Bringer et al.
Remark 1. Quantized templates have been introduced in the context of Secure Sketches [11, 13, 18, 19]. Given a quantized template x, we can indeed compute a secure sketch as: c ⊕ x where c is a random codeword from a given code. [17] explicitly reports the construction of quantized templates for fingerprints. The theory behind this topic of quantization of biometrics, and the problematics such as prealignment, are not within the scope of this paper; see Section 4 for our implementation or [5, 4, 12] for more recent background on this subject. 2.2
Match-On-Card Technology
A classical way to use biometrics with enhanced security is to store the biometric template on a smartcard. Match-On-Card (MOC) is usually used for biometric authentication. In such a setting, a person is authenticated by first inserting a smartcard into a terminal, and then by presenting his biometrics. The biometric terminal sends the resulting template to the smartcard, which computes a matching score between the fresh template, and a previously stored one, and decides if the two templates come from the same user. Typically, a MOC fingerprint template is stored on about 512 bytes. As the computing power is limited, the matching algorithms for Match-OnCard suffer from more restrictions than usual matching functions. However, the performances are still good enough for real-life applications. As an example, the NIST MINEX test [15] reports a False Reject Rate of 4.7 10−3 for a False Accept Rate of 10−2 , and a False Reject Rate of 8.6 10−3 for a False Accept Rate of 10−3 . More detailed results can be found on the project website. The next section describes how to use MOC for biometric identification.
3 3.1
A Step by Step Implementation Entities Involved
The system architecture depicted in this work tends to combine the efficiency of biometric recognition and the physical security of a hardware-protected component. In practice, we build a biometric terminal, (cf. Figure 1), that includes distinct entities: – a main processing unit, – a sensor, – some non-volatile memory. This memory contains what we call the encrypted database which contains the encryption of all the templates of registered users, – a SAM dedicated to the terminal. It can be physically attached to the terminal as a chip with connections directly weld to the printed circuit board of the terminal. Another possibility is to have a SIM card and a SIM card reader inside the terminal. Afterwards, when we mention the computations of the biometric terminal, we designate those made by its main processor.
Extending Match-On-Card to Local Biometric Identification
181
Fig. 1. Our Biometric Terminal
Remark 2. Coming back to the Introduction of this paper and the analogy with the payment terminals, we consider in the following that our terminal is tamperevident [20]. Therefore, attempts of physical intrusions will be detected after. 3.2
Setup
We choose a symmetric encryption scheme, such as, for instance, the AES. It requires a cryptographic key κ which is kept inside the SAM. The SAM thus performs the encryption and decryption. The encryption of x under the key κ is denoted by Enc(x) (we omit κ in order to lighten the notations). Note that no user owns the key κ: there is only one key, independent of the user. We ensure the confidentiality of the templates by encrypting the content of the database under the key k. For n registered users, the database of the terminal stores their n encrypted templates {Enc(b1 ), . . . , Enc(bn )}. Identification through our proposal is made in two steps. To identify the owner of a biometric template b , we first roughly select a list of the most likely templates (bi1 , . . . , bic ) from the database, c < n . This is done by comparing quantized templates, as the comparison of these binary vectors is much faster than a MOC comparison. In a second step, the identification is comforted by doing the c matching operations on the MOC. 3.3
Enrolment Procedure
The enrolment of a user ui associated to a (classical) template bi takes two steps: 1. Compute and store the encryption of the template Enc(bi ) into the database, 2. then, compute and store a quantized template vi into the SAM memory. Although not encrypted, the quantized templates are stored in the SAM memory, and are thus protected from eavesdroppers. 3.4
Access-Control Procedure
When a user uj presents his biometrics to the sensor, he is identified in this way: 1. The processor encodes the biometric feature into the associated template bj . 2. The processor computes the quantized template vj and sends it to the SAM.
182
J. Bringer et al.
3. The SAM compares vj with the stored v1 , . . . , vi , . . . , vn . He gets a list of c candidates vi1 , . . . , vic for the identification of bj . 4. The SAM sequentially requests each of the Enc(bi ) for i ∈ {i1 , . . . , ic }, and decrypts the result into bi . 5. The SAM completes its task by doing the c MOC comparisons, and finally validating the identity of the owner of bj if one of the MOC comparisons leads to a match. Proposition 1. As the biometric information of the enrolled users remain either in the SAM, or encrypted outside the SAM and decrypted only in the SAM, our access-control biometric terminal architecture ensures the privacy of the registered users.
4
Performances of This Scheme
The main observation is that the MOC comparison is the most costly operation here as an identification executes n of them. Based on this fact, we reduce the number of comparisons to a lower one, and focus on selecting the best candidates. For an identification, we in fact switch the timing needed for n MOC comparisons within the SAM for the (Hamming) comparison with n quantized templates followed by a sorting for the selection of the c best candidates, and at most c MOC comparisons. Let μMOC (resp. μHD (k); Sort(k, n); μDec ) be the computation time for a MOC comparison (resp. Hamming distance computation of k-bits vectors; sorting n integers of size k; template decryption). Additionally, the feature extraction and quantization of the fresh biometric image is managed outside the SAM by the main processor of the terminal. Neglecting this latter part, the pre-screening of candidates through quantized biometrics will improve the identification time as soon as (n − c) × (μMOC + μDec ) > n×μHD (k)+Sort(k, n). Assuming that μHD (k) is 2ms for k ≤ 1000 and that the comparison of two integers of size k takes 2ms as well, then it yields (n − c) × (μMOC + μDec ) > 2(n + n × log2 (n))ms. μMOC is generally within 100ms-500ms; assume that μMOC + μDec takes 200ms here. Then for instance with n = 100 and c = 10, it leads to an improvement by a factor 5.6. 4.1
A Practical Example
To confirm our solution to enhance the security of an access control terminal, we run experiments through different fingerprints dataset based on a modification of the quantization algorithm studied in [17]. Some of our results are highlighted here on the fingerprint FVC2000 second database [14]. Adaptation of [17] algorithm towards identification. Tuyls et al. apply an algorithm based on reliable component quantization to extract stable binary vectors from fingerprints and to apply secure sketches on it. From a fingerprint image, a real vector is extracted via the computation of a squared directional field and Gabor responses of 4 complex Gabor filters. Before this encoding,
Extending Match-On-Card to Local Biometric Identification
183
Table 2. Notations for the algorithm n users {u1 , . . . un }, M captures per user L = 17952: the number of extracted values per capture Xi,j ∈ (R {⊥})L : capture n◦ j for user ui , ⊥ is a null component (Mi )t : for t ∈ {1 . . . L}, number of real values among {(Xi,1 )t , . . . (Xi,M )t } μi , μ: mean (vector) per user, and overall Σ w , Σ b : within-class covariance and between-class covariance of the Xi,j Qi ∈ {0, 1, }L : L-long quantized vector for ui ; denotes an erasure Vi , V Mi : k-long binary quantized vector for ui , and its mask.
fingerprints are considered as pre-aligned based mainly on core detection and registration [17]. We adapt their algorithm for identification purpose and to avoid any loss of information due to re-alignment. On the second FVC2000 database, we obtain real vectors with 1984 components of information embedded in a vector of length L = 17952. All the 15968 null components are marked as erasures (i.e. positions where no value is known) for the sequel. To increase the stability of the vectors, an enrolment database containing n users and M images j=1..M per user is considered. From the nM real vectors (Xi,j )i=1..n obtained as above, fixed-length binary strings are generated following some statistics and reliable bit selection. This step is similar to the one described in [17], but adapted to take into account null positions, and to select the same choice of coordinates for all the users, to unmake the user-specific aspect of the original approach. For a given user i, the number (Mi )t of non-erased components at an index t is not constant: for 1 ≤ i ≤ n, 1 ≤ t ≤ L, 0 ≤ (Mi )t ≤ M . Whence (Mi )t = 0 the position t is considered as an erasure for the user i. When (Xi,j )t or (Mi )t are erased, they are not counted in the mean computations. For each coordinates, the means μi by user and the mean μ of all the enrolment vectors are computed. The within-class covariance Σ w and the between-class covariance Σ b are also estimated. Then we construct a binary string Qi as follows: for 1 ≤ t ≤ L, if (Mi )t ≥ 1, (Qi )t = 0 if (μi )t ≤ (μ)t , 1 if (μi )t > (μ)t . If (Mi )t = 0 then (Qi )t is marked as an erasure. Let k < L be the number of reliable components to select. (Σ b ) The Signal-to-Noise Ratio (SNR) of the coordinate t is defined by (ξ)t = (Σ w )t,t . t,t Here, we pick the k coordinates with highest SNR. These indexes are saved in a vector P and a new vector Vi of length k is constructed with the corresponding reliable bit values for each user ui , with a mask V Mi – a second k-bit vector to distinguish the known coordinates and the positions where no value is known. With respect to Section 3, this procedure enables us to manage the enrolment of a set of users u1 , . . . , un and outputs for each user a quantized template vi = (Vi , V Mi ). As for the access-control procedure, when a new fingerprint image is captured for a user uj , a fingerprint template based on minutiae bj is extracted together with another template Yj based on pattern as above, then the quantization handles the quantized vector Q(Yj ), which is computed according to the comparison of Yj with the enrolment mean μ, to construct the vector vj = (Vj , V Mj ) by keeping only the indexes contained in P . We stress again that all these computations are performed by the main processor unit of the terminal.
184
J. Bringer et al.
Performances. On the second FVC2000 database, for the 100 users, we choose randomly M = 6 images per user for enrolment and the 2 remaining for the identification tests. We construct binary vectors vi of length 128 (i = 1, . . . , 100) at enrolment and for each vj obtained at the access-control step, we observe the rank of the good candidates by sorting the vi with respect to an adapted Hamming distance between vj and vi . This distance is computed as the number of differences plus half the number of positions where no value is known. In that case, 90% of the good candidates are among the 8 closest results and almost all are reached before rank 20. To reduce further the number of MOC comparisons needed, we can increase the length of the quantized templates. The experiments validate this, for 256-bit long templates: 81% of good candidates are reached on rank 2 and 90% on rank 5. The list of candidates is then almost always consolidated by very few MOC comparisons. Figure 2 illustrates the results with a quantization on 256 bits and 128 bits.
Fig. 2. Accuracy with 128 bits and 256 bits
5
Conclusion
This paper describes how to locally improve the privacy inside a biometric terminal for the purpose of identification. We can go further and change the scale of the setting. Indeed, the same idea can be applied at a system level. We only need to replace our Match-On-Card SAM by a more powerful hardware component, such as, for instance, Hardware Secure Module (HSM) [1]. This leads to study the applicability of our quantized templates speed-up to many users: this change of scale in the size of the database needs further investigations. Acknowledgment. This work is supported by funding under the Seventh Research Framework Programme of the European Union, Project TURBINE (ICT2007-216339). This document has been created in the context of the TURBINE project. It describes one of the protocols which are envisaged to be developed in the TURBINE demonstrators. All information is provided as is and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. The European Commission has no liability in respect of this document, which is merely representing the authors’ view.
Extending Match-On-Card to Local Biometric Identification
185
References 1. Anderson, R., Bond, M., Clulow, J., Skorobogatov, S.P.: Cryptographic processorsa survey. Proceedings of the IEEE 94(2), 357–369 (2006) 2. Boyen, X., Dodis, Y., Katz, J., Ostrovsky, R., Smith, A.: Secure remote authentication using biometric data. In: Cramer, R. (ed.) EUROCRYPT 2005. LNCS, vol. 3494, pp. 147–163. Springer, Heidelberg (2005) 3. Bringer, J., Chabanne, H., Cohen, G., Kindarji, B., Zemor, G.: Theoretical and practical boundaries of binary secure sketches. IEEE Transactions on Information Forensics and Security 3(4), 673–683 (2008) 4. Chen, C., Veldhuis, R.N.J., Kevenaar, T.A.M., Akkermans, A.H.M.: Biometric binary string generation with detection rate optimized bit allocation. In: IEEE CVPR 2008, Workshop on Biometrics, June 2008, pp. 1–7 (2008) 5. Chen, C., Veldhuis, R.N.J.: Performances of the likelihood-ratio classifier based on different data modelings. In: ICARCV, pp. 1347–1351. IEEE, Los Alamitos (2008) 6. Crescenzo, G.D., Graveman, R., Ge, R., Arce, G.: Approximate message authentication and biometric entity authentication. In: Patrick, A.S., Yung, M. (eds.) FC 2005. LNCS, vol. 3570, pp. 240–254. Springer, Heidelberg (2005) 7. Maltoni, A.K.J.D., Maio, D., Prabhakar, S. (eds.): Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 8. Dodis, Y., Katz, J., Reyzin, L., Smith, A.: Robust fuzzy extractors and authenticated key agreement from close secrets. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 232–250. Springer, Heidelberg (2006) 9. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004) 10. Hao, F., Daugman, J., Zielinski, P.: A fast search algorithm for a large fuzzy database. IEEE Transactions on Information Forensics and Security 3(2), 203–212 (2008) 11. Juels, A., Wattenberg, M.: A fuzzy commitment scheme. In: ACM Conference on Computer and Communications Security, pp. 28–36 (1999) 12. Kelkboom, E.J.C., Molina, G.G., Kevenaar, T.A.M., Veldhuis, R.N.J., Jonker, W.: Binary biometrics: An analytic framework to estimate the bit error probability under gaussian assumption. In: IEEE BTAS 2008, pp. 1–6 (2008) 13. Linnartz, J.-P.M.G., Tuyls, P.: New shielding functions to enhance privacy and prevent misuse of biometric templates. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 393–402. Springer, Heidelberg (2003) 14. Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC2000: fingerprint verification competition. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 402–412 (2002) 15. National Institute of Standards and Technology (NIST). Minex ii - an assessment of match-on-card technology, http://fingerprint.nist.gov/minex/ 16. Simoens, K., Tuyls, P., Preneel, B.: Privacy weaknesses in biometric sketches. In: IEEE Symposium on Security and Privacy (to appear, 2009) 17. Tuyls, P., Akkermans, A.H.M., Kevenaar, T.A.M., Schrijen, G.-J., Bazen, A.M., Veldhuis, R.N.J.: Practical biometric authentication with template protection. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 436– 446. Springer, Heidelberg (2005)
186
J. Bringer et al.
18. Tuyls, P., Goseling, J.: Capacity and examples of template-protecting biometric authentication systems. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS, vol. 3087, pp. 158–170. Springer, Heidelberg (2004) 19. Tuyls, P., Verbitskiy, E., Goseling, J., Denteneer, D.: Privacy protecting biometric authentication systems: an overview. In: EUSIPCO 2004 (2004) 20. Weingart, S.H.: Physical security devices for computer subsystems: A survey of attacks and defenses. In: Paar, C., Ko¸c, C ¸ .K. (eds.) CHES 2000. LNCS, vol. 1965, pp. 302–317. Springer, Heidelberg (2000)
A New Fingerprint Matching Algorithm Based on Minimum Cost Function ´ Andr´es I. Avila and Adrialy Muci Departamento de Ingenier´ıa Matem´ atica Universidad de La Frontera, Chile
Abstract. We develop new minutia-based fingerprint algorithms minimizing a cost function of distances between matching pairs. First, using the minutia type or minutia quality, we choose a reference set of points form each set. next, we create the set of combinations of pairs to perform the best alignment and finally the matching by distances is computed. We tested our algorithm using the DB2A FVC2004 database extracting the minutia information by the mindtct program given by NBIS and we compare with the bozorth3 algorithm performace.
1
Introduction
Among all the fingerprint matching methods, minutia-based ones have proved to be the most implemented in real life applications, due to their low computational requirements and cheap sensor technologies. One of the problems of this family is the dependence with the quality of the image captured by different sensor technologies. In Figure 5 [13], it is shown the effect of solid sate and digital sensor technologies respect to the number of extracted minutia, noting that the solid state sensor tends to capture less minutia. Also in [3], it is shown the effect of capacitance, optical and thermal sensor technologies in quality image. Finally in [1] the authors studied the effect of sensor technology in quality image, defining five quality indexes considering different pressures and dryness. In addition to these two effects, temperature was also considered in [9] for four types of sensors. All these works highlight the need of new algorithms, which uses robust information related to the quality of images. In this work, first we will use information about minutia type to select a set of minutiae for alignment, and then use a minimizing cost algorithm for matching. Next, we will consider minutia quality given by mindtct files .xyt and .min for selecting the best minutiae for alignment.
2 2.1
Fingerprint Matching Main Ideas
Minutia-based matching is performed comparing a number of characteristics from each minutia. Most algorithms consider coordinates (xi , yi ), rotation angle respect the horizontal θi , and type ti . In [6], they considered eight characteristics J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 187–191, 2009. c Springer-Verlag Berlin Heidelberg 2009
188
´ A.I. Avila and A. Muci
including type information of neighboring minutiae. In [5], they included more characteristics including distances, lines between minutiae, and relative orientation respect to the central minutia. In [8] and [12], first they generate a local characteristic vector to find the best pair by weighted distance of vectors. Next, the align the rest of the minutiae. In [14], they also generate a set of angles for the characteristic vector to compute a score matrix with the best minutiae. They perform a greedy algorithm to find the maximum pairs. In our approach, we will consider only three characteristics, the first two are (xi , yi ), and the third will be either the type ti or the quality qi given by mindtct algorithm. The main idea is to minimize the number of characteristics used in matching. In the first stage in the algorithm, we perform an alignment by each pairs of minutia from a reference set as it was shown in [7] and solve a minimum square problem to find the translation, orientation and scaling. The reference set is selected either by type or by quality. In the second case, a new parameter is needed, the quality threshold. In the second stage, for each alignment we compute the distance matrix between minutiae from template and input images, and solve a task assignment problem by the Munkres’ Assigment Algorithm [10], also called Hungarian Algorithm, which was extended to rectangular matrices in [2]. We will give more details in the next sections. 2.2
Reference Set and Parameters
In this stage, we want to solve the problem of alignment among minutiae. Because testing all rotations, translations and scalings are too much time consuming, most minutia-based methods have used two reference points, as core or deltas, from each image for performing one alignment. In our approach, we do not consider information about these special points, so we consider pair of special minutiae to perform the alignment, the so called principal minutiae in [4]. Because not all combinations are useful, we first select a reference set. Denote by template minutiae by T = {mti }i=1...n and the input minutiae by I = {mij }j=1...m , where each minutia is given by mk = (xk , yk , tk ) for the coordinates and type, and by mk = (xk , yk , qk ) when quality qk is considered. For the reference set, in [7] the authors mentioned that false minutiae are located closer than real minutiae. Then, we will extend our minutiae to a fourth characteristic wk defined as the minimum distance to all other minutiae from the same image. Then, in case of type, we sort from highest to lowest distance and select a fixed number L of minutiae from each image for building the reference set A = {(mti , mij ) : 1 ≤ i, j ≤ L and ti = tj }, where the minutiae have the same type. This step avoid useless pairing for alignment. In case of quality, we consider the quality threshold U to select the best quality minutia for performing the alignment qk > U . In both cases, we will choose L such that the set A is nonempty and not so large. Notice that the are L2 pairs. Now, we build a set of pairs of pairs for the transformation AT = {(mti1 , mij1 ), (mti2 , mij2 ) ∈ A and mi1 = mj1 , mi2 = mj2 }.
A New Fingerprint Matching Algorithm
189
Thus, we obtain O(L4 ) possible alignments. It is clear that a small L will give few possibilities for alignment and a large L will give too many options. Thus, L must be tune. 2.3
Alignment and Rotation
The main idea is to transform the template minutiae into the coordinate system of the input minutiae. To obtain this transformation, we search for the optimal set of coordinates for the origin, angle and scaling T = (tx , ty , θ, ρ). Consider the reference pair (pi1 , qj1 ), (pi2 , qj2 ) ∈ A, where pi1 = (xi1 , yi1 ), pi2 = (xi2 , yi2 ) ∈ P , and qj1 = (xj1 , yj1 ), qj2 = (xj2 , yj2 ) ∈ Q u cos θ sin θ x t =ρ + x . (1) v − sin θ cos θ y ty Using standard minimum squares, it is possible to obtain a set of formula for the parameters. Because these computations are fast, we can perform several transformations avoiding the use of cores or deltas. 2.4
Minimum Cost Function
After alignment is done, we have two sets of minutiae in the same coordinate system P = {p1 , p2 , ..., pn } pi = (xpi , yip , zip ) i = 1, n Q = {q1 , q2 , ..., qm } zip
qi = (xqj , yjq , zjq )
j = 1, m,
where is either the type or the quality of the minutia. Next, we compute the cost matrix C = (cij ) by the distance cij = |xpi − xqj | + |yip − yjq | for i = 1, n and j = 1, m. This distance is faster to compute than the standard Euclidean distance. Considering that both set of minutiae has different size, we will assume that n < m. For each template minutia i we search for input minutia j such that the minimum distance c∗ij is attained. This is represented by the following minimum cost problem, where zij is an integer variable representing if the pair (i, j is chosen: ⎛ ⎞ n m min c z ij ij ⎜ ⎟ ⎜ ⎟ i=1 j=1 ⎜ ⎟ ⎜ ⎟ n m . ⎜ ⎟ ⎜ ⎟ zij = 1, zij ≤ 1 ⎜ s.t. ⎟ ⎝ ⎠ i=1
j=1
zij ∈ 0, 1 i = 1, n j = 1, m The first restriction represents the fact that each template minutia has a matching, and the second restriction mentioned that each input minutia cannot be associated to more than one template minutiae. Because the size of the matrix is not large, we can use non heuristic methods to solve this problem. The Munkres’ algorithm is the most efficient algorithm to solve this problem.
190
´ A.I. Avila and A. Muci
2.5
Munkres’ Algorithm
We assume the matrix C has dimensions n × m, where n ≤ m. We sketch the algorithm: 1. For each row, find the minimum element and substract to the whole row. 2. Find a zero element and mark with an ∗. 3. Cover columns with ∗, count columns. If there are n elements, the matching is found. Else, goto to next step 4. 4. Find a zero not covered and mark with . If there is no ∗ in the row, goto next step 5. If there an ∗, uncover and cover row. Store the lowest noncovered value and goto 6. 5. Build a sequence of ∗ and to find a new assignment and goto 3. 6. Add the value to each covered row and subtract to the uncovered columns. Goto 4. The whole matching algorithm is described in Figure 1. Algorithm 1. Matching with alignment
1 2 3 4 5 6 7 8 9 10 11 12
Data: set of minutiae from template and input images Result: Matching score after alignment begin Build characteristic vectors P,Q from the input data; Compute the weight wk for each minutia P and Q; Sort minutiae P and Q from highest to lowest ; Find reference set A; foreach reference pair in A do Compute transformation by minimum squares; Align two characteristic sets using the transofrmation; Perform matching by Munkres’ algorithm and compute matching score; end The matching score will be the lowest score obtained among transformations; end
The final score is computed as the percentage of template minutiae found by each run.
3
Experiments for Minutia Quality
We perform an experiment with DB2A FVC2004 fingerprints, eight images for one hundred fingers. We selected a quality parameter U = 50 and L = 10 references sets. Using these parameters, we perform on average 134.4 alignments (with std 201.7), which shows how sensitive is the algorithm to the selection of pairs. The selected parameter for minimum EER was 78. We obtain a FNMR was 89.3% and the FMR was 11.5%, which shows.
A New Fingerprint Matching Algorithm
4
191
Conclusions
It is important to study the sensitivity of the parameters respect to the matching. It is clear that the combination of parameters will more complicated to obtain but it will allow more possibilities for adjusting the algorithm. The parameter L should depend on the quality image. If the quality is high, we will need less pairs for performing matching. Because of the poor quality of some fingerprints, we notice that in some cases there were no alignment, so rejection is made. Also, the score depends on the number of minutia matched in the comparison.
References 1. Alonso-Fernandez, F., Roli, F., Marcialis, G.L., Fierrez, J., Ortega-Garcia, J.: Comparison of fingerprint quality measures using an optical and a capacitive sensor. In: IEEE Conference on Biometrics: Theory, Applications and Systems, 6 p. (2007) 2. Burgeois, F., Lasalle, J.-C.: An extension of the Munkres algorithm for the assignment problem to rectangular matrices. Communications of the ACM 142, 302–806 (1971) 3. Blomeke, C., Modi, S., Elliott, S.: Investigating The Relationship Between Fingerprint Image Quality and Skin Characteristics. In: IEEE International Carnahan Conference on Security Technology ICCST 2008, 4 p. (2008) 4. Chang, S.H., Cheng, F.H., Hsu, W.H., Wu, G.Z.: Fast algorithm for point pattern matching: invariant to translations, rotations and scale changes. Pattern Recognition 30(2), 311–320 (1997) 5. Chen, Z., Kuo, C.H.: A Topology-Based Matching Algorithm for Fingerprint Authentication. In: Proc. Int. Carnahan Conf. on Security Technology (25th), pp. 84–87 (1991) 6. Hrechak, A., McHugh, J.: Automated Fingerprint Recognition Using Structural Matching. Pattern Recognition 23(8), 893–904 (1990) 7. Jia, J., Cai, L., Lu, P., Liu, X.: Fingerprint matching based on weighting method and the SVM. Neurocomputing 70, 849–858 (2007) 8. Jiang, X., Yau, W.Y.: Fingerprint Minutiae Matching Based on the Local and Global Structures. Proc. Int. Conf. on Pattern Recognition (15th) 2, 1042–1045 (2000) 9. Kang, H., Lee, B., Kim, H., Shin, D., Kim, J.: A Study on Performance Evaluation of Fingerprint Sensor. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 574–583. Springer, Heidelberg (2003) 10. Munkres, J.: Algorithms for Assignment and Transportation Problems. Journal of the SIAM 5(1) (March 1957) 11. Watson, C., Garris, M., Tabassi, W., Wilson, C., McCabe, R., Janet, S., Ko, K.: User’s Guide to NIST Biometric Image Software (NBIS), NIST, 217 p. (2009) 12. Ratha, N.K., Pandit, V.P., Bolle, R.M., Vaish, V.: Robust Fingerprint Authentication Using Local Structural Similarity. In: Proc. Workshop on Applications of Computer Vision, pp. 29–34 (2000) 13. Ross, A., Jain, A.: Biometric Sensor Interoperability: A Case Study in Fingerprints. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS, vol. 3087, pp. 134–145. Springer, Heidelberg (2004) 14. Tico, M., Kuosmanen, P.: Fingerprint Matching using an Orientation-based Minutia Descriptor. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(8), 1009–1014 (2003)
Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation Dimo Dimov1 and Lasko Laskov1,2 1
Institute of Information Technologies (IIT) at Bulgarian Academy of Science (BAS) 2 Institute of Mathematics and Informatics (IMI) at BAS Acad. G. Bonchev Str., Block 29-A, 1113 Sofia, Bulgaria {dtdim,llaskov}@iinf.bas.bg
Abstract. During the last decade a lot of effort has been put in studying of the Fourier descriptors (FD) and their application in 2D shape representation and matching. Often FD has been preferred to other approaches (moments, wavelet descriptors) because of their properties which allow their translational, scale, rotational and contour start-point change invariance. However, there is a lack in the literature of extensive theoretical proof of these properties, which can result in inaccuracy in the methods’ implementation. In this paper we propose a detailed theoretical exposition of the FDs’ invariance with special attention paid to the corresponding proofs. A software demonstration has been developed with an application to the medieval Byzantine neume notation as part of our OCR system. Keywords: Fourier descriptors, historical document image processing, OCR.
1 Introduction Byzantine neume notation is a form of musical notation, used by the Orthodox Christian Church to denote music and musical forms in the sacred documents from the ancient times until nowadays. The variety and the number of different historical documents, containing neume notation is vast and they are not only a precious historical record, but also an important source of information and object of intense scientific research [8]. Naturally, most of the research of the neume notation in the historical documents is connected with the content of the documents itself, including searching for fragments or patterns of neumes, comparison between them, searching for similarities, etc. These and other technical activities are good argument in favor of creation of a software tool to help the research of the medieval neume notation. Such software tool can be an OCR (Optical Character Recognition) based system. In the literature there are quite few attempts described for creation of a software system for processing and recognition of documents containing neume notation [4], [1]. The both works were designed to work with the contemporary neume notation in printed documents. Our goal is to develop methods and algorithms for processing and recognition of medieval manuscripts containing Byzantine neume notation with no J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 192–199, 2009. © Springer-Verlag Berlin Heidelberg 2009
Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation
193
binding to a particular notation. The main stages of the processing include: (i) preliminary processing and segmentation; (ii) symbol agglomeration in classes based on unsupervised learning of the classifier; (iii) symbol recognition. For the goal of the unsupervised learning and recognition we need a suitable representation of the neumatic symbols which will be used for defining of a feature space which will help the comparison between the neume representatives. Since the neumes have relatively simple shapes and rarely contain cavities, in the proposed approach each neume is represented by its outer contour. For feature space definition the Fourier transform (FT) of the contour is used with number of high frequencies removed resulting in a reduced frequency contour representation. Such representation of 2D shapes is often called Fourier descriptors (FDs) [5], [6]. During the last decade FDs has been investigated in detail and applied with success in different problems, like OCR systems design [2], Content Based Image Retrieval (CBIR) [3], [5], etc. One of the main reasons FDs to be preferred to the other approaches for 2D shape representation, as moments and wavelet descriptors, are the comparatively simple methods for translation, scale, rotation and starting point normalization of FDs. This is the reason why in the literature a lot of effort is put in investigating these properties. Nevertheless, the corresponding analytical proofs are rarely given which can be the reason for inaccuracy and even errors in the implementation of the corresponding methods for linear frequency normalization of the contours. The goal of this paper is to investigate in detail the properties of FDs to achieve their translation, scale, rotation and start- point invariance. A special attention is paid to the analytical proofs of these properties and a method for construction of linearly normalized reduced FDs (LNRFDs) for 2D shapes representation, in particular for Byzantine neume notation representation. The LNRFD representation of the neume notation can be used effectively for the goals of the unsupervised learning.
2 FD Representation of Byzantine Neume Notation For each segmented neume symbol, the algorithm for contour finding of bi-level images [7] is applied. The resulting contour is a closed and non-self-crossing curve. For our purposes we will represent the contour z as a sequence of Cartesian coordinates, ordered in the counterclockwise direction:
z ≡ ( z (i ) | i = 0,1,...( N − 1) ) ≡ ((x(i ), y (i ) ) | i = 0,1,...( N − 1) ) .
(1)
Besides, the contour is a closed curve, i.e.: z (i ) = z (i + N ), i = 0,1,K , N − 1 .
(2)
We will also assume that z is approximated with line segments between its neighboring points z (i) = (x(i), y(i)) , which are equally spaced, i.e.:
z (i + 1) − z (i ) = z ( N − 1) − z (0) = Δ , i = 0,1,K , N − 1 , where Δ is a constant for which we can assume Δ = 1 .
(3)
194
D. Dimov and L. Laskov
(a)
(b)
Fig. 1. (a) Fragment of a neume contour, represented in the complex plane; (b) The contour represented as a sum of pairs of radius-vectors. The sum of the first pair gives the base ellipse of the neume symbol.
2.1 Fourier Transform of a Contour
For the sake of the FT and in correspondence with (1) we will consider the contour z as a complex function: z (i ) = x (i ) + jy (i ) = z (i ) e jϕ (i ) = z (i ) exp( jϕ (i ) ) , i = 0,1,K , N − 1,
(4)
where x and y are its real and imaginary components in Cartesian representation, z and ϕ = arg(z ) are the respective module and phase in polar representation, and
j = − 1 is the imaginary unit (see Fig.1,a). Thus, according to (2) and (3), the conditions for the DFT are fulfilled: zˆ (k ) = F ( z )(k ) =
1 N
N −1
∑z(i)exp(− jΩki),
k = 0,1, K , N − 1 , Ω =
i =0
2π , N
(5)
where zˆ is the spectrum of z , zˆ (k ) , k = 0,1,..., ( N − 1) are the respective harmonics, also called FDs, and the values Ω | k | have the sense of angular velocity. The Inverse DFT relates the spectrum zˆ to the contour z : N −1
z (i ) = F -1 ( z )(i ) = ∑zˆ(k ) exp( jΩki) , i = 0,1,K , N − 1 ,
(6)
k =0
which after equivalent transformations can be written in the form: z (i ) = rest +
N 2 −1
∑
k = − ( N 2 −1)
zˆ(k ) exp( jΩki) , i = 0,1,K , ( N 2 − 1) , N 2 = ⎡N / 2⎤ ,
⎧ zˆ( N 2 ), if N is odd rest = ⎨ if N is even ⎩0 ,
(7)
Considered in polar coordinates, (6) and (7) lead to a useful interpretation: Interpretation 1. The contour z represented as a sum of pairs of radius-vectors r r (rk , r− k ) , k = 1,2...( N 2 − 1) , N 2 = ⎡N / 2⎤ , rotating with the same angular velocity Ω k ,
Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation
195
r r but in different directions: rk in positive and the symmetric r−k in negative direction, r r where rk ⇔ zˆ(k ) and r− k ⇔ zˆ(−k ) ⇔ zˆ( N − k ) . To this vector-sum we have also the r static CoG (Center of Gravity) vector, r0 ≡ zˆ (0) as well as the residual vector r rN ≡ zˆ( N 2 ) which is different from zero only if N is even (see Fig.1,b). 2
r
Apparently the terms harmonics zˆ ( k ) , descriptors zˆ ( k ) , and radius-vectors rk , k = 0,1,..., ( N − 1) are almost identical, but express different interpretations of the contour spectrum zˆ . Thus, according to the Interpretation 1 each separate pair outlines an ellipse with a variable speed which direction depends on which of the two radiusvectors dominate by module. The following practical rules can be derived: Rule 1. The base harmonics zˆ (1) and zˆ (−1) cannot be zero at the same time, i.e. r r r1 + r−1 ≠ 0 . The opposite means that the contour is traced more than once, which is
impossible with the used algorithm for contour trace. Rule 2. If the direction of the contour trace is positive (counterclockwise), then r r r r | r1 | ≥ | r−1 | , otherwise | r1 | ≤ | r−1 | (clockwise). For concreteness we assume that the direction of the contour trace is positive, i.e. | zˆ (1) | ≥ | zˆ(−1) | that respects our case. An important property of FDs is that the harmonics which correspond to the low frequencies contain the information about the more general features of the contour, while the high frequencies correspond to the details. In this sense we shall give: Definition 1. Reduced FD of length L we will call the following spectral representation of the contour z :
~ ⎧ zˆ(k ), 0 ≤ k ≤ L zˆ (k ) = ⎨ L < k < N 2 , N 2 = ⎡N / 2 ⎤ ⎩0 ,
(8)
for a boundary value L , 0 ≤ L ≤ ⎡N / 2⎤ . L and respectively the frequency ΩL can be evaluated using the least-square criterion:
ε2 =
1 N
N −1
∑ | z (i) − ~z (i) |
2
< ε 02 ,
(9)
i =0
where ~ z is the approximation of the contour z which corresponds of the reduced ~ frequency representation zˆ and ε 02 is some permissible value of the criterion ε 2 . 2.2 Linear Normalization of Contour in the Frequency Domain
For the aims of creation of a self-learning classifier for the neume symbols a measure of similarity between the normalized individual representatives is needed. These normalizations can be relatively easily performed in the frequency domain, using the FDs.
196
D. Dimov and L. Laskov
Translational normalization. Given (6) for the translated by a vector zˆ (0) contour z we have that N −1
z (i ) − zˆ (0) = ∑zˆ (k ) exp( jΩki) , i = 0,1,K, N − 1 ,
(10)
k =1
1 N −1 ∑ z (i) according to (5). Obviously the new contour ν ≡ z(i) − zˆ(0) N i =0 coincides with the original z , but the coordinate system is translated in its CoG, i.e. the static harmonic of ν is equal to zero: F (ν )(0) = 0 , while all others remain unwhere zˆ (0) =
changed: F (ν )(k ) = F ( z )(k ) , k = 1,2,..., ( N − 1) . Hence, the transitional normalization can be achieved by z (i ) := z (i ) − zˆ (0) , i = 0,1,..., ( N − 1) , where “:=” denotes the operation assignment. Scale normalization. Assume that we have the contour scaled by an unknown coefficient s , i.e.:
v which is a version of z ,
v(i) = sz (i), i = 0,1,K , N − 1 , s ≠ 0 .
(11)
Thus, the spectral representation of ν will be scaled by the same coefficient. Really, for the forward DFT of (10), it follows from (5): vˆ(k ) =
1 N
N −1
s
N −1
sz (i)exp(− jΩki) = ∑z (i)exp(− jΩki) = szˆ (k ), ∑ N i=0 i =0
k = 0,1, K , N − 1
(12)
Therefore, the scale invariance of the contour can be achieved dividing the modules of its harmonics with some non-zero linear combination of them. In the case of the algorithm of Pavlidis [7], which we use for neume contour trace, the first positive or negative harmonic is different from zero, depending on the contour trace direction. Thus, without loss of generality we may consider the module of the first harmonic is non-zero, i.e. the scale invariance can be achieved by a division of all the harmonics by it. Thus, for the spectrum νˆ s of the scale normalized contour ν s : vˆ s (k ) =
| vˆ(k ) | s | zˆ(k ) | | zˆ(k ) | = = , k = 0,1,K , N − 1 | vˆ(1) | s | zˆ(1) | | zˆ (1) |
Hence, scale normalization can be done by zˆ (k ) :=
(13)
| zˆ (k ) | , k = 0,1,K , N − 1 . | zˆ (1) |
The case of irregular scaling s x ≠ s y is not interesting for our application. For completeness, we will mention that in this case we can preliminary calculate the 2D ellipsoid of inertia for the given neume and to reshape it according to the main axis of the ellipsoid to its transformation to circle and then to continue with (13). Rotational normalization. Suppose we have the contour v which is a version of the contour z , rotated to an unknown angle α . If the contours are preliminary normalized with respect to translation, i.e. their common CoG coincides with the beginning
Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation
197
of the coordinate system, the rotation to α corresponds to multiplication of the complex representation of z with e jα . v(i ) = e jα z (i ), i = 0,1,K , N − 1
(14)
The spectrum of the contour will be rotated by the same angle α . Indeed, because of the linearity of DFT, (5) and similarly to (12), for (14) we have: vˆ(k ) =
1 N
N −1
e jα z (i )exp(− jΩki) = e jα zˆ (k ), ∑ i =0
k = 0,1,K , N − 1
(15)
And so, the rotation by an angle α in the object domain corresponds to rotation by the same angle α of the phases of the contour spectrum. Therefore, there are two approaches to provide the rotational invariance of the final contour representation. The first is to ignore the phases of the spectrum which leads to the rotationally invariant representation, but also to a big lost of information. The second approach is to normalize the spectrum phases by the phase of some of the harmonics, for example the first one vˆ(1) , for which we consider again vˆ(1) ≠ 0 . Thus, for the spectrum vˆα of the rotationally normalized contour ν α , we have: vˆα (k ) =
vˆ(k ) e jα zˆ(k ) zˆ(k ) = jα = , k = 0,1,K, N −1 exp( j arg(vˆ(1))) e exp( j arg(zˆ(1))) exp( j arg(zˆ(1)))
Hence, rotational normalization is: zˆ (k ) :=
(16)
zˆ (k ) , k = 0,1,K , N − 1 . exp( j arg(zˆ (1)))
Starting point normalization. The algorithm of Pavlidis does not guarantee that the contour trace of two identical symbols will start from one and the same start-point. The contour start-point change can be simply examined in the frequency domain. Suppose we have the contour v which is a version of the contour z with shifted start-point by Δ positions:
v(i) = z (i + Δ ), i = 0,1,K , N − 1
(17)
Statement 1. Let two contours z and v , given in the complex plane, corresponds each other as (16). Then their correspondence in the frequency domain is given by:
vˆ(k ) = e jΩkΔ zˆ (k ), k = 0,1,K, N − 1 . Proof: see the Appendix.
(18) ♦
According to this statement, the integer shift Δ of the start-point of the contour in the object domain corresponds to multiplication of the phases of its spectrum by the constant exp( jΩkΔ) , or equivalently to rotations of the phases as follows: the k -th phase is rotated to an angle δ (k ) , δ (k ) = ΩΔk , k = 0,1,..., ( N − 1) . This normalization can be treated analogously to the rotational one, again in two approaches. The invariance in the first approach is trivial. To achieve invariance in
198
D. Dimov and L. Laskov
the second approach, we propose the procedure: Normalize each harmonic of the spectrum vˆ with the phase of the first non-zero harmonic vˆ(m) ≠ 0 , where m > 1 : vˆ Δ ( k ) =
vˆ ( k ) e jΩkΔ zˆ ( k ) zˆ ( k ) = j ( Ωm Δ ) k / m = , exp ( j arg (vˆ( m ) )k / m ) e exp ( j arg ( zˆ ( m ) )) exp ( j arg (zˆ ( m ) ))
(19)
for k = 0,1, K , N − 1 . Then the modified spectrum vˆΔ corresponds uniquely to the all contours that are isomorphic to the original z but with an arbitrary selected startzˆ(k ) point: Hence, we normalize: zˆ(k ) := , k = 0,1,..., N − 1 . exp( j arg(zˆ(m))k / m) Definition 2. We will call linearly normalized reduced FD (LNRFD) of the original contour z the reduced FD of z after its processing by (10), (13), (16) and (19). Besides, (10) has to be the first one, and in case of s x ≠ s y , (13) has to be the second, else arbitrary, while (16) and (19) has to be applied one after another at least q times, q ≥ (ln arg (zˆ ( m ) ) − arg (zˆ (1) ) − ln(ε ) ) ln( m ) , to obtain finally: arg(zˆ(1) ) ≤ ε and arg(zˆ (m) ) ≤ ε , where ε is chosen arbitrary close to zero.
3 Conclusion In the paper we propose an approach for constructing of LNRFDs for medieval Byzantine neume notation, which are invariant with respect to the translation, scaling, rotation and change of the contour start-point. Theoretical grounds of considered normalizations are described in more detail. For the aims of experiment, original software has been developed to extract the LNRFDs of each neume segmented in a document. These LNRFDs play the role of index into a database of neume objects. The next stage of the proposed methodology for medieval neume notation processing and recognition will be the organization of an unsupervised learning on the basis of the above described LNRFD. After the database sorting through the LNRFD-index, the problem will be reduced to a 1D clustering problem. Acknowledgments. This work was partially supported by following grants of IITBAS: # DO-02-275/2008 of the National Science Fund at Bulgarian M. of Education & Science, and Grant # 010088/2007 of BAS.
References 1. Dalitz, C., Michalakis, G.K., Pranzas, C.: Optical recognition of psaltic byzantine chant notation. IJDAR 11(3), 143–158 (2008) 2. Dimauro, G.: Digital Transforms in Handwriting Recognition. In: Impedovo, S. (ed.) FHWR. NATO ASI Series “F”, vol. 124, pp. 113–146. Springer, Heidelberg (1994) 3. Dimov, D.: Fast Image Retrieval by the Tree of Contours’ Content, Cybernetics and Information Technologies, BAS, Sofia, 4(2), pp. 15–29 (2004) 4. Gezerlis, V., Theodoridis, S.: Optical character recognition of the orthodox hellenic byzantine music notation. Pattern Recognition 35(4), 895–914 (2002)
Invariant Fourier Descriptors Representation of Medieval Byzantine Neume Notation
199
5. Folkers, A., Samet, H.: Content-based image retrieval using Fourier descriptors on a logo database. In: 16th Int. Conf. on Pattern Recogn., vol. 3, pp. 521–524 (2002) 6. Zhang, D., Lu, G.: A comparative study on shape retrieval using Fourier descriptors with different shape signatures. In: Intelligent Multimedia, Computing and Communications Technologies and Applications of the Future, Fargo, ND, USA, June 2001, pp. 1–9 (2001) 7. Pavlidis, T.: Algorithms for Graphics and Image Processing. Springer, Heidelberg (1982) 8. DDL of Chant Manuscript Images, http://www.scribeserver.com/NEUMES
Appendix: Proof of Statement 1 If Δ = 0 , then the statement is obviously true. Let Δ ≠ 0 . Then, for each harmonic vˆ(k ) , k = 0,1,K , N − 1 from the spectrum of the contour v , according to (16): νˆ (k ) =
1 N
N −1
ν (i )exp( − jΩki) = ∑ i =0
exp( jΩkΔ) N −1 z (i + Δ )exp(− jΩk (i + Δ) ) ∑ N i =0
Using the substitution l = i + Δ , we get:
νˆ(k) =
N +Δ−1 ⎞ exp(jΩkΔ) N+Δ−1 exp(jΩkΔ) ⎛ N−1 ⎜⎜ ∑z(l)exp(− jΩkl) + ∑ z(l)exp(− jΩkl)⎟⎟ z(l)exp(− jΩkl) = ∑ N N l =Δ l =N ⎝ l =Δ ⎠
Because of the periodicity (2) of the contours z (l ) = z (l ± N ) we have:
νˆ(k ) = =
Δ −1 ⎞ exp( jΩkΔ) ⎛ N −1 ⎜⎜ ∑z (l )exp(− jΩkl ) + ∑ z (l − N )exp(− jΩk (l − N + N ))⎟⎟ = N l − N =0 ⎝ l =Δ ⎠ Δ −1 ⎞ exp( jΩkΔ) ⎛ N −1 ⎜⎜ ∑z (l )exp(− jΩkl ) + ∑z (m)exp(− jΩkm)exp(− jΩkN ) ⎟⎟ N m=0 ⎝ l =Δ ⎠
But, according to (5), ΩN = 2π , and hence exp( − jΩkN ) = 1 . Thus, finally:
νˆ(k ) = =
Δ −1 ⎞ exp( jΩkΔ) ⎛ N −1 ⎜⎜ ∑z (l )exp(− jΩkl ) + ∑z (m)exp(− jΩkm) ⎟⎟ = N m=0 ⎝ l =Δ ⎠
exp( jΩkΔ) N −1 z (l )exp(− jΩkl ) = exp( jΩkΔ) zˆ(k ) , k = 0,1,K, N − 1 ; ∑ N l =0
which we had to prove.
Bio-Inspired Reference Level Assigned DTW for Person Identification Using Handwritten Signatures Muzaffar Bashir and Jürgen Kempf Faculty of Electronics and Information Technology University of Applied Sciences Regensburg, Germany
[email protected],
[email protected] www.bisp-regensburg.de
Abstract. Person identification or verification is becoming an important issue in our life with the growth of information technology. Handwriting features of individuals during signing are used as behavioral biometrics. This paper presents a new method for recognizing person using online signatures based on reference level assigned Dynamic Time Warping (DTW) algorithm. The acquisition of online data is carried out by a digital pen equipped with pressures and inclination sensors. The time series obtained from pen during handwriting provide valuable insight to the unique characteristics of the writers. The obtained standard deviation values of time series are found person specific and are used as so called reference levels. In the proposed method reference levels are added to time series of query and sample dataset before dynamic time warping distance calculations. Experimental results show that the performance of accuracy in person authentication is improved and computational time is reduced. Keywords: online signature authentication, dynamic time warping, biometric person authentication, signature normalization.
1 Introduction and Related Work Person identification or verification is an important issue in our life with the growth of digital age. Widely acceptance of person verification by handwritten signature and because of history of its use in transactions and documents authorization, online signature verification is a topic of present research. A handwritten signature generation is a process of sequence predetermined by the brain activity and reflects neuro-motors of the signer [1]. Natural variations of person’s handwritten genuine signatures can be minimized by consistency and practice. Therefore recognizable signature pattern can be made available for biometric identification. Unlike static signature verification methods in which the image of the signature is digitized and taken into account for comparison, the dynamic signature verification methods consider how the handwritten signature is generated. Therefore the signing process in terms of x-y coordinates or pressures, speeds, timing and inclinations etc are recoded and used for signature comparisons. So it is difficult for forger to recreate this signing information [4],[5],[6]. The person verification or identification by handwritten PIN can also be considered in this regard [6]. Signature identification or verification can be categorized into two main J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 200–206, 2009. © Springer-Verlag Berlin Heidelberg 2009
Bio-inspired Reference Level Assigned DTW
201
types: parametric and functional approaches. In parametric approach only the parameters or features set abstracted from the complete signals are used for signature matching. Because of higher level of data abstraction, these approaches are generally very fast but it is difficult to select the right parameters. On the other hand, functional approaches use the complete signals as features set in terms of time series which essentially contains more signing information hence provide more accurate results[5]. Dynamic Time Warping based classifier has been successfully applied in this regard since couple of decades. DTW is computationally expensive. There are some speed up techniques which reduces the number of data point comparisons by introduction of bands like Sakoe-Chiba band or Itakura parallelogram [7],[8] piecewise aggregate representation of time series for DTW (PDTW),[2] data down-sampling [6] segment to segment matching [4] and extreme points warping [5] etc. In this paper we present classic dynamic time warping based essentially functional approach which also includes one parametric feature. In our study work the acquisition of online signature data is carried out by a digital pen in terms of five time series. The time series obtained from handwriting provides valuable insight to the unique characteristics of the writers. In our study works it is noted that the reference level obtained from the standard deviation values of time series is person specific, we name it as bio-inspired reference level. In proposed method reference levels are added to time series of reference and sample dataset before dynamic time warping distance calculations therefore amplitude values of sequences are shifted to different base levels. Bashir and Kempf [3] introduced a simple method for multi-dimensional channel data conversion to one time series by direct sum of all channels with no lose of accuracy. We also take the advantage of dimension reduction by direct sum. This paper deals with person identification by applying Dynamic Time Warping and its variant Bio-inspired Reference Level Assigned Dynamic Time Warping a technique which provides fast and accurate classification results. In section 2 database and classifier used for our experiment is described as well as the concept of proposed method and speed up by dimension reduction of time series are discussed. Then in section 3 experimental results are presented. Section 4 finally summarizes the major findings and highlights the future prospects of the application.
2 Database and DTW Based Classifier Signature database consists of 420 signatures from 42 writers (10 signatures from each writer). Signatures are captured by a digital pen. The pen is equipped with a diversity of sensors measuring two refill pressures, one finger grip pressure holding the pen and acceleration and tilt angle of the pen during handwriting on commonly used paper pad. A captured signature can be represented by time series of five sensor channels as: x(t) horizontal pressure, y(t) finger grip pressure, z(t) vertical pressure, α(t) longitudinal and β(t) vertical angles. For evaluation task, the data base is divided into query (test) and reference (prototype) samples. Dynamic Time Warping (DTW) based classifier measure distance for matching two signature time series. The minimum distance determines the similarity. Natural variations of person’s handwritten genuine signatures in terms of non-linear distortions in time domain are minimized before DTW distance calculations. Generally Euclidean distance is determined for optimal aligned time series. The review of DTW is omitted, we refer to [2] and [7] for details.
202
M. Bashir and J. Kempf
2.1 Preprocessing of Time Series After acquisition, the data is pre-processed in order to eliminate the potential sensor noise. The essential pre-processing steps are segmentation of data, smoothing, normalizing and down-sampling of data without discarding valuable information. The signatures are captured separately therefore no separate segmentation of signals is required in our study work. Smoothing of data based on local regression is done to minimize sensor noise. In order to compensate partly large variations in time duration normalization of two signature data is done in such a way that time is normalized to short signature signal. In order to reduce complexity of DTW based classifier, five dimensional times of one signature are converted to one dimension by direct sum in such a way that the amplitude of five channel data is normalized to [-1 1] before conversion. Further data is down-sampled to a lower sampling rate. We use smooth(,) and decimate(,) functions for smoothing and data down-sampling of MATLAB. Data processing and DTW algorithms implementation in MATLAB were done by using Pentium 4 processor (2.4 GHz, 3 GB RAM). 2.2 Reference Level Assignment to Time Series In DTW based classifiers, generally the signatures time series are normalized in time and amplitude domains but we proposed a special treatment for amplitude shifting. In our study work it is found that the Reference Level (RL) is unique for a writer. We added person specific so called reference level to the corresponding time series, consequently amplitude values are shifted to new base levels. The distribution of
Fig. 1. Person specific reference level values are shown against number of writers. This shows the distribution of values for 42 writers.
Bio-inspired Reference Level Assigned DTW
203
reference level values is shown in Fig1. Standard deviation (STD) values are calculated for each channel signal and different combinations of these values are tested for best performance. Best RL value determined for accuracy in person identification is given by equation (1). RL= mean{ STD(x(t), STD(β(t))}
(1)
Where x(t) is refill pressure and β(t) is vertical angle of pen during handwriting. 2.3 Architecture of Proposed System Fig.2 shows the proposed system. It can be described as segmentation, noise removal, normalizing each individual channel amplitude to [-1 1], dimension reduction by sum and data sampling to lower sampling rate. Two schemes for classification of data are as followings: In first scheme two sequences are normalized in time domain in such a way that time is normalized to short signature signal while on the other hand in second scheme besides time normalization, amplitude values of time series are shifted to their bio-inspired reference levels.
Multichannel time series
Original Data time series
Normalization, Noise removal dimensions reduction
decision
Down sampling Of data
Reference Level Assignment
DTW classifier
Fig. 2. Proposed person identification system shows data acquisition, pre-processing: Normalization, noise removal and dimension conversion, data down-sampling and classification of signatures based on (1) standard DTW and (2) proposed reference level assigned DTW technique
3 Experiments and Results Signature database consists of 420 signatures samples from 42 writers (10 signatures from each writer). For evaluation task, the data base is divided into query (test) and reference (prototype) samples. We are interested in classification of sequences
204
M. Bashir and J. Kempf
obtained from handwritten signatures so, one out of 420 samples is repeated selected as query and matched with rest of all remaining samples. The minimum DTW distance determines the accuracy of match. In DTW based classifiers, the signature time series are generally normalized in time and amplitude domains. In order to evaluate the performance of proposed method we did two experiments. (a) DTW1: The DTW technique is applied to two signature sequences in such a way that time is normalized to shorter signature signal and amplitude base levels are shifting to person specific reference levels as described in section 2. (b) DTW2: The DTW technique is applied to two signature sequences in such a way that time is normalized to shorter signature signal and with no shifts in amplitude values. The time complexity of the both DTW techniques is O(mn) or O(m2) with m=n, where m is length of signature sequence [2]. The average data points for signatures in our study work were about 3000± 660. The speedup of computations is obtained by data down-sampling as shown in the table I. The speed up obtained over classic DTW in our experiments is O(m2)/D2, where D is a down-sampling factor. The classification accuracy of signature sequences in terms of Error Rate is shown in the table I. At D=6, the Error Rate of classification of proposed method DTW1 is about 3 times lower than that of DTW2 and a similar situation of lower error rates is shown for other values of D in table I . Another prospective of our experimental results is the computations reduction, as for DTW2 the Error Rate at D=30 is 0.2265 with computational complexity of O(m2)/900 while on the other hand proposed method is faster and has about same error rate at D=40 with computational complexity of O(m2)/1600. Similar effect is shown for other D values in the table 1. Table 1. The average performance for 42 writers for Error Rates: DTW1 proposed method for shifted amplitude values and DTW2 without amplitude shifting D
Error Rate DTW1
6 10 20 30 40 50 60 70 100
0.0058 0.0116 0.0290 0.1161 0.2371 0.4123 0.8188 1.1789 1.7422
DTW2
0.0174 0.0232 0.0581 0.2265 0.5923 1.2602 1.9744 2.8513 4.1115
The receiver operating characteristic ROC curves for DTW1 and DTW2 are shown in the Fig.3. The higher value of area under the curve AUC for DTW1 over AUC value of DTW2 in figure shows better classification of signature sequences with the help of proposed method DTW1.
Bio-inspired Reference Level Assigned DTW
205
Fig. 3. ROC curves for DTW1 (line with cross as marker) and DTW2 (line with dot as marker) are shown. The figure is zoomed to lower scale in order to increase readability.
4 Conclusion In this paper we introduced a new reference level assigned dynamic time warping technique for person identification based on handwritten signatures. The acquisition of online data is carried out with the help of digital pen during handwriting on commonly used paper pad. Generally in DTW based classifiers, the signatures sequences are normalized in time and amplitude domains. We introduced a special approach to amplitude normalization. A useful feature (reference level) of handwriting for individuals during signing is found unique for each writer. In the proposed method reference level are added to the amplitude of signature sequences consequently the base levels of amplitudes are shifted to different levels. We achieve the goal of getting lower errors in signature classifications with the help of proposed method. The experimental results show speedup of computations over classical DTW because proposed method allows high level of data abstraction in terms of data down-sampling without loss of accuracy. Further speed up of computations can be achieved by involving the state of the art fast DTW algorithms. The focus of the present study work is signature classification for person identification. The effectiveness of the proposed method need to be realized as future work, in case of person identification while data is sampled from the same writer in different sessions on different days and in case of verification tasks in the present of forgery tests.
Acknowledgment The support given by G. Scharfenberg, G. Schickhuber and BiSP team from the University of Applied Sciences Regensburg is highly acknowledged.
206
M. Bashir and J. Kempf
References 1. Impedovo, D., Modugno, R., Pirlo, G., Stasolla, E.: Handwritten Signature Verification by Multiple Reference Set. In: Int. Conf. on Frontiers in Handwriting Recognition ICFHR (2008) 2. Keogh, E.J., Pazzani, M.J.: Scaling up dynamic Time Warping for Data mining Applications. In: Proc. 6th Int. Conf. on Knowlegde Discovery and Data Mining. KDD (2000) 3. Bashir, M., Kempf, J.: Reduced Dynamic Time Warping for Handwriting Recognition Based on Multi-dimensional Time Series of a Novel Pen Device. In: IJIST, WASET, Paris, vol. 3.4 (2008) 4. Zhang, J., Kamata, S.: Online Signature Verification Using Segment-to-Segment Matching. In: Int. Conf. on Frontiers in Handwriting Recognition ICFHR (2008) 5. Hao, F., Wah, C.C.: Online signature verification using a new extreme points warping technique. In: Pattern recognition, vol. 24. Elsevier Science, NY (2003) 6. Bashir, M., Kempf, J.: Person Authentication with RDTW using Handwritten PIN and signature with a Novel Biometric Smart Pen Device. In: SSCI Computational Intelligence in Biometrics. IEEE, Nashville (2009) 7. Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. In: Knowledge and Information Systems, pp. 358–386. Springer, London (2004) 8. Henniger, O., Muller, S.: Effects of Time Normalization on the Accuracy of Dynamic Time Warping. In: BTAS, pp. 27–29. IEEE, Los Alamitos (2007)
Pressure Evaluation in On-Line and Off-Line Signatures Desislava Dimitrova and Georgi Gluhchev Institute of Information Technologies - BAS, 2, Acad. G. Bonchev Str., 1113 Sofia, Bulgaria {ddimitrova,gluhchev}@iinf.bas.bg
Abstract. This paper presents a comparison of pressure of signatures acquired from a digitizing tablet using an inking pen. The written signature is digitized with a scanner and a pseudo pressure is evaluated. The experiments have shown that the obtained histograms are very similar and the two modalities could be used for pressure evaluation. Also, the data from the scanned image proved to be more stable which justifies its use for signature authentication. Keywords: graphical tablet, scanner, signature, pressure evaluation, pressure distribution, sensor interoperability.
1 Introduction Signatures are recognized and accepted modality for authentication. That is why developing of reliable methods for signature authentication is a subject of high interest in biometrics. There are two methods for signature acquisition: off-line and on-line [6]. The offline method uses an optical scanner or CCD camera to capture written signatures, and specially designed software to measure geometric parameters as shape, size, angles, distances, and like. However, important dynamic parameters and pressure can not be measured directly. Nevertheless, one can estimate dynamic information from a scanned image using pseudo dynamic features [2,3,5] considering pixel intensities in a grayscale image. It is intuitively acceptable to interpret dark zones in a grayscale image as zones of high pressure. The on-line method where signature is captured during signing seems to be more appropriate due to the straightforward measurement of pressure, speed and other writer specific parameters like pen tilt, azimuth, velocity, acceleration, etc. Devices used in on-line method are touch screens, Tablet PCs, PDAs, and graphical tablets. Often signature verification systems are designed to work with a particular input sensor and changing the sensor leads to decreasing of the system’s performance. Various input sensors exist so the problem of sensor interoperability arises. Sensor interoperability can be defined as the capability of a biometric system to adapt to the data obtained from different sensors. In [1] sensor interoperability is evaluated on a signature verification system by using two different Tablet PCs brands with similar hardware and having 256 levels of pressure. In this paper we investigate the sensor interoperability problem by means of a comparison of pressure data taken from a set of signatures. We use an on-line J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 207–211, 2009. © Springer-Verlag Berlin Heidelberg 2009
208
D. Dimitrova and G. Gluhchev
(graphical tablet) and an off-line (image scanner) input sensors with different pressure levels (1024 and 256 respectively). The paper is structured as follows. Section 2 discusses the proposed approach. Section 3 presents some experimental results. Finally, section 4 includes some concluding remarks.
2 The Approach In off-line signature investigations experts try to use different pressure as a reliable identification feature due to its immunity to forgery. As a rule they usually distinguish between three levels of pressure: high, medium and low. However, it is quite difficult to define quantitative thresholds for the automatic detection of the zones of specific pressure. One possibility is to try to automatically construct three clusters of pressure and classify the strokes according to them. Another way which seems more appropriate when a comparison between different devices is needed, and which was used in the paper, is to present pressure by a relatively small number of levels, say 16, that could be presented in pseudo colors if required. It is especially interesting to see what the situation is in case of different modalities, like on-line and off-line signatures. To investigate the problem, we used a graphical tablet with inking pen. Thus, same signature placed in a sheet of paper on the tablet could be captured directly by the tablet software and could be scanned after that. In both cases triplets (x, y, pressure) will be registered and saved into a data base. For the comparison between the two data sets histograms of 16 bins were used. 2.1 Data Acquisition The digitized signature consists of a set of sample points, captured at device dependent frequency. For the tablet the points number is directly proportional to the time of signing. In this investigation we used a digitizing tablet WACOM Intuos3 A5 PTZ-630 with a resolution of 5080 lines per inch, 1024 levels of pressure, an acquisition area of the pad of 152.4 x 210.6 mm, and sampling rate of 200 points per second coupled with an inking pen which allows writing on a white sheet of paper placed on the tablet pad. Signature data is obtained using the Tablet SDK 1.7 in a C# software program. After that the sheet is scanned at a resolution of 200 dpi and 256 grey levels. For this HP ScanJet 3400C scanner and a software program written in MATLAB were used. 2.2 Data Preprocessing Due to the high sampling rate of the digitizing tablet (200 pps), we have to perform re-sampling in order to get rid of redundant data. In this way we lose information concerning the writing speed, which is implicitly incorporated in the data, but this is not of vital importance because in this study only x and y coordinates and the pressure are used. All of the repeated points come with different level of pressure, so we set pressure to its average. The graphical tablet reports pressure values as integers in [0, 1023] interval. The higher the pressure value, the darker the corresponding pixel appears. Grayscale pixel
Pressure Evaluation in On-Line and Off-Line Signatures
209
intensities of the scanned signature image fall in the range [0,255]. Here the lowest gray level is associated with the pixel of highest pressure, which is opposite to the tablet. So, we have to adjust the two ranges and invert one of them. For this the tablet’s range was squeezed to 0-255, and the values were inverted. But, this is not sufficient. There is one more thing that has to be taken into account. It concerns the different width of lines in both signatures. While the tablet submits lines of one pixel of width, the lines from the scanned image may be of a few pixels of width each of them of different intensity. This is because the ink is spread around the central line and the border pixels are brighter than the central ones. To overcome this, only the gray levels alongside the skeleton have to be used in the evaluation (Fig. 1). The second pitfall comes from the repetition of same pixels captured by the tablet. To avoid this, only one of the repeated pixels was preserved and the average pressure value of all of his twins was used.
Fig. 1. From left to right: original signature and its skeleton
Fig. 2. From left to right: histogram of the scanned signature and histogram of the signature captured by the tablet
210
D. Dimitrova and G. Gluhchev
2.3 Pressure Presentation In many cases histograms are used for general presentation of data. They are appropriate when data has to be presented according to its magnitude, which is the case of pressure. Thus, two 16 bins histograms were built for the obtained signatures by the scanner and by the tablet dividing the dynamic range in 16 intervals of equal length (Fig. 2).
3 Experimental Results A group of three individuals were used to collect signatures. All the signatures of a particular signer have been captured at the same time. Thus, no changes are involved due to time delay. The histograms can be compared either globally by evaluating the distance between them (using Euclidean distance) or locally, bin by bin, looking for the largest difference. The values of both distances between the histograms of same signatures are shown in Table 1. Table 2 and Table 3 show the results from histogram comparison (global distance and max bin distance) within the set of signatures of a given signer for the scanned signature images and for the signatures captured by the tablet, respectively. All pairs of the signatures used in the experiments have similar histogram forms (Fig. 2) and some variation in bin values. An interesting and unexpected observation is that the distances between signatures of same writer captured by the tablet are much higher than the distances between corresponding signatures captured by the scanner. This points out that pressure evaluation in case of scanned image is more stable. Even the distances between the histograms from scanned images and tablet are smaller than the distances between tablet histograms. Table 1. Global distance and max bin distance between the corresponding histograms
Signature # 1 2 3 4 5 6 7 8
Global distance 16,36 8,90 8,24 28,39 23,55 29,50 18,49 21,57
Max bin distance 9,06 6,91 5,13 17,01 10,30 12,70 11,13 13,62
Table 2. Histogram comparisons carried out for each individual for scanned signature images. Each cell (i,j) contains global distance / max bin distance between i-th and j-th signatures of the signer.
1 2
2 4.47/ 2.46
3 9.72/ 4.83 12.56/ 7.29
4 6
5 8.06/ 4.78
6 7.27/ 4.93 3.33/ 1. 90
7 8
8 5.57/ 3.89
Pressure Evaluation in On-Line and Off-Line Signatures
211
Table 3. Histogram comparisons carried out for each individual for the signatures captured by the tablet. Each cell (i,j) contains global distance / max bin distance between i-th and j-th signatures of the signer.
1
2 20.32/ 13.35
2
3 12.82/ 8.93 15.82/ 10.13
4 6
5 21.8/ 14.40
6 17.8/ 13.01 20.8/ 12.29
7 8
8 29.85/ 15.31
4 Conclusions In this paper a comparison has been carried out about the similarity in pressure obtained from same signature, captured simultaneously by a graphical tablet and a scanner. The observed similarity justifies the use of pressure from scanned images, which is usual practice in forensic investigations and document authentication. But while being similar in shape, the obtained histograms are not interchangeable, i.e. it does not seem possible to do identification/verification based on pressure evaluation of scanned signatures, if the comparison will be carried out with pressure values, obtained by a tablet. Further investigation will be required for the estimation of pressure distribution of signatures of same individuals in both modalities. Acknowledgments. This investigation was supported by the Ministry of Education and
Sciences in Bulgaria, contract No BY-TH-202/2006.
References 1. Alonso-Fernandez, F., Fierrez-Aguilar, J., Ortega-Garcia, J.: Sensor interoperability and fusion in signature verification: A case study using Tablet PC. In: Li, S.Z., Sun, Z., Tan, T., Pankanti, S., Chollet, G., Zhang, D. (eds.) IWBRS 2005. LNCS, vol. 3781, pp. 180–187. Springer, Heidelberg (2005) 2. Ammar, M., Yoshida, Y., Fukumura, T.: A new Effective Approach for Automatic Off-line Verification of Signatures by using Pressure Features. In: 8th International Conference on Pattern Recognition, pp. 566–569. IEEE Press, Paris (1986) 3. Fierrez-Aguilar, J., Alonso-Hermira, N., Moreno-Marquez, G., Ortega-Garcia, J.: An offline signature verification system based on fusion of local and global information. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS, vol. 3087, pp. 295–306. Springer, Heidelberg (2004) 4. Hennebert, J., Loe_el, R., Humm, A., Ingold, R.: A new forgery scenario based on regaining dynamics of signature, pp. 366–375. IEEE Press, Seoul Korea (2007) 5. Nestorov, D., Shapiro, V., Veleva, P., Gluhchev, G., Angelov, A., Stoyanov, I.: Towards objectivity of handwriting pressure analysis for static images. In: 6th Int. Conf. on Handwritting and Drawing ICOHD 1993, Paris, July 5-7, pp. 216–218 (1993) 6. Plamondon, R., Lorette, G.: Automatic signature verification and writer identification – the state of the art. Pattern Recognition 22(2), 107–131 (1989)
Confidence Partition and Hybrid Fusion in Multimodal Biometric Verification System Chaw Chia, Nasser Sherkat, and Lars Nolle School of Computing and Technology Nottingham Trent University, Nottingham, UK {chaw.chia,nasser.sherkat,lars.nolle}@ ntu.ac.uk
Abstract. Sum rule fusion is a very promising multimodal biometrics fusion approach. However, it is proposed not to widely applying it across the multimodal biometrics score space. By examining the score distributions of each biometric matcher, it can be seen that there exist confidence regions which enable the introduction of the Confidence Partition in multimodal biometric score space. It is proposed that the Sum rule can be replaced by the Min or the Max rule in the Confidence Partition to further increase the overall verification performance. The proposed idea which is to apply the fusion rules in a hybrid manner has been tested on two publicly available databases and the experimental results shows 0.3% ~ 2.3% genuine accept rate improvement at relatively low false accept rate.
1 Introduction Multimodal biometrics have attracted great interest in the biometric research field in recent years. Given its potential to out perform single biometrics verification, many researchers have put their efforts in exploration of different integration techniques. However, integration at the score level is the most preferred approach due to the effectiveness and ease in implementation [1]. The Sum rule, one of the well known score level fusion rule is a method that simply utilises the addition of each biometric scores as fusion result. Surprisingly, it appears to be outperforming many complicated fusion algorithms [2] and being widely employed in biometric research [3, 4, 5, 6, 7, 8]. Through sensitivity analysis, Kittler concluded that the superior performance of the Sum rule is due to it resilient ability to estimate error [9]. In this paper, the assignment of Confidence Partitions (CP) in multimodal biometrics score space has been introduced. Instead of applying the Sum rule over the complete region of multimodal biometrics score space, we suggest to replace the Sum rule in the different CPs with more appropriate rules (Min and Max rule in this paper). This scheme enables the fusion of multimodal biometrics in a hybrid manner including the Sum rule. Figure 1 illustrates a typical biometric matcher score distribution that includes a genuine user and an impostor score distributions. There is a significant overlap region of the curves that causes the main difficulty to classify the claimant into the genuine user or impostor groups. The shaded regions outside the overlap part are confidence J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 212–219, 2009. © Springer-Verlag Berlin Heidelberg 2009
CP and Hybrid Fusion in Multimodal Biometric Verification System
213
regions. They represent the regions where only a single class of users can be found. Although the Sum rule performs well to produce reliable fusion scores, when the biometric scores are located in a confidence region it is suggested to apply a more appropriate rule instead of the Sum rule for a more reliable fusion score, for example the Min, Max rule [9] or the decision fusion rule [10]. The rest of the paper is organised as follows: Section 2 provides details about the proposed integration method. Section 3 presents the databases used, experiments, results and their analysis. Finally section 4 concludes the paper.
Fig. 1. Biometric matcher score distribution
2 Confidence Partition and Hybrid Fusion Even though the proposed idea is feasible in higher dimensional score space, it has only being used for the bimodal biometrics fusion in this paper. First of all, the score distributions of bimodal matchers are constructed (the distributions will be modeled by density estimation algorithm in future research). The regions within the distribution where only one type of user (either genuine user or impostor) is present are marked. Within the genuine user score distribution, the marked region is termed as genuine user confidence region whereas the region within the impostor score distribution is termed as impostor confidence region. Consequently, a two dimensional score space is created. The Genuine User Confidence Partition (GCP) in the score space is assembled from both modalities’ genuine user confidence regions. Also the Impostor Confidence Partition (ICP) in the score space is formed by both modalities’ impostor confidence regions. Prior to applying the fusion rule, we need to normalise the scores from different biometric matchers into a common domain before they can be effectively combined [11]. The simplest normalisation technique is the Minmax normalisation [11] which is showed in (1). It is a rule that maps the biometric scores into the interval between 0 and 1. The minimum value (min) and the maximum value (max) of the score distribution can be estimated from a set of matching scores. The notations shown in the equation represent the follows: Si is the biometric score of user i, S’i represents the normalised score for user i, Sfi is the after fusion score for the particular user, K represents the total number of matchers.
214
C. Chia, N. Sherkat, and L. Nolle
S 'i =
Si − min max − min
(1)
By introducing the CP, multiple rules can be applied over the multimodal biometric system in a hybrid manner. In this work, the rules (2) ~ (4) have been applied. The hybrid fusion scheme is implemented according to scenario shown in (5). 1. Sum Rule: K
S fi = ∑ S 'i ,k , ∀i
(2)
S fi = min(S 'i ,1 , S 'i , 2 ,..., S ' K ) , ∀i
(3)
S fi = max(S 'i ,1 , S 'i , 2 ,..., S 'i , K ) , ∀i
(4)
k =1
2. Min Rule:
3. Max Rule:
4. Hybrid Rule: Apply Min Rule, when < S’i,1 , S’i,2 ,…,S’i,K > fall in ICP. Sfi
=
Apply Max Rule, when < S’i,1 , S’i,2 ,…,S’i,K > fall in GCP.
(5)
Apply Sum Rule, elsewhere. As shown in equation (5), for the partitions where we have high confidence from the biometric matchers we can apply the Min or Max rule which is considered as the more appropriate rule than the Sum rule. The non-confidence partition which is the complement region of the CP exhibits the part that can be easily misclassified. Due to the superior performance of Sum rule in dealing with the estimation error mentioned in section 1, we employ this rule to these non-confidence partitions.
3 Experimental Results The proposed method has been tested on two publicly available databases, which are the NIST-BSSR1 multimodal database [12] and the XM2VTS benchmark database [13]. In the NIST-BSSR1 multimodal database, there are 517 genuine user scores and 266,772 impostor scores, whereas the XM2VTS database (evaluation set) includes 400 genuine user scores and 111,800 impostor scores. Both the databases are truly multimodal (chimeric assumption is not in used [14]). The performance graphs of each matcher in the databases are depicted in figure 2.
CP and Hybrid Fusion in Multimodal Biometric Verification System
(a)
215
(b)
Fig. 2. Performance of baseline matchers (a) NIST-BSSR1 Multimodal Matchers and (b) XM2VTS Matchers Performance
Only the best and the worst biometric matchers from each modality are chosen for the experiments. In the NIST-BSSR1 multimodal database, the right index fingerprint has been paired with the facial matcher C and the left index fingerprint has been paired with the facial matcher G to develop the best and worst multimodal biometrics fusion respectively. For the XM2VTS database, the best facial matcher DCTb-GMM is paired with the best speech matcher LFCC-GMM whereas the worst DCTb-MLP facial matcher is paired with the worst speech matcher PAC-GMM in the experiments. Table 1. Assignment of Confidence Partitions in the experiments Impostor
Genuine User
Non-
Confidence Partition
Confidence Partition
Confidence Partition
NIST-BSSR1 Best Matchers
Sface < 0.55 Sfinger < 0.15
Sface > 0.34 Sfinger > 0.20
Other than Confidence Partitions
NIST-BSSR1 Worst Matchers
Sface < 0.35 Sfinger < 0.09
Sface > 0.20 Sfinger > 0.20
Other than Confidence Partitions
XM2VTS Best Matchers
Sspeech < 0.48 Sface < 0.44
Sspeech > 0.41 Sface > 0.60
Other than Confidence Partitions
XM2VTS Worst Matchers
Sspeech < 0.43 Sface < 1.00
Sspeech > 0.67 Sface > 0.79
Other than Confidence Partitions
216
C. Chia, N. Sherkat, and L. Nolle
The GCP and ICP are assigned manually according to the figures in table 1. All the fusion results based on the best and worst multimodal matcher’s combination are graphically shown in figure 3 and figure 4. Their numerical results are also presented in table 2 and table 3. This is worth mention that the genuine accept rate (GAR) listed in the tables is reported to be 0.001% of the false accept rate (FAR). From the graphical and numerical results shown in figures 3 and 4 and tables 2 and 3, we can conclude that the proposed method outperforms the Sum rule fusion especially at lower FAR even though there are no significant improvements of the equal error rate (EER) which is the rate where FAR is equal to the false reject rate (FRR). The best matchers hybrid fusion for the NIST-BSSR1 dataset achieved 93% GAR which is 0.7% more than the Sum rule whereas in the XM2VTS the best matchers hybrid fusion achieved 96.3% GAR which is 0.3% better than the Sum rule. The GAR
(a)
(b)
Fig. 3. Performance of the NIST-BSSR1 bimodal biometrics fusion on (a) the best multimodal matchers and (b) the worst multimodal matchers Table 2. Accept rates and error rates of NIST-BSSR1 Multimodal database single biometrics and the combined multimodal biometrics
Fingerprint EER
GAR
Face EER
GAR
Sum
Hybrid
EER
GAR
EER
GAR
Best Matchers
8.6% 70.0 %
5.8% 61.1%
1.6%
92.3%
1.3% 93.0%
Worst Matchers
4.5%
4.3% 56.9%
0.5%
91.9%
0.5% 94.0%
82.7%
CP and Hybrid Fusion in Multimodal Biometric Verification System
217
improvement becomes more obvious in the worst matchers hybrid fusion in both databases. The hybrid fusion gains additional 2.1% and 2.3% GAR improvement compared to the Sum rule in NIST-BSSR1 and XM2VTS databases respectively. The relative Sum rule performances are 91.9% and 62.0% in NIST-BSSR1 and XM2VTS. As it can be observed from the scatter plots, the best matchers achieved very good separation between the genuine user and impostor score distribution. Therefore the Sum rule is able to produces a very reliable fusion score. As a result no significant hybrid fusion improvement can be obtained when comparing it with the Sum rule. However, the Sum rule performs poorer to fuse multimodal biometrics with lower authentication rate. In this case, the use of a hybrid fusion rule leads to an improvement over the Sum rule fusion. Like the work shown in [4], our work justifies again that the higher accuracy biometric system leaves less room for improvement.
(a)
(b)
Fig. 4. Performance of the XM2VTS bimodal biometrics fusion on (a) the best multimodal matchers and (b) the worst multimodal matchers
In a bimodal biometric system, the Sum fusion score can be considered as the average value between the Min fusion score and the Max fusion score. Further, within the confidence partition the difference between minimum score and maximum score will not be significant. As a result, the improvements of the GAR achieved in the experiments are within the range between 0.3%~2.3%. It is assumed that the improvement can be further increased when the Min and Max rules being replaced by a higher degree confidence fusion rule, for example the decision fusion rule. In fact, the improvement also relies on a more accurate assignment of the CP and depends on the amount of claimants whose multimodal biometric scores are falling in the confidence partitions. The more scores falls in the CP, the more improvement of the hybrid fusion can be obtained.
218
C. Chia, N. Sherkat, and L. Nolle
Table 3. Accept rates and error rates of XM2VTS single biometrics and their combined multimodal biometrics
Face EER
GAR
Speech EER
GAR
Sum
Hybrid
EER
GAR
EER
GAR
Best Matchers
1.8% 81.3%
1.1% 58.3%
0.5%
96.0%
0.5% 96.3%
Worst Matchers
6.4%
6.4% 19.0%
2.5%
62.0%
2.5% 64.3%
0.0%
4 Conclusions After the introduction of the confidence partition, we have proposed to use more appropriate fusion rules (Min and Max rule in this paper) in the confidence partitions instead of Sum rule. This approach enables the rule based fusion to be applied in a hybrid manner that includes Sum, Min and Max rules. In the preliminary experiments, we showed that the manually operated hybrid rule performed better than the Sum rule. The future exploration will be focusing on automatic assignment of confidence partitions across the biometric score space. An investigation into integration of decision rule in the developed hybrid fusion framework will also be conducted.
References 1. Ross, A., Jain, A.K.: Multimodal Biometrics: An Overview. In: 12th European Signal Processing Conference (EUSIPCO), September 2004, pp. 1221–1224 (2004) 2. Ross, A., Jain, A.K.: Information Fusion in Biometrics. Pattern Recognition Letters 24, 2115–2125 (2003) 3. Jain, A.K., Ross, A.: Learning User Specific Parameters in A MultiBiometric System. In: IEEE ICIP, pp. 57–60 (2002) 4. Indovina, M., Uludag, U., Snelick, R., Mink, A., Jain, A.K.: Multimodal Biometric Authentication Mehods: A COTS Approach. In: Proc. MMVA, Workshop Multimodal User Authentication, December 2003, pp. 99–106 (2003) 5. Ailisto, H., Vildjiounaite, E., Lindholm, K., Makela, S., Peltola, J.: Soft Biometrics- Combining Body Weight and Fat Measurements with Fingerprint Biometrics. Pattern Recognition Letters 27, 325–334 (2006) 6. Lu, X., Wang, Y., Jain, A.K.: Combining Classifiers for Face Recognition. In: ICME 2003, vol. 3, pp. 13–16 (2003) 7. Bouchaffra, D., Amira, A.: Structural Hidden Markov models for biometrics: Fusion of face and fingerprint. Pattern Recognition 41(3), 852–867 (2008) 8. Nanni, L., Lumini, A.: A Hybrid Wavelet-based Fingerprint Matcher. Pattern Recognition 40(11), 3146–3151 (2007) 9. Kittler, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
CP and Hybrid Fusion in Multimodal Biometric Verification System
219
10. Lam, L., Suen, C.Y.: Application of Majority Voting to Pattern Recognition: An Analysis of Its Behaviour and Performance. IEEE Trans. Systems Man Cybernet. Part A: Systems Humans 27(5), 553–568 (1997) 11. Jain, A.K., Nandakumar, K., Ross, A.: Score Normalization in Multimodal Biometric Systems. Pattern Recognition 38(12), 2270–2285 (2005) 12. National Institute of Standards and Technology: NIST Biometric Scores Set, http://www.itl.nist.gov/iad/894.03/biomtricscores 13. Poh, N., Bengio, S.: Database, Protocol and Tools for Evaluating Score-Level Fusion Algorithms in Biometrics Authentication. Pattern Recognition 39(2), 223–233 (2006) 14. Poh, N., Bengio, S.: Can Chimeric Persons Be Used in Multimodal Biometric Authentication Experiments? In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 87– 100. Springer, Heidelberg (2006)
Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face Tobias Scheidat1, Michael Biermann1, Jana Dittmann1, Claus Vielhauer1,2, and Karl Kümmel2 1
Otto-von-Guericke University of Magdeburg, Universitätsplatz 2, 39106 Magdeburg, Germany 2 Brandenburg University of Applied Sciences, PSF 2132, 14737 Brandenburg, Germany {tobias.scheidat,claus.vielhauer, jana.dittmann}@iti.cs.uni-magdeburg.de, {claus.vielhauer,kuemmel}@fh-brandenburg.de
Abstract. Nowadays biometrics becomes an important field in IT security, safety and comfort research for automotive. Aims are automatic driver authentication or recognition of spoken commands. In this paper an experimental evaluation of a system is presented which uses a fusion of three biometric modalities to verify the authorized drivers out of a limited group of potential persons such as a family or small company which is a common use case for automotive domain. The goal is to show the tendency of biometric verification performance in such a scenario. Therefore a multi-biometric fusion is carried out based on biometric modalities face and voice in combination with the body weight. The fusion of the three modalities results in a relative improvement of 140% compared to the best individual result with regard to the used measure, the equal error rate. Keywords: Automotive, multi-biometric fusion, face, voice, body weight, compensational biometrics
1 Introduction The automatic authentication of persons and information plays an important role in IT security research. There are three main concepts for user authentication: secret knowledge, personal possession and biometrics. Methods based on secret knowledge use information only known by the authorized person such as a password. A special physical token is used for authentication in personal possession scenarios which can be a smart card for example. A main problem of both strategies is the possibility that other person may get the authentication object (information, token) which can be stolen or handed over. In biometric applications the authentication object is based on a physical (e.g. fingerprint) or behavioral (e.g. voice) characteristic of the person, thus it can not be misused by another person in an easy way. J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 220–227, 2009. © Springer-Verlag Berlin Heidelberg 2009
Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face
221
Biometric information is used in cars since a couple of years in a simple way to detect if persons sitting on driver’s and/or passenger’s seats in order to activate the corresponding airbag or detect if the seat belt is fastened. Here in the most cases the used biometrics is the body weight which is acquired by a binary sensor in the seat. Figure 1 shows a simplified car and different networks corresponding to different functions of a car. As shown in figure 1 additionally to the conventional networks for power trains, instrumentation, chassis and body electrics, two new components are necessary for biometric systems which provide higher levels of safety, security and comfort: the biometrics network and the biometric database. The biometrics network is used to acquire the data from drivers and/or passengers and process them. The next step depends on the application of the system. One aim may to decide whether a person is authorized to use the vehicle in a given way, while another goal can be to setup the infotainment and entertainment systems as well as the positions of seat, mirrors, heater and/or other comfort settings. The new biometric database component is used to store biometric reference data and fusion parameters such as general weights.
Fig. 1. Simplified car ([1], [2]) with the four conventional networks and two new biometric components biometrics network and biometrics database
In this paper we focus on the automatic driver authentication based on biometric modalities face, speech, and body weight combined with compensational biometrics and environmental sensors values. As compensational biometrics we use steering wheel pressure, pedal pressure and body volume. Contrarily to static biometric systems, an automotive system lacks on differing environmental influences such as noises or changing illuminations as well as differing availability of sensors (e.g. dirty, broken or vibrating sensors). As shown in [3] also existing sensor information could support the adaptive calculation of the fusion result. In our scenario, such sensor information could be light level, current speed or window position to determine a confidence value for the corresponding biometric data. This Paper is structured as follows: The next section describes fundamentals of biometric fusion while the second subsection gives a short overview on the biometric algorithms used for speech and face recognition and the third subsection explains the fusion strategy of our approach. The setup, methodology and results of the experimental evaluation are described in section three. The forth section concludes the paper and gives a short outlook to future work in this area of biometric research.
222
T. Scheidat et al.
2 Biometric Fusion Despite their advantages in comparison to authentication systems based on secret knowledge and personal possession, biometric systems lack of false recognition probabilities due to the variability of data from the same person (intra-class variability). Another problem exists in form of similarities between data of different persons (inter-class similarity). In both cases, a combination of at least two authentication factors can be used to improve the security and/or authentication performance. While also combinations of biometrics, knowledge and possession are possible, in the last years the fusion of multiple biometric components has become more important in biometric research. There are a number of possibilities to improve the authentication performance of the single components involved in the fusion process. For example, it is possible to combine biometric modalities with each other as well as algorithms of one biometric modality. This section provides a short overview on the fundamentals of multi-biometric fusion, introduces the algorithms for face and speech verification shortly and discusses the fusion strategy adapted for evaluation described in this paper to show the recognition tendency in a small use case scenario of a family or small company. 2.1 Fundamentals of Multi-biometric Fusion In order to decrease the influence of drawbacks of biometric systems (i.e. intra-class variability, inter-class similarity), some current approaches are using more than one biometric component such as sensor, modality, and algorithm of one modality or instance of one trait. In recent work on combination of biometric components, fusion is carried out mainly on one of the following levels within the authentication process: feature extraction, matching score computation and decision. At fusion on feature extraction level each subsystem extracts the features from raw data, and fusion is done by combining the feature vectors of all particular subsystems into one single combined feature vector. Each system involved determines the matching score for itself on fusion on matching score level. Then all single scores are fused in order to obtain a joint score as basis for the authentication decision. The fusion on decision level is carried out on a late time within authentication process because the single decisions are made by each system separately followed by the fusion of all individual results. In [4] Ross et al. suggest a classification based on the number of biometric traits, sensors, classifiers and units involved: The fusion in multi-sensor systems is based on different sensors, which acquire the data for one biometric modality. Multialgorithmic systems use multiple algorithms for the same biometric trait. In case of multi-instance systems multiple representations of the same modality are used for the fusion process. Besides multiple physical traits such as fingertips also behavioral modalities provides the usage of multiple units. Multi-sample systems use multiple samples of the same modality, e.g. different positions of the same fingerprint. Multimodal systems combine at least two different biometric modalities to improve the authentication performance of the single systems involved. By using such multi-biometric fusion approaches, higher levels of authentication performance, security and user acceptance can be reached. However, for some strategies, complexity of usability increases with each additional fusion component (e.g.
Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face
223
sensor, modality or trait). Here a user has to present more than one biometric trait or instance of the same trait to the biometric system. But problems, caused by disabilities, illness and/or age can be compensated by some of these systems. For example, a trait that can not be recognized in a sufficient degree can be ignored. 2.2 Biometric Algorithms In the evaluation described here, two existing algorithms were used to calculate a matching score: Speech Alize for speech recognition and 2DFace for face recognition. Both methods are shortly introduced in the next subsections. To perform the verification using speech modality based on Speech Alize ([5], see figure 2, p2) 16 Mel Frequency Cepstral Coefficients (MFCC), their first order deltas and also the energy and delta-energy are calculated for the first audio channel. Therefore the audio material is divided into Hamming windows of 20 ms with a shift of 10 ms. Frequencies between 300 Hz and 8000 Hz are analyzed. Afterwards preprocessing is done like normalization of the energy coefficients, silence removal and feature normalization. For the enrollment one world model GMM (Gaussian Mixture Model) is generated, that is later adapted regarding the user that someone tries to verify to as its own GMM. Finally a score for each test feature vector is calculated. In general the 2DFace algorithm ([6]) uses the standard Eigenfaces approach for face recognition (see figure 2, p1). First the used images are normalized. Therefore a fixed mask is used to crop the image to be 51x55 pixels. That way the located eyes are at the same position for all reference and test data. Afterwards the face space is build, whereas the Principal Component Analysis is used to reduce the dimensionality of the space whilst 99% of the variance is selected. For both, target and test images, features are extracted which correspond to the projection of images to the face space. Finally the score is generated using the L1-norm as a distance measure. Both systems are based on world-models for user verification. To generate the Speech Alize world-model 24 randomly chosen audio samples from [7] where used. For 2DFace an existing world model from [8] was used to have more than the four registered users for the model. 2.3 Fusion Strategy The approach used for fusion in this paper is based on the Enhanced Fusion Strategy (EFS) introduced in [2] by Biermann et al. In total three biometric modalities are used to form a fused matching score in our automotive scenario. The matching scores of the body weight and the compensational biometrics were simulated because corresponding sensors where not installed in the generalized car and/or are not developed yet. The two remaining main modalities are face and voice. In general the final matching score is the weighted sum of the individual matching scores of the single biometric modalities involved. Figure 2 shows the scheme of possible biometric fusion of the biometrics face, speech and body weight which are based directly on the driver as suggested in [1]. Additionally, since the biometric authentication in automotive scenario depends on many environmental factors the figure shows a selection of such factors.
224
T. Scheidat et al.
Fig. 2. Biometric authentication of the driver [1]
As suggested in [2], our evaluation considers the biometrics face, speech and body weight, and the compensational biometrics steering wheel pressure, pedal pressure and body volume (not shown in figure 2). The steering wheel pressure is the pressure of the hands on the steering wheel and replaces speech modality if it is broken. Pedal pressure is the pressure based behavior of the pedal usage, which compensates the face modality. The biometrics body volume is measured by the length of the seat belt used and replaces the main biometrics body weight. In this initial experimental evaluation additional factors such as environmental conditions are not taken into account. Thus, the matching score calculation of the Enhanced Fusion Strategy (EFS) introduced in [2] is applied as shown in equation (1).
[
MS fus,t = ∑ B j ,t * Aj ,t * {MBU j ,t *WNt *W j * MS j ,t }+ 3
j =1
{(1 − MBU 3
∑W j =1
j
j ,t
]
) *V _ CB j ,t * F _ CBj ,t *WNt *W _ CB j * MS _ CB j ,t }
= 1; W j ∈[0, 1]; V _ CB j ,t ∈[0, 1]; F _ CB j ,t ∈[0, 1]
(1)
(2)
Here the variables are defined as follows: MSj,t is the matching score of modality j at time t, while MSfus,t is the fused matching score at time t. Bj,t are binary operands and the related part consisting of main modality j and compensational modality j becomes zero if the corresponding operand is set to 0. In order to have the possibility to manipulate the standard weighting, the operands Aj,t were introduced. Using this parameter, the individual weighting of a modality can be decreased or increased by the system. MBUj,t is a parameter which describes the influence of main biometrics (MBj, here face, speech and body weight) to MSfus,t. MBUj,t is a parameter which describes the functionality of the sensors of main biometrics (MBj, here face, speech and body weight). Wj are constant weights that are based on an estimation of the equal error rate (EER) of the related modality j. According to the fact, that the body weight fluctuates in a higher magnitude than the other two modalities it is weighted with 0.2. The main biometrics face and speech are weighted with 0.4. MS_CBj denotes the matching
Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face
225
score of the compensational biometrics which compensates a breakdown of main biometrics. For the compensational biometrics three additional values are used: the functionality of the sensor F_CBi, the confidence-factor V_CBi and the weight W_CBi. Please note, as shown in equation (2) the sum of the main biometrics’ weights amounts 1.
3 Experimental Evaluation This section introduces database and methodology, which were used for the experimental evaluation of the single systems and their fusion. The evaluation results are presented and discussed in the third part of this section. 3.1 Database The number of test persons is limited since the scenario is based on a family of four. Thus, video and audio data were acquired from four test persons. This simulated small family consists of one woman and three men. To record speech and face we used a webcam (Logitech Quickcam Pro 5000) mounted at the position of the rear-view mirror in a generalized car in our laboratory. For each user and modality one reference sample and two verification samples were acquired. The faces were captured frontal. The initial idea was to collect speech data of different content (so-called semantics, see also [9]), but the first evaluations show, that the Speech Alize needs audio samples with duration higher than 10 seconds to determine sufficient results. Thus, we decide to use longer spoken sequences which consist of several commands to the power train and instrumentation networks. Hence, the spoken semantic was ‘
| Start Engine | Yes | No | Cancel’. Please note, due to the German data protection law, the user’s name is a pseudonym. 3.2 Methodology Biometric error rates are applied, to determine the authentication performance of the fusion scenario. They have to be determined empirically, since it is not possible to measure these error rates from the system directly. In order to do so, for each threshold, the numbers of acceptances or rejections for authorized and non authorized persons are determined experimentally. On one hand, the false rejection rate (FRR) calculates the ratio between the number of false rejections of authorized persons and the total number of tests. On the other hand, the ratio between the number of false acceptances of non-authentic persons and the entire number of authentication attempts is the false acceptance rate (FAR). For a comparative analysis of verification performance of the fusion components, as well as those of their fusion the equal error rate (EER) is used. The EER is a common measurement in biometrics and denotes the point in error characteristics, where FRR and FAR yield identical values. However, the EER is not to be interpreted as the optimal operating point of a biometric system, it is mainly used as a normalized reference point for comparisons between algorithms or systems.
226
T. Scheidat et al.
3.3 Results As shown in the third row of table 1 the fusion of the three biometrics face, speech and body weight leads to a better result compared to the results of the individual modalities. The best individual result was determined by modality face with an EER of 15.00%, while speech and body weight results in an EER of 29.17% and 49.70% respectively. The fusion results in an equal error rate of approx. 6.24%. This corresponds to a relative improvement of approx. 140% compared to the best result of the single modalities (face: EER=15.00%). Table 1. Evaluation results of the single modalities speech, face, weight and compensational biometrics body volume Speech EER weight 29.17% 0.4 29.17% 0.5 29.17% 0.4
Face EER weight 15.00% 0.4 15.00% 0.5 15.00% 0.4
Body weight EER weight 49.70% 0.2 -
Body volume EER weight 41.48% 0.2
Fusion EER 6.24% 12.49% 6.25%
The case that the body weight system is out of order is shown in rows four and five of table 1. The fourth row illustrates the behavior of the fusion if no compensational biometrics is use to adjust the broken component. The fusion result of speech and face is 12.49%, which is on the one side a relative improvement of approx. 20% compared with best single modality’s result. On the other side, the fusion performance declines by approx. 50% compared to the result of the fusion of all three main biometrics. In the last row of table 1 the fusion results are shown, which are determined by substitution of the broken body weight system by the compensational biometric body volume. While the individual result of body volume amounts 41.48% a corresponding fusion result of 6.25% can be reached. This observation show, if the broken body weight system is compensated by the body volume system, a similar authentication performance is reached.
4 Conclusions and Future Work In this paper we show an experimental evaluation of a biometric fusion based on the modalities speech, face and body weight, and the compensational biometrics steering wheel pressure, pedal pressure and body volume. Firstly, the results show, that the fusion of speech, face and body weight leads to a better verification result compared to the individual results of the single modalities. The relative improvement amounts approx. 140% with an EER of 6.24%. Secondly, in the simulated case of a broken body weight recognition system, we use the corresponding compensational biometrics body volume to provide a fall back possibility. Here a fusion result of approx. 6.25% was reached with the compensational biometrics and an EER of 12.49% without (bimodal fusion of speech and face). An evaluation of the suggested system based on a higher number of users is one of the main parts of our future work. This would correspond to other use cases in the automotive domain such as car pools of medium and/or big companies or car rental
Multi-biometric Fusion for Driver Authentication on the Example of Speech and Face
227
companies. Another of the next steps will be the integration of a speech recognition system in order to recognize the spoken content. This can be helpful to decide if a spoken text has to be used for driver authentication or as a command. The position of the speaker within the car may be evidence whether he or she is authorized to give commands. In the case a command is not spoken by the driver it should be ignored. Possibilities to detect the position of the speaker we see in the combination of speech and video by an analysis of the lip movement of the occupants or by using the independent component analysis.
Acknowledgements This work has been supported in part by the European Commission through the EFRE Programme "Competence in Mobility" (COMO) under Contract No. C(2007)5254.
References 1. Makrushin, A., Dittmann, J., Kiltz, S., Hoppe, T.: Exemplarische Mensch-MaschineInteraktionsszenarien und deren Komfort-, Safety- und Security-Implikationen am Beispiel von Gesicht und Sprache. In: Alkassar, S. (ed.) Sicherheit 2008; Sicherheit - Schutz und Zuverlässigkeit; Beiträge der 4. Jahrestagung des Fachbereichs Sicherheit der Gesellschaft für Informatik e.V (GI), April 2-4, 2008, pp. 315–327 (2008) 2. Biermann, M., Hoppe, T., Dittmann, J., Vielhauer, C.: Vehicle Systems: Comfort & Security Enhancement of Face/Speech Fusion with Compensational Biometrics. In: MM&Sec 2008 - Proceedings of the Multimedia and Security Workshop 2008, Oxford, UK, pp. 185– 194 (2008) 3. Nandakumar, K., Chen, Y., Jain, A.K., Dass, S.C.: Quality-based Score Level Fusion in Multibiometric Systems. In: Proceedings of the 18th international Conference on Pattern Recognition, ICPR, vol. 04. IEEE Computer Society, Washington (2006) 4. Ross, A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer, New York (2006) 5. Reference System based on speech modality (2008), http://share.int-evry.fr/ svnview-eph/filedetails.php?repname=ref_syst&path=%2FSpeech_ Alize%2Fdoc%2FhowTo.pdf 6. A biometric reference system for 2D face (2008), http://share.int-evry.fr/ svnview-eph/filedetails.php?repname=ref_syst&path=%2F2Dface_ BU%2Fdoc%2FhowTo.pdf 7. bodalgo. Voice Over Market Place for Voice Overs, Voiceover Talents – Find Voice Overs The Easy Way! (2009), http://www.bodalgo.com (download: February 2009) 8. Subversion server of BioSecure reference systems, 2DFace (2009), http://share.int-evry.fr/svnvieweph/filedetails.php?repname=ref_syst&path=%2F2Dface_BU%2Fresu lts%2FmodelBANCA.dat (download: February 2009) 9. Vielhauer, C.: Biometric User Authentication for IT Security: From Fundamentals to Handwriting. Springer, New York (2006)
Multi-modal Authentication Using Continuous Dynamic Programming K.R. Radhika1 , S.V. Sheela1 , M.K. Venkatesha2 , and G.N. Sekhar1 1 2
B M S College of Engineering, Bangalore, India R N S Institute of Technology, Bangalore, India
Abstract. Storing and retrieving the behavioral and physiological templates of a person for authentication using a common algorithm is indispensable in on-line applications. This paper deals with authentication of on-line signature data and textual iris information using continuous dynamic programming [CDP]. Kinematic derived feature, acceleration is considered. The shape of acceleration plot is analysed. The experimental study depict that, as the number of training samples considered for CDP algorithm increase, the false rejection rate decrease.
1
Introduction
Research issues are based on iris localization, nonlinear normalization, occlusion, segmentation, liveness detection and large scale identification. Signature authentication is strongly affected by user dependencies as it varies from one signing instance to another in a known way. Signature and iris, as biometric security technologies have great advantages of variability and scalability. The variability of signature can be described as constant variation which aids in rejection of a duplicate in a modern self certified security system. Even though captured iris image is a constant signal, it provides scalability for variety of applications. In this paper the sample refer to both on-line signature and an iris sample.
2
Background
CDP aid in recognition process with a concept of grouping items with similar characteristics together. A part of registered pattern as an input pattern can be verified using spotting method called CDP [3]. CDP, developed by R.Oka to classify real world data by spotting method, allows to ignore the portions of data which lie outside of the task domain [4]. CDP is a nice tool to tackle problems of time warping and spotting for classification [5]. Two-dimensional [2D] CDP allows characteristical transformation and is a quasi-optimal algorithm for row axis and column axis which is combination of spotting recognition with a reference image for tracking a target and making a segmented image as a reference image for the next frame [6,7]. Incremental reference-interval-free CDP is applied for speech recognition, which gives us the idea of detecting similar J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 228–235, 2009. c Springer-Verlag Berlin Heidelberg 2009
Multi-modal Authentication Using CDP
229
segments within one discourse sample which can be extended for detecting segments that are similar to two independent samples [8]. The work done by T. Nishimura, gives a clear on-line scenario, which detects a frame sequence in the map which matches an input frame sequence with real-time localization [9]. Shift CDP applies CDP to a constant length of unit reference patterns and provides a fast match between arbitrary sections in the reference pattern and the input pattern [10]. Spatio-temporal approach handles the direction change to obtain flexibility of recognition system [11]. Reference-interval-free CDP are the technique that has been proposed as ways to assign labels, analyze content, and search time-series data. These work by detecting similar passages in two different sets of time-series data. H.Kameya et al, developed writer authentication system for text independent system [2].
3
Proposed System
In this experiment we have used a novel method for pupil extraction. The first peak of the histogram provides the threshold ‘t’ for lower intensity values of the eye image as shown in the Fig.1(e). We label all the connected components in sample eye image with less than ‘t’ intensity value. Selecting the maximum area component we arrive at pupil area of the eye as shown in the Fig.1(a)-(d). Normalised bounding rectangle is implemented using centre of pupil to crop iris biometry from eye image. In extracted 2D pupil area, the pixel intensity values are sorted in ascending order. The listing of unique top ‘g’ gray scale values are considered which signify lower intensity values. Itop = {I1 , I2 , I3 , I4 ,...Ig }. In this experiment the value of ‘g’ is 50. This value specifies darkest part of the eye image. The set of co-ordinate values with Itop intensity values form the texture template. Larger the ‘g’ value, larger the size of the texture template, which extends from pupil to iris.
Fig. 1. (a)-(d) Iris texture template formation (e) Histogram of eye image
3.1
Segmentation
Using each of the co-ordinate values confined in texture template, the velocity and acceleration values are calculated using (1) and (2) in two-dimension. v = r rˆ + rω θˆ
(1)
230
K.R. Radhika et al.
a = rˆ(r − rθ2 ) + θˆ
1 d 2 (r θ ) r dt
(2)
where r = x2 + y 2 , rˆ = x ˆcosθ + yˆsinθ, θˆ = −ˆ xsinθ + yˆcosθ. From signature samples, 3D on-line features are extracted. The x, y, pressure (z), pen azimuth and pen inclination coordinate sequence are the features considered [1] . Azimuth is the angle between the z axis and the radius vector connecting the origin and any point of interest. Inclination is the angle between the projection of the radius vector onto the x-y plane and the x-axis. z-axis is the pressure axis. The velocity and acceleration values of sample are calculated using (3) and (4) in three-dimension. ˆ + ϕrϕ v = rˆ(r + θrθ ˆ sinθ) (3) a = rˆ(r − rθ2 − rϕ2 sin2 θ) + θˆ
1 d 2 1 d 2 (r θ ) − rϕ2 sinθcosθ + ϕˆ (r ϕ sin2 θ) r dt rsinθ dt (4)
where r = x2 + y 2 + z 2 , rˆ = x ˆsinθcosϕ + yˆsinθsinϕ + zˆcosθ, θˆ = xˆcosθcosϕ + yˆcosθsinϕ − zˆsinθ, ϕˆ = −ˆ xsinϕ + yˆcosϕ. The r is radial distance of a point from origin. x ˆ, yˆ and zˆ are unit vectors. θ is azimuth angle and ϕ is angle of inclination. rˆ, θˆ and ϕˆ are unit vectors in spherical co-ordinates. r , θ and ϕ are derivatives of spherical co-ordinates. The top 50 velocity values were considered in descending order for a signature sample. These values were plotted according to positional values. From Fig.2(a)-(b), we can analyse that, the velocity scatter plot look similar for two genuine samples. By Fig.2(b)-(c), we can analyse that, the velocity scatter plot of forged samples does not match with velocity plots of two genuine samples. The numerical values of velocity of genuine samples did not match but the shape of scatter plot looked similar therefore the second order derivative acceleration was considered. Similar results were found with iris texture template velocity values. Let ‘n’ be is total number of pixels of a sample. a1 ,a2 ,a3 ,a4 ...an are acceleration values generated using (4). Let p1 ,p2 ,p3 ,p4 ...pn signify the positional values of ‘n’ pixels. The acceleration values are normalized with respect to positional values by using (5). It is observed that, irrespective of the length of signature and number of gray values, scatter of acceleration values form same shape pattern for genuine signatures and genuine iris samples. pinorm = (pi ∗ 100)/n, ∀ pinorm ≤ 100
(5)
By normalizing position we achieve acceleration values in class intervals like 10 - 19, 20 - 29, 30 - 39 and so on. The sample can be now be segmented to ‘m’ equal parts with respect to acceleration values. In this paper we have considered m = 10. In some cases the pressure of the writing device may change in starting and ending stage by same person due to emotional variation which in turn varies the acceleration scatter shape. Figure 3(a)-(b) show starting stage acceleration variation. The same is observed in some cases of iris texture template due to white illumination spots and dark shadow spots. In ‘m’ segments the first and last segments are not considered to improve the system performance. They
Multi-modal Authentication Using CDP
231
Fig. 2. Scatter of velocity values pertaining to different percentage of signature length
Fig. 3. Initial part plot for two genuine samples of same person
contain the settling down acceleration components. The variations exist in the numerical values of the acceleration within genuine sample segments with a fewer variations in shape of acceleration plot. Figure 4(a) shows the shape similarity of acceleration plot between the two genuine samples of same person and Fig.4(b) shows enormous variation in shape of acceleration plot when compared between a genuine and forged sample of same person. The proposed sytem matches the shape of acceleration plot segmentwise. This partial reference matching to partial
232
K.R. Radhika et al.
Fig. 4. (a)Depicts similar local minima direction changes for two genuine samples of same person (b)Depicts dissimilar local minima direction changes with one genuine sample and one forged sample
input is achieved by CDP. In one percentile range for acceleration values the distance measure is generated which is the count of directional changes. ∀ ai < ai+1 , Di = −1; ∀ ai > ai+1 , Di = +1; else Di = 0
(6)
The direction sequence set D is generated as D = {1,-1,0,1,1,1,-1,...} using equation(6). The count of change over from ‘-1’ to ‘+1’ are counted as c40−49 . These are the local minima acceleration values in the considered percentile. The set of 8 counts, [c10−19 ,c20−29 ,c30−39 ,c40−49 ,c50−59 ,c60−69 ,c70−79 ,c80−89 ] leaving the counts of first and last parts form the value array, va . va = {va (1), va (2), va (3), va (4), va (5), va (6), va (7), va (8)} where va (1) = c10−19 , va (2) = c20−29 , va (3) = c30−39 , va (4) = c40−49 , va (5) = c50−59 , va (6) = c60−69 , va (7) = c70−79 , va (8) = c80−89 . Each training sample will generate a value array. The reference genuine sample in the set of P training samples should be selected for further testing given genuine or forged sample. This is explained in Sect.3.3. 3.2
Algorithm
CDP accumulates minimum local distances [2]. For a sample r(t) (1≤ t ≤ S) and another sample i(τ )(1 ≤ τ ≤ T) va and vb value arrays are generated, which are bounded with τ . The value of S and T is 8, which depicts 8 percentile components. va = {va (1), va (2), va (3), va (4), va (5), va (6), va (7), va (8)} and vb = {vb (1), vb (2), vb (3), vb (4), vb (5), vb (6), vb (7), vb (8)}. Dynamic programming method follows scan-line algorithm of the (t,τ ) plane from line with t = 1 to the line with t = S. R(t,τ ) contain minimum of accumulated distance. The weight will normalise the value of R(t,τ ) to locus of ‘3T’. This is using the theorem that between the 2 fixed points A and B, circle of Appolonius is the locus of the point P such that |P A| = 3 · |P B|, where |P A| means the distance from point P to point A. For the cases of τ = -1 and τ = 0, the accumulation is defined by R(t,-1) = R(t,0) = ∞. For t=1, R(1,τ )=3*d(1,τ ), where ‘d’ is local distance measure between r(t) and i(τ ). d = |va (1) − vb (τ )|. For t=2, R(2,τ ) =
Multi-modal Authentication Using CDP
233
min{R(1,τ -2)+2·d(2,τ -1)+d(2,τ ),R(1,τ -1)+3·d(2,τ ),R(1,τ )+3·d(2,τ )}. For t=3 to S, R(t,τ ) = min{R(t-1,τ -2)+2·d(t,τ -1)+d(t,τ ), R(t-1,τ -1) +3·d(t,τ ), R(t-2,τ 1)+3·d(t-1,τ )+3·d(t,τ )}. Given the two value arrays va and vb , the cdp-value is found. cdp-value = min{R(1,8), R(2,8), R(3,8), R(4,8), R(5,8), R(6,8), R(7,8), R(8,8)}*(1/(3*T)). 3.3
Leave One Out Method
Training set [T-SET] consists of ‘P’ samples. va array is calculated for first sample. All the samples in T-SET lead to ‘P’ vb arrays. The ‘P’ cdp-values formed with first sample’s va array and ‘P’ vb arrays form the first row. Similarly P-1 rows of cdp-values are formed considering each of the P-1 samples independently to form va array and left out P-1 sample’s vb arrays. This forms P X P matrix of cdp-values. The row index of minimum average row forms reference/template value array va−person for one person. The threshold value is the average of P row averages multiplied by a constant Z. Z signifies constant behavioral variation in one’s signature and angle of capture of iris sample. CDP works on differences between va−person and vb value arrays in respective eight percentile parts of sample. The vb array is obtained from any input sample either testing samples or forged samples. If cdp-value obtained is less than threshold , the given input is considered as genuine sample else it is termed as forged sample.
4
Experimental Results
100 x 25 x 2 signature samples were used from MCYT-100 Signature Baseline Corpus, an online database. This database provides 25 genuine and 25 forged samples per person. For P = 10, 10 genuine samples form T-SET which is 40% of genuine sample set. The remaining 15 genuine samples form testing set, to find false rejection rate [FRR]. The 25 forged samples were used to find false acceptance rate [FAR]. The acceptance rate 97%and rejection rate 92% are obtained. UBIRIS.V1 database is composed of 1877 iris images collected from 241 people in two distinct sessions with 5 geniune samples per person. P = 3 genuine samples form T-SET which is 60% of genuine sample set. The acceptance rate 98% and rejection rate 97% are obtained. CASIA Iris Image Database version 2.0 includes 1200 iris images from 60 eyes. For each eye, 20 images are captured in one session. Two such sets are provided with images captured using different devices. For P = 10, 10 genuine samples form T-SET which is 50% of genuine sample set. The acceptance rate 98% and rejection rate 97% are obtained. For large databases, the reference template storing and retrieving will be a major issue. Further the experiment was conducted for P = 7,8 and 9. The result show that FRR decrease and FAR increase with increase in ‘P’ values for Z = 1,2,3,4 as shown Table.1 and Fig.5. FAR and FRR are trade off against one another [12]. Keeping track of receiving operating curve characteristics, proposed system suggests P = 10 as reliable value for authentication applications using CDP.
234
K.R. Radhika et al. Table 1. FRR and FAR values for Signature and Iris Signature FRR Z=1 FRR Z=2 FRR Z=3 FRR Z=4 FAR Z=1 FAR Z=2 FAR Z=3 FAR Z=4
P=7 8.8% 6.7% 4.59% 3.25% 3.03% 4.51% 6.07% 7.55%
P=8 8.72% 6.01% 4.02% 2.55% 3.02% 4.51% 6.07% 7.52%
P=9 8.91% 6.33% 3.75% 2.63% 2.93% 4.58% 6.08% 7.73%
P=10 8.73% 5.71% 3.37% 2.37% 3.07% 5.06% 6.6% 8.02%
Iris FRR Z=1 FRR Z=2 FRR Z=3 FRR Z=4 FAR Z=1 FAR Z=2 FAR Z=3 FAR Z=4
P=7 7.15% 4.71% 2.56% 1.51% 1.81% 5.98% 7.15% 8.18%
P=8 6.06% 3.46% 1.78% 1.05% 3.2% 6.53% 8.68% 8.48%
P=9 5.2% 3.31% 1.56% 0.91% 2.95% 6.55% 9.46% 9.21%
P=10 5.16% 1.71% 0.58% 0.2% 3.71% 6.83% 9.71% 9.9%
Fig. 5. (a)Depicts FRR decreasing and FAR increasing for P=10 as compared to P=7 for MCYT database (b)Depicts FRR decreasing and FAR increasing for P=10 as compared to P=7 for CASIA database
5
Conclusion
The proposed system is a function based approach dealing with local shape analysis of a kinematic value using CDP for on-line hand written signature and for eye texture template. Instead of working on primary features or image, the system works on derived feature which leads to vigorous security system. The moments of each segment can be considered as a reference feature for CDP to achieve global optimization.
Acknowledgment The authors would like to thank J.Ortega-Garcia for the provision of MCYT Signature database from Biometric Recognition Group, B-203, Universidad Autonoma de madrid SPAIN [1] and Proenca H and Alexandre L.A, Portugal [13] for UBIRIS database. Portions of the research in this paper use the CASIA-V2 collected by the Chinese Academy of Sciences Institute of Automation (CASIA) [14].
Multi-modal Authentication Using CDP
235
References 1. Ortega-Garcia, J., Fierrez-Aguilar, et al.: MCYT baseline corpus: A bimodal biometric database. IEEE Proc. Vision, Image and Signal Processing 150(6), 395–401 (2003) 2. Kameya, H., Mori, S., Oka, R.: A segmentation-free biometric writer verification method based on continuous dynamic programming. Pattern Recognition Letters 27(6), 567–577 (2006) 3. Kameya, H., Mori, S., Oka, R.: A method of writer verification without keyword registration using feature sequences extracted from on-line handwritten sentences. In: Proc. MVA2002 IAPR Workshop., vol. 1, pp. 479–483 (2002) 4. Oka, R.: Spotting Method for Classification of Real World Data. Computer Journal 41(8), 559–565 (1998) 5. Zhang, H., Guo, Y.: Facial Expression Recognition using Continuous Dynamic Programming. In: Proc. IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp. 163–167 (2001) 6. Iwasa, Y., Oka, R.: Spotting Recognition and Tracking of a Deformable Object in a Time-Varying Image Using Two-Dimensional Continuous Dynamic Programming. In: Proc. The Fourth International Conference on Computer and Information Technology, pp. 33–38 (2004) 7. Itoh, Y., Kiyama, J., Oka, R.: A Proposal for a New Algorithm of Reference Interval-free Continuous DP for Real-time Speech or Text Retrieval. In: Proc. International Conference on Spoken Language Processing., vol. 1, pp. 486–489 (1996) 8. Kiyama, J., Itoh, Y., Oka, R.: Automatic Detection of Topic Boundaries and Keywords in Arbitrary Speech Using Incremental Reference Interval-free Continuous DP. In: Proc. International Conference on Spoken Language Processing., vol. 3, pp. 1946–1949 (1996) 9. Nishimura, T., Kojima, H., Itoh, Y., Held, A., Nozaki, S., Nagaya, S., Oka, R.: Effect of Time-spatial Size of Motion Image for Localization by using the Spotting Method. In: Proc. 13th International Conference on Pattern Recognition, pp. 191– 195 (1996) 10. Nishimura, T., Sogo, T., Ogi, S., Oka, R., Ishiguro, H.: Recognition of Human Motion Behaviors Using View-Based Aspect Model Based on Motion Change. IEICE Transactions on Information and systems, J84-D 2(10), 2212–2223 (2001) 11. Itoh, Y.: Shift continuous DP: A fast matching algorithm between arbitrary parts of two time-sequence data sets. Systems and Computers in Japan 36(10), 43–53 (2005) 12. Jain, A.K., Bolle, R., Pankanthi, S.: BIOMETRICS Personal Identification in Networked Society 13. Proenca, H., Alexandre, L.A.: UBIRIS: A Noisy Iris Image Database. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 970–977. Springer, Heidelberg (2005) 14. CASIA-IrisV3, http://www.cbsr.ia.ac.cn/IrisDatabase.htm
Biometric System Verification Close to “Real World” Conditions Aythami Morales1, Miguel Ángel Ferrer1, Marcos Faundez2, Joan Fàbregas2, Guillermo Gonzalez3, Javier Garrido3, Ricardo Ribalda3, Javier Ortega4, and Manuel Freire4 1
2
GPDS Universidad de Las Palmas de Gran Canaria Escola Universitària Politècnica de Mataró (Adscrita a la UPC) 3 HCTLab, Universidad Autónoma de Madrid, 4 ATVS, Universidad Autónoma de Madrid [email protected]
Abstract. In this paper we present an autonomous biometric device developed in the framework of a national project. This system is able to capture speech, hand-geometry, online signature and face, and can open a door when the user is positively verified. Nevertheless the main purpose is to acquire a database without supervision (normal databases are collected in the presence of a supervisor that tells you what to do in front of the device, which is an unrealistic situation). This system will permit us to explain the main differences between what we call "real conditions" as opposed to "laboratory conditions". Keywords: Biometric, hand-geometry verification, contact-less, online signature verification, face verification, speech verification.
1 Introduction Biometric system developments are usually achieved by means of experimentation with existing biometric databases, such as the ones described in [1]. System performance is usually measured using the identification rate (percentage of users whose identity is correctly assigned) and verification errors: False Acceptance Rate (FAR, percentage of impostors permitted to enter the system), False Rejection Rate (FRR, percentage of genuine users whose access is denied) and combinations of these two basic ratios, such as Equal Error Rate (EER, or adjusting point where FAR=FRR) and Detection Cost Function (DCF) [2]. A strong problem in system comparison is that most of the times the experimental conditions of different experiments performed by different teams are not straight forward comparable. In order to illustrate this problem, let us see a simple example in the motoring sector. Imagine two cars with the fuel consumption depicted in table 1. According to this table, looking at the distance (which is equal in both cases) and the speed (which is also equal) we could conclude that car number 1 is more efficient. Nevertheless, if we look at figure 1, we realize that the experimental conditions are very different and, in fact, nothing can be concluded. This is an unfair comparison. It is well known that car makers cannot do that. Slope, wind, etc., must be very controlled and it is not up to the car maker. Nevertheless the situation is not the same in J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 236–243, 2009. © Springer-Verlag Berlin Heidelberg 2009
Biometric System Verification Close to “Real World” Conditions
237
biometrics, because there is no “standard” database to measure performance. Each fabricant can use its own database. This can let to unfair comparisons, as we explain next. We will assume that training and testing of a given biometric system will be done using different training and testing samples, because this is the situation in real operating systems in normal life. Otherwise, this is known as “Testing on the training set”: the test scores are obtained using the training data, which is an optimal and unrealistic situation. This is a trivial problem where the system only needs to memorize the samples, and the generalization capability is not evaluated. The comparison of different biometric systems is quite straight forward: if a given system shows higher identification rate and lower verification error than its competitor, it will be considered better. Nevertheless, there is a set of facts that must be considered, because they can let to reach a wrong conclusion. Table 1. Toy example for car fuel consumption comparison
1 Distance Speed Fuel consumption
2
100 Km 100 Km/h 8 liters
100Km 100Km/h 12 liters
Slope
100Km Car1:8liters
Flat
2 100Km Car2:12liters
Fig. 1. Experimental conditions corresponding to table 1
Nevertheless, there is a set of facts that must be considered, because they can let to reach a wrong conclusion. We will describe these situations in the next sections. A.
Comparison of results obtained with different databases
When comparing two biometric systems performing over different databases, it must be taken into account that one database can be more trivial than the other one. For instance, it does not have the same difficulty to identify people inside the ORL database [3] (it contains 40 people) than in the FERET database [4] (around 1000 people). For a study of this situation, see [5]. Thus, as a conclusion, a given system A performing better on Database DB1 than another system B performing worse on database DB2, is not necessarily better, because the comparison can be unfair.
238
B.
A. Morales et al.
Comparison of results obtained with the same database
When comparing two biometric systems performing over the same database, and following the same protocol (same samples for training both competing systems and the remaining samples for testing), it seems that the comparison is fair. In fact it is, but there is a problem: how can you be sure that these results will hold on when using a different database? Certainly you cannot. For this reason, researchers usually test their systems with different databases acquired by different laboratories. In the automobile example, probably, you will get the fuel consumption in several situations (urban, highway, different speeds, etc.) because one car can be more efficient in a particular scenario but it can be worse in a different one. Of course the car must be the same in all the scenarios. It will be unfair to trim the car design before making the test (one design for urban path, one design for rural path, another one for highway, etc). In which comparison is interested the system seller? Probably in the most favorable one for his/her product. In which comparison are we (the buyers) interested? Obviously the best characterization of biometric systems is the one that we achieve with a fully operating system, where users interact with the biometric system in a “normal” and “real” way. For instance, in a door opening system, such as the system described in [6-7]. In this paper we want to emphasize the main differences between databases collected under “real conditions”, as opposed to “laboratory conditions”. This is a milestone to produce applications able to work in civilian applications. Next sections summarize the main differences between our proposed approach and classical approaches. 1.1 Classic Design (Step 1) Biometric system design implies the availability of some biometric data to train the classifiers and test the results. Figure 2 on the left summarizes the flow chart of the procedure, which consists on the following steps: 1. A database is acquired in laboratory conditions. There is a human supervisor that tells the user what to do. Alternatively, in some cases, programs exist for creating synthetic databases, such as SFINGE [8] for fingerprints. Another example would be the software Faces 4.0 [10] for synthetic face generation. Nevertheless, synthetic samples have a limited validity to train classifiers when applied to classify real data. 2. After Database acquisition, a subset of the available samples is used for training a classifier, user model, etc. The algorithm is tested and trimmed using some other samples of the database (testing subset). 3. The developed system jumps from the laboratory to real world operation (physical access, web access, etc.). This procedure is certainly useful for developing a biometric system, for comparing several different algorithms under the same training and testing conditions, etc., but it suffers a set of drawbacks, such as: a) In real world conditions the system will be autonomous. b) Laboratory databases have removed those samples with low quality, because if the human supervisor detects a noisy speech recording, blurred face image, etc., will discard the sample and will ask the user for a new one.
Biometric System Verification Close to “Real World” Conditions
239
STEP 1 Biometric Database (laboratory)
ALGORITHMS
BIOMETRIC DATABASE
BIOMETRIC SECURITY APPLICATIONS
ALGORITHM
STEP 2
BIOMETRIC SECURITY APPLICATION
Classic design
Biometric Database (operational)
Proposed approach
Fig. 2. Classic design (on the left) versus proposed approach (on the right)
c) Database acquisition with a human supervisor is a time consuming task. d) Real systems must manage a heterogeneous number of samples per user. Laboratory system developments will probably ignore this situation and thus, will provide a suboptimal performance due to mismatch between the present conditions during development and normal operation. 1.2 Proposed Approach (Step 2) A more sophisticated approach involves two main steps (see figure 2 on the right). The operation can be summarized in the next steps: 1. Based on algorithms developed under the “classical approach”, a physical access control system is operated. 2. Simultaneously to system operation, biometric acquired samples are stored in a database. This procedure provides the following characteristics: a)
In general, the number of samples per user and the time interval between acquisitions will be different for each user. While this can be seen as a drawback in fact this is a chance to develop algorithms in conditions similar to “real world” where the user’s accesses are not necessary regular. b) While supervised databases contain a limited number of recording sessions, this approach permits to obtain, in an easy way, a long term evolution database.
240
A. Morales et al.
c)
Biometric samples must be checked and labeled a posteriori, while this task is easier in supervised acquisitions. d) While incorrect (noisy, blurred, etc.) samples are discarded in supervised databases, they exhibit a great interest when trying to program an application able to manage the Failure to Acquire rate. In addition, these bad quality samples are obtained in a realistic situation that hardly can be obtained in laboratory conditions.
Fig. 3. Multimodal interface for biometric database acquisition (hand-geometry, speech, face and on-line signature). Frontal view (top).
Fig. 4. Physical installation (at EUPMt) in a wall for door opening system
Biometric System Verification Close to “Real World” Conditions
241
2 Multimodal Interface for Biometric Recognition and Database Acquisition In this section we present a multimodal device specially designed to acquire speech, on-line signature, hand-geometry and face. The system is prepared for four biometric traits, the acquisition protocol asks the user to provide his/her identity and two biometric traits (randomly selected). If both biometric traits are positively identified, the user is declared as “genuine”. In case of tilt, a third biometric trait is checked. The core of this system is a hewlett-packard notebook with touch screen (suitable for online signature acquisition). The technological solutions behind each biometric trait are DCT-NN [9] for face recognition, SVM for hand-geometry, HMM for signature and GMM for speaker recognition. Figure 4 shows a physical installation in a wall for door opening system.
3 Real World: One Step Further from Laboratory Conditions The goal of research should be to develop applications useful for daily usage. However, nowadays, most of the research is performed in laboratory conditions, which are far from “real world” conditions. While this laboratory conditions are interesting and necessary in the first steps, it is important to jump from laboratory to real world conditions. This implies to find a solution for a large number of problems that never appear inside the laboratory. In conclusion, the goal is not a fine trimming that provides a very small error in laboratory conditions. The goal is a system able to generalize (manage new samples not seen in the laboratory). It is important to emphasize that the classical Equal Error Rate (EER) for biometric system adjustment implies that the verification threshold is set up a posteriori (after knowing the whole set of test scores). While this is possible in laboratory conditions, this has no sense in a real world operation system. Thus, system performance measured by means of EER offers a limited utility. The Table 2 shows the system performance of the multimodal biometric system with two different set up methods. During four months, 102 people (70 genuine and 32 impostors) use the system. More than 900 unsupervised accesses were achieved. In the Laboratory set up method, we process the database acquired in “real world” condition using the set up configuration obtained with previous laboratory conditions experiments. In the “Real world” method we use a posteriori set up configuration to obtain the less EER. Table 2. “Real World” System performance with different set up methods Verification Method Laboratory set up “Real world” set up
FAR 5.1% 2.5%
FRR 15.3% 2.3%
EER 10.2% 2.4%
242
A. Morales et al.
Figure 5 shows an example of difference between laboratory set up and “Real world” set up. In this example we use the Hand Geometry Classifier Threshold versus FAR and FRR curves. On the left we use a 30 people database obtained in laboratory conditions. The best EER is obtained for -0.06 Hand Geometry Classifier Threshold. Working with the “Real World” database used for the Table 2 we observe -0.33 optimum threshold.
Fig. 5. Hand Geometry Classifier Threshold versus FRR and FAR, Laboratory Database (on the left) “Real World” database (on the right)
In Figure 6 we divide the “Real World” database in two different databases with the same length. We obtain similar thresholds in both databases. In this case, we can use the database 1 to obtain the set-up of the system.
Fig. 6. Hand Geometry Classifier Threshold versus FRR and FAR, “Real World” database 1 (on the left) “Real World” database 2 (on the right)
4 Conclusions In this paper we have presented a multimodal interface for biometric database acquisition. This system makes feasible the acquisition of four different biometric traits: hand-geometry, voice, on-line signature and still face image. The results obtained
Biometric System Verification Close to “Real World” Conditions
243
using the laboratory set up in a “Real World” system shows that we are far from the best set up options. To use set up information obtained from laboratory conditions experiment in “Real World” systems can be not advisable. In this paper we have emphasized the convenience of unsupervised database acquisition. Acknowledgments. This work has been supported by FEDER and MEC, TEC200613141-C03/TCM, and COST-2102.
References 1. Faundez-Zanuy, M., Fierrez-Aguilar, J., Ortega-Garcia, J., Gonzalez-Rodriguez, J.: Multimodal biometric databases: an overview. IEEE Aerospace and electronic systems magazine 21(9), 29–37 (2006) 2. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET curve in assessment of detection performance. In: European speech Processing Conference Eurospeech 1997, vol. 4, pp. 1895–1898 (1997) 3. Samaria, F., Harter, A.: Parameterization of a stochastic model for human face identification. In: 2nd IEEE Workshop on Applications of Computer Vision, Sarasota (Florida) (December 1994) 4. Color FERET. Facial Image Database, Image Group, Information Access Division, ITL, National Institute of Standards and Technology (October 2003) 5. Roure-Alcobé, J., Faundez-Zanuy, M.: Face recognition with small and large size databases. In: IEEE 39th International Carnahan Conference on Security Technology ICCST’2005 Las Palmas de Gran Canaria, October 2005, pp. 153–156 (2005) 6. Faundez-Zanuy, M.: Door-opening system using a low-cost fingerprint scanner and a PC. IEEE Aerospace and Electronic Systems Magazine 19(8), 23–26 (2004) 7. Faundez-Zanuy, M., Fabregas, J.: Testing report of a fingerprint-based door-opening system. IEEE Aerospace and Electronic Systems Magazine 20(6), 18–20 (2005) 8. http://biolab.csr.unibo.it/research.asp?organize=Activities&s elect=&selObj=12&pathSubj=111%7C%7C12& 9. Faundez-Zanuy, M., Roure-Alcobé, J., Espinosa-Duró, V., Ortega, J.A.: An efficient face verification method in a transformed domain. Pattern recognition letters 28(7), 854–858 (2007)
Developing HEO Human Emotions Ontology Marco Grassi Department of Biomedical, Electronic and Telecommunication Engineering Università Politecnica delle Marche, Ancona, Italy [email protected]
Abstract. A big issue in the task of annotating multimedia data about dialogs and associated gesture and emotional state is due to the great variety of intrinsically heterogeneous metadata and to the impossibility of a standardization of the used descriptor in particular for the emotional state of the subject. We propose to tackle this problem using the instruments and the vision offered by Semantic Web through the development of an ontology for human emotions that could be used in the annotation of emotion in multimedia data, supplying a structure that could grant at the same time flexibility and interoperability, allowing an effective sharing of the encoded annotations between different users.
1 Introduction A great research effort has been made in recent years in the field of multimodal communication asserting how human language, gestures, gaze, facial expressions and emotions are not entities amenable to study in isolation. A cross-modal analysis of both verbal and non-verbal channel has to be carried out to capture all the relevant information involved in the communication, starting from different perspectives, ranging from advanced signal processing application to psychological and linguistic analysis. Working in the annotation of dialogues and associated gesture and emotional states for multimedia data this means to encode a great variety of metadata (data about data) that are intrinsically heterogeneous and to make such metadata effectively sharable amongst different users. This represents a big issue, in particular for the emotional state of the subject, which represents a key feature in the non-verbal communication, due to the impossibility of a standardization of the used descriptors. Different kind of supports means different kind of extracted features. Also the used approach for features extraction and emotion classification strongly determines the kind of information to deal with. A first distinction is between manual and automatic system. The difference is twofold, both in the kind of features used for the description and in the grain of the description. Human annotation uses all the nuances in the description that are present in literature while, at least at the moment, automatic recognition system usually describe just a small number of different states. Also inside these two groups, a high variability exists in the description of an emotion. Different automatic recognition methods, even when related with the same medium, are interested in different features. On the other side, in the scientific community the debate over human J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 244–251, 2009. © Springer-Verlag Berlin Heidelberg 2009
Developing HEO Human Emotions Ontology
245
emotions it’s still opened and there is not common agreement about which features are the most relevant in the definition of an emotion and which are the relevant emotions and their names. Due to this great and intrinsic heterogeneity it is impossible to define a standard and unique set of describers for the wide and variable spread of descriptors for human emotions that could grant at the same time flexibility (possibility to use a large set of descriptor adapt to every kind of description) and interoperability (possibility to share the understandable information between different users) [1]. We propose to tackle this problem using the instruments and the vision offered by Semantic Web in the encoding of the metadata. This means not only to encode the information in a machine-processable language but also to associate semantics to it, a well defined and unambiguous meaning that could make the information univocally interpreted even between different users. On this purpose, we are working at the development of HEO (Human Emotion Ontology), a high level ontology for human emotions, that could supply the most significant concepts and properties for the description of human emotions and that could be extended according to the use’s purpose by defining lower level concepts and properties related to more specific descriptions or by linking it to other more specialized ontologies. The purpose is to create a description framework that could grant at the same time flexibility and interoperability, that can be used for example in a software application for video annotation, allowing an effective sharing of the encoded information between different users.
2 Semantic Web and Ontologies The Semantic Web [2] is an initiative that aims at improving the current state of the World Wide Web. The main idea is to represent the information in a proper way that could encode also semantic in a machine-processable format and to use intelligent techniques to take advantage of these representations allowing more powerful and crossed information retrieval, through semantic queries and data connections that are far more evolved than the ones that simply rely on the textual representation of the information. Implementing the Semantic Web requires adding semantic metadata, data that describes data, to describe information resources. To such purpose, Semantic Web uses RDF (Resources Description Framework) [3], which provides a foundation for representing and processing metadata. Although often called language RDF is essentially a data model, whose basic building block is an object-attribute-value triple, called a statement. A resource is as an object, a “thing” we want to talk about. Resources may be authors, books, publishers, places, people, hotels, rooms, search queries, and so on. Properties are a special kind of resources that describe relations between resources, for example “written by”, “age”, “title”, and so on. Statements assert the properties of resources. Values can either be resources or literals (strings). In order to provide machine-accessibility and machine-processability, the RDF triples (x, P, y) can represented in a XML syntax. RDF is a powerful language that lets users describe resources using their own vocabularies. Anyhow RDF doesn’t make assumptions about any particular application domain, nor does it defines the semantics of any domain. To assure an effective sharing of the encoded information it is necessary to provide a shared understanding of a domain, to overcome differences in terminology. For this purpose it is necessary the introduction of ontologies, which therefore play a fundamental role in the Semantic Web.
246
M. Grassi
Ontologies basically deal with knowledge representation and can be defined as formal explicit descriptions of concepts in a domain of discourse (named classes or concepts), properties of each concept describing various features and attributes of the concept (roles or properties), and restrictions on property (role restrictions)[4]. An ontology together with a set of individual instances of classes constitutes a knowledge base. Ontologies make possible the sharing of common understanding about the structure of information among people or software agents. Once aggregated through an ontology these information can be used to answer user queries or as input data to other applications. Ontologies allow the reuse of the knowledge. This means that it’s possible to start from an existing ontology and to extended it for the own purpose or that in the building of a large ontology it is possible to integrate several existing ontologies describing portions of the large domain. In addition, through the definition of a taxonomical organization of concepts and properties, which are expressed in a hierarchical classification of super/sub concepts and properties, ontologies make possible reasoning. This mean that starting from the data and the additional information expressed in the form of ontology it is possible to infer new relationship between data. In Semantic Web different languages have been developed for writing ontologies, in particular RDF Schema and OWL. RDF Schema [5] can be seen as an RDF vocabulary and is a primitive ontology language. It offers certain modelling primitives with fixed meaning. Key concepts of RDF Schema are class, subclass relations, property, sub-property relations, and domain and range restrictions. OWL (Ontology Web Language) [6] is a language more specifically conceived for ontologies’ creation. OWL builds upon RDF and RDF Schema and XML-based RDF syntax is used; instances are defined using RDF descriptions and most RDFS modelling primitives are used. Anyhow, OWL introduces a number of features that are missing in RDF Schema like local scope of property, disjointness of classes, Boolean combination of classes (like union, intersection and complement), cardinality restriction and Special characteristics of properties (like transitive, unique or inverse). The OWL language provides three increasingly expressive sublanguages designed for use by specific communities of implementers and users. OWL Lite is intended for easy use and light computability. OWL DL offers maximum expressiveness without losing computational completeness (all entailments are guaranteed to be computed) and decidability (all computations will finish in finite time) of reasoning systems. OWL Full is meant for users who want maximum expressiveness and the syntactic freedom of RDF with no computational guarantees. OWL DL in particular is based on the description logics [7], a solid logic formalism for knowledge representation and reasoning.
3 Ontologies and Languages for the Description of Human Emotion In the last years the study of emotions has dragged a growing attention from different research fields, ranging from advanced signal processing to psychology and linguistics. It is common agreement, between the researchers in the emotion field, that to full describe the wide variety of human emotions a small fixed set of archetypal emotion categories results too limiting. The great research effort of the last years has led to
Developing HEO Human Emotions Ontology
247
great advances in the understanding of emotions and the development of many different models and classification techniques for them. As a result, a standardization of the knowledge about emotions is becoming always more important but at the same time more difficult. At the state of the art, few works has been done in the development of complete and formal emotion ontologies, probably due to the difficulties mentioned above in the standardization of what defines an emotion and how to encode it. Some interesting works exist in the field of virtual human animation. In [8], an ontology for emotional face expression profiles has been developed for virtual humans, starting from facial animation concepts standardized in the MPEG-4 (FAPS) and defining some relationship with emotions through expression profiles that utilize psychological models of emotions. Even if not standardized into an ontology, very interesting works in the definition of proper languages for emotions description have been done in particular by the Humaine Project (http://emotion-research.net/) and by the W3C’s Emotion Markup Language Incubator Group (http://www.w3.org/2005/Incubator/emotion/). Inside the Humaine Project, the EARL language (Emotion Annotation and Representation Language) [9] has been developed for the annotation of audio-video databases and used with ANVIL software [10]. Even if not standardized into an ontology but expressed in XML Schema, EARL offers a powerful structure for emotions description. It makes in fact possible to specify the emotion categories, the dimensions, the intensity and even appraisals selecting the most appropriate case from a predefined list. Moreover, EARL includes elements to describe mixed emotions as well as regulation mechanisms like for example the degree of simulation or suppression. The Emotion Markup Language Incubator Group of the W3C (World Wide Web Consortium) has recently published a first report about the preliminary work in the definition of EmotionML, a markup language for the description and annotation of emotion [11]. Even it’s not sure if the group’s activity will lead in the future to a W3C standardization, it represent the most complete and authoritative reference for emotion descriptors in a machine interpretable language. Starting for 39 use cases in emotion annotation, emotion recognition and emotion generation and investigating the existing mark-up language (in particular EARL) the report proposes a collection of high level descriptors for the encoding of information about emotions and their related components. The proposed language allows to describe emotion both by category and by dimension and the modality in which it’s expressed (face, voice, body, text), to deal with appraisal, triggering events and the action tendencies.
4 An Overview of HEO The HEO ontology has been developed in OWL language with a threefold purpose. It provides a standardization of the knowledge of the emotion that can be useful in particular for people with little expertise in the emotion field. It allows the definition of a common vocabulary that can be used in describing emotion with a fully machine accessible semantics. It can to be used for the creation of the menus for a multimedia annotation software application. In fact, the taxonomical organization of classes and property defines a clear hierarchy of the descriptors and the restrictions introduced for properties and allowed datatypes limit the set of possible instances, allowing in this way the structuring of the annotation menus and submenus.
248
M. Grassi
Fig. 1. Graphical representation of HEO’s main classes and properties
As mentioned above, developing an ontology means to define a set of classes and properties describing a knowledge domain and to express their relationships. HEO introduces the basic class Emotion and a set of related classes and properties expressing the high level descriptors of the emotions. They represent the general structure for describing emotion, which can be refined introducing subclasses and subproperties that are representative of the specific model used in the description and can be extended by linking other ontologies. In the definition of such classes and properties we rely on the main descriptors introduced by the W3C Emotion Incubator Group. In figure 1, it is shown a representative graph of the most relevant classes and properties of the ontology. An emotion can be described both in a discrete way – by using the property hasCategory to classify the category of the emotion and in a dimensional way, by using the property hasDimension. In describing the emotion’s category it’s possible to refer to the 6 archetypal emotions (anger, disgust, fear, joy, sadness, surprise) introduced by Eckman [12] (using the ArchetypalCategory class), particularly used for automatic emotions classification through face expression recognition, or to the wider set of 48 categories defined by Douglas-Cowie, E., et al. (using the DouglasCowieCategory) [13]. Dealing with a discrete classification of the emotion it’s necessary to supply a measure of the entity of the emotion, this can be made using the hasIntensity property of the Category class, with value ranging between [0,1]. In literature there exist different set of dimensions for the representation of an emotion. A commonly used set for emotion dimension is the arousal, valence, dominance
Developing HEO Human Emotions Ontology
249
set, which are known in the literature also by different names, including "evaluation, activation and power", "pleasure, arousal, dominance". For this reason we introduced the ArousalValenceDominance subclass of the Dimension class, which has the properties hasArousal, hasValence, hasDominance. To overcome the different terminology, these properties have been mapped to the hasEvaluation, hasActivation, hasPower properties, by stating in OWL that they are equivalent object properties. Also emotion related features have been introduced into the ontology. The appraisal, the evaluation process leading to an emotional response, can be described using by the hasAppraisal property. The appraisals set proposed by Scherer [14] can be used by defining the property novelty, intrinsicPleasantness, goal-needSignificance, copingPotential and norm-selfSignificance of the subclass SchererAppraisal, which can assume values from -1 to 1 to describe positive or negative values of the properties. Also action tendencies play an important role in the description of an emotion because Emotions have a strong influence on the motivational state of a subject, for example anger is linked to the tendency to attack while fear is linked to the tendency to flee or freeze. Action tendencies can be viewed as a link between the outcome of an appraisal process and actual actions. The model by Frijda [15] can be used to describe the action tendencies through the ActionTendency’s subclass FrijdaActionTendency and its properties approach, avoid, being-with, attending, rejecting, non-attending, agonistic, interrupting, dominating and submitting with values ranging between [0,1]. Rarely the description of emotions rely with full-blown emotions, usually emotions are regulated in the emotion generation process, for this HEO use the hasRegulation property. A description of the regulation process can be supplied describing how the emotion is genuine, masked or simulated, through the class Regulation and its properties genuine, masked and simulated, with values ranging between [0,1]. Emotions can also be expressed through different channels, like face, voice, gesture and text. To his purpose HEO introduce the hasModality class and the subclasses Face, Voice, Gesture and Text for the Modality class. Such subclasses can be further refined by the introduction of specific ontology. An important parameter that has to be introduced in the annotation is the confidence of the annotation itself. Such parameter should be associated separately for each of the descriptor of emotions and related states. This can be made by defining a superclass EmotionDescriptor for every class describing the emotion (Category, Dimension, Appraisal, etc…) with the hasConfidence property, whose values ranging between [0,1]. Important information should also be supplied about the person affected by the emotion. To this purpose we propose to reuse the FOAF (Friend Of A Friend) ontology (http://www.foaf-project.org/), an ontology commonly used to describe person, their activities and their relations to other people and objects (figure 2). Such ontology presents a wide set of descriptors for information related to persons and in particular for their Web resources, like firstName, familyName, title, gender, mbox (email), weblog. Such descriptors can be extended by adding other properties to the Person main class, that are relevant in the for the emotion annotation, like age, culture, language, education and personality traits and defining the ObservedPerson’s subclass of the class Person, which can be connected with the HEO’s Emotion class through the property affectPerson. For the observed person, information should be supplied about the subject with whom he/she interacts (using the interactWith property for the
250
M. Grassi
Fig. 2. Linking HEO to FOAF ontology
ObservedPerson subclass) that could be another person or a device. In the former case it could be relevant to supply information about the kind of relationship that exists between the persons, as for example the degree of kinship or friendship, working relationship. In the latter case information about the device should be supplied. Information can also be supplied about who is performing the annotation, using the isAnnotatedBy property of Emotion’s class, in particular if it’s made by an human or a machine. In the former case, the descriptor added to the Person class could be relevant to analyze how emotions are differently perceived by a person, for example according to their culture. Another, interesting information should also be given about the experience in the annotation. To such purpose we propose the definition of a HumanAnnotator subclass of the Person class, with the hasAnnotationExperience with value ranging between [0, 1].
5 Conclusions and Future Efforts In this paper we have presented the HEO ontology that we are currently developing, describing its main classes and properties. The present ontology shall undergo to a revision process by experts in the field of emotion annotation and more classes and property will be added. More efforts will be also focused in linking HEO with other ontologies, for example with the existing MPEG-7 ontologies, that could be used to encode information about the multimedia content in which the emotion occurs. We are actually operating a survey for the existing software applications for multimedia annotation to discover the main features and requirements, before to proceed to development of web application for semantic multimedia annotation. Such application should allow the annotation of generic knowledge domains by importing descriptors from ontology files and encode the annotations in form of RDF statement in order to grant an intelligent management of the acquired information, through advanced queries and data connections.
Developing HEO Human Emotions Ontology
251
References 1. Scientific description of emotion - W3C Incubator Group Report, July 10 (2007), http://www.w3.org/2005/Incubator/emotion/ XGR-emotion/#ScientificDescriptions 2. Antoniou, G., Van Harmelen, F.: A Semantic Web Primer. Mit Press Cooperative Information Systems (2004) 3. Manola, F., Miller, E. (eds.): RDF Primer W3C Recommendation 10 (February 2004), http://www.w3.org/TR/REC-rdf-syntax 4. Kompatsiaris, Y., Hobson, P. (eds.): Semantic Multimedia and Ontologies: Theory and Applications. Springer, Heidelberg (2008) 5. Brickley, D., Guha, R.V. (eds.): RDF Vocabulary Description Language 1.0: RDF Schema W3C Recommendation, February 10 (2004), http://www.w3.org/TR/rdf-schema/ 6. McGuinness, L., van Harmelen, F.: OWL Web Ontology Language Overview W3C Recommendation, February 10 (2004), http://www.w3.org/TR/owl-features/ 7. Franz Baader, et al.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003) 8. Garcia-Rojas, A., Raouzaiou, A., Vexo, F., Karpouzis, K., Thalmann, D., Moccozet, L., Kollias, S.: Emotional face expression profiles supported by virtual human ontology. Computer Animation and Virtual Worlds Journal, 259–269 (2006) 9. Schröder, M., Pirker, H., Lamolle, M.: First suggestions for an emotion annotation and representation language. In: Proceedings of LREC 2006 Workshop on Corpora for Research on Emotion and Affect, Genoa, Italy, pp. 88–92 (2006) 10. Elements of an EmotionML 1.0 - W3C Incubator Group Report, November 20 (2008), http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120 11. Kipp, M.: Spatiotemporal Coding in ANVIL. In: Proceedings of the 6th international conference on Language Resources and Evaluation, LREC 2008 (2008) 12. Ekman, P.: The Face Revealed. Weidenfeld & Nicolson, London (2003) 13. Douglas-Cowie, E., et al.: HUMAINE deliverable D5g: Mid Term Report on Database Exemplar Progress (2006), http://emotion-research.net/deliverables/D5g%20final.pdf 14. Scherer, K.R., Shorr, A., Johnstone, T. (eds.): Appraisal processes in emotion: theory, methods, research. Oxford University Press, Canary (2001) 15. Frijda, N.: The Emotions. Cambridge University Press, Cambridge (1986)
Common Sense Computing: From the Society of Mind to Digital Intuition and beyond Erik Cambria1 , Amir Hussain1 , Catherine Havasi2 , and Chris Eckl3 1
Dept. of Computing Science and Maths, University of Stirling, Scotland, UK 2 MIT Media Lab, MIT, Massachusetts, USA 3 Sitekit Labs, Sitekit Solutions Ltd, Scotland, UK {eca,ahu}@cs.stir.ac.uk,[email protected],[email protected] http://cs.stir.ac.uk/~ eca/commonsense
Abstract. What is Common Sense Computing? And why is it so important for the technological evolution of humankind? This paper presents an overview of past, present and future efforts of the AI community to give computers the capacity for Common Sense reasoning, from Minsky’s Society of Mind to Media Laboratory’s Digital Intuition theory, and beyond. Is it actually possible to build a machine with Common Sense or is it just an utopia? This is the question this paper is trying to answer. Keywords: AI, Semantic networks, NLP, Knowledge base management.
1
Introduction
What magical trick makes us intelligent? - Marvin Minsky was wondering more than two decades ago - The trick is that there is no trick. The power of intelligence stems from our vast diversity, not from any single, perfect principle [1]. Human brain in fact is a very complex system, maybe the most complex in nature. The functions it performs are the product of thousands and thousands of different subsystems working together at the same time. Such a perfect system is very hard to emulate: nowadays in fact there are plenty of expert systems around but none of them is actually intelligent, they just have the veneer of intelligence. The aim of Common Sense Computing is to teach computers the things we all know about the world and give them the capacity for reasoning on these things.
2
The Importance of Common Sense
Communication is one of the most important aspects of human life. Communicating has always a cost in terms of energy and time, since information needs to be encoded, transmitted and decoded. This is why people, when communicating with each other, provide just the useful information and take the rest for granted. This ‘taken for granted’ information is what we call Common Sense: obvious things people normally know and usually leave unstated. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 252–259, 2009. c Springer-Verlag Berlin Heidelberg 2009
Common Sense Computing
253
We are not talking about the kind of knowledge you can find in Wikipedia but all the basic relationships among words, concepts, phrases and thoughts that allow people to communicate with each other and face everyday life problems. It’s a kind of knowledge that sounds obvious and natural to us but it is actually daedal and multifaceted: the illusion of simplicity comes from the fact that, as each new group of skills matures, we build more layers on top of them and tend to forget about the previous layers. Today computers lack this kind of knowledge. They do only what they are programmed to do: they only have one way to deal with a problem and, if something goes wrong, they get stuck. This is why nowadays we have programs that exceed the capabilities of world experts but are not one able to do what a three years old child can do. Machines can only do logical things, but meaning is an intuitive process: it can’t be reduced to zeros and ones. To help us work, computers must get to know what our jobs are. To entertain us, they need to know what we like. To take care of us, they have to know how we feel. To understand us, they must think as we think. We need to transmit computers our Common Sense knowledge of the world because soon there won’t be enough human workers to perform the necessary tasks for our rapidly aging population. To face this AI emergency, we’ll have to give them physical knowledge of how objects behave, social knowledge of how people interact, sensory knowledge of how things look and taste, psychological knowledge about the way people think, and so on. But having a database of millions of Common Sense facts will not be enough: we’ll also have to teach computers how to handle this knowledge, retrieve it when necessary, learn from experience, in a word we’ll have to give them the capacity for Common Sense reasoning.
3
The Birth of Common Sense Computing
It’s not easy to state when exactly Common Sense Computing was born. Before Minsky many AI researchers started to think about the implementation of a Common Sense reasoning machine: the very first one was maybe Alan Turing when, in 1950, he first raised the question “can machines think?”. But he never managed to answer that question, he just provided a method to gauge artificial intelligence: the famous Turing test. 3.1
The Advice Taker
The notion of Common Sense in AI is actually dated 1958, when John McCarthy proposed the ‘advice taker’ [2], a program meant to try to automatically deduce for itself a sufficiently wide class of immediate consequences of anything it was told and what it already knew. McCarthy stressed the importance of finding a proper method of representing expressions in the computer and developed the idea of creating a property list for each object in which are listed the specific things people usually know about it. It was the first attempt to build a Common Sense knowledge base but, more important, it was the epiphany of the need of Common Sense to move forward in the technological evolution.
254
3.2
E. Cambria et al.
The Society of Mind Theory of Human Cognition
While McCarthy was more concerned with establishing logical and mathematical foundations for Common Sense reasoning, Minsky was more involved with theories of how we actually reason using pattern recognition and analogy. These theories were organized in 1986 with the publication of The Society of Mind, a masterpiece of AI literature containing an illuminating vision of how the human brain works. Minsky sees the mind made of many little parts called ‘agents’, each mindless by itself but able to lead to true intelligence when working together. These groups of agents, called ‘agencies’, are responsible to perform some type of function, such as remembering, comparing, generalizing, exemplifying, analogizing, simplifying, predicting, and so on. The most common agents are the so called ‘K-lines’ whose task is simply to activate other agents: this is a very important issue since agents are all highly interconnected and activating a K-line can cause a significant cascade of effects. To Minsky, in fact, mental activity ultimately consists in turning individual agents on and off: at any time only some agents are active and their combined activity constitutes the ‘total state’ of the mind. K-lines are a very simple but powerful mechanism since they allow entering a particular configuration of agents that formed a useful society in a past situation: this is how we build and retrieve our problem solving strategies in our mind, this is how we should develop our problem solving strategies in our programs.
4
Towards Programs with Common Sense
Minsky’s theory was welcomed with great enthusiasm by the AI community and gave birth to many attempts to build Common Sense knowledge bases and exploit them to develop intelligent systems e.g. Cyc and WordNet. 4.1
The Cyc Project
Cyc [3] is one of the first attempts to assemble a massive knowledge base spanning human Common Sense knowledge. Initially started by Doug Lenat in 1984, this project utilizes knowledge engineers who handcraft assertions and place them into a logical framework using CycL, Cyc’s proprietary language. Cyc’s knowledge is represented redundantly at two levels: a frame language distinction (epistemological level), adopted for its efficiency, and a predicate calculus representation (heuristic level), needed for its expressive power to represent constraints. While the first level keeps a copy of the facts in the uniform user language, the second level keeps its own copy in different languages and data structures suitable to be manipulated by specialized inference engines. Knowledge in Cyc is also organized into ‘microtheories’, resembling Minsky’s agencies, each one with its own knowledge representation scheme and sets of assumptions. 4.2
WordNet
Begun in 1985 at Princeton University, WordNet [4] is a database of words, primarily nouns, verbs and adjectives. It has been one of the most widely used
Common Sense Computing
255
resources in computational linguistics and text analysis for the ease in interfacing it with any kind of application and system. The smallest unit in WordNet is the word/sense pair, identified by a ‘sense key’. Word/sense pairs are linked by a small set of semantic relations such as synonyms, antonyms, is-a superclasses, and words connected by other relations such as part-of. Each synonym set, in particular, is called synset: it consists in the representation of a concept, often explained through a brief gloss, and represents the basic building block for hierarchies and other conceptual structures in WordNet.
5
From Logic Based to Common Sense Reasoning
Using logic-based reasoning can solve some problems in computer programming. However, most real-world problems need methods better at making decisions based on previous experience with examples, or by generalizing from types of explanations that have worked well on similar problems in the past. In building intelligent systems we have to try to reproduce our way of thinking: we turn ideas around in our mind to examine them from different perspectives until we find one that works for us. Since computers appeared, our approach to solve a problem has always consisted in first looking for the best way to represent the problem, then looking for the best way to represent the knowledge needed to solve it and finally looking for the best procedure for solving it. This problem-solving approach is good when we have to deal with a specific problem but there’s something basically wrong with it: it lead us to write only specialized programs that cope with solving only that kind of problem. 5.1
The Open Mind Common Sense Project
Initially born from an idea of David Stork, the Open Mind Common Sense (OMCS) project [5] is a kind of second-generation Common Sense database: knowledge is represented in natural language, rather than using a formal logical structure, and information is not handcrafted by expert engineers but spontaneously inserted by online volunteers. The reason why Lenat decided to develop an ad-hoc language for Cyc is that vagueness and ambiguity pervade English and computer reasoning systems generally require knowledge to be expressed accurately and precisely. But, as expressed in the Society of Mind, ambiguity is unavoidable when trying to represent the Common Sense world. No single argument in fact is always completely reliable, but if we combine multiple types of arguments we can improve the robustness of reasoning as well as we can improve a table stability by providing it with many small legs in place of just one very big leg. This way information is not only more reliable but also stronger: if a piece of information goes lost we can still access the whole meaning, exactly as the table keeps on standing up if we cut out one of the small legs. Diversity is in fact the key of OMCS project success: the problem is not choosing a representation in spite of another but it’s finding a way for them to work together in one system.
256
5.2
E. Cambria et al.
Acquiring Common Sense by Analogy
In 2003 Timothy Chklovski introduced the cumulative analogy method [6]: a class of analogy-based reasoning algorithms that leverage existing knowledge to pose knowledge acquisition questions to the volunteer contributors. Chklovski’s Learner first determines what other topics are similar to the topic the user is currently inserting knowledge for, then it uses cumulative analogy to generate and present new specific questions about this topic. Because each statement consists of an object and a property, the entire knowledge repository can be visualized as a large matrix, with every known object of some statement being a row and every known property being a column. Cumulative analogy is performed by first selecting a set of nearest neighbors, in terms of similarity, of the treated concept and then by projecting known properties of this set onto not known properties of the concept and presenting them as questions. The replies to the knowledge acquisition questions formulated by analogy are immediately added to the knowledge repository, affecting the similarity calculations. 5.3
ConceptNet
In 2004 Hugo Liu and Push Singh, refined the ideas of the OMCS project in ConceptNet [7], a semantic resource structurally similar to WordNet, but whose scope of contents is general world knowledge in the same vein as Cyc. While WordNet is optimised for lexical categorisation and Cyc is optimised for formalised logical reasoning, ConceptNet is optimised for making practical context-based inferences over real-world texts. In ConceptNet, WordNet’s notion of node in the semantic network is extended from purely lexical items (words and simple phrases with atomic meaning) to include higher-order compound concepts, e.g. ‘satisfy hunger’ or ‘follow recipe’, to represent knowledge around a greater range of concepts found in everyday life. Most of the facts interrelating ConceptNet’s semantic network in fact are dedicated to making rather generic connections between concepts. This type of knowledge can be brought back to Minsky’s K-lines as it increases the connectivity of the semantic network and makes it more likely that concepts parsed out of a text document can be mapped into ConceptNet. In ConceptNet version 2.0 a new system for weighting knowledge was implemented, which scores each binary assertion based on how many times it was uttered in the OMCS corpus, and on how well it can be inferred indirectly from other facts. In ConceptNet version 3.0 [8] users can also participate in the process of refining knowledge by evaluating existing statements. 5.4
Digital Intuition
The best way to solve a problem is to already know a solution for it but if we have to face a problem we have never met before we need to use our ‘intuition’. Intuition can be explained as the process of making analogies between the current problem and the ones solved in the past to find a suitable solution. Minsky
Common Sense Computing
257
attributes this property to the so called ‘difference-engines’, a particular kind of agents which recognize differences between the current state and the desired state and act to reduce each difference by invoking K-lines that turn on suitable solution methods. To emulate this ‘reasoning by analogy’ we use AnalogySpace [9], a process which generalizes Chklovski’s cumulative analogy. In this process, ConceptNet is first mapped into a sparse matrix and then truncated Singular Value Decomposition (SVD) is applied over it to reduce its dimensionality and capture the most important correlations. The entries in the resulting matrix are numbers representing the reliability of the assertions and their magnitude increases logarithmically with the confidence score. Applying SVD on this matrix causes it to describe other features that could apply to known concepts by analogy: if a concept in the matrix has no value specified for a feature owned by many similar concepts, then by analogy the concept is likely to have that feature as well. This process is naturally extended by the ‘blending’ technique [10], a new method to perform inference over multiple sources of data simultaneously, taking advantage of the overlap between them. This enables Common Sense to be used as a basis for inference in a wide variety of systems and applications so that they can achieve Digital Intuition about their own data.
6
Applications of Common Sense Computing
We are involved in an EPSRC project whose main aim is to further develop and apply the above-mentioned technologies in the field of Common Sense Computing to build a novel intelligent software engine that can auto-categorise documents, and hence enable the development of future semantic web applications whose design and content can dynamically adapt to the user. 6.1
Enhancing the Knowledge Base
The key to perform Common Sense reasoning is to find a good trade-off for representing knowledge: since in life no two situations are ever the same, no representation should be too concrete, or it will not apply to new situations, but, at the same time, no representation should be too abstract, or it will suppress too many details. ConceptNet already supports different representations by maintaining different ways of conveying the same idea with redundant concepts, but we plan to enhance this multiple representation by connecting ConceptNet with dereferenceable URIs and RDF statements to enlarge the Common Sense knowledge base on a different level. We also plan to improve ConceptNet by giving a geospatial reference to all those pieces of knowledge that are likely to have one and hence make it suitable to be exploited by geographic oriented applications. The ‘knowledge retrieval’ is one of the main strengths of ConceptNet: we plan to improve it by developing games to train the semantic network and by pointing on social networking. We also plan to improve ConceptNet on the level of what Minsky calls ‘selfreflection’ i.e. on the capability of reflecting about its internal structure and
258
E. Cambria et al.
cognitive processes. ConceptNet in fact currently focuses on the kinds of knowledge that might populate the A-brain of a Society of Mind: it knows a great deal about the kinds of objects, events and other entities that exist in the external world, but it knows far less about how to learn, reason and reflect. We plan to give the semantic network the ability to self-check its consistency e.g. by looking for words that appear together in ways that are implausible statistically and ask users for verification or keeping trace of successful learning strategies. The system is also likely to remember attempts that led to particularly negative conclusions in order to avoid unproductive strategies in the future. To this end we plan to improve the ‘negative expertise’ of the semantic network, which is now just partially implemented by asking users to judge inferences, in order to give the system the capability to learn from its mistakes. 6.2
Understanding the Knowledge Base
Whenever we try to solve a problem, we continuously and almost instantly switch to different points of view in our mind to find a solution. Minsky argues that our brains may use special machinery, that he calls ‘panalogy’ (parallel analogy), that links corresponding aspects of each view to the same ‘slot’ in a larger-scale structure that is shared across several different realms. ConceptNet’s current knowledge representation doesn’t allow us to think about how to implement such a strategy yet, but, once we manage to give ConceptNet a multiple representation as planned, we’ll have to start thinking about it. At the moment SVD-based methods are used on the graph structure of ConceptNet to build the vector space representation of the Common Sense knowledge. Principal Component Analysis (PCA) is an optimal way to project data in the mean-square sense but the eigenvalue decomposition of the data covariance matrix is very expensive to compute. Therefore we plan to explore new methods such as Sparse PCA, a technique consisting in formulating PCA as a regression-type optimization problem, and Random Projection, a method less reliable than PCA but computationally cheaper. 6.3
Exploiting the Knowledge Base
Our primary goal is to build an intelligent auto-categorization tool which uses Common Sense Computing, together with statistical methods, to make the document classification more accurate and reliable. We also plan to apply Common Sense Computing for the development of emotion-sensitive systems. By analysing users’ Facebook personal messages, emails, blogs, etc., the engine will be able to extract users’ emotions and attitudes and use this information to be able to better interact with them. In the sphere of e-games, for example, such an engine could be employed to implement conversational agents capable to react to user’s frame of mind changes and thus enhance players’ level of immersion. In the field of enterprise 2.0 the engine could be used to develop customer care applications capable to measure users’ level of satisfaction. In the field of e-health, finally, PDA-microblogging
Common Sense Computing
259
techniques could be applied to seize clinical information from patients by studying their pieces of chat, tweets or SMS, and an e-psychologist could be developed to provide help for light psychological problems.
7
Conclusions
It is hard to measure the total extent of a person’s Common Sense knowledge, but a machine that does humanlike reasoning might only need a few dozen millions of items of knowledge. Thus we would be tempted to give a positive answer to the question “is it actually possible to build a machine with Common Sense?”. But, as we saw in this paper, Common Sense Computing is not just about collecting Common Sense knowledge: it’s about how we represent it and how we use it to make inferences. We made very good progress in performing this since McCarthy’s ‘advice taker’ but we are actually still scratching the surface of human intelligence. So we can’t give a concrete answer to that question, not yet. The road to the creation of a machine with the capacity of Common Sense reasoning is still long and tough but we feel that the path undertaken so far is a good one. And, even if we fail in making machines intelligent, we believe we’ll be able to at least teach them who we are and thus make them able to better contribute to the technological evolution of human kind.
References 1. Minsky, M.: The Society of Mind. Simon and Schuster (1986) 2. McCarthy, J.: Programs with Common Sense. In: Proceedings of the Teddington Conference on the Mechanization of Thought Processes (1959) 3. Lenat, D.: Cyc: toward programs with common sense. ACM Press, New York (1990) 4. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998) 5. Singh, P.: The Open Mind Common Sense Project. KurzweilAI.net (2002) 6. Chklovski, T.: Learner: a system for acquiring commonsense knowledge by analogy. K-CAP (2003) 7. Liu, H., Singh, P.: ConceptNet: a practical commonsense reasoning toolkit. BT Technology Journal (2004) 8. Havasi, C., Speer, R., Alonso, J.: ConceptNet 3: a Flexible, Multilingual Semantic Network for Common Sense Knowledge. RANLP (2007) 9. Speer, R., Havasi, C., Lieberman, H.: AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge. AAAI (2008) 10. Havasi, C., Speer, R., Pustejovsky, J., Lieberman, H.: Digital Intuition: Applying Common Sense Using Dimensionality Reduction. IEEE Intelligent Systems (2009)
On Development of Inspection System for Biometric Passports Using Java Luis Terán and Andrzej Drygajlo Speech Processing and Biometrics Group Swiss Federal Institute of Technology Lausanne (EPFL) Switzerland {luis.terantamayo,andrzej.drygajlo}@epfl.ch http://scgwww.epfl.ch/
Abstract. Currently it is possible to implement Biometric Passport applets according to ICAO specifications. In this paper, an ePassport Java Card applet, according to ICAO specifications using the Basic Access Control security, is developed. A system for inspection of the ePassport applet, using Java, in order to test its functionalities and capabilities is also implemented. The simulators, which are developed in this paper, can display the communication between the inspection system and the Java Cards, which could be real or emulated cards. Keywords: Biometrics, ePassport, Java Card, Inspection System, Basic Access Control.
1 Introduction Over the last two years, Biometric Passports have been introduced in many countries to improve the security in Inspection Systems and enhance procedures and systems that prevent identity and passport fraud. Along with the deployment of new technologies, countries need to test and evaluate its systems since the International Civil Aviation Organization (ICAO) provides the guidelines, but the implementation is up to each issuing country. The specific choice of each country as to which security features to include or not include makes a major difference in the level of security and privacy protection available. Table 1. ePassport Deployments Country
Security
Biometric
Italy
RFID Type Deployment 14443
2006
Passive, Active Authentication, BAC
Photo
U.S.
14443
2005
Passive, Active Authentication, BAC
Photo
Netherlands
14443
2005
Passive, Active Authentication, BAC
Photo
Germany
14443
2005
Passive, Active Authentication, BAC
Photo
J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 260–267, 2009. © Springer-Verlag Berlin Heidelberg 2009
On Development of Inspection System for Biometric Passports Using Java
261
In this paper we focus in the development of an Inspection System for Biometric Passports. The standards and specifications of Machine Readable Travel Documents (MRTD’s) are described in Section 2. The technologies used for the implementation of the inspection system for Biometric Passports, and a description of the implementation and its utilization are described in Sections 3. The conclusions are drawn in Section 4. Table 1 shows some examples of ePassport implementations and the security features selected.
2 ICAO Specifications The Machine Readable Travel Document (MRTD) is an international travel document, which contains machine-readable data of the travel document holder. MRTDs facilitate the identification of travelers and enhance the security levels. MRTDs are developed with the assistance of the ICAO’s Technical Advisory Group for Machine Readable Travel Documents (ICAO TAG/MRTD) and the ISO Working Group 3 (JTC1/SC17/WG3). MRTDs such as passports, visas or other travel documents, have the following key issues to be considered: Global Interoperability, Uniformity, Technical Reliability, Practicality, and Durability. ICAO has elaborated different technical reports. The main technical reports, which are part of ICAO specifications and considered in this paper, are: Logical Data Structure Technical Report and Public Key Infrastructure Technical Report. 2.1 Logical Data Structure The Logical Data Structure (LDS) technical report specifies a global and interoperable data structure for recording identity details including biometric data. The data stored in the LDS is read electronically and designed to be flexible and expandable for future needs. A series of mandatory and optional elements has been defined for LDS, which are used in MRTDs. The use of biometric data is optional except the use of the encoded face. Each data group is stored in different Elementary Files (EF’s). The structure and coding of data objects are defined in ISO/IEC 7816-4 [1]. Each data object is encoded using Tag - Length - Value (TLV). Any data object is denoted {T - L - V} with a tag field followed by a length field encoding a number, which represents the size of the value field. If the size is equal to zero, the data field is absent. A constructed data object is denoted {T - L - V {T1 - L1 - V1}...{Tn - Ln Vn}}, which represent a concatenation and interweaving of data objects. This type of structure is used in Data Groups containing more than one value field, which are preceded by specific Tag and Length field. Figure 1 shows the complete structure of LDS, which includes mandatory and optional data elements defined for LDS (version 1.7)
262
L. Terán and A. Drygajlo
Fig. 1. Logical Data Structure
2.2 Public Key Infrastructure The aim of the Public Key Infrastructure (PKI) report scheme is to enable MRTDinspecting authorities (Receiving States) to verify the authenticity and integrity of the data stored in the MRTD. ICAO has specified two types of authentications: Passive and Active Authentication. The main treats of a contactless IC chip compared with a traditional contact chip are that the information stored in a contactless chip could be read without opening the document, and that an unencrypted communication between a chip and a reader could be eavesdropped. The use of Access Control is optional. If it is implemented, an inspection system must prove that it has access to the IC chip. ICAO technical report PKI for Machine Readable Travel Documents offering ICC Read-Only Access [4] provides specifications of Basic Access Control and Secure Messaging. Basic Access control consists of tree steps as described next: Derivation of Document Basic Access Keys (KENC and KMAC)
In order to get access to an IC chip and set up a Secure Channel, an inspection system derives the Document Basic Access Keys (KENC and KMAC) from the MRZ as is mentioned next: -
The inspection system reads the MRZ. A field called MRZ_information consists of a concatenation of the fields: Document number, date of birth, and date of expiry, as is shown next: LINE 1:
P < U S AAM OS S < < F R AN K< < < < < < < < < < < < < < < < < < < < < < < < < < <
LINE 2:
0000780043USA5001013M1511169100000000<381564 MRZ_information = 0 0 0 0 7 8 0 0 4 5 0 0 1 0 1 1 5 1 1 1 6
On Development of Inspection System for Biometric Passports Using Java -
263
The inspection system computes the SHA_1 value of MRZ_information. The inspection system uses 16 bytes (most significative) from the hash value of MRZ_information in order to compute KIFD/ICC
KIFD/ICC = Trunc16 (SHA_1(MRZ_information)) Compute session keys from seed key KIFD/ICC - The inspection system concatenates the key seed KIF/ICC with a value c in order to compute KENC and KMAC (c = 1 for KENC and c = 2 for KMAC) and takes the hash value of it.
HASH1 = Trunc16 (SHA_1(KIFD/ICC||00 00 00 01)) HASH2 = Trunc16 (SHA_1(KIFD/ICC||00 00 00 02)) -
The first 8 bits of HASH1 is set as Ka(ENC) and the last 8 bits of HASH1 are set as Kb(ENC) in a two keys triple DES cryptographic algorithm. The first 8 bits of HASH2 is set as Ka(MAC) and the last 8 bits of HASH2 are set as Kb(MAC) in a two keys triple DES cryptographic algorithm.
Authentication and Establishment of Session Keys - The inspection system requests an 8 byte challenge (RND.ICC) from ICC chip, and generates a random 8 bytes (RND.IFD) together with a random 16 bytes triple DES key (KIFD). - The inspection system concatenates RND.ICC, RND.IFD, and KIFD.
S = RND.IFD || RND.ICC || KIFD -
The inspection system sends to MRTD the encrypted version of S using KENC and KMAC using MUTUAL_AUTENTICATE function of ISO7816-4.
cmd_data = E[KENC](S) || MAC[KMAC](E[KENC](S)) -
MRTD decrypts and verifies the received data. MRTD computes a 16 bytes triple DES key (KICC). MRTD concatenates RND.ICC, RND.IFD, and KICC.
R = RND.ICC || RND.IFD || KICC -
MRTD sends to the inspection system the encrypted version of R using KENC and KMAC using MUTUAL_AUTENTICATE function of ISO7816-4.
resp_data = E[KENC](R) || MAC[KMAC](E[KENC](R)) -
The inspection system decrypts and verifies the received data. Both, the inspection system and MRTD, create a 16 bytes key seed KIFD/ICC, which is the exclusive or between KICC and KIFD
KIFD/ICC = KICC ⊕ KIFD -
Both, the inspection system and MRTD, compute the 16 byte Encryption and Authentication Session Keys (KSENC and KSMAC) using KIFD/ICC as described previously in section “Compute session keys from seed key KIFD/ICC”.
264
L. Terán and A. Drygajlo
The Secure Messaging should be implemented for the communication of Application Protocol Data Units (APDUs) between MRTDs and the inspection systems. Secure Messaging has been specified according to CWA 14890-1: 2004, Application Interface for smart cards used as Secure Signature Creation Devices [5]. It is required for Passive Authentication and it is optional for Active Authentication. Figure 2 shows an example of Secure Messaging using session keys KSENC and KSMAC computed previously.
Fig. 2. Secure Messaging
3 Java Implementation The main technologies that were considered in this paper are: Smart Cards, Java Card versions 2.2.2 and 2.2.1, and Java version 1.6 in order to provide an application and allow the user to emulate and evaluate an ePassport applet. The Smart Card is a portable and tamper-resistant computer. It has both, processing power and memory. The physical appearance and properties of smart card are defined in the standard ISO 7816-1. Smart cards usually contain three types of memory: ROM, EEPROM, and RAM. The ROM memory is used for storing the fixed program of the card and contains the operative system routines as well as permanent data and user applications. The EEPROM memory is similar to ROM memory in that it can preserve information when the power of the memory is turned of. The main difference is that the content of this memory can be modified. We could compare EEPROM memory with the hard disk on a PC. The RAM memory is used as temporary working space for storing and modifying data. The second technology, which is the Java Card Technology, allows Java-based applications (called applets) to run on smart cards including memory and processing capabilities, essentially Java Card Technology. The distribution of Sun Microsystems for smart cards (Java Card Development Kit V2.2.1 and V2.2.2) are used in this paper in order to develop an ePassport applet according to ICAO specifications. The Java Card Technology defines three parts: Java Card Virtual Machine (JCVM), Java Card Runtime Environment (JCRE) and Java Card Application Programming Interface (API). The JCVM defines a subset of Java programming language. JCVM is implemented as two separated pieces. The on-card portion of JCVM includes the Java Card byte-code interpreter. The Java Card converter runs on a PC, converts and loads the class files generating a CAP (converted applet) file. The second part, which is the JCRE, is the Java Card system that runs inside the card. It is responsible for card resource management, network communication, applet execution and security. The third part, which is the Java Card API, is a set of classes for programming smart cards according to ISO 7816 model.
On Development of Inspection System for Biometric Passports Using Java
265
Finally, the last technology used is Java. One of the most important decisions to be taken before starting the development of the prototype is the programming language. The requirements of the application to be developed are Graphical User Interface (GUI), communication with Data Base, and communication with Java Card Applet. Thus, in order to fulfill all requirements, Java Version 1.6 is used for the implementation. 3.1 JCWDE Simulator The Java Card platform Workstation Development Environment (JCWDE) tool allows to emulate the running of an applet. JCWDE is not an emulation of JCVM, it uses the JCVM in order to emulate the JCRE. This simulator can not support some of the JCRE features like: package installation, persistent card state, firewall, transaction, transient array clearing, object deletion, applet deletion, package deletion, and package deletion. The JCWDE simulator used in this paper, allows the user to test the ePassport applet. In order to generate the File System required by the ePassport applet, which contains the personal information of the ePassport owner, the Inspection System gets access to a data base, and the user selects one person from the list of people included in the system. After the selection, the Inspection System configures the required environmental variables, and generates the necessary files for the simulation. When the configuration process has been performed, the user can test the ePassport applet using the information of the person selected. This simulator does not allow the use of Basic Access Control security method due to the limitations mentioned before. The JCWDE is used to read information from the emulated ePassport in an secure channel. This simulator can also display the communication generated between the emulated ePassport and the Inspection System. Figure 3, shows the architecture of the JCWDE simulator used in this paper.
Fig. 3. JCWDE Simulator
3.2 CREF Simulator The C-language Java Card Runtime Environment (CREF) tool allows to emulate a real Java Card technology-based implementation. CREF has the ability to simulate, persistent memory (EEPROM), save and restore the contents of EEPROM. The CREF simulator used in this paper, allows the user to test the ePassport applet. In order to generate the File System required by the ePassport applet, which contains the personal information of the ePassport owner, the Inspection System gets access to the data base, and the user selects one person from the list of people included in the system. After the selection, the Inspection System configures the required environmental variables, and generates the necessary files for the simulation.
266
L. Terán and A. Drygajlo
When the configuration process has been performed, the user can test the ePassport applet using the information of the person selected. The CREF is used to read information from the emulated ePassport in an insecure channel. This simulator can also display the communication generated between the emulated ePassport and the Inspection System. Figure 4, shows the architecture of the CREF simulator used in this paper.
Fig. 4. CREF Simulator
3.3 Card Interface The Inspection System can also test ePassport applets installed in a smart card. The Card Interface is used to read information from the an ePassport applet installed in a smart card in an insecure channel using the Basic Access Control security method. In order to communicate with the smart card we use an Alya Reader provided by COVADIS [8] which use a PCSC interface. The Card Interface can also display the communication generated between the ePassport applet installed in the smart card and the Inspection System. Figure 5, shows the architecture of the Simulator with Card interface used in this paper.
Fig. 5. Simulator with Card
4 Conclusions In this paper it was designed an Inspection System for Biometric Passports according to ICAO specifications. The resulting system covers most of the user requirements, for instance: upload and update of information, verification, and security. The ePassport applet was built using Java Card Development Kit version 2.2.1, which has limitations like: management of APDU commands, less cryptographic algorithms, etc. The current system can test an ePassport applet with or without a real smart card using JCWDE and CREF simulators included in the Java Card Development Kit 2.2.2. In order to improve the security of the system is recommended to consider a new version of Java Card Development Kit (version 2.2.2), which provides better cryptographic algorithms.
On Development of Inspection System for Biometric Passports Using Java
267
References 1.
2.
3. 4. 5. 6. 7. 8.
ISO. Identification cards - Integrated circuit cards with contacts - Part: 4. Organization, security and commands for Interchange. ISO/IEC 7816-4. International Organization for Standardization, Geneva, Switzerland (2005) Development and Specifications of Globally Interoperable Biometric Standards for Machine Assisted Identity Confirmation using Machine Readable Travel Documents, V2.0, International Civil Aviation Organization (2004) Development of a Logical Data Structure for Optional Capacity Expansion Technologies, V1.7, International Civil Aviation Organization (2004) PKI for Machine Readable Travel Documents Offering ICC Read-Only Access, V1.1, International Civil Aviation Organization (2004) CWA 14890-1: 2004, Application Interface for smart cards used as Secure Signature Creation Devices, Version 1 release 9 rev. 2 (December 22, 2003) Chen, Z.: Java Card Technology for Smart Cards. Architecture and Programer’s Guide (2000) Security and Privacy Issues in E-Passports, Ari Juele, David Molnar, and David Wagner, RSA Labs (2005) ALYA reader, http://www.covadis.ch/Alya.239.0.html
Handwritten Signature On-Card Matching Performance Testing Olaf Henniger1 and Sascha M¨ uller2 1
Fraunhofer Institute for Secure Information Technology, Darmstadt, Germany [email protected] 2 Technische Universit¨ at Darmstadt, Darmstadt, Germany [email protected]
Abstract. This paper presents equipment and procedures for on-card (in-situ) performance testing of biometric on-card comparison implementations using pre-existing databases of biometric samples. A DTW-based on-line signature on-card comparison implementation serves as an example test object. The test results presented are false match rates and false non-match rates over a range of decision thresholds on a per-test-subject basis. The results reveal considerable differences in the comparison-score frequency distribution among test subjects, which necessitates the setting of user-dependent decision thresholds or comparison-score normalization. Keywords: On-card comparison, biometric performance testing, handwritten signature.
1
Introduction
If sufficiently resistant against direct and indirect attacks, biometric methods can be deployed for user authentication purposes in smart cards that provide security-relevant functions (like the creation of electronic signatures or the authorization of transactions) or carry data worthy of protection (like medical data). An important component of evaluating the security of a biometric system is testing its performance in terms of error rates. This paper presents means of testing the performance of on-card comparison implementations using databases of biometric samples. In contrast to [1], oncard comparison is not emulated in a software library installed on a PC; the testing is rather conducted on physical cards. The advantage of this is that, by moving the point of control and observation closer to the implementation under test (IUT), the confidence in its correct functioning may increase. The paper also reports about performance test results for an on-line signature on-card comparison prototype. Moreover, it closely investigates the issue why do behavioral biometric characteristics require either the setting of user-dependent decision thresholds [2] or the normalization of comparison scores [3]. The rest of the paper is organized as follows: Section 2 introduces the on-card comparison IUT. Section 3 reports which aspects of performance are measured. Section 4 introduces the test database used. Section 5 deals with the means of J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 268–275, 2009. c Springer-Verlag Berlin Heidelberg 2009
Handwritten Signature On-Card Matching Performance Testing
269
testing, i.e. equipment and procedures used. Section 6 details the test results for the IUT. Section 7 compares strengths and weaknesses of actual on-card testing and off-card emulation. Section 8 summarizes the results and gives an outlook.
2
Implementation under Test
The IUT is a prototype of an on-card comparison algorithm for handwritten signatures [4]. It uses a dynamic time warping (DTW) algorithm with Sakoe/Chiba band [5]. DTW can be considered among the best approaches for on-line signature verification [6,7]. The data sent to the card are time series for x and y coordinates in the compact format of [8]. The sample rate is 100 per sec. The implementation platform are Java cards, i.e. smart cards with a Java Card Virtual Machine. Java cards are good for prototyping smart card applications, but their computing speed is limited because Java byte code is interpreted at run-time. The same comparison algorithm has been installed on different Java cards with different computing speeds. At any one time, a single signature is stored on the card as reference signature. The storage space for the reference signature is restricted and cannot hold signatures that take longer than 5 sec to sign. For testing purposes, no retry counter is used, i.e. the biometric verification method is not blocked after a number of failed verification attempts. For faster testing, the IUT also does not prompt for user authentication prior to changing the reference data. For testing purposes, the IUT returns the comparison score of each comparison of a probe signature with a reference signature as two-octet unsigned integer. The comparison scores are distance (or dissimilarity) scores, i.e. decrease with similarity. The IUT tries (in vain) to normalize the distance scores by dividing each DTW distance by the number of sample points of the reference signature.
3
Performance Metrics
The false non-match rate (FNMR) of a biometric verification system is the proportion of genuine attempts falsely declared not to match the biometric reference. The false match rate (FMR) of a biometric verification system is the proportion of impostor attempts falsely declared to match the biometric reference [9]. Error rates involving impostor attempts can be measured using either – zero-effort attempts where impostors present their own biometric characteristics as if attempting successful verification against their own reference or – active impostor attempts where impostors intentionally imitate the biometric characteristics of other persons. In cases where impostors may easily imitate aspects of the required biometric characteristics, such as handwritten signatures, an FMR measurement based on active-impostor attempts is more predictive of the performance in practice. Therefore, here the active-impostor FMR is considered.
270
O. Henniger and S. M¨ uller
FNMR and FMR depend on an adjustable decision threshold determining the required degree of similarity of probe sample and reference. The lower the FMR at a certain threshold value, i.e. the fewer forgeries are accepted, the higher is the FNMR, i.e. the fewer genuine samples are accepted as well, and vice versa. Depending on the concrete requirements, an appropriate threshold value has to be chosen, reconciling security (low FMR) with usability (low FNMR). A receiver operating characteristic (ROC) curve shows the rate of impostor attempts accepted (FMR) on the x axis against the corresponding rate of genuine attempts accepted (1−FNMR) on the y axis, plotted parametrically as a function of the decision threshold [9]. To find out the worst-case performance for individuals, the ROC curves are plotted on a per-test-subject basis. This requires multiple genuine and impostor samples for each test subject. A significant performance metric of a biometric system is the equal error rate (EER), even though it does not summarize all characteristics of an ROC curve. The EER is the error rate at that threshold value where FNMR = FMR. When the threshold is moved away from that value, then either FNMR or FMR deteriorate beyond EER. Because FMR and FNMR of a biometric system could be improved by abstaining from processing low quality samples, also the failure-to-enrol (FTE) rate is reported. The FTE rate of a biometric system is the proportion of enrolment attempts for which the system fails to complete the enrolment process.
4
Test Corpus
The error rates can be estimated experimentally with some statistical significance using large test databases. The corpus of samples used is a publicly available subset of the database [10] consisting of signature samples of n = 100 test subjects. For each test subject, there are mg = 25 genuine samples and mf = 25 skilled forgeries. For the forgery attempts, the impersonators had the original signatures available on paper and were allowed to imitate them to the best of their ability. They were allowed to practice the signatures to be forged, to look at the original while forging, and even to retrace the original.
5
Test System
The tests are controlled by a software called KARMASYS (card-manipulation system). KARMASYS allows, via an off-the-shelf card terminal, sending commands to a smart card and recording the responses received from the card. Fig. 1 shows an overview of the test equipment. A KARMASYS test script comprises a sequence of application protocol data units (APDUs) to be sent to the card under test. Change Reference Data APDUs are used for changing the reference data stored on the smart card; Verify APDUs are used for transmitting probe samples to the card [11]. For compliance with [11], signature time series data blocks that are too large for a single regular-length APDU are split into chains of APDUs.
Handwritten Signature On-Card Matching Performance Testing
271
Implementation under test
Host PC Test tool KARMASYS
Card-terminal application programming interface Cardterminal interface
Card-terminal driver
Card interface
Card terminal
Test system
Fig. 1. Architecture of the test system
For each of the n = 100 test subjects, a test script was created using awk scripts and batch files. The output is a sequence of APDUs for testing pairs of signature samples of a test subject. The test script for a test subject includes – m 25 Change Reference Data APDU chains, gm= g – (i i=1 − 1) = 300 Verify APDU chains containing genuine signatures (as DTW is commutative, there is no need to compare the genuine signatures (gk , gj ) after comparing (gj , gk )), and – mg · mf = 625 Verify APDU chains containing forged signatures.
6 6.1
Performance Test Results Failure-to-Enrol Rate
For 30% of all test subjects a failure to enrol occurred because at least one of their genuine signature time series data blocks was too long to be stored on the smart card as reference data. Provided that long signatures are more difficult to forge than shorter ones, these failures are detrimental to the overall FMR. 6.2
Per-Test-Subject Performance
Fig. 2(a) through 2(c) show examples of per-test-subject performance test results in ROC curves. Considerable individual differences show up. The per-test-subject EERs of all enrolled test subjects range from 0% to 44.7%: – For four test subjects (5.7% of all 70 enrolled test subjects) whose signatures are long and of high complexity as in Fig. 3(a), perfect discrimination between genuine and forged signatures appears possible, i.e. all forgery attempts resulted in a higher distance score than any genuine signature verification attempt. The EER is 0% in this case, and the ROC curve passes through the upper left corner of the diagram, cf. Fig. 2(a).
272
O. Henniger and S. M¨ uller
– For two test subjects whose signatures are short and of low complexity as in Fig. 3(b), the EER is above 40%, cf. Fig. 2(c). For these test subjects the outcome of signature verification is nearly as random as tossing a coin.
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0.2
0.4
0.6
0.8
1
1−FNMR
1
1−FNMR
1−FNMR
These differences may be due to inter-individual differences in the stability (intra-individual invariance) and the forgeability of signatures. They may also be due to differences in the skills applied when attempting to forge the signatures. Quality scores [12] can be used for deciding whether a handwritten signature is long and complex enough to be hard to forge. Fig. 4 shows the relative frequency distribution of per-test-subject EERs. For a third of all enrolled test subjects, the EER amounts to less than 5%. The median value of the per-test-subject EERs of all enrolled test subjects (i.e. the
0.4
0.2
0 0
0.2
0.4
FMR
0.6
0.8
0 0
1
0.2
0.4
(a) Test subject 42
0.6
0.8
1
FMR
FMR
(b) Test subject 91
(c) Test subject 15
Fig. 2. ROC curves 0042v00.fpg
0015v00.fpg
(a) Test subject 42 (long, complex)
(b) Test subject 15 (short, simple)
Fig. 3. Signature examples 0.5
relative frequency
0.4
0.3
0.2
0.1
0 0
0.1
0.2 0.3 per−test−subject EER
0.4
0.5
Fig. 4. Histogram of per-test-subject EERs
Handwritten Signature On-Card Matching Performance Testing
273
value separating the higher half of the values from the lower half) is 8.3%. The mean value of the per-test-subject EERs of all enrolled test subjects is 10.7%; their standard deviation is 9.6%. 6.3
Overall Performance
The decision threshold can be set individually for each user or chosen to be identical for all users. A common decision threshold for all users has the advantage that it needs to be set only once for all. However, a common threshold can achieve an acceptable overall performance only if the frequency distribution of the comparison scores is nearly the same for all users. Otherwise, the optimal thresholds for distinguishing between genuine and forgery samples lie at different threshold values for different users. In order to achieve an acceptable overall performance in case of different comparison-score frequency distributions among users, – the threshold should be set individually for each user based on the similarity of the samples presented during enrolment [2] or – the comparison scores should be normalized in such a way that their frequency distribution becomes similar for all users [3]. For illustration, see the distance score distributions for the test subjects 42 and 62 in Fig. 5(a) and (b). For each of them, all genuine distance scores are smaller than any forgery distance score and an optimal threshold value can be found that forgery genuine
100
100
80
80
60
60
40
40
20
20
0 0
20
40
60
80
100 120 distance
140
160
180
forgery genuine
120
frequency
frequency
120
0 0
200
20
(a) Test subject 42
40
60
80
100 120 distance
140
160
180
200
(b) Test subject 62 forgery genuine
120 100
frequency
80 60 40 20 0 0
20
40
60
80
100 120 distance
140
160
180
200
(c) Test subjects 42 and 62 together; genuine and forgery score distributions overlap Fig. 5. Distance score histograms
274
O. Henniger and S. M¨ uller
allows perfect discrimination between genuine signatures and forgeries. The two optimal threshold values, however, are different. If a common threshold value is to be used for both test subjects, then, no matter which value is chosen, false matches and false non-matches are unavoidable, see Fig. 5(c). This effect is aggravated when all test subjects are taken into consideration. If the tested prototype operated with a common decision threshold for all test subjects, its overall EER were 19.1%, i.e. significantly worse than the median and mean EER values reported in Section 6.2. For score normalization the user should present several signatures at enrolment time. The one with the smallest average distance to all other presented signatures should be chosen as reference signature. At verification time, each distance score should be divided by the average distance of the reference signature to that user’s other signatures presented at enrolment [6]. 6.4
Time Needed
The time needed for transmitting probe samples to the card and for on-card comparison using the DTW algorithm grows linearly with the size of the probe samples. That means, if it takes 10 sec on the fastest tested Java card to verify a signature which it took 1 sec to write, then it takes 50 sec to verify a signature which it took 5 sec to write. As different Java cards with different speed have been used during the test campaign, here the time needed is not evaluated statistically.
7
On-Card Testing vs. Off-Card Emulation
While off-card emulation needs less time than actual on-card (i.e. in-situ) testing, in-situ testing allows measuring additional performance and robustness indicators, such as the actual time needed, and detecting errors that remain hidden in case of off-card emulation. For instance, the following error in the IUT has been revealed by in-situ testing: A frequently used loop-control variable got allocated EEPROM space instead of RAM space. This slowed down operations and repeatedly lead, after several thousand signature comparisons, to the destruction of the misused EEPROM cells and thus of the cards. After allocating RAM space to the loop-control variable, the life time of the cards greatly increased (no outage since then).
8
Summary and Outlook
The performance of biometric systems depends on the algorithms, the environment, and the population. Therefore, caution must be exercised when comparing these results with those of other systems tested using different test data. There is ample potential for improvement of the prototype IUT: – Processing time: 10 sec per 100 sample points is too long. A speed-up is expected from faster smart cards and optimizations of the algorithm. – FTE rate: To avoid failures to enrol, more memory space is needed for reference signatures.
Handwritten Signature On-Card Matching Performance Testing
275
– Frequency distribution of distance scores: This differs from person to person for the IUT. The distance scores need to be normalized to allow using a uniform decision threshold. – Quality control: Too short and too simple signatures should be rejected at enrolment time to make forgeries more difficult. The performance of future versions of the handwritten signature on-card comparison implementation can be tested, largely automatically, using the presented test equipment and procedures. The test equipment and procedures can also be used, with different corpora of biometric samples, to test the performance of on-card comparison implementations for other biometric characteristics.
References 1. Grother, P., Salamon, W., Watson, C., Indovina, M., Flanagan, P.: MINEX II – Performance of fingerprint match-on-card algorithms – Phase II report. NIST Interagency Report NISTIR 7477, NIST, Gaithersburg, MD, USA (2008) 2. Jain, A., Griess, F., Connell, S.: On-line signature verification. Pattern Recognition 35, 2963–2972 (2002) 3. Fi´errez-Aguilar, J., Ortega-Garcia, J., Gonzalez-Rodriguez, J.: Target dependent score normalization techniques and their application to signature verification. In: [13], pp. 498–504 4. Henniger, O., Franke, K.: Biometric user authentication on smart cards by means of handwritten signatures. In: [13], pp. 547–554 5. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-26(1), 43–49 (1978) 6. Kholmatov, A., Yanikoglu, B.A.: Identity authentication using improved online signature verification method. Pattern Recogn. Letters 26(15), 2400–2408 (2005) 7. Yeung, D.Y., Chang, H., Xiong, Y., George, S., Kashi, R., Matsumoto, T., Rigoll, G.: SVC2004: First international signature verification competition. In: [13], pp. 16–22 8. Information technology – Biometric data interchange formats – Part 7: Signature/sign time series data. International Standard ISO/IEC 19794-7 (2007) 9. Information technology – Biometric performance testing and reporting – Part 1: Principles and framework. International Standard ISO/IEC 19795-1 (2006) 10. Ortega-Garcia, J., Fi´errez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy, M., Espinosa, V., Satue, A., Hernaez, I., Igarza, J.J., Vivaracho, C., Escudero, D., Moro, Q.I.: MCYT baseline corpus: a bimodal biometric database. IEE Proceedings Visual Image Processing 150(6), 395–401 (2003) 11. Information technology – Identification cards – Part 4: Organization, security and commands for interchange. International Standard ISO/IEC 7816-4 (2004) 12. M¨ uller, S., Henniger, O.: Evaluating the biometric sample quality of handwritten signatures. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 407–414. Springer, Heidelberg (2007) 13. Zhang, D., Jain, A.K. (eds.): ICBA 2004. LNCS, vol. 3072. Springer, Heidelberg (2004)
Classification Based Revocable Biometric Identity Code Generation ˙ Alper Kanak1,2 and Ibrahim So˜ gukpınar2 1
National Research Institute of Electronics and Cryptology, TUBITAK-UEKAE Kocaeli, Turkiye 2 Gebze Institute of Technology, Kocaeli, Turkiye [email protected], [email protected]
Abstract. The recent biometric template protection methods often propose salting or one-way transformation functions and biometric cryptosystems which are capable of key binding or key generation to provide the revocability of the templates. Moreover, the use of multiple instances of a biometric trait proposes more robust features which are then combined with the well-known template protection methods. In this study, a normal densities based linear classifier is proposed to distinguish the features associated with each user and cluster them to generate an identity code by mapping the center of the cluster to a N-dimensional quantized bin. The resulting code is converted to a bit stream by a hashing mechanism to let user revoke his biometric in case of key compromise. This method presents the advantage of representing an individual by using his plenty of features instead of a single one in a supervised manner.
1
Introduction
Traditional cryptosystems often require users to protect a secret key by selecting a password or carrying a media such as smart cards. However, using such storage media may cause some security problems due to stolen, forgotten or easily guessed passphrases. The idea of using personal entropy instead has promoted the integration of biometric authentication into conventional security solutions. Having biometric-based keys has the advantages including the fact that this eliminates the need for a user to memorize long bit streams and provides a good way of protecting the user’s privacy. However, the main obstacle to generate a biometric key is that all cryptosystems mandate these keys to be unique and exactly the same on every use. This obligation directs researchers to propose a biometric template protection mechanism which should posses the following requirements: (1) They should be independent from uncertain characteristics of human nature due to aging, illnesses, environmental factors, sensor differences, etc. (accuracy); (2) A compromised template should be revoked by keeping the original biometric trait same (revocability); (3) Revealing the original biometric data should be computationally hard (security); and (4) the biometric information should be concealed with some other user-specific information (privacy). J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 276–284, 2009. c Springer-Verlag Berlin Heidelberg 2009
Classification Based Revocable Biometric Identity Code Generation
277
The research on biometric template protection is evolving especially in the last decade which can be classified into two categories: (1) Feature transformation and (2) Biometric cryptosystems [1]. In feature transformation approach, a transformation function is used to extract a new biometric template by the help of a randomly generated information associated with each user. Depending on the characteristics of the transformation function, the template protection mechanism either lets the invertible transformation function recover the original biometric template (salting)[3][4] or applies a one-way non-invertible hash function to conceal the biometric trait [5][6][7]. The salting and hash functions present low error rates and provide biometric cancelability. However, the security of these methods mostly relies on the security of the key distribution and the secret key may dominate the discriminative information of a user such that an attacker might be able to grant access to the system by just using his own biometric trait. Biometric cryptosystems, on the other hand, concentrates on storing a public helper data which is not supposed to reveal any significant information about the original biometric and it is needed during matching to extract a cryptographic key. Biometric cryptosystems might be categorized into two folds: (1) Key Binding and (2) Key Generation. In key binding, the helper data is obtained by binding a key with the template. Matching is based on Fuzzy commitment scheme in which the associated codeword (and then the embedded key) is recovered within a certain error tolerance [8]. An overview can be seen in [1][2]. Note that, key binding approaches provide an error-tolerant way of degrading the intra-user variations but they are still sensitive to error correction performance and weak against privacy and security threats. The key generation methods instead, concentrate on generating a key directly from biometrics in spite of the high intra-user variability. The proposed key generation methods may either apply user-specific quantization schemes [9] or secure sketches and fuzzy extractors [11][12] which are inspired from [10]. The main limitation of these methods is that it is difficult to generate keys with high stability and entropy. In order to solve the problems mentioned above, use of statistical features of biometric data is proposed considering the use of data clusters or distributions of features [9][13][14][15][16]. These methods focus on producing exact values of unordered sets using quantization schemes. Modeling a person with his most robust features assists in separating users in the sense that biometric keys produced by the same user are the same, and ensuring that those produced by different users are different. Note that implementing a such approach is problematic due to the difficulties in modeling a person and quantizing the feature space reliably. In this work, a secure salting-based template protection mechanism is proposed to produce biometric identity codes of a person. The biometric identity code generation scheme is based on statistical analysis of fingerprint features. Here a normal densities based linear classifier is applied to extract a representative information about the user. The representative data associated with each user is the mean vector of the normally distributed features in the training set. The mean vector is mapped onto a quantized space to form the identity codes.
278
˙ So˜ A. Kanak and I. gukpınar
The information related to the quantized bin is then converted to a binary identity code by applying a projection onto user-specific multiple random subspaces. For the sake of simplicity a simple spectral feature extraction algorithm is used for 750 × 10 fingerprint images after an image enhancement and registration algorithm are applied. As compared to the traditional biometric authentication systems, the proposed method presents a better accuracy in 750-subject population after a strengthening algorithm is applied.
2
The Proposed Template Protection Mechanism
The proposed system is comprised of two phases: Enrollment and Verification (See Fig. 1). In the enrollment phase the strengthened identity codes (hI ) which are extracted from biometric features of user I are encrypted with a private key k. The strengthening function utilizes a randomly generated array (RI ) associated specifically with the user I to hash the corresponding identity code (ρI ). Here, the encrypted hI and RI are stored in the smart card. In verification, a query identity code is extracted from the acquired image and it is hashed with the same RI to obtain hI . Finally, decision is made by comparing the Hamming distance between hI and hI . The key methods used in the proposed scheme basically focus on two procedures:
Fig. 1. The Proposed Method
1. Generating Identity Codes: The proposed method concentrates on describing a user in a high-dimensional space by using the parameters of the distribution of the training features that belong to the corresponding user. The method is based on deriving nearly optimal quantization of the feature space via a normal densities based linear classifier. In order to generate identity codes, a spectral feature extraction algorithm is applied to enhanced and
Classification Based Revocable Biometric Identity Code Generation
279
aligned fingerprint images. The extracted features of each person are then modeled by a gaussian after the outliers are eliminated. Meanwhile, scalar quantization is applied to the feature space to define the codewords associated with the cells (bins in the quantized space). The mean vector of each distribution is then mapped to the nearest codeword which are then named as Identity Codes. This method provides a good recognition performance since it uses multiple instances of biometric traits which decreases the inclass variation and increases the inter-class variation. 2. Strengthening Identity Codes: The generated codes are then strengthened with a user-specific randomly generated key. Here, the random multispace quantization [3] is applied to project the identity codes onto a set of user-specific orthogonal random spaces and then a bit sequence is obtained by thresholding the values. The resulting sequence is accepted as the strengthened identity of the user and provides the revocability of the biometric trait since the compromised template can be canceled by just changing the key. 2.1
Generating Identity Codes
The identity code generation module consists of the following three submodules: A. Fingerprint Processing and Feature Extraction: The feature extraction method used in this paper is based on Fourier transform which takes the advantage of manifesting distinguishing characteristics of a fingerprint among wedges and rings (such as ridge orientation and minutiae) as small deviations from the dominant spatial frequency of the ridges [17]. Here the harmonics of each individual sectoral region Fig. 2. are accumulated (Fig. 2). The resulting feature vector χ = The Sector [ν , ν , . . . , ν , ν , . . . ν , . . . , ν 11 12 1R 21 2R W R ] denotes the power specGeometry trum in each sector where W = 12 and R = 12 (# of wedges and rings). It is obviously seen that there exist W ×R = 144 sectors on the frequency image. Each νij is computed as νij (r, φ) = k1 k2 |F (k1 , k2 )|2 . Here, r ≤ (k12 + k22 )1/2 ≤ r + Δr and Φ ≤ arctan(k2 /k1 ) ≤ Φ + ΔΦ. Here, Δr and ΔΦ represent the sector thickness and segment width, respectively. This method represents a global description of a fingerprint and therefore seems to be fairly robust to small changes in fingerprints and problems due to damaged or low quality fingertips. Note that this type of global representation may lose most of the available spatial information while decreasing the computational cost. Moreover, since minutiae or ridge based features highly depend on image quality, such spectral methods have become useful. B. Extracting User Model Parameters: After extracting features, a normal densities based linear classifier (NDLC) is applied to find the conditional probability densities of each class. Suppose that the measurement vectors coming from an object with class ωI , here associated with each user, are normally distributed with the mean vector μI and the covariance matrix CI . Note that NDLC is a
280
˙ So˜ A. Kanak and I. gukpınar
special case of the quadratic classifier in which the covariance matrices do not depend on the classes, i.e. CI = C for all ωI . In the case of class-independency the classifier is formulated as: i = argmaxk=1,...,K {2lnP (ωk )−(z−μk )T C −1 (z−μk )} Note that in this class-dependant approach, all the class-specific information is involved in the class-dependant vectors, (μI )s. Thus, these class-dependant vectors μI are then used for identifying each person. Note that for any N dimensional data (μI )s are N -dimensional as well and present N -dimensional identity features. In order to increase accuracy the outliers which are not in the neighborhood of the center of the feature scatter associated with each user are eliminated. To realize this Euclidean distances between the center of the scatter and the samples are compared. If the distance measured is higher than a threshold value, the regarding sample is accepted as an outlier. The outliers are discarded from the training data and a new model is created with the rest of the samples presenting a more robust model. C. Computing the Identity Codes: After determining the class-dependant vectors in the N-dimensional space, each dimension of the space is partitioned into M cells resulting M N cells in total forming the quantized discrete domain. The discrete domain forms the codeword space C and each cell is coded with an integer (Fig. 3). For generating the identity code of the user I, the corresponding μI is mapped onto the codeword space C by the mapping function, ρI = f (μI ) and ρI is an integer array with length N . Note that f (•) is a simple function that returns the corresponding discrete midpoint of the quantized cell in which the user model exists. The resulting ρI is now used as the identity code of the person I. In order to generate long codes the space must be partitioned Fig. 3. Clusters Used to into tiny cells. However, in this case the reliability Generate User Identity of the generated code will correspondingly decrease Codes due to the overlapping class-dependant vectors in space which fall in the same cell. For instance, in Fig. 3, a 2-dimensional 4-class case is depicted. Here, the resulting user identities are ρ1 = [1, 3], ρ2 = [3, 2], ρ3 = [1, 1], and ρ4 = [4, 1] for the classes denoted with the signs ∗, ◦, + and ×, respectively. 2.2
Strengthening Identity Codes
In order to strengthen and protect the biometric feature, a salting approach [3] is applied to identity codes. In this method identity codes are transformed into a strengthened bit stream using a set of randomly generated user-specific strings. The algorithm is as follows:
Classification Based Revocable Biometric Identity Code Generation
281
Algorithm: – Create T instances of random arrays each with length N for the user I by using a private key kI : ΓI = {τ1I , τ2I , · · · , τTI }Ik . I
– Orthonormalize ΓI by applying Gramm-Schmidt method to make τzI s linearly independent to each other where z = 1, · · · , T . The resulting orthonormalized set is presented as ΓI⊥ = {τ1I ⊥, τ2I ⊥, · · · , τTI ⊥}Ik . I
– Inner product each element of ρI with each element of the ΓI⊥ : ΥIz = {τ1I ⊥ | ρI , τ2I ⊥ | ρI , · · · , τTI ⊥ | ρI }Ik where z = 1, · · · , T and | denotes the dot product. I – Finally apply thresholding to each ΥIz to convert it to binary: hI =
1, if Υiz > t 0, if Υiz ≤ t
This method has the advantage that introduction of private key kI results in low false accept rates. Moreover, codes can be revoked by just changing kI which presents a capability of cancelability. However, if the user key is compromised, the template becomes insecure depending on the usage of randomly generated data ΓI . As T increases the discrimination between imposter and genuine distributions increases. However, T affects the dominance of the synthetic data ΓI on the recognition process as well and this may cause false accepts independent from the biometric information in the key compromise case. The proposed system considers this issue by adjusting parameters of the strengthening algorithm to keep biometric information accurate enough even if the key is compromised.
3
Experimental Results and Discussions
The experimental fingerprint data were collected during a pilot smartcard based electronic identity management project of Turkish Government for the purpose of research and development (MTRD database). The overall database consists of 750 fingerprints × 10 samples per finger resulting 7500 images with the size 96 × 96 pixels and acquired by the Authentec 8600 sensor. The performance of the proposed scheme is determined in terms of Equal Error Rate (EER) where False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. To present the accuracy of the identity codes, matching is implemented by measuring the Euclidean distance between each ρI and the query identities ρJ s (I = J and I, J = 1, · · · , P, P: # of people in the population). Similarly, the recognition performance of the strengthened codes are evaluated by computing the Hamming distances among hI s. For the genuine population each test image of a user is matched against the identity code of the same user leading to 750×5 = 3750 comparisons. For the imposter tests, each ρI is matched against all other identity codes leading to 750 × 749 × 5 = 2808750 comparisons. The accuracy of pure identity codes depends on the number of quantized bins (M ). Obviously seen that the proposed system should be optimized for the best value of M because the EER increases for tiny or large N -dimensional cells (N = 144) due to internal or external fragmentation problems of the clusters. Moreover, the quantization affects the strengthening performance as it is dependent to the statistical method of identity code generation. In order to evaluate the strengthening and recognition performance, EERs for various bit
282
˙ So˜ A. Kanak and I. gukpınar
Table 1. Strengthening Performance w.r.t. Bit String Length (T ) where M = 50
Table 2. Strengthening Performance w.r.t. Various Quantization Levels (M ) where T = 100
Equal Error Rate(%) String Strengthened Attack Length (T ) Codes Scenario 40 14.80 20.51 50 9.93 15.42 60 6.34 11.98 70 4.46 9.91 80 3.22 8.83 90 2.30 7.81 100 1.91 7.31
Equal Error Rate(%) Quantization Pure Strengthened Attack Level (M) Codes Codes Scenario 20 4.24 2.38 7.89 30 4.20 2.26 7.65 40 3.89 1.87 7.16 50 3.96 1.91 7.31 60 4.11 2.11 7.40
lengths (T ) are presented with the performance of the case where a malicious user somehow has stolen the private key and user-specific randomly generated arrays (Attack Scenario). According to the EERs in Table 1 where M = 50, the proposed system performs well for longer bit strings as the amount of randomly generated information (ΓI ) increases. However, the outstanding discrimination performance observed here also brings the danger of dominating the authentication process due to the entropy gained from random arrays. This may diminish the discrimination capability of biometric templates such that an attacker may present his own biometric if he somehow steals RI or kI . The proposed system should be accurate enough to present still good performance for the case when synthetic data is stolen (Attack Scenario). In Table 2, the EERs (where T = 100) for observing three cases are presented where: (1) Only identity codes are compared (no biohashing), (2) strengthened codes are compared and (3) the attack scenario happened in which the attacker has somehow stolen all the private information of each person. As observed the EERs for these three cases, where N = 144, M = 40 and T = 100, the strengthening method improves the system with 2.02% EER while the performance degradation is about 5.29%. This shows that, if the system administrator may adjust the values by changing the parameters N, M an T , the biometric information may still carry sufficient significance as compared to the synthetically generated random data. The EERs presented in Table 2 show that for small or large values of quantization levels, the performance degrades. This is a common problem of quantization. For tiny cells where M is large, the cluster centers falsely mapped to neighbor cells resulting wrong code representation. On the other hand, for large cells where M is small, multiple cluster centers associated with different users might be mapped to the same quantized bin. In both cases the performance degrades. Finding an optimum value of M is an engineering problem and is proposed as M = 40 in this study. Security Analysis: Note that factoring inner products of biometric code ρI and random vectors τzI s is an intractable problem if 1 < T < length(ρI ) even if ΓI and kI is known. It is obviously seen that the iterative inner products form T systems of equations where all (τiI ⊥)⊥(τjI ⊥) and (i = j and τiI = τjI ). Since there are length(ρI ) number of unknowns and only T < length(ρI ) number of
Classification Based Revocable Biometric Identity Code Generation
283
equations exist, the system has infinite number of solutions. Moreover, since τiI = τjI , every hIi is dependent on all of ρI , each change in the resulting bit sequence is dependent on the entire input biometric. Note that the discretization process quantizes the random projected vector space into 2T spaces and hence this is an irreversible process.
4
Conclusion
In this study, the spectral fingerprint features of a person are used to generate identity codes by applying a normal densities based linear classifier. The identity codes are strengthened by a set of orthonormal arrays and finally a biohash is extracted after thresholding. This method presents the cancelability of biometric templates and a good way of discriminating genuine and imposter distributions. The main advantage of the system is that it makes use of statistical features of training data giving a chance to eliminate outlier data in the enrollment stage to increase reliability. For further studies the strengthened identity codes might be used to extract secure sketches to increase security and privacy of the user. Moreover, more reliable fingerprint features might be used. Multimodal biometric schemes can be considered as well.
References 1. Jain, A.K., Nandakumar, K., Nagar, A.: Biometric Template Security. EURASIP Jour. on Adv. in. Sig. Proc., Special Issue on Biometrics (2008) 2. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.K.: Biometric Cryptosystems: Issues and Challenges 92(6), 948–960 (2004) 3. Teoh, A.B.J., Goh, A., Ngo, D.C.L.: Random Multispace Quantization as an Analytic Mechanism for Biohashing of Biometric and Random Identity Inputs. IEEE Trans. on Pat. An. & Mach. Intelligence 28(12), 1892–1901 (2006) ˙ Fingerprint Hardening with Randomly Selected Chaff 4. Kanak, A., So˜ gukpınar, I.: Minutiae. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 383–390. Springer, Heidelberg (2007) 5. Sutcu, Y., Sencar, H.T., Memon, N.: A Secure Biometric Authentication Scheme Based on Robust Hashing. In: Conf. MM-SEC, New York, USA (2005) 6. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.K.: Cancelable Biometric Filters for Face Recognition. In: Proc. of Int. Conf. on Pattern Rec., pp. 922–925 (2004) 7. Ratha, N.K., Chikkerur, S., Connell, J.H., Bolle, R.M.: Generating Cancelable Fingerprint Templates. IEEE Trans. on Pat. An.& Mac. Intelligence 29(4), 561– 572 (2007) 8. Juels, A., Wattenberg, M.: A Fuzzy Commitment Scheme. In: Tsudik, G. (ed.) Proc. ACM Conf. Comp. & Comm. Security, pp. 28–36 (1999) 9. Chang, Y.J., Zhang, W., Chen, T.: Biometric-based Cryptographic Key Generation. In: Int. Conf. on Multimedia and Expo (2004) 10. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004)
284
˙ So˜ A. Kanak and I. gukpınar
11. Sutcu, Y., Li, Q., Memon, N.: Protecting Biometric Templates with Sketch: theory and Practice. IEEE Trans. on Inf. Forensics & Security 2(3), 502–512 (2007) 12. Buhan, I.R., Doumen, J.M., Hartel, P.H., Veldhuis, R.N.J.: Fuzzy Extractors for Continous Distributions. In: Proc. of ACM Symp. on Inf., Comp. & Comm. Security, Singapore, pp. 353–355 (2007) 13. Fairhust, M., Hoque, S., Howells, W.G.J., Deravi, F.: Evaluating Biometric Encryption Key Generation. In: Proc. of 3rd Cost 275 Workshop Biometrics Internet, pp. 93–96 (2005) 14. Yamazaki, Y., Komatsu, N.: A Secure Communication System Using Biometric Identity Verification. IEICE Trans. Inf. Syst. E84-D(7), 879–884 (2001) 15. Sheng, W., Howells, G., Fairhust, M., Deravi, F.: Template-Free Biometric Key Generation by Means of Fuzzy Genetic Clustering. IEEE Trans. Inf. Forensics & Security 3(2), 183–191 (2008) 16. Lee, Y.J., Park, K.R., Lee, S.J., Bae, K., Kim, J.: A New Method for Generating an Invariant Iris Private Key Based on the Fuzzy Vault System. IEEE Trans. on Sys., Man. & Cyber., Part B 38(5), 1302–1313 (2008) 17. Willis, A.J., Myers, L.: A Cost-Effective Fingerprint Recognition System for Use with Low-Quality Prints and Damaged Fingertips. Jour. Pattern Recog. 34, 255– 270 (2001)
Vulnerability Assessment of Fingerprint Matching Based on Time Analysis Javier Galbally, Sara Carballo, Julian Fierrez, and Javier Ortega-Garcia Biometric Recognition Group–ATVS, EPS, Universidad Autonoma de Madrid, C/ Francisco Tomas y Valiente 11, 28049 Madrid, Spain {javier.galbally,sara.carballo,julian.fierrez,javier.ortega}@uam.es
Abstract. A time analysis of a reference minutiae-based fingerprint matching system is presented. We study the relation between the score generated by the system (NFIS2 from NIST) and the time required to produce the matching score. Experimental results are carried out on a subcorpus of the MCYT database and show a clear correlation between both matching variables (time and score). Thus, a new threat against biometric systems is arisen as attacks based on the matching score could be largely simplified if the time information is used instead.
1
Introduction
In the last recent years important research efforts have been conducted to study the vulnerabilities of biometric systems to direct attacks to the sensor (carried out using synthetic biometric traits such as gummy fingers) [1], and indirect attacks (carried out against some of the inner modules of the system) [2,3]. This research efforts have led to an enhancement of the security level offered by biometric systems through the proposal of new countermeasures to the attacks analyzed. Furthermore, the interest for the analysis of security vulnerabilities has surpassed the scientific field and different standardization initiatives at international level have emerged in order to deal with the problem of security evaluation in biometric systems, such as the Common Criteria through different Supporting Documents [4], or the Biometric Evaluation Methodology [5]. Within the vulnerabilities that have been studied, special attention has been paid to the hill-climbing attacks [6,7,8]. These attacking algorithms produce a number of synthetic templates which are iteratively modified according to the score they produce: if the score increases the changes are kept and otherwise the modifications are discarded. This way the score raises until the acceptance threshold is reached and the system is broken. Although hill-climbing attacks have proven their efficiency against biometric systems, they still present the strong restriction of needing the score produced by the matcher to be able to break the system. Even if the attacker is able to access the similarity measure (which is not always the case), the attack can still be countermeasured by quantizing the score so that the hill-climbing algorithm does not get the necessary feedback to iteratively increase the similarity measure until the threshold is reached. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 285–292, 2009. c Springer-Verlag Berlin Heidelberg 2009
286
J. Galbally et al.
A bigger threat to biometric systems would arise if they could be attacked using some easily measurable information such as the matching time, or the power consumed by the system in the matching process. This type of information, which has already been used to successfully attack cryptographic security systems [9,10], is always accessible to a possible attacker and difficult to be manipulated or distorted by the system designer (in opposition to the similarity score used in traditional hill-climbing algorithms). In the present work we carry out a time analysis of a reference fingerprintbased recognition system (NFIS2 from NIST [11]), in order to determine if there exists a relation between the similarity score produced by the matching module, and the time required to generate that score. Such a study will permit to determine the feasibility of developing attacks against this type of systems based on the time information, and the necessity or not of taking into account this threat not only when designing biometric systems, but also when evaluating their level of security. The rest of the paper is structured as follows. In Sect. 2 the system analyzed in the experiments is described. The database used in the experiments and the performance evaluation of the studied system are presented in Sect. 3. Experimental results of the temporal analysis are given in Sect. 4, and conclusions are finally drawn in Sect. 5.
2
Reference System Analyzed (NFIS2)
The system analyzed in the experiments is the minutiae-based NIST Fingerprint Image Software 2 (NFIS2) [11]. This publicly available software is used in many works as a reference system with which to compare new fingerprint verification solutions. It is a PC-based fingerprint processing and recognition system formed of independent software modules. The feature extractor generates a text file containing the location (x and y coordinates) and orientation (angle with respect to the positive x axis) of each minutia from the fingerprint. The matcher uses this file to generate the score. The matching algorithm is rotation and translation invariant since it computes only relative distances and orientations between groups of minutiae.
3
Database and Performance Evaluation
The temporal analysis has been carried out using a subcorpus of the MCYT database [12]. The subcorpus comprises 10 impressions of the right and left index fingers of 75 users (75 × 2 × 10 = 1, 500 images), captured electronically with the optical sensor UareU from Digital Persona (500 dpi, 256 × 400 images). Six of the samples of each finger were acquired with a high control level (small rotation or displacement of the finger core from the center of the sensor was permitted), another two with a medium control level, and the remaining two with low control level (see Fig. 1 for examples of fingerprint images).
Vulnerability Assessment of Fingerprint Matching
287
Fig. 1. Examples of typical fingerprint images that can be found in MCYT acquired with a low (left), medium (center), and high degree of control (right)
0.9
100 Genuine Impostor
0.8
FR (in %) FA (in %)
90 80
0.7
70
0.6
60 0.5 50 0.4 40 0.3
30
0.2
20
0.1 0
10 0
50
100
150 200 Score
250
300
350
0
0
50
100
150 200 Score
250
300
Fig. 2. Score distributions (left), and FA and FR curves (right), for the MoC system
In order to estimate the verification performance of the system a set of genuine and impostor scores were computed. We used one of the low control samples as a template and the other 9 samples from the same finger as probes to test genuine matches, leading to 150 × 9 = 1, 350 genuine user scores. Impostor scores are obtained comparing each template to one sample from the remaining fingers of the database, thus we have 150×149 = 22, 350 impostor scores. In Fig. 2 we show the two score distributions (left), and the FA and FR curves of the evaluated system (right). The EER of the system is 1.47%.
4
Time Analysis
The objective of the experimental study is to determine if there exists a correlation between the score given by the analyzed system and the matching time required to produce that score. In order to reach this goal two experiments have
288
J. Galbally et al. ST
Smin
Smax
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Fig. 3. Division of the total range of scores ST used in experiment 1
been carried out, the first one to find out if there is a correspondence between score and time, and the second one to identify the behavior of the matching time when the score increases or decreases. 4.1
Experiment 1: Relation between Score and Time
As explained in Sect. 3, with the used database a set of genuine and impostor scores were generated, comprising 1,350 and 22,350 similarity measures respectively. Each of these scores has a time value associated corresponding to the matching time used by the system to produce that score. Thus, there are a total 23,700 scores (irrespective of genuine or impostor) and the same amount of time values. In this experiment, the whole range of scores (ST ) is divided into ten equally spaced bands (corresponding to different percentages of the total range) as shown in Fig. 3. The distributions of the time values associated to the scores corresponding to those bands are plotted in Fig. 4, from the most distant regions (those time distributions corresponding to the first and last 10% of ST ), to the closest bands (time distributions corresponding to the scores in bands [40-50]% and [5060%]). This way we will be able to determine if more distant score distributions correspond to more separated time distributions. No distinction is made between impostor and genuine scores as we want to find the relation between a certain score s and its associated time value ts , regardless of whether that score has been produced by a genuine or impostor fingerprint. In Fig. 4 we can see that the degree of overlap of the time distributions through plots (a) to (e) increases. That is, the time distributions corresponding to the scores that belong to more distant bands are more separated than those corresponding to scores of closer bands. Furthermore, in all the plots, the distribution corresponding to the higher band (drawn with a dashed line) has a higher mean value than that corresponding to the lower band of scores (drawn with a plain line). These observations suggest that bigger scores produce higher time values. 4.2
Experiment 2: Relation between Score and Time Variations
This second experiment has a twofold objective, first to confirm the conclusions extracted from the previous experiment (i.e., that higher time values correspond to higher scores), and second to determine if there exists a correlation between
Vulnerability Assessment of Fingerprint Matching 1
0.7
Time [0−10]% Time [90−100]%
0.9
289
Time [10−20]% Time [80−90]%
0.6 0.8 0.5
0.7 0.6
0.4
0.5 0.3
0.4 0.3
0.2
0.2 0.1 0.1 0
0
0.5
1
1.5
Time (s)
2
2.5
3
0
3.5
0
0.5
1
1.5
Time (s)
(a)
2
2.5
3
3.5
(b)
0.5
0.45
Time [20−30]% Time [70−80]%
0.45
Time [30−40]% Time [60−70]%
0.4
0.4
0.35
0.35
0.3
0.3 0.25 0.25 0.2 0.2 0.15
0.15
0.1
0.1
0.05
0.05 0
0
0.5
1
1.5
Time (s)
2
2.5
3
0
3.5
0
0.5
1
1.5
(c)
Time (s)
2
2.5
3
3.5
(d) 0.45
Time [40−50]% Time [50−60]%
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.5
1
1.5
Time (s)
2
2.5
3
3.5
(e)
Fig. 4. Distributions of the time values associated with the scores corresponding to: (a) bands [0-10]% and [90-100]%, (b) bands [10-20]% and [80-90]%, (c) bands [20-30]% and [70-80]%, (d) bands [30-40]% and [60-70]%, and (e) bands [40-50]% and [50-60]%
a variation in the score and a variation in the time needed to produce that score. With this purpose, 50 templates corresponding to 50 different users of the database were slowly degraded following an iterative process. At each iteration of the algorithm one of these two random modifications A) perturbing a minutia or, B) substituting a minutia, are applied to the fingerprint, and the modified template is matched to the original one at the same time that the time needed to produce the score is measured. The two modifications A and B were taken from [2]. The fact that only perturbing or substituting a minutia are permitted (and not other modifications such as adding or deleting a minutia), assures that the number of minutiae in
290
J. Galbally et al. 500
3
500
6
Score
Score
450
450 Time
400 350
Time
2.5
400 350
2
200
250
3
Time
1.5
Score
250
200 1
150 100
2
150 100
0.5
50 0
4
300 Time
Score
300
5
1
50 0
50
100
150 Iterations
200
0 300
250
0
0.7
150
0
50
100
150 Iterations
200
0.7
350
Score Time
0 300
250
Score 0.65
Time
300
0.6
0.65 0.6
250
100 0.55 200 0.5
Time
Score
0.5
Time
Score
0.55
150 0.45
0.45
50 0.4 0.35 0
0
50
100
150 Iterations
200
250
300
100
0.4
50
0
0.35
0
50
100
150 Iterations
200
250
300
Fig. 5. Evolution of the score (black) and the time (grey) for four of the fingerprints used in experiment 2
the template remains constant through the iterations and that the changes in the matching time are not due to variations in the number of singular points. This way, in the first iteration of the process the fingerprint is matched against itself reaching the highest possible score for that template. Through the remaining iterations, until reaching the maximum permitted (300), the score slowly decreases as a result of the changes introduced in the template. At the end of the experiment we have 50 sets of 300 matching scores (corresponding to the evolution of the score for each of the 50 fingerprints), and their associated 50 sets of 300 time values. In Fig. 5 we show the evolution of the score (black) and the time (grey) for four of the cases studied. As the score and time ranges are different, the score scale is depicted on the left of each plot (also in black), and the time scale on the right (in grey). In the top row of Fig. 5 we have depicted two cases in which a clear relation between the score and the time can be seen: the higher the score, the higher the time needed by the system to generate it. However, in the two examples of the bottom row that relation cannot be observed, and score and time seem to be totally uncorrelated. In order to study the behavior of the system from a statistical point of view, and not for each particular case, the mean of the 50 score sets (with 300 matching scores each) and of the corresponding 50 time sets were computed. Both means are shown in Fig. 6 in an analogue way to the one used in the plots of Fig. 5.
Vulnerability Assessment of Fingerprint Matching 450
291
1.3 Score 1.2
400
Time 1.1
350
1
300
0.8 200
Time
Score
0.9 250
0.7 150
0.6
100
0.5
50 0
0.4 0
50
100
150 Iterations
200
250
300
Fig. 6. Mean of the score (black) and the time (grey) evolution for all the 50 fingerprints considered in experiment 2
In the score and time evolution depicted in Fig. 6 we can distinguish two zones (separated with a vertical dashed line) where the system presents different behaviors. On the left of the vertical dashed line we can see that there is a clear correlation between score and time variations: an increase in the score causes a raise in the time (on average). However, on the right of the separating line (once the score has fallen below 30 approximately), time and score seem to be uncorrelated, and while the score keeps decreasing the time fluctuates around a constant value.
5
Conclusions
A time analysis of a reference fingerprint based verification system has been made (NFIS2 from NIST). Experiments were carried out on a subcorpus of the publicly available MCYT database and show that there exists a clear correlation between the score given by the system and the time needed to produce that score. These findings reveal a new type of vulnerability of biometric systems as the matching time (easy to measure) could be used to simplify attacks initially thought to exploit the matching scores (not always easy to access). The present work might be of special interest not only for designers (in order to include the necessary countermeasures), but also for evaluators of security systems (in order to take into account this new vulnerability), and for those institutions developing evaluation standards such as the Common Criteria [4], or the associated Biometric Evaluation Methodology [5].
292
J. Galbally et al.
Acknowledgements J. G. is supported by a FPU Fellowship from Spanish MEC and J. F. is supported by a Marie Curie Fellowship from the European Commission. This work was supported by Spanish MEC under project TEC2006-13141-C03-03.
References 1. Galbally, J., Fierrez, J., et al.: On the vulnerability of fingerprint verification systems to fake fingerprint attacks. In: Proc. IEEE of International Carnahan Conference on Security Technology (ICCST), pp. 130–136 (2006) 2. Uludag, U., Jain, A.K.: Attacks on biometric systems: a case study in fingerprints. In: Proc. SPIE-IE, vol. 5306, pp. 622–633 (2004) 3. Hill, C.J.: Risk of masquerade arising from the storage of Biometrics, B.S. Thesis. Autralian National University (2001) 4. CC: Common Criteria for Information Technology Security Evaluation. v3.1 (2006) 5. BEM: Biometric Evaluation Methodology. v1.0 (2002) 6. Adler, A.: Sample images can be independently restored from face recognition templates. In: Proc. Canadian Conference Electrical and Computing Engineering (CCECE), vol. 2, pp. 1163–1166 (2003) 7. Martinez-Diaz, M., Fierrez, J., et al.: Hill-climbing and brute force attacks on biometric systems: a case study in match-on-card fingerprint verification. In: Proc. IEEE of International Carnahan Conference on Security Technology (ICCST), pp. 151–159 (2006) 8. Galbally, J., Fierrez, J., Ortega-Garcia, J.: Bayesian hill-climbing attack and its application to signature verification. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 386–395. Springer, Heidelberg (2007) 9. Kocher, P.C.: Timing attacks on implementations of diffie-hellman, RSA, DSS, and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp. 104–113. Springer, Heidelberg (1996) 10. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, p. 388. Springer, Heidelberg (1999) 11. Watson, G.I., Garris, M.D., et al.: User’s guide to NIST Fingerprint Image Software 2 (NFIS2). National Institute of Standards and Technology (2004) 12. Ortega-Garcia, J., Fierrez-Aguilar, J., et al.: MCYT baseline corpus: a bimodal biometric database. IEE Proc. Vision, Image and Signal Processing 150, 391–401 (2003)
A Matching Algorithm Secure against the Wolf Attack in Biometric Authentication Systems Yoshihiro Kojima1 , Rie Shigetomi1,2 , Manabu Inuma1,2 , Akira Otsuka1,2 , and Hideki Imai1,2 1
Graduate School of Science and Engineering, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan [email protected] 2 Research Center for Information Security (RCIS), National Institute of Advanced Industrial Science and Technology (AIST), Akihabara-Daibiru Room 1003, 1-18-13 Sotokanda, Chiyoda-ku, Tokyo 101-0021, Japan
Abstract. We propose a matching algorithm secure against the wolf attack in one-to-one biometric authentications. Our proposed algorithm embeds a wolf-judgement function in a traditional matching algorithm. We show that our proposed algorithm is accurate and secure. Moreover we remark that our proposed algorithm is efficient in a framework proposed by Inuma, Otsuka, and Imai [1].
1
Introduction
Recently, the use of biometric authentication system has spread in various services and therefore it is increasingly important to explicitly evaluate the security of biometric authentication systems. In this paper, we consider the evaluation of security against intentional impersonation attacks on biometric authentication systems. We focus on one-to-one biometric authentication system. The false acceptance rate (FAR) in the biometric verification system is the probability that a user claiming a wrong identity will be incorrectly accepted. The false rejection rate (FRR) is the probability that a user claiming a correct identity will be rejected. FAR and FRR are traditional measures to evaluate the recognition accuracy of the system. FAR can be also considered as a measure to evaluate the security against the zero-effort impersonation attack. The zero-effort attack means that the attacker presents his own biometric characteristic to the sensor of the system. However, Une, Otsuka, and Imai [2] showed that for some biometric authentication algorithm, there exists an input value which shows high similarity to many enrollment biometric templates. We call such an input value a “wolf”. If the attacker finds a wolf sample, then he can succeed in impersonating many genuine users with a higher probability than FAR by presenting the wolf sample to the system. The wolf attack is defined as an attempt to impersonate a genuine user by presenting a wolf, where the attacker has no information of a biometric sample of a genuine user to be impersonated. The wolf attack probability (WAP) is defined as a maximum success probability of the wolf attack with one wolf sample. J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 293–300, 2009. c Springer-Verlag Berlin Heidelberg 2009
294
Y. Kojima et al.
Inuma, Otsuka, and Imai [1] presented a framework for construction of a secure matching algorithm against the wolf attack on biometric authentication systems. They proposed an (ideal) matching algorithm which determines a threshold for each input value by the distribution of the comparison scores with all individual templates. Moreover, they showed that if the information about the distribution for each input value is perfectly given, then the proposed matching algorithm is secure against the wolf attack. However, in the real world, it might be very hard to precisely calculate the distribution of the comparison scores with the random variations of all individual templates. More precise calculation of the distribution makes WAP lower. Namely, there exists a trade-off between the achievable WAP and the efficiency. Almost every conventional matching algorithm employs a fixed threshold for all input value. It is secure under the assumption that the distributions for all input values are the same. However, this assumption is not practical since the human biometric features are biasedly distributed. Therefore, we can easily find a wolf sample for almost every matching algorithm employing a fixed threshold and hence it is not secure against the wolf attack. Actually, as far as we know, there still exists no efficient and secure matching algorithm against the wolf attack. The purpose of this paper is to construct an efficient and secure matching algorithm. We embed a wolf-judgement function W OLF in a traditional one-toone matching algorithm. Let M be a finite set. Let U be a set of all valid users. For each user u ∈ U , the biometric features extracted from a biometric characteristic of u are represented as an element of M . Assume that the traditional one-to-one matching algorithm employs a decision algorithm dec : M × M → {sim, dissim} defined as follows. For any input value s ∈ M and any template t ∈ M , if the algorithm dec decides that s and t are generated from the same biometric characteristic (of a user), then dec outputs sim, otherwise outputs dissim. Our proposed matching algorithm prop-match is defined as follows. Let W OLF be a function which, for any s ∈ M and any user v ∈ U outputs the number of the valid users w such that w = v and dec(s, tw ) = sim, where tw is a stored template of w. First, a user u claims the identity id and presents his/her biometric sample b. For an input value s generated from b and template t stored with id, if dec(s, t) = dissim, then match outputs “rejected”. Otherwise propmatch calculates W OLF (s, t). If W OLF (s, t) ≤ T − 1, then prop-match outputs “accept” and otherwise prop-match outputs “reject.” For an appropriately chosen T , our proposed algorithm can successfully reject all suspicious samples which show the similarity to T samples and therefore are (probably) wolves. T We show that our proposed matching algorithm is secure, namely WAP ≤ n where n is the number of the templates in the database of the system (Theorem 1). We also show that our proposed algorithm is accurate, namely the false rejection rate (FRR) is small enough (Theorem 2). Moreover we remark that our proposed algorithm can be more efficient than that proposed by Inuma, Otsuka, and Imai [1].
A Matching Algorithm Secure against the Wolf Attack
295
This paper continues as follows. Section 2 describes the model of biometric authentication. Section 3 defines accuracy and security. Section 4 discusses our proposed matching algorithm, and show the accuracy and security. Section 5 summarize our results.
2
Model of Biometric Authentication
A biometric authentication system can be used for verification or identification of individuals. In verification, a user of the system claims to have a certain identity and the biometric system performs a one-to-one comparison between the offered biometric features and the template which is linked to the claimed identity. In identification, a one-to-all comparison is performed between the offered features and all available template stored in the database to reveal the identity of an individual. In this paper, we focus on the verification systems. In this section, we define a model of a biometric verification system. Let U be a set of all valid users, and M be a finite set. We assume that the biometric features of a user u ∈ U is represented as an element s of M , which is called a feature vector of u. In an enrollment phase, a valid user u ∈ U presents his/her own biometric characteristic bu to the sensor of the system. Assume that bu is constant for each u ∈ U Then the system generates a feature vector t ∈ M from bu . Let ID be a finite set of identities of all valid users, publishes the identity idu ∈ ID of u where ID is a finite set of the identities of the users, and stores a pair (t, idu ) in the database. The identity idu is sent to u and the user u keeps idu for the forward verification. We assume that the biometric authentication system employs a decision algorithm dec : M × M → {sim, dissim} defined as follows. For any offered feature vector s ∈ M and any stored feature vector t ∈ M , if the algorithm dec decides that s and t are generated from the same biometric characteristic (of a user), then dec outputs sim, otherwise outputs dissim. A traditional verification phase is defined as follows. A user v ∈ U claims an identity idw ∈ ID and presents his/her biometric sample bv to the sensor of the system. Note that the user v might claims a wrong identity idw other than his/her own identity idv . A feature vector s ∈ M is generated from bv . A decision algorithm dec of the system compares s with the stored feature vector t associated with the identity idw and outputs a message, either sim or dissim. Then a function match : U × U → {accept, reject} is defined by accept if dec(s, t) = sim, match(v, w) = reject if dec(s, t) = dissim. For each user u ∈ U , let Xu be a random variable on M representing noisy versions of feature vectors of u in both enrollment and verification phases, namely P(Xu = s) denotes the probability that the feature vector s ∈ M will be generated from bu of u. Assume that the Xu , u ∈ U , are independent. Moreover, assume that Xu in the enrollment and Xu in the verification are independent for each u ∈ U . The enrollment and verification phases are again described in Fig. 1 and Fig. 2.
296
Y. Kojima et al.
u keeps (bu , id u )
User
System
User $ u← ⎯⎯ U
b
u ⎯ ⎯→
id
u ← ⎯⎯
t ← Xu
(t
System
bv , id w
⎯ ⎯→ (t , id w ) is taken
u ←U
is generated from bu )
from the DB s ← Xv
id u ← ID
(t , id u ) is
(s is generated from bv ) match(v, w) =
stored in the DB
accept if dec(s, t ) = sim reject if dec(s, t ) = dissim
Fig. 1. Enrollment phase in the traditional Fig. 2. Verification phase in the traditional algorithm algorithm
3
Accuracy and Security
In order to explicitly evaluate the security, we will formulate three measures, FAR, WAP and FRR. First we prepare some notations. For any finite set S, let $ x ← S denote that x is chosen uniformly from S. For any random variable X, let $ x ← X denote that x is chosen according to X. For any real-valued function f and any finite set (or any random variable) Z, let Ave f (z) denote the expected $
z ←Z $
(average) value of f over a distribution z ← Z, namely ⎧ 1 ⎪ f (z) if Z is a finite set ⎪ ⎨ |Z| z∈Z Ave f (z) = $ ⎪ P(Z = z)f (z) if Z is a random variable ⎪ z ←Z ⎩ z∈Z
where |Z| denotes the number of elements of Z. 3.1
The Zero-Effort Attack and FAR
The false acceptance rate (FAR) is the probability that a user claiming a wrong identity is incorrectly accepted. The zero-effort attack is an attacker’s attempt to impersonate a valid user u ∈ U by claiming the identity idu and presenting the attacker’s own biometric characteristic to the system. Then the success probability equals to the false acceptance rate (FAR) defined by FAR =
Ave $
(u,v)← −(U×U)×
P[match(u, v) = accept]
(1)
where (U × U )× = {(u, v) ∈ U × U |u = v}. Then FAR is written as follows: FAR =
Ave $
Ave Ps (v) $
(u,v)← −(U×U)× s← −Xu
where Ps (v) = Ave P[dec(s, t) = sim]. $ t← −Xv
(2)
A Matching Algorithm Secure against the Wolf Attack
3.2
297
The Wolf Attack Probability
Let A be a set consisting of all possible users including attackers or malicious users who might present non-biometric samples to the system(cf. [3]). Note that in verification phase, an honest user u ∈ U claims his/her own identity idv , however an attacker (or a malicious user) a ∈ A\U claims an identity idu of a randomly chosen valid user u ∈ U . Definition 1. For any a ∈ A, put p = Ave P[match(a, v) = accept]. If p is $ v← −U (extremely) large, then a is called a “p-wolf.” Definition 2. [2] Assume the following two conditions (i) the attacker has no information of a biometric sample of a valid user to be impersonated (ii) the attacker has complete information of a matching algorithm in the biometric authentication system to be attacked The wolf attack is defined as an attempt to impersonate a randomly chosen valid user v ∈ U by presenting a p-wolf with a large p to minimize the complexity of the impersonation attacks. Definition 3. [2] The wolf attack probability (WAP) is defined by WAP = max Ave P[match(a, v) = accept]. a∈A $ v← −U
(3)
Definition 4. [1] For any δ > 0, a matching algorithm is δ-secure against the wolf attack if WAP < δ. Une, Otsuka, and Imai [2] showed that there exist strong wolves with extremely large p’s for typical matching algorithms of two modalities, fingerprint-minutiae and finger-vein patterns (cf.[4],[5]). If we use only FAR as a security measure, then we cannot precisely evaluate the security against the wolf attack. Therefore we should use not only FAR but also WAP to evaluate the security against strong intentional impersonation attacks such as the wolf attack.
3.3
The False Rejection Rate
The false rejection rate (FRR) is the probability that a user claiming a correct identity will be rejected, namely FRR = Ave P[match(u, u) = reject] = Ave Ave (1 − Ps (u)). $ $ $ u← −U u← −U s← −Xu
(4)
298
4
Y. Kojima et al.
Our Proposed Matching Algorithm
In this section, we construct a matching algorithm secure against the wolf attack, and show the accuracy and security of the constructed algorithm. We embed a wolf-judgement function W OLF in a traditional one-to-one matching algorithm. Our proposed matching algorithm prop-match is defined as follows. The enrollment phase of prop-match is the same as that of traditional match. Let W OLF be a function which, for any feature vector s ∈ M and any user w ∈ U outputs the number of the valid user w such that w = w and dec(s, t( w )) = sim, where t( w ) is a stored template of w , namely W OLF (s, w) = |{w ∈ U | w = v, dec(s, tw ) = sim}|. First, a user v claims the identity idw and presents his/her biometric sample bv . For an input value s generated from bv and a template t stored with idw , if dec(s, t) = dissim, then match outputs “rejected”. Otherwise prop-match calculates W OLF (s, w). If W OLF (s, w) ≤ T − 1, then prop-match outputs “accept” and otherwise prop-match outputs “reject.” The proposed algorithm prop − match rejects all samples which show the similarity to T or more templates. Intuitively, prop − match recognizes such samples as wolf-samples and rejects them. The verification phase is described of our proposed matching algorithm in Fig. 3
User
v ←U
System
bv , id w
⎯ ⎯→ (t , id w ) is taken from the DB s ← X v (s is generated from bv ) match(v, w) = accept if
dec(s, t ) = sim
WOLF (s, w) ≤ T − 1
reject otherwise
Fig. 3. Verification phase in our proposed algorithm
4.1
The False Acceptance Rate for Prop-Match
Let FARprop denote the false acceptance rate of our proposed algorithm, namely, which as calculated as follows : T −1 FARprop = Ave Ave Ps (v) Ps (v ) (1 − Ps (v )). (5) $ $ × (u,v)← −(U×U) s← −Xu k=0 I⊂U \{v} v ∈I v ∈(U \{v})\I I=k
fx denotes the product of the fx , x ∈ S. From (2), (5), it is clear that
x∈S prop
FAR
≤ FAR, since
T −1
k=0
I⊂U \{v} I=k
v ∈I
Ps (v )
(1 − Ps (v )) ≤ 1.
v ∈(U \{v})\I
A Matching Algorithm Secure against the Wolf Attack
4.2
299
The Wolf Attack Probability for Prop-Match
Let WAPprop denote the wolf attack probability of our proposed algorithm, namely, which as calculated as follows : T −1 WAPprop = max Ave Ave Ps (v) Ps (v ) (1 − Ps (v )). a∈A $ $ I⊂U \{v} s← −Xa v← −U v ∈I k=0 v ∈(U \{v})\I I=k
(6) Theorem 1. Our proposed matching algorithm is attack, namely we have WAPprop ≤
T . |U |
T -secure against the wolf |U |
Proof. From (6), we have T −1 max Ave Ave Ps (v) a∈A $ $ s← −Xa v← −U k=0
= max a∈A
= max a∈A
≤ max a∈A
I⊂U \{v} I=k
v ∈I
Ps (v )
(1 − Ps (v ))
v ∈(U \{v})\I
T −1 1 P(Xa = s) Ps (v) Ps (v ) (1 − Ps (v )) |U | I⊂U \{v} v∈U s∈M
k=0
I=k
v ∈I
v ∈(U \{v})\I
T −1 1 P(Xa = s) (k + 1) Ps (v ) |U | I⊂U s∈M
k=0
I=k+1
v ∈I
T −1 1 P(Xa = s) T Ps (v ) |U | I⊂U s∈M
k=0
I=k+1
v ∈I
(1 − Ps (v ))
v ∈U \I
(1 − Ps (v )) ≤
v ∈U \I
T . |U | (7)
The result follows. 4.3
The False Rejection Rate for Prop-Match
Let FRRprop denote the false rejection rate of our proposed algorithm, namely, which as calculated as follows : FRRprop = Ave P[match(u, u) = reject] $ u← −U ⎛
|U|−1
⎜ = Ave Ave ⎝{1 − Ps (u)}+ Ps (u) $ $ u← −Us← −Xu k=T |U|−1
= FRR + Ave Ave Ps (u) $ $ u← −U s← −Xu
⎞ ⎟ Ps (v ) {1 − Ps (v )}⎠
I⊂U \{u} I=k
v ∈I
k=T
I⊂U \{u} I=k
v ∈I
v ∈(U \{u})\I
Ps (v )
{1 − Ps (v )}.
v ∈(U \{u})\I
(8)
300
Y. Kojima et al.
Let Y be a random variable representing the distribution of W OLF on {u}× u∈U
Xu , namely P(Y = k) = Ave Ave P[W OLF (s, u) = k] $
$
u←U s←X
= Ave Ave $
$
u←U s←X
I⊂U \{u} #I=k
v ∈I
Ps (v )
(1 − Ps (v )) .
(9)
v ∈(U \{u})\I
From (8), we have |U|−1
FRRprop ≤ FRR +
P(Y = k) = FRR + P(Y ≥ T )
(10)
k=T
Theorem 2. The mean and the standard deviation of Y are μ and σ, respectively. If T = μ + aσ, then we have 1 FRRprop ≤ FRR + 2 . (11) a 1 Proof. From Chebyshev’s inequality, we have P(|Y − μ| ≥ aσ) ≤ 2 . If Y ≥ T , a then |Y − μ| ≥ aσ. Therefore we have 1 P(Y ≥ T ) ≤ P(|Y − μ| ≥ aσ) ≤ 2 . (12) a From (10), the result immediately follows.
5
Conclusions
By Theorem 1 and Theorem 2, if T is carefully chosen, then our proposed algorithm prop-match is not only secure but also accurate. We remark that our proposed algorithm do not need to calculate the entropy of the distributions during the verification phase. Therefore it can be more efficient than the algorithm proposed by Inuma, Otsuka, and Imai [1].
References 1. Inuma, M., Otsuka, A., Imai, H.: Theoretical Framework for Constructing Matching Algorithms in Biometric Authentication Systems. In: ICB, pp. 806–815 (2009) 2. Une, M., Otsuka, A., Imai, H.: Wolf Attack Probability: A Theoretical Security Measure in Biometric Authentication Systems. IEICE Transactions on Information and Systems 91(5), 1380–1389 (2008) 3. Matsumoto, T., Matsumoto, H., Yamada, K., Hoshino, S.: Impact of Artificial Gummy Fingers on Fingerprint Systems. In: Proceedings of SPIE, vol. 4677, pp. 275–289 (2002) 4. Ratha, N., Connell, J., Bolle, R.: Enhancing Security and Privacy in Biometricsbased Authentication Systems. IBM systems journal 40(3), 614–634 (2001) 5. Miura, N., Nagasaka, A., Miyatake, T.: Feature Extraction of Finger Vein Patterns Based on Iterative Line Tracking and Its Application to Personal Identification. Systems and Computers in Japan 35(7) (2004)
A Region-Based Iris Feature Extraction Method Based on 2D-Wavelet Transform Nima Tajbakhsh1, Khashayar Misaghian2, and Naghmeh Mohammadi Bandari3 1
Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, University of Tehran, Iran 2 Biomedical Engineering Department, Iran University of Science and Technology, Tehran, Iran 3 Biomedical Engineering Department, Amirkabir University of Technology, Tehran, Iran [email protected], [email protected], [email protected]
Abstract. Despite significant progress made in iris recognition, handling noisy and degraded iris images is still an open problem and deserves further investigation. This paper proposes a feature extraction method to cope with degraded iris images. This method is founded on applying the 2D-wavelet transform on overlapped blocks of the iris texture. The proposed approach enables us to select the most informative wavelet coefficients providing both essential texture information and enough robustness against the degradation factors. Our experimental results on the UBIRIS database demonstrate the effectiveness of the proposed method that achieves 4.10%FRR (@ FAR=.01 %) and 0.66% EER. Keywords: Noisy iris recognition, Region-based image processing, 2D-wavelet transform.
1 Introduction The importance of security is an undeniable fact that plays a crucial role in our societies. Having a high level of security against terrorist attacks prompts governments to tighten security measures. Undoubtedly, employing biometric traits constitutes a fundamental part of governments’ efforts to provide national security. Among the proposed biometrics, the iris is known as the most accurate one which is broadly deployed in commercial recognition systems. Pioneering work on iris recognition –as the basis of many commercial systems– was done by Daugman [1]. In this algorithm, the 2D Gabor filters are adopted to extract oriented–based texture features corresponding to a given iris image. After the Daugman’s work, several researchers [2-6] proposed their own feature extraction methods to achieve more compacted codes and to accelerate the decision-making process. Although their efforts have led to a great progress from the viewpoint of the computational time and the accuracy, the iris-based recognition systems still suffer from lack of acceptability. This mainly originates from the fact that subjects are reluctant to participate in image acquisition process repeatedly until the system manages to record an ideal iris image. Accordingly, devising some methods that are J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 301–307, 2009. © Springer-Verlag Berlin Heidelberg 2009
302
N. Tajbakhsh, K. Misaghian, and N.M. Bandari
capable to handle low quality iris images can be considered as an effective approach to increase the acceptability of the iris-based recognition systems. In this paper, we propose a region-based feature extraction method based on 2D-Discrete Wavelet Transform (2D-DWT) that aims at giving a general presentation of the iris texture in a way less affected by degradation factors. The rest of the paper is organized as follows: Section 2 introduces related works of the literature. Section 3 presents the proposed feature extraction method. Experimental results are given in section 4; finally, section 5 concludes this paper.
2 Related Works In this section, we focus our attention to methods that are basically developed based on the wavelet transform, and readers for more information about state-of-the-art methods are referred to a comprehensive survey [7] conducted by Bowyer et al. In literature of the iris recognition, the wavelet transform constitutes the basis of many well-known feature extraction methods. These methods could roughly be divided into two categories. The first one is compromised of methods utilizing the 1D-wavelet transform as the core of the feature extraction module [3, 5, 6, 8-12]. For instance, Boles and Boashash [3] apply the 1D-wavelet transform at various resolution levels of a virtual circle on an iris image, and Ma et al. [5,6] utilize the wavelet transform to extract sharp variations of the intensity signals. Methods that are categorized in the second group, utilize 2D-wavelet transform to extract the iris texture information [4, 13-18]. For instance, Lim et al. [4] use the 2D-wavelet decomposition and generate a feature vector for each given pattern comprised of the resulting coefficients from
HH 4 sub–
image and average of the coefficients contained in the HH1 , HH 2 , HH3 sub–images. A rather similar approach based on the 2D-wavelet decomposition is also proposed by Poursaberi and Araabi [18] in which the wavelet coefficients of the LH 3 , HL3 and
LH 4 , HL4 sub–images are extracted and coded based on the signs of the coefficients. In the both methods, global information of the iris texture is obtained through applying 2D-DWT on the whole iris texture. However, analyzing the texture in this way provides no specific solution toward local noisy regions of the texture and motion blurred iris images. Furthermore, such a general texture presentation cannot reveal region-based iris information that plays a crucial role in the decision-making process. To compensate for mentioned drawbacks, a possible solution is to apply the 2D-DWT on sub-regions of the iris that provides more textural information and enables us to select the most discriminating coefficients regarding to the quality of captured images.
3 Proposed Feature Extraction Method In the proposed method, we make use of a region-based approach founded on the 2DDWT for the following reasons:
A Region-Based Iris Feature Extraction Method Based on 2D-Wavelet Transform
303
• Region-based image decomposition gives abundant textural information and consequently achieves more accurate recognition system. • Partitioning the iris texture splits noisy regions into several blocks and this potentially reduces adverse effects of the local noisy regions to a minimum. • Region-based approach permits us to benefit from local and global information of every block thus provides both essential texture features and a high level of robustness against the degradation factors. • Selecting just few coefficients from a block facilitates registration between two iris samples and as a result, the inherent similarity between images of a subject is better revealed. The feature extraction method begins with partitioning the iris texture to 32x32 pixels blocks with 50% overlap in both directions. Then, the 2D-wavelet decomposition is performed on every block. Through an optimization process on the training set, six wavelet coefficients with the most discriminating power less affected by the degradation factors are selected. Positions of the selected coefficients in resulting sub-images are depicted in Figure 1. By putting together extracted values for each coefficient, a matrix is created. To encode six generated matrices, we follow two coding strategies to achieve the best results on the training set. The first coding method is based on the signs of the extracted coefficients and the second one is founded on the zero crossing of first derivate of matrices along the horizontal direction. We apply both coding strategies to encode two matrices created for vertical details at the fourth level that result in four binary matrices. The remaining matrices generated from the approximation coefficients of the fourth level of decomposition are coded based on the second coding approach. At last, the eight binary matrices are concatenated and final binary matrix corresponding to an iris pattern is produced.
4 Experiments In this section, at first, we briefly describe the UBIRIS database that is used in our experiments, then, feature extraction methods we use to compare our results are introduced and at last, the experimental results are presented. 4.1 UBIRIS Database The UBIRIS database is composed of 1877 images from 241 European subjects captured in two different sessions. The images in the first session are gathered in a way that adverse effects of the degradation factors reduce to a minimum, whereas the images captured in the second session have irregularities in reflection, contrast, natural luminosity, and focus. Figure 2 shows some samples of the UBIRIS database. The rationale behind choosing this database is to examine effectiveness of our method dealing with a variety of the degradation factors that may occur as a result of relaxing constraints on subjects’ behavior.
304
N. Tajbakhsh, K. Misaghian, and N.M. Bandari
Fig. 1. An overview of the proposed feature extraction method. In this figure, generating two binary matrices corresponding to a coefficient of vertical detail sub-image is depicted.
Fig. 2. Iris samples from the UBIRIS database
4.2 Methods Used for Comparison To compare our approach with other methods, we use feature extraction strategies suggested in [18, 19]. We select these methods because the wavelet–based method [18] yields results that are comparable with several of state–the–art–methods and the method based on Gauss-Laguerre filter [19] as a filter–based one, by generating a binary matrix similar to IrisCode [1] can be considered as a Daugman–like algorithm. Furthermore, corresponding authors of both papers provided us with all source codes, thus permitting us to have a fair comparison. During our experiments, the segmentation method is same for all methods, no strategy is adopted for detecting the eyelids and eyelashes, and we just discard the upper half of the iris to eliminate the eyelashes. Moreover, there are few iris images suffering from nonlinear texture deformation because of mis-localization of the iris. We deliberately do not modify and let them to enter the feature extraction and
A Region-Based Iris Feature Extraction Method Based on 2D-Wavelet Transform
305
matching process modules. Although segmentation error can significantly increase the overlap between inter– and intra–class distributions [20], this can simulate what happens in practical applications and also permits us to evaluate the robustness of the proposed method and those suggested in [18, 19] dealing with the texture deformation. 4.3 Results We evaluate the performance of the proposed method in verification mode. To do this, we create the training set by choosing one high quality and one low quality iris images per subject and put all remaining iris images (eight images per subject) in the test set.
Fig. 3. Distribution of inter- and intra-class comparisons obtained through evaluating the proposed method on the UBIRIS database
Fig. 4. Roc plots of the suggested methods in [18, 19] and the one proposed for the UBIRIS database
306
N. Tajbakhsh, K. Misaghian, and N.M. Bandari
Table 1. Comparison between the error rates obtained from the proposed method and algorithms suggested in [18, 19] for the UBIRIS Database Method
EER
FR
Proposed Poursaberi [18] Ahmadi [19]
0.66 2.65 2.46
18.41 6.95 6.63
FRR (@ FAR=.01%) 4.10 7.80 10.15
To compensate for rotation of the eye during acquisition process, we store twelve additional iris codes for six rotations on either side by horizontal shifts of 2, 4, 6, 8, 10, and 12 pixels each way in the normalized images. Therefore, to measure dissimilarity of two iris patterns, thirteen comparisons are made and the minimum distance is considered as the dissimilarity value. To form inter- and intra-class distributions, the test samples are compared against two training samples of each subject according to the mentioned shifting strategy and the minimum distance is chosen. Figure 3 shows resulting distributions for inter and intra-class comparisons. The outliers in the tail of the intra-class distribution are generated during comparing of two iris samples of a subject that at least one of them seriously suffers from the degradation factors. In other words, despite taking the proposed course of action, there has been some degraded iris images that mistakenly are rejected by the system. The Receiver Operating Curves (ROCs) of the proposed method and the other ones are depicted in Figure 4. As can be seen, the proposed method outperforms and gives the highest performance. Table 1 enables quantitative comparison between our method and the implemented ones. From Table 1, it is seen that our method achieves the least equal error rate (EER) and the smallest amount of False Rejection Rate (FRR) while provides the maximum fisher ratio (FR) explaining more separability of inter- and intra-class distributions.
5 Conclusion This paper proposed a new feature extraction method to deal with iris images that suffer from the local noisy regions and other degradation factors like motion blurriness and lack-of-focus. On the understanding that the region-based image processing makes it possible to handle noisy mages, we utilized the 2D-DWT to obtain an informative regional presentation of the iris. Analyzing the texture in this way enabled us to locate the most discriminating coefficients providing enough robustness against the degradation factors. Although selecting just few reliable coefficients leads to miss some important details of the iris structures, this is the price we have to pay to achieve a robust presentation. We evaluated the performance of the proposed approach using the UBIRIS database in verification mode. The experimental results confirmed the efficiency of our approach compared with methods suggested in [18, 19] where 0.66% of EER and 4.10% of FRR (@ FAR=.01%) were obtained.
A Region-Based Iris Feature Extraction Method Based on 2D-Wavelet Transform
307
References [1] Daugman, J.: High Confidence Visual Recognition of Persons by a Test of Statistical Independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1148–1161 (1993) [2] Wildes, R.P.: Iris Recognition: An Emerging Biometric Technology, vol. 85, pp. 1348– 1363. IEEE Press, Los Alamitos (1997) [3] Boles, W.W., Boashash, B.: A Human Identification Technique Using Images of the Iris and Wavelet Transform. IEEE Transactions on Signal Processing 46(4), 1085–1088 (1998) [4] Lim, S., Lee, K., Byeon, O., Kim, T.: Efficient Iris. Recognition through Improvement of Feature Vector and. Classifier. ETRI Journal 23(2), 61–70 (2001) [5] Ma, L., Tan, T., Wang, Y., Zhang, D.: Efficient Iris Recognition by Characterizing Key Local Variations. IEEE Transactions on Image Processing 13(6), 739–750 (2004) [6] Ma, L., Tan, T., Wang, Y., Zhang, D.: Local Intensity Variation Analysis for Iris Recognition”. Pattern Recognition 37(6), 1287–1298 (2004) [7] Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image Understanding for Iris Biometrics: A Survey. Computer Vision and Image Understanding (2), 281–307 (2008) [8] Chena, C., Chub, C.: High Performance Iris Recognition based on 1-D Circular Feature Extraction and PSO–PNN Classifier. Expert Systems with Applications (Article in press) doi:10.1016/j.eswa.2009.01.033 [9] Huang, H., Hu, G.: Iris Recognition Based on Adjustable Scale Wavelet Transform. In: 27th Annual International Conference of the Engineering in Medicine and Biology Society, Shanghai, pp. 7533–7536 (2005) [10] Huang, P.S., Chiang, C., Liang, J.: Iris Recognition Using Fourier-Wavelet Features. In: 5th International Conference Audio- and Video-Based Biometric Person Authentication, Hilton Rye Town, pp. 14–22 (2005) [11] Tajbakhsh, N., Araabi, B.N., Soltanian–Zadeh, H.: Noisy Iris Verification: A Modified Version of Local Intensity Variation Method. Accepted in 3rd IAPR/IEEE International Conference on Biometrics, Alghero (2009) [12] Chena, C., Chub, C.: High performance iris recognition based on 1-D circular feature extraction and PSO–PNN classifier. Expert Systems with Applications (Article in press) [13] Son, B., Kee, G., Byun, Y., Lee, Y.: Iris Recognition System Using Wavelet Packet and Support Vector Machines. In: 4th International Workshop on Information Security Applications, Jeju Island, pp. 365–379 (2003) [14] Kim, J., Cho, S., Choi, J., Marks, R.J.: Iris Recognition Using Wavelet Features. The Journal of VLSI Signal Processing (38), 147–156 (2004) [15] Cho, S., Kim, J.: Iris Recognition Using LVQ Neural Network. In: International conference on signals and electronic systems, Porzan, pp. 155–158 (2005) [16] Alim, O.A., Sharkas, M.: Iris Recognition Using Discrete Wavelet Transform and Artificial Neural Networks. In: IEEE International Symposium on Micro-Nano Mechatronics and Human Science, Alexandria, pp. 337–340 (2005) [17] Narote, S.P., Narote, A.S., Waghmare, L.M., Kokare, M.B., Gaikwad, A.N.: An Iris Recognition Based on Dual Tree Complex Wavelet Transform. In: IEEE TENCON 2007, pp. 1–4 (2007) [18] Poursaberi, A., Araabi, B.N.: Iris Recognition for Partially Occluded Images: Methodology and Sensitivity Analysis. EURASIP Journal on Advances in Signal Processing 2007(1), 12 pages (2007) [19] Ahmadi, H., Pousaberi, A., Azizzadeh, A., Kamarei, M.: An Efficient Iris Coding Based on Gauss–Laguerre Wavelets. In: second IAPR/IEEE International Conference on Biometrics, Seoul, pp. 917–926 (2007) [20] Proença, H., Alexandre, L.A.: A Method for the Identification of Inaccuracies in the Pupil Segmentation. In: First International Conference on Availability, Reliability, and Security, Vienna, pp. 227–230 (2006)
A Novel Contourlet Based Online Fingerprint Identification Omer Saeed, Atif Bin Mansoor, and M Asif Afzal Butt National University of Sciences and Technology, Pakistan [email protected]
Abstract. Biometric based personal identification is regarded as an effective method for automatically recognizing an individuals identity. As a method for preserving the security of sensitive information biometrics has been applied in various fields over last few decades. In our work, we present a novel core based global matching approach for fingerprint matching using the Contourlet Transform. The core and delta points along with the ridge and valley orientations have strong directionality or directional information. This directionality has been exploited as the features and considered for matching. The obtained ROI is analyzed for its textures using Contourlet transform which divides the 2-D spectrum into fine slices by employing Directional Filter Banks (DFBs). Distinct features are then extracted at different resolutions by calculating directional energies for each sub-block from the decomposed subband outputs, and given to a Euclidian distance classifier. Finally adaptive majority vote algorithm is employed in order to further narrow down the matching criterion. The algorithm has been tested on a developed database of 126 individuals, enrolled with 8 templates each.
1
Introduction
Fingerprint identification is an important biometric technique for personal recognition. The fingerprints are graphical flow-like ridges and valleys on human fingers. Due to their uniqueness, fingerprints are the most widely used biometric modality today. Fingerprints vary in quality depending upon scanning conditions, sensor quality, surface of the sensor etc. A pre-processing stage is necessary in order to culminate these effects. It may consist of employing image enhancement techniques like histogram equalization, adaptive histogram, deblurring, Fourier transform and 2D adaptive noise removal filtering etc. In local feature extraction based approaches, minutia such as ridge ending and bifurcation are used to represent a fingerprint and matching scheme is adopted. These approaches require very accurate minutia extraction, additionally they contain limited information compared to complete fingerprint. This paper investigates a holistic personal identification approach using the Contourlet J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 308–317, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Novel Contourlet Based Online Fingerprint Identification
309
Transform. It can further be combined with other biometric modalities so as to form a highly accurate and reliable biometric based personal identification system. 1.1
Algorithm Development
The approach followed for the development of fingerprint based identification system is shown in Figure 1.
Fig. 1. Algorithm Development
2
Image Acquisition
For our work, we have employed Digital Persona’s “U are U 4000-B” to capture fingerprint scans. It is a miniature USB interfaced fingerprint reader which is an optical scanner and gives on-line scans. It has a resolution of 512 dip (dots per inch). Scans capture area of 14.6mm × 18.1mm the image output is a 8bit grayscale image. The image Acquisition software is developed using JAVA platform. The captured image is then displayed on the monitor. Acquired image contains unwanted reflections of the scanner at its base. These unnecessary rows, shown in Figure 2(a), are excluded as shown in Figure 2(b), by rejecting the bits at the base of the image. The image is then saved bit by bit in windows bitmap (BMP) format. The acquired image has dimensions of 512 × 460pixels. While giving scans the users were guided to place their fingers on the scanner such that the vertical axis of the finger is aligned with the vertical axis of the scanner. Provisions have been made in the saving module to visually inspect the fingerprint scan prior to saving it for enrollment and generating templates. If the image quality is not satisfactory the scan can be discarded. A new scan is taken subsequently and saved.
310
O. Saeed, A. Bin Mansoor, and M.A.A. Butt
(a)
(b)
Fig. 2. (a) Unwanted reflections at the image’s base (b) Reflections removed
2.1
Pre-processing
Before an image can be further processed for reference point localization a preprocessing stage is required. As the fingerprint images acquired from fingerprint reader are not assured with perfect quality, pre-processing of an image aids in its enhancement, though there is some loss of image integrity but usually this trade-off is made. Fingerprint Image enhancement is to make the image clearer for easy further operations. In our work we have applied pre-processing by histogram equalization, adaptive thresholding, Fourier transform and adaptive binarization. 2.2
Histogram Equalization
Histogram equalization is to expand the pixel value distribution of an image so as to increase the perceptional information [1]. The original histogram of a fingerprint image has the bimodal type. The histogram after the histogram equalization occupies all the range from 0 to 255 and the visualization effect is enhanced. 2.3
Adaptive Thresholding
It is observed that the acquired image is illuminated unevenly. For a factor like uneven illumination, histogram cannot be easily segmented [2]. In order to counter this problem we divided the complete image into sub-images and then for each of this sub-image, used a separate threshold in order to segment each image. The threshold was selected based on the average intensity values of the respective sub-images. 2.4
Enhancement Using Fourier Transform
We divided the image into small processing blocks (32 by 32 pixels) and performed the Fourier transform. The image segments are now transformed into the Fourier Frequency Domain. For each segment, in order to enhance a specific
A Novel Contourlet Based Online Fingerprint Identification
(a)
311
(b)
Fig. 3. Enhanced Image (a) Acquired Image (b) Image after Fourier Enhancement
block by its dominant frequencies (Frequency with greater number of components for a given threshold), we convolve the FFT of each segment with the respective dominant frequency of that block. Taking the inverse Fourier transform gives the enhanced image. In the resultant enhanced image the falsely broken ridges are connected while some of the spurious connections are disconnected. Figure 3(a) depicts the acquired fingerprint and 3(b) illustrates the enhanced image after taking the Fourier transform of the respective blocks. 2.5
Adaptive Image Binarization
We performed Fingerprint Image Binarization in order to transform the 8-bit grayscale image to a 1-bit image with value zero for ridges and value one for the valleys. After the operation, ridges in the fingerprint are highlighted in black color while the valleys are white. We have employed local adaptive binarization method to achieve binarization of image. The procedure is same as that of the adaptive thresholding for segmentation. But after segmentation if the value of the pixel is larger than the mean intensity value of block a value of one is assigned to it otherwise it is zero.
3
Reference Point Localization
In order to extract a region of interest there has to be a reference point or a characteristic feature which can be located every time a fingerprint image is processed by the algorithm. This is called the reference point localization or singularity detection. In our work as global feature extraction technique is adopted, core point is the reference point. Core point is the point on the inner most ridge that has the maximum curvature. 3.1
Ridge Flow Estimation
Finally the enhanced image obtained is now segmented into non overlapping bocks of size 16 × 16 pixels. In each of the segmented blocks value of gradients is calculated first in the x direction as (gx ) and then in the y direction as (gy ), where
312
O. Saeed, A. Bin Mansoor, and M.A.A. Butt
gx are values for cosine and gx are values for sine [3]. Least square approximation value for each of the block is then calculated. An average value for the gradients is calculated and a threshold is set. Once the ridge orientation estimation for each of the block is complete the insignificant ridges are rejected. The point for which the value of this sum is minimum is the point on the inner most ridge, designated as the core point.
4
Region of Interest (ROI) Extraction
Once the core point is localized, we now extract the Region of Interest (ROI). The size of the region of interest is of considerable importance, as it is this area from which the features are extracted and considered for subsequent matching. The region around the core point contains greater variations and directional information, so an area of 128 × 128 pixels is cropped around it. There are certain issues pertaining to the ROI extraction. One such problem is that if a core point is located at the extreme margin of the image and cropping a region centering on the core will include region outside the boundary of the image as shown in Figure 4.
Fig. 4. Core point located to the extreme margin of the image
In this research work, when ROI is drawn around the singularity and any portion of the ROI extends outside the image boundary, is shifted inwards to contain the relevant area only. The difference between the values of the ROI boundary and the image boundary is calculated for this shift. Entire ROI is moved inwards until it is completely included inside the image’s margin. Figure 5 depicts this process of shifting of ROI to include only the appropriate area. The advantage of the scheme is that no external data is added, such as averaging intensity values or filling the ROI. The data used in the ROI is the actual data contained inside the original image. Additionally, instead of decreasing the size of the ROI the same size (128 × 128) has been utilized.
5
Feature Vector Generation
In our work, we have employed the contourlet transform for the feature extraction. 5.1
Contourlet Transform
The contourlet transform is an efficient directional multi resolution image representation, utilizing non-separable filter banks developed in the discrete form;
A Novel Contourlet Based Online Fingerprint Identification
(a)
(b)
313
(c)
Fig. 5. ROI included inside the image boundary (a) ROI outside image margin (b) ROI moved inwards (c) Complete ROI
conceptually the contourlet transform first utilizes a wavelet like transform for edge detection such as the Laplacian pyramid, and then the utilizes a local directional transform for contour segment detection such as the directional filter bank to link point discontinuities into linear structure. Therefore Contourlet have elongated supports at various scales, directions and aspect ratios [4], [5]. Contourlet can efficiently handle the intrinsic geometrical structure containing contours. It is proposed by Minh Do and Martin Vetterli [5] and provides sparse representation at both spatial and directional resolutions. The reconstruction is perfect, almost critically sampled with a small redundancy factor of up to 4/3. Contourlet transform uses a structure similar to that of curvelets [6], [7] a double filter bank structure comprising the Laplacian pyramid with a directional filter bank as shown in Figure 6.
Fig. 6. Contourlet Transform Structure
5.2
Feature Vector Generation
The transformed ROI is decomposed into sub bands at four different resolution levels. Figure 7, gives the pictorial view of the decomposition at just two levels at different directional sub-bands. In actual decomposition, level-1 corresponds to an ROI of 128 × 128 pixels. At further levels, the size of ROI is determined by expression log2 N i.e. for level 2 it is 64 × 64. Similarly level 3 and 4 sizes are 32 × 32 and 16 × 16 respectively. Now at each resolution level “k” the ROI is decomposed in 2 sub-bands.
314
O. Saeed, A. Bin Mansoor, and M.A.A. Butt
Fig. 7. Decomposition of the transformed ROI
5.3
Selected Features
As fingerprint contain strong directionality, so the related directional energies may be exploited as fingerprint features. The image is decomposed using the DFB (Directional Filter Bank). Five levels decomposition is considered which yields a total of 60 blocks. Let Skθ denotes the subband image at k level and θ direction. Similarly let σθk denotes the standard deviation of the k th block in the θ direction sub band image, and cθk (x, y) is the contourlet coefficient value at pixel (x, y) in the subband block Skθ , then the value for directional energy Ekθ , for that sub band block can be calculated by Equations 1 and 2 [8]. Ekθ = nint Where
σθk =
255(σθk − σmin ) σmax − σmin
1 ( ) (ckθ (x, y) − ckθ )2 n
(1)
(2)
x,y∈Skθ
nint(x) is the function that returns the nearest integer value to x; σmax and σmin are the maximum and minimum standard deviation values for a particular subblock. N is the number of pixels in sub band Skθ .ckθ is the mean of contourlet coefficients ckθ (x, y) in the sub band block Skθ . 5.4
Normalized Energies Calculations
The normalized energy for each block is computed from the Equation 3. Here Ekθ represents directional energy of sub-band θ at k level and Ekθ(t) represents total directional energy of all sub-block at k level, while E is the normalized energy. Ekθ E= (3) Ekθ(t)
A Novel Contourlet Based Online Fingerprint Identification
6
315
Fingerprint Matching
Matching is performed by calculating the Normalized Euclidean Distance between the input feature vector and template feature vector. Euclidean distance between two vectors is calculated by squaring the difference between corresponding elements of the feature vector.
7
Experimental Results
The performance evaluation criterion was adopted from that of the Fingerprint Verification Competition 2002 [9]. The end result of the calculations is a Genuine and an Imposter distribution. These distributions when plotted, gives us the genuine and the imposter Curves. Each point on the genuine and imposter curve corresponds to value for a single matching score. A total of 126 individuals were enrolled. Each individual is enrolled with eight templates in the database constituting 1008 images. The results for the contourlet based matching are given in Figure 8. Equal Error Rate of 1.2% has been achieved.
(a)
(b)
(c) Fig. 8. (a) ROC based analysis (b) Threshold Vs FMR and FNMR (c) Genuine and Imposter Distributions
316
7.1
O. Saeed, A. Bin Mansoor, and M.A.A. Butt
Speed
The registration of an individual takes about 25 seconds whereas the identification takes about 7.8 seconds on a 1 GB RAM 1.6 GHz Intel Core Duo processor with Windows XP operating system. The speed may be increased manifolds by specialized hardware implementation.
8
Adaptive Majority Vote Algorithm
To further narrow our criteria for matching we have implemented Adaptive Majority Vote Algorithm. The values of the normalized Euclidian distances are preserved as a matrix, where number of templates “M” for “N” individual enrolled is a constant. Whose rows are equal to the number of people enrolled “M” with an interval of “N” columns. A classifying threshold is defined. The value for this threshold is kept greater as compared to that of the matching threshold which is attained from the ROCs (Receiver Operating Characteristics). The algorithm searches, each row with an interval of “N” for the values less than that of the classifying threshold. The row with the maximum number of values less than classifying threshold is considered a match. In case there is more than one row, with the same number of values less than that of the classifying threshold a slight decrement is made to the value of this threshold and the algorithm continues search on the mentioned criteria. Subsequently, a single interval is located and the smallest value in the interval is checked. If its value is less than that of the final threshold (Threshold achieved from the ROCs), the template is considered a match. Otherwise it is rejected.If more than one interval has the same number of maximum values less than the classifying threshold and this threshold is decreased; though the decrement made is very small, but still all rows are rejected and finally no value is returned. So, there has to be some plausible solution to this occurrence. Although, the probability of this occurrence is very less but still the issue has been addressed. Once the mentioned situation occurs, the last decrement made to the value of the classifying threshold is again incremented with a factor of one half of the decrement. This continues until a single interval is returned and then the prior mentioned criteria for matching are applied.
9
Conclusion
The paper investigates novel contourlet based fingerprint identification and matching. The energy features extracted exhibit the pattern of ridge and valley flow on various resolutions because of the multi resolution decomposition through Contourlet Transform. The results depict the effectiveness of the proposed scheme by demonstrating the EER of 1.2%. In future, we intend to investigate the combined operation of fingerprint and palm print identification for a multimodal system.
A Novel Contourlet Based Online Fingerprint Identification
317
References 1. Gonzalez, R.C., Woods, R.E., Eddins, S.L.: Digital Image Processing Using MATLAB. Prentice-Hall, Inc., Saddle River (2003) 2. Jain, A.K., Maltoni, D.: Handbook of Fingerprint Recognition. Springer, New York (2003) 3. Zhang, Q., Yan, H.: Fingerprint classification based on self organizing maps. Pattern Recoginition (2002) 4. Do, M.N.: Directional multi-resolution image representation. Ph.D. dissertation, Department of Communication Systems Swiss Federal Institute of Technology Lausanne (2001) 5. Do, M.N., Vetterli, M.: Beyond Wavelets, ch. Contourlets. Academic Press, London (2003) 6. Candes, E.J., Donoho, D.L.: Curvelets– a surprisingly effective nonadaptive representation for objects with edges, Stanford Univ. CA Dept. of Statistics, pp. 1–16 (2000), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2419 7. Candes, E.J., Donoho, D.L.: Curvelets, multiresolution representation, and scaling laws. In: Aldroubi, A., Laine, A.F., Unser, M.A. (eds.) SPIE , vol. 4119(1), pp. 1–12 (2000), http://dx.doi.org/10.1117/12.408568 8. Park, C.-H., Lee, J.-J., Smith, M., il Park, S., Park, K.-H.: Directional filter bankbased fingerprint feature extraction and matching. IEEE Transactions on Circuits and Systems for Video Technology 14(1), 74–85 (2004) 9. Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: Fvc2002: Second fingerprint verification competition. In: 3rd International Conference on Pattern Recognition, pp. 811–814 (2002)
Fake Finger Detection Using the Fractional Fourier Transform Hyun-suk Lee, Hyun-ju Maeng, and You-suk Bae Dept. of Computer Engineering, Korea Polytechnic University 2121, JeongWang-dong, SiHeung-Si, KyongGi-do, South Korea {leehsuk,hjmaeng,ysbae}@kpu.ac.kr
Abstract. This paper introduces a new method for detecting fake finger using the fractional Fourier transform (FrFT). The advantage of this method is to require one fingerprint image. The fingerprint is a texture with the interleaving of ridge and valley. When the fingerprint is transformed into the spectral domain, we found energy of fingerprint. Generally, the energy of live fingerprints is larger than the energy of fake fingerprints. Therefore, the energy in the spectral image of a fingerprint can be a feature for detecting of fake fingers. We transformed the fingerprint image into the spatial frequency domain using 2D Fast Fourier transform and detected a specific line in the spectrum image. This lineis transformed into the fractional Fourier domain using the fractional Fourier transform. And, the standard deviation of it is used to discriminate between fake and live fingerprints. For experiment, we made fake fingers of silicone, gelatin, paper and film. And, the fake finger database is created, by which the performance of a fingerprint verification system can be evaluated with higher precision. The experimental results demonstrate that the proposed method can detect fake fingers. Keywords: fake finger detection, fractional fourier transform, fingerprint recognition system.
1 Introduction Fingerprint recognition systems have become popular these days and have been adopted in a wide range of applications because the systems’ performances are better than other biometric systems and low-cost image acquisition devices have become available [1]. However, the feasibility of fake fingerprints’ attacks has been reported by some researchers [2]. Some previous work showed that fake fingerprints can spoof some fingerprint recognition systems actually. Since fake fingerprints can cause very serious security troubles and even crimes, a reliable method to detect fake fingerprints’ attack is strongly required. Practically, there are two ways to detect fake fingerprints in fingerprints recognition systems [3]. One is the hardware-based way, including odor detection [4], blood pressure detection [5], body temperature detection [6], pulse detection [7], and human skin analysis [8]. These methods need additional hardwares [3], so that their implementations are expensive and bulky. The other is software-based way, including analysis on perspiration and shade changes between ridges and valleys, comparison of J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 318–324, 2009. © Springer-Verlag Berlin Heidelberg 2009
Fake Finger Detection Using the Fractional Fourier Transform
319
fingerprint image sequence [9], and observation of sweat pores [10]. These methods need complicated algorithms to analyze fingerprints’ images. However, they do not require any extra hardwares or costs, and they can react against fake fingerprints’ attacks much more flexibly. While many physiological approaches to detect fake fingerprints have been proposed, however, in this paper, we propose a novel method based on the fractional Fourier transform (FrFT). This paper is organized as follows. Section 2 describes the fabrication of fake fingerprints. Section 3 provides a brief overview of the fractional Fourier transform. In section 4 and 5, the proposed method is discussed and evaluated with some experimental results. Finally, in section 6, we give short conclusion.
2 Fabrication of Fake Fingers There are two ways when fabricating fake fingerprints. One is produced by cloning with a plastic mold under personal agreement. The other is produced by cloning from a residual fingerprint. Because fabricating fake fingerprint needs appropriate materials and appropriate processing, it is hard to make fake fingerprint. Especially, fabricating fake fingerprint from a latent fingerprint is requiring a professional skill. Material and procedure are two essential factors when fabricating fake fingerprints. The common materials are paper, film, and silicon. Gelatin and synthetic rubber are also used very often for fake fingerprints because they have physical and electrical properties very similar to human skin. Recently, prosthetic fingers, clones of whole fingers, have appeared. Prosthetic fingers are expensive yet, but they are almost same
Fig. 1. The process of fake finger making according to each material
320
H.-s. Lee, H.-j. Maeng, and Y.-s. Bae
with real fingers and can be used semi-permanently. Figure 1 shows the fabrication procedure of fake fingerprints of four materials: silicon, gelatin, film, and paper. In the experiments, we made the four kind of fake fingerprints using paraffin (candle) molds rather than direct printing of fingerprints.
3 Fractional Fourier Transform The fractional Fourier transform is a generalized form of the conventional Fourier transform. By adding a degree of freedom, the order, the fractional Fourier transform allows spectral analysis in a certain space-frequency domain. Equation 1 denotes the basic formula of the fractional Fourier transform fa(u), the a-th order fractional Fourier transform of a certain function f(u). The order, a, which ranges from 0 to 1, determines the space-frequency domain of the transform. When a equals to 0, the transform becomes the original function f(u), the exact space domain, and, when a equals to 1, the transform becomes the conventional Fourier transform of f(u), the exact frequency domain [11].
α = aπ /2
[
f a (u ) = ∫ ∞−∞ 1 − i cot α exp iπ (cot αu 2 − 2 csc αuu '+ cot αu '2 )
]
(1)
The fractional Fourier transform can be implemented very simply in optical systems, and this greatly reduces computing time [12]. In this paper, we apply the transform to fingerprint recognition system and attempt to find the transform’s optimal domain in which fake fingerprints can be identified most clearly.
4 Proposed Method In this paper, we proposed a fake finger detection method using the fractional Fourier transform. A fingerprint is a texture with the interleaving of ridges and valleys [3].
Fig. 2. Diagram of the proposed algorithm
Fake Finger Detection Using the Fractional Fourier Transform
321
Since an original live fingerprint and its fake have different clearness of the ridgevalley texture, we can notice the difference between two fingerprints’ energies in the space-frequency domain. Therefore, the energy difference in the space-frequency domain can be a useful indicator to identify fake fingerprints. Figure 2 diagrams the computing procedure of the proposed method. First, the fingerprint image is transformed into the frequency domain through fast Fourier transform (FFT). To find energy distribution of image of fourier transform from center to edge, horizontal line is extracted. Next, this one dimensional signal is transformed into fractional Fourier transform’s domain, and the standard deviation can be calculated from equation 2. n Std = ∑ ( xi − μ )2 /( n −1) i =1
(2)
Fake fingerprints can be identified according to this standard deviation. To be more specific, if the standard deviation is greater than certain threshold value, and then the input fingerprint is from a live finger, otherwise from a fake finger. Figure 3 graphically summarizes standard deviations of the respective persons and materials.
Fig. 3. Standard Deviation Average for each material (live, silicone, gelatin, paper and film fingerprints)
5 Experiment Results 5.1 Fake Finger Database The used fake finger database in this paper means fake fingerprint images and real fingerprint images for comparing with fake fingerprints.
322
H.-s. Lee, H.-j. Maeng, and Y.-s. Bae
In order to evaluate the performance of the proposed method, a database of fake fingerprints was collected from 15 persons using an optical fingerprint sensor, without activating the sensor fake finger detection function. The database contains total 3,750 fingerprint images of 15 persons: 75 fingers (15 live, 15 silicon, 15 gelatin, 15 paper, and 15 film) x 50 impressions. Sample images of the fake fingerprint database are shown in figure 4.
Fig. 4. Sample images of the fake finger database
5.2 Results In this paper, we evaluated the proposed method using 750 live and 3,000 fake fingerprints. Experimental results of the proposed method are shown in figure 5 and 6. Figure 5 shows that the error rates of overall, live and fake fingerprints according to the threshold value. Figure 6 reports the error rates of overall fingerprints according to the order (‘a’). In this result, the best error rate of about 11.4% is obtained when the order is 0.7. Table 1 compares the performance of proposed method with other methods. The error rates (best error rates) of the proposed method are similar to the lowest error rates of method. While perspiration-based method, method of the lowest error rate, needs to capture image pairs and the performance of this method is susceptible to various factors, the proposed method only needs one image to find fractional fourier energy.
Fig. 5. Error rate for each threshold value (overall, live and fake fingers)
Fake Finger Detection Using the Fractional Fourier Transform
323
Fig. 6. Error rate for each order ‘a’ (overall) Table 1. Performance compares with other methods
Method Perspiration-based methods[2,5] Skin deformation-based method[9]
Error rate (%) Approximately 10
Band-selective fourier spectrum based method[3] Proposed method
Approximately 16
Precondition Need to capture image pairs Hardness hypothesis. Need a fingerprint scanner capable of capturing and delivering frames at proper rate Only need one image
Approximately 11
Only need one image
Approximately 16
6 Conclusions This paper proposed a fake fingerprint detection method based on the fractional Fourier transform. This method can identify real fingerprint and fake fingerprint using a fingerprint image. The result of experiment using fake fingerprint database shows error rate about 11.4% by using certain area (or region) after the fractional Fourier transform. In conclusion of this paper, we could verify there is a possibility to detect fake fingerprint by using the fractional Fourier transform. Although it doesn't applied in this paper, we expect better results if it includes additional pre-processing. In order to reliable detecting fake fingerprints, it has to do more research process to reduce error rate and define exactly each step of algorithm Also we are planning additional testplan by using various sensors with database which is including more fake fingerprint images to take reliable results.
324
H.-s. Lee, H.-j. Maeng, and Y.-s. Bae
This research proposes a method based on the software for detecting fake fingers. This method can detect fake fingerprints by one fingerprint image. More research is needed to reduce processing-time by pointing out only requested parts. More tests will be processed by making data bases that includes more fake fingerprints images.
References [1] Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of fingerprint Recognition. Springer, New York (2003) [2] Matsumoto, T., Matsumoto, H., Yamada, K., Hoshino, S.: Impact of artificial “Gummy” fingers on fingerprint systems. In: Proc. SPIE, vol. 4677 (2002) [3] Jin, C., Kim, H., Elliott, S.: Liveness Detection of Fingerprint Based on Band-Selective Fourier Spectrum. In: Nam, K.-H., Rhee, G. (eds.) ICISC 2007. LNCS, vol. 4817, pp. 168–179. Springer, Heidelberg (2007) [4] Baldisserra, D., Franco, A., Maio, D., Maltoni, D.: Fake Fingerprint Detection by Odor Analysis. In: ICBA 2006. Proceedings International Conference on Biometric Authentication, Hong Kong (2006) [5] Drahansky, M., Notzel, R., Funk, W.: Liveness Detection based on Fine Movements of the Fingertip Surface. In: 2006 IEEE Information Assurance Workshop, June 21-23, 2006, pp. 42–47 (2006) [6] van der Putte, T., Keuning, J.: Biometrical fingerprint recognition: don’t get your fingers burned. In: Proceedings of IFIP TC8/WG8.8 Fourth Working Conference on Smart Card Research and Advanced Applications, pp. 289–303. Kluwer Academic Publishers, Dordrecht (2000) [7] Reddy, P.V., Kumar, A., Rahman, S.M.K., Mundra, T.S.: A New Method for Fingerprint Antispoofing using Pulse Oxiometry. In: IEEE on Biometrics: Theory, Applications, and Systems, Washington DC (2007) [8] Jia, J., Cai, L., Zhang, K., Chen, D.: A New Approach to Fake Finger Detection Based on Skin Elasticity Analysis. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 309–318. Springer, Heidelberg (2007) [9] Antonelli, A., Cappelli, R., Maio, D., Maltoni, D.: Fake Finger Detection by Skin Distortion Analysis. IEEE Transactions on Information Forensics and Security 1(3), 360– 373 (2006) [10] Parthasaradhi, S.T.V., Derakhshani, R., Hornak, L.A., Schuckers, S.A.C.: Time-Series Detection of Perspiration as a Liveness Test in fingerprint Devices. IEEE Trans. on Systems, Man, and Cybernetics - Part C 35(3) (2005) [11] Ozaktas, H.M., Zalevsky, Z., Alper Kutay, M.: The fractional Fourier Transform: With Applications in Optics and Signal Processing. Wiley, New York (2001) [12] Wilson, C.L., Watson, C.I., Paek, E.G.: Effect of resolution and image quality on combined optical and neural network fingerprint matching. In: PR, vol. 33(2) (2000)
Comparison of Distance-Based Features for Hand Geometry Authentication Javier Burgues, Julian Fierrez, Daniel Ramos, and Javier Ortega-Garcia ATVS - Biometric Recognition Group, EPS, Universidad Autonoma de Madrid, Campus de Cantoblanco, C/ Francisco Tomas y Valiente 11, 28049 Madrid, Spain {javier.burgues,julian.fierrez,daniel.ramos, javier.ortega}@ uam.es
Abstract. A hand-geometry recognition system is presented. The development and evaluation of the system includes feature selection experiments using an existing publicly available hand database (50 users, 500 right hand images). The obtained results show that using a very small feature vector high recognition rates can be achieved. Additionally, various experimental findings related to feature selection are obtained. For example, we show that the least discriminative features are related to the palm geometry and thumb shape. A comparison between the proposed system and a reference one is finally given, showing the remarkable performance obtained in the present development when considering the best feature combination. Keywords: Hand geometry, biometrics, feature selection.
1 Introduction Nowadays, people identification to control access to certain services or facilities is a very important task. The traditional method to assert that a person is authorized to perform an action (e.g. using a credit card) was the use of a password. This kind of identification methods has the problem of usually requiring long and complicated passwords to augment the security level, at the cost of user inconvenience. People identification through biometric traits is a possible solution to enable secure identification in a user convenient way [1]. In biometric systems, users are automatically recognized by their physiological or behavioral characteristics (e.g. fingerprint, iris, face, hand, signature, etc.) In the present work, we focus on hand biometrics. Traditional hand recognition systems can be split in three modalities: geometry, texture and hybrid. We concentrate our efforts in the first one due to its simplicity. In the literature, several hand geometry recognition systems have been developed [2-4]. For example, in [2] a hand recognition system is presented based on various fingers widths, heights, deviations and angles. The work described in [3] treats the J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 325–332, 2009. © Springer-Verlag Berlin Heidelberg 2009
326
J. Burgues et al.
fingers individually by rotating and separating them from the hand. Oden et al. [4] used the finger shapes represented with fourth degree implicit polynomials. On the other hand, in [5] only palm texture information of the hand is used to identify a user. Finally, a third kind of hand recognition methods employ fusion of hand geometry and texture, as for example [6]. As mentioned before, the present work is focused on hand geometry. In particular, we implement and study a distance-based hand verification system based on hand geometry features inspired by previous works [2,8]. These features are compared in order to find new insights into their discriminative capabilities. As a result, we obtain a series of experimental findings such as the instability of features related to the thumb shape and location. A comparison between the proposed system and a reference one is finally given, showing the remarkable performance obtained in the present development when considering the best feature combination. The rest of the paper is structured as follows. In section 2 we describe the processing blocks of our authentication system based on hand geometry. Section 3 describes the experimental results and observations obtained related to feature selection. Finally, conclusions are drawn in section 5, together with the future work.
2 Distance-Based Hand Geometry Authentication The global architecture of our system is shown in Fig. 1. The first step is a hand boundary extraction module, from which the hand silhouette is obtained. The radial distance from a fixed reference point is then computed for the silhouette to find, for all fingers, their valleys and tips coordinates. Then, some distance-based measures considering these reference points are calculated to conform the feature vector representation of the hands. Given test and enrolled hands, the matching is based on a distance measure between their feature vectors. 2.1 Boundary Extraction Input images are first converted to a gray scale and then binarized using Otsu’s method. A morphological closing with a small circle used as structuring element removes spurious irregularities. After that, we search for the connected components present in the image assuming that the largest component is the hand and the others (if any) are potentially disconnected fingers or noise. Various shape measures are computed for the disconnected components found in order to detect disconnected fingers (e.g. due to rings), case in which we reconnect the finger to the hand using morphological operations. Once the hand boundary is extracted, we detect the wrist region. To do so, we search for the segment perpendicular to the major axis of the hand, closest to the center of the palm with a length equal or less than half of the maximum palm width (see Fig. 1 for example images).
Comparison of Distance-Based Features for Hand Geometry Authentication
327
Fig. 1. Block diagram of the main steps followed in our system to extract features and matching two hands. Original image (a) is first binarized (b). The boundary is then calculated and the plot (c) of the radial distance from a reference point lets us estimate the coordinates of tips and valleys (d). After that, feature extraction is done by measuring some finger lengths and widths (e). Last, given two hands, their matching is based on a distance between their feature sets.
2.2 Tips and Valleys Detection Once the boundary of the hand is available we fix a reference point in the wrist, from which the boundary is clockwise scanned calculating the Euclidean distance to the reference point. The resulting one-dimensional function is examined to find local maxima and minima. Maxima of the curve correspond to finger tips and minima are associated to finger valleys. Depending on the hand acquisition, first maxima will correspond to the thumb or to the little finger. This process is depicted in Fig. 1c. Before feature extraction, we compute a valley point for every finger at each side of its base (left and right). The only two fingers for which a simple analysis of the previous minima results in these valley points are the middle and ring fingers. For the other fingers we take as reference point the only available valley associated to the finger, and then we compute the Euclidean distance between this point and the boundary points at the other side of the finger. The point that yields the minimum distance is selected as the remaining valley point for that finger.
328
J. Burgues et al.
Fig. 2. Set of features studied in the proposed hand geometry authentication system
2.3 Feature Extraction We define the reference point of a finger as the middle point between the two finger valleys. The length of the finger is calculated as the Euclidean distance from the tip to the finger reference point. Fig. 2 shows the notation used to name the hand features we propose. For each finger, its length is denoted with letter ‘L’ and a number that identifies the finger (1 for index, 2 for middle, 3 for ring, 4 for little and 5 for thumb). Finger widths (‘W’) keep the same numbering with an additional character indicating if it is the upper (‘u’) or the lower width (‘d’). See Fig. 3. The thumb only contains one width measure, at the middle of the finger, denoted as W5. There are also some palm distance features named as P1, P2 and P3 (see Fig. 2). In the experimental section we will study various combinations of these features. 2.4 Similarity Computation Once the feature vector has been generated, the next step is to compute the matching score between two hands. In our system, based in a distance measure, lower values of the matching score represent hands with higher similarity, therefore the matching score represents dissimilarity. If we denote the feature vector of one hand as m1[i], i = 1,…,N, and the feature vector of another hand as m2[i], i = 1,…,N, then their dissimilarity is computed as: N
d ( m1 , m2 ) = ∑ m1[i ]−m2 [i ] i =1
with N being the length of the feature vectors.
(1)
Comparison of Distance-Based Features for Hand Geometry Authentication
329
3 Experiments In the first section, the database used in this work is detailed and the protocol used to generate genuine and impostor scores is explained. The reference system is summarized in section 3.2. Finally, the results obtained in the feature selection experiments are shown. The best combination achieved will be included in the final system to evaluate its performance in comparison to the reference system. 3.1 Database and Experimental Protocol The experiments have been carried out using a publicly available database, captured by the GPDS group of the Univ. de Las Palmas de Gran Canaria in Spain [8]. This database contains 50 users with 10 right hand samples per user. The image acquisition was supervised: users cannot place the hand in the scanner in any position, scanner surface was clean, illumination was non variable, etc. Hence, high quality images were obtained. To fairly compare the performance of our system with the reference one, both systems were tested over the same database using the same protocol. Impostor scores are obtained by comparing the user model to one hand sample (the sixth one) of all the remaining users. Genuine scores are computed by comparing the last 5 available samples per user with its own model (which is constructed with the first hand sample). This protocol uses one sample per user for enrollment and five samples per user for test. Overall system performances are reported by means of DET plots [10]. 3.2 Reference System Fig. 3 shows the processing steps of the recognition system used as reference for comparison with our development. This reference system is fully described and available through [9]. In the reference system the image is first preprocessed and then,
Fig. 3. Processing steps and feature extraction for the reference system (extracted from [7])
330
J. Burgues et al.
for each finger, the histogram of the Euclidean distances of boundary points to the major axis of the finger is computed. The features of the hand boundary are the five normalized histograms. Then, given two hands, the symmetric Kullback-Leibler distance between finger probability densities is calculated in order to measure the grade of similarity. 3.3 Experiments The set of features presented in Sect. 2.3 consists of 17 measures from different zones of the hand. Specifically, there are five finger lengths, nine finger widths and three palm widths. This set of features is based on a selection from the best features proposed in [8] and some features studied in [2]. In our first experiment, some subsets of features were manually chosen and then tested to check their performance. Table 1 shows the results. We observe that not considering the information of the thumb in the feature set (feature subset 2 vs. feature subset 1) provides a significant performance improvement (from more than 9.6% to less than 1.7% EER). This is in accordance with the results presented in [4], and may be due to the freedom of movement of this finger, which makes hard to estimate correctly its valley points. Because of this, for the rest of experiments we discard the features related to the thumb (i.e., L5 and W5). Also interesting, the lengths of the four remaining fingers are useful because removing any of them deteriorates the system performance (subsets 3, 4 and 5 vs. subset 2). On the other hand, the palm lengths considered (P1 to P3) do not provide any benefit (subset 6 vs. subset 2). Maybe, this is due to the fact that these features related to the palm use the three exterior valley points which are most difficult to be precisely estimated. Finally, in Table 1, we can see that the basic information provided by the finger lengths (subset 2) benefits from the incorporation of the finger widths (subset 7). The system performance for the feature sets present in Table 1 is analyzed for all the verification threshold operating points by means of DET plots in Fig. 4. Fig. 4a shows the DET plot of: (i) the five finger lengths (feature subset 1), (ii) four finger lengths, excluding the thumb (feature subset 2) and (iii) the reference system. Table 1. EER for different subsets of features. Feature nomenclature is the same as the one used in Fig. 2. Feature Features subset ID
Equal Error Rate (%)
1 2 3 4 5 6 7 8
9.66 1.68 5.70 4.83 3.06 5.54 1.24 5.09 2.97
L1, L2, L3, L4, L5 L1, L2, L3, L4 L1, L4 L2, L3 L2, L3, L4 L1, L2, L3, L4, P1, P2, P3 L1, L2, L3, L4, W1u, W1d, W2u, W2d, W3u, W3d, W4u, W4d L1, L2, L3, L4, W1u, W1d, W2u, W 2d, W3u, W3d, W4u, W4d, P1, P2, P3
Reference system
Comparison of Distance-Based Features for Hand Geometry Authentication
(a)
331
(b)
Fig. 4. (a) Performance obtained using three different feature sets. This experiment reports results about which fingers must be included in the feature set. (b) DET comparative between four proposed feature sets. In this picture, the influence of palm and finger widths is examined. Fig. 4b shows the results of the system evaluation with: (i) four finger lengths, excluding the thumb (feature subset 2), (ii) the set used in (i) plus palm widths (P1 to P3) (feature subset 6), (iii) four finger lengths and their associated widths (feature subset 7) and (iv) the reference system. Also interesting, the best Equal Error Rate achieved in the proposed system (1.24%) is lower than the reference system (2.97%).
4 Conclusions and Future Work A new recognition system based on hand geometry has been proposed. In this work, different sets of features have been evaluated and some experimental findings have been obtained. We have observed that the features based on the thumb are the least discriminative. This may be due to its freedom of movement, which makes hard to estimate correctly the valley points that define this finger. For the four remaining fingers, we have concluded that their lengths and widths are the most discriminative features. Also interesting, the palm widths report bad results, perhaps due to their relation with the thumb valley points. Finally, the results obtained for the best feature combination (1.24% EER) improve the reference system performance (2.59% EER) over the same database and experimental protocol with a relative improvement of more than 50% in the EER. Future work includes applying feature subset selection methods to the proposed set of features and the development of quality detection algorithms to automatically discard low quality images which worsen the system performance.
332
J. Burgues et al.
Acknowledgements. J. F. is supported by a Marie Curie Fellowship from the European Commission. This work was supported by Spanish MEC under project TEC2006-13141-C03-03.
References 1. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Trans. on Circuits and Systems for Video Technology. 14, 4–20 (2004) 2. Sanchez-Reillo, R., Sanchez-Avila, C., Gonzalez-Marcos, A.: Biometric identification through hand geometry measurements. IEEE Trans. on Pattern Analysis and Machine Intelligence. 22, 1168–1171 (2000) 3. Yörük, E., Konukoglu, E., Sankur, B.: Shape-Based Hand Recognition. IEEE Trans. on Image Processing. 15, 1803–1815 (2006) 4. Oden, C., Ercil, A., Buke, B.: Combining implicit polynomials and geometric features for hand recognition. Patter Recognition Letter 24, 2145–2152 (2003) 5. Zhang, D., Kong, W.K., You, J., Wong, M.: Online Palmprint Identification. IEEE Trans. on Pattern Analysis and Machine Intelligence. 25, 1041–1050 (2003) 6. Kumar, A., Wong, D.C.M., Shen, H.C., Jain, A.K.: Personal authentication using hand images. Pattern Recognition Letters 27, 1478–1486 (2006) 7. Geoffroy, F., Likforman, L., Darbon, J., Sankur, B.: The Biosecure geometry-based system for hand modality. In: ICASSP, vol. 147, pp. 195–197 (2007) 8. González, S., Travieso, C.M., Alonso, J.B., Ferrer, M.A.: Automatic biometric identification system by hand geometry. In: Proceedings. IEEE 37th Annual 2003 International Carnahan Conference Security Technology, 2003, pp. 281–284 (2003) 9. Dutagaci, H., Fouquier, G., Yoruk, E., Sankur, B., Likforman-Sulem, L., Darbon, J.: Hand Recognition. In: Petrovska-Delacretaz, D., Chollet, G., Dorizzi, B. (eds.) Guide to Biometric Reference Systems and Performance Evaluation. Springer, London (2008) 10. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET curve in assessment of detection task performance. In: EUROSPEECH 1997, pp. 1895–1898 (1997)
A Comparison of Three Kinds of DP Matching Schemes in Verifying Segmental Signatures Seiichiro Hangai, Tomoaki Sano, and Takahiro Yoshida Dept. of Electrical Engineering, Tokyo University of Science, 1-14-6 Kudankita, Chiyoda-ku, Tokyo, 102-0073, Japan {hangai,yoshida}@ee.kagu.tus.ac.jp, [email protected]
Abstract. In on-line signature verification, DP (Dynamic Programming) matching is an essential technique in making a reference signature and/or calculating the distance between two signatures. For continuous signatures such as western, it is reasonable that the matching is tried over whole signature, i.e., from the start of signature to the end of signature. However, for segmental signatures such as eastern (e.g., Japanese or Chinese), the DP matching generates unrecoverable error caused by wrong correspondences of segments, and leads to worse EER (Equal Error Rate). In this study, we compare three kinds of DP matching scheme; (1)Based on First name part and Second name part, (2)Based on characters, and (3)Based on strokes. From experimental results, 1.8% average EER in whole DP matching decreases into 0.5% in scheme (1), 0.105% in scheme (2) and 0.00244% in scheme (3) for 14 writers’ signatures with intersession variability over one month. In this paper, problems in DP matching and results of signature verification with XY position, pressure, and inclination are given under proposed matching schemes. Keywords: signature verification, DP matching, segmental signature, EER.
1 Introduction Recently, authentication systems using hand written signature have been realized and are starting to be used[1,2]. In order to improve the security level, there are many schemes, some of which use selected features based on G.A.(Genetic Algorithms) [3] and features of global and local[4]. We have proposed an on-line signature verification system and shown that pen inclination (azimuth and altitude) data improve the security level[5]. For instance, by combining the pen inclination data with pen position data and pen pressure data, we can realize 100% verification rate for 24 persons after 2 weeks passed. However, the intersession variability of signature decreases the verification rate as days go by. In general, lowering the threshold can rescue the decrease at the sacrifice of the robustness to forgery. A main reason of this decrease is that the reference pattern cannot adapt to the change of signatures after a long term. To reduce this degradation, we proposed a reference renewal method and J. Fierrez et al. (Eds.): BioID_MultiComm2009, LNCS 5707, pp. 333–339, 2009. © Springer-Verlag Berlin Heidelberg 2009
334
S. Hangai, T. Sano, and T. Yoshida
obtained the verification rate of 98.5% after 9 weeks passed[6,7]. In these methods, we have applied DP matching to make a reference signature from multiple samples and/or calculating the distance between two signatures in verification process. For continuous stroked signatures by western people, it is reasonable to match signatures over whole stroke, i.e., from the start of signature to the end of signature, by the normal DP matching with narrow matching window. However, for segmental signatures by eastern (e.g., Japanese or Chinese) people, the DP matching over all segments generates unrecoverable error caused by wrong correspondences between segments in different samples. This unconcern with segmental structure leads the EER to pushing down. In this study, we compare three kinds of DP matching scheme; (1)Matching after dividing into First name part and Second name part, (2)Matching after dividing into characters, and (3)Matching after dividing into strokes in verifying Japanese signatures. In this paper, we discuss the problems in applying DP matching to segmental signatures. Experimental results and a comparison with three kinds of matching for 14 writers’ signatures over one month are also given.
2 Writing Information Acquired by Tablet and Verification Process Fig. 1 shows acquired five writing data, i.e., x-y position x(t) and y(t), pen pressure p(t), pen azimuth θ(t), and pen altitude φ(t). Fig. 2 shows an example of acquired signature and time series data. Errors in matching the i-th test signature Si to the reference signature Sr are calculated by the following equations,
L(t ) =
{xr (t ) − xi (t )}2 + {yr (t ) − yi (t )}2
P(t ) = p r (t ) − pi (t )
(2)
⎛ Vr (t ) ⋅ Vr (t ) ⎞ A(t ) = cos ⎜ rr ri ⎟ ⎜ V (t ) V (t ) ⎟ i ⎝ r ⎠
(3)
−1
x y Y-axis
(1)
X-axis
azimuth:θ Altitude:φ Pressure:p
Fig. 1. Acquired five writing data
A Comparison of Three Kinds of DP Matching Schemes
335
where, L(t), P(t), and A(t) are errors of xy position, pen pressure, and pen inclination respectively. V (t ) is a 3D vector, which gives the feature of pen inclination as follows,
⎡ sin θ (t )cosϕ (t ) ⎤ r V (t ) = ⎢⎢− cosθ (t )cosϕ (t )⎥⎥ ⎢⎣ ⎥⎦ sin ϕ (t )
(4)
In making reference and/or verifying, we use accumulated errors by the following equations,
1 L= ΔT
P= A=
1 ΔT 1 ΔT
T + ΔT
∫ L(t )dt
(5)
t =T
T + ΔT
∫ P(t )dt
(6)
t =T
T + ΔT
∫ A(t)dt
(7)
t =T
where T is the start time of accumulation, and ∆T is the accumulation time which depends on matching scheme described later.
Fig. 2. An example of acquired signature and time series data
336
S. Hangai, T. Sano, and T. Yoshida
For the purpose of comparing the EER with respect to each feature, we use three kinds of data sets (XY position data set, pen pressure data set, and pen inclination data set) for three kinds of DP matching schemes. The reference are made as follows, 1) Inspect matching part in signatures for registration, and define the data that has the longest writing time as 'parent' data. Others are defined as 'child' data. 2) Compand time axis of child data to match time axis of parent data by DP matching based on (1) First name part and Second name part, (2) Characters, and (3) Strokes.. 3) After companding, make reference data set by averaging in each time slot. Verification is done by comparing the error between reference data and acquired data with changing the threshold.
3 Problem in Segmental Signature Verification and Proposal of Three Kinds of DP Matching Schemes In making reference using segmental signature, we often faced to the problem caused by corresponding illegal points. Fig.3 shows the accumulated XY position error given by Eq.(5), XY position error, and pen pressure along signature time. It is found that the error increases rapidly in the time when the pen pressure becomes zero marked by asterisk. In that time, the accumulated error is on the upward trend. This means that the correct matching becomes difficult once the matching failed. In order to solve the problem, we propose three kinds of matching schemes.
Fig. 3. Accumulated L(t), L(t) and P(t) along time
A Comparison of Three Kinds of DP Matching Schemes
337
3.1 DP Matching After Dividing Signature into First Name Part and Second Name Part By using frequency of x coordinates, we divide a signature into characters as shown in Fig. 4. After dividing, we make First name part and Second name part manually. DP matching is done by each part by minimizing accumulated error given in Eq.(5), or (6), or (7). First name part
Second name part
Fig. 4. Dividing signature using frequency of X coordinates 3.2 DP Matching After Dividing Signature into Characters After dividing characters as shown in Fig.4, DP matching is done for the following characters separately. The accumulated error is reset at the start of each character. divided characters
3.3 DP Matching After Dividing Signature into Strokes By using pen pressure data, we divide the signature into the following multiple strokes. No turning point [8,9] is used in segmentation. DP matching is done for each stroke after the accumulated error is reset at the start of each stroke.
338
S. Hangai, T. Sano, and T. Yoshida
4 Performance of Proposed DP Matching Schemes Three kinds of DP matching schemes are compared using Japanese signature database with 14 writers, 4times/day, and 30 days. The number of signature is 1680. Intersession variability of the EER along days is used as a measure as shown in Fig. 5, Fig. 6 and Fig.7. In Fig.5, the EERs of three kinds of scheme using XY position are compared with the EER using whole DP matching scheme. From the figure, the effect of the division is clear. In case of stroke division, there is no EER between sessions
Fig. 5. Comparison of Intersession variability along days using XY position
Fig. 6. Comparison of Intersession variability along days using pressure
A Comparison of Three Kinds of DP Matching Schemes
339
Fig.6 shows the EERs using pressure. The EERs are worse than that of XY position. However, we can find the effect of division for signature. As same as pen position, stroke division shows the best performance. Fig.7 shows the EERs using inclination. The EERs are not so improved even by the stroke division.
Fig. 7. Comparison of Intersession variability along days using inclination
5 Conclusion In this paper, we divide a signature into characters and strokes, and improve the performance of DP matching without increasing EER. By using Japanese signature dataset, XY position after stroke division can achieve the EER of 0.00244% for 14 writers’ signatures with intersession variability over one month.
References [1] Cyber-Sign, http://www.cybersign.com/ [2] SOFTPRO, http://www.signplus.com/en/ [3] Galbally, J., Fierrez, J., Freire, M.R., Ortega-Garcia, J.: Feature Selection Based on Genetic Algorithms for On-Line Signature Verification. In: IEEE workshop on AIAT, pp. 198–203 (2007) [4] Tanaka, M., Bargiela, A.: Authentication Model of Dynamic Signatures using Global and Local Features. In: IEEE 7th workshop on Multimedia Signal Processing, pp. 1–4 (2005) [5] Hangai, S., Yamanaka, S., Hamamoto, T.: On-Line Signature Verification Based On Altitude and Direction of Pen Movement. In: IEEE ICME 2000 (August 2000) [6] Yamanaka, S., Kawamoto, M., Hamamoto, T., Hangai, S.: Signature Verification Adapting to Intersession Variability. In: IEEE ICME 2001 (August 2001) [7] Kawamoto, M., et al.: Improvement of On-line Signature Verification System Robust to Intersession Variability. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, pp. 168–175. Springer, Heidelberg (2002) [8] Bierhals, N., Hangai, S., Scharfenberg, G., Kempf, J., Hook, C.: Extraction of Target Areas based on Turning Points for Signature Verification – A Study on Signature/Sign Processed Dynamic Data Format SC37 N2442, Tech. Rep. of Biometrics Security Group (June 2008) [9] ISO/IEC JTC1 /SC37 N 2442 (2008)
Ergodic HMM-UBM System for On-Line Signature Verification Enrique Argones R´ ua, David P´erez-Pi˜ nar L´ opez, and Jos´e Luis Alba Castro Signal Theory Group, Signal Theory and Communications Department, University of Vigo {eargones,dperez,jalba}@gts.tsc.uvigo.es
Abstract. We propose a novel approach for on-line signature verification based on building HMM user models by adapting an ergodic Universal Background Model (UBM). State initialization of this UBM is driven by a dynamic signature feature. This approach inherits the properties of the GMM-UBM mechanism, such as minimizing overfitting due to scarcity of user training data and allowing a world-model type of likelihood normalization. This system is experimentally compared to a baseline state-of-the-art HMM-based online signature verification system using two different databases: the well known MCYT-100 corpus and a subset of the signature part of the BIOSECURE-DS2 corpus. The HMM-UBM approach obtains promising results, outperforming the baseline HMM-based system on all the experiments.
1
Introduction
Behavioral biometrics are based on measurements extracted from an activity performed by the user, in a conscious or unconscious way, and they are inherent to his/her own personality or learned behavior. In this sense, behavioral biometrics have some interesting pros, like user acceptance and cancelability, but they still lack of the same level of uniqueness as physiological biometrics. Among all the biometric traits that can be categorized as pure behavioral, the signature, and the way we sign, is the one that has the widest social acceptance for identity authentication. Furthermore, learning the dynamics of the real signer is a very difficult task when compared to replicate the shape of a signature. This is the main reason behind the research efforts conducted the last decade on dynamic or on-line signature verification. On-line signature verification approaches can be coarsely categorized depending on the feature extraction process and the signature representation and matching strategy. A wide variety of features can be extracted from a dynamic signature and they are usually divided into local and global [1,2,3]. Signature representation and matching strategies must cope with the large intra-user variability inherent to this problem and they can be divided into template- and statistical-based.
This work has been partially supported by Spanish Ministry of Science and Innovation through project TEC2008-05894.
J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 340–347, 2009. c Springer-Verlag Berlin Heidelberg 2009
Ergodic HMM-UBM System for On-Line Signature Verification
341
Template-based matching approaches use to rely on Dynamic-Time-Warping to perform training - test signatures alignment with different distance measurements [4][5][6]. Statistical-based approaches mostly rely on representing the underlying distribution of dynamic data with Gaussian Mixture Models [7] or, exploiting the analogy to speech recognition, representing the signature as a first order Hidden Markov Model with continuous output observations modeled as Gaussian mixtures [8][9]. We propose a novel approach based on building HMM user models by adapting a Universal Background Model (UBM). This UBM is an ergodic HMM with a fixed number of states driven by a dynamic signature feature. This approach has two main advantages: i) it inherits the properties of the GMM-UBM mechanism succesfully applied to speaker verification [10], like minimizing overfitting due to scarcity of user training data and allowing a “world-model” type of likelihood normalization that has also shown good results in signature verification [7], and ii) it allows a great deal of flexibility to automatically accomodate signaturedependent dynamics by adapting the ergodic HMM-UBM to each user data. The rest of the paper is organized as follows. Section 2 is dedicated to introduce the baseline system to compare with. This is largely based on the work in [8]. Here we use the same set of features and the HMM topology that gave them the best performance, but we do not use any type of score normalization to avoid introducing the same improvements into the comparing systems that could distort the objective of this work. Section 3 is the theoretical core of the article and it is devoted to explain the ergodic HMM-UBM scheme applied for user adaptation. Section 4 explains the experimental framework where we have used two databases: the publicly available MCYT-100 and a subset of the signature part of the BIOSECURE-DS2 database1 . Section 5 shows the experimental results and analyse them.
2
Baseline System: User-Specific HMMs
An HMM [12] is a statistical model with an unobserved first-order discrete Markov process. Each state of this hidden (non observable) Markov chain is associated with a probability distribution. The observations of a HMM are produced according to these distributions, and hence according to the hidden state. A HMM is formally defined by the following parameters: – S = {S1 , . . . , SN }: the state set of the Markov chain, where N is the number of states. – A = {aij } ; i, j ∈ {1, . . . , N }: state transition matrix. aij = P (qt = Sj |qt−1 = Si ). – B = {bi (x)} ; i ∈ {1, . . . , N }, where bi (x) is the probability density functions of the observations at state Si . – Π = {πi }: set of the initial state probabilities, where πi = P (q0 = Si )). 1
This database will soon be publicly available and comprises three different multimodal datasets DS1, DS2 and DS3 [11].
342
E.A. R´ ua, D. P´erez-Pi˜ nar L´ opez, and J.L.A. Castro
Output distributions bi (x) are commonly modeled as Gaussian Mixture ModM els (GMMs): bi (x) = m=1 wi,m N (x, μi,m , Σi,m ), where N (x, μi,m , Σi,m ) denotes a multivariate normal distribution with mean μi,m and covariance matrix Σi,m . An HMM is fully characterized by the set Λ = {A, B, Π}. In a state of the art HMM-based on-line signature verification system, an HMM Λu is used to model the K reference signatures provided by the user u at the enrolment stage. The verification score of a given signature O = {o0 , . . . , oT −1 } claiming the identity u is calculated as P (O|Λu ). This probability is usually approximated using the Viterbi algorithm. For the baseline experiment the number of states was 2, like the best performing system in [8], but we allowed different number of mixtures per state M = {16, 32, 64, 128}. This baseline system and the HMM-UBM presented in the next Section share the same feature space. The feature extraction proposed in [8] is used, resulting in vectors with 14 parameters.
3
HMM-UBM System: Adapted HMMs
In an UBM-HMM verification system an user-independent HMM is trained using signatures from many different users. This HMM plays the role of UBM. Then, user models are obtained adapting the HMM statistics to the enrolment signatures. However, the use of an UBM-HMM approach in on-line signature verification faces us to a new problem: signatures must be labeled as a previous step to the Baum-Welch training. This labeling must be designed on a user-independent basis, i.e., an user-independent approach must be adopted. The usual arbitrary partition of the signature in contiguous segments of approximately the same length is not valid here, since these partitions, and therefore the resulting output distributions, are strongly user-dependent. One solution is to cluster the feature space on the basis of dynamic characteristics. An HMM trained using this strategy can play the role of Universal Background Model (UBM), since the meaning of the states is user independent. This model will characterize signatures on a common basis, grouping features in meaningful states. The HMM-UBM system proposed in this paper uses an activity detector for the task of state labeling. The use of an activity detector to drive the initial state labeling results in an ergodic structure in the trained UBM, where states are related to signature dynamics. This activity detector can generate output sequences of states according to the level and persistence of the dynamic input. In our case, the activity detector can be set up to produce sequences with two or five different states. This is illustrated in Figure 1. Two different dynamic characteristics are proposed as activity detector inputs for state labeling: log-velocity and pressure. Logarithms are applied to velocity in order to give the activity detector a better sensitivity to low velocity values. An important advantage of UBM systems is that they are more resiliant to overfitting effects, since MAP-adapted versions of the UBM [13] can work as user
Ergodic HMM-UBM System for On-Line Signature Verification
343
Fig. 1. Activity detector: two different granularity levels are provided. Output state sequences have two or five differentiated states, that will produce two or five states UBMs respectively.
models. MAP adaptation can produce accurate models despite the scarcity of training samples [10]. Verification scores are obtained as the loglikelihood ratio between the user HMM and the UBM, in the same manner that a GMM-UBM system [14]. Only adaptation of the output distribution means μi,m is performed. As −1 w N (x ,μ ,Σi,m ) suming we have observed ci,m = Ss=0 Tt=0 γt (i, s) Mi,mw Nt (xi,m soft ,μ ,Σ ) k=1
i,k
t
i,k
i,k
counts of samples associated with the mth Gaussian component of the state Si , where γt (i, s) is the probability of the state i at time t in the enrolment sequence s, and S is the number of available enrolment sequences for user u, then, the MAP estimates of the model means for user u can be written as: ˆL μ ˆ i,m = μM i,m
ci,m ri,m ri0 ,m + μi,m ri0 ,m + ci,m ri,m ri0 ,m + ci,m ri,m
,
(1)
L where μM i,m is the ML mean estimate given the enrolment data, μi,m is the 2 value of the mean in the UBM, ri,m = 1/σi,m are the precisions, and ri0 ,m are the prior precisions. A more complete MAP formulae description can be found at [13]. The experiments carried out in this paper involve UBM-HMM systems with initial state labeling driven by pressure or logvelocity, and with several complexity levels. The number of states of the UBMs will be two or five, to account for a coarse or more precise dynamic activity clustering. Besides, different number of mixtures per state M = {16, 32, 64, 128} are used, like in the baseline system.
4
Experimental Framework
Experiments outlined in this paper were run against the dynamic signature subcorpus of two databases: MCYT-100 and Biosecure Data Set 2 (DS2). Both databases share several common characteristics, but differ in key aspects, such as the skillness of forgeries. Both databases are described in this section, along with the experimental protocol used.
344
4.1
E.A. R´ ua, D. P´erez-Pi˜ nar L´ opez, and J.L.A. Castro
MCYT-100
The MCYT bimodal biometric database [15] consists of fingerprint and on-line signature modalities. Dynamic signature sequences were acquired with a WACOM INTUOS A6 USB pen tablet capable of detecting height, so that pen-up movements are also considered. Data was captured at 100 Hz and includes position in both axis, pressure, azimuth angle and altitude angle, both referred to the tablet. Sampling frequency used for acquisition leads to a precise discretetime signature representation, taking into account the biomechanical sequences frequency distribution. MCYT-100 is a subcorpus of 100 users from MCYT database. Signature modality used in our experiments includes both genuine signatures and shapebased skilled forgeries with natural dynamics, generating low-force forgeries, as defined in [16]: impostors were asked to imitate the shape trying to generate a smooth and natural signature. Each user in the dataset has 50 associated signatures, of which 25 are genuine and 25 are skilled forgeries generated by subsequent contributors. 4.2
Biosecure DS2 On-Line Signature Database
BIOSECURE-DS2 (Data Set 2) is part of the Biosecure Multimodal Database captured by 11 international partners under the BioSecure Network of Excellence. For our experiments, we have used the dynamic signature modality of a subset of 104 contributors. Signature acquisition was carried out using a similar procedure to the one conducted in MCYT [15]. Pen coordinates, pressure, azimuth and altitude signals are available. Signatures were captured in two acquisition sessions, producing 30 genuine signatures and 20 forgeries available for each user. Forgeries generation involved training impostors with static and dynamic information by means of a dynamic representation of the signature to imitate, obtaining brute-force skilled forgeries, as defined in [16]. This characteristic gives this dataset a higher level of dificulty. 4.3
Experimental Protocols
The user-specific HMM system is used as a baseline to assess the capabilities of the UBM-HMM system. Tests are performed on the MCYT-100 and the DS2 databases. All users from MCYT-100 are pooled on a single group. Only a posteriori results are provided for this dataset. In contrast, users from BIOSECUREDS2 are divided into two disjoint groups of 50 users each to build an open set protocol. Two-fold cross-validation is then used to provide a priori results. The MCYT-100 corpus is divided into two partitions: 10 genuine signatures for each user are defined as the training partition, and the remaining 15 genuine signatures and all 25 low-force skilled forgeries [16] are used for testing. The Biosecure DS2 corpus is also divided into a training partition with 10 genuine
Ergodic HMM-UBM System for On-Line Signature Verification
345
Table 1. W ER% (1.0) of the different systems on the MCYT-100 and BIOSECUREDS2 datasets. LV and P stand for LogVelocity- and Pressure-driven state labeling respectivelly.
Dataset MCYT-100 a posteriori W ER%(1.0) BIOSECURE-DS2 a priori W ER%(1.0)
HMM-UBM HMM-UBM HMM-UBM HMM-UBM User-specific M LV, N = 2 LV, N = 5 P, N = 2 P, N = 5 HMM 16 4.71 3.08 5.31 3.75 6.86 32 4.11 3.76 4.43 3.74 5.90 64 3.29 3.94 3.47 3.60 4.97 128 3.46 3.49 3.52 3.34 7.72 16 6.83 6.99 7.90 7.18 13.24 32 6.34 8.58 6.44 6.35 8.75 64 5.92 5.82 6.64 6.57 7.90 128 8.59 6.11 6.30 5.77 8.75
signatures for each user and a test partition with 20 genuine signatures and 20 brute-force skilled forgeries. Both user-specific HMM training for the baseline system, and HMM adaptation in the HMM-UBM system are performed on the training partitions. In the HMM-UBM system, the second group from BIOSECURE-DS2 is used to train the UBM for the MCYT-100 dataset, whereas the first group UBM is trained using the second group data and vice versa in the BIOSECURE-DS2 dataset.
5
Experimental Results
Performance of both systems is evaluated in two ways. DET curves [17] are provided for a visual comparison of systems, and Weighted Error Rates (WER) are also provided for a numerical comparison of systems performance. WER is AR defined as W ER(R) = F RR+R·F . 1+R Tables 1 shows the W ER%(1.0) of HMM-UBM and user-specific HMM systems on the MCYT-100 and BIOSECURE-DS2 datasets. Comparable results are obtained for all HMM-UBM systems in each dataset. In the case of the MCYT-100 dataset, best results are obtained for HMM-UBM with LogVelocitydriven initialization, five states and 16 mixtures per state, with a posteriori W ER% (1.0) = 3.08. In the case of the BIOSECURE-DS2 dataset, best results correspond to HMM-UBM with Pressure-driven initialization, five states and 128 mixtures per state. In all cases, HMM-UBM systems obtain better results than the baseline system. For a fair comparison between datasets, the HMM-UBM system with LogVelocity-driven initialization, M = 64, N = 5 was evaluated on the MCYT-100 dataset, using the BIOSECURE-DS2 dataset for threshold computation. W ER%(1.0) = 4.49 is obtained, which is still lower than the best W ER%(1.0) = 5.77 obtained for the BIOSECURE-DS2 dataset. Figure 2 shows DET curves of the best HMM-UBM and user-specific HMM systems on both BIOSECURE-DS2 and MCYT-100 datasets. It is easy to realize that HMM-UBM DET curves are below baseline DET curves for both
346
E.A. R´ ua, D. P´erez-Pi˜ nar L´ opez, and J.L.A. Castro DET curve of best HMM−UBM and user−specific HMM systems on both MCYT−100 and BIOSECURE−DS2 datasets. User−specific HMMs, 64 GM/state, DS2 P HMM−UBM, 5 states, 128 GM/state, DS2 User−specific HMMs, 64 GM/state, MCYT−100 LV HMM−UBM, 5 states, 16 GM/state, MCYT−100
False Acceptance Rate (in %)
40
20
10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5
1 2 5 10 20 False Rejection Rate (in %)
40
Fig. 2. Best HMM-UBM and best user-specific HMM systems’ DET curves in MCYT100 and BIOSECURE-DS2 database
MCYT-100 and BIOSECURE-DS2 datasets, what reinforces the superiority of the proposed UBM-HMM system. A second outcome emerges from these curves: systems obtain worse a posteriori performance in BIOSECURE-DS2 than in the MCYT-100 dataset. This behaviour confirms the BIOSECURE-DS2 dataset as a challenging scenario for current on-line signature verification systems, due to the quality of its forgeries. These high quality brute-force skilled forgeries cause an error rate increase of about 87% for the best HMM-UBM systems and of 59% for the user-specific HMM system, both referred to their respective MCYT-100 counterparts.
6
Conclusions and Future Work
In this paper we have presented a HMM-UBM approach for on-line signature verification. This approach is compared to a state-of-the-art on-line verification system based on user-specific HMMs on two different databases: MCYT-100 and BIOSECURE-DS2. The HMM-UBM approach obtains promising results, outperforming the user-specific HMMs system on all the experiments. Futhermore, on-line signature verification BIOSECURE-DS2 database is presented. Bruteforce skilled forgeries contained in this corpus make it a challenging scenario for on-line signature, as demonstrated in the experiments, where error rates are significantly worse in the BIOSECURE-DS2 database when compared to the MCYT-100 corpus. Future work will include new experimental comparisons of the HMM-UBM system and other state-of-the-art on-line signature verification systems, including score normalization techniques, such as cohort normalization, that can significantly improve verification performance, as demonstrated in [4,15,8].
Ergodic HMM-UBM System for On-Line Signature Verification
347
References 1. Fi´errez Aguilar, J., Nanni, L., L´ opez-Pe´ nalba, J., Ortega Garc´ıa, J., Maltoni, D.: An on-line signature verification system based on fusion of local and global information. In: IEEE International Conference on Audio- and Video-Based Person Authentication, pp. 523–532 (2005) 2. Richiardi, J., Ketabdar, H., Drygajlo, A.: Local and global feature selection for on-line signature verification. In: Proc. IAPR 8th International Conference on Document Analysis and Recognition, ICDAR (2005) 3. Vielhauer, C., Steinmetz, R., Mayerhofer, A.: Biometric hash based on statistical features of online signatures. In: Proceedings of 16th International Conference on Pattern Recognition, 2002, vol. 1, pp. 123–126 (2002) 4. Jain, A.K., Griess, F.D., Connell, S.D.: On-line Signature Verification. Pattern Recognition 35(1), 2963 (2002) 5. Fa´ undez-Zanuy, M.: On-line signature recognition based on vq-dtw. Pattern Recognition 40(3), 981–992 (2007) 6. Schimke, S., Vielhauer, C., Dittmann, J.: Using adapted levenshtein distance for on-line signature authentication. In: ICPR (2), pp. 931–934 (2004) 7. Richiardi, J., Drygajlo, A.: Gaussian mixture models for on-line signature verification. In: Intl. Multimedia Conf., Proc. ACM SIGMM workshop on Biometrics methods and applications, pp. 115–122 (2003) 8. Fi´errez Aguilar, J., Ortega Garc´ıa, D., Ramos, J.G.: Hmm-based on-line signature verification: Feature extraction and signature modeling. Pattern Recognition Letters 28(16), 2325–2334 (2007) 9. Van, B.L., Garcia-Salicetti, S., Dorizzi, B.: Fusion of hmm’s likelihood and viterbi path for on-line signature verification. In: European Conference on Computer Vision, pp. 318–331 (2004) 10. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000) 11. Ortega, J., et al.: The multi-scenario multi-environment biosecure multimodal database (bmdb). IEEE Transactions on Pattern Analysis and Machine Intelligence (2009) 12. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, vol. 77, pp. 257–286. IEEE Computer Society Press, Los Alamitos (1989) 13. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001) 14. Mart´ınez D´ıaz, M., Fi´errez Aguilar, J., Ortega Garc´ıa, J.: Universal background models for dynamic signature verification. In: IEEE International Conference on Biometrics: Theory, Applications, and Systems, pp. 1–6 (2007) 15. Ortega Garcia, J., Fierrez Aguilar, J., Simon, D., Gonzalez, J., Faundez Zanuy, M., Espinosa, V., Satue, A., Hernaez, I., Igarza, J.-J., Vivaracho, C., Escudero, D., Moro, Q.-I.: Mcyt baseline corpus: a bimodal biometric database. In: IEE Proceedings: Vision, Image and Signal Processing, December 2003, vol. 150, pp. 395–401 (2003) 16. Wahl, A., Hennebert, J., Humm, A., Ingold, R.: Generation and Evaluation of Brute-Force Signature Forgeries. In: Gunsel, B., Jain, A.K., Tekalp, A.M., Sankur, B. (eds.) MRCS 2006. LNCS, vol. 4105, pp. 2–9. Springer, Heidelberg (2006) 17. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET Curve in Assessment of Detection Task Performance. In: European Conference on Speech Communication and Technology, pp. 1895–1898 (1997)
Improving Identity Prediction in Signature-based Unimodal Systems Using Soft Biometrics M´ arjory Abreu and Michael Fairhurst Department of Electronics, University of Kent, Canterbury, Kent CT2 7NT, UK {mcda2,M.C.Fairhurst}@kent.ac.uk
Abstract. System optimisation, where even small individual system performance gains can often have a significant impact on applicability and viability of biometric solutions, is an important practical issue. This paper analyses two different techniques for using soft biometric information (which is often already available or easily obtainable in many applications) to improve identity prediction accuracy of signature-based tasks. It is shown that such a strategy can improve performance of unimodal systems, supporting high usability profiles and low-cost processing.
1
Introduction
Biometrics-based systems for individual identity authentication are now well established [1]. Many factors will influence the design of a biometric system and, in addition to obvious performance indicators such as accuracy rates, issues concerning reliability, flexibility, security and usability are extremely important. It is necessary to understand and evaluate each part of the process in developing these systems. The advantages and disadvantages of the many different modalities available (fingerprint, face, iris, voice, handwritten signature, etc) are well documented, and a wide variety of different classification/matching techniques have been extensively explored. There are many proposed approaches which use multimodal solutions to improve accuracy, flexibility and security, though these solutions can be of relatively high cost and diminished usability [2]. Thus, where possible unimodal solutions are still important, and, indeed, attempts to improve the performance of such configurations are widely reported. One approach which aims to improve the accuracy of unimodal systems is to include non-biometric (soft biometric) information in the decision-making process. In the main, the reported work in this area incorporates relevant information (such as gender, age, handedness, etc) in order to help to narrow the identityrelated search space [3], [4], [5], [6], [7], [8], but the precise methodology adopted is sometimes quite arbitrary. This paper presents an investigation of two rather different but potentially very effective techniques for including soft biometric information into the overall identification decision. Although we will focus in this paper specifically on the J. Fierrez et al. (Eds.): BioID MultiComm2009, LNCS 5707, pp. 348–356, 2009. c Springer-Verlag Berlin Heidelberg 2009
Improving Identity Prediction in Signature-based Unimodal Systems
349
handwritten signature as the biometric modality of choice, only one of our proposed approaches is modality-dependent, giving a valuable degree of generality to our study.
2
Techniques for Exploiting Soft Biometrics
The available literature shows a relatively modest number of examples of using soft biometrics to improve accuracy yet, when this approach is adopted, it generally leads to improved performance. In this paper we propose two different ways of using such information in a more efficient way than has been adopted hitherto. 2.1
Soft Biometrics as a Tool for Feature Selection
In this approach, the soft biometric information works as a feature selector, the selection being related to the demographic information that is saved in a Knowledge Database of the system. Fig. 1 shows how the process is realised. During the training phase, it is important to understand the relationship between the dynamic features and the soft biometric information, from which the system is able to choose the most suitable features for each user. This is the information that will be stored in the Knowledge Database module and will be used in the feature selection. The feature analysis is carried out, after training the system with all the features, as follows: 1. Select all the static features and test the system saving the partitioned error rates related with the soft biometrics. 2. For each dynamic feature: – Add this feature to the vector of the static features, – Test this new vector in the system, – Save the error rates to each related soft biometric. 3. Once the system is tested with all different feature combinations and the partitioned error rates saved, the analysis of each dynamic feature is executed with respect to each soft biometric information: – Save in the Knowledge Database module each dynamic feature that generated better performances than when using only static features. As an example to illustrate the general operation of this method, a hypothetical three-feature (fea1, fea2 and fea3) biometric-based identity prediction task is considered. In this task let us assume that the features fea1 and fea2 are static features and fea3 is a dynamic feature. The soft-biometric information of the user is known and is recorded as either “X” or “Y” (these labels representing appropriate values depending on the particular instance of soft biometric information. For instance, for gender the labels will be “male” and “female”). After the training phase with all the features (fea1, fea2 and fea3), the system is tested with fea1 and fea2 (only static features) first. The partitioned error rates for each “X” and “Y” for the soft-biometric information might then appear as
350
M. Abreu and M. Fairhurst
Table 1. Example: Fictional illustrative data Static Features System Soft-Biometric Error Mean “X” “Y” 10.21% 5.40% 4.81% Static Features + Dynamic Feature System Soft-Biometric Error Mean “X” “Y” 9.54% 4.54% 5.00%
in Table 1. The next step is to test the system with fea1, fea2 and fea3 and also save the partitioned error rates. This information can also be seen in Table 1. If there is any gain in accuracy with respect to the partitioned error rates, then this dynamic feature is seen to improve performance. The information stored in the Knowledge Database module (Fig. 1) during the training phase is as follows: – if soft-biometric information is tagged “X” then fea3 can improve accuracy. – if soft-biometric information is tagged “Y” then fea3 is not contributing to improved accuracy. Once the features are selected, the corresponding input values are shown to the system and the system computes its result.
Fig. 1. Soft Biometric as feature selector
2.2
Fig. 2. Soft Biometric as input features
Soft Biometric as an Extra Input Feature
In this approach, the soft biometric information functions, in essence, as an extra input feature. The soft biometric information is used in the same way as any other biometric feature and is simply added to the input vector. Fig. 2 shows a schematic illustration of this process. This information can be added into our analysis of system performance, where these additional characteristics effectively define sub-groups within our overall
Improving Identity Prediction in Signature-based Unimodal Systems
351
test population. These new information sources contribute to the system input information in the same way as the extracted sample features are used, which requires the integration of the further features. Using the same example introduced in Section 2.1, the input of the system will be the three features (fea1, fea2 and fea3) and the additional soft biometric feature, designated soft-fea here. In this case, the input will be a binary value, 10 for “X” and 01 for “Y”, using Hamming Distance coding [9].
3
A Case Study
In order to analyse how these two approaches compare in an application based on real data, we now present a practical Case Study. 3.1
Database
The multimodal database used in this work was collected in the Department of Electronics at the University of Kent [10] as part of the Europe-wide BioSecure Project. In this database, there are samples of Face, Speech, Signature, Fingerprint, Hand Geometry and Iris biometrics of 79 users collected in two sessions. In the work reported here we have used the signature samples from both sessions. Table 2. Signature features Feature Execution Time Pen Lift Signature Width Signature Height Height to Width Ratio Average Horizontal Pen Velocity in X Average Horizontal Pen Velocity in Y Vertical Midpoint Pen Crossings Azimuth Altitude Pressure Number of points comprising the image Sum of horizontal coordinate values Sum of vertical coordinate values Horizontal centralness Vertical centralness
Type Dynamic Dynamic Static Static Static Dynamic Dynamic Dynamic Dynamic Dynamic Dynamic Static Static Static Static Static
The database contains 25 signature samples for each subject, where 15 are samples of the subject’s true signature and 10 are attempts to imitate another user’s signature. In this investigation we have used only the 15 genuine signatures of each subject. The data were collected using an A4-sized graphics tablet with a density of 500 lines per inch. There are 16 representative biometric features extracted from each signature sample, as identified in Table 2. These features are chosen to be representative of those known to be commonly adopted in signature processing applications. All the available biometric features are used in the classification process as input to the system. During the acquisition of this database, the subjects were required to provide some additional information which constitutes exactly the sort of soft biometric data discussed above.
352
M. Abreu and M. Fairhurst
In particular, we recorded the following information about each user: – Gender: Male or Female. – Age information: For our purposes three age groups were identified, namely: under 25 years, 25-60 years and over 60 years. – Handedness: Right or Left. In the study reported here, we have explored the users’ characteristics with respect to the age information and handedness information. The possible values assigned to the age features, use a Hamming Distance coding [9], and are defined as 100 (< 25), 010 (25-60) or 001 (> 60), depending on the age group. In the same way, the possible values assigned to handedness features are 10 (right) and 01 (left). 3.2
Classification Methods
In order to evaluate the effectiveness of the proposed two approaches, we choose four different classifiers (to allow a comparative study), as described below. – Fuzzy Multi-Layer Perceptron (FMLP) [11]: This classifier incorporates fuzzy set theory into a multi-layer Perceptron framework, and results from the direct “fuzzyfication” in the network level of the MLP, in the learning level, or in both. – Support Vector Machines (SVM) [12]: This approach embodies a functionality very different from that of more traditional classification methods and is based on an induction method which minimizes the upper limit of the generalization error related to uniform convergence. – K-Nearest Neighbours (KNN) [13]: In this method, the training set is seen as composed of n-dimensional vectors and each element represents an n-dimensional space point. The classifier estimates the k nearest neighbours in the whole dataset based on an appropriate distance metric (Euclidian distance in the simplest case). – Optimized IREP (Incremental Reduced Error Pruning) (JRip) [14]: The Decision Tree usually uses pruning techniques to decrease the error rates of a dataset with noise, one approach to which is the Reduced Error Pruning method. In order to guarantee robustness in the classification process, we chose a tenfold-cross-validation approach because of its relative simplicity, and because it has been shown to be statistically sound in evaluating the performance of classification tasks [15]. In ten fold cross validation, the training set is equally divided into ten different subsets. Nine out of ten of the training subsets are used to train the classifier and the tenth subset is used as the test set. The procedure is repeated ten times, with a different subset being used as the test set.
Improving Identity Prediction in Signature-based Unimodal Systems
3.3
353
Experimental Results
In order to analyse the performance of our proposed new techniques for using soft biometric information to improve accuracy, it is important first to evaluate the performance of the classifiers without this additional information. Table 3 shows the error rates of the classifiers when all features (All-F) are used and when only static features (Sta-F) are used. The classifier that presents the best results, highlighted in bold in Table 3, is FMLP. It is important to note that in all the classifiers, using all features leads to an improvement of around 5%, (corresponding to approximately 60 additional samples that are classified correctly). This is perhaps not in itself surprising, but confirms what can readily be determined from the available literature, that using dynamic features improves the accuracy of a signature-based classification task. Table 3. Results without and with soft-biometrics Classification Results without Soft biometric Methods All-F FMLP 10.21%±2.37 Jrip 11.67%±2.96 SVM 10.92%±2.31 KNN 14.12%±2.31 Classification Results without Soft biometric Methods Sta-F FMLP 15.97%±3.69 Jrip 18.22%±3.54 SVM 16.31%±3.97 KNN 22.84%±4.01
Age-based results S-F/Age All-F+Age Sta-F+Age 8.84%±1.98 9.54%±2.91 12.91%±2.57 11.87%±1.74 10.28%±2.57 16.84%±2.91 8.69%±1.67 8.21%±2.64 13.73%±2.54 15.34%±1.39 12.47%±2.83 18.81%±2.88 Handedness-based results S-F/Hand All-F+Hand Sta-F+Hand 10.27%±2.89 10.81%±3.69 12.09%±1.58 12.64%±2.81 11.27%±3.47 15.74%±1.79 9.57%±2.36 9.83%±3.59 11.94%±1.67 14.37%±2.47 13.97%±3.77 16.33%±1.88
However, Table 3 shows the results for the same classifiers when the additional soft biometric information (Age and Handedness) is also incorporated. The three columns show respectively the error rate measured in the cases for selected features using soft biometrics (S-F), all features plus soft biometrics (All-F) and static features plus soft biometrics (Sta-F). From an analysis of Table 3, it is possible to see that by adding selected dynamic features to the static features or adding soft biometrics either to all features or only to static features produces a better performance than using only static features. The overall error rates are broadly related to the sophistication (and, generally therefore, complexity) of the classifiers, with the best results being obtained with the SVM and FMLP classifiers. Analysing the results based on the incorporation of the age-based additional information and using the t-test, it is possible to note that there is no statistical difference between the S-F/Age results and the All-F+Age results while, on the other hand, these results are both statistically better than the Sta-F+Age. Analysing the results when the handedness information is incorporated and using the t-test, it can be seen that all the results are statistically comparable. Fig. 3 and Fig. 4 show the comparison among all five different configurations according to the partitioned age bands and left and right handedness. Careful
354
M. Abreu and M. Fairhurst
Fig. 3. Result of all classifiers in all config- Fig. 4. Result of all classifiers in all configurations partitioned into age bands urations partitioned into handedness
examination of these results shows, however, how choice of classifier and configuration can lead directly to an optimised process with respect to a given task objective (for example, when working within a particular age band, or focusing on a particular population group). 3.4
Enhancing Off-Line Processing
An issue of particular interest here is related to a question about off-line signature processing and how this might be enhanced. Most implementations of signature verification systems use a wide range of feature types, but almost all assume the availability of the dynamic features which are only extractable when on-line sample acquisition is possible. Yet there are still important application where only off-line sampling is possible (remote bank cheque clearing, many document-based applications, etc), and thus it is instructive to consider performance achievable when only static features can be used for processing. It is generally expected that static-only solutions will return poorer levels of performance than can be achieved with the much richer feature set available when dynamic features are incorporated and, indeed, the results shown above confirm this here. It is interesting to consider the results of enhancing the processing by incorporating soft biometrics, such as can be seen when we add in the age-based information, with the results shown in Fig. 3 and Fig. 4. The improvement in performance which this brings about suggests a very valuable further benefit of an approach which seeks to exploit the availability of soft biometrics as a means of enhancing performance. In this case, such an approach provides the opportunity for significant enhancement in a scenario which has important practical implications yet which is often especially limited by the inherent nature of the task domain.
Improving Identity Prediction in Signature-based Unimodal Systems
4
355
Concluding Remarks
This paper has introduced two new techniques by means of which to include soft biometric information to improve identity prediction accuracy. The results presented are very encouraging, and show how additional information often available or explicitly collected in practical scenarios can be exploited in a way which can enhance the identification process. Although accuracy improvements tend to be modest (perhaps not surprisingly given the small scale of this initial experimental study) the gains afforded can nevertheless make an impact in practical situations and provide a basis for further development of the general strategy proposed. Further work is required to develop optimisation procedures in configurations such as those investigated here, and to extend the analysis to integrate different types of soft biometric information. Already, however, the work reported here is beginning to offer some options to a system designer in seeking to improve error rate performance in unimodal systems, providing alternatives to the increased complexity and reduced usability incurred in multibiometric systems.
Acknowledgment The authors gratefully acknowledge the financial support given to Mrs Abreu from CAPES (Brazilian Funding Agency) under grant BEX 4903-06-4.
References 1. Jain, A., Nandakumar, K., Nagar, A.: Biometric template security. EURASIP 8(2), 1–17 (2008) 2. Toledano, D.T., Pozo, R.F., Trapote, A.H., G´ omez, L.H.: Usability evaluation of multi-modal biometric verification systems. Interacting with Computers 18(5), 1101–1122 (2006) 3. Franke, K., Ruiz-del-Solar, J.: Soft-biometrics: Soft-computing technologies for biometric-applications. In: AFSS, pp. 171–177 (2002) 4. Jain, A., Dass, S., Nandakumar, K.: Can soft biometric traits assist user recognition? In: ICBA, pp. 561–572 (2004) 5. Jain, A., Nandakumar, K., Lu, X., Park, U.: Integrating faces, fingerprints, and soft biometric traits for user recognition. In: ECCV Workshop BioAW, pp. 259–269 (2004) 6. Zewail, R., Elsafi, A., Saeb, M., Hamdy, N.: Soft and hard biometrics fusion for improved identity verification. In: MWSCAS, July 2004, vol. 1, pp. I-225–I-228 (2004) 7. Ailisto, H., Vildjiounaite, E., Lindholm, M., M¨ akel¨ a, S., Peltola, J.: Soft biometricscombining body weight and fat measurements with fingerprint biometrics. Pattern Recogn. Lett. 27(5), 325–334 (2006) 8. Jain, A.K., Dass, S.C., Nandakumar, K.: Soft biometric traits for personal recognition systems. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 731–738. Springer, Heidelberg (2004)
356
M. Abreu and M. Fairhurst
9. Hamming, R.W.: Error detecting and error correcting codes. The Bell System Technical Journal 26(2), 147–160 (1950) 10. Ortega-Garcia, J., Alonso-Fernandez, F., Fierrez-Aguilar, J., Garcia-Mateo, C., Salicetti, S., Allano, L., Ly-Van, B., Dorizzi, B.: Software tool and acquisition equipment recommendations for the three scenarios considered. Technical Report Report No.: D6.2.1. Contract No.: IST-2002-507634 (Jun 2006) 11. Canuto, A.: Combining Neural Networks and Fuzzy Logic for Aplications in Character Recognition. PhD thesis, Department of Electronics, University of Kent, Canteburry, UK (May 2001) ( Adviser-Fairhurst, M.C.) 12. Nello, C., John, S.T.: An introduction to support vector machines and other kernelbased learning methods. Robotic 18(6), 687–689 (2000) 13. Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891– 923 (1998) 14. Furnkranz, J., Widmer, G.: Incremental reduced error pruning. In: ICML 1994, New Brunswick, NJ, pp. 70–77 (1994) 15. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Author Index
Abel, Andrew 65 Abreu, M´ arjory 348 Akakin, Hatice Cinar 105 Alba Castro, Jos´e Luis 340 Alotaibi, Yousef Ajami 162 Ant´ on, Luis 97 Argones R´ ua, Enrique 340 ´ Avila, Andr´es I. 187 Atah, Joshua A. 170 Bae, You-suk 318 Bandari, Naghmeh Mohammadi Bashir, Muzaffar 200 Beetz, Michael 122 Behzad, Moshiri 130 Biermann, Michael 220 Bringer, Julien 178 Burgues, Javier 325 Butt, M Asif Afzal 308 Cambria, Erik 252 Carballo, Sara 285 Carbone, Domenico 73 Castrill´ on, Modesto 97 Chabanne, Herv´e 178 Chetouani, Mohamed 65 Chia, Chaw 212 Dimitrova, Desislava 207 Dimov, Dimo 146, 192 Dittmann, Jana 220 Dobriˇsek, Simon 114 Drygajlo, Andrzej 25, 260 Eckl, Chris 252 Ejarque, Pascual 81 Esposito, Anna 73 F` abregas, Joan 236 Fairhurst, Michael 348 Fariba, Bahrami 130 Faundez, Marcos 236 ´ Ferrer, Miguel Angel 236 Fierrez, Julian 154, 285, 325 Freire, David 97 Freire, Manuel 236
Gajˇsek, Rok 114 Galbally, Javier 285 Garrido, Javier 236 Gkalelis, Nikolaos 138 Gluhchev, Georgi 207 Gonzalez, Guillermo 236 Gonzalez-Rodriguez, Joaquin Grassi, Marco 244 G´ omez, David 81 ˇ Gros, Jerneja Zganec 57 301
Hadid, Abdenour 9 Hangai, Seiichiro 333 Havasi, Catherine 252 Henniger, Olaf 268 Hernado, Javier 81 Hernando, David 81 Howells, Gareth 170 Husain, Amir 162 Hussain, Amir 65, 252 Imai, Hideki 293 Inuma, Manabu 293 Jalal, Arabneydi
130
Kanak, Alper 276 Kempf, J¨ urgen 200 Kevenaar, Tom A.M. 178 Kindarji, Bruno 178 Kojima, Yoshihiro 293 Kyperountas, Marios 89 K¨ ummel, Karl 220 Laskov, Lasko 192 Lee, Hyun-suk 318 Li, Weifeng 25 Lopez-Moreno, Ignacio
49
Maeng, Hyun-ju 318 Mansoor, Atif Bin 308 Marinov, Alexander 146 Mayer, Christoph 122 Miheliˇc, Aleˇs 57 Miheliˇc, France 114
49
358
Author Index
Milgram, Maurice 65 Misaghian, Khashayar 301 Morales, Aythami 236 Moreno-Moreno, Miriam 154 Muci, Adrialy 187 M¨ uller, Sascha 268 Nguyen, Quoc-Dinh Nolle, Lars 212
65
Ortega, Javier 236 Ortega-Garcia, Javier 154, 285, 325 Ortego-Resa, Carlos 49 Otsuka, Akira 293 Paveˇsi´c, Nikola 1, 114 P´erez-Pi˜ nar L´ opez, David Pietik¨ ainen, Matti 9 Pitas, Ioannis 89, 138 Pˇribil, Jiˇr´ı 41 Pˇribilov´ a, Anna 41 Radhika, K.R. 228 Radig, Bernd 122 Ramos, Daniel 49, 325 Rawls, Allen W. 17 Riaz, Zahid 122 Ribalda, Ricardo 236
340
Ricanek Jr., Karl 17 Ringeval, Fabien 65 Riviello, Maria Teresa
73
Saeed, Omer 308 Sankur, Bulent 105 Sano, Tomoaki 333 Scheidat, Tobias 220 Sekhar, G.N. 228 Sheela, S.V. 228 Sherkat, Nasser 212 Shigetomi, Rie 293 So˜ gukpinar, Ibrahim 276 Staroniewicz, Piotr 33 ˇ Struc, Vitomir 1, 114 Tajbakhsh, Nima Tefas, Anastasios Ter´ an, Luis 260
301 138
Venkatesha, M.K. 228 Vielhauer, Claus 220 Yoshida, Takahiro
333
Zhu, Kewei 25 ˇ Zibert, Janez 114 Zlateva, Nadezhda 146