Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4872
Domingo Mery Luis Rueda (Eds.)
Advances in Image and Video Technology Second Pacific Rim Symposium, PSIVT 2007 Santiago, Chile, December 17-19, 2007 Proceedings
13
Volume Editors Domingo Mery Pontificia Universidad Católica de Chile Department of Computer Science Avda. Vicuña Mackenna 4860, Santiago 6904411, Chile E-mail:
[email protected] Luis Rueda Universidad de Concepción Department of Computer Science Edmundo Larenas 215, Concepción 4030000, Chile E-mail:
[email protected]
Library of Congress Control Number: 2007940435 CR Subject Classification (1998): H.5.1, H.5, I.4, I.3, H.3-4, E.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-540-77128-X Springer Berlin Heidelberg New York 978-3-540-77128-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12200894 06/3180 543210
Preface
These proceedings are a manifestation of the excellent scientific contributions presented at the Second IEEE Pacific Rim Symposium on Video and Image Technology (PSIVT 2007), held in Santiago, Chile, during December 17–19, 2007. The symposium provided a forum for presenting and exploring the latest research and developments in image and video technology. Discussing the possibilities and directions in these fields settled a place where both academic research and industrial activities were presented for mutual benefit. The aim of the symposium was to promote and disseminate ongoing research on multimedia hardware and image sensor technologies, graphics and visualization, image analysis, multiple view imaging and processing, computer vision applications, image and video coding, and multimedia processing. The volume is a realization of the ongoing success of the Pacific Rim Symposium on Video and Image Technology for which the first issue (PSIVT 2006) was last year in Hsinchu, Taiwan, Republic of China. PSIVT 2007 provides evidence of the growing stature of the Pacific Rim scientific community in video and image technology and of their impact worldwide. The symposium received contributions from 31 countries, registering a total of 155 papers, out of which 75 were accepted for publication in these proceedings, which is equivalent to an acceptance rate of 48.4%. The review process was carried out in seven different themes, each composed of theme Co-chairs and a Program Committee composed of internationally recognized scientists, all experts in their respective theme. Each paper was peer-reviewed by two to five reviewers. Besides oral and poster presentations of high-quality papers, interesting keynote talks in a mix of topics in theory and applications of image and video technology were presented by internationally renowned scientists: – Yi-Ping Hung, Image and Vision Laboratory, National Taiwan University – Hiromishi Fujisawa, Hitachi Central Research Laboratory, Japan – Pablo Irarrazaval, Department of Electrical Engineering and Magnetic Resonance Imaging Research Center, Pontificia Universidad Cat´ olica, Chile – Peter Kovesi, The Vision Research Group, The University of Western Australia PSIVT 2007 was organized by the Department of Computer Science at Pontificia Universidad Cat´ olica de Chile (PUC). The symposium was sponsored by IEEE, Microsoft Research, the Chilean Society for Computer Science, the Chilean Association for Pattern Recognition (AChiRP) and Pontificia Universidad Cat´olica de Chile. This conference would not have been such a success without the efforts of many people. First of all, we are very grateful to the authors who contributed
VI
Preface
their high-quality research work, sharing their knowledge with our scientific community. We are especially indebted to our theme Co-chairs for putting their efforts in ensuring a high-quality review and selection process. We would like to thank the Program Committee members and the reviewers, who generously spent their precious time in providing quite useful and detailed comments, offering authors an excellent opportunity to improve their work presented in this symposium and their future research. Additionally, we appreciate the meticulous work done by Miguel Carrasco, our Editor Assistant, who put together all camera-ready papers in this book ensuring that every single paper strictly followed the required style and format. Finally, we would like to express our gratitude to all members of the Organizing and Steering Committees, especially to Reinhard Klette and Wen-Nung Lie, for their support and help in bringing this symposium to Chile for the first time. October 2007
Domingo Mery Luis Rueda
PSIVT 2007 Organization
Organizing Committee General Co-chairs Domingo Mery Luis Rueda Reinhard Klette
Pontificia Universidad Cat´ olica, Chile Universidad de Concepci´ on, Chile University of Auckland, New Zealand
Program Co-chairs Wen-Nung Lie Ren´e Vidal Alvaro Soto
National Chung Cheng University, Taiwan John Hopkins University, USA Pontificia Universidad Cat´ olica, Chile
Steering Committee Wen-Nung Lie Kap Luk Chan Yung-Chang Chen Yo-Sung Ho Reinhard Klette Mohan M. Trivedi Domingo Mery
National Chung Cheng University, Taiwan Nanyang Technological University, Singapore National Tsing Hua University, Taiwan Gwangju Institute of Science and Technology, Korea The University of Auckland, New Zeland University of California, San Diego, USA Pontificia Universidad Cat´ olica, Chile
Editor Assistance and Webmaster Miguel Carrasco
Pontificia Universidad Cat´ olica, Chile
VIII
Organization
Theme Co-chairs Multimedia Hardware and Image Sensor Technologies Miguel Torres (Pontificia Universidad Cat´olica, Chile) Charng-Long Lee (Sunplus Inc., Taiwan) Jose Luis Gordillo (Instituto Tecnol´ ogico de Monterrey, Mexico) Graphics and Visualization Bedrich Benes (University of Purdue, USA) Nancy Hitschfeld (Universidad de Chile, Chile) Image Analysis Cristi´ an Tejos (Pontificia Universidad Cat´olica, Chile) Luis Pizarro (Saarland University, Germany) Multiple View Imaging and Processing Fernando Rannou (Universidad de Santiago, Chile) Hideo Saito (Keio University, Japan) Yo-Sung Ho (Gwangju Institute Science and Technology, Korea) Computer Vision Applications Javier Ruiz-del-Solar (Universidad de Chile, Chile) Luis Enrique Sucar (INAOE, Mexico) Pablo Zegers (Universidad de Los Andes, Chile) Image and Video Coding Byeungwoo Jeon (Sung Kyun Kwan University, Korea) Ramakrishna Kakarala (Avago Technologies, San Jose, USA) Multimedia Processing Xuelong Li (University of London, UK) Hyoung-Joong Kim (Korea University, Korea)
Invited Speakers Yi Ping-Hung (Image and Vision Lab., National Taiwan University, Taiwan) Hiromichi Fujisawa (Hitachi Central Research Laboratory, Japan) Pablo Irarr´ azaval (MRI Research Center, Pontificia Universidad Cat´ olica, Chile) Peter Kovesi (The Vision Research Group, The University of Western Australia)
Organization
Program Committee Multimedia Hardware and Image Sensor Technologies Mariano Aceves (INAOE, Mexico) Jose Atoche (ITMERIDA, Mexico) Oscal T.-C. Chen (National Chung Cheng University, Taiwan) Tihao Chiang (Nationl Chiao Tung University, Taiwan) Michael Cree (University of Waikato, New Zealand) Chiou-Shann (Fuh National Taiwan University, Taiwan) Jos´e Luis Gordillo (Instituto Tecnol´ogico de Monterrey, Mexico) Marcelo Guarini (Pontificia Universidad Cat´ olica, Chile) Andr´es Guesalaga (Pontificia Universidad Cat´ olica, Chile) Jiun-In Guo (National Chung Cheng University, Taiwan) Charng-Long Lee (Sunplus Inc., Taiwan) Gwo Giun Lee (National Cheng Kung University, Taiwan) Chia-Wen Lin (National Tsing Hua University, Taiwan) Bruce MacDonald (University of Auckland, New Zealand) Jos´e Luis Paredes (Universidad de Los Andes, Venezuela) Javier Vega Pineda (Instituto Tecnol´ogico de Chihuahua, Mexico) Ram´on M. Rodr´ıguez (Tecnol´ogico de Monterrey, Mexico) Ewe Hong Tat (Multimedia University, Malaysia) Miguel Torres (Pontificia Universidad Cat´olica, Chile) Flavio Torres (Universidad de la Frontera, Chile) Y. Tim Tsai (ITRI, Taiwan) Kazunorhi Umeda (Chuo University, Japan) Graphics and Visualization Laura Arns (Purdue University, USA) Bedrich Benes (University of Purdue, USA) Sanjiv K. Bhatia (University of Missouri St. Louis, USA) Xiaofei He (Yahoo Research, USA) Heiko Hirschmueller (DLR Munich, Germany) Nancy Hitschfeld (Universidad de Chile, Chile) Reinhard Koch (Kiel University, Germany) Ivana Kolingerova (University of West Bohemia, Czech Republic) Ngoc-Minh Le (HCMC University of Technology, Vietnam) Damon Shing-Min Liu (National Chung Cheng University, Taiwan) Kellen Maicher (Purdue University, USA ) Ryan Pedela (Purdue University, USA) Maria Cecilia Rivara (Universidad de Chile, Chile) Isaac Rudomin (ITESM CEM, Mexico) John Rugis (Manukau Institute of Technology, New Zealand) Ashok Samal (University of Nebraska-Lincoln, USA) Jose Serrate (Universidad Politecnica de Catalunya, Spain) Mingli Song (Zhejiang University, China)
IX
X
Organization
Masahiro Takatsuka (ViSLAB, The University of Sydney, Australia) Matthias Teschner (Freiburg University, Germany) Michael Wilkinson (Groningen University, Groningen, The Netherlands) Image Analysis Luis Alvarez (Universidad de Las Palmas de Gran Canaria, Spain) Humberto Sossa Azuela (National Polytechnic Institute, Mexico) Ricardo Barr´ on (Centro de Investigaci´ on en Computaci´ on, Mexico) Josef Bigun (Halmstad University, Sweden) Thomas Brox (University of Bonn, Germany) Li Chen (The University of the District of Columbia, USA) Kamil Dimililer (Near East University, Turkey) Mohamed El Hassouni (Mohammed V University, Morocco) Giovani Gomez Estrada (Universitaet Stuttgart, Germany) Alejandro Frery (Universidade Federal de Alagoas, Brazil) Andr´es Fuster Guill´ o (Universidad de Alicante, Spain) Vaclav Hlavac (Czech Technical University, Czech Republic) Pablo Irarr´ azaval (Pontificia Universidad Cat´ olica, Chile) Kazuhiko Kawamoto (Kyushu Institute of Technogy, Japan) Pierre Kornprobst (INRIA, France) Fajie Li (University of Groningen, The Netherlands) Jorge Azor´ın L´opez (Universidad de Alicante, Spain) Joan Marti (Universitat de Girona, Spain) Nicolai Petkov (University of Groningen, The Netherlands) Hemerson Pistori (Universidade Cat´olica Dom Bosco, Brazil) Luis Pizarro (Saarland University, Germany) Arturo Espinosa Romero (Universidad Autnoma de Yucat´an, Mexico) Mikael Rousson (Siemens Corporate Research, USA) Xiaowei Shao (Tokyo University, Japan) Nir Sochen (Tel Aviv University, Israel) Juan Humberto Sossa (National Polytechnic Institute, Mexico) Cristi´ an Tejos (Pontificia Universidad Cat´olica, Chile) Petra Wiederhold (CINVESTAV, Mexico City, Mexico) Multiple View Imaging and Processing Daisaku Arita (Institute of Systems and Information Technologies, Japan) Chi-Fa Chen (I-Shou University, Taiwan) Gianfranco Doretto (General Electric Research, USA) Paolo Favaro (Heriot-Watt University, UK) Clinton Fookes (Queensland University of Technology, Australia) Toshiaki Fujii (Nagoya University, Japan) Jens Gregor (University of Tennessee, USA) Yo-Sung Ho (Gwangju Institute Science & Tech., Korea) Fay Huang (Ilan University, Taiwan)
Organization
Kun Huang (Ohio State University, USA) Hailin Jin (Adobe Research, USA) Makoto Kimura (Advanced Industrial Science and Technology, Japan) Nahum Kiryati (Tel Aviv University, Israel) Itaru Kitahara (University of Tsukuba, Japan) Akira Kubota (Tokyo Institute of Technology, Japan) Huei-Yung Lin (National Chung Cheng University, Taiwan) Le Lu (Siemens Corporate Research, USA) Brendan McCane (University of Otago, New Zealand) Vincent Nozick (Keio University, Japan) Fernando Rannou (Universidad de Santiago de Chile, Chile) Bodo Rosenhahn (MPI, Saarbruecken, Germany) Hideo Saito (Keio University, Japan) Yasuyuki Sugaya (Toyohashi University of Technology, Japan) Keita Takahashi (University of Tokyo, Japan) Carlos Vazquez (Communications Research Centre, Canada) Shuntaro Yamazaki (Advanced Industrial Science and Technology, Japan) Allen Yang (Berkeley, University of California) Computer Vision Applications Hector-Gabriel Acosta-Acosta (University of Veracruz, Mexico) John Atkinson (Universidad de Concepci´ on, Chile) H´ector Avil´es (INAOE, Mexico) Olivier Aycard (INRIA, France) Jacky Baltes (University of Manitoba, Canada) John Barron (University of Western Ontario, Canada) Marcelo Bertalm´ıo (Universidad Pompeu Fabra, Spain) Bubaker Boufama (University of Windsor, Canada) Thomas Br¨aunl (The University of Western Australia, Australia) Miguel Carrasco (Pontificia Universidad Cat´ olica, Chile) Roberto Marcondes Cesar (Universidade de Sao Paulo, Brazil) Kap Luk Chan (Nanyang Technological University, Singapore) Raul Pinto Elias (CENIDET, Mexico) How-lung Eng (Institute of Infocomm Research, Singapore) Maria-Jose Escobar (INRIA, France) Giovani Gomez Estrada (Universitaet Stuttgart, Germany) David Fofi (Institut Universitaire de Technologie, France) Uwe Franke (DaimlerChrysler AG - Machine Perception, Germany) Juan Manuel Garc´ıa Chamizo (Universidad de Alicante, Spain) Duncan Gillies (Imperial College London, UK) Pablo Guerrero (Universidad de Chile, Chile) Adlane Habed (University of Bourgogne, France) Sergio Hernandez (Victoria University, New Zealand) Jesse Hoey (University of Dundee, UK) Lucca Iocchi (Universidad La Sapienza Roma, Italy)
XI
XII
Organization
Jesse Jin (University of Newcastle, Australia ) Val´erie Kaftandjian (Non Destructive Testing Laboratory, France) Ron Kimmel (Computer Science Department, Israel) Mario Koeppen (Kyushu Institute of Technology, Japan) Patricio Loncomilla (Universidad de Chile, Chile) Brian Lovell (Brisbane, Australia) Joan Marti (Universitat de Girona, Spain) Fabrice Meriaudeau (Institut Universitaire de Technologie, France) Rodrigo Palma-Amestoy (Universidad de Chile, Chile) Henry Pang (Aureon, USA) Hemerson Pistori (Universidade Cat´olica Dom Bosco, Brazil) Gregory Randall (Universidad de la Rep´ ublica, Uruguay) Arturo Espinosa Romero (Universidad Aut´ onoma de Yucat´ an, Mexico) Javier Ruiz-del-Solar (Universidad de Chile, Chile) Xiaowei Shao (Tokyo University, Japan) Aureli Soria-Frisch (Universidad Pompeu Fabra, Spain) Alvaro Soto (Pontificia Universidad Cat´ olica, Chile) Mohan Sridharan (University of Austin Texas) Christophe Stolz (Universit´e de Bourgogne, France) Luis Enrique Sucar (INAOE, Mexico) Jo˜ ao Manuel Tavares (Universidade do Porto, Portugal) Rodrigo Verschae (Universidad de Chile, Chile) Pascal Vasseur (University of Picardie Jules Verne, France) Alfrefo Weitzenfeld (ITAM, Mexico) Su Yang (Fudan University, China) Wei Yun Yau (Institute of Infocomm Research, Singapore) Kaori Yoshida (Kyushu Institute of Technology, Japan) Pablo Zegers (Universidad de Los Andes, Chile) Image and Video Coding John Arnold (Australian Defense Force Academy, Australia) Yuk Hee Chan (The Hong Kong Polytechnic University, Hong Kong) Homer Chen (National Taiwan University, Taiwan) Mei-Juan Chen (National Dong-Hwa University, Taiwan) Gerardo F. Escribano (Universidad de Castilla-La Mancha, Spain) Xiaodong Fan (Microsoft, USA) Markus Flierl (Stanford University, USA) Wenjen Ho (Institutes of Information Industry, Taiwan) Byeungwoo Jeon (Sung Kyun Kwan University, Korea) Ramakrishna Kakarala (Avago Technologies, San Jose, USA) Chang-Su Kim (Korea University, Korea) Hae Kwang Kim (Sejong Univisersity, Korea) Andreas Koschan (University of Tennessee, USA) Shipeng Li (Microsoft Research Asia, China) Yan Lu (Microsoft Research Asia, China)
Organization
XIII
Kai-Kuang Ma (Nanyang Technological University, Singapore) Shogo Muramatsu (Niigata University, Japan) Philip Ogunbona (University of Wollongong, Australia) Dong Kyu Sim (Kwang Woon University, Hong Kong) Byung Cheol Song (Samsung Electronics Co., Ltd, Korea) Gary Sullivan (Microsoft Corporation, USA) Alexis M. Tourapis (Dolby Corporation, USA) Carlos Vazquez (Communications Research Centre, Canada) Ye-Kui Wang (Nokia Research Center, Finland) Mathias Wien (RWTH Aachen University, Germany) Jar-Ferr Yang (National Cheng Kung University, Taiwan) Chia-Hung Yeh (National Dong-Hwa University, Taiwan) Yuan Yuan (Aston University, UK) Multimedia Processing Imran Ahmad (University of Windsor, Canada) Amr Ahmed (Lincoln University, UK) Oscar Au (Hong Kong University of Science and Technology, Hong Kong) Berlin Chen (National Taiwan Normal University, Taiwan) Shyi-Chyi Cheng (National Taiwan Ocean University, Taiwan) Kevin Curran (University of Ulster, UK) Xinbo Gao (Xi’Dian University, China) Hyoung-Joong Kim (Korea University, Korea) Yung-Lyul Lee (Sejong University, Korea) Jing Li (Sheffield University, UK) Xuelong Li (University of London, UK) Guo-Shiang Lin (Da-Yeh University, Taiwan) Yanwei Pang (Tianjin University, China) Laiyun Qing (Institute of Computing Technology, China) Day-Fann Shen (National Yunlin University of Science and Technology, Taiwan) Jialie Shen (Singapore Management University, Singapore) Chien-Cheng Tseng (National Kaohsiung First University of Science and Tech., Taiwan) Huiqiong Wang (Zhejiang University, China) Ya-Ping Wong (Multimedia University, Malaysia) Marcel Worring (University of Amsterdam, The Netherlands ) Hsien-Huang P. Wu (National Yunlin University of Science and Tech., Taiwan) Qingxiang Wu (Ulster University, UK) Tianhao Zhang (Shanghai Jiaotong University, China) Huiyu Zhou (University of London, UK) Xingquan Zhu (Florida Atlantic University, USA)
XIV
Organization
Additional Reviewers M. Abdel-Maquid Imran Ahmad Soria-Frisch Aureli Anna Bosch Sylvie Chambon Y.H. Chan Ying Chen Mauro Conti P. Kresimir Delac Stephan Didas Fadi Dornaika Hong Tat Ewe Torres Flavio Ruben Garc´ıa Mei Guo Yang Guo A. Ben Hamza Jeff Han Jin Huang
El Hassan Ibn El Haj Reinhard Klette Mohamed Chaker Larabi Chang-Ming Lee Thomas Lemaire Houqiang Li Wei-Yang Lin Ligang Liu Xavier Llado Chun-Shien Lu Sujeet Mate Remi Megret Jesus Mena-Chalco Domingo Mery Romuald Mosqueron Valguima Odakura Arnau Oliver Ricardo Pastrana V. Patrick Perez
David Silva Pires Milton Romero Luis Rueda Mohammed Rziza Li Su Yang Su Truong Cong Thang Alexandros Tourapis Mejdi Trimeche Kemal Ugur Anna Ukovich Ren´e Vidal Demin Wang Jiaping Wang Ruixuan Wang Peter Wu Cixun Zhang Liang Zhang
Sponsoring Institutions IEEE Microsoft Research The Chilean Society for Computer Science The Chilean Association for Pattern Recognition (AChiRP) Pontificia Universidad Cat´ olica de Chile (PUC)
Table of Contents
Keynote Lectures An Image-Based Approach to Interactive 3D Virtual Exhibition . . . . . . . Yi-Ping Hung
1
Information Just-in-Time: An Approach to the Paperless Office . . . . . . . Hiromichi Fujisawa
2
Sampling Less and Reconstructing More for Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pablo Irarrazaval
3
Phase is an Important Low-Level Image Invariant . . . . . . . . . . . . . . . . . . . . Peter Kovesi
4
Multimedia Hardware and Image Sensor Technologies A Pipelined 8x8 2-D Forward DCT Hardware Architecture for H.264/AVC High Profile Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tha´ısa Leal da Silva, Cl´ audio Machado Diniz, Jo˜ ao Alberto Vortmann, Luciano Volcan Agostini, Altamiro Amadeu Susin, and Sergio Bampi A Real Time Infrared Imaging System Based on DSP & FPGA . . . . . . . . Babak Zamanlooy, Vahid Hamiati Vaghef, Sattar Mirzakuchaki, Ali Shojaee Bakhtiari, and Reza Ebrahimi Atani Motion Compensation Hardware Accelerator Architecture for H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruno Zatt, Valter Ferreira, Luciano Agostini, Fl´ avio R. Wagner, Altamiro Susin, and Sergio Bampi High Throughput Hardware Architecture for Motion Estimation with 4:1 Pel Subsampling Targeting Digital Television Applications . . . . . . . . . Marcelo Porto, Luciano Agostini, Leandro Rosa, Altamiro Susin, and Sergio Bampi
5
16
24
36
Graphics and Visualization Fast Directional Image Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chih-Wei Fang and Jenn-Jier James Lien
48
XVI
Table of Contents
Out-of-Order Execution for Avoiding Head-of-Line Blocking in Remote 3D Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Stavrakakis and Masahiro Takastuka A Fast Mesh Deformation Method for Neuroanatomical Surface Inflated Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Andrea Rueda, Alvaro Perea, Daniel Rodr´ıguez-P´erez, and Eduardo Romero Mosaic Animations from Video Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafael B. Gomes, Tiago S. Souza, and Bruno M. Carvalho
62
75
87
Image Analysis Grayscale Template-Matching Invariant to Rotation, Scale, Translation, Brightness and Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hae Yong Kim and Sidnei Alves de Ara´ ujo
100
Bimodal Biometric Person Identification System Under Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Carrasco, Luis Pizarro, and Domingo Mery
114
A 3D Object Retrieval Method Using Segment Thickness Histograms and the Connection of Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingliang Lu, Kunihiko Kaneko, and Akifumi Makinouchi
128
Facial Occlusion Reconstruction: Recovering Both the Global Structure and the Local Detailed Texture Components . . . . . . . . . . . . . . . . . . . . . . . . Ching-Ting Tu and Jenn-Jier James Lien
141
Cyclic Linear Hidden Markov Models for Shape Classification . . . . . . . . . Vicente Palaz´ on, Andr´es Marzal, and Juan Miguel Vilar
152
Neural Network Classification of Photogenic Facial Expressions Based on Fiducial Points and Gabor Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luciana R. Veloso, Jo˜ ao M. de Carvalho, Claudio S.V.C. Cavalvanti, Eduardo S. Moura, Felipe L. Coutinho, and Herman M. Gomes
166
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subin Lee and Yongduek Seo
180
Image Feature Extraction Using a Method Derived from the Hough Transform with Extended Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . Sergio A. Velastin and Chengping Xu
191
Nonlinear Dynamic Shape and Appearance Models for Facial Motion Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chan-Su Lee, Ahmed Elgammal, and Dimitris Metaxas
205
Table of Contents
XVII
Direct Ellipse Fitting and Measuring Based on Shape Boundaries . . . . . . Milos Stojmenovic and Amiya Nayak
221
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fajie Li, Reinhard Klette, and Xue Fu
236
Sub-grid Detection in DNA Microarray Images . . . . . . . . . . . . . . . . . . . . . . Luis Rueda
248
Modelling Intermittently Present Features Using Nonlinear Point Distribution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gerard Sanroma and Francesc Serratosa
260
Measuring Linearity of Ordered Point Sets . . . . . . . . . . . . . . . . . . . . . . . . . . Milos Stojmenovic and Amiya Nayak
274
Real-Time Color Image Watermarking Based on D-SVD Scheme . . . . . . . Cheng-Fa Tsai and Wen-Yi Yang
289
Recognizing Human Iris by Modified Empirical Mode Decomposition . . . Jen-Chun Lee, Ping S. Huang, Tu Te-Ming, and Chien-Ping Chang
298
Segmentation of Scanned Insect Footprints Using ART2 for Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bok-Suk Shin, Eui-Young Cha, Young Woon Woo, and Reinhard Klette Meshless Parameterization for Dimensional Reduction Integrated in 3D Voxel Reconstruction Using a Single PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunli Lee, Dongwuk Kyoung, and Keechul Jung An Efficient Biocryptosystem Based on the Iris Biometrics . . . . . . . . . . . . Ali Shojaee Bakhtiari, Ali Asghar Beheshti Shirazi, and Babak Zamanlooy
311
321 334
Subjective Image-Quality Estimation Based on Psychophysical Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gi-Yeong Gim, Hyunchul Kim, Jin-Aeon Lee, and Whoi-Yul Kim
346
Adaptive Color Filter Array Demosaicking Based on Constant Hue and Local Properties of Luminance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chun-Hsien Chou, Kuo-Cheng Liu, and Wei-Yu Lee
357
Multiple View Imaging and Processing Automatic Multiple Visual Inspection on Non-calibrated Image Sequence with Intermediate Classifier Block . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Carrasco and Domingo Mery
371
XVIII
Table of Contents
Image-Based Refocusing by 3D Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akira Kubota, Kazuya Kodama, and Yoshinori Hatori
385
Online Multiple View Computation for Autostereoscopic Display . . . . . . . Vincent Nozick and Hideo Saito
399
Horizontal Human Face Pose Determination Using Pupils and Skin Region Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shahrel A. Suandi, Tie Sing Tai, Shuichi Enokida, and Toshiaki Ejima
413
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Federico Tombari, Stefano Mattoccia, and Luigi Di Stefano
427
3D Reconstruction of a Human Body from Multiple Viewpoints . . . . . . . . Koichiro Yamauchi, Hideto Kameshima, Hideo Saito, and Yukio Sato
439
3D Posture Representation Using Meshless Parameterization with Cylindrical Virtual Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunli Lee and Keechul Jung
449
Using the Orthographic Projection Model to Approximate the Perspective Projection Model for 3D Facial Reconstruction . . . . . . . . . . . . Jin-Yi Wu and Jenn-Jier James Lien
462
Multi-target Tracking with Poisson Processes Observations . . . . . . . . . . . . Sergio Hernandez and Paul Teal
474
Proposition and Comparison of Catadioptric Homography Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Simler, C´edric Demonceaux, and Pascal Vasseur
484
External Calibration of Multi-camera System Based on Efficient Pair-wise Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chunhui Cui, Wenxian Yang, and King Ngi Ngan
497
Computer Vision Applications Fast Automatic Compensation of Under/Over-Exposured Image Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vassilios Vonikakis and Ioannis Andreadis Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Prieto, Marcelo Guarini, Joseph Hajnal, and Pablo Irarrazaval
510
522
Table of Contents
XIX
Real-Time Hand Gesture Detection and Recognition Using Boosted Classifiers and Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardy Francke, Javier Ruiz-del-Solar, and Rodrigo Verschae
533
Spatial Visualization of the Heart in Case of Ectopic Beats and Fibrillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S´ andor M. Szil´ agyi, L´ aszl´ o Szil´ agyi, and Zolt´ an Beny´ o
548
A Single-View Based Framework for Robust Estimation of Height and Position of Moving People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seok-Han Lee and Jong-Soo Choi
562
Robust Tree-Ring Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mauricio Cerda, Nancy Hitschfeld-Kahler, and Domingo Mery
575
A New Approach for Fingerprint Verification Based on Wide Baseline Matching Using Local Interest Points and Descriptors . . . . . . . . . . . . . . . . Javier Ruiz-del-Solar, Patricio Loncomilla, and Christ Devia
586
SVM with Stochastic Parameter Selection for Bovine Leather Defect Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Viana, Ricardo B. Rodrigues, Marco A. Alvarez, and Hemerson Pistori
600
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tzung-Heng Lai, Te-Hsun Wang, and Jenn-Jier James Lien
613
Vision-Based Guitarist Fingering Tracking Using a Bayesian Classifier and Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chutisant Kerdvibulvech and Hideo Saito
625
Accuracy Estimation of Detection of Casting Defects in X-Ray Images Using Some Statistical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Romeu Ricardo da Silva and Domingo Mery
639
A Radial Basis Function for Registration of Local Features in Images . . . Asif Masood, Adil Masood Siddiqui, and Muhammad Saleem
651
Hardware Implementation of Image Recognition System Based on Morphological Associative Memories and Discrete Wavelet Transform . . . Enrique Guzm´ an, Selene Alvarado, Oleksiy Pogrebnyak, Luis Pastor S´ anchez Fern´ andez, and Cornelio Ya˜ nez
664
Detection and Classification of Human Movements in Video Scenes . . . . . A.G. Hochuli, L.E.S. Oliveira, A.S. Britto Jr., and A.L. Koerich
678
Image Registration by Simulating Human Vision . . . . . . . . . . . . . . . . . . . . . Shubin Zhao
692
XX
Table of Contents
Face and Gesture-Based Interaction for Displaying Comic Books . . . . . . . Hang-Bong Kang and Myung-Ho Ju
702
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anjin Park, Kwangjin Hong, and Keechul Jung
715
Practical Error Analysis of Cross-Ratio-Based Planar Localization . . . . . Jen-Hui Chuang, Jau-Hong Kao, Horng-Horng Lin, and Yu-Ting Chiu
727
People Counting in Low Density Video Sequences . . . . . . . . . . . . . . . . . . . . J.D. Valle Jr., L.E.S. Oliveira, A.L. Koerich, and A.S. Britto Jr.
737
Simulation of Automated Visual Inspection Systems for Specular Surfaces Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Manuel Garc´ıa-Chamizo, Andr´es Fuster-Guill´ o, and Jorge Azor´ın-L´ opez Low Cost Virtual Face Performance Capture Using Stereo Web Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Woodward, Patrice Delmas, Georgy Gimel’farb, and Jorge Marquez Hidden Markov Models Applied to Snakes Behavior Identification . . . . . . Wesley Nunes Gon¸calves, Jonathan de Andrade Silva, Bruno Brandoli Machado, Hemerson Pistori, and Albert Schiaveto de Souza
749
763
777
Image and Video Coding SP Picture for Scalable Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Jia, Hae-Kwang Kim, and Hae-Chul Choi
788
Studying the GOP Size Impact on the Performance of a Feedback Channel-Based Wyner-Ziv Video Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Pereira, Jo˜ ao Ascenso, and Catarina Brites
801
Wyner-Ziv Video Coding with Side Matching for Improved Side Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bonghyuck Ko, Hiuk Jae Shim, and Byeungwoo Jeon
816
On Digital Image Representation by the Delaunay Triangulation . . . . . . . Josef Kohout
826
Low-Complexity TTCM Based Distributed Video Coding Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.L. Mart´ınez, W.A.C. Fernando, W.A.R.J. Weerakkody, J. Oliver, O. L´ opez, M. Martinez, M. P´erez, P. Cuenca, and F. Quiles
841
Table of Contents
Adaptive Key Frame Selection for Efficient Video Coding . . . . . . . . . . . . . Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang
XXI
853
Multimedia Processing Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gwanggil Jeon, Rafael Falcon, Rafael Bello, Donghyung Kim, and Jechang Jeong
867
Markov Random Fields and Spatial Information to Improve Automatic Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos Hern´ andez-Gracidas and L. Enrique Sucar
879
Shape-Based Image Retrieval Using k-Means Clustering and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoliu Chen and Imran Shafiq Ahmad
893
Very Fast Concentric Circle Partition-Based Replica Detection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ik-Hwan Cho, A-Young Cho, Jun-Woo Lee, Ju-Kyung Jin, Won-Keun Yang, Weon-Geun Oh, and Dong-Seok Jeong Design of a Medical Image Database with Content-Based Retrieval Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan C. Caicedo, Fabio A. Gonz´ alez, Edwin Triana, and Eduardo Romero
905
919
A Real-Time Object Recognition System on Cell Broadband Engine . . . . Hiroki Sugano and Ryusuke Miyamoto
932
A Study of Zernike Invariants for Content-Based Image Retrieval . . . . . . ˜ Pablo Toharia, Oscar D. Robles, Angel Rodr´ıguez, and Luis Pastor
944
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
959
Keynote Lecture An Image-Based Approach to Interactive 3D Virtual Exhibition Yi-Ping Hung Image and Vision Laboratory Graduate Institute of Networking and Multimedia Department of Computer Science and Information Engineering National Taiwan University, Taiwan
[email protected]
Abstract. With the advance of 3D digitization and rendering technologies, interactive virtual exhibition can now be realized for applications such as virtual museum, virtual showcase, and virtual mall. There are two major approaches to implementing a 3D interactive virtual exhibition application. One approach is the geometry-based approach, which reconstructs geometric models for 3D objects by using laser scanners or other 3D digitization equipments. Another approach is the imagebased approach, which renders the 3D object directly using a large set of pre-acquired images without reconstructing geometric models. While the geometry-based approach provides better interaction and smaller data size, its cost-effectiveness is not as good as the image-based approach. We have developed a new image-based approach to 3D interactive virtual exhibition based on a technique named augmented panoramas. With augmented panoramas, the 3D exhibition space is represented by panoramas and can be augmented by the 3D objects to be exhibited, which can be represented either by geometric models or by object movies. Compared with other image-based techniques, such as light field rendering, object movies have the advantage of being easier in image acquisition and rendering. Here, a major challenge for augmented panoramas is how to integrate two sources of 2D images, the panoramas and the object movies, in a 3D-consistent way. In Taiwan, with the support of National Digital Archives Program, we have designed and implemented a 3D stereoscopic kiosk system for virtually exhibiting 3D artifacts in the National Palace Museum, the National Historical Museum, and the Museum of the Institute of History and Philology, Academia Sinica. Also, we have built a few other interactive display systems, which will be shown in this presentation. For example, we shall show the Magic Crystal Ball, which allows the user to see a virtual object appearing inside a transparent ball and to rotate the virtual object by barehanded interaction. Our goal is to transform different concepts from movies and fiction into the development of a new medium for the users to access multimedia in an intuitive, imaginative and playful manner. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, p. 1, 2007. c Springer-Verlag Berlin Heidelberg 2007
Keynote Lecture Information Just-in-Time: An Approach to the Paperless Office Hiromichi Fujisawa Hitachi Central Research Laboratory, Japan
[email protected]
Abstract. Information Just-in-Time or iJIT is a concept that denotes the idealistic state where information is always ready for use when it is necessary. Despite of the great success of search engines for the Internet, we still have problems in dealing with information just around us. We still have a lot of paper documents on the desk, in the cabinets and drawers. The reasons are well described by A. Sellen in his book on ”the myth of the paperless office.” By borrowing the concept of hot/warm/cold documents from the book, we will introduce a system that allows users to work on ’hot’ paper documents, keeping the writings in computer simultaneously. The system, called iJIT system, uses the Anoto’s digital pen technology with on-demand printing capability to capture history of usage of documents and annotations. Electronically produced documents can be printed with the Anoto dots, and handwriting made on the papers by a digital pen is digitized and stored in computer. The digitized writing information also holds encrypted date, time, and pen ID information, by which we can identify who wrote what and when in a secured way. Because the equivalent information is kept in computer, such paper documents may be discarded any time. A prototype system being used by about 200 people at our laboratory will be described.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, p. 2, 2007. c Springer-Verlag Berlin Heidelberg 2007
Keynote Lecture Sampling Less and Reconstructing More for Magnetic Resonance Imaging Pablo Irarrazaval Department of Electrical Engineering, Pontificia Universidad Cat´ olica, Chile Magnetic Resonance Imaging Research Center, Chile
[email protected]
Abstract. Magnetic Resonance Imaging (MRI) is one of the modalities for medical imaging with the fastest growth in recent years. The rapid adoption of MRI is explained by its high soft tissue sensitivity, unprecedentely high resolution and contrast for some anatomies, wide variety of contrast mechanism and functionality, high geometric flexibility, and lastly because of its innocuousness. Nevertheless there are several issues that remains challenging the scientific community. These are typically related to two linked characteristics: high cost (investment and operation) and long scans, particularly for dynamic imaging. One of the promising approachs currently investigated for reducing the scan time is under-sampling. This is a fascinating area of research in which the Nyquist sampling theorem is defied: the data is scarcely sampled in the Fourier domain and later on, reconstructed with a minimum of artefacts (aliasing, for instance). This talk will review some of the techniques currently proposed which are at different stages of applicability such as partial Fourier and key-hole, kt-BLAST, UNFOLD, Obels and Compressed Sensing. All of these employed some kind of a-priori knowledge to reconstruct fairly high quality images from data under-sampled by factors of 4, 16 and more.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, p. 3, 2007. c Springer-Verlag Berlin Heidelberg 2007
Keynote Lecture Phase is an Important Low-Level Image Invariant Peter Kovesi The Vision Research Group The University of Western Australia
[email protected]
Abstract. The performance of low-level image operators has always been less than satisfactory. The unreliability of the output, sensitivity to noise and the difficulty in setting thresholds has long frustrated those working in high-level vision. Much of the recent success of high-level image processing techniques have come about from the use of robust estimation techniques, such as RANSAC, and the use of effective optimization algorithms. These techniques have allowed the deficiencies of low level operators to be largely ignored. However, problems still remain. Most of the existing low-level operators for feature detection and feature description are based in the use of local image derivatives. This is problematic because image gradients are affected by image contrast and scale. There is much we do not know about the low-level structure and statistics of images. This is especially so for the newer classes of images such as X-ray, MRI, and geological aeromagnetic images. It is too simplistic to think of image features as consisting of only step edges or lines. There is a continuum of feature types between these two. These hybrid feature types occur just as frequently as do lines and steps. Gradient based operators are unable to properly detect or localize these other feature types. To overcome these problems I argue that local phase information should be the building block for low-level feature detectors and descriptors. Phase encodes the spatial structure of an image, and crucially, it is invariant to image contrast and scale. Intriguingly, while phase is important, phase values can be quantized quite heavily with little penalty. This offers interesting opportunities with regard to image compression and for devising compact feature descriptors. I will present some approaches that show how features can be detected, classified and described by phase information in a manner that is invariant to image contrast.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, p. 4, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Pipelined 8x8 2-D Forward DCT Hardware Architecture for H.264/AVC High Profile Encoder Thaísa Leal da Silva1, Cláudio Machado Diniz1, João Alberto Vortmann2, Luciano Volcan Agostini2, Altamiro Amadeu Susin1, and Sergio Bampi1 1
UFRGS – Federal University of Rio Grande do Sul - Microelectronics Group Porto Alegre - RS, Brazil {tlsilva, cmdiniz, bampi}@inf.ufrgs.br,
[email protected] 2 UFPel – Federal University of Pelotas – Group of Architectures and Integrated Circuits Pelotas - RS, Brazil {agostini, jvortmann}@ufpel.edu.br
Abstract. This paper presents the hardware design of an 8x8 bi-dimensional Forward Discrete Cosine Transform used in the high profiles of the H.264/AVC video coding standard. The designed DCT is computed in a separate way as two 1-D transforms. It uses only add and shift operations, avoiding multiplications. The architecture contains one datapath for each 1-D DCT with a transpose buffer between them. The complete architecture was synthesized to Xilinx Virtex II - Pro and Altera Stratix II FPGAs and to TSMC 0.35μm standard-cells technology. The synthesis results show that the 2-D DCT transform architecture reached the necessary throughput to encode high definition videos in real-time when considering all target technologies. Keywords: Video compression, 8x8 2-D DCT, H.264/AVC standard, Architectural Design.
1 Introduction H.264/AVC (MPEG 4 part 10) [1] is the latest video coding standard developed by the Joint Video Team (JVT) which is formed by the cooperation between ITU Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Experts Group (MPEG). This standard achieves significant improvements over the previous standards in terms of compression rates [1]. H.264/AVC standard was firstly organized in three profiles: Baseline, Extended and Main. A profile defines a set of coding tools or algorithms which can be used to generate a video bitstream [2]. Each profile is targeted to specific classes of video applications. The first version of H.264/AVC standard was focused on "entertainment-quality" video. In July 2004, a extension was added to this standard, called the Fidelity Range Extensions (FRExt). This extension focused on professional applications and high definition videos [3]. Then, a new set of profiles was defined and this set was generically called High profile, which is the focus of this work. There are four different profiles in the High profile set, both targeting high quality videos: High profile (HP) includes support to video with 8 bits per sample and with an YCbCr D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 5 – 15, 2007. © Springer-Verlag Berlin Heidelberg 2007
6
T.L. da Silva et al.
color relation of 4:2:0. High 10 profile (Hi10P) supports videos with 10 bits per sample and also with a 4:2:0 color relation. High 4:2:2 profile (H422P) supports a 4:2:2 color relation and videos with 10 bits per sample. Finally, High 4:4:4 profile (H444P) supports a 4:4:4 color relation (without color subsampling) and videos with 12 bits per sample. One improvement present in High profiles is the inclusion of an 8x8 integer transform in the forward transform module. This transform is an integer approximation of the 8x8 2-D Discrete Cosine Transform (DCT) and it is commonly referred as 8x8 2-D DCT in this standard [3]. This new transform is used to code luminance residues in some specific situations. Other profiles support only 4x4 DCT transform. However, significant compression performance gains were reported for Standard Definition (SD) and High Definition (HD) solutions when larger than 4x4 transforms are used [4]. So, in High profiles, the encoder can choose adaptively between the 4×4 and 8×8 transforms, when the input data was not intra or inter predicted using sub-partitions smaller than 8x8 samples [5][6]. Fig. 1 presents a block diagram of the H.264 encoder. The main blocks of the encoder [7], as shown in Fig. 1, are: motion estimation (ME), motion compensation (MC), intra prediction, forward and inverse (T and T-1) transforms, forward and inverse quantization (Q and Q-1), entropy coding and de-blocking filter. This work focuses on the design of an 8x8 2-D forward DCT hardware architecture, which composes the T Block of H.264/AVC coders when the high profile is considered. T module is highlighted in Fig. 1. This architecture was designed without multiplications, just using shift-adds operations, aiming to reduce the hardware complexity. Besides, the main goal of the designed architecture was to reach the throughput to process HDTV 1080 frames (1080x1920 pixels) in real time, allowing its use in H.264/AVC encoders targeting HDTV. This architecture was synthesized to Altera and Xilinx FPGAs and to TSMC 0.35µm standard-cells and the synthesis results indicated that the 2-D DCT designed in this work reaches a very high throughput, making possible its use in a complete video coder for high resolutions videos. We did not find any solution in the literature which presents a H.264/AVC 8x8 2-D DCT completely designed in hardware. Current Frame
T
Q
T-1
Q-1
INTER Prediction ME
Reference Frame
MC INTRA Prediction
Current Frame
Filter
(reconstructed)
Fig. 1. Block diagram of a H.264/AVC encoder
Entropy Coder
A Pipelined 8x8 2-D Forward DCT Hardware Architecture
7
This paper is organized as follows: section two presents a review of the 8x8 2-D forward DCT transform algorithm. The third section presents the designed architecture. Section four presents the validation strategy. The results of this work and the discussions about these results are presented in section five. Section six presents comparisons of this work with related works. Finally, section seven presents the conclusions of this work.
2 8x8 2-D Forward DCT Algorithm The 8x8 2-D forward DCT is computed in a separable way as two 1-D transforms: a 1-D horizontal transform (row-wised) and a 1-D vertical transform (column-wised). The 2-D DCT calculation is achieved through the multiplication of three matrices as shown in Equation (1), where X is the input matrix, Y is the transformed matrix, Cf is the transformation matrix and CfT is the transposed of the transformation matrix. The transformation matrix Cf is showed in Equation (2) [5][8].
Y = C f XC Tf
(1)
8 8 8 8 8 8 8⎤ ⎡ 8 ⎢ 12 ⎥ − − − − 10 6 3 3 6 10 12 ⎢ ⎥ ⎢ 8 4 − 4 −8 −8 − 4 4 8⎥ ⎥ Cf = ⎢ 10 − 3 − 12 − 6 6 12 3 − 10⎥ 1 ⎢ ⋅ ⎢ 8 −8 −8 8 8 −8 −8 8⎥ 8 ⎢ ⎥ 3 10 − 10 − 3 12 − 6⎥ ⎢ 6 − 12 ⎢ 4 −8 8 −4 −4 8 −8 4⎥ ⎢ ⎥ 10 − 12 12 − 10 6 − 3⎦⎥ ⎣⎢ 3 − 6
(2)
This transform can be calculated through fast butterfly operations accordingly to the algorithm presented in Table 1 [5], where in denotes the vector of input values, out denotes the transformed output vector and a and b are internal variables. Table 1. 2-D Forward 8x8 DCT Algorithm Step 1 a[0] = in[0] + in[7]; a[1] = in[1] + in[6]; a[2] = in[2] + in[5]; a[3] = in[3] + in[4]; a[4] = in[0] - in[7]; a[5] = in[1] - in[6]; a[6] = in[2] - in[5]; a[7] = in[3] - in[4];
Step 2 b[0] = a[0] + a[3]; b[1] = a[1] + a[2]; b[2] = a[0] - a[3]; b[3] = a[1] - a[2]; b[4] = a[5] + a[6]+ ((a[4]>>1) + a[4]); b[5] = a[4] - a[7] - ((a[6]>>1) + a[6]); b[6] = a[4] + a[7] - ((a[5]>>1) + a[5]); b[7] = a[5] - a[6] + ((a[7]>>1) + a[7]);
Step 3 out[0] = b[0] + b[1]; out[1] = b[4] + (b[7]>>2); out[2] = b[2] + (b[3]>>1); out[3] = b[5] + (b[6]>>2); out[4] = b[0] - b[1]; out[5] = b[6] - (b[5]>>2); out[6] = (b[2]>>1) - b[3]; out[7] = - b[7] + (b[4]>>2);
8
T.L. da Silva et al.
This algorithm was derived from Equation (1) and it needs of three steps to compute the 1-D DCT transform. However, in this work the algorithm presented in [5] was modified in order to reduce the critical path of the designed architecture and to allow a better balanced pipeline when the architecture was designed. This modified algorithm was divided in five steps, allowing the architectural design in a five stages pipeline. The modified algorithm is presented in Table 2 and it computes the 1-D DCT transform in five steps. This algorithm uses only one addition or subtraction to generate each result, allowing the desired best balancing between the calculation stages. Table 2. 2-D Forward 8x8 DCT Modified Algorithm Step 1 a[0] = in[0] + in[7]; a[1] = in[1] + in[6]; a[2] = in[2] + in[5]; a[3] = in[3] + in[4]; a[4] = in[0] - in[7]; a[5] = in[1] - in[6]; a[6] = in[2] - in[5]; a[7] = in[3] - in[4];
Step 2 b[0] = a[0] + a[3]; b[1] = a[1] + a[2]; b[2] = a[0] - a[3]; b[3] = a[1] - a[2]; b[4] = a[5] + a[6]; b[5] = a[4] - a[7]; b[6] = a[4] + a[7]; b[7] = a[5] - a[6]; b[8] = a[4]; b[9] = a[5]; b[10] = a[6]; b[11] = a[7];
Step 4 d[0] = c[0]; d[1] = c[1]; d[2] = c[2]; d[3] = c[3]; d[4] = c[4] + c[8]; d[5] = c[5] - c[10]; d[6] = c[6] - c[9]; d[7] = c[7] + c[11];
Step 5 out[0] = d[0] + d[1]; out[1] = d[4] + (d[7]>>2); out[2] = d[2] + (d[3]>>1); out[3] = d[5] + (d[6]>>2); out[4] = d[0] - d[1]; out[5] = d[6] - (d[5]>>2); out[6] = (d[2]>>1) - d[3]; out[7] = (d[4]>>2) - d[7];
Step 3 c[0] = b[0]; c[1] = b[1]; c[2] = b[2]; c[3] = b[3]; c[4] = b[4] + (b[8]>>1); c[5] = b[5] - (b[10]>>1); c[6] = b[6] - (b[9]>>1); c[7] = b[7] + (b[11]>>1); c[8] = b[8]; c[9] = b[9]; c[10] = b[10]; c[11] = b[11];
3 Designed Architecture Based on the modified algorithm presented in Section 2, a hardware architecture for the 8x8 2-D Forward DCT transform was designed. The architecture uses the 2-D DCT separability property [9], where the 2-D DCT transform is computed as two 1-D DCT transforms, one row-wised and other column-wised. The transposition is made by a transpose buffer. The 2-D DCT block diagram is shown in Fig. 2. The designed architecture was designed to consume and produce one sample per clock cycle. This decision was made to allow an easy integration with the other
A Pipelined 8x8 2-D Forward DCT Hardware Architecture
9
Fig. 2. 2-D DCT Block Diagram
transforms designed in our research group for the T module of the H.264/AVC main profile [10], which were designed with this production and consumption rates. The two 1-D DCT modules are similar; the difference is the number of bits used in each pipeline stage of these architectures and consequently, in the number of bits used to represent each sample. This occurs because at each addition operation could generate a carry out and the number of bits to represent the data increases in one bit. Both 1-D DCT modules were designed at the same way and the input and output bitwidths are changed for the second 1-D DCT module. The control of this architecture was hierarchically designed and each sub-module has its own control. A simple global control is used to start the sub-modules operation in a synchronous way. The designed architecture for the 8x8 1-D DCT transform is shown in Fig. 3. The hardware architecture implements the modified algorithm presented in Section 2. It has a five stage pipeline and it uses ping-pong buffers, adders/subtractors and multiplexers. This architecture uses only one operator in each pipeline stage as shown in Fig. 3. Ping-pong buffers are two register lines (ping and pong), each register with n bits. The data inputs serially in the ping buffer, one sample at each clock cycle. When n samples are ready at the ping buffer, they are sent to the pong buffer in parallel [11]. There are five ping-pong buffers in the architecture and these registers are necessary to allow the pipeline synchronization. The 1-D DCT was the first designed module. A Finite State Machine (FSM) was designed to control the architecture datapath.
Fig. 3. 1-D 8x8 DCT Architecture
10
T.L. da Silva et al.
A transpose buffer [11] was designed to transpose the resulting matrix from the first 1-D DCT, generating the input matrix to the second 1-D DCT transform. The transpose buffer is composed of two 64-word RAMs and three multiplexers besides various control signals, as presented in Fig. 4. The RAM memories operate in an intercalated way: while one of them is used for writing, the other one is used for reading. Thus, the first 1-D DCT architecture writes the results line by line in one memory (RAM1 or RAM2) and the second 1-D DCT architecture reads the input values column by column from the other memory (RAM2 or RAM1). The signals Wad and Rad define the address of memories and the signals Control1 and Control2 defines the read/write signal of memory. The main signals of this architecture are also controlled by a local FSM.
Fig. 4. Transpose Buffer Architecture
Each 1-D DCT architecture has its own FSM to control its pipeline. These local FSMs control the data synchronization among these modules. The first 1-D DCT architecture has an 8-bit input and a 13-bit output. The transpose buffer has a 13-bit input and output. In the second 1-D DCT architecture a 13-bit input and an 18-bit output is used. Finally, the 2-D DCT architecture has an 8bit input and an 18-bit output. The two 1-D DCT architectures have a latency is of 40 clock cycles. The transpose buffer latency is of 64 clock cycles. Then, the global 8x8 2-D DCT latency is of 144 clock cycles.
4 Architecture Validation The reference data for validation of the designed architecture was extracted directly from the H.264/AVC encoder reference software and ModelSim tool was used to run the simulations. A testbench was designed in VHDL to generate the input stimulus and to store the output results in text files. The used input stimuli were the input data extracted from the reference software. The first simulation considers just a behavioral model of the designed architecture. The second simulation considers a post place-and-route model of the designed architecture. In this step the ISE tool was used together with
A Pipelined 8x8 2-D Forward DCT Hardware Architecture
11
ModelSim to generate the post place-and-route information. The target device selected was a Xilinx VP30 Virtex-II Pro FPGA. After some corrections in the VHDL descriptions, the comparison between the simulations results and the reference software results indicates no differences between them. The designed architecture was also synthesized for standard-cells, using Leonardo Spectrum tool, the target technology was TSMC 0.35um. After, the Modelsim tool was used again to run new simulations considering the files generated by Leonardo and to validate the standard-cells version of this architecture.
5 Synthesis Results The architectures of the two 1-D DCTs and the Transpose Buffer were described in VHDL and synthesized to Altera Stratix II EP2S15F484C3 FPGA, Xilinx VP30 Virtex II Pro FPGA and TSMC 0.35µm standard-cell technologies. These architectures were grouped to form the 2-D DCT architecture which was also synthesized for these target technologies. The 2-D DCT architecture was designed to reach real time (24fps) when processing HDTV 1080 frames and considering the HP, Hi10P and H422P profiles. Then, color relations of 4:2:0 and 4:2:2 are allowed and 8 or 10 bits per sample are supported. In this case, the target throughput is of 100 million of samples per second. This section presents the synthesis results obtained considering a 2-D DCT input bit width of 8 bits. The synthesis results of the two 1-D DCT modules, transpose buffer module and the complete 2-D DCT targeting Altera and Xilinx FPGAs are presented in Tables 3 and 4, respectively. From Table 3 and Table 4 it is possible to notice the differences between the use of hardware resources and the maximum operation frequency reached by the two 1-D DCT modules, since the second 1-D DCT module uses a higher bit width than the first 1-D DCT module. It is also possible to notice in both tables that the transpose buffer uses few logic elements and reaches a high operation frequency, since it is basically two Block RAMs and a little control. From Table 3 it is very important to notice that the 8x8 2-D DCT uses 2,718 LUTs of the Altera Stratix II FPGA and it reaches a maximum operation frequency of161.66MHz. With these results this 2-D DCT is able to process 161.66 million of Table 3. Synthesis results to Altera Stratix II FPGA Blocks
Total Logic Elements LUTs Flip Flops Mem. Bits
Period (ns)
Throughput (Msamples/s)
First 1-D DCT Transform
1,072
877
-
5.03
198.77
Transpose Buffer
40
16
1,664
2.00
500
1,065
1,332
-
5.18
193.09
2,718
2,225
1,664
6.18
161.66
Second 1-D DCT Transform 2-D DCT Integer Transform
Selected Device: Stratix II EP2S15F484C3
12
T.L. da Silva et al.
samples per second. This rate is enough to process HDTV 1080 frames in real time (24fps) when the 4:2:0 or 4:2:2 color relations are considered. Table 4 presents the results for Xilinx Virtex II Pro FPGA and this synthesis reported an use of 1,430 LUTs and a maximum operation frequency of 122.87MHz, allowing a processing rate of 122.87 million of samples per second as presented in Table 4. This processing rate is also enough to reach real time when processing HDTV 1080 frames. Table 4. Synthesis results to Xilinx Virtex II - Pro FPGA Blocks
Total Logic Elements LUTs Flip Flops Mem. Bits
Period (ns)
Throughput (Msamples/s)
First 1-D DCT Transform
562
884
-
6.49
153.86
Transpose Buffer
44
17
2
2.31
432.11
776
1,344
-
7.09
141.02
1,430
2,250
2
8.13
122.87
Second 1-D DCT Transform 2-D DCT Integer Transform
Selected Device: Virtex II - Pro 2vp30ff896-7
Table 5 shows the synthesis results targeting TSMC 0.35µm standard-cells technology for all designed blocks. Besides, this table emphasizes the synthesis results of the 2-D DCT architecture including and not including the Block RAMs synthesis. From these results it is possible to notice that the number of used gates in the architecture with Block RAMs synthesis is almost the double of the architecture without Block RAMs. This difference is caused because the memories were mapped directly to register banks. But nevertheless this architecture is able to process 124.1 million of samples per second, also reaching the throughput to process HDTV 1080 frames in real time. The presented synthesis results indicate that the 2-D DCT architecture designed in this work reaches a processing rate of 24 HDTV 1080 frames per second considering all Table 5. Synthesis results to TSMC 0.35µm standard-cells technology Blocks
Total Logic Elements (Gates)
Period (ns)
Throughput (Msamples/s)
First 1-D DCT Transform
7,510
6.33
158.1
Transpose Buffer
15,196
4.65
215.2
11,230
7.58
131.9
19,084
7.58
131.9
33,936
8.05
124.1
Second 1-D DCT Transform 2-D DCT Transform (without RAM) 2-D DCT Transform (with RAM)
A Pipelined 8x8 2-D Forward DCT Hardware Architecture
13
technology targets. This processing rate allows the use of this architecture in H.264/AVC encoders for HP, Hi10P and H422P profiles which target high resolution videos.
6 Related Works There are a lot of papers that present dedicated hardware designs for 8x8 2-D DCT in the literature, but papers targeting the complete 8x8 2-D DCT defined in the H.264/AVC High profile were not found in the literature. There are some papers about the 4x4 2-D DCT of the H.264/AVC standard, but not about the 8x8 2-D DCT. Only three papers were found about the High profile transforms, but not reporting the complete hardware design 8x8 2-D DCT defined in the standard. The first work [12] proposes a new encoding scheme to compute the classical 8x8 DCT coefficients using error-free algebraic integer quantization (AIQ). The algorithm was described in Verilog and synthesized for a Xilinx VirtexE FPGA. This work presented an operation frequency of 101.5 MHz and a consumption of 1,042 LUTs, and not presented throughput data. The second work [13] proposes a hardware implementation of the H.264/AVC simplified 8x8 2-D DCT and quantization. However, this work implements just the 1D DCT architecture and not the 8x8 2-D DCT architecture. The comparison with the first paper [12] shows that the architecture designed in this paper presented a higher operation frequency and a little increase in the hardware resources consumption. A comparison in terms of throughput was not viable, once this data not presented in [12]. The comparison with the second paper is not possible, once it reports only an 8x8 1-D DCT and quantization design and this work presents an 8x8 2-D DCT. Finally, the third work [14] proposes a fast algorithm for the 8x8 2-D forward and inverse DCT and it also proposes an architecture for this transforms. But this architecture was not implemented in hardware, therefore, it is not possible to realize comparisons with this work. Other 8x8 2-D solutions presented in the literature were also compared with the architecture presented in this paper. These other solutions are not compliant with the H.264/AVC standard. Solutions [11], [15], [16], [17] and [18] presents hardware implementations of the 8x8 2-D DCT using some type of approximation to use only integer arithmetic instead of floating point arithmetic originally present in the 2-D DCT. A comparison of our design with others, in terms of the throughput and the used technology, is presented in Table 6. The differences between those implementations will not be explained, as they used completely different technologies, physical architectures and techniques to reduce area and power. Throughputs in Table 6 show that our 8x8 2-D DCT implemented in Stratix II surpasses all other implementations. Our standard-cells based 8x8 2-D DCT is able to process 124 millions of samples per second and it presents the highest throughput among the presented standard-cells designs. Our FPGA based results could be better had we used macro function adders, that are able to use the special fast carry chains that are present in the FPGAs. In function of these comparisons, it is possible to conclude that the 8x8 2-D Forward DCT architecture designed in this paper has interesting profits in relation to other published works.
14
T.L. da Silva et al. Table 6. Comparative results for 8x8 2-D DCT Design
Technology
Throughput (Msamples/s)
Our Standard-cell version Fu [15] Agostini [11] Katayama [16] Hunter [17] Chang [18] Our Stratix II version Agostini [11] Our Virtex II version
0.35µm 0.18µm 0.35µm 0.35µm 0.35µm 0.6µm Stratix II Stratix II Virtex II
124 75 44 27 25 23.6 162 161 123
7 Conclusions and Future Works This work presented the design and validation of a high performance H.264/AVC 8x8 2-D DCT architecture. The implementations details, the synthesis results targeted to FPGA and standard-cells were also presented. This architecture was designed to reach high throughputs and to be easily integrated with the other H.264/AVC modules. The modules which compose the 2-D DCT architecture were synchronized and a constant processing rate of one sample per clock cycle is achieved. The constant processing rate is independent of the data type and it is important to make easy the integration of this architecture with other modules. The synthesis results showed a minimum period of 8.13ns considering FPGAs and a minimum period of 8.05ns considering standard-cells. These results indicate that the global architecture is able to process 122.87 million of samples per second when mapped to FPGAs and 124.1 million of samples per second when mapped to standard-cells, allowing their use in H.264/AVC encoders targeting HDTV 1080 @ 24 frames per second. As future works it is planned an exploration in others design strategies for the 8x8 DCT of the H.264/AVC standard and a comparison among the obtained results. The first design strategy to be explored is to implement other 8x8 2-D DCT transform in a parallel fashion with a processing rate of 8 samples per clock cycle. Other future work is the integration of this module in the Forward Transform module of the H.264/AVC encoder.
References 1. Joint Video Team of ITU-T, and ISO/IEC JTC 1: Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 or ISO/IEC 14496-10 AVC). JVT Document, JVT-G050r1 (2003) 2. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems For Video Technology 13, 560–576 (2003)
A Pipelined 8x8 2-D Forward DCT Hardware Architecture
15
3. Sullivan, G.J., Topiwala, P.N., Luthra, A.: The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions. In: SPIE Conference on Application of Digital Image Processing, Denver, CO, vol. XXVII (5558), pp. 454–474 (2004) 4. Gordon, S., Marpe, D., Wiegand, T.: Simplified Use of 8x8 Transforms. JVT Document, JVT-I022 (2004) 5. Gordon, S., Marpe, D., Wiegand, T.: Simplified Use of 8x8 Transforms - Updated Proposal & Results. JVT Document, JVT-K028 (2004) 6. Marpe, D., Wiegand, T., Gordon, S.: H.264/MPEG4-AVC Fidelity Range Extensions: Tools, Profiles, Performance, and Application Areas. In: International Conference on Image Processing, ICIP 2005, Genova, Italy, vol. 1, pp. 593–596 (2005) 7. Richardson, I.E.G.: H.264 and MPEG-4 Video Compression - Video Coding for NextGeneration Multimedia. John Wiley & Sons, Chichester, UK (2003) 8. Malvar, H.S., Hallapuro, A., Karczewicz, M., Kerofsky, L.: Low-Complexity Transform and Quantization in H.264/AVC. IEEE Transactions on Circuits and Systems for Video Technology 13, 598–603 (2003) 9. Bhaskaran, V., Konstantinides, K.: Image and Video Compression Standards: Algorithms and Architectures, 2nd edn. Kluwer Academic Publishers, Norwell, MA (1997) 10. Agostini, L.V., Porto, R.E.C., Bampi, S., Rosa, L.Z.P., Güntzel, J.L., Silva, I.S.: High Throughput Architecture for H.264/AVC Forward Transforms Block. In: Great Lake Symposium on VLSI, GLSVLSI 2006, New York, NY, pp. 320–323 (2006) 11. Agostini, L.V., Silva, T.L., Silva, S.V., Silva, I.S., Bampi, S.: Soft and Hard IP Design of a Multiplierless and Fully Pipelined 2-D DCT. In: International Conference on Very Large Scale Integration, VLSI-SOC 2005, Perth, Western Australia, pp. 300–305 (2005) 12. Wahid, K., Dimitrov, V., Jullien, G.: New Encoding of 8x8 DCT to make H.264 Lossless. In: Wahid, K., Dimitrov, V., Jullien, G. (eds.) Asia Pacific Conference on Circuits and Systems, APCCAS 2006, Singapore, pp. 780–783 (2006) 13. Amer, I., Badawy, W., Jullien, G.: A High-Performance Hardware Implementation of the H.264 Simplified 8X8 Transformation and Quantization. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2005, Philadelphia, PA, vol. 2, pp. 1137–1140 (2005) 14. Fan, C.-P.: Fast 2-D Dimensional 8x8 Integer Transform Algorithm Design for H.264/AVC Fidelity Range Extensions. IEICE Transactions on Informatics and Systems E89-D, 2006–3011 (2006) 15. Fu, M., Jullien, G.A., Dimitrov, V.S., Ahmadi, M.: A Low-Power DCT IP Core Based on 2D Algebraic Integer Encoding. In: International Symposium on Circuits and Systems, ISCAS 2004, Vancouver, CA, vol. 2, pp. 765–768 (2004) 16. Katayama, Y., Kitsuki, T., Ooi, Y.: A Block Processing Unit in a Single-Chip MPEG-2 Video Encoder LSI. In: Workshop on Signal Processing Systems, Shanghai, China, pp. 459–468 (1997) 17. Hunter, J., McCanny, J.: Discrete Cosine Transform Generator for VLSI Synthesis. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1998, Seattle, WA, vol. 5, pp. 2997–3000 (1998) 18. Chang, T.-S., Kung, C.-S., Jen, C.-W.: A Simple Processor Core Design for DCT/IDCT. IEEE Transactions on Circuits and Systems for Video Technology 10, 439–447 (2000)
A Real Time Infrared Imaging System Based on DSP & FPGA Babak Zamanlooy, Vahid Hamiati Vaghef, Sattar Mirzakuchaki, Ali Shojaee Bakhtiari, and Reza Ebrahimi Atani Department of Electrical Engineering Iran University of Science and Technology Narmak, 16846, Tehran, Iran {Babak_Zamanlooe, Vahid_Hamiiativaghef, Ali_Shojaeebakhtiari}@ee.iust.ac.ir {M_Kuchaki, Rebrahimi}@iust.ac.ir
Abstract. The principle, configuration, and the special features of an infrared imaging system are presented in this paper. The work has been done in two parts. First, the nonuniformity of IRFPA is detected using a processing system based on FPGA & microcontroller. The FPGA generates system timing and performs data acquisition, while the microcontroller reads the IRFPA data from FPGA and sends them to the computer. Afterwards the infrared imaging system is implemented based on DSP & FPGA. The DSP executes high level algorithms such as two–point nonuniformity correction. The FPGA here performs two functions: the first one is reading the IRFPA video output and sending it to DSP; the second function is reading the corrected data from DSP and sending them to video encoder which converts the digital data to the analog video signal. The experimental results show that the system is suitable for the real time infrared imaging with high quality and high precision. Keywords: IRFPA, Nonuniformity Detection, Nonuniformity Correction.
1 Introduction With the development of Infrared Focal Plane Array (IRFPA) technology the advantages of high density, excellent performance, high reliability and miniaturization have become available in Infrared (IR) imaging systems [1]. At present, acquisition of high quality images has become the key problem of IR imaging systems. Such systems generally need to process mass data in real-time. The processing includes various algorithms such as nonuniformity correction, image segmentation, local characteristics extraction, image de-noising, image enhancement, etc; hence there must be a well integrated high-speed information processing system [2]. Another important problem of infrared imaging systems is fixed-pattern noise (also known as spatial nonuniformity noise) which arises because of the difference in response characteristics of each photodetector in an IRFPA [3], [4]. To solve this problem, photoresponse nonuniformity correction must be applied by software or hardware [5]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 16 – 23, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Real Time Infrared Imaging System Based on DSP & FPGA
17
Because of these requirements, a system has been designed which has the capability of detecting and correcting nonuniformity and displaying high quality infrared image. This imaging system is based on DSP&FPGA and fulfills the requirements of infrared imaging systems. Nonuniformity detection system is described in section 2, while infrared imaging system is investigated in section 3. Next, the experimental results are shown in section 4. Finally, conclusions are drawn in section 5.
2 Nonuniformity Detection System 2.1 Hardware Configuration of the Nonuniformity Detection System The schematic diagram of the signal processing system for IRFPA nonuniformity detection based on FPGA & microcontroller is shown in Fig. 1. This system consists of an IRFPA, a driving circuit, an ADC, a FPGA and a microcontroller. The IRFPA is an infrared opto-electronic device sensitive to radiation in the 7.7 to 10.3 micrometer spectral region. It includes a high-sensitivity focal plane array formed by photovoltaic Mercury Cadmium Telluride diodes connected to a silicon CMOS readout integrated circuit. The driving circuit unit provides the necessary signals and supply voltage for IRFPA's proper operation. This board also acts as a buffer so that the ADC board has no effect on IRFPA's video output signal. The output of IRFPA is an analog signal and the signal processing system is digital, so this analog signal should be converted to digital format first. This is done using ADC , which transforms the analog video signal to digital. In order to be applicable to image data processing with high speed and precision, a 12 bit ADC whose sampling frequency is up to 25 MHz is selected so that a high resolution output of digitized data is obtained. The FPGA used in IRFPA nonuniformity detection system has two functions. It acts as both synchronization and timing Control Module and harmonizes the other units in the system, including the output circuit unit of the IRFPA and the ADC sample unit. It also acts like a SRAM and stores IRFPA's video output. Another part of the system is a microcontroller that reads the data stored in the FPGA and then sends this data to a computer using the RS232 standard. 2.2 Software of the Nonuniformity Detection System The software of the FPGA has been written using verilog hardware description language. The written software causes the FPGA to store IRFPA output data and also produces the necessary synchronization signals. The software for the microcontroller has been written using C language. The written software activates microcontroller serial interface, reads the data stored in the FPGA and sends them to computer. Also a program has been written in MATLAB which reads the IRFPA data from Microcontroller using computer's serial port and saves them in a lookup table.
18
B. Zamanlooy et al.
Fig. 1. Schematic diagram of signal processing system for IRFPA nonuniformity detection
3 Infrared Imaging System 3.1 Hardware Configuration of the Infrared Imaging System The schematic diagram of the real-time IRFPA imaging system based on DSP &FPGA is shown in Fig. 2. This system consists of an IRFPA, a driving circuit, an ADC, a FPGA and a high speed DSP. The ADC transforms the analog output of IRFPA to digital format. The FPGA reads digital video data from ADC and stores them. When one complete frame is read, the DSP reads this data through external memory interface unit (EMIF). The DSP used here is Texas instrument’s TMS320VC5510. This DSP achieves high performance and low power consumption through increased parallelism and total focus on reduction in power dissipation. This DSP has an operating frequency of 200 MHZ [6]. The EMIF unit of DSP offers configurable timing parameters so that it can be used as an interface to a variety of asynchronous memory types, including flash memory, SRAM, and EPROM [7]. The FPGA here acts like a SRAM. The DSP reads the video data using EMIF unit and then applies nonuniformity correction coefficients to the data read and corrects them. After applying nonuniformity correction, the video data is ready for display. But it should be noted that the digital data can not be displayed directly on TV and should be converted to standard television signal. To do this he FPGA reads the corrected data from DSP using host port interface (HPI). The host port interface (HPI) unit of DSP provides a 16-bit-wide parallel port through which an external host processor (host) can directly access the memory of the DSP [8]. The conversion of digital data to standard television signal is done using ADV7177. The ADV7177 is an integrated digital video encoder that converts Digital video data into a standard analog baseband television signal [9]. 3.2 Software of the Infrared Imaging System The software written for infrared imaging system consists of FPGA and DSP programs. The FPGA program is written using verilog hardware description language. The written software causes the FPGA to read digital output of ADC and store it like a SRAM, which can be read by DSP. Also the written program causes the FPGA to read the corrected data from DSP using host port interface (HPI) and send them to
A Real Time Infrared Imaging System Based on DSP & FPGA
19
Fig. 2. Schematic Diagram of real-time infrared imaging system
video encoder. The software of DSP is written using C language. The written software activates EMIF and HPI units of DSP. Also this program applies the nonuniformity correction algorithm to video data. 3.3 Nonuniformity Correction Algorithm The so-called nonuniformity of IRFPA is caused by the variation in response among the detectors in the IRFPA under uniform background illumination. There are several factors causing nonuniformity. The main sources of nonuniformity are: (1) response nonuniformity, including spectral response nonuniformity; (2) nonuniformity of the readout circuit and the coupling between the detector and the readout circuit; and (3) nonuniformity of dark current [5]. Without nonuniformity correction (NUC), the images from the IRFPA are distorted and are not suitable for image formation [10]. There are two main types of nonuniformity correction (NUC) techniques. The first is to calibrate each pixel by the signal obtained when the FPA views a flat-field calibration target (usually a blackbody radiation source) held at several known temperatures, and it assumes the response characteristics of detector are constant temporally; this method is usually called calibration-based correction. The second is to use an image sequence to estimate the correction factors or to estimate the corrected signal directly, this kind of method is based on the scene and requires no calibration of the FPA, and therefore it is called as scene-based correction. Although the latter method is convenient and has developed greatly recently, there exist two disadvantages. The first one is that it does not reveal the correspondence between the signal output and the thermal radiation (or temperature) of the object observed. The other is, for lack of prior information about the FPA, many scene-based techniques are sophisticated and need a procedure to estimate the correction factor, which makes its realization impractical in some real-time systems, especially where the correction needs be implemented by hardware. Consequently, calibration-based NUC methods are still the main compensation method in many IR imaging systems, especially systems used to measure the accurate thermal radiation or temperature of the scene [5].
20
B. Zamanlooy et al.
The algorithm used here is a two-point correction method which is a calibrationbased method. In this algorithm detector outputs are assumed to be linear and stable in time, as shown in Fig. 3. Detector output can be expressed as [11]:
S ij (φ ) = K ij φ + Qij . Where
φ
represents the incident irradiance on detector (i,
of detector
(1)
j ) , S ij (φ ) is the output
(i, j ) , and K ij and Qij are the gain and the offset of detector (i, j )
respectively.
Fig. 3. The linear model of response curve of detector in IRFPA
According to the radiation range of a scene that IRFPA observes, two irradiances and φ 2 are chosen as the correction points, and the detector response data at these two points are recorded using the nonuniformity detection system which was investigated in section 2. Then the average value of all detectors output S ij (φ1 ) and
φ1
S ij (φ 2 ) in the IRFPA are calculated, respectively.
S1 =
1 N ×M
S2 =
1 N ×M
The line determined by
N
M
i =1
j =1
N
M
i =1
j =1
∑ ∑S ∑ ∑S
ij
(φ1 ) .
(2)
ij
(φ 2 ) .
(3)
( Sij (φ1 ), S1 ) and ( S ij (φ2 ), S 2 ) , illustrated in Fig. 4, is
used as the normalized line for the correction of the response of all pixels. Then the output value
S ij (φ ) and its corrected value S ij' (φ ) are related as follows:
A Real Time Infrared Imaging System Based on DSP & FPGA
S ij' (φ ) =
( S 2 − S1 ) S ij (φ1 ) S 2 − S1 . S ij (φ ) + S1 − S ij (φ 2 ) − S (φ1 ) S (φ 2 ) − S ij (φ1 ) ij
ij
i = 1,2,..., N .
21
(4)
j = 1,2,..., M .
Fig. 4. Sketch map of the two-point correction
The normal two-point NUC based on the linearity model has the advantage of little online computation. IRFPA imaging systems need to process data in real time, therefore this method is selected to correct nonuniformity. Equation (4) can be written as:
' (φ) =G S (φ) + O . Sij ij ij ij
(5)
Gij and Oij are the correction coefficients for the gain and offset of the (i, j ) detector.
Gij and Oij are precalculated and then stored in the FLASH memory unit.
When the system is operating, it reads them out of the flash and corrects data it in real time.
4 Experimental Results The performance and capabilities of the IRFPA signal processing system are validated by procedures that connect the image processing system to the IRFPA. The IRFPA is made of Mercury Cadmium Telluride with 4*288 detectors, operating at a frame rate of 100 frames per second. It should be noted that operation of IRFPA at 100 frames per second is due to limitations of imaging system. Results are shown in figures Fig. 5(a) and Fig. 5(b) respectively. Fig. 5(a) is the infrared image before nonuniformity correction. The nonuniformity has distorted the image of the hand. Fig. 5(b) is the infrared image after nonuniformity correction. The imaging quality is greatly higher than raw image.
22
B. Zamanlooy et al.
(a)
(b)
Fig. 5. (a) Infrared image before nonuniformity correction (b) Infrared image after nonuniformity correction
5 Summary and Conclusion The IR imaging industry is rapidly expanding, thus, the need to improve the performance of processing systems for such applications is also growing. Nonuniformity detection, correction and displaying high quality infrared image which are done in this paper are the challenging tasks of the IR imaging systems. The proposed IRFPA imaging system has the capability of nonuniformity detection, correction and displaying infrared images and fulfills these complex tasks.
References 1. Scribner, D.-A., Kruer, M.-R., Killiany, J.-M.: Infrared focal plane array technology. Proceedings of IEEE 79, 66–85 (1991) 2. Zhou, H.X., Lai, R., Liu, S.Q., Wang, B.J.: A new real time processing system for the IRFPA imaging signal based on DSP&FPGA. Journal of Infrared Physics & Technology 46, 277–281 (2004) 3. Harris, J.G., Chiang, Y.M.: Nonuniformity correction of infrared image sequences using the constant-statistics constraint. IEEE Transactions on Image Processing 8, 1148–1151 (1999) 4. Milton, A.F., Barone, F.R., Kruer, M.R.: Influence of nonuniformity on infrared focal plane array performance. Journal of Optical Engineering 24, 855–862 (1985) 5. Shi, Y., Zhang, T., Zhigou, C., Hui, L.: A feasible approach for nonuniformity correction in IRFPA with nonlinear response. Journal of Infrared Physics & Technology 46, 329–337 (2004) 6. TMS320VC5510 Fixed-Point Digital Signal Processors, http://www.dspvillage.ti.com 7. TMS320VC5510, D.S.P.: External Memory Interface (EMIF) Reference Guide, http://www. dspvillage.ti.com 8. TMS320VC5510 DSP Host Port Interface (HPI) Reference Guide, http://www.dspvillage. ti.com
A Real Time Infrared Imaging System Based on DSP & FPGA
23
9. Integrated Digital CCIR-601 to PAL/NTSC Video Encoder, http://www.analog.com 10. Sui, J., Jin, W., Dong, L.: A scene-based nonuniformity correction technique for IRFPA using perimeter diaphragm strips. In: International Conference on Communication, Circuits and Systems, pp. 716–720 (2005) 11. Zhou, H.X, Rui, L., Liu, S.Q., Jiang, G.: New improved nonuniformity correction for infrared focal plane arrays. Journal of Optics Communications 245, 49–53 (2005)
Motion Compensation Hardware Accelerator Architecture for H.264/AVC Bruno Zatt1, Valter Ferreira1, Luciano Agostini2, Flávio R. Wagner1, Altamiro Susin3, and Sergio Bampi1 1 Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil 2 Informatics Department Federal University of Pelotas Pelotas – RS – Brazil 3 Electrical Engineering Department Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil {bzatt, vaferreira, flavio, bampi}@inf.ufrgs.br
[email protected] [email protected]
Abstract. This work presents a new hardware acceleration solution for the H.264/AVC motion compensation process. A novel architecture is proposed to precede the luminance interpolation task, which responds by the highest computational complexity in the motion compensator. The accelerator module was integrated into the VHDL description of the MIPS Plasma processor, and its validation was accomplished by simulation. A performance comparison was made between a software implementation and a hardware accelerated one. This comparison indicates a reduction of 94% in processing time. The obtained throughput is enough to reach real time when decoding H.264/AVC Baseline Profile motion compensation for luminance at Level 3. Keywords: Video Coding, H.264/AVC, MPEG-4 AVC, Motion Compensation, Hardware Acceleration.
1 Introduction Currently, the development of embedded devices that use some system of video player is growing. Such systems need to find a balance between the computational complexity, to execute their functions, and the excessive increasing in the energy consumption. On the other hand, the H.264/AVC standard of video [1,2] compression, due to its high complexity, needs powerful processors and hardware support to work accordingly with the application requirements. Furthermore, the motion compensation operation presents one of the highest computational complexities in a H.264/AVC [3] decoder. This high complexity also implies in large energy consumption. This work intends to generate an efficient embedded solution for the H.264/AVC motion compensation. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 24 – 35, 2007. © Springer-Verlag Berlin Heidelberg 2007
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
25
In this work, a general purpose processor was used together with a specific designed accelerator hardware to meet the embedded motion compensation requirements. The processor used was the MIPS Plasma processor, and the two-dimensional FIR filter was designed as the accelerator hardware. Then a satisfactory real time performance was obtained for the motion compensation process. As the operation frequency of Plasma is relatively low (74MHz), the energy consumption of this solution could be lower than that obtained through the design of the complete motion compensation in hardware. Other advantage is the time-to-market, once processor-based systems are more quickly designed than specific integrated circuits. This paper is organized as follows. Section 2 presents the H.264/AVC standard. The motion compensation process in the H.264/AVC and its main features are presented in the third section. In Section 4, the proposed MC hardware accelerator architecture is presented in details. The integration with the MIPS processor is shown in Section 5. Section 6 presents the synthesis results and the performance comparison. Finally, Section 7 concludes the work.
2 The H.264/AVC Standard H.264/AVC [1] is the latest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). H.264/AVC provides higher compression rates than earlier standards as MPEG-2, H.263, and MPEG-4 part 2 [2]. The H.264/AVC decoder uses a structure similar to that used in the previous standards, but each module of a H.264/AVC decoder presents many innovations when compared with previous standards as MPEG-2 (also called H.262 [4]) or MPEG-4 part 2 [5]. Fig. 1 shows the schematic of the decoder with its main modules. The input bit stream first passes through the entropy decoding. The next steps are the inverse quantization and inverse transforms (Q-1 and T-1 modules in Fig. 1) to recompose the prediction residues. Motion compensation - MC (also called INTER prediction) reconstructs the macroblock (MB) from neighbor reference frames, while INTRA prediction reconstructs the macroblock from the neighbor macroblocks in the same frame. INTER or INTRA prediction reconstructed macroblocks are added to the residues, and the results of this addition are sent to the deblocking filter. Finally, the reconstructed frame is filtered by the deblocking filter, and the result is sent to the frame memory. This work focuses on the motion compensation module, which is highlighted in Fig. 1.
Fig. 1. H.264/AVC decoder diagram
26
B. Zatt et al.
H.264/AVC was standardized in 2003 [1] and defines four profiles, targeting different applications. These profiles are called: Baseline, Main, Extended, and High. The Baseline profile (which is the focus of this work) focuses on low delay applications and was developed to run on low-power platforms. The Main profile is oriented to high image quality and HDTV applications. It added some different features with regard to the Baseline profile, like: bi-prediction, weighted prediction (WP), direct prediction, CABAC, and interlaced video capabilities [1, 2]. The Extended profile was developed for streaming video applications. Finally, the High profile, which was defined in 2005 by the FRExt (Fidelity Range Extension) [2] extension, provides support to different color sub-sampling (4:2:2 and 4:4:4), besides all Main profile features. The standard also defines sixteen operation levels [1, 2], which are classified in accordance to the desired processing rate. This work presents an embedded solution for motion compensation of an H.264/AVC decoder considering the Baseline profile at Level 3.
3 Motion Compensation in H.264/AVC The operation of motion compensation in a video decoder can be regarded as a copy of predicted macroblocks from the reference frames. The predicted macroblock is added to the residual macroblock (generated by inverse transforms and quantization) to reconstruct the macroblock in the current frame. The motion compensator is the most demanding component of the decoder, consuming more than half of its computation time [3]. Intending to increase the coding efficiency, the H.264/AVC standard adopted a number of relatively new technical developments. Most of these new developments rely on the motion prediction process, like: variable block-size, multiple reference frames, motion vector over picture boundaries, motion vector prediction, and quarter-sample accuracy. This paper will explain in more details just the features that are used in the Baseline profile. Quarter-sample accuracy: Usually, the motion of blocks does not match exactly in the integer positions of the sample grid. So, to find good matches, fractional position accuracy is used. The H.264/AVC standard defines half-pixel and quarter-pixel
(a)
(b)
Fig. 2. (a) Half-sample luma interpolation and (b) Quarter-sample luma interpolation
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
27
accuracy for luma samples. When the best match is an integer position, just a 4x4 samples reference is needed to predict the current partition. However, if the best match is a fractional position, an interpolation is used to predict the current block. A matrix with 4x9 samples is needed to allow the interpolation of a fractionary vector in the 'X' direction, while a matrix with 9x4 samples is needed to allow the interpolation of a fractionary vector in the 'Y' direction. When the fractionary vectors occur in both directions, the interpolation needs a matrix with 9x9 samples. This need of extra samples to allow the interpolation has a direct impact on the number of memory accesses. Fig. 2(a) shows the half-samples interpolation, which is made by a six-tap FIR filter. Then, a simple average from integer and half-sample positions is used to generate the quarter-sample positions, as shown in Fig. 2(b). Multiple reference frames: In H.264/AVC, slices are formed by motion compensated blocks from past and future (in temporal order) frames. The past and future frames are organized in two lists of frames, called List 0 and List 1. The past and future frames are not fixed just to the immediate frames, as in early standards. Fig. 3 presents an example of this feature.
Past Reference Frames
Current Frame
Future Reference Frames
Fig. 3. Multiple Reference Frames
4 MC Hardware Accelerator Architecture The hardware accelerator module for MC targets the bi-dimensional FIR filter, which is used in the luminance quarter-pixel interpolation process. This filter was designed using the 1-D separability property of 2-D FIR filters. Other MC filter implementations presented in the literature [6, 7, 8] use a combination of different vertical and horizontal FIR filters serially and target an ASIC implementation. In the architecture of this work, the 2-D interpolation is done by only four FIR filters used for vertical and horizontal filtering. The bilinear interpolation used to generate quarter-sample accuracy is done by bilinear filters embedded in the FIR filters. The hardware accelerator was designed to process 4x4 samples blocks. A six-tap filter is used to generate a block of 4x4 interpolated samples. An input block of up to 9x9 samples is necessary to generate the interpolation. The motion compensation luminance filtering considers eight different cases in this accelerator architecture, as listed below and presented in Fig. 4:
28
B. Zatt et al.
(a) No interpolation: The samples by-pass the filters; (b) No vertical interpolation without ¼ samples: the samples pass the FIR filters once with FIR interpolation; (c) No vertical interpolation with ¼ samples: the samples pass the filters once with FIR and bilinear interpolation; (d) No horizontal interpolation without ¼ samples: the samples by-pass the filters and are stored in the transposition memory, then the memory columns are sent to the FIR filters once with FIR interpolation; (e) No horizontal interpolation with ¼ samples: the samples by-pass the filters and are stored in the transposition memory, then the memory columns are sent to the FIR filters once with FIR and bilinear interpolation; (f) Horizontal and vertical interpolations without ¼ samples: the samples pass the filters twice with FIR interpolation; (g) Horizontal and vertical interpolations with ¼ samples: the samples pass the filters twice with FIR interpolation in the first time and with FIR and bilinear interpolation in the second one; (h) Horizontal and two vertical interpolations with ¼ samples: the samples pass the filters three times with FIR interpolation in the first and second times and with FIR and bilinear interpolation in the third time;
Fig. 4. Filtering cases
Fig. 5 presents the proposed MC hardware accelerator organization as well as its main modules and connections. The number above each connection indicates its width considering the number of 8-bit samples. The first procedure to start each block processing is to set up the filtering parameters. In this case, the parameters are only the motion vector coordinates ‘X’ and ‘Y’. This information defines the kind of filtering that will be used and spends one clock cycle. The ‘X’ and ‘Y’ coordinates are stored in two specific registers inside the architecture, which are omitted in Fig. 5.
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
29
Fig. 5. MC unit architecture
The input receives a line with 9 samples, and these samples are sent to the FIR filter module. This module contains four 6-tap FIR filters working in parallel. After the interpolation, which generates four half-pixel samples, these results are sent to a transposition memory together with four non-interpolated samples. If just 1-D interpolation is needed, then the filter output is sent directly to the output memory. This process is repeated up to nine times to process all the 9x9 input block, completing the first 1-D filtering and filling the 8x9 samples transposition memory. After filling the transposition memory by shifting the lines vertically, the columns are shifted horizontally to the left side, sending the left column samples to the filter to be interpolated in the second dimension. Each column is composed by 9 full or halfsamples. The quarter-samples are generated together with the half-samples during the same filter pass. Each filter can perform the FIR filtering and the bilinear filtering at the same clock cycle, since the bilinear filter is embedded in the FIR filter as shown in Fig. 6. However, when quarter-sample accuracy is needed, other four samples must be sent to the filters. Depending on the filtering case, the transposition memory is filled using a different order to simplify the multiplexing logic for the FIR input. When just the half-samples need to be interpolated again in the second filter loop, they are sent to the four left memory columns (columns 0 to 3) and the full samples are sent to the four right column (column s 4 to 7). When just full samples need to be filtered in the second filter loop, these samples are sent to the left columns (columns 0 to 3) and half-samples to the right columns (columns 4 to 7). Finally, when both half and full samples need to be filtered, the columns are interleaved, even columns are filled with full samples while
30
B. Zatt et al.
odd columns are filled with half-samples (columns 0, 2, 4, and 6 for full-samples and 1, 3, 5, and 7 for half-samples). When just half or full samples are interpolated, after the second filtering loop the results are sent to the output memory, which stores the output block of 4x4 interpolated samples. If both half and full samples must be filtered again, the full samples are processed and the outputs are stored in four delay registers to create one cycle of delay. So, in the next cycle, when the half-sample columns filtered by the FIR filter, the interpolated samples processed in the past cycle are sent to the embedded bilinear filter to generate the quarter-sample. After the interpolation is completed, the output is also sent to the output memory. The output memory can be read by column or by lines, depending whether the input was transposed or not. The kind of output depends on the type of interpolation and is controlled through an output multiplexer. Each FIR filter is composed by six taps with coefficients (1, -5, 20, 20, -5, 1). Fig. 6 shows the FIR filter hardware, which was designed using only additions and shifts to eliminate the multiplications. With six 8-bit inputs (E, F, G, H, I, J), the FIR block includes five adders, two shifters, one subtractor, and a clipping unit to keep the values in the range [0..255]. A bilinear filter was embedded in the FIR filter. As inputs, the bilinear filter uses the FIR output and an 8-bit input (Y) to perform the bilinear filtering.
Fig. 6. FIR filter block diagram
The MC unit was described in VHDL and validated by simulation using the Mentor Graphics ModelSim software. The simulation was controlled by a testbench also written in VHDL.
5 Integration To evaluate the designed hardware accelerator, it was integrated with a generalpurpose processor. The MIPS Plasma core was chosen because of its simple RISC
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
31
organization and because its VHDL description was available in the “Opencores” community [9]. The MIPS Plasma organization and its integration with the MC hardware accelerator module can be seen in Fig. 7. The MIPS Plasma processor is fully compliant with the MIPS I(TM) ISA, but without support to non-aligned memory access instructions. It was designed using a Von Neumann architecture with 32x32-bits registers and three pipeline stages. The integration demanded some modifications on the MC accelerator and on the MIPS Plasma core. Once the processor data bus is 32-bits wide and the input of MC is 72-bits wide, a hardware wrapper was designed to make them compatible. Finally, some changes in the processor control block were made to insert new instructions and to call the new module tasks. MC spends a variable number of cycles to process a 4x4 block, and the Plasma processor does not support parallel instructions execution. Therefore, the processor pipeline is kept frozen while this module is working.
Fig. 7. Integration architecture among MIPS Plasma and MC unit
The MC unit architecture uses an input interface modeled as a ping-pong buffer (see Fig. 8), which receives a 32-bit word per cycle, storing up to three words. After the module received the appropriate number of words, a specific instruction sends the signal to start the processing. Each word is read from the Plasma memory, sent to the processor register bank, and finally sent to the ping-pong buffer. This happens up to three times to process each MC input line. Finally, the words are sent to the MC accelerator. The ping-pong buffer filling process happens up to nine times for each block. Many clock cycles are spent to feed the MC unit. After loading, the data processing occurs relatively fast. After the processing, the results can be read from MC registers to the Plasma register bank.
32
B. Zatt et al.
Fig. 8. Ping-Pong buffer
Some new instructions were added to the MIPS Plasma instruction set to control the MC module. Each operation of reading, writing, or setting the MC hardware accelerator originated a new processor instruction. The new instructions use a MIPS type R format (as shown in Fig. 9) composed by a 6-bit op-code field, two 5-bit source register index fields, one 5-bit target register index field, and one 6-bit function field. The new instructions use in the op-code field the value “111111”, while in the function field the values from “000000” to “0000100” were used. This op-code value is reserved for eventual instruction set expansions. The new instructions are listed in Table 1. The MC_WRITE instruction uses the “Rt” field to indicate the source register, while the MC_READ instruction uses the “Rd” field to point the target register. The other register fields need no specific value.
Fig. 9. Type R MIPS Instruction Table 1. New Instructions
Function
Name
Description
000000 000001 000010 000100
MC_SET MC_WRITE MC_PROC MC_READ
Sets motion vector coordinates Writes a word Starts the filtering Reads a word
The final integration step was the validation of the integrated modules. An assembly code was written using the new instructions to feed and control the MC hardware accelerator. This assembly was loaded to the Plasma ROM memory, and its correct execution was verified through simulation using the Mentor ModelSim software.
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
33
6 Results and Comparisons The MIPS Plasma processor and the MC accelerator architectures were synthesized targeting a Xilinx Virtex-2 PRO FPGA device (XC2VP30-7) [10] using the Xilinx ISE 8.1 software. Table 2 shows the resource utilization for the Plasma processor and the MC module in the second and third columns, respectively. The fourth column presents the synthesis results for the processor integrated to the MC hardware accelerator. Finally, the last column shows the ratio between synthesis results obtained for the Plasma version with the hardware accelerator and for its original version. The synthesis results presented an important increasing in hardware resource utilization besides a degradation in the maximum operation frequency. The high increasing in register utilization occurs because the memories inside the MC module were implemented as banks of registers and not as Block-RAMs available in this FPGA family. Table 2. Synthesis Results
LUTs Reg Slices Freq.
Plasma
MC
MC + Plasma
Increase
2599 402 1378 ~90MHz
1594 758 891 ~74 MHz
3966 1262 2172 ~74 MHz
52 % 213 % 57 % -21%
Two different software codes were described to compare the performances of the standard MIPS Plasma and the modified MIPS Plasma. The first code makes the motion compensation task in software without any hardware acceleration. The second solution is a HW/SW solution using the hardware acceleration instructions to call the new MC module. For a fair comparison, the software solution was optimized to have no multiplications, since a multiplication spends 32 clock cycles in this processor. Additions and shifts were used to eliminate multiplications. The first software solution (without MC accelerator) was described in C language based on the H.264/AVC reference software (JM 12.2) [11]. GCC was used as a cross compiler to generate MIPS machine code. This machine code was mapped to the MIPS Plasma ROM memory. The software was analyzed through simulations using Mentor Graphics ModelSim 6.0. These simulations were used to count the number of clock cycles necessary to process a 4x4 samples block at each different interpolation case. The HW/SW solution demanded an assembly description to use the new instructions. The same method of simulation used in the first code was applied to the HW/SW solution. Another way to generate the code using MC accelerating instructions is adapting the GCC compiler to use these instructions, but this solution was not implemented in this paper. The results obtained in the simulation process are shown in Tables 3 and 4. These tables present a considerable increase in performance with the use of the MC acceleration hardware. The performance increase reaches more than 95% in clock cycles and 94% in execution time, when comparing the average gains of the HW/SW solution in
34
B. Zatt et al.
relation to the SW one. As expected, because of the simplicity of the Plasma processor, the increase in area was relatively high and the performance gains were expressive. The first and second columns of Tables 3 and 4 present the different interpolation cases and their probability of occurrence. In Table 3, the third column presents the total number of clock cycles spent to process each kind of block using a SW solution. The three following columns show the number of cycles spent to process a block in the HW/SW solution, considering also the cycles used for memory accesses and for effective processing. The seventh column presents the percentage of reduction in number of clock cycles when using the MC accelerator. Table 4 shows the total execution time for the SW and HW/SW solutions in the third and fourth columns, respectively. The last column presents the percentage of reduction in terms of execution time, considering 90 MHz for the SW solution and 74 MHz for the HW/SW one. Table 3. Results and Comparison (clock cycles)
Interpolation Cases No Interpolation No Vertical S/1/4 No Vertical C/1/4 No Horizontal S/1/4 No Horizontal C/1/4 Vert. & Hor. S/ 1/4 Vert. & Hor. C/ 1/4 Vert. & 2 Hor. C/ 1/4 Weighted Average
Clock SW HW/SW Prob. Total # of Memory Processor Total # of Cycles Cycles Cycles Cycles Cycles Reduction 82.89% 1/16 187 24 8 32 93.52% 1/16 802 44 8 52 95.14% 1/8 1069 44 8 52 90.26% 1/16 811 62 17 79 92.71% 1/8 1084 62 17 79 96.48% 1/16 2245 62 17 79 95.40% 1/4 1717 62 17 79 96.89% 1/4 2667 62 21 83 1617.9 56.25 15.75 72 95.55%
Table 4. Results and Comparison (execution time)
Interpolation Cases No Interpolation No Vertical S/1/4 No Vertical C/1/4 No Horizontal S/1/4 No Horizontal C/1/4 Vert. & Hor. S/ 1/4 Vert. & Hor. C/ 1/4 Vert. & 2 Hor. C/ 1/4 Weighted Average
Prob. 1/16 1/16 1/8 1/16 1/8 1/16 1/4 1/4 -
SW Time (ns) 207.78 891.11 1187.78 901.11 1204.44 2494.44 1907.78 2963.33
HW/SW Time (ns) 43.24 70.27 70.27 106.76 106.76 106.76 106.76 112.16
Total Time Reduction 79.19% 92.11% 94.08% 88.15% 91.14% 95.72% 94.40% 96.22%
1797.71
97.30
94.59%
Motion Compensation Hardware Accelerator Architecture for H.264/AVC
35
7 Conclusions This work presented a new architectural solution for a hardware accelerator for the motion compensation of an H.264/AVC Baseline Profile video decoder. The applicability in embedded devices was demonstrated. The MC accelerator was validated and successfully integrated to the MIPS Plasma VHDL description. Through simulations, data were extracted to evaluate the performance increase of the proposed solution. Results indicate sufficient performance to execute the luminance motion compensation decoding task in real time for H.264/AVC Baseline Profile at level 3. H.264/AVC at level 3 demands decoding 525SD (720x480) video sequences at 30 fps or 625SD (720x576) video sequences at 25 fps. The HW/SW performance gains were compared to a SW solution running in the MIPS Plasma processor. These results indicate a reduction of 95% in the number of necessary clock cycles and a reduction of 94% in execution time when using the MC accelerator. This architecture working at 74MHz and using an average number of 72 clock cycles to decode each 4x4 block can process up to 64.2K P-type macroblocks (16x16 samples) per second, reaching an average processing rate of 39.6 P-frames per second for 625SD (720x576).
References 1. JVT, Wiegand, T., Sullivan, G., Luthra, A.: Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec.H.264 ISO/IEC 14496-10 AVC). JVT-G050r1, Geneva (2003) 2. International Telecommunication Union.: Advanced Video Coding for Generic Audiovisual Services. ITU-T Recommendation H(264) (2005) 3. Wiegand, T., Schwarz, H., Joch, A., Kossentini, F., Sullivan, G.: Rate-constrained Coder Control and Comparison of Video Coding Standards. IEEE Transactions on Circuits and Systems for Video Technology 13, 688–703 (2003) 4. International Telecommunication Union: Generic Coding of Moving Pictures and Associated Audio Information - Part 2. ITU-T Recommendation H(262) (1994) 5. International Organization For Standardization. Coding of Audio Visual Objects - Part 2 ISO/IEC 14496-2 - MPEG-4 Part 2 (1999) 6. Azevedo, A., Zatt, B., Agostini, L., Bampi, B.: Motion Compensation Decoder Architecture for H.264/AVC Main Profile Targeting HDTV. In: IFIP International Conference on Very Large Scale Integration, VLSI SoC, Nice, France, pp. 52–57 (2006) 7. Wang, S.-Z., Lin, T.-A., Liu, T.-M., Lee, C.-Y.: A New Motion Compensation Design for H.264/AVC Decoder. In: International Symposium on Circuits and Systems. In: ISCAS, Kobe, Japan, pp. 4558–4561 (2005) 8. Chen, J.-W., Lin, C.-C., Guo, J.-I., Wang, J.-S.: Low Complexity Architecture Design of H.264 Predictive Pixel Compensator for HDTV Applications. In: Proc. 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Toulouse, France, pp. 932–935 (2006) 9. OPENCORES.ORG (2007), Available from: URL: http://www.opencores.org/projects.cgi/ web/ mips/overview 10. Xilinx Inc. (2007), Availabe from: http://www.xilinx.com 11. H.264/AVC JM Reference Software (2007), Available from: URL: http://iphome.hhi.de/ suehring/tml
High Throughput Hardware Architecture for Motion Estimation with 4:1 Pel Subsampling Targeting Digital Television Applications Marcelo Porto1, Luciano Agostini2, Leandro Rosa2, Altamiro Susin1, and Sergio Bampi1 1
Microeletronics Groups (GME), UFRGS – Porto Alegre, RS, Brazil {msporto,bampi}@inf.ufrgs.br br,
[email protected] 2 Group of Architectures and Integrated Circuits (GACI),UFPel – Pelotas, RS, Brazil {agostini, lrosa.ifm}@ufpel.edu.br
Abstract. Motion estimation is the most important and complex operation in video coding. This paper presents an architecture for motion estimation using Full Search algorithm with 4:1 Pel Subsampling, combined with SAD distortion criterion. This work is part of the investigations to define the future Brazilian system of digital television broadcast. The quality of the algorithm used was compared with Full Search through software implementations. The quality of 4:1 Pel Subsampling results was considered satisfactory, once it presents a SAD result with an impact inferior to 4.5% when compared with Full Search results. The designed hardware considered a search range of [-25, +24], with blocks of 16x16 pixels. The architecture was described in VHDL and mapped to a Xilinx Virtex-II Pro VP70 FPGA. Synthesis results indicate that it is able to run at 123,4MHz, reaching a processing rate of 35 SDTV frames (720x480 pixels) per second. Keywords: Motion estimation, hardware architecture, FPGA design.
1 Introduction Nowadays, the compression of digital videos is a very important task. The industry has a very high interest in digital video codecs because digital videos are present in many current applications, such as: cell-phones, digital television, DVD players, digital cameras and a lot of other applications. This important position of video coding in the current technology development has boosted the creation of various standards for video coding. Without the use of video coding, processing digital videos is almost impossible, due to the very high amount of resources which are necessary to store and transmit these videos. Currently, the most used video coding standard is MPEG-2 [1] and the latest and more efficient standard is H.264/AVC [2]. These standards reduce drastically the amount of data necessary to represent digital videos. A current video coder is composed by eight main operations, as shown in Fig. 1: motion estimation, motion compensation, intra-frame prediction, forward and inverse transforms (T and T-1), forward and inverse quantization (Q and Q-1) and entropy coding. This work focuses on the motion estimation, which is highlighted in Fig. 1. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 36 – 47, 2007. © Springer-Verlag Berlin Heidelberg 2007
High Throughput Hardware Architecture for Motion Estimation
37
Fig. 1. Block diagram of a modern video coder
Motion estimation (ME) operation tries to reduce the temporal redundancy between neighboring frames [3]. One or more frames that were already processed are used as reference frames. The current frame and the reference frame are divided in blocks to allow the motion estimation. The idea is to replace each block of the current frame with one block of the reference frame, reducing the temporal redundancy. The best similarity between each block of the current frame and the blocks of the reference frame is selected. This selection is done through a search algorithm and the similarity is defined through some distortion criterion [3]. The search is restricted to a specific area in the reference frame which is called search area. When the best similarity is found, then a motion vector (MV) is generated to indicate the position of this block inside the reference frame. These steps are repeated for every block of the current frame. Motion compensation operation reconstructs the current frame using the reference frames and the motion vectors generated by the motion estimation. The difference between the original and the reconstructed frame (called residue) is sent to the transforms and quantization calculation. Motion estimation is the video coder operation that provides the highest gain in terms of compression rates. However, motion estimation has a very high degree of computational complexity and software implementations could not reach real time (24-30 frames per second) when high resolution videos are being processed. This paper presents an FPGA based architecture dedicated to the motion estimation operation. This architecture used the Full Search with 4:1 Pel Subsampling (also called Pel Decimation) [3] as search algorithm, and the Sum of Absolute Differences (SAD) [3] as distortion criterion. ME design considered a search area with 64x64 pixels and blocks with 16x16 pixels. This implies a search range of [-25, +24]. The architecture was described in VHDL and mapped to Xilinx Virtex-II Pro FPGAs. This work was developed within the framework in an effort to develop intellectual property and to carry out an evaluation for the future Brazilian system of digital television broadcast, the SBTVD [4]. The presented architecture was specifically designed to reach real time when processing standard definition television frames (720x480 pixels). Section 2 of this paper presents the 4:1 Pel Subsampling search algorithm and the SAD criterion. Section 3 presents a software evaluation of the search algorithms used.
38
M. Porto et al.
Section 4 presents the designed architecture, detailing its main modules. Section 5 presents the designed architecture for SAD calculation. Section 6 presents the synthesis results and comparison with related works. Finally, section 8 presents the conclusions.
2 Description of the Used Algorithm This section presents some details about the Full Search with 4:1 Pel Subsampling search algorithm and about the SAD distortion criterion. The architectural design presented in this paper was based in these two algorithms. 2.1 Full Search with 4:1 Pel Subsampling Algorithm Full Search with 4:1 Pel Subsampling algorithm is based on the traditional Full Search algorithm; however, the distortion criterion is not calculated for all samples. In 4:1 Pel Subsampling, for each pixels calculated, three pixels are discarded [3]. With this algorithm only a quarter of the block samples are calculated, increasing the performance and decreasing the complexity of the motion estimation operation. Fig. 2 shows the 4:1 Pel Subsampling relation for a block with 8x8 samples. In Fig. 2, the black dots are the samples which are used in the SAD calculation and the white dots are the samples which are discarded.
Fig. 2. 4:1 Pel Subsampling in an 8x8 block
2.2 SAD Criterion Distortion criterion defines how the differences between the regions are evaluated. Many distortion criteria were proposed [3]; however, the most used for hardware design is the Sum of Absolute Differences (SAD). Equation (1) shows the SAD criterion, were SAD(x, y) is the SAD value for (x, y) position, R is the reference sample, P is the search area sample and N is the block size. N −1 N −1
SAD ( x, y ) = ∑ ∑ Ri , j − Pi + x , j + y i =0 j =0
(1)
High Throughput Hardware Architecture for Motion Estimation
39
3 Quality Evaluation The search algorithm defines how the search for the best match will be done in the search area. The search algorithm choice has a direct impact in the motion vector quality and in the motion estimator performance. There are lots of algorithms to define the search method; however, Full Search algorithm [5] is the most used for hardware implementations. Full Search algorithm is the only one that presents the optimal results in terms of best matching. All the others are fast algorithms which were designed to reduce the computational complexity of the motion estimation process. These algorithms produce sub-optimal results, because many positions are not compared. A good strategy for ME hardware implementation is to use Full Search algorithm with pixel subsampling (also called Pel Subsampling) because this can reduce the number of pixel comparisons keeping good quality results. A software analysis was developed to evaluate and compare the quality of Full Search and Full Search with 4:1 Pel Subsampling algorithms. The main results are shown in Table 1. The search algorithms were developed in C and the results for quality and computational cost were generated. The search area used was 64x64 pixels with 16x16 pixels block size. The algorithms were applied to 10 real video sequences with a resolution of 720x480 pixels and the average results are presented in Table 1. The quality results were evaluated through the percentage of error reduction and the PSNR [3]. The percentage of error reduction is measured comparing the results generated by the motion estimation process with the results generated by the simple subtraction between the reference and current frame. Table 1 also presents the number of SAD operations used by each algorithm. Table 1. Software evaluation of Full Search and 4:1 Pel Subsampling
Search Algorithm Full Search Full Search with 4:1 Pel Subsampling
Error reduction (%)
PSNR (db)
# of SAD operations (Goperations)
54.66
28.48
82.98
50.20
27.25
20.74
Full Search algorithm presents the optimal results for quality, generating the highest error reduction and the highest PSNR. However, it uses four times more SAD operations than the Full Search with 4:1 Pel Subsampling. The quality losses generated with the use of 4:1 Pel Subsampling are small. These losses are of only 4.46% in the error reduction and only 1.23dB in the PSNR. It is important to notice that the Full Search with 4:1 Pel Subsampling algorithm can reduce significantly the computational costs of the motion estimation process with small losses in the quality results. Full Search based algorithms (including its version with 4:1 Pel Subsampling) are regular algorithms and they do not present data dependencies. These features are
40
M. Porto et al.
important when a hardware design is considered. The regularity is important to allow a reuse of the basic modules designed and the absence of data dependencies allow a free exploration of parallelism. Other important characteristic of Full Search based algorithms is that this type of algorithm is deterministic in terms of the clock cycles used to generate a new motion vector. This characteristic is important to allow an easy integration and synchronization of this module with other encoder modules. The parallelism exploration is important to generate a solution tuned with the application requirements. Considering these features and the good results for quality, the Full Search with 4:1 Pel Subsampling algorithm was chosen to be designed in hardware. Using 4:1 Pel Subsampling it is possible to simplify the architecture, keeping the desired high performance.
4 Designed Architecture There are many hardware architectures proposed in the literature that are based in the Full Search algorithm, such as [6] and [7]. These solutions are able to find the optimal results in terms of blocks matching. However, this type of architecture uses a very high amount of hardware resources. Full Search complexity can be reduced with little losses in the results quality, using the subsampling technique. This complexity reduction implies an important reduction in the hardware resources cost. The architecture designed in this paper used Full Search with 4:1 Pel Subsampling algorithm with SAD distortion criterion. The block diagram of the proposed architecture is presented in Fig. 3. This architecture was designed to operate considering blocks with 16x16 samples and considering a search range of [-25, +24] samples. The internal memory is organized in 5 different memories, as presented in Fig. 3. One memory is used to store the current frame block and the other four memories are used to store the search area. The current block memory has 8 words and each word has 8 samples with 8 bits, in a total of 64 bits per memory word. This memory stores 64 samples (8x8) instead of 256 samples (16x16) because of the 4:1 subsampling relation. The four memories used to store the search area have 32 words and each word has 32 samples with 8 bits, in a total of 256 bits per memory word. The data from the search area were divided in four memories, considering the frame as a bi-dimensional matrix of samples: samples from even lines and even columns, samples from even lines and odd columns, samples from odd lines and even columns and samples from odd lines and odd columns. Each word of each search area memory stores half of a line of the search area. Then, the memory that stores the samples from even lines and even columns stores the samples (0, 2, 4, … , 62), while the memory that stores the samples from even lines and odd columns stores the samples (1, 3, 5, … , 63). This division was made to allow a more efficient 4:1 Pel Subsampling processing. The architecture presented in Fig. 3 was designed to explore the parallelism and to minimize the local memory access. The data read from memories are reused and each stored sample is read only once from the memories. The search area was divided in exactly four different memories to allow the data reuse and the minimization of the number of local memory accesses. The data read from the search area memories are
High Throughput Hardware Architecture for Motion Estimation Current block memory
Search area memory Control
EE
EO OE OO
CB 8 registers for the block line
32 registers for the search area line Memory manager
SLR SS SS SS SS SS
41
Search area selector SAD Line 0 PU PU PU PU
BLR0 Block selector
PU
Comp
PU
Comp
BLR2 BS
SAD Line 2 PU PU PU
BLR1 BS
SAD Line 1 PU PU PU PU
BS
PU PU
Comp
BLR3 BS
SAD Line 24 PU PU PU
P
PU
Comp RR
Comp
Motion vector Fig. 3. Motion Estimation Architecture
sent to all SAD lines (see Fig. 3) which use this data when necessary. The data read from the current block memory is shifted through the SAD lines (using BLR registers in Fig. 3) and they are used to generate the SADs. When ME architecture starts, the memory manager reads half of a line of the search area (one word of one search area memory), and one line of the current block (one word from the current block memory). With these data, half of the line of the search area and one line of the current block are available to be processed. These lines are stored in the search line register (SLR in Fig. 3) and in the block line register (BLR in Fig. 3). The processing unit (PU in fig. 3) calculates the distortion between two samples of the current block and two samples of a candidate block from the search area. Five PUs are used to form a SAD line. A set of 25 SAD lines forms a SAD matrix, as presented in Fig. 3. The control manages all these modules. Four iterations over the 25 SAD lines are necessary to process a complete search area. One iteration is used for each search area memory. The result memory register (RMR in Fig. 3) stores the best match from each iteration. The final best match is
42
M. Porto et al.
P0 abs
R0 P1 abs
R1
Fig. 4. PU architecture
generated after the comparison between the four results stored in RMR and then the motion vector is generated. All the operations necessary to generate one motion vector (MV) uses 2615 clock cycles.
5 SAD Calculation Architecture SAD calculation architecture was hierarchically designed. The highest hierarchical level instance is the SAD matrix which is formed by 25 SAD lines. Each SAD line is formed by five processing units (PUs), as presented in Fig. 3. Fig. 4 shows the PU architecture. When the 4:1 Pel Subsampling algorithm is used, the number of SAD calculations per line of the current block decreases to a half in comparison with Full Search algorithm, once the block was sub-sampled. This reduction of the number of calculations allows a reduction of the parallelism level of each PU without a reduction of the global ME performance. The PU architecture designed in this work is able to process a quarter of line of each candidate block (two samples) per cycle. The subsampling reduces the size of the current block from 16x16 to 8x8 samples. The search area division in four sub-areas (stored in four different memories) implies in sub-areas with 32x32 samples, once the complete area has 64x64 samples. Then it is possible to conclude that there are 25 candidate blocks starting in each line of the search sub-area, because there are 32 samples per search line and 8 samples per block line. The partial SAD of the candidate block (a quarter of line) must be stored and added to the SADs of the other parts of the same block to generate the total SAD for the block. The total SAD is formed by 8 lines with 8 samples in each line (64 SADs must be accumulated). The SAD lines, presented in Fig. 5, groups five PUs and they make the accumulation to generate the final value of the candidate block SAD. Each PU is responsible for SAD calculation of five different candidate blocks, in distinct times. Then, a line of SADs calculates the SAD of 25 different candidate blocks. Fig. 5 presents the five PUs (highlighted in gray) of one SAD line and it also presents the accumulators used to keep the partial and final values of the SAD calculations for each block. As each PU processes in parallel the SAD calculation of two samples, then each PU generates 32 partial results of SAD for each processed block. These 32 partial results must be added to generate the final block SAD. A simple structure of adder and accumulation is enough to generate this result.
High Throughput Hardware Architecture for Motion Estimation
43
ACC0 ACC1
B
ACC2 PU0
R0
C0
ACC3 ACC4
ACC5 ACC6
B
ACC7 PU1
C1
ACC8
R1 ACC10
ACC9
ACC11
B
ACC12 PU2
R2
C2
ACC13 ACC14
ACC15 ACC16 ACC17
B R3
PU3
C3
ACC18 ACC20
ACC19
ACC21
B R4
ACC22 PU4
C4
ACC23 ACC24
Fig. 5. Block Diagram of a SAD Line
Each PU calculates the SAD of 25 blocks, then a register is used to store the SAD of each block (ACC0 to ACC24 in Fig. 5) and a pair of demux/mux is necessary to control the correct access to the registers. When a SAD line concludes its calculations, the registers ACC0 to ACC24 will contain the final SADs of the 25 candidate blocks from one specific line of the search sub-area. Search area selector (SS in Fig. 3) and block selector (BS in Fig.3) choose the correct data for each PU in a SAD line. A new and valid data is available to the PUs at each clock cycle. Comparator (Comp modules in Fig. 3) receives the output from the SAD lines, and it can make five comparisons of SADs in parallel, in a pipeline with five stages. This module is responsible to compare 25 SADs from one SAD line (5 SADs per clock cycle) and to compare the best SAD (lowest value) of this SAD line with the best SAD of the previous SAD line, as shown in Fig. 3. The result of each comparator consists of the best SAD among all SADs previously processed and a motion vector indicating the position of the block which generates this best SAD. This result is sent to the next comparator level (see Fig. 3). The five SAD lines outputs (C0 the C4 in Fig. 5) generate five values of SAD in each clock cycle. In five clock cycles, all the 25 values of SAD from the SAD line are ready and these values are used in the comparator. The comparator architecture is showed in Fig 6. A motion vector generator is associated to each SAD line to generate de motion vector for each candidate block. These motion vectors are sent to the comparator with the corresponding SAD result.
44
M. Porto et al.
vector 0 vector 1
1 0
MSB
C0 1 0
C1 MSB
vector 2 1
vector 3
0 1
MSB
0
1 0
MSB
C2
0 1 0 1
1
C3
0
1
1
0
0
MSB
vector 4 C4 Selected vector from previous SAD line
Accumulator used to store the best SAD from current SAD line
1 0
Selected SAD from previous SAD line Motion vector of the best SAD
Best SAD
Fig. 6. Comparator Block Diagram
6 Synthesis Results The synthesis results of the proposed architecture are summarized in Table 1. The synthesis was targeted to a Xilinx Virtex-II Pro VP70 FPGA and the ISE synthesis tool was used [8]. The synthesis results indicated that the designed architecture used 30,948 LUTs (46% of total device resources), using 19,194 slices (58% of total device resources) and 4 BRAMs (1% of device resources). This architecture is able to run at 123.4 MHz and a new motion vector is generated at each 2615 clock cycles. The synthesis results show that the designed architecture can reach a processing rate of 35 SDTV frames (720x480 pixels) per second. This processing rate is enough to process SDTV frames in real time. The performance could be better if the parallelism level was increased or if a faster target device was used. Some related works, using the Full Search algorithm with 4:1 Pel Subsampling, can be found in the literature, such as [9], [10] and [11]. However, these works target a standard cell technology and a comparison between our FPGA results is not easily made. Other related works to Full Search algorithms targeting FPGA implementations as [12], [13] and [14] were also found. The published solutions consider a search range of [-16, +15] while our solution considers a search range of [-25, +24]. The higher search range was defined to allow better quality results when processing high resolution videos. We did not find any published solution based on Full Search algorithm with a search range larger or equal to [25, +24].
High Throughput Hardware Architecture for Motion Estimation
45
We calculated the number of cycles that our solution needs to generate a motion vector considering the range [-16, +15], to allow a comparison of our architecture with the published solutions. This calculation results in 634 clock cycles to generate a motion vector. The operation frequency was estimated in the same 123.4MHz, once the architecture would be reduced to work in the [-16, +15] range. Table 2. Synthesis results for [-25, +24] search range
ME Module
Frequency (MHz)
CLB Slices
LUTs
Global Control Processing Unity SAD line Comparator Vector Generator Memory Manager Search Area Selector Block Selector SAD Matrix Motion Estimator
269.2 341.5 341.5 224.6 552.7 291.4 508.3 541.4 143.7 123.4
91 38 489 317 6 311 343 33 19,083 19,194
164 67 918 235 10 613 596 58 30,513 30,948
Device: Virtex-II Pro VP70
The comparison with these related works, including Full Search and Full Search with 4:1 Pel Subsampling algorithms, is presented in Table 3. Table 3 presents the Pel Subsampling rate, the used technology, the operation frequency and the throughput. The throughput considers the number of HDTV 720p frames processed per second. Our architecture presents the second higher operation frequency, just less than [13]; however, our throughput is about 120% higher than [13]. This is the highest throughput among the FPGA based architectures. The architecture presented in [11] can reach a higher throughput than ours; however, this result was expected once this architecture was designed in 0.18um standard cell technology. Table 3. Comparative results for search range [-16, +15]
Solution
Pel Subsampling
Technology
Freq. (MHz)
HDTV 720p (fps)
[9] [10] [11] [12] [13] [14] Our
4:1 4:1 4:1 No No No 4:1
0.35 um 0.35 um 0.18 um Altera Stratix Xilinx Virtex-II Xilinx XCV3200e Xilinx Virtex-II Pro
50.0 50.0 83.3 103.8 191.0 76.1 123.4
8.75 22.56 63.58 5.15 13.75 20.98 54.10
46
M. Porto et al.
6 Conclusions This paper presented a FPGA based hardware architecture for motion estimation using the Full Search algorithm with 4:1 Pel Subsampling and using SAD as distortion criterion. This architecture considers blocks with 16x16 samples and it uses a search area with 64x64 samples, or a search range of [-25, + 24]. This solution was specifically designed to meet the requirements of standard definition television (SDTV) with 720x480 pixels per frame and it was designed focusing the solutions for the future Brazilian system of digital television broadcast. The synthesis results indicated that the motion estimation architecture designed in this paper used 30,948 LUTs of the target FPGA and that this solution is able to operate at a maximum operation frequency of 123.4 MHz. This operation frequency allows the processing rate of 35 SDTV frames per second. Comparisons with related works were also presented and our architecture had the highest throughput among the FPGA based solutions and the second highest throughput among all solutions.
References 1. International Telecommunication Union. ITU-T Recommendation H.262 (11/94): generic coding of moving pictures and associated audio information - part 2: video. [S.l.] (1994) 2. Joint Video Team of ITU-T and ISO/IEC JTC 1. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 or ISO/IEC 14496-10 AVC) (2003) 3. Kuhn, P.: Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation. Kluwer Academic Publishers, Dordrecht (1999) 4. Brazilian Communication Ministry, Brazilian digital TV system (2006), Available at: http://sbtvd.cpqd.com.br/ 5. Lin, C., Leou, J.: An Adaptative Fast Full Search Motion Estimation Algorithm for H.264. In: IEEE International Symposium Circuits and Systems, ISCAS 2005, Kobe, Japan, pp. 1493–1496 (2005) 6. Zandonai, D., Bampi, S., Bergerman, M.: ME64 - A highly scalable hardware parallel architecture motion estimation in FPGA. In: 16th Symposium on Integrated Circuits and Systems Design, São Paulo, Brazil, pp. 93–98 (2003) 7. Fanucci, L., et al.: High-throughput, low complexity, parametrizable VLSI architecture for full search block matching algorithm for advanced multimedia applications. In: International Conference on Electronics, Circuits and Systems, ICECS 1999, Pafos, Cyprus, vol. 3, pp. 1479–1482 (1999) 8. Xilinx INC. Xilinx: The Programmable Logic Company. Disponível em (2006), www.xilinx.com 9. Huang, Y., et al.: An efficient and low power architecture design for motion estimation using global elimination algorithm. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, Orlando, Florida, vol. 3, pp. 3120–3123 (2002) 10. Lee, K., et al.: QME: An efficient subsampling-based block matching algorithm for motion estimation. In: International Symposium on Circuits and Systems, ISCAS 2004, Vancouver, Canada, vol. 2, pp. 305–308 (2004)
High Throughput Hardware Architecture for Motion Estimation
47
11. Chin, H., et al.: A bandwidth efficient subsampling-based block matching architecture for motion estimation. In: Asia and South Pacific Design Automation Conference, ASPDAC 2005, Shanghai, China, vol. 2, pp. D/7–D/8 (2005) 12. Loukil, H., et al.: Hardware implementation of block matching algorithm with FPGA technology. In: 16th International Conference on Microelectronics, ICM 2004, Tunis, Tunisia, pp. 542–546 (2004) 13. Mohammadzadeh, M., Eshghi, M., Azadfar, M.: Parameterizable implementation of full search block matching algorithm using FPGA for real-time applications. In: Fifth International Caracas Conference on Devices, Circuits and Systems, ICCDCS 2004, Punta Cana, Dominican Republic, pp. 200–203 (2004) 14. Roma, N., Dias, T., Sousa, L.: Customisable core-based architectures for real-time motion estimation on FPGAs. In: Cheung, P.Y.K., Constantinides, G.A. (eds.) FPL 2003. LNCS, vol. 2778, pp. 745–754. Springer, Heidelberg (2003)
Fast Directional Image Completion Chih-Wei Fang and Jenn-Jier James Lien Robotics Laboratory, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. {nat, jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. We developed a fast image completion system using the multiresolution approach to accelerate the convergence of the system. The downsampling approach is for the texture-eigenspace training process based on the multi-level background region information. The up-sampling approach is for the image completion process to synthesize the replaced foreground region. To avoid the discontinuous texture structure, we developed the directional and nondirectional image completions to reconstruct the global geometric structure and maintain local detailed features of the replaced foreground region in the lowerand higher-resolution levels, respectively. In addition, the Hessian matrix decision value (HMDV) is generated to decide the priority order and direction of the synthesized patch in the replaced region. To avoid the rim effect of the synthesized result, the border of each patch defined as O-shaped pattern is selected for matching comparison instead of using entire patch. Finally, additional texture refinement process is guaranteed to have high-resolution result. Keywords: Texture Analysis, Texture Synthesis, Image Completion, Hessian Matrix, Eigenspace.
1 Introduction Photographs sometimes include unwanted objects. After removing the unwanted foreground objects, holes will leave at that photograph. Although many existing image completion techniques can fill those holes, there still exists the discontinuity problem of the texture structure between the new fill-in foreground regions and the original background regions. One major factor caused this kind of texture discontinuity is the priority of the synthesis (or fill-in) order for each hole. Since texture structure is the important clue to judge the completion performance in general appearance of a photo and edge is the most important component to construct the completion of the texture structure. Once the texture structure is damaged, the discontinuities of the edges become obvious. In order to synthesize (or reconstruct) complete texture structure, authors in [4], [5], [16], [19] proposed the texture synthesis approach starting along the damaged edges to fill the hole or the work in [3] divided image into structure components and texture components, and synthesized separately. Bertalmio et al. [2], [3] expand structure from boundary, so the structure continuity can be maintained particularly for the long and thinner removal region. The work in [13] segmented the image into several regions and synthesized each region individually. Sun et al. [16] manually drew the D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 48 – 61, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fast Directional Image Completion
49
curve first, and then compared the matching patches along the curve. Moreover, the work in [5] is based on the inner product of the vertical direction of pixel gradient and normal vector of the border of the removal region, and then the result is multiplied by the proportion of the information that patch keeps in to find the priority. The purpose is to find the starting point of filling in the matching patch from searching exhaustively. But this approach is easy to be influenced by high-frequency components, such as noises or complicated texture components, where these high-frequency components easily cause higher synthesis order than the structure components. Therefore, it misleads the error of priority and destroys the integrality of structure. The other factor is the size of the synthesis patch whether it is pixel-based patch [1], [17], [18] or patch-based patch [8], [9], [11], [12], [14], [15] (or exemplar-based), which has been used mostly. The pixel-based approach is slower and the synthesis result trends to fuzzy result. The patch-based approach is faster but remains obvious discontinuous flaws between neighboring patches. Efros and Freeman [8] used dynamic programming (DP) to find the minimum error for cutting the overlaps between two discontinuous neighboring patches. Drori et al. [6] mixed different scaling patches according to different frequencies of texture complexity. The synthesis result is good but the computational time is very slow. To improve the computational time of the image completion process, Freeman et al. [11] and Liang et al. [14] modified the similarity measure between patches by comparing only the border pixels of the patches instead of comparing those of entire patches. In addition, comparing the pixels of the entire patch will result in not easy to connect between the patches It is because the number of pixels inside the patch are more than those at the border of the patch. According to above existing problems, we develop a novel image completion system based on the patch-based multi-resolution approach. This approach can not only accelerate computational time and is capable of handling the large removed region, but also maintain both the texture structure and detailed features. We develop the directional and non-directional image completions to maintain the texture structure and detailed features of the replaced foreground region, respectively. Moreover, to solve the priority problem of the synthesis order and avoid affecting by noises and complicated texture components, Hessian matrix is employed to have the stable and correct priority of synthesis order. Section 2 describes we develop the texture analysis module for analyzing and training input image. At Section 3, we apply Hessian matrix to decide the synthesis order and image completion methods: either using the directional image completion to propagate the structure or using the non-directional image completion to synthesize detailed features. Subsequently, texture refinement is used to recover the lost feature at the training process. Section 4 presents the current experimental results by evaluating time complexity of the training process and image completion, and analyzing our developed approaches. Finally, Section 5 discusses the conclusion.
2 Training Process Based on Background Region This training process is similar to our previous work in [10]. Here we just briefly describe as following: An input image is given and is annotated as I0, as show in
50
C.-W. Fang and J.-J.J. Lien
Total M patches in background regions from levels 1 to L …
…
Select O-shaped patterns for training … Input image I0
Inverse matte Į0
Create eigenspace Ȍ
Down-sampling from levels 0 to L
Ȍ
I1
I2
I3
…
ª E11 E12 " " E1K º «E E2 K »» « 21 E22 % « # # % # » » « ¬ E N 1 E N 2 " " E NK ¼
IL=4 Project the M O-shaped patterns onto the first N eigenvectors to have corresponding weight vectors
Į1
Į2
Į3
ĮL=4
Cluster weight vectors using VQ
Fig. 1. Flowchart of the texture training (or analysis) process
Fig.1. The mask image, which is manually labeled the replaced foreground region, is called the inverse matte, α0, as show in Fig. 1. This matte is a binary image having the same size as input image. The pixels in the white regions, which are the known regions, are set to 1, while the pixels in the black regions, which are going to be removed and then synthesized based on the background information, are set to 0. In addition, the replaced regions can comprise many sub-regions, but they must contain the removable objects; they can exceed the boundaries of the removable objects, and they can be any shapes. However, too many or too large replaced regions will cause the quality of the synthesis result to become worse. Moreover, the known background regions serve as the source of the replaced regions. For the training process, initially, we have input image I0 and corresponding inverse matte α0, and then we down-sample ↓ the original image I0 L times to obtain different lower resolution images Ii and corresponding inverse matte αi at level i, where i=1~L. The background regions corresponding to known regions having value 1 at the inverse matte are extracted to several patches to be the training data. In order to reduce the computational time during the similarity measure at the synthesis process, principal component analysis (PCA) is employed to reduce the dimensions of the training data, and vector quantization (VQ) is adopted for clustering the projection weight vectors in the eigenspace so as to reduce the comparison time. Fig. 1 illustrates the system flowchart of the training process extending from the input image to get the weight vector of each patch. 2.1 Multi-resolution Preprocessing from Levels 0 to L We apply the multi-resolution approach [20] to our system for three purposes. The first purpose is to avoid the computational condition of being unable to converge and
Fast Directional Image Completion
51
the worse initial result of synthesized image. The second purpose is to have more training patches with various scales [7] and texture features. At different levels, the patches have various properties. The patches at the lower-resolution images (ex. IL1~IL) contain stronger structural information. The patches at the higher-resolution images (ex. I0~I1) have more detailed feature information. The third purpose is to reduce the computational time. The multi-resolution approach down-sample ↓ the input image I0 and corresponding inverse matte α0 L times by factor 2 to have input image Ii and inverse matte αi for ith level, where i=1~L. The background known region is denoted as Bi as:
Bi = I iα i , i = 0 ~ L
(1)
and preserved for the training data. The foreground unknown region Fi is denoted as:
Fi = I iα i , i = 0 ~ L
(2)
and is going to be synthesized patch by patch by utilizing the information of the background Bi. In addition, the smallest image size at level L depends on the correlation between the patch size and the foreground region at level L. That is, the patch size needs to be big enough to cover most foreground region. More details will be discussed later. 2.2 Create Eigenspace Ψ Based on O-shaped Patterns from Levels 1 to L During the training process, one WpxHp-pixel (width x height = 15x15-pixel) search window shifting pixel by pixel from the top-left corner to the right-bottom corner in the background regions for images from Il to IL to extract training patch data. There are M total WpxHp-pixel patches. The reason we do not include the image I0 at level 0 is because of following reasons based on our empirical experiences: (1) It is unnecessary to use entire patches of image I0 at level 0, i.e., 100000~300000 patches per 320x240-pixel, because it will include many unnecessary patches and require a large amount of template matching operations. In addition, the training time of the vector quantization (VQ) will increase exponentially following the increasing number of clusters. It also increases the probability of mismatching the incorrect patch. (2) Image I0 at level 0 contains more noises, which affect the result of the PCA process, and will increase the mismatch probability. When we practically don’t include the image I0 during the training process, the computational time reduces from 15 seconds to 2 seconds for 320x240-pixel image size. But this approach causes the synthesized result losing the detailed feature information and decreases the high-resolution quality of image completion result. So an additional process (Section 3.2) for the texture refinement at the 0th level is required in order to make the image completion result have the same high resolution as the input image. Furthermore, including whole patch elements to the training data for further matching comparison may result in a discontinuous structure of the patch and will certainly increase the training time. In addition, during the synthesis process, the similarity
52
C.-W. Fang and J.-J.J. Lien
Wp Search Patch
…
Hp
K pixels
ω O-shaped Pattern
O-shaped Pattern Vector
Fig. 2. Acquire the four borders with thickness ω (ω=2) pixels for each search patch. Wp (Wp=15) is the width of the patch, and Hp (Hp=15) is the height of the patch. There are K pixels in each O-shaped pattern, where K = 2ω(Wp+Hp) – 4ω2.
⎡ P11 P12 ⎢P P ⎢ 21 22 ⎢ M M ⎢ M M ⎢ ⎢⎣ PK 1 K 2
L L P1M ⎤ ⎡ E11 E12 ⎢E E 22 L L P2 M ⎥⎥ 21 PCA ⎢ M ⎥ ⇒ ⎢ M M O ⎥ ⎢ M M M O ⎥ ⎢ ⎢⎣ E K 1 E K 2 L L PKM ⎥⎦
L L E1K ⎤ ⎡ E11 E12 L L E1K ⎤ L L E 2 K ⎥⎥ ⎢E 98% E22 O E 2 K ⎥⎥ 21 O M ⎥ ⇒ Ψ =⎢ ⎢ M M O M ⎥ ⎥ O M ⎥ ⎢ ⎥ E E L L E NK ⎦ ⎣ N1 N 2 L L E KK ⎥⎦
Fig. 3. Total M O-shaped patterns can be obtained from levels L-1 to 0. Each pattern vector has K elements (see Fig. 2). During training process, PCA is used to transform the original K×M matrix of all training pattern vectors to a N×K eigenvector matrix in the eigenspace, where the first N eigenvectors corresponding to 98% energy of total eigenvalues and N < K << M.
(a)
(b)
(c) Fig. 4. (a) The first several eigenvectors of the O-shaped patterns control the global geometrical structure of the photographs. (b) The middle eigenvectors of the O-shaped patterns control the local detailed features. (c) The last few eigenvectors of the O-shaped patterns control some noises.
measure by considering the whole contents of the patch will generally produce unsatisfactory results, and it may cause the rim effect to become distinct. Thus, this work adopts the O-shaped pattern instead of whole patch for the training data, as shown in Fig. 2.
Fast Directional Image Completion
53
PCA is applied to the entire O-shaped training patterns to obtain the eigenspace Ψ, as shown in Fig. 3. Two important properties of PCA are employed to have the best performance for the image completion result: (1) PCA process can reduce the dimensions of the data representation from K dimensions to N dimensions, where N < K, without losing the significant characteristics of the original data, as shown in Fig. 3. (2) Recombine the features of the O-shaped pattern. because the PCA process and sorting the eigenvalues with corresponding eigenvectors, we found that the first several eigenvectors, as shown in Fig. 4, control the global geometrical structure of the photographs, while the middle eigenvectors control the local detailed features, and some noises are controlled by the last few eigenvectors. This study uses only the first N eigenvectors, whose corresponding eigenvalues occupy 98% of total eigenvalues, for comparison purpose to identify the suitable matching patch. 2.3 Create Corresponding Weight Vectors by Projecting O-shaped Patterns onto Eigenspace Ψ Each K-dimensional O-shaped pattern vector P (P=[P1…PK]T) is projected onto the Ndimensional eigenspace, which consists of N eigenvectors (E1i…ENi, where i=1~K) as:
⎡ P1 ⎤ ⎡ Ε11 Ε12 L L Ε1K ⎤ ⎢ ⎥ ⎡ W1 ⎤ ⎢Ε ⎥ ⎢ P2 ⎥ ⎢W ⎥ Ε Ε O 21 22 K 2 ⎢ ⎥⎢ M ⎥ = ⎢ 2 ⎥ ⎢ M M O M ⎥⎢ ⎥ ⎢ M ⎥ ⎢ ⎥⎢ M ⎥ ⎢ ⎥ ⎣ Ε Ν 1 Ε Ν 2 L L Ε ΝK ⎦ ⎢ P ⎥ ⎣WN ⎦ ⎣ K⎦
(3)
So the corresponding N-dimensional weight vector W (W=[W1…WN]T) for each pattern vector can be obtained to represent the original patch. Therefore, there are a total of M N-dimensional weight vectors in the training database. 2.4 Cluster Weight Vectors Using Vector Quantization For the similarity measure during the synthesis process, it is extreme time consuming to compare the projection weight vector of each unknown pattern with those of all training patterns. To reduce the comparison time, the vector quantization (VQ) is applied to cluster entire training patterns into several clusters, ex. c (c=32) clusters. Thus, the comparison time is reduced from O(M) to O( M ). The tree-structured vector quantization (TSVQ) [17] has the potential to reduce computation costs further. However, there is a risk in applying this technique in the present cases. For example, any misclassification in the parent node will affect the final classification result. Therefore, in our work, initially the closest cluster is directly identified according to its mean vector, and this cluster is then searched exhaustively to locate the best matching vector.
54
C.-W. Fang and J.-J.J. Lien
3 Directional and Non-directional Image Completion For the synthesis process of the image completion, it starts from image IL at level L to image I0 at level 0, as shown in Fig. 5. Since the patches at the lower-resolution images contain more texture structure information, and the patches at the higherresolution images contain more detailed feature information, so different resolution images will apply different synthesis processes. In addition, the pixel gray values at the lower level i can serve as the initial pixel gray values at the higher level i-1 for the corresponding foreground region. Level L: Initialize synthesis values.
Level L-1: Directional and Non-Directional IC 1. Starting from patch having larger Hessian matrix decision value (HMDV)
Levels L-2 to 1: Non-Directional IC 1. Starting from patch having larger HMDV. 2. Texture synthesis.
Fi
Decision window (b)
Bi (a)
Level 0: Texture Refinement 1. Starting from patch having larger HMDV. 2. Texture synthesis. 3. Find the best matching patch from image I0.
2. If (HMDV) Threshold: Then Search along the 1st eigenvector v1 direction of Hessian matrix (directional). Fi
I0 Bi Ii , i = 1 ~ L (d)
Ȝ1
v1
(c) Else Texture synthesis (non-directional).
IC: Image Completion
Fig. 5. Flowchart of the image completion from level L to level 0. (a) Fi is the removal foreground region and Bi is the background region at the level i. Each 2x2 Hessian matrix G is constructed based on a 7x7-pixel decision window. (b) The decision window goes along the boundary of the foreground region to decide the priority of the synthesis order. (c). Search along the direction of the 1st eigenvector v1. (d) The matching patch is found at the blue point, but it doesn’t contain high-resolution content. Therefore, the blue point is mapping to the green point at level 0, and then we search the best matching patch near the green point.
3.1 Level L: Initialize Synthesis Values For image IL at level L, most pixels of the foreground region are located inside the Oshape pattern. Then the search window (or patch) will scan from the top-left corner to the bottom-right corner at the foreground region. Usually the search window has only one for image IL. Considering the missing data problem for the PCA process, the Oshape pattern of each search window won’t project on the eigenspace to produce the corresponding weight vector. Instead, the O-shaped pattern of each search window is directly compared with the pixels’ gray values of each O-shaped pattern over entire training patches. The similarity measure uses the Euclidean distance. Then the patch of the best matching pattern in the training patches is directly pasted onto the
Fast Directional Image Completion
55
corresponding location of the foreground region. Here, the computational time for the comparison process at level L is very fast because the number of search windows usually has only one. So the projection process of the PCA does not apply here. After synthesizing the foreground regions, the composite image Ci at level i is defined as:
Ci = I iα i + I i′α i , i = 0 ~ L
(4)
where Ii is the original image and Ii’ is the synthesized image. Subsequently, the upsampling ↑ process is applied to the composite image Ci from level i to level i-1. Then following process is applied to update the image Ii-1 at level i-1.
I i −1 = I i−1α i−1 + (Ci ↑)α i−1 , i = 0 ~ L − 1
(5)
That is, each pixel gray value of the foreground region at image Ii-1 can be assigned an initial value, which is obtained from the composite image Ci at level i. 3.2 Levels L-1 to 0: Hessian Matrix Decision for Directional and NonDirectional Image Completion Synthesizing the foreground regions is necessary to maintain the continuity of the texture structure extending from the background regions. Thus, the priority of synthesis order becomes very important. We apply the gradient-directional property of the Hessian matrix to play the role of the priority decision maker. We designed that each 2x2 Hessian matrix G is constructed by a 7x7-pixel decision window, W, as (also see Fig. 5(b)):
⎡ ∂ 2 I ( x, y ) ⎢ ∑ ∂x 2 G = ⎢( x , y )∈W 2 ∂ I ( x, y ) ⎢ ⎢ ∑ ⎣( x , y )∈W ∂x∂y
∑
( x , y )∈W
∑
( x , y )∈W
∂ 2 I ( x, y ) ⎤ ∂x∂y ⎥⎥ ∂ 2 I ( x, y ) ⎥ ∂y 2 ⎥⎦
(6)
where I(x,y) is the gray value at location (x,y) belonging to the decision window W. The centroid of the decision window goes along the boundary of the foreground region and both eigenvalues of the Hessian matrix are used to decide the priority of the λ2, the synthesis order, as shown in Fig. 5(a). That is, assuming the eigenvalues λ1 Hessian matrix decision value (HMDV), V, is defined as:
≧
V=
λ1 + ε λ2 + ε
(7)
where ε is a very small value (ε=0.001) to avoid the denominator term becoming zero. The HMDV will exist three conditions: (1) Higher HMDV value V (V >> 1.0) means that the decision window is directional and exists stronger edge. (2) If the HMDV
56
C.-W. Fang and J.-J.J. Lien
Fig. 6. Above two groups of images: Left image is the level L-1 image. Right image is the distribution of HMDV. The red peak is the starting point to fill in the matching patch.
value V is close to or equal to 1, there exist two conditions: (2.1) If both λ1 and λ2 have higher values, then the decision window contains more detailed features or highfrequency noises. (2.2) If both λ1 and λ2 have lower values, the patch of decision window is smooth. Therefore, when the HMDV value V of the decision window is bigger than or equal to the predefined threshold value, the search patch of the corresponding centroid has the higher priority of synthesis order and higher HMDV value has higher priority. Then the directional image completion is applied to this search patch as shown in Fig. 6. Conversely, if the HMDV value V of the decision window is smaller than the predefined threshold, it will be synthesized as non-directional image completion after directional image completion. Level L-1: Directional Image Completion for Texture Structure. To synthesize the foreground region at level L-1, initially the centroid of the decision window goes along the boundary of the foreground region and the HMDV value V of each decision window will be recorded. If the HMDV value V of the decision window is bigger than or equal to the threshold, it means that the corresponding search patch exists stronger edge components. After sorting the HMDV values, which are bigger than or equal to the threshold, the synthesis process will start from the search patch, whose corresponding decision window has the maximum HMDV value. Then the search patch will scan along the direction of the eigenvector v1 corresponding to eigenvalue λ1 or opposite direction depending on the location of background region, as shown in Fig. 5(c). The direction of eigenvector v1 is the tangent direction of edge. Subsequently, the patch of the best matching pattern in the background region will be directly pasted onto the location of the search patch. Again, the centroid of the decision window will go along the inner boundary of the replaced foreground patch to calculate the HMDV values. Thus, the same threshold, sorting (if necessary), matching (or comparison), pasting and HMDV calculating processes compute iteratively until none of any HMDV values are bigger than or equal to the threshold or the patches of entire foreground region are updated. Since this kind of comparison process exists only few training patches for each search patch, so the computational time of the similarity measure won’t take too much. Thus, the similarity measure is based on the gray values instead of the projection weights. In addition, above procedure for structure synthesis can be defined as the directional image completion. For remaining decision windows having the smaller HMDV values than the threshold, the texture synthesis procedure [10], [14] will be defined as the non-directional image completion and will be described at next section.
Fast Directional Image Completion
57
Levels L-1 to 1: Non-Directional Image Completion for Detailed Features For image IL-1 at level L-1, after the directional image completion process, the remaining search patch located at the foreground boundary will be synthesized following the order from higher HMDV in the foreground region. Initially, the O-shaped pattern of each search patch is projected onto the eigenspace Ψ to obtain the corresponding weight vector. Based on the similarity measure of Euclidean distance between the weight vector of this search pattern and those of the cluster centers, this search pattern will be classified to the nearest cluster. Then this search pattern will compare with all patterns within the same cluster to find the best matching pattern. The patch corresponding to the best matching pattern will be directly pasted onto the location of the search patch. This texture synthesis process will iterative process until all remaining search patches are updated. Since the directional image completion is able to construct the texture structure for image IL-1 at level L-1, so we can concentrate on the enhancement of detailed features for remaining texture synthesis process from level L-2 to level 1. The similar procedure to image IL-1 at level L-1 is applied except that the search patch does not need to scan along the eigenvector direction. The priority of synthesis order still relies on the HMDV value for each decision window. And the similarity measure is the same as the procedure of the non-directional image completion in the eigenspace. Thus, the images from level L-2 to 1 must use HMDV to determine the order of image completion and avoid destroying the already existing edges at level L-1. Level 0: Texture Refinement Because of above-mentioned considerations in the training and synthesis processes, the patches at the highest-resolution level 0 are not included in the training database, thus the most detailed texture information will lose for the final synthesized result. Therefore, at the level 0, when the matching patch is found from the training database, we do not paste the matching patch directly. Instead, we search more detailed features of the patch neighboring the position, which is the position of the matching patch corresponding to the highest level, on the highest level, as shown in Fig. 5(d). This texture refinement process makes the removal region (foreground region) and the reserved region (background region) have the consistent resolution.
4 Experimental Results In acquiring the experimental statistics presented below, each process was performed ten times and then the average time was calculated. The experiments were performed on a personal computer having an Intel Core 2 Duo E6300 (1.86 GHz) processor. The computational times from the training process to the image completion in various kinds of images are as follows: Fig. 7 and Fig. 8. The information of Fig. 7 and Fig. 8 in Table 1, and then it shows the processing time of other existing method. In addition, the result of exemplar-based image inpainting in [5] is unable to converge at the case of larger removal region. When the removal region is not narrow and long, the
58
C.-W. Fang and J.-J.J. Lien
C4
C3
C2
C1 (a)
C0
(b)
(c)
Fig. 7. The image size is 392 x 364 pixels. The ratio of removal region is 7.2%. (a) Show our reconstructed image Ci from the lowest level L to the original level 0, i=0~4. Ci does upsampling ↑, and then serves as initial value of Ii-1 for searching the matching patch to fill the removed (or replaced) region of Ci-1. (b) The result of exemplar-based image inpainting by [5]. (c) The result of image inpainting by [2]. Table 1. Show the information of image size and the ratio of removal region. Then compare the processing time of other existing method. Units of time in seconds (s). Image Size (pixel) Windmill Slope Diving Mother Wall Mountain
Fig. 7 Fig. 8(a) Fig. 8(b) Fig. 8(c) Fig. 8(d) Fig. 8(e)
392 × 364 213 × 284 206 × 308 538 × 403 400 × 400 392 × 294
Ratio of removal region 7.2% 8.7% 12.6% 25.5% 28.4% 5.3%
Our metho d 11 4 2 57 35 7
Exemplar-based image inpainting by [5] 104 46 38 724 420 60
Image Inpainting by [2] 2 1 1 6 4 1
result of image inpainting in [2], which expands structure from boundary, tends to the blurred result without obvious structure and edge.
Fast Directional Image Completion
59
(a)
(b)
(c)
(d)
(e) Fig. 8. The images of first column are input images. The image of second column shows the results of our method. The images of third column are the results of exemplar-based image inpainting by [5]. The images of fourth column are the results of image inpainting by [2].
60
C.-W. Fang and J.-J.J. Lien
5 Conclusions The multi-resolution approach is applied to image completion. The down-sampling approach is used for the analysis process, such as compiling the training data, and the up-sampling approach is used for the synthesis process, such as initial values of level i-1. Therefore, this approach enables the system to handle the large removed region and converge quickly. We only take the border of patch (O-shaped pattern) for training. In addition, the patches at the highest-resolution level are not included in the training data in order to speed up the computational time of the training process, and it reduces noise impact on PCA and improves the result of matching patch. Above training process reduces further the time of comparison and searching patch. Subsequently, Hessian matrix is used for the decision of the synthesis order, and it is more stable than the existing methods of using the differentiation [5] in the patch with more noise or detailed feature in patch. During the synthesis process, the developed HMDV is applied for the decision of synthesis order and direction in order to propagate the structure continuity from the background region. For the directional image complete, for each higher HMDV decision window, we search the matching patch along the direction of the eigenvector of Hessian matrix. This directional process can decrease the time of exhaustive search and make a better structure continuity between background scene and foreground replaced region. Finally, we use texture refinement for resuming the lost detailed features of the training process.
References 1. Ashikhmin, M.: Synthesizing Natural Textures. ACM Symposium Interactive 3D Graphics, 217–226 (2001) 2. Bertalmio, M., Sapiro, G., Ballester, C., Caselles, V.: Image Inpainting. ACM SIGGRAPH, 417–424 (2000) 3. Bertalmio, M., Vese, L., Sapiro, G., Osher, S.: Simultaneous Structure and Texture Image Inpainting. IEEE Trans. on Image Processing 12(8), 882–889 (2003) 4. Chan, T., Shen, J.: Non-Texture Inpainting by Curvature-Driven Diffusions (CDD). Jour. of Visual Communication and Image Representation 12(4), 436–449 (2001) 5. Criminisi, A., Perez, P., Toyama, K.: Region Filling and Object Removal by ExemplarBased Image Inpainting. IEEE Trans. on Image Processing 13(9), 1200–1212 (2004) 6. Drori, I., Cohen-Or, D., Yeshurun, H.: Fragment-Based Image Completion. ACM SIGGRAPH, 303–312 (2003) 7. De Bonet, J.S.: Multiresolution Sampling Procedure for Analysis and Synthesis of Texture Images. ACM SIGGRAPH, 361–368 (1997) 8. Efros, A.A., Freeman, W.T.: Image Quilting for Texture Synthesis and Transfer. ACM SIGGRAPH, 341–346 (2001) 9. Efros, A.A., Leung, T.K.: Texture Synthesis by Non-parametric Sampling. International Conf. on Computer Vision, 1033–1038 (1999) 10. Fang, C.-W., Lien, J.-J.: Fast Image Replacement Using Multi-resolution Approach. Asian Conference on Computer Vision, 509–520 (2006) 11. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22(2), 56–65 (2002)
Fast Directional Image Completion
61
12. Igehy, H., Pereira, L.: Image Replacement through Texture Synthesis. IEEE International Conf. on Image Processing 3, 186–189 (1997) 13. Jia, J., Tang, C.K.: Image Repairing: Robust Image Synthesis by Adaptive ND Tensor Voting. IEEE Conf. on Computer Vision and Pattern Recognition 1, 643–650 (2003) 14. Liang, L., Liu, C., Xu, Y., Guo, B., Shum, H.-Y.: Real-Time Texture Synthesis using Patch-Based Sampling. ACM Trans. on Graphics 20(3), 127–150 (2001) 15. Liu, Y., Lin, W.-C., Hays, J.: Near-Regular Texture Analysis and Manipulation. ACM SIGGRAPH, 368–376 (2004) 16. Sun, J., Yuan, L., Jia, J., Shum, H.-Y.: Image Completion with Structure Propagation. ACM SIGGRAPH, 861–868 (2005) 17. Wei, L.-Y., Levoy, M.: Fast Texture Synthesis using Tree-structured Vector Quantization. ACM SIGGRAPH, 479–488 (2000) 18. Wexler, Y., Shechtman, E., Irani, M.: Space-Time Video Completion. IEEE Conf. on Computer Vision and Pattern Recognition 1, 120–127 (2004) 19. Wu, Q., Yu, Y.: Feature Matching and Deformation for Texture Synthesis. ACM SIGGRAPH, 362–365 (2004) 20. Yamauchi, H., Haber, J., Seidel, H.-P.: Image Restoration using Multiresolution Texture Synthesis and Image Inpainting. Computer Graphics International, 120–125 (2003)
Out-of-Order Execution for Avoiding Head-of-Line Blocking in Remote 3D Graphics John Stavrakakis and Masahiro Takastuka 1
ViSLAB, Building J12 The School of IT, The University of Sydney, Australia 2 National ICT Australia, Bay 15 Locomotive Workshop, Australian Technology Park, Eveleigh NSW, Australia
[email protected],
[email protected]
Abstract. Remote 3D graphics can become both process and network intensive. The Head-of-Line Blocking(HOLB) problem exists for an ordered stream protocol such as TCP. It withholds any available data from the application until the proper ordered segment arrives. The HOLB problem will cause the processor to have unnecessary idle time and non-uniform load patterns. In this paper we evaluate how the performance of an immediate mode remote 3D graphics system is affected by the HOLB and how the out-of-order execution can improve the performance. Keyword: Distributed rendering, Network graphics, Load balancing.
1 Introduction Interactive multimedia applications over a network demand both realtime delivery and excellent response time for a high quality end user experience. Networked applications using 3D graphics have difficulties ensuring this level of quality is maintained. The biggest problem is the sheer volume of network traffic they would generate. For audio and video multimedia, this is greatly reduced through appropriate down-sampling and discardment of perceptually unimportant data. In addition, they are able to exploit lossy network protocols that can continue without all the data[1]. This scenario is contrast to that of 3D graphics, as the the majority of data be preserved correctly which would otherwise cause severe artifacts to appear in the rendered images. Specifically, the complexity of graphics data creates greater network traffic and thus making it difficult to maintain desirable rendering frame rates. Networked 3D graphics has been well researched for high network utilisation, compression [2][3] and several techniques in reducing computational load [4]. Despite the attention, another bottleneck existing in the network graphics systems occurs between the passing of data from the network layer to the rendering engine. As the data is received from the network in the form of segments (a one-to-one correspondence to a packet sent/delivered), the segments are held in buffers until reassembly of the application data can take place. Following the reassembly of fragments to a segment, the segment will need to meet an ordering requirement prior to being passed on for rendering. Such a process is typically handled between decoupled networking and rendering modules within the system. This problem is also known as Head-of-Line Blocking, and exists in the Transmission Control Protocol[5] (TCP) protocol. It occurs when a TCP segment is lost and a D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 62–74, 2007. c Springer-Verlag Berlin Heidelberg 2007
Out-of-Order Execution for Avoiding Head-of-Line Blocking
63
subsequent TCP segment arrives out of order. The subsequent segment is held until the first TCP segment is retransmitted and arrives at the receiver[6]. This is not a problem for most software using TCP, as the requirement is not realtime and operation order is essential. Immediate mode graphics works differently; each command is simply defining the rendered scene, this requires no order until the drawing takes place. As such, 3D graphics is able to avoid the HOLB problem as its operations need not execute in order. The advantage of doing so allows the the graphics processor to avoid idling when data is available to execute. This paper investigates the avoidance of the HOLB for a subset of applications using immediate mode 3D graphics. The following section will briefly introduce motivation for remote 3D graphics and currently available systems. Section 3 will detail the out of order nature in 3D graphics and its theoretical impact. We follow with experimental methods and results to address HOLB, and finally conclude with a brief summary of our findings and future work.
2 Remote 3D Visualisation Systems Remote visualisation allows users to interact with graphics data over a network. There is a need for remote visualisation as it may be more convenient by location, more efficient by compute power to process data remotely, or to aid the infrastructure for remote collaboration such systems include AVS[7], Collaborative VisAD[8] and Iris Explorer[9]. Following our previous work, our concern lay within network graphics technology, which is transparent to the application. In order to enable existing 3D applications to be made available in this fashion, there are two approaches: modify the application source to internalise the network component; or, extract the graphics from the application transparently, and then send it over the network. Example of systems belonging to the first approach can be found in a survey paper by Brodlie[10]. Advantages and disadvantages of these approaches were discussed in our previous work[11]. For reasons beyond the scope of this paper we look at the latter method. Known methods for achieving this remote graphics distribution are typically done using OpenGL[12] an immediate mode graphics library. Immediate mode graphics asks the application to hold the graphics data and (re)transmit the data to the rendering engine to render the 3D graphics into a 2D image(frame). This type of remote rendering method is ideal for highly dynamic scenes where no simple predefining of the entities can be described or updated efficiently. Such examples include the rendering of large animated models[13] and also scale with fluid simulation[14]. GLX. A standard technique for remote display of 3D applications is GLX [15], the “OpenGL Extension to the X Window System”. The X Window System[16] (often referred to as X) was developed to provide a network transparent user interface with rendering on a remote server. The X protocol is a client-server protocol in which applications are run on the client machine and display requests are sent to a server. This design allows application data and processing to be performed locally with window management and drawing to be handled transparently at the server. It is highly portable
64
J. Stavrakakis and M. Takastuka
Fig. 1. GLX is composed of a client-side API, a protocol for GL command stream encoding, and an X server extension (all components shown in orange). The application resides on the client machine and the display is connected to the server (both indicated in light blue).
because any X client can connect to any X server, regardless of platform or operating system. Moreover, since all communication is handled by the client-side X library (Xlib) and the X server, applications do not have to be network aware. GLX enables OpenGL applications to draw to windows provided by the X Window System. It is comprised of an API, an X protocol extension, and an X server extension. When using GLX for remote rendering, GL commands are encoded by the client-side API and sent to the X server within GLX packets. These commands are decoded by the X server and submitted to the OpenGL driver for rendering on graphics hardware at the server. Importantly, GLX provides a network transparent means of remote rendering to any X server that supports the extension. It also specifies a standard encoding for GL commands. Figure 1 illustrates GLX. An application that uses the GLX API can send GL render requests to a remote X server that supports the GLX server extension. GL commands are encoded in the GLX packet, which is itself inserted into an X packet. Any number of clients can connect to an X server, but a client will only ever connect to a single X server. GLX is limited because of these characteristics. Rendering is always necessarily server-side and it cannot support GL command streaming to multiple remote displays. However, GLX is important because it establishes the fundamental capabilities required for OpenGL command streaming for remote rendering. Chromium. Chromium is a well established and widely used distributed rendering system based on another technology called WireGL [17]. One of the major advantages of Chromium is that it enables users to construct a high-performance rendering system, which is capable of large scale complex rendering. Moreover, it can drive a large multi-display system to display high-resolution images. However, it was not designed to support sharing of 3D graphics following a dynamic producer/subscriber model. Figure 2 illustrates Chromium’s distribution model. It is general node-based model for stream processing. A node accepts one or more OpenGL command stream (GL stream) as input and outputs a GL stream to one or more other nodes. Each node contains one or more Stream Processing Units (SPUs) that modify the GL stream.
Out-of-Order Execution for Avoiding Head-of-Line Blocking
65
Application Chromium Client Chromium SPU
Chromium protocol Chromium Node Chromium SPU
client
node 0 ...
node n Chromium Node Chromium SPU
Chromium Server Chromium SPU
node 0
OpenGL
...
Graphics hardware Display server node n
Fig. 2. Chromium has a flexible configuration system that arranges nodes in a directed acyclic graph (DAG). Each node can take input from, and send output to, multiple nodes. A client node takes GL commands from an application and creates a GL command stream. A server node takes GL commands from another node (and usually renders the stream on local hardware). Because Chromium is usually used for remote display, this diagram shows rendering and display at a server node. However, it is important to note that rendering (and display) can occur at the any node.
Rendering or display can occur at any node in the graph. This depends entirely on whether the SPUs on the node perform rendering or display. One characteristic of Chromium that is not illustrated in the figure is that configuration of the graph is centralized and set at system initialization. This is suitable for dedicated clusters (with fixed, single-purpose resources), but not ideal for grid computing (with heterogeneous resources that are dynamically added and removed, and also available for other services). Chromium follows the OpenGL Stream Codec (GLS)[18] to encode GL commands. GLS defines methods to encode and to decode streams of 8-bit values that describe sequences of GL commands invoked by an application. Chromium, however, employs its optimized protocol to pack all opcodes into a single byte. As the underlying delivery mechanisms of both Chromium and GLX rely on a TCPlike streaming protocols. The head of line problem becomes a greater issue when the amount of data increases, hence causing more network traffic. This would result in data being withheld for longer periods due to segment losses. Such segment losses are illustrated by figures 3(a)-3(c). In figure 3(a), the system does not experience any packet loss. It must wait for the entire segment to arrive before it can execute it. Moreover, the processing of that segment can overlap with the retrieval
66
J. Stavrakakis and M. Takastuka Network Transmission Time
Segment #
Execution Time
time
Network Transmission Time
Network Transmission Time
Execution Time
Execution Time
Segment #
Segment #
(a) No packet loss.
transmission failed
time
(b) Packet loss with ordering restraint.
transmission failed
time
(c) Packet loss with no ordering restraint.
Fig. 3. Figure 3(a) shows the system where no packet loss occurs, the system must wait for each packet to transfer and it can overlap processing intermediately. Figure 3(b) shows what will happen in an ordered sequence, the delay of execution in one segment will cause the processing of other segments to be delayed. Finally, figure 3(c) shows the unordered model, allowing any ready segments to be executed. The dark region denotes the time that processing was unnecessarily deferred to.
of the next segment. Figure 3(b) demonstrates what will happen when a segment fails to arrive in an ordered sequence. The execution delay of that single segment will forbid the processing of other segments until the first it has completed execution. Finally, figure 3(c) shows the effect of packet loss when no ordering is imposed on the execution of each segment. As a segment is received the only barrier to prevent its execution, is contention for graphics processing resources. Note the dark region denotes the time that processing was unnecessarily deferred to.
3 Out of Order Execution An immediate mode graphics library can execute most commands out of order purely because, within rendering a frame, the commands being sent are defining the scene. The actual rendering of the final image requires all data to be present, thus requiring no specific ordering to define it in. This implies that synchronisation must at least occur at the frame boundaries, and was also found true for parallel graphics systems in order for the rendering to to maintain ordered consistency[19]. Data can be defined in this way due to the rasterisation process. In the pipeline processing view of the data, it is treated independently and is typically processed in parallel with one another until a later stage. It is only until the latter stage in the render-
Out-of-Order Execution for Avoiding Head-of-Line Blocking
67
ing engine, which would require intermediate results to be processed in order to satisfy dependencies between them. An example of such is depth sorting. Unfortunately, not all commands can be executed out of order. The transformations, colours and textures that define an object need to be specified prior to executing the commands to define the object. This limits the applicability of out of order execution to either: components within an object sharing the same state, or whole objects that can be defined in with state very easily (discussed later). We have chosen to use Lumino[20], a distributed rendering framework to transmit and execute OpenGL commands over a network. In a typical OpenGL application there are commands to define the beginning and end of a sequence of polygon data, these are glBegin and glEnd respectively. Within this Begin/End clause the expected data can take several formats, they can be points, lines, triangles or quadrilaterals1. No one command defines a polygon, rather we identify the sequence of commands as a component. The component is independent to any other component. As a polygon it may contain applicable state information such as a normal, color and texture coordinates. It is independent to other components since there are no shared vertices and any differing state information is available. For example in defining a triangle mesh, the receiving end would simply obtain the segment containing a whole number of triangles, and immediately pass it on to the graphics processor. This is advantageous as the event of losing a segment, which contains other triangles, will not uphold the next arriving segment of triangles to be executed. It is also important to note that any transformations, depth ordering, lighting calculations etc. can be done during the reception of the remaining data of that frame rather than when the ordered segment has been delivered. The alternative to OOO contiguous blocks are independent blocks that can be executed independently with respect to one another. This case would appear when a burst of packets for an object are missed, but segments are available and hold enough state information to allow execution. This situation is quite complicated and is not considered in this study at this time. The network communication was implemented as a reliable ACK-based UDP protocol. The advantage of doing so allows us to control the packet loss probabilities, segment sizes, window size (for flow control) and fine control over command alignment in the data packing of segments. Additional information that was packed in the header was a byte that indicates the Out-of-Order(OOO) sequence number. This number increments when switching between applicable OOO blocks and non-OOO blocks. Alternatively, the OOO segments could be determined by the receiver alone. However, it would be difficult to detect the appropriate OOO regions if segments are dropped and arrive out of order. By using a sender side method, the barriers between OOO and non-OOO sections are understood directly. Let us consider a 3D model viewer program. For every frame that is rendered, the model is retransmitted from the source location be it system memory or file to the graphics processor. The primary geometric data will be included in Begin/End clauses, the size of these clauses scales with the geometric complexity of the data. For example, 1
We isolate the study to independent components. GL LINE STRIP, GL TRIANGLE STRIP and GL QUAD STRIP cannot be executed out of order simply because there needs to be a reference point.
68
J. Stavrakakis and M. Takastuka
a reduced Stanford bunny model[21] has over 17,000 vertices. For the application to display the full model, all vertices are sent via begin/end clauses. In one frame of rendering commands the size of the geometric data clearly dominates the remaining data (such as transformation matrices, lighting, miscellaneous items). The vertices alone account for: 17,000 * (12+2) = 232KB. Adding a normal for each triangle incurs an additional 77KB. The bandwidth required to retransmit this every frame at 24 frames per second is 7.3MB/s. As uncompressed data, there are circumstances by which there is not enough processing capacity to apply compression like that of Deering[22].
4 Experiment The aim of the experiment is to observe the cumulative sum of waiting times from segment arrival to segment execution. We implemented this functionality as a plugin to Lumino using the aforementioned custom reliable UDP protocol. The sending side would identify the appropriate out-of-order sequences and selectively push component data into individual segments, as well as the necessary header information. The receiver would obtain each segment in no particular order and either withhold or execute segments. This depended on the out-of-order mode being disabled or applicable/enabled. The sender is also configurable for packet loss probability, as well as the flow control window size and the segment size, which had remained fixed. The receiving end performs statistical monitoring of the arrival and the beginning of when a segment is to be executed. Note that the actual length of the execution time was not recorded as the resulting time difference to execute the segment data was comparable to making the expensive system call to gettimeofday twice. The application under test was a simple ply model loader based on the clean[23] library. The ply format is a 3D format developed at Stanford University. The program begins by shaping the camera to fit the model in view, the animation then commences where the model rotates in a predetermined fashion without any randomisation. Using the display lists option in this program changes the rendering method from immediate to retained mode. In retained mode the model is cached at the graphics processor, and would avoid all bandwidth limitations. It is not always possible to use a display list and for our purposes this was disabled. Tests were conducted between two machines within the local Gigabit LAN, the round trip time (RTT) between them yields the statistics min/avg/max/mdev = 0.076/0.108/0.147/0.014 ms. This test were all performed when no other users were active in the system. Table 1. Machines used in tests name Specification moth Dual AMD Opteron 242 (1594MHz), 1GB DDR SDRAM Nvidia GeForce FX 5950 Ultra AGP. Ubuntu Linux 2.6.20 kernel. cat Intel Pentium 4 2.8GHz, 512MB DDR SDRAM , Nvidia GeForce FX 5700/AGP. Ubuntu Linux 2.6.20 kernel.
Out-of-Order Execution for Avoiding Head-of-Line Blocking
(a) Buddha.
(b) Dragon.
69
(c) Bunny17k.
Fig. 4. Screen shots of the model viewing application with the models chosen from by experiments. Time difference from packet arrival to execution 7000
Time difference from packet arrival to execution 8
Dragonres2 NON-OOO P/L:1000 W:100
Dragonres2 OOO P/L:1000 W:100
7
6000
6 5000
Microseconds
Microseconds
5 4000
3000
4
3 2000 2 1000
1
0
0 0
2000
4000
6000
8000
Packet #
(a) Without OOO execution.
10000
0
2000
4000
6000
8000
10000
Packet #
(b) With OOO execution.
Fig. 5. The HOLB issue(left) is evident as the loss of single/burst of segments causes the execution delay of subsequent packets. The right shows an out of order protocol demonstrating that segment losses are isolated and independent to other segments.
The test models included decimated Stanford models: Dragon(res2), Buddha(res2) and a further reduced bunny (17k polygons) (see Figure 4. The use of the models in the experiment is to demonstrate how the large ooo segment can incur unnecessary delays in execution. The pattern observed in figure 5(a) demonstrates the concept of head-of-line blocking problem. The single/burst of packets lost at one instant will not be delivered until there is a chance to access the network. The network in these cases are highly utilised, thus the waiting time is significant. At closer examination the vertical stripes have slope that is left to right, these are the segments immediately following the missing one, who are not able to continue, thus their waiting time is directly proportional to its ’packet’ distance from the missing segment. Figure 5(b) shows how an out of order execution model allows almost every packet to execute immediately on arrival. The actual delays visible here is the remainder of processing other out of order commands first. It is important to note that some segments require appropriate ordering, however, as the packet loss probability is 1 in a 1000 the chances to observe similar delay from 5(b) is too small.
70
J. Stavrakakis and M. Takastuka
Time difference from packet arrival to execution 7000
Time difference from packet arrival to execution 12
Dragonres2 NON-OOO P/L:1000 W:50
6000
Dragonres2 OOO P/L:1000 W:50
10
5000
Microseconds
Microseconds
8 4000
3000
6
4 2000
2
1000
0
0 0
2000
4000
6000
8000
10000
0
2000
4000
Packet #
6000
8000
10000
Packet #
(a) Without OOO execution.
(b) With OOO execution.
Fig. 6. The HOLB issue(left) is affected by sliding window like flow control methods. The larger the sliding window, the longer it will take before retransmission can take place. Either due to delaying the negative acknowledgement or by having to wait on network availability.
Time difference from packet arrival to execution 9000
Time difference from packet arrival to execution 14
Dragonres2 NON-OOO P/L:3000 W:50
Dragonres2 OOO P/L:3000 W:50
8000 12 7000 10
Microseconds
Microseconds
6000 5000 4000
8
6
3000 4 2000 2 1000 0
0 0
2000
4000
6000
8000
10000
0
2000
4000
Packet #
6000
8000
10000
Packet #
(a) Without OOO execution.
(b) With OOO execution.
Fig. 7. The HOLB issue(left) is only an issue on packet loss. At this time the retransmission delay will cause a queue of unprocessed segments. The benefits of out of order execution appear much lower when segment loss is every 1/3000.
Time difference from packet arrival to execution 8000
Time difference from packet arrival to execution 12
Happyres2 NON-OOO P/L:1000 W:50
Happyres2 OOO P/L:1000 W:50
7000 10 6000 8 Microseconds
Microseconds
5000
4000
6
3000 4 2000 2 1000
0
0 0
2000
4000
6000
8000
Packet #
(a) Without OOO execution.
10000
0
2000
4000
6000
8000
10000
Packet #
(b) With OOO execution.
Fig. 8. The HOLB issue(left) with a different model Happy Buddha. The large contiguous sequence of geometry helps avoid the HOLB problem.
Out-of-Order Execution for Avoiding Head-of-Line Blocking Bunny17k NON-OOO P/L:2000 W:50 RTT+:0
71
Bunny17k NON-OOO P/L:2000 W:50 RTT+:500
4500
6000 Time Difference from packet arrival to execution
Time Difference from packet arrival to execution
4000 5000 3500 4000 Microseconds
Microseconds
3000 2500 2000
3000
1500
2000
1000 1000 500 0 1600
1650
1700
1750
0 1600
1800
1650
Packet #
(a) Without OOO execution.
1800
Bunny17k OOO P/L:2000 W:50 RTT+:10000
12000
6 Time Difference from packet arrival to execution
Time Difference from packet arrival to execution
10000
5
8000
4 Microseconds
Microseconds
1750
(b) Without OOO execution.
Bunny17k NON-OOO P/L:2000 W:50 RTT+:10000
6000
3
4000
2
2000
1
0 1600
1700 Packet #
1650
1700
1750
Packet #
(c) Without OOO execution.
1800
0 1600
1650
1700
1750
1800
Packet #
(d) With OOO execution.
Fig. 9. This figure shows the time interval between delivery and execution for an increasing round trip time. The non OOO cases 9(a), 9(b), 9(c) have added delays of 0, 500, 10000 microseconds respectively. These areas denote the time that the segment was waiting for ordering to occur before being executed. The OOO case for delay 10000s is shown in 9(d), the effect of RTT does not interfere with the time between arrival and execution.
The influence of round trip time (RTT) is depicted in figure 9. A simulated delay for packet delivery was added for the non OOO cases: 9(a), 9(b), 9(c) as well as the OOO case:9(d). They each had an increased RTT of 0, 500, 10000 microseconds respectively. In the non OOO cases the distance between consecutive points (after a loss) represents the time that is taken to parse and process it, note that this gradient is machine dependent (machine cat4 used here). Such an area represents the time that the receiver was unnecessarily idle and/or the waiting on the ordering of segments for execution. With zero additional RTT9(a), the delay is matching that of the protocol and the network alone, the response time in this case is within 4500 microseconds. Introducing a delay of 500 microseconds9(b) makes the effect of network delay visible. The recovery time worsened up to 6000 microseconds. The trend continues for a typical inter-city RTT(Sydney to Melbourne) time of 10,000 microseconds9(c). In this case, the stepping down represents the time at which the sender was recovering from the lost segment. During this time the sending window was full and no newer segments were being transmitted. It was only until the recovery occurred that the latter segments began to be transmitted, however as they arrived the ordering constraint on executing previous segments caused them to be delayed. The effects of HOLB will continue to ripple the
72
J. Stavrakakis and M. Takastuka
waiting time in this way for future arriving segments. Alternative protocols will offer various other distances to step down (denoting recovery), regardless, the loss of the single segment will incur an RTT penalty for consecutive segments within its own sliding window as shown in this protocol. Figure 9(d) is the OOO case where the RTT does not affect the interval between arrival and execution of the segment. It can be said to be invariant of RTT as it scales to larger networks (of higher RTT) for when out-of-order execution is applicable. It is difficult to avoid the ordering constraint for non-contiguous segments as certain functions will change the state of OpenGL and would thus break the scene rendering. This is most apparent in parallel rendering systems such as Pomegranate[24], where there are several commands being executed simultaneously. This out of order execution occurs within the same one-to-one interaction between single consumer/producer model. It is also important to note that the performance of rendering or network throughput is not a valid metric, as OOO execution in 3D graphics will not change the total time to transfer a frame of commands or to render them2 . This is due to the frame synchronisation which requires all data to be available before the next segment of the newer frame can be executed. The observation made is that without OOO execution the load of the graphics processing unit appears non-continuous (surge of usage), however when it is used, load is continuous and more effective in competing for 3D graphics/CPU time share in the system.
5 Conclusion By observing the flow of data in remote 3D graphics systems, we have discovered a potential bottleneck that can impact specific remote 3D applications. Our design decisions in building the experiment have been justified such that the nature of the bottleneck is clearly exposed. Using a controlled network protocol and large volumes of contiguous OOO graphics data, we are able to show that the HOLB problem presents unnecessary idle time for the graphics processor. This idle time is also dependent on the parameters of the network protocol, where increased sliding window sizes accumulated greater delay times for more segments, while single or burst of packet losses can still impact utilisation. We found that by having less restrictive ordering requirements, immediate mode graphics can alleviate the HOLB problem when the amount of geometric data becomes large. To further exploit the HOLB with graphics, either: the specification of the graphics is required to change such that dependencies are satisfied in advance (moving towards retained mode) or the time of processing the segment data exceeds that of network costs. Such circumstances can exist when receiving systems are under a high load from competing 3D graphics clients, embedded devices may also take more time computing than utilising the network speed. Our future work will aim at utilising this protocol for the two cases: load balancing in limited computing resources; and performing a dynamic trade off between idle process2
The time taken to execute a segment was too insignificant to be considered an improvement, less the negligible overhead.
Out-of-Order Execution for Avoiding Head-of-Line Blocking
73
ing time to perform other functionality. For example, we could easily apply transparent compression on the components such that the bandwidth requirement is lowered, in doing so we raise the computational cost per segment and can move closer towards a more efficient network graphics system.
References 1. Liang, Y.J., Steinbach, E.G., Girod, B.: Real-time voice communication over the internet using packet path diversity. In: MULTIMEDIA 2001: Proceedings of the ninth ACM international conference on Multimedia, pp. 431–440. ACM Press, New York (2001) 2. Peng, J., Kuo, C.-C.J.: Geometry-guided progressive lossless 3d mesh coding with octree (ot) decomposition. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Papers, pp. 609–616. ACM Press, New York (2005) 3. Purnomo, B., Bilodeau, J., Cohen, J.D., Kumar, S.: Hardware-compatible vertex compression using quantization and simplification. In: HWWS 2005: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 53–61. ACM Press, New York (2005) 4. Varakliotis, S., Hailes, S., Ostermann, J.: Progressive coding for QoS-enabled streaming of dynamic 3-D meshes. In: 2004 IEEE International Conference on Communications, vol. 3(20-24), pp. 1323–1329 (2004) 5. Information Sciences Institute, U.o.S.C.: Transmission control protocol (September 1981) 6. Stevens, W.R., Fenner, B., Rudoff, A.M.: UNIX Network Progrmaming, vol. 1. AddisonWesley, Reading (2004) 7. Advanced Visual Systems (AVS), Information visualization: visual interfaces for decision support systems (2002), http://www.avs.com/ 8. Hibbard, B.: Visad: connecting people to computations and people to people (GET THIS). SIGGRAPH Comput. Graph. 32(3), 10–12 (1998) 9. Brodlie, K., Duce, D., Gallop, J., Sagar, M., Walton, J., Wood, J.: Visualization in grid computing environments. In: VIS 2004: Proceedings of the conference on Visualization 2004, pp. 155–162. IEEE Computer Society Press, Los Alamitos (2004) 10. Brodlie, K., Duce, D., Gallop, J., Walton, J., Wood, J.: Distributed and collaborative visualization. Computer Graphics Forum 23(2), 223–251 (2004) 11. Stavrakakis, J., Takatsuka, M.: Shared geometry-based collaborative exploratory visualization environment. In: Workshop on Combining Visualisation and Interaction to Facilitate Scientific Exploration and Discovery, British HCI, London, pp. 82–90 (2006) R graphics system: A 12. Segal, M., Akeley, K., Frazier, C., Leech, J., Brown, P.: The opengl specification. Technical report, Silicon Graphics, Inc (October 2004) 13. Meta VR, Inc.: Meta VR virtual reality scene (2007), http://www.metavr.com 14. Liu, Y., Liu, X., Wu, E.: Real-time 3d fluid simulation on gpu with complex obstacles. In: Proceedings of 12th Pacific Conference on Computer Graphics and Applications, pp. 247– 256 (2004) R graphics with the X Window System R : Version 1.3. 15. Womack, P., Leech, J.: OpenGL Technical report, Silicon Graphics, Inc. (October 1998) 16. The X.Org Foundation.: About the x window system, http://www.x.org/X11.html 17. Humphreys, G., Buck, I., Eldridge, M., Hanrahan, P.: Distributed rendering for scalable diaplays. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, vol. 30, IEEE Computer Society Press, Los Alamitos (2000) R stream codec: A specification. Technical report, Silicon 18. Dunwoody, C.: The openGL Graphics, Inc. (October 1996)
74
J. Stavrakakis and M. Takastuka
19. Igehy, H., Stoll, G., Hanrahan, P.: The design of a parallel graphics interface. In: SIGGRAPH 1998: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 141–150. ACM Press, New York (1998) 20. Stavrakakis, J., Lau, Z.-J., Lowe, N., Takatsuka, M.: Exposing application graphics to a dynamic heterogeneous network. In: WSCG 2006: The Journal of WSCG, Science Press (2006) 21. Stanford University.: The stanford 3d scanning repository (2007) 22. Deering, M.: Geometry compression. In: SIGGRAPH 1995: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 13–20. ACM Press, New York (1995) 23. Real-time Rendering Group, The University of Western Australia. The clean rendering libraries (2005), http://60hz.csse.uwa.edu.au/libraries.html 24. Eldridge, M., Igehy, H., Hanrahan, P.: Pomegranate: a fully scalable graphics architecture. In: SIGGRAPH 2000: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 443–454. ACM Press, New York (2000)
A Fast Mesh Deformation Method for Neuroanatomical Surface Inflated Representations ´ Andrea Rueda1 , Alvaro Perea2 , Daniel Rodr´ıguez-P´erez2, and Eduardo Romero1 1
BioIngenium Research Group, Universidad Nacional de Colombia, Carrera 30 45-03, Bogot´ a, Colombia {adruedao,edromero}@unal.edu.co 2 Department of Mathematical Physics and Fluids, Universidad Nacional de Educaci´ on a Distancia, c/ Senda del Rey, 9, 28040 Madrid, Spain {aperea,daniel}@dfmf.uned.es
Abstract. In this paper we present a new metric preserving deformation method which permits to generate smoothed representations of neuroanatomical structures. These surfaces are approximated by triangulated meshes which are evolved using an external velocity field, modified by a local curvature dependent contribution. This motion conserves local metric properties since the external force is modified by explicitely including an area preserving term into the motion equation. We show its applicability by computing inflated representations from real neuroanatomical data and obtaining smoothed surfaces whose local area distortion is less than a 5 %, when comparing with the original ones. Keywords: area-preserving deformation model, deformable geometry, surface inflating.
1
Introduction
Computational technologies have recently invaded medical practice, changing in many ways clinical activities and becoming an important tool for patient diagnosis, treatment and follow-up. In particular, the use of three-dimensional models, obtained from medical images such as Magnetic Resonance, Positron Emission Tomography or Computed Tomography, improves visualization and functional analysis of anatomical structures with intricate geometry. Morphometrical studies require a high degree of precision and reproducibility [1], tasks which are difficult to achieve because of the complexity of such structures. Assessment of lengths and areas on these surfaces is a very high time-consuming process, a factor which limits large anatomical studies. This problematic situation is worsen when one considers that morphometrical studies show variabilities which can reach a 30 % [1], an unacceptable figure for many investigations. This D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 75–86, 2007. c Springer-Verlag Berlin Heidelberg 2007
76
A. Rueda et al.
kind of procedures can be improved by assistance of recent semi-automated or full-automated developed techniques, but they are generally addressed to measure deep brain structures through voxel-based [2] or object-based [3] strategies. Lately, a deformation of the original surface into a simpler one turns out to overcome most of these difficulties since measurements can be carried out on simpler and smoother surfaces [4, 5]. These morphometric tasks may thus be simplified if lengths and areas are calculated over topologically equivalent, smoother versions of the surface, subjected to the condition that the metric properties must be appropriately preserved. Deformable models, those where a 2D contour or a 3D surface evolves towards a target contour or surface, have been extensively studied during the last 20 years. 2D deformable models were introduced by Kass et al. [6] and extended to 3D applications by Terzopoulos et al. [7]. A broad range of applications of these models includes pattern recognition, computer animation, geometric modeling, simulation and image segmentation, among others [8]. A number of previous works [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] have focused on application of deformable models to cerebral cortex analysis. In these papers, several models are able to generate different types of smoothed representations. These works introduce different surface-based analyses of the cerebral cortex, such as surface smoothing, geometric transformations, projections or mappings, surface flattening and other types of deformations. Recent work in surface deformation has been addressed to generate three types of representations: unfolded, inflated, and spherical. Unfolding methods transform the original 3D surface into a planar representation, which sometimes require inserting cutting paths onto the surface boundary, a simple strategy which permits to reduce the stretch of the flattened surface. Surface smoothing methods iteratively inflate the surface, retaining its overall shape and simplifying its geometry. Spherical or ellipsoidal representations can be obtained by defining a conformal mapping from the original surface onto a sphere or ellipsoid. These representations generally attempt to facilitate visualization of a particular structure, disregarding metric preserving restrictions. Concerning surface deformation methods under the restriction of metric preservation, a lot of work has been dedicated to build conformal or quasiconformal mappings of the brain surface. These approaches use an important theorem from the Riemannian geometry, namely, that a surface of genus zero (without any handles, holes or self-intersections) can be conformally mapped onto a sphere and any local portion thereof onto a disk. Conformal mappings are angle preserving and many attempts have been made at designing also quasilength or area-preserving mappings [21,22,23,24,25,26], reporting metric distortions close to 30 % of the original surface. On the other hand, different methods propose a set of local forces, which guarantees an approximated metric preservation1 while smoothing the surface [19, 22, 27, 28]. It is well known that it is impossible to exactly preserve distances, or to preserve angles and areas simultaneously because the original surface and its smoothed version will have 1
These methods report metric distortions up to a 20 % of the original surface.
A Fast Mesh Deformation Method
77
different Gaussian curvature [29]. Pons et al. [30] presented a method to deform the cortical surface while area distortion is less than a 5 %. The whole approach uses a tangential motion which depends on the normal motion, constructed for ensuring area preservation. However, implementation is performed using high computational cost representations such as level sets [31]. Our main goal was to develop an efficient and accurate deformation method constrained by tough area preservation conditions for mesh representations. In the present work a novel formulation, meeting the requirement of area preservation for evolving surfaces, is introduced. The original surface, approximated by a triangulated mesh, is modified by applying an external velocity field, which iteratively drives the surface towards a desired geometrical configuration. This velocity field is composed of a smoothness force, used to move each point in the surface towards the centroid of its neighbors and a radial expansion term. This last term combines a radial velocity (such as the distance to a desired sphere surface) and a geometrical component which depends on the local curvature and is charged of maintaining local area properties. This paper is organized as follows. Section 2 describes the proposed method, giving details related to the local area preservation condition and to the deformation process. Results obtained by applying the deformation method to phantom and actual surfaces are presented in Section 3 and conclusions in Section 4.
2
Methods and Models
In this section we develop the mathematical model which includes conditions for the preservation of local area of an evolving surface, represented by a triangulated mesh. We then introduce the smoothing and radial expansion terms, which guide the deformation process. Finally we describe the complete velocity model, together with a description of the surface evolution algorithm. 2.1
Local Area Preservation
The whole approach can be formulated as follows: consider a surface which is represented by a triangulated mesh, composed of N nodes. This mesh is forced to deform into a desired surface subjected to the condition that Euclidean the local metric must be preserved. Let us define the total area S {xi }N i=1 as a function of the node coordinates xi , while the global conservation condition of S upon motion of the xi , parameterized as xi (t), is N i=1
x˙ i ·
∂ S {xk }N k=1 = 0 ∂xi
(1)
where the x˙ i denotes the derivative of xi with respect to the parameter t (that we will refer to as time). The total area S, written as the sum of the mesh triangle areas, can be decomposed for a given node of coordinates xi as S= sil + Si l
78
A. Rueda et al.
where the sil represents the area of all the triangles having xi as one of their vertices, and Si stands for the area of the rest of the triangles, none of which has xi as a vertex. Thus, we can rewrite (1) on a per vertex basis as N
x˙ i ·
i=1
∂ i sl = 0 ∂xi l
so that a convenient solution to this equation is ∂ i sl = 0 x˙ i · ∂xi
(2)
l
which clearly fulfills the searched condition. Let us then define from expression (2) a vector κi =
∂ i sl ∂xi l
which can be seen as a “curvature vector” associated to the i-th vertex so that local area preservation is guaranteed when the variation of xi is perpendicular to κi . This local estimation of the curvature κi is a vector for each mesh point, expressed as a − b cos α κi = |xj − xj+1 | 2 sin α j∈Ni
where Ni is the set of vertices neighboring the i-th vertex, a = xj −xj+1 |xj −xj+1 |
xi −xj |xi −xj |
and
are unit vectors with directions defined in the triangle with vertices b= i, j and j + 1, and α is the angle between them which satisfies a · b = cos α. 2.2
Smoothing Terms
The local smoothing term fSL (i) is calculated for each point xi as the difference between the center of mass xCM of the triangles that share the vertex xi and the position of this vertex, that is to say 1 fSL (i) = xCM (i) − xi , where xCM (i) = xj . Ni j∈Ni
On the other hand, a global smoothing term is calculated as fSG
N 1 = ni · (xi − xj )ni N i j∈Ni
where ni is the average normal vector on every triangle which shares the i-th vertex and the sum on j is on all neighbors of the i-th vertex. The total smoothing term fS (i) = fSL (i) + fSG , proposed by Fischl et al. [20], drives each vertex in the direction of the centroid of its neighbors.
A Fast Mesh Deformation Method
2.3
79
Radial Expansion Motion
Overall, the expansion movement is imposed by three different components: a radial velocity which is defined by the user, a geometrical component which depends on the local curvature and a radial expansion which forces the surface towards a hypothetical sphere. All these forces point out to a direction on the average normal ni of the i-th vertex, as follows hR (i) = [vradial (i) + F (κi ) + (Rext − xi · ni )] ni where F (κi )) = −κi , so that if the curvature is positive (belly shape), the surface is flattened towards the interior and when the curvature is negative (hole shape) the surface is flattened towards the exterior. The reference value Rext corresponds to the maximum distance between the whole surface and its center of mass. This term forces out the points towards the circumscribed sphere. 2.4
Velocity Model
Let us now assume that we impose a deformation field such that every vertex xi is moving with a particular “velocity” v(xi ), which is dependent on the vertex position. Then, the evolution equation for each point is x˙ i = fSL (i) + fSG + λi hR (i) where λi is a local parameter which takes into account the relative weight of the radial expansion and smoothing. Such weight function is estimated from the local conservation relationship x˙ i · κi = 0 so that λi = −
κi · (fSL (i) + fSG ) . κi · hR (i)
(3)
In order to prevent stiffness phenomena during the surface evolution, an additional parameter β is introduced into the expansion term x˙ i = fSL (i) + fSG + [(1 + β)λi − βλ]hR (i)
(4)
also, a global weight function λ is included κi · (fSL (i) + fSG ) . λ=− i i κi · hR (i) This evolution equation combines, in an arithmetic proportion, local and global preservation effects; which is the constrained motion model used in this paper. 2.5
Surface Evolution Process
This external velocity field imposes an expansion or contraction movement driven by the radial force. Physically, this amounts to an internal pressure which acts on a smoothed surface, result of a re-distribution effect of the surface tension
80
A. Rueda et al.
caused by the pressure changes. The radial expansion movement is then a consequence of the resultant pressure excess. According to this scheme, a two-phase surface evolution process is proposed. In the first stage, only the local and global smoothing terms are applied, updating the mesh position points. Once the surface is smoothed, the radial expansion factor which includes the local preservation term, is calculated and applied to the point coordinates of the smoothed surface. Algorithm 1 summarize the whole process and is hereafter presented. Algorithm 1. Surface Evolution Set the time step dt and the β parameter repeat Calculate the global smoothing force fSG for i = 1 to N do Calculate the local smoothing force fSL (i) ˜ i (t) = xi (t) + [fSL (i) + fSG ]dt Update the point coordinates x end for for i = 1 to N do Calculate the radial expansion force hR (i) Calculate the local weighting parameter λi end for Calculate the global weighting parameter λ for i = 1 to N do ˜ i (t) + [(1 + β)λi − βλ]hR (i)dt Update the point coordinates xi (t + dt) = x end for until Some convergence criterion is met
3
Results and Discussion
In this section we compute inflated representations from both phantom and actual neuroanatomical data. Then, a description of the actual surfaces is introduced together with implementation and evaluation issues. 3.1
Phantoms
At a first stage, the deformation model was evaluated over phantom surfaces, generated and modified using a MathematicaTM routine (version 5.0), which implements the surface evolution process presented in Algorithm 1. These surfaces were obtained by mixing up simple shapes as illustrated in Figure 1, with similar discontinuities to the actual neuroanatomical data. The number of triangles varied between 192 and 288 and the area units were defined for each surface from the isotropical cartesian space generated for each case. Figure 1 illustrates our technique with a phantom surface, constructed via two spheres which results in a single discontinuity. The initial surface is displayed at the left panel and the resultant surface, after 25 iterations, at the right panel. In this example, the total area of both surfaces is 20, 95 area units so the total area was
A Fast Mesh Deformation Method
(a)
81
(b)
Fig. 1. Result of applying our deformation model on a phantom image. (a) Initial surface. (b) Deformed surface.
(a)
(b)
(c)
(d)
Fig. 2. Result of applying our deformation model, using a different β, on a phantom image. (a) Initial surface. (b) With β = 0.5. (c) With β = 1.0. (d) With β = 1.5.
preserved. Note that the smooth force, subjected to the preservation condition, redistributes point positions, an effect which can be here observed as a twist of the main surface direction. Figure 2 shows, upon a phantom surface similar to a brain stem, how the model was proved using different values of the β parameter. Figure 2 presents the initial surface (panel 2(a)) and results obtained with β values of 0.5, 1.0 and 1.5 are shown at panels 2(b), 2(c) and 2(d), respectively. After 10 iterations and a set time step of dt = 0.1, the three resulting surfaces presents variable degrees of deformation, without any metric distortion. 3.2
Real Surfaces
Performance of the surface evolution process was also assessed on 3D surfaces, obtained from actual neuroanatomical data. The whole implementation was written in C++, using the VTK (www.vtk.org) functions for interaction and visualization of these structures. All these routines run on a Linux system with a 2.13 GHz Intel Core 2 Duo processor and 2GB in RAM memory.
82
A. Rueda et al.
Datasets. Brain stem and cerebellum triangulated surfaces, segmented and reconstructed from medical images were used as input of the algorithm. The former was obtained from a 512 × 512 × 50 computed tomographic image and the resulting mesh was composed of 2800 points and 5596 triangles, while the latter was obtained from a 512 × 512 × 40 computed tomographic image which resulted in a mesh composed of 4650 points and 9296 triangles. Implementation Issues. The simple over-relaxation scheme proposed in Equation 4 for integration, was replaced by a one step predictor-corrector scheme [32]. The local preserving condition is introduced through a λi parameter which obliges the curvature vector κi to be perpendicular to the direction of the smoothing force. Denominator of the λi parameter (see Equation 3) was forced to be larger than 0.001 for avoiding discontinuities. A global area preservation factor λ facilitates a proper handling of the general preserving contribution while relaxes the local conservation condition. A β parameter is also introduced for managing the balance between local and global contributions (see Equation 4). All the examples use a β parameter set at 0.2 and dt = 0.001. Finally, the total smoothness force was also weighted using a factor set to 1.2 in the same equation. Evaluation Issues. For evaluation purposes a local area factor Ji at point xi of the surface is introduced as Ji = A0pi /Atpi , where A0pi is the initial area and Atpi is the current area of the patch around this point, defined by the area of the triangles which share the point xi . A decreasing Ji indicates a local area
(a)
(b)
(c)
(d)
(e)
Fig. 3. Result of applying our deformation model on a brain stem surface. (a) Initial surface. (b) Iteration 500. (c) Iteration 1000. (d) Iteration 2000. (e) Iteration 3000.
A Fast Mesh Deformation Method
83
Table 1. Normalized area factor J/J¯ for the brain stem surface J/J¯ Number of patches % of patches 0.0 − 0.75 0 0% 0.76 − 0.95 0 0% 0.96 − 1.04 2722 97.21 % 1.05 − 1.24 71 2.54 % 1.25 − 2.0 7 0.25 %
(a)
(b)
(c)
(d)
(e)
Fig. 4. Result of applying our deformation model on a cerebellum surface. (a) Initial surface. (b) Iteration 500. (c) Iteration 1000. (d) Iteration 2000. (e) Iteration 3000.
expansion while an increasing Ji , a local area shrinkage. Also, let us define the N average area factor as J¯ = 1/N i Ji and the normalized area factor as the ¯ This factor ratio between the local area factor Ji and the average area factor J. ¯ J/J gives an estimation of the local area changes related to the distortion of the total area of the surface. Results. Figure 3 illustrates the whole process on actual brain stem data. Upper panel (Figure 3(a)) corresponds to the original data and lower Figures 3(b), 3(c), 3(d) and 3(e) stand for the resulting meshes after 500, 1000, 2000, and 3000 iterations, respectively. The deformation method, applied on the brain stem, presented a local area distortion of about a 4% in the 97% of the patches after 3000 iterations. Regarding performance time, image 3(e) is obtained after 223 s, a time which can be considered as adequate for measures in actual morphometrical studies. Table 1 shows the normalized area factor between the interval [0, 2] for the brain stem surface. A ratio close to one indicates little area changes, that is to say that the local and overall changes are comparable. Figures indicate small changes since most patches present a ratio close to 1.
84
A. Rueda et al. Table 2. Normalized area factor J/J¯ for the cerebellum surface J/J¯ Number of patches % of patches 0.0 − 0.75 0 0% 0.76 − 0.95 12 0.26 % 0.96 − 1.04 4622 99.40 % 1.05 − 1.24 0 0% 1.25 − 2.0 16 0.34 % Table 3. CPU time needed to generate the example surfaces Number CPU Time Dataset of points Iteration 500 Iteration 1000 Iteration 2000 Iteration 3000 Brain stem 2800 37 s 74 s 147 s 223 s Cerebellum 4650 61 s 122 s 244 s 366 s
For the cerebellum mesh, local area distortion is close to 4 % in the 99 % of the patches until iteration 3000. Figure 4 presents images of the surface deformation: upper panel (Figure 4(a)) corresponds to the original data set and lower Figures 4(b), 4(c), 4(d) and 4(e) stand for resulting meshes after 500, 1000, 2000, and 3000 iterations, respectively. Calculation time for 3000 iterations is 366 s. The normalized area factor, as shown in Table 2, is again consistent with little area changes. Performance analysis. Table 3 summarizes the calculation time required for generating the resulting meshes presented before. It is important to point out that the intermediate meshes obtained after each iteration are not visualized and only the final surface is rendered. As observed in Table 3, CPU performs 8.2 iterations for the cerebellum mesh in one second, while the brain stem surface demands 13.5 iterations for the same time, perfectly compatible with actual clinical practice.
4
Conclusions
We have presented a deformation method which permits to generate smoothed representations of neuroanatomical structures. These structures are represented as surfaces approximated by triangulated meshes, in such a way that these rather simple representations allow us to obtain and efficient and fast deformation under a local area preservation restriction. This approach is efficient because of the little area changes and fast in terms of adequate visualization in actual clinical practice. Each node velocity is given by a geometrical varying motion field to which area preserving constraints are applied. We use a mixed local-global area preservation constraint to enhance the success of the algorithm. The mathematical structure of the constrained motion model allows us to simply integrate the motion on a per node basis, with no need to solve large systems of equations on
A Fast Mesh Deformation Method
85
each integration step. Finally, we have shown applicability of this algorithm to compute inflated representations of neuroanatomical structures from real data. Future work includes a parameter analysis for better tuning of the algorithm performance. Also, clinical evaluation of this method is needed in actual morphometrical studies.
References 1. Filippi, M., Horsfield, M., Bressi, S., Martinelli, V., Baratti, C., Reganati, P., Campi, A., Miller, D., Comi, G.: Intra- and inter-observer agreement of brain mri lesion volume measurements in multiple sclerosis. a comparison of techniques. brain 6, 1593–1600 (1995) 2. Tapp, P.D., Head, K., Head, E., Milgram, N.W., Muggenburg, B.A., Su, M.Y.: Application of an automated voxel-based morphometry technique to assess regional gray and white matter brain atrophy in a canine model of aging. NeuroImage 29, 234–244 (2006) 3. Mangin, J., Riviere, D., Cachia, A., Duchesnay, E., Cointepas, Y., PapadopoulosOrfanos, D., Collins, D., Evans, A., Regis, J.: Object-based morphometry of the cerebral cortex. IEEE Trans Med Imaging 23, 968–982 (2004) 4. Filipek, P.A., Kennedy, D.N., Jr., V.S.C., Rossnick, S.L., Spraggins, T.A., Starewicz, P.M.: Magnetic resonance imaging-based brain morphometry: Development and application to normal subjects. Annals of Neurology 25, 61–67 (1989) 5. Ashtari, M., Zito, J., Gold, B., Lieberman, J., Borenstein, M., Herman, P.: Computerized volume measurement of brain structure. Invest Radiol. 25, 798–805 (1990) 6. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision 1, 321–331 (1988) 7. Terzopoulos, D., Witkin, A., Kass, M.: Constraints on deformable models: recovering 3D shape and nonrigid motion. Artificial Intelligence 36, 91–123 (1988) 8. Montagnat, J., Delingette, H., Ayache, N.: A review of deformable surfaces: topology, geometry and deformation. Image and Vision Computing 19, 1023–1040 (2001) 9. Carman, G.J., Drury, H.A., Essen, D.C.V.: Computational methods for reconstructing and unfolding the cerebral cortex. cerebral cortex 5, 506–517 (1995) 10. Drury, H., Van Essen, D., Anderson, C., Lee, C., Coogan, T., Lewis, J.: Computarized mappings of the cerebral cortex: a multiresolution flattening method and a surface-based coordinate system. Journal of Cognitive Neuroscience 1, 1–28 (1996) 11. Essen, D.C.V., Drury, H.A.: Structural and functional analyses of human cerebral cortex using a surface-based atlas. The Journal of Neuroscience 17, 7079–7102 (1997) 12. Essen, D.C.V., Drury, H.A., Joshi, S., Miller, M.I.: Functional and structural mapping of human cerebral cortex: Solutions are in the surfaces. Neuroimaging of Human Brain Function 95, 788–795 (1998) 13. Drury, H.A., Corbetta, M., Shulman, G., Essen, D.C.V.: Mapping fMRI activation data onto a cortical atlas using surface-based deformation. NeuroImage 7, S728 (1998) 14. Joshi, M., Cui, J., Doolittle, K., Joshi, S., Essen, D.V., Wang, L., Miller, M.I.: Brain segmentation and the generation of cortical surfaces. NeuroImage 9, 461– 476 (1999) 15. Essen, D.C.V., Drury, H.A., Dickson, J., Harwell, J., Hanlon, D., Anderson, C.H.: An integrated software suite for surface-based analyses of cerebral cortex. Journal of the American Medical Informatics Association 8, 443–459 (2001)
86
A. Rueda et al.
16. Harwell, J., Essen, D.V., Hanlon, D., Dickson, J.: Integrated software for surfacebased analyses of cerebral cortex. NeuroImage 13, 148 (2001) 17. Fischl, B., Sereno, M.I., Tootell, R.B., Dale, A.M.: High-resolution intersubject averaging and a coordinate system for the cortical surface. Human Brain Mapping 8, 272–284 (1999) 18. Dale, A.M., Fischl, B., Sereno, M.I.: Cortical surface-based analysis I: Segmentation and surface reconstruction. NeuroImage 9, 179–194 (1999) 19. Fischl, B., Sereno, M.I., Dale, A.M.: Cortical surface-based analysis II: Inflation, flattening, and a surface-based coordinate system. NeuroImage 9, 195–207 (1999) 20. Fischl, B., Liu, A., Dale, A.M.: Automated manifold surgery: Constructing geometrically accurate and topologically correct models of the human cerebral cortex. IEEE Transactions on Medical Imaging 20, 70–80 (2001) 21. Angenent, S., Haker, S., Tannenbaum, A., Kikinis, R.: On the Laplace-Beltrami operator and brain surface flattening. IEEE Transactions on Medical Imaging 18, 700–711 (1999) 22. Haker, S., Angenent, S., Tannenbaum, A., Kikinis, R., Sapiro, G., Halle, M.: Conformal surface parameterization for texture mapping. IEEE Transactions on Visualization and Computer Graphics 6, 181–189 (2000) 23. Gu, X., Yau, S.-T.: Computing conformal structure of surfaces. CoRR: Graphics (2002) 24. Hurdal, M.K., Stephenson, K.: Cortical cartography using the discrete conformal approach of circle packings. NeuroImage 23, s119–s128 (2004) 25. Ju, L., Stern, J., Rehm, K., Schaper, K., Hurdal, M., Rottenberg, D.: Cortical surface flattening using least square conformal mapping with minimal metric distortion. 2004 2nd IEEE International Symposium on Biomedical Imaging: Macro to Nano 1, 77–80 (2004) 26. Wang, Y., Gu, X., Chan, T.F., Thompson, P.M., Yau, S.T.: Intrinsic brain surface conformal mapping using a variational method. Proceedings of SPIE - The International Society for Optical Engineering 5370, 241–252 (2004) 27. Hermosillo, G., Faugueras, O., Gomes, J.: Cortex unfolding using level set methods. Technical report, INRIA: Institut National de Recherche en Informatique et en Automatique (1999) 28. Tasdizen, T., Whitaker, R., Burchard, P., Osher, S.: Geometric surface smoothing via anisotropic diffusion of normals. In: 13th IEEE Visualization 2002 (VIS 2002), IEEE Computer Society Press, Los Alamitos (2002) 29. DoCarmo, M.P.: Differential Geometry of Curves and Surfaces. Prentice-Hall, Englewood Cliffs (1976) 30. Pons, J.-P., Keriven, R., Faugeras, O.: Area preserving cortex unfolding. In: Medical Image Computing and Computer-Assisted Intervention MICCAI, Proceedings, pp. 376–383 (2004) 31. Sethian, J.A.: Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science. Cambridge University Press, Cambridge (1999) 32. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)
Mosaic Animations from Video Inputs Rafael B. Gomes, Tiago S. Souza, and Bruno M. Carvalho Departamento de Inform´ atica e Matem´ atica Aplicada Universidade Federal do Rio Grande do Norte Campus Universit´ ario, S/N, Lagoa Nova Natal, RN, 59.072-970 - Brazil
[email protected], souza
[email protected], bruno m
[email protected]
Abstract. Mosaic is a Non-Photorealistic Rendering (NPR) style for simulating the appearance of decorative tile mosaics. To simulate realistic mosaics, a method must emphasize edges in the input image, while placing the tiles in an arrangement to minimize the visible grout (the substrate used to glue the tiles that appears between them). This paper proposes a method for generating mosaic animations from input videos (extending previous works on still image mosaics) that uses a combination of a segmentation algorithm and an optical flow method to enforce temporal coherence in the mosaic videos, thus avoiding that the tiles move back and forth the canvas, a problem known as swimming. The result of the segmentation algorithm is used to constrain the result of the optical flow, restricting its computation to the areas detected as being part of a single object. This intra-object coherence scheme is applied to two methods of mosaic rendering, and a technique for adding and removing tiles for one of the mosaic rendering methods is also proposed. Some examples of the renderings produced are shown to illustrate our techniques.
1
Introduction
Non-Photorealistic Rendering (NPR) is a class of techniques defined by what they do not aim, the realistic rendering of artificial scenes. NPR techniques, on the other hand, aim to reproduce artistic techniques renderings, trying to express feelings and moods on the rendered scenes. Another way of defining NPR is that it is the processing of images or videos into artwork, generating images or videos that can have the visual appeal of pieces of art, expressing the visual and emotional characteristics of artistic styles (e.g. brush strokes). Animation techniques can convey information that cannot be simply captured by shooting a real scene with a video camera. However, such kind of animation is labor intensive and requires a fair amount of artistic skill. NPR techniques can be used to generate highly abstracted animations with little user intervention, thus, making it possible for non-artist users to create their own animations with little effort. Mosaic is an Non-Photorealistic Rendering (NPR) style for simulating the appearance of decorative tile mosaics. To simulate realistic mosaics, a method must D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 87–99, 2007. c Springer-Verlag Berlin Heidelberg 2007
88
R.B. Gomes, T.S. Souza, and B.M. Carvalho
emphasize edges in the input image, while placing the tiles in an arrangement to minimize the visible grout (the substrate used to glue the tiles that appears between them), i.e., maximizing the area covered by the tiles, as defined initially in [1]. Another characteristic common to most real mosaic styles are that the tiles are convex. If one wants to generate mosaic animations, he/she has to track the tile locations and enforce that their geometrical relation maintains temporal coherence, i.e., does not suffer from abrupt changes, to avoid discontinuities over time, or swimming, where drawn features can move around the canvas. The method proposed in this paper creates temporally coherent mosaic animations from input videos, extending previous works on still image mosaics. This paper introduces a method for enforcing temporal coherence in mosaic videos that is based on a combination of a segmentation algorithm and an optical flow method. The result of the segmentation algorithm is used to constrain the result of the optical flow, restricting its results to the areas detected as being part of a single object. The main contributions of this paper are the extensions of two still mosaic techniques [2,3] for generating mosaic animations. These extensions include a method for moving tiles in a temporal coherent way, as well as methods for adding and removing tiles.
2
Still Image Mosaics
The generation of artificial still mosaics must follow a few rules if the intent is to generate images similar to common man made mosaics, such as the maximization of the area covered by the tiles, the use of convex tiles, and the emphasizing of edges by orienting the tiles according to the edge orientation. The use of Voronoi diagrams to generate artificial mosaics is very popular, since they discretize the 2D space into finite convex regions (tiles) and maximize the space covered by the tiles. The first attempt to produce images with a mosaic style effect was proposed by Haeberli [4], that worked by creating random sites for the Voronoi diagram and painting each region with a color sampled from the input image. In order to produce a smoother flow of tiles that follow the edges detected in the input images, Hausner [1] proposed to use a generalization of Centroidal Voronoi diagrams (CVDs), that are Voronoi diagrams that have the additional property that each site is located in the center of mass of its region. The CVDs are calculated using an iterative algorithm that updates the centroid positions and recomputes the Voronoi diagrams until it converges, and it can be implemented in hardware, thus speeding up its execution. The orientations of the tiles are controlled by a direction field that can be created using the Euclidean distance or the Manhattan distance from the edges, if the desired tile shapes are hexagonal or square, respectively. Dobashi et al. [5] proposed a method where the initial sites are positioned approximately at the centers of an hexagonal mesh, and thus, are approximately centroidal. The sites are then moved to minimize a metric defined based on the color of the pixels contained in each tile and the Voronoi diagram is recom-
Mosaic Animations from Video Inputs
89
puted, thus, representing global features of the input image. In [3], Faustino and Figueiredo proposed an adaptive mosaic method where tiles of different sizes are used, according to the feature size in the input image, according to an image density function. Recently, Di Blasi and Gallo [2] proposed a mosaic method based on the Distance Transform Map (DTM) to create chains spaced periodically according to a pre-defined tile size, where the DTM is produced based on a guideline (edge) image that encodes the edges to be emphasized in the rendering. The tiles are then placed following these chains. This technique has problems handling noisy images, since the existence of small guidelines (edges) produce small cut tiles and images that do not resemble a typical man made mosaic. In this work we use Voronoi diagrams and CVDs to create an animated mosaic rendering style, whereas the mosaics obtained from the CVDs are more similar to the traditional man made mosaics. Another problem with the animations produced by using the Voronoi diagrams is that distances between neighbor tiles can vary throughout the animation, generating a distracting effect. We also propose to use a new initial site distribution scheme for computing the CVDs, followed by the use of a Constrained Optical Flow method to maintain intraobject temporal coherence throughout the animation.
3
Animated Mosaics
One of the objectives of NPR techniques for video stylization is to make automatic or semi-automatic procedures that mimic real life artistic styles, thus allowing a user to stylize real movie sequences captured with a camera with little effort when compared to the task of creating an animation from scratch. Video stylization also offers the choice of mixing real movies with stylized objects, rendering with one or more NPR techniques only parts of a movie sequence, leaving the rest of the video sequence intact. When producing a NPR video from a modeled 3D scene, it is important to maintain temporal coherence, moving the elements of the drawing (e.g. brush strokes) with the surfaces of the objects being drawn, otherwise, these elements stick to the view plane and the animation appears as if it is seen through a textured glass. due to this nature this is referred as the shower door effect, named by Meier in [6]. However, if the input for the NPR video is a normal video, it has been reported in the literature that not maintaining temporal coherence incurs in swimming, where features of the animation move within the rendered animation. This flickering comes not only from changed objects being rendered with elements that follow the object movement but also from static areas being rendered differently each time. To solve part of this problem, Litwinowicz [7] introduced a method for maintaining temporal coherence in video sequences stylized using an impressionist style. The method consists of using optical flow tracking movement in the scene and move, add or remove brush strokes from frame to frame. An approach for
90
R.B. Gomes, T.S. Souza, and B.M. Carvalho
coherent rendering of static areas in successive frames was proposed by Hertzmann [8], by detecting areas of change from frame to frame and painting over them, i.e., keeping the brush strokes of the static areas. Intra-object temporal coherence is achieved by warping the brush stroke’s control points using the output of an optical flow method. Wang et al. proposed in [9] a method for creating cartoon animations from video sequences by using a mean shift segmentation algorithm for end-to-end video shot segmentation. After the segmentation is performed, the user specifies constraint points on keyframes of the video shot through a graphical interface. These points are then used for interpolating the region boundaries between keyframes. The animated mosaic technique described by Smith et al. in [10] proposes a method for moving groups of 2D primitives in a coordinated way, thus, allowing a user to create mosaic animations with temporal coherence. The tiles are geometric shapes that are fitted inside 2D containers (polygons) with the help of the user in the first frame. Then, the system automatically advects the container’s tiles to the other frames, in a way that enforces temporal coherence; a step that can be followed by manual refinement. However, since the method of [10] takes as input an animated scene represented as a collection of polygons, it cannot be directly applied to a real video. An extension of this method was proposed in [11], where a Fast Fourier Transform based method was used to perform effective tile placements, allowing the packing of 3D volumes using temporally repeating animated shapes. Our method for producing intra-object temporally coherent NPR animations is divided into three parts, the segmentation of the input video, followed by the calculation of the Constrained Optical Flow map, and the rendering of objects using some NPR style (in this case, animated mosaics), which we proceed to describe now. Actually, our method could be used to generate the polygon collection representation needed by the method of [10]. The interactions between the parts mentioned above can be seen in Figure 1. 3.1
Video Segmentation
As mentioned above, the segmentation images are used to delimit the extent of the objects, in other words, the search area for the optical flow algorithm. The system described by Collomosse et al. in [12] uses 2D algorithms for segmenting objects in the frames independently followed by the application of a region association algorithm with an association heuristics. This results in a set of temporally convex objects that can then be rendered. In this paper, the video shots were treated as a 3D volume and interactively segmented using a variant [13] of the fast fuzzy segmentation algorithm introduced by Carvalho et al. in [14], that was extended for segmenting color 3D volumes. The algorithm works by computing, for every voxel of the 3D volume I(x, y, z) (considering the frames as z slices), a grade of membership value, between 0 and 1 to a number of objects in the scene, i.e., a segmentation map S(x, y, z). The user interaction of the segmentation algorithm is the selection of seed voxels for the objects to be segmented. This interaction allows the user to solve
Mosaic Animations from Video Inputs
91
Fig. 1. Diagram showing the interactions between the parts of our method for generating intra-object temporally coherent NPR animations
problems pointed out by Collomosse in [12] as drawbacks for end-to-end 3D segmentation of the video sequence, such as the segmentation of small fast moving objects, gradual shape changes, and texture segmentation, since the user can put seeds throughout the video sequence to capture such object changes. The fuzzy nature of the segmentation algorithm allows that we render a single object using different styles, according to their grade of membership, e.g., small features inside an object may be detected by their low grade of membership to the surrounding object and be rendered using the original input value. Here, the objects were segmented based on their color information, but these end-to-end segmentations can be made more robust using not only intensity and color information, but also motion cues, such as the algorithms presented in Galun et al. [15] and Khan and Shah [16], allowing the algorithm to differentiate between foreground and background objects of similar color as one occludes the other. 3.2
Constrained Optical Flow
In video stylization, some authors have used optical flow techniques for enforcing temporal coherence, such as the work of Litwinowicz [7] or the work of Hertzmann [8]. However, the local characteristic of the optical flow techniques and their sensitivity to noisy images somehow limit their applicability. To overcome
92
R.B. Gomes, T.S. Souza, and B.M. Carvalho
Fig. 2. Application of the optical flow algorithm to two subsequent frames of the Pooh sequence, on the whole image (left) and to the segmented object only (right). Looking at the original sequence, one can see that the Constrained Optical Flow yields better results, specially close to the borders of the Pooh.
those problems, segmentation algorithms have been applied to video shot segmentation to produce end-to-end segmentations that are later used to enforce temporal coherence, as done by Collomosse et al. [12] and Wang et al. [9]. Wang et al. [9] proposed a method where the user selects keyframe points to guide the segmentation, with a typical keyframe interval of 10 to 15 frames, and no intra-object coherence is needed, since the NPR style used are cartoon styles. If sub-regions within an object are needed, the user has to add them using keyframe point selection. In the approach proposed by Collomosse et al. [12], intra-object temporal coherence is achieved by computing an homography, with the strong assumption that the object in question is approximately flat. This may cause severe intra-object distortion in areas with high curvature values. In this work, we advocate the usage of an optical flow algorithm for enforcing temporal coherence in video NPR sequences, but with the search area for the pixel matching restricted by object boundaries obtained during the segmentation phase. Thus, the optical flow information can be used to enforce intra-object temporal coherence on these sequences. The use of high level knowledge, in the form of a segmented image, provides important information regarding relationship of different objects through time but can also encode information about the type of animation sought by the user. Figure 2 shows two optical flow maps of a frame of the Pooh video sequence. In order to detect parts of the Pooh object that are moving in adjacent frames, a high value has to be used for the smoothness criterion of Proesmans’ algorithm, propagating flow vectors to the background area, even though it is not moving. To use such information would case the background tiles to move unnecessarily.
Mosaic Animations from Video Inputs
93
Furthermore, it can be observed from the input sequence that the Constrained Optical Flow map is much more accurate than the global optical flow map. The optical flow algorithm chosen for computing intra-object optical flow was the one published in Proesmans et al. [17] because it produces a very dense optical flow map (with one motion estimate per pixel). An evaluation performed by McCane et al. [18] with three complex synthetics scenes and one real scene showed that the algorithm of Proesmans et al. [17] was the only of the evaluated algorithms to produce accurate and consistent flow vectors for every pixel of the image. The algorithm uses a system of 6 non-linear diffusion equations that computes a disparity map and also depth discontinuity map, containing information about occluded parts. This depth discontinuity map may be useful in maintaining the temporal coherence in parts of objects that are occluded for short periods. The Constrained Optical Flow can be defined as follows: given, for every voxel of the 3D image I(x, y, z) (considering the frames as z slices), a grade of membership value, between 0 and 1 to a number of objects in the scene, in the form of a segmentation map S(x, y, z), we have that Sk (x, y, z) = 1 if the pixel (x, y) of the z slice belongs to the kth object, and Sk (x, y, z) = 0, otherwise. Based on the membership information of the segmentation, we define the image Ik as I(x, y, z), if Sk (x, y, z) = 1; Ik (x, y, z) = (1) T, otherwise, where T is a value outside the range of the images. This ensures that the optical flow is computed only inside a particular object. Thus, the Constrained Optical Flow calculated from two successive frames is given by the union of non-null flow vectors of the calculated Constrained Optical Flow from the individual objects. It is important to note that we do not have to compute the Constrained Optical Flow for all objects, since we can choose not to render an object using a NPR technique, or to render it using a technique that needs only temporal coherence between the borders of objects. 3.3
Rendering
The rendering phase is divided into the definition of the initial distribution, followed by the application of Lloyd’s algorithm, and the final rendering of the tiles, that we describe now. Centroidal Voronoi diagrams tend to fill the space uniformly, creating regions that are approximately regular polygons. In our work, as was done by Hausner [1] and Faustino and Figueiredo [3], we transform a Voronoi diagram obtained from an initial site distribution into a CVD using Lloyd’s algorithm [19]. The initial distribution greatly influences the convergence of Lloyd’s algorithm, and starting from an initial guess that is approximately centroidal usually requires less site movements and iterations to reach convergence. The initial site distribution can be used to emphasize image characteristics, for example, by using regions of different sizes, specially close to edges, as was done by Faustino and
94
R.B. Gomes, T.S. Souza, and B.M. Carvalho
Fig. 3. Distance Transform Matrix (left) and the initial point distribution for the initial frame of the Pooh video input sequence (right). (For visualization purposes, the histogram of the DTM image has been equalized and the gray level have been inverted.)
Figueiredo [3]. On the other hand, successive iterations of Lloyd’s algorithm will tend towards a uniform region distribution, a result that goes against the desired emphasis of some image characteristics. If a close to centroidal distribution is used, Lloyd’s algorithm can be used without substantially affecting the initial non-uniform point distribution. In our method, we use point chains formed from Distance Transform Matrices (DTM), as done by Di Blasi and Gallo [2] to distribute the tiles. Thus, we can render mosaics using CVDs, as done by Hausner [1] and Faustino and Figueiredo [3], as well as quadrilateral tiles, as done by Di Blasi and Gallo [2]. A DTM is calculated by evaluating at each pixel, its distance from an object border, as can be seen on the left side of Figure 3, where distance zero is white and the farthest pixels are black. Based on the DTM M , the gradient matrix G can be computed by G(x, y) = arctan
M (x, y + 1) − M (x, y − 1) , M (x + 1, y) − M (x − 1, y)
(2)
that will be used to determine the tile orientations in the mosaic. Then the DTM M is used to determine the level line matrix L, computed by ⎧ if mod(M (x, y), 2tSize) = 0; ⎨ 1, (3) L(x, y) = 2, if mod(M (x, y), 2tSize) = tSize; ⎩ 0, otherwise, where tSize is the tile size. This matrix then determines the lines in which the center of the tiles can be located (pixels x, y such that L(x, y) = 2), as can be seen on the right side of 3. However, here we use their technique to compute an initial site distribution that is approximately centroidal. Figure 3 shows the DTM and initial point distribution of an input video sequence.
Mosaic Animations from Video Inputs
95
However, the method of Di Blasi and Gallo [2] handles only tiles of the same size. This is not the case with our method, since we segment the video sequence into disjoint objects that can have different characteristics associated with them, such as the tile size, emphasizing regions close to borders, as was done in Faustino and Figueiredo [3]. We could even render different objects using different NPR styles, even though this is not done here. 3.4
Adding and Removing Tiles
As objects move closer or further away from the camera, or when new parts of the scene appear in the video, we have to insert or remove new tiles in the animation to maintain a consistent appearance of the tiles, i.e., an homogeneously dense animated mosaic. We now describe a technique we developed to maintain this homogeneous tile packing in animated mosaics.
Fig. 4. DTM with the guidelines for tile placement (for the background object) of a frame from the Frog sequence (left) and the areas not covered by tiles of the previous frame moved using the Constrained Optical Flow information (right).
Tile Removal. Tile removal must be used when areas visualized in the previous frame are occluded by the movement of some object in the video or when an object moves further away from the camera. The last case of tile removal happens because the technique of Di Blasi and Gallo [2] uses tiles with the same size, and so, the decrease in area of the object in question, means that less tiles will be used to render it. In both cases, we use a threshold that specifies the maximal superposition that two tiles can have. The superposition of two tiles appears as if the one on the back has been slightly cut to fit in the area left by the other tiles. Remember that we do have information about the object delineations from the segmentation result. Thus, we render the objects, and compute, based on the Constrained Optical Flow information, which tiles moved to different segmented objects. These tiles, together with the tiles that moved to outside the image are removed and not rendered. As mentioned above, their removal is subject to comparing their intersection area with other segmented object areas or areas outside the frame, to the specified threshold.
96
R.B. Gomes, T.S. Souza, and B.M. Carvalho
Tile Addition. The addition of tiles may be rendered necessary when the area not covered by the tiles grow. This happens when areas not seen in the previous frame appear in the current frame or when no tile is mapped to some area due to an object becoming bigger. In this last case, what happens is that tiles from an object that is moving closer to the camera are moved away from each other, using the Constrained Optical Flow information, and, at some point, the area between them is big enough for a new tile to be rendered. The addition of a tile is done in the following way: working object by object, first we compute the DTM of the object. Then, as done before, we compute the lines in which the center of the tiles can be located. Finally, using a map with 0 where there is a tile and 1 where there is no tile, we insert a new tile, if its intersection with other tiles is smaller than a specified threshold. The maps used in this process can be seen in Figure 4. Playing with the threshold we can achieve more or less tightly packed tiles in areas where the video is changing.
Fig. 5. The 1st and 25th frames of the Frog video, on the top row, were rendered using our techniques for enforcing temporal coherence and for adding and removing tiles. The bottom row shows the 1st and 15th frames of the Mug video, rendered using the same techniques.
4
Experiments
The first and second experiments shown here demonstrate the use of our technique for adding and removing tiles. The top row of Figure 5 shows the first and 22nd frames of the Frog video, where the thresholds set for removing existing
Mosaic Animations from Video Inputs
97
Fig. 6. Three frames of the Pooh input video sequence (left) and the correspondent frames of the mosaic animation (right), where only the Pooh object has been rendered in the mosaic NPR style
tiles and adding new tiles are both 50%. Note that we chose to render the frog with smaller tiles than the background. We do have this flexibility because we segment the video into semantic regions, or temporal objects. As a matter of fact, we can even render different objects using different NPR styles. The bottom row of Figure 5 shows the first and 15th frames of the Mug video, that was
98
R.B. Gomes, T.S. Souza, and B.M. Carvalho
rendered using the same thresholds for tile additions and removals as the Frog video. Note how the tile placements of the background object change very little when comparing both frames. In the third experiment, shown on Figure 6, we rendered the Pooh object of the 70 frames long Pooh input video sequence as an animated mosaic while rendering the other areas with their original values. This is only possible due to the flexibility allowed by the segmentation of the video sequence end-to-end followed by treating each object as a layer of a video frame. The tile size choice is very important in determining the overall look of the output video, since tiles that are too big will remove important characteristics from the animation (the same is true for still image mosaic). After selecting the initial site distribution performed using the DTM and gradient matrix, Lloyd’s algorithm was run for 10 iterations and the approximated CVD was tracked using the result of the Constrained Optical Flow method proposed here. It is very important for the success of our method that the segmentation of the objects be of good quality, otherwise, the flexibility of our method turns against us, rendering in an erroneous way parts of the object that were mistakenly segmented. Of course, noisy videos will affect the quality of the Constrained Optical Flow result, even to the point of making it useless. To better handle noisy input videos, a multi-scale approach such as the one proposed by Galun et al. [15], may be useful. However, the segmentation method described here has been successfully used to segment very diverse videos, some of which contained several overlapping objects and moving shadows [20].
5
Conclusion
We presented a method for generating mosaic animations while maintaining intra-object temporal coherence. Our method is based on the use of a segmentation algorithm for segmenting a video shot, followed by the application of an optical flow algorithm that produces a dense flow map, allowing it to be used to move the tiles between successive frames with reduced coherence problems. The segmentation of the video shot into objects, that are treated as different layers in the rendering process also provides many options in the rendering phase, such as the use of tiles of different sizes to emphasize characteristics of the input movie, or the use of completely different NPR styles for different objects. We also presented a method for adding and removing tiles in mosaic animations, and showed some frames of two mosaic movies generated using our techniques. The user can influence the generation of the addition/removal of tiles by adjusting thresholds for both tasks. Future work include the use of weighted Voronoi diagrams, allowing new tiles to grow gradually and current tiles to shrink until a minimum size that would trigger their removal, and the addition of mathematical morphology tools to the segmentation program, thus, allowing the user to correct manually small segmentation errors in a post-processing step.
Mosaic Animations from Video Inputs
99
References 1. Hausner, A.: Simulating decorative mosaics. In: Proc. of ACM SIGGRAPH, pp. 207–214. ACM Press, New York (2001) 2. Blasi, G.D., Gallo, G.: Artificial mosaics. The Vis. Comp. 21, 373–383 (2005) 3. Faustino, G., Figueiredo, L.: Simple adaptive mosaic effects. In: Proc. of SIBGRAPI, pp. 315–322 (2005) 4. Haeberli, P.: Paint by numbers: Abstract image representations. In: Proc. of ACM SIGGRAPH, pp. 207–214. ACM Press, New York (1990) 5. Dobashi, Y., Haga, T., Johan, H., Nishita, T.: A method for creating mosaic images using Voronoi diagrams. In: Proc. of Eurographics, pp. 341–348 (2002) 6. Meier, B.: Painterly rendering for animation. In: Proc. of ACM SIGGRAPH, pp. 477–484. ACM Press, New York (1996) 7. Litwinowicz, P.: Processing images and video for an impressionist effect. In: Proc. of ACM SIGGRAPH, pp. 407–414. ACM Press, New York (1997) 8. Hertzmann, A., Perlin, K.: Painterly rendering for video and interaction. In: Proc. of NPAR, pp. 7–12 (2000) 9. Wang, J., Xu, Y., Shum, H.-Y., Cohen, M.: Video tooning. ACM Trans. on Graph. 23, 574–583 (2004) 10. Smith, K., Liu, Y., Klein, A.: Animosaics. In: Proc. of 2005 ACM SIGGRAPH/Eurograph. SCA, pp. 201–208. ACM Press, New York (2005) 11. Dalal, K., Klein, A.W., Liu, Y., Smith, K.: A spectral approach to NPR packing. In: Proc. of NPAR, pp. 71–78 (2006) 12. Collomosse, J., Rowntree, D., Hall, P.: Stroke surfaces: Temporally coherent artistic animations from video. IEEE Trans. on Visualiz. and Comp. Graph. 11, 540–549 (2005) 13. Carvalho, B., Oliveira, L., Silva, G.: Fuzzy segmentation of color video shots. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 402–407. Springer, Heidelberg (2006) 14. Carvalho, B.M., Herman, G.T., Kong, T.Y.: Simultaneous fuzzy segmentation of multiple objects. Disc. Appl. Math. 151, 55–77 (2005) 15. Galun, M., Apartsin, A., Basri, R.: Multiscale segmentation by combining motion and intensity cues. In: Proc. of IEEE CVPR, pp. 256–263. IEEE Computer Society Press, Los Alamitos (2005) 16. Khan, S., Shah, M.: Object based segmentation of video using color, motion and spatial information. In: Proc. of IEEE CVPR, vol. 2, pp. 746–751. IEEE Computer Society Press, Los Alamitos (2001) 17. Proesmans, M., Gool, L.V., Pauwels, E., Oosterlinck, A.: Determination of optical flow and its discontinuities using non-linear diffusion. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 2, pp. 295–304. Springer, Heidelberg (1994) 18. McCane, B., Novins, K., Crannitch, D., Galvin, B.: On benchmarking optical flow. Comp. Vis. and Image Underst. 84, 126–143 (2001) 19. Lloyd, S.: Least square quantization in PCM. IEEE Trans. on Inform. Theory 28, 129–137 (1982) 20. Oliveira, L.: Segmenta¸ca ˜o fuzzy de imagens e v´ıdeos. Master’s thesis, Universidade Federal do Rio Grande do Norte, Natal, Brazil (2007)
Grayscale Template-Matching Invariant to Rotation, Scale, Translation, Brightness and Contrast Hae Yong Kim and Sidnei Alves de Araújo Escola Politécnica, Universidade de São Paulo, Brazil {hae,saraujo}@lps.usp.br
Abstract. In this paper, we consider the grayscale template-matching problem, invariant to rotation, scale, translation, brightness and contrast, without previous operations that discard grayscale information, like detection of edges, detection of interest points or segmentation/binarization of the images. The obvious “brute force” solution performs a series of conventional template matchings between the image to analyze and the template query shape rotated by every angle, translated to every position and scaled by every factor (within some specified range of scale factors). Clearly, this takes too long and thus is not practical. We propose a technique that substantially accelerates this searching, while obtaining the same result as the original brute force algorithm. In some experiments, our algorithm was 400 times faster than the brute force algorithm. Our algorithm consists of three cascaded filters. These filters successively exclude pixels that have no chance of matching the template from further processing. Keywords: Template matching, RST-invariance, segmentation-free shape recognition.
1 Introduction In this paper, we consider the problem of finding a query template grayscale image Q in another grayscale image to analyze A, invariant to rotation, scale, translation, brightness and contrast (RSTBC), without previous “simplification” of A and Q that discards grayscale information, like detection of edges, detection of interest points and segmentation/binarization. These image-simplifying operations throw away the rich grayscale information, are noise-sensitive and prone to errors, decreasing the robustness of the matching. Moreover, these simplifications cannot be used to find smooth grayscale templates. The “brute force” solution to this problem performs a series of conventional (BCinvariant) template matchings between the image to analyze A and the query template Q. Image Q must be rotated by every angle, translated to every position and scaled by every factor (within some specified range of scale factors) and a conventional BC-invariant template matching is executed for each instance of the transformed Q. Possibly, the brute force algorithm yields the most precise solution to this problem. However, it takes too long and thus is not practical. Our technique, named Ciratefi, substantially accelerates this searching, while obtaining exactly the same result as the D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 100 – 113, 2007. © Springer-Verlag Berlin Heidelberg 2007
Grayscale Template-Matching Invariant to RST
101
original brute force algorithm (disregarding incidental numerical imprecision). In some experiments, our algorithm was 400 times faster than the brute force algorithm and obtained exactly the same results. Fast grayscale RSTBC-invariant template matching is a useful basic operation for many image processing and computer vision tasks, such as visual control [1], image registration [2], and computation of visual motion [3]. Consequently, it has been the object of an intense and thorough study. However, surprisingly, we could not find any technique similar to Ciratefi in the literature. Some approaches that achieve RST-invariance using detection of interest points and edges include: generalized Hough transform [4]; geometric hashing [5, 6]; graph matching [7]; and curvature scale space [8], adopted by MPEG-7 as standard shape descriptor. These operations and Ciratefi seems to occupy different hierarchies in image processing and computer vision. Indeed, low-level Ciratefi can be used to detect interest points, to be used later by high-level techniques such as geometric hashing and graph matching. Techniques that achieve RST-invariance using previous segmentation/binarization are described, for example, in [9, 10]. They are in fact algorithms designed to search for binary templates in binary images. So, given a grayscale image to analyze A, they first convert it into a binary image using some thresholding algorithm. Then, they separate each connected component from the background and compute some RSTinvariant features for each component. These features are compared with the template’s features. The most commonly used rotation-invariant features include Hu’s seven moments [11] and Zernike moments [12]. In recent years, many other rotationinvariant features have been developed [13, 14, 15, 16]. All these features are not truly RST-invariant, but only rotation-invariant. These features become scaleinvariant by isolating each component and normalizing its area to one. Unfortunately, in many practical grayscale cases, the template Q and the analyzed image A cannot be converted into binary images and thus the above techniques cannot be applied. On the contrary, Ciratefi technique does not need to isolate individual shapes and can be used directly in grayscale (and also binary) template matchings. Ullah and Kaneko [17] and Tsai and Tsai [18] present two different segmentationfree RTBC-invariant template-matching techniques. However, their techniques are not scale-invariant. Hence, the key problem seems to be: “How to obtain the scaleinvariance without isolating the shapes or components?” Or, in other words: “How can we estimate the scale of a shape without determining first its boundaries?” Our Ciratefi algorithm consists of three cascaded filters. Each filter successively excludes pixels that have no chance of matching the template from further processing, while keeping the “candidate pixels” that can match the template to further refined classifications. The first filter, called Cifi (circular sampling filter), uses the projections of images A and Q on circles to divide the pixels of A in two categories: those that have no chance of matching the template Q (to be discarded) and those that have some chance (called first grade candidate pixels). This filter is responsible for determining the scale without isolating the shapes. It determines a “probable scale factor” for each first grade candidate pixel. The second filter, called Rafi (radial sampling filter), uses the projections of images A and Q on radial lines and the “probable scale factors” computed by Cifi to upgrade some of the first grade candidate pixels to second grade. It also assigns a “probable rotation angle” to each second grade candidate
102
H.Y. Kim and S.A. de Araújo
pixel. The pixels that are not upgraded are discarded. The third filter, called Tefi (template matching filter), is a conventional BC-invariant template matching. The second grade candidate pixels are usually few in number and Cifi and Rafi have already computed their probable scales and rotation angles. Thus, the template matching can quickly categorize all the second grade candidate pixels in true and false matchings. There are some other papers that use circular or radial projections, like [19, 20]. However, their objectives (fingerprint and Chinese character recognition) are completely different from ours, and they intend to obtain neither scale-invariance nor segmentation-free recognition. Ciratefi is not robust to occlusions (neither is the brute force algorithm). However, in the presence of occlusions, it appears that the template can be divided into smaller sub-templates and the results of the sub-matchings can be combined to detect the original template. Finally, Ciratefi (as well as the brute force algorithm) can easily be made parallel.
2 The Brute Force Algorithm In this section, we describe the “brute force” algorithm. This algorithm makes use of the BC-invariant template matching. 2.1 BC-Invariant Template Matching Template matching uses some difference measuring function to evaluate how well the template Q matches a given position of image A. Usually, sum of absolute differences, sum of squared differences, cross-correlation and correlation coefficient are used as difference measuring functions. We have adopted the correlation coefficient, because it always ranges from -1 to +1 and is BC-invariant. However, let us make the following reasoning to explicit the brightness/contrast-invariance. Let x be the columnwise vector obtained by copying the grayscales of Q’s pixels and let y be the vector obtained by copying the grayscales of the pixels of A’s region to be correlated with Q. Then, the brightness/contrast correction can be written as a least squares problem: y = βx + γ1 + ε
(1)
where 1 is a vector of 1’s, ε is the vector of residual errors, β is the contrast correction factor and γ is the brightness correction factor. The problem consists on finding β and x = x−x γ that minimizes ε 2 . This problem has a computationally fast solution. Let ~ be the mean-corrected vector, where x is the mean of x. Similar definitions are applicable to y. Then: ~ x~y x. β = ~ 2 , γ = y − β x , and ε = ~y − β~ (2) x
The correlation coefficient rxy can be computed:
Grayscale Template-Matching Invariant to RST
~ β~ x~y x2 rxy = ~ ~ = ~ ~ . x y x y
103
(3)
We assume that the correlation is zero if a large brightness or contrast correction is required, because in this case the template and the image are likely quite different. The correlation is assumed to be zero if β ≤t β or 1 / tβ ≤ β , where 0 < tβ ≤ 1 is a
chosen contrast correction threshold. For example, tβ = 0.5 means that regions of A with contrast less than half or more than twice the Q’s contrast will be considered as not correlated with Q. This also avoids divisions by zero in regions of A with almost constant grayscale (where the values of ~y are almost zero). The correlation is also assumed to be zero if γ >t γ , where 0 < tγ ≤ 1 is a chosen brightness correction threshold (we assume that the grayscales of the images are real numbers within the interval [0, 1]). We define Corr as the correlation that takes into account the contrast and brightness corrections: ⎧⎪0, if β ≤t β , 1 / tβ ≤ β or γ >t γ Corr( x, y ) = ⎨ ⎪⎩rxy , otherwise
(4)
Depending on the application, we can use either the absolute value |Corr| (to allow matching negative instances of the template) or the value of Corr with signal (negative instances will not match the template). 2.2 RSTBC-Invariant Template Matching
To obtain RSTBC-invariant template matching, we said above that the query shape Q must be rotated by every angle and scaled by every factor. In practice, it is not possible to rotate and scale Q by every angle and scale, but only by some discrete set of angles and scales. Figure 1 depicts some of the “frog” template rotated in m=36 different angles (α0=0, α1=10, ..., α35=350) and scaled by n=6 different factors (s0=0.6, s1=0.7, ..., s5=1.1). To avoid that a small misalignment may cause a large mismatching, a low-pass filter (for example, the Gaussian filter) smoothes both images A and Q. This low-pass filtering lessens the errors introduced by using discrete scales and angles. Then, each pixel p of A is tested for matching against all the transformed templates (6×36=216 templates, in our case). If the largest absolute value of the contrast/brightness-aware correlation Corr at pixel p is above some threshold tf, the template is considered to be found at p. Figure 2 depicts the detection of the frog shape, using tf=0.9, tβ = 0.1 and tγ = 1. Besides detecting the shape, the brute force algorithm also returns the precise scale factor and rotation angle for each matching. The only problem is that this process takes 9173s, or two and half hours using a 3GHzPentium4 (image A has 465×338 pixels and image Q has 52×51 pixels). Our Ciratefi algorithm does the same task in only 22s.
104
H.Y. Kim and S.A. de Araújo
Fig. 1. Some of the rotated and scaled templates
Fig. 2. Frog shapes detected by the brute force algorithm. Each matching is marked with a red “x”.
3 Circular Sampling Filter Circular sampling filter (Cifi) uses the projections of the images A and Q on a set of rings (figure 3a) to detect the first grade candidate pixels and their probable scales. As we show experimentally in subsection 7.2, the correct choice of number of circles l is not essential to our algorithm, because Rafi and Tefi will further filter the first grade candidate pixels. Figure 3b depicts the output of Cifi filtering, where the first grade candidate pixels are depicted in magenta. Given an image B, let us define the circular sampling Cis B ( x, y , r ) as the average grayscale of the pixels of B situated at distance r from the pixel (x, y): 2π
Cis B ( x, y, r ) = ∫ B ( x + r cos θ, y + r sin θ) dθ 0
(5)
In practice, a sum must replace the integral and a computer graphics algorithm for drawing circles, as [21], can be used to find efficiently all the pixels that belong to a specific circle. Given the template image Q and the set of n scales (in our example, s0=0.6, s1=0.7, ..., s5=1.1) the image Q is resized to each scale si, obtaining the resized templates Q0, Q1, ..., Qn-1. Then, each resized template Qi is circularly sampled at a set of l predefined circle radii (in our example, l=13, and r0=0, r1=2, ..., r12=24 pixels), yielding a 2-D matrix of multi-scale rotation-invariant features CQ with n rows (scales) and l columns (radii): CQ [i, k ] = Cis Qi ( x0 , y 0 , rk ) , 0 ≤ i < n and 0 ≤ k < l where (x0, y0) is the central pixel of Q.
(6)
Grayscale Template-Matching Invariant to RST
105
(b)
(a)
Fig. 3. Circular sampling filter Cifi. (a) Circles where the image is sampled. (b) The output of Cifi with the first grade candidate pixels in magenta.
Given the image to analyze A, we build a 3-D image CA[x,y,k]: C A [ x, y, k ] = Cis A ( x, y, rk ), 0 ≤ k < l and ( x, y ) ∈ domain( A)
(7)
Cifi uses matrices CQ and CA and the contrast and brightness thresholds tβ and tγ to detect the circular sampling correlation CisCorr at the best matching scale for each pixel (x,y): n −1
[
CisCorr A,Q ( x, y ) = MAX Corr(CQ [i ], C A [ x, y ]) i =0
]
(8)
A pixel (x,y) is classified as a first grade candidate pixel if CisCorrA,Q ( x, y ) ≥ t1 for some threshold t1 (in the example, t1=0.95). As we show in subsection 7.1, the adequate choice of t1 is not critical, provided that it is low enough to not discard the real matching pixels. The probable scale CisPS of a first grade candidate pixel (x,y) is the best matching scale: n −1
[
CisPS A,Q ( x, y ) = ARGMAX Corr(CQ [i ], C A [ x, y ]) i =0
]
(9)
In our example, the computation of the 3-D image C A [ x, y, k ] took 2.5s and the computation of CisCorrA,Q ( x, y ) for all pixels of A took 4.5s. The remaining Cifi operations are almost instantaneous.
4 Radial Sampling Filter The second filter is called radial sampling filter (Rafi) and uses the projections of images A and Q on a set of radial lines to upgrade some of the first grade candidate pixels to second grade. The pixels that are not upgraded are discarded. It also assigns a “probable rotation angle” to each second grade candidate pixel. Figure 4a marks in blue the radial lines and figure 4b marks with a red “x” each second grade candidate
106
H.Y. Kim and S.A. de Araújo
pixel. The set of inclinations of the radial lines must be equal to the m chosen rotation angles (in our example, α0=0, α1=10, ..., α35=350). As we show in subsection 7.2, the choice of m is not critical, provided that it is not too small.
(a)
(b)
Fig. 4. Radial sampling filter Rafi. (a) The radial lines where the image is sampled. (b) The output of Rafi, where each second grade candidate pixel is marked with a red “x”.
Given an image B, let us define the radial sampling Ras λB ( x, y, α) as the average grayscale of the pixels of B located on the radial line with one vertex at (x,y), length λ and inclination α: λ
Ras λB ( x, y, α) = ∫ B ( x + t cos α, y + t sin α) dt 0
(10)
In practice, the integral must be replaced by a sum and a line drawing algorithm (as [22]) can be used to find efficiently all the pixels that belong to a line. Given the template Q and the set of m angle inclinations (α0, α1, ..., αm-1), Q is radially sampled using λ=rl-1 (the largest sampling circle radius), yielding a vector RQ with m features:
RQ [ j ] = Ras Qrl −1 ( x0 , y0 , α j ) , 0 ≤ j < m
(11)
where (x0, y0) is the central pixel of Q. For each first grade candidate pixel (x, y), A is radially sampled at its probable scale i = CisPS A,Q ( x, y ) . The largest radius rl-1 resized to the probable scale si becomes λ = si rl −1 . Thus: R A [ x, y, j ] = Ras sRi rl −1 ( x, y, α j ), 0 ≤ j < m and ( x, y ) ∈ f_gr_cand( A)
(12)
At each first grade candidate pixel (x,y), Rafi uses the vectors RA[x,y], RQ and contrast and brightness thresholds tβ and tγ to detect the radial sampling correlation RasCorr at the best matching angle:
Grayscale Template-Matching Invariant to RST
m −1
[
)]
(
RasCorrA,Q ( x, y ) = MAX Corr R A [ x, y ], cshift j ( RQ ) , ( x, y ) ∈ f_gr_cand( A) j =0
107
(13)
where “cshiftj” means circular shifting j positions of the argument vector. A first grade pixel (x,y) is upgraded to second grade if: RasCorrA,Q ( x, y ) ≥ t 2
(14)
for some threshold t2 (in the example, t2=0.9). As we show in subsection 7.2, the adequate choice of t2 is not critical, provided that it is low enough to not discard the real matching pixels. The probable rotation angle RasAng at a second grade candidate pixel (x,y) is the best matching angle: m −1
[
(
RasAng A,Q ( x, y ) = ARGMAX Corr R A [ x, y ], cshift j ( RQ ) j =0
)]
(15)
In the example, the computation of RasCorrA,Q ( x, y ) in all pixels (x,y) of A took 13s. The remaining Rafi operations are almost instantaneous.
5 Template Matching Filter The third filter is called Tefi and it is simply the BC-invariant template matching, applied only at the second grade candidate pixels, using the probable scale and angle determined respectively by Cifi and Rafi. Figure 5 depicts its output (that is also the final output of the Ciratefi algorithm). Similarly to the RSTBC-invariant template matching, Tefi first resizes and rotates the template Q to all m angles and n scales. Let (x,y) be a second grade candidate pixel, with its probable scale i = CisPS A,Q ( x, y ) and probable angle j = RasAng A,Q ( x, y ) . Then, Tefi computes the contrast/brightness-aware correlation Corr between the template image Q at scale si and angle αj, and the image A at pixel (x,y). If the absolute value of the correlation is above some threshold t3, the template is considered to be found at pixel (x,y).
Fig. 5. The final output of Ciratefi. Each matching pixel is marked with a red “x”.
108
H.Y. Kim and S.A. de Araújo
Adopting the same threshold used in the brute-force algorithm (that is t3=tf), the output is usually equal or very similar to the output of the brute-force algorithm. For even more robustness, it is possible to test the matchings at a set of scales around i (for example, i-1, i, i+1) and at a set of angles around j (for example, j-1, j, j+1, where the addition and subtraction must be computed modulus m). In our example, Tefi took 1s to be computed.
6 Complexity Analysis The precise computational complexity of Ciratefi depends on many unforeseeable factors, like the number of the first and second grade candidate pixels. However, we will make some assumptions and approximations to analyze its complexity. Let N be the number of pixels of the image to analyze A and M the number of pixels of the template image Q. To make our analysis, we will assume that the number of scales n, the number of angles m, and the number of sampling circles l are all O( M ) . We will ignore all operations that does not depend on N, because usually N is much larger than M. The brute force algorithm makes n×m template matchings for each pixel of A. Considering that each template matching makes O(M) operations, this algorithm’s complexity is O(NnmM ) , or approximately O( NM 2 ) . Ciratefi has four operations that depend on N: • The generation of the 3-D image CA[x,y,k] takes O(NM), considering that almost all pixels of the domain of Q must be scanned for each pixel (x,y) of A. • The computation of CisCorr for all pixels of A takes O(Nnl), or approximately O(NM). • The computation of RA[x,y,j] and RasCorr for all first grade candidate pixels takes O( N1m M ) and O(N1m2), respectively, where N1 is the number of the first grade candidate pixels. O( N1m M + N1m 2 ) can be approximated by O(NM). • The computation of Tefi takes O(N2M), where N2 is the number of the second grade candidate pixels, and O(N2M)≤O(NM). Consequently, the complexity of Ciratefi is O(NM), while the complexity of the brute force algorithm is O(NM2). This makes a lot of difference! In our example, M≈2500, justifying why Ciratefi was 400 times faster than the brute force algorithm.
7 Experimental Results 7.1 Experiments
We made three experiments to evaluate Ciratefi, using a total of 145 images. In all images, shape instances appear in different rotations, scales, brightnesses and contrasts. We do not compare the accuracy of our algorithm with other techniques
Grayscale Template-Matching Invariant to RST
109
because, as we considered in section 1, seemingly there is no rotation and scaleinvariant grayscale template matching in the literature (except the brute-force algorithm). Choosing adequate parameters, Ciratefi and the brute-force algorithm yield exactly the same results, what makes foreseeable the behavior of our algorithm. In the first experiment, we took 70 pictures of 16 toy figures randomly scattered on the floor. Then, we searched in the 70 images (each one with 512×384 pixels) for 5 templates: frog, dog, palm_tree, bear and letter_s (figure 6), extracted from one of the 70 images. 10 instances of the searched templates appear in each image. Figure 7 shows one final output where the matching positions are marked with “x”. All 700 Ciratefi matchings were perfect, without any false positive or false negative. Note that there are faintly visible shapes (dog and bear). These shapes also were successfully detected, in spite of their low contrast, using tβ=0.1.
Fig. 6. Template images (51×51 pixels)
Fig. 7. Result of detection of the 5 templates
(a)
(b)
(c)
Fig. 8. Detection of McDonald’s symbols. (a) Template. (b) Perfect matching. (c) False negative case encircled in yellow.
110
H.Y. Kim and S.A. de Araújo
In the second experiment, we searched for the McDonald’s® symbol (figure 8a) in 60 images taken from different places and objects. The smallest image has 96×94 pixels, and the largest has 698×461 pixels. Figures 8b and 8c show two sample images matched against the template. Each matching is marked with a red “x”. This experiment presented only one false positive and two false negatives. The bright background of the symbol probably caused one of the false negatives, marked with a yellow circle in figure 8c. Note that we did not use the color information. Probably, this task would become much easier using the color information. Finally, in the third experiment we tested the detection of buildings with a specific shape in 15 remote sensing images with 515×412 pixels, provided by the Google Earth, using a grayscale template with 40×40 pixels (figure 9a). Figures 9b and 9c depict some examples of the experiment. In this experiment, the building appears 187 times in the 15 analyzed images. We detected 18 false positives and 16 false negatives cases, caused mainly by shadows, different illumination angles, occlusions and blurred images. The results of the three experiments are summarized in Table 1.
(a)
(c)
(b)
Fig. 9. Detection of a building in remote sensing images. (a) Template image. (b) Perfect matching. (c) False negative case encircled in yellow. Table 1. Summary of the tests
Experiment Toys McDonald’s Buildings
Instances of the shape 700 116 187
Correct detections 700 114 171
False positives 0 1 18
False negatives 0 2 16
7.2 Parameters
We tested the sensitivity of Ciratefi to different choices of parameters, such as number of circles (l), number of radial lines (m) and thresholds t1, t2, t3, tβ and tγ. We demonstrate below that the only really important parameter is t3. We searched for the frog template in one image of the toy experiment (figure 7). We used 10 scale factors (s0=0.4, s1=0.5,..., s9=1.3). In each table, the fixed parameters appear in the first line.
Grayscale Template-Matching Invariant to RST
111
Table 2. Sensitivity to the number of circles l m = 36, t1 = 0.95, t2 = 0.9, t3 = 0.9, tβ = 0.1, tγ = 1 First grade candidate Number of circles l False positives False negatives pixels 05 0 0 83,310 10 0 0 69,389 15 0 0 50,519 20 0 0 74,970 25 0 0 77,375
Table 2 shows that the number of circles l does not have a strong influence on the final result, because no error was detected even varying its value. However, the suitable choice of l is important to minimize the number of the first grade candidate pixels and accelerate the processing. Table 3. Sensitivity to the number of radial lines m
Number of radial lines m 08 15 20 30 40
l = 13, t1 = 0.95, t2 = 0.9, t3 = 0.9, tE = 0.1, tJ = 1 False positives
False negatives
0 0 0 0 0
2 2 0 0 0
Second grade candidate pixels 433 42 30 35 41
Table 3 shows that too small number of radial lines m can produce false negatives by eliminating the true matching pixels. In this experiment, no error was detected for m≥20. However, the algorithm becomes slower using large m. Table 4. Sensitivity to the thresholds t1, t2 and t3
Thresholds t1, t2 , t3 0.50, 0.50, 0.50 0.50, 0.50, 0.75 0.50, 0.50, 0.95 0.75, 0.75, 0.50 0.75, 0.75, 0.75 0.75, 0.95, 0.95
False Positives 8376 286 0 1325 104 0
l = 13, m = 36, False negatives 0 0 0 0 0 0
tβ = 0.1, tγ = 1 Thresholds t1, t2 , t3 0.95, 0.75, 0.75 0.95, 0.75, 0.95 0.95, 0.95, 0.95 0.95, 0.95, 0.98 0.95, 0.98, 0.95 0.98, 0.95, 0.95
False positives 104 0 0 0 0 0
False negatives 0 0 0 2 2 2
Table 4 shows that a incorrect choice of t3 may produce false negatives or false positives. However, the choices of t1 and t2 are not critical to the detection of the shape, as long as their values are not too high to discard the true matchings. Indeed, the detection was errorless for t3=0.95, for any t1≤0.95 and t2≤0.95. However, small values of t1 and t2 make the algorithm slower.
112
H.Y. Kim and S.A. de Araújo Table 5. Sensitivity to the thresholds tβ and tγ Thresholds tβ , tγ . 0.10 , 0.10 0.10 , 0.50 0.10 , 1.00 0.25 , 0.10 0.25 , 0.50 0.25 , 1.00 0.50 , 0.10
l = 13, m = 36, t1 = 0.95, t2 = 0.9, t3 = 0.9 False Thresholds False False negatives positives positives tβ , t γ . 0 1 0.50 , 0.50 0 0 0 0.50 , 1.00 0 0 0 0.75 , 0.10 0 0 1 0.75 , 0.50 0 0 0 0.75 , 1.00 0 0 0 1.00 , 0.10 0 0 1 1.00 , 1.00 0
False negatives 1 1 1 1 1 2 2
As expected, table 5 shows that too large tβ or too small tγ yields false negatives. However, there are large ranges of values that do not produce errors (0.1≤tβ≤0.25 and 0.5≤tγ≤1.0).
8 Conclusions and Future Works In this paper, we have presented a new grayscale template matching algorithm, invariant to rotation, scale, translation, brightness and contrast, named Ciratefi. Differently from many other techniques, Ciratefi does not discard the rich grayscale information through operations like detection of edges, detection of interest points or segmentation/binarization of the images. The proposed algorithm was about 400 times faster the brute force algorithm in the experiments, while yielding practically the same output. Complexity analysis has shown that Ciratefi is indeed superior to the brute force algorithm. Experimental results demonstrate the efficiency and the robustness of the proposed technique. A straightforward generalization of this technique is to use the color information, together with the luminance. Another possible generalization is to use other features besides the mean grayscales on circles and radial lines, such as standard deviations, and maximum or minimum values.
References 1. Hutchinson, S., Hager, G.D., Corke, P.I.: A tutorial on visual servo control. IEEE Trans. on Robotics and Automation 13(5), 651–670 (1996) 2. Brown, L.G.: A survey of image registration techniques. ACM Computing Surveys 24(4), 325–376 (1992) 3. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. Int. J. Comput. Vision 2(3), 283–310 (1989) 4. Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–122 (1981) 5. Lamdan, Y., Wolfson, H.J.: Geometric hashing: a general and efficient model-based recognition scheme. In: Int. Conf. on Computer Vision, pp. 238–249 (1988) 6. Wolfson, H.J., Rigoutsos, I.: Geometric hashing: an overview. IEEE Computational Science & Engineering, 10–21 (October-December 1997)
Grayscale Template-Matching Invariant to RST
113
7. Leung, T.K., Burl, M.C., Perona, P.: Finding faces in cluttered scenes using random labeled graph matching. In: Int. Conf. on Computer Vision, pp. 637–644 (1995) 8. Mokhtarian, F., Mackworth, A.K.: A Theory of Multi-scale, Curvature Based Shape Representation for Planar Curves. IEEE T. Pattern Analysis Machine Intelligence 14(8), 789–805 (1992) 9. Kim, W.Y., Yuan, P.: A practical pattern recognition system for translation, scale and rotation invariance. In: Computer Vision and Pattern Recognition, pp. 391–396 (1994) 10. Torres-Méndez, L.A., Ruiz-Suárez, J.C., Sucar, L.E., Gómez, G.: Translation, rotation and scale-invariant object recognition. IEEE Trans. Systems, Man, and Cybernetics - part C: Applications and Reviews 30(1), 125–130 (2000) 11. Hu, M.K.: Visual Pattern Recognition by Moment Invariants. IRE Trans. Inform. Theory 1(8), 179–187 (1962) 12. Teh, C.H., Chin, R.T.: On image analysis by the methods of moments. IEEE Trans. on Pattern Analysis and Machine Intelligence 10(4), 496–513 (1988) 13. Li, J.H., Pan, Q., Cui, P.L., Zhang, H.C., Cheng, Y.M.: Image recognition based on invariant moment in the projection space. In: Int. Conf. Machine Learning and Cybernetics, Shangai, vol. 6, pp. 3606–3610 (August 2004) 14. Flusser, J., Suk, T.: Rotation moment invariants for recognition of symmetric objects. IEEE T. Image Processing 15(12), 3784–3790 (2006) 15. Dionisio, C.R.P., Kim, H.Y.: A supervised shape classification technique invariant under rotation and scaling. In: Int. Telecommunications Symposium, pp. 533–537 (2002) 16. Tao, Y., Ioerger, T.R., Tang, Y.Y.: Extraction of rotation invariant signature based on fractal geometry. IEEE Int. Conf. Image Processing 1, 1090–1093 (2001) 17. Ullah, F., Kaneko, S.: Using orientation codes for rotation-invariant template matching. Pattern Recognition 37, 201–209 (2004) 18. Tsai, D.M., Tsai, Y.H.: Rotation-invariant pattern matching with color ring-projection. Pattern Recognition 35, 131–141 (2002) 19. Chang, D.H., Hornak, J.P.: Fingerprint recognition through circular sampling. The Journal of Imaging Science and Technology 44(6), 560–564 (2000) 20. Tao, Y., Tang, Y.Y.: The feature extraction of chinese character based on contour information. In: Int. Conf. Document Analysis Recognition (ICDAR), pp. 637–640 (September 1999) 21. Bresenham, J.E.: A linear algorithm for incremental digital display of circular arcs. Comm. ACM 20(2), 100–106 (1977) 22. Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Systems Journal 4(1), 25–30 (1965)
Bimodal Biometric Person Identification System Under Perturbations Miguel Carrasco1, Luis Pizarro2, and Domingo Mery1 1
Pontificia Universidad Cat´ olica de Chile Av. Vicu˜ na Mackenna 4860(143), Santiago, Chile
[email protected],
[email protected] 2 Mathematical Image Analysis Group Faculty of Mathematics and Computer Science Saarland University, Bldg. E11, 66041 Saarbr¨ ucken, Germany
[email protected]
Abstract. Multibiometric person identification systems play a crucial role in environments where security must be ensured. However, building such systems must jointly encompass a good compromise between computational costs and overall performance. These systems must also be robust against inherent or potential noise on the data-acquisition machinery. In this respect, we proposed a bimodal identification system that combines two inexpensive and widely accepted biometric traits, namely face and voice information. We use a probabilistic fusion scheme at the matching score level, which linearly weights the classification probabilities of each person-class from both face and voice classifiers. The system is tested under two scenarios: a database composed of perturbation-free faces and voices (ideal case), and a database perturbed with variable Gaussian noise, salt-and-pepper noise and occlusions. Moreover, we develop a simple rule to automatically determine the weight parameter between the classifiers via the empirical evidence obtained from the learning stage and the noise level. The fused recognition systems exceeds in all cases the performance of the face and voice classifiers alone. Keywords: Biometrics, multimodal, identificacion, face, voice, probabilistic fusion, Gaussian noise, salt-and-pepper noise, occlusions.
1
Introduction
Human beings possess a highly developed ability for recognising certain physiological or behavioral characteristics of different persons, particularly under high levels of variability and noise. Designing automatic systems with such capabilities comprises a very complex task with several limitations. Fortunately, in the last few years a large amount of research has been conducted in this direction. Particularly, biometric systems aim at recognising a person based on a set of intrinsic characteristics that the individual possesses. There exist many attributes that can be utilised to build an identification system depending on the application domain [1,2]. The process of combining information from multiple biometric D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 114–127, 2007. c Springer-Verlag Berlin Heidelberg 2007
Bimodal Biometric Person Identification System
115
traits is known as biometric fusion or multimodal biometrics [3]. Multibiometric systems are more robust since they rely on different pieces of evidence before taking a decision. Fusion could be carried out at three different levels: (a) fusion at the feature extraction level, (b) fusion at the matching score level, and (c) fusion at the decision level [4]. Over the last fifteen years several multimodal schemes have been proposed for person identification [5,6,7]. It is known that the face and voice biometrics have lower performance compared to other biometric traits [8]. However, these constitute some of the most widely accepted by people, and the low cost of the equipment for face and voice acquisition makes the systems inexpensive to build. We refer to [9] for a relatively recent review on identity verification using face and voice information. We are interested in setting up a bimodal identification system that makes use of these two biometrics. Traditional recognition systems are built assuming that the biometrics used in the learning (or training) process are noiseless. This ideal condition implies that all variables1 susceptible to noise must be regulated. However, keeping all these variables under control might be very hard or unmanageable under the system’s operation conditions. There are two alternatives to handle this problem. On the one hand, if the nature of the noise is known a suitable filter can be used in a preprocessing step. On the other hand, without any filtering, it is possible to build the recognition system with noisy data and make the biometric classifiers as robust as possible to cope with the perturbations. In this paper we are concerned with the latter alternative. We propose a probabilistic fusion scheme performed at the matching score level, which linearly combines the classification probabilities of each authenticated person in both the face and the voice matching processes. The identity of a new input is associated with the identity of the authenticated person with the largest combined probability. We assess the robustness of the proposed bimodal biometric system against different perturbations: face images with additive Gaussian and salt-and-pepper noise, as well as with partial occlusions, and voice signals with additive white Gaussian noise. The performance of the fused system is tested under two scenarios: when the database is built on perturbationfree data (ideal case), and when it is built considering variable perturbations. Moreover, we develop a simple rule to automatically determine the weight parameter between the classifiers via empirical evidence obtained from the learning stage and the noise level. We show that combining two lower performance classifiers is still a convenient alternative in terms of computational costs/overall performance. In Section 2 we describe classical techniques utilised in face and voice recognition. Section 3 details the proposed fused biometric system, which is tested under several perturbation conditions in Section 4. We conclude the paper in Section 5 summarising our contribution and delineating some future work. 1
In the case of face and voice signals: calibration of audio/video recording devices, analog-digital data conversion, illumination conditions, background noise and interference, among others.
116
2
M. Carrasco, L. Pizarro, and D. Mery
Face and Voice Recognition
Face recognition. At present there are three main approaches to the problem of face recognition: i) based on appearance, ii) based on invariant characteristics, and iii) based on models [10,11]. In the first approach the objective is to extract similar characteristics present in all faces. Usually statistical or machine learning techniques are used, and dimensionality reduction tools are very important for improving efficiency. One of the most widely used unsupervised tools in this respect is principal component analysis (PCA) [12]. This method linearly projects the high-dimensional input space onto a lower-dimensional subspace containing all the relevant image information. This procedure is applied over all the face images –training set– used for the construction of the identification system. This projection space is known as eigenfaces space. To recognize a new face the image is transformed to the projection space, and the differences between that projection and those of the training faces are evaluated. The smallest of these differences, which in turn is smaller than a certain threshold, gives the identity of the required face. The second approach is based on the invariant characteristics of the face, e.g., color, texture, shape, size and combinations of them. The objective consists in detecting those patterns that allow the segmentation of the face or faces contained in an image [13]. The third approach consists in the construction of models in two and three dimensions. Control points that identify specific face positions are determined robustly, and they are joined to form a nonrigid structure. Then this structure is deformed iteratively to make it coincide with some of the structures recognized by the identification system [14]. Unfortunately, this technique is very slow and requires the estimation of precise control points, and therefore the image must have high resolution. Also, because of the iteration process, it can be trapped in local optima, and is therefore dependent on the position of the control points chosen initially. The different face recognition algorithms depend on the application’s domain. There is no system that is completely efficient under all conditions. Our study is limited to developing an identification mechanism considering images captured in controlled environments. The approach chosen is that based on appearance and its implementation through PCA-eigenfaces. Voice recognition. Voice recognition is the process of recognizing automatically who is speaking by means of the information contained in the sound waves emitted [15,16]. In general, voice recognition systems have two main modules: extraction of characteristics, which consists in obtaining a small but representative amount of data from a voice signal, and comparison of characteristics, which involves the process of identifying a person by comparing the characteristics extracted from its voice with those of the persons recognized by the identification system. Voice is a signal that varies slowly with time. Its characteristics remain almost stationary when examined over a sufficiently short period of time (ca. 5-100 ms). However, over longer time periods (more than 0.2 s) the signal’s characteristics change, reflecting the different sounds of voice. Therefore, the most natural way of characterizing a voice signal is by means of the so-called short-time
Bimodal Biometric Person Identification System
117
Fig. 1. Proposed framework for person identification
spectral analysis. One of the most widely used techniques in voice recognition is mel-frequency cepstrum coefficients (MFCC) [17,18], which we also use in this study. Basically, MFCC imitates the processing by the human ear in relation to frequency and band width. Using filters differentiated linearly at low frequencies (below 1000 Hz) and logarithmically at high frequencies, MFCC allows capturing the main voice’s characteristics. This is expressed in the literature as the mel-frequency scale. We use this approach for voice characterisation.
3
Fusion of Face and Voice Under Perturbations
As previously mentioned, face and voice biometrics have lower performance compared to other biometric traits [8]. Nevertheless, it is relatively inexpensive to set up systems based on such biometrics. Moreover, PCA-eigenfaces and MFCC techniques require simple computation compared to other more sophisticated techniques [11]. Probabilistic fusion framework. Our proposal consists in fusing these lower performance classifiers by means of a simple probabilistic scheme, with the aim of obtaining an identification system with better performance and robust against different perturbations. The construction of such a system consists of the following five phases outlined in Fig. 1 and described next. I. Preprocessing. In this phase k face images and k voice signals are considered for each one of the t persons in the system. With the purpose of examining the behaviour of the classifiers constructed with altered data, both signals are intentionally contaminated with different kinds of perturbations. The face images are contaminated with Gaussian, salt-and-pepper noise, or partial occlusions, while the voice signals are perturbed with additive white Gaussian noise. This also allows us to verify the performance of the algorithms used in our study for the extraction of characteristics. All signals belonging to a person j, perturbed or not, are associated with the person-class C(j), for all j = 1, . . . , t.
M. Carrasco, L. Pizarro, and D. Mery
n
1
n
2
Normalisation
118
y h
h
h× p
h× p = g
S =
…
h
g
x
×n
Nn
N1 N2
N=
M1
M
1
M =
…
Mg
g
×n
g
Fig. 2. Vector transformation of each image of the training set and later normalization and calculation of the mean image of the set of training images
N=
en
e1 e2
Nn
N1 N2
PCA
…
g×n
P=
First 5 eigenfaces
…
g×n
Fig. 3. Generation of the eigenfaces by means of PCA using the normalized data
II. Face feature extractor. To extract the face features we use the method known as eigenfaces [19]; see figures 2 and 3. All the images of the training set are transformed into column vectors and are concatenated in a matrix S. This matrix is normalized (N ) by subtracting the mean and dividing by the standard deviation of each column. This improves contrast and decreases the effect of changes in illumination. Then, by averaging its rows, matrix N is reduced to a column vector M which represents the mean image of the training set. Then, applying PCA to normalization matrix N we obtain eigenfaces matrix P . Column vectors e1 , . . . , en represent the eigenfaces, and they are ordered from more to less information content. Finally, matrix W is obtained which contains the characteristics of the training set. This is calculated as the cross product between corresponding columns in the normalization and projection matrices, i.e. Wi = Ni · Pi , for all columns i = 1, . . . , n. Voice feature extractor. The process of generation of the MFCC coefficients requires a set of steps that transform a voice signal into a matrix that contains its main characteristics. Initially, the audio signal is divided into a set of adjacent frames. Then each frame is filtered through a Hamming window, allowing the spectral distortion to be minimized both at the beginning and at the end of each frame. Then a transformation is made in each frame to the spectral domain with the Fourier transform; the result of this transformation is known as a spectrum. The next step transforms each spectrum into a signal that simulates the human ear, known as a mel scale. Finally, all the mel-spectra are transformed into the time domain by means of the discrete cosine transform. The latter step generates as a result the mel frequency cepstrum coeficients (MFCC) of the voice signal. For details we refer to [20].
Bimodal Biometric Person Identification System
I
i
=
M =
D
i
= I −M = i
Di =
…
P=
g ×1
Wi=Di
·P
j
=
119
…
n ×1 g×n e1
e2
en
∀
j
…
=1
n
Fig. 4. Determination of the difference image using the general mean of the training set and the calculation of the characteristics vector Wi of face i
III. Template generation. The process of storing the biometric characteristics extracted before is called enrolment. In the case of the face, the characteristics matrix W is stored. For the voice, a compressed version of the signals of each person is stored. For that purpose, we make use of the LBG clustering algorithm [21], which generates a set of vectors called VQ-Codebook [22]. The registered features are considered as templates with which the features of an unknown person must be compared in the identification process. IV. Face matching. To determine probabilistically the identity of an unknown person i, first the difference Di between its normalized image Ii and the mean image M of the training set is calculated. Then the characteristics vector Wi is generated as the dot product between Di and each column of the projection matrix P ; see Fig. 4. Later, the Euclidian distances between the vector Wi and all the columns of the characteristics matrix W are computed. The k shortest distances are used to find the most likely person-class C(j) to which the unknown person i is associated with. Voice matching. In the case of voice, the process consists in extracting the ceptrum coefficients of the unknown speaker i by means of the calculation of the MFCCs, and calculating their quantized vector qvi . Then the Euclidian distances between qvi and all the vectors contained in the VQ-codebook are determined. The same as with the face, the k shortest distances are used to find the most likely person-class C(j) to which the unknown speaker i is associated with. V. Fusion. Finally, the response of the fused recognition system is given as a linear combination of the probabilistic responses of both the face classifier and the voice classifier. Since each person in the database has k signals of face and voice, the nearest k person-classes associated to an unknown person i represent those that are more similar to it. Thus, if the classification were perfect, these k classes should be associated with the same person, such that the classification probability would be k/k = 1. The procedure consists of two steps: Firstly, we determine the classification probability of each person j for face matching Pf (j), as well as for voice matching Pv (j): Vf (j) Vv (j) , Pv (j) = , for all j = 1, . . . , t; (1) k k where Vf (j) and Vv (j) is the number of representatives of the person-class C(j) out of the k previously selected candidates in the face matching and Pf (j) =
120
M. Carrasco, L. Pizarro, and D. Mery
in the voice matching stages, respectively. Secondly, we infer the identity of an unknown person i with the person-class C(j) associated with the largest value of the combined probability P (j) = α · Pf (j) + (1 − α) · Pv (j),
for all j = 1, . . . , t.
(2)
The parameter α ∈ [0, 1] weights the relative importance associated with each classifier. In the next section we present a simple rule to estimate this parameter. Estimation of the weight parameter α. The weight α is the only free parameter of our probabilistic fusion model and it is in connection with the reliability that the recognition system assigns to each classifier. Therefore, its estimation must intrinsically capture the relative performance between the face classifier and the voice classifier in the application scenario. In general, as it will be shown in the experimental section, estimating this parameter depends on the input data. Heuristically, the feature learning process provides empirical evidence about the performance of the face and voice classifiers. Once the learning is done, the identification capabilities of the system are tested on faces and voices belonging to the set of t recognisable persons, though these data have not been previously used for learning. In this way, we have quantitative measurements of the classifiers’ performance at our disposal. Thus, a simple linear rule for estimating α based on these measurements is given by α ˆ=
1 + (qf − qv ) , 2
(3)
where qf , qv ∈ [0, 1] are the empirical performance of the face and voice classifiers, respectively. This formula assigns more importance to the classifier that performs better under certain testing scenario. When both classifiers obtain nearly the same performance, their responses are equally considered in equation (2). This scheme agrees with the work by Sanderson and Paliwal [9], since assigning a greater weight to the classifier with better performance clearly increases the performance of the fused recognition.
4
Experimental Results
The data base used consists of 18 persons, with eight different face and voice versions for each one. The faces used are those provided by the Olivetti Research Laboratory (ORL) [23]. The faces of a given person vary in their facial expressions (open/closed eyes, smiling/serious), facial details (glasses/noglasses), and posture changes. The voices were generated using an electronic reproducer in MP3 format at 128 kbps. A total of 144 recordings (8 per person) were made.
Bimodal Biometric Person Identification System 0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
0
0
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.4
-0.4
-0.4
-0.4
-0.4
-0.4
-0.6
-0.6 -0.8 0
1
2
3
-0.8 0
4
-0.6 1
2
3
Original Audio
-0.8 0
4
-0.6 1
2
3
4
4
-0.8 0
4
2
3
SNR = 17.9dB σ = 40
-0.8 0
4
-0.6 1
2
3
4
x 10
SNR = 20.9dB σ = 20
0.2
-0.6 1
4
x 10
x 10
-0.8 0
4
1
2
3
4
x 10
SNR=16.2dB σ = 60
121
SNR=14.9dB σ = 80
4 4
x 10
x 10
SNR=13.9dB σ = 100
Fig. 5. One of the voice signals used in the experiments and some of its noisy versions
(a)
(b)
(c)
(d)
10%
20%
30%
40%
50%
60%
70%
80%
Fig. 6. (a) Original face sequence of an individual with eight different expressions. (b) Sample with variable Gaussian noise. (c) Sample with variable salt-and-pepper noise. (d) Sample with variable textured occlusions.
Face and voice classifiers alone. We performed two types of experiments to analyse the effect of noisy data on the performance of the face and voice classifiers without fusion. In the first experiment (Exp.1 ), the recognition system is constructed with perturbation-free data, but later it is tested on noisy data. In the second experiment (Exp.2 ), the recognition system is constructed considering various perturbations of the face and the voice signals, and tested then on perturbation-free data. Different perturbation levels were considered. The voice signals contain additive white Gaussian noise with zero mean and variable standard deviation σ = {0, 10, . . . , 100} of their mean power weighted by a factor of 0.025. The faces contain additive Gaussian noise with zero mean and standard deviation σ = {0, 10, . . . , 100} of the maximal grey value, additive salt-and-pepper noise that varies between 0% and 100% of the number of pixels, or randomly located textured occlusions whose size varies between 0% and 100%
122
M. Carrasco, L. Pizarro, and D. Mery
of the image area [24]. Figures 5 and 6 show examples of the data utilised in testing. The experiment Exp.1 in Fig. 7(a) shows, on the one side, that the MFCC method has a low capability of recognising noisy data when only clean samples have been used for training. On the other side, we observe that PCA-eigenfaces2 deals quite well with all types of noise till 70% of perturbation, and it is specially robust against Gaussian noise. Surprisingly, the experiment Exp.2 in Fig. 7(b) reveals an improvement on the voice recognition when this classifier is constructed considering noisy samples. However, the face recognition is now able to satisfactorily manage up to 30% of perturbations. Notice that when no perturbations at all are considered (ideal case), the performance of the classifiers is around 90%. Fused voice and face recognition. In this section we aim at combining the responses of both the face classifier and the voice classifier using the relation (2). A crucial aspect of this objective is the proper estimation of the weight parameter α. In the experiments of the previous section we varied the noise level over a large range, and the results logically depended on the amount of noise. We would like to use the formula (3) to adjust the computation of the parameter α to the noise level. This assumes that we should have quantitative measurements of the noise level on the voice and face samples, but in a real application the amount of noise is not known in advance. The estimation of these quantities for the different signals used here is out of the scope of this paper. However, we cite several strategies appear in the literature for noise estimation in audio [25,26,27] and image [28,29,30,31,32,33] signals. If we assume that we have reliable estimations of the noise level in voice and face signals, and since the empirical performances of the classifiers are known from the learning stage under different testing scenarios, it is possible to compute the parameter α using the relation (3). For example, considering voice signals with variable white Gaussian noise and face images with salt-and-pepper noise, the figures 8(a) and 8(b) show the estimated α ˆ curves for the experiments Exp.1 and Exp.2 of the figures 7(a) and 7(b), respectively. Evidently, the weight α increases as the noise in the voice signal increases, because voice recognition is more sensitive to noise than face recognition. Again, we measure the performance of the fused recognition under two experimental scenarios: Exp.3 : system built with perturbation-free data and tested then on noisy samples; and Exp.4 : system built with noisy data and tested then on noiseless samples. Figures 9 and 10 show the recognition performance for these two operation settings, respectively. The missing α ˆ curves have been omitted for the sake of space and readability. Notice that the performance of the ideal case now reaches 100%. Similarly, under the same experimental settings, the fused recognition outperforms the voice and face classifiers alone. The performance 2
Although PCA may require a precise localisation of the head, the set of faces used in the experiments were not perfectly aligned, as shown in Fig. 6. However, satisfactory results are still achievable.
1
1
0.9
0.9
0.8
0.8 classification percentage
classification percentage
Bimodal Biometric Person Identification System
0.7 0.6 0.5 0.4 0.3 FR with occlusion FR with salt−and−pepper FR with Gaussian noise VR with Gaussian noise
0.2 0.1 0
0
20
40 60 perturbation percentage
0.7 0.6 0.5 0.4 0.3 FR with occlusion FR with salt−and−pepper FR with Gaussian noise VR with Gaussian noise
0.2 0.1 80
123
0
100
0
20
(a)
40 60 perturbation percentage
80
100
(b)
1
1
0.9
0.9
0.8
0.8
0.7
0.7 estimated α ˆ
estimated α ˆ
Fig. 7. Independent performance of voice recognition (VR) and face recognition (FR) systems. (a) Exp.1 : Recognition systems built with perturbation-free data and tested on samples with variable noise. (b) Exp.2 : Recognition systems built with variable noisy data and tested on noiseless samples.
0.6 0.5 0.4 0.3
Face with 0% of salt−and−pepper Face with 20% of salt−and−pepper Face with 40% of salt−and−pepper Face with 60% of salt−and−pepper Face with 80% of salt−and−pepper
0.2 0.1 0
0
20 40 60 80 voice perturbation percentage
(a)
100
0.6 0.5 0.4 0.3
Face with 0% of salt−and−pepper Face with 20% of salt−and−pepper Face with 40% of salt−and−pepper Face with 60% of salt−and−pepper Face with 80% of salt−and−pepper
0.2 0.1 0
0
20 40 60 80 voice perturbation percentage
100
(b)
Fig. 8. Estimated α ˆ curves when voice signals with variable white Gaussian noise and face images with salt-and-pepper noise are considered in (a) Exp.1, and (b) Exp.2
stability of the experiment Exp.3 is in accordance with the much larger influence of the face classifier as outlined the Fig. 7(a). Although such an influence is not so large in Fig. 7(b), the experiment Exp.4 also enjoys certain stability. With respect to the robustness of the arbitrarily chosen feature extraction tools, it was shown that occlusions cause greater impact than Gaussian or saltand-pepper noise on the eigenfaces analysis, and the analysis of the voice signals via MFCC is much more sensitive to white noise. However, even when the face and voice classifiers might reach a low performance independently, it is possible
M. Carrasco, L. Pizarro, and D. Mery
1
1
0.9
0.9
0.8
0.8
0.8
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face Gaussian noise Fusion with 20% of face Gaussian noise Fusion with 40% of face Gaussian noise Fusion with 60% of face Gaussian noise Fusion with 80% of face Gaussian noise
0.2 0.1 0
0
20 40 60 80 voice perturbation percentage
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face salt−and−pepper Fusion with 20% of face salt−and−pepper Fusion with 40% of face salt−and−pepper Fusion with 60% of face salt−and−pepper Fusion with 80% of face salt−and−pepper
0.2 0.1 0
100
0
(a)
20 40 60 80 voice perturbation percentage
classification percentage
1 0.9
classification percentage
classification percentage
124
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face occlusion Fusion with 20% of face occlusion Fusion with 40% of face occlusion Fusion with 60% of face occlusion Fusion with 80% of face occlusion
0.2 0.1 0
100
0
(b)
20 40 60 80 voice perturbation percentage
100
(c)
1
1
0.9
0.9
0.8
0.8
0.8
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face Gaussian noise Fusion with 20% of face Gaussian noise Fusion with 40% of face Gaussian noise Fusion with 60% of face Gaussian noise Fusion with 80% of face Gaussian noise
0.2 0.1 0
0
20 40 60 80 voice perturbation percentage
(a)
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face salt−and−pepper Fusion with 20% of face salt−and−pepper Fusion with 40% of face salt−and−pepper Fusion with 60% of face salt−and−pepper Fusion with 80% of face salt−and−pepper
0.2 0.1 100
0
0
20 40 60 80 voice perturbation percentage
(b)
100
classification percentage
1 0.9
classification percentage
classification percentage
Fig. 9. Exp.3 : Performance of a bimodal person identification system by fusing voice and face classifiers. The system is built with perturbation-free data and tested then on noisy samples. Voice signals with white Gaussian noise and image faces with (a) Gaussian noise, (b) salt-and-pepper noise, and (c) textured occlusions, are considered.
0.7 0.6 0.5 0.4 0.3
Fusion with 0% of face occlusion Fusion with 20% of face occlusion Fusion with 40% of face occlusion Fusion with 60% of face occlusion Fusion with 80% of face occlusion
0.2 0.1 0
0
20 40 60 80 voice perturbation percentage
100
(c)
Fig. 10. Exp.4 : Performance of a bimodal person identification system by fusing voice and face classifiers. The system is built with noisy data and tested then on noiseless samples. Voice signals with white Gaussian noise and image faces with (a) Gaussian noise, (b) salt-and-pepper noise, and (c) textured occlusions, are considered.
to obtain a much better recognition system when the responses of both classifiers are fused in a probabilistic manner. Similarly, by improving the performance of the independent classifiers the overall performance increases too. It has been shown that, depending on the learning and operation conditions of the identification system, it might be worthwhile to consider not only ideal noiseless samples when building the classifiers, but also inherent or potential sources of noise, which may improve the whole identification process. For a particular application, the impact of every source of noise in the learning step as well as in the operation step should be evaluated before the identification system is set up. In the light of that study, the decision of building the system under noise samples or not should be taken.
Bimodal Biometric Person Identification System
5
125
Conclusions and Future Work
This work presents a biometric person identification system based on fusing two common biometric traits: face and voice. The fusion is carried out by a simple probabilistic scheme that combines the independent responses from both face and voice classifiers. The performance of the recognition system is assessed under different types of perturbations: Gaussian noise, salt-and-pepper noise and textured occlusions. These perturbations might affect the samples used to build the classifiers, and/or the test samples the system must identify. It is shown that the proposed probabilistic fusion framework provides a viable identification system under different contamination conditions, even when the independent classifiers have low single performance. We present a simple formula to automatically determine the weight parameter that combines the independent classifiers’ responses. This formula considers the empirical evidence derived from the learning and testing stages, and it depends in general on the noise level. As future work, we will investigate more robust feature extraction tools that provide better results under this probabilistic scheme. We also seek for alternative ways to estimate the weight parameter. Acknowledgments. This work was partially funded by CONICYT project ACT-32 and partially supported by a grant from the School of Engineering at Pontificia Universidad Cat´ olica de Chile. The authors would like to thank the G’97-USACH Group for their voices utilized in this research.
References 1. Prabhakar, S., Pankati, S., Jain, A.K.: Biometric recognition: Security and privacy concerns. IEEE Security and Privacy 01(2), 33–42 (2003) 2. Jain, A.K.: Biometric recognition: How do i know who you are? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 19–26. Springer, Heidelberg (2005) 3. Ross, A., Jain, A.: Multimodal biometrics: An overview. In: Proc. 12th European Signal Processing Conference, EUSIPCO 2004, Vienna, Austria, pp. 1221–1224 (September 2005) 4. Ross, A., Jain, A.K.: Information fusion in biometrics. Pattern Recognition Letters 24(13), 2115–2125 (2003) 5. Brunelli, R., Falavigna, D.: Person identification using multiple cues. IEEE Trans Pattern Anal Mach Intell 17(10), 955–966 (1995) 6. Big¨ un, E., Big¨ un, J., Duc, B., Fischer, S.: Expert conciliation for multi modal person authentication systems by bayesian statistics. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 291–300. Springer, Heidelberg (1997) 7. Snelick, R., Uludag, U., Mink, A., Indovina, M., Jain, A.: Large-scale evaluation of multimodal biometric authentication using state-of-the-art systems. IEEE Trans Pattern Anal Mach Intell 27(3), 450–455 (2005) 8. Jain, A.K., Ross, A.: Multibiometric systems. 47(1), 34–40 (2004) 9. Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digit Signal Process 14(5), 449–480 (2004)
126
M. Carrasco, L. Pizarro, and D. Mery
10. Yang, M.-H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Trans Pattern Anal Mach Intell 24(1), 34–58 (2002) 11. Lu, X.: Image analysis for face recognition: A brief survey. Personal Notes (May 2003) 12. Ruiz-del-Solar, J., Navarrete, P.: Eigenspace-based face recognition: a comparative study of different approaches. IEEE Trans Syst Man Cybern C Appl Rev 35(3), 315–325 (2005) 13. Guerfi, S., Gambotto, J.P., Lelandais, S.: Implementation of the watershed method in the hsi color space for the face extraction. In: IEEE Conference on Advanced Video and Signal Based Surveillance, 2005. AVSS 2005, pp. 282–286. IEEE Computer Society Press, Los Alamitos (2005) 14. Lu, X., Jain, A.: Deformation analysis for 3d face matching. In: Proc. Seventh IEEE Workshops on Application of Computer Vision, WACV/MOTION 2005, pp. 99–104. IEEE Computer Society Press, Los Alamitos (2005) 15. Doddington, G.R.: Speaker recognition identifying people by their voices. Proc. IEEE 73(11), 1651–1664 (1985) 16. Furui, S.: Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process 29(2), 254–272 (1981) 17. Murty, K.S.R., Yegnanarayana, B.: Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Process Lett 13(1), 52–55 (2006) 18. Picone, J.W.: Signal modeling techniques in speech recognition. Proc. IEEE 81(9), 1215–1247 (1993) 19. Kirby, M., Sirovich, L.: Application of the karhunen-loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell 12(1), 103–108 (1990) 20. Wei, H., Cheong-Fat, C., Chiu-Sing, C., Kong-Pang, P.: An efficient mfcc extraction method in speech recognition. In: Proc. 2006 IEEE International Symposium on Circuits and Systems, ISCAS 2006, may 2006, pp. 145–148. IEEE Computer Society Press, Los Alamitos (2006) 21. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans Comm 28(1), 84–95 (1980) 22. Kinnunen, I., K¨ arkk¨ ainen, T.: Class-discriminative weighted distortion measure for vq-based speaker identification. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 681–688. Springer, Heidelberg (2002) 23. Samaria, F., Harter, A.: Parameterisation of a stochastic model for human face identification. In: Proc. 2nd IEEE Workshop on Applications of Computer Vision, pp. 138–142. IEEE Computer Society Press, Los Alamitos (1994) 24. Dana, K.J., Van-Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reflectance and texture of real world surfaces. ACM Transactions on Graphics (TOG) 18(1), 1–34 (1999) 25. Yamauchi, J., Shimamura, T.: Noise estimation using high frequency regions for speech enhancement in low snr environments. In: Proc. of the 2002 IEEE Workshop on Speech Coding, pp. 59–61. IEEE Computer Society Press, Los Alamitos (2002) 26. Reju, V.G., Tong, Y.C.: A computationally efficient noise estimation algorithm for speech enhancement. In: Proc. of the 2004 IEEE Asia-Pacific Conference on Circuits and Systems, vol. 1, pp. 193–196. IEEE Computer Society Press, Los Alamitos (2004) 27. Wu, G.D.: A novel background noise estimation in adverse environments. In: Proc. of the 2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 2, pp. 1843–1847. IEEE Computer Society Press, Los Alamitos (2005)
Bimodal Biometric Person Identification System
127
28. Starck, J.L., Murtagh, F.: Automatic noise estimation from the multiresolution support. Publications of the Astronomical Society of the Pacific 110(744), 193–199 (1998) 29. Salmeri, M., Mencattini, A., Ricci, E., Salsano, A.: Noise estimation in digital images using fuzzy processing. In: Proc. of the 2001 International Conference on Image Processing, vol. 1, pp. 517–520 (2001) 30. Shin, D.H., Park, R.H., Yang, S., Jung, J.H.: Block-based noise estimation using adaptive gaussian filtering. IEEE Transactions on Consumer Electronics 51(1), 218–226 (2005) 31. Liu, C., Freeman, W.T., Szeliski, R., Kang, S.B.: Noise estimation from a single image. In: Proc. of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 901–908. IEEE Computer Society Press, Los Alamitos (2006) 32. Grammalidis, N., Strintzis, M.: Disparity and occlusion estimation in multiocular systems and theircoding for the communication of multiview image sequences. IEEE Transactions on Circuits and Systems for Video Technology 8(3), 328–344 (1998) 33. Ince, S., Konrad, J.: Geometry-based estimation of occlusions from video frame pairs. In: Proc. of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 933–936. IEEE Computer Society Press, Los Alamitos (2005)
A 3D Object Retrieval Method Using Segment Thickness Histograms and the Connection of Segments Yingliang Lu1 , Kunihiko Kaneko, and Akifumi Makinouchi2 1
Graduate School of Information Science and Electrical Engineering, Kyushu University, 744 Motooka Nishi-ku Fukuoka, Japan
[email protected] http://www.db.is.kyushu-u.ac.jp/ 2 Department of Information and Network Engineering, Kurume Institute of Technology, 2228-66, Kamitsu-machi, Kurume, Fukuoka-Ken, Japan
Abstract. We introduce a novel 3D object retrieval method that is based on not only the topological information but also the partial geometry feature of 3D object. Conventional approaches for 3D object similarity search depend only on global geometry features or topological features. We use the thickness distribution along the segment of curveskeleton as the partial geometry feature of 3D object and define the connection of the segments of curve-skeleton as the topological feature of 3D object. In order to retrieve 3D objects, we match 3D objects by their thickness distributions along segment on the curve-skeletons. Furthermore, we use the connection information of segments to improve the accuracy of partial similarity retrieval. The experimental evaluation shows that our approach yields meaningful results in the articulated object database. Keywords: 3D object retrieval, content-based similarity search 3D object partial matching.
1
Introduction
Since 3D models are increasingly created and designed using computer graphics, computer vision, CAD medical imaging, and a variety of other applications, a large number of 3D models are being shared and offered on the Web. Large databases of 3D objects, such as the Princeton Shape Benchmark Database [1], the 3D Cafe repository [2], and Aim@Shape network [3], are now publicly available. These datasets are made up of contributions from the CAD community, computer graphic artists, and the scientific visualization community. The problem of searching for a specific shape in a large database of 3D objects is an important area of research. Text descriptors associated with 3D shapes can be used to drive the search process [4], as is the case for 2D images [5]. However, text descriptions may not be available, and furthermore may not apply for part-matching or similarity-based matching. Several content-based 3D shape retrieval algorithms have been proposed [6] [7] [8]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 128–140, 2007. c Springer-Verlag Berlin Heidelberg 2007
A 3D Object Retrieval Method
129
For the purpose of content-based 3D shape retrieval, various features of 3D shapes have been proposed [6] [7] [8] [9]. However, these features are global features. In addition, it is difficult to effectively implement these features on relational databases because they include topologic information. The shock graph comparison based retrieval method described in a previous paper [10] is based only on the topologic information of the shape. However, those methods are only based on the topological information of 3D shape. A geometrical, partial similarity based and efficient method is needed to retrieve 3D shapes from a 3D shape database. In this paper, we propose a novel method to retrieve shapes based on their partial similarity. The proposed method is based on geometrical information and topological connection information of parts of 3D objects rather than on topological information alone. We compare the similarity of two shapes by the connection information and thickness information of their curve-skeleton segments. In order to retrieve similar shapes, when a key shape was inputted, firstly, we thin it to a curve-skeleton and add the thickness of the partial shape on the correlative curve-skeleton voxel. Secondly, in our implement, we define a key segment that the volume size of the key segment is largest in the set of key object segments. We find and retrieve the most similar segments that belong to different objects by the Segment Thickness Histogram (STH) of the key segment. Thirdly, we find similar segments that connect to the retrieved segment and similar to the segments connecting the key segment. Fourthly, we continue do the third step until there are not any segments that are still not used as a retrieval key in the segment set of key object. If most of correspondence segments of two 3D objects are similar, we think that the two 3D objects are partial similarity. Finally, we retrieve the 3D objects that have most similarity segments. Our proposed method has a few important properties. First, our method is invariant to changes in the orientation (translation, rotation and reflection) and scale of 3D objects. For instance, given a human object, we expect to retrieve other human objects, whether they bend, fold their legs, point forward or other poses, as illustrated in Table 6 and Table 7. Second, our 3D shape retrieval method is partial similarity retrieval. For instance, given a part of an animal object, we expect to retrieve the whole animal body. Third, an efficient index can be implemented in spatial database using the connection information of the segments of curve-skeleton. The remainder of the paper is organized as follows. In section 2 we shortly review already existing 3D shape retrieval algorithm. Section 3 provides an overview of the Curve-Skeleton Thickness Histogram (CSTH). In Section 4, we describe the novel partial similarity retrieval method. A discussion of an empirical study and the results thereof are presented in Section 5. Finally, in Section 6, we conclude the paper and present ideas for future study.
2
Related Works
A number of different approaches have been proposed for the similarity searching problem. Using a simplified description of a 3D object, usually in lower
130
Y. Lu, K. Kaneko, and A. Makinouchi
dimensions (also known as a shape signature), reduces the 3D shape similarity searching problem to comparing these different signatures. The dimensional reduction and the simple nature of these shape descriptors make them ideal for applications involving searching in large databases of 3D objects. Osada et al. in [8] propose the use of a shape distribution, sampled from one of many shape functions, as the shape signature. Among the shape functions, the distance between two random points on the surface proved to be the most effective at retrieving similar shapes. In [11], a shape descriptor based on 2D views (images rendered from uniformly sampled positions on the viewing sphere), called the Light Field Descriptor, performed better than descriptors that use the 3D properties of the object. In [12], Kazhdan et al. propose a shape description based on a spherical harmonic representation. Unfortunately, these previous methods cannot be implemented on partial matching, because the descriptions of these methods are global feature based. Another popular approach to shape analysis and matching is based on comparing graph representations of shape. Nicu et al [9] develop many-to-many matching algorithm to compute shape similarity on curve-skeleton’s topologic information. Sundar et al [6] develop a shape retrieval system based on the shape’s skeleton graph. These previous methods only focus on the shape’s topologic information. Unfortunately, the most important information of shape that shape’s geometric information is neglected. Moreover, it is highly cost that use graph to match shapes. We proposed a novel shape feature of a 3D object. That feature is named as CSTH (mentioned in section 1) [13]. It is based on shape’s geometric information. In this paper we add a topological connection comparison process in our 3D shape retrieval process.
3
Curve-Skeleton Thickness Histogram
In this section, we briefly describe the methods used to build the thickness of a curve-skeleton from 3D polygonal models. We also introduce a novel method by which to break a curve-skeleton into independence parts called segments by its topology. In addition, we describe in detail the normalization that normalizes the thickness histogram of a single segment of curve-skeleton. 3.1
Skeleton Extraction
A number of methods of skeleton extraction have been reported [14] [15] [16] [17]. The electrostatic field function [14] can extract well-behaved curves on medial sheets. Their algorithm is based upon computing a repulsive force field over a 3D voxel object. They compute divergence of the vector-field at each voxel. Then, the topological characteristics of the resulting vector field, such as critical points are finded. Finally, they connect the critical points along with the direction of the vector of voxels. Even though the result is connected, extracted curves are divided into a number of segments based on electrostatic concentration. However, we need to split the skeleton into parts based on topology rather than on electrostatic concentration. In our implementation, the initial curve-skeleton based on
A 3D Object Retrieval Method
131
the method described in a previous study [14] is first extracted. We introduced a similarity computation method of 3D shape models based on the curve-skeletons thickness distribution of the entire shape model when all of the curve-skeletons of the shape were connected and have no branches in Reference [13]. However, there must be several branches on the curve-skeleton of a complex shape (Figure 1). We must first merge all of the parts that are separated from the curve-skeleton by the electrostatic concentration into a connected curve. Then, we must break the connected curve into parts according to its topology (Figure 3). Of course, there are a few algorithms to dissever a shape into parts. Different to those algorithms, in this paper we break the curve-skeleton into segments on the connection points of its branches. To compute the thickness of segments on a curve-skeleton, we use the algorithm proposed in [15]. Because the distance transform (DT) computation algorithm proposed in [15] has good performance on computing the proximate least Euclidean distance from a voxel to the boundary surface, we used this algorithm to compute the DT value of all voxels on the extracted curve-skeleton by step one (Figure 2). We define the DT value of a voxel of the extracted curve-skeleton as the thickness of curve-skeleton on this voxel in this paper. The Segment Thickness Histogram (STH) is compsed of the DT values of all voxels on a segment. The STH is proposed as a geometrical feature of a 3D shape on our 3D shape similarity retrieval system.
Fig. 1. A 3D shape model used to extract the skeleton
3.2
Normalize the Segment Thickness
In order to obtain a Segment Thickness Histogram (STH) representation that is invariant with the scale of a 3D object for similarity measuring, a normalization step is needed. The horizontal axis of the distribution should be normalized with a fixed value. Moreover, the vertical axis should be zoomed by a ratio that is equal to the zoom ratio of horizontal normalization. Using the normalization strategy, we use the variation of each Segment Thickness Histogram (STH) of the shape as a feature of the shape. Furthermore, we treat the proportion of the length of a segment and the thickness distribution along with the segment as a component of the feature by this method.
132
Y. Lu, K. Kaneko, and A. Makinouchi
Fig. 2. The curve-skeleton with thickness of the 3D model in Figure 1
Fig. 3. The segments of curve-skeleton after splitting the curve-skeleton in Figure 2
4
Retrieval Algorithm
In this section, we describe our retrieval algorithm used to retrieve the partial similarity shapes by their STHs and their segment connections. Using this algorithm, we can retrieve shapes that are partial similarity, only using some parts but not all parts of the key shape. 4.1
The Similarity of Two Different Segments
Having constructed the Segment Thickness Histograms (STH) of parts of two 3D objects, we are left with the task of comparing them in order to produce a dissimilarity measure. In our implementation, we have experimented with a simple dissimilarity measure based on the LN norms function with n = 2. We use the formula shown as Formula 1. Dissimilarity = (Xi − Yi )2 (1) i
where X and Y represent two STHs, X i represent the thickness of the i-th voxel on the X STHs. In addition, since there are two different align ways to compare two STHs, the different alignments will produce different dissimilarity results. For convenience, we use the minimum dissimilarity value in our experiments.
A 3D Object Retrieval Method
133
Table 1. Two 3D example objects to be matched
(a)
(b)
Table 2. The curve-skeleton with thickness of the two 3D example objects shown in Table 1
(a) 4.2
(b)
Topological Connection Based Retrieval Algorithm
In general, similarity of objects can be evaluated only by the distances of STH. We can think that the key object is similar to the candidate object or similar to a part of the candidate object when each STH of a key object has a similar STH on the candidate object. Otherwise, if the each STH of a candidate has a similar STH on the key object, we can say that the candidate object is similar to a part of the key object. Obviously, this method is invariant to changes in the orientation (translation, rotation and reflection) and scale of 3D objects. However, the major drawback of this evaluation depended only on STH similarity is that the topological connection information of segments is neglected.
134
Y. Lu, K. Kaneko, and A. Makinouchi Table 3. The STHs of the two 3D example objects (cf. Table 1)
(a)
(b)
To overcome the problem, we define two 3D objects are partially similar if a selected segments is similarity based on the distance of the STHs, furthermore, most of the segment connecting to the selected segment are similar to the segments connecting the segments that belong to another object. In our implementation, when we will match the two 3D objects shown in Table 1, first, we generate curve-skeleton with thickness of each of them (Table 2). Second, we dissever the curve-skeleton with thickness into segment with thickness on the branch points of curve-skeleton. Third, we normalize the segment with thickness and generate the STHs of 3D shape like Table 3. The Table 3(a) is the set of STHs generated from the object A (Table 1(a)), and The Table 3(b) is the set of STHs generated from the object B (Table 1(b)). Finally, we select a STH that has largest original volume size and use it as the key STH to retrieval similar STHs from spatial database. In this example, we select the STH of the trunk of the Object A as a key STH of the object A. In our experiment, we find that the STH of the trunk of the Object B is the most similarity STH in the six STHs of the object B. Then, we select the STHs that connect to the key STH as new key STHs to retrieval similar STHs that are connect to the retrieved STHs on topology other than the retrieved STHs. For instance, the STHs of head and the four limbs the object A are defined as new key STHs to retrieve similar STHs from spatial database. As illustrated in Table 4, the STH of the head of object A is similar the STH of the head of object B, furthermore, the head of object A is connect to the trunk of object A and the head of object B is connect to the trunk of object B. the four limbs of each objects are similar in Table 2. Then, we think that the 2 3D objects are similar.
5
Experiment and Discussion
In order to test the proposed feasibility of the similar object retrieval strategy, we implement the present algorithms on a Linux system by C++ and PostgreSQL.
A 3D Object Retrieval Method
135
Table 4. The most similar objects to the key chess retrieved by ascending order of the similarity
Key
1
2
3
4
5
6
7
8
9
Table 5. The most similar objects to the key chess retrieved by ascending order of the similarity
Key
1
2
3
4
5
6
7
8
9
136
Y. Lu, K. Kaneko, and A. Makinouchi
Table 6. The mostly partially similar objects (the number of similarity STHs = the number of STHs of the query object)
Key
1
2
3
4
5
6
7
8
9
10
11
12
13
14
We set the resolution of the volume data as in the volume voxelization procedure. We used the Princeton shape database [1] as the test data in the present study. We found that the proposed method works well for similar object retrieval. In order to test the feasibility of the similar object retrieval strategy proposed herein, we implement the proposed algorithms in two ways. First, we test the similar object retrieval strategy only by STHs but not using the segment connection information. There is only a STH on the curveskeleton of key object in Table 4 and Table 5. The STH of each result object shown in Table 4 match the STH of key object. In addition, in order to improve accuracy, we only retrieve the objects that the count of their STHs is same as the key object. Another test result by search from one STH is shown in Table 5. Therefore, we retrieve 3D objects that each of them has only one segment on their curve-skeleton and their STHs are similar to the STH of the key object. It turned out that the STH similarity algorithm yields meaning results when the key object has only one segment on its curve-skeleton.
A 3D Object Retrieval Method
137
Table 7. The mostly partially similar objects (the number of similarity STHs < the number of STHs of the query object)
Key
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
The query object of experiments (Table 6 - 8) has six segments of its curveskeleton (shown in the Table 2(a)). These segments belong to a head (number of segments: 4), a trunk of a body (number of segments: 5), and four limbs (numbers of segments: 0, 1, 2, and 3). Since each segment has its own thickness histogram, the key object has six independent thickness histograms (Table 3). In order to find the objects of which the STHs match the key object for the head, the trunk of the body, and the four limbs, we need to find the best objects from each result set of the six parts. We retrieve 3D objects that each of them has
138
Y. Lu, K. Kaneko, and A. Makinouchi
Table 8. The mostly partially similar objects (the connection information between segments is used)
Key
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
same number of segments as the number of segments of key object and each of their STHs is similar to the correspondence STHs of the key object in Table 6. To analyze the performance of the partial similarity retrieval, we retrieved the similarity of 3D objects that each of STH of key object has a similar STH on their curve-skeleton. Therefore, the number of segments on the key object is equal with or less than the number of segments on each of the retrieved objects. For instance, in Table 7, the result objects such as 12, 15, 19, are animal objects with seven segments. Obviously, the segment of tail on the curve-skeleton of them is not existed on the curve-skeleton of key object.
A 3D Object Retrieval Method
139
Second, we test the partial similarity retrieval using the similarity of two correspondence STHs and the connection of segments. The results shown in Table 8, show that our new approach yields more meaningful results with topological connection of parts. In addition, the partial similarity retrieval is shown to be efficient in our method, for instance, the result 5, 7 are partial similarity with the key object in Table 8. It also turned out that our method is invariant to changes in the orientation, translation, rotation, pose and scale of 3D articulated objects.
6
Conclusions and Future Studies
The 3D object retrieval method proposed in this paper is based on partial geometry similarity between 3D objects. First, the proposed method extracts a curve-skeleton with thickness. Second, we compute the dissimilarity of the STH (mentioned in Section 1) of each part with respect to the objects. Third, we propose a novel 3D object partial similarity retrieval strategy using the computed dissimilarity and the topological connection information of parts of 3D object. Finally, implement our method on 3D shape database. It is possible to effectively retrieve 3D objects by partial similarity in the present experiments. Since the STH is extracted from 3D objects using the geometrical information of a part of 3D object, the 3D objects can be compared based on geometrical information rather than on topologic information alone. Since each of the STH is a partial feature of a 3D object, the STH can compare two 3D objects based on their partial features, rather than on their global features alone. Furthermore, since the topological connection information of STHs is a topological and simple feature of a 3D object, an efficient index can be implemented in spatial database using the connection. The index can improve the efficient of partial similarity retrieval on the STH feature. Good efficiency and good results were obtained in the present experiments using the proposed method. In the future, we intend to develop an algorithm that can generate a curveskeleton with thickness from a 2D shape of sketch, and then develop an efficient algorithm that can search 3D objects from 2D sketch.
Acknowledgment Special thanks to Dr. Nicu D.Cornea for the voxelization code. This research is partially supported by the Special Coordination Fund for Promoting Science and Technology, and Grant-in-Aid for Fundamental Scientific Research 16200005, 17650031 and 17700117 from Ministry of Education, Culture, Sports, Science and Technology Japan, and by 21st century COE project of Japan Society for the Promotion of Science.
References 1. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton Shape Benchmark. Shape Modeling Applications, 2004. Proceedings, 167–178 (2004) 2. 3D Cafe, http://www.3dcafe.com/asp/freestuff.asp
140
Y. Lu, K. Kaneko, and A. Makinouchi
3. AIM@SHAPE, Network of Excellence, http://www.aimatshape.net/ 4. Princeton Shape Retrieval and Analysis Group, 3D Model Search Engine, http:// shape.cs.princeton.edu/search.html 5. Google Image Search, http://www.google.com/ 6. Sundar, H., Silver, D., Gagvani, N., Dickinson, S.: Skeleton based shape matching and retrieval. Shape Modeling International, 2003, 130–139 (2003) 7. Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology matching for fully automatic similarity estimation of 3D shapes. In: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 203–212 (2001) 8. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM Transactions on Graphics (TOG) 21(4), 807–832 (2002) 9. Cornea, N., Demirci, M., Silver, D., Shokoufandeh, A., Dickinson, S.J., Kantor, P.B.: 3D Object Retrieval using Many-to-many Matching of Curve Skeletons. Shape Modeling and Applications (2005) 10. Siddiqi, K., Shokoufandeh, A., Dickinson, S.J., Zucker, S.W.: Shock Graphs and Shape Matching. International Journal of Computer Vision 35(1), 13–32 (1999) 11. Chen, D., Tian, X., Shen, Y., Ouhyoung, M.: On Visual Similarity Based 3 D Model Retrieval. Computer Graphics Forum 22(3), 223–232 (2003) 12. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3D shape descriptors. In: Proceedings of the Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 156–164. ACM Press, New York (2003) 13. Lu, Y., Kaneko, K., Makinouchi, A.: 3D Shape Matching Using Curve-Skeletons with Thickness. In: 1st Int. Workshop on Shapes and Semantics (June 2006) 14. Cornea, N.D., Silver, D., Yuan, X., Balasubramanian, R.: Computing hierarchical curve-skeletons of 3D objects. The Visual Computer 21(11), 945–955 (2005) 15. Gagvani, N., Silver, D.: Parameter-controlled volume thinning. Graphical Models and Image Processing 61(3), 149–164 (1999) 16. Wu, F., Ma, W., Liou, P., Liang, R., Ouhyoung, M.: Skeleton Extraction of 3D Objects with Visible Repulsive Force. Computer Graphics Workshop (2003) 17. Sharf, A., Lewiner, T., Thomas Lewiner, A.S., Kobbelt, L.: On-the-fly curveskeleton computation for 3d shapes. Eurographics (2007)
Facial Occlusion Reconstruction: Recovering Both the Global Structure and the Local Detailed Texture Components Ching-Ting Tu and Jenn-Jier James Lien Robotics Laboratory, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. {vida, jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. An automatic facial occlusion reconstruction system based upon a novel learning algorithm called the direct combined model (DCM) approach is presented. The system comprises two basic DCM modules, namely a shape reconstruction module and a texture reconstruction module. Each module models the occluded and non-occluded regions of the facial image in a single, combined eigenspace, thus preserving the correlations between the geometry of the facial features and the pixel grayvalues, respectively, in the two regions. As a result, when shape or texture information is available only for the nonoccluded region of the facial image, the optimal shape and texture of the occluded region can be reconstructed via a process of Bayesian inference within the respective eigenspaces. To enhance the quality of the reconstructed results, the shape reconstruction module is rendered robust to facial feature point labeling errors by suppressing the effects of biased noises. Furthermore, the texture reconstruction module recovers the texture of the occluded facial image by synthesizing the global texture image and the local detailed texture image. The experimental results demonstrate that compared to existing facial reconstruction systems, the reconstruction results obtained using the proposed DCM-based scheme are quantitatively closer to the ground truth. Keywords: Facial reconstruction, facial synthesis, eigenspace, facial occlusion.
1 Introduction The performance of automatic face recognition, facial expression analysis and facial pose estimation schemes is largely dependent upon the amount of information available in the input facial images. However, in real life, facial images are invariably occluded to a greater or lesser extent, and hence the performance of such schemes is inevitably degraded. It is necessary to develop the means to recover the occluded region(s) of the facial image such that the performance of these applications can be improved. Saito et al. [14] proposed a method for removing eyeglasses and reconstructing the facial image by applying principal component analysis (PCA) to eigenspaces having D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 141 – 151, 2007. © Springer-Verlag Berlin Heidelberg 2007
142
C.-T. Tu and J.-J.J. Lien
no eyeglass information. Similarly, Park et al. [13] removed eyeglasses from facial images by repainting the pixels in the occluded region of the image with the grayvalues of the corresponding region of the mean facial image prior to the PCA reconstruction process. However, in both studies, the reconstruction process was performed based upon eigenspaces derived from the entire facial image rather than from the occluded and non-occluded regions, respectively. As a result, the two schemes are capable only of reconstructing facial images with long and thin occluded regions, e.g. occlusion by a pair of eyeglasses. If the major facial features, e.g. the eyes or the nose, are occluded, the schemes yield highly unpredictable and unrealistic reconstruction results. Furthermore, the reconstructed images tend to be notably blurred since both schemes use the Gaussian-distributed PCA process to model the facial images, whereas such images typically have a non-Gaussian distribution. To resolve this problem, the facial reconstruction systems presented in [7], [8] and [9] separated each facial image into its facial shape and facial texture, respectively, utilizing the face models introduced in [1], [3] and [15]. In contrast to the iterative facial reconstruction process presented in [9], Hwang et al. [7], [8] proposed a noniterative process for reconstructing the occluded region of an input face using facial shape and facial texture models. Each model consisted of one eigenspace and one sub-eigenspace, with the former containing the whole facial shape or texture information and the latter containing only the shape or texture information of the nonoccluded region. In the proposed approach, the shape or texture information of the non-occluded region was reconstructed via a linear combination of the sub-eigenspace and the corresponding weight vector. The whole facial image was then reconstructed by applying the same weight vector to the whole-face eigenspace. However, the significant characters of the two eigenspaces are different, and thus inherent variances between two different subjects may be suppressed if the same weight vectors are applied to both. In contrast to the methods described above, apply a Gaussian distributed PCA process, the patch-based non-parametric sampling methods presented in [2], [5] and [11] synthesize facial images based upon local detailed features. In the psychological evaluations performed in [12], it was shown that facial features are correlated rather than independent. However, the localized characteristic of patch-based approaches results in a loss of information describing the overall geometric relationships between the individual facial features. This paper proposes a learning-based facial occlusion reconstruction system comprising two DCM modules, namely a shape reconstruction module and a texture reconstruction module. Adopting a similar approach to that used in [3], the proposed system normalizes the texture image by warping the facial image to the mean-shape coordinates. The DCM approach used in the two modules facilitates the direct analysis of the geometric and grayvalue correlations of the occluded and nonoccluded regions of the face by coupling the shape and texture of the two regions within single shape and texture eigenspaces, respectively. Given the shape or texture of the non-occluded region of the face, the DCM modules enable the optimal shape or texture of the occluded region to be reconstructed even though the two regions of the
Facial Occlusion Reconstruction
143
face are modeled within a single eigenspace. In practice, the quality of the reconstructed facial shape is adversely effected by errors in the facial feature positions when labeling the features in the non-occluded region of the face. However, the shape reconstruction module developed in this study is specifically designed to tolerate such misalignments by taking account of these noise sources. Furthermore, the quality of the texture reconstruction results is enhanced by synthesizing the global texture image, i.e. a smooth texture image containing the global geometric facial structure, and a local detailed texture image, i.e. a difference image between the texture image and the corresponding global texture image.
2 Direct Combined Model Algorithm The DCM algorithm assumes the existence of two related classes, i.e. X ∈ Rm and Y ∈ Rn. Given an observable (or known) vector x ∈ X, such as the shape or pixel grayvalues of the non-occluded facial region, the objective of the DCM modules developed in this study is to estimate (i.e. recover) the corresponding unobservable (or unknown) vector y ∈ Y, i.e. the shape or pixel grayvalues of the occluded region, based on training datasets X and Y. According to the maximum a posterior (MAP) criterion, the optimal solution of the unknown y can be obtained by maximizing the posterior probabilistic distribution P( x | y, θ ) , i.e.
yˆ = arg max P( y | x,θ ) = arg max P ( y | θ ) P( x | y,θ ) , y
y
(1)
where θ denotes the model parameters, i.e. x , y , CXY (or CYX) and CXX (or CYY) , in which x and y denote the mean vectors of classes X and Y, respectively, CXY (or CYX) is the cross-covariance matrix of X and Y (or Y and X), and CXX (or CYY) is the covariance matrix of X (or Y), respectively. Assuming that P( y | x, θ ) has a Gibbs distribution [4], [11], then P( y | x,θ ) ∝ exp{− EG ( y, x, θ )} ,
(2)
where EG(•) is the Gibbs potential function, which describes the strength of the correlation between x and y based on the information contained within the training dataset and the model parameters θ. Thus, Eq. (1) can be reformulated as an energy minimization problem of the form yˆ = arg min EG ( y, x, θ ) . y
(3)
In the reconstruction system presented in this study, the two training datasets, i.e. X and Y, are modeled by combining them into a single joint Gaussian distribution using the PCA method. As a result, the combined training dataset, comprising p training samples, can be represented as an (m+n)×p matrix, [XT YT]T, in which each column corresponds to an unbiased, concatenated sample vector [( x − x )T ( y − y )]T.
144
C.-T. Tu and J.-J.J. Lien
Applying the singular value decomposition (SVD) process, the covariance matrix of the coupled training dataset can be expressed as 2 C XY ⎤ ⎡U X ⎤ ⎡Σ =⎢ UΔ ⎥ ⎢ K ⎥ CYY ⎦ ⎣ UY ⎦ ⎢⎣ 0
T
⎡C XX ⎡X ⎤ ⎡X ⎤ ⎢ ⎥ ⎢ ⎥ = ⎢C ⎣Y ⎦ ⎣Y ⎦ ⎣ YX
⎡U Σ 2 U T =⎢ X K X ⎢⎣ UY Σ 2K U TX
T 0 ⎤ ⎡U X ⎤ UΔ ⎥ ⎥⎢ Σ 2Δ ⎥⎦ ⎣ U Y ⎦
U X Σ 2K UYT ⎤ T ⎥ + ⎡U Δ Σ 2ΔU Δ ⎦⎤ U Y Σ 2K UYT ⎥⎦ ⎣
’
(4)
where U, Σ and UΔ represent the combined eigenvector matrix, the combined eigenvalue matrix and the m+n-K eigenvector matrix, respectively. According to the general properties of PCA, the linear combination of the first K (K << (m + n)) eigenvectors, [UXT UYT]T, sorted in descending order based on their corresponding eigenvalues, sufficiently represents all of the significant variances within the training dataset, and thus the remaining eigenvectors, UΔ, can be discarded. The resulting Kdimensional combined eigenspace, i.e. the DCM, can be used to reconstruct any new feature pair ( xˆ |w, yˆ |w) via a linear combination of the eigenspace and the corresponding K-dimensional weight vector, w , i.e. T
⎡U X ⎤ ⎛ ⎡ x ⎤ ⎡ x ⎤ ⎞ ⎡ x ⎤ ⎡ xˆ|w ⎤ ⎡U X ⎤ ⎡x ⎤ ⎢ y ⎥ ≈ ⎢ yˆ ⎥ = ⎢ U ⎥ w + ⎢ y ⎥ , where w = ⎢ U ⎥ ⎜⎜ ⎢ y ⎥ − ⎢ y ⎥ ⎟⎟ . ⎣ ⎦ ⎣ |w ⎦ ⎣ Y ⎦ ⎣ ⎦ ⎣ Y ⎦ ⎝⎣ ⎦ ⎣ ⎦⎠
(5)
In the PCA technique, a set of orthogonal eigenvectors is obtained by minimizing the mean-square error (MSE) between the input data and the corresponding reconstruction result. In the current study, the minimum mean-square error (MMSE) is used as the criterion for the energy minimization problem given in Eq. (3). Hence, Eq. (3) can be expressed in terms of the expected posterior distribution P ( y | x, θ ) as follows: 2
yˆ = arg min y
∫
Y
1 ⎡ x ⎤ ⎡ xˆ|w ⎤ y P ( y | x,θ )dY = arg min ⎢ ⎥ − ⎢ ⎥ . 2 ⎣ y ⎦ ⎣ yˆ|w ⎦ y,w
(6)
Substituting the parameters of the combined model, θ, into Eq. (6) and applying the Penrose conditions method [6], Eq. (6) becomes −1 ( x − x ) = y + U U † ( x − x ) , yˆ = CYX C XX Y X †
(7)
where the singular matrix UX is the right inverse matrix of the non-square matrix UX. In contrast to the schemes presented in [7] and [16], in which the SVD algorithm is applied to approximate the inverse of this non-square matrix indirectly, the current study directly uses the following procedure to substitute the matrix inverse, UX†. Since UXUX†=I and the combined eigenspace, [UXT UYT]T, is an orthonormal matrix, then UXTUX=I-UYTUY. As a result, it can be inferred that:
Facial Occlusion Reconstruction
145
U TX I = U TX ⇔ U TX (U X U X† ) = U TX ⇔ (U TX U X )U X† = U TX ⇔ ( I − U YT U Y )U X† = U TX ⇔ ⇔
− U YT U Y
UY ( I )U X† ( I − U Y U YT )U Y U X†
,
(8)
= U Y U TX = U Y U TX
in which the square matrix (I-UYUYT) is invertible since classes X and Y are correlative. Hence, the UYUX† term in Eq. (7) can be replaced by ( I − U Y U YT ) −1U Y U TX , with the result that Eq. (7) becomes yˆ ( x) = y + ( I − UY UYT )−1UY U TX ( x − x ) .
(9)
Here, the inverse of the residual covariance matrix, i.e. (I-UYUYT)-1, is a normalization term which renders the correlation between X and Y insensitive to variances within each class. For example, if X represents the grayvalues of the non-eye regions of the facial image and Y represents the grayvalues of the eye region, the dynamic ranges of X and Y are clearly different. However, the normalization process renders the two ranges approximately equal to one another. In addition, the DCM algorithm assumes that classes X and Y are correlated. If this assumption is not made, i.e. X and Y are considered to be statistically uncorrelated, then UYUXT becomes a zero matrix, i.e. yˆ ( x) = y .
3 Reconstruction System The proposed reconstruction system for partially-occluded facial images is based on a joint Gaussian distribution assumption. However, in practice, the distribution of facial images actually has the form of a complicated manifold in a high-dimensional space, and thus it is inappropriate to model this distribution using a Gaussian distribution model. To resolve this problem, the current system separates the facial shape and texture of each image, rendering both facial properties more suitable for modeling using a Gaussian approach. As shown in Fig. 1, the proposed facial occlusion reconstruction system comprises two separate DCM modules, namely the shape reconstruction module and the texture reconstruction module. In the training process, the facial feature points of each facial image in the training set are manually labeled to generate the corresponding facial shape, S, and the mean facial shape coordinates, S , are then derived. Thereafter, each facial texture image with facial shape coordinates S is warped to the mean facial shape S using a texture-warping transformation function W [3] to generate the corresponding normalized texture image T. The resulting facial shapes {S} and canonical textures {T} of the training images are then used in the shape and texture modules, respectively, as described in the following.
146
C.-T. Tu and J.-J.J. Lien
Training Process
OD &A
{S} W
Reconstruction Process
Shape Reconstruction DCM Module
SR
Sno
W
Sno+So
TR
Input Image {T} Texture Reconstruction DCM Module
-OD&A: Occlusion Detection and ASM Search [10] -W: Texture-warping Transformation -SR: Shape Reconstruction
Final Result
W-1
Tno
Tno+To
-W-1: Inverse texture-warping Transformation -TR: Texture Reconstruction
Fig. 1. Framework of partially-occluded facial image reconstruction system comprising shape reconstruction module and texture reconstruction module
In the shape reconstruction module, the occluded region of the input image and the facial shape of the non-occluded region, i.e. Sno, are detected automatically using the method prescribed in [10]. The facial shape of the occluded region, i.e. So, is then reconstructed by the shape reconstruction DCM algorithm by applying a process of Bayesian inference to the facial shape of the non-occluded region to give the complete facial shape, S. Meanwhile, in the texture reconstruction module, the input texture image of the non-occluded region is warped from its original shape coordinates S to the mean shape coordinates S using the transformation function W to generate the corresponding normalized texture image of the non-occluded region, i.e. Tno. The canonical facial texture of the occluded region, To, is then reconstructed from Tno using the texture reconstruction DCM algorithm. Finally, the complete canonical facial texture T (i.e.Tno+To) is warped from the mean facial shape coordinates S back to the original facial shape coordinates S in order to generate the final reconstruction result. The reconstruction process illustrated in Fig. 1 presents the particular case in which both eyes are occluded. However, due to the combined model approach, the reconstruction system developed in this study is capable of reconstructing frontalview facial images containing occluded regions in other facial features, such as the nose and the mouth, without modeling an additional combined model. 3.1 Robustness of DCM Shape Reconstruction Module to Facial Feature Point Labeling Errors
As shown in Fig.2, the DCM shape reconstruction module comprises a training process and a reconstruction process. In the training process, a K-dimensional shape eigenspace is constructed based upon a total of p manually-labeled facial shapes S. The performance of the facial shape reconstruction module is highly dependent on the
Facial Occlusion Reconstruction
147
Fig. 2. Workflow of DCM shape reconstruction module
accuracy with which the individual facial feature points in the non-occluded region of the face image are identified. To improve the robustness of the shape reconstruction module, each training facial shape S is added by q number of random biased noises to generate a total of q biased facial shapes S’. The biased noise is randomly generated and is bounded by σcn2(∑K2-σcn2I)-1 in accordance with the recommendations of the subspace sensitivity analyses presented in [6], where ∑K is the matrix of the first K eigenvalues and σcn2 is the norm of the covariance matrix of the expected residual vector based on the p training shape vectors S. Note that this residual vector is defined as the distance between the input facial shape and the corresponding reconstructed shape obtained using the K-dimensional shape eigenspace. A new facial shape eigenspace is then constructed based on the total of p×q facial shapes S’. Once the non-occluded and occluded regions of the input image have been detected and separated, the new facial shape eigenspace can be rearranged according to the combined eigenspace formula of the DCM algorithm, which is the non-occluded part, X, should be in the upper rows of the combined model, while the occluded part, Y, should be in the lower rows. Importantly, rearranging the eigenspace has no effect on the reconstruction result since exchanging any two row vectors in the combined eigenspace changes only their relative position in the eigenspace, i.e. the values of their elements are unchanged. Finally, the rearranged combined eigenspace is used to reconstruct the shape of the occluded region So by replacing x in Eq. (9) with Sno. 3.2 Recovery of Global Structure and Local Detailed Texture Components Using DCM Texture Reconstruction Module
As shown in Fig. 3, the texture of an input image is reconstructed by integrating the global texture DCM and the local detailed texture DCM. The global texture image, i.e. Tg, is a smooth texture image containing the global geometric facial structure, while the local detailed texture image, i.e. Tl, represents the difference between Tg and the normalized texture image T, and contains the subtle details of the facial texture. The objective function of the DCM texture reconstruction module can be formulated as Tog = arg max P (Tnog | Tog , θ ) P (Tog , θ ) , (10) l Tol = arg max P (Tno | Tol , θ ) P (Tol , θ )
148
C.-T. Tu and J.-J.J. Lien
Training Process
Reconstruction Process
OD PCA
TRg g
g
Uo Uno Global Texture DCM
{T} GL
Tno
Tnog+Tog TRl
OD PCA
{Tl}
Tno+To
no
Uol Unol Local Detailed Texture DCM
- GL: Generating Local Detailed Texture Image - no: Image Subtraction in Non-occluded Region - TRg: Global Texture Reconstruction
Tnol
Tnol+Tol
- OD: Occlusion Detection : Image Addition - TRl: Local Detailed Texture Reconstruction
-
Fig. 3. Workflow of DCM texture reconstruction module
where Tnog and Tnol are the global and local detailed texture components of Tno, respectively, and Tog and Tol are the global and local detailed texture components of To, respectively. In the training process, the texture image training dataset, {T}, is used to construct a K-dimensional global texture eigenspace. The local detailed texture images of the training dataset {Tl} are then calculated and used to construct the local detailed texture eigenspace. Each local detailed texture image is derived by calculating the difference between its texture image, T, and the corresponding global texture image, Tg. In the reconstruction process, the texture of the occluded region To is inferred via the following procedure: 1.
2. 3.
4. 5.
According to the occluded region and the non-occluded region of the input texture image, the global eigenspace and the local detailed eigenspace are rearranged using the combined eigenspace formula of DCM given in Eq. (4). Replacing x in Eq. (9) with Tno, the global texture of the occluded region, i.e. Tog, is reconstructed using the global texture DCM. The image T’, which contains the texture of the non-occluded region Tno and the current reconstruction result Tog, is projected onto the K-dimensional global texture eigenspace, and the corresponding projection weight is then used to reconstruct T”. The local detailed texture components of the non-occluded region are then extracted by calculating Tnol=T”-T’. Replacing x in Eq. (9) with Tnol, the local detailed texture of the occluded region Tol is reconstructed using the local detailed DCM. The final texture result is obtained by synthesizing Tno with the reconstruction results Tog and To , i.e. T=Tno+Tog+Tol.
4 Experimental Results The performance of the proposed reconstruction system was evaluated by performing a series of experimental trials using training and testing databases comprising 205 and
Facial Occlusion Reconstruction
149
60 facial images, respectively. The images were acquired using a commercial digital camera at different times, in various indoor environments. Eighty-four facial feature points were manually labeled on each training and testing facial image to represent the ground truth of the facial shape. Specific facial feature regions of the testing images were then occluded manually. Figure 4 presents representative examples of the reconstruction results obtained using the proposed method for input images with a variety of occlusion conditions. Figures 4(a) and 4(b) show the occluded facial images and the original facial images, respectively. Figure 4(c) presents the reconstruction results obtained using the shape and global texture DCMs. Meanwhile, Fig. 4(d) presents the reconstruction results obtained when the texture is reconstructed using not only the global texture DCM, but also the local detailed texture DSM. Comparing the images presented in Fig. 4(d) with
(a)
(b)
(c)
(d)
Fig. 4. Reconstruction results obtained using DCM method: (a) occluded facial images, (b) original facial images, (c) reconstructed facial images using global texture DCM only, and (d) final reconstruction results using both global texture DCM and local texture DCM Table 1. Average and standard deviation of facial shape and facial texture reconstruction errors for images in testing database with different levels of occlusion. Note that the occlusion rate data indicate the ratio of the occluded area to the non-occluded area in the facial image; Ave: Average; Std. Dev: Standard deviation of errors. Facial Ave. Error (Pixel/Grayvalues) Std. Dev. (Pixel/Grayvalues) features Shape Texture Shape Texture Left Eye 1.2 6.6 1.1 1.7 Right Eye 1.3 6.5 1.0 1.8 Both Eye 1.4 8.0 1.7 3.6 Nose 1.0 7.2 1.4 3.0 Mouth 1.6 6.8 1.5 3.2
Occlusion Rate 10% 10% 24% 16% 20%
150
C.-T. Tu and J.-J.J. Lien
(a)
(b)
(c)
(d)
6.4
7.0
9.1
9.8
7.7
8.1
8.1
9.0
9.4
10.5
10.4
11.2
7.5
8.0
9.2
8.4
9.7
7.4
8.7
8.8 8.5
8.8 5.7
(e)
5.9
6.6
5.8
6.5
7.2
6.8
6.7
Fig. 5. Reconstruction results: (a) occluded facial images, (b) original facial images, (c) reconstructed texture images using method presented in [13], (d) reconstructed texture images using method presented in [8], and (e) reconstructed texture images using current DCM method. The digits within the images represent the average grayvalue evaluation error of the corresponding pixels in the original non-occluded image, while the digits in the columns next to these images represent the average grayvalue error over all of the images in the test database. Note that each facial texture image has a size of 100*100 pixels.
those presented in Fig. 4(b), it can be seen that the use of the two texture DCMs yields a highly accurate reconstruction of the original facial image. Table 1 presents the average reconstructed shape and texture errors computed over all the images in the testing database. In general, the results show that the magnitudes of both errors increase as the level of occlusion increases or as the geometrical complexity of the occluded facial feature increases. Figure 5 compares the reconstruction results obtained using the proposed DCM method with those obtained using the occlusion recovery schemes presented in [8] and [13], respectively. The data presented within the reconstructed images indicate the average difference between the grayvalues of the pixels in the restored region of the reconstructed image and the grayvalues of the corresponding pixels in the original non-occluded image, while the data in the columns next to these images indicate the average grayvalue error of the restored pixels in the occluded region computed over all 60 texture images within the test database. Overall, the results demonstrate that the images reconstructed using the current DCM-based method are closer to the original un-occluded facial images than those obtained using the schemes presented in [8] or [13].
5 Conclusions This study has presented an automatic facial occlusion reconstruction system comprising two DCM-based modules, namely a shape reconstruction module and a
Facial Occlusion Reconstruction
151
texture reconstruction module, respectively. The experimental results have demonstrated that the images reconstructed by the proposed system closely resemble the original non-occluded images. The enhanced reconstruction performance of the proposed system arises as a result of its robustness toward misalignments of the facial features when constructing the facial shape and its ability to recover both the global structure and the local detailed facial texture components of the input image. Unlike PCA-based methods, the DCM-based system presented in this study provides the ability to reconstruct the occluded region of an input image directly from an image of the non-occluded region even though they are initially combined in a single eigenspace. Overall, the experimental results indicate that the DCM-based facial occlusion reconstruction system presented in this study represents a promising means of enhancing the performance of existing automatic face recognition, facial expression recognition, and facial pose estimation applications.
References 1. Blanz, V., Romdhani, S., Vetter, T.: Face Identification across Different Poses and Illuminations with a 3D Morphable Model. In: IEEE Intl. Conf. on FG, pp. 202–207 (2002) 2. Chen, H., Xu, Y.Q., Shum, H.Y., Zhu, S.C., Zhen, N.N.: Example-Based Facial Sketch Generation with Non-Parametric Sampling. In: Proceedings of ICCV, pp. 433–438 (2001) 3. Cootes, T.F., Taylor, C.J.: Statistical Models of Appearance for Computer Vision. Technical Report, Univ. of Manchester (2000) 4. Fisker, R.: Making Deformable Template Models Operational. PhD Thesis, Informatics and Mathematical Modelling, Technical University of Denmark (2000) 5. Freeman, W.T., Pasztor, E.C.: Learning Low-Level Vision. In: ICCV, pp. 1182–1189 (1999) 6. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 7. Hwang, B.W., Blanz, V., Vetter, T., Lee, S.W.: Face Reconstruction from a Small Number of Feature Points. In: Proceedings of ICPR, pp. 842–845 (2000) 8. Hwang, B.W., Lee, S.W.: Reconstruction of Partially Damaged Face Images Based on a Morphable Face Model. IEEE Trans. on PAMI 25(3), 365–372 (2003) 9. Jones, M.J., Poggio, T.: Multidimensional Morphable Models: A Framework for Representing and Matching Object Classes. IJCV 29(2), 107–131 (1998) 10. Lanitis, A.: Person Identification from Heavily Occluded Face Images. In: Handschuh, H., Hasan, M.A. (eds.) SAC 2004. LNCS, vol. 3357, pp. 5–9. Springer, Heidelberg (2004) 11. Liu, C., Shum, H.Y., Zhang, C.S.: A Two-Step Approach to Hallucinating Faces: Global Parametric Model and Local Nonparametric Model. In: Proc. of CVPR, pp. 192–198 (2001) 12. Mo, Z., Lewis, J.P., Neumann, U.: Face Inpainting with Local Linear Representations. In: Proceedings of British Machine Vision Conference, vol. 1, pp. 347–356 (2004) 13. Park, J.S., Oh, Y.H., Ahn, S.C., Lee, S.W.: Glasses Removal from Facial Image Using Recursive PCA Reconstruction. IEEE Trans. on PAMI 27(5), 805–811 (2005) 14. Saito, Y., Kenmochi, Y., Kotani, K.: Estimation of Eyeglassless Facial Images Using Principal Component Analysis. In: Proceedings of ICIP, vol. 4, pp. 197–201 (1999) 15. Vetter, T., Poggio, T.: Linear Object Classes and Image Synthesis from a Single Example Image. IEEE Trans. on PAMI 19(7), 733–742 (1997) 16. Wu, C., Liu, C., Shum, H.Y., Xu, Y.Q., Zhang, Z.: Automatic Eyeglasses Removal from Face Images. IEEE Trans. PAMI 26, 322–336 (2004)
Cyclic Linear Hidden Markov Models for Shape Classification Vicente Palaz´on, Andr´es Marzal, and Juan Miguel Vilar Dept. Llenguatges i Sistemes Inform` atics. Universitat Jaume I de Castell´ o. Spain {palazon,amarzal,jvilar}@lsi.uji.es
Abstract. In classification tasks, shape descriptions, combined with matching techniques, must be robust to noise and invariant to transformations. Most of these distortions are relatively easy to handle, particularly if we represent contours by sequences. However, starting point invariance seems to be difficult to achieve. The concept of cyclic sequence, a sequence that has no initial/final point, can be of great help. We propose a new methodology to use HMMs to classify contours represented by cyclic sequences. Experimental results show that our proposal significantly outperforms other methods in the literature. Keywords: Cyclic Sequences, Hidden Markov Models, Shape Classification.
1
Introduction
Shape classification is a very important problem with applications in several areas such as industry, medicine, biometrics and even entertainment. The first step towards the design of a shape classifier is feature extraction. Shapes can be represented by their contours or by their regions [1]. However, contour based descriptors are widely used as they preserve the local information which is important in the classification of complex shapes. The next step is shape matching. Dynamic Time Warping (DTW) based shape matching is being increasingly applied [2,3,4]. A DTW-based dissimilarity measure seems natural for optimally aligning contours, since it is able to align parts instead of points and is robust to deformations. Hidden Markov Models (HMMs) are also being used as a possible shape modelling and classification approach [5,6,7,8]. Hidden Markov Models are a general approach to model sequences. They are stochastic generalizations of finite-state automata, where transitions between states and generation of output symbols are modelled by probability distributions. HMMs have the properties of DTW matching, but also provide a probabilistic framework for training and classification. Shape descriptors, combined with shape matching techniques, must be invariant to many distortions, including scale, rotation, noise, etc. Most of these
Work partially supported by the Ministerio de Educaci´ on y Ciencia (TIN200612767), the Generalitat Valenciana (GV06/302) and Bancaixa (P1 1B2006-31).
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 152–165, 2007. c Springer-Verlag Berlin Heidelberg 2007
Cyclic Linear Hidden Markov Models for Shape Classification
153
distortions are relatively easy to deal with, particularly when we convert contours in sequences and use DTW or HMMs to match them. However, no matter what representation is used, starting point invariance seems to be difficult to achieve. A contour can be transformed into a sequence by choosing an appropriate starting symbol, but its election is always based on heuristics which seldom work in unrestricted scopes. The most suitable solution to this problem is to measure distances between every possible initial symbol of the sequences. The concept of cyclic sequence arises here. Figure 1 depicts the coding of a character contour as a cyclic sequence with an 8-direction code. A cyclic sequence is a sequence of symbols or values that has neither beginning nor end, i.e., a cyclic sequence models the set of every possible cyclic shift of a sequence, thus, to measure distances between two cyclic sequences is equivalent to measure distances between every possible cyclic shift of both sequences.
Symbols d
c
e f
b a
g
h
A = aaaahggeffhaheeeeedbbbabceeefecb
Fig. 1. A shape of a character “two” is coded as a cyclic sequence of directions along the contour
So the question is now: how can we train HMMs for cyclic sequences? There is a time order, but we do not know where the sequences begin. An immediate idea is to use a cyclic topology as in [5], but this is not the best solution, as we will see. To overcome this problem, in the following sections, we will propose a new methodology to properly work with HMMs in order to classify cyclic sequences. The paper is organised as follows: Section 2 and Section 3 give an overview of Hidden Markov Models and Cyclic sequences. The proposed approach is discussed in Section 4 and Section 5. Experimental results are presented in Section 6. Finally, the paper ends with conclusions in Section 7.
2
Hidden Markov Models
A Hidden Markov Model (HMM) [9] contains a set of states, each one with an associated emission probability distribution. At any instant t, an observable event is produced from a particular state and it only depends on the state. The transition from one state to another is a random event only depending on the departing state. Without loss of generality, in the following we will only consider discrete HMMs, i.e., the set of observable events is finite.
154
V. Palaz´ on, A. Marzal, and J.M. Vilar aii aji
0.5 0.3
aij
1 bi (v1 )
bi (v2 )
2
bi (v4 )
1
i
0.5 0.2
bi (v3 )
aki
aik
0.75
0.25
3
πi
0.5
(a)
(b)
Fig. 2. (a) An HMM state that can emit any of four symbols according to the probability distribution depicted as a pie chart. (b) A complete HMM.
Given an alphabet Σ = {v1 , v2 , . . . , vw }, an HMM with n states is a triplet (A, B, π) where (1) A = {aij }, for 1 ≤ i, j ≤ n, is the state transition probability matrix (aij is the probability of being in state i at time t and being in state j at time t + 1); (2) B = {bik }, for 1 ≤ i ≤ n and 1 ≤ k ≤ w, is the observation probability matrix (bik is the probability of observing vk while being in state i); and (3) π = {πi }, for 1 ≤ i ≤ n is an initial state probability distribution (πi is the probability of being = 1). The following conditions must in state i when t be satisfied: for all i, 1≤j≤n aij = 1 and 1≤k≤w bik = 1; and 1≤i≤n πi = 1. Figure 2 (a) depicts a state and Figure 2 (b) shows a complete HMM (transitions with null probability are not shown). Apart from this definition, there is another one that has been popularised by the toolkit HTK [10]. This definition has two non-emitting states, the initial and the final one. The initial state, that we will identify with the number 0, has no input transitions (it eliminates the need for an explicit initial state distribution π since a0i can be interpreted as πi ) and the final state, that we will identify with n + 1, has no output transitions. These special non-emitting states simplify some computations and eases HMM composition. In the following, we will use this alternative definition. There are efficient iterative algorithms for training the parameters of an HMM [11,12]. Unfortunately, there are no effective methods for estimating the number of states and the topology of the model. These are usually chosen heuristically depending on the application features. For example, when the sequence of symbols can be segmented, all the symbols in a segment are emitted by the same state, and consecutive segments are associated to consecutive states, a socalled Linear HMM (LHMM), i.e., a left-to-right topology like the ones shown in Figure 3 (a) and (b), can be used. For an HMM λ = (A, B) and a sequence of observed symbols, x = x1 x2 . . . xm , there are three basic problems that must be solved to be useful in applications: (1) the evaluation problem, i.e., the probability of x, given λ; (2) the decoding problem, i.e., obtaining the sequence of states that most likely produced x; and (3) the learning problem, i.e., estimating λ to maximise the probability of generating x. There are well-known, efficient algorithms for the two first problems. The Viterbi algorithm [13] solves the decoding problem by evaluating φn+1 (m + 1),
Cyclic Linear Hidden Markov Models for Shape Classification a11
1
a22 a12
2
a33 a23
3
a44 a34
4
π0 a11
0
a01
1
a22 a12
2
(a) a23
a33
3
155
a44 a34
4
a45
5
0.5 0.5 0.3 0.7 0.5 0.5 1.0 w
π0
(b)
z
z
w
(c)
Fig. 3. (a) A Linear HMM. (b) A Linear HMM using the HTK definition. (c) Trellis for a Linear HMM and a sequence of length 4. The optimal alignment is shown with thicker arrows.
where ⎧ 1, ⎪ ⎪ ⎪ ⎨0, φj (t) = ⎪ max1≤i≤N (φi (t − 1) · aij ) · bj (xt ), ⎪ ⎪ ⎩ max1≤i≤N (φi (m) · ai,n+1 ),
if if if if
t = 0 and j = 0; t = 0 and j = 0; 1 ≤ t ≤ m and 1 ≤ j ≤ n; t = m + 1 and j = n + 1.
The Forward algorithm solves the evaluation problem by computing a similar recursive expression with summations instead of maximizations. Both recursive equations can be solved iteratively by Dynamic Programming in O(n2 m) time (O(nm) for LHMMs). The iterative version of the Viterbi algorithm computes an intermediate value at each node of the trellis graph (see Figure 3 (c)). Each node (j, t) corresponds to a state (j) and a time instant (t) and stores φj (t). The value at (n + 1, m + 1) is the final result. The Viterbi algorithm solves the decoding problem by recovering the optimal alignment (sequence of states) in the trellis (see Figure 3 (c)). There is no algorithm that optimally solves the training problem. The BaumWelch [11] and the segmental K-means [12] procedures are used to iteratively improve the parameters estimation until a local maximum is found. In practice, both methods offer a comparable performance regarding classification rates.
3
Hidden Markov Models for Cyclic Sequences
A cyclic sequence can be seen as the set of sequences obtained by cyclically shifting a conventional sequence: Definition 1 ([14]). Let x = x1 . . . xm be a sequence from an alphabet Σ. The cyclic shift σ(x) of a sequence x is defined as σ(x1 . . . xm ) = x2 . . . xm x1 . Let σ s denote the composition of s cyclic shifts and let σ 0 denote the identity. Two sequences x and x are cyclically equivalent if x = σ s (x ), for some s. The equivalence class of x is [x] = {σs (x) : 0 ≤ s < m} and it is called a cyclic
156
V. Palaz´ on, A. Marzal, and J.M. Vilar 0.5
0.8
0.5
2
1
0.2
0.4
0.7
0.6
4 0.3
(a)
0.6
0.5
2
3
0.7
0.3
0.5
3
0.4
1
4 0.3
0.7
(b)
Fig. 4. (a) Cyclic HMM as proposed in [5]. (b) The contour of a shape is segmented and each segment is associated to a state of the HMM. Ideally, each state is responsible for a single segment.
sequence. Any of its members is a representative of the cyclic sequence. For instance, the set {wzz, zzw, zwz} is a cyclic sequence and wzz (or any other sequence in the set) can be taken as its representative. Since cyclic sequences have no initial/final point, Linear HMMs seem inappropriate to model them. In [5], the authors proposed a circular HMM topology to model cyclic sequences. Figure 4 (a) shows this topology (the initial and final non-emitting states are not shown for the sake of clarity). This topology can be seen as a modification of the left-to-right one where the “last” emitting state is connected to the “first” emitting state. The proposed structure eliminates the need to define a starting point: the cyclic sequence can be segmented to associate consecutive states to consecutive segments in the cyclic sequences, but no assumption is made on which is the first state/first segment (see Figure 4 (b)); therefore, there is an analogy with Linear HMMs. However, there is a problem that breaks this analogy: the model is ergodic (all states can be reached from any state) and the cyclic sequence symbols can “wrap” the model, i.e., the optimal sequence of states can contain non-consecutive, repeated states and, therefore, a single state can be responsible for the emission of several nonconsecutive segments in the cyclic sequence. Thus, the problem that we have is the following: Problem 1. To properly model cyclic sequences, HMMs should take into account that any symbol of the sequence can be emitted by the first emitting state and when such a symbol has been chosen as emitted by this state, its previous symbol must be emitted by the last state.
4
Cyclic Linear HMMs
We can use Linear HMMs in a similar way than cyclic sequences. A cyclic Linear HMM (CLHMM) can be seen as the set of LHMMs obtained by cyclically shifting a conventional LHMM:
Cyclic Linear Hidden Markov Models for Shape Classification
157
Definition 2. Let λ = (A, B) be an LHMM. Given A, let σ(A) be the following transformation: ⎡ ⎤ ⎤ 1 0 ................. 0 1 0 ................. 0 ⎢0 a22 a23 0 . . . . . . . . . 0⎥ ⎢0 a11 a12 0 . . . . . . . . . . 0⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢0 0 a22 a23 0 . . . 0 ⎢0 0 . . . . . . ⎥ ⎢ 0 . . . 0⎥ , σ(A) = ⎢ A=⎢ ⎥ ⎥. . . ⎢0 . . . 0 ann ann+1 0 0⎥ ⎢0 . . . 0 . . . . 0 0⎥ ⎢ ⎥ ⎥ ⎢ ⎣0 . . . . . . . . . . . a11 a12 0⎦ ⎣0 . . . . . . . . . . ann ann+1 0⎦ 0 ............... 0 0 0 ................. 0 0 ⎡
Let σ(B) be σ(b1 . . . bn ) = b2 . . . bn b1 (where bi are rows from matrix B). The composition of r cyclic shifts of λ is defined as σ r (λ) = (σ r (A), σ r (B)). Two LHMMs λ and λ are cyclically equivalent if λ = σ r (λ ), for some r. The equivalence class of λ is [λ] = {σ r (λ) : 0 ≤ r < n} and it is called a Cyclic LHMM. Any of its members is a representative of the Cyclic LHMM (see Figure 5 (a)). Then, to solve Problem 1: Definition 3. The Viterbi score for a cyclic sequence [x1 x2 . . . xm ] and a
CLHMM [λ] is defined as P ([x]|[λ]) = max0≤r
158
V. Palaz´ on, A. Marzal, and J.M. Vilar 0.5
0
1.0
1
0.7 0.5
0.7
0
1.0
0
1.0
1
0.4 0.3
0.4
1
2
0.4 0.3
2
2
0.6
4
0.5 0.6
0.5 0.6
3
3
0.5
4
0.3
4
0.7 0.5
3
0.5 0.5 0.3 0.7 0.5 0.5 1.0
(a)
w
z
z
w
w
z
z
w
(b)
Fig. 5. (a) A CLHMM represented by its set of LHMMs. (b) Extended trellis for a Linear HMM and a cyclic sequence of length 4. The optimal alignments for each starting point are shown with thicker arrows, one of them is the optimal cyclic alignment.
paths” property [14]: Let ςi be the optimal alignment beginning at node (i, 0) and ending at node ((m + i + 1, n + 1) in the extended trellis and let j, k, and l be three integers such that 0 ≤ j < k < l ≤ m; there is an optimal path starting at node (k, 0) and arriving to (k + m + 1, n + 1) that lies between ςj and ςl . This property leads to a Divide and Conquer, recursive procedure: when ςj and ςl are known, ς(j+l)/2 is computed by only taking into account those nodes of the extended trellis lying between ςj and ςl ; then, optimal alignments bounded by ςj and ς(j+l)/2 and optimal alignments bounded by ς(j+l)/2 and ςl can be recursively computed. The recursive procedure starts after computing ς0 (by means of a standard Viterbi computation) and ςm , which is ς0 shifted m positions to the right. Each recursive call generates up to two more recursive calls and all the calls at the same recursion depth amount to O(mn) time; therefore, the algorithm runs in O(mn log m) time. This adaptation of Maes’ algorithm comes naturally after defining the Viterbi score in lemma 1. In principle, we could adopt a symmetric approach defining a cyclic shift on the states of the Linear HMMs to obtain the same Viterbi score. This is appealing because n < m and, therefore, “doubling” the HMM in the extended trellis instead of the sequence would lead to an O(mn log n) algorithm. This would be better than O(mn log m) since n < m (and, usually, n m). However, it cannot be directly done: Lemma 2. P ([x]|[λ]) = max0≤r
Cyclic Linear Hidden Markov Models for Shape Classification 0.5
0 0.5
0
1.0
1
0.7 0.5
2
1.0
0.4 0.3
3
1
0.7 0.5
0.7 0.6
4
0
1.0
1
0.4 0.3
0.4
0
(a)
1.0
1
2
0.4 0.3
2
2
0.5 0.6
0.5 0.6
0.5 0.6
3
3
3
4
0.5
5
0.7
5
0.4
5
0.7 0.5
0.7 0.5
159
4 0.4
0.3
4
(b)
Fig. 6. (a) A CLHMM [λ] represented by a representative (an LHMM). (b) The corresponding LHMMs for the ιr (λ) operation, for 0 ≤ r < n (where n = 3). From top to bottom, ι(λ), ι2 (λ) and ι3 (λ).
σ 2 (v1 v2 v1 ) = v1 v1 v2 ). If we try to perform a cyclic shift in the Linear HMM, we have two possible cyclic shifts both possibilities give us 0 as the Viterbi score. Let [λ] = (A, B) be a CLHMM, let [x] be a cyclic sequence and let ι(λ) be an operation that performs a cyclic shift (σ(λ)) and inserts a copy of the first emitting state before the last state, but its transition to the next state has the value of its self transition (see Figure 6). Then, Theorem 1.
P ([x]|[λ]) = max
0≤r
max P (x|σ r (λ)), P (x|ιr (λ)) .
Proof (sketch): Each alignment induces a segmentation on x. All the symbols in a segment are aligned with the same state of the CLHMM. There is a problem when xm−p xm−p+1 . . . xm and x1 x2 . . . xq , for some p, q ≥ 0, belong to the same segment of x. In that case, the optimal alignment cannot be obtained by simply cyclic shifting λ, since xm must be aligned with the state n and x1 must be aligned with the state 1, i.e., they never fall in the same segment. The LHMM ιr (λ), formed by inserting to σ r (λ) the first emitting state after the last one, permits to align xm−p xm−p+1 . . . xm and x1 x2 . . . xq with the first state, since this state also appears at the end of ιr (λ). On the other hand, there is another problem: let us suppose we have now the complete segment at the beginning of the sequence, p + q symbols, then the first self transition must be executed p + q − 1 times, but if the segment is in the situation explained above, the first self transition will be executed just p + q − 2 times, the transition to the last non-emitting state provides this necessary extra transition. Fortunately, for each value of r, P (x|σ r (λ)) can be obtained as a subproduct of the computation of P (x|ιr (λ)). The trellis underlying P (x|σ 0 (λ)) is a subgraph of the one underlying P (x|ι0 (λ)).
160
V. Palaz´ on, A. Marzal, and J.M. Vilar
The value of P (x|σ r (λ)) and P (x|ιr (λ)), for each r, can be obtained by computing optimal alignments in an extended trellis similar to the one in Figure 5 (b), but now “doubling” the LHMM. It should be taken into account that, unlike in Maes’ algorithm, the optimal path starting at (r, 0) can finish either at node (r + n − 1, m) or (r + n, m) and the recursive computation can be applied just using the optimal alignments between σ r (λ) and x as a new left or right border.
5
Segmental K-means for Cyclic Linear HMMs
The proposed algorithm to compute the Viterbi score for a CLHMM cannot be extended to Forward-value computation because there is no optimal alignment on the trellis on which the “non-crossing paths” property holds. Since the BaumWelch training procedure is based on the Forward (and Backward) values, we cannot use it for cyclic strings without requiring n times more time, which is too expensive. However, on this purpose, we can adapt the segmental K-means algorithm [12]. In creating the CLHMM for each class, we should guarantee that the parameters we obtain are the optimum for a given set of training cyclic sequences. Since our decision rule is the state optimised likelihood function, it ¯ be such that P ([x]|[λ]) ¯ is maximised requires that the estimated parameter [λ] 0 for the training set. Starting from an initial model [λ ] (the superscripts indicate the iteration number), this procedure takes us from [λk ] to [λk+1 ] such that P ([x], ς k |[λk ]) ≤ P ([x], ς k+1 |[λk+1 ]), where ς k is the optimal cyclic alignment for [x] = [x1 x2 . . . xm ] and [λk ]. According to lemma 1, this is equivalent to P ([x], ς k |λk ) ≤ P ([x], ς k+1 |λk+1 ), i.e., to train a representative LHMM λ is equivalent to train this CLHMM [λ] (we can adopt here as well a symmetric approach, that is to say, we can apply the cyclic shift to λ instead of x, but we use this one to make understanding easier). Thus, in this procedure to train the LHMM λ (lemma 1) a number of (training) cyclic sequences are required. Each cyclic sequence [x] = [x1 x2 . . . xm ] consists of m observation symbols. The algorithm then consists in the following steps1 : 1. The process is started by performing uniform cyclic alignments as optimal ones (for all states) of any representative x for each training cyclic sequence [x]. ˆ according to optimal cyclic 2. Compute the transition probability matrix A, alignments, for 0 ≤ i ≤ n and 0 ≤ j ≤ n: a ˆij =
Number of occurrences of {xt ∈ i and x(t+1) mod m ∈ j} for all t Number of occurrences of {xt ∈ i} for all t
ˆ according to optimal cyclic 3. Compute the observation probability matrix B, alignments, for 0 ≤ i ≤ n and 0 ≤ k ≤ w: 1
We only consider discrete models. These steps can be extended to the continuous ones, refer to [12] for further details.
Cyclic Linear Hidden Markov Models for Shape Classification
161
ˆbik = Number of occurrences of {xt ∈ i and xt = vk } for all t Number of occurrences of {xt ∈ i} for all t ˆ = max0≤s<m P (σ s (x)|λ) ˆ (where λ ˆ = (A, ˆ B)) ˆ for each 4. Compute P ([x]|λ) training cyclic sequence [x]. This Viterbi score gives us an optimal cyclic alignment ςˆ for each [x], ςˆ defines the sequence of states visited and which symbols are emitted for each state (and it can be computed in O(mn log m) time, Section 4). 5. If these optimal cyclic alignments are different to the previous iteration, repeat step 2 through step 5; otherwise, stop. Following the line of thought in [12]: Theorem 2. The adapted segmental K-means for CLHMMs and cyclic sequences converges in Zangwill’s global convergence sense [15]. ¯ Proof (sketch): What needs to be shown is that the algorithm (that obtains λ from λ) is closed and that P ([x]|λ) is an ascent function for the algorithm: (i) The algorithm is closed because we assume that the function P ([x]|λ) is continuously diferentiable in λ for almost all [x] in a totally finite measurable space; and (ii) Let ς ∗ and ς¯ be two optimal cyclic alignments such that, ς ∗ = arg maxς P ([x], ς|λ) ¯ then: and ς¯ = arg maxς P ([x], ς|λ), ¯ ≥ P ([x], ς ∗ |λ) ¯ max P ([x], ς|λ) ς
∗ r = max max max P ([x], ς |λ ) = max P (σ (x), ς|λ ) λ
λ
ς
r
(1)
≥ max P ([x], ς|λ). ς
The maximization over λ in (1) can be replaced by the adapted segmental Kmeans method explained above. Finally, taking into account that this iterative training method obtains a local maxima and for this reason, a good starting LHMM plays an important role. Step 1 obtains this initial LHMM, but as you have probably noticed it is an almost random procedure. We propose a different, automatic method for a Step 0: finding an optimal starting point for all cyclic sequences through comparison with a reference cyclic sequence via the Cyclic Dynamic Time Warping (CDTW) [2] algorithm. The starting point for all sequences is chosen in function of the optimal alignment in the extended CDTW graph. Thus, in step 1 the chosen representatives of the cyclic strings will all have a relative order and the uniform alignments will produce a better initial LHMM.
6
Experiments
In order to assess the behaviour of the algorithms presented, we performed comparative experiments on the MPEG-7 Core Experiment CE-Shape-1 (part B)
162
V. Palaz´ on, A. Marzal, and J.M. Vilar
database. This database consists of 1400 images divided in 70 shape classes of 20 images for each class (see Figure 7). All the images were clipped, scaled into a 32×32 and 50×50 pixels matrix and binarized, and their outer contours were represented by 8-directional chain-codes (the average length is 120 for the 32 × 32 images and 200 for the 50 × 50 images). All the classification experiments (70 equiprobable classes) were cross-validated (20 partitions, 70 samples for testing and 1330 samples for training).
Fig. 7. Some images in the MPEG-7 CE-Shape-1 (part B) database
First experiments try to show that the Cyclic Linear HMMs produces better classification rates on cyclic sequences than those obtained with a conventional LHMM or with a cyclic topology [5], and also that a good initial LHMM can improve the training of the CLHMM. Then, we have four methods: (i) LHMM, a conventional LHMM with a conventional Viterbi Score and Segmental Kmeans2 , (ii) Cyclic Topology, the method from [5], (iii) CLHMM, our approach, and finally (iv) CLHMM(CDTW), our approach using a CDTW-based initiation. Since we are interested in cyclic sequences, the contours were coded as conventional sequences with a random starting point. All HMMs were trained (with the HTK toolkit [10]) with randomly chosen starting points for all the sequences in the training set. Since the results can depend on the number of states, we have performed experiments varying this parameter. Figure 8 (a) shows the classification rate for the four methods with random starting points as a function of the number of states for the 32 × 32 images. Figure 8 (b) shows equivalent experiments for the 50 × 50 images. It can be seen that both methods proposed in this work improve the other ones. CLHMMs provide better results than [5] because of the 2
Obviously, the results of this method are very poor, they are exposed to show how important is the starting point invariance.
163
80 70 50
60
CLHMM(CDTW) CLHMM Cyclic Topology LHMM
30
40
Classification Rate
90
100
Cyclic Linear Hidden Markov Models for Shape Classification
30
35
40
45
50
55
60
65
States
80 70 50
60
CLHMM(CDTW) CLHMM Cyclic Topology LHMM
30
40
Classification Rate
90
100
(a)
50
60
70
80
90
100
110
States
(b)
Fig. 8. (a) Classification rate for random starting point sequences as a function of the number of states for the 32 × 32 images. (b) Idem for the 50 × 50 images.
problems that a cyclic topology has (Section 3). The highest classification rate is always obtained with CLHMM(CDTW), confirming that the adapted Segmental K-means trains better from a good initial LHMM. We also performed experiments of time in classification (for one partition on the 50×50 images) on a 2.4Ghz Pentium 4 running Linux 2.4 (the algorithm was implemented in C++). As explained in Section 4, cyclic shifting (or doubling in the extended trellis) the LHMM instead of the sequence is advisable. In Figure 9, this fact is shown empirically.
V. Palaz´ on, A. Marzal, and J.M. Vilar
300 250
Time
350
400
450
164
100
150
200
doubling LHMM doubling sequence
50
60
70
80
90
100
110
States
Fig. 9. Time comparison (in seconds) on the 50 × 50 images between both extended trellis, doubling the LHMM or doubling the sequence
7
Conclusions
In this work, we have presented a new methodology to use HMMs for dealing with cyclic sequences, called Cyclic Linear HMMs. We have formulated a framework to use these models for classification and training, adapting the Viterbi and the Segmental K-means algorithms. Experiments performed on a shape classification task show that our methods outperform other proposals from the literature.
References 1. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37(1), 1–19 (2004) 2. Marzal, A., Palaz´ on, V.: Dynamic time warping of cyclic strings for shape matching. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 644–652. Springer, Heidelberg (2005) 3. Adamek, T., O’Connor, N.E.: A multiscale representation method for nonrigid shapes with a single closed contour. IEEE Trans. Circuits Syst. Video Techn 14(5), 742–753 (2004) 4. Milios, E.E., Petrakis, E.G.M.: Shape retrieval based on dynamic programming. IEEE Transactions on Image Processing 9(1), 141–147 (2000) 5. Arica, N., Yarman-Vural, F.: A shape descriptor based on circular hidden markov model. In: International Conference on Pattern Recognition, vol. I, pp. 924–927 (2000) 6. Bicego, M., Murino, V.: Investigating hidden markov models’ capabilities in 2D shape classification. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 281–286 (2004) 7. Cai, J., Liu, Z.Q.: Hidden markov models with spectral features for 2D shape recognition. IEEE Trans. Pattern Anal. Mach. Intell. 23(12), 1454–1458 (2001) 8. He, Y., Kundu, A.: 2-D shape classification using hidden markov model. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(11), 1172–1184 (1991)
Cyclic Linear Hidden Markov Models for Shape Classification
165
9. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2) (1989) 10. Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University 1996 (1995) 11. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41, 164–171 (1970) 12. Juang, B.H., Rabiner, L.R.: The segmental K-means algorithm for estimating parameters of hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 38(9), 1639 (1990) 13. Forney, G.D.: The Viterbi algorithm. Proceedings of the IEEE 61, 268–278 (1973) 14. Maes, M.: On a cyclic string-to-string correction problem. Information Processing Letters 35, 73–78 (1990) 15. Zangwill, W.I.: Nonlinear Programming. A Unified Approach. Prentice-Hall, Englewood Cliffs (1969)
Neural Network Classification of Photogenic Facial Expressions Based on Fiducial Points and Gabor Features Luciana R. Veloso, Jo˜ao M. de Carvalho, Claudio S.V.C. Cavalvanti, Eduardo S. Moura, Felipe L. Coutinho, and Herman M. Gomes Universidade Federal de Campina Grande, Av. Aprigio Veloso s/n, 58109-970 Campina Grande PB {veloso,carvalho}@dee.ufcg.edu.br {csvcc,edumoura,felipelc,hmg}@dsc.ufcg.edu.br
Abstract. This work reports a study about the use of Gabor coefficients and coordinates of fiducial (landmark) points to represent facial features and allow the discrimination between photogenic and non-photogenic facial images, using neural networks. Experiments have been performed using 416 images from the Cohn-Kanade AU-Coded Facial Expression Database [1]. In order to extract fiducial points and classify the expressions, a manual processing was performed. The facial expression classifications were obtained with the help of the Action Unit information available in the image database. Various combinations of features were tested and evaluated. The best results were obtained with a weighted sum of a neural network classifier using Gabor coefficients and another using only the fiducial points. These indicated that fiducial points are a very promising feature for the classification performed. Keywords: facial expression analysis, Gabor coefficients, facial fiducial points, neural networks.
1
Introduction
A major task for the Human Computer Interaction (HCI) community is to equip the computer with the ability to recognize the user’s affective states, intentions and needs from a set of non-verbal cues, hence significantly enhancing the interaction between human and machine. One of the most difficult investigated tasks in this area is the recognition of the emotional state of its users. Many systems have been investigated and proposed, both by industry and by the scientific community, with this objective. This effort is motivated by the relevance that emotional expression have for human communication [2], constituting the most powerful, natural, and immediate means for human beings to communicate their emotions and intentions. Mehrabian affirms in this article [3] that facial expressions contributes for 55% to the effect of the message as a whole, while voice intonation and verbal contributes with 38% and 7%, respectively. Despite the large number of possible facial expressions, only a small set of prototype emotional expressions have been investigated: joy, sadness, surprise, D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 166–179, 2007. c Springer-Verlag Berlin Heidelberg 2007
Neural Network Classification of Photogenic Facial Expressions
167
anger, fear and disgust. This categorization was first discussed by Darwin in 1872 [4] and is generally accepted by psychologists working on facial expression analysis. Although there is some objection to the idea that these expressions signal similar emotions in people of different cultures, it is conceded the cross-cultural consistency of the combinations of facial movements (behavioral phenotypes) that compose the six basic expressions. In this work, we investigate the relationship between emotional face expressions and the concept of a photogenic expression which is new to the Computer Vision literature, but well known to the Photography field. The concept is normally associated with the attractiveness of a person as a subject for photography. A neural network classifier was utilized to discriminate from photogenic (labeled from neutral and happy emotions) and non-photogenic (labeled from disgust, anger, sadness, fear and suprise emotions) facial expressions. Within the above context, this work reports a study about the viability of using a set of the fiducial points extracted from a face image and represented by their (x, y) coordinates and gabor features to classify photogenic and nonphotogenic facial expressions. Section 2 presents a review of related work and Section 3 describes the methodology employed in the work. Section 4 describes the experiments performed and results obtained. Conclusions and proposals of further research are presented in Section 5.
2
Related Work
The photogeny classification problem has been firstly addressed in a paper [5] that has been chosen for discussion in the next paragraph. Nonetheless, there is a number of other related work on general facial expression recognition that is equally worth discussing in this section. Batista et. al [5] explored an appearance-based approach to photogenic expression discrimination. PCA features extracted from the grey level information of the face images has been used as input to a SVM and a MLP neural network classifiers. The best recognition rate for an experiment involving re-labelled images from the Cohn-Kanade database [1], when using a MLP neural network classifier, was 87.5%. In the present paper, a completelly different set of features (fiducial points) has been investigated with promissing results. The next paragraphs will discuss the more general facial recognition related work. Essa and Pentland [6] presented the results of facial expressions recognition based on optical flow, coupled with geometric, physical and motion-based face models. They used 2D motion energy and history templates that encode both the magnitude and the direction of motion. By learning the ideal 2D motion views for four emotional expressions (anger, disgust, happiness and surprise), they defined spatio-temporal templates for those expressions, from which recognition can be performed by template matching. Although the approach proposed by Essa and Pentland is not fully validated, it constitutes an unique method for facial expressions classification. Kanade et al. [7] proposed a system that recognizes Action Units (AUs) and AUs combinations [8] in facial image sequences using Hidden Markov Models.
168
L.R. Veloso et al.
Initially, a manual marking of facial feature points around the contours of the eyebrows, eyes, nose and mouth in the first frame of an image sequence is performed. Next, the LucasKanade optical flow algorithm is used to track automatically the feature points in the remaining frames. In the case of the upper face, the WuKanade dense optical flow algorithm and high gradient component detection are used to include detailed information from the larger region of the forehead. Texture appearance provides an important visual cue to classify a variety of facial expressions. Working on this hypothesis Wang and Yin [8] presented an approach to represent and classify facial expressions based in Topographic Context (TC). Topographic Context describes the distribution of topographic labels in the face region of interest. The face image is split into a number of expressive regions and the facial topographic surface is labeled to form a terrain map. Statistics of the terrain map are then extracted to derive the TC for each of the pre-defined face expressive regions. Finally, a topographic feature vector is created by concatenating the TCs of all expressive regions. With the extracted TC features, facial expressions were recognized using four classifiers: quadratic discriminant classifier - QDC, linear discriminant analysis - LDA, nave Bayesian network classifier - NBC and support vector classifier SVC. The LDA classifier achieved the best average result of 82.68% correct recognition rate. Lanitis el al. [9] proposed an unified approach to problems of face image coding and interpretation, using flexible models which represent both shape and graylevel appearance. For shape models, the shapes of facial features and the spatial relations between them are captured in single models, using 152 coordinates of landmarks points in the face. For the training examples, those points were manually located. The model can accurately approximate the shape of any face in the training set using 16 shape parameters (eigenvectors weights). The shape-free gray level appearance model was obtained by deforming each face image to match a mean shape in such a way that changes in gray-level intensities are kept to a minimum. Landmarks (14 in total) were used to deform the face images and gray-level intensities within the face area were extracted. Only 12 variables were needed to account for 95% of the training set variation in the flexible gray-level model. The last model, local gray-level appearance, was built from the projection profile at each model point. The shape model a the local gray-level models can be used to automatically locate all the modeled features, using Active Shape Models search. The facial expression recognition problem was investigated by establishing the distribution of appearance parameters over a selected training set for each expression category so that the appearance parameters calculated for a new face image could be used for determining the expression. The best results were obtained with shapefree gray level. The recognition rate was 74%, with shape-free gray level, 70%, with shape + shappe-free model and 53%, with shape model. In the work of Zhang et. al[10], geometric positions of a set of fiducial points and Gabor wavelet coefficients were applied to recognize facial expressions. Each image was convolved with 18 Gabor filters, comprising 3 scales and 6 orientations at the location of the fiducial points. Therefore, the images were represented
Neural Network Classification of Photogenic Facial Expressions
169
by a vector of 612 (18X34) elements, each. The classifier architecture is based on a two-layer perceptron. Experiments were performed using 10-fold crossvalidation. Gabor filters reached a better recognition rate than that obtained using only geometric positions (coordinates of the points). The recognition rate was 73.3%, with geometric positions, 92.2%, with Gabor filters and 93.3%, with combined information. An expert system for automatic analysis of facial expressions, called Integrated System for Facial Expression Recognition (ISFER), was developed by Pantic and Rothkrantz [11]. The system consists of two parts. The first part utilizes a parallel of multiple techniques: detention of nose and chin, curve fitting of the eyebrow, chain code eyebrow, neural network approach to eye tracking, adjustment of curves for mouth detention, detector fuzzy of mouth and search of profile contour. The second part of the system is its inference engine called HERCULES, which converts low level face geometry into high level facial actions, and then this into highest level weighted emotion labels. For model building, shapes are usually represented by a set of reference points, sometimes called fiducial points, taken from well defined image edges. The most direct way of obtaining these points is by hand marking, on all images of the training and test sets. Although quite simple, this procedure is a rather tedious and error prone task. Besides modeling, reference points are also used for face and facial expressions recognition tasks, as they allow detection of significant face features. Much work has been announced on the development of systems for automatic marking, or annotation, of reference points, but no efficient method for that is thus far available. Locating and tracking facial features in image sequences is the objective of the work presented by Zhu and Ji [12]. An Adaboost classifier is initially utilized for face and eye detection in an initial image. Eyes location is then used to position a trained face mask, which allows a first (rough) detection of 28 fiducial points. Next, a vector of Gabor coefficients is calculated at each detected point and compared (similarity search) to vectors in the training set. The most similar vectors in the training set provide new estimates for the fiducial points, and so on, until convergence is achieved. Appearance parameters and geometric relationships from the extracted facial features are utilized by a mechanism based on correlation and constraints of face shape to track the fiducial points in subsequent image frames. A probabilistic approach based on multi-modal models has been proposed by Tong and Ji [13], aiming to capture relationships between features extracted from different face view angles, using PCA. An eye detector is initially utilized to determine starting points for a feature point search procedure. A set of multiscale and multi-directional wavelets is employed for local appearance modeling around the detected feature points. An EM algorithm is finally utilized to estimate multi-modal parameters. Experimental results show that this technique performs well for facial features tracking in cases of large pose variations. In this paper, the geometric positions of a set of fiducial points and Gabor wavelet coefficients, which was initially utilized by Zhang [10], has been investigated
170
L.R. Veloso et al.
to the photogenic classification problem. Other differeces of the proposed approach when compared to previous work is the small number of fiducial points utilized and the combination of classifiers - one using Gabor coefficients and the other using the fiducial points themselves.
3
System Description
Our main goal was to train a classifier to learn the relationships between emotional face expressions and the concept of a photogenic expression, normally associated with a good picture of a person. In this section a system is proposed to perform facial expression recognition from facial fiducial points, using a Neural Network (NN) classifier. The system’s architecture is composed of four main modules: fiducial point extractor, Gabor wavelet extractor, mask normalization and Neural Network classifier (NN). The fiducial point extraction module is responsible for locating and extracting the coordinates of discriminant (landmark) points on each face. In the next module, the Gabor wavelet extractor, a set of Gabor coefficients are extracted at each fiducial point. The fiducial points are normalized in terms of Cartesian origin and face orientation by the mask normalization module. Each image is represented by two feature sets: the coordinates of the fiducial points and the coefficients of the Gabor wavelets. After normalization, the two sets are used to train the NN module which will perform the recognition of facial expression patterns. 3.1
Fiducial Points
For the present work, the fiducial points were hand marked for all images in the Conh-Kanade database [1]. The development of an automatic fiducial points extractor is currently under way, using the Active Appearance Model (AAM) technique (see Appendix 5) . All face images in the database have been previously and originally labelled as belonging to one of the following facial expressions: happiness, sadness, anger, fear, surprise, disgust and neutral. For our work, these labels have been mapped into just two labels: photogenic and non-photogenic. On each face, 29 points are marked, as illustrated in Figure 1. A software tool was developed with the objective of facilitating the annotation of fiducial points and class labeling for the images in the database. This tool was named Face Descriptor. The Face Descriptor incorporates a face detector and a graphical interface. Upon examination of its facial features, each face in a given image is labelled as belonging to following categories: happiness, anger, fear, surprise, sadness, disgust and neutral. With the Face Descriptor, the user can either work with a previously annotated and labeled image or perform those operations on a new image. For new images, the menu option “File Open image without XML must be selected, which will prompt the face detected (if there is one) in the opened image to be shown within a square, as illustrated in Figure 2. Annotation (markings) of the fiducial points is done in a pre-defined order, by just positioning the cursor on
Neural Network Classification of Photogenic Facial Expressions
171
Fig. 1. Image with marked fiducial points
the selected point and pressing the mouse button, which causes the point coordinates to be stored in the propper field in the “Facial Regions Coordinates frame (Figure 2). As an aid to the user, a reference face is displayed in the interface, indicating the next point to be marked.
Fig. 2. Face Descriptor interface window
Face labeling is achieved by selecting the appropriate menu options in the “Facial Features frame, also shown in Figure 2. Once the labeling is finished, the “ADD FACE button is pressed to indicate that the processing of the current face is concluded. If other faces are present in the analyzed image they will be automatically detected and displayed for further marking and labeling. When a not valid face is detected, the “NO FACE button has to be pressed, to discard that detection and proceed to the next. When no more faces are detected, the user presses the “OK button to conclude the operation, which prompts the “Face Descriptor software to save the selected points coordinates and labels in XML (eXtensible Markup Language) files, one file for each face. The “UNDO button is used to discard wrongly marked points. The “CLEAR button erases all saved information, allowing the user to start all over again. “The File Exit menu option closes the software.
172
L.R. Veloso et al.
For each marked point, cartesian coordinates are extracted and stored in a XML file. Table 1 shows the image quantities used for each facial exression. The set of the extracted points forms a mask, representing the face image. Table 1. Image distribution among expressions Expression Number of Images Anger 92 Fear 128 Surprise 156 Sadness 150 Hapiness 205 Disgust 84 Neutral 537
3.2
Gabor Wavelets
Gabor wavelets are receiving great attention in the area of Facial Expression Recognition [14] [15] [10]. The Gabor wavelets capture the properties of spatial localization, orientation selectivity, spatial frequency selectivity and quadrature phase relationship [16]. In the face image, the convolution of the image with a family of Gabor filters produces salient local features, such as eyes, nose and mouth. The two-dimensional Gabor function describes a sinusoid of frequency W modulated by a Gaussian: 2
g(x, y) = (
2
y 1/2 x [ −1 Wx ] 1 2 ( σ2 + σ2 )+2π(−1) x y )e 2πσx σy
(1)
where σx and σy are the widths of the Gaussian in the spatial domain, that is, along x and y axis. The Gabor functions form a complete, but non-orthogonal, basis set and any given function f (x, y) can be decomposed in terms of these basis functions. Such a decomposition results in a family of Gabor filters, making possible to detect features at various scales and orientations. In this work, a set of multi-scale and multi-orientation Gabor Wavelets are employed to model local appearances around ten fiducial points, localized around the eyes and the mouth, except the eyes inner corners. Each image is convolved with four orientation filter and two frequencies filters. Therefore, the vector of the characteristic is composed of 80 Gabor wavelet coefficient, 8 coefficients at each fiducial point analyzed. 3.3
Mask Normalization
In the third module of the system, all fiducial point coordinates are normalized by the length of the line segment joining the fiducial points corresponding to the eyes inner corners (this length becomes unitary). Additionally, mask orientation is
Neural Network Classification of Photogenic Facial Expressions
173
normalized (slope correction) and the origin of the Cartesian coordinates system is translated to the middle point of that line segment, i.e., to the middle point between the eyes. These operations are depicted in Equations 2-5. x = x − xc y = y − yc
(2) (3)
x = x ∗ cos θ − y ∗ sin θ y = y ∗ cos θ + x ∗ sin θ
(4) (5)
where (x , y ) and (x, y) are the normalized and original coordinates, respectively, (xc , yc ) are the coordinates of the middle point between the eyes, and θ is the slope (angle) of the line joining the eyes inner corners, used for slope correction. 3.4
Neural Network Training
All neural network trainings were performed using MLP (Multilayer Perceptron), with varying sizes of input and hidden layers. The sizes utilized in the experiments have been experimentally discovered as a function of the type and amount of input features. In total, five types of neural networks were used: two neural networks receiving only point coordinates, one neural network receiving Gabor features, another for Gabor features and point coordinates and a last one receiving a set of Gabor features, point coordinates and PCA (Principal Component Analysis) features.
4
Experiments and Results
The experiments with photogenic versus non-photogenic faces were performed using 416 images from the Cohn-Kanade AU-Coded Facial Expression Database [1]. From the total image set, 208 images were labeled as photogenic and 208 as non-photogenic. The subset was separated in training 50%, validating 25% and testing 25% further. Figure 3 shows some exemples of this image set.
(a)
(b)
Fig. 3. Examples of (a) Photogenic and (b) Non-photogenic faces
174
L.R. Veloso et al.
Facial expression classification experiments were performed using the following five types of feature sets: 1)coordinates of all 29 fiducial points (58 coordinates); 2) only eye and mouth fiducial points coordinates (32 coordinates); 3) 10 Gabor features calculated for eight fiducial points of each face (80 features); 4) Gabor features combined with all fiducial points coordinates (138 features); 5)78 features obtained by applying Principal Component Analysis (PCA)[17] to the previous feature set (4). The main goal of PCA is to reduce the dimension of a data set while retaining as much information as possible. Essentially, a set of correlated variables is transformed into a set of uncorrelated variables (by linear combination) and ordered by decreasing variability. The resulting set of components in this work account for 90% of the total data variance. A different neural network (NN) was trained for each type of feature set/ experiment. Due to the fact that random weight initialization can lead to different error energy minima, each experiment (training and testing) was repeated 10 times, using a Intel Xeon 5130 Processor - Dual core / 2.00 GHz / 1333MHz FSB with 4GB of RAM. Training time varied between 843,3 and 14,5 seconds, depending on the features set size and initial parameters. Table 2 summarizes the experiments results. It shows the number of nodes at the three (input, hidden and output) layers of the NN, as well as the maximum, mean and standard deviations values obtained for the correct classification rates, claculated from the 10 repetitions of each experiment type. The average classification (execution) time is also shown in Table 2. Table 2. Results for the five types of experiments performed. Input, Hidden and Output are the number of nodes at each layer of the Neural Network. Max, Mean and STD are the Maximum, Mean, and Standard Deviation for the classification rates, respectively, calculated for 10 repetitions of each experiment. Execution time (in seconds) is the average time (for 10 repetitions) required by each NN to produce a classification result. Type Input Hidden Output Max 1 58 29 2 75.50 2 32 16 2 73.50 3 80 40 2 68.50 4 138 69 2 72.00 5 78 38 2 71.00
4.1
Mean 74.60 71.70 64.60 69.00 68.70
STD 0.65 0.90 2.21 1.29 1.28
Exec. time 9.2 ∗ 10−4 6.6 ∗ 10−4 7.2 ∗ 10−4 1.2 ∗ 10−1 1.4 ∗ 10−1
Classifiers Fusion
The idea of combining classifiers in order to compensate for their individual weakness and to enhance their individual strengths has been widely used in recent pattern recognition applications [18]. Using this approach, different types of features can be independently used to classify a given pattern and the classifiers outputs combined to achieve an overall performance which is better than that of any individual classification.
Neural Network Classification of Photogenic Facial Expressions
175
This section presents a multiple classifiers algorithm based on two different photogenic classifiers: Coordinates-NN (type 1 on Table 2) and Gabor-NN (type 3 on Table 2). For this hybrid classifier we needed to define a combination rule for the classifiers outputs. In this work, three combining strategies have been considered. Initially, assume that an object Z must be assigned to one of the possible classes and that a number of classifiers are available, each representing the given pattern by a distinct measurement vector. Denote the measurement vector used by the ith classifier as xi and the a posteriori probability P (wj |xi , ..., xl ). Therefore, the combination rules are: Sum (S): Assigns Z to class if L L Σi=1 P (wj |xi ) = maxK k=1 Σi=1 p(wk |xi )
(6)
Product (P): Assigns Z to class if L L P (wj |xi ) = maxK Πi=1 k=1 Πi=1 P (wk |xi )
(7)
Weighted sum (WS): Assigns Z to class if L L αi P (wj |xi ) = maxK Σi=1 k=1 Σi=1 αi p(wk |xi )
(8)
where , are weights for the classifiers. To guarantee that the classifier outputs represent probabilities, output normalization is performed: P ∗ (wj |xi ) =
P (wj |xi ) Σk P (wj |xi )
(9)
For the weighted sum rule, for which the optimum weights were obtained by an exhaustive search procedure where for each classifiers combination, 2,000 different weight vectors with random adaptation are tested. The average recognition rates obtained considering the two different classifiers combination (types 1 and 3) are as follows: 77.30%, 77.50% and 78.00%, for the sum, product and weighted sum rules, respectivelly. Thus, the best classification rate was achiecved when using the weighted sum rule. Table 3 shows the confusion matrix of the best results of the experiments combining the classifiers. Note that the false acceptance and false Table 3. Confusion matrix of the best results for the experiments combining the classifiers Photogenic Non-photogenic
Photogenic Non-photogenic 77 21 23 79
176
L.R. Veloso et al.
Table 4. Results for Gabor-NN and Coordinates-NN phtogenic classifier with a lower learning rate. The classification rates are expressed in %. Classifier Gabor-NN Coord-NN Combination
Result 68.50 81.00 82.00
rejection errors are similar (23 and 21). A new training was performed for GaborNN and Coordinates-NN photogenic classifiers with a lower learning rate. This was done aiming a better recognition rate to the system (see Table 4).
5
Conclusion
Preliminary experimental results indicate the potencial of fiducial points to distinguish between photogenic/non-photogenic facial expressions. A combination of classifiers performed slightly better (just 1%) than the best classifier investigated (neural network using a set of fiducial points). This can be attributed to the poor performance of the other combined classifier - Gabor-NN (with only 68.50% of correct recognition). Further investigation on this issue will include the investigation of techniques for automatic fiducial point extraction. In a preliminary study towards that direction, we compared the points identified by an AAM (Active Appearance Model) with the available manually marked fiducial points (see the Appendix at the end of the paper). The results showed that, for the most significant points for facial expression analysis, the compared coordinates were not significantly distinct from each other. This is a good indication that the approach will not suffer degradation when using automatically located points. The next step is exactly to redo all performed experiments, this time with the fiducial points automatically located. Acknowledgments. This work was developed in collaboration with HP Brazil R&D. The authors would like to thank Professor Jeffrey Cohn for granting access to CohnKanade AU-coded Facial Expression Database.
References 1. Cohn, J.F., Zlochower, A., Lien, J., Kanade, T.: Automated face analysis by feature point tracking has high concurrent validity with manual facs coding. Psychophysiology 36, 35–43 (1999) 2. Pantic, M., Rothkrantz, L.: Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1424–1445 (2000) 3. Mehrabian, A.: Communication without words. Psychology Today 2(4), 53–56 (1968) 4. Darwin, C.: The Expression of the Emotions in Man and Animals. Appleton and Company, New York (1872)
Neural Network Classification of Photogenic Facial Expressions
177
5. Batista, L.B., Gomes, H.M., Carvalho, J.M.: Photogenic facial expression discrimination. In: International Conference on Computer Vision Theory and Applications, pp. 166–171 (2006) 6. Essa, I.A., Pentland, A.P.: Coding analysis interpretation and recognition of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 757–763 (1997) 7. Cohn, J., Zlochower, A., Lien, J., Kanade, T.: Feature-point tracking by optical flow discriminates subtle differences in facial expression. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 396–401 (1998) 8. Wang, J., Yin, L.: Static topographic modeling for facial expression recognition and analysis. Computer Vision and Image Understanding 108(1-2), 19–34 (2007) 9. Lanitis, A., Taylor, C.J., Cootes, T.F.: Automatic interpretation and coding of face images using flexible models. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 743–756 (1997) 10. Zhang, Z., Lyons, M., Schuster, M., Akamatsu, S.: Comparison between geometrybased and gabor wavelets based facial expression recognition using multi-layer perceptron. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 454–461 (1998) 11. Pantic, M., Rothkrantz, L.J.M.: Expert system for automatic analysis of facial expression. Image and Vision Computing 18, 881–905 (2000) 12. Zhu, Z., Ji, Q.: Robust pose invariant facial feture detection and tracking in realtime. In: International Conference on Pattern Recognition, vol. 1, pp. 1092–1095 (2006) 13. Tong, Y., Ji, Q.: Multiview facial feature tracking with a milti-modal probabilistc model. In: International Conference on Pattern Recognition, vol. 1, pp. 307–310 (2006) 14. Bartlett, M., Littlewort, G., Braathen, B., Sejnowski, T., Movellan, J.: A prototype for automatic recognition of spontaneous facial actions. Advances in Neural Information Processing Systems 15, 1271–1278 (2002) 15. Tian, Y.: Evaluation of face resolution for expression analysis. In: Computer Vision and Pattern Recognition Workshop, pp. 82–82 (2004) 16. Lin, D.T., Yang, C.M.: Real-time eye detection using face-circle fitting and darkpixel filtering. In: IEEE International Conference on Multimedia and Expo, vol. 2, pp. 1167–1170 (2004) 17. Haykin, S.: Neural Networks: A comprehensive Foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1998) 18. Carvalho, J.M., Oliveira, J., Freitas, C.O.A., Sabourin, R.: Handwritten month word recognition using multiple classifiers. In: Brazilian Symposium on Computer Graphics and Image Processing, pp. 82–89 (2004) 19. Cootes, T.F., Cooper, D.H., Graham, J.: Active shape models- their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 20. Cootes, T.F., Taylor, C.J.: Statistical models of appearance for computer vision. Technical report, University of Manchester, UK, Imaging Science and Biomedical Engineering (2004)
Appendix: Comparing Manual and Automatic Fiducial Point Extraction The Active Appearance Model (AAM) approach, developed by Cootes et al. [19] [20], was used for automatically locating fiducial points in untrained faces of the available dataset. After generating the Active Appearance Model from a training face set, the AAM software can be used to automatically detect the
178
L.R. Veloso et al.
fiducial points. This is performed by a search process, which searches within the image for points that best matches with the learned model. This process, however, is done through a graphical user interface which searches for points on each image at once. In order to speed up experiments, the modeling software was used to provide the fiducial points only for the test face images. Facial images under several expressions have been used for training the AAM with the purposed of obtaining a face model. From this model, the software can extract, for each image, the fiducial points, as described before. Three different metrics have been used to compare the points: the mean of the Euclidean distances between automatic and manual points for (a) individual points; (b) all points; and (c) the Root Mean-Squared Error (RMSE). In the first evaluation, the mean and the standard deviation of Euclidean distances for each point were computed according to the equation below: ∀f id | 1 ≤ f id ≤ 29 meanf id =
nImages i=1
d(mi,f id , ci,f id ) nImages
(10) (11)
where m is the set of manually selected fiducial points, c is the set of automatically selected fiducial points, nImages is a number of images, f id is fiducial points and d(a,b) is d(a, b) =
(a.x − b.x)2 + (a.y − b.y)2
(12)
In the second evaluation, the mean and the standard deviation of all Euclidean distances were computed as follows:
sumAllP oints =
nImages i=1
29 d(mi,f id , ci,f id ) nImages
(13)
sumAllP oints nImages · f id
(14)
f id=1
meanP oints =
The standard deviation was evaluated similarly to the mean, but due to space constraints it is not shown here. In the third evaluation, the Root Mean-Squared error was calculates according to: ∀f id | 1 ≤ f id ≤ 29 n (mi,f id − ci,f id )2 RM SE =
i=1
(15)
(16) n where n is the number of images. The RMSE can be computed individually for each fiducial point or for all points as in the Equation 17.
Neural Network Classification of Photogenic Facial Expressions
RM SE =
n 29 (mi,f id − ci,f id )2 i=1 f id=1
179
(17)
n
The results of the evaluation of the mean and standard deviation of Euclidean distances are shown on Table 5. The experiment was run taking 45 unseen images (5% of the total used) and detecting 29 fiducial points in each image. The distances shown are measured in pixels and the images have the dimensions of 640x490 pixels. A value of 7 pixels in distance has been used to decide that the automatic detection was too divergent from the manual one in a given point. Within this scenario, only the points P1, P9, P10 and P11 were considered too divergent. This can be easily explained since all of them are from eyebrows: P1 is from the left eyebrow and the remaining ones are from the right eyebrow. Eyebrows are slightly more difficult to manually locate than other facial points due to variations in width and texture within the image dataset. These variations may degrade the quality of the trained AAM model. Despite this fact, in an automatic fiducial point extraction scenario, the proposed approach would not be invalidated for two reasons: first, the eyebrows locations are not the main information used for facial expression recognition, and, second, the detected differences are not very significant given the image dimensions. The mean and the standard deviation of all Euclidean distances were 4.88 and 4.20 pixels receptively. The Root Mean-Squared error was 4.55. Table 5. Mean and the standard deviation of the euclidean distances for each individuals point P1 Mean 6.98 StdDev 4.40 P16 Mean 2.76 StdDev 2.08
P2 5.83 3.21 P17 6.43 5.55
P3 6.14 4.99 P18 4.54 3.10
P4 5.07 4.64 P19 5.37 3.93
P5 3.48 2.53 P20 4.61 2.56
P6 4.42 2.61 P21 2.57 1.43
P7 3.95 2.92 P22 4.35 2.40
P8 3.35 2.64 P23 6.54 6.10
P9 7.44 6.55 P24 4.83 2.98
P10 7.79 7.50 P25 3.37 2.11
P11 8.16 5.50 P26 3.37 2.56
P12 3.66 2.57 P27 3.41 3.23
P13 2.70 1.95 P28 5.89 4.48
P14 P15 4.48 3.56 4.43 2.34 P29 6.39 4.13
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division Subin Lee and Yongduek Seo Sogang University, Seoul, Korea {adrift, yndk}@sogang.ac.kr
Abstract. We propose a novel image in-painting method composed of two parts: band matching and seamless cloning. In band matching, a band enclosing the boundary of a missing region is compared to those from the other parts of the image. The inner area of the minimum difference band is then copied to the missing region. Even though this band matching results in successful inpainting in many practical applications, brightness discontinuity (a seam) may appear between the filled missing region and its neighborhood. We apply seamless-cloning to remove such discontinuity between the two regions. Examples show that this two step approach can provide a very fast and effective image in-painting. However, since this basic method using one patch may not deal with cases where there are abrupt changes of color or brightness along the boundary, we furthermore devise one more step: target sub-division. The target area is subdivided into small sub-areas, and the band matching and seamless cloning is applied to each of them. This sub-division is done also when the missing region is too large or the user wants to see more candidates to choose a better one. The multiple results from the sub-division are then ordered according to in-painting quality, which is measured based the edge map or discontinuity map along the boundary band. Our algorithm is demonstrated with various experiments using real images. Keywords: image in-painting, band matching, seamless cloning, area subdivision.
1 Introduction Digital image in-painting means a restoration or repairing of image areas which are missing or degraded accidentally or intentionally. This technique is also suitable for removing unwanted object regions or areas from movie frames as well as digital images; those areas are filled with similar textures or replaced by a new object digitally synthesized so as to, for example, ease the needs of re-shooting. Image in-painting techniques can be categorized into two based on their fundamental approaches. The first category such as [1-3] adopts image intensity interpolation scheme based on PDE’s (Partial Differential Equation), which are effective for filling-in small or narrow image regions because PDE-based interpolation provides a smooth continuation along the boundary very much naturally. This method is more appropriate for filling smoothly colored regions. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 180 – 190, 2007. © Springer-Verlag Berlin Heidelberg 2007
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division
181
The second category [4,5,9,11,13,14,15,16] is about filling relatively larger regions using methods of texture synthesis. For example, the image region may be filled with a synthesized output texture image generated from a small input image using a technique of [6] or [10], trying to keep the continuity of boundary. This approach works well on regions with structural continuity. However, it is not effective on nontextured and smooth regions where structural continuity can be hardly defined. Among the work of the second category, this paper compares the algorithm and results with those of [4]. The seminal method of [4] fills the target region with many small patches one by one chosen on the ground of the priority and confidence computed based on image gradients or discontinuities. However, in order to attain a satisfactory result, the user needs to find a proper size of the small patch. Therefore, at least a few trial-and-decision loops are required for the user to reach at the final selection. The result of in-painting depends mainly on the size of the covering patch but the way of choosing the patch size is not discussed in [4]. Our study was inspired by this point. This paper on the contrary tries to fill a large missing region with only one large textured patch. That is, the proposed method tries to replace the whole target region with another whole image whose shape is the same as the target region. Some differences between [4] and ours are listed: • In [4], the shape of image patch is rectangular and the image area for sourcetarget matching is different from patch to patch. The patch size is determined by the user at the beginning of in-painting. We compare the colors of pixels in the boundary band and find the best matching according to the comparison. Only one band is considered at the beginning; the user need not choose the patch size. Furthermore, the computation speed becomes much faster in this case. We will present examples that show even this uncomplicated plain procedure works well. • There is no process to guarantee the smoothness along the boundary in [4], which is not appropriate for in-painting when the target area is surrounded by smoothly graded colors. We overcome such cases by applying a seamlesscloning technique which is appropriate for stitching textured areas, too. • When the user wants to see whether or not a better in-painting can be obtained, she/he has to increase or decrease the patch size and run the algorithm again for the case of [4]. Our method allows the user to choose the number of total patches to cover the in-painting area. For example, when it is two, the target is split vertically to form two patches and our band-based in-painting is performed; the same procedure is then applied to horizontally sub-divided patches; these result in two in-painted images in total. The user may choose one of them or increase the number to see more. Practically, as the number of sub-division increases, our method becomes similar to [4]. However, the basic philosophy of solving the in-painting problem is different as explained. Ours may be seen as a top-down approach in this sense whereas [4] and its followers can be taken as a bottom-up approach. The sub-division approach also provides a way of dealing with discontinuities inside the boundary band; priority map was developed in [4].
182
S. Lee and Y. Seo
Section 2 explains our fundamental method: band-matching and seamless cloning. Experimental results of our method are also presented. They are fast in computation but show better results. Section 3 gives the detail of area sub-division method, which has two aims. The first is to provide practical usability; various in-painting results will be given to the users within fast computation time. The second is to guarantee the smoothness along the boundary band and hence reduce structural discontinuities along the boundary. Section 4 concludes this paper.
2 The Proposed In-Painting Our image in-painting method consists of two basic components: band matching and seamless cloning. In band matching, defined is a band enclosing the boundary of a missing region. The band is compared to those from the other parts of the image. The inner image of the minimum difference band is then copied to the missing area. When the region is enclosed by textured surroundings, this procedure of band matching and copying the inside produces good in-painting result. The computation speed is very fast because band matching is the only necessary procedure. If the surroundings have smoothly colored parts, a noticeable non-smooth boundary (a seam) may happen. In this case, the seam between filled missing region and its neighborhood is removed by the procedure of seamless cloning. We found two seamless cloning algorithms applicable to our case in the literature [7,8,12] and we use the method of [7] due to its simplicity and effectiveness.
(a)
(b)
(c)
Fig. 1. (a) Input image (b) Enlarged image of input image: The white region is target region and the red contour is its boundary. (c) Marked target band: The blue region is target band.
2.1 Band Matching As the initial step, the missing region (target region) is manually marked. For example, the white region in Figure 1 is the target region and the red contour is its boundary. The blue region around the target boundary contour is the band we are going to use for searching a source patch. That is, our idea is that a patch in the source image of a similar color or texture inside the boundary band will be a good patch for the area filling. The thickness of the band is set to five pixels for all the experiments in this paper. The target band is then compared to all the possible source bands
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division
(a)
(b)
(c)
183
(d)
Fig. 2. An example of the proposed in-painting: (a) The target band (blue) is compared to the source ones (green). (b) The source band with minimum Dk (green) is found. (c) The target region is filled with the inner image of the optimal source band. (d) The result image.
throughout the whole image. The shape and thickness of a source band is the same as those of the target band. In Figure 2(a), green regions show some of the source bands. Among the candidates of source bands, the optimal band is chosen as the one with minimum difference of color and gradient. Let k be the index to denote a source band and let Ck denote the difference of all the three color channels, Gk be the difference of gradients for the k-th source band, defined as following: N
Ck = ∑ RiSk − Rit + GiSk − Git + BiSk − Bit
(1)
i =1
Gk =
∑ ∇I
c
,
c = {R, G , B}
(2)
c
where s and t stand for the source and target bands, respectively, N is the number of pixels belonging to the band, R, G and B mean the red, green, and blue color channels, respectively, and || I|| = ||( tx - sx, ty - sy)||; tx - sx is the difference of horizontal gradients and ty - sy is the difference of vertical gradients between target band and source band. The similarity measure Dk between the target band and the k-th source band is given by a weighted sum of Ck and Gk as following:
∇
∇ ∇ ∇ ∇ ∇ ∇ Dk = λ1Ck + λ 2Gk
∇ ∇
(3)
where λ1 and λ2 are weight constants. If the difference of color is an important in the image, λ1 will be larger than λ2. For every source patch, we choose the patch with minimum Dk. The selected source region is then copied the target region. In Figure 2(b), the green region is the source band with minimum Dk which is called the optimal source band. In Figure 2(c), the target region is filled with the inner image of the optimal source band, and Figure 2(d) is the final result. Figure 3(a) shows a magnified view of the in-painted area. Note that the in-painting was done by copying only without any boundary smoothing. In addition, our algorithm needs matching of those pixels inside the band over the source image only once, but such an image scanning is required for every patch in [4]. Accordingly, our algorithm results in a fast computation. For example,
184
S. Lee and Y. Seo
our method took 14 seconds for the computation with 812 pixels in the band area but our implementation of [4] took 4 minutes with the patch size 7x7 - 49 pixels. Figure 4 shows more examples of our method. The band matching shows high performance on the images of repeated textures or similar colors. The middle example in Figure 4 tells that if all guitars are different, we selected a whole guitar instead of parts of guitar to an input image. The right-hand side example in Figure 4 also tells that if discontinuity appeared, we can solve it by target subdivision. Target subdivision will explain the section 3.
(a)
(b)
Fig. 3. Comparison of the proposed method and that of [4]: (a) The result of the proposed (process time: about 14 sec.) (b) The result of [4] (process time: about 4 min.)
Fig. 4. The results of band matching: Upper line is input images and lower line is results of band matching. Computation time was 15, 26, 200 seconds, respectively. The number of band pixels is 1396, 1576, 2324 pixels, respectively. When [4] was applied, it took 420, 780, 1800 seconds.
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division
185
2.2 Seamless Cloning
Seams along the stitching boundary are a great concern in developing an image inpainting algorithm. In particular, boundary seam occurs when the structure of the image is mainly composed of graded smooth colors. For example, direct stitching (copying) by band in-painting caused a visible seam in Figure 5. Main reason for this discontinuity is the difference of gradient as addressed in [7], [8] and [12]. In this paper, the discontinuity is made smooth by the seamless cloning algorithm proposed in [7]. We briefly introduce the method here.
Fig. 5. An example of discontinuous boundary: Left is the input, middle is the result of bandbased in-painting, to which the seamless cloning is applied, and the right is the final result
For seamless cloning, we first compute h(x,y), the ratio of pixel value of target band to the pixel value of the optimal source band along every boundary pixels (x,y): h( x, y ) =
f ( x, y ) g ( x, y )
(4)
where f(x,y) and g(x,y) are the pixel value of the target band and the optimal source band, respectively. Initial h(x,y) inside the target region are set to 1. Then, the boundary values are propagated into the target area by repeatedly applying a Laplacian operator K:
h n ( x, y) = K × h n−1 ( x, y )
(5)
where n is the repetition index. Here we use K of size 3x3. The final pixel value inside the target band is given by the product of updated h(x,y) and the pixel value of the optimal source band.
fˆ ( x, y) = h n ( x, y) g ( x, y)
(6)
Figure 6 shows two more examples of band-based in-painting together with seamless cloning. Upper line shows results of in-painting by band matching, showing the seams on the boundary and lower line shows results of seamless cloning after band matching. It took 44 and 150 seconds, respectively. Note that this is still a very fast computation; when we applied the method of [4], it took 960 and 1260 seconds, respectively.
186
S. Lee and Y. Seo
Fig. 6. The results of seamless cloning: Upper line is results of band matching and lower line is results of seamless cloning after band matching
3 Target Sub-division and Quality Measure There are some cases even the pair of band-based in-painting and seamless cloning may not resolve. If a line passes through the target region as shown in the left of Figure 7(a) for example, then there is a possibility where the line will appear to be discontinuous after in-painting as shown in the right of Figure 7(a). This is mainly due to the image structure and needs to be overcome for a realistic in-painting. On the other hand, in-painting is impossible when the target region is too large to perform our band matching because any of source bands cannot be found. For solving these problems we divide the target region into smaller ones as shown in Figure 7 and apply band matching and seamless cloning to each target subdivision region. This gives us a few various in-painting results according to the number of sub-divided target regions. For example, when the target is divided into two, we have two cases as shown in Figure 7(b)-(c). Therefore, we devise a method of measuring the quality of in-painting to decide the best of them or to list the results to help the user choose one among the results. The number of subdivision is given by the number of horizontal cut and vertical cut, respectively. Figure 7 is an example; one horizontal cut resulting in two vertically divided pieces; one vertical cut, two horizontally divided. Therefore, in this case, we get four results; one without subdivision (Figure 7(a)), two with two subdivided targets (Figure 7(b)-(c)), and finally one with four sub-targets (Figure 7(d)). In Figure 7(a)-(d), each left image shows the way of target subdivision and each right image is the result of band in-painting. Notice that Figure 7(c) and Figure 7(d) look very much natural than the others because of the structural continuity provided by the subdivision band in-painting. Actually, our advantage on the computational cost is partly due to the small number of pixels inside the band. If the degree of subdivision is high, our method will have many subdivided patches and the computational cost will become similar to that
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division
187
of [4]. However, at the user’s viewpoint, there is a chance to have a good in-painting result when the degree is low with a short computation time. On the other hand, the total amount of computation until the degree becomes high is relatively low, which must be a very advantageous to practical users who have to execute an algorithm repeatedly to get a satisfactory result. In order to measure the quality of each subdivision in-painting, for automatic selection on one hand and for user assistance on the other hand, this paper proposes two methods for computing the quality of results - the goodness of structural continuity. The first method uses edge maps of original image and the in-painted. The second uses segmentation instead of the edge map. In both of the methods, we check how the edges or the borders of different colors match along the in-painting boundary. This provides a measure for the structural continuity according to which we can sort out the multiple in-painting results.
(a)
(b)
(c)
(d)
Fig. 7. An example of target subdivision region into two for horizon and vertical respectively and its results: band matching using (a) one band, (b) and (c) two bands, (d) four bands
(a)
(b)
Fig. 8. (a) Edge image (b) Segmentation image
The method using edge images is as following. First, we compute binary edge images by thresholding the magnitude of Sobel filtering; edges are marked white and the rest are marked black. Figure 8(b) is the edge image of the original image and
188
S. Lee and Y. Seo
Figure 9(b) is of the in-painted images. Second, we define the comparison region to examine the edges of the original image and in-painting results. The comparison region is defined by target region’s boundary and its front and rear pixels. Third, for each pixel in the comparison region, we compute u(x,y), a sign for the structural difference; it is set to 1 if both of the edge maps have the same pixel values. Otherwise, it is set to 0. Finally, we sum up u(x,y) to compute Sk, the measure of quality for the k-th in-painting result. The larger the value of Sk is, the better the inpainting result (or the better structural continuity) is. Instead of the edge maps, we may use the results of color-based segmentation. Then, the borders of region segments play the same role as the binary edges. The remaining calculation process is identical. Figure 8(c) and 9(c) are segmentation images of the original image and in-painting results, respectively.
(a)
(b)
(c)
Fig. 9. (a) Original image (b) Edge image (c) Segmentation image
Fig. 10. Ordered results based on edge maps. Left was found to be the best.
Image In-painting by Band Matching, Seamless Cloning and Area Sub-division
189
Fig. 11. Ordered results based on color-based segmentation. Left was found to be the best.
Figure 10 and 11 shows the results of quality measure using edge maps and segmentation images, respectively. Left image had the highest score. For the examples in Figure 10 and 11, the two methods of quality measure provided the same computation. The right most in-painting result was from degree-four subdivision; the in-painted boundaries included highly textured area around the border of rocks and the sea, which made the quality low. However, the best quality by inspection was also chosen to be the best by our algorithm as shown in the figures.
4 Conclusion We proposed an image in-painting algorithm using a kind of template matching between the target band and source bands. A band was a thick contour enclosing the region of the interest, and band matching itself found to be a very fast and useful inpainting algorithm. Seamless cloning was adopted to smooth the boundary. Our algorithm showed a fast and good performance compared to the previous method. Furthermore, we adopted the method of target subdivision in order to deal with the cases when target region was very large or the result of band-based in-painting was not good enough especially due to a structural continuity. We presented the performance of our algorithm with various experimental results. Acknowledgments. This research is accomplished as the result of the research project for culture contents technology development supported by KOCCA and also supported by Seoul R&BD Program.
References 1. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image Inpainting. In: Proc. ACM SIGGRAPH, New Orleans, Louisiana, USA, pp. 417–424 (2000) 2. Bertalmío, M., Bertozzi, A., Sapiro, G.: Navier-Stokes, Fluid-Dynamics and Image and Video Inpainting. In: Proc. Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 355–362 (2001) 3. Chan, T., Shen, J.: Mathematical Models for Local Nontexture Inpaintings. SIAM Journal on Applied Mathematics, 1019–1043 (2002)
190
S. Lee and Y. Seo
4. Criminisi, A., Perez, P., Toyama, K.: Object Removal By Exemplar-Based Inpainting. In: Proc. Conference on Computer Vision and Pattern Recognition, Wisconsin, USA, pp. 721–728 (2003) 5. Drori, I., Cohen-Or, D., Yeshurun, H.: Fragment-Based Image Completion. In: Proc. ACM Transactions on Graphics, SIGGRAPH, San Diego, California, USA, pp. 303–312 (2003) 6. Efros, A.A., Leung, T.K.: Texture Synthesis by Non-parametric Sampling. In: Proc. International Conference on Computer Vision, Kerkyra, Corfu, Greece, pp. 1033–1038 (1999) 7. Georgiev, T.: Covariant Derivatives and Vision. In: Proc. European Conference on Computer Vision, Graz, Austria, pp. 56–69 (2006) 8. Jia, J., Sun, J., Tang, C.K., Shum, H.Y.: Drag-and-Drop Pasting. In: Proc. ACM SIGGRAPH, Los Angeles, California, USA, pp. 631–637 (2005) 9. Komodakis, N., Tziritas, G.: Image Completion Using Global Optimization. In: Proc. Conference on Computer Vision and Pattern Recognition, New York, NY, USA, pp. 442– 452 (2006) 10. Kwatra, V., Schodl, A., Essa, I., Turk, G., Bobick, A.: Graphcut Textures: Image and Video Synthesis Using Graph Cuts. In: Proc. ACM SIGGRAPH, San Diego, USA, pp. 277–286 (2003) 11. Perez, P., Gangnet, M., Blake, A.: PatchWorks: Example-Based Region Tiling for Image Editing, Technical Report MSR-TR-2004-04 12. Perez, P., Gangnet, M., Blake, A.: Poisson Image Editing. In: Proc. ACM SIGGRAPH, San Diego, California, USA, pp. 313–318 (2003) 13. Shen, J., Jin, X., Zhou, C., Wang, C.C.L.: Gradient Based Image Completion by Solving Poisson Equation. PCM, 257–268 (2005) 14. Sun, J., Yuan, L., Jia, J., Shum, H.Y.: Image Completion with Structure Propagation. In: Proc. ACM SIGGRAPH, Los Angeles, California, USA, pp. 861–868 (2005) 15. Wilczkowiak, M., Brostow, G.J., Tordoff, B., Cipolla, R.: Hole Filling Through Photomontage. In: British Machine Vision Conference, Oxford, UK, pp. 492–501 (2005) 16. Yamauchi, H., Harber, J., Seidel, H.P.: Image Restoration using Multiresolution Texture Synthesis and Image Inpainting. In: Proc. Computer Graphics International, Tokyo, Japan, pp. 120–125 (2003)
Image Feature Extraction Using a Method Derived from the Hough Transform with Extended Kalman Filtering Sergio A. Velastin and Chengping Xu Digital Imaging Research Centre, Kingston University, Kingston upon Thames, KT1 2EE, United Kingdom
[email protected]
Abstract. The conventional implementation of the Hough Transform is inadequate in many cases due to its integrative effects of the discrete spaces. The design of an algorithm to extract optimal parameters of curves passing through image points requires a measure of statistical fitness. A strategy for image feature extraction called Tracking Hough Transform (THT) is presented that combines Extended Kalman Filtering with a Hough voting scheme that incorporates a formal noise model. The minimum mean-squares filtering process leads to high accuracy. Computing cost for real-time applications is addressed by introducing a converging sampling scheme. Extensive performance tests show that the algorithm can achieve faster speed, lower storage requirement and higher accuracy than the Standard Hough Transform. Keywords: Hough Transform, Parametric curve detection, line detection, Kalman Filtering.
1 Introduction The Standard Hough transform (SHT) [1] provides a technique in image processing for extracting the parameters of a straight line from its feature points (or edgels) and that involves applying a co-ordinate transformation to the image, such that all the feature points belonging to a contour of a given type in the image space map into a single location in the transformed space. Although this technique has been widely studied [2], conflicts in accuracy, computing cost and memory requirement are still serious. The accuracy achieved depends on the resolution of the parameter space or the number of accumulator cells, but with a corresponding increase in computation cost. A class of solutions has been proposed [2] which employs non uniform or multiple resolutions techniques, based on the observation that it is only necessary to have high accumulator resolution where a high density of votes accumulates. Alternatively, higher accuracy can also be achieved by a curve fitting post-process [3] or interpolation [4]. However, the conventional least square distance method of fitting a line to a set of feature points is unreliable when feature points due to noise and to other edges are present [5-6]. A class of Hough-like techniques appeared in the literature [2] to reduce the computing cost through the redundancy of the range of the parameter or image space , such as the Randomised Hough Transform (RHT) [7] and the subimage processing strategy [8]. The RHT only uses feature points that have a D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 191–204, 2007. © Springer-Verlag Berlin Heidelberg 2007
192
S.A. Velastin and C. Xu
high probability of forming lines. This can also avoid the use of a conventional quantisation scheme, which greatly influences the accuracy and detection capabilities of the algorithm, as well as the computational and storage requirements. However, in practice, due both to noise in the coordinates of the feature points and to the quantisation of the parameter space, the sampled, quantized line segments do not in general intersect precisely at a common point in the parameter space. Thus, variations of the RHT [9,10] have been proposed such as the curve fitting post-process after the accumulation. In the subimage strategy, as only partial features or subsets of the whole image are used, its performance depends on the quality of the image, the type and the fraction of the image feature points that is selected [2]. Further work has been carried out to derive stopping rules for selection of image feature points [11]. Behrens et al [12] report a method to Kalman track a set of features (ellipses) from a resulting from a prior Hough transform stage. Accuracy is determined by the accuracy of the preliminary stage whereas in our case Kalman filtering is used as an integral part of the voting process. Hills et al [13] and French et al [14] report how objects can be tracked in a sequence of frames using groups of features detected using a Hough Transform, in this case tracking is used to denote correspondence between frames rather than a trajectory in a single image. In this paper a Hough transform method (based on previous results [15][16], but explained in more detail here) called the Tracking Hough Transform, is presented combining the Extended Kalman filter technique with a converging sampling scheme, which achieves faster speed, lower memory requirement and higher accuracy than the SHT.
2 Standard Hough Transform A straight line in 2D space can be represented by the equation, f (Z k , a i ) = x k cos θ i + y k sin θ i − ρ i = 0
where
y k ]T ∈ ℜ 2 , (k = 1,..., M ) , a i = [θ i
Z k = [x k
ρ i ]∈ ℜ 2 ,
(1)
(i = 1,..., N ) , f, is a
function from ℜ × ℜ into ℜ ρ i is the length of the normal vector from the image origin to the line, θ i is the orientation of the normal vector and ( x k , y k ) are the coordinates of the image feature points. The SHT uses the idea that the constraint equation above can be viewed as a mutual constraint between image feature points and parameter points. Therefore, it can be interpreted as defining a many-to-one coordinate transformation from the space of image points to the space of possible parameter values and a one-to-many mapping from an image feature point to a set of possible parameter values. The intersection points in the parameter space represent the parameters of the possible lines in the image space. Thus, the objective is to find a subset of significant a i from the superset of all possible a in the parameter space, represented by a discrete "accumulator" array whose elements contain a count of the 2
2
1,
Image Feature Extraction Using a Method Derived from the Hough Transform
193
number of image feature points associated with a straight line and whose size depends on the required resolution for the parameter space. The procedure is performed in two stages: an incrementation or voting stage followed by an exhaustive search for maximum counts in the accumulator array. The accumulating function can be expressed formally as
H (a i ) =
M
∑
M
∑ I [f ]
I [f (Z k , a i )] =
i
k =1
(i = 1,..., N )
(2)
k =1
where M is the total number of image feature points. The indication function I [f i ] is defined by ⎧1 f i = 0 I [f i ] = ⎨ . ⎩0 f i ≠ 0
(3)
The search stage is a peak detection process on the a i stored in the accumulator array, where the number of significant peaks is selected using global thresholding, local peak enhancement, etc. In this way, the SHT converts a difficult global detection problem in the image space into a more easily solved local peak detection problem in the parameter space.
3 Tracking by Extended Kalman Filtering 3.1 Models The Kalman Filter (KF) is a minimum mean squares filtering technique based on a state-space formulation, whose recursive nature makes it appropriate for use in systems without large data storage capabilities. In this paper, the EKF is used for tracking a sequence of positions (coordinates of feature points) in the image space. The special characteristic in this case is that the trajectory of the feature points followed is restricted to be a straight line (or any particular known curve) and the final results from the "tracking" will be the parameters of the trajectory. 1 2
3 4
5 6
Fig. 1. Viewing line detection as a position tracking process
194
S.A. Velastin and C. Xu
The KF is based on three probabilistic models: the system or state model, the measurement model and the prior model. The system model considers the process as the result of passing white noise through some system with linear or non-linear dynamics. For straight line detection by the Hough transform, the probabilistic models can be considered as follows: • •
The measurement model, which relates the measurement vector Z k to the state vector a , is non-linear. Therefore, the Extended Kalman Filter (EKF) is used. During a sequence of independent detection of a particular straight line, the state vector representing the line remains constant. Thus, the system model can be thought of as being a deterministic random process that satisfies the differential equation a& = 0 .
The prior model which describes the knowledge about the initial system state aˆ 0 is obtained by sampling or digitising the parameter space for the HT accumulation. 3.2 Basic Formulation Considering a static case only (single image), let the feature points Z k associated with an accumulator cell vector a i satisfy the general non-linear relationship, f (Z k , a i ) = 0
(4)
where Z k ∈ℜ m and ai ∈ℜ and f is a function from ℜ m × ℜ n into ℜ . It is assumed that Z k and a i are independent zero-mean stochastic processes for which only estimate values of Z$ k and a$ i are available i.e. n
p
[
]
E Z k − Z$ k = 0
(
(5)
)(
E ⎡ Z k − Z$ k Z k − Z$ k ⎢⎣
)
T⎤
⎥⎦
= Rk
E[a i − a$ i ] = 0
(6) (7)
[
E (a i − a$ i )(a i − a$ i )
T
]= P
(8)
i
Here, R k is the measurement covariance matrix (directly related to image space resolution) and Pi is the model error covariance matrix (directly related to parameter space resolution and deviations from an ideal model). Using a first-order Taylor's expansion around Z$ k , a$ i ,
(
)
(
)
(
)
∂f ∂f f (Z k , a i ) = 0 ≈ f Zˆ k , aˆ i + Z k − Zˆ k + (a i − aˆ i ) ∂Z ∂a which can rewritten as the linear measurement equation: Y = Ma i + U
(9)
(10)
Image Feature Extraction Using a Method Derived from the Hough Transform
195
where ∂f Y = − f Z$ k , a$ i + a$ i a ∂a , ( p × 1) modified measurement on the process i
(
M= U=
)
∂f ∂a , ( p × n ) measurement-state transformation matrix
∂f Z k − Z$ k , ( p × 1) modified measurement noise ∂Z
(
)
(11) (12)
(13)
It can then be shown that U is a random variable with zero-mean uncorrelated noise, i.e. E[ U] = 0
[
(14)
]
E UU T = Wk =
∂f ∂f T Rk ∂Z ∂Z
(15)
Therefore, the linear Kalman Filter equations can now be applied directly leading to the recursive EKF algorithm:
(
K k = Pk M MPk M T + Wk
)
−1
,
a$ ik = a$ ik + K k ( Yk − Ma$ ik ) , Pk = (I − K k M )Pk ,
Kalman "gain"
(16)
update state
(17)
update state covariance matrix
(18)
where k is the iteration number. It can be seen that the previously estimated parameter a$ ik is corrected by an amount proportional to the current error ( Yk − Ma$ ik ) called the
innovation. The proportionality factor (Kalman gain) K k minimises the mean-square estimation error [17] (i.e. the terms along the major diagonal of the P matrix that represent the estimation error variances for the elements of the state vector being estimated). 3.3 Voting Function of the THT The clustering criterion used to reject outliers when dealing with multiple lines is the Mahalanobis distance (MD) test, defined as
(
)
(
)
(
)
T $ $ <ε d Z$ k , a$ i = dik = f Z$ k , a$ i D−1 ik f Z k , a i
where ε is a suitable threshold (normally selected from a χ T ∂f ∂f T ∂f ∂ f T P Rk Dik = E ⎡f Z$ k , a$ i f Z$ k , a$ i ⎤ = + ⎥⎦ ∂Z ∂Z ∂a i ∂a ⎣⎢
(
)(
)
(19) 2 distribution
table), and (20)
196
S.A. Velastin and C. Xu
Thus, the voting stage of the THT takes place through the computation of the MD accumulating function M (a$ i ) =
∑ I[d(Z$ k , a$ i )] = ∑ I[d ik ] (i = 1, L , N ) M
M
k =1
k =1
⎧1 d ik ≤ ε I[d ik ] = ⎨ ⎩0 d ik > ε
(21) (22)
During the EKF, a feature point is rejected if it does not satisfy the MD test. Otherwise, the point is processed to update the tracked trajectory. The refined value for a i and the updated Pi (eqs. (17) and (18)) are fedback to the MD test to be used by the next image point. At the same time, the vote in the accumulating cell a i is incremented by one unit. All the points used by the EKF are then removed if the vote is larger than a threshold. Because of the central role played by the EKF, we have called this approach the “Tracking Hough Transform” (THT).
4 Converging Tracking Strategy 4.1 Converging Sampling of Parameter Space The SHT is a one-to-many transform where the whole parameter space is sampled, i.e. each image point is mapped to a curve in the Hough space (intersections of such curves indicate the presence of a significant feature). This exhaustive sampling wastes a great deal of computing time, especially for low line-density images, as the voting stage of the SHT usually dominates the execution time. A converging feature tracking strategy is proposed by combining the EKF with a converging sampling strategy. This maps a set of image feature points into a single location in the parameter space and achieves high accuracy. In other words, the scheme combines voting and feature refinement in a single stage. The converging sampling mechanism is based on the fact that a pair of feature points defines a straight line and hence a single value a i (c.f. Randomised Hough Transform [9]). During the sampling process, two feature points ( x1, y1) and ( x2 , y2 ) are selected to obtain initial parameter values (θ , ρ) ⎛ x1 − x 2 ⎞ ⎟ ⎝ y 2 − y1 ⎠
(23)
θ = arctan ⎜
ρ = x i cosθ + y i sin θ
i = 1 or i = 2
(24)
Obviously, these two selected image feature points must be different. Then, the EKF process is started from these initial parameter values. The method reduces storage requirement and accumulating time significantly. An overview of the THT algorithm is shown in diagrammatic form in Figure 2.
Image Feature Extraction Using a Method Derived from the Hough Transform
197
Dynamic Accumulator Refined Parameters Votes Feature Points
Select Random Pair
Candidate Parameters
Select Points
EKF (track)
MDHT Y
Remove Points
Votes > Threshold?
Fig. 2. THT (overview)
4.2 Criteria Random Selection: As there is no prior knowledge about the features in an image, any pair of image points is equally likely to form part of a significant feature. Therefore, these points are selected at random. Stopping Criteria: Selection and tracking is carried out until the candidate image points are exhausted. Two situations are considered to satisfy the exhausted condition (Exhausted Iteration Control): (1) There are no feature points left in the image. (2) All the image points have been at least selected once. Using these criteria, even if there no feature points are removed, every feature point can be selected once. However, some feature points might have a chance to be selected again, when there are some feature points removed from the image space after a few selection and tracking processes. Thus, the computing cost of the selection and tracking process can be further reduced if the following criterion is applied (Minimum Iteration Control): "No repetitive selection of any image feature points will be allowed even if they can never be removed from the image", e.g. they are noise.
5 Properties of the THT 5.1 Computing Complexity One of the effects of removing feature points from the image space is that the subsequent THT process can be regarded as performing on a "subimage" of the original image, but without losing any useful feature points. This is an advantage over Kiryati's subimage strategy. With less feature points in the subimage, the computing cost is also reduced. Combined with the converging sampling process, the THT can
198
S.A. Velastin and C. Xu
reduce the computing complexity even further by continuously reducing the number of feature points in the image, subimage, sub-subimage, …, and so on.
5.2 Memory Requirement In the THT, once the tracking process on a candidate line is finished, the accumulated peak value is compared with a threshold to decide if a line has been detected. This can be regarded as "on-line" peak detection. Thus, there is no need in the THT to use a multi-dimensional accumulator array to register the parameter values for subsequent peak detection process or for further accumulation when the counts are not large enough, such as in the RHT. Instead, only a one-dimensional array is needed to register the counts. In this way, the THT reduces memory requirement significantly from a multi-dimensional accumulator array into a one-dimensional array. Furthermore, converging sampling reduces the range of the parameter space. By using EKF tracking and feature point removal, the THT reduces the range of the parameter space and hence memory requirements even further. The memory required is only proportional to the number of lines (curves) in the image. For example, suppose that the resolution of the parameter space is high enough for detection, e.g. ( Δρ = 1, Δθ = 180 I size ) where I size = I x = I y . Typical accumulator sizes required for the SHT are shown in Table 1. These are typically much larger than the expected number of lines in the image. As the dimensionality of the accumulator array in the SHT is proportional to the dimensionality of the curves to be detected, the situation is aggravated for curves other than straight lines (circles, ellipses and so on). Table 1. Accumulator size required by the SHT
(
Image Size I x × I y
)
Accumulator Size
64 × 64
128 × 128
256 × 256
4K
16K
64K
(= 2 ⋅ 2 ) 6
6
(= 2
7
⋅2
7
)
(= 2 ⋅ 2 ) 8
8
5.3 Ease of Peak Detection In the SHT, the peak detection is carried out after the accumulation. As the SHT is a one to-many mapping from the image space to the Hough space, votes from a feature point spread among several accumulator cells. When there are other edges present in the image, the peak detection process sometimes becomes difficult. In the THT, on the other hand, the selection and tracking is a detection-rejection process. When a parameter is obtained, the THT tracks all the feature points on the trajectory. If the number of feature points tracked is high enough, these feature points are removed from the image space; otherwise a new selection and tracking process starts. The process that removes feature points from the image space totally cancels the chances of the contributions to the later detected parameter cell from these feature points. This has an effect similar to the back-projection strategy for peak detection proposed elsewhere [18,19].
Image Feature Extraction Using a Method Derived from the Hough Transform
199
5.4 Accuracy The tracking of feature points by the EKF uses a stochastic model and avoids the need for a quantised parameter space. Therefore, it provides higher accuracy compared to conventional HTs or other algorithms such as the RHT or the subimage strategy.
5.5 Connectivity The connectivity problem can be directly addressed in the THT during tracking of a line from the starting to the terminating positions, as the positions of the feature points are tracked continuously and recorded in a dynamic array. Thus, after the THT the end-points of lines are automatically obtained [20]. Linking and merging techniques reported in the literature, such as in [21], can be further used to locate the end-points.
6 Tests The performance of the algorithm presented here has been studied using the HT Test Framework (HTTF) developed by Hare and Sandler [22]. The HTTF generates a large number of images with randomly distributed geometric features (e.g. position and length of straight lines) for gathering statistical data (parameter accuracy, detection and false alarm rates) on the behaviour of a given HT algorithm. This avoids the problem of choosing a representative data set to compare different algorithms which, unless carefully selected, might result in little more than “anecdotal” evidence. In the HTTF, characteristics such as detection rates, false alarm rates, average errors and relative computing cost are used to characterise detection capability, location/accuracy and speed. From a signal detection theory point of view, the performance of a HT algorithm can be considered as a composite of detection capabilities and location accuracy [23]. Thus, alternative HT algorithms can be applied to each random image set and compared in terms of these characteristics. The Detection Rate (Mch %) 105
SHT THT(Exhausted)
100
THT(Minimum) 95
90
85
80
1
2
4
8
12
Lines/Image
Fig. 3. Detection rates (%) as a function of line density
200
S.A. Velastin and C. Xu False Alarm Rate (FA %) 4
SHT THT(Exhausted)
3 THT(Minimum) 2
1
0
1
2
4
8
12
Lines/Image
Fig. 4. False alarm rates (%) as a function of line density
SHT is used here as a reference as it is the algorithm with best performance reported in [22]. In these tests R k (equation 6) is set to a 2x2 matrix where the diagonal elements are equal to 0.25 (corresponding to a pixel standard deviation σ of 0.5) and the non-diagonal elements are set to zero (i.e. the x and y image components are uncorrelated). Following usual practice the P matrix is set initially to have all elements set to a high value (10,000), implying there is no prior knowledge of the process. Figures 3 to 5 show detection rates, false alarm rates, and relative computing cost for the SHT and the THT (using exhausted and minimum iteration control) respectively, for a randomly generated set of 12000 images (image size 128 × 128 , parameter space resolution is (Δθ = π 128, Δρ = 1) , threshold set to 30 and 40 votes for low and high line densities respectively [22]). The Relative Computing Cost in Figure 5 is defined as the ratio of CPU time between the SHT and the THT. Relative Computing Cost (THT/SHT) 1.6
Exhausted
1.4
Minimum
1.2 1 0.8 0.6 0.4 0.2 0
1
2
4
8 Lines/Image
Fig. 5. Relative Computing Cost (THT/SHT)
12
Image Feature Extraction Using a Method Derived from the Hough Transform
201
Table 2. Average parameter errors
Lines/Image 1 2 4 8 12
THT (Exhausted) (0.057, 0.141°) (0.063, 0.152°) (0.067, 0.173°) (0.093, 0.221°) (0.087, 0.196°)
THT (Minimum) (0.057, 0.142°) (0.062, 0.152°) (0.071, 0.171°) (0.092, 0.221°) (0.084, 0.189°)
SHT (0.246, 0.429°) (0.254, 0.444°) (0.256, 0.450°) (0.262, 0.462°) (0.258, 0.448°)
It can be seen that the computing costs in the THT (using either exhausted or minimum iteration control) is less than that of the SHT for detection in low density images. For the high density case, the THT with the minimum iteration control criterion still costs less. It also achieves similar detection performance to the THT that uses exhausted iteration control criterion. It should be noted (Figure 3, Figure 4 and Table 2) that the computing cost saved by the minimum iteration control does not sacrifice performance. Table 2 shows average parameter errors, between generated and measured features, obtained by the SHT and the THT over the complete test (12000 images). Since the THT provides sub-parameter space accuracy, for direct comparison votes are assigned to discrete cells using tolerance bands of Δρ and Δθ . Detection performance or capabilities can be characterised by the interaction between probability of detection and the probability of false alarm [24]. For a given detection scheme, these two probabilities will depend on the value of the threshold is used. Broadly speaking, higher thresholds improve detection but, at the same time, increase false alarms. For practical applications, therefore, a compromise has to be reached to achieve acceptable levels of detection and false alarms. To visualise the performance of a HT algorithm in this context, we use “Performance Characteristic Detection Rate (Mch %) 110
THT
100
SHT
90 80 70 60 50
0
1
2
3
False Alarm (FA %)
Fig. 6. PCCs of the THT and the SHT (1 line/image)
4
202
S.A. Velastin and C. Xu Detection Rate (Mch %) 100
THT
95
SHT
90 85 80 75 70 65
0
1
2
3
4
5
6
False Alarm Rate (FA %)
Fig. 7. PCCs of the THT and the SHT (12 lines/image)
Curves” (PCC). The PCC is a plot of the probability of detection versus the probability of false alarm for different thresholds and shows the maximum detection rate that can be achieved for a given maximum false alarm rate. Since the performances of the THT with minimum or exhausted iteration control criterion are similar except in terms of computing cost, only the THT with exhausted iteration control is used here. In the low density case (1 line/image) the superiority of the THT over the SHT is clear. High detection rates can be achieved for low false alarm rates (e.g. 99.92% detection for 0.008% false alarm). Thus the PCC of the THT is close to the ideal case, essentially on the y-axis (Figure 6). In the higher density case (12 lines/image) the PCC of the THT also illustrate the superiority of this algorithm over the SHT (Figure 7).
7 Conclusions The THT algorithm presented here achieves faster speed, lower memory requirement, and higher accuracy than the SHT. The minimum iteration control strategy achieves even faster speed without sacrificing performance. This has been demonstrated by extensive statistical performance tests. The method is based on a converging sampling scheme which avoids sampling the whole image space, thus saving significant storage and reducing iteration times. Unlike usual post-processing strategies combined with the HT algorithms, high accuracy is obtained by a single-stage combination of voting with feature refinement based on Extended Kalman filtering. A Mahalanobis Distance test is used to reject outliers, so that points which are far from the candidate line do not contribute to voting or refinement. This addresses one of the common practical weaknesses of least-squares methods, as the MD test is dynamically updated by the refinement process. The incorporation of a noise model for image quantisation deals with the so called "errors in the variables" problem. A formal noise model for parameter space quantisation allows control of Hough space "coarseness" (e.g. when establishing a compromise between accumulator array size and accuracy) while maintaining accuracy. Computing cost will increase with line density, in a way
Image Feature Extraction Using a Method Derived from the Hough Transform
203
similar to the SHT, but the effect can be minimised by dividing the image into several subimages for parallel processing. This algorithm has also been extended to deal with features of higher dimensionality (e.g. circles, ellipses, etc.).
References 1. Duda, R.O., Hart, P.E.: Use of the Hough transform to detect lines and curves in pictures. Communications of the Association of Computing Machinery 15, 11–15 (1972) 2. Leavers, V.F.: Which Hough Transform? Computer Vision, Graphics and Image Processing: Image Understanding 58, 250–264 (1993) 3. Liang, P.: A new and efficient transform for curve detection. Journal of Robotic Systems 8, 841–847 (1991) 4. Niblack, W., Petkovic, D.: On improving the accuracy of the Hough transform. Machine Vision and Applications 3, 87–106 (1990) 5. Weiss, I.: Line fitting in a noisy image. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 325–329 (1989) 6. Kiryati, N., Bruckstein, A.M.: Antialiasing the Hough Transform. Computer Vision. Graphics and Image Processing: Graphical Models Image Processing 53, 213–222 (1991) 7. Xu, L., Oja, E., Kultanen, P.: A new curve detection method: Randomized Hough Transform (RHT). Pattern Recognition Letters 11, 328–331 (1990) 8. Kiryati, N., Eldar, Y., Bruckstein, A.M.: A probabilistic Hough transform. Pattern Recognition 24, 303–316 (1991) 9. Xu, L., Oja, E.: Randomized Hough transform (RHT): Basic mechanisms, algorithms, and computational complexities. Computer Vision, Graphics and Image Processing: Image Understanding 57, 131–154 (1993) 10. Kalviainen, H., Hirvonen, P., Xu, L., Oja, E.: Comparisons of probabilistic and nonprobabilistic Hough transforms. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 351–360. Springer, Heidelberg (1994) 11. Shaked, D., Yaron, O., Kiryati, N.: Deriving stopping rules for the probabilistic Hough transform by Sequential Analysis. Computer Vision and Image Understanding 63, 512–526 (1996) 12. Behrens, T., Rohr, K., Stiehl, H.S.: Using an extended Hough transform combined with a Kalman filter to segment tubular structures in 3D medical images. In: Proceedings of the Vision, Modelling, and Visualization Conference 2001, pp. 491–498 13. Hills, M., Pridmore, T., Mills, S.: Object tracking through a Hough space. In: VIE 2003. Visual Information Engineering Conference, pp. 53–56. IEE, Guildford (2003) 14. French, A., Mills, S., Pridmore, T.: Condensation tracking through a Hough space. In: ICPR 2004. 17th International Conference on Pattern Recognition, vol. 4, pp. 195–198. 15. Xu, C., Velastin, S.A.: A comparison between the standard Hough transform and the Mahalanobis distance Hough transform. In: Eklundh, J.-O. (ed.). LNCS, vol. 800, pp. 95– 100 (1994) 16. Xu, C., Velastin, S.A.: The Mahalanobis Distance Hough Transform with Extended Kalman Filter Refinement. IEEE International Symposium on Circuits & Systems 3, 5–8 (1994) 17. Brown, R.G., Hwang, P.Y.C.: Introduction to Random Signal Analysis and Kalman Filtering, 2nd edn. John Wiley and Sons Inc, Chichester (1992)
204
S.A. Velastin and C. Xu
18. Gerig, G.: Linking image-space and accumulator-space: A new approach for objectrecognition. In: Proceedings of IEEE 1st International Conference on Computer Vision, pp. 112–115 (1987) 19. Dambra, C., Serpico, S.B., Vernazza, G.: A new technique for peak detection in the Hough-transform parameter space. In: Proceedings of Signal Processing V: Theories and Applications, pp. 705–708 (1990) 20. Xu, C.: The Mahalanobis Distance Hough Transform with Kalman Filter Refinement. PhD thesis, King’s College London, University of London (1995) 21. Weiss, R., Boldt, M.: Geometric grouping applied to straight lines. In: ICPR 1986. IEEE International Conference on Pattern Recognition, pp. 489–495. 22. Hare, A.R., Sandler, M.B.: General test framework for straight-line detection by Hough transforms. In: ISCAS 1993. IEEE International Symposium on Circuits and Systems, pp. 239–242. 23. Hunt, D.J., Nolte, L.W.: Performance of the Hough transform and its relationship to statistical signal detection theory. Computer Vision, Graphics and Image Processing 43, 221–238 (1988) 24. Peterson, W.W., Birdsall, T.G., Fox, W.C.: The theory of signal detectability. IRE Transactions on Information Theory 4, 171–211 (1954)
Nonlinear Dynamic Shape and Appearance Models for Facial Motion Tracking Chan-Su Lee, Ahmed Elgammal, and Dimitris Metaxas Rutgers University, Piscataway, NJ, USA {chansu,elgammal,dnm}@cs.rutgers.edu
Abstract. We present a framework for tracking large facial deformations using nonlinear dynamic shape and appearance model based upon local motion estimation. Local facial deformation estimation based on a given single template fails to track large facial deformations due to significant appearance variations. A nonlinear generative model that uses low dimensional manifold representation provides adaptive facial appearance templates depending upon the movement of the facial motion state and the expression type. The proposed model provides a generative model for Bayesian tracking of facial motions using particle filtering with simultaneous estimation of the expression type. We estimate the geometric transformation and the global deformation using the generative model. The appearance templates from the global model then estimate local deformation based on thin-plate spline parameters. Keywords: Nonlinear Shape and Appearance Models, Active Appearance Model, Facial Motion Tracking, Adaptive Template, Thin-plate Spline, Local Facial Motion, Facial Expression Recognition.
1
Introduction
Recently there has been extensive research on modeling and analyzing dynamic human motions for human computer interaction, visual surveillance, autonomous driving, computer graphics, and virtual reality. Facial motions intentionally or unintentionally display internal emotional states explicitly through facial expressions. Accurate facial motion analysis is required for affective computer interaction, stress analysis of users or vehicle drivers, and security systems such as deception detection. However, it is difficult to accurately model global facial motions since they undergo through nonlinear shape and appearance deformations, which varies across different people and expressions. Local facial motions are also important to detect subtle emotional states for stress analysis, and recognizing deception. Active Shape Models (ASMs) is a well known statistical model-based approach that uses point distribution models in linear subspace [1]. By constraining shape deformation into the linear subspace of training shapes, the model achieves robust estimation of shape contour [1]. Active Appearance Models (AAMs) [2] combine the shape model and the linear appearance subspace after aligning the D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 205–220, 2007. c Springer-Verlag Berlin Heidelberg 2007
206
C.-S. Lee, A. Elgammal, and D. Metaxas
appearance into a normalized shape. It employs an iterative model refinement algorithm based on a prediction model, that is learned as a regression model. A variety of approaches have been proposed to improve the update schemes such as compositional update schemes [3,4], direct AAMs [5], and adaptive gradient methods [6], are proposed. However, all these methods have limitations in modeling facial shape and appearance deformations since they approximate nonlinear deformations in shape and appearance using a linear subspace. As a result of the linear approximation, the model requires high dimensional parameters to model nonlinear data. The high dimensionality makes it difficult to find the optimal shape and appearance parameters. In addition, it is difficult to generate accurate facial animations using linear approximations since linear subspace requires large amount of data in order to model shape and appearance variations accurately [7]. Template based approaches are frequently used for estimating geometric transformation [8,9,10,11] that are invariant to shape and appearance variations. Recently templates based on nonlinear warping parameter estimation have been used for tracking nonrigid shape deformation [12]. Although, the method provides effective facial motion tracking under small facial deformations, it loses track for large deformations. We propose a nonlinear facial motion tracking framework that can accurately estimate the local and global shape deformation in addition to the geometric transformation. We estimate the geometric transformation and global facial motions based on a global nonlinear appearance model. The global nonlinear appearance model provides a compact low dimensional representation of the facial motion state using an embedded representation of the motion manifold. Our system also factorizes the shape variations into different expressions. We achieve tracking of large facial motions using particle filter within the Bayesian framework based on the global nonlinear appearance model. The global model is not enough for accurate tracking of local deformation and shape deformation that are limited in training. The global nonlinear appearance model, however, provides accurate appearance templates according to the estimated embedding state and the expression type. The local facial deformation estimation using single template TPS parameter estimation fails to track large facial motion deformation. The global model that supports large shape deformations provides normalized-appearance models for local deformation estimation. By combining the global appearance model and local deformation, we can achieve accurate estimations of facial motions in large deformations. Our contributions are as follows: Modeling nonlinear shape and appearance deformations: We propose a nonlinear shape and appearance model of facial expressions that factorizes facial expression type and facial motion state. A low dimensional representation for facial motion state is achieved using an embedded representation of the motion manifold. For accurate facial appearance model, we employ nonlinear warping of the appearance templates based on TPS (Sec. 3.1) warping.
Nonlinear Dynamic Shape and Appearance Models
207
Tracking global facial motions using particle filter: Using the global nonlinear shape and appearance model in conjunction with low dimensional facial motion separations, we estimate the geometric transformation and facial deformations. We use the global model for global facial motion tracking within the Bayesian particle tracking based framework (Sec. 4.1). Local facial motion estimation using adaptive appearance templates: We extend tracking of the motion deformations using single template in [12] by adaptive templates to cover large facial deformation. After estimating the state of the global shape and appearance, local nonrigid deformation is estimated using TPS warping control (Sec. 4.2). The local facial deformation is directly estimated using shape landmark points from the adaptive normalized-appearance templates.
2
Framework
We develop a facial shape and appearance model for large facial deformations with different expressions. A dynamic facial deformation for a given facial expression is a nonlinear function of the facial configuration; as the facial configuration varies over time, the corresponding observed facial shape and appearance changes according to the given facial configuration. In addition, the facial deformation is variant in different expressions. Different facial expressions undergo different facial deformations in shape and appearance. Let facial configuration at time t be xt , and the corresponding observed nonlinear shape and appearance at the same time be y t for given expression sequence k, then the nonlinear facial shape and appearance can be represented by y t = f k (xt ) = g(ek , xt ),
(1)
where f k (·) is a nonlinear function variant in different expression type ek , g(·) is a nonlinear mapping with factorization of expression type parameter ek in addition to facial configuration xt . Hence, to develop nonlinear shape and appearance model, we need to find a representation of facial configuration xt and a factorization of nonlinear function f k (·) in different expressions. In Sec. 3.2 we present a nonlinear generative model with low dimensional embedded representation of the motion manifold and a factorization of nonlinear mapping using empirical kernel map and decomposition. For a given nonlinear shape and appearance model, tracking of facial motion is estimating facial configuration and expression type, and geometric transformation, which match generated shape and appearance to the observed image frame. For a given observation z t and state st , we can represent the Bayesian tracking as a recursive update of the posterior P (st |z t ) over the object state st given all the observation z t = z 1 , z 2 , .., z t up to time t: P (st |st−1 )P (st−1 |z t−1 ) P (st |z t ) ∝ P (z t |st ) st−1
Rao-Blackwellized particle filtering is applied for efficient state estimation in Sec. 4.1.
208
C.-S. Lee, A. Elgammal, and D. Metaxas
Fig. 1. The block diagram of nonlinear facial motion tracking system
Facial state estimation based on the global facial shape and appearance model provides global facial motion tracking for the training data. The estimated states, however, are not sensitive to local deformations: small misalignment in geometric transformation. Hence we enhance our facial motion tracking system based on local nonrigid deformation estimation using TPS-warping parameter estimation and adaptive appearance templates in Sec. 4.2. The appearance templates for TPS-warping parameter estimation are provided from the global facial shape and appearance model, which support different appearance model according to facial motion state and expression type. For accurate estimation of local nonrigid deformation from the appearance template, we need accurate appearance representation of the global shape and appearance model. We use TPS warping for accurate shape-normalized appearance template (Sec. 3.1). The local estimation of facial motion is used to update global shape model by linear combination of expression weights (Sec. 4.3). Our facial motion tracking system consist of three stages: data acquisition, normalization and learning nonlinear facial motion model, and tracking facial motion from the video sequence. Fig. 1 shows the block diagram of our facial motion tracking system. First, we collect multiple video sequences with different expressions and manually mark some of the frames. Prior to learning nonlinear facial shape and appearance model, we collect normalized-shape and -appearance
Nonlinear Dynamic Shape and Appearance Models
209
using similarity transformation and TPS warping respectively (Sec. 3.1). Collected normalized-shapes and corresponding normalized-appearances in different expression sequences are used for learning the nonlinear shape and appear model. We use particle filtering and the nonlinear shape and appearance model to estimate global deformation and geometric transformation using particle filter. Based on the estimated global state, we generate the appearance template for local nonrigid deformation estimation. The estimated local deformation is used to refine global model state estimation as shown in Fig. 1.
3
Nonlinear Global Shape and Appearance Models
In this section, we explain how to achieve accurate and appearance normalization for a normal shape, and how to learn nonlinear shape and appearance model using an embedded representation of the motion manifold. 3.1
Facial Shape and Appearance Normalization
Facial shape normalization: We align collected landmark shape points using weighted similarity transformation for shape normalization. The ith shape with n landmark points is represented by a vector pi = (xi1 , yi1 , xi2 , yi2 , · · · , xin , yin ). Given two shapes pi and pj , we can find the similarity transformation for shape j, S(δj ) that minimizes the weighted sum Ej = (pi − S(δj )pj )T D(pi − S(δj )pj ), where D is a weighting diagonal matrix. The mean shape, represented by p0 , is computed by averaging shape landmark points after shape normalization. This mean shape is used as a normal shape for normalized-appearance representation. Facial appearance normalization: Normalized-appearance is a vector representation for appearance of the normal shape. It is important to have precise normalized-appearance as we use the normalized-appearance as an adaptive appearance template for the local deformation estimation (Sec. 4.2) in addition to the observation model in Bayesian tracking using particle filtering (Sec. 4.1). We use TPS warping [13] for non-rigid registration of appearance image to the mean shape that is estimated after shape normalization. The TPS warping leads to smooth deformations of shape by control points. Though piecewise-affine warping are frequently used in linear appearance models [14,4], the piecewise-affine warping can cause artifacts around the boundaries in non-rigid deformation of shape due to facial motion [15]. The normalized-appearances, which are computed by the TPS warping of the given landmark points to the mean shape, are used to represent appearance variations in vector space. We compute the normalized facial appearance precisely using TPS backward warping. Given the image frames I 1 , I 2 , · · · , I NK , we collect manually marked corresponding shape vectors p1 , p2 , · · · , pNK , where NK is the number of image frame for training. A normalized-appearance template for training image j, I j , is generated from the image I j with a corresponding shape vector pj by TPS warping the shape pj to the mean shape p0 . We denote this normalized-appearance
210
C.-S. Lee, A. Elgammal, and D. Metaxas
computation for the given image I j , and shape vector pj by I j = I j (W(p0 , pj )), where W(·) denotes a TPS warping from control landmark point pj to p0 . In actual computation, we apply a backward warping due to discrete nature of the raster images and computational efficiency. In case of backward warping we need to warp output image coordinate into input image coordinate and interpolate the intensity values. The TPS warping W(·) needs to be computed once for the mean shape p0 . Normalized shape-appearance representation: In image sequences, the kth image I k can be represented by its aligned shape pk and the TPS warped normalized-appearance ak . We combine the normalized-shape vector and the T T normalized-appearance vector as a new shape-appearance vector y k = [pT k ak ] . We extract the normalized-appearance vector ak as pixels which are inside the contour of the mean shape p0 after the TPS warping of the original image I k from the original shape pk to the mean shape p0 . We denote this procedure as ak =
Ik ξ∈p0
(W(ξ, p0 ; pk )) = Υ (I k , pk )
(2)
So, Υ (I j , pi ) returns a normalized-appearance vector for the given image I j with the TPS warping from a shape vector pi to the mean shape p0 . If the pixel number within the mean shape is Na , then the dimension of the shapeappearance vector y k is Nas = 2n + Na . 3.2
Nonlinear Generative Models with Manifold Embedding and Factorization
Facial motion embedding and nonlinear mapping: We propose to use the nonlinear facial shape and appearance model based on low-dimensional manifold embedding and empirical kernel mapping to track accurate nonlinear appearance deformations in different facial motions. Since dynamic facial expressions lie on low dimensional manifolds, we use a conceptual unit circle as an embedded representation of the facial motion manifold for each of the facial expression cycle [16]. Sets of image sequences, which represent a full cycle of facial expressions, are used for the embedded representation of the motion manifold. We denote each expression sequence by y e = {y e1 · · · y eNe } where e denotes the expression type and Ne the number of frames for a given expression sequence. Each sequence is temporally embedded on a unit circle at equal distance. Given a set of distinctive representative embedding points {xi ∈ R2 , i = 1 · · · N }, we can define an empirical kernel map[17] as ψN (x) : R2 → RN where ψN (x) = [φ(x, x1 ), · · · , φ(x, xN )]T , given a kernel function φ(·). For each input y e and its embedding xe , we learn a nonlinear mapping function f e (x) that satisfies f e (xi ) = y ei , i = 1 · · · Ne and minimizes a regularized risk criteria. Such function admits a representation of the form ψ(x) = N i=1 wi φ(x, xi ), i.e., the whole mapping can be written as f e (x) = B e · ψ(x),
(3)
Nonlinear Dynamic Shape and Appearance Models
211
where B is a d × N coefficient matrix. The mapping coefficient can be obtained by solving the linear system [y e1 · · · y eNe ] = B e [ψ(xe1 ) · · · ψ(xeNe )]. Using this nonlinear mapping, we capture nonlinearity of facial expressions in each sequence. Expression type factorization: Given learned nonlinear mapping coefficients B 1 , B 2 , · · · , B K of K different expression type sequences, the nonlinear mappings are factorized by fitting an asymmetric bilinear model to the coefficient space [18]. As a result, we can generate a nonlinear shape and appearance instance y kt for a particular expression type k at any configuration xt as y kt = A × ek × ψ(xt ) = g(ek , xt ),
(4)
where A is a third order tensor, ek is an expression type vector for the expression class k. We can analyze and represent nonlinear facial expression sequences by estimating the facial motion state vector xt , and expression type e in this generative model.
4
Tracking Global and Local Facial Motions
In order to track nonrigid local facial deformations as well as global large facial deformations in different expression type, we first estimate the global facial motion and the geometric transformation. We then apply local nonrigid facial deformation estimation using the appearance template generated from the global facial motion estimation. Estimated global facial motion parameters are updated to reflect local facial deformation. 4.1
Global Facial Motion Estimation
Our global facial motion tracking routine incorporates two components: the geometric transformation, and the global deformation. The geometric transformation explains the rigid movement of face due to head motion. The global deformation motion captures the nonlinear facial deformation in different expression types and motion states (configurations). If we describe the geometric transformation parameters by Tαt , the global shape and appearance deformation as y t , i.e.at , pt , then the goal of our global tracking algorithm for a given image I t is to estimate sub state vector α∗t , p∗t and a∗t that minimize E(α∗t , p∗t , a∗t ) = min (Υ (I t , Tαt · pt ) − at ) αt ,pt ,at
= min (Υ (I t , Tαt · (q(y t )) − a(y t )) αt ,pt ,at
(5)
where a(y ∗t ) = a∗t = y ∗t (2n + 1 : Nas ) is an appearance sub-vector and q(y ∗t ) = p∗t = y ∗t (1 : 2n) is a shape sub-vector from shape-appearance vector y ∗t . The shape-appearance vector y ∗t = A × e∗ × ψ(x∗t ) is computed for the estimated expression type e∗ and facial motion state x∗t by Eq. 4. Therefore, tracking of the global deformation of facial motion essentially invloves estimating e∗ , and x∗t ,
212
C.-S. Lee, A. Elgammal, and D. Metaxas
which are the best fitting global shape-appearance template after the geometric transformation αt . Global facial motion tracking:particle filtering Given the nonlinear generative shape and appearance model, we can describe the observation of shape and appearance instance z t by geometric transformation and global shapeappearance vector, i.e., state parameters αt and y t . The global shape-appearance vector is defined by expression type et and facial configuration xt in Eq. 4. Therefore, tracking facial motion is effectively inferring the configuration xt , facial expression type parameter et , and global transformation Tαt given the observation z t at time t. In our model, the state st [αt , xt , et ] uniquely describe the state of the tracking facial deformation. The observation z t is composed of shape vector zp t and appearance vector zat for the given image at time t. The global transformation parameter is independent of the global deformation state as we can combine any shape and appearance model with any geometrical transformation to synthesize a new shape and appearance in the image space. However, they are dependent on the given observation. We approximate the joint posterior distribution P (αt , xt , et |z t ) = P (αt , y t |z t ) by two marginal distribution P (αt |y ∗t , z t ) and P (y t |α∗t , z t ), where α∗t , and y ∗t are representative values like MAP (maximum a posteriori). We estimate the likelihood of the observation z t for given state st = (αt , y t ) by ||Υ (I t , Tαt · pt ) − at || (6) P (z t |αt , y t ) ∝ exp − σ where pt = y t (1 : 2n), at = y t (2n + 1 : Nas ), and σ is the scaling factor for the measured image distance. In particle filtering, the state st is updated by (i) estimating the weight πt using the observation likelihood: (i)
(i)
(i)
(i)
πt ∝ P (z t |st ) = P (z t |αt , y t ). Particle filter for the geometric transformation: We estimate geometric transformation using particle filter based on the predicted global shape and appearance. We assume that expression state varies smoothly, and predicted configuration explains temporal variation of the estimated expression state. The estimated global shape and appearance at time t, y ∗t , is estimated from the pre vious expression state et−1 , and predicted configuration xt . The prediction of configuration, xt , is estimated from previous estimated embedding x∗t−1 using dynamics of the configuration along the embedded representation of the motion manifold [19]. This predicted shape and appearance y t is used as the represen∗ tative value y t . Given this global shape and appearance template, we estimate the best geometric transformation αt for the given observation at time t, z t . The geometric transformation state αt consists of geometric transformation parameters γt ,θt , and τ t for scaling, rotation, and translation. The marginal (i) (i) α probability distribution is represented by Nα particles {αt , α πt }N i=1 . We up α (i) date weights πt , i = 1, 2, · · · , Nα with y t using Eq. 6.
Nonlinear Dynamic Shape and Appearance Models
213
Rao-Blackwellized particle filtering for global deformation tracking: For the state estimation of the global deformation, we utilize Rao-Blackwellized particle filtering. In order to estimate global deformations using generative model in Eq. 4, we need to estimate the state vector xt , and et whose dimensions are 2, and Ne . The dimension of the expression state Ne depends on the number of expression types which can be high. When we know the configuration vector xt , we can achieve approximate solution for the expression vector as explained in the following. The original Rao-Blackwellized particle filtering for dynamic Bayesian networks [20] assumes accurate solution for the state that is not represented by particle state. We utilize an approximate solution for the expression type vector to avoid sampling for high dimensional state density estimation, which requires large number of particles for accurate approximation. The facial motion state xt is embedded in two dimensional space with one constraint for unit circle embedding. So, the embedding dimension is actually one-dimensional and we can represent the embedding parameter βt as onedimensional state vector. We represent the distribution of facial motion em(i) (i) Nβ bedding β by Nβ particles {βt , β πt }i=1 . If we represent the approximate esti∗ mation of expression vector as et , we can approximate the marginal distribution as P (e∗t |y t )
=
P (e∗t |βt , y t )P (βt |y t )
β
=
P (e∗t |βt , y t )
Nβ
(i) β (i) πt δ(βt , βt )
i=1
β
Nβ
=
(i) β (i) πt P (e∗t |βt , y t ),
i=1
where δ(x, y) = 1 if x = y and 0 otherwise. We represent the estimated expression vector by a linear weighted sum of known expression vectors. We assume that the optimal expression vector can be represented by a linear combination of the expression classes in the training data; we can generate the global deformations as the configuration changes along the manifold through the linear weighted sum of expression Now, we need Kclasses. e to solve linear regression weights κ such that enew = k=1 κk ek where each ek is one of Ke expression classes. For a given configuration βt , that is xt = h(βt ), we can obtain expression conditional class probability p(ek |y t , xt ) proportional to the observation likelihood p(y t | xt , ek ). Such likelihood can be estimated as a Gaussian density centered around A × ek × ψ(xt ), i.e., p(y t | xt , ek ) ≈ N (A × ek × ψ(xt ), Σ e ). k
Given expression class probabilities, we can set the weights of expression classes (i) (i) to κk = p(ek | y t , xt ). The estimated expression vector is the weighted sum of each expression type e∗t =
Nβ i=1
(i) k Ne k=1 κk e Nβ (i) i=1 κk
.
214
4.2
C.-S. Lee, A. Elgammal, and D. Metaxas
Local Facial Motion Estimation
We perform local facial motion tracking for estimating local deformations that differs from global facial model, and to refine inaccurate estimation of the geometric transformation. The estimated global facial motion state using particle filter, with limited number of particle samples shows misalignment of geometric transformations and inaccurate estimations of the global deformations sometimes. In addition, facial deformation of the new sequence can be different from the learned global model even for the same person with the same expression type. Therefore, we need local facial motion tracking to refine the global tracking result. We propose template-adaptive local facial motion tracking with shape description using thin plate splines (TPS) warping. We utilize landmark points in the facial shape description as control points in TPS. The shape-normalized appearance is used as a template for local facial motion tracking. The proposed local facial motion tracking is similar to the non-rigid object motion tracking using TPS parameters and image gradients [12]. In our case, the tracking result of global deformation using the nonlinear shape and appearance model provides a new appearance template for each frame. In addition, the landmark shape estimated from the global deformation after applying geometric transformations provides initial shape for local facial motion tracking. Let the estimated global shape and appearance be y gt0 , its shape vector be g pt0 , appearance vector be agt0 , and current input image be It , the objective of the local deformation fitting is to minimize the error function E(δpt ) = =
Υ (I t , pgt0 + δpt ) − agt0 I t (W(ξ, p0 ; pgt0 + δp)) − I gt0 (ξ)2
(7)
ξ∈p0
where I gt0 is an image in normalized shape with global appearance vector agt0 . Since we use shape normalized appearance as the template in the local tracking, the TPS warping W(ξ, p0 ; pgt0 + δp) is determined by the coordinate control points pgt0 + δp. For the given pt0 from the global deformation tracking, the warping function is solely determined by the local deformation δp. Gradient descent technique is applied to find the local deformation parameter δp which minimize Eq. 7 similar to [12,8]. Linearization is carried out by expanding I t (W(ξ, p0 ; pgt0 + δp)) in the Taylor series about δp, I t (W(ξ, p0 ; pgt0 + δp)) = I t (W(ξ, p0 ; pgt0 )) + δpT M t + h.o.t, ∂I t ∂I t ∂I t | ∂p | · · · | ∂p ]. Each term where M t = [ ∂p 1
2
image coordinate ξ = ∂I t ∂ξ
2n
W(ξ, p0 ; pgt0 )
∂I t ∂pk
(8)
can be computed using warped
by applying chain rule:
∂I t ∂pk
=
∂I t ∂ξ ∂ξ ∂pk .
The
is the gradient of current input image I t after TPS warping to the mean shape. The warping coefficients are fixed and can be pre-computed since we use the common mean shape in all the normalized-appearance templates. The
Nonlinear Dynamic Shape and Appearance Models
215
solution for Eq. 7 can be computed when the higher-order terms in Eq. 8 is ignored: −1 δp = (M T MT (9) t M t) t δI t , where δI t is the image difference between template appearance image and current image warped to the template shape. We achieve better fitting of the shape to local image features by iterative updating the local shape model. This local fitting provides better alignment of shape and normalized appearance for a given input image. 4.3
Combining Global Facial Motion Estimation and Local Facial Motion Estimation
We update the global deformation state using the new shape normalized appearance image after local fitting. As a result of accurate local fitting, the new shape-normalized appearance vector will represent appearance more accurately than the one estimated by the global facial motion tracking. Using the new shape-normalized appearance vector estimated from local deformation, we update the expression state. First, we estimate the new expression weight κl based on the new appearance vector after local fitting. Then, the combined new estimated expression weight is computed by linear combination of the local expression weight κl and global expression weight κg , κnew = (1 − ε)κg + εκl
(10)
This process enhances the robustness in the expression parameter estimation. The combining parameter ε, which is empirically estimated, depends on the reliability of local fitting. For example, local fitting is less reliable for unknown subject and we assign small value for ε. Even though the combination is in linear interpolation, the overall system preserve nonlinear characteristics of the facial motions. This refined global state estimation improves accuracy of geometry transformation in the subsequent frames.
5
Experimental Results
In order to build global shape and appearance model for different expressions, we use Cohn-Kanade AU coded facial expression database [21]. The landmarks have 38 points in each frame image (n = 38). The appearance vector was represented by 35965 pixels (Na = 35965) inside landmark shape contour in the mean shape. This appearance vector size depend on mean shape size. By reducing the mean shape size, we can reduce the appearance vector dimension. We manually marked the shape landmarks of every other frame to learn the shape and appearance model. As the database has expression sequences from the neutral expression and to the peak expression, we embed each frame on the half circle with equal distance for each sequence.
216
C.-S. Lee, A. Elgammal, and D. Metaxas
(a)
(b) Estimated Expression Weights(Global) 1 Suprise Happy Angry Sad
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
(c)
0
1
3
5
7
9
11
13
15
17
15
17
(d)
(e) Estimated Expression Weights(Local) 1 Suprise Happy Angry Sad
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
(f)
0
1
3
5
7
9
11
13
Fig. 2. Facial expression tracking with global and local fitting: (a) Best fitting global appearance in normalized shape. (b) Global shape tracking facial motion. (c) Expression weights in global facial motion estimation. (d) Image error in the local fitting. (e) Local tracking facial motion with adaptive template provided by global appearance model. (f) Expression weights in local facial motion estimation.
Facial motion tracking with expression type estimation: Estimated expression type shows how well the facial motion tracking discriminate variations in facial motion of different expressions. Fig. 2 shows tracking a smile expression sequence with the local fitting. At each frame, global facial motion tracking is estimated expression weights (c) and facial shape after global transformation estimation. The best fitting shape-appearance parameter provided shape-normalized
Nonlinear Dynamic Shape and Appearance Models
217
(a) MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
MAP shape Local fitting
Estimated Expression Weights 1 Suprise Happy Angry Sad
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
(b)
0
1
3
5
7
9
11
13
15
17
(c) Fig. 3. Comparison of Facial expression tracking: (a) Comparison of tracking result: yellow-global fitting, red-local fitting. (b) Update of estimated expression weights by combination of local and global expression estimation. (c) Best fitting global model using updated expression state.
appearance template (a) and facial shape tracking after global deformation (b). After local nonrigid deformation estimation, tracking result (e) shows better estimation of shape deformation to the input image and better estimation of facial
218
C.-S. Lee, A. Elgammal, and D. Metaxas
(a)
(b)
(c) Estimated Expression Weights 1 Suprise Happy Angry Sad
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
(d)
0
1
2
3
4
5
6
7
8
9
Fig. 4. Tracking surprise expression : (a) Error image based on a template after local fitting. (b) Tracking result by the local deformation estimation with an initial frame as a template. (c) Tracking result with adaptive template by global shape and appearance model: yellow-global fitting, red-local fitting. (d) Estimated global expression weights.
expression type (f). Facial expression weight in global deformation had similar weights between ’surprise’ and ’happy (smile)’. After the local deformation estimation, the estimated expression type got higher weight for happy expression correctly. However, some points like left eyebrow show inaccurate local fitting. Fig. 3 shows comparison of tracking accuracy. After updating estimated expression type by combining global deformation and local deformation, we got new estimation of expression weight Fig. 3(a). Based on the new expression weight, we accurately estimated global facial motion tracking(c). Tracking large facial deformations: We compared tracking accuracy with a single template and adaptive templates in large facial deformations. Fig. 4 (b) is facial motion tracking result based on single frame. It shows appropriate tracking of facial motion in small deformations. However, it fails to track large facial deformations around mouth region. Fig. 4 (c) shows facial motion tracking result using adaptive templates at each frame. As the global deformation model provides updated appearance template in addition to initial shape for tracking, it achieved more accurate tracking of large facial deformations.
Nonlinear Dynamic Shape and Appearance Models
6
219
Conclusion and Future Works
We have proposed a new framework for facial motion tracking for handling large facial deformations. The global deformation tracking based on nonlinear shape and appearance model provides appearance adaptive template in large facial deformation. The local fitting with the appearance adaptive templates enables accurate fitting of global, coarse estimation of facial motion. Here, we count facial motion deformation by combination of expression type in addition to configuration change. We plan to extend our system to consider variations of facial shape and appearance in different people by applying multilinear analysis. The TPS warping is expensive computationally. We may efficiently program this computation using general-propose computing on graphics processing units(GPGPU), which provides efficient parallel processing.
References 1. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: Their training and applications. CVIU 61(1), 38–59 (1995) 2. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Proc. of ECCV, vol. 2, pp. 484–498 (1998) 3. Baker, S., Matthews, I.: Equivalence and efficiency of image alignment algorithms. In: Proc. of CVPR, vol. 1, pp. 1090–1097 (2001) 4. Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60(2), 135–164 (2004) 5. Hou, X., Li, S., Zhang, H., Cheng, Q.: Direct appearance models. In: Proc. of CVPR, vol. 1, pp. 828–833 (2001) 6. Batur, A.U., Hayes, M.H.: A novel convergence scheme for active appearance models. In: Proc. of CVPR, vol. 1, pp. 359–366 (2003) 7. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH 1999, pp. 187–194. ACM Press/Addison-Wesley Publishing Co., New York (1999) 8. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. PAMI 20(10) (1998) 9. Black, M.J., Jepson, A.D.: Eigentracking: Robust matching and tracking of articulated objects using view-based representation. Int.J. Compter Vision, 63–84 (1998) 10. Ho, J., Lee, K.-C., Yang, M.-H., Kriegman, D.: Visual tracking using learned linear subspaces. In: Proc. of CVPR, pp. 782–789 (2004) 11. Elgammal, A.: Learning to track: Conceptual manifold map for closed-form tracking. In: Proc. of CVPR, pp. 724–730 (2005) 12. Lim, J., Yang, M.H.: A direct method for modeling non-rigid motion with thin plate spline. In: Proc. of CVPR, vol. 1, pp. 1196–1202 (2005) 13. Bookstein, F.L.: Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Trans. PAMI 11(6), 567–585 (1989) 14. Stegmann, M.B.: Analysis and segmentation of face images using point annotations and linear subspace techniques. Technical Report TMM-REF-2002-22, Technical University of Denmark (2002)
220
C.-S. Lee, A. Elgammal, and D. Metaxas
15. Cootes, T.F.: Statistical models of appearance for computer vision. Technical report, University of Manchester (2004) 16. Lee, C.S., Elgammal, A.: Facial expression analysis using nonlinear decomposable generative models. In: Zhao, W., Gong, S., Tang, X. (eds.) AMFG 2005. LNCS, vol. 3723, pp. 17–31. Springer, Heidelberg (2005) 17. Sch¨ olkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge (2002) 18. Elgammal, A., Lee, C.S.: Separating style and content on a nonlinear manifold. In: Proc. CVPR, vol. 1, pp. 478–485 (2004) 19. Lee, C.-S., Elgammal, A.: Style adaptive bayesian tracking using explicit manifold learning. In: Proc. of British Machine Vision Conference (2005) 20. Murphy, K., Russell, S.: 24 Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks. In: Sequential Monte Carlo Methods in Practice, pp. 499–515. Springer, Heidelberg (2001) 21. Kanade, T., Tian, Y., Cohn, J.F.: Comprehensive database for facial expression analysis. In: Proc. of FGR., pp. 46–53 (2000)
Direct Ellipse Fitting and Measuring Based on Shape Boundaries Milos Stojmenovic and Amiya Nayak SITE, University of Ottawa, Ottawa, Ontario, Canada K1N 6N5 {mstoj075, anayak}@site.uottawa.ca
Abstract: Measuring ellipticity is an important area of computer vision systems. Most existing ellipticity measures are area based and cannot be easily applied to point sets such as extracted edges from real world images. We are interested in ellipse fitting and ellipticity measures which rely exclusively on shape boundary points which are practical in computer vision. They should also be calculated very quickly, be invariant to rotation, scaling and translation. Direct ellipse fitting methods are guaranteed to specifically return an ellipse as the fit rather than any conic. We argue that the only existing direct ellipse fit method does not work properly and propose a new simple scheme. It will determine the optimal location of the foci of the fitted ellipse along the orientation line (symmetrically with respect to the shape center) such that it minimizes the variance of sums of distances of points to the foci. We next propose a novel way of measuring the accuracy of ellipse fits against the original point set. The evaluation of fits proceeds by our novel ellipticity measure which transforms the point data into polar representation where the radius is equal to the sum of distances from the point to both foci, and the polar angle is equal to the one the original point makes with the center relative to the x-axis. The linearity of the polar representation will correspond to the quality of the ellipse fit for the original data. We also propose an ellipticity measure based on the average ratio of distances to the ellipse and to its center. The choice of center for each shape impacts the overall ellipticity measure. We discuss two ways of determining the center of the shape. The measures are tested on a set of shapes. The proposed algorithms work well on both open and closed curves. Keywords: Ellipticity, ellipse fitting, linearity, shape analysis.
1 Introduction Classifying a shape as a certain primitive is important in image processing applications and computer vision systems. Popular shape measures such as elongation, convexity and orientation exist in literature. Like other measures for primitive geometric shapes, the measure of ellipticity and ellipse fitting are motivated by real world image processing problems. Ellipticity is common in nature and industry, and finding a way of identifying it can be important to both. Applications of D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 221–235, 2007. © Springer-Verlag Berlin Heidelberg 2007
222
M. Stojmenovic and A. Nayak
ellipse identification are found in agricultural and medical imaging systems for identifying certain grains, onions, watermelons, cells, and even human faces. In this article we study two related problems: ellipse fitting and measuring shape ellipticity. Finding an ellipse that best represents a set of points is called ellipse fitting. We are also interested in measuring how elliptical a finite set of points is. The two main approaches to ellipse fitting and measuring ellipticity are shape (boundary) based and area based. Shape based approaches consider only the points on the boundary or perimeter of a shape, while area based ones take into consideration all of the pixels within a closed shape. One of the main advantages of our algorithm is that it works on both open and closed curves. Most other ellipticity measures found in literature are linked to exclusively closed curves, because they are area based. Our algorithms have image processing applications in mind which deal with pixels as points, but also accept real numbers as input. In analyzing various algorithms, we restrict ourselves to the following criteria. Ellipticity values are assigned to sets of points and these values shall be numbers in the range [0, 1]. The ellipticity measure equals 1 if and only if the shape is an ellipse, and equals 0 when the shape is highly non-elliptical. A shape’s ellipticity value should be invariant under similarity transformations of the shape, such as scaling, rotation and translation. Ellipticity values should also be computed by a simple and fast shape based algorithm. As in existing literature, the points in the set are not ordered; that is, ellipse fits and ellipticity measures do not depend on the ordering of points in the data set or along the boundary. Ellipse fitting has been widely studied in literature. Voss and Süße [16] described an area and moments based method of fitting geometric primitives. Rosin [8, 9] discussed various ellipse fitting methods and various distance measures of points in data sets from the corresponding ellipse fit [11]. We will discuss a shape sampling based method [9] by Rosin because his papers indicate it as the ultimate selection. Fitzgibbon, Pilu and Fisher [4] introduced the only existing direct fitting ellipse method which is heavily based on the work of Bookstein [1]. We will discuss the performance of this method in section 2. Various ellipticity measures were proposed in literature. An ellipse fit is a prerequisite for some of them. These include DFT and the shape based measure by Proffit [6], the set operations area based method [5], and the Euclidean ellipticity shape based and orthogonal hyperbolae area based measures by Rosin [10]. Other measures are not based on a prior ellipse fit, such as the elliptic variance shape based method [6I], and the area and moment based method [10]. Here, we will propose and analyze an algorithm that fits an ellipse to a set of points, and one that finds an ellipticity value for the fitted ellipse. The ellipticity measure can be used to compare quality of several ellipse fits. The ellipse fit is done by first choosing a shape center and finding the orientation line of the shape. The best ellipse fit is chosen by determining the location of the foci of this fit along the orientation line in opposite directions of the shape center. The best locations of the foci are those that minimize the variance of sums of distances of points to the foci. Our new ellipticity measure is based on a measure of linearity used in [14]. Any one of the six linearity measures in [14] can be used as a base of our ellipticity algorithm since the input set of points to our algorithm was transformed from planar representation to polar representation, where highly elliptical input point sets become highly linear in the new representation. This transformation is done by using the center of the shape as the
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
223
origin in the polar representation, and by maintaining the angle each point forms with the center. The radius in polar form is modified such that it equals the sum of distances from the point to both foci. The choice of center of each shape influences its overall ellipticity value. The center of each shape is traditionally seen as its center of gravity. We propose to also use a true center (Xtc, Ytc) of a shape, defined as the center of a circle C that best fits the shape to C. It is determined by sampling k triplets of points from the point set, and finding their true centers. The median center value of the k samples is taken as the shape’s true center. Our algorithms were tested in several ways. The new foci distance variance fit method is compared to the sampling [9] and EllipVos [16] methods using our new polar linearity and average distance ratio measures of ellipticity. The two fit measures from [5, 10], are compared with our two measures on a set of open or closed shapes, as appropriate. Overall in this paper, we propose two novel algorithms, which when combined fit and assign ellipticity values to sets of planar 2D points. The literature review is given in section 2. The new measures are presented in section 3. Chapter 4 is reserved for the presentation of our test set along with a general discussion of our results.
2 Literature Review 2.1 Central Moments and Orientation of a Point Set The central moment of order pq of a set of points Q is defined as:
μ pq = ∑x , y∈Q ( x − xc ) ( y − yc )q , p
where S is the number of points in the set Q. The center of gravity (xc, yc) is the average value of each coordinate in set Q, and is determined as follows:
(xc , y c ) = ⎛⎜ 1 ∑ xi , 1 ∑ yi ⎞⎟ , ⎝S
S
⎠
where (xi, yi), 1≤i≤S, are real coordinates of points from Q. The angle of orientation of the set of points Q is determined by [2]:
⎛ 2μ11 angle = 0.5 arctan⎜⎜ ⎝ μ 20 − μ 02
⎞. ⎟⎟ ⎠
2.2 Ellipse Fitting by Sampling Rosin [9] described the least median of squares (LMedS) shape based approach to ellipse fitting. Several (k) minimal subsets (i.e. five points) of the data are selected at random and used to generate the ellipses through them. Only subsets that generate coherent results are considered. When enough minimal subsets have been acquired to satisfy a pre-determined error estimation threshold, the median value of the k subsets of each ellipse coefficient is taken to define the ellipse fit. This method is not guaranteed to return an ellipse.
224
M. Stojmenovic and A. Nayak
2.3 Moment Based Ellipse Fitting Voss and Süße [16] described a moments and area based method of fitting geometric primitives by ellipses. The data is normalized into a unique canonical frame (a circle with specific radius for the case of ellipse fitting), by applying an affine transform. Given input points, they first find the centroid, and translate all points so that the new centroid is at the origin. Then μ01 = 0 and μ10 = 0 for the translated shape. They then apply an x-sheering transform which maps (x, y) to (x’, y’) as follows: x’=x+βy, y’=y. The goal is to obtain an elliptical shape with horizontal and vertical axis. For this position, the new moment value of m’11 is 0 because such an ellipse is an odd function. This means that
∫∫ (x + βy ) ydxdy = 0 , which decides β = − μ
11
μ 02 ,
S
where μ11 and μ02 are taken before the transform [12]. The elliptical shape should now be scaled to obtain a circular shape with new unit moments M02 = M20 = 1. The circle with such unit value moments has radius r = (4/π)1/4 [12]. The new transformation is x”=αx’, y”=δy’.
1 = M 20 = ∫∫ x ' ' 2 dx' ' dy ' ' = ∫∫ (αx') αdx' δdy ' = α 3δμ ' 20 , 2
S ''
S'
and similarly 1 = M02 = αδ 3μ’02. Moments μ’02 and μ’20 of the shape are calculated after x-shearing and before scaling. Multiplication gives (αδ)4μ’02 μ’20, which leads to α = (μ’02/ μ’203)1/8, δ = (μ’20/μ’023)1/8. This decides the new transform, which brings the data to a circular form with a unique canonical frame. Applying an inverse transforms to the circle will result in getting the elliptical fit for the original data. A circle with radius r is scaled back to an ellipse with axes a = r/α, b = r/δ. The equation of that ellipse is x’2/a2+y’2/b2 = 1. After inverse x-shearing, the equation of the original ellipse is (x+βy)2/a2+y2/b2 = 1. 2.4 Direct Least Square Fitting Fitzgibbon, Pilu, Fisher [4] described a direct method for least square fitting of ellipses. The method is based on the method by Bookstein [1] for fitting conic sections to scattered data. The method in [4] is ellipse-specific, so it always returns an ellipse. The equation a’x = ax2+bxy+cy2+dx+ey+f = 0, where a=[a b c d e f ]T and x = [x2 xy y2 x y 1]T describes an arbitrary conic section a. This conic section is an ellipse if b2-4ac<0. Given a point (xi, yi), its algebraic distance to conic a from a point to a candidate ellipse is a·.xi where xi = [xi2 xiyi yi2 xi yi 1]T. The method [1] minimizes the sum of squared algebraic distances G = ∑ iN=1 ( a ⋅ xi ) , where xi are the input points. 2
Let D= [x1 x2 … xn]T, and S=DTD (called the scatter matrix, a symmetric 6x6 matrix). Then G = (Da)T(Da) = aTDTDa = aTSa. Therefore the problem is to minimize G = aTSa. However, a constraint on coefficients needs to be placed, since otherwise
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
225
the solution is not unique. Bookstein [1] suggested to use constraint a2 + b2/2+ c2 =2. We follow here the description and solution of [1] since [4] only discussed the difference from that algorithm. This solution is generally applicable for any constraint of the form aTCa = constant. The main contribution of [4] is to replace the diagonal constraint matrix C by a matrix which corresponds to the condition 4ac-b2=1, which then always produces an ellipse. Such a matrix C satisfies aTCa = 1 and has all zeros except C13 = C31 = 2, C22 = -1. Let aT = (a1T | a2T) = [a b c | d e f ]. Let S be decomposed into four blocks, where S11, S12=S21T and S22 are 3x3 matrices. Then G= aTSa = a1TS11 a1 + 2a1TS12 a2 + a2TS22 a2. The constraint is on a1, not on a2. Assuming a1 is fixed, the minimum for G is obtained when d(aTSa)/(da2) = 0. Using theorems of matrix calculus, this means that 2a1TS12 + 2a2TS22 = 0, or a2T = -a1TS12S22-1. Then G = a1TS11 a1 + a1TS12 a2=a1T(S1 1-S12S22-1S21)a1= a1TS1 a1, S1 = S11 -S12S22-1S21. The problem now is to minimize a1TS1a1 subject to a1TC’a1=1, where C’ is the top 3x3 sub-matrix from C (nonzero elements are C’13=C’31=2, C’22=-1). The Lagrange multiplier λ is introduced, to minimize a1TS1 a1 -λa1TC’a1. The derivative with respect to a1T implies 2S1 a1 - 2λC’a1 = 0. λ is a relative eigenvalue of S1 with respect to C’, or a solution to |S1 -λC’|=0, and a1 is the corresponding eigenvector. The determinant |S1 - λC’|=0 is a cubic polynomial in λ. The eigenvectors for each real solution λ are obtained from (S1 -λC’)a1=0. Usually the best solution corresponds to the smallest λ [1]. Let H=S1 -λC’ and u = (a b c)T be a solution of this homogeneous 3x3 system Hu=0 (one can fix c=1). Then μu is also a solution for any μ, and a1 = μu satisfies constraint a1TC’a1=1. Thus μ2uTC’u=1 and μ=(1/(uTC’u))1/2. The conic we looked for is μu = (a b c)T while a2T=(d e f) is obtained from a2T = -(μu)TS12S22-1. It can be observed, however, that scaling input data D leads to the scaling of the matrices S and S1 without changing constraint C’ and therefore changes the cubic polynomial |S1 -λC’|=0. This will change the claimed optimal values for λ and a1, as confirmed by our implementation, without ever producing a visually good fit. We believe that the optimal fit should be scalable and therefore conclude that the method [1, 4] does not work. 2.5 DFT Based Ellipticity Measure Proffitt [6] described a shape based approach for measuring ellipticity based on the discrete Fourier transform (DFT). An ellipse is fitted to the shape by centering it on the shape’s centroid. The ellipse is then scaled such that its mean square of the lengths of the lines from the centroid to the boundary points matches the shape’s. Let u[j] = a[j] + ib[j], i= − 1 , and v[j] = x[j] + iy[j] be the corresponding points on the ellipse and the shape boundary, with the line between them passing through the centroid. Then the ellipticity measure D [6] (not normalized to [0, 1]) is D2 = 1 ∑ Nj=1 ( u [ j ] − v [ j ])2 . 2N
Note that (u[j] – v[j])2 = (a[j]-x[j])2 + (b[j] – y[j])2 is the Euclidean distance. 2.6 Area Comparison to Measure Ellipse Fit Koprnicky, Ahmed, and Kamel [5] defined ellipticity measures based on comparing the areas of shape S, the area of its ellipse fit R, and the areas of set differences S\R
226
M. Stojmenovic and A. Nayak
and R\S between the two. The measure that is closest to our criteria is (area(S\R)+area(R\S))/(area(S∪R). The authors do not elaborate on how to determine the best ellipse fit. This measure however might produce results that are outside of the interval [0, 1] when the ellipse fit is much bigger that the shape it is trying to fit. We therefore modified their method to measure ellipticity via area(S∩R) / area(S∪R). 2.7 Moment Based Ellipticity Measures An area based ellipticity measure of a fit is obtained by comparing the differences between the central moments uij and u’ij of the shape and the corresponding ellipse fit [10, 13]. The measure is
1
1 + ∑i , j =0 (μ 'ij − μ ij ) i+ j≤4
2
.
This method relies on normalizing uij coefficients, normally gives number close to 0, and can be used to rank shapes. Rosin [10] defined another moment based ellipticity measure as follows. Since any ellipse can be obtained by applying an affine transform to a circle, the simplest affine moment invariant of the circle can be used to characterize ellipses. It is defined as I = (μ20μ02-μ112)/μ004, where μpq are central moments. The moment for the unit radius circle is Ic = 1/(16π2). Thus I is measured for a given shape, and the measure of ellipticity of that shape in [10] is: E = 16π2I if I ≤ 1/(16π2), or 1/(16π2I) otherwise. 2.8 Elliptic Variance Based Ellipticity Measure Peura and Iivarinen [7] described an ‘elliptic variance’ which they used to measure ellipticity based on shape perimeters. The center of gravity u = (u1, u2)T and the covariance C of N data points pi = (xi, yi)T are calculated. Covariance is a 2x2 matrix and is calculated by the following formula: C=
1 N
∑ (p
T − u )(p i − u ) .
N
j =1
The mean radius v of the contour is v = The elliptic variance is then EVAR =
1 N
i
∑
1 Nv 2
N
i =1
(p i − u )T C −1 (p i − u ) .
∑i =1 ⎛⎜⎝ N
(p i − u )T C −1 (p i − u ) − v ⎞⎟ ⎠
2
.
Rosin [10] modified it to get a measure in [0, 1] as follows: PI = 1/(1+EVAR). He observed that EVAR suffers from the same problems as many distance approximations used for ellipse fitting. In particular, similar to standard algebraic distance, it exhibits a curvature bias (distances near the pointed ends of the ellipse are underestimated relative to distances at the flatter sections). This leads to irregularities at the ends having less of an effect than at the sides. However, it is not prone to the asymmetry between distances inside and outside the fitted ellipse.
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
227
2.9 Orthogonal Hyperbolae Distance and Ellipticity Measure Rosin described a number of metrics for calculating the approximate error of each point, which represents the distance the point deviates from the fit [8, 11]. The true point error – the distance from the point along the line normal to the ellipse, involves solving a quartic equation. Rosin recommended [10] using the method that approximates the normal distance to an ellipse by the distance to the intersection with the confocal orthogonal hyperbolae passing through the point [8]. The ellipse fit error is then robustly and accurately determined by the summed errors of the ellipse fit as N SE = ∑ i =1 di , where di are the orthogonal hyperbolae based distance approximations [8] and N is the number of points to fit. Rosin [10] then proposed the following area based ellipticity measure: EE=(1+SE/(N A ))-1 (A is the area of given shape). 2.10 Detecting Elliptical Shapes [15] described a variety of shape measures, applying them as shape descriptors of saccular otoliths. They list a number of shape measures from literature for measuring aspect ratio (elongation), compactness, convexity, eccentricity, rectangularity, ellipticity, circularity [6], triangularity [16], convexity, intrusiveness, protrusiveness, and several further measures. [3, 17] consider the problem of detecting ellipses, but do not measure ellipticity in the process. David [3] studied the problem of detecting cereal grains considering them as ellipses with aspect ratios (elongation) close to 2:1. They detect ellipses based on the property that the line passing through the centers of two horizontal chords also passes through the center of ellipse. A Hough transform was applied to handle some irregularities when grains touch each other. Zhang and Liu [17] described an ellipse detector that may be used for real-time face detection. They first extract edges from the image using a robust edge detector. The center of an ellipse is detected by intersecting two lines that pass through the intersection of tangent lines and the midpoint of the corresponding chord. The parameter space of the Hough transform is decomposed to achieve computational efficiency. 2.11 Measuring Linearity Stojmenovic, Nayak and Zunic [14] proposed 6 linearity measurements for finite, planar point sets. Their measures are quickly calculated and are invariant to scale, translation or rotation. All of their methods give linearity estimates in the range [0, 1] after some normalization. Their Average Orientations measure takes k random pairs of points along the curve. It finds their slopes (m), and finds the normals to their slopes (-m, 1). These normals are averaged out, and the resulting normal (A, B) is deemed to be the normal to the orientation of the curve. The averaging is done separately for each vector coordinate. The measure of linearity is defined as A 2 + B 2 . The Eccentricity measure is determined by first finding the center of gravity of the set of points (Xc, Yc). All of the points are translated by (-Xc, -Yc) so that the center of gravity is (0, 0). The linearity is determined to be
228
M. Stojmenovic and A. Nayak
linearity =
( μ 20 − μ 02 )
2
+ 4 μ112
μ 20 + μ 02
,
where μ11, μ02, and μ20 are the second order moments of the shape. The triangle heights linearity measure is found by first taking k triplets of random points from the set and computing the heights h to the longest side of the triangles that the triplets form. This h value is divided by the longest side c of the triangle to normalize the measure. This value is called hc. We use the average of these k hc values as a linearity measure of the set of points. The triangle perimeters method is similar to the previous one in the sense that k triplets of random points from the set of points are taken. Each triplet of points defines the vertices of a triangle. The three sides of the triangle are labelled a, b and c, where a≤b≤c. The linearity measure derived from these three sides is p = (2c-a-b)/c. The contour smoothness measure was formed by taking triplets of points, and averaging out their triangular areas. Each triplet of points produced a smoothness value in the form of area/perimeter2. The maximum value for area divided by the triangle perimeter is 3 36 (for an equilateral triangle). After smoothness values are averaged to produce value sums, the result is adjusted as follows: sums = 36 sums 3 . The compliment of the obtained sums value was taken as a linearity value. The Ellipse Axis Ratio was determined by first finding the center of mass and the first and second order moments of the set of input points, and then finding the values of the major and minor axis of the best fit ellipse as determined by the formulas in [9]. The linearity value was given as 1-minor axis/major axis.
3 Measuring Ellipticity Our new ellipse measures are presented here in two sections. The first deals with fitting an ellipse to point data, and the second measures the ellipticity of the point set by rating the accuracy of the ellipse fit. The quality of the fit relies on an accurate method of finding the center of the shape. We propose two ways of finding shape center. The standard way is by considering the center of gravity, and the other is by finding the true center of the shape. 3.1 Fitting an Ellipse to a Set of Points Here we describe the algorithm that fits an ellipse to a set of points. Its input is just the set of points, and it outputs the locations of the optimal foci locations, along with the major and minor axes of the fit ellipse and the angle of orientation of the major axis of the fit ellipse. We begin by finding the angle of orientation α of the point set via moments. The moment based algorithm sometimes produces orientation angles that are normal to the actual shape orientation. To verify the correctness of the obtained value α , the linearity of the set was measured twice. The first measure was made considering the orientation angle was α, and the second was made considering orientation angle α +90o. Linearity can be measured using any one of the linearity measures from [14]. The higher of the two linearity values corresponds to the actual
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
229
orientation of the shape. The orientation line passes through the selected center and has slope α . We then project all points onto the orientation line, resulting in a new array. The two extremity points min and max along the orientation line of the new array are found. In Figure 1 we see the blue orientation line which is also the line on which the points on the shape are projected. G is the center of the shape. Foci f1 and f2 will be determined by the foci finding procedure to follow.
Fig. 1. Orientation line with foci, min, max a, b, c, and G Fig. 2. Varienca of summed foci dist.
3.1.1 Finding Optimal Foci for Ellipse Fitting For simplicity, we assume that the point set has been translated such that its center is at the origin. Also, the point set has been rotated such that its orientation line lies on the x axis. In Figure 2, the distances to the foci are:
(xi − c )2 + y i 2 , and
d1 =
(xi + c )2 + yi 2
d2 =
.
Therefore, we have
Di = d 1 + d 2 =
( x i − c )2 + y i 2
+
( x i + c )2 + y i 2 .
We need to find c for which the values Di have the smallest possible variance. Variance is defined by: N
f (c ) = ( N − 1)σ 2 = ∑ Di − 2
i =1
2
1⎛ N ⎞ ⎜ ∑ Di ⎟ . N ⎝ i =1 ⎠
This is a continuous function and thus has a minimum value c such that f ' (c ) = 0 . N
f (c ) = ∑ ⎛⎜ i =1 ⎝
(xi
− c) + yi + 2
N
⇒ f ' (c ) = 2∑ ⎛⎜ i =1 ⎝ 2⎛ N ⎛ ⎜∑⎜ N ⎝ i =1 ⎝
(x i
2
(x i
( x i − c )2 + y i 2
− c) + yi + 2
2
2 1⎛ N 2 2 2 2 2 2 ⎞ + c ) + y i ⎞⎟ − ⎜ ∑ ⎛⎜ (x i − c ) + y i + ( x i + c ) + y i ⎞⎟ ⎟ ⎠ ⎠⎠ N ⎝ i =1 ⎝ ⎛ − (x i − c ) (x i + c ) ⎞⎟ 2 2 + (x i + c ) + y i ⎞⎟ ⋅ ⎜⎜ + − 2 2 ⎠ (x i + c )2 + y i 2 ⎟⎠ ⎝ (xi − c ) + y i
⎛
(x i + c )2 + y i 2 ⎞⎟ ⎞⎟ ⋅ ∑ ⎜⎜ N
⎠⎠
i =1
⎝
− (x i − c )
(x i
− c) + yi 2
2
+
(xi + c ) (x i + c )2 + y i 2
⎞ ⎟. ⎟ ⎠
2
230
M. Stojmenovic and A. Nayak
This is a continuous function of one variable, which can be solved by a standard equation root finding technique of numerical analysis, such as the bisection method. Note that, f(c) may have local minima and thus multiple solutions for equation f’(c), some of which could even correspond to local maxima. Therefore solving the problem in this direction is not straight forward. We therefore opted for a simple approximate solution that corresponds to a linear search with pixel unit distance steps. Foci in our implementation are found by inspecting each possible location for the pair between min and G for f1, and G and max for f2 simultaneously so that |Gf1|=|Gf2|. For each candidate location foci pair, the distances from each point on the shape to both foci are stored in an array. The variance of this array is calculated for each candidate pair, and the one with the lowest variance corresponds to the best choice for the locations of the ellipse fit foci. Now that foci are found, we need the median of the sum of distances from each point on the shape to both foci in order to find the length of the major axis a, which equals half of this sum. The distance c from focus f1 to center G is used to find the length of the minor axis b = a − c . We now have all of the necessary components of the ellipse fit to be able to evaluate it. 2
2
3.2 Assesing the Fit Quality: Minimal Variance of Summed Foci Distances Once the foci of the ellipse fit have been determined, the quality of the ellipse fit can be assessed. This was done by first transforming the original point set into polar representation. This transformation is seen in Figure 3. In Figure 3, we see the inherent ellipse property that each point on the ellipse is equidistant to the sum of distances from both foci. We exploit this property when transforming the point set to polar coordinate form. As seen in the bottom part of Figure 3, the polar distance value for each point x from the center G will be the sum of distances from x to both foci, r=d1 +d2. The angle α that vector Gx’ forms with the x-axis will remain the same as the angle Gx formed with the x-axis. For a perfect ellipse, the resulting shape can be drawn as a circle, but if its polar coordinate shapes are plotted as Cartesian, they would look highly linear, as seen in Figure 4. Applying a linearity measure to this polar representation results in a linearity value for the modified set of points. This linearity value represents the ellipticity value of the original set. A normalization was applied to the polar coordinate transformation of points since the angles the points form in the new representation is limited to the interval [0, 360], whereas the length of the radii they form is unbounded. This can result in polar representations that are not proportional to the original shape. To normalize the polar representation, the following information is gathered from it: the length of the smallest and largest radii rmin, rmax, and the variance of the radii, rvar. The normalization factor norm = (360*rvar)/N. This normalization factor represents the size of the interval each radius would be fit into. The larger the variance of the radii, the less the points fit to the ellipse. Therefore the interval they are placed in is larger to make the linearity value of this set smaller. Each radius value is normalized by the following statement: R=norm*(r-rmin)/(rmaxrmin). The result is a number that fits a radius R into the interval [0, norm].
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
Fig. 3. Transforming the input set to polar representation
231
Fig. 4. Polar point set on a planar graph
3.3 Average Distance Ratio to Ellipse and the Center We propose a shape based ellipticity measure that can be applied to open and closed shapes. Let O be the center of fitted ellipse, and V be a point from the original set. Let U be the intersection of line VO with the fitted ellipse that is closer to V, therefore |VU|<|VO|. This intersection can be found by using the equations of a line and ellipse, which leads to a quadratic equation. The error measure used is the average of the distance ration to the ellipse and to the center, that is, |VU|/|VO|. It always returns a number between 0 and 1.
Fig. 5. Finding intersect U on line VO
Fig. 6. Choosing the correct shape center
3.4 Finding the Center of a Shape The trivial way of choosing a shape’s center is to take the per-coordinate average of all pixels, which is the center of gravity. This is the method that is usually chosen when measuring any shape property such as linearity, orientability or elongation. Choosing the appropriate center of a shape when measuring ellipticity is more delicate and heavily influences the result of the ellipticity measure. To illustrate this point, we turn to figure 6. Here, we see a semi elliptical shape where the red dot represents the center of the shape as determined by the traditional method. Transferring the shape to polar coordinates with respect to the red dot would not yield a straight line, but rather a curved one. This would result in a much lower ellipticity measure than expected for the given shape. However, had the green dot been chosen as the center, the resulting polar coordinate representation would have looked similar to the line seen in Figure 4, and a much higher ellipticity measure
232
M. Stojmenovic and A. Nayak
would have been awarded. We have experimented with both types of center finding methods in this work. In order to find the ‘true center’ of a shape, as opposed to its traditional center of gravity, we sampled k quintuplets of points from its point set. From each quintuplet, we found the center (Xtc, Ytc) that the points define via the method proposed in [9].
4 Experimental Data The ellipticity algorithms were tested on a set of 20 closed shapes, shown in Figure 7, and 15 open shapes shown in Figure 8. These shapes were assembled by hand and are meant to cover a wide variety of non trivial curves. Each of them is comprised of between 100 and 500 points. This number of pixels is common in extracted edges of some computer vision systems. The center of gravity was close to the true center in almost all of the closed test curves seen in Figure 7. There are however big differences in the ellipticity values of the shapes seen in Figure 7 because the two centers were generally not close. Table 1 holds the ellipticity values for closed curves. The three algorithms for ellipse fitting found in table 1 are the new foci variance fitting measure proposed here, the fitting measure proposed by Voss and Sube [16], and Rosin’s sampling LMedS technique [9]. This set of three ellipse fitting techniques was evaluated by two separate ellipticity measures: the ellipticity via linearity measure proposed here (average orientation linearity measure [14] was applied in polar space), and the area based measure proposed by Koprnicky, Ahmed, Kamel (KAK) [5]. The right most column in table 1 shows the ellipticity values of the E measure in [10], which does not rely on an ellipse fit of a shape.
Fig. 7. Closed test curves
Fig. 8. Open Test Curves
The VS [16] fit algorithm is area based, so to be able to compare it to the shape based ones, the fit it produced based on the area pixels of a shape was evaluated for ellipticity against the perimeter pixels of the same shape. The perimeter pixels were extracted using a 3x3 mask that checked to see if a pixel had at least one neighbour which is not included in the shape. If it had at least one, it was considered as part of the perimeter. Most of the algorithms agree upon the ellipticity results of the figures which are highly elliptic to the naked eye, as seen in table 1. Shape 8 in figure 7 is the first major disagreement between the ellipticity fits and measures. The left two shapes of Figure 7 show the ellipse fit of our algorithm, and the general fit of the rest of the
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
233
algorithms for this triangular shape. Our fit is more elongated, and closely follows the triangle on the left hand side. The ellipticity values for this shape vary considerably between measures. Our measure produces the lowest ellipticity value for this shape and we feel that such a low score is merited given that a triangle should not be closely associated with an ellipse. Table 1. Ellipticity results for closed curves
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ellipticity Via linearity Foci var gc [16] VS fit [9] 1 0.98 1 0.97 1 0.98 0.92 0.83 1 0.98 0.83 0.73 0.99 0.84 0.27 0.68 0.51 0.54 0.84 0.77 0.99 0.89 0.75 0.73 0.77 0.63 0.4 0.63 0.85 0.77 0.45 0.6 0.89 0.37 0.99 0.45 0.89 0.32 0.91 0.39
LMedS 1 1 1 1 1 0.75 0.97 0.83 0.49 0.92 0.99 0.79 0.69 0.14 0.82 NA NA 0.99 0.85 0.74
Foci var gc 0.95 0.94 0.96 0.82 0.98 0.68 0.8 0.59 0.44 0.77 0.84 0.69 0.24 0.34 0.89 0.49 0.45 0.95 0.92 0.55
KAK [16] VS fit 0.99 0.99 0.99 0.92 1 0.9 0.94 0.88 0.87 0.91 0.96 0.89 0.87 0.93 0.95 0.86 0.87 0.44 0.33 0.94
[9] LMedS 0.96 0.94 0.97 0.68 0.98 0.64 0.93 0.31 0.73 0.86 0.97 0.93 0.64 0.99 0.15 NA NA 0.94 0.93 0.95
E [10] 1 1 1 0.91 1 0.72 0.9 0.68 0.35 0.76 0.87 0.78 0.55 0.91 0.97 0.54 0.38 0.99 0.98 0.92
Shape 16 in Figure 7 shares a similar discrepancy in awarded ellipticity scores. The star shape has similar disagreements in ellipticity scores to the triangular shape, and the explanation for these disagreements is also similar. The difference in fits can be seen in the right part of Figure 9. The rest of the closed shapes have ellipse fits and values that very closely follow the original shapes. Table 2. Ellipticity results for open curves
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Ellipticity Via polar linearity Foci Foci [9] var tc var cg LMedS 0.94 0.54 0.88 0.98 0.66 0.93 0.61 0.39 0.40 1.00 1.00 1.00 0.96 0.98 0.99 0.10 0.71 NA 0.82 0.83 0.80 0.82 0.56 0.71 0.91 0.62 0.91 0.96 0.22 0.97 0.93 0.92 0.96 0.79 0.92 0.87 0.96 0.98 1.00 0.92 0.93 0.87 0.61 0.69 0.95
Distance ratio to ellipse and center Foci Foci [9] var tc var cg LMedS 0.97 0.83 0.86 0.98 0.82 0.96 0.90 0.82 0.88 0.97 0.97 0.98 0.72 0.75 0.66 0.74 0.69 NA 0.74 0.76 0.66 0.92 0.89 0.79 0.95 0.92 0.96 0.89 0.79 0.78 0.92 0.82 0.88 0.86 0.83 0.87 0.84 0.86 0.66 0.77 0.83 0.78 0.71 0.72 0.56
234
M. Stojmenovic and A. Nayak
Table 2 shows the results of the ellipticity measures as applied to the open curves in Figure 8. Our ellipse fits with true and gravity centers based on minimizing variance of summed distances to foci is compared with sampling LMedS method [9]. Fit accuracies are measured by our two newly proposed methods, polar linearity and average distance ratio to the fitted ellipse and its center. The choice of center significantly impacts the fit and ellipticity values of the shapes tested here. Shapes 1 and 2 in Figure 7 best illustrate the impact of center selection. In Figure 10, we see these two shapes coupled with their ellipse fits with both the center of gravity and true center approaches. The light blue dots represent the chosen centers for each shape. The ellipticity values for the shapes in Figure 9 are 0.54, 0.94, 0.66, and 0.98 respectively.
Fig. 9. Ellipticity of a triangle and a star
Fig. 10. Center of gravity vs. true center impact
5 Conclusion Finding the true center of an ellipse greatly improved the fit for open curves. However, the orientation line also heavily influences the overall quality of the fit. If the orientation line is found not to overlap with the actual visual orientation of the shape, the foci finding procedures are forced to find foci locations that are along this erroneous orientation line. This diminishes the ellipticity values for highly elliptical shapes. The LMedS sampling method [9] for finding the true center of a shape does not always return a center since some shapes cannot be easily fit with an ellipse. This is especially true considering the 5 point sampling technique which might not produce quintuplets of points that can be inscribed with just one ellipse. The other problem with the sampling technique is that it might not return an ellipse, but instead either hyperbola or parabola. We have made an extensive literature review, investigation, and implementation of existing methods. Most ellipticity methods from literature do not fit our criteria for a variety of reasons. Most of the methods are area based [5, 16, 4], and some of the ones that are shape based do not return ellipticity values that reflect how elliptical a shape is, rather they can just be used to rank shapes (moment based [10], [6, 7]). Other than LMedS [9] for fitting, our methods appear to be the only ones that are boundary based, guaranteed to return an ellipse, work with open and closed curves, and return meaningful number in the interval [0,1].
References 1. Bookstein, F.L.: Fitting conic sections to scattered data. Computer Graphics and Image Processing 9, 56–71 (1979) 2. Csetverikov, D.: Basic algorithms for digital image analysis, Course, Institute of Informatics, Eotvos Lorand University, visual.ipan.sztaki.hu
Direct Ellipse Fitting and Measuring Based on Shape Boundaries
235
3. Davies, E.R.: Algorithms for ultra-fast location of ellipses in digital images. Proc. IEE Image Processing and its Applications 2, 542–546 (1999) 4. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least square fitting of ellipses. IEEE TPAMI 21(5), 476–480 (1999) 5. Koprnicky, M., Ahmed, M., Kamel, M.: Contour description through set operations on dynamic reference shapes. In: Campilho, A., Kamel, M. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 400–407. Springer, Heidelberg (2004) 6. Proffitt, D.: The Measurement of Circularity and Ellipticity on a Digital Grid. Pattern Recognition 15(5), 383–387 (1982) 7. Peura, M., Iivarinen, J.: Efficiency of simple shape descriptors. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) Advances in Visual Form Analysis, World Sci. (1997) 8. Rosin, P.L.: Ellipse fitting using orthogonal hyperbolae and Stirling’s oval. CVGIP: Graph Models Image Process 60(3), 209–213 (1998) 9. Rosin, P.L.: Further five-point fit ellipse fitting. CVGIP: Graph Models Image Proc. 61(5), 245–259 (1999) 10. Rosin, P.L.: Measuring shape: ellipticity, rectangularity, and triangularity. Machine Vision and Applications 14, 172–184 (2003) 11. Rosin, P.L.: Assessing error of fit functions for ellipses. CVGIP: Graph Models Image Process 58(5), 494–502 (1996) 12. Rothe, I., Süße, H., Voss, K.: The method of normalization to determine invariants. IEEE T-PAMI 18(4), 366–376 (1996) 13. Rosin, P.L., Zunic, J.: 2D shape measures for computer vision. In: Handbook of Applied Algorithms: Solving Scientific, Engineering and Practical problems, Wiley, Chichester (2007) 14. Stojmenovic, M., Nayak, A., Zunic, J.: Measuring Linearity of a Finite Set of Points. In: IEEE International Conference on Cybernetics and Intelligent Systems (CIS), Bangkok, Thailand, pp. 222–227 (2006) 15. Tuset, V.M., Rosin, P.L., Lombarte, A.: Sagittal otolith shape used in the identification of fishes of the genus Serranus. Fisheries Research 81, 316–325 (2006) 16. Voss, K., Süße, H.: Invariant fitting of planar objects by primitives. IEEE T-PAMI 19, 80– 84 (1997) 17. Zhang, S.C., Liu, Z.Q.: A robust, real time ellipse detector. Pattern Recognition 38, 273– 287 (2005)
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm Fajie Li1 , Reinhard Klette2 , and Xue Fu3,4 1
Institute for Mathematics and Computing Science, University of Groningen P.O. Box 800, 9700 AV Groningen, The Netherlands 2 Computer Science Department, The University of Auckland Private Bag 92019, Auckland 1142, New Zealand 3 Faculty of Economics, University of Groningen P.O. Box 800, 9700 AV Groningen, The Netherlands 4 School of Public Finance, Jiangxi University of Finance and Economy Nanchang, 330013, China
Abstract. Let p and q be two points on the surface of a polytope Π. This paper provides a rubberband algorithm for computing a Euclidean shortest path between p and q (a so-called surface ESP) that is contained on the surface of Π. The algorithm has κ1 (ε)·κ2 (ε)·O(n2 ) time complexity, where n is the number of vertices of Π, κi (ε) = (L0i −Li )/ε, for the true length Li of some shortest path with initial (polygonal path) length L0i (used when approximating this shortest path), for i = 1, 2. Rubberband algorithms follow a straightforward design strategy, and the proposed algorithm is easy to implement and thus of importance for applications, for example, when analyzing 3D objects in 3D image analysis, such as in biomedical or industrial image analysis, using 3D image scanners. Keywords: Rubberband algorithm, Euclidean shortest path, surface ESP.
1
Introduction
Let Π be a connected polyhedral domain such that its frontier is a union of a finite number of triangles. An obstacle is a connected, bounded polyhedral component in the complement R3 \Π of Π. Let p, q ∈ Π such that p = q. The general Euclidean shortest-path problem (ESP) asks to find a shortest polygonal path ρ(p, q) which is either completely contained in Π, or just not intersecting any (topologic) interior of a finite number of given obstacles. This problem is actually a special case of the problem of planning optimal collision-free paths for a robot system; for its specification and a first result, see [1]. This paper presented in 1984 a doubly exponential time algorithm for solving the general obstacle avoidance problem. [2] improved this by providing a singly exponential time algorithm. The result was further improved by a PSPACE algorithm in [3]. Since the general ESP problem is known to be NP-hard [4], special cases of the problem have been studied afterwards. [5] gave a polynomial time algorithm for ESP calculations for cases where all obstacles are convex and D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 236–247, 2007. c Springer-Verlag Berlin Heidelberg 2007
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
237
the number of obstacles is small. [6] solved the ESP problem with an O(n6k−1 ) algorithm assuming that all obstacles are vertical “buildings” with k different values for height. [1] is the first publication considering the special case that the shortest polygonal path ρ(p, q) is constrained to stay on the surface of Π. [1] presented an O(n3 log n) algorithm where Π was assumed to be convex. [7] improved this result by providing an O(n2 log n) algorithm for the surface of any bounded polyhedral Π. The time complexity was even reduced to O(n2 ) [8]. So far, the best known result for the surface ESP problem is due to [9]; it improved in 1999 the time complexity to O(n log2 n), assuming that there are O(n) vertices and edges on Π. This paper provides a rubberband algorithm (RBA) for computing approximate surface ESP. The algorithm has κ1 (ε) · κ2 (ε) · O(n2 ) time complexity, where n is the number of vertices of Π, and κi (ε) =
L0 i − Li ε
for the true length Li of some kind of shortest path with length L0 i of the used initial polygonal path, for i = 1, 2. Although this rubberband algorithm is not the most efficient, it follows a straightforward design strategy, and the proposed algorithm is easy to implement. (See [10] for results on implementing rubberband algorithms for various shortest path problems.) We generalize a rubberband algorithm from solving the 2D ESP of a simple polygon (see [11] for this 2D algorithm) to a solution for the surface ESP of polytopes. Considering the difficulty of the general ESP problem, our approach is very important for applications, e.g. when analyzing 3D objects in 3D image analysis (such as in biomedical or industrial image analysis, using 3D image scanners). For shortest paths on digital surfaces (in the context of 3D picture analysis), also known as geodesics, see the monograph [12]. One of the earlier publications, related to the calculation of surface geodesics, is [13]. The rest of this paper is organized as follows: Section 2 provides the definitions of some useful notions. Section 3 presents four procedures being subroutines of the Main Algorithm. Section 4 proves the correctness of the Main Algorithm. Section 5 analyses the time complexities for involved procedures and Main Algorithm. Section 6 illustrates the main ideas behind the steps of the Main Algorithm by a simple example. Section 7 summarizes the paper.
2
Definitions
Let Π be a polytope (see Figure 1 for an example). Let T = {1 , 2 , . . . , m } be a set of triangles such that ∂Π = ∪m i=1 i and i ∩ j = ∅ or = eij or = vij ,
238
F. Li, R. Klette, and X. Fu
Fig. 1. An unit cube within the xyz-coordinate system, where p = (0.76, 0.12, 1), q = (0.9, 0.24, 0)
where eij (vij ) is an edge (vertex) of both i and j , i = j, respectively, with i, j = 1, 2, . . . , m. We construct a corresponding simple graph GΠ = [VΠ , EΠ ] where VΠ = {v1 , v2 , . . . , vm }. Each vi is a triangle. Edges e ∈ EΠ are defined as follows: If i ∩ j = eij = ∅, then we have an edge e = vi vj (where eij is an edge of both triangles i and j ); and if i ∩ j = ∅ or a vertex, then there is not an edge between vi and vj , i < j and i, j = 1, 2, . . . , m. In such a case we say that GΠ is a corresponding graph with respect to the triangulated polytope Π. See Figure 2 for an example. Analogously, we can define a corresponding graph for a connected surface segment (a subsurface) of a polytope. Abbreviated, we may also speak about “the graph for a polytope” or “the graph for a subsegment of a surface”.
Fig. 2. A (3-regular) corresponding graph of the polytope in Figure 1
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
239
A triangulated polytope Π can also be thought as being a graph such that each vertex of Π is a vertex of this graph, and each edge of a triangle is an edge of this graph. We denote this graph by GΠ . Let p = q, p, q ∈ V (GΠ ); if ρ is a cycle of GΠ such that GΠ \ρ has two components, denoted by G1 and G2 with p ∈ V (G1 ) and q ∈ V (G2 ), then ρ is called a cut cycle of GΠ or Π. For example, in Figure 1, ABCDA or AF GDA are cut cycles of Π. An approximate cycle is a graph such that it consists of a cycle plus a few more vertices, each of which is of degree one only, and (thus) adjacent to a vertex on the cycle. (The graph later shown in Figure 4 is an approximate cycle.) A band is a subsurface of a polytope Π such that the corresponding graph of it is a cycle or an approximate cycle. A band can also be thought as being a subgraph of GΠ . Let E be the subset of all the edges of a triangulated band such that each edge belongs to a unique triangle. Then E consists of two cycles. Each of them is called a frontier of the band. For example, in Figure 1, ABCDA and EF GHE are two frontiers of a band whose triangles are perpendicular to the xoy-plane. If two triangulated bands share a common frontier, then they are called continuous (in the sense of “continuation”).
3
Algorithms
Without loss of generality, we can assume that p = q, p and q ∈ V (Π). 3.1
Separation
The following procedure finds a cut cycle to separate p and q such that either p or q is not a vertex of the cut cycle. (This procedure will be used in Step 1 of the Main Algorithm below.) Procedure 1 = q ∈ V (Π). Input: GΠ = [V (Π), E(Π)], and two vertices p Output: The set of all vertices of a cycle ρ in G such that, if we cut the surface of Π along ρ into two separated parts, then p and q are on different parts respectively. 1. 2. 3. 4. 5. 6. 7.
Let Np = {v : vp ∈ E(Π)} (i.e., the set of all neighbors of p). = 180◦. In other words, uv ∈ E(Π). Select u, v ∈ Np such that ∠upv Let V = {p, v}. Let Nv = {w : w v ∈ E(Π)} (i.e., the set of all neighbors of v). Take a vertex w ∈ Nv \V . If w = u, then stop. Otherwise, let V = V ∪ {w}, v = w and go to Step 4. If q ∈ / V , then output V . Otherwise, output V \{q}.
240
F. Li, R. Klette, and X. Fu
For example, in Figure 1, ρ can be either ABCDA or AF GDA, but it can not be AEHDA. 3.2
Step Set Calculation
The following procedure computes step bands (i.e., the step set for the second level RBA). It will be used in Step 2 of the Main Algorithm below. Procedure 2 Input: GΠ = [V (Π), E(Π)], and ρ: the cut cycle obtained with Procedure 1. Without loss of generality, we can assume that p ∈ V (ρ) and q ∈ / V (ρ). Output: The set of the step bands S = {B1 , B2 , . . . , Bm } such that p ∈ V (B1 ) and q ∈ V (Bm ). 1. Let S = ∅, Π1 = Π and ρ1 = ρ. 2. While q ∈ / V (ρ1 ), do the following: 2.1. Let Π2 = Π1 − ρ1 such that q ∈ V (Π2 ). (Note: the used “minus” in graph theory can also be written as Π1 \ρ1 ; in other words, we delete each vertex in ρ1 and each edge of Π1 which is incident with a vertex of ρ1 .) 2.2. Let ρ2 be the frontier of Π2 . 2.3. Let Π1 , ρ1 and ρ2 as the input, compute a band B = GΠ1 (V (ρ1 ) ∪ V (ρ2 )) (the induced subgraph of GΠ1 ). 2.4. Update ρ1 and Π1 by letting ρ1 = ρ2 and Π1 = Π2 . 2.5. Let S = S ∪ {B}. 3. Output S. For example, in Figure 1, if a single vertex can be thought of as being a band, then we can have S = {B1 , B2 , B3 }, where B1 = p, B2 is the band such that V (B2 ) = {A, B, C, D, E, F, G, H}, and B3 = q. 3.3
Step Segments in a Single Band
The following procedure computes step segments in a single band (i.e., a subset of the step set for the initialization of the RBA). (It will be used in Step 1.1 of Procedure 4 below.) Procedure 3 Input: The triangulated band B and two vertices u, v ∈ V (B) such that u and v are on two different frontiers of B, denoted by ρ1 and ρ2 (i.e., u ∈ V (ρ1 ) and v ∈ V (ρ2 ). Output: Two step sets of segments (edges) S1 and S2 such that either S1 or S2 contains the vertices of a surface ESP of B from u to v. Let u , v be the triangles such that u ∈ ∂u and v ∈ ∂v , respectively. Let wu and wv ∈ V (GB ) such that wu and wv correspond to u and v , respectively.
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
241
Fig. 3. A unit cube such that u = p, and v is the center of EF
By the definition of a band (see Section 2), there is a cycle, denoted by ρB , such that either wu (respectively, wv ) ∈ V (ρB ) or the unique neighbor of wu (respectively, wv ) is in V (ρB ). For example, in Figure 3, the frontier of B consists of two cycles uABCDu and EF GHE. We have that u = pDA, v = AEF . S1 = {AD, AE} and S2 = {DA, DE, DH, DG, CG, CF, BF, AF }. Case 1 : Both wu and wv are in V (ρB ). In this case, ρB can be decomposed into two paths from wu and wv , denoted by P1 and P2 . Let {1 , 2 , . . . , m1 } be the sequence of triangles corresponding to the sequence of the vertices of P1 . Let {e1 , e2 , . . . , em1 −1 } be a sequence of edges such that ei = i ∩i+1 , where i = 1, 2, . . . , m1 − 1. Let {e1 , e2 , . . . , em1 −1 } be a sequence of edges such that ei is obtained by removing a sufficiently small segment (Assume that the length of the removed segment equals δ .) from both endpoints of ei , where i = 1, 2, . . . , m1 − 1. The set {e1 , e2 , . . . , em1 −1 } is the approximate step set we are looking for. Case 2 : Both wu and wv are not in V (ρB ). Again, by the definition of a band (see Section 2), let wu (wv ) be the unique neighbor of wu (wv ) such that wu and wv ∈ / V (ρB ). In this case, ρB can be decomposed into two paths from wu and wv , denoted by P1 and P2 . Appending wu and wv to both ends of P1 and P2 , we obtain
Fig. 4. The corresponding graph with respect to B; the two frontiers of B are pABCDp and EF GHE in Figure 3. v9 corresponds to pDA, and v2 corresponds to AEF .
242
F. Li, R. Klette, and X. Fu
two paths, denoted by P1 and P2 . Analogous to Case 1, we can compute the approximate step set. Case 3 : Only one of either wu or wv is not in V (ρB ). We can compute the approximate step set, analogously to Cases 1 and 2. 3.4
Initializations
The following procedure is the initialization procedure of the RBA. It will be used in Steps 7.2 and 8.2 of the Main Algorithm below. Procedure 4 Input: Two continuous triangulated bands B1 and B2 , and three vertices u1 , u2 and u3 , all three in V (B1 ∪ B2 ), such that u1 and u2 are on two different frontiers of B1 , denoted by ρ1 and ρ2 ; u3 is on the frontier denoted by ρ3 ( = ρ2 ), of B2 . Output: The set of vertices of an approximate ESP on the surface of B1 ∪ B2 , from u1 to u3 . Let eu2 ∈ E(ρ2 ) such that u2 ∈ eu2 ; l a sufficiently large integer; and E = E(ρ2 ). 1. While E = ∅, do the following: 1.1. Let GBi and ui , ui+1 be the input; apply Procedure 3 to compute step segments in band Bi , denoted by SBi , where i = 1, 2. 1.2. Let S12 = SB1 ∪ SB2 be the input. Apply Algorithm 1 of [11] to compute an approximate ESP on the surface of B1 ∪ B2 . This is denoted by ρeu2 , and it connects u1 with u3 .1 1.3. Let the length of ρeu2 be equals l(ρeu2 ). 1.4.1. If l(ρeu2 ) < l then let V = V (ρeu2 ). 1.4.2. If l(ρeu2 ) = l then let V = minlexi {V, V (ρeu2 )} (minimum with respect to lexicographic order). 1.5. Let E = E\{eu2 }. 1.6. Take an edge e ∈ E and let u2 be one endpoint of e; let eu2 = e; go to Step 1.1. 2. Output V . 3.5
The Main Algorithm
The main algorithm defines now the iteration steps of the RBA. = q, p, q ∈ V (Π); accuracy Input: GΠ = [V (Π), E(Π)], and two vertices p constant ε. 1
Note that Algorithm 1 of [11] still works when the line segments in the step set are in 3D space.
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
243
Output: The set of vertices of an approximate ESP on the surface of Π. 1. Let GΠ , p and q be the input; apply Procedure 1 to compute a cut cycle which separates p and q, denoted ρpq . 2. Let GΠ and ρpq be the input; apply Procedure 2 to compute step bands S = {B1 , B2 , . . . , Bm } such that p ∈ V (B1 ) and q ∈ V (Bm ). 3. Let pi be a point on the frontier of Bi , where i = 1, 2, . . ., m, p = p1 and q = pm . We obtain an initial path ρ =< p1 , . . . , p2 , . . . , pm > [note: it is very likely that there exist further points between pi and pi+1 !]. The following steps are modified from Algorithm 1 of [11] (note: only Steps 7.2 and 8.2 are modified!). 4. Let ε = 1010 (the chosen accuracy). 5. Compute the length L1 of the initial path ρ. 6. Let q1 = p and i = 1. 7. While i < k - 1 do: 7.1. Let q3 = pi + 1. 7.2. Apply Procedure 4 to compute a point q2 on the frontier of Bi such that q2 is a vertex of an approximate ESP on the Surface of Bi−1 ∪ Bi from qi−1 to qi+1 . 7.3. Update ρ by replacing pi by q2 [possibly also by some additional points between pi−1 and pi , and between pi and pi+1 !]. 7.4. Let q1 = pi and i = i + 1. 8.1. Let q3 = q. 8.2. Apply Procedure 4 to compute a point q2 on the frontier of Bm such that q2 is a vertex of an approximate ESP on the surface of Bm−1 ∪ Bm , from qm−1 to qm+1 . 8.3. Update ρ by replacing pk by q − 2 [note: possibly also by additional points between pm−1 and pm , and between pm and pm+1 !]. 9. Compute the length L2 of the updated path ρ. 10. Let δ = L1 - L2 . 11. If δ > ε , then let L1 = L2 and go to Step 7. Otherwise, stop. We provide a proof of correctness, an analysis of run-time complexity, and an example for this algorithm. It is basically another illustration for the general comments (e.g., in [14,10]) that the basic idea of rubberband algorithms my be applied efficiently for a diversity of shortest path problems.
4
Proof of Correctness
Theorem 1. The approximate path computed by the Main Algorithm is an approximate ESP on the surface of Π.
244
F. Li, R. Klette, and X. Fu
Proof. Let {B1 , B2 , . . . , Bm } be the step bands computed by Step 2 of the Main Algorithm. Let ρi = Bi ∩ Bi+1 , where i = 1, 2, . . . , m − 1. For each point pi ∈ ρi , where i = 1, 2, . . . , m − 1, the length of the surface path ρ =< p1 , . . . , p2 , . . . , pm > m−1 is a continuous function defined on ρ1 · ρ2 · . . . · ρm−1 , denoted by Πi=1 ρi . m−1 Since each ρi is a closed set, where i = 1, 2, . . . , m − 1, Πi=1 ρi is a closed set as well. Since ρ is continuous, for each ε > 0 and for each P = (p1 , p2 , . . . , pm−1 ) ∈ m−1 Πi=1 ρi , there exists a δε > 0, such that for each P ∈ U (P, δε ), the difference between the lengths (i.e., of the outputs) of the Main Algorithm by using either P or P as an initial path, is not more than ε. m−1 We can now construct an open cover for Πi=1 ρi as follows: m−1 Oε = {U (P, δε ) : P ∈ Πi=1 ρi }
By the finite cover principle of mathematical analysis, there exists a finite subm−1 cover of Oε which can cover Πi=1 ρi . This implies that the number of approximate ESPs obtained by the Main Algorithm is finite. In analogy to the proof of Lemma 24 of [14], the number of approximate ESPs obtained by the Main Algorithm is only one. This proves the theorem.
5
Time Complexity
This section analyses, step by step, the time complexity for each of the procedures and the Main Algorithm as presented in the previous section. Lemma 1. Procedure 1 can be computed in time O(|V (Π)|2 ). Proof. In our data structure we identify adjacent vertices for each vertex; so Steps 1 and 4 can be computed in time O(|V (Π)|). Step 2 can be computed in time O(|Np |). Step 3 can be computed in time O(1). Step 5 can be computed in time O(|Nv |). Step 6 can be computed in time O(1). The loop, from Step 4 to Step 6, is computed in time O(|V (Π)|2 ). Step 7 can be computed in time O(|V |). Therefore, Procedure 1 can be computed in time O(|V (Π)|2 ). Lemma 2. Procedure 2 can be computed in time O(|Π1 |2 ). Proof. Step 1 can be computed in time O(1). The test in Step 2 can be computed in time O(|V (ρ1 )|). Step 2.1 can be computed in time O(|V (Π1 )|). Step 2.2 can be computed in time O(|V (Π2 )|). Step 2.3 can be computed in time O(|V (Π1 )|). Steps 2.4 and 2.5 can be computed in time O(1). The loop, from Step 2 to Step 2.5, is computed in time O(|Π1 | · |V (ρ1 )|) ≤ O(|Π1 |2 ). Step 3 can be computed in time O(|S|). Therefore, Procedure 2 can be computed in time O(|Π1 |2 ).
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
245
The following Lemma is obvious. Step 1.4.1 can be computed in time O(1). Lemma 3. Procedure 3 can be computed in time O(|V (GB )|). Lemma 4. Procedure 4 has time complexity κ1 · O(|V (ρ2 )| · |V (B1 ∪ B2 )|). Proof. By Lemma 3, Step 1.1 can be computed in time O(|V (Bi )|), where i = 1, 2. By Theorem 1.4 of [11], Step 1.2 has time complexity κ1 · O(|V (ρ2 )| · |V (B1 ∪ B2 )|) where κ1 = (L0 − L)/ε, ε is the accuracy, and L0 and L are the lengths of the initial and true path, respectively. Step 1.3 can be computed in time O(|V (ρeu2 )|). Step 1.4.1 can be computed in time O(1). Step 1.4.2 can be computed in time O(|V (B1 ∪ B2 )|). Step 1.5 can be computed in time O(|V (ρ2 )|). Step 1.6 can be computed in time O(1). The loop, from Step 1.1 to 1.6, can be computed in time κ1 · O(|V (ρ2 )| · |V (B1 ∪ B2 )|) Step 2 can be computed in time O(|V |). Therefore, Procedure 2 can be computed in time κ1 · O(|V (ρ2 )| · |V (B1 ∪ B2 )|). Theorem 2. The Main Algorithm has time complexity κ1 · κ2 · O(|V (GΠ )|2 ). Proof. By Lemma 1, Step 1 can be computed in time O(|V (GΠ )|2 ). According to Lemma 2, Step 2 can be computed in time O(|V (GΠ )|2 ). Step 3 can be computed in time O(|V (ρ)|). Steps 4, 6, 10 and 11 can be computed in time O(1). Steps 5 and 9 can be computed in time O(|V (ρ)|). Steps 7.1, 7.4 , 8.1 can be computed in time O(1). By Lemma 4, Steps 7.2 and 8.2 can be computed in time κ1 · O(|V (ρj2 )| · |V (Bi−1 ∪ Bi )|), where j = i, m. Steps 7.3 and 8.3 can be computed in time O(|V (ρ)|). The loop, from Step 7 to 11, can be computed in time κ1 · κ2 · O(|V (GΠ )|2 ). Therefore, the Main Algorithm can be computed in time κ1 ·κ2 ·O(|V (GΠ )|2 ).
6
An Example
The following example illustrates the steps of the Main Algorithm. Let Π be the unit cube in Figure 5. Step 1 computes a cut cycle (which may be not uniquely defined) ρpq = ABCDA. Step 2 computes step bands S = {B1 , B2 , B3 }, where B1 = p, B2 ’s frontiers are two cycles ABCDA and EF GHE, and B3 = q. Step 3 decides that we use pIJq as an initial surface path from p to q (see Figure 5).
246
F. Li, R. Klette, and X. Fu
Fig. 5. A unit cube within an xyz-coordinate system, where p = (0.76, 0.001, 1), q = (0.999, 0.001, 0). pIJq is an initial surface path from p to q while pM LKq is an approximate surface ESP from p to q, where I ∈ AB, J, K ∈ EF , L ∈ AE and M ∈ AD.
In Step 7.2, the algorithm applies Procedure 4 (the initialization procedure of the RBA) and searches each edge of the polygon ABCDA; it finds a point M ∈ AD to update the initial point I, and it also inserts a new point L ∈ AE into the segment between M and J. Analogously, in Step 8.2, the algorithm searches each edge of the polygon EF GHE and finds a point K ∈ EF for updating the initial point J; it also updates point L ∈ AE by point L ∈ AE which is between M and K. The algorithm iterates (note: the iteration steps are defined in the Main Algorithm) until the required accuracy is reached.
7
Conclusions
The paper presented a rubberband algorithm for computing an approximate surface ESP of a polytope. Although it is not the most efficient, it follows a straightforward design strategy, and is thus easy to implement. This algorithm generalized an rubberband algorithm designed for solving a 2D ESP of a simple polygon (see [11]) to one which solves the surface ESP of polytopes. This approach is a contribution towards the exploration of efficient approximate algorithms for solving the general ESP problem. This will allow more detailed studies of computer-represented surfaces as typical (e.g.) in biomedical or industrial 3D image analysis. Acknowledgement. The authors thank the PSIVT reviewers whose comments have been very helpful for revising an earlier version of this paper.
Approximate ESPs on Surfaces of Polytopes Using a Rubberband Algorithm
247
References 1. Sharir, M., Schorr, A.: On shortest paths in polyhedral spaces. SIAM J. Computation 15, 193–215 (1986) 2. Reif, J.H., Storer, J.A.: A single-exponential upper bound for shortest paths in three dimensions. J. ACM 41, 1013–1019 (1994) 3. Canny, J., Reif, J.H.: Some algebraic and geometric configurations in pspace, 460– 467 (1988) 4. Canny, J., Reif, J.: New lower bound techniques for robot motion planning problems, 49–60 (1987) 5. Sharir, M.: On shortest paths amidst convex polyhedra. SIAM J. Computation 16, 561–572 (1987) 6. Gewali, L.P., Ntafos, S., Tollis, I.G.: Path planning in the presence of vertical obstacles. Technical report, Computer Science, University of Texas at Dallas (1989) 7. Mitchell, J.S.B., Mount, D.M., Papadimitriou, C.H.: The discrete geodesic problem. SIAM J. Computation 16, 647–668 (1987) 8. Chen, J., Han, Y.: Shortest paths on a polyhedron, 360–369 (1990) 9. Kapoor, S.: Efficient computation of geodesic shortest paths. In: Proc. ACM Symp. Theory Computation, vol. 1, pp. 770–779 (1999) 10. Li, F., Klette, R.: Rubberband algorithms for solving various 2d or 3d shortest path problems. In: Proc. Computing: Theory and Applications, Platinum Jubilee Conference of The Indian Statistical Institute, pp. 9–18. IEEE, Los Alamitos (2007) 11. Li, F., Klette, R.: Euclidean shortest paths in simple polygons. Technical report CITR-202, Computer Science Department, The University of Auckland, Auckland (2007), http://www.citr.auckland.ac.nz/techreports/2007/CITR-TR-202.pdf 12. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 13. Kiryati, N., Szekely, G.: Estimating shortest paths and minimal distances on digitized three dimensional surfaces. Pattern Recognition 26, 1623–1637 (1993) 14. Li, F., Klette, R.: Exact and approximate algorithms for the calculation of shortest paths. Report 2141, ima, The University of Minnesota, Minneapolis (2006)
Sub-grid Detection in DNA Microarray Images Luis Rueda Department of Computer Science University of Concepci´on Edmundo Larenas 215, Concepci´on, 4030000, Chile Phone: +56 41 220-4305, Fax: +56 41 222-1770
[email protected]
Abstract. Analysis of DNA microarray images is a crucial step in gene expression analysis, as it influences the whole process for obtaining biological conclusions. When processing the underlying images, accurately separating the subgrids is of supreme importance for subsequent steps. A method for separating the sub-grids is proposed, which aims to first, detect rotations in the images independently for the x and y axes, corrected by an affine transformation, and second, separate the corresponding sub-grids in the corrected image. Extensive experiments performed in various real-life microarray images from different sources show that the proposed method effectively detects and corrects the underlying rotations and accurately finds the sub-grid separations. Keywords: Microarray image gridding, image analysis, image feature and detectors.
1 Introduction One of the most important technologies used in molecular biology are microarrays. They constitute a way to monitor gene expression in cells and organisms under specific conditions, and have many applications. Microarrays are produced on a chip (slide) in which DNA extracted from a tissue is hybridized with the one on the slide, typically in two channels. The slide is then scanned at a very high resolution generating an image composed of sub-grids of spots (in the case of cDNA microarrays) [1,2]. Image processing and analysis are two important aspects of microarrays, since the aim of obtaining meaningful biological conclusions depends on how well these stages are performed. Moreover, many tasks are carried out sequentially, including gridding [3,4,5,6,7], segmentation [8,9], quantification [2], normalization and data mining [1]. An error in any of these stages is propagated to the rest of the process. When producing DNA microarrays, many parameters are specified, such as the number and size of spots, number of sub-grids, and even their exact location. However, many physical-chemical factors produce noise, misalignment, and even deformations in the sub-grid template that it is virtually impossible to know the exact location of the spots after scanning is performed, at least with the current technology. The first stage in the analysis is to find the location of the sub-grids (or gridding), which is the focus of our paper. Roughly speaking, gridding consists of determining the spot locations in a microarray image (typically, in a sub-grid). The problem, however, is that microarray images are divided in sub-grids, D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 248–259, 2007. c Springer-Verlag Berlin Heidelberg 2007
Sub-grid Detection in DNA Microarray Images
249
done to facilitate the spot locations. While quite a few works have been done on locating the spots in a sub-grid, they all assume the sub-grids are known, and this is the problem considered in this paper, more formally stated as follows. Consider an image (matrix) A = {aij }, i = 1, ...., n and j = 1, ...., m, where aij ∈ Z+ , and A is a sub-grid of a cDNA microarray image1 [1] (usually, aij is in the range [0..65,535] in a TIFF image). In what follows, we use A(x, y) to denote aij . The aim is to obtain a matrix G (grid) where G = {gij }, i = 1, ...., n and j = 1, ...., m, gij = 0 or gij = 1 (a binary image), with 0 meaning that gij belongs to the grid. This image could be thought of as a “free-form” grid. However, in order to strictly use the definition of a “grid”, our aim is to obtain vectors v and h, v = [v1 , ..., vm ]t , h = [h1 , ..., hn ]t , where vi ∈ [1, m] and hj ∈ [1, n]. Each vertical and horizontal vectors are used to separate the sub-grids. As seen later, rotation correction facilitates finding this template. Many approaches have been proposed for spot detection and image segmentation, which basically assume that the sub-grids are already identified. That is, they typically work on a sub-grid, rather than on the entire microarray image. The Markov random field (MRF) is a well known approach that applies different application specific constraints and heuristic criteria [3,10]. Another gridding method is mathematical morphology, which represents the image as a function and applies erosion operators and morphological filters to transform it into other images resulting in shrinkage and area opening of the image, and which further helps in removing peaks and ridges from the topological surface of the images [11]. Jain’s [8], Katzer’s [12], and Stienfath’s [13] models are integrated systems for microarray gridding and quantitative analysis. A method for detecting spot locations based on a Bayesian model has been recently proposed, and uses a deformable template model to fit the grid of spots in such a template using a posterior probability model which learns its parameters by means of a simulated-annealing-based algorithm [3,5]. Another method for finding spot locations uses a hill-climbing approach to maximize the energy, seen as the intensities of the spots which are fit to different probabilistic models [7]. Fitting the image to a mixture of Gaussians is another technique that has been applied to gridding microarray images by considering radial and perspective distortions [6]. All these approaches, though efficient, assume the sub-grids have already been identified, and hence they proceed on a single sub-grid, which has to be specified by the user. A method used for gridding that does not use this assumption has been proposed in [14,15]. It performs a series of steps including rotation detection based on a simple method that compares the running sum of the topmost and bottommost parts of the image. It performs rotations locally and applies morphological opening to find sub-grids. This method, which detects rotation angles wrt one of the axes, either x or y, has not been tested on images having regions with high noise (e.g. bottommost 1/3 of the image is quite noisy). In this paper, we focus on automatically detecting the sub-grids given the entire microarray image. The method proposed here uses the well-known Radon transform and an information-theoretic measure to detect rotations (wrt the x and y axes), which 1
The aim is to apply this method to a microarray image that contains a template of rows and columns of sub-grids.
250
L. Rueda
are corrected by an affine transformation. Once corrected, the imaged is passed through a mathematical-morphology approach to detect valleys, which are then used to separate the sub-grids. Section 2 discuss the details of the proposed method, while Section 3 presents the experiments on real-life images, followed by the conclusions to the paper.
2 Sub-grid Detection The proposed sub-grid detection method aims to first correct any rotation of the image by means of the Radon transform [4,16]. After this, the (x or y-axis) running sum of pixels is passed through morphological operators to reduce noise, and then the detected valleys denote the separation between sub-grids. Note, however, that in order to process the microarray image in subsequent steps (i.e. gridding and segmentation) the image does not have necessarily to be rotated. Although, the method that we proposed herein performs the rotation correction, this can be avoided by generating the horizontal and vertical lines, v and h, for the corrected image, applying the inverse affine transformation to v and h, obtaining the sub-grids in the original image, and hence not degrading the quality of the image. Rotations of the image are seen in two different directions, wrt the x and y axes, in the aim at finding two independent angles of rotation for an affine transformation, and for this the Radon transform is applied. Roughly speaking, the Radon transform, which can be seen as a particular case of the Hough transform, is the integral of an n-dimensional function over all types of functions of dimension smaller than n. In two dimensions, like in the case of images, the Radon transform is the integral of the 2D function over all types of real-valued functions, e.g. polynomials, exponentials, etc. In particular, when the latter function is a line, the Radon transform can be seen as the projection of the two-variable function (the integral) over a line having a direction (slope) and a displacement (wrt the origin of the coordinate system); this is the case considered in this work. The Radon transform has been found quite useful in the past few decades in many applications, including medicine (for the computed axial tomography, or CAT), geology, and other fields of science. In two-dimensional functions projected onto lines, it works as follows. Given an image A(x, y), the Radon transform performs the following transformation: ∞ R(p, t) = A(x, t + px)dx , (1) −∞
where p is the slope and t its intercept. The rotation angle of the image with respect to the slope p is given by φ = arctan p. For the sake of the notation, R(φ, t) is used to denote the Radon transform of image A. Each rotation angle φ gives a different onedimensional function, and the aim is to obtain the angle that gives the best alignment with the rows and columns. This will occur when the rows and columns are parallel to the x or y-axis. There are many ways to state this as an optimization problem, and different objective functions have been used (cf. [3]). In this work, an entropy-based function is used. Assuming the sub-grids are (or should be2 ) aligned wrt the y-axis (and x-axis), 2
The aim is to detect the correct alignment. While the assumption made here is to formalize the model, such an alignment is indeed what is found in the proposed approach.
Sub-grid Detection in DNA Microarray Images
251
the one-dimensional projected function will show well-accentuated peaks, each corresponding to a column (row) of spots and deep valleys corresponding to the background separating the spots and sub-grids. Assuming the experimental setup in the microarray fabrication considers a reasonable separation between the sub-grids (otherwise it would not be possible to detect such a separation), deeper and wider valleys will be expected between sub-grids, and which are then used to detect the corresponding sub-grids. To compute the entropy function, the R(φ, t) function is normalized and renamed R (φ, t), such that t R (φ, t) = 1. The best alignment will thus occur at the angle φmin that minimizes the entropy as follows: ∞ H(φ) = − R (φ, t) log R (φ, t)dt . (2) −∞
One of the problems, however, the entropy function has is that, depending on the rotation angle φ, the sides of the one-dimensional function tend to diminish the “uniformity” of the function, and hence bias the entropy measure. This occurs when φ is near π/4. Since reasonable small rotations are expected to occur, small angles are considered (no more than 10 degrees of rotation). Also, the resulting signal function is on a discrete domain, i.e. φ takes discrete values, and hence the entropy is computed as follows: ∞ H(φ) = − R (φ, t) log R (φ, t)dt . (3) t=−∞
Note that R(φ, t) is normalized into R (φ, t), such that t R (φ, t) = 1. The image is checked for rotations in both directions, wrt the x and y axes, obtaining two different angles of rotation φminx and φminy respectively. The positions of the pixels in the new image, [uv] , are obtained as follows: [uv] = [xy1]T ,
(4)
where T is the following 3 × 2 matrix: ⎡
⎤ α β T =⎣β α⎦ γ1 γ2
(5)
The first two parameters, α and β, are given by the best angles of rotation found by the Radon transform, φminx and φminy , and computed as follows: α = s cos φminx
(6)
β = s sin φminy
(7)
where s is a scaling factor (in this work, s is set to 1), and γ1 and γ2 are translation factors, which are set to 0. The transformed image, A , is reconstructed by interpolating the intensities of pixels x and y and their neighbors; in this work, bicubic is used.
252
L. Rueda
(a) Original.
(b) Transformed.
7.7
H(φ)
7.65 7.6 7.55 7.5 7.45 −5
−4
−3
−2
−1
0 φ
1
2
3
4
5
(c) Entropy function for rotation angles φ between -5 and 5, wrt the y-axis. Fig. 1. A sample DNA microarray image (AT-20385-ch1) drawn from the Stanford microarray database, along with the transformed (by means of the affine transformation) image, and the entropy function wrt the y-axis
A sample image from the Stanford microarray database is shown in Fig. 1(a), namely image AT-20385-ch1. This image has been reduced in size, and the whole image can
Sub-grid Detection in DNA Microarray Images 7
7
x 10
3
2.5
2.5
2
2
s (x)
1.5
x
x
s (x)
3
1
0.5
0.5
0
200
400
600
800
1000 x
1200
1400
1600
1800
(a) Original sx (x) function.
2000
x 10
1.5
1
0
253
0
0
200
400
600
800
1000 x
1200
1400
1600
1800
2000
(b) Resulting sx (x).
Fig. 2. The original running sum function, sx (x), before and after applying the morphological operators, for image AT-20385-ch1
be found in the database3 . This image contains 48 sub-grids arranged in 12 rows and 4 columns. This image is rotated -0.8 degrees wrt the y-axis and 1.5 degrees wrt the xaxis (the latter not easily visualized by eye). These rotations are accurately detected by means of horizontal and vertical Radon transforms, which are performed independently, and the resulting image after applying the affine transformation as per (4) is shown in Fig. 1(b). Fig. 1(c) depicts the entropy function as per (3) for all angles φ between -5 and 5 degrees wrt the y-axis. The global minimum at φminy = −0.8 is clearly visible in the plot. The next step consists of finding the lines separating the sub-grids. For this, it is assumed that the angles that give the optimal affine transformation are φminx and φminy , and the “transformed” image is A . To detect the vertical lines4 , the running sum of pixel of A is computed for all values of y, obtaining the function intensities sx (x) = y A (x, y). To detect the lines separating the sub-grids, the n deepest and widest valleys are found, where n is the number of columns of sub-grids (parameter given by the user). The function sx (x) is passed through morphological operators (dilation, sx (x) ⊕ b, followed by erosion, sx (x) b, with b = [0, 1, 1, 1, 1, 0]) to in order to remove noisy peaks. After this, n potential centers for the sub-grids are found by dividing the range of sx (x) into nearly-equal parts. Since sx (x) contains many peaks (each representing a spot), and the aim is to detect sub-grids, the function is passed, again, by morphological operators (dilation, sx (x) ⊕ b, followed by erosion, sx (x) b), where b depends on the spot width in pixels (scanning resolution), and computed as follows. The number of pixels p for each spot is found by means of a “hill-climbing” procedure that finds the lowest valleys (in this paper, six are found) around the potential centers for each sub-grid. Averaging the distances between the valleys found gives a resolution r (width of each spot), and the morphological operand b is set as follows: b = [01r 0], 3
4
The full image is electronically available at smd.stanford.edu, in category “Hormone treatment”, subcategory “Transcription factors”, experiment ID “20385”, channel “1”. The details for detecting the horizontal lines are similar and omitted to avoid repetition.
254
L. Rueda
where 1r is a run of r ones. Once the morphological operators are applied, the lines separating the sub-grids are obtained as the centers of the deepest and widest valleys between the potential centers found previously. Fig. 2 shows the running sum of pixel intensities along the x-axis, for image AT20385-ch1. The original sx (x) function is plotted in Fig. 2(a), which contains many sharped peaks corresponding to each column of spots, and hence making it difficult to detect the separation between grids (the widest and deepest valleys). The resulting function after applying the morphological operators is depicted in Fig. 2(b), in which the sharped peaks tend to “disappear”, while the deepest valleys are preserved. The three deepest and widest valleys, which can be easily visualized by eye, correspond to the three lines separating the four columns of sub-grids. Note that it is not difficult to detect these three valleys despite the image does not clearly show the separation between columns of sub-grids.
3 Experimental Results For the experiments, two different kinds of cDNA microarray images have been used. The images have been selected from different sources, and have different scanning resolutions, in order to study the flexibility of the proposed method to detect sub-grids under different spot sizes. The first set of images has been drawn from the Stanford Microarray Database (SMD), and corresponds to a study of the global transcriptional factors for hormone treatment of Arabidopsis thaliana5 samples. Ten images were selected for testing the proposed method, and they correspond to channels 1 and 2 for experiments IDs 20385, 20387, 20391, 20392 and 20395. The list of the images used for the testing are listed in Table 1. The images have been named using AT (which stands for Arabidopsis thaliana), followed by the experiment ID, and the channel number (1 or 2). The images have a resolution of 1910 × 5550 pixels and are in TIFF format. The spot resolution is 24 × 24 pixels per spot, and the separation between sub-grid columns is about 40 pixels, which is very low. Each image contains 48 sub-grids, arranged in 12 rows and 4 columns. Also, listed in the table are for each image, the angles of rotation wrt to the x and y axes, φminx and φminy respectively, found by maximizing (3). The last column lists the accuracy in terms of percentage, which represents the number of sub-grids correctly detected. All the images are rotated with respect to both x and y axes. Also, the angles of rotation are different for the two axes, x and y, for all the images. These rotations are detected and corrected by the proposed method. Note that even when the angles of rotation are small, e.g. 0.5 and 0.2 for AT-20395 ch1 and ch2, it is critical to detect these angles and correct them, since the resolution of the images is very high and a small angle variation will produce a displacement of a vertical line by a large number of pixels. For example, a rotation angle of 0.8 degrees wrt to the y-axis will produce a displacement of 25 pixels (for images AT-20385 ch1 and ch2). Since the separation between sub-grids is about 40 pixels, it is quite difficult, though possible, to detect the vertical lines separating the sub-grids, while after detecting and correcting the rotations, 5
The images can be downloaded from smd.stanford.edu, by searching “Hormone treatment” as category and “Transcription factors” as subcategory.
Sub-grid Detection in DNA Microarray Images
255
Table 1. Test images drawn from the SMD, angles of rotation and percentage of sub-grids detected Image φminx φminy Accuracy AT-20385-ch1 1.5 -0.8 100% AT-20385-ch2 1.5 -0.8 100% AT-20387-ch1 0.8 -0.1 100% AT-20387-ch2 0.8 -0.1 100% AT-20391-ch1 0.9 -0.2 100% AT-20391-ch2 0.9 -0.2 100% AT-20392-ch1 1.0 -0.2 100% AT-20392-ch2 1.0 -0.2 100% AT-20395-ch1 0.5 0.2 100% AT-20395-ch2 0.5 0.2 100%
it is rather easy to separate the sub-grids – this is observed by the 100% accuracy the method yields on all the images of the SMD. To observe visually how the method performs, Figs. 3 and 4 show two images, AT20385-ch1 and AT-20387-ch1, in their original form, and the resulting images obtained after applying the proposed method (Figs. 3(b) and 4(b)). For AT-20385-ch1, the rotation wrt to the y-axis is clearly visible in the picture and it is seen how it is corrected. It is clear also how the sub-grids are accurately detected, specially the vertical lines separating the grids, despite the image contains many noisy artifacts resulting from the microarray experimental stages – some sub-grids on the bottom right part of the image are even quite noisy. For AT-20387-ch1 the angle of rotation wrt to the y-axis is very small, φminy = 0.1; however, it is detected and corrected by the proposed method. It is clear from Figs. 3 and 4 how the sub-grids are detected and well separated by the vertical and horizontal lines. The second test suite consists of a set of images produced in a microarray study of a few genes in an experiment where human cultured cell line was used to look at the toxicogenomic effects of two pesticides that were found in the rural drinking water [17], namely the human toxicogenomic dataset (HTD). Ten images were selected for testing the proposed method, which correspond to five different experiments in two channels, Cy3 and Cy5. The images are listed in Table 2, and are named by using HT (which stands for human toxicogenomic), followed by the channel number (Cy3 or Cy5) and the experiment ID. The images have a resolution of 7013 × 3514 pixels, and are in TIFF format. The spot resolution is 40 × 40 pixels per spot, and the separation between subgrid columns and rows is about 400 pixels. Each image contains 32 sub-grids, arranged in 8 rows and 4 columns. The second, third and fourth columns have the same meaning as in Table 1. As in the other set of images, all the sugrids are detected with 100% accuracy, denoting the efficiency of the proposed method. While the angles of rotation for the images of the HTD are quite small, they are detected and corrected by the proposed method. However, a small angle for these images produce a large displacement in terms of pixels, since their resolution is higher than that of the images of the SMD. For example, for image HT-Cy3-12667177, a rotation of -0.2 degrees produces a displacement of 8 pixels, which in turn, affects the process of detecting the sub-grid separation.
256
L. Rueda
(a) Original.
(b) Sub-grids.
Fig. 3. Original and sub-grids detected by the proposed method, for images AT-20385-ch1 drawn from the SMD
An image, HT-Cy5-12663787, drawn from the HTD is shown in Fig. 5. The original and sub-grids detected are shown in (a) and (b) respectively. Even though the subgrids are well separated by a large number of pixels, the image contains a lot of noise in the separating area, and thus, making it difficult to detect the sub-grid separation (accurately done by the proposed method). The noise present in the separating area, however, does produce a displacement of the separating lines, but each box perfectly encloses the corresponding sub-grid. All the 20 images tested can be downloaded from the corresponding links given above. To conclude the paper, the advantages of using the proposed method are summarized as follows. First, the proposed method allows to automatically detect angles of rotation (independently for the x and y axes), and performs a correction based on an affine transformation. Second, rotations are detected by mathematically sound principles involving the Radon transform and information-theoretic measures. Third, once the affine transformation is performed, the method allows to detect the sub-grids accurately, as shown in two sets of images from different sources and having different parameters (resolution,
Sub-grid Detection in DNA Microarray Images
(a) Original.
257
(b) Sub-grids.
Fig. 4. Original and sub-grids detected by the proposed method, for images AT-20387-ch1 drawn from the SMD Table 2. Test images drawn from the HTD, angles of rotation and percentage of sub-grids detected Image φminx φminy Accuracy HT-Cy3-12663787 0.3 -0.1 100% HT-Cy5-12663787 0.3 -0.1 100% HT-Cy3-12667177 0.3 -0.2 100% HT-Cy5-12667177 0.3 -0.2 100% HT-Cy3-12667189 0.3 -0.1 100% HT-Cy5-12667189 0.3 -0.1 100% HT-Cy3-12667190 0.4 -0.2 100% HT-Cy5-12667190 0.4 -0.2 100% HT-Cy3-12684418 0.0 0.0 100% HT-Cy5-12684418 0.0 0.0 100%
258
L. Rueda
(a) Original.
(b) Sub-grids.
Fig. 5. Original and sub-grids detected for image HT-Cy5-12663787 drawn from the HTD
number of sub-grids, spot width, etc.). Fourth, the method provides the right orientation of the sub-grids detected so that they can be processed in the subsequent steps required to continue the microarray data analysis, namely detecting the spot centers (or gridding), and separating the background from foreground (segmentation).
4 Conclusions A method for separating sub-grids in cDNA microarray images has been proposed. The method performs two main steps involving the Radon transform for detecting rotations wrt the x and y axes, and the use of morphological operators to detect the corresponding valleys that separate the sub-grids. The proposed method has been tested on real-life, high-resolution microarray images drawn from two sources, the SMD and the HTD. The results show that (1) the rotations are effectively detected and corrected by affine transformations, and (2) the sub-grids are accurately detected in all cases, even in abnormal conditions such as extremely noisy areas present in the images. Future work involves the use of nonlinear functions for the Radon transform, in order to detect curvilinear rotations. This is far from trivial as it involves a number of possible nonlinear functions, e.g. polynomials or exponentials. Another topic to investigate is
Sub-grid Detection in DNA Microarray Images
259
to fit each sub-grid into a perfect box eliminating any surrounding background, and hence providing advantages for the subsequent steps, a problem that is currently being undertaking.
References 1. Dr˘aghici, S.: Data Analysis Tools for DNA Microarrays. Chapman & Hall (2003) 2. Schena, M.: Microarray Analysis. John Wiley & Sons, Chichester (2002) 3. Antoniol, G., Ceccarelli, M.: A markov random field approach to microarray image gridding. In: Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, pp. 550–553 (2004) 4. Brandle, N., Bischof, H., Lapp, H.: Robust dna microarray image analysis. Machine Vision and Applications 15, 11–28 (2003) 5. Ceccarelli, B., Antoniol, G.: A deformable grid-matching approach for microarray images. IEEE Transactions on Image Processing 15(10), 3178–3188 (2006) 6. Qi, F., Luo, Y., Hu, D.: Recognition of perspectively distorted planar grids. Pattern Recognition Letters 27(14), 1725–1731 (2006) 7. Rueda, L., Vidyadharan, V.: A hill-climbing approach for automatic gridding of cdna microarray images. IEEE Transactions on Computational Biology and Bioinformatics 3(1), 72–83 (2006) 8. Jain, A., Tokuyasu, T., Snijders, A., Segraves, R., Albertson, D., Pinkel, D.: Fully automatic quantification of microarray data. Genome Research 12(2), 325–332 (2002) 9. Noordmans, H., Smeulders, A.: Detection and characterization of isolated and overlapping spots. Computer Vision and Image Understandigng 70(1), 23–35 (1998) 10. Katzer, M., Kummer, F., Sagerer, G.: A markov random field model of microarray gridding. In: Proceedings of the 2003 ACM Symposium on Applied Computing, pp. 72–77 (2003) 11. Angulo, J., Serra, J.: Automatic analysis of dna microarray images using mathematicalmorphology. Bioinformatics 19(5), 553–562 (2003) 12. Katzer, M., Kummert, F., Sagerer, G.: Automatische auswertung von mikroarraybildern. In: Proceedings of Workshop Bildverarbeitung f¨ur die Medizin, Cambridge, UK (2002) 13. Steinfath, M., Wruck, W., Seidel, H.: Automated image analysis for array hybridization experiments. Bioinformatics 17(7), 634–641 (2001) 14. Wang, Y., Ma, M., Zhang, K., Shih, F.: A hierarchical refinement algorithm for fully automatic gridding in spotted dna microarray image processing. Information Sciences 177(4), 1123–1135 (2007) 15. Wang, Y., Shih, F., Ma, M.: Precise gridding of microarray images by detecting and correcting rotations in subarrays. In: Proceedings of the 8th Joint Conference on Information Sciences, Salt Lake City, USA, pp. 1195–1198 (2005) 16. Helgason, S.: The Radon Transform, 2nd edn. Springer, Heidelberg (1999) 17. Qin, L., Rueda, L., Ali, A., Ngom, A.: Spot detection and image segmentation in dna microarray data. Applied Bioinformatics 4(1), 1–12 (2005)
Modelling Intermittently Present Features Using Nonlinear Point Distribution Models Gerard Sanroma and Francesc Serratosa Dept. of Computer Engineering and Maths, Universitat Rovira i Virgili, Av. Pa¨ısos Catalans 26, E-43007 Tarragona, Catalonia, Spain
[email protected],
[email protected]
Abstract. We present in this paper a new compact model to represent a data set based on the main idea of the Point Distribution Model (PDM). Our model overcomes PDM in two aspects, first, it is not needed all the objects to have the same number of points. This is a very important feature, since in real applications, not all the landmarks are represented in the images. And second, the model captures the nonlinearity of the data set. Some research have been presented that couples both aspects separately but no model have been presented until now that couples them as a whole. A case study shows the efficiency of our model and its improvement respect the models in the literature. Keywords: Point Distribution Models, intermittently present landmarks, missing data, statistical shape modelling, imputation.
1
Introduction
One of the main challenges in locating and recognising objects in images is that objects of the same class are often no identical in appearance. In such cases, deformable models, which allow for variability in the imaged objects, are appropriate [1]. An important issue is limiting the shape variability to that which is consistent with the class of objects to be modelled. Point Distribution Models (PDMs) [2] is one of the most important methods used to represent such shape variability. Objects are defined by landmark points which are placed in the same way on each of the examples of the set. A statistical analysis estimates for the mean shape and the main modes of variation. These linear models have proved to be successful in capturing the variability present in some applications, however, there are some situations where they fail due to the nonlinearity of the shapes of objects that compose the class. To overcome this problem, the method called Polynomial Regression PDM (PRPDM) [3] was defined. It is based on a polynomial regression to capture the nonlinearity of the shape variability. All the methods commented before need the same number of landmarks to be present in all the objects that compose the class. However, there are some objects that can present patterns of presence and absence in some of their features. It gives shapes with different number of points. Few research have been carried out D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 260–273, 2007. c Springer-Verlag Berlin Heidelberg 2007
Modelling Intermittently Present Features
261
that considers intermittently present landmarks. The Structured PDM (SPDM) [4] by M. Rogers and J. Graham is a method based on the well known PDM augmented with a vector informing the inclusion or exclusion of the landmark. This method was proven to capture the variability and the presence of the landmarks on shapes both in artificial images and in electron microscope images of diabetic nerve capillaries. In this paper, we present a new model that permits intermittently present landmarks as in SPDM and allows to capture nonlinear variability as in the PRPDM. We have called it Polynomial Regression Structured Point Distribution Model (PR-SPDM) since we consider that both problems have been solved; nonlinear variability and intermittently present landmarks. In section 2 we present the topic of modelling intermittently present landmarks and describe an existing approach. In section 3 we describe an approach to model nonlinear variability in shapes. In section 4 we present our approach that allows for shapes with intermittently present landmarks and captures nonlinear variability in shapes. In sections 5 and 6 we present some results and make a discussion on the relevance of our work.
2
Modelling Intermittently Present Landmarks
When modelling a class of objects with a PDM, it is needed the whole objects to be structured by the same number of landmark points. Nevertheless, in some examples, this is not possible due to the nature of the objects. Consider the class of objects in figure 1. There is a dependency between the aspect ratio of the external square and the size of the internal square. While the vertices of the external square moves along a straight line, the internal square appears, grows up, decreases and finally disappears. There is a quadratic dependency between the aspect ratio of the external square and the area of the internal one with a threshold that forces the internal square to disappear.
Fig. 1. Sombe objects of the training set. There are 16 points representing the figures with only the external square, and 32 the figures with both squares (there are 3 points along each line segment).
If we want to model this kind of variability with a PDM, it is needed a first process which is based on estimating the position of the missing landmarks in order to have non-missing points. We refer to this process as Data imputation
262
G. Sanroma and F. Serratosa
(section 2.1). Then, a classical PDM might be applied. Thus for the training data after the imputation step x = [x1 | . . . | xN ] (where xi is the ith training shape and N the number of training shapes) we get a set of parameterizations b = [b1 | . . . | bd ]T (where bi is the ith principal mode and d is the number of modes of the PDM). Nevertheless, the model would have lost the knowledge of the presence or absence of some of the landmarks. For this reason, it seems reasonable to define a model that considers this source of knowledge. We refer to this model as a Model for Intermittently-present Data. (section 2.2). 2.1
Data Imputation
In this section, two schemes for data imputation are depicted. In the first one, the missing data is replaced by the mean. Usually, this approximation underestimates the variance of the data. The second one was presented by Rogers and Graham [4] and it is known as Iterated PCA. The basic idea is to iteratively impute values in such a way as to retain the relationships present in the original data. Mean Imputation In this model, the position (x,y) of the landmark points that are not present in some objects, are replaced by the mean (x,y) of the related landmark points that are present in the other objects. We show in figure 2 the scattergram of the first two modes of variation of the PDM when the missing data is imputed with the mean. The vertical axe is represented by b2 and codes the aspect of the external square, and the horizontal axe by b1 that codes the size of the internal square. Low values (or negative) in b2 represent big size and high values represent small size. It can be observed the nonlinear correlation between these two features. It represents the process of enlargment-reduction of the internal square throughout the process of varying aspect of the external square. The objects with originally no interior square are projected in the scattergram as two horizontal lines at the extrema of b2 . They are assigned to zero in b1 since it corresponds to an interior square of mean size. Figure 3 shows the reconstructed objects varying the two principal modes of variation between ±2 standard deviations (s.d.). The first mode, b1 , (first row) encodes the size of the interior square and the second mode, b2 , (second row) encodes the aspect of the external square. The linear PDM does lead to the generation of invalid shapes (not similar to the ones in the training set). Iterated PCA . This is an iterative method to impute missing data developed by Rogers and Graham [4]. There is an initial imputation based on the mean (section 2.1) and then the data is re-estimated iteratively until convergence. At the beginning, the first principal component of the data is used for re-estimation, and next subsequent principal components are added. The aim of this process is to imput the data in a way so as to retain the relationships represented by a new eigenvector at each step. The goal is to end up with an imputed data consistent with the data patterns represented by the eigenvectors.
Modelling Intermittently Present Features
263
Mean imputation 0.6
0.4
0.2
b2
0
−0.2
−0.4 training shapes 1st mode 2nd mode
−0.6
−0.8 −1.5
−1
−0.5
0 b1
0.5
1
1.5
Fig. 2. Scattergram of the two main modes of variation. Data imputed with the mean.
Fig. 3. Two principal modes of variation of PDM with mean imputation. From -2 to +2 s.d.
This method obtains practically the same results than the mean imputation using the example set given in figure 1. It is for this reason that we do not show explicitly the results. Since the imputation only affects to the size of the internal square, it can be seen as moving the points along b2 . Then, setting the points at zero in b2 is according with the linear relationship defined by the first eigenvector b1 , and it is equivalent to put a mean-size internal square. 2.2
A Model for Intermittently-Present Data
In this work, we will use the model for intermittently-present data defined by Rogers and Graham called Structured Point Distribution Model (SPDM) [4]. The aim of this model is to build a PDM that deals with intermittently present landmarks. Let x be an initial training set composed by N configurations of k landmarks in two dimensions. xi = [(xi1 , yi1 ) , (xi2 , yi2 ) , . . . , (xik , yik )] .
264
G. Sanroma and F. Serratosa
Let each configuration be represented by the data vector xi = (xi1 , xi2 , xi3 , . . . , xik , yi1 , yi2 , yi3 , . . . , yik )T . And let miss(i) be the set of missing landmarks in the ith configuration. The procedure to build an SPDM is the following: 1. Replace each missing value by a placeholder such as NaN (a computational representation of Not a Number). xi,miss(i) = yi,miss(i) = N aN,
1≤i≤N
2. Align the configurations using the landmarks that are available with Generalized Procrustes Analysis [5]. 3. Impute the missing landmarks by some imputation scheme (see Section 2.1 for further details). Then, an initial training shape i with miss(i) = {2, 3} becomes ˆi2 , x ˆi3 , . . . , xik , yi1 , yˆi2 , yˆi3 , . . . , yik ,) , x ˆi = (xi1 , x T
where primed elements are aligned and hat elements are imputed. 4. Perform a shifting and scaling in order to make the data x ˆ to lie between 0 and 1. x ˜i =
xi − xmin , xmax − xmin
1 ≤ i ≤ N,
(1)
where xmax and xmin are the vectors with the maximum and minimum values of x ˆ for each dimension, respectively. This shifting and scaling is to avoid problems associated with shape and structure being measured on different scales. It responds to the consideration of treating shape and structure as equally important. 5. Build a classical PDM [2], getting a parameterization bd for the normalized training shapes x ˜=x ¯ + Φbd ,
(2)
where Φ are the eigenvectors, and x ¯ is the procrustes mean shape [5]. While this gives us a model of shape variation, we have lost the structural information about which landmarks are present and which are not. 6. Augment the shape vector with an structure vector informing about the presence or absence of each landmark. xsi = (xsi1 , xsi2 , . . . , xsik )T , where xsij = {0, 1} depending on whether the jth landmark of the ith shape (xij , yij ) is present or not.
Modelling Intermittently Present Features
265
7. Apply PCA on these structure vectors in order to reduce the redundancy, obtaining a reduced parameter vector bs representing the structure vector xs = x¯s + P bs ,
(3)
where P are the eigenvectors of the covariance matrix of the structural data xs , and x¯s is the mean structure vector of the N samples. 8. Build a combined model of shape and structure. i.e., for each training shape generate a concatenated vector d T x−x ¯) Φ (˜ b = b= (4) P T (xs − x¯s ) bs 9. Apply PCA again to obtain a combined model of shape and structure b = Qc
(5)
where Q are the eigenvectors and c is a vector of structural shape parameters controlling both the position and presence or absence of the shape points. Since the shape and structural parameters have zero mean, c does too. Note that the linear nature of the model allows to express the shape and its structural information directly as functions of c x ˜=x ¯ + ΦQd c
,
xs = x¯s + P Qs c
(6)
where Q=
Qs Qd
(7)
and the original shape is given by x=x ˜ xmax − xmin + xmin
(8)
An example shape can be synthesised for a given c by generating the shape from the vector x and removing those landmarks according to xs and a given threshold. The SPDM like the PDM is a generative model, i.e., given a parameter vector we can recreate the structure vector for a particular instance. The main drawback comes from the fact of representing a binary process (presence or absence) by a linear model. Thus, to recover binary parameters in the reconstructed structure vector, a threshold representing the probability of presence or absence, must be applied. By the inclusion of the structure vector xs we allow for arbitrary patterns of inclusion/exclusion. Figure 4 shows the scattergrams obtained using this model and imputing the data with the mean. The horizontal axes of both scattergrams, b1 , represent the mode that captures the variability in the presence or absence of the points. Here the objects represented by the points at b1 −0.5 have the interior square
266
G. Sanroma and F. Serratosa 1st mode vs 2nd mode of SPDM with mean imputation
1st mode vs 3rd mode of SPDM with mean imputation 0.5
0.8
0.6
training shapes 1st mode 2nd mode
training shapes 1st mode 2nd mode
0.4 0.3
0.4
0.2 0.2 b3
b2
0.1 0
0 −0.2 −0.1 −0.4
−0.2
−0.6
−0.8 −3
−0.3
−2
−1
0
1
2
3
4
−0.4 −3
−2
b1
(a)
−1
0
1
2
3
4
b1
(b)
Fig. 4. Scattergrams of the three main modes of variation. Data imputed by the mean and using the SPDM.
present, and the those at b1 3.5 have not. The mode that captures the variability in the aspect ratio of the exterior square, b2 , is represented by the vertical axe of figure (a) and the mode that captures the variability in the size of the interior square, b3 , is represented by the vertical axe of figure (b). Figure 5 shows the objects generated by varying the mode b1 of the SPDM between ±2 s.d., which represents the presence or absence of the internal square. The objects generated by varying the mode that represents the size of the internal square and by varying the mode that represents the aspect of the external square are not shown due to they are identical to the ones generated by a PDM with a mean imputation (figure 3).
Fig. 5. First principal mode of variation of SPDM. Data with mean imputation.From -2 to +2 s.d.
3
Modelling Nonlinear Dependencies
When the modes of variation obtained from the objects of a class have nonlinear dependencies, as it is the case of our example, the model specificity obtained by a classical PDM is very poor (figure 3). To solve this problem, Sozou et al. defined a nonlinear generalization of PDM called Polynomial Regression Point Distribution Model (PRPDM) [3]. The aim is to model all the modes bi as nonlinear functions of a given mode: b∗i = fik (bk ), where bk is the mode being fitted, fik (•) is the polynomial that models the ith mode as function of the kth mode, and b∗i are the modelled values. Next, the new residuals bnew are computed i as the difference between the data values of bi , and the modelled values of b∗i :
Modelling Intermittently Present Features
267
Mean imputation 0.5
0.4
0.3
b2
0.2
0.1
0
−0.1
−0.2
−0.3 −0.8
training shapes principal axe −0.6
−0.4
−0.2
0 b1
0.2
0.4
0.6
0.8
Fig. 6. Scattergram of the linear decomposition of the training data with mean imputation. The first principal mode of variation of PRPDM is superimposed.
bnew = bi − b∗i . This procedure is repeated until there are no more residuals or i until we have explained the desired variance. Note that at each step one mode is removed since we ignore the mode being fitted bk . At each step we append the values of bk into a new b , and store the set of functions fik (•). This model assumes that there is no missing data. In our data set, there are some missing points. In this case, the missing points in these experiments have been imputed using the mean imputation scheme presented in section 2.1. Figure 6 shows the scattergram of the first two modes of the linear decomposition of the training data with mean imputation. The data is the same than the one presented in figure 2, but the first automatically-obtained polynomial mode (that fits b1 to b2 ) is superimposed. The polynomial mode do not adjust to the data due to the noise introduced by the mean imputation and so, the model does not retain properly the variability of the training set. Figure 7 shows the objects generated varying the first mode of the PRPDM between ±1.5 s.d. The generated objects are more similar to the original ones (figure 1) than the ones generated by the linear PDM (figure 3) but the results are not good enough. In the following section, we present a new model with the aim of solving the two main problems described in the above sections. It has to capture the nonlinear variability in a more accurate way and also, it has to capture the presence or absence of some data.
Fig. 7. First principal mode of PRPDM. Data with mean imputation. From -1.5 to +1.5 s.d.
268
4
G. Sanroma and F. Serratosa
Nonlinear Modelling of Intermittently Present Data
The main idea of this new model is to imput the missing data according to the nonlinear relationships of the present data. Thus, we expect to obtain a better approximation of the polynomial regression in the present data. To that aim, we have defined a new imputation model called Iterated Nonlinear PCA inspired in the model Iterated PCA explained in section 2.1. Moreover, and with the aim of not loosing the knowledge of the presence or absence of the data, we have defined a new model for intermittently present data inspired in the one commented in section 2.2. The new model is useful to model nonlinear variability as the PRPDM (section 3) and intermittently present data as in the SPDM (section 2.2). For this reason, we have called Polynomial-Regression Structured Point Distribution Model (PR-SPDM). 4.1
A New Imputation Scheme: Iterated Nonlinear PCA
We have developed this method as an adaptation of iterated PCA, intended to fit to our purposes. The idea is to impute the missing data according to the nonlinear relationships present in the original data. The algorithm can be described with the following equations: (Pxd , μx , bxd ) = pca(x, d)
(9)
(Axm , cxm ) = prpca(bxd , m)
(10)
ˆbm = prreconstr(cxm , Axm ) xd
(11)
x ˆ = μx + Pxd · ˆbm xd
(12)
xi,miss(i) = x ˆi,miss(i)
(13)
where x is the original data, xij is the jth value in example i, pca is a function that computes the first d principle components Pxd and the mean μx , together with the associated reconstruction parameters bxd for each training example. prpca is a function that computes the associated projections cxm of the parameters bxd to the space spanned by the first m principal polynomial modes Axm . prreconstr is the function that reconstructs the d dimensional data ˆbm xd from the projections cxm using the first m polynomial axes Axm . miss(i) is the set of variables missing in example i and xˆi,miss(i) is the set of estimated missing values from xi . It consists on a two-step procedure. First bxd is obtained by the projection of x onto the first d principal linear components (eq. (9)). Next these data are used to compute the nonlinear modes of variation, and are then projected using the first m of these modes, obtaining cxm (eq. (10)). Then, we reconstruct the
Modelling Intermittently Present Features
269
original data x ˆ (eqs. (11)-(12)), and replace the imputed values (eq. (13)). The algorithm starts by mean imputation of the missing data, m is set to 1, and cycle through equations (9)-(13) until convergence. Next, we increment m by 1 and repeat the procedure. At each step the data is according to the nonlinear relationships retained by the first m principal nonlinear components. The process Σσ2 2 is repeated until σxm > 0.9, where σxm is the variance of the mth mode and 2 x 2 σx is the variance of all modes. Figure 8 shows the scattergram of the linear decomposition of the training data with iterated nonlinear PCA imputation, together with the first principal mode of the PRPDM. We have set d = 3 in the execution of the algorithm since the 99% of the variance is explained by 3 eigenvectors. Making use of this imputation scheme, we observe two main properties. First, the objects with missing data (points such that b1 < −0.6 and b1 > 0.6) are according to the first principal nonlinear mode, instead of having a mean-size square (which is according to the first principal linear mode). And second, the obtained polynomial approximates better the whole dataset, which means that the model is more specific.
Iterated non−linear PCA imputation 0.6 training shapes principal axe
0.5 0.4 0.3
b2
0.2 0.1 0 −0.1 −0.2 −0.3 −0.8
−0.6
−0.4
−0.2
0 b1
0.2
0.4
0.6
0.8
Fig. 8. Scattergram of the linear decomposition of the training data. Data imputed by the iterated nonlinear PCA and modelled by the PRPDM. The first principal mode of the PRPDM is superimposed.
Figure 9 shows the objects generated varying the first mode of the PRPDM between ±1.5 s.d. In this example the nonlinear models are more compact, since only one mode of variation is needed to generate good approximations of the objects. This is due to the fact that the nonlinear dependency between the size of the interior square and the position of the exterior one is properly acquired by PRPDM. Moreover, when imputing with iterated nonlinear PCA, the generated objects are closer to the original ones than the ones generated by any of the other models presented before. Nevertheless, in the generated objects, the interior square is
270
G. Sanroma and F. Serratosa
Fig. 9. First principal mode of PRPDM. Data imputed with iterated nonlinear PCA. From -1.5 to +1.5 s.d.
always present. The knowledge of the presence or absence of some points has not been considered. This is the topic of the following section. 4.2
A New Model for Intermittently-Present Data: PR-SPDM
This model is useful to capture sets of objects with missing data (or landmarks) and also, captures the nonlinearity between the relationships of the landmarks. Let x be an initial training set composed by N configurations of k landmarks in two dimensions. And let miss(i) be the set of missing landmarks in the ith configuration. The procedure to build a PR-SPDM is the following: 1. Follow the steps 1-4 of the procedure to build an SPDM in Section 2.2. After these steps, we end up with an aligned set of shapes with the missing landmarks imputed, and this whole set normalized to lie between 0 and 1. We denote this set as x ˜. 2. Build a classical PRPDM with the dataset x ˜, as indicated in Section 3 and get the nonlinear parameterizations bdpr for each shape x ˜i . 3. Augment the shape vector with an structure vector xsi informing about the presence or absence of each landmark as in step 6 of the SPDM procedure in section 2.2. 4. Apply PCA to the structure vectors xsi in order to reduce the redundancy, and get an eigenvectors matrix P for the structure parameterization as in step 7 of the SPDM procedure in section 2.2. 5. Build a concatenated vector of shape and structural parameterizations d bpr b= (14) bs 6. Apply PCA again to obtain a combined model of shape and structure b = Qc
(15)
where Q are the eigenvectors and c is a vector of structural shape parameters controlling both the position and presence or absence of the shape points. Since the shape and structural parameters have zero mean, c does too. Note that a shape and its structural information can be recreated for a given c. bdpr = Qd c
,
xs = x¯s + P Qs c
(16)
Modelling Intermittently Present Features
271
where Q=
Qs Qd
(17)
and x˜ is computed reconstructing the PRPDM parameters. Finally, the original shape is given by x=x ˜ xmax − xmin + xmin
(18)
An example shape can be synthesised for a given c by generating the shape from the vector x and removing those landmarks according to xs and a given threshold. Figure 10 shows the scattergrams of the PR-SPDM with mean (a) and iterated nonlinear PCA (b) imputation of the missing data. The horizontal axes, b1 , encode the inclusion and exclusion of points. The vertical axes, b2 , correspond to the first nonlinear modes of the PRPDMs. 1st mode vs 2nd mode of PRSPDM with mean imputation
1st mode vs 2nd mode of PRSPDM with iterated non−linear PCA imputation
0.8
0.6
0.8 training shapes 1st mode 2nd mode
0.6
training shapes 1st mode 2nd mode
0.4 0.4 0.2 b2
b2
0.2 0
0 −0.2 −0.2 −0.4 −0.4
−0.6 −3
−0.6
−2
−1
0
1 b1
(a)
2
3
4
−0.8 −3
−2
−1
0
1
2
3
4
b1
(b)
Fig. 10. Scattergrams of the two first principal components of PR-SPDM for data with (a) mean imputation, and (b) iterated nonlinear PCA imputation
Figure 11 shows the objects generated by varying the first mode of the PRSPDM between ±1.5 s.d. for data with both mean imputation and iterated nonlinear PCA imputation. As in the case of SPDM, the first principal mode of PR-SPDM represents the inclusion or exclusion of points. The remaining two modes capture deformations in shape as seen in figures 7 for mean imputation and 9 for nonlinear PCA imputation.
Fig. 11. First principal mode of PR-SPDM. It is the same for mean imputation and iterated nonlinear PCA imputation. From -1.5 to +1.5 s.d.
272
5
G. Sanroma and F. Serratosa
Evaluation Experiments
We have perturbed the (x, y) coordinates of the training images with noise. We calculate the mean square error between the reconstructed objects using each of the models described, and the original images. We present the results in two plots corresponding to objects composed by one and two squares (figs 12.a and 12.b). We use the first two principal modes of each model. −3
6
1 part shapes
x 10
2 part shapes 0.05
spdm mean imput spdm pca imput prspdm mean imput prspdm non−lin pca imput
5
0.045 0.04 spdm mean imput spdm pca imput prspdm mean imput prspdm non−lin pca imput
0.035 4
error
error
0.03 3
0.025 0.02
2 0.015 0.01 1 0.005 0
0
0.01
0.02
0.03
0.04
0.05 noise
0.06
0.07
0.08
0.09
0.1
0
0
(a)
0.01
0.02
0.03
0.04
0.05 noise
0.06
0.07
0.08
0.09
0.1
(b)
Fig. 12. Reconstruction errors for, (a) 1-part and, (b) 2-parts objects
While the differences of the objects which have only one square are negligible, models using PR-SPDM do present better results in comparison with ones using SPDM for objects which have the two squares. This is because two modes are enough for the PR-SPDM to explain the three sources of variability (presence/absence, aspect ratio of external square and size of internal square), while SPDM only explains the first two with the same number of modes. Between the nonlinear models, the one with iterated nonlinear PCA imputation achieves clearly the best results.
6
Conclusions and Future Work
We have presented a new scheme for imputing missing data and also to model the presence or absence of the data. The imputation scheme, called Iterated nonlinear PCA, aims to generate the missing data according to the nonlinear relationships of the present data set. The model defined to capture the presence or absence of the data is called Polynomial-Regression Structured-Point Distribution Model (PR-SPDM). Applying this model, a new variability mode appears to represent the presence or absence of data. Results show that the imputed data plays an important role in the resulting model making it more specific. Moreover, our model achieves the best representation of the data set and we believe that it is portable to real examples, i.e. medical images. We leave this study as a future
Modelling Intermittently Present Features
273
work. It is a topic of future research to model the clearly nonlinear relationships between inclusion/exclusion of landmraks and shape deformation into nonlinear modes of variation of a PR-SPDM.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes - active contour models. International Journal of Computer Vision 1(4), 321–331 (1987) 2. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models - their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 3. Sozou, P.D., Cootes, T.F., Taylor, C.J., Dimauro, E.C.: Nonlinear generalization of point distribution models using polynomial regression. Image and Vision Computing 13(5), 451–457 (1995) 4. Rogers, M., Graham, J.: Structured point distribution models: Modelling intermittently present features. In: Proceedings of the British Machine Vision Conference, vol. 1, pp. 33–42 (2001) 5. Goodall, C.: Procrustes methods in the statistical-analysis of shape. Journal of the Royal Statistical Society Series B-Methodological 53(2), 285–339 (1991)
Measuring Linearity of Ordered Point Sets Milos Stojmenovic and Amiya Nayak SITE, University of Ottawa, Ottawa, Ontario, Canada K1N 6N5 {mstoj075, anayak}@site.uottawa.ca
Abstract. It is often practical to measure how linear a certain ordered set of points is. We are interested in linearity measures which are invariant to rotation, scaling, and translation. These linearity measures should also be calculated very quickly and be resistant to protrusions in the data set. No such measures exist in literature. We propose several such measures here: average sorted orientations, triangle sides ratio, and the product of a monotonicity measure and one of the existing measures for linearity of unordered point sets. The monotonicity measure is also a contribution here. All measures are tested on a set of 25 curves. Although they appear to be conceptually very different approaches, six monotonicity based measures appear mutually highly correlated (all correlations are over .93). Average sorted orientations and triangle side ratio appear as effectively different measures from them (correlations are about .8) and mutually relatively close (correlation .93). When compared to human measurements, the average sorted orientations and triangle side ratio methods prove themselves to be closest. We also apply our linearity measures to design new polygonal approximation algorithms for digital curves. We develop two polygonization algorithms: linear polygonization, and a binary search polygonization. Both methods search for the next break point with respect to a known starting point. The break point is decided by applying threshold tests based on a linearity measure. Keywords: Linearity, ordered point sets, polygonization.
1 Introduction The main motivation for this work is in image processing. Measuring the linearity of a finite set of points can become an interesting way of identifying the important components of a picture. Linear points often represent a region of interest in an image. By dissecting an object into an ordered collection of lines, the object becomes more easily identifiable; visually and computationally. Polygonization is the natural extension and cleanest application to measuring linearity. Simple objects are often a collection of straight lines. A square, triangle or even a star can be represented using a few vertices rather than a large number of points. For instance, polygonization is the basic technique of template matching in production facilities. Newly manufactured parts need to look like the master template to be acceptable. Polygonization is used to quickly compare the master to the copy. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 274–288, 2007. © Springer-Verlag Berlin Heidelberg 2007
Measuring Linearity of Ordered Point Sets
275
In general, we are interested in measuring how linear a finite set of points is, and how well we can produce polygonizations based on these linearity measures. In analyzing various linearity algorithms, we align ourselves with the following criteria. We are interested in assigning linearity values to sets of points. The linearity value is a number in the range [0, 1]. It equals 1 if and only if the shape is linear, and equals 0 when the shape is circular or has another form which is highly non-linear such as a spiral. A shape’s linearity value should be invariant under similarity transformations of the shape, such as scaling, rotation and translation. The algorithms should also be resistant to protrusions in the data set. Linearity values should also be computed by a simple and fast algorithm. It is very important to stress that points in the set are ordered. This means that figures such as ellipses or rectangles which are very flat (long and thin) are considered to be highly nonlinear. If we were to consider unordered sets of points, such ellipses would be highly linear. The only concrete discussion on measuring linearity was found in [13], but it addresses unordered sets. In [13], six linearity measures of unordered sets were described. Here, we will propose and analyze 8 algorithms that assign linearity values to ordered sets of points. The linearity algorithms are called: average sorted orientations, triangle sides ratio, and the last 6 deal with monotonicity multiplied by corresponding linearity measures from [13] for unordered point sets. Average sorted orientations finds the unit vectors along the selected ordered pairs, and their average vector. The linearity measure is the length of that vector. Triangle sides ratio method takes random triplets of ordered points A
276
M. Stojmenovic and A. Nayak
possible point segment that has a satisfactory linearity measure by applying a binary search to the other end of the next polygon edge. Thus a longer edge may emerge, leveraging out some local irregularities. The polygonization algorithms were tested using polygonization measures defined in [7, 8, S, 11]. Although they appear to be conceptually very different approaches, we show that most of the linearity algorithms provide similar polygonization results. Overall in this paper, one monotonicity measure, 8 new linearity measures for ordered point sets and two new polygonization techniques are proposed. The literature review is given in section 2. Linearity and monotonicity measures and polygonal approximation algorithms are presented in section 3. The algorithms were tested on selected curves in section 4.
2 Literature Review We will describe several well known functions on finite sets of points that are used in our linearity measures here. Existing linearity measures for unordered set will be covered along with other relevant measures. We also discuss existing polygonal approximation methods of digitized curves based on some linearity tests. Finally, we describe some existing measures of quality of polygonization. 2.1 Moments and Orientation The central moment of order pq of a set of points Q is defined as:
μ pq = ∑ x , y∈Q ( x − xc ) ( y − y c )q , p
where S is the number of points in the set Q, and (xc, yc) is the center of mass of the set Q. The center of mass is the average value of each coordinate in the set, and is determined as follows:
(xc , y c ) = ⎛⎜ 1 ∑ xi , 1 ∑ yi ⎞⎟ , ⎝S
S
⎠
where (xi, yi), 1≤i≤S, are real coordinates of points from Q. The angle of orientation of the set of points Q is determined by [1]: ⎛ 2μ11 ⎞ . ⎟⎟ angle = 0.5 arctan⎜⎜ ⎝ μ 20 − μ 02 ⎠ 2.2 Linearity Measures for Unordered Sets The most relevant and applicable shape measure to our work is the measuring of linearity of unordered data sets. [13] is the only source in literature that deals directly with measuring linearity for unordered sets of points. Six linearity measures were proposed in [13], all of which we will adapt here. The average orientation scheme first finds the orientation line of the set of points using moments. The method takes k pairs of points and finds the unit normals to the lines that they form. The unit normals all point in the same direction (along the normal to the orientation line). The average normal value (A, B) of all of the k pairs is found, and the linearity value is calculated
Measuring Linearity of Ordered Point Sets
277
as A 2 + B 2 . Triangle heights takes an average value of the relative heights of triangles formed by taking random triplets of points. Relative heights are heights that are divided by the longest side of the triangle, then normalized so that we obtain a linearity value in the interval [0, 1]. Triangle perimeters takes the normalized, average value of the area divided by the square of the perimeter of triplets of points as its linearity measure. Contour smoothness and eccentricity were adapted from measures of circularity. They are simple formulas involving moments that were found in literature, and adapted to finding linearity [1]. The idea remained the same, but the resulting measurements were interpreted differently. In the original scheme in [1], they proposed a measure of circularity by dividing the area of a shape by the square of its perimeter. For circles, they arrived at circularities of 1, and values of less than 1 for other objects. Ellipse axis ratio is based on the minor/major axis ratio of the best ellipse that fits the set of points. 2.3 Relevant Measures The standard method for measuring rectangularity is to use the ratio of the region’s area against the area of its minimum bounding rectangle (MBR) [5]. A weakness of using the MBR is that it is very sensitive to protrusions from the region. A narrow spike out of the region can vastly inflate the area of the MBR, and thereby produce very poor rectangularity estimates. This goes against our stated criteria. Three new methods for measuring the rectangularity of regions are developed by Rosin [5]. Zunic and Rosin [16] described shape measures intended to describe the extent to which a closed polygon is rectilinear (each corner angle is 90o or 270o). The most frequently used convexity measure in practice is the ratio between the area of polygon and area of its convex hull [12]. Zunic and Rosin [17] discussed two measures that have advantages when measuring convexity of shapes with holes. Rosin [4] described several measures for sigmoidality. Broder and Rosenfeld [BR] define a measure of collinearity merit for two adjacent line segment AB and CD (A
278
M. Stojmenovic and A. Nayak
Williams [W] used the cone intersection method to find the maximal possible line segment. Circles of fixed radius are drawn around each point. Points are then added one after another to the initial point until the intersection of all the cones with their vertex at the initial point and touching these circles is an empty set. The test used by Pavlidis [P] is based on maximal distance to a specific line segment and number of sign changes of points with respect to that line, and three parameters. Wall and Dannielsson [WD] find the maximal line segment by merging points one after another until the area deviation per unit length of the approximating line segment exceeds a maximum allowed threshold value. Ray and Ray [9] define three points to be collinear if the radius of largest inscribed circle inside their triangle is below a certain threshold. They assume that a set of points is collinear if the set containing the initial point and the last two points is collinear, and, recursively, the same set of points, without the last one, is collinear. A common feature of all of the described methods is that they are based on a test (without measuring how linear a segment is) with parameters that are not scale invariant. Thus the same curve will yield different polygonizations on digitizations with different resolutions or with different curve sizes. 2.5 Evaluation of Polygonization Methods Rosin [7, 8] identified enormous number of available methods for finding polygonal approximations, and discussed techniques for evaluating them. One popular metric is figure of merit, FOM=N/(M*ISE) [S], where N is the number of points on the curve, M is the number of vertices of the obtained polygon, and ISE is the integral square error. ISE is the sum of squares of distances of each point on the curve from the corresponding edge of the polygon. [11] observes that ISE depends on the size of object, but does not change the definition. Marji and Siu [3] modified FOM by adding a parameter n, defining it as (N/M)n/ISE, where n=1,2,3. Rosin [7] proposed several new measures of polygonization quality: fidelity, efficiency and merit. Fidelity is measured as Eo/E, where E is ISE produced by measured algorithm, and Eo is ISE produced by the optimal algorithm, which is the one that produces the same number of lines but minimizes ISE. Efficiency is measured as M0/M, where M is the number of lines produced by measured algorithm, and M0 is the number of lines produced by an optimal algorithm, which is the minimal possible number of lines of an algorithm that produces the same ISE. Merit is then defined as fidelity * efficiency . Rosin [8] proposed monotonicity and consistency measures of polygonization stability. Monotonicity measures how well the number of generated lines and ISE by given method follow selected parameter in the given method. In a number of domination point based methods, the selected parameter is M, the desired number of lines. A perfectly monotone method is then supposed to always reduce ISE when M is increased, and monotonicity measures the amount of discrepancy in such behavior. Consistency measures the effect of deleting certain number of pixel from the beginning of the curve on breakpoint position, number of breakpoints, ISE, merit etc.
Measuring Linearity of Ordered Point Sets
279
3 Measuring Linearity and Polygonization Here, we will describe the measures of linearity of ordered sets, the monotonicity measure, and the two polygonization schemes which constitute the contributions of this article. The monotonicity measure is the only one not used independently in either linearity measurement or polygonization. It is combined with other measures to achieve usable results. 3.1 Linearity of Ordered Sets of Points and Monotonicity A. Average Sorted Orientations Average sorted orientations works much the same as its average orientations predecessor [13], but points are taken such that the second comes after the first in the ordered set. Also, the slopes are directly averaged, without any reference to the orientation line. Figure 1 shows that only the magnitude of the average ordered unit vectors determines the linearity measure. This makes this metric faster to compute than the average orientations protocol in [13].
Fig. 1. Average Sorted Orientations
B. Triangle Sides Ratio Here, we take k ordered triplets of points, and measure their average colinearity. For each triple, divide the length of the side of the triangle whose endpoints are furthest apart in the sorted point array by the sum of the lengths of the other two sides. Figure 2 shows that the red side of the triangle AC is the one who’s endpoints are furthest apart in the ordered array. The length of the side AC is divided by the sum of the lengths of the other two blue sides.
Fig. 2. Triangle sides ratio example
C. Monotonicity Monotonicity measures the behavior of curves with respect to their orientation line. It is expected that monotonic curves define a more linear ordered set of points than nonmonotonic curves. Here, we first find the orientation line of the set of point which we are interested in, and measure the monotonicity of this set relative to this line. Since
280
M. Stojmenovic and A. Nayak
the moment function sometimes produces orientation lines which are 90 degrees offset from what is visually the actual orientation line, we repeat the entire procedure for angle=angle+π/2, and select the higher of the two measures. In Figure 3, we see that the curve on the left is completely monotone with respect to its blue orientation line, whereas the curve on the right is not monotone.
Fig. 3. Monotonicity examples
The algorithm works by taking N-4 pairs of points which are 4 positions apart. We chose to take points which are 4 positions apart in the array of ordered points since points which were closer together did not represent the natural slope of the curve due to digitization. A vector v is found for each pair of points. For each v, we multiply it by the orientation line vector of the whole set of points via a dot product. If the dot product is positive, the sign s of the magnitude mag of v is positive, otherwise it is negative. The sum of all s*mag is divided by the sum of the absolute values of all mag to form a monotonicity value. Monotonicity is multiplied by each linearity measure to make combined metrics that measure the linearity of sorted point sets. Figure 4 shows a concrete yet simplified example of how the monotonicity algorithm works. We take only 4 pairs of points to capture the spirit of the method. The orientation line is seen below the figure. All of the dark blue arrows show pairs of points. The only pair that will have a negative dot product with the orientation line is pair A5A6. The monotonicity value for these 4 pairs is (absolute value of): (|A1A2|+|A3A4|-|A5A6|+|A7A8|)/|A1A2|+|A3A4|+|A5A6|+|A7A8|). Note that the light blue orientation line should actually be paced right over the curve, but was shifted to the side to avoid complicating the figure.
Fig. 4. Monotonicity functional example
The ordered linearity values of a shape are the result of the monotonicity measure of that shape multiplied with one of the six linearity measures from [13] for unordered data.
Measuring Linearity of Ordered Point Sets
281
3.2 Polygonal Approximation of Curves Both polygonization algorithms are designed to be invariant to rotation. This is achieved through a mechanism that always selecting the same starting point. The starting point is chosen such that it encompasses the longest linear segment that does not violate the linearity threshold of that curve. A. Linear Extension This method uses a brute force method of polygonization by comparing the linearity of growing line segments to a threshold which depends on the length of the line segment and a linearity value. The threshold is in the form D(1-L)
282
M. Stojmenovic and A. Nayak
int start=1; While start<>N do { start=polygonize(start, N, T, Points); memorize points[start] as next polygonization point; } Procedure polygonize(start, end, T, Points){ if (start=end) then return start; mid=(start+end)/2; find linearity lin of line segment start to mid by one of the linearity methods; find monotonicity mon of segment start to mid; tempLinearity=ecc*mon; dis=Euclidian distance between points start and mid; templinearity=dis*(1- tempLinearity); if (tempLinearity ≤T) { start=polygonize(mid, end, Points); } else { start=polygonize(start, mid, Points); } return start; }
Fig. 5. Binary polygonization process
The whole process is seen in Figure 5. In Fig. 5a), we see that the starting point is the green dot labeled start, the ending point is the red dot labeled end, and the midpoint between them is represented by a blue dot, and labeled mp. The algorithm first checks the linearity value of segment start to MP. If this linearity value satisfies the threshold, the next breakpoint is sought after in the segment MP to end. In this example, we see that the segment Start to MP is not linear enough, and therefore, the binary algorithm further searches for a breakpoint here. This is seen in Figure 5b), where the new mid and end points are marked. By continuing the binary search, we find the break point in Figure 5c). The segment Start to Break point is stored as part of the polygonization. The polygonization algorithm will restart from this break point in Fig. 5d).
Measuring Linearity of Ordered Point Sets
283
4 Experimental Data The testing of the algorithms was done on two sets. The linearity algorithms were tested on a set of 25 non trivial shapes, shown in Figure 6. These shapes were assembled by hand and are meant to cover a wide variety of non trivial curves. The polygonization algorithms were executed on a set of 21 standard test curves found in [8], shown in Figure 7. Each curve comprises between 100 and 500 points. Several linearity measurement algorithms are based on selecting k pairs or triples of points from the set. When such procedure is called frequently by linear search based polygonization algorithm, the speed may become a concern. We first added speedup procedures to determine sample size k so that linearity measure has sufficient accuracy. In application, such as polygonization, it may be more common to make a judgment on whether or not the set is linear. This provided further speedup. Both speed improvements were made using basic probability theory for confidence intervals [2]. Table 1 shows the linearity testing results for the Average Sorted Orientations (ASO), and Triangle Sides Ratio (TSR) algorithms. It also shows linear measures for six methods AO, EC, TH, TR, CS, and EAR. They are proposed in [13] for unsorted data sets, and were multiplied by the monotonicity values (using the algorithm described in this article) of each curve to make them relative to sorted data sets which are relevant here. The monotonicity values (MON) of each curve are seen in the last column.
Fig. 6. Linearity test set
To achieve higher consistency (as defined by [8]) for both polygonization methods, and therefore rotation independency, the initial point has been selected by first searching for the longest line segment in the polygon. Starting from each point, linearity measure was applied by linear search, extending the line as far as possible. The starting point of the longest such line segment was selected as the initial point.
284
M. Stojmenovic and A. Nayak
Visual inspection of the polygonization results of the binary polygonization method can be seen in Figure 8. The binary search method was used here. The only parameter in this polygonization run is the linearity*distance threshold, which was set to T=1, which provided similar values for M as those seen in cited references. The linearity measure that was used to generate the polygonization was the eccentricity (EC) measure. The light red dot on each curve represents the starting point of each algorithm. Tables 2 and 3 show the polygonization results of linear extension algorithm as it is tested on the curves in Figure 7. Table 2 shows the number of points in each polygonization, and Table 3 shows the measure of N3/(M3*ISE), where N is the number of points in the original curve, and M is the number of points in the approximation. This measure is taken from [3] for n=3. It is selected for the following reason.
Fig. 7. Polygonization set
Fig. 8. Binary Polygonization Table 1. Linearity Results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
AO .95 .35 .34 .71 .01 .36 .10 .83 .35 .04 .05 .61 .23 .57 .09 .35 .05 .26 .59 .08 .44 1 .64 .04 .41
EC .99 .56 .47 .83 .02 .55 .12 .96 .47 .06 .09 .79 .36 .71 .16 .53 .07 .43 .73 .13 .62 1 .66 .22 .50
TH .85 .29 .25 .61 0 .24 .07 .73 .25 .02 .02 .50 .13 .44 .07 .35 .03 .28 .56 .06 .42 .99 .56 .24 .36
TP .97 .37 .31 .79 0 .32 .08 .85 .31 .03 .03 .64 .17 .53 .08 .39 .04 .36 .68 .08 .54 1 .61 .28 .42
CS .80 .23 .20 .50 0 .19 .06 .66 .20 .02 .01 .42 .11 .38 .05 .32 .02 .22 .49 .04 .34 .99 .53 .22 .32
EAR .93 .47 .40 .70 .02 .46 .10 .88 .39 .05 .08 .68 .29 .63 .14 .44 .05 .35 .61 .11 .52 1 .63 .20 .45
ASO .98 .71 .85 .90 .27 .82 .26 .94 .81 .45 .47 .89 .71 .91 .70 .79 .25 .76 .85 .61 .86 1 .85 .64 .91
TSR .99 .71 .91 .92 .18 .90 .49 .95 .83 .59 .25 .92 .79 .95 .71 .90 .26 .69 .91 .50 .82 1 .96 .69 .97
MON 1 .65 .54 .98 .16 .65 .13 .98 .69 .20 .16 .87 .53 .75 .32 .70 .09 .65 1 .52 .80 1 .67 .78 .51
Measuring Linearity of Ordered Point Sets
285
Table 2. Linear Polygonization-Break Points 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
N 366 413 377 352 482 317 484 474 384 372 674 290 376 366 483 427 329 326 450 386 419
AO 43 55 41 39 59 22 53 66 48 40 73 34 35 40 56 29 38 34 49 46 50
EC 18 20 12 13 13 6 18 33 24 15 23 8 7 18 23 14 14 13 21 12 32
TH 28 36 28 27 39 23 39 43 34 27 54 27 35 30 38 21 29 27 35 30 33
TP 22 35 17 19 22 13 35 44 32 19 43 18 28 28 35 20 25 23 35 29 37
CS 29 36 29 28 39 23 41 43 34 27 56 28 36 30 38 22 29 27 35 30 34
EAR 34 48 28 36 46 27 52 60 45 33 70 34 40 37 49 27 40 37 45 39 48
ASO 17 21 13 13 16 9 25 31 23 15 29 13 14 17 26 15 16 16 23 16 27
TSR 31 40 31 35 40 25 44 37 36 31 59 30 29 35 52 33 32 33 38 34 40
av
407
45
17
33
28
33
41.7
19
36
Our goal is to design and evaluate polygonization methods that are invariant to curve scaling and/or resolution applied. The curve size is described by a single parameter N, the number of pixels in a given digitization. Distances are measured in this ‘pixel’ geometry, where the distance between two neighboring pixels in horizontal/vertical direction is 1. Consider two digitizations of the same curve, with N1=N and N2=mN pixels, respectively. Assuming that polygonization is invariant to scaling, they should yield the same M, the number of selected polygon vertices, and these vertices should be at Table 3. Linear polygonization measure n3/(m3*ISE)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 av
N
AO
EC
TH
TP
CS
EAR
ASO
TSR
366 413 377 352 482 317 484 474 384 372 674 290 376 366 483 427 329 326 450 386 419 407
1.42 0.89 0.34 0.99 0.09 3.51 0.90 0.00 0.68 1.55 0.69 1.89 2.12 1.45 0.66 0.85 1.02 0.93 0.59 0.97 0.81 1.06
3.08 1.93 2.72 1.64 1.10 2.32 3.46 0.92 1.29 1.72 1.80 3.87 5.52 1.52 0.64 2.67 2.63 2.67 0.82 3.19 1.76 2.25
1.07 1.43 0.00 2.44 2.09 3.72 1.74 0.00 1.28 2.05 1.13 4.38 1.74 1.14 1.68 2.90 1.19 3.03 1.84 3.28 1.40 1.88
1.06 1.29 1.02 0.70 1.75 6.96 1.40 0.66 0.44 1.40 1.30 3.26 2.20 2.27 0.97 1.79 1.94 2.44 0.00 1.72 1.05 1.70
1.06 1.43 0.00 2.57 2.13 3.78 1.46 0.91 1.36 1.93 1.09 2.86 0.00 1.09 1.60 2.81 1.19 2.92 1.80 3.14 0.00 1.67
0.94 0.16 0.85 1.80 0.00 0.00 0.00 0.00 0.82 0.00 0.00 0.00 2.04 1.29 0.97 1.90 1.18 1.23 0.54 1.49 0.88 0.77
2.41 2.34 3.31 3.07 0.00 4.33 1.93 1.00 1.33 3.95 1.47 4.52 5.27 0.35 1.31 2.93 1.92 0.00 2.25 5.14 1.45 2.39
.00 .45 0 .23 .18 0 .27 .39 .15 .31 .15 .45 .41 .32 .21 0 .30 .86 0 .36 0 .24
286
M. Stojmenovic and A. Nayak
positions with respect to the curve. The distances of corresponding pixels in two mN digitizations are then e and me, respectively. Then ISE2 ≈ ∑1 (me)2 ≈ m3ISE1. Thus N23/(MtISE2)≈(mN)3/(Mtm3ISE1)=N13/(MtISE2). This means that the measure N3/(MtISE) is invariant to curve scaling or resolution, for any t. Similar analysis can also show that some popularly used evaluation measures are not scalable, and that the evaluation measures depend on the numbers of pixels taken in a particular digitization. Measures which are not invariant to the Table 4. Binary polygonization-break points
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 av
N
AO
EC
TH
TP
CS
EAR
ASO
TSR
366 413 377 352 482 317 484 474 384 372 674 290 376 366 483 427 329 326 450 386 419 407
51 81 48 53 68 33 79 87 68 51 94 45 39 55 82 38 46 50 71 61 80 61
18 20 12 13 13 6 18 33 24 15 23 8 7 18 23 14 14 13 21 12 32 17
29 36 25 25 42 26 44 46 30 33 70 24 32 35 38 25 32 29 33 27 39 34
26 34 24 24 28 15 38 50 40 23 46 25 30 27 39 26 25 28 38 26 46 31
29 36 29 30 42 31 40 45 34 33 73 28 37 35 39 26 32 30 41 31 38 36
40 54 36 40 44 33 60 71 54 38 79 39 49 49 64 38 45 40 55 46 58 49.1
18 21 18 16 22 11 25 35 25 20 31 14 18 21 26 17 15 17 21 14 29 21
42 53 43 55 60 38 58 53 51 43 79 39 38 53 70 44 43 46 55 48 52 51
Table 5. Binary polygonization measure n3/(m3*ISE) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 av
N
AO
EC
TH
TP
CS
366 413 377 352 482 317 484 474 384 372 674 290 376 366 483 427 329 326 450 386 419 407
0.11 0.32 0.09 0.37 0.05 0.11 0.20 0.28 0.21 0.46 0.41 0.69 0.43 0.17 0.20 0.67 0.75 0.27 0.19 0.09 0.20 0.30
3.08 1.93 2.72 1.64 1.10 2.32 3.46 0.92 1.29 1.72 1.80 3.87 5.52 1.52 0.64 2.67 2.63 2.67 0.82 3.19 1.76 2.25
2.40 0.70 1.54 0.20 0.41 0.60 0.31 0.30 0.19 0.88 0.28 1.52 0.10 0.58 0.58 2.25 0.21 1.90 0.26 0.20 0.77 0.77
1.27 1.85 0.43 1.13 0.89 2.59 1.07 0.64 0.53 2.21 1.21 1.26 0.81 0.57 1.80 1.83 1.12 0.82 1.05 0.42 0.47 1.14
0.95 0.83 0.52 1.08 0.33 1.49 0.13 0.38 0.67 0.48 0.35 1.65 0.28 0.39 0.54 2.03 0.06 1.76 0.97 1.65 0.79 0.82
EAR
1.48 1.46 1.81 1.44 1.36 4.24 1.21 0.50 0.68 1.07 0.35 1.75 0.72 0.50 1.03 0.76 2.15 0.90 0.36 0.15 0.47 1.16
ASO
2.13 1.96 1.59 0.96 1.99 0.73 1.98 0.84 0.73 0.48 0.71 0.95 1.56 1.06 1.61 6.20 2.95 1.92 1.39 3.98 1.21 1.76
TSR
0.37 0.17 0.14 0.17 0.19 0.59 0.40 0.17 0.19 0.25 0.31 0.41 0.18 0.72 0.44 0.09 0.61 0.46 0.29 0.56 0.28 0.33
Measuring Linearity of Ordered Point Sets
287
number of pixels on the curve include FOM [10] used very frequently (it behaves as the inverse of m2), and (N/M)n/ISE [3] for n≠3. Tables 4 and 5 show the binary polygonization results of this algorithm as it is tested on the curves in Figure 7. They follow the format and meaning of the previous two tables. The best algorithm for polygonization according to the n3/(m3*ISE) measure was the linear polygonization method combined with the average sorted orientations linearity metric. The closest competitor was binary polygonization combined with the eccentricity linearity metric.
5 Conclusion Since the values N were not reported in given references, and the measures used were either not scalable or requiring further information about parameters used in testing, direct comparison of our polygonization schemes with other exiting methods was left for future work. This is part of our plan for further extensions of this work in several directions. We will study new linearity measures, and new evaluation measures for polygonization, including new ways to evaluate ISE. We will then be in a position to properly compare some of the existing methods and our new ones, using scale invariant measures.
References 1. Csetverikov, D.: Basic algorithms for digital image analysis, Course, Institute of Informatics, Eotvos Lorand University, visual.ipan.sztaki.hu 2. Hogg, R.V., Tanis, E.A.: Probability and Statistical Inference. Prentice Hall, Englewood Cliffs (1997) 3. Marji, M., Siy, P.: Polygonal representation of digital planar curves through dominant point detection-a nonparametric algorithm. Pattern Recognition 37, 2113–2130 (2004) 4. Rosin, P.: Measuring sigmoidality. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 410–417. Springer, Heidelberg (2003) 5. Rosin, P.: Measuring Rectangularity. Machine Vision and Applications 11, 191–196 (1999) 6. Rosin, P.: Measuring shape: ellipticity, rectangularity, and triangularity. Machine Vision and Applications 14, 172–184 (2003) 7. Rosin, P.: Techniques for assessing polygonal approximations of curves. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(6), 659–666 (1997) 8. Rosin, P.: Assessing the behavior of polygonal approximation algorithms. Pattern Recognition 36, 505–518 (2003) 9. Ray, B., Ray, S.: An algorithm for polygonal approximation of digitized curves. Pattern Recognition Letters 13, 489–496 (1992) 10. Sarkar, D.: A simple algorithm for detection of significant vertices for polygonal approximation of chain-coded curve. Pattern Recognition Letters 14, 959–964 (1993) 11. Sarfraz, M., Asim, M.R., Masood, A.: Piecewise polygonal approximation of digital curves. In: Proceedings of the Eigth International Conference on Information Visualization, vol. 14, pp. 991–996 (2004)
288
M. Stojmenovic and A. Nayak
12. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision, Chapman & Hall (1993) 13. Stojmenovic, M., Nayak, A., Zunic, J.: Measuring linearity of a finite set of points. In: IEEE International Conference on Cybernetics and Intelligent Systems (CIS), Bangkok, Thailand, June 7-9, pp. 222–227 (2006) 14. Teh, C., Chin, R.: On the detection of dominant points on digital curves. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(8), 859–872 (1989) 15. Ventura, J., Chen, J.M.: Segmentation of two-dimensional curve contours. Pattern Recognition 25(10), 1129–1140 (1992) 16. Zunic, J., Rosin, P.: Rectilinearity Measurements for Polygons. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9), 1193–1200 (2003) 17. Zunic, J., Rosin, P.: A new convexity measure for polygons. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(7), 923–934 (2004)
Real-Time Color Image Watermarking Based on D-SVD Scheme Cheng-Fa Tsai and Wen-Yi Yang Department of Management Information Systems, National Pingtung University of Science and Technology, 91201 Pingtung, Taiwan
[email protected]
Abstract. This investigation presents a robust digital watermarking scheme for copyright protection, called D-SVD. The proposed scheme integrates discrete cosine transform (DCT) with singular value decomposition (SVD). In contrast to traditional methods based on DCT watermarking schemes, in which watermark messages are embedded directly on DCT frequency coefficients, the proposed approach emphasizes that watermark message bit is embedded on the blocks of the DCT coefficient’s singular value within an original color image. Experimental results demonstrate that the quality of watermarked image is robust under compression, noises, filtering and various attacks. In addition, it is observed that the proposed D-SVD algorithm can obtain larger NC and PSNR values than some existing well-known methods, and can successfully resist attacks such as cropping, blurring, reshaping, adding noise and JPEG compression. Keywords: digital watermarking, copyright protection, image processing.
1
Introduction
Digital watermarking algorithms are generally developed to verify authorship or to protect copyright [1], [4]-[18]. Numerous watermarking schemes that embed the watermark into transformed frequency coefficients have been presented recently, due to the robustness consideration such as DCT [1], [3], [13], [16] and wavelet [2], [18]. This methodology is unlike conventional approaches based on the transformed coefficient. This study proposes a new method that embeds the watermark into the singular value of the DCT coefficient blocks within an original color image, and does not require the original image when performing extraction of watermark procedure. The proposed scheme first calculates the frequency coefficient of luminance within the original color image using DCT, and decomposes the coefficient to computes its singular value after. The watermarks are then embedded on the singular value. Experimental results indicate that the presented technique produces a high-quality watermarked image and a robust embedding watermark. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 289–297, 2007. c Springer-Verlag Berlin Heidelberg 2007
290
C.-F. Tsai and W.-Y. Yang
Fig. 1. The proposed watermark embedding process
2
Embedding Process
According to illustrated in Fig. 1, the proposed process embeds an n × n watermark image W into an N × N size original image H, with secret keys (key) for security. The steps of watermark embedding process are provided as follows: Step 1: Transform the RGB color space of the original image H to the YUV color space. The Y (luminance) component has selected from YUV to apply the DCT frequency transformation, since the luminance has the same range as the gray image between 0 and 255. The original image luminance component Y is split into non-overlapping 8 × 8 or 4 × 4 pixel blocks Bk , where k=1, 2, 3,. . . , M , and M represents the numbers of blocks. Each block Bk is then individually transformed to a frequency coefficient using the two-dimension DCT. Step 2: Compute the singular value of each frequency coefficient block Bk by singular value decomposition. Step 3: Compute Ns = s + 1 and D = Ns /d , where s = (λk1 , λk2 , . . . , λkN ) denotes a vector formed to the singular values of each block Bk , and dk represents the quantization level for correlating Ns with Bk . Step 4: Shuffle the watermark image for embedding, to prevent tampering with the watermark form. The proposed approach employs Torus automorphism, which is an image permutation mapping function to break up a watermark image randomly. The number of Torus permutation transformations on the watermark image must be maintained as a secret key for the extracting process. Step 5: Embed each bit bi (0 or 1) of watermark image by the adjustment integer number D. If bi=0, then D is odd, while if bi=1, then D is even.
Real-Time Color Image Watermarking Based on D-SVD Scheme
291
Step 6: Compute Ns = dk ×D+dk /2 and the modified singular value (δ1k , δ2k , . . . , k k δN ) , where (δ1k , δ2k , . . . , δN ) = (λk1 , λk2 , . . . , λkN ) × (Ns /Ns ). Step 7: Compute the watermarked block Bk with the modified singular value (λk1 , λk2 , . . . , λkN ). Step 8: Inversely transform each Bk to luminance image domain using IDCT. Step 9: Reconfigure the watermarked image from all blocks to obtain the RGB color space.
3
Extracting Process
Fig. 2 displays the extracting process utilized in the proposed scheme. The steps of extracting the watermark from the watermarked image are given as follows: Step 1: Transform the watermarked image Yw to the YUV color space, and then place the luminance component Y into non-overlapping 8 × 8 or 4 × 4 ˜k . pixel blocks B ˜k to the frequency coefficient by two-dimension Step 2: Transform each block B DCT, and then compute the singular value of each frequency coefficient ˜ using the singular value decomposition. block B k ˜ = N ˜s /dk , where s˜ = (δ k , δ k , . . . , δ k ) ˜s = ˜ s + 1 ; D Step 3: Compute N 1 2 N ˜ . represents a vector formed to the singular values of each block B k ˜ is odd, then the embedded watermark bit is 0. Otherwise, if D ˜ is Step 4: If D even, then the embedded bit is 1. Step 5: Reshape the randomly watermark W , by performing the Torus permutation transformation once to reconfigure the recovery watermark using the appropriate secret key value.
Fig. 2. The proposed watermark extracting process
292
C.-F. Tsai and W.-Y. Yang
Fig. 3. Comparison of the proposed D-SVD algorithm and Wang’s and Liu’s methods with extracted watermark NC value and the PSNR of watermarked through attacked by JPEG with different compression qualities (using Lena as testing image)
Fig. 4. Comparison of the proposed D-SVD algorithm and Wang’s method with extracted watermark NC value and the PSNR of watermarked through attack by blurring (using Lena as testing image)
Real-Time Color Image Watermarking Based on D-SVD Scheme
293
Fig. 5. The outcome of watermarks after attack by blurring with three times using the proposed D-SVD and Wang’s method [3]
Fig. 6. Comparison of the proposed D-SVD algorithm and Wang’s method [3] with extracted watermark NC value and the PSNR of watermarked through attack by adding noise (using Lena as testing image and using Salt & Pepper as noise)
Fig. 7. The outcome of watermarks after attack by adding noise with six times using the proposed D-SVD and Wang’s method [3]
294
4
C.-F. Tsai and W.-Y. Yang
Simulation Result
An experiment was performed using an original image of size 512 × 512 with true color, and a watermark image of size 64 × 64 with binary image. The experiment was performed on a personal computer with an Intel Pentium 4, 3-GHz and 512 M RAM. To verify the performance of the proposed D-SVD algorithm, some experiments were conducted to compare the proposed D-SVD algorithm and Wang’s and Liu’s methods [3], [15]. Fig. 3-7 show some experiment results. Fig. 3 presents the comparison of the proposed D-SVD algorithm and Wang’s and Liu’s methods with extracted watermark NC (Normalized Correlation) value and the PSNR (Peak Signal to Noise Ratio) of watermarked through attacked by JPEG with different compression qualities (using Lena as testing image). The human eye can generally judge the quality of a processed image, but not very objectively. Some objective measures are available to verify the quality of a compressed image. For instance mean square error (MSE), signal/noise relation (SNR) and Peak Signal-to-Noise Ratio (PSNR) are generally used, and are formulated as follows. M SE =
−1 M−1 N 1 ˆ y) − I(x, y)]2 [I(x, M × N x=0 y=0
SN R = 10 log10
1 M×N
M−1 N −1 ˆ 2 x=0 y=0 [I(x, y) ] M SE
P SN R = 10 log10
2552 M SE
(1)
(2)
(3)
Fig. 8. The outcome of watermarks after attack by cropping with different percentages (25%, 50%, and 75%) using the proposed D-SVD approach (cropping 25%, PSNR=11.246dB, NC=0.8999; cropping 50%, PSNR=7.5894dB, NC=0.7969; cropping 75%, PSNR=5.692dB, NC=0.6919)
Additionally, Normalized Correlation (NC) is used to evaluate the similarity degree between original watermark and extracting watermark. Fig. 4 lists the comparison of the proposed D-SVD algorithm and Wang’s method with extracted watermark NC value and the PSNR of watermarked through attack by
Real-Time Color Image Watermarking Based on D-SVD Scheme
295
blurring (using Lena as testing image). Fig. 5 illustrates the outcome of watermarks after attack by blurring with three times using the proposed D-SVD and Wang’s method [3]. Fig. 6 displays the result after an attack by adding noise. Fig. 8 depicts the outcome of watermarks after attack by different percentages cropping (25%, 50%, and 75%) using the proposed D-SVD approach (notably, cropping 25%, PSNR=11.246dB, NC=0.8999; cropping 50%, PSNR=7.5894dB, NC=0.7969; cropping 75%, PSNR=5.692dB, NC=0.6919). It is found that the outcome of watermarks after attack by high percentage cropping using D-SVD approach is still clear. However, it is difficult to find watermark after attack by high percentage cropping using Wang’s method.
Fig. 9. The outcome of watermarks after attack by reshaping using the proposed DSVD approach and Wang’s method [3] (a)using F16 as reshaping testing image (b) using D-SVD algorithm (c) using Wangs method
Fig. 9 shows the result of watermarks after attack by reshaping using the proposed D-SVD approach (PSNR=20.4762, NC=0.6895) and Wang’s method [3] (PSNR=20.3242, NC=0.66040), and F16 airplane was used as reshaping testing image. From Fig. 3-9, it is observed that the proposed D-SVD algorithm can obtain larger NC and PSNR values than Wang’s and Liu’s methods, and can successfully resist attacks such as cropping, blurring, reshaping, adding noise and JPEG compression. Hence, the proposed image watermarking technique is robust and imperceptible.
5
Conclusion
This investigation presents a new scheme for embedding digital watermark into a color image. The embedding and extracting processes are based on D-SVD. Experimental results reveal that the proposed method can successfully resist attacks such as cropping, blurring, reshaping, adding noise and JPEG compression.
296
C.-F. Tsai and W.-Y. Yang
Therefore, the proposed image watermarking technique is robust and imperceptible. Additionally, the proposed approach can retrieve the embedded information without accessing the original image. Experimental results demonstrate that the proposed schemes outperform Wang’s and Liu’s schemes. Acknowledgments. The author would like to thank the National Science Council of Republic of China for financially supporting this research under contract no. NSC 94-2213-E-020-002.
References 1. Narges, A., Reza, S.: A Novel DCT-based Approach for Secure Color Image Watermarking. In: IEEE Computer Society Proceedings of the International Conference on Information Technology: Coding and Computing (2004) 2. Paul, B., Ma, X.-H.: Image Adaptive Watermarking Using Wavelet Domain Singular Value Decomposition. IEEE Transactions on Circuits and Systems for video Technology 15, 96–102 (2005) 3. Wang, Y., Pearmain, A.: Blind image data based on self reference. Pattern Recognition Letters, 1681–1689 (2004) 4. Chen, L.-H., Lin, J.-J.: Mean quantization based image watermarking. Image and Vision Computing, 717–727 (2003) 5. Chung, K.-L., Shen, C.-H., Chang, L.-C.: A novel SVD and VQ-based image hiding schemes. Pattern Recognition Letters, 1051–1058 (2001) 6. Chu, W.-C.: DCT-Based Image Watermarking Using Sub-sampling. IEEE Transactions on Multimedia 5, 34–38 (2003) 7. Voyatzis, G., Pitas, I.: Chaotic Mixing of Digital Images and Applications to watermarking. In: Proceedings of European Conference on Multimedia Applications Service and Techniques, vol. 2, Belgium, European (1996) 8. Hsu, C.-T., Wu, J.-L.: Hidden Digital Watermarks in Images. IEEE Transactions on Image Processing 8 (1999) 9. Huang, F., Guan, Z.-H.: A hybrid SVD-DCT watermarking method based on LPSNR. Pattern Recognition Letters on 25, 1769–1775 (2004) 10. Hwang, M.-S., Chang, C.-C., Hwang, K.-F.: A Watermarking Technique Based on one-way Hash function. IEEE Transactions on Consumer Electronics 45, 286–294 (1999) 11. Huang, J., Shi, Y.-Q., Shi, Y.: Embedding Image Watermarks in DC Components. IEEE Transactions on Circuits and Systems for Video Technology 10, 974–979 (2000) 12. Cox, I.-J., Kilian, J., Leighton, F.-T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Transactions on Image Processing 6, 1673–1687 (1997) 13. Lin, S.-F., Chen, C.-F.: A Robust DCT-BASED Watermarking for copyright protection. IEEE Transactions on Consumer Electronics 46, 415–421 (2000) 14. Li, C.-T.: Digital fragile watermarking scheme for authentication of JPEG images. IEEE Image Signal Processing 151, 460–466 (2004) 15. Liu, R., Tan, T.: An SVD-Based Watermarking Scheme for Protecting Right Ownership. IEEE Transactions on Multimedia 4, 121–128 (2002)
Real-Time Color Image Watermarking Based on D-SVD Scheme
297
16. Li, X.-Q., Xue, X.-Y.: Improved Robust Watermarking in DCT Domain for Color Images. In: IEEE Computer Society Proceedings of the 18th International Conference on Advanced information Networking and Application (2004) 17. Shih, F.-Y., Wu, Y.-T.: Enhancement of image watermark retrieval based on genetic algorithms. Journal of Visual Communication and Image Representation, 115–133 (2005) 18. Tsai, M.-J., Yu, K.-Y., Chen, Y.-Z.: Joint Wavelet and Spatial Transformation for Digital Watermarking. IEEE Transactions on Consumer Electronics 46, 241–245 (2000)
Recognizing Human Iris by Modified Empirical Mode Decomposition Jen-Chun Lee1, Ping S. Huang2, Te-Ming Tu1, and Chien-Ping Chang1 1
Department of Electrical and Electronic Engineering, Institute of Technology, National Defense University, Taoyuan, Taiwan, Republic of China 2 Department of Electronic Engineering, Ming Chuan University, Taoyuan, Taiwan, Republic of China {i923002, cpchang, tutm}@yahoo.com.tw
Abstract. With the increasing needs in security systems, iris recognition is reliable as one important solution for biometrics-based identification systems. Empirical Mode Decomposition (EMD), a multi-resolution decomposition technique, is adaptive and appears to be suitable for non-linear, non-stationary data analysis. This paper presents an effective approach for iris recognition using the proposed scheme of Modified Empirical Mode Decomposition (MEMD) to analyze the iris signals locally. Since MEMD is a fully data-driven method without using any pre-determined filter or wavelet function, MEMD is used as a low-pass filter to extract the iris features for iris recognition. To verify the efficacy of the proposed approach, three different similarity measures are evaluated. Experimental results show that those three metrics have achieved promising and similar performance. Therefore, the proposed method demonstrates to be feasible for iris recognition and MEMD is suitable for feature extraction. Keywords: Biometrics, iris recognition, Empirical Mode Decomposition (EMD), multi-resolution decomposition.
1 Introduction Biometrics is inherently a more reliable and capable technique to identity human's authentication by his or her own physiological or behavioral characteristics. The features used for personnel identification by current biometric applications include facial features, fingerprints, iris, palm-prints, retina, handwriting signature, DNA, gait, etc. [1], [2] and the lowest error recognition rate is achieved by iris recognition [3]. With the increasing interests, more and more researchers gave their attention into the field of iris recognition. Recently, iris recognition approaches can be roughly divided into four categories: phase-based approaches [4], zero-crossing representation [5], texture analysis [6], [7], and intensity variation analysis [8], [9]. Daugman’s algorithm [4] adopted the 2D Gabor filters to demodulate phase information of iris. Each phase structure is quantized to one of the four quadrants in the complex plane. The Hamming distance was further used to measure the 2048-bits of iris code. Boles and Boashash [5] D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 298 – 310, 2007. © Springer-Verlag Berlin Heidelberg 2007
Recognizing Human Iris by Modified Empirical Mode Decomposition
299
proposed the zero-crossing of 1D wavelet transform to represent distinct levels of a concentric circle for an iris image, and then two dissimilarity functions were used for matching. Wildes et al. [6] analyzed the iris texture using the Laplacian pyramids to combine features from four different resolutions. Normalized correlation is selected to decide whether the input image and the enrolled image belong to the same class. L. Ma et al. [8], [9] proposed a local intensity variation analysis-based method and adopted the Gaussian-Hermite moments [8] and dyadic wavelet [9] to characterize the iris image for recognition. Feature extraction is a crucial processing stage for pattern recognition. Traditionally, basis decomposition techniques such as Fourier decomposition or Wavelet decomposition are selected to analyze real world signals [10]. Also, Fourier and Wavelet descriptors have long been used as powerful tools for feature extraction [10], [11], [12]. However, the main drawback of those approaches is that the basis functions are fixed, and do not necessarily match varying nature of signals. The Empirical Mode Decomposition (EMD) was firstly proposed by Huang et al. [13] for analyzing nonlinear and non-stationary time series. Any complicated data set can be decomposed into a finite and often small number of intrinsic mode function (IMF) components representing the data features. Those extracted components can match the signal itself very well. Motivated by that EMD provides a decomposition method to analyze the signal locally and separate the component holding locally the highest frequency from the rest into a separate IMF, in this paper, EMD technique is modified and refined to extract distinguishable features from iris images, called Modified Empirical Mode Decomposition (MEMD). There are two merits for using MEMD to extract features for iris recognition. First, MEMD is a fully data driven method without using any pre-determined filter [8], wavelet function or Fourier-wavelet basis [12]. Second, MEMD can be easily implemented, the matching time is greatly reduced and the achieved recognition rate is better than the method using EMD for feature extraction. Therefore, the proposed MEMD approach is used to extract residual components from iris images as features for recognition. This paper is organized as follows. Section 2 introduces preprocessing procedures for iris images. Section 3 and Section 4 describe the details of our proposed approach for feature extraction and matching. Experimental results are demonstrated and discussed in Section 5, prior to Conclusions in Section 6.
2 Iris Image Preprocessing To ensure that correct iris features can be easily extracted from the eye image, it is essential to perform preprocessing to eye images. The human iris is an annular portion between the pupil (inner boundary) and the sclera (outer boundary). The image preprocessing procedures to extract the iris from the eye image operate in three steps. The first step is to locate the iris area. Then, the located iris is normalized to a rectangular window with a fixed size in order to achieve the approximate scale invariance. Finally, illumination and contrast problems are eliminated from the normalized image by image enhancement, and the most irrelevant parts (such as eyelid and eyelashes) are removed from the normalized image as much as possible.
300
J.-C. Lee et al.
2.1 Locating the Iris Area In an iris recognition system, iris location is an essential step. Herein, we proposed the method for iris location base on the Thales' theorem that the diameter of a circle always subtends a right angle to any point on the circle’s circumference. Fig. 1 shows the Thales' theorem is applied to find the inner and outer boundary of iris. The iris location method is not detailed here, because it is not the focus of this paper. We sum up the main points as follows. Firstly, the dilation and erosion basic morphological operators are used in order to obliterate the illumination influence inside the pupil. Then, a point inside the pupil is found using the method of minimum local block mean and the pupil area gives the minimum average gray value in the eye image. Therefore, the target point P0 inside the pupil is found and its coordinate can be computed. Secondly, we rely on the target point and cooperate with the specialized boundary detection mask (SBDM) to locate three points ( P1 , P2 , P3 and P4 , P5 , P6 ) along the inner and outer iris boundaries, respectively. The SBDM is constructed by a × b matrix, as shown in Fig. 2. During the processing, each pixel ( x, y ) in the search range is considered as the center of SBDM and the corresponding edge intensity is calculated by
∑
e=
b −1 2 b −1 i= x− 2 x+
∑
a −1 2 a −1 j= y− 2
y+
f (i, j ) ∗ w(i, j ) ,
(1)
where f (i, j ) represents the pixel value in an image, w(i, j ) is a weighting value of the SBDM. During the search range, we can find that the boundary point appears with the largest edge intensity variation values. Finally, we apply Thales' theorem to calculate the circle parameters such as the circle center ( Pp and Pi ) and its radius ( R p and Ri ).
P1 ( x1, y1 )
P0 ( x0 , y0 ) Rp
P2 ( x2 , y2 )
P4 (x4 , y4 )
P5 ( x5 , y5 )
Pp ( xp , y p )
Ri
P0
Pi (xi , yi )
P6 (x6 , y6 )
P3 ( x3, y3 )
(a)
(b)
Fig. 1. Thales' theorem is applied to find (a) the inner and (b) the outer boundary of iris
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
0 0 0
-1 -1 -1
-1 -1 -1
-1 -1 -1
Fig. 2. An example of the SBDM (3x11)
-1 -1 -1
-1 -1 -1
Recognizing Human Iris by Modified Empirical Mode Decomposition
301
2.2 Iris Normalization
The pupil will dilate or constrict when eye images are captured with flash light or under a dark circumstance. For the purpose of achieving more accurate recognition performance, it is necessary to compensate for such deformation. Before using the proposed method for iris recognition, it is required to normalize the iris image, so that the representation is common to all, with similar dimensions. Normalization process involves in unwrapping the iris and converting it into its equivalent polar coordinates. We transform the circular iris area into a block using Daugman’s Rubber sheet model [14]. The pupil center is considered as the reference point and a remapping formula is used to convert the points on the Cartesian scale to the polar scale. In our experiment, the radial resolution and the angular resolution are set to 64 and 512 pixels, respectively. 2.3 Image Enhancement
After normalization, iris templates still have low contrast and non-uniform illumination problems. To eliminate the background brightness, the iris template is divided into non-overlapped 16 × 16 blocks and their means constitute coarse estimates of background illumination for individual blocks. By using bicubic interpolation, each estimated value is expanded to the size of a 16 × 16 block. Then each template block can be enhanced to a uniform light condition by subtracting from the background illumination. After that, the lighting corrected images are enhanced by histogram equalization. It shows clearer texture characteristics of iris than those in Fig. 3(c). Figure 3 illustrates the preprocessing process for the iris image.
(a)
(b)
(c) ROI
(d)
Fig. 3. Preprocessing of the iris image, (a) the original iris image, (b) the image with located iris area, (c) normalized the iris image, and (d) the region of interesting (ROI) from the enhanced image
302
J.-C. Lee et al.
3 Feature Extraction Despite all normalized iris templates have the same size and uniform illumination, there are still eyelashes and eyelids on the templates and those will influence the performance of iris recognition. Therefore, the region of interest (ROI) is selected to remove the influence of eyelashes and eyelids that are shown in Fig. 3(d). The features are extracted only from the upper half region ( 32 × 512 ) which is closer to the pupil and provides the most discriminating information [15]. 3.1 Empirical Mode Decomposition
Huang et al. [13] introduces a multi-resolution decomposition technique, Empirical Mode Decomposition (EMD) that is adaptive and appears to be suitable for non-linear and non-stationary signal processing method. The major advantage of EMD is that the basis functions are derived directly from the signal itself. Its principle is to decompose adaptively a given signal into components called intrinsic mode functions (IMFs). An IMF is characterized by some specific properties. One is that the number of zero crossings and the number of extrema points are equal or different only by one. Another property of the IMFs is that the mean of its upper and lower envelopes must equal zero. Hence, for a given signal X , EMD ends up with a representation of the form: n
X = ∑ hi + r ,
(2)
i =1
where hi is the ith mode (or IMF) of the signal, and r is the residual trend (a loworder polynomial component). The sifting procedure generates a finite (and limited) number of IMFs that are nearly orthogonal to each other [13]. 3.2 Modified Empirical Mode Decomposition
Huang’s solution is to find a mean envelope by using cubic spline interpolation through the respective local extrema. It can be argued that repeated iterations using cubic splines in EMD cause the loss of amplitude and frequency information. In this paper, the technique of Modified Empirical Mode Decomposition (MEMD) is proposed to improve EMD for iris feature extraction. The local mean of a signal is accomplished by progressively smoothing the signal using moving averaging. By considering the sample portion of iris data shown in Fig. 4, the local mean involves calculating the mean of the maximum and minimum points of half-wave oscillation of the signal. So the ith mean value mi of each two successive extrema ni and ni +1 is given by
mi =
ni + ni +1 , 2
(3)
The local mean function is then repeatedly smoothed using this length of moving average until no two successive points with the same value. The smoothing process is shown in Fig. 5.
Recognizing Human Iris by Modified Empirical Mode Decomposition
303
I ris S ig n a l
2
1
0
-1
-2
0
10
20
30
40
50
60
70
80
90
100
Ir i s S i g n a l
Fig. 4. Sample portion of iris data is displayed as the black line. The local means are shown by horizontal lines computed from the mean of successive extrema. The smoothed local mean is calculated by moving averaging and shown in blod line. 2
0
-2
0
5
10
15
20
25
30
35
40
45
50
30
35
40
45
50
30
35
40
45
50
Ir i s S i g n a l
(a ) 2
0
-2
0
5
10
15
20
25
Ir i s S i g n a l
(b ) 2 1 0 -1 -2
0
5
10
15
20
25
(c )
Fig. 5. The smoothing process ((a)-(c)) of the local mean function using successive applications of a moving average, (c) the final smoothed local mean function
The MEMD principle is similar to EMD that a signal is decomposed into a sum of intrinsic mode functions (IMFs). Also, it has to satisfy two conditions as same as the EMD. Specifically, the first condition is similar to the narrow-band requirement, whereas the second condition modifies a global requirement to a local one by using the local mean defined by the local maxima points and the local minima points, and is necessary to certify that the instantaneous frequency will not have unnecessary fluctuations as induced by asymmetric waveforms. To make use of MEMD for practical applications, the signal must have at least two extrema—one maximum and one minimum to be successfully decomposed into individual IMFs. These IMF components are obtained from the signal by means of an algorithm called sifting process. This algorithm extracts locally for each mode the highest frequency oscillations out of the original signal. Given those two definitive requirements for an IMF, the sifting process to extract IMFs from a given signal z (t ) , t = 1,..., T is described as follows.
304
J.-C. Lee et al.
1) Identify all the maxima and minima in z (t ) . 2) Calculate the local mean of each two successive extrema using formula (3). 3) The local means are smoothed using moving averaging from a smoothly varying continuous local mean function m(t ) . 4) Extract the details by d (t ) = z (t ) - m(t ) . 5) Check the properties of d (t ) : • If the above-defined two conditions are met, an IMF is derived and z (t ) is replaced with the residual r (t ) = z (t ) - d (t ) ; • If r (t ) is not an IMF, then replace z (t ) with d (t ) . 6) Repeat Steps 1)–5) until the residual satisfies pre-defined stopping criteria. At the end of this process, the original signal z (t ) can then be reconstructed, using the following equation n
z (t ) = ∑ ci (t ) + rn (t ) ,
(4)
i =1
where n is the number of IMFs, rn (t ) denotes the final residue which can be interpreted as the dc component of the signal, and ci (t ) are nearly orthogonal to each other, and all have nearly zero means. In fact, after a certain number of iterations, the produced signals do not carry significant physical information. To avoid this situation, we can stop the sifting process by limiting the normalized standard deviation (SD), computed from two consecutive sifting results. The SD is defined as T
SD= ∑
z j (t ) − z j +1 (t )
t =1
z 2j (t )
2
,
(5)
the SD is usually set between 0.2 and 0.3. As the decomposition process proceeds, the time scale increases, and hence, the mean frequency of the mode decreases. Based on this observation, we may devise a general purpose time-space filtering as h
zlh (t ) =
∑ c (t ) , i
(6)
i =1
where l , h ∈ [1,..., n ] , l ≤ h. For example, when l = 1 and h < n , it is a high-pass filtered signal; when l > 1 and h = n , it is a low-pass filtered signal; when 1 < l ≤ h < n , it is a band-pass filtered signal. The above equation forms the basis to our application of iris data described below, where we use it as a low-pass filtering. To associate with iris recognition, we also present the results of MEMD decomposition for iris images in Fig. 6. Note that the ROI of the normalized iris image is converted into a 1-D feature sequence by concatenating its rows. For the purposes of easy comparison, Figure 6 shows only the first 500 components from their original feature sequences. To demonstrate the similarity of two iris images from the same person captured at different time, it is easily proved by checking those corresponding circles marked in Fig. 6(a) and 6(b). Also, those circles marked in Fig. 6(a) and 6(c) point out the differences of two iris images from two different persons.
Recognizing Human Iris by Modified Empirical Mode Decomposition
305
Fig. 6. (a) and (b) show the MEMD decomposition results of two iris images from the same person. (a) and (c) show the MEMD results of two iris images from two different persons.
3.3 Feature Vector
For the ROI of each normalized iris image I , pixel sequences from adjacent rows are concatenated to form the 1-D vector V represented by V = {I1
Ix
I K } = {v1, v2 ,
, vn } ,
vj,
(7)
where I x denotes gray values of the xth row in the image I , v j defines the pixel value of position j inside the vector V , and n is the number of total components, herein, n = 32 × 512 = 18634 . After concatenation and before applying MEMD, linear re-scaling [16] is applied to each vector to adjust the average of each data set to zero and to normalize the standard deviation to unity before further using the ROI vector. After MEMD calculation, the feature vector of each MEMD residual from the 1-D vector can be obtained by R m = {R1m , R2m ,
R mj ,
, Rnm } ,
(8)
where R m represents the mth residual result from MEMD and R mj denotes the feature value from the jth position of the R m .
4 Iris Matching A suitable similarity measure is essential for precise matching between feature vectors. In this article, three different similarity measures used as the matching criterion are: 1) The mean of the Euclidean distances (MED) measure: d1 ( p , q ) =
M
(p − q ) M ∑ 1
i
i
2
,
(9)
i =1
where M = K × L is the dimension of the feature vector, pi is the ith component of sample feature vector, and qi is the ith component of unknown sample feature vector.
306
J.-C. Lee et al.
2) The cosine similarity measure: p
d 2 ( p, q ) = 1 −
p
i
q
,
q
(10)
where p and q are two different feature vector. • indicates the Euclidean norm. The range of
p q is [0,1] . The more similar the two vectors are, the smaller the i p q
d 2 ( p, q ) value is.
3) The binary Hamming distance (HD) measure: d 3 ( p, q ) =
M
∑ p ⊕q M 1
i
i
,
(11)
i =1
where ⊕ denotes Exclusive-OR, M is the length of the binary sequence. pi is the ith component of the database sample feature vector, and qi is the ith component of the unknown sample feature vector.
5 Experimental Results To evaluate the performance of the proposed approach at iris recognition, varied experiments are conducted in this section. In the verification mode, the ROC curve that depicts the relationship of false acceptance rate (FAR) versus false reject rate (FRR) is used. Hence, ROC curve is normally used to measure the accuracy of matching process showing the achieved performance of an algorithm. Meanwhile, the equal error rate (EER) is also used for performance evaluation. In the recognition mode, the correct recognition rate (CRR) is adopted to assess the efficacy of the algorithm. 5.1 Iris Database In our experiments, the test data set is from the CASIA Iris Database [17]. Each image has the resolution of 320 × 280 in 8-bit gray level. This database includes 1,992 iris images from 249 different eyes (hence, 249 different classes) with 8 each. The images are acquired during different sessions and the time interval between two collections is at least one month. In our experiments, three images from each class are randomly selected to constitute the training set, so the entire training set has 747 images. The other five images of each class are used as the test set with the number of 1,245 images. Using those 1,992 different iris images from the CASIA Iris Database, the experiments conducted below are running on the computing environment of 1.8GHz PC with 736MB RAM using Matlab 6.5. 5.2 Recognition Results Table 1 demonstrates promising recognition results achieved by our proposed MEMD method using three similarity measures from (8)-(10). Note that performance
Recognizing Human Iris by Modified Empirical Mode Decomposition
307
differences are not very significant while different similarity measures are used. Only a slightly higher recognition rate of 99.31% is accomplished by using the MED similarity measure in the identification tests. The verification results are also shown in Figure 7. The achieved Az value (the area under the ROC curve) is up to 0.9927 by the MED similarity measure. Therefore, experimental results show that the proposed iris representation is effective for recognition and the MEMD approach can really extract the promising feature from each iris image. Table 1. Recognition rates of MEMD by different measures Similarity measure MED Cosine
Correct recognition rate (CRR) % 99.31% 98.78 %
HD
98.32%
ROC Curve of MEMD feature with three different similarity measures
TruePositiveRate(100-False Non-Match Rate)
100 99.9 99.8 99.7 99.6 99.5 99.4 99.3 HD Cosine MED
99.2 99.1 99 -3 10
-2
10
-1
0
10 10 FalsePositiveRate(FalseMatchRate)
1
10
2
10
Fig. 7. The ROC curve of MEMD method with different similarity measures
5.3 Comparison Between MEMD and EMD This paper presents the proposed MEMD method and illustrates its performance in iris recognition. MEMD can separate the iris signal into a small set of components and each one could be associated with some aspects of cognitive function. Although using MEMD on sample test signals produces similar results to those generated by EMD, significant differences between these two schemes should be noted. The MEMD iteration process using smoothed local means appears to be a gentler way of decomposing the data than the cubic spline approach used by EMD. This can be seen in Fig. 8, which compares the intrinsic mode functions calculated by MEMD with the equivalent EMD IMFs.
308
J.-C. Lee et al.
Experimental results demonstrated in Section 5.2 reveal that the proposed MEMD technique is an effective scheme for feature extraction from iris images and the MED similarity measure can achieve a correct recognition rate up to 99.31%. Here, we also use the EMD method to extract the iris feature for iris recognition in order to compare with the MEMD method. For the results shown in Table 2 and Fig. 9, we have also implemented the other iris recognition algorithms, the methods of the Fourier-wavelet feature [12] and the Gaussian-Hermite moments [8]. Together with our proposed scheme, four approaches are tested using the 249 classes of the CASIA Iris Database and the cosine similarity measure. Although a slightly lower recognition rate than the approach of Gaussian-Hermite moments is achieved, the proposed method still can fulfill the demand of high accuracy suitable for very high security environments.
IMF1
2 0 -2
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
IMF2
5 0 -5 IMF3
2 0 -2 IMF4
2 0 -2
Fig. 8. Comparison between MEMD and EMD intrinsic mode functions. This figure compares four MEMD IMFs (shown in black line) with the four EMD IMFs (shown in bold line) generated from the same iris signal. Table 2. Typical operating states of the different method Methods Fourier-wavelet feature[12] Gaussian-Hermite moments[8] EMD Proposed method
CRR(%) 94.37 99.64 97.22 98.78
Az 0.9683 0.9989 0.9812 0.9915
EER (%) 5.24 0.29 1.82 0.54
To evaluate the computation complexity, Table 3 shows the computational costs consumed by four methods, including the CPU time for feature extraction and matching. In spite of the cost time of EMD for feature extraction is slightly faster than the cost time of MEMD, the computation efficiency of MEMD is still better than the other two methods. Our proposed MEMD method for feature extraction demonstrates to have the desired recognition performance. This can be a potential advantage for iris matching in a large database.
Recognizing Human Iris by Modified Empirical Mode Decomposition
309
Based on the previous experimental results and corresponding analysis, we can conclude: 1. The proposed method can achieve high accuracy and fast performance for iris recognition. This indicates that the MEMD technique can extract discriminating features suitable for iris recognition. 2. Although our proposed method presents better computation efficiency than the method of Gaussian-Hermite moments [8], the recognition performance still needs to be improved. Therefore, feature selection is an important research issue in the near future. Table 3. Comparison of the computational complexity Methods Fourier-wavelet feature[12] Gaussian-Hermite moments[8] EMD Proposed method
Feature extraction(s) 1.297 2.191 0.521 0.985
Matching(s) 0.116 0.553 0.109 0.108
Total times(s) 1.413 2.744 0.63 1.093
ROC Curve of four methods with Cosine measure
TruePositiveRate(100-False Non-Match Rate)
100 99.9 99.8 99.7 99.6 99.5 99.4 99.3
Proposed method Gaussian-Hermite moment Empirical Mode Decomposition Fourier-Wavelet feaature
99.2 99.1 99 -3 10
-2
10
-1
0
10 10 FalsePositiveRate(FalseMatchRate)
1
10
Fig. 9. The ROC curve of different methods using the cosine similarity measure
5 Conclusions In this paper, a novel and effective method of feature extraction for iris recognition is presented, which operates using the Modified Empirical Mode Decomposition (MEMD) technique. The performance of iris recognition achieved by the MEMD approach associated with three different similarity measures has been evaluated. Experimental results have shown that the proposed method can demonstrate eminent
310
J.-C. Lee et al.
performance for iris recognition. The best similarity metric is the MED measure and the other two measures also have achieved similar performance more than 98%. Therefore, the proposed method has demonstrated to be promising for iris recognition and MEMD is suitable for feature extraction. In the future, we will ameliorate the template processing method to reduce the influence of light, eyelid, and eyelash. We are also working at increasing the database in order to further verify the performance and trying other possible approaches to improve the classification accuracy.
References 1. Yanushkevich, S.N.: Synthetic Biometrics: A Survey. In: Proceedings of International Joint Conference on Neural Networks, pp. 676–683 (2006) 2. Miller, B.: Vital Signs of Identity. IEEE Spectrum 31, 22–30 (1994) 3. Mansfield, T., Kelly, G., Chandler, D., Kane, J.: Biometric Product Testing Final Report. issue 1.0, National Physical Laboratory (2001) 4. Daugman, J.: High Confidence Visual Recognition of Persons by a Test of Statistical Independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1148–1161 (1993) 5. Boles, W.W., Boashash, B.: A Human Identification Technique Using Images of the Iris and Wavelet Transform. IEEE Transactions on Signal Processing 46(4), 1185–1188 (1998) 6. Wildes, R., Asmuth, J., Green, G., Hsu, S., Kolczynski, R., Matey, J., McBride, S.: A Machine-Vision System for Iris Recognition. Machine Vision and Applications 9(1), 1–8 (1996) 7. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal Identification Based on Iris Texture Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12), 1519– 1533 (2003) 8. Ma, L., Tan, T., Wang, Y., Zhang, D.: Local Intensity Variation Analysis for Iris Recognition. Pattern Recognition 37, 1287–1298 (2004) 9. Ma, L., Tan, T., Wang, Y., Zhang, D.: Efficient Iris Recognition by Characterizing Key Local Variations. IEEE Transactions on Image Processing 13(6), 739–750 (2004) 10. Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia, Pennsylvania (1992) 11. Wang, S.S., Chen, P.C., Lin, W.G.: Invariant Pattern Recognition by Moment Fourier descriptor. Pattern Recognition 27, 1735–1742 (1994) 12. Huang, P.S., Chiang, C.S., Liang, J.R.: Iris Recognition Using Fourier-Wavelet Features. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 14–22. Springer, Heidelberg (2005) 13. Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., Yen, N., Tung, C., Liu, H.: The Empirical Mode Decomposition and Hilbert Spectrum for Nonlinear and Nonstationary Time Series Analysis. Proc. of the Royal Society of London 454, 903–995 (1998) 14. Daugman, J.: How Iris Recognition Works. IEEE Transactions on Circuits and Systems for Video Technology 14(1), 21–30 (2004) 15. Sung, H., Lim, J.J., Lee, Y.: Iris Recognition Using Collarette Boundary Localization. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 4, pp. 857– 860 (2004) 16. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1996) 17. Institute of Automation, Chinese Academy of Science, CASIA Iris Image Database, http://www.sinobiometrics.com/chinese/chinese.htm
Segmentation of Scanned Insect Footprints Using ART2 for Threshold Selection Bok-Suk Shin1 , Eui-Young Cha1 , Young Woon Woo2 , and Reinhard Klette3 1
3
Dept. of Computer Science, Pusan National University, Busan, Korea
[email protected],
[email protected] 2 Dept. of Multimedia Engineering, Dong-Eui University, Busan, Korea
[email protected] Dept. of Computer Science, The University of Auckland, Auckland, New Zealand
[email protected]
Abstract. In a process of insect footprint recognition, footprint segments need to be extracted from scanned insect footprints in order to find out appropriate features for classification. In this paper, we use a clustering method in a preprocessing stage for extraction of insect footprint segments. In general, sizes and strides of footprints may be different according to type and size of an insect for recognition. Therefore we propose a method for insect footprint segment extraction using an improved ART2 algorithm regardless of size and stride of footprint pattern. In the improved ART2 algorithm, an initial threshold value for clustering is determined automatically using the contour shape of the graph created by accumulating distances between all the spots within a binarized footprint pattern image. In the experimental results, applying the proposed method to two kinds of insect footprint patterns, we illustrate that clustering is accomplished correctly. Keywords: Insect footprint segmentation, Clustering, ART2 algorithm.
1
Introduction
Modern transportation also means that various kinds of insects change places by vehicle, aircraft or ship. There are no problems in cases where native insects travel within their habitat, but it may cause harm to the ecosystem or the environment if insects enter an area outside of their habitat. In order to monitor movements or presence of insects (e.g., in containers in airplanes or ships, or in defined areas such as an island), special methods have been designed for the monitoring of insects, taking their characteristics into account. Examples of monitoring devices are illustrated in Figure 1. Such tunnels are widely used for collecting footprints of small animals (such as rats or mice) and of various kinds of insects. The acquired footprints are visually inspected or scanned for automated reading; they are used for various monitoring tasks, for example for verifying the presence of some insects, or for more detailed ecological or biological studies as supported by those footprints [1,2]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 311–320, 2007. c Springer-Verlag Berlin Heidelberg 2007
312
B.-S. Shin et al.
TM
Fig. 1. Tracking tunnel devices of varying sizes using Black Trakka tracking cards [3]: insects or small animals are attracted by a lure to walk into the tunnel, across the inked area, leaving tracks on the white card
The acquired insect footprints, using such tracking tunnel devices for collection, are then typically identified by entomologists having expert knowledge about insect’s morphology [4]. The identification requires that individual footprints are extracted (e.g., by using morphological features of each kind of insect [5]) and then clustered into meaningful track patterns, but it may be hard to extract, analyze and classify insect footprints even for the experienced human specialist if available knowledge about entomology and visible patterns do not match (e.g., if too many insects left traces on the same card). For automated reading of such cards, we start with a method to extract automatically segments for later classification, with the aim to remove unnecessary human preprocessing, improve time efficiency, and increase accuracy for insect footprint recognition, possibly even for situations where expert knowledge about entomology is not accessible. In this paper, we propose a method for insect footprint segmentation using an improved ART2 (Adaptive Resonance Theory) algorithm, regardless of size and stride of each type of insect footprint. In the improved ART2 algorithm, the threshold value for clustering is determined automatically using contour shape of the graph created by accumulating distances between all of the “spots” of an insect footprint pattern image scanned from one of those tracking cards. The paper improves a method that was proposed in [6].
2
Improved Footprint Segmentation
First, we define four terms for describing our methodology. We define a “spot” as a set of connected pixels in a binarized footprint image and a “region” as a set of spots for each foot. And then we define a “segment” as a set of three regions for front foot, mid foot and hind foot and a “pattern” as a set of segments in a footprint image. This paper considers the following steps for collecting tracks, scanning, and extracting footprint segments (For a graphical sketch of the overall process, see Figure 2.): Step 1: Insect footprints are acquired on tracking cards, placed into tracking tunnel devices for some time.
Segmentation of Scanned Insect Footprints
313
Fig. 2. Step graph of automatic insect footprint segmentation
Step 2: Collected tracking cards are scanned at 1200dpi resolution, and with 256 gray levels. The scanned images are binarized using Abutaleb’s higher-order entropy algorithm, which was already identifies as being a useful method for binarization of scanned insect footprint images [6]. Step 3: An initial threshold value for ART2, a neural network algorithm for clustering, is decided using the contour shape of the graph created by accumulating distances between all the spots of a footprint pattern image. Step 4: The decided threshold value is used in ART2, and basic segments of footprints are extracted automatically by the improved ART2 clustering algorithm regardless of size and stride of each type of insect footprint. 2.1
Abutaleb’s Higher-Order Entropy Binarization Algorithm
Quality of the binarization process is crucial for the overall performance of recognition. Various kinds of binarization methods have been developed during the past 20 or more years; see, for example, [7,8]. In this paper we use the binarization method proposed by Abutaleb [9] which can be briefly explained as follows: The histogram and the probability mass function (PMF) of the image are given, respectively, by h(g) and by p(g), for g = 0 . . . G, where G is the maximum luminance value in the image (typically G = 255). If the gray value range is not explicitly indicated as being equal to [gmin , gmax ], it is assumed to extend from 0 to G. The cumulative probability function is defined as follows: P (g) =
g
p(i)
(1)
i=0
The foreground (i.e., object) and background PMFs are expressed as Pf (g), for 0 ≤ g ≤ T , and Pb (g), T + 1 ≤ g ≤ G, respectively, where T is the thresh-
314
B.-S. Shin et al.
old value. The foreground and background area probabilities are calculated as follows: Pf (T ) = Pf =
T
p(g), Pb (T ) = Pb =
g=0
G
p(g).
(2)
g=T +1
The Shannon entropy, parametrically dependent on the threshold value T for the foreground and background, is formulated as: Hf (T ) = −
T
G
pf (g) log pf (g) , Hb (T ) = −
g=0
pb (g) log pb (g)
(3)
g=T +1
The mean and variance of the foreground and background as functions of the thresholding level T are denoted as: mf (T ) =
T
g · p(g)
σf2 (T ) =
g=0
mb (T ) =
G
T
[g − mf (T )]2 p(g),
g=0
g · p(g)
g=T +1
σb2 (T ) =
G
[g − mb (T )]2 p(g)
(4)
g=T +1
Abutaleb’s binarization algorithm assumes the joint entropy of two related random variables, namely, the image gray value g at a pixel, and the average gray value g¯ of a neighborhood centered at that pixel. Using the 2-D histogram p(g, g¯), for any threshold pair (T ,T¯), one can calculate the cumulative distribution P (T, T¯), and then define the foreground entropy as follows: T¯ T p(g, g¯) p(g, g¯) Hf = − log ¯ P (T, T ) P (T, T¯ ) i=1 j=1
(5)
Similarly, one can define the background region’s second order entropy. Under the assumption that the off-diagonal terms, that are the two quadrants [(0, T ), (T¯, G)] and [(T , G), (0, T¯)] are negligible and contain elements only due to image edges and noise, the optimal pair (T , T¯) can be found as the minimizing value of the 2-D entropy functional. In this algorithm, the following equation is used for finding an optimal threshold value: (Topt , T¯opt ) = arg min{log[P (T, T¯ )[1−P (T, T¯ )]]+Hf /P (T, T¯ )+Hb /[1−P (T, T¯ )]} where Hf = −
Hb = −
T¯ T p(g, g¯) p(g, g¯) ¯ ) log P (T, T¯ ) and P (T, T i=1 j=1
G
G
i=T +1 j=T¯ +1
p(g, g¯) p(g, g¯) log ¯ 1 − P (T, T ) 1 − P (T, T¯ )
(6)
Segmentation of Scanned Insect Footprints
2.2
315
Clustering Method
Insect footprint patterns are composed of sets of segments made by insect’s feet, and these segments appear in the footprint image repeatedly and dispersedly. In general, it is hard to detect segments which identify a footprint (from a scanned footprint image). Meaningful groups of regions, segments identifying a single footprint, can be extracted using specific morphological characteristics defined by species, body size, leg positions and stride of an insect (see conventional research [4,5,10] on insects). In this paper, we propose a method for the extraction of footprint segments using an ART2 algorithm, which is a neural network algorithm that has a good performance in clustering [11,12]. The ART2 algorithm can cluster footprint spots easily without any morphological features. With the ART2 algorithm, the clustering process can be performed in real time regardless of the number of massively generated data as clusters are created dynamically. The ART2 algorithm is an unsupervised learning neural network where stability (a known weakness of conventional competitive learning algorithms) is supplemented. The ART2 algorithm automatically integrates new learning results into former learning results in order to keep former learning results. The ART2 learning algorithm used in this paper is as follows: Step 1: The k th input datum is defined as xk , and the center of the ith cluster is defined as wi . Step 2: A cluster j ∗ , which has a minimum distance to the new input datum xk , is selected as the winner cluster. The distance between center of a cluster and input datum is calculated by using the Euclidean distance as shown in the following equation: xk − wj∗ = min xk − wj
(7)
Step 3: We perform the vigilance test for an input datum. If the distance between the input datum and the winner cluster is smaller than threshold value(σ), then this input datum is accepted as similar datum with the winner cluster, and the center of the winner cluster is updated using this input datum. If the distance between the input datum and the winner cluster is not smaller than threshold value(σ), then a new cluster is created by this input datum. This process is performed by using the following equation: if xk − wj ∗ < σ,
wjnew ∗
Clusterold xk + wjold ∗ · j∗ = old Cluster ∗ + 1
(8)
j
means the number of input data included in the j th cluster. where Clusterjold ∗ Step 4: Step 1 to Step 3 are repeated until no input datum remains. If the whole learning process is iterated as predefined (e.g., that the number or centers of clusters does not change anymore), the learning process is terminated.
316
2.3
B.-S. Shin et al.
Automatic Threshold Selection
We use the ART2 algorithm, an unsupervised learning algorithm, for clustering insect footprint spots. But the threshold value(σ) in the ART2 algorithm is set by characteristics of input data heuristically, and the threshold value is of crucial importance for the performance of clustering. When we cluster insect footprint spots by ART2 algorithm, it is difficult to preselect an initial threshold value because the sizes of feet and strides vary with the kinds of insects. For example, the Black Cockroach, one of our test insects, has dense foot intervals, and the Native Bush Cockroach, another test insect, has sparse foot intervals. So, if we set a threshold value for the Black Cockroach to obtain good clustering results, it has an improper effect on the Native Bush Cockroach segmentation, and if we set a threshold value for the Native Bush Cockroach, it also has improper effect on the Black Cockroach processes. In order to solve this difficulty, we used the contour shape of the graph created by accumulating distances between all the spots of a footprint pattern image for an automatic setting of a threshold value used in the ART2 algorithm. But the acquired graph (by accumulating distances) has undesirable peaks due to noisy spots; so, we applied a median filter to smooth the contour of the graph.
Fig. 3. Accumulation graph of distances between footprint spots
Figure 3 shows a graph of common contour shape by accumulating distances between all the spots imprinted in a general insect footprint pattern image, and
Fig. 4. A segment of an insect footprint image
Segmentation of Scanned Insect Footprints
317
this figure represents stride and feet density of an insect. In Figure 3, “zone a” having first peak value represents accumulated distances within each front foot (“zone a1” in Figure 4), mid foot (“zone a2” in Figure 4), hind foot (“zone a3” in Figure 4), and “zone b” having second peak value represents accumulated distances between each spot in “zone b” in Figure 4. In this paper, we used the second peak value in “zone b” including front, mid, and hind legs as an initial threshold value(σ) for accurate clustering in the ART2 algorithm. Figure 5 shows a graph generated by a test insect footprint image, and Figure 6 shows a graph processed by a median filter in order to find the local maxima values in the graph effectively.
Fig. 5. Graph created by accumulating distances between all the spots
Fig. 6. Graph after applying a median filter to the graph of Figure 5
2.4
Segment Extraction
In this paper, we proposed a method to set an initial threshold value in the ART2 algorithm using the graph created by accumulating distances between all the spots of a footprint pattern for effective segment extraction. For segment extraction using a clustering method, the center of gravity and the size of each spot area found by linked pixels from a binarized image are utilized. The position of the center of gravity is used for center coordinates for clustering, and the size information is used for extracting final segments. Figure 7 shows a sample spot area, and also shows the center of gravity and radii (width and height) of the spot area. P (L, R) denotes the coordinates in a 2-dimensional plane of an scanned footprint image; the whole spot area is given by P (Lmin , Rmin ), P (Lmax , Rmax ) coordinates. These coordinates are utilized for boundary coordinates in extracted segments by clustering results. Figure 8 shows each step from clustering by center of gravity and the proposed ART2 algorithm to segment extraction using size information.
318
B.-S. Shin et al.
Fig. 7. Center of gravity and used radii within a spot area
Fig. 8. Steps for segment extraction using the center of gravity and radii
3
Experimental Results and Analysis
We restricted experiments on two kinds of insects (Black Cockroaches and Native Bush Cockroaches) which illustrate the typical difficulty of dealing with different sizes of feet and stride lengths. Figure 9 shows 256 gray level insect footprints; the image (left) is acquired by scanning a tracking card, and the binarized image (right) is obtained by using Abutaleb’s binarization algorithm. Figure 10 shows extracted segment images from a Black Cockroach footprint image using the proposed method, and Figure 11 shows extracted segment images from a Native Bush Cockroach image. Table 1 shows experimental results of segment extraction, using three scanned cards for Black Cockroaches and Native Bush Cockroaches each. If there are noisy ink traces (e.g., by the abdomen of an insect, or by insect foot dragging during tracking card acquisition), then we obtain too many noise spots in the binarized images. Some low success ratios in Table 1 are caused by such noisy spots. Thus, the next step is to develop an effective noise removal method.
Fig. 9. A sample image (left) and the binarized image (right) using Abutaleb’s algorithm
Segmentation of Scanned Insect Footprints
319
Fig. 10. Results of segment extraction for the Black Cockroach
Fig. 11. Results of segment extraction for the Native Bush Cockroach
Table 1. Results of footprint segment extraction Native Bush Cockroach N1 N2 N3 Initial Threshold Value 205 235 400 Number of Test Images 9 11 11 # of Correct Extraction 7 11 10 # of Incorrect Extraction 2 0 1 Success Ratio 77.8% 100% 90.9% Type of Insect
4
Black Cockroach B1 B2 B3 295 290 330 18 15 14 17 15 11 1 0 3 94.4% 100% 78.6%
Conclusions
In this paper, we proposed a clustering method for extracting insect footprint segments as a preprocessing stage of insect footprint recognition. We improved the ART2 algorithm by an automatic threshold value setting (by using the
320
B.-S. Shin et al.
contour shape of the graph created by accumulating distances between all the spots of footprint pattern) in the proposed clustering method. In the experiments, using two kinds of insect footprint patterns (with clearly understandable differences), the clustering results of the proposed method were almost the same as those of a human expert. Acknowledgment. Data used in this paper were provided by the insect track recognition project at CITR, The University of Auckland. The project was initiated in 2003 by Warren Agnew (Connovation Ltd., Auckland).
References 1. Russel, J.: A recent survey of methods for closed populations of small mammals. unpublished report, The University of Auckland, Auckland (2003) 2. Whisson, D.A., Engeman, R.M., Collins, K.: Developing relative abundance techniques (RATs) for monitoring rodent population. Wildlife Research 32, 239–244 (2005) 3. Connovation Ltd.: (last visit: August 2007), see www.connovation.co.nz 4. Deng, L., Bertinshaw, D.J., Klette, R., Klette, G., Jeffries, D.: Footprint identification of weta and other insects. In: Proc. Image Vision Computing New Zealand, pp. 191–196 (2004) 5. Gray, J.: Animal Locomotion. Weidenfeld & Nicolson, London (1968) 6. Woo, Y.W.: Performance evaluation of binarizations of scanned insect footprints. ˇ c, J. (eds.) IWCIA 2004. LNCS, vol. 3322, pp. 669–678. Springer, In: Klette, R., Zuni´ Heidelberg (2004) 7. Rosenfeld, A., De la Torre, P.: Histogram concavity analysis as an aid in threshold selection. IEEE Transactions on System Man Cybernetics 13, 231–235 (1983) 8. Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative performance evaluation. J. Electronic Imaging 13, 146–165 (2004) 9. Abutaleb, A.S.: Automatic thresholding of gray-level pictures using twodimensional entropy. Computer Vision Graphics Image Processing 47, 22–32 (1989) 10. Hasler, N., Klette, R., Rosenhahn, B., Agnew, W.: Footprint recognition of rodents and insects. In: Proc. Image Vision Computing New Zealand, pp. 167–173 (2004) 11. Carpenter, G.A., Grossberg, S.: The ART of adaptive pattern recognition by a self-organizing neural network. Computer 21, 77–88 (1988) 12. Haykin, S.: Neural Networks: A Comprehensive Foundation, MacMillan (1994)
Meshless Parameterization for Dimensional Reduction Integrated in 3D Voxel Reconstruction Using a Single PC Yunli Lee, Dongwuk Kyoung, and Keechul Jung∗ School of Media, College of Information Technology, Soongsil University, Seoul, South Korea {yunli, kiki227, kcjung}@ssu.ac.kr http://hci.ssu.ac.kr
Abstract. Shape-From-Silhouettes (SFS) is one of the most popular ideas for reconstructing the 3D voxel of an object from silhouettes images. This paper presents a method based SFS and pre-computing methods for 3D voxel reconstruction using a single PC. This method is reduced the memory usage. Towards this approach, a meshless parameterization for dimensional reduction is integrated in the process to obtain object representation in 2D form. Since the meshless parameterization requires the solution of large linear system on a whole 3D voxel, by taking the advantages of 3D voxel reconstruction process, the meshless parameterization is computed locally. The proposed method is able to implement an optimize system and utilize in various applications. Keywords: Meshless parameterization, dimensional reduction, 3D voxel reconstruction, shape from silhouettes, single PC.
1 Introduction 3D voxel reconstruction using multiple-views of silhouette images is an active research topic in 3D video, 3D animation, 3D gesture recognition and etc. The 3D voxel reconstruction is referred to the problem of extracting the parameters of an object from a set of sequence images. Visual Hull (VH) construction and ShapeFrom-Silhouettes (SFS) approximates the shape using silhouettes images have been proposed in last two decades [1, 2]. SFS is one of most popular approach for shape estimation since it has many advantages for estimating 3D shape. For better shape estimation, multiple cameras are used to increase the number of silhouette images. The 3D voxel reconstruction involves many operations therefore PC clusters are used in order to perform a fast processing speed. However, the PC clusters need high cost, high speed network and synchronizing multiple-view images. This motivates us proposed a pre-computing method using look-up table for 3D voxel reconstruction in a single PC. The proposed system consists of one PC, four web cameras and a graphic card. This method optimizes the performance and speed for 3D voxel reconstruction using a single PC. ∗
Corresponding author.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 321 – 333, 2007. © Springer-Verlag Berlin Heidelberg 2007
322
Y. Lee, D. Kyoung, and K. Jung
In addition, the interest of dimensional reduction of 3D voxel has grown rapidly. The 3D voxel is rich of information that supports the various kinds of applications. However, this information causes high computational time for real time performance. Dimensional reduction becomes very important to reduce the curse of dimensionality. There are various approaches of dimension reduction such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Multi-Dimensional Scaling (MDS) and etc [3, 10, 11]. These techniques are popular for pattern recognition areas. Yet, we chose Meshless Parameterization, as a technique for dimensional reduction of the reconstructed 3D voxel. The meshless parameterization is well known for parameterizing and triangulating a single patch for unorganized point sets [4-8]. The basic idea of meshless parameterization is to map the points into some convex parameter domain in the plane. This method is mapping independently in any given topological structure. However, this method requires a solution of a large linear system and the complexity of linear system is increased rapidly when the number of 3D voxel increased. Therefore, in this paper, we are adopted the advantages of the 3D voxel reconstruction, the meshless parameterization is computed locally within the process of generating the 3D voxel. This approach is able to overcome the linear system complexity. On top of this, the operations for number of neighborhoods and weight computation are performed faster. An overview of our proposed system is illustrated in Fig. 1. In this paper, the 3D voxel is reconstructed using SFS method. The silhouettes images are extracted from four web cameras using a single PC. In the process of generating 3D voxel, meshless parameterization is computed for every three base plane slices. At the end of process, the system results a 3D voxel object and a 2D pixel object representation.
Base Plane Generation Cross Section Generation Voxel Extraction for 3 base-plane slices
Meshless Parameterization Silhouettes Generation continue 3D voxel reconstruction
Fig. 1. The overview of proposed system
Shape-From-Silhouettes method for 3D voxel reconstruction using pre-computing approach is described in section 2. Section 3 explains the basic idea of meshless parameterization. The meshless parameterization towards the 3D voxel reconstruction in order to solve the large linear system is presented in section 4. The experimental results of the proposed system are elaborated in section 5. Conclusions and proposed future work are presented in section 6.
Meshless Parameterization for Dimensional Reduction
323
2 3D Voxel Reconstruction Matsuyama et al. [1] presented a real-time dynamic 3D object shape reconstruction method using sequential processes: silhouettes image generation (SIG), base-plane projection (BPP), parallel-plane projection (PPP) and intersection (INT). They are introduced a forward computation of all silhouettes pixels for 3D voxel generation using pc clusters approach. However, this process requires much computational time. We adopted the method of Shape-From-Silhouettes (SFS) to reconstruct the 3D object. However, we propose a fast and efficient method which runs on a PC, for 3D voxel reconstruction using pre-computing of 2D silhouettes image for one slice of plane and stored into a look-up table. Section 2.1 describes the overview of general method of SFS for 3D voxel reconstruction and section 2.2 explains our proposed approach of pre-computing. 2.1 Basic Method of Shape-From-Silhouettes Shape-From-Silhouettes is implemented in proposed system. In this paper, the 3D object is reconstructed by volume intersection. To realize efficient volume intersection, the plane-based volume intersection method is introduced. The 3D voxel space is partitioned into a group of parallel planes and the cross section of the 3D object volume on each plane is reconstructed. Then, we devised the plane-to-plane perspective projection algorithm to realize efficient plane-to-plane projection computation. PC cluster is introduced to realize real-time processing. Fig. 2 shows a forward computation of all silhouettes pixels for 3D voxel generation on PC cluster. Captured Image
SIG
SIG
SIG
Silhouette Image
BPP
BPP
BPP
Base Plane
Silhouette Image
Communication
PPP
Loop
PPP
Loop
PPP
Loop
Silhouette
on a slice INT
INT
INT
Object Area on a slice
Final Result Node 1
Node 2
Node 3
Fig. 2. 3D voxel reconstruction based on silhouettes pixel using PC clusters
324
2.2
Y. Lee, D. Kyoung, and K. Jung
Pre-computing Method
The 3D voxel reconstruction consists of four steps: silhouettes generation, plane generation, cross section, and voxel extraction. It is represented in volumetric form. The silhouettes are generated using background subtraction and shadow detection from the color difference [2]. The proposed approach only relies on volumetric voxel data of the object in order to infer shape information. The threshold value of RGB color is derived to subtract the background and foreground images. The silhouette generation consists of two methods: first, compute the distance in order to separate a silhouette pixel and background pixel, and second, compute the difference to separate the shadow and non-shadow. We present a fast and efficient method of 3D voxel reconstruction in order to reduce the memory storage and computational times. This approach used a computed plane data (pre-computing) and post-computing method. Fig. 3 shows the proposed method of pre-computing in 3D voxel reconstruction.
Camera Calibration
Pre-computing (save as look-up table)
Image Acquisition of four cameras
Silhouette Extraction
Look-up table
Post-computing
Voxel Generation
Fig. 3. Work flow of 3D voxel reconstruction using pre-computing approach
The projection of each voxel in the 3D volume is constrained on a 2D plane. The plane is x-y axis and z-axis is equal to zero. Then, the 2D silhouette points of this plane are pre-computed and saved in look-up table. The voxel reconstruction is fast and efficient with the assist of look-up table and post-computing process. In the proposed system, the relationship between a 3D point (X, Y, Z) and its image projection of 2D point (x1/x3, x2/x3) is given by the equation (1).
Meshless Parameterization for Dimensional Reduction
⎡ x1 ⎤ ⎡X ⎤ ⎡ h11 ⎢ x ⎥ = H ⎢ Y ⎥ , H = ⎢h ⎢ 2⎥ ⎢ ⎥ z ⎢ 21 ⎢⎣ x3 ⎥⎦ ⎢⎣ 1 ⎥⎦ ⎢⎣ h31
h12 h22 h32
h13 ⎤ h23 ⎥⎥ = A[r1r2 ( r3 z + t )] h33 ⎥⎦
325
(1)
As for x1, x2 and x3 computation, the equation can be found in Table 1. We can represent the general perspective projection from 3D voxel point (X, Y, Z) to 2D point (x1/x3, x2/x3). This process requires 22 multiplications, 14 additions and 2 divisions for each voxel point. However, for the equation (2) and (3), if the z-axis is zero for each voxel data, the parameter A1, and A2 are pre-computed and stored in look-up table. Therefore, this process reduces the computational times and only requires 5 multiplications, 5 additions and 2 divisions for each computation of all voxel data from plane 0 until plane k. The data set for (B1, B2, r31, r32, t3, r33) which derived from equation (2), (3) and (4) are also saved for each of web camera [12, 13]. Table 1. List of equation for x1, x2 and x3 x1 = h11 × X + h12 × Y + h13 = ( f x × r11 + c x × r31 ) × X + ( f x × r12 + c x × r32 ) × Y + f x × ( r13 × Z + t1 ) + c x × ( r33 × Z + t 3 ) = [( f x × r11 + c x × r31 ) × X + ( f x × r12 + c x × r32 ) × Y
(2)
+ ( f x × t1 + c x × t 3 )] + ( f x × r13 + c x × r33 ) × Z = A1 + B1 × Z
x2 = h21 × X + h22 × Y + h23 = ( f y × r21 + cy × r31 ) × X + ( f y × r22 + cy × r32 ) × Y + f y × (r23 × Z + t2 ) + c y × (r33 × Z + t3 ) = [( f y × r21 + cy × r31 ) × X + ( f y × r22 + c y × r32 ) × Y
(3)
+( f y × t2 + cy × t3 )] + ( f y × r23 + cy × r33 ) × Z = A2 + B2 × Z
x3 = h31 × X + h22 × Y + h33
= r31×X +r32×Y +r33×Z +t3
(4)
3 Meshless Parameterization The meshless parameterization method [4-8] is used to represent the 3D voxel data into 2D pixel representation which adopt good characteristics of convex combination such as fast computation and one-to-one mapping. We briefly describe the basic idea of meshless parameterization in the following sections.
326
Y. Lee, D. Kyoung, and K. Jung
3.1 Basic Idea Meshless parameterization is a 2D representation with some convex parameter where the one-to-one mappings of 3D voxel into 2D pixel without using mesh information [4-8]. The method is divided into two basic steps. First, map the boundary points PB into the boundary of domain D plane. Then, the corresponding parameter points U = {un+1, un+2,…,uN} are laid around the domain D counter-clockwise order. The chord length parameterization is used for the distribution of parameter points U. In this paper, we used the boundary-following algorithm to search and order the boundary points from the 3D voxel. The detail of the algorithm is described in section 3.2. The second step, the interior points are mapped into the domain D plane. However, before mapping, a neighborhood pj for each interior point in PI where the points are some sense closed by is chosen, and let Ni as a set of neighborhood points of pi. In this case, a constant radius r is chosen. The points that fall within the ball with radius r are considered the neighborhood points of each interior point. Then, the reciprocal distance weights method is to compute the weight λij for each interior point pi. The parametric points for interior point’s ui can be obtained by solving the linear system of n equations of the number of interior points, as shown in below equations (5) and (6).
ui −
ui =
∑λu ,
∑
λij u j =
j∈N i
j∈N i ∩ PI
ij
j
i = 1,..., n .
(5)
∑
(6)
j∈N i ∩ PB
λij u j , i ∈ PI
From the equation, we rewrite the linear system in the form of Au=b, where A is a matrix of weight nxn, u is parameter points and b is the sum of neighbor points of u. The linear system can be written in the matrix form:
⎡ 1 ⎢−λ w h ere A = ⎢ 2 1 ⎢ M ⎢ ⎣ − λ n1
− λ1 2 1
L L
M
O
− λn2
L
− λ1 n ⎤ −λ2n ⎥ ⎥, M ⎥ ⎥ 1 ⎦
(7)
⎡ ∑ λ1 j u j ⎤ ⎢ j∈N1 ⎥ ⎡ u1 ⎤ ⎢ ⎧ 1 , j =i ⎢u ⎥ λ2 j u j ⎥⎥ ⎢ j∑ ⎪ 2⎥ ⎢ ∈ N aij = ⎨−λij , j ∈ Ni , and u = , b=⎢ 2 ⎥ . ⎢M⎥ ⎪ 0 , otherwise M ⎢ ⎥ ⎢ ⎥ ⎩ ⎢ ⎣un ⎦ λnj u j ⎥⎥ ⎢ j∑ ∈ N ⎣ n ⎦ We use Gauss Elimination to compute the inverse of matrix A. Follows by solving the equation (7) and obtained the parametric value U for all interior points, which are used to represent gesture data, and novel features are extracted for gesture recognition.
Meshless Parameterization for Dimensional Reduction
327
Fig. 4 illustrated the result of 3D hand voxel data and the 2D representation of meshless parameterization. This meshless parameterization has two drawbacks: first, need to solve a large sparse linear system, and second, the neighborhood computation takes a linear search computation time. Both drawbacks have much computational times when the number of 3D voxel increases. This is not efficient for real-time application. In order to solve these problems, meshless parameterization is integrated in the 3D voxel reconstruction. The details approach of proposed method is presented in Section 4.
Fig. 4. Result of meshless parameterization for 3D hand voxel
3.2 Neighborhood Points and Weight In this section, we describe a method to compute the neighborhood points and weight. The number of neighbor points are determined using a ball neighborhood with constant radius, r. The reciprocal distance weights method is to compute the weight, λij for each interior point pi. In this process, we reduced the computation time from all interior points into local interior points only. The equation (8) explains the method of choosing neighborhoods and weights. The choice of weights is positive λij , for j ∈ Ni , such that ∑ λij = 1 where the parameter interior point, ui is j∈Ni
some convex combination of its neighbor’s uj. Let N i be the ball neighborhood N i = { j : 0 < p j − p i < r },
(8)
for som e radius r > 0 and let the λ ij be the reciprocal distance w eights
λ ij =
1 p j − pi
∑
k∈N i
1 p k − pi
.
3.3 Boundary-Following Algorithms The boundary-following algorithm is used for surface point’s extraction and order boundary points in counterclockwise ordered. The 3D voxel data generation is extracted from the silhouette images and intersection points. The 3D voxel contains the whole volume of the 3D object information. The voxel data is extracted without proper ordered. In meshless parameterization, it only needs the 3D surface points instead of the whole volume. The boundary-following algorithm [9] is widely used on
328
Y. Lee, D. Kyoung, and K. Jung
the boundary extraction on 2D image. Thanks to the base plane generation for 3D voxel which is the slice basis, it can be treated as a 2D domain image. The boundary of a connected component S is the set of pixels of S that are adjacent to Ŝ is defined in 2D domain image. A simple local operation may be used to find pixels on the boundary. In this meshless parameterization, it wants to track pixels on the boundary in counterclockwise order. The details and simple boundary-following algorithm is given as follows: 1. 2. 3. 4. 5.
∈
Find the starting pixel s S for the region using a systematic scan, say from left to right and from bottom to top of the image Let the current pixel in boundary tracking be denoted by c. Set c=s and let the 4-neighbor to the west of s be b Ŝ. Let the eight 8-neighbors of c starting with b in counterclockwise order be n1, n 2 ,…, n8. Find ni, for the first i that is in S. Set c=ni and b=ni-1. Repeat steps 3 and 4 until c=s.
∈
This process is very important to determine the boundary points. The first starting point of the boundary points is at the left most bottom position. The order of the boundary points sequence is counterclockwise ordered.
4 Meshless Parameterization in 3D Voxel Reconstruction The objective of the proposed system is to represent the 3D voxel into 2D pixel representation that is useful for object recognition or related applications. In the same time, the integration of meshless parameterization in 3D voxel reconstruction reduced the computational complexity and memory storage. The overall proposed system of meshless parameterization for dimensional reduction integrated in pre-computing 3D voxel reconstruction using a single PC is described as following algorithm and Fig. 5: 1. 2. 3. 4.
5.
Precompute the 2D silhouettes points correspond to 3D voxel and store in look-up table Synchronized the multiple images by taken simultaneously with multiple cameras Extract the silhouette by background subtraction based on the RGB color threshold Reconstruct the 3D voxel data volume using post-computing based on look-up table A. Every 3 slices compute neighbor points and weight for local meshless parameterization Repeat step 4 until complete reconstruct 3D voxel
The existing method of meshless parameterization requires the computation of the 2D parameterization to solve a large linear system. In order to solve the large sparse linear system, an iterative solver such as Gauss-Seidel or Krylov subspace methods [8] is used. Our proposed method can be overcome the large linear system using local 3D voxel for meshless parameterization process. This local approach of meshless parameterization is introduced during the process of reconstructing the 3D voxels. The 3D reconstruction method is based on shape from silhouettes and generates the 3D voxel plane per plane. Original method of meshless parameterization of 3D voxels into 2D representation is gathered all the 3D voxels points and split them into 2
Meshless Parameterization for Dimensional Reduction
Pre-computing 2D Silhouettes points & Store in look-up table
329
Pre-computing
Captured Image
SIG
SIG
SIG
Silhouette Image
BPP
BPP
BPP Base Plane Silhouette Image
Post-computing based on look-up table
Loop PPP Update Local Meshless Parameterization
Silhouette on a slice
INT
NP&W
Every 3 slices
Object Area on a slice
Final Result 3D Voxel
2D Representation
Fig. 5. The overall process of proposed system
disjoint sets. This process will generate a large sparse linear system during 2D parameterization computation. In addition, the process time of neighborhood computation increases rapidly according to number of voxel points. Therefore, we are introduced meshless parameterization integrated in the process of reconstructing the 3D object. Fig. 5 shows the process of 3D reconstruction, the voxels are generated plane by plane. The first plane generation is equivalent to the boundary points one of two disjoints subsets of points for meshless parameterization. Therefore the boundary points are discriminated easily from the interior points. Only in the first plane, the boundary points are ordered in counter-clockwise order using boundary following algorithm. The process is carried out for the next plane of 3D voxels and this voxels are stored as interior points. The meshless parameterization using local 3D voxel is referred as partial simultaneously compute the 2D parameterization whenever there are three consequent of planes that generated in 3D reconstruction from bottom to upper plane. This process of meshless parameterization is illustrated as following pseudocode:
330
1. 2. 3. 4. 5. 6. 7. 8. 9.
Y. Lee, D. Kyoung, and K. Jung
Assume the first plane generation of 3D voxels data as boundary points Ordered the boundary points using boundary-following algorithm Map the boundary points into 2D domain using chord-length parameter Obtained the two consequence planes of 3D voxels data as interior points Compute the neighbor points and weights Solve the linear system to compute and update the interior points parameter Treat the 2D parameter points of first plane from interior points as map boundary points in 2D domain Obtained the following plane and remain the last plane of 3D voxel as interior points Repeat the step 5 to 9 until last plane generation
5 Experimental Results In this paper, we proposed two methods and integrated into a system. The first method is the pre-computing for 3D voxel reconstruction, and the second method is meshless parameterization for dimension reduction. Four cameras are placed towards a defined workspace to capture the object. The system is processed with a personal computer (Intel® XeonTM CPU 3.20GHz, 2.56GB RAM, Windows XP OS). We have tested on overall performance for general method, look-up table and our proposed method to reconstruct 3D voxel. Fig. 6 shows the total processing time for each method. The general method took the longest processing time compared to lookup table and our proposed method. The look-up table method is slightly faster than our proposed method. However, the look-up table used up more memory storage compared with our proposed method.
sec 15 12 12.015 9 6 3 0.360.48 0 100*100*100
Total Processing Time our method look-up table
9.656
general method
6.110 0.300.34
80*80*80
0.256 0.250
50*50*50
the number of voxel
4.922 0.235 0.234
40*40*40
Fig. 6. Total processing time for proposed method and other existing methods
We have explained the theoretical method of meshless parameterization for existing approach and proposed approach which is integrated in 3D voxel reconstruction. In this experiment section, we evaluated the similarity of the proposed
Meshless Parameterization for Dimensional Reduction
331
approach with the original approach. There are 3 virtually created 3D opened cubes in 5x5x5, 7x7x7 and 11x11x11 size of voxel for validation purposes. The 3D opened cubes voxel are saved as binary format file. Fig. 7 shows the result of meshless parameterization for 5x5x5 size of 3D opened cube, Fig. 7(a) is the result of original approach and Fig. 7(b) is the result of proposed approach.
Fig. 7. Results of meshless parameterization for both approaches on 5x5x5 voxel size of 3D opened cube Table 2 shows the results of error difference between the proposed method and the existing method. The absolute error is used to measure the similarity of the proposed approach with the existing one. The results concluded that when the number of voxel point increase, the average or absolute error is getting lesser. And the average of minimum distance is computed between the results of 2D parametric. It also indicated that the average of minimum distance is reduced corresponding to the number of voxel size. Table 2. Results of error difference between existing and proposed method
3D opened cube (volume size) 5x5x5 7x7x7 11x11x11
Measurement of error Average of absolute Error Average of minimum distance 0.00225052 0.141804 0.031601 0.00108899 0.025904 0.00063468
In meshless parameterization method, we need to solve large linear system and compute the inverse matrix according to the number of interior points. Solving system of linear equations Au=b, where A is a matrix of weight nxn, u is parameter points and b is the sum of neighbor points of u, the inverse of A-1 for matrix A can be determined in time of Θ(n 3 ) . As a result, the existing method of meshless parameterization is considered all number of interior points to solve the linear system. The inverse of A-1 for matrix A consumes much running time since the all number of interior points are counted.
332
Y. Lee, D. Kyoung, and K. Jung
As our proposed approach, the meshless parameterization is executed for every three slices of 3D voxel reconstruction. It has less running time since the numbers of interior points are reduced from all points to some number of points. The inverse of matrix A in our proposed approach can be determined in time of Θ((n − g )3 ), where g < n . The parameter g is a large number of interior points which are not included in derived slices. In this approach, we solved the running time complexity and reduced the memory space for computation process. In addition, this approach results two tasks in one system, we adopted the advantage of 3D voxel reconstruction approach to performance the dimension reduction using meshless parameterization. This approach solved the sparse linear system and is able to maintain the existing 3D information data for any related applications.
6 Conclusions and Future Work Our primary aim is to reduce the 3D voxel complexity without loss the information generality and satisfy the critical requirements of speed and robustness in representing 3D object for related applications. This paper presented a solution of the large sparse linear system of meshless parameterization, which is integrated in pre-computing 3D voxel reconstruction. In addition to this system, we proposed an efficient and fast method for reconstruct the 3D voxel in real time. The proposed method using look-up table and post-computing to reconstruct 3D voxel shows the total processing time nearly closed to look-up table method. However, our proposed method is wisely reduced memory storage compared to lookup table. And for the inverse matrix of meshless parameterization operation, the running time needs Θ((n − g )3 ), where g < n and parameter g is almost closed to n. Therefore, the running time is much lesser than existing approach. The integration of 3D reconstruction and dimension reduction makes this system not only efficient in term of time processing, it also wisely reduce the memory usage. Therefore, this method promises a better performance in any real time applications. For future work, more experiments need to carry out on various kinds of 3D object to test and validate the robustness and efficiency of proposed system. The object recognition classification remains to be performed in order to judge our proposed 3D voxel reconstruction and meshless parameterization for dimensional reduction are not limited the shape preserving information. Acknowledgments. This work was supported by the Soongsil University Research Fund.
References 1. Matsuyama, T., Wu, X., Takai, T., Wada, T.: Real-Time Dynamic 3D Object Shape Reconstruction and High-Fidelity Texture Mapping for 3D Video. IEEE Transactions on Circuits and Systems for Technology 14(3), 357–369 (2004) 2. Kong, G., Kanade, T., Bouguet, J., Holler, M.: A Real Time System for Robust 3D Voxel Reconstruction of Human Motion. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, vol. 2, pp. 714–720 (2000)
Meshless Parameterization for Dimensional Reduction
333
3. Van der Maaten, L.J.P., Postma, E.O., Van den Herik, H.J.: Dimensionality Reduction: A Comparative Review (2007) 4. Floater, M.S.: Meshless Parameterization and B-spline Surface Approximation. In: Cipolla, R., Martin, R. (eds.) The Mathematics of Surfaces IX, pp. 1–18. Springer, Heidelberg (2000) 5. Floater, M.S., Reimers, M.: Meshless Parameterization and Surface Reconstruction. Computer Aided Geometric Design, 77–92 (2001) 6. Floater, M.S., Hormann, K.: Surface Parameterization: a Tutorial and Survey. Advances in Multiresolution for Geometric Modelling, 157–186 (2004) 7. Lee, Y., Kyoung, D., Han, E., Jung, K.: Dimension Reduction in 3D Gesture Recognition Using Meshless Parameterization. In: Chang, L.-W., Lie, W.-N., Chiang, R. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 64–73. Springer, Heidelberg (2006) 8. Voledine, T., Roose, D., Van der straeten, D: Efficient Triangulation of Point Clouds using Floater Parameterization, Report TW385 (2004) 9. Jain, R., Kasturi, R., Schunck, B.G.: Machine Vision. McGraw-Hill, Inc., New York (1995) 10. Jolliffe, I.: Principal component analysis. Technical report, Springer (October 2002) 11. Kruskal, J.B., Wish, M.: Multidimensional Scaling, Ch. 1, 3, 5, pp. 7–19, 48, 73. Sage Publications Inc., Newbury Park, CA (1978) 12. Kyoung, D., Lee, Y., Beak, W., Han, E., Yang, J., Jung, K.: Efficient 3D Voxel Reconstruction using Precomputing Method for Gesture Recognition. In: First Korea Japan Joint Workshop on Pattern Recognition (2006) 13. GML C++ Camera Calibration Toolbox, http://research.graphicon.ru/calibration/gml-c++cameracalibration-toolbox-3.html
An Efficient Biocryptosystem Based on the Iris Biometrics Ali Shojaee Bakhtiari, Ali Asghar Beheshti Shirazi, and Babak Zamanlooy Department of Electrical Engineering Iran University of Science and Technology Narmak, 16846, Tehran, Iran {Ali_Shojaeebakhtiari, Babak_Zamanlooe}@ee.iust.ac.ir,
[email protected]
Abstract. A new and efficient method for combining iris biometrics with custom cryptographic schemes to obtain an efficient biocryptosystem is proposed in this paper. Though the method structure is basically derived from a previously described biocryptosystem scheme, the introduction of new image processing methods alongside with efficient utilization of traditional methods show promising developments compared with the previous biocryposystem especially in the field of generating longer cryptographic key strings while keeping the system quality. Keywords: Authentication system, biocryptosystem, error correction coding, iris segmentation.
1 Introduction Traditionally the research fields of biometric and cryptography authentication have been considered as two distinct areas with different operational backgrounds. Whereas Biometric authentication systems are generally based on fuzzy comparison of the claimed data with a previously stored reference data and adequate similarity between the claiming and the original data results in positive authentication, the logic behind the cryptography based authentication on the other hand is absolutely exact. Therefore only the exact matching between the claiming and reference data, authentication key in this aspect, results in positive authentication. The mentioned deep difference between the natures of the two eras one relying on fuzzy and the other on exact comparisons, has made the effective combination of the two eras facing various odds. The various obstacles facing the combined method and the proposed counteractions to overcome each of the obstacles were first analyzed in [1]. Overall three different challenges facing the combined structure have been classified and different approaches have been taken to overcome each of them. The first work to take into considers proposing an effective and concrete implementation of a biocryptosystem on this basis was proposed by Hao, Anderson and Daugman [1]. The proposed system acts considerably well with key lengths of up to 140 bit long however as shown in [1] the efficiency of the system starts to deteriorate once the chosen key length becomes longer. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 334 – 345, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient Biocryptosystem Based on the Iris Biometrics
335
Therefore in order to achieve longer key length, which is the requirement of the ever growing need for more security, in this paper we propose a new biocryptosystem based on the iris biometrics. Considering the successful implementation results of the system proposed in [1], the basis of our system is derived from the system, which from now shall be called the Hao’s system, However in order to reach the desired objectives new code extraction and generation methods are designed and utilized in our designed system. The paper is organized as follows in the 2nd section the Hao’s system [1] is described and the strength and weaknesses of the method from the view of the authors are analyzed, afterwards in the 3rd section the proposed iris image processing and the code generation methods are introduced and analyzed, in the 4th section the proposed biocryptosystem is introduced and its different modules are described and finally in the 5th section the results of the designed system are analyzed and compared with other systems.
2 Hao’s System Fig.1 shows this basic two-factor scheme of the Hao’s system. In this system the key depends on a combination of a biometric and a token, in which the information required for error correction of the received code, is stored. In order to make the system resilient against the three mentioned drawbacks in the introduction of this paper the designers of the Hao’s system have considered a series of provisions in the system. In order to overcome the contradicting natures of the fuzziness of biometrics and exactitude of cryptography a set of error correction coding, which effectively detects and corrects the errors caused by the presence of noise, is added to the design.
Fig. 1. Two factor scheme for biometric key generation [1]
The Hao’s system works as follows: In the first step a random key k is generated in the random key generation module. Afterwards the generated key is entered to the coding unit which encodes the random key string in to a 2048 bit string, adapted to the Daugman’s iris code length [7]. The encoder block consists of two blocks of random error correction coding and burst error correction coding units. During the decoding process the sample iris code is delivered to the system to unlock the locked iris code. The obtained pseudo iris code (not necessarily equal to the original one) is next entered to the decoding module. The decoding module applies the two random error correction and burst error correction decoding to regenerate the random key string kˆ . Should the claiming iris and the reference iris match each other to the extend that the minute differences between the two codes are
336
A. Shojaee Bakhtiari, A.A. Beheshti Shirazi, and B. Zamanlooy
correctable by the decoding module, the obtained key kˆ would be equal to the original key string k . A simple comparison between the hash value of kˆ and the stored hash value of k reveals if the authentication is positive or not. The entire decoding process is formulized as below: ) (1) < θ sam ,τ >⇒ k
Table 1 shows the implementation result of the Hao’s method for different keylenghts. Table 1. Performance of the Hao’s system for khadamard = 6 [1] RS Corrected blocks 0
Length of biocrypto key 224
FRR% (False reject rate)
FAR% (False accept rate)
12.22
0
1
210
6.5
0
2
196
3.65
0
3
182
2.06
0
4
168
1.26
0
5
154
0.79
0
6
140
0.47
0
The Hao’s system shows great improvement compared with the previously proposed biocryptosystems notably in the field of FRR%. Also the obtainable key length in the system has improved greatly compared with the previous works. However from the authors’ point of view with some reconsiderations which are the basic of this paper, more proper results are obtainable. The following section introduces the reconsiderations proposed by the authors.
3 The Proposed Image Processing and Code Generation In order to fulfill the needs of the proposed system as mentioned in section (2) a group of image processing methods are utilized for our own purposes. A new iris segmentation method designed for the purpose of this paper is introduced in this section. The aim of the segmentation method is to choose the areas most suitable for biocryptosystem purposes. 3.1 Image Processing and Feature Enhancement
After the iris image has been captured and the initial preprocessing is done on the image, classic approach is to apply an efficient segmentation method such as the Hough transform or the Canny operator [11] on the image to separate the iris data from the rest of the image. The segmentation method developed by the authors for
An Efficient Biocryptosystem Based on the Iris Biometrics
337
selecting the desired regions of the iris automatically omits the need for this step. However the segmentation method which shall be explained later in this section and which shall be called local entropy method from now initially requires special features of the image to be enhanced in order to operate properly. For two distinct reasons related to the objectives of this work the Phase Congruency based iris feature enhancement as described in [4], based in the work of Kovesi [5],[12] and Morrone [8] is chosen as the basis for the feature enhancement of this work. The first reason for selecting the method is the robustness of the method against the unwanted brightness and contrast changes in the received image, which acts as an important source in generating burst errors in the biocryptosystem. The other reason is the ability of the method to omit the unwanted random noise while simultaneously enhancing the required edge features of the image which is a unique feature of the phase based methods compared with the majority of spatial methods. In order to support the claim of the efficiency of the phase congruency based feature enhancement a group of experiments on the ability of the method to recover the desired features of the image in the presence of different interferences was performed so that initially a group of the features in the unenhanced images were chosen and afterwards the unenhanced images were distorted by different kinds of distortions and again the selected features were sought in the images. In the next step a group of selected features Table 2. Different distortions and related percent of recovered feature points Recovered feature points Distortion
no enhancement %
Addition of 2% Noise Addition of 10% Noise Rotation by 2 degrees Rotation by 10 degrees Contrast reduction by 10% Contrast reduction by 50% Contrast Increase by 10% Contrast Increase by 50%
50% 28.57% 35.71% 33.24%
Phase congruency % 93.35% 38.71% 61.62% 46.86%
21.42% 95.2% 20.87% 92.98% 42.57% 93.35% 22.58% 52.39%
338
A. Shojaee Bakhtiari, A.A. Beheshti Shirazi, and B. Zamanlooy
in the enhanced images were chosen and afterwards the enhanced images were distorted by different kinds of distortions and again the selected features were sought in the images. The results of which are presented in Table 2. As can be seen from Table 2 the phase congruency method works actually well in the presence of deep contrast changes and is able to recover to majority of the assigned feature points correctly. Also the method shows some improvement in the feature extraction in the presence of image noise which considering the fact that the method is enhancing the edges of the image is quite noteworthy. 3.2 Image Segmentation and the Local Entropy Method
After the features of the image have been enhanced it is time for the segmentation of the iris image to derive the segments suitable for the biocryptosystem purposes. In order to choose the best regions for the paper purpose, the authors have proposed a method named the local entropy method. The method which is explained in this subsection is based on locating the high entropy regions of the surface of the iris for code generation purposes. According to Shannon’s 2nd theorem [2] if the event i occurs from a set of valid events, with the probability p i the amount of uncertainty related to the event is equal to: H i = − log 2 ( pi )(bits / Symbol ) .
(2)
And also the amount of the uncertainty that the source of the events generates is equal to: H = −∑ ( pi log 2 ( pi ))(bits ) .
(3)
From equation 2 it can be seen that the highest amount of uncertainty from an information source is realized when the output symbols of the source are equally probable. The idea behind local entropy method is to divide the processed image into separate regions and then to analyze each region separately as information source. The amount of entropy calculated for each region gives an overview about the level of correlation between individual blocks (bits) in the selected region. A research by J.Daugman [3] has shown that in average the iris image has discrimination entropy of 3.2 bits per square millimeters which is indeed a suitably high value for identity recognition purposes and comparison based structures. The process of deriving the local entropy of the image begins with the acquisition of the image, after this step the initial preprocessing is performed on the image to prepare the image for the main processing, after preprocessing, the extracted image is further processed for revealing its hidden features, After the features are revealed and enhanced, the local entropy segmentation process divides the obtained result into separate regions each containing a portion of the enhanced features, In this step, assuming each section as an information source, the entropy of each of the segments is separately calculated and finally the obtained entropy values are sorted to deliver the entropy function of the image. In the rest of the paper, unless stated, fig.2 is considered as the reference image.
An Efficient Biocryptosystem Based on the Iris Biometrics
339
Fig. 2. Reference image
Fig. 3 shows the result of calculating the entropy curve of the reference image for 100 segments. After the entropy of the surface is calculated it is the necessity of the application that dictates how much of the iris surface is to be segmented for the best performance. Another question that arises here is the size of the segmentation blocks. Apparently the finer the blocks are chosen the better the entire surface is segmented, however making the segmentation too small also results in the selection of the regions where image noise is still present as the presence of noise acts as a source of entropy in the region.
Fig. 3. Sorted Local entropy in bits/segment, for 100 local segments of fig.2
In biocryptosystem applications it is necessary not to let the phony regions to enter the process of key generation therefore experiment results show that the best practice in segmenting the iris for such purposes is to choose the highest entropy regions and to use large segments in order to prevent the noisy regions from masquerading themselves as the desirable regions. For comparison based purposes in which not every single bit, but the iris surface as a whole is important it is therefore a good practice to choose fine regions and also a larger portion of the entropy curve to segment the most of the iris surface. Table III shows the experimentally obtained results for best segmentation practices and percent of the local entropy chosen for different applications. One point that is necessary to be mentioned here is the potential weakness of the method at the presence of eyelashes in the captured image. As the eyelashes generally occur in the image with a random pattern, this pattern generally leads to unwanted uncertainty inside the image and therefore an undesired source of entropy is generated
340
A. Shojaee Bakhtiari, A.A. Beheshti Shirazi, and B. Zamanlooy
by the eyelashes. Moreover in designing biocryptosystems, the presence of eyelashes are modeled as channel burst errors [1] and to overcome their negative effects it is necessary to sacrifice a great deal of system capacity for covering the effect of burst noise. Therefore it is necessary to detect those areas of the image affected by the presence of eyelashes and put them in the blacklist. Various eyelash detection methods have been proposed in iris processing literature from which the method proposed by Kong and Zhang [6] is chosen by the authors for its efficiency and ease of implementation. Table 3. Experimentally obtained results for best segmentation and surface selection
Application
No. of Segmentation blocks
Biocryptosystem Comparison based purposes (in the presence of eyelashes in the image database) Comparison based purposes (in the absence of eyelashes in the image database)
100 (10 by 10) 900(30 by 30)
Percent of sorted curve used. (From higher to lower entropy) 10% 30%
900(30 by 30)
50%
3.3 Iris Code Generation
After the local entropy segmented iris data is chosen it is time for the generation of the iris code from the sorted local entropy data. The process begins with the normalization of the obtained sorted local entropy data for obtaining a normalized data value between 0 and 1. Afterwards the data quantities are quantized to a group of assigned thresholds to limit the number of output bits. Depending on the thresholds chosen for the code generation a group of nonoverlapping bit strings are related to each of the thresholds, for example a 10 level thresholding requires the assignment of 4 bit strings to each quantized value. For the database consisting of images with the size 320 × 280 this results in 89600 bits of data. As it shall be mentioned in the next section for the system to work with the standard 2048 bit strings a set of provision should be considered for adoption. The detailed procedure for obtaining the iris-code by this method is presented in [9] and is briefly described in the next section.
An Efficient Biocryptosystem Based on the Iris Biometrics
341
4 The Biocryptosystem In this section the proposed biocryptosystem is introduced and analyzed. As it has already been mentioned in the previous sections the basis of the design is from the biocryptosystem proposed by Hao, Anderson and Daugman in [1]. However in order to fulfill the objectives of this paper different modules of the system have been altered to fit our desires. Fig.4 shows the block diagram of the system proposed by the authors.
Key generation unit
Ham ming coding
Token: Locked code + h(k)
Enhancem ent, Local entropy and code
Reference Image
Hamming decoding
Enhancem ent, Local entropy and code
Sample. Image
Is there a match?
Key Recovery Successful
Failure
Fig. 4. Block diagram of the proposed biocryptosystem
As can be seen by comparison between the Hao’s system and the presented system it can be seen that some modules of the Hao’s system have been either omitted or altered in our proposed system. The first and the most important alteration towards the Hao’s system is the replacing of the iris-code module in the Hao’s with the local entropy code based code generation unit. The 2nd change is the omission of the burst error correction coding and decoding modules from the Hao’s system. As is already mentioned and will be shown in the results section the omission of the burst error correction coding results in a free hand in choosing the arbitrary though limited longer key string lengths. Also in order to analyze the efficiency of the image processing part of the system the strong Hadamard code described in the Hao’s system has been replaced by the simple Hamming code.
342
A. Shojaee Bakhtiari, A.A. Beheshti Shirazi, and B. Zamanlooy
Apart from the structural changes the operation procedure of the proposed system is identical to the Hao’s system. The same as the Hao’s system the presented system initially generates a random key string which is passed through the bank coders to generate a pseudo iris-code string. This string is afterwards Xored by the iris-code to generate the locked iris-code. During the decoding phase the sample iris-code is again Xored with the locked iris-code stored in the token to reveal the supposed pseudo-iris code. Afterwards the supposed pseudo-iris code is delivered to the decoding module. If the difference between the supposed pseudo-iris code and the original pseudo-iris code is in the acceptable range of the decoding ability of the decoding bank the correct key string is extracted and a comparison between the hash value of the derived key and the stored hash value of the original key confirms the whole process. It should be mentioned here that as the standard iris-code generation method has not been used in our method and therefore, we are not obliged to follow the traditional 2048 bit scheme. However in order to make comparison with the Hao’s work the iris code generation module is designed to generate the standard 2048 bit iris-code length for fair comparison. The procedure for generating the 2048 bit string is as follows. The CASIA database consists of images with the size of 320 × 280 Pixels each, or equivalently 89600 total pixels. The Local entropy method selects 10% of the total pixels which results in 8960 total pixels. In this work the threshold step is chosen to be 0.1 in 0 to 1 normalized space which contributes to 10 threshold levels and therefore 4 bits of data for each pixel which results in the total of 35840 bits of data. However considering the (15, 11) Hamming code used in the system, the system requires only
2048 × 11 15
or
around 1502 bits of data. In order to reduce the available data to the needed number of bits a 6 × 4 averaging mask is applied on the sorted local entropy data. This operation results in a 1493 bit string. The difference between the 1502 and the 1493 bit strings is filled with a random generated bit string or a simple null string for simulation purposes. The obtained string is entered to the encoding block and the 2048 bit string is obtained. Excluding the burst error correction module from the system structure immediately arises the question of how to deal with the present burst errors. As will be shown in the next section the image processing methods applied in addition to the local entropy method results in a great reduction in the negative effects of typical causes of the burst noise.
5 Experimental Results In order to analyze the efficiency of the proposed system a series of experiments has been done on the system. The CASIA database [10] has been used as the image database. The database consists of 108 identical iris images each taken in 7 different states, which in total has 756 images. The result of applying the system on the image database for different chosen key lengths is brought in Table 4. In designing the system 3 new alterations are proposed for improving the operation of the Hao’s system. As has already been mentioned in the paper, the first alteration is
An Efficient Biocryptosystem Based on the Iris Biometrics
343
the replacing of the Hao’s image processing method with the phase based feature enhancement method, the 2nd is the introduction of the local entropy method combined with eyelash detection methods for choosing the suitable regions for biocrypto purposes and the 3rd alteration is the replacement of the heavy loading error correction blocks of the Hao’s system with simple linear error correction codes to reduce the overload imposed to the system by the coders. Table 41. False reject rate of the system for different chosen key lengths Length of biocrypto key 224
FRR% before applying error correction coding(Hamming) 16.6
FRR% after applying error correction coding(Hamming) 1.6
210
23.6
1.3
196
29.3
2
182
31.3
1.3
168
24
2.6
154
22
1.3
140
23
1.6
Fig. 5. Comparison of the FRR% of the Hao's system and the proposed system
Analyzing Table 4 and Fig. 5 and comparison of the results of the Table 4 with the results of Table 1 results in the following conclusions: To analyze the efficiency of the image processing methods and the burst prevention methods proposed in the system a fair point in the Hao’s system must be chosen. It’s because in the system the burst error correction module is practically omitted. The only key length of the Hao’s system in which the effect of burst correction coding is not present is the 224 bit key length. Comparison between the 1
It must be noted and emphasized here that the results brought in Table. 4 are strictly for qualitative comparison, and as the authors of [1] have refused to share the original database to the public, no quantitative comparison between the results in this paper and reference [1] is logically valid.
344
A. Shojaee Bakhtiari, A.A. Beheshti Shirazi, and B. Zamanlooy
results in Table V and Table II shows that the system error in the key length of 224 is in the scale of 2% compared with the error level of 12% for the Hao’s system. As the main cause of the error in the Hao’s system is because of the presence of the burst error this result shows that the system has been able to overcome the negative effect of unwanted burst noise. Another supporting evidence for showing the strength of the method against the negative effects of the burst noise can also be deduced from Table V. As it can be seen from the 2nd column of the Table V before applying the error correction coding block the FRR% of the system is in the scale of about 25%, however after the introduction of the error correction coding module which simply consists of a simple Hamming error correction coding, The error rate is reduced to the scale of about 2%. Considering the fact the Hamming code is not at all capable of detecting burst errors logically reveals that the burst error noise should not have been present in the first. Otherwise the negative effects were reflected in the system output. As it can be seen from the Table V the FRR of the system for different key length is nearly constant and in the scale of about 2%. However the Hao’s system shows a FRR % ranging from 0.47% for 140 bit length key to 12.22% for 224 bit length chosen key. Therefore it can be logically deduced that the Hao’s system acts more superior to the proposed system in shorter key lengths but as the chosen key length begins to increase the proposed system shows superiority compared with the Hao’s system From the complexity point of view the system is superior to the Hao’s system in the encoding process because of the replacement of the relatively computationally complicated Hadamard & Reed-Solomon Codes with the simple Hamming and permutational coding.
6 Conclusion In this paper the possibility of the implementation of an effective biocryptosystem was analyzed and confirmed. Comparison between the system and the previously implemented systems with different structural backgrounds shows promising results. The system also shows a strong background for obtaining longer key lengths required for higher security applications.
Acknowledgment This paper is supported by Iran telecommunication research center under the contract agreement number T/500/150/50
References 1. Hao, F., Anderson, R., Daugman, J.: Combining crypto with biometrics effectively. IEEE Transactions on Computers 55, 1081–1088 (2006) 2. Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423, 623–656 (1948) 3. Chen, W.: Linear Networks and Systems, Belmont, CA: Wadsworth, USA (1993)
An Efficient Biocryptosystem Based on the Iris Biometrics
345
4. Daugman, J.: How Iris Recognition Works. IEEE Transactions on Circuits and Systems for Video Technology 14, 21–30 (2004) 5. Shojaee Bakhtiari, A., Beheshti, A.-A, Zamanlooy, B.: Phase congruency based image enhancement method and its application in enhancing iris feature extraction. In: Proceedings of Iranian conference on electrical engineering (2007) 6. Kovesi, P.: Phase Congruency Detects Corners and Edges. School of Computer Science & Software Engineering, The University of Western Australia (2003) 7. Kong, W.-K, Zhang, D.: Accurate iris segmentation based on novel reflection and eyelash detection model. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech processing (2001) 8. Daugman, J.: Biometric personal identification system based on iris analysis. US Patents 291560 (1994) 9. Morrone, M.C., Owens, R.A.: Feature detection from local energy. Pattern Recognition Letters 6, 303–313 (1987) 10. Shojaee Bakhtiari, A.: Design of a Biocryptosystem Based on the Key Extracted from the Iris Biometrics. School of Electrical Engineering, MSc Thesis, Iran University of Science and Technology, Tehran, Iran (2007) 11. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, http://www.cbsr.ia.ac.cn 12. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–714 (1986) 13. Kovesi, P.D.: MATLAB Code for Calculating Phase Congruency and Phase Symmetry Asymmetry (1996), http://www.cs.uwa.edu.au/_pk/Research/MatlabFns/
Subjective Image-Quality Estimation Based on Psychophysical Experimentation Gi-Yeong Gim1, Hyunchul Kim1, Jin-Aeon Lee2, and Whoi-Yul Kim1 1
Department of Electronics and Computer Engineering, Hanyang University, Haengdang-Dong, Sungdong-Gu, Seoul, 133-792, Korea {gygim, hckim}@vision.hanyang.ac.kr,
[email protected] 2 Samsung Electronics, Giheung-Eup, Yongin-Si, Gyeonggi-Do, 449-712, Korea
[email protected]
Abstract. The purpose of estimating subjective image quality is to provide best-quality image content to users in diverse fields. Image quality is subjective, therefore very difficult to estimate accurately, and so many researchers have proposed psychophysical experimentation as a means of estimating it. Conventional methods describe the relationship between the subjective preference and the perceived contrast as an “inverted U” shape. However, the relationship was resulted from only a few, high-quality images. Thus, they are inadequate for general image-quality estimation. In this paper, we carried out two experiments using a dataset with untransformed and various images. We discovered an important property that the preference increases in proportion to the perceived contrast. The result shows us not only that our experimentation can reduce MSE of image-quality estimation by approximately 40% over the previous methods, but also that it can be applied in various applications. Keywords: Image quality, Psychophysical experimentation, Subjective quality model.
1 Introduction Image quality determines the observer’s level of psychophysical satisfaction. Measuring and evaluating image quality make it possible to verify the performance of image processing methods and thereby ensures that only image content of very high quality is provided for multimedia applications. The demand for accurate imagequality estimation has been increasing. However, to objectively quantize the quality of an image is a challenging task, because it is difficult to find the quality attribute that is most highly related to the perception of image quality. Thus, many researchers have proposed psychophysical-experimentation-based subjective image-quality estimation as a mean of estimating image quality [1-6]. Conventional methods [1-3] conduct experimentation using datasets generated from transformations of the following commonly employed image attributes: lightness, chroma, sharpness, noise distribution, resolution and compression. The perception of image quality is modeled with such transformed images, and thus the D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 346 – 356, 2007. © Springer-Verlag Berlin Heidelberg 2007
Subjective Image-Quality Estimation Based on Psychophysical Experimentation
347
experimental results can be helpful in assessing the performance of imageenhancement algorithms. However, those experimental results are inadequate for image-quality estimation in general applications, because the image-quality models are derived from only a few, high-quality images. As recounted in this paper, we carried out two types of experiments. The first experimentation was performed to verify the conventional methods, using a dataset generated from the transformations of the various image attributes. The second one was based on our proposed method in order to overcome the drawback of the conventional methods. To estimate subjective image quality accurately, we used datasets of various kinds of untransformed images. All experiments were conducted by the pair-comparison method, and observers were asked to choose the better image according to subjective contrast and preference. Then, to estimate the subjective quality, the perceived contrast was modeled using the image attributes, because it is difficult to approximate the subjective image quality directly from quality attributes of an image. Even if many quality attributes are used in modeling, only chroma and sharpness correlate with contrast. Hence the perceived contrast is estimated using the regression analysis with the standard deviations of chroma and sharpness. Using this model, subjective image quality is estimated according to the relationship between the perceived contrast and the preference. As a result, we found, contrary to previous experimental results, that most people prefer image of higher contrast. Moreover, compared with the conventional methods, the proposed model could reduce the mean square error (MSE) by nearly 40%. This means that our model can be a significant image-content model in diverse fields. The rest of this paper is organized as follows. In Section 2.1, we review the conventional experiment methods of estimating image quality. In Section 2.2, we describe the proposed model in detail. In Section 3, we discuss the experimental results. Finally, in Section 4 we conclude the paper.
2 Image-Quality Experimentation 2.1 Related Experimentation Calabria and Fairchild conducted experiments to yield information relating to the perception of contrast and to develop a metric of perceived image contrast as it relates to observer preferences [1][2]. Their dataset consisted of images that were transformed for lightness, chroma, and sharpness. Because the contrast is an important attribute determining image quality, perceived contrast was modeled for image-attribute perception. To model the perceived contrast, they employed standard deviations of lightness, chroma, and sharpness. Their experimental results showed that perceived image contrast has a nonmonotonic, “inverted U”-shaped relationship to the preference. Using this relation, the preference was measured as subjective image quality. The significance of this research is that subjective quality can be estimated from the attributes of an input image.
348
G.-Y. Gim et al.
The Electronics and Telecommunications Research Institute (ETRI)’s CG research team in Korea proposed a method that estimates human perception based on an experimental approach [3]. Their dataset was generated by certain transformations of input images for the attributes of lightness, chroma, contrast, noise, sharpness, and compression. They conducted two types of psychophysical experiments: ‘Categorical judgment’ and ‘Pair comparison’. The first was carried out to estimate the preference, and the second was performed to the sensitivity of human perception to image-quality attributes. The Z-scores of the results were calculated to compare the quality attributes. The Z-scores increased as a function of the transformation level until they reached the maximum, and then decreased. It means that people prefer images with middle-level lightness and chroma. They also showed that most people are more sensitive to lightness variation than chroma variation. These researches could be valuable to performance evaluation and the setting of image processing algorithm parameters. However, their qualitative models cannot guarantee the estimated subjective image-quality result for general applications, because the dataset consists of transformed images from only a few images, in fact fewer images than are used in modeling. Therefore, in the case of evaluating images that are not used for modeling, the conventional experiments may yield inaccurate result. 2.2 Proposed Experimentation The aim with this paper is to provide an experimental scheme by which subjective image quality can be estimated with high certainty. We carried out two types of experiments: Experiment I (Exp I) and Experiment II (Exp II). Exp I was performed to verify the effectiveness of the conventional methods, the dataset consisted of transformed images about image attributes. Unlike the experiments of the conventional methods [1-3], we made up a dataset that consisted of the transformed images from images of various qualities, in order to make a more general dataset than that for images only of high quality. The dataset for Exp II, the proposed method,
Image factor
Test image
extraction
Perceived contrast estimation
Subjective image quality
Psychophysical experiment Dataset Answer sheets
Perceived contrast modeling
Preference modeling
Fig. 1. Flowchart of subjective image-quality estimation
Subjective Image-Quality Estimation Based on Psychophysical Experimentation
349
therefore, consisted of untransformed images of various qualities. The rationale behind Exp II is that when humans estimate the quality of a certain image, they generally compare the image with different images of various qualities, the image attributes of which are not transformed. Figure 1 shows the entire procedure for estimating the subjective image quality. As can be seen, perceived contrast modeling is performed to estimate the subjective image quality [1][2]. In Exp I, to evaluate the pair-comparison result, a total of 660 observations are generated.1 The 12 total transformations for lightness, chroma, and sharpness are listed in Table 1. In order to understand how each image attribute is independently related to the image quality, no image-quality transformation is allowed to affect other image attributes. To that purpose, transformations of the lightness and sharpness are performed on the luminance channel, and the chroma transformation is performed on the CIELAB C*ab channel. Figure 2 shows examples of the transformed images. Table 1. Image-quality transformations Attributes
Transformations
Lightness (5 images)
Two sigmoidal functions of different shape An inverse sigmoidal function Two linear functions of different slope
Chroma (3 images)
Three scaling functions with 20%, 60%, and 120% of the original chroma
Sharpness (4 images)
Four unsharp mask filters in Adobe Photoshop® with a radius of 2.0 and amounts of 25, 75, 150, and 250
The dataset of Exp II consists of images of various qualities, and in order to select images accurately from the worst to the best quality, we paid attention to select dataset images. Three hundred observations were made using 25 different digitalcamera images that were collected from the DPChallenge® (http://www.dpchallenge. com) and the Corbis® (http://pro.corbis.com). They contain a diverse set of high and low quality photos from many different photographers. Moreover, the images have been rated by the website users. In order to achieve highly reliable results, we performed Exp II twice with different datasets. Therefore, a total of 600 observations were used in modeling. The sample images used in Exp II are shown in Fig. 3. Overall, 40 observers with normal color vision, consisting mainly of our faculty along with graduate and undergraduate students, participated in the experiments.
1
⎛n⎞ ⎝k⎠
10 images x C212 = 660, where Ckn = ⎜ ⎟=
n! . k !(n − k )!
350
G.-Y. Gim et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 2. Transformation examples (a)-(e) Result images for lightness: (a) Sigmoidal function with an exponent of 10, (b) Sigmoidal function with an exponent of 25, (c) Inverse sigmoidal function with an exponent of 20, (d) Linear function with a slope of 0.85, and (e) Linear function with a slope of 1.15 (f)-(h) Result images for chroma (i)-(l) Result images for sharpness
Observer information is shown in Table 2. The observers were seated one meter from the monitor screen, and they were given no information about the purpose of the experiment. They were directed to select the better image between the two displayed images for both contrast and preference. The two-image combinations were shown on the center of a calibrated LCD monitor (Samsung SyncMaster 178B 17”) in random order. The luminance of a monitor is 300cd/m2, and screen brightness level was set to 50% of the maximum [4]. The dot-pitch of monitor is 0.264mm and the response time is 4ms. All of the experiments were performed in the same room under fluorescent lighting, and had a comfortable break (e.g. 5-10 minutes).
Subjective Image-Quality Estimation Based on Psychophysical Experimentation
351
Fig. 3. The sample images using Exp II’s dataset Table 2. Observer information Ethnic background
Number of observers
Male
40
27
Female Korean 13
35
Pakistani 4
Malaysian 1
Age range 21-35
3 Experimental Results To analyze the experimental results, both the preference scores and the perceived contrast scores were scaled within five levels. First of all, we analyzed the relation between the perceived preference and the subjective quality. In Fig. 4, we plot the graph of the perceived contrast versus the preference, as based on the answer sheets. Figure 4 (a) and (b) show results of Exp II using different datasets. In order to evaluate the similarity between the results, a correlation of the results was calculated. The correlation was 0.94, and it means that Exp II has a high consistency between the results. Hence we adopted averaged result for Exp II as shown in Fig. 4(d). The result of Exp I in Fig. 4(c) showed that the relationship between the perceived contrast and the preference is same as that determined by the existing methods [1-3]. The “inverted U” shape represented the fact that the highest score for the preference was around
352
G.-Y. Gim et al.
5
5
4
4
Preference
Preference
level 3–4. As shown in Fig. 4(d), the result of Exp II indicated that the preference is proportional to the perceived contrast. The perceived contrast of excessively transformed image brings out the decreasing preference, but untransformed image with good contrast is regarded as high-quality image.
3 2
2 1
1 0
3
0
1
2
3
4
0
5
0
1
Perceived contrast
3
4
5
4
5
(b)
5
5
4
4
Preference
Preference
(a)
3 2 1 0
2
Perceived contrast
3
2
1
0
1
2
3
Perceived contrast
(c)
4
5
0
0
1
2
3
Perceived contrast
(d)
Fig. 4. The relationship between perceived contrast and preference (a)(b) Results of Exp II using different dataset (c) Result of Exp I (d) Averaged result of Exp II
After the analysis of the relationship between the perceived contrast and the preference is completed, information on the lightness, chroma, and sharpness is used for contrast modeling: the standard deviation in the luminance channel for lightness, the standard deviation in the C*ab channel for chroma, and the standard deviation of image applied to the Sobel operator [7] in the luminance channel for sharpness. The results of the perceived contrast modeling are illustrated in Fig. 5. As shown in Figs. 5 (b) and (c), the standard deviations of chroma and sharpness represented the relationship with the perceived contrast well. However, the standard deviation of lightness showed less correlation with the perceived contrast, and is a different result from that of [1-2]. The reason we had this result is that we used images of various lightness qualities, whereas the conventional experiments use only high-quality-lightness images. To determine the relationship between the contrast and
ͦ
ͦ
ͥ
ͥ
Perceived contrast
Perceived contrast
Subjective Image-Quality Estimation Based on Psychophysical Experimentation
ͤ
ͣ
͢
ͤ
ͣ
͢
͡
͡ ͡
ͣ͡
ͥ͡
ͧ͡
ͩ͡
͢͡͡
͡
͢͡
Lightness std. dev.
ͣ͡
(b)
ͦ
ͦ
ͥ
ͥ
Perceived contrast
Perceived contrast
ͥ͡
ͤ͡
Chroma std. dev.
(a)
ͤ
ͣ
͢
ͤ
ͣ
͢
͡
͡
͡
ͣ͡
ͥ͡
ͧ͡
ͩ͡
͢͡͡
͡
ͦ͡͡͡
Sharpness std. dev.
͢͡͡͡͡
ͦ͢͡͡͡
ͣ͡͡͡͡
ͣͦ͡͡͡
ͤ͡͡͡͡
͢͡͡͡͡
ͣ͢͡͡͡
Absolute luminance
(d)
(c) ͦ
ͦ
ͥ
ͥ
Perceived contrast
Perceived contrast
353
ͤ
ͣ
͢
ͤ
ͣ
͢
͡
͡ ͡
ͦ͡
͢͡͡
ͦ͢͡
MDR
(e)
ͣ͡͡
ͣͦ͡
͡
ͣ͡͡͡
ͥ͡͡͡
ͧ͡͡͡
ͩ͡͡͡
TEN value
(f)
Fig. 5. The perceived contrast modeling. As the standard deviation of chroma (b) and sharpness (c) increase, the perceived contrast also increases. But it is difficult to find any relationship between contrast and the other values (a)(d)(e)(f).
the luminance, we applied the absolute luminance [8] and the mean dynamic range (MDR) [9] as factors of lightness. Still, no good relationship could be derived. The TEN value [10] was also calculated as the sharpness measure, but the standard deviation of sharpness showed a better result.
354
G.-Y. Gim et al.
Finally, the standard deviations of chroma and sharpness were chosen and modeled owing to their high correlation with the perceived contrast. By using the least square estimation (LSE), the equations of estimated contrast for each experiment were defined as follows:
C% I = 0.0864 × σ chroma + 0.0487 × σ sharpness − 1.7983
(1)
C% II = 0.0365 × σ chroma + 0.0329 × σ sharpness − 0.205 .
(2)
Eqs. (1) and (2) are the approximations of the results of Exp I and Exp II, respectively. By using these equations, we could model the contrast from the standard deviations of an image. Next, we estimated the preference using the relationship with the perceived contrast. By using the least square regression method, the preference can be approximated by: 2 P%I = −0.3554 × C% I + 2.9004 × C% I − 1.7915
(3)
P%II = 0.9679 × C% II + 0.0844 .
(4)
Eqs. (3) and (4) are calculated with a polynomial and a linear regression, respectively. Therefore, we could estimate the subjective image quality from an input image using these procedures. To evaluate the validity of the experimental results, we procured test images and measured the image quality directly. The observers were asked to provide their subjective quality and contrast preferences in the form of discrete scores that were divided into five and marked according to the grade of an adjective: Bad = 1, Poor = 2, Fair = 3, Good = 4, Excellent = 5. It was not a pair comparison test, but rather, one image was displayed on a monitor for the observers. The condition of test is same as Exp I and II. A total of 24 observers participated in the test, and a total number of 85 test images were measured: 60 transformed images from five images similar to those 5
Preference
4
Exp I 3
Exp II 2
1
1
2
3
4
5
Perceived contrast
Fig. 6. The Relationship between the contrast and the preference with MOS
Subjective Image-Quality Estimation Based on Psychophysical Experimentation
355
used in Exp I, and 25 distinct images as Exp II. The resultant score being the criterion of quality verification, the mean opinion score (MOS) was calculated for each test image [4]. The MOS for contrast and the MOS for preference are plotted in Fig. 6. The dashed and solid lines indicate the estimated results using Eqs. (3) and (4) respectively. We can easily see that solid line of Eq. (4) represents the MOS distribution with less error. This means that the contrast of image is the main factor determining subjective image quality. For a more accurate comparison, we calculated the MSE with the estimated results for the image quality and the MOS. At the same time, the coefficient of determinant (R2) was computed for the suitability of the modeling. Each of the MSE and R2 results are listed in Table 3. Table 3. MSE and R2 values for each experiment Experiment type
MSE (Contrast modeling)
MSE (Preference estimation)
R2
Exp I
0.8729
0.9685
0.7528
Exp II
0.6615
0.5769
0.9475
We could compare the performance of Exp II against Exp I within the 95% confidence intervals. The MSE of Exp II was reduced to 0.5769, which is 40.4% lower than the result obtained by Exp I. We found that Exp. II with a higher R2 offers a better modeling performance. These results showed, in other words, that the accuracy of Exp II is superior to that of Exp I. This means that an image with good contrast is considered to be a high-quality image, and that modeling using a dataset that consists of untransformed images yields a more reasonable result. Moreover, instead of an “inverted U” shape, the approximating preference in proportion to contrast is more reliable when we estimate the subjective image quality.
4 Conclusions In this paper, we conducted two types of experiments for more accurate image-quality estimation. Whereas the conventional methods use only high-quality images for modeling, the dataset of Exp I included degraded images also. In Exp II, we compiled the dataset using different untransformed images, and modeled the subjective image quality. As a result of Exp I, the relationship between the perceived contrast and the preference manifested the “inverted U” shape, identically to the existing experiments. However, the Exp II result showed that the relationship between the perceived contrast and the preference was represented by a monotonically increasing function. Then, perceived contrast modeling was carried out using standard deviations of chroma and sharpness. Finally, we were able to estimate the subjective image quality, and the result showed that the approximated image quality showed an approximately 40%-reduced MSE compared with the models that using transformed datasets.
356
G.-Y. Gim et al.
Further, our image-quality model can be applied in various fields in which digital image content is relevant, such as the digital camera and broadcasting system fields.
References 1. Calabria, A.J., Fairchild, M.D.: Perceived image contrast and observer preference I. The effects of lightness, chroma, and sharpness manipulations on contrast perception. Journal of Imaging Science and Technology 47, 479–493 (2003) 2. Calabria, A.J., Fairchild, M.D.: Perceived image contrast and observer preference II. Empirical modeling of perceived image contrast and observer preference data. Journal of Imaging Science and Technology 47, 494–508 (2003) 3. Kim, J.-S., Cho, M.-S., Koo, B.-K.: Experimental Approach for Human Perception Based Image Quality Assessment. In: Harper, R., Rauterberg, M., Combetto, M. (eds.) ICEC 2006. LNCS, vol. 4161, pp. 59–68. Springer, Heidelberg (2006) 4. Brotherton, M.D., Huynh-Thu, Q., Hands, D.S., Brunnstrom, K.: Subjective Multimedia Quality Assessment. IEICE Trans. Fundamentals E89-A, 2920–2931 (2006) 5. Ke, Y., Tang, X., Jing, F.: The Design of High-Level Features for Photo Quality Assessment. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 419–426. IEEE Computer Society Press, Los Alamitos (2006) 6. Horita, Y., Sato, M., Kawayoke, Y., Parvez Sazzad, Z.M., Shibata, K.: Image Quality Evaluation Model Based on Local Features and Segmentation. In: Proc. of IEEE International Conference on Image Processing, pp. 405–408. IEEE Computer Society Press, Los Alamitos (2006) 7. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2/E. Prentice-Hall, New Jersey (2002) 8. Akyuz, A.O., Reinhard, E.: Color appearance in high-dynamic-range imaging. Journal of Electronic Imaging 15(3), 033001 (2006) 9. Jourlin, M., Pinoli, J.C.: Image dynamic range enhancement and stabilization in the context of the logarithmic image processing model. Signal Processing 41(2), 225–237 (1995) 10. Buerkle, A., Schmoeckel, F., Kiefer, M., Amavasai, B.P., Caparreli, F., Selvan, A., Travis, J.R.: Vision-based closed-loop control of mobile microbots for microhandling tasks. Proc. SPIE 4568, 187–198 (2001)
Adaptive Color Filter Array Demosaicking Based on Constant Hue and Local Properties of Luminance Chun-Hsien Chou1, Kuo-Cheng Liu1,2,*, and Wei-Yu Lee1 2
1 Department of Electrical Engineering, Tatung University, Taiwan Foreign Language and Information Educating Center, Taiwan Hospitality and Tourism
[email protected]*
Abstract. Most commercial digital cameras use a single electronic sensor overlaid with a color filter array (CFA) to capture imagery. Since only one primary color is sampled in each pixel, the missing color primaries must be reconstructed by interpolation. In this paper, an adaptive demosaicking scheme for CFA interpolation is proposed. The scheme uses intra-channel correlation, color difference correlation, constant hue, and luminance-color difference correlation is proposed. A rough interpolation is first implemented by bilinear interpolation. Then the color difference correlation and constant hue are successively used to update the missing color primaries. To obtain high quality color images, an adaptive algorithm using luminance-color difference correlation and the information of edge direction is iteratively applied to improve the image quality around the edges. Simulation results demonstrate that the image quality of the proposed algorithm is better than that of the approach using color difference correlation in terms of peak signal-to-noise ratio (PSNR). Keywords: Color filter array, demosaicking, hue, luminance.
1 Introduction Digital still cameras have become indispensable in people’s life and have been widely used as image input devices. To reduce cost and size, most commercial digital cameras use a single electronic sensor overlaid with a color filter array to capture imagery. Since only one primary color is sampled in each pixel, the missing color primaries must be reconstructed by interpolation sampled color primaries of the adjacent pixels. This color plane interpolation is known as demosaicking or CFA interpolation. Bayer color filter array is a popular format for digital acquisition of color images. The Bayer pattern [1] is shown in Fig. 1. Half of the total numbers of pixels are green, while a quarter of the total number is assigned to both red and blue. More pixels are dedicated to green than to red and blue, because the human eye is more sensitive to that color. An immense number of demosaicking methods have been proposed in the literature [2]-[13]. To obtain more visually pleasing results, many adaptive CFA demosaicking methods that exploit the spectral and spatial correlations among neighboring D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 357 – 370, 2007. © Springer-Verlag Berlin Heidelberg 2007
358
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
Fig. 1. Bayer pattern
Fig. 2. Reference Bayer CFA pattern
pixels were proposed. In [2], Pei and Tam present a color difference (green-red and green-blue) model for demosaicking technique. Based on this observation, several schemes (e.g., [3], [4]) have been devised to estimate missing color values. These methods normally do not perform satisfactorily around sharp edges and fine details. Besides color difference rule, many methods make use of color ratios [5], [6]. The ratios between the red and green values are highly similar, so are the ratios between the blue and green values. In [7], Gunturk et al. have proposed an effective scheme to exploit spectral correlation by alternately projecting the estimates of the missing color values onto constraint sets and spectral correlation. In general, the spatial correlation of neighboring pixels around edges is used to perform color interpolation. In [8], [9], several edge classifiers are proposed to identify the best directions for interpolating the missing color values. In [10], [11], authors attempt to estimate the luminance of an image and obtain good demosaicking image based on the luminance and chrominance signals of a CFA image in the frequency domain. In this paper, the proposed demosaicking algorithm can be divided into four parts. The first part is an initial process using bilinear interpolation. The second part is designed to update the green channel by the constant color difference model. In the third part, red and blue channels are respectively updated by using the constant hue model. The fourth part is an adaptively iterative update algorithm that is developed based on luminance-color difference model and local properties. The paper is organized as follows. The interpolation algorithms for demosaicking are reviewed in Section 2. In Section 3, three image models including the constant color difference model, the constant hue model, and the luminance-color difference model are presented for the design of the proposed demosaicking algorithm. The proposed demosaicking algorithm is described including a CFA interpolation and an adaptively iterative update method in Section 4. The CFA interpolation is based on the constant color difference model and the constant hue model. The adaptively iterative update algorithm improves the image quality in edge regions using the luminance-color difference model and the edge information. Simulation results and conclusions are respectively described in Section 5 and 6.
Adaptive Color Filter Array Demosaicking
359
2 Reviews of Interpolation Algorithms In this section, the interpolation algorithms used in CFA are discussed. Common interpolation algorithms can be classified to two categories: non-adaptive and adaptive. Non-adaptive methods have low complexity and easy to be implemented because all pixels are processed in the same method. Adaptive methods use special method to process pixels on the edge. These algorithms can improve the artifact and increase image quality. Bilinear interpolation is the simplest method. This method determines the value of a missing pixel based on a weighted average of the adjacent pixels in the CFA image. It is not considered good enough for photo quality images. Take the Bayer CFA pattern shown in Fig. 2 for example. The average of the upper, lower, left, and right pixel values is assigned as the G value of the interpolated pixel. The average of two adjacent pixel values in corresponding color is assigned to the interpolated pixel at green position. R 2 + R10 G 3 + G 6 + G8 + G11 and (1) R6 = G7 = 4 2 The average of four adjacent diagonal pixel values is assigned to the interpolated pixel at a red/blue position. R7 =
R 2 + R 4 + R10 + R12 4
(2)
Edge-sensing interpolation is an adaptive method and it is used to recover the missing green elements only. This method detects edge before the interpolation. Then, it classifies pixels into several categories using edge orientation and different interpolation schemes are applied to interpolate the missing components for different categories. By the edge patterns, the missing color elements can be reconstructed. The algorithm is given by if ( ΔH < T ) and ( ΔV > T ) G = GH else if ( ΔH > T ) and ( ΔV < T )
(3)
G = GV else G = G AVG where T denotes the threshold. Referring to Fig. 2, take G7 for example. The horizontal and vertical gradients are calculated as ΔH = |G 6 -G 8 | , ΔV = |G 3-G11|
(4)
The three estimations is calculated as GH 7 =
G 6 + G8 2
(5)
360
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
G3 + G11 (6) 2 G 3 + G 6 + G 8 + G11 (7) G AVG 7 = 4 Smooth-hue-transition interpolation method is used to recovery the missing chrominance (R and B) elements only. It performs the bilinear interpolation algorithm in the hue domain. First the green channel is interpolated by bilinear interpolation, and then the red and blue values are calculated trying to deliver a smooth transition in hue from pixel to pixel. First of all, a red hue value and a blue hue value must be defined as: GV 7 =
R B (8) , HB = G G where HR denotes the red hue value and HB denotes the blue hue value. As shown in Fig 2, the missing red channel can be reconstructed as HR =
G B2 B4 ( + ) 2 G' 2 G' 4 G B 4 B12 R8 = ( + ) 2 G' 4 G'12
R3 =
R7 =
G B 2 B 4 B10 B12 ( + + + ) 4 G' 2 G' 4 G'10 G'12
(9) (10) (11)
where G’ denotes the interpolated G value. The blue plane is interpolated in the same manner as the interpolation of red channel.
3 Color Image Models The color image models presented in this section are use to develop the demosaicking algorithm. These models include the constant color difference model, the constant hue model, and the luminance-color difference model. We will introduce the basic ideas of these models and detail how these image models are utilized for our interpolation algorithm. 3.1 The Constant Color Difference Model Because of the high correlation between the red, green and blue signals, the interpolation method of green signal can take advantage of the red and blue information. In [2], Pei and Tam developed an image model about the correlation between green and chrominance signals. Authors define KR as green minus red and KB as green minus blue, as shown below. (12) K R = G-R K B = G-B For natural images, the contrasts of KR and KB are quite flat over small region and this property is suitable for interpolation. Figure 3 illustrates an example of green channel image, KR, and KB images. Based on this observation, the authors reduce the interpolation error and the image quality is improved. From another viewpoint, this concept is similar to that of the smooth-hue-transition method.
Adaptive Color Filter Array Demosaicking
(a)
(b)
361
(c)
Fig. 3. (a) G channel image, (b) KR image, (c) KB image
(a)
(b)
(c)
Fig. 4. (a) G channel image, (b) HR image, (c) HB image
3.2 The Constant Hue Model Although CFA interpolation has been an object of study for a long time, almost algorithms focus on color ratios rule and inter-channel color differences rule. Based on these observations, several schemes have been devised to estimate missing color values with the aid of other color planes. However, these methods normally do not perform satisfactorily around sharp edges and fine details. By taking the advantages of both types of demosaicking methods, an image model using color difference ratios is proposed. After a lot of experiments, we discover that an image domain is suitable for interpolation. The image model is called constant hue model. In our hue model, we define HR and HB as HR =
G-R G-B
HB =
G-B G-R
(13)
The same observation can be fined in our model. For real world images, the contrasts of HR and HB are quite flat over small region and this property is suitable for interpolation. Figure 4 helps account for the results. As the figures indicate, the gray parts are the results of HR and HB values. There are a few exceptions to the rule. The white parts include denominator is zero or the absolute values of HR and HB are greater than one. These exceptions are not processed in our algorithm.
(a)
(b)
(c)
Fig. 5. (a) Luminance image, (b) LR image, (c) LG image, (d)LB image
(d)
362
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
Fig. 6. Flowchart of the proposed algorithm
3.3 The Luminance Color Difference Model Luminance information plays an important role in dictating the quality of a color image. In [11], the luminance color difference model is proposed. They define three luminance color difference planes as
LR = L-R LG = L-G
(14)
LB = L-B These three chrominance planes are generally smooth because the high-frequency contents of the luminance plane are highly correlated with that of each color plane. Figure 5 shows the luminance and the three chrominance planes of a test image. Through exploiting this strong spatial smoothness in these luminance color difference planes, the missing values can be estimated by adaptively combining the neighboring luminance color difference values.
4 The Proposed Demosaicking Algorithm The proposed algorithm is shown in Fig 6. The whole algorithm can be divided into four parts. The first part is the initial interpolation. The bilinear interpolation is used in step 1. The second part is the update of the green channel. It employs constant color difference model to update the result of the green channel from step 1. The third part is that the red and blue channels are updated by the constant hue model. The fourth part is an iterative update algorithm. It uses luminance-color difference model and edge information. This step aims to suppress visible artifacts residing in the demosaicked images obtained from the aforementioned interpolation step.
Adaptive Color Filter Array Demosaicking
363
4.1 Initial Interpolation In this step, we will interpolate roughly the red, green, and blue channels to obtain initial estimates. Bilinear interpolation algorithm can be used for this initial interpolation. Because the CFA image has only one channel information per pixel, our algorithm cannot exploit inter-channel correlation in next step. That is why that the algorithm needs initial interpolation. 4.2 Update the Green Channel Since the green channel plays an important role, the green component is updated first with the help of the red and blue channels. The reason is that the green channel is sampled at higher rate than the red and blue channels in Bayer color filter array. To find the missing green value, the constant color difference model presented before is used. We can update the green channel as follows. Before updating the green channel, we have to calculate the KR and KB values around the updated pixel. Figure 7 shows the reference neighboring samples of G channel. Referring to Fig. 7(a), we take the G value at pixel (x,y) for example. We can transform the operation into KR domain.
G ( x, y ) − R ( x, y ) = K R ( x, y )
(15)
Since our algorithm have initialized the CFA image, we can use the average of four surrounding values at locations: (x+1,y), (x-1,y), (x,y+1), and (x,y-1) to estimate KR at pixel (x,y) directly. The four surrounding KR values can be calculated as green minus red at the position. As shown in following equations K R ( x + 1, y ) = G ( x + 1, y ) − R( x + 1, y ) K R ( x − 1, y ) = G ( x − 1, y ) − R( x − 1, y )
(16)
K R ( x, y + 1) = G ( x, y + 1) − R( x, y + 1)
(18)
K R ( x, y − 1) = G ( x, y − 1) − R( x, y − 1)
(19)
(17)
The G value at pixel (x,y) can be presented as 1 G ( x, y ) = R ( x, y ) + ( K R ( x + 1, y ) + K R ( x − 1, y ) 4 + K R ( x, y + 1) + K R ( x, y − 1))
(20)
The update of green value at a blue pixel is similar to the update at a red pixel.
(a)
(b)
Fig. 7. Reference neighboring samples (x and y are row and column indexes, respectively)
364
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
(a)
(b)
Fig. 8. (a) Zipper effect, (b) false color
4.3 Update Red and Blue Channels
Since the green channel plays an important role, the green component is updated first with the help of the red and blue channels. The reason is that the green channel is sampled at higher rate than the red and blue channels in Bayer color filter array. In step 2, we update green channel using Pei’s method. The red and blue channels are updated follow the green channel. In this step, the proposed constant hue model is used to update the red and blue channels. The basic idea is similar to the constant color difference model. We can transform the operation into the hue domain. Referring to Fig. 7(b), the red values at the locations (x,y) is presented as following equation. R( x,y) = G ( x,y) − H R ( x,y)(G ( x,y) − B( x,y)) (21) The HR value at the updated pixel is calculated first. We can use the average of its surrounding HR values to estimate HR at the updated pixel directly. The HR values at three locations: (x+1,y), (x,y-1), and (x,y) are taken for example.
1 ( H R ( x + 1, y − 1) + H R ( x + 1, y + 1)) 2 1 H R ( x, y − 1) = ( H R ( x, y − 1) + H R ( x, y − 1)) 2 1 H R ( x, y ) = ( H R ( x − 1, y − 1) + H R ( x − 1, y + 1) 4 + H R ( x + 1, y − 1) + H R ( x + 1, y + 1)) The interpolation of blue channel is similar to the interpolation of red channel. H R ( x + 1, y ) =
(22) (23) (24)
4.4 Improve the Image Quality by Using Adaptively Iterative Update Algorithm
If demosaicking is not performed appropriately, the produced image quality will suffer degradation from highly visible artifacts, such as zipper effect and false color. As shown in Fig. 8(a), Zipper effect means to abrupt or unnatural change in color or intensity between neighboring pixels. False color corresponds to the fact that certain areas contain streaks of colors which do not exist in the original image, as shown in
Adaptive Color Filter Array Demosaicking
365
The updated image by step 1
Edge detection
Update R/G/B channels in edge pixels
Stopping criterion satisfied?
No
Yes Full-color demosaicked image
-1
0
+1
-2
0
+2
0
0
0
-1
0
+1
-1
-2
-1
Gx
+1 +2 +1
Gy
(Left) Fig. 9. The basic block diagram of the proposed iterative algorithm (Right) Fig. 10. The Sobel convolution masks used for edge detection
Fig. 8(b). To restore more accurate and visually pleasing results, the adaptive iterative method based on the luminance-color difference model is proposed. We will describe directly that how to use the luminance-color difference model for updating the red, green, and blue channels. Due to the artifacts usually occur in edge regions, the iterative algorithm will update the red, green, and blue channels around edges. The basic block diagram of the proposed iterative algorithm is given in Fig. 9. The Sobel method is used to detect edges in this step. Typically it is used to find the approximate absolute gradient magnitude at each point in an input grayscale image. Here we use the Sobel method in a luminance image. The luminance at pixel (x,y) can be expressed as the equation (25) . 1 L( x, y ) = ( R ( x, y ) + 2 × R ( x, y ) + B ( x, y )) (25) 4 The Sobel operator consists of a pair of 3×3 convolution masks as shown in Fig. 10. The kernels can be applied separately to the input image and produce separate measurements of the gradient component in each orientation (Gx and Gy). These can then be combined together to find the absolute magnitude of the gradient at each point and the orientation of that gradient. The approximate gradient magnitude is given by: Grad =| G x | + | G y |
(26)
If the gradient magnitude is greater than a predetermined threshold, the pixel is considered the edge pixel.
366
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
ŢţŴŰŭŶŵŦġťŪŧŧŦųŦůŤŦġŷ ŢŭŶŦ
ŊŮŢŨŦġĴ ijııııı IJĶıııı IJııııı Ķıııı ı IJ
Ĵ
Ķ
ĸ
ĺ
IJIJ
IJĴ
IJĶ
IJĸ
IJĺ
IJĶ
IJĸ
IJĺ
IJĶ
IJĸ
IJĺ
ŪŵŦųŢŵŪŰůġůŶŮţŦų
ŢţŴŰŭŶŵŦġťŪŧŧŦųŦůŤŦġŷŢŭŶŦ
ŊŮŢŨŦġIJĵ Ķııııı ĵııııı Ĵııııı ijııııı IJııııı ı IJ
Ĵ
Ķ
ĸ
ĺ
IJIJ
IJĴ
ŪŵŦųŢŵŪŰůġůŶŮţŦų
ŢţŴŰŭŶŵŦġťŪŧŧŦųŦůŤŦġŷŢŭŶŦ
ŊŮŢŨŦġIJĶ ijĶıııı ijııııı IJĶıııı IJııııı Ķıııı ı IJ
Ĵ
Ķ
ĸ
ĺ
IJIJ
IJĴ
ŪŵŦųŢŵŪŰůġůŶŮţŦų
Fig. 11. The absolute difference value of update in different images
Fig. 12. Images used in experiments. (These images are referred as Image1 to Image 16 in this thesis, enumerated from left-to-right, and top-to-bottom.).
Adaptive Color Filter Array Demosaicking
367
After edge detection, we will update the red, green, and blue channels for these edge pixels. The other two channels are updated based on luminance-color difference model except for the original channel in the CFA image. If the green channel is updated, it uses the luminance-color difference correlation and edge direction. If the red and blue channels are updated, it only uses simple luminance-color difference correlation. The green channel is updated according to the edge direction. The algorithm is similar to edgesensing interpolation. Referring to Fig. 7.(a) and (b), the algorithm is given by If ( ΔH < T )and ( ΔV > T ) 1 (G ( x + 1, y ) + G ( x − 1, y )) 2 else if ( ΔH > T )and ( ΔV < T ) G ( x, y ) = L ( x, y ) −
G ( x, y ) = L( x, y ) −
1 (G ( x, y + 1) + G ( x, y − 1)) 2
(27)
else G ( x, y ) =
1 (G ( x, y + 1) + G ( x, y − 1) 4 + G ( x + 1, y ) + G ( x − 1, y )
The horizontal and vertical gradients are calculated as
ΔH =| G ( x, y + 1) − G ( x, y − 1) | ΔV =| G ( x + 1, y ) − G ( x − 1, y ) |
(28) (29)
The red channel is updated based on luminance-color difference model. Referring to Fig. 7(b), the red values at three locations: (x+1,y), (x,y-1), and (x,y) are taken for example. The red values at these three locations can be calculated as following equations. 1 R( x + 1, y ) = L( x + 1, y ) − ( LR ( x + 1, y ) (30) 2 + LR ( x + 1, y + 1)) 1 R( x, y − 1) = L( x, y − 1) − ( LR ( x − 1, y − 1) (31) 2 + LR ( x + 1, y − 1)) 1 R( x, y ) = L( x, y ) − ( LR ( x − 1, y − 1) 2 (32) + LR ( x + 1, y − 1) + LR ( x − 1, y + 1) + LR ( x + 1, y + 1)) The interpolation of blue channel is similar to the interpolation of red channel. The blue values can be similarly estimated using the LR values. Based on the above discussion, we conclude that iterative demosaicking requires a stopping criterion in order to minimize the risk of artifacts. Due to the type of images is not the same, it needs
368
C.-H. Chou, K.-C. Liu, and W.-Y. Lee Table 1. Comparison of PSNR obtained by different methods Img 1
2
3
4
5
6
7
8
Bilinear 25.1983 29.3453 24.7342 25.9942 32.9579 36.5939 31.9 33.4014 25.7452 29.3343 25.6496 26.6073 27.6818 32.0801 27.8705 28.7968 33.5907 37.4494 33.2664 34.402 23.4831 28.3377 23.4509
ECI 31.2458 35.2724 33.0498 32.8873 35.0892 41.4552 38.5411 37.6007 32.3681 36.5174 34.3285 34.0828 34.6458 36.981 34.0785 35.0655 40.2112 42.129 39.6653 40.5467 31.004 33.9382 30.9946
Proposed 33.1116 36.1268 31.4955
24.5725 29.2908 33.2913 29.2758
31.7767 35.5608 38.124 36.0815
32.233 34.3917 40.1086 36.2374
30.2572 22.7615 26.2456 22.8442
36.4559 29.9026 32.2465 30.5098
36.3256 31.4633 33.6793 29.709
23.6776
30.7778
31.3236
Img 9
33.1842 37.5982 40.8904 36.0545
10
37.7523 33.7775 37.4681 33.3527
11
34.5171 34.9566 38.7807 32.637 34.7856 39.9079 43.2795 38.2019 39.9938 31.1662 35.4533 31.2809
12
13
14
15
16
Bilinear 30.5707 34.4121 29.1981 30.8942 31.2803 34.4937 31.1881 32.0745 26.9332 30.3703 27.2461 27.9319 28.1475 32.8665 28.1213 29.2202 29.4347 35.3101 31.413 31.4353 28.7127 32.581 28.5333
ECI 33.141 39.2096 36.7883 35.6616 39.1357 40.3678 38.4157 39.2333 34.88 36.2609 34.3573 35.0945 34.8555 37.4168 34.5623 35.4366 38.8646 40.53 35.2614 37.6435 35.6157 37.5707 34.9765
Proposed 33.802 39.3059 36.176
35.3595 38.9585 33.3298 35.3129 38.0733 41.442 35.783 37.8547 36.0041 39.7511 35.1737
29.5878 35.6084 39.1438 35.2393
35.9214 40.3241 43.8424 41.1522
36.5713 40.0832 45.5646 41.392
36.3454 27.0526 30.2049 26.1341
41.5327 34.368 35.542 33.1528
41.7948 34.4993 37.757 32.9299
27.479
34.2452
34.6351
35.877 37.6562 41.4204 36.8205 38.2237 34.1626 37.4117 33.358 34.6577
different times to update. Our proposed stopping criterion is suitable for all images. We exploit the amount of update for the first time as the stopping criterion. It should be noted that the stopping criterion is based on the absolute difference of color channels. The stopping criterion is defined as s=
1
α
∑ | P ( x,y) - P ( x,y) | , c = R,G, or B
1 c x , y∈edge pixels
c
(33)
where Pc1 denotes the updated color channel for the first time in step 4, Pc denotes the color channel of the input image from step 3, and α defines the stopping coefficient.
Adaptive Color Filter Array Demosaicking
369
The stopping coefficient can control the time of the process. The iteration is terminated if the following condition is satisfied.
∑| P
n +1 ( x, c x , y∈edge pixels
y ) − Pcn ( x, y ) | < s , c = R, G , or B
(34)
where Pcn denotes the nth updated color channel and Pcn+1 denotes the n+1th updated color channel. After experiments and observations, we discover that our iterative update algorithm is convergence. Figure 11 shows the absolute difference value of update in different images. The convergence of the iterative interpolation can be demonstrated by the decreases of the absolute difference value between the interpolated images obtained by two successive updates (refer to Fig. 11).
(a) Original image
(b) Bilinear method.
(c) Method in [1].
(d) Our method.
Fig. 13. The original and demosaicked results of a cropped region from image 8
(a) Original image.
(b) Bilinear method.
(c) Method in [1]
(d) Our method.
Fig. 14. The original and demosaicked results of a cropped region from image 14
5 Simulation Results In our experiments, we used the images shown in Fig. 12 [14]. These images are film captures and digitized with photo scanner. Full color channels are available, and the CFA is simulated by sampling the channels. The sampled channels are used to test the demosaicking algorithms. We use bilinear interpolation for the red, green, and blue channels to get initial estimates. Then the green channel is updated with the constant color difference model, and the red and blue channels are updated with the constant hue model. Finally, the data is update repeatedly until the stopping criterion is satisfied. The stopping coefficient was set to 16 for all images. The edge threshold was set to 60 for all images. The performance in terms of PSNR can be seen in Table 1 for our and other demosaicking algorithms. The results of the red, green, blue, and all planes for each image are shown in top to bottom, respectively. The better PSNR result of each row is shown in bold. We also provide some examples from the images used in the experiments for visual comparison. Figure 13-14 show cropped segments
370
C.-H. Chou, K.-C. Liu, and W.-Y. Lee
from original images (Images 8 and 14 in Fig. 12), and the corresponding reconstructed images from the demosaicking algorithms that were used in comparison.
6 Conclusions In this paper, an adaptive CFA interpolation algorithm with better performance is proposed. The proposed constant hue model based on color difference ratios rule is successively used to estimate the missing color primaries. Since most false colors and zipper artifacts occur in edge regions, an adaptive iterative algorithm using luminance-color difference correlation and the information of edge direction is used to improve the image quality around the edges. The proposed stopping criterion is suitable for all images. Simulation results show that the proposed algorithm can obtain better image quality compared with demosaicking method using color difference rule in 16 photographic color images.
References 1. Bayer, B.: Color Imaging Array. U.S. Patent, no. 3,971,065 (1976) 2. Pei, S.C., Tam, I.K.: Effective color interpolation in CCD color filter array using signal correlation. IEEE Trans. on Circuits Systems Video Technology 13, 503–513 (2003) 3. Li, X.: Demosaicing by successive approximation. IEEE Trans. on Image Processing 14, 370–379 (2005) 4. Chang, L., Tan, Y.P.: Effective use of spatial and spectral correlations for color filter array demosaicking. IEEE Trans. on Consumer Electronics 50, 355–365 (2004) 5. Kimmel, R.: Demosaicing: Image reconstruction from CCD samples. IEEE Trans. on Image Processing 8, 1221–1228 (1999) 6. Cok, D.R.: Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal. U.S. Patent, no. 4,642,678 (1987) 7. Gunturk, B.K., Altunbasak, Y., Mersereau, R.M.: Color plane interpolation using alternating projections. IEEE Trans. on Image Processing 11, 997–1013 (2002) 8. Li, X., Orchard, M.T.: New edge-directed interpolation. IEEE Trans. on Image Processing 10, 1521–1527 (2001) 9. Chang, H.A., Chen, H.: Directionally weighted color interpolation for digital cameras. ISCAS 6, 6284–6287 (2005) 10. Alleysson, S., Susstrunk, S., Herault, J.: Linear demosaicking inspired by human visual system. IEEE Trans. on Image Processing 14, 439–449 (2005) 11. Lian, N., Chang, L., Tan, Y.P.: Improved color filter array demosaicking by accurate luminance estimation. ICIP 1, 41–44 (2005) 12. Wu, X., Zhang, N.: Primary-consistent soft-decision color demosaicking for digital cameras. IEEE Trans. on Image Processing 13, 1263–1274 (2004) 13. Lu, W., Tan, Y.P.: Color Filter Array Demosaicking: New Method and Performance Measures. IEEE Trans. on Image Processing 12, 1194–1210 (2003) 14. Kodak Lossless True Color Image Suite (2007), http://www.r0k.us/graphics/kodak/
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence with Intermediate Classifier Block Miguel Carrasco and Domingo Mery Departamento de Ciencia de la Computaci´ on Pontificia Universidad Cat´ olica de Chile Av. Vicu˜ na Mackenna 4860(143), Santiago de Chile
[email protected],
[email protected]
Abstract. Automated inspection using multiple views (AMVI) has been recently developed to automatically detect flaws in manufactured objects. The principal idea of this strategy is that, unlike the noise that appears randomly in images, only the flaws remain stable in a sequence of images because they remain in their position relative to the movement of the object being analyzed. This investi- gation proposes a new strategy, based on the detection of flaws in a non- calibrated sequence of images. The method uses a scheme of elimination of potential flaws in two and three views. To improve the performance, intermediate blocks are introduced that eliminate those hypothetical flaws that are regular regions and real flaws. Use is made of images captured in a non-calibrated vision system, so there are no optical, geometric and noise disturbances in the image, forcing the proposed method to be robust, so that it can be applied in industry as a quality control method in non-calibrated vision systems. the results show that it is possible to detect the real flaws and at the same time decrease most of the false alarms. Keywords: computer vision, multiple view geometry, automated visual inspection, defect detection, industrial applications.
1
Introduction
Since the early 1980s various authors have shown the need to introduce Automatic Visual Inspection (AVI) systems in production processes [1,2,3]. According to them, there is no methodology applicable to all cases, since development is an ad hoc process for each industry. However, there is clear consensus that the use of AVI technologies can reduce significantly the cost and time spent in the process of inspection, allowing the replacement of a large number of inspectors of average training by a limited group of highly trained operators [4]. This has led in recent years to increased productivity and profitability, and to a reduction in labor costs [5]. In spite of their advantages, AVI systems have the following problems: i) they lack precision in their performance, since there is no balance between undetected flaws (false negatives) and false alarms (false positives); ii) D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 371–384, 2007. c Springer-Verlag Berlin Heidelberg 2007
372
M. Carrasco and D. Mery
they are limited by the mechanical rate required for placing the object in the desired position; iii) they require a high computer cost for determining whether the object is defective or not; and iv) they generate high complexity in the configuration and lack of flexibility for analyzing changes in parts design. For those reasons, AVI remains as a problem open to the development of new applications. To counteract the difficulties mentioned above, in recent years a new methodology has been developed to detect flaws automatically making use of the potential of multiple views called Automatic Multiple View Inspection [6,7,8,9] (AMVI). The main objective of AMVI is to exploit the redundancy of information in multiple views that have corresponding parts of the object that is being analyzed, so the information captured from different viewpoints can reinforce the diagnosis made with a single image. This original strategy, presented in [6], requires prior calibration of the image sequence acquisition system. In the calibration we seek to establish the transfer function that projects a 3D point in the object onto a 2D point on the image. Unfortunately, the calibration process is difficult to carry out in industrial environments due to the vibrations and random movements that vary in time and are not considered in the original estimated transfer function. An alternative method for carrying out the AMVI strategy in non-calibrated sequences was presented in [7] for sequences with two images, and in [8] for sequence with three images. In order to achieve an adequate performance, the number of false alarms to be tracked must be reduced. For that reason, the objective of our research is to improve the performance of the original AMVI scheme by introducing intermediate classifiers between the changes of views in order to reduce the number of false alarms and increase the performance in the detection of real flaws, and additionally to perfect the method of detection of control points to avoid calibration. The remainder of this document is organized as follows: Section 2 includes background information on AMVI methodology; Section 3 deals with the proposed method, includes a description of the methodology used to generate artificial control points, and to estimate the fundamental matrix and trifocal tensors robustly; Section 4 includes the new intermediate classifier methodology; Section 5 shows the experimental results; and finally, Section 6 presents the conclusions and future work.
2
Background of Multiple Automatic Visual Inspection
Geometric analysis with multiple views represents a new field of analysis and development of machine vision [11,12]. The main idea is to get more information on the test object by using multiple views taken from different viewpoints. Using this idea, a new methodology for detecting flaws automatically, called Automatic Multiple View Inspection (AMVI) was developed in [6]. AMVI methodology is based mainly on the fact that only real flaws and not false alarms can be seen in the image sequence because their position remains stable relative to the object’s motion. Therefore, having two or more views of the same object from different viewpoints makes it possible to discriminate between real flaws and false
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
Identification Casting
••Segmentation Segmentation ••Feature Feature extraction extraction ••Detection Detection of of potential potential defects defects
373
Defect Defect and and no no defect defect recongnition recongnition
Tracking
Real defects
Image Sequence Acquisition
1
2
1
3
2
3
False Alarms
Fig. 1. Visualization of the general model of AMVI for the identification and tracking of hypothetical flaws in two and three views
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Flaw detection: a) section of a radioscopic image with a flaw inscribed on the edge of a regular structure; b) application of the Laplacian filter on an image with σ = 1.25 pixels (kernel = 11x11); c) zero crossing image; d) image gradient; e) detection of edges after increasing them to the highest levels in the gradient; and f) detection of flaws using the variance of the crossing line profile (see details in [10])
alarms by means of a geometric tracking process in multiple views. AMVI has been developed under two schemes: calibrated and non-calibrated. Both methods detailed below share the following two steps: identification and tracking. Identification: It consists in detecting all the anomalous regions or hypothetical flaws in each image of a motion sequence of the object, without a priori knowledge of its structure. The segmentation of hypothetical flaws allows the identification of regions in each image of the sequence which may correspond to real flaws (Fig.2) (see details in [10]). The process that follows is to extract the characteristics of each hypothetical flaw after identifying the regions by the previous procedure. Various investigations of the extracted characteristics have been described in [10,13]. This information makes it possible to determine if a flaw is corresponding in the multiple view analysis, according to the new intermediate classification system described in Section 4. Tracking: It consists in following in each image of the sequence the hypothetical flaws detected in the first step, using the positions forced by the geometric restrictions in multiple views. If the hypothetical flaws continue through the image sequence, they are identified as real flaws, and the object is classified as defective. On the other hand, if the hypothetical flaws do not have correspondence in the sequence, they will be considered as false alarms (Fig.1). The AMVI
374
M. Carrasco and D. Mery
methodology has as its main foundation the fact that only real flaws, and not false alarms, can be seen throughout the image sequence, because their position remains stable relative to the object’s motion. AMVI is currently a useful tool and a powerful alternative for examining complex objects. It provides two independent approaches: those based on the calibration of a 3D→2D transfer function in the multiple view projection [6], and those based on the estimation of the motion of the control points in correspondence for pairs [7] and triplets of views [8] without prior calibration. A brief description of each is given below. i) Calibrated method: The calibrated image sequence flaw tracking method was first used as a quality control method for aluminum castings [6]. This approach consists in the estimation of the 3D→2D model through an off-line process called calibration [14], which is the process that allows the determination of the model’s parameters to establish the projection matrix of a 3D point of the object at a 2D point of the digital image. Unfortunately, the model’s parameters are usually nonlinear, which implies that the optimization problem does not have a closed solution. For that reason, it is finally impractical in industrial environments, where there are vibrations and random movements that are not considered in the original transfer function, i.e., the calibration is not stable, and the computer vision system must be calibrated periodically to avoid this error. ii) Non-calibrated method: To avoid the problems involved in the calibrated method, a new system was developed using a sequence of non-calibrated images for two views [7]. This system does not require prior calibration. On the contrary, it can estimate the model of the motion using the images of the sequence in a procedure that can be carried out in line with the computer vision system. In general, to achieve high precision in the motion model it is necessary to determine a large number of correspondences of control points in pairs and triplets of images in sequence. Many times this condition is difficult to achieve, and for that reason the RANSAC algorithm [11] was used in [8]. In our work, as will be seen in following sections, the attempt is made to improve the non-calibrated approach. First, our proposed method does not involve affine transformations, on the contrary, it aims at estimating correspondences through a geometric process in multiple views. Second, we use intermediate blocks that eliminate those hypothetical flaws that are regular regions and real flaws in order to increase the performance.
3
Proposed Method
Below is an explanation of each of the stages of the non-calibrated AMVI process with intermediate classifiers. The proposed scheme has three steps (A, B and C) detailed in Fig.3. They correspond to the stages of identification (A), extraction of control points (B), and tracking (C). A. Identification of Hypothetical Flaws: The identification stage allows the detection of the hypothetical flaws by means of an algorithm without a priori knowledge of the object that is analyzed. Its most important characteristic is
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
375
the high percentage in the detection of real flaws, in spite of the existence of false alarms. Using the method of segmentation and extraction of characteristics described in [10] we determine all the regions with a high degree probability of being real flaws. The next step is to determine the position of the center of mass for each hypothetical flaw. For each image, mi will be used to denote the center of mass of the segmented region ri . In homogenous coordinates, mi = [xi , yi , 1]T represents the 2D spatial position of a point i. This information makes it possible to analyze the trajectories of the hypothetical flaws in the subsequent stages of the proposed method. B. Robust Control Points: The control point extraction stage allows the determination of corresponding points in multiple views. The process has two general steps: identification of control points, and matching of control points. The first step consists in determining the possible regions that can be in correspondence. The second step allows discarding possible combinations of correspondence that have a large error, storing a subset of correspondences with the highest precision. In our investigation we proposes a new curve alignment system by maximizing Pearson’s correlation coefficient [15] in the correspondence between 2D curves, using an isometric transformation between the curves. We use this scheme because in the analysis of manufactured products the object that is analyzed is usually not deformable. This premise justifies the use of a rigid transformation method with which, given a rotation and displacement, it is possible to estimate a correspondence between the object’s control points. However, due to the object’s rotation, some regions can remain occluded, and therefore the proposed system must consider that only some regions retain this transforRobust Control Points
B Identification of Control points
Matching of Control Points
Casting
Identification
A
Identification of hypothetical flaws RS0
Tracking
C Tracking in two views
PF1
F0
2
3
PF3
ICB
RS1
Real flaws
1
PF2
Tracking in three views
F1
PF4
RS2
PF5 Detection
ICB
F2
Corresponding points
1
2
3
1
2
3
False Alarms
Fig. 3. Visualization of the general model for the identification, control point estimation, and tracking of hypothetical flaws in two and three views
376
M. Carrasco and D. Mery
mation. The proposed robust system of control points consists of two stages that are detailed below: matching of regions, and matching of control points. B.1) Matching of regions: It consists in establishing the correspondence between regions of each view and not the control points. The process designed consists of four stages: First, segmenting those regions in which the intensity of the object is distinguishable from the background, using the method of Otsu [16]. Second, extracting a set of characteristics for each segmented region. This consists in extracting the centers of mass, the area, and the order moments of Flusser-and-Suk [17] of each region in the three views. Third, determining the matching between the segmented regions using the characteristics extracted before, and relating those regions having greater similarity according to a Euclidian distance metric. Fourth, once the correspondences between the regions have been determined, extracting the edges of each region and smoothing them to decrease the noise of each curvature. For that we calculate the perimeter of each segmented region and generate a list in parametric form as Zs = [xs , ys ], where s = 0, . . . , L − 1 is the index of the list of pixels ordered in a turning direction, and L is the number of pixels of the region’s perimeter. Using this parametric form, we generate the Fourier descriptors [18], transforming the Zs coordinates in a complex value us = xs + j · ys . This signal with period L is transformed into the Fourier domain by means of a discrete Fourier transform (DFT) [19]:
Fn =
L−1
us · e−j·
2π·s·n L
(1)
s=1
The modulus of the complex Fourier coefficients describes the energy of each descriptor. Therefore, if we choose the highest energy coefficients (above 98%) and return to real space with the inverse discrete Fourier transform (IDFT) we get a smoother curve with less noise. However, when applying the elimination of some Fourier coefficients, the original curve is transformed into a new curve = Zs . Cs = [xs , ys ], where, Cs B.2) Matching of Control Points: It is a process in which the correspondence of points of the curve is established for each view. Using the Fourier processing described above, we define a curve C1 corresponding to a region in the first view, and a curve C2 corresponding with C1 in the second view. For both curves to keep the same distance and be aligned it is necessary to select a section of each list having equal length. Let P , a section of curve C, be such that P = C(δ), where δ = [si , · · · , sj ], where i, j ∈ [1, · · · , n]. In this way there is a section P1 in the first view that has the same length as section P2 in the second view. These sections of the curve do not necessarily have a correspondence, and for that we define a shift operator Θ(P, λ) that displaces list P by λ positions in a turning direction. Operator Θ uses the function “mod” (modulus after division) to determine the λ relative positions that list C, of length P , must turn. Using the above definitions, we design an alignment function as the maximization of Pearson’s correlation
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
377
coefficient ρ(α, β) [15] between the isometric transformation of a section of P1 , with the shift of section P2 with a jump λ {θ, Δsx , Δsy , λ} = arg max ρ ([R, t][P1 ], Θ(P2 , λ)) where,
cos θ − sin θ R= , sin θ cos θ
Δsx t= Δsy
(2)
(3)
This maximization function must find parameters {θ, Δsx , Δsy , λ} to estimate an alignment between sections P1 and P2 . The main advantage of using this function is that it does not require a perfect alignment because the correlation coefficient is maximum when the displacement is linear. Another advantage is that curves P1 and P2 are open, so the alignment determines only sections that are corresponding, allowing control points to be obtained for curves that have partial occlusion in corresponding regions. Also, the use of parameter λ allows finding a position relation for curve C2 with P1 , and in this way, while curve P2 adjusts its shift, curve P1 adjusts its translation and rotation to become aligned. C. Tracking of Potential Flaws: The tracking stage allows tracking of hypothetical flaws obtained in the identification stage. The method for carrying out the tracking takes place through multiple view geometric analysis [11]. C.1). Two Views: The mathematical formulation that allows relating two points in stereo images is called the fundamental matrix [11]. Within the AMVI field its use is vital because it allows the trajectories of hypothetical flaws to be analyzed in two views and to verify if the flaws are corresponding. In this case, if point mp of the first view corresponds to mq , in the second view, the following relationship is established: mT q · Fpq · mp = 0
(4)
where Fpq is the fundamental matrix of the projection of points mp and mq in homogenous coordinates as [xp , yp , 1]T and [xp , yp , 1]T , respectively. Once the set of corresponding positions has been generated in each region in both views by the method proposed in section 3-B, we use the robust RANSAC algorithm to estimate the fundamental matrix [11]. In our investigation we used the method proposed by Chen et al. [20] to make an initial estimation of the fundamental matrix. The modification of Chen’s method consists in choosing a subset of candidate points by means of the epipolar restriction. So in our method we use a combination of the algorithm of Hartley and Zisserman [11] with the normalization of the 2D coordinates, followed by an estimation of the fundamental matrix through the biepipolar restriction [20]. Therefore, using the centers of mass for each hypothetical flaw generated in section 3-A, we generate the epipolar line thus lqi = FT pq · mpi = [lx , ly , lz ]i , where lqi is the epipolar line of flaw i in the second view, and mpi is the center of mass of flaw i in the first view. Once the epipolar line of flaw i of the first view has been generated, it is necessary to determine the distance between the
378
M. Carrasco and D. Mery
corresponding flaw in the second view and the epipolar line. This distance is determined through the practical bifocal constraint [12] as d (mpi , F, mqj ) =
|mT qj · F · mpi | <ε lx2 + ly2
(5)
For any flaw i in the first view and flaw j in the second view, we define mpi and mqj to be the centers of mass of the regions rpi and rqj in each view, respectively. If the Euclidean distance between mqj and the epipolar line of mpi is less than a given ε, this implies that the hypothetical flaw in the second view is related to mpi (Fig.4). If the hypothetical flaw is found in both images, then it is considered to be a flaw in the bifocal correspondence, if this is not the case, the region is discarded. C.2) Three views: The initial estimation of the tensors is carried out with Shashua’s four trilinearities [21]. In particular we use an estimation of the tensors that maximizes the number of inliers according to the RANSAC trifocal algorithm. Furthermore, the estimation of the tensors was made with the normalized linear algorithm [11, pp.383]. Once the trifocal tensors have been determined, it is possible to verify whether three points mp , mq and ms are corresponding in the first, second, and third view, respectively. For that we use the re-projection of the trifocal tensor in the third view using the positions mp and mq in the first two views applying the point-line-point method [11, pp.373]. We use only the centers of mass of the first two views which fulfill the bifocal relationship from section 3-B. Let us define ms as the center of mass of region rs from the third view. If the Euclidean distance between the real position of the hypothetical flaw ms and that which is estimated with the trifocal tensors, m ˆ s , is less than some value ε, we take the hypothetical flaw to be a real flaw, since it complies with the correspondence in three views as ds = m ˆ s − ms < ε. Should the hypothetical flaw in the third view not agree with the projection of the tensor, it is discarded, as it does not fulfill the trifocal condition [21].
Epipolar line
(a)
(b)
(c)
Fig. 4. Epipolar line generated automatically from the fundamental matrix: a) first view; b) zoom-identification of a hypothetical flaw; c) intersection of the epipolar line in the second view with one or more corresponding hypothetical flaws
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
4
379
Intermediate Classifier Block Method
The Intermediate Classifier Block (ICB) method proposed uses the classifier ensemble methodology [22], in which different linear classifiers do the classification and then, through the majority of votes technique, the final classification decision is made. The objective of the ICB method is to eliminate those correspondences between hypothetical flaws that have a low probability of being the same flaw in the different views. The ICB method has as input the distribution of two classes: Flaws (F) and Regular Structures (RS). According to this distribution, the classifier must determine the region of space where there are actually flaws only starting from point θF , and regular structures from θS (Fig.5a). Once these regions are extracted, only the hypothetical flaws contained in the region in which the classifier cannot verify with high probability the kind of class to which they belong are assigned to a new class called Potential Flaw (PF) (Fig.5b). This reduction avoids the analysis of the trajectories of all the flaws in correspondence, thereby improving the performance. The simplest form of the previous classifier is reflected in the linear separation of the RS, PF and F regions, using the V1 and V2 features (Fig.5b). In the case of having three features [V1 , V2 , V3 ], the separation between them generates a three-dimensional volume bounded by the cuts of the two-dimensional separations, containing only the hypothetical flaws considered as potential flaws (PF) (Fig.5c). This three-dimensional volume generated from the combination of the two-dimensional features [V1 , V2 ], [V1 , V3 ] and [V2 , V3 ] contains the potential flaws that will be analyzed in the following phases of the multiple views analysis. On the other hand, the regions outside the three-dimensional volume can be flaws or regular structures, depending on the position in which the hyperplanes are projected. Our analysis considers the combination of two to seven features, giving rise to multidimensional section maps generated from the two-dimensional combinations. The methodology used by the ICB consists of a series of stages detailed below. V3
class distribution classified as new potential flaw [PFi+1] class ‘regular structure’
Regular Estructure (RS)
class ‘flaws’
z
TS classified as regular structures [RSi+1]
TF classified as flaws [Fi+1]
(a)
Feature V2
PF
PF
V3
Potencial Flaws (PF) V1
Flaws (F)
Feature V1
(b)
V2
PF
PF
V1
V2
(c)
Fig. 5. (a) distribution of classes of hypothetical flaws between the views; b) distribution of classes in two dimensions with the linear separation of the RS, PF and F regions; c) Three dimensional representation of the ICB classification system
380
M. Carrasco and D. Mery
i) Assessment method of the ICB classifier: Our problem falls within the framework of supervised classification problems, since the class to which each potential flaw belongs is known. Using this information, the classification model is designed by means of the cross-validation method [23]. To compare the results of the various configurations of the classifier we use two parameters known as the ROC curves [24], sensitivity and 1-specificity, which allow the measurement of the performance of a classification of two classes. The main characteristic of the ROC curve is that it allows the comparison to be independent of the sample. The objective is for the sensitivity to be maximum (100%) and at the same time the 1-specificity to be minimum (0%), and in this way the classifier guarantees an ideal classification for two classes. In practice this is difficult to achieve, because it depends on the classifier’s internal parameters, and it can be quite variable with respect to the noise existing in the data. ii) Selection of characteristics: The characteristics selected by the ICB classifier are determined automatically using the information contained in each potential flaw, each of which has associated a characteristics vector. To determine the combination of characteristics that separate the classification space we will use the Take-L-Plus-R characteristics selection algorithm [25]. The objective of this algorithm is to determine the best characteristics that allow a greater separation of the space between the classes. In our research we used Fisher’s discriminant as the criterion function [26]. iii) Linear Classification: We use a linear classification system that allows finding the hyperplanes that best separate the solution space. For that, the classification process must fit the following linear equation wT ·v+w0 > 0, where −1 w = Σw ·(¯ v1 −¯ v2 ) are the hyperplane parameters, Σw is the interclass covariance matrix, and v corresponds to the characteristics vector chosen earlier. Finally, factor w0 for two classes is determined according to the mean of characteristics v¯1 and v¯2 and the probabilities of each class pe1 and pe2 according to 1 −1 v1 + v¯2 ) · ΣW · (¯ v1 − v¯2 ) − log w0 = − · (¯ 2
pe1 pe2
(6)
Once an initial solution is obtained for parameters w, the optimization problem tries to fit the hyperplanes so that (7) is maximum, and in that way always ensure that we are obtaining a high performance for each subselection of characteristics. {w, w0 } = arg max {Sn (w, w0 )} s.t. Sp (w, w0 ) = 1
(7)
This problem has been solved by the Nelder-Mead Simplex method [27]. Then the information from the selected straight lines and characteristics is used to evaluate the performance of the classifier on the test data. At some time each register is used to build the model or to be part of the test. This is necessary because of the low number of registers available at the time of identifying the hypothetical flaws, and for that reason we used the ensemble of classifiers [22].
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
5
381
Experimental Results
This section presents the results of experiments carried out on a sequence of 70 radioscopic images of aluminum wheels (see some of them Fig.6). There are twelve known real flaws in this sequence. Three of them are flaws detected by human visual inspection (∅ = 2.0 ∼ 7.5 mm). The remaining nine were generated by a drill which made small holes (∅ = 2.0 ∼ 4.0 mm) positions that would make their detection difficult. The method was applied to 70 radioscopic images (578 x 768 pixels) of aluminum wheels generated in [6] for which the angle of rotation of 5◦ is known for each sequence in the image. We separated the analysis into three steps. i) identification: in this step potential flaws are automatically identified in each image. The result of the identification generates a data base that contains 424 registers with 11 characteristics of the total hypothetical flaws detected in the sequence, 214 registers are real flaws, and 210 registers are regular structures or false alarms that must be reduced. ii) tracking: in this step we track the identified potential flaws in the image sequence in two and three views. iii) ICB method : finally, we analyzed the performance of the classifiers inserted in two and three views, to filter the hypothetical flaws between the views. The last two steps are detailed below. i) Performance with two views: The results indicate that the model detects 100% of the real flaws that are corresponding in two views (Table 1, Track 2 Views). This validates the assumption of correspondence between the position of the real flaws and implies that automatic detection with the fundamental matrix allows the detection of corresponding flaws that are contained on the epipolar line, and this agrees with the results given in [6] and [7]. There is, however, a large number of false alarms in sequence (198/388=51%), which must be reduced using a third view. ii) Performance with three views: After completing the matching of possible pairs of flaws in both images, we extend the detection of flaws to the third image in the sequence. In this case the performance remains at 100% of real flaws detected in sequence, but it is seen, however, that it has not been possible to eliminate all the false alarms (Table 1, False Alarms). Furthermore, it is seen that the ICB method in two and three views has allowed the detection of a large part of the real flaws (F) and regular structures (RS) with high probability, allowing them to be separated from the multiple views analysis. Table 1. Performance of the Uncalibrated Tracking Step
Flaws
Track 2-Views ICB-2 Track 3-Views ICB-3
190 151 137 18
Regular Structure 198 94 45 17
Real Flaws 100% 100% 100% 100%
False Alarms 51.0% 24.2% 11.6% 4.4%
382
M. Carrasco and D. Mery Table 2. Comparison between different calibrated and non-calibrated tracking
Method
Tracked
3 4 5 2 2 3 Uncalibrated 2 2 3 Calibrated
YearReference 2002 [6] 2002 [6] 2002 [6] 2005 [7] 2006 [8] 2006 [8] 2007 [9] 2007new 2007new
Analyzed Images 70 70 70 24 70 70 70 70 70
True Positives 100% 100% 83% 92.3% 100% 98.8% 86.7% 100% 100%
False Positives 25% 0% 0% 10% 32.9% 9.9% 14% 24.2% 4.4%
iii) Performance of ICB: The greatest advantage of ICB classifiers for two and three views is the extraction of flaws and regular structures with high probability. The results indicate a clear relation between the performance of the ICB method for two views and the number of characteristics chosen. In this way, with the five best combinations of characteristics the performance in the classification is ideal, but there is a clear decrease in the number of flaws extracted by the ICB method (Fig.7a). In the case of three views, the number of correspondences is drastically reduced because the correspondence of a hypothetical flaw in three images has a lower probability of occurrence (Fig.7b). iv) Comparison with other methods: Finally, we present a summary of the performance obtained with the calibrated and non-calibrated AMVI approximation (see Table 2). It shows the performances corresponding to the tracking phase of the different investigations carried out with an non-calibrated sequence of X-ray images designed in [6]. According to the results generated in two and three views by the same authors in 2006 [8], it is seen that the intermediate classification block (ICB) technique has allowed a reduction of 8.7% in the correspondence number in two views, and of 5.5% in the case of three views, with a 4.4% remainder that it has not been possible to eliminate by geometric analysis. 1
(a)
2
(b)
3
(c)
Fig. 6. Generalized flaw estimation process in one sequence of three views: a) segmentation of hypothetical flaws; b) projection of the epipolar line in the second view using the robust fundamental matrix; c) projection of the coordinates of images 1 and 2 using trifocal tensors over the third view
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Sn (2 View s)
classification percentage
classification percentage
Automatic Multiple Visual Inspection on Non-calibrated Image Sequence
2
3
4
5
6
7
92,65% 95,59% 98,36% 100,00 100,00 100,00
1-Sp (2 View s) 2,41% 0,00% 0,00% 0,00% 0,00% 0,00% Reduction ICB
52,25% 49,48% 43,60% 17,30% 12,80% 10,03% # Features Combination in tw o view s
(a)
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Sn (3 View s)
2
3
4
5
6
383
7
95,83% 98,04% 96,00% 99,14% 99,11% 100,00
1-Sp (3 View s) 25,00% 16,67% 26,67% 22,22% 0,00% 6,67% Reduction ICB
81,55% 61,17% 79,65% 89,38% 75,29% 75,72% # Features Combination in three view s
(b)
Fig. 7. Sensitivity and 1-Specificity performance of the ICB classifier, and percentage of fails reduction of ICB classifier, (a) for two views , b) for three views
6
Conclusions
This investigation presents the development of a new flaw detection algorithm in manufactured goods using an non-calibrated sequence of images. Using new AMVI methodology with the ICB elimination system, we have designed a novel system of automatic calibration based only on the spatial positions of the structures. We based our investigation on the assumption that hypothetical flaws are real flaws if their positions, in a sequence of images, are in correspondence because they remain stable in their position relative to the movement of the object. With respect to the investigation carried out in [8], we have introduced the calculation of corresponding points generated artificially through the maximization of Pearson’s correlation coefficient [15] for two curves. Our results indicate that it is possible to generate an automatic model for a sequence of images which represent the movement between the points and the regions contained in them. In this way we can use as reference points the edges of the structures or areas with no loss of information using a nonlinear method. The main advantage of our model is the automatic estimation of movement. Our future aim is to reduce the number of false alarms by means of a method of final verification of the flaws in correspondence, and an analysis of the ICB classification method with other ensemble classification and probabilistic techniques. Acknowledgments. This work was partially supported by a grant from the School of Engineering at Pontificia Universidad Cat´ olica de Chile.
References 1. Malamas, E.N., Petrakis, E.G., Zervakis, M.: A survey on industrial vision systems, applications and tools. Image and Vision Computing 21(2), 171–188 (2003) 2. Newman, T.S., Jain, A.K.: A survey of automated visual inspection. Computer Vision and Image Understanding 61(2), 231–262 (1995) 3. Chin, R.T.: Automated visual inspection: 1981-1987. Computer Vision Graphics Image Process 41, 346–381 (1988)
384
M. Carrasco and D. Mery
4. Nakagawa, M., Ohnishi, K., Nakayasu, H.: Human-oriented image recognition for industrial inspection system. In: IEEE International Workshop on Robot and Human Interactive Communication, pp. 52–56. IEEE Computer Society Press, Los Alamitos (2000) 5. Brandt, F.: The use of x-ray inspection techniques to improve quality and reduce costs. The eJournal of Nondestructive Testing and Ultrasonics 5(5) (2000) 6. Mery, D., Filbert, D.: Automated flaw detection in aluminum castings based on the tracking of potential defects in a radioscopic image sequence. IEEE Trans. Robotics and Automation 18(6), 890–901 (2002) 7. Mery, D., Carrasco, M.: Automated multiple view inspection based on uncalibrated image sequence. In: Kalviainen, H., Parkkinen, J., Kaarna, A. (eds.) SCIA 2005. LNCS, vol. 3540, pp. 1238–1247. Springer, Heidelberg (2005) 8. Carrasco, M., Mery, D.: Automated visual inspection using trifocal analysis in an uncalibrated sequence of images. Materials Evaluation 64(9), 900–906 (2006) 9. Pizarro, L., Mery, D., Delpiano, R., Carrasco, M.: Robust automated multiple view inspection. Pattern Analysis and Applications 10. Mery, D.: Crossing line profile: a new approach to detecting defects in aluminium castings. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 725–732. Springer, Heidelberg (2003) 11. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 1st edn. Cambridge University Press, Cambridge, UK (2000) 12. Faugueras, O., Luong, Q.: The geometry of multiples images. MIT Press, Cambridge (2001) 13. Mery, D.: High contrast pixels: a new feature for defect detection in x-ray testing. Insight 46(12), 751–753 (2006) 14. Mery, D.: Exploiting multiple view geometry in x-ray testing: Part I, theory. Materials Evaluation 61(11), 1226–1233 (2003) 15. Dunn, O.J., Clark, V.A.: Applied statistics: analysis of variance and regression. Wiley, Chichester (1974) 16. Haralick, R., Shapiro, L.: Computer and Robot Vision. Addison-Wesley Publishing Co., New York (1992) 17. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision, 2nd edn. PWS Publishing, Pacific Grove, CA (1999) 18. Persoon, E., Fu, K.S.: Shape discrimination using fourier descriptors. IEEE Transactions on System, Man and Cybernetics, 170–179 (1977) 19. Castleman, K.R.: Digital Image Processing. Prentice-Hall, New Jersey (1996) 20. Chen, Z., Wu, C., Shen, P., Liu, Y., Quan, L.: A robust algorithhm to estimate the fundamental matrix. Patter Recognition Letters 21, 851–861 (2000) 21. Shashua, A., Werman, M.: Trilinearity of three perspective views and its associated tensor. In: 5th International Conference on Computer Vision (ICCV 1995), Boston MA, p. 920 (1995) 22. Polikar, R.: Ensemble systems in decision making. IEEE Circuits and Systems Magazine 6(3), 21–45 (2006) 23. Mitchel, T.M.: Machine Learning. McGraw-Hill, Boston (1997) 24. Egan, J.: Signal detection theory and ROC analysis. Academic Press, New York (1975) 25. Duda, R.O., Hart, P.E., Stork, D.G: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2001) 26. Stearns, S.D.: On selecting features for patterns classifiers. In: IAPR International Conference on Pattern Recognition, pp. 71–75 (1976) 27. Lagarias, J.C., Reeds, J.A., Wright, M.H., Wright, P.E.: Convergence properties of the nelder-mead simplex method in low dimensions. SIAM Journal of Optimization 9(1), 112–147 (1998)
Image-Based Refocusing by 3D Filtering Akira Kubota1 , Kazuya Kodama2 , and Yoshinori Hatori1 1
Interdisciplinary Graduate School of Science and Technology, Tokyo Institute of Technology, Nagatsuta, Midori-ku, Yokohama 226-8502, Japan 2 Research Organization of Information and Systems, National Institute of Informatics, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
[email protected],
[email protected],
[email protected] Abstract. This paper presents a novel spatial-invariant filtering method for rendering focus effects without aliasing artifacts from undersampled light fields. The presented method does not require any scene analysis such as depth estimation and feature matching. First, we generate a series of images focused on multiple depths by using the conventional synthetic aperture reconstruction method and treat them as a 3D image. Second we convert it to the alias-free 3D image. This paper shows this conversion can be achieved simply by a 3D filtering in the frequency domain. The proposed filter can also produce depth-of-field effects. Keywords: Refocus, Light Field, 3D Filtering, Aliasing.
1
Introduction
Using acquired multiview images or light fields, synthetically post-producing 3D effects such as disparity and focus has attracted attention recently [2,8]. If dense light fields data is available, these effects can be easily produced with high quality by light field rendering (LFR) method [6,4]. LFR method essentially performs resampling the acquired light fields, independent of the scene complexity. This paper addresses an image-based refocusing problem in the case when input light fields data was undersampled. In this case, applying LFR method to the undersampled data introduces aliasing or ghosting artifacts in the rendered image; hence the scene analysis such as depth estimation is needed to reduce the artifacts. The objective of this paper is to present a novel spatial-invariant filter that can produce both focal depth and depth-of-field effects with less aliasing artifacts, requiring no scene analysis. 1.1
Problem Description of Image-Based Refocusing
Consider a XY Z world coordinate system. We use multiview images f(i,j) (x, y) captured with a 2D array of pin-hole cameras on the XY plane as an input light field data. The coordinate (x, y) denotes the image coordinate in every image and (i, j) ∈ Z2 represents the camera positions on the XY plane (both distance between cameras and focal length are normalized to be 1 for simpler notation). The scene is assumed to exist in the depth range of Zmin ≤ Z ≤ Zmax . D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 385–398, 2007. c Springer-Verlag Berlin Heidelberg 2007
386
A. Kubota, K. Kodama, and Y. Hatori
The goal of image-based refocusing in this paper is to reconstruct the image f(0,0) (x, y; Z, R), that is, the image at the origin of the XY plane focused at depth Z with an aperture of radius R. Since the target camera position is fixed at the origin, we simply represent the desired image as f (x, y; Z, R). 1.2
Synthetic Aperture Reconstruction
Refocusing effects can be generated by taking the weighted average of the multiview images that are properly shifted. This shifting-and-averaging method, called synthetic aperture reconstruction (SAR) method, was presented by Haeberli [5] and firstly applied to real scenes by Isaksen et. al [4]. In the SAR method, as illustrated in fig. 1 (a) where parameters y and Y are fixed, let the refocused image be g(x, y; Z, R), it is synthesized by g(x, y; Z, R) = w(i,j) f(i,j) (x − i/Z, y − j/Z), (1) i,j
where w(i,j) is the weighting value for the multiview image f(i,j) and is a function of the aperture radius R: w(i,j) (R) =
1 −(i2 +j2 )/R2 e . πR2
(2)
This weighting values are sampled from the point spread function (PSF) that we desire on the refocused image. In this paper, we use a Gaussian-type PSF function shown in equation (2), because it produces visibly natural blur effects. The shift amounts, i/Z and j/Z, correspond to respectively horizontal and vertical disparities between g (or f(0,0) ) and f(i,j) with respect to the focal depth Z. In the resultant image g, objects on the focal plane appear sharp, while other objects not on the focal plane appear blurred. Changing the aperture radius R produces depth-of-field effects on the image. In addition, the SAR method can be efficiently performed in the 4D Fourier domain by a 2D slice operation [3] 1.3
Motivation
The SAR method requires a large number of images taken with densely arranged cameras. With sparsely spaced cameras, in the resulting image, the blurred regions suffer from aliasing or ghosting artifacts. The blur effects rendered in these regions are not the desired effects but unnatural blur effects due to sparsely sampled PSF (see fig. 1(a)). Our goal in this paper is to reduce aliasing artifacts in the blurred regions, improving the quality of synthetic aperture images. If continuous light fields f(X,Y ) (x, y) are available (see fig 1(b)), the desired refocused image can be generated as an integration of all light rays with appropriate weights of Gaussian function:
Image-Based Refocusing by 3D Filtering
focal plane
Multiview images on the XY plane
387
Weighting values
x
R O
Z
i
f ( i ) ( x − i /Z )
1
e − (i
2
/ R2 )
1
(a) for multivew images as sparsely sampled light fields.
XY plane
focal plane
Weighting values
Continuous light fields
x
R O
Z f ( X ) ( x − X /Z )
X
e −( X
2
/ R2 )
(b) for continuous light fields. Fig. 1. Refocusing by the SAR method to reconstruct an image focused at depth Z with Gaussian aperture of radius R. (a) The refocused image is synthesized as a weighted average of shifted multiview images based on the focal plane. The weighting values are sampled from Gaussian point spread function. (b) The desired refocused image is synthesized by a weighted integration of corresponding light rays.
f (x, y; Z, R) =
w(X,Y ) f(X,Y ) (x−X/Z, y −Y /Z)dXdY, where w(X,Y ) =
2 2 2 1 e−(X +Y )/R ; (3) (πR2 )
hence the goal is to reconstruct f from g. This is however generally a difficult problem: one has to identify the aliased regions in g and change them into those with desired blur effects. One reasonable way is to estimate missing light rays densely and then apply the SAR method to the obtained dense light fields. This approach essentially requires estimating the scene structure to some extent, according to plenoptic sampling theory [7]. For instance, Georgeiv et al. [8] have used a morphing technique based on feature correspondences to fill all necessary light fields; as a result, they successfully produced focus effects with less aliasing artifacts.
388
A. Kubota, K. Kodama, and Y. Hatori
In this paper, in contrast to computer vision based approaches, we present a novel reconstruction method that does not require any scene analysis. Our idea is that we treat a series of images refocused on multiple depths by the conventional SAR method as a 3D image and convert it to the alias-free 3D image. This conversion can be achieved simply by a 3D filtering in the frequency domain; hence the computation is constant independent of the scene complexity. This approach can cope with rendering depth-of-field effects as well. Related to our method, Stewart [1] presented a novel reconstruction filter that can reduce artifacts by blurring aliased regions much more; however it cannot handle rendering effects of focal depth and depth-of-focus.
2
The Proposed Image-Based Refocusing Method
We derive a novel reconstruction filter that converts g into f . In our method, instead of g in (1), we use the following image series generated by SAR method with the circular aperture of fixed radius Rmax . g(x, y; Z, Rmax ) =
1 N
f(i,j) (x − i/Z, y − j/Z),
(4)
i,j∈Amax
where Amax denotes the aperture region defined as a set {(i, j)| i2 + j 2 ≤ Rmax }, and N is the number of cameras inside the aperture. 2.1
3D Image Formation and Its Modeling
By changing the focal depth Z, we synthesize the refocused image sequence and treat it as a 3D image. Introducing the parameter z that is inversely proportional to Z, we represent g(x, y; Z, Rmax ) in (4) and f (x, y; Z, R) in (3) as 3D image g(x, y, z; Rmax ) and f (x, y, z; R), respectively. The synthesis process of these 3D images can be modeled by g(x, y, z; Rmax ) = h(x, y, z) ∗ s(x, y, z) f (x, y, z; R) = b(x, y, z) ∗ s(x, y, z),
(5) (6)
where h(x, y, z) and b(x, y, z) are the 3D PSF, s(x, y, z) is the color intensity distribution in the 3D scene, and ∗ denotes a 3D convolution operation. These model are derived in similar manner to the observation model of multi-focus images in microscopic imaging [9]. Note that we assume here that z ranges (−∞, ∞) and effects of occlusions and lighting are negligible. The 3D PSF h is given by h(x, y, z) =
1 N
δ(x − iz, y − jz),
(7)
i,j∈Amax
where δ is Dirac delta function. Unlike the case of microscopic imaging, the PSF does not need to be estimated; it can be computed correctly based only on the
Image-Based Refocusing by 3D Filtering
389
camera arrangement set. The 3D PSF b is the desired PSF and can be ideally modeled by b(x, y, z) =
1 πR2
e−(X
2
+Y 2 )/R2
δ(x − zX, y − zY )dXdY.
(8)
The scene information s can be defined as .
s(x, y, z) = f(0,0) (x, y) · δ(d(0,0) (x, y) − 1/z)
(9)
where d(0,0) (x, y) is the depth map from the origin. The scene information is a stack of 2D textures at depth Z visible form the origin. Neither the scene information nor the depth map are known.
2.2
Reconstruction Filter
Taking 3D Fourier transform in both sides in equations (5) and (6) yields respectively G(u, v, w; Rmax ) = H(u, v, w)S(u, v, w),
(10)
F (u, v, w; R) = B(u, v, w)S(u, v, w).
(11)
where capital-letter functions denote the Fourier transform of the corresponding small-letter functions and (u, v, w) represents the frequency domain counterparts of the spatial variables (x, y, z). Both spectra H(u, v, w) and B(u, v, w) are calculated as real functions. By eliminating the unknown S from the above equations, we derive the following relationship if H(u, v, w) is not zero. F (u, v, w; R) =
B(u, v, w) · G(u, v, w; Rmax ). H(u, v, w)
(12)
To avoid error amplification due to division by the value of H(u, v, w) close to zero, we employ Wiener type regularization to stably reconstruct the desired 3D image B(u, v, w)H(u, v, w) Fˆ (u, v, w; R) = · G(u, v, w; Rmax ), (13) H 2 (u, v, w) + γ where γ is a positive constant. By taking 3D inverse Fourier transform of Fˆ , we can finally obtain the desired refocused 3D image fˆ. This suggests that image-based refocusing can be achieved simply by this linear and spatially invariant filtering. The coefficient of G is the reconstruction filter. It consists of known PSFs independent of the scene; hence no scene analysis such as depth estimation is required in our method.
390
3 3.1
A. Kubota, K. Kodama, and Y. Hatori
Experimental Results Rendering Algorithm
Let us first here give the algorithm of our method when applying to digital dataset of multiview images, as follows: 1. synthesizing the 3D image based on the discrete version of eq. (4); 2. computing the reconstruction filter with the aperture radius R we desire; 3. taking 3D Fourier transform of both g and the filter, and compute Fˆ by eq. (13); 4. finally we take inverse 3D Fourier transform of the Fˆ to reconstruct the refocused images fˆ. In the step 1, assume L focal planes discretely located at Zl (l = 1, ..., L) from near to far, we synthesize L images focused on these focal depths. The 3D image g is formed as a set of the synthesized images with paremerar zl , the inverse of Zl . Each depth Zl is determined to be Zl =
1 − ZL
1 1 − Z1 ZL
l−1 L−1
−1 ,
(14)
such that the disparities i/Zl and j/Zl (i.e., izl and jzl ) can be changed with equal interval. The number L should be given so that the interval be less than 1 pixels, as suggested by plenoptic sampling theory. The depth range where we arrange the focal planes was determined such that it satisfies both conditions: 1/Z1 − 1/ZL = 2(1/Zmin − 1/Zmax ) and (1/Z1 + 1/ZL )/2 = (1/Zmin + 1/Zmax)/2, which impose the range be two times wider than that of the scene in the inverse dimension z, setting the center of both the ranges to be the same. In the step 2, let the focal plane range 1/Z1 − 1/ZL be z¯, we set the support range in z for both PSFs h and b to be −¯ z/2 ≤ z ≤ z¯/2 (see fig. 7 (a)). That is, the range in z should be the same z¯ for all 3D data, g, f , h and b, which ensures the DC energy conservation of G(0, 0, 0) = F (0, 0, 0) = H(0, 0, 0) = B(0, 0, 0). 3.2
Results for a Synthetic Scene
The proposed method was tested using synthetic 9x9 multiview images (640x480 pixels each) for a synthetic scene. The scene structure and camera array setting are illustrated as a top view (from Y axis) in Fig. 2. The scene consists of slant a wall of wood texture and a boll of checker pattern. They exist in the depth range of 30–50. Cameras are regularly arranged in 2D lattice with equal space of 1 and the horizontal field of view is set to be 50 degree for all the cameras. Figure 3 shows the reconstructed images that were refocused by our method and those by the conventional SAR method. In this simulation, we fixed R at 2 and used the following parameters: L = 16, Rmax =4 and γ = 0.1. In these results, from the top, the focal depth were 33.9, 37.4, 41.8, 50.6 and 58.9, which
Image-Based Refocusing by 3D Filtering
391
XY plane 1
45 deg. 7
Z 50
30
50 deg.
O
42 Camera array
Fig. 2. Synthetic scene and camera array setting in our simulation
corresponds to Z4 , Z6 , Z8 , Z11 , and Z13 . In the results by our method, blur effects were rendered without artifacts for all the cases, whereas in those by the conventional method, blurred regions appear ghosted in the latter two cases— especially artifacts are much visible in the checker pattern regions. Rendering depth-of-field effects are demonstrated in fig. 4, where the focal depth was fixed at Z11 = 50.6 and the aperture radius R was varied from 0.5 to 2.5 with increments of 0.5. In the case of R = 0.5, the image refocused by the conventional method (the top image in fig. 4(b)) was focused in every depth region and the desired blur effects were not correctly rendered. This is another drawback of the conventional method. In contrast, our method allows rendering smaller size blur effects, as shown in the top image in fig. 4(a). In the other cases, it can be clearly seen that our method produced alias-free blur effects with high quality; while the conventional method introduced aliasing artifacts in the blurred regions. 3.3
Results for a Real Scene
We used 81 real images captured with a 9x9 camera array, which are provided from “The Multiview Image Database,” courtesy of the University of Tsukuba, Japan. Image resolution is 320x240 pixels (down sampled from the original resolution of 640x480) and the distance between cameras is 20 [mm] in both the horizontal and vertical directions. The scene contains an object (“Santa Claus doll”) in the depth range of 590–800 [mm], which is the target depth range in this experiment. In this experiment, we used the following parameters: L = 16, Rmax =4x20 [mm] and γ = 0.1. One problem arises from narrow field-of-view (which was 27.4 degree) of the camera used: it is too narrow to produce refocused images with enough fieldof-view. To overcome this problem, we applied Neumann expansion to all the multiview image to virtually fill the pixel values outside the field-of-view with the pixel value at the nearest edge. The reconstructed images are shown in figures 5 and 6. The former demonstrates effects of change of focal depth, the latter does those of depth-of-field. The results show that advantage of our method over the conventional one; the our method works well, significantly suppressing aliasing artifacts that were visible
392
A. Kubota, K. Kodama, and Y. Hatori
(a) our method
(b) the conventional method
Fig. 3. Refocusing results when the focal depth was varied from near to far (from the top image) with fixed aperture radius at R = 2
Image-Based Refocusing by 3D Filtering
(a) our method
393
(b) the conventional method
Fig. 4. Refocusing results demonstrating depth-of-field effects when the aperture radius R is varied from 0.5 to 2.5 (from the top image) with the focal depth fixed
394
A. Kubota, K. Kodama, and Y. Hatori
in the images reconstructed by the conventional method. In the images of our method, some vertically scratched lines are slightly visible. This is due to the effect of Neumann expansion, not aliasing artifacts. Note that noise amplification, which is generally a critical issue, does not occur as large as an general inverse filtering method may have, because noise on the multiview images are averaged out and reduced in the 3D image g, which is a well-known feature in synthetic aperture methods.
4
Discussion
We have shown that we can effectively create alias-suppressed refocused images from those with aliasing generated by the conventional method. This conversion process was achieved by a spatially invariant filtering, not estimating any of scene information even for undersampled light field data. This section gives this reason in the frequency domain analysis. Figure 7 (b) shows the reason schematically. We can consider that not only h but b is composed of a set of Delta functions (a line). The Fourier transform of Delta function is given by the plane perpendicular to the Delta functions; hence the Fourier transform of both PSFs, H and B, is composed of a set of the planes that are rotated along w axis and slanted according to the corresponding camera position (i, j). As a result, H and B become double-cones shape functions, illustrated in the fig. 7 (b), and the radial size is proportional to the size of Rmax and R, respectively. We have set Rmax ≥ R; therefore, it ideally holds for most regions in the frequency domain that B is always zero when H is zero, which leads to that B/H is stable. Note that due to the limited support range z¯, each plane in H become ”thick” because of the effect of production with the Sinc function, resulting in having more non-zero components. This property helps make B/H more stable. Mathematically precise arguments on this property is needed in our future work. An alternative interpretation of our method is that by generating G, our method extract all the information needed for reconstructing F . It is not necessary at all to recover the whole 3D scene S, since our goal is to reconstruct F . The underlying idea of our method is the same with that in computerized tomography where the frequency components of the desired image are extracted by projections from many directions. In our method, by generating g using many multiview images at different positions, we extract the frequency components needed for reconstruct the desired 3D image f . Figure 8 shows an example of numerically computed frequency characteristics H and B as a series of cross sections at different frequencies w. They were used for reconstructing the images in fig. 3 (a). The whiter regions indicate higher amplitude (black level represents zero). The whiter regions in B indicates the frequency components needed for reconstructing F and the white regions in H is the frequency components that can be extract from the 3D scene. Comparison between the both regions shows that the latter regions almost covers the former regions; hence stable reconstruction is possible.
Image-Based Refocusing by 3D Filtering
395
(a) our method
(b) the conventional method Fig. 5. Results for real multiview images. The left is focused on the near region; the right on the far region. The expanded images are also shown for comparison. The aperture radius set was R =1.5x20mm.
To extract the necessary information, we have to set Rmax larger than R. It is true that the larger we set the aperture, the more stable and accurate we could obtain the scene information; however the larger aperture setting picks up the occluded regions much more often. This causes undesirable transparent effects at occluded boundaries, which are observed in the top image in fig. 5 (a) and
396
A. Kubota, K. Kodama, and Y. Hatori
(a) our method
(b) the conventional method
Fig. 6. Results for real multiview images when the aperture radius R is varied at 0.5, 1.0, 2.0 and 3.0 times of the distance between cameras. The focal depth was fixed at 590.
Image-Based Refocusing by 3D Filtering
unknown
z /2
∗
z
∗
3-D conv.
1 / Z max
− z /2
− z /2
x, y
g ( x, y , z )
3D PSF b( x, y, z )
f ( x, y , z ) w
w
x, y
3-D conv.
3D scene s ( x, y, z )
3D PSF h( x, y, z )
(b) The frequency domain
z /2
1 / Z min
x, y
(a) The spatial domain
z
z
z
397
w
unknown
u, v
×
u, v
product
H (u , v, w)
× B(u , v, w)
S (u , v, w) G (u, v, w)
u, v
product
BH H2 +γ
F (u , v, w)
Fig. 7. The spatial and frequency analysis in our method. Image formation model in the x-z spatial domain and our reconstruction method in the u-w frequency domain.
u H (u , v, w)
v B(u , v, w) w=0
w
Fig. 8. Computed frequency characteristic of point spread functions
more visible in the top image in fig. 6 (a). We can not completely avoid this effect, because we do not estimate depth in our method. This is an disadvantage of our approach. Another disadvantage is that the method requires computation much more than the conventional SAR method. This is mainly due to 3D filtering process. One limitation of our approach is that the virtual view position must be on the XY plane. This is because shift amounts in eq. (4) depend on image coordinate; hence we cannot use Fourier transform in the image model. To
398
A. Kubota, K. Kodama, and Y. Hatori
handle spatially varying shift amounts, we will have to consider to use some suitable orthogonal transformation such as wavelet transform.
5
Conclusions and Future Work
What we have shown in this paper is summarized as follows: by arranging an array of cameras or micro lens such that the reconstruction filter B/H be stable, alias-less image refocusing is possible through spatially invariant filtering without analyzing any scene information, even if the acquired light field data was undersampled in a sense of Plenoptic sampling. We believe that our method can be applied to the data acquired with integral camera and produce synthetic refocused images with higher quality. The underlying methodology is similar to the idea used in computerized tomography; as future work, we could adopt some sophisticated methods that have been developed in that field, taking into account regularization, to enhance the quality more even for the case of less density of light field data.
References 1. Stewart, J., Yu, J., Gortler, S.J., McMillan, L.: A new reconstruction filter for undersampled light fields. In: Eurographics Symposium on Rendering 2003, EGSR 2003, pp. 150–156 (2003) 2. Ng, R., Levoy, M., Bredif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light Field Photography with Hand-held Plenoptic Camera. Stanford Tech Report CTSR,2005-02(2005) 3. Ng, R.: Fourier slice photography. SIGGRAPH 2005, 735–744 (2005) 4. Isaksen, A., McMillan, L., Gortler, S.J.: Dynamically reparameterized light fields. SIGGRAPH 2000, 297–306 (2000) 5. Haeberli, P.E., Akeley, K.: The accumulation buffer: Hardware support for highquality rendering. SIGGRAPH 1990, 309–318 (1990) 6. Levoy, M., Hanrahan, P.: Light field rendering. SIGGRAPH 1996, 31–42 (1996) 7. Chai, J.-X., Tong, X., Chany, S.-C., Shum, H.-Y.: Plenoptic sampling. SIGGRAPH 2000, 307–318 (2000) 8. Georgeiv, T., Zheng, K.C., Curless, B., Salesin, D., Nayar, S., Intwala, C.: SpatioAngular Resolution Tradeoff in Integral Photography. In: Eurographics Symposium on Rendering, EGSR 2006, pp. 263–272 (2006) 9. Castleman, K.R.: Digital image processing, pp. 566–569. Prentice Hall, Englewood Cliffs (1996) 10. Sarder, P., Nehorai, A.: Deconvolution method for 3-D fluorescence microscopy images. Signal processing magazine 23(3), 32–45 (2006)
Online Multiple View Computation for Autostereoscopic Display Vincent Nozick and Hideo Saito Graduate School of Science and Technology, Keio University, Japan {nozick,saito}@ozawa.ics.keio.ac.jp
Abstract. This paper presents a new online Video-Based Rendering method that creates simultaneously multiple views at every new frame. Our system is especially designed for communication between mobile phones using autostereoscopic display and computers. The new views are computed from 4 webcams connected to a computer and are compressed in order to be transfered to the mobile phone. Thanks to GPU programming, our method provides up to 16 images of the scene in realtime. The use of both GPU and CPU makes our method work on only one consumer grade computer. Keywords: video-based rendering, autostereoscopic, view interpolation.
1
Introduction
In recent years, steoroscopic technology has advanced from stereoscopic to autostereoscopic displays. This latter family does not involve any glasses or specific device for the user. Such a screen displays several images of the same scene and provides to the user an adequate stereoscopic view according to his relative position from the screen. This ability makes autostereoscopic displays very convenient to use, especially for multi-user applications. Autostereoscopic displays can have various applications like 3D TV, games, 3D teleconference or medical applications. In this paper, we will focus on the communication between mobile phones using 3D display and computers using cameras. Indeed, mobile TV is now a commercial reality and the next mobile phone evolution will include 3D display. Autostereoscopic displays require multiple views of the scene at every frame. Stereoscopic animations or menus can easily be achieved by computer graphics methods. 3D content can also be generated from videos of real-scenes using several cameras. However this approach is not suited for mobil e phone applications due to their restricted bandwidth and low capacity to use complex decompression algorithms. Furthermore, systems using more than 10 cameras will probably not be attractive for commercial applications. Video-Based Rendering (VBR) methods can provide new views of a scene from a restricted set of videos and thus decrease the number of required cameras. Nevertheless few VBR methods provide online rendering and most of these methods are not suited for D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 399–412, 2007. c Springer-Verlag Berlin Heidelberg 2007
400
V. Nozick and H. Saito
multiple image rendering. Indeed, autostereoscopic applications require several new images simultaneously and most of VBR method should compute these new views independently, decreasing the frame rate. This article present a new VBR algorithm that can create online multiple new views simultaneously using GPU programing. This method requires 4 webcams connected to a consumer grade computer and can provide up to 16 images in real-time, including compression and network transfer. In the following parts, we introduce recent autostereoscopic devices and propose a survey of the latest online VBR methods. Then we explain the plane sweep algorithm and our contribution. Finally, we detail our implementation and we present experimental results.
2
Autostereoscopic Displays
A stereoscopic display requires at least two views of the same scene to create depth perception. During more than a century, the three most popular methods have been anaglyph, polarization and shutter methods [1]. These three techniques involve the use of adapted glasses. Recent research on 3d display technologies made stereoscopic system advance from stereoscopic to autostereoscopic displays. This latter category presents tree major advantages. First, users do not need any special devices like glasses. Second, these displays can provide more than two views simultaneously. The user receives an adequate pair of stereoscopic images according to his position from the screen. Finally, these displays can support multi-user applications. Currently, commercial autostereoscopic displays are available from several famous companies. Spatial multiplex system is a common method to provide stereoscopic images. A lenticular sheet is laid on the screen such every lens covers a group of pixels. According to the user’s position, the lens will show the adequate pixel. The major part of commercial autostereoscopic displays requires around 10 views. Some recent devices can display up to 60 views. More informations about autostereoscopic displays can be found on [2]. The main purpose of stereoscopic display is to increase the user’s immersion sensation. Stereoscopic display applications includes scientific visualization, medical imaging, telepresence or gaming. In this paper, we will focus on mobile phone applications. Indeed, some companies already propose 3D autostereoscopic display for mobile phone designed for display 3D menu, images and videos. This paper especially focuses on communications between a mobile phone and a computer for real scenes. Such situation occurs when someone call his family or company. The computer side provides multiple images of its environment to the mobile phone user. Harrold and Woodgate [3] describe such device and present a method to transfer and render computer graphics stereoscopic animations. However stereoscopic video of real scenes remains a problem if the display supports more than 6 or 7 views. Indeed, using one camera per view involves a huge quantity of data and hence storage and transfer issues.
Online Multiple View Computation for Autostereoscopic Display
401
The main restriction of our application concerns the bandwidth limitation during the video stream transfer from the computer to the mobile phone. However, mobile phone technology can easily support standard online video decompression. Moreover, the bandwidth issue is lessened by the low resolution of the mobile phone screen. Finally, the system providing the views should work on a consumer grade computer to be attractive for commercial applications. Using 10 or more webcams can provide enough views for a stereoscopic display but even with a low resolution, real-time video-stream acquisition is a serious issue with a single computer. In the following parts, we propose an online video-based rendering method that provides multiple images of the same scene from a small set of webcams using only one consumer grad computer.
3
Online Video-Based Rendering
Given a set a videos taken from video cameras, Video-Based Rendering methods provides new views of the scene from new viewpoint. Hence these methods are well suited to reduce the number of input cameras for autostereoscopic display systems. VBR methods are divided into two families : off-line and online methods. Off-line methods focus on the visual quality rather than on the computation time. They first calibrate the cameras and record the video streams. Then they compute these videos to extract scene informations. The rendering step can start only when the data computation is completed. Most of the off-line methods provide real-time rendering but are not suited for live rendering. On the other hand, online method are fast enough to record, compute and render a new view in real-time however few VBR methods reach online rendering. Finally, autostereoscopic display applications require not only online VBR methods but also methods that can create simultaneously several new views of the scene for every frame. The most popular online VBR method is the Visual Hulls algorithm. This method extracts the silhouette of the main object of the scene on every input image. The shape of this object is then approximated by the intersection of the projected silhouettes. There exist several online implementations of the Visual Hulls described in [5]. The most accurate online Visual Hulls method seems to be the Image-Based Visual Hulls presented by Matusik et al. [6]. This method creates news views in real-time from 4 cameras. Each camera is controlled by one computer and an additional computer create the new views. The methods proposed by Li et al. [7,8] may run on a single computer but the ability to compute several images simultaneously should be demonstrated. Furthermore, the visual hulls method is suited for an “all around” camera configuration but not for a dense aligned camera configuration. Finally, the Visual Hulls algorithm requires a background extraction, thus only the main “objects” can be rendered. Another possibility to reach online rendering is to use a distributed Light Field as proposed by Yang et al. [9]. They present a 64-camera device based on a client-server scheme. The cameras are clustered into groups controlled by several
402
V. Nozick and H. Saito
computers. These computers are connected to a main server and transfer only the image fragments needed to compute the new view requested. This method provides real-time rendering but requires at least 8 computers for 64 cameras and additional hardware. Thus this method is incompatible with a commercial use of stereoscopic applications. Finally, some Plane-Sweep methods reach online rendering using graphic hardware (GPU). The Plane-Sweep algorithm introduced by Collins [10] was adapted to online rendering by Yang et al. [11]. They compute new views in real-time from 5 cameras using 4 computers. Geys et al. [12] also use a Plane-Sweep approach to find out the scene geometry and render new views in real-time from 3 cameras and one computer. The Plane-Sweep algorithm is effective when the input cameras are close to each other and hence is highly capable with an aligned camera configuration. Since our method follows a Plane-Sweep approach, we will expose the basic Plane-Sweep algorithm in the next section. Then we will detail our method for both single and multiple new views creation.
4
Single View Computation
In this section, we present the Plane-Sweep algorithm and [11,12] contribution. Then we detail our new scoring method. 4.1
Plane-Sweep Algorithm Overview
The Plane-Sweep algorithm provides new views of a scene from a set of calibrated images. Considering a scene where objects are exclusively diffuse, the user should place the virtual camera camx around the real video cameras and define a near plane and a f ar plane such that every object of the scene lies between these two planes. Then, the space between near and f ar planes is divided by parallel planes Di as depicted in Figure 1. Consider a visible object of the scene lying on one of these planes Di at a point p. This point will be seen by every input camera with the same color (i.e. the object color). Consider now another point p lying on a plane but not on the surface of a visible object. This point will probably not be seen by the input cameras with the same color. Figure 1 illustrates these two configurations. Therefore, the Plane-Sweep algorithm is based on the following assumption : a point lying a plane Di whose projection on every input camera provides a similar color potentially corresponds to the surface of an object. During the new view creation process, every plane Di is computed in a back to front order. Each pixel p of a plane Di is projected onto the input images. Then, a score and a representative color are computed according to the matching of the colors found. A good score corresponds to similar colors. This process is illustrated on Figure 2. Then, the computed scores and colors are projected on the virtual camera camx . The virtual view is hence updated in a z-buffer style : the color and score (depth in a z-buffer) of a pixel of this virtual image is updated only if the projected point p provides a better score than the current score. This
Online Multiple View Computation for Autostereoscopic Display
403
Fig. 1. Plane-Sweep : guiding principle
process is depicted on Figure 2. Then the next plane Di+1 is computed. The final image is obtained when every plane is computed. 4.2
Scoring Stage
Yang et al. [11] propose an implementation of the Plane-Sweep algorithm using register combiners. The system chooses a reference camera that is closest to camx . During the process of a plane Di , each point p of this plane is projected on both the reference image and the other input images. Then, pair by pair, the color found in the reference image is compared to the color found in the other images using a SSD (Sum of Squared Difference). The final score of p is the sum of these SSD. This method provides real-time and online rendering using 5 cameras and 4 computers, however the input cameras have to be close to each other and the navigation of the virtual camera should lie between the viewpoints of the input cameras, otherwise the reference camera may not be representative of camx . Lastly, moving the virtual camera may change the reference camera and induce discontinuities in the computed video during this change. Geys et al.’s method [12] begins with a background extraction. The background geometry is supposed to be static. This assumption restricts the application of the Plane-Sweep algorithm to the foreground part. The scoring method used is similar to the method proposed by Yang et al. but they only compute
404
V. Nozick and H. Saito
Fig. 2. Left : Every point of the current plane is projected on the input images. A score and a color are computed for these points according to the matching of the colors found. Right : The computed scores and colors are projected on the virtual camera.
a depth map. Then, an energy minimization method based on a graph cut algorithm cleans up the depth map. A triangle mesh is extracted from the new depth map and view dependent texture mapping is used to create the new view. This method provides real-time and online rendering using 3 cameras and only one computer. However, the background geometry must be static. Our main contribution to the Plane-Sweep algorithm concerns the score computation. Indeed, this operation is a crucial step since both visual results and time computation depend on it. Previous methods computes scores by comparing input images with the reference image. We propose a method that avoids the use of such reference image image that may not be representative of the virtual view. Our method also use every input image together rather than to compute images by pair. Since the scoring stage is performed by the graphic hardware, only simple instructions are supported. Thus a suitable solution is to use variance and average tools. During the process of a plane Di , each point p of Di is projected on every input image. The projection of p on each input image j provides a color cj . The score of p is then set as the variance of the cj . Thus similar colors cj will provide a small variance which corresponds to a high score. On the contrary, mismatching colors will provide a high variance corresponding to a low score. In our method, the final color of p is set as the average color of the cj . Indeed, the average of similar colors is very representative of the colors set. The average color computed from mismatching colors will not be a valid color for our method however, since these colors also provide a low score, this average color will very likely not be selected for the virtual image computation.
Online Multiple View Computation for Autostereoscopic Display
405
This plane sweep implementation can be summarized as follows : ◦ reset the scores of the virtual camera ◦ for each plane Di from far to near • for each point (fragment) p of Di → project p on the n input images. cj is the color obtained from this projection on the j th input image → compute the color of p : colorp = n1 j=1...n cj → compute the score of p : scorep = j=1...n (cj − color)2 • project all the Di ’s scores and colors on the virtual camera • for each pixel q of the virtual camera → if the projected score is better than the current one then update the score and the color of q ◦ display the computed image This method does not require any reference image and all input images are used together to compute the new view. The visual quality of the computed image is then noticeably increased. Moreover, this method avoids discontinuities that could appear in the virtual video when the virtual camera moves and changes its reference camera. Finally, this method is not limited to foreground objects.
5
Multiple View Computation
A basic approach to render multiple views would be to compute every virtual view independently. However most of online VBR methods already fully use the available computer capability to reach real-time rendering, thus we can hardly expect real-time rendering for multiple views without any optimization. The Plane-Sweep algorithm is well suited for such optimization thanks to the space decompostion using planes. Indeed, scores and colors computed on every plane represent local information of the scene. This score and color computation, which are a central task in the Plane-Sweep algorithm, can be shared among every virtual view and hence provide a consequent gain of computation time. Therefore, our single view Plane-Sweep method can be modified in a k + 1 passes algorithm, where k is the number of virtual cameras. For every plane Di , the score and color of every point is computed in a first pass. This pass is absolutely independent of the number of virtual views to create. The information computed during this pass is then projected on every virtual view in k passes. During these last k passes, color and score information is updated on every successive virtual camera. The k + 1 passes are repeated until every plane Di is computed. Hence our previous method can be modified as follows : ◦ reset the scores and colors of the virtual cameras’ memory Vj (j ∈ {1, ..., k}) ◦ for each plane Di from far to near
406
V. Nozick and H. Saito
• for each point (fragment) p of Di → compute a score color and a color score → store color and score in an array T (p) = (color, score) • for each virtual camera camj → for each point (fragment) p of Di · find the projection qj,p of p on camj . Vj (qj,p ) contains previous color and score information on camj at the position qj,p · if the score on T (p) is better than the score stored on Vj (qj,p ) then Vj (qj,p ) = T (p) ◦ convert each Vj into images Like in the single view method, the score and color are computed only once for every point of each plan. Since the projection of these informations on every virtual view differs, the final views will be different. These information projections are very fast compared to the score and color computation. Hence sharing the score and color computation speeds up the application and avoids redundancy process without any loss of visual quality.
6
Implementation
Since our webcams have a fixed focal length, we can compute accurately their internal parameters using Zhang calibration [14]. Then we can freely move them for our experimentations and only a single view of a calibration chessboard is required to perform a full calibration. Color calibration can be performed by the method proposed by Magnor [5, page 23]. This method is effective only for small corrections. We usually set the f ar plane as the calibration marker plane. The user should then determine the depth of the scene to define the near plane. These two planes can also be set automatically using a precise stereo method as described in Geys et al. [12]. We use OpenGL for the rendering part. For each new view, we perform a first off-screen pass for every input image to correct the radial distortion and the color using Frame Buffer Objects. Implementation indications can be found on [16]. During the score and color computation, each plane Di is drawn as a textured GL QUADS. The scoring stage is performed thanks to fragment shaders. First, Di ’s points (fragments) are projected onto the input images using projective texture mapping. The texture coordinates are computed from the projection matrices of each input camera. Multi-texturing provides an access to every texture simultaneously during the scoring stage. Then, this fragment program computes each score and color using the algorithm described in section 4. For the single view method, the scores are stored in the gl FragDepth and the colors in the gl FragColor. Then we let OpenGL select the best scores with the z-test and update the color in the frame buffer. The use of the the z-test for the multiple view method would imply that every new view is rendered on the screen. Thus the screen resolution would limit
Online Multiple View Computation for Autostereoscopic Display
407
the number of new view that can be computed. We propose a method where every process is done off-screen using Frame Buffer Object. RGBA textures are assigned to every virtual view and an additional texture is used for the color and score computation. The color is stored in the RGB component and the score in the alpha component. The virtual camera’s texture will replace the frame buffer used on the single view method. As illustrated on Figure 3 (a), the score and color computation of a plane does not differ from the single view method except that the rendering is performed on a texture. Naturally the rendering has to be associated to a projection matrix. We select the central virtual camera as a reference camera for this projection (Figure 3 (b)). Then, every virtual camera involves an additional rendering pass. During a pass, the score and color texture is projected on the curent plane using the reference camera projection matrix (Figure 3 (c)). The textured plane is then projected on the virtual camera (Figure 3 (d)) using fragment shaders. The texture associated to the curent virtual camera is used for both rendering and reading the last selected scores and colors. The fragment program decides to update a fragment information or to keep the current texture value according to the last selected scores as described in section 5. After the last plane computation, the virtual camera’s texture can be extracted as images of the virtual views. The computation time linearly depends on the number of planes used, on the number of virtual cameras and on the output images resolution. The number of input cameras has both a repercussion on the image transfer from the main memory to the GPU and on the score computation performances. Most of the computation is done by the graphic card, hence the CPU is free for the video stream acquisition, virtual views compression and transfer.
7
Compression and Images Transfert
Since 3d display and 3d video broadcasting services became feasible, 3d video data compression has been an active research field. Indeed, without any compression, the transmission bandwidth linearly increases with the number of views and becomes a severe limitation for the display frame-rate. Nevertheless, stereoscopic views represent the same scene and contain a huge amount of redundancies. Thus the basic concept of 3d video compression is to remove these redundancies among the views. There exist several stereoscopic compression methods. For more informations, the reader can refer to Kalva et al. [17]. Since we want our system to be used with mobile phones, the problem is a bit different. The screen resolution is lower than for standard 3d displays but the available bandwidth is also restricted by the mobile phone communication system. Furthermore, the compression part achieved by the computer should be fast and should not require too many CPU capabilities. In our tests, we chose a MPEG2 compression. Indeed, the views to be transfered consist of the input images and the virtual images. These views can be sorted by position (from left to right for example) such they become suited to be compressed with a standard video compression method. Such compression is performed by the CPU and
408
V. Nozick and H. Saito
Fig. 3. (a) : The points of the current plane are projected on every input camera to read the corresponding color. (b) : The colors found are used to compute a score and a color during a rendering process on the reference camera . (c) and (d) : For every virtual camera, the scores and colors are projected on the current plane using the reference camera projection matrix (c). The scores and colors are projected from the current plane to the virtual camera (d).
hence is compatible for real-time computation with our VBR method which mainly uses the GPU. In addition, MPEG2 decompression is not a problem with mobile phone hardware.
Online Multiple View Computation for Autostereoscopic Display
409
Fig. 4. Cameras configuration
The compressed images are then transfered to the user. Since we consider that the data transfer should be done by the mobile phone operator, we just tested our compressed video transfer with an UDP network protocol with another PC. There exist more powerful tools for such video streaming but this is not the main purpose of our article.
8
Results
We have implemented our system on a PC Intel core duo 1.86 GHz with a nVidia GeForce 7900 GTX. The video acquisition is performed by 4 usb Logitech fusion webcams connected to the computer via an usb hub. With a 320×240 resolution, the acquisition frame rate reaches 15 frames per second. Our camera configuration is depicted on Figure 4. As explained in part 6, the computation times depends among others, on the number of planes, on the number of virtual cameras and on the virtual view resolution. In our tests, we set the output image resolution to 320×240. Since our system is designed for stereoscopic display, the base-line between extreme camera is restricted. In such condition, our tests shown that under 10 planes, the visual results becomes unsatisfactory and using more that 60 planes does not improve the visual result. Hence, we used 60 planes in our experimentation to ensure an optimal visual quality. The number of virtual views depends on the application. In our case, we tested our system with 6, 9, 12, 15 and 18 virtual cameras set between adjacent input cameras. The speed results obtains with such configuration are shown on table 1. This computation includes compression and transfer of both virtual views and input images. Table 1 also includes the frame rate of the classic method witch computes independently every virtual view.
410
V. Nozick and H. Saito
Our tests indicate that our method provides especially good results for a large number of virtual views. Compared to the classic method, our method is at least more than twice faster for 6 virtual views and is four time faster for 18 virtual views without any loss of quality. Figure 5 depicts a sample result for a 12 virtual views configuration. Input images are displayed on the diagonal. The visual quality of a virtual view varies with its distance from input cameras and decreases for a virtual view located exactly between two input cameras. However, autostereoscopic display provides 2 views per user (right and left eyes) and the fusion of the two images decreases the imperfection impact. As shown on Figure 5, stereoscopic pairs (parallel-eyed viewing) are very comfortable. In addition, the base-line between the extreme right and left views are perfectly suited to autostereoscopic display application. In our tests, we compressed and send the images to a client computer. The compression is done by a MPEG2 method and reaches a 1:41 compression rate. Thus the transfered data is highly compressed and well suited to be decompressed by mobile phones.
Fig. 5. Sample result of 16 views : 12 virtual views and 4 input images on the diagonal. These images have been computed using 60 planes at 8.7 frames per second. Paralleleyed viewing provides stereoscopic images.
Online Multiple View Computation for Autostereoscopic Display
411
Table 1. frame rate and number of virtual views number of number of frame rate classic method virtual views total views (frames per second) (frames per second) 6 10 11.2 3.8 9 13 10 2.9 12 16 8.7 2.4 15 19 7.6 1.9 18 22 7 1.6
9
Conclusion
This article presents a live video-based rendering method that provides simultaneous multiple views of a scene from a small set of webcams. We propose a new scoring method that provides good visual quality images in real-time thanks to fragment shaders. Our multiple view method shares the 3D data computation for every virtual view and speeds up the computation time more than four times compared to the single view method for the same number of new views. The rendering is online and provides high quality stereoscopic views. This method is especially designed for autostereoscopic display on mobile phones communicating with a computer. The use of only one computer and few webcams makes this system low cost and well suited for commercial applications, particularly for the latest mobile phone autostereoscopic displays that require more that 15 images per frame. According to our knowledge, there does not exist other VBR method that provides equivalent result with such configuration. Concerning other extensions of this method, we believe that our multipleview system can be easily adapted for multi-users stereoscopic teleconference applications. The system would work as a server that provides stereoscopic views for several clients from desired viewpoints.
Acknowledgment This work has been supported by “Foundation of Technology Supporting the Creation of Digital Media Contents” project (CREST, JST), Japan.
References 1. Okoshi, T.: Three-Dimensional Imaging Techniques. Academic Press, San Diego (1977) 2. Dodgson, N.A.: Autostereoscopic 3D Displays. Computer 38(8), 31–36 (2005) 3. Harrold, J., Woodgate, G.: Autostereoscopic display technology for mobile 3DTV applications. In: Proc. of the SPIE, vol. 6490 (2007) 4. Goldlucke, B., Magnor, M.A., Wilburn, B.: Hardware accelerated Dynamic Light Field Rendering. Modelling and Visualization VMV 2002, Germany, 455–462 (2002)
412
V. Nozick and H. Saito
5. Magnor, M.A.: Video-Based Rendering. A K Peters Ltd (2005) 6. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-Based Visual Hulls. ACM SIGGRAPH 2000, 369–374 (2000) 7. Li, M., Magnor, M.A., Seidel, H.P.: Online Accelerated Rendering of Visual Hulls in Real Scenes. In: International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG 2003), pp. 290–297 (2003) 8. Li, M., Magnor, M.A., Seidel, H.P.: Hardware-Accelerated Visual Hull Reconstruction and Rendering. Graphics Interface GI 2003, Canada, 65–71 (2003) 9. Yang, J.C., Everett, M., Buehler, C., McMillan, L.: A real-time distributed light field camera. In: 13th Eurographics workshop on Rendering, Italy, pp. 77–86 (2002) 10. Collins, R.T.: A Space-Sweep Approach to True Multi-Image. Computer Vision and Pattern Recognition Conf., 358–363 (1996) 11. Yang, R., Welch, G., Bishop, G.: Real-Time Consensus-Based Scene Reconstruction using Commodity Graphics Hardware. Pacific Graphics, 225–234 (2002) 12. Geys, I., De Roeck, S., Van Gool, L.: The Augmented Auditorium: Fast Interpolated and Augmented View Generation. In: European Conference on Visual Media Production, CVMP 2005, pp. 92–101 (2005) 13. Billinghurst, M., Campbell, S., Chinthammit, W., Hendrickson, D., Poupyrev, I., Takahashi, K., Kato, H.: Magic book: Exploring transitions in collaborative ar interfaces. SIGGRAPH 2000 (2000) 14. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1330–1334 (2000) 15. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge, UK (2004) 16. Pharr, M., Fernando, R.: GPU Gems 2: Programming Techniques For HighPerformance Graphics And General-Purpose Computation. Addison-Wesley Professional, Reading (2005) 17. Kalva, H., Christodoulou, L., Mayron, L., Marques, O., Furht, B.: Challenges and opportunities in video coding for 3D TV. In: IEEE International Conference on Multimedia & Expo (ICME), Canada, pp. 1689–1692 (2006)
Horizontal Human Face Pose Determination Using Pupils and Skin Region Positions Shahrel A. Suandi1,2 , Tie Sing Tai1 , Shuichi Enokida2 , and Toshiaki Ejima2 1
School of Electrical & Electronic Engineering, Universiti Sains Malaysia, Engineering Campus, 14300 Nibong Tebal, Pulau Pinang, Malaysia
[email protected],
[email protected] http://ee.eng.usm.my/eeacad/shahrel/index.html 2 Intelligence Media Laboratory, Department of Artificial Intelligence, Kyushu Institute of Technology, Kawazu 680-4, Iizuka City, Fukuoka Pref., 820-8502 Japan {shahrel,enokida,toshi}@mickey.ai.kyutech.ac.jp http://www.mickey.ai.kyutech.ac.jp/
Abstract. This paper describes a novel real-time technique to determine horizontal human face pose from a video color sequence. The idea underlying this technique is that when head is at an arbitrary pose to the right or left, there are significant relationships between the distance from center of both pupils to head center, and the distance between both pupils. From these distances, we compute a ratio known as ”horizontal ratio”. This ratio, besides being advantageous in the sense that it reduces the dependency on facial features tracking accuracy and robust to noise, is actually the quantity that is used to determine the horizontal human face pose. The technique is simple, computational cheap and requires only information that is usually retrievable from a face and facial feature tracker. Keywords: Multiple view image and processing, tracking and motion, face pose, horizontal ratio, eyes and skin region.
1
Introduction
Face pose determination task has become one of the challenging tasks in face related researches. While being able to detect, locate and track a face and its facial features, one would expect some other factors such as face pose, gender, facial expressions and so on. These are some additional factors that create current computer vision systems to interact more intelligently with environments. As for face pose, it is an important task for some computer vision applications such as gaze tracking [1,2], human-computer interaction [3,4], monitoring driver alertness [5,6], best shot for face recognition [7] and multimedia retrieval [8]. To date, pose determination task can be categorized into active and passive methods. The former requires some special devices like sensors to be equipped on users face while the latter is usually vision-based method, non-intrusive and therefore, more preferable for human-computer interaction applications due to its convenience. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 413–426, 2007. c Springer-Verlag Berlin Heidelberg 2007
414
S.A. Suandi et al.
Apart from this, while passive method is gaining more attention from computer vision researchers, it can be generally classified into two primer approaches like being described in [9] – model-based approach and appearance-based approach. Model-based approach usually employs the 3D positions of facial features and recovers the pose by first, making assumptions on the projection, e.g., perspective projection, weak projection; second, deriving a few equations corresponding to the projection, and finally, computing the poses with respect to x, y and z axes by solving the equations derived. These poses are referred to as yaw, pitch and roll, respectively.Refs. [1],[10],[9],[2] are some of the examples employing this approach. On the other hand, appearance-based approach assumes that there is a unique causal-effect relationship between 3D face pose and certain facial image properties, for instance, appearances of facial features from frontal straight forward pose are different from frontal right pose. Refs. [11],[12],[13] are some of the examples from this appearance-based approach using Support Vector Machine(SVM), Boosting and Modular Eigenspaces, respectively. In our work, we employ model-based approach. In contrast to some existing methods [1,10,14] that employ inner eyes corners and/or outer eyes corners for horizontal human face pose determination, we use pupils and face region for this purpose. Pupils are used due to the reason that it may also reveal the direction of gaze which can also be used for monitoring one’s vigilance. These features are acquired from EMoTracker (Eyes and Mouth Tracker), which is developed by Suandi et al. [15]. The ideas underlying our proposed method is that when the head is at an arbitrary pose to the right or left, there are significant relationships between the distance from center of both pupils to head center and the distance between both eyes. Additionally, in order to achieve real-time system, no additional image processing or complex tasks are required even though only limited information (pupils, mouth corners and face region) are retrievable from the image. As a result, we have achieved a simple, computational cheap and real-time system. From the detected pupils and face region, our system determines the horizontal face pose by executing three main tasks shown in Fig. 1.
Fig. 1. Pose determination algorithms flow
Firstly, the head center is computed using nonlinear regression method by taking into account the ratio of pupils distance to pupil to skin edge (from the side where face is facing) distance. Secondly, the distance between both pupils (Dpp ) and distance from pupils center to head center (Dch ) is computed. Subsequently,
Horizontal Human Face Pose Determination
415
“horizontal ratio (H)” is determined using Dch and Dpp . Finally, the horizontal pose is determined from this ratio. For the rest of this paper, we first describe the analysis results of anthropometric statistics in Section 2. Detail explanations on horizontal face pose determination are presented in Section 3. Experimental results are presented in Section 4 and the discussion on the observations are given in Section 5. Finally, the conclusion is described in Section 6.
2
Anthropometric Statistics
Our pose determination task is tailored using a monocular camera with weak projection assumption. It is well known that this approach, although simple, lacks the capability to estimate the depth of an object. Many efforts have been done to estimate the object pose by using monocular camera, for instance, by making assumptions on the 3D object shape and dimensions prior to estimating the pose, or some may require a sophisticated calibration technique. Similarly, we make assumptions on human 3D head shape by analyzing human anthropometrics statistics. Compared to others, ours only considers minimum features, pupils and face region. According to the work reported by Davis and Vaks [3], there are actually significants in the positions of human facial features and head geometry. This is supported by the report on anthropometric statistics which is written by Joseph W. Young [16]. In our work, we refer to this report and analyze the anthropometric statistics for United States of America (US) male. Results from the analysis are used to derive a “mean model” – a model made from the mean value of meaningful items; head circumference, head breadth and bipupils breadth. Using this model, further investigation on the relationship between head center and face pose is performed. 2.1
Analysis Results of US Male Citizens Anthropometric Data
Our analysis results show that merely three items are important in our work. These are head circumference, head breadth and bipupils breadth. The summary of these data is presented in Table 1. Statistics that have been considered are mean (μ), standard deviation (σ), minimum and maximum. Due to there is no head radius statistics data available in the report, we determine the head radius manually by computing this value from the mean head circumference. Let P denotes the head circumference and r denotes the head radius, P is given as P = 2πr. Therefore, the head diameter, d is d = 2r = 182.88mm. Our mean model is made from these data, in which head breadth is considered as the face region in image plane (in a forward straight frontal pose). This mean model is referred to as head cylindrical model , which is discussed further in Section 3.1. Making assumptions that a head horizontal motion when viewed from top is a circular motion and eyes are located on this circle where the radius is equal to head radius, and when looking from forward straight ahead, the head diameter d (attributed from head radius, r) is always bigger than frontal face region width,
416
S.A. Suandi et al.
Table 1. Summary of anthropometric statistic data from 110 US adult citizens male μ,mm σ,mm min,mm max,mm Head Breadth, H
152.39 5.34
138.94
166.12
Head Circumference, P 574.54 16.22 541.02
620.01
Bipupil Breadth, Dpp
70.99
61.39 3.63
54.99
significant relationship between bipupils breadth (Dpp ) and head diameter (d) can be given with the ratio of approximately 1:3 (61.39:182.88). Please refer to Fig. 3 for graphical representation of these values. r and d are invariant quantities since they are observed from top on the same person, whereas, Dpp is a variant quantity which relies on the face pose. From this observations, we yield geometrical relationship between Dpp and d,which can be given as d = γDpp , where, γ = 3.0. Using the mean model, we study how to compute the head center from head cylindrical model. This is explained in Section 3.1.
3 3.1
Horizontal Face Pose Determination Head Center Computation
As the information that we have are too limited to compute the head depth, it is impossible to compute the head center. Therefore, we introduce “head cylindrical model ” as the solution. Head Cylindrical Model. Head cylindrical model (HCM) contributes in providing a reliable model to compute head center. Our ideal model of HCM is the mean model. It carries two main properties; invariant to head motions such as rotations and side-to-side motion – regardless what the face pose is, the head center shall remain at the same position with respect to face position; invariant to scale – when the face moves near or far from the camera, the head center shall remain at the same position with respect to face size. Considering these two properties, head center is determined as follows: 1. As the only observable quantities are pupils and skin-like region (face region), we first compute these two quantities, X0 and X1 . X0 is the distance between both pupils on image plane, which equals to Dpp , while X1 is the distance from the pupil (on which side the face is facing) to the edge of face region on the same side. X2 and X3 , which are the distances from the face region edge to the HCM edge, and from the other side pupil to its side HCM edge, respectively, are determined indirectly from X0 and X1 . This is depicted in Fig. 2. 2. To handle scaling problems, next, we compute R1 , R2 and R3 , where each X2 X3 1 of these is defined as follows: R1 = X X0 ,R2 = X0 andR3 = X0 .
Horizontal Human Face Pose Determination
Right Pose, θt = 15◦
Frontal Pose, θt = 0◦
X2
X1
X0
X3
Image Plane
X2
X1
X0
Right Pose, θt = 45◦
Right Pose, θt = 30◦
X3
X2 X1
417
X1 X0
Image Plane
X3
X2
X0
X3
Image Plane
Image Plane
Fig. 2. Examples of HCM (mean model) viewed from top in four different horizontal poses, 0◦ , 15◦ , 30◦ and 45◦ from left to right. The observable quantities are only X0 and X1 . X2 and X3 are determined using nonlinear regression method. Notice that X0 = Dpp .
3. Since only R1 are determinable, we determine R2 and R3 by deriving each of them out using nonlinear least square regression (NLLSR) after R1 is known. To establish the relationship between R1 , R2 and R3 , the observations made from mean model which have been rotated to 15◦ , 30◦ and 45◦ are utilized. 4. Then, R1 , R2 and R3 are computed. NLLSR provides the relationship between R1 to R2 , and R1 to R3 in terms of ratio, which are given in Eq. (1) and (2), respectively. R2 = 0.334552 − 0.0502925R1 − 0.0253305R1−1
(1)
R3 = 7.23527 − 25.4088R1 +
(2)
33.7887R12
−
14.6841R13
These equations show that when R1 is known, then R2 and R3 may also be determined, which will consequently provide the values of X2 and X3 because X2 = R2 X0 and X3 = R3 X0 . 5. When X2 and X3 are known, both edges of HCM are determined and finally, Dch is computed. Besides providing the head center, HCM is actually has the advantage to distinguish motions of rotations or side-to-side motion, that is, when changes in Dch is observed, a rotation is happening, whereas, when changes is observed in x− or y−axes while at the same time there is no changes in Dch , then it is a sideto-side motion. Such capabilities might be useful for monitoring one’s vigilance systems. 3.2
Horizontal Ratio Computation
After computing the two important cues shown in Fig. 3, horizontal ratio, defined as H in Eq. (3), is computed. It is the ratio of two cues, Dch and Dpp , that are observable from image plane. It actually defines that profile face will provide an
418
S.A. Suandi et al.
infinite value of H, whereas, frontal forward face will provide value of 0. Details on H is presented below. H=
Dch Dpp
(3)
Models to Define Horizontal Ratio. The model shown in Fig. 3 and 4 are actually derived from the HCM that has been introduced in Section 3.1. Making assumptions that a head is an ellipse when viewed from top and its rotation radius and distance between both eyes are r and d, respectively, two right triangles can be observed from top of the model when it rotates horizontally with respect to its origin O. The angle at time t is defined as θt and θt0 is defined as the initial pose (frontal pose) where θt0 = 0. This is a requirement in the system in order for the system to compute and initialize individual parameters for the person it tracks. The two triangles are given as Triangle I and Triangle II. Values of r and d are the values that are always constant due to tracking is performed on the same person, whereas, the values of Dch and Dpp are the observable values yielded from the image plane. Using the relationships between these two triangles, H is defined as follows.
θt
l π ( ) − θt 2 Dch
d
O
θt Dpp
r
θt
l
Triangle I
π ( ) − θt 2
π ( ) − θt 2
d
θt
Triangle II
θt
Dpp Dch Image Plane
Pose Direction, θt
Fig. 3. HCM shown from top view. In the model are shown two triangles that can be drawn when the head moves horizontally to an arbitrary pose on the left. These relations are derived to compute the interrelation of face horizontal pose.
Triangle I. Triangle I is yielded only when variant in Dch is observed. Dch is defined as the distance from center of both eyes to head center on image plane, therefore, the range for this value is 0 ∼ l, which ranging from a frontal to a profile pose. Notice that l < r, where l can be computed indirectly from a frontal
Horizontal Human Face Pose Determination Model at initialization (frontal pose)
Model after initialization at time, t
O
O
Rotated θt to the lef t
r
419
r
θt
l
d
l
θt
d a
Dpp a=
Dpp
Dpp d = 2 2
Dch
Pose Direction, θt0
Image Plane
Image Plane
Pose Direction, θt
Dch = 0
Fig. 4. Top view of HCM during initialization (left) and after rotated θt to the left (right). The right triangle shown in the above figure is pose invariant when observed from top view. From this model, r can be approximated from the results of analyzing anthropometrics statistic of human head and facial features. These information are then used to derive H.
pose during initialization. For reading and reference convenient, we will refer to the quantities presented in Fig. 3 and 4 in the following mathematics definition. The properties for the triangle are given as follows: – l – this value is computed indirectly from the triangle observed during initialization. This triangle is shown in the left model from Fig. 4. r and a are given in Eq. (4). Sincel and r are constant, once computed they are usable through out the tracking process until another initialization. γ in Eq. (4) is a coefficient that has been discussed in Section 2.1. r=
γd γDpp = 2 2
and a =
Dpp d = . 2 2
(4)
During initialization, d = Dpp . From these two equations, we use theorem of Pythagoras to compute l.
Dpp 2 ) + l2 r =a +l =( 2 2
2
2
Dpp 2 Dpp 2 ) = γ −1. 2 2 (5) = d in the Eq. (5) so that it is more
where, l =
As l is a constant, we replace Dpp appropriate. Therefore, we yield l=
r2 − (
d 2 γ −1. 2
(6)
420
S.A. Suandi et al.
– θt – This value defines the pose angle at time t. Knowing the value l as given in Eq. (6), the mathematical relationship yielded from Triangle I are given as follows: sin θt =
2Dpp Dpp = . l d γ2 − 1
(7)
Therefore, at initialization stage where m = 0, sin θt0 = 0. Furthermore, computing θt from Eq. (7) yields, 2Dpp θt = sin−1 . d γ2 − 1
(8)
To indicate left or right pose, positive (left) and negative (right) value are used. Triangle II. In contrast to Triangle I, relationship observed in Triangle II is simpler to derive. This is due to all its quantities are directly computable from the model. d is a constant observed during initialization and n is the observable D value on the image plane. From these quantities, we know that cos θt = dpp . Therefore, θt can be computed from Eq. (9), θt = cos−1
Dpp . d
(9)
where, the given angle, θt is actually equal to the angle derived by Eq. (8). Although it is shown that either using Eq. (8) or (9) can give the face pose, considering both of them simultaneously has been shown empirically to be more robust. Since we have sin and cos relation, we compute tan using these quantities. tan θt =
2Dch sin θt 2Dch Dpp = = ÷ . cos θt d d γ2 − 1 Dpp γ 2 − 1
(10)
Rearranging Eq. (10), we yield
As γ = 3.0, Eq. (11) becomes as:
Dch Dpp
H=
γ2 − 1 tan θt . (11) 2 √ = 2 tan θt . The horizontal ratio, H is defined
Dch = Dpp
Dch √ = 2 tan θt . Dpp
(12)
Therefore, considering Eq. (12), yaw angle at time t, given as θt can be straight forward computed as follows: θt = tan−1 (
Dch √ ), Dpp 2
(13)
Horizontal Human Face Pose Determination
421
which means yaw angle (horizontal pose) can be computed by only using tan−1 relation as shown in the Eq. (13). Apart from this, using sin and cos relation from Eq. (8) and (9) may also reveal θt . However, unlike using tan−1 relation, using neither sin nor cos relations is appropriate due to influence of noise during tracking. For example, when Dch equals to a value that is greater than l due to noise, sin θt will become sin θt > 1 which is not true. This reason can be applied to Dpp in Eq. (9) as well.
4
Experimental Results
Three main experiments have been carried out in this work. The first experiment is to evaluate the validity of horizontal ratio as the cue to represent horizontal face pose, the second and third experiments are to evaluate horizontal ratio in determining horizontal face pose manually and automatically, respectively, from video sequences. For automatic detection purpose, we use EMoTracker. All experiments are performed on a 2.2GHz Celeron CPU machine equipped with Linux OS. Figure 5 shows the database that have been used in our experiments.
Fig. 5. Samples of database used in the experiments. Top to down row: Boston University Database, Pointing ICPR’04 Workshop Database and two video sequences taken at our lab.
4.1
H Validity Check Experiment
In this experiment, database provided by Pointing’04 ICPR Workshop [17] is used. This database consists of 15 sets of images. Each set has another 2 different sets with 93 images. We only consider images within vertical 0◦ and horizontal ±45◦ from this database. From this data set, pupils and face region (left, top, right and bottom) positions are recorded manually. Then Dch , Dpp , H and θt are computed from the recorded positions. The results are summarized in Table 2.
422
S.A. Suandi et al.
Table 2. Results of evaluating H to represent horizontal face pose, θt (in degree) using database provided by Pointing ICPR’04 Workshop Face pose, θt −45◦ −30◦ −15◦ 0◦ Mean, μ Std. Dev., σ
15◦
30◦
45◦
-43.92 -29.02 -15.27 0.00 15.13 29.46 46.09 6.00
4.72
4.65 0.00 3.58 3.91 5.62
Table 3. Total of differences (in degree) between ground truth data and actual experiment data using Boston University Database JAM 5 JIM 1 LLM 8 SSM 8 V AM 8 Mean, μ
-0.09
-0.55
-0.19
0.10
0.06
Std. Dev., σ
-0.15
0.23
0.84
0.75
0.85
For each face pose category, we compute the statistics for θt in terms of mean and standard deviation. Analyzing the mean for each pose, we know that H defined in our proposed method is feasible to determine θt . In fact, while considering the standard deviation results, it promotes that θt determination using H is the best for frontal pose. But however, for other than this pose, the standard deviations are bigger than the standard deviation given for frontal pose but as overall, they are smaller than 15. This ensures us that the results given are within the range of ±3◦ ∼ ±6◦ and therefore, suggesting the validity of our proposed method. 4.2
Determining θt from Video Sequence Database - Manual Features Detection
For this experiment, we use video sequences database provided by Boston University [18]. This database provides the ground truth data to benchmark our proposed method. However, since we concentrate only on horizontal pose in uniform lighting condition, only some part of the data are used in the experiment. This database consists of five different video sequences, given as JAM5, JIM1, LLM8, SSM8 and VAM8. Each of them is taken from different subjects and contains about 200 frames. Similar to the preceding experiment, pupils and face region (left, top, right and bottom) positions have been recorded manually and then Dch , Dpp , H and θt are computed. Results of total difference (in mean) between ground truth and experimental data are shown in Table 3. The results show that the difference is very low, i.e. within −0.15◦ ∼ 0.85◦ . When plotted into graphs, each of the results can be observed as in Fig. 6. Almost accurate results have been achieved. From this experiment, we can conclude that when good facial features are detected in the video sequence, good face pose angle can be determined.
Horizontal Human Face Pose Determination
Jim1 Data Analysis
50
100
D e g re e
D egree
Jam5 Data Analysis 40 20 0 -20 0 -40
150
40 20 0 -20 0 -40
50
Number of Frames Ground Truth
Ground Truth
150
Experiment
ssm8 Data Analysis
llm8 Data Analysis 40 D e g re e
D e g re e
100 Number of Frames
Experiment
40 20 0 -20
423
0
50
100
150
-40
20 0 0
-20
50
100
150
-40
Number Of Frames Ground Truth
Number of Frames
Experiment
Ground Truth
Experiment
Vam8 Data Analysis D e g re e
40 20 0 -20
0
50
100
150
-40 Number of Frames Ground Truth
Experiment
Fig. 6. Comparison between pose given in ground truth and pose computed using proposed method using Boston University Database
4.3
Determining θt from Video Sequence Database – Automatic Facial Features Detection
In this experiment, we use EMoTracker to automatically detect and track corresponding facial features. The main purpose of this experiment is to observe how the automatic detection results influence the face pose determination results. We prepare two video sequences taken at our lab as the data set. Each of them is taken from two different subjects and contains about 400 frames. These subjects were asked to start with a frontal pose and after a while, rotate their faces horizontally about one or two cycles (for example, left-right-left-right) followed by vertical motion about the same cycles. For comparison purpose, data for pupils and face region have been taken manually beforehand and θt is already computed from this manual database. Results given by the data taken manually and automatically are referred as “manual” and “auto”, respectively. These are shown in Fig. 7. Analyzing results for both subjects, we have observed that there is not much different between the ground truth and experiment data. An obvious difference can be observed from Subject 2 results when the pose in nearly to 45◦ . This is due to inconsistent pupils tracking using EMoTracker when face is within this range. Moreover, it is also difficult to track pupils within this pose range if the subject wears spectacle. We are currently in the stage of improving EMoTracker to solve this problem using separability filter [19,20,21]. For this particular experiment, we have achieved about 25 ∼ 30fps rates for tracking and face pose determination using proposed method.
424
S.A. Suandi et al. Pupils and Face Region Positions in Image X Coordinate for Subject No. 1 250
Manual_Rx Manual_Lx Manual_FRx Manual_FLx
Pupils and Face Region Positions in Image X Coordinate for Subject No. 2 250
Auto_Rx Auto_Lx Auto_FRx Auto_FLx
150
150
100
100
50
50
0
50
100
150
200
250
300
350
0
400
50
100
150
Horizontal Ratio H Computed From Subject No. 1 1.5
250
300
350
400
Horizontal Ratio H Computed From Subject No. 2 1.5
Manual_H Auto_H
1
Manual_H Auto_H
1
0.5
0.5
Ratio
Ratio
200
Frame No.
Frame No.
0
-0.5
0
-0.5
-1
-1
-1.5 0
50
100
150
200
250
300
350
400
-1.5 0
Frame No.
50
100
150
200
250
300
350
400
Frame No.
Results of Horizontal Face Pose Plotted From Subject 1
Results of Horizontal Face Pose Plotted From Subject 2
Manual_Theta Auto_Theta
45
Manual_Theta Auto_Theta
45
30
30
Horizontal Pose, θt
Horizontal Pose, θt
Auto_Rx Auto_Lx Auto_FRx Auto_FLx
200
Image X coordinate
Image X coordinate
200
Manual_Rx Manual_Lx Manual_FRx Manual_FLx
15
0
-15
15
0
-15
-30
-30
-45
-45 0
50
100
150
200
Frame No.
250
300
350
400
0
50
100
150
200
250
300
350
400
Frame No.
Fig. 7. From top row: manual and automatic pupils and face region data, horizontal ratio and horizontal face pose results plotted from subject 1 (left column) and 2 (right column)
5
Discussions
H is generated using pupils and face region positions. As being described in Section 3.2, H is defined from the values of Dch and Dpp . Failing to detect the pupils precisely will cause false-positive Dch and Dpp will be given to Eq. (3), which consequently affect the results. This also explains the disadvantage when Dch or Dpp is considered independently to compute the pose using Eq. (8) or (9). Considering Eq. (8), a small observation noise, ΔDch , will contribute to large ∂θt difference in θt due to ∂D = sec θt curve characteristics. The same observach ∂θt tion can be seen if Eq. (9) is considered as well, but with ∂D = csc2 θt curve pp characteristics. Whereas, while considering H as the ratio of quantities, Dch and Dpp , a small observation noise, ΔH, will contribute smaller difference in θt due 2 t to ∂θ ∂H = cos θt curve characteristics. This confirms that using H as defined in our work is robust against observation noise and therefore, appropriate for
Horizontal Human Face Pose Determination
425
this kind of framework. Besides, it also reduces the dependency on the tracking accuracy, which is one of the most difficult task in pose determination work.
6
Conclusions
A novel technique to determine horizontal human face pose from pupils and face region has been introduced in this paper. Considering the results of analyzing anthropometrics statistics data, we derive a model known as head cylindrical model and use this model to compute the head center. Head center is an additional information to compute a ratio known as horizontal ratio. This ratio is used to determine the face pose. Although the desired pose can be computed straight forward without taking the ratio into account, it has been shown that using ratio is more robust and capable of reducing the dependency on tracking accuracy. Comparison between truth and experiment data has also been performed, in which a very satisfactory results have been achieved. We have also encountered two major problems, inconsistent tracking when face pose is greater than 30◦ and when face region is not given as symmetrical during initialization. The solutions to these problems will be addressed in our future work.
Acknowledgements This work is partially funded by Universiti Sains Malaysia short term grant.
References 1. Gee, A., Cipolla, R.: Determining the gaze of faces in images. Image and Vision Computing 12(10), 639–647 (1994) 2. Park, K.R., Lee, J.J., Kim, J.: Gaze position detection by computing the three dimensional facial positions and motions. Pattern Recognition 35(11), 2559–2569 (2002) 3. Davis, J.W., Vaks, S.: A perceptual user interface for recognizing head gesture acknowledgements. In: PUI 2001: Proceedings of the 2001 workshop on Perceptive user interfaces, pp. 1–7. ACM Press, New York (2001) 4. Heinzmann, J., Zelinsky, A.: Robust real-time face tracking and gesture recognition. In: International Joint Conference on Artificial Intelligence, IJCAI 1997, vol. 2, pp. 1525–1530 (1997) 5. Smith, P., Shah, M., da Vitoria Lobo, N.: Monitoring head/eye motion for driver alertness with one camera. In: IEEE International Conference on Pattern Recognition (ICPR 2000), pp. 4636–4642. IEEE Computer Society Press, Los Alamitos (2000) 6. Ji, Q., Yang, X.: Real-time eye, gaze and face pose tracking for monitoring driver vigilance. Real-Time Imaging 8(5), 357–377 (2002) 7. Yang, Z., Ai, H., Wu, B., Lao, S., Cai, L.: Face pose estimation and its application in video shot selection. In: IEEE International Conference on Pattern Recognition (ICPR 2004), vol. 1., pp. 322–325 (2004)
426
S.A. Suandi et al.
8. Garcia, C., Tziritas, G.: Face detection using quantized skin colour regions merging and wavelet packet analysis. IEEE Transactions on Multimedia MM-1, 264–277 (1999) 9. Ji, Q., Hu, R.: 3d face pose estimation and tracking from a monocular camera. Image and Vision Computing 20(7), 499–511 (2002) 10. Horprasert, T., Yacoob, Y., Davis, L.S.: Computing 3-d head orientation from a monocular image sequence. In: IEEE International Conference on Automatic Face and Gesture Recognition (FGR 1996), pp. 242–247. IEEE Computer Society Press, Los Alamitos (1996) 11. Osuna, E., Freund, R., Girosit, F.: Training support vector machines: An application to face detection. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 1997), pp. 130–136. IEEE Computer Society Press, Los Alamitos (1997) 12. Schneiderman, H.W.: Learning statistical structure for object detection. In: Computer Analysis of Images and Pattern (CAIP), pp. 434–441. Springer, Heidelberg (2003) 13. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 1994), IEEE Computer Society Press, Los Alamitos (1994) 14. Ho, S.Y., Huang, H.L.: An analytic solution for the pose determination of human faces from a monocular image. Pattern Recognition Letters 19(11), 1045–1054 (1998) 15. Suandi, S.A., Enokida, S., Ejima, T.: Emotracker: Eyes and mouth tracker based on energy minimizaton criterion. In: 4th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2004), IAPR, pp. 269–274 (2004) 16. Young, J.W.: Head and face anthropometry of adult u.s. citizens. Technical Report R0221201, Beta Research Inc. (1993) 17. Gourier, N., Hall, D., Crowley, J.L.: Estimating Face Orientation from Robust Detection of Salient Facial Features. In: Proceedings of Pointing 2004, ICPR International Workshop on Visual Observation of Deictic Gestures (2004) 18. Cascia, M.L., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: An approach based on registration of texture-mapped 3d models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22(4), 322–336 (2000) 19. Loy, C.Y., Suandi, S.A.: Precise pupils detection using separability filter. In: International Conference on Robotics, Vision, Information and Signal Processing (ROVISP) ( to be published, 2007) 20. Fukui, K., Yamaguchi, O.: Facial feature point extraction method based on combination of shape extraction and pattern matching. Systems and Computers in Japan 29(6), 49–58 (1998) 21. Kawaguchi, T., Rizon, M.: Iris detection using intensity and edge information. Pattern Recognition 36(2), 549–562 (2003)
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence Federico Tombari1,2 , Stefano Mattoccia1,2 , and Luigi Di Stefano1,2 1
Department of Electronics Computer Science and Systems (DEIS) University of Bologna Viale Risorgimento 2, 40136 - Bologna, Italy 2 Advanced Research Center on Electronic Systems (ARCES) University of Bologna Via Toffano 2/2, 40135 - Bologna, Italy {ftombari, smattoccia, ldistefano}@deis.unibo.it
Abstract. Significant achievements have been attained in the field of dense stereo correspondence by local algorithms based on an adaptive support. Given the problem of matching two correspondent pixels within a local stereo process, the basic idea is to consider as support for each pixel only those points which lay on the same disparity plane, rather than those belonging to a fixed support. This paper proposes a novel support aggregation strategy which includes information obtained from a segmentation process. Experimental results on the Middlebury dataset demonstrate that our approach is effective in improving the state of the art. Keywords: Stereo vision, stereo matching, variable support, segmentation.
1 Introduction Given a pair of rectified stereo images Ir , It , the problem of stereo correspondence is to find for each pixel of the reference image Ir the correspondent pixel in the target image It . The correspondence for a pixel at coordinate (¯ x, y¯) can only be found at the same vertical coordinate y¯ and within the range [¯ x + dm , x ¯ + dM ], where D = [dm , dM ] denotes the so-called disparity range. The basic local approach selects, as the best correspondence for a pixel p on Ir , the pixel of It which yields the lowest score of a similarity measure computed on a (typically squared) fixed support (correlation window) centered on p and on each of the dM − dm candidates defined by the disparity range. The use of a spatial support compared to a pointwise score increases the robustness of the match especially in presence of noise and low-textured areas, but the use of a fixed support is prone to errors due to the fact that it blindly aggregates pixels belonging to different disparities. For this reason, incorrect matches tend to be generated along depth discontinuities. In order to improve this approach, many techniques have been proposed which try to select for each pixel an adaptive support which best aggregates only those neighbouring pixels at the same disparity [1], [2], [3], [4], [5], [6] (see [7] and [8] for a review). Recently very effective techniques [8], [9] were proposed, which represent state of the art D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 427–438, 2007. c Springer-Verlag Berlin Heidelberg 2007
428
F. Tombari, S. Mattoccia, and L. Di Stefano
for local stereo algorithms. The former technique weights each pixel of the correlation window on the basis of both its spatial distance and its colour distance in the CIELAB space from the central pixel. Though this technique provides in general excellent results, outperforming [9] on the Middlebury dataset1 , in presence of highly textured regions the support can shrink to a few pixels thus dramatically reducing the reliability of the matches. Unreliable matches can be found also near depth discontinuities, as well as in presence of low textured regions and repetitive patterns. This paper proposes a novel adaptive support aggregation strategy which deploys segmentation information in order to increase the reliability of the matches. By means of experimental results we demonstrate that this approach is able to improve the quality of the disparity maps compared to the state of the art of local stereo algorithms. In the next section we review the state of the art of adaptive support methods for stereo matching. For a more comprehensive survey on stereo matching techniques see [10].
2 Previous Work In [9] Gerrits and Bekaert propose a support aggregation method based on the segmentation of the reference image (Ir ) only. When evaluating the correspondence between two points, p ∈ Ir and q ∈ It , both correlation windows are identically partitioned into two disjoint regions, R1 and R2 . R1 coincides with the segment of the reference image including p, R2 with its complement. Points belonging to R1 gets a high constant weight, those belonging to R2 a low constant weight. Cost computation relies on an M-estimator. A major weakness of the method is that the support aggregation strategy is not symmetrical (i.e. it relies on Ir only) hence does not deploys useful information which may be derived from the segmentation of the target image (It ). Experimental results shows that [9] is clearly outperformed by the algorithm from Yoon and Kweon in [8], which is currently the best local stereo algorithm. The basic idea of [8] is to extract an adaptive support for each possible correspondence by assigning a weight to each pixel which falls into the current correlation window Wr in the reference image and, correspondingly, in the correlation window Wt in the target image. Let pc and qc being respectively the central points of Wr and Wt , whose correspondence is being evaluated. Thus, the pointwise score, which is selected as the Truncated Absolute Difference (TAD), for any point pi ∈ Wr corresponding to qi ∈ Wt is weighted by a coefficient wr (pi , pc ) and a coefficient wt (qi , qc ), so that the total cost for correspondence (pc , qc ) is given by summing up all the weighted pointwise scores belonging to the correlation windows and normalized by the weights sum: C(pc , qc ) =
wr (pi , pc ) · wt (qi , qc ) · T AD(pi , qi )
pi ∈Wr ,qi ∈Wt
wr (pi , pc ) · wt (qi , qc )
(1)
pi ∈Wr ,qi ∈Wt 1
The image pairs together with the groundtruth are available at: http://cat.middlebury.edu/ stereo/data.html
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence
429
Each point in the window is weighted on the basis of its spatial distance as well as of its distance in the CIELAB colour space with regards to the central point of the window. Hence, each weight wr (pi , pc ) for points in Wr (and similarly each weight wt (qi , qc ) for points in Wt ) is defined as: dp (pi , pc ) dc (Ir (pi ), Ir (pc )) wr (pi , pc ) = exp − (2) − γp γc where dc and dp are respectively the euclidean distance between two CIELAB triplets and the euclidean distance between two coordinate pairs, and the constants γc , γp are two parameters of the algorithm. This method provides excellent results but has also some drawbacks, which will be highlighted in the following by analysing the results obtained by [8]2 on stereo pairs belonging to the Middlebury dataset and shown in Fig. 1. Depth discontinuities. The idea of a variable support is mainly motivated by depth discontinuities: in order to detect accurately depth borders, the support should separate “good” pixels, i.e. pixels at the same disparity as the central point, from “bad” pixels, i.e. pixels at a different disparity from the central point. It is easy to understand that within these regions the concept of spatial distance is prone to lead to wrong separations, as due to their definition border points always have close-by pixels belonging to different depths. Therefore “bad” pixels close to the central point might receive higher weights than “good” ones far from the central point, this effect being more significant the more the chromatic similarities between the regions at different disparities increase. Moreover, as for “good” pixels, far ones might receive a significantly smaller weight than close ones while ideally one should try to aggregate as many “good” pixels as possible. Generally speaking, weights based on spatial proximity from the central point are constant for each correlation window, hence drive toward fixed - not anymore variable - supports, with all negatives consequences of such an approach. Fig. 2 shows a typical case where the use of spatial distance would determine wrongly the correct support. Imagine that the current point (the blue point in figure) is on the border of two planes at different depths and characterized by a slightly different colour or brightness. The central image shows the correlated pixels (circles coloured from red - high correlation - to yellow - low correlation) on the basis of spatial proximity, where it can be seen that many “bad” pixels would receive a high weight because of the close spatial distance from the central point. Right image depicts in red the correct support that should be ideally extracted. This effect leads to mismatches on some depth borders of the Tsukuba and Venus datasets, as indicated by the blue boxes of Fig. 1 (groundtruth is shown in Fig. 6). Low textured surfaces. A further drawback of [8] deals with matching ambiguities which apply when trying to match points belonging to low textured areas on constant depths. When considering the correspondence of points on these areas, the support should ideally enlarge itself as much as possible in order to maximize the signal-tonoise ratio. Instead, the combined use of the spatial and colour proximities force the 2
The results shown in this paper were obtained running the authors’ code available at: http://cat.middlebury.edu/stereo/code.html
430
F. Tombari, S. Mattoccia, and L. Di Stefano
Fig. 1. Some typical artifacts caused by the cost function adopted by [8] on high textured regions (red), depth discontinuities (blue), low textured regions (green), repetitive patterns (yellow). [This image is best viewed with colors].
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence
431
Fig. 2. Example of a correlation window along depth borders (left), correspondent weights assigned by [8] on the basis of spatial proximity (center) and ideal support (right).[This image is best viewed with colors].
Fig. 3. Examples where the support shrinks to a few elements due to the combined use of spatial and colour proximity. The coloured circles indicate the region correlated to the central pixels on the basis of the spatial proximity.
support to be smaller than the correlation window. This effect is particularly evident in datasets Venus, Cones and Teddy, where the low textured regions denoted by the green boxes of Fig. 1 lead to remarkable artifacts in the correspondent disparity map. High textured surfaces. Suppose to have a high textured region laying on a constant disparity plane. Then, for all those points having not enough chromatic similarities in their surroundings the aggregated support tends to reduce to a very small number of points. This effect is due to the weights decreasing exponentially with the spatial and colour distances, and it tends to reduce notably the robustness of the matching as the support tends to become pointwise. It is important to note that in these situations the support should ideally enlarge itself and aggregate many elements in the window because of the constant depth.
432
F. Tombari, S. Mattoccia, and L. Di Stefano
Fig. 4. Typical example of a repetitive pattern along epipolar lines where the aggregation step of [8] would lead to ambiguous match. Red-to-yellow colours are proportional to the weights assigned to the supports.
In order to have an idea of the behaviour of the aggregated support, consider the situation of Fig. 3, where some particular shapes are depicted. In the upper row, the blue point represents the current element for which the support aggregation is computed and the blue square represents the window whose elements concur in the computation of the support. In the lower row the coloured circles denote the points correlated to the central point on the basis of the spatial proximity criterion, where red corresponds to high correlation and yellow to low correlation. As it can be clearly seen the combined use of spatial and colour proximity would lead in these cases to very small aggregated supports compared to the whole area of the shapes as well as to the correlation window area. Typical artifacts induced by this circumstance are evident in datasets Venus, Cones and Teddy as highlighted by the red boxes in Fig. 1, where it is easy to see that they are often induced by the presence of coloured writings on objects in the scene and that they produce notable mistakes in the correspondent regions of the disparity maps. Repetitive patterns. Finally, a further problem due to the use of the weight function (1) applies in presence of repetitive patterns along the epipolar lines. As an example consider the situation depicted in Fig. 4. In this case, the blue point in top left image has to be matched with two candidates at different disparities, centered on two similar patterns and shown in top right image. In this situation, the combined use of spatial and colour proximities in the weight function would extract supports similar to the ones shown in the bottom part of the figure, where red corresponds to high weight values and
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence
433
yellow to low weight values. It is easy to see that the pixels belonging to both candidate supports are similar to the reference support, hence would lead to an ambiguous match. This would not happen, e.g., with the use of the common fixed square support which includes the whole pattern. In Fig. 1 a typical case of a repetitive pattern along epipolar lines is shown by the yellow box in dataset Tsukuba, which lead to mismatches in the disparity map. Also the case depicted by the yellow box in dataset Cones seems due to a similar situation.
3 Proposed Approach The basic idea beyond our approach is to employ information obtained from the application of segmentation within the weight cost function in order to increase the robustness of the matching process. Several methods have been recently proposed based on the hypothesis that disparity varies smoothly on each segment yielded by an (over)segmentation process applied on the reference image [9], [11], [12]. As the cost function (1) used to determine the aggregated support is symmetrical, i.e. it computes weights based on the same criteria on both images, we propose to apply segmentation on both images and to include in the cost function the resulting information. The use of segmentation allows for including in the aggregation stage also information dealing with the connectiveness of pixels and the shape of the segments, rather than only relying blindly on colour and proximity. Because our initial hypothesis is that each pixel lying on the same segment of the central pixel of the correlation window must have a similar disparity value, then its weight has to be equal to the maximum value of the range(i.e. 1.0). Hence we propose a modified weight function as follows: 1.0 p i ∈ Sc wr (pi , pc ) = (3) dc (Ir (pi ),Ir (pc )) otherwise exp − γc with Sc being the segment on which pc lies. It is important to note that for all pixels outside segment Sc , the proximity term has been eliminated from the overall weight computation and all pixels belonging to the correlation window have the same importance independently from their distance from the central point, because of the negative drawbacks of the use of such a criterion shown in the previous section. Instead, the use of segmentation plays the role of an intelligent proximity criterion. It is easy to see that this method is less subject to the negative aspects of method [8] outlined in the previous section. The problem of having very small supports in presence of shapes such as the ones depicted in Fig. 3 is improved by segmentation. In fact, as segmentation allows segments to grow as long as chromatic similarity is assessed, the aggregated supports extracted by proposed approach are likely to correctly coincide with the shapes depicted in the figure. Moreover, the use of segmentation in spite of the spatial proximity would allow to extract correctly the support also for border points such as the situation described in Fig. 2, with the extracted support tending to coincide with the one shown on the right of that figure. Improvements are yielded also in presence of low textured areas: as they tend to correspond to a single segment because of the low texture, the support correctly enlarges to include all points of these regions. Finally, in
434
F. Tombari, S. Mattoccia, and L. Di Stefano
presence of repetitive patterns such as the ones shown in Fig. 4 the exclusion of the spatial proximity from the weights computation allows only the correct candidate to have a support similar to the one of the reference point. Moreover, from experimental results it was found that the use of a colour space such as the CIELAB helps the aggregation of pixels which are distant chromatically but which are closer in the sense of the colour space. Unfortunately this renders the colour distance measure less selective, and tends to produce more errors along depth discontinuities.Conversely, the use of the RGB colour space appeared more picky, decreasing the chance that pixels belonging to different depths are aggregated in the same support, but also increasing the number of artifacts along textured regions which lie at the same depth. As the use of segmentation implies adding robustness to the support, we found more convenient to operate in the RGB space in order to enforce smoothness over textured planes as well as to increase the accuracy of depth borders localization. Finally, it is worth pointing out that there are two main differences between our method and that proposed in [9]: first we apply segmentation on both reference and target images, hence the support aggregation strategy is symmetric. Besides, rather than using two constant weights, we exploit the concept of colour proximity with all benefits of such an approach shown in [8].
4 Experimental Results In this section we present some experimental results of the proposed method. First we compare our results on the Middlebury dataset with those yielded by [8] using a WinnerTake-All (WTA) strategy. The parameter set is kept constant for all image pairs: the set used for the algorithm by Yoon and Kweon is the one proposed in the experimental results in [8], while the set used for the proposed approach is: γc = 22.0, window size = 51 × 51, T (parameter for TAD) = 80. For what means the segmentation step in the proposed approach, we use the Mean-Shift algorithm [13] with the same constant parameter set, that is: σS = 3 (spatial radius), σR = 3 (range radius), minR = 35 (minimum region size). Figure 5 shows the output of the segmentation stage on both images of each of the 4 stereo pairs used for testing. Fig. 6 compares the disparity maps obtained by [8] with the proposed approach. Significant improvements can be clearly noticed since the artifacts highlighted in Fig. 1 are less evident or no longer present. In particular, errors within the considered high textured regions on Venus and Teddy are greatly reduced and almost disappear on Cones. Accuracy along depth borders of Tsukuba is significantly enhanced while the error along the depth border in Venus shrinks to the true occluded area. Moreover, highlighted artifacts present on low textured regions notably decrease on Venus and disappear on Teddy and Cones. Finally, also the artifacts due to the presence of repetitive patterns as shown on Tsukuba and Cones definitely disappear. In addition, Table 1 shows the error percentages with regards to the groundtruth, with the error threshold set to 1, computed on the maps of Fig. 6. For each image pair two error measures are proposed: the former is relative to all image area except for occlusions (N.O.), the latter only to discontinuities except for occlusions (DISC). The error on all image area including occlusions has not been reported because occlusions
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence
435
Fig. 5. Output of the segmentation stage on the 4 stereo pairs of the Middlebury dataset
are not handled by WTA strategy. As it can be seen from the table, the use of the proposed approach yields notable improvements for what concerns the error measure on all N.O. area. Moreover, by looking only at discontinuities, we can see that generally the proposed approach allows for a reduction of the error rate (all cases except for Cones). Benefits are mostly evident on Venus and Tsukuba. Finally, we show the results obtained by our method after application of the LeftRight consistency check and interpolation of those points which were determined as
436
F. Tombari, S. Mattoccia, and L. Di Stefano
Fig. 6. Reference images (first column), disparity maps computed by [8] (second column) and our approach (third column), ground truth (last column) Table 1. Comparison between proposed approach and method [8] on the Middlebury dataset using a WTA strategy Tsukuba
Venus
Teddy
Cones
N.O. - DISC N.O. - DISC N.O. - DISC N.O. - DISC
Proposed 2,05 - 7,14 1,47 - 10,5 10,8 - 21,7 5,08 - 12,5 [8] 4.66 - 8.25 4.61 - 13.3 12.7 - 22.4 5.50 - 11.9
inconsistent. The obtained disparity maps were submitted and are available at the Middlebury website. We report, in Tab. 2, the quantitative results of our method (referred to as SegmentSupport) compared to the submitted results of method [8] (referred to as AdaptWeight), together with the overall ranking assigned by Middlebury to the two approaches. The table reports also the results published in [9] which consist only of the error rates on the ALL groundtruth maps (all image area including occlusions), since no submission has been done so far on Middlebury. As it is clear from the table and the
Segmentation-Based Adaptive Support for Accurate Stereo Correspondence
437
Table 2. Disparity error rates and rankings obtained on Middlebury website by the proposed approach (referred to as SegmentSupport) compared to method [8] (referred to as AdaptWeight) and (where available) [9]
Rank
Tsukuba
Venus
Teddy
Cones
N.O. - ALL - DISC
N.O. - ALL - DISC
N.O. - ALL - DISC
N.O. - ALL - DISC
SegmentSupport 9 1.25 - 1.62 - 6.68 0.25 - 0.64 - 2.59 8.43 - 14.2 - 18.2 3.77 - 9.87 - 9.77 AdaptWeight 13 1.38 - 1.85 - 6.90 0.71 - 1.19 - 6.13 7.88 - 13.3 - 18.6 3.97 - 9.79 - 8.26 [9] n.a. n.a. - 2.27 - n.a. n.a. - 1.22 - n.a. n.a. - 19.4 - n.a. n.a. - 17.4 - n.a.
Middlebury website, currently our approach is the best performing known local method ranking 9th overall (as of July 2007).
5 Conclusions In this paper a novel support aggregation strategy has been proposed, which embodies the concept of colour proximity as well as segmentation information in order to obtain accurate stereo correspondence. By means of experimental comparisons it was shown that the proposed contribution, deployed within a WTA-based local algorithm, is able to improve the accuracy of disparity maps compared to the state of the art. It is likely that the proposed strategy might be usefully exploited also outside a local framework: this is currently under study.
References 1. Xu, Y., Wang, D., Feng, T., Shum, H.: Stereo computation using radial adaptive windows. In: Proc. Int. Conf. on Pattern Recognition (ICPR 2002), vol. 3, pp. 595–598 (2002) 2. Boykov, Y., Veksler, O., Zabih, R.: A variable window approach to early vision. IEEE Trans. PAMI 20(12), 1283–1294 (1998) 3. Gong, M., Yang, R.: Image-gradient-guided real-time stereo on graphics hardware. In: Proc. 3D Dig. Imaging and modeling (3DIM), Ottawa, Canada, pp. 548–555 (2005) 4. Hirschmuller, H., Innocent, P., Garibaldi, J.: Real-time correlation-based stereo vision with reduced border errors. Int. Jour. Computer Vision (IJCV) 47(1-3) (2002) 5. Kanade, T., Okutomi, M.: Stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. PAMI 16(9), 920–932 (1994) 6. Veksler, O.: Fast variable window for stereo correspondence using integral images. In: Proc. Conf. on Computer Vision and Pattern Recognition (CVPR 2003), pp. 556–561 (2003) 7. Wang, L., Gong, M.W., Gong, M.L., Yang, R.G.: How far can we go with local optimization in real-time stereo matching. In: Proc. Third Int. Symp. on 3D Data Processing, Visualization, and Transmission (3DPVT 2006), pp. 129–136 (2006) 8. Yoon, K.J., Kweon, I.S.: Adaptive support-weight approach for correspondence search. IEEE Trans. PAMI 28(4), 650–656 (2006) 9. Gerrits, M., Bekaert, P.: Local stereo matching with segmentation-based outlier rejection. In: Proc. Canadian Conf. on Computer and Robot Vision (CRV 2006), pp. 66–66 (2006) 10. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. Jour. Computer Vision (IJCV) 47(1/2/3), 7–42 (2002)
438
F. Tombari, S. Mattoccia, and L. Di Stefano
11. Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: Proc. Int. Conf. on Pattern Recognition (ICPR 2006), vol. 3, pp. 15–18 (2006) 12. Bleyer, M., Gelautz, M.: A layered stereo matching algorithm using image segmentation and global visibility constraints. Jour. Photogrammetry and Remote Sensing 59, 128–150 (2005) 13. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI 24, 603–619 (2002)
3D Reconstruction of a Human Body from Multiple Viewpoints Koichiro Yamauchi, Hideto Kameshima, Hideo Saito, and Yukio Sato Graduate School of Science and Technology, Keio University Yokohama 223-8522, Japan {yamauchi,kameshima,saito}@ozawa.ics.keio.ac.jp,
[email protected]
Abstract. A human body measurement system using multiple viewpoints is proposed. Whole human body data taken from systems, which have been developed with a few viewpoints, have not been successful at acquiring due to occlusion. It is proposed that data can be successfully obtained by a method using multiple rangefinders correctly. Four compact rangefinders are installed in a pole. Those three pole units, with 12 rangefinders, are assigned around a person. Multiple viewpoint range images allow the 3D shape reconstruction of a human body. Then a morphable model is adapted to whole human body data. The measurement time is 2 seconds and the average error is found to be 1.88 mm. In this paper, the system configuration, calibration, morphable model and experimental results are described. Keywords: human body, 3d reconstruction, modeling, rangefinder.
1
Introduction
A rangefinder acquires the 3D shape (i.e. a range image and a surface image) of a target object by observing it. It has been a suitable device for practical use thanks to the continual improving accuracy, miniaturization and price reduction. Recently, human body measurement has attracted the attention of the research and the business fields. For example whole human body data is applicable to animate digital human models by using motion capture data, based on a prediction model or kinematics. Other applications are the health management, surgical simulation, augmented reality, computer-aided design (CAD), and custom-made clothing. Some human measurement systems have already been developed and have reached the commercial market. One such product is the Whole Body Color 3D Scanner by Cyberware [1]. The measurement time for measuring instruments to move down from a head to toe is 17 seconds. VITUS 3D Body Scanner is a product of VITRONIC [2]. It is based on the light-section method, acquiring whole human body data with a resolution of 4-5mm in 12 seconds. Another product, which is composed of four range sensors, measures only from the front and the rear [3]. Projectors emit structured light patterns in turn and respective cameras D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 439–448, 2007. c Springer-Verlag Berlin Heidelberg 2007
440
K. Yamauchi et al.
Fig. 1. Configuration of pole unit
capture images of those views. The measurement time is about 8 seconds. Because there are only a few viewpoints, a person must stand a little further away and occlusion will also occur. Although these systems are useful for accuracy and resolution, it takes a long time to obtain whole human body data. In contrast, stereo methods for 3D shape reconstruction with multiple viewpoints have already been presented. A pertinent instance is the ”3D Room” which has 49 cameras are mounted inside the room [4], [5], [6]. Virtual views are generated by model-based rendering and image-based rendering. It is a large facility for digitizing dynamic events. Generally speaking, compact 3D human measurement systems acquiring a number of accurate data points are desired. There are some important problems we should resolve. First one of them is the measurement time. It is very hard for persons to keep standing motionless during the measurement. In addition, they move increasingly every second. Therefore, we must complete the measurement in the shortest possible time. Another problem is occlusion which is the hidden regions from the views. When we utilize only a couple of viewpoints, occlusion will occur easily. It is especially difficult to capture data of a submental, an axillary, or a groin region, which are often hidden from the views. Therefore, it is better to obtain whole human body data using multiple rangefinders. In this paper, we use appropriately positioned multiple rangefinders to resolve the previous problems. We have developed a compact rangefinder and installed four rangefinders in a pole. The human body measurement system is configured with three poles in compact space. Multiple range images allow the 3D shape reconstruction of a human body at high speed. Then a morphable model is adapted to the whole human body data. In the following sections, the system configuration, calibration, morphable model and experimental results are described.
3D Reconstruction of a Human Body from Multiple Viewpoints
Fig. 2. Pole unit
2 2.1
441
Fig. 3. Measurement result by pole unit
Human Body Measurement System Pole Unit
The compact rangefinder that we have developed is composed of a CCD camera and a laser scanner [7]. The CCD camera captures an image at 640×480 pixels resolution with 6 mm lens. The light source of the laser scanner is red-color semiconductor laser with 650 nm wavelength. A light pattern is generated from emitting and switching of a slit light made by the laser scanner. The range image is obtained by the space encoding method within 0.5 seconds [8], [9]. 3D coordinate points are computed by the triangulation principle. When we measure a target placed one meter ahead, the measurement error is found to be 2 mm. Four compact rangefinders are installed in a pole as shown Fig. 1. And, Fig. 2 shows the pole unit and the control PC. It is an independent and movable unit. The size is 300 mm wide, 265 mm long and 2135 mm high. The CCD camera is placed under the laser scanner in each rangefinder. A baseline is 330 mm between the CCD camera and the laser scanner. The pole unit makes measuring range infinitely wider than one rangefinder. When we measure a target placed one meter ahead, measuring range will be 800×1800 mm. Each rangefinder is connected with the control PC by a USB2.0 cable. The control PC takes control of each rangefinder and synchronizes actions of four rangefinders. Fig. 3 is a measurement result by one pole unit when we have measured a man from the front. The left is texture display and the right is point cloud display. We have obtained the 3D shape of a human body from head to toes. Whole human body data is acquired by using more than one pole unit. 2.2
System Configuration
When we measure human bodies, it is difficult to capture data of the submental, axillary and groin regions. If there are a few viewpoints, occlusion will be
442
K. Yamauchi et al.
Fig. 4. Human body measurement system
Fig. 5. Timing diagram
found easily. The measuring range of one rangefinder is narrow. Therefore, these problems are solved by using multiple rangefinders. Three pole units, with 12 rangefinders, are assigned around a human. Fig. 4 is a scene of the measurement. 12 viewpoint range images allow the 3D shape reconstruction of a human body. Whole human body data, 1.5 million point cloud and 12 surface textures, is obtained. We must complete the measurement as quickly as possible, preferably within one second, because it is hard to let a person keep standing motionless. If 12 rangefinders of three pole units are sequentially operated one by one, the measurement time is too long. Furthermore, if some rangefinders are performed concurrently, coarse whole human body data will be generated because of light pattern interferences among rangefinders. This adverse effect is suppressed by a combination and a control of the measurement timing. We acquire whole human body data at four times. Fig. 5 is the timing diagram. Four rangefinders placed in a diagonal corner or noninterference height are operated at the same time. It is equal to the measurement time of one rangefinder. Therefore, the measurement time is 2 seconds at four times. It is possible to improve the accuracy by increasing the number of pole units and conversely to make system more compact with two pole units. The assignment of three pole units is not fixed and able to move itself flexibly.
3
Calibration
A simple calibration is performed for the rangefinder. Our rangefinder is composed of the CCD camera and the laser scanner. Camera parameters are the focal length, image centre, skew, and coefficients of the radial distortion. Scanner parameters are the projection angles and baseline between the optical center of the camera and the light source of the scanner. We execute Tsai’s camera calibration program to acquire camera parameters [10]. Scanner parameters are obtained using theoretical figures of its design. Two parameter set enable the
3D Reconstruction of a Human Body from Multiple Viewpoints
443
Fig. 6. Calibration model
rangefinder to capture a range image and a surface image of a target object. 3D coordinates are computed by the triangulation principle. The human body measurement system is configured with three pole units. If geometric positions of rangefinders are not known, 12 range images can not allow the 3D shape reconstruction of a human body. Our solution to this problem is an alignment approach using a calibration target known 3D shape, such as a cuboid and a cylinder. Calibration model is shown in Fig. 6. The camera coordinate systems of 12 rangefinders are integrated into the world coordinate system of the calibration target. 3D measurement point from the rangefinder is denoted by c˜ = [xc , yc , zc , 1]T . 3D calibration point on the calibration target is denoted by w ˜ = [xw , yw , zw , 1]T . The relationship between the camera coordinate system and the world coordinate system is given by ⎡ ⎤ r11 r12 r13 t1 ⎢ r21 r22 r23 t2 ⎥ ⎥ w ˜ = He c˜ with He = ⎢ (1) ⎣ r31 r32 r33 t3 ⎦ 0 0 0 1 where He is the Euclidean transformation; 12 parameters [r11 , . . . , t3 ]T are solved by the least square method. When 12 Euclidean transformation matrix set are obtained, 12 range images can be integrated into whole human body data. The assignment of pole units has no constraint as long as all rangefinders can observe the calibration target.
4
Morphable Model
Many researchers are studying for the 3D modeling of human bodies. For example, the Stanford Digital Michelangelo Project [11], [12] is famous for protecting 3D graphics content. B. Allen et al. [13], [14] proposed a method for creating a whole-body morphable model based on 3D scanned examples.
444
K. Yamauchi et al.
(a) female
(b) male
Fig. 7. Morphable models
In this section, a modeling technique using whole human body data, which is obtained by multiple rangefinders, is presented. We consider the problems involved with bodily habitus and absent parts. A human body is known more commonly as a closed surface, which declares itself as the same topological object. It is necessary to emphasize that few 3D human body models have the capacity to represent the figure of all persons. Therefore, we have designed two 3D human body models based on Poser 5.0 [15] figures for representation of human bodies. A 3D female model (7,973 vertices and 8,409 faces) and a 3D male model (8,994 vertices and 8,582 faces) are generated as shown in Fig. 7. Twenty feature curves are defined in these models as closed curves which indicate the boundaries of some body parts. Whole human body data is associable with other obtained data using 3D human body models including region information. A human body is treated as a rigid object, but our models are treated as an elastic object like a wetsuit. To adapt the proposed 3D model to whole human body data, the mass and-spring-based fitting technique is utilized for deforming elastic objects [16]. The operator is able to handle some feature curves interactively so that whole human body data is wrapped around the proposed model. In addition, the operator can adjust mismatched parts of the adapted model along the surface shape interactively. Consequently, the 3D human body model can be adapted to whole human body data as if it had been dressed in a wetsuit. The adapted model is useful for various applications thanks to the region information.
5 5.1
Experimental Results Calibration
A cylinder known 3D shape (415 mm diameter and 1800 mm height) is utilized for calibration. It is placed in the center of the human body measurement system. This cylinder shape roughly equals to a standard human body and avails
3D Reconstruction of a Human Body from Multiple Viewpoints
445
Fig. 8. Calibration cylinder Table 1. Measurement accuracy The number of points 33 Average error [mm] 1.88 Standard deviation [mm] 0.79
the improvement of the calibration precision. A lattice pattern is drawn on the surface of the cylinder as shown in Fig. 8. The intersection of a row and column bar is defined as a calibration point. We utilize 80 calibration points or more for the calibration. The measurement error is searched by using two rangefinders. When two rangefinders measure a same calibration point, one and the other 3D coordinate are denoted by pi and pi . The measurement error is defined by Error =
N 1 pi − pi N i=1
(2)
Euclidean distance between pi to pi ; Table. 1 shows the average error and the standard deviation. 33 calibration points are used in this evaluation. This result is within 0.2 percent of the distance to a target. 5.2
Measurement and Modeling
We have measured a female mannequin (Fig. 9), a male (Fig. 10), and a clothed male (Fig. 11). Measurement data is displayed in front and back views. Whole human body data is successfully acquired, especially at the submetal, axillary and groin regions. Wrinkles in a fabric also can be obtained. The head hair shape has only few 3D points due to low specular reflectivity for brunet hair. We know that a person moves increasingly second by second. Because the proposed system completes the measurement so quickly, it has little effect.
446
K. Yamauchi et al.
front
front
back
back
Fig. 9. Measurement result of a female
Fig. 10. Measurement result of a male
Experimental results show occlusion problem is solved by multiple rangefinders, but the dead space in the side of a body is caused by long vertical bodily habitus. Then the morphable models are adapted to measurement data of a female and a male. Fig. 12 and Fig. 13 show modeling results of a female and a male, respectively. Because proposed morphable models are closed surfaces, some holes and absent parts are automatically covered. Using region information (the positions of arms, elbows, knees and so forth), the adapted model will be utilized for various applications. At the present stage, basic motions, such as walking or running, have been realized [17].
6
Conclusion
We have introduced the 3D shape reconstruction of a human body using multiple viewpoints. Whole human body data is obtained from 12 viewpoints, which is found a little occlusion. Then proposed morphable model is adapted to the whole human body data. A person swing effect is reduced thanks to high-speed
3D Reconstruction of a Human Body from Multiple Viewpoints
447
front Fig. 12. Modeling result of a female
back Fig. 11. Measurement result of a clothed male
Fig. 13. Modeling result of a male
measurement in 2 seconds. The average error 1.88mm is within 0.2 percent of the distance to a target. Unlike other human measurement systems, our system is configured with three poles in compact space. The assignment of pole units is not fixed and able to move itself flexibly because the pole unit is independent and movable. When we place pole units in different installation location, it is necessary to execute calibration. Promoting a rationalization of calibration is challenge that lies ahead. Increasing the number of pole units improves the accuracy and resolves the presence of a little occlusion. Adversely if there is a good solution to complement non measurement regions of whole human body data with traditional sculptured surface or some other techniques, it is sufficient to configure with two pole units. Our approach is a flexible strategy for every situation. Acknowledgments. This work is supported in part by a Grant-in-Aid for the Global Center of Excellence for High-Level Global Cooperation for Leading-
448
K. Yamauchi et al.
Edge Platform on Access Spaces from the Ministry of Education, Culture, Sport, Science, and Technology in Japan.
References 1. 2. 3. 4.
5.
6. 7. 8. 9.
10.
11.
12. 13.
14.
15. 16.
17.
Cyberware: Whole body color 3d scanner, http://www.cyberware.com/ VITRONIC: Vitus 3d body scanner, http://www.vitronic.de/ Treleaven, P.: Sizing us up. IEEE Spectrum 41, 28–31 (2004) Kanade, T., Saito, H., Vedula, S.: The 3d room: Digitizing time-varying 3d events by synchronized multiple video streams. Tech. rep. CMU-RI-TR-98-34, Robotics Institute, Carnegie Mellon University (1998) Saito, H., Baba, S., Kanade, T.: Appearance-based virtual view generation from multicamera videos captured in the 3-d room. IEEE Trans. Multimedia 5(3), 303– 316 (2003) Vedula, S., Baker, S., Kanade, T.: Image-based spatio-temporal modeling and view interpolation of dynamic events. ACM Trans. Graphics 24(2), 240–261 (2005) SPACEVISION: Handy 3d camera cartesia, http://www.space-vision.jp/ Hattori, K., Sato, Y.: Accurate rangefinder with laser pattern shifting. In: Proc. International Conference on Pattern Recognition, vol. 3, pp. 849–853 (1996) Sato, Y., Otsuki, M.: Three-dimensional shape reconstruction by active rangefinder. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 142– 147. IEEE Computer Society Press, Los Alamitos (1993) Tsai, R.: A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lens. IEEE Journal of Robotics and Automation 3(4), 323–344 (1987) Koller, D., Turitzin, M., Levoy, M., Tarini, M., Croccia, G., Cignoni, P., Scopigno, R.: Protected interactive 3d graphics via remote rendering. IEEE Journal of Robotics and Automation 23(3), 695–703 (2004) Koller, D., Levoy, M.: Protecting 3d graphics content. Communications of the ACM 48(6), 74–80 (2005) Allen, B., Curless, B., Popovi, Z.: The space of all body shapes: Reconstruction and parameterization from range scans. ACM trans. on Graphics 22(3), 587–594 (2003) Allen, B., Curless, B., Popovi, Z., Hertzmann, A.: Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In: Proc. of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, vol. 48, pp. 147–156 (2006) E frontier: Poser, http://www.e-frontier.com/ Miyazaki, S., Hasegawa, J., Yasuda, T., Yokoi, S.: A deformable object model for virtual manipulation based on maintaining local shapes. In: Proc. World MultiConference on Systemics, Cybernetics and Informatics, vol. 6, pp. 100–105 (2001) Kameshima, H., Sato, Y.: Interactive adaptation for 3-d human body model to range data. In: Proc. SICE-ICASE International Joint Conference (20020061), pp. 3523–3526.
3D Posture Representation Using Meshless Parameterization with Cylindrical Virtual Boundary Yunli Lee and Keechul Jung∗ School of Media, College of Information Technology, Soongsil University, Seoul, South Korea {yunli, kcjung}@ssu.ac.kr http://hci.ssu.ac.kr Abstract. 3D data is getting popular which offers more details and accurate information for posture recognition. However, it leads to computational hurdles and is not suitable for real time application. Therefore, we introduce a dimension reduction method using meshless parameterization with cylindrical virtual boundary for 3D posture representation. The meshless parameterization is based on convex combination approach which has good properties, such as fast computation and one-to-one mapping characteristic. This method depends on the number of boundary points. However, 3D posture reconstruction using silhouettes extraction from multiple cameras had resulted various number of boundary points. Therefore, a cylindrical virtual boundary points is introduced to overcome the inconsistency of 3D reconstruction boundary points. The proposed method generates five slices of 2D parametric appearance to represent a 3D posture for recognition purpose. Keywords: 3D voxel, dimension reduction, meshless parameterization, posture recognition, cylindrical virtual boundary.
1 Introduction The latest advances in computer vision have gained much attention especially for 3D posture recognition application. The 3D data offers more details and accurate posture information compare to 2D posture data. However, it leads to computational hurdles and is not suitable for real-time posture recognition. Therefore, 2D posture recognition application has attracted more researchers’ bias towards it [10, 15, 16]. The main reason is the simplicity and reasonable processing time for posture recognition application. Still the 2D posture recognition only restricts to particular applications or methods to deliver the input pose. For example, sign-language recognition application [8, 14] which captures the 2D posture from a camera. However, the 2D input is not able to estimate some pose which caused by image projection and self-occlusion. The user might be facing away from the camera, hiding the pose, or some objects could block the camera's view of the user. Therefore, the input pose has limits on the space in which posture can be recognized. And it creates additional burden on the user of staying alert for this restriction. ∗
Corresponding author.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 449–461, 2007. © Springer-Verlag Berlin Heidelberg 2007
450
Y. Lee and K. Jung
In order to make the posture recognition application more meaningful and resourceful, 3D posture recognition becomes a challenging and active research in computer vision field. Multiple cameras usages are introduced to solve the limitation of placing model position. There are various kinds of approaches for 3D posture data reconstruction, some well known methods such as space carving, Shape-FromSilhouettes (SFS), visual-hull reconstruction, voxel-based and etc [9]. In this paper, we focus on a dimension reduction method for 3D posture representation. The key idea of dimension reduction method is to overcome the computational hurdles of 3D posture recognition and preserved the 3D information. There are various kinds of approaches to apply for dimension reduction of 3D to 2D such as principal component analysis (PCA), multidimensional scaling (MDS), local linear embedding (LLE) and etc [14, 17]. PCA constructs a low-dimensional representation of the data that describes as much the variance in the data as possible. It is done by finding a linear basis of reduced dimensionality for the data. The main drawback of PCA is that the size of covariance matrix is proportional to the dimensionality of the data points. MDS represents a collection of nonlinear techniques that maps the high dimensional data representation to a low dimensional representation while retaining the pairwise distances between the data points as much as possible. The quality of the mapping is expressed in the stress function, a measure of the error between the pairwise distances in the low dimensional and high dimensional representation of data. LLE is a local nonlinear technique for dimension reduction which constructs a graph representation of the data points. It attempts to preserve local properties of the data manifold in the low dimensional representation of data points. However, these approaches are not able to preserve the posture information. Since the 3D posture reconstruction using SFS results voxel as point clouds form. Then, the 3D posture representation over 2D domain using meshless parameterization is introduced. This method is known for parameterizing and triangulating single patch on unorganized point sets. The convex combination approach in meshless parameterization has good properties, such as fast computational and one-to-one mapping characteristic [1-4]. Therefore, we chose meshless parameterization instead of the others approach for dimension reduction. However, the existing method of meshless parameterization is hardly to analyze the posture due to 3D voxel reconstruction drawback: inconsistency of boundary points where the boundary shape is deformed when captured from multiple cameras. In this paper, cylindrical virtual boundary is introduced to overcome the boundary point’s drawback. The cylindrical virtual boundary provides a consistent boundary shape. The results of meshless parameterization over 2D are studied and analyzed for matching purpose. An overview of our proposed approach is illustrated in Fig. 1. In this paper, the 3D posture data is reconstructed using SFS method where the silhouettes images are extracted from four web cameras. We introduce meshless parameterization using cylindrical virtual boundary and divide the 3D posture data into five segments for dimension reduction. This process overcomes the complexity of 3D posture data computation and it makes the recognition process more accurate and robust. The various related works of posture or gesture recognition is presented in Section 2. Section 3 describes the details of posture modeling using dimension reduction method on 3D posture data into five slices of 2D parametric appearance. The
3D Posture Representation Using Meshless Parameterization
451
dimension reduction process is using meshless parameterization with cylindrical virtual boundary. Posture analysis is written in Section 4, for matching purpose. The experimental results of posture recognition are elaborated in Section 5. Conclusion and future work are presented in Section 6.
Multiple Cameras
Posture Modeling ˀ Meshless Parameterization Silhouettes Extraction
5
6
4
7
8
3 2
3D Voxel Reconstruction
Cylindrical Boundary
1
Posture Analysis
Fig. 1. Overview of the proposed system: from 3D voxel reconstruction, to dimension reduction using meshless parameterization, until posture analysis for recognition purpose
2 Related Work There are various aspects involved in posture or gesture recognition, such as modeling, analysis, recognizing and etc. Therefore, recognizing posture is a complex task. In this section, we discuss the methods that have been proposed for posture or gesture recognition that involves in computer vision [5-16]. Generally posture recognition methods in vision-based can be divided into two categories: 3D model and 2D appearance modeling. 3D model provided more details and precise posture data compared to 2D, however, this approach is too complex and expensive for computation. Hence, 2D appearance has low computational complexity and many applications are
452
Y. Lee and K. Jung
adopted this approach. Somehow, 2D appearance has limited information of posture data due to the self-occlusion and projection error. Kenny Morrison et al. [10] made an experimental comparison between trajectorybased and image-based representation for gesture recognition. The trajectory-based representation depends on tracking system which provides the temporal features of movement. The image-based recognition computed the values of pixel histories from image sequence and performed matching algorithm, such as statistical matching. Both approaches have its strengths and weakness. Usually, the Hidden Markov Model (HMM) is used for recognizing gesture where the 3D model is fitted to silhouettes images or extracted data, or analyze the raw data. This make the HMM process complex and computational expensive. Chi-Wei Chu and Isaac Cohen [9] introduced a method for posture identification, called atoms. By modeling the atom transition and observation, the state transition and HMM computational complexity is reduced. H.K. Shin and et al. [6] proposed 3D Motion History Model (MHM) for gesture recognition. Their method is using stereo input sequences that contain motion history information in 3D space and overcome the 2D motion limitation like viewpoint and scalability. Guangqi Ye and et al. [5] presented 3D gesture recognition scheme that combines the 3D appearance and motion features by reducing the 3D features with employing unsupervised learning. The proposed method is flexible and efficient way to capture the 3D visual cues in a local neighborhood around the object. Daniel Weinland at el [11] introduced motion descriptors that based on motion history volumes with advantage to fuse the action cues from different viewpoints and in short period, into a single three dimensional representation. Xiaolong Teng et al. [14] proposed a real-time vision system to recognize hand gestures for sign language using linear embedding approach. They identified the hand gesture from images of normalized hand and used local linear embedding for feature extraction. In our proposed approach, we are using 2D silhouettes images to reconstruct the 3D voxel and apply dimension reduction on 3D voxel using meshless parameterization with cylindrical virtual boundary. The result of five slices of 2D parametric appearance model is used for posture analysis and recognition purpose.
3 Posture Modeling The posture modeling process is difficult and complex to represent in 3D voxel. The meshless parameterization is introduced to represent the 3D point’s data into 2D representation which adopts good characteristics of convex combination such as fast computation and one-to-one mapping. However, this approach only works well for 3D voxel with consistent boundary shape. In the process of the 3D voxel reconstruction, the deformation of boundary shape occurs quite often. The meshless parameterization method depends on the boundary shape information. It will cause a poor result of the dimension reduction of 3D voxel into 2D representation. Section 3.1 briefly describes the basic idea of meshless parameterization and followed by section 3.2, introduction of cylindrical virtual boundary in meshless parameterization to solve the drawback of existing approach.
3D Posture Representation Using Meshless Parameterization
453
3.1 Basic Idea: Meshless Parameterization Meshless parameterization is a 2D parametric representation with some convex parameter where the one-to-one mappings of 3D voxel over 2D domain without using mesh information [1-4]. The method is divided into two basic steps. First, map the boundary points PB into the boundary of domain D plane. Then, the corresponding parameter points U = {un+1, un+2,…,uN} are laid around the domain D counterclockwise order. The chord length parameterization is used for the distribution of parameter points U. The second step, the interior points are map into the domain D plane. However, before mapping, a neighborhood pj for each interior point in PI where the points are some sense close by is chosen, and let Ni as a set of neighborhood points of pi. In this case, a constant radius r is chosen. The points that fall within the ball with radius r are considered the neighborhood points of each interior point. Then, the reciprocal distance weights method is to compute the weight λij for each interior point pi. The parametric points for interior point’s ui can be obtained by solving the linear system of n equations of the number of interior points. Fig. 2 illustrated the process of 3D voxel data in 2D parametric representation using existing method of meshless parameterization. However, the existing method has two drawbacks: first, the initial starting point is different for each posture generated from 3D voxel, and second, the boundary shape extracted from silhouettes generates variation of boundary shape. This both drawbacks cause the difficulties for posture analysis and recognition. In order to solve these problems, cylindrical virtual boundary is generated on 3D voxel before performing meshless parameterization. The details approach of cylindrical virtual boundary in meshless parameterization is presented in Section 3.2.
Fig. 2. The process of 2D parametric representation for 3D hand posture using the existing meshless parameterization
3.2 Cylindrical Virtual Boundary The cylindrical virtual boundary is introduced to overcome the inconsistence shape of 3D posture boundary. It is derived by computing the 3D voxel bounding area and identifies the center of the voxel data as the center point of cylindrical. The cylindrical radius is derived based on the length distance between the minimum and maximum
454
Y. Lee and K. Jung
voxel points of x-axis. The x-axis is chosen as reference axis for cylindrical virtual boundary in our system. The cylindrical virtual boundary does not apply to whole 3D voxel data, there are only five cylindrical virtual boundaries place within the 3D voxel. This created five segments which consist of some interior points as interior points set and a cylindrical virtual boundary as boundary points set for each segment. Thus, for each segment of cylindrical virtual boundary, the radius size depends on the voxel data size for that particular segment. In our experiments, we are using an artifact hand model and real human posture. The size for the models is suitable to divide into five segments. The meshless parameterization method is applied for each segment. The voxel data in each segment PI={p1,p2,…,pn} as a set of interior points with n points, and PB={pn+1,pn+2,…,pN} as set of boundary points with N-n points which is corresponding to the number of cylindrical virtual boundary points. The constant radius r in section 3.1 for computing the number of neighbors for each interior is set based on the radius size of the cylindrical virtual boundary. Therefore, the meshless parameterization with cylindrical virtual boundary generates five slices of 2D parametric appearance representation. Fig. 3 shows the basic idea of cylindrical virtual boundary for 3D voxel which is divided into five segments and each segment of cylindrical virtual boundary act as corresponding boundary points for each segment. The number of cylindrical virtual boundary points is equal to all five segments.
Fig. 3. The 3D voxel data of human pose is divided by five segments and each segment has a cylindrical virtual boundary. Each segment of cylindrical virtual boundary and 3D interior points are transform over 2D parametric domain using meshless parameterization.
3.3
Meshless Parameterization with Cylindrical Virtual Boundary Algorithm
This meshless parameterization works well on a surface patch with open disc of 3D posture data. Our proposed approach for meshless parameterization with cylindrical virtual boundary is described as below algorithm:
3D Posture Representation Using Meshless Parameterization
1. 2. 3. 4.
455
Find the minimum and maximum voxel data of 3D voxel Compute the center points of the 3D voxel Divide the 3D voxel into 5 segments based on the min-max of z-axis For each segments with n number of voxel data: Find the minimum and maximum of voxel data Compute the radius for cylindrical virtual boundary Generate the cylindrical virtual boundary with a constant distribution Set the cylindrical virtual boundary as boundary points and voxel points as interior points v. Map the cylindrical virtual boundary points into 2D domain of 1 unit size vi. Set the constant radius r = radius of cylindrical virtual boundary to compute the number of neighbor points of each interior point and using reciprocal distance to compute the weights vii. Solve the n linear equations viii. Map the interior parameter values onto the 2D domain i. ii. iii. iv.
4 3D Posture Representation The result of meshless parameterization with cylindrical virtual boundary generates five slices of 2D parametric appearance separately, named it as multi layers 2D appearance. This result is preceded for analysis and matching purpose. In posture recognition, template matching using 2D pixel points is the simple and easy approach by dividing the 2D domain into eight regions. However, the multi layers of 2D appearance have various orientations for each pose. Thus, the 2D appearance is divided into eight regions from the center point. The cylindrical virtual boundary is uniform for five segments, the regions division makes it possible for pixel points matching. All the matching process will be based on the same clockwise orientation from highest pixel points region. 4.1 Multi Layers of 2D Appearance The result of multi layers of 2D appearance represents a 3D posture. It consists of five slices of 2D parametric appearance. Each slice is divided into 8 regions through the
(a)
(b)
Fig. 4 (a) One of segment slice in 2D parametric appearance which is divided into eight regions from the domain center point; (b) Graph of normalized distribution of each region pixel points in 2D slice segment
456
Y. Lee and K. Jung
center of 2D domain. We had chosen eight division regions for the best matching region purpose due to small Euclidean distance of voxel data distribution and cylindrical virtual boundary distribution. Another reason is to perform a fast processing, so it is possible to apply real-time posture application. The number of pixel points in each region is computed and represented into a graph as shown in Fig. 4 for one segment. 4.2 Synchronization of Starting Region The eight division regions do not provide the posture orientation information for matching purpose. Therefore, the number of pixel point’s distribution in each region is re-ordered to ease the matching. From the graph distribution, the highest number of pixel points of the region is referred as a starting region. And from the start region, the matching process is continuing to match within the region in clockwise order of the regions from the 2D parametric appearance. Fig. 5 shows the method of choosing the starting region, which based on the highest number of pixels region. The regions are ordered in clockwise order distribution is shown in the represented graph.
(a) Original distribution
(c) Original Graph
(b) Ordered distribution
(d) Re-ordered Graph
Fig. 5. Segment 1: Multi layers of 2D appearance, the 2nd region in (a) has highest number pixel points and re-ordered the sequence as start region (1st region); (b) with clockwise order sequence; (c) Graph of original pixel distribution; (d) Graph of re-ordered distribution based on highest number of pixels
3D Posture Representation Using Meshless Parameterization
457
Table 1. Hand posture database and re-ordered distribution of 2D graph No.
Pose DB
Re-ordered Distribution of 2D Graph
1
2
3
4
5 Experimental Results In order to validate the proposed method for posture recognition application, artifact hand gestures experiment were carried out. Table 1 shows part of database for hand
458
Y. Lee and K. Jung
pose and re-ordered distribution of 2D graph for each pose. The re-ordered distribution of 2D graph shows the pixel points distribution for each region of each segment. The segment is referred to a cylindrical virtual boundary and the 3D voxels of each segment division. For this hand postures experiment, there are total 10 poses are created in database (see Fig. 6). Table 2 shows two examples of test hand pose to recognize the test pose from the defined database. The matching results for hand test poses recognition are shown in Fig. 7 and Fig. 8. The Fig. 7 shows the details process of matching the hand pose for pose test 1 within Table 2. Test hand pose and re-ordered distribution of 2D graph No.
Pose Test
Re-ordered Distribution of 2D Graph
1
2
Fig. 6. The 10 poses of hand posture are derived in the proposed system database
3D Posture Representation Using Meshless Parameterization
459
Fig. 7. Example of matching process for Pose Test 1 within poses in database: the lowest error difference of Pose Test 1 is Pose 1 DB with 1.983 of total error rates for five segments.
Fig. 8. Example of matching process for Pose Test 2 within poses in database: the lowest error difference of Pose Test 2 is Pose 4 DB with 2.819 of total error rates for five segments
460
Y. Lee and K. Jung
each segment from the database. The error difference is computed from 10 poses and the total lowest value of error difference of the pose is matched. The figure shows only four poses of database and the pose test. From the experiment result show that, the pose test 1 is matched with Pose 1 DB with total error difference 1.983. Fig. 8 shows another experimental result of pose test 2 data with four poses from the database. The matched result is Pose 4 DB with lowest total error difference is 2.819. This experimental results show the matching process of the 3D hand posture using the multi layers of 2D appearance is reasonable and simple approach for posture recognition application.
6 Conclusions and Future Work This paper presented dimension reduction method using meshless parameterization with cylindrical virtual boundary for 3D posture representation. This method provides posture modeling in multi layers of 2D appearance representation for 3D posture. The results of meshless parameterization with cylindrical virtual boundary overcome the inconsistency of boundary shape of 3D posture and it is also easy to identify the starting position on 2D domain for matching purpose. The experimental results show the proposed system is possible to recognize posture using matching method at reasonable performance. The 2D representation graph with the lowest total error difference matching rate is recognized as candidate posture from the database. Moreover, the system is good enough and simple to implement for recognizing the 3D posture easily. As for the future work, we will continue to study and upgrade the system in order to recognize human hand posture and a series of 3D temporal gesture data. We intend to make this algorithm to extract the specific features of each pose automatically. We also plan to evaluate the performance of recognition using specific features for each posture based on the multi layers of 2D appearance. Acknowledgments. This work was supported by the Soongsil University Research Fund.
References 1. Lee, Y., Kyoung, D., Han, E., Jung, K.: Dimension Reduction in 3D Gesture Recognition Using Meshless Parameterization. In: Chang, L.-W., Lie, W.-N., Chiang, R. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 64–73. Springer, Heidelberg (2006) 2. Floater, M.S.: Meshless Parameterization and B-spline Surface Approximation. In: Cipolla, R., Martin, R. (eds.) The Mathematics of Surfaces IX, pp. 1–18. Springer, Heidelberg (2000) 3. Van Floater, M.S., Reimers, M.: Meshless Parameterization and Surface Reconstruction. Computer Aided Geometric Design, 77–92 (2001) 4. Floater, M.S., Hormann, K.: Surface Parameterization: a Tutorial and Survey. Advances in Multiresolution for Geometric Modelling, 157–186 (2004) 5. Ye, G., Corso, J.J., Hager, G.D.: Gesture Recognition Using 3D Appearance and Motion Features. In: Proceeding IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE Computer Society Press, Los Alamitos (2004)
3D Posture Representation Using Meshless Parameterization
461
6. Shin, H.-K., Lee, S.-W., Lee, S.-W.: Real-Time Gesture Recognition Using 3D Motion History Model. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 888–898. Springer, Heidelberg (2005) 7. Malassiotis, S., Aifanti, N., Strintzis, M.G.: A Gesture Recognition System Using 3D Data. In: Proceedings of the First International Symposium on 3D Data Processing Visualization and Transmission, pp. 190–193 (2002) 8. Huang, T.S., Pavlovic, V.I.: Hand Gesture Modeling, Analysis, and Synthesis. Int. Workshop on Automatic Face-and Gesture-Recognition, Zurich, pp. 26–28 (1995) 9. Chu, C.-W., Cohen, I.: Posture and Gesture Recognition using 3D Body Shapes Decomposition. IEEE Workshop on Vision for Human-Computer Interaction (2005) 10. Morrison, K., McKenna, S.J.: An Experimental Comparison of Trajectory-Based and History-Based Representation for Gesture Recognition. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 152–163. Springer, Heidelberg (2004) 11. Weiland, D., Ronfard, R., Boyer, E.: Motion History Volumes for Free Viewpoint Action Recognition. IEEE International Workshop on modeling People and Human Interaction PHI 2005 (2005) 12. Sato, Y., Saito, M., Koike, H.: Real-time Input of 3D Pose and Gestures of a User’s Hand and Its Applications for HCI. In: Proceeding IEEE Virtual Reality Conference, pp. 79–86. IEEE Computer Society Press, Los Alamitos (2001) 13. Wu, Y., Huang, T.S.: Vision-Based Gesture Recognition: A Review. In: Braffort, A., Gibet, S., Teil, D., Gherbi, R., Richardson, J. (eds.) GW 1999. LNCS (LNAI), vol. 1739, Springer, Heidelberg (2000) 14. Teng, X., Wu, B., Yu, W., Liu, C.: A Hand Gesture Recognition System based on Local Linear Embedding. Journal of Visual Languages & Computing (2005) 15. Dong, Q., Wu, Y., Hu, Z.: Gesture Recognition Using Quadratic Curves. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 817–825. Springer, Heidelberg (2006) 16. Mori, G., Ren, X., Efros, A.A., Malik, J.: Recovering Human Body Configurations: Combining Segmentation and Recognition. In: CVRP 2004, Washington, DC, vol. 2, pp. 326– 333 (2004) 17. de Silva, V., Tenenbaum, J.B.: Global versus Local Methods in Nonlinear Dimensionality Reduction. Advances in Neural Information Processing Systems (2003)
Using the Orthographic Projection Model to Approximate the Perspective Projection Model for 3D Facial Reconstruction Jin-Yi Wu and Jenn-Jier James Lien Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan 70101, ROC {curtis, jjlien}@csie.ncku.edu.tw
Abstract. This study develops a 3D facial reconstruction system, which consists of five modules, using the orthographic projection model to approximate the perspective projection model. The first module identifies a number of feature points on the face and tracks these feature points over a sequence of facial images by the optical flow technique. The second module applies the factorization method to the orthographic model to reconstruct a 3D human face. The facial images are acquired using a pinhole camera, which are based on a perspective projection model. However, the face is reconstructed using an orthographic projection model. To compensate for the difference between these two models, the third module implements a simple and efficient method for approximating the perspective projection model. The fourth module overcomes the missing point problem, commonly arising in 3D reconstruction applications. Finally, the fifth module implements a smoothing process for the 3D surface by interpolating additional vertices. Keywords: 3D reconstruction, factorization, orthographic projection, and perspective projection.
1 Introduction The goal of 3D facial reconstruction, which has been studied for decades, is to reconstruct the 3D face model from either a single image or a set of images taken from known or unknown camera viewpoints. Lee et al. [9] developed a technique for constructing 3D facial models by using laser scanners, and the method acquires a very high accuracy reconstruction result. However, their method is time-consuming and the equipments are expensive. Hence its applicability in the public domain is limited. Various researchers present different approaches in the use of single or multiple cameras for 3D reconstruction. We organize those approached into several categories. One category is to apply the Bundle Adjustment (BA) approach, such as [5], [11], [12], [14], [15], [22], and [27]. They model the 3D reconstruction problem as a minimization problem between the 2D ground-truth feature point locations and a 2D location estimating function which consists of a 3D-2D projection function (the intrinsic and extrinsic parameters of the camera motion) and the 3D shape of the object. By using the Levenberg-Marquardt (LM) algorithm, which takes the advantages of both D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 462–473, 2007. © Springer-Verlag Berlin Heidelberg 2007
Using the Orthographic Projection Model
463
the Gauss-Newton algorithm and the steepest descent algorithm, they can solve both the parameters of the 3D camera motion and the shape of the 3D object simultaneously. However, because applying LM algorithm to this case will suffer the problem of calculating the inverse Hessian matrix, whose size is dominated by the estimated parameters, LM will take a long time for the computational converge. In addition, the BA approach itself is also a large sparse geometric parameters estimation problem. There are many sparse optimization methods to accelerate the BA algorithm, but it still needs a long time to solve the parameters, especially for very long image sequences. Therefore, [11], [12], [27] use local optimization to accelerate the speed to solve the problem. Another category is the shape from shading (SfS) approach [17], [28], which can reconstruct the 3D shape as long as the scene or object satisfy the Lambertian reflection model. However, not all scene or object can satisfy this constrain. Therefore, a number of researches [8], [16], [20] turn to seek some Bidirectional Reflection Distribution Functions (BRDFs), which can more generally model the reflection model to reconstruct the 3D structure. The other category is the shape from motion (SfM) approach, which can be solved by using the factorization approach [13], [18], [21]. By introducing the rank constraint and factorizing by singular value decomposition (SVD), the factorization approach with the orthographic projection models [18] or the para-perspective projection models [13] can factorize the location matrix of the 2D feature points in the image plane to the 3D rotation matrix of the 2D image frame and the 3D shape matrix of the object. Moreover, the work in [21] generalizes works in [18] and [13] to recover both the 3D rotation information and the 3D shape of the object by applying the perspective projection model. All above-mentioned methods are used to reconstruct the static or rigid objects. Recently, some researches focus on the 3D reconstructions of non-rigid objects over an image sequence [1], [2], [3], [19], [25], [26]. They model the 3D shapes as a linear weighted combination of a set of shape vectors, so that they can represent different shapes correspond to different images by giving different weights.
A: Compiling Successive Facial Images and Finding Corresponding Points.
E: Solving the Missing Point Problem.
B: Factorization Process Based on the Orthogonal Projection Model.
D: Smoothing the 3D Facial Surface.
C: Approximating to the Perspective Projection Model.
Fig. 1. Workflow of the proposed 3D facial reconstruction system
464
J.-Y. Wu and J.-J.J. Lien
Regarding above existing works, our goal is to reconstruct a human face with a common PC camera and a regular PC, and it is easy for human to do pan-rotation without any non-rigid motion. Hence, we choose the method easily to reconstruct 3D rigid object. In all kinds methods mentioned above, the factorization method is a relative simple method and can acquire a good reconstruction result. Thus, current study develops a straightforward and efficient 3D reconstruction approach in which the perspective projection model is approximated by applying the factorization method to the orthographic projection model. A solution is also presented for solving the missing point problem when the face moves through large pan-rotation angles. Finally, a smoothing method is presented to interpolate additional 3D vertices in order to give the surface of the reconstructed 3D face a smoother and more realistic appearance.
2 System Description Fig. 1 shows the major five modules in the proposed 3D facial reconstruction system. We will discuss each module in the following sections, and the difference between the results of modules C and D will be shown more clearly in Section 3. 2.1 1st Module: Compiling Successive Facial Images and Finding Corresponding Points A conventional PC camera is fixed at a stationary position in front of the subject and is used to capture a sequence of facial images as the subject gradually turns his or her head from left to right or right to left in the pan-rotation direction. In the first frame of the facial image sequence, N various facial feature and contour points pm(u, v) are automatically located using the method proposed in [23]. The optical flow technique [10] is then used to track the corresponding points in the remaining image frames. However, a tracking error may occur for some of the feature points located in textureless regions of the face. Therefore, a manual correction procedure is applied to remedy those particular feature points. Subsequently, Delaunay Triangulation [4], [6] is applied to construct the 2D triangles from the features and contour points. 2.2 2nd Module: Factorization Process Based on the Orthographic Projection Model In the proposed system, the 3D face is reconstructed using the factorization approach based on the orthographic projection model [18], which is a simple model compared with the perspective projection model [13], [21]. Based on the locations of N corresponding 2D feature points over F facial image frames, a 2F× N point matrix W is created. By applying the factorization method [18], the point matrix W becomes:
W2 F × N
Factorization ⎯⎯⎯⎯ → R2 F ×3 × S3× N
(1)
Using the Orthographic Projection Model
465
where R is the 3D rotation matrix of x and y axes and S is the 3D shape matrix. Then the 2D triangles are used to construct corresponding 3D polygons, where the position of each vertex sm is defined as (xm, ym, zm). 2.3 3rd Module: Approximating to the Perspective Projection Model The reconstruction method used in this study is based on the orthographic projection model. However, the pinhole camera used to acquire the facial images is based on a perspective model. Therefore, if the orthographic model is to provide an accurate approximation of the perspective model, the ratio of the object depth, Δ d, to the distance between the camera and the object, d, should be very small, i.e. Δ d<
of corresponding 2D projection points pi. Registering two 3D coordinate systems requires the presence of at least 4 common points in these two coordinate systems in order to solve the 12-parameter rotation and translation transformation matrix. Then applying the factorization approach to each set of pi, we can have pi ⎯⎯⎯⎯ → ri ' g i' , for i=1 to n. Since each group g i' has its own 3D coordinate, it is necessary to register all of the groups into the same 3D coordinate system. To achieve this, the coordinate of g i' +1 is aligned with that of g i' . This procedure, which is called the group registration procedure, is employed to register the different coordinates of all the groups g i' into the same world coordinate system to form a complete reconstructed 3D model. The basic steps of the group registration procedure can be summarized as follows: Using the overlapping regions oi(i+1) in g i' and o(i+1)i in g i' +1 , the 12parameter transformation matrixτi+1 between oi(i+1) and o(i+1)i can be found by oi(i+1) =τi+1o(i+1)i, whereτ=[R3X3T3X1] and R, T are the rotation matrix and the translation vector. Based on the transformation matrix τ, the new coordinate of g i' +1 = τ i +1 g i' +1 has factorization
the same coordinate as g i' . By finding all transformation matrix τbetween n-1 successive group pairs, we can transform all coordinates of all groups into the same coordinate. Then the registered 3D facial model (or facial shape) S’ is described by S ' = g1' U g 2' UK U g n' . Therefore, the depth zm of each vertex in the 3D object in the world coordinate system is more accurate. After improving the depth value zm of each vertex in the 3D object, the accuracies of the x and y values of each vertex are also needed to be improved. Analyzing the reconstructed 3D facial model S’, it is found that the x-axis and y-axis components of the 3D coordinate (x’m, y’m, zm) of every vertex s’m in S’ are virtually the same as the corresponding 2D projection point (um, vm) in the image coordinate system (Fig. 2.a).
466
J.-Y. Wu and J.-J.J. Lien
pm (um,vm)
3D Object
image plane C(camera center)
pm (um,vm)
S'
C(camera center)
s'm(x'm, ym' , zm )
S m(x m,ym ,z m)
y
S'm (x'm, y'm,z m)
2f centroid(0,0,0)
Om( X m, Ym,Z m)
image plane Fig. 2. (a) Point pm in the image plane represents the 2D projection point of both the 3D reconstructed point s’m in the orthographic model and the 3D object point Om in the perspective model. (b) The difference Δ y between the reconstruction results in the y-axis direction obtained from the orthographic model and the perspective model, respectively.
This is understandable because the factorization approach uses the orthographic projection model. However, the situation, x’m= um, and y’m= vm, under the perspective model occurs when the 3D object and 2D image plane are both located at a distance of twice the focal length f, which can be known in advance, from the origin of the camera coordinate. Therefore, approximating the perspective projection model using the orthographic projection model requires an assumption that the x-y plane of the 3D object’s coordinate system overlaps the 2D u-v image plane at a distance of 2f from the origin of the camera coordinate system, as show in Fig. 2.b. And it can be seen that the reconstructed vertex s’m and the real vertex sm have an error Δ y in y-axis, where:
Δy = ym '
zm
(2)
2f
Therefore, the improved value ym for vertex sm is given by: '
y m = y + Δy = y + ' m
' m
ym zm
(3)
2f
By the same logic, the improved value xm for vertex sm is expressed as: '
xm = x + Δx = x + ' m
' m
xm z m
(4)
2f
Applying (3) and (4), an improved reconstruction result for the xm and ym components of the vertex sm in S can be obtained. Note that hereafter, the result of this “improved reconstruction process” is referred to as “the improved result.”
Using the Orthographic Projection Model
NB
B
PB B NPB NPC
NPB
NPC
PC C
NC
PB D M B
dCL dDL
dBL
C (2)
L
PC C (1)
2
1
(1)
PD
467
L
(3)
(a)
(b)
(2)
Fig. 3. (a) Application of smoothing method. (b) Smoothing result
2.4 4th Module: Smoothing the 3D Facial Surface
The higher density of the 3D feature points in the facial image, the smoother the appearance of the reconstructed 3D facial shape. However, a high feature point density may cause a tracking problem in that some points may be erroneously matched with one of their neighbors. Therefore, in the proposed 3D facial reconstruction system, a small number of 2D feature points are used initially to create a draft 3D facial model containing only a limited number of polygon vertices. Then, a smoothing method is implemented in which additional vertices (D, E, F, and G in Fig. 3.b.2) are interpolated to create a mesh of finer polygons.
uuv
For each vertex v, the normal direction N v can be found by calculating the mean of the normal directions of those meshes which contain v as one vertex. As shown in
uuv
uuv
Fig. 3.b.1 and 3.a.1 on edge BC , vertices B and C have normal directions N B and N C ;
uuv
uuv
uuv
and vectors BC and CB can be calculated. Therefore, a vector N PB , which lies on the
uuv
uuv
plane formed by N B and BC , can be found by solving the following equations:
{
uuv uuv
N Nuu PB Bv = 0uuv uuv N PB (N B × BC) = 0
(5)
uuuv
uuv
uuuv
uuv
uuuv
Similarly, vector N PC can be found in the same way. Having found N PB and N PC , the two corresponding planes, PB and PC, can be determined. As shown in Fig 3.a.2, PB and PC are planes, which intersect at line L, with normal directions N PB and N PC
passing through B and C, respectively. An arbitrary point M on edge BC (see Fig.
uuv
uuuv
3.a.3) and line L form a plane PD with a normal direction N PD . The vector N M , the normal direction of point M lying on PD, can be found by solving:
{
uuv
uuv
uuv
N uuvM =uuuαv N B+(1 − α)N C NM N PD = 0
(6)
uuv
uuv
where α is a scaling factor used to control the weights of N B and N C . Finally, the interpolated vertex D can be determined from:
468
J.-Y. Wu and J.-J.J. Lien
uuv uuv ⎧⎪ MD = β N M ⎨d ( D, L ) = d ( B, L ) θ1 + d ( C, L ) θ 2 ⎪⎩ θ1 + θ 2 θ1 + θ 2
(7)
where β is a scaling factor and d(X,L) is the distance from X to L, where X is the vertex B, C, or D, respectively. Similarly, the interpolated vertices E and F can be obtained from (5) to (7) for edge CA and AB , respectively, as shown in Fig. 3.b.2. In this work, the point M in Fig. 3.a.3 represents the mid-point for the interpolated vertices D, E, and F. Similarly, to locate vertex G in Fig. 3.b.2, lines AD, BE, and CF are employed to find the interpolated vertices GAD, GBE, and GCF, respectively, from (5) to (7). Because G should be located at the centroid of the ABC triangular polygon, the M position for GAD, GBE, and GCF should be as d(X, M) : d(M, Y) = 2:1, where (X, Y) is (A, D), (B, E), or (C, F). Vertex G is then given by G = (G AD + GBE + GCF ) / 3 . After the smoothing process, the original polygon mABC is transformed into six individual sub-polygons in the 3D space. Hence the 3D surface has a smoother appearance. As a result, when all the polygons in the original 3D model S have been processed using the smoothing method, the reconstructed facial surface is far smoother and more lifelike than the original reconstructed model. Table 1. Reconstruction Error (%) of Each Axis Using The Factorization Method With the Orthogonal Model And the Improved Method (sepetated by “/“) -1Ʊ to +1Ʊ 2.60% / 0.74% 3.49% / 1.15% 7.49% / 6.19%
-5Ʊ to +5Ʊ 2.57% / 0.70% 3.49% / 1.13% 7.47% / 6.17%
X]
IJķ
XY
IJij
_ m¡G pG
[
m
-15Ʊ to +15Ʊ 2.47% / 0.57% 3.46% / 0.92% 7.29% / 6.05%
ŇŢŤŵŰųŪŻŢŵŪŰůġųŦŴŶŭŵ ŊŮűųŰŷŦťġųŦŴŶŭŵ
Ĺ ĵ
ŇųŢŮŦŴ
ı
W Z
^
XX
X]
X\
(a)
YX
ZX
[X
]X
m¡G pG
XY
lGOLP
-10Ʊ to +10Ʊ 2.54% / 0.63% 3.47% / 0.93% 7.40% / 6.14%
ņųųŰųġĩĦĪ
lGOLP
Angles X-axis Y-axis Z-axis
Ĵ
ĸ
IJIJ
IJĶ
(b)
ijIJ
ĴIJ
ĵIJ
ķIJ
Ratio ( ' d/d) 1/3 1/4 1/5 1/6 1/7
_ [
m
Improvement 43.3 41.5 40.0 38.6 37.8 (%)
W Z
^
XX
X\
(c)
YX
ZX
[X
]X
(d)
Fig. 4. Average errors for different Δ d/d ratios: (a) 1/3, (b) 1/5 and (c) 1/7. (d) is the average improvement for different Δ d/d ratios.
Using the Orthographic Projection Model
469
2.5 5th Module: Solving the Missing Point Problem
When the pan-rotation angle of the face is large, some facial feature points will be occluded, i.e. some of the feature points will disappear. This presents problems when applying the factorization method because some of the elements in 2D position matrix W will be missing. From observation, it is found that when the face turns to the left or right through an angle of approximately around 10°to 15°, the feature points adjacent to the nose are occluded by the nose. It is also observed that the ears of the face become clearly visible when the head is rotated to the left or right by more than 30°. Therefore, the captured video sequence is segmented into three parts, namely the left side-view V1 (from -45°to -30°), the frontal-view V2 (from -10°to 10°), and the right side-view V3 (from 30° to 45°), according to [7] and [24]. The improved reconstruction process is then used to reconstruct these three parts, V1, V2, and V3, individually. The side-views, V1 and V3, are then registered to the frontal-view using the same registration procedure as that described in the pervious section.
3 Experimental Results To evaluate the performance of the improved reconstruction method (excluding the smoothing process), 10 different pre-known 3D head model was used to evaluate the reconstruction results obtained for different values of the Δ d/d ratios. Fig. 4 shows the estimation errors. Figs. 4.a, 4.b, and 4.c present the errors obtained for different numbers of frames in the reconstruction process for Δ d/d ratios of 1/3, 1/5, and 1/7. It can be seen that as the Δ d/d ratio decreases, i.e. the perspective projection model more closely approximates the orthographic projection model, the estimation errors under both methods reduce. Furthermore, the improved method provides a better reconstruction performance than the original factorization method with an orthographic model. Fig. 4.d shows that the improvement in the reconstruction results obtained by the improved method reduces as the ratio decreases. This result is to be expected since the orthographic model used to create the reconstruction results more closely approximates the perspective projection model at lower values of the Δ d/d ratio. Table 1 shows the reconstruction errors using a Δ d/d ratio of 1/5. Comparing the errors of factorization method with orthographic model and of the improved method, the reconstruction results of the latter are 78%, 70%, and 17% better than those of the factorization method with the orthographic model in the x-axis, y-axis, and z-axis directions. Therefore, the improved method yields an effective improvement in the reconstruction results in the x-axis and y-axis directions. However, it fails to provide an obvious improvement in the z-axis direction. The reason for this is that the ratio of Δ d/d used in the layered reconstruction approach is insufficiently small. Nonetheless, if an adequate number of feature points are assigned such that the ratio of Δ d/d can be reduced, it is reasonable to expect that the improved reconstruction method will provide a more accurate reconstruction result. The robustness of the proposed reconstruction system to tracking errors was evaluated by adding white Gaussian noises with various standard deviations to the tracking results, the 2D feature point positions, to simulate the tracking errors. The reconstruction results obtained using the improved method are shown in Figs. 5.b, 5.c, and 5.d
J.-Y. Wu and J.-J.J. Lien
lGOLP
470
ZW Y[
m¡G pG
X_ XY ] W
(b)
(c)
(d)
m¡G pG
lGOLP
lGOLP
(a) ZW Y[ X_ XY ] W
m Z
XX
(f)
YX
ZX
m Z
XX
(e)
YX
ZX
ZW Y[ X_ XY ]
m¡G pG
m
W Z
XX
(g)
YX
ZX
Fig. 5. (a) Reconstruction result obtained using the improved method in the absence of noise. Note that the d/d ratio is 1/5. (b), (c), and (d) show the reconstruction results obtained when white Gaussian noise is added with a standard deviation of 1, 3, and 5, respectively. (e), (f), and (g) show the reconstruction errors of the 3D facial model after applying the factorization method with the orthogonal model and the improved method, when white Gaussian noise is added to the tracking results with a standard deviation of 1, 3, and 5, respectively.
for white Gaussian noises with standard deviations of 1, 3, and 5, respectively. The estimation errors obtained from the factorization method with the orthogonal model and the improved method are shown in Figs. 5.e, 5.f, and 5.g for Gaussian noises with standard deviations of 1, 3 and 5, respectively. Figs. 5.e, 5.f, and 5.g show that the reconstruction performance of both methods are degraded as the standard deviation of the Gaussian noise adding to the tracking results increases. In addition, Figs. 5.b, 5.c, and 5.d can be seen that the reconstruction errors in the z-axis direction (i.e. the depth direction) increases more significantly than the errors in the x- and y-axis directions as the standard deviation of the Gaussian noise adding to the tracking results is increased. In other words, the reconstructed 3D face becomes flatter as the tracking errors increase. Comparing the errors of the 3D reconstruction results shown in Figs. 5.e, 5.f, and 5.g with those shown in Fig. 4.b for the same d/d ratio of 1/5, it is observed that the noise factor increases the reconstruction error considerably. In other words, as the noise increases, the tracking accuracy of the feature points decreases, and the 3D reconstruction performance deteriorates accordingly. However, unlike the estimation errors shown in Fig. 5.b, when noise factors are taken into consideration, the error reduces markedly as the number of frames considered in the reconstruction process increases. This is because more feature point information is included in the point matrix W, and hence the overdetermind optimization process of the factorization approach has an improved ability to reduce the effects of noise interference. Nonetheless, the reconstruction error is still greater than that obtained when the tracking results are free of noise. Fig. 6.a is the comparison of the improved results and the smoothed results. It is clearly seen that the boundaries of the facial features such as the forehead, nose, mouth, and philtrum after the smooth process are much smoother than before ones. Hence, the effectiveness of the proposed smoothing method is confirmed. However,
Using the Orthographic Projection Model
(a)
471
(b)
Fig. 6. (a) Upper row is the improved results and the lower row is the smoothed results. (b) Upper one is the result with proper register part (place with small depth change) while lower one is with bad register part (place with larger depth change, such as cheek region).
Fig. 7. 3D facial reconstruction results obtained from three image sequences by using our improved method and smoothing process.
in creating the complete model, an appropriate choice of the overlapping regions is essential. Specifically, the chosen regions should have a small depth d, such as the face near the nose, to ensure that the group registration procedure provides the optimum reconstruction results, as shown in Fig. 6.b. And Fig. 7 gives some reconstruction results of real data.
3 Conclusions This study has developed a straightforward and efficient system for reconstructing 3D faces by using the factorization method and the orthographic projection model to provide a simple approximation to the perspective projection model. The experimental results have shown that the proposed method provides a promising technique for
472
J.-Y. Wu and J.-J.J. Lien
reconstructing 3D faces. However, some manual refinements are required to compensate for the tracking errors. The missing point problem occurs frequently in 3D reconstruction applications. Accordingly, this study has proposed a solution in which the facial image is divided into three discrete parts in accordance with the rotation angle of the head. This study has also developed a smoothing method based on the linear interpolation of additional 3D vertices to improve the smoothness of the reconstructed facial surface. However, linear interpolation fails to provide an optimum result for regions of the face with large curvatures, e.g. the surface of the nose. In the future, the authors tend to explore the feasibility of using various non-linear curvature functions, e.g. nature spline or B-spline, to improve the smoothing result in the future study.
References 1. Blanz, V., Vetter, T.: A Morphable Model for the Synthesis of 3D Faces, pp. 187–194. ACM Press, New York (1999) 2. Brand, M.: A Direct Method for 3D Factorization of Nonrigid Motion Observed in 2D. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 122–128 (2005) 3. Bregler, C., Hertzmann, A., Biermann, H.: Recovering Non-Rigid 3D Shape from Image Streams. In: Proc. IEEE Conf. on CVPR, pp. 690–696. IEEE Computer Society Press, Los Alamitos (2000) 4. Dwyer, R.A.: A Faster Divide-and-Conquer Algorithm for Constructing Delaunay Triangulation. Algorithmica 2, 137–151 (1987) 5. Fua, P.: Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data. International Journal of Computer Vision (IJCV), 153–171 (2000) 6. Hassanpour, R., Atalay, V.: Delaunay Triangulation Based 3D Human Face Modeling from Uncalibrated Images. In: CVPR Workshop, p. 75 (2004) 7. Horprasert, T., Yacoob, Y., Davis, L.S.: Computing 3-D Head Orientation from a Monocular Image Sequence. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 242–247. IEEE Computer Society Press, Los Alamitos (1996) 8. Jin, H., Soatto, S., Yezzi, A.: Multi-view Stereo Beyond Lambert. IEEE Conf. CVPR 1, 171–178 (2003) 9. Lee, Y.C., Terzopoulos, D., Waters, K.: Constructing Physics-Based Facial Models of Individuals. In: Proc. of Graphics Interface, pp. 1–8 (1993) 10. Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: Proc. of DARPA Image Understanding, pp. 121–130 (1981) 11. Mouragnon, E., Dekeyser, F., Sayd, P., Lhuillier, M., Dhome, M.: Real Time Localization and 3D Reconstruction. IEEE Conf. CVPR 1, 363–370 (2006) 12. Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F., Sayd, P.: 3D Reconstruction of Complex Structures with Bundle Adjustment: an Incremental Approach. In: Proc. IEEE International Conference on Robotics and Automation, pp. 3055–3061. IEEE Computer Society Press, Los Alamitos (2006) 13. Poelman, C., Kanade, T.: A Paraperspective Factorization Method for Shape and Motion Recovery. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 19(3), 206–218 (1997) 14. Shan, Y., Liu, Z., Zhang, Z.: Model-Based Bundle Adjustment with Application to Face Modeling. In: Proc. IEEE International Conference on Computer Vision (ICCV), pp. 644–751. IEEE Computer Society Press, Los Alamitos (2001)
Using the Orthographic Projection Model
473
15. Shum, H., Ke, Q., Zhang, Z.: Efficient Bundle Adjustment with Virtual Key Frames: A Hierarchical Approach to Multi-Frame Structure from Motion. IEEE Conf. CVPR 2, 538– 543 (1999) 16. Soatto, S., Yezzi, A., Jin, H.: Tales of Shape and Radiance in Multiview Stereo. In: IEEE International Conf. on Computer Vision (ICCV), pp. 974–981. IEEE Computer Society Press, Los Alamitos (2003) 17. Tankus, A., Sochen, N., Yeshurun, Y.: A New Perspective [on] Shape-from-Shading. IEEE ICCV 2, 862–869 (2003) 18. Tomasi, C., Kanade, T.: Shape and Motion from Image Streams under Orthography: a Factorization Method. IJCV 9(2), 137–154 (1992) 19. Torresani, L., Yang, D., Alexander, J., Bregler, C.: Tracking and Modeling Non-Rigid Objects with Rank Constraints. In: IEEE Conf. CVPR, pp. 493–500. IEEE Computer Society Press, Los Alamitos (2001) 20. Treuille, A., Hertzmann, A., Seitz, S.: Example-Based Stereo with General BRDFs. In: European Conference on Computer Vision (ECCV), vol. 2, pp. 457–469 (2004) 21. Triggs, B.: Factorization Methods for Projective Structure and Motion. In: IEEE Conf. CVPR, pp. 845–851. IEEE Computer Society Press, Los Alamitos (1996) 22. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzibbon, A.W.: Bundle Adjustment - A Modern Synthesis. In: Proc. of the International Workshop on Vision Algorithms: Theory and Practice, pp. 298–372 (1999) 23. Twu, J.T., Lien, J.J.: Estimation of Facial Control-Point Locations. In: Proc. in the Computer Vision, Graphics, and Image Processing (CVGIP), pp.E1–3 (2004) 24. Wang, T.H., Lien, J.J.: Rigid and Non-Rigid Motion Separation Using 3D Model. In: Proc. in the CVGIP, pp. A2–5 (2004) 25. Xiao, J., Chai, J.X., Kanade, T.: A Closed-Form Solution to Non-Rigid Shape and Motion Recovery. ECCV, pp. 573–587 (2004) 26. Xiao, J., Kanade, T.: Uncalibrated Perspective Reconstruction of Deformable Structures. ICCV 2, 1075–1082 (2005) 27. Zhang, Z., Shan, Y.: Incremental Motion Estimation through Modified Bundle Adjustment. International Conf. on Image Processing 3, 343–346 (2003) 28. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from Shading: A Survey. IEEE Trans. on PAMI 21(8), 690–705 (1999)
Multi-target Tracking with Poisson Processes Observations Sergio Hernandez1 and Paul Teal2 1
Victoria University of Wellington School of Mathematics, Statistics and Computer Science Wellington, New Zealand
[email protected] 2 Victoria University of Wellington School of Chemical and Physical Sciences Wellington, New Zealand
[email protected]
Abstract. This paper considers the problem of Bayesian inference in dynamical models with time-varying dimension. These models have been studied in the context of multiple target tracking problems and for estimating the number of components in mixture models. Traditional solutions for the single target tracking problem becomes infeasible when the number of targets grows. Furthermore, when the number of targets is unknown and the number of observations is influenced by misdetections and clutter, then the problem is complex. In this paper, we consider a marked Poisson process for modeling the time-varying dimension problem. Another solution which has been proposed for this problem is the Probability Hypothesis Density (PHD) filter, which uses a random set formalism for representing the time-varying nature of the state and observation vectors. An important feature of the PHD and the proposed method is the ability to perform sensor data fusion by integrating the information from the multiple observations without an explicit data association step. However, the method proposed here differs from the PHD filter in that uses a Poisson point process formalism with discretized spatial intensity. The method can be implemented with techniques similar to the standard particle filter, but without the need for specifying birth and death probabilities for each target in the update and filtering equations. We show an example based on ultrasound acoustics, where the method is able to represent the physical characteristics of the problem domain. Keywords: Bayesian inference, marked Poisson process, multi-target tracking, sequential Monte Carlo methods, particle filters.
1
Introduction
Tracking multiple objects with multiple sensors is a problem where the dimensionality of the posterior distribution is evolving in time. Such problems are called D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 474–483, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multi-target Tracking with Poisson Processes Observations
475
“transdimensional” [1]. Furthermore, the procedure of making inference about the current state of the system involves the estimation of non-linear and nonGaussian functions. State space models provides a sound framework for building a probabilistic model from a sequence of observations contaminated with noise. However, not all the solutions for fixed dimensionality are directly applicable to variable dimension models. Transdimensional problems are common in inverse problems, signal processing and statistical model selection. In this paper, a method for handling such problems is provided, and multiple target tracking is to be considered as an example of such a problem. The method is then applied to parameter estimation for acoustic sources. Sequential Monte Carlo (SMC) methods have been proposed [2,3] as a framework for dealing with non-linear and non-Gaussian target tracking problems. These methods use the Bayesian approach for simulating several hypotheses taken from an easy-to-sample proposal distribution and weight them using their likelihood with the observed data. The most standard implementation of a SMC method, called the particle filter has demonstrated good performance in several domains as well as having theoretical convergence properties [4]. In the multiple target tracking problem, the number of targets may be timevarying and unknown. The number of measurements or observations may also be time varying. The traditional approaches for performing measurement-target data association include multiple hypothesis tracking (MHT) and joint probabilistic data association (JPDA), which separates the problem into different sub levels. These approaches use thresholding heuristics to associate each measurement to an existing target, create a new target, merge measurements or mark them as false alarms [5]. Particle filtering has been successfully used for multiple target tracking [6], where the data association problem is formulated in a non-explicit way using a joint multiple-target probability density which comprises the estimation of the state of several targets, the number of the targets and the likelihood of the observations with each of them. More recently, the Probability Hypothesis Density (PHD) filter has been proposed to deal with the problem of the unknown number of targets, using random sets theory [7]. The PHD filter propagates the first moment of the multi-target posterior distribution, which (for known dimensionality) results in an analytically tractable Bayesian solution for the multi-target posterior. If the multi-target likelihood can be approximated as a Poisson process, then the integral of the PHD in any region of the space gives the expected number of targets on that region [8]. In this paper, we work further on the spatial Poisson process model presented in [9,10] for tracking extended targets. The model is able to perform seamlessly in the multi-target tracking problem as well as solving the data association problem. We propose a SMC approximation to the conditional spatial intensity of the point process. The approximation uses a discretization of the spatial intensity and a modification of the standard particle filter, to take in account spontaneous targets birth. The model is tested with unthresholded physical measurements, where the number of measurements is dependent on the unknown number of
476
S. Hernandez and P. Teal
targets. An example in the bearings-only tracking problem is provided, as well as an example in range estimation from ultrasound acoustics. The example shows that the model is well suited to characterize physical features from acoustic measurements in uncertain environments. The paper is structured as follows. In Section 2 we outline the PHD filter. In Section 3 the spatial Poisson process model is presented, and then in Section 4 we present a new sequential Monte Carlo method for an unknown number of targets. Section 5 provides examples that show the feasibility of the model in two different problems. Section 6 present the conclusions.
2
PHD Filter
First-order moment of a multi-target posterior distribution: The PHD filter represents the first moment of the intensity function λ of the multi-target posterior distribution. For any measurable subset S of the space of possible random sets, the regions of high intensity indicate the actual locations of the targets, and the total mass of the same function can provide an estimate of the number of targets N : E[N (S)] = λ(x)dx (1) S
The PHD filter operates using the following procedure: 1. Filtering: ˆ t−1 (xn,t−1 )dxn,t−1 + γ(xn,t ) f (xn,t |xn,t−1 ) Ps (xn,t−1 )λ
λt (xn,t ) =
(2)
S
where f (xn,t |xn,t−1 ) represents the probability density of motion of the nth single target from state xn,t−1 to state xn,t at time t, Ps (xn,t ) is the probability of survival for the nth target at time t, and γ() is the birth probability. 2. Update: Mt ψm,t (xn,t ) ˆ λt (xn,t ) = 1 − Pd (xn,t ) + λt (xn,t ) (3) < ψm,t , λt > +κ(zm,t ) m=1 where ψm,t (x) = Pd (x)L(zm,t |x) ψm,t (x)λ(x)dx < ψm,t , λ >
(4) (5)
S
and where Pd (xn,t ) is the probability of detection, κ() is the clutter rate, L(zm,t |x) is the likelihood of the mth observation and Mt is the total number of observations at time t.
Multi-target Tracking with Poisson Processes Observations
477
SMC implementation of PHD: The particle filter implementation for the PHD recursion was proposed by [11,12] as a method for simulating the conditional expectation of the unknown number of targets given the current observations. The SMC implementation of the PHD approximates the intensity function λ(x) with a set of particles. Further results on the convergence properties for the particle implementation were given in [13] and an extension to Gaussian mixture models for solving the integral for the PHD recursion [14].
3
State Space Model with Poisson Process Observations
A marked point process is a random collection of events that happens in time and some other dimensions (in this case space) [15]. The number of observations Mt received in the interval Δt = [t − 1, t) can be written as a marked point process with intensity λ(A, t) dependent on time and the spatial parameter A. p(Mt |λ(A, t) ) = λ(A, t)Mt
exp [−λ(A, t)] Mt !
(6)
The mean rate of observations λ(A, t) can be decomposed into a rate for clutter components with homogeneous (i.e., constant) spatial intensity function λc , and an intensity component for each of the Nt targets: λ(A, t) = (λc +
Nt
λn (A) )Δt
(7)
n=1
The spatial intensity λ(A) can also be written in terms of the spatial posterior distributions of the targets Xt = x1 , .., xNt and the observations Zt = z1 , .., zMt . λn (A) = λ(z|xn )dz (8) A
So the conditional likelihood of the observations on the targets states can be written as: M exp (−λ(A, t) ) t λ(zm |X) Mt ! m=1 M Nt exp (−λ(A, t) ) t λc + = λn (zm |xn ) Mt ! m=1 n=1
p(Zt |Xt ) =
4
(9)
SMC Approximation for the Spatial Poisson Process Model
We decompose A using some suitable spatial discretization into K disjoint unions (here called bins) so as to transform the problem into counting in each bin. The
478
S. Hernandez and P. Teal
intensity function λn (zk |xn ) defines the rate of observations of a particular target xn in a measurement bin k. Given that the overall intensity is a superposition of a discrete number of Poisson processes, it is difficult to calculate the intensity for a single target. A possible solution is to approximate the intensity with a quantity proportional to the likelihood of the nth target and inversely proportional to the number of observations in that bin. λ(z|xn )dz = A
K
p(zk |xn )/mk
(10)
k=1
A particle filter model is then used for propagating the conditional spatial intensity of each target, which can be used by the multi-target likelihood (9) for calculating the posterior distribution of the observations. The conditional spatial intensity can be approximated with the update and filtering equations of the standard SMC methods, but in order to take into account targets appearing and dissapearing we use resampling without replacement. In this setup, only r particles survive and the remaining samples gives birth to new samples with probability b(x).
5
Examples
In this section, two examples are provided for the spatial Poisson process model. Multi-target bearings-only tracking: The bearings-only tracking problem is a non-linear Gaussian problem where a passive observer is taking measurements of the angle for which a target is moving in two dimensions. If the number of targets is dynamically changing, then the multi-target Gaussian likelihood becomes intractable and the optimal filter needs to resort in additional techniques for estimating the system state. Figure 1 shows the target trajectories for this example. Each target is represented by its 2-D position and velocity xn,t = (xn,t , yn,t , dxn,t , dyn,t ). The target dynamic model can be written as xn,t = Gxn,t−1 + ηt . The state equation innovation is a bivariate zero-mean Gaussian process η ∼ N (0, σ 2 I). The observation model is the non-linear function of the position zn,t = tan−1 (xn,t /yn,t ) + αt . A finite hidden Markov model can be used for the observed number of measurements given an unknown number of targets. This model can represent the probability of a change in dimension given the history of the number of targets and observations The example shows a model jump from 5 targets to 6 targets, with constant clutter probability λc = 1 and probability of detection Pd = 1 for each state. Figure 2 shows the time-varying dimensionality of the state and observations. Figure 3 shows the superimposed observations received at each time step. The number of observations received is a point process with inhomogeneous spatial intensity, and there is not enough information for an explicit measurement-totrack data association.
Multi-target Tracking with Poisson Processes Observations
479
40 35 30
y coordinate
25 20 15 10 5 0 −5 −10 −30
−20
−10
0 10 x coordinate
20
30
40
Fig. 1. Multi-target Tracking Model 12 Number of Observations Estimated Target Rate Target Number Ground Truth
Number of Components
10
8
6
4
2
0
0
10
20
30
40 time
50
60
70
80
Fig. 2. Poisson Hidden Markov Model
The particle filter model uses 100 particles to represent the intensity function λ(z|xn ). The particles are resampled without replacement. Thus the low weighted particles are killed but not replaced with copies of the best weighted particles.
480
S. Hernandez and P. Teal 2 1.5 1
observations
0.5 0 −0.5 −1 −1.5 −2
0
10
20
30
40 time
50
60
70
80
Fig. 3. Spatial Poisson Process
6 5 4 3 2 1 0 80 60
100 80
40
60 40
20 time
20 0
0
sample index
Fig. 4. Conditional Intensity Monte Carlo Estimate
They are used to give birth to new samples. The particles are used to represent a weighted conditional intensity. Figure 4 shows the evolution of the particle system.
Multi-target Tracking with Poisson Processes Observations
481
Acoustic Response 100 80 Amplitude
60 40 20 0 −20 −4
−2
0
2
4
6
8
10
Time [ms]
−3
x 10
Spatial Distribution
Num. Observations
5 4 3 2 1 0
1
2
3
4
5
6
7
8
Time [ms]
−3
x 10
Monte Carlo Range Estimate 1
0.6 0.4 0.2
0
0.2
0.4
0.6
0.8
1 Range [m]
1.2
1.4
1.6
1.8
Fig. 5. Single Channel Acoustic Data (Channel 1)
Amplitude
Acoustic Response 100 0 −100 −4 Num. Observations
0
−2
0
2
4 Time [ms] Spatial Distribution
6
8
10 −3
x 10
10 5 0
1
2
3
4
5 6 7 Time [ms] Monte Carlo Range Estimate
8
9
10 −3
x 10
2 Weight
Weight
0.8
1 0
0
0.2
0.4
0.6
0.8
1 1.2 Range [m]
1.4
1.6
Fig. 6. Single Channel Acoustic Data (Channel 10)
1.8
2
2
482
S. Hernandez and P. Teal
Range estimation from acoustic measurements: The location of targets using acoustic measurements in reverberant environments is difficult to characterize because of the multi-path propagation. Estimating range from time-delay measurements from multiple reflecting objects using an array of sensors can be a challenging problem when the model dimension is unknown. Time-delay measurements are a superposition of the first reflection of a target, the multi-path reflections, and the background noise. In this example, ultrasound measurements are received by an array of sensors. A known signal is propagated in the air, and the received acoustic energy is represented by the complex response of the sensors. The observations here correspond to any samples above a threshold such as the variance of the complex signal. The range of the target is represented as a function of the time-delay of arrival of the wavefront for the first reflection of each target. A Monte Carlo importance sampling method is used for estimating the range of the target for each measured time delay. The range is calculated as r = ctd +w, where c is the velocity of sound propagation in air, w ∼ N (0, 10−6 ) and td is the measured time delay. Successful target detection is shown in Figure 6.
6
Conclusion
This paper has presented a SMC approximation to the spatial Poisson process model for multi-target tracking problems. The model uses a Poisson point process in a multidimensional space, therefore it can be thought as being part of Mahler’s Finite Set Statistics framework. The formulation presented does not make use of an explicit birth and death strategy, but uses a modified sequential Monte Carlo method that performs resampling without replacement. In that way, new particles are born in the place of the resampled particles, allowing a non-explicit birth and death formulation. The case of multiple target tracking is illustrated as a motivating example. Although the method proposed has no apparent improvement on the PHD recursion for the tracking problem, the example in acoustic parameter estimation has shown the feasibility of the model for representing the measurement superposition due to reverberation. This is an important feature that makes the model practical to use with unthresholded physical measurements. On the other hand, it shares the problems of the PHD filter of being highly dependent on the SNR ratio and being difficult to interpret. Further work will be done on calculating the target state estimate and comparing the results with the ground truth data. For the acoustic problem presented, new results will be extended to multi-channel observation data.
Acknowledgement The authors would like acknowledge Industrial Research Limited for providing the facilities for capturing the data used in this paper.
Multi-target Tracking with Poisson Processes Observations
483
References 1. Sisson, S.A.: Transdimensional Markov chains: A decade of progress and future perspectives. Journal of the American Statistical Association 100(471), 1077–1089 (2005) 2. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEEE Proceedings 2(140), 107–113 (1993) 3. Liu, J.S., Chen, R.: Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 443(93), 1032–1044 (1998) 4. Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo methods in practice. Springer, New York (2001) 5. Blackman, S., Popoli, R.: Design and Analysis of Modern Tracking Systems. Artech House, Norwood (1999) 6. Hue, C., Le Cadre, J.P., Perez, P.: Sequential Monte Carlo methods for multiple target tracking and data fusion. IEEE Trans. on Signal Processing 2(50), 309–325 (2002) 7. Mahler, R.P.S.: Multitarget bayes filtering via first-order multitarget moments. IEEE Transactions on Aerospace and Electronic Systems 39(4), 1152–1178 (2003) 8. Mahler, R.P.S.: Statistical Multisource-Multitarget Information Fusion. Artech House, Norwood (2007) 9. Godsill, S., Li, J., Ng, W.: Multiple and extended object tracking with Poisson spatial processes and variable rate filters. In: IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, IEEE Computer Society Press, Los Alamitos (2005) 10. Gilholm, K., Godsill, S., Maskell, S., Salmond, D.: Poisson models for extended target and group tracking. In: Signal and Data Processing of Small Targets 2005, SPIE (2005) 11. Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and Electronic Systems 41(4), 1224–1245 (2005) 12. Sidenbladh, H.: Multi-target particle filtering for the probability hypothesis density. In: Proceedings of the Sixth International Conference of Information Fusion (2003) 13. Clark, D.E., Bell, J.: Convergence results for the particle PHD filter. IEEE Transactions on Signal Processing 54(7), 2652–2661 (2006) 14. Vo, B., Ma, W.: The Gaussian mixture probability hypothesis density filter. IEEE Transactions on Signal Processing 54(11), 4091–4104 (2006) 15. Daley, D., Vere-Jones, D.: An Introduction to the Theory of Point Processes. Elementary Theory and Methods, vol. I. Springer, New York (2003)
Proposition and Comparison of Catadioptric Homography Estimation Methods Christophe Simler, C´edric Demonceaux, and Pascal Vasseur C.R.E.A, E.A. 3299, 7, rue du moulin neuf, 80000 Amiens, France chris
[email protected], {cedric.demonceaux,pascal.vasseur}@u-picardie.fr
Abstract. Homographies are widely used in tasks like camera calibration, tracking, mosaicing or motion estimation and numerous linear and non linear methods for homography estimation have been proposed in the case of classical cameras. Recently, some works have also proved the validity of homography for catadioptric cameras but only a linear estimator has been proposed. In order to improve the estimation based on correspondence features, we suggest in this article some non linear estimators for catadioptric sensors. Catadioptric camera motion estimation from a sequence of a planar scene is the proposed application for the evaluation and the comparison of these estimation methods. Experimental results with simulated and real sequences show that non linear methods are more accurate. Keywords: Omnidirectional Vision, Homography estimation.
1
Introduction
Since thirty years, many computer vision studies have been performed in order to have some information on the trajectory of a mobile perspective camera, with only the image sequence and the intrinsic parameters (calibrated camera) [1],[2],[3]. Without prior knowledge about the scene, this motion is always partially obtained because the translation is known up to a scale factor. In the case of a planar scene or a pure rotation motion or both, two images are related with a homography. From such a homography, the rotation, the direction of the translation and the direction of the normal to the plane can be computed [4]. Homographies have also other multiple applications like camera calibration [5], mosaicing[6], visual servo-control law computation in robotic [7]. Since the estimation of a homography requires data matching between two images, different kinds of primitives can be used. Thus, in [8] a dense matching based on grey level of pixels is proposed. However, this approach is iterative and a right initial value is required. Then, if the motion between the two images is too large, the method becomes inadequate and a solution consists then in performing the estimation with other kind of features such as contours [9], lines or points [10]. In this way, lines or points can be used with a linear estimator in order to provide the initial value to an iterative non linear approach which provides more stability with respect to correspondence coordinate noise [11]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 484–496, 2007. c Springer-Verlag Berlin Heidelberg 2007
Proposition and Comparison of Catadioptric Homography
485
Some recent works deal with the motion estimation with central catadioptric image sequence. Some motion and structure reconstruction methods have been proposed in [12] and in [13]. However, when the scene is planar and if only the motion is required, some less computational methods can be applied. In [14] the authors mention that the epipolar geometry is non linear between two omnidirectional images. Then, in order to recover a similar epipolar geometry as in the perspective case, the solution consists in projecting the images on the unitary sphere [15]. In this way, if two catadioptric images of a planar scene are projected on the sphere, a homography relates them and it is then possible to use a linear estimation algorithm almost similar to the one used with a perspective camera. However, the non linear estimation of a homography has not yet really been studied in the catadioptric case except in [16] where a non linear approach has been suggested based on grey level matching. However, in this case only small displacements are authorized in order to perform the iterative process. In [17], the authors also present a non linear estimation technique but only for the case of catadioptric camera calibration. In order to test our different estimation algorithms, we consider that a catadioptric camera moves in an urban area or in an indoor scene. Such environments are generally composed of planes. In fact, we consider just a planar scene, a study with several planes will be performed in further works. We have a set of matched image points (noisy inliers) and our aim is to optimise the homography estimation process. The motion computation is then optimised because it depends directly on the homography. In order to perform the optimisation, we suggest in this article four non linear homography estimators for catadioptric sensors. The estimations are done from matched points. Their stabilities with respect to correspondence noise are quantified and compared with the results of the catadioptric linear estimator by simulations. Some tests with matched points with real omnidirectional images validate these simulations. We also perform some simulation tests in the perspective case by quantifying the precision of the perspective non linear estimator and by comparing it with the results of the perspective linear one. It is well known that in the pinhole case the non linear approach is better [11], but this experiment enables above all to compare the precisions of the catadioptric estimators to their homologues of the pinhole case. This paper is divided into four main parts. After introducing catadioptric projection, we derive the homography equations for catadioptric images in section 2. Then, we present in section 3 our different linear and non linear estimators. Section 4 is devoted to the evaluation and comparison of the methods. Finally, the estimators are tested on real sequences in section 5.
2 2.1
Perspective and Catadioptric Homography The Unifying Catadioptric Projection Model
The projection model defined in [18] covers all the catadioptric cameras with a single point of view. The catadioptric systems can be modelled with the following generic steps (fig 1):
486
C. Simler, C. Demonceaux, and P. Vasseur
1. Projection of an object point M of the scene in a point Ms on the unit sphere centred on the inner focus F of the mirror. 2. Non linear projection of the 3D point Ms of the sphere with respect to the point C in a 2D point, m, on a virtual plane (with the mirror parameter ξ). 3. Projection of the point m of the virtual plane in a point p on the image plane (with the camera intrinsic parameters and the 2 mirror parameters). Due to the non linearity of the projection in step 2, it is difficult to model the geometrical relation between 2 images. However, by projecting them on the unit sphere (from p to Ms ) an epipolar geometry similar to the perspective case is recovered.
C
x
F Ms M
m
Virtual plane
p
Image plane
Fig. 1. Central catadioptric camera projection model
2.2
Homography Between Two Images
In this part we consider a couple of perspective images and another of catadioptric images. The motion between the two images is composed of rotation R and translation t. In the two cases, an image normalisation was done. In the perspective case, the pixel coordinates were converted in meters, and in the omnidirectional case the projection on the unit sphere was done (see part 2.1). In the perspective case, the motion means the camera frame motion, and in omnidirectional it means the mirror frame motion. Figure 2 shows an illustration of these couples. M is a planar scene point, and m1 and m2 its images before and after the motion. With the pinhole, m1 = (x1 , y1 , 1)T and m2 = (x2 , y2 , 1)T in the image frame. In the catadioptric case, m1 = (x1 , y1 , z1 )T and m2 = (x2 , y2 , z2 )T in the mirror frame. It is shown in [7] and [15] that both in perspective and omnidirectional, m1 and m2 are linked with a homography, called H, which can be written as follow :
Proposition and Comparison of Catadioptric Homography
Planar scene
487
Planar scene
M
M
Image frame
H
H m2
m1 Image plane [m]
Camera frame
F2
F1
m2
m1
R,t a) Perspective case: homograpy between 2 planar images.
Mirror frame
F2
F1
Unit sphere
R,t
b) Catadioptric case: homography between two spherical images.
Fig. 2. Homography between perspective and catadioptric images
⎛ ⎞ ⎞ ⎞ ⎞ ⎛ ⎛ T x1 x2 h11 x1 + h12 y1 + h13 z1 H1 m 1 1 1 1 ⎝ y2 ⎠ = H ⎝ y1 ⎠ = ⎝ h21 x1 + h22 y1 + h23 z1 ⎠ = ⎝ H2T m1 ⎠ s s s z2 z1 h31 x1 + h32 y1 + h33 z1 H3T m1 ⎛
(1)
where s is an unknown scale factor,Hi = (hi1 , hi2 , hi3 )T . In the pinhole case, z1 = z2 = 1 . H is defined up to a scale factor, thus it has only eight independent parameters. To cope with the scale factor ambiguity, we impose an arbitrary constraint, h33 = 1, in order to have an unique solution. 2.3
Motion from Homography
→ Homography H is expressed with motion (R, t) and scene plane normal − n . If the → − norm of n is equal to the inverse of the plane initial distance, we have: H = R + → → t− n T . R, the direction of t and − n can be computed with the singular value of H [4]. However, this leads to 4 solutions, thus the right one was to be selected.
3
Homography Estimators
It has been seen in part 2 that two matched points are related with a homography H in the case of a planar scene. We consider that the planar scene provides N (N ≥ 8 ) matched image points between two views. In the pinhole case, z1 = z2 = 1 . In part 3.1 we present a brief state of the art of homography estimators, and in part 3.2 we introduce the non linear ones we suggest for catadioptric cameras. 3.1
Main Estimator Overview
The linear estimator is currently used. Also, it exists several non linear ones for the pinhole case but since their performances are quite similar [19], we just present the most generally used.
488
C. Simler, C. Demonceaux, and P. Vasseur
The linear estimator (for pinhole and catadioptric camera): If we isolate s in the third equation of expression (1), we obtain: ⎧ T ⎨ x2 = H1T m1 z2 H m1 ⎩ y2 =
3
H2T m1 z H3T m1 2
After linearization, the ith correspondence provides:
0 0 −x1i x2i −y1i x2i x2i z1i x1i z2i y1i z2i z1i z2i 0 ¯h = 0 0 0 x1i z2i y1i z2i z1i z2i −x1i x2i −y1i x2i y2i z1i
(2)
(3)
where h ¯ = (H1T , H2T , h31 , h32 )T . The expression (3) has not a strict equality because of the correspondence noise. With more than four matches it is possible to solve an overdetermined linear system of 2×N equations which can be written as follows: A.¯ h ≈ b . The 2N × 8 matrix and the 2N vector b are built with the correspondences. The solution obtained using the linear least squares is the following: ¯ = arg min L1 (¯ h h) = arg min
N
(Ai ¯ h − bi )2 = (AT A)−1 AT b.
(4)
i=1
This estimator has the advantage to provide a closed-form and unique solution. However, in the pinhole case this estimator is unstable with respect to the data noise [1]. This instability is due to the linearization of equations (2), which complicates the distributions of the error terms and thus the linear estimator (4) derives from the maximum likelihood estimator in the presence of noise. In fact, the maximum likelihood estimator is optimal because it is unbiased with minimal variance. The estimators which are close to it are generally stable, and the estimators far from it are generally unstable. In order to improve the stability of the linear estimator, a solution consists in improving the condition of the matrix AT A [2]. The techniques suggested in [2] are efficient, however they do not enable to reach the performance of the non linear estimator for pinhole camera (5) (see below). It can be noted that if the linear estimator (4) is used with a catadioptric camera, there is no need to improve the condition of AT A. It is obvious that the projection on the unitary sphere provides automatically a low condition number. The linear algorithm, with the pinhole as well as with the catadioptric camera is not optimal because it is not a maximum likelihood estimator. This is the reason why some non linear estimators exist for the pinhole, and why we suggest some catadioptric non linear estimators. The non linear estimator for pinhole camera: Let us consider expression (2) with z1 = z2 = 1. The non linear least squares solution consists in minimising the following criterion: J1 (H) =
N i=1
(x2i −
T T H1i H2i m1i m1i 2 z ) + (y − z2i )2 . 2i 2i T T H3i m1i H3i m1i
(5)
Proposition and Comparison of Catadioptric Homography
489
This function is generally minimised with the iterative algorithm of LevenbergMarquardt. This procedure needs to be initialised, and it is better to have a correct initial value in order to limit the risks of convergence toward a local minimum. The procedure is generally initialised with the linear least squares. The advantage of this criterion with respect to the linear one is that it minimises the sum of the Euclidian reprojection errors. This means that it can be generally assumed that each error term is independent and has the same Gaussian centered distribution with respect to the exact solution. In other words it is (almost) the maximum likelihood estimator, thus it is optimal in terms of stability with respect to the noise. 3.2
Propositions of Catadioptric Non Linear Estimators
Our aim is to estimate with a good precision the catadioptric homography between two views, because the uncertainties of H directly affect the estimated motion, which is always recovered by SVD in our work (see part 2.3). Because the linear algorithm (4) is not optimal, we suggest some catadioptric non linear estimators. The first proposition is the minimisation of criterion (5) in the catadioptric case. However, equation (2) is not a point-to-point relation on the sphere, thus it is not the sum of the Euclidian reprojection errors which is minimised but a quantity which has no physical interpretation. In this case, nothing is known about the error term distributions. Thus, we do not know if this estimator is near of the maximum likelihood estimator (we do not know if it is stable of not). The second proposition ensures to work with the maximum likelihood estimator. For this, we propose coming back to equation (1). The first problem with this equation is to determine the unknown scale factor s. We set s = (H1T m1 )2 + (H2T m1 )2 + (H3T m1 )2 because it forces m2 to be on the unitary sphere. In this condition, we suggest minimising the sum of the Euclidian reprojection errors (proposition 2): J2 (H) =
N
(x2i −
i=1
T H1i H T m1i 2 H T m1i 2 m1i 2 ) + (y2i − 2i ) + (z2i − 3i ) . si si si
(6)
The properties of criterion (6) are the same as the ones of the non linear estimator for pinhole camera (see the end of part 3.1). In summary, it is optimal because it is the maximum likelihood estimator. However, in [17] the authors suggest an estimator which applies a spherical metric to spherical images in the context of calibration. The idea is attractive because it enables to work with the metric which corresponds to our images. The adaptation of this estimator to the context of homography estimation leads us to suggest minimising the sum of the spherical reprojection errors (proposition 3): J3 (H) =
N i=1
arccos[
1 T T T (x2i H1i m1i + y2i H2i m1i + z2i H3i m1i )]. si
(7)
490
C. Simler, C. Demonceaux, and P. Vasseur
In our opinion, this estimator is theoretically equivalent to the estimator (6), because the Euclidian reprojection error is proportional to the spherical reprojection error. It will be interesting to compare them. The drawback of criterion (7) is not its quality, but if the Levenberg-Marquardt algorithm is used to minimise it, the singularity of the derivative of arcos could be prejudicial. To cope with this problem, two solutions are mentionned in [17], the minimisation can be done with the simplex method, or we can minimise the rope error sum (proposition 4): J4 (H) =
N i=1
[2 −
2 T T T (x2i H1i m1i + y2i H2i m1i + z2i H3i m1i )]2 . si
(8)
This criterion has been introduced in [17] as the solution to solve the drawback of the previous. However, it is not the sum of the (Euclidian or spherical) reprojection errors which is minimised, thus the same remarks can be done than for the first suggested estimator J1 .
4 4.1
Simulations Simulation Conditions
We use 3 planar patterns, containing 9, 25 and 81 points in a square of side respectively equal to 80, 120 and 160m. These patterns are centered on the camera initial optical axis, perpendicular to this axis and situated from 100m to the imaging device projection centre. The scene frame coincides with the initial camera frame. The intrinsic parameters of our pinhole are: f = 1m, sx = 768pixels/m, sy = 768pixels/m, xc = 511.5pixels, yc = 383.5pixels. Our catadioptric camera is composed of a parabolic mirror and an orthographic camera. The latus rectum of the parabolic mirror is equal to 2. The actual motion between the two acquisitions is: roll= −5◦ , pitch= 10◦ , yaw= 20◦ , tx = 2m, ty = 5m, tz = 3m. With the 3 patterns and the 2 devices, we build 6 correspondence sets. A central Gaussian noise is added to the matching point coordinates. We work in fact with 5 Gaussian noises of standard deviation equal to 1/3, 1, 5/3, 7/3, 3 pixels. The eventual outliers are rejected. The matches are then normalised (see part 2.2). After, the homography is estimated with the estimators of part 3, and the motion and the normal of the plane are computed (see part 2.3). Among the different solutions obtained by SVD, we retain the roll, pitch and yaw angles corresponding to the smallest quadratic error with the reference. Also, we retain the translation and the normal of the plane corresponding to the highest scalar product in absolute value with the reference (after normalising the vectors). The arccos of the selected scalar product provides us the angular deviation with respect to the exact direction, αT for the translation and αN for the normal of the plane. Finally, we are able to estimate and to select the right solution for the parameters: Λ( roll= −5◦ , pitch= 10◦ , yaw= 20◦ , αT = 0◦ , αN = 0◦ ).
Proposition and Comparison of Catadioptric Homography
4.2
491
Comparisons and Quantification of the Estimator Precision
We evaluate the 7 estimators of the part 3: the linear ones for perspective and catadioptric camera, L1 , the non linear one for perspective camera, J1 , and the non linear ones for catadioptric camera (propositions 1, 2, 3 and 4 of part 3.2) : J1 , J2 , J3 and J4 . Because the homography parameters have not obvious physical interpretations, we compare the 5 parameters Λ (roll, pitch,yaw,αT ,αN ). For each of the three patterns and for each of the five image point noise variances, the error on Λ is computed with 20000 estimations as follows:
ErrN,σ2 (Λ) = |bias(Λ)| + var(Λ). In fact, ErrN,σ2 can be seen as a five components vector, each component is not a scalar but a 3 × 5 matrix. For each estimator, the mean values with respect to N and σ2 of the five components of the estimation error (4.2) are in Table 1. Table 1. Mean values of the error matrices (mean error [degree]) for each estimator Perspective Camera L1 J1 roll 0.2593 0.2584 pitch 0.2541 0.2540 yaw 0.1130 0.1127 αT 7.8027 7.7959 αN 6.0727 6.0872
L1 0.7077 0.6376 0.2720 18.0361 14.0271
Catadioptric Camera J1 J2 J3 0.6921 0.7058 0.7058 0.6401 0.6382 0.6386 0.2687 0.2690 0.2689 17.9363 18.0032 18.0038 13.7943 13.7378 13.7374
J4 0.7398 0.6666 0.2845 18.9386 14.6840
According to the presentation of part 3.2, it is assumed that the estimators J2 and J3 are optimal (maximum likelihood estimator) and equivalent. Thus they should be better than the linear estimator L1 (which is not optimal, it does not minimise a physical quantity). However, it was also mentioned that we are not sure about the stabilities or the estimators J1 and J4 . It can be seen in Table 1 that, as predicted in part 3.2, the suggested non linear estimators J2 and J3 are more stable than the linear estimator L1 . Thus, it has been checked that they are nearer from the maximum likelihood estimator than L1 , this is a very encouraging result. It can also be seen in Table 1 that they present very similar results, thus the Euclidian metric is not penalizing with respect to the spherical one, and the equivalence has been checked. The small difference is due to the computation round-off errors. It can be noted that because we use the Levenberg-Marquardt algorithm to minimise each criterion, the singularity of the derivative of arcos could have deteriorated the results of J3 , but it was not the case in our experiment. It can be seen in Table 1 that the estimator J4 is by far the worst and even the linear estimator L1 works better. Thus, the error term distribution is certainly very far from a Gaussian, then this estimator is far from the maximum likelihood estimator and that explains its instability. Because of the bad results obtained by J4 in the simulations of Table 1, we consider that this estimator should not be used and we not consider it in the comparisons with real images of section 5.
492
C. Simler, C. Demonceaux, and P. Vasseur
Surprisingly, the results of the estimator J1 seem to be as good as the results of J2 . Thus, the error term distribution is certainly close to a Gaussian, then this estimator is close to the maximum likelihood estimator and that explains its stability. That could be checked in further works. Also, an advantage of this criterion is that its simplicity may be enables a more accurate convergence in the Levenberg-Marquardt iterative minimisation process. In summary, J1 , J2 and J3 give better results than L1 and are very similar in quality. With the perspective camera, it can be seen in Table 1 that the non linear estimator J1 provides some better results than the linear one L1 . It is not surprising because it is well known in the litterature [11]. It is interesting to compare the precisions of the catadioptric estimators L1 and J1 to their homologues of the pinhole case. According to Table 1 the perspective estimators are more precise. However, it was assumed in our simulations that the image plane of the pinhole is not limited and thus the advantage of the large field of view provided by the catadioptric camera was cancelled (in practice the huge field of view provided by a catadioptric device is sometimes essential to perform the matching between two views). In fact, what is interesting to retain about the perspective-catadioptric comparisons is that the projection of noisy data on the unitary sphere is prejudicial for the estimations.
5
Experimental Results
In the simulations of section 4 the seven estimators of part 3 were evaluated. We performed some simulations with perspective camera which have provided some useful additional information. However, in this part we compare only the catadioptric estimators, because it is the central point of our study. A sequence of seven images of a room is taken with a catadioptric calibrated camera (Fig. 3). The mirror frame relative attitude of each acquisition with respect to the initial acquisition is in Table 2. The homography is estimated with 18 matched points (Harris corners) belonging to a plane of the scene. The non linear estimator J4 is not considered because of the bad results obtained in simulation in part 4. As it was the case with the simulations of part 4, with real images it was also noticed that the non linear estimators J1 , J2 and J3 give some very similar results. Thus it is difficult both in simulation and with real images to separate them because their performances are very similar. In addition, with real images the imprecisions on the attitude of reference complicate the selection of an eventual best among them. In term of performance we are not able to separate them, but conceptually J2 has some advantages: it minimises the Euclidian reprojection error sum and there is no risk of singularity by using the Levenberg-Marquardt algorithm. This is the reasons why in this section only the results of J2 are represented and compared with the linear estimator L1 . Table 3 shows the roll, pitch, yaw errors, and the translation angular deviation between each couple of successive images using 18 matched points. Figure 4 shows the errors between each image and the first.
Proposition and Comparison of Catadioptric Homography
493
Fig. 3. Example of real scene images used in our experiments associated to the 18 matched points. The reference plane is composed of the locker on the right of the image. Table 2. Mirror frame real attitude at each acquisition with respect to its initial position
tx [m] ty [m] tz [m] roll [◦ ] pitch [◦ ] yaw [◦ ]
Image2/1 Image3/1 Image4/1 Image5/1 Image6/1 Image7/1 0.7 0.7 0.7 0.7 0.7 0.7 0 0 0.7 0.7 0.7 0.7 0 0.1 0.1 0.1 0.1 0.1 0 0 0 0 5 5 0 0 0 10 10 10 0 0 0 0 0 10
Table 3. Roll, pitch, yaw absolute errors, and translation angular deviation between each couple of successive images [degree] using 18 matched Image2/1 L1 J1 roll 0.7 0.3 pitch 3 2 yaw 0.3 1 αT 5 4
Image3/2 L1 J1 0.1 0.2 1.6 1.5 0.4 0.1 21 10
Image4/3 L1 J1 0.2 0.1 0.7 0.7 0.2 0.2 3 3
Image5/4 L1 J1 3 2 2 2 0.8 0.7 8 5
Image6/5 L1 J1 0.3 0.1 2 2 0.7 0.4 34 28
Image7/6 L1 J1 0.9 0.3 0.8 2 0.3 0.5 78 30
The results show that the non linear criterion J2 is more precise than the linear criterion L1 . Thus the results with real images are coherent with the results obtained in simulation in part 4. However, it can be noticed in Table 3 and in figure 4 that rarely, but sometimes, L1 provides better results than J2 . This can be explained by the fact that a poorer estimator can have a better estimate than a better one, with a probability which is low but not null in general. Also, the non linear criterion is minimised with the Levenberg-Marquardt iterative algorithm, and there is always a small risk to converge toward a local minimum.
With 18 matched points 8 roll linear error roll nonlinear error 6
4
2
0
0
2 4 Image number
6
With 18 matched points 6 yaw linear error yaw nonlinear error
5 4 3 2 1 0
0
2 4 Image number
6
Absolute value of the pitch angular error [deg]
C. Simler, C. Demonceaux, and P. Vasseur
Angular deviation of the translation [deg]
Absolute value of the yaw angular error [deg] Absolute value of the roll angular error [deg]
494
With 18 matched points 7 pitch linear error pitch nonlinear error
6 5 4 3 2 1 0
0
2 4 Image number
6
With 18 matched points 40 angular deviation linear error angular deviation nonlinear error 30
20
10
0
0
2 4 Image number
6
Fig. 4. Motion error between each image of the sequence and the first using 18 matched points
It can be noticed on figure 4 that the errors not always increase with respect to the image number. That is normal because a larger motion does not mean a poorer estimation. The estimation depends on the correspondences, and they are always established whatever the motion due to the large field of view of the catadioptric camera.
6
Conclusion and Perspective
In this paper four non linear catadioptric homography estimators were suggested and compared in a quantitative way. It has been noticed both in simulation and with real images that the performances of three of them are very similar, and above all better than the linear estimator. Our tests do not enable us to separate these three winners, but we advice to use the one called J2 because it is the single which has these two qualities: it minimises the sum of the reprojection errors and there is no singularity problem when the minimisation is performed with the
Proposition and Comparison of Catadioptric Homography
495
Levenberg-Marquardt algorithm. It is thus (almost) the maximum likelihoods estimator. The motion estimation is now optimised with a planar scene. Because we are going to work with scenes composed of several planes (urban scenes), the next step consists in optimally exploiting the different planar data sets in order to improve the motion estimation.
References 1. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293(5828), 133–135 (1981) 2. Hartley, R.I.: In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Mach. Intell 19(6), 580–593 (1997) 3. Zhang, Z.: Determining the epipolar geometry and its uncertainty: A review. Int. J. Comput. Vision 27(2), 161–195 (1998) 4. Tsai, R., Huang, T., Zhu, W.: Estimating 3-d motion parameters of a rigid planar patch ii: Singular value decomposition. ASSP 30(8), 525–533 (1982) 5. Zhang, Z.: Flexible camera calibration by viewing a plane from unknown orientations. In: IIEEE Int. Conf. on Computer Vision, pp. 666–673. IEEE Computer Society Press, Los Alamitos (1999) 6. Brown, M., Lowe, D.G.: Recognising panoramas. In: IEEE Int. Conf. on Computer Vision, vol. 02, pp. 1218–1225. IEEE Computer Society, Los Alamitos (2003) 7. Benhimane, S., Malis, E.: Homography-based 2d visual servoing. In: IEEE International Conference on Robotics and Automation, IEEE Computer Society Press, Los Alamitos (2006) 8. Szeliski, R., Shum, H.Y.: Creating full view panoramic image mosaics and environment maps. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 251–258. ACM Press/AddisonWesley Publishing Co, New York (1997) 9. Jain, P.K.: Homography estimation from planar contours. In: Third Int. Symp. on 3D Data Processing, Visualisation and Transmission, pp. 877–884 (2006) 10. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 11. Faugeras, O.: Three-dimensional computer vision: a geometric viewpoint. MIT Press, Cambridge (1993) 12. Lhuillier, M.: Automatic structure and motion using a catadioptric camera. In: Proceedings of 6th Workshop on Omnidirectional Vision OMNIVIS 2005 (2005) 13. Makadia, A., Geyer, C., Sastry, S., Daniilidis, K.: Radon-based structure from motion without correspondences. In: CVPR, pp. 796–803. IEEE Computer Society Press, Los Alamitos (2005) 14. Geyer, C., Daniilidis, K.: Mirrors in motion: Epipolar geometry and motion estimation. In: IEEE Int. Conf. on Computer Vision, pp. 766–773. IEEE Computer Society Press, Los Alamitos (2003) 15. Benhimane, S., Malis, E.: A new approach to vision-based robot control with omnidirectional cameras. In: IEEE International Conference on Robotics and Automation, IEEE Computer Society Press, Los Alamitos (2006) 16. Mei, C., Benhimane, S., Malis, E., Rives, P.: Homography-based tracking for central catadioptric cameras. In: IROS (2006)
496
C. Simler, C. Demonceaux, and P. Vasseur
17. Mei, C., Rives, P.: Single view point omnidirectional camera calibration from planar grids. In: IEEE International Conference on Robotics and Automation, IEEE Computer Society Press, Los Alamitos (2007) 18. Barreto, J.: A unifying geometric representation for central projection systems. Comput. Vis. Image Underst. 103(3), 208–217 (2006) 19. Kosecka, J., Ma, Y., Sastry, S.: Optimization criteria, sensitivity and robustness of motion and structure estimation. In: ICCV 1999: Proceedings of the International Workshop on Vision Algorithms, pp. 166–182. Springer, London, UK (2000)
External Calibration of Multi-camera System Based on Efficient Pair-Wise Estimation Chunhui Cui, Wenxian Yang, and King Ngi Ngan Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong {chcui,wxyang,knngan}@ee.cuhk.edu.hk
Abstract. In this paper, we present an external calibration technique for typical multi-camera system. The technique is very handy in practice using a simple planar pattern. Based on homography, an efficient pair-wise estimation method is proposed to recover the rigid rotation and translation between neighboring cameras. By registering all these partial calibrated structures, complete calibration of the multi-camera system is accomplished. Experiments with both simulated and real data show that accurate and stable calibration results can be achieved by the proposed method. Keywords: Camera calibration, multi-camera system, homography.
1 Introduction Virtual immersive environment usually requires multiple cameras distributed in a wide area, so as to capture scenes of considerable extent in large rooms or even outdoors. A complete multi-camera calibration is an inevitable and important step towards the efficient use of such systems. In recent years, many multi-camera calibration methods [1][2][3] have been developed based on factorization and global constraints. Usually the whole projection matrix P is estimated instead of distinguishing intrinsic and extrinsic parameters. The method proposed in [2] relies on the planar pattern and assumes it to be visible to all cameras. Its applications are limited, e.g. unsuitable for wide baseline cases. Other approaches [3] using a laser pointer or virtual calibration object are more flexible, but usually involve elaborate feature detection and tracking, or have some particular requirements in the scene captured. Some researchers [4][5][6] focus their efforts on the external camera calibration, where the intrinsic and distortion parameters are estimated beforehand and regarded as fixed. In [4], positions and orientations of the model planes relative to the camera are estimated by Zhang’s method [7]. Using this information, rigid transforms between two cameras are then determined through an arbitrarily chosen plane. Besides, a RANSAC procedure is applied to remove possible outliers. A more elaborate D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 497–509, 2007. © Springer-Verlag Berlin Heidelberg 2007
498
C. Cui, W. Yang, and K.N. Ngan
approach is presented in [5], where virtual calibration object is used instead of the planar pattern. A structure-from-motion algorithm is employed to compute the rough pair-wise relationship between cameras. Global registration in a common coordinate system is then performed using a triangulation scheme iteratively. The method proposed in [6] estimates the pair-wise relationship based on the epipolar geometry. Translation and rotation between two cameras are recovered by decomposing the associated essential matrix. In this paper, we present an external calibration method for typical multi-camera system designed for real-time 3D video acquisition. The technique is simple to use, only requiring the user to present the pattern to cameras in different locations and orientations. Generality in the camera position is offered, only reasonable overlap in FOV (field of view) between neighboring cameras is necessary. Based on homography, a robust pair-wise estimation method is proposed to recover the rotation and translation between cameras. Four different estimation algorithms are proposed, namely Linear, Two-step, Nonlinear and Three-step method. The four algorithms impose the orthogonal constraint of rotation in different levels and accordingly achieve the calibration results with different accuracy and stability. To calibrate the multi-camera system, the proposed pair-wise calibration method is first applied to estimate the relative relationship between neighboring cameras. Then the complete external calibration is accomplished by registering all these partial calibrated structures. The validity of the proposed method is verified through experiments with both simulated and real data.
2 Basic Equations from Homography Suppose two cameras capture a planar pattern simultaneously as shown in Fig. 1. Let , and , denote the projections of the same 3D point onto camera 1 and camera 2, and , ,1 , , , 1 denote their homogeneous coordinates. The homography introduced by the plane is given by (1)
,
where ‘ ’ indicates equal up to scale, and H is the 3 3 homography matrix. Let C1 and C2 denote the coordinate system of camera 1 and camera 2, respectively, and let , denote their intrinsic matrices. R and t represent the rotation and translation from C1 to C2 as shown in Fig.1. represents the planar pattern surface with the plane normal /d , where is the unit vector in the direction of plane normal expressed in C1, and d is the distance from the C1 origin to the plane . We then have λ where λ is an unknown arbitrary scalar.
·
,
(2)
External Calibration of Multi-camera System
499
Fig. 1. Homography between two views
2 Multi-camera Calibration Based on Pair-Wise Estimation In most multiview applications, the intrinsic and distortion parameters of cameras are fixed. Therefore it is reasonable to estimate these parameters for each camera independently and elaborately, so as to achieve accurate and stable calibration. We apply Zhang’s method [7] to do the intrinsic calibration for each individual camera. Thus only the external calibration is necessary every time the cameras are moved and refocused to capture new 3D video. Planar pattern such as a checkerboard is widely used in calibration due to its flexibility and convenience. The main drawback of using the planar pattern lies in its invisibility to all cameras. However, we only use the planar pattern to estimate the relative relationship between two neighboring cameras. It is practical to make the pattern visible to both cameras, because in most multiview systems two neighboring cameras generally have sufficient common FOV. As for global registration, the transform from one camera to another can be easily computed by chaining those associated neighboring transforms together. It is argued that the chaining procedure is prone to errors. However, accurate and stable pair-wise calibration can benefit the accuracy of chaining. Our experiments show that there is no obvious error accumulation during transform chaining using the proposed pair-wise estimation method. An easy way to do the pair-wise (R, t) estimation is to utilize the single camera calibration results. Intrinsic calibration [7] can also recover the positions and orientations of the model planes relative to the camera, such as (R1, t1) and (R2, t2) in Fig. 1. Using this information, (R, t) between two cameras can be determined through an
500
C. Cui, W. Yang, and K.N. Ngan
arbitrarily chosen model plane. Ideally, (R, t) between cameras should be invariant irrespective of the plane through which they are computed. However in the presence of noise, the (R, t) estimates computed through different planes actually differ from each other. Simply combining these estimates is not robust and may lead to erroneous or unstable calibration results. Based on homography (2), we propose a robust pair-wise estimation method to recover the relative relationship between two cameras. First, homography is estimated by correspondence, and then follows the calculation of unknown scale λ and plane normal n. Finally, (R, t) between two cameras can be estimated by four different algorithms: Linear, Two-step, Nonlinear and Three-step method. 3.1 Homography Estimation With sufficient point correspondences, the homography matrix can be computed based on (1). The algorithm described in [7] is applied to do the homography estimation. As shown in Fig. 1, each image pair, one view from camera 1 and the other from camera 2, leads to a homography H. Suppose there are total P image pairs and then ( 1, 2, … , ) induced by different planes. we can estimate P homographies
Fig. 2. Geometry between the model plane and camera center
3.2 Calculation of n and The plane normal n also varies with the moving pattern, thus P different stereo views ( 1, 2, … , ). To compute each plane normal w.r.t. lead to P different normals the C1 coordinates system, we first use Zhang’s method [7] to estimate the plane position and orientation relative to C1, i.e. (R1, t1). . As shown in Let’s express R1 by means of its column vectors is a unit vector parallel to the plane normal, Fig. 2, the third column vector . The translation t1 is a vector connecting the C1 origin to a specific therefore is orthogonal to the plane , the distance from C1 point on the plane . Since origin to the plane can be computed by
d
| | cos
External Calibration of Multi-camera System
501
| || | cos
(3)
·
where denotes the angle between and . Therefore the plane normal n can be calculated by d
,
and t1 as (4)
·
The collineation matrix · has an important property that its median singular value is equal to one [8]. This property can be used to compute the unknown scalar λ. Let’s define λ , follow Eqn. (2) we have ·
(5)
Let σ , σ , σ denote the singular values of matrix (σ σ σ 0). According to (5), matrix λ value equal to one, thus we have λσ , and Note that matrix and then recover the matrix .
1
in descending order has median singular (6)
are known, so we can compute λ according to (6)
3.3 (R, t) Estimation 3.3.1 Linear Method From Eqn. (5), we can derive the following linear equation (7), where vec(X) denotes the vectorization of matrix X formed by stacking the columns of X into a single coland denote 3×3 and 9×9 identity matrices, respectively, and umn vector, denotes the kronecker product. vec
vec
(7)
As described earlier, each stereo view of the pattern can generate a homography and a normal , and from each the normalized matrix can then be recovered. As a set of and form an equation (7), by stacking P equations we have vec
vec vec
(8)
Let D denote the left 9P×12 matrix and d denote the right 9P×1 matrix. The leastsquares solution to (8) is given by
502
C. Cui, W. Yang, and K.N. Ngan
vec
(9)
Because of noise in data, the computed matrix R does not in general satisfy the orthogonal property of a rotation matrix. We need to solve the best orthogonal matrix to approximate R. The method described in [7] is adopted here. However, the orthogonal approximation causes a severe problem here. (R, t) computed by (9) is the best solution to equation (8) in the least square sense. After orthogonal approximation, the obtained ( , t) no longer fit this equation well and may lead to erroneous calibration results. Therefore it is necessary to impose the orthogonal constraint in the (R, t) estimation procedure so that the matrix R is as close to orthogonal as possible, consequently less deviation will be caused by orthogonal approximation from R to . 3.3.2 Two-Step Method with Implicit Orthogonal Constraint In this section, we first derive an implicit constraint imposed in vector t based on homography and the orthogonal property of matrix R. Then a Two-step method is proposed to estimate the pair-wise (R, t), where the implicit orthogonal constraint is imposed leading to better calibration results compared with the linear method. Implicit Orthogonal Constraint Follow Eqn. (5) we have
·
where
,
(10)
,
1, 2, 3 denote the three row vectors of G.
and
1, 2, 3 form an
As matrix R is orthogonal, the three row vectors orthonormal basis of R3, i.e., we have 1 1
,
1, 2, 3 ;
(11)
0 Note that
1, 2, 3 , we then have 1 1
2 2
,
1, 2, 3 ;
(12)
External Calibration of Multi-camera System
503
By eliminating the terms involving n, we can derive the quadratic equation (13) t /t . Note that Eqn. (13) no longer involves the with one unknown quantity k normal n, indicating less noise disturbance. 2
1
1
,
1, 2, 3 ;
(13)
Two-Step Method The proposed Two-step method is based on the implicit orthogonal constraint derived above. At the first step, we gather P such equations as (13) corresponding to different G matrices and compose simultaneous quadratic equations. Solving this problem by least square metric, we can obtain the uniform ratio of the three elements of t vector : : 1: : . Therefore, the original 3-DOF (Degree of Freedom) t vector is reduced to a single scale s as 1 (14)
Based on (14), we rewrite Eqn. (7) as (15). At the second step, we solve the simultaneous linear equations generated by stacking P such equations as (15). Once s is estimated, vector t is readily computed by (14). 1 vec
1
vec
vec (15)
The Two-step method imposes the implicit orthogonal constraint (13) in the estimation explicitly, while keeps the problem linear. (R, t) estimated by this method not only conform to the homography geometry, but also satisfy the orthogonal constraint better. Still R is not perfectly orthogonal and further orthogonal approximation is necessary. However, less deviation will be induced by the approximation, because R is much closer to its corresponding orthogonal approximation . 3.3.3 Nonlinear Method If we expect the estimated matrix R to be orthogonal without further orthogonal apexplicitly in the estimation proximation, we should impose the constraint and it turns to be a constrained nonlinear optimization problem:
min ∑ || R, t
·
||
subject to
,
(16)
where the optimum is in the sense of the smallest sum of Frobenius norms. We may use the Lagrange Multiplier to solve the constrained problem, but a better choice is to utilize the angle-axis representation of rotation. As we know, in three dimensions a rotation can be defined by a single angle of rotation , and the direction
504
C. Cui, W. Yang, and K.N. Ngan
, ,
of a unit vector be written as
, about which to rotate. Thus the rotation matrix
1 ,
1 1
1
1 1
1 1
can
(17) 1
Then we substitute this compact representation to the minimization term of (16), and solve the nonlinear problem with the Levenberg-Marquardt algorithm as implemented in Minpack [9]. The required initial guess of (R, t) can be obtained either by the linear method or by the Two-step method. Experiment results show that matrix R estimated by the nonlinear method is already orthogonal, thus further orthogonal approximation is unnecessary, consequently avoiding the problem caused by it. 3.3.4 Three-Step Method Although the nonlinear method is expected to achieve the best result, it is indeed time-consuming. Another choice is to make a reasonable trade-off, i.e., to develop a method that has much less computational complexity and meantime can achieve comparable performance. By good calibration result, we mean that the (R, t) should not only conform to the homography geometry, but also satisfy the orthogonal . Though we cannot obtain this kind of result completely by a linear straint method, it is possible to achieve a good (R, t) estimation with close performance through several linear optimization steps. As mentioned before, a rotation matrix in three dimensions can be represented by a unit vector , , and an angle . According to Euler's rotation theorem, the 3×3 rotation matrix R has one real eigenvalue equal to unity, and the unit vector is the corresponding eigenvector, i.e. (18) It follows from (5) and (18) that ·
·
(19)
1, … , ) and t, vector can be estimated by solving If we know the matrix , ( the linear equation (20), which is the accumulation of P such equations as (19). · (20) · According to Eckart-Young-Mirsky (EYM) theorem, the solution to Eqn. (20), in matrix form as , is the right singular vector of B associated with its smallest singular value.
External Calibration of Multi-camera System
505
To recover the rotation matrix, we further estimate the parameter based on the rotation representation (21), where indicates the 3×3 skew symmetric matrix corresponding to . 1
(21)
In order to guarantee a linear optimization, we estimate two parameters and 1 is not imposed in instead of the single and the constraint the estimation. Experimental results show that the computed and basically satisfy this constraint. At this step, we may retain the original result of vector t, or we can refine it together with and , while still keep the optimization problem a linear one. In order to achieve robust results, we choose the latter scheme to estimate , and vector t together by linear equation (22), which is derived from (5) and (21) (22) By stacking P such equations, we have
(23)
Based on the above description, the Three-step method is outlined as follows: 1. Use linear method or Two-step method to compute the initial estimation of t; 2. Estimate vector based on (20) with t fixed; 3. Estimate parameters , and refine t together based on (23) with fixed. With , and already known, the rotation matrix R can be recovered by (21).
4 Experimental Results 4.1 MPLD The mean point-line distance (MPLD) is used as the metric to evaluate the extrinsic calibration results. Suppose two cameras capture the same scene from different views, and a pair of images I1 and I2 is obtained. If the two cameras are calibrated both intrinsically and extrinsically, the associated fundamental matrix F can be recovered by (24). (24)
506
C. Cui, W. Yang, and d K.N. Ngan
With F known, given a poiint m1 in I1, we can determine the corresponding epipoolarline l1 in I2. Ideally, the correspondence of m1 in I2, i.e. m2 should be located on the epipolar-line l1. However, it i may deviate from l1 due to data noise or inaccurate ccalibration. Therefore the disttance of m2 to l1 can be used reasonably to evaluate the calibration performance. Th he mean point line distance is computed over all the patttern corners of all the test imagees. 4.2 Simulated Data uate the performance of different (R, t) estimation allgoThis simulation is to evalu rithms, especially their rob bustness to data noise. We consider the scenario that ttwo cameras capture a planar paattern with 9×12 corners. Total 20 different stereo views of the plane are used. We speccify the translations and rotations from the plane to the ffirst camera and from the first camera c to the second one. Gaussian noise is added to the true projected image pointss (we assume no distortion), from a standard deviationn of 0.05 pixels up to 0.45 pixells in step of 0.05. Fig. 3 shows the mean of MPLD from m 50 simulations by different alg gorithms. We also give the simulation results of Zhanng’s method [7] for comparison n. As Zhang’s method can estimate the transformation between two cameras from a single stereo image pair, in our experiment we simply ttake the mean of the (R, t) estim mates from different image pairs and denote this methodd as Zhang_mean. For both exp periments here and those in Section 4.3, the required t for the Three-step method is initialized by the Two-step method. 0.30
MPLD (pixel)
0.25 0.20 Zhang_meaan Linear Two-step nonlinear Three-step
0.15 0.10 0.05 0.00 0.00
0.20 0.30 0.40 0.10 standard deeviation of gaussian noise (pixel)
0.50
Fig. 3. 3 Simulation results of different methods
As shown in Fig.3, the performance p of the original linear method is bad mainly due to the orthogonal approxim mation problem. With implicit orthogonal constraint imposed, the Two-step metho od performs much better although still not satisfactory. As
External Calibration of Multi-camera System
507
expected, the nonlinear method achieves the best results in the sense of robustness against data noise. The Three-step method has very close performance and meantime much lower complexity. By our rough test on the PC with 1.83G Duo CPU and 1.5G RAM, the nonlinear (R, t) estimation procedure may take dozens of seconds, while by all the other algorithms the time-consuming keeps within 10ms. Both the nonlinear and Three-step methods are less sensitive to noise compared with Zhang_mean. 4.3 Real Data Experiments are also performed using real data captured by our multi-camera system that consists of five Prosilica GC650C cameras with 8mm lens. The cameras are indexed from 0 to 4 sequentially, thus there are total four neighboring pairs: (0-1), (1-2), (2-3) and (3-4). In the experiment, the five cameras are placed focusing on the same scene. The angle between the optical axes of two neighboring cameras is approximately 30 degree. Beforehand, intrinsic calibration was done to the five cameras individually using Zhang’s method [7]. The proposed method is then applied to recover the extrinsic relationship among cameras. A checkerboard pattern with 9×12 corners is presented to the four camera pairs. For each pair, 30 different stereo views are captured. To verify the calibration results, another checkerboard with 6×9 corners is used and similarly 30 views are captured for each camera pair. Calibration is performed by different (R, t) estimation algorithms, while using the same training and testing images. As shown in Table 1, the experimental results of real data are similar to that of synthesized data. Both the nonlinear and Three-step method outperform the method Zhang_mean. Note that for all methods here, we do not apply any processing to remove possible data outliers. Better results can be expected by adopting such procedure. Table 1. Results of pair-wise (R, t) estimation MPLD (pixel) by different methods Camera Pairs
Zhang_mean
Linear
Two-step
Nonlinear
0-1 1-2 2-3 3-4 Average
0.0825 0.0814 0.0692 0.0769 0.0775
0.5297 0.3526 0.3359 0.4112 0.4074
0.0851 0.2400 0.0610 0.1505 0.1342
0.0781 0.0655 0.0488 0.0654 0.0644
Three-step 0.0799 0.0617 0.0540 0.0693 0.0662
To investigate the error induced by transform chaining, we test the calibration results of camera pairs: (0-2), (0-3) and (0-4). (R, t) between these camera pairs are computed by chaining the (R, t) results across those neighboring pairs (0-1), (1-2), (2-3) and (3-4). As shown in Fig. 4, there is no obvious error accumulation during the (R, t) chaining procedure, especially when the pair-wise calibration results are sufficiently accurate and stable, as estimated by the Three-step or nonlinear method. Clearly the calibration results can be further improved by using more robust registration method such as [6], rather than simple chaining here.
508
C. Cui, W. Yang, and K.N. Ngan 0.12 0.10 MPLD
0.08 Zhang_mean
0.06
nonlinear
0.04
Three-step
0.02 0.00 (0-1)
(0-2)
(0-3)
(0-4)
Camera pair
Fig. 4. Results of transform chaining
5 Conclusion In this paper, we present a convenient and efficient method to calibrate the typical multi-camera system. Relative relationship between neighboring cameras is first recovered by the proposed pair-wise estimation method. Complete multi-camera calibration is then accomplished by chaining these pair-wise estimates together. Four estimation algorithms are proposed based on homography. The linear method does not impose any constraint, thus leads to inaccurate calibration results due to the problem caused by orthogonal approximation. Better results are obtained by the Two-step method with the implicit orthogonal constraint imposed. The orthogonal constraint is fully imposed in the nonlinear method, which achieves the best calibration results in the sense of noise robustness. The Three-step method has very close performance to the nonlinear method, while has much lower computational complexity. Extra outlier removal and more robust registration method may further improve the accuracy and stability of the external calibration.
References 1. Sturm, P., Triggs, B.: A Factorization Based Algorithm for Multi-Image Projective Structure and Motion. In: European Conference on Computer Vision, pp. 709–720 (1996) 2. Ueshiba, T., Tomita, F.: Plane-based Calibration Algorithm for Multi-camera Systems via Factorization of Homography Matrices. In: International Conference on Computer Vision, vol. 2, pp. 966–973 (2003) 3. Svoboda, T., Martinec, D., Pajdla, T.: A Convenient Multicamera Self-calibration for Virtual Environments. PRESENCE: Teleoperators and Virtual Environments 14(4) (2005) 4. Prince, S., Cheok, A.D., Farbiz, F., Williamson, T., Johnson, N., Billinghurst, M., Kato, H.: Nat: 3D Live: Real Time Captured Content for Mixed Reality. In: International Symposium on Mixed and Augmented Reality, pp. 307–317 (2002) 5. Chen, X., Davis, J., Slusallek, P.: Wide Area Camera Calibration Using Virtual Calibration Objects. Computer Vision and Pattern Recognition 2, 520–527 (2000) 6. Hrke, I., Ahrenberg, L., Magnor, M.: External Camera Calibration for Synchronized Multivideo Systems. Journal of WSCG 12 (2004)
External Calibration of Multi-camera System
509
7. Zhang, Z.Y.: A Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 8. Zhang, Z., Hanson, A.R.: Scaled Euclidean 3D Reconstruction Based on Externally Uncalibrated Cameras. In: IEEE Symposium on Computer Vision, pp. 37–42 (1995) 9. More, J.: The Levenberg-marquardt Algorithm, Implementation and Theory. In: Watson, G.A. (ed.) Numerical Analysis. Lecture Notes in Mathematics, vol. 630, Springer, Heidelberg (1977)
Fast Automatic Compensation of Under/OverExposured Image Regions Vassilios Vonikakis and Ioannis Andreadis Democritus University of Thrace, Department of Electrical and Computer Engineering, Laboratory of Electronics, Section of Electronics and Information Systems Technology, Vas. Sofias, GR-67100 Xanthi, Greece {bbonik, iandread}@ee.duth.gr http://electronics.ee.duth.gr
Abstract. This paper presents a new algorithm for spatially modulated tone mapping in Standard Dynamic Range (SDR) images. The method performs image enhancement by lightening the tones in the under-exposured regions while darkening the tones in the over-exposured, without affecting the correctly exposured ones. The tone mapping function is inspired by the shunting characteristics of the center-surround cells of the Human Visual System (HVS). This function is modulated differently for every pixel, according to its surround. The surround is calculated using a new approach, based on the oriented cells of the HVS, which allows it to adapt its shape to the local contents of the image and, thus, minimize the halo effects. The method has low complexity and can render 1MPixel images in approximately 1 second when executed by a conventional PC. Keywords: Image Enhancement, Tone Mapping, Human Visual System.
1 Introduction Important differences often exist between the direct observation of a scene and the captured digital image. This comes as a direct result of the low dynamic range of the capturing device, compared to the dynamic range of natural scenes. Conventional SDR images (8-bits/channel) cannot acceptably reproduce High Dynamic Range (HDR) scenes, which is usual in outdoor conditions. As a result, recorded images suffer from loss in clarity of visual information within shadows (under-exposured regions), or near strong light sources (over-exposured regions). A straight-forward solution to this problem is the use of HDR capturing devices instead of the conventional SDR ones. Nevertheless, HDR cameras cannot always provide a practical solution. Their increased cost has limited their use, while the majority of the existing vision systems are already designed to use SDR cameras. Another possible solution is to acquire an HDR image by combining multiple SDR images with varying exposures [1]. However efficient, this solution is by its definition time consuming and thus, cannot be used in time-critical applications. Consequently, an unsupervised tone-enhancement algorithm for the under/over-exposured regions of SDR images could at least partially solve the problem, by enhancing the visual information of those regions, while minimally affecting the others. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 510 – 521, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fast Automatic Compensation of Under/Over-Exposured Image Regions
511
Many algorithms have been proposed in this direction the past decades. The most important of all is the Retinex family of algorithms. Retinex was first presented by Edwin Land in 1964 [2] and was motivated by some attributes of the HVS, which also defined its name (Retina & Cortex). The initial algorithm inspired several others and until today the most widespread version of Retinex is the Multi Scale Retinex with Color Restoration (MSRCR) [3], which has been extensively used by NASA and has been established in the market as commercial enhancement software (PhotoFlair). In MSRCR, the new pixel values are given by the logarithmic ratio of the initial value and a weighted average of its surround. Gaussian surround functions of three different scales are used in order to simultaneously achieve local contrast enhancement and tonal rendition. Recently, a new algorithm that has some common attributes with Retinex, called ACE (Automatic Color Equalization), has been reported in [4]. It uses a form of lateral inhibition mechanism, which adjusts every pixel value according to the local and global contents of the image. The new pixel values are then scaled to the dynamic range of the channel (0-255). As a result, it enhances the under/overexposured image regions while and achieving White-Patch and Gray-World color correction. The main weakness of both algorithms is their computational burden. This derives from the convolution of the image with Gaussians of radiuses up to 240 pixels for the MSRCR or the participation of a great number of pixels for the ACE, respectively. Additionally, halo effects tend to appear in regions were strong intensity transitions exist, degrading the final output of the algorithms. The thrust of the proposed method is to enhance the tones in the under/overexposured regions of SDR images, with minimal computational burden and without affecting the correctly exposured ones. This is achieved by a spatially modulated tone mapping in which, the main tone mapping function is inspired by the shunting characteristics of the center-surround cells of the HVS. The calculation of the surround is based on a new approach inspired by the oriented cells of the HVS. According to this, the shape of the surround function is not constant and it is adapted to the local intensity distribution of the image. As a result, it avoids the strong intensity transitions, which lead to the emergence of halo effects. The results of the proposed method outperform the existing ones in this category. The execution time of the algorithm is faster than any of the other algorithms and can render 1MPixel images in approximately 1 sec when executed by a conventional PC. The rest of the paper is organized as follows: Section 2 presents a detailed description of the proposed algorithm. Section 3 demonstrates the experimental results. Finally, concluding remarks are made in section 4.
2 Description of the Method Fig. 1 shows the block diagram of the proposed method. The input image is first converted to the YCbCr space, in order to decorrelate the chromatic and achromatic information. The algorithm enhances only the Y component. The enhanced component Yout is used in the inverse color-space transformation to produce the final image.
512
V. Vonikakis and I. Andreadis
Fig. 1. The block diagram of the proposed method
2.1 Orientation Map In order to design a surround function that is adaptable to the local distribution of the intensity, the orientation map OM of the Y component must be calculated. Instead of intensity values, OM contains the outputs of a set of orientation elements, similar to the orientation cells of the HVS. The orientation elements are a set of 60 binary kernels (Fig. 2a). Their size is 10×10 pixels and they are applied to non-overlapping regions of the Y component. Consequently, for an image of size py×px, the size of OM will be (py/10)×(px/10). Every kernel K has two distinct parts (AK and BK). The two parts form a step edge in 12 different orientations (every 15°) and in all possible phases within the 10×10 kernel.
Fig. 2. a. The 60 kernels and the 44th kernel in magnification. b. The Y component of the original image. c. The orientation map OM of image b.
Let (i,j) be the coordinates of the upper left pixel in one of the non-overlapping regions of the Y component, in which the kernels are applied. Then OM is calculated as follows. outuK,v = M uA,,vK − M uB,,vK , u =
M uA,v, K =
1 N A, K
i +9 j +9
∑∑
Yy , x ∀Yy , x ∈ AK
y =i x = j
OM uA,v = M uA,v, K ′ , OM uB,v = M uB,v, K ′ ,
i j , v = , u, v ∈ Ζ 10 10
M uB,,vK =
1 N B, K
(1)
i +9 j +9
∑∑ Y
y,x
∀Yy , x ∈ BK
(2)
y =i x = j
K ′ : max ⎡⎣outuK,v ⎤⎦
60 K =1
= outuK,v′
(3)
Fast Automatic Compensation of Under/Over-Exposured Image Regions
513
where K is the number of the kernel, MA,Ku,v, MB,Ku,v are the mean intensities of AK and BK, respectively, NA,K and NB,K are the number of pixels of AK and BK, respectively, and u and v are the coordinates of the OM. Equation (3) selects the kernel K’ with the highest output, whose phase and orientation match the local contents of region (i,j). Every position (u,v) of the OM contains the number of the winner kernel K’ and the average intensity values of MA,Ku,v and MB,Ku,v. Fig. 2b and Fig. 2c depict the original image and its OM, respectively. For every pixel Yn,m two surround values are calculated: one for the Y component (S1,n,m) and one for the OM (S2,n,m). The final surround value Sn,m is calculated by an interpolation between the two values and will be discussed later. The surround, in the center of which the pixel Yn,m is located, was selected to be square, with a side of 51 pixels. The S1,n,m surround is the classic surround that has been also used in [3,4]. In these cases, a weighting function (Gaussian for MSRCR, Euclidian or Manhattan for ACE) is used to determine the weight of every pixel in connection to its distance from the central pixel. In the proposed method, for simplicity, no such function was used, and the S1,n,m surround is a simple average, as described by equation (4). S1,n, m =
n + 25
1 512
m + 25
∑ ∑
Yy , x
(4)
y = n − 25 x = m − 25
The S1,n,m surround tends to extract unwanted halo effects when calculated in a region with a sharp intensity transition (Fig. 3a). For this reason, the S2,n,m surround is also calculated, which adapts its size to the local intensity distribution of the image. The S2,n,m surround belongs to a region H of 5×5 kernels in the OM, whose central kernel is the one in which, pixel Yn,m is located (Fig. 3b). Region H is segmented into two distinct regions E1 and E2 (Fig. 3c) by a threshold thH. These regions define the value of S2,n,m, as the following equations indicate. OM Hmax = max ⎡OM zG ⎤ , OM Hmin = min ⎡OM zG ⎤ , ∀G ∈ { A, B} , z ∈ H ⎣ ⎦ ⎣ ⎦
thH = ME1 =
1 N E1
∑ OM
G z
OM Hmax + OM Hmin 2
if OM zG < thH
ME2 =
1 NE2
⎧⎪ ME1 if Yn,m < thH S2,n, m = ⎨ ⎪⎩ ME2 if Yn,m ≥ thH
(5)
(6)
∑ OM
G z
if OM zG ≥ thH
(7)
(8)
where, NE1 and NE2 are the total number of kernel parts (A or B) that constitute regions E1 and E2, respectively. Equation (5) defines the maximum and the minimum of all the kernel parts that belong to region H. These two extremes determine the threshold thH, in equation (6), which is necessary to segment region H into E1 and E2. Equation (7) shows that ME1 and ME2 are the mean intensities of E1 and E2, respectively. The final value of surround S2,n,m is determined by the value of the central pixel Yn,m. If
514
V. Vonikakis and I. Andreadis
Yn,m is lower than the threshold, it belongs to region E1 (the darker region) and thus, S2 acquires the value of ME1. On the contrary, if Yn,m is higher than the threshold, it belongs to region E2 (the brighter region) and thus, S2,n,m acquires the value of ME2. Consequently, S2,n,m is allowed to obtain values only from one region but never from both, and thus, it does not contribute to the formation of halo effects. The difference difH between the OMHmax and OMHmin is the factor that determines the final surround value of pixel Yn,m by interpolation. dif H = OM Hmax − OM Hmin , dif H ∈ [ 0, 255]
Sn,m =
S1,n,m ( 255 − dif H ) + S2, n,m ⋅ dif H 255
(9)
(10)
When there is not a sharp intensity transition located within the surround window of pixel Yn,m, difH has a low value and as a result, S1,n,m contributes more to the final surround value Sn,m. Alternatively, if a sharp intensity transition is located within the surround window, difH has a high value and S2,n,m is the main contributor to Sn,m.
Fig. 3. a. A pixel Yn,m and its surround S1,n,m in the Y component, located near an edge where halo effects may appear. b. The H region of pixel Yn,m in the OM. c. Segmentation of region H into regions E1 and E2. d. The position of Yn,m determines the value of S2,n,m.
2.2 Tone Mapping
Center-surround cells of the HVS have to encode extremely high dynamic range visual signals into an output of variable, yet finite frequency. This is succeeded by modulating their response according to the following equation [5]:
F ( x) =
B⋅x ∀ x ∈ [ 0, ∞ ) , A, B ∈ ℜ+ A+ x
(11)
Equation (11) maps all inputs from [0,∞) to the interval [0,B] with a nonlinear degree, varying according to the A/B ratio. However, in the present application, all the inputs are bounded to the interval [0,255]. Fig. 4a depicts the graphical representation of equation (11) for inputs raging in the interval [0,B]. In this interval the maximum output of equation (11) is not constant (within the [0,B] interval) and depends on the A/B ratio. For this reason, equation (11) is transformed to equations (12) and (13), which retain a constant output, within the [0,B] interval, for all A/B ratios, as depicted in Fig. 4b and 4c.
Fast Automatic Compensation of Under/Over-Exposured Image Regions
G ( x) = H ( x) =
( B + A) ⋅ x
515
∀ x ∈ [ 0, B ] , A, B ∈ ℜ +
(12)
A⋅ x ∀ x ∈ [ 0, B ] , A, B ∈ ℜ+ A+ B − x
(13)
A+ x
Equations (12) and (13) can be used as adjustable mapping functions. Once B is defined according to the range of input data, varying A can result to different nonlinear curves, controlling the mapping between input and output. In the proposed method, the input range is [0,255] and thus B=255.
Fig. 4. a. Graphical representation of equation (11). b. Graphical representation of equation (12). c. Graphical representation of equation (13).
Equations (12) and (13) are the basis of the proposed tone mapping function. In order to have spatially modulated tone mapping, factor A is substituted by a function of the surround Sn,m and x by Yn,m. The equations of the method are:
(
(
))
⎧ B+ A S n , m ⋅ Yn , m ⎪ ∀Sn, m < 128 ⎪ A S + Y n , m n , m ⎪ Ynout , m (Y , S ) = ⎨ A Sn,m ⋅ Yn,m ⎪ ∀Sn, m ≥ 128 ⎪ ⎪⎩ A Sn, m + B − Yn, m
( (
(
(
A S n, m
)
⎧ ⎪ =⎨ ⎪ ⎩
(
)
) )
)
(
)
⎡M + q Sn,m ⎤ ⋅ d Sn ,m ⎣ dark ⎦ ⎡ M bright + q 255 − Sn, m ⎤ ⋅ d 255 − Sn ,m ⎣ ⎦
(
)
q ( x) = d ( x) =
128 128 − x
(
x2 lobe ∀x ∈ [ 0,128 )
(14)
∀Sn,m < 128
)
∀Sn, m ≥ 128
(15)
(16)
(17)
516
V. Vonikakis and I. Andreadis
Equation (14) is the basic equation of the method. It combines equation (12) for surround values lower than 128, and equation (13) for surround values greater than 128. This is selected because under-exposured regions are darker than the middle of the range (i.e. 128) and need to increase their intensity. Equation (12) increases nonlinearly the lower intensity values, while barely affecting the higher ones (Fig. 5a). On the contrary, over-exposured regions are brighter than the middle of the range and need to decrease their intensity. Equation (13) decreases non-linearly the higher intensity values, while barely affecting the lower ones. Equation (15) is the modulation function that controls the non-linearity degrees of equation (14). It comprises the two adaptation characteristics of the HVS: the global and local adaptation. Global adaptation is expressed by the constants Mdark and Mbright which depend on the global statistics of the image. They determine the initial non-linearity degrees a* (for Sn,m=0) and b* (for Sn,m=255), respectively (Fig. 5a). Local adaptation is expressed by the use of equation (16) in equation (15), which determines the transition of the two nonlinearities a* and b* to the linearity a, in the middle of the surround axis (Sn,m=128). This transition is regulated by the local surround value Sn,m, since it is a local variable parameter, and the constant lobe, which depends on the global statistics of the image (Fig. 5b). Equation (17) is a necessary correction factor which ensures the smooth continuity of equation (14) in the middle of the surround axis (Sn,m=128), in the transition point between equations (12) and (13). If equation (17) is omitted from equation (15), equation (14) will not be smooth in the middle of the surround axis, and artifacts will be created in the mapping of input values with Sn,m≈128.
Fig. 5. a. 3D representation of equation (14). b. Different view-angle of a.
The coefficients Mdark, Mbright and lobe, are statistically determined by the Y component and adapt globally the mapping function to the input image, according to its dominant tones. The dominant tones of the image are defined by calculating the % percentage of pixels that belong to the interval [0, 85) for the dark tones, [85, 170) for the middle tones and [170, 255] for the light tones. These percentages indicate roughly whether the image is subjectively light, normal or dark. Mdark and Mbright determine the higher non-linearity degrees a* and b*, which can be applied in the underexposured and over-exposured regions, respectively (Fig. 5a). Their values are in the interval [10,255] and are inversely proportional to the % percentage of the dark tones (bin_low) for Mdark and light tones (bin_high) for Mbright.
Fast Automatic Compensation of Under/Over-Exposured Image Regions
517
M dark =
245 (100 − bin _ low ) + 10 100
(18)
M bright =
245 (100 − bin _ high ) + 10 100
(19)
A high percentage of dark tones, indicates a globally dark image, resulting to a low Mdark value and a higher non-linear correction for the dark regions. A high percentage of light tones, indicates a globally bright image, resulting to a low Mbright value and a higher non-linear correction for the bright regions. Coefficient lobe determines the shape of the transition of the two non-linearities (Fig. 5b). Its value is in the interval [1,30] and is inversely proportional to the % percentage of the middle tones (bin_middle). lobe =
29 (100 − bin _ middle ) + 1 100
(20)
Low lobe values, limit the non-linearities to low surround values, leaving middle tones intact. High lobe values allow the non-linearities to affect the middle tones also.
3 Experimental Results The proposed method was compared to MSRCR and ACE. These algorithms were selected because they belong to the same category with the proposed method and they are also inspired by the HVS. The proposed method was implemented in C. All algorithms were executed by an Intel Pentium 4 at 3 GHz, under Windows XP. The implementation of the MSRCR that was used in the evaluation is the commercial software Photoflair [6]. The parameters of the MSRCR are the default ones as reported in [3] (3 scales of 5, 20 and 240 pixels with equal weights, a color restoration process and no auto-level post-processing). The implementation of ACE can be found in [7]. The parameters that are used in the testing are the defaults: sub-sampling factor=8, slope=5, distance alpha=0.01, and a dynamic tone reproduction scaling of WP+GW. The implementation of the proposed algorithms is available in [8]. All the results depicted from the proposed method were derived by the parameters described in section 2.2, without any manual tuning. Fig. 6 depicts the results of the algorithms for three images with strong underexposured or over-exposured regions. In the first image, the proposed method extracts more details in the under-exposured region. MSRCR and ACE are affected by the strong intensity transition and fail to extract details in some of the under-exposured regions. The second image has both an under-exposured and an over-exposured region. MSRCR is heavily affected by the sharp intensity transition and extracts a strong halo effect. ACE tends to lighten the correctly exposured regions. The proposed method does not extract a hallo effect and is the only one that compensates for the over-exposured region of image 2. The proposed method has a faster execution time compared to the other two methods and needs approximately 1 sec for the
518
V. Vonikakis and I. Andreadis
Fig. 6. Line 1: Results for image 1. Line 2: Magnified portion of image 1. Line 3: Results for image 2. Halo effects are marked with “H”. Line 4: Magnified portion of image 2. Compensated over-exposured area is marked with a circle. Line 5: Results for image 2. Line 6: Magnified portion of image 3.
rendition of a 1-MPixel image. Image 3 has both under/over-exposured regions. It is clear that the proposed method compensates better for both these regions, in comparison to the other methods.
Fast Automatic Compensation of Under/Over-Exposured Image Regions
519
Fig. 7. Naturalness and colorfulness of the images of Fig. 7
Fig. 8. The four images that were used in the experiment of Fig. 8 and the results of the compared methods
520
V. Vonikakis and I. Andreadis
The main objective of the proposed method is to enhance the original image and produce a better looking image to the human observer. For this reason, two image quality metrics, which were proposed after psychophysical experimentation with human observers, were used in the evaluation. The first metric is naturalness, which is the degree of correspondence between human perception and reality world and has a value in the interval [0,1], with 1 accounting for the most natural image. The second metric is colorfulness, which presents the color vividness degree of an image. High colorfulness values indicate high color vividness. These metrics were found to have strong correlation with the perception of the human observer [9] and were successfully used in the enhancement algorithm of [10]. The algorithms were applied to the four images of Fig. 8, and both metrics were calculated for their results and for the original image. The four images were selected because they have under-exposured and over-exposured regions. The calculated naturalness and colorfulness are depicted in Fig. 7. The proposed method achieved higher degrees of naturalness and colorfulness for all the four images, indicating that its results are more probable to be ranked first by a human observer.
4 Conclusions A new method for spatially modulated tone mapping has been presented in this paper. Its main objective is to enhance the under/over-exposured regions of images, while leaving intact the correctly exposured ones. The method utilizes orientation kernels, similar to the orientation cells of the HVS, in order to achieve varying surrounds that adapt their shape to the local intensity distribution of the image. Thus, no averaging is performed between two regions with strong intensity differences. This, results to the elimination of halo effects, which are a consequence of the wide surrounds. The surround of every pixel, regulates the tone mapping function that will be applied to the pixel. This function is inspired by the shunting characteristics of the center-surround cells of the HVS. The proposed method exhibits at least comparable and many times better results than other established methods in the same category. More importantly, the execution times of the proposed method are lower than those of the existing ones.
References 1. Battiato, S., Castorina, A., Mancuso, M.: High dynamic range imaging for digital still camera: an overview. Journal of Electronic Imaging 12, 459–469 (2003) 2. Land, E.: The Retinex. American Scientist 52(2), 247–264 (1964) 3. Jobson, D.J., Rahman, Z., Woodell, G.A.: A multi-scale Retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Processing 6, 965–976 (1997) 4. Rizzi, A., Gatta, C., Marini, D.: A new algorithm for unsupervised global and local color correction. Pattern Recognition Letters 24, 1663–1677 (2003) 5. Ellias, S., Grossberg, S.: Pattern formation, contrast control and oscillations in the short term memory of shunting on-center off-surround networks. Biological Cybernetics 20, 69– 98 (1975)
Fast Automatic Compensation of Under/Over-Exposured Image Regions 6. 7. 8. 9.
521
Truview (2007), http://www.truview.com/ Eidomatica (2007), http://eidomatica.dico.unimi.it/ita/ricerca/ace.html Electronics, http://electronics.ee.duth.gr/vonikakis.htm Hasler, S., Susstrunk, S.: Measuring colorfulness in real images. In: Proc. SPIE Electron. Imag.: Hum. Vision Electron. Imag. VIII, SPIE 5007, pp. 87–95 (2003) 10. Huang, K.-Q., Wang, Q., Wu, Z.-Y.: Natural color image enhancement and evaluation algorithm based on human visual system. Computer Vision and Image Understanding 103, 52–63 (2006)
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI Claudia Prieto1 , Marcelo Guarini1 , Joseph Hajnal2 , and Pablo Irarrazaval1 1
Pontificia Universidad Cat´ olica de Chile, Departamento de Ingenieria El´ectrica, Vicuna Mackenna 4860, Chile
[email protected] 2 Hammersmith Hospital, Imperial College London, Du Cane Road W12 ONN, UK
Abstract. Magnetic Resonance Imaging (MRI) has become an important tool for dynamic clinical studies. Regrettably, the long acquisition time is still a challenge in dynamic MRI. Several undersampled reconstruction techniques have been developed to speed up the acquisition without significantly compromising image quality. Most of these methods are based on modeling the pixel intensity changes. Recently, we introduced a new approach based on the motion estimation of each object el ement (obel, a piece of tissue). Although the method works well, the outcome is a trade off between the maximum undersampling factor and the motion estimation accuracy. In this work we propose to improve its performance through the use of additional data from multiple coils acquisition. Preliminary results on cardiac MRI show that further undersampling and/or improved reconstruction accuracy is achieved using this technique. Furthermore, an approximation of the vector field of motion is obtained. This method is appropriate for sequences where the obels’ intensity through time is nearly constant. Keywords: motion estimation, MRI, dynamic images, undersampling.
1
Introduction
Over the last years, Magnetic Resonance Imaging (MRI) has become an important tool for dynamic clinical studies. Applications are wide ranging, including cardiac MRI [1]-[2], real-time interventional imaging [3] and kinematics of joints [4]. The long time involved in the acquisition of an image sequence is still a challenge in dynamic MRI. An active line of research in this area has been aimed to speed up the information capture phase, without significantly compromising the image quality. The MRI signal corresponds to the Fourier transform of the imaged object density map, along a given sample path. The frequency space where data is acquired is called k-space. The way in which this space is covered during the acquisition phase is known as k-space trajectory. The ideal approach to acquired dynamic objects would be to collect the fully sampled k-space repeatedly for D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 522–532, 2007. c Springer-Verlag Berlin Heidelberg 2007
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI
523
each time frame (k-t space [5]) and then to reconstruct the desired number of image frames independently. However, in practice the temporal resolution depends on the speed of acquisition of the k-space for each frame, which is limited by hardware constraints1 . There is a little scope for improving temporal resolution through hardware advancement. As a consequence, considerable research is being aimed towards undersampling reconstruction techniques in k-space or k-t space. These techniques reduce the sampling by a certain factor, and later the aliasing artifacts are reduced by estimating the missing data from prior information or through the use of information redundancy in dynamic MRI. Traditional methods to reconstruct undersampled dynamic images use models based on time varying pixel intensities or are based on temporal frequencies to recover the non acquired data. Among others, those methods include keyhole [6][7], reduced field of view [8], unaliasing by Fourier-encoding the overlaps using the temporal dimension (UNFOLD) [9], k-t broad-use linear acquisition speedup technique (k-t BLAST) [10], reconstruction employing temporal registration [11]. Recently we have introduced a method which recovers the non acquired data by estimating the motion of each object el ement or obel [12]. An obel is defined as a piece of tissue of the object whose intensity remains constant over time. The supporting assumption for this method is that the displacement of an obel has lower bandwidth than the intensity changes through time of a stationary pixel, and therefore it can be described with fewer parameters. A pictorial description of this idea, for a pixel near the edge of a dynamic object, is depicted in Fig.1. The proposed method performs quite well, although there is a trade off between the undersampling factor and the accuracy of the motion estimation. A known technique to speed up acquisition in MRI is to use multiple receiver coils in parallel scan time (parallel imaging) as a complementary encoding [13] - [14]. In this work we propose to use the additional information provided by multiple coils to increase the available data, in a scheme that combines the above method based on obels with parallel imaging. The additional information can be used to reach higher undersampling factors or to improve the motion estimation. Here we describe the proposed method and the preliminary results of applying it to cardiac images. We begin summarizing the basics of the reconstruction technique based on the motion estimation of obels for single coil images. Then, the concept is extended to incorporate parallel imaging by exploiting the additional data from multiple receiver coils. Finally, we provide an analysis of results showing the method potential.
2
Reconstruction by Obels in Single Coil Acquisition
Considering that an obel does not change its intensity over time, it is possible to reconstruct any frame of a dynamic sequence from a reference frame and the motion model of each obel initially defined in this frame. This statement is also valid for undersampled dynamic images. Solving the inverse problem from the 1
The gradient strength and slew rate of modern MR scanners is limited by safety concerns related to peripheral nerve stimulation and by the cost associated.
524
C. Prieto et al.
Fig. 1. Comparison between the intensity fluctuation of a pixel in the edge of an object and the displacement of the obel at the same starting position. a) Pixel intensity fluctuation through time. b) Obel ’s displacement.
undersampled k-t space, a fully sampled reference frame and the motion model of the sequence are obtained, allowing the reconstruction of a fully sampled dynamic sequence. For simplicity, we first review the fully sampled case. Let m0 be the reference frame and Pt , t = 1 . . . Nt the matrix that describes the spatial displacement over time for each obel initially defined in m0 . In this way, any frame mt of the dynamic sequence can be computed using mt = Pt m0
(1)
where Pt represents both, the permutation and the interpolation matrices [15]. Since this matrix is large and seldom invertible, a more efficient implementation of the image transformation is achieved by representing it as a spatial transformation Ft (x) = x + ut (x)
(2)
where x is the position vector defined in the coordinate system of m0 , and ut holds the obels’ displacement in the image dimensions for each time frame t. Employing this spatial transformation Eq.1 becomes mt (y) = Pt m0 (x) = m0 (F−1 t (y))
(3)
where y is the position vector in any frame of the dynamic sequence mt , in contrast to x which is defined in m0 .
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI
525
Since we need to model the displacement for each obel using few parameteres, we do not use Ft (x) for each time frame t, but the parameterised version, F(e). Each row of the matrix e corresponds to a vector of parameters that describes the displacement of one particular obel initially defined in m0 . Now, we can write mt (m0 , e) to state that it is possible to reconstruct any frame of the dynamic sequence from a reference frame m0 and a set of parameters e. This representation can be applied to undersampled dynamic images. As the data in MRI is acquired in the k-space, we multiply the dynamic sequence mt by the Fourier transform W obtaining Wmt (y) = Wmt (m0 , e)
(4)
The desired samples are collected from the k-space using an undersampling pattern St (k). This matrix has elements one and zero indicating if a sample in the position k of the k-space was or was not collected at time frame t, respectively. Let Bt (k) represent the samples acquired in k-t space, thus Bt (k) = St (k)Wmt (m0 , e)
(5)
which represents a non-linear system. The equations correspond to the acquired samples Bt (k), and the unknowns correspond to the reference frame m0 and the parameteres needed to model the obels’ motion e. In order to improve solution stability the system in Eq.5 is solved in the image domain rather than in the k-space domain. To bring the data to the image domain we multiply Bt (k) by the inverse Fourier transforms WH , thus bt (y) = WH St (k)Wmt (m0 , e)
(6)
where bt (y) is the acquired aliased data. Let Nd be the spatial dimensions of the image sequence, Nd = (Nx , Ny ) if the sequence is bidimensional (2D) and Nd = (Nx , Ny , Nz ) if it is three-dimensional (3D). Letting Nt be the temporal dimension of the image sequence, Ne the size of the matrix e and Q the undersampling factor, conceptually the system becomes fully determined if the following relation holds Nd Nt ≥ Nd + Ne Q
(7)
It defines an upper bound for the undersampling factor, given the image sequence size and the degrees of freedom required to model the obels’ motion over all the image dimensions. Therefore, there is a trade off between the undersamplig factor Q and the accuracy of the movement estimation given by Ne . This trade off becomes less critical using multiple coils, as we describe in the next section.
3
Reconstruction by Obels in Multiple-Coil Acquisition
In single coil acquisitions of MRI, the imaged object properties are codified using a magnetic field gradient, allowing to collect only one k-space sample at a time.
526
C. Prieto et al.
Complementary encoding can be achieved by employing multiple receiver coils. It is possible because the signal read by each coil varies appreciably with its relative position over the object. In this way, the information about the spatial sensitivity of each receiver can be use to obtain more than one sample at a time [13]. The spatial sensitivity of each receiver does not depend on the imaged object and thus the unknows of our method (described in the previous section) do not change with coil sensitivities. Therefore, it is possible to use the extra information from multiple coils in our method based on obels without modifying the amount of unknowns of the reconstruction process. Let Ci be the sensitivity of coil i, i = 1 . . . Nc , where Nc is the number of parallel coils considered. Let mt,i the image acquired with coil i at the time frame t, thus mt,i = Ci mt ,
(8)
Repeating the procedure described in the previous section for each single coil image, mt,i , we obtain, bt,i (y) = WH St (k)WCi mt (m0 , e)
(9)
where bt,i represents the aliased single coil data. We can obtain the image sequence mt from the data acquired with each receiver bt,i solving the non-linear system Eq.9. The advantage of this approach is that the system has more equations and the same number of unknowns as in the single coil case. The system becomes fully determined if Nd Nt N c ≥ Nd + Ne Q
(10)
Clearly then, the higher the number of coils, Nc , the higher the upper bound for the undersampling factor Q or for the parameters of the motion model Ne . We can solve Eq.9 setting it as an optimization problem. Let the sum over all the coils and time frames of the difference between the acquired and the estimated data be the cost function to be minimized, then min Δb =
Nc Nt bt,i (y) − ˆbt,i (y) t=1 i=1
(11) 2
where ˆbt,i (y) = WH St (k)WCi mt (m ˆ 0, e ˆ)
(12)
The problem can be solved by two nested optimization loops. In the inner loop, e is considered known and Δb is minimized as a function of m0 . In the outer loop the inner estimation of m0 is known and the minimum of Δb is found as a function of e. After convergence, m0 and e are found, representing the best fitted model for the acquired multiple coil data.
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI
527
Fig. 2. Reconstructed images from undersampled data using motion estimation. 2D short-axis cardiac sequence with a post-acquisition undersampling factor of 8. Frames 4 and 27 of a 50-frame sequence are shown. a) Fully sampled frame 4. b) Reconstructed image frame 4 with our method from a 5-coil acquisition. c) Reconstructed image with our method frame 4 from a 1-coil acquisition. d) Fully sampled frame 27. b) Reconstructed image frame 27 with our method from a 5-coil acquisition. c) Reconstructed image frame 27 with our method from a 1-coil acquisition.
4
Experimental Results
The proposed algorithm was used to reconstruct a 2D cardiac sequence. A fully sampled sequence was collected on a Philips Intera 1.5T with a 5-channel cardiac coil. The acquired raw data was undersampled post-acquisition by factors of 8 and 16. The images were reconstructed considering two reference frames, m0 and mc , at the beginning and at the middle of the cardiac sequence. Each pixel in m0 was considered an obel, and B-Splines with three coefficients were used to describe their displacement in every Cartesian direction through time. For an undersampling of 8, the results of the single-coil method were compared against those obtained using the multiple-coil approach. The results of the proposed method for an undersampling factor of 16 were only compared with those obtained using sliding window SW reconstruction [16]. This level of undersampling is not possible using the method based on obels with a single coil because of constraints in the amount of data (Eq.7).
528
C. Prieto et al.
Fig. 3. Difference images. 2D short-axis cardiac sequence with a post-acquisition undersampling factor of 8. The images show the differences between the fully sampled and the reconstructed images. a) Fully sampled image. b) Our reconstruction using multiple coils and the difference respect to the fully sampled image. c) Our reconstruction using single-coil and the difference respect to the fully sampled image.
4.1
Method
A steady-state free precession (SSFP) cardiac sequence was aquired with the following scanning parameters: 2D balanced fast field echo (B-FFE) cardiac gated, short-axis acquisition, TR/TE = 3 ms/1.46 ms, flip angle = 50◦ , FOV = 400 × 320 mm2 , resolution = 1.56 × 2.08 mm2 , slice thickness = 8 mm, acquisition matrix = 256 × 154, 50 frames, five channel sinergy coil and breath-hold duration close to 25 seconds. The acquired data was undersampled post-acquisition by factors of 8 and 16 using a lattice pattern (similarly to the one used in [10]). This pattern allows to sample 1/Q of the samples available in the k-space. Two references frames (m0 and mc ) were considered to reconstruct the dynamic sequence, thus the optimization problem solved was min Δb =
5 50 bt,i (y) − ˆbt,i (y) t=1 i=1
(13) 2
where ˆbt,i (y) = WH St (k)WCi mt (m ˆ 0, m ˆ c, e ˆ)
(14)
is a simple extension of Eq.12. This arregment is convenient for 2D sequences where some obels move in adirection normal to the slice. In the same way as in [12], the displacement of each obel was fitted using periodic quadratic B-Splines with three control points for every spatial direction. Then
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI
529
Fig. 4. Reconstructed images from undersampled data using motion estimation. 2D short-axis cardiac sequence with a post-acquisition undersampling factor of 16. Frames 4 and 27 of a 50-frame sequence are shown. a) Fully sampled frame 4. b) Reconstructed image frame 4 with our method from a 5-coil acquisition. c) Reconstructed image frame 4 with sliding window from a 5-coil acquisition. d) Fully sampled frame 27. e) Reconstructed image frame 27 with our method from a 5-coil acquisition. f) Reconstructed image frame 27 with sliding window from a 5-coil acquisition. Ne /Nd −1
u(ei ) =
eni An
(15)
n=0
where u(ei ) is the displacement of the obel i, eni is the weight applied to the nth B-Spline base An and Ne /Nd is the number of parameters needed considering one obel per pixel. The minimization in Eq.13 was computed with MATLAB (R2007a, The MathWorks, Natick) routines. The inner loop was solved efficiently using a conjugategradient algorithm (Least Square QR factorization, LSQR). The outer loop was solved employing a trust region method and an aproximation of the analytic gradient of the objetive function. This optimization represents an expensive computational load due to the big amount of unknowns and the non-linear nature of the problem. The reconstruction process took around 6 hours on a regular PC for a image of 128 × 128 pixels and 50 time frames.
530
4.2
C. Prieto et al.
Results
The reconstruction results for an undersampling factor of 8 are shown in Fig.2 for two cardiac phases. We have included the fully sampled images (Fig.2a, d), the reconstructed images using our method with multiple-coil acquisition (Fig.2b, e) and the reconstructed images using our method applied to single-coil acquisition (Fig.2c, f). Both reconstructions are in good agreement with the fully sampled image, with root mean square (RMS) errors of 2.65% and 2.45% for single and muliple coil, respectively. A zooming in on the differences between the fully sampled and the reconstructed images for a particular frame is shown in Fig.3. The difference images show that the main errors are due to small displacement of the edges, reaching a better estimation using the multiple coil approach. The reconstruction results for an undersampling factor of 16 are shown in Figs.4 and 5. We have included in Fig.4 two selected fully sampled time frames (Fig.4a, d), the corresponding reconstructions using our method applied to multiple-coil acquisition (Fig.4b, e) and the corresponding reconstructions using SW (Fig.4c, f). The image sequence reached with the proposed method has an RMS error of 1.72% compared to the fully sampled image, while the one reconstructed with SW has an RMS error of 2.03%. These results show that most of the aliasing was eliminated and only a slight spatial and temporal blurring remains, which is dependent on the quality of the reference frames. The temporal blurring is more evident in Fig.5 which shows the evolution over time of a line
Fig. 5. Time evolution. 2D short-axis cardiac sequence with a post-acquisition undersampling factor of 16. The images show the temporal evolution of a line passing through the left ventricle. a) Temporal evolution for the fully sampled image. b) Temporal evolution for the reconstructed image using our method from a 5-coil acquisition. c) Evolution temporal for the reconstructed image using sliding window from a 5-coil acquisition.
Motion Estimation Applied to Reconstruct Undersampled Dynamic MRI
531
passing through the left ventricle. Again, we have included the temporal evolution for the fully sampled sequence and the reconstructions using our method and SW.
5
Summary and Conclusions
An application of motion estimation to reconstruct undersampled dynamic MRI was presented. This is an extension of the method based on modeling the motion of objects elements (obels) to parallel imaging using multiple coils. Further undersampling factor and/or improved reconstruction accuracy is possible through the proposed method. It was demonstrated using 2D cardiac images acquired with 5 coils and undersampling factors of 8 and 16 (i.e. acquisitions are 8 and 16 times faster, respectively). For an undersampling factor of 8 images reconstructed with the proposed method display better quality than those obtained using only one coil with the same undersampling. An undersampling factor of 16 is quite feasible using multiple coils. This level of undersampling is not possible using one coil because of constraints in the amount of data. The proposed method does not requiere the motion to be confined to a portion of the field of view or to a portion of the temporal frequency. Moreover, an approximation of the vector field of motion is obtained as an additional result.
References 1. Abd-Elmoniem, K.Z., Osman, N.F., Prince, J.L., Stuber, M.: Three-dimensional magnetic resonance myocardial motion tracking from a single image plane. Magn. Reson. Med. 58, 92–102 (2007) 2. Sakuma, H.: Magnetic resonance imaging for ischemic heart disease. J. Magn. Reson. Imaging. 26, 3–13 (2007) 3. Raman, V.K., Lederman, R.J.: Interventional cardiovascular magnetic resonance imaging. Trends Cardiovasc. Med. 17, 196–202 (2007) 4. Gupta, V., Khandelwal, N., Mathuria, S., Singh, P., Pathak, A., Suri, S.: Dynamic magnetic resonance imaging evaluation of craniovertebral junction abnormalities. J. Comput. Assist. Tomogr. 31, 354–359 (2007) 5. Xiang, Q.S., Henkelman, R.M.: k-space description for MR imaging of dynamic objects. Magn. Reson. Med. 29, 422–428 (1993) 6. Jones, R.A., Haraldseth, O., Muller, T.B., Rinck, P.A., Oksendal, A.N.: k-space substitution: a novel dynamic imaging technique. Magn. Reson. Med. 29, 830–834 (1993) 7. van Vaals, J.J., Brummer, M.E., Dixon, W.T., Tuithof, H.H., Engels, H., Nelson, R.C., Gerety, B.M., Chezmar, J.L., den Boer, J.A.: Keyhole method for accelerating imaging of contrast agent uptake. J. Magn. Reson. Imaging 3, 671–675 (1993) 8. Hu, X., Parrish, T.: Reduction of field of view for dynamic imaging. Magn. Reson. Med. 31, 691–694 (1994) 9. Madore, B., Glover, G.H., Pelc, N.J.: Unaliasing by Fourier-encoding the overlaps using the temporal dimension (UNFOLD), applied to cardiac imaging and fMRI. Magn. Reson. Med. 42, 813–828 (1999)
532
C. Prieto et al.
10. Tsao, J., Boesinger, P., Pruessmann, K.P.: k-t BLAST and k-t SENSE: dynamic MRI with high frame rate exploiting spatiotemporal correlations. Magn. Reson. Med. 50, 1031–1042 (2003) 11. Irarrazaval, P., Bourbetakh, R., Razavi, R., Hill, D.: Dynamic three-dimensional undersampled data reconstruction employing temporal registration. Magn. Reson. Med. 54, 1207–1215 (2005) 12. Prieto, C., Batchelor, P., Hill, D., Hajnal, J., Guarini, M., Irarrazaval, P.: Reconstruction of undersampled dynamic images by modeling the motion of objects elements. Magn. Reson. Med. 57, 939–949 (2007) 13. Pruessmann, K.P., Werger, M., Scheidegger, M.B., Boesinger, P.: SENSE: Sensitivity encoding for fast MRI. Magn. Reson. Med. 42, 952–962 (1999) 14. Sodickson, D.K., Mannin, W.J.: Simultaneous acquisition of spatial harmonics (SMASH): ultra-fast imaging with radiofrequency coil arrays. Magn. Reson. Med. 38, 591–603 (1997) 15. Batchelor, P.G., Atkinson, D., Irarrazaval, P., Hill, D.L.G., Hajnal, J., Larkman, D.: Matrix description of genearal motion correction applied to multishot images. Magn. Reson. Med. 54, 1273–1280 (2005) 16. d’Arcy, J.A., Collins, D.J., Rowland, I.J., Padhani, A.R., Leach, M.O.: Applications of sliding window reconstruction with Cartesin sampling for dynamic contrast enhancement MRI. NMR. BioMed. 15, 174–183 (2002)
Real-Time Hand Gesture Detection and Recognition Using Boosted Classifiers and Active Learning Hardy Francke, Javier Ruiz-del-Solar, and Rodrigo Verschae Department of Electrical Engineering, Universidad de Chile {hfrancke,jruizd,rverscha}@ing.uchile.cl Abstract. In this article a robust and real-time hand gesture detection and recognition system for dynamic environments is proposed. The system is based on the use of boosted classifiers for the detection of hands and the recognition of gestures, together with the use of skin segmentation and hand tracking procedures. The main novelty of the proposed approach is the use of innovative training techniques - active learning and bootstrap -, which allow obtaining a much better performance than similar boosting-based systems, in terms of detection rate, number of false positives and processing time. In addition, the robustness of the system is increased due to the use of an adaptive skin model, a colorbased hand tracking, and a multi-gesture classification tree. The system performance is validated in real video sequences. Keywords: Hand gesture recognition, hand detection, skin segmentation, hand tracking, active learning, bootstrap, Adaboost, nested cascade classifiers.
1 Introduction Hand gestures are extensively employed in human non-verbal communication. They allow to express orders (e.g. “stop”, “come”, “don’t do that”), mood state (e.g. “victory” gesture), or to transmit some basic cardinal information (e.g. “one”, “two”). In addition, in some special situations they can be the only way of communicating, as in the cases of deaf people (sign language) and police’s traffic coordination in the absence of traffic lights. An overview about gesture recognition can be found in [18]. Thus, it seems convenient that human-robot interfaces incorporate hand gesture recognition capabilities. For instance, we would like to have the possibility of transmitting simple orders to personal robots using hand gestures. The recognition of hand gestures requires both hand’s detection and gesture’s recognition. Both tasks are very challenging, mainly due to the variability of the possible hand gestures (signs), and because hands are complex, deformable objects (a hand has more than 25 degrees of freedom, considering fingers, wrist and elbow joints) that are very difficult to detect in dynamic environments with cluttered backgrounds and variable illumination. Several hand detection and hand gesture recognition systems have been proposed. Early systems usually require markers or colored gloves to make the recognition easier. Second generation methods use low-level features as color (skin detection) [4][5], shape [8] or depth information [2] for detecting the hands. However, those systems are not robust enough for dealing with dynamic environments; they usually require uniform background, uniform illumination, a single person in the camera view [2], D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 533–547, 2007. © Springer-Verlag Berlin Heidelberg 2007
534
H. Francke, J. Ruiz-del-Solar, and R. Verschae
and/or a single, large and centered hand in the camera view [5]. Boosted classifiers allow the robust and fast detection of hands [3][6][7]. In addition, the same kind of classifiers can be employed for detecting static gestures [7] (dynamic gestures are normally analyzed using Hidden Markov Models [4]). 3D hand model-based approaches allow the accurate modeling of hand movement and shapes, but they are time-consuming and computationally expensive [6][7]. In this context, we are proposing a robust and real-time hand gesture detection and recognition system, for interacting with personal robots. We are especially interested in dynamic environments such as the ones defined in the RoboCup @Home league [21] (our UChile HomeBreakers team participates in this league [22]), with the following characteristics: variable illumination, cluttered backgrounds, real-time operation, large variability of hands’ pose and scale, and limited number of gestures (they are used for giving the robot some basic information). In this first version of the system we have restricted ourselves to static gestures. The system we have developed is based on the use of boosted classifiers for the detection of hands and the recognition of gestures, together with the use of skin segmentation and hand tracking procedures. The main novelty of the proposed approach is the use of innovative training techniques - active learning and bootstrap -, which allow obtaining a much better performance than similar boosting-based systems, in terms of detection rate, number of false positives and processing time. In addition, the robustness of the system is increased thanks to the use of an adaptive skin model, a color-based hand tracking, and a multi-gesture classification tree. This paper is organized as follows. In section 2 some related work in hand gesture recognition and active learning is presented. In section 3 the proposed hand gesture detection and recognition system is described. In sections 4 and 5 the employed learning framework and training procedures are described. Results of the application of this system in real video sequences are presented and analyzed in section 6. Finally, some conclusions of this work are given in section 7.
2 Related Work Boosted classifiers have been used for both hand detection and hand gesture detection. In [3] a hand detection system that can detect six different gestures is proposed. The system is based on the use of Viola&Jones’ cascade of boosted classifiers [16]. The paper’s main contributions are the addition of new rectangular features for the hand detection case, and the analysis of the gesture’s separability using frequency spectrum analysis. The classifiers are trained and tested using still images (2,300 in total), which contains centered hands, with well-defined gestures. The performance of the classifiers in real videos is not analyzed. In [6] an extension of [3] is proposed, in which boosted classifiers are employed for hand detection, while gestures are recognized using scale-space derived features. The reported experiments were carried out in a dynamic environment, but using single, large and centered hands in the camera view. In [7] a real-time hand gesture recognition system is proposed, which is also based on the standard Viola&Jones system. New rectangular features for the hand detection case are added. The recognition of gestures is obtained by using several single gesture detectors working in parallel. The final system was validated in a very controlled environment (white wall as background); therefore, its performance in
Real-Time Hand Gesture Detection and Recognition
535
dynamic environment is uncertain. In [9] a system for hand and gesture detection based on a boosted classifier tree is proposed. The system obtains very high detection results, however, the system is very time consuming (a tree classifier is much slower than a single cascade), and not applicable for interactive applications. Our main contribution over previous work are the use of a much powerful learning machine (nested cascade with boosted domain-partitioning classifiers), and the use of better training procedures, which increase the performance of the classifiers. The performance of a statistical classifier depends strongly on how representative the training sets are. The common approach employed for constructing a training set for a learning machine is to use human labeling of training examples, which is a very time-consuming task. Very often, the amount of human power for the labeling process limits the performance of the final classifier. However, the construction of training sets can be carried out semi-automatically using active learning and the bootstrap procedure. This allows building larger training sets, and therefore to obtain better classifiers. Thus, the bootstrap procedure can be employed in the selection of negative samples [17]. The procedure requires that the human expert selects a large amount of images that do not contain object instances. During training, the bootstrap procedure automatically selects image areas (windows) that will be used as negative examples. In [11] the bootstrap procedure is extended for the particular case of the training of cascade classifiers. On the other hand, active learning is a procedure in which the system being built is used to lead the selection of the training examples. For instance, in [14] an interactive labeling system is used to select examples to be added to the training set. Initially, this system takes a rough classifier and later, interactively adds both, positive and negative examples. In the here-proposed approach both, bootstrap and active learning, are employed.
Fig. 1. Proposed hand gesture detection and recognition system
3 Real-Time Hand Gesture Detection and Recognition System 3.1 System Overview The main modules of the proposed hand gesture detection and recognition system are shown in figure 1. The Skin Segmentation module allows obtaining skin blobs from
536
H. Francke, J. Ruiz-del-Solar, and R. Verschae
the input image. The use of a very reliable face detector (Face Detection module) allows the online modeling of the skin, which makes possible to have an adaptive segmentation of the skin pixels. The Hand Detection and Hand Tracking modules deliver reliable hand detections to the gesture detectors. Hand detection is implemented using a boosted classifier, while hand tracking is implemented using the mean shift algorithm [1]. Afterwards, several specific Gesture Detectors are applied in parallel over the image’s regions that contain the detected hands. These detectors are implemented using boosted classifiers [12]. Finally, a Multi-Gesture Classifier summarizes the detections of the single detectors. This multi-class classifier is implemented using a J48 pruned tree (Weka’s [19] version of the C4.5 classifier). In the next subsections these modules are described in detail. 3.2 Adaptive Skin Segmentation Adaptive skin segmentation is implemented using a procedure similar to the one described in [10]. The central idea is to use the skin color distribution in a perceived face to build a specific skin model. In other words, the skin model uses the context information from the person, given by its face, and the current illumination. With this we manage to have a robust skin detector, which can deal with variations in illumination or with differences in the specific skin’s colors, in comparison to offline trained skin detectors. This approach requires having a reliable face detector. We employed a face detector that uses nested cascades of classifiers, trained with the Adaboost boosting algorithm, and domain-partitioning based classifiers. This detector is detailed described in [11]. With the aim of making the model invariant to the illumination level to a large degree, the skin modeling is implemented using the RGB normalized color space:
I = R+G +B ; r =
R G ; g= I I
(1)
After a new face is detected, a subset of the face pixels is selected for building the skin model (see figure 2). After pixels’ selection and normalization, the r, g and I skin variables are modeled with Gaussian functions. The skin model parameters correspond to the variables’ mean value and standard deviation: μ r , σ r , μ g , σ g , μ I and σ I . In order to lighten the computational burden, this modeling is carried out only once for every detected face (the first time that the face is detected). As long as there is not any major change in the illumination, there is no need to update the model. Having the skin model, the classification of the pixels is carried out as follows:
⎧ skin f (i, j ) = ⎨ ⎩non − skin
if if
c − μc < α c ⋅σ c , ~
c = r, g , I
(2)
where i and j represent the coordinates of pixel being analyzed, and α r , α g and α I are constants of adjustment of the classifier. For simplicity all these constants are made equal. In practice we have observed that this value needs to be adjusted depending on the brightness of the input image, increasing it when the brightness decreases, and vice versa.
Real-Time Hand Gesture Detection and Recognition
537
After the skin pixels are detected, they are grouped together in skin blobs, according to their connectivity. In order to diminish the false positives from the skin detection, blobs that have an area below a certain threshold are discarded. Finally, all skin blobs are given to the next stage of the process except the ones containing faces. 3.3 Hand Detection and Tracking In order to detect hands within the skin blobs, a hand detector is implemented using a cascade of boosted classifiers. Although this kind of classifiers allows obtaining very robust object detectors in the case of face or car objects, we could not build a reliable generic hand detector. This is mainly because: (i) hands are complex, highly deformable objects, (ii) hand possible poses (gestures) have a large variability, and (iii) our target is a fully dynamic environment with cluttered background. Therefore we decided to switch the problem to be solved, and to define that the first time that the hand should be detected, a specific gesture must be made, the fist gesture. Afterwards, that is, in the consecutive frames, the hand is not detected anymore but tracked. The learning framework employed for training the fist detector is described in section 4 and the specific structure of the detector in section 6. The hand-tracking module is built using the mean shift algorithm [1]. The seeds of the tracking process are the detected hands (fist gestures). We use RGB color histograms as feature vectors (model) for mean shift, with each channel quantized to 32 levels (5 bits). The feature vector is weighted using an Epanechnikov kernel [1]. As already mentioned, once the tracking module is correctly following a hand, there is no need to continue applying the hand detector, i.e. the fist gesture detector, over the skin blobs. That means that the hand detector module is not longer used until the hand gets out of the input image, or until the mean shift algorithm loses track of the hand, case where the hand detector starts working again. At the end of this stage, one or several regions of interest (ROI) are obtained, each one indicating the location of a hand in the image.
x 0,orange = x 0,green + 0.25 ⋅ widthgreen y 0,orange = y 0,green + 0.25 ⋅ height green widthorange = 0.5 ⋅ widthgreen height orange = 0.5 ⋅ height green
Fig. 2. Left: The green (outer) square corresponds to the detected face. The orange (inner) square determines the pixels employed for building the skin model. Right: The orange square cropping formula.
3.4 Hand Gesture Recognition In order to determine which gesture is being expressed, a set of single gesture detectors are applied in parallel over the ROIs delivered as output of the tracking module.
538
H. Francke, J. Ruiz-del-Solar, and R. Verschae
Each single gesture detector is implemented using a cascade of boosted classifiers. The learning framework employed for building and training these classifiers is described in section 4. Currently we have implemented detectors for the following gestures: first, palm, pointing, and five (see Figure 3). The specific structure of each detector is given in section 6. Due to noise or gesture ambiguity, it could happen than more than one gesture detector will give positive results in a ROI (more than one gesture is detected). For discriminating among these gestures, a multi-gesture classifier is applied. The used multi-class classifier is a J48 pruned tree (Weka’s [19] version of C4.5), built using the following four attributes that each single gesture detector delivers: -
conf: sum of the cascade confidence’s values of windows where the gesture was detected (a gesture is detected at different scales and positions), numWindows: number of windows where the gesture was detected, meanConf: mean confidence value given by conf/numWindows, and normConf: normalized mean confidence value given by meanConf/maxConf, with maxConf the maximum possible confidence that a window could get.
Fist
Palm
Pointing
Five
Fig. 3. Hand gestures detected by the system
4 Learning Framework The learning framework used to train the hand detector and single gesture detectors is presented in the next subsections. An extensive description of this framework can be found in [11]. 4.1 Learning Using Cascade of Boosted Classifiers The key concepts used in this framework are nested cascades, boosting, and domainpartitioning classifiers. Cascade classifiers [16] consist of several layers (stages) of increasing complexity. Each layer can reject or let pass the inputs to the next layer, and in this way a fast processing speed together with high accuracy are obtained. Nested cascades [13] allow high classification accuracy and higher processing speed by reusing in each layer the confidence given by its predecessor. Adaboost [12] is employed to find highly accurate hypotheses (classification rules) by combining several weak hypotheses (classifiers). A nested cascade of boosted classifiers is composed by several integrated (nested) layers, each one containing a boosted classifier. The cascade works as a single classifier that integrates the classifiers of every layer. Weak classifiers are linearly combined, obtaining a strong classifier. A nested
Real-Time Hand Gesture Detection and Recognition
539
cascade, composed of M layers, is defined as the union of M boosted classifiers H Ck each one defined by: H Ck (x) =
H Ck −1 (x) +
Tk
∑h
k t (x) − bk
(3)
t=1
with H C0 (x) = 0 , htk the weak classifiers, T k the number of weak classifiers in layer k, and bk a threshold (bias) value that defines the operation point of the strong classifier. At a layer k, processing an input x, the class assigned to x corresponds to the sign of H Ck (x) . The output of H Ck is a real value that corresponds to the confidence of the classifier and its computation makes use of the already evaluated confidence value of the previous layer of the cascade. 4.2 Design of the Strong and Weak Classifiers The weak classifiers are applied over features computed in every pattern to be processed. To each weak classifier a single feature is associated. Following [12], domainpartitioning weak hypotheses make their predictions based on a partitioning of the input domain X into disjoint blocks X1,…,Xn, which cover all X, and for which h(x)=h(x’) for all x, x’ ∈ Xj. Thus, a weak classifier´s prediction depends only on which block, Xj, a given sample instance falls into. In our case the weak classifiers are applied over features, therefore each feature domain F is partitioned into disjoint blocks F1,…,Fn, and a weak classifier h will have an output for each partition block of its associated feature f: h( f ( x)) = c j s.t f ( x) ∈ F j (4) For each classifier, the value associated to each partition block (cj), i.e. its output, is calculated so that it minimizes a bound of the training error and at the same time a loss function on the margin [12]. This value depends on the number of times that the corresponding feature, computed on the training samples (xi), falls into this partition block (histograms), on the class of these samples (yi) and their weight D(i). For minimizing the training error and the loss function, cj is set to [12]: cj =
j 1 ⎛ W +1 + ε ⎞ ⎟, ln⎜⎜ j 2 ⎝ W−1 + ε ⎟⎠
Wl j =
∑ D(i) = Pr [f (x i ) ∈ F j ∧ y i = l ], where
l = ±1
(5)
i:f (x i )∈F j ∧y i = l
where ε is a regularization parameter. The outputs, cj, of each of the weak classifiers, obtained during training, are stored in a LUT to speed up its evaluation. The real Adaboost learning algorithm is employed to select the features and training the weak classifiers htk (x) . For details on the cascade’s training algorithm see [11]. 4.3 Features Two different kinds of features are used to build the weak classifiers, rectangular features (a kind of Haar-like wavelet) and mLBP (modified Local Binary Pattern). In both cases the feature space is partitioned so that it can be used directly with the
540
H. Francke, J. Ruiz-del-Solar, and R. Verschae
domain-partitioning classifier previously described. Rectangular features can be evaluated very quickly, independently of their size and position, using the integral image [16], while mLBP corresponds to a contrast invariant descriptor of the local structure of a given image neighborhood (see [15]).
5 Training Procedures The standard procedure to build training sets of objects and non-objects for training a statistical classifier requires that an expert (a human operator) obtains and annotates training examples. This procedure is usually very time-consuming; more importantly, it is very difficult to obtain representative examples. In the following, two procedures for solving these problems are presented. 5.1 Bootstrap Procedure Every window of any size in any image that does not contain an object (e.g. a hand) is a valid non-object training example. Obviously, to include all possible non-object patterns in the training database is not an alternative. To define such a boundary, nonobject patterns that look similar to the object should be selected. This is commonly solved using the bootstrap procedure [17], which corresponds to iteratively train the classifier, each time increasing the negative training set by adding examples of the negative class that were incorrectly classified by the already trained classifier. When training a cascade classifier, the bootstrap procedure can be applied in two different situations: before starting the training of a new layer (external bootstrap) and for retraining a layer that was just trained (internal bootstrap). It is important to use bootstrap in both situations [11]. The external bootstrap is applied just one time for each layer, before starting its training, while the internal bootstrap can be applied several times during the training of the layer. For details on the use of bootstrapping in the training of a cascade see [11]. 5.2 Active Learning As mentioned, the selection of representative positive training examples is costly and very time consuming, because a human operator needs to be involved. However, these training examples can be semi-automatically generated using active learning. Active learning is a procedure in which the system being built is used to lead the selection of the training examples. In the present work we use active learning to assist the construction of representative positive training sets, i.e. training sets that capture the exact conditions of the final application. To generate training examples of a specific hand gesture detector, the procedure consists of asking a user to make this specific hand gesture for a given time. During this time the user hand is automatically tracked, and the bounding boxes (ROI) are automatically incorporated to the positive training sets of this gesture. If the hand is tracked for a couple of minutes, and the user maintains the hand gesture while moving the hand, thousands of examples can be obtained with the desired variability (illumination, background, rotation, scale, occlusion, etc.). Thus, all windows classified as positive by the hand tracker are taken as positive training examples. This pro-
Real-Time Hand Gesture Detection and Recognition
541
cedure can be repeated for several users. A human operator only has to verify that these windows were correctly detected, and to correct the alignment of the windows, when necessary. Later, all these windows are downscaled to the window size (24x24 or 24x42 pixels in our case) to be used during training. In a second stage, active learning can also be employed for improving an already trained specific gesture detector. In this last case, the same procedure is employed (the user makes the hand gesture and the hand is tracked), but the already trained gesture detector is in charge of generating the training examples. Thus, every time that the gesture detector classifies a hand bounding box coming from the hand tracker as a non-object (the gesture is not detected), this bounding box is incorporated in the positive training set for this gesture.
6 Evaluation In the present section an evaluation and analysis of the proposed system is presented. In this evaluation the performance of the system, as well as, its modules are analyzed. We also analyze the effect over the detector’s performance of using Active learning during training. The detection results are presented in terms of Detection Rate (DR) versus Number of False Positives (FP), in the form of ROC curves. An analysis of the processing speed of the system is also presented. The cascade classifiers were trained using three kinds of hand databases: (i) the IDIAP hand database [20], (ii) images obtained from the Internet, and (iii) images obtained using active learning and our hand gesture detection and recognition system. Table 1 and Table 2 summarize information about these training sets and the obtained cascade classifiers. For the other gesture’s databases, the amount of data used to train the classifiers is similar. On Table 1 and Table 2 we can also observe information about the structure of the obtained classifiers (number of layers and total number of weak classifiers). This information gives us an idea of the complexity of the detection problem, where large values indicate higher complexity and also larger processing times. These numbers are a result of the training procedure of the cascade [11] (they are not set a priori). As mentioned, we have selected a J48 pruned tree as multi-gesture’s classifier. This classifier was trained using the training sets described in Table 3, using the Weka package, and 10-fold cross-validation. The obtained tree structure has 72 leaves and 143 tree nodes. In the validation dataset we obtained 90.8% of correct classifications. To evaluate each single detector, a dataset consisting of 200 examples per class was used. This database contains images presenting a large degree of variability in the shape and size of the hands, the illumination conditions, and in the background. As a reference, this database contains more variability than the IDIAP database [20], and therefore is more difficult. The complete system was evaluated using a database that consists of 8,150 frames coming from 5 video sequences, where 4 different persons performed the 4 considered gestures. The sequences were captured by the same camera used to perform the active learning, and emphasis was given to produce a cluttered background and varying illumination conditions. To analyze how active learning improves the performance of the boosted classifiers, we studied two cases, a fist detector, and a palm detector. For each case we
542
H. Francke, J. Ruiz-del-Solar, and R. Verschae ROC curves (offline vs. active learning) 100 90 80
Detection Rate [%]
70 60 50 40 30 Fist D1 (active learning) 20
Fist D2 (offline learning) Palm D1 (active learning)
10
Palm D2 (offline learning) 0 0
50
100
150
200
250
300
350
400
False Positives
Fig. 4. Fist and Palm detector ROC curves, using active learning (D1) and not using active learning (D2). In all cases the tracking system was not used. ROC curves of Gesture Detectors 100 90 80
Detection Rate [%]
70 60 50 40 30 20 10
Five
Fist
Pointing
Palm
0 0
10
20
30
40
50
60
70
80
False Positives
Fig. 5. ROC curves of the gesture detectors (trained using active learning) applied without using the tracking system
trained two classifiers, the first one using active learning and the second one without using it. The training of these detectors was done using the datasets presented in Table 1. The effect of using active learning in the performance of the detector is shown in Figure 4. To better show the effect of using active learning, the evaluation was performed by applying the detectors directly over the skin blobs in the input images that do not correspond to the face, i.e., not over the results of the handtracking module. As it can be noticed, the use of active learning during training
Real-Time Hand Gesture Detection and Recognition
543
largely improves the performance of the detectors, with up to a 90 % increase for operation points with low false positive rates. When using the tracking system, the number of false positives is reduced even more, so the complete system has much lower false positive rates than the ones observed here. Even though in the case of using active learning the obtained classifiers have a larger number of weak classifiers, the processing time is not much larger, because there is not a large increase on the number of weak classifier for the first layers of the cascade. As a consequence of this result, we choose to train all our gesture detectors using active learning.
Fig. 6. Example results of the system. The five, victory, and the victory gestures are detected and recognized. Notice the cluttered background, the highlights, and skin-like colors.
An evaluation of the gesture detectors, trained using active learning, is shown in figure 5. In this case the results were obtained by applying the detectors directly over the detected skin blobs not corresponding to the face, not over the results of the handtracking module. The use of the hand tracking before applying the detectors reduces largely the number of false positives. The training was done using the datasets described in Table 2, and as in the previous experiment, the evaluation was done using a dataset consisting of 200 examples per class, which contains all gestures and a large degree of variability. As it can be observed, the fist gesture detector obtains a very high performance, achieving a detection rate of 99%, with just 2 false positives. The other detectors show a lower performance, having a higher number of false positives, which is reduced when the tracking module is used. The main reason for the large number of false positives is the large variability of the illumination conditions and background of the place where the detectors were tested. Figure 6 show some images from the test dataset, where it can be observed that it is an environment with several different light sources, and a lot of reflections, shadows, and highlights. An evaluation of the complete system, that means the hand tracking and detection module, together with the gesture´s detection module and the gesture recognition´s module, is summarized in Table 4. The results are presented by means of a confusion matrix. The first thing that should be mention here is that the hand detection together with the tracking system did not produce any false positive out of the 8150 analyzed frames, i.e. the hands were detected in all cases. From Table 4 it can be observed that the gesture detection and recognition modules worked best on the five gesture, followed by the pointing, fist and palm gestures, in that order. The main problem is the confusion of the fist and pointing gestures, which is mainly due to the similarly of the
544
H. Francke, J. Ruiz-del-Solar, and R. Verschae
Table 1. Training sets for the fist and palm detectors. The D1 detectors are built using active learning, while the D2 detectors are built using standard hand databases. Gesture
Size of training images (pixels)
Fist (D1)
24x24
Fist (D2)
24x24
Palm (D1)
24x42
Palm (D2)
24x24
Database
Training’s Validation’s Num. negative Num. Num. weak set size set size images layers classifiers
Active learning IDIAP [20] Active learning IDIAP [20]
1194
1186
46746
9
612
795
606
47950
10
190
526
497
45260
10
856
597
441
36776
8
277
Table 2. Training sets and classifier structure for the definitive gesture’s detectors Size of training images (pixels) Fist 24x24 Palm 24x42 Pointing 24x42 Five 24x24 Gesture
Num. positive training images 1194 526 947 651
Num. positive Num. negative Num Total Num. validation images (no-hand) images layers detectors 1186 46746 9 612 497 45260 10 856 902 59364 12 339 653 41859 9 356
Table 3. Training sets for the multi-gesture’s classifier Gesture Number of training examples Number of training attributes
Fist 3838 15352
Palm 3750 15000
Pointing 3743 14972
Five 3753 15012
Table 4. Confusion matrix of the complete system Class\Predicted Fist Palm Pointing Five
Fist 1533 39 436 103
Palm Pointing 2 870 1196 10 36 1503 32 6
Five 9 659 27 1446
Unknown 15 15 86 127
Detection and recognition rates [%] 63.1 62.3 72.0 84.3
Table 5. Average processing time of the main modules, in milliseconds The size of frames is 320x240 pixels Skin detection
Face detection
Face tracking
Hand detection
4.456
0.861
1.621
2.687
Gesture recognition + Hand tracking 78.967
gestures. In average the system correctly recognized the gestures in 70% of the cases. If the pointing and the fist gestures are considered as one gesture, the recognition rate goes up to 86%.
Real-Time Hand Gesture Detection and Recognition
545
We also evaluated the processing time of the whole system. This evaluation was carried out in a PC powered with a Pentium 4 3.2GHz, 1GB RAM, running Windows XP and the system was implemented using the C language. The observed average processing time required for processing a 320x240 pixel’s image, without considering the time required for image acquisition, was 89 milliseconds (see details in Table 5). With this, the system can run at about 11 fps for frames of 320x240 pixel size.
7 Conclusions One of the ways humans communicate with each other is through gestures, in particular hand gestures. In this context, a framework for the detection of hands and the recognition of hand gestures was proposed, with the aim of using it to interact with a service robot. The framework is based on cascade classifiers, a J48 tree classifier, an adaptive skin detector and a tracking system. The main module of the system corresponds to a nested cascade of boosted classifiers, which is designed to carry out fast detections with high DR and very low FPR. The system makes use of a face detector to initialize an adaptive skin detector. Then, a cascade classifier is used to initialize the tracking system by detecting the fist gesture. Afterwards, the hands are tracked using the mean shift algorithm. Afterwards, several independent detectors are applied within the tracked regions in order to detect individual gestures. The final recognition is done by a J48 classifier that allows to distinguishing between gestures. For training the cascade classifiers, active learning and the bootstrap procedure were used. The proposed active learning procedure allowed to largely increase the detection rates (e.g., from 17% up to 97% for the Palm gesture detector) maintaining a low false positive rate. As in our previous work [11], the bootstrap procedure [17] helped to obtain representative training sets when training a nested cascade classifier. Out of the hand detectors, the best results were obtained for the fist detection (99% DR at 1 FP), probably because this gesture has the lower degree of variability. The worst results were obtained for the gesture five detector (85% DR at 50 FP), mainly because under this gesture the hand and the background are interlaced, which greatly difficult the detection process in cluttered backgrounds. In any case, it should be stressed that these results correspond to a worst case scenario, i.e. when no tracking is performed, and that when using the tracking the FPR is greatly reduced. The system performs with a reasonable high performance in difficult environments (cluttered background, variable illumination, etc.). The tracking module has a detection rate over 99%, the detection module a 97% detection rate, and the gesture recognition rate is 70%. The main problem is the confusion of the fist with the pointing gesture and vice-versa. When these two gestures are considered as one, the global recognition rate goes up to 86%. We think that the recognition could be improved by using the history of the detection. The system presents a high processing speed (about 11 fps), and therefore it can be applied in dynamical environments in real time. As future research we would like to extend our system for recognizing dynamic gestures and to improve the detection module by integrating the classifiers’ cascades.
546
H. Francke, J. Ruiz-del-Solar, and R. Verschae
Acknowledgements This research was funded by Millenium Nucleus Center for Web Research, Grant P04-067-F, Chile.
References 1. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. IEEE Trans. on Pattern Anal. Machine Intell. 25(5), 564–575 (2003) 2. Liu, X.: Hand gesture recognition using depth data. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, Seoul, Korea, pp. 529–534 (2004) 3. Kolsch, M., Turk, M.: Robust hand detection. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, Seoul, Korea, pp. 614–619 (2004) 4. Binh, N.D., Shuichi, E., Ejima, T.: Real-Time Hand Tracking and Gesture Recognition System. In: Proc. GVIP 2005, Cairo, Egypt, pp. 19–21 (2005) 5. Manresa, C., Varona, J., Mas, R., Perales, F.: Hand Tracking and Gesture Recognition for Human-Computer Interaction. Electronic letters on computer vision and image analysis 5(3), 96–104 (2005) 6. Fang, Y., Wang, K., Cheng, J., Lu, H.: A Real-Time Hand Gesture Recognition Method. In: Proc. 2007 IEEE Int. Conf. on Multimedia and Expo, pp. 995–998 (2007) 7. Chen, Q., Georganas, N.D., Petriu, E.M.: Real-time Vision-based Hand Gesture Recognition Using Haar-like Features. In: IMTC 2007. Proc. Instrumentation and Measurement Technology Conf, Warsaw, Poland (2007) 8. Angelopoulou, A., García-Rodriguez, J., Psarrou, A.: Learning 2D Hand Shapes using the Topology Preserving model GNG. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 313–324. Springer, Heidelberg (2006) 9. Ong, E.-J., Bowden, R.: A boosted classifier tree for hand shape detection. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, Seoul, Korea, pp. 889–894 (2004) 10. Wimmer, M., Radig, B.: Adaptive Skin Color Classificator, Int. Journal on Graphics, Vision and Image Processing. Special Issue on Biometrics 2, 39–42 (2006) 11. Verschae, R., Ruiz-del-Solar, J., Correa, M.: A Unified Learning Framework for object Detection and Classification using Nested Cascades of Boosted Classifiers, Machine Vision and Applications (in press) 12. Schapire, R.E., Singer, Y.: Improved Boosting Algorithms using Confidence-rated Predictions. Machine Learning 37(3), 297–336 (1999) 13. Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection based on real Adaboost. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, Seoul, Korea, pp. 79–84 (2004) 14. Abramson, Y., Freund, Y.: Active learning for visual object detection, UCSD Technical Report CS2006-0871 (November 19, 2006) 15. Fröba, B., Ernst, A.: Face detection with the modified census transform. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, Seoul, Korea, pp. 91–96 (2004) 16. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 511–518 (2001) 17. Sung, K., Poggio, T.: Example-Based Learning for Viewed-Based Human Face Deteccion. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 39–51 (1998) 18. The Gesture Recognition Home Page (August 2007), Available at: http://www.cybernet.com/~ccohen/
Real-Time Hand Gesture Detection and Recognition
547
19. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 20. IDIAP hand gesture database (August 2007), Available at: http://www.idiap.ch/resources/gestures/ 21. RoboCup @Home Official website (August 2007), Available at: http://www.robocupathome.org/ 22. UChile RoboCup Teams official website (August 2007), Available at: http://www.robocup.cl/
Spatial Visualization of the Heart in Case of Ectopic Beats and Fibrillation S´andor M. Szil´ agyi1,2 , L´ aszl´o Szil´ agyi1,2 , and Zolt´ an Beny´ o2 1
Sapientia - Hungarian Science University of Transylvania, Faculty of Technical and Human Science, Tˆ argu-Mure¸s, Romania
[email protected] 2 Budapest University of Technology and Economics, Dept. of Control Engineering and Information Technology, Budapest, Hungary
Abstract. This paper presents a dynamic heart model based on a parallelized space-time adaptive mesh refinement algorithm (AMRA). The spatial and temporal simulation method of the anisotropic excitable media has to achieve great performance in distributed processing environment. The accuracy and efficiency of the algorithm was tested for anisotropic and inhomogeneous 3D domains using ten Tusscher’s and Nygen’s cardiac cell models. During propagation of depolarization wave, the kinetic, compositional and rotational anisotrophy is included in the tissue, organ and torso model. The generated inverse ECG with conventional and parallelized algorithm has the same quality, but a speedup of factor 200 can be reached using AMRA modeling and single instruction multiple data (SIMD) programming of the video cards. These results suggest that a powerful personal computer will be able to perform a onesecond long simulation of the spatial electrical dynamics of the heart in approximately five minutes. Keywords: spatial visualization, heart wall movement analysis, parallel processing.
1
Introduction
Sudden cardiac death, caused mostly by ventricular fibrillation, is responsible for at least five million deaths in the world each year. Despite decades of research, the mechanisms responsible for ventricular fibrillation are not yet well understood. It would be important to understand how the onset of arrhythmias that cause fibrillation depends on details such as heart’s size [15], geometry [11], mechanical and electrical state, anisotropic fiber structure and inhomogeneities [1]. The main difficulty in development of a quantitatively accurate simulation of an entire three-dimensional human heart is that the human heart muscle is a strongly excitable medium whose electrical dynamics involve rapidly varying, highly localized fronts [2]. Ectopic heartbeats are arrhythmias involving variations in a normal heartbeat. Sometimes they may occur without obvious cause and are not harmful. However, they are often associated with electrolyte abnormalities in the blood D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 548–561, 2007. c Springer-Verlag Berlin Heidelberg 2007
Spatial Visualization of the Heart
549
that should be treated. Many times ectopic beats can be associated with ischemia, or local reduction in blood supply to the heart. Once an ectopic beat appears, the underlying reversible reasons should be investigated, even if no further treatment is needed. An important aspect of ectopic beats caused by the altered depolarization of cardiac tissue is the significantly altered displacement of the heart during the whole beat. This special movement is easily visible in echocardiography image sequences. Each ectopic beat has a patient dependent special waveform caused by the irregular depolarization order of the cardiac tissue. The formulation of an ectopic beat and the generated mechanical movement can be simulated with computers. In ventricular tissues the width of a depolarization front is usually less than half mm. A simulation approximating the dynamics of such a front requires a spatial resolution of Δx ≤ 0.1mm. Forasmuch the muscle in an adult heart has a volume of 250cm3, and so a uniform spatial representation require at least 2.5 · 108 nodes. Taking into account that each node’s state is described with at least 50 floating numbers, the necessary storage space rises higher than 50GB, which exceeds by far the available memory of personal computers. The rapid depolarization of the cell membrane is the fastest event in the heart; it blows over in few hundred microseconds, which implies a time step Δt ≤ 25μs. Since dangerous arrhythmias may require several seconds to become established, the 1010 floating point numbers associated with the spatial representation would have to be evolved over 105 -106 time steps. Such a huge uniform mesh calculation currently exceeds all existing computational resources [3]. The spatiotemporal structure of wave dynamics in excitable media suggests an automatically adjustable resolution in time and space. The basic idea of this improvement [2,3] is deducted from experiments and simulations [4], which recommend that the function of electrical membrane potential of a ventricular cell fV (t, x, y, z) in the fibrillating state consists of many spirals or of many scroll waves. An interesting property of these spatiotemporal disordered states is that the dynamics is sparse: at any given moment, only a small volume fraction of the excitable medium is depolarized by the fronts, and away from them, the dynamics is slowly varying in space and time. This idea permits the decrement of necessary computational effort and storage space for regular beats but the total front volume can greatly increase with fibrillating state. By varying the spatiotemporal resolution to concentrate computational effort primarily along the areas with large spatial and temporal gradients, it is possible to reduce the computational load and memory needs by orders of magnitude. The rest of the paper describes the applied human cell and tissue model, the time and spatial position dependent heart and torso model, the position of the ectopic beat generators, the adaptively variable resolution wave-propagation method and the parallel processing of these algorithms aided by graphic cards. Using this algorithm, we can simulate the electric and mechanic formulation of ectopic beats on a parallel functioning platform.
550
2 2.1
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
Materials and Methods Human Cell and Tissue Model
We used the ten Tusscher heart cell model [14] for ventricular and Nygren’s model [9] for atrial cells, to investigate the accuracy and efficiency of the simulation algorithm. These models are based on recent experimental data on most of the major ionic currents, such as the fast sodium, L-type calcium, transient outward, rapid and slow delayed rectifier, and inward rectifier currents. With the inclusion of basic calcium dynamics, the contraction and restitution mechanism of the muscle cells can be investigated. The model is able to reproduce human epicardial, endocardial and M cell action potentials, to modify the internal state of the cells and to show that differences can be explained by differences in the transient outward and slow delayed rectifier currents. These properties allow us to study the evolution of reentrant arrhythmias. The conduction velocity restitution of this model is broader than in other models and agrees better with the available data. We conclude that the applied model can reproduce a variety of electrophysiological behaviors and provides a basis for studies of reentrant arrhythmias in human heart tissue. As described in [14], the cell membrane can be modeled as a capacitor connected in parallel with variable resistances and batteries representing the different ionic currents and pumps. The electrophysiological behavior of a single cell is described as: dV Iion + Istim =− , (1) dt Cmemb where V is the voltage, t is time, Iion is the sum of all transmembrane ionic currents, Istim is the externally applied stimulus current, and Cmemb is the cell capacitance per unit surface area. The ionic current is given as the following sum: Iion = IN a + IK1 + Ito + IKr + IKs + ICaL + IN aCa + + IN aK + IpCa + IpK + IbCa + IbK
(2)
where IN aCa is N a+ /Ca2+ exchanger current, IN aK is N a+ /K + pump current, IpCa and IpK are plateau-, IbCa and Ibk are background- Ca2+ and K + currents. The fast N a+ current that is responsible for the fast depolarization of the cardiac cells, is formulated by: IN a = GN a · m3 · h · j · (V − EN a ),
(3)
where GN a is the sodium conductance, m represents the activation gate, h is the fast and j the slow inactivation gate. All detailed equations are described in [8]. These gates have mainly a voltage dependent behavior. The maximal value of the first derivative of the L-type calcium current ICaL , transient outward current Ito , slow delayed rectifier current IKs , rapid delayed rectifier current IKr , and inward rectifier K + current IK1 , and all other described currents are lower with at least two magnitudes than for the fast N a+ current IN a .
Spatial Visualization of the Heart
551
A homogenous spatial cardiac tissue can be modeled in space as a continuous system, using the following partial differential equation: 1 dV 1 ∂2V 1 ∂ 2V 1 ∂2V = −Iion − Istim + , (4) + + dt Cmemb ρx Sx ∂x2 ρy Sy ∂y 2 ρz Sz ∂z 2 where ρx , ρy , ρz , are the cellular resistivity and Sx , Sy , Sz , are the surface-tovolume ratio in the x, y and z directions. Computational modeling of the cardiac tissue is a useful tool for developing mechanistic insights into cardiac dynamics. The most important parts of human cardiac analysis are atria and ventricular tissue modeling. In this study, the tissue-level excitation mechanism is based on Fast’s work [6]. In this stage, each tissue element works as a secondary generator element. These elements can generate a depolarization wave if the adjacent elements are repolarized; otherwise, the wave propagation is swooned. Our study uses Harrild’s atria model [8] that is the first membrane-based description of spatial conduction in a realistic human atrial geometry. This model includes both the left and right atria, including representations of the major atrial bundles and a right-sided endocardial network of pectinate muscle. The membrane’s kinetics is governed by the Nygren [9] formulation for the human atrial cell. An advantage of this model is that it provides an easy perceptibility of atrial activation, particularly in regions that cannot be easily recorded in patients. It has long been appreciated that cardiac ventricular fibers are arranged as counter-wound helices encircling the ventricular cavities, and that the orientation of these fibers depends on transmural location. Fibers tend to lie in planes parallel to the epicardium, approaching a longitudinal orientation on the ventricular surfaces, and rotating toward the horizontal near the mid-wall. The direct anatomical reconstructions are labor-intensive and time-consuming tasks. In our study, we applied Winslow’s ventricular tissue model [17]. 2.2
Heart and Torso Model
There are many possible different heart structures [10]. To describe various representative cases, we studied our breast MRI records (42 examples) and numerous CT images. These samples lead us to construct a morphological heart structure for simulation, using a segmentation method presented in [5]. The obtained results were classified by physiologists and used to identify each atrial and ventricular region. The identification process uses as a base Harrild’s atria model [8] and Winslow’s ventricular tissue model [17]. From the correctly segmented images, we constructed a spatial representation of the heart, using an averaging technique. Such a prototype heart representation must be adjusted taking into consideration the ECG data. The ECG has an important role, as it may describe the electric property of the heart. For example, the mechanic related data obtained from MRI and CT images cannot give us any information about some malfunctions, such as the presence of numerous ectopic beats. An ultrasound image sequence, due to the relation between
552
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
electric and mechanic properties of the heart, may hold some mechanic information that can be used to identify diverse electric dysfunctions. The obtained heart model prototype contains most mechanic characteristics of the heart, such as tissue mass, wall thickness, internal structure of atria and ventricles. Some electric properties, such as conduction speed of the depolarization wave, are not deductible from captured images and the unique information sources are the parameters determined from the ECG signal. For example, the activation delay between atria and ventricles can be determined from P and R wave distance. The increment speed of the R wave determines the conduction speed on the ventricular tissue. This information was used to construct the electric-mechanic heart model as described in [12]. The anatomical structure of the atria [8] and ventricles [13] was involved in the geometrical model of the heart and torso. The torso, lung, endo- and epicardial surfaces were initially divided into 23647, 38844, 78809 and 89723 sub-units. For each of the units, the constant properties were determined (mass, tissue type but not the tissue state). During an ordinary simulation, the number of these sub-units can vary by demand. The only restriction relies on preserving the ratio among the numbers of sub-units pre-determined for each heart region. Such a heart model could have a maximal spatial resolution of 0.025mm (restricted by the size of computer main memory) that means more than ten billion individual compartments at highest decomposition. To allow a flexible simulation, we may choose the minimal time-slice between 0.01ms and 2ms. Each of these units may contain diverse variable properties, such as tissue state, ionic concentrations or diverse malfunction information such as ischemia. Starting from anatomical information and selected resolution for both time and space, the heart is constructed using tetra meshes. During a simulation with selected spatial resolution, the number of meshes remains constant. However, the mechanical displacement of the heart modifies the shape of each created mesh structure. The structure of the torso, its spatial position, the relative position and distance of the compartments with respect to the electrodes, and the electrical behavior of the torso’s contents are necessary to be known. As the model has to take in consideration extremely numerous parameter values, the problem cannot be solved in a deterministic way (we have much more unknown values than known equations). That is why a stochastic method (genetic algorithm, adaptive neural networks and fuzzy systems) should be applied to determine the values of the parameters. The search space of the optimization problem was reduced using the genetic algorithm (GA) presented in [7]. 2.3
Mathematical Description of the Compartments
The heart is represented as a set of finite homogenous elements, called compartments. Since their size is obviously much larger than that of actual biological cells, these units effectively represent small tetrahedron-shaped groups of biological cells, and must capture their macroscopic behavior rather than the microscopic behavior of individual cells. Microscopic inter/intracellular interactions, such as ionic flow across membrane boundaries, were described in the cell
Spatial Visualization of the Heart
553
model presentation. Compartment connectedness was defined as the set of rules that establish which units are considered directly connected to a given one, for the purposes of electrophysiological simulation, such that myocardial activation may be directly propagated between them. These rules are based on atria and ventricles anatomy; they define the neighborhood for each unit. Each compartment was considered homogenous, constructed by only one type of tissue with well-defined properties, such as cell type, cell state, cell activation potential (AP) function. The type of cells determines the electrical propagation properties, but no additional considerations were taken in, such as tissue fiber torsion and so on. The environmental parameters such as 4D position (x, y, z spatial coordinates and time), conduction speed of stimulus, weight and connection with neighbor structures, localize each unit. The heart behavior is characterized by the following parameters: 1. Type of cells: T (such as ventricular muscle cell or Purkinje fiber cell); 2. State (time varying): S (normal, ischemia); 3. Function of activation potential variation: AP (T, S, t) (each compartment has a specific activation potential function that depends from cell type and state); 4. Space position in time: P osC (x, y, z, t) (in every moment a given compartment has a spatial position); 5. Conduction speed of the stimulus: CS(T, S) (it is type and state dependent); 6. Weight of the contents of the compartment: M ; 7. Connections with other compartments; 8. The position of the electrode: P osE (x, y, z, t) (the measuring electrode has a time dependent spatial position); 9. The relative resistance of the electrode: RE,C (P osC , P osE ) (the time dependent electric resistance of the human tissue from the studied compartment to a given electrode). Because the main ion channels situated inside the cells have a quite complicated behavior (with lots of unknown parameters), the activation potential function of the compartment was considered as basic input parameter (we determine an AP function - based on cell model - with static shape for each cell type and state). Due to contractions of the heart, respiration, and other disturbing phenomena, the position of compartments was considered time varying. The mathematical expressions presented in the followings, that describe compartment behavior are time variant. Let VC be the potential of an arbitrary compartment C: VC (t) = AP (T, S, t − τC ), where τC is the time the stimulus needs to reach compartment C. The activation potential function that varies from cell type T and state S, has a short delay τC due to activation propagation until compartment C. The measured potential Ej , generated by compartment Ci is: Ej,Ci (t) = VCi (t) · REj,Ci (t) − EGND,Ci (t),
(5)
where REj,Ci (t) represents the time varying resistance from compartment Ci to electrode Ej . Using bipolar electrodes, the value measured on the reference
554
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
electrode EGND will be subtracted. As all compartments have an accession to the measured potential on each electrode, the measured voltage on electrode Ej will become the sum of each Ej,Ci (t) generated by compartment Ci : Ej (t) =
N −1
[VCi (t) · REj,Ci(t) − EGND,Ci (t)],
(6)
i=0
where N is the number of compartments. These equations determine the measured electrical potentials and the inner mechanism in the heart. During the simulation, these voltages were determined for each compartment and electrode for every time-slice (mostly between 0.1ms and 2ms depending from the phase of AP function). 2.4
Connections Between Electric and Mechanic Properties
The time-varying evolution of the cardiac volume is determined by the interconnection of electrical and mechanical phenomena. In a whole cardiac cycle there are two extremity values. The maximal volume can be coupled with the starting moment of ventricular contraction. The depolarization wave normally starts from the sino-atrial node (SA) and propagates through the atrioventricular node (AV) and ventricles. The moment of minimal volume shortly precedes the termination of ventricular contraction, but is much more difficult to identify, due to the dead time of a normal cardiac cell. This delay is caused by the strange property of a regular cardiac cell, whose electric response is most directly caused by the depolarization wave (fast N a+ channels), but the mechanical contraction is controlled by the much slower Ca2+ channels. The calcium channel opens at 10 − 20ms after depolarization, and the maximal contraction follows in about 80ms [16]. 2.5
Adaptively Varied Resolution
As presented earlier, the simulation of each compartment at each small timeslice needs a powerful computer. To enhance the simulation performance we can increase the computational power of the simulation platform and modify the algorithm such a way, that it determines the most important data more accurately. Anyway, due to the limited computational power of the computer, the simulation must contain approximations. In our formulation, the simulation task can be performed in the following manners: – determine a pre-defined time and space resolution (not adaptive); – guarantee an estimation error that is lower than a pre-defined threshold value (adaptive); – guarantee a pre-defined processing speed (adaptive). In the first case nothing in know about the simulation speed and estimation error. The simulation algorithm uses pre-defined time and space resolution and the result’s performance (speed and accuracy) can only be estimated. The second processing manner has an adaptive behavior. Resolution is not important,
Spatial Visualization of the Heart
555
but the estimation error is. This approximation of the problem leads to low estimation error, but we have no guaranteed processing speed. The processing speed may highly depend on the heart’s state (see Discussions section for details). The third approximation of the problem is useful to create an on-line processing system. However, in this situation, we have the pre-defined simulation speed, but we do not have any control regarding simulation accuracy. In both adaptive approximations of the problem, the scalability of the simulation is realized in the same manner. During the simulation problem, the key element is the compartment. Each compartment has a time dependent voltage that is increased by depolarization waves and decreased due the self-repolarization process. Both high voltage increment and a high diversity of the compartments (adjacent compartments can have significantly different voltage levels) increases estimation error. This error is estimated by the following formula: err(C, t) = λd
N −1 dVC + λv {λC,Ci (t) · [VCi (t) − VC (t)]}2 . dt i=0
(7)
The estimation error is weighted by λd (derivative weight) and λv (voltage weight). The derivative term contains the voltage’s increment caused by the fast N a+ ionic current during depolarization and by Ca2+ and K + currents during repolarization phase. In the second term, λC,Ci represents a weight between compartments C and Ci . This weight is considered time dependent, as the distance among compartments may vary during simulation. A high voltage difference may increase estimation errors dramatically. From this formula emerges, that the most sensitive moments are the moment of depolarization especially in presence of multiple depolarizing fronts. From the determined estimation error, its variance in time, and the initial settings referring to error threshold or simulation speed, the necessary time and space resolutions are determined. However, during a whole heartbeat the estimation error may vary, that implies the spatial and temporal modification of the used resolution. In order to assure a good alignment among diversely selected resolution slices, each high-resolution value must be selected 2i times shorter than the initial reference resolution (this is the coarsest resolution in both time and space). To assure proper resolution values, each determined variable is rounded down to the closest allowed level. Data on all resolution levels are synchronized only after one full time step on the coarsest grid level is completed. The efficiency of the method arises from its ability to refine or to coarsen the spatial and temporal representations of sub-units automatically and locally. The approximation errors are estimated on each sub-units to determinate the lowest necessary resolution to keep under a pre-defined tolerance value. The most important factor that demands a finer temporal and spatial resolution to keep the estimation errors under the pre-determined tolerance level, is the fast depolarization wave of the atrial and ventricular tissue cells. The simulation program varies the resolution in concordance with the first derivative of the activation potential.
556
2.6
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
Parallel Processing
The implementation of the method allows high parallelization. As sub-unit potential values are determined independently from each other at all possible resolution levels, these tasks can be processed on separate processors with reduced communication needs. The hardware accelerated programmability of graphical processing units (GPUs) (that may contain up to 320 individual processor units) admits the development of programs called shaders vertex and fragment shaders, which are loaded into the graphics card memory for replacing the fixed functionality. The fragment shaders are used in our method to perform the SIMD commands for each sub-unit. From architectural concepts results that the GPUs are most efficient in case of more than 1000 similar tasks, which is caused by the relatively long video memory delay. 2.7
Simulation Platforms
Experiments were performed on four platforms with different configuration shown in Table 1. Each computer has 1GB memory that admits a maximal resolution between 0.01mm (normal beats) and 0.1mm (ventricular fibrillation) for the most critical area (highly restricted size). The program was developed in C++ programming environment and the shader programs were developed using ATI’s graph library (shader model 3.0).
3
Results
Table 2 informs about simulation times using various configurations. All simulation tasks were performed with adaptive time and spatial resolution, and Table 2 show the finest ones among them. The simulated normal and pathological cases have one second duration. In all cases, the number of simultaneously performable tasks has the order of thousands. The conventional simulation (constant resolution) was performed only for 1mm spatial and 0.2ms temporal units, and was slower about 200 times in normal and 35 times in fibrillating case. Table 1. Configuration of the simulation platforms involved in this study Configuration
CPU
GPU
RAM
1st 2nd 3rd 4th
Athlon 3000+ Core2 Duo 6400 Core2 Duo 6400 Pentium D805
nVidia 6600 ATI 1950 Pro 2 × ATI 1950 Pro nVidia 7600GT
1GB 1GB 1GB 1GB
DDR DDR2 DDR2 DDR2
Figure 1(a) elucidates the relation between estimation error and spatial resolution. A lower spatial resolution increases the estimation error of the depolarization wave. The obtained results are almost similar for pathological cases. The propagation of the depolarization wave for an anisotropic tissue is presented in
Spatial Visualization of the Heart
557
Table 2. The whole simulation and visualization time of a one second duration event performed on involved platforms using two spatial and temporal resolutions Configuration and Resolution
Normal beat
Ectopic beat
Ventricular fibrillation
1st - (1mm,0.2ms) 2nd - (1mm,0.2ms) 3rd - (1mm,0.2ms) 4th - (1mm,0.2ms) 1st - (0.1mm,0.05ms) 2nd - (0.1mm,0.05ms) 3rd - (0.1mm,0.05ms) 4th - (0.1mm,0.05ms)
11.3s 1.32s 0.7s 2.37s 1h 11min 9min 20s 5min 3s 15min 40s
37.15s 5.21s 2.68s 9.03s 4h 22min 37min 10s 19min 17s 1h 6min
4min 2s 33.11s 17.13s 57.48s 29h 11min 3h 53min 1h 59min 6h 42min
Fig. 1. (a) The estimation error plotted against the chosen spatial resolution in case of a normal beat, (b) The simulated depolarization wave in anisotropic ventricular tissue: from the pace origin, the wave front propagation is plotted with 5ms resolution
Fig. 2. Propagation of the depolarization wave in a ventricular tissue area during ventricular fibrillation. The visualized area contains four ectopic points. The white excitable area is fast depolarized and the arisen wave fronts extinguish each other. The gray level of each point represents the voltage level for each individual cell. Each image represents a 50mm wide square; time step between neighbor squares is 10ms.
Fig. 1(b). Figure 2 presents the collision of the depolarization waves. The depolarization of various ventricular slices for normal case is presented in Fig. 3. The simulated ECG signal in normal and abnormal case (ventricular hypertrophy) can be seen in Fig. 4. The spatial representation of the ventricles during a normal heart beat is presented in Fig. 5. The resting and contracting tissue
558
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
Fig. 3. The propagation of the depolarization wave in the ventricular and atrial tissue. In the left side of the image the consecutive ventricular slices are presented from the ventricular top region to apex (using 5mm distance among consecutive slices). The propagation of the depolarization wave is presented, simulating a normal heart beat. The right-sided two images present atrial slices (5mm distance).
is visible in the first and second rows, respectively. In this simulation, a 0.2mm spatial resolution was used.
4
Discussion and Conclusion
Table 1 presents four configurations with shader model (SM) 3.0 ready GPUs. The 3rd configuration is the most powerful one, due to the cross-fire connected ATI 1950 PROs. The type of CPU (Intel or AMD), the clock speed (1.86 − 2.66GHz), the core number (solo or duo) and memory bandwidth (DDR or DDR2) did not play an important role because a powerful video card has a much higher floating-point calculation power (internally has 8-36 pixel shader units). In all cases, the size of memory was selected at 1GB that restricts the applicable maximal resolution. Table 2 summarizes a simulation for normal beat, ectopic beat and ventricular fibrillation state. The finest spatial and temporal resolution was 16 times greater in case of normal beat, 32 times greater in case of Ectopic beat and 64 times greater in case of ventricular fibrillation. This result is in perfect concordance with the complexity of the studied events. A more complex event implies longer depolarization waveform that enforces the processing algorithm to choose lower spatial and temporal steps. From the data of Table 2 we could observe the clear dominance of GPUs. Although the spatial and temporal resolution limit the necessary simulation time, in all cases a massive parallelization could be performed. All shader programs were created using a low-level programming environment. We could observe that in normal cases, the active depolarization wave front has a much lower size than in case of ventricular beat of ventricular fibrillation. In a complex biological situation, as the wave front size grows, the parallelization becomes harder. This assumption is reflected by the simulation times from Table 2. It is observable that a normal heart has at least 20 times lower front area than a fibrillating one. As a cardiac muscle (especially left ventricular), become less homogeneous, the relative simulating speed decreases. Some basic characteristics of the heart such
Spatial Visualization of the Heart
559
Fig. 4. Simulated ECG signal in: (a) normal case, and (b) abnormal case (presence of accessory pathway)
Fig. 5. The spatial representation of the ventricles in resting ((a), (b) and (c) images) and contracted ((d), (e) and (f) images) state during a normal beat as follows: (a) and (d) upper view, (b) and (e) sectioned bottom view, (c) and (f) sectioned frontal view)
as size, maximal tissue volume and left wall width, significantly influence the maximal performance. Figure 1(a) represents the estimation error in function of spatial resolution. The temporal resolution has almost similar effect, but with lower impact. From
560
S.M. Szil´ agyi, L. Szil´ agyi, and Z. Beny´ o
measurements, we could deduct that estimation error is free from physiological state. In normal and pathological cases, we measured almost the same error values. In this paper, we have discussed new features and new capabilities of a space-time adaptive heart-modeling algorithm. We have shown the algorithm’s ability to simulate inhomogeneous and strongly anisotropic tissue regions (see Fig. 1(b)). This method can provide a variety of advances in addition to reductions in time and memory requirements. For example, the algorithm allows a more complex ionic model, higher spatial resolution of a non-linear tissue model. Similarly, it allows the use of higher spatial and temporal resolution to reduce the angle dependence of propagation patterns in spatial domains with rotational anisotropy or to verify that a calculation is sufficiently resolved, so that an increase in resolution does not affect the results (see Fig. 1). From Fig. 2 we can conclude that the diverse depolarizing wave fronts are unifying and the arisen wave fronts extinguish each other. The simulation was done on a simple ventricular tissue surface to be able to verify the obtained results and to compare with other simulation methods, such as presented in [2]. We can affirm that the obtained front shapes were almost the same. The propagation of the depolarization wave in the ventricular and atrial tissue is presented in Fig. 3. The propagation of the depolarization wave can be seen in the consecutive slices. Using this view, we can supervise the propagation of the depolarizing waves in various circumstances, such as normal beat, ectopic beat, Wolff-Parkinson-White syndrome and ventricular fibrillation. Besides the wave propagation, the simulated ECG can be visualized (see Fig. 4). The simulation model combined with a forward heart model presented in [12] can yield a simulated ECG. It is important to study the shape of the heart during a whole cycle. Despite various perturbing phenomena, it was possible to realize the spatial representation of the heart or some segments of it (see Fig. 5). Using this kind of approach, we can balance between performance and accuracy. The optimal solution may depend on the used platform, studied events and the available time. We have presented a massively parallelized flexible and efficient heart simulation method that uses almost all features of a modern processing hardware. After that, we have demonstrated that the processor of a modern graphics card can provide better performance than a modern CPU under certain conditions, in particular, allocating data in a regular and parallel manner. In these situations, the GPU should operate in a SIMD fashion to get the most performance hit. Experimental results show that the graphics card can be exploited in order to perform non-rendering tasks. Acknowledgements. This research was supported by the Hungarian National Research Funds (OTKA) under Grant No. T069055, Sapientia Institute for Research Programmes and the Communitas Foundation.
Spatial Visualization of the Heart
561
References 1. Antzelevitch, C., Shimizu, W., Yan, G.-X., Sicouri, S., Weissenburger, J., Nesterenko, V.V., Burashnikov, A., Di Diego, J., Saffitz, J., Thomas, G.P.: The M cell: Its contribution to the ECG and to normal and abnormal electrical function of the heart. J. Cardiovasc. Electrophysiol. 10, 1124–1152 (1999) 2. Cherry, E.M., Greenside, H.S., Henriquez, C.S.: A Space-Time Adaptive Method for Simulating Complex Cardiac Dynamics. Phys. Rev. Lett. 84, 1343–1346 (2000) 3. Cherry, E.M., Greenside, H.S., Henriquez, C.S.: Efficient simulation of threedimensional anisotropic cardiac tissue using an adaptive mesh refinement method. Chaos 13, 853–865 (2003) 4. Courtemanche, M.: Complex spiral wave dynamics in a spatially distributed ionic model of cardiac electrical activity. Chaos 6, 579–600 (1996) 5. Dumoulin, S.O., Hoge, R.D., Baker Jr., C.L., Hess, R.F., Achtman, R.L., Evans, A.C.: Automatic volumetric segmentation of human visual retinotopic cortex. Neuroimage 18, 576–587 (2003) 6. Fast, V.G., Rohr, S., Gillis, A.M., Kl´eber, A.G.: Activation of Cardiac Tissue by Extracellular Electrical Shocks: Formation of ’Secondary Sources’ at Intercellular Clefts in Monolayers of Cultured Myocytes. Circ. Res. 82, 375–385 (1998) 7. Godefroid, P., Khurshid, S.: Exploring Very Large State Spaces Using Genetic Algorithms. In: Katoen, J.-P., Stevens, P. (eds.) ETAPS 2002 and TACAS 2002. LNCS, vol. 2280, pp. 266–280. Springer, Heidelberg (2002) 8. Harrild, D.M., Henriquez, C.S.: A Computer Model of Normal Conduction in the Human Atria. Circul. Res. 87, 25–36 (2000) 9. Nygren, A., Fiset, C., Firek, L., Clark, J.W., Lindblad, D.S., Clark, R.B., Giles, W.R.: Mathematical Model of an Adult Human Atria Cell: The Role of K+ Currents in Repolarization. Circul. Res. 82, 63–81 (1998) 10. Quan, W., Evans, S.J.: Efficient Integration of a realistic Two-dimensional Cardiac Tissue Model by Domain Decomposition. IEEE Trans. Biomed. Eng. 45, 372–384 (1998) 11. Panfilov, A.V.: Three-dimensional organization of electrical turbulence in the heart. Phys. Rev. E 59, R6251–R6254 (1999) 12. Szil´ agyi, S.M., Szil´ agyi, L., Beny´ o, Z.: Spatial Heart Simulation and Analysis Using Unified Neural Network. Ser. Adv. Soft Comput. 41, 346–354 (2007) 13. ten Tusscher, K.H.W.J., Bernus, O., Hren, R., Panfilov, A.V.: Comparison of electrophysiological models for human ventricular cells and tissues. Prog. Biophys. Mol. Biol. 90, 326–345 (2006) 14. ten Tusscher, K.H.W.J., Noble, D., Noble, P.J., Panfilov, A.V.: A model for human ventricular tissue. Amer. J. Physiol. Heart. Circ. Physiol. 286, H1573–H1589 (2004) 15. Winfree, A.T.: Electrical turbulence in three-dimensional heart muscle. Science 266, 1003–1006 (1994) 16. Winslow, R.L., Hinch, R., Greenstein, J.L.: ICCS 2000. Lect. Notes Math, vol. 1867, pp. 97–131 (2005) 17. Winslow, R.L., Scollan, D.F., Holmes, A., Yung, C.K., Zhang, J., Jafri, M.S.: Electrophysiological Modeling of Cardiac Ventricular Function: From Cell to Organ. Ann. Rev. Biomed Eng. 2, 119–155 (2000)
A Single-View Based Framework for Robust Estimation of Height and Position of Moving People Seok-Han Lee and Jong-Soo Choi Dept. of Image Engineering, Graduate School of Advanced Imaging Science, Multimedia, and Film, Chung-Ang University, 221 Huksuk-Dong, Dongjak-Ku, 156-756, Seoul, Korea {ichthus, jschoi}@imagelab.cau.ac.kr
Abstract. In recent years, there has been increased interest in characterizing and extracting 3D information from 2D images for human tracking and identification. In this paper, we propose a single view-based framework for robust estimation of height and position. In the proposed method, 2D features of target object is back-projected into the 3D scene space where its coordinate system is given by a rectangular marker. Then the position and the height are estimated in the 3D space. In addition, geometric error caused by inaccurate projective mapping is corrected by using geometric constraints provided by the marker. The accuracy and the robustness of our technique are verified on the experimental results of several real video sequences from outdoor environments. Keywords: Video surveillance, height estimation, position estimation, human tracking.
1 Introduction Vision-based human tracking is steadily gaining in importance due to the drive from many applications, such as smart video surveillance, human-machine interfaces, and ubiquitous computing. In recent years, there has been increased interest in characterizing and extracting 3D information from 2D images for human tracking. Emergent features are height, gait(an individual’s walking style), and trajectory in 3D space [10, 11, 12]. Because they can be measured at a distance, and from coarse images, considerable research efforts have been devoted to use them for human identification and tracking. An important application is in forensic science, to measure dimensions of objects and people in images taken by surveillance cameras [1, 2, 5]. Because bad quality of the image (taken by cheap security camera), quite often it is not possible to recognize the face of the suspect or distinct features on his/her clothes. The height of the person may become, therefore, a very useful identification feature. Such a system is typically based upon 3-dimensional metrology or reconstruction from twodimensional images. Accordingly, it is extremely important to compute accurate 3dimensional coordinates using projection of 3D scene space onto 2D image planes. In general, however, one view alone does not provide enough information for complete three-dimensional reconstruction. Moreover the 2D-3D projection which is determined by linear projective camera model is defined up to an arbitrary scale; i.e. its D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 562–574, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Single-View Based Framework for Robust Estimation
563
scale factor is not defined by the projective camera model. Most single view-based approaches, therefore, are achieved on the basis of geometric structures being resident in images, such as parallelism, and orthogonality. Vanishing points and vanishing lines are powerful cues, because they provide important information about the direction of lines and orientation of planes. Once these entities are identified in an image, it is then possible to make measurements on the original plane in three-dimensional space. In [1], [2], and [5], excellent plane metrology algorithms to measure distances or length ratios on planar surfaces parallel to the reference plane are presented. If an image contains sufficient information to compute a reference plane vanishing line and a vertical vanishing point, then it is possible to compute a transformation which maps identified vanishing points and lines to their canonical positions. The projective matrix which achieves this transformation allows reconstruction of affine structure of the perspectively imaged scene. By virtue of the affine properties, one can compute the relative ratio of lengths of straight line segments in the scene. This technique is relatively simple, and does not require that the camera calibration matrix or pose be known. However the geometric cues are not always available and these methods are not applicable in the absence of the scene structures. Alternatively, the position of an object on a planar surface in 3D space can be computed simply by using a planar homography. In this method, however, it is not possible to recover the original coordinates of a point which is not in contact with the plane in the scene. More popular approach to reconstruct three-dimensional structure is to employ multiple cameras [13, 14, 15, 16, 17]. By using multiple cameras, the area of surveillance is expanded and information from multiple views is quite helpful to handle issues such as occlusions. But the multiple camera-based approaches may bring some problems such as correspondence between the cameras, inconsistency between images, and camera installation etc. For example, feature points of an object extracted from different views may not correspond to the same 3D point in the world coordinate system. This may make the correspondence of feature point pairs ambiguous. Furthermore, the calibration of multiple cameras is not a simple problem.
(a)
(b)
(c)
(d)
Fig. 1. An example of the procedure (a) Estimation of the projective camera matrix using a marker (b) Real-time input image (c) Extraction of the object (d) Final result
In this paper, we propose a single view-based technique for the estimation of human height and position. In our method, the target object is a human walking along the ground plane. Therefore a human body is assumed to be a vertical pole. Then we back-project the 2D coordinates of the imaged object into the three-dimensional scene to compute the height and location of the moving object. This framework requires a reference coordinate frame of the imaged scene. We use a rectangular marker to give the world coordinate frame. This marker is removed from the scene after the
564
S.-H. Lee and J.-S. Choi
Fig. 2. Block diagram of the proposed method
initialization. Finally, we apply a refinement approach to correct the estimated result by using geometric constraints provided by the marker. The proposed method allows real-time acquisition of the position of a moving object as well as the height in 3D space. Moreover, as the projective camera mapping is estimated by using the marker, our method is applicable even in the absence of geometric cues. The remainder of this paper is structured in the following way: In Section 2, the proposed method is discussed in Section 3, and experimental results are given in Section 4. The conclusions are drawn in Section 5.
2 Proposed Method 2.1 Foreground Blob Extraction An assumption throughout the proposed method is the linear projective camera model. This assumption is violated by wide-angle lenses, which are frequently used in surveillance cameras. Those cameras tend to distort the image, especially near its boundaries. In such case, the grossest distortion from the linear camera model is usually radial, and this may affect metrology algorithm considerably. Therefore, we apply a redial distortion correction method introduced in [8] before the main process. After the preprocessing step, we are given a quartic polynomial function which transforms the distorted feature points into correct ones. In the proposed method, the feature points (not the entire image) are corrected because of the processing time. The foreground region is extracted by the statistical background subtraction technique presented in [9] which is robust to the presence of shadows. The main idea of this method is to learn the statistics of properties of each background pixels over N precaptured background frames and obtain statistical values modeling for the background. Based on this, the algorithm can classify each pixel into “moving foreground,” “original background,” “highlighted background,” “shadow/shaded background” after getting its new brightness and chromaticity color values. After the background subtraction, we use the simple morphological operators to remove small misclassified blobs. Humans are roughly vertical while they stand or walk. In order to measure the height of a human in the scene, a vertical line should be detected from the image. However, the vertical line in the image may not be vertical to the ground plane in the real world space. Therefore, human body is assumed to be a vertical pole that is
A Single-View Based Framework for Robust Estimation
(a)
(b)
565
(c)
Fig. 3. Extraction of head and feet locations (a) Captured image (b) Estimation of principal axis using eigenvectors (c) Extraction of the head and feet points
a vertical principal axis of the foreground region. We first compute the covariance matrix of the foreground region, and estimate two principal axes of the foreground blob. And a bounding rectangle of the foreground blob in the image is detected. Then we compute intersections of the vertical principal axis and the vertical bounds of the blob. These two intersections are considered as the apparent positions of the head and feet, which are back-projected for the estimation of the height and position. As shown in Fig. 3, let (e1, t, e2, t) be the first and second eigenvectors of the covariance matrix of the foreground region at frame t, respectively. Then, e1, t and the center of the object blob Po, t give the principal axis lve, t of the human body at t. Given lve, t, the intersections can be computed by cross products of each lines. The head and feet positions then are p’h, t and p’f, t, respectively. 2.2 Back-Projection In our method, the height and position are measured by using the back-projected fea~ tures in three-dimensional scene. Let M = [X Y Z 1]T be the 3D homogeneous coordiT ~ nates of a world point and m = [x y 1] be the 2D homogeneous coordinates of its projection in the image plane. This 2D-3D mapping is defined by a linear projective transformation as follows. ~~ ~ ~ ~ = λP m M = λ K[R | t]M = λK[r1 r2 r3 | t]M ,
(1)
where λ is an arbitrary scale factor, and the 3 x 4 matrix P~ is called the projective camera matrix, which represents the projection of 3D scene space onto a 2D image. R is a 3 x 3 rotation matrix, and t denotes translation vector of the camera. And ri means i-th column vector of the projection matrix. We use ‘~’ notation to denote the homogeneous coordinate representation. The non-singular matrix K represents the camera calibration matrix, which consists of the intrinsic parameters of the camera. In our method, we employ the calibration method proposed by Zhang in [7]. This method computes the IAC (the image of absolute conic) ω by using the invariance of the circular points which are the intersections of a circle and the line at infinity l∞. Once the IAC ω is computed, the calibration matrix can be K computed by ω-1=KKT. Thus this method requires at least three images of a planar calibration pattern observed at three different orientations. From the calibrated camera matrix K and (1), the projective
566
S.-H. Lee and J.-S. Choi
transformation between 3D scene and its image can be determined. In particular, the projective transformation between a plane of 3D scene and the image plane can be defined by a general 2D homography. Consequently, if four points on the world plane and their images are known, then it is possible to compute the projection matrix P~ . Suppose that π0 is the XY-plane of the world coordinate frame in the scene, so that ~ ~ points on the scene plane have zero Z-coordinate. If four points X 1~ X 4 of the world ~ ~ ~ ~ plane are mapped onto their image points x 1~ x 4, then the mapping between M p=[ X 1 T ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ X 2 X 3 X 4] and m p =[ x 1 x 2 x 3 x 4] which consist of X n =[X n Y n 0 1] and x n = [xn yn 1]T respectively is given by ~ ~ . ~ = K[R | t]M m p p = [p 1 p 2 p 3 p 4 ]M p
(2)
~ Here, pi is i-th column of the projection matrix. In this paper, X n is given by four vertices of the rectangular marker. From the vertex points and (2), we have
⎡ xn ⎤ ⎡ r11 X n + r12Yn + t x ⎤ , K −1 ⎢⎢ y n ⎥⎥ = ⎢⎢r21 X n + r22Yn + t y ⎥⎥ ⎣⎢ 1 ⎦⎥ ⎣⎢ r31 X n + r32Yn + t z ⎦⎥
(3)
where (xn, yn) is n-th vertex detected from the image. And rij represents the element of the rotation matrix R, tx, ty, and tz the elements of the translation vector t. From (3) and the four vertices, we obtain the translation vector t and the elements of the rotation matrix rij. By the property of the rotation matrix, the third column of R is computed by r3 = r1 x r2. Assuming that the rectangular marker is a square whose sides ~ have length wm, and defining M p as (4), the origin of the world coordinate frame is the center point of the square marker. In addition, the global scale of the world coordinate frame is determined by wm. The geometry of this procedure is shown in Fig. 4. ⎡wm / 2 wm / 2 − wm / 2 − wm / 2⎤ ⎢w / 2 − w / 2 − w / 2 w / 2 ⎥ . ~ m m m ⎥ Mp = ⎢ m ⎢ 0 0 0 0 ⎥ ⎥ ⎢ 1 1 1 ⎦ ⎣ 1
Fig. 4. Projective mapping between the marker and its image
(4)
Fig. 5. Back-projection of 2D features
In general, the computed rotation matrix R does not satisfy with the properties of a rotation matrix. Let the singular value decomposition of R be UΣVT, where Σ = diag(σ 1, σ 2, σ 3). Since a pure rotation matrix has Σ = diag(1, 1, 1), we set R = UVT
A Single-View Based Framework for Robust Estimation
567
which is the best approximation matrix to the estimated rotation matrix[6]. An image point m = (x, y) back-projects to a ray in 3D space, and this ray passes through the camera center as shown in Fig. 5. Given the camera projection matrix P~ = [P p~ ], where P is a 3 x 3 submatrix, the camera center is denoted by C = −P −1 p~ . And the direction of the line L which formed by the join of C and m can be determined by its ~ as follows point at infinity D ~~ ~ ~ PD = m , D = [ D 0 ]T ,
(5)
~,m ~ = [ m T 1] T D = P − 1m .
(6)
Then, we have the back-projection of m given by ~ + λ P − 1m ~ = C + λD, − ∞ < λ < ∞ . L = − P − 1p
(7)
2.3 Estimation of Height and Position In our method, a human body is approximated as a vertical pole. As shown in Fig. 5, the height of the object is the distance between M0 and Mh, and its position is M0 which is the intersection of the reference plane π0 and the line L1. Assuming that the line segment M0 ~ Mh is mapped onto its image m0 ~ mh, the intersection can be de~ , where λ is a scale coefficient at the intersection point. noted as M0 = C + λ0 P-1 m 0 0 As M0 is always located on the reference plane π0, we have ~ ~ ~ (8) π T0 M 0 = 0 , ~ π 0 = [ 0 0 1 0 ] T , M 0 = [ M 0 1] T .
~ ), we can uniquely determine λ as follows Then, from π0T M0 = π0T (C + λ0 P-1 m 0 0
λ0 = −
π T0 C . ~ π T P −1m 0
(a)
(9)
0
(b)
Fig. 6. Distortion of 2D-3D projective mapping due to inaccurate camera calibration (a) Projective relationship (b) Side view of (a)
The height of the object is given by the length of M0 ~ Mh, and Mh is the intersection of the vertical pole Lh and the line L2 which passes through mh. The vertical pole Lh and the line L2 can be denoted as follows
568
S.-H. Lee and J.-S. Choi
~ + λ P −1m ~ = C + λD , − ∞ < λ < ∞ , L 2 = − P − 1p h h
(10)
~ ~ ~ ~ L h = M 0 + μD v , D v = [ 0 0 1 0]T , - ∞ < μ < ∞ .
(11)
From Lh = L2 = Mh, we obtain M 0 + μD v = C + λD h .
(12)
We rearrange (12), so that a set of linear equations on λ and μ is given as follows ⎡ m1 − c1 ⎤ ⎡ d h1 ⎢m − c ⎥ = ⎢d 2⎥ ⎢ 2 ⎢ h2 ⎢⎣ m3 − c3 ⎥⎦ ⎢⎣ d h 3
− d v1 ⎤ ⎡ λ ⎤ . − d v 2 ⎥⎥ ⎢⎢ ⎥⎥ − d v 3 ⎥⎦ ⎢⎣ μ ⎥⎦
(13)
Here, mi, ci, dhi, and dvi represent the i-th row’s element of M0, C, Dh, and Dv respectively. Without difficulty, this can be solved via simple linear-squared estimation. Finally, from (10) and (11), we obtain the height and position. 2.4 Correction of Back-Projection Error Inaccurate projective mapping, which is often caused by inaccurate estimation of camera projection matrix, affects the estimation of 3D points and consequently the measurement results. Fig. 6 shows an example of the back-projection error. Suppose that the camera is fixed and π0 is the ideal reference plane. In general, the detected plane π’ does not coincide with π0 perfectly because of the back-projection error. Fig. 6(b) is the side view of Fig. 6(a), which illustrates that the measurements are significantly affected by perspective distortions. This problem is often solved by implementing nonlinear optimization algorithm such as the Levenberg-Marquardt iteration. However, there normally exist a significant trade-off between processing time and the reliability of the result. In order to correct this perspective distortion, therefore, we use the four vertices of the square marker as shown in Fig. 7. Assuming that the projective mapping is ideal, x1 ~ x4 is mapped to X1 ~ X4 of the ideal plane. In practice, however, the vertex images are back-projected onto X’1 ~ X’4 of π’. From X’1 ~ X’4
(a)
(b)
Fig. 7. Correction of geometric distortion using vertices of the marker
A Single-View Based Framework for Robust Estimation
569
and X1 ~ X4, we can estimate the homography which transforms the points of π’ to those of π0. The measured position of the object can then be corrected simply by applying the homography. On the other hand, the height of the object can not be corrected in this way because the intersection Mh is not in contact with the reference plane. Therefore, we rectify the measured height as follows. 1) Compute the intersection MC’ of L2’ and π’ as follows ~+λ m ~ ), λ = M 'C = P −1 ( − p C h C
π T0 C . ~ π P −1m h T 0
2) Transform MC’ to MC of π0 by applying the homography ~ ~ ~ ~ M C = H pM'C , MC = [MC 1]T ,
(14)
(15)
where Hp denotes the homography defined by the quadruple point pairs. 3) Finally, estimate Mh which is the intersection of the vertical pole Lh and L2 formed by the join of C and Mc. The height is obtained from h = || Mh – M0 ||.
3 Experimental Results To evaluate the performance of the proposed method, two sets of experiments are conducted. The first experiment is carried out under ideal condition in laboratory. And we validate the proposed method on outdoor videos sequences. All experiments are performed with a CCD camera which produces 720 x 480 image sequences in 30 FPS. The first experiment is performed in following way. In a uniform background, we locate and move a rod which has length 30cm. And then, at every 25cm along horizontal direction and at every 10cm from the camera, we measure its position and height. To give the reference coordinate, we used a square marker whose sides have length wm = 30cm. The measurement errors are shown in Fig. 8. Fig. 8(a) and Fig. 8(b) illustrate that the results are affected significantly by the perspective distortion. From Fig. 8(c) and Fig. 8(d), however, we verify that the measurements are fairly improved by applying the correction algorithm. We note that the measurement error grows as the distance in each direction is increased. Considering the dimension of the object and the distance from the camera, however, the measurement errors can be regarded as relatively small. Therefore, we can conclude that our method achieves reliable estimation of the height and position without critical error. The second experiment is carried out using several outdoor videos sequences. For the outdoor experiments, we preset an experimental environment. On every rectangular area which has dimension of 280cm x 270cm, we place a recognizable landmark. During the experiment, a participant walks along preset paths, and the height and position are measured at each frame. The reference coordinate system is given by a square marker whose sides have length wm = 60cm. Fig. 9(a) illustrate the input video streams, which also show the measured height and position, the reference coordinate frame, and a vector pointing to the individual. Fig. 9(b) shows the measured heights at each frame. In general, human walking involves periodic up-and-down displacement. The
570
S.-H. Lee and J.-S. Choi 16 14
15 -50 cm -25 cm 0 cm 25 cm 50 cm
10
5
Position Mesurement Error (cm)
Height Mesurement Error (cm)
20
12 10
-50 cm -25 cm 0 cm 25 cm 50 cm
8 6 4 2 0
80 cm 90 cm 10 0c m 11 0c m 12 0c m 13 0c m 14 0c m 15 0c m 16 0c m 17 0c m 18 0c m 19 0c m 20 0c m 21 0c m 22 0c m
80 cm 90 cm 10 0c m 11 0c m 12 0c m 13 0c m 14 0c m 15 0c m 16 0c m 17 0c m 18 0c m 19 0c m 20 0c m 21 0c m 22 0c m
0
Distance from the cam era
Distance from the camera
(b)
0.4
0.35
0.35 0.3
-50 cm -25 cm 0 cm 25 cm 50 cm
0.25 0.2 0.15 0.1
Position Mesurement Error (cm)
0.4
0.3 0.25
-50 cm -25 cm 0 cm 25 cm 50 cm
0.2 0.15 0.1 0.05
0
0 80 cm 90 c 10 m 0c m 11 0c m 12 0c m 13 0c m 14 0c m 15 0c m 16 0c m 17 0c m 18 0c m 19 0c m 20 0c m 21 0c m 22 0c m
0.05
80 cm 90 c 10 m 0c m 11 0c m 12 0c m 13 0c m 14 0c m 15 0c m 16 0c m 17 0c m 18 0c m 19 0c m 20 0c m 21 0c m 22 0c m
Height Mesurement Error (cm)
(a) 0.45
Distance from the cam e ra
Distance from the cam era
(c)
(d)
Fig. 8. Measurement errors: (a), (b) Height and position estimation errors before the distortion compensation (c), (d) After the distortion compensation
(a) 195
-540
193 191 -270
Heights (cm)__
189 187 185
-840
-560
-280
0
280
560
840
1120
1400
1680
0
183
(-493.20 ,-40.89)
181 179
Measured
Marker position
177
Refined
270 Estimated positions Camera position
175 1
22
43 64
85 106 127 148 169 190 211 232 253 274 295 316 337 358 Frames
(b)
Principal ray 540
(c)
Fig. 9. Experiment #1 (a) Input video stream (b) Estimated heights (c) Bird's eye view which illustrates estimated positions
maximum height occurs at the leg-crossing phase of walking, while the minimum occurs when the legs are furthest apart. Therefore we refine the results through running average filter. As shown in Table 1, the height estimate is accurate to within σ =2.15 ~ 2.56cm. Fig. 9(c) shows a bird’s eye view of the scene, which illustrates trajectory of
A Single-View Based Framework for Robust Estimation
571
(a) 700
180
Estimated heights (cm)
175
170
600
Measured Position
500
Camera Center Marker Position
400
Principal Ray
300 200
165
100 0
Measured
160
-1750
-1500
-1250
-1000
-750
-500
Refined
-250
-100
0
250
500
750
-200
155 1
7
13 19 25 31 37 43 49 55 61 67 73 Frames
-300
79 85 91 97
-400
(b)
(c)
Fig. 10. Experiment #2 (a) Input video stream (b) Height estimates (c) Bird's eye view of (a) which illustrates measured positions
(a) 400
184
Estimated heights (cm)
182
200
180 178 -1300
-1050
-800
-550
-300
0 -50
200
450
700
176 -200
174 172
Measured
170
Refined
-400
Camera Center -600
168 1
4
Measured Position Marker Position Principal Ray
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 Frames
(b)
-800
(c)
Fig. 11. Experiment #3 (a) Input video stream (b) Height estimates (c) Bird's eye view of (a) which illustrates measured positions
the individual, principal ray, position of the camera, and position of the marker. The trajectory which exactly coincides with the land marks clearly means that our method can recover the original position of the moving individual accurately. Fig. 10 and Fig. 11 show results on several outdoor scenes, which also confirm the accuracy and
572
S.-H. Lee and J.-S. Choi
(a) 179
P2
Height (cm)
173
Measured Position 700
P3
P1
176
Camera Center Marker Position 500 Principal Ray
P1
P3
300
P2
170
100
167 164
-3000
-2500
-2000
-1500
-1000
-500
-100 0
500
161 -300
158 -500
155 1
10
19
28
37
46
55
64
73 82 Frames
91 100 109 118 127 136 145 -700
(b)
G
(c)
Fig. 12. Experiment #4 (a) Input video stream (b) Height estimates (c) Bird's eye view of (a) which illustrates measured positions Table 1. Height estimation results Real Height (cm) Path 1 Experiment 1 Path 2
185.00
Path 3
Mean (cm)
Std. Dev. (cm)
Median (cm)
184.83
2.56
184.89
185.88
2.33
185.79
185.58
2.15
185.47
Experiment 2
168.00
170.08
3.08
169.65
Experiment 3
176.00
178.24
2.46
178.19
the robustness of the proposed method. Fig. 12 demonstrates experimental results of multiple targets. In this case, P3 is occluded by P2 between frame 92 and 98. As shown in Fig. 12(b) and Fig. 12(c), this occlusion may affect the estimates of P2 and P3. This problem can, however, be avoided by using a prediction algorithm, and we hope to report on this in the near future. The processing speed of the proposed method is roughly 12frames/sec., but this may be dependent on image quality and number of targets in the scene. In summary, the experimental results suggest that the proposed method allows recovering the trajectories and height with high accuracy.
A Single-View Based Framework for Robust Estimation
573
4 Conclusion We have presented a single view-based framework for robust and real-time estimation of human height and position. In the proposed method, human body is assumed to be a vertical pole. And the 2D features of the imaged object are back-projected into the real-world scene to compute the height and location of the moving object. To give the reference coordinate frame, a rectangular marker is used. In addition, a refinement approach is employed to correct the estimated result by using the geometric constraints of the marker. The accuracy and robustness of our technique was verified on the experimental results of several real video sequences from outdoor environments. The proposed method is applicable to surveillance/security systems which employ a simple monocular camera. Acknowledgment. This work was supported by Korean Research Foundation under BK21 project, Korean Industrial Technology Foundation under LOE project, and SFCC Cluster established by Seoul R&BD Program.
References 1. Leibowitz, D., Criminisi, A., Zisserman, A.: Creating Architectural Models from Images. In: Eurograpihcs 1999. 20th Annual Conference of the European Association for Computer Graphics, Mailand, Italy, vol. 18, pp. 39–50 (1999) 2. Criminisi, A., Reid, I., Zisserman, A.: Single View Metrology. International Journal of Computer Vision 40, 123–148 (2000) 3. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge Univ. Press, Cambridge (2003) 4. Faugeras, O.: Three-Dimensional Computer Vision. The MIT Press, Cambridge (1993) 5. Criminisi, A.: Accurate Visual Metrology from Single and Multiple uncalibrated Images. Springer, London (2001) 6. Golub, G., Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins Univ. Press, Baltimore (1996) 7. Zhang, Z.: Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 1330–1334 (2000) 8. Hartley, R., Kang, S.: Parameter-free Radial Distortion Correction with Center of Distortion Estimation. In: Nicu, S., Michael S.L., Thomas S.H. (eds.): Proceedings of the Tenth IEEE International Conference on Computer Vision, ICCV 2005, Beijing, China, vol. 2, pp. 1834 – 1841 (2005) 9. Elgammel, A., Harwood, D., Davis, L.: Non-parametric model for back ground subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 751–767. Springer, Heidelberg (2000) 10. BenAbdelkader, R., Cutler, D.L.: Person Identification using Automatic Height and Stride Estimation: In Anders, H. In: Anders, H., Gunnar, S., Mads, N., Peter, J. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 155–158. Springer, Heidelberg (2002) 11. Havasi, L., Szlák, Z., Szirányi, T.: Detection of Gait Characteristics for Scene Registration in Video Surveillance System. IEEE Transactions on Image Processing 16, 503–510 (2007) 12. Liu, Z., Sarkar, S.: Improved Gait Recognition by Gait Dynamics Normalization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 863–876 (2006)
574
S.-H. Lee and J.-S. Choi
13. Lee, L., Romano, R., Stein, G.: Monitoring Activities from Multiple Video Streams: Establishing a Common Coordinate Frame. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 758–769 (2000) 14. Hu, W., Hu, M., Zhou, X., Tan, T., Lou, J., Maybank, S.: Principal Axis-Based Correspondence between Multiple Cameras for People Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 663–671 (2000) 15. Kim, K., Davis, L.: Multi-camera Tracking and Segmentation of Occluded People on Ground Plane Using Search-Guided Particle Filtering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 98–109. Springer, Heidelberg (2006) 16. Khan, S., Shah, M.: A Multiple View Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint. In: Ales, L., Horst, B., Axel, P. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 133–146. Springer, Heidelberg (2006) 17. Khan, S., Shah, M.: Consistent Labeling of Tracked Objects in Multiple Cameras with Overlapping Fields of View. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1355–1361 (2003) 18. Hu, W., Tan, T., Wang, L., Maybank, S.: A Survey on Visual Surveillance of Object Motion and Behaviors. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 334–353 (2004) 19. Haritaoglu, I., Harwood, D., Davis, L.: W4: Real-time Surveillance of People. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 809–830 (2000) 20. Mckenna, S., Jabri, S., Duric, J., Wechsler, H., Rosenfeld, A.: Tracking Groups of People. Computer Vision and Image Understanding 80, 42–56 (2000) 21. Gomez, J., Simon, G., Berger, M.: Calibration Errors in Augmented Reality: a Practical Study. In: ISMAR 2005. Fourth IEEE and ACM international Symposium on Mixed and Augmented Reality, Vienna, Austria, pp. 154–163 (2005)
Robust Tree-Ring Detection Mauricio Cerda1,3 , Nancy Hitschfeld-Kahler1 , and Domingo Mery2 1
2
Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile mcerda,
[email protected] Department of Computer Science, Pontificia Universidad Cat´ olica de Chile, Av. Vicu˜ na Mackenna 4860(143), Santiago, Chile
[email protected] 3 INRIA-Loria Laboratory, Campus Scientifique 54506, Vandoeuvre-l`es-Nancy, France
Abstract. The study of tree-rings is a common task in dendrology. Usually the rings deliver information about the age of the tree, historic climate conditions and forest densities. Many different techniques exist to perform the tree-ring detection, but they commonly are semi-automatic. The main idea of this work is to propose an automatic process for the tree-ring detection and compare it with a manual detection made by an expert in dendrology. The proposed technique is based on a variant of the Generalized Hough Transform (GHT) created using a very simple growing model of the tree. The presented automatic algorithm shows tolerance to textured and very noisy images, giving a good tree-ring recognition in most of the cases. In particular, it correctly detects the 80% of the tree-rings in our sample database. Keywords: dendrology, tree-ring, hough transform.
1
Introduction
The tree-rings or annual growth rings are formed in response to seasonal changes. Generally, a tree-ring is composed by two growth zones. In the first part of the growing season, thin-walled cells of large radial diameters are produced (earlywood), while towards the end of the season thick-walled cells of smaller diameter appears (latewood), resulting in a sharp disjunction between growth rings (see Fig. 1). Analysis of tree-rings from cross-sections of the tree (called stem analysis) plays a main role in assessing growth response of trees to environmental factors. Furthermore, stem analysis is used to develop tree growth models to make yield and stand tables, and to reconstruct the entire historical growth record. Hence it has applicability in dendrochronological analysis 1 . The tree-ring analysis is usually made recording the ring-width of four or eight directions on a wood disc, however in some applications it is necessary to record the entire growth ring [1], achieving a better estimation of ring areas. 1
Study of woody plants such as shrubs and lianas.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 575–585, 2007. c Springer-Verlag Berlin Heidelberg 2007
576
M. Cerda, N. Hitschfeld-Kahler, and D. Mery
The automatization of the tree-ring recognition process is important because it could make more comparable and reproducible results, currently manually performed by experts. Additionally, an automatic algorithm could reduce the time required to perform the analysis. The automatization of the tree-ring recognition process requires of image analysis, but this is a tough task, because of a wood disc image contains a high level of noise. The noise of the wood disc images comes mainly from the texture and imperfections of the wood, and the acquisition process itself. Another problem is the difficulty to express the rings properties in any detection algorithm as constraints or desirable properties. Each tree-ring can be approximated by using a polygon (closed polyline). The most obvious property of a tree-ring is that the polygon that represents it must have empty intersections with the polygons that represent the other rings. In addition, each polygon must contain the center of the wood disc (position of the pith2 ). Some of the not obvious properties are that each ring is located at the transition dark-to-light in the latewood sector taking as reference point the pith position of the tree (see Fig. 1) and following the radial growth direction [1], and also, that the shape of one ring constrains the shape of the others. In Fig. 1, the shape similarity of close rings can be observed. The idea of this work is to propose a simple way to include those restrictions in the detection process in order to build an automatic algorithm for tree-ring detection. In Section 2, we give an overview of different existing approaches for the treering detection and some other techniques that could be applied to this problem. The proposed algorithm is detailed in Section 3 and the results are presented in Section 4. Finally, the conclusions of our work are presented in Section 5.
2
Overview
From all the techniques used and proposed for the tree-ring detection problem, a simple classification can be made: techniques based on local features, techniques based on global models and other techniques. In the following section a discussion on the effectiveness of each technique is presented. 2.1
Local Features Techniques
The work of Conner [2] proposes a modified version of Canny edge detector [3] with a preferred edge orientation for each region of interest and a suppression of any edge that is not coming from a transition from latewood to earlywood. The inherent problem of this scheme is that assumes one edge orientation for a certain region of interest. The most interesting idea of the work of Conner [2] is the restriction imposed to the allowable edge orientations, but the main problem is that the restriction is fixed to one value for each region of interest. Laggoune et al. [4] propose a different edge model that could handle noisy edges as the ones present in wood disc images. This approach is still strongly 2
The oldest part of the tree, at its center, is called the pith.
Robust Tree-Ring Detection
577
Fig. 1. The image shows a cross-cut from a Radiata pine trunk. The lighter zone indicates the earlywood, and the darker one the latewood. The abrupt edge of the latewood indicates the end of the growth season. The dotted line represents one possible approximate ring representation.
dependent on the kind of edge model assumed, then it does not always work. So it is possible to find wood disc images where in some parts, the Canny edge detector works better than noisy edge models as the one described in [4], and viceversa. Again, inherent to the local nature of this technique, there is no warranty that the output will be a closed shape for each tree-ring. 2.2
Global Model Techniques
One of the most simple techniques for template matching is the Hough Transform. In [5] a good review of this topic is given. The main restriction of the Hough Transform is that a certain shape must be usually assumed, for example a circle or an ellipse, and this delivers acceptable solutions only in a very reduced number of cases. In the other hand, a detection based on non-analytical shapes, as could be the output of a Generalized Hough Transform (GHT) [5,6] can not be used directly in the tree-rings problem. The GHT must be adapted first to the special characteristics of the tree-rings problem, as an example, there is no a priori ring shape to look for. Other techniques such as LevelSets [7], are not well suited to include restrictions specific to this problem, at least not in a simple way. 2.3
Other Techniques
The mentioned algorithms are mostly based on common techniques used in image processing, but a family of algorithms taking advantage of biologicalmorphological properties has also been developed. In order to understand the nature of some of these algorithms it is important to realize that the most common objective is not to detect each full ring, but other characteristics such as the number of rings, the area of the rings, and the mean ring width. The work of Georg Von Arx et al. [8] presents an automatic technique to determine the mentioned characteristics and some additional ones. The authors
578
M. Cerda, N. Hitschfeld-Kahler, and D. Mery
use a high resolution image of a prepared sample where the wood cells can be visualized and appear in a different color. Using this image as input, the next step is to morphologically classify the cells according to the tree species in order to identify the ring zones. This work gives the idea that the input to detect the rings can be greatly improved (coloring cells) even before any process is applied, and it takes into account that the algorithm must make adjustments depending on the species. For the problem of recognizing the full shape of each ring, this technique does not deliver a good solution because it can not guarantee closed shapes and the rings can intersect each other. In the work of Soille et al. [9] another approach is presented to compute the tree-ring area. The authors use different morphological operators and thresholding values to identify the ring zones and combine this information with the gradient of the image. The morphological filters are supposed to repair failures in the rings. The authors discuss the problem of too close rings and indicate that in some cases the approach does not deliver closed shapes. Table 1. Comparison between the different existing algorithms Technique Authors Filter noise Impose shape Overlap rings Local -Conner [2] No No Yes -Laggoune et al. [4] Yes No Yes Global -Hough Transform [5] Yes Yes No -Generalized Hough Transform [6] Yes No No -Level Sets [7] Yes Others -Georg Von Arx et al. [8] No No No -Soille et al. [9] No No No
For the tree-ring detection process, most of the previous techniques have been already tested on wood discs. Table 1 shows a comparison among them that takes into account desirable properties that a good recognition method should have. Since most of the reviewed techniques use only local or close to local information [2,4,8,9], those techniques do not allow a proper recognition of the rings. A proper recognition method should consider global restrictions such as the similarity of the close rings and the influence of the shape of each ring on the neighboring ones, among others. As shown in Table 1, GHT global model shows more attractive characteristics but it must be adapted to the problem. We claim than our top-down GHT-based approach, described in the next section, is more similar to what the expert is implicitly doing in the manual processing and because of that, closer to what we have evaluated as a better technique than the known ones.
3
Proposed Algorithm
The proposed technique requires two parameters for each image, to compute the full ring set. These two parameters are the location of the center of the wood disc
Robust Tree-Ring Detection
579
Fig. 2. Block diagram of the complete algorithm
image or pith (C) and a polygon (convex or not) that represents the perimeter of the trunk (P ). Both parameters can be computed in different ways or even manually to have a semi-automatic algorithm. The description of the automatic procedure to compute C and P is explained in detail in Sections 3.2 and 3.3, respectively. 3.1
Main Algorithm
Fig. 2 shows a general diagram of the algorithm and Fig. 3 shows the algorithm applied to a wood disc image. The algorithm consists of three steps: Filtering (Fig. 3(b) and (c)), Accumulation (Fig. 3(d)) and Selection of the rings (Fig. 3(e)). Filtering. The image is firstly transformed to the HSV color space, taking just the saturation component because this is the most representative value for the rings in the examined data. After this transformation, the Canny edge detection algorithm [3] is applied (any other gradient based technique could also be used). Then, for any point Q detected as belonging to an edge (edge point), the following angle is computed: Q − C∇I|Q α = arccos (1) < Q − C, ∇I|Q > where I is the image intensity at point Q. Using a threshold level for the angle α, it is possible to keep only dark-to-light edges. Note that usually the Canny algorithm delivers two edges for the latewood zone. In order to smooth this last process, we have applied two different thresholding levels, and for this, a standard double threshold linking was used. The output of this part of the algorithm is a binary image of not necessary connected edges but mostly in the correct ring locations. Noise is not completely removed at this point.
580
M. Cerda, N. Hitschfeld-Kahler, and D. Mery
Fig. 3. Illustration of the proposed algorithm in each stage. (a) Sector of an input image. (b) Edges obtained after the Canny edge detector is applied to (a). (c) Edges obtained after the dark-to-light filtering was applied to (b). (d) All the possible selectable polygons computed from P and C. (e) The selected rings from (d).
Accumulation. The growth model used to generate all the possible selectable polygons requires the tree trunk perimeter P , and the pith location C of the wood disk. The following restriction on the shape of the polygon is then imposed: “Any polygon R can be represented as a function of the tree trunk perimeter P around the center of symmetry of the tree (pith) C”, (see Fig. 4(a)). We can represent this function by using the following expression: Ri (Pi , C, k) = C + k (C − Pi )
(2)
where Ri represents the i−th vertex in the selectable polygon of scale parameter k. Pi is the i − th vertex of the tree trunk perimeter. The space of accumulation is 1D because is for the parameter k. The perimeter is not necessary a circle and C is not necessary the center of the circle so, this scheme takes implicitly into account the normal asymmetries and the constraints of the rings. After the filtering, each point detected as belonging to an edge is assigned to the closest selectable polygon represented by a certain value of k. The output of this stage is an accumulator for k, that represents roughly the probability of each selectable polygon of being a tree-ring. In the simple case of an square image circumscribed about a circle P with C the center of that circle, each selectable polygon will be a circle of center C, k will represent the normalized radius and the total number of selectable polygons will be at the most N/2, where N is the width of the image. Selection of the Rings. The last stage of the algorithm takes the 1D accumulator and computes all the local maxima considering the maximum of three consecutive k values. This way to compute the local maxima gives less false negative rings than taking five or more consecutive k values. Usually all the local maxima correspond to a ring, but it is necessary to fix a certain level of noise depending on the size of the image and on the size of the smaller tree-ring. For example in a 512x512 image, an accumulator with a value less than or equal to 10 for a certain polygon of scale parameter k, is probably noise (see Fig. 4(b)).
Robust Tree-Ring Detection
(a)
581
(b)
Fig. 4. (a) a ring is computed taking the 50% of P. (b) an accumulator for 40 possible scale changes (k). Note that larger rings, more similar to the tree trunk perimeter, have a higher frequency value; this is because a larger ring is composed of more edge points.
3.2
Center of the Wood Disc Image
To estimate a point (x0 , y0 ), that represents the center of the wood disc image or pith, we can use a non-linear minimization process such as a gradient based method like the one mentioned in [4]. By using this approach we find the point (x0 , y0 ) that minimizes the following objective function: J(x0 , y0 ) = (xi − x0 )2 + (yi − y0 )2 (3) i
where xi and yi represent the coordinates of each point detected as belonging to an edge in the wood disc image. The minimization gives a point usually very close to the center, but not precise enough for the main algorithm. This is then corrected by using a common property observed in the sample database: the center is the darkest point around a certain distance from (x0 , y0 ). After testing this strategy in many images it is possible to fix the size of the neighborhood to be checked. 3.3
Perimeter of the Wood Disc Image
Several approaches were tested to select the most appropriate tree trunk perimeter approximation. The most simple one was to compute the convex hull of the edge image. The main problem with this approach is that it does not work well if the perimeter has large concavities or the shape of the tree bark is too different from the shape of the most external ring. The second approach was to use a Snake [5] algorithm but this also does not handle the concavities of the bark and it is quite more complex. Finally, the selected technique was to compute the perimeter using the convex hull of the edge image mostly because of the good results obtained in most of the wood disc images of the sample database and because of its simple implementation.
582
3.4
M. Cerda, N. Hitschfeld-Kahler, and D. Mery
Implementation
In the design of the implementation, one key factor was the performance of the algorithm. The most time consuming step in the presented algorithm is the accumulation stage. Considering an image of N 2 pixels, in the worst case all of them belong to an edge, the accumulator will have at the most N possible values each one representing a possible ring and if each possible ring is composed of M points (M segments), a brut force implementation will take time O(N 3 M ) because for each pixel the closest possible ring must be calculated. A faster implementation is obtained by pre-computing the Voronoi diagram [10] of the vertices that form all the possible selectable rings and then iterating cell by cell of the diagram. This implementation takes time O(N 2 ) to perform the accumulation. Note that the Voronoi diagram can be computed in time O((N M )log(N M )) using the Quickhull algorithm. To give an idea of the final performance, the average time to process one wood disc (see Section 4), was 172s: 46% of this time for the accumulation, 18% for the Voronoi computation, 12% for the center and filtering stage, and the rest was spent in reading and in the transformation operations such as RGB to HSV conversion.
4
Results
In this section we present and compare the ring detection results using the proposed automatic algorithm, the semi-automatic variant and a manual ring detection that we have called “real”. The automatic version of the algorithm was applied by using the same parameters for the Canny edge detector and the double threshold linking in each one of the images. The semi-automatic technique was performed by asking the user for the perimeter, but the center point was still R automatically obtained. All tests were performed using Matlab Software [11], Table 2. Table of results indicating the number of detected rings of each technique. P TP: true positive rings, FN: false negative rings, Sˆn = T PT+F : sensitivity. N Images
Difficulty (1-10) Wood disc (base) 3 Wood disc 1 4 Wood disc 2 5 Wood disc 3 5 Wood disc 4 6 Wood disc 5 7 Wood disc 6 7 Wood disc 7 7 Wood disc 8 7 Wood disc 9 9 <> -
TP 9 9 11 9 12 11 10 12 9 9 -
Real FN Sˆn 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1
Automatic TP FN Sˆn 9 0 1.00 9 0 1.00 10 1 0.90 7 2 0.77 7 5 0.58 6 5 0.54 10 1 0.90 8 4 0.66 8 1 0.88 7 2 0.77 - 0.80
Semi-automatic TP FN Sˆn 9 9 10 7 10 6 10 8 8 7 -
0 0 1 2 2 5 0 4 1 2 -
1.00 1.00 0.90 0.77 0.83 0.54 1.00 0.66 0.88 0.77 0.85
Robust Tree-Ring Detection
583
Table 3. Number of detected rings that mix two different manually detected rings Images Wood disc Wood disc Wood disc Wood disc Wood disc Wood disc Wood disc Wood disc Wood disc Wood disc
(base) 1 2 3 4 5 6 7 8 9
Difficulty(1-10) Real Automatic Semi-automatic 3 0 0 0 4 0 0 0 5 0 1 1 5 0 2 2 6 0 2 2 7 0 4 2 7 0 1 0 7 0 3 0 7 0 0 0 9 0 3 3
with 10 color jpeg files of approximately 700 by 700 pixels each. The wood disk images were taken directly in the field with a Nikon Coolpix 885 camera. The chosen images include both, easy and difficult cases even for manual detection, and were selected for being representative of the main difficulties founded in the wood disc image database. The counting of the rings was performed automatically and the overlapping manually. The results are summarized in Table 2 and Table 3. It can be seen that the proposed algorithm gives very good results. The automatic algorithm usually recognizes the same number of rings (TP+overlaps) than the manual detection does, but sometimes, some of them are not true rings. This occurs when the detection algorithm mix two very close rings as shown in Table 3. This kind of problem in the automatic detection algorithm is produced when the bark of the tree does not give a good approximation of the first ring, as occurs in wood disc 5 and 7. In this case, we recommend to use the semi-automatic algorithm because it usually improves the results. If the bark is too deformed is better to ask the user to directly indicate the first ring instead of the bark. It can be also seen in Table 2 and Table 3 that in the wood tree disc 9, the algorithm (automatic or semi-automatic) did not work well, but it was because this wood disc image contains a branch, situation that deforms the normal radial growth of the tree. The assumption that all the rings can be obtained by scaling the shape of the bark works well, when the bark is a good approximation of the first ring (usually this implies a thin bark) and the wood disc presents a close to regular growing (without branches).
5
Conclusions
In this work we present a robust automatic algorithm for tree-ring detection that works well in recognizing the rings of trees with normal or close to normal growing, tolerating false partial rings, textured zones and even additional lines. The proposed technique is composed of a filtering stage, followed by a voting for the re-scaling parameter of the perimeter. The new idea presented here is to take a
584
M. Cerda, N. Hitschfeld-Kahler, and D. Mery
(a) Sample Tree 3.
(c) Sample Tree 8.
(b) Fully automatic recognition with the input figure 5(a). The detailed sector shows two overlapping rings.
(d) Fully automatic recognition with the input figure 5(c).
Fig. 5. Two examples of input and result using the automatic algorithm
ring-prototype obtained from the bark of the wood tree image as input for this Hough-like-Transform without any previous assumption in the shape of the rings and to deform this ring-prototype using a very simple growing model of the tree. Future work includes (a) the use of different edge detection models more adapted to recognize noisy edges, (b) the use of a more accurate growing model for the tree rings and (c) the testing of the algorithm in different tree species.
Acknowledgments Special thanks to Fernando Padilla from the Mathematical Modeling Center (CMM) of the Faculty of Physical Sciences and Mathematics (FCFM) of the University of Chile for the revision of the paper and valuable comments on how
Robust Tree-Ring Detection
585
to improve it and for lending us the trunk image database and to Bernard Girau for useful comments about the paper. The authors acknowledge financial support to FONDECYT Chile - Project No. 1061227.
References 1. Forest, L., Padilla, F., Mart´ınez, S., Demongeot, J., Mart´ın, J.S.: Modelling of auxin transport affected by gravity and differential radial growth. Journal of Theoretical Biology 241, 241–251 (2006) 2. Conner, W.S., Schowengerdt, R.A.: Design of a computer vision based tree ring dating system. In: IEEE Southwest Symposium on Image Analysis and Interpretation, pp. 256–261 (1998) 3. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 679–698 (1986) 4. Laggoune, H., Sarifuddin, G.V.: Tree ring analysis. In: Canadian Conference on Electrical and Computer Engineering, pp. 1574–1577 (2005) 5. Nixon, M., Aguado, A.: Feature Extraction & Image Processing. Elsevier, Amsterdam (2005) 6. Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–122 (1981) 7. Sethian, J.A.: Level Set Methods and Fast Marching methods. Cambridge University Press, Cambridge (1999) 8. Arx, G.V., Dietz, H.: Automated image analysis of annual rings in the roots of perennial forbs. International Journal of Plant Sciences 166, 723–732 (2005) 9. Soille, P., Misson, L.: Tree ring area measurements using morphological image analysis. Can. J. For. Res. 31, 1074–1083 (2001) 10. Aurenhammer, F.: Voronoi diagrams - a survey of a fundamental geometric data structure. ACM Comput. Surv. 23, 345–405 (1991) 11. Mathworks: Image processing toolbox for use with Matlab: User Guide. The Mathworks Inc., Natick, MA, USA (2007)
A New Approach for Fingerprint Verification Based on Wide Baseline Matching Using Local Interest Points and Descriptors Javier Ruiz-del-Solar, Patricio Loncomilla, and Christ Devia Department of Electrical Engineering, Universidad de Chile {jruizd, ploncomi, cdevia}@ing.uchile.cl Abstract. In this article is proposed a new approach to automatic fingerprint verification that is not based on the standard ridge-minutiae-based framework, but in a general-purpose wide baseline matching methodology. Instead of detecting and matching the standard structural features, in the proposed approach local interest points are detected in the fingerprint, then local descriptors are computed in the neighborhood of these points, and afterwards these descriptors are compared using local and global matching procedures. The final verification is carried out by a Bayes classifier. It is important to remark that the local interest points do not correspond to minutiae or singular points, but to local maxima in a scale-space representation of the fingerprint images. The proposed system has 4 variants that are validated using the FVC2004 test protocol. The best variant, which uses an enhanced fingerprint image, SDoG interest points and SIFT descriptors, achieves a FRR of 20.9% and a FAR of 5.7% in the FVC2004-DB1 test database, without using any minutia or singular points’ information. Keywords: Fingerprint verification, Wide Baseline Matching, SIFT.
1 Introduction Fingerprint verification is one of the most employed biometric technologies. A fingerprint is the pattern of ridges and furrows on the surface of a fingertip. It is formed by the accumulation of dead, cornified cells [5]. The fingerprint pattern is unique and determined by the local ridge characteristics and the presence of ridge discontinuities, called minutiae. The two most prominent minutiae are ridge termination and ridge bifurcation. Minutiae in fingerprints are generally stable and robust to fingerprint impression conditions. Singular points, called loop and delta, are a sort of control points around which the ridge-lines are “wrapped” [12]. Many approaches to automatic fingerprint verification have been proposed in the literature and the research on this topic is still very active. In most of the cases the automatic verification process is based on the same procedure employed by human experts: (i) Detection of structural features (ridges, minutiae, and/or singular points), and in some cases derived features as the orientation field, which allow characterizing the fingerprints, and (ii) Comparison between the features in the input and reference fingerprints. This comparison is usually implemented using minutiae-based matching, ridge pattern D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 586–599, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
587
comparison and/or correlation between the fingerprints. The mentioned comparison methodologies can be described as [12]: Minutiae-based matching: It consists of finding the alignment between the input and the reference minutiae sets that results in the maximum number of minutiae pairings; Ridge feature-based matching: The approaches belonging to this family compare fingerprints in term of features extracted from the ridge pattern (e.g. local orientation and frequency, ridge shape, texture information); and Correlation-based matching: Two fingerprint images are superimposed and the correlation (at the intensity level) between corresponding pixels is computed for different alignments (e.g., various displacements and rotations). In state of the art fingerprint verification systems several structural features and comparison methodologies are jointly employed. For instance, in the 2004 Fingerprint Verification Competition (FVC2004) the 29 participants (from 43) that provided algorithm’s information employed the following methodologies [2]: • Features: minutiae (27), orientation field (19), singular points (12), ridges (10), local ridge frequency (8), ridge counts (6), raw or enhanced image parts (4), and texture measures (3). • Comparison methodology: minutiae global (20), minutiae local (15), correlation (7), ridge pattern geometry (5), and ridge pattern texture (2). In this general context the main objective of this article is to propose a new approach to automatic fingerprint verification that it is not based on the standard ridge-minutiae-based framework, but in a general-purpose wide baseline matching methodology. Instead of detecting and matching the standard structural features, in the proposed approach local interest points are detected in the fingerprint, then local descriptors are computed in the neighborhood of these points, and finally these descriptors are compared using local and global matching procedures. The local interest points do not correspond to minutiae or singular points, but to local maxima in a scale-space representation of the fingerprint image (see examples in Figure 1).
Fig. 1. Example of detected local interest points in the test and template fingerprints. Interest points are displayed as arrows, whose origin, orientation and size corresponds to the position (x,y), orientation q and scale s of the corresponding interest points.
The main intention in proposing this new approach is to show an alternative procedure for solving the fingerprint verification procedure. We believe that this new approach can complement and enrich the standard procedures, and it can be used in
588
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
addition to them. In this sense, we are not proposing a methodology for replacing the standard one, but a complementary solution. This article is structured as follows. In section 2 the employed methodology for solving the general wide baseline matching problem is described. In section 3 is presented the adaptation of this methodology to fingerprint verification. In section 4 are presented preliminary results of this new approach for fingerprint verification. Finally, in section 5 some conclusions and projections of this work are given.
2 Wide Baseline Matching Using Local Interest Points and Descriptors Wide baseline matching refers to a matching process where the images to be compared are allowed to be taken from widely separated viewpoints, so that a point in one image may have moved anywhere in the other image. Object recognition using a reference image (model) can be modeled as a wide baseline matching problem. In the context, wide baseline matching (object recognition) approaches based on local interest points (invariant features) have become increasingly popular and have experienced an impressive development in the last years [3][7][8][13][17]. Typically, local interest points are extracted independently from both a test and a reference image, then characterized by invariant descriptors, and finally the descriptors are matched until a given transformation between the two images is obtained. Most employed local detectors are the Harris detector [4] and the Lowe’s sDoG+Hessian detector [7], being the Lowe’s detector multiscale and the Harris detector single scale. Best performing affine invariant detectors are the Harris-Affine and the HessianAffine [15], but they are too slow to be applied in general-purpose applications. The most popular and best performing invariant descriptor [14] is the SIFT (Scale Invariant Feature Transform) [7]. For selecting the local detector and invariant descriptor to be used in a given application it should be taken into account the algorithm’s accuracy, robustness and processing speed. Lowe’s system [7] using the SDoG+Hessian detector, SIFT descriptors and a probabilistic hypothesis rejection stage is a popular choice, given its recognition capabilities, and near real-time operation. However, Lowe’s system main drawback is the large number of false positive detections. This is a serious problem when using it in real world applications as for example robot self-localization [19], robot head pose detection [9] or image alignment [20]. One of the main weaknesses of Lowe’s algorithm is the use of just a simple probabilistic hypothesis rejection stage, which cannot successful reduce the number of false positives. Loncomilla and Ruiz-del-Solar (L&R) propose a system that reduces largely the number of false positives by using several hypothesis rejection stages [8][9][10][11]. This includes a fast probabilistic hypothesis rejection stage, a linear correlation verification stage, a geometrical distortion verification stage, a pixel correlation verification stage, a transformation fusion procedure, and the use of the RANSAC algorithm and a semi-local constraints test. Although, RANSAC and the semi-local constraints test have being used by many authors, Lowe’s system does not use them. In [10] are compared the Lowe’s and the L&R systems using 100 pairs of real-world high-textured images (variations in position, view angle, image covering, partial occlusion, in-plane and outof the-plane rotation). The results show that in this dataset the L&R system reduces the
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
589
false positive rate from 85.5% to 3.74%, by increasing the detection rate by 5%. For this reason we choose to use this system in this work. The L&R system considers four main stages: (i) Generation of local interest points, (ii) Computation of the SIFT descriptors, (iii) SIFT-matching using nearest descriptors, and (iv) Transformation Computation and Hypothesis Rejection Tests. The first three stages are the standard ones proposed by Lowe, while the fourth stage is employed for reducing the number of false matches, giving robustness to the whole system. This stage is implemented by the following procedure (detailed description in [8][9][10]): 1. Similarity transformations are determined using the Hough transform. After the Hough transform is computed, a set of bins, each one corresponding to a similarity transformation, is determined. Then: a. Invalid bins (those that have less than 4 votes) are eliminated. b. Q is defined as the set of all valid candidate bins, the ones not eliminated in 1.a. c. R is defined as the set of all accepted bins. This set is initialized as a void set. 2. For each bin B in Q the following tests are applied (the procedure is optimized for obtaining high processing speed by applying less time consuming tests first): a. b.
c.
d. e.
f.
g. h.
i.
If the bin B has a direct neighbor in the Hough space with more votes, then delete bin B from Q and go to 2. Calculate rREF and rTEST, which are the linear correlation coefficients of the interest points corresponding to the matches in B, that belong to the reference and test image, respectively. If the absolute value of any of these two coefficients is high, means that the corresponding points lie, or nearly lie, in a straight line, and that the affine transform to be obtained can be numerically unstable. If this condition is fulfilled delete bin B from Q and go to 2. Calculate the probability PFAST associated to B. If PFAST is lower than a threshold PTH1, delete bin B from Q and go to 2. The main advantage of this probability test is that it can be computed before calculating the affine transformation, which speeds up the whole procedure. Calculate an initial affine transformation TB using the matches in B. Compute the affine distortion degree of TB using a geometrical distortion verification test. A certain affine transformation should not deform very much an object when mapping it. Therefore, if TB has a strong affine distortion, delete bin B from Q and go to 2. Top down matching: Matches from all the bins in Q that are compatible with the affine transformation TB are summarized and added to bin B. Duplication of matches inside B is avoided. Compute the Lowe’s probability PLOWE of bin B. If PLOWE is lower than a threshold PTH2, delete bin B from Q and go to 2. To find a more precise transformation apply RANSAC inside bin B. In case that RANSAC success, a new transformation TB is calculated and B is labeled as a RANSAC-approved bin. Accept the candidates B and TB, what means delete B from Q and include it in R (the TB transformation is accepted).
590
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
3. For all pairs (Bi, Bj) in R, check it they may be fused into a new bin Bk. If the bins may be fused and one of them is RANSAC-approved, do not fuse them, and delete the other in order to preserve accuracy. If the two bins are RANSAC-approved, delete the least probable. Repeat this until all possible pairs (including the new created bins) have been checked. 4. For any bin B in R, apply the semi-local constraints procedure to all matches in B. The matches from B who are incompatible with the constraints are deleted. If some matches are deleted from B, TB is recalculated. 5. For any bin B in R, calculate the pixel correlation rpixel using TB. Pixel correlation is a measure of how similar are the image regions being mapped by TB. If rpixel is below a given threshold, delete B from R. 6. Assign a priority to all bins (transformations) in R. The initial priority value of a given bin will correspond to its associated PLOWE probability value. In case that the bin is a RANSAC-approved one, the priority is increased in one. Thus, RANSACapproved bins have a larger priority than non RANSAC-approved ones.
3 Proposed System for Fingerprint Verification The proposed system for fingerprint verification is based on the L&R wide baseline matching system described in the former section. However, for applying this system in real world harsh conditions (state of the art fingerprint testing protocols) two main improvements are included: (i) a fingerprint enhancement pre-processing module, and (ii) a statistical classification post-processing module. In addition an optional module that computes minutiae-based SIFT descriptors is also included in the system, for studying how the minutiae information affects the performance of the system. Figure 2 (d) shows a block diagram of the proposed fingerprint verification system. In the next paragraphs we analyze the fingerprint verification process using the L&R system, and we describe the new processing modules. 3.1 Fingerprint Verification Analysis We performed several fingerprint analysis experiments using different fingerprint public databases, and we verified that the L&R wide baseline system allows matching fingerprints. In Figure 2 we show an exemplar experimental result. In Figure 2 (a) can be observed a fingerprints reference-test image pair with the corresponding correct matches. As it can be seen, the wide baseline system achieves matching correctly the two fingerprints. In Figures 2 (b)-(c) are shown some selected matched local interest points. It can be verified that the local interest points do not correspond to minutiae or singular points. As mentioned they correspond to local maxima in the position-scale multi-resolution representation of the fingerprint images. One of main problems in fingerprint verification is the nonlinear distortion in fingerprint images, which disturbs the matching process [16]. This problem is tackled by limiting the acceptable distortion in the acquisition process, by estimating the
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
591
distortion during the matching process [21][1], or by compensating the distortion using fingerprint warping ([16]). The here-proposed wide baseline matching approach is robust against nonlinear distortions in fingerprint images. It can be proved that any differentiable non-linear transformation that is locally near-orthogonal can be approximated by a bundle of local similarity approximations using the Hough transform methodology, if the density of matches between interest points is enough high (see Appendix).
Fig. 2. (a) Fingerprints reference-test image pairs with matches. (b)-(c) Matched local interest points. (d) Block diagram of the proposed L&R system. Dashed lines represent optional modules. Modules in pink are proposed in this work. Fingenhanc: Fingerprint Enhancement; SDoGKeyGen: SDoG Keypoints Generation; MinutiaeKeyGen: Minutiae Keypoints Generation; SIFTDescGen: SIFT Descriptors Generation; TrasfComp: Transformation Computation and hypothesis rejection tests; BayesK: bayes Classifier.
592
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
3.2 Fingerprint Enhancement Due to their low quality, fingerprint images can be optionally enhanced before applying the wide baseline recognition system. The fingerprint is divided into a twodimensional array of squared regions, and then the local orientation and local frequency are calculated in each region to get a pair of orientation and frequency fields over the complete image. Finally, the pair of fields is used to generate a bank of realvalued Gabor filters, which is applied to the image to enhance it. When this preprocessing module is employed the keypoints are named SDoG-enhanced (SDoG-E) keypoints, otherwise SDoG-non-enhanced (SDoG-NE). To implement this stage the open FVS library is used [24]. 3.3 Generation of Minutia Keypoints for Each Image Minutia keypoints are searched over a (Gabor-) enhanced, binarized and thinned version of the input image. The local orientation for each minutia keypoint is obtained from the orientation field calculated, while the local scale for the minutia keypoint is proportional to the inverse of the frequency field. Minutia keypoints are finally characterized by a 4-dimensional vector (x,y,σ,θ), which fixes the position, scale and orientation of the minutia keypoint. As both SDoG keypoints and minutia keypoints are described as 4-dimensional vectors, they can be used alone or in an integrated fashion. To obtain the minutiae set, the open FVS library is used [24]. 3.4 Bayes Classification After applying the L&R methodology to verification problems in which the fingerprints quality changes largely (changes in finger position, orientation, pressure, skin distortion, etc.), we noted that the number of false positives was very large. We solve this problem by applying a statistical classifier (Naïve Bayes) after the wide baseline matching original system. We defined the following 12 features for the classifier (see details in section 2): 1. 2. 3. 4. 5. 6. 7. 8. 9.
TNMatches: Total number of matches between the reference and test image. PTime: Processing time as a measure of the complexity of the matching process. NAffinT: Number of detected affine transformations between reference and test image. NMatches: Number of associated matches in the best transformation. PBT: Probability of the best transformation (PLOWE). LCorr: Linear correlation of the best transformation (rREF) PCorr: Pixel correlation of the best transformation (rpixel). MNDesc: Maximum number of test image descriptors who are matched to the same reference image descriptor, considering the best transformation. ScaleAffinMax: Absolute value of the upper eigenvalue of the affine transformation matrix of the best transformation (i.e., upper scale of the best affine transformation).
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
593
10. ScaleAffinMin: Absolute value of the lower eigenvalue of the affine transformation matrix of the best transformation (i.e., lower scale of the best affine transformation). 11. NIncMatches: Number of matches of the best transformation who are incompatible with the semi-local constraints. 12. RansacPar: RANSAC-compatibility with precise affine transform. A small subset of 3 matches from B is selected to construct a transformation TB who is tested against all the resting matches in B. The matches from B who are correctly mapped using TB with a very little error are called compatible matches. If more than 2/3 of the matches from B are compatible, the incompatible matches are deleted and a final fine transformation is calculated using the compatible matches. This procedure is tried 10 times using different subsets with 3 matches from B. If in none of the 10 iterations a fine transformation is obtained, RANSAC fails and the initial transformation is preserved. We analyzed the relevance of these features using the Weka package [22]. With the Weka’s BestFirst attributes selection method, which searches the space of attribute subsets by greedy hill-climbing, we selected the final attributes that we use in each of our experiments (see table 1).
4 Preliminary Results We present some preliminary results of the operation of the proposed system for fingerprint verification. We test different flavors of the system. Although this system validation is preliminary we choose to use a state of the art fingerprint database. We select to use the DB1 database from FVC2004. According to the FVC2004 test results, DB1 has proven to be very difficult compared to DB2, DB3 and DB4 [2], mainly because of the presence of a large number of distorted fingerprints (skin distortion was encouraged during some acquisition sessions). The main characteristics of DB1 are: acquisition using an optical scanner; 120 fingers: 30 persons, 4 fingers per person; 12 impressions per finger: 3 collection sessions, in each session 4 impressions were acquired per finger; and database divided in DB1-train (10 fingers) and DB1-test (100 fingers). The collection sessions encourage variations in the obtained impressions; session 1: changes in finger position and pressure, session 2: skin distortion and variations in rotation, session 3: fingers were dried and moistened. The FVC 2004 test protocol was followed (see details in Cappelli et al. [2]), and Genuine recognition attempts (GNA) and Impostor recognition attempts (IRA) sets for the DB1-train and DB1-test databases were built. The Naïve Bayes classifier of the proposed fingerprint verification system was trained using the Weka package [22] and 10-fold cross-validation. The training of the classifier was performed using DB1-train. Several tests were executed over the DB1-test database using different flavors of the proposed system. These flavors were obtained using different keypoint-generators (SDoG-E: SDoGenhanced; SDoG-NE: SDoG-non-enhanced; or minutia), and SIFT descriptors of different sizes (small: 4x4 region’s size; medium: 5x5 region’s size; or large: 30x30 region’s size). To give a short name for a given descriptor-generator, the notation
594
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
G@XxY will be used; G represents the keypoint generator, and XxY represents the region’s size of the associated SIFT descriptor. If several descriptor-generators are used simultaneously, the combined generator notation will be expressed as the sum of the individual generators notation. The features selected for training the classifiers depends on the kind of keypoints and descriptors been employed (see table 1). The results obtained from the tests (see table 2) show that the SDoG-E@30x30 (large size descriptors, enhanced image, no minutiae) is the flavor that produces the best TP v/s FP pair. As several small, local regions of fingerprints can look similar, a local symmetry problem exists in any fingerprint image. The use of large regions in the SIFT descriptor calculation helps to break the local symmetry observed in the fingerprint, and then helps to produce more distinctive descriptors. The image enhancement process before the SDoG keypoints calculation, which is a novelty in general-purpose wide-baseline methods, helps to remove acquisition noise, giving a very repeatable fingerprint image, which produces more repeatable keypoints and descriptors, improving the verification results. The Bayesian classifier helps to discard a great amount of false detections, which are produced by the fingerprint local symmetry problem mentioned above. This can be illustrated showing that, when using the classifier, the verification results obtained on the DB1 database are TP=79.1% and FP=5.7%, while when not using it are TP=98.5% and FP=73.29%. Thus, the main effect of the classifier is to reduce largely the number of false detections. Table 1. Selected Features for classifier training Method SDoG-E@30x30 SDoG-NE@30x30 SDoG-E@5x5 SDoG-NE@5x5 Minutia@30x30 + E@30x30 Minutia@4x4 + E@5x5 Minutia@4x4
Selected Features PTime, TNMatches, NMatches, PCorr PTime, TNMatches, NMatches, PCorr PTime, TNMatches, NAffinT, NMatches, PCorr, MNDesc, ScaleAffinMax, NIncMatches PTime, TNMatches, NMatches, PCorr SDoG- PTime, TNMatches, NAffinT, NMatches, PCorr SDoG- PTime, TNMatches, NAffinT, NMatches, MNDesc, ScaleAffinMax, ScaleAffinMin, NIncMatches. All the 12 characteristics
Table 2. Recognition statistics over the DB1-Test database Method SDoG-E@30x30 SDoG-NE@30x30 SDoG-E@5x5 SDoG-NE@5x5 Minutia@30x30 + SDoG-E@30x30 Minutia@4x4 + SDoG-E@5x5 Minutia@4x4
TP% (100-FRR) 79.1 61.0 77.5 60.9 69.6 83.4 57.7
FP% (FAR) 5.7 16.7 18.4 31.6 17.3 24.5 10.6
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
595
The confrontation of SDoG v/s Minutia + SDoG methods (see Table 2) shows that the direct addition of minutia information to the interest point information before the Hough transform does not help effectively to the matching process. Thus, a future alternative to test is to mix SDoG keypoints information and minutia keypoints information in a more smart way, for example, to use only SDoG keypoints to detect fingerprints using the Hough transform methodology, and then to use minutia information as a posterior verification stage. When comparing the obtained results with the ones from state of the art systems participating in the FVC2004 competition (systems developed by research institutions working for years in fingerprint verification), we observe that our system could have achieved the top30 position. In the FVC2004 report the results from all the participants are ordered by EER (equal error rate). Our SDoG-E@30x30 test does not get an EER value, but two error values: a FRR=20.9% and a FAR=5.7%, which corresponds to our operational point. We compared our operational point with the ROC curves from the competitors, and we found that the top30 participant has the ROC curve that is nearest to our operational point. We believe that this result is very promising because our approach is the first one that solves a fingerprint verification problem using a general-purpose wide-baseline method, and it can still be improved, extended and tuned for achieving state of the art results. One of the very interesting results of the application of the proposed algorithm is that it could process the whole DB1 database without any software failure. FVC2004 test developers implement a special treatment of failures during tests [2], because usually the compared verification systems can fail during the enrollment or the verification process. This situation was observed in the FVC2000, FVC2002 and FVC2004 tests, but not in the testing of our system.
5 Conclusions A new approach to automatic fingerprint verification based in a general-purpose wide baseline matching methodology was proposed. Instead of detecting and matching the standard structural features, in the proposed approach local interest points are detected in the fingerprint, then local descriptors are computed in the neighborhood of these points, and finally these descriptors are matched. Image enhancement, several verification stages, and a simple statistical classifier are employed for reducing the number of false positives. The nature of the interest points permits to integrate them with minutia points, but a useful way to integrate both information sources is been investigated. The proposed fingerprint verification system was validated using the FVC2004 test protocol. Without using any a priori knowledge of the finger minutia and singular points information, the system achieves, in the FVC2004-DB1 database, a FAR of 5.7% and a FRR of 20.9%. We expect to improve these results with a better integration of the minutiae-derived descriptors.
Acknowledgements This research was funded by Millenium Nucleus Center for Web Research, Grant P04-067-F, Chile.
596
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
References 1. Bazen, A.M., Gerez, S.H.: Systematic methods for the computation of the directional fields and singular points of fingerprints. IEEE Trans. Pattern Anal. Machine Intell 24(7), 905– 919 (2002) 2. Cappelli, R., Maio, D., Maltoni, D., Wayman, J., Jain, A.K.: Performance evaluation of fingerprint verification systems. IEEE Trans. Pattern Anal. Machine Intell 28(1), 3–18 (2006) 3. Ferrari, V., Tuytelaars, T., Van Gool, L.: Simultaneous Object Recognition and Segmentation by Image Exploration. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 40–54. Springer, Heidelberg (2004) 4. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., Manchester, UK, pp. 147–151 (1998) 5. Lee, H.C., Lee, H.C.G.: Advances in Fingerpint Tecnology. Elsevier, NY (1991) 6. Lowe, D.: Local feature view clustering for 3D object recognition. In: Lowe, D. (ed.) IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, pp. 682–688 (2001) 7. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004) 8. Loncomilla, P., Ruiz-del-Solar, J.: Gaze Direction Determination of Opponents and Teammates in Robot Soccer. In: Bredenfeld, A., Jacoff, A., Noda, I., Takahashi, Y. (eds.) RoboCup 2005. LNCS (LNAI), vol. 4020, pp. 230–242. Springer, Heidelberg (2006) 9. Loncomilla, P., Ruiz-del-Solar, J.: Improving SIFT-based Object Recognition for Robot Applications. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 1084– 1092. Springer, Heidelberg (2005) 10. Loncomilla, P., Ruiz-del-Solar, J.: A Fast Probabilistic Model for Hypothesis Rejection in SIFT-Based Object Recognition. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 696–705. Springer, Heidelberg (2006) 11. Ruiz-del-Solar, J., Loncomilla, P., Vallejos, P.: An Automated Refereeing and Analysis Tool for the Four-Legged League. LNCS (2006) 12. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 13. Mikolajczyk, K., Schmid, C.: Scale & Affine Invariant Interest Point Detectors. Int. Journal of Computer Vision 60(1), 63–96 (2004) 14. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Machine Intell 27(10), 1615–1630 (2005) 15. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A Comparison of Affine Region Detectors. Int. Journal of Computer Vision 65(1-2), 43–72 (2005) 16. Ross, A., Dass, S., Jain, A.K.: Fingerprint warping using ridge curve correspondences. IEEE Trans. Pattern Anal. Machine Intell 28(1), 19–30 (2006) 17. Schaffalitzky, F., Zisserman, A.: Automated location matching in movies. Computer Vision and Image Understanding 92(2-3), 236–264 (2003) 18. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Anal. Machine Intell 19(5), 530–534 (1997) 19. Se, S., Lowe, D., Little, J.: Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. Int. Journal of Robotics Research 21(8), 735–758 (2002) 20. Vallejos, P.: Detection and tracking of people and objects in motion using mobile cameras. Master Thesis, Universidad de Chile (2007)
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
597
21. Kovacs-Vajna, Z.M.: A fingerprint verification system based on triangular matching and dynamic time warping. IEEE Trans. Pattern Anal. Machine Intell 22(11), 1266–1276 (2000) 22. Weka: Weka website. Available in November 2006 in: http://www.cs.waikato.ac.nz/ml/weka/ 23. FVS 2003, Fingerprint Verification System, http://fvs.sourceforge.net
Appendix: About the Robustness of Similarity Hough Transform for the Detection of Nonlinear Transformations Consider two rectangular planar surfaces, named SREF and STEST. In both, let us define a two-dimensional rectangular coordinate system (xR,yR) for SREF and (xT,yT) for STEST. Additionally, we define two functions: one function IREF from SREF to RR , which is named the reference image, and another function ITEST from STEST to RR named the test image. A similarity transformation that maps the reference surface SREF into the test surface STEST has the following expression: ⎛ xT ⎞ ⎛ cos(θ ) sin(θ ) ⎞⎛ x R ⎞ ⎛ t X ⎞ ⎜ ⎟ = e⎜ ⎟⎜ ⎟ + ⎜ ⎟ ⎝ yT ⎠ ⎝−sin(θ ) cos(θ ) ⎠⎝ y R ⎠ ⎝ tY ⎠
(1)
As the similarity transformation has 4 parameters, there must be 4 constraints to fix a transformation, i.e. 4 scalar equations relating the parameters. There is a collection of matches between pairs of scale-invariant interest points. Each pair consists of a point located in the reference image and another point located in the test image. Each point has a position (x,y), orientation θ and scale σ. Any scale-invariant interest point can be described by an arrow with two relevant points, which are the origin and the head of the arrow. If the interest point is described by the information (x,y,σ,θ), the origin of the associated arrow is described by the point (x,y), while the head of the arrow is separated from the origin by a distance σ in the direction θ. Given that any interest point can be represented as an arrow, any match between two interest points can be thought as a match between two arrows. The match of two arrows can be considered as two point-matches: the match of the origins of the arrow, and the match of the heads of the arrow. These two-point matches correspond to 2 vector equations which relate the parameters of the similarity transformation and determinate the transformation in a unique way: ⎛ xTEST ⎞ ⎛ cos(θ ) sin(θ ) ⎞⎛ x REF ⎞ ⎛t X ⎞ ⎜ ⎟ = e⎜ ⎟⎜ ⎟+ ⎜ ⎟ ⎝ yTEST ⎠ ⎝−sin(θ ) cos(θ ) ⎠⎝ y REF ⎠ ⎝ tY ⎠ ⎛ xTEST + σ TEST cos(θ TEST ) ⎞ ⎛ cos(θ ) sin(θ ) ⎞⎛ x REF + σ REF cos(θ REF ) ⎞ ⎛ t X ⎞ ⎜ ⎟ = e⎜ ⎟⎜ ⎟+ ⎜ ⎟ ⎝ yTEST − σ TEST sin(θ TEST ) ⎠ ⎝−sin(θ ) cos(θ ) ⎠⎝ y REF − σ REF sin(θ REF ) ⎠ ⎝ tY ⎠
(2) (3)
The solution to these equations is the vector of 4 parameters (e,θ,tX,tY) of the transformation, which depends of the information of the match between two interest points (xREF,yREF,θREF,σREF) and (xTEST,yTEST,θTEST,σTEST). In other words, each match
598
J. Ruiz-del-Solar, P. Loncomilla, and C. Devia
between two scale-invariant interest points generates a set of parameters (e,θ,tX,tY) for a similarity transformation. θ = θ TEST − θ REF e = σ TEST /σ REF
t X = xTEST − e(x REF cos θ + y REF sin θ )
(4)
tY = yTEST − e(−x REF sin θ + y REF cos θ )
A Hough transform is used to count the number of times a set of parameters (e,θ,tX,tY) is voted by a match. The parameters space is quantized into bins using 4 indexes (i,j,k,z). Each bin has a width of 1 / 4 of the reference image’s projection size in translation dimensions, 30° in orientation dimension, and a factor of 2 in scale dimension. If LX is the width of the reference image and LY is its height, then the following expressions show the parameters’ quantization. θ = 30°k e = 2z 1 z 1 2 cos(30°k)LX i − 2 z sin(30°k)LY j 4 4 1 z 1 z tY = 2 sin(30°k)LX i − 2 cos(30°k)LY j 4 4
tX = −
(5)
Then, each bin has an associated central similarity transformation, which is calculated with the central parameters of the bin: ⎛ ⎞ 1 ⎛ cos(30°k) sin(30°k) ⎞⎜ x − LX i ⎟ 4 T i, j,k,z (x, y) = 2 ⎜ ⎟⎜ ⎟ ⎝− sin(30°k) cos(30°k) ⎠⎜ y − 1 L j ⎟ Y ⎝ ⎠ 4 z
(6)
When a match between points (xREF,yREF,θREF,σREF) and (xTEST,yTEST,θTEST,σTEST) succeeds, a vote will be accumulated in the 16 more nearest integer values for (i, j, k, z). In particular, each match will vote for the nearest (i,j,k,z) in the Hough transform. If the transform is differentiable, it can be expanded as a second order polynomial in the vicinity of any point. (u, v)T = T(x,y ) = T(x 0 ,y 0 ) + J(x 0 ,y 0 ) (x − x 0 , y − y 0 )T + O(Δ2 )
(7)
Let us define a transformation as locally near-orthogonal if the perpendicular angles in the image are almost not modified by the application of the transformation, i.e. it preserves locally the perpendicularity of lines in the vicinity of any point in an approximated way. A transformation which is locally near-orthogonal is like a similarity transformation in the vicinity of any point, i.e. it produces a different translation, rotation and scale change in a small vicinity of any point. Let us assume that the transformation T is locally near-orthogonal, i.e. it is like an approximation of a different similarity transformation in any point. Then, the following expression stands with small epsilons in all the space:
A New Approach for Fingerprint Verification Based on Wide Baseline Matching
( (
) )
( (
) )
⎛ cos θ sin θ 0(x0 ,y 0 ) + ε 2(x 0 ,y 0 ) ⎞ 0(x 0 ,y 0 ) + ε1(x 0 ,y 0 ) ⎜ ⎟ J(x 0 ,y 0 ) = e(x 0 ,y 0 ) ⎜−sin θ ⎟ + ε cos θ + ε 0(x ,y ) 3(x ,y ) 0(x ,y ) 4 (x ,y ) ⎠ ⎝ 0
0
0
0
0
0
0
599
(8)
0
In other words, if the Jacobian J of the transformation includes only a local rotation and a local scale change, the transformation T produces approximately a local translation T ( x, y) , a local rotation θ 0 ( x , y ) and a local scale change e( x , y ) in the vicinity of any point (x, y). This local translation, rotation and scale change are properties of the transformation T. Then, if a locally near-orthogonal transformation is applied to the reference image, any interest point (xREF,yREF,θREF,σREF) in that image transforms approximately in the following way:
( x REF , y REF ,θ REF ,σ REF ) → ( xTEST , y TEST ,θ TEST ,σ TEST ) (xTEST , y TEST ) = T (x REF , y REF )
(9)
θ TEST = θ REF + θ 0(x REF ,y REF ) σ TEST = σ REF e(x REF ,y REF )
It can be noted that, when the transformation is locally near-orthogonal, all the points (xREF,yREF,θREF,σREF) transforms locally as in the similarity case. Then, the associated indexes (i,j,k,z) in a vote on the Hough transform originated from a point (xREF, yREF) depend only on the properties of the transformation T(x,y), i.e. the local translation, rotation and scale change defines the (i, j, k, z) for which to vote on each point (xREF,yREF). Let us define the function vote(xREF,yREF) as the function that returns the nearest integers (i,j,k,z) to vote for. Let us define too the set E(xREF,yREF) which depends on T(x, y) and (xREF, yREF) as:
{
}
E(x REF , y REF ) = (x, y) ∈ SREF O | vote(x,y ) = vote(x REF ,y REF )
(10)
S REF = domain of T ( x, y ) over S REF without border O
In other words, E(xREF,yREF) is the set of all points in SREFO that votes for the same bin (i, j, k, z) as (xREF,yREF). If the transformation is differentiable (i.e. its Jacobian exists and is continuous in all SREFO ), then E(xREF, yREF) must include a connected maximal-size vicinity V(xREF, yREF) that includes (xREF, yREF). As any (xREF,yREF) in SREFO must have its corresponding V(xREF, yREF), it can be concluded that a {Vk, k = 1, …, N} partition of SREFO can be created. Any of the Vk with 4 or more matches will produce a detection in Hough space, and then a local similarity approximation will be computed. Then, it is concluded that any differentiable non-linear transformation that is locally near-orthogonal can be approximated by a bundle of local similarity approximations using the Hough transform methodology if several of the existent Vk have 4 or more votes, i.e., when the density of matches between interest points is enough high. If the transformation is locally near-orthogonal and the nonlinearities are enough weak, the surface’s partition will have only one vicinity V1=SREFO. Then, only one similarity detection will occur and a simple transformation (as an example, an affine transformation) is enough to approximate the true transformation in all the space SREF.
SVM with Stochastic Parameter Selection for Bovine Leather Defect Classification Roberto Viana1 , Ricardo B. Rodrigues1 , Marco A. Alvarez2 , and Hemerson Pistori1 1
GPEC - Dom Bosco Catholic University, Av. Tamandare, 6000 Campo Grande, Brazil http://www.gpec.ucdb.br 2 Department of Computer Science Utah State University Logan, UT 84322-4205, USA {roberto,ricardo}@acad.ucdb.br,
[email protected],
[email protected]
Abstract. The performance of Support Vector Machines, as many other machine learning algorithms, is very sensitive to parameter tuning, mainly in real world problems. In this paper, two well known and widely used SVM implementations, Weka SMO and LIBSVM, were compared using Simulated Annealing as a parameter tuner. This approach increased significantly the classification accuracy over the Weka SMO and LIBSVM standard configuration. The paper also presents an empirical evaluation of SVM against AdaBoost and MLP, for solving the leather defect classification problem. The results obtained are very promising in successfully discriminating leather defects, with the highest overall accuracy, of 99.59%, being achieved by LIBSVM tuned with Simulated Annealing. Keywords: Support Vector Machines, Pattern Recognition, Parameter Tuning.
1
Introduction
The bovine productive chain plays an important role in the Brazilian economy and it has been considered as the owner of the largest cattle herd in the world [1]. However, according to [2] only 8.5% of Brazilian leather achieves high quality. Recently, the Brazilian Agricultural Research Corporation (EMBRAPA) suggested the pursuit of automation for improving the reliability of the national grading system for bovine raw hide1 . In particular, the authors of this paper believe that designing computational systems for the automatic classification of leather defects represents a relevant contribution to the government and industrial needs. 1
Normative instruction number 12, December 18th, 2002, Brazilian Ministry of Agriculture, Livestock and Food Supply.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 600–612, 2007. c Springer-Verlag Berlin Heidelberg 2007
SVM with Stochastic Parameter Selection
601
The defect classification in materials like, wood, metals, woven and leathers is reported to be made visually in [3]. In general, this task involves the product surface analysis in order to identify fails. Recalling that such task requires laborious and precise work, it is very common to face the occurrence of errors during the analysis. The visual inspection of leather surfaces for analysis of defects can be modeled using computer vision techniques as reported in [3,4,5,6,7,8,9]. Nonetheless, leather is considered a complex object for analysis since it can present a large range of differences in color, thickness, wrinkledness, texture and brightness [6]. In order to address the automatic classification of leather defects, this paper proposes the use of computer vision techniques and machine learning algorithms. This work is part of the DTCOURO project2 which proposes the development of a completely automated system, based on computer vision, for bovine raw hide and leather classification and grading. Among the existing supervised learning algorithms, Support Vector Machines (SVM) have been widely used for classification denoting great generalization power and capacity for handling high-dimensional data [10]. However, despite its success, SVMs still are very sensitive to the definition of initial parameters. Determining the right parameter set is often computationally expensive and over-fitting may occur when the training set does not contain a sufficient number of training examples. Furthermore, the parameter selection has a preponderant effect on the effectiveness of the model. Having in mind these considerations, the approach proposed and evaluated in this paper consists of the following contributions: 1. The use of Interaction Maps [11], Co-occurrence Matrices [12], RGB and the HSB Color Space for extracting texture and color features from a given set of raw hide leather images. The proposed methods are based on the feature extraction algorithms experimented in [9]; 2. An empirical evaluation of the use of a selected supervised learning algorithms set for solving the leather defect classification problem. Two different implementations of SVMs (LIBSVM3 and SMO [13]) in conjunction with a stochastic approach, namely simulated annealing, for SVM parameter tuning were posted against, Multilayer Perceptron (MLP) and an adaptive boosting of decision trees and K-NN using the well-known AdaBoost [14] algorithm. The results obtained are very promising in successfully discriminating leather defects. The highest overall accuracy achieved by SVM is 99.59%. The remaining of this paper is organized as follows: Section 2 introduces concepts and previous work related to the leather inspection using computer vision and automatic classification of leather defects. Section 3 gives an overview of the selected machine learning algorithms. The experimental settings and results are presented in Section 4. Finally, conclusions and research directions are given in Section 5. 2 3
http://www.gpec.ucdb.br/dtcouro http://www.csie.ntu.edu.tw/∼ cjlin/libsvm/
602
2
R. Viana et al.
Related Work
The discussion of related work presented here is divided into two main parts. First, related work on leather defect detection and classification is discussed, and after that, an overview on the use of stochastic methods for SVM parameter optimization is shown. Bear in mind that this sections is not intended to present an exhaustive literature review on related work, in its place, the most suggestive topics will be covered. High quality leather is very important in numerous industrial segments. The good appearance of products made using leather depends on the absence of defects in its surface. Bovine leather, in particular, is characterized by the emergence of defects when the animal is still alive and it goes until the tanning process. Defects are mostly provoked by: 1) wounds during the productive phase (e.g. cuts, fighting with other males, brand marks using hot iron, infections, among others); 2) exposure of cattle to ectoparasites and inadequate management [15]; and 3) development of problems during transportation and conservations phases. Defects during tanning and post-processing are much less common, as they are controlled by the tanneries, which have in the leather quality their main business. For a more detailed description of potential causes for common leather defects in the Brazilian leather productive chain the reader can refer to [16]. Roughly speaking, leather defects can be observed in raw hide, the untanned hide of a cattle, or in wet blue leather which is a hide that has been tanned using chrominus sulphate. Wet blue leather is an intermediate stage between untanned and finished leather. The reader can examine Figure 1 to have a clue about the appearance of raw hide and wet blue leather. In general, the detection and classification of leather defects is conducted on wet blue leather, because even without defects, bovine raw hide has a very complex surface.
Fig. 1. (a) Image of a ‘brand mark’ defect on bovine raw hide taken after skinning and before tanning. (b) Image of a ‘open scar’ defect on bovine wet blue leather during the first stage of the tanning process.
Yeh and Perng in [3] propose and evaluate semi-automatic methods for wet blue leather defects extraction and classification. Their results are reliable and effective, when compared with human specialists. The main contribution of the work is a fully quantified grading system, called demerit count reference standard for leather raw hides, but the authors also point out that one of the drawbacks of
SVM with Stochastic Parameter Selection
603
their proposal is the need for human, specialized intervention, for counting the total number of demerits on a wet blue leather. A leather inspection method, based on Haar’s wavelets, is presented by Sobral in [4]. The system is reported to perform in real time, at the same level of an experienced operator [4] and to outperform previous methods, based on Gabor filters, like the one described in Kumar and Pang [5]. Although not clearly stated in Sobral’s paper, the system seems to have been experimented only on finished leather, a much simpler problem than raw hide or wet blue leather defect extraction. A dissimilarity measure based on χ2 criteria has been used to compare gray-level histograms from sliding windows (65x65 pixels) of a wet blue leather image to an averaged histogram of non-defective samples in [6]. The results of the χ2 test and an experimentally chosen threshold are used to segment defective regions of the leather. The approach has not been used to identify the defect type. The segmentation of defective regions from wet blue leather images, using histogram and co-occurrence based features, has been investigated in [7]. On the other hand, the approach proposed in this paper considers the use of SVMs for defect classification on raw hide leather regions. Nevertheless, SVMs achieve good performance only with appropriate parameter estimation, specifically its C parameter which denotes the penalty on misclassified examples and the γ parameter, necessary for the RBF kernel, the one chosen for the experiments. Proposals for estimating the free parameters range from manual setting by experts with a priori knowledge on the data to the computationally expensive grid search. When using a grid search, a number of combinations of SVM parameters C and γ are tried within a predefined range. Empirical estimation is undesired because it does not provide any guarantee on selecting the best parameters, on the other hand, the precision of grid search depends on the range and granularity chosen. One would be interested in an automatic parameter selection method without traversing the entire parameters space. Stochastic heuristics guided by the simulated annealing [17] algorithm are very suitable for selecting the best parameters for a given training set. Formerly this approach has been presented in [18,19]. Motivated by the outstanding performance of SVMs in solving classification tasks, the goal of the experiments conducted in the present work is to validate their use in a real problem such as the defect classification in raw ride leather. Moreover, simulated annealing is applied for parameter selection in order to conduct a global optimization on the SVM parameters C and γ.
3
Supervised Learning Approach for Defect Classification
Machine Learning techniques during the last years have been successfully used to solve significant real world applications [20]. In this paper, the authors propose the use of machine learning for solving the defect classification problem in raw ride leather images, where a classifier is trained to learn the mapping function between a set of features describing a particular region of a given image and the type of the leather defect. This is the central point on the approach proposed here.
604
R. Viana et al.
The goal of this section is to present how the defect classification problem can be modeled as a supervised learning problem. For this purpose, initially basic definitions are presented, followed by an overview of the selected learning algorithms. 3.1
Basic Definitions
Regarding the supervised learning terminology, the following definitions will be considered in the context of the defect classification problem. A labeled instance is a pair (¯ x, y) where x ¯ is a vector in the d-dimensional space X. The vector x¯ represents the feature vector with d = 145 attributes extracted from a region within a given raw hide leather image and y is the class label associated with x ¯ for a given instance, details on the attribute extraction phase are found in section 4.1. Therefore, a classifier is a mapping function from X to Y . The classifier is induced through a training process from an input dataset which contains a number n of labeled examples (¯ xi , yi ) for 1 ≤ i ≤ n. For the experiments, a set of four types of defect has been chosen containing the following elements: tick marks, brand marks made from hot iron, cuts and scabies. These defects have been chosen because they are very common in Brazilian leather. From each region extracted from the raw ride leather images, a set of features were extracted using color and texture attributes. Applying machine learning algorithms raises the question of how to select the right learning algorithm to use. As stated in previous sections, SVMs with stochastic parameter selection will be experimented and compared according to their effectiveness with the MLPs and the boosting of Decision Trees and K-NN using the Adaboost [14] algorithm. 3.2
Support Vector Machine
Support Vector Machines (SVM), created by Vapnik [21], have become one of the most popular classification algorithms. SVMs are classifiers based on the maximum margin between classes. By maximizing the separation of classes in the feature space, it is expected to improve the generalization capability of the classifiers, which are conceived, in the basic approach, as linear classifiers that split the input data into two classes using a separating hyperplane. The reader can refer to Figure 2 for an illustration of the basic SVM for linearly separable data. SVMs can also work with non-linearly separable datasets either by mapping the input feature space into higher dimensions using kernel functions or relaxing the separability constraints. In the former it is expected that the same dataset become linearly separable in the higher space whereas in the latter some margin failures are allowed but penalized using the cost parameter C. In fact, this parameter in conjunction with the kernel parameters are critical to the performance of the classifier. In this paper, these parameters are estimated using Simulated Annealing [17], which is a stochastic algorithm for the global optimization problem. The goal is to locate a good approximation to the global optimum for the
SVM with Stochastic Parameter Selection
605
Fig. 2. Illustration of the hyperplane separating two classes. Note the maximum margin separation for the given input examples. SVMs are expected to improve generalization by maximizing the margin.
generalization performance in the SVM’s free parameters space. At each step the algorithm replaces the current solution with a probabilistic guess on nearby solutions, controlled by a global adaptive parameter T, the temperature. SVMs are naturally designed for binary classification, however, available implementations like LIBSVM and SMO [13] provide the extension of SVMs for multi-class problems. Several methods have been proposed for multi-class SVMs by combining binary classifiers. A comparison of different methods for multi-class SVMs is presented in [22]. 3.3
AdaBoost and MLPs
Boosting is a general way to improve the accuracy of any given learning algorithm. The basic idea behind boosting refers to a general method of producing very accurate predictions by combining moderately inaccurate (weak) classifiers. AdaBoost is an algorithm that calls a given weak learning algorithm repeatedly, where at each step the weights of incorrectly classified examples are increased in order to force the weak learner to focus on the hard examples. The reader can refer to [14] for a detailed description of AdaBoost. The main motivation to the use of a meta-classifier as AdaBoost is given by the fact that many previous papers have shown stellar performance of AdaBoost with several datasets [23]. In fact, Bauer and Kohavi in [23] show a more realistic view of the performance improvement one can expect. After empirical evaluation of selected weak learners, the authors opted for the J48 Decision Tree (DT) algorithm (the Java implementation of C4.5 integrated in Weka4 ) and IBK (a Java implementation of K-NN integrated in Weka). The C4.5 algorithm splits data by building a decision tree based on attributes from the training set. Basically at each integration the algorithm selects the best attribute based on information gain and splits the data in 2 subsets. In addition, decision trees have the advantage of: 1) DTs are easy to understand and convert into production rules, allowing fast evaluation of test examples, 2) There are no a priori assumptions about the nature of the data. 4
http://www.cs.waikato.ac.nz/ml/weka
606
R. Viana et al.
The k-nearest neighbor (K-NN) is one of the most simple algorithms in machine learning algorithms. The classification of data consists in to gather a majority of votes of its neighbors where the most common class among its k nearest neighbors is assigned to it. The neighbours are selected from a set of correct classified samples that are represented as vectors in a multidimensional feature space. Euclidean distance and Manhattan distance can be used as distance measures. The Multilayer Perceptron (MLP) is basically a set of processing units organized in layers where the number of layers and the number of units in each layer varies according to the problem, the first layer is called input layer, the last layer is the output layer and all layers between them are called hidden layers, the output layer has one unit for each class in the training set. The units on a layer are usually connected to all units in the layer above and below it and have weight values that denote their behavior and are adjusted during the training process. After the training phase, for all data presented at the input layer the network perform calculations until an output is computed at each of the output layers. It is expected that the correct class have the highest output value in the output layer.
4
Empirical Evaluation
Having in mind the goal of evaluating and comparing the results obtained by the selected classifiers, this section describes the details about the experiments conducted for this paper together with an analysis of their results. 4.1
Dataset
In order to create the dataset for experimentation fifteen bovine images from tanned leather in the raw hide stage were selected from the DTCOURO repository. The images have been taken using a five mega-pixel digital camera during technical visits to slaughterhouses and tanneries located in the region of Mato Grosso do Sul, Brazil, by September 2005. For this project, the images were scaled from high resolution images to 600x450 pixels with the intention of saving time and space. Empirical evidence have shown that there is no loss of effectiveness when using the scaled images. Furthermore, the images have low environmental characteristics variation. A set of four types of defect has been chosen, namely, tick marks, brand marks from hot iron, cuts and scabies. As the goal of this work is to distinguish between defects, non defect samples were not considered. One sample of each of these defects can be visualized in Figure 3. The defects were manually segmented using a software module from the DTCOURO project. A total of thirty segments have been extracted from the images including examples of the previously cited leather defects. After the manual segmentation of defects, an algorithm implemented in the DTCOURO project was used to extract windows of 20x20 pixels by scanning all the segments. Each window is an example that belongs to either one of the defects class. A total of 14722 20x20 windows were created in this way.
SVM with Stochastic Parameter Selection
607
Fig. 3. Sample of (a) tick, (b) brand, (c) cut and (d) scabies over the leathers in raw hide stage
The next step is the feature extraction from each 20x20 window. A set of 139 attributes for each sample were extracted using Interaction Maps [11] and the Grey Level Co-Occurrence Matrices [12] (GLCM) for texture attributes and 6 attributes using the mean values of histograms for hue, saturation and brightness, red, green and blue for color attributes. Interaction Maps is an attribute extraction technique that consists in a directional analysis of the texture. The cooccurrence matrices can be defined over an image as the distribution of gray level pixel values occurring at given offsets [12]. Usually, the values of GLCMs are not directly used as texture features, but some statistics and values calculated from them, like entropy, contrast, angular second moment, inverse difference moment, energy and homogeneity. In this project, the feature extractors are configured based on previous experiments reported in [24], which can be seen in Table 1. Table 1. Parameters used for the feature extraction techniques used in this project. 139 texture features were extracted from each 20x20 window. Initial Angle: Final Angle: Angle variation: Distance (pixels): Distance variation:
Int. Maps Co. Matrices 10 10 180 180 10 10 2 1 1
For each of the 14722 examples, a feature vector x ¯ was calculated and stored into the dataset. At the same time, all the training examples were already labeled with one of the following classes: {T ick, Brand, Cut, Scabies}, the distribution of classes is as follows: 2819 T ick, 3716 Brand, 2804 Cut and 5383 Scabies examples, where the number of examples in each class is proportional to the area of each defective region in the original images. 4.2
Experimental Settings
The experiments were conducted using the latest developer version of Weka software [25] and the LIBSVM library written by Chang and Lin [26]. Two different implementations of SVMs, LIBSVM and SMO were tested in conjunction with
608
R. Viana et al.
MLPs and AdaBoost-J48 and K-NN. For each of the algorithms 5-fold crossvalidation was performed over the dataset in order to certify a more reliable estimation of the generalization error [18]. A new module for classifier parameters tuning was developed into DTCOURO project for this work. This module applies parameters recombination that results in significant accuracy improvement. The Simulated Annealing was the algorithm chosen to parameters generation due to its good probability to terminate with the global optimal solution. LIBSVM is a library for Support Vector Machines developed by Chih-Chung Chang [26] that implements classification, regression and distribution estimation. The implemented classifier solves the SVM quadratic programming problem by decomposing the set of lagrange multipliers. LIBSVM also implements two techniques to reduce the computational time of the evaluation function, shrinking and caching. The technique used to solve multi-class problems is the one-against-one. Sequential Minimal Optimization (SMO) is a SVM training algorithm developed by John C. Platt [13], who claims that SMO is a simple and fast technique to solve the SVM’s quadratic problem. The main advantage of SMO compared to other SVM training algorithms is that it always chose the smallest QP problem to solve at each iteration. An other advantage is that SMO does not use storage matrix, which greatly reduces the memory usage. 4.3
Evaluation of Supervised Algorithms
The experiments are basically exploratory and were conducted with the intention of evaluating the effectiveness and efficiency of the algorithms over the leather defect detection. The works in this subsection can be divided in two parts. The first experiment shows the time, best parameters founded and overall accuracy of SVMs tuning. In the second part the results of best tuned SVMs are compared with other well know algorithms and analyzed using traditional measures including, precision, recall, overall accuracy and area under the ROC curve. SVMs tuning. Initially, the goal of the experiments was the search for the best SVMs C and γ parameters. The classifier parameters tuning module was applied over a smaller subset of the dataset for time saving proposes, where at each iteration the C and γ values were evaluated using 5-fold cross validation. The evaluation measure is the overall accuracy. The initial values for C and γ are the default values suggested by Weka. In Table 2 the reader can notice the execution time, best values for C and γ and their respective overall accuracy (number of correctly classified examples), using the default values (Def. Acc.) and the optimized values (Opt. Acc.). As the reader can observe, the results clearly show a higher time performance achieved by the LIBSVM implementation. The final overall accuracy in both cases was improved, even though one can conclude that the use of LIBSVM is recommended due to its high performance. Despite the slow computation time of SA it is still practicable in this situation since it needs to be executed only once. One of the possible reasons for the difference in time can be credited to the
SVM with Stochastic Parameter Selection
609
Table 2. Running time, best C and γ, default accuracy and accuracy with Simulated Annealing optimization for SVM parameter estimation using the LIBSVM and SMO implementations Time Best C and γ Def. Acc. Opt. Acc. SMO 35655s 24.165 0.931 88.95% 93.10% LIBSVM 12786s 49.494 1.008 76.16% 99.59%
LIBSVM shrinking and caching implementation. Table 2 also shows that Simulated Annealing optimization increased the classification accuracy in 23% over the standard LIBSVM parameter configuration and 5% over the SMO standard. Classifiers comparison. The confusion matrix is a |Y | × |Y | bi-dimensional array where the position (i, j) denotes the number of examples of class i predicted as examples of the class j. Roughly speaking, each column represents the predicted examples and each row represents the actual examples. Such matrix can be used to compare the classifiers by combining their elements into more sophisticated formulas like precision, recall and area under the ROC curve. The traditional formula for precision is: P =
tp . tp + f p
(1)
where tp is the number of true positives and f p is the number of false positives. Precision is the ratio between the correctly predicted examples from a given class over the total number of actual examples of such class. On the other hand, recall is defined as the ratio between the number of correctly predicted examples from a given class and the total number of predicted examples for such class. Recall is often called sensitivity and is traditionally defined by: TPR =
tp . tp + f n
(2)
where f n is the number of false negatives. In Table 3 it is possible to observe the behavior of the algorithms with respect to precision, recall, and the area under the ROC curve. Note that all the implementation obtained relevant results. The outstanding precision and recall values as well as the perfect area under the ROC curve demonstrate the suitability of supervised learning algorithms for the defect classification problem. In addition, it can be concluded that the set of features extracted from the original images boosts the effectiveness of the classifier. Table 4 shows the execution time for the testing and training phases as well as the respective accuracy of the five classifiers. LIBSVM and MLP have shown excellent and similar performance with respect to the classification task, nevertheless, the efficiency of the algorithms during the testing phase is of interest as well. Note that the testing phase of AdaBoost-J48 and the SMO are by far
610
R. Viana et al.
Table 3. Execution results for precision, recall and area under the ROC curve. The SVM parameters are shown in Table 2, Adaboost used 10 interactions and weight threshold 100 with confidence 0.25 for J48 and k=1 for IBK, MLP used 74 hidden layers, learning rate of 0.3 and momentum 0.2.
SMO BoostIBK BoostJ48 MLP LIBSVM
Roc 0.9979 0.9916 0.9999 1.0000 0.9991
Recall 0.9965 0.9876 0.9959 0.9978 0.9983
Precision 0.9879 0.9870 0.9946 0.9978 0.9997
the best in terms of efficiency. It is justified by the fact that the time for evaluating test examples is proportional to the number of base classifiers (decision trees) multiplied by the height of each decision tree. In the case of SVM the time for evaluating test cases is proportional to the final number of support vectors. AdaBoost-IBK presents the best time during training, not so far is the LIBSVM with the second best time. The accuracy only confirms that all the classifiers can discriminate the defects very accurately. Table 4. Testing and training time for AdaBoosts, SVMs (The parameter selection time by SA is not included) and and MLP
SMO BoostIBK BoostJ48 MLP LIBSVM
5
Testing time Training time Accuracy (%) 0.21s 2433.62s 93.10 38.99s 110.41s 95.75 0.14s 699.89s 98.74 1.93s 7322.86s 99.24 36.70s 158.23s 99.59
Conclusion and Future Work
Previous works in solving classification problems with SVMs has shown the weakness of parameter tuning. This paper addressed this weakness and presented results for a real problem using a stochastic parameter selection for SVM training with the goal of improving the generalization performance. As expected, the use of Simulated Annealing had presented a effective solution for the problem of training SVMs. When comparing two different implementations of SVMs, the LibSVM implementation is either the most effective algorithm or the most efficient algorithm for training purposes. Moreover, LibSVM is very suitable for the iterative process of parameter selection. Note that the difference in effectiveness between LibSVM an MLP can be neglected once both results are outstanding. One interesting observation is that AdaBoost-J48 solutions tend to be by far faster than the others classifiers for testing purposes and its loss in accuracy is very small.
SVM with Stochastic Parameter Selection
611
In order to get faster times for evaluation a natural step in future work is the reduction of the number of features using feature selection of extraction algorithms. Clearly, efficiency is crucial for real industrial needs. Another research direction is the application of similar solutions at different leather stages which are characterized by presenting different features. An other set of experiments with a larger dataset is a must, as the low quantity of images is a problem since it does not represents the problem properly and may wrongly indicate that the problem is easy. The DTCOURO application which already assists with image segmentation, sampling and feature extraction for the learning model generation is actually having its visual classification module finalized. Thus one will be able to apply the learned model over an input image and analyse the classification results visually over the image. The tuning module is being generalized to attend all Weka compatible classifiers as well. Acknowledgments. This work has received financial support from Dom Bosco Catholic University, UCDB, Agency for Studies and Projects Financing, FINEP, and Foundation for the Support and Development of Education, Science and Technology from the State of Mato Grosso do Sul, FUNDECT. One of the co-authors holds a Productivity Scholarship in Technological Development and Innovation from CPNQ, the Brazilian National Counsel of Technological and Scientific Development, and some of the other co-authors have received PIBIC/ CNPQ scholarships.
References 1. Matthey, H., Fabiosa, J.F., Fuller, F.H.: Brazil: The future of modern agriculture. MATRIC (2004) 2. da Costa, A.B.: Estudo da competitividade de cadeias integradas no brasil: Impactos das zonas de livre comercio. Technical report, Instituto de Economia da Universidade Estadual de Campinas (2002) 3. Yeh, C., Perng, D.B.: Establishing a demerit count reference standard for the classification and grading of leather hides. International Journal of Advanced Manufacturing 18, 731–738 (2001) 4. Sobral, J.L.: Optimised filters for texture defect detection. In: Proc. of the IEEE International Conference on Image Processing, September 2005, vol. 3, pp. 565– 573. IEEE Computer Society Press, Los Alamitos (2005) 5. Kumar, A., Pang, G.: Defect detection in textured materials using gabor filters. IEEE Transactions on Industry Applications 38(2) (2002) 6. Georgieva, L., Krastev, K., Angelov, N.: Identification of surface leather defects. In: CompSysTech 2003: Proceedings of the 4th international conference on Computer systems and technologies, pp. 303–307. ACM Press, New York (2003) 7. Krastev, K., Georgieva, L., Angelov, N.: Leather features selection for defects’ recognition using fuzzy logic. In: CompSysTech 2004: Proceedings of the 5th international conference on Computer systems and technologies, pp. 1–6. ACM Press, New York (2004)
612
R. Viana et al.
8. Branca, A., Tafuri, M., Attolico, G., Distante, A.: Automated system for detection and classification of leather defects. NDT and E International 30(1), 321–321 (1997) 9. Pistori, H., Paraguassu, W.A., Martins, P.S., Conti, M.P., Pereira, M.A., Jacinto, M.A.: Defect detection in raw hide and wet blue leather. In: CompImage (2006) 10. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: CVPR 1997, Puerto Rico, pp. 130–136 (1997) 11. Chetverikov, D.: Texture analysis using feature-based pairwise interaction maps. Pattern Recognition 32(3), 487–502 (1999) 12. Hseu, H.W.R., Bhalerao, A., Wilson, R.G.: Image matching based on the cooccurrence matrix. Technical Report CS-RR-358, Coventry, UK (1999) 13. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector machines (1998) 14. Freund, Y., Schapire, R.E.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999) 15. Jacinto, M.A.C., Pereira, M.A.: Industria do couro: programa de qualidade e estratificacao de mercado com base em caracteristicas do couro. Simposio de producao de gado de corte, 75–92 (2004) 16. Gomes, A.: Aspectos da cadeia produtiva do couro bovino no Brasil e em Mato Grosso do Sul. In: Palestras e proposicoes: Reunioes Tecnicas sobre Couros e Peles, 25 a 27 de setembro e 29 de outubro a 1 de novembro de 2001, pp. 61–72. Embrapa Gado de Corte (2002) 17. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science, Number 4598 220(4598), 671–680 (1983) 18. Imbault, F., Lebart, K.: A stochastic optimization approach for parameter tuning of support vector machines. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004), pp. 597–600. IEEE Computer Society Press, Los Alamitos (2004) 19. Boardman, M., Trappenberg, T.: A heuristic for free parameter optimization with support vector machines. In: Proceedings of the 2006 IEEE International Joint Conference on Neural Networks, pp. 1337–1344. IEEE Computer Society Press, Los Alamitos (2006) 20. Mitchell, T.M.: The discipline of machine learning. Technical Report CMU-ML06-108 (2006) 21. Vapnik, V.N.: An overview of statistical learning theory. IEEE Transactions on Neural Networks 10(5), 988–999 (1999) 22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002) 23. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Machine Learning 36(1-2), 105–139 (2005) 24. Amorim, W.P., Viana, R.R.R.P.H.: Desenvolvimento de um software de processamento e geracao de imagens para classificacao de couro bovino. SIBGRAPIWorkshop of Undergraduate Works (2006) 25. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 26. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines, Software (2001), available at: http://www.csie.ntu.edu.tw/∼ cjlin/libsvm
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation Tzung-Heng Lai, Te-Hsun Wang, and Jenn-Jier James Lien Robotics Laboratory, CSIE, NCKU, Tainan, Taiwan, R.O.C. {henry, dsw_1216, jjlien}@csie.ncku.edu.tw
Abstract. Motion extraction is an essential work in facial expression analysis because facial expression usually experiences rigid head rotation and non-rigid facial expression simultaneously. We developed a system to separate non-rigid motion from large rigid motion over an image sequence based on the incremental perspective motion model. Since the parameters of this motion model are able to not only represnt the global rigid motion but also localize the non-rigid motion, thus this motion model overcomes the limitations of existing methods, the affine model and the 8-parameter perspective projection model, in large head rotation angles. In addition, since the gradient descent approach is susceptible to local minimum during the motion parameter estimation process, a multi-resolution approach is applied to optimize initial values of parameters at the coarse level. Finally, the experimental result shows that our model has promising performance of separating non-rigid motion from rigid motion. Keywords: Separating rigid and non-rigid motion, incremental perspective motion model, multi-resolution approach.
1 Introduction Computer vision researchers have developed various techniques on automatic facial expression recognition. Some exiting recognition systems [10], [11], [13], [14], [16] apply the facial images without taking rigid and non-rigid motion separation process; and the tolerance of rotation angles are not mentioned. Work in [2] uses Adaboost for feature selection, and then classifying the selected outputs by the support vector machine (SVM). The combination of Adaboost and SVM enhanced both speed and accuracy of recognition system. The study in [17] integrated the dynamic Bayesian networks (DBNs) with the facial action units for modeling the dynamic and stochastic behaviors of expressions. ± 30o out-of-plane rotations are allowed in their tracking technique that one active IR illumination is used to provide reliable visual information under variable lightings and head motions. The work in [1] used SVM to classify the facial motions for recognizing six basic emotions associated with unique expressions i.e. happiness, sadness, disgust, surprise, fear and anger. The limitations of head rotation angles are ±30o in pan rotation, ±20o in tilt rotation and ±10o in roll rotation. The work in [7] also used SVM to classify five facial motions, neutral expression, opening or closing mouth, smile and raising eyebrows, with one deformable model D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 613–624, 2007. © Springer-Verlag Berlin Heidelberg 2007
614
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien
containing 19 points and stereo system. The stereo tracking serves as the basis for reconstructing the subject’s 3D face model with deformable model and constructing corresponding 3D trajectories for facial motion classification. In this stereo system, the head rotation can be near 90o. However, a limitation of above studies is inability to separate rigid and non-rigid motion that makes the expression recognition fail to achieve accurate result in real-life image sequences. The affine model is used in [9] to solve the separation problem, but it can’t work for large out-of-plane rotation because of the perspective effects caused by depth variations. The facial expression scheme in [4] applied the 8-parameters perspective projection model and affine model with a curvature function to parameterize large head rotations and non-rigid facial motions, respectively. Facial expression recognition was achieved by assigning different threshold values to the different motion parameters of the models. However, this approach reduces the recognition sensitivity and accuracy of slightly different facial expressions. The work in [5] estimates the camera parameters and reconstructs the corresponding 3D facial geometry of the subject; the 3D pose is recovered by using the Markov Chain Monte Carlo method and each image is warped to frontal-view canonical face image. Some of above studies [1], [7], [11], [14], [16] recognized the limited six basic expressions, and mouth opening and closing which occurred relatively infrequently in daily life. However, the facial expressions often occurred with changes of more than one feature. Consequently, some other studies [2], [5], [10], [13], [17] used the Facial Action Coding System (FACS) [6] in which the defined action units (AUs) represent the smallest visibly discriminable muscle action and can be combined to create an overall expression. Thus, to separate the rigid and non-rigid motion for facial analysis, we developed a separating system based on the incremental perspective motion model with feature sub-regions. The selected feature sub-regions can take account of recognizing the FACS AUs in the future work.
2 Incremental Perspective Motion Model Computing the perspective transformation between two different view facial images is the main phase of rigid and non-rigid motion separation. Since during the facial expression, large percentage of facial area experienced global motion, the rigid head rotation, while the local facial features, i.e., eyebrows, eyes, nose and mouth, experienced local motion, non-rigid facial expression, obviously. In other words, majority of the face experiencing rigid motion while minority experiencing rigid and non-rigid motions simultaneously. Therefore, the incremental perspective model presented in [12], [8] is used to estimate one global perspective transformation for registering two images under assuming that the interesting region is planar. The used incremental motion model estimates the transformation with considering the major variation caused by the rigid head rotation between images. As for the influence of minor variation that caused by the rigid head rotation and non-rigid facial expression simultaneously, also influenced the transformation estimation but far smaller than the majority. The contrast between two images may be different because of the change of lightings or camera positions. Since the image registration process need to consider
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation
615
the intensity changes of pixels and assume that assuming the pixels of the corresponding positions have the same intensities in different images. Thus, adjusting the contrast of corresponding images beforehand is necessary by using I1 ( x) = αI 0 ( x) where α = ( or α =
∑I ∑I
∑ I1gray
∑I
1r 0r
+
∑I ∑I
1g 0g
+
∑I ∑I
1b
) ÷3
in color image
0b
(1)
in gray level image
0 gray
α
is the contrast adjustment between where I0 and I1 are two consecutive images, and images. To register two corresponding facial images, I0(x) and I1(x’), with different views, ~ the warping image is computed as: I1 ( x ) = I1 ( f ( x; M )) , that is x’ can be computed by using one parametric motion model M with x, i.e. x′ = f ( x; M ) . The trick is to find the deformation image, I1(x’), take it into closer registration of I0(x) with bilinear interpolation and then update the parametric motion model M. The loop of image warping, registration, and updating parameter can then be iterative computation. To describe the relation between two images by one perspective motion model, these two images are taken as a planar model that the planar perspective transformation warps one image to the other as: ⎡ m0 ⎡ x '⎤ x' = ⎢⎢ y '⎥⎥ ≅ Mx = ⎢⎢m3 ⎢⎣m6 ⎢⎣ 1 ⎥⎦ ⇒ x' =
m1 m4 m7
m2 ⎤ ⎡ x ⎤ m5 ⎥⎥ ⎢⎢ y ⎥⎥ m8 ⎥⎦ ⎢⎣1 ⎥⎦
(2)
m0 x + m1 y + m2 m x + m4 y + m5 , y' = 3 m6 x + m7 y + m8 m6 x + m7 y + m8
To recover the parameters, the transformation matrix M is iteratively updated by using M ← ( I + D) M ⎡d 0 where D = ⎢⎢d 3 ⎢⎣d 6
d1 d4 d7
d2 ⎤ d 5 ⎥⎥ d 8 ⎥⎦
(3)
In equation (3), D represents the deformation (i.e. incremental motion) and is used to update the incremental perspective motion model M (i.e. warping function). That is, resampling (warping) image I1 with the transformation x ′ ≅ ( I + D ) Mx is equivalent ~ to warping the resampled image I1 by x ′′ ≅ ( I + D ) x , where x′′ =
(1 + d 0 ) x + d1 y + d 2 d x + (1 + d 4 ) y + d 5 and y ′′ = 3 d 6 x + d 7 y + (1 + d8 ) d 6 x + d 7 y + (1 + d 8 )
(4)
To compute the incremental motion parameter vector, d=(d0, …, d8), the minimizing squared error metric is formulated as:
616
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien
[
]
2 ~ E (d ) = ∑ I 1 ( xi′′) − I 0 ( xi ) i
∂x ′′ ~ ⎤ ⎡~ ≈ ∑ ⎢ I 1 ( xi ) + ∇I 1 ( x i ) i d − I 0 ( xi )⎥ = ∑ g iT J iT d + ei ∂d i ⎣ i ⎦ 2
[
]
2
(5)
~ ~ T where ei = I 1 ( xi ) − I 0 ( xi ) is the intensity (grayvalue) error, g i = ∇I 1 ( xi ) is the ~ image gradient of I 1 at xi, and Ji=Jd(xi) is the Jacobian of the resampled point coordinate xi′′ with respect to d, that is: J d ( x) =
∂x ′′ ⎡ x y 1 0 0 0 − x 2 − xy − x ⎤ =⎢ ⎥ ∂d ⎣0 0 0 x y 1 − xy − y 2 − y ⎦
T
(6)
Then the least-squared error problem can be solved through ∂E (d ) =0 ∂d ⇒ Ad = −b, where A = ∑ J i g i g iT J iT and b = ∑ ei J i g i i
(7)
i
where A is the Hessian matrix and b is the accumulated gradient or residual. Thus, the incremental motion parameter vector, d can be calculated using pseudo inverse as: d = −( AT A) −1 AT b
(8)
To modify the current motion estimate procedure, the computational effort requires to take single gradient descent in parameter space included the three-step ~ computation: (1) the warping of I1 ( x′′) into I 1 ( x ) , (2) computing the local intensity errors ei and gradients gi ,and (3) accumulating ei and gi in A and b. Because of computing the monomials in Ji and the formations in A and b, the step (3) is computational expensive. To lower the computational cost, the image is divided into patches Pj; and make the approximation that J(xi)=Jj is constant within each patch. Thus, equation (7) is rewritten as: A ≈ ∑ J j A j J iT with A j = ∑ g i g iT i∈Pj
j
and b ≈ ∑ J j b j with b j = ∑ ei g i j
(9)
i∈Pj
The computation of this patch-based algorithm only needs to evaluate Jj and accumulate A and b once per patch. There is the other drawback that gradient descent approach is susceptible to local minimum during the parameter estimation process. To enlarge the convergent region, a milti-resolution approach [3] is used and the estimation result of the incremental motion parameter vector, d, at the coarser levels can be used as the initial values at the finer levels. In this work, each image is decomposed into 4 levels from level 0 (the original finest resolution) to level 3 (the coarsest resolution). From level 0 to level 3, the patch-size 8, 4, 2 and 2 are considered at each level, respectively.
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation
617
3 Rigid and Non-rigid Motion Separation System In this system, we used the incremental perspective motion model [8] mentioned above to register corresponding sub-regions of two consecutive facial images having different views. Transformation between two images can be taken as a global perspective transformation, i.e. the incremental perspective motion model, M, under the assumption that the interesting region is planar. This assumption under facial analysis is that the depth variation of the face is very small compared to the distance between face and camera. That is, the face is almost flat. To make this assumption more general, we take video image sequence as input and consider the transformation between corresponding sub-regions. The positions of the sub-regions were defined in the first frame of sequence by using [15] to automatically locate the facial points. The parameters of the incremental perspective transformation, M, can be used to model the rigid head motion. That is, the parameters of the perspective transformation, M, between registered sub-regions can be estimated as the parameters of the warping function. In other words, the rigid and non-rigid motions can be separated by image warping process with warping function M. The system flowchart is shown in Fig. 1 and the detailed statement for the separation process is in following contents. M 2,1 1 I21
I1 1
M
I12
2 2,1
1 M 3,2 2 M 3,2
I22
M
M 2,3 1 I23
I13
3 3,2
…
M N1 -1, N-2 IN-11
1 1 M N, N -1 IN
M N2 -1, N-2 IN-12 M N3 -1, N-2 I 3
M N,2 N-1 I 2 N 3 3 M N, N -1 I
N-1
N
(a).Registering the corresponding sub-regions between Ii and Ii+1, i =1~N-1. k
The warping function M i 1,i : sub-region k warps from Ii+1 to Ii, k=1~3. k
(b).The images of sequence with side-view are warpedġfromġIi+1 toġIiġby M i 1,i to remove the rigid motion.ġEach image Ii+1 is warping frame by frame, and Vi+1,,j is the warped image of Ii+1, where i+1 is i+1th image of sequence and j is jth warped image.
V2,2
M 2,1~13 …
VN-1,N-1
VN-1,1
M 2,1~13
…
M N1~-32, N-3
M N1~-1,3 N-2
VN,N
M
1~ 3 2,1
…
M
VN,1
IN
1~ 3 N -1, N - 2
1~ 3 N, N -1
M
Fig. 1. System flowchart of local incremental perspective motion model
618
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien
3.1 Feature Sub-region Selection
The transformation estimation takes the major variation between two images as the basis and localizes the minor one. However, to a small extent, the minor variation caused by rigid and non-rigid motions still influences the estimation process. In addition, the depth variation between features, such as eyes and nose, nose and mouth, may violate the assumption that the face is almost flat. Consequently, the sub-regions are used that each contained the specific local feature; in our work three sub-regions, one for the eyebrows and eyes, one for the nose and the other for the mouth, as Fig.1 shown. The area of local facial feature contrasts smaller with area of others of the same sub-region, that is, the majority of the sub-region is not influenced by the facial expression. Each sub-region is independent that the transformation estimations are not influenced by each other. And it is more closed to the assumption that the environment is planar. As Fig. 2 shows that the sub-region selection commences by using the feature point location estimation presented in [15] to automatically locate the feature points in the first frame of sequence. All input image sequences are restricted that subjects commencing with frontal-view and neutral facial expression; that is, the first frame of sequence is supposed to be no rigid and non-rigid motion. The two horizontal lines, L1 ~ L2, and four vertical lines, L3 ~ L6, are decided by points P1 ~ P6; the distances D1 (distance of P7 and P9) and D2 (distance of P8 and P9) are also be defined. Then, the positions of three sub-regions with respect to the image coordinate can be determined, that the three fixed positions are also used to locate the sub-regions in the remaining frames of sequence. L3
L4 P7
P3
P4
L5
P8 P9 P1 P2
L3
L6
2D2 P5
L6
D1 D2
D1
P6 L1 L2
L1 L2 L4
L5
Fig. 2. The automatically located facial feature points and selected sub-regions
3.2 Incremental Registration and Separation Process for Image Sequence
For reducing the influence of depth variation between images, the image sequence is taken as the input. After determining the three positions of sub-regions of the input sequence in the first frame, the N images can be divided into 3N sub-regions. Then, j
the warp functions M i +1,i of the corresponding sub-regions, there are 3(N-1) functions
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation
619
in sequence with N images, will be calculated frame by frame using the incremental motion model as shown in Fig.1(a). Then, Fig.1(b) shows the separation process which also implement frame by frame. Three sub-regions of image Ii+1 warp to those of image Ii by corresponding M ik+1,i ,
k=1~3 as three sub-regions, and its warping image Vi+1,1, constructed by three subregions, represented image Ii+1 removing the motion between Ii+1 and Ii. If there were outliers occurred, that means the computed corresponding locations don’t within the corresponding sub-regions while warping from Ii+1 to Ii; we would use the pixellocation of Ii+1 to take the intensity of corresponding pixel-location of Ii for synthesizing Vi+1,1. Then, the separation process is continued that the warping image k
Vi+1,1 is warping to Ii-1 by M i ,i −1 and synthesizing another warping image Vi+1,2. The process is repeating until the warping image of Ii+1 warping to I1; and the final producing warping image Vi+1,i+1 shows the result of image Ii+1 removing the motions between Ii+1 and I1.
4 Experimental Results As shown in Table 1, the testing database contains 50 Asian subjects (35 males and 15 females) that all the subjects did not wear eyeglasses. Each subject performs 3 upper facial expressions, i.e. AU4, AU1+4 and AU1+2, and 6 lower facial expressions, i.e. AU12, AU12+25, AU20+25, AU9+17, AU17+23+24 and AU15+17, with pan rotation. They demonstrated facial expressions without previous training and the sequences were videotaped under constant illumination. Fig.3 shows the input image sequence and the separating (warping) results of using the affine model, 8-parameter model, global incremental perspective motion model and our local incremental perspective motion model. For comparing the separating results in the same criterion, each image warped to the template constructed by three sub-regions selected in our approach. These separating results are blurred because of taking bilinear interpolation during the warping process. The subtraction images of input frame 0 represent the differences between the warping result of the current frame and the image at the frame 0. Table 1. Testing database. The image size and the average facial size are 320×240-pixels resolution and 110×125 pixels, respectively. Subjects 50
Upper/ Lower facial expressions 3/6
Pan 0o~+30o and 0o~-30o
Table 2. The average difference rate between the warping image at the current frame and the input frame 0 Affine model 63.6%
8-parameter model 73.2%
Global incremental motion model 39.4%
Our Approach 26.2%
620
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien (a) Input image/frame sequence Frame 0
Frame 10
(b) Affine model Frame 10
Frame 20
Frame 20
Frame 28
(c) 8-parameter model Frame 28
Frame 10
Frame 20
Frame 28
Separating/warping results
Separating/warping results
Subtraction images
Subtraction images
(d) Global incremental motion model Frame 10 Frame 20 Frame 28
(e) Our approach's results Frame 20 Frame 28 Frame 10
Separating/warping results
Separating/warping results
Subtraction images
Subtraction images
Fig. 3. Separation results of 4 models and subtractions with the first input frame
From Fig. 3, we can find that the separating results of using the affine mode can not take care of the image sequence with out-of-plane motion. It is because the affine model doesn’t contain the factors of the depth variations. For the 8-parameter motion model, since parameter estimation process is susceptible to local minimum, so the warping results of the 8-parameter motion model are distorted. In addition, the global incremental perspective motion model is estimated based on the entire facial image that large local motions, i.e. facial expressions, may affect the global transformation estimation as shown in Fig.3.(d). Our results are better than the other three models since the use of sub-regions that local motion within one sub-region wouldn’t affect the transformation estimations of the others. As shown in Fig.3.(d) and Fig.3.(e), the
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation
621
Upper facial expression: AU4
Upper facial expression: AU1+4
Upper facial expression: AU1+2
Lower facial expression: AU12+25
Lower facial expression: AU12
Fig. 4. The lists of nine expressions of the testing database that each expression has four images, frontal-view, side-view, separating result of global incremental motion model and our separating result, from left to right. The “+” represents the position of feature point at last frame. And the trajectory shows the variation of the feature point from the first frame to the last.
622
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien
Lower facial expression: AU20+25
Lower facial expression: AU17+23+24
Lower facial expression: AU15+17
Lower facial expression: AU9+17
Fig. 4. (continued)
nose region in the global incremental motion model’s result is not registered as well as our approach. Table 2 shows the average difference rate of four motion models. The difference rate is the percentage of the ratio between the numbers of the pixels
Incremental Perspective Motion Model for Rigid and Non-rigid Motion Separation
623
whose subtraction grayvalues are above the threshold and the total pixels of warping image. Fig.4 demonstrates the results of our work. The frontal-view image means the last image of the sequence that the subject demonstrated expression without head rotation. The side-view image means the last image of the sequence demonstrated expression with head rotation. The separating result of global incremental motion model and our separating result are the separation of side-view image. For comparison, each image also shows the facial feature points’ trajectories that described the variations of the feature points from the first frame to the last. Compared with the trajectories of global incremental motion model's results, our separating results are similar to frontal-view.
5 Conclusions The feature of our approach is estimating the motions between sub-regions with incremental motion model and a multi-resolution approach. The incremental motion model overcomes the limitations of the affine and the 8-parameter perspective projection models in large head rotation angles. By taking account of sub-regions, the influences of depth variation between features are reduced and furthermore the transformation estimations of each sub-region are not affected by other sub-regions. Then, the used multi-resolution approach prevents the parameter estimation get trapped to local minimum. The running time of warping function calculating and image warping are 6.2 seconds and 14.5 seconds, respectively, with 29 frames in the input sequence (Intel Pentium 4 CPU 3.2GHz). The deblurring of warping image and other head rotations as tilt and roll rotations would be considered in future work.
References 1. Anderson, K., McOwan, P.W.: A Real Time Automated System for the Recognition of Human Facial Expression. IEEE Tran. on SMC 36(1), 96–105 (2006) 2. Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Fully Automatic Facial Action Recognition in Spontaneous Behavior. In: International Conf. on FG, pp. 223–230 (2006) 3. Bergen, J.R., Anandan, P., Hanna, K.J., Hingorani, R.: Hierarchical model-based motion estimation. In: Proc. of ECCV 1992, pp. 237–252 (May 1992) 4. Black, M., Yacoob, Y.: Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion. IJCV, 23–48 (1997) 5. Braathen, B., Bartlett, M.S., Littlewort, G., Smith, E., Movellan, J.R.: An Approach to Automatic Recognition of Spontaneous Facial Actions. In: International Conf. on FG, pp. 345–350 (2002) 6. Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologist Press Inc, San Francisco, CA (1978) 7. Gokturk, S.B., Bouguet, J.Y., Tomasi, C., Girod, B.: Model-Based Face Tracking for View-Independent Facial Expression Recognition. In: International Conf. on FG, pp. 272–278 (2002) 8. Hua, W.: Building Facial Expression Analysis System. CMU Tech. Report (1998)
624
T.-H. Lai, T.-H. Wang, and J.-J.J. Lien
9. Lien, J.J., Kanade, T., Cohn, J.F., Li, C.C.: Subtly Different Facial Expression Recognition and Expression Intensity Estimation. In: CVPR, pp. 853–859 (1998) 10. Lucey, S., Matthews, I., Hu, C., Ambadar, Z., de la Torre, F., Cohn, J.: AAM Derived Face Representations for Robust Facial Action Recognition. In: International Conf. on FG, pp. 155–160 (2006) 11. Rosenblum, M., Yacoob, Y., Davis, L.S.: Human Emotion Recognition from Motion Using a Radial Basis Function Network Architecture, Uni. of Maryland, CS-TR-3304 (1994) 12. Szeliski, R., Shum, H.: Creating Full View Panoramic Image Mosaics and Environment Maps. In: Proc. of Siggraph 1997 (August 1997) 13. Tian, Y., Kanade, T., Cohn, J.F.: Evaluation of Gabor-Wavelet-Based Facial Action Unit Recognition in Image Sequences of Increasing Complexity. In: International Conference on FG, pp. 218–223 (2002) 14. De la Torre, F., Yacoob, Y., Davis, L.S.: A Probabilistic Framework for Rigid and Nonrigid Appearance Based Tracking and Recognition. In: International Conf. on FG, pp. 491–498 (2000) 15. Twu, J.T., Lien, J.J.: Estimation of Facial Control-Point Locations. In: IPPR Conf. on Computer Vision, Graphics and Image Processing (2004) 16. Yacoob, Y., Davis, L.S.: Recognizing Human Facial Expressions from Long Image Sequence Using Optical Flow. IEEE Tran. on PAMI 18(6), 636–642 (1996) 17. Zhang, Y., Ji, Q.: Active and Dynamic Information Fusion for Facial Expression Understanding from Image Sequences. IEEE Tran. on PAMI 27, 699–714 (2005)
Vision-Based Guitarist Fingering Tracking Using a Bayesian Classifier and Particle Filters Chutisant Kerdvibulvech and Hideo Saito Keio University, 3-14-1 Hiyoshi, Kohoku-ku 223-8522, Japan {chutisant, saito}@ozawa.ics.keio.ac.jp
Abstract. This paper presents a vision-based method for tracking guitar fingerings played by guitar players from stereo cameras. We propose a novel framework for colored finger markers tracking by integrating a Bayesian classifier into particle filters, with the advantages of performing automatic track initialization and recovering from tracking failures in a dynamic background. ARTag (Augmented Reality Tag) is utilized to calculate the projection matrix as an online process which allow guitar to be moved while playing. By using online adaptation of color probabilities, it is also able to cope with illumination changes. Keywords: Guitarist Fingering Tracking, Augmented Reality Tag, Bayesian Classifier, Particle Filters.
1 Introduction Due to the popularity of acoustic guitars, research about guitars is one of the most popular topics in the field of computer vision for musical applications. Maki-Patola et al. [1] proposed a system called VAG (Virtual Air Guitar) using computer vision. Their aim was to create a virtual air guitar which does not require a real guitar (e.g., by using only a pair of colored gloves), but can produce music as similar as the player is playing the real guitar. Liarokapis [2] proposed an augmented reality system for guitar learners. The aim of this work is to show the augmentation (e.g., the positions where the learner should place their fingers to play the chord) on an electric guitar to guide the player. Motokawa and Saito [3] built a system called Online Guitar Tracking that supports a guitarist using augmented reality. This is done by showing a virtual model of the fingers on a stringed guitar as an aid to learning how to play the guitar. In these systems, they do not aim to track the fingering which a player is playing (A pair of gloves are tracked in [1], and graphics information is overlaid on captured video in [2] [3]). We have different goal from most of these researches. In this paper, we propose a new method for tracking the guitar fingerings by using computer vision. Our research goal is to accurately determine and track the fingering positions of a guitarist which is relative to guitar position in 3D space. A challenge for tracking fingers of a guitar player is naturally that the guitar neck often moves while the guitar is being played. It is then necessary to identify the guitar’s position relative to the camera’s position. Another important issue is recovery D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 625–638, 2007. © Springer-Verlag Berlin Heidelberg 2007
626
C. Kerdvibulvech and H. Saito
when finger tracking fails. Our method for tracking fingers of guitar player can handle the mentioned problems. At every frame, we first estimate the projection matrix of each camera by utilizing ARTag (Augmented Reality Tag) [4]. ARTag’s marker is placed on the guitar neck. Therefore the world coordinate is defined on the guitar neck as the guitar coordinate system so the system allows the players to move guitar while playing We utilize a particle filter [5] to track the finger markers in 3D space. We propagate sample particles in 3D space, and project them onto the 2D image planes of both cameras to get the probability of each particle to be on finger markers based on color in both images. To determine the color probabilities being finger markers color, during preprocess we apply a Bayesian classifier that is bootstrapped with a small set of training data and refined through an offline iterative training procedure [6] [7]. Online adaptation of markers-color probabilities is then used to refine the classifier using additional training images. Hence, the classifier is able to deal with illumination changes, even when there is a dynamic background. In this way, the 3D positions of finger markers can be obtained, so that we can recognize if the fingers of player are pressing the strings or not. As a result, our system can determine the complete positions of all fingers on the guitar fret. It can be used to develop instructive software to aid chord tracking or people learning the guitar. One of the possible applications [8] is to identify whether the finger positions are correct and in accord with the finger positions required for the piece of music that the players are playing. Therefore, guitar players can automatically identify whether their fingers are in the correct position.
2 Related Works In this section, related approaches of finger detection and tracking of guitarists will be described. Cakmakci and Berard [9] detected the finger position by placing a small ARToolKit (Augmented Reality Toolkit) [10]’s marker on a fingertip of the player for tracking the forefinger position (only one fingertip). However, when we attempted to use the markers to all four fingertips, all markers were not exactly perpendicular when captured by the cameras view direction simultaneously in some angles (especially while the player was pressing their fingers on the strings). Therefore, it is quite difficult to accurately track the positions of four fingers concurrently by using the ARToolKit finger markers. Burns and Wanderley [11] detected the positions of fingertips for retrieval of guitarist fingering without markers. They assumed that the fingertip shape can be approximated with a semicircular shape while the rest of the hand is roughly straight, and use the circular Hough transform to detect fingertips. However, utilizing Hough transform to detect the fingertips when playing the guitar is not accurate and robust enough. This is because a fingertip shape does not appear as a circular shape in some angles. Also, the lack of contrast between fingertips and background skin adds complication, which often the case in real-life performance. In addition, these two methods [9] [11] used only one camera on 2D image processing. The constraint of using one camera is that it is very difficult to classify whether fingers are pressing the strings or not. Therefore, stereo cameras are needed
Vision-Based Guitarist Fingering Tracking
627
(3D image processing). At the same time, these methods are sometimes difficult to use with stereo cameras because all fingertips may not be perpendicularly captured by two cameras simultaneously. We propose a method to overcome this problem by utilizing four colored markers placed on the four fingertips to determine the positions of the fingertips. However, a well-known problem of color detection nowadays is the control of lighting. Changing levels of light and limited contrasts prevent correct registration, especially in the case of a cluttered background. A survey [12] provides an interesting overview of color detection. A major decision towards deriving a model of color relates to the selection of the color space to be employed. Once a suitable color space has been selected, one of the commonly used approaches for defining what constitutes color is to employ bounds on the coordinates of the selected space. However, by using the simple threshold, when changing illumination, it is sometimes difficult to accurately classify the color. Therefore, we use a Bayesian classifier by learning color probabilities from small training image set and then learn the color probabilities from online input images adaptively (proposed recently in [6] [7]). Applying this method, the first attractive property is that it can avoid the burden involved in the process of manually generating a lot of training data. From small number of training data, it then adapts the probability according to current illumination and converges to a proper value. For this reason, the main attractive property of this method is its ability to cope with changing illumination because it can adaptively describe distribution of markers color.
3 System Configuration The system configuration is shown in Figure 1. We use two USB cameras and a display connected to the PC for the guitar players. The two cameras capture the position of the left hand (assuming the guitarist is right-handed) and the guitar neck to obtain 3D information. We attach a 4.5cm x 8cm ARTag fiducial marker onto the top right corner of guitar neck to compute the position of the guitar (i.e., the poses of cameras relative to guitar position). The colored markers (with different color) are attached to the fingers of the left hand.
Fig. 1. System configuration
628
C. Kerdvibulvech and H. Saito
4 Method Figure 2 shows the schematic of the implementation. After capturing the images, we calculate the projection matrix in each frame by utilizing ARTag. We then utilize a Bayesian classifier to determine the color probabilities of the finger markers. Finally, we apply the particle filters to track the 3D positions of the finger markers.
Fig. 2. Method overview
4.1 Calculation of Projection Matrix Detecting positions of fingers in captured images is the main point of our research, and the positions in images can give 3D positions based on stereo configuration of this system. Thus, it is necessary to calculate projection matrix (because it will be then used for projecting 3D particles to the image planes of both cameras in particle filtering step in section 4.3). However, because the guitar neck is not fixed to the ground while the cameras are fixed, the projection matrix changed at every frame. Thus, we have to define the world coordinate on the guitar neck as a guitar coordinate system. In the camera calibration process [13], the relation by projection matrix is generally employed as the method of describing the relation between the 3D space and the images. The important camera properties, namely the intrinsic parameters that must be measured, include the center point of the camera image, the lens distortion and the camera focal length. We first estimate intrinsic parameters during the offline step. During online process, extrinsic parameters are then estimated every frame by utilizing ARTag functions. Therefore we can compute the projection matrix, P, by using
⎡α u P = A[ R, t ] = ⎢⎢ 0 ⎢⎣ 0
− α u cot θ α v / sin θ 0
u 0 ⎤ ⎡ R11 v 0 ⎥⎥ ⎢⎢ R12 1 ⎥⎦ ⎢⎣ R13
R 21 R 22
R31 R32
R 23
R33
tx ⎤ t y ⎥⎥ t z ⎥⎦
(1)
Vision-Based Guitarist Fingering Tracking
where A is the intrinsic matrix, [ R, t ] is the extrinsic matrix, center point of the camera image,
θ
is the lens distortion,
αu
629
u0 and v0 are the
and
αv
represent the
focal lengths. 4.2 Finger Markers Color Learning This section will explain the method we used for calculating the color probabilities being finger markers color which will be then used in the particle filtering step (section 4.3). The learning process is composed of two phases. In the first phase, the color probability is learned from a small number of training images during an offline preprocess. In the second phase, we gradually update the probability from the additional training data images automatically and adaptively. The adapting process can be disabled as soon as the achieved training is deemed sufficient. Therefore, this method will allow us to get accurate color probabilities being finger markers from only a small set of manually prepared training images because the additional marker regions do not need to be segmented manually. Also, due to adaptive learning, it can be used robustly with changing illumination during the online operation. 4.2.1 Learning from Training Data Set During an offline phase, a small set of training input images (20 images) is selected on which a human operator manually segments markers-colored regions. The color representation used in this process is YUV 4:2:2 [14]. However, the Y-component of this representation is not employed for two reasons. Firstly, the Y-component corresponds to the illumination of an image pixel. By omitting this component, the developed classifier becomes less sensitive to illumination changes. Second, compared to a 3D color representation (YUV), a 2D color representation (UV) is lower in dimensions and, therefore, less demanding in terms of memory storage and processing costs. Assuming that image pixels with coordinates (x,y) have color values c = c(x,y), training data are used to calculate: (i) The prior probability P(m) of having marker m color in an image. This is the ratio of the marker-colored pixels in the training set to the total number of pixels of whole training images. (ii) The prior probability P(c) of the occurrence of each color in an image. This is computed as the ratio of the number of occurrences of each color c to the total number of image points in the training set. (iii) The conditional probability P(c|m) of a marker being color c. This is defined as the ratio of the number of occurrences of a color c within the marker-colored areas to the number of marker-colored image points in the training set. By employing Bayes’ rule, the probability P(m|c) of a color c being a marker color can be computed by using
P (m | c) =
P (c | m) P ( m ) P (c )
(2)
630
C. Kerdvibulvech and H. Saito
This equation determines the probability of a certain image pixel being markercolored using a lookup table indexed with the pixel’s color. The resultant probability map thresholds are then set to be Tmax and Tmin , where all pixels with probability
P(m | c) > Tmax are considered as being marker-colored—these pixels constitute seeds of potential marker-colored blobs—and image pixels with probabilities P(m | c) > Tmin where Tmin < Tmax are the neighbors of marker-colored image pixels being recursively added to each color blob. The rationale behind this region growing operation is that an image pixel with relatively low probability of being marker-colored should be considered as a neighbor of an image pixel with high probability of being markercolored. Indicative values for the thresholds Tmax and Tmin are 0.5 and 0.15, respectively. A standard connected component labeling algorithm (i.e., depth-first search) is then responsible for assigning different labels to the image pixels of different blobs. Size filtering on the derived connected components is also performed to eliminate small isolated blobs that are attributed to noise and do not correspond to interesting marker-colored regions. Each of the remaining connected components corresponds to a marker-colored blob. 4.2.2 Adaptive Learning The success of the marker-color detection depends crucially on whether or not the illumination conditions during the online operation of the detector are similar to those during the acquisition of the training data set. Despite the fact that the UV color representation model used has certain illumination independent characteristics, the marker-color detector may produce poor results if the illumination conditions during online operation are considerably different compared to those in the training set. Thus, a means for adapting the representation of marker-colored image pixels according to the recent history of detected colored pixels is required. To solve this problem, marker color detection maintains two sets of prior probabilities. The first set consists of P(m), P(c), P(c|m) that have been computed offline from the training set while the second is made up of PW (m) , PW (c ) , PW (c | m) corresponding to the evidence that the system gathers during the W most recent frames. In other words, PW (m) , PW (c ) and PW (c | m) refer to P(m), P(c) and P(c|m) during the W most recent frames respectively. Obviously, the second set better reflects the “recent” appearance of marker-colored objects and is therefore better adapted to the current illumination conditions. Marker color detection is then performed based on the following weighted moving average formula:
PA ( m | c ) = γP (m | c ) + (1 − γ ) PW (m | c)
(3)
where γ is a sensitivity parameter that controls the influence of the training set in the detection process, PA ( m | c) represents the adapted probability of a color c being a marker color, P( m | c) and PW (m | c) are both given by Equation (2) but involve prior probabilities that have been computed from the whole training set [for P(m | c) ] and from the detection results in the last W frames [for PW (m | c) ]. In our
implementation, we set γ = 0.8 and W = 5.
Vision-Based Guitarist Fingering Tracking
631
Thus, the finger markers-color probabilities can be determined adaptively. By using online adaptation of finger markers-color probabilities, the classifier is able to cope with considerable illumination changes and also a dynamic background (e.g., moving guitar neck). 4.3 3D Finger Markers Tracking Particle filtering [5] is a useful tool to track objects in clutter, with the advantages of performing automatic track initialization and recovering from tracking failures. In this paper, we apply particle filters to compute and track the 3D position of finger markers in the guitar coordinate system (The 3D information is used to help for determining whether fingers are pressing a guitar string or not). The finger markers can then be automatically tracked initially, and the tracking can be recovered from failures. We use the color probability of each pixel which obtained from the section 4.2 as the observation model The particle filtering (system) uniformly distributes particles all over the area in 3D space, and then projects the particles from 3D space onto the 2D image planes of the two cameras to obtain the probability of each particle to be finger markers. As new information arrives, these particles are continuously re-allocated to update the position estimate. Furthermore, when the overall probability of particles to be finger markers is lower than the threshold we set, the new sample particles will be uniformly distributed all over the area in 3D space. Then the particles will converge to the areas of finger markers. For this reason, the system is able to recover the tracking. (The calculation is based on the following analysis.) Given that the process at each time-step is an iteration of factored sampling, the output of an iteration will be a weighted, time-stamped sample-set, denoted by {s t( n ) , n = 1,..., N } with weights π t(n ) , representing approximately the probabilitydensity function p ( X t ) at time t: where N is the size of sample sets, s t( n ) is defined as the position of the n th particle at time t, X t represents the position in 3D of finger marker at time t, p ( X t ) is the probability that a finger marker is at 3D position X = (x,y,z) T at time t. The number of particles used is 900 particles. The iterative process can be divided into three main stages: (i) Selection stage; (ii) Predictive state; (iii) Measurement stage. In the first stage (the selection stage), a sample s ' t( n ) is chosen from the sample-set
{s t(−n1) , π t(−n1) , c t(−n1) } with probabilities π t(−j1) , where c t(−n1) is the cumulative weight. This is
done by generating a uniformly distributed random number r ∈ [0, 1]. We find the smallest j for which ct(−j1) ≥ r using binary search, and then s ' t( n ) can be set as follows: s 't( n ) = s t(−j1) . Each element chosen from the new set is now subjected to the second stage (the predictive step). We propagate each sample from the set s ' t −1 by a propagation function, g ( s' t(n ) ) , using
st( n ) = g ( s ' (t n ) ) + noise
(4)
632
C. Kerdvibulvech and H. Saito
where noise is given as a Gaussian distribution with its mean = (0,0,0)T. The accuracy of the particle filter depends on this propagation function. We have tried different propagation functions (e.g., constant velocity motion model and acceleration motion model), but our experimental results have revealed that using only noise information gives the best result. A possible reason is that the motions of finger markers are usually quite fast and constantly changing directions while playing the guitar. Therefore the calculated velocities or accelerations in previous frame do not give accurate prediction of the next frame. In this way, we use only the noise information by defining g ( x) = x in Equation (4). In the last stage (the measurement stage), we project these sample particles from 3D space to two 2D image planes of cameras using the projection matrix results from Equation (1). We then determine the probability whether the particle is on finger marker. In this way, we generate weights from the probability-density function p( X t ) to obtain the sample-set representation {( st( n ) , π t( n ) )} of the state-density for time
t using
π t( n ) = p ( X t = st( n ) ) = PA (m | c) Camera 0 PA (m | c) Camera1
(5)
p ( X t = s t(n ) ) is the probability that a finger marker is at position st(n ) . We assign the weights to be the product of PA (m | c) of two cameras which can
where
be obtained by Equation (3) from the finger markers color learning step (the adapted probability PA (m | c ) Camera 0 and PA (m | c) Camera1 represent a color c being a marker color in camera 0 and camera 1, respectively). Following this, we normalize the total weights using the condition
Σ nπ t( n ) = 1
(6)
Next, we update the cumulative probability, which can be calculated from normalized weights using
ct( 0) = 0 , ct( n ) = ct( n −1) + π t( n ) Total where
π t( n ) Total
(n = 1,..., N )
(7)
is the total weight.
Once the N samples have been constructed, we estimate moments of the tracked position at time-step t as using
ε [ f ( X t )] = Σ nN=1π t( n ) st( n ) where
(8)
ε [ f ( X t )] represents the centroid of each finger marker. The four finger
markers can then be tracked in 3D space, enabling us to perform automatic track initialization and track recovering even in dynamic background. The positions of four finger markers in the guitar coordinate system can be obtained.
Vision-Based Guitarist Fingering Tracking
633
5 Results In this section, representative results from our experiment are shown. Figure 3 provides a few representative snapshots of the experiment. The reported experiment is based on a sequence that has been acquired. Two USB cameras with resolution 320x240 have been used. The camera 0 and camera 1 windows depict the input images which are captured from two cameras. These cameras capture the player’s fingers in the left hand positioning and the guitar neck from two different views. For visualization purposes, the 2D tracked result of each finger marker is also shown in camera 0 and camera 1 windows. The four colored numbers depict four 2D tracking results from the finger markers (forefinger [number0 - light blue], middle finger [number1 - yellow], ring finger [number2 - violet] and little finger [number3 - pink]). The 3D reconstruction window, which is drawn using OpenGL, represents both the tracked 3D positions of the four finger markers in guitar coordinate system. In this 3D space, we show the virtual guitar board to make it clearly understand that this is the guitar coordinate system. The four-color 3D small cubes show each 3D tracked result of the finger markers (these four 3D cubes correspond to the 2D four colored numbers in the camera 0 and the camera 1 windows). In the initial stage (frame 10), when the experiment starts, there are no guitar and no fingers in the scene. The tracker attempts to find the color which is similar to the markers-colored region. For example, because the color of player’s shirt (light yellow) is similar to a middle finger marker’s color (yellow), the 2D tracking result of middle finger marker (number1) in the camera 0 window detects wrongly as if the player’s shirt is the middle finger marker. However, later during the playing stage (frame 50), the left hand of a player and the guitar enter the fields of cameras’ views. The player is playing the guitar, and then the system can closely determine the accurate 3D fingering positions which correspond to the 2D colored numbers in the camera 0 and the camera 1 windows. In this way, this implies that the system can perform automatic track initialization because of using particle filtering. Next, the player changes to hold to the next fingering positions in frame 80. The system can continue to correctly track and recognize the 3D fingering positions which correspond nearly to the positions of 2D colored numbers in the camera 0 and the camera 1 windows. Following this, the player moves the guitar position (from the old position in frame 80) to the new position in frame 110, but still holding the same fingering positions on the guitar fret. It can be observed that the detected 3D positions of the four finger markers from different guitar positions (i.e., but the same input fingering on the guitar fret) are almost the same positions. This is because ARTag marker is used to track the guitar position. Later on, in the occlusion stage (frame 150), the finger markers are totally occluded by the white paper. Therefore, the system is again back to find the similar colors of each marker (backing to the initial stage again).
634
C. Kerdvibulvech and H. Saito
However, following this in the recovering stage (frame 180), the occlusion of white paper is moved out, and then the cameras are capturing the fingers and guitar neck again. It can be seen that the tracker can return to track the correct fingerings (backing to the playing stage again). In other words, the system is able to recover from tracking failure due to using particle filtering.
Initial stage: frame 10 (no guitar and no fingers in the scene)
Playing stage: frame 50
Playing stage: frame 80 Fig. 3. Representative snapshots from the online tracking experiment
Vision-Based Guitarist Fingering Tracking
635
Playing stage: frame 110
Occlusion stage (tracking fails – back to initial stage): frame 150
Recovering stage (playing stage again): frame 180 Fig. 3. (continued)
The reader is also encouraged to observe illumination difference between camera 0 and camera 1 windows. Our experimental room composes of two main light sources which are located oppositely. We turned on the first light source of the room which is located near to use for capturing images in camera 0, while we turned off the second light source (opposite to the first source) of the room which is located near for capturing images in camera 1. Hence, the lighting used to test in each camera is
636
C. Kerdvibulvech and H. Saito
Fig. 4. Speed used for recovering from tracking failures
different. However, it can be observed that the 2D tracked result of finger markers can be still determined without effects of different light sources in both camera 0 and camera 1 windows in each representative frame. This is because a Bayesian classifier and online adaptation of color probabilities are utilized to deal with this. We also evaluate the recovering speed whenever tracking the finger markers fails. Figure 4 shows the speeds used for recovering from lost tracks. In this graph, the recovering speeds are counted from initial frame where certainty of tracking is lower than threshold. At the initial frame, the particles will be uniformly distributed all over the 3D space as described in the section 4.3. Before normalized weights in particle filtering step, we determine the certainty of tracking from the sum of the weight probability of each distributed particle to be marker. Therefore, if the sum of weight probability is lower than the threshold, we assume that tracker is failing. On the other hand, if the sum of weight probability is higher than threshold, we imply that tracking has been recovered. Thus, the last counted frame will be decided at this frame (the particles have been already converged to the areas of finger markers). The mean recovering speed and the standard derivation are also shown in the table in Figure 4, in frames (the speed of fingering tracking is approximately 6 fps). We believe this recovering speed is fast enough for recovering of tracking in real-life guitar performance.
Fig. 5. Accuracy of 3D finger detection results
Vision-Based Guitarist Fingering Tracking
637
Then, we evaluate accuracy of our system by using 100 samples data sets for testing. Figure 5 shows the accuracy of our experimental results when detecting fingering positions. All errors are measured in millimetre. With respect to the manually measured ground truth positions, the mean distance error and standard derivation error in each axis are shown in the table in Figure 5. Finally, we will note about a limitation of the proposed system. The constraint of our system is that, although a background we used can be cluttered, the background should not be composed of large objects which are the same color as the colors of finger markers. For instance, if the players wear their clothes which are very similar color to the markers’ colors, the system cannot sometimes determine the output correctly.
6 Conclusions In this paper, we have developed a system that measures and tracks the positions of the fingertips of a guitar player accurately in the guitar’s coordinate system. A framework for colored finger markers tracking has been proposed based on a Bayesian classifier and particle filters in 3D space. ARTag has also been utilized to calculate the projection matrix. Although we believe that we can successfully produce a system output, the current system has the limitation about the background color and the markers’ colors. Because four finger markers composed of four different colors, it is sometimes not convenient for users to select their background. As future work, we intend to make technical improvements to further refine the problem of the finger markers by removing these markers which may result in even greater user friendliness. Acknowledgments. This work is supported in part by a Grant-in-Aid for the Global Center of Excellence for High-Level Global Cooperation for Leading-Edge Platform on Access Spaces from the Ministry of Education, Culture, Sport, Science, and Technology in Japan.
References 1. Maki-Patola, T., Laitinen, J., Kanerva, A., Takala, T.: Experiments with Virtual Reality Instruments. In: Fifth International Conference on New Interfaces for Musical Expression, Vancouver, Canada, pp. 11–16 (2005) 2. Liarokapis, F.: Augmented Reality Scenarios for Guitar Learning. In: Third International Conference on Eurographics UK Theory and Practice of Computer Graphics, Canterbury, UK, pp. 163–170 (2005) 3. Motokawa, Y., Saito, H.: Support System for Guitar Playing using Augmented Reality Display. In: Fifth IEEE and ACM International Symposium on Mixed and Augmented Reality, ISMAR 2006, pp. 243–244. IEEE Computer Society Press, Los Alamitos (2006) 4. Fiala, M.: Artag, a Fiducial Marker System Using Digital Techniques. In: IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 590–596. IEEE Computer Society Press, Los Alamitos (2005)
638
C. Kerdvibulvech and H. Saito
5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal on Computer Vision, IJCV 1998 29(1), 5–28 (1998) 6. Argyros, A.A., Lourakis, M.I.A.: Tracking Skin-colored Objects in Real-time. Invited Contribution to the Cutting Edge Robotics Book, ISBN 3-86611-038-3, Advanced Robotic Systems International (2005) 7. Argyros, A.A., Lourakis, M.I.A.: Tracking Multiple Colored Blobs with a Moving Camera. In: IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2005, San Diego, CA, vol. 2(2), p. 1178 (2005) 8. Kerdvibulvech, C., Saito, H.: Real-Time Guitar Chord Estimation by Stereo Cameras for Supporting Guitarists. In: Tenth International Workshop on Advanced Image Technology, IWAIT 2007, Bangkok, Thailand, pp. 256–261 (2007) 9. Cakmakci, O., Berard, F.: An Augmented Reality Based Learning Assistant for Electric Bass Guitar. In: Tenth International Conference on Human-Computer Interaction, HCI 2003, Rome, Italy (2003) 10. Kato, H., Billinghurst, M.: Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System. In: Second IEEE and ACM International Workshop on Augmented Reality, pp. 85–94. IEEE Computer Society Press, Los Alamitos (1999) 11. Burns, A.M., Wanderley, M.M.: Visual Methods for the Retrieval of Guitarist Fingering. In: Sixth International Conference on New Interfaces for Musical Expression, Paris, France, pp. 196–199 (2006) 12. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 34–58 (2002) 13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall, Upper Saddle River, NJ (2003) 14. Jack, K.: Video Demystified. Elsevier Science, UK (2004)
Accuracy Estimation of Detection of Casting Defects in X-Ray Images Using Some Statistical Techniques Romeu Ricardo da Silva and Domingo Mery Departamento de Ciencia de la Computación, Pontificia Universidad Católica de Chile, Vicuña Mackenna 4860 (143)
[email protected],
[email protected] www.romeu.eng.br http://dmery.puc.cl
Abstract. Casting is one of the most important processes in the manufacture of parts for various kinds of industries, among which the automotive industry stands out. Like every manufacturing process, there is the possibility of the occurrence of defects in the materials from which the parts are made, as well as of the appearance of faults during their operation. One of the most important tools for verifying the integrity of cast parts is radioscopy. This paper presents pattern recognition methodologies in radioscopic images of cast automotive parts for the detection of defects. Image processing techniques were applied to extract features to be used as input of the pattern classifiers developed by artificial neural networks. To estimate the accuracy of the classifiers, use was made of random selection techniques with sample reposition (Bootstrap technique) and without sample reposition. This work can be considered innovative in that field of research, and the results obtained motivate this paper. Keywords: Casting Defects, Radioscopy, Image Processing, Accuracy Estimation, Bootstrap.
1 Introduction Shrinkage as molten metal cools during the manufacture of die castings can cause defect regions within the workpiece. These are manifested, for example, by bubbleshaped voids, cracks, slag formation, or inclusions. Light-alloy castings for the automotive industry, such as wheel rims, steering knuckles, and steering gear boxes are considered important components for overall roadworthiness. To ensure the safety of construction, it is necessary to check every part thoroughly. Radioscopy rapidly became the accepted way for controlling the quality of die castings through computeraided analysis of X-ray images [1]. The purpose of this nondestructive testing method is to identify casting defects, which may be located within the piece and thus are undetectable to the naked eye. Two classes of regions are possible in a digital X-ray image of an aluminium casting: regions belonging to regular structures (RS) of the specimen, and those relating to defects (D). In an X-ray image we can see that the defects, such as voids, cracks and bubbles (or inclusions and slag), show up as bright (or dark) features. The reason is that X-ray attenuation in these areas is lower (or higher). Since contrast in D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 639–650, 2007. © Springer-Verlag Berlin Heidelberg 2007
640
R.R. da Silva and D. Mery
the X-ray image between a flaw and a defect-free neighbourhood of the specimen is distinctive, the detection is usually performed by analysing this feature (see details in [2] and [3]). In order to detect the defects automatically, a pattern recognition methodology consisting of five steps was developed [1]: a) Image formation, in which an X-ray image of the casting that is being tested is taken and stored in the computer. b) Image pre-processing, where the quality of the X-ray image is improved in order to enhance its details. c) Image segmentation, in which each potential flaw of the X-ray image is found and isolated from the rest of the scene. d) Feature extraction, where the potential flaws are measured and some significant features are quantified. e) Classification, where the extracted features of each potential flaw are analysed and assigned to one of the classes (regular structure or defect). Although several approaches have been published in this field (see for example a review in [1]), the performance of the classification is usually measured without statistical validation. This paper attempts to make an estimation of the true accuracy of a classifier using the Bootstrap technique [4] and random selection without repositioning applied to the automated detection of casting defects. The true accuracy of a classifier is usually defined as the degree of correctness of data classification not used in its development. The great advantage of this technique is that the estimation is made by sampling the observed detection distribution, with or without repositioning, to generate sets of observations that may be used to correct for bias. The technique provides nonparametric estimates of the bias and variance of a classifier, and as a method of error rate estimation it is better than many other techniques [5]. The rest of the paper is organised as follows: Section 2 outlines the methodology used in the investigation. Section 3 shows the results obtained recently on real data. Finally, Section 4 gives concluding remarks.
2 Methodologies 2.1 Processing of the Casting Images The X-ray image taken with an image intensifier and a CCD camera (or a flat panel detector), must be pre-processed to improve the quality of the image. In our approach, the pre-processing techniques are used to remove noise, enhance contrast, correct the shading effect, and restore blur deformation [1]. The segmentation of potential flaws identifies regions in radioscopic images that may correspond to real defects. Two general features of the defects are used to identify them: a) a flaw can be considered as a connected subset of the image, and b) the grey level difference between a flaw and its neighbourhood is significant. According to these features, a simple automated segmentation approach was suggested in [6] (see Fig. 1). First, a Laplacian of Gaussian (LoG) kernel and a zero crossing algorithm [7] are used to detect the edges of the X-ray images. The LoG-operator involves a Gaussian lowpass filter which is a good choice for pre-smoothing our noisy images that are obtained without frame averaging. The resulting binary edge image should produce closed and connected contours at real flaws which demarcate regions. However, a flaw may not be perfectly enclosed if it is located at an edge of a regular structure as shown in Fig. 1c. In
Accuracy Estimation of Detection of Casting Defects in X-Ray Images
641
order to complete the remaining edges of these flaws, a thickening of the edges of the regular structure is performed as follows: a) the gradient of the original image is calculated (see Fig. 1d); b) by thresholding the gradient image at a high grey level a new binary image is obtained; and c) the resulting image is added to the zero crossing image (see Fig. 1e). Afterwards, each closed region is segmented as a potential flaw. For details see a description of the method in [6]. All regions enclosed by edges in the binary image are considered 'hypothetical defects' (see example in Fig. (1e)). During the feature extraction process the properties of each of the segmented regions are measured. The idea is to use the measured features to decide whether the hypothetical defect corresponds to a flaw or a regular structure.
a
b
c
d
e
f
Fig. 1. Detection of flaws: a) radioscopic image with a small flaw at an edge of a regular structure, b) Laplacian-filtered image with σ = 1.25 pixels (kernel size = 11 × 11), c) zero crossing image, d) gradient image, e) edge detection after adding high gradient pixels, and f) detected flaw using feature F1 extracted from a crossing line profile [2]
Fig. 2. Example of a region. (a) X-Ray image, (b) segmented region, (c) 3D representation of the intensity (grey value) of the region and its surroundings [8].
642
R.R. da Silva and D. Mery Table 1. Descriptions of the features extracted
f1 and f2
Height (f1) and width (f2): height (h) and width (w) of the region [9].
f3
Area (A): number of pixels that belong to the region [9].
f4
Mean grey value (G): mean of the grey values that belong to the region [9].
f5
Mean second derivative (D): mean of the second derivative values of the pixels that belong to the boundary of the region [9].
f6
Crossing Line Profile (F1): Crossing line profiles are the grey level profiles along straight lines crossing each segmented potential flaw in the middle. The profile that contains the most similar grey levels in the extremes is defined as the best crossing line profile (BCLP). Feature F1 corresponds to the first harmonic of the fast Fourier transformation of BCLP [2].
f7
Contrast Kσ: standard deviation of the vertical and horizontal profiles without offset [9].
f8
High contrast pixels ratio (r): ratio of number of high contrast pixels to area [3].
The features extracted in this investigation are described below (Table 1), and they provide information about the segmented regions and their surroundings. The total number of features extracted is 8 divided into 3 geometric features and 5 intensity features. In our work we present results obtained on 72 radioscopic images of aluminium die castings. The size of the images is 572 × 768 pixels. About 25% of the defects of the images were existing blow holes (with ∅ = 2.0 – 7.5 mm). They were initially detected by visual (human) inspection. The remaining 75% were produced by drilling small holes (with ∅ = 2.0 – 4.0 mm) in positions of the casting which were known to be difficult to detect. In these experiments, 424 potential defects were segmented, 214 of them correspond to real defects, while the others are regular structures (210). 2.2 Development of the Nonlinear Classifiers The non-linear classifiers were implemented using a two-layer neural network with training by error backpropagation. The first step taken in the development of a nonlinear classifier was to optimize the number of neurons used in the intermediate layer in order to obtain the best accuracy possible for the test sets. Some tests were carried out in terms of training parameters of the network, and the best result (fastest convergence) was found when the moment (β=0.9) and α (training rate) variables were used [10, 11]. The initialization of the synapses and bias used the Widrow [12] method. All these training variations resulted in a convergence for the same range of error.
Accuracy Estimation of Detection of Casting Defects in X-Ray Images
643
2.3 Accuracy Estimation There are various techniques to estimate the true accuracy of a classifier, which is usually defined as being the degree of correctness of classification of data not used in its development. The three that are most commonly used are: simple random selection of data, cross validation that really presents diverse implementations [13], and the bootstrap technique [4, 14]. It is not really possible to confirm whether one method is better than the other for any specific pattern classification system. The choice of one of these techniques will depend on the quantity of data available and the specific classification to be made. As described in [4], two properties are important when evaluating the efficiency of an estimator θˆ , its bias and its variation, that are defined by the equations below:
Bias = E[θˆ] − θ
()
(
(1)
)
2 Var θˆ = E ⎡ θˆ − E[θˆ] ⎤ ⎢⎣ ⎥⎦
(2)
where,
E[θˆ] : expected value of estimator θˆ . Var θˆ : variation of estimator.
()
An estimator is said to be reliable if it contains low values of bias (trend) and variation. However, in practice an appropriate relation between both is desirable when looking for a more realistic objective [4, 14]. When dealing with the accuracy of a classifier, bias and variation of the estimated accuracy are going to vary as a function of the number of data and the accuracy estimation technique used. In this work, to calculate the classification accuracy of casting defects we first carried out the bootstrap technique as follows: A set of bootstrap data (size n), following Efron’s definition [4], is made up of
x1* , x 2* , L , x n* data, obtained in a random way and with repositioning, from an original set of data
x1 , x 2 ,L , x n (also size n). In this way it is possible for some data
to appear 1, 2, 3 or n times or no times [4]. With this technique the classifier implemented using the ith training set is tested with data that were not used in the make up of this set, resulting in an accuracy estimator of
θˆi
repeated b times. The model of bootstrap accuracy estimation pattern classifiers is defined by
θˆB =
(
1 b ∑ ωˆ θˆi + (1 − ωˆ )θˆc b i =1
)
(for test data). This is
θˆB
of frequently used
(3)
644
R.R. da Silva and D. Mery
θˆc is the apparent accuracy (calculated with the training set data only) and the weight ωˆ varies between 0.632 and 1, which is normally taken as being equal to
where
0.632 [4, 14]. As a second way of estimating the accuracy of the developed classifiers, the form of random selection without data reposition was used for the formation of the training and testing sets, different from the Bootstrap technique [15]. In addition to that, ROC curves were drawn to verify the reliability of the results achieved with this technique [11].
3 Results 3.1 Features Selection An optimized way of representing the domains of the classes of patterns of multivariate system in a two-dimensional space is by obtaining the two main discrimination components. It is known that the main linear discrimination address is called Fisher's Discriminator [11], and it maximizes the interclass covariance matrix and minimizes the intraclass covariance matrix [11, 16]. In this case, the first linear discrimination address of classes RS and D can be obtained going over a supervised neural network of the backpropagation type with only one neuron [10]. Then it is possible to obtain a second main linear discrimination address, also with a neural network with only one neuron, using for the training of the network the residual information of the projetion of the original information in the first discrimination address, what is called independent components (orthogonals). A detailed description of this technique is found in [17].
Fig. 3. Graphs made with the two principal linear discrimination components
Accuracy Estimation of Detection of Casting Defects in X-Ray Images
645
In this way the two main components of the linear discrimination of classes RS e D with a neural network of only one neuron which was trained through the error backpropagation algorithm using batch training (3000 periods), parameter β=0.9 and α variable, were obtained. Figure 3 shows the graph obtained with those two main linear discrimination addresses. It is evident that the separation of classes RS and D is more efficient in that representation space, because a visual analysis will make it possible to identify that there are few false positive (RS inputs in the domain space of D) and false negative (D inputs in the domain space of RS) errors. The projection of the data on the x axis (p1) represents what would be the best discrimination of these classes, and a projection on y (p2), the second best discrimination. From this graph it is concluded that the separation between RS and D can achieve good indices of success with well developed pattern classifiers. 3.2 Study of Neuron Number in the Intermediate Layer The graph of Figures 3 showed the problem of classification of classes RS and D only from the two principal linear discrimination components. However, it is well known that the linear pattern classifiers solve well very easy class separation problems [11]. To optimize the separation between the classes of patterns RS and D, non-linear pattern classifiers will be developed through supervised neural networks with two layers of neurons and error backpropagation training [10]. Since non-linear classifiers can have network overtraining problems, whose probability increases with increasing number of neurons in the second layer, thereby losing the capacity to generalize [10], to decrease the probability of the existence of overfitting the parameters of the non-linear classifier, a study was made of the optimum number of neurons in the intermediate layer of the classifier that would make possible the best result with test sets. For that purpose, from the initial set of data with the eight features, a training set was chosen with 75% of the data chosen randomly and without reposition, and a test set with the remaining 25%, keeping the proportion between the classes. In this way the training set contained 158 samples of RS and 160 of D, and the test set had 52 of RS and 54 of D. The number of neurons in the intermediate layer of the network was varied one at a time up to 20 neurons, and the indices of success in classification and testing were recorded. It should be noted that, since we are dealing with only two classes of patterns, the last layer of the classifierr can contain only one neuron. The results obtained from the study of the number of neurons are shown in Table 2. In the table it is seen that the smallest difference between the results of the training and the tests, which theoretically can indicate a good generalization capacity of the classifier, occurs for two neurons in the intermediate layer. However, if we analyse the increase of the performance of the classifier, which occurs significantly with the increase in the number of neurons, which is expected, a second lowest difference occurs for 11 neurons, achieving 94.34% of success with the test set. For that reason, 11 neurons were used in the intermediate layer of the neural network for the development of all the classifiers of this work having in view the estimation of the accuracy of the classification.
646
R.R. da Silva and D. Mery Table 2. Optimization of the number of neurons in the intermediate layer
Number of Neurons 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Training Performance (%)
Test Performance (%)
90.57 90.25 94.66 97.50 96.90 97.80 96.90 99.06 98.75 98.43 98.43 99.06 99.38 99.38 99.38 99.06 99.38 99.38 99.38 99.70
86.80 89.63 89.63 91.51 91.51 89.63 91.51 89.63 93.40 92.46 94.34 92.46 92.46 93.40 92.46 92.46 90.57 89.63 92.46 94.34
3.3 Accuracy Estimation by the Bootstrap Technique To estimate the accuracy of the non-linear classifiers, the first technique used was a random selection with data repositioning. Once more, as an example of the operation of this technique, one can imagine a “bag” with all the original data, then we choose data randomly from this bag to form the training sets, but every piece of data that goes to these sets returns to the “bag” and can be chosen again several times. The data that are not chosen for training are used in the formation of the test sets. In this way, 10 pairs of training and test sets were formed. It should be noted that, with this selection technique the training sets always have the same nubmer of data as the original set (in this work, 424 data). In this way, the test sets had a number of data between 150 (≈ 35%) and 164 (≈ 38%). For the development of the classifiers, with the aim of decreasing the possibility of overtraining of its parameters (synapses and bias), use was made of a validation set formed by samples selected randomly from the boostrap training sets in a 10% proportion. This technique is well known as cross validation (Haykin, 1994), and the end of the training was set to when the validation error increases or remains stable for 100 epochs or a maximum of 3000 epochs, obviously choosing the values of the network parameters in the situation of the least validation error. The results are presented in Table 3. Analysing these results, it is seen that the training indices were quite high, with a mean of 98.46%, however the success indices of the test were significatively lower, with a mean of 55.61%. Calculating the accuracy estimator according to the
Accuracy Estimation of Detection of Casting Defects in X-Ray Images
647
weighting factor of 0.632 for the test set estimator and 0.368 for the training estimator [4, 13], the estimated accuracy result is 71.40%, which can be considered unsatisfactory for the classification of patterns for this problem of fault detection in automobile rims. The great problem for the classification of patterns, which is common to almost all work in that relation, is the lack of data to estimate with precision the true classification accuracy, so that it can be trusted that all the success indices will always be similar when the classifier is tested with a new set of data. The main objective of the use of the bootstrap technique was to try to reproduce several sets for training and testing the classifiers as well as for estimating the accuracy expected for classes RS and D. One justification that can be thought of for the low success indices with that technique is the fact that the test sets have a large number of data in relation to the number of data used for training. Normally, in terms of pattern classification, the test or validation sets contain between 20 and 30% of data, and by the bootstrap technique, in this paper, some test sets get to contain almost 40% of the data, and this can in fact affect the correct training of the network parameters, even using a cross validation technique to interrupt the trainings. This is even more feasible if we think that the original data did not contain a large number of samples. To expect a success index of only about 55%, or even 71.40%, for this classification problem is too pesimistic having in mind the efficiency of the image processing techniques used and the relevance of the extracted features. Table 3. Result of classification with the bootstrap input sets (%) Input Sets
Training (%)
Test (%)
1 2 3 4 5 6 7 8 9 10
418/98.60 422/98.60 410/96.70 421/99.30 405/95.52 421/99.30 416/98.12 424/100 420/99.05 422/99.53
75/50.00 88/53.66 92/56.10 94/57.31 94/57.32 88/53.66 86/52.45 100/61.00 86/52.45 102/62.20
Mean
98.47
55.61
Bootstrap accuracy estimation
(
1 b θˆB = ∑ 0,632θˆi + 0,368θˆc b i =1
)
71.40
3.4 Accuracy Estimation by Random Selection Without Repositioning In the simple method of evaluation with random sampling, the original data set (with n data) is partitioned randomly into two sets: a training set containing p × n data,
648
R.R. da Silva and D. Mery
and a test set containing (1 − p ) × n data (the values of p are chosen in a variable way case by case). This process is repeated a number of times, and the mean value is the accuracy estimator [13]. This technique was used for the first selection and formation of sets with the purpose of choosing the number of neurons of the classifier's intermediate layer. Using that simple yet very efficient technique, 10 pairs of data sets for training and testing of the classifier were chosen, and the percent proportion chosen (based on experience from other work) was 75% for training (318) and 25% for testing (106). Table 4 contains the results achieved successfully with these sets. The fourth and fifth columns of the table refer to the number of data of each class contained in the corresponding sets. The mean was approximately 53 data of each class in each set, that is, in general there was not a significant disproportion between the number of data of each class that would affect the trainings and tests of the classifiers. The training column contains not only the percentages of success, but also the number of data classified correctly, which were as high as those obtained with the bootstrap sets. However, it is seen that the test results were considerably higher than those achieved with the bootstrap technique, with a mean estimated accuracy of 90.30% for the 10 test sets selected, a very satisfactory index close to the mean of 97.52% obtained for the training sets. That small difference of about 7% is perfectly acceptable, and it shows the generalization of the classifiers (confirmed also by the low values found for standard deviation). It should be noted that with these sets cross validation was also used for interrupting the training in a manner similar to that used for the bootstrap sets. Table 4 also contains the false negative (FN) indices, real defects classified as regular structures as well as the false positive (FP) indices, regular structures classified as defects. The mean values achieved of 7.69% and 11.64%, respectively, can be considered satisfactory, especially if we consider that the most critical situation Table 4. Results of classification with the input sets of the random selection without repositioning (%) Input Sets
Training (%)
Test (%)
RS
D
FN (%)
FP (%)
1 2 3 4 5 6 7 8 9 10 Mean (%) Standard Deviation(%)
314/98.75 311/97.80 315/99.06 312/98.12 314/98.75 307/96.55 299/94.03 314/98.75 311/97.80 304/95.60 97.52
95/89.63 98/92.46 101/95.30 95/89.63 93/87.74 93/87.74 96/90.57 94/88.68 96/90.57 96/90.57 90.30
57 52 50 55 45 60 53 46 55 54 ≈53
49 54 56 51 61 46 53 60 51 52 ≈53
3.51 11.54 2.00 18.18 4.44 6.67 5.66 6.52 7.27 11.11 7.69
18.37 3.70 7.14 1.96 18.03 19.57 13.21 15.00 11.76 7.69 11.64
1.28
1.61
13.03
12.53
Accuracy Estimation of Detection of Casting Defects in X-Ray Images
649
Fig. 4. ROC resultant curve of the randomly selected sets without data repositioning (sixth and seventh columns of Table 5)
is always that of false negative, and less than 8% of errors in the classification of real defects is an index that cannot be considered high for a fault detection situation in these kinds of images. Figure 4 shows the ROC (Receiver Operating Characteristic) curve obtained from the interpolation of true positive (TP), 1- FN, and false positive points of Table 4. The area over the curve, calculated by simple integration of the interpolated curve, represents the efficiency of the system used for the detection of the real defects in the acquired images (probability of detection, PoD). In this case the value found for the area was 96.1%, which can be considered an optimum index of the efficiency and reliability of the system, higher than the 90.30% estimated accuracy value of Table 4.
4 Conclusions As to the bootstrap technique, the accuracy results were well on this side of acceptable, and that can be explained by the small amount of data available in the training sets. The estimation of the accuracy of classification with the random selection technique without data repositioning, with fixed values of 25% of data for the test sets, had high indices of correctness, showing the efficiency of the system developed for the detection of defects, which was also evident from the drawing of the ROC curve for the system. It must be pointed out that this work does not exhaust the research in this field, and that much can still be done to increase the reliability of the results obtained as well as to increase the number of features to be extracted to increase the degree of success in the detection of faults. However, this paper can be considered pioneering dealing with defects in automobile wheels, and there are no results on estimated accuracy in other papers that could be used for comparison with these results.
650
R.R. da Silva and D. Mery
Acknowledgment. This work was supported in part by FONDECYT – Chile (International Cooperation), under grant no. 7060170. This work has been partially supported by a grant from the School of Engineering at Pontificia Universidad Católica de Chile. We acknowledge the permission granted for publication of this article by Insight, the Journal of the British Institute of Non-Destructive Testing.
References 1. Mery, D.: Automated Radioscopic Testing of Aluminium die Castings. Materials Evaluation 64, 135–143 (2006) 2. Mery, D.: Crossing line profile: a new approach to detecting defects in aluminium castings. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 725–732. Springer, Heidelberg (2003) 3. Mery, D.: High contrast pixels: a new feature for defect detection in X-ray testing. Insight 46, 751–753 (2006) 4. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC, New York (1993) 5. Webb, A.: Statistical Pattern Recognition, 2nd edn. John Wiley & Sons Inc, Chichester (2002) 6. Mery, D., Filbert, D.: Automated flaw detection in aluminum castings based on the tracking of potential defects in a radioscopic image sequence. IEEE Trans. Robotics and Automation 18, 890–901 (2002) 7. Castleman, K.: Digital Image Processing. Prentice-Hall, Englewood Cliffs, New Jersey (1996) 8. Mery, D., Silva, R.R., Caloba, L.P., Rebello, J.M.A.: Pattern Recognition in the Automatic Inspection of Aluminium Castings. Insight 45, 431–439 (2003) 9. Mery, D., Filbert, D.: Classification of Potential Defects in Automated Inspection of Aluminium Castings Using Statistical Pattern Recognition. In: 8th European Conference on Non-Destructive Testing (ECNDT 2002), Barcelona (June 17–21, 2002) 10. Haykin, S.: Neural Networks - A Comprehensive Foundation. Macmillan College Publishing. Inc, USA (1994) 11. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley& Sons, U.S.A (2001) 12. Beale, M.: Neural Network Toolbox for Use with Matlab User’s Guide Version 4. USA. The MathWorks (2001) 13. Diamantidis, N.A., Karlis, D., Giakoumakis, E.A.: Unsupervised Stratification of CrossValidation for Accuracy Estimation. Artificial Intelligence 2000 116, 1–16 (2002) 14. Efron, B., Tibshirani, R.J.: Cross-Validation and the Bootstrap: Estimating the Error Rate of the Prediction Rule. Technical Report 477, Stanford University (1995), http://utstat.toronto.edu/tibs/research.html 15. Silva, R.R., Siqueira, M.H.S., Souza, M.P.V., Rebello, J.M.A., Calôba, L.P.: Estimated accuracy of classification of defects detected in welded joints by radiographic tests. NDT & E International UK 38, 335–343 (2005) 16. Silva, R.R., Soares, S.D., Calôba, L.P., Siqueira, M.H.S., Rebello, J.M.A.: Detection of the propagation of defects in pressurized pipes by means of the acoustic emission technique using artificial neural networks. Insight 48, 45–51 (2006) 17. Silva, R.R., Calôba, L.P., Siqueira, M.H.S., Rebello, J.M.A.: Pattern recognition of weld defects detected by radiographic test. NDT&E International 37, 461–470 (2006)
A Radial Basis Function for Registration of Local Features in Images Asif Masood1, Adil Masood Siddiqui2, and Muhammad Saleem2 1 Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan 2 Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan
[email protected],
[email protected],
[email protected]
Abstract. Image registration based on landmarks and radial basis functions (e.g. thin plate splines) results in global changes and deformation spreads over the entire resampled image. This paper presents a radial basis function for registration of local changes. The proposed research was based on study/analysis of profile for different radial basis functions, supporting local changes. The proposed function was designed to overcome the weaknesses, observed in other radial basis functions. The results are analyzed/compared on the basis of different properties and parameters discussed in this paper. Experimental results show that the proposed function improves the registration accuracy. Keywords: Radial basis function, Image registration, Compact support, Landmarks.
1 Introduction Registration based on radial basis functions play an important role in medical applications, image warping and simulation of facial expressions. In this paper, we considered a point based non-rigid registration approach. Transformations based on radial basis function have proven to be a powerful tool in image registration. With this approach, the transformation is composed of radially symmetric function that serves as basis function. The choice of the radial basis function is crucial for overall characteristics such as the smoothness or the locality of transformation function. An often applied non-rigid image registration is based on thin plate splines, introduced by Bookstein [1] for registration of medical images. Subsequently, Evans et al. [2] applied this scheme to 3D medical images and Goshtasby [3] applied it to 2D aerial image registration. This approach yields minimal bending energy properties measured over the whole image, but the deformation is not limited to regions where the point landmarks are placed. This behavior is advantageous for yielding an overall smooth deformation, but it is problematic when local deformations are desired. To cope with local deformations, the landmarks have to be well distributed over the images to prevent deformations in regions where no changes are desired [4]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 651–663, 2007. © Springer-Verlag Berlin Heidelberg 2007
652
A. Masood, A.M. Siddiqui, and M. Saleem
The radial basis functions with compact support were designed to register the local deformations. These functions limit the influence of landmarks around a circular area. Computational efficiency is another advantage of such functions. A radial basis function with compact support was first introduced by Wendland [5]. It has been used to model facial expressions [6] and elastic registration of medical images [7]. Arad and Reisfeld [8] used Gaussian function to incorporate locality constraints by properly tuning the locality parameter. Some other functions are given in [9]-[13]. Disadvantage of these functions is that they do not properly span over the complete region of support. This may lead to deterioration of results, which are studied in this paper. The proposed radial basis function is designed to minimize these problems. The criterion to evaluate the performance of radial basis functions is also proposed in this paper. On the basis of this evaluation criterion, results of proposed radial basis function are compared with similar functions i.e. Gaussian and Wendland. The proposed function proves better in all the results. Rest of the paper is organized as follow. Section 2 gives brief description of radial basis functions suitable for local deformation i.e. with compact support. The proposed radial basis function, its properties and analysis of results is discussed in section 3. Image registration results, with different radial basis functions, are presented/compared in section 4. Finally, section 5 concludes this presentation.
2 Radial Basis Functions for Local Deformation Out of many approaches, Radial basis functions (RBF) is one mean of achieving scattered data interpolation, i.e. fitting a smooth surface through a scattered or nonuniform distribution of data points. RBF interpolation is a linear combination of radially symmetric basis functions, each centered on a particular control point. The value of the RBF is only a function of the distance from the center point, so that ϕ (x, c) = ϕ ( x - c ) or ϕ (x) = ϕ ( x ) if the function is centered at the origon. A
function ϕ that satisfies the property ϕ (x) = ϕ ( x ) can be categorized under radial basis functions, where the norm is Euclidian distance. For a given set of N corresponding radial basis functions has the following general form N
y ( x) =
∑ω ϕ ( x − c ) i
(1)
i =1
The radial basis function y(x) is the linear combination of N radial basis functions, each having a different center ci and a weighing coefficient ω i . The word ‘radial’ reflects an important property of the function. Its value at each point depends on distance of the point from respective control point (landmark) and not on its particular position. The radial basis functions can be broadly divided into two types, namely global and local. The global functions influence the image as a whole. This type of functions is useful when registration process needs repositioning and deformation of complete image. Some examples of global functions are thin plate splines ( ϕ TPS ) [1], [2],
A Radial Basis Function for Registration of Local Features in Images
653
multiquadrics ( ϕ MQ ) [14], and inverse multiquadrics ( ϕ IMQ ) [15]. These functions are given in Table 1. These functions are global in nature and cannot be applied for local changes like simulation of facial expressions, or change in local features of body organs after a surgical operation. In these functions, location of each pixel is effected by each landmark, which is computationally time-consuming. Table 1. Some Radial basis functions without compact support
Ser. No.
Radial Basis Function without compact support Thin Plate Splines (TPS) [1],[2]
ϕTPS ( x) = x 2 log( x)
2.
Multiquadrics (MQ) [14]
ϕ MQ ( x) = x 2 + σ 2
3.
Inverse Multiquadrics (IMQ) [15]
ϕ IMQ ( x) =
1.
Profile
1 x +σ 2 2
Radial basis functions with compact support can be very useful to deal with the local changes in an image. Influence of such functions is limited to a circular area around a landmark, which allows changes to a local area. A compactly supported radial basis function was established by Wendland [5]. It has been used to model facial expressions for video coding applications [6] and elastic registration of medical images [7]. Wendland’s compactly supported function forms a family of radial basis functions that have a piecewise polynomial profile function and compact support. Member of the family to choose depends on the dimension (d) from which the data is drawn and the desired amount of continuity (k) of the polynomials. Fornefett et al. [7] shows with some proves that Wendland’s radial basis function at d = 3 and k = 2 is most suitable for local deformation of images. He used it for elastic registration of medical images. This function is given as:
ϕ F ( x) = (1 − x) 4+ (4 x + 1)
(2)
Where ϕ F is the Wendland’s function [5], used by Fornefett et al. [7]. They [7] compared the results with Gaussian function used by Arad and Deniel [8] to modify facial expressions in an image. The Gaussian function is given as:
ϕ G ( x) = e − x
2
σ2
(at σ = 0.5)
(3)
The authors [7],[8] have demonstrated with examples that the two functions are very suitable for local deformation as these functions have the properties of primary interest in radial basis functions. These properties include locality, solvability, stability, and positive definite. We propose a new radial basis function that has compact support and can produce even better results.
654
A. Masood, A.M. Siddiqui, and M. Saleem
ϕ cos ϕG ϕF
ϕ
x Fig. 1. Plot of three radial basis functions
3 Proposed Radial Basis Function In local deformations, smooth transformation of image pixels is needed which should be proportional to the transformation of target landmark. A radial basis function maps the image pixels to their new location. The proposed radial basis function is based on study of its impact on smooth transition of pixels during local deformation. The proposed radial basis function is defined using cosine function. It is given as:
ϕ cos ( x) =
1 + cos( xπ ) 2
(4)
Plot of radial basis function ( ϕ cos ) along with ϕ F and ϕ G is given in Fig. 1. Characteristics of radial basis function can be analyzed from the plot and their impact on individual points/pixels would be studied in later part of this section. Desired properties of radial basis functions and advantages of proposed function over others are discussed below (section 3.1). 3.1 Properties of Radial Basis Function •
Smooth transition from start to End: A radial basis function must be smooth at the start (x = 0) and end (x = 1). In other words, it must be along the horizontal line at its end points and should smoothly change its slope and join the two end points. It is important for smooth interpolation of effected image pixels. From Fig. 1, we can observe that all functions are smooth at its end points. The function ϕ F smoothes the end (x = 1) more than start (x = 0), which leads to deterioration of other properties discussed below. On the other hand, smoothness
A Radial Basis Function for Registration of Local Features in Images
655
level remains same in proposed radial basis function. Although the function ϕ G is smooth along x but it never reaches ϕ = 0. •
Equal Distribution of ϕ : Equal distribution of ϕ is important to maintain the equal amount of impact on interpolating pixel points on both sides of target landmark. If this distribution is not controlled, some of the pixel points would map very close to each other and some would map far off from their neighbors. Proposed radial basis function produce ideal distribution of ϕ along x. This is the inherent properties of cosine function which is used in this radial basis function. ϕ = 0.5 at x = 0.5 is one evidence of equal distribution. A straight line ( ϕ = 1 − x ) can produce perfect distribution but it cannot be used as it would violate the first property discussed above. Distribution of both ϕ G and ϕ F concentrates more towards first half of x i.e. 0 – 0.5. About 63% of ϕ G falls at lower half of x. Similarly in case of ϕ F , about 81% of points fall at lower half of x. This behavior would produce adverse effects on results, which is discussed in section 3.2 & 3.3. The proposed function ϕ cos distributes exactly 50% of points on both halves of x.
•
Using full range of x: The radial basis function should use the full range of x i.e. 0 – 1 for equal distribution of function ( ϕ ). The proposed function ( ϕ cos ) utilizes full range of x. The function ϕ F is almost 0 at x = 0.8 and does not utilize the range properly from x = 0.8 – 1. Similarly, ϕ G = 0.18 at x = 1. In other words, it never produce ϕ G = 0 along the complete range of x i.e. from 0 – 1.
3.2 Local Deformation on Single Row of Points
This section demonstrates the application of radial basis function on single row of points. Impact of different radial basis functions, in the light of various properties presented in section 3.1, would also be studied. Deformation of (single row of) points after applying different radial basis functions ( ϕ cos , ϕ F , ϕ G ) is shown in Fig. 2. The radial basis functions are applied to a row of 30 points using eq. 1. The original location of points is shown in lowest row of Fig. 2. In this row of points, source (P1) and target (P2) location of landmark is inscribed in a circle and square respectively. As the deformation is based on single landmark, eq. 1 may be written as: y ( x) = ω * ϕ ( r )
(5)
Where ω = P2 − P1 , which is the displacement from source to target location. ϕ (r ) represents a radial basis function like ϕ cos (r ) , ϕ G (r ) , or ϕ F (r ) . In eq. 5, r is the normalized distance from origin/center (P1), which can range from 0 to 1. It can be given as: r=
x − P1 R
,
and replace r = {1 : r > 1}
(6)
656
A. Masood, A.M. Siddiqui, and M. Saleem
From Fig. 2, x is the row of points that range from 1 to 30. R is the radius around origin (P1) that would be effected during deformation. It can be given as:
R = ω *a
(7)
After ϕ G After ϕ F After ϕ cos
Original
P1
P2
Fig. 2. Applying radial basis functions on a row of 30 points
The radius R is proportional to the displacement ( ω ) and ‘a’ is the locality parameter. The parameter ‘a’ is used to control the proportional extent of radius, which remains constant for all landmarks. After extensive testing, default value was set to 2.5. However, user may adjust the parameter ‘a’ as per its suitability in particular applications. Generally, parameter ‘a’ limits the locality of deformation and increasing the value of ‘a’ would tend to globalize the deformation effects. Top three rows in Fig. 2 show local deformation after applying different radial basis functions. The point landmark (inscribed in circle i.e. 13th point) moves from source (P1) to target (P2) location and all points with radius R interpolate or adjust their location within available gaps. Proper utilization of available gaps and smooth change in distance between the points is desirable from a radial basis function. The properties of radial basis function (discussed in section 3.1) have direct impact on proper positioning of these points. We can compare the deformation results of different radial basis functions using parameters discussed below.
•
Minimum distance (MinD): This is the minimum distance of any point from its neighbor. The radial basis function reduces the distance between points, if it needs to adjust them in smaller gaps. For example points at right side of origin in Fig. 2. This parameter is useful to monitor the adjustment in such situations. It can be given as: n
MinD = min{d i } i =1
Where di ,is the distance of ith point from its neighbor.
(8)
A Radial Basis Function for Registration of Local Features in Images
•
657
Maximum distance (MaxD): This is the maximum distance of any point from its neighbor. This parameter is useful to monitor the adjustment of radial basis function when few points are to be settled in larger gaps. For example points at left side of origin in Fig. 2. It can be given as: n
MaxD = max{d i }
(9)
i =1
•
Maximum change in distance (MaxΔD): This shows the maximum change of distance between two points from their neighbors. This parameter monitors the smoothness of transition from larger gaps to smaller and vice versa. For example monitoring change in distance while moving from left side of origin to right side. It can be given as: n
MaxΔD = max{Δd i }
(10)
i =1
Where Δdi is change in distance of ith point. A comparison of results for different radial basis function is shown in Table 2. These results were calculated after deformation of row of 30 points shown in Fig. 2. The best value in Table 2 should have minimum difference from original i.e. before deformation. The best value for each parameter is written in bold. The first and third property of radial basis functions (i.e. smooth transition and using full range of x) has direct impact on parameter MaxΔD. We can see from Table 2 that MaxΔD for ϕ F and ϕ G is about 3.3 and 1.6 times higher than ϕ cos . The second property (i.e. equal distribution) influences the parameter MinD and MaxD. From Fig. 1, we can observe Table 2. Comparison of radial basis function applied on a row of points of Fig. 2
Evaluation Parameters
Before Deformation
ϕ cos
ϕF
ϕG
MinD
1
0.382
0.164
0.319
MaxD
1
1.618
1.836
1.682
MaxΔD
0
0.196
0.652
0.314
Table 3. Comparison of radial basis function applied on 2D grid of Fig. 3
Evaluation Parameters
Before Deformation
ϕ cos
ϕF
ϕG
MinD
1
0.557
0.405
0.517
MaxD
1.414
2.299
2.951
2.380
0
0.215
0.222
0.238
MaxΔD
658
A. Masood, A.M. Siddiqui, and M. Saleem
(a)
(b)
(c)
(d)
Fig. 3. Applying radial basis function on 50x50 grid. (a) Original grid marked with source (circle) and target (square) position of landmark, (b) After ϕ cos , (c) After ϕ G , and (d) After ϕ F .
that ϕ cos is best according to this property followed by ϕ G and ϕ F . This can be verified from Table 2 also, as the MinD and MaxD is best for ϕ cos and worst for ϕ F .
3.3 Local Deformation on 2D Grid The application of radial basis function on single row of points is discussed in section 3.2. Same methodology can be extended to 2D grid of points. In 2D grid, each grid points (x) and landmarks (P1 and P2) has the coordinates (x, y). The deformation of 50x50 grid with different radial basis functions ( ϕ cos , ϕ G , ϕ F ) is shown in Fig. 3. The radius (R) around the center (P1) is shown with a circle. One can observe that the grid after ϕ cos looks well distributed and smooth when compared with other two ( ϕ G , ϕ F ). This can be verified by the quantitative measurements shown in Table 3.
A Radial Basis Function for Registration of Local Features in Images
659
Quantitative results are similar to the one shown in Table 2. The radial basis function ϕ cos again produces the best results which are shown in bold.
4 Registration Results for Images Proposed radial basis function is designed to register local changes in an image. Such changes may include registration of post-operative medical images, simulation/ registration of tumor, kidney stones, and dislocation of bones. Some other
(a)
(c)
(f)
(b)
(d)
(g)
(e)
(h)
Fig. 4. Applying radial basis function on MR images. (a) Source image marked with source landmarks (circles), (b) Target image marked with target landmarks (squares), (c) Registered image with ϕ cos , (d) Registered image with ϕ G , (e) Registered image with ϕ G , (f) Difference of (a) & (c), (g) Difference of (a) & (d), (g) Difference of (a) & (e).
660
A. Masood, A.M. Siddiqui, and M. Saleem
applications may include image warping/morphing and simulation of facial expressions. As an example, registration of brain tumor is demonstrated in this section. Tomographic brain images, including pre-operative (Fig 4(a)) and postoperative (Fig. 4(b)) pictures, were taken from [7] for demonstration of results. The source (Fig. 4(a)) and target (Fig. 4(b)) images are corresponding slices of rigidly transformed 3D MR data sets. Aim of this registration is to correct the pre-operatively acquired image such that it agrees with the current anatomical situation. Registration results of proposed radial basis function ( ϕ cos ) are compared with ϕ G , and ϕ F . The source and target image with selected landmarks is shown in Fig. 4(a) and Fig. 4(b) respectively. The registered images with ϕ cos , ϕ G , and ϕ F are shown Fig. 4(c), 4(d), and 4(e) respectively. Similarly, difference of registered image from original (Fig. 4(a)) is shown in Fig 4(f)-4(h). It shows that the radial basis functions ( ϕ cos ,
ϕ G , ϕ F ) obey the property of locality and effect only a limited area. The results of these algorithms look very similar when compared with visual observation. Table 3 showed quantitative comparison of results for 2D grid. Measurement for these parameters (like MinD, MaxD, MaxΔD) remain similar incase of images as well, since an image is also a 2D grid of pixels with different gray levels. Comparison of results in a graph is shown in Fig. 5. The parameters (in this graph) show the difference from original position (i.e. before deformation). This difference should be minimized. This graph shows that the results of proposed algorithm are best (i.e. having minimum value for each parameter), when compare with ϕ G and ϕ F .
ϕ cos ϕF ϕG
MinD
MaxD
MaxΔD
Fig. 5. Graph showing difference of parameters ( ϕ cos , ϕ G , ϕ F ) from their original value (i.e. before deformation), for 50x50 grid
A Radial Basis Function for Registration of Local Features in Images
(a)
(b)
(c)
(d)
661
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 6. Applying radial basis function on MR images. (a) Source image, (b) Target image , (c) Source image marked with source landmarks , (d) Target image marked with target landmarks , (e) Registered image with ϕ cos , (f) Registered image with ϕ G , (g) Registered image with ϕ G , (h) Difference of a & e, (g) Difference of a & f, and (g) Difference of a & g.
662
A. Masood, A.M. Siddiqui, and M. Saleem
Another set of result, i.e. registration of facial expressions, is shown is (Fig. 6). Fig. 6(a) is the famous Mona Lisa smiling image which is taken as source image. Fig. 6(b) shows the changed facial expression. This (Fig. 6(b)) was taken as target image. These images were taken from [8]. The marking of source and target landmarks is shown in Fig. 6(c) and 6(d). Registered images using different radial basis are shown from Fig. 6(e) – 6(g). Difference images are shown in Fig 6(h) – 6(j). Again the locality property of the radial basis functions is evident. Registration of facial expression produces the similar quantitative results shown in Table 3 and Fig. 5. Thus, the values of quantitative measurements for the ϕ cos are optimum followed by
ϕ G and ϕ F .
5 Conclusion A radial basis function to register images with local deformations was presented in this paper. A study on desired properties of radial basis functions and different parameters to evaluate/compare the results was also presented. Deformation results were tested for row of points, 2D grid and images. Registered image is expected to improve similarity with target image. The proposed algorithm proved better in all the tests.
Acknowledgments. The authors acknowledge the Higher Education Commission (HEC) of Pakistan for providing funds for this research work and University of Engineering and Technology (UET), Lahore, Pakistan, for providing facilities to conduct this research.
References 1. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell 11(6), 567–585 (1989) 2. Evans, A.C., Dai, W., Collins, L., Neelin, P., Marrett, S.: Warping of a computerized 3-D atlas to match brain image volumes for quantitative neuroanatomical and functional analysis. In: Proc. SPIE, vol. 1445, pp. 236–246 (1991) 3. Goshtasby, A.: Ragistration of images with geometric distortions. IEEE Trans. Geosci. Remote Sens. 26(1), 60–64 (1988) 4. Goshtasby, A.: Image registration by local approximation methods. Image and Vision Computing 6, 255–261 (1988) 5. Wendland, H.: Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Advances in Computational Mathematics 4, 389–396 (1995) 6. Soligon, O., Mehaute, A.L., Roux, C.: Facial expressions simulation with Radial Basis Functions. International Workshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, 233–236 (1997) 7. Fornefett, M., Rohr, K., Stiehl, H.S.: Radial basis function with compact support for elastic registration of medical images. Image and Vision Computing 19, 87–96 (2001) 8. Arad, N., Reisfled, D.: Image Warping using few anchor points and radial functions. Computer Graphics Forum 14(1), 35–46 (1995)
A Radial Basis Function for Registration of Local Features in Images
663
9. Fornberg, B., Larsson, E., Wright, G.: A new class of oscillatory radial basis functions. Comput. Math. Appl. 51, 1209–1222 (2006) 10. Eickhoff, R., Ruckert, U.: Enhancing Fault Tolerance of Radial Basis Functions. IEEE Transactions on systems, Man and Cybernetics 35, 928–947 (2005) 11. Golberg, M.A., Chen, C.S., Bowman, H.: Some recent results and proposals for the use of radial basis functions in the BEM. Engineering Analysis with Boundary Elements 23, 285–296 (1999) 12. Šarler, B.: A radial basis function collocation approach in computational fluid dynamics. Computer Modeling in Engineering & Sciences 7, 185–193 (2005) 13. Peng, W., Tong, R., Qian, G., Dong, J.: A Local Registration Approach of Medical Images with Niche Genetic Algorithm. In: 10th International Conference on Computer Supported Cooperative Work in Design, pp.1–6 (2006) 14. Little, J.A., Hill, D.L.G., Hawkes, D.J.: Deformations incorporating rigid structures. Computer Vision and Image Understanding 66(2), 223–232 (1997) 15. Ruprecht, D., Muller, H.: Free form deformation with scattered data interpolation method. Computing Supplementum 8, 261–281 (1993)
Hardware Implementation of Image Recognition System Based on Morphological Associative Memories and Discrete Wavelet Transform Enrique Guzmán1, Selene Alvarado1, Oleksiy Pogrebnyak2, Luis Pastor Sánchez Fernández2, and Cornelio Yañez2 1
Universidad Tecnológica de la Mixteca
[email protected]. 2 Centro de Investigación en Computación del Instituto Politécnico Nacional (olek,lsanchez,cyanez)@pollux.cic.ipn.mx
Abstract. The implementation of a specific image recognition technique for an artificial vision system is presented. The proposed algorithm involves two steps. First, smaller images are obtained using Discrete Wavelet Transform (DWT) after four stages of decomposition and taking only the approximations. This way the volume of information to process is reduced considerably and the system memory requirements are reduced as well. Another purpose of DWT is to filter noise that could be induced in the images. Second, the Morphological Associative Memories (MAM) are used to recognize landmarks. The proposed algorithm provides flexibility, possibility to parallelize algorithms and high overall performance of hardware implemented image retrieval system. The resulted hardware implementation has low memory requirements, needs in limited arithmetical precision and reduced number of simple operations. These benefits are guaranteed due to the simplicity of MAM learning/restoration process that uses simple morphological operations, dilation and erosion, in other words, MAM calculate maximums or minimums of sums. These features turn out the artificial vision system to be robust and optimal for the use in realtime autonomous systems. The proposed image recognition system has, among others, the following useful features: robustness to the noise induced in the patter to process, high processing speed, and it can be easily adapted to diverse operation circumstances. Keywords: Artificial Vision, Image Recognition, Morphological Associative Memories, Discrete Wavelet Transform, Hardware Implementation.
1 Introduction Currently the artificial vision of autonomous systems has a grown interest in the scientific community and in the potential industry applications as well. The main problem of the artificial vision is the recognition of contained in images physical elements and determining their identity and position. [1]. Diverse techniques have been used for the image pattern recognition in artificial vision, the most known are listed below. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 664–677, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hardware Implementation of Image Recognition System
665
Lin-Cheng Wang et al. [1] described a modular neural network classifier for the automatic object recognition based on forward-looking infrared imagery. The classifier is formed by several independently trained neuronal networks; each neuronal network makes a decision based on local features extracted from a specific part of the image to recognize, then individual results of each network are combined to determine the final decision. Dong-Gyu Sim, Oh-Kyu Kwon, and Rae-Hong Park [2] proposed algorithms for object recognition based on modification of Hausdorff distance algorithm. These modifications uses M-estimation and least trimmed square measures, which demonstrated are more efficient than the conventional Hausdorff distance measures. A new approach for object recognitions based on coarse-and-fine matching and a multilayer Hopfield neural network was presented by Susan S. Young and Peter D. Scott in [3]. The network is formed by several cascade single layer Hopfield networks, each network codifies object features at different resolutions, with bidirectional interconnections linking adjacent layers. Susan S. Young et al. presented a method to detect and classify objects using a multi resolution neuronal network [4]. The objects identification is based on minimizing an energy function. This energy function is evaluated by means of a procedure of concurrent equalization implemented with a multilayer Hopfield neuronal network. An algorithm based on wavelets analysis for identification and separation of packages in the recycling process is proposed by J.M. Barcala et al. in [5]. Quaternions are constructed with obtained wavelet coefficients that allow to realize the identification of the packages on-line. The objective of this system is its use in recycling plants replacing manual process. Foltyniewicz, R.[6] presented a new method for automatic face recognition and verification. His approach is based on a two stage process. At the first step a wavelet decomposition technique or morphological nonlinear filtering is used to enhance intrinsic features of a face, reduce the influence of rotation in depth, changes in facial expression, glasses and lighting conditions. Preprocessed images contain all the essential information for the discrimination between different faces and are next a subject for learning by a modified high order neural network which has rapid learning convergence, very good generalization properties and a small number of adjustable weights. Considering an image as a pattern, we propose the hardware implementation of image recognition for an artificial vision system based on Morphological Associative Memories and Discrete Wavelet Transform. The designed system outperforms the traditionally artificial vision techniques in the robustness to noise, learning ability, high speed of both learning and recognition processes and overall efficiency of the image patterns recognition.
2 Morphological Associative Memories The modern era of associative memories began in 1982 when Hopfield described the Hopfield Associative Memory in [7]. In 1998, a novel work appears in this area when
666
E. Guzmán et al.
Ritter, Sussner and Diaz de León created Morphological Associative Memories (MAM). The MAM base their operation on the morphological operations, dilation and erosion. In other words, they use the maximums or minimums of sums [8]. This feature distinguishes them from the Hopfield memories, which use sums of products. The input patterns and output patters of the Mg generic associative memory are and y = [ y i ]m , respectively. Let represented by x = [xi ]n
{( x , y ) , ( x , y ) ,..., ( x , y )} 1
1
2
2
k
be k vector pairs defined as the fundamental set of
k
associations [10]. The fundamental set of associations is described as
{( x
μ
}
, y μ ) | μ = 1, 2,..., k .
(1)
The Mg associative memory is represented by a matrix and is generated from the fundamental set of associations. Once the fundamental set is delineated, one can use the necessary operations for the learning and recovery processes of MAM. These operations are the maximum product and minimum product and they use the maximum and minimum operators [9], [10], [11]. According to the operation mode, the associative memories can be classified in two groups: Morphological Auto-associative Memories (MAAM) and Morphological Hetero-Associative Memories (MHM). Morphological Heteroassociative Memories are of particular interest for the development of this work.
∨
∧
2.1 Morphological Heteroassociative Memories A MAM is hetero-associative if ∃μ ∈ {1, 2,..., k} such that x μ ≠ y μ . There are two
types of MHM: max, symbolized by M, and min, symbolized by W. The W memories are those that use the maximum product and the minimum operator in their learning phase and the maximum product in their recovery phase; the M memories are those that use the minimum product and the maximum operator in their learning phase and the minimum product in their recovery phase. Learning phase: 1. For each of the k element of the fundamental set of associations ( x μ , y μ ) , the
( )
( )
matrices y μ ∇ - x μ are calculated for a W memory, or the matrices y μ Δ - x μ are calculated for a M memory. The morphological heteroassociative memory is created. to the resulting 2. The W memory is obtained applying the minimum operator matrices of step 1. W is given by t
t
∧
k
[
(
W = ∧ y μ∇ - xμ μ =1
wij =
∧( y k
μ
μ =1
i
) ] = [w ] t
ij m× n
(2)
− x μj )
The memory M is obtained applying the maximum operator matrices of step 1. M is given by
∨ to the resulting
Hardware Implementation of Image Recognition System
k
[
(
M = ∨ yμΔ - xμ μ =1
mij =
∨( y k
μ
μ =1
i
) ] = [m ]
667
t
ij m× n
μ
− xj
(3)
)
Recovery phase: The recovery phase consists of presenting an input pattern to the memory generated at the learning phase. As an answer, the memory generates the output pattern associated to the presented pattern. When W memory is used, the maximum product W∇x ω is calculated, y = W∇ x ω yi =
∨( w + x n
j =1
ij
ω j
)
(4)
When a M memory is used, the minimum product M Δx ω is calculated, y = M Δx ω
yi =
∧( m + x n
j =1
ij
ω j
)
(5)
where ω ∈{1, 2,..., k} , and a column vector y = [ yi ] is obtained, which represents the m output patterns associated with xω input patterns. Theorem 1 and Theorem 2 of [12] govern the conditions that must be satisfied respectively by MHM max and min to obtain a perfect recall to output patterns. On the other hand, Theorem 5 and Theorem 6 of [12] indicate the amount of noise that is permissible in the input patterns to obtain a perfect recall to output patterns.
3 Hardware Implementation of Image Recognition System When an autonomous system uses an artificial vision technique, it is required that this system must be robust to certain factors such as the noise induced in the pattern to process. Besides, it can be able to develop high speeds of processing and be easily adapted to diverse circumstances of operation, among others. In order to satisfy these requirements, we propose the hardware description of a specific application processor focused on patterns recognition in an artificial vision system. The proposed algorithm described in the processor uses MAM for the patterns recognition. Besides of before mentioned useful features that make MAM an attractive tool, MAM have demonstrated the excellent performance in recognizing and recovering patterns, even in presence of dilative, erosive or random noise [9], [10], [13]. Discrete Wavelet Transform (DWT) is a complement to the proposed algorithm. The use of DWT has two objectives. First, when one uses only the approximation sub-band of the four-scale DWT decomposition, the quantity of
668
E. Guzmán et al.
information to process is reduced and therefore the memory system requirements are reduced too. Second, DWT filters the noise that could be induced in the image at previous stages. The proposed image recognition system combines the flexibility, the facility to describe parallel processes and a high performance that is granted by the implementation of an algorithm in hardware, and it inherits the MAM features such as low memory requirements, limited arithmetical precision, and small number of arithmetical operations. The use of Electronic Design Automation (EDA) tools optimizes the system design time, increases the product quality and reduces the production costs. For the development of the proposed system, we have chosen the set of EDA tools formed by the software tools ISE (Integrated Software Environment) Foundation version 8.2i from Xilinx company and the hardware descriptor language VHDL, and Xess XSV300 Board V1.0 is the hardware tool.
Fig. 1. Proposed system scheme
In order to describe the processor for patterns recognition, we choose the TopDown design methodology. It begins visualizing the design with a high level of abstraction. Then the design is partitioned to a further designs, each new design increases its level of detail according to the required one [14]. Based on the principles of Top-Down design methodology, our design is divided in modules shown in Fig. 1. Each of these modules is independent of another but all interact to each other constantly. The algorithm proposed for the design of the processor for patterns recognition consists of two phases, learning and recognition. 3.1 Learning Phase
At this phase, both MHM min and MHM max are created. These memories codify the information necessary to identify each one of the input patterns that are represented by images. Fig. 2 illustrates the learning algorithm.
Hardware Implementation of Image Recognition System
669
Fig. 2. Learning process
The learning algorithm can be summarized in the following stages. i. Image acquisition. The implemented USB interface between the image recognition system and a personal computer allows emulating the image acquisition and visualizing the results. Let CI = Aα α = 1,2,...h be the set of h images to recognize; each image is
{
}
represented by A = [aij ] , where m represents the height of the image and n mxn α
the width of the image. ii. The four-scale DWT decomposition of the image A is computed [15]:
Ai +1 ( x, y ) = ∑ ∑ h(z1)h( z 2 )Ai (2 x − z1,2 y − z 2 ) z1
(6)
z2
[ ]
The following stage uses only the approximation sub-band, An 4 = aij
uxv
, where
u and v define the size of the new image; detail sub-bands are discarded. iii. The approximation sub-band of the four-scale DWT decomposition is converted in the binary image ABin :
An 4 ≤ ϕ ⎧0 (7) ABin = ⎨ A 1 ⎩ n4 > ϕ where ϕ is a binary index. iv. Conversion of ABin to vector. This vector is an input pattern of the fundamental set of associations: μ x μ = ABin μ = 1,2,..., h
[xl ]uv = [aij ]uxv
(8)
μ v. A label is assigned to the vector x μ = ABin , this label represents an output pattern μ y μ = 1,2,..., h . The union of these patterns generates an element of the
μ
μ
fundamental set of associations: {(x ,y ) μ = 1,2,..., h }.
670
E. Guzmán et al.
Theorem 1 and Corollary 1.1 in [12] govern the conditions that must be satisfied by output pattern for MHM max to obtain a perfect recall. Theorem 2 and Corollary 2.1 in [12] govern the conditions that must be satisfied by output pattern for MHM min to obtain a perfect recall. vi. The Morphological Associative Memories are obtained: The MHM max are computed using both the minimum product Δ and the maximum operator
∨
k
[
(
M = ∨ yμ Δ - xμ μ =1
)] t
The MHM min are compute using both the maximum product ∇ and the minimum operator
∧
k
[
(
W = ∧ yμ∇ - xμ μ =1
)] t
vii. Stages i - vi are repeated h times. 3.2 Recognition Phase
At this phase, the system is able to identify the images that represent the input patterns with the aid of the memory obtained in the learning phase. When an image is received, the system recovered the label (output pattern) that was associated to this image. Fig. 3 shows the recognition phase.
Fig. 3. Recognition process
The recognition algorithm can be summarized in the following stages. i. Image to be identified is acquired. ii. The four-scale DWT decomposition of the image is computed. iii. The approximation sub-band of the four-scale WDT decomposition is converted into a binary image. iv. The binary image is converted into vector. v. With the help of the MAM obtained at learning phase the system is able to identify the received current image. The system generates the label corresponding to the identified image.
Hardware Implementation of Image Recognition System
671
If MHM max is used, the minimum product Δ is applied, y = M Δx ω If MHM min is used, the maximum product ∇ is applied, y = W∇x ω vi. Finally, the generated label outputs to the response system for its following processing.
4 Results In order to show the performance of the modeled image recognition system, we choose to apply it to 2 sets of images. The first image set is grayscale images of 720x480 pixels shown in Fig. 4. Using the implemented USB interface between the image recognition system and a personal computer we can visualize the results of each one of the phases of the algorithm shown in Fig. 5. Table 1 shows a perfect recall for original images.
Fig. 4. Test images: a) go ahead, b) stop, c) right, d) return, e) left
Fig. 5. Results of algorithm phases: a) approximation sub-bands of the WDT decomposition, b) binary process Table 1. Processor for patterns recognition performance on original images
Image a b c d e
MHM min and MHM max performance (recognition percentage) MHM min MHM max 100 100 100 100 100 100 100 100 100 100
In order to estimate how the image recognition system performs in real conditions, the test images were corrupted with typically associated Gaussian noise and uniformly
672
E. Guzmán et al.
distributed noise. The images were corrupted with the 3 variants of each one of these noises: dilative, erosive and random. Fig. 6 and Tables 2, 3 and 4 compare the system performance using MHM max and MHM min on corrupted images with both Gaussian and uniform noise in 3 modalities: dilative, erosive and random, each with different noise percentages. Table 2. Comparison of system performance using MHM max and MHM min on corrupted images with both dilative Gaussian and dilative uniform noise
Image a b c d e
MHM min and MHM max performance (recognition percentage) Dilative Gaussian noise Dilative Uniform noise 10 % 20 % 30 % 10 % 20 % 30 % MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM min max min max min max min max min max min max 100 100 0 100 0 100 100 100 0 100 0 100 100 100 0 100 0 100 0 100 0 100 0 100 100 100 100 100 0 100 0 100 0 100 0 100 100 100 100 100 0 100 100 100 100 100 100 100 100 100 0 100 0 100 100 100 0 100 0 100
Table 3. Comparison of system performance using MHM max and MHM min on corrupted images with both erosive Gaussian and erosive uniform noise
Image a b c d e
MHM min and MHM max performance (recognition percentage) Erosive Gaussian noise Erosive Uniform noise 10 % 20 % 30 % 10 % 20 % 30 % MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM min max min max min max min max min max min max 100 100 100 100 100 0 100 100 100 0 100 0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 0 100 0 100 100 100 100 0 100 0 100 100 100 100 100 0 100 100 100 0 100 0 100 100 100 0 100 0 100 0 100 0 100 0
Table 4. Comparison of system performance using MHM max and MHM min on corrupted images with both random Gaussian and random uniform noise
Image a b c d e
MHM min and MHM max performance (recognition percentage) Random Gaussian noise Random Uniform noise 10 % 20 % 30 % 10 % 20 % 30 % MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM MHM min max min max min max min max min max min max 100 100 100 100 100 100 100 100 0 100 0 100 100 100 100 100 0 100 100 100 0 100 0 100 100 100 100 100 100 100 100 100 100 100 0 100 100 100 100 100 100 100 100 100 100 100 0 100 100 100 100 100 100 100 100 100 100 100 0 100
Hardware Implementation of Image Recognition System
673
Fig. 6. MAM performance on corrupted “return” image, a) MHM max, b) MHM min
Observing the results, one can conclude that the image recognition system based on MHM max has a perfect recall when the image is corrupted by random noise, being this type of noise the one that commonly is added to images at previous stages before the recognition in the artificial vision system. Second image set are grayscale images of 128x128 pixels shown in Fig 7.
Fig. 7. Second test on images. a) Lena, b) Baboon, c) Barbara, d) Elaine, e) Man.
With aid of implemented USB interface between the image recognition system and the personal computer, we can visualize the results of each one of the phases of the algorithm shown in Fig. 8. Fig. 9 shows the results of the binarization of the test images implemented in the image recognition system. Table 5 shows a perfect recall to original images. In order to estimate the behavior of the image recognition system in real conditions, the test images were corrupted with both dilative and erosive Gaussian noise. Table 6 compares the system performance using MHM min on corrupted images with different noise percentage. The binarization process affects the performance of the image recognition system. The influence of the binarization can be estimated analyzing the results shown in
674
E. Guzmán et al.
Fig. 8. USB interface between image recognition system and PC
Fig. 9. Image binarization: (a) Lena, (b) Baboon, (c) Barbara, (d) Elaine, (e) Man Table 5. Image recognition system performance on original images Image Lena Baboon Barbara Elaine Man
MHM min performance (recognition percentage) 100 100 100 100 100
Table 6. Image recognition system performance on corrupted images
Image Lena Baboon Barbra Elaine Man
MHM min performance (recognition percentage) Dilative Gausian noise Erosive Gaussian noise 10 % 20 % 30 % 10 % 20 % 30 % 100 100 100 0 0 0 100 100 100 0 0 0 100 100 100 0 0 0 100 100 100 100 0 0 100 100 100 100 0 0
Hardware Implementation of Image Recognition System
675
Table 6. If an image is corrupted with erosive noise, it will be considered by the patterns recognition processor as an image corrupted with dilative noise, consequently the processor has a perfect recognition with images corrupted with up to 30% of dilative noise. We used the hardware description language VHDL for the modeled system. It is a standardized language, then the design of the image recognition system is portable to any FPGA architecture. Table 7 shows FPGA design summary obtained by ISE Foundation tool. Table 7. FPGA design summary of the image recognition system obtained by ISE Foundation tool
Device Number of Slice Number of Slice Flip Flops Number of 4 input LUTs Number of bonded IOBs Number of TBUFs Number of GCLKs Maximum Frequency
Features summary XCV300-PQ240 1337 out of 3072 (43%) 1025 out of 6144 (16%) 2270 out of 6144 (36%) 64 out of 170 (37%) 48 out of 3072 (1%) 1 out of 4 (25%) 34.254MHz
Tables 8, 9 and 10 show the processing speed of the DWT, MAM learning process and MAM recognition process. System can operate in an ample frequency range. Therefore, the speed of processing is expressed in clock cycles. The number of clock cycles for each of the considered processing stages mostly is the time consumed by access to system memory. Table 8. Processing speed of DWT process DWT decomposition One-scale Two-scale Three-scale Four-scale Total
Number of clock cycles 1,036,800 259,200 64800 16,200 1,377,000
Table 9. Processing speed of MAM learning process MAM learning process One image Five images
Number of clock cycles 47,250 236,250
Table 10. Processing speed of MAM recognition process MAM recognition process One image
Number of clock cycles 33,750
676
E. Guzmán et al.
5 Conclusions The MAM have demonstrated to be an excellent tool in the recognition and recovery of patterns, due to their useful features. The obtained results confirm that MAM are robust to dilative, erosive or random noise and have great capacity to save system memory. Additionally, MAM have demonstrated high speed in both learning and recovery processes. The description of the processor for patterns recognition based on MAM using a standardized hardware description language allowed to design a system that has features as portability and easy adaptation to diverse operation circumstances. Moreover, it is possible to realize parallel processing implying high speeds of processing and high performance that grants the implementation of a hardware algorithm. The combination of all of these features resulted in a robust, fast and reliable artificial vision system that can be used in Real-Time Autonomous Systems.
References 1. Wang, L., Der, S.Z., Nasrabadi, N.M.: Automatic Target Recognition Using a FeatureDecomposition and Data-Decomposition Modular Neural Network. IEEE Transactions on Image Processing 7(8), 1113–1121 (1998) 2. Sim, D.-G, Kwon, O.-K., Park, R.-H.: Object Matching Algorithms Using Robust Hausdorff Distance Measures. IEEE Transactions on Image Processing 8(3), 425–429 (1999) 3. Young, S.S., Scott, P.D.: Object Recognition Using Multilayer Hopfield Neural Network. IEEE Transactions on Image Processing 6(3), 357–372 (1997) 4. Young, S.S., Scott, P.D., Bandera, C.: Foveal Automatic Target Recognition Using a Multiresolution Neural Network. IEEE Transations on Image Processing 7(8), 1122–1135 (1998) 5. Barcala, J.M., Alberdi, J., Navarrete, J.J., Oller, J.C.: Clasificación Automática de Envases Plásticos. XX Jornadas de Automática, Comite Español de Automática, España (1999) 6. Foltyniewicz, R.: Automatic face recognition via wavelets and mathematical morphology. In: Proceedings of the 13th International Conference on Pattern Recognition, vol. 2, pp. 13–17 (1996) 7. Hopfield, J.J.: Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences of the USA 79, 2554–2558 (1982) 8. Ritter, G.X., Sussner, P., Díaz de León, J.L.: Morphological Associative Memories. IEEE Transactions on Neural Networks 9(2), 281–293 (1998) 9. Yáñes, C., de León, J.L.D.: Memorias Morfológicas Heteroasociativas. Centro de Investigación en Computación, IPN, México, IT. Serie Verde 57 (2001) (ISBN 970-186697-5) 10. Yáñes, C., de León, J.L.D.: Memorias Morfológicas Autoasociativas. Centro de Investigación en Computación, IPN, México, IT 58, Serie Verde (2001) (ISBN 970-186698-3) 11. Guzman, E., Pogrebnyak, O., Yáñez, C., Moreno, J.A.: Image Compression Algorithm Based on Morphological Associative Memories. In: CIARP 2006. LNCS, vol. 4225, pp. 519–528. Springer, Heidelberg (2006)
Hardware Implementation of Image Recognition System
677
12. Ritter, G.X., Ritter, G.X., de León, J.L.D., Sussner, P.: Morphological Bidirectional Associative Memories. Neural Networks 12(6), 851–867 (1999) 13. Castellanos, C.: Díaz De León, J.L., Sánchez, A.: Análisis Experimental con las Memorias Asociativas Morfológicas. XXI Congreso Internacional de Ingeniería Electrónica, Electro 1999, México, pp. 11-16 (1999) 14. Pardo, F., Boluda, J.A.: VHDL: lenguaje para síntesis y modelado de circuitos, 2nd Edition. Editorial RA-MA, España (2004) (ISBN 84-7897-595-0) 15. Acharya, T., Tsai, P.-S.: JPEG2000 Standard for Image Compression. John Wiley & Sons, Chichester (2005)
Detection and Classification of Human Movements in Video Scenes A.G. Hochuli, L.E.S. Oliveira, A.S. Britto Jr., and A.L. Koerich Postgraduate Program in Computer Science (PPGIa) Pontifical Catholic University of Parana (PUCPR) R. Imaculada Concei¸ca ˜o, 1155 Prado Velho 80215-901, Curitiba, PR, Brazil {hochuli,soares,alceu,alekoe}@ppgia.pucpr.br www.ppgia.pucpr.br
Abstract. A novel approach for the detection and classification of human movements in videos scenes is presented in this paper. It consists in detecting, segmenting and tracking foreground objects in video scenes to further classify their movements as conventional or non-conventional. From each tracked object in the scene, features such as position, speed, changes in direction and temporal consistency of the bounding box dimension are extracted. These features make up feature vectors that are stored together with labels that categorize the movement and which are assigned by human supervisors. At the classification step, an instancebased learning algorithm is used to classify the object movement as conventional or non-conventional. For this aim, feature vectors computed from objects in motion are matched against reference feature vectors previously labeled. Experimental results on video clips from two different databases (Parking Lot and CAVIAR) have shown that the proposed approach is able to detect non-conventional human movements in video scenes with accuracies between 77% and 82%. Keywords: Human Movement Classification, Computer Vision, Security.
1
Introduction
The classification of events in video scenes is a relative new research area in computer science and it has been growing more and more due to the broad applicability in real-life. One of the main reasons is the growing interest and use of video-based security systems, known as CCTV. However, the majority of the CCTV systems currently available in the market have limited functionality which comprises capture, storing and visualization of video gathered from one or more cameras. Some CCTV systems already include motion detection algorithms and are able to constrain the recording of videos only when variations in the scene foreground are detected. The main utility of such systems is the recording of conventional and non-conventional events for further consultation and analysis. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 678–691, 2007. c Springer-Verlag Berlin Heidelberg 2007
Detection and Classification of Human Movements in Video Scenes
679
In other words, such systems do not have any embedded intelligence which is able to provide a classification of the events. They do not have mechanisms to warn operators when non-conventional events are occurring. Such an attribute would be very helpful to prevent and detect in an active fashion the occurrence of non-conventional events. Besides the need of a more efficient tool in the security area, the detection of non-conventional events in video scenes could be used in other contexts, such as: to detect when an elderly people has an accident inside his/her house [1,2], non-conventional activities in an office, transit infractions [3]. Therefore, a nonconventional event can be viewed as an action that does not belong to the context. The research in this area has been focused on two main streams: state-space modeling and template matching [4]. In the former, most of the approaches employ Markov Process and state transition functions [1], hidden Markov models [3] and hierarchical hidden Markov models [2] to model categories of non-conventional events inside pre-defined environments. Essentially an event is characterized by a sequence of actions modeled by a graph called model. When an event presents a sequence which is not modeled, it is considered as nonconventional. The main disadvantage of the model-based approaches is that its use in a novel environment requires a remodeling. The latter uses an approach based on the movement trajectory prototypes [5]. Such prototypes are in fact statistical information about the motion in the time and space domains, such as object centroid position, object edges, variation in velocity and direction. Based on such information, the algorithm computes the trajectory and matches it against other previously known trajectories. In a similar manner, but with lower complexity, Niu et al. [6] use only the object position and velocity to design curves which describe the motion. The use of people gait, represented through an histogram, was used to classify non-conventional situations inside a house [7]. Classification is carried out through a regular histogram. Besides this approach is base on the object features and not on the object motion, the author points out that the variation of distance between the objects and cameras is a serious drawback that may produce errors in the histogram projections. Therefore, one of the challenges in the automatic analysis of video scenes is the adaptability to different environments as well as a real-time classification of the events. In this paper we present a novel approach which has a relative ability to be adapted to different application environments and which is able to detect non-conventional human movements in video scenes. Such an approach has a calibration period and further it extracts features from the foreground objects in motion through the scene. A non-parametric learning algorithm is used to classify the object motion as conventional or non-conventional. The proposed approach has four basic steps: detection and segmentation of foreground objects, tracking of foreground objects, features extraction from their motion, and movement classification as conventional or non-conventional event.
680
A.G. Hochuli et al.
This paper is organized as follows: Section 2 presents an overview of the proposed approach as well as the main details of the detection, segmentation and tracking algorithms. Section 3 presents the feature extraction while the classification of human movements is discussed in Section 4. Experimental results achieved in video clips from two databases are presented in Section 5. Conclusions and perspective of future work are stated in the last section.
2
System Overview
Through a video camera placed in an strategic point in the environment, video is captured and its frames are processed. First, there is a step to detect and segment the objects in motion, or foreground objects, which aim is to look at the video frames for the regions where the objects of interest may be present. These regions are tracked at the subsequent frames. Only the regions where the objects of interest may appear are tracked. From such objects of interest are extracted some features, not from the objects, but features from the object motion. Features like position, velocity, x, y coordinates, direction variation, and temporal consistency of the bounding box dimension are extracted to make up feature vectors. Such feature vectors are matched against other feature vectors which have been previously labeled and stored in a database. In this step it is employed a temporal window and the dissimilarities between the feature vectors represent mean values for the temporal windows. Using a majority voting rule, the motion of the objects of interest is classified as conventional and nonconventional. Figure 1 presents an overview of the proposed approach and the relationship between the main modules. The main idea of the proposed approach is that such an strategy could be applied to detect some types of human movements without much effort to be adapted to the environment, since we do not take into account specific information from the objects or scene, but from the object motion. First the solution is adapted to environments where the flow of people in the camera view is moderate, since our research is focused on the movement classification and therefore we do not pay attention to situations where overlapping or occlusion may happen. 2.1
Detection and Segmentation of the Foreground Objects
Several approaches to detect motion have been proposed in the last years [8]. However, the main limitation of such techniques refers to the presence of noise due to the variations in the scene illumination, shadows, or spurious generated by video compression algorithms. In this case, the most elementary techniques based on the background subtraction yields to the detection of several false foreground regions. To minimize the impact of the noise the strategy proposed by Stauffer and Grimson [9] employs Gaussian functions to classify the pixels as belonging to the background or to the foreground. At each frame, the pixels are matched against a mixture of Gaussian distributions according to its variance, standard deviation and weight. All the pixels that could be absorbed by
Detection and Classification of Human Movements in Video Scenes
681
Fig. 1. An overview of the proposed approach to detect and classify human movements in video scenes
a Gaussian distribution are considered as belonging to the background. If there is no Gaussian distribution that can absorb the pixel, then it is considered as a foreground pixel. Gaussian distributions are able to absorb continuous motion and this is one of the greatest merit of this approach. If there is at the scene an object executing a periodic motion, the blades of a fan for example, after a small time such a motion is absorbed by a Gaussian distribution and considered as belonging to the background. However, for objects that present a slow motion, only the edges are highlighted. The central parts of the object are absorbed quickly, resulting in a set of
682
A.G. Hochuli et al.
sparse points of the object. To reconstruct the object without changing its size like as a morphological operation, a local background subtraction is carried out on these regions. A 3x3 window is applied at each pixel that is not absorbed by a Gaussian distribution, and inside such a window the pixels are subtracted from a fixed background. Thus if the pixel belongs to an object, the neighbor pixels that were previously absorbed by the Gaussian distribution will be highlighted. In this step, we can retrieve the pixels of object that was absorbed by gaussian function, but using a simple background subtraction these pixel are highlighted again. To eliminate the remaining noise is applied a 3x3 median filter. The partial result is a set of pixels from the object in motion, possibly with non-connected pixels. A contour detection algorithm based on polygonal approximation is used to assure that these pixels make up a single object. In such a way, what was before a set of pixels is now a single object called blob which has all its pixels connected. Figures 2 and 3 show in a simplified way the detection and segmentation of foreground objects. Once a blob is identified, it must be tracked while it is presented in the camera field of view.
Fig. 2. An example of motion detection and segmentation on a video clip from Parking Lot Database: (a) original video frame with objects in motion, (b) motion segmentation by Gaussian distributions, (c) resulting blobs after applying filters, background subtraction and contour detection
2.2
Tracking Objects
The tracking consists in evaluating the trajectory of the object in movement while it remains in the scene. To eliminate objects that are not interesting under the point of view of the tracking, it is applied a size filter which discards blobs that are not consistent with the width and height of the objects of interest. The idea of using filters to eliminate undesirable regions was proposed by Lei and Xu [10], where the filtering take into account the velocity and direction of the motion applied to a cost function. The tracking of the objects in the scene and the prediction of its position in the scene is done by an approach proposed by Latecki and Miezianko [11] with some modifications. Suppose an object Oi in the frame F n , where Oi denotes a tracking object. In the next frame F n+1 , given j
Detection and Classification of Human Movements in Video Scenes
683
Fig. 3. Motion detection and segmentation in a video clip from CAVIAR Database: (a) original video frame with objects in motion, (b) motion segmentation by Gaussian distributions, (c) resulting blobs after filtering, background subtraction and contour detection
regions of motion, Rj , we have to know which Rj represents the object Oi from the preceding frame. The following cost function is used: Cost = (wP ∗ dP ) + (wS ∗ dS ) + (wD ∗ dD ) + dT
(1)
where wP , wS , and wD are weights that sum to one, dP is the Euclidean distance in pixels between the object centers, dS is the size difference between the bounding boxes of the region of motion, dD is the difference in direction between the object position estimated by the Lucas-Kanade algorithm [12] and the last known center of the object in the preceding frames and the difference between the center of the region of movement and the center of the object, and dT is the difference of the time to live (TTL) of the object. These parameters are better described as follows. dP = |Rcj − Oci | Rcj
(2) Oci
where is the center of the region of motion and is the last known center of the object. The value of dP should not be higher than a threshold of proximity measured in pixels. This proximity threshold varies according to the objects are being tracked, mainly due to the speed of such objects in the scene. dS =
|Rrj − Ori | (Rrj − Ori )
(3)
where Rrj and Ori denote the size of the box bounding the region of motion and bounding the object respectively. dD = |arctan(Osi − Oci ) − arctan(Rcj − Oci )|
(4)
where Osi is the object position estimated by Lucas-Kanade, Oci and Rcj are the last know center of object and the center of region of motion respectively. The value of the angle lies between zero and 2π. dT = (T T LMAX − OTi T L )
(5)
684
A.G. Hochuli et al.
where T T LMAX is the maximum persistence in frames and OTi T L is the object persistence . If the object is found in the current frame, the value of OTi T L is set to T T LMAX , otherwise it is decreased by one until OTi T L becomes equal zero, where the object must be eliminated from the tracking. The T T LMAX was set to 3 times the frames per second rate of the video. Each object from the preceding frame must be absorbed by the region of motion in the current frame that leads to the lowest cost. The values of the object and bounding box centers assume the values of the regions of motion. If there is a region of motion that was not assigned to any object, then a new object is created with the values of such a region. If there is an object that was not assigned to any region of motion, such an object may be occluded and the Lucas-Kanade algorithm will fail to predict the corresponding motion. In this case, the motion of such an object is predicted as: Osi = S ∗ Osi + (1 − S) ∗ (Rcj − Oci )
(6) Rcj ,
should be the where S is a fixed value of the speed. The region of motion closest region to the object, respecting the proximity threshold. Then, the new position of the object and his bounding box is computed as:
3
Oci = Oci + Osi
(7)
Ori
(8)
=
Ori
+
Osi
Feature Extraction
Given an interval t of the trajectory of an object of interest, features are extracted from motion to make up a feature vector denoted by as Vi . Such a vector is composed by five features: Vi = [vspeed , vposx,posy , vdisx,disy , vsizx,sizy , vdir ]
(9)
where vspeed denotes the speed of the object, vposx,posy denotes the coordinate x, y of the object in the scene, vdisx,disy denotes the displacement of the object in x and y, vsizx,sizy denotes the temporal consistency of the bounding box based on the variation of its x and y dimensions, and vdir denotes the variation in the direction of the object. These features are computed as: vspeed = (Oci t−1 − Oci t )2 /Q (10) vdisx,disy = Oci t−1 − Oci t
(11)
vsizx,sizy = |Ori t−1 − Ori t |
(12)
vdir = arctan(Oci t−2 − Oci t−1 ) − arctan(Oci t−1 − Oci t )
(13)
The feature extraction is carried out considering an interval of Q frames. Such a value was defined empirically and set to Q = 3. Figure 4 illustrates the feature extraction process from a video and the generation of feature vectors.
Detection and Classification of Human Movements in Video Scenes
685
Fig. 4. An overview of the feature extraction process and generation of feature vectors from objects in motion along the scene
4
Motion Classification
The feature vectors generated from the objects in motion are stored in a database to be further used by a non-parametric classifier. In this paper we have used a instance-based leaning algorithm due to the simplicity and low dimensionality of the feature vectors. First, a database with reference vectors is generated from the analysis of objects in motion in the video frames. Each reference feature vector has a label assigned to it to indicate if it is representing a conventional (C) or a nonconventional movement (NC). This database is composed by reference feature vectors Z both from conventional and non-conventional movements. At the classification step a temporal window is used to classify segments of the motion almost in real-time. The classification consists in, given an object in motion, a set of features vectors are extracted V. The number of vectors in the V set varies according to the size of the temporal window. In our case we have defined a temporal windows of size twenty seven frames, that is, the set V will be composed by nine feature vectors (27/Q, where Q is equal 3 which represents the feature extraction interval). The classification process is composed by two stages: first, each Vi ∈ V is classified using an instance-based approach, more specifically the k nearest neighbor algorithm (k-NN) [13]; next, the majority voting rule is applied to the feature vectors in V to come up to a final decision. For the k-NN algorithm, the Euclidean distance among each feature vector in V and the Z reference feature vectors stored in the database. The Euclidean distance between a D-dimensional reference feature vector Vz and a testing feature vector Vi is defined as: D d(Vz , Vi ) = (Vz d − Vi d )2 (14) d=1
The k-closest reference feature vectors will label each feature vector in V with their labels. After the classification of all feature vectors in V, a final decision on the motion of the object is given by the vote of each member of the set V,
686
A.G. Hochuli et al.
and the classification ”conventional” or ”non-conventional” is assigned to the object according to the majority vote. For example, if there are seven feature vectors in V classified by the k-NN as non-conventional (NC) and two classified as conventional (C), the final decision is to assign the label ”non-conventional” to the object. Figure 5 illustrates the classification process.
Fig. 5. The classification process: the Euclidean distance between the feature vector extracted from the object in motion and the reference feature vectors stored in the database
5
Experimental Results
The proposed approach was evaluated in two different databases. The first database consists in CCTV videos where people can execute three types of motion: walking, walking in zig-zag and running. These videos where captured in a parking lot through a security camera installed at top of a neighbor building and without any control of illumination and background with a resolution of 720 x 480 pixels, 30 frames per second and compressed using MPEG2. For each kind of motion two video clips with 100 seconds of length, were produced summing up to 200 seconds for each type of motion. For each type of movement, one video clip was used to generate the reference feature vectors (training) and the other was used for testing. The video clip lengths and the number of samples for each type of movement is shown in Table 1. The main goal of the experiments is to evaluate the accuracy in detecting non-conventional events. Furthermore we are also interested in evaluating the discriminative power of the features. Since there is a low number of features, a force brute strategy was employed to evaluate the feature set. The weights and thresholds described in Section 2.2 were empirically defined on the same video segments used as training. This is known as calibration procedure. The dP proximity threshold was set to 40, T T LMAX to 45, S to 0.9 and the values of the weights wP , wS , wD to 0.4, 0.1 and 0.5 respectively.
Detection and Classification of Human Movements in Video Scenes
687
Table 1. Number of samples generated from the Parking Lot and from CAVIAR Database videos Parking Event Training Walking 94 Running 62 Zig-Zag 77 Fighting – Total 233
Lot CAVIAR Test Training Test 112 57 23 31 – – 50 – – – 41 16 193 98 39
In spite of having three types of motion in the videos, we have considered a two-class problem where “walking” is considered as a conventional event and walking in zig-zag and running were considered as non-conventional events. The accuracy is defined as the ratio between the number of events correctly classified and the total number of events. Among all possible combinations of the features, for the Parking Lot database, the combination of only two features (speed and variation in the direction) has provided the best discrimination between events (Fig.6). On the other hand the worst accuracy was achieved using only the size of the bounding box. Table 2 presents the confusion matrix for the best combination of features.
Fig. 6. The best and the worst feature combination for Parking Lot Database
The second experiment was carried out on some videos from the CAVIAR database 1 . One of the goals of this experiment is to evaluate the adaptability of the proposed approach to different scenarios and well as to different types of non-conventional events. The video clips were filmed with a wide angle camera 1
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
688
A.G. Hochuli et al.
Table 2. Confusion matrix for the combination of speed (vspeed ) and variation in the direction (vdir ) features Movement Conventional Non-Conventional Walking Running Zig-Zag Walking 90 10 12 Running 3 27 1 Zig-Zag 19 12 19
lens in an entrance lobby. The resolution is half-resolution PAL standard (384 x 288 pixels, 25 frames per second) and compressed using MPEG2. For each kind of movement a number of some videos were used for training while the remaining were used for testing. The videos contain people executing two types of action: walking and fighting. The number of samples for each type of action is shown in Table 1. Again, the main goal of the experiments is to evaluate the accuracy in detecting non-conventional events. Furthermore we are also interested in evaluating the discriminative power of the features. Among all possible combinations of the features, for the videos from the CAVIAR database, the combination of three features (coordinate, displacement and dimension of the bounding box) has provided the best discrimination between events, while the variation in the direction and bounding box has provided the worst (Fig.7). Table 3 presents the confusion matrix for the best combination of features.
Fig. 7. The best and the worst feature combination for CAVIAR Database
In the last experiment, we have switched the best feature combination between the databases to compare the results. We can observe in Fig.8 that is not possible apply the same combination of features into the two databases, but with a simple
Detection and Classification of Human Movements in Video Scenes
689
Table 3. Confusion matrix for the combination of coordinate (vposx,posy ), displacement (vdisx,disy ) and variation in the bounding box (vsizx,sizy ) features. Event Conventional Non-Conventional Walking Fighting Walking 19 4 Fighting 3 13
Fig. 8. The best feature combination switched between the two databases
feature selection, the method is able to choose the better combination of features for the database.
6
Conclusion
In this paper we have presented a novel approach to non-conventional event detection which is able to classify the movement of objects with relative accuracy. Experimental results on video clips gathered from a CCTV camera (Parking Lot) and from CAVIAR database have shown the adaptability of the proposed approach to different environments. The proposed approach minimizes the use of contextual, said, information from the scene and from the objects in movement, giving priority to the adaptability to different scenes with a minimal amount of effort. In spite of the preliminary results are very encouraging, since we have achieved correct classification rates varying from 77.20% to 82.05% on video clips
690
A.G. Hochuli et al.
captured in different scenes, further improvements are required. Furthermore, broad tests in a variety of environments are also necessary. One of the main sources of errors is the problems related to occlusions. However this problem was not addressed in this work and it will be the subject of our future work. The use of instance-based learning has lead to satisfactory results and the classification of the events was carried out almost in real-time due to the low dimension of the optimized feature vector as well as a database with few reference feature vectors (223 vectors for the first experiment and 98 vectors for the second experiment) for the Parking Lot database. However, one of the main drawbacks of the proposed approach is the necessity of positive and negative examples, said, examples of conventional an non-conventional events. Our on-going work is now focusing on the use of one-class classifiers which are able to model conventional events only since the capture of non-conventional events in real-life is a timedemanding task.
References 1. Hara, K., Omori, T., Ueno, R.: Detection of unusual human behavior in intelligent house. In: IEEE Workshop on Neural Networks for Signal Processing, Martigny, pp. 697–706. IEEE Computer Society Press, Los Alamitos (2002) 2. L¨ uhr, S., Bui, H.H., Venkatesh, S., West, G.A.W.: Recognition of human activity through hierarchical stochastic learning. In: IEEE Annual Conf on Pervasive Computing and Communications, Fort Worth, pp. 416–421. IEEE Computer Society Press, Los Alamitos (2003) 3. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 844–851 (2000) 4. Aggarwal, J.K., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding 73(3), 428–440 (1999) 5. Mecocci, A., Pannozzo, M., Fumarola, A.: Automatic detection of anomalous behavioral events for advanced real-time video surveillance. In: IEEE Intl Symp on Computational Intelligence for Measurement Systems and Applications, pp. 187– 192. IEEE Computer Society Press, Los Alamitos (2003) 6. Niu, W., Long, J., Han, D., Wang, Y.F.: Human activity detection and recognition for video surveillance. In: IEEE Intl Conf Multimedia and Expo, pp. 719–722. IEEE Computer Society Press, Los Alamitos (2004) 7. Cucchiara, R., Grana, C., Prati, A., Vezzani, R.: Probabilistic posture classification for human-behavior analysis. IEEE Trans. on Systems, Man, and Cybernetics, Part A 35(1), 42–54 (2005) 8. Hu, W., Tan, T., Wang, L., Maybank, S.J.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. Systems, Man, Cybernetics, Part C 34(3), 334– 352 (2004) 9. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 747–757 (2000) 10. Lei, B., Xu, L.Q.: From pixels to objects and trajectories: A generic real-time outdoor video surveillance system. In: IEE Intl Symp Imaging for Crime Detection and Prevention, London, UK, pp. 117–122 (2005)
Detection and Classification of Human Movements in Video Scenes
691
11. Latecki, L.J., Miezianko, R.: Object tracking with dynamic template update and occlusion detection. In: 18th Intl Conf on Pattern Recognition, Washington, USA, pp. 556–560 (2006) 12. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: 7th Intl Joint Conf Artificial Intelligence, Vancouver, Canada, pp. 674–679 (1981) 13. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)
Image Registration by Simulating Human Vision Shubin Zhao Jiangsu Automation Research Institute, 42 East Hailian Road, Lianyungang, Jiangsu, China 222006
[email protected]
Abstract. In this paper, an efficient and robust algorithm is proposed for image registration, where the images have been acquired at different times, by different sensors and some changes may take place in the scene during the time interval when the images were taken. By simulating human vision behaviors, image registration is carried out through a two-stage process. First, the rotation angles are computed by comparing the distributions of gradient orientations, which is implemented by a simple 1-D correlation. Then, a new approach is presented to detect significant corners in two images and the correspondences are established between corners in two images. At this time, the orientation difference has been removed between the images, so the corner correspondences can be established more efficiently. To account for the false corner correspondences, the voting method is used to determine the transformation parameters. Experimental results are also given. Keywords: image registration, human vision, corner detection.
1 Introduction Image registration is the process of spatially registering two or more images of the same scene, taken at different times, from different viewpoints, and/or by different sensors. It is a critical step in many applications of image analysis and computer vision, such as image fusion, change detection, video geo-registration, pattern and target localization, and so on. Because of its importance in various application areas and its complicated nature, image registration has been the topic of much recent research. During the last decades, many kinds of approaches have been proposed to address image registration problems, and the comprehensive and excellent surveys can be found in [1-2]. There are four important aspects in image registration: 1) transformation space; 2) feature space; 3) similarity measure; and 4) search strategy. Selection of transformation space is application-dependent. For example, if we know there exists affine transform between two images, affine transformation space should be selected to align the images. According to what objects are used to align images, the approaches can be categorized into two classes: intensity-based methods and feature-based ones. Intensity-based methods directly use raw image data (intensity values) as features and the images are usually compared with cross-correlation function as similarity D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 692–701, 2007. © Springer-Verlag Berlin Heidelberg 2007
Image Registration by Simulating Human Vision
693
measure. Because the imaging conditions may be quite different and some changes may take place in the scene during the time interval when the images were taken, there are almost inevitably many differences between the images. Consequently, for image registration to be robust, feature-based ones are preferred, whereas intensitybased methods are usually not applicable. Most commonly used features may be significant regions [3-4], line segments [5-6], line intersections [7], and edge points [8]. How to select features depends on the given task. Apart from feature selection, similarity measure plays an important role in image registration. Huttenlocher et al. compare images transformed by translation or translation plus rotation, where edge points are used as matching features and the Hausdorff distance is adopted to measure the similarity of the images [8]. The Hausdorff distance based algorithms outperform the cross-correlation based methods, especially on images with perturbed pixel locations. Moreover, the algorithms are desired to compare partial images when changes have taken place in the scene. That is, differences of some extent will not lead to severe errors in image registration. Partial Hausdorff distance and voting logic can be used to address this issue. In most practical systems, computational cost must be considered, especially when the real-time processing is needed. The computational complexity mainly comes from two aspects: large volume of image data and high dimensionality of transformation space. Various search strategies such as multi-resolution methods [9-10] and decomposition of transformation space can be used to reduce computations and hence speed the image alignment process. In the last few years, local features have become increasingly popular in various applications, such as wide baseline matching, object recognition, image mosaicking, to name just a few domains. In spite of their recent success, local features have a long history. Image matching using a set of local interest points can be traced back to the work of Moravec [11]. The Moravec detector was improved by Harris and Stephens to make it more repeatable under small image variations and near edges [12]. Harris et al. showed its applications in motion tracking, and the Harris corner detector has since widely used in many image matching tasks. While these feature detectors are usually called corner detectors, they are not selecting just corners, but rather any image locations that have large gradients in a few directions in a predefined scale. Zhang et al. showed that it was possible to match Harris corners over a large image range by using a correlation window around each corner to select likely matches [13]. Schmid et al. showed that invariant local feature matching could be extended to general image recognition problems in which a feature was matched against a large database of images [14]. They also used Harris detector to select interest points, but rather than matching a correlation window, they used a rotationally invariant descriptor of the local image region. This allowed features to be matched under arbitrary orientation change between the two images. More recently, Lowe proposed a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene [15]. The method is based on local 3D extrema in the scale-space pyramid built with difference-of-Gaussian filters. The features are invariant to image scale and rotation, and are shown to provide robust matching under considerable geometrical distortion and noise. The recognition approach can robustly
694
S. Zhao
identify objects among clutter and occlusion while achieving near real-time performance. In this paper, we deal with such an image registration problem, in which the images were acquired at different times and even by different imaging devices, and changes of some extent may take place in the scene in between the two imaging time. We suppose that the images can be aligned by similarity transform. An efficient and robust algorithm is proposed to solve this problem, which simulates behaviors of human vision in image registration. By simulating human vision behaviors, we use a two-stage process to perform image registration. First, the rotation angle is computed by comparing the distributions of gradient orientations, which is implemented by a simple 1D correlation. Second, a novel approach is presented to detect significant corners in two images, and the correspondences are established between corners in two images. At this time, the orientation difference has been removed, so the corner correspondences can be established more efficiently. To account for false corner correspondences, the voting method is adopted to determine the transformation parameters, that is, the parameters supported by more corner correspondences are accepted as the true ones. The rest of this paper is organized as follows. In section 2, we describe the image registration problem and main ideas of the algorithm. Section 3 shows how to compute the rotation angle. Section 4 introduces a novel corner detector. Experimental results are shown in section 5, and some remarks are given in the end.
2
Problem Formulation and Main Ideas
Given two images to be registered, one of which is called the reference image and the other is called the sensed image, the two images are denoted by I and M , respectively. Mathematically, image registration problem can be formulated as: search a geometric transformation g and an intensity transformation function f , so that
I ( x, y ) = f ( M ( g ( x, y )) .
(1)
In most cases, the goal of image registration is aligning two images geometrically (spatial registration), hence the intensity transformation is not always necessary. In some cases, searching such intensity transformation is even impossible, especially in the case of multi-sensor image registration. For this kind of problems, all work of image registration amounts to determining the geometric transformation function g , that is, computing the function g so that the sensed image can be spatially registered with the reference image as good as possible. The transformation may be translation, rotation, scaling, similarity transformation, or other more complex transformations such as affine and perspective transformations. In this paper, we only consider similarity transformation, which is defined as follows.
⎛ x ' ⎞ ⎛ t x ⎞ ⎛ cos θ ⎜ ' ⎟ = ⎜ ⎟ + s⎜⎜ ⎜ y ⎟ ⎜t ⎟ ⎝ ⎠ ⎝ y ⎠ ⎝ − sin θ
sin θ ⎞⎛ x ⎞ ⎟⎜ ⎟ . cos θ ⎟⎠⎜⎝ y ⎟⎠
(2)
Image Registration by Simulating Human Vision
695
In this case, the problem equals to computing the four parameters, which is denoted by T = (t x , t y , s,θ ) in the following sections. As mentioned earlier, it is the structure features such as edges, corners, line segments and boundaries, not the intensity values of the original images that robust image registration relies on. If edge is selected as feature, we can perform edge detection on the two images, then compare the two images in feature space (i.e., point sets) and search the transformation space to determine the parameters that can best match the sensed image with the reference image using the partial Hausdorff distance or other similarity measures. This approach is robust in the sense that it works well under noise, illumination variation and changes of some extent in the scene. But this kind of methods are computationally demanding, especially for high-dimensional transformation spaces such as similarity, affine and perspective transformations. To reduce computations, distance transform and other techniques are used. However, the computational costs are still high. In many applications, images are rich with line segments resulting from man-made structures, as shown in Fig. 1. In such cases, human vision first determines the orientation difference between two images and then register these images by establishing correspondences of significant features, such as line segments and/or corners, and pays no attention to other features.
Fig. 1. Images rich with line segments resulting from man-made structures
By simulating human vision behaviors, we perform image registration through a two-stage process. In the first stage, we determine the rotation angle using only the information about the distributions of line orientations in the images, and the process is a simple 1D correlation. The underlying principles of this process are the properties of Hough transform under similarity transformation. To improve the efficiency, we use the distributions of gradient orientations to compute the rotation angle in practice. In this way, the straight line Hough transform, which is of high computational cost, is not necessary. In the second stage, a novel approach is used to detect salient corners in images, and then the transformation parameters can be computed by establishing the correspondences among the corners in two images. Note that the orientation difference has been removed between two images at this time, so the correspondences can be established more efficiently. To account for false corner correspondences, the voting method is adopted to determine the transformation parameters. For our problem, two pairs of corresponding corners uniquely determine the geometric transformation, i.e. one point in the transformation space. The point in transformation
696
S. Zhao
space, i.e. the transformation parameters, voted for by more corner correspondences is accepted as the solution of the image registration problem.
3 Computing the Rotation Angle The straight line Hough transform(SLHT) is a well-known method for the detection of lines in binary images, and it is also used to locate, recognize some special shape and register images [16-18]. Let a line L be represented by
ρ = x cosθ + y sin θ
.
(3)
Then this line corresponds to point ( ρ ,θ ) in the Hough space. In the discrete domain, the SLHT transform of an image is a 2D array, in which each row corresponds to one value of ρ , and each column to one value of θ . By simple mathematics, it is obvious that this array has the following three properties: • Rotation in the image plane corresponds to circular shifting of the array; • If the image is translated, then the same offset value is added to all points in the same column; • Scaling the image equals to only scaling every ρ in the array, whereas the orientation remains unchanged. From these properties, it follows that the change of line orientation under similarity transformation is just circular shifting of the array in Hough space. That is, circular shifting of the array is only related to rotation, and independent of translation and scaling. Utilizing this fact, we can compute the rotation angle in the Hough space. Because Hough transform is computationally demanding, we hope to avoid computing the Hough transform. We know that, if a straight line passes through an edge point, then the gradient orientation at this point is perpendicular to the line with high probability. So, the distributions of gradient orientations in images will behave similarly as the Hough transform. For robustness under noise, the distributions of gradient orientations are computed with the gradient magnitudes as weights. In fact, the distribution is just weighted orientation histogram. Now, the distribution of gradient orientations can be defined as:
hi =
∑ mag ( x, y) f (φ ( x, y)) .
(4)
i
( x, y )
mag ( x, y ) is the gradient magnitude, and f i (φ ( x, y )) is a function of the gradient orientation φ ( x, y ) . f i (φ ( x, y )) = 1 Where the sum is over all points of the image,
if
φ ( x, y )
in the specified orientation interval; otherwise,
f i (φ ( x, y )) = 0 .
In practice, [0, π ) is uniformly decomposed into a number of intervals according to the expected resolution, then the distribution is computed by (4). If we have
Image Registration by Simulating Human Vision
(a)
(c)
697
(b)
(d)
Fig. 2. An example for computing the rotation angle using gradient orientation distributions. (a) and (b): two images to be registered; (c) and (d): gradient orientation distributions of (a) and (b), respectively. Comparing the two distributions gives the rotation angle, 70o in this example.
obtained the distributions of two images, then comparing the two distributions gives the rotation angle α . Note that the distributions are periodic in π , so the true rotation angle may be α or π + α . An example is shown in Fig. 2.
4 Corner Detection for Image Registration The success of feature-based image registration algorithms depends considerably on effective feature detection. That is, we must select good features and detect the features with effective and efficient algorithms. Good features should have the properties such as repeatability, locality, distinctiveness, accuracy and efficiency. That is, given two images of the same scene, taken under different imaging conditions and some changes in the scene, a high percentage of the features must be detected in both images, whereas only a small number of features detected in one image are not detected in the other image. It is possible only when the features are local enough to limit the risk of a feature including an occluded part and/or parts corresponding to different objects or surfaces, and thus to allow some variations resulting from different imaging conditions and changes in the scene. Because the feature correspondences between two images need to be established, features must be distinctive so that they can be distinguished and matched.
698
S. Zhao
Corner is increasingly popular in object recognition and image matching. With respect to its practical applications, the corner detection is studied intensively for approximately three decades, and a number of corner detectors have been proposed. For locality and other reasons, these detectors only consider small neighborhoods. But in a small neighborhood, all the measurements would not be precise and reliable enough mainly due to noise. In other words, no detector can make a correct decision whether or not a candidate is a corner without seeing the neighborhood that is big enough. In the image registration applications, images may be acquired under different imaging conditions, at different times, and/or by different devices. So in this case, it is very difficult to ensure the repeatability of corners among images. To satisfy the requirement of image registration, we propose a novel algorithm for robust corner detection. For a candidate point to be a true corner, there must exist enough points within its neighborhood to support it. The point that is relevant and can make contribution must satisfy two conditions: the gradient magnitude should be large enough, and the gradient orientation should be approximately perpendicular to the line passing through this point and the candidate point. This idea is shown in Fig. 3. By combining the gradient orientation and magnitude, the new approach can consider a large neighborhood and in the meanwhile, exclude most clutters and irrelevant features nearby. For each candidate point, the approach computes three values, which are defined as follows. • Average Orientation:
∑ f (ϕ ( P),φ ( P))mag ( P)φ ( P) . μφ = ∑ f (ϕ ( P),φ ( P))mag ( P) P∈Ω
(5)
P∈Ω
Ω is neighborhood of point O; ϕ (P) and φ (P) are the orientation of OP and the gradient orientation at point P, respectively; f (ϕ ( P ), φ ( P )) is a function of Where
the two orientations. For example, we can let this function take 1 if one orientation is approximately perpendicular to the other, and 0 if otherwise. • Orientation Variation:
∑ f (ϕ ( P),φ ( P))mag ( P)[φ ( P) − μφ ] σφ = ∑ f (ϕ ( P),φ ( P))mag ( P)
2
2
P∈Ω
.
(6)
P∈Ω
• Corner Strength:
CS =
∑ f (ϕ ( P),φ ( P))mag ( P) .
(7)
P∈Ω
In practice, the gradient magnitude
mag and orientation φ are first computed,
and all points with the gradient magnitude bigger than a predefined threshold are considered to be candidates for corners. The candidates are then examined by
Image Registration by Simulating Human Vision
computing the values of maximum and
σφ
2
μφ , σ φ2
699
and CS . The candidate with CS as a local
greater than chosen threshold is accepted as a corner.
Fig. 3. Ideas for robust corner detection. Irrelevant points in the neighborhood of the candidate corner will be excluded using the gradient orientation information, though these points have high gradient magnitude. Left: original image; right: image for gradient magnitude.
5 Experimental Results Many experiments have been conducted to demonstrate the robustness and efficiency of the presented algorithm using real-world images. Limited by the space, only one experimental result is given in Fig. 4. Here, the reference images were taken by the satellite borne imaging sensors and have been geometrically rectified; the sensed images have been taken by a TV camera mounted on the unmanned aerial vehicle. The algorithm works well in our experiments. Though there exist significant differences between two images in imaging conditions, modality of sensor and the scene being imaged, both determination of the rotation angle and corner detection are all reliable and robust. For 640X480 images, an experiment can be carried out within a fraction of second on a P4 3.0GHz PC machine.
6 Concluding Remarks This work has proposed an efficient and robust image registration algorithm by simulating human vision behaviors. The approach is a two-stage process. Based on the properties of Hough transform under similarity transformation, the orientation difference is first removed between two images by a simple 1D correlation of the gradient orientation distributions; then a novel corner detector is used to extract salient corners, and the transformation parameters can be computed by establishing corner correspondences between the two images. In future work, more complex geometric transformations will be dealt with by generalizing the ideas presented in this paper.
700
S. Zhao
(a) the sensed image
(b) the reference image
(c) the gradient orientation distribution of (a) (d) the gradient orientation distribution of (b)
(e) result of corner detection on (a)
(f) result of corner detection on (b)
(g) result of image registration Fig. 4. Experimental results using the image registration algorithm presented in this paper
Image Registration by Simulating Human Vision
701
References 1. Brown, L.G.: A Survey of Image Registration Techniques. ACM Computing Surveys 24, 226–376 (1992) 2. Zitova, B., Flusser, J.: Image Registration Methods: A Survey. Image and Vision Computing 21, 977–1000 (2003) 3. Goshtasby, A., Stockman, G.C., Page, C.V.: A Region-Based Approach to Digital Image Registration with Subpixel Accuracy. IEEE Trans. on Geoscience and Remote Sensing 24, 390–399 (1986) 4. Alhichri, H.S., Kamel, M.: Virtual circles: a new set of features for fast image registration. Pattern Recognition Letters 24, 1181–1190 (2003) 5. Moss, S., Hancock, E.R.: Multiple Line-Template Matching with EM Algorithm. Pattern Recognition Letters 18, 1283–1292 (1997) 6. Habib, A.F., Alruzouq, R.I.: Line-based modified iterated Hough transform for automatic registration of multi-source imagery. The Photogrammetric Record 105, 5–21 (2004) 7. Stockman, G., Kopstein, S., Benett, S.: Matching Images to Models for Registration and Object Detection via Clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 4, 229–241 (1982) 8. Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.J.: Comparing Images Using the Hausdorff Distance. IEEE Trans. on Pattern Analysis and Machine Intelligence 15, 850– 863 (1993) 9. Zheng, Q., Chellapa, R.: A computational vision approach to image registration. IEEE Trans. on Image Processing 2, 311–325 (1993) 10. Borgefors, G.: Hierarchical chamfer matching: a parametric edge matching algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence 10, 849–865 (1988) 11. Moravec, H.: Rover Visual Obstacle Avoidance. In: The Seventh International Joint Conference on Artificial Intelligence, IJCAI 1981, Vancouver, Canada, pp. 785–790 (1981) 12. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: The Fourth Elvey Vision Conference, Manchester, UK, pp. 147–151 (1988) 13. Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.T.: A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry. Artificial Intelligence 78, 87–119 (1995) 14. Schmid, C., Mohr, R.: Local Grayvalue Invariants for Image Retrieval. IEEE Trans on Pattern Analysis and Machine Intelligence 19, 530–535 (1997) 15. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 16. Ballard, D.H.: Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition 13, 111–122 (1981) 17. Illingworth, J., Kittler, J.: A Survey of the Hough Transform. Graphics and Image Processing 44, 87–116 (1988) 18. Pao, D.C.W., Li, H.F., Jayakumar, R.: Shapes Recognition Using the Straight Line Hough Transform: Theory and Generalization. IEEE Trans. on Pattern analysis and Machine Intelligence 14, 1076–1089 (1992)
Face and Gesture-Based Interaction for Displaying Comic Books Hang-Bong Kang and Myung-Ho Ju Dept. of Computer Eng. Catholic University of Korea #43-1 Yokkok 2-dong Wonmi-Gu, Bucheon, Gyonggi-Do Korea
[email protected]
Abstract. In this paper, we present human robot interaction techniques such as face pose and hand gesture for efficient viewing comics through the robot. For the controlling of the viewing order of the panel, we propose a robust face pose recognition method using the pose appearance manifold. We represent each pose of a person’s face as connected low-dimensional appearance manifolds which are approximated by the affine plane. Then, face pose recognition is performed by computing the minimal distance from the given face image to the sub-pose manifold. To handle partially occluded faces, we generate an occlusion mask and then put the lower weights on the occluded pixels of the given image to recognize occluded face pose. For illumination variations in the face, we perform coarse normalization on skin regions using histogram equalization. To recognize hand gestures, we compute the center of gravity of the hand using skeleton algorithm and count the number of active fingers. Also, we detect index finger’s moving direction. The contents in the panel are represented by the scene graph and can be updated according to the user’s control. Based on the face pose and hand gesture recognition result, an audience can manipulate contents and finally appreciate the comics in his own style. Keywords: Face pose recognition, Hand gesture recognition, Human robot interaction.
1 Introduction Recently, various intelligent robots are developed and used in the range of application from the industrial manufacturing environment to human environment for service and entertainment. Since robots for entertainment are new media with mobility to display various contents to audiences, human robot interaction (HRI) plays an important role in displaying various contents through robots. For example, children can read and hear fairy tales, comics and songs from the robots. However, the traditional contents display method through the robots is usually linear and limited. For efficient viewing of comics, in particular, it is desirable for the user to control the viewing order of panels in comics and manipulate objects in the specified panel. To effectively interact with intelligent robots, it is desirable for a robot to recognize a user’s face pose and hand gestures. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 702–714, 2007. © Springer-Verlag Berlin Heidelberg 2007
Face and Gesture-Based Interaction for Displaying Comic Books
703
There have been some research works on face pose and hand gesture recognition. For face pose recognition approaches, Pentland et al. [1] proposed view-based eigenspace approach to deal with various face appearances. Moghaddam et al. [2], [3] also suggested various probabilistic visual learning methods for face recognition. Lee et al. [4], [5] presented video-based face recognition using probabilistic appearance manifolds. Their method showed good results in face recognition, but have some limitations to estimate robust face poses in natural environments. It is because the face pose detection is very difficult due to occlusion and illumination changes in face. Ju and Kang [6] proposed a robust face pose recognition method that works even with partial occlusion or illumination variations for human robot interaction. For hand gesture recognition approaches, Davis and Shah [7] used markers on the finger tips. By detecting the presence and color of the markers, active fingers in the gesture are identified. Chang et al. [8] used a curvature space method for finding the boundary contours of the hand. This approach is robust but requires large computing time. Hasanuzzaman et al. [9] used subspace method to recognize face and hand gesture for human robot interaction. The accuracy of their system depends on the accuracy of the pose detection results. Malima et al. [10] proposed a fast algorithm to recognize a limited set of hand gestures for human robot interaction. Their method is invariant to translation, rotation and scale of the hand, but has problems in precise hand segmentation. In this paper, we propose a new human robot interaction system for viewing comics using video-based face pose and hand gesture recognition. Fig. 1 shows our scheme for HRI. The input image is taken from the camera on the top of the robot and skin-like regions are extracted. Then, we use morphological filters to filter noise and holes. Face detection and hand segmentation are executed from probable face and hand regions. After that, face pose recognition is performed to control the viewing order of the comics and hand gesture recognition is performed to manipulate objects in the activated panel. The remainder of the paper is organized as follows. Section 2 discusses face pose appearance manifold and face pose recognition method in the cases of partial occlusion and illumination variations. Section 3 presents our hand gesture recognition method. Section 4 shows our human robot interaction system for displaying comic books. Section 5 presents experimental results of our proposed method.
2 Video-Based Face Pose Recognition In this section, we will discuss face pose estimation method using face appearance manifold. We also deal with two cases of partial occlusion and illumination variations for robust control of robots. 2.1 Video-Based Face Pose Recognition For a given face image, its dimensionality is equal to the number of pixels D in the image. If the face surface is smooth, its appearance can be constrained and confined to an embedded face manifold of dimension d << D as in [3]. We represent face pose appearance manifold by a set of simple linear sub-pose manifolds using Principal Component of Analysis (PCA). From the face pose manifold
P n , the face pose
704
H.-B. Kang and M.-H. Ju *
recognition task is to find sub-pose n by computing the minimal distance from the given face image I to sub-pose manifold such as
n* = arg min d 2 ( I , P n )
(1)
n
Since the distance can be represented as conditional probability [5], Eq. (1) is
p( P n | I ) as in
n* = arg max p ( P n | I )
(2)
n
where
p( P n | I ) =
1 −1 exp( 2 d 2 ( I , P n )) , and Λ is the normalization term. σ Λ
The continuous pose estimation in the video-based face recognition framework is to estimate the current sub-pose manifold previous sub-pose
Pt n given current face image I t and the
ptm−1 .
Pt n* = arg max p( Pt n | I t , Pt −m1 ) n
1 p( I t | Pt n , Pt −m1 ) p( Pt n | Pt −m1 ) Λ 1 = arg max p( I t | Pt n ) p( Pt n | Pt −m1 ) n Λ m where Λ is the normalization term, the image I t and Pt −1 are independent. = arg max
(3)
n
Face Pose
Face Pose Recognition Face Detection using Gray Image
Input Image
Face & Hand Detection Hand Gesture Recognition
Skin-like region Detection
Noise & hole Elimination
Fig. 1. Our proposed face pose and hand gesture recognition scheme
95 °
Hand Gesture
Face and Gesture-Based Interaction for Displaying Comic Books
705
n
The likelihood probability p ( I t | Pt ) can be estimated using eigenspace decomposition as in [3]. Using PCA, principal component feature vector
~ y = Φ TM I is obtained, where ΦTM refers to the submatrix of Φ containing the ~ principal eigenvectors and I = I − I refers to the mean-normalized image vector. If we assume a Gaussian distribution, the likelihood probability can be represented by the product of two Gaussian densities [3]. In other words, 2 ⎡ ⎛ 1 ⎛ M yi2 ⎞ ⎞ ⎤ ⎡ ⎤ ⎜ − ⎜ ∑ ⎟ ⎟ ⎥ ⎢ exp⎛⎜ − ε ( I t ) ⎞⎟ ⎥ exp ⎢ ⎟ ⎜ ⎜ 2 ⎜ i =1 λ ⎟ ⎟ 2ρ ⎠ ⎥ i ⎠ ⎠ ⎥⎢ ⎝ ⎢ ⎝ ⎝ p ( I t | Pt n ) = ⎢ M M ⎥ ⎢ (2πρ ) ( N −M ) / 2 ⎥ ⎢ (2π ) 2 ∏ λ1i / 2 ⎥ ⎢ ⎥ ⎢⎣ ⎥⎦ ⎢⎣ ⎥⎦ i =1
(4)
where N denotes the dimension of the image space, M denotes the dimension of subpose space,
λ denotes
eigenvalue,
N
∑y
ε 2 ( It ) =
2
i
is the residual reconstruction
i = M +1
error, and
ρ=
1 N −M
N
∑λ . i
i = M +1
i
j
The transition probability between sub-poses p ( P | P ) in Eq. (3) represents the temporal dynamics of the face movement in the training sequence [5]. When two subposes of face are not connected, the transition probability is 0. 2.2 Handling of Partially Occluded Face For the robust face pose recognition, it is necessary to handle partially occluded faces by hands or other stuffs [6]. The intensity of an occluded pixel is different from that of the corresponding pixel in the training pose data. In recognizing face pose of the given image, we generate an occlusion mask in which the value of each pixel represents the degree of occlusion of that pixel. According to the occlusion mask, we 2
n
put lower weights on occluded pixels when computing the distance like d ( I , P ) in Eq. (1). At start, initial face data is assumed as the front pose and an occlusion mask is constructed from the initial image and the front face pose of training data. After that, the occlusion mask is used to recognize the pose of the next face data. This is shown in Fig. 2. To compute the degree of occlusion ODi , we define intensity difference IDi at pixel i as
IDi =|| I i − Ei ||
(5)
To normalize or balance the intensity difference at each pixel, Eq. (5) becomes
⎛ I − μi ⎞ ⎟⎟ IDi = ⎜⎜ i ⎝ σi ⎠
2
(6)
706
H.-B. Kang and M.-H. Ju
where Ii is the intensity value at pixel i,
μi
and
σi
are pixel i’s mean value and
variance in the training data, respectively. If the pixel’s intensity difference is larger than the threshold value, it will be determined as an occluded pixel. If we assume that the distribution of IDi is Gaussian distribution, the degree of occlusion of the ith pixel
ODi is computed as ⎧ ⎡ ID − th ⎤ ⎪1 − exp ⎢− i 2 ⎥ IDi ≥ th ODi = ⎨ ⎣ 2σ th ⎦ ⎪ 0 otherwise ⎩
where
σ 2th is
(7)
the variance of the pixel differences less than threshold values. To
determine the threshold value th in Eq. (7), we compute the histogram of
IDi from
the sample data of the sub-pose manifold, and the point of 95% in the accumulated histogram is selected as a threshold value. 2.3 Pose Recognition of Face Data with Illumination Variations Handling illumination changes is another important factor for robust face pose recognition. Under practical imaging conditions, the image differences due to changing illumination may be critical in the face pose estimation. In addition, if preprocessing is not performed on the face data with illumination variations, the recognition results will be poor because they do not capture non-linear variation in face pose appearance due to illumination changes. Similar to Arandjelovic et al. [11], we remove the background from the bounding box of a detected face by set-specific skin color segmentation and then normalize for global illumination changes by histogram equalization. The removal of background is helpful in face pose recognition because only face appearance information is used in computing the probabilistic distance like Eq. (2).
3 Hand Gesture Recognition In this section, we present our hand gesture recognition method. Our aim is to recognize a small set of hand gesture commands in real time. We first segment skinlike regions based on skin color statistics and remove false hand candidates using size constraint. Then, we detect precise hand regions and compute center of gravity (COG) of the hand region. After that, we extract the farthest point and the nearest point from COG. Based on these features, we count the number of activated fingers in the hand. 3.1 Hand Segmentation To detect hand candidates, we find the pixels that are likely to skin-like regions. YCbCr color representation system is used for skin-like region segmentation because it provides an effective use of chrominance information for modeling the human skin color. The RGB image which is taken from the video camera is converted to YCbCr
Face and Gesture-Based Interaction for Displaying Comic Books
707
color space and skin area are determined by the skin color range. Fig. 3 shows YCbCr distributions of a sample of skin region. Chrominance components Cb and Cr play important roles in detecting skin-like regions from the color image.
Fig. 2. Pose recognition from the partially occluded face
We extract three large connected skin-like regions and remove false positives using size constraints. Noise and holes are filtered by morphological dilation and erosion operations. From the three large regions, we remove the face region that is detected from gray image using Viola and Jones’ method [13]. The remaining regions are the probable hand regions. 3.2 Hand Gesture Recognition To recognize the hand gesture, it is necessary to segment hand regions precisely from the probable hand regions because the forearm features do not represent important information about hand pose. Fig. 4 shows probable COG of hand regions. We first compute the skeleton of the hand regions and then find a circle to cover the hand region along the skeleton axis. The circle will be small at the wrist point and will be large in the hand region. When we find the largest circle covering hand region along the skeleton axis like Fig. 4, the center of that circle is assigned as COG of the hand. And the hand regions are segmented at the wrist point. From the COG, we compute the distances between the most extreme points such as the farthest point and the nearest point in the hand to the COG. This is shown in Fig. 5. Usually the farthest point is the tip of the longest active finger in the particular gesture. To count the number of active fingers, we draw a circle on the COG like in [10]. The radius of the circle is 0.7 of the farthest distance which is from the COG to the farthest point. This is shown in Fig. 6. After that, we can count the intersection
708
H.-B. Kang and M.-H. Ju
Fig. 3. YCbCr distribution for the sample skin region
Fig. 4. COG of the hand
area for counting active fingers in the hand. To detect the fist of the hand, we compute the relationship between the nearest distance and the farthest distance. If the relationship like Eq. (8) is satisfied, we classify that the hand gesture is fist.
Farthest _ dist < 1.7 * Nearest _ dist
(8)
We can classify hand gestures into the fist and the number of active fingers. These classifications are used in the control of object manipulation in the activated panel of the comics.
Fig. 5. The farthest point, the nearest point and COG
Face and Gesture-Based Interaction for Displaying Comic Books
709
Fig. 6. Two active fingers: “Count 2”
4 Face Pose and Hand Gesture Based Human Robot Interaction To control the viewing style of the comics from the wheeled robot, we use the face pose and hand gesture. In this section, we present human robot interaction methods in viewing comics using face pose and hand gesture. 4.1 Face Pose and Hand Gesture-Based Control A user appreciates comics from the wheeled robot equipped with a web cam like ER1[12]. Fig. 7 shows our scheme. The input image is taken from the camera and skin-like regions are extracted as stated in Section 3. Then, we use morphological filters to filter noise and holes. Face detection and hand segmentation are executed from probable face and hand regions. A Face candidate is detected from the gray image using Viola and Jones’ method [13]. Hand segmentation is performed by finding a large circle along the skeleton of probable hand regions as stated in Section 3. After that, face pose recognition is performed to control the viewing order of the comics and hand gesture recognition is performed to manipulate objects in the activated panel. The face pose controls the activation of panels in comics into four directions such as up, down, right and left panels. Hand gesture controls the activated panel by “zoom in” or “zoom out” using “fist” and “count 5”. When the panel is zoomed in, the robot moves toward to the user about 20cm and then a user can move object such as move up, down, left and right. Translation of the activated object is performed by index finger’s direction like Fig. 8. Scaling is performed by “count 3” and “count 4”. Rotation is performed using two hands consisting of “fist” and “count 5”. When the object is rotated, the robot also makes a circle. According to the user’s various manipulations, new object-based comic contents are created. 4.2 Scene Graph-Based Object Representation The scene in the panel of the comics is represented by the scene graph. Fig. 9 shows an example of scene graph which is simplified version used in VRML97 [14]. Each leaf node in the scene graph has information about location, rotation, and scale of the object. According to the user’s hand gesture, the activated object is transformed by translation, rotation or scaling. The modified scene is maintained using the update of scene graph like Fig. 9(b). The attributes in each node have two values: one is original value, and the other one is current value. By maintaining the scene graph, we can recover original panel of the comics easily.
710
H.-B. Kang and M.-H. Ju
5 Experimental Results We experimented our face pose and hand gesture driven comics viewing method on a wheeled robot ER1. Fig. 10 shows an example of human robot interaction using face pose and hand gesture for viewing comics. The wheeled robot’s camera images are transmitted to the computer and the commands for viewing order of comics and robot’s movement are computed from the face pose and hand gesture recognition results.
Fig. 7. Our proposed face pose and hand gesture recognition
Up
Right
Left
Down Fig. 8. Four translation directions using an index finger
Face and Gesture-Based Interaction for Displaying Comic Books
(a)
711
(b)
Fig. 9. Scene Graph. (a) original, (b) updated scene graph according to translation.
To test face pose recognition, 12 image sequences are captured from 6 subjects using web camera on the robot. The image resolution is converted to 320 x 240 pixels. The duration of each sequence is about 40 seconds and the frame rate is 15 frames per second. For pose appearance manifold learning, we picked 2 sequences from each person for training sequences and cropped face images. The face images are normalized to 20x 20 pixels. Then, we construct 5 sub-pose manifolds from cropped images using PCA. For each manifold, the dimension of sub pose M in Eq. (4) is set to 20. To control the viewing order with the face pose information, the user’s face should be first detected and then recognized. After that, the face tracking process is executed for video-based face pose recognition. For face tracking, 80 sample windows having different sizes and orientations are created around the previous face position. Each sample window is converted to 20 x 20 size window and skin regions are extracted. For the skin regions, histogram is equalized. Then, the sample windows are compared with the identified person’s face sub-pose manifold. At this time, the face sub-pose manifold is the face pose of the previous frame. If the minimal distance between the sample window and the sub-pose appearance manifold is larger than the threshold value, we conclude the tracked face is a wrong one and a new face detection process is executed. Otherwise, the face sample window is determined to be a correct one and current face pose is estimated using Eq. (3). We experimented face pose recognition on 10 sequences captured from 6 persons. Table 1 shows our face pose recognition results. Our proposed method achieves 93.8% face pose recognition rate on average.
712
H.-B. Kang and M.-H. Ju
Fig. 10. Face pose and hand gesture based human robot interaction Table 1. Pose estimation results
Pose (Total Frame No.) 0 (560) 1 (1,263) 2 (575) 3 (331) 4 (430) Total (3,159)
Correct Frame No. 527 1,202 570 287 387 2,964
Proposed method 94.11 % 95.17 % 99.13 % 83.99 % 90.00 % 93.83 %
Table 2. Face pose recognition results for partially occluded faces
Pose 0 1 2 3 4 Total
Proposed method 92.98% 93.32% 93.98% 69.33% 93.92% 92.11%
Non-mask 81.02% 84.42% 80.66% 68.48% 88.63% 82.85%
We also tested our face pose recognition from partially occluded faces. The occlusion mask like Fig. 2 is generated by detecting occluding pixels. The face pose recognition results are shown in Table 2. Our method shows desirable performance like 92.11%. For faces under various illumination variations, our face pose recognition results are shown in Table 3. Our proposed method improves the recognition rate in comparison with the non-histogram equalization method.
Face and Gesture-Based Interaction for Displaying Comic Books
713
Table 3. Face pose recognition results for illumination changed faces
Pose 0 1 2 3 4 Total
Proposed method 95.79% 87.34% 96.69% 80.50% 94.21% 91.77%
Non-Histogram Equalization 85.31% 91.67% 90.42% 65.93% 51.75% 82.35%
Finally, hand gesture recognition is conducted on 102 samples of 6 members of our laboratory in the normal condition. We have obtained 92.15% accuracy in correct recognition.
6 Conclusion In this paper, we proposed a new viewing control method on the robot for comics using face pose and hand gesture recognition. Our face pose recognition method is based on the face pose appearance manifold which is approximated by lowdimensional connected affine plane. Face pose is estimated by computing minimal distance from the given image to the face sub-pose manifold. For robust face pose recognition, we used an occlusion mask for partially occluded faces and a coarse illumination normalization method with histogram equalization for skin-like regions. Empirical evaluation on image sequences with 6 subjects has shown that the proposed method is successful in face pose recognition across illumination and partial occlusion. Our hand gesture method also shows good performance in manipulating object in the panel of the comics. Our proposed system works at near real-time and shows desirable results under several real situations. Using our proposed system, a user can view comics in his own style. It is worth noticing that our proposed method provides various viewing methods for reading comics through the robots. The main direction for future work is to handle large head-pose variations and severe illumination variations to facilitate more natural user interaction with consumer devices. Finally, it may prove beneficial to incorporate on-line learning scheme in constructing face pose appearance manifold. Acknowledgments. This work was supported by the Culture Research Center Project, the Ministry of Culture & Tourism and the KOCCA R&D program in Korea.
References 1. Pentland, A., Moghaddam, B., Starner, B.: View-based and modular eigenspaces for face recognition. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (1994) 2. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object recognition. IEEE Trans. PAMI (1997)
714
H.-B. Kang and M.-H. Ju
3. Moghaddam, B.: Principal manifold and probabilistic subspaces for visual recognition. IEEE Trans. PAMI (2002) 4. Lee, K., Ho, J., Yang, M., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (2003) 5. Lee, K., Kriegman, D.: Online learning of probabilistic appearance manifold for videobased recognition and tracking. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (2005) 6. Ju, M., Kang, H.-B.: A new partially occluded face pose recognition. In: ACIVS 2007. LNCS, vol. 4678, Springer, Heidelberg (2007) 7. Davis, J., Shah, M.: Visual Gesture Recognition. IEE Proc. Vis. Image Signal Processing 141(2) (1994) 8. Chang, C., Chen, I., Huang, Y.: Hand Pose Recognition Using Curvature Scale Space. IEEE ICPR (2002) 9. Hasanuzzaman, Md., Zhang, T., Amporanaramveth, V., Bhuiyan, M., Shirai, Y., Ueno, H.: Face and gesture recognition using subspace method for human-robot interaction. In: Aizawa, K., Nakamura, Y., Satoh, S. (eds.) PCM 2004. LNCS, vol. 3331, Springer, Heidelberg (2004) 10. Malima, A., Ozgur, E., Cetin, M.: A Fast algorithm for vision-based hand gesture recognition for robot control. In: Proc. IEEE Conf. Signal Proc. And Comm. Applications (April 2006) 11. Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face recognition with image sets using manifold density divergence. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (2005) 12. Evolution robotics, http://www.evolution.com 13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. CVPR (2001) 14. ISO/IEC 14772-1, The Virtual Reality Modeling Language (VRML) (1997), http://www.vrml.org/specifications/VRML97
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts Anjin Park, Kwangjin Hong, and Keechul Jung* School of Digital Media, College of Information Science, Soongsil University 156-743, Seoul, S. Korea {anjin,hongmsz,kcjung}@ssu.ac.kr
Abstract. Research on image-based 3D reconstruction has recently shown a lot of good results, but it assumes precise target objects are already segmented from each input image. Traditionally, background subtraction was used to segment the target objects, but it can yield serious problems, such as noises and holes. To precisely segment the target objects, graph cuts have recently been used. Graph cuts showed good results in many engineering problems, as they can globally minimize energy functions composed of data terms and smooth terms, but it is difficult to automatically obtain prior information necessary for data terms. Depth information generated by stereo vision was used as prior information, which shows good results in their experiments, but it is difficult to calculate depth information for 3D face reconstruction, as the most of faces have homogeneous regions. In this paper, we propose better foreground segmentation method for 3D face reconstruction using graph cuts. The foreground objects are approximately segmented from each background image using background subtraction to assist to estimate data terms of energy functions, and noises and shadows are removed from the segmented objects to reduce errors of prior information. Removing the noises and shadows should cause to lose detail in the foreground silhouette, but smooth terms that assign high costs if neighboring pixels are not similar can fill out the lost silhouette. Consequently, the proposed method can segment more precise target objects by globally minimizing the energy function composed of smooth terms and approximately estimated data terms using graph cuts. Keywords: Foreground Segmentation, Graph Cuts, Shadow Elimination, 3D Face Reconstruction.
1 Introduction Since the world that we live is 3D space, we can feel the reality and get more valuable information from 3D objects. For this reason, research on 3D reconstruction for interest regions, such as various objects, human bodies, and faces, has recently attracted a lot of attention in the fields of computer vision and graphics. Specially, image-based 3D reconstruction that reconstructs 3D objects from images captured by *
Corresponding author.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 715–726, 2007. © Springer-Verlag Berlin Heidelberg 2007
716
A. Park, K. Hong, and K. Jung
multi-cameras has shown good results with the rapid growth of computer performance and stable reconstruction algorithm[1]. The image-based 3D reconstruction has shown good results, but it assumes precise target objects, such as face regions for 3D face reconstruction, are already segmented from each input image, and traditionally background subtraction was used to segment objects. The background subtraction first builds a model of the static background, either off-line or updated dynamically, and then compares new input image with the background model on a per-pixel basis[2]. However, they tended to produce unstable and error-prone results, including false foreground blobs and holes in segmented foregrounds caused by camera noises or low contrast with background areas[3]. For better results, researchers have used graph cuts for foreground segmentation[410]. Graph cuts can globally minimize energy functions composed of data terms and smooth terms, and they shows high precision rates if prior information are given for data terms of energy functions. Li et al.[4], Boykov and Funka-Lea[5], and Rother et al.[6] proposed interactive image segmentation tools using graph cuts. In tools proposed by Li et al.[4] and Boykov and Funka-Lea[5], a user marks a few lines on the images by dragging the mouse cursor while holding a button, e.g. the left button indicates the foreground and the right button indicates the background. Rother et al.[6] proposed the interactive tool that the user loosely drags a rectangle around an objects to merely touch foreground in a few locations. Above mentioned methods shows high precision rates by segmenting foreground using graph cuts, but they can not automatically segment the foreground, which restricts the scope of applicability, as users manually gave prior information by lines[4,5] or rectangles[6]. Recently, depth information generated by stereo vision techniques was used as the prior information for data terms[7,8], based on an assumption that objects within specific scopes of depth are target objects. They showed good results in their experiments, but it is difficult to calculate accurate depth information in 3D face reconstruction, as the most of faces have homogeneous colors and textures, which means accurate prior information can not be obtained. As another approaches to foreground segmentation using graph cuts, Howe and Deschamps[10] used the difference between the background image and the input image as prior information for the background, and used given parameters for the foreground. Therefore, this approach was very sensitive to camera noises, and made many holes within segmented foreground owing to inaccurate prior information for the foreground. Sun et al.[11] used Gaussian mixture models(GMM) for prior information of the background, which is robust to the noises, but used constant values as prior information of the foreground based on an assumption that is uniform distribution in appearance of foreground objects, which restricts applicability of the proposed method. This paper uses Volume Intersection[1] and Object Carving[9] that is the most popular method instead of stereo vision techniques. Volume Intersection tentatively reconstructs 3D face using silhouette of the face, and thus it needs accurate silhouette information. Object Carving is used for definitive 3D face reconstruction, and needs accurate foreground objects without noises and holes. Accordingly, this paper proposes better foreground segmentation method for 3D face reconstruction using graph cuts. To improve the precision rates for foreground segmentation, more
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts
717
accurate priori information for foreground objects is needed instead of given parameters. For this, the foreground objects are approximately segmented from each background image using background subtraction, and noises and shadow regions are removed from the segmented objects to reduce the errors of prior information for foreground objects. Removing the noises and shadows should cause to lose legitimate detail in the foreground silhouette, but smooth terms that assign high costs if neighboring pixels are not similar can fill out the lost silhouette. Consequently, the proposed method estimates more accurate prior information for foreground objects, and fills out the lost foreground objects based on smooth terms. Moreover, the graph cuts can segment more precise foreground objects by globally minimizing the energy function composed of smooth terms and data terms The remainder of this paper is organized as follows. The proposed foreground segmentation method using graph cuts are described in section 2. Some experimental results and 3D face reconstruction using Volume Interaction and Object Carving are then presented in section 3, and final conclusions are given in section 4.
2 Foreground Segmentation This paper uses graph cuts to improve the accuracy of foreground segmentation. Generally, when graph cuts are used for computer vision problems, two problems are encountered: how to express the energy function, and how to automatically obtain prior information for data terms. Section 2.1 describes how to express the energy function for foreground segmentation, and section 2.2 describes how to obtain prior information for the foreground and background. 2.1 Energy Function for Foreground Segmentation This paper considers foreground segmentation as a labeling problem. The labeling problem is to assign a label from a set of labels, {‘fg’ and ‘bg’} where fg means the foreground and bg means the background, to each pixel in an input image P, and the labeling is denoted by F={f1,f2,…,fDn}, where Dn is the number of pixels and each element of F can be one of the label set. To assign a specific label to each node, the foreground segmentation problem is first expressed as an energy function(Eq. 1).
E (F ) = ¦ D p ( f p ) + λ p∈P
¦V ( f
{ p , q }∈N
p ,q
p
, fq )
(1)
Here, D p ( f p ) is a data term that reflects how each pixel fits into the prior data given for each label. In other words, D p ( f p ) has low costs if a pixel p is similar to a label fp, and has high costs if a pixel p is not similar to a label fp. V p ,q ( f p , f q ) is a smooth term that reflects discontinuities between neighboring pixel p and q, and has high costs if two neighboring pixels are similar. Ȝ specifies the relative importance of two energy terms, and N is all pairs of neighboring elements in P, called n-links(neighborhood links). To minimize the energy function via graph cuts, a graph G = υ , ε is first created with nodes corresponding to the pixels. Two distinguished nodes, source(S) and
718
A. Park, K. Hong, and K. Jung
sink(T) called terminals, are also needed to represent two labels, plus each node has two additional edges, called t-links(terminal links) {p,S} and {p,T}. Therefore, υ = P ∪ {S , T } and ε = N ∪ {{ p, S },{ p, T }} . The weights of the graph are set for both p∈P
the t-links and n-links, where t-links connecting each terminal and each node are correspond to the data that indicates the label-preferences of each pixel and n-links connecting between neighboring nodes are correspond to the smooth terms that indicates discontinuities between neighboring nodes. The graph G is then completely defined, and specific labels are assigned to two disjointed sets connected by S and T by finding the minimum cost cut in the graph via graph cuts[13]. The graph cuts find the cut with the minimum cost among all the cuts, and the minimum cost cut problem can be solved by finding the maximum flow from S to T based on the theorem of Ford and Fulkerson[14]. The maximum flow(Fig. 1) saturates the set of edges in the graph, dividing the nodes into two disjointed parts corresponding to the minimum cut, and the value of the maximum flow is equal to the cost of the minimum cut[14]. Thus, since the maximum flow in a graph can assist with energy minimization for the labeling problem, the proposed method uses the maximum flow for global minimization of the energy function. Fig. 1 shows a pseudo code of the basic Ford-Fulkerson algorithm[14], where f(u,v) and İ(u,v) denote the flow and a weight between two vertices u and v, respectively. FORD-FULKERSON(G, S, T) For each edge (u, v ) ∈ ε f(u,v)=0 f(v,u)=0 End While an augmenting path(minimum path) p from S to T exists in the graph G Cf(p)=min İ(u,v) : (u,v) is in p} For each edge (u,v) in p f(u,v) = f(u,v) + cf(p) f(v,u) = -f(u,v) End Re-construction of the graph G by replacing f with İ from the previous network End Fig. 1. Pseudo code of Basic Ford-Fulkerson Algorithm[13]
2.2 Automatic Retrieval of Prior Data from Image Prior information for each label is required to assign suitable values into data terms of the energy function, and is manually obtained using interactive tools, e.g. foreground seeds and background ones in [4], in previous approaches. To automatically obtain prior information for foreground segmentation, methods based on modeled background images were proposed[10,11]. Howe and Deschamps[10] used the difference įp between the background image and the input
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts
719
image as prior information for the background, and the data terms are defined as follows: (2) if f =' bg ' δ
Dp ( f p ) = ® ¯2τ − δ p p
p
if f p =' fg '
This term was very sensitive to noises, as it used just the difference as prior information of the background, and made many holes within segmented foreground owing to inaccurate prior information of the foreground, which is assigned by given parameters IJ. Moreover, smooth terms can not express discontinuities between neighboring pixels, as they used Pott model, which is assigned by just given parameter IJ and Į(Eq. 3).
τα V p ,q ( f p , f q ) = ® ¯0
if f p ≠ f q if f p = f q
(3)
Sun et al.[11] used GMM for prior information of the background, which is robust to the noises, but used constant values as prior information of the foreground based on an assumption that is uniform distribution in appearance of foreground objects, which restricts applicability of the proposed method. The data term of the energy function proposed by [11] is defined as follows: 1 − °arg min (I p − ȝ k )¦ k 2 k Dp ( f p ) = ® ° const ¯
if f p =' bg '
(4)
if f p =' fg '
where k is the number of Gaussian models, ȝk and Ȉk are the mean and covariance matrix of kth Gaussian models, and Ip is information of pixel p. The smooth term used the contrast term[14](Eq. 5) that encourages spatial coherence by penalizing discontinuities between neighboring elements p and q when adjacent elements are assigned different labels, allowing the capture of only gradient information between the labels.
( )
tan −1 δ pq ° V p,q ⋅ f p − f q = ®1 − π /2 ° 0 ¯
(5)
if f p ≠ f q if f p = f q
where įpq is difference between neighboring pixel p and q. The main problem of above mentioned methods is the accurate prior information for the foreground can not be obtained. Accordingly, this paper approximately segments foreground objects to efficiently estimate data terms for foreground segmentation. The foreground objects are segmented from each background image using background subtraction, and noises and shadows should be removed from the segmented objects to reduce errors of prior information for foreground objects. Noises can be removed by simple morphological operations. Removing the noises may cause to lose legitimate detail in the foreground silhouette, but the smooth term that assigns low costs if neighboring pixels are not similar, i.e. edges in images, and can
720
A. Park, K. Hong, and K. Jung
compensate this problem, which means filling out the lost silhouette. Shadows in segmented foregrounds are removed based on an assumption that they have similar hue but lower brightness compared with background pixels, and a following equation shows the assumption.
°1 Shadowxy = ® °¯0
bg bg if Hue xyfg − Hue xy < τ H and Brt xyfg − Brt xy <0 otherwise
(6)
where Hue p is a hue value in a pixel p of input images, and Brt xybg is a brightness value in a pixel p of background images. Fig. 2 shows an example of approximate foreground results. Fig. 2(b) shows a noiseless result of background subtraction by morphological operations, Fig. 2(c) shows an image that the gray color indicates shadow pixels, while Fig. 2(d) shows a result of an approximate foreground result.
(a)
(b)
(c)
(d)
Fig. 2. Approximate foreground result: (a) input image and background image, (b) noiseless image, (c) shadow pixels indicated by graph color, and (d) result image
To obtain prior information for the foreground object, we use GMM that is robust to the noises, and data terms of the proposed method are defined as follows: T 1 bg bg · § − (I p − u bg 1 k ) ¦ k (I p −u k ) ¸ if f =' bg ' 2 °arg min¨ α kbg ⋅ exp p bg k ¨ ¸ 2π ¦ k °° © ¹ Dp ( f p ) = ® 1 fg T fg fg · § ° ¨ α fg ⋅ 1 exp− 2 (I p −u k ) ¦ k (I p −u k ) ¸ if f =' fg ' arg min p k ° k ¨ ¸ 2π ¦ kfg °¯ © ¹
(7)
where µ kfg and ¦ kfg are the mean and covariance matrix of kth Gaussian models for the foreground. To completely construct the graph, these costs of data terms and smooth terms are set for the t-links and n-links in the graph, where Dp(fp=‘fg’) is set for {p,T} and Dp(fp=‘fg’) is set for {p,T}.
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts
721
3 Experimental Results The computer used in the experiments consisted of intel Core2Quad Q6600 CPU and NVIDIA GeForce 8800 GTX graphic card, and we used Olympus E-500 cameras to obtain high resolution images. The software to develop the proposed system is implemented using Microsoft Visual C++ 6.0 and Olympus camera SDK, and the size of the input image is 800 600. The experiment results are divided into two sub-subjects: 1) experimental results for foreground segmentation, and 2) 3D face reconstruction results. 3.1 Experimental Results for Foreground Segmentation
Three methods, background subtraction, Howe’s graph cuts-based method[10], and the proposed method, was compared. The connected component error(CCE)(Eq. 8) criterion proposed in [10] was adopted for evaluation of each method, which takes only connected component of the segmented foreground that overlaps with ground truth into account, and Table 1 shows comparison of each method using CCE.
CCE =
# in false positives + # in false negatives # in ground truth
Table 1. Comparison of foreground segmentation results
Background Subtraction 0.287
Howe’s 0.161
Proposed Method 0.115
(a)
(b) Fig. 3. Result images of proposed method: (a,c,e) input images, (b) result images
(8)
722
A. Park, K. Hong, and K. Jung
(c)
(d)
(e)
(f) Fig. 3. (continued)
(a)
(b)
(c)
Fig. 4. Misclassification result: (a) input images, (b) background subtraction result, (c) output
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts
723
Fig. 3 shows the results of the proposed method, Figs. 3(a,c,e) shows input image, while Figs. 3(b,d,f) shows results images. Fig. 4 shows results of misclassification. This error was occurred when shadow on foreground was removed for foreground segmentation, and almost all misclassification of the proposed method was occurred on the side face in the experiments. 3.2 3D Face Reconstruction
We reconstruct 3D objects based on segmented foreground objects, and 3D object reconstruction consists of two steps: Volume Intersection[1] and Object Carving[9](Fig. 5).
Fig. 5. Flow-chart of 3D object reconstruction
The Volume Intersection is performed using Silhouette Volume Intersection(SVI) algorithm[1] that is the most popular method to reconstruct 3D object from multiview images. This method first projects silhouette images transformed using a calibration matrix onto each plane of the voxel space from the corresponding camera viewpoint, and then calculates the intersection among transformed silhouette images.
Original silhouette image
Transformed silhouette image (a)
Common Plane (b) Fig. 6. Sub-steps of SVI: (a) plane image generation step, (b) plane image intersection step
724
A. Park, K. Hong, and K. Jung
Fig. 7. Problem of SVI
As shown in Fig. 6, the SVI algorithm consists of two sub-steps: plane image generation and plane image intersection. In the plane image generation, all the silhouette images captured by multi-camera are projected onto a common plane using calibration matrix of each camera(Fig. 6(a)). Then, we compute 2D intersection of all projected silhouette images on each plane(Fig. 6(b)). Here, the common plane is perpendicular to the z-axis, and is moved from the common base plane (gray-colored check plane in the right image of Fig. 6(b)) until no silhouette images are projected onto the common plane. The results on each plane are called visual hull, and Fig. 6(b) shows the visual hull and the common plane of the foreground object. However, almost all image-based 3D reconstruction methods can make only the convex visual hull(Fig. 7), e.g. darker gray regions can not be expressed in reconstructed objects in Fig. 7. Object Carving is used to solve this problem. Generally, the photo-consistency between neighboring silhouette images is used to reconstruct more accurate shapes. As shown in Fig. 8, we find a point p on a silhouette image A and a point q on a silhouette image B that two points correspond to the 3D point P on the visual hull, and compute the epipolar line l on silhouette image B that the line passes the point q. If p and q do not have photo-consistency, we delete the point P and compare p with q’ that q
p
l
q’’ Silhouette image B
Silhouette image A Incorrect matching point
q’
P
Visual hull
Real boundary Correct matching point Fig. 8. Object Carving
Better Foreground Segmentation for 3D Face Reconstruction Using Graph Cuts
725
Fig. 9. Result of 3D face reconstruction with texture mapping
Fig. 10. Results 3D face reconstruction without texture mapping: (a) using background subtraction, (b) using proposed method
is a neighbor of the q until two points have the photo-consistency. Fig. 8 shows an example of results of 3D face reconstruction, and we paint colors to 3D faces using texture mapping. Fig. 9 shows results of 3D face reconstruction without texture mapping. Since almost all misclassification of the proposed method was occurred on the side face in the experiments. Therefore, we did not use the side face to construct 3D face, and thus the side faces did not have textures. Fig. 10(a) shows a result using background subtraction, while Fig. 10(b) shows a result using the proposed method. As shown in Fig. 10, Volume Intersection and Object Carving to reconstruct 3D objects must have accurate foreground objects segmented from background images.
4 Conclusion In this paper, we proposed better foreground segmentation method for 3D face reconstruction using graph cuts. Stereo vision techniques were generally used for 3D reconstruction, and depth information was used as prior information for data terms of energy functions. However, it was difficult to calculate depth information in 3D face reconstruction, as the most of faces had homogeneous regions. Therefore, we used Volume Intersection and Object Carving that does not use depth information for 3D face reconstruction, which should have accurate foreground objects.
726
A. Park, K. Hong, and K. Jung
In this paper, foreground objects from each input image were approximately segmented from each background image using background subtraction to estimate data terms of energy functions, and noises and shadows were removed from the segmented objects to reduce errors of prior information for foreground objects. Removing the noises and shadows cased to lose detail in the foreground silhouette, but smooth terms of energy functions filled out the lost silhouette. As a result, the proposed method can segment more precise foreground objects by globally minimizing the energy function composed of smooth terms and approximately estimated data terms using graph cuts. However, if shadows were located in foreground objects, the regions were also removed at the final foreground objects. Therefore, we will investigate how to remove more accurate shadow regions. Acknowledgement. This work was supported by the Soongsil University Research Fund.
References 1. Laurentini, A.: The Visual Hull Concept for Silhouette-based Image Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 2. Wayne Power, P., Schoonees, J.: Understanding Background Mixture Models for Foreground Segmentation. In: Proceedings of Image and Vision Computing, New Zealand, pp. 267–271 (2002) 3. Image Processing Toolbox, ch. 9: Morphological Operations, The Mathworks (2001) 4. Li, Y., Sun, J., Tang, C-K., Shum, H-Y.: Lazy Snapping. ACM Transactions on Graphics 23(3), 303–308 (2004) 5. Boykov, Y., Funka-Lea, G.: Graph Cuts and Efficient N-D Image Segmentation. International Journal of Computer Vision 70(2), 109–131 (2006) 6. Rother, C., Kolmogorov, V., Blake, A.: GrabCut - Interactive Foreground Extraction using Iterated Graph Cuts. ACM Transactions on Graphics 23(3), 309–314 (2004) 7. Ahn, J-H., Kim, K., Byun, H.: Robust Object Segmentation using Graph Cut with Object and Background Seed Estimation. In: Proceedings of International Conference on Pattern Recognition, vol. 2, pp. 361–364 (2006) 8. Kolmorogov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Probabilistic Fusion of Stereo with Color and Contrast for Bilayer Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1480–1492 (2006) 9. Kutulakos, K.N., Seitz, S.M.: A Theory of Shape by Space Carving. International Journal of Computer Vision 38(3), 199–218 (2000) 10. Howe, N.R., Deschamps, A.: Better Foreground Segmentation through Graph Cuts. Technical Report (2004), http://arxiv.org/abs/cs.CV/0401017 11. Sun, Y., Li, B., Yuan, B., Miao, Z., Wan, C.: Better Foreground Segmentation for Static Cameras via New Energy Form and Dynamic Graph-cut. In: Proceedings of International Conference on Pattern Recognition, vol. 2, pp. 49–52 (2006) 12. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 13. Ford, L., Fulkerson, D.: Flows in Networks. Princeton University Press, Princeton (1962) 14. Kumar, M.P., Torr, P.H.S, Zisserman, A.: Obj Cut. Proceedings of Computer Vision and Pattern Recognition 1, 18–25 (2005)
Practical Error Analysis of Cross-Ratio-Based Planar Localization Jen-Hui Chuang, Jau-Hong Kao, Horng-Horng Lin, and Yu-Ting Chiu Deptartment of Computer Science, National Chiao-Tung University, No. 1001, Ta-Hseuh Rd., Hsinchu, Taiwan {jchuang, hhlin}@cs.nctu.edu.tw,
[email protected],
[email protected]
Abstract. Recently, more and more computer vision researchers are paying attention to error analysis so as to fulfill various accuracy requirements arising from different applications. As a geometric invariant under projective transformations, cross-ratio is the basis of many recognition and reconstruction algorithms which are based on projective geometry. We propose an efficient way of analyzing localization error for computer vision systems which use cross-ratios in planar localization. By studying the inaccuracy associated with cross-ratio-based computations, we inspect the possibility of using linear transformation to approximate localization error due to 2-D noises of image extraction for reference points. Based on such a computationally efficient analysis, a practical way of choosing point features in an image so as to establish the probabilistically most accurate planar location system using crossratios is developed. Keywords: cross-ratio, error analysis, error ellipse, robot localization.
1 Introduction One of the main purposes of computer vision is to develop a reliable system that can carry out its tasks, e.g., reconstruction of scene structures, with satisfactory efficiency and precision in a realistic environment. There are basically two classes of methods to reconstruct 3-D structures from 2-D images. The first class involves strategies relying on camera calibration to establish reconstruction matrices while the second class consists of approaches based on projective geometry associated with reference points given as prior knowledge. As a geometric invariant under projective transformations, cross-ratio is the basis of many recognition and reconstruction algorithms which are based on projective geometry [1][2]. For example, cross-ratios calculated from vertices of polygons are used in [3-7] to recognize planar features in a 3-D environment. In addition to recognition, given prior knowledge about a scene, object structure can also be reconstructed using cross-ratio. For example, an approach that transforms relative affine structure defined in [8] into equivalent cross-ratio measurement is used to determine relative 3-D face structure from facial images in an identity recognition system [9]. Such a projective invariant can also be utilized to match trajectories across video streams and applied to image retrieval problems [10][11]. For autonomous D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 727–736, 2007. © Springer-Verlag Berlin Heidelberg 2007
728
J.-H. Chuang et al.
navigation of vehicles, cross-ratio is often used to identify artificial landmarks or beacons placed in the environment [12-16]. As indicated in [17][18], the quality of scene reconstruction and structure inference strongly depends on the quality of the image data. In addition to other possible measurement uncertainties, 2-D coordinates of feature points in an image plane will always have quantization errors due to limited image resolution. Hence, values of projective coordinates, i.e., pairs of cross-ratios with respect to some given reference points, will also be noisy. Some studies of cross-ratio are proposed to assess its use in invariant-based recognition systems [17-21]. These studies mainly focus on robust estimations of the cross-ratio but not the final localization or reconstruction results for autonomous navigation applications. In this paper, we propose a novel way of analyzing localization error for systems which use cross-ratio for planar localization. Through a 1st-order approximation of the derived one dimensional error function, we first inspect the linear nature of localization error due to small inaccuracy in image data. Similar properties of the localization error due to two dimensional noises are then investigated. In particular, an approximation of a nominal boundary of error ellipse can be determined efficiently for one of image points being affected by radially symmetric errors of a fixed magnitude. Based on such a computationally efficient error analysis, one may obtain the picture of resultant regions of localization error in advance, and select proper reference image points accordingly. The rest of the paper is organized as follows. In Section 2, we introduce a general invariant-based method for 3-D reconstruction of locations of planar features using cross-ratio. Subsequently, the localization error due to associated cross-ratio computation using noisy 2-D image data is formulated in Section 3. We determine how the error propagates, including the direction and magnitude range, through linear approximation of such a cross-ratio-based formulation. In Section 4, synthesized noise is added to real data in the experiments for the verification of the related theoretical investigations. Finally, a summary is given in Section 5.
2 Projective Invariant-Based 3-D Reconstruction Before the description of a typical framework for 3-D reconstruction of a scene point from four reference points using projective invariant cross-ratio, we first review some mathematics involved in its computation. Let O, A, B, C and D be five coplanar points l3
l2 l1
B A
l0
l4
C D
L
K
J
M
θ4 θ3
A
θ2 θ1 O
Fig. 1. Cross-ratio of five coplanar points
Practical Error Analysis of Cross-Ratio-Based Planar Localization
729
in a general configuration (with no three of them being collinear), as shown in Fig. 1. One form of cross-ratio can be computed as1 JG JG A×C = JG JG [ A, B , C , D ]O CRO sin θ 3 sin θ 4 B×C sin θ1 sin θ 2
JG JJG B×D JG JJG A× D
(1)
JG
with O being the origin of pencils OA, OB, OC , OD . Let A = ( Ax , A y ) stands for vector OA , and so on. We can rewrite (1) as ⎛ [ A, B, C, D]O = ⎜ ⎜ ⎝
Ax Cx Bx Ay C y By
Dx ⎞ ⎟ Dy ⎟ ⎠
⎛ ⎜ ⎜ ⎝
Bx Cx Ax By C y Ay
Dx ⎞ ⎟ ( K AC K BD ) /( K BC K AD ) = Q / Q . 1 2 Dy ⎟ ⎠
(2)
Thus, a cross-ratio can be obtained without computing the sinusoidals in (1). Given five scene points O, A, B, C, D located on a 3-D plane π 0 , with no three of them being collinear, they are projected on image plane π 1 as o, a, b, c and d, respectively2,3. The invariant property of cross-ratio, assures that if the five feature points can be identified in the image plane accurately, the cross-ratio [a, b, c, d ]o will be identical to the cross-ratio [ A, B, C , D]O . An immediate application of the projective invariant cross-ratio is to determine ray directions. For example, if the origin and three of the rest four points are known in Fig. 1, the vector passing through the fourth point from the origin can be determined HJJG easily if [A,B,C,D]O is given. For example, let D=(X, Y), or OD , is to be determined. From (1), we have CRO = K AC
Bx
X − Ox
By
Y − Oy
K BC
Ax
X − Ox
Ay
Y − Oy
Q 1, Q2
which can be rewritten as Bx (Q1K BC Ay − Q2 K AC B y ) X + (Q2 K AC Bx − Q1 K BC Ax )Y = Q2 K AC By
Ox Ax − Q1 K BC Oy Ay
Ox . Oy
(3)
HJJG This is in fact the equation of l4 ( OD ). Furthermore, if O, A, B, C are known, so as [A, B, C, D]O and [B, C, D, O]A, we can
obtain point D by intersecting OD and AD . Accordingly, a localization system can be developed based on the invariant cross-ratio, assuming perfect image acquisition and feature extraction. However, measurement uncertainty and system noise, such as quantization errors of 2-D coordinates of feature points in an image plane due to limited image resolution, usually occur in practice. These uncertainties will propagate 1
Note that a total of 24 different cross-ratios ki, 1 ≤ i ≤ 24 , can be defined for a scene point and Eq. (1) corresponds to k1 defined in [21]. 2 In the rest of the paper, we denote image points with lowercase letters and scene points with uppercase letters. 3 In a robot navigation environment, π 0 can be the ground plane and the five points can be landmarks or beacons placed in the environment, or the robot itself.
730
J.-H. Chuang et al.
through the computation process, resulting in erroneous localizations or reconstructions. In the next section, we will investigate how measurement errors may propagate in a reconstruction process based on cross-ratio.
3 Error Analysis of Cross-Ratio-Based Localization A general configuration of coplanar points for cross-ratio-based localization is shown in Fig. 2. Assume P1, P2, P3, P4 are known planar points in 3-D space with P1 and P4, as well as p1 and p4, being origins for two corresponding cross-ratios. The position of a scene point R (or the location of a robot), which corresponds to image point r, can be determined with the procedure described in Section 2. For simplicity, let
p1r = (d x , d y ) and assume the location of p4 has noise Δ x along x-direction and is extracted as pˆ 4 , we have c x + Δ x bx
ax [ p2 , p3 , pˆ 4 , r ] p1 CR p1 =
ay
cy
bx by
dx
by
dy
cx + Δ x ax cy ay
dx dy
qˆ1 q1 − kbd a y Δ x = , qˆ 2 q2 − kad by Δ x
(4)
where kbd = bx d x and k ad = a x d x . Substituted Q1 and Q2 in (3) by qˆ1 and qˆ 2 , by
dy
ay
dy
p3
p2 b p1
a
θ1 θ2
θ3
pˆ4
c
(cx + Δx , cy )
θ4 r
d
Fig. 2. A general configuration of coplanar points where p1 is the origin of four pencils
p3
p2
p1
bˆ
aˆ pˆ4
cˆ ˆ d
(cx + Δx , cy )
r
Fig. 3. pˆ 4 is used as the origin to compute cross-ratio CR pˆ 4
Practical Error Analysis of Cross-Ratio-Based Planar Localization
731
we have Bx ( qˆ1 K BC Ay − qˆ2 K AC B y ) X + ( qˆ2 K AC Bx − qˆ1 K BC Ax )Y = qˆ 2 K AC By
P1x P1 y
− qˆ1 K BC
Ax
P1x
Ay
P1 y
which yields line equation of P1 R
[ ( q1 − kbd a y Δ x ) K BC Ay − (q2 − kad by Δ x ) K AC B y ⎤⎦ X + [ ( q2 − kad by Δ x ) K AC Bx − (q1 − kbd a y Δ x ) K BC Ax ⎤⎦Y = (q2 − kad b y Δ x ) K AC
Bx
P1x
By
P1 y
− (q1 − kbd a y Δ x ) K BC
Ax
P1x
Ay
P1 y
(5) .
On the other hand, with pˆ 4 being the origin, as shown in Fig. 3, CR pˆ 4 can be computed as a ′x − Δ x a ′y
c′x − Δ x bx′ − Δ x c′y b′y
bx′ − Δ x b′y
c′x − Δ x a ′x − Δ x c ′y a ′y
d x′ − Δ x d ′y
qˆ ′ 1 = d x′ − Δ x qˆ2′ d ′y
′ Δ x + u2 kbd ′ Δ x + u1u2 Δ 2x q1′ + u1kac , ′ Δ x + u4 kad ′ Δ x + u3u4 Δ 2x q2′ + u3 kbc
(6)
where u1 = b ′y − d ′y , u2 = a ′y − c ′y , u3 = a ′y − d ′y , u4 = b ′y − c′y . Similarly, from (3) HJJJG
and (6), we can obtain line equation of Pˆ4 R as ′ A′y − qˆ2′ K ′AC B′y ) X + ( qˆ2′ K ′AC Bx′ − qˆ1′ K BC ′ Ax′ )Y ( qˆ1′ K BC ′ = qˆ2′ K AC
Bx′
P4 x
B y′
P4 y
′ − qˆ1′ K BC
Ax′ A′y
P4 x
(7)
P4 y
where (P4x, P4y) is coordinate of the scene point P4. It is easy to see that (5) and (7) are of the form ⎧ α1′ X + β1′Y = γ 1′ ⎨ ⎩α 2′ X + β 2′ Y = γ 2′
Therefore, by solving the above equations, the robot position can be obtained as ⎛ γ' R ( Rx , R y ) = ⎜ 1 ⎜ γ' ⎝ 2
β1'
α1'
β1' α1'
γ 1'
α1'
β1' ⎞
β 2'
α 2'
β 2' α 2' γ 2'
α 2'
β 2' ⎟⎠
,
⎟
(8)
To simplify (8), by skipping high order terms of Δx , we have γ 1′ γ 2′
β1′ γ 1 ≈ β 2′ γ 2
⎡ ⎛ P4 x Ax β1 + Δ x ⎢ M1 ⎜ β2 ⎢ ⎜⎝ P1x Ay ⎣
⎛ P4 x Bx + M3 ⎜ ⎜ P1x B y ⎝
α1′ α 2′
and
β1′ α1 ≈ β 2′ α 2
⎞ ⎛ P4 x Ax Ax′ + Ax Ax′ ( P1 y − P4 y ) ⎟ + M 2 ⎜ ⎟ ⎜ P1x Ay A′y ⎠ ⎝
⎞ ⎛ P4 x Bx Ax′ + Bx Ax′ ( P1 y − P4 y ) ⎟ + M 4 ⎜ ⎟ ⎜ P1x B y A′y ⎠ ⎝
⎡ Ax β1 + Δ x ⎢ M1 A β2 y ⎣⎢
Ax′ Ax + M2 A′y Ay
Bx′ Bx + M3 B y′ By
Bx′ B ′y
⎞ + Ax Bx′ ( P1 y − P4 y ) ⎟ ⎟ ⎠
⎞ ⎤ γˆ Ax′ + Bx Ax′ ( P1 y − P4 y ) ⎟ ⎥ 1 ⎟ ⎥ γˆ A′y ⎠⎦ 2
Bx′ Bx +M 4 B y′ By
Ax′ ⎤ αˆ1 ⎥ A′y ⎥ αˆ ⎦ 2
βˆ1 βˆ2
βˆ1 βˆ2
,
,
732
J.-H. Chuang et al.
⎡ ⎛ P1 y Ax α1′ γ1′ α1 γ1 ≈ + Δ x ⎢ M1 ⎜ α 2′ γ 2′ α 2 γ 2 ⎢ ⎜ P4 y Ay ⎣
⎝
Ax′ A′y
B′y
⎞ + Ay B′y ( P4 x − P1x ) ⎟ ⎟ ⎠
⎞ ⎤ αˆ γˆ1 + By A′y ( P4 x − P1x ) ⎟ ⎥ 1 ⎟ ⎥ αˆ2 γˆ2 A′y ⎠⎦
Bx′
⎛ P1 y Bx + M3 ⎜ ⎜ P4 y By ⎝
Bx′
⎞ ⎛ P1 y Ax + Ay Ay′ ( P4 x − P1x ) ⎟ + M 2 ⎜ ⎟ ⎜ P4 y Ay ⎠ ⎝ Ax′
⎞ ⎛ P1 y Bx + By By′ ( P4 x − P1x ) ⎟ + M 4 ⎜ ⎟ ⎜ P4 y By B′y ⎠ ⎝
where M1, … , M4 are some constants, α1 β1 , γ 1 β1 and α1 γ 1 are the α2 β2
γ2
β2
α2 γ 2
corresponding noise-free terms. Thus, an approximation of the robot location R becomes γˆ1 Rˆ x = γˆ2
βˆ1
αˆ1
βˆ1
βˆ2
αˆ 2
βˆ2
,
Rˆ y =
αˆ1
γˆ1
αˆ 2 γˆ2
αˆ1
βˆ1
αˆ 2
βˆ2
.
(9)
Since (9) has the form of E Δx + F ˆ G Δx + H Rˆ x = , Ry = I Δx + J I Δx + J
(10)
for constants E through J, we can obtain the following linear equation by eliminating Δx 4 IH − GJ ˆ GF − EH . Rˆ y = Rx + IF − EJ IF − EJ
(11)
The above equation gives an approximated trajectory of the reconstructed robot locations, due to different image extraction errors Δx added to p4 in Fig. 2 (and Fig. 3). In general, as will be demonstrated with simulation results in the next section, if the 2-D image error of a feature point is within a reasonably small range, it can be transformed approximately linearly into a planar region in the 3-D space of the reconstructed scene5. In particular, such a linear transformation of coordinate system will transform a circular region of image error into an elliptic one in the above planar region. Therefore, with only transformations of image error in two linearly independent directions, an approximate ellipse of reconstruction error can be obtained. Such error ellipses will be useful indicators for one to choose among point features in an image to establish the probabilistically most accurate planar location system using cross-ratios.
4 Simulation Results We conduct a series of simulations for the error analysis of cross-ratio-based planar localization for a real robot with synthesized noises added to it, and some reference 4
One can show that by applying Taylor series expansion to (8) and skipping the higher order terms, which gives a linear relationship between 2-D image extraction error and 3-D localization error, a linear equation identical to (11) can be obtained. 5 A formal derivation of such a property is omitted for brevity.
Practical Error Analysis of Cross-Ratio-Based Planar Localization
733
points, in an image. In these simulations, we consider the situation when extraction noises only affect a single image point. First, we investigate the characteristics of the localization error assuming 1-D noise along x-direction, as discussed in the previous section, as well as along other directions. Next, nominal boundary of an error ellipse due to two dimensional noises is computed to approximate that resulted from the circle of image inaccuracy. Finally, we give a cross-ratio-based localization scheme which adopts the proposed error analysis method to assist the selection of reference image points to optimize the reconstruction process. Scene
Image -325 image points noise (+ x) noise (- x)
100 50
-325.2 -325.4
0
-325.6 p3
-100
p4
p2
-150 -200
p1
-325.8
Y (cm)
Y (pixel)
-50
-326.2
r
-326.4
-250
-326.6
-300
-326.8
-350
R
-326
100
150
200 250 X (pixel)
300
-327 -0.2
noise-free loc. exact loc. (+ x) exact loc. (- x) 1st-order loc. (+ x) 1st-order loc. (- x) -0.1
0 X (cm)
0.1
0.2
Fig. 4. (Left) Extracted feature points in an input image. p1 … p4 are identified as images of reference points, and r is identified as robot image. Image extraction noises within a range of ±2 pixels along x-direction are added to p4. (Right) Trajectory of reconstructed robot locations: blue and magenta points are obtained by (8) while red and green points are obtained by (12). The former are hardly visible since they are almost entirely covered by the latter. R is the robot location in 3-D space resulted from noise-free extraction of image points.
Fig. 4 illustrates the trajectory of the reconstructed robot locations due to noises, within the range of ±2 pixels along x-direction, being added to p4 6 . The locations obtained from linear equation (11) are represented in red and green colors, corresponding to deviations of p4 into +x and –x directions, respectively. Points in blue and magenta colors represent similar results but computed with original rational equation (8). One can see that the latter, which are drawn first, are hardly visible since (11) gives a nearly perfect approximation of the former. In fact, similar results (which are omitted for brevity) can also be obtained for 1-D noises in arbitrary directions. In general, if the 2-D image errors are within a reasonable small range, the errors can also be transformed approximately linearly into the 3-D space of the reconstructed scene. 6
It is assumed in the rest of the paper that two cross-ratios involved in the computation use p1 and p4 as origins, respectively.
734
J.-H. Chuang et al. Scene
Image 100
p3
-325.5
p4
Y (cm)
Y (pixel)
-50 -100
p2
-150 p1
r
-326.5 -327
-300
-327.5 100
150
R
-326
-250
-350
R1
-325
0
-200
R2
-324.5
50
200 250 X (pixel)
300
noise-free loc. exact loc. approx. error ellipse sample point -1
-0.5
0 X (cm)
0.5
1
Fig. 5. (Left) Similar to that in Fig. 4 but with circularly distributed image extraction noises of ±2 pixels added to p4. (Right) Trajectory of reconstructed robot locations: blue points are obtained by (8) while the approximate error ellipse is obtained using (12). The reconstructed locations of robot due to image errors (Δx, Δy) = (2,0) and (Δx, Δy) = (0,−2) on p4 are at R1 and R2, respectively. Image
Scene
100
-324
50
-324.5 -325
0
-325.5 p5
-100
p3
p4
p2
-150 -200
p1
-326
Y (cm)
Y (pixel)
-50
r
R
-326.5 -327 -327.5 -328
-250
noise-free loc. exact loc. approx. error ellipse
-328.5 -300 -350
-329 100
150
200 250 X (pixel)
300
-2
-1 X (cm)
0
Fig. 6. Simulation results similar to that given in Fig. 5 but using reference point p5 in place of p2
Fig. 5 illustrates the trajectory of the reconstructed robot locations due to circularly distributed image extraction noises of 2 pixels added to p4, as well as the error ellipse obtained from the linear transformation ⎡ ΔRx ⎤ ⎡ a b ⎤ ⎡ Δx ⎤ ⎢ ⎥=⎢ ⎥⎢ ⎥, ⎣ ΔRy ⎦ ⎣ c d ⎦ ⎣ Δy ⎦
(12)
Practical Error Analysis of Cross-Ratio-Based Planar Localization
735
where ΔR represents the corresponding location offsets derived from Δx 7 . Specifically, we set (Δx, Δy) = (2,0) and (Δx, Δy) = (0,−2) (corresponding to reconstructed robot locations R1 and R2, respectively) and then derive the approximate error ellipse. One can easily see that the above approximate error ellipse can be used to appropriately express the spatial characteristics of the localization error without the computation of a lot of reconstructed robot locations using the expensive high-order equation (8). For the application of a general cross-ratio-based localization algorithm, because a scene may have many image features (points) extracted, multiple choices of reference points, as well as the origins for the computation of cross-ratios, are possible. Fig. 6 shows simulation results similar to that given in Fig. 5 but using reference point p5 in place of p2. According to the results obtained with either (8) or (12), localization results in Fig. 6 give a similar worst case error, but with approximately twice the ellipse area, compared to that in Fig. 5. The above results suggest that when there are multiple choices of reference points or cross-ratio origins, one can perform the proposed analysis to predict possible localization errors for each choice and select the optimal one accordingly. For each choice, one needs to ensure first that the noises are restricted to a reasonable range that (12) obtained using two noisy samples of the reference point of interest can appropriately describe the localization error. Subsequently, an optimal choice can be determined by comparing the direction of error, worst-case error, average error magnitude, or other metrics suggested by specific applications.
5 Summary As a geometric invariant under projective transformations, cross-ratio is the basis of many recognition and reconstruction algorithms which are based on projective geometry. We propose an efficient way of approximately analyzing localization error for systems which use cross-ratio for planar localization, by establishing a linear relationship between localization error and small inaccuracy in measurements of image features due to 1-D and 2-D noises in the image space. Such an analysis will be useful for one to choose among point features, as well as cross-ratio origins, in stereo images to establish the probabilistically most accurate planar location system. The proposed approach is applicable whenever multiple choices of image features are available, which happens frequently in various computer vision applications, e.g. in robot navigation systems. Acknowledgments. This work is partly supported by Ministry of Economic Affairs, Taiwan, under grant No. 95-EC-17-A-02-S1-032.
References 1. Kanatani, K.: Computational Cross-Ratio for Computer Vision. CVGIP 60, 371–381 (1994) 2. Mundy, J.L., Zisserman, A.: Geometric Invariance in Computer Vision. MIT Press, Cambridge, MA (1992) 7
Eq. (12) can be used to derive the approximate error ellipse only if there is a linear relationship between image and reconstruction errors. Various ways of inspecting such a relationship exist, but is not discussed here for brevity.
736
J.-H. Chuang et al.
3. Lourakis, M.I.A., Halkidis, S.T., Orphanoudakis, S.C.: Matching Disparate Views of Planar Surfaces Using Projective Invariants. Image and Vision Computing 18, 673–683 (2000) 4. Carlsson, S.: Projectively Invariant Decomposition and Recognition of Planar Shapes. Int’l J. Computer Vision 17, 193–209 (1996) 5. Suk, T., Flusser, J.: Point-Based Projective Invariants. Pattern Recognition 33, 251–261 (2000) 6. Chuang, J.-H., Chiu, J.-M., Chen, Z.: Obtaining Base Edge Correspondence in Stereo Images via Quantitative Measures Along C-diagonals. Pattern Recognition Letters 18, 87– 95 (1997) 7. Chiu, J.-M., Chen, Z., Chuang, J.-H., Chia, T.-L.: Determination of Feature Correspondence in Stereo Images Using a Calibration Polygon. Pattern Recognition 30, 1387–1400 (1997) 8. Shashua, A., Navab, N.: Relative Affine Structure: Theory and Application to 3D Reconstruction From Perspective Views. IEEE Trans. Pattern Analysis and Machine Intelligence 18, 873–883 (1996) 9. Chuang, J.-H., Kao, J.-H., Chen, Y.-H.: Identity Verification by Relative 3-D Structure Using Multiple Facial Images. Pattern Recognition Letters 26, 1292–1303 (2005) 10. Nunziati, W., Sclaroff, S., Bimbo, A.D.: An Invariant Representation for Matching Trajectories Across Uncalibrated Video Streams. In: 4th Int’l Conf. Image and Video Retrieval, Singapore, pp. 318–327 (2005) 11. Rajashekahar, S.C., Namboodiri, V.P.: Image Retrieval Based on Projective Invariance. In: IEEE Int’l Conf. Image Processing, pp. 405–408. IEEE Computer Society Press, Los Alamitos (2004) 12. Tsonis, V.S., Chandrinos, K.V., Trahanias, P.E.: Landmark-Based Navigation Using Projective Invariants. In: IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, pp. 342–347 (1998) 13. Åström, K.: A Correspondence Problem in Laser Guided Navigation. In: Symposium on Image Analysis, Sweden, pp. 141–144 (1992) 14. Åström, K.: Automatic Mapmaking. In: 1st IFAC Int’l Workshop on Intelligent Autonomous Vehicles, Southampton, pp. 181–186 (1993) 15. Basri, R., Rivlin, E., Shimshoni, I.: Image-Based Robot Navigation Under the Perspective Model. In: Int’l Conf. Robotics and Automation, Michigan, pp. 2578–2583 (1999) 16. Guerrero, J.J., Sagüés, C.: Uncalibrated Vision Based on Lines for Robot Nnavigation. Mechatronics 11, 759–777 (2001) 17. Georgis, N., Petrou, M., Kittler, J.: Error Guided Design of a 3D Vision System. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 366–379 (1998) 18. Huynh, D.Q.: The Cross-Ratio: A Revisit to its Probability Density Function. In: British Machine Vision Conf., pp. 262–271 (2000) 19. Åström, K., Morin, L.: Random Cross-Ratios. In: 9th Scandinavian Conf. on Image Analysis, pp. 1053–1060 (1995) 20. Maybank, S.J.: Stochastic Properties of the Cross-Ratio. Pattern Recognition Letters 17, 211–217 (1996) 21. Liu, J.-S., Chuang, J.-H.: A Geometry-Based Error Estimation of Cross-Ratios. Pattern Recognition 35, 155–167 (2002)
People Counting in Low Density Video Sequences J.D. Valle Jr., L.E.S. Oliveira, A.L. Koerich, and A.S. Britto Jr. Postgraduate Program in Computer Science (PPGIa) Pontifical Catholic University of Parana (PUCPR) R. Imaculada Concei¸ca ˜o, 1155 Prado Velho 80215-901, Curitiba, PR, Brazil {jaimejr,soares,alekoe,alceu}@ppgia.pucpr.br www.ppgia.pucpr.br
Abstract. This paper presents a novel approach for automatic people counting in videos captured through a conventional closed-circuit television (CCTV) using computer vision techniques. The proposed approach consists of detecting and tracking moving objects in video scenes to further counting them when they enters into a virtual counting zone defined in the scene. One of the main problems of using conventional CCTV cameras is that they are usually not placed into a convenient position for counting and this may cause a lot of occlusions between persons when they are walking very close or in groups. To tackle this problem two strategies are investigated. The first one is based on two thresholds which are related to the average width and to the average area of a blob top zone, which represents a person head. By matching the width and the head region area of a current blob against these thresholds it is possible to estimate if the blob encloses one, two or three persons. The second strategy is based on a zoning scheme and extracts low level features from the top region of the blob, which is also related to a person head. Such feature vectors are used together with an instance-based classifier to estimate the number of persons enclosed by the blob. Experimental results on videos from two different databases have shown that the proposed approach is able to count the number of persons that pass through a counting zone with accuracy higher than 85%. Keywords: People Counting, Computer Vision, Tracking.
1
Introduction
The automatic counting of people in video scenes is a relative new subject of research in computer vision. The growing interest on such a subject is due to its large applicability. Among several possibilities, by counting people in an environment is possible to control the functionalities of a building such as optimizing the setup of an heating, ventilation, and air conditioning system (HVAC), measure the impact of a marketing campaign, estimate pedestrian flows or the number of visitor in tourist attractions, and make possible a number of surveillance applications. However, nowadays people counting is done in a manual fashion or by D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 737–748, 2007. c Springer-Verlag Berlin Heidelberg 2007
738
J.D. Valle Jr. et al.
means of electronic equipments based on infrared sensors or through very simple video-based analysis. These solutions usually fail when trying to count individuals when they are very close to each other or in groups due to the occurrence of partial or total occlusion or to the difficult to distinguish a clear frontier between one or more persons. In other words, such systems do not have any embedded intelligence which is able to handle close persons or groups. Different approaches have been proposed to deal with this problem. The most straightforward strategy has been avoiding occlusions. Here, the main idea is to gather videos from cameras positioned at the top of the counting zones, which are known as top-view cameras. Kim et al. [1] and Snidaro et al. [2] avoided the total occlusion problem by placing the top-view cameras to count the passing people in a corridor. The camera position (top-view) allows the authors to estimate the number of people considering a simple strategy which matches the area of the moving object with a predefined value of the maximum area occupied by a single object. However, the main drawback of such a strategy is that the topview cameras are usually not available in most of the current surveillance systems what requires the installation of dedicated cameras for counting purposes. Some authors have based their approach on general purpose CCTV cameras which usually are placed in an oblique position to cover a large area of an environment. To deal with such oblique cameras some authors have employed classification techniques to take a decision about how many individuals are into the counting zone or enclosed by a blob, instead of carrying out a blind segmentation of groups into individuals. Nakajima et al. [3] use Support Vector Machines (SVM) to deal with this problem, while Gavrila [4] uses a tree-based classifier to represent possible shapes of pedestrians. A similar strategy was proposed by Giebel et al. [5] which uses dynamic point distribution models. Lin et al. [6] describe the region of interest based on Haar Wavelet features and SVM classification. A segmentation-based strategy was proposed by Haritaoglu et al. [7]. The authors have attempted to count the number of people in groups by identifying their heads. To detect the heads, such an approach uses convex hullcorner vertices on silhouette boundary combined with the vertical projection histogram of the binary silhouette. In this paper we propose a novel people counting approach which is inspired in the Haritaoglu et al. [7]. The proposed approach is also based on the head area detection but it does not attempt to segment groups into individuals. Instead of that, features are extracted from the head region of the blobs and matched against simple templates that models individuals and groups of persons. The match is carried out through an instance-based learning algorithm. These models can be easily adapted for different application environments thanks to the simplicity of learning algorithm and the low dimensionality of the feature vectors. Besides this strategy to cope with groups of persons, this paper presents a complete approach for automatic people counting in videos captured through a CCTV camera using computer vision techniques. The proposed method consists of detecting and tracking foreground objects in video scenes to further make the counting. As we have discussed earlier, the main problem resides in to have
People Counting in Low Density Video Sequences
739
a correct counting when the individuals are close to each other or in groups. In these situations the individuals are usually occluding each other. For each tracked object that reaches a counting zone, two strategies were evaluated: the first one is based on two thresholds that represent the average width and average area of a blob that enclosed a single individual. By comparing the width and area of a blob against these thresholds, one can decide if it is representing one, two or three persons; in the second one, a feature set based on a zoning scheme and shape descriptors, is computed from the head region of the blob while it is inside of the counting zone and a classifier is used to decide if the blob encloses one, two or three persons. This paper is organized as follows: Section 2 presents an overview of the proposed approach as well as the main details of the detection, segmentation and tracking algorithms. Section 3 presents the strategies for counting people in the video scenes. The experimental results on video clips from two different databases are presented in Section 4. Conclusions and perspectives of future work are stated in the last section.
2
System Overview
Figure 1 presents an overview of the proposed method. Through a video camera placed in an oblique position in the environment, video is captured and the frames are preprocessed to reduce noise caused mainly by lightning variations. After detecting and segmenting the motion objects, which represent regions of interest, their blobs are defined and tracked at the subsequent frames. When a blob enters a counting zone its shape is described based on a set of features. Two strategies were evaluated for executing the counting process. The first one considers the use of threshold values computed on the blob width and the head region area to decide how many individuals are enclosed by the blob while the second focuses on the top region of the blob, called the head region and extracts a set of features that describe the shape of the objects enclosed by head region of the blob. The resulting feature vector is the input to a k Nearest-Neighbor classifier (k-NN) which decides if the blob encloses one, two or three individuals. 2.1
Detection and Segmentation of Foreground Objects
In the last years, several approaches to detect motion in video have been proposed in the literature [8]. However, the main limitation of such techniques refers to the presence of noise mainly due to the variations in the scene illumination, shadows, or spurious generated by video compression algorithms. Background subtraction technique is one of such segmentation approaches that is very sensitive to illumination. However it is a straightforward approach and it has a low computational complexity since each new frame is subtracted from a fixed background model image followed by a thresholding algorithm to obtain a binary image which segments the foreground from the background. A median filter of size 3x3 is applied to eliminate the remaining noise. The resulting image is
740
J.D. Valle Jr. et al.
Fig. 1. An overview of the proposed approach to count people in video scenes
a set of pixels from the object in motion, possibly with some non-connected pixels. Morphological operations such as dilation and erosion are employed to assure that these pixels make up a single object. In such a way, what was before a set of pixels is now a single object called blob which has all its pixels connected.
People Counting in Low Density Video Sequences
741
Fig. 2. Motion detection and segmentation in a video clip from CAVIAR Database: (a) original video frame with objects in motion, (b) motion segmentation by background subtraction, (c) resulting blobs after filtering, background subtraction and morphological operations
Figure 2 shows in a simplified manner the detection and segmentation of foreground objects in the scene. Once a blob is identified, it must be tracked while it is presented in the camera field of view. 2.2
Tracking Objects
The tracking consists of evaluating the trajectory of the object in motion while it remains in the scene. To eliminate objects that are not interesting under the point of view of the tracking, it is applied a size filter which discards blobs that are not consistent with the expected width and height of the objects of interest. The idea of using filters to eliminate undesirable regions was proposed by Lei e Xu [9]. Filtering takes into account the velocity and direction of the motion applied to a cost function. The tracking of the objects in the scene and the prediction of its position in the scene is done by an approach proposed by Latecki and Miezianko [10] with some modifications. Suppose an object Oi in the frame F n , where Oi denotes a tracking object. In the next frame F n+1 , given j regions of motion, Rj , we have to know which Rj represents the object Oi from the preceding frame. The following cost function is used:
742
J.D. Valle Jr. et al.
Cost = (wP ∗ dP ) + (wS ∗ dS ) + (wD ∗ dD ) + dT
(1)
where wP ,wS ,e wD are weights that sum to one and dP is the Euclidean distance in pixels between the object centers, dS is the cost of the size difference, dD is the cost of the direction difference between the object position estimated by the Lucas-Kanade algorithm [11] in the current frame and the difference between the center of the region of motion and the center of the object, and dT is the time to live (TTL) of the object. These parameters are better described as follows. dP = |Rcj − Oci |
(2)
where Rcj is the center of the region of motion and Oci is the last known center of the object. The value of dP should not be higher than a threshold of proximity measured in pixels. This proximity threshold varies according to the objects are being tracked, mainly due to the speed of such objects in the scene. dS =
|Rrj − Ori | Rrj + Ori
(3)
where Rrj and Ori denote the size of the region of motion and the size of the object respectively. i dD = |arctan(Olkc ) − arctan(Rcj − Oci )|
(4)
i where the value of the angle lies between zero and 2π. Olkc is the object position estimated by the Lucas-Kanade algorithm.
dT =
(T T LMAX − OTi T L ) T T LMAX
(5)
where T T LMAX is the maximum persistence in frames and OTi T L is the object persistence . If the object is found in the current frame, the value of OTi T L is set to T T LMAX , otherwise it is decreased by one until OTi T L becomes equal zero, where the object must be eliminated from the tracking. Each object from the preceding frame must be absorbed by the region of motion in the current frame that leads to the lowest cost. The values of the object and bounding box centers assume the values of the regions of motion. If there is a region of motion that was not assigned to any object, then a new object is created with the values of such a region. If there is an object that was not assigned to any region of motion, such an object may be occluded and the Lucas-Kanade algorithm will fail to predict the corresponding motion. In this case, the motion of such an object is predicted as: Osi = S ∗ Osi + (1 − S) ∗ (Rcj − Oci )
(6)
where S is a fixed value of the speed. The region of motion Rcj , should be the closest region to the object, respecting the proximity threshold. Then, the new position of the object and its bounding box is computed by Equation 7 and 8.
People Counting in Low Density Video Sequences
3
743
Oci = Oci + Osi
(7)
Ori = Ori + Osi
(8)
People Counting
The counting algorithm starts to analyze the segmented objects (blobs) when they enter in a counting zone (Figure 3). Two strategies were proposed to tackle the problem of estimating the number of persons enclosed by the blobs. The first one employs two thresholds that are learned from the width and area of the blobs, that is, the average width of blobs enclosing one person as well as the average area of the head region of blobs. A person is added to the counting when the analyzed blob does not have a width greater than the width threshold. Otherwise, additional information based on the blob head region area is used to estimate the number of persons enclosed by the blob. To such an aim the average head area region from a blob enclosing single persons is estimated through the analysis of objects in motion in the video frames and it is further used as a reference value. The head region is computed by splitting the blob height into four zones. Thus, the head region is considered as the first one piece at the top of the blob, as shown in Figure 3. The area of the head region is obtained by counting the number of foreground pixels. The process consists of, given a blob into the counting zone, extracting its head region area and divide it by the one person head region reference area. The outcome value, denoted as v, is used to decide the number of persons enclosed by the blob as shown in Equation 9. count + 2, if v < 2 count = (9) count + 3, if v ≥ 2 where count is the variable which stores the number of persons. In the second strategy, a zoning scheme divides the head region into ten vertical zones of equal size. From each vertical zone is computed the number of foreground pixels divided by the subregion total area. The features extracted from the zoning scheme plus the width of the head region are used to make up an 11-dimensional feature vector. The feature vectors generated from the objects in motion are further matched against reference feature vectors stored in a database by using a non-parametric classifier. The second strategy has two steps: training and classification. At the training step, a database with reference feature vectors is build from the analysis of blobs in motion in the frames of videos. Each reference feature vector receives a label to indicate if it encloses one, two, or three persons. The labels are assigned by human operators at the calibration/training step. At the end, the database is composed by Z reference feature vectors representing all possible classes (one, two or three persons). The classification consists of, given a blob in motion, a set V of feature vectors is extracted. The number of vectors in the V set varies according to the period of
744
J.D. Valle Jr. et al.
Fig. 3. Key regions to the counting system
time the blob takes to pass through the counting zone, which characterizes a size variant temporal window. The classification process is composed by two stages: first, each Vi ∈ V is classified using an instance-based approach, more specifically the k nearest neighbor algorithm (k-NN) [12]; next, the majority voting rule is applied to the feature vectors in V to come up to a final decision. For the k-NN algorithm, the Euclidean distance among each feature vector in V and the Z reference feature vectors stored in the database is computed. The Euclidean distance between a D-dimensional reference feature vector Vz and a testing feature vector Vi is defined in Equation 10. D d(Vz , Vi ) = (Vz d − Vid )2 (10) d=1
The k-closest reference feature vectors will label each feature vector in V with their labels. After the classification of all feature vectors in V, a final decision is given by the vote of each member of the set V, and the classification “one”, “two” or “three” is assigned to the object in motion according to the majority vote rule. For example, if there are in seven feature vectors in V classified by the k-NN as “one person” and two classified as “two persons”, the final decision is to assign the label “one-person” to the blob in motion.
4
Experimental Results
The proposed approach was evaluated in two different databases. The first database consists of videos clips where people walk alone and in groups through the field of view of a CCTV camera positioned in a oblique view. These videos were gathered at a university entrance hall through a security camera installed at
People Counting in Low Density Video Sequences
745
second floor of the building and without any control of illumination and background. The second database is made up of some videos available in the CAVIAR database 1 . One of the goals of this experiment is to evaluate the adaptability of the proposed approach to different scenarios. CAVIAR video clips were captured with a wide angle camera lens in front of a Mall store. Both databases have halfresolution PAL standard (320 x 240 pixels, 25 frames per second) video which are compressed using MPEG2. About one minute of video from each database was used for calibration and training purposes, that is, to generate the reference feature vectors and adjust some parameters. The remaining video were used for testing, that is, for evaluating the counting algorithms. The number of samples for each class (one, two or three persons enclosed by a blob) is shown in Table 1. Table 2 shows the total number of persons who passed through the virtual counting zone in each environment. These numbers were manually obtained through human interaction with the videos. Table 1. Number of samples in each class at CAVIAR Mall and University Hall databases Class (Number of Persons)
CAVIAR Mall Database
University Hall Database
one two three
49 14 4
45 20 11
Table 2. Total number of persons that cross the counting zones in the videos clips used in the tests Database
Total Number of Persons
CAVIAR Mall University Hall
92 128
The weights and thresholds described in Section 2.2 were empirically defined. This is called the calibration procedure. The dP proximity threshold was set to 40, T T LMAX to 45, S to 0.9 and the values of the weights wP , wS , wD to 0.4, 0.1 and 0.5 respectively. About one minute of video was used for calibration. The same values were uses for both databases. First, we have evaluated the total amount provided by the automatic counting, where the manual process is compared with the automatic results of each approach. Table 3 presents such a comparison. In order to demonstrate the necessity of an approach which treats the people proximity, the outcome of tracking algorithm is also presented, because it does not account for any treatment to this problem. 1
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
746
J.D. Valle Jr. et al.
Table 3. Automatic counting outcome achieved by different strategies on both databases
Tracking
Threshold Based
CAVIAR 74 (80.43%) University 94 (73.74%)
81 (88.04%) 155 (82.58%)
Database
11-nn
Head Region Analysis 5-nn
92 (100%) 142 (90.14%)
93 (98.92%) 149 (85.91%)
1-nn
91 (98.91%) 157 (81.53%)
Next, we have evaluated the behavior of each approach when dealing with the proximity of persons in the scene. The main goal of these experiments is to evaluate whether the approaches are able to estimate the correct amount of persons in groups or not. Tables 4 and 5 show the confusion matrices for the threshold-based approach for the CAVIAR and University databases respectively, while Tables 6 and 7 present the confusion matrices for the head region analysis approach for the CAVIAR and University databases respectively. Table 4. Confusion matrix for the threshold-based approach on the CAVIAR videos Class one two one two three
44 8 0
5 6 2
three
Correct Counting Rate
0 0 2 Average
89.79% 42.85% 50% 77.61%
Table 5. Confusion matrix for the threshold-based approach on the University Hall videos Class one two one two three
33 5 0
7 9 5
three
Correct Counting Rate
5 6 6 Average
73.33% 45% 54.54% 63.15%
Both approaches achieved encouraging results when just the final counting is observed (Table 3). However, when the way to obtain these results is detailed, it can be seen that the head region analysis has got reliable results in comparison with the threshold-based approach. An error analysis has shown the two main sources of errors in the proposed people counting method. First is related to the background/foreground segmentation process, mainly due to the clothes colors that have been confused with the background. Second is related to the tracking process, especially in the presence
People Counting in Low Density Video Sequences
747
Table 6. Confusion matrix for the head region analysis approach on the CAVIAR videos Class one two one two three
49 3 0
0 9 2
three
Correct Counting Rate
0 2 2 Average
100% 64.28% 50% 89.55%
Table 7. Confusion matrix for the head region analysis approach on the University Hall videos Class one two one two three
43 5 0
2 14 7
three
Correct Counting Rate
0 1 4 Average
95.55% 70% 36.36% 80.26%
Fig. 4. Error caused by occlusion: both strategies have count one person instead of two
of occlusions. Figure 4 shows an example in which both strategies used for people counting have failed since one of the two individuals in the scene is almost totally occluded by the other.
5
Conclusion
In this paper we have presented an approach for counting people that pass through a virtual counting zone and which are gathered by a general purpose CCTV camera. One of the proposed strategies is able to count with a relative accuracy the number of persons even when they are very close or when they are in groups. Such a strategy is able to classify the number of persons enclosed into a blob. The use of this simple approach has lead to satisfactory results and the classification of the number of people in a group was carried out in real-time due to the simplicity
748
J.D. Valle Jr. et al.
of the classification technique and the low dimensionality of the feature vectors used to discriminate the number of persons. In spite of the preliminary results are very encouraging, since we have achieved correct counting rates up to 85% on video clips captured in a non-controlled environment, further improvements are required, specially in the first steps of the proposed approach which includes more sophisticated and reliable segmentation and tracking algorithms. Compared with other works that deal with counting persons through a nondedicated camera, the proposed approach is more simple but it has achieved similar correct counting rate. However, at this moment a direct comparison is not possible due to different experimental setups and databases used in the tests. Furthermore, a large scale test on a number of different environments is also necessary.
References 1. Kim, J.-W., Choi, K.-S., Choi, B.-D., Ko, S.-J.: Real-time Vision-based People Counting System for the Security Door. In: Proc. of 2002 International Technical Conference On Circuits Systems Computers and Communications, Phuket (July 2002) 2. Snidaro, L., Micheloni, C., Chiavedale, C.: Video security for ambient intelligence. IEEE Transactions on Systems, Man and Cybernetics PART A 35(1), 133–144 (2005) 3. Nakajima, C., Pontil, M., Heisele, B., Poggio, T.: People recognition in image sequences by supervised learning. In: MIT AI Memo (2000) 4. Gavrila, D.: Pedestrian detection from a moving vehicle. In: Proc. 6th European Conf. Computer Vision, Dublin, Ireland, vol. 2, pp. 37–49 (2000) 5. Giebel, J., Gavrila, D.M., Schnorr, C.: A bayesian framework for multi-cue 3d object tracking. In: Proc. 8th European Conf. Computer Vision, pp. 241–252. Prague, Czech Republic (2004) 6. Lin, S-F., Chen, J-Y., Chao, H-X.: Estimation of Number of People in Crowded Scenes Using Perspective Transformation. IEEE Trans. Systems, Man, and Cybernetics Part A 31(6), 645–654 (2001) 7. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-Time Surveillance of People and Their Activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 809–830 (2000) 8. Hu, W., Tan, T., Wang, L., Maybank, S.J.: A Survey on Visual Surveillance of Object Motion and Behaviors. IEEE Trans. Systems, Man, Cybernetics, Part C, 334–352 (2004) 9. Lei, B., Xu, L.Q.: From Pixels to Objects and Trajectories: A generic real-time outdoor video surveillance system. In: IEE Intl Symp Imaging for Crime Detection and Prevention, pp. 117–122 (2005) 10. Latecki, L.J., Miezianko, R.: Object Tracking with Dynamic Template Update and Occlusion Detection. In: 18th Intl Conf on Pattern Recognition, pp. 556–560 (2006) 11. Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: 7th Intl Joint Conf Artificial Intelligence, pp. 674–679 (1981) 12. Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Machine Learning 6, 37–66 (1991)
Simulation of Automated Visual Inspection Systems for Specular Surfaces Quality Control* Juan Manuel García-Chamizo, Andrés Fuster-Guilló, and Jorge Azorín-López U.S.I. Industrial Information Technology and Computer Networks Information Technology and Computing Dept. University of Alicante. P.O. Box 99. E-03080. Alicante. Spain {juanma,fuster,jazorin}@dtic.ua.es http://www.ua.es/i2rc
Abstract. This paper proposes the use of simulations as a design mechanism for visual inspection systems of specular surfaces. The system requirements and the characteristics of the objects involve a technological design problem for each of the solutions to be developed. A generic model is proposed. It may be adapted or particularised to solve specific inspection problems using simulations. The method results in a flexible low cost design, reducing the distance between the design model and system implementation in a manufacturing procedure. The proposed simulator generates model-based architectures. The paper shows the results on application of metallized automobile logos. Keywords: automated visual inspection, specular surfaces, simulation, quality control.
1 Introduction This paper presents the use of simulations in order to generate automatically vision systems, in particular, for the visual inspection of specular surfaces. This type of material shows optical difficulties due to have a high reflection coefficient. It causes undesirable reflections and shines that conceals chromatic and morphology information about the object. Most conventional methods of artificial vision proposed ignore the specular reflection characteristics of materials. They focus on surfaces whose reflection is essentially diffuse, arguing that it contains the object information. However, by making use of traditional techniques, methods have been developed with certain restrictions trying to identify the specular reflection component in order to separate it from the diffuse component [1][2]. Alternatively, specific methods have been developed that make the most of specular reflection to extract the shapes of objects [3][4][5][6][7]. Inspection of products with specular reflection is an open problem. Although some partial solutions have been provided, very few systems have been developed for addressing the automatic visual inspection of these objects. The solutions proposed make use of characteristics of active vision. Lighting of the scene is a determining *
This work was supported by the Spanish MCYT DPI2005-09215-C02-01.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 749–762, 2007. © Springer-Verlag Berlin Heidelberg 2007
750
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
factor as it is possible to facilitate the detection of defects in images [8]. Structured lighting techniques are considered reliable and adequate for inspection of the surface shape and are presented in the few systems of the literature [9][10][11][12][13][14][15]. In addition, the use of the same sensor to determine reflectance is another argument for this type of technique (regularly it should be acquired in an independent manner). The systems differ according characteristics of the pattern: the acquisition of lighting pattern is acquired, through projection on the surface or the attention of the system on the reflection (considering the object as part of the optical system where lighting patterns are projected); the instruments used in generating the pattern [16] (screens, projectors, etc.); and, the method used for codification [17][18]. Efficiency and speed are the two strictest conditions that the inspection technique has to satisfy [15] [19]. Efficiency is related to the vision system accuracy rates. According to speed, the system should operate at the rate of the rest of the production line. These requirements involve a technological design problem for each of the solutions to be developed. In addition, the prototype on which tests and corrections are carried out is physical, which causes an increase in both financial and temporal costs. The selection of working and lighting conditions is complex and critical for a satisfactory inspection [20]. In some works, simulations were carried out in order to select the characteristics of lighting pattern and methods of capture. Nevertheless, the choice of different patterns has not been considered, nor are other factors taken into account which would affect the effectiveness of the system such as the angle of capture, zoom, focus, etc. [11][21][22]. Within this framework, it is proposed to model and methods for vision systems designed to automatic inspection of products with specular reflection surfaces and undefined curvature. The solution consists of minimising the lack of sensitivity that takes place in certain image acquisition conditions (those related with specular surfaces). The inspection model amplifies the relevant characteristics defining the objects by means of environmental conditioning and camera calibration [23]. The inspection architectures of a specific product may be conceived through particularisation of the general model. To achieve this, it is proposed to develop a simulator. It will enable to validate the model and guide the experiments using laboratory instruments. The simulator will enable evaluation of the architectures as an alternative to preliminary tests on the physical prototypes. The basic contribution of this work focuses on the general nature of the inspection model. It will enable architectures to be developed through particularisation using simulations. This method will permit a reduction of distance between the stages of hypothesis and experimentation. That is, a reduction of the technical or scientific development cycle by means of preliminary validation of the viability of the model, and reducing experimental costs by using virtual prototypes.
2 Model for Inspection of Specular Surfaces The inspection model is based on a normalisation ϒ that determines the different defects in an image. The transformation ϒ uses two stages (see figure 1): ϒμ and ϒe,c. The transformation ϒμ is able to determine the function μ of the object (that is
Simulation of Automated Visual Inspection Systems
1
751
chromatic, morphological, topographical information, etc.) and will be the objective of the inspection system. Establishing ϒμ is a complex transformation when specular surfaces take part in the scene. Therefore, performing transformation ϒe,c will enable us to obtain an image in which the sensitivity is adequate to perceive the defects. The problem is solved by obtaining the precise expression for ϒ in each case. However, the aim of providing a general solution motivates the decision to resort to explicit forms in order to embody the knowledge contained in ϒ. The transformation ϒe,c stores the suitable environmental and calibration conditions in order to obtain an improved image and to be able to determine the defects: lighting angles, vision angles, focus distance, chromaticity, polarisation and other characteristics of light transmission. It is known that the camera provides a radiance measurement for the scene [24]. Assuming that the inspection object is unique in the scene, the scene radiance, LR, is related to the object reflectance (Bidirectional Reflectance Distribution Function, BRDF), fR [24], and irradiance, I, which affects it according to the equation (1).
LR = f R I
(1)
The environment modulates the contribution of the object in the camera. To be precise, as can be seen in equation (1), the irradiance, function of the environment, I modulates the object reflectance fR in LR (captured by the camera). Thus, without considering the other perception characteristics, irradiance is a key factor since it enables areas of the surface of the object to be isolated, the contrast in the camera to be increased or decreased, etc. Therefore, with the aim of reducing the cost of determining the conditions in which the capture of objects has to be performed, an irradiance function I for the object is originally established: ϒeФ. This function acts as an amplifier of the reflectance of the object, and consequently of the sensitivity of the sensor in order to increase the ranges of environmental and calibration conditions where the system is able to perceive the defects. In this way, the system can perceive the defects in more values of angles conditions, resolutions, etc. Transformation ϒeФ must allow different areas to be established on the object so that its projection on the image will enable the variations in the object functions to be established in those areas. For this purpose, we propose a lighting scheme according to which the adjacent areas of the object will be radiated by different spectral powers, defining, in this way, areas of different radiances. These radiances will establish independent areas on the image, which will allow them to be studied individually. Thus, areas with different responses dependent on the radiant energy and on the reflectance fR of the object will be established in the image domain. It is interesting to define the function determining the energy that later reaches the object in terms of the field radiance (Lf) and which can be adapted to the different situations. In order to do this, the function ϒeФ is defined according (2).
ϒ eφ (m, μi ) = L f ( s, Λ, ξ , t )
(2)
The function consists of four parameters: s, Λ, ξ y t. The area of interest in the motive may be configured based on the parameter s. This function establishes the
752
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
morphology of the regions, which will form on the surface. The variable Λ contains the set of lighting characteristics that cover each of the established regions. The function ξ determines the spatial configuration of the energy that reaches the motive by means of distribution of the lighting characteristics of Λ in each of the areas established by the function s, which shall be projected on the image. Finally, it is function of the time t: if the structuring of the lighting is temporal, it will be necessary to generate a sequence of patterns.
Fig. 1. Inspection model based on the input of a motive m, transformation ϒe,c will determine the environmental and calibration conditions so that the camera can capture it. Prior to this, the motive reflectance is increased by means of ϒeФ. Transformation ϒμ will determine variation δ of the motive based on the comparison between image F(mδ) and the stored image.
Once the function ϒeФ has been defined, it is necessary to establish the rest of conditions of the magnitudes of the scene ρi (3) where the system is capable of perceiving adequately (we have called this process tuning the system).
ρi = ρi (m, e, c)
(3)
The magnitudes ρi contribute to the formation of the image: scale, angle of the camera, lighting intensity, frequency, saturation of color, etc. Each magnitude is a function of the magnitudes μi of the motive m, εi magnitudes of the environment e and γi of the camera. The contribution of the motive m to the image constitutes the really valuable information. Therefore, camera and environment variables will need to be established ϒe-Ф,c for different defects which may be produced in the object. μ Finally, the comparison stage ϒ is established between the image obtained in the transformation ϒe,c and the image stored in the knowledge base. The inspection stage determines the image processing techniques, which will enable detection of defects in the inspection motive. For example, a series of descriptors may be established for the inspection images and those stored in the database in order to compare them. It is also possible to differentiate on the same image but requires a transformation or rotation against the motive, environment, etc. A graphic diagram of the transformation process may be observed in figure 1.
Simulation of Automated Visual Inspection Systems
1
753
The specification of each transformation ϒ establishes the computational and economic costs of the inspection system for the detection of defects in surfaces of specific products. The design of the architectures will be required to consider performance and cost criteria, measured in relation to robustness, accuracy, density of information number of images and speed among other items.
3 Simulation Aided Design Simulation provides a mechanism for preliminary validation of technological designs and assistance in conception models. It may be used to reinforce, reject or redirect the basic hypothesis of research. It has a direct consequence on costs associated to each of the stages of scientific method and, mainly, in the laboratory experimentation. Computer simulation enables controlled tests to be carried out on initial hypothesis facilitating refinement until it is mature. A systematic flexible low cost validation method is established without any need to perform preliminary experiment on physical prototypes. Computational modelling of systems and their simulation as validation mechanisms is not a new idea. Computer simulation has been introduced in almost all areas of scientific and technological knowledge. For example, in the technological field, simulations are carried out in order to contrast ideas of a potential system prior to addressing its design. However, this is not a systematic procedure. Computer simulation may become a universal and flexible tool that can be used systematically in scientific analyses or technological designs. It provides a rapid abstraction and specification mechanism based on computational models. It would permit progress in all economic, social, industrial sectors, etc. For example, the advantages associated to assisted design through simulation of the inspection system primarily consist of: a substantial cost reduction in terms of system analysis time and its development in the production line; reduction of financial costs of the technological material required, as the process obviates the need for expensive physical prototypes; maintenance of knowledge bases of inspection systems with respect to their technology, processing modules and, in general, the designed architecture which could be used for implementing other similar systems in respect of inspection product specifications; and finally, reduction of errors deriving from the system analysis phase, design and implementation of real inspection systems, which may be quickly detected before considerable costs are incurred. In the observation stage of scientific or technological development, the introduction of computational models will enable virtual worlds to be established based on scientific or technological models and theories. Some of the advantages of the use of virtual worlds are the reduction of costs associated to observation, laboratory material costs and the possibility of generating data repositories in a rapid and flexible manner. In the experimentation phase, the computational models developed would be able to guide the experiments to be carried out. Laboratory experiments usually use a restricted universe model. In many cases, this is artificial, designed ad hoc in order to carry out the appropriate tests to validate basic hypothesis. This could be substituted by a computational model. In some cases, it may even be more appropriate for the
754
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
experimentation phase. For example, Hughes argues for the inclusion of algorithms that simulate complex physical systems in order to carry out experiments [25]. Norton and Suppe defend simulation as a means of experimentation, which serves as an instrument to test or detect phenomena in the real world [26]. The difficulty of experimentation on a conventional model may exceed the material possibilities available at any given time. For example, studies on human knowledge and intelligence have been more promising, by applying computational models which use the brains of certain animals as a human model, as we all know.
4 Simulation of Model-Based Architectures The computational modelling required to simulate hypotheses should affect several levels: from image formation, based on scene magnitudes, to the visual process of automatic inspection based on the ϒ transformations proposed. In the first stage, a virtual environment is developed based on existing models and scientific theories in order to form the image. Later, computational modelling is established for the hypothesis of the visual inspection system for specular surfaces. 4.1 Modelling of the Inspection System Hypothesis The simulator stage related to ϒ transformation recreates the necessary characteristics for inspection the virtual world generated based on the synthesis of magnitudes of the scene ρi (scale, angle and lighting) and synthesis of the image. In addition, it takes into account performance P and cost C offered by the architecture of the vision system. They are additional characteristics that will facilitate study experimental data and extract conclusions in the automatically generation of hypotheses for vision system architectures. The performance P is related to the efficiency E and speed (the two restrictive conditions which must be fulfilled by the inspection system) of the system for inspection of a magnitude μi of motive m. The speed will be studied through the inspection time T. Cost C is related to the technological market aspects implicated in obtaining the architecture. The computational model *ϒeФ (4) of transformation ϒeФ (2) stores, in knowledge bases for each of the magnitudes μi (chromaticity, morphology, topography, etc.) to be inspected: the function of the field radiance Lf, performance PeФ and the cost CeФ associated to the architecture of the inspection system using this radiance. The contribution of performance PeФ to the global P of the architecture reflects the time TeФ taken by the system establishing radiance Lf and efficiency EeФ that provides this structured lighting. The cost CeФ is associated, in this case, to the costs of elements that enable the desired structured lighting to be obtained: sources of lighting, elements of positioning, etc. *
ϒ eΦ (m, μi ) = L f ( s, Λ, ξ , t ), PeΦ , CeΦ
(4)
Transformation ϒe-Ф.c establishes the appropriate tuning of the system (scale, angle lighting, etc.) in order to determine any defects that might appear in the various magnitudes μi of the motive m. It will determine the appropriate values of ρi (3) for
Simulation of Automated Visual Inspection Systems
1
755
the system in terms of its components (environment and camera). The system tuning necessitates inferring in any of the variables intervening in the magnitude of the associated ρi scene (we have considered for the inspection problem the scale ρE, angle between the camera and the motive ρθ and lighting intensity ρI). In the case of the scale ρE the system will position the motive at a specific distance from the camera or at the same distance will position the camera from the motive or change the zoom camera. The procedure is similar in the case of the angle ρθ, the motive or the camera may be positioned and oriented. In the case of tuning the lighting ρI, all environmental parameters e and the camera c will be used. The variables which permit conformation of values of the desired magnitudes of the scene ρi establish the tuples GρE, Gρθ, GρI for the scale, the angle and the lighting respectively. The computational model for *ϒe-Ф.c (5) makes use of knowledge bases which store a set of tuples GρE, Gρθ and GρI for the ρi values suitable for inspection of the complete motive. Also, performance Pe-Ф.c and associated cost Ce-Ф.c in a similar way to the function *ϒeФ is considered. *
ϒ e −Φ ,c (m, μi ) = {G ρ E , G ρθ , G ρ I } , Pe −Φ ,c , Ce −Φ ,c
(5)
The temporal cost Te-Ф,c of performance Pe-Ф.c depends on the variables chosen to obtain the value ρi. For example, the cost Te-Ф,c of establishing a scale value ρE based on a change in the zoom of the optic may be less than doing so by means of a repositioning of the motive or of the camera. The cost Ce-Ф.c will be given by the technological material that enables to obtain the magnitudes of the scene associated to the scale ρE, the angle ρθ and the lighting ρI. The transformation *ϒμ will determine the defects of the inspection motive establishing the efficiency of the system Eμ. Time Tμ required for motive inspection and the cost Cμ of the implementation determine the final stages of the architecture.
E (m, μ ) = EeΦ = Ee −Φ ,c = Eμ
T (m, μ ) = {TeΦ , Te −Φ ,c , Tμ }
(6)
The performance P of the architecture for the inspection of a magnitude μ of the motive m will therefore be defined according to equation (6). The efficiency E of the architecture will be stored in each transformation *ϒeФ, *ϒe-Ф.c and *ϒμ. The time T for the inspection will depend on the time for establishing the radiance TeФ, for the tuning the magnitudes of the scene ρi, Te-Ф,c, and for the inspection Tμ. Finally, the cost of the architecture C (7) will depend on the technological cost of all the implicated elements.
C (m, μ ) = CeΦ + Ce −Φ ,c + Cμ
(7)
4.2 Generation of Architecture Hypotheses for Inspection Let A set of possible architectures for visual inspection of a set of motives MI (8). An architecture a is defined as an ordered set of tuples which contain the characteristics
756
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
for carrying out transformations ϒeФ and ϒe-Ф,c. This ordered set contains the work plan for inspection of MI.
{ {ϒ
A= a=
eΦ
}
(mi , μ j ), ϒ e −Φ ,c (mi , μ j ) : U(mi ) = M I
}
(8)
Therefore, the order of the motives m to be inspected, the radiance conditions (4), and tuning of the magnitudes of the scene ρi define the characteristics of the system. They establish performance P and cost C for the inspection of a set of motives. The choice of architecture a, of set A, of those possible for inspection of MI will be determined by different criteria relating to that performance P and cost C: reduction of costs, increased performance, increased ratio performance/cost, etc. The hypothesis generator and simulator of the inspection system should be able to provide the appropriate architecture for each problem. In order to carry out this procedure, the system calculates the performance P and cost C of the architecture in all conditions: for each defect of the motive m and, for all the motives to be inspected, for each of the radiance conditions (according to transformation ϒeФ) and for all the scene magnitudes (scale ρE, angle ρθ, and lighting ρI) defined for the problem study. In this work, we focus specifically performance P of the system, in particular, the efficiency E. Time T and cost C depend on aspects of the technology at any given time. We propose to calculate the difference between the images of a defective motive mj and a motive without defects mi (model motive) in order to extract the defects mδ (9) in the same conditions. The difference is in accordance to specified transformation ϒμ and is designed to be an independent measurement of any other transformation that may be chosen ad hoc.
F ( ρ (mδ , e, c)) = F ( ρ (mi , e, c)) − F ( ρ (m j , e, c))
(9)
In addition, in order to ensure that the difference between the images is independent of capture, the defective motive mj (10) is provided with a series of defects mδ, in each μi, distributed uniformly over the surface of the motive. A function of defects synthesis syndef is applied to the model motive mi.
m j = syndef (mi , mδ )
(10)
The ratio between the number of points in the difference image (9) and the number of points which would represent defects T(m,e,c) provide the efficiency E of transformation ϒμ. We use this calculation as a measurement of sensitivity of the system, of the capacity for perceiving defects, and it will be stored in a knowledge base BDδ (11). We have called BDδ sensitivities database.
BDδ (m, μi , ϒ eΦ , ρ E , ρθ , ρ I ) =
#( F ( ρ (mδ , e, c))) # T (mδ , e, c)
(11)
Sensitivities database BDδ queries will enable us to construct transformations *ϒeФ and *ϒe-Ф.c according to their performance P. Consequently, the magnitudes ρi shall be determined in which the system sensitivity is appropriate for the perception of defects.
Simulation of Automated Visual Inspection Systems
1
757
5 Inspection of Automobile Logos This section presents the results of the application of the model and the simulator to the development of specular automobile logo inspection systems based on the efficiency E of the system. Firstly, the virtual environment was developed based on existing scientific models and theories for image formation. In order to carry out the study of local light reflection, the reflectance model presented by Cook-Torrance [27] was used as a basic element. Recent studies show its considerable ability to adapt to reflectance extracted from objects [28]. 5.1 Creation of the Sensitivities Database For the creation of the sensitivities DBδ (11), planes were used in order to control the angle formed by the elements involved in the scene on all parts of the object. Three types of defects were distributed over these 12 x 12 mm2 planes: two changes in topography (a 0.6 mm-diameter crack or crater) and a change in colour on a surface measuring 0.6 x 0.6 mm2. The planes have a specularity coefficient (s in CookTorrance model [27]) of 0.75 and a roughness (rms, or m in [27]) of 0.1. There are two types of planes with the above characteristics: dielectric and metallic. The dielectric planes have a refraction index of 1.6, while the metallic planes have a refraction index of 2.8 and an extinction index of 3.2 using Fresnel equations.
a)
b)
c)
d)
Fig. 2. Lighting environmental conditioning Interlaced XY with different areas for interest. a) 0.1x0.1 mm2 b) 0.2x0.2 mm2 c) 0.3x0.3 mm2 d) 0.6x0.6 mm2
A function of environmental conditioning ϒeФ that provides maximum sensitivity of the system to perceive the defects Interlaced XY [23] and a non-structured lighting function Homogeneous (white lighting) were taken into account. The goal is to show the influence of the increase in perception capacity using ϒeФ and the development of inspection architectures according different sensitivity results for each type of material (dielectric or metallic). The function ξ of the Interlaced XY lighting distribute monochromatic lights Λ of different set of wavelengths (380-780 nm) in different regions. The adjacent regions of a certain region are different maximizing the differences of the wavelengths between them (see figure 2). There are different regions of interest for each of these functions, (0.1x0.1 mm2, 0.2x0.2 mm2, 0.3x0.3 mm2 and 0.6x0.6 mm2). Time t is not considered in this case.
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
758
The system tunes the magnitudes of the scene ρi modifying camera angles with respect to the normal object (0º-90º), the scale ρE (1 p/mm, 5 p/mm, 10 p/mm and 15 p/mm) and the lighting angle (0º-90º) with respect to the normal to the object in order to consider different lighting values ρθ. Unstructured lighting
Interlaced XY
Sensitivity (%)
80
0
1
ρE (scale pp/mm)
Sensitivity (%)
80
0
0
40º
20º
ρθ angle (degrees)
80º
1
ρE (scale pp/mm)
15 5 60º
80º
60º
40º
20º
ρθ angle (degrees)
5
0º
15
0º
Metallic
Sensitivity (%)
80
80º
20º
ρθ angle (degrees)
1
ρE (scale pp/mm)
5 60º
80º
60º
40º
20º
5
ρθ angle (degrees)
15
0º
15
40º
0 0º
Dielectric
Sensitivity (%)
80
1
ρE (scale pp/mm)
Fig. 3. Partial view of the BDδ for dielectric and metallic motives according to lighting configuration ϒeФ unstructured and used for amplification of the sensitivity (Interlaced XY)
Figure 3 shows partial views of the database DBδ organised according to the scale
ρE, angle ρθ and motive characteristics. The average value was used for the remaining characteristics such as the areas of lighting, types of defects, lighting scene ρI, etc. 5.2 Application of the Mercedes-Benz1 Logo Inspection Results We have used a Mercedes Benz logo with a 100 mm diameter in order to show the application of results on the inspection of specular surfaces. The geometric characteristics of the part define a total inspection surface of 4676.15 mm2. 1
Mercedes Benz and its logo are a trademark of DaymlerChrysler AG.
Simulation of Automated Visual Inspection Systems
1
759
Given that the minimum sensitivity threshold for detection depends on the application, in this case a minimum rate of sensitivity in the inspection of 80% has been assumed. The inspection of a dielectric motive with lighting of the Homogeneous type can be carried out in 15 points of the scene magnitudes ρi (see figure 3). The environmental conditioning of the measure providing the function of maximum sensitivity, Interlaced XY, raises the number of ρi points to 19, where it is possible to perceive adequately the defects. The inspection of a metallic motive with an unstructured lighting environment restricts the inspection conditions to only five values of the scene magnitudes ρi (this is a situation of less sensitivity than those studied in the conditioning experiments [23]). The results contrast with those achieved by ϒeФ Interlaced XY, which permits an adjustment of the scene magnitudes system by 20 points (figure 3). The choice of conditions in which capture for inspection of the whole part is carried out is a complex problem. It is necessary to take into account the particularities of each solution. In this case, in order to apply the results, the scale ρE has been set at its greatest value (15 pixels per millimetre). Variations may be given in the perception angles ρθ (angle between the normal to the camera and the normal to the motive) using the lighting function of the Homogeneous type will be up to 30º, for dielectric surfaces, and up to 20º, for metallic surfaces. In the case of using the best conditioning function, Interlaced XY, the angles may vary between 0º and 40º, irrespective of the electromagnetic characteristics. In order to determine the appropriate angles between camera and motive, an approximation to the optimum solution was proposed which increases the area of inspection of the parts and reduces the images to be captured. For this purpose, a search tree was designed using a branch and bound algorithm, in which the solutions space of each node is reduced to a maximum of five children and a maximum temporal processing level is established. The sensitivity associated to the values of the scene magnitudes conditions the capture of the motive in 15 images for inspection of this part with a surface of electromagnetic characteristics of the dielectric material and using a Homogeneous lighting configuration. The solution extracted from the search tree that maximises the area to be inspected and reduces the number of captures, is able to process a total surface of 4671.17 mm2, practically 100% of the surface. The resolution requirements of the camera vary from 740x740 to 1452x1452 pixels of the first capture (figure 4). Inspection of the dielectric part using maximum sensitivity environmental conditioning, function ϒeФ Interlaced XY, reduces the capture set of 15 to 9 images. The camera resolution varies between 826x826 and 1455x1455 pixels in order to cover a total surface area of 4675.05 mm2. The first capture obtains 45% of the total inspection of the logo. This fact contrasts with that carried out with the unstructured lighting environment that, with the same position, permits only 11.58 % of the part. The metallic logo inspection requires 26 captures using Homogeneous type lighting. In this case, angles are only permitted between the camera and the normal angles to the motive of less than 20º, setting a scale of perception of 15 pixels per mm. Thus, the inspection plan is the worst. With a camera of 1452x1452 pixels, it
760
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
Fig. 4. Captures of the Mercedes Benz dielectric logo using the Homogeneous function ϒeФ
Fig. 5. Captures using the function ϒeФ Interlaced XY for the metallic logo
is able to inspect a total surface area of 4619.96 mm2, which represents 98.79% of the part. Certain captures may be made a lower resolution of up to 510x510 pixels. Finally, the inspection of the metallic surface using the structured lighting that provides function ϒeФ Interlaced XY produces a considerable improvement in the number of captures from 26 to 9 (see figure 5).
6 Conclusions The research addresses the problem of the adverse conditions vision in which specular surfaces intervene. A general model is provided in which solutions to problems may
Simulation of Automated Visual Inspection Systems
1
761
be obtained by particularisation on a formal framework that enables study in the domain of the motive. The use of simulators for validation of hypotheses reduces the distance between abstract notions and reality of the phenomena. It provides research with a reduction of financial costs in the requisite technological material, reducing the need for expensive economic prototypes, whilst also rationalising development times. Automatic generation and simulation of the system hypotheses for inspection using computational models has a direct effect on the reliability and cost of such systems. The study enables the best conditions for carrying out the inspection of specular surfaces to be discerned through evaluation of the proposed architectures: with respect to resolution and capture angles, lighting of the environment, etc. These may be reasoned and deliberated before being put into practice. The conclusions of the simulation will establish for example, the minimum characteristics of the system, avoiding particularly complicated technological designs that do not necessarily lead to increased performance. Simulation confirms the basic hypothesis of the work, and it would be appropriate to further advance the physical experiment by framing the technological development of visual inspection for specular surfaces for specific industries: the automobile industry, bathroom fixtures and fittings, jewellery, marble, mirror manufacturing etc.
References 1. Swaminathan, R., Kang, S.B., Szeliski, R., Criminisi, A., Nayar, S.K.: On the Motion and Appearance of Specularities in Image Sequences. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, Springer, Heidelberg (2002) 2. Lin, S., Shum, H.: Separation of Diffuse and Specular Reflection in Color Images. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2001 (2001) 3. Bhat, D.N., Nayar, S.K: Stereo and specular refection. International Journal of Computer Vision 26(2), 91–106 (1998) 4. Schultz, H.: Retrieving Shape Information from Multiple Images of a Specular Surface. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2) (1994) 5. Oren, M., Nayar, S.K.: A theory of specular surface geometry. International Journal of Computer Vision 24(2), 105–124 (1997) 6. Svarese, S., Perona, P.: Local Análisis for 3D Reconstruction of Specular Surfaces. In: Proc. of IEEE Computer Society Conference on CVPR 2001, IEEE Computer Society Press, Los Alamitos (2001) 7. Ragheb, H., Hancock, E.R.: A Probabilistic Framework for Specular Shape-form-Shading. In: Proc. 16th International Conference on Pattern Recognition, ICPR 2002 (2002 8. Pernkopfa, F., O’Leary, P.: Image acquisition techniques for automatic visual inspection of metallic surfaces. NDT&E International, Elsevier 36, 609–617 (2003) 9. Perard, D.: Automated visual inspection of specular surfaces with structured-lighting reflection techniques, Fortschritt-Berichte VDI, vol. 8 (869). VDI Verlag, Dusseldorf (2001) 10. Puente León, F., Kammel, S.: Inspection of specular and painted surfaces with centralized fusion techniques. Measurement (2006) 11. Seulin, R., Merienne, F., Gorria, P.: Simulation of specular surface imaging based on computer graphics: application on a vision inspection system. Journal of Applied Signal Processing - Special issue on Applied Visual Inspection, EURASIP 2002 7, 649–658 (2002)
762
J.M. García-Chamizo, A. Fuster-Guilló, and J. Azorín-López
12. Seulin, R., Merienne, F., Gorria, P.: Dynamic lighting system for specular surface inspection. In: Conference on Machine Vision Applications in Industrial Inspection VII, SPIE, vol. 4301, pp. 199–206 (2001) 13. Aluze, D., Merienne, A.F., Dumont, C., Gorria, P.: Vision system for defect imaging, detection and characterization on a specular surface of 3D object. Image and Vision Computing, Elsevier Science 20, 569–580 (2002) 14. Morel, O., Stolz, C., Gorria, P.: Polarization imaging for 3D inspection of highly reflective metallic objects. Optics and Spectroscopy 101(1), 15–21 (2006) 15. Newman, T.S., Jain, A.K.: A Survey of automated visual inspection. Computer Vision and Image Understanding 61(2), 231–262 (1995) 16. Zhang, X., North, W.P.T.: Analysis of 3-D surface waviness on standard artifacts by retroreflective metrology. Optical Engineering 39(1), 183–186 (2000) 17. Hung, Y.Y., Shang, H.M.: Nondestructive testing of specularly reflective objects using reflection three-dimensional computer vision technique. Optical Engineering 42(5), 1343– 1347 (2003) 18. Rocchini, C., Cignoni, P., Montani, C., Pingi, P., Scopigno, R.: A Low Cost Optical 3D Scanner. In: Computer Graphics Forum. Eurographics 2001 Conference Proc., vol. 20(3), pp. 299–308 (2001) 19. Malamas, E.N., Petrakis, E.G.M., Zervakis, M., Petit, L., Legat, J.D.: A survey on industrial vision systems, applications and tools. Image and Vision Computing 21, 171– 188 (2003) 20. García-Chamizo, J.M., Fuster-Guilló, A., Azorín-López, J.: Automatic Generation of Image Acquisition Conditions for the Quality Control of Specular Surfaces. In: IEEE/SPIE 8th International Conference on Quality Control by Artificial Vision (QCAV 2007), Le Creusot, France (2007) 21. Kammel, S.: Automated optimization of measurement setups for the inspection of specular surfaces. In: Machine Vision and Three-Dimensional Imaging Systems for Inspection and Metrology II, Proc. SPIE 4567, pp. 199–206 (2002) 22. Seulin, R., Merienne, F., Gorria, P.: Machine vision system for specular surface inspection: use of simulation process as a tool for design and optimization. In: International Conference on Quality Control by Artificial Vision, Le Creusot, France, vol. 1, pp. 147– 152 (2001) 23. García-Chamizo, J.M., Fuster-Guilló, A., Azorín-López, J.: Visual Input Amplification for Inspecting Specular Surfaces. In: Proc. IEEE ICIP 2006, IEEE, Atlanta, United States (2006) 24. Nicodemus, F.E., Richmond, J.C., Hsia, J.J., Ginsberg, I.W., Limperis, T.: Geometrical considerations & nomenclature for reflectance. NBS Monograph 160, National Bureau of Standards, Washington, D.C. U.S. Department of Commerce (October 1977) 25. Hughes, R.: The Ising Model, Computer Simulation, and Universal Physics. Models as Mediators. Cambridge University Press, Cambridge (1999) 26. Norton, S., Suppe, F.: Why Atmospheric Modeling is Good Science. Changing the Atmosphere: Expert Knowledge and Environmental Governance. MIT Press, Cambridge, MA (2001) 27. Cook, R.L., Torrance, K.E.: A Reflectance Model for Computer Graphics. ACM Transactions on Graphics 1(1), 7–24 (1982) 28. Ngan, A., Durand, F., Matusik, W.: Experimental Analysis of BRDF Models. In: Proc. of Eurographics Symposium on Rendering (2005)
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras Alexander Woodward1 , Patrice Delmas1 , Georgy Gimel’farb1 , and Jorge Marquez2 1
Department of Computer Science, The University of Auckland, Auckland, New Zealand
[email protected], {p.delmas,g.gimelfarb}@auckland.ac.nz 2 Image Analysis Visualization Laboratory, CCADET UNAM, Mexico
[email protected]
Abstract. A complete system for creating the performance of a virtual character is described. Stereo web-cameras perform marker based motion capture to obtain rigid head motion and non-rigid facial expression motion. Acquired 3D points are then mapped onto a 3D face model with a virtual muscle animation to create face expressions. Muscle inverse kinematics updates muscle contraction parameters based on marker motion to create the character’s performance. Advantages of the system are reduced character creation time by using virtual muscles and a dynamic skin model, a novel way of applying markers to a face animation system, and its low cost hardware requirements, capable of running on standard hardware and making it suitable for interactive media in end-user environments. Keywords: Computer Vision, Web-camera, Markers, Motion capture, Facial animation, Virtual character.
1
Introduction
This paper presents a complete system for creating the performance of a virtual character. The system uses stereo web-cameras to perform marker based motion capture on a subject’s face. 3D marker positions allow the derivation of rigid head motion and non-rigid facial expression motion. These 3D points are then mapped onto a 3D face model of arbitrary shape – the model does not have to relate to the subject from which marker tracking was taken. Mapping was done using radial basis functions (RBF). A facial animation system that uses virtual abstract muscle definitions is used to convey expressions in the virtual character. The mapped marker performance is used to drive the animation system. Muscle inverse kinematics update muscle contraction parameters in order to match marker position change to the corresponding vertices on the mesh The system is designed to be low-cost, using only off-the-shelf hardware. Only a standard PC with Firewire webcams and coloured markers (self-adhesive labels) available from stationery suppliers is needed. This makes the system suitable for end-user environments. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 763–776, 2007. c Springer-Verlag Berlin Heidelberg 2007
764
A. Woodward et al.
Marker based motion capture was chosen for its robustness and relatively low computational costs. The animation system supports a physically modelled skin tissue continuum, meaning smooth motion can be represented with fewer markers. The advantages of the presented system lie in the cutting down on creation time by using virtual muscles and dynamic skin model along with its low cost hardware requirements making it suitable for interactive media in enduser environments. Firstly a brief overview of related work is given in Section 2. The system overview is described in Section 3 along with the development platform and hardware. Sections 4-6 describe the system in depth. Finally, visual results are given in Section 7 along with a commentary on the advantages and disadvantages of the presented solution.
2
Related Work
Virtual characters are now widespread in motion pictures and video-games, and a wide range of animation methods exist [13]. Chai et. al [10] present a markerless vision based system that can generate virtual expressions using a single camera. A face mesh of the test subject is scanned and a database of marker motion capture data is created. Face feature points are then estimated and correlated with data in the motion capture database to create facial expressions on an arbitrary 3D character. A finite element based physical simulation of facial tissue, created with the visible human dataset, is used to create a parameterised face model in [5]. Marker motion capture data is then used to determine muscle activations of the model. Despite realistic results, the complexity of the model makes it unsuitable for real-time application on end-user hardware. The Playable Universal Capture system, described in [8], presents a real-time technique for facial animation. The system combines a high-resolution scanned face model which is animated using marker based motion capture data. Texture data is acquired as a video-stream and combined with the model to provide high-quality animation. Paterson et. al describe a markerless head motion tracking approach using non-linear optimisation [14]. The system is able to track head motion through texture alone, however it does not track facial expressions. The reconstruction of dense 3D data at frame rates capable of capturing facial expression is described by Zhang et. al in [11]. Data is acquired through the use of synchronised video cameras and structured light projection. Motion capture systems are popular in providing animation data to drive these characters. Quality motion data can help give more realistic and immediate results than manual crafting alone. Motion capture is also used in biomechanics and sport performance analysis [17]. Still, when fine movement is needed for the face, most motion data has to be refined or applied to crafted keyframe animation in order to look convincing and without errors in a model’s surface. The common-place nature of motion capture means that many of the commercial
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
765
systems compatible with many of the leading 3D graphics applications such as Maya and 3ds Max [1, 2]. For face motion capture it is common for the user to wear a specially designed helmet with a video camera attached. Optical retro-reflective markers are then tracked and analysed on the computer. There are a wide range of motion capture systems that greatly vary in hardware requirements. Famous3D sells a single video camera marker based motion capture solution named vTracker [7]. The Vicon MX system uses a remote video sync unit called an Ultranet. This supports up to eight cameras per unit, and these can be stacked to provide 245 cameras per PC. Vicon provides a range of cameras that can be used with the system, as an example, their F40 camera can capture 370 frames-per-second at a full frame resolution of 4 Megapixels [20]. Mova provides a markerless motion capture solution called Mova Contour that uses 44 cameras to carve out 3D space in order to create a face performance. Special phosphorescent makeup that matches the normal flesh-tone in visible light is used to help reconstruction [12]. For character creation, Famous3D provides software to easily animate characters using motion capture data [7]. Face Robot by SoftImage [16] is an advanced software solution for the creation and animation of virtual characters with the ability to incorporate motion capture data to drive the animation, purporting to be faster than tradition solutions by using a soft tissue solver. A common feature of the aforementioned systems is their high hardware requirement; most of the systems are optically based, with an uncomplicated software element. Although this provides utmost quality in data, these systems are expensive and therefore not applicable for end-user environments; situations where robustness and ease of use are more important than accuracy. Also, the software environments often have a high learning curve, being targeted at professional animators. The solution presented in this paper aims at being low-cost and computer vision based, suitable for end-user environments. It has potential applications for light-weight interactive applications and possibly constrained devices where accuracy of results is less important that conveying a visually appealing result. The usability of stereo cameras for tracking markers has been justified by Woodward et al. [23], providing better robustness and localisation over a mono camera solution, while still maintaining low overall system requirements.
3
System Overview
The system comprises three modules: marker tracking, face animation, and the virtual performance creator which links the other two modules. Software was written in C++ and used Microsoft MFC for the GUI and OpenGL for 3D visualisation. The system was developed on a Pentium 4 3.0 GHz machine with 1 gigabyte of RAM running Windows XP. Marker Tracking: to make the system practical the marker tracking module captures the stereo view using two off-the-shelf Firewire web-cameras, synchronised
766
A. Woodward et al.
over the 1394 bus, and provides the calibration of this system. A marker template can be created to track coloured markers through a video stream. 2D marker projections are associated between stereo image frames and these can be triangulated to recover 3D marker positions. The tracking can the be recorded to file, each frame being time-stamped. The use of a stereo system makes 3D point localisation more robust compared to a single camera system. Face animation: the facial animation module comprises of a 3D face mesh that has a predefined virtual muscle animation system. The animation system provides two modes of animation; geometric mesh deformation, and deformation of a physically based skin tissue simulation. Virtual performance creator: the virtual performance module assigns 3D markers to the model and creates a mapping between the two. Doing so allows any marker configuration and test subject to be used with any mesh that is fitted with the face animation module. Once this mapping is completed, face animation can be driven by the markers through the calculation of muscle inverse kinematics. 3.1
Camera Specification and Calibration
A stereo configuration of two Unibrain Fire-i Digital Cameras running over a IEEE-1394a (Firewire) interface was used. Firewire was chosen over USB 2.0 as it has better support for industrial cameras, and easy access and control of IIDC-1394 compliant cameras through the CMU 1394 camera driver and API. The camera operates at 640 by 480 pixel resolution and has a 4.3 mm focal length – the technical specification of the camera is given in [19]. Tsai’s geometric camera calibration [18] was used to estimate the intrinsic and extrinsic parameters of the cameras due to its proved effectiveness in past research. This provides the equations for mapping world coordinates to image coordinates. Once markers have been identified in the stereo image pair, knowledge of the system geometry is necessary for recovering 3D marker locations through triangulation. A calibration target was designed consisting of 30 circular markings distributed evenly over its surface. Experiments in reconstructing calibration markings with known true locations have shown that the errors are stable between 40 cm and 100 cm from the world origin, with a mean error of approximately 1 mm and std. of 0.48 mm. However, in practice the delineation and localisation of coloured markers is more prone to noise than the calibration target – noticeable in the jitter seen in recovered marker positions.
4
Marker Tracking
Marker tracking begins by localising candidate markers within each stereo image. For this experiment, 12 coloured self-adhesive blue markers were used. Their
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
767
locations are determined by using a colour predicate [3]. Once this is done, a marker template can be constructed and then tracked through the video stream. The tracking procedure is given in Algorithm 1. The marker template defines the position and number of markers to be tracked within the image and provides robustness against any extra erroneous localisations from being considered. Once a template has been defined, only the areas around the current marker locations need to be processed and searched in, thus reducing computation time. Markers are placed in facial locations that best represent face motion. The current marker test setup is displayed in Fig. 1. Note that this has a minimalist number of markers acting for experimental purposes and to reduce setup time. A subset of the template markers are designated as anchor points (shown as white points in Fig. 1). These anchors are located where a good estimate of the head rigid motion can be taken. They should not be affected by facial expressions as they are used to define a local head coordinate frame (described in Section 4).
Fig. 1. Marker placement on the face, for the real and virtual cases. White points on the virtual model indicate anchor markers: the forehead, to the outer side of the left and right eyes, and below the nose.
A marker is considered to have disappeared if a new position that is below a certain distance threshold cannot be found – in this case the previous marker position is carried through into the current frame. In practice this works adequately for temporary loss of markers, in the case of an object such as the hand briefly occluding the face when the face is not undergoing substantial movement. Estimation using marker’s velocity was found unreliable due to the changing acceleration in marker trajectories as they follow muscular contractions. Marker correspondence between stereo frames can be found with respect to the scan order and epipolar geometry condition of the stereo cameras. The marker template records the current position of the markers, their number, and velocity. Marker correspondences can then be triangulated to recover the 3D locations. The presence of noise within the system causes jitter in the triangulated 3D marker positions. Noise sources include optical distortions, imaging noise and
768
A. Woodward et al.
inaccuracies in the calibration. To stabilise 3D marker positions we apply temporal mean filtering over three frames. This gives a noticeable improvement in the stability of markers, removing a level of jitter.
Algorithm 1. Marker tracking algorithm 1: while Stereo image pairs from a video stream do 2: for Subregions of images if marker template initialised, else for entire image do 3: Colour segmentation using a colour predicate, median filter, and connected components analysis. 4: end for 5: Merging nearby components of similar colour. 6: Computing centroids of connected components as potential marker locations. 7: if Marker template initialised then 8: Assign new marker positions; if marker temporarily lost retain previous position. 9: Triangulate marker positions based on marker template correspondences. 10: end if 11: Optionally: 12: if Suitable candidate markers found then 13: Initialise marker template based on their locations. 14: end if 15: end while
Head Local Coordinate Frame and Reference Marker Positions Rigid head motion must be accounted for in order to estimate the non-rigid motion underlying facial expressions. This can be done by constructing a local coordinate system based on the four aforementioned anchor markers described, and shown in Fig. 1. The centre point of the four markers acts as the centre of gravity, or translational component. The four marker points a1 , a2 , a3 , a4 are arranged so that the vector from a1 to a2 is orthogonal to the vector formed from a3 to a4 . These can be used to create the local coordinate frame R = (r1 r2 r3 ) as follows: r1 =
(a2 − a4 ) (a3 − a1 ) , r3 = r1 × , r2 = r1 × r3 |a2 − a4 | |a3 − a1 |
where × denotes the outer product. The calculation ensures that the coordinate frame is orthogonal. A neutral expression is recorded to provide default reference marker positions (rest state) within the local head frame. This reference frame can be taken as the first few seconds of marker motion capture where the test subject is asked to keep their face in a neutral expression without facial movement (head movement permissible). These reference marker positions allow for the calculation of divergences from the rest state, and are also used in the RBF mapping procedure (Section 6.1). These divergences are used to create the virtual performance (Section 6).
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
5
769
Face Animation System
A face animation system that uses a virtual muscle approach was implemented. 17 virtual muscles are placed in anatomically based positions within a 3D human face model. The choice of face model is conveniently arbitrary since the muscles are defined separately from the face model. They provide an abstract description of a subjects facial expression performance and their usage also means that a face model does not need to be designed with knowledge of a certain marker configuration making it easy to change marker positions. The face mesh can be deformed in two modes, firstly a simple geometric deformation where vertices are moved independently by the virtual muscles. Secondly, for more realistic facial animation, a physically based deformation approach with a layered tissue model was created. Since the facial tissue is not just a single layer of skin, a mass-spring system was used for the layered tissue model, representing the epidermal, fascial and skull layers. This model is more faithful to the form of human facial tissue - nodes and springs provide a natural behaviour for the skin tissue; stretching and pulling as forces are applied. This means that less markers are needed in order to produce smooth animation over the face surface.
Fig. 2. The face mesh before, left, and after, right, a linear muscle contraction
Two types of muscle were used that apply direct deformation to the mesh in the geometric mode, or apply forces to the skin tissue model in the physically based approach: – Linear muscle: has an origin where the muscle is attached to the bone, and another point which describes the insertion into the skin tissue. It possesses an angular and radial range. Points that lie within a certain angular zone to the muscle vector and within a certain radial distance from its origin are affected by this muscle. – Ellipsoid muscle: acting like a string bag; points are squeezed toward the muscle origin. It possesses a radial range and has no angular range. Because of its function, this muscle is also known as a sphincter muscle and can be defined by an ellipsoid, having a major and two minor axes. In addition, jaw movement was modelled by rotating vertices around a jaw axis. For the physically based system the jaw skull layer was rotated allowing
770
A. Woodward et al.
the skin tissue to follow along. The complete animation system and formulae is described in depth in [22].
6
Virtual Performance Creation
Once 3D marker positions have been obtained they can be mapped and applied to the face model and animation system described in Section 5. The local changes in 3D marker positions drive the face muscles through inverse kinematics. This section describes the process of creating a complete virtual performance given a set of 3D markers and a face animation system. 6.1
Marker to Model Correspondence
A design goal was to allow for the use of any arbitrary 3D human face model to be used for a particular set of motion capture data. This allows for an automated solution and the ability to retrofit existing models, along with the system being able to drive a wide variety of face models that do not have to be human in nature. In order to do this, mapping of markers into a new face space was performed by using RBF where the required correspondences between markers and vertices of the mesh are entered manually. This procedure needs only be done once for a certain marker configuration and face model, as the marker template, marker correspondences, and RBF mapping coefficients can be archived for further use. RBF Mapping: For the case of a non-rigid mapping between datasets, where one dataset is of a face model and the other of 3D marker positions, two criteria must be met. Firstly a set of correspondences is needed between datasets so that there is knowledge of how the non-rigid mapping should proceed. Secondly, the vertex distribution and resolution of the face model must be preserved. This makes a data interpolation strategy appropriate. Radial basis functions are a form of interpolant with desirable smooth properties [9, 15]. Considering the 1D case, the goal of interpolation is to analytically approximate a real valued function g(x) by s(x) given the set of values g = (g1 , ..., gN ) at the distinct points X = (x1 , ..., xN ). This is naturally extended for this research to the orthogonal 3D spatial dimension, where a function that maps spatial points is required. In practice, the two sets of data are the points that are chosen to correspond between the 3D reference markers X, as calculated in Section 4, and the virtual face mesh vertices g. N In general a 3D RBF has the form: s(x) = p(x) + i=1 λi ω(|x − xi |), x ∈ R3 , where p is a polynomial of degree at most k, λi is a real valued weight, |x − xi | specifies the Euclidean distance between the points x and xi and ω is a basic function [9]. The chosen sets X and g will generally be much less than the number of points contained in the source dataset. This mapping is done once based on the neutral reference marker frame and the neutral face mesh. The marker performance can then be translated into
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
771
the virtual mesh local coordinate frame by giving the 3D marker positions to the RBF interpolant; the resulting output positions can then be measured as divergences of marker vertices from the mesh rest state. To find the RBF interpolant for a given set of correspondences f and X, it is necessary to estimate the polynomial term p and the real valued weights λi . This under-determined system of linear equations was solved using standard least-squares estimation. The benefit of using a scattered data interpolation approach is the abstraction from the specifics of the input data. Only point locations need to be dealt with.
Fig. 3. The face animation system showing muscle placements (a), texture only (b), and the underlying mesh (c)
6.2
Driving Face Animation Through Marker Motion
At this stage the system is able to measure marker divergences from a rest frame mapped into a virtual face model’s local space. Each marker is associated with a vertex of the mesh. Upon initialisation the muscles that influence each marker must be found. This is easily done as each muscle has an area of influence. When a marker moves we want to update the virtual muscles to deform the mesh in order to make the correspondent vertex on the mesh be as close as possible to the marker. In order to do this, inverse kinematics was used. It is widely used in robotics and as a tool for animating articulated figures in order to reduce the amount of data needed to be specified each animation frame [21]. Muscle Inverse Kinematics Using Marker Positions: Inverse kinematics is the calculation of parameters for a kinematic chain in order to meet a desired goal position g , when starting from an initial position e. Applied to this research, the desired end position is the 3D marker position as tracked by the vision system described in Section 4, and the initial (current) position is considered to be the vertex on the mesh corresponding to the marker. The kinematic chain consists of a set of joints: the muscles or jaw affecting a certain marker. Since we only deal with contraction values each muscle, and the jaw, possesses 1-DOF (Degree
772
A. Woodward et al.
of Freedom)1 . Only the start and end locations of the marker need to be planned for the animation. This is a much simpler way to generate a performance based on 3D motion capture when opposed to forward kinematics, where each muscle contraction value must be determined. Considering a single vertex and marker correspondence, the inverse kinematics = {φ1 , φ2 , .., φN }, from a problem is to compute the vector of muscle DOFs, Φ −1 target marker position: Φ = f (e), where f is the forward kinematics map. ≈ J −1 · Δe, is sought after. Hence, for inverse kinematics an update value, ΔΦ A fast and simple iterative gradient descent solution was chosen that uses the Jacobian transpose of marker position with respect to muscle contraction (or jaw rotation around its axis) [4]. This method is computationally inexpensive since there is no inversion of the Jacobian matrix. It also localises computations since the degrees of freedom can be updated before the entire Jacobian is computed: = J T Δe ΔΦ
(1)
Intuitively, smaller step sizes are needed with respect to the complexity of marker movement. The side effect of this is an increase in the number of iterations required to find a solution. Calculating the entries of the Jacobian matrix is straightforward. Entries of J specify DOFs and depend on the type of joint – in this case muscular contraction is a linear joint and jaw rotation an angular joint. For a muscle DOF a vertex e is draw toward the muscle’s origin o. The possible values for the DOF is bounded as muscles have a physical contraction range. A change in the DOF φ gives a movement along this direction: ∂ ei = (o − e) ∂φj
(2)
Consider the jth joint of a kinematic chain to be a 1-DOF rotational joint which models jaw motion. Its entry in the Jacobian measures how the end effector, e, changes during a rotation about its axis: ∂ ei = aj × ( ei − rj ) ∂φj
(3)
The rate of change of the vertex e position (the velocity) is in a tangential direction to the rotation axis aj , scaled by the distance of the axis rj to e. The step Δe is given as: Δe = β(g − e) 1
(4)
Degrees of Freedom (DOF) refer to the set of animatable parameters within the virtual face, where the animatable parameters are the muscle contraction values and jaw rotation. In generality, joint DOFs are a common example, i.e. three orientation parameters and a position gives 6-DOF. Collectively all of these parameters describe the system state of the virtual muscles. Their change provides the virtual character performance.
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
773
The scale factor β is used to limit the step size since we are dealing with non-smooth functions. Since smaller step sizes must be taken when the number of muscles increases the value of β could be tuned for each marker. Pseudo-Code for Muscle Inverse Kinematics. Pseudo-code is given in Algorithm 2; markerCount gives the number of markers being tracked and affectedMuscleCount(j) specifies the number of muscles affecting a particular marker j. Algorithm 2. Muscle inverse kinematics 1: for all i such that 0 ≤ i ≤ iterations do 2: for all j such that 0 ≤ j ≤ markerCount do 3: for all k such that 0 ≤ k ≤ af f ectedM uscleCount(j) do 4: Estimate Jacobian transpose entry, JkT , from Eqs. (2) or (3) depending on muscle type. Matrix is implicit as entries can be processed in order. 5: Pick approximate step to take - as in Eq. (4). 6: Compute change in joint DOFs - as in Eq. (1). 7: Apply change to DOFs - φk = φk + Δφk . 8: Contract virtual muscle with the new parameter value φk to obtain new value for ej . 9: end for 10: end for 11: end for
6.3
Expression Estimation
It has been found that sadness, anger, joy, fear, disgust and surprise are six universal categories of facial expression that all cultures can recognise [6]. Falling within these categories are a range of possible intensities and variations in expression detail. Face expressions can be hand crafted using the face animation system and adjusting muscle contraction parameters.
7
Experimental Results and Conclusions
Our experiments show that the proposed system is capable of reproducing face expressions from marker motion, a selection of results are shown in Fig. 4. The attraction of using a marker based system is its robustness, and relatively low computational costs and formulation in relation to markerless motion capture approaches. This system provides an automatic face performance from marker data and can therefore operate on a wider variety of face models and applications – avoiding the need for time consuming model creation. Also, virtual muscles are easier to set up than keyframing and are anatomically based. Setup costs are low compared to other motion capture solutions, however placing markers on a face is time consuming, especially if placement needs to be precise and repeatable. This points to a future direction of research involving markerless
774
A. Woodward et al.
Fig. 4. Tracked markers on a subject’s face (left column). Synthetic performance results (middle column). An angry expression (right), marker divergences from the neutral frame is indicated by lines.
motion capture. In doing so, the trade-offs between an automatic process and accuracy need to be investigated. Noise in the detection of 2D marker positions affect the quality of 3D triangulation. Illumination conditions affected the detection of the coloured markers within the image and it was important to have uniform and controlled illumination. The robustness of marker detection would need to be improved if constraints on an end-users environment were to be relaxed. To this end, improvement in the localisation of markers and their triangulation will be investigated. Also, it was found that noise and not rigid enough positions of the anchor points created some jitter in the local coordinate system. A better means of deriving this frame should be investigated. A possible solution is for the user to wear a head band
Low Cost Virtual Face Performance Capture Using Stereo Web Cameras
775
with markers on it. This would be more stable than applying representative anchor markers on the face. RBF provides an effective means of mapping markers to a face model. There is no restriction on the face model used and it does not have to represent the test subject. RBF is inexpensive and the mapping coefficients for a particular system setup need only be estimated once based on a neutral face expression. The quality of animation greatly depends on the quality of the animation system and how well it can mimic marker motion. The physically based skin system provides better animation when a low number of markers is used as it treats the mesh as a skin continuum, making it of benefit for creating novel animations. However, this does not guarantee that the final animation is representative of the actual expressions performed by a human test subject. Placement of additional markers will be looked at to improve animation this will possibly mean more face muscles will be required around very dynamic areas of the face, such as the mouth region. Since the markers can be transferable between models and animation systems, it could be used to drive alternate animation systems, e.g. key-frame based. A hybrid of the two could be combined. A better inverse kinematics scheme will be pursued to handle fast marker movement when a marker is affected by a large number of muscles and an improved face model, with a complete head and ears, will also be tested. The system will be tested on higher quality video cameras, providing better triangulation of 3D markers and higher acquisition rates. Upon writing, specially designed reflective motion capture markers have been recently acquired. They should allow better detection of markers giving less noise in the tracked 3D marker positions. A database of expressions could be created in order to estimate the validity of synthesised expressions and their recognition. An investigation on the impact of the number of markers on the quality of animation will be undertaken, this will determine the expressive ability of the face animation system and the determination of a balance with computational time.
References 1. Autodesk. 3ds max (2007), http://www.autodesk.com/3dsmax 2. Autodesk. Autodesk maya (2007), http://www.autodesk.com/maya 3. Barton, G., Delmas, P.: A semi-automated colour predicate for robust skin detection. In: Proc. Image and Vision Computing New Zealand, pp. 121–125 (2002) 4. Baxter, B.: Fast numerical methods for inverse kinematics (2007), http://billbaxter.com/courses/290/html/index.htm 5. Neverov, I., Sifakis, E., Fedkiw, R.: Automatic determination of facial muscle activations from sparse motion capture marker data. In: ACM Transactions on Graphics (SIGGRAPH Proceedings), pp. 417–425. ACM Press, New York (2005) 6. Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. Journal of Social and Personality Psychology, 124–129 (1971) 7. famous3D. famous3D (2007), http://famous3d.com/ 8. Montgomery, J., Borshukov, G., Hable, J.: GPU Gems 3 - Playable Universal Capture, ch. 15, pp. 485–504. Addison-Wesley Professional, Reading (2007)
776
A. Woodward et al.
9. Cherrie, J., Mitchell, T., Fright, W., McCallum, B., Carr, J., Beatson, R., Evans, T.: Reconstruction and representation of 3D objects with radial basis functions. In: ACM SIGGRAPH, pp. 67–76. ACM Press, New York (2001) 10. Xiao, J., Chai, J., Hodgins, J.: Vision-based control of 3d facial animation. In: Proc. of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, ACM Press, New York (2003) 11. Curless, B., Zhang, L., Snavely, N., Seitz, S.: Spacetime faces: High-resolution capture for modeling and animation. In: ACM Annual Conference on Computer Graphics, pp. 548–558. ACM Press, New York (2004) 12. Mova. Mova (2007), http://www.mova.com/ 13. Noh, J., Neumann, U.: A survey of facial modeling and animation techniques. In: USC Technical Report, University of Southern California (1993) 14. Paterson, J., Fitzgibbon, A.: 3D head tracking using non-linear optimization. In: Proc. of the British Machine Vision Conference 2003, vol. 2, pp. 609–618 (2003) 15. HomePage of Applied Research Associates NZ Ltd (ARANZ): Interpolating scattered data with RBFs (2007), http://www.aranz.com/research/modelling/theory/rbffaq.html 16. Softimage. Face robot (2007), http://www.softimage.com/products/facerobot/ 17. Motek Motion Technology. Motek - motion technology (2007), http://www.e-motek.com/ 18. Tsai, R.Y.: A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of Robotics and Automation, 323–344 (1987) 19. Unibrain. Unibrain fire-i digital camera (2007), http://www.unibrain.com/Products/VisionImg/Fire i DC.htm 20. Vicon. Vicon (2007), http://www.vicon.com/ 21. Welman, C.: Inverse kinematics and geometric constraints for articulated figure manipulation. Master thesis, Simon Fraser University (1993) 22. Woodward, A., Delmas, P.: Towards a low cost realistic human face modelling and animation framework. Proc. Image and Vision Computing New Zealand, 11–16 (2004) 23. Woodward, A., Delmas, P.: Computer vision for low cost 3-D golf ball and club tracking. Proc. Image and Vision Computing New Zealand (2005)
Hidden Markov Models Applied to Snakes Behavior Identification Wesley Nunes Gon¸calves, Jonathan de Andrade Silva, Bruno Brandoli Machado, Hemerson Pistori, and Albert Schiaveto de Souza Dom Bosco Catholic University Research Group in Engineering and Computing Av. Tamandar´e, 6000, Jardim Semin´ ario, 79117-900, Campo Grande, MS, Brazil {wnunes,jsilva,bmachado}@acad.ucdb.br, {pistori,albert}@ucdb.br http://www.gpec.ucdb.br
Abstract. This paper presents an application of the hidden Markov models (HMMs) to the recognition of snakes behaviors, an important and hard problem that, as far as the authors know, has not been tackled before, by the computer vision community. Experiments were conducted using different HMM configurations, including modifications on the number of internal states and the initialization procedures. The best results have shown a 84% correct classification rate, using HMMs with 4 states and an initialization procedure based on the K-Means algorithm. Keywords: Hidden Markov Models, Animals Behavior Recognition, Snakes.
1
Introduction
Snakes were suggested as one of the main groups of animals for the evaluation of ecological and evolutionary hypotheses [1]. Moreover, this group is an interesting and important producer of venom, which are intensively used in the development of drugs for hypertension control, analgesic, anticoagulants, among others. So, research on snakes behavior, in order to understand its impact on venom production in natural and controlled habitats, has been considerably increased in the last years. The habitat for reptiles, in particular snakes, is a much unexplored and little understood topic. The creation of snakes in captivity is a difficult task that can benefit from the analysis and identification of snakes behaviors in natural and artificial environments. The behavior of the snakes is influenced by many factors, like temperature, solar radiation, humidity and cycle of seasons. The ability to identify and predict snakes activity is essential in the process of venom production. The identification of animals behavior is usually carried out through a nonautomatic procedure involving high time consuming sections of visual observation and annotations. The reproducibility and precision of this procedure is D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 777–787, 2007. c Springer-Verlag Berlin Heidelberg 2007
778
W.N. Gon¸calves et al.
affected by the fatigue and distraction of the observers. The automation of this process, using computer vision techniques, could bring to the area much more reliable results and the possibility to extend the experiments to situations and environments that are not easily accessable using the current methods. The benefits are even higher for snakes, which presents some activities that, to be observed, would require many hours of continues observation. Automation could also provide information, related to precise physical measurements, that can not be achieved by naked eyes observation [2]. This work evaluated the use of hidden Markov models to automatically identify a particular snake behavior that occurs during an attack. The species of snakes were used during the experiments: Common boa (Boa constrictor ), Neotropical Rattlesnake (Crotalus durissus terrificus) and Pitviper (Bothrops jararaca). The HMM model was trained using 1000 frames from 20 image shots of the snakes during the attack and not attacking. The model was tested on 10 image shots not presented in the training data. A semi-automatic procedure, based on SVM (Support Vector Machines) supervised learning [3], was adopted, in the segmentation phase, to separate the snake region from the background. Parameter extraction were based on Image Moments. The best correct classification and execution time was achieved through the variation of some of the parameters of the hidden Markov models, like the number of states and the number of iterations used by the Baum Welch parameters estimation procedure. Moreover, two different initialization techniques were explored during the HMM’s parameter estimation. The attack behavior was inferred with accuracy of 84% by a HMM with 4 internal states. The paper is organized as follows. Section 2 presents related work that applies the hidden Markov models to computer vision problems. Section 3 briefly reviews the image moments parameter extractor. In Section 4 the hidden Markov Models used in this work is explained. The experiments performed and the results are shown in details in Sections 5 e 6, respectively. Finally, the conclusion and future work are discussed.
2
Previous Work
The hidden Markov models (HMMs) have been used in many areas, mainly in systems for speech recognition [4], behaviors recognition [5] and hand-write recognition [6]. In [7] HMM is applied to 2D objects recognition in images. The HMM, jointly with the contour invariant feature, have been tested in four different objects. For each object, a HMM was estimated using a set of fifty training images. The classification was carried out using ten images for each object, resulting in a correct classification rate of 75%. Starner and Pentland [8] describe a system for recognition of American sign language using HMM. The correct classification rate was 99.2% for words, however, the feature set proved to be limited and the gestures were expect to occur in pre-specified spatial positions in the image, as hands positions were not normalized. A new technique for character recognition is presented in [9]. The features
Hidden Markov Models Applied to Snakes Behavior Identification
779
are extracted from a gray level image and a HMM is modeled for each character. During the recognition, the most probable combination of models is found for each word, using a technique based on dynamic programming. In [10] a HMM for recognition of faces is described. The image containing the face is divided in five blocks (hair, forehead, eyes, nose and mouth) and each block is represented by a HMM internal state. The feature vectors are obtained from each block by the Karhunen-Loeve transform approach. The HMMs are frequently used to describe a sequence of patterns characterizing a behavior. In [5], human behaviors are identified. Those behaviors are related to legal and illegal activities, captured by a camera, and carried out in an archaeological site. For the identification of those behaviors, the images are segmented utilizing movement detection and followed by a shade removal processing. After that, the human posture is identified using histogram and similarity measure based in the Manhattan distance. Behavior recognition is carried out by HMMs, where the internal states represents different postures. Experiments were carried out in the identification of four behaviors and a mean correct classification rate of 86,87% was reported. In [11] an animal behaviors classification system is presented. That system uses a combination of HMM and kNN for learning to recognize some movements The system was evaluated in bees paths extracted from a 15 minutes image sequence. A classification rate of 81.5% has been reported on this problem.
3
Image Moments
An image can be modelled as a 2D discrete function I, where the intensity of each pixel is indexed as I(x, y). Equation 1 represents the image regular moments of order p, q of an image. Mpq =
width height x=1
xp y q I(x, y)
(1)
y=1
Regular image moments can be used to represent some important properties of an object presented in an image, like the object area, M00 , and its center of 10 M01 mass, ( M M00 , M00 ). upq =
width height x=1
(x − x)p (y − y)q I(x, y)
(2)
y=1
Central image moments, as defined in equation 2, can also be used to calculate some other interesting object properties , like its variance in X and Y axes (Equation 3), direction (Equation 4) and eccentricity (Equation 5). u02 m00
(3)
u02 − u20 − 2u11 + λ u02 − u20 + 2u11 − λ
(4)
σx2 = θ=
u20 , m00
σy2 =
780
W.N. Gon¸calves et al.
e2 = λ=
u20 + u02 + λ u20 + u02 − λ
(u20 − u02 )2 + 4u211
(5) (6)
Besides each of the above mentioned properties, calculated for the whole object, in this work, following a methodology suggested by Freeman [12], the object is further divided in 4 equal regions, and for each of these regions, the same image moments properties are calculated. In this way, the system can combine global and local information during the classification phase. Figure 1 illustrates, using an image moments visualization tool, the properties extracted from a snake image that has been previously segmented and binarized.
Fig. 1. Images Moments Application
4
Hidden Markov Models
The hidden Markov models (HMMs) are used to model a pair of complementary sthocastic processes. The first process is represented by a set of unobserved states, also called hidden or internal states. In the special case of first order HMMs, the current state of the system depends only on the previous state, and the probability distribution that models states transition is usually represented via a transition matrix A = {aij }, with aij = P (qt+1 = Sj |qt = Si )
1 ≤ i, j ≤ N
(7)
where N is the number of states, qt is the current state of the system and S = {S1 , S2 , S3 , ..., SN } is the set of hidden states.
Hidden Markov Models Applied to Snakes Behavior Identification
781
The second sthocastic process models the probability of observing or measuring some predetermined values (the observed values or symbols) given that the system is in a specific state (hidden). A sequence of T observations is represented by a set O = O1 , O2 , O3 , ..., OT , where each element Ot is a member of the symbols set V = v1 , v2 , ..., vM . The emission or observation probability of any symbol given an internal state j is defined by a matrix B = {bj (k)}, with bj (k) = P (Ot = vk |qt = Sj ) 1 ≤ j ≤ N,
1≤k≤M
(8)
The initial probability of each state is represented by a set π = {πi }, with πi = P (q1 = Si ) 1 ≤ i ≤ N,
with
N
πi = 1
(9)
i=1
The problem of estimating the parameters (probability matrices) of a HMM is usually called learning problem, whereas the problem of calculating the likelihood of an observation sequence given a particular HMM is called evaluation problem. Given an observed sequence O and a HMM model λ = (A, B, π), the evaluation problem is to calculate P (O|λ). One of the procedures that resolves this problem efficiently, based on dynamic programming, is known as the Forward-Backward algorithm. This procedure defines a variable αt (j) (the forward variable) that represents the probability of a partial observation sequence (from time 0 to t) given an state Sj (in time t) and the model λ. The variable is updated incrementally using the recursive procedure defined by Equations 10 and 11 until the full observation sequence is reached and P (O|λ) can be easily calculated using Equation 12. α1 (j) = πj bj (O1 ),
αt+1 (j) =
N
1 ≤ j ≤ N.
(10)
αt (i)aij bj (Ot+1 ),
1 ≤ t ≤ T − 1.
(11)
i=1
P (O|λ) =
N
αT (j)
(12)
j=1
In order to calculate the backward variable βt (i), representing the probability of partial observations from t + 1 until T given the state Si in the time t and a model λ, a similar procedure is followed, but in a reverse manner. The procedure is summarized in equations 13, 14 and 15. βT (i) = 1, βt (i) =
N j=1
1 ≤ i ≤ N.
(13)
aij bj (Ot+1 )βt+1 (j), t = T − 1, T − 2, T − 3, ..., 1 1 ≤ i ≤ N. (14)
782
W.N. Gon¸calves et al.
P (O|λ) =
N
πi bi (O1 )β1 (i)
(15)
i=1
In spite of the somewhat misleading name, in order to solve the evaluation problem, one must choose to use the forward or the backword procedure, not both. The learning problem is to choose the parameters of the model λ = (A, B, π), that maximizes locally P (O|λ). A well-known algorithm that solves the problem in polynomial time is the Baum-Welch, a specialization of the EM algorithm. Detailed information on the Baum-Welch algorithm and HMMs in general can be found in [4,13,8,9].
5
Experiments and Results
The experiments with the HMMs were carried out with 30 images shot representing the presence and the absence of the attack behavior. An example of the attack behavior can be visualized in Figure 2. The images were captured using a TRENDNET TV-IP301W camera with a spatial resolution of 640 x 480 pixels. The snake was held in a place that simulates its natural environment and the pictures were taken from above.
Fig. 2. Four frames of an attack sequence
Snakes use camouflage to hide from enemies and to more easily capture other animals, which turns the segmentation problem, in this context, very difficult. A supervised learning strategy, based on Support Vectors Machine (SVM), were used to separate the snake from the background. For each image, a rectangular region surrounding the snake was manually determined in order to turn the
Hidden Markov Models Applied to Snakes Behavior Identification
783
Fig. 3. Image Segmentation for Snakes
segmentation phase easier. Color based attributes, extracted from snake and background regions, were used to feed the learning process. The results of this segmentation procedure is illustrated in Figure 3. After segmentation, image moments based attributes were used to extract information related to the shape of the snake in each frame. These attributes were further discretized using the vector quantization LBG algorithm [14] and used as observation symbols for two HMMs, one corresponding to attack behavior and the other to non-attack. A total of 20 images sequences were used for training and 10 for testing the classification module based on HMM. All the experiments were carried out in a computer with a P4 2.8GHz processor, 512MB of RAM and Fedora Core 5. Experiments were conducted in order to find the HMM configuration that gives the higher correct classification rate in the problem of attack and nonattack behavior classification. Three parameters were evaluated: the number of hidden states, the number of iterative steps of the Baum-Welch algorithm and the initialization procedure during the learning phase. The number of hidden states and iterative steps varied from 2 to 20, and from 100 to 1000, respectively. Two initialization procedures were tested: the procedure suggested in [4], that assumes a uniform probability distribution for all matrices and a K-Means [15] based approach. For the K-Means approach, the A e π matrices are calculated as in [4], however, the B matrix is initialized in a different way. First, the training set is clustered using K-means with K being the number of internal states. Then, using the mapping from states to training samples generated by the clustering procedure, the B matrix can be estimated by simple counting techniques. The results are presented in Figures 4, 5, 6 and 7. The correct classification rate in the graphic are calculated through on average of the results of HMM
784
W.N. Gon¸calves et al.
Fig. 4. Number of States X Correct Classification Rate
Fig. 5. Number of States X Execution Time (in milliseconds)
configurations. Like this, the correct classification rate for the a HMM with 20 states is on average of the result of all the HMM configurations with 20 states. In Figure 4 the graphic relates the number of states to the correct classification rate. The best rate, of 81.2%, has been reached at 10. This performance has been achieved at a 515.41ms execution time cost (learning and evaluation time), as the graphic in Figure 5 indicates. The graphic in Figure 5 also illustrates the polynomial time complexity, on the number of states, of the learning and evaluation algorithms for HMMs. Nonetheless, for real time application, like, for instance, in a system that should immediately react to a snake attack, even a
Hidden Markov Models Applied to Snakes Behavior Identification
785
Fig. 6. Number of Iterations X Correct Classification Rate
Fig. 7. Number of States X Correct Classification for the Two Different Initialization Approaches
polynomial performance could not be sufficient, and a balance among accuracy and time response should be persued. In all graphics, the values of the parameters not shown, including the initialization approach, were chosen as the ones that give the mean performance. The graphic of Figure 6, relating the number of iterations and the correct classification rates, indicates that this parameter doest not have a great effect on the performance, that is always kept near the 81% value.
786
W.N. Gon¸calves et al.
Finally, in Figure 7 a graphical comparison between the two initialization approaches is presented. The graphic shows the relation on the number of states and correct classification rate for each approach. The standard initialization approach (uniform distribution assumption) outperforms the K-Means based approach when the number of states is greater than 11. Bellow this value, the K-Means approach is always better. This is due to the fact that the clustering performed before the initialization reviews to the learning module that some hidden states are not associated (or highly associated) to any observation. The best correct classification rate, 84%, of all experiments, was achieved using the K-Means initialization approach, with only 4 states.
6
Conclusion and Future Works
This paper showed an application of hidden Markov models to the recognition of snakes behavior. Experiments in the number of internal states, Baum-Welch iterations and initializations approaches were conducted in order to find the best configuration for this particular problem. A maximum correct classification rate of 84% has been achieved. For future research, it would be interesting to include information related to the contour of the snake in the feature vectors, and to evaluate other HMM types, like the ones that use continuous probability distribution to overcome that problem of having to discretize the values observed. It is also important to expand the tests using a larger amount of images, with different kinds of animals and environments. Acknowledgments. This work has received financial support from Dom Bosco Catholic University, UCDB, Agency for Studies and Projects Financing, FINEP, and Foundation for the Support and Development of Education, Science and Technology from the State of Mato Grosso do Sul, FUNDECT. One of the co-authors holds a Productivity Scholarship in Technological Development and Innovation from CPNQ, the Brazilian National Counsel of Technological and Scientific Development, and some of the other co-authors have received PIBIC/CNPQ scholarships.
References 1. Rivas, J.A., Burghardt, G.M.: Snake mating systems, behavior, and evolution: The revisionary implications of recent findings. Journal of Comparative Psychology 119(4), 447–454 (2005) 2. Spink, A.J., Tegelenbosch, R.A.J., Buma, M.O.S., Noldus, L.P.J.J.: The ethovision video tracking system-a tool for behavioral phenotyping of transgenic mice. Physiology and Behavior 73(5), 731–744 (2001) 3. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Hidden Markov Models Applied to Snakes Behavior Identification
787
4. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1990) 5. Leo, M., D’Orazio, T., Spagnolo, P.: Human Activity Recognition for Automatic Visual Surveillance of Wide Areas, 1st edn. Academic Press, London (1999) 6. Hu, J., Brown, M.K., Turin, W.: Hmm based on-line handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell 18(10), 1039–1045 (1996) 7. Hornegger, J., Niemann, H., Paulus, D., Schlottke, G.: Object recognition using Hidden Markov Models. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern Recognition in Practice IV: Multiple Paradigms, Comparative Studies and Hybrid Systems, vol. 16, pp. 37–44. Elsevier, Amsterdam (1994) 8. Starner, T., Pentland, A.: Visual recognition of american sign language using hidden markov models. Technical Report Master’s Thesis, MIT, Program in Media Arts & Sciences, Massachusetts Institute of Technology, Cambridge, USA (February 1995) 9. Aas, K., Line, E., Tove, A.: Text recognition from grey level images using hidden ˇ ara, R. (eds.) CAIP 1995. LNCS, vol. 970, pp. Markov models. In: Hlav´ aˇc, V., S´ 503–508. Springer, Heidelberg (1995) 10. Nefian, A.V., Hayes, M.H.: Face detection and recognition using hidden markov models. ICIP (1), 141–145 (1998) 11. Feldman, A., Balch, T.: Automatic identification of bee movement. Technical report, Georgia Institute of Technology, Atlanta, Georgia 30332, USA (2003) 12. Freeman, W., Tanaka, K., Ohta, J.: Computer vision for computer games. In: Int’l Workshop on Automatic Face- and Gesture-Recognition, Killington, Vermont, USA (1996) 13. Montero, J.A., Sucar, L.E.: Feature selection for visual gesture recognition using hidden markov models. In: ENC, pp. 196–203 (2004) 14. Shen, F., Hasegawa, O.: An adaptive incremental lbg for vector quantization. Neural Netw. 19(5), 694–704 (2006) 15. Malyszko, D., Wierzchon, S.T.: Standard and genetic k-means clustering techniques in image segmentation. CISIM 0, 299–304 (2007)
SP Picture for Scalable Video Coding Jie Jia, Hae-Kwang Kim, and Hae-Chul Choi Department of Computer Science, Sejong University, 143-747, Seoul, Korea Radio & Broadcasting Division, ETRI, 305-700, Daejeon, Korea {jiejia, hkkim}@sejong.ac.kr,
[email protected]
Abstract. This paper investigates an extension of the SP picture from the H.264/AVC to the scalable video coding (SVC), which has been recently developed and standardized as the scalable extension of the H.264/AVC. In comparison with the scalable profiles of previous video coding standards, the SVC has achieved significant improvement in both coding efficiency and scalability in temporal, spatial and fidelity, which efficiently provides coded stream wide adaptivity to dynamic network conditions as well as diverse clients. In communication environments, this efficient adaptivity can be provided by bit stream switching between different scalable layers. The current SVC supports bit stream switching only at instantaneous decoding refresh (IDR) access unit. However, in order to provide instantaneous switching capability, the IDR picture needs to be frequently coded in the SVC stream, which dramatically decreases the coding efficiency. Therefore, SP picture for the SVC is proposed in this paper for efficient bit stream switching. Performance analysis shows that the SP picture for the SVC provides an average 1.2 dB PSNR enhancement over the IDR picture while providing similar functionalities. Keywords: Scalable Video Coding, bitstream switching, SP picture.
1
Introduction
Video applications today range from video conference, mobile video to highdefinition (HD) TV broadcast and HD DVD storage, where application environments cover mobile, wireless and wired network with various conditions as well as diverse clients and system resources. For those applications, both coding efficiency and adaptivity are the most important. To meet requirements from those applications, the scalable video coding (SVC) extension of the H.264/AVC has been developed and recently standardized by the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) [1]. SVC addresses coding schemes for efficient and adaptive video storage and communication over heterogeneous network for diverse clients using available system resource. The most significant improvement of the SVC is the enhanced coding efficiency and scalability over the scalable profiles of previous video coding standards, such as MPEG-2, H.263+ and MPEG-4 part II. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 788–800, 2007. c Springer-Verlag Berlin Heidelberg 2007
SP Picture for Scalable Video Coding
789
The purpose of SVC is to provide an universally accessible bit stream which can be decoded when part of the stream is removed for adapting dynamic and various network conditions as well as user preferences. One extracted scalable bit stream constitutes another bit stream corresponding to a decreased frame rate, picture resolution or quality level. For communication applications, this adaptivity can be provided by dynamic bit stream switching between different scalable layers. The SVC provides coded stream inter-layer switching capability at instantaneous decoding refresh (IDR) access unit. However, an employment of the IDR picture decreases the coding efficiency, especially when it is frequently coded in bit stream. Referring to the H.264/AVC, a similar problem is solved by a new type picture, the SP picture which was proposed and adopted to AVC for efficient bit stream switching [2][3][4]. Basically, the SP picture employs inter picture motion estimation and motion compensation in a similar way as the P picture. While different from the P picture, the SP picture allows an identical reconstruction with a pair of SP pictures. The two SP pictures, denoted as the primary SP picture (the SP picture for non-switching) and the secondary SP picture (the SP picture for switching), are predicted from reference picture decoded from the current stream and the stream switching from, respectively [4][5][6]. SP pictures achieve this identical reconstruction by lossless coding a secondary SP picture with a quantized primary SP picture as input. Note, only a primary SP picture is coded and transmitted in a bit stream. A secondary SP picture is only transmitted and decoded when switching occurs, at this time, no primary SP picture is decoded. Intensive simulation results show that the SP picture outperforms the IDR picture on coding efficiency regarding the functionality of bit stream switching [7]. Therefore, in this paper, we extend the SP picture from AVC to SVC for an efficient SVC bit stream switching capability. This paper is structured as follows. Sec. 2 presents an overview of the SVC scalable coding structure as well as a summary of the SP picture for AVC. Sec. 3 investigates the SP picture for SVC. Moreover, a proof of the drift-free switching via SP picture for SVC is also given. Sec. 4 analyzes the performance of SP picture for SVC, followed by a conclusion drawn in Sec. 5.
2
SVC Overview and SP Picture Basics
SVC was designed as an scalable extension of the H.264/AVC. The SVC exploits most of the AVC techniques and further develops inter-layer prediction which greatly contributes to the enhanced coding efficiency. SVC provides scalability in temporal, spatial as well as fidelity. A hierarchical prediction structure enables the SVC a flexible temporal scalability. A layered coding structure with the enhanced inter-layer prediction contributes to the spatial and fidelity scalability. In the following of this section, a brief overview of the SVC is given, followed by a concept explanation of SP picture. For detailed description of SVC, the reader is referred to the draft standard [1] and an overview of the SVC[8].
790
J. Jia, H.-K. Kim, and H.-C. Choi
Fig. 1. Illustration of hierarchical prediction structure. (a) dyadic temporal scalability with hierarchical B picture, (b) non-dyadic scalability with hierarchical P picture.
2.1
Temporal Scalability
SVC provides both dyadic and non-dyadic temporal scalability by hierarchical B/P picture [9]. Different temporal layers are identified by temporal level identification, which starts from 0, representing temporal base layer, and is increased by 1 for every enhancement on temporal layer. Fig. 1 illustrates the hierarchical prediction structure for enabling temporal scalability. Fig. 1(a) presents a dyadic coding structure with hierarchical B picture which provides another two independently decodable sub-sequences with 1/4 and 1/2 of the full temporal resolution. Fig. 1(b) describes a non-dyadic prediction structure based on hierarchical P picture where another two sub-sequences can be independently decoded with 1/9 and 1/3 of the full temporal resolution. A principle for this hierarchical prediction is that a base temporal layer pictures can only be predicted from previous decoded picture(s) with the same temporal level, and an enhancement temporal layer picture with temporal level Ti can be predicted from picture(s) with temporal level Tk , where k≤i. Note, the hierarchical prediction structure can also be combined with the multiple reference picture concept from the H.264/AVC. Besides the flexible temporal scalability, the hierarchical prediction structure is also possible to provides a zero structure delay which is supported by hierarchical P picture, as illustrated in Fig. 1(b). 2.2
Spatial Scalability
In order to support spatial scalability, the SVC employs a multi-layer coding structure, which is also used in MPEG-2, H.263 and MPEG-4 Visual. But different from previous standards, the SVC encodes pictures of spatial enhancement layer with both intra-layer motion compensated prediction and layer-specific
SP Picture for Scalable Video Coding
791
inter-layer prediction. For improving SVC coding efficiency, an enhanced interlayer prediction has been developed which consists of inter-layer motion prediction, inter-layer residual prediction and inter-layer intra prediction [10]. In order to employ base layer information as much as possible for the spatial enhancement layer coding, the SVC includes a new macroblock (MB) type, denoted as BlSkip. With this type for enhancement layer MB, both MB partition and motion information including reference frame and motion vector are derived from collocated base layer MB. This is the inter-layer motion prediction. Besides the inter-layer motion prediction, an MB based residual signal prediction is also provided. When this inter-layer prediction is employed, the block-wise up-sampled base layer residual signal where a bi-linear filter is applied, is used as the prediction signal for the enhancement layer residual signal coding. Regarding the BlSkip mode, when the corresponding block in base layer is fully located within an intra-coded MB, the current enhancement layer MB is coded with an inter-layer intra prediction. The prediction signal for the enhancement layer MB is obtained by up-sampling the reconstructed intra-coded block in base layer. Note, constrained intra prediction is always applied to layers that are employed for inter-layer prediction, so that only a single motion compensation loop is needed for decoding. 2.3
Fidelity Scalability
SVC provides fidelity scalability in the form of coarse gain scalability (CGS) and medium gain scalability (MGS) [11]. The CGS employs the same intra-layer prediction and a similar inter-layer prediction as that used for spatial scalable coding without up-sampling. The number of supported quality levels is identical to the number of layers. MGS employs a similar motion compensated prediction structure as that for FGS [12], but without bit-plane coding. Therefore, switching between different quality levels is virtually possible in any access unit for MGS. 2.4
SP Picture Basics
SP picture was firstly proposed by Karczewicz and Kurceren in [7] for AVC bit stream switching, error resilience etc. A pair of SP picture provide an identical reconstruction even their motion compensated predictions are performed from different reference picture. When a primary SP picture is coded, the constructed signal is requantized with a finer quantizer than that used for the residual signal quantization. Then the requantized signal is reconstructed and sent to decoded picture buffer, where pictures are further used as the reference for the following picture prediction. Therefore, in general, the primary SP picture is slightly less efficient in compression than the regular P picture, but significantly more efficient than the IDR picture thanks to the motion compensation. A constructed primary SP picture prior to deblocking process is fed as an original signal to a secondary SP picture encoder. Similar to that for primary SP picture coding, the motion compensation for secondary SP picture is also
792
J. Jia, H.-K. Kim, and H.-C. Choi
performed in transform domain. After that, the residual signal of secondary SP picture is lossless coded, so that a mismatch free construction for the primary SP picture can be obtained by the secondary SP picture. Usually, the secondary SP picture is much less efficient than the regular P picture. For further description about the SP picture, readers are referred to [4][5][6].
3
SP Picture for SVC
In this section, we first discuss applications of SP picture for SVC, then present a detailed description of encoding process for inter-picture coded MBs in enhancement SP picture. Finally, an analytical proof of the drift-free switching is given. 3.1
Applications
SP picture for AVC provides efficient drift-free switching between different streams. This provides SP picture wide applications such as error recovery, fast forward/ backward as well as bit stream switching. An extension of SP picture from AVC to SVC not only further exploits features of SP picture, but also, more importantly, provides a more efficient and flexible inter-layer switching for SVC. Considering real-time video communication, say video conference, compressed video streams are transmitted under various network conditions which can be wireless or wired heterogeneous network. For those application, the available bandwidth is always varying due to the dynamic network condition. SVC provides multiple decodable bit stream which can be decoded into video sequences with different frame rate, spatial resolution and/or quality level which is controlled by scalable layer. This feature significantly enhances the adaptivity of scalable coded streams to different network conditions as well as diverse clients. However, when real-time is concerned, the dynamic network requires coded stream to timely adapt to variations of the bandwidth available to client. For scalable coded streams, this can be obtained by inter-layer switching, but instantaneous is needed. IDR picture is one of the solutions. But, unfortunately, as described before, a frequent usage of IDR picture significantly decreases the coding efficiency. With SP picture for SVC, thanks to its significantly enhanced coding efficiency in comparison with the IDR picture, a frequent employment of SP picture is possible with just slightly performance decrease. Usually, primary SP picture is transmitted and decoded. When switching occures, a secondary SP picture is transmitted and decoded which provides an exact same reconstruction as that of the primary SP picture. This enables scalable coded stream a timely adaptivity to the dynamic network conditions. Regarding the performance mentioned above, it can be further referred to Sec. 4. 3.2
Encoding Process for SP Picture for SVC
SP picture for SVC was first proposed and presented in [13]. The most significant difference between the SP picture for AVC and the SP picture for SVC lies in
SP Picture for Scalable Video Coding
793
Fig. 2. SVC primary SP picture coding structure
the inter-layer prediction for enhancement SP picture coding. Fig. 2 and Fig. 3 illustrates a schematic block diagram of the encoding process for primary SP picture and secondary SP picture, respectively. In both of the figures, two spatial layers are assumed to be scalably coded. Note, the SVC is AVC compatible at base layer. Therefore, the SP picture for SVC at base layer is same as the regular SP picture for AVC. In Fig. 2 and Fig. 3, T, T −1 , Q, and Q−1 represents the transform, inverse transform, quantization and de-quantization process, respectively. ME/MC indicates motion estimation and motion compensation. Loop filter (LF) refers to deblocking filter. Decoded picture buffer (DPB) holds encoder reconstructed pictures which are used as reference for coding P and B pictures later. As shown in Fig. 2, when primary SP picture is coded, basically, the downsampled original signal is firstly encoded in base layer and later the original signal is coded in enhancement layer. For each layer, two quantizers are employed. The coarser one is used for residual signal quantization, while the finer one is used for motion-compensated prediction and construction signal quantization, which is also used for the secondary SP picture coding, see Fig. 3. The finer quantizer is designed for reducing the secondary SP picture stream size with the cost of a decreased primary picture quality.
794
J. Jia, H.-K. Kim, and H.-C. Choi
Fig. 3. SVC secondary SP picture coding structure
Regarding the enhancement SP picture coding process, firstly, ME/MC is performed, then a transform is applied separately to the prediction signal and original signal. After that, residual signal is obtained in the transform domain. Note, an inter-layer prediction is employed in the ME/MC process for the enhancement SP picture coding. Similar to the inter-layer prediction which is used for enhancement P picture coding, both inter-layer motion prediction and interlayer intra prediction are utilized. But different from enhancement P picture, enhancement SP picture does not include inter-layer residual prediction. A description of inter-layer prediction was given in Sec. 2.2. Primary SP picture is designed to efficiently encode a picture that can have an identical reconstruction even that is predicted from different reference picture. Secondary SP picture is designed to encode a picture, the reconstruction of which is identical to that of primary SP picture. For this purpose, the construction signal in primary SP picture coding process prior to LF is fed as the input to secondary SP picture coding process, as shown in Fig. 3. Note, similar to the enhancement primary SP picture coding, only inter-layer motion prediction and inter-layer intra prediction are employed in the enhancement layer ME/MC process. A further explanation on the drift-free switching is provided in Sec. 3.3. 3.3
Drift-Free Switching by SP Picture for SVC
This section gives an analytical proof to the drift-free switching for SVC provided by the proposed SP picture. As the base layer of SVC is AVC compatible, in the following analysis, a proof of the identical reconstruction provided by enhancement SP pictures is given.
SP Picture for Scalable Video Coding
795
As discussed before, in order to obtain a mismatch free reconstruction between the primary SP picture and the secondary SP picture, the constructed signal x”rec E of primary SP picture prior to deblocking process, as shown in Fig. 2, is used as the input to the secondary SP picture coding process. Referring to the coding process of primary SP picture, x”rec E can be expressed as (1). xrec E = T −1 Q−1 (1) s {Qs [XT E + Xpred E ]} Where XT E and Xpred E is the de-quantized residual signal in transform domain and the transformed prediction signal, respectively. Moreover, T −1 and Q−1 s represent the inverse transform and de-quantizer. In the secondary SP picture coding process, the input signal x”rec E is transformed and quantized with quantizer Qs . Combining (1), this process can be expressed as (2). Qs ·T (xrec
E)
= Qs ·T (T −1 Q−1 s {Qs [XT
= =
E + Xpred E ]}) −1 Qs ·Qs {Qs [XT E + Xpred E ]} Qs [XT E + Xpred E ]
(2)
Followed the transform and quantization, a transform domain motion compensation is performed, which generates the corresponding residual signal that will be lossless coded for the secondary SP picture. Thanks to the lossless coding process, a construction signal in transform domain can be obtained which is exactly same as the quantized input signal in transform domain. Then, after the de-quantization and inverse transform, a construction signal prior to deblocking process can be obtained. This process combined with (2) is given by (3) T −1 Q−1 s {XrecQs
E}
= T −1 Q−1 s {Qs [XT E + Xpred E ] − Xpred2 = T −1 Q−1 s {Qs [XT E + Xpred E ]}
E
+ Xpred2
E}
(3)
It can be seen that (3) and (1) represent the exact same signal. Therefore an exact same reconstruction can be obtained by decoding the primary SP picture and the secondary SP picture. This guarantees a drift-free switching between different scalable coded streams.
4
Performance Analysis
To illustrate the coding efficiency performance of the SP picture for SVC, a software implementation of the proposed SP picture was done based on the SVC software JSVM 8 [14]. The implemented software was submitted to JVT with [15]. Simulations under the SVC coding efficiency test condition [16] were performed. All eight of the standard test sequences are tested. In the simulation, CABAC is chosen as the entropy coding method, and rate-distortion optimization is employed. Three test sets are performed in this paper. Those consist of the performance comparison between SP picture and Intra picture, the comparison between SP
796
J. Jia, H.-K. Kim, and H.-C. Choi
Fig. 4. Performance comparison for Bus sequence with GOP size = 8, key picture is encoded as Inter P, SP and Intra picture, respectively. QPsp = QPp − 2, Qs = QPp − 5 (a) base layer, (b) enhancement layer.
picture and EIDR picture, and the comparison between SP pictures which are coded with different QPsp and Qs values. Fig. 4 to Fig. 7, and Tab. 1 to Tab. 2 illustrate the simulation results. Basically, the simulation results report that the proposed SP picture coding scheme enhances the coding efficiency performance of SVC. Fig. 4 and Fig. 5 compare the coding efficiency performance between the inter P, SP and Intra pictures. For all three cases, pictures with the lowest temporal level are coded as Inter P, SP and Intra pictures, respectively, which are denoted with periodic P, periodic SP and periodic I. Similar notations are also applied to the following figures in this section. Regarding the ”Bus” sequence, an average 0.8 dB improvement can be observed from the simulation results by the SP picture over the Intra picture for the base layer. Furthermore, a 0.4 dB PSNR enhancement can be observed at the enhancement layer. Simulation results for the ”Mobile” sequence report an average 1.7 dB and 1.2 dB PSNR improvement of the SP picture over the Intra picture for the base layer and the enhancement layer respectively. Fig. 6 illustrates the performance comparison between the SP picture and the EIDR picture. In the simulation, both of the SP picture and the EIDR picture are periodically coded for the enhancement layer. The GOP size of 8 is employed. From the simulation results, it can be observed that the SP picture improves the coding efficiency performance by an average 1.6 dB PSNR enhancement over the EIDR picture. Note, as the SP picture and the EIDR picture are only encoded for the enhancement layer, the performance regarding the base layer is same for all three of them. Fig. 7 compares the SP picture performance when different quantization parameters are employed. Basically, two sets of QP values are tested in the simulation. The curve with a triangle mark presents the performance of SP picture which is
SP Picture for Scalable Video Coding
797
Fig. 5. Performance comparison for Mobile sequence with GOP size = 16, key picture is encoded as Inter P, SP and Intra picture, respectively. QPsp = QPp −2, Qs = QPp −5 (a) base layer, (b) enhancement layer.
Fig. 6. Performance comparison between SP picture and EIDR picture, Bus sequence, GOP size = 8, QPsp = QPp − 2, Qs = QPp − 5
coded with QPsp = QPp − 2 and Qs = QPp − 5. The curve with a cross mark presents the performance of SP picture which is coded with QPsp = QPp − 1 and Qs = QPp − 10. From the figure, it can be seen that lower Qs improves the performance of the primary SP picture. However, it increases the bits used for representing the secondary SP picture. Therefore, in the previous simulations, the quantization parameters for the SP picture are set as QPsp = QPp − 2 and Qs = QPp − 5. Tab. 1 and Tab. 2 illustrate comparisons on average coded picture size between the IDR picture and the secondary SP picture for the ”Bus” and ”Mobile”
798
J. Jia, H.-K. Kim, and H.-C. Choi
Fig. 7. Performance comparison between SP pictures which are coded with different QPsp and Qs. Foreman sequence, GOP size = 8, (a) base layer, (b) enhancement layer. Table 1. Comparison of average coded I and SP picture size for Bus sequence
Bus IDR 27 SP 33 36
30 33162 34198 32792 34485
33 24536 30 26110 36 23604 39 24917
35 19485 32 21266 38 18590 41 19656
Table 2. Comparison of average coded I and SP picture size for Mobile sequence
Mobile IDR SP
31 50786 28 34280 34 33311 37 37988
33 41438 30 27338 36 26274 39 32051
37 28450 34 17133 40 16468 43 22541
sequences. In this simulation, sequences are coded at QCIF format. A bit stream switching between different quality layers are employed. The QP values given in the first row within each table are the QP values used for the target bit stream coding. The QP values listed in each column are the QP values used for coding the stream from which the target bit stream is switched. From the tables, it can be seen that a similar or less coded picture size can be generally obtained for the ”Bus” sequence with the secondary SP picture in comparison to the IDR picture. While regarding the ”Mobile” sequence, only around 2/3 of the coded IDR picture size is needed for representing the secondary SP picture. This is due to the reason that the number of bits used for representing a picture depends on the picture content. A complex content picture, such as the ”Mobile” sequence,
SP Picture for Scalable Video Coding
799
requires more bits to represent the pictures than a simple content picture does. This is more obvious between the Intra picture coding and the Inter picture coding, say the IDR picture and the SP picture.
5
Conclusion
This paper presents application, design, proof of drift-free switching and performance analysis of SP picture for SVC. An overview of SVC and SP picture is also given. The proposed SP picture for SVC further exploits adaptivity of SVC to various dynamic network conditions as well as diverse clients. The SP picture for SVC is based on the SP picture for AVC. Different from both the regular SP picture and the P picture, the SP picture for SVC employs not only intra-layer inter prediction, intra prediction, but also inter-layer intra prediction and inter-layer motion prediction. Those contributes to the final enhanced coding efficiency performance in comparison with the IDR picture. Simulation results show that an average 1.2dB PSNR enhancement can be observed by the SP picture for SVC over the IDR picture. Acknowledgments. This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST)(No. R01-2007-000-11078-0) and ETRI (Electronics and Telecommunications Research Institute).
References 1. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M.: Joint draft 10 of SVC amendment. In: Joint Video Team Meeting, San Jose, CA, USA, Doc. JVTW201 (2007) 2. ISO/IEC JTC 1: Advanced video coding for generic audio-visual services, ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG4-AVC), Version 4 (2005) 3. Wiegand, T., Sullivan, G.J., Bjøntegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Trans. on Circuits and Systems for Video Technology 13(7), 560–576 (2003) 4. Karczewics, M., Kurceren, R.: The SP- and SI- frames design for H.264/AVC. IEEE Trans. on Circuits and Systems for Video Technology 13(7), 637–644 (2003) 5. Setton, E., Girod, B.: Rate-distortion analysis and streaming of SP and SI frames. IEEE Trans. on Circuits and Systems for Video Technology 16(6), 733–743 (2006) 6. Sun, X., Wu, F., Li, S., Shen, G., Gao, W.: Drift-free switching of compressed video bitstreams at predictive frames. IEEE Trans. on Circuits and Systems for Video Technology 16(5), 565–576 (2006) 7. Karczewics, M., Kurceren, R.: A proposal for SP–frames. In: Video Coding Expert Group Meeting, Eibsee, Germany, Doc. VCEG–L27 (2001) 8. Schwarz, H., Marpe, D., Wiegand, T.: Overview of the scalable video coding extension of the H.264/AVC standard. In: Joint Video Team Meeting, San Jose, CA, USA, Doc. JVT–W132 (2007)
800
J. Jia, H.-K. Kim, and H.-C. Choi
9. Schwarz, H., Marpe, D., Wiegand, T.: Hierarchical B pictures. In: Joint Video Team Meeting, Poznan, Poland, Doc. JVT–P014 (2005) 10. Schwarz, H., Marpe, D.H., Wiegand, T.: SVC core experiment 2.1: Inter-layer prediction of motion and residual data. In: Moving Picture Experts Group Meeting, Redmond, WA, USA, ISO/IEC JTC 1/SC 29/WG 11, Doc. M11043 (2004) 11. Kirchhoffer, H., Schwarz, H., Wiegand, T.: SVC core experiment 1: Simplified FGS. In: Joint Video Team Meeting, San Jose, CA, USA, Doc. JVT–W090 (2007) 12. Winken, M., Schwarz, H., Marpe, D., Wiegand, T.: Adaptive refinement of motion information for fine-granular SNR scalable video coding. In: European Symposium on Mobile Media Delivery, EuMob 2006, Alghero, Italy, September 2006 (2006) 13. Jia, J., Kim, H.K., Choi, H.C., Kim, J.-G.: SP picture for SVC switching. In: Joint Video Team Meeting, Marrakech, Morocco, Doc. JVT–V045 (2007) 14. Vieron, J., Wien, M., Schwarz, H.: JSVM8 software. In: Moving Picture Experts Group Meeting, Hangzhou, China, ISO/IEC JTC 1/SC 29/WG 11, Doc. N8457 (2006) 15. Jia, J., Kim, H.K., Choi, H.C., Kim, J.-G.: SVC core experiment 2: Switching for SVC. In: Joint Video Team Meeting, Hangzhou, China, Doc. JVT–U302 (2006) 16. Wien, M., Schwarz, H.: Test conditions for SVC coding efficiency and JSVM performance evaluation. In: Joint Video Team Meeting, Nice, France, Doc. JVT–Q205 (2005)
Studying the GOP Size Impact on the Performance of a Feedback Channel-Based Wyner-Ziv Video Codec Fernando Pereira1, João Ascenso2, and Catarina Brites1 1
Instituto Superior Técnico – Instituto de Telecomunicações, Av. Rovisco Pais, 1049-001 Lisboa, Portugal 2 Instituto Superior de Engenharia de Lisboa ― Instituto de Telecomunicações, Rua Conselheiro Emídio Navarro, 1, 1950-062 Lisboa, Portugal {fernando.pereira, joao.ascenso, catarina.brites}@lx.it.pt
Abstract. Wyner-Ziv video coding has become one of the hottest research topics in the video coding community due to the conceptual, theoretical and functional novelties it brings. Among the many practical architectures already available, feedback channel-based with channel coding, e.g. LDPC and turbo codes, solutions are rather popular. These solutions rely on decoder motion estimation based on periodic Intra coded key frames, setting the so-called GOP size, very much like in conventional video coding. This paper targets the ratedistortion and complexity performance study of this type of Wyner-Ziv coding solution as a function of the GOP size, considering both LPDC and turbo codes. Keywords: Distributed video coding, Wyner-Ziv video coding, GOP size, LDPC coding, Turbo coding, RD performance, Complexity performance.
1 Introduction Since the middle eighties, the video coding research community has been developing video codecs where it is the task of the encoder to exploit the data redundancy and irrelevancy to reach the compression factors necessary to deploy video coding needy applications and services. The most popular video codec architecture, the so-called motion compensated hybrid scheme, adopted by all MPEG and ITU-T video coding standards, relies on a combination of efficient motion-compensated temporal prediction and block-based transform coding, where encoders may become rather complex in comparison with decoders. This complexity balance is particularly suitable for asymmetric application topologies, such as digital television, video on demand, digital storage, and video streaming, where the content is typically coded once (and many times offline) and decoded many times or by many decoders. The beginning of this decade saw the emergence of a new video coding paradigm, the so-called distributed video coding [1], challenging the ‘traditional’ coding model since it proposes to fully or partly exploit the video data redundancy at the decoder and not anymore at the encoder. This new coding approach is based on the SlepianWolf and Wyner-Ziv theorems which basically state, for the lossless and lossy coding cases, that, under certain conditions, the same compression can be achieved with both the joint-encoding and joint-decoding paradigm and the distributed/ D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 801 – 815, 2007. © Springer-Verlag Berlin Heidelberg 2007
802
F. Pereira, J. Ascenso, and C. Brites
independent-encoding and joint-decoding paradigm [1]. Wyner-Ziv (WZ) video coding regards the lossy coding of two correlated sources where the source X is coded without access to the correlated source Y, assumed to be available at the decoder to perform joint decoding. This new paradigm implies the data correlation is mainly exploited at the decoder, enabling low-complexity video encoding, where the core of the computation, notably motion processing, is shifted to the decoder. Following important developments in the channel coding domain, the first practical implementations of the distributed video coding paradigm, notably Wyner-Ziv video coding solutions, appeared around 2002 [1,2,3]. One of the most popular Wyner-Ziv video coding architectures relies on a feedback channel to perform rate control at the decoder and uses some channel codes, e.g. turbo codes or Low-Density Parity-Check (LDPC) codes, to ‘correct’ the errors in the frame estimations (designated as side information), created at the decoder after motion estimation, for the frames to be Wyner-Ziv encoded (X source). In this architecture, the full video sequence is divided into the so-called key frames (Y source) and the so-called WZ frames (X source), with the key frames appearing in a periodic way and setting the so-called GOP (Group of Pictures) size, as in conventional video coding with I frames versus P and B frames. Since the key frames serve for the decoder to create the estimations of the WZ frames based on motion estimation, the GOP size has a very significant impact on the overall performance of the WZ video codec (even more than for traditional video coding) since it strongly determines the goodness of the motion estimation, and thus the quality of the side information, and the number of ‘estimation errors’ that have to be corrected by requesting (channel coding) parity bits to the WZ encoder through the feedback channel. While most WZ video coding performance results in the literature regard a GOP size of 2 (the simplest case in terms of decoder motion estimation), it is well know that higher GOP sizes are relevant and interesting since temporal redundancy should be exploited more than once every two frames. In fact, conventional video coding tells that the RD performance limits improve with the GOP size and the overall WZ encoding complexity decreases with the GOP size which are two simultaneously positive trends that have to be exploited notably for some encoder complexity and battery constrained applications. However, longer GOP sizes are particularly challenging for WZ video coding since the motion estimation becomes more difficult, and thus the side information poorer, especially for more active video content. In this context, the main target of this paper is to evaluate in detail the current performance of an advanced WZ video codec in terms of the GOP size used for the key frames versus WZ frames splitting. This performance evaluation will not only consider the rate-distortion (RD) performance but also the complexity performance since WZ coding is deeply related to additional complexity budget flexibility; in fact, theoretically, the compression efficiency can, at most, be the same as for conventional video coding. While Section 2 briefly describes the WZ video codec used for this study, Section 3 details the performance evaluation for several relevant metrics always as a function of the GOP size. Finally, Section 4 summarizes and concludes this paper.
Studying the GOP Size Impact on the Performance
803
2 DISCOVER Wyner-Ziv Video Codec The WZ video codec evaluated in this paper has been improved by the DISCOVER project team based on a first codec developed by the authors of this paper at Instituto Superior Técnico [4]. The DISCOVER WZ video codec architecture, illustrated in Figure 1, is based on the basic WZ video coding architecture proposed in [3], and it is presented in detail in [5]. However, at this stage, the initial architecture has already evolved, e.g. adding a Cyclic Redundancy Check code (CRC) and some encoder rate control capabilities, and most of the tools in the various modules are different (and globally more efficient).
Fig. 1. DISCOVER video codec architecture
The DISCOVER WZ video codec works as follows: At the encoder: 1. First, a video sequence is divided into Wyner-Ziv (WZ) frames, this means the frames that will be coded using a WZ approach, and key frames as in the original WZ architecture adopted as the basis of the DISCOVER codec [1, 3]. The key frames are encoded as Intra frames, e.g. using the H.264/AVC Intra codec [6], and may be inserted periodically with a certain Group of Pictures (GOP) size. An adaptive GOP size selection process may also be used. In the latter case, the key frames are inserted depending on the amount of temporal correlation in the video sequence [7]. Most results available in the literature use a GOP size of 2 which means that odd and even frames are key frames and WZ frames, respectively. This paper targets precisely the study of the WZ codec performance impact when changing the GOP size. 2. Over each Wyner-Ziv frame XWZ, a 4×4 block-based Discrete Cosine Transform (DCT) is applied. The DCT coefficients of the entire frame XWZ are then grouped together, according to the position occupied by each DCT coefficient within the 4×4 blocks, forming the DCT coefficients bands. 3. After the transform coding operation, each DCT coefficients band bk is uniformly quantized with 2Mk levels (where the number of levels 2Mk depends on the DCT
804
4.
5.
F. Pereira, J. Ascenso, and C. Brites
coefficients band bk). Over the resulting quantized symbol stream (associated to the DCT coefficients band bk), bitplane extraction is performed. For a given band, the quantized symbols bits of the same significance (e.g. the most significant bit) are grouped together, forming the corresponding bitplane array which is then independently LDPC (or turbo) encoded [8]. The LDPC (or turbo) coding procedure for the DCT coefficients band bk starts with the most significant bitplane array, which corresponds to the most significant bits of the bk band quantized symbols. The parity information generated by the LDPC encoder for each bitplane is then stored in the buffer and sent in chunks/packets upon decoder request, through the feedback channel. In order to limit the number of requests to be made by the decoder, and thus the decoding complexity (since each request corresponds to several LDPC decoder iterations), the encoder estimates for each bitplane the initial number of bits to be sent before any request is made [5]. This number should be an underestimation of the final number of bits which means there should be no RD performance losses associated to this step (regarding the case where no initial estimation is made).
At the decoder: 6. The decoder creates the so-called side information for each WZ coded frame by performing a frame interpolation process using the previous and next temporally closer key frames of XWZ to generate an estimate of frame XWZ, YWZ [9]. The side information for each WZ frame intends to be an estimate of the original WZ frame; the better it is this estimation, the smaller are the number of ‘errors’ the Wyner-Ziv LDPC (or turbo) decoder has to correct and the bitrate used for that. 7. A block-based 4×4 DCT is then carried out over YWZ in order to obtain YWZ DCT, an estimate of XWZ DCT. The residual statistics between correspondent coefficients in XWZ DCT and YWZ DCT is assumed to be modeled by a Laplacian distribution. The Laplacian parameter is estimated online and at different granularity levels, notably at band and coefficient levels. 8. Once YWZ DCT and the residual statistics for a given DCT coefficients band bk are known, the decoded quantized symbol stream q’WZ associated to the DCT band bk can be obtained through the LDPC decoding procedure. The LDPC (or turbo) decoder receives from the encoder successive chunks of parity bits following the requests made through the feedback channel. 9. To decide whether or not more bits are needed for the successful decoding of a certain bitplane, the decoder uses a request stopping criteria based on the LDPC code parity check equations. If no more bits are needed to decode that bitplane, the decoding of the next band can start; otherwise, the bitplane LDPC decoding task has to proceed with another request and receive another chunk of parity bits. 10. After successfully LDPC (or turbo) decoding the most significant bitplane array of the bk band, the LDPC (or turbo) decoder proceeds in an analogous way to the remaining Mk-1 bitplanes associated to that band. Once all the bitplane arrays of the DCT coefficients band bk are successfully LDPC (or turbo) decoded, the LDPC (or turbo) decoder starts decoding the bk+1 band. This procedure is repeated until all the DCT coefficients bands for which WZ bits are transmitted are LDPC (or turbo) decoded.
Studying the GOP Size Impact on the Performance
805
11. Because the estimation of the bitplane error probability is not perfect, a CRC check sum is transmitted to help the decoder detect and correct the remaining errors in each bitplane since they have a rather negative subjective impact. Since this CRC is combined with the developed request stopping criteria, it does not have to be very strong. As a consequence, an 8 bit CRC check sum per bitplane is found to be strong enough for this purpose which only adds minimal extra rate. 12. After LDPC (or turbo) decoding the Mk bitplanes associated to the DCT band bk, the bitplanes are grouped together to form the decoded quantized symbol stream associated to the bk band. This procedure is performed over all the DCT coefficients bands to which WZ bits are transmitted. 13. Once all decoded quantized symbol streams are obtained, it is possible to reconstruct the matrix of DCT coefficients, X’WZ DCT. The DCT coefficients bands for which no WZ bits were transmitted are replaced by the corresponding DCT bands of the side information, YWZ DCT. 14. After all DCT coefficients bands are reconstructed, a block-based 4×4 inverse discrete cosine transform (IDCT) is performed and the reconstructed XWZ frame, X’WZ, is obtained. 15. To, finally, get the decoded video sequence, decoded key frames and WZ frames are mixed conveniently. It is important to stress that the DISCOVER WZ video codec does not include any of the limitations which are many times present in WZ papers, notably those adopting this type of WZ architecture [1]. This means, for example, that no original frames are used at the decoder to create the side information, to measure the bitplane error probability or to estimate the noise correlation model parameters for LDPC (or turbo) decoding. 2.1 Transform and Quantization Different RD performances can be achieved by changing the Mk value for the DCT band bk. In this paper, eight rate-distortion points are considered corresponding to the various 4×4 quantization matrices depicted in Fig. 2. Within a 4×4 quantization matrix, the value at position k in Fig. 2 indicates the number of quantization levels associated to the DCT coefficients band bk; the value 0 means that no Wyner-Ziv bits are transmitted for the corresponding band. In the following, the various matrices will be referred as Qi with i= 1, …, 8 being the number of matrices, and thus RD points, tested; the higher is i, the higher are the bitrate and the quality. These matrices have been defined to provide eight reasonably meaningful RD points, e.g. in bitrate range, but they may easily be changed. 2.2 Slepian-Wolf Coding The DISCOVER WZ video codec has, finally, adopted LDPC codes for the SlepianWolf part of the WZ codec after using for a long time turbo codes. The literature states that LDPC codes can better approach the capacity of a variety of communication channels than turbo codes [8].
806
F. Pereira, J. Ascenso, and C. Brites
Fig. 2. Eight quantization matrices associated to different RD performances (and qualities)
The DISCOVER codec uses a LDPC Accumulate (LDPCA) codec which consists of an LDPC syndrome-former concatenated with an accumulator [8]. For each bitplane, syndrome bits are created using the LDPC code and accumulated modulo 2 to produce the accumulated syndrome. The Wyner-Ziv encoder buffers these accumulated syndromes and transmits them to the decoder in chunks, upon decoder request. Previously, the DISCOVER codec has used turbo codes. The turbo encoder enclosed a parallel concatenation of two identical constituent recursive systematic convolutional (RSC) encoders of rate ½ and a pseudo-random L-bit interleaver was employed to decorrelate the L-bit input sequence between the two RSC encoders. The Slepian-Wolf decoder enclosed an iterative turbo decoder constituted by two soft-input soft-output (SISO) decoders. Each SISO decoder was implemented using the Logarithmic Maximum A Posteriori (Log-MAP) algorithm. A confidence measure based on the a posteriori probabilities ratio is used as error detection criteria to determine the current bitplane error probability Pe of a given DCT band. If Pe is higher than 10−3, the decoder requests for more parity bits from the encoder via the feedback channel; otherwise, the bitplane turbo decoding task is considered successful. 2.3 Frame Interpolation or Side Information Generation The side information creation process is rather complex and central for the performance of the Wyner-Ziv codec since it determines how many ‘errors’ have to be corrected through LDPC (or turbo) parity bits. This process is described in detail in [9]. In this section, the most important control parameters related to the side information creation process are presented. For the frame interpolation in the side information creation process, two block sizes are used: 16×16 and 8×8. The forward motion estimation works with 16×16 block size and ±32 pixels are used for the search range. In the second iteration, the motion vectors are refined using 8×8 block sizes and an adaptive search range is used. The motion search is performed using the
Studying the GOP Size Impact on the Performance
807
half-pixel precision and the reference frames are first low pass filtered with the mean filter using a 3×3 mask size.
3 GOP Size Dependent Performance Evaluation As mentioned before, the main purpose of this paper is to evaluate the RD and complexity performance of a rather advanced WZ video codec as a function of the GOP size. For this purpose, it is essential to define first the performance evaluation conditions. As usual in the WZ literature, only the luminance component is coded and thus all metrics in this paper refer only to the luminance. Although performance results are available for many video sequences, results in this paper will be presented only for two rather different sequences: Foreman (with the Siemens logo), and Coast Guard. For both sequences, 299 frames are coded at QCIF resolution, 15 Hz. The key frames quantization steps have been found using an iterative process which stops when the average quality (PSNR) of the WZ frames is similar to the quality of the Intra frames (H.264/AVC Intra encoded). 3.1 RD Performance Although many metrics are relevant to evaluate the RD performance, it is recognized that the most used quality metric is the average PSNR (with all limitations it brings) over all the frames of a sequence coded for a certain quantization matrix. When this PSNR metric is represented as a function of the used bitrate – in this case, the overall bitrate which includes all WZ and key frames bits for the luminance component – very important performance charts are obtained since they allow to easily comparing the overall rate-distortion (RD) performance with other coding solutions, including standard coding solutions largely well known and used. In this paper, the RD performance of the DISCOVER codec will be compared with the corresponding performance of three standard coding solutions which share an important property in terms of encoder complexity: the complex and expensive motion estimation task is not performed by any of them. Sections 3.3 and 3.4 will present encoder and decoder complexity evaluations which complement the RD evaluation proposed in this section. The three standard video coding solutions used here for comparison purposes are: • H.263+ Intra [10] – Coding with H.263+ without exploiting temporal redundancy; this is a rather old codec although still much used in the literature for comparison with WZ codecs because it is ‘easier to beat’ in comparison to the H.264/AVC Intra codec. • H.264/AVC Intra [6] – Coding with H.264/AVC in Main profile without exploiting temporal redundancy; this type of Intra coding is among the most efficient Intra (video) coding standard solutions available, even more than JPEG2000 in many cases. However, notice that the H.264/AVC Intra codec exploits quite efficiently the spatial correlation (at a higher complexity cost when compared to H.263+ Intra) with several 4×4 and 16×16 Intra modes, a feature that is also (still) missing in the DISCOVER WZ codec. • H.264/AVC Inter No Motion [6] – Coding with H.264/AVC in Main profile exploiting temporal redundancy in a IB… structure but without performing any
F. Pereira, J. Ascenso, and C. Brites
motion estimation which is the most computationally expensive encoding task. The so-called “no motion” mode achieves better performance than Intra coding because it can partly exploit temporal redundancy but it requires far less complexity than full motion compensated Inter coding since no motion search is performed. This type of comparison (excluding encoder motion estimation as in WZ coding) is not typically provided in most WZ published papers because its RD performance is still rather difficult to ‘beat’ with WZ coding solutions.
Foreman 41 39
PSNR [dB]
37 35 33 31 29 DISCOVER H.263+ (Intra)
27
H.264/AVC (Intra) H.264/AVC (No Motion)
25 0
50
100
150
200
250
300
350
400
450
500
550
Rate [kbps]
Coast Guard 40 38 PSNR [dB]
36 34 32 30 28 DISCOVER H.263+ (Intra)
26
H.264/AVC (Intra) H.264/AVC (No Motion)
24 0
50
100
150
200
250
300
350
400
450
500
550
600
Rate [kbps]
Hall Monitor 43 41 39 PSNR [dB]
808
37 35 33 31 29
DISCOVER H.263+ (Intra)
27
H.264/AVC (Intra) H.264/AVC (No Motion)
25 0
50
100
150
200
250
300
350
400
450
500
Rate [kbps]
Fig. 3. RD performance for GOP 2: Foreman, Coast Guard and Hall Monitor
550
Studying the GOP Size Impact on the Performance
809
Foreman 41 39
PSNR [dB]
37 35 33 31 LDPC - GOP 2
29
LDPC - GOP 4
27
LDPC - GOP 8
25 0
50
100
150
200
250
300
350
400
450
500
550
Rate [kbps]
Coast Guard 38 36
PSNR [dB]
34 32 30 LDPC - GOP 2
28
LDPC - GOP 4
26
LDPC - GOP 8
24 0
50
100
150
200
250
300
350
400
450
500
Rate [kbps]
Hall Monitor 41
PSNR [dB]
39 37 35 33 LDPC - GOP 2
31
LDPC - GOP 4 LDPC - GOP 8
29 0
50
100
150
200
250
300
350
Rate [kbps]
Fig. 4. RD performance comparison between GOP 2, 4 and 8: Foreman, Coast Guard and Hall Monitor
While Fig. 3 shows the RD performance for the various video codecs tested, Fig. 4 shows a RD performance comparison for various GOP sizes, notably 2, 4 and 8. The main conclusions that can be drawn from these charts are:
810
F. Pereira, J. Ascenso, and C. Brites
• For the sequences here tested (and many others), the WZ codec at GOP 2 can always beat the H.263+ Intra RD performance. The same happens for H.264/AVC Intra with the DISCOVER WZ always beating or equaling the H.264/AVC Intra RD performance (although the H.264/AVC Intra encoding complexity is much higher as it will be shown later). For the Coast Guard sequence, the DISCOVER WZ codec can even beat the H.264/AVC No Motion RD performance which may be explained by the fact the WZ codec is performing motion estimation at the decoder while the H.264/AVC No Motion codec is not doing it at the encoder. • For the selected test sequences, the highest RD performance always happens for GOP 2 showing the difficulty in getting good side information for longer GOP sizes due to the decreased quality of the frame interpolation (key frames are further away). However, Fig. 4 shows RD performance results for the Hall Monitor sequence, which is a rather stable sequence, where GOP 4 is already more efficient than GOP 2 since motion estimation and frame interpolation is now more reliable. 3.2 LDPC Versus Turbo Codes RD Performance This section compares the RD performances of the DISCOVER codec when using turbo codes and LDPC codes in the Slepian-Wolf codec, for equal conditions in terms of all other modules. Fig. 5 shows the RD performance comparison which highlights the fact that LDPC codes have always better RD performance than turbo codes, for all GOP sizes and all sequences tested. More in detail, the following conclusions can be drawn: • For lower bitrates, the performance of the turbo and LDPC codes is quite similar since when the correlation between the SI and the (quantized) WZ frames is high, the turbo codes achieve a RD performance similar to the LPDC codes. • At medium and high bitrates, the LDPC codes have a better RD performance when compared to the turbo codes (with coding gains up to 35 kbps for GOP size equal to 8). The LDPC codes show better performance for the bands/bitplanes which have a lower correlation between the SI and WZ data (i.e. for side information with lower quality regarding the WZ frames). • When the GOP size increases, the performance gap between the turbo codes and the LDPC codes increases with a clear advantage for the LDPC codes (for GOP size 2, 4, and 8, coding gains up to 10, 27, 35 kbps occur, respectively). One major reason for this effect is that the LDPC codes have always a maximum rate of 1 for any bitplane, i.e. the maximum number of syndrome bits sent cannot exceed the number of bits that represent the original data. This property does not exist for the iterative turbo codes (where rate expansion is possible) and, thus, it is responsible for the turbo codes loss of efficiency, especially at larger GOP sizes, where the correlation between the WZ frames and the side information is lower, and, thus, compression rates higher than 1 happen for successful decoding.
Studying the GOP Size Impact on the Performance
811
Foreman 41 39
PSNR [dB]
37 35 33 31 29
LDPC - GOP 2 LDPC - GOP 4 LDPC - GOP 8
27
TC - GOP 2 TC - GOP 4 TC - GOP 8
25 0
50
100
150
200
250
300
350
400
450
500
550
600
Rate [kbps]
Coast Guard 37
PSNR [dB ]
35 33 31 29 LDPC - GOP 2 LDPC - GOP 4 LDPC - GOP 8
27
TC - GOP 2 TC - GOP 4 TC - GOP 8
25 0
50
100
150
200
250
300
350
400
450
500
Rate [kbps]
Fig. 5. RD performance comparison for LDPC versus turbo codes for various GOP sizes Foreman 50 DISCOVER (WZ Frames)
DISCOVER (Key Frames)
H.264/AVC (Intra)
H.264/AVC (No Motion)
Time (sec)
40
30
20
10
0 1
2
3
4
5
6
7
8
Qi
Fig. 6. Encoder complexity measured in terms of encoding time for GOP 2: Foreman sequence
3.3 Encoding Complexity Performance In this paper, the encoding (and decoding) complexity will be measured by means of the encoding (and decoding) time for the full sequence, in seconds, under
812
F. Pereira, J. Ascenso, and C. Brites
controlled conditions. For the present results, the hardware used was an x86 machine with a dual core Pentium D processor, at 3.4 GHz, with 2048 MB of RAM. Regarding the software conditions, the results were obtained with a Windows XP operative system, with the C++ code written using version 8.0 of the Visual Studio C++ compiler, with optimizations parameters on, such as the release mode and speed optimizations. Since the encoding (and decoding) time results are highly dependent on the used hardware and software platforms, they have a relative and comparative value, in this case allowing comparing the DISCOVER codec with alternative solutions, e.g. H.264/AVC based, running in the same hardware and software conditions. Table 1. Encoding time (full sequence in seconds) comparison for the Foreman sequence H.264/AVC Intra QP
DISCOVER
No Motion
Using LDPC
Ratios
Using Turbo Codes
H.264/AVC Intra vs DISCOVER LPDC
GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8
40
32.67 33.14
33.64
32.97
19.44
11.58
8.02
19.14
11.26
7.58
1.68
2.82
4.08
39
33.30 33.81
34.41
33.72
19.75
11.80
8.09
19.47
11.42
7.69
1.69
2.82
4.11
38
33.78 34.34
34.88
34.15
20.05
11.95
8.22
19.73
11.57
7.79
1.69
2.83
4.11
34
37.11 37.60
38.21
37.34
21.67
12.94
8.79
21.39
12.48
8.33
1.71
2.87
4.22
34
37.13 37.61
38.29
37.34
21.68
13.00
8.80
21.39
12.53
8.34
1.71
2.86
4.22
32
38.94 39.55
40.18
39.23
22.78
13.66
9.37
22.38
13.14
8.79
1.71
2.85
4.15
29
42.33 43.11
43.73
42.72
24.49
14.58
9.94
24.13
14.07
9.34
1.73
2.90
4.26
25
48.56 48.89
49.54
48.17
27.72
16.61
11.23
27.36
15.98
10.55
1.75
2.92
4.32
Table 2. Encoding time (full sequence in seconds) comparison for the Coast Guard sequence H.264/AVC Intra QP
DISCOVER
No Motion
Using LDPC
Ratios
Using Turbo Codes
H.264/AVC Intra vs DISCOVER LPDC
GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8 GOP 2 GOP 4 GOP 8
37
39.30 39.27
39.55
38.58
23.05
13.71
9.19
22.85
13.21
8.62
1.70
2.87
4.28
36
39.70 40.00
40.25
39.42
23.31
13.86
9.41
23.15
13.37
8.81
1.70
2.87
4.22
36
39.73 40.08
40.26
39.44
23.34
13.86
9.42
23.17
13.38
8.82
1.70
2.87
4.22
33
42.22 42.66
43.06
42.17
24.65
14.72
9.95
24.54
14.22
9.39
1.71
2.87
4.24
33
42.29 42.77
43.16
42.20
24.73
14.72
9.97
24.57
14.23
9.41
1.71
2.87
4.24
31
44.30 44.83
44.97
44.02
25.89
15.52
10.50
25.82
14.90
9.83
1.71
2.85
4.22
29
46.22 46.91
47.41
46.38
26.96
16.12
10.94
26.71
15.50
10.28
1.71
2.87
4.23
24
52.58 53.50
54.09
53.09
30.32
18.17
12.50
29.97
17.49
11.59
1.73
2.89
4.21
Fig. 6 and Table 1 and Table 2 show the encoder complexity results for GOP 2, 4 and 8, measured in terms of encoding time distinguishing between key frames (blue) and WZ frames (red) encoding times. The results allow concluding that:
Studying the GOP Size Impact on the Performance
813
• For the DISCOVER video codec, the WZ encoding complexity is negligible when compared to the key frames coding complexity, even for GOP 2. For longer GOP sizes, the overall encoding complexity decreases with the increase of the share of WZ frames regarding the key frames; in this case, the number of key frames decreases, although their encoding complexity is still the dominating part. • The DISCOVER encoding complexity is always much lower than the H.264/AVC encoding complexity, both for the H.264/AVC Intra and H264/AVC no Motion solutions. While the H.264/AVC Intra encoding complexity does not vary with the GOP size and the H.264/AVC no Motion encoding complexity is also rather stable with a varying GOP size, the DISCOVER encoding complexity decreases with the GOP size. If encoding complexity is a critical requirement, the results in this section, together with the RD performance results previously shown, indicate that the DISCOVER monoview codec with GOP 2 is already a credible practical solution since it has a rather low complexity and ‘beats’ H.264/AVC Intra in terms of RD performance for most cases. This is a rather important result … • Another important result is that the WZ encoding complexity does not increase significantly when the Qi increases (i.e. when the bitrate increases); on the other hand, for H.264/AVC Intra and H264/AVC no Motion, the complexity increases when higher bitrates are targeted. • The encoding complexity is rather similar for the LDPC and turbo coding alternatives for all GOP sizes. 3.4 Decoding Complexity Performance Table 3 and Table 4 show the decoder complexity results for GOP 2, 4 and 8, measured in terms of decoding time. The results allow concluding that: • For the DISCOVER video codec, the key frame decoding complexity is negligible regarding the WZ frames coding complexity, even for GOP 2 (when there are more key frames). This confirms the well known WZ coding trade-off where the encoding complexity benefits are paid in terms of decoding complexity. Contrary to the encoding complexity, the longer is the GOP size, the higher is the overall decoding complexity since the higher is the number of WZ frames. • The DISCOVER decoding complexity is always much higher than the H.264/AVC decoding complexity, both for the H.264/AVC Intra and H264/AVC no Motion solutions. While the H.264/AVC Intra decoding complexity does not vary with the GOP size and the H.264/AVC no Motion decoding complexity is also rather stable with a varying GOP size, the DISCOVER decoding complexity increases with the GOP size. • The WZ decoding complexity increases significantly when the Qi increases (i.e. when the bitrate increases) since the number of bitplanes to LDPC (or turbo) decode is higher and the LDPC (or turbo) decoder (and the number of times that is runs) is the main responsible for the decoding complexity. • Regarding the decoding complexity comparison between LDPC and turbo codes, the results seem to say that while LDPC wins for more quiet sequences, e.g. Coast Guard (and Hall Monitor) for GOP 2, turbo codes win for sequences with more motion, e.g. Foreman (and Soccer).
814
F. Pereira, J. Ascenso, and C. Brites
Table 3. Decoding time (full sequence in seconds) comparison for the Foreman sequence H.264/AVC Intra QP
DISCOVER
No Motion
Using LDPC
Using Turbo Codes
GOP 2
GOP 4
GOP 8
GOP 2
GOP 4
GOP 8
GOP 2
GOP 4
GOP 8
40
1.55
1.53
1.53
1.50
664.06
1150.11
1486.70
590.47
983.92
1237.19
39
1.55
1.55
1.53
1.52
729.45
1237.47
1605.34
680.80
1124.59
1402.70
38
1.58
1.55
1.58
1.53
848.45
1482.45
1904.23
768.25
1280.75
1606.08
34
1.64
1.66
1.66
1.64
1362.45
2536.06
3293.84
1219.88
2105.47
2663.94
34
1.66
1.67
1.69
1.64
1541.00
2824.58
3641.94
1346.77
2319.48
2930.91
32
1.72
1.73
1.77
1.70
2041.53
3640.48
4586.58
1765.06
3030.48
3808.64
29
1.83
1.83
1.84
1.81
2352.56
4273.28
5551.36
2148.23
3738.67
4728.45
25
1.92
1.97
1.98
1.94
3254.92
5901.63
7640.66
3207.05
5535.03
6974.56
Table 4. Decoding time (full sequence in seconds) comparison for the Coast Guard sequence H.264/AVC Intra QP
DISCOVER
No Motion
Using LDPC
Using Turbo Codes
GOP 2
GOP 4
GOP 8
GOP 2
GOP 4
GOP 8
GOP 2
GOP 4
GOP 8
38
1.55
1.56
1.56
1.53
430.27
709.75
986.66
440.89
691.36
890.94
37
1.58
1.61
1.56
1.56
485.89
776.86
1048.20
512.61
796.70
994.88
37
1.61
1.64
1.58
1.57
531.17
894.77
1226.23
579.48
908.95
1150.61
34
1.64
1.66
1.70
1.66
796.69
1525.11
2168.36
874.58
1463.97
1885.91
33
1.70
1.75
1.77
1.69
824.23
1628.22
2329.80
931.80
1564.95
2024.33
31
1.72
1.81
1.78
1.80
1144.45
2254.47
3193.66
1229.36
2093.36
2708.59
30
1.75
1.88
1.81
1.81
1497.25
2802.48
3856.28
1566.00
2668.41
3375.45
26
1.92
2.05
2.03
1.94
2461.73
4462.38
5872.11
2500.95
4309.33
5428.16
4 Final Remarks This paper presents a detailed evaluation of the RD and complexity performance of an advanced feedback-channel based Wyner-Ziv video codec as a function of the GOP size. While longer GOP sizes increase the RD performance theoretical upper bounds, as shown by conventional video coding, due to the more intensive exploitation of the temporal redundancy (even if at the cost of higher encoder complexity), this is still not happening for the tested WZ video codec since the longer is the GOP size the more difficult and less reliable is the motion estimation process at the decoder, reducing the quality of the side information estimate. This fact asks for the development of more sophisticated frame interpolation methods at the decoder, eventually combined with having the encoder sending to the decoder some auxiliary data for the more motion critical parts of the frame, e.g. local hash codes [11], even at the cost of some encoder complexity. It is, however, important to stress that WZ video coding already proposes competitive solutions for application scenarios where encoding complexity is the main critical requirement since it RD performs the best for the lowest encoder complexity in comparison with standard alternative solutions.
Studying the GOP Size Impact on the Performance
815
Acknowledgments. The work presented here was developed within DISCOVER, a European Project (http://www.discoverdvc.org), funded under the European Commission IST FP6 programme.
References 1. Girod, B., Aaron, A., Rane, S., Monedero, D.R.: Distributed Video Coding. Proceedings of the IEEE 93(1), 71–83 (2005) 2. Puri, R., Ramchandran, K.: PRISM: A New Robust Video Coding Architecture Based on Distributed Compression Principles. In: 40th Allerton Conference on Communication, Control and Computing, Allerton, USA (2002) 3. Aaron, A., Zhang, R., Girod, B.: Wyner-Ziv Coding of Motion Video. In: Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA (2002) 4. Brites, C., Ascenso, J., Pereira, F.: Improving Transform Domain Wyner-Ziv Video Coding Performance. In: International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France (2006) 5. Artigas, X., Ascenso, J., Dalai, M., Klomp, S., Kubasov, D., Ouaret, M.: The DISCOVER Codec: Architecture, Techniques and Evaluation. In: Picture Coding Symposium, Lisbon, Portugal (2007) 6. ISO/IEC, 14496-10:2003: Coding of Audio-Visual Objects - Part 10: Advanced Video Coding, 1st Edition (also ITU-T: 2003, H.264) (2003) 7. Ascenso, J., Brites, C., Pereira, F.: Content adaptive Wyner-Ziv Video Coding Driven by Motion Activity. In: Int. Conf. on Image Processing. Atlanta - USA (2006) 8. Varodayan, D., Aaron, A., Girod, B.: Rate-Adaptive Codes for Distributed Source Coding. EURASIP Signal Processing Journal. Special Issue on Distributed Source Coding 86(11), 3123–3130 (2006) 9. Ascenso, J., Brites, C., Pereira, F.: Improving Frame Interpolation with Spatial Motion Smoothing for Pixel Domain Distributed Video Coding. In: 5th EURASIP Conf. on Speech, Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic (2005) 10. ITU-T: 1998, H.263+ Video Coding for Low Bitrate Communication. ITU-T Recommendation H.263, Version 2 (1998) 11. Ascenso, J., Pereira, F.: Adaptive Hash-based Side Information Exploitation for Efficient Wyner-Ziv Video Coding. In: Int. Conf. on Image Processing, ICIP 2007, San Antonio, TX, USA (2007)
Wyner-Ziv Video Coding with Side Matching for Improved Side Information* Bonghyuck Ko, Hiuk Jae Shim, and Byeungwoo Jeon Department of Electronic and Electrical Engineering, Sungkyunkwan University 300 Chunchun-dong, Jangan-gu, Suwon, 440-746, Korea {
[email protected],
[email protected],
[email protected]}
Abstract. To make an encoder extremely simple by eliminating motion prediction/compensation from encoder, source coding with side information has been investigated based on the Wyner-Ziv theorem as the basic coding principle. However, the frame interpolation at decoder which is essential for redundancy elimination makes erroneous side information when the basic assumption of linear motion between frames is not satisfied. In this paper, we propose a new Wyner-Ziv video coding scheme featuring side matching in the frame interpolation to improve the side information. In the proposed scheme, Wyner-Ziv decoder compensates wrong blocks by side information using side matching and bi-directional searching. The noise reduction in side information allows the proposed algorithm to achieve coding improvements not only in bitrate but also in PSNR. Results of our experiments show improvement of PSNR up to 0.4dB. Keywords: Distributed video coding, Wyner-Ziv coding, Side matching.
1 Introduction In conventional video coding such as MPEG-1/2/4 and H.26x, the complexity of encoder is much higher than that of decoder. It is because the temporal redundancy among frames is removed at the encoder using motion estimation and compensation (ME/MC) process. The ME process is computationally very intensive, and also so in consuming power, therefore in many applications where saving in power and computation is a premium factor such as in sensor networks, a new encoding algorithm which claims only low complexity is very essential. It has been known that the distributed source coding (DSC) [1,2] can provide a clue to solutions to this problem. The Slepian-Wolf [1] and the Wyner-Ziv theorems [2] state that it is possible to encode correlated sources independently while still achieving optimum compression performance, as long as decoding is performed jointly. In the context of video coding, it enlightens that all the processing at an encoder to exploit temporal and spatial correlation in the video frames can be shifted *
This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the National Research Lab. Program funded by the Ministry of Science and Technology (No.M10600000286-06J0000-28610).
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 816–825, 2007. © Springer-Verlag Berlin Heidelberg 2007
Wyner-Ziv Video Coding with Side Matching for Improved Side Information1
817
to a decoder side. It makes a video encoder utterly simple at the expense of increased complexity at the decoder. It is just an opposite design principle from the traditional broadcasting. Lately, Girod et al. proposed a Wyner-Ziv coding scheme based on these theorems in which encoder performs turbo encoding, and decoder generates side information reflecting temporal correlation of frames [3]. The decoder also does the turbo decoding to reduce noise between original frame and its side information.
l
m
j
Fig. 1. Proposed Wyner-Ziv Coding Scheme with Side Matching
Basically, this scheme is highly dependant on the accuracy of side information, since the decoder extracts temporal redundancy based on the side information which is generated by frame interpolation using key frames. Note that, the frame interpolation assumes linear motion between frames. Therefore, if there is non-linear motion or occlusion, overall coding performance is bound to be decreased. In this paper, we propose a solution to this problem. The remainder of this paper is organized as follows. Section 2 presents the coding process of the previous Wyner-Ziv coding [3,4]. In Section 3, we describe the proposed method. Section 4 shows experimental results. Conclusions and future works are drawn in Section 5.
2 Wyner-Ziv Video Coding Fig. 1 illustrates the proposed architecture of the PDWZ (Pixel Domain Wyner-Ziv) coding. It consists of an M bit quantizer, a turbo-encoder/decoder based on SlepianWolf coding, a frame interpolation module generating side-information, a reconstruction module, and a side matching module which is newly proposed in this paper. 2.1 Encoding Process Overall encoding process is as follows. First of all, all frames to be coded are organized into two categories - so called, “key frames” and “Wyner-Ziv frames.” As previously proposed pixel domain Wyner-Ziv coder [3,4], we take the same simple
818
B. Ko, H.J. Shim, and B. Jeon
approach of designating odd frames as key frames and even frames as Wyner-Ziv frames. The key frames are encoded as conventional intra frames by such as H.26x, and then sent to a decoder. In the decoder, these key frames are assumed to be perfectly reconstructed as previously proposed pixel domain Wyner-Ziv coding [3,4]. As for the Wyner-Ziv frames between key frames, the encoder performs turbo encoding. The turbo encoder consists of two RSC’s (Recursive Systematic Convolution Coding) of rate 1/2 and an interleaver. Quantized symbol streams from the M bit quantizer are fed into these two RSC’s. Input streams to the second RSC are interleaved. After this processing, the parity data produced by the turbo encoder is stored in a buffer, and the systematic data from the turbo encoding are discarded. Once the parity data is generated, puncturing is additionally exercised before parity is transmitted, and the turbo encoder sends the data to the decoder according to the request from the decoder. As already mentioned above, this type of encoder is utterly light-weighted in its complexity since the Wyner-Ziv encoder performs only intra coding, not referring to other frames as inter-coding. 2.2 Decoding Process The decoding process is as follows. First, the decoder reconstructs the key frames as received from the encoder. By utilizing the reconstructed key frames, the decoder generates the side-information. Typically, it is done through frame interpolation assuming linear motion between key frames. After this step, side information is used by both the turbo decoder and the reconstruction module. Our turbo decoder is based on MAP (Maximum A Posteriori) algorithm and, unlike the bit-level operation in the previously proposed schemes [3,4], it operates on symbol-level (named as the Hyper-trellis turbo code [5]). In this processing, side information is interpreted as a noisy version of the corresponding original WynerZiv frame and used to decode quantized symbol stream Qt by referring to the parity data from the encoder. A decision module in the turbo decoder calculates symbol error rate, and based upon this value, if required, it repeats requesting parity bits until the calculated symbol error rate ( Pe ) decreases under a threshold (for example, Pe < 10−3 ). Once the threshold symbol error rate is achieved, the turbo decoder sends decoded symbol stream to the reconstruction module. The reconstruction module reconstructs the Wyner-Ziv frame using this quantized symbol stream and the side-information. The reconstruction rule is same as the previously proposed PDWZ [3,4].
3 Proposed Wyner-Ziv Video Coding with Side Matching Our proposed Wyner-Ziv coding differs in two aspects with the previously proposed PDWZ [3,4]. First, the side matching module is added in the side information generator. Second, we use the hyper-trellis turbo coding [5] instead of turbo coding based on bit-plane. We will discuss what differences these two make in the coding performance, especially in subjective quality.
Wyner-Ziv Video Coding with Side Matching for Improved Side Information1
819
3.1 Importance of Good Side Information In general, the Wyner-Ziv decoder performs three key functions: side information generation, turbo decoding, and reconstruction. Among them, the side information generation process is the most important in improving coding performance.
Fig. 2. Original 76th frame (left) and side information generated (right) in decoder (assumption of linear motion is satisfied)
Fig. 3. Original 180th frame (left) and side information generated (right) in decoder (assumption of linear motion is not satisfied)
One can regard the turbo decoding as a process which corrects, using parity, virtual channel noise originated from side information generation process, and reconstruction as a process which reduces ambiguity in reconstructed frame using correlation between the Wyner-Ziv frame and the side-information. Therefore, the closer the side information is to the original image, i.e., smaller virtual channel noise, the lesser parity bits it will require, and also the better quality of reconstructed image it will get. Usually, the Wyner-Ziv decoder makes side information using frame interpolation assuming linear motion between key frames. Frame interpolation works well in such a circumstance that the motion between frames is well characterized as zero or linear motions. On the other hand, if there is non-linear motion or occlusion between frames, the decoder is bound to generate wrong blocks as side information as shown in Figures 2 and 3.
820
B. Ko, H.J. Shim, and B. Jeon
Although this type of error is desired not to occur so frequently, once it occurs, it is detrimental in overall coding performance, and unfortunately even extremely so in subjective quality. It usually looks like a burst noise. Moreover, it increases the number of requests for parity bits during turbo decoding process as well. Unlike the assumption of stationary processing in turbo decoding, real virtual channel noise is non-stationary. Several researches have been done related to this problem [6,7]. In this paper, however, we focus on reducing this type of non-stationary errors rather than developing a new channel noise model as in [6,7]. Note that although there is a report of a scheme to improve side information [8], our scheme is distinct from it in several aspects. First, our scheme works well with low M (which is same as quantized symbol length) compared to the scheme in [8]. Usually the probability of having such burst-like noise in MSB is very low, therefore it is difficult for the scheme [8] to decide whether errors exist or not. Second, the previous scheme [8] operates on bit-level approach; on the other hand, our proposed scheme can operate not only in bit-level but also in symbol-level. Thirdly, the coding performance of our scheme is affected less by channel coding or reconstruction module while the scheme in [8] is in the opposite way. It is because our scheme is a kind of pre-processing, but [8] is a kind of post-processing. 3.2 Side Matching The degradation of subjective quality of reconstructed video by the decoder is due to wrong blocks from frame interpolation due to non-linear motion or occlusion. In this case, there are large discrepancies between sides of the block at corresponding positions in the key frames and at the side information. To quantify the differences, we calculate MAD (Mean Absolute Difference) between sides of the block in side information St and average, ( St −1 + St +1 ) 2 , of corresponding sides at the key frames indicated by motion vectors obtained in the frame interpolation process as in eq. (1).
α = MAD{St , (St-1 + St +1 ) / 2} .
Fig. 4. Proposed Side Matching for Wyner-Ziv Coding
(1)
Wyner-Ziv Video Coding with Side Matching for Improved Side Information1
821
If the calculated MAD α in eq.(1) exceeds a predefined threshold ε , the side matching module decides the block as a wrong one. The threshold value ε is obtained from experiments.
Fig. 5. Wrong blocks in side information are marked using the proposed scheme
If α > ε , then Bt = Bi i = argmin{M k |k=1, 2, 3} k=1,2,3
. (2)
forward direction ⎧M1 = MAD{St , S't-1} ⎪ where ⎨M 2 = MAD{St , S't+1} backward direction ⎪M = MAD{S , (S' + S ' ) / 2} bi − direction ⎩ 3 t +1 t t-1 (B means a block, S means sides, and S’ means searched sides) For the blocks determined as wrong ones, the decoder generates side information again using bi-directional motion search. For this, three candidate blocks are generated with forward searching (B1), backward searching (B2), and bi-directional
Fig. 6. Original side information of 180th frame (left) and improved one by the proposed scheme (right)
822
B. Ko, H.J. Shim, and B. Jeon
searching (B3) [9]. Among these three, the proposed decoder consequently selects one for the corresponding block in side-information as in eq.(2). By choosing the best one among the forward searching, the backward searching, and the bi-directional searching, the decoder can estimate non-linear motion between frames better than before, thus total number of wrong blocks generated from nonlinear motion and occlusion is expected to be lesser compared to the conventional side information generator based on frame interpolation assuming linear motion between key frames. Figures 5 and 6 are results of our simulation which shows that the decoder corrects such types of error in many cases. 3.3 Hyper-trellis Turbo Coding The turbo decoder in the Wyner-Ziv coding corrects the differences between the original Wyner-Ziv frame and the side information inflicted by virtual channel noise. It decodes original quantized symbol stream using parity data from the encoder with its own powerful error correction capability. In turbo coding, virtual channel noise can be taken care of in two different ways - one is a symbol-level approach, and the other is bit-level approach. Although some degree of correlation still exists between bits constituting a symbol, correlation inherently does belong to symbol level. In this respect, the independence assumption of bit planes taken by previously proposed bitplane-based turbo coding has a problem in calculating channel likelihood [5]. On the contrary, the Hyper-trellis turbo coder treats several combined bit transition corresponding to one symbol length as one symbol transition, thus calculation of the metric data for a state transition in trellis is obtained for the symbol length corresponding to the combined transition, not a bit level transition. Of course, channel likelihood is calculated on symbol-level. Therefore, the hyper-trellis-based turbocoder does not have a problem of approximation unlike the bit-plane-based counterpart.
4 Experimental Results In order to evaluate the coding performance of the proposed scheme, following two configurations are considered: i) Conventional: Hyper-trellis-based PDWZ with frame interpolation[5] ii) Proposed: Hyper-trellis-based PDWZ with frame interpolation using the proposed side matching The test conditions are as follow. For the test sequences, we use Foreman (the one without Siemens logo), Mobile, and Stefan sequences in QCIF format (176 x 144 pixels). The Wyner-Ziv frame rate is 15 Hz like IST-PDWZ [4] - that is, all the even frames are coded as the Wyner-Ziv frames. In the experiment, no coding is carried out for the key frames and they are assumed to be perfectly reconstructed by decoder without any coding error. It is assumed so because the coding of the key frames is a separate issue from the Wyner-Ziv frames. In the evaluation of PSNR and bit rate, only those of Wyner-Ziv frames are considered since we are interested in evaluating coding performance of Wyner-Ziv frames only.
Wyner-Ziv Video Coding with Side Matching for Improved Side Information1
823
In the experiment, different quantization levels are applied, 2M ∈ {2, 4,8,16} , to obtain four rate-distortion points. For the Turbo coding, we use a generating function, [1
1 + D + D3 + D 4 . Puncturing period is 32, and pseudo random interleaver is used. For ] 1 + D3 + D 4
the frame interpolation, block size is 8x8, search range is ± 16, and refinement range is ± 3. For the side matching, decision threshold ε =8, and search modes are forward, backward, and bi-direction as described in Section 3.2. 39.0 38.5 38.0 37.5
PSNR(dB)
37.0 36.5 36.0 35.5 35.0 Proposed Conventional
34.5 34.0 33.5 0
100
200
300
400
500
600
700
Rate(kbps)
Fig. 7. RD performance of the proposed scheme with Foreman (frames 200~300) 37.5 37.0 36.5 36.0 35.5
PSNR(dB)
35.0 34.5 34.0 33.5 33.0 32.5 Proposed conventional
32.0 31.5 31.0 30.5 0
100
200
300
400
500
Rate(kbps)
Fig. 8. RD performance of the proposed scheme with Mobile (frames 200~300)
824
B. Ko, H.J. Shim, and B. Jeon
32 31 30
PSNR(dB)
29 28 27 26 25 Proposed Conventional
24 23 22 0
200
400
600
800
1000
1200
1400
Rate(kbps)
Fig. 9. RD performance of the proposed scheme with Stefan (frames 200~300)
In case of the Foreman sequence, PSNR is increased by up to maximum 0.42dB, minimum 0.09dB by the proposed scheme. In the minimum case, bit-rate is reduced by about 12Kbps. Since the LSB parts of a pixel in Wyner-Ziv frame are not transmitted from the encoder, the decoder inherits the LSB parts of the generated side information. Therefore, in LSB parts, reduction of burst noise aids to recover quantization error by the reconstruction module, consequently higher PSNR is obtained. On the other hand, in MSB parts, reduced burst noise makes turbo decoding easier with smaller number of parity bit requests, consequently lower bit-rate is obtained. Therefore better side information in decoding process is shown to increase PSNR and to decrease bit-rate at the same time in our experiments. Similar results are also observed with the Mobile and the Stefan sequences. Table 1 shows average performance improvements which lead us to conclude that our proposed scheme is better than the conventional scheme[5], and as we mentioned above, performance improvements occur in both PSNR and bit-rate. Table 1. Average performance improvements of proposed scheme over the conventional one [5] Sequence Foreman Mobile Stefan
BDPSNR(dB) 0.29 0.18 0.33
BDBR(%) -12.66 -7.12 -6.39
5 Conclusion and Future Work In this paper, we proposed using side matching in Wyner-Ziv Coding in order to increase the quality of side information. The proposed method enables us to reduce
Wyner-Ziv Video Coding with Side Matching for Improved Side Information1
825
burst type virtual channel noise, and makes more accurate side information. Therefore, the turbo decoder requests less parity bits than the conventional scheme. This means that the total transmission rate is eventually lowered in decoding process and reconstructed image has more reliable information about original data from the side-information. The proposed scheme works well in many cases, however, we also observed, although they are rare, some cases of wrong blocks still occurring in groups after side matching in case of extremely fast non linear motions. As a future work, we will make more robust scheme which can also operate in such a very noisy circumstance having very fast non linear motion.
References 1. Slepian, D., Wolf, J.: Noiseless coding of correlated information sources. IEEE Trans. Inform. Theory 19, 471–480 (1973) 2. Wyner, A., Ziv, J.: The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inform. Theory 22, 1–10 (1976) 3. Aaron, A., Zhang, R., Girod, B.: Wyner-Ziv Coding for Motion Video. In: 36th Asilomar Conference on Signals, Systems and Computers, Pacific Grove Monterey CA, pp. 240–244 (2002) 4. Ascenso, J., Brites, C., Pereira, F.: Improving Frame Interpolation With Spatial Motion Smoothing For Pixel Domain Distributed Video Coding. In: 5th Eurasip Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice. Slovak Republic (2005) 5. Avudainayagam, A., Shea, J.M., Wu, D.: Hyper-trellis decoding of pixel-domain WynerZiv video coding. In: IEEE GLOBECOM 2005 proceedings (2005) 6. Dalai, M., Leonardi, R., Pereira, F.: Improving turbo coding integration in pixel-domain distributed video coding. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse (2006) 7. Westerlakena, R.P., Klein Gunnewiekb, R., Lagendijka, R., Inald, L.: Turbo-Code Based Wyner-Ziv Video Compression. In: Twenty-sixth Symposium on information Theory, Benelux (2005) 8. Ascenso, J., Brites, C., Pereira, F.: Motion compensated refinement for low complexity pixel based distributed video coding. In: IEEE International Conference on Advanced Video and Signal-Based Surveillance, Como, Italia (2005) 9. Wu, S.W., Gersho, A.: Joint Estimation of Forward and Backward Motion Vectors for Interpolative Prediction of Video. IEEE Transactions on Image Processing 3, 684–687 (1994)
On Digital Image Representation by the Delaunay Triangulation Josef Kohout Department of Computer Science and Engineering, University of West Bohemia, Univerzitní 22, 306 14 Plzeň, Czech Republic
[email protected]
Abstract. This paper deals with a transformation of raster grey-scale images into a geometric representation by the Delaunay triangulation. It discusses the influence of image filtering techniques and methods for the evaluation of significance of pixels on the conversion process. Furthermore, it proposes several novel approaches for a compression of the Delaunay triangulation and compares them with existing solutions. Keywords: image coding, Delaunay triangulation, compression.
1 Introduction A digital image is nowadays usually represented by a rectangular grid of pixels, where each pixel contains a colour value or a greyscale luminance. In order to minimize the storage costs, the grid is very often transferred and stored in a compact form such as well-known GIF, PNG and JPEG. This representation suffers from several disadvantages. First, it cannot be easily processed in its compact form. Next, scaling and rotation operations typically introduce some distortion to the image. Therefore, many researchers have focused recently on alternative geometric representations, such as triangular meshes. Geometric representations are applicable since the pixels of an image can be considered 3D points in a space in which x and ycoordinates are the rows and columns of the image, and z-coordinate is either the grey level or colour value. As it would not be very useful to represent an image with N pixels by a triangulation with the same number of vertices, a triangulation such that it has fewer vertices but it still sufficiently approximates the original mesh is very often needed to be found. From this triangulation, the corresponding image can be easily reconstructed by the interpolation among vertices of the mesh. Let us note that, if bilinear interpolation is exploited, this can be done in real time (especially, if graphics adapters are exploited). Existing methods for the conversion of digital images from the traditional raster representation into a geometric representation can be subdivided into three main categories according to the goal they want to achieve as follows. First, there are methods that produce geometric representations that enhance the quality of further image processing. The representations are not compact as they contain usually as many vertices as the raster. Majority of these methods creates the data dependent D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 826–840, 2007. © Springer-Verlag Berlin Heidelberg 2007
On Digital Image Representation by the Delaunay Triangulation
827
triangulation (DDT) where triangle edges match the edges in the image and they differ only in cost functions used to detect an edge and optimisations [1], [16], [17]. In the second category, there are methods that produce compact (i.e., only a subset of vertices is kept) but highly imprecise representations. They find its use in applications of non-photorealistic rendering where details are unwanted because they make an understanding of the information presented by the image more difficult. A typical application of such representations is described in [10]. From existing methods that belong to this category, let us describe two interesting. Prasad et al. [13] proposed a technique that starts with the detection of edges in the input image using the Canny operator. The detected edges are used as constraints for the constrained Delaunay triangulation that is afterwards computed. For every constructed triangle one colour computed as the average of colours of pixels covered by the triangle is assigned. Adjacent triangles with similar colours are merged together forming a polygon for which a common colour is chosen. The process results in the polygonal representation of the input image. Kreylos et al. [11] describes an interesting approach that starts with the Delaunay triangulation of a randomly chosen subset of vertices that is successively improved by choosing different vertices governed by a simulated annealing algorithm. A drawback of their approach is that the final triangulation contains a lot of long and narrow triangles that may be difficult to efficiently encode. The approach was later exploited by Cooper et al. [3] for the surface reconstruction from a set of images. Instead of picking a random subset for the initial approximation, they, however, choose detected important points (typically, corners and edges). The last category consists of methods that attempt to balance the compactness and the quality of the produced geometric representations that, if efficiently encoded, are suitable for the storing of digital photos. These representations are very often adaptive triangulations that differ in the way how they were obtained. In general, we can identify two basic strategies how to create an adaptive triangulation. The first one generates an adaptive triangular mesh by starting with two triangles covering the whole image area and then successively splitting them in order to reduce the approximation error. Alternatively, the algorithm can start with a fine mesh and successively make it coarser until the approximation error is above the desired tolerance. The question is which triangle should be split or which vertex should be removed in the next step and that it is not a simple task is demonstrated by two straightforward approaches described in [9] and [2] that either do not preserve well sharp edges in images or produce meshes with many vertices. Let us describe some more sophisticated approaches. Starting with two initial triangles and their corresponding approximated image, Rila et al [14] successively construct the Delaunay triangulation as follows. A vertex, in which the approximation is the poorest, is inserted into the triangulation, which results in the construction of new triangles. These triangles are interpolated, i.e., a new approximation is obtained, and the next point to be inserted is found. The process stops when the required quality of the approximation is reached. The authors also describe a technique for the storing of the created mesh. As the Delaunay triangulation of a set of points is unique, it is necessary to store just vertex positions and their grey levels. An array of N bits such that it contains 1 at positions appropriate to the vertices of the constructed triangulation and 0 elsewhere is constructed and
828
J. Kohout
compressed using a RLE (Run Length Encoding) approach. The grey levels are encoded using a fixed-length uniform quantizer of 5 bits. García et al [8] choose a predefined number of pixels from image by applying a non-iterative adaptive sampling technique, which detects pixels on edges present in the image, and triangulate the corresponding points of these pixels using the Delaunay triangulation. Afterwards, triangles are further subdivided as long as the error of the approximation does not drop below some threshold. Although the authors were able to achieve better results (in the compression ratio as well as in the quality of the representation) than the authors of straightforward approaches, the results are, in our opinion, still far from being perfect. In the approach described by Galic et al [7], a vertex with the poorest approximation is found using the same criteria as Rila et al. [14] in every step of their algorithm and the triangle containing this vertex is split into two new triangles by the height on its hypotenuse. The centre of the hypotenuse becomes an additional vertex of the representation. The advantage of this hierarchical splitting process is that it forms a binary tree structure that can be efficiently stored using just one bit per node. For the encoding of grey levels, the authors use the Huffman compression. In their paper, they also discussed various interpolation techniques and finally they decided to use edge-enhancing diffusion interpolation for their experiments instead of commonly used piecewise linear interpolation. More recently, Demaret et al. [4] proposed an algorithm that computes the Delaunay triangulation of all vertices and after that it successively decimates this triangulation by removing the least significant vertex in every step. A vertex is considered to be the least significant, if its removal leads to the approximation of the original image with the smallest mean square error (MSE). The authors were able to achieve the compression ratio comparable with JPEG and the same or, especially, for higher compression ratios, even better quality of the image representation. On the other hand the proposed algorithm consumes a lot of time. In our research, we followed the approach by Demaret et al. and investigated various problems related to the transformation of the traditional pixel representation into the representation by the Delaunay triangulation and to the encoding of the produced triangulation. The following text is structured into five sections. The next section gives an overview of the transformation process. In Section 3, several image filtering techniques and their impact on the quality of the final representation is discussed. Section 4 deals with the construction of the Delaunay triangulation itself and compare various cost functions. The encoding of the created triangulation is discussed in Section 5 and the paper is concluded in Section 6.
2 Overview of the Transformation Process Fig. 1 gives a schematic overview of the complete transformation from the raster representation into a geometric representation stored in a compact form in a bitstream and the overview of the corresponding reverse transformation from this form into a raster of pixels. Let us note that we deal, similarly to many other authors, with greyscale images only because every component of colour images can be considered as a grey-scale image and can be processed individually.
On Digital Image Representation by the Delaunay Triangulation
829
Forward transformation Filtering
Triangulation
Encoding
raster image
stream Reverse Filtering
Interpolation
Decoding Reverse transformation
Fig. 1. A schematic view on transformation processes
The first step of the forward transformation is a lossless filtering of the input image. The goal of this step is to modify the image in such a way that it may be represented in the requested quality by a triangulation with a lower number of vertices, i.e., by a more compact triangulation. The details about filtering and its impact are given in the next section. Each pixel of the filtered image stands for one vertex with coordinates x and y that denote the position of this pixel in the image and the coordinate z that is its grey level. The Delaunay triangulation of these vertices is constructed. For readers not familiar with the Delaunay triangulation, let us define the Delaunay triangulation as a triangulation such that the circum-circle of any triangle does not contain any of the given vertices in its interior (also called the empty circum-circle criterion). Let us note that the Delaunay triangulation can be computed in O(N) expected time where N is the number of vertices. We decided to use the Delaunay triangulation mainly because of its two important properties. First, the Delaunay triangulation maximizes the minimal angle and, therefore, it contains the most equiangular triangles of all triangulations (i.e., it limits the number of too narrow triangles that may cause problems during the interpolation). Next, if no four points lie on a common circle, then the Delaunay triangulation is unique, i.e., it is sufficient to store vertices only and the same triangulation can be later recalculated. It is clear that this condition is violated for vertices obtained directly from pixels. This problem can be, however, easily solved by a random subpixel perturbation of vertices. An initial significance of vertices is evaluated and all vertices are put into a priority queue ordered by this significance. We use the heap as an efficient implementation of the priority queue because it ensures log(N) for every operation. After that the iterative process starts. In every step, the least significant vertex is taken from the queue and removed from the triangulation, which may lead to a reevaluation of the significance of some of the remaining vertices. The process stops when the requested quality of the representation is achieved. The details concerning the evaluation of the significance of vertices and the removal of vertices are described in Section 4. After the final Delaunay triangulation is obtained, it may be compressed and stored in a compact form onto a disk, from which it may be later, in the reverse transformation process, decompressed. The techniques suitable for the compression of the Delaunay triangulation are discussed in Section 5. In the reverse transformation, each triangle of the Delaunay triangulation is rasterized, i.e., converted into a set of pixels, and shaded (we use the bilinear
830
J. Kohout
interpolation of grey values stored in the vertices of this triangle to get values for the created pixels). The result of this interpolation process is a raster image that is further transformed by the application of reverse filter into the final reconstructed image.
3 Image Filtering Filtering transforms the image with the goal of improving the compression ratio whilst keeping the requested quality intact. While filtering techniques are thoroughly exploited in various raster compact formats (particularly in PNG), as far as we know, there is no paper considering their use for the transformation of the raster representation into a geometric one. In this section, we, therefore, discuss the importance of filtering techniques in the transformation process and their impact on the quality of the final representation. For an easier understanding of the problem, let us resort to one dimensional case. Fig. 2 shows a function that should be approximated by a piecewise linear function with the allowed approximation error ε. In its original form, an approximation that connects the ending points of the given function, pa and pb, is not possible because its error is out of the specified tolerance. It is, therefore, necessary to introduce the third vertex, pc, into the approximation to fulfil the criterion. If the input function is, however, filtered using a simple SUB filter (will be described later), the approximation by pa and pb is possible. It means that by the filtering we reduced the number of vertices from three to two, i.e., we improved the compression ratio. I(x) 140 135
pb pa
130
>ɸ pc
125 0
10
a) the original function
20 x
I(x)
5 4 3 2 1 0
<ɸ
pa 0
10
pb 20 x
b) the filtered function
Fig. 2. An edge approximation (dashed line) of the original and filtered function (solid line)
We tried three image filters used in PNG. The first one is already mentioned SUB filter that computes differences between neighbouring pixels using the formula: SUB(x) = IMG (x) – IMG(x - 1),
(1)
where x ranges from zero to the number of pixels in the image minus one and IMG() refers to the grey value of the pixel in the image corresponding to the specified position. Unsigned arithmetic modulo 256 is used, so the outputs fit into bytes (e.g., 1 – 2 = 255). For all x < 0, we assume IMG(x) = 0. In order to reverse the effect of the SUB filter after the interpolation, the output is computed as: IMG(x) = SUB (x) + IMG(x - 1).
(2)
On Digital Image Representation by the Delaunay Triangulation
831
The AVG filter transmits the difference between the value of a pixel and the average of the two neighbouring pixels (left and above) used as a prediction of this value. The formulas for forward and reverse filter can be written as: AVG(x) = IMG (x) – ((IMG(x - 1) + IMG(x – cx)) / 2,
(3)
IMG (x) = AVG (x) + ((IMG(x - 1) + IMG(x – cx)) / 2,
where cx denotes the horizontal size of the image. As the previous filter, the PAETH filter also transmits the difference between the real value and the predicted value of a pixel, however, the prediction is calculated from the three neighbouring pixels (left, above, upper left) by the algorithm developed by Alan W. Paeth (see http://www.w3.org/TR/PNG/). Despite our expectations, the experiments proved that these filtering techniques are not useful; we obtained even worse results with them than without. Fig. 3 shows images of fruits that were reconstructed from triangulations with 93.4% of the original amount of vertices (i.e., only an insignificant amount of vertices was removed) when filtering techniques were applied. Artefacts are clearly visible.
a) SUB
b) AVG
c) PAETH
Fig. 3. The artefacts caused by various filtering techniques
We identified several reasons for such behaviour. The most important fact is that by filtering we introduce a dependency between pixels and, therefore, if one pixel is reconstructed with an error, this error is distributed over the rest of pixels, which may cause unexpected artefacts. Let us consider the following example. The SUB filter transforms pixels [0, 0, 10, 10, 10] into filtered values [0, 0, 10, 0, 0]. If 2nd value is not stored and has to be reconstructed, we get values [0, 5, 10, 0, 0]. The reverse SUB filter propagates the error and gives pixels [0, 5, 15, 15, 15]. The problem is also that although the filtering flattens the image, it does not create sufficiently large places with constant value but, on the contrary, it introduces a lot of edges into the image. It makes the approximation process uneasy as it leads to a rapid degradation of quality. Having a bad experience with the filtering, we cease to use it in our next experiments. Nevertheless, we believe that the idea of filtering of image in the
832
J. Kohout
preprocessing is not bad in general but one has to come with the filtering technique where the filtered values are more independent and thus less liable to errors.
4 Delaunay Triangulation After the pixels are transformed into vertices, the Delaunay triangulation of these vertices is computed and this triangulation is successively decimated by the deletion of the least significant vertex from the triangulation in every step. The deletion consists of two steps. First, a fan of triangles sharing the vertex to be deleted is removed from the triangulation, which results in a star-shaped hole. Next, the hole is filled by new triangles fulfilling the Delaunay criterion – see Fig. 4. For the triangulation of the hole, we use the approach by Devillers [5] because of its simplicity and efficiency (it requires O(m⋅log m), where m is the number of vertices forming the hole). It works as follows. For each triple of topologically consecutive vertices qi, qi+1, qi+2 along the boundary of the hole, i.e., for a candidate for the triangle, a weight computed as a function of coordinates qi, qi+1, qi+2 and p is assigned. All candidates are put into a priority queue ordered by their weights. After that an iterative filling process starts. In every step of this process, a candidate at the head of the queue is taken and the corresponding triangle is constructed. The candidates qi-1, qi, qi+1 and qi+1, qi+2, qi+3 that overlap the newly constructed triangle are changed to qi-1, qi, qi+2 and qi, qi+2, qi+3 and their weights are recalculated. The process stops when the hole is filled. q6
q6 q5 q4
p
q5
q7 q8
q8
q3
q3
q2
q9 q1
a) triangles to be removed
q7
q4
b) the hole
q2
q9 q1
c) Delaunay retriangulation
Fig. 4. The deletion of the point p from the Delaunay triangulation
As for the evaluation of the significance of vertices, we proposed and tested several methods. The simplest one, denoted as RND, assigns a random significance to vertices at the beginning of the decimation process and does not modify it during the process. Thus it means that vertices are removed from the triangulation in a random fashion. A more sophisticated method, called MARR, computes the significance of vertices as the results of Marr-Hildreth edge detection operator [12], i.e., vertices that form edges in the image are more significant. As in the RND method, the significance is not recalculated during the decimation.
On Digital Image Representation by the Delaunay Triangulation
833
Other three methods start with the computation of significance of a vertex p as the negative absolute difference of the grey value held by this vertex and the value computed as the distance weighted average of grey values of its neighbouring vertices q, i.e., vertices connected with p by an edge:
∑ grey (q) ⋅ p − q grey ( p ) − ∑ p−q q
.
(4)
q
When a vertex is deleted from the triangulation, the significance of all vertices that formed the hole, i.e., they were originally connected by an edge with the deleted vertex, is recalculated. While the first method, which is denoted as DISTW, performs the recalculation using the same formula, the second method called ERRDIST updates the significance in a way similar to the Floyd-Steinberg error diffusion technique [6]. It modifies the significance of a vertex by adding a fraction of the significance of deleted vertex that is appropriate to the distance between these two vertices. The third method called TRIMSE updates the significance as follows. For each newly constructed triangle, it first computes its mean square error (MSE), i.e., it computes the MSE between the grey values of pixels covered by this triangle in the original image and the corresponding values obtained by the bilinear interpolation of grey values of triangle vertices. The significance of a vertex is then recalculated as the sum of MSE of triangles that share this vertex. Let us note that this method is, indeed, slower than the previous two methods. The last method, denoted as BRUTE, is based on a brute-force idea to calculate the significance of a vertex as the sum of MSE of triangles that would be constructed if this vertex was deleted. To speed up the processing, the initial significance of a vertex is computed simply as the negative square difference of the grey value held by this vertex and the average of grey values held by two its neighbouring vertices (left and right). When a vertex is deleted, the significance of all vertices that formed the hole is recalculated using the described brute-force idea. It is without any doubt that this method is the most time consuming but we may expect the best results. Fig. 5 shows images of Lena reconstructed from Delaunay triangulations with 10 000 vertices (i.e., 96% of vertices was removed) computed using different methods for the evaluation of vertex significance. It is not surprising that methods that take the MSE of triangles into an account give better results. On the other hand the MARR method provides us unexpectedly with results of a low quality. The reason is that there is not a sufficient amount of vertices to represent also areas with a smooth change of intensity (e.g., in face) in a good quality because too many vertices were wasted to represent areas with sharp edges in an outstanding quality (see the feather). For each of proposed methods of the vertex significance evaluation, we investigated the degradation of the quality of the geometric representation in the dependency on the amount of removed vertices for a set of grey scale images. The results for three popular images from this set are presented in Fig. 6. Let us note that similar results were obtained also for the remaining tested images. As it can be seen from graphs, the quality of the representation degrades quite quickly until the algorithm removes approximately 25% of vertices, after that the quality decreases almost linearly in a slow pace until another threshold of about 90% removed vertices
834
J. Kohout
is reached. From that moment, the quality rapidly drops down. An interesting observation is that in this last period, all methods (including the RND method based on a random selection of vertices to be removed) produce similar results. It means that an application that calls for a triangulation with a few vertices only (e.g., in nonphotorealistic rendering), does not need to pay much attention which method for the evaluation of vertex significance to use.
a) RND, PSNR = 24.26
b) MARR, PSNR = 17.91
c) DISTW, PSNR = 26.84
d) ERRDIST, PSNR=22.33
e) TRIMSE, PSNR=27.14
f) BRUTE, PSNR = 30.78
Fig. 5. The comparison of Lena images reconstructed from triangulations with 10 000 vertices computed using different methods
As it can be further seen, the curve of quality achieved by the slowest BRUTE method is the upper bound for all other curves, while the curve achieved by the simplest (and fastest as well) RND method stands for the lower bound. All methods except for these two methods produce very similar results despite that they use different evaluation heuristics. This introduces a question whether it makes sense to try to develop a sophisticated evaluation method because it is likely that such a method would be time consuming but would not bring a significant improvement.
5 Encoding If we cease to retain the connectivity of triangles and store the geometry only (the Delaunay triangulation can be recomputed from the geometry), 5 bytes are required per one vertex (2 + 2 for the x and y-coordinates, 1 for the grey value). Even if it
On Digital Image Representation by the Delaunay Triangulation
835
contains a few vertices only, the triangulation in this raw form consumes a lot of bytes and, therefore, it is not suitable for storing. A more compact form is necessary. We experimented with several compression methods and, in this section, we present results of our experiments. 70 60
RND
MARR
DISTW
ERRDIST
TRIMSE
BRUTE
PSNR
50 40 30 20 10 0 0
50000
100000
150000
200000
250000
300000
N 70 60
RND
MARR
DISTW
ERRDIST
TRIMSE
BRUTE
PSNR
50 40 30 20 10 0 0
50000
100000
150000
200000
250000
300000
N 70 60
RND
MARR
DISTW
ERRDIST
TRIMSE
BRUTE
PSNR
50 40 30 20 10 0 0
50000
100000
150000
200000
250000
300000
N
Fig. 6. The dependency of quality (measured in PSNR) on the number of removed vertices for three popular 512x512 grey-scale images
Edgebreaker proposed by Rosignac [15] is probably the most often used algorithm for a compression of an arbitrary triangulation. Edgebreaker visits triangles in a spiralling order and generates a string of symbols from the set {C, L, E, R, S}. This string describes the connectivity, i.e., it indicates how the mesh can be rebuilt. Using the Huffman compressor, it can be efficiently encoded so that two bits per triangle are guaranteed. The geometry, i.e., coordinates and grey levels of vertices are encoded as
836
J. Kohout
follows. When a triangle, say pa, pb, pc, is visited and the far vertex p of its adjacent triangle has not been processed yet, a prediction q is computed using the parallelogram predictor and the algorithm stores the differences between this predictor and the vertex p – see Fig. 7. The number of bits used to store these differences depends on how accurate the predictions of vertices are. Theoretically, if the predictor were able to give always an accurate prediction (i.e., q and p are the same), it would be possible to avoid the storing of geometry. In practice, however, this case does not exist. p pc q
pa
pb
Fig. 7. The prediction q of the vertex p
An advantage of this algorithm is that it can compress any triangulation, not only the Delaunay triangulation, and, if the parallelogram predictor is used, it runs incredibly fast. On the other hand, its implementation for triangulations with boundaries is not as simple as for closed meshes. Another drawback is that the connectivity has to be stored even for the Delaunay triangulation because it is needed for decoding of coordinates. Therefore, we propose several simple methods that encode the geometry only. The first one, VXPATH, successively visits all vertices storing the differences between the currently inspected vertex and the previously visited vertex. Let us note that the differences in x and y-coordinates are stored separately from the differences in grey levels (i.e., in z-coordinates). Vertices are visited in such an order that the differences in x and y-coordinates are minimized. Being in a vertex p, the algorithm thus continues with the vertex q such that it has not been already visited and the value: px − qx + p y − q y
(5)
is minimal. An advantage of this approach is that it does not require the connectivity for the decoding process and it offers a compactness as the differences should be very small. On the other hand, this brute-force algorithm runs in O(N2), which means that it takes a lot of time (especially, if the triangulation contains a large number of vertices). The second method, called TRPATH, enhances the currently described method in two things. First, the vertices are processed in a different order (but by the same way) as follows. The algorithm traverses the triangulation in the depth-first order (similarly to Edgebreaker) and whenever a new triangle is visited, its still not processed vertices are processed. As vertices are handled in linear time, the time consumption is significantly reduced. The second improvement is that the TRPATH does not use a fixed number of bits for all differences, which makes the algorithm slightly more complex but promises lower storage costs.
On Digital Image Representation by the Delaunay Triangulation
837
The last method, which is called KORILA, is based on the idea presented by Rila et al. [14] to store the geometry as a bitmap that contains 1 at positions corresponding to the vertices of triangulation and 0 elsewhere. This bitmap is compressed independently by a Run Length Encoding (RLE) and by the Huffman encoding and the result that is smaller is stored into the output file. Let us note that Huffman has proved to outpace RLE in all our experiments. Grey-values and values obtained by the application of the SUB filter on grey levels (see Section 3) are also independently compressed using the Huffman approach and the smaller outcome is stored. In our experiments, the filtering has proved to be useful especially for larger meshes where intensities do not change as rapidly as in small triangulations. Table 1 brings the results obtained for triangulations with 50 000 vertices computed using the BRUTE method for the three popular images. While VXPATH and TRPATH methods achieved a similar compression ratio for all three images, other two methods show a different behaviour. The reason of this for EdgeBreaker is that the triangulation may be very irregular (particularly for images with a large amount of sharp edges) as it can be seen in Fig. 8 and, therefore, the predictor used in Edgebreaker often produces highly inaccurate predictions, which leads to lower compression ratio. The KORILA method is sensitive to input data as the Huffman encoding achieves worse results for images that contain a large scale of grey levels. Table 1. The comparison of sizes of output files produced by various compression schemes for triangulations with 50 000 vertices of three popular 512x512 grey-scale images Image Lena Fruits Boat
PSNR 36.29 39.95 35.30
Edgebreaker 152 624 171 464 273 648
VXPATH 175 012 162 512 175 012
TRPATH 138 450 156 106 144 288
KORILA 50 804 70 769 69 307
Fig. 8. The Delaunay triangulation with 15 000 vertices and the corresponding reconstructed image (PSNR = 33.90)
We also tried to zip the output files in order to achieve a better compression ratio. The results are shown in Table 2. While Edgebreaker and VXPATH produce files that are highly compressible, the KORILA method gives outputs not compressible at all.
838
J. Kohout
Table 2. The comparison of sizes of zipped output files produced by various compression schemes for triangulations with 50 000 vertices of three popular 512x512 grey-scale images Image Lena Fruits Boat
PSNR 36.29 39.95 35.30
Edgebreaker 66 566 79 620 110 419
VXPATH 52 470 57 135 53 677
TRPATH 90 602 100 998 115 788
KORILA 50 260 70 751 69 377
The dependency of file sizes of outputs produced by the KORILA method on the number of vertices in triangulations is presented in Table 3. Let us note that the filtering of grey values in prior to encoding proved to be useless for the smallest three triangulations. We compared the achieved results also with JPEG and JPEG2000. As it can be seen from Table 4, both compression techniques outpace the KORILA method; JPEG2000 even produces files of half sizes. Table 3. The sizes of output files produced by the KORILA method for triangulations with 50 000 down to 1 000 vertices 50 000 50 804 36.29 70 769 39.95 69 307 35.30
Lena
Size PSNR Fruits Size PSNR Boat Size PSNR
30 000 33 057 34.78 45 521 37.18 44 930 33.10
20 000 23 877 33.41 32 332 35.27 32 066 31.00
10 000 14 331 31.00 18 692 32.00 18 620 27.83
5 000 9 379 28.37 11 648 28.92 11 622 25.11
2 000 6 262 24.77 7 324 24.52 7 293 22.46
1 000 5 211 21.37 5 862 22.37 5 838 20.55
Table 4. The sizes of output files produced by JPEG and JPEG2000
Lena
Size PSNR Fruits Size PSNR Boat Size PSNR
JPEG 38 467 36.22 45 584 39.31 39 357 35.36
16 159 33.46 21 367 35.28 16 295 31.00
3 147 24.73 2 677 24.58 3 024 23.03
JPEG2000 25 538 11 948 36.34 33.37 33 081 15 755 39.97 35.31 26 378 12 325 35.26 30.95
From results presented in this section, it is quite clear that the compression of computed triangulation is a very important issue. Even a small change of an existing method could dramatically change its typical compression ratio. Future research, therefore, should be more focused on the development of encoding techniques.
6 Conclusion In this paper, we investigated a representation of digital image by the Delaunay triangulation computed from the original raster of pixels by successive removal of insignificant pixels (vertices). While a filtering of input image (by filters used in
On Digital Image Representation by the Delaunay Triangulation
839
PNG), which was supposed to help to get triangulations with fewer vertices, proved to be useless, it may play a key role in the lossless compression of the computed triangulation. The results of performed experiments also lead us to the conclusion that attempts to develop a sophisticated method for evaluation of significance of vertices are a waste of time because it is likely that such a method would be time consuming but would not bring a significant improvement. Future research should be rather oriented towards a development of efficient triangle interpolation techniques because, in our opinion, the commonly used bilinear interpolation does not provide a sufficient quality. For public use of geometric representations, it is also vital to develop a compression scheme that would outpace JPEG and JPEG2000. Acknowledgments. This work was supported by the Grant Agency of the Academy of Sciences of the Czech Republic (GA AV ČR) – project No. KJB101470701. The author would also like to thank I. Kolingerová and V. Skala from the University of West Bohemia for providing conditions in which this work has been possible.
References 1. Battiato, S., Gallo, G., Messina, G.: SVG rendering of real Images using data dependent triangulation. In: Proceedings of SCCG 2004, pp. 185–192 (2004) 2. Ciampalini, A., Cignoni, P., Montani, C., Scopigno, R.: Multiresolution decimation based on global error. The Visual Computer 13, 228–246 (1997) 3. Cooper, O., Campbell, N., Gibson, D.: Automatic augmentation and meshing of sparse 3D scene structure. In: Proceedings of 7th IEEE Workshop on Applications of Computer Vision, Breckenridge, pp. 287–293. IEEE Computer Society Press, Los Alamitos (2005) 4. Demaret, L., Dyn, N., Floater, M.S., Iske, A.: Adaptive thinning for terrain modelling and image compression. Advances in Multiresolution for Geometric Modelling, 321–340 (2004) 5. Devillers, O.: On deletion in Delaunay triangulations. In: Proceedings of SCG 1999, pp. 181–188. ACM Press, New York (1999) 6. Floyd, R.W., Steinberg, L.: An adaptive algorithm for spatial grey scale. In: Proceedings of the Society of Information Display, pp. 75–77 (1976) 7. Galic, I., Weickert, J., Welk, M.: Towards PDE-based image compression. In: Paragios, N., Faugeras, O., Chan, T., Schnörr, C. (eds.) VLSM 2005. LNCS, vol. 3752, pp. 37–48. Springer, Heidelberg (2005) 8. García, M.A., Vintimilla, B.X., Sappa, A.D.: Efficient approximation of grey-scale images through bounded error triangular meshes. In: Proceedings of IEEE International Conference on Image Processing, Kobe, pp. 168–170. IEEE Computer Society, Los Alamitos (1999) 9. Gevers, T., Smeulders, A.W.: Combining region splitting and edge detection through guided Delaunay image subdivision. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1021–1026. IEEE Computer Society Press, Los Alamitos (1997) 10. Grundland, M., Gibbs, Ch., Dodgson, N.A.: Stylized rendering for multiresolution image representation. Proceedings of SPIE 5666, 280–292 (2005) 11. Kreylos, O., Hamann, B.: On simulated annealing and the construction of linear spline approximations for scattered data. IEEE Transactions on Visualization and Computer Graphics 7, 17–31 (2001)
840
J. Kohout
12. Marr, D., Hildreth, E.C.: Theory of edge detection. In: Proceedings of the Royal Society, London B, vol. 207, pp. 187–217 (1980) 13. Prasad, L., Skourikhine, A.N.: Vectorized image segmentation via trixel agglomeration. Pattern Recognition 39, 501–514 (2006) 14. Rila, L.: Image coding using irregular subsampling and Delaunay triangulation. In: Proceedings of SIBGRAPI, pp. 167–173 (1998) 15. Rosignac, J.: Edgebreaker: Connectivity compression for triangle meshes. IEEE Transactions of Visualization and Computer Graphics 5, 47–61 (1999) 16. Su, D., Willis, P.: Image interpolation by pixel level data-dependent triangulation. Computer Graphics Forum 23, 189–201 (2004) 17. Yu, X., Morse, B.S., Sederberg, T.W.: Image reconstruction using data-dependent triangulation. IEEE Computer Graphics and Applications 21, 62–68 (2001)
Low-Complexity TTCM Based Distributed Video Coding Architecture J.L. Martínez1, W.A.C. Fernando2, W.A.R.J. Weerakkody2, J. Oliver3, O. López4, M. Martinez4, M. Pérez4, P. Cuenca1, and F. Quiles1 1
Albacete Research Institute of Informatics Universidad de Castilla-La Mancha 02071 Albacete, Spain {joseluismm,pcuenca,paco}@dsi.uclm.es 2 Center for Communications Research University of Surrey Guildford GU2 7XH, United Kingdom
[email protected],
[email protected] 3 Dep. of Computer Engineering. Technical University of Valencia. Spain
[email protected] 4 Dept. of Physics and Computer Engineering. Miguel Hernández University. Spain {mmrach,otoniel,mels}@umh.es
Abstract. Distributed Video Coding (DVC) is a promising coding solution for some emerging applications, where the encoder complexity, power consumption or memory requirements are constraint the system resources. Current approaches to DVC focus on improving the performance of the WynerZiv coding by improving the quality of the reconstructed side information or by improving the quality of channel codes. Up to date, no attention has been paid to the problem of key frames coding where a low-encoding complexity scenario is also needed. This work focuses on key frames coding in its effect to the Wyner-Ziv frames decoding aiming to implement a very low-complexity Turbo Trellis Coded Modulation (TTCM) based DVC architecture. In this paper, we propose a new key frame coding scheme which has very low complexity and memory requirements for the TTCM based distributed video codec. Results show that the proposed intra frame codec for key frame coding outperforms the JPEG2000 and the Intra H.264 AVC codecs in terms of encoding-time and memory requirements, with better RD performance. Keywords: Distributed Video Coding, Low Complexity, TTCM codes.
1 Introduction Nowadays, with emerging applications such as multimedia wireless sensor networks, wireless video surveillance, disposable video cameras, medical applications and mobile camera phones, the traditional video coding architecture is being challenged. For all the applications mentioned above there is need to have a low complexity D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 841–852, 2007. © Springer-Verlag Berlin Heidelberg 2007
842
J.L. Martínez et al.
encoder probably at the expense of a high complexity decoder. For these emerging applications, Distributed Video Coding (DVC) seems to be able to offer efficient and low-complexity encoding video compression. DVC is a new video coding paradigm which allows among other things shifting complexity from the encoder to the decoder. The theoretical framework and the guidelines for DVC were established by Slepian-Wolf [1] and the current work in this field is based on the work by Wyner-Ziv [2]. Based on this theoretical framework, several turbo coded DVC codecs have been proposed recently [3,4,5]. In [3,4] the authors have proposed a turbo coded based Wyner-Ziv codec for motion video using a simple frame interpolation. In [5] the authors proposed a more sophisticated motion interpolation and extrapolation techniques [5] to predict the side information. The majority of these well-know research works on DVC have been carried out using a Turbo Wyner-Ziv codec. However, recent experimental results [6] show that the Turbo Trellis Coded Modulation (TTCM) based DVC codecs can improve the PSNR up to 6dB at the same bit rate with less memory compared to the Turbo Coded DVC codecs. Current practical schemes developed for DVC are based in general on the following principles: the video frames are organized into two types; Key frames and Wyner-Ziv frames, while the key frames are encoded with a conventional intraframe codec, the frames between them are Wyner-Ziv encoded. At the decoder, the side information is obtained using previously decoded key frames and Wyner-Ziv frames. In this context, most of the contributions given in the literature focus on improving the performance of the Wyner-Ziv coding by improving the quality of the reconstructed side information [5] or by improving the quality of channel codes [7]. Up to date, no attention has been paid to the problem of key frames coding where a low-encoding complexity scenario is also needed. The most current approaches to DVC rely on key frames available at the decoder perfectly reconstructed (losslesscompression) or encoded with conventional intra-frame codecs (lossy-compression). Recently, in the DISCOVER European project, JPEG2000 and Intra AVC have been proposed as technologies for the key frames coding [8]. However, these conventional intraframe encoders are too complex to be implemented in a DVC low-complexity scenario. For this reason, this paper present a DVC architecture based on TTCM codes for the Wyner-Ziv frames as proposed in [6] and on LTW for the key frames as proposed in [9]. This paper is an integration and evaluation of these two architectures. In particular, the main objective of the paper is to propose a DVC codec with very low complexity and memory requirements for the non-DVC portion of an integrated TTCM based DVC architecture with very low complexity. The rest of the paper is organized as follows. Section 2 introduces the TTCM based distributed video coding architecture with very low complexity key-frame coding. In Section 3, we carry out a performance evaluation of the proposed architecture in terms of memory requirements, computational complexity and ratedistortion. We compare the performance of our proposal to the JPEG2000 and Intra AVC proposals. Finally, in section 4 conclusions are drawn.
Low-Complexity TTCM Based Distributed Video Coding Architecture
843
Fig. 1. DVC Architecture using Turbo Trellis Coded Modulation (TTCM)
2 Low-Complexity TTCM Based DVC Architecture 2.1 Wyner-Ziv Frames Coding The considered Distributed Video Coding architecture is showed in Figure 1. The odd frames {X1, X3 ...} are the Wyner-Ziv frames which go through the interframe encoder to generate the parity sequence to be transferred to the decoder. The WynerZiv frames are first passed through the 2M level quantizer where the level M is an independently varied parameter based on the expected quality of output and the available channel bandwidth. Next, the Slepian-Wolf based encoder incorporates the bit plane extractor and then the turbo trellis encoder. Each rate ½ component encoder of our implementation has a constraint length K=M+1 = 4 and a generator polynomial of (11 02) in octal form. A Pseudo-random interleaver is used in front of the 2nd constituent encoder. Only the parity bit sequence thus generated is retained in the parity buffers and the systematic bits are discarded. The decoder generates the side information using the Key-frames employing a pixel interpolation algorithm as below:
Ym (i, j ) =
1 2
( X m −1 (i, j ) + X m +1 (i, j ) )
(1)
This side information together with the parity bits passed from the encoder, upon request, form the PSK symbols to be processed in the TTCM (Turbo Trellis Coded Modulation) decoder. A multi level set partitioning is done with the constellation mapping of the TCM symbols in order to maintain the maximum Euclidian distance between the information bits. Where ever parity bits are not available due to puncturing being effective, the symbol automatically reduces to a lower modulation level. In the implementation under discussion, a combination of 4 PSK and BinaryPSK is used based on the availability of the parity bits for the constellation mapping. As commonly understood, Trellis Coded Modulation is conceptually a channel coding technique used to optimize the bandwidth requirements of a channel while protecting the information bits by increasing the size of the symbol constellation. Our effort is to exploit the high coding gain and the noise immunity inherent in this technique. A block diagram of the Turbo-TCM decoder implementation is shown in Figure 2. A symbol based MAP algorithm is used in the turbo trellis decoder which is run for 6 iterations as a complexity performance trade-off. A modification was done to the
844
J.L. Martínez et al.
branch metric calculation to take care of the independent distributions of side information and parity bits. The parity bits are supplied to the decoder through an “on-demand” approach using a reverse channel for passing the request to the parity buffer maintained in the encoder. The de-puncturer function in the decoder basically watches the parity availability and manipulates the symbols fed to the SISO based MAP decoder accordingly. A reconstruction function is used to smoothing some adverse effects in the output sequence including some contribution by the quantization noise. On the other hand, the side information generated by the temporal interpolation of two key frames is assumed to be a form of the original Wyner-Ziv frame subjected to noise. The probability distribution of this noise was a part of the detailed study. It was noticed that both the Gaussian noise distribution and the Laplacian noise distribution resembled the interpolation noise with distinct variance parameters. However, most interestingly, it was noted that our implementation of the codec was not susceptible to error by sub-optimal approximations of the distribution for the purpose of taking the results; an Additive White Gaussian Noise (AWGN) with variance 0.125 was assumed. To obtain more details about this, see [6].
Fig. 2. Block Diagram of TTCM Decoder
2.2 Key Frames Coding Little attention has been paid in the literature to the problem of key frames coding and most of the current approaches to DVC rely on key frames available at the decoder perfectly reconstructed (lossless-compression) or key frame coding using conventional intra frame codecs, such as JPEG2000 or AVC intra. In this work, we propose the use of the LTW (Lower-Tree Wavelet) compression algorithm [9], for Key-Frames encoding in order to be integrated in TTCM based DVC architecture with very low complexity key-frame coding. LTW is based on the efficient construction of wavelet coefficient lower trees. The main contribution of the LTW encoder is the utilization of coefficient trees, not only as an efficient method of grouping coefficients, but also as a fast way of coding them. Thus, it presents state-ofthe-art compression performance, whereas its complexity is lower than the
Low-Complexity TTCM Based Distributed Video Coding Architecture
845
conventional intraframe codecs. Fast execution is achieved by means of a simple twopass coding and one-pass decoding algorithm. Moreover, its computation does not require additional lists or complex data structures, so there is no memory overhead. With LTW, the quantization process is performed by two strategies: one coarser and another finer. The finer one consists in applying a scalar uniform quantization, Q, to wavelet coefficients. The coarser one is based on removing the least significant bit planes, rplanes, from wavelet coefficients. The use of coefficient trees structure called lower tree reduces the total number of symbols needed to encode the image, decreasing the overall execution time. This structure is a coefficient tree in which all its coefficients are lower than 2rplanes. The LTW algorithm consists of two stages. In the first one, the significance map is built after quantizing the wavelet coefficients (by means of both Q and rplanes parameters). In Figure 3(b) we show the significance map built from wavelet decomposition shown at Figure 3(a). The symbol set employed in our proposal is the following one: a LOWER (L) symbol represents a coefficient that is the roots of a lower-tree, the rest of coefficients in the lower-tree are labeled as LOWER_COMPONENT (*) but they are never encoded because they are already represented by the root coefficient. If a coefficient is insignificant but it does not belong to a lower-tree because it has at least one significant descendant, it is labeled as an ISOLATED_LOWER (I) symbol. For a significant coefficient, we simply use a symbol indicating the number of bits needed to represent it. With respect to the coding algorithm, in the first stage (symbol computation), all wavelet subbands are scanned in 2×2 blocks of coefficients, from the first decomposition level to the Nth (to be able to build the lower-trees from leaves to root). In the first level subband, if the four coefficients in each 2×2 block are insignificant (i.e., lower than 2rplanes), they are considered to be part of the same lowertree, labeled as LOWER_COMPONENT. Then, when scanning upper level subbands, if a 2×2 block has four insignificant coefficients, and all their direct descendants are LOWER_COMPONENT, the coefficients in that block are labeled as LOWER_ COMPONENT, increasing the lower-tree size. However, when at least one coefficient in the block is significant, the lower-tree cannot continue growing. In that case, a symbol for each coefficient is computed one by one. Each insignificant coefficient in the block is assigned a LOWER symbol if all its descendants are LOWER_COMPONENT, otherwise it is assigned an ISOLATED_LOWER symbol. On the other hand, for each significant coefficient, a symbol indicating the number of bits needed to represent that coefficient is employed. Finally, in the second stage, subbands are encoded from the LLN subband to the first-level wavelet subbands, as shown at Figure 4. Observe that this is the order in which the decoder needs to know the symbols, so that lower-tree roots are decoded before its leaves. In addition, this order provides resolution scalability, because LLN is a low-resolution scaled version of the original image, and as more subbands are being received, the low-resolution image can be doubled in size. In each subband, for each 2×2 block, the symbols computed in the first stage are entropy coded by means of an arithmetic encoder. Recall that no LOWER_COMPONENT is encoded. In addition, significant bits and sign are needed for each significant coefficient and therefore binary encoded.
846
J.L. Martínez et al.
Fig. 3. (a) 2-level wavelet transform of an 8x8 example image. (b) Map Symbols.
Fig. 4. Coefficient-trees in LTW
3 Results In this section, we carry out a performance evaluation of the Low-Complexity TTCM Based Distributed Video Coding Architecture proposed in section 2, in terms of memory requirements, computational complexity and rate-distortion. For the purpose of this performance comparative evaluation, even frames were intra coded with LTW, JPEG2000 or H.264 (Baseline Profile, the fastest version of AVC Intra), and decoded while odd frames are coded as Wyner-Ziv frames, as shown in Figure 1. The bit rate and PSNR are calculated for the luminance of the Wyner-Ziv frames (odd frames), or the Key-Frames (even frames) of the Foreman sequence (300 frames), for a frame size of 176x144 (QCIF) with a Wyner-Ziv frame rate of 15fps. For a better comparative performance of rate-distortion function, we also show the average PSNR difference (∆PSNR) and the average bit-rate difference (∆Bitrate). The PSNR and bit-rate differences are calculated according to the numerical averages between the RD-curves derived from LTW encoder, JPEG2000 and H.264 encoders, respectively. The detail procedures in calculating these differences can be found from a JVT document authored by Bjontegaard [10]. Note that PSNR and bit-rate differences should be regarded as equivalent, i.e., there is either the decrease in PSNR or the increase in bit-rate, but not both at the same time. For the purpose of our performance evaluation, we first evaluate the key frames coding part of our low-complexity DVC architecture, and then we evaluate the Wyner-Ziv frames coding part of our low-complexity DVC architecture. Finally, global results (taking into account all frames) will be provided.
Low-Complexity TTCM Based Distributed Video Coding Architecture
847
For the key frames coding part, all the evaluated encoders have been tested on an Intel Pentium M Dual Core 3.0 GHz with 1Gbyte RAM Memory. We have selected H.264 [11] (Baseline, JM10.2), JPEG2000 [12] (Jasper 1.701.0) and LTW, since their source code is available for testing. The correspondent binaries were obtained by means of Visual C++ (version 2005) compiler with the same project options and under the above mentioned machine. A further evaluation can be found in [13] Figure 5 shows the average memory requirements1 per key frame for all key frame codecs under study and for QCIF and CIF size formats. In both cases LTW needs practically half the memory than JPEG2000 and H.264 needs six times the memory of LTW for QCIF size and eight times for CIF size.
Memory Requeriments per Frame (Kbytes)
14000 H264 12000
JPEG 2000 LTW
10000 8000 6000 4000 2000 0 QCIF
CIF
Fig. 5. Memory Requirements (Key Frames)
Figure 6 shows the average encoding time per key frame for all key frame codecs under study for QCIF Foreman video sequence. As shown in Figure 6, LTW codec has the lowest complexity of all evaluated codecs and is about 10 times faster than JPEG2000 and 100 times faster than H.264 (Baseline profile, the fastest version of AVC Intra). LTW codec reduces the complexity substantially with respect the other conventional codecs under study showing the effectiveness of the LTW codec in the proposed Low-Complexity TTCM Based Distributed Video Coding Architecture. Figure 7 shows the RD results for key frames for all key frame codecs under study. For a fair comparison, first, the H.264 simulations were carried out by varying the QP factor from 20 to 50. For every simulation the real bit-rate was obtained and then it was introduced to JPEG 2000 and LTW codecs as target bit-rates. As shown, the Rate-Distortion obtained with LTW outperforms the other codecs by 1.2 dB and 1.13 dB on average, with respect to JPEG2000 and H.264 respectively, using less bit-rate, up to 17% and up to 10% with respect to JPEG2000 and H.264 respectively (see Table 1). For the Wyner-Ziv frames coding part, Figure 8 shows the effect on the Wyner-Ziv (WZ) frames decoding when key frames are coded with all key frame codecs under study with QP=20. The best results for Wyner-Ziv frames are obtained when key frames are coded with LTW codec. As seen in Table 2, the Rate-Distortion results 1
Results obtained from Windows XP task manager, peak memory usage column.
848
J.L. Martínez et al. Foreman (176x144 QCIF, 15 Hz, Key frames)
Enoding Time per Frame (Seconds)
1
H.264 JPEG 2000 LTW
0.1
0.01
0.001 0
10
20
30
40
50
Bit Rate (kbits/frame)
Fig. 6. Encoding Time per Key Frames Foreman (176x144,QCIF,15 Hz,Key frames) 45
PSNR (dB)
40
35 H.264 LTW JPEG 2000
30
25
20 0
10
20
30
40
50
60
Bitrate (kbits/frame)
Fig. 7. Rate-Distortion Results for Key Frames Table 1. Comparison for Key Frames Coding
JPEG 2000 vs. LTW ΔPSNR (dB)
-1.271
ΔBitrate(%)
17.11
H.264 vs. LTW ΔPSNR (dB)
-1.133
ΔBitrate(%)
10.70
obtained on average for Wyner-Ziv frames when key frames are coded with LTW codec outperforms the other codecs by 0.5 dB and 0.6 dB on average, with respect to JPEG2000 and H.264 respectively, using significant less bit-rate, up to 47% and up to 46% with respect to JPEG2000 and H.264 respectively. Figure 9 shows the effect on the Wyner-Ziv frames decoding when key frames are coded with all key frame codecs under study with QP=50. Again, the best results for Wyner-Ziv frames are obtained when key frames are coded with LTW codec. As shown in Table 3, the Rate-Distortion results obtained for Wyner-Ziv frames when key frames are coded with LTW codec outperforms the other codecs by 4.5 dB and
Low-Complexity TTCM Based Distributed Video Coding Architecture
849
0.4 dB on average, with respect to JPEG2000 and H.264 respectively, using significant less bit-rate, up to 2768% and up to 9.6% with respect to JPEG2000 and H.264 respectively. These results may seem erroneous but there is an explication: the side information is generated from the key frames and this side information has an important impact over the general performance of DVC. A lack of PSNR, which is shown by Figure 7, for the key frames denotes a lack of RD performance for the DVC scheme, shown in figure 9. Foreman (176x144,QCIF,15 Hz,WZ frames) 40
PSNR (dB)
38
36
34
H.264 LTW JPEG 2000
32
30 0
5
10 15 20 Bitrate (kbits/frame)
25
30
Fig. 8. Rate-Distortion Results for Wyner-Ziv Frames (When Key Frames are coded with QP=20) Table 2. Comparison for WZ Frames Coding when Key Frames are coded with QP=20
JPEG 2000 vs. LTW ΔPSNR (dB)
H.264 vs. LTW
ΔBitrate(%)
-0.51
ΔPSNR (dB)
42.07
ΔBitrate(%)
-0.662
46.52
Foreman (176x144,QCIF,15 Hz,WZ frames) 34 32
PSNR (dB)
30 28 26
H.264 LTW JPEG 2000
24 22 20 0
10
20
30
40
50
60
70
Bitrate (kbits/frame)
Fig. 9. Rate-Distortion Results for Wyner-Ziv Frames (When Key Frames are coded with QP=50)
850
J.L. Martínez et al.
Finally we present global results taking into account all frames (key frames + Wyner-Ziv frames). Figure 10 shows the effect on all frames decoding for our LowComplexity TTCM Based Distributed Video Coding Architecture when key frames are coded with all key frame codecs under study with QP=20. The best results are obtained when key frames are coded with LTW codec. As seen in Table 4, the RateDistortion results obtained using LTW codec outperforms the other codecs by 1 dB on average, with respect to JPEG2000 and H.264 approximately, using significant less bit-rate, around 20%. Table 3. Comparison for WZ Frames Coding when Key Frames are coded with QP=50
JPEG 2000 vs. LTW ΔPSNR (dB)
H.264 vs. LTW
ΔBitrate(%)
-4.545
ΔPSNR (dB)
2768.17
ΔBitrate(%)
-0.435
9.6
Foreman (176x144, QCIF, Total frames) 45
PSNR (dB)
40
35
H.264 LTW JPEG 2000
30
25 0
10
20
30
40
50
60
70
80
90
Bitrate (kbits/frame)
Fig. 10. Rate-Distortion Results for all Frames. (When Key Frames are coded with QP=20).
Figure 11 shows the effect on all frames decoding for our Low-Complexity TTCM Based Distributed Video Coding Architecture when key frames are coded with all key frame codecs under study with QP=50. Again, the best results are obtained when key frames are coded with LTW codec. As seen in Table 5, our proposal outperforms the other codecs by 3.9 dB and 0.7 dB on average, with respect to JPEG2000 and H.264 respectively, using significant less bit-rate, up to 105% and up to 10% with respect to JPEG2000 and H.264 respectively. Table 4. Comparison for ALL Frames when Key Frames are coded with QP=20
JPEG 2000 vs. LTW ΔPSNR (dB)
-1.011
ΔBitrate(%)
24.76
H.264 vs. LTW ΔPSNR (dB)
-1.188
ΔBitrate(%)
20.41
Low-Complexity TTCM Based Distributed Video Coding Architecture
851
Although the results presented in this paper are only shown for QCIF format and for Foreman sequence, similar behavior was obtained for CIF format and for other video sequences. Foreman (176x144, QCIF, Total frames) 40
PSNR (dB)
35
30
H.264 LTW JPEG 2000
25
20 0
20
40
60
80
100
120
Bitrate (kbits/frame)
Fig. 11. Rate-Distortion Results for all Frames. (When Key Frames are coded with QP=50). Table 5. Comparison for ALL Frames when Key Frames are coded with QP=50
JPEG 2000 vs. LTW ΔPSNR (dB)
-3.913
ΔBitrate(%)
105.67
H.264 vs. LTW ΔPSNR (dB)
-0.744
ΔBitrate(%)
10.65
4 Conclusions In this paper, we have proposed a very low-complexity Turbo Trellis Coded Modulation based DVC architecture. In particular, we have proposed the use of a fast intra frame codec, with very low complexity and memory requirements, in order to be implemented for the non-DVC portion of a TTCM based DVC codec. Results clearly indicate that the use of LTW intra frame codec on a TTCM based DVC architecture outperforms the same architecture when JPEG2000 or the Intra AVC codecs are used in terms of encoding-time and memory requirements, showing very similar RD performance. Acknowledgments. This work has been jointly supported by the Spanish MEC and European Commission FEDER funds under grants “Consolider Ingenio-2010 CSD2006-00046” and “TIN2006-15516-C04-02”, TIC2003-00339 and by JCCM funds under grant “PAI06-0106”.
852
J.L. Martínez et al.
References 1. Slepian, D., Wolf, J.K.: Noiseless Coding of Correlated Information Sources. IEEE Transaction on Information Theory 19, 471–480 (1973) 2. Wyner, D., Ziv, J.: The Rate-Distortion Function for Source Coding with Side Information at the Decoder. IEEE Transaction on Information Theory 22, 1–10 (1976) 3. Aaron, A., Zhang, R., Girod, B.: Wyner-Ziv Coding of Motion Video. In: Proceeding of Asilomar Conference on Signals and Systems, Pacific Grove, USA (2002) 4. Girod, B., Aaron, A., Rane, S., Monedero, D.R.: Distributed Video Coding. Advances in Video Coding and Delivery, IEEE Special Issue on 93, 1–12 (2005) 5. Ascenso, J., Brites, C., Pereira, F.: Improving Frame Interpolation with Spatial Motion Smoothing for Pixel Domain Distributed Video Coding. In: 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services (2005) 6. Weerakkody, W.A.R.J., Fernando, W.A.C., Adikari, A.B.B., Rajatheva, R.M.A.P.: Distributed video coding of Wyner-Ziv frames using Turbo Trellis Coded Modulation. In: Proceedings of International Conference on Image Processing, ICIP 2006, Atlanta, USA (2006) 7. Wang, H., Zhao, Y., Wang, A.: Performance Comparisons of Different Channel Codes in Distributed Video Coding. In: First International Conference on Innovative Computing, Information and Control, ICICIC 2006, Beijing, China, vol. 2, pp. 225–228 (2006) 8. Pereira, F., Guillemot, C., Leonardi, R., Ostermann, J., Ebrahimi, T., Torres, L.: Distributed Coding for Video Services. DISCOVER Project Deliverable 7 (2006) 9. Oliver, J., Malumbres, M.P.: Low-Complexity Multiresolution Image Compression Using Wavelet Lower Trees. IEEE Transactions on Circuits and Systems for Video Technology, 1051–8215, 1437-1444 (2006) 10. Bjontegaard, G.: Calculation of Average PSNR Differences between RD-Curves. 13th VCEG-M33 Meeting, Austin, USA (2001) 11. ISO/IEC International Standard, 14496-10:2003: Information Technology - Coding of Audio - Visual Objects - Part 10: Advanced Video Coding 12. JPEG2000 Image Coding System. ISO/IEC 15444-1 (2000) 13. López, O., Martínez-Rach, M., Piñol, P., Oliver, J., Malumbres, M.P.: M-LTW: A Fast and Efficient Non-Embedded Intra Video Codec. In: Pacific-Rim Conference on Multimedia, PCM 2007 (2007) (Accepted - publication pending)
Adaptive Key Frame Selection for Efficient Video Coding Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang Digital Media Lab., Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul, 133-791, Korea
[email protected],
[email protected]
Abstract. Recently, many researches on frame skipping are conducted to reduce temporal redundancy in video frames. As a simple method, fixed frame skipping (FFS) adjusts frame rate by skipping frame at regular intervals. To overcome the poor performance of FFS, variable frame skipping (VFS) has been introduced to exploit the temporal dependency between frames. In this paper, scene-adaptive key frame selection method with low complexity is proposed. The proposed method performed about 20 percent better in complexity with the better visual quality than the conventional video encoding. As a preprocessing method, the proposed technology can be used with any conventional video codec. Keywords: variable frame skipping, frame interpolation, fame rate control.
1 Introduction Many conventional video codecs tried to reduce temporal redundancy in video frames by means of motion estimation and motion compensation (MEMC). MEMC was performed as a part of video encoding process. As an alternative approach, frame skipping (FS) method has been investigated. It is to further exploit the temporal redundancy of video frames by skipping some video frames in encoding. The skipped video frames may be generated at the decoder by repeating the previous video frame or by interpolating neighboring frames. By employing FS in the coding process, one can allocate the more bits for each frame in encoding. By lowering the frame rate, the better picture quality can be obtained for the encoded frames. FS helps in downsizing the decoding time and the complexity by reducing the number of coded frames. These are the clear advantages of FS over MEMC. The performance of FS is highly dependent upon the selection of coded (or uncoded) frames from the given video sequence. The most critical issue in FS is not to lose semantically important frames after FS. Therefore, a key element in FS is to distinguish repetitive or easily-interpolatable frames from the video sequence. In FS, there are two representative approaches: fixed frame skipping (FFS) and variable frame skipping (VFS). By skipping frames at regular intervals, FFS is a useful technology for very low bit rate environment [1]. FFS is simple to implement, but suffers severe quality degradation due to the jerky effect in the sequence with high motion. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 853–866, 2007. © Springer-Verlag Berlin Heidelberg 2007
854
J. Jun et al.
In the case of VFS, the interval of skipping frames can be changed depending on the similarity in video frames. Therefore, it can reduce the jerky effect when the bitstream is decoded and interpolated. Several methods are proposed in VFS [2]-[4]. Based on the similarity of the last two frames in the previous group of pictures (GOP), Song[2] defined the variable frame rate of the current GOP. However, the method showed a limited performance in the presence of high motion in a scene, since the frame rate of the current GOP is predicted only from the previous GOP information and the range of the variable frame rate was not flexible enough. Pejhan[3] proposed a dynamic frame rate control mechanism. He tried to separate files containing motion vectors at the low frame rates. However this method has to analyze the entire encoding video sequence for adjusting frame rates. Due to the complexity that all the frames are to be analyzed, it would be difficult to use the method in real time applications. Kuo[4] proposed a VFS method that introduced interpolation in the encoding process. The results reported in [4] produced the better PSNR quality than conventional coding with no FS. In [4], the method is closely coupled with the encoding process targeted for low resolution and low bit rate applications. In this paper, we proposed a straightforward skipping algorithm by adaptively selecting key frames depending on the similarity of video sequence. In designing the proposed scheme, we focused on two requirements: low computational complexity and efficient selection of key frames. The main objective of our method is to build a framework that provides good enough visual quality to the user. We tried to reach the level of good enough quality not necessarily by minimizing the difference (e.g., PSNR) between the original and interpolated video frames, but rather by semantically interpolating in between key frames. In this case, the level of visual quality can only be estimated correctly by subjective test with independent viewers. The remainder of this paper is organized as follows. In Section 2, the proposed method is described including frame skipping and frame interpolation processes. Experimental results are provided in Section 3. We summarized the paper and listed the future work in Section 4.
2 Adaptive Key Frame Selection Our proposed system can be described as shown in Fig. 1. The proposed frame skipping (FS) process can be implemented either inside or outside of the encoding process, since the FS is not dependent upon the encoding process. After FS, only the
Fig. 1. Block diagram of the encoding and decoding process of the proposed system
Adaptive Key Frame Selection for Efficient Video Coding
855
selected frames will be encoded by the encoder. Once these encoded frames are decoded from the transmitted bitstream, frame interpolation is performed at the decoder side. 2.1 Frame Skipping The proposed frame skipping algorithm can be described with the following three steps: Step 1: Define the maximum cluster size. Step 2: Analyze similarity between neighboring frames and form clusters. Step 3: Select key frames. The maximum cluster size (MCS) is defined in Step 1. A cluster is defined as a set of consecutive video frames with high similarity. The definition of MCS may be needed to meet the physical cluster size for application needs. For example, the MCS can be defined to be 30 to ensure that frame skipping does not last more than a second. The definition of the MCS does not affect the encoding/decoding process. It is only used to identify and form a cluster in Step 2. The similarity among neighboring video frames is checked to identify clusters from the given video sequence in Step 2. In this paper, to measure distortion between two frames, PSNR is used as depicted in Fig. 2. For simplicity, tools such as sum of absolute difference (SAD) may be used instead. A cluster is identified when all the PSNR values of video frames in the cluster are higher than a certain threshold:
PSNR (i, i + 1) ≥ T
(1)
where PSNR (i, i + 1) is the PSNR value between the i-th and (i+1)-th frames and T is the threshold. If PSNR (i, i + 1) is larger than the threshold, it is assumed that the i-th frame has little similarity to the (i+1)-th frame and a scene change is happened between the i-th and (i+1)-th frames as shown in Fig. 3. We have assigned an appropriate value to the threshold for each test sequence in empirical approach. This clustering method would include a case that PSNR between the first frame and the last frame in a cluster is lower than the threshold. From the experiments, we found that such a case exists and clustering is still effective. When such a case happens, it means that only a small part of an entire frame is moving. This creates a gradual change in PSNR, which results in a great difference between the first and the last frames. However, it would be nicely interpolated when using motion vector information at the decoder side.
Fig. 2. PSNR computation between consecutive image frames
856
J. Jun et al.
Fig. 3. Example of forming clusters
In Step 3, the first and the last frames in a cluster are candidates for the key frame as shown in Fig. 4 (a). In forming clusters, it is also possible for neighboring two clusters overlapped by one key frame as depicted in Fig. 4 (b). It is usually caused when the MCS is set too small and the actual cluster size is larger than the MCS. When two clusters are overlapped, the last frame of the first cluster will be the first frame of the second cluster. If three consecutive clusters are overlapped, the number of selected key frames becomes four. The number of the skipped frames can be identified when reading the time stamp in each selected video frame. If there is a certain jump in time stamp between two consecutive frames, the decoder can detect how many frames are skipped by computing the difference of the two time stamps. 2.2 Frame Interpolation When we use the frame skipping at the encoder side, skipped frames will be interpolated by using key frames at the decoder side. There are several methods to interpolate skipped frames such as repetition, bilinear interpolation, and motion compensated interpolation (MCI) [4], [5]. In this paper, we used an MCI-based method at the decoder side. The proposed MCI-based interpolation method is performed to reproduce N skipped frames in between the first and the last key frames in a cluster. For the interpolation of the i-th skipped frame (i = {1, …, N}), the following procedure can be applied: Step 1: [Bilinear interpolation] perform bilinear interpolation (cf. Eqn. 3) using the first and the last key frames to fill the i-th skipped frame. Step 2: Select an MB in the last key frame with the following conditions: 1) a nonzero MV and 2) the smallest MV value among the unprocessed MBs. Step 3: [First-frame background filling] Fill the collocated MB in the i-th skipped frame with the collocated MB in the first key frame according to Eqn. 4. Step 4: [Last-frame background filling] Using the MB where the MV of the selected MB is pointing in the last key frame, fill the collocated area of the i-th skipped frame according to Eqn. 5. Step 5: [Motion compensated bilinear interpolation] Find the area in the skipped frame with the scaled MV from the selected MB in the last key frame. Update the area by bilinear interpolation of the area which the MV of the selected MB in the last key frame is pointing and the selected MB in the last key frame according to Eqn. 6. Step 6: Go to Step 2, until all MBs in the last key frame are processed.
Adaptive Key Frame Selection for Efficient Video Coding
857
(a) No overlapped cluster
(b) Overlapped cluster :Non-Key frame to be skipped
: Key frame to be code
Fig. 4. Selecting key frames
The basic principle in interpolation is to use a two-pass interpolation. The first pass is the bilinear interpolation between the first and last key frames. In the second pass, we exploited the MV values in the last key frame to identify moving objects between the first and the last key frames. When the i-th skipped frame is interpolated, interval ratio (R) should be computed based on the time stamps of the first, the last, and the i-th skipped frames:
R=
Ti − T first Tlast − T first
(2)
where Ti , T first , Tlast are the time stamps of the i-th skipped, the first, and the last video frames, respectively. For bilinear interpolation the following equation can be formulated:
Pi ( x, y ) = Plast ( x, y ) × R + Pfirst ( x, y ) × (1 − R )
(3)
where Pi ( x, y ) , Plast ( x, y ) , and Pfirst ( x, y ) denote pixel values at the x-th row and the
y-th column of the i-th skipped, the last, and the first video frame respectively. The bilinear interpolation is a very simple method to reduce the computational complexity. However, the visual quality using bilinear interpolation is poor in the presence of fast moving objects in the first and the last key frames. In order to overcome this shortcoming, the second pass with the MVs in the last frame is performed. In the second pass, we start with the MBs with small MV values to large MV values. The reason that we start with MBs with small MVs is due to the fact that moving objects with high motion is more influential to the visual quality of the interpolated frame. Therefore, the MB with the largest MV will be used last.
858
J. Jun et al.
Once there is an MB selected with nonzero MV, three operations follow as shown in Fig. 5: first-frame background (FB) filling, last-frame background (LB) filling, and motion compensated bilinear interpolation (MCBI). FB filling is to fill an MB in the skipped frame with the collocated MB in the first frame:
Pi ( x, y ) = Pfirst ( x, y )
(4)
In LB filling, the area in the first key frame pointed by the selected MB in the last key frame is identified first. The collocated (to the pointed area in the first key frame) area in the skipped frame will be filled with the collocated area in the last key frame:
Pi ( x + MV x , y + MV y ) = Plast ( x + MV x , y + MV y )
(5) In MCBI, the filled area in the skipped frame can be found using the interval ratio (cf. Eqn. 2) and MV in the last key frame as shown in Fig. 5. The area will be filled by bilinear interpolation as follows:
Pi (( x + MVx ) × R, ( y + MVy ) × R) = Pfirst ( x + MVx , y + MVy ) × (1 − R) + Plast ( x, y ) × R
(6)
where MVx , MVy are motion vectors of macro block in the last key frame, which references the first key frame.
3 Experimental Results For the evaluation of the proposed method, we used MPEG-4 simple profile (SP) as a test bed on a PC equipped with Intel Core2Duo 2.8GHz running Windows Vista. The test sequences used in the experiment are shown in Table 1. For MPEG-4 SP codec, we used the MPEG-4 reference software [6], where no optimization is performed. For simplicity of the experiment, we used only one MV per MB. Table 2 shows the result of the proposed key frame selection. With three different MCS values for each test sequence, it is observed that the more frames will be skipped with the larger MCS value. This is more apparent in Akiyo and Container, whereas the variation in the number of selected key frames is marginal when the video sequence is highly active such as in Stefan.
Fig. 5. MCI-based Interpolation
Adaptive Key Frame Selection for Efficient Video Coding
859
Table 1. Test Sequence
Sequence Akiyo Container Stefan
Resolution CIF (352x288) CIF (352x288) SIF (352x240)
Number of frames 300 300 300
FPS 30 30 30
Table 2. Key frame selection results Test sequence Akiyo
Container
Stefan
MCS (frames)
T (dB)
Number of clusters
3 7 12 3 7 12 3 7 12
38 38 38 38.2 38.2 38.2 23 23 23
85 36 24 75 40 30 11 6 5
Selected key frames/ average fps 215 (21.5) 141 (14.1) 124 (12.4) 225 (22.5) 150 (15.0) 124 (12.4) 289 (28.0) 282 (28.2) 280 (28.9)
We chose arbitrary thresholds (T) for clustering from sequence to sequence, keeping in mind that a high threshold is assigned when there is low in motion as shown in Table 1. A large cluster is divided into multiple clusters when MCS decreases. For example, there are 24 clusters formed in Akiyo when MCS was set to 12, where as 61 more clusters formed when MCS was set to three with the same threshold. Overall, it is shown that the average frames per second (fps) can be controlled with MCS and T. In the table, the average fps varied from 12.4 to 28.9. We have depicted a bar graph in Fig. 6, which indicates the selected key frames in Akiyo sequence. A bar in the figure indicates a selected key frame. If the distance between two neighboring bars is greater than one, there is a cluster. From the figure, it is clearly noticed that the number of the skipped frames is highly affected by MCS. For the estimation of the computational complexity, we measured the encoding and decoding time in Table 3. Compared with the encoding time of MPEG-4 SP, the skipping time including encoding time of the proposed method is faster for all the test sequences. Our proposed skipping method requires only PSNR computation between two neighboring frames and reduces the coding time according to the number of the skipped frames. Furthermore, decoding time of the proposed method including the interpolation time is also comparable with the decoding time of MPEG-4 SP. Table 4 shows the number of the coded frames for MPEG-4 SP and the proposed method at the similar bit rate. The reduced number of frames in the proposed method resulted in high visual quality, since more bits per frame can be assigned in encoding using the lower quantization parameter (Qp). Obviously, the better visual quality in key frames leads to the better visual quality in interpolation.
860
J. Jun et al.
(a) MCS = 3
(b) MCS = 7
(c) MCS = 12 Fig. 6. Selected key frames in Akiyo
We evaluated the PSNR values between the original and reconstructed sequences using two methods (MPEG-4 SP and proposed interpolation method), where the MCS value is 7. For test sequences except Akiyo, the average PSNR values using two methods are quite close to each other. In the case of Akiyo sequence, the average PSNR value of the reconstructed sequence using the proposed method is 36.38dB, whereas that using MPEG-4 SP is 34.44dB as shown in Fig. 7. The results obtained in Akiyo sequence are rather surprising in that we did not expect the objective quality using the proposed method would excel that using MPEG-4 SP. Fig. 8 shows the comparison of reconstructed key frames from MPEG-4 SP and the proposed method. If many frames are skipped as in the case of Akiyo and Container, Table 3. Evaluation of encoding time and decoding time (in seconds)
Encoding Time Decoding Time
Test Sequence
MPEG-4 SP
Akiyo Container Stefan Akiyo Container Stefan
14.64 19.56 33.08 3.09 3.21 4.04
Proposed Method MCS=3 11.90 18.87 31.81 2.98 2.99 4.19
MCS=7 8.42 14.78 31.63 2.48 3.07 4.15
MCS=12 7.78 13.04 32.22 3.10 3.04 4.45
Adaptive Key Frame Selection for Efficient Video Coding
861
Table 4. Comparison of Qp between MPEG-4 SP and the proposed method Test sequence Akiyo
Container
Stefan
Method MPEG-4 SP MCS=3 Proposed Method MCS=7 MCS=12 MPEG-4 SP MCS=3 Proposed Method MCS=7 MCS=12 MPEG-4 SP Proposed MCS=3 MCS=7 Method MCS=12
Qp 16 12 9 8 16 14 10 9 16 16 16 16
Coding frames 300 215 141 124 300 225 150 124 300 289 282 280
Bitstream size (byte) 138,323 134,330 127,858 125,174 228,041 210,438 216,991 218,381 1,094,888 1,074,832 1,060,963 1,049,128
Fig. 7. Comparison of PSNR in Akiyo
the improved visual quality of the encoded key frames is apparent. In the case of Stefan, not many frames are skipped due to high motion. In such a case, there is no apparent gain in Qp. In Fig. 9, the screenshots of the first key frame, the interpolated frame, and the last key frame using the proposed method are compared with the corresponding frames using MPEG-4 SP. In Akiyo and Container sequences, the interpolated frames are better than the coded frames using MPEG-4 SP in that there are less visual artifacts due to the better key frames. It does not necessarily mean that the interpolated frame is close to the original skipped frame. As explained earlier, the main objective of this work is to reproduce the interpolated frame with semantically acceptable visual quality. In the case of Stefan, the interpolated frame is worse than the coded frame using MPEG-4 SP. It means that frame skipping and interpolation of video sequences with high motion is of little value.
862
J. Jun et al.
(a) Akiyo screenshot with MPEG-4 SP (Qp = 16)
(c) Container screenshot with MPEG-4 SP (Qp = 16)
(e) Stefan screenshot with MPEG-4 SP (Qp = 16)
(b) Akiyo screenshot with the proposed method (MCS=7 & Qp =9)
(d) Container screenshot with the proposed method (MCS = 7 & Qp = 10)
(f) Stefan screenshot with the proposed method (MCS=7 & Qp =16)
Fig. 8. Screenshot of the selected key frames
Adaptive Key Frame Selection for Efficient Video Coding
Fig. 9a. Akiyo screenshot
Fig. 9b. Container screenshot
863
864
J. Jun et al.
Fig. 9b. (continued)
Fig. 9c Stefan screenshot
Adaptive Key Frame Selection for Efficient Video Coding
865
Fig. 9c. (continued) Screenshots of the interpolated frames in the form of MPEG-4 Decoded frame Decoded frame Decoded frame
Proposed First key frame (decoded) Interpolated frame Last key frame (decoded)
We also conducted a subjective test (mean opinion score: MOS) with 40 viewers. The viewers are chosen from the first year undergraduate students who are taking C programming classes, where these viewers cannot be regarded as image processing experts. The viewers are given three video sequences: 1) the original test sequence and 2) two test video sequences using MPEG-4 SP and the proposed method (with an arbitrary order). After viewing, the viewer scored the test sequences between one (very corrupted) and five (undistinguishable with the original). In this subjective test, we set the Qp value (16) in using MPEG-4 SP. For the test sequences of the proposed method, we used the cases, in which MCS is set to 3 for all test sequences with different Qp value as shown in Table 4. Fig. 10 shows the results of the MOS of the blind test. In Akiyo and Container, the viewers gave more scores on the proposed method, while the best and the worst scores for the proposed method are the same or better than those for MPEG-4 SP. In the case of Stefan, the lower score was anticipated due to its difficulty in forming clusters. Considering that the interpolated frames using the proposed method are not
Fig. 10. MOS of the blind test
866
J. Jun et al.
necessarily mathematically closer to the original frames than those using MPEG-4 SP, the subjective test results prove that the good enough quality reached using the proposed method.
4 Conclusion In this paper, we proposed an adaptive key frame selection method to adaptively select key frames for the better interpolation at the decoder. From the experimental results, we showed that the proposed method is simple and efficient. Because of its low complexity at the encoder side as well as at the decoder side, the proposed method is well suited in real-time environment. In this paper, the more focus has been given to the frame skipping part. Yet, a good frame interpolation is as important as a good frame skipping. Research on frame interpolation mechanism to yield good enough visual quality with low complexity should be continued in the future.
References 1. ITU -T Study Group 16.: Video codec test model, near-term, version8 (TMN8). ITU Telecommunications Standardization Sector, Q15-A-59 (1997) 2. Song, H., Kuo, C.-C.J.: Rate control for low-bit-rate video via variable-encoding frame rates. IEEE Trans. Circuits Syst. Video Technol 11(4), 512–521 (2001) 3. Pejhan, S., Chiang, T.-H., Zhang, Y.-Q.: Dynamic frame rate control for video streams. ACM Multimedia 1, 141–144 (1999) 4. Kuo, T.-Y.: Variable Frame Skipping Scheme Based on Estimated Quality of Non-coded Frames at Decoder for Real-Time Video Coding. Ieice Trans. Inf. and Syst E88-D (12), 2849–2856 (2005) 5. Kuo, T.-Y., Kim, J., Kuo, C.-C.: Motion-compensated frame interpolation scheme for H.263 codec. In: Circuits and Systems, ISCAS 1999, vol. 4, pp. 491–494 (1999) 6. MPEG-4 Reference Software, http://mpeg.nist.gov/cvsweb/MPEG-4/MPEG4RefSoft/Video/natural/microsoft-2.5040207-NTU/
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing Gwanggil Jeon1, Rafael Falcon2, Rafael Bello2, Donghyung Kim3, and Jechang Jeong1 1
Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea {windcap315, jjeong}@ece.hanyang.ac.kr 2 Computer Science Department, Universidad Central de Las Villas, Carretera Camajuani km 5 ½ Santa Clara, Cuba {rfalcon, rbellop}@uclv.edu.cu 3 Radio and Broadcasting Research Division, Broadcasting Media Research Group, ETRI, 138 Gajeongno, Yuseong-gu, Daejeon, 305-700, Korea
[email protected]
Abstract. This paper proposes a fuzzy reasoning interpolation method for video deinterlacing. We propose edge detection parameters to measure the amount of entropy in the spatial and temporal domains. The shape of the membership functions is designed adaptively, according to those parameters and can be utilized to determine edge direction. Our proposed fuzzy edge direction detector operates by identifying small pixel variations in nine orientations in each domain and uses rules to infer the edge direction. We employ a Bayesian network, which provides accurate weightings between the proposed deinterlacing method and common existing deinterlacing methods. It successively builds approximations of the deinterlaced sequence by weighting interpolation methods. The results of computer simulations show that the proposed method outperforms a number of methods in the literature. Keywords: Deinterlacing, fuzzy reasoning, directional interpolation.
1 Introduction Interlaced scanning is used in conventional broadcast to prevent large area flicker, while maintaining a good vertical resolution. In these days, an increased amount of video processing equipment needs to be transitioned from analog to digital. Moreover, FPD (flat panel display) such as TFT (thin film transistor), LCD (liquid crystal display), and PDP (plasma display panel) become more common display equipments than CRT (cathode-ray tube) in large display market. The FPD has higher panel resolution though, but the interlaced signal cannot be displayed on FPD [1]. Therefore, the demand for progressive material will increase, causing a directly proportional increase in the demand for video processing products with high quality deinterlacing [2]. Deinterlacing converts each filed into a frame, so as the number of pictures per second remains constant, whereas the number of lines per picture is doubled [3-8]. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 867–878, 2007. © Springer-Verlag Berlin Heidelberg 2007
868
G. Jeon et al.
However, a common interlaced TV signals in the vertical direction does not fulfill the demands of the Nyquist sampling theory, and the linear sampling rate conversion theory cannot be utilized for an effective interpolation. This causes several visual artifacts, which decrease the picture quality of the interlaced video sequence. For example, twitter artifacts occur with fine vertical details where pixels appear to twitter up and down. Flicker artifacts occur in regions of high vertical frequency detail, causing annoying flicker. An unwanted staircase effect occurs when diagonal edges move slowly in the vertical direction. To alleviate above issues, we adopt fuzzy and rough sets theory into deinterlacing algorithm. Recently, many different approaches adopting fuzzy reasoning have been proposed in the engineering domain. Fuzzy reasoning methods may be effective in image processing (e.g., filtering, interpolation, edge detection, and morphology) and have numerous practical applications. Michaud et al. [4] proposed a line interpolation method, using an intrafield edge-direction detector to obtain correct edge information. However, the correlations are not only among pixels in the spatial domain but also among highly related fields in the temporal domain. Our algorithm is based on the above method. We further improve fuzzy edge detection by enlarging the detection angles and obtaining previous and next field information. We use a three-field spatio-temporal domain filter with linear interpolation and utilize SMDW and TMDW parameters to design membership functions of a fuzzy rule-based interpolator. After determining membership functions, the fuzzy approach to image edge detection is represented by the family of fuzzy rulebased interpolators. Finally, we employ a Bayesian network (BN) to provide accurate weightings between several deinterlacing methods. In section 2, we present the SMDW and TMDW parameters, which respectively indicate that high-entropy and high-motion appear on the image edge. Determination of the membership functions will be made here. In section 3, we present the structure of the fuzzy rule-based interpolator. In section 4, we build an actual interpolation method, using a BN. In section 5, experimental results are obtained to show the feasibility of the proposed approach. These results are compared to other well-known, already existing deinterlacing methods.
2 Proposed Membership Function Edge-dependent interpolation is an important method for improving the resolution of intra- and inter-field interpolation. The causal neighbors in the training window can be classified into two classes: edge neighbors and non-edge neighbors. The effectiveness of any adaptive prediction scheme depends on its ability to adapt from smooth regions to edge regions. We obtained spatial and temporal domain maximum difference over the window (SMDW and TMDW) parameters to find the amount of entropy in a window of the current field and between fields. They are operators for extracting sketch features that contain edges, subject to local intensities. SMDW and TMDW, both of the proposed parameters for edge detecting are shown in (1) and (2).
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing SMDW =
TMDW =
{ max x(i, j, k ) −
( i , j , k )∈WS
{ max x(i, j, k ) −
( i , j , k )∈WT
( i , j , k )∈WS
( i , j , k )∈WT
min x(i, j , k ) × NWS
}
( i , j , k )∈WS
min x(i, j, k ) × NW
}
( i , j , k )∈WT
T
μ ±θ
∑
x(i, j, k )
(1)
∑
x(i, j , k )
(2)
A : b0o
SMALL0o
1
SMALL±30o
SMALL±45o SMALL±55o SMALL±60o
0.2
0
G F
0.6
0.4 E
0.8 D
C B
1 A
869
Δ ±θ
4 B : b±30o = × b0o 5 3 C : b±45o = × b0o 4 2 D : b±55o = × b0o 3 1 E : b±60o = × b0o 2 1 F : a0o ,±30o = b0o 4
1 G : a±45o ,±55o ,±60o = b0o 5
Fig. 1. SMALL Membership functions
We are using 3-by-2 localized spatial and temporal windows to measure the entropy of the region. For example, SMDW (or TMDW) is defined as the ratio of the difference between maximum and minimum pixel value to the average pixel value in a 3-by-2 spatial (or temporal) window. According to Weber’s law, the ratio of the increment threshold to the background intensity is a constant. Here, we regard the difference between maximum pixel value and minimum pixel value as the increment threshold of Weber’s law, and we consider the average pixel value in the window to be the background intensity according to the property definition. When the magnitude of SMDW (or TMDW) increases, we assume that the window has some edges in the spatial (or temporal) domain. WS is a 3-by-2 spatial domain window including three upper pixels and three lower pixels, and WT is a 3-by-2 temporal domain window including three pixels of the previous field and three pixels of the next field. Each NW and NW provides six, and x(i,j,k) denotes the intensity of a pixel. S
T
In order to compute the value that expresses the size of the fuzzy derivative in a certain direction, we define the use of the fuzzy set SMALL as shown in Fig. 1. Each one is defined differently by two parameters: a±Sθ (or a±Tθ ) refers to the threshold for unit membership, and b±Sθ (or b±Tθ ) defines the upper bound of the function. The values a±Sθ (or a±Tθ ) and b±Sθ (or b±Tθ ) can be determined empirically. We can induce a±Sθ (or a±Tθ ) and b±Sθ (or b±Tθ ) from b0S (or b0T ). Thus, all that we need to do is to determine b0S and b0T as (3): b0S = β × SMDW b0T = γ × TMDW
(3)
870
G. Jeon et al.
μ±Sθ
⎧1, 0 ≤ Δ S±θ < a±Sθ S S ⎪⎪ Δ ±θ − a±θ = ⎨1 − S , a±Sθ ≤ Δ ±Sθ < b±Sθ S ⎪ b±θ − a±θ b±Sθ ≤ Δ ±Sθ ⎪⎩0,
μ±Tθ
⎧1, 0 ≤ ΔT±θ < a±Tθ ⎪⎪ ΔT − aT = ⎨1 − T±θ T±θ , a±Tθ ≤ ΔT±θ < b±Tθ ⎪ b±θ − a±θ b±Tθ ≤ ΔT±θ ⎪⎩0,
(4)
where β and γ are amplification factors which affect the size of membership functions, and make b0S and b0T values between 0 and 255. We can determine the factors adaptively, depending on the property of the sequence. Our proposed fuzzy reasoning method can be easily implemented by using the above SMALL fuzzy set, whose membership function is defined by (4). The membership value μ ±Sθ (or μ ±Tθ ) will be increased when Δ dS ,θ (or ΔTd ,θ ) is smaller than b±Sθ (or b±Tθ ) and has a lower value, and if it is lower than a±Sθ (or a±Tθ ), it is considered to have a truth level μ ±Sθ (or μ ±Tθ ) equal to one.
3 Edge Direction Detector For the most part, fuzzy controllers follow three processing steps: fuzzification, rule inference, and defuzzification. We try to provide a robust estimate by applying fuzzy rules. A fuzzification process is used, since the inputs from the controlled system are crisp values. The operation has the effect of covering the ignorance of input measures by transforming a crisp input into a corresponding membership function. i refers to the column number, j refers to the line number, and k refers to the field number. Consider the pixel, x(i,j,k), which will be interpolated in our work. Differences between two pixels through the pixel, x(i,j,k) according to the defined direction are Δ Sd ,θ (i, j , k ) = abs[ x(i + d + σ θS , j − 1, k ) − x(i + d − σ θS , j + 1, k )]
(5)
ΔTd ,θ (i, j , k ) = abs[ x(i + d + σ θT , j , k − 1) − x(i + d − σ θT , j , k + 1)]
(6)
where Δ dS ,θ (i, j , k ) (or ΔTd ,θ (i, j , k ) ) is the pixel variation in direction θ at the pixel, x(i+d,j,k), d is the horizontal offset, and σθS (or σθT ) is the orientation offset in the spatial (or temporal) domain. These inputs are converted into fuzzy variables represented by the name of the associated fuzzy sets and a membership value. For the fuzzy detector, nine spatial domain and nine temporal domain fuzzy sets are used to characterize small pixel variations. The same membership function is used for directions θ and −θ for a symmetrical evaluation. Our SMALL±θ membership functions were proposed in section 2. After fuzzification, the fuzzified inputs Am (Δ ±Sθ ) and Bm (ΔT±θ ) are simultaneously broadcast to all control rules to be compared with their antecedent membership
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing
871
functions parts Cm (Δ ±Sθ ) and Dm (ΔT±θ ) . Thus, we can find the matching degree Cm (Δ ±Sθ ) of Am (Δ ±Sθ ) AND SMALLm (Δ S±θ ) , and Dm (ΔT±θ ) of Bm (ΔT±θ ) AND SMALLm (ΔT±θ ) . The fuzzy rule base characterizes the control policy needed to infer fuzzy control decisions, which are directions for our fuzzy detector. The fuzzy reasoning which has been adopted is the max-min composition. These rules are implemented using the minimum to represent the AND-operator and the maximum to represent the OR-operator. A rule is described as a conditional statement in which antecedents are the conditions and the consequences are a control decision. The conjunction of antecedent membership values gives the truth level of the consequence of the rule. The minimum and the algebraic product are the fuzzy conjunction operators used with the fuzzy detector. All the rules that have any truth in their premises will fire and contribute to the output. Afterwards, the truth levels of the same consequences are unified, using the fuzzy disjunction maximum. The fuzzy rule used in this paper is shown in Table 1. Table 1. Rule for fuzzy edge direction in the spatial and temporal domain Spatial domain edge direction
Temporal domain edge direction
Fuzzy Sets
ANTECEDENT Input for x(i,j,k)
Cons. dirS(i,j,k)
SMALL0o
S Δ 0,0 o
0o
SMALL−30o
Δ
S −0.5, −30o
SMALL+30o
Δ
S −0.5, +30o
ANDΔ
S +0.5, +30o
Δ
Δ
+30o
ANDΔ
T +0.5, +30o
+30o
+45o
ΔT−1, +45o , ΔT0,+45o ANDΔT+1,+45o
+45o
SMALL−55o
Δ −S1.5,−55o , Δ −S0.5,−55o , Δ S+0.5, −55o ANDΔ S+1.5, −55o
-55o
Δ T−1.5,−55o , Δ T−0.5,−55o , ΔT+0.5, −55o ANDΔT+1.5, −55o
-55o
SMALL+55o
Δ −S1.5,+55o , Δ −S0.5,+55o , Δ +S0.5,+55o ANDΔ +S1.5, +55o
+55o
Δ T−1.5,+55o , Δ T−0.5,+55o , ΔT+0.5, +55o ANDΔT+1.5, +55o
+55o
SMALL−60o
Δ −S2, −60o , Δ −S1, −60o , Δ 0,S −60o , Δ +S1, −60o ANDΔ +S2,−60o
-60o
ΔT−2, −60o , ΔT−1, −60o , ΔT0, −60o , ΔT+1, −60o ANDΔT+2,−60o
-60o
SMALL+60o
Δ −S2, +60o , Δ −S1, +60o , Δ 0,S +60o , Δ +S1, +60o ANDΔ +S2,+60o
+60o
ΔT−2,+60o , ΔT−1,+60o , ΔT0,+60o , ΔT+1,+60o ANDΔT+2,+60o
+60o
T −1, −45o
T 0, −45o
ANDΔ
-30o
Δ S−1,+45o , Δ 0,S +45o ANDΔ +S1,+45o
-45
o
,Δ
ANDΔ
0o
SMALL+45o
S +1, −45o
Δ
T −0.5, +30o
T +0.5, −30o
Δ
S 0, −45o
ANDΔ
-30
T −0.5, −30o
o
Cons. dirT(i,j,k)
SMALL−45o
S −1, −45o
,Δ
ANDΔ
S +0.5, −30o
ANTECEDENT Input for x(i,j,k) ΔT0,0o
T +1, −45o
-45o
The final step in the computation of the fuzzy filter is defuzzification. The defuzzification process converts the output of the fuzzy value into crisp variables. To make the final decision about the edge-direction at the pixel, x(i,j,k), our fuzzy detector takes the direction with the maximum membership value, as described by (7), (8). When some maximum membership values are equal, priority order is given to make the final decision: 0o, -30o, 30o, -45o, 45o, -55o, 55o, -60o, and 60o. Finally, an interpolation process is performed, as in (9), (10).
872
G. Jeon et al.
Direction S (i, j , k ) = arg max( μdir S ( i , j ,k )=θ )
(7)
Direction T (i, j, k ) = arg max( μ dirT ( i , j ,k ) =θ )
(8)
xS (i, j, k ) = 0.5 ×{x(i + σ θS , j − 1, k ) + x(i − σ θS , j + 1, k )}
(9)
xT (i, j , k ) = 0.5 ×{x(i + σ θT , j , k − 1) + x(i − σ θT , j , k + 1)}
(10)
θ
θ
4 BN Process BNs are one of the best known ways to reason under uncertainty in artificial intelligence. BNs with a graphical diagram provide a comprehensive method of representing relationships and influences among nodes. BNs represent joint probability distribution and expert knowledge in a compact way. Each entry in the joint probability table can be obtained by the product of all the appropriate elements of the prior probabilities or conditional probability tables assigned to the nodes of the BN by the chain rule in (11) [9].
SMDW is High? ( Z1 = z1 )
η
κ
Inference
Pre-action (Z3 )
Bob Weave FBNI
Deinterlacing (Z 4 )
TMDW is High? ( Z 2 = z2 )
Utility Fig. 2. Proposed BN for the deinterlacing method n
P ( z1 ,..., z n ) = ∏ P ( zi | parent ( Z i ))
(11)
i =1
The proposed BN interpolator is shown in detail in Fig. 2. This BN is defined by a graph with four nodes in the domain. The root nodes (nodes without parents: Z1 and Z2) are associated with a prior probability distribution, and the non-root nodes (child nodes with parent nodes: Z3 and Z4) have local conditional probability distributions that quantify the parent-child probabilistic relationships.
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing
873
Histograms of SMDW 2500 Akiyo Flower Foreman Mobile News Stefan Table Tennis
Numbers
2000 1500 1000 500 0 0
20
40
60
80
100
120
140
160
180
200
220
SMDW
(a)
Histogram of TMDW 2500 Akiyo Flower Foreman Mobile News Stefan Table Tennis
Numbers
2000 1500 1000 500 0 0
20
40
60
80
100
120
140
160
180
200
220
TMDW
(b) Fig. 3. Histogram of (a) SMDW; (b) TMDW
In a situation when no other evidence is available, the probability of an event occurring is a prior probability. Here, the prior probability of “SMDW is High?” may be expressed as P(SMDW is High?)=0.5, meaning that, without the presence of any other evidence, one can assume that SMDW has a 50% chance of being bigger than the average value of SMDW in a field. Each row in a conditional probability table must sum to “1.” A probability term is also used to express random variables with two values High (H) and Low (L) in the domain. The method of building a BN structure relies solely on domain expert knowledge. Fig. 3 shows the histogram of β·SMDW and γ·TMDW, where those values are gathered from various video sequences. We observed that most pixel β·SMDW and γ·TMDW values are smaller than 24. The percentage of SMDW and TMDW values, according to β·SMDW and γ·TMDW, are exhibited in Table 2. Consequently, the low γ·TMDW pixels are classified within the static area, and remaining pixels are classified within the motion area. The low
874
G. Jeon et al.
β·SMDW pixels are classified within the plain area; others are classified within the edge area. Based on the classification result, a different deinterlacing algorithm is activated in order to obtain the best performance. Table 2. Distribution of SMDW and TMDW Percentage β·SMDW
γ·TMDW
H H H L L H L L β =100, γ =100.
Akiyo 9.79 % 6.11 % 2.20 % 81.90 %
Flower Foreman 26.79 % 6.82 % 11.88 % 10.21 % 5.48 % 2.24 % 55.85 % 80.73 %
Sequences Mobile 37.79 % 15.66 % 7.09 % 39.46 %
News 16.96 % 12.78 % 3.31 % 66.95 %
T. Tennis Stefan 23.88 % 5.74 % 21.69 % 17.01 % 5.44 % 7.12 % 48.99 % 70.13 %
Table 3. An Example of Conditional Probability Distribution for Z3 P(Z3|Z1,Z2) PH PL (P(Z3|Z1,Z2)>0.5) (P(Z3|Z1,Z2)≤0.5)
β·SMDW (Z1) γ·TMDW (Z2) Z1=H Z1=L
Z2=H Z2=L Z2=H Z2=L
(η+κ)>>1 {η+(1-κ)}>>1 {(1-η)+κ}>>1 1-[{(1-η)+(1-κ)}>>1]
1-{(η+κ)>>1} 1-[{η+(1-κ)}>>1] 1-[{(1-η)+κ}>>1] {(1-η)+(1-κ)}>>1
Interpolated pixel x(i,j,k) PL·xS(i,j,k)+PH·xT(i,j,k) PL·xBob(i,j,k)+PH·xWeave(i,j,k)
{xWeave(i,j,k)+xBob(i,j,k)}>>1
η = {log 2 ( β ⋅ SMDW )} / 8, κ = {log 2 (γ ⋅ TMDW )} / 8
(12)
The probability table for z2 is made according to κ, which represents the probability that “TMDW is High.” In the same manner, the probability table for z1 is made according to η, which represents the probability that “SMDW is High.” Variables η and κ are achieved by (12), and both of them have values between 0 and 1. Each frame is passed through a region classifier, which classifies each missing pixel into four different categories. Table 3 illustrates the conditional probabilities for Z3. The best way to interpolate the missing pixel is to select accurate weightings, according to Z1 and Z2. Inference gives Utility the information regarding three different deinterlacing methods: Bob, Weave, and FBNI. The weighting of each method from 0 to 1 will be decided by the Utility. The BN topology in Fig. 2 can express each entry of the joint probability table as (13). P( z1 , z2 , z3 , z4 ) = P( Z1 = z1 ) × P ( Z 2 = z2 ) × P( Z 3 = z3 | Z1 = z1 , Z 2 = z2 ) × P( Z 4 = z4 | Z3 = z3 )
(13)
It is well known that Bob exhibits no motion artifacts and has minimal computational requirements. However, the input vertical resolution is halved before the image is interpolated, reducing the detail in the progressive image. On the other hand, the Weave technique results in no degradation of static images, but edges exhibit significant serrations, which is an unacceptable artifact in a broadcast or
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing
875
professional television environment. Both of them require less complexity to interpolate a missing pixel. Our proposed FBNI interpolation algorithm is performed, according to the rule of Table 3. Since it requires more computation time than that of Bob or Weave, we utilize the proposed interpolation algorithm in the area with motion and complex edges. For example, if β·SMDW is 32 and γ·TMDW is 64, then η and κ become 5/8 and 6/8. One has a 31.25 (%) chance of using xS(i,j,k) and a 68.75 (%) chance of using xT(i,j,k) to interpolate the missing pixel. However, if β·SMDW is 8 and γ·TMDW is 4, the average value of xBob(i,j,k) and xWeave(i,j,k) are utilized to reduce the computational burden. If β·SMDW is 32 and γ·TMDW is 8, η and κ become 5/8 and 3/8. One has a 62.5 (%) chance of using xWeave(i,j,k) and a 37.5 (%) chance of using xBob(i,j,k) to determine the interpolated missing pixel. Finally, if β·SMDW is 6 and γ·TMDW is 20, η and κ become (log26)/8 and (log220)/8. One has a 60.85 (%) chance of using xWeave(i,j,k), and a 39.15 (%) chance of using xBob(i,j,k) to interpolate the missing pixel.
5 Experimental Results In this section, we conduct our experiments on seven “real-world” sequences with a field size of 352×288 for objective performance, and 176×144 for subjective performance. The video sequences were sub-sampled by a factor of two in the vertical direction without anti-aliasing filtering. These original progressive sequences were used as a reference point to which we compared our algorithm. The pre-specified amplification factor β and γ give 100 in our experiments. After the deinterlacing process, the PSNR was chosen to provide an objective measure of the schemes’ performance. Table 4 summarizes the average PSNR (dB) and computational CPU Table 4. PSNR and average CPU time (seconds/frame) results of different interpolation methods for five CIF sequences ELA Akiyo Flower Foreman Mobile News Stefan T. Tennis
37.6815 dB 0.0278 s 21.9325 dB 0.0316 s 31.3965 dB 0.0342 s 23.3409 dB 0.0282 s 31.5308 dB 0.0310 s 25.9657 dB 0.0346 s 31.2361 dB 0.0406 s
Bob 39.6985 dB 0.0124 s 22.4077 dB 0.0094 s 30.6320 dB 0.0096 s 25.4945 dB 0.0154 s 33.6650 dB 0.0158 s 27.5011 dB 0.0156 s 32.0153 dB 0.0190 s
Weave 40.6748 dB 0.0160 s 20.3169 dB 0.0154 s 30.0970 dB 0.0124 s 23.3613 dB 0.0064 s 36.2949 dB 0.0188 s 31.0550 dB 0.0154 s 24.7550 dB 0.0092 s
Michaud’s Method 39.9255 dB 0.1250 s 22.2700 dB 0.1220 s 31.1383 dB 0.1282 s 25.1571 dB 0.1342 s 33.5498 dB 0.1282 s 27.2761 dB 0.1342 s 31.8828 dB 0.1498 s
Proposed Method 43.4401 dB 0.1192 s 22.7945 dB 0.1060 s 32.3261 dB 0.1030 s 27.3653 dB 0.1000 s 39.1069 dB 0.1122 s 31.6382 dB 0.1094 s 32.9398 dB 0.1094 s
876
G. Jeon et al.
time (s) comparisons of each algorithm, over the corresponding sequences, for each test sequence. The results show the proposed algorithm outperforms the other methods in all of the chosen sequences in terms of PSNR. The proposed algorithm requires only 82 % average computational CPU time as compared to that of Michaud’s method with a 2.14 dB average PSNR gain. We also found that our proposed method yields more effective visual quality with smoother edges. It reduces staircase artifacts, giving relatively satisfactory image quality. For a subjective performance evaluation, the 51st frame of the QCIF Table Tennis sequence was chosen. Fig. 4 compares the visual performance of the proposed method with other five major conventional methods; Bob, ELA, Michaud, Weave, and STELA. The conventional methods’ main weak points in contrast to the proposed method can be described as follows. Bob method is a spatial domain method, which does not use temporal information and shows no motion artifacts in the motion region (Fig. 4a). ELA exhibits no motion artifacts either, with relatively small computational requirements (Fig. 4b). Moreover, Bob and ELA methods do not seem to work properly with complex structures and the edges are degraded and blurred severely. Because the edge detector may discover the incorrect edge direction, it causes artifacts and deteriorates visual quality. For example, the blurred artifacts are shown in the boundaries of the table in Figs. 4(a-b). Since Michaud’s method uses spatial information only, no motion artifacts are appeared as shown in Fig. 4(c). Above three spatial domain methods should solve some issues that the input vertical resolution is halved before the image is interpolated, thus reducing the detail in the progressive image. Weave method is one of the temporal domain methods, which results in no degradation of stationary images as shown in Fig. 4(d). The processing requirements for Weave are slightly less than that of Bob method while providing better PSNR result in stationary region. The boundaries of the table which included within stationary region demonstrated the best performance among several methods. However, the boundaries of the objects exhibit significant serrations, which is an unacceptable artifact in a broadcast or professional television environment. STELA method is spatio-temporal domain methods, which can estimate the motion vector to be zero in the region without motion, allowing perfect reconstruction of the missing pixel, and resulting in no degradation. However, the vertical detail of STELA is gradually reduced as the temporal frequencies increase. Because the vertical detail from the previous field is combined with the temporally shifted current field, indicating some motion blur occurred. STELA method provides relatively good performance, can eliminate the blurring effect of bilinear interpolation, and gives both sharp and straight edges. However, due to misleading edge directions, interpolation errors often become bigger in areas of high-frequency components. In Fig. 4(e), flickering artifacts were found to occur only where there is motion or edge. The feathering effect appears on the boundaries of the hand. The processing requirement for the proposed method is relatively higher than that of conventional methods yet with the advantage of higher output image quality. Fig. 4(f) shows the proposed method-utilized image. The proposed method offers the best subjective quality over all methods, and enhances edge preservation and edge sharpness
Application of Bayesian Network for Fuzzy Rule-Based Video Deinterlacing
(a) Bob
(b) ELA
(c) Michaud
(d) Weave
(e) STELA
(f) Proposed method
877
Fig. 4. Subjective quality comparison of the 51st QCIF Table Tennis sequence
after deinterlacing. From these experimental results, the proposed method demonstrated good objective and subjective qualities for different sequences, in particular requiring low computational CPU time to achieve real-time processing. Moreover, our proposed method performed well for seven sequences, indicating that the incorporation of motion information for deinterlacing can help boosting up the video quality.
878
G. Jeon et al.
6 Conclusion A new fuzzy reasoning interpolation method for video deinterlacing is proposed in this paper. Through the parameters SMDW and TMDW, membership functions are adaptively designed. Our interpolator employs fuzzy reasoning to alleviate resolution degradation. It operates by identifying small pixel variations for nine orientations in each domain and uses rules to infer the edge direction. BN provides accurate weightings between several interpolation methods. Detection and interpolation results are presented. The results of computer simulations show that the proposed method is able to outperform a number of methods in the literature.
Acknowledgment This research was supported by Seoul Future Contents Convergence (SFCC) Cluster established by Seoul R&BD Program.
References 1. Jack, K.: Video Demystified A Handbook for the Digital Engineer. Elsevier, Oxford, UK (2005) 2. Bellers, E.B., Haan, G., De De-interlacing, A.: De-interlacing: A Key Technology for Scan Rate Conversion. Elsevier, Amsterdam, The Netherlands (2000) 3. De Haan, G., Bellers, E.B.: Deinterlacing - An overview. Proceedings of the IEEE 9, 1839– 1857 (1998) 4. Michaud, F., Dinh, C.T., Le, L.G.: Fuzzy detection of edge-direction for video line doubling. IEEE Trans. on Circuits and Systems for Video technology 3, 539–542 (1997) 5. Doyle, T.: Interlaced to sequential conversion for EDTV applications. In: Proc. 2nd Int. Workshop Signal Processing of HDTV, pp. 412–430 (1990) 6. Swan, P.L.: Method and apparatus for providing interlaced video on a progressive display, U.S. Patent 5 864 369 (January 26, 1999) 7. Bellers, E.B., de Haan, G.: Advanced de-interlacing techniques. In: Proc. ProRisc/IEEE Workshop on Circuits, Systems and Signal Processing, Mierlo, The Netherlands, pp. 7–17 (1996) 8. Oh, H.-S., Kim, Y., Jung, Y.-Y., Morales, A.W., Ko, S.-J.: Spatio-temporal edge-based median filtering for deinterlacing. In: IEEE International Conference on Consumer Electronics, ICCE 2000, pp. 52–53 (2000) 9. Russell, S., Norvig, P.: Artifical intelligence a modern approach, Upper Saddle River. Prentice Hall, NJ (1995)
Markov Random Fields and Spatial Information to Improve Automatic Image Annotation Carlos Hern´andez-Gracidas and L. Enrique Sucar National Institute of Astrophysics, Optics and Electronics, Luis Enrique Erro #1, Sta. Mar´ıa Tonantzintla, Puebla, M´exico
[email protected],
[email protected] http://ccc.inaoep.mx
Abstract. Content-based image retrieval (CBIR) is currently limited because of the lack of representational power of the low-level image features, which fail to properly represent the actual contents of an image, and consequently poor results are achieved with the use of this sole information. Spatial relations represent a class of high-level image features which can improve image annotation. We apply spatial relations to automatic image annotation, a task which is usually a first step towards CBIR. We follow a probabilistic approach to represent different types of spatial relations to improve the automatic annotations which are obtained based on low-level features. Different configurations and subsets of the computed spatial relations were used to perform experiments on a database of landscape images. Results show a noticeable improvement of almost 9% compared to the base results obtained using the k-Nearest Neighbor classifier. Keywords: Spatial relations, Markov random fields, automatic image annotation, content-based image retrieval.
1
Introduction
Considerable amounts of digitally stored visual information are available for their use in a number of different applications. Regarding this information, it is a frequent necessity to retrieve image subsets which fulfill certain criteria, in most of the cases concerning the visual contents of the image itself. Objects with specific physical characteristics, performing a given action or in a given position, are some examples of possible queries for image retrieval. Also, a desirable feature is the ability to retrieve images where the objects interact in a particular way, which is an even more complicated form of query. Unfortunately, most state of the art image retrieval systems are based on low-level features like color, texture, shape, or on the other hand, based on image captions assigned by humans. In the first case, retrieval is ineffective due to the lack of semantic information coming from the image; in the second case, often better results are obtained, but with D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 879–892, 2007. c Springer-Verlag Berlin Heidelberg 2007
880
C. Hern´ andez-Gracidas and L.E. Sucar
the need of manual annotations; and for huge databases, manual annotation is a time consuming task which cannot always be performed correctly. Content-based image retrieval (CBIR) is the use of computer vision to analyze the actual contents of images (by using their color, texture, shape or any other information derived from the images themselves), applied to the retrieval of images by their semantics. Spatial relations are useful to know the relative position of an object in a scene by using other objects in the same scene as reference. It seems almost obvious that by applying spatial information CBIR will automatically improve results, but the interesting questions are: How to do it? Which of all the possible relations can be useful? It is important to notice that we do not suggest that spatial information will suffice to obtain an efficient image retrieval, on the contrary, this research is intended to encourage the use of it as a complementary source of key information. The number of fields where spatial relations could be applied is by itself an important motivation. A few examples are: medical imagery analysis, geographic information systems (GIS) and image retrieval. In this paper we follow an approach based on Markov random fields (MRFs) to represent the information about the spatial relations among the regions in an image, so the probability of occurrence of a certain spatial relation between each pair of labels could be used to obtain the most probable label for each region, i.e., the most probable configuration of labels for the whole image. Spatial information extracted from training images is fused with “expert” knowledge to represent the information coming from the neighbors. Spatial relations are divided in this study in three groups: topological relations, horizontal relations an vertical relations. Experiments with each of these groups incorporated individually and with the three groups used at the same time were performed in order to determine their relevance in the final results. Different configurations were also used in the experiments. Results were obtained on a database of landscape images and they show a noticeable improvement of almost 9% compared to the base results obtained using the k-Nearest Neighbor (kNN) classifier. Since this work is proposed as an improvement to a basic classification algorithm, it is expected that if the annotation algorithm used provides better results, they can be improved as well and an even higher accuracy can be reached. The structure of this paper is as follows. Section 2 reviews basic concepts on image segmentation, spatial relations and MRFs. Section 3 summarizes related work in this field. Section 4 presents the methodology followed. Section 5 describes how the experiments were performed and the results obtained. Finally, in section 6 we present our conclusions and the future research to be done.
2
Fundamentals
In this section we present definitions and basics of automatic image segmentation, automatic image annotation, spatial relations and Markov random fields.
Markov Random Fields and Spatial Information
881
Fig. 1. Example of the results of automatic image segmentation using Normalized cuts. Left: an oversegmented image. Right: important objects are incorrectly segmented in the image.
2.1
Automatic Image Segmentation
Segmenting an image is partitioning that image into several regions, which are determined by their local features like color and texture. In general, automatic segmentation algorithms like Normalized cuts [1] usually tend to produce erroneously delimited regions with results like the ones shown in Figure 1. In the first image, the elephant and the grass are oversegmented, providing more segments than the necessary; in the second case, important objects in the image are incorrectly segmented, making of this an almost useless segmentation. These errors affect directly the performance of automatic annotation algorithms. However, it is important to mention that the emphasis of our work is not on segmentation improvements. 2.2
Automatic Image Annotation
Automatic image annotation (AIA) is a process that has been commonly used to support image retrieval, though results are not quite accurate at this time. Automatic image annotation is the task of automatically assigning annotations or labels to images or segments of images, based on their local features. Given the size of most image databases, image annotation is frequently performed by automatic systems, and this task, though necessary, is currently poorly performed given the difficulty and complexity of the extraction of adequate features which allow to generalize and distinguish an object of interest from others with similar visual properties. Erroneous labeling of regions is a common consequence of the lack of a good characterization of the classes by low-level features. 2.3
Representing Spatial Relations
Spatial relations provide relevant high-level information about the way elements interact in a scene. They are useful to know the relative position of an object in a scene with respect to other reference objects. Given two objects of the classes A and B, it is feasible to think that, depending on their kind, object of the class A can relate to object of the class B
882
C. Hern´ andez-Gracidas and L.E. Sucar
in a limited number of ways, and that some spatial relations will not be valid between these two objects. If we assume that most of the automatically assigned annotations are correct, then it is feasible to correct the mistaken ones. This binary method of evaluating region annotations by their spatial relations, classifying them as valid or not valid, can be extended by means of probabilities. In this way, the more frequent a relation between objects of the class A and objects of the class B, the higher the associated probability will be; conversely, the less frequent a relation between A and B, the closer its probability value will be to 0. This probability measure allows to obtain a global optimal configuration, i.e., the set of annotations (one for each region) which according to the spatial relations among regions, provides the highest probability of being globally correct in the image. It is important to notice that the spatial relations are independent from the annotations, and consequently they are not affected by the quality of such annotations. The more correct annotations we have, the more reliable our corrections will be. Same as with the number of annotations, the more objects, and the more relations among these objects, the more information about coherence of these annotations can be inferred. Spatial relations are divided into: 1. Topological relations: They are preserved under rotation, scaling and translation. Examples of them are: overlapped and contained by. 2. Order relations: They are preserved under scaling and translation but change under rotation. They are based on the definition of order. Some examples are: above-below and left of-right of. 3. Metric relations: They change under scaling but are unaffected by rotation and translation. They measure distance and direction. Some examples are: 2 miles away and 30 meters around. 4. Fuzzy relations: They are measured in vague terms, and consequently, are difficult to quantify. Examples of them are: near and far. Spatial relations provide important information in domains such as GIS, Robotics and CBIR; where the location of an object implies knowledge about the geographic position of a certain place, the possible path to follow by a robot or the contents of an image to be retrieved. 2.4
Markov Random Fields
Markov Random Fields [2,3] are probabilistic models which combine a priori knowledge given by some observations, and knowledge given by the interaction with neighbors. Let F = {F1 , F2 , . . . , Fn } be random variables on a set S, where each Fi can take a value fi in a set of labels L. This F is called a random field, and the instantiation of each of these Fi ∈ F as an fi , is what is called a configuration of F , so, the probability that a random variable Fi takes the value fi is denoted by P (fi ), and the joint probability is denoted as P (F1 = f1 , F2 = f2 . . . , Fn = fn ).
Markov Random Fields and Spatial Information
883
A random field is said to be an MRF if it has the properties of positivity and markovianity. The joint probability can be expressed as P (f ) =
e−Up (f ) Z
(1)
where Z is called the partition function or normalizing constant, and Up (f ) is called the energy function. The optimal configuration is found by minimizing the energy function Up (f ) obtained by Up (f ) = Vc (f ) + λ Vo (f ) (2) c
o
Vc (f ) and Vo (f ) are called the potential functions, where Vc (f ) stands for the information coming from the neighboring nodes, Vo (f ) represents the information coming from the observations, and λ is a constant used to weight the observations and the information from the neighbors, giving each a relative importance with respect to the other. The optimal configuration is obtained when the value of Up (f ) with minimal energy is found for every random variable in F . An MRF can also be seen as an undirected graph G = (V, E), where each vertex v ∈ V represents a random variable, and each edge u, v ∈ E determines that nodes u and v are neighbors.
3
Related Work
In this section we give a general perspective on how spatial relations are used in previous works and how they are applied on image retrieval and other tasks. One of the first attempts to consider spatial information (combined with temporal information) is introduced by Allen [4]. Several topological models have been developed, from which the most used are the 4-Intersection Model [5] and its extensions, such as the 9-Intersection Model [5] and the Voronoi-Based 9Intersection Model [6]. A deductive system for deriving relations in images is introduced by [7] where a set of rules is proposed to deduce new relations from a basic set of relations. This approach can be used to extend queries when images are searched in a retrieval system. This is is intended to complement text-based search systems, but assumes spatial relations are somehow existent in the image since it provides no form of computing them from images. The system of rules is shown to be complete for the 3D case but incomplete for 2D. Studies like [8,9] focus on the problem of using spatial relations in CBIR related tasks. Basic image retrieval systems using some kind of spatial information are shown in [10,11,12]. In [10] another method to extend textual information is proposed. They suggest a way to complement image annotations by semi-automatically adding spatial information about annotated objects in the image. A human needs to be involved since the objects are assumed to be well segmented and well annotated; then spatial relations are computed and annotations are complemented with this information. They provide a study about the relative relevance of spatial relations
884
C. Hern´ andez-Gracidas and L.E. Sucar
based on the way people tend to use them. In [11] a system for image retrieval using spatial information as a complementary element is shown. They implement a web crawler which uses textual information from the web pages retrieved and from the image names, and complements it with low-level features like color and high-level features like spatial information. However, as in other similar works, human interaction is necessary to recognize objects and adequately label them so the images can be stored in the database, which significantly limits the usefulness of this search tool. Queries are performed on the database by using hand sketches or sample images and human interaction is also required in this process to determine objects and their relevance in the image before the query is processed. In [12] a retrieval system based on a set of basic spatial relations is proposed using a matching system to measure similarity between a pair of images and using automatic image segmentation and annotation. They propose the use of six spatial relations and show experiments using a limited set of labels and images where objects like grass, clouds, trees and sky are contained. The already existent methods are insufficient to apply them directly in the solution of the problem of CBIR, nor are they suitable for our purpose of improving AIA. The reason is that these methods focus mostly on topological relations and other important spatial relations which provide interesting information are usually discarded. In the few cases where non-topological relations are considered, they are used in a very basic way, like simple image comparison.
4
Improved Image Annotation
In this work we make use of an automatic segmentation system to divide images into regions, and an automatic annotation system to assign potential labels to each region. These regions and labels are validated by means of the spatial relations among the regions themselves, and if that is the case, modified to harmonize within the image. We claim that we can improve AIA by iteratively validating and correcting these intermediate processing steps. The methodology, depicted in Figure 2, is the following: 1. The image is automatically segmented (using Normalized cuts). 2. The obtained segments are assigned with a list of labels and their probabilities (computed with the kNN algorithm). Concurrently, the spatial relations among the same regions are computed. 3. The labels are checked for consistency by using spatial information based on MRFs. The labels with Maximum A-Posteriori Probability (MAP) are obtained for each region. 4. Adjacent regions with the same label are joined. As mentioned before, this is an iterative process, and steps 2 to 4 may be repeated until the system stabilizes. The spatial relations considered in this work are shown in Table 1. Considering the image as a graph, where each node of the graph represents a region, and the spatial relations are represented by edges joining these nodes, then the relations can also be divided into: directed and undirected. Table 1 also shows this
Markov Random Fields and Spatial Information
885
Fig. 2. Block diagram of the proposed methodology. In 1 the image is segmented with the Normalized cuts algorithm; in 2a each region is assigned with a list of labels and an associated probability for each label, at the same time, in 2b spatial relations among regions are computed; in 3 the labels are improved with the use of MRFs and spatial information; finally, in 4, and using these improved labels and the adjacency relations, if two or more adjacent regions have the same label they are joined.
separation. Observe the distinction made between the horizontal relation beside (which is a generalization of left and right in order to convert these two originally directed relations into a single undirected relation), and the use of above and below as separate relations. The reason for this decision is that for the kind of images to be analyzed (landscapes, as in the Corel data set), knowing if an object is left or right of another object is irrelevant, but knowing if it is above or below is considered to give important information about the coherence of the image annotations. This is not the case for every application domain, since, for example, in the case of medical imagery, knowing if a certain organ is left, right, above or below another, certainly gives crucial information for the interpretation of the image. These relations can be grouped in 3 sets: topological relations, horizontal relations and vertical relations. In the case of horizontal relations and vertical relations we separate order relations into these two groups, given the fact that an object can be related to another in both ways (for example, an object A can be at the same time above and beside another object B).
886
C. Hern´ andez-Gracidas and L.E. Sucar
Table 1. Spatial relations used in this work. Relations are divided as: topological, horizontal and vertical; and also as directed and undirected. Directed Topological relations
1 2 Horizontal relations 3 4 Order relations 5 Vertical relations 6 7
Undirected Adjacent Disjoint Beside (either left or right) Horizontally aligned
Above Below Vertically aligned
An important feature of these three sets is that for each there will be one and only one relation between every pair of regions, since in each group the relations are exclusive among the rest (an object A cannot be above and below an object B at the same time). In our notation we use the term XOR (⊕) to represent this characteristic, meaning that only one of the values in the corresponding group is taken into account, but no more than one at the same time. It must be remembered that in this particular case, XOR does not mean a logic XOR, but a notation for value exclusion. Given that for each group there will be one and only one relation, we can infer that each group defines by itself a complete graph, i.e., a graph where all regions are connected to all others by exactly one edge representing a relation. There are some obstacles for the direct use of MRFs in the solution of this problem. The first one is that in traditional MRFs, every edge determining vicinity must be an undirected edge, and some important relations are directed by nature, and though they can be generalized in an undirected fashion, important information may be lost in the process (like in the above and below cases), so it would be desirable to be able to keep such information; the second obstacle is the presence of more than one relation between each pair of regions, since MRFs are defined by at most one edge connecting any pair of nodes. These points force to extend the use of MRFs in order to adequate them to be used in this kind of application. If we structure the spatial relations as follows, we can provide a new energy function Up (f ) using these different relations at the same time. Each Rij represents a spatial relation in one of the groups: Rij (T ) ∈ {1, 2} − − − Adjacent,
Disjoint
Rij (H) ∈ {3, 4} − − − Beside, Horizontally Aligned Rij (V ) ∈ {5, 6, 7} − − − Below, Above, Vertically Aligned Using these three groups of relations, the energy function is: Up (f ) = α1 VT (f ) + α2 VH (f ) + α3 VV (f ) + λ
o
Vo (f )
(3)
Markov Random Fields and Spatial Information
887
where VT , VH and VV are potential functions computed from the topological, horizontal and vertical relations, respectively. They can be obtained as inversely proportional to the sum of the probabilities in each group of relations. In the energy formula in Equation 2, only lambda (λ) is used for the purpose of weighting the relative value of Vc and Vo . The use of three alphas (α1 , α2 and α3 ) in Equation 3, allows to give a different weight to each group of relations, with the premise that they do not have the same relevance. Given the fact that the best configuration will be the one giving the lowest energy value, the more relevant a relation, the higher its associated alpha value. We define now the energy functions. In these functions, each Pkc represents the probability of relation k. The XOR operators depend on the value taken by the Rij related. 1 c P1c (f ) ⊕ P2c (f ) 1 VH (f ) = c P3c (f ) ⊕ P4c (f ) 1 VV (f ) = P (f ) ⊕ P 6c (f ) ⊕ P7c (f ) c 5c VT (f ) =
In order to compute each Pkc , a combination of information extracted from training images and “expert” knowledge is used. First, training images which are already segmented and manually labeled, are examined to determine the presence of spatial relations and their frequency. For each relation, k, a matrix Relk (i, j) of D × D components is created, where D is the number of possible labels for a region. An equivalent matrix Ek (i, j) is created for each relation incorporating a priori knowledge of each relation based on subjective estimates. These two estimates are linearly combined as follows [13]:
Pkc (f ) =
Relk (i, j) + δEk (i, j) N R(i, j) + δ100
(4)
where δ is a constant used to determine the relevance of expert knowledge with respect to the training data, and N R(i, j) is the number of times labels i and j appeared together in an image. The use of Ek (i, j) also serves as a smoothing technique to minimize the number of 0’s in the original Relk (i, j) matrices. To obtain the “best” configuration, the MRFs are solved by computing the MAP using simulated annealing [14] with temperature (T ) decremented as follows:
T =
T log(100) log(100 + j)
where j is the number of the iteration in the MRFs.
(5)
888
C. Hern´ andez-Gracidas and L.E. Sucar Table 2. The set of labels for CorelA airplane bird boat church cow elephant
5
grass mountains sky ground pilot snow horse road trees house rock water lion sand log sheep
Experiments and Results
To evaluate our method we used the Corel database, and particularly, the subset CorelA1 developed by [15] and consisting of 205 images. This data set was divided into 137 training images and 68 test images. This database portraits landscapes, containing mainly elements in a set of 22 different possible annotations, which are shown in Table 2. The advantage of using this database is that besides being already segmented, it counts with labeled regions, so the time-consuming task of hand-labeling the regions for the experiments was already performed, allowing for experimentation only on image annotation. The experiments were performed using as annotation system a kNN classifier as implemented in [16], which is a simple but efficient instance-based learning algorithm. We must mention the special case when a region is unknown (label 0). These values affect our spatial approach since no spatial information can be obtained from them. If the region being examined is unknown, values of P1c to P4c are set to 12 , and values of P5c to P7c are set to 13 , to reflect the lack of a priori knowledge coming from neighbors in such cases. Experiments were performed dividing them in four groups: tests using only one group of spatial relations individually, and tests using the three groups of spatial relations simultaneously. For each of these groups three variations were also tested: the use of no smoothing technique, the use of a simple Laplacian smoothing, and the use of expert knowledge as initialization and smoothing technique. Several parameters had to be set: λ, δ, α1 , α2 , α3 , T and the number of iterations (n) for the MRFs. Figure 3 shows the effect of changing λ and δ values using our approach; considering the three groups of spatial relations and fixing the remaining parameters to their optimal values. We can see how important the prior observations are, since setting λ=0 makes annotation rates fall to approximately 25.3%. However, a value of λ set too high makes rates oscilate around 43.5%. The highest annotation rates were reached with λ set to values close to 0.25, and this is the value used for the experiments. Experimental tests showed the necessity of an initialization either by an expert or by a Laplacian smoothing; although it is not clear what the ideal value for δ would be (as it is shown, also in Figure 3). When δ was set to a value of 0.25, the 1
Available at http://www.cs.ubc.ca/˜pcarbo/
Markov Random Fields and Spatial Information
889
Fig. 3. Left: the variation of annotation accuracy with values of λ in the interval (0,4) and incremented by 0.02. Right: the variation of annotation accuracy with respect to δ values in the same interval and with the same increments.
best annotation rates were reached, so, this is the value used for the experiments. The idea behind using α1 , α2 and α3 was to be able to give a different relative weight to each relation group with the premise that they will have different relevance in the annotation result (which is partially confirmed by our experiments). However, setting these values requires more investigation, and for the experiments they were equally fixed to a value of 1, leaving for a future research their estimation using a more sophisticated heuristic. Values for T and n were also set by experimentation, fixing them to 116 and 400, respectively. The approximate execution time with this value of n is of 35 seconds for the 68 test images. Experimental results are shown in Table 3. To calculate the accuracy of the method, we ran 10 times each test and proceeded to eliminate “peaks”, i.e., we Table 3. Results obtained with the use of MRFs and spatial information with the different groups of relations and smoothing types for the test images in the CorelA database. The best results were obtained when the three groups of spatial relations are used together (in bold). Algorithm Relation group Smoothing Accuracy Improvement Rel. improvement kNN None None 36.81% None 42.72% 5.91% 16.04% Topological Laplacian 43.51% 6.70% 18.19% Expert info. 43.25% 6.44% 17.49% None 41.72% 4.91% 13.34% MRFs Horizontal Laplacian 43.08% 6.27% 17.02% Expert info. 43.58% 6.76% 18.38% None 43.73% 6.92% 18.80% Vertical Laplacian 44.93% 8.12% 22.06% Expert info. 44.88% 8.07% 21.92% None 43.29% 6.47% 17.58% All Laplacian 45.41% 8.60% 23.37% Expert info. 45.64% 8.82% 23.97%
890
C. Hern´ andez-Gracidas and L.E. Sucar
Table 4. Comparison of our results with other methods. The last two rows show our base result and our best result using MRFs and all the spatial relations with expert knowledge (MRFs AREK). Algorithm Accuracy gML1[15] 35.69% gML1o[15] 36.21% gMAP1[15] 35.71% gMAP1MRF[15] 35.71% kNN 36.81% MRFs AREK 45.64%
discarded the highest result and the lowest result and obtained the average of the 8 remaining results. These experiments show no significant difference between an expert initialization and a Laplacian smoothing, probably because these values were not well estimated by the expert or because the training images contained already sufficient information. Individual experiments using only one group of relations at a time, show that the more significant results are obtained when vertical relations are used, which proves the usefulness of these undirected relations on this particular domain, with an improvement of 8.12% and a relative improvement of 22.06% Experiments show that the individual use of each group of spatial relation certainly improves results, but the highest accuracy rates are reached when they are used simultaneously, showing in the best case an improvement of 8.82% and a relative improvement of 23.97%.
Fig. 4. Some examples of the result of joining regions after their annotations are improved. Left: the originally segmented images. Right: the resulting segmentations after joining adjacent regions with the same labels.
Markov Random Fields and Spatial Information
891
As a way of comparing our results with other methods, we show in Table 4 a comparison with three state of the art automatic annotation methods. These results are obtained with the code developed by [15]. It can be observed that our base result is similar in performance to those we are comparing to, but at the same time, our proposed method is at least 9% more accurate than any of them. Although improving segmentation was not our main goal, we performed a simple experiment with interesting results. After improving the initial annotations with our method, adjacent regions with the same label were joined. A couple of examples of this are shown in Figure 4. We found that several other image segmentations were improved the same way.
6
Conclusions and Future Work
We proposed a novel methodology to improve AIA based on spatial relations and MRFs. It combines several types of spatial relations under an MRF framework, where the potentials are obtained using subjective estimates and real data. We concluded that it is feasible to apply spatial relations to improve automatic image annotation systems. Our experiments show that an important number of the labels are corrected by using MRFs and the spatial relations among regions. In experiments with the CorelA database, a significant improvement of almost 9% and a relative improvement of almost 24% were obtained with respect to the original annotations. Further experiments must be performed to clarify the relevance of each relation group and also to evaluate the advantage of using expert estimations. Also, a more sophisticated way of determining optimal values for parameters, like the use of evolutionary methods [17], is a possible future line of research. The iteration of steps 2 to 4 of the method as we suggested, should also provide better results. Using a different annotation algorithm is an alternative that might provide better results. An interestingly different approach to the use of MRFs for more than one relation would be the use of interacting MRFs [18], and finding the way this interaction should be performed represents a motivating challenge. The application in other image domains may show the generality of the method here proposed, and should also confirm our hypothesis that the importance of some relations varies depending on the domain. Medical imagery, GIS, automatic robot navigation, are some of the potential future applications of our method.
References 1. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 888–905 (2000) 2. Preston, C.: Gibbs States on Countable Sets. Cambridge University Press, Cambridge (1974) 3. Spitzer, F.: Random Fields and Interacting Particle Systems. Mathematical Association of America (1971)
892
C. Hern´ andez-Gracidas and L.E. Sucar
4. Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM 26(11), 832–843 (1983) 5. Egenhofer, M., Sharma, J., Mark, D.: A critical comparison of the 4-intersection and 9-intersection models for spatial relations: Formal analysis. In: R.B., M., M., A. (eds.) Eleventh International Symposium on Computer-Assisted Cartography, Auto-Carto 11, Minneapolis, Minnesota, USA, pp. 1–11 (1993) 6. Chen, J., Li, Z., Li, C., Gold, C.: Describing topological relations with voronoibased 9-intersection model. International Archives of Photogrammetry and Remote Sensing 32(4), 99–104 (1998) 7. Sistla, A., Yu, C., Haddack, R.: Reasoning about spatial relationships in picture retrieval systems. In: Bocca, J., Jarke, M., Zaniolo, C. (eds.) VLDB 1994. Twentieth International Conference on Very Large Data Bases, Santiago, Chile, pp. 570–581 (1994) 8. Zhang, Q., Yau, S.: On intractability of spatial relationships in content-based image database systems. Communications in Information and Systems 4(2), 181–190 (2004) 9. Zhang, Q., Chang, S., Yau, S.: On consistency checking of spatial relationships in content-based image database systems. Communications in Information and Systems 5(3), 341–366 (2005) 10. Hollink, L., Nguyen, G., Schreiber, G., Wielemaker, J., Wielinga, B., Worring, M.: Adding spatial semantics to image annotations. In: SemAnnot 2004. Fourth International Workshop on Knowledge Markup and Semantic Annotation, Hiroshima, Japan. LNCS, Springer, Heidelberg (2004) 11. Rathi, V., Majumdar, A.: Content based image search over the world wide web. In: Chaudhuri, S., Zisserman, A., Jain, A., Majumder, K. (eds.) ICVGIP 2002. Third Indian Conference on Computer Vision, Graphics and Image Processing, Ahmadabad, India (2002) 12. Ren, W., Singh, M., Singh, S.: Image retrieval using spatial context. In: Wellstead, P. (ed.) IWSSIP 2002. Ninth International Workshop on Systems, Signals and Image Processing, Manchester, UK, vol. 18, pp. 44–49 (2002) 13. Neapolitan, R.: Probabilistic Reasoning in Expert Systems. Wiley, New York (1990) 14. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 15. Carbonetto, P.: Unsupervised statistical models for general object recognition. Msc thesis, The Faculty of Graduate Studies, Department of Computer Science, The University of British Columbia, West Mall Vancouver, BC Canada (2003) 16. Escalante, H., Montes, M., Sucar, L.: Word co-occurrence and markov random fields for improving automatic image annotation. In: Rajpoot, N.M., B.A. (eds.) BMVC 2007. Eighteenth British Machine Vision Conference, Warwick, UK, vol. 2, pp. 600–609 (2007) 17. B¨ ack, T.: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford Univ. Press, Oxford (1996) 18. Wellington, C., Courville, A., Stentz, A.: Interacting markov random fields for simultaneous terrain modeling and obstacle detection. In: Thrun, S., Sukhatme, G., Schaal, S. (eds.) RSS 2005. Robotics: Science and Systems, Boston, USA, pp. 1–8 (2005)
Shape-Based Image Retrieval Using k-Means Clustering and Neural Networks Xiaoliu Chen and Imran Shafiq Ahmad School of Computer Science University of Windsor Windsor, ON N9B 3P4 - Canada
[email protected],
[email protected]
Abstract. Shape is a fundamental image feature and belongs to one of the most important image features used in Content-Based Image Retrieval. This feature alone provides capability to recognize objects and retrieve similar images on the basis of their contents. In this paper, we propose a neural network-based shape retrieval system in which moment invariants and Zernike moments are used to form a feature vector. kmeans clustering is used to group correlated and similar images in an image collection into k disjoint clusters whereas neural network is used as a retrieval engine to measure the overall similarity between the query and the candidate images. The neural network in our scheme serves as a classifier such that the moments are input to it and its output is one of the k clusters that has the largest similarity to the query image. Keywords: image retrieval, shape-based image retrieval, k-means clustering, moment-invariants, Zernike moments.
1
Introduction
Recent advances in image acquisition, storage, processing, and display capabilities have resulted in more affordable and widespread use of digital images. As a result, there is an increased demand for effective management of image data. Given the huge amount of image data that exist now and will be collected in near future, the traditional approach of manual annotations is not only inadequate but also fails to serve the purpose. To utilize the image information efficiently, there is a constant demand for effective techniques to store, search, index and retrieve images based on their contents [1]. This has led to a great emphasis and demand on the use of automatically extractable and mathematically quantifiable visual image features such as color, texture, shape, and spatial relationships. Such retrievals are generally termed as Content-based Image Retrieval (CBIR). Shape is a very powerful and one of the fundamental image features that facilitates representation of image contents. Retrieval of images based on the shapes of objects, generally termed as Shape-based Retrieval, is an important
Corresponding author. Authors would like to acknowledge partial support provided by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 893–904, 2007. c Springer-Verlag Berlin Heidelberg 2007
894
X. Chen and I.S. Ahmad
CBIR technique and has applications in many different fields. Examples of such type of retrievals can be found but are not limited to recognition and retrieval of trademarks, logos, medical structures, fingerprints, face profiles, hand written signatures, etc. In many situations, people can recognize an object only with its shape. Most often, shape of an object can be obtained by traversing its boundary. Primary issues associated with shape-based image retrieval are: shape representation, similarity measure and retrieval strategy. Shape representations are formalistic identification models of original shapes so that the important characteristics of shapes are preserved [2]. The goal of shape representation is to derive a numeric shape descriptor or a feature vector that can uniquely characterize the given shape. Two-dimensional shapes can be represented in a number of different ways. Generally, such methods can be divided into two main categories: (i) contour-based methods and (ii) region-based methods. In contour-based methods, emphasis is on the outer closed curve that surrounds the shape whereas in region-based methods, entire shape region occupied by the shape within a closed boundary on the image plane is considered. moment invariants [3] and Zernike moments [4] are examples of such methods. This paper provides an efficient mechanism for indexing and shape-based retrieval of images from an image database. We analyze and compare performance of two region-based methods, viz., moment invariants and Zernike moments to derive feature descriptors for shape representation and image retrieval while using artificial neural network as an intelligent search engine instead of traditional techniques of multidimensional indexing trees. k-means clustering is employed to provide learning samples to the neural network to facilitate back-propagation training. A user can query the system using query-by-example approach. The moments of the objects in a query are computed on-the-fly and compared against those of the already indexed database images to retrieve similar images. The important contributions of our shape matching and retrieval approach can be summarized as: – the proposed method is independent of geometric transformations, i.e., scale, translation and rotation of the shape of the query or the database images. – images are classified using their moments and placed in k number of distinct clusters such that images in the same cluster exhibit higher levels of correlation whereas it is low between images in different clusters. Therefore, by controlling k, to a great extent, one could possibly control the degree of similarity among various images in a cluster. – although use of moment invariants is not new [3,5], our scheme provides a mechanism for efficient retrieval of stored images through an artificial neural network which in our scheme serves as an intelligent search engine. Remainder of this paper is organized as follows: in Section 2, we provide a summary of shape representation techniques. Section 3 describes the proposed methodology to represent, classify and retrieve shape images. Results of our experiments are presented in Section 4 whereas Section 5 provides our conclusions.
Shape-Based Image Retrieval
2
895
Shape Representation
Shape representations are formalistic identification models of original shapes so that their important characteristics are preserved [2]. The goal of shape representation is to derive a set of numeric shape descriptors or a feature vector that can uniquely characterize a given shape. 2D shapes can be described in a number of different ways with several different classifications [2]. The most popular and widely used among them is the one proposed by Pavlidis [6]. It is based on the use of shape boundary points as opposed to the interior features of a shape. 2.1
Moment Invariants
Moment invariants are extensively used in the area of pattern recognition, shape representation and for shape similarity measures. Moment invariants are derived from moments of shapes and are unchanged under translation, scale and rotation [3]. xp y q f (x, y)dydx
mpq = x
(1)
y
The theory of moments provides an interesting and useful alternative to a series of expansions for representing a real bounded function [9]. Suppose f (x, y) = 0 is such a 2D function on a region R, then the geometric moments of order p + q, for p, q = 0, 1, 2, . . . are given by Equation (1). The uniqueness theorem [5] states that if f (x, y) is piecewise continuous and has nonzero values only in a finite part of the xy-plane, moments of all orders exist and the moment sequence mpq is uniquely determined by f (x, y). Many shape features such as total mass (area), centroid, angle of the principal axis, bounding box, best-fit ellipse and eccentricity can be conveniently represented in terms of geometric moments [7]. Moment invariants are usually specified in terms of centralized moments μpq , i.e., the moments measured with respect to the center of mass (¯ x, y¯) and are defined as: p q μpq = (x − x ¯) (y − y¯) f (x, y)dydx (2) x m10 m00 ,
where x¯ = defined as [8]:
y¯ =
m01 m00
y
is the center of mass and the normalized moments are ηpq =
μpq μγ00
and p+q +1 2 Once computed, the feature vector consists of seven components and can be used to index shapes in the image database. The values of the computed geometric moments are usually small but the values of higher order moment invariants in some cases are close to zero. Therefore, all of the moment invariants need to be further normalized into [0,1] by the limit values of each dimension [9]. γ=
896
2.2
X. Chen and I.S. Ahmad
Zernike Moments
Based on the idea of replacing the conventional kernel of moments with a general transform, orthogonal moments have been proposed to recover the image from moments [10]. Zernike moments are orthogonal moments and allow independent moments to be constructed to an arbitrarily high order. The complex Zernike moments are derived from the Zernike polynomials: Vnm (x, y) = Vnm (ρ cos θ, ρ sin θ) = Rnm (ρ) exp(jmθ) and
(n−|m|)/2
Rnm (ρ) =
s=0
(−1)s
(n − s)! s!( n+|m| 2
− s)!( n−|m| − s)! 2
ρn−2s .
(3)
(4)
where n is a non-negative integer. m is an integer, subject to the conditions n − |m| = even, and |m| ≤ n. ρ = x2 + y 2 is the radius from (x, y) to the image centroid. θ = tan−1 ( xy ) is the angle between ρ and the x-axis. Zernike polynomials are a complete set of complex-valued functions orthogonal over the unit circle, i.e., x2 + y 2 = 1 and the complex Zernike moments of order n with repetition m are defined as: n+1 ∗ Anm = f (x, y)Vnm (ρ, θ)dxdy. (5) π x2 +y 2 ≤1 where ∗ denotes the complex conjugate. The precision of shape representation depends on the number of moments truncated from the expansion. Since Zernike basis functions take the unit disk as their domain, this disk must be specified before moments can be calculated. The unit disk is then centered on the centroid of the shape, thus, making the obtained moments scale and translation invariant [4]. Rotation invariance is achieved by using only the magnitudes of the moments. The magnitudes are then normalized into [0,1] by dividing them by the mass of the shape. Zernike moments do not need to know the boundary information, thus, making them suitable for more complex shape representation. Zernike moments can be obtained for some arbitrary order, thus, eliminating the drawback of moment invariants in which higher order moments are difficult to construct [9]. However, Zernike moments lose the important perceptual meaning as reflected in the moment invariants [9]. In our implementation, all the shapes are normalized into a unit disk of fixed radius of 32 (25 ) pixels, i.e., R = 32. The disk radius can also be 8, 16, or 64 but the larger the disk, more are the computations required but more details the feature will reflect. The unit disk is then centered on the shape centroid by the transformations we have discussed earlier. For digital images, we substitute the
Shape-Based Image Retrieval
897
integrals with summations. As a result, the complex Zernike moments of order n with repetition m are defined as: Anm =
n+1 ∗ f (x, y)Vnm (ρ, θ), x2i + yi2 ≤ R2 . π i
(6)
For binary images, f (x, y) can be replaced by u(x, y). As mentioned earlier, Zernike transformation makes the obtained moments scale and translation invariant whereas rotation invariance is achieved by only using the magnitudes (absolute values) of the moments.
3
Proposed Approach
In the proposed scheme, we consider both the moment invariants and the Zernike moments to represent shape in an image. Even though both of these representations are region-based shape representations, we still need to determine boundary sequence of the shape object. In many computer vision applications, for simplicity and speed, use of binary shape representations is a common practice. Therefore, we assume that all the images are binary images. This assumption will not effect the system’s performance since we are concerned only with the shape features of the images. We also assume that all the pixels in the object have a value of ‘1’ and all the pixels in the background have a value of ‘0’. Therefore, a boundary sequence [11] is a list of connected pixels on the edge of the object, separating the shape region and the background. 1, if (x, y) ∈ R u(x, y) = (7) 0, otherwise From the definition of geometric moments in Equation (1), if f (x, y) ≥ 0 is a 2D real bounded function, the (p + q)th order moment of the shape enclosed by f (x, y) is given in Equation (7). The binary function given in Equation (7) is a simple representation of a binary region R. By replacing f (x, y) by u(x, y) in Equation (1), we get the moments of the region R and the Equation (1) becomes: mpq = xp y q dydx (8) R
For geometric moments of a digital binary image, the double integrals can be approximated by double summations and Equation (8) can be written as: xp y q (9) mpq = x
y
A similar procedure can be followed for Zernike moments. The above equations involve a large number of multiplications and additions. In real-time applications,
898
X. Chen and I.S. Ahmad
the speed of computation is crucial to the system performance. For computational speedup in binary shapes, the Delta method [12] uses contributions from each row rather than the individual pixels, requiring only the coordinates of the first pixel and the length of the chained pixels of each row for the shape R. The Integral method [7], on the other hand, uses contribution of each extreme point of the shape in each row instead of the contribution of the total row. Computations in this case are directly based on the integral operation. The geometric moments and both moment invariants and Zernike moments in this method are derived from the chain codes. Based on the idea of computing from the integral and extreme pixels, moments are calculated by the following method. Let a given shape is considered a matrix of pixels then if xL,i , xR,i are the abscissas of the first pixel (extreme left) and the last pixel (extreme right) of the shape in row i, δi is the number of connected pixels in row i, i.e., δi = xL,i − xR,i + 1, yi is the ordinate of row i, the geometric moments mpq can be written as mpq = i mpq,i . The contribution of row i in terms of xL,i , xR,i and yi for a horizontal convex shape is considered as a region consisting of small uniform squares of size 1x1. Therefore, we adjust the coordinates by ±1/2 and the contribution of row i is derived using the Newton-Leibniz formula such that the (p + q)th moment of the whole binary shape is the sum of the contributions of every row mpq = i mpq,i . For a digital binary shape, the central moments are given as: μpq = (x − x ¯)p (y − y¯)q (10) x
y
After mpq have been calculated by the extreme pixels, central moments μpq can be obtained [8]. 3.1
Indexing and Retrieval Approach
Traditional indexing in CBIR involves use of multidimensional indexing trees to manage extracted visual features. In our approach, clustering and neural network is used to organize images and to build an intelligent retrieval engine. Our strategy consists of two stages: training and testing, i.e., retrieval. The overall system architecture is presented in Fig.1. In the training stage, we use all of the images in our image database as training samples. We first group the training images into clusters using their feature vectors and then train the neural network with the results of clustering. In the testing stage, for a query image q, we use same technique to extract its features to build a feature vector which then becomes an input to the trained neural network for retrieval. The network assigns it to one or more similar clusters. We compare all of the images in the selected cluster(s) against the query image q. The distance between the query image and the database images is determined using a distance function, to be explained in Section 3.3. Finally, similar images are ranked by their similarities and returned as the retrieval results.
Shape-Based Image Retrieval
899
Fig. 1. Architecture of the proposed system
3.2
Clustering
k-means clustering [26] is one of the best known non-hierarchical clustering algorithm in pattern recognition. For a specified number of clusters k, the scheme assigns each input sample to one of the k clusters so as to minimize the dispersion within the clusters. The membership for every sample is determined through an unsupervised learning procedure. The algorithm of k-means clustering first randomly initializes k cluster means and then assigns each pattern vector to the cluster with nearest mean. It re-computes the cluster means and then re-assigns pattern vectors until no further changes occur. The nearest mean μi is found by using an arbitrary distance function. Initial values of μi affect the convergence since different initialization values may lead to different membership results. Therefore, we either guess initial value based on the knowledge of the problem or choose k random samples from the data set {x1 , x2 , . . . , xn }. Clustering techniques generally aim to partition the data set into a number of clusters. In our approach, our aim is to use a clustering algorithm that could provide us a minimum within-cluster distance and maximum inter-cluster distance. Based on this idea, there are several cluster validity indices to evaluate partitions obtained by the clustering algorithm, including Davies-Bouldin (DB) Index. It is a function of the ratio of the sum of within-cluster scatter to the inter-cluster separation. The scatter within the ith cluster Si is computed as: 1 Si = |x − μi | x∈Ci Ni and the distance between two clusters Ci and Cj , denoted by Dij , is defined as Dij = |μi − μj | such that μi represents the ith cluster center. The DB index is then defined as: k 1 DB = Ri,ql k i=1
900
X. Chen and I.S. Ahmad
where Ri,ql = max j,j=i
Si,q + Sj,q Dij,l
and the objective to minimize the DB index for achieving proper clustering. 3.3
Similarity Measurement
Distance functions are used to measure similarity or dissimilarity of two feature vectors. In a d-dimensional space, for any two elements x and y, D(x, y) is a real number and represents the distance between them and is called a distance function. There are many distance functions such as Manhattan, Euclidean, Minkowski and the Mahalanobis. In our scheme we have compared performance of the Euclidean distance and the Mahalanobis distance functions. If x and y are the two feature vectors of same dimension d, then these two distance functions are represented by Equation (11) and Equation (12), respectively as:
D(x, y) =
d−1
12 (xi − yi )
2
=
(x − y)t (x − y),
(11)
i=0
D(x, y) = [(x − y)t Σ −1 (x − y)] 2 1
(12)
where Σ is the covariance matrix of x and is given as: N −1 N −1 1 t (xi − μ)(xi − μ) and μ = xi Σ= N i=0 i=0
Σ is a positive, semi-definite and symmetric matrix. When Σ = I, the distance is reduced to the Euclidean distance and when Σ = I, it is called the Mahalanobis distance. Both the Euclidean and the Mahalanobis distance functions are commonly used in clustering algorithms. Generally, center of a cluster is determined by its mean vector μ and its shape is determined by its covariance matrix Σ. 3.4
Neural Network
After clustering, we use neural network as part of the retrieval engine. The neural network in our scheme consists of 3 layers such that there are 7 nodes in the input layer and the number of nodes in the output layer are same as the number of clusters k. Our choice of a neural network design is based on a study of decision regions provided in [13]. This study demonstrates that a 3 layer network can form arbitrarily complex decision regions and can also provide some insight into the problem of selecting the number of nodes in the 3-layer networks. The number of nodes in the second layer must be greater than 1 when decision regions are disconnected or meshed and cannot be formed from one convex area. In worst case, the number of nodes required in second layer are equal to the number of
Shape-Based Image Retrieval
901
disconnected regions in the input distributions. The number of nodes in the first layer must typically be sufficient to provide three or more edges for each convex area generated by every second-layer node. There should typically be more than three times as many nodes in the second layer as in the first layer [13]. In our approach, moment invariants or Zernike moments of a query image form an input feature vector to the neural network whereas its output determines the best representation of the query image features among the k clusters.
4
Experimental Results
We have performed experiments with more than 10,000 binary images, obtained from “A Large Binary Image Database” [14] and “Amsterdam Library of Object Images (ALOI)” [15]. The images in our data set have many variations of same shape and object. The data set consists of some groups of similar shapes that are scaled, rotated and slightly distorted as well as some images that are unique. We use Davies-Bouldin (DB) Index to evaluate the results of clustering using different values of k as shown in Fig. 2 for the two distance functions. From this figure, one can observe that for smaller values of k, e.g., 3 or 4, clustering results with both of these functions are similar. However, when k ≥ 5, Mahalanobis distance performs better than the Euclidean distance. It is primarily due to the fact that the Mahalanobis distance presents a precise shape of the cluster whereas the Euclidean distance forms a circle irrespective of the shape of the data set. Euclidean Mahalanobis 0.19 0.17
DB
0.15 0.13 0.11 0.09 0.07 0.05 3
4
5
6
k
7
8
9
10
Fig. 2. DB index Vs k-means clustering for the two distance functions
For Mahalanobis distance in k-means clustering, after assigning samples to the cluster of the nearest mean, we have to re-compute not only the new cluster mean but also the new cluster covariance in each loop of our computation. For Euclidean distance, even after few iterations, it is easy to get a stable cluster means but for Mahalanobis distance, it always takes much longer despite the fact that we use a threshold as a termination condition (new cluster means doesn’t move very much, say less than 0.1%). Fig. 3 and Fig. 4 shows the precision-recall graph for our retrieval results. In one set of experiments, the query image chosen was part of the database
902
X. Chen and I.S. Ahmad
1.00 0.90 0.80 Precision
0.70 0.60 MI ZM
0.50 0.40 0.30 0.20 0.10 0.00 0.00
0.20
0.40
0.60
0.80
1.00
Recall
Fig. 3. Precision-recall results when the query image is part of the image database
whereas in some other cases, it was not part of the training sample. The reason for such experiments is to evaluate the system performance in instances when the system has already encountered a similar shape during its training and those when the system has not encountered a similar shape before. As one can observe from these figures, when the query image is not part of the training set, the shape retrieval precision using moment invariants is only 82% whereas in cases when the query image is part of the training sample, it is about 100%. The precision using Zernike moments in the two cases is only about 72% and a little more than 90% respectively. However, in both cases, system was able to retrieve more than 90% of the relevant images. The reduced recall in both cases can be attributed to the possible clustering problem when the system places similar images in different clusters. Further, it can also be observed that the moment invariants perform better than the Zernike moments for both the precision and the recall. This is partly due to the fact that the Zernike moments have very large coefficients that need to be normalized at the expense of reduced precision. Images in all of the above sets of experiments reported here contain only a single object. However, we have performed similar experiments on a small data set containing images with multiple objects and obtained comparable results. We have also performed experiments on same images with different sizes and observed that the size of an image doesn’t effect either the precision or the recall. All of the images in the experiments reported here are black-and-white, thus, making it an artificial requirement. However, it is important to note that 0.90 0.80
Precision
0.70 0.60 0.50
MI ZM
0.40 0.30 0.20 0.10 0.00 0.00
0.20
0.40
0.60
0.80
1.00
Recall
Fig. 4. Precision-recall results when the query image is not part of the image database
Shape-Based Image Retrieval
903
MI - proposed approach MI - Existing approach
0.90 0.80
Precision
0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00
0.20
0.40 Recall 0.60
0.80
1.00
Fig. 5. Comparison of the precision-recall results with the technique proposed in [16]
we are not trying to recognize shapes but only matching on the basis of shape. In practice, many of the image processing and computer vision techniques simply do not take into account the color information. We have compared the precision-recall results of the proposed technique with the one proposed in [16] and the results are shown in Fig. 5. Although this technique uses moment invariants to form a feature vector, but feature vectors are managed with the help of k-means clustering tree (KCT) in which the value of k determines the number of branches in each node of the tree whereas pointers to the actual images are stored in the leaf nodes. We have chosen this scheme for comparison due to its similarity with our approach as both involve k-means clustering to manage similar images and also use neural network as part of the search engine. We realize that the cost of clustering images and training of neural network can be very high since it involves extensive training and mathematical computations but it is important to realize that images are stored in the database only once and this can be an off-line process to provide retrieval efficiency.
5
Conclusions and Future Directions
In this paper, we have used moment invariants and Zernike moments as shape descriptors which can uniquely represent a shape. We have proposed use of kmeans clustering to organize images and that of neural network as a retrieval engine. Although the training of a neural network is a time consuming process, the training and retrieval are not symmetric and once training is done, it can achieve higher retrieval efficiency and lower computational cost. To refine retrieval results further, there is a need to incorporate relevance feedback and is an important aspect of our future work.
References 1. Jain, R.: Sf workshop on visual information management system: Workshop report. In: Storage and Retrieval for Image and Video Databases SPIE , vol. 1908, pp. 198–218 (1993) 2. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983–1001 (1998)
904
X. Chen and I.S. Ahmad
3. Hu, M.K.: Visual pattern recognition by moment invariants. IEEE Transaction on Information Theory 8(2), 179–187 (1962) 4. Teh, C.-H., Chin, R.: Image analysis by the methods of moments. IEEE Transacfion on Pattern Analysis and Machine Intelligence 10(4), 496–513 (1988) 5. Maitra, S.: Moment invariants. Proceedings of the IEEE 67, 697–699 (1979) 6. Pavlidis, T.: Survey: A review of algorithms for shape analysis. Computer Graphics Image Processing 7(2), 243–258 (1978) 7. Dai, M., Baylou, P., Najim, M.: An efficient algorithm for computation of shape moments from run-length codes or chain codes. Pattern Recognition 25(10), 1112– 1128 (1992) 8. Chen, X., Ahmad, I.: Neural network-based shape retrieval using moment invariants and zernike moments. Technical report 06-002, School of Computer Science, University of Windsor (January 2006) 9. Zhang, D.S., Lu, G.: A comparative study of three region shape descriptors. In: Proceedings of the Sixth Digital Image Computing and Applications (DICTA 2002), pp. 86–91 (January 2002) 10. Teague, M.: Image analysis via the general theory of moments. Journal of Optical Society of America 70(8), 920–930 (1980) 11. Pitas, I.: Digital Image Processing Algorithms. Prentice-Hall, Englewood Cliffs (1993) 12. Zakaria, M., Vroomen, L., Zsombor-Murray, J., van Kessel, H.: Fast algorithm for the computation of moment invariants. Pattern Recognition 20(6), 639–643 (1987) 13. Lippmann, R.: An introduction to computing with neural nets. IEEE Acoustics, Speech and Signal Processing Magazine 4(2), 4–22 (1987) 14. URL: Laboratory for engineering man/machine systems LEMS - a large binary image database (2006), http://www.lems.brown.edu/∼ dmc 15. Guesebroek, J.M., Burghouts, G., Smeulders, A.: The amsterdam library of object images. International Journal of Computer Vision 61(1), 103–112 (2005) 16. Ahmad, I.: Image indexing and retrieval using moment invariants. In: Proceedings of the 4th iiWAS), Indonesia, pp. 93–104 (September 2002)
Very Fast Concentric Circle Partition-Based Replica Detection Method Ik-Hwan Cho1, A-Young Cho1, Jun-Woo Lee1, Ju-Kyung Jin1, Won-Keun Yang1, Weon-Geun Oh2, and Dong-Seok Jeong1 1
Dept. of Electronic Engineering, Inha University, 253 Yonghyun-Dong, Nam-Gu, Incheon, Republic of Korea 2 Electronics and Telecommunication Research Institute, 138 Gajeongno, Yuseong-Gu, Daejeon, Republic of Korea {teddydino,ayoung,jjunw6487,jukyung77,aida}@inhaian.net,
[email protected],
[email protected]
Abstract. Image replica detection becomes very active research field recently as the electronic device such as the digital camera which generates digital images spreads out rapidly. As huge amount of digital images leads to severe problems like copyright protection, the necessity of replica detection system gets more and more attention. In this paper, we propose a new fast image replica detector based on concentric circle partition method. The proposed algorithm partitions image into concentric circle with fixed angle from image center position outwards. From these partitioned regions, total of four features are extracted. They are average intensity distribution and its difference, symmetrical difference distribution and circular difference distribution in bitstring type. To evaluate the performance of the proposed method, pair-wise independence test and accuracy test are applied. We compare the duplicate detection performance of the proposed algorithm with that of the MPEG-7 visual descriptors. From experimental results, we can tell that the proposed method shows very high matching speed and high accuracy on the detection of replicas which go through many modification from the original. Because we use the hash code as the image signature, the matching process needs very short computation time. And the proposed method shows 97.6% accuracy on average under 1 part per million false positive rate. Keywords: Image Replica detection, Image Retrieval, Concentric Circle Partition.
1 Introduction Image retrieval (IR) system has been studied in various research fields and utilized in various real applications as digital images can be distributed easily through internet. Image retrieval system can be used in image indexing, searching and archiving. These enormous scientific interest leads to several international standards activity and the growth of related research areas. ISO/IEC MPEG standard group published MPEG-7 [1] standard and JPEG did JPSearch [2]. As a subset of image retrieval applications, D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 905–918, 2007. © Springer-Verlag Berlin Heidelberg 2007
906
I.-H. Cho et al.
the interest for replica detection is gradually increasing. In general, all instances of reference image are called ‘replicas’. And ‘non-replicas’ mean images which are not related to the modified versions of the reference image [3]. The reason why the replica detection is important and getting much attention in real application is that it can be the better alternative over the conventional contents protection algorithm such as encryption and watermarking. Cryptograph can encrypt digital contents with invisible user key and therefore only legal users who have key can decrypt contents appropriately. However it is very dependent on user key and encryption algorithm and both distributor and user need to equip particular tools for encryption and decryption. It could limit the convenient contents usage of user. On the other hand the watermarking method gives no limitation for the usage of contents, the watermark to be embedded in contents decreases the quality of original contents. Above all, digital watermarking technology is very weak when the watermarked image is modified or attacked by some distortions. On the contrary, image replica detection method has no limitation of user’s contents usage as well as no distortion of original contents. Therefore it could resolve disadvantages of cryptograph and watermarking technologies. There are some difference between image retrieval and replica detection. The objective of image retrieval system is to find out similar images according to particular features. For example, if we use the photo image of Miami Beach with blue ocean and white beach as a query image, the image retrieval system shows some beach images with blue colored ocean and white colored sand. The results may include images of pacific coast or Caribbean Sea. In contrast to the image retrieval, the image replica detection system should find out only the modified versions of input image exactly. While image retrieval system shows several results if some modified images are used as the query, replica detection system presents only their original images. 1.1 The Related Works In order to detect original image for various replicas to be applied, key point-based method can be used [4]. Key point-based method detects feature points in image and describes local feature for each point. And it measures similarity by matching local descriptors of all feature point blocks. These methods show very good performance for replica detection and precise feature point detection methods such as SIFT [5] and Harris-Laplacian [6] can be proper solutions for these purpose. But they need high computational complexity for extracting and matching signatures. In real application, we need to process a large number of images so that extraction and matching speed are very important. In extraction and matching speed, matching speed is more important since signature extraction is generally carried out off-line. However matching process between query and reference images should be processed in realtime. Therefore we should consider speed problem in design of replica detection system. In addition, there have been many methods especially for replica detection system [7, 8, 9]. Alejandro et al. uses interesting point detection method to align query and reference images [7]. After global alignment between two images, blockbased intensity difference is measured to obtain similarity score. Although it shows good performance under its own experiment condition, relative small size of database
Very Fast Concentric Circle Partition-Based Replica Detection Method
907
can be the limitation in recent environment. Yan et al. proposed near-duplicate detection algorithm [8] and they also used local interesting point detection algorithm of Difference of Gaussian (DoG) detector [5]. And for interest point representation, PCA-SIFT is used as compact form relative to original one [10] and they use Locality Sensitive Hashing (LSH) to index local descriptor [11]. It shows excellent result for near duplicate detection, but it still has limitations of relative low matching speed. They improve matching speed dramatically rather than conventional method, it couldn’t outperform simple Hamming distance measure. Recently Wang et al. proposed duplicate detection method using hash code [9]. Because of using very short binary hash code they can show very fast matching speed without even additional optimization method. However the proposed method in [9] uses too short hash code so that it can identify only small set of image pairs. As the size of database used in experiment is growing more and more, its performance decreases gradually. 1.2 The Proposed Replica Detection Model As we commented in above section, replica detection system is a little different from image retrieval process. And there is also difference between conventional replica or duplicate detection method. In general replica detection algorithm has its own simulation model and they are usually separated into two groups (Fig. 1). Let’s assume there are original images and their copied and transformed images. In most of conventional models, original images were used as query ones and its performance was measured by precision and recall. But in real application this model can be very exhaustive process because reference (transformed) image pool is usually extraordinary bigger than query (original) image pool so that matching process needs longer time for full searching. If we want to know how many replicas for specific original image is distributed, this model is reasonable. Therefore we can assume another replica detection model and the different replica detection model are proposed in this paper. the new replica detection model is to use transformed images as queries and original images as references. By using the proposed model we have fixed and relative small reference pool so that it takes short time to match query and reference. Even if there are so many images to be used as query, we can choose only image which we need to use as query. As a point of view of copyright protection, which is one of major application, we can collect reference images which want to be protected from copyright infringement and we choose specific suspicious image as query one. Then replica detection system searches its original from reference pool by matching query and all references. In this paper we employ the proposed replica detection model so that it is not proper to use conventional performance metrics such as precision and recall. Then we can define the modified requirements of replica detection system; robustness, independence and fast matching. For robustness, image signature must not be modified even if image is transformed or degraded. And for independence, if two images are perceptually different, the signatures of two images should be quite different. Finally for fast matching, matching process between two images should be carried out based on only signatures without image itself and matching algorithm must be considerably simple. In this paper, we propose new fast replica detection algorithm which is based on concentric circle partition. We develop concentric circle partition method to make
908
I.-H. Cho et al.
rotation invariant signature. Based on concentric circle partition, we extract 4 kinds of features and they are all rotation and scaling invariant. In addition, the proposed method uses simple mathematical function to map scalar distribution into binary hash code. Binary hash code is very useful in matching process since it requires only XOR operation which needs smallest computation power. We define new performance measure to evaluate independence and robustness of the proposed method in experiment step. From these new performance metrics, we can identify the proposed method is very useful in practical application. This paper is organized as follows. In section 2, we describe extraction process of the proposed signature in detail. And matching process is depicted shortly in section 3. Section 4 shows experiment results of the proposed method and we conclude it in section 5 and section 6. Transformed Image Pool Original Image Pool G B
A
A’
D
B’
I E
C Query
F H
C’
(a) Conventional replica detection model Transformed Image Pool Original Image Pool G B
A
Query A’
D
B’
I E
C F C’
H
(b) The proposed replica detection model
Fig. 1. Image replica detection model. (a) is conventional replica detection model which use original image as query and (b) is the proposed replica detection model which use original images as references. The proposed model requires short matching time relative to conventional model since it uses relative small and fixed reference pool.
2 Extraction of the Proposed Signature for Replica Detection The proposed replica detector is based on concentric circle partition of input image. Fig. 2 shows example of concentric circle. Concentric circles are a set of circles which has common center and different radius. Prior to extracting feature from input image, concentric circle partition is applied to image and features are extracted from these segmented regions. Concentric circle is quantized into several sub-regions by
Very Fast Concentric Circle Partition-Based Replica Detection Method
909
radius and angle level. In Fig. 2, left images shows basic concentric circle quantized by radius and right one is sub-regions quantized by different angles. The proposed method utilizes these sub-regions to extract features.
Fig. 2. Concentric circle and its partition in according to radius and angle
Basically, the proposed replica detector needs fixed number of bits and these hash type signature has advantages for signature size and matching speed. In addition, just one bit is allocated into one circle for signature. Actually overall replica detector has four kinds of signatures and they all are based on same concentric circle region. They are signatures of bit-string type and each signature has same simple hash generation method. Fig. 3 represents the overall block diagram for extracting features.
Input Image Size Normalization
C alc ulation of Differenc e of Average Intensity Level between Neighborhood C irc les
Polar C oordinate C onversion
C alc ulation of Symmetric al Differenc e Distribution
C onc entric C irc le Partition
C alc ulation of C irc ular Differenc e Distribution
C alc ulation of Average Intensity Level Distribution on Eac h C irc le
C onc atenation of c alc ulated hash features
Fig. 3. The overall block diagram for signature extraction process
To make robust replica detector, we use four kinds of features based on concentric circle regions. The four features are average intensity level, difference of average intensity level, symmetrical difference and circular difference distribution. And they are represented by Hash type bit-string with same length. In following sections, the detailed extraction method of four features is explained. 2.1 Concentric Circle Partition Input images are resized into fixed size while width-height ratio is remained. The fixed size is determined in according to diameter of largest circle and minimum value between resized width and height is equal to diameter. Resize operation is carried out by using bi-linear interpolation method. We used 256 as diameter in experiment.
910
I.-H. Cho et al.
To extract feature, concentric circle region concept should be applied into real image. Concentric circle is implemented by coordinate conversion from Cartesian coordinate to polar coordinate. Cartesian coordinate of (x, y) is converted into polar coordinate by using Eq. (1). x = r cos θ y = r sin θ r = x2 + y2 ⎧ ⎛ y⎞ ⎪ arctan ⎜ x ⎟ if x>0 and y ≥ 0 ⎝ ⎠ ⎪ ⎪ ⎛ y⎞ ⎪arctan ⎜ ⎟ + 2π if x>0 and y < 0 ⎝x⎠ ⎪ ⎪ ⎛ y⎞ θ =⎨ arctan ⎜ ⎟ + π if x<0 ⎪ ⎝x⎠ ⎪ π ⎪ if x=0 and y > 0 2 ⎪ ⎪ 3π if x=0 and y < 0 ⎪ 2 ⎩
(1)
After conversion into the polar coordinate, each pixel in (x, y) position has angle and distance from center and then polar coordinate map is obtained. The calculated polar coordinate map is quantized by angle and radius. In this experiment, we use 32 radius level and 36 angle level. 2.2 Average Intensity Level Distribution First feature of the proposed signature is average intensity level in each circle. For all circle regions, average intensity level for each circle is calculated and the distribution from inner circle to outer circle is obtained (Eq. (2)). Ci =
angleLevel 1 ∑ Pi, j angleLevel j = 0
(2)
Pi , j is all pixel value in ith radius level and jth angle level.
2.3 Difference Distribution of Average Intensity Level Second feature is difference distribution of average intensity level distribution which is calculated in previous step. Its mathematical representation is depicted in Eq. (3). Vi = abs ( Ci +1 − Ci ) Ci is all average intensity level in ith radius level.
(3)
2.4 Symmetrical Difference Distribution Third feature is symmetrical difference distribution. Symmetrical difference is calculated by summing absolute differences between average level for some angle
Very Fast Concentric Circle Partition-Based Replica Detection Method
911
region and average value of other side region in each circle. In Fig. 4(a), two gray regions are symmetrical regions each other in one circle and so symmetrical difference for each regions is obtained by calculating absolute difference between average intensity value of two same symmetrical regions. Its mathematical representation is shown in Eq. (4).
Si =
angleLevel / 2 −1 1 abs ( Ci , j − Ci , j + angleLevel / 2 ) ∑ angleLevel / 2 j =0
(4)
Ci , j is all average intensity level in ith radius level and jth angle level. 2.5 Circular Difference Distribution
Final fourth feature is circular difference distribution. Circular difference in one circle is calculated by summing absolute difference between average intensity level in some angle region and average intensity value in counter-clockwise directional neighborhood angle region. In Fig. 4(b), gray regions are neighborhood regions in one circle and their absolute difference is circular difference for two regions. Mathematical representation of circular difference is shown in Eq. (5). Ri =
1 a n g le L e v e l
a n g le L e v e l − 1
∑
j=0
a b s (C i , j − C i ,( j +1)
m o d a n g le L e v e l
C i , j is a ll a v e ra g e in te n s ity le v e l in i th ra d iu s le v e l
) (5)
a n d j th a n g le le v e l.
(a)
(b)
Fig. 4. (a) Symmetrical regions in one circle, circular difference calculation for counterclockwise
2.6 Merging of Hash Type Features In previous signature extraction process, we obtained four kinds of distribution and its distribution is spread from inside to outside of concentric circles. In this step, simple
912
I.-H. Cho et al.
hash table is applied to this distribution to change scalar distribution into binary bitstring. Eq. (6) is simple Hash mapping function used to represent graph pattern with binary string. ⎧1, M i+1 > M i Bi = ⎨ ⎩0, Mi+1 ≤ M i M i is distribution value in ith index.
(6)
This mapping function is applied into pre-calculated distribution values in each signature independently. And extracted bit-string is final proposed replica detector. In addition, above all process can be repeated with different parameter and they are mixed as final replica detector.
3 Matching Process Matching of the proposed signature is carried out by measuring general normalized Hamming distance simply. Matching process of the proposed signature is very simple since its signature has just bit-string, not real value to be required of Euclidean distance calculation or additional processing. Eq. (7) shows distance measure used in the proposed method. D=
1 N
N −1
∑R j =0
j
⊗ Qj
(7)
N is the number of bits of descriptor and R j and Q j is bit of reference and query in jth index.
4 Experimental Results For performance evaluation of the proposed replica detector, two kinds of experiment are carried out. First one is simulation to get distance threshold value under 1 ppm (parts per million) false positive error constriction. And second one is experiment to measure accuracy of the proposed replica detector. Accuracy is evaluated by using only Correct Retrieval (CR) Ratio. Since we cannot use traditional precision and recall, we define measurement metrics. To define CR ratio, let us assume that there are M query images for one ‘transformed’ version (e.g. blur). To compute CR ratio, the number of true pairs that are classified as containing copies (i.e. the original image and its modified version) is counted (K). The CR ratio is defined as: Correct Retrieval Ratio =
K M
The query shall be performed for all original images and the experiment repeated for all types of modifications. For this experiment, 60,555 images are used for
Very Fast Concentric Circle Partition-Based Replica Detection Method
913
experiment independence test. This larger database has various natural photo images including several genre including animal, people, food, architecture, house and abstract image [12]. And 3,943 original images and 23 kinds of modifications are used. Modification list is depicted in Table 1. Actually feature extraction process is repeated twice with different radius levels and two extracted bit-strings are mixed. In this experiment for the proposed replica detector, we set up basic parameters as following; Circle diameter = 256 Circle radius level = 32, 16 Circle angle level = 36 For comparison of performance, we implement duplicate detection algorithm based on PCA-HAS in [9] and two MPEG-7 visual descriptors of color layout and edge histogram descriptor as references. We use same database for these three reference algorithms and carry out same independent and accuracy test. And all experiment is processed in 3.4GHz Intel Pentium 4 Processor with Microsoft Windows XP. 4.1 Independent Test When cross similarity measurement with 60,555 images is processed, threshold distance under 1ppm false positive rate is 1,833th distance. Fig. 5 shows cross distance histogram of all methods including the proposed one for very large database.
(a)
(b)
(c)
(d)
Fig. 5. Cross distance histogram of reference methods and the proposed method. Threshold distance of each algorithm is as following; ((a): 0.0315, (b):0.007, (c):0.036, (d):0.141).
914
I.-H. Cho et al.
4.2 Accuracy Accuracy is measured by average CR Ratio and results under 1 ppm false positive rate are represented in Table 1. In experiment results, the proposed method shows very high accuracy of average 97.6 %. Table 1. CR ratio results for modification levels
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Add Gaussian Noise (σ=2.5) Add Gaussian Noise (σ=4.5) Add Gaussian Noise (σ=8.0) Auto level Blur (3x3) Blur (5x5) Bright change (5%) Bright change (10%) Bright change (20%) JPG compress (qf=95%) JPG compress (qf=80%) JPG compress (qf=65%) Color reduction (8bit) Color reduction(16bit) Histogram equalization Monochrome Rotation (10˚) Rotation (25˚) Rotation (45˚) Flip Rotation (90˚) Rotation (180˚) Rotation (270˚) Average
PCAHASH [9]
Color Layout (MPEG-7)
Edge Histogram (MPEG-7)
The Proposed Method
0.530
0.678
0.481
0.993
0.525
0.625
0.238
0.983
0.517
0.532
0.075
0.971
0.382 0.525 0.524 0.511 0.478 0.396
0.055 0.738 0.734 0.032 0.002 0.000
0.940 0.091 0.052 0.631 0.356 0.124
0.992 0.998 0.995 0.994 0.989 0.967
0.537
1.000
0.998
1.000
0.534
0.865
0.390
0.998
0.526
0.722
0.203
0.991
0.493 0.466 0.296 0.530 0.038 0.004 0.001 0.003 0.000 0.001 0.000
0.224 0.223 0.063 0.144 0.010 0.002 0.001 0.015 0.001 0.001 0.001 0.290
0.256 0.445 0.213 0.273 0.013 0.013 0.010 0.002 0.000 0.001 0.000
0.983 0.980 0.919 0.997 0.992 0.847 0.876 1.000 0.997 0.997 0.999
0.252
0.976
0.340
The above test shows that the proposed method is very robust for various transforms including rotation by arbitrary angle and flip. However the other method is less robust for image distortion and they show very low performance on rotation and flip. Fig. 6 shows the relation of their performance graphically.
Very Fast Concentric Circle Partition-Based Replica Detection Method
915
Fig. 6. The performance of all methods for variable distortions
4.3 Signature Size The proposed replica detector is composition of four signatures and its length is related with radius level. In this experiment, we use 32 and 16 as radius level. As a result, total signature size is only 192 bits. As point a view of signature size, method proposed in [9] has smallest of only 32 bits. Table 2 shows size of the signature of all methods. Table 2. Signature size (bits)
PCA-HASH [9] Color Layout (MPEG-7) Edge Histogram (MPEG-7) The Proposed Method
Number of bits 32 42 240 192
4.4 Computational Complexity In this paper, we limit the concept of computational complexity into complexity in matching process. Because extraction process can be carried out in off-line in general, it requires no real time processing. In matching step, computation complexity is very related to matching method of the signature. PCA-HASH and the proposed method in [9] use binary string as signature so that they need only XOR operation. This XOR
916
I.-H. Cho et al.
operation is basic operation which needs minimum computing power in general computer architecture. Because of using binary hash code, they can obtain high matching speed relatively while the other methods use more complex matching scheme. Table 3 shows computational complexity of all methods as the number of matching pairs per second. Table 3. Computation matching complexity
PCA-HASH [9] Color Layout (MPEG-7) Edge Histogram (MPEG-7) The Proposed Method
The number of matching pairs per second 90,000,000 4,230,543 293,206 15,000,000
5 Discussion In this paper, we proposed new fast concentric circle partition-based signature method for image replica detection. The proposed method partitions image into concentric circle shape in according to radius and angle. Basically the final signature has 1 bit per each circle after partition. For each circle, four kinds of features are extracted and its values are converted into 1 bit by using simple hash function with the relation to neighborhood circle. Each four feature has different characteristics relatively. First feature of average intensity distribution represents overall pixel level distribution from center to outer region. This feature can show general characteristics of the image. However its diversity of intensity distribution in image has limitation, which means that it is possible for two different images to have similar distribution when the signature is presented as bit-string by using only magnitude difference between neighborhood circles. So we add difference of intensity distribution as second feature. This feature compensates the weakness of first feature by considering the degree of magnitude difference. Therefore second feature acts valuable role to increase discrimination between different images. Third and fourth features represent local characteristics while first feature describes image globally. Third feature describes the variation on symmetry through circles. And fourth feature describes the degree of variation to the circular direction in each circle. By utilizing these four kinds of features together, the proposed method shows very high CR (Correct Retrieval) ratio of 97.6 % under 1 ppm false positive rate. In addition, the proposed signature has very small size of fixed 192 bits per image and shows very fast matching process 15 million pairs per second. In our experiment part, we compare the proposed method with other three algorithms. First one is method proposed in [9] for large scale replica-pair detection system. Its algorithm calculate block-based pixel mean as a vector and this vector is optimized and compressed into only 32 dimensions by Principle Component Analysis (PCA). Finally vector of 32 dimensions are converted into 32 bits hash code. However in our experiment it doesn’t show good performance because it has too short signature. Only 32 bits is not enough to have unique and independent information of specific image. The MPEG-7 visual descriptors of color layout and edge histogram are selected as targets for
Very Fast Concentric Circle Partition-Based Replica Detection Method
917
experimental comparison since they show better performance in every MPEG-7 visual descriptors relatively. As the experimental results shows, however, MPEG-7 visual descriptors which are developed for image retrieval applications cannot satisfy the requirement of replica detection. In case of edge histogram descriptor, it partitions image into fixed number of blocks so that they cannot synchronize signatures between query and reference if original image is rotated or flip geometrically. Color layout descriptor has critical limitations even though it shows good performance for some image distortions. It uses color information in overall range of image so that it cannot protect original information when it loses color information by converting gray scale image. Therefore it is difficult to expect high performance of color descriptor for variable image modifications. For aspect of computation complexity, the proposed method outperforms these two descriptors since they use Euclidean distance measure and it requires more computational power rather than Hamming distance measure using basic XOR operation. For the development of better image replica detector, most important factors are high accuracy and low computational complexity. Digital image is usually copied and modified without large efforts through simple software. Therefore the replica detector should cover variable modification of source image. In this paper, we apply 23 kinds of transformations and they can be considered as the most frequent usage by normal users in real application. For these various modifications, the proposed method shows very high performance regularly for all modifications. And In general, best method which shows the fastest speed in matching of feature-based signature is to describe feature with bit-string and use simple Hamming distance in matching procedure. In computational calculation, it needs just one clock to carry out XOR operation which other operation needs more clocks. However bit-string type signature has low accuracy relative to other type method since its bit-string can show just binary information. Binary information represents only two types of information; ON or OFF. However these problems of bit-string type signature can be compensated by using proper design of Hash table. In this paper, we firstly describe feature with floating values and obtain final signature by presenting its distribution into bit-string. The proposed method shows high performance for accuracy and speed by mixing various features appropriately.
6 Conclusion We proposed very fast replica detection method based on concentric circle partition of image. The proposed method utilizes the partitioned regions and extracts four kinds of features which have different characteristics to describe image efficiently. The proposed method shows very high CR (Correct Retrieval) ratio of 97.6 % under 1 ppm false positive rate. And the mixed signature has the form of bit-string by applying simple Hash table which leads to fast matching speed. The main advantages of the proposed replica detector are high accuracy and matching speed. Therefore the proposed method will be very useful in real application such as the detection of illegal image replicas.
918
I.-H. Cho et al.
Acknowledgments. The presented research is supported by INHA UNIVERSITY Research Grant.
References 1. Martinez, J.M., Koenen, R., Pereira, F.: MPEG-The Generic Multimedia Content Description Standard. IEEE Multimedia 9(2), 78–87 (2002) 2. ISO/IECJTC1/SC29/WG1 WG1 N3684, JPSearch Part1 TR - System Framework and Components (2005) 3. Maret, Y., Dufaux, F., Ebrahimi, T.: Image Replica Detection based on Support Vector Classifier. In: Proc. SPIE Applications of Digital Image Processing XXVIII, Santa Barbara, USA (2005) 4. Ke, Y., Sukthankar, R., Huston, L.: An Efficient Parts-Based Near-Duplicate and SubImage Retrieval System. In: ACM International Conference on Multimedia, pp. 869–876 (2004) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2000) 6. Mikolajczyk, K., Schmid, C.: Scale & Affine Invariant Interest Point Detectors. International Journal of Computer Vision 60, 63–86 (2004) 7. Jaimes, A., Shih-Fu, C., Loui, A.C.: Detection of non-identical duplicate consumer photographs, Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. In: Proceedings of the 2003 Joint Conference of the Fourth International Conference, vol. 1, pp. 16–20 (2003) 8. Ke, Y., Sukthankar, R., Huston, L.: Efficient Near-duplicate Detection and Sub-image Retrieval. In: Proc. ACM Intl. Conf. on Multimedia, New York, pp. 869–876 (2004) 9. Wang, B., Li, Z., Li, M., Ma, W.Y.: Large-Scale Duplicate Detection for Web Image Search. In: IEEE International Conference on Multimedia and Expo., pp. 353–356 (2006) 10. Ke, Y., Sukthankar, R.: PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. In: Proceedings of IEEE Computer Vision and Pattern Recognition (2004) 11. Indky, P., Motwani, R.: Approximate Nearest Neighbor-towards Removing The Curse of Dimensionality. In: Proceedings of Symposium on Theory of Computing (1998) 12. Mammoth – 800,000 Clipart DVD by FastTrak, http://www.amazon.co.uk
Design of a Medical Image Database with Content-Based Retrieval Capabilities Juan C. Caicedo, Fabio A. Gonz´ alez, Edwin Triana, and Eduardo Romero Bioingenium Research Group Universidad Nacional de Colombia {jccaicedoru,fagonzalezo,emtrianag,edromero}@unal.edu.co http://www.bioingenium.unal.edu.co
Abstract. This paper presents the architecture of an image administration system that supports the medical practice in tasks such as teaching, diagnosis and telemedicine. The proposed system has a multi-tier, web-based architecture and supports content-based retrieval. The paper discusses the design aspects of the system as well as the proposed contentbased retrieval approach. The system was tested with real pathology images to evaluate its performance, reaching a precision rate of 67%. The detailed results are presented and discussed. Keywords: content-based image retrieval, medical imaging, image databases.
1
Introduction
Nowadays medical environments generate a large number of digital images to support clinical decisions, the Geneve University Hospital reported a production rate of 12.000 daily images during 2,002 [1]. The problem of archiving those medical image collections have been addressed with different solutions such as PACS1 [2,3] or modular and specialized systems for image databases [4,5]. Effectiveness of those systems may be critical in clinical practice [6] since they are responsible for storing medical images in a dependable way. Besides, these systems must allow users to efficiently access this information. Traditional medical image database systems store images as a complementary data of textual information, providing the most basic and common operations on images: transfer and visualization. Usually, these systems are restricted to query a database only through keywords, but this kind of queries limit information access, since it does not exploit the intrinsec nature of medical images. On the other hand, a medical image database system must have a flexible architecture along with a wide variety of functionalities supporting clinical, academic and research tasks [7]. Medical users must count on a set of automated and efficient tools, which permits efficient access to relevant information. A recent approach to medical image database management is the retrieval of information by content, named Content-Based Image Retrieval (CBIR)[1]. This 1
Picture Archiving and Comunication Systems.
D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 919–931, 2007. c Springer-Verlag Berlin Heidelberg 2007
920
J.C. Caicedo et al.
kind of systems allow evaluation of new clinical cases so that when similar cases are required, the system is able to retrieve comparable information for supporting diagnoses in the decision making process. Several systems have been developed following this approach such as ASSERT [8], IRMA [9], FIRE [10] among others [11,12,13]. The main idea of this kind of systems is to achieve adequate support to physicians in reaching their medical goals [7]. This paper presents the design, implementation and evaluation of an image database system for the Telemedicine Centre at the National University of Colombia. Physicians asociated to the Telemedicine Centre have attended to cases in dermathology, pathology and radiology, related to general and tropical diseases in remote places since 2,002. The number of received cases has been steadily increasing and each case has many images associated with different types of metadata. Presently, this amounts to an average of 100,000 digital images per year to be managed. The architecture of the proposed image database system is multi-tier and allows easy mantenance and extensibility. This system is a web-oriented service platform to explore any image collection from anywhere, following state-of-theart security requirements. The system is capable of processing and storing any type of medical image format, including DICOM2 , TIFF or any conventional digital photography. In addition, the proposed system includes a module for content-based retrieval, configurable with different features and metrics. The CBIR module performance was assessed with an annotated histopathology image collection composed of 1,502 basal-cell carcinoma images. The reminder of this paper is organized as follows. In section 2, we define the requirements for the image database system. In section 3, a state of the proposed methods for contentbased image retrieval are presented. The system architecture is then analyzed in section 4 and section 5 presents results of the experimental evaluation. Finally, some concluding remarks are presented in section 6 .
2
Defining Requirements
Many technologies are currently available to capture medical images and support clinical decisions, but there exists little work to develop clever management systems for that amount of visual information. The required system must support the medical workflow, which includes clinical, academic and research tasks. Figure 1 illustrates a general overview of the required system which must provide support to image capturing, storing and methods for allowing similarity queries. 2.1
Functional Requirements
There exist different devices and methods for image capture in a health center. A medical case may require either only one picture or many slides. Whatever the number of images to store is, the system must provide a standard interface that permits uploading of any number of images. Each image can also have different 2
Digital Imaging and Communication in Medicine.
Design of a Medical Image Database
921
Fig. 1. Overview of the image database system architecture
annotations or textual descriptions that need to be stored together with the visual raw data. The system must provide the possibility of associating textual metadata and queries using available key-words, e.g DICOM attributes. Once images are stored, physicians might explore a particular image collection for retrieving specific cases or simply for taking an overview on the whole collection. This exploration could be structured by metadata, filtering image results with some textual conditions, for instance dates or image modality. As a high number of results may be delivered in the exploration process, images should be presented in pages with a specific number of items per page. Image details must be shown at the user’s request. The system must allow the downloading of images with their associated metadata. Since medical images are characterized by their particular visual patterns, the system must support queries to the image collection using a given pattern image, which then triggers the process of delivering a similar image set. Querying by similarity is thus a very desirable and useful property of the system for physicians, to find images with similar patterns. CBIR systems have many benefits in a medical environment [1] e.g. a computer-aided diagnosis tool. The system must provide tools for automatic image processing, with specialized image analysis if needed, to support queries by similarity. 2.2
Technical and Architectural Requirements
Design of such an image database system must consider extensibility and scalability. Extensibility through a modular architecture that incorporates new functionalities and capabilities, scalability by allowing new resources to be naturally pluged to the system in order to extend the service, covering more users or more data. The system must be capable of managing any type of digital image format, after requirements for any medical speciality have been established; for instance,
922
J.C. Caicedo et al.
in dermathology, where resolution is less crucial, the common acquisition method is a digital picture, commonly stored in a JPEG format. In pathology, a digital camera coupled to the microscopy allows users to capture images in TIFF, JPEG or BMP format, while the standard format for radiology is DICOM [5]. Importantly, the image database system should not be designed for personal use as a desktop software. Main core system functionalities must be executed in a high-performance computing platform, allowing concurrent users through network services. In addition, images and medical data can not be exposed to unauthorized access so that the system must provide an authentication security module. The system has to be devised for allowing access through the web [14], but keeping in mind that the main uses of a medical image database is to support diagnosis and the associated academic and research activities. In addition, search of relevant images should be an easy process, that is to say, the system must provide content-based retrieval mechanisms such as query by example and relevance feedback. [1].
3
Content-Based Retrieval Methods
The content-based image retrieval module allows users to obtain a set of similar images from an example image, i.e making a query-by-example request [15]. The example image is analyzed and compared with the other database images and the most similar images are identified in a ranking process. In this process, image features and similarity measures are used. 3.1
Image Retrieval Model
The fisrt goal here is to define how images have to be compared. A Medical image is a two, three or four dimensional discrete signal with information about colors, luminance, volume or time in one scene. Therefore, images are difficult to compare because each one is an object, provided with complex properties and different structure. Other technical aspects make images difficult to compare, such as different widths or heights, color representations and formats. A very convenient manner to face these difficulties consists of using a statistiscal frame: images are modeled as random variables because they are the product of a stochastic process. Then, many stastistical measurements can be obtained from images and used as characteristic features. On the other hand, image analysis is required for structuring visual data information. Common features computed for such a tasks comprise a broad range of possibilities [16], but the very basic ones are color distribution, edges and textures. Formally, the feature extraction process is defined as: Definition 1. Let I be the image collection into the database. Let Fk be the space defined for a feature k. The extraction process for a feature k is defined as a function between I and Fk : Ek : I −→ Fk
(1)
Design of a Medical Image Database
923
There exists a feature space onto which images are mapped when a specific feature is extracted so that all images are now represented by their corresponding features in that space. In addition, many feature spaces have to be supported by the image database system and different measurement strategies must be defined for each. If we assume that those feature spaces are metric spaces, distance functions can be devised for determining the similarity degree of images in each of such metric spaces. A metric space is a tuple (Fk , Mk ), where Fk is a set and Mk a metric on Fk as follows: Definition 2. Let Fk × Fk be the cartesian product between features of the same space. Let M be a metric that calculates the similarity between a pair of given features, then: Mk : Fk × Fk −→ R (2) such that: 1. Mk (x, y) ≥ 0. Non-negativity 2. Mk (x, y) = 0, if and only if x = y. Identity 3. Mk (x, y) = Mk (y, x). Simetry 4. Mk (x, z) ≤ M (x, y) + Mk (y, z). Triangle inequality Definition 2 permits to introduce an order relationship between images using a feature k and a metric Mk . Previous definitions allow to perform image comparisons using one feature an one metric. However, better comparisons may be achieved using many features and a linear combination of different metrics, as follows: Definition 3. Let x, y ∈ I be images. Let Ek be the feature extraction function of a feature k and Mk be a metric in the feature space Fk . A similarity function for different features is defined as the linear combination of metrics Mk with importance factors wk : wk Mk (Ek (x), Ek (y)) (3) s(x, y) = k
3.2
Feature Extraction
Yet it is desirable to match image features and semantic concepts, most image processing techniques actually computes information at the very basic level. This is the well known semantic gap, and there are different approaches to bridge it, many of them including prior information about the application domain [17]. In this work we use features corresponding to a perceptual level, without any conceptual structure, named low level features. These features were selected both to evaluate extensibility of the proposed architecture and to assess performance regarding medical image retrieval. We select histogram features such as: – gray scales and color histogram – local binary partition – Sobel histogram
924
J.C. Caicedo et al.
– Tamura textures – invariant feature histogram that have been previously used for content-based retrieval in different scenarios, and their details are described in [18,19]. 3.3
Similarity Measures
Features previously described define a set of feature spaces, and each image into the database is mapped onto each of those spaces. We assume that each feature space is characterized by a special topology, requiring a specific metric. We used the following metrics as candidates to make measurements in each feature space: – – – – –
Euclidean distance histogram intersection Jensen-Shannon divergence relative bin deviation Chi-square distance
In general, those metrics have been defined to calculate the similarity between two probability distributions [18], since features defined in this work are global histograms. The most appropriate metric for each feature space is found by experimental evaluation as described in section 5. 3.4
Retrieval Algorithm
In order to find the set of images most similar to a query image, it is required a retrieval algorithm that ranks images. This algorithm uses metric information to sort images out in decreasing order from the most similar. The retrieval algorithm receives as parameters the image query and the set of metrics, features and importance factors to be applied. Then, the algorithm evaluates the similarity between each database image and the query image. This evaluation calculates each metric with its corresponding feature and importance score and integrates over all results to provide a unique similarity score per item. When all images are evaluated, results are sorted by similarity score. Although all images are ranked, only the n-top images of the ranking are presented. The n parameter can be configured, and the user could request additional results, if needed. This algorithm can compute a single metric or a linear combination of them, since it receives as parameter the set of metrics to be applied, the feature extraction functions and the importance factors associated to each metric.
4
Proposed Architecture
The proposed image database architecture is based on the Java 2 Enterprise Edition (J2EE)[20], that provides a multitiered and a distributed application model. In this model, each tier (also called layer) has a particular function in
Design of a Medical Image Database
925
order to guarantee easy maintenance, extensibility and low coupling. Each tier offers services to other tiers, e.g. the Persistent tier provide support to retrieve and store results of the Bussines tier, as well as the Bussines tier processes information fed to the Web tier. J2EE also supports transactions, web-services and provides a well defined security scheme to control the access from local and remote clients through a rol-based security model for Enterprise Java Beans (EJB) and web components. 4.1
Architecture
System architecture is composed of five main layers, a strategy which allows dividing processing responsabilities, data management and verification. The global model of the architecture can be viewed in Figure 2. As said before, each tier has a particular function and the loose interaction between the tiers results in a system with the whole functionality. Each layer is hereafter described: 1. Client Tier: It contains the Graphical User Interface (GUI) which allows interaction with the system and visualization of images and data. This tier has two client types: the web client who uses the internet browser and applets to access the system through the Web tier; and the standalone client that can be a remote application using RMI3 through JNDI4 to access the J2EE server. 2. Web Tier. It has Java Server Pages (JSP) and servlets. This tier processes requests and builds responses based on the Bussines tier results. This layer uses a local interface to invocate logical methods. 3. Bussines Tier. This is the system core, actually composed by Enterprise Java Beans (EJB) and plain Java classes. There are two types of EJBs: the Session Beans and the Entity Beans. Session Beans are objects representing clients in the server side, which can access logic methods such as: image archiving, image groups configuration, image search by attributes, among others. Entity Beans constitute a database abstraction to execute SQL queries and to access relational databases with Java classes. This tier also has the Metadata Manager module to record images and extract textual information such as name, size and DICOM attributes; and the CBIR module, responsible for the feature extraction and similarity operations. 4. Persistent Tier. It provides tools to access the file system where images and their thumbnails are stored as well as the database that contains the metadata information and image features. 5. Security Tier: It provides access control to the application based on the role provided by the JBossSX Security extension, configured in a standard XML descriptor. The security scheme verifies the user rol and allows or refuses the access to some methods or domains. 3 4
Remote Method Invocation. Java Naming and Directory Interface.
926
J.C. Caicedo et al.
Fig. 2. System global architecture
4.2
Content-Based Retrieval Module
One of the main concerns of the proposed architecture is the CBIR module, which is located in the Business tier. The design of this module is based on design patterns which support extension by adding new feature extraction methods and new similarity functions. The CBIR module uses a standarized representation of digital images to apply algorithms independent of the format details. Image features are coded and stored in the Persistent layer to build a content-based index. The CBIR module has four main submodules: features representation, feature extraction algorithms, similarity functions and retrieval algorithms. The feature representation submodule provides a class hierarchy with a common interface to access features data, and some specializations to handle specific features such as histograms, vectors and trees. The feature extraction submodule uses the Template pattern to codify each feature extraction algorithm in a class method, and associate it to a class in the feature hierarchy. The similarity functions submodule uses a hierarchy of metrics with the Command pattern, allowing an abstract configuration of the parameters to be compared to return the similarity score. The retrieval algorithms submodule, provides a framework to configure different retrieval algorithms with multiple feature-metric pairs and their associated importance factor, with the design of the Observer and Iterator patterns. Every submodule also includes a Factory pattern to dinamically create objects of its hierarchy. With this structure, it is easy to develop new algorithms for feature extraction and similarity evaluation, making the module reusable and extensible. Currently, implemented features include the histogram features described in section 3, as well as the similarity functions.
5
Retrieval Performance Evaluation
Like other information retrieval systems, a CBIR system resolves queries in an approximate way, because users are not aware of the exact results that should be
Design of a Medical Image Database
927
delivered [21]. That means that CBIR systems must be evaluated to determine a degree of precision of the retrieval process, revealing how good the system is at locating relevant images associated to the user query. Since the most important contributions of the proposed architecture is the content-based image retrieval module, it is important to assess its performance. In this work a specific evaluation has been made using an experimental dataset of histopathology images, which are used to diagnose a special skin cancer type. 5.1
Evaluation Framework
When evaluating information retrieval systems, it is important to define what a perfect system response would be like, named ground truth [22]. Many approaches to define a ground truth have been used in the literature, including: user assesment of relevancy; the automatic definition of classifications from available image annotations; and manual definition of user information needs. In this work a ground truth was defined through the analysis, annotation and coding of a set of images, performed manually by pathologists. The image collection selected to evaluate the system is a database of 6,000 histopathology images, from which a subset of 1,502 images was selected as ground truth. The ground truth, created by pathologists, is composed by 18 general query topics, corresponding to possible information needs in pathology. In average, each query topic has 75 relevant images, and many images are shared by different query topics, i.e. query topics are not disjoint sets, because differential diagnosis patterns could be part of one or many categories. For the experimental test, each image in the result set is evaluated against the ground truth to see whether it is relevant or not. M¨ uller et al [23] presents a framework to evaluate CBIR systems in order to report comparable results from different research centers in a standarized way. The most representatives of those performance measures are precision and recall: precision =
recall =
number of relevant retrieved images number of all retrieved images
number of relevant retrieved images number of all relevant images
(4)
(5)
This performance measures are easily understandable, and can be taken when different number of images have been retrieved. Another widely used performance evaluation mechanism in CBIR systems is the precision vs recall graph (PR graph), which provides information about the behavior of the system in many points. 5.2
Experimental Design
The experimental framework uses the subset of 1,502 annotated images which allows determining whether results are relevant or not. Each experiment is composed of 30 queries (images) randomly selected from the annotated collection.
928
J.C. Caicedo et al.
When results are obtained, the evaluation procedure verifies if each result belongs to the same query topic of the example image, marking it as relevant or irrelevant. The goal of this experimentation is to identify features and metrics that output the best results to the user, based on the information of the ground truth. There are different situations to evaluate. Since there are many metrics available, it is important to identify which is the best choice for each feature space. Then, knowing what metric to use for each feature, the test evaluates which featuremetric pair presents the best performance in the general ranking process. In addition, the test verifies whether the combination of different features performs better than individual feature-metric pairs. 5.3
Results
The identification of the best metric in each feature space, is determined by the values of precision and recall obtained in the experimentation. Each feature was tested with each metric selecting the best precision rate per feature. Results are shown in Table 1, where features are associated with the metric that outputs the best performance in that space. Reported values correspond to the average precision of the 30 random queries at the first result, e.g. the Sobel-JSD metric returns a relevant image in the fisrt position in the 61% of the cases. Since those results are the best per feature, i.e. each feature was tested with all metrics and the presented results are the best feature-metric pair, Table 1 also shows features ordered by their precision rate. This test shows that edges, codified into the Sobel Histogram, performs better than any other histogram features, this suggest that edges is an important feature for differential diagnosis in pathology. Table 1. Average precision values for the best feature-metric pairs Feature
Metric
P(1)
Sobel RGB Histogram Local Binary Partition Gray Histogram Tamura Texture Invariant Feature Histogram
Jensen-Shannon Divergence Relative Bin Deviation Relative Bin Deviation Relative Bin Deviation Euclidean Distance Relative Bin Deviation
61% 53% 53% 50% 39% 36%
Testing the combination of different features, an average precision rate of 67% was achieved for the first image retrieved. According to the definition 3, a linear combination of features requires the use of different importance factors. In this test, those factors were identified by exhaustive search, finding a combination of 50% for Local Binary Partition, 30% for Sobel Histogram and 20% for RGB Histogram. In the PR graph this configuration outperforms the individual feature-metric pairs previously tested. This tendence is better shown in Figure 3, in which the combination approach is compared with the best three individual
Design of a Medical Image Database
929
Precision - Recall Graph 1 Linear Comb. Sobel - JSD RGB Hist - RBD LBP - RBD 0.8
Precision
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Recall
Fig. 3. Precision vs Recall graph comparing the performance of the linear combination approach and other individual metrics
metrics. The linear combination of features shows a better performance than individual features i.e. the precision-recall pairs are the best for the combination of features in almost all the cases.
6
Conclusions
This paper has presented the design, development and evaluation of a medical image database system, now in use in the Telemedicine Centre at the National University of Colombia5 . The proposed system exhibits some particular features that distinguish it from traditional image management systems: its architecture is multi-tier, it provides web access to image collections, and it allows contentbased retrieval. The content-based-retrieval module provides a search-by-example capability, i.e. the user can retrieve images that are similar to a given reference image. Similar images are retrieved thanks to a two-phase process: feature extraction and similarity evaluation. Different low-level features were implemented including color, texture and edges. Also, different similirity measures were tested, since a given feature, such as a histogram, requieres an appropiate metric. The content-based-retrieval module was evaluated using a collection of annotated histopathology images. The images were annotated by an especialist stablishing a base line for the system performance. This evaluation demonstrates that some low level features can approximate the differential diagnosis criteria used by pathologists until certain level, which is deemed as adequate for teaching purposes by pathologists who annotated these images. The results may be 5
http://www.telemedicina.unal.edu.co
930
J.C. Caicedo et al.
outperformed by using high-level features that take into account the semantics of images. The modelling and implementation of these high-level features is part of our future work.
References 1. M¨ uller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content based image retrieval systems in medical applications clinical bene ts and future directions. International Journal of Medical Informatics 73, 1–23 (2004) 2. Costa, C.M., Silva, A., Oliveira, J.L., Ribeiro, V.G., Ribeiro, J.: A demanding web-based pacs supported by web services technology. SPIE Medical Imaging 6145 (2006) 3. Gutierrez, M., Santos, C., Moreno, R., Kobayashi, L., Furuie, S., Floriano, D., Oliveira, C., Jo˜ ao, M., Gismondi, R.: Implementation of a fault-tolerant pacs over a grid architecture. SPIE Medical Imaging - Poster Session 6145 (2006) 4. Chadrashekar, N., Gautham, S.M., Srinivas, K.S., Vijayananda, J.: Design considerations for a reusable medical database. In: IEEE International Symposium on Computer-Based Medical Systems, pp. 69–74 (2006) 5. Marcos, E., Acu˜ na, C., Vela, B., Cavero, J., Hern´ andez, J.: A database for medical image management. Computer Methods and Programs in Biomedicine 86, 255–269 (2007) 6. Caramella, D.: Is pacs research and development still necessary? International Congress Series 1281, 11–14 (2005) 7. Doi, K.: Computer-aided diagnosis in medical imaging: Historical review, current status and future potential. Computerized Medical Imaging and Graphics 31, 198– 211 (2007) 8. Shyu, C.-R., Brodley, C.E., Kak, A.C., Kosaka, A., Aisen, A.M., Broderick, L.S.: Assert: A physician-in-the-loop content-based retrieval system for hrct image databases. Computer Vision and Image Understanding 75, 111–132 (1999) 9. Lehmann, T.M., G¨ uld, M.O., Thies, C., Plodowski, B., Keysers, D., Ott, B., Schubert, H.: The irma project: A state of the art report on content-based image retrieval in medical applications. Korea-Germany Workshop on Advanced Medical Image, 161–171 (2003) 10. Deselaers, T., Weyand, T., Keysers, D., Macherey, W., Ney, H.: Fire in imageclef 2005: Combining content-based image retrieval with textual information retrieval. Image Cross Language Evaluation Forum (2005) 11. Traina, A.J., Castanon Jr., C.A., C.T.: Multiwavemed: A system for medical image retrieval through wavelets transformations. In: 16th IEEE Symposium on Computer-Based Medical Systems (2003) 12. Tan, Y., Zhang, J., Hua, Y., Zhang, G., Huang, H.: Content-based image retrieval in picture archiving and communication systems. SPIE Medical Imaging - Posters 6145 (2006) 13. M¨ uller, H., Hoang, P.A.D., Depeursinge, A., Hoffmeyer, P., Stern, R., Lovis, C., Geissbuhler, A.: Content-based image retrieval from a database of fracture images. SPIE Medical Imaging 6516 (2007) 14. Lozano, C.C., Kusmanto, D., Chutatape, O.: Web-based design for medical image. In: IEEE International Conference on Control, Automation, Robotics and Vision 3, pp. 1700 – 1705 (2002)
Design of a Medical Image Database
931
15. Petrakis, E.G.M., Faloutsos, C.: Similarity searching in medical image databases. IEEE Transactions on Knowledge and Data Engineering 9, 435–447 (1997) 16. Nikson, M.S., Aguado, A.S.: Feature Extraction and Image Processing. Elsevier, Amsterdam (2002) 17. Liu, Y., Zhang, D., Lu, G., Ma, W.-Y.: A survey of content-based image retrieval with high-level semantics. Pattern Recognition 40, 262–282 (2007) 18. Deselaers, T.: Features for Image Retrieval. PhD thesis, RWTH Aachen University. Aachen, Germany (2003) 19. Siggelkow, S.: Feature Histograms for Content-Based Image Retrieval. PhD thesis, Albert-Ludwigs-Universit¨ at Freiburg im Breisgau (2002) 20. Ashmore, D.C.: The J2EE architect’s handbook. DVT Press (2004) 21. Yates, R.B., del Solar, J.R., Verschae, R., Castillo, C., Hurtado, C.: Contentbased image retrieval and characterization on specific web collections. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 189–198. Springer, Heidelberg (2004) 22. M¨ uller, H., Rosset, A., Vallee, J.P., Geissbuhler, A.: Comparing features sets for content-based image retrieval in a medical-case database. Medical Imaging 5371, 99–109 (2004) 23. M¨ uller, H., M¨ uller, W., Marchand-Maillet, S., Squire, D.M., Pun, T.: A Framework for Benchmarking in Visual Information Retrieval. International Journal on Multimedia Tools and Applications 22, 55–73 (2003) (Special Issue on Multimedia Information Retrieval)
A Real-Time Object Recognition System on Cell Broadband Engine Hiroki Sugano1 and Ryusuke Miyamoto2 1
2
Dept. of Communications and Computer Engineering, Kyoto University, Yoshida-hon-machi, Sakyo, Kyoto, 606-8501, Japan
[email protected] Dept. of Information Systems, Nara Institute of Science and Technology, 8916-5, Takayama-cho, Ikoma, Nara, 630-0192, Japan
[email protected]
Abstract. Accurate object recognition based on image processing is required in embedded applications, where real-time processing is expected to incorporate accurate recognition. To achieve accurate real-time object recognition, an accurate recognition algorithm that can be quickened by parallel implementation and a processing system that can execute such algorithms in real-time are necessary. In this paper, we implemented an accurate recognition scheme in parallel that consists of boosting-based detection and histogram-based tracking on a Cell Broadband Engine (Cell), one of the latest high performance embedded processors. We show that the Cell can achieve real-time object recognition on QVGA video at 22 fps with three targets and 18 fps with eight targets . Furthermore, we R Playstaconstructed a real-time object recognition system using SONY tion 3, one of the most widely used Cell platforms, and demonstrated face recognition with it. Keywords: Object recognition, Cell Broadband Engine, Real-time processing, Parallel implementation.
1
Introduction
Currently we must realize an object recognition system based on image processing for embedded applications, such as automotive applications, surveillance, and robotics. In these applications, highly accurate recognition must be achieved with real-time processing under limited system resources. For such achievement, both a highly accurate recognition algorithm suitable for parallel processing and a real-time processing system suitable for image recognition must be developed. Generally, object recognition based on image processing is achieved by combining object detection and tracking [1]. For example, a neural network [2], a support vector machine [3,4], and boosting [5] are adopted in the detection phase for pedestrian recognition, one application of object recognition. In some cases, candidate extraction based on segmentation is also adopted to enhance the detection performance [6]. In the tracking phase, recently particle filter-based schemes are widely used [7,8], although Kalman filter-based schemes used to be popular. D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 932–943, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Real-Time Object Recognition System on Cell Broadband Engine
933
On the other hand, some works toward real-time processing of object recognition on embedded systems exist. Some aim for rapid object detection by a specialized processor [9], and others propose real-time stereo that sometimes aids object detection [10]. In such works, Field Programmable Gate Array (FPGA), which is programmable hardware, Application Specific Integrated Circuit (ASIC), which is hardware designed for a specific application, and high performance Digital Signal Processor (DSP) are adopted. However, a highly accurate real-time object recognition system has not been developed yet. In this paper, we propose a real-time object recognition system that achieves highly accurate recognition. In our proposed system, an object recognition algorithm based on the scheme proposed in [11] is adopted. In this recognition scheme, boosting-based detection and color histogram-based tracking with a particle filter are used for the detection and tracking phases, respectively. Because both have massive parallelism, parallel implementation is expected to improve processing speed. For a processing device, we adopt Cell Broadband Engine (CBE), one of the latest high performance embedded processors for general purpose use, which has a novel memory management system to achieve efficient computation with parallel execution units. By utilizing the computational power of CBE suitable for image recognition, we realize a highly accurate real-time object recognition system. The rest of this paper is organized as follows. Section 2 describes boostingbased detection and particle filter-based tracking adopted in the proposed system. In Section 3, CBE architecture is summarized and parallel programming on CBE is introduced. Section 4 explains parallel implementation of detection R and tracking. In Section 5, a real-time object recognition system on SONY Playstation 3, one embedded CBE platform, is described. Section 7 concludes this paper.
2
Preliminaries
In the proposed system, boosting-based detection and histogram-based tracking with a particle filter are adopted for the detection and tracking phases, respectively. In this section, an overview of boosting and histogram-based tracking is described. 2.1
Boosting
Boosting is one ensemble learning method with which an accurate classifier is constructed by combining weak hypotheses learned by a weak learning algorithm. The obtained classifier consists of weak hypotheses and a combiner, and output is computed by the weighted votes of weak hypotheses. In the proposed scheme, AdaBoost [12], one of the most popular methods based on boosting, is adopted for construction of an accurate classifier. AdaBoost’s learning flow is shown as follows.
934
H. Sugano and R. Miyamoto
Algorithm 2.1: AdaBoost(h, H, (x1 , y1 ), . . . , (xn , yn ), m, l, T ) for i ← 1 to n do if yi == 1 1 then w1,i = 2m 1 else w1,i = 2l for t ← ⎧ 1 to T for i ← 1 to n ⎪ ⎪ w ⎪ ⎪ ⎪ do wt,i = n t,iwt,j ⎪ j=1 ⎪ ⎪ ⎪ ⎪ for j ← 1 to H ⎪ ⎨ do j = i wi |hj (xi ) − yi | do ⎪ Choose classifier ht with the lowest error t ⎪ ⎪ ⎪ ⎪ for i ← 1 to n ⎪ ⎪ ⎪ t ⎪ do wt+1,i = wt,i βt1−ei , βt = 1− ⎪ ⎪ t ⎩ where ei = 0 if example xi is classified correctly, ei = 1 otherwise T T 1 1 if t=1 αt ht (x) ≥ 2 t=1 αt , αt = log 1/βt Final strong classifier is: h(x) = 0 otherwise
where x is an input sample and y indicates a label of the sample. Input is a negative sample if y = 0, and input is a positive sample if y = 1. T is the number of classifiers which strong classifier consists of, m and l are the number of negative and positive examples, respectively, h is a set of weak classifiers, and H is the number of sets of weak classifiers. 2.2
Histogram-Based Tracking
Histogram-based tracking is a particle filter-based tracking scheme in which state space, the state transition, and how to compute likelihood must be defined. In the rest of this subsection, state space, state transition, and a computation method of likelihood used in histogram-based tracking are described. State Space. In the histogram-based tracking scheme, each particle of the distribution represents an rectangle and is given as: st = {xt , yt , xt−1 , yt−1 , w0 , h0 , at , at−1 },
(1)
where xt and yt specify the current location of rectangle, xt−1 and yt−1 the previous location, w0 and h0 specify the initial width and height of the rectangle, and at and at−1 specify the scale change corresponding to the initial width and height. State Transition. In histogram-based tracking, the probability distribution of a tracking target at the next time step is represented by: ∗ qB (xt |x0:t−1 , Y1:t ) = αqada (xt |xt−1 , Yt ) + (1 − α)p(xt |xt−1 ),
(2)
where p(xt |xt−1 ) shows the distribution of the previous time step and qada is the probability distribution derived from the detection results.
A Real-Time Object Recognition System on Cell Broadband Engine
q(x)
935
p(xt |xt−1 )
qada ∗ qB
x
Fig. 1. State Transition
Figure 1 shows state transition by the above equation. In this scheme, detection results are used for the state transition to enhance tracking accuracy, as shown in the above figure. Likelihood Computation. In this scheme, likelihood is computed by using HSV histogram [13] as follows. First, ξ, which is the Bhattacharyya distance between K ∗ , the HSV histogram of the area detected by the learning machine constructed by boosting, and (i) (i) K(s t ), which is the HSV histogram of predicted sample s t , are calculated by: 12 M
(i) (i) ξ[K ∗ , K(s t )] = 1 − k ∗ (n)k(n; s t ) , (3) n=1 ∗
(i) k(n; s t )
where k (n) and are the elements of K ∗ ,K(s t ), respectively, and M means the size of the histogram. (i) (i) Next, likelihood πt of sample s t is computed by: (i)
πt = exp (−λξ 2 [K ∗ , K(s t )]), (i)
(i)
(4)
where λ is a constant defined experimentally based on its application.
3
Overview of Cell Boradband Engine
In this section, Cell Broadband Engine architecture is summarized and parallel programming on CBE is introduced. 3.1
Architecture
Cell Broadband Engine (Cell) is a multi-core processor jointly developed by SONY, Toshiba, and IBM. Fig. 2 shows its architecture. A Cell is composed
936
H. Sugano and R. Miyamoto
of one “Power Processor Element” (PPE) and eight “Synergistic Processor Elements” (SPE). PPE is the Power Architecture-based core that handles most of the computational workload, and SPE is a RISC processor with 128-bit SIMD organization for stream processing. PPE and SPEs are linked by an internal high speed bus called “Element Interconnect Bus” (EIB).
PPE PPU L1
L2 Cache
SPE
SPE
SPE
SPE
SPU
SPU
SPU
SPU
LS
LS
LS
LS
MFC
MFC
MFC
MFC
Memoly I/O
Main Storage
FlexIO
Cell (GPU)
Element Interconnect Bus MFC
MFC
MFC
MFC
LS
LS
LS
LS
SPU
SPU
SPU
SPU
SPE
SPE
SPE
SPE
Bridge Chip
I/O
Fig. 2. Cell Broadband Enginearchitecture
PPE works with conventional operating systems due to its similarity to other 64-bit PowerPC processors. It also acts as the controller for multiple SPEs. Each SPE can operate independently when PPE boots up the SPE. With current Cell generation, each SPE contains a 256 KB instruction and data local memory area called “Local Store,” which does not operate as a conventional CPU cache. Then a programmer explicitly writes DMA operation code to transfer data between the main memory and the local store. SPE contains 128×128 register file. This feature enables the SPE compiler to optimize memory access to explore instruction level parallelism. 3.2
Parallel Implementation
We optimize our object recognition system for the Cell to realize real-time processing. This section shows the Cell specific programming methods which are suitable for CBE architecture. – Multiple SPEs First, separate the processing into several groups so that multiple SPEs independently operate each processing group. Examples of image processing include filter processing with 4 SPEs; divide the image into 4 blocks and allocate one block to one SPE. Note that instruction and data local memory area in SPE must be less than 256 KB.
A Real-Time Object Recognition System on Cell Broadband Engine
937
– Single Instruction Multiple Data (SIMD) An SPE contains 128-bit SIMD units and can operate on 16 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single precision floatingpoint numbers in a single clock cycle. – Loop unrolling An SPE contains a 128 × 128 register file. Unrolling loops increase register usage in a single iteration, but decrease the number of memory accesses and loop overhead instructions (branch or increment instructions).
4
Parallel Implementation
In this section, parallel implementation of boosting-based detection and histogram-based tracking on a Cell Broadband Engine are described. 4.1
Boosting-Based Detection
An object detection scheme based on boosting with haar-like features is executed as follows: 1) 2) 3) 4)
generate integral image of an input image, search objects by scanning the whole input image with a constructed detector, enlarge the scale of features used in the detector, terminate detection if the size of features becomes greater than the size of an input image, or else go to 2.
The detection scheme can be performed by scaling an input image instead of scaling features as follows: 1) 2) 3) 4)
generate an integral image of an input image, search objects by scanning the entire input image with a constructed detector, scale down an input image, terminate detection if an input image becomes smaller than the features, or else go to 1.
The latter scheme requires computational cost for generating shrunk images, but it is suitable for parallel implementation by specialized hardware or SIMD processor because the feature size is fixed. Furthermore, the authors showed that the latter scheme can achieve identical accuracy as the former scheme. Therefore, we adopt the latter scheme, which is expected to be suitable for the SIMD operation of SPE. Integral images used for the detection phase are generated by : I(x, y) =
y x
f (m, n),
(5)
n=0 m=0
where f (m, n) and I(x, y) are the luminance of image f at (m, n) and an integral image, respectively. Using the integral image, we obtain SABCD , which is the
938
H. Sugano and R. Miyamoto
sum of the luminance of the area enclosed by points A, B, C, and D shown as Fig. 3, by: SABCD = I(D) − I(C) − I(B) + I(A). (6) This operation includes only four load operations and three arithmetic operations.
A
B
C
D
Fig. 3. Computation using integral images
In this implementation, the generation and the scaling of integral images are performed by PPE, and their detection using features is operated on SPEs. Here, each detection, which corresponds to different scales, is individually mapped to each SPE. By this partition of the processing of the detection phase, applying features, the generation of integral images, and the scaling of integral images are executed in parallel, which reduces the total processing time. In each SPE, detection by features is computed in parallel by applying the SIMD operation. Detection is performed by moving the detection window to the detector
PPE
SPE SPE SPE SPE SPE SPE
CBE Fig. 4. Parallel execution of detection by multiple SPEs
A Real-Time Object Recognition System on Cell Broadband Engine
939
Integral image a b c d
a b c d
ABC D a b c d
a b c d
Fig. 5. Parallel computation of sum of luminance by SIMD operation
128-bits float
float
float
float
int
int
int
32-bits
int 32-bits
Fig. 6. SIMD vector
adjacent coordinates. In this phase, four detection operations can be executed in parallel because the SIMD vector of SPE can simultaneously operate four int variables, as shown in Fig. 6. By this parallel operation, four sums corresponding to A, B, C, and D are obtained, as shown in Fig. 5. 4.2
Histogram-Based Tracking
In an object tracking scheme based on particle filters, the probability distribution of the tracking target is represented by the density of particles. A particle filter consists of the following three steps: state transition, likelihood estimation, and resampling. Generally, likelihood estimation requires the most computational cost in these operations, and state transition and likelihood estimation can be operated in parallel because there is no dependence between each particle. Resampling cannot be executed in parallel; however it requires less computational power, so we use PPE for resampling in this implementation. Applying SIMD operations to histogram calculation, which requires the most computational power in the computation of likelihood, is difficult because it consists of memory accesses to Lookup and histogram tables. Therefore, we apply the SIMD operation to the computation of Bhattacharyya distance, which requires the second most computational power. Applying the SIMD operation to the computation of Bhattacharyya distance is easy because it consists of an operation to N array elements. Since this computation requires normalization of the histogram, this process is also implemented with the SIMD operation.
940
H. Sugano and R. Miyamoto
Here, it is necessary for the computation of likelihood to access HSV images. However, storing whole HSV images in the local store, which only SPE can directly access, is difficult because its size is limited to 256 KBytes.
5
Real-Time Object Recognition by Combining Detection and Tracking
In the previous section, the parallel implementation of detection and tracking on Cell were described. To realize a real-time object recognition system by combining these processes, allocating SPEs for them that consider required computational power is important. In this section, first, we discuss load balance for object recognition and then introduce a real-time object recognition system based on R SONY Playstation 3 [14], one of the most widely used Cell platform. 5.1
Load Balance on Cell for Object Recognition
The relation between processing time and the number of SPEs for detection and tracking is measured for optimal load balance on the Cell. In this experiment, the size of the detection and tracking images is 320 × 240, the size of the features is 24 × 24, the number of particles is 128 for a tracking target. Input image size is started from 320 × 240 and ended up 32 × 24, and the size is scaled down 83 percent at each iteration. These parameters are decided experimentally to achieve both real-time processing and high recognition performance. The results are shown in Table 1. Table 1. Processing time of detection and tracking Number of SPEs 1 2 3 4 5 detect objects 73.47 51.40 45.06 46.04 48.46 track objects 27.44 13.55 12.11 7.32 12.69
The processing time of detection decreases as the number of SPEs increases; however, the time increases if the number of SPEs becomes greater than three. The processing time of tracking decreases as the number of SPEs increases; however, the time increases if the number of SPEs becomes greater than five because the required time to manage SPEs sometimes becomes greater. In this system, the number of available SPEs is six, because Playstation 3 is adopted as a Cell platform. Considering processing time, we should allocate two or three SPEs to the detection process. If two SPEs are allocated to detection, about eight targets can be tracked while the detection process for the next frame is performed. In this implementation, we use OpenCV on Cell[15] package, which the authors cooperate with members of OpenCV on the Cell project to develop,
A Real-Time Object Recognition System on Cell Broadband Engine
941
for detection and we adopt a software cache implemented in Cell/BE Software Development Kit[16] to deal with entire input image on each SPE’s local store for tracking. In this case, the object recognition performance achieves 18 fps. If three SPEs are allocated to detection, about three targets can be tracked while the detection process for the next frame is performed. In this case, the object recognition performance achieves 22 fps. 5.2
Real-Time Implementation on Playstation 3
Based on the above results, we constructed a real-time object recognition system using Playstation 3 and Qcam Orbit MP QVR-13R, one USB camera. The
Fig. 7. Real-time object recognition system
following operations are required in addition to detection and tracking when a USB camera is used for real-time processing: 1) acquire images from a USB camera (640x480 pixels, RGB image) 2) shrink input images to 320 × 240 and convert color to grayscale and HSV images. Table 2 shows the required processing time for the above operations. Table 2. Processing time of miscellaneous functions Number of SPEs 1 2 3 4 5 retrieve frame 130.38 129.47 129.61 130.03 131.60 convert color and resize 18.97 17.93 17.79 18.73 17.87
942
H. Sugano and R. Miyamoto
By this result, this system achieves real-time object detection at about 7 fps. In this implementation, image aquisition from USB camera becomes dominant.
6
Demonstration
Figure 8 shows the face recognition results with the proposed system. In these figures, white and green rectangles correspond to detected and tracked objects, respectively. In the 60th frame, both white and green rectangles are shown around the target face because both detection and tracking succeed. In the 103rd and the 248th frame, detection fails but the position of the face is indicated by the tracking result. In the 263rd frame, the face is both successfully detected and tracked.
frame 60
frame 103
frame 152
frame 192
frame 248
frame 263
Fig. 8. Face recognition result
7
Conclusion
In this paper, we showed the parallel implementation of boosting-based detection and histogram-based tracking on Cell, discussed load balance on Cell for object recognition, and showed sample implementation of a real-time object recognition system based on Playstation 3. We showed that Cell can ideally achieve realtime object recognition on QVGA video at 22 fps for three targets and 18 fps for eight targets. Furthermore, real-time face detection is demonstrated with a R real-time object recognition system implemented on SONY Playstation 3, one of the most widely used Cell platforms. In the future, we will improve the image acquisition performance from the USB camera to reveal Cell performance with the widely used Playstation 3.
A Real-Time Object Recognition System on Cell Broadband Engine
943
References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. on SMC 34, 334–352 (2004) 2. Zhao, L., Thorpe, C.E.: Stereo- and neural network-based pedestrian detection. IEEE Trans. on ITS 01, 148–154 (2000) 3. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Proggio, T.: Pedestrian detection using wavelet templates. In: Proc. of CVPR, pp. 193–199 (1997) 4. Papageorgiou, C., Poggio, T.: Trainable pedestrian detection. In: Proc. of ICIP, vol. 4, pp. 35–39 (1999) 5. Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision 63, 153–161 (2005) 6. Soga, M., Kato, T., Ohta, M., Ninomiya, Y.: Pedestrian detection using stereo vision and tracking. In: Proc. of The 11th World Congress on Intelligent Transport Systems (2004) 7. Ashida, J., Miyamoto, R., Tsutsui, H., Onoye, T., Nakamura, Y.: Probabilistic pedestrian tracking based on a skeleton model. In: Proc. of ICIP, pp. 2825–2828 (2006) 8. Miyamoto, R., Sugano, H., Saito, H., Tsutsui, H., Ochi, H., Hatanaka, K., Nakamura, Y.: Pedestrian recognition in far-infrared images by combining boostingbased detection and skeleton-based stochastic tracking. In: Proc. of PSIVT, pp. 483–494 (2006) 9. Masayuki, H., Nakahara, K., Sugano, H., Nakamura, Y., Miyamoto, R.: A specialized processor suitable for adaboost-based detection with haar-like features. In: Proc. of CVPR (2007) 10. Brown, M.Z., Burschka, D., Hager, G.: Advances in computational stereo. IEEE Trans. on PAMI 25, 993–1008 (2003) 11. Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: Proc. of ECCV, pp. 28–39 (2004) 12. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 13. Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Proc. of ECCV (2002) 14. Playstation3 (2007), http://www.us.playstation.com/PS3 15. OpenCV on The Cell (2007), http://cell.fixstars.com/opencv/index.php/OpenCV on the Cell 16. Cell/BE software development kit (SDK) version 2.1 (2007), http://www.bsc.es/plantillaH.php?cat id=301
A Study of Zernike Invariants for Content-Based Image Retrieval Pablo Toharia1 , Oscar D. Robles1 , Ángel Rodríguez2 , and Luis Pastor1 1
Dpto. de Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, U. Rey Juan Carlos, C/ Tulipán, s/n. 28933 Móstoles. Madrid. Spain {pablo.toharia,oscardavid.robles,luis.pastor}@urjc.es 2 Dpto. de Tecnología Fotónica, U. Politécnica de Madrid, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Madrid, Spain
[email protected]
Abstract. This paper presents a study about the application of Zernike invariants to content-based Image Retrieval for 2D color images. Zernike invariants have been chosen because of their good performance for object recognition. Taking into account the good results achieved in previous CBIR experiments with color based primitives using a multiresolution representation of the visual contents, this paper presents the application of a wavelet transform to the images in order to obtain a multiresolution representation of the shape based features studied. Experiments have been performed using two databases: the first one is a small self-made 2D color database formed by 298 RGB images and a test set with 1655 query images that has been used for preliminary tests; the second one is Also experiments using the Amsterdam Library of Object Images (ALOI), a free access database. Experimental results show the feasibility of this new approach. Keywords: CBIR primitives, Zernike invariants.
1
Introduction
Content-Based Image Retrieval (CBIR) systems are becoming a very useful tool when dealing with a large volume of visual information, due to the maturity level of the techniques involved and posed nowadays and in the past by the research community [1,2,3,4]. Most of these techniques are inherited from computer vision and from database systems to represent and manage the available data. CBIR systems can be classified following different criteria, such as the nature of primitives used for characterizing the image’s contents (color, texture, shape, scheme or attributed graphs, etc.), the abstraction levels covered by these primitives (low, medium high), the automation level achieved in the primitive extraction process (automatic, semi-automatic, manual), the classifier used for the retrieval stage (standard metrics, neural networks, SVM, etc.) or the way data are stored, processed and managed in the whole CBIR system (centralized, distributed). All D. Mery and L. Rueda (Eds.): PSIVT 2007, LNCS 4872, pp. 944–957, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Study of Zernike Invariants for Content-Based Image Retrieval
945
these topics are different research areas in which continuous advances are made in order to improve the performance of CBIR systems. In this paper, several new shape primitives to be used in a CBIR system are implemented. The computation of Zernike invariants has been chosen because of their good performance achieved in object recognition applications. It has been a traditional path to follow to adapt primitives used in recognition systems to CBIR ones, since some of the techniques used in the latter have been inherited from the former. However, it must be noticed that the working domain of recognition systems, i. e. the set of objects that the system is able to recognize, is usually very restricted. To our knowledge, the only works in the CBIR field that have used the Zernike Invariants have been Novotni and Klein, applying 3D Zernike Descriptors for 3D object retrieval [5,6], Kim et al., who work with a restricted subject database of binarized images [7], and Lin and Chou [8] who work with a reduced database of color images but only report results about computational cost, not including any measure of recall or precision. On the other hand, the main drawbacks of Zernikes moments are both the computational cost and the approximation errors. Lately, there have been a few works trying to solve or minimize this disadvantages [9,10,11,12]. Based on the good results achieved in previous CBIR experiments with color based primitives using a multiresolution representation of the visual contents [13], in this paper is also studied the application of a wavelet transform to obtain a multiresolution representation of the shape based features searched. The development of wavelet transform theory has spurred new interest in multiresolution methods and has provided a more rigorous mathematical framework. Wavelets give the possibility of computing compact representations of functions or data. Additionally, they allow variable degrees of detail or resolution to be achieved, as well as are attractive from the computational point of view [14,15]. Analysis and diagonal detail coefficients of the image’s wavelet transform have been used. The resulting primitive is a feature vector composed of the values computed at each resolution level of the transformed image, obtaining a more robust representation of the image contents. Section 2 describes some background about the Haar transform, the wavelet transform used in the work herein presented, and the different feature extraction techniques implemented. Section 3 analyzes some implementation details of the implemented primitives. Section 4 shows the experiments performed, presenting the success rate achieved for each primitive as well as discusses these results. Finally, conclusions are presented in Section 5.
2 2.1
Description of the Implemented Primitives The Haar Transform
Wavelet transforms can be seen as a reformalization of the multiresolution methods of the 80’s [16,17]. The information they provide is very similar to that obtained from Fourier-based techniques, with the advantage of working with local
946
P. Toharia et al.
Xin
HiF_D
LoF_D
d1
2
2
a1
d1
HiF_D
LoF_D
2
2
d2
a2
d2
+ a2
2
HiF_D
2
LoF_D
+
a1
2
HiF_D
2
LoF_D
+
Xin
Fig. 1. Scheme of a wavelet transform
(a) Original image. (b) Wavelet trans- (c) Wavelet trans- (d) Wavelet transResolution level form. Resol. level form. Resol. level form. Resol. level n (512 × 512) n − 1 (256 × 256) n − 2 (128 × 128) n − 4 (32 × 32) Fig. 2. Non-standard Haar transform of a 2D color image
information using base functions with compact support. Wavelet transform coefficients show variations in object features at different resolution or scale levels [14,18,19,20]. Roughly speaking, the detail coefficients of the wavelet transform can be considered as a high-frequency extraction process of the objects appearing on the images, while the analysis coefficients behave complementary: the lower the resolution level, the more homogeneous the regions they produce. This could be equivalent to a successive application of low-pass filters with a signal subsampling operation (see Fig. 1(a)). The inverse of this process allows the reconstruction of the original signal by the so-called synthesis process (Fig. 1(b)).The result of this transform is a series of images where four regions of coefficients are obtained at each step: analysis coefficients (top-left, Fig. 2(b)), vertical detail coefficients (top-right, Fig. 2(b)), horizontal detail coefficients (bottom-left, Fig. 2(b)) and diagonal detail coefficients (bottom-right, Fig. 2(b)). The Haar transform will be used as a tool to extract some features of the transformed image that will allow a discrimination process to be performed between the queries and the images stored in the information system. The low order complexity of this transform, O(n), allows an efficient implementation of the whole process.
A Study of Zernike Invariants for Content-Based Image Retrieval
2.2
947
Zernike Invariants
Zernike invariants have been selected because of its demonstrated good performance in object recognition problems [21,22]. In 1934, Zernike [23] presented a set of complex polynomials Vnm (x, y) that were defined inside a unity radius circle (x2 + y 2 ≤ 1) in the following way: Vnm (x, y) = Vnm (ρ, θ) = Rnm (ρ) ejmθ
(1)
where Vnm is a complete set of complex polynomials, n is a positive integer value n ≥ 0 that represents the polynomial degree and m is the angular dependency, ρ and θ are the polar coordinates of the Cartesian coordinates (x, y) and Rnm is a set of radial polynomials that have the property of being orthogonal inside the unity circumference. The values of n and m have the following relation: (n − |m|) mod 2 = 0
and |m| ≤ n
(2)
and the radial polynomials have the following expression: m−|n| 2
Rnm (ρ) =
s=0
(−1)s
(m − s)! ρm−2s m − |n| m + |n| − s)!( − s)! s!( 2 2
(3)
Starting from Zernike polynomials and projecting the function over the orthogonal basis composed by the polynomials, the moments can be generated in the following way: m+1 ∗ f (x, y)Vnm (x, y)dxdy with x2 + y 2 ≤ 1 (4) Amn = π x2 +y 2 ≤1 The discretization needed to work with digital data can be done straightforwardly: Amn =
m + 1 ∗ f (x, y)Vnm (x, y)dxdy π x y
with
x2 + y 2 ≤ 1
(5)
Figure 3 shows the reconstruction of two example images using Zernike moments of several orders. From these functions, we compute the modulus to obtain the p different invariant values for each considered case. The invariant values are used to create a vector of p elements ZIi that collect the shape information of an image i. For example, in the case of polynomials of 10th degree, p would be 36. 2.3
Signature Based on Zernike Invariants
The visual contents of the images are transformed into a vector of some features, named signature, that aims to collect some discriminant information of the original data.
948
P. Toharia et al.
(a) Original images.
(b) Reconst. image using Zernike moments up to order 3.
(c) Idem up to order 5.
(d) Idem up to order 10.
(e) Idem up to order 15.
(f) Idem up to order 20.
Fig. 3. Original images and reconstructions obtained working with Zernike moments of orders 3, 5, 10, 15 and 20
Firstly, a primitive based on the Zernike Invariants extracted from the original image has been implemented, in order to compare the multiresolution primitives studied in this paper with the results achieved without applying the Haar transform, as it has been traditionally done in object recognition environments. Also, the invariant values computed from polynomials of several orders have been tested. The signature is generated concatenating the invariants extracted until the maximum polynomial degree considered. A simple vector of scalar values is obtained. 2.4
Signature Based on Zernike Invariants over Analysis Coefficients
The analysis coefficients of the Haar transform have been selected to compute the Zernike invariants over different scales of the original image. Analysis coefficients retain a coarse description of the objects’ shape as can be observed in Fig. 2. Once the sequence of lower resolution images is obtained, we compute the Zernike Invariants over the analysis coefficient region at each resolution level. The final step is to compose a vector as explained in the previous section, so as to collect the original image’s multiresolution shape information.
A Study of Zernike Invariants for Content-Based Image Retrieval
949
Fig. 4. Inscribing Zernike’s polynomial definition domain into rectangular images
2.5
Signature Based on Zernike Invariants over Analysis and Diagonal Detail Coefficients
The choice of the diagonal coefficients is due to the fact that the diagonal region, right bottom squares in Figure 2 at each resolution level, gathers high-pass features on both vertical and horizontal directions. This feature should be an additional discriminant element when dealing with images of similar appearance.
3
Implementation Analysis
As it can be deduced from Eqs. 3 and 5, the computation of invariants is a very high demanding task from a computational point of view, so different order polynomials have been tested in order to verify if their responses are significantly different. The only Zernike invariants computed online are those of the query image, so this approach is completely feasible in a real-world CBIR system. Another issue faced in the implementation is mapping the images’ rectangular domain to the circular space where radial polynomials are defined (Eq. 1). The unity radius circle has been inscribed into the image, so its corners have been discarded under the assumption that they do not usually contain relevant information about the scene (Fig. 4). When working with gray-level images, the original image’s wavelet transform gives all the information needed to compose the signature representing the image’s contents. Joining together Zernike invariant values for all the considered regions and all the resolutions levels, a vector of features collecting the intensity of the input data is obtained. When dealing with color images, as it is the case in the RGB color space, the process described for the monochrome images must be applied over each one of the color channels.
4
Experimental Results
The main objectives of the tests are to measure and analyze the implemented features recall and precision values achieved by the implemented features. The
950
P. Toharia et al.
classical definitions of recall and precision have been used: True positives True positives + False negatives True positives Precision = True positives + False positives Recall =
(6) (7)
For the experiments presented in this paper two different databases have been considered: our self-made color database and a public one. 4.1
Self-made Color Database
Experiments setup. For computational purposes, instead of using a larger database as ALOI is, a small one previously available has been used. This database has been used for preliminary tests in order to extract some conclusions about the influence of the moments’ order considered in the Zernike invariants extracted for obtaining the signature. The same color database described in [24] has been used. It is formed by 298 RGB two-dimension 128x128 images collected from different sources, like [25] or Internet. The test set was generated introducing images which share the same concept but look quite different from those of the considered database and making affine transformations or selecting regions of interest from the original database images. Following these guidelines, a test set formed by 1655 two-dimension color images was obtained. The experiments have consisted in querying all the images from the test set in order to retrieve the associated image stored in the corresponding database, which was selected as the representative image of its class in the query. For each input image, the result from the search is a list sorted according to the level of similarity of its signature with the signatures stored in the database. The minimum distance classifier used is based on the Euclidean distance. The aforementioned list contains the best n matches of the query. Figure 5 presents an example of the system response on a query to the TRECVID database [26]. The most similar object appears on the upper-left corner of the mosaic. In the example presented, the topic searched was “shots of one or more soccers goal posts”. Several tests have been performed considering all the parameters involved in the Zernike invariants combined with the multiresolution representation described above. The influence of Zernike invariants order on the retrieval results has been studied due to the great computational load demand that involves their computation. Orders 3, 5, 10, 15 and 20 have been considered. It must be noticed that they respectively imply 6, 12, 36, 72 and 121 first invariants. The behaviour of the wavelet coefficients over the proposed shape primitive has been considered, taking into account the following configurations: analysis coefficients, analysis plus diagonal detail coefficients. Results analysis. Table 1 shows recall values computed for all the primitives herein described using the Euclidean distance as minimum distance classifier. The suffix notation used in the method column in Table 1 is the following:
A Study of Zernike Invariants for Content-Based Image Retrieval
951
Fig. 5. Example of the output of the CBIR implemented system
Table 1. Recall values achieved over the whole set of Zernike based primitives Primitive Recall ZER_3 0.480 ZER_5 0.471 ZER_10 0.474 ZER_15 0.460 ZER_20 0.460
– – – – –
Primitive ZMR_CA_3 ZMR_CA_5 ZMR_CA_10 ZMR_CA_15 ZMR_CA_20
Recall 0.517 0.514 0.510 0.515 0.490
Primitive Recall ZMR_CA_DD_3 0.517 ZMR_CA_DD_5 0.514 ZMR_CA_DD_10 0.511 ZMR_CA_DD_15 0.515 ZMR_CA_DD_20 0.503
ZER: Zernike invariants. ZMR: Zernike invariants with multiresolution information. CA: Analysis coefficients of the Haar transform. CA_DD: Analysis and diagonal detail coefficients of the Haar transform. number: order of the Zernike moments used.
As it can be seen, several order values for the Zernike Invariants computation have been tested. An analysis of Table 1 shows how the obtained results are not influenced by the order of Zernike moments since the recall value is quite similar in all cases. It can be seen that the best results are achieved using invariants computed with
952
P. Toharia et al. Table 2. Basic characteristics of the ALOI’s subsets ILL, COL and VIEW Subset ILL COL VIEW
Variations Images 24 illumination variations 24000 12 illumination variations 12000 72 illumination variations 72000
low order moments. The reason for this might be the more significant information carried by low order moments (up to order 15) during the reconstruction process, as it can be observed in Figure 3. In that case, those low order moments would be basically determining the main difference between two images. On the other hand, the use of multiresolution approach improves the information provided by the Zernike invariants computed over the original image, although this improvement is not as remarkable as the one obtained when the multiresolution approach has been applied to color primitives [13]. The explanation for this behaviour comes from the fact that low resolution levels are not sharp enough to significantly contribute to the discriminant power of the primitive. 4.2
ALOI Database
Experiments setup. Taking into account the results obtained using our selfmade database a new experiments setup with the Amsterdam Library of Object Images (ALOI) [27] database was established. It will allow to do a deeper study about the behaviour of Zernike-based primitives as well as to obtain more exhaustive results. The ALOI database is a color image collection of 1000 small objects. This database is divided in different subsets depending on the parameter varied at capture time (color, illumination, view-angle, etc.). From the available subsets, the main three of them have been chosen for our purposes: illumination direction (ILL), illumination color (COL), object viewpoint (VIEW). Table 2 shows some basic information about these subsets. More detailed information and examples of the captures can be found on [27]. The ALOI database does not provide a separated query image set. For this reason, the experiments have consisted in using each image to query a target database formed by the rest of the images. In order to evaluate recall and precision in this scenario the number of retrieved images has to be fixed. Different values for this number have been evaluated in order to obtain the behaviour tendency of each primitive. Another aspect to be taken into account is that the three subsets (COL, ILL and VIEW) have different number of relevant images, so results from different subsets can not be directly compared. For example, retrieving 100 images from VIEW could have 72 relevant images, while retrieving the same number of images from COL could at most have 12 relevant ones. Under this scenario, recall and precision values would be biased. Since the number of relevant images per object in each subset is well known, a percentage of this number can be used instead of the number of retrieved images so as to be able to do that comparison.
A Study of Zernike Invariants for Content-Based Image Retrieval
953
Considering the previous experiments over the self-made database, we have selected order 15 Zernike moments to build the Zernike invariants primitive run in the experiments over the ALOI database. From our point of view, they offer a promising trade-off between discriminant power and computational cost that will allow to extract a better performance to the multiresolution approach. Recall
Precision
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
(a) COL subset. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
100
150
200
250
300
350
400
(b) ILL subset. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
50
100
150
200
250
300
350
400
0
50
(c) VIEW subset. ZER
EN2
HRN
ZMR
HIN
HAD
HRD
Fig. 6. Precision (right column) and recall (left column) against number of retrieved images (abscissa axis) over the ALOI database
954
P. Toharia et al.
Zernike-based primitives have been compared to some other multiresolution primitives developed and tested in previous works:
1
1
0.75
0.75
0.5
0.5
0.25
0.25
0
0 0
0.25
0.5
0.75
1
0
0.25
Recall (a) COL subset.
0.5
Recall (b) ILL subset.
1
0.75
Precision
Precision
– Multiresolution energy-based primitive (EN2) [24]. – Multiresolution histogram primitives [13]: • Global multiresolution histogram over analysis coefficients (HIN). • Local multiresolution histogram over analysis coefficients (HRN). • Global multiresolution histogram over analysis and detail coefficients (HAD). • Local multiresolution histogram over analysis and detail coefficients (HRD).
0.5
0.25
0 0
0.25
0.5
0.75
1
Recall (c) VIEW subset. ZER
EN2
HRN
ZMR
HIN
HAD
HRD
Fig. 7. Precision vs. recall over the ALOI database
0.75
1
A Study of Zernike Invariants for Content-Based Image Retrieval
955
Results analysis. Figure 6 presents the achieved precision and recall values for the three different ALOI subsets (COL, ILL, VIEW). The abscissa axis shows the number of images taken into account to calculate the precision and recall values. To be more precise each abscissa value stands for a percentage of retrieved images compared to the a priori known number of relevant images, i.e., 100 means 12 images for COL subset, 24 for ILL and 72 for VIEW. This is done in order to be able to compare the results among subsets. Figure 7 shows the same data than Figure 6 but in this case the recall and precision values are depicted one against the other. This graph allows to see the behavior of the precision value together with the behavior of the recall value. Results show that shape-based primitives, ZER and ZMR, outperforms the rest of primitives when using this database. In particular these two primitives achieve excellent results when dealing with the COL subset data. It can be seen that their precision and recall graphs shown in Figure 6(a) are almost the ideal graphs, achieving recall and precision values of 1 when retrieving 12 images (that is, a 100% of relevant images per object). On the other hand the performance of these primitives using the other two subsets (ILL and VIEW) are also good (ZMR and ZER are the best results for ILL and VIEW respectively) but not as much as the values achieved when using COL. This can be explained easily taking into account the fact that the variation of the color properties does not change the perception of the captured object’s shape, leading to a very good performance. However, when using ILL and VIEW subsets, images’ shape changes because of the shadows produced by the variation of the illumination direction, what makes shapes to appear different. A viewpoint variation has the same effect if the object is not “symmetric”. Comparing the results achieved by the multiresolution shape approach (ZMR) against the non-multiresolution (ZER) it can be stated that the results achieved are quite similar except in the ILL subset in which the ZMR performs slightly better.
5
Conclusions and Ongoing Work
The application of a shape primitive to a content-based image retrieval system using a multiresolution approach has been studied over two different databases. The shape primitive based on Zernike invariants has achieved interesting results using low level order polynomials, diminishing the main drawback of this technique: its demanding computational cost. In the particular case of dealing with color changes the Zernike-based primitive has shown an excellent performance. The use of multiresolution information in the Zernike based primitive has improved the results in some cases but has never given a big increase with the databases tested. It has been shown that increasing the order of the moments used in the experiments does not guarantee an image with enough detail for discriminating purposes, as can be deduced from the examples presented in Fig. 3.
956
P. Toharia et al.
Using Zernike moments up to order 20 on a database composed of images related to so different topics and themes has not improved the results in comparison to low order moments. It must be noticed that the examples found in the available bibliography apply Zernikes to a set of images in a restricted domain or do not report any results about recall but from a computational cost point of view. Further research on multiresolution primitives will allow, on one hand, the use of additional wavelet basis to achieve a more compact representation of the multilevel information without loosing discriminant capacity. On the other hand, it will contribute to an improvement on the behavior of the Zernike based primitives. To conclude, the use of alternatives for combining both primitives will also be studied in the near future.
Acknowledgments This work has been partially funded by the the Spanish Ministry of Education and Science (grant TIN2007-67188) and Government of the Community of Madrid (grant S-0505/DPI/0235; GATARVISA).
References 1. del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann Publishers, San Francisco, California (1999) 2. Venters, C.C., Cooper, M.: A review of content-based image retrieval systems. Technical report, Manchester Visualization Center. Manchester Computing. University of Manchester (2000), http://www.jtap.ac.uk/. 3. Marques, O., Furht, B.: Content-based Image and Video Retrieval. Multimedia Systems and Application Series. Kluwer Academic Publishers, Dordrecht (2002) 4. Wu, J.-K., Kankanhalli, M.S., Wu, K.W.J.K., Lim, L.J.-H., Hong, H.D.: Perspectives on Content-Based Multimedia Systems. Springer, Heidelberg (2000) 5. Novotni, M., Klein, R.: 3D zernike descriptors for content based shape retrieval. In: The 8th ACM Symposium on Solid Modeling and Applications (2003) 6. Novotni, M., Klein, R.: Shape retrieval using 3D zernike descriptors. ComputerAided Design 36(11), 1047–1062 (2004) 7. Kim, H.K., Kim, J.D., Sim, D.G., Oh, D.I.: A modified zernike moment shape descriptor invariant to translation, rotation and scale for similarity-based image retrieval. In: International Conference on Multimedia and Expo, ICME, vol. 1, pp. 307–310 (2000) 8. Lin, T.W., Chou, Y.F.: A comparative study of zernike moments. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’ 2003), Halifax, Canada (2003) 9. Hwang, S.K., Kim, W.Y.: A novel approach to the fast computation of Zernike moments. Pattern Recognition 39(11), 2065–2076 (2006) 10. Papakostas, G., Boutalis, Y., Karras, D., Mertzios, B.: A new class of Zernike moments for computer vision applications. Information Sciences 177(13), 2802– 2819 (2007)
A Study of Zernike Invariants for Content-Based Image Retrieval
957
11. Wee, C.-Y., Paramesran, R.: On the computational aspects of Zernike moments. Image and Vision Computing 25(6), 967–980 (2007) 12. Xin, Y., Pawlak, M., Liao, S.: Accurate computation of zernike moments in polar coordinates. IEEE Transactions on Image Processing 16(2), 581–587 (2007) 13. Robles, O.D., Rodríguez, A., Córdoba, M.L.: A study about multiresolution primitives for content-based image retrieval using wavelets. In Hamza, M.H., ed.: IASTED International Conference On Visualization, Imaging, and Image Processing (VIIP 2001), Marbella, Spain, IASTED, ACTA Press, pp. 506–511 (2001) ISBN 0-88986-309-1 14. Strang, G., Nguyen, T.: Wavelets and Filter Banks. Wellesley-Cambridge Press (1997) 15. Starck, J.L., Murtagh, F., Bijaoul, A.: Image Processing and Data Analysis. The Multiscale Approach. Cambridge University Press, Cambridge (1998) 16. Rosenfeld, A.: Multiresolution Image Processing and Analysis. Springer Series in Information Sciences, vol. 12. Springer, Heidelberg (1984) 17. Marr, D., Hildreth, E.: Theory of edge detection. In: Proceedings of the Royal Society, London, ser. B, vol. 207, pp. 187–217 (1980) 18. Daubechies, I.: Ten Lectures on Wavelets. vol. 61 of CBMS-NSF Regional Conf. Series in Appl. Math. Society for Industrial and Applied Mathematics, Philadelphia, PA (1992) 19. Mallat, S.G.: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674– 693 (1989) 20. Pastor, L., Rodríguez, A., Ríos, D.: Wavelets for Object Representation and Recognition in Computer Vision. In: Vidaković, B., Müller, P. (eds.) Bayesian Inference in Wavelet Based Models. Lectures Notes in Statistics, vol. 141, pp. 267–290. Springer Verlag, New York (1999) 21. Khotanzad, A., Hong, Y.H.: Invariant image recognition by zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 489–497 (1990) 22. Kamila, N.K., Mahapatra, S., Nanda, S.: Invariance image analysis using modified Zernike moments. Pattern Recognition Letters 26(6), 747–753 (2005) 23. Zernike, F.: Beugungstheorie des schneidenverfahrens und seiner verbesserten form, der phasenkontrastmethode (Diffraction theory of the cut procedure and its improved form, the phase contrast method). Physica 1, 689–704 (1934) 24. Rodríguez, A., Robles, O.D., Pastor, L.: New features for Content-Based Image Retrieval using wavelets. In: Muge, F., Pinto, R.C., Piedade, M. (eds.) V Iberoamerican Simposium on Pattern Recognition, SIARP 2000, Lisbon, Portugal, pp. 517–528 (2000) ISBN 972-97711-1-1 25. MIT Media Lab.: VisTex. Web Color image database (1998), http://vismod.media.mit.edu/vismod/imagery/VisionTexture/vistex.html 26. Over, P., Ianeva, T., Kraaij, W., Smeaton, A.F.: TRECVID 2006: Search task overview. In: Proceedings of the TRECVID Workshop, NIST Special Publication (2006), http://www-nlpir.nist.gov/projects/tvpubs/tv6.papers/tv6. search.slides-final.pdf 27. Geusebroek, J.-M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam library of object images. Int. J. Comput. Vision 61(1), 103–112 (2005)
Author Index
Agostini, Luciano Volcan 5, 24, 36 Ahmad, Imran Shafiq 893 Alvarado, Selene 664 Alvarez, Marco A. 600 Andreadis, Ioannis 510 Ara´ ujo, Sidnei Alves de 100 Ascenso, Jo˜ ao 801 Azor´ın-L´ opez, Jorge 749
Falcon, Rafael 867 Fang, Chih-Wei 48 Fernando, W.A.C. 841 Ferreira, Valter 24 Francke, Hardy 533 Fu, Xue 236 Fujisawa, Hiromichi 2 Fuster-Guill´ o, Andr´es 749
Bampi, Sergio 5, 24, 36 Beheshti Shirazi, Ali Asghar Bello, Rafael 867 Beny´ o, Zolt´ an 548 Brites, Catarina 801 Britto Jr., A.S. 678, 737
Garc´ıa-Chamizo, Juan Manuel 749 Gim, Gi-Yeong 346 Gimel’farb, Georgy 763 Gomes, Herman M. 166 Gomes, Rafael B. 87 Gon¸calves, Wesley Nunes 777 Gonz´ alez, Fabio A. 919 Guarini, Marcelo 522 Guzm´ an, Enrique 664
334
Caicedo, Juan C. 919 Carrasco, Miguel 114, 371 Carvalho, Bruno M. 87 Carvalho, Jo˜ ao M. de 166 Cavalvanti, Claudio S.V.C. 166 Cerda, Mauricio 575 Cha, Eui-Young 311 Chang, Chien-Ping 298 Chen, Xiaoliu 893 Chiu, Yu-Ting 727 Cho, A-Young 905 Cho, Ik-Hwan 905 Choi, Hae-Chul 788 Choi, Jong-Soo 562 Chou, Chun-Hsien 357 Chuang, Jen-Hui 727 Coutinho, Felipe L. 166 Cuenca, P. 841 Cui, Chunhui 497 Delmas, Patrice 763 Demonceaux, C´edric 484 Devia, Christ 586 Diniz, Cl´ audio Machado 5 Di Stefano, Luigi 427 Ebrahimi Atani, Reza 16 Ejima, Toshiaki 413 Elgammal, Ahmed 205 Enokida, Shuichi 413
Hajnal, Joseph 522 Hamiati Vaghef, Vahid 16 Hatori, Yoshinori 385 He, Zanming 853 Hernandez, Sergio 474 Hern´ andez-Gracidas, Carlos 879 Hitschfeld-Kahler, Nancy 575 Hochuli, A.G. 678 Hong, Kwangjin 715 Huang, Ping S. 298 Hung, Yi-Ping 1 Irarrazaval, Pablo
3, 522
Jang, Euee S. 853 Jeon, Byeungwoo 816 Jeon, Gwanggil 867 Jeong, Dong-Seok 905 Jeong, Jechang 867 Jia, Jie 788 Jin, Ju-Kyung 905 Ju, Myung-Ho 702 Jun, Jaebum 853 Jung, Keechul 321, 449, 715 Kameshima, Hideto 439 Kaneko, Kunihiko 128
960
Author Index
Kang, Hang-Bong 702 Kao, Jau-Hong 727 Kerdvibulvech, Chutisant 625 Kim, Donghyung 867 Kim, Hae-Kwang 788 Kim, Hae Yong 100 Kim, Hyunchul 346 Kim, Whoi-Yul 346 Klette, Reinhard 236, 311 Ko, Bonghyuck 816 Kodama, Kazuya 385 Koerich, A.L. 678, 737 Kohout, Josef 826 Kovesi, Peter 4 Kubota, Akira 385 Kyoung, Dongwuk 321 Lai, Tzung-Heng 613 Lee, Chan-Su 205 Lee, Jen-Chun 298 Lee, Jin-Aeon 346 Lee, Jun-Woo 905 Lee, Myungjung 853 Lee, Seok-Han 562 Lee, Subin 180 Lee, Sunyoung 853 Lee, Wei-Yu 357 Lee, Yunli 321, 449 Li, Fajie 236 Lien, Jenn-Jier James 48, 141, 462, 613 Lin, Horng-Horng 727 Liu, Kuo-Cheng 357 Loncomilla, Patricio 586 L´ opez, O. 841 Lu, Yingliang 128 Machado, Bruno Brandoli 777 Makinouchi, Akifumi 128 Marquez, Jorge 763 Mart´ınez, J.L. 841 Martinez, M. 841 Marzal, Andr´es 152 Masood, Asif 651 Mattoccia, Stefano 427 Mery, Domingo 114, 371, 575, 639 Metaxas, Dimitris 205 Mirzakuchaki, Sattar 16 Miyamoto, Ryusuke 932 Moura, Eduardo S. 166
Nayak, Amiya 221, 274 Ngan, King Ngi 497 Nozick, Vincent 399 Oh, Weon-Geun Oliveira, L.E.S. Oliver, J. 841
905 678, 737
Palaz´ on, Vicente 152 Park, Anjin 715 Pastor, Luis 944 ´ Perea, Alvaro 75 Pereira, Fernando 801 P´erez, M. 841 Pistori, Hemerson 600, 777 Pizarro, Luis 114 Pogrebnyak, Oleksiy 664 Porto, Marcelo 36 Prieto, Claudia 522 Quiles, F.
841
Robles, Oscar D. 944 Rodr´ıguez-P´erez, Daniel 75 Rodrigues, Ricardo B. 600 ˜ Rodr´ıguez, Angel 944 Romero, Eduardo 75, 919 Rosa, Leandro 36 Rueda, Andrea 75 Rueda, Luis 248 Ruiz-del-Solar, Javier 533, 586 Saito, Hideo 399, 439, 625 Saleem, Muhammad 651 S´ anchez Fern´ andez, Luis Pastor 664 Sanroma, Gerard 260 Sato, Yukio 439 Seo, Yongduek 180 Serratosa, Francesc 260 Shim, Hiuk Jae 816 Shin, Bok-Suk 311 Shojaee Bakhtiari, Ali 16, 334 Siddiqui, Adil Masood 651 Silva, Jonathan de Andrade 777 Silva, Romeu Ricardo da 639 Silva, Tha´ısa Leal da 5 Simler, Christophe 484 Souza, Albert Schiaveto de 777 Souza, Tiago S. 87 Stavrakakis, John 62
Author Index Stojmenovic, Milos 221, 274 Suandi, Shahrel A. 413 Sucar, L. Enrique 879 Sugano, Hiroki 932 Susin, Altamiro Amadeu 5, 24, 36 Szil´ agyi, L´ aszl´ o 548 Szil´ agyi, S´ andor M. 548 Tai, Tie Sing 413 Takastuka, Masahiro 62 Teal, Paul 474 Toharia, Pablo 944 Tombari, Federico 427 Triana, Edwin 919 Tsai, Cheng-Fa 289 Tu, Ching-Ting 141 Tu, Te-Ming 298 Vasseur, Pascal 484 Velastin, Sergio A. 191 Valle Jr., J.D. 737 Veloso, Luciana R. 166 Verschae, Rodrigo 533
Viana, Roberto 600 Vilar, Juan Miguel 152 Vonikakis, Vassilios 510 Vortmann, Jo˜ ao Alberto 5 Wagner, Fl´ avio R. 24 Wang, Te-Hsun 613 Weerakkody, W.A.R.J. 841 Woo, Young Woon 311 Woodward, Alexander 763 Wu, Jin-Yi 462 Xu, Chengping
191
Yamauchi, Koichiro 439 Ya˜ nez, Cornelio 664 Yang, Wen-Yi 289 Yang, Wenxian 497 Yang, Won-Keun 905 Zamanlooy, Babak 16, 334 Zatt, Bruno 24 Zhao, Shubin 692
961