Computer Vision

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Yasushi Yagi | Sing Bing Kang | In So Kweon | Hongbin Zha

20 downloads 2108 Views 23MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

4844

Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha (Eds.)

Computer Vision – ACCV 2007 8th Asian Conference on Computer Vision Tokyo, Japan, November 18-22, 2007 Proceedings, Part II

13

Volume Editors Yasushi Yagi Osaka University The Institute of Scientiﬁc and Industrial Research 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan E-mail: [email protected] Sing Bing Kang Microsoft Corporation 1 Microsoft Way, Redmond WA 98052, USA E-mail: [email protected] In So Kweon KAIST School of Electrical Engineering and Computer Science 335 Gwahag-Ro Yusung-Gu, Daejeon, Korea E-mail: [email protected] Hongbin Zha Peking University Department of Machine Intelligence Beijing, 100871, China E-mail: [email protected]

Library of Congress Control Number: 2007938408 CR Subject Classiﬁcation (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13

0302-9743 3-540-76389-9 Springer Berlin Heidelberg New York 978-3-540-76389-5 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12183685 06/3180 543210

Preface

It is our great pleasure to welcome you to the Proceedings of the Eighth Asian Conference on Computer Vision (ACCV07), which held November 18–22, 2007 in Tokyo, Japan. ACCV07 was sponsored by the Asian Federation of Computer Vision. We received 640 abstracts by the abstract submission deadline, 551 of which became full submissions. This is the largest number of submissions in the history of ACCV. Out of these 551 full submissions, 46 were selected for oral presentation and 130 as posters, yielding an acceptance rate of 31.9%. Following the tradition of previous ACCVs, the reviewing process was double blind. Each of the 31 Area Chairs (ACs) handled about 17 papers and nominated ﬁve reviewers for each submission (from 204 Program Committee members). The ﬁnal selection of three reviewers per submission was done in such a way as to avoid conﬂict of interest and to evenly balance the load among the reviewers. Once the reviews were done, each AC wrote summary reports based on the reviews and their own assessments of the submissions. For conﬂicting scores, ACs consulted with reviewers, and at times had us contact authors for clariﬁcation. The AC meeting was held in Osaka on July 27 and 28. We divided the 31 ACs into 8 groups, with each group having 3 or 4 ACs. The ACs can confer within their respective groups, and are permitted to discuss with pre-approved “consulting” ACs outside their groups if needed. The ACs were encouraged to rely more on their perception of paper vis-a-vis reviewer comments, and not strictly based on numerical scores alone. This year, we introduced the category “conditional accept;” this category is targeted at papers with good technical content but whose writing requires signiﬁcant improvement. Please keep in mind that no reviewing process is perfect. As with any major conference, reviewer quality and timeliness of reviews varied. To minimize the impact of variation of these factors, we chose highly qualiﬁed and dependable people as ACs to shepherd the review process. We all did the best we could given the large number of submissions and the limited time we had. Interestingly, we did not have to instruct the ACs to revise their decisions at the end of the AC meeting—all the ACs did a great job in ensuring the high quality of accepted papers. That being said, it is possible there were good papers that fell through the cracks, and we hope such papers will quickly end up being published at other good avenues. It has been a pleasure for us to serve as ACCV07 Program Chairs, and we can honestly say that this has been a memorable and rewarding experience. We would like to thank the ACCV07 ACs and members of the Technical Program Committee for their time and eﬀort spent reviewing the submissions. The ACCV Osaka team (Ryusuke Sagawa, Yasushi Makihara, Tomohiro Mashita, Kazuaki Kondo, and Hidetoshi Mannami), as well as our conference secretaries (Noriko

VI

Preface

Yasui, Masako Kamura, and Sachiko Kondo), did a terriﬁc job organizing the conference. We hope that all of the attendees found the conference informative and thought provoking. November 2007

Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha

Organization

General Chair General Co-chairs

Program Chair Program Co-chairs

Workshop/Tutorial Chair Finance Chair Local Arrangements Chair Publication Chairs Technical Support Staﬀ

Area Chairs

Katsushi Ikeuchi (University of Tokyo, Japan) Naokazu Yokoya (NAIST, Japan) Rin-ichiro Taniguchi (Kyuushu University, Japan) Yasushi Yagi (Osaka University, Japan) In So Kweon (KAIST, Korea) Sing Bing Kang (Microsoft Research, USA) Hongbin Zha (Peking University, China) Kazuhiko Sumi (Mitsubishi Electric, Japan) Keiji Yamada (NEC, Japan) Yoshinari Kameda (University of Tsukuba, Japan) Hideo Saito (Keio University, Japan) Daisaku Arita (ISIT, Japan) Atsuhiko Banno (University of Tokyo, Japan) Daisuke Miyazaki (University of Tokyo, Japan) Ryusuke Sagawa (Osaka University, Japan) Yasushi Makihara (Osaka University, Japan) Tat-Jen Cham (Nanyang Tech. University, Singapore) Koichiro Deguchi (Tohoku University, Japan) Frank Dellaert (Georgia Inst. of Tech., USA) Martial Hebert (CMU, USA) Ki Sang Hong (Pohang University of Sci. and Tech., Korea) Yi-ping Hung (National Taiwan University, Taiwan) Reinhard Klette (University of Auckland, New Zealand) Chil-Woo Lee (Chonnam National University, Korea) Kyoung Mu Lee (Seoul National University, Korea) Sang Wook Lee (Sogang University, Korea) Stan Z. Li (CASIA, China) Yuncai Liu (Shanghai Jiaotong University, China) Yasuyuki Matsushita (Microsoft Research Asia, China) Yoshito Mekada (Chukyo University, Japan) Yasuhiro Mukaigawa (Osaka University, Japan)

VIII

Organization

P.J. Narayanan (IIIT, India) Masatoshi Okutomi (Tokyo Inst. of Tech., Japan) Tomas Pajdla (Czech Technical University, Czech) Shmuel Peleg (The Hebrew University of Jerusalem, Israel) Jean Ponce (Ecole Normale Superieure, France) Long Quan (Hong Kong University of Sci. and Tech., China) Ramesh Raskar (MERL, USA) Jim Rehg (Georgia Inst. of Tech., USA) Jun Sato (Nagoya Inst. of Tech., Japan) Shinichi Sato (NII, Japan) Yoichi Sato (University of Tokyo, Japan) Cordelia Schmid (INRIA, France) Christoph Schnoerr (University of Mannheim, Germany) David Suter (Monash University, Australia) Xiaoou Tang (Microsoft Research Asia, China) Guangyou Xu (Tsinghua University, China)

Program Committee Adrian Barbu Akash Kushal Akihiko Torii Akihiro Sugimoto Alexander Shekhovtsov Amit Agrawal Anders Heyden Andreas Koschan Andres Bruhn Andrew Hicks Anton van den Hengel Atsuto Maki Baozong Yuan Bernt Schiele Bodo Rosenhahn Branislav Micusik C.V. Jawahar Chieh-Chih Wang Chin Seng Chua Chiou-Shann Fuh Chu-song Chen

Cornelia Fermuller Cristian Sminchisescu Dahua Lin Daisuke Miyazaki Daniel Cremers David Forsyth Duy-Dinh Le Fanhuai Shi Fay Huang Florent Segonne Frank Dellaert Frederic Jurie Gang Zeng Gerald Sommer Guoyan Zheng Hajime Nagahara Hanzi Wang Hassan Foroosh Hideaki Goto Hidekata Hontani Hideo Saito

Hiroshi Ishikawa Hiroshi Kawasaki Hong Zhang Hongya Tuo Hynek Bakstein Hyun Ki Hong Ikuko Shimizu Il Dong Yun Itaru Kitahara Ivan Laptev Jacky Baltes Jakob Verbeek James Crowley Jan-Michael Frahm Jan-Olof Eklundh Javier Civera Jean Martinet Jean-Sebastien Franco Jeﬀrey Ho Jian Sun Jiang yu Zheng

Organization

Jianxin Wu Jianzhuang Liu Jiebo Luo Jingdong Wang Jinshi Cui Jiri Matas John Barron John Rugis Jong Soo Choi Joo-Hwee Lim Joon Hee Han Joost Weijer Jun Sato Jun Takamatsu Junqiu Wang Juwei Lu Kap Luk Chan Karteek Alahari Kazuhiro Hotta Kazuhiro Otsuka Keiji Yanai Kenichi Kanatani Kenton McHenry Ki Sang Hong Kim Steenstrup Pedersen Ko Nishino Koichi Hashomoto Larry Davis Lisheng Wang Manabu Hashimoto Marcel Worring Marshall Tappen Masanobu Yamamoto Mathias Kolsch Michael Brown Michael Cree Michael Isard Ming Tang Ming-Hsuan Yang Mingyan Jiang Mohan Kankanhalli Moshe Ben-Ezra Naoya Ohta Navneet Dalal Nick Barnes

Nicu Sebe Noboru Babaguchi Nobutaka Shimada Ondrej Drbohlav Osamu Hasegawa Pascal Vasseur Patrice Delmas Pei Chen Peter Sturm Philippos Mordohai Pierre Jannin Ping Tan Prabir Kumar Biswas Prem Kalra Qiang Wang Qiao Yu Qingshan Liu QiuQi Ruan Radim Sara Rae-Hong Park Ralf Reulke Ralph Gross Reinhard Koch Rene Vidal Robert Pless Rogerio Feris Ron Kimmel Ruigang Yang Ryad Benosman Ryusuke Sagawa S.H. Srinivasan S. Kevin Zhou Seungjin Choi Sharat Chandran Sheng-Wen Shih Shihong Lao Shingo Kagami Shin’ichi Satoh Shinsaku Hiura ShiSguang Shan Shmuel Peleg Shoji Tominaga Shuicheng Yan Stan Birchﬁeld Stefan Gehrig

Stephen Lin Stephen Maybank Subhashis Banerjee Subrata Rakshit Sumantra Dutta Roy Svetlana Lazebnik Takayuki Okatani Takekazu Kato Tat-Jen Cham Terence Sim Tetsuji Haga Theo Gevers Thomas Brox Thomas Leung Tian Fang Til Aach Tomas Svoboda Tomokazu Sato Toshio Sato Toshio Ueshiba Tyng-Luh Liu Vincent Lepetit Vivek Kwatra Vladimir Pavlovic Wee-Kheng Leow Wei Liu Weiming Hu Wen-Nung Lie Xianghua Ying Xianling Li Xiaogang Wang Xiaojuan Wu Yacoob Yaser Yaron Caspi Yasushi Sumi Yasutaka Furukawa Yasuyuki Sugaya Yeong-Ho Ha Yi-ping Hung Yong-Sheng Chen Yoshinori Kuno Yoshio Iwai Yoshitsugu Manabe Young Shik Moon Yunde Jia

IX

X

Organization

Zen Chen Zhifeng Li Zhigang Zhu

Zhouchen Lin Zhuowen Tu Zuzana Kukelova

Additional Reviewers Afshin Sepehri Alvina Goh Anthony Dick Avinash Ravichandran Baidya Saha Brian Clipp C´edric Demonceaux Christian Beder Christian Schmaltz Christian Wojek Chunhua Shen Chun-Wei Chen Claude P´egard D.H. Ye D.J. Kwon Daniel Hein David Foﬁ David Gallup De-Zheng Liu Dhruv K. Mahajan Dipti Mukherjee Edgar Seemann Edgardo Molina El Mustapha Mouaddib Emmanuel Prados Frank R. Schmidt Frederik Meysel Gao Yan Guy Rosman Gyuri Dorko H.J. Shim Hang Yu Hao Du Hao Tang Hao Zhang Hirishi Ohno Hiroshi Ohno Huang Wei Hynek Bakstein

Ilya Levner Imran Junejo Jan Woetzel Jian Chen Jianzhao Qin Jimmy Jiang Liu Jing Wu John Bastian Juergen Gall K.J. Lee Kalin Kolev Karel Zimmermann Ketut Fundana Koichi Kise Kongwah Wan Konrad Schindler Kooksang Moon Levi Valgaerts Li Guan Li Shen Liang Wang Lin Liang Lingyu Duan Maojun Yuan Mario Fritz Martin Bujnak Martin Matousek Martin Sunkel Martin Welk Micha Andriluka Michael Stark Minh-Son Dao Naoko Nitta Neeraj Kanhere Niels Overgaard Nikhil Rane Nikodem Majer Nilanjan Ray Nils Hasler

Nipun kwatra Olivier Morel Omar El Ganaoui Pankaj Kumar Parag Chaudhuri Paul Schnitzspan Pavel Kuksa Petr Doubek Philippos Mordohai Reiner Schnabel Rhys Hill Rizwan Chaudhry Rui Huang S.M. Shahed Nejhum S.H. Lee Sascha Bauer Shao-Wen Yang Shengshu Wang Shiro Kumano Shiv Vitaladevuni Shrinivas Pundlik Sio-Hoi Ieng Somnath Sengupta Sudipta Mukhopadhyay Takahiko Horiuchi Tao Wang Tat-Jun Chin Thomas Corpetti Thomas Schoenemann Thorsten Thormaehlen Weihong Li Weiwei Zhang Xiaoyi Yu Xinguo Yu Xinyu Huang Xuan Song Yi Feng Yichen Wei Yiqun Li

Organization

Yong MA Yoshihiko Kawai

Zhichao Chen Zhijie Wang

Sponsors Sponsor Technical Co-sponsors

Asian Federation of Computer Vision IPSJ SIG-CVIM IEICE TG-PRMU

XI

Table of Contents – Part II

Poster Session 4: Face/Gesture/Action Detection and Recognition Palmprint Recognition Under Unconstrained Scenes . . . . . . . . . . . . . . . . . . Yufei Han, Zhenan Sun, Fei Wang, and Tieniu Tan

1

Comparative Studies on Multispectral Palm Image Fusion for Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Hao, Zhenan Sun, and Tieniu Tan

12

Learning Gabor Magnitude Features for Palmprint Recognition . . . . . . . . Rufeng Chu, Zhen Lei, Yufei Han, Ran He, and Stan Z. Li

22

Sign Recognition Using Constrained Optimization . . . . . . . . . . . . . . . . . . . . Kikuo Fujimura and Lijie Xu

32

Poster Session 4: Image and Video Processing Depth from Stationary Blur with Adaptive Filtering . . . . . . . . . . . . . . . . . . Jiang Yu Zheng and Min Shi

42

Three-Stage Motion Deblurring from a Video . . . . . . . . . . . . . . . . . . . . . . . . Chunjian Ren, Wenbin Chen, and I-fan Shen

53

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nazim Ashraf, Imran N. Junejo, and Hassan Foroosh

63

Video Mosaicing Based on Structure from Motion for Distortion-Free Document Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiko Iketani, Tomokazu Sato, Sei Ikeda, Masayuki Kanbara, Noboru Nakajima, and Naokazu Yokoya Super Resolution of Images of 3D Scenecs . . . . . . . . . . . . . . . . . . . . . . . . . . . Uma Mudenagudi, Ankit Gupta, Lakshya Goel, Avanish Kushal, Prem Kalra, and Subhashis Banerjee Learning-Based Super-Resolution System Using Single Facial Image and Multi-resolution Wavelet Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shu-Fan Lui, Jin-Yi Wu, Hsi-Shu Mao, and Jenn-Jier James Lien

73

85

96

XIV

Table of Contents – Part II

Poster Session 4: Segmentation and Classiﬁcation Statistical Framework for Shot Segmentation and Classiﬁcation in Sports Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Yang, Shouxun Lin, Yongdong Zhang, and Sheng Tang

106

Sports Classiﬁcation Using Cross-Ratio Histograms . . . . . . . . . . . . . . . . . . . Balamanohar Paluri, S. Nalin Pradeep, Hitesh Shah, and C. Prakash

116

A Bayesian Network for Foreground Segmentation in Region Level . . . . . Shih-Shinh Huang, Li-Chen Fu, and Pei-Yung Hsiao

124

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation . . . . Fangfang Lu, Zhouyu Fu, and Antonio Robles-Kelly

134

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Wang

145

Evaluating Multi-class Multiple-Instance Learning for Image Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyu Xu and Baoxin Li

155

Poster Session 4: Shape TransforMesh: A Topology-Adaptive Mesh-Based Approach to Surface Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrei Zaharescu, Edmond Boyer, and Radu Horaud

166

Microscopic Surface Shape Estimation of a Transparent Plate Using a Complex Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masao Shimizu and Masatoshi Okutomi

176

Shape Recovery from Turntable Image Sequence . . . . . . . . . . . . . . . . . . . . . H. Zhong, W.S. Lau, W.F. Sze, and Y.S. Hung

186

Shape from Contour for the Digitization of Curved Documents . . . . . . . . Fr´ed´eric Courteille, Jean-Denis Durou, and Pierre Gurdjos

196

Improved Space Carving Method for Merging and Interpolating Multiple Range Images Using Information of Light Sources of Active Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Furukawa, Tomoya Itano, Akihiko Morisaka, and Hiroshi Kawasaki

206

Shape Representation and Classiﬁcation Using Boundary Radius Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamidreza Zaboli and Mohammad Rahmati

217

Table of Contents – Part II

XV

Optimization A Convex Programming Approach to the Trace Quotient Problem . . . . . Chunhua Shen, Hongdong Li, and Michael J. Brooks

227

Learning a Fast Emulator of a Binary Decision Process . . . . . . . . . . . . . . . ˇ Jan Sochman and Jiˇr´ı Matas

236

Radiometry Multiplexed Illumination for Measuring BRDF Using an Ellipsoidal Mirror and a Projector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Mukaigawa, Kohei Sumino, and Yasushi Yagi

246

Analyzing the Inﬂuences of Camera Warm-Up Eﬀects on Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Holger Handel

258

Geometry Simultaneous Plane Extraction and 2D Homography Estimation Using Local Feature Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ouk Choi, Hyeongwoo Kim, and In So Kweon

269

A Fast Optimal Algorithm for L2 Triangulation . . . . . . . . . . . . . . . . . . . . . . Fangfang Lu and Richard Hartley

279

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Zheng, Jun Takamatsu, and Katsushi Ikeuchi

289

Determining Relative Geometry of Cameras from Normal Flows . . . . . . . Ding Yuan and Ronald Chung

301

Poster Session 5: Geometry Highest Accuracy Fundamental Matrix Computation . . . . . . . . . . . . . . . . . Yasuyuki Sugaya and Kenichi Kanatani

311

Sequential L∞ Norm Minimization for Triangulation . . . . . . . . . . . . . . . . . Yongduek Seo and Richard Hartley

322

Initial Pose Estimation for 3D Model Tracking Using Learned Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Wimmer and Bernd Radig

332

XVI

Table of Contents – Part II

Multiple View Geometry for Non-rigid Motions Viewed from Translational Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng Wan, Kazuki Kozuka, and Jun Sato Visual Odometry for Non-overlapping Views Using Second-Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jae-Hak Kim, Richard Hartley, Jan-Michael Frahm, and Marc Pollefeys

342

353

Pose Estimation from Circle or Parallel Lines in a Single Image . . . . . . . . Guanghui Wang, Q.M. Jonathan Wu, and Zhengqiao Ji

363

An Occupancy – Depth Generative Model of Multi-view Images . . . . . . . Pau Gargallo, Peter Sturm, and Sergi Pujades

373

Poster Session 5: Matching and Registration Image Correspondence from Motion Subspace Constraint and Epipolar Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shigeki Sugimoto, Hidekazu Takahashi, and Masatoshi Okutomoi Eﬃcient Registration of Aerial Image Sequences Without Camera Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shobhit Niranjan, Gaurav Gupta, Amitabha Mukerjee, and Sumana Gupta

384

394

Simultaneous Appearance Modeling and Segmentation for Matching People Under Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon

404

Content-Based Matching of Videos Using Local Spatio-temporal Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gajinder Singh, Manika Puri, Jeﬀrey Lubin, and Harpreet Sawhney

414

Automatic Range Image Registration Using Mixed Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shizu Sakakubara, Yuusuke Kounoike, Yuji Shinano, and Ikuko Shimizu Accelerating Pattern Matching or How Much Can You Slide? . . . . . . . . . . Oﬁr Pele and Michael Werman

424

435

Poster Session 5: Recognition Detecting, Tracking and Recognizing License Plates . . . . . . . . . . . . . . . . . . Michael Donoser, Clemens Arth, and Horst Bischof

447

Action Recognition for Surveillance Applications Using Optic Flow and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Somayeh Danafar and Niloofar Gheissari

457

Table of Contents – Part II

XVII

The Kernel Orthogonal Mutual Subspace Method and Its Application to 3D Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiro Fukui and Osamu Yamaguchi

467

Viewpoint Insensitive Action Recognition Using Envelop Shape . . . . . . . . Feiyue huang and Guangyou Xu

477

Unsupervised Identiﬁcation of Multiple Objects of Interest from Multiple Images: dISCOVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Devi Parikh and Tsuhan Chen

487

Poster Session 5: Stereo, Range and 3D Fast 3-D Interpretation from Monocular Image Sequences on Large Motion Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong-Sung Kim and Ki-Sang Hong

497

Color-Stripe Structured Light Robust to Surface Color and Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kwang Hee Lee, Changsoo Je, and Sang Wook Lee

507

Stereo Vision Enabling Precise Border Localization Within a Scanline Optimization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Mattoccia, Federico Tombari, and Luigi Di Stefano

517

Three Dimensional Position Measurement for Maxillofacial Surgery by Stereo X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoya Ohta, Kenji Mogi, and Yoshiki Nakasone

528

Stereo Total Absolute Gaussian Curvature for Stereo Prior . . . . . . . . . . . . . . . . . . Hiroshi Ishikawa

537

Fast Optimal Three View Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Byr¨ od, Klas Josephson, and Kalle ˚ Astr¨ om

549

Stereo Matching Using Population-Based MCMC . . . . . . . . . . . . . . . . . . . . Joonyoung Park, Wonsik Kim, and Kyoung Mu Lee

560

Dense 3D Reconstruction of Specular and Transparent Objects Using Stereo Cameras and Phase-Shift Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masaki Yamazaki, Sho Iwata, and Gang Xu

570

Image and Video Processing Identifying Foreground from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . Wonwoo Lee, Woontack Woo, and Edmond Boyer

580

XVIII

Table of Contents – Part II

Image and Video Matting with Membership Propagation . . . . . . . . . . . . . . Weiwei Du and Kiichi Urahama

590

Temporal Priors for Novel Video Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Shahrokni, Oliver Woodford, and Ian Reid

601

Content-Based Image Retrieval by Indexing Random Subwindows with Randomized Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rapha¨el Mar´ee, Pierre Geurts, and Louis Wehenkel

611

Poster Session 6: Face/Gesture/Action Detection and Recognition Analyzing Facial Expression by Fusing Manifolds . . . . . . . . . . . . . . . . . . . . Wen-Yan Chang, Chu-Song Chen, and Yi-Ping Hung

621

A Novel Multi-stage Classiﬁer for Face Recognition . . . . . . . . . . . . . . . . . . . Chen-Hui Kuo, Jiann-Der Lee, and Tung-Jung Chan

631

Discriminant Clustering Embedding for Face Recognition with Image Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youdong Zhao, Shuang Xu, and Yunde Jia

641

Privacy Preserving: Hiding a Face in a Face . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyi Yu and Noboru Babaguchi

651

Face Mosaicing for Pose Robust Video-Based Recognition . . . . . . . . . . . . Xiaoming Liu and Tsuhan Chen

662

Face Recognition by Using Elongated Local Binary Patterns with Average Maximum Distance Gradient Magnitude . . . . . . . . . . . . . . . . . . . . Shu Liao and Albert C.S. Chung

672

An Adaptive Nonparametric Discriminant Analysis Method and Its Application to Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Huang, Yong Ma, Yoshihisa Ijiri, Shihong Lao, Masato Kawade, and Yuming Zhao

680

Discriminating 3D Faces by Statistics of Depth Diﬀerences . . . . . . . . . . . . Yonggang Huang, Yunhong Wang, and Tieniu Tan

690

Kernel Discriminant Analysis Based on Canonical Diﬀerences for Face Recognition in Image Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen-Sheng Vincnent Chu, Ju-Chin Chen, and Jenn-Jier James Lien

700

Person-Similarity Weighted Feature for Expression Recognition . . . . . . . . Huachun Tan and Yu-Jin Zhang

712

Table of Contents – Part II

Converting Thermal Infrared Face Images into Normal Gray-Level Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingsong Dou, Chao Zhang, Pengwei Hao, and Jun Li Recognition of Digital Images of the Human Face at Ultra Low Resolution Via Illumination Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jen-Mei Chang, Michael Kirby, Holger Kley, Chris Peterson, Bruce Draper, and J. Ross Beveridge

XIX

722

733

Poster Session 6: Math for Vision Crystal Vision-Applications of Point Groups in Computer Vision . . . . . . . Reiner Lenz

744

On the Critical Point of Gradient Vector Flow Snake . . . . . . . . . . . . . . . . . Yuanquan Wang, Jia Liang, and Yunde Jia

754

A Fast and Noise-Tolerant Method for Positioning Centers of Spiraling and Circulating Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ka Yan Wong and Chi Lap Yip

764

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomokazu Takahashi, Lina, Ichiro Ide, Yoshito Mekada, and Hiroshi Murase Conic Fitting Using the Geometric Distance . . . . . . . . . . . . . . . . . . . . . . . . . Peter Sturm and Pau Gargallo

774

784

Poster Session 6: Segmentation and Classiﬁcation Eﬃciently Solving the Fractional Trust Region Problem . . . . . . . . . . . . . . . Anders P. Eriksson, Carl Olsson, and Fredrik Kahl

796

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomoyuki Nagahashi, Hironobu Fujiyoshi, and Takeo Kanade

806

Backward Segmentation and Region Fitting for Geometrical Visibility Range Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erwan Bigorgne and Jean-Philippe Tarel

817

Image Segmentation Using Co-EM Strategy . . . . . . . . . . . . . . . . . . . . . . . . . Zhenglong Li, Jian Cheng, Qingshan Liu, and Hanqing Lu

827

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yadong Mu and Bingfeng Zhou

837

XX

Table of Contents – Part II

Shape from X Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Kawasaki and Ryo Furukawa

847

Evolving Measurement Regions for Depth from Defocus . . . . . . . . . . . . . . . Scott McCloskey, Michael Langer, and Kaleem Siddiqi

858

A New Framework for Grayscale and Colour Non-Lambertian Shape-from-shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William A.P. Smith and Edwin R. Hancock

869

Face A Regularized Approach to Feature Selection for Face Detection . . . . . . . Augusto Destrero, Christine De Mol, Francesca Odone, and Alessandro Verri

881

Iris Tracking and Regeneration for Improving Nonverbal Interface . . . . . . Takuma Funahashi, Takayuki Fujiwara, and Hiroyasu Koshimizu

891

Face Mis-alignment Analysis by Multiple-Instance Subspace . . . . . . . . . . . Zhiguo Li, Qingshan Liu, and Dimitris Metaxas

901

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

911

Palmprint Recognition Under Unconstrained Scenes Yufei Han, Zhenan Sun, Fei Wang, and Tieniu Tan Center for Biometrics and Security Research National Laboratory of Pattern Recognition, Institute of Automation Chinese Acdamey of Sciences P.O.Box 2728, Beijing, P.R. China, 100080 {yfhan,znsun,fwang,tnt}@nlpr.ia.ac.cn

Abstract. This paper presents a novel real-time palmprint recognition system for cooperative user applications. This system is the first one achieving noncontact capturing and recognizing palmprint images under unconstrained scenes. Its novelties can be described in two aspects. The first is a novel design of image capturing device. The hardware can reduce influences of background objects and segment out hand regions efficiently. The second is a process of automatic hand detection and fast palmprint alignment, which aims to obtain normalized palmprint images for subsequent feature extraction. The palmprint recognition algorithm used in the system is based on accurate ordinal palmprint representation. By integrating power of the novel imaging device, the palmprint preprocessing approach and the palmprint recognition engine, the proposed system provides a friendly user interface and achieves a good performance under unconstrained scenes simultaneously.

1 Introduction Biometrics technology identifies different people by their physiological and behavioral differences. Compared with traditional security authentication approaches, such as key or password, biometrics is more accurate, dependable and difficult to be stolen or faked. In the family of biometrics, palmprint is a novel but promising member. Large region of palm supplies plenty of line patterns which can be easily captured in a low resolution palmprint image. Based on those line patterns, palmprint recognition can achieve a high accuracy of identity authentication. In previous work, there are several successful recognition systems proposed for practical use of palmprint based identity check [1][2][3], and the best-known is developed by Zhang et al [1]. During image capturing, users are required to place hands on the plate with pegs controlling displacement of hands. High quality palmprint images are then captured by a CCD camera fixed in a semi-closed environment with uniform light condition. To alignment captured palmprint images, a preprocessing algorithm [2] is adopted to correct rotation of those images and crop square ROI (regions of interests) with the same size. Detail about this system can be found in [2]. Besides, Connie et al proposed a peg-free palmprint recognition system [3], which captures palmprint images by an optical scanner. Subjects are allowed to place their hand more freely on the platform of the scanner without pegs. As a result, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 1–11, 2007. © Springer-Verlag Berlin Heidelberg 2007

2

Y. Han et al.

palmprint images with different sizes, translations and rotation angles are obtained. Similar as in [2], an alignment process is involved to obtain normalized ROI images. However, efficient as they are, there are still some limitations. Firstly, some users may feel uncomfortable with pegs to restrict hands during capturing images. Secondly, even without pegs, subjects’ hands are required to contact plates of devices or platforms of scanners, which is not hygienic enough. Thirdly, semi-closed image capturing devices usually increase volume of recognition systems, which makes them not convenient for portable use. Thus, it’s necessary to improve design of the HCI(human-computer interface), in order to make the whole system easy-to-use. Recently, active near infrared imagery (NIR) technology has received more and more attention in face detection and recognition, as seen in [4]. Given a near infrared light source shining objects in front of cameras, intensity of reflected NIR light is attenuated at a large scale with distance between objects and the light source increasing. This property provides us a promising solution to eliminate affection of backgrounds when palmprint images are captured under unconstrained scenes. Based on the technology, in this paper, we propose a novel real-time palmprint recognition system. It’s designed to localize and obtain normalized palmprint images under clutter scenes conveniently. The main contributions are as followings: First, we present a novel design of portable image capturing device, which mainly consists of two parallel placed web cameras. One is used for active near infrared imagery to localize hand regions. The other one captures corresponding palmprint images in visible light, preparing for further feature extraction. Second, we present a novel palmprint preprocessing algorithm, utilizing color and shape information of hands for fast and effective hand region detection, rotation correction and localization of central palm region. So far as we know, there is no similar work reported in previous literatures. The rest of paper is organized as follows: Section 2 presents a description of the whole architecture of the recognition system. In Section 3, the design of human computer interface of the system is described in detail. Section 4 introduces ordinal palmprint representation briefly. Section 5 evaluates the performance of the system. Finally, in Section 6, we conclude the whole paper.

2 System Overview We adopt a common PC with Intel Pentium4 3.0Ghz and 1G RAM as the computation platform. Based on it, the recognition system is implemented using Microsoft Visual C++ 6.0. It consists of five main modules, as shown in Fig.1. After starting the system, users are required to open their hands in a natural manner and place palm regions toward the imaging device at a certain distance between 35 cm and 50 cm from cameras. Surfaces of palms are approximately orthogonal to the optical axis of cameras. In-plane rotation of hands is restricted between -15 degree to 15 degree deviated from vertical orientation. The imaging device then captures two images for each hand by two cameras placed in parallel respectively. One is a NIR hand image with active NIR lighting, the other is a color hand image with background objects, obtained with normal environment lighting condition. Both of them contain complete hand region, see in Fig.2. After that, an efficient palmprint preprocessing

Palmprint Recognition Under Unconstrained Scenes

3

algorithm is performed on the two captured images to obtain one normalized palmprint image quickly, which makes use of both shape and skin color information of hands. Finally, robust palmprint feature templates are extracted from the normalized image using the ordinal code based approach [5]. Fast hamming distance calculation is applied to measure dissimilarity between two feature templates. An example of the whole recognition process could be seen in the supplementary video of this paper.

3 Smart Human-Computer Interface HCI of the system mainly consists of two parts, image capturing hardware and palmprint preprocessing procedure, as shown in Fig.1. Considering a hand image captured under an unconstrained scene, unlike those captured by devices in [1][2][3], there are not only a hand region containing palmprint patterns, but also background objects of different shapes, colors and positions, as denoted in Fig.2. Even within the hand, there still exits rotation, scale variation and translation of palmprint patterns due to different hand displacements. Thus, before further palmprint feature encoding, HCI should localize the candidate hand region and extract a normalized ROI (region of interest), which contains palmprint features without much geometric deformations. 3.1 Image Capturing Device Before palmprint alignment, it is necessary to segment hand regions from unconstrained scenes. This problem could be solved by background modeling and subtraction or labeling skin color region. However, both methods suffer from unconstrained backgrounds or varying light conditions. Our design of imaging device aims to solve the problem in a sensor level, in order to localize foreground hand regions more robustly by simple image binarization. The appearance of the image capturing device is shown in Fig.2(a). This device has two common CMOS web cameras placed in parallel. We mount near infrared (NIR) light-emitting diodes on the device evenly distributed around one camera, similar as in [4], so as to provide straight and uniform NIR lighting. Near infrared light emit by those LEDs have a wavelength of 850 nm. In a further step, we make use of a band pass optical filter fixed on the camera lens to cut off lights with all the other wavelengths except 850nm. Most of environment lights are cut off because their wavelengths are less than 700nm. Thus, lights received by the camera only consist of reflected NIR LED lights and NIR components in environment lights, such as lamp light and sunlight, which are much weaken than the NIR LED lights. Notably, intensities of reflected NIR LED lights are in the inverse proportion to high-order terms of the distance between object and the camera. Therefore, assuming a hand is the nearest one among all objects in front of the camera during image capturing, intensities of the hand region in the corresponding NIR image should be much larger than backgrounds. As a result, we can further segment out the hand region and eliminate background by fast image binarization, as denoted in Fig.2(b). The other

4

Y. Han et al.

camera in the device captures color scene images, obtaining clear palmprint patterns and reserving color information of hands. An optical filter is fixed on the lens of this camera to filter out infrared components in the reflected lights, which is applied widely in digital camera to avoid red-eye. The two cameras work simultaneously. In our device, resolution of both cameras is 640*480. Fig.2(b) lists a pair of example images, captured by the two cameras at the same time. The upper one is the color image. The bottom one is the NIR image. The segmentation result is shown in the upper row of Fig.2(c). In order to focus on hand regions with a proper scale in further processing, we adopt a scale selection on binary segmentation results to choose candidate foreground regions. The criterion of selection grounds on a fact that area of a hand region in a NIR image is larger if the hand is nearer to the camera. We label all connected binary foreground after segmentation and calculate area of each connected component, then choose those labeled regions with their areas varying in a predefined narrow range as the candidate foreground regions, like the white region shown in the image at the bottom of Fig.2(c).

Fig. 1. Flowcharts of the system

Fig. 2. (a) Image capturing device (b) Pair-wise color and NIR image (c) Segmented fore ground and candidate foreground region selection

Palmprint Recognition Under Unconstrained Scenes

5

3.2 Automated Hand Detection Hand detection is posed as two-class problem of classifying the input shape pattern into hand-like and non-hand class. In our system, a cascade classifier is trained to detect hand regions in binary foregrounds, based on works reported in [6]. In [6], Eng-Jon Ong et al makes use of such classifier to classify different hand gestures. In our application, the cascade classifier should be competent for two tasks. Firstly, it should differentiate shape of open hand from all the other kinds of shapes. Secondly, it should reject open hands with in-plane rotation angle deviating out of the restricted range. To achieve this goals, we construct a positive dataset containing binary open left hands at first, such as illustrated in Fig.3(a). In order to make the classifier tolerate certain in-plane rotation, the dataset consists of left hands with seven discrete rotation angles, sampled every 5 degree from -15 degree to 15 degree deviated from vertical orientation, a part of those binary hands are collected from [11]. For each angle, there are about 800 hand images with slight postures of fingers, also shown in Fig.3(a). Before training, all positive data are normalized into 50*35 images. The negative dataset contains two parts. One consists of binary images containing nonhand objects, such as human head, turtles and cars, partly from [10]. The other contains left hands with rotation angle out of the restricted range and right hands with a variety of displacements. There are totally more than 60,000 negative images. Fig.3(b) shows example negative images. Based on those training data, we use Float AdaBoost algorithm to select most efficient Haar features to construct the cascade classifier, same as in [6]. Fig.3(c) shows the most six efficient Haar features obtained after training. We see that they represent discriminative shape features of left open hand. During detection, rather than exhaustive search across all positions and scales in [6], we perform the classifier directly around the candidate binary foreground regions

Fig. 3. (a) Positive training data (b) Negative training data (c) Learned efficient Haar features (d) Detected hand region

6

Y. Han et al.

to search for open left hands with a certain scale. Therefore, we can detect different hands with a relative stable scale, which reduces influence of scale variations on palmprint patterns. Considering mirror symmetry between left and right hands, to detect right hands, we just perform symmetry transform on the images and apply the classifier by the same way on the flipped images. Fig.3(d) shows results of detection. Obtaining detected hand, all the other non-hand connected regions are removed from binary hand images. The whole detection can be finished within 20 ms. 3.3 Palmprint Image Alignment Palmprint alignment procedure eliminates rotation and translation of palmprint patterns, in order to obtain normalized ROI. Most alignment algorithms calculate rotation angles of hands by localizing key contour points in gaps between fingers [2][3]. However, in our application, different finger displacements may change local contours and make it difficult detect gap regions, as denoted in Fig.4. To solve this problem, we adopt a fast rotation angle estimation based on moments of hand shape. Given R is the detected hand region in a binary foreground image. Its orientation θ can be estimated by calculating its moments [7]: θ=

1

arctan(

2

2μ1,1

μ 2,0 − μ 0,2

(1)

)

μ p ,q (p,q=0,1….) is (p,q) order central moments, which is represented as :

μ p ,q = ∑∑ ( x − x

y

1

1

x) p ( y − ∑∑ y ) q , ( x, y ) ∈ R ∑∑ N N x

y

x

(2)

y

Compared with key point detection, moments are calculated based on the whole hand region rather than only contour points. Thus, it is more robust to local changes in contours. To reduce computation cost, the original binary image is down-sampled to a 160*120 one. Those moments are then calculated on the down-sampled version. After obtaining rotation angle θ , the hand region is rotated by - θ degree to get vertical oriented hands, see in Fig.4. Simultaneously, the corresponding color image is also rotated by - θ , in order to make sure consistency of hand orientations in both two images. In a further step, we locate central palm region in a vertical oriented open hand by analyzing difference of connectivity between the palm region and the finger region. Although shape and size of hands vary a lot, a palm region of each hand should be like a rectangle. Compared with it, stretched fingers don’t form a connective region as palm. Based on this property, we employ an erosion operation on the binary hand image to remove finger regions. The basic idea behind this operation is run length code of binary image. We perform a raster scanning on each row to calculate the maximum length W of connective sequences in the row. Any row with its W less than threshold K1 should be eroded. After all rows are scanned, a same operation is performed on each column. As a result, columns with their maximum length W less than K2 are removed. Finally, a rectangular palm region is cropped from the hand. Coordinates (xp,yp )of its central point is derived as localization result. In order to

Palmprint Recognition Under Unconstrained Scenes

7

cope with varying sizes of different hands, we choose values of K1 and K2 adaptively. Before row erosion, distance between each point in the hand region and nearest edge point is calculated by a fast distance transform. The central point of hand is defined as the one with the largest distance value. Assuming A is the maximum length of connective sequences in the row passing through the central point, K1 is defined as follows: K1 = A * p%

(3)

p is a pre-defined threshold. K2 is defined in the same way: K2 = B * q%

(4)

B is the maximum length of connective sequences in the column passing through the central point after row erosion. q is another pre-defined threshold. Compared with fixed value, adaptive K1 and K2 lead to more accurate location of central palm regions, as denoted in Fig.5(b). Fig.5(a) denotes the whole procedure of erosion. Due to visual disparity between two cameras in the imaging device, we can not use (xp,yp ) to localize ROI in corresponding color images directly. Although visual disparity can be estimated by a process of 3D scene reconstruction, this approach may lead to much computation burden on the system. Instead, we apply a fast correspondence estimation based on template matching. Assuming C is a color hand image after rotation correction, we convert C into a binary image M by setting all pixels with skin color to 1, based on the probability distribution model of skin color in RGB space [8]. Given the binary version of the corresponding NIR image, with a hand region S locating at (xn,yn), a template matching is conducted as in Eq.5, also as denoted in Fig.6:

f ( m, n ) =

∑∑ [ M ( x + m, y + n) ⊕ S ( x, y )], ( x, y ) ∈ S x

(5)

y

⊕ is bitwise AND operator. f(m,n) is a matching energy function. (m,n) is a candidate position of the template. The optimal displacement (xo,yo) of hand shape S in M is defined as the candidate position where the matching energy achieves its maximum. The central point (xc,yc) of palm region in C can be estimated by following equations:

xc = x p + xo − xn yc = y p + yo − y n

Fig. 4. Rotation correction

(6)

8

Y. Han et al.

Fig. 5. (a) Erosion procedure (b) Erosion with fixed and adaptive thresholds

With (xc,yc) as its center, one 128*128 sub-image is cropped from C as ROI, which is then converted to gray scale image for feature extraction.

Fig. 6. Translation estimation

4 Ordinal Palmprint Representation In previous work, the orthogonal line ordinal feature (OLOF) [5] provides a compact and accurate representation of negative line features in palmprints. The orthogonal line ordinal filter [5] F(x,y,θ) is designed as follows: F ( x, y , θ ) = G ( x, y , θ ) − G ( x, y , θ + π / 2)

(7)

Palmprint Recognition Under Unconstrained Scenes

G ( x, y , θ ) = exp[ −(

x cos θ + y sin θ

δx

) −( 2

− x sin θ + y cos θ

δy

2

) ]

9

(8)

G(x,y,θ) is a 2D anisotropic Gaussian filter, and θ is the orientation of the Gaussian filter. The ratio between δx and δy is set to be larger than 3, in order to obtain a weighted average of a line-like region. In each local region in a palmprint image,

three such ordinal filters, with orientations of 0, π/6, π/3 are used in convolution process on the region. The filtering result is then encoded into 1 or 0 according to whether its sign is positive or negative. Thousands of ordinal codes are concatenated into a feature template. Similarity between two feature templates is measured by a normalized hamming distance, which ranges between 0 and 1. Further details can be found in [5].

5 System Evaluation Performance of the system is evaluated in terms of verification rate [9], which is obtained through one-to-one image matching. We collect 1200 normalized palmprint ROI images from 60 subjects using the system, with 10 images for each hand. Fig.7 illustrates six examples of ROI images. During the test, there are totally 5,400 intraclass comparisons and 714,000 inter-class comparisons. Although recognition accuracy of the system lies on effectiveness of both alignment procedure of HCI and the palmprint recognition engine, the latter is not the focus of this paper. Thus we don’t involve performance comparisons between the ordinal code and other state-ofthe-art approaches. Fig.8 denotes distributions of genuine and imposter. Fig.9 shows corresponding ROC curve. The equal error rate [9] of the verification test is 0.54%. From experimental results, we can see that ROI regions obtained by the system are suitable for palmprint feature extraction and recognition. Besides, we also record time cost for obtaining one normalized palmprint image using the system. It includes time for image capturing, hand detection and palmprint alignment. The average time cost is 1.2 seconds. Thus, our system can be competent for point-of-sale identity check.

Fig. 7. Six examples of ROI images

10

Y. Han et al.

Fig. 8. Distributions of genuine and imposter

Fig. 9. ROC curve of the verification test

6 Conclusion In this paper, we have proposed a novel palmprint recognition system for cooperative user applications, which achieves a real-time non-contact palmprint image capturing and recognition directly under unconstrained scenes. Through design of the system, we aim to provide more convenient human-computer interface and reduce restriction on users during palmprint based identity check. The core of HCI in the system consists of a binocular image device and a novel palmprint preprocessing algorithm. The former delivers a fast hand region segmentation based on NIR imaging technology. The latter extracts normalized ROI from hand regions efficiently based on shape and color information of human hands. Benefiting further from the powerful recognition engine, the proposed system achieves accurate recognition and convenient use at the same time. As far as we know, this is the first attempt to solve the problem of obtaining normalized palmprint images directly from clutter backgrounds. However, accurate palmprint alignment has not been well addressed in the proposed system. In our future work, it’s an important issue to improve the performance of the system by reducing alignment error in a further step. In addition,

Palmprint Recognition Under Unconstrained Scenes

11

we should improve the imaging device to deal with influence of NIR component in environment light, which varies much in practical use. Acknowledgments. This work is funded by research grants from the National Basic Research Program (Grant No.2004CB318110), the Natural Science Foundation of China (Grant No.60335010, 60121302, 60275003, 60332010, 69825105,60605008) and the Chinese Academy of Sciences.

References 1. Zhang, D., Kong, W.K., You, J., Wong, M.: Online Palmprint Identification. IEEE Trans on PAMI 25(9), 1041–1050 (2003) 2. Kong, W.K.: Using Texture Analysis in Biometric Technology for Personal Identification, MPhil Thesis, http://pami.uwaterloo.ca/ cswkkong/Sub_Page/Publications.htm 3. Connie, T., Jin, A.T.B., Ong, M.G.K., Ling, D.N.C.: Automated palmprint recognition system. Image and Vision Computing 23, 501–515 (2005) 4. li, S.Z., Chu, R.F., Liao, S.C., Zhang, L.: Illumination invariant Face Recognition using Near- Infrared Images. IEEE Trans on PAMI 29(4), 627–639 (2007) 5. Sun, Z.N., Tan, T.N., Wang, Y.H., Li, S.Z.: Ordinal Palmprint Representation for Personal Identification. Proc. of IEEE CVPR 2005 1, 279–284 (2005) 6. Ong, E., Bowden, R.: A Boosted Classifier Tree for Hand Shape Detection. In: Proc. of International Conference on Automatic Face and Gesture Recognition, pp. 889–894 (2004) 7. Jain, A.K.: Fundamentals of Digital Image Processing, vol. 07458, p. 392. Prentice Hall, Upper Saddle River, NJ 8. Jones, M.J., Rehg, J.M.: Statistical Color Models with Application to Skin Color Detection. International Journal of Computer Vision 46(1), 81–96 (2002) 9. Daugman, J., Williams, G.: A Proposed Standard for Biometric Decidability. In: Proc. of CardTech/SecureTech Conference, Atlanta, GA, pp. 223–234 (1996) 10. http://www.cis.temple.edu/ latecki/TestData mpeg7shapeB.tar.gz

11. UST Hand Image database, http://visgraph.cs.ust.hk/Biometrics/Visgraph_web/ index.html

Comparative Studies on Multispectral Palm Image Fusion for Biometrics Ying Hao, Zhenan Sun, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, CAS

Abstract. Hand biometrics, including ﬁngerprint, palmprint, hand geometry and hand vein pattern, have obtained extensive attention in recent years. Physiologically, skin is a complex multi-layered tissue consisting of various types of components. Optical research suggests that different components appear when the skin is illuminated with light sources of diﬀerent wavelengths. This motivates us to extend the capability of camera by integrating information from multispectral palm images to a composite representation that conveys richer and denser pattern for recognition. Besides, usability and security of the whole system might be boosted at the same time. In this paper, comparative study of several pixel level multispectral palm image fusion approaches is conducted and several well-established criteria are utilized as objective fusion quality evaluation measure. Among others, Curvelet transform is found to perform best in preserving discriminative patterns from multispectral palm images.

1

Introduction

Hand, as a tool for human to percept and reconstruct surrounding environment, is most used among body parts in our daily life. Due to its high acceptance by the human beings, its prevalence in the ﬁeld of biometrics is no surprising. Fingerprint[1], hand geometry[2], palmprint[7][8], palm-dorsa vein pattern[3], ﬁnger vein[4] and palm vein[5] are all good examples of hand biometric patterns. These modalities have been explored by earlier researchers and can be divided into three categories: - Skin surface based modality. Examples are ﬁngerprint and palmprint. Both traits explore information from the surface of skin and have received extensive attention. Both of them are recognized as having the potential of being used in highly security scenario; - Internal structure based modality, which extracts information from vein structure deep under the surface for recognition. Although new in biometric family, the high constancy and uniqueness of vein structure make this category more and more active nowadays[3][9]; - Global structure based modality. The only example of this category is hand geometry. Hand geometry is a good choice for small scale applications thanks to its high performance-price ratio. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 12–21, 2007. c Springer-Verlag Berlin Heidelberg 2007

Comparative Studies on Multispectral Palm Image Fusion for Biometrics

13

No matter which category of modality one chooses to work on, a closer look on the skin appearance beneﬁts. Physiologically, human skin consists of many components, such as Cells, ﬁbers, veins and nerves, and they give skin a multilayered structure. At the outermost layer, numerous ﬁne furrows, hair and pores are scattered over the surface of skin, while veins, capillaries and nerves form a vast network inside[6]. Optical study has demonstrated that light with longer wavelength tends to penetrate the skin more deeply, for example, near infrared light from 600nm to 1000nm typically penetrates the skin to about 1-3 mm. Therefore diﬀerent visual contents, with diﬀerent optical properties, are detected with incident light of diﬀerent wavelengths[11]. The uniqueness of human skin, including its micro, meso and macro structures, is a product of random factors during embryonic development. Enlighten by the success and fast development of above mentioned hand based biometrics, each of them reﬂecting only one aspect of hand, we believe that best potential of biometric feature in the hand region is yet to be discovered. The purpose of this work is to exploit the correlative and complementary nature of multispectral hand images for image enhancement, ﬁltering and fusion. Take palmprint and vein for example, the common characteristics of two modalities is that they both utilize moderate resolution hand imagery and they share similar discriminative information: line-like pattern. On the other hand, the intrinsic physiological nature makes the two traits holding distinctive advantages and disadvantages. More precisely speaking, palmprint is related to outermost skin pattern, therefore, its appearance is sensitive to illumination condition, aging, skin disease and abrasion etc. In contrast, hand vein pattern, as an interior structure, is robust to the above mentioned external factors. However, vein image quality varies dramatically across the population and in case of blood vessel constriction resulting from extremely cold weather. Several advantages can be obtained by fusing the two spectral hand images. First of all, a more user-friendly system can be developed by alternatively combining the two traits or choosing the appropriate one for recognition according to corresponding imaging quality; secondly, forgery is much more diﬃcult for such an intelligent system and hence the system is more secure; and ﬁnally, the recognition performance might be boosted. In this work, we designed a device to automatically and periodically capture visible and near infrared spectral images. Two sets of lights are turned on in turn so that palmprint and vein images are captured. With the images at hand, we validated the idea of image fusion. Several pixel level image fusion approaches are performed to combine the original images to a composite one, which is expected to convey more information than its inputs. The rest of this paper is organized as follows. The hardware design of image capture device is presented in Section 2, followed by a brief introduction of the four methods in Section 3. The proposed fusion scheme is introduced in Section 4 and Section 5 includes experimental results as well as performance evaluation. Finally, conclusion and discussion are presented in Section 6.

14

2

Y. Hao, Z. Sun, and T. Tan

Hardware Design

Fig. 1 illustrates the principal design of the device we developed to capture images in both visible (400-700nm) and near infrared (800-1000nm) spectra. The device works under a sheltered environment and the light sources are carefully arranged so that the palm region is evenly illuminated. An infrared sensitive CCD camera is ﬁxed at the bottom of the inner encloser and connected to a computer via USB interface. An integrated circuit plate is mounted near the camera for illumination and diﬀerent combinations of light wavelengths can be accomplished by replacing the circuit plate. By default, the two sets of lights are turned on in turn so that only expected layer of hand appears to camera. When illuminated with visible light, image of hand skin surface, namely, the palmprint is stored. while when NIR light is on, deeper structure as well as parts of dominant surface features, for example the principal lines, are captured. Manual control of the two lights is also allowed by sending computer instruction to the device. A pair of images captured using the device is shown in Fig. 2(a)(b), where (a) is palmprint image and (b) is vein image. It is obvious that the two images emphasize on quite diﬀerent components of hand.

Fig. 1. Multispectral palm image capture device, where LEDs of two wavelengths is controlled by computer

3

Image Fusion and Multiscale Decomposition

The concept of image fusion refers to the idea of integrating information from different images for better visual or computational perception. Image fusion sometimes refers to pixel-level fusion, while a broad sense deﬁnition also includes feature-level and matching score level fusion. In this work, we focus on pixel level fusion because it features minimum information loss. The key issue in image fusion is to faithfully preserve important information while suppress noises. Discriminative information in palmprint and vein, or more

Comparative Studies on Multispectral Palm Image Fusion for Biometrics

15

speciﬁcally principal lines, wrinkle lines, ridges and blood vessels, all takes form of line-like patterns. Therefore, the essential goal is to maximally preserve these patterns. In the ﬁeld of pixel-level fusion, multiscale decomposition (MSD), such as pyramid decomposition and wavelet decomposition, is often applied because it typically provides better spatial and spectral localization of image information and such decorrelation between pyramid subbands allows for a more reliable feature selection[19]. The methods utilized in this paper also follow this direction of research, while evaluation measures are applied to feature level representation rather than intensity level to accommodate the context of biometrics. We selected four multiscale decomposition methods for comparison. Gradient pyramid can obtained by applying four directional gradient operators to each level of Gaussian pyramid. The four operators correspond to horizontal, vertical and two diagonal directions respectively. Therefore, image features are indexed according to their orientations and scales. Morphological pyramid can be constructed by successive procedure of morphological ﬁltering and sub-sampling. Morphological ﬁlters, such as open and close are designed to preserve edges and shapes of objects, which make this approach suitable for the task presented here. Shift invariant digital wavelet transform is a method proposed to overcome the wavy eﬀect normally observed in traditional wavelet transform based fusion. It is accomplished by an over-complete version of wavelet basis and the downsampling process is taken place by dilated analysis ﬁlters. In our implementation, Haar wavelet is chosen and decomposition level for the above mentioned three methods is three. Curvelet transform is a bit more complex multiscale transform[12][13][15][14] and is designed to eﬃciently represent edges and other singularities along curves. Unlike wavelet transform, it has directional parameters and its coeﬃcients have a high degree of directional speciﬁcity. Therefore, large coeﬃcients in transform space suggests strong lines on original image. These methods are not new in the ﬁeld of image fusion[16][17][18][19][21], . However, earlier researchers either focus on remote sensing applications, which involves tradeoﬀ between spectral and spatial resolution, or pursue general purpose image fusion scheme. This work is the one of the ﬁrst applications that adopt and compared them in the context of hand based biometrics.

4

Proposed Fusion Method

Our fusion method is composed of two steps, namely, a preprocessing step to adjust dynamic ranges and remove noises from vein images and a fusion step to combine information from visible image and infrared image. 4.1

Preprocessing

When illuminated with visible light, images of skin ﬁne structures are captured. Contrast to the behavior in visible wavelength, cameras usually have a much

16

Y. Hao, Z. Sun, and T. Tan

lower sensitivity to infrared lights. Therefore, cameras tend to work at a low luminance circumstance and AGC(Auto Gain Control) feature takes eﬀect to maintain the output level. This procedure ampliﬁes the signal and noises at the same time, producing noisy IR images. The ﬁrst stage of preprocessing is to distinguish between the two spectra. The relative large diﬀerence between camera responses to the two wavelengths makes NIR images constantly darker than visible image, in consequence the separation is accomplished simply via an average intensity comparison. Followed is a normalization step to modify the dynamic range of vein image so that the mean and standard deviation of vein image equals to that of the palmprint image. The underlying reason is that equal dynamic range across source images helps to produce comparable coeﬃcients in transform domain. Finally, bilateral ﬁltering is undertaken to eliminate noises from infrared images. Bilateral ﬁltering is a non-iterative scheme for edge-preserving smoothing [10]. The response at x is deﬁned as an weighted average of similar and nearby pixels, where the weighting function corresponds to a range ﬁlter while the domain component closely related to a similarity function between current pixel x and its neighbors. Therefore, desired behavior is achieved both in smooth regions and boundaries. 4.2

Fusion Scheme

According to the generic framework proposed by Zhang et. al.[19], image fusion schemes are composed of (a) Multiscale decomposition, which maps source intensity images to more eﬃcient representations; (b) Activity measurement that determines the quality of each input; (c) Coeﬃcient grouping method to determine whether or not cross scale correlation is considered; (d) Coeﬃcient combining method where a weighted sum of source representations is calculated and ﬁnally (e) Consistency veriﬁcation to ensure neighboring coeﬃcients are calculated in similar manner. As a domain speciﬁc fusion scheme, the methods applied in this work can be regarded as examples of this framework. For each of the multiscale decomposition methods mentioned in Section 3, the following scheme is applied: Activity measure - A coeﬃcient-based activity measure is applied to each coeﬃcient, which means that the absolute value of each coeﬃcient is regarded as the activity measure of corresponding scale, position and sometimes orientation; Coeﬃcient combining method - Generally speaking, no matter what kind of linear combination of coeﬃcient is adopted, the basic calculation is weighted sum. We utilized the popular scheme proposed by Burt[20] to high frequency coeﬃcient and average to base band approximation. Consistency veriﬁcation - Consistency veriﬁcation is conducted in a blockwise fashion and majority ﬁlter is applied in local window of 3 by 3 in case that choose max operation is taken in coeﬃcient combination.

Comparative Studies on Multispectral Palm Image Fusion for Biometrics

5

17

Experimental Results

To evaluate the proposed fusion scheme, we collected a database from 7 subjects. Three pairs of images are captured for both hands, producing a total number of 84 images. 5.1

Subjective Fusion Quality Evaluation

The proposed scheme is applied to each pair of visible and NIR images and the resulting fused images by the four decomposition methods are subjectively examined. Fig. 2 demonstrates such an example. Morphological pyramid, although produces most obvious vein pattern on fused images, sometimes introduced artifacts. Other three methods seem to perform similarly to human eyes, thus objective fusion quality evaluation is necessary for more detailed comparison.

(a) visible image

(b) infrared image

(c) fused image with gradient pyramid (d) fused image with morphological pyramid

(e) fused image with shift-invariant DWT

(f) fused image with Curvelet transform

Fig. 2. Palmprint and vein pattern images captured using the self-designed device as well as the fused images

18

5.2

Y. Hao, Z. Sun, and T. Tan

Objective Fusion Quality Evaluation

Many fusion quality evaluation measures have been proposed[22][23] and we choose four of them for our application. Root Mean Square Error(RMSE) between input image A and fused image F is originally deﬁned in Eq. 1. N N 2 i=1 j=1 [A(i, j) − F (i, j)] RM SEAF = (1) N2 Mutual information(MI) statistically tells how much information fused image F tells about the input image A. Suppose that pA (x), pF (y) and pAF (x, y) denote marginal distribution from A, F and joint distribution between A and F respectively. Mutual information between A and F is deﬁned as Eq. 2. M IAF =

x

y

pAF (x, y)log

pAF (x, y) pA (x)pF (y)

(2)

Universal image quality index(UIQI) was proposed to evaluate the similarity between two images, and is deﬁned in Eq. 3. The three components of UIQI denote correlation coeﬃcient, closeness of mean luminance and contrast similarity of two images or image blocks A and F respectively. U IQIAF =

σAF 2μA μF 2σA σF 4σAF μA μF · · 2 = 2 2 + σ2 ) σA σF μ2A + μ2F σA + σF2 (μA + μ2F ) · (σA F

(3)

The above mentioned general purpose criteria are usually applied to intensity image. However, in order to predict the performance of proposed method in the context of biometrics, we apply these measures to feature level representation. In the ﬁeld of palmprint recognition, the best algorithms reported in literature are those based on binary textual features[7][8]. These methods seek to represent line-like patterns and have been proved to be capable of establishing stable and powerful representation for palmprint. We utilized a multiscale version of Orthogonal Line Ordinal Feature (OLOF) to fused image as well as palmprint image for feature level representation and the average results on collected database are shown in Table 1. Textural features are not suitable for vein image due to the sparse nature of true features and widespread of false features. From Table 1, we can obviously ﬁnd that Curvelet transform based method outperforms other methods in that it maintains most information available in palmprint. The disadvantage of Curvelet transform is that it takes much longer time in calculating coeﬃcients. We also adopted average local entropy to estimate the information gain from palmprint to fused image and the result is shown in Fig. 3. Curvelet transform based approach is the only one which conveys more information than the original palmprint representation. Thus we can safely draw the conclusion that Curvelet transform based method results in richer representation and is more faithful to source representations. The superior performance of Curvelet transform mainly

Comparative Studies on Multispectral Palm Image Fusion for Biometrics

19

Table 1. Objective Fusion Quality Evaluation RM SEFP alm M IFP alm U IQIFP alm Time Consumption(s) Gradient Pyramid Morphological Pyramid Shift-Invariant DWT Curvelet Transform

0.4194 0.4539 0.4313 0.3773

0.3300 0.2672 0.3083 0.4102

0.5880 0.4800 0.5583 0.7351

0.5742 1.2895 1.9371 18.6979

0.95 0.9 0.85

Average Entropy

0.8 0.75 0.7 0.65 0.6 Original Palmprint Image Fused Image with Curvelet Transform Fused Image with Morphological Pyramid Fused Image with Gradient Pyramid Fused Image with Shift Invariant DWT

0.55 0.5 0.45 4

6

8 10 12 Local Window Size

14

16

Fig. 3. The average local entropy of fused image with regard to window size

results from its built-in mechanism to represent line singularities. Gradient pyramid performs next to Curvelet transform, which suggests its good edge preservation capability and low orientation resolution compared with Curvelet transform. Morphological pyramid method introduces too many artifacts which contribute most to performance degradation.

6

Conclusion and Discussion

In this paper, we proposed the idea of multispectral palm image fusion for biometrics. This concept extends the visual capability of camera and will improve user-friendliness, security and hopeful recognition performance of original palmprint based biometric system. Several image fusion based approaches are evaluated in the context of discriminative features. Experimental results suggest

20

Y. Hao, Z. Sun, and T. Tan

that Curvelet transform outperforms several other carefully selected methods in terms of well established criteria. Further work along proposed direction will include the followings: – Image collection from more spectra. The results presented in Section 5 have ¯ performance of Curvelet transform in combining palmproved the superior print and vein images. To explore the best potential of hand biometrics, we will improve device to capture images from more spectra. Although line-like pattern is dominant on palmprint and vein images, they are not necessarily suitable for other components. Thus more schemes need to be studied based on examination of meaningful physiological characteristics of each skin component. – Recognition based on fused image. Currently, the database is not large enough ¯ to produce convincing recognition performance. A well deﬁned database will be collected in the near future and the proposed method will be tested and also compared with other level fusion; Acknowledgments. This work is funded by research grants from the National Basic Research Pro- gram (Grant No. 2004CB318110), the Natural Science Foundation of China (Grant No. 60335010, 60121302, 60275003, 60332010, 6982510560605008) and the Chinese Academy of Sciences.

References 1. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 2. Bolle, R., Pankanti, S., Jain, A.K.: Biometrics: Personal Identiﬁcation in Networked Society. Springer, Heidelberg (1999) 3. Lin, C.-L., Fan, K.-C.: Biometric Veriﬁcation Using Thermal Images of Palm-Dorsa Vein Patterns. IEEE Trans. on Circuits and Systems for Video Technology 14(2), 199–213 (2004) 4. Finger Vein Authentication Technology, http://www.hitachi.co.jp/Prod/comp/finger-vein/global/ 5. Fujitsu Palm Vein Technology, http://www.fujitsu.com/global/about/rd/200506palm-vein.html 6. Igarashi, T., Nishino, K., Nayar, S.K.: The Appearance of Human Skin. Technical Report CUCS-024-05, Columbia University (2005) 7. Wai-Kin Kong, A., Zhang, D.: Competitive Coding Scheme for Palmprint Veriﬁcation. In: Intl. Conf. on Pattern Recognition, vol. 1, pp. 520–523 (2004) 8. Sun, Z., Tan, T., Wang, Y., Li, S.Z.: Ordinal Palmprint Recognition for Personal Identiﬁcation. Proc. of Computer Vision and Pattern Recognition (2005) 9. Wang, L., Leedham, G.: Near- and Far- Infrared Imaging for Vein Pattern Biometrics. In: Proc. of the IEEE Intl. Conf. on Video and Signal Based Surveillance (2006) 10. Tomasi, C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. In: Proc. of Sixth Intl. Conf. on Computer Vision, pp. 839–846 (1998) 11. Anderson, R.R., Parrish, J.A.: The Science of Photomedicine. In: Optical Properties of Human Skin. ch. 6, Plenum Press, New York (1982)

Comparative Studies on Multispectral Palm Image Fusion for Biometrics

21

12. Donoho, D.L., Duncan, M.R.: Digital Curvelet Transform: Strategy, Implementation and Experiments, available http://www-stat.stanford.edu/∼ donoho/Reports/1999/DCvT.pdf 13. Cand`es, E.J., Donoho, D.L.: Curvelets – A Surprisingly Eﬀective Nonadaptive Representation for Objects With Edges. In: Schumaker, L.L., et al. (eds.) Curves and Surfaces, Vanderbilt University Press, Nashville, TN (1999) 14. Curvelet website, http://www.curvelet.org/ 15. Starck, J.L., Cand`es, E.J., Donoho, D.L.: The Curvelet Transform for Image Denoising. IEEE Transactions on Image Processing 11(6), 670–684 (2002) 16. Choi, M., Kim, R.Y., Nam, M.-R., Kim, H.O.: Fusion of Multispectral and Panchromatic Satellite Images Using the Curvelet Transform. IEEE Geoscience and Remote Sensing Letters 2(2) (2005) 17. Nencini, F., Garzelli, A., Baronti, S., Alparone, L.: Remote Sensing Image Fusion Using the Curvelet Transform. Information Fusion 8(2), 143–156 (2007) 18. Zhang, Q., Guo, B.: Fusion of Multisensor Images Based on Curvelet Transform. Journal of Optoelectronics Laser 17(9) (2006) 19. Zhang, Z., Blum, R.S.: A Categorization of Multiscale-Decomposition-Based Image Fusion Schemes with a Performance Study for a Digital Camera Application. Proc. of IEEE 87(8), 1315–1326 (1999) 20. Burt, P.J., Kolczynski, R.J.: Enhanced Image Capture Through Fusion. In: IEEE Intl. Conf. on Computer Vision, pp. 173–182. IEEE Computer Society Press, Los Alamitos (1993) 21. Sadjadi, F.: Comparative Image Fusion Analysais. IEEE Comptuer Vision and Pattern Recognition 3 (2005) 22. Petrovi´c, V., Cootes, T.: Information Representation for Image Fusion Evaluation. In: Intl. Conf. on Information Fusion, pp. 1–7 (2006) 23. Petrovi´c, V., Xydeas, C.: Objective Image Fusion Performance Characterisation. In: Intl. Conf. on Computer Vision, pp. 1868–1871 (2005) 24. Wang, Z., BovikA, A.C.: Univeral Image Quality Index. IEEE Signal Process Letter 9(3), 81–84 (2002)

Learning Gabor Magnitude Features for Palmprint Recognition Rufeng Chu, Zhen Lei, Yufei Han, Ran He, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences {rfchu,zlei,yfhan,rhe,szli}@nlpr.ia.ac.cn http://www.cbsr.ia.ac.cn

Abstract. Palmprint recognition, as a new branch of biometric technology, has attracted much attention in recent years. Various palmprint representations have been proposed for recognition. Gabor feature has been recognized as one of the most effective representations for palmprint recognition, where Gabor phase and orientation feature representations are extensively studied. In this paper, we explore a novel Gabor magnitude feature-based method for palmprint recognition. The novelties are as follows: First, we propose an illumination normalization method for palmprint images to decrease the influence of illumination variations caused by different sensors and lighting conditions. Second, we propose to use Gabor magnitude features for palmprint representation. Third, we utilize AdaBoost learning to extract most effective features and apply Local Discriminant Analysis (LDA) to reduce the dimension further for palmprint recognition. Experimental results on three large palmprint databases demonstrate the effectiveness of proposed method. Compared with state-of-the-art Gabor-based methods, our method achieves higher accuracy.

1 Introduction Biometrics is an emerging technology by using unique and measurable physical characteristics to identify a person. The physical attributes include face, fingerprint, iris, palmprint, hand geometry, gait, and voice. Biometric systems have been successfully used in many different application contexts, such as airports, passports, access control, etc. Compared with other biometric technologies, palmprint recognition has a relatively shorter history and has received increasing interest in recent years. Various techniques have been proposed for palmprint recognition in the literature [1,2,3,4,5,6,7,8,9,10]. They can be mainly classified into three categories according to the palmprint feature representation method. The first category is based on structure features, such as line features [1] and feature points [2]. The second one is based on holistic appearance features, such as PCA [3], LDA [4] and KLDA [5]. The third one is based on local appearance features, such as PalmCode [7], FusionCode [8], Competitive Code [9] and Ordinal Code [10]. Among these representation methods, Gabor feature is one of the most efficient representations for palmprint recognition. Zhang et al. [7] proposed a texture-based method for online palmprint recognition, where 2D Gabor filter was used to extract the Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 22–31, 2007. c Springer-Verlag Berlin Heidelberg 2007

Learning Gabor Magnitude Features for Palmprint Recognition

23

phase information (called PalmCode) from low-resolution palmprint images. Kong and Zhang [8] improved the efficiency of PalmCode method by fusing the codes computed in four different orientations (called FusionCode). Multiple Gabor filters are employed to extract phase information on a palmprint image. To further improve the performance, Kong and Zhang [9] proposed another Gabor based method, namely competitive code. The competitive coding scheme uses multiple 2D Gabor filters to extract orientation information from palm lines based on the winner-take-all competitive rule [9]. Combined with angular matching, promising performance has been achieved. Gabor phase and orientation features have been extensively studied in existing works [7,8,9]. In this paper, we attempt to explore Gabor magnitude feature representation for palmprint recognition. First, to increase the generalization capacity and decrease the influence of illumination variations due to different sensors and lighting environments, we propose an illumination normalization method for palmprint images. Second, multi-scale, multi-orientation Gabor filters are used to extract Gabor magnitude features for palmprint representation. The original feature set is of high dimensionality. Then, we utilize AdaBoost learning to select most effective features from the large number of candidate feature set, followed by Local Discriminant Analysis (LDA) for further dimensionality reduction. Experimental results demonstrate the good performance of proposed method. Compared with state-of-the-art Gabor-based method, our method achieves higher accuracy. Moreover, the processing speed of the method is very fast. In the testing phase, the execution time for the illumination normalization, feature extraction, feature space to LDA subspace projection and matching for one image are 30ms, 20ms, 1.5ms and 0.01ms, respectively. The rest of this paper is organized as follows. In Section 2, we introduce the illumination normalization method. In Section 3, we describe the Gabor magnitude features for palmprint representation. Section 4 gives the details of statistical learning for feature selection and classifier. Experimental results and conclusions are presented in Section 5 and Section 6, respectively.

2 Illumination Normalization Due to different sensors and lighting environments, the palmprint images are varied significantly, as shown in the top row of Fig. 1. A robust illumination preprocessing method will help to diminish the influence of illumination variations and increase the robustness of recognition method. In general, an image I(x, y) is regarded as product I(x, y) = R(x, y)L(x, y), where R(x, y) is the reflectance and L(x, y) is the illuminance at each point (x, y). The reflectance R depends on the albedo and surface normal, which is the intrinsic representation of an object. The luminance L is the extrinsic factor. Therefore, the illumination normalization problem reduces to how to obtain R given an input image I. However, estimating the reflectance and the illuminance is an ill-posed problem. To solve the problem, a common assumption is that the illumination L varies slowly while the reflectance R can change abruptly. In our work, we introduce an anisotropic approach to compute the estimate of the illumination field L(x, y), which has been used

24

R. Chu et al.

Fig. 1. Examples of the palmprint images from different sensors before and after illumination normalization. Top: Original palmprint images. Bottom: Corresponding processed palmprint images. The images are taken from the PolyU Palmprint Database [12] (first two columns), UST Hand Database [13] (middle two columns) and CASIA Palmprint Database [14] (last two columns).

for face recognition [11]. Then, we estimate the reflectance R(x, y) as the ratio of the image I(x, y) and L(x, y) for palmprint image, The luminance function was estimated as an anisotropically smoothed version of the original image, which can be carried out by minimizing the cost function: ρ(x, y)(L − I)2 dxdy + λ (L2x + L2y )dxdy (1) J(L) = y

x

y

x

where the first is the data term while the second term is a regularization term which imposes a smoothness constraint. The parameter λ controls the relative importance of the two terms. ρ is Weber’s local contrast between a pixel a and its neighbor b in either the x or y directions [11]. The space varying permeability weight ρ(x, y) controls the anisotropic nature of the smoothing constraint. By Euler-Lagrange equation, Equ. (1) transforms to solve the following partial differential equation (PDE): λ (2) L + (Lxx + Lyy ) = I ρ The PDE approach is easy to implement. By the regularized approach, the influence of the illumination variations is diminished, while the edge information of the palmprint image is preserved. Fig. 1 show some examples from several different palmprint database before and after processing with the method. In section 5, we will further evaluate the effectiveness of the illumination normalization method on a large palmprint database.

3 Gabor Magnitude Features for Palmprint Representation Gabor features exhibit desirable characteristics of spatial locality and orientation selectively, and are optimally localized in the space and frequency domains. The Gabor kernels can be defined as follows [15]: ψμ,v =

2 2 kμ,v kμ,v z2 σ2 )] exp( )[exp(ik z) − exp(− μ,v σ2 2σ 2 2

(3)

Learning Gabor Magnitude Features for Palmprint Recognition

25

where μ and v define the orientation and scale of the Gabor kernels respectively, z = (x, y), and the wave vector kμ,v is defined as follows: kμ,v = kv eiφμ (4) √ where kv = kmax /f v , kmax = π/2, f = 2, φμ = πμ/8. The Gabor kernels in Equ. 3 are all self-similar since they can be generated from one filter, the mother wavelet, by scaling and rotating via the wave vector kμ,v . Each kernel is a product of a Gaussian envelope and a complex plane wave, while the first term in the square brackets in Equ. (3) determines the oscillatory part of the kernel and the second term compensates for the DC value. Hence, a bank of Gabor filters is generated by a set of various scales and rotations. In our experiment, we use Gabor kernels at five scales v ∈ {0, 1, 2, 3, 4} and eight orientations μ ∈ {0, 1, 2, 3, 4, 5, 6, 7} with the parameter μ = 2π to derive the Gabor representation by convoluting palmprint image with corresponding Gabor kernels. Let I(x, y) be the gray level distribution of an palmprint image , the convolution of image I and a Gabor kernel ψμ,v is defined as: Fμ,v (z) = I(z) ∗ ψμ,v (z)

(5)

where z = (x, y), ∗ denotes the convolution operator. Gabor magnitude feature is defined as Mμ,v (z) = Im(Fμ,v (z))2 + Re(Fμ,v (z))2 (6) where Im(·) and Re(·) denote the imaginary and real part, respectively. For each pixel position (x, y) in the palmprint image, 40 Gabor magnitudes are calculated to form the feature representation.

4 Statistical Learning of Best Features and Classifiers The whole set of Gabor magnitude features is of high dimension. For a palmprint image with size of 128 × 128 , there are about 655,360 features in total. Not all of them are useful or equally useful, and some of them may cause negative effect on the performance. Straightforward implementation is both computationally expensive and exhibits a lack of efficiency. In this work, we utilize AdaBoost learning first to select the most informative features and then apply linear discriminant analysis (LDA) on the selected Gabor magnitude features for further dimension reduction. 4.1 Feature Selection by AdaBoost Learning Boosting can be viewed as a stage-wise approximation to an additive logistic regression model using Bernoulli log-likelihood as a criterion [16]. AdaBoost is a typical instance of Boosting learning. It has been successfully used on face detection problem [17] as an effective feature selection method. There are several different versions of AdaBoost algorithm [16], such as Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost. In this work, we apply Gentle AdaBoost learning to select most discriminative Gabor magnitude features and remove the useless and redundant features. Gentle AdaBoost is a modified version of the Real AdaBoost algorithm and is defined in Fig. 2.

26

R. Chu et al.

Input: Sequence of N weighted examples: {(x1 , y1 , w1 ), (x2 , y2 , w2 ), . . . , (xN , yN , wN )}; Initialize: wi =

1 N

, i = 1, 2, ..., N, F (x) = 0

Integer T specifying number of iterations; For t = 1, . . . ,T (a) Fit the regression function ft (x) by weighted least squares of yi to xi with weights wi . (b) Update F (x) ← F (x) + ft (x)

(c) Update wi ← wi e−yi ft (xi ) and renormalize. 3. Output the classifier sign[F (x)] = sign[

T t=1

ft (x)]

Fig. 2. Algorithm of Gentle AdaBoost

Empirical evidence suggests that Gentle AdaBoost is a more conservative algorithm that has similar performance to both the Real AdaBoost and LogitBoost algorithms, and often outperforms them both, especially when stability is a crucial issue [16]. While the above AdaBoost procedure essentially learns a two-class classifier, we convert the multi-class problem into a two-class one using the idea of intra- and extraclass difference [18]. However, here the difference data are derived from each pair of Gabor magnitude features at the corresponding locations rather than from the images. The positive examples are derived from pairs of intra-personal differences and the negative from pairs of extra-personal differences. In this work, the weak classifier in AdaBoost learning is constructed by using a single Gabor magnitude feature. Therefore, AdaBoost learning algorithm can be considered as a feature selection algorithm [17,19]. With the selected feature set, a series of statistical methods can be used to construct effective classifier. In the following, we introduce LDA for dimension reduction further and use cosine distance for palmprint recognition and expect it can achieve better performance. 4.2 LDA with Selected Features LDA is a famous method for feature extraction and dimension reduction that maximizes the extra-class distance while minimized the intra-class distance. Let the sample set be X = {x1 , x2 , ..., xn }, where xi is the feature vector for the i-th sample. The withinclass scatter matrix Sw and the between-class scatter matrix Sb are defined as follows: Sw =

L

(xj − mi )T (xj − mi )

(7)

i=1 xj ∈Ci

Sb =

L

ni (mi − m)T (mi − m)

(8)

i=1

L where mi = n1i xj ∈Ci xj is the mean vector in class Ci , and m = n1 i=1 xj ∈Ci xj is the global mean vector.

Learning Gabor Magnitude Features for Palmprint Recognition

27

LDA aims to find projection matrix W so that the following object function is maximized: J=

tr(WT Sb W) tr(WT Sw W)

(9)

The optimal projection matrix Wopt can be obtained by solving the following generalized eigen-value problem (10) S−1 w Sb W = WΛ where Λ is a diagonal matrix whose diagonal elements are the eigenvalues of S−1 w Sb . Given two input vectors x1 and x2 , their subspace projections are calculated as v1 = WT x1 and v2 = WT x2 , and the following cosine distance is used for the matching: H(v1 , v2 ) =

v1 · v2 v1 v2

(11)

where . denotes the norm operator. In the test phase, the projections v1 and v2 are computed from two input vectors x1 and x2 , one for the input palmprint image and another for an enrolled palmprint image. By comparing the score H(v1 , v2 ) with a threshold, a decision can be made whether x1 and x2 belong to the same person.

5 Experiments To evaluate the performance of the proposed palmprint recognition method, three large palmprint databases are adopted, including PolyU Palmprint Database [12], UST Hand Image Database [13] and CASIA Palmprint Database [14]. These databases are among the largest in size in the public domain. We train the classifiers and evaluate the effectiveness of illumination normalization method on PolyU Palmprint Database. To explore the generalization of the classifier, we further evaluate the performance of proposed palmprint recognition method on the other two databases, and compare with the state-of-the-art Gabor-based recognition methods [7,8,9]. 5.1 Evaluate on PolyU Palmprint Database PolyU Palmprint Database[12] contains 7752 images corresponding to 386 different palms. Around twenty samples from each of these palms were collected in two sessions. There are some illumination variations between the two sessions. We select 4000 images from 200 different palms collected in two sessions as the testing set, with 20 images per palm. The rest 3752 images from 186 different palms are used for training. All the input palmprint images are normalized to 128 × 128 using the method proposed in [7]. In the training phase, the training set of positive samples were derived from intraclass pairs of Gabor features, the negative set from extra-class pairs. Two Gabor magnitude feature-based classifiers are trained. One is an AdaBoost learning based classifier, and another is an LDA based classifier using AdaBoost-selected features. These two methods are named “GMBoost” and “GMBoostLDA”, respectively. Moreover, to

28

R. Chu et al.

evaluate the effectiveness of the illumination normalization method, we also train two classifiers and test the performance on the palmprint images without illumination normalization. The first two classifiers are trained using the palmprint images without illumination normalization. 882 most effective features are selected by the AdaBoost procedure from the original 655,360 Gabor magnitude features with the training error rate of zero on the training set. For LDA, the feature dimension retained is 181, which is optimal in the test set. The other two classifiers are trained using the palmprint images with illumination normalization. 615 most effective features are selected with the training error rate of zero on the training set. The optimal feature dimension for LDA is 175 found in the test set. The first 5 most effective features learned by Gentle AdaBoost are shown in Fig. 3, in which the position, scale and orientation of corresponding Gabor kernels are indicated on an illumination normalized palmprint image.

Fig. 3. The first 5 features and associated Gabor kernel selected by AdaBoost learning

In the testing phase, we match palmprints from different sessions. Each image from the first session is matched with all the images in the second sessions. This generated 20,000 intra-class (positive) and 380,000 extra-class (negative) pairs. Fig. 4 shows the ROC curves derived from the scores for the intra- and extra-class pairs. From the result, we can see that all these Gabor magnitude feature-based methods achieve good verification performances. The performance of “GMBoostLDA” methods are better than that of “GMBoost” methods. This indicates applying LDA with AdaBoost-selected features is a good scheme for palmprint recognition. Among these classifiers, “GMBoostLDA with Illumination Normalization” performs the best, which demonstrates the effectiveness of the proposed illumination normalization method. The processing speed of proposed method is very fast. In the testing phase, only the features selected by the AdaBoost learning need to be extracted with the Gabor filter, which largely reduce the computational cost. On a P4 3.0GHz PC, the execution time for the illumination normalization, feature extraction, feature space to LDA subspace projection and matching for one image are 30ms, 20ms, 1.5ms and 0.01ms, respectively. In the next subsection, we will further evaluate the performance of our best classifier on the other two databases to explore the generalization capacity and compare with the state-of-the-art Gabor-based recognition methods. 5.2 Evaluate on UST Hand Image Database and CASIA Palmprint Database UST hand image database [13] contains 5,660 hand images corresponding to 566 different palms, 10 images per palm. All images are captured using a digital camera with

Learning Gabor Magnitude Features for Palmprint Recognition

29

Fig. 4. Verification performance comparison on PolyU Palmprint Database

resolution of 1280 × 960 (in pixels) and 24-bit colors. There are totally 25,470 intraclass (genuine) samples and 15,989,500 extra-class (impostor) samples generated from the UST database. CASIA palmprint database [14] contains 4,796 images corresponding to 564 different palms. All images are captured using a CMOS camera with resolution of 640x480 (in pixels) and 24-bit colors. There are 8 to 10 samples in each of these palms. There are totally 18,206 intra-class (genuine) samples and 11,480,204 extra-class (impostor) samples generated from the test set. Fig. 5 shows the ROC curves derived from the scores for the intra- and extra-class samples. According to the ROC curves, the performance of the proposed method is better than that of the state-of-the-art Gabor-based recognition methods in both the two databases. Note that our classifier is trained on the PolyU database and tested on the UST and CASIA palmprint databases. Two accuracy measurements are computed for further comparison in Table 1. One is the equal error rate (EER) and the other is the d (d-prime) [20]. d is a statistical measure of how well a biometric system can discriminate between different individuals defined as

Fig. 5. Comparative results with state-of-the-art Gabor-based recognition methods. Left: ROC curves on UST Hand Image Database. Right: ROC curves on CASIA Palmprint Database.

30

R. Chu et al.

|m1 − m2 | d = 2 (12) (δ1 + δ22 )/2 where m1 and δ1 denote the mean and variance of intra-class feature vector respectively, while m2 and δ2 denote the mean and variance of extra-class feature vector. The larger the d value is, the better a biometric system performs [20]. Table 1. Comparison of accuracy measures for different classifiers on UST and CASIA databases Algorithm Palm Code (θ = 45o ) [7] Fusion Code [8] Competitive Code [9] Proposed method

Results on UST database EER (%) d 1.77 3.39 0.75 3.40 0.38 3.51 0.35 5.36

Results on CASIA database EER (%) d 0.95 3.58 0.57 3.80 0.19 3.81 0.17 5.57

From the experimental results, we can see that both the EER and the discriminating index of proposed method achieve good performance (in bold font). This also suggests the good generalization capacity of proposed method, which can work well on different types of palmprint images.

6 Conclusions In this paper, we have proposed a Gabor magnitude feature based learning method for palmprint recognition. To decrease the influence of illumination variations, we introduced an illumination normalization method for palmprint images. Then, multi-scale, multiorientation Gabor filters are used to extract Gabor magnitude features. Based on the Gabor magnitude features and statistical learning, a powerful classifier is constructed. The experimental results show that Gabor magnitude features with statistical learning can be powerful enough for palmprint recognition. Compared with state-of-the-art Gabor-based method, our method achieves better performance on two large palmprint database.

Acknowledgements This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.

References 1. Zhang, D., Shu, W.: Two novel characteristics in palmprint verification: Datum point invariance and line feature matching. Pattern Recognition 32, 691–702 (1999) 2. Duta, N., Jain, A., Mardia, K.: Matching of palmprint. Pattern Recognition Letters 23, 477–485 (2001)

Learning Gabor Magnitude Features for Palmprint Recognition

31

3. Lu, G., Zhang, D., Wang, K.: Palmprint recognition using eigenpalms features. Pattern Recognition Letters 24, 1463–1467 (2003) 4. Wu, X., Zhang, D., Wang, K.: Fisherpalms based palmprint recognition. Pattern Recognition Letters 24, 2829–2838 (2003) 5. Wang, Y., Ruan, Q.: Kernel fisher discriminant analust-hand-databaseysis for palmprint recognition. In: Proceedings of International Conference Pattern Recognition, vol. 4, pp. 457–461 (2006) 6. Kumar, A., Shen, H.: Palmprint identification using palmcodes. In: ICIG 2004. Proceedings of the Third International Conference on Image and Graphics, Hong Kong, China, pp. 258– 261 (2004) 7. Zhang, D., Kong, W., You, J., Wong, M.: On-line palmprint identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1041–1050 (2003) 8. Kong, W., Zhang, D.: Feature-Level Fusion for Effective Palmprint Authentication. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 761–767. Springer, Heidelberg (2004) 9. Kong, W., Zhang, D.: Competitive coding scheme for palmprint verification. In: Proceedings of International Conference Pattern Recognition, vol. 1, pp. 520–523 (2004) 10. Sun, Z., Tan, T., Wang, Y., Li, S.: Ordinal palmprint represention for personal identification. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 279–284. IEEE Computer Society Press, Los Alamitos (2005) 11. Gross, R., Brajovic, V.: An image preprocessing algorithm for illumination invariant face recognition. In: Proc. 4th International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp. 10–18 (2003) 12. PolyU Palmprint Database, http://www.comp.polyu.edu.hk/biometrics/ 13. UST Hand Image database, http://visgraph.cs.ust.hk/biometrics/Visgraph web/index.html 14. CASIA Palmprint Database, http://www.cbsr.ia.ac.cn/ 15. Daugman, J.G.: Complete discret 2d gabor transforms by neural networks for image analysis and compression. IEEE Trans. ASSP 36, 1169–1179 (1988) 16. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Technical report, Department of Statistics, Sequoia Hall, Stanford Univerity (1998) 17. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, IEEE Computer Society Press, Los Alamitos (2001) 18. Moghaddam, B., Nastar, C., Pentland, A.: A Bayesian similarity measure for direct image matching. Media Lab Tech Report No.393, MIT (1996) 19. Shan, S., Yang, P., Chen, X., Gao, W.: AdaBoost gabor fisher classifier for face recognition. In: Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, Beijing, China, pp. 279–292. IEEE Computer Society Press, Los Alamitos (2005) 20. Daugman, J., Williams, G.: A Proposed Standard for Biometric Decidability. In: Proc. CardTech/SecureTech Conference, pp. 223–234 (1996)

Sign Recognition Using Constrained Optimization Kikuo Fujimura1 and Lijie Xu2 1

Honda Research Institute USA 2 Ohio State University

Abstract. Sign recognition has been one of the challenging problems in computer vision for years. For many sign languages, signs formed by two overlapping hands are a part of the vocabulary. In this work, an algorithm for recognizing such signs with overlapping hands is presented. Two formulations are proposed for the problem. For both approaches, the input blob is converted to a graph representing the ﬁnger and palm structure which is essential for sign understanding. The ﬁrst approach uses a graph subdivision as the basic framework, while the second one casts the problem to a label assignment problem and integer programming is applied for ﬁnding an optimal solution. Experimental results are shown to illustrate the feasibility of our approaches.

1

Introduction

There have been many approaches in sign recognition [1]. Among many important elements for sign recognition, components which are of basic importance are hand tracking and hand shape analysis. For hand detection, many approaches use color or motion information [6,7]. It turns out, however, that hand tracking using color is a non-trivial task except for well-controlled environments, as various lighting changes pose challenging conditions [3,10,15,17]. Making use of special equipment such as data gloves is one solution to overcome this diﬃculty. When the hand is given in a magniﬁed view, hand shape analysis becomes a feasible problem [16], although body and arm posture information might be lost. Successful results are reported by using multiple cameras for extracting 3D information [9,13]. Even though redundancy makes the problem more approachable, handling a bulk of data in real-time poses another challenge for eﬃcient computation. Model ﬁtting for 3D data is known to be a computationally demanding task. Stereo vision is a popular choice in many tasks including man-machine interaction and robotics [9]. However, it still fails to provide suﬃcient details in depth maps for some tasks such as counting ﬁngers in a given hand posture. Stereo images have been useful in some applications such as large-scale gesture recognition such as pointing motions. In contrast, coded lights or recent techniques such as space-time stereo in general provide a depth resolution much better than a traditional stereo. Such a device is expected to provide a high quality image Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 32–41, 2007. c Springer-Verlag Berlin Heidelberg 2007

Sign Recognition Using Constrained Optimization

33

sequence even for posture analysis. Time-of-ﬂight sensors also provide a resolution that is suﬃcient for hand shape analysis [5,12,13,18]. For our work, we opt to use this type of device as a suitable alternative for stereo. Much of the work in hand shape and motion analysis primarily deal with a single hand shape or two non-overlapping hands [4,8,11,13,16]. Whereas analysis of independent hand shapes is a basic task, analysis of overlapping hands presents another level of challenge in gesture recognition. Natural gestures (including sign languages as in Fig. 1) often use signs formed by two overlapping hands. Motivated in this manner, we present approaches at analyzing various hand patterns formed by two overlapping hands. In particular, we focus on how to separate two hands from one blob representing overlapping hands. In addition, we also present a real-time non-intrusive sign recognition system that can recognize signs by one hand, two non-overlapping hands, and two overlapping hands. The rest of the paper is organized as follows. In section 2, we outline our solution approaches. Sections 3 and 4 present two formulations to the problem and Section 5 contains a description of the entire system. Experimental results are presented in Section 6 and Section 7 contains concluding remarks.

Fig. 1. Examples of signs formed by overlapping hands

2

Flow of the Algorithm

We present two algorithms to address the problem of overlapping hands. The two algorithms share a common basic part, namely, steps 1 - 3 of the following procedure. 1. 2. 3. 4. 5.

Extracting the arm blob(s) from the image Overlapping hand detection and palm detection Graph formation Separation of overlapped hands (two methods are proposed) Sign recognition

Hand segmentation is an important step for most sign recognition methods. In this work, we use depth streams for hand blob segmentation. Even though

34

K. Fujimura and L. Xu

it is conceptually simple to segment hands in depth images, especially when hands are in front of the body, it still requires work to identify palm and ﬁnger parts from the arm blob. This part is described in Section 5. (When the blob is determined to contain only one hand, Steps 3 and 4 are skipped.) For the second step, we make use of the observation that for overlapping hands, the blob’s area is larger than a blob containing a single hand and the convex hull of the blob is signiﬁcantly larger than the blob itself. Next, a graph structure is extracted from the hand blob. For this operation, the medial axis transform (roughly corresponding to the skeleton of the blob) is generated from the blob by a thinning operation (Fig. 2). The skeleton represents the general shape of the hand structure, while it has some short fragmental branches that do not represent ﬁngers. Also, for two ﬁngers that are connected as in Fig 1(c), the connecting point may not become a node (branching point) in the skeleton. Thus, we create an augmented graph (which we call as G hereafter) from the skeleton. This is accomplished by removing short fragments (e.g., ‘dangling branch’ in Fig. 2) and dividing long edges into shorter pieces (Fig. 2), in particular, at a high curvature point. After a connectivity graph G is formed, the remaining task is to determine parts of G that come from the right and the left hands, respectively.

Fig. 2. Example of the skelton structure (left). Example of the augmented graph structure (right).

Two methods are presented. The ﬁrst algorithm is based on a tree search paradigm, while the second one is formulated by using constrained optimization.

3

Tree Search Framework

The ﬁrst algorithm for hand disambiguation uses tree search. Given an augmented graph G, we form two subgraphs H1 and H2 such that G = H1 ∪ H2 . Moreover, each Hi is required to satisfy a few contions so that it represents a proper hand shape. A connected subgraph H1 (H2 ) is formed such that it contains the palm corresponding to the right (left) hand, respectively. Our strategy is to generate H1 and H2 systematically and pick the pair that best matches our deﬁnition of ‘hands’. The outline of the algorithm is summarized as follows.

Sign Recognition Using Constrained Optimization

35

1. Create a DAG (directed acyclic graph) from G. 2. Separate the tree G into two parts: (a) Do DFS (depth ﬁrst search) from the source node (top node of the tree). (b) Let H1 be a connected subgraph from by the scan. This forms a possible hand structure. (c) For each H1 , do reduction on graph G to obtain the remaining part which, in turn, forms the second hand structure H2 . 3. Evaluate the given hand structure pair H1 and H2 . 4. The one with the best evaluation is selected as the ﬁnal answer.

Fig. 3. Example of tree search

3.1

Evaluation Function

After each scan of the tree, we are left with a pair of H1 and H2 . The evaluation of this pair consists of a few criteria. 1. Each of H1 and H2 must be connected. This comes from the natural requirement that each hand is a connected entity. 2. Distance of any part of H from the palm must be within a certain limit. This discourages to have a ﬁnger that is longer than a certain limit. 3. For two segments within a subgraph, the angle formed by the segments cannot be small. This condition discourages to have a ﬁnger that bends at an extremely sharp angle. Likewise, ﬁngers bending outward are discouranged. 4. A branching node in H must stay within a certain distance from the palm. This condition discourages to form a ﬁnger with branches at the tip of the ﬁnger. The above criteria are encoded in the decision process and ones that have the best evaluation are considered.

4

Optimization Framework

The second framework reduces the problem to a labeling problem which is formulated as the following optimization problem. We continue to use graph G. A segment (or edge) si in G is to be assigned a label by a function f (that is, f (si ) is either Right or Left). Each si has an estimate of its likelihood of having

36

K. Fujimura and L. Xu

labeling f (si ). This comes from a heuristic in sign estimation. For example, if s is near the left hand palm, its likelihood of being a part of the right hand is relatively low. For this purpose, a non-negative cost function c(s, f (s)) is introduced to represent the likelihood. Further, we consider two neighboring segments si and sj to be related, in the sense that we would like si and sj to have the same label. Each edge e in graph G has a nonnegative weight indicating the strength of the relation. Moreover, certain pairs of labels are more similar than others. Thus, we impose a distance d() on the label set. Larger distance values indicate less similarity. The total cost of a labeling f is given by : c(s, f (s)) + we d(f (si ), f (sj )) Q(f ) = s∈S

In our problem, the following table (4) is to be completed, where binary variable Aij indicates if segment si belongs to hand pi . For c(i, j), the Euclidean distance from segment si to the palm of hand j is used. Since each (thin) segment si belongs to only one hand, j Aij = 1 hold. In addition to this constraint, a number of related constraints are considered. 1. Neighboring segments should have a similar label. 2. Thick parts represent more than one ﬁnger. 3. Thin parts represent one ﬁnger.

Fig. 4. Assignment table

It turns out that this is an instance of the Uniform Labeling Problem, which can be expressed as the following integer program by introducing auxiliary variables Ze for an edge e to express the distance between the labels and we use Zej to express the absolute value |Apj − Aqj |. Following Kleinberg and Tardos [2], we can rewrite our optimization problem as follows: ⎛ ⎞ N M min ⎝ c(i, j)Aij + we Xe ⎠ i=1 j=1

subject to j

Aij = 1,

eE

i = 1, 2, 3, · · · , N, if the ith segment is thin.

Sign Recognition Using Constrained Optimization

Aij = 2,

37

i = 1, 2, 3, · · · , N, if the ith segment is thick.

j

Ze =

1 Zej , e ∈ E 2 j

Zej ≥ Apj − Aqj , e = (p, q);

j = 1, · · · , M

Zej ≥ Aqj − Apj , e = (p, q);

j = 1, · · · , M

Aij ∈ {0, 1},

i = 1, 2, 3, , · · · , N ;

j = 1, 2, · · · , M

length(s1 ) + length(s2 ) + · · · < M AXLEN ; for each hand. Here, c(i, j) represents cost (penalty) for segment i to belong to hand j, where j can be either Right or Left. If segment i is far from Left, then c(i, Left) is large. For time-varying image sequences, previous c(i, Left) may be used. This factor is a very powerful factor, assuming palm locations are correctly detected. Terms involving Ze and Zej come from constraint (1). The weight we represents strength between graph node a and b (where e is the edge connecting a and b, where a and b represent segments. For example, if a and b make a sharp turn we is a high number, since a and b are likely to belong to diﬀerent hands. The weight is given by we = e−αde , where de is the depth diﬀerence between two adjacent segments and α is selected based on experiments. For an additional constraint, Aij + Akj < 2 holds for all j, if segments si and sk make a sharp turn. If a segment is thick, an additional constraint may be added. Finally, the total length of ﬁngers must not exceed a certain limit. In general, solving an integer program optimally is NP-hard. However, we can relax the above problem to linear programming with Aij ≥ 0, and this can be solved eﬃciently by using a publicly available library. Kleinberg and Tardos [2] describe a method for rounding the fractional solution so that the expected objective function Q(f ) is within a factor of 2 from the optimal solution. In our experiments we ﬁnd that this relaxed linear programming always returns an integer solution.

5 5.1

Sign Recogntion System and Components Palm Detection

To build an entire system for sign recognition, the hand shape analysis module has to be integrated with many other modules such as a motion tracker and pattern classiﬁer. Here, a description is given for modules that are highly related to shape analysis, namely, ﬁnger and palm detection. For other related modules, see [18]. To locate ﬁngers within the arm blob, each branch of the skeleton is traced at a regular interval, while measuring width at each position. When the width is smaller than a certain limit, we consider these pixels to belong to a ﬁnger. For palm detection, the following steps work well for our experiment.

38

K. Fujimura and L. Xu

1. Trace all branches of the skeleton. At a certain inteval, shoot rays enamating from a point on the skeleton. Pick up the chord that has the shortest distance and we call it the width of the point (Fig. 7 (right)). 2. From all widths, pick up the top few that have the widest widths. Choose the one that is closest to ﬁnger positions as the palm center. 3. For the chord at the selected point, pick the center point of the chord and deﬁne this as the palm center to be used for the rest of sign recognition. 5.2

Experimental Results

The algorithm has been implemented in C and tested by using alphabet letters used in JSL (Japanese Sign Language). Our experiments show that the second framework has more successful results. The primary reason is that deﬁning the evaluation function that works uniformly for all words is diﬃcult. For example, for ‘X’, we want sharp turns to be minimal, while for ‘B’, we want to keep some of the sharp turns in the pattern. This requires case-based analysis or substantial work is to be done as post-processing. Currently, it takes approximately 0.5 second to resolve overlapping hand cases as in Fig. 5. The second framework also fails at times, for example, when unncecesary fragments remain after pruning in the tree structure. Fig. 6 shows an example of a JSL sentence consisting of three words. ‘convenient store’, ‘place’, and ‘what’, illustrating a sign consisting of a non-overlapping2-hand signs (convenient store) and 1-hand-signs (place and what). Currently, our system has approximately 50 recognizable words. For 1-hand-signs and nonoverlapping-2-hand-signs, recognition speed is less than 0.2 sec for a word (after the total word has been seen) on a laptop computer.

Fig. 5. Examples of two overlapping hand separation. From the top left, signs represent ‘A’, ‘K’, ‘G’, ‘B’, ‘X’, ‘Well’, ‘Meet’, and ‘Letter’ in JSL. Palm centers are marked by circles and hand separation is shown by hands with diﬀerent shades.

Sign Recognition Using Constrained Optimization

39

Fig. 6. Example of sign recognition. The gray images represent depth proﬁles obtained by the camera, while the black and white images show processing results. Palm centers are marked as circles. Top row : ‘Convenient store (24 hours open)’. The right and left hands represent ‘2’ and ‘4’, respectively while the hands draws a circular trajectory in front of the body to represent ‘open’. Middle row: ‘place’. The open right hand is moved downward. Bottom row: ‘what’. Only the pointing ﬁnger is up and it is moved left and right. The total meaning of the sequence is ‘Where is a convenient store?’. Currently, the system performs word-by-word recognition.

The module has also been incorporated in the word recognition module that takes hand shape, movement, and location into consideration and shows near real-time speed performance.

40

K. Fujimura and L. Xu

Fig. 7. Snapshot of the recognition system (left). Illustration for palm detection (right). For a point on the skeleton, a chord is generated to measure the width at that point. The one that gives rise to the maximum chord length is a candidate for locating the palm center.

6

Concluding Remarks

We have presented an algorithm to separate two overlap hands when ﬁngers of the right and left hands may touch or cross. This is a step toward recognizing various signs formed by two hand formations. Two algorithms have been proposed and examined. Algorithm performance has been experimentally illustrated by showing some overlapping hand patterns taken from JSL words. Some work attempt at identifying ﬁngers in the hand (in addition to counting the number of ﬁngers) from hand appearance. Our work has analyzed ﬁnger formation and classify the pattern into several types such as ‘L’ shape and ‘V’ shape. No attempt has been made at this point to identify ﬁngers. We have also been able to integrate our algorithm in a word-based sign recognition system. As compared with existing approaches, salient features of the system constructed as ours are as follows. (i) Signs are recognized without using any special background, marks, or gloves. (ii) Each computation module is computationally inexpensive, thereby achieving fast sign recognition. (iii) Due to depth analysis, the method is less susceptible to illumination changes. The focus of the present work has been to distinguish a given arm blob into two separate parts representing left and right hands. The problem is formulated by a graph division and labelling pardigm and experimental results have been shown that the method has a reasonable performance. Our algorithm requires that much of the ﬁngers is clearly visible. Challenges remain since sign languages usually use overlapping patterns that are complex than those presented in this paper (e.g., ones involving occlusion). We leave this as our future research subject.

Sign Recognition Using Constrained Optimization

41

References 1. Ong, S., Ranganath, S.: Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Pattern Analysis and Machine Intelligence 27, 873– 891 (June 2005) 2. Kleinberg, J., Tardos, E.: Approximation algorithms for classﬁcation problems with pairwise relationships: Metric partitioning and Markov random ﬁelds. Journal of the ACM 49(5) (2002) 3. Hamada, Y., Shimada, N., Shirai, Y.: Hand shape estimation under complex backgrounds for sign language recognition. In: Proc. of Symposium on Face and Gesture Recognition, Seoul, Korea, pp. 589–594 (2004) 4. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language using desk and wearable computer based video. IEEE Pattern Analysis and Machine Intelligence 20, 12, 1371–1375 (1998) 5. Mo, Z., Neumann, U.: Real-time hand pose recognition using low-resolution depth images. In: Int. Conf. on Computer Vision and Pattern Recognition, New York City (2006) 6. Imagawa, K., Lu, S., Igi, S.: Color-based hand tracking system for sign language recognition. Proc. Automatic Face and Gesture Recognition, 462–467 (1998) 7. Polat, E., Yeasin, M., Sharma, R.: Robust tracking of human body parts for collaborative human computer interaction. Computer Vision and Image Understanding 89(1), 44–69 (2003) 8. Wilson, A., Bobick, A.: Parametric hidden markov models for gesture recognition. IEEE Trans. on Pattern Anal. Mach. Intel. 21(9), 884–900 (1999) 9. Jojic, N., Brumitt, B., Meyers, B., Harris, S., Huang, T.: Detection and estimation of pointing gestures in dense disparity maps. In: Proc. of the 4th Intl. Conf. on Automatic Face and Gesture Recognition, Grenoble, France (2000) 10. Bretzner, L., Laptev, I., Lindeberg, T.: Hand gesture recognition using multi-scale colour features, hierarchical models and particle ﬁltering. In: Proc. of the 5th Intl. Conf. on Automatic Face and Gesture Recognition, Washington D.C., May 2002, pp. 423–428 (2002) 11. Pavlovic, V., et al.: Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Trans. on Pattern Anal. Mach. Intel. 19(7), 677–695 (1997) 12. Iddan, G.J., Yahav, G.: 3D imaging in the studio. In: SPIE, vol. 4298, p. 48 (2000) 13. Malassiotis, S., Aifanti, N., Strintzis, M.G.: A gesture recognition system using 3D data. In: 1st Intl. Symp. on 3D Data Processing, Visualization, and Transmission, Padova, Italy (June 2002) 14. Vogler, C., Metaxas, D.: ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Proc. Int. Conf. Computer Vision, Bombay (1998) 15. Zhu, Y., Xu, G., Kriegman, D.J.: A real-time approach to the spotting, representation, and recognition of hand gestures for human-computer interaction. Computer Vision and Image Understanding 85(3), 189–208 (2002) 16. Athitsos, V., Sclaroﬀ, S.: An appearance-based framework for 3D handshape classﬁcation and camera viewpoint estimation. In: Proc. of the 5th Intl. Conf. on Automatic Face and Gesture Recognition, Washington D.C., May 2002, pp. 45–50 (2002) 17. Zhu, X., Yang, J., Waibel, A.: Segmenting hands of arbitrary color. In: Proc. of the 4th Intl. Conf. on Automatic Face and Gesture Recognition, Grenoble, March 2000, pp. 446–453 (2000) 18. Fujimura, K., Liu, X.: Sign recognition using depth image streams. In: Proc. of the 7th Symposium on Automatic Face and Gesture Recognition, Southampton, UK (May 2006)

Depth from Stationary Blur with Adaptive Filtering Jiang Yu Zheng and Min Shi Department of Computer Science Indiana University Purdue University Indianapolis (IUPUI), USA

Abstract. This work achieves an efficient acquisition of scenes and their depths along long streets. A camera is mounted on a vehicle moving along a path and a sampling line properly set in the camera frame scans the 1D scene continuously to form a 2D route panorama. This paper extends a method to estimate depth from the camera path by analyzing the stationary blur in the route panorama. The temporal stationary blur is a perspective effect in parallel projection yielded from the sampling slit with a physical width. The degree of blur is related to the scene depth from the camera path. This paper analyzes the behavior of the stationary blur with respect to camera parameters and uses adaptive filtering to improve the depth estimation. It avoids feature matching or tracking for complex street scenes and facilitates real time sensing. The method also stores much less data than a structure from motion approach does so that it can extend the sensing area significantly. Keywords: Depth from stationary blur, route panorama, 3D sensing.

1 Introduction For pervasive archiving and visualization of large-scale urban environments, mosaicing views from a translating camera and obtaining depth information has become an interesting topic in recent years. Along a camera path, however, overlapping consecutive 2D images perfectly is impossible due to the inconsistent motion parallax from drastically changed depths in urban environments. Approaches to tackle this problem so far include (1) 1D-2D-3D approach [1] that collects slit views continuously from a translating camera under a stable motion on a vehicle. The generated route panorama (RP) [2][3][4] avoids image matching and stitching. Multiple route panoramas from different slits are also matched to locate 3D features in the 3D space [1][4][5][6]. (2) 2D-3D-2D approach that mosaics 2D images through matching, 3D estimation [7], and re-projection to a 2D image. If scenes are close to a single depth plane, photomontage can select scenes seamlessly [8] to result in a multiperspective projection image. Alternatively, images at intermediate positions can also be interpolated to form a parallel-perspective image [9]. (3) 3D-1D-2D approach obtains a long image close to perspective images at each local position. The 1D sampling slit is shifted dynamically [10] according to a dominant depth measured by laser [11][12] or the image velocity in a video volume [13][14][15][16]. This work aims to scan long route panoramas and the depth with a 1D slit as it is the simplest approach without image matching. We analyze the stationary blur [2][24] ⎯ a perspective effect in the parallel projection due to using a non-zero width Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 42–52, 2007. © Springer-Verlag Berlin Heidelberg 2007

Depth from Stationary Blur with Adaptive Filtering

43

slit. The degree of blurring is related to the scene depth from the camera path as well as the camera parameters [17]. By using differential filters to evaluate the contrast in the RP against the original contrast in the image, we can obtain depth measure at strong spatial-temporal edges. This paper further adjusts camera parameters such as the vehicle speed, camera focal length and resolution to increase the blur effect, which gains the sensitivity of the method and improve the depth estimation. Adaptive filtering for various depths is implemented to reduce the depth errors. In the next section, we first extend the path to a general curve to obtain a geometric projection of the route panoramas. Then we analyze the physical model of the slit scanning and introduce the stationary blur in Section 3. Section 4 is devoted to a depth calculation method and Section 5 develops a filtering approach adaptive to various depths. Section 6 introduces the experiments followed with a conclusion.

2 Acquisition of Route Panoramas Along Streets in Urban Areas We define the slit-scanning scheme model along a smooth camera path on a horizontal plane. A video camera is mounted on a four-wheeled vehicle moving at a speed V. Denoting the camera path by S(t) in a global coordinate system where t is the scanning time in frame number, such a path is an envelope of circular segments with changing curvature κ(t), where κ(t)=0 for a straight segment. The vehicle keeps V=|V| as constant as possible and the variation can be obtained from GPS. In order to produce good shapes in a route panorama, a vertical Plane of Scanning (PoS) is set in the 3D space through the camera focal point as the camera moves along a path. This ensures that vertical lines in the 3D space appear vertically in the route panorama even if the camera is on a curved path.

Fig. 1. A section of 2D RP from slit scanning. Different horizontal contrasts appear at different depths.

To create the route panorama, we collect temporal data continuously from the slit of one pixel width, which generates an RP image with time t coordinate horizontally and the slit coordinate y vertically (Fig. 1). A fixed sampling rate m (frame/sec), normally selected as the maximum reachable rate of the camera, is used for the scanning. At each instance or position on the path, a virtual camera system O-XYZ(t) can be set such that the image frame is vertical and the X axis is aligned with the moving direction V. Within each PoS, we can linear-transform data on the slit l to a vertical slit l’ in O-XYZ(t). This converts a general RP to a basic one that has a vertical and smooth image surface along the camera path. A 3D point P(X,Y,Z) in O-XYZ(t) has the image projection as

44

J.Y. Zheng and M. Shi

I ( x, y, t ) : x = Xf Z , y = Yf Z

(1)

where f is the camera focal length. The projection of P in the basic RP is then I (t , y ) = I ( x, y, t ) ' , calculated by x∈l

I (t , y ) : t = S r , y = Yf Z , r = V m

(2)

where V=|V|, S=|S|, and r (meter/frame) is the camera sampling interval on the path. We define a path-oriented description of the camera rather than using viewpoint orientated representation. As depicted in Fig. 2, the camera moves on a circular path with a radius of curvature R=1/κ, where κ<0, κ=0, and κ>0 for convex, linear, and concave paths, respectively. The camera translation and rotation velocities are V(V,0,0) and Ω(0,β,0), where β is a piece-wised constant related to the vehicle steering status and is estimated from GPS output. Because V is along the tangent of the path, a four-wheeled vehicle has a motion constraint as V =

∂ S (t ) = Rβ ∂t

(3)

where R and β have the same sign.

Fig. 2. Relation of circular path segments and the camera coordinate systems. (a) Convex path where R<0 and β<0, and (b) Concave path where R>0 and β>0.

On the other hand, the relative velocity of a scene point P(X, Y, Z) to the camera is ∂ P (t ) = −V + Ω × P (t ) ∂t

∂( X (t) Y(t) Z(t)) = −(V 0 0) + (0 β 0) × ( X (t) Y(t) Z(t)) ∂t

(4) (5)

When the point is viewed through the slit at time t, i.e., the point is on PoS, we have ∂ X (t ) = −V + β Z ( t ) ∂t

∂ Y (t ) =0 ∂t

β Z (t ) ∂ Z (t ) = − β X (t ) = − ∂t tan α

using tanα=X/Z. Taking temporal derivative of (1), the image velocity v is

(6)

Depth from Stationary Blur with Adaptive Filtering

v=

∂X ∂t ∂X ∂t ∂ Z (t ) ∂ t ∂x ∂ Z (t ) / ∂ t = f − fX = f −x 2 ∂t Z (t ) Z (t ) Z (t ) Z (t )

45

(7)

Filling in the results from (5) and (6) into (7), we obtain the image velocity on slit l as v=−

f (V − β Z ( t )) xβ fV V x + =− + (f + ) Z (t ) Z (t ) R tan α tan α

(8)

From (8), the depth Z(t) and the 3D point can be obtained by Z (t ) =

f 2V

fV = V x V (f + )− v (f R tan α R

2

X (t ) =

+ x ) − fv 2

Z (t ) tan α

, Y (t ) = Z (t ) y

(9)

f

where image velocity v<0 and slit position x=f× tanα. For a linear motion where β=0 (R=∞), (9) yields X (t ) = − f

V v tan α

Y (t ) = −

Vy v

Z (t ) = − f

V v

(10)

which is a traditional formula for depth calculation. These equations clarify the depth related to the image velocity at the sampling slit.

3 Stationary Blur and Different Sampling Ranges In the route panorama, we observed an image blur along the t direction on distant scenes, scenes on the concave side of a curved path, and scenes captured when the vehicle slows down. This blur, not appearing in a perspective projection image, can be seen in Fig. 1 on a distant house and backyard trees. The ideal projection model discussed in Sec. 2 cannot interpret this blur because the projection assumes a zerowidth slit; the PoS through the slit is extremely thin and the sampling positions are infinitively dense along a camera path. In a real situation as depicted in Fig. 3(a), the slit has a physical width and a RP is generated from a series of narrow perspective Overlapped -sampling range

256

Justsampling Undersampling range

Zj Slit

Camera Path

S

0m Zj =8m

16m

32m

64m

Fig. 3. Real projection of a route panorama (top view) and the stationary blur. (a) Different ranges of sampling with consecutive cones on a camera path. (b) Simulation of projecting an ideal step edge from all depths (0-256m) to short RPs (15 pixels wide) for Zj = 8, 16, 32, 64m. The edge intensity distributions are piled with respect to depth.

46

J.Y. Zheng and M. Shi Table 1. Depth related image properties Depth

Sampling range

Image velocity

Trace slope in EPI

Scene sampling

Blurs

Feature width in RP

Z>Zj

Overlapped sampling Justsampling

|v|<1

>S/4

|v|=1

=S/4

Stationary blur in RP No blur

wider than in image Same as in image

|v|>1

<S/4

Overlapped sampling neither missing nor overlapping scenes Missing some details

Motion blur in images

Shorter than in image

Z=Zj Z
Undersampling

projections. Different depths, classified as just-sampling depth, under-sampling range, and overlapped-sampling range, have different sampling characteristics [17]. For scenes at the just-sampling depth, Zj, their slit views are connected in the RP without scene overlapping just as in normal perspective projection. In the undersampling range (Z
4 Depth Estimation from Blurs in Route Panoramas The stationary blur visible in the RP is sufficient to separate different depth layers by humans. Assuming the same vehicle speed and camera sampling rate, a simulation in Fig. 3(b) also shows the stationary blur related to the depth. One can notice that an edge becomes more blurred as the depth increases, even with different just-sampling depths. A question then arises as to whether the depth, or at least depth layers, can be computed inversely from the degree of the blur. Theoretically, the intensity formation of the RP undergoes two phases: the first is the convolution of a function G(0,W) of the cone with scenes and the second is the sampling at discrete locations. Width W of the cone is proportional to scene depth Z (Fig. 3(a)). The degree of blurring is thus related to the depth in terms of W. We first examine depth estimation from the stationary blur. Although the stationary blur is related to the depth, the contrast distribution in the RP is insufficient to determine the depth independently if the original spatial contrast is unknown. To obtain spatial contrast for points in the RP, we calculate differential value ∂I(x)/∂x=Ix around the slit in the images as the route panorama is scanned. Ix(t,y) is obtained at slit l if we widen the sampling slit to include several neighboring pixels. We also obtain temporal contrast within the route panorama by calculating ∂I(t)/∂t=It, which reflects the contrast change after the stationary blurring. The ratio of the spatial and temporal differentials can be related to the image velocity v because

Depth from Stationary Blur with Adaptive Filtering

v=

I (t , y ) ∂x ∂I ∂t =− =− t I x (t , y ) ∂t ∂I ∂x

47

(11)

Thus the depth at point (t,y) can be computed locally from two blurs (or equivalently contrasts) and vehicle translational and rotational parameters according to (10). The measure of v from spatial and temporal differentials is similar to depth from optical flow but is performed along the RP rather than in the images. Because Ix and It are unreliable at low contrast points, two additional feature selection steps are added. (a) To avoid disturbance from features at a different height due to the vehicle shaking and waving, we avoid near-horizontal features in the depth estimation, i.e.,|∂I(t,y)/∂y|<δy where δy is a threshold. (b) The original contrast level affects Ix and It and then their ratio. We select reliable edges with high contrasts either in temporal or in spatial domain for depth estimation. A spatial-temporal gradient g(t,y) influenced neither from motion blur nor stationary blur is calculated as

g (t , y) = ( I x (t , y)) 2 + ( I t (t , y)) 2

(12)

Feature points satisfying g(t,y)>δ are selected for depth estimation. Using local data to yield depth instantly can avoid many complex issues such as occlusion, dynamic objects, and lack of feature in sensing outdoor scenes. On the other hand, the local image velocities may still produce unreliable depth because of the measure in a narrow region around the slit and the digitalized errors in intensities. One can also implement a global method to evaluate blurs of a region for depth if an RP can be segmented successfully [17], because a global approach uses more evidence and the result will be more robust.

5 Increasing Sensitivity of Depth Measure and Adaptive Filtering How to improve the estimation of blur with more stable filters? It is noticeable in Fig. 3(b) that the degree of blurring or contrast may not be sensitive to the depth change because of the nature of convolution with the sampling cone. The sensitivity of depth from stationary blur also depends on the just-sampling depth (Fig. 3(b)). A large Zj makes the projection close to an ideal parallel-perspective projection, while a small Zj increases the perspective effect in the RP and the stationary blur. The just-sampling depth is further determined from the camera focal length f and the image resolution according to Fig. 3(a). A large f leads to a large Zj, and yields insignificant contrast changes with respect to the depth variation. All of these can be quantified as: W=

rZ Zj = rf, thus W = Z / f or Z = fW Zj

(13)

Although a large f is preferred for an ideal parallel-perspective projection image without the stationary blur (Fig. 3(b)), a parallel projection contains less distance information. Hence, depth measured under a small f can improve the depth accuracy. We should enhance perspective effect of the physical slit by reducing Zj as follows: (1) selecting a wide angle lens optically, which also fits the goal of route scanning to take high buildings into the RP; (2) summarize several pixel lines into one line that

48

J.Y. Zheng and M. Shi

reduces the slit resolution. In general, the vehicle speed, camera sampling rate, and path curvature cannot be changed freely onsite due to the path (street) and device limitations.

Fig. 4. Relation of the camera focal length and the image velocity when viewing the same scene. The figure illustrates two camera focal lengths resulting in different FOVs as well as cones in (a) and (b) respectively. Top, RPs, and EPI images are depicted in order. Traces of two objects in the EPI have orientations corresponding to the image velocities.

As depicted in the Epipolar plane image (EPI) of Fig. 4, the trace of a feature at Zj has angle π/4 in penetrating the RP according to the definition of just-sampling depth. If we increase focal length f under a fixed image resolution, the image velocity also increases. If Zj is close to infinity, the cone approaches a line of sight on the PoS. All feature traces become parallel and horizontal in the EPI (|v|→∞ according to (10)); the perspective effect disappears. Inversely, if f is reduced, the image velocity is lowered under the same vehicle speed V (see (10)). In Fig. 4(b), a shorter f results in a larger cone and a wider image FOV. This yields more vertical traces in the EPI than in Fig. 4(a), because the FOV in Fig. 4(a) is now reduced in Fig. 4(b). Eventually, the shorter the f, the more the stationary blur appears in the RP. The widths of features at distant depths are also extended. To obtain stable temporal contrast It(t, y), we also examine the data coherence in the RP. Under the thin perspective projection model of slits, a distant edge is wider in the RP than a close edge. If scene depth is far, a small size filter applied horizontally on the RP cannot detect subtle changes for It(t, y); it may result in zero or only few levels due to the digitized error. This affects the depth accuracy at distant range. We thus utilize wide filters in the temporal domain and the outputs are scaled by the filter sizes. Such a filter catches detailed changes of It(t, y) in a longer duration for finer levels of depth. At a close range, however, a large filter may include some distant background when it filters a close occluding edge. Hence, a small size filter is better in locating depth of close features in the RP.

Depth from Stationary Blur with Adaptive Filtering

49

We prepare a set of differential filters, ∂G(0,Wi)/∂t, to compute It(t, y) adaptively, where G is Gaussian distribution and Wi=5, 9, 13 pixels. Starting with the largest size filter, we detect It(t, y) at distant edges. Only strong edges at both distant and close ranges will respond. Then, we apply a smaller filter and then the smallest one on the RP consecutively. These filters will no longer respond to distant features but response to close features that are narrow and sharp in the RP. If the new output at a point becomes lower, the point must be far away and the previous value is kept. If the output is higher than the previous one, the feature is a close or detailed one and It(t,y) is updated accordingly.

6 Systems and Experiments Our strategy for the depth estimation is based on the local stationary blur in the RP against the contrast captured in the images, along with the vehicle motion V, R (or β) from GPS that records the path information independently. Using separate sensors for motion parameters are crucial for a sustainable sensing system; this prevents failures in the image-based motion estimation due to lack of features or existence of complex features (e.g., full of occlusions from trees, poles and buildings on the street). We have driven a vehicle through a number of streets to record route panoramas. We keep the calculation of the spatial differential within a stripe around at the slit that is much smaller than the image patches used in feature matching and stitching by other approaches. The motion blur may affect feature matching in the structure from motion. With the route panorama, however, no additional processing of the motion blur is required. Our filtering method uses g(t,y) that is invariant to the motion blur in feature selection. To increase the FOV and stationary blur, and reduce under-sampling at close scenes as well, we use wide angle lenses (30mm conversion lens attached to 44mm camcorder lens) during the scanning. We have also tried a fish-eye lens to create maximum stationary blur optically, however, the image quality of RP is not guaranteed for visualization (scenes getting very small). If the sampling speed of the slit can be maintained, we can average pixels in a wider image stripe (3, 5, 9, … pixels wide) to obtain the slit data for depth estimation in real time, keeping the fine slit of one pixel wide for generating the RP for visualization. The slit sampling rate is 60 fps for a video camera and the vehicle speed is about at 15~36 km/hr. Along with the short focal length of camera, a slow move of the vehicle determines a short justsampling depth Zj. The just-sampling depth is roughly set at the front face of buildings and houses, but it is hardly controllable onsite dynamically. The vehicle can move faster at 45 km/hr if side scenes are predicted to have large depths, because the image velocities of scenes can still be low. The maximum height of the RP is 640 pixels and can be increased if a high definition camera is used. An example of the depth is shown in Fig. 5, where three depth layers such as front trees and parked cars, front face of houses, and backyard trees and houses are included. The contributions of the depth from adaptive filters of three sizes for It are also displayed partially in colors. We can notice in Fig. 6 the depths of distant scenes are more calculated from a wide filter, while close and detailed scenes are from the small filters (see large images in [25]).

50

J.Y. Zheng and M. Shi

Fig. 5. Depth estimation by an adaptive filter. Bright points are closer than dark ones. Houses in the continuous RP show their depth differences from backyard trees and front trees.

Fig. 6. Depths calculated from three filters of different sizes are displayed in R, G, B channels. Red and yellow points (on front trees and brick details) are calculated from small and median size filters. Blue and cyan points (back trees) indicate their depths are from large and median filters. See enlarged color figure in web [25].

To verify the obtained depth data at strong temporal-spatial edges in Fig. 5, we fill empty areas as much as possible to form depth regions (Fig. 7) such that houses and distant trees identifiable. Note that this filling is different from building a complete model and may be less precise than the measure because no segmentation has been implemented. In the result, many glass windows have a more distant depth measured (darker than house fronts), which is correct because scenes reflected from the mirror windows have more than doubled distances from the path. Some variation of the depth is given in Fig. 8. We can notice that the house fronts with solid edges have relatively small error while distant trees may have a large variation in their depths.

Fig. 7. Depth map with empty points filled up for verification. According to the measured depths at points with strong values of spatial-temporal gradient, horizontal and then vertical interpolation are carried out in order.

Depth from Stationary Blur with Adaptive Filtering

51

Fig. 8. Estimating depths of edges and their variances in vertical planes. (a) Regions outlined from Fig. 7 to examine depth values, (b) Means and variances of depth values in the specified regions, (c) Sorted means (the horizontal axis is region index).

7 Conclusion This work developed a depth estimation approach for urban streets based on the stationary blur in the route panoramas. Through an elaborate analysis of blurs, motion, and image sampling, we proposed an algorithm that generates depth layers efficiently according to the sharpness in the route panorama compared to the original image sharpness. The adaptive filtering of the route panorama avoids feature matching and tracking, which is less influenced from occlusion, motion blur, and other complex situations confronted in complex urban environments. From the data storage perspective, the stored spatial differential image and route panorama are much smaller in size than the image sequence for stitching and the EPIs for tracking; this compactness benefits system development.

References [1] Zheng, J.Y., Tsuji, S.: Panoramic representation for route recognition by a mobile robot. IJCV 9(1), 55–76 (1992) [2] Zheng, J.Y.: Digital route panorama. IEEE Multimedia 10(3), 57–68 (2003) [3] Seitz, S., Kim, J.: Multiperspective imaging. IEEE CGA 23(6), 16–19 (2003) [4] Gupta, R., Hartley, R.: Linear push-broom cameras. IEEE PAMI 19(9), 963–975 (1997) [5] Seitz, S., Kim, J.: The space of all stereo images. IJCV 48(1), 21–38 (2002) [6] Zhu, Z., Hanson, A.R.: 3D LAMP: a new layered panoramic representation. In: ICCV 2001, vol. 2, pp. 723–730 (2001) [7] Wang, A., Adelson, E.H.: Representing moving images with layers. IEEE Trans. Image Processing 3(5), 625–638 (1994)

52

J.Y. Zheng and M. Shi

[8] Agarwala, A., et al.: Photographing long scenes with multi-viewpoint panoramas. ACM Trans. Graphics 25(3), 853–861 (2006) [9] Zhu, Z., Hanson, A.R., Riseman, E.M.: Generalized parallel-perspective stereo mosaics from airborne video. IEEE Trans. PAMI 26(2), 226–237 (2004) [10] Roman, A., Garg, G., Levoy, M.: Interactive design of multi-perspective images for visualizing urban landscapes. In: IEEE Conf. Visualization 2004, pp. 537–544 (2004) [11] Zhao, H., Shibasaki, R.: A Vehicle-borne urban 3D acquisition system using single-row laser range scanners. IEEE Trans. on SMC B33(4) (2003) [12] Frueh, C., Zakhor, A.: Constructing 3D city models by merging ground-based and airborne views. In: IEEE CVPR 2003, pp. 562–569 (2003) [13] Baker, H., Bolles, R.: Generalizing epipolar-plane image analysis on the spatial-temporal surface. In: Proc. CVPR 1988, pp. 2–9 (1988) [14] Li, Y., Shum, H.Y., Tang, C.-K., Szeliski, R.: Stereo Reconstruction from Multiperspective Panoramas. IEEE PAMI 26(1), 45–62 (2004) [15] Zhu, Z., Xu, G., Lin, X.: Efficient Fourier-based approach for detecting orientations and occlusions in Epipolar plane images for 3D scene modeling. IJCV 61(3), 233–258 (2004) [16] Zomet, D., Feldman, D., Peleg, S., Weinshall, D.: Mosaicing new views: the crossed-slits projection, IEEE Trans. PAMI, 741–754 (2003) [17] Shi, M., Zheng, J.Y.: A slit scanning depth of route panorama from stationary blur. IEEE CVPR (2005) [18] Potmesil, M., Chakravarty, I.: Modeling motion blur in computer-generated images. In: SIGGRAPH 1983, pp. 389–400 (1983) [19] Fox, J.S.: Range from translational motion blurring. In: IEEE CVPR 1988, pp. 360–365 (1988) [20] Ben-Ezra, M., Nayar, S.K.: Motion deblurring using hybrid imaging. In: IEEE CVPR 2003, pp. 657–665 (2003) [21] Aliaga, D.G., Carlbom, I.: Plenoptic Stitching: A scalable method for reconstructing 3D interactive walkthroughs. In: SIGGRAPH 2001 (2001) [22] Zheng, J.Y.: Stabilizing route panorama. In: 17th ICPR, vol. 4, pp. 348–351 (2004) [23] Ikeuchi, K., Sakauchi, M., Kawasaki, H., Sato, I.: Constructing virtual cities by using panoramic images. IJCV 58(3), 237–247 (2004) [24] Zheng, J.Y., Shi, M.: Removing temporal stationary blur in route panorama. In: 18th ICPR, vol. 3, pp. 709–713 (2006) [25] http://www.cs.iupui.edu/ jzheng/RP/IJCV

Three-Stage Motion Deblurring from a Video Chunjian Ren1 , Wenbin Chen2 , and I-fan Shen3 School of Computer Science and Technology, Donghua University, P.R. China 2 School of Mathematical Sciences, Fudan University, P.R. China Department of Computer Science and Engineering, Fudan University, P.R. China 1

3

Abstract. In this paper, a novel approach is proposed to remove the motion blur from a video, which is degraded and distorted by fast camera motion. Our approach is based on the image statistics rather than the traditional motion estimation. The image statistics has been successfully applied for blind motion deblurring for a single image by Fergus et al [3] and Levin [10]. Here a three-stage method is used to deal with the video. First, the ”unblurred” frames in the video can be found based on the image statistics. Then the blur functions can be obtained by comparing the blurred frames with the unblurred ones. Finally a standard deconvolution algorithm is used to reconstruct the video. Our experiments show that our algorithms are eﬃcient.

1

Introduction

Motion blur is caused by the relative motion between the camera and the scene during exposure time of each frame, which degrades and distorts the video frames. A video is always expected to be of high-quality so that people feel comfortable when they enjoy it, but the motion blur will spoil this. An unblurred video content is needed in many applications, such as the tracking and surveillance system. The video camera, especially the consumer digital cameras which are designed for hand-hold, is usually not ﬁxed by the stabilization systems. One approach that reduces the degree of blur is to increase the temporal resolution, which raises the frame-rate. But this approach can cause other problems such as expensive hardware and sensor noise. An alternative eﬀective solution is to adopt image deconvolution with the point spread function (PSF, also know as blur function). However, the PSF is not usually known, and then the process becomes a blind deconvolution. Under this circumstance, the estimation of unknown blur function is required. Direct estimation of PSF from a single frame is very diﬃcult and ineﬃcient. Traditionally, in order to enhance the video, the motion estimations are performed and then used to reconstruct the video frames. Such methods are complicated and computationally intensive when every frame is taken into account. Moreover, if the blur is large and objects in the video are distorted, motion estimation in a blurred video is sometimes incorrect and even infeasible. The resulting frames can be degraded signiﬁcantly due to the incorrect PSFs in the deconvolution stage. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 53–62, 2007. c Springer-Verlag Berlin Heidelberg 2007

54

C. Ren, W. Chen, and I.-f. Shen

In this paper we propose a new approach to remove the motion blur from a video. We ﬁrst divide the video into shots and extract unblurred frames in terms of the speciﬁc statistical property. Then we search for a one-dimensional PSF and further adopt the Genetic Algorithm (GA) to reﬁne the PSF. The key novelty of our method is enumerated as follows: (1) The distributions of blurred frames is compared with that of unblurred ones so that there is no need of motion estimation. (2) Blurred frames are contrasted with adjacent unblurred ones so that the PSFs estimation is eﬀective. (3) Genetic Algorithm is adopted to reﬁne the PSFs so that the results are better. 1.1

Related Works

The motion deblurring problem involves two main parts: PSF estimation and image deconvolution. PSF estimation is an active area of research in computer vision and image analysis. Early deblurring methods usually assume that the blur function has a simple parametric form such as a Gaussian component. Recent some breakthrough progresses have been made in this area. Ben-Ezra and Nayar [1] exploit the fundamental trade oﬀ between spatial resolution and temporal resolution to construct a hybrid camera. This special prototype system can measure its own motion during image integration. Fergus et al [3] compute the complex blur kernel with maximum marginal probability, using derivatives distribution in an unblurred image. Our approach also relies on image statistics, but the diﬀerence is that our approach uses the distribution of unblurred frames as known distribution and avoids the need of estimating the latent unblurred image in every step. Deconvolution algorithms have been developed extensively [7, 11, 13, 17]. In [16], Rav-Acha and Peleg estimate two one-dimensional PSFs simultaneously by using two motion blurred images, which have diﬀerent blur directions. All other approaches for video enhancement [18, 20] need to do complicated motion estimation. Thus, our approach avoids the need of estimating the complex motion in these blurred frames. Another similar research on this subject is motion deblurring of moving objects from a single image [10, 15]. Genetic algorithms are now widely applied in science and engineering as adaptive algorithms for optimizing practical problems [5, 9]. In [4,12], the literature is expounded in detail. The GA is used to reﬁne the PSF to reduce the visible artifacts in our method.

2

Model of Image Statistics

Natural image statistics typically obey speciﬁc distribution of gradients. The shape of gradients distribution can be inﬂuenced signiﬁcantly by motion blur. This property has been used for deblurring a single image [3, 10], which sheds light on our approach. For comparing distributions, the approach needs known distributions for reference. These known distributions must be those of the unblurred image since they are critical for identifying the blur function.

Three-Stage Motion Deblurring from a Video

0

Log probability density

Log probability density

0

using [1 −1] using [1 −1]’

−2

−4

−6

−8

−10

−12 −0.8

55

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Gradient

using [−1 0;0 1] using [0 −1;1 0]

−2

−4

−6

−8

−10

−12 −0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Gradient

Fig. 1. Oblique direction of motion blur can be distinguished by distributions of gradient computed with Roberts operators. Left: A frame with oblique motion blur. Middle: Distributions of Horizontal gradient and vertical gradient can not distinguish the blur clearly. Right: Distributions computed with Roberts operators obviously distinguish the blur.

Observing the histograms of the adjacent frames, it is evident that the distributions of two frames are very close when they are captured within the same scene and both are unblurred. The moving objects with low velocity do not aﬀect the evidently close distribution, so the distributions of unblurred frames can be taken as known distributions. Comparing the histograms of the vertical gradient with the horizontal gradient in a single image, the horizontal or vertical blur can be identiﬁed easily and eﬀectively, as in [10]. In the other case, how does the shape of histogram change if the direction of motion is diagonal? As in Fig. 1 (middle), the shape of histograms is so similar that it is hard to tell the direction of the blur. In order to solve this problem, two other directional derivatives are added by using Roberts operators. As demonstrated in Fig. 1 (right), the Roberts operators are chosen because the histograms of gradient computed by them can distinguish the oblique blur. Please note that, all the histograms illustrated here perform logarithmic on distributions. By employing these four histograms, the unblurred frames in a video can be found in section 3 and the blur functions can be estimated in section 4.

3

Extracting Unblurred Frames

A video usually contains many shots [14] and the frames in a shot show consecutive content. The comparison of two frames within two shots may lead to failure because of the inconsecutiveness of their content. In our experiments, the video is divided into shots by the existing simple method [14]. Given a frame sequence within a shot, we ﬁrst get four distributions of each frame (see section 2). Let It denote a frame in a video, di means corresponding gradient operator: d1 = [1, −1], d2 = [1, −1]T , d3 = [−1, 0; 0, 1], d4 = [0, −1; 1, 0]. The image gradients are gained through the convolution di ∗ It . The histogram of gradients is normalized to obtain the discrete gradients distributions: P (di ∗ It ) ∝ hist(di ∗ It ) ,

(1)

56

C. Ren, W. Chen, and I.-f. Shen

where P (·) denotes the computing gradient distribution. We assume there are n quantities on the x-axis of the distribution and Pk is the value of the kth quantity. The unblurred frame Is maximizes total value of logarithmic distributions: Is = arg max It

4 n

log Pk (di ∗ It ) .

(2)

i=1 k=1

Here we use the logarithmic operation to magnify the diﬀerence between the two distributions that are compared with each other. It is possible that a lot of frames in one sequence are unblurred due to the irregular camera motion. Therefore, a threshold can be used to extract more than one frame in each sequence. Under some circumstances, there may be no unblurred frames in a sequence but the approach above can also be done. And the frames extracted are relatively ”unblurred”.

4

Estimating the Motion Blur Function

As analyzed in section 2, the blur of the frames can be identiﬁed using image statistics. By comparing the distributions of unblurred frame with the distributions of an adjacent blurred frame, the PSF of the blurred one can be estimated. The exposure time of video camera with one frame is mostly very short, so the PSF is always similar to one-dimensional form. Considering the justiﬁable performance and eﬃciency, we ﬁrst estimate a one-dimensional PSF and then reﬁne it to a more complex and suitable one. 4.1

General Estimation

We ﬁrst treat the PSF as a one-dimensional form, which is similar to video camera motion path in one frame exposure time. This kind of PSF has two parameters: direction (α) and length (). Each entry of PSF equals to 1/. As addressed in [1, 3], PSF can be represented by a convolution kernel (also know as blur kernel) which satisﬁes an energy conservation constraint (total value of kernel equal to 1). Our method brings the distributions of unblurred frame close to the distributions of blurred frame. The length and direction of PSF can be estimated simultaneously. Let P (·) and Q(·) describe the distribution of unblurred and blurred frames respectively. We minimize a K-L divergence which represents the distance of distributions. The distance can be formulated as: EKL (P ||Q) =

4 n P (i, k) )) , ( (P (i, k) log Q(i, k) i=1

(3)

k=1

where P (i, k) = Pk (di ∗ K(α, ) ∗ Is ) ,

(4)

Q(i, k) = Qk (di ∗ Ib ) .

(5)

and

Three-Stage Motion Deblurring from a Video

57

Here Ib is a blurred frame and Is is an adjacent unblurred frame, K denotes the blur kernel and it has two parameters α and , and di denotes the gradient operator. The cost function (2) is minimized by the following formula: α, = arg min EKL (P ||Q) . α,

(6)

The indexes α and can be determined using an exhaustive search over the angles and length. For each possible values of α and , we compose a blur kernel and compute the cost function. The kernel of blurred frame will be found when the cost function is minimized. In order to be more eﬃcient, we can initialize all the possible blur kernels before we compute the cost function. Moreover, the kernel of α is equivalent to the kernel of π + α, so only α ∈ [0, π) is taken into account. When the length is small, not every angle is needed to be computed. 4.2

Reﬁning the Blur Kernel

The blur frame can be deconvolved with the blur kernel which has been estimated to get a unblurred frame, but sometimes the result contains visible artifacts. As pointed out in [1, 3], the kernel may contain a complex form. The genetic algorithm is adopted to reﬁne the blur kernel, in order to produce a better form.

Fig. 2. Left: The pretreatment of encoding. Right: A example of blur kernel encoding. Zeros in the chromosome are replaced by a nonzero numerical value which is smaller than 1/.

Encoding. A nonzero numerical value is used to replace each zero in the chromosome. The numerical value is much smaller than 1/, as in Fig. 2. And each chromosome in the ﬁrst generation is initialized:

ei = randi · ei ,

for i = 1, ..., m .

ei = ei /

m

ei .

(7)

(8)

i=1

So the chromosomes are er (er1 , er2 , ..., erm ), r = 1...50. The population size used here is 50. The equation (6) makes the chromosomes varied. The equation (7) does the normalization to satisfy energy conservation constraint.

58

C. Ren, W. Chen, and I.-f. Shen

Genetic Operators. In the selection stage, 10% of best ﬁtness chromosomes are directly copied to the new population. This way ensures that we can keep the best chromosomes to the next generation at each iteration. The new population is also expected to be of variation, so the crossover rate is set to be 90%. The chromosomes to be crossed are randomly selected. Two crossover operations are used in our approach. Single point crossover is used when ≤ 5 (: the length of PSF), and two point crossover is used when > 5. Mutation is important in some applications. It can avoid the local minimum. But it is not essential sometimes, as in [9]. Mutation has little eﬀect in our approach, so we do not use mutation.

5

Experiments

Fig. 3 shows the ﬂowchart of our algorithm.

Fig. 3. Flow chart showing the stages of Motion Deblurring

Given the blur kernels, a non-blind deconvolution algorithm is used to reconstruct the blur frames. Following [3], the Matlab’s implementation of the Richardson-Lucy deconvolution algorithm (deconvlucy) is used here. In practice, we can deconvolve each frame after its blur kernel has been estimated or after all the blur kernels have been estimated. Our algorithm has been applied to the image sequences of some real scenes, and these sequences were reconstructed successfully. As shown in Fig. 5 and Fig. 7. The ﬁnal kernel produced by our method can be a complex form. In [3], their algorithm can deal with very complicated and large blur, but it is less eﬃcient.

Three-Stage Motion Deblurring from a Video

59

Fig. 4. Left: Blurred frame. Middle: Deblurring using Matlab’s blind deconvolution algorithm deconvblind. The algorithm is initialized with a Gaussian blur kernel. Right: Output of our algorithm.

Fig. 5. Left column: Blurred frames. Right column: Output of our algorithm and the inferred blur kernels.

As reported in [3], with the minimum practical patch (size of 128 × 128), it takes 10 minutes for Matlab implementation. Our algorithm takes less than 1 minute for every blurred frame (size of 320 × 240). Therefore, our method is more suitable for processing video. We also compared our algorithm against Matlab’s blind deconvolution function (deconvblind ). This function provides the implementation of the methods of Biggs and Andrews [2] and Jansson [7]. These methods also estimate the blur kernel and adopt the Richardson-Lucy deconvolution algorithm. We used the Matlab’s deconvblind with a Gaussian blur kernel input, similar in size to blur artifacts, as shown in Fig. 4.

60

C. Ren, W. Chen, and I.-f. Shen

Fig. 6. A synthetic example is shown. The 1st is a ground truth image. The 2nd is a blurred image created by convolution of the 1st image with a known blur kernel. The 3rd image is the result of deconvolution using the known kernel. The 4th image is the output of our algorithm using both the 1st and the 2nd images.

Fig. 7. Left column: Blurred frames. Right column: Output of our algorithm and the inferred blur kernels.

We adopt a synthetic example to run our algorithm for validation. We use a known kernel to blur a natural image, and then reconstruct the blurred image by using both blurred and unblurred images as our algorithm’s input. As shown in Fig. 6, the ﬁnal blur kernel estimated by our method is similar to the original known kernel. We also observe in experiments that the artifacts may still be introduced in the restored image, even adopting deconvolution that uses the known kernel.

Three-Stage Motion Deblurring from a Video

6

61

Discussion

This paper introduces a method for motion deblurring from a video without performing the motion estimation. For higher eﬃciency, we ﬁrst estimate the one-dimensional PSF, but do not limit it to the simple form. We further reﬁne it to one complex form in order to reduce the artifacts. Our work also includes unblurred frames extracting, which is simple and eﬀective. This method is also useful for other applications, like blur detection. We assume a uniform blur over one frame and do not deal with objects motion blur. The problem is more complicated when camera motion deblurring is coupled with objects motion deblurring from a frame. But sometimes it is also worthy of research, for instance, tracking the fast moving objects when both objects and the background are blurry. We use an exhaustive search for ﬁnding out the one-dimensional PSF. This method is faster than traditional motion estimation algorithm, but it may still not satisfy other high demand real-time applications. In future work, we should adopt a fast algorithm which can also ﬁnd the optimal solution. It will also be interesting to perform shape detection or object detection in a blurred image. This may require a method diﬀerent from the past, if we do not do the deblurring ﬁrst.

Acknowledgements We thank anonymous reviewers for their suggestions. We are most grateful to Xiaoli Cai, Qian Cui for their helpful discussions and support. This work was supported by NSFC under contract 60473104 and the National Basic Research Program under the Grant 2005CB321701.

References 1. Ben-Ezra, M., Nayar, S.K.: Motion-based motion deblurring. IEEE Trans. on PAMI 26, 689–698 (2004) 2. Biggs, D., Andrews, M.: Acceleration of iterative image restoration algorithms. Applied Optics. 36, 1766–1775 (1997) 3. Fergus, R., et al.: Removing camera shake from a single photograph. In: SIGGRAPH (2006) 4. Goldberg, D.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA (1989) 5. Herrera, F., Lozano, M.: Gradual distributed real-coded genetic algorithms. IEEE Trans. on Evolutionary Computation 4, 43–63 (2000) 6. Irani, M., Peleg, S.: Improving resolution by image registration. Graphical Models and Image Processing. 53, 231–239 (1991) 7. Jansson, P.A.: Deconvolution of images and spectra. Academic Press, London (1997) 8. Jia, J., Tang, C.: Image registration with global and local luminance alignment. In: ICCV (2003)

62

C. Ren, W. Chen, and I.-f. Shen

9. Koza, J.: Genetic programming: on the programming of computers by means of natural selection. MIT Press, MA (1992) 10. Levin, A.: Blind motion deblurring using image statistics. In: NIPS (2006) 11. Lucy, L.: Bayesian-based iterative method of image restoration. Journal of Astronomy 79, 745–754 (1974) 12. Mitchell, M.: An introduction to genetic algorithms. MIT Press, Cambridge, MA (1996) 13. Raj, A., Zabih, R.: A graph cut algorithm for generalized image deconvolution. In: ICCV (2005) 14. Rasheed, Z., Shah, M.: Scene detection in Hollywood movies and TV shows. In: CVPR (2003) 15. Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: motion deblurring using ﬂuttered shutter. In: SIGGRAPH (2006) 16. Rav-Acha, A., Peleg, S.: Two motion-blurred images are better than one. Pattern Recognition Letters, 311–317 (2005) 17. Richardson, W.: Bayesian-based iterative method of image restoration. Journal of the Optical Society of America. 62, 55–59 (1972) 18. Shah, N.R., Zakhor, A.: Resolution enhancement of color video sequences. IEEE Trans. on IP 8, 879–885 (1999) 19. Simoncelli, E.P.: Statistical modeling of photographic images. In: Handbook of Image and Video Processing (2005) 20. Tom, B.C., Katsaggelos, A.K.: Resolution enhancement of video sequence using motion compensation. In: ICIP (1996)

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras Nazim Ashraf, Imran N. Junejo, and Hassan Foroosh School of Electrical Engineering and Computer Science, University of Central Florida

Abstract. Applying graph-theoretic concepts to solve computer vision problems makes it not only trivial to analyze the complexity of the problem at hand, but also existing algorithms from the graph-theory literature can be used to find a solution. We consider the challenging tasks of frame selection for use in mosaicing, and feature selection from Computer Vision, and Machine Learning, respectively, and demonstrate that we can map these problems into the existing graph theory problem of finding the maximum independent set. For frame selection, we represent the temporal and spatial connectivity of the images in a video sequence by a graph, and demonstrate that the optimal subset of images to be used in mosaicing can be determined by finding the maximum independent set of the graph. This process of determining the maximum independent set, not only reduces the overhead of using all the images, which may not be significantly contributing in building the mosaic, but also implicitly solves the “camera loop-back” problem. For feature selection, we conclude that we can apply a similar mapping to the maximum independent set problem to obtain a solution. Finally, to demonstrate the efficacy of our frame selection method, we build a system for mosaicing, which uses our method of frame selection.

1 Introduction Recently, there has been growing interest in applying graph-based approaches to solve problems in pattern recognition, computer vision, and robotics. This is not surprising owing to the fact that graph-based approaches have a very attractive feature: once the problem at hand has been mapped to an existing graph-theory problem, it not only becomes trivial to analyze the problem, but a number of existing algorithms - from the graph-theory literature - can be used to solve the problem. Therefore, concepts from graph theory have been used in a number of computer vision problems. There are two strands to this paper: First, we focus on the famous problem of finding the maximum independent set of a graph, and discuss its possible uses in computer vision. Namely, two different yet similar problems from computer vision, and machine learning are discussed. These are the (i) Frame selection problem, and (ii) Feature selection problem. We provide complete complexity analysis of the first problem, and arrive at its solution by mapping it into the maximum independent set problem; for the second problem, we give an intuitive understanding of how it can be similarly mapped to the maximum independent set problem and solved. Secondly, in order to prove the efficacy of our frame selection method, we build a system for mosaicing, which uses our method of frame selection. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 63–72, 2007. c Springer-Verlag Berlin Heidelberg 2007

64

N. Ashraf, I.N. Junejo, and H. Foroosh

Frame Selection Problem for Mosaicing: Registering a set of images to form a larger image mosaic has been an active subject of research because of its many different applications [1,7]. A video sequence normally contains a large number of frames with little displacement between consecutive frames. Using all the frames for mosaicing is a time consuming process due to redundant information processing. Hence, we want to use as few frames as possible but we also want to maximize the information contents. Although mosaicing has received tremendous attention from researchers, little research has focused on the problem of selecting the frames for use in mosaicing; and current systems either use all the frames in a video sequence or the frames are manually selected. We approach this problem from a graph-theoretic approach. We demonstrate that we can select a near-optimal subset of images to be used in images by mapping the problem into the existing graph theory problem of finding the maximum independent set of a graph. This scheme not only reduces the overhead of using all the frames but also solves the “camera loop-back” problem. Feature Selection Problem:Feature selection is fundamental in a number of different tasks such as classification [9], tracking [2], image processing [14], conceptual learning and in many others [15,11]. In recent times, the growing importance of knowledge discovery and data-mining approaches in practical applications has made the feature selection problem quite an attractive topic, especially when considering the mining of knowledge from real-world databases or warehouses, containing not only a huge amount of records, but also a significant number of features not always relevant for the task at hand. Graph-theory and Mosaicing: Applying graph theory concepts for mosaicing have been used by many. Of particular interest is the research done by Sawhney et al. [13], and Kang et al.[10]. Both employ graph-theoretic methods to build coherent mosaics. Sawhney et al. [13] uses topology determination, local and global alignment parameters to build consistent mosaics. This technique introduced constraints introduced by nonconsecutive yet spatially neighboring frames. To address the same problem of global consistency in the mosaic, Sawhney et al [10] builds a frame graph representing the temporal and spatial connectivity of the frames. Then for each node of the frame graph, they locate a number of grid-points of the mosaic. Hence, each node has a list of gridpoints, and each grid-point has a list of correspondences to other grid-points located in other nodes. It is then demonstrated that the problem of global consistency translates to finding the optimal path in the resulting graph. The rest of the paper is organized as follows: maximum independent set problem is described in section 2.1. The complexity analysis of the frame selection problem, its mapping into the maximum independent set, and greedy solution is formulated in section 2.1; and an intuitive method of solving feature selection problem is presented in section 2.3. Image mosaicing is described in section 3. Finally we discuss the results in section 4 and conclude in section 5.

2 Graph-Theoretic Approach In this section, we first analyze the complexity of the frame selection problem, and map it to the maximum independent set problem to find possible solutions. Secondly, we

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras

65

also provide an intuitive understanding of how the feature selection problem can also be solved in a similar manner. 2.1 Maximum Independent Set Problem Before proceeding let us first define the maximum independent set problem: In graph theory, an independent set in a graph, G = (V, E), is defined as the set of vertices V (a subset of V ) such that for every two vertices in V , there is no edge connecting the two. Therefore, each edge in the graph is incident to at most one vertex in the set. The size of an independent set is the number of vertices it contains. A maximum independent set is a largest independent set for a given graph. The problem of finding such a set is called the maximum independent set problem. This problem is proved to be NP-hard [5]. We can find many approximate efficient algorithms for determining the maximum independent set in the literature. Hence, if some problem can be mapped to the maximum independent set problem, then these existing algorithms can be used to solve that problem. 2.2 Frame Selection Problem In this section, we first define the problem in mathematical terms and analyze its complexity. Then, we look at near-optimal solution by mapping our problem to maximum independent set problem. Mathematical Model. In order to analyze the problem of determining which frames in the sequence should be used for mosaicing, we first need to model the problem in mathematical terms. We define a set F of frames, and for each frame we define a set of inter-frame overlaps γ(fi , fj ) ∈ [0, 1], ∀fi , fj ∈ F and i = j. The inter-frame overlap measures the percentage overlap between two frames. In addition, we need a merit which can measure the efficiency of a given frame subset. We can define the merit as: k MF = k + (k − 1)¯ γ

(1)

where k is the number of frames in F , and γ¯ is the average inter-frame overlaps for frames in F . The above formulation implies that (i) a higher number of frames, k, are more desirable; but (ii) the average inter-feature overlap should be as low as possible. This formulation fulfils our definition of a good frame subset because it would maximize the information content while minimizing the number of frames to use thus reducing the processing time. In other words, we need to find a subset of frames, which gives the highest merit possible. Complexity Analysis. Since, our problem is an optimization problem, we have to analyze the complexity of its corresponding decision problem first. The corresponding decision problem can be formally stated as:

66

N. Ashraf, I.N. Junejo, and H. Foroosh

Frame Selection Problem (FSP): INSTANCE: A finite feature set F , a set of inter-frame overlaps γ(fi , fj ) ∈ [0, 1], ∀fi , fj ∈ F and i = j, an integer k, and merit constraint B ∈ Z + . QUESTION: Does there exist an F ⊆ F : |F | = k, and MF = √ k ≥B k+(k−1)¯ γ

where |F | is the number of frames in F , γ¯ is the average inter-frame overlap area for F ? Theorem 1. FSP is NP-Complete in the hard sense Proof. We prove this by (i) showing that the problem is in NP, and (ii) reducing the independent set problem to FSP in polynomial time which implies that FSP is also NP-Complete. Clearly, FSP problem is in NP. If an Oracle claims that an instance F’ is a yes inin polynostance, we can verify the claim by simply calculating, MF = √ k k+(k−1)¯ γ

mial time, and check to see if MF ≥ B. To prove that FSP is NP-Complete, we show that we can set up a polynomial reduction from the Independent set problem - which is already known to be NP-Complete [5]- to FSP. The problem of Independent Set (IS) is formally stated as follows: INSTANCE: A graph G = (V, E) and an integer m QUESTION: Does G have an independent set G , |G | ≥ m? We reduce IS to FSP by the following procedure: 1. For an instance I of IS, create √ an instance, I of FSP, such that ∀vi ∈ I’, create a corresponding fi , with B = k, withk = m 2. Set γ(fi , fj ) = 1 ∈ I if vi and vj in I are adjacent or γ(fi , fj ) = 0 otherwise. 2

Since we need n steps to create the corresponding frames in FSP and n2 − n steps to build the inter-frame matrix, the complexity of this reduction is O(n2 ), where n is the number of vertices in graph G for the Independent set problem. To complete the proof, we show that yes instances of IS map to yes instances in FSP and vice versa. Yes instances of IS map to yes instances of FSP: Assume we have a yes instance of IS i.e. a m-vertex independent set V such that m ≥ m exists for a given graph, G. By our reduction, the corresponding problem in FSP would √ be such that B = k, k = m, and γ(fi , fj ) = 1 if and only if vi and vj were adjacent in G; otherwise γ(fi , fj ) = 0. Now note that the merit constraint B can only be satisfied if = we select the same set as V because only then will we have merit, MF = √ k k+(k−1)0 √ k ≥ B. Hence yes instances of IS map to yes instances of FSP. Yes instances of FSP map to yes instances of IS:

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras

67

v2 v1

f1 f1 v3

v5

f2

f3

f4

f5

1

0

1

0

1

0

1

1

0

f2

1

f3

0

1

f4

1

0

1

f5

0

1

0

1 1

Inter-frame Overlap Matrix v4

Constraints: k = m, B =

k

Fig. 1. Example reduction of IS to FSP where the frame set constitutes of F {f1 , f2 , f3 , f4 , f5 }

=

Assume there exists a yes instance of FSP such that for F , √ a k-frame subset F exists k ≥ B = k. This can only be true which satisfies the equation, MF = √ k+(k−1)¯ γ

if γˆ = 0. This implies that γ(fi , fj ) = 0 for ∀i,j fi , fj ∈ F . Therefore there exists a corresponding k-vertex independent set V in IS. Since, m = k, an independent set of size m exists, implying that this is a yes instance of IS. Therefore, yes instances of FSP map to yes instances of IS. Hence, our proof that there is a polynomial reduction from IS to FSP and vice versa is complete implying that FSP is also NP-Complete. Figure 1 gives an example reduction of IS to FSP. Theorem 2. Original Frame Selection Optimization problem is NP-hard Proof. We say that a language is NP-hard if its corresponding decision language is NPComplete. Given that FSP is NP-Complete, we conclude that the original optimization problem of frame selection is NP-hard. Greedy Solution. Having proved that the problem is NP-hard, an optimal polynomial time algorithm cannot be guaranteed; hence, we need an efficient approximate algorithm. Rather, it would be great if we are able to map our problem into an existing, and well-studied problem for which efficient algorithms are available. In the previous section, we saw how the decision problem of both our problem and Maximum Independent Set problem are reducible to each other. Observe that in both cases we are trying to maximize the number of nodes based on some constraint (imposed by merit in our problem, while independence for Maximum Independent Set problem). But while the Maximum Independent Set problem has a graph composed of 0-1 edge values, our problem might have any value for the edge ranging from 0 to 1, signifying the percentage amount of overlap between frames. If we set up a threshold on the values of the overlap so that overlap values larger than or equal to the threshold are mapped to one, and otherwise mapped to zero otherwise, then our problem reduces to the Maximum Independent Set problem. Therefore, the Maximum Independent Set problem is a special case of our problem. Setting up a threshold is intuitively correct because it signifies that for two frames having an overlap value larger than the threshold, only one of the two should be used for mosaicing.

68

N. Ashraf, I.N. Junejo, and H. Foroosh

Hence, the size of subset of frames to be used for mosaicing is dependent on the threshold: having a large threshold value would mean that few frames would be used for mosaicing, while a small value for threshold would account for using a larger number of frames. Normally, in experiments, we have set this value close to sixty percent. Since Maximum Independent Set problem has been studied in depth, we have a range of algorithms to choose from. To approximate the Independent set, we use the simple, yet very efficient method known as Minimum-Degree Greedy algorithm [6]. The Minimum-Degree Greedy operates by a sequence of iterations, where in each iteration, a vertex is selected and added to the solution set - which is initially empty - and that vertex as well as its neighbors are removed from the graph. The iterations continue on the remaining graph and finally stop when the whole graph has been exhausted. Now we present the pseudo-code of the Minimum-Degree Greedy where d(v) represents the degree of vertex v : Require: Minimum-Degree Greedy (G) 1: I ← ∅ 2: while G = ∅ do 3: Choose v such that d(v) = minw∈g d(w) 4: I ←I ∪v 5: G ← G − {v ∪ N (v)} 6: end while The algorithm can be implemented in time linear in the number of edges and vertices. Furthermore, Hallrsson and Radhakrishnan [6] prove that this method surprisingly achieves a performance ratio of Δ+2 3 for approximating independent sets in graphs with ¯ degree bounded by Δ, the maximum degree of the graph; and 2d+3 performance ratio 5 ¯ on graphs where d is the average degree of the graph. Here performance is defined by where A is the algorithm in question, A(G) is the size of the pA = maxG alpha(G) A(G) solution obtained by algorithm A on graph G, and α(G) represents the actual size of the maximum independent set. Hence, in short, our algorithm proceeds by first finding the overlap matrix for the frames, applying a threshold on the values of the matrix, and then using MinimumDegree Greedy algorithm to estimate the subset of frames for use in mosaicing. To demonstrate the usefulness of our method, we have build a mosaicing system which makes use of our method for frame selection. Details can be found in section 3. 2.3 Feature Selection Problem Feature selection is a process commonly used in machine learning, wherein a subset of the features available from the data is selected for application of classification. The objective of feature subset selection in machine learning is to reduce the number of features used to characterize a data-set so as to improve a learning algorithm’s performance on a given task. A good feature subset is one, which contains features highly correlated with the class, but uncorrelated with each other. Irrelevant features should be ignored because they will have low correlation with the class; redundant features should

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras

69

be screened out as they will be highly correlated with one or more of the remaining features. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features. Thus, the process of feature subset selection involves identifying and removing as much irrelevant and redundant information as possible. This has the effect of reducing the dimensionality of the data, thus allowing learning algorithms to operate faster and more effectively, and improving the accuracy of classification. Since we want to select a subset of features from the available features, which minimize the inter-feature correlations but maximize the feature-class correlation. Intuitively, if we represent the features as nodes of a graph, and mark an edge between two nodes if they have a significantly high inter-feature correlation, then we have reduced this problem into the maximum independent set problem. This is because by finding the maximum independent set of the graph, we have found a subset of maximum number of features, which are un-correlated to each other. Hence, intuitively, the maximum independent set problem can also be used to solve the problem of feature selection.

3 Image Mosaicing In this section, we build a mosaicing system which makes use of the frame selection method derived in section 2.2 in order to demonstrate its efficacy. Our algorithm essentially consists of two steps: the subset selection step, and a reconstruction step based on a maximum likelihood estimator that minimizes accumulating errors. Results are scrutinized using real video sequences. 3.1 Method Description Building the Inter-frame Overlap Matrix: Given a frame set, F , we first build up the inter-frame overlap matrix. This is done by finding the homography based on matched image features. We use SIFT features [12] for this purpose. Since these matches might contain many outliers, giving inaccurate homography, we use RANSAC [4] robust estimation algorithm to simultaneously estimate a homography and to select a set of matches consistent with the homography. Lens distortion is removed by applying the method of Devernay [3]. The estimate is further refined using a non-linear optimization technique [8]. Given a set of image correspondences xi → xi , we need to estimate a corrected set of correspondences, xˆi → xˆi , which play the role of true measurements. Hence, the maximum likelihood estimate of the homography H, and the set of correspondences, xi → xi is ˆ and corrected set of correspondences, xˆi → xˆi which minimizes the Homography H C

d2 (xi , xˆi ) + d2 (xi , xˆi )

(2)

i

The cost function is minimized using Levenberg-Marquardt algorithm. Frame Graph: We then build up the frame graph from this matrix. A node in the frame graph represents a frame while an edge between two nodes signifies that the two nodes

70

N. Ashraf, I.N. Junejo, and H. Foroosh

have significant overlap area. In other words, an edge is added between two nodes if the overlap is above a set threshold value. Finding the Maximum Independent Set: Once the frame graph is built, we can find the maximum independent set of graph by using the Minimum-Degree Greedy algorithm, and use this subset for mosaicing. Alignment of frames: Alignment of the frames is done as follows: A frame is chosen as the reference frame, and all the other frames are aligned to this frame by concatenating the intervening homographies. Hence, for instance, frame 0 and frame 4 would be aligned by the homography, H4,0 = H4,3 H3,2 H2,1 H1,0 .

Fig. 2. An example frame graph with its corresponding independent set marked by red vertices

Re-projection Surface: Once the images have been aligned, the frames can be reprojected to any surface. Simple averaging or temporal median filter on each mosaic pixel can be used for the overlapping area. The latter technique is good in that it eliminates independent moving objects which would appear blurry if simple averaging was used. For this reason, we use median filtering. Consistency Issues: Our scheme avoids the camera loop-back problem, where the camera loops-back on itself. Since all the frames are being used for mosaicing in current systems, if the camera loops-back, then the new frames are still used even though they may not contribute significantly to the mosaic. Since the error is being accumulated with each concatenation of the homographies, the old frames and the new frames would be poorly registered. In our case, this simply cannot happen, because the frames resulting from loop-back would be discarded and would not be used for mosaicing.

4 Results Several experiments were performed on real data. As described in 3, we use the method of Devernay [3] to correct lens distortion, and apply median filtering on the overlapping region. But applying median filtering introduces intensity differences in the resulting mosaic, which are otherwise well-registered. We used a threshold value close to sixty

Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras

71

percent in our experiments. In order to obtain our data, we use a SONY SNC-RZ30N PTZ camera with an image resolution of 320 × 240 where the ground truth rotation angles are known. Image features and correspondences were obtained by using SIFT[12]. Visual inspection and the reduction of the average frame overlap is the only source of measuring the quality of the applied method.

(a)

(b)

Fig. 3. (a) This video sequence contained 27 frames; only 7 frames were chosen for mosaicing. (b) Only eight frames were used for mosaicing; the original video sequence contained eighteen frames.

Fig. 4. This video sequence contained 25 frames; only 8 frames were selected and used for mosaicing. The intensity differences in this mosaic are because of using median filtering on the overlapping regions.

Figure 3 (a) shows an outdoor sequence. The sequences contained a total of 27 frames with mean overlap area of 52.07%. Only 7 frames were selected using the method described in 2 for building the mosaic. The reduced mean overlap area was only 28.49%. A total of 18 frames were captured indoor for another sequence shown in Figure 3 (b). Only 8 frames were selected for mosaic building. The mean overlap area was reduced from 65.82% to 30.44%. Finally, our last sequence, consisting of 25 frames was taken inside a lab. Frame selection method reduced the number of frames to be used for mosaicing to only 8. The mean overlap frame area also reduced from 57.04% to 35.41%. The resultant mosaic is shown in Figure 4. As is clear from these results, our method efficiently selects a good subset for use in mosaicing; and image mosaicing builds globally consistent image mosaics. Based

72

N. Ashraf, I.N. Junejo, and H. Foroosh

on the results, we are able to reduce the number of frames used for mosaicing down to ∼ 40% of the original video frames; and average inter-frame overlap area drops from ∼ 57% to ∼ 30%.

5 Conclusion We have demonstrated that we can map the frame selection problem and feature selection problem into the maximum independent set problem. Through this mapping, we can not only analyze the complexity of the problem at hand, but also use the existing algorithms from the graph-theory literature to solve the problem. In order to demonstrate the efficacy of our solution, we built a system for mosaicing, which uses our method of frame selection. We use real video sequences to test our method. We have found that our method significantly reduces the number of frames for use in mosaicing, hence decreasing the processing time while still building globally consistent mosaics.

References 1. Agapito, L.D., Hayman, E., Reid, I.: Self-calibration of rotating and zooming cameras. Int. J. Comput. Vision 45(2), 107–127 (2001) 2. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002) 3. Devernay, F., Faugeras, O.D.: Straight lines have to be straight. Machine Vision and Applications 13(1), 14–24 (2001) 4. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartology. Communications of the ACM 5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman & Co, New York (1979) 6. Hallrsson, M., Radhakrishnan, J.: Greed is good: approximating independent sets in sparse and bounded-degree graphs. In: STOC 1994: Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pp. 439–448. ACM Press, New York (1994) 7. Hartley, R.I.: Self-calibration of stationary cameras. Int. J. Comput. Vision 22(1), 5–23 (1997) 8. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 9. Junejo, I., Javed, O., Shah, M.: Multi feature path modeling for video surveillance. In: ICPR 2004. 17th conference of the International Conference on Pattern Recognition (2004) 10. Kang, E.-Y., Cohen, I., Medioni, G.G.: A graph-based global registration for 2d mosaics. In: ICPR, pp. 1257–1260 (2000) 11. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 6(2), 91–110 (2004) 13. Sawhney, H.S., Hsu, S., Kumar, R.: Robust video mosaicing through topology inference and local to global alignment. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 103–119. Springer, Heidelberg (1998) 14. Shi, J., Malik, J.: Motion segmentation and tracking using normalized cuts. In: Proc. IEEE ICCV, IEEE Computer Society Press, Los Alamitos (1998) 15. Shi, J., Tomasi, C.: Good features to track. In: CVPR 1994. IEEE Conference on Computer Vision and Pattern Recognition, Seattle (June 1994)

Video Mosaicing Based on Structure from Motion for Distortion-Free Document Digitization Akihiko Iketani1,2 , Tomokazu Sato1,2 , Sei Ikeda2 , Masayuki Kanbara1,2 , Noboru Nakajima1 , and Naokazu Yokoya1,2 1 2

NEC Corporation, 8916-47 Takayama, Ikoma, Nara 630-0101, Japan Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan

Abstract. This paper presents a novel video mosaicing method capable of generating a geometric distortion-free mosaic image using a hand-held camera. For a document composed of curved pages, mosaic images of virtually ﬂattened pages are generated. The process of our method is composed of two stages : real-time stage and oﬀ-line stage. In the realtime stage, image features are automatically tracked on the input images, and the viewpoint of each image as well as the 3-D position of each image feature are estimated by a structure-from-motion technique. In the oﬀline stage, the estimated viewpoint and 3-D position of each feature are reﬁned and utilized to generate a geometric distortion-free mosaic image. We demonstrate our prototype system on curved documents to show the feasibility of our approach.

1

Introduction

Recently, digital cameras and cellular phones with a built-in camera have become so popular, and there is a strong demand for document digitization using these portable imaging devices, which allows us to scan and send documents anytime, anywhere. Compared with ﬂat-bed scanners, however, these portable imaging devices have several problems to be solved. The most critical problem is the low resolution of the image acquired with these devices. Various techniques have been proposed to solve this low resolution problem. Among them, a video mosaicing technique is one of the most promising solutions. In video mosaicing, partial images of the document are captured as a video sequence, and multiple frame images are stitched seamlessly into one large, high resolution image, called a mosaic image. Szeliski [1] has developed a method using 8-DOF projective image transformation parameters called homography. In this method, for every pair of consecutive frames, the homography which minimizes the sum of squared diﬀerences between the two frames is estimated. A mosaic image is constructed by warping all the images to a reference frame (in general, the ﬁrst frame). After his work, various extensions to this method have been proposed [2,3,4,5,6]. One of the major extensions is the use of image Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 73–84, 2007. c Springer-Verlag Berlin Heidelberg 2007

74

A. Iketani et al.

(a)Target document.

(b)Mosaic image with distortion due to curvature.

Fig. 1. Mosaic image with geometric distortion

features instead of all the pixels in images in order to reduce computational cost [3,4,6]. These methods, however, cannot avoid geometric distortion induced by the curvature of the target document. The homography based methods are only applicable when the target is a plane, or the optical center of camera is approximately ﬁxed throughout the video capturing. If the target is a curved surface, the above assumption no longer holds, thus will cause misalignment of images in the resultant mosaic image. A mosaic image for the left page of the book in Figure 1(a) generated by a homography based method is shown in Figure 1(b). Distortion due to the curvature is evident in the curved lines of text and contour lines of the page. In the domain of document analysis, various methods to remove these geometric distortion have been proposed. Cao [7] assume the target document is composed of horizontal text lines, and generates a distortion-free image by warping the captured image so that all the baselines in the image are parallel to one another. Brown [8] assume the contour of the page is captured in the image, and warps the image so that the contour is transformed into a rectangle. Although these methods are capable of removing geometric distortion, they can only be applied to targets which fulﬁll the underlying assumptions. We present a novel video mosaicing method capable of generating a geometric distortion-free mosaic image. Our work is based on a structure-from motion technique, which recovers camera parameters, pose estimates, and sparse 3-D scene geometry from image sequence in real time. Using the 3-D geometry recovered by the algorithm, unwrapped mosaic images of virtually ﬂattened pages are generated for a document composed of curved pages.

2

Video Mosaicing Based on Structure from Motion

This section describes a method for generating a high-resolution and geometric distortion-free mosaic image from a video sequence. The ﬂow of the proposed method is given in Figure 2.

Video Mosaicing Based on Structure

75

(1) Real-time Stage (a) Camera parameter estimation by tracking features (b) Generation of preview image and instruction for a user Iterate from first frame to last frame

(2) Off-line stage (c) Detection of reappearing features (d) Refinement of estimated camera parameters (e) Surface fitting to 3-D point cloud (if system is in curved mode) Iterate until convergence

(f) Mosaic image generation

Fig. 2. The ﬂow of the proposed method

In the real-time stage, the system carries out 3-D reconstruction processes frame by frame by tracking image features (Figure 2(a)). A coarse preview of the generated mosaic image is rendered in real time (Figure 2(b)). After the real-time stage, the system proceeds to the oﬀ-line stage. In this stage, ﬁrst, reappearing image features are detected in the stored video sequence (Figure 2(c)), and camera parameters and 3-D positions of features estimated in the real-time stage are reﬁned by global optimization (Figure 2(d)). Surface parameters are also estimated by ﬁtting a parameterized 3-D surface to estimated 3-D point clouds (Figure 2(e)). After some iteration, a high-resolution and geometric distortion-free mosaic image is generated (Figure 2(f)). The assumptions made in the proposed method are that intrinsic camera parameters are ﬁxed and calibrated in advance to correct lens distortion. It is also assumed that the curve of the target lies along one direction and its curvature changes smoothly along this direction. In the following sections, ﬁrst, extrinsic camera parameters and an error function used in the proposed method are deﬁned. The stages (1) and (2) in Figure 2 are then described in detail. 2.1

Deﬁnition of Extrinsic Camera Parameter and Error Function

In the proposed method, the coordinate system is deﬁned such that an arbitrary point Sp = (xp , yp , zp ) in the world coordinate system is projected to the coordinate xf p = (uf p , vf p ) in the f -th image plane. Deﬁning 6-DOF extrinsic camera parameters of the f -th image as a 3 × 4 matrix Mf , the relationship between 3-D coordinate Sp and 2-D coordinate xf p is expressed as follows: (auf p , avf p , a)T = Mf (xp , yp , zp , 1)T ,

(1)

where a is a parameter. In the above expression, xf p is regarded as a coordinate on the ideal camera with the focal length of 1 and without radial distortion induced by the lens. In practice, however, Sp is actually projected to the position ˆ f p = (ˆ of x uf p , vˆf p ) in the real image, which is given by transferring xf p using known intrinsic camera parameters including focus, aspect, optical center and

76

A. Iketani et al.

distortion parameters. In the rest of this paper, this transformation from xf p to ˆ f p is omitted for simplicity. x Next, the error function used for 3-D reconstruction is described. In general, the projected position xf p of Sp to the f -th image frame does not coincide with the actually detected position x = (uf p , vf p ), due to errors in feature detection, extrinsic camera parameter and 3-D feature position estimation. In this paper, the squared error Ef p is deﬁned as an error function for the feature p in the f -th frame as follows: (2) Ef p = |xf p − xf p |2 . 2.2

Real-Time Stage for Image Acquisition

As shown earlier in Figure 2, the real-time stage is constructed of two iterative processes for each frame. First, the extrinsic camera parameter is estimated by tracking features (step (a)). A coarse preview of a mosaic image is generated and updated every frame (step (b)). The following describes each process of the real-time stage. Step (a): Camera parameter estimation by tracking features The extrinsic camera parameter and 3-D position Sp of each feature point is estimated by an iterative process. This process is basically an extension of the structure from motion method proposed by Sato [9]. In the ﬁrst frame, assuming that the image plane in the ﬁrst frame is approximately parallel to the target, rotation and translation components in Mf are set to an identity matrix and 0, respectively. For each feature point detected in the ﬁrst frame, its 3-D position Sp is set to (u1p , v1p , 1), based on the same assumption. Note that these are only initial values, which will be corrected in the reﬁnement process (Figure 2(d)). In the succeeding frames (f > 1), Mf is estimated by iterating the following steps toward the last frame. Feature point tracking: All the image features are tracked from the previous frame to the current frame by using a standard template matching with Harris corner detector [10]. The RANSAC approach [11] is also employed to eliminate outliers. Extrinsic camera parameter estimation: Extrinsic camera parameter Mf is estimated using the tracked position (uf p , vf p ) and its corresponding 3-D position Sp = (xp , yp , zp ). Here, extrinsic camera parameters are obtained by minimizing p Ef p , the sum of the error function deﬁned in Eq. (2), using the Levenberg-Marquardt method. For 3-D position Sp of the feature point p, the estimated result in the previous iteration is used. Estimation of 3-D feature position: For every feature point p in the current frame, its 3-D position Sp = (xp , yp , zp ) is reﬁned by minimizing the error f function i=1 Eip .

Video Mosaicing Based on Structure

77

Addition and deletion of feature points: In order to obtain accurate estimates of camera parameters, good features should be selected. The set of features to be tracked is updated by evaluating the reliability of features [9]. By iterating the above steps, extrinsic camera parameters Mf and 3-D feature positions Sp are estimated. Step (b): Generation of preview image and instruction for user In parallel with the camera parameter estimation, a coarse preview of the mosaic image is rendered. This preview is updated in real time, using captured images and the estimated camera parameters. With this preview, the user can easily recognize which part of the document is still left to be captured, and ﬁgure out where to move the camera in the subsequent frames. 2.3

Oﬀ-Line Stage for Parameter Reﬁnement and Target Shape Estimation

This section describes the process to globally optimize the estimated extrinsic camera parameters and 3-D feature positions, and to approximate the shape of the target by ﬁtting a parameterized surface. Step (c): Detection of Reappearing Features Due to the camera motion, most of image features come into the image, move across toward the end of the image, and disappear. Some features, however, reappear in the image, as shown in Figure 3. In this step, these reappearing features are detected, and distinct tracks belonging to the same reappearing feature are linked to form a single long track. This will give tighter constraints among camera parameters in temporally distinct frames, and thus makes it possible to suppress cumulative errors in the global optimization step described later. Reappearing features are detected by examining the similarity of the patterns among features belonging to distinct tracks. First, templates of all the features are projected to the ﬁtted surface (described later). Next, feature pairs whose distance in 3-D space is less than a given threshold are selected and tested with the normalized cross correlation function. If the correlation is higher than a threshold, the feature pair is regarded as reappearing features (see Figure 3).

Compensate for Distortion

(a)

(b)

(c)

Matching

(d)

Fig. 3. Detection of re-appearing features. (a) camera path, posture and 3-D feature positions, (b) temporarily distinct frames in the input video, (c) templates of the same feature in diﬀerent frame images, (d) templates without perspective distortion.

78

A. Iketani et al.

Vmin: direction of minimum principal curvatures of the target

m

Vmax: direction of maximum principal curvatures of the target

n

direction of minimum curvature of Rp direction of maximum curvature of Rp

Sp Rp

N: Normal of projection plane

Plane P V2

V1: first principal direction of projected points

y = f (x )

Projected points

Fig. 4. Target shape estimation by polynomial surface ﬁtting

Step (d): Global Optimization of 3D Reconstruction Since 3-D reconstruction process described in Section 2.2 is an iterative process, its result is subject to cumulative errors. In this method, by introducing bundleadjustment framework [12], the extrinsic camera parameters and 3-D feature positions are globally optimized so as to minimize the sum of re-projection errors Eall = f p Ef p . Step (e): Target Shape Estimation by Surface Fitting In this step, assuming the curve of the target lies along one direction, the target shape is estimated using 3-D point clouds optimized in the previous step (d). First, as shown in Figure 4, the principal direction of curvature is computed from the 3-D point clouds. Next, 3-D position of each feature point is projected to a plane perpendicular to the direction of minimum principal curvatures. Finally, a polynomial equation of variable order is ﬁtted to the projected 2-D coordinates, and the target shape is estimated. Let us consider for each 3-D point Sp a point cloud Rp which consists of feature points lying within a certain distance from Sp . First, the directions of maximum and minimum curvatures are computed for each Rp using local quadratic surface ﬁt. Then, a voting method is applied to determine the dominant direction Vmin = (vmx , vmy , vmz ) of minimum principal curvatures for the whole target. Next, 3-D position Sp for each feature point is projected to a plane whose normal vector N coincides with Vmin ; i.e. P (x, y, z) = vmx x + vmy y + vmz z = 0. The projected 2-D coordinate (¯ xp , y¯p ) of Sp is given as follows: x ¯p V1 (3) = Sp , y¯p V2 where V1 is a unit vector parallel to the principle axis of inertia of the projected 2-D coordinates (¯ x, y¯), and V2 is a unit vector which is perpendicular to V1

Video Mosaicing Based on Structure

79

and Vmin ; i.e. V2 = V1 × N. Finally, the shape parameter (a0 , a1 , · · · , am ) is obtained by ﬁtting the following variable-order polynomial equation to (¯ x, y¯). q i y¯ = f (¯ x) = ai x ¯. (4) i=0

Using geometric AIC [13], the optimal order q in the above equation is determined as q which minimize the following criteria: G-AIC = J + 2(N (m − r) + q + 1)2 , (5) where J is the residual, N is the number of points, m is the dimension of observed data, and r is the number of constraint equations in ﬁtting Eq. (4) to the projected 2-D coordinates. , called noise level, is the average error of the estimated feature position along y¯ axis. Here, the order q is independent of m, r, N , thus the actual criteria to be minimized is given as follows: (6) G = J + 2q2 . In our method, the noise level is approximated as = Cl, where l is the average of the depth of each feature point in camera coordinate of every frame, and C is a constant, which is empirically set to 0.007. In case of a target with multiple curved surfaces, e.g. a thick bound book shown in Figure 1(a), the target is ﬁrst divided with a line where the normal vector of the locally ﬁtted quadratic surface varies discontinuously, and the shape parameter is computed for each part of the target. Step (f ): Mosaic Image Generation Finally, a mosaic image is generated by using extrinsic camera parameters and surface shape parameters. Let us consider a 2-D coordinate (m, n) on the unwrapped mosaic image, as shown in Figure 4. Here, the relation between (m, n) and its corresponding 3-D coordinate (¯ x, f (¯ x), z¯) on the ﬁtted surface is given as follows: x¯ d 1 + { f (x)}2 dx, z¯). (7) (m, n) = ( dx 0 The relationship between coordinate (¯ x, f (¯ x), z¯) and its corresponding 2-D coordinate (uf , vf ) on the f -th image plane is given by the following equation. ⎞ ⎞⎛ ⎛ ⎛ ⎞ x ¯ auf V1 ⎝ avf ⎠ = Mf ⎝ V2 ⎠ ⎝ f (¯ x) ⎠ . (8) a N z¯ The pixel value at (m, n) on the unwrapped mosaic image is given by computing the average of the pixel values at all the corresponding coordinates (uf , vf ) in the input image, given by Eq. (7) and (8). After an unwrapped mosaic image is generated by the above process, the shade induced by the curved shape of the target is removed. In this process, the following assumptions are made: the background of the target page is white, and the target is illuminated by a parallel light source. In the proposed method, the vertical direction in the mosaic image coordinate (m, n) is deﬁned to coincide with the direction of the minimum principle curvature of the target. Thus, under

80

A. Iketani et al.

a parallel light source, the eﬀect of shade is uniform for pixels having the same m coordinate on the mosaic image. If we can assume that, in any column of the mosaic image, there exists at least one pixel belonging to the background, a new pixel value Inew (m, n) after shade removal can be computed as follows: Inew (m, n) =

Imax I(m, n) , max(I(m + u, n + v); ∀(u, v) ∈ W )

(9)

where W is a rectangular window whose height is larger than its width, e.g. a window with the size of 5×500 pixel, and Imax is the maximum possible intensity value of an image (typically 255).

3

Experiments

We have developed a prototype video mosaicing system which consists of a laptop PC (Pentium-M 2.1 GHz, Memory 2GB) and an IEEE 1394 type web-cam (Aplux C104T, VGA, 15fps). Experiments are performed on two curved targets: one on a thick bound book with 2 curved pages and the other on a curved poster on a cylindrical column. In both experiments, the intrinsic camera parameters are calibrated in advance using Tsai’s method [14], and are ﬁxed throughout image capturing. A thick bound book captured in the ﬁrst experiment is shown in Figure 5. It is composed of 2 pages of curved surfaces: one page with texts and the other with pictures and ﬁgures. Plus marks (+) were printed on 40mm grid points on both pages for quantitative evaluation (described later). The target is captured as a video of 200 frames at 7.5fps. Sampled frames of the captured images are shown in Figure 6. Tracked feature points are depicted with cross marks. 3-D reconstruction result is shown in Figure 7. The curved line shows the camera path, pyramids show the camera postures in every 10 frames, and the point cloud shows 3-D positions of feature points. The shape of the target estimated after 3 iterations of reappearing feature detection and surface ﬁtting process is shown in Figure 8(a). The optimal orders of the polynomial surface, which are automatically determined by geometric AIC [13], are 5 and 4 for the left and right pages, respectively. The unwrapped mosaic image is shown in Figure 8(b). The resolution of the mosaic image is 3200 × 2192. We evaluate the distortion in the generated mosaic image quantitatively, using plus marks (+) printed on the target paper at every 40mm grid positions. First, the positions of the plus marks are acquired manually in the generated Table 1. Distances of adjacent grid points on the mosaic image [pixels (percentage from average)] page average maximum minimum std. dev. left 338.5(100.0) 348.0(102.8) 331.0(97.8) 3.77(1.11) right 337.6(100.0) 345.0(102.2) 331.1(98.1) 2.75(0.81)

Video Mosaicing Based on Structure

81

Fig. 5. Thick bound book with curved surface

1st frame

100th frame

200th frame

300th frame

Fig. 6. Sampled frames of input images and tracked features (Book)

(a) frontal view

(b) side view

Fig. 7. Estimated extrinsic camera parameters and 3-D positions of features (Book)

(a) Estimated target shape.

(b) Unwrapped mosaic image.

Fig. 8. Estimated target shape and unwrapped mosaic image (Book)

82

A. Iketani et al.

Fig. 9. Poster on a cylindrical column

1st frame

35th frame

87th frame

142th frame

Fig. 10. Sampled frames of input images and tracked features (Poster)

(a) frontal view

(b) side view

Fig. 11. Estimated extrinsic camera parameters and 3-D positions of features (Poster)

(a) Estimated target shape.

(b) Unwrapped mosaic image.

Fig. 12. Estimated target shape and unwrapped mosaic image (Poster)

Video Mosaicing Based on Structure

83

mosaic image. The distances between adjacent plus marks are then computed in the unit of pixel. The average, maximum, minimum and standard deviation of the distances are shown in Table 1. The percentage of each value from the average distance is also shown in parenthesis. Here, the standard deviation can be considered as the average distortion in the generated image. In this experiment, the average distortions for the left and right pages are only 1.1% and 0.8%, respectively. We can conﬁrm that the geometric distortion has been correctly removed. The other experiment is performed on a poster on a cylindrical column shown in Figure 9. Sampled frames of the captured images with tracked feature points, and 3-D reconstruction result are shown in Figure 10, 11, respectively. The shape of the target estimated after 3 iterations is shown in Figure 12(a). The optimal order of the polynomial surface ﬁtted to the target is 4. The unwrapped mosaic images is shown in Figure 12(b). The resolution of the mosaic image is 3200 × 5161. Although the target has few text features, the distortion has been successfully removed in the resultant image. The performance of our system measured in the ﬁrst experiment is as follows: 22 seconds for initial 3-D reconstruction, 188 seconds for camera parameter reﬁnement and surface ﬁtting, and 410 seconds for generating the ﬁnal mosaic image.

4

Conclusion

A novel video mosaicing method for generating a high-resolution and geometric distortion-free mosaic image for curved documents has been proposed. With this method based on 3-D reconstruction, the 6-DOF camera motion and the shape of target document are estimated. Assuming the curve of target lies along one direction, the shape model is ﬁtted to the feature point cloud and unwrapped image is automatically generated. In experiments, a prototype system based on the proposed method has been developed and has been successfully demonstrated. Our future work is to reduce the computational cost, and to implement the algorithm on mobile PC.

References 1. Szeliski, R.: Image Mosaicing for Tele-Reality Applications. In: Proc. IEEE Workshop on Applications of Computer Vision, pp. 230–236. IEEE Computer Society Press, Los Alamitos (1994) 2. Capel, D., Zisserman, A.: Automated Mosaicing with Super-resolution Zoom. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 885–891. IEEE Computer Society Press, Los Alamitos (1998) 3. Chiba, N., Kano, H., Higashihara, M., Yasuda, M., Osumi, M.: Feature-based Image Mosaicing. In: Proc. IAPR Workshop on Machine Vision Applications, pp. 5–10 (1998) 4. Hsu, C.T., Cheng, T.H., Beuker, R.A., Hong, J.K.: Feature-based Video Mosaicing. In: Proc. IEEE Int. Conf. on Image Processing, vol. 2, pp. 887–890. IEEE Computer Society Press, Los Alamitos (2000)

84

A. Iketani et al.

5. Lhuillier, M., Quan, L., Shum, H., Tsui, H.T.: Relief Mosaicing by Joint View Triangulation. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 785–790. IEEE Computer Society Press, Los Alamitos (2001) 6. Takeuchi, S., Shibuichi, D., Terashima, N., Tominage, H.: Adaptive Resolution Image Acquisition Using Image Mosaicing Technique from Video Sequence. In: Proc. IEEE Int. Conf. on Image Processing, vol. 1, pp. 220–223. IEEE Computer Society Press, Los Alamitos (2000) 7. Cao, H., Ding, X., Liu, C.: A Cylindrical Surface Model to Rectify the Bound Document Image. In: Proc. Int. Conf. on Computer Vision, vol. 1, pp. 228–233 (2003) 8. Brown, M.S., Tsoi, Y.C.: Undistorting Imaged Print Materials using Boundary Information. In: Proc. Asian Conf. on Computer Vision, vol. 1, pp. 551–556 (2004) 9. Sato, T., Kanbara, M., Yokoya, N., Takemura, H.: Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera. Int. J. of Computer Vision 47(1-3), 119–129 (2002) 10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. Alvey Vision Conf., pp. 147–151 (1988) 11. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In: Communications of the ACM, vol. 24(6), pp. 381–395. ACM Press, New York (1981) 12. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle Adjustment a Modern Synthesis. In: Proc. Int. Workshop on Vision Algorithms, pp. 298–372 (1999) 13. Kanatani, K.: Geometric Information Criterion for Model Selection. Int. J. of Computer Vision 26(3), 171–189 (1998) 14. Tsai, R.Y.: An Eﬃcient and Accurate Camera Calibration Technique for 3D Machine Vision. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 364–374. IEEE Computer Society Press, Los Alamitos (1986)

Super Resolution of Images of 3D Scenecs Uma Mudenagudi, Ankit Gupta, Lakshya Goel, Avanish Kushal, Prem Kalra, and Subhashis Banerjee Department of Computer Science and Engineering IIT Delhi, New Delhi, India [email protected],{pkalra,suban}@cse.iitd.ernet.in [email protected]

We address the problem of super resolved generation of novel views of a 3D scene with the reference images obtained from cameras in general positions; a problem which has not been tackled before in the context of super resolution and is also of importance to the ﬁeld of image based rendering. We formulate the problem as one of estimation of the color at each pixel in the high resolution novel view without explicit and accurate depth recovery. We employ a reconstruction based approach using MRF-MAP formalism and solve using graph cut optimization. We also give an eﬀective method to handle occlusion. We present compelling results on real images.

1

Introduction

In this paper we address the problem of estimating a single high resolution (HR) image of a 3D scene given a set of low resolution (LR) images obtained from diﬀerent but known viewpoints with no restrictions on either the scene geometry or the camera position. The super resolved image can correspond to one of the known view points in the input set, or even be of a novel view. There have been several diﬀerent approaches to image super resolution, with estimation of high resolution images from multiple low resolution observations related by small 2D motions being by far the most common one [1,2,3]. Most of these approaches assume that the low resolution images can be accurately registered by a 2D aﬃne transformation or a homography, and attempt to super resolve by reversing the degradation caused by blur or defocus and sampling. There has been very little work on super resolution when the scene is 3D and the input cameras are generally positioned. The super resolution problem becomes considerably harder when, because of the depth variation in the scene and arbitrary placement of the input cameras, there is no simple registration transformation between the input images. For then to super resolve one needs dense depth estimation at each pixel using multiple-views geometry; a problem which has not been entirely solved. Though qualitative estimation of such depth information is possible, high precision depth recoveryrequired for accurate and high resolution rendering of a 3D scene is still a challenging problem. Super resolution rendering of a novel view of a 3D scene is an even harder problem because then one needs to eﬀectively handle occlusion. For super resolution of one of the input views the key problem is to determine which pixels/colors Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 85–95, 2007. c Springer-Verlag Berlin Heidelberg 2007

86

U. Mudenagudi et al.

from the other views need to be used for generating each high resolution pixel in the target view. However, for a novel view, the visibility information itself may change signiﬁcantly - and this needs to be handled. The main contributions of this paper are as follows: 1. We give an MRF-MAP formulation of super resolution reconstruction of 3D scenes from a set of calibrated LR images obtained from arbitrary 3D positions and solve using graph cut optimization. 2. We show that the super resolved novel view synthesis problem can be posed as one of ﬁrst generating a set of pixels in the source LR views, depending on their color, which are most likely to contribute to a target pixel in the novel view, and then using standard techniques of super resolution to recover the high frequency details. In this sense, our approach is closely related to the novel view synthesis methods of [4,5], which however do not address the issues of super resolution. 3. We give an eﬀective method for handling occlusion using photo-consistency [6]. We demonstrate this with results on real images. Our approach is motivated by [4,5] where also the novel view generation problem have been formulated as one of estimation of color at each generated pixel, without explicit depth recovery. In [4] the virtual cameras are placed within the scene itself, and the visibility information does not change signiﬁcantly between the input cameras. The color selection is done by geometric analysis of “line of sight”. [5] relies on a probabilistic analysis of the possible colors at a target pixel and uses a database of color patches, computed from the input images themselves, as regularization priors. Occlusion is not explicitly handled. We draw insights from these works to formulate a scheme for super resolution rendering of novel views which can explicitly handle occlusion. In Section 2, we describe the image formation process, and formulation of novel view super resolution using MRF-MAP formalism. We present an eﬀective method to resolve occlusion using photo-consistency in section 3. We give energy minimization using graph cut in section 4. In Section 5, we present results to demonstrate the eﬀectiveness of our method. We conclude in section 6.

2

Novel View Super Resolution

2.1

MRF-MAP Formulation

We are given a set of 2D LR images g1 , . . . gn with their 3 × 4 projective camera matrices P1 , . . . , Pn . Let Ci be the camera center for the camera Pi . Let f be the high resolution image seen from the novel view with projection matrix Pcam . The projection matrix of the k th observed image Pk projects homogeneous 3D points X to homogeneous 2D points xk = λ(x, y, 1)T [7] projectively and xk = Pk X, where the equality is up to scale. We compute Pcam by ﬁrst multiplying matrix of camera internals with desired magniﬁcation, and then multiplying with external parameters of novel view in a standard way. The main task is to assign a color to

Super Resolution of Images of 3D Scenecs

87

a high resolution pixel p which is photo-consistent in the input images, i.e., the corresponding pixels in the input LR reference images have similar color, unless occluded [8,9,6]. We need to compute the expected color for a HR pixel p. Ideally, this would require the computation of a dense depth map from the input images to obtain complete registration, and a reliable solution to this problem is still elusive in the computer vision literature. We compute the expected label (color) in a given image using the photo-consistency constraint. We model f as a Markov Random Field (MRF) [10] and use a maximum a posteriori (MAP) estimate as the ﬁnal solution. The problem of estimating f can be posed as a labeling problem where each pixel is assigned a color label. We formulate the posterior energy function as n αk (p, p (z))(h(p) ∗ f (p, z) − gk (p (z)))2+λs Vp,q (f (p), f (q))(1) E(f, z|g) = p∈Sk=1

p,q∈N

where S is the set of sites (pixels) in the novel view, z is a depth along the back-projected ray from site p in the novel view camera, f (p, z) is the label at site p and at depth z, h(p) is the space invariant PSF of the camera for k th image, p (z) = Pk (X(z)) is the projection of back projected 3D point at depth z on to the k th LR image, gk (p (z)) is the expected color label when z along the back-projected ray from p is projected to k th LR image, αk (p, p (z)) is a weight associated with the k th LR image and Vp,q = min(θ, |f(p)−f(q)|) is a smoothness prior. We compute the weights αk (p, p (z)) in the following way. We deﬁne a zone HR pixel not influenced by the LR pixel Mapping of pixel in any low resolution image

LR pixel Zone of Influence No Pixles zone

p

X(z)

3D Ray

z

HR pixel

Projected rays

p p

p(x, y) p

HR pixel influenced by the LR pixel

(a)

p

(b)

Fig. 1. (a) Limiting the zone of inﬂuence of a LR pixel in space. Only those HR pixels whose centers lie inside the circle are inﬂuenced by the LR pixel. (b) Zone of inﬂuence of LR pixel.

of inﬂuence for every low resolution pixel such that the LR pixel contributes to the reconstruction of only those high resolution pixels whose centers lie within the zone of inﬂuence (see Figure 1(a) and 1(b)). Speciﬁcally, we deﬁne the following functions to determine whether a LR pixel p (z) = (x , y ) contributes to the reconstruction of a high resolution pixel p = (x, y). Let p (z) = (x , y ) be the projection of p, in real coordinates, which is projected from z along the back projected ray from the pixel p, in to the k th LR image. We deﬁne

88

U. Mudenagudi et al.

αk (p, p (z)) =

1 if d((x , y ), (x , y )) < θ1 0 otherwise

(2)

αk (p, p (z)) = 1 signiﬁes that the high resolution pixel p is within the zone of inﬂuence of the LR pixel p in the k th LR image. Here d() is the Euclidean distance function and θ1 is a threshold. We set the threshold to the σ values of the low resolution blur function. In most of our experiments the value of spatial blur σ is evaluated to 0.45 of LR pixel. Minimization of the energy function given in Equation 1 over all possible depths and colors is computationally intractable. In what follows we explain a pre-processing step for pruning the potential color labels gk (p (z)) for a site p by considering only those depths which are photo-consistent at p [6]. We ﬁrst outline a strategy for the simpler problem of super resolution of one of the input reference images and then outline a procedure for novel view super resolution. 2.2

Pruning of Color Labels Using Photo-Consistency

Consider, ﬁrst, the problem of super resolution of one of the source images. We ﬁnely sample the back-projected ray from a site in the target super resolved view between a range Zmin to Zmax . For each z in this range, we project the 3D point X(z) to each of the LR input images and compute c(p, z) = {gk (p (z)) | 1 ≤ k ≤ n}

(3)

as set of possible colors corresponding to depth z. For each z we do a nearest neighbor clustering of c(p, z) in Y CbCr space, with a suitable distance threshold on color diﬀerences, and remove all but the top few dominant clusters. If there are no dominant clusters then we remove z from the list of candidate depths at p. For all candidate z values at site p we write (4) C(p, z) = I(p, z, i) i=1...mpz input images in cluster 1 patch

p(z)

Zmax pair−wise patch based correlation in each cluster

z candidate depth

Zmin

z1

z2 X(z)

B1 B2

p C1

C2

C6

Cn

Ccam C4

C3

input images in cluster 2

(a)

C5

Pcam

(b)

Fig. 2. (a) Clusters at a depth z and patches in each of the input images projected from z along the back projected ray from the pixel p in the super resolved novel view. (b) Cameras C1 to C3 see the color of the surface B1 and rest of the cameras see the color of B2 .

Super Resolution of Images of 3D Scenecs

89

where mpz is the number of dominant clusters, and I(p, z, i) is the set of the image indices that belong to the ith dominant cluster. For each depth z, we deﬁne a correlation function mpz corr(patch(I(p, z, i))) (5) f corr(z) = i=1 mpz I(p,z,i)| i=1

2

where patch(I(p, z, i)) is the set of 3 × 3 patches around the LR pixel gk (p ) and corr(patch(I(p, z, i))) is the sum of pair-wise correlations between patches of I(p, z, i). We form a set of candidate depths dp at site p by choosing the local minimas of f corr(z). The intra cluster correlation does not guarantee that the candidate depths are correct, but one can only infer that these depths satisfy the photo-consistency constraint [6,9]. In this process we get the possible color clusters for each pixel and there can be two possibilities: 1. Only one cluster: In this case the pixel is photo consistent in all the views at this particular depth, and hence the dominant color cluster can be chosen unambiguously solely based on photo-consistency. 2. More than one dominant clusters: This suggests that either the 3D point which projects to the pixel is occluded in some of the source images (see Figure 2(b)); or that the high resolution pixel is on an edge so that it projects on to more than one color in the LR images. In view of the above, we write the energy to be minimized as mpz

β(p, z, i) αk (p, p ) h(p) ∗ f (p, z) − gk (p (z)) 2 min E(f |g) = p∈S

z∈dp

+λs

i=1

k∈I(p,z,i)

Vp,q (f (p), f (q))

(6)

p,q∈N

where αk (p, p ) is given by Equation 2 and β(p, z, i) is the weight given to the cluster i at candidate depth z for pixel p. We set the cluster weight β(p, z, i) as follows. In case there is only one dominant cluster, we set β(p, z, i) = 1. If there are more than one dominant clusters we determine if p is an edge point. This check is easy for super resolution of one of the source views because then the LR reference pixel corresponding to p is available. If p is an edge point we set the weights β(p, z, i) for all the clusters corresponding to neighboring colors of p as exp(b Ni ) β(p, z, i) = mpz k=1 exp(b Nk )

(7)

Number of images in ith the cluster and b is a tuning parameter. where Ni = total number of images If p is not an edge point we look up the color of p in the LR image and set β(p, z, i) = 1 for the corresponding cluster and all others to zero. Note

90

U. Mudenagudi et al.

that the above energy function given in Equation 6 is not in a form that is amenable to minimization by graph cut. Moreover, in case of super resolution reconstruction of a novel view, the problem of choosing the photo-consistent set of colors becomes considerably harder because then one needs an eﬀective strategy for resolving occlusion. In the next section we explain how we can handle occlusion using the photo-consistency constraint [6,9].

3

Occlusion Handling Using Photo-Consistency

Clearly, if the above procedure returns multiple dominant clusters for a high resolution pixel p in the target novel view due to occlusion, there is no way to resolve the correct cluster from an analysis based on a single pixel. Seitz and Dyer [8,9] and Kutulakos and Seitz [6] develop a theory for occlusion analysis using photo-consistency, which gives the ordinal visibility constraint [8,9]. If X(z1 ) and X(z2 ) are any two points on two back projected rays then X(z1 ) can occlude X(z2 ) only if z1 < z2 . This allows us to resolve occlusion in a way similar to the voxel coloring algorithm of [8,9]. We scan the back projected rays from each pixel in a target novel view image in the order of depth (z) values. For a given z value along a ray through p, we project onto the source images and mark source image pixels if they satisfy photo-consistency. For the subsequent z values we consider only the unmarked pixels in the source images. We outline the algorithm to resolve occlusion using photo-consistency in Algorithm 1.

Algorithm 1. Algorithm to resolve occlusion For each z from Zmin to Zmax : for each pixel p in the novel view image that has z as a candidate depth (z ∈ dp from previous analysis in Section 2.2): set C = Φ. for all input LR images gk (indexed by k): a. project X(z) on the ray back projected through p on to gk . Let p (z) be the projected pixel in gk (after round oﬀ). b. push the color of p (z) into C if p (z) is not marked. end. if C is photo-consistent then a. retain the clusters with color similar to the photo-consistent set and prune all others from C(z, i). Set the weights β(z, p, i) for the remaining clusters in C(z, i) according to their membership as given in Equation 7. b. retain this z value in dp and remove all others. c. mark all pixels p (z) in gk which are unmarked. endif. end. end.

Super Resolution of Images of 3D Scenecs

(a)

91

(b)

Fig. 3. (a) Row 1: One of the input images (out of 40) and photo-consistent depth map, Row 2: Occlusion-map, the pixels for which more than one clusters are formed, and are marked white and generated novel view. (b) Super resolved novel view using 20 images. Note that though the depth map generated is not accurate, it provides correct photo-consistent color in the novel view.

The space carving theory of [6] guarantees that the algorithm outlined above will converge to a photo-consistent shape. Consequently, the procedure guarantees a unique z value for all p, which though may not be “correct”, nevertheless provides a photo-consistent color assignment, which is all that we require for super resolved novel view generation. We show one of the input images, the depth map using a unique z values, the image where the pixels for which multiple clusters are formed, and are marked white (occlusion-map, with slight abuse of notation) and generated novel view in Figure 3(a), and the super resolved novel view in 3(b). The ﬁnal energy is then given as pz

2 m β(p, z , i) αk (p, p (z )) h(p) ∗ f (p, z ) − gk (p (z )) E(f |g) = p∈S

+λs

i=1

k∈I(p,z ,i)

Vp,q (f (p), f (q))

(8)

p,q∈N

where z is the unique depth value we get after resolving the occlusion. The energy in Equation 8 is clearly amenable to minimization by graph cut.

4

Energy Minimization Using Graph-Cut

Graph cut [11] can minimize only graph-representable energy functions. An energy function is graph-representable if and only if each term Vp,q satisﬁes the regularity constraint [12]. The energy function of Equation 8 is not in the graphrepresentable form. The data term of site p also depends on the neighbors of p because of the blurring operator. We approximate the data term of Equation 8 as follows. Consider the blur kernel h(p) with values wpp at the center and wpq at the neighbor q of p. Writing f (p, z ) as fp , αk (p, p (z)) as αk , β(p, z , i) as β, I(p, z , i) as Ii , gk (p (z)) as gk (p ) and denoting as Nsp the set of spatial neighbors of p, then the data term is given by

92

U. Mudenagudi et al.

pz pz 2 m 2

m β αk h(p) ∗ fp −gk (p ) = β αk wpp fp −gk (p )+ wpq fq p∈S i=1 k∈Ii

p∈Si=1 k∈Ii

p∈Nsp

pz

2

2 m = β αk wpp fp − gk (p ) + wpq fq +2 wpp fp − gk (p ) wpq fq . (9) p∈Si=1 k∈Ii

q∈Nsp

q∈Nsp

We write the overall energy as

Dp∗ (fp ) + E(f |g) = φpq (fp , fq ) + λs Vp,q (fp , fq ) p∈S

q∈Nsp

(10)

q∈Nsp

where, collecting all terms that depend on a single pixel in to the data term, we have mpz 2 2

∗ Dp (fp ) = β αk wpp fp − gk (p ) + wpq fq (11) i=1

q∈Nsp

k∈Ii

and

mpz

φpq (fp , fq ) =

i=1

β

αk

2(wpp fp − gk (p )) (wpq fq )

(12)

k∈Ii

The above expression is still not graph representable because of the term fp fq in φpq . We further approximate wpp fp − gk (p ) = Δkp in φpq and hold Δkp constant during a particular α-expansion move during the minimization using graph cut. It is easy to verify that φpq satisﬁes the regularity condition in [12] with this approximation. Unfortunately, it cannot be guaranteed that every step of energy minimization of the approximate energy using α-expansion also minimizes the original energy. However, in our experiments we ﬁnd that this is almost always the case. [13] gives a diﬀerent approximation for spatial deconvolution using graph cut, where they can give such guarantee.

5

Results

In this section we present some results of novel view super resolution. In all our experiments we obtained the reference images from image sequences captured using a hand held video camera (Sony), and the sequences were calibrated using an automatic tracking and calibration software, Boujou [14]. We provide the input and the output images in the supplemental material through index.html link. 5.1

Novel View Generation

In our ﬁrst example we select 11 frames from the image sequence and use 10 of them as reference images to generate the missing 11th image, so that the reconstructed image can be compared with the ground truth. In Figure 4(a) we

Super Resolution of Images of 3D Scenecs

(a)

93

(b)

Fig. 4. (a) Row 1 : two input images out of 10, Row 2: generated novel view and diﬀerence with ground truth. (b) Super resolved novel view with magniﬁcation of 2× in each direction.

show two of the reference images, the generated novel view and the diﬀerence image with ground truth. In Figure 4(b) we show the super resolved output corresponding to the novel view. In Figure 5 we show a close up of interpolated 11th image, the corresponding super resolved novel view and the diﬀerence image. The diﬀerence image clearly shows the missing high frequency components in the interpolated image.

Fig. 5. Close ups of the interpolated, the super resolved & diﬀerence image of Figure 4

5.2

Occlusion Handling

In this example we show that we can eﬀectively handle occlusion. Four out of 10 input images are shown in Figure 6(a). Note that diﬀerent colors are visible in between the petals of the ﬂower because of occlusion. The occlusion has been properly handled in the reconstructed super resolved view of one of the source image as shown in Figure 6(b). In Figure 6(c) and Figure 6(d) we show the super resolved novel view using 10 and 20 images. In case of super resolution of one of the source view the cluster is selected correctly and in case of super resolution of novel view, the cluster selection improves when we increase the number of images as seen in the super resolved view using 20 images compared to using 10 images.

94

U. Mudenagudi et al.

(a)

(b)

(c)

(d)

Fig. 6. Resolving occlusion: (a) four of the 10 input images where diﬀerent colors are visible in between the petals of ﬂower, (b) super resolved image of one of the source view, (c) super resolved novel view using 10 images and (d) super resolved novel view using 20 images

(a)

(b)

Fig. 7. Novel views and the corresponding diﬀerence image with ground truth (a) using 10 and 40 images, (b) using FOV 11.5o and 7.5o

5.3

Eﬀect of Number of Views and Field of View

In this example we show the novel view generation using diﬀerent number of input images keeping same ﬁeld of view (FOV) and diﬀerent FOVs keeping same number of input images. In the ﬁrst experiment we consider image sequence of 61 images and generate the novel view using 10, 20, 30, 40 and 50 images. The corresponding rms errors between ground truth and the novel view are 20.2027, 17.7744, 16.8216, 16.3895 and 16.3895 respectively. Two of the generated novel views using 10 and 40 images and corresponding diﬀerence images with ground truth are shown in Figures 7(a). In the second experiment we used 10 images from three input sets with FOVs 11.5o, 9.0o and 7.5o . The rms errors corresponding to threes sets are 30.3203, 20.2027 and 15.6003 respectively. The generated novel view using FOV 11.5o and 7.5o , and the corresponding diﬀerence images with the ground truth are shown in Figure 7(b).

6

Conclusions

In this paper we have addressed the problem of super resolved generation of novel views of a 3D scene with the reference images obtained from cameras in general

Super Resolution of Images of 3D Scenecs

95

positions. We have posed novel view super resolution problem in the MRF-MAP framework and proposed a solution using graph cut. We have formulated the problem as one of estimation of the color at each pixel in the high resolution novel view without explicit and accurate depth recovery. We have also presented an eﬀective method to resolve occlusion. We have presented compelling results on real images.

References 1. Irani, M., Peleg, S.: Improving resolution by image registration. CVGIP: Graphical Models and Image Processing 53, 231–239 (1991) 2. Chaudhuri, S., Joshi, M.V.: Motion-Free Super-Resolution. Springer, Heidelberg (2004) 3. Borman, S., Stevenson, R.L.: Linear Models for Multi-Frame Super-Resolution Restoration under Non-Aﬃne Registration and Spatially Varying PSF. In: Computational Imaging II. Proceedings of the SPIE, vol. 5299, pp. 234–245 (2004) 4. Irani, M., Hassner, T., Anandan, P.: What Does the Scene Look Like from a Scene Point? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 883–897. Springer, Heidelberg (2002) 5. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-Based Rendering Using ImageBased Priors. International Journal of Computer Vision 63, 141–151 (2005) 6. Kutulakos, K.N., Seitz, S.M.: A Theory of Shape by Space Carving. International Journal of Computer Vision 38, 199–218 (2000) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 8. Seitz, S.M., Dyer, C.R.: Photorealistic Scene Reconstruction by Voxel Colouring. In: Proceedings of the IEEE CVPR, pp. 1067–1073. IEEE Computer Society Press, Los Alamitos (1997) 9. Seitz, S.M., Dyer, C.R.: Photorealistic Scene Reconstruction by Voxel Colouring. International Journal of Computer Vision 35, 151–173 (1999) 10. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 11. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on PAMI 23, 1222–1239 (2001) 12. Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Minimized via Graph Cuts? IEEE Transactions on PAMI 26, 147–159 (2004) 13. Raj, A., Singh, G., Zabih, R.: MRFs for MRIs: Bayesian Reconstruction of MR Images via Graph Cuts. In: Proceedings of the IEEE CVPR, pp. 1061–1068. IEEE Computer Society Press, Los Alamitos (2006) 14. 2d3 Ltd (2002), http://www.2d3.com

Learning-Based Super-Resolution System Using Single Facial Image and Multi-resolution Wavelet Synthesis Shu-Fan Lui, Jin-Yi Wu, Hsi-Shu Mao, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {kilo, Curtis, marson, jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw

Abstract. A learning-based super-resolution system consisting of training and synthesis processes is presented. In the proposed system, a multi-resolution wavelet approach is applied to carry out the robust synthesis of both the global geometric structure and the local high-frequency detailed features of a facial image. In the training process, the input image is transformed into a series of images of increasingly lower resolution using the Haar discrete wavelet transform (DWT). The images at each resolution level are divided into patches, which are then projected onto an eigenspace to derive the corresponding projection weight vectors. In the synthesis process, a low-resolution input image is divided into patches, which are then projected onto the same eigenspace as that used in the training process. Modeling the resulting projection weight vectors as a Markov network, the maximum a posteriori (MAP) estimation approach is then applied to identity the best-matching patches with which to reconstruct the image at a higher level of resolution. The experimental results demonstrate that the proposed reconstruction system yields better results than the bi-cubic spline interpolation method. Keywords: Super-resolution, learning-based, reconstruction, Markov network, multi-resolution wavelets, maximum a posteriori.

1 Introduction Super-resolution (SR) is an established technique for expanding the resolution of an image since directly up-sampling a low-resolution image to produce a high-resolution equivalent invariably results in blurring and a loss of the finer (i.e. higher-frequency) details in the image. SR has attracted increasing attention in recent years for a variety of applications, ranging from remote sensing, to military surveillance, face zooming in video sequence, and so forth. Some researches [8], [9] and [15] use the interpolation method, such as bilinear and bi-cubic interpolation, etc., to obtain the high-resolution image. However, it is difficult to interpolate the details well, such as the texture and corner-like regions. Considering that one single low-resolution image may not provide enough information to recover the high-frequency details, multiframe image reconstruction algorithms [5], [13] and [14] were developed to facilitate the reconstruction of single high-resolution images from a sequence of low-resolution Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 96–105, 2007. © Springer-Verlag Berlin Heidelberg 2007

Learning-Based Super-Resolution System

97

images. Some of the methods are successful to enlarge the images, but we focus on reconstructing the high-resolution image from one single low-resolution image in this paper since it may be more useful in our daily life. Recently, the application of SR to the resolution enlargement of single images has attracted particular attention. For example, Freeman et al. [6], [7] learned the relationship between the corresponding high- and low-resolution pairs from the training data and modeled the images as a Markov network. Although the algorithms presented in these studies were applicable to the resolution scaling of generic images, they were unsuited to the manipulation of facial images characterized by pronounced geometric structures, as described in [1]. Baker and Kanade [1], [2] developed a learning-based facial hallucination method in which a process of Bayesian inference was applied to derive the high-frequency components of a facial image from a parent structure. Meanwhile, Liu et al. [10] presented a two-step procedure for hallucinating faces based on the integration of global and local models. The structure of face is generated in the global model and the local texture is obtained in the local model. However, the methods proposed in [1], [2] and [10] were sometimes difficult to obtain in practice since they were reliant upon the complicated probabilistic models. In general, the objective of SR reconstruction is to preserve the high-frequency details of an image while simultaneously expanding its scale to the desired resolution. Various researchers have demonstrated the successful interpolation and reconstruction of high-resolution images using wavelet algorithms [3], [4], [11]. Such algorithms have a number of fundamental advantages when applied to image reconstruction, including the ability not only to keep the geometric structure of the original image, but also to preserve a greater proportion of the high-frequency components than other methods, such as interpolation. Accordingly, the current study develops a learning-based SR method based upon a multi-resolution wavelet synthesis approach. The SR system consists of two basic processes, namely a training process and a synthesis (reconstruction) process. In the training process, the input image is decomposed into a series of increasingly lowerresolution images using a discrete wavelet transformation (DWT) technique. The image at each resolution level is divided into patches, which are then projected onto an eigenspace to compute the corresponding projection weight vectors. The projection weight vectors generated from all of the patches within the image at each level are grouped into a projection weight database. In the subsequent synthesis procedure, the input image is firstly interpolated to a double-size one (Fig.2, I). By applying three corresponding Haar filters on it, the three initial subbands are created. The three subbands together with the input image are then divided into patches, which are then mapped onto the eigenspace associated with the lowest-resolution input image in the training process. The resulting projection weight vector for each patch is then compared with those within the corresponding projection weight database. The bestfitting vector for each patch is identified by modeling the spatial relationships between patches as Markov network and using the maximum a posteriori (MAP) criterion. The corresponding patch is then mapped back to the input image. Having replaced all of the patches in the three subbands with the best-fitting patches from the training database, the resulting image is synthesized using an inverse wavelet transformation process and used as the input image for an analogous synthesis procedure performed at the next (higher) resolution level. This up-sampling procedure is repeated iteratively until the required image resolution has been obtained.

98

S.-F. Lui et al.

2 Training Process Figure 1 presents a schematic illustration of the training process within the current SR system. As shown, the input image is decomposed into two lower-resolution images, each divided into four sub-bands. At each resolution level, patches from each subband are concatenated to form a patch vector, which is then projected onto the corresponding eigenspace to obtain the corresponding projection weight vector. The vectors associated with each image resolution level are grouped to construct a projection weight database. w

h

level 0

w/2

h/2

Projection Weight Database 1

Haar DWT

w/2

Haar

LL 1

HL HH

…

LH

e1

Projection

N1

level 1

Eigenspace

…

3x3

HL

LL

HH w/4

LL 1

LH

HH

N2

3x3

HL HH

Eigenspace e1

…

HL

LH

…

LL

eN1

Projection Weight Database 2

Haar DWT

h/4 level 2

LH

Projection

eN2

Fig. 1. Flowchart of SR system triaining process. The original input training image (Level 0) is successively transformed using Haar discrete wavelet transformation (DWT) to generate lowerresolution images (Levels 1 and 2, respectively) comprising four sub-bands. At each level, 3x3 pixel patches from each sub-band are concatenated to form a 3×3×4-dimensional vector with a one patch-based eigenspace. These vectors are projected onto the eigenspace to generate a corresponding projection weight vector. The projection weight vectors produced at each level are grouped to form a projection weight database.

2.1 Feature Extraction Using Multi-resolution Wavelet Analysis Discrete wavelet transformation (DWT) is a well-established technique used in a variety of signal processing and image compression / expansion applications. In the

Learning-Based Super-Resolution System

99

current SR system, DWT is applied to decompose the input image into four subbands, namely LL, LH, HL and HH, respectively, where L denotes a low-pass filter and H a high-pass filter. Each of these sub-bands can be thought of as a smaller-scale version of the original image, representing different properties of the image, such as its vertical and horizontal information, for example. The LL sub-band is a coarse approximation of the original image, containing its overall geometrical structure, while the other three sub-bands contain the high-frequency components of the image, i.e. the finer image details. In the synthesis stage of the proposed SR system, the LL sub-band is used as the basis for estimating the best-fitting patches in the remaining three sub-bands. Having identified these patches, the four sub-bands are synthesized using an inverse DWT process to construct an input image for the subsequent higherlevel image reconstruction process. Note that in the current system, the Haar DWT function is employed to carry out wavelet transformation in the training and synthesis procedures since for this function, the direct transform is equal to its inverse. The multi-resolution wavelet approach described above can be repeated as many times as required to achieve the desired scale of enlargement. Assuming that the input image has a 48×64-pixel resolution, the output image should therefore have a resolution of 192x256 pixels. Accordingly, the SR procedure involves the use of two consecutive reconstruction processes, i.e. an initial process to enlarge the input image from 48x64 pixels to 96×128 pixels, and then a second process to produce the final enlarged image with the desired resolution of 192×256 pixels. 2.2 Patch-Based Eigenspace Construction As described in the previous section, four sub-band images are required for the wavelet reconstruction process. However, initially only a single low-resolution image is available. Therefore, the problem arises as to how the coefficients of the other three sub-bands may be estimated. In the current study, this problem is resolved by using a learning-based approach [1], [2] to estimate the missing high-frequency image details on the basis of the image information contained within the training datasets. As shown in Fig. 1, the current SR system incorporates two training datasets, namely projection weight databases 1 and 2, respectively. Database 1 is constructed by applying DWT to each of the original (i.e. Level 0) w×h-pixel training images, resulting in the creation of four sub-band images, i.e. LL, LH, HL and HH, respectively. These sub-band images are then further divided into an array of 3×3pixel patches. Since the image has a total of four sub-bands, a 36-dimensional patchbased vector can be constructed by concatenating the 9 pixels in the corresponding patches of each sub-band. Assuming that a single sub-band image can be divided into M patches and that the training dataset contains a total of N original images, respectively, then an M×N by 36 patch-based matrix can be constructed. There will be over 250000 patch-based vectors if we use 100 192×256-pixel resolution images to train the database, so we use PCA to lower the dimension in order to speed up the computational time of the system. Having constructed projection weight database 1 for the level-1 resolution images, a similar procedure is performed to compute the projection weight vectors for project weight database 2, corresponding to the lowestresolution level images.

100

S.-F. Lui et al.

3 Synthesis Process 3.1 Wavelet Synthesis and Patch-Based Weight Vector Creation As described in Section 2.1, each input image is decomposed into four sub-bands. Subband LL contains the global image structure, but lacks the finer details of the image. Therefore, the LL sub-band is used as the input image in the reconstruction process. The patches within the other three sub-bands are then estimated and used to reconstruct the higher-resolution image, as described below. w h

Inverse wavelet

Output

Bi-cubic interpolation: I1

LL LH HL HH Projection HL

1

3x3

HH

…

LL LH

N1

Markov network: Best Match using MAP

Eigenspace 1 Weight Database 1

Second wavelet reconstruction

Haar filter w/2 Inverse wavelet h/2

LL LH HL HH Projection HL HH

w/4 Input

Haar filter

N2

3x3

…

h/4

LL

Bi-cubic LH interpolation: I2

1

Markov network: Best Match using MAP

Eigenspace 2 Weight Database 2

First wavelet reconstruction

Fig. 2. Flowchart of SR system synthesizing process. The input image with a quarter-resolution of the desired output is interpolated and a Haar filter applied to produce four sub-bands. These sub-bands are then divided into patches and projected onto patch-based eigenspace 2 to generate the corresponding projection weights. Using the maximum a posteriori criterion, the projection weights are then compared with those within projection weight database 2. The bestfitting matches are used to improve the patches in the original input image, which is then synthesized using an inverse wavelet transformation process to construct a new image with twice the resolution of the original input image. Using this new image as an input, the reconstruction process is repeated using patch-based eigenspace 1 and projection weight database 1 to synthesize the required highest-resolution image.

Learning-Based Super-Resolution System

3x3

3x3

3x3

3x3

y

Find a best match from weight database 3x3

3x3

3x3

101

x

3x3

Fig. 3. Each search vector is constructed by concatenating four 3x3-pixel patches from the four sub-band images. These search vectors are then compared with the weight projection vectors within the training database(s) to identify the best-fitting match. The training image patches corresponding to the identified weight projection vector are then mapped back to the input image. Note that the search vector and the training databases are modeled as a Markov network.

As shown in Fig. 2, the input image at the lowest level of resolution, i.e. w/4×h/4 pixels, is interpolated and partitioned into four subbands using a Haar filter. Adopting the same approach as that applied in the training process, the four subbands are divided into patches, and the pixels within corresponding patches in the four different sub-bands are then concatenated to form a search vector (see Fig. 3). All of the search vectors relating to the image are then projected onto eigenspace 2 to create a projection weight set W2i, where i = 1, …, N, in which N is the total number of patches within one sub-band. The search vectors are then compared with the projection weight vectors contained within database 2 to identify the best-matching vectors. The search patches are in raster-scan order. Having found the best-matching vector in the database, the corresponding patches are used to replace the patches in the LH, HL and HH sub-bands in the original input image. Once all of the original patches have been replaced, an inverse wavelet transformation process is performed to synthesize a higher-resolution image with a w/2×h/2-pixel resolution. This image is then used as the input to a second reconstruction process performed using eigenspace 1 and projection weight database 1 to reconstruct the desired w×h-pixel resolution image (see Fig. 2). 3.2 Markov Network: Identifying Best-Matching Vectors Using Maximum a Posteriori Approach In the current SR system, the patches within the four sub-bands are modeled as a Markov network. In this network, the observed nodes, y, represent the transformed patches and the aim is to estimate the hidden nodes, x, from y. In the current network, the hidden nodes represent the fine details of the original images, i.e. the patches in subbands LH, HL and HH. Since the spatial relationship between x and y is modeled as a Markov network (see Fig. 3), the joint probability over the transformed patch y [7] can be expressed as follows:

p ( x1 , x 2 ,..., x N , y 1 , y 2 ,..., y N ) = ∏ Ψ ( xi , x j ) ∏ Φ ( x k , y k ) , (i , j )

k

(1)

102

S.-F. Lui et al.

where Ψ and Φ are the pair-wise compatibility functions learned; i and j are neighboring nodes; and N is the total number of patches in each subband. To estimate xˆ j , which is the best match one of each patch, we adopt the MAP estimation described in [7] to take the maximum over the other variables in the posterior probability.

xˆ jMAP = arg max xj

max p( y1 , y2 ,..., y N , x1 , x2 ,..., x N ) .

[ all xi ,i ≠ j ]

(2)

According to the conditional probability and Bayes’ theorem, the posterior probability can be written as p(x | y) =

1 p(x, y) = p(y) Z

∏Ψ

( xi , x j )Φ ( y k , xk ) ,

(3)

k

(i, j )

where Z is a normalization constant, therefore the probability of any given transformed patch for each node is proportional to the product of the pair-wise compatibility functions. In this Markov network, the relationship between the observation node and the hidden node is modeled by Ψ, and the relationship between the neighboring hidden nodes is modeled by Φ. In our work, we use Ψ to find the best matching one of each patch from training data, and Φ is used to make sure the neighboring patches connect with each other.The compatibility function Φ(x, y) is used to measure the similarity between two patch vectors, and is defined as

φ (x j , y j ) = e

−| y j − x j |2 / 2σ 12 ,

(4)

where yj and xj represent patch vectors. Specifically, yj is the observed node and consists of patches from the transformed images and xj is the hidden node which is to be found in the projection weight database 1. σ1 is a standard deviation which is the corresponding eigenvalue of the eigenvector in our system. The compatibility function between the hidden nodes is defined as

ψ ( xi , x j ) = e

− | d ji − d ij | 2 / 2 σ 22

,

(5)

where dij denotes the pixels in the overlapping region between node i and j, and σ2 is the standard deviation of the overlapping region. The pixels in the overlapping region should agree to guarantee that neighboring nodes are compatible. The best match for each search vector, is obtained by solving these compatibility functions.

4 Experimental Results To evaluate the performance of the proposed SR system, a training database was compiled comprising 100 images. Although the figures presented in this study feature facial images, the current SR system is not limited to the processing of a specific class

Learning-Based Super-Resolution System

103

of images. Hence, a general training database was constructed containing images from the CMU PIE database, a self-compiled database of facial images, and the Internet, respectively. Note that the images were not pre-processed in any way.

(b)

(a)

(1) Input (48×64)

(2) High resolution (192×256)

(1) Input (48×64)

(2) High resolution (192×256)

(4) Current method (3) Bi-cubic spline (192×256) interpolation (192×256)

(3) Bi-cubic spline (4) Current method interpolation (192×256) (192×256)

(5) Result of (2) - (3). (6) Result of (2) - (4). Average difference : Average difference : 3.3 gray value/pixel 9.7 gray value/pixel

(5) Result of (2) - (3). Average difference : 9.5 gray value/pixel

(6) Result of (2) - (4). Average difference : 4.0 gray value/pixel

Fig. 4. Comparison between enlargement results of current SR method and those obtained using a bi-cubic spline interpolation method: (a.1) low-resolution input with 48×64 resolution; (a.2) original high-resolution image with 192×256 resolution; (a.3) bi-cubic spline interpolation results; (a.4) enlargement results obtained using current method; and (a.5) and (a.6) difference images of (a.2) minus (a.3) and (a.2) minus (a.4), respectively. Note that the average difference is calculated as the sum of the gray values in the difference image divided by the total number of pixels in the image. The results show that the proposed SR method achieves a better enlargement performance than the bi-cubic interpolation. Figure 4.(b) presents a corresponding set of images for a different input image.

Increasing the patch size to 5×5 pixels was found to generate no significant improvement in the enlargement results. Typical experimental results obtained when processing facial images are shown in Fig. 3. In general, super-resolution can be

104

S.-F. Lui et al.

(a) Input (48×64)

(c) Current method (d) High resolution (b) Bi-cubic spline (192×256) (192×256) interpolation (192×256)

Fig. 5. Further results

performed using a variety of interpolation techniques, including the nearest-neighbor, bi-linear and bi-cubic spline schemes [10]. In the current study, the enlargement results obtained using the proposed SR system is compared with those obtained from the bi-cubic spline interpolation method. As shown in Figs. 4 and 5, the reconstruction results generated by the SR method are both quantitatively and qualitatively superior to those obtained using the interpolation technique.

5 Conclusions This study has presented a learning-based super-resolution system for the reconstruction of high-resolution images from a single low-resolution image. In the proposed method, a multi-resolution wavelet approach is used to synthesize both the global geometrical structure of the input image and the local high-frequency details. The SR system incorporates a training process and a synthesis process. In the training process, the high-resolution input image is transformed into a series of lowerresolution images using the Haar discrete wavelet transform (DWT). The images at each resolution level are divided into four subbands (LL, LH, HL and HH) and the corresponding patches from each subband are combined to form a search vector. The search vector is then projected onto an eigenspace to derive the corresponding projection weight vector. All of the projection weight vectors associated with the image at its current level of resolution are then grouped into a projection weight vector database. In the reconstruction process, the LL sub-band is utilized as the input image to ensure that the global, low-frequency components of the original image are retained during the enlargement process. A high-resolution image of the desired scale is obtained by performing the reconstruction process iteratively, taking the output image at one resolution level as the input to the synthesis process at the next (higher) resolution level. During each reconstruction process, the high-frequency components

Learning-Based Super-Resolution System

105

of the image are replaced by the best-fitting patches identified when modeling the patches in the subbands as a Markov network and performing a maximum a posteriori search of the projection weight database such that the finer details of the image are preserved as the image resolution is progressively enlarged.

References 1. Baker, S., Kanade, T.: Hallucinating Faces. In: Proc. of Inter. Conf. on Automatic Face and Gesture Recognition, pp. 83–88 (2000) 2. Baker, S., Kanade, T.: Limits on Super-Resolution and How to Break them. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002) 3. Chan, R., Chan, T., Shen, L., Shen, Z.: Wavelet Algorithms for High-Resolution Image Reconstruction. SIAM Journal on Scientific Computing 24(4), 1408–1432 (2003) 4. Chan, R., Riemenschneider, S., Shen, L., Shen, Z.: Tight Frame: An Efficient Way for High-Resolution Image Reconstruction. Applied and Computational Harmonic Analysis 17, 91–115 (2004) 5. Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Fast and Robust Multi-Frame SuperResolution. IEEE Trans. Image Processing 13(10), 1327–1344 (2004) 6. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22(2), 56–65 (2002) 7. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. IJCV 40(1), 25–47 (2000) 8. Hou, H.S., Andrew, H.C.: Least Squares Image Restoration Using Spline Basis Function. IEEE Transactions on Computers C-26(9), 856–873 (1977) 9. Hou, H.S., Andrews, H.C.: Cubic Splines for Image Interpolation and Digital Filtring. IEEE Transactions Acoust, Speech Signal Processing 26(6), 508–517 (1978) 10. Liu, C., Shum, H., Zhang, C.: A Two-Step Approach to Hallucinating Faces: Global Parametric Model and Local Nonparametric Model. CVPR 1, 192–198 (2001) 11. Nguyen, N., Milanfar, P.: A Wavelet-Based Interpolation-Restoration Method for Superresolution. Circuits, System, Signal Process 19, 321–338 (2000) 12. Pratt William, K.: Digital Image Processing (1991) 13. Schultz, R.R., Stevenson, R.L.: Extraction of High-Resolution Frames from Video Sequences. IEEE Transactions on Image Processing 5(6), 996–1011 (1996) 14. Shekarforoush, H., Chellappa, R.: Data-Driven Multi-Channel Super-Resolution with Application to Video Sequences. JOSA. A 16, 481–492 (1999) 15. Ur, H., Gross, D.: Improved Resolution from Subpixel Shifted Pictures. CVGIP: Graph. Models Image Process 54, 181–186 (1992)

Statistical Framework for Shot Segmentation and Classiﬁcation in Sports Video Ying Yang1,2 , Shouxun Lin1 , Yongdong Zhang1 , and Sheng Tang1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China 2 Graduate University of Chinese Academy of Sciences, Beijing 100085, China {yyang, sxlin, zhyd, ts}@ict.ac.cn

1

Abstract. In this paper, a novel statistical framework is proposed for shot segmentation and classiﬁcation. The proposed framework segments and classiﬁes shots simultaneously using same diﬀerence features based on statistical inference. The task of shot segmentation and classiﬁcation is taken as ﬁnding the most possible shot sequence given feature sequences, and it can be formulated by a conditional probability which can be divided into a shot sequence probability and a feature sequence probability. Shot sequence probability is derived from relations between adjacent shots by Bi-gram, and feature sequence probability is dependent on inherent character of shot modeled by HMM. Thus, the proposed framework segments shot considering the character of intra-shot to classify shot, while classiﬁes shot considering character of inter-shot to segment shot, which obtain more accurate results. Experimental results on soccer and badminton videos are promising, and demonstrate the eﬀectiveness of the proposed framework. Keywords: shot, segmentation, classiﬁcation, statistical framework.

1

Introduction

In recent years, there has been increasing research interests in sports video analysis due to its tremendous commercial potentials, such as sports video indexing, retrieval and abstraction. Sports video can be decomposed into several types of video shots which are sequences of frames taken contiguously by a single camera. In sports video analysis, shot segmentation and classiﬁcation play an important role for shots are often basic processing unit and give potential semantic hints. Much work was done on shot segmentation and classiﬁcation, most of which take shot segmentation and classiﬁcation as a two-stage successive process, and perform the two stages independently using diﬀerent features without considering the relationship between them. Many shot segmentation algorithms have

This research was supported by National Basic Research Program of China (973 Program, 2007CB311100), Beijing Science and Technology Planning Program of China (D0106008040291), and National High Technology and Research Development Program of China (863 Program, 2007AA01Z416).

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 106–115, 2007. c Springer-Verlag Berlin Heidelberg 2007

Statistical Framework for Shot Segmentation and Classiﬁcation

107

been developed by measuring the similarity of adjacent shots [1]. However, these algorithms get poorer performance on sports video due to frequent panning and zooming caused by fast camera motion. After shot segmentation, shot classiﬁcation is performed on each segment for higher level video content analysis. Some work classiﬁed sports video shots based on domain rules of certain sports game [2,3], such as classifying soccer video shots into long shots, medium shots and close-up shots using dominant color. Others tried a uniﬁed method to solve this problem using SVM or HMM [4,5,6,7,8]. For these approaches, since classiﬁcation is done after segmentation independently, the incorrect segmentation will have a negative eﬀect on shot classiﬁcation. In this paper, a novel statistical framework is proposed for shot segmentation and classiﬁcation. Compared with previous work, the proposed framework classiﬁes and segments shot simultaneously using same diﬀerence features based on statistical inferences. The task of shot segmentation and classiﬁcation is taken as ﬁnding the most possible shot sequence given feature sequence, and it can be formulated by a conditional probability which can be divided into a shot sequence probability and a feature sequence probability. Shot sequence probability is derived from relations between adjacent shots models modeled by Bi-gram, and feature sequence probability is dependent on inherent characters of shot modeled by HMM(Hidden Markov Model). Therefore, the proposed framework segments shot considering characters of intra-shot, while classiﬁes shot considering characters of inter-shot, which is a global search on all the possible shot sequence for the best shot sequence matching feature sequence. The rest of paper is organized as follows. In Section2, the main idea of the statistical framework is presented. Section 3 gives the details of shot segmentation and classiﬁcation based on the proposed statistical framework. To evaluate the performance of this framework, two applications in soccer and badminton videos and result analysis are described in Section4. Finally, conclusions are drawn and future work is discussed in Section 5.

2

Main Idea of the Statistical Framework

Suppose that a video stream is composed of a sequence of shots, denoted by H = h1 h2 . . . ht . After feature extraction, the video stream can be viewed and manipulated as a sequence of feature vectors, denoted by O = o1 o2 . . . oT . Hence, the task of shot segmentation and classiﬁcation can be seen as mapping the sequence of feature vectors to a sequence of shots, and the best mapping is the expected result of shot segmentation and classiﬁcation. Hence, in the proposed framework, the task is interpreted as ﬁnding a shot sequence that maximize the conditional probability of H under condition O, namely, ﬁnding ˆ = arg max{P(H|O)} = arg max{P(H) · P(O|H)/P(O)} H (1) H

H

The equation is transformed by applying Bayes’ theorem. Since P(O) is constant for O is a known sequence, the problem can be simpliﬁed as following ˆ = arg max{P(H) · P(O|H)} H (2) H

108

Y. Yang et al.

The calculation of above probability involves two types of probability distribution, i.e. P(H) and P(O|H).The former indicates probability of shot sequence without eﬀect of features, and the latter indicates probability of feature sequence under a given shot sequence. So we call them Shot Sequence Probability (SSP) and Feature Sequence Probability (FSP), respectively. 2.1

Shot Sequence Probability

Shot sequence H is composed of successive shots of diﬀerent categories, so the shot sequence probability is dependent on the transition probability between adjacent shots. Since shot sequence is a temporal sequence, the appearance of the present shot is only related to the appearances of shots prior to it. Therefore, shot sequence can be taken as a Markov process. We suppose the shot sequence is a 1 D Markov, namely, the appearance of present shot is only related to the last shot, which can be formulated by the following equation P(hm |h1 h2 . . . hm−1 ) = P(hm |hm−1 )

(3)

This assumption is reasonable since sports video shots regularly alternate to exhibit certain semantic content according to the play status. Hence, shot sequence probability P(H) can be calculated by P(H) = P(h1 h2 . . . ht−1 ht ) = P(h1 ) · P(h2 |h1 ) · . . . P(ht |ht−1 )

(4)

where P(hi ) denotes the initial distribution probability of shot hi , and P(hj |hi ) denotes transition probability between shot hi and hj , which we called Bi-gram in our framework. So SSP can be deduced according to the Bi-gram. 2.2

Feature Sequence Probability

Feature sequence is also a temporal sequence which indicates a certain shot sequence. Since HMM has been successful used in speech recognition to model the temporal evolution of features in a word [9], we use HMM to model sports video shot for the word and shot have similar temporal structure. Hence, feature sequence of a shot is an observed sequence of a given shot HMM, and each emitting state of HMM produces a feature vector in the feature sequence [9]. So feature sequence probability P(O|H) can be transformed into P(O, S|H) (5) P(O|H) = S

where S = s1 s2 . . . sT is the state sequence which emits feature sequence O = o1 o2 . . . oT through the link of all the shot HMMs which we called a super HMM. A super HMM is obtained by concatenating the corresponding shot HMMs using a pronunciation lexicon [10]. So P(O, S|H) can be derived from P(O, S|H) =

T t=1

P(ot , st |st−1 , H) =

T

[P(st |st−1 , H) · P(ot |st )]

t=1

(6)

Statistical Framework for Shot Segmentation and Classiﬁcation

109

P(ot |s) is the emission probability distribution of state s, and P(st |st−1 , H) is the transition probability between two states. State transitions of intra-HMM are determined from HMM parameters, and state transitions of inter-HMM are determined by Bi-gram Therefore, these two components can be derived from the parameters of shot HMM and Bi-gram.

3

Shot Segmentation and Classiﬁcation in Sports Video

In this section, we present the semantic shot segmentation and classiﬁcation based on the statistical framework presented in Section 2, and shots are classiﬁed into three categories including Long Shot (LS), Medium Shot (MS) and Close-up Shot (CS). As described in Section 2, the task of shot segmentation and classiﬁcation is taken as solving the problem of a maximum conditional probability P(H|O), which is dependent on SSP and FSP derived from Bi-gram and the parameters of shot HMM. Hence, Bi-gram and shot HMM are the two keys to shot segmentation and classiﬁcation. The whole framework is shown in Fig.1.

Fig. 1. Framework for shot segmentation and classiﬁcation

Since parameters of HMM are estimated by EM algorithm using the feature vector sequence [11], appropriate features are vital to the HMM construction and can better explain the temporal evolution in a shot. So in section 3.1, feature extraction is introduced. Section 3.2 and Section 3.3 present the shot HMM and Bi-gram constructions, respectively. In section 3.4, the procedure of simultaneous shot segmentation and classiﬁcation is discussed. 3.1

Feature Extraction

As mentioned in Section 2, feature sequence of a shot is the observed vector sequence of shot HMM. So each shot is partitioned into segments to extract features from each segment, and features of all the segments form the feature vector sequence. Therefore, shot segment can be one or more consecutive frames, which called as Shot Segment Unit (SSU). Given the length of SSU, the Segmenting Rate (SR) is required at frame level to determine the space of two successive SSUs. To remain more information of shot, the size of SR may be smaller than that of SSU, as shown in Fig.2.

110

Y. Yang et al.

After the magnitudes of SSU and SR are set, feature vector is extracted from each SSU. Two classes of color and motion related features are used in our work for they are generic and can be easily computed, which can be extended to most categories of sports game. Features are ﬁrstly extracted from each frame, and then feature values of a SSU are set as the mean of corresponding feature values of all the frames in it.

Fig. 2. SSU segmentation

Color Related Features. Since the three categories of shots have diﬀerent playﬁeld and player sizes, such as LS has the largest playﬁeld view, MS have a smaller playﬁeld and a whole player, frames in each type of shot have a distinct color distribution which is diﬀerentiated from that of other types of shots. Hence, color features are derived from L, U and V components for CIE LUV color space is approximately perceptual uniform and computed by the following equations ⎧ ⎪ ⎨ Lf = all pixels in frame f L(x, y)/numbers of pixels in frame f Uf = all pixels in frame f U (x, y)/numbers of pixels in frame f (7) ⎪ ⎩ Vf = V (x, y)/numbers of pixels in frame f all pixels in frame f where Lf , Uf and Vf are the 3 basic color features of a frame f , and L(x, y), U (x, y), V (x, y) are the L, U, V components of pixel (x, y) in f , respectively. Motion Related Features. Motion is another important factor to distinguish shot for diﬀerent shots reﬂect various camera motions, such as LS usually has relatively much smaller motion than CS. 3 basic motion features are extracted for a frame, which are frame diﬀerence Df , compensated frame diﬀerence Cf and motion magnitude Mf . Df =

255

(H(f, i) − H(f − 1, i))2 /number of pixels in frame f

(8)

i=0

where H(f, i) is the number of pixels of color i in frame f . The other two basic motion features are built on block-based motion analysis. For each block Bf (s, t) at location (s, t) in present frame f , a block Bf∗−1 (u∗ , v ∗ ) from the previous frame f − 1 is found to best match it. So Cf and Mf are computed as follows, where Gf is the set of all blocks in frame f . 255 Cf = Bf (s,t)∈Gf ( i=0 (Hf (Bf (s, t), i) − Hf −1 (Bf∗−1 (u∗ , v ∗ ), i))2 ) (9) Mf = Bf (s,t)∈Gf ((s − u∗ )2 + (t − v ∗ )2 )1/2 Diﬀerence is widely used in signal processing, which can interpret the trend of transformation. Therefore, diﬀerences are used to express the trend of variety

Statistical Framework for Shot Segmentation and Classiﬁcation

111

of color and motion of diﬀerent types of shots. Given 6 basic features, the 1st and 2nd diﬀerences of basic features are computed as follows ∇1 F eaf = F eaf − F eaf −1

∇2 F eaf = ∇1 F eaf − ∇1 F eaf −1

(10)

where F eaf denotes one of the 6 basic features, ∇k F eaf (k = 1, 2) denotes the k th diﬀerence of basic feature F eaf of frame f . As a result, there is a 18-D feature vector for each frame and SSU. 3.2

Shot HMM Construction

Since there are 3 categories of shots to be classiﬁed, a general solution is to build 3 HMMs modeling 3 types of shots, respectively. However, this method only considers the temporal evolution of SSUs in a single shot, and doesn’t take into account the temporal evolution of SSUs at the transitions between adjacent shots. In fact, in sports videos, the alternation of various types of shots exhibits certain rules, such as MS is mainly followed by LS, while LS is often followed by CS. Therefore, to better simulate the temporal evolutions of SSUs of intra-shot and inter-shot, a context-dependent shot model is introduced, which is deﬁned as tri-shot HMM. Tri-shot HMM, just as its name indicates, models the temporal evolution of SSUs in three shot including prior shot, present shot and next shot. Compared with an ordinary shot model, tri-shot model is trained use feature vector sequence of the 3 involved shots. Since we have 3 categories of shots, there are 33 = 27 tri-shots in total.

Fig. 3. Topology of HMM structure

For simplicity, we use a left-to-right HMM to represent each tri-shot HMM. The middle states are emitting states with output probability distributions, as shown in Fig.3. Each HMM contains 5 states, and Gaussian Mixture Models are used to express continuous observation densities of the emitting states. 3.3

Bi-gram Construction

As described in Section 2, Bi-gram denotes the transition probability between two adjacent shots, which indicates the possibility of a transit of a shot to another shot. In sports video, various types of shots display diﬀerent play status and they don’t appear randomly. For example, after LS is shown to exhibit the global game status, MS or CS is often shown to track the player or ball. Therefore, Bi-gram

112

Y. Yang et al.

can be calculated according to the statistics of the appearances of each couple of shots in training sports video. The following formulas embody the basic idea of derivation of Bi-gram.

P(hj |hi ) = αN (hi , hj )/N (hi ), if N (hi ) = 0 (11) otherwise P(hj |hi ) = 1/l, where N (hi , hj ) is the number of times shot hj follows shot hi and N (hi ) is the number of times that shot hi appears. l l is the total number of distinct shot models, and α is chosen to ensure that j=1 P(hj |hi ) = 1. 3.4

Procedure of Shot Segmentation and Classiﬁcation

With Bi-gram and tri-shot HMM constructed, sports video can be segmented and classiﬁed into 3 types of shots. Since log operator can transform product operation into summation operation, in practice, product probability in the equations presented in Section 2 can be transformed into summation of corresponding log probability. As mentioned in Section 2, a super HMM is obtained by concatenated the corresponding shot HMMs using a pronunciation lexicon. So each shot HMM in the super HMM is considered as a node. Hence, the task of shot segmentation and classiﬁcation is to ﬁnd a path of the maximum log probability from the start node to the end node in the super HMM, as shown in Fig.1. For an unknown shot sequence of T SSUs whose feature vector sequence is O(O = o1 o2 . . . oT ), each path from start node to end node in the super HMM which passes through exactly T emitting HMM states is a potential recognition hypothesis. The log probability of each path is computed by summing the log probability of each individual transition in the path and the log probability of each emitting state generating the corresponding SSU. Intra-HMM transitions are determined from HMM parameters, and inter-HMM transitions are determined by Bi-gram [12]. Thus, the path of maximum log probability is the best result of shot segmentation and classiﬁcation. As a result, the nodes (shot HMMs) on the best path are the optimal result of shot classiﬁcation. Since each state in a shot HMM matches a SSU, the boundaries of each shot are the ﬁrst and last SSUs of it, which realizes shot segmentation simultaneously.

4

Experimental Results and Analysis

We ﬁrst evaluated the performance of our proposed statistical framework on soccer and badminton videos, and compared the performance of our method with that of a general two-stage method of shot segmentation and classiﬁcation using the same features and test videos, and then we analyzed experimental results and parameters. Test video is CIF- 352 × 288 × 25fps got from FIFA World Cup 2006 and 2005 World Badminton Championship including 4 full half-time soccer videos and 4 full game badminton videos. The implementation of the proposed

Statistical Framework for Shot Segmentation and Classiﬁcation

113

framework is based on HTK 3.3 [12]. The ground-truth of boundaries and type of each shot are labeled manually. Table.1 shows the test corpus, and each shot lasts at least 1s, i.e. 25 frames. Table 1. Test Videos Name

Match(2006)

Soccer1 Soccer2 Soccer3 Soccer4

GER-ITA(7-4) ENG-POR(7-1) GER-ARG(6-30) SUI-UKR(6-26)

Length (min) 46:42 46:38 46:05 47:04

Name

Match(2005-8-21)

Badminton1 Badminton2 Badminton3 Badminton4

INA-THA(Mixed Doubles) MAS-INA(Men’s Singles) CHN-INA(Mixed Doubles) NZL-CHN(Mixed Doubles)

Length (min) 24:11 26:12 26:24 17:09

Results are assessed by recall and precision rate which can be computed by Recall = Correct/Ground truth

Precision = Correct/Detected

(12)

where Detected is the shot number obtained by shot segmentation, and Correct is the number of correctly classiﬁed shots. Correct classiﬁcation denotes not only the shot type is correctly recognized but also the overlap of the ranges of classiﬁed shot and actual shot more than 90 percent of the length of actual shot. The proposed framework is tested on soccer and badminton videos since they are complete diﬀerent types. Experimental results are shown in Table 2. Table 2. Experimental results on soccer and badminton video Ground-truth Detected Correct Shot Type soccer badminton soccer badminton soccer badminton LS 570 156 658 165 559 152 MS 103 121 80 392 474 326 CS 215 266 212 402 395 313 Soccer: Recall = 87.8%; Precision = 78.5% Total Badminton: Recall =93.7% ; Precision = 80.4%

On the average, the proposed framework achieves 87.8% recall and 78.5% precision rate on soccer videos, and 93.7% recall and 80.4% precision rate on badminton videos. The results are promising, which demonstrates the eﬀectiveness of our general framework for shot segmentation and classiﬁcation. We studied that false alarms are mainly caused by the misclassiﬁcation of CS and MS and over segmentation of CS and MS, and the performance can be improved by applied more complicated features instead of simple color and motion features. Experiments Using SVM. we applied a general method proposed in [5] for shot segmentation and classiﬁcation using the same features for comparison. To simplify the procedure, we use SVM to classify manually segmented shots in above soccer videos. We chose C = 21 and γ = 2−10 by cross-validation, and

114

Y. Yang et al.

the experimental results are shown in Table 3. The total precision rate is 70.3%, which is far less than 78.5% precision rate achieved by our method. Note that shot classiﬁcation using SVM is performed on the manual segmented shots, which avoids occurrences of wrong classiﬁcations caused by inaccurate segmentation. Thus, the proposed framework performs much better. Table 3. Experimental results of SVM classiﬁcation Shot Type Ground-truth Correct Precision(%) Total LS 570 530 92.9 Precision = 70.3% MS 392 181 46.2 CS 402 248 61.7

Experiments Using Various Orders of Diﬀerence Features. Another 2 types of feature vectors are tested to demonstrate that diﬀerence can improve the performance, and they are 6-D feature vector (without diﬀerence feature), 12-D feature vector (with 1st diﬀerence feature). As we can see from Table 4, performance is signiﬁcantly improved with higher order diﬀerence features. Table 4. Experimental results using various orders of diﬀerence features Name Soccer1 Soccer2 Soccer3 Soccer4

6-D feature vector 12-D feature vector 18-D feature vector Recall : Precision (%) Recall : Precision (%) Recall : Precision (%) 83.5 : 72.5 85.8 : 74.7 86.1 : 76.2 84.1 : 73.2 87.3 : 74.1 87.9 : 78.9 83.2 : 68.8 87.9 : 77.1 90.1 : 79 80.6 : 75.6 85.2 : 78.5 86.7 : 80

Experiments Using Various Combinations of SSU and SR. As described in Section 3.1, the size of SSU and SR determined the ﬁneness and number of feature vector in a shot, respectively. 8 groups of diﬀerent SSU and SR are tested on soccer video using 18-D feature to ﬁnd the best combination of SSU and SR.

Fig. 4. Experimental results using various combinations of SSU and SR

Statistical Framework for Shot Segmentation and Classiﬁcation

115

Results are shown in Fig. 4, where the combination of SSU and SR is denoted by SSU SR (frames). As we can see, performances are much better when SR is more than 10 frames and the ratio of SSU to SR is in [1.5, 2]. So in our work, the sizes of SSU and SR are set as 20 and 10 frames, respectively.

5

Conclusions

A statistical framework for shot segmentation and classiﬁcation in sports video is presented in this paper. The main idea of the proposed framework is that the task of shot segmentation and classiﬁcation is taken as a conditional probability which implicates intra-shot and inter-shot information. Thus, the proposed method realizes shot segmentation and classiﬁcation simultaneously, and achieves much better performance than general two-stage method on soccer videos. Experimental results on badminton videos are also promising, which demonstrate the framework can be extended to other sports video. Furthermore, diﬀerence of feature introduced from speech recognition area is applied in our framework and has been proved it’s superiority in improvement of performance. In the further work, we will employ higher semantic features to enhance the performance, and apply the framework to event detection in sports video.

References 1. Hanjalic, A.: Shot-boundary Detection: Unraveled and resolved? IEEE Trans. Circuits and Systems for Video Technology 12, 90–105 (2002) 2. Lexing, X., Peng, X., Chang, S.-F., Divakaran, A., Sun, H.: Structure Analysis of Soccer Video with Domain Knowledge and Hidden Markov Models. Pattern Recognition Letters 24(15), 767–775 (2003) 3. Ekin, A., Tekalp, A.M., et al.: Automatic Soccer Video Analysis and Summarization. IEEE Trans. Image Processing 12, 796–807 (2003) 4. Dahyot, R., Rea, N., Kokaram, A.: Sports Video Shot Segmentation and Classiﬁcation. In: SPIE Int. Conf. Visual Communication and Image Processing, pp. 404–413 (2003) 5. Wang, L., Liu, X., Lin, S., Xu, G., Shum, H.-Y., et al.: Generic Slow-motion Replay Detection in Sports Video. IEEE ICIP, 1585–1588 (2004) 6. Duan, L.-Y., Xu, M., Tian, Q.: Semantic Shot Classiﬁcation in Sports Video. In: SPIE Proc. Storage and Retrieval for Media Databases, pp. 300–313 (2003) 7. Duan, L.Y., Xu, M., Tian, Q., et al.: A Uniﬁed Framework for Semantic Shot Classiﬁcation in Sports Video. IEEE Trans. on Multimedia 7, 1066–1083 (2005) 8. Xu, M., Duan, L., Xu, C., Tian, Q.: A fusion scheme of visual and auditory modalities for event detection in sports video. IEEE ICASSP 3, 189–192 (2003) 9. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceeding of the IEEE 77, 257–286 (1989) 10. Ney, H., Ortmanns, S.: Progress in dynamic programming search for LVCSR. Proceeding of the IEEE 88, 1224–1240 (2000) 11. Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report of University of Berkeley, ICSI-TR-97-021 (1998) 12. Young, S., Evermann, G., et al.: The HTK book (for HTK version 3.3). Cambridge University Tech Services Ltd. (2005), http://htk.eng.cam.ac.uk/

Sports Classiﬁcation Using Cross-Ratio Histograms Balamanohar Paluri, S. Nalin Pradeep, Hitesh Shah, and C. Prakash Sarnoﬀ Innovative Technologies Pvt Ltd

Abstract. The paper proposes a novel approach for classiﬁcation of sports images based on the geometric information encoded in the image of a sport’s ﬁeld. The proposed approach uses invariant nature of a crossratio under projective transformation to develop a robust classiﬁer. For a given image, cross-ratios are computed for the points obtained from the intersection of lines detected using Hough transform. These cross-ratios are represented by a histogram which forms a feature vector for the image. An SVM classiﬁer trained on aprior model histograms of crossratios for sports ﬁelds is used to decide the most likely sport’s ﬁeld in the image. Experimental validation shows robust classiﬁcation using the proposed approach for images of Tennis, Football, Badminton, Basketball taken from dissimilar view points.

1

Introduction

The exponential growth during the last decade of photographic content has fueled the requirement for intelligent content management systems. One of the essential part of such systems is an automated classiﬁcation of image content. In this paper, we address identiﬁcation of sports based on the sport ﬁeld in image using a robust classiﬁcation mechanism. Conventional approaches discussed in the literature for sport identiﬁcation are primarily related to video, for example Wang et. al [1] distinguish sports videos shot using color and motion features. Takagi et. al [2] proposed a HMM based video classiﬁcation system using camera motion parameters, Kobla et. al [3] applied replay detection,text and motion features with Bayesian classiﬁers to identify sports video. Messer et. al [4] employed neural networks and the texture codes on semantic cues for the same purpose. More recently, Wang et. al [5] classiﬁed sports videos with pseudo-2D-HMM using visual and audio features. In these approaches, sports classiﬁcation is dependent on cues like color, texture and spatial arrangement, which are not preferable to use due to the variability of these features between images of the same sport for two diﬀerent ﬁelds and views. Also, none of the currently existing approaches applied to classify sports videos can be directly used for classifying images of sports ﬁeld as they employ temporal features also. As opposed to existing approaches, we have opted to use the geometric information of a ﬁeld to identify the sport. The motivation for this approach is attributed to the fact that sports ﬁelds have dominant geometric structures on Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 116–123, 2007. c Springer-Verlag Berlin Heidelberg 2007

Sports Classiﬁcation Using Cross-Ratio Histograms

P1’

P1

Δ14

Δ’13

Δ13

P2

Δ23 P3

117

P2’

Δ24 Δ’ 23 P3’

Δ’14

P4

Δ13 Δ24 Δ14 Δ23

=

Δ’13 Δ’24 Δ’14 Δ’23

Δ’24

P4’

Fig. 1. Invariant nature of cross ratio for four collinear points subjected to a projective transformation

a planar surface. These structures are well deﬁned and uniform across diﬀerent ﬁelds for the same sport. Moreover, these structures consist of lines that are in stark contrast, like white over green ground, to the sports ﬁelds making it easy to identify such lines using conventional image processing techniques. These observations necessitated the use of geometric information to develop a robust classiﬁer. The idea of using geometric information like projective invariance has been used for object recognition [6,7,8]. But to the best of our knowledge, there has been no prior work done related to the classiﬁcation of sport’s ﬁeld images using projective invariance. In the proposed approach, we exploit the idea of invariance of cross ratio under projective transformation. Thus in any view of a sport’s ﬁeld four corresponding co-linear points have the same cross ratio. Since this is true only for the four corresponding points, we use a histogram of cross-ratio for a given sport’s ﬁeld image to describe a model for that sport. To ensure selection of the same points in each frame, we limit our consideration to points of intersection of dominant lines only. The paper is organized as follows. Section 2 explains the proposed approach. The results are propounded in section 3. Finally, the section 4 concludes the paper by suggesting the future work.

2

Proposed Approach

The planar geometric structures in images of a sport’s ﬁeld, i.e. lines to demarcate a sport’s ﬁeld, taken from diﬀerent view points, are related with each other under a projective transformation. Such a transformation does not preserve

118

B. Paluri et al.

(a)

(b) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0

0.5

(c)

1

1.5

2

2.5

3

(d)

Fig. 2. Interim steps for calculation of the feature vector: (a) Input image (b) Canny edge detections (c) Hough transform based line detections and the points of interesection of the lines (d) Histogram of cross ratios calculated using the intersection points

distances or angles, however incidence and cross-ratios are invariant under it, see [9]. In the following paragraphs, we build upon the invariant property of both these attributes to develop a robust classiﬁer for sports images. 2.1

Cross Ratio Histogram

For a line, incidence relations are : ‘lies on’ between points and lines (as in ‘point P lies on line L’), and ‘intersects’ (as in ‘line L1 intersects line L2’). The cross ratio is a ratio of ratios of distances. Given four collinear points p1 , p2 , p3 , and p4 in P 2 and denoting the Euclidean distance between two points pi and pj as Δij , cross ratio can be deﬁned as shown in ﬁgure 1, τp1 p2 p3 p4 =

Δ13 Δ24 . Δ14 Δ23

(1)

Although cross ratio is invariant once the order of the points has been chosen, its value is diﬀerent depending on this order. Four points can be chosen in

Sports Classiﬁcation Using Cross-Ratio Histograms

119

4! = 24 diﬀerent orders, but in fact only six distinct values are produced, which are related by the set 1 τ −1 τ 1 , , }. {τ, , 1 − τ, τ 1−τ τ τ −1

(2)

Thus, for given four points, we use minimum of the above cross ratio value as a representative for them. To classify a given unknown sport’s ﬁeld image, a canny edge detection [10] is done on the image. The dominant lines in the image are then identiﬁed using a Hough Transform [11]. Since in a sport’s ﬁeld lines drawn on the ﬁeld are dominant as compared to other edges produced due to noise or player, these dominant lines are are easily identiﬁed using the above approach. On each line obtained, we ﬁnd intersecting points with the rest of the lines. These intersection points are then used for a cross ratio calculation as they will be consistently detected in any view of a sport’s ﬁeld. Also for all the coplanar lines, such points of intersection will correspond to the same physical point on the ﬁeld irrespective of the viewing angle of the camera due to the invariance of incidence relations. From the entire set of points obtained in a line, we form subsets of four points each. For each subset we calculate the representative cross-ratio as explained using the set in equation (2). Thus for a given image, we obtain a large number of these cross ratio values which encode the geometric structures present in the ﬁeld. Ideally, if we are able to consistently reproduce the corresponding points in each input image for a given sport, we can match the cross ratio values individually and obtain a good classiﬁcation. But this is not the case normally, there are two issues which make the direct comparison of cross ratio values infeasible. Firstly, it is challenging to consistently establish correspondence in points across images of a sport’s ﬁeld from diﬀerent view points because some points might not be detected due to noise in the image and as well due to the orientation of the sports ﬁelds being signiﬁcantly diﬀerent. Secondly, there might be small variation in the cross ratio values due to noise in measurement of lines. To overcome these issues, a histogram based representation of cross ratios is done. For each new image to be classiﬁed, the above steps are done to obtain a histogram of cross ratios. This histogram is then used as a feature vector describing the dominant geometric structures in the image. A trained SVM classiﬁer is used to decide the class to which each feature vector and its corresponding image belong to. The entire process with intermediate results is pictorially depicted in ﬁgure 2. 2.2

Support Vector Machine Classiﬁer

Since multiple sports are being considered, there is a need for a multi-class SVM classiﬁer. SVM is originally designed for binary classiﬁcations only. However, multiple strategies of extending the binary classiﬁer to a multi-class classiﬁer have been discussed in literature such as One-against-all [12], oneagainst-one [13], half-against-half [14] etc. In case of one-against-all(OVA), the

120

B. Paluri et al.

Table 1. Cross ratio values for horizontal and a vertical line, on the outer boundary of the sport ﬁeld, in each sports. These values are calculated based on the standard ﬁeld measurements directly. Tennis Badminton Football Basketball

Horizontal line Vertical line 1.0208 1.0989 1.08 1.412 1.333 1.05 0.91 0.96

N class problem is decomposed into series of N binary class problems. In oneaginast-one(OVO), for a N class problem N (N2−1) binary classiﬁers are trained and the appropriate class is found by a voting scheme. For a Half-againstHalf(HAH) extension, a classiﬁer is built by recursively dividing the training dataset of N classes into two subsets. We have used a modiﬁed OVA approach. OVA is one of the earliest approaches for multi-class SVM and gives good results for most of the problems. For an Nclass problem, it constructs N binary SVMs, each binary SVM is trained with all the samples belonging to one class as positive samples and all the samples from the rest of the classes as negative. Given a sample to classify, all the N SVMs are evaluated and the label of the class that has the largest value of the decision function is chosen. One drawback of this method is that when the results from multiple classiﬁers are combined, each classiﬁer gets the same importance irrespective of its competence. Hence, instead of directly comparing the decisions, we use reliability measures to make the ﬁnal decision which makes the multi-class framework more robust. The static reliability measure proposed in [15] has been used for our multi-class framework.

3

Experimental Results

Images of Tennis, Basketball, Badminton, and Football ﬁelds have been used for experimental evaluation since they possess a good geometric structure in their respective ﬁelds. First column of Table 1 shows the cross ratio values considering points of intersection of the vertical lines with the outer boundary horizontal line of the ﬁeld. Similarly, second column shows the cross ratio value for points of intersection of horizontal lines with the outer vertical line of the ﬁeld. It can be observed that the diﬀerence of geometric structure in a sport’s ﬁeld is reﬂected well in the cross ratio values and hence these can be used as a basis for classiﬁcation of the sport’s ﬁeld. Cross ratio histogram for images of two diﬀerent sports ﬁelds for the all the four sports considered is shown in ﬁgure 3. It can be noted that for the same sport images taken on diﬀerent ﬁelds from diﬀerent view points generate similar cross ratio histograms. We have collected 200 images of each sport’s ﬁeld from videos, Internet; and by taking photos of the various sport’s ﬁeld we ensure that the images for each

Sports Classiﬁcation Using Cross-Ratio Histograms

121

0.25

0.2

0.15

0.1

0.05

0

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0

0.5

1

1.5

2

2.5

3

Fig. 3. First and second column show two seperate views of the sport ﬁeld with detected line and points of intersections overlaid on the gray image. The third column shows the cross ratio histogram for both the views (blue represent the histogram for ﬁrst column image and red for the second column image in each row).

class were taken with diﬀerent viewpoints and illumination conditions. Out of these images, 50 images from each class were used for generating the cross ratio histogram feature vector to train the SVM classiﬁer. The remaining 150 images from each class were mixed to form a collective database of 600 images which has been used as test set to evaluate the performance of the trained SVM classiﬁer. To test robustness of the proposed system when presented with non-sport images, we introduced 100 random images that did not belong to any of the four sports ﬁeld. For SVM classiﬁer, we experimented with Polynomial kernel function and Radial basis kernel functions with a variety of parameter values. Among the both, we observed that the second order polynomial function performed with higher accuracy.

122

B. Paluri et al.

Table 2. Classiﬁcation accuracies in percentage for each class. The overall classiﬁcation accuracy is 91.857 %. Sport Tennis Badminton Basketball Football Non-sport

Tennis Badminton Basketball Football Non-sport 96.66 2 0.66 0 0.66 2.66 94 0.66 1.33 1.33 0.66 2 91.33 3.33 2.66 0.66 0.66 3.33 90.66 4.66 2 1 5 8 84

The performance metric is deﬁned as the percentage of correct classiﬁcation per sport. Classiﬁcation accuracies for the 4 types of sports is given in Table 2. Overall classiﬁcation error on the test dataset was observed to be 8.14 % . The results clearly indicate that for real world datasets under diﬀerent conditions and viewpoints, the classiﬁcation accuracies are very good. Also we noted that most of the non-sport images were also segregated well and the few which were not classiﬁed as one of the four sports consisted of images of buildings.

4

Conclusion

The paper proposed an approach for sports image classiﬁcation based on geometric structure of the sports ﬁeld. Our experimental validation shows encouraging results and high accuracy of classiﬁcation even for widely separated views of a sport’s ﬁeld. One of the observation from experimental results was that the proposed approach performs very well when used for classifying images of the sports ﬁelds it was trained for. But the performance drops, from the classiﬁcation error of 6.83% to 8.14%, with introduction of random images having prominent geometric structures. Thus, currently we are working on to improve the system’s ability to discern between sports and non-sports images using other invariant cues from geometry of the sports ﬁelds to improve robustness and accuracy of the proposed approach. To address this, we are exploring a relative spatial distribution of intersection points and invariant measures for lines under projective transformation to further augment the feature vector for the classiﬁer. Also, we are gathering more dataset for sports like Hockey, Baseball, and Lacrosse to extend the classiﬁer.

References 1. Wang, D.H., Tian, Q., Gao, S., Sung, W.K.: News sports video shot classiﬁcation with sports play ﬁeld and motion features. In: ICIP, pp. 2247–2250 (2004) 2. Takagi, S., Hattori, S., Yokoyama, K., Kodate, A., Tominaga, H.: Sports video categorizing method using camera motion parameters. In: ICME 2003. Proceedings of the 2003 International Conference on Multimedia and Expo, pp. 461–464. IEEE Computer Society, Washington, DC, USA (2003)

Sports Classiﬁcation Using Cross-Ratio Histograms

123

3. Kobla, V., DeMenthon, D., Doermann, D.: Identifying sports videos using replay, text, and camera motion features (2000) 4. Messer, K., Christmas, W., Kittler, J.: Automatic sports classiﬁcation. icpr 02, 21005 (2002) 5. Wang, J., Xu, C., Chng, E.: Automatic sports video genre classiﬁcation using pseudo-2d-hmm. In: ICPR 2006. Proceedings of the 18th International Conference on Pattern Recognition, pp. 778–781. IEEE Computer Society, Washington, DC, USA (2006) 6. Lei, G.: Recognition of planar objects in 3-d space from single perspectiveviews using cross ratio. IEEE Transactions on Robotics and Automation 6(4), 432–437 (1990) 7. Weiss, I., Ray, M.: Model-based recognition of 3d objects from single images. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 116–128 (2001) 8. Rajashekhar, Chaudhuri, S., Namboodiri, V.: Image retrieval based on projective invariance, pp. I: 405–408 (2004) 9. Mundy, J.L., Zisserman, A.: Geometric invariance in computer vision. MIT Press, Cambridge, MA, USA (1992) 10. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986) 11. Hough, P.: Method and means for recognizing complex patterns. In: US Patent 3,069,654 (1962) 12. Vapnik, V.: Statistical Learning Theory. Wiley, Chichester 13. KreBel, U.H.G.: Pairwise classiﬁcation and support vector machines, pp. 255–268 (1999) 14. Lei, H., Govindaraju, V.: Half-against-half multi-class support vector machines. In: Multiple Classiﬁer Systems, pp. 156–164 (2005) 15. Yi Liu, Z.Y.: One-against-all multi-class svm classiﬁcation using reliability measures. In: International Joint Conference on Neural Networks, vol. 2, pp. 849–854 (2005)

A Bayesian Network for Foreground Segmentation in Region Level Shih-Shinh Huang1 , Li-Chen Fu2 , and Pei-Yung Hsiao3 Dept. of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, R.O.C Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. Dept. of Electrical Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, R.O.C Abstract. This paper presents a probabilistic approach for automatically segmenting foreground objects from a video sequence. In order to save computation time and be robust to noise eﬀect, a region detection algorithm incorporating edge information is ﬁrst proposed to identify the regions of interest. Next, we consider the motion of the foreground objects, and hence utilize the temporal coherence property on the regions detected. Thus, foreground segmentation problem is formulated as follows. Given two consecutive image frames and the segmentation result obtained priorly, we simultaneously estimate the motion vector ﬁeld and the foreground segmentation mask in a mutually supporting manner. To represent the conditional joint probability density function in a compact form, a Bayesian network is adopted, which is derived to model the interdependency of these two elements. Experimental results for several video sequences are provided to demonstrate the eﬀectiveness of our proposed approach.

1

Introduction

Foreground segmentation is a fundamental element in developing a system that can deal with high-level tasks of computer vision. Accordingly, the objective of this paper is to propose a new approach to automatically segment the objects of interest from the video. 1.1

State of the Art

Generally speaking, techniques for foreground segmentation can be grouped into two categories, namely, background subtraction and motion-based foreground segmentation. For background subtraction, a simple and an intuitive way is to model the background by independent representation of the gray level or color intensity of every pixel in terms of some uni-modal distributions or an uniform distributions [1,2]. If the intensity of each pixel is due to the light reﬂected from a particular surface under a single source of lighting, a uni-modal distribution will be suﬃcient to represent the pixel value. However, in the real world, the model Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 124–133, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Bayesian Network for Foreground Segmentation in Region Level

125

of a pixel value in most video sequences is characterized by multi-modal distribution, and hence the usage of a mixture of Gaussian distributions is common in it [3,4,5]. For example, Friedman and Russell [4] modeled the pixel intensity as a mixture of three Gaussian distributions respectively corresponding to three hypotheses - road, vehicle, and shadow. In [5], this idea is extended to model the scene by using multiple Gaussian distributions and to develop a fast approximate method for incrementally updating the parameters. However, not all distributions are in Gaussian form [6]. In [7,6], a non-parametric background model based on kernel function is employed to represent the color distribution of each background pixel. This kind of approach is a generalization of the mixture of Gaussians, but unfortunately the resulting computation complexity of this method is high. Moreover, their performance deteriorates when the background exhibits noise or dynamic properties. This is because they carry out the foreground segmentation at pixel level without taking spatial relations among pixels into consideration. The approaches of motion-based foreground segmentation arise from the fact the images of foreground objects are generally accompanied with motion. Accordingly, motion-based foreground segmentation refers to dividing of an image into a set of motion-coherent objects which are associated with diﬀerent labels. To achieve this, the ﬁrst step is to iteratively compute the dominant motion and signiﬁcant areas consistent with the current dominant motion [8]. A drawback of these methods is that their performance deteriorates in lack of well-deﬁned dominant motion. To overcome this problem, color information is introduced to obtain more accurate segmentation results. Color segmentation algorithms, such as watershed or mean-shift ones are ﬁrst applied to obtain an initial segmentation and then the regions with similar motion and intensity are merged. Finally, spatiotemporal constraint is imposed to classify each segmented region into foreground or background. This kind of approaches is generally referred to as region-merging approaches [9,10,11]. Nevertheless, motion-based foreground segmentation approaches are seriously limited by the accuracy of the motion estimation algorithm. Thus, the segmentation results may become unsatisfactory when the image sequence exhibits severe noise eﬀect or when the foreground object is subjected to large displacement. 1.2

Approach Overview

To reduce the problem dimensionality and to provide more accurate segmentation results, an algorithm incorporates the edge information with intensity to identify the regions of interest and the neighborhood relations among them are expressed by a region adjacency graph (RAG). Let G = {V, E} be a RAG, where V = {si : i ∈ [1, ..., R]} is a set of nodes in graph and each node si corresponds to a region; E is a set of edges and (si , sj ) ∈ E if the regions si and sj are adjacent. Here, R is the number of extracted regions of interest. The set of neighboring regions of si is denoted as N (si ) = {sj : (si , sj ) ∈ E and si = sj }. The main drawback of the methods [12,9] that directly estimate the motion followed by foreground segmentation is that their performance is limited by

126

S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao

the motion estimation. Instead of that, we model the relationship between the motion vector ﬁeld Dt and the foreground segmentation mask Ft throught a Bayesian network. To impose the temporal coherence, the main idea behind is to reinforce the interdependency between Dt and Ft so that motion estimation and foreground segmentation can be mutually support to each other. The remainder of this paper is organized as follows. In section 2, we present an algorithm to extract regions of interest. Based on the obtained regions, section 3 describes how to model the foregoing interdependency among variables through a Bayesian network and to infer the results of motion vector ﬁeld and foreground segmentation mask simultaneously. Section 4 gives the probability deﬁnition of each term aforementioned. In section 5, we demonstrate the eﬀectiveness of the developed approach by providing some satisfactorily experimental results. Finally, we conclude the paper in section 6 with some insightful discussions.

2

Interested Region Detection

In tradition, color segmentation followed by the region merging algorithm [9,10] is a widely used method for region generation. The approaches based on such a region-merging strategy usually result in over-segmentation, Here, we propose an algorithm that incorporates both intensity and edge information to identify the regions of interest so that noise eﬀect can be successfully removed and the number of regions can be greatly decreased. This algorithm mainly consists of two steps, namely, CDM (change detection mask) detection and region generation. 2.1

CDM Detection

First of all, an initial change detection mask (CDMit ) between two consecutive image frames It−1 and It , is computed by a thresholding operation. Fig. 1(a) shows an example of CDMi between frames 39 and 40 in Hall Monitoring video sequence. However, only taking intensity into account for change detection suffers from the problem that non-changed pixels are easily mis-detected as changed ones. Here we incorporate the moving edge in the CDM detection step for identifying the regions with motion. The moving edge (ME) map is deﬁned as (1) MEt = e ∈ Et | min ||e − x|| ≤ Tchange , x∈DEt where Et is the edge map of the current frame It , DEt is the diﬀerence edge map obtained by applying Canny edge detector on the diﬀerence image Dt , and Tchange is a distance threshold. In our implementation, Tchange is set to 2. Fig 1(b) shows the ME of frame 40 in Hall Monitoring video sequence. Obviously, moving edges only result in the boundaries of foreground objects that are invariant to noise eﬀect. Let these pixels in ME serve as initial seed points, we then use the region growing algorithm [13] is used to expand them pixel by pixel in CDMi. The output of the region growing algorithm is denoted

A Bayesian Network for Foreground Segmentation in Region Level

127

as CDMu. Note that the noise eﬀect is almost removed as shown in Fig. 1(c) after incorporating the extracted moving edges. In order to maintain temporal stability which means that pixels belonging to the previous object mask should also belong to the current change mask, we set all pixels in the previous object mask OMt−1 as changed ones in the resulting change detection mask (CDMt ). Fig. 1(d) shows the ﬁnally obtained change detection mask.

Fig. 1. An example of the algorithm for interested region detection

2.2

Region Generation

Based on the assumption that the foreground boundaries always coincide with color segmented boundaries, the colors of pixels in the CDM are ﬁrst quantized by a k-means clustering algorithm in RGB color space followed by the connectedcomponent ﬁnding algorithm so as to extract a set of homogeneous color regions. Currently, the number of quantized colors used here is 12. We show all regions with diﬀerent colors in Fig. 1(e). Finally, a data structure called region adjacency graph (RAG) is used to represent the spatial relations among these regions of interest.

3

Bayesian Foreground Segmentation

In our formulation, the problem of foreground segmentation is described as follows. Given previous foreground segmentation mask Ft−1 and two consecutive image frames It−1 and It , we deﬁne the joint conditional probability density function of the two variables: motion vector ﬁeld Dt and foreground segmentation mask Ft , that is, Pr(Dt , Ft |It−1 , It , Ft−1 ). To be clearer, Dt is a random variable that represents the motion of each region. For simplicity, we use translation motion model to describe the region motion in this research. As for Ft , it is a random variable that assigns a label, such as background (b) or foreground (f ) to each region. From Bayes’ rule, the a posteriori probability density function of Dt and Ft given It−1 , It , and Ft−1 , can be expressed as Pr(Dt , Ft |Ft−1 , It−1 , It ) =

Pr(Dt , , Ft−1 , Ft , It−1 , It ) , Pr(Ft−1 , It−1 , It )

(2)

Because Pr(Ft−1 , It−1 , It ) is constant with respect to the unknowns and the estimation of Dt and Ft is by computation of the maximum a posteriori (MAP), that is, ˜ t , F˜t ) = arg max Pr(Dt , Ft−1 , Ft , It−1 , It ) (D (3) (Dt ,Ft )

In order to represent the joint probability density function in a compact form, we must identify the dependency among these variables. A Bayesian network

128

S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao

in Fig. 2(a) is adopted to model the interrelationships among these variables in three aspects. For explanation, we decompose it into three sub-graphs as shown in Fig. 2(b)(c)(d). Under such modeling, Pr(Dt , Ft−1 , Ft , It−1 , It ) can be decomposed into ﬁve terms and can be further expressed as: Pr(Dt , Ft−1 , Ft , It−1 , It ) = Pr(It |Ft ) Pr(It−1 |Dt , It ) × Pr(Ft−1 |Dt , Ft ) Pr(Dt ) Pr(Ft ).

(4)

The ﬁrst term Pr(It |Ft ) models the observation likelihood and is commonly used in traditional background subtraction approaches. The graphical model expressing this term is shown in Fig. 2(b). The second term Pr(It−1 |Dt , It ) stands for displaced frame diﬀerence (DFD), which was widely used in [14] for estimating motion vector ﬁeld. Fig. 2(c) shows its graphical model. The third term Pr(Ft−1 |Dt , Ft ) represents the temporal coherence [10] which denotes that the foreground segmentation mask Ft should be almost consistent with the Ft−1 through motion compensation using Dt (see Fig. 2(d)). Finally, the last two terms ,Pr(Dt ) and Pr(Ft ), respectively model the prior probabilities of Dt and Ft , respectively.

Fig. 2. Graphical Models

4

Probability Modeling

In this section, we will concentrate our attention on deﬁning the ﬁve terms described in (4). The term Pr(It |Ft ) evaluates how well the currently observed image It ﬁts the labels given in the Ft . Here, the label assigned to a region site s in Ft is denoted as Ft (s) and it is either b (background) or f (foreground). Accordingly, the two conditionally probability density functions Pr(It (s)|Ft (s) = b) and Pr(It (s))|Ft (s) = f ) must be deﬁned for modeling background likelihood and foreground likelihood, respectively. The way we deﬁne these two functions is the same as [15]. 4.1

Temporal Constraint

Temporal information is a fundamental element to maintain the consistency of the results over time. The usage of motion information across image sequence is a general way to impose the temporal coherence.

A Bayesian Network for Foreground Segmentation in Region Level

129

Displaced Frame Diﬀerence. The conditional probability Pr(It−1 |It , Dt ) quantiﬁes how well the estimated motion vector ﬁts the consecutively observed images. As in [14], we model this term by a Gibbs distribution with the following energy function. UDF D (It−1 |It , Dt ) = λDF D (x, y)2 , (5) si ∈V (x,y)∈si

where (x, y) = ||It (x, y) − It−1 ((x, y) − Dt (si ))||

(6)

is the displaced frame diﬀerence (DFD) and λDF D is the parameter for controlling the contribution of this term. Tracked Foreground Diﬀerence. The term Pr(Ft−1 |Dt , Ft ) is to model temporal coherence between the segmented foreground masks Ft−1 and Ft , respectively, at two consecutive time instants. It denotes that a region site at the current frame will probably continue having the same label as its corresponding region through motion compensation using Dt in the previous frame. Therefore, we deﬁne the energy function of Pr(Ft−1 |Dt , Ft ) as: UT F D (Ft−1 |Dt , Ft ) = − λT F D δ(Ft−1 (x , y ) − Ft (x, y))

(7)

si ∈V (x,y)∈si

where δ(x) is the Kronecker delta function and (x , y ) is the corresponding pixel of (x, y) ∈ si at the previous frame, that is, (x , y ) = (x, y) − Dt (si ). 4.2

Spatial Constraint

The property of spatial constraint infers that region sites in adjacency tend to have the same label. In other words, it encourages the formation of smooth motion estimation in Dt and contiguous components in Ft . Motion Smoothness. Pr(Dt ) represents the prior probability of the motion vector ﬁeld and the motion smoothness constraint is imposed to deﬁne this term. Suppose that si and sj are two neighboring region sites and bij denotes the length of the common border between si and sj . We deﬁne Pr(Dt ) by a Gibbs distribution with the energy function UMS (Dt ) = λMS bij ||Dt (si ) − Dt (sj )|| (8) si ∈V sj ∈N (si )

where ||.|| is the Euclidean distance. Foreground Smoothness. As for Pr(Ft ), we relate it to the foreground smoothness which means that the two neighboring region sites should be likely

130

S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao

assigned to the same label with high probability. Similar to Pr(Dt ), Pr(Ft ) is deﬁned as follows. UF S (Ft ) = λF S VF S (si , sj ) (9) si ∈V sj ∈N (si )

where VF S (si , sj ) =

−bij if Ft (si ) = Ft (sj ) +bij otherwise

(10)

Once the energy functions for modeling the interdependency among the aforementioned variables are deﬁned, the next step is to ﬁnd a solution by maximizing (4). However, due to the nature of the problem, this probability distribution is non-convex and is diﬃcult to estimate Dt and Ft simultaneously. Therefore, we employ an iterative optimization strategy [16] to ﬁnd the solution as the results of the foreground segmentation.

5

Experiment

This section presents some experiment results of the proposed approach described in this paper. Both subjective evaluation and objective evaluation in comparison with Lee’s approach [17] are introduced to demonstrate the eﬀectiveness of our approach. 5.1

Subjective Evaluation

The ﬁrst image sequence we use here is the Hall Monitoring MPEG-4 test sequence in CIF format at 10 fps. In Fig. 3, the ﬁrst row illustrates the original frames and the numbers shown in the top-left corner are the frame number. In particular, frame 10 used here is to demonstrate that our approach can automatically detect the newly introduced objects. On the contrary, in Lee’s approach, the red circles referring to the noisy pixels due to light ﬂuctuation in Fig. 3(b) and shadow areas indicated by blue circles in Fig. 3(b) are both mis-classiﬁed as foreground regions. Therefore, it can be seen that performing foreground segmentation in a more semantic region level and using an appropriate scaling factor, both the noise and the shadow eﬀect can be successfully removed in our proposed approach as shown in Fig. 3(c). 5.2

Objective Evaluation

Besides subjective evaluation, we take two videos from the contest held in IPPR (Image Processing and Pattern Recognition), Taiwan, 2006, with manually generated ground-truth images to validate the eﬀectiveness of our approach from objective viewpoint. Here, the error rate ε similar to [1] is adopted for objective evaluation. It is deﬁned as ε = Ne /NI , where NI is the frame size and Ne is the number of pixels that are mis-classiﬁed. Two videos with frame size

A Bayesian Network for Foreground Segmentation in Region Level

10

20

40

60

100

131

80

(a) Original Images noise

shadow

shadow

shadow

‘ s Approach (b) Lee ‘

(c) Our Approach

Fig. 3. Hall Monitoring Sequence. (a) are frames 10, 20, 40, 60, and 80. (b) and (c) are the segmentation results of Lee’s approach and our proposed approach, respectively.

Fig. 4. IPPR Contest Sequence One. (a) are original images exhibiting similar appearance and large movement; (b) and (c) are the segmentation results of Lee’s approach and our approach, respectively.

320 × 240 from IPPR contest (http://www.ippr.org.tw/) are unfortunately low quality image sequences and have severe noise eﬀect. The challenges of the ﬁrst video lie in that a person has similar appearance to the background and the foreground objects have large movements in the scene. However, by modeling the interdependency between Dt and Ft , we can

132

S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao

Fig. 5. Error Rate

obtain the accurate motion estimation to facilitate the foreground segmentation as demonstrated in the right-bottom picture in Fig. 4(c). The error rate curves over 150 frames of the Lee’s approach and our approach are both shown in Fig. 5(a) and the average error rates are 1.1% and 0.5%, respectively. The second IPPR contest sequence shows that two persons go through the gallery. The strong lighting shed through the left windows is blocked by the walking persons and severe shadow areas on the ground and wall are unfortunately formed. Due to space limitation, the results of this video sequence are skipped in this paper. For this video, the average error rates (see Fig. 5(b)) of the Lee’s approach and our approach are 2.3% and 1.0%, respectively.

6

Conclusion

In this paper, we propose a region detection algorithm by incorporating moving edges with intensity diﬀerence to extract the regions of interest for further foreground segmentation at the region level. This greatly reduces the inﬂuence due to noise eﬀect and at the same time saves the computation complexity. Under the proposed Bayesian network representing the crucial interdependency of the involving variables, the problem of foreground segmentation task is naturally formulated as a kind of inference problem with the imposed spatiotemporal constraints corresponding to the abovementioned interdependency relationship. The solution to this problem is then an iterative optimization strategy which is set to maximize the objective functions of Dt and Ft under the MAP criterion. The main advantage is that both motion estimation and foreground segmentation are simultaneously achieved in a mutually supporting manner so that more accurate results can be obtained.

Acknowledgement This work was supported by the National Science Council under the project NSC 94-2752-E-002-007-PAE and NSC 96-2516-S-390-001-MY3.

A Bayesian Network for Foreground Segmentation in Region Level

133

References 1. Chien, S.Y., Ma, S.Y., Chen, L.G.: Eﬃcient Moving Object Segmentation Algorithm Using Background Registration Technique. IEEE Trans. on Circuits and Systems for Video Technology 12(7), 577–586 (2002) 2. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-Time Surveillance of People and Their Activities. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(8), 809–830 (2000) 3. Benedek, C., Sziranyi, T.: Markovian Framework for Foreground-BackgroundShadow Separation of Real World Video Scenes. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 898–907. Springer, Heidelberg (2006) 4. Friedman, N., Russell, S.: Image Segmentation in Video Sequence: A Probabilistic Approach. In: Intl. Conf. on Uncertainty in Artiﬁcial Intelligence (1997) 5. Stauﬀer, C., Grimson, W.: Adaptive Background Mixture Models for Real-Time Tracking. IEEE Conf. on Computer Vision and Pattern Recognition 2, 246–252 (1999) 6. Elgammal, A., Harwood, D., Davis, L.S.: Non-parametric Model for Background Subtraction. European Conf. on Computer Vision 2, 571–767 (2000) 7. Sheikh, Y., Shah, M.: Bayesian Modeling of Dynamic Scenes for Object Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(1), 1778–1792 (2005) 8. Wang, J.Y.A., Adelson, E.H.: Representation Moving Images with Layers. IEEE Trans. on Image Processing 3(5), 625–638 (1994) 9. Tsaig, Y., Averbuch, A.: Automatic Segmentation of Moving Objects in Video Sequences: A Region Labeling Approach. IEEE Trans. on Circuits and Systems for Video Technology 12(7), 597–612 (2002) 10. Patras, I., Hendriks, E.A., Lagendijk, R.L.: Video Segmentation by MAP Labeling of Watershed Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(3), 326–332 (2001) 11. Altunbasak, Y., Eren, P.E., Tekalp, A.M.: Region-Based Parametric Motion Segmentation Using Color Information. Graphical Models and Image Processing: GMIP 60(1), 13–23 (1998) 12. Chen, P.C., Su, J.J., Tsai, Y.P., Hung, Y.P.: Coarse-To-Fine Video Object Segmentation by MAP Labeling of Watershed Regions. Bulletin of the College of Engineering, N.T.U 90, 25–34 (2004) 13. Adams, R., Bischof, L.: Seeded Region Growing. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(6), 641–647 (1994) 14. Wang, Y., Loe, K.F., Tan, T., Wu, J.K.: Spatiotemporal Video Segmentation Based on Graphic Model. IEEE Trans. on Image Processing 14(7), 937–947 (2005) 15. Huang, S.S., Fu, L.C., Hsiao, P.Y.: A Bayesian Network for Foreground Segmentation. In: IEEE Intl. Conf. on Systems, Man, and Cybernetics, IEEE Computer Society Press, Los Alamitos (2006) 16. Chang, M.M., Tekalp, A.M., Sezan, M.I.: Simultaneous Motion Estimation and Segmentation. IEEE Trans. on Image Processing 6(9), 1326–1333 (1997) 17. Lee, D.S.: Eﬀective Gaussian Mixture Learning for Video Background Subtraction. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(5), 827–832 (2005)

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation Fangfang Lu1 , Zhouyu Fu1 , and Antonio Robles-Kelly1,2 Research School of Information Sciences and Engineering, Australian National University National ICT Australia , Canberra Research Laboratory, Canberra, Australia 1

2

Abstract. Interactive Image Segmentation has attracted much attention in the vision and graphics community recently. A typical application for interactive image segmentation is foreground/background segmentation based on user speciﬁed brush labellings. The problem can be formulated within the binary Markov Random Field (MRF) framework which can be solved eﬃciently via graph cut [1]. However, no attempt has yet been made to handle segmentation of multiple regions using graph cuts. In this paper, we propose a multiclass interactive image segmentation algorithm based on the Potts MRF model. Following [2], this can be converted to a multiway cut problem ﬁrst proposed in [2] and solved by expansion-move algorithms for approximate inference [2]. A faster algorithm is proposed in this paper for eﬃcient solution of the multiway cut problem based on partial optimal labeling. To achieve this, we combine the one-vs-all classiﬁer fusion framework with the expansion-move algorithm for label inference over large images. We justify our approach with both theoretical analysis and experimental validation.

1

Introduction

Image segmentation is a classical problem in computer vision. The purpose is to split the pixels into disjoint subsets so that pixels in the same subset form a region with similar and coherent features. This was usually posed as a problem of pixel level grouping based on certain rules or aﬃnity measures. The simplest way to do this is region growing starting from some initial seeds [3]. More sophisticated clustering methods have been introduced later to solve the grouping problem, characterized by the work of Belongie et. al. on the EM algorithm for segmentation [4] and Shi and Malik on normalized cut [5]. While early work on segmentation is mostly focused on unsupervised learning and clustering, a new paradigm of interactive segmentation has come into fashion recently following the pioneering work of Boykov and Jolly [1]. In [1], pixels are labeled by the user with a brush to represent the foreground and background regions. The purpose is to assign the remaining unlabeled pixels to the foreground/background classes. This can be treated as a supervised version

National ICT Australia is funded by the Australian Governments Backing Australia’s Ability initiative, in part through the Australian Re- search Council.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 134–144, 2007. c Springer-Verlag Berlin Heidelberg 2007

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation

135

of segmentation, providing more accuracy and robustness with user interaction. A number of methods have been proposed to extend [1] in diﬀerent ways, including grab cut [6] which presents a more convenient way of user interaction instead of brush labellings, lazy snap [7] which applies an over-segmentation as a preprocessing step. Most of the previous methods for interactive segmentation consider only binary segmentation, i.e. foreground and background. An exception is the random walk segmentation algorithm proposed by Grady in [8], which can cope with multiclass segmentation. However, it is based on solving linear systems of equations and not as fast as graph cuts. To our knowledge, the problem of interactive segmentation with multiple classes or regions under the graph cuts framework has not yet been studied. In this paper, we generalize binary interactive image segmentation to multiclass segmentation based on the Markov random ﬁeld (MRF) model which can then be treated by the multiway cut framework in [2]. Moreover, we develop a strategy for the eﬃcient computation of the multiway cut cost which combines the one-vs-all classiﬁer fusion framework with the expansion-move algorithm proposed in [9]. Some theoretical analysis are also presented on the optimality of the proposed approach. The remainder of this paper is organized as follows. In Section 2, we brieﬂy present the background material of the MRF model for binary segmentation and the graph cuts algorithm. In Section 3, we proposed the model for multiclass interactive segmentation and the details of the labeling algorithm. Experimental results are provided in Section 5. Conclusions are given in Section 6.

2 2.1

Binary Image Segmentation Using Graph Cuts An MRF Formulation

Markov Random Fields (MRF) are undirected graphical models. They provide a probabilistic framework for solving labeling problems over the graph nodes. Let GV, E denote a graph with node-set V = {V1 , . . . , VN } and edge-set E = {Eu,v |u, v ∈ V}.Each node u ∈ V is associated with a label variable Xu which takes a set of discrete values in {1, . . . , K} where K is the number of label classes. Each Eu,v ∈ E represents a pairwise relationship between node u and node v. The optimal set of labels are obtained by maximizing the following function of joint distributions over the label set maximize P (X) = X

1 Z

Eu,v ∈E

ψu,v (Xu , Xv )

φu (Xu )

(1)

u∈V

where φu (Xu ) and ψu,v (Xu , Xv ) are unitary and binary potential functions which determine the compatibility of the assignment of nodes in the graph to the label classes. u ∼ v implies that u and v are neighbors, i.e. Eu,v ∈ E. Z = X u,v∈E ψu,v (Xu , Xv ) u∈V φu (Xu ) is the unknown normalization factor independent of the assignment of X and thus can be omitted for the purpose of inference.

136

F. Lu, Z. Fu, and A. Robles-Kelly

Taking the negative logarithm of Equation 1, we can convert the MRF formulation into an equivalent problem of energy minimization as follows minimize log P (X) = cu (Xu ) + wu,v (Xu , Xv ) (2) X

u

u∼v

where cu (Xu ) = − log φu (Xu ) and wu,v (Xu , Xv ) = − log ψu,v (Xu , Xv ). Here the binary weight term wu,v (a, b) is determined by the interactions between neighboring nodes. A special form of binary interaction term is the Potts prior for smoothness constraints, which will be introduced shortly.In this paper, we will focus on the Potts MRF model for multiclass image segmentation. 2.2

Solving Binary MRF with Graph Cuts

We focus on general Potts prior MRF in this section with the following form of the binary interaction term. wu,v (a, b) = λu,v δ(a, b) 1 a = b δ(a, b) = 0 a=b

where a, b ∈ {0, 1}

(3)

This will simply penalize disparate labels of adjacent nodes and enforces smoothness of the labels across local neighborhood. The conventional Potts prior is a special case of the above equation with constant λu,v ’s across all neighboring sites u and v. The MRF cost function in Equation 2 can then be re-written for the Potts model as minimize log P (X) = X

N u=1

cu (Xu ) +

λu,v δ(Xu , Xv )

(4)

u∼v

The above cost function can be naturally linked with graph cuts. To see this, we ﬁrst augment the original undirected graph G = V, E deﬁned over the image pixels, where each pixel is treated as a node in the node-set V, by two additional nodes s and t and edge-links between each pixel and the two terminal nodes. Then the graph nodes are split into two disjoint subsets V1∗ = {V1 , s} and V2∗ = {V2 , t } such that V1 ∪ V2 = V. The cut metric C(G) of the graph G is then deﬁned as the weighted sum of edges linking nodes in subset V1∗ and nodes in V2∗ . C(G) = wu,v = wu,t + wu,s + wu,v (5) u∈V1∗ ,v∈V2∗

u∈V1

u∈V2

u∈V1 ,v∈V2

where wu,t and wu,s deﬁne the edge weights between node u and terminal nodes, and wu,v deﬁnes the edge weight between node u and node v in diﬀerent subsets. Suppose s and t represent binary class 0 and 1 respectively, deﬁne wu,s = cu (1), wu,t = cu (0) and wu,v = λu,v . Then minimizing the cost function in Equation 4 is equivalent to minimizing the cut metric C(G) in Equation 5. As a result, pixel

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation

137

u is assigned to class 1 if it is both linked to the terminal node t representing class 1 and broken from the terminal node s representing class 0. Graph min-cut is a well studied problem and equivalent to the max-ﬂow problem, which can be solve eﬃciently in polynomial time [10]. For binary classes, we can obtain globally optimal solution for the min-cut. An example of binary cut for a toy problem of 9 nodes is illustrated in Figure 1(a). For the sake of visualization, terminal links that have been severed are not shown here.

(a) Binary Cut

(b) Multiway Cut

Fig. 1. Illustration of Binary and Multiway Cuts

In interactive segmentation, we have to keep ﬁxed the membership of the user speciﬁed pixels,i.e. brush labeled foreground pixels must remain in the foreground and so do labeled background pixels. These hard constraints over the assignment of labeled pixels can be naturally enforced with the graph cut framework by manipulating the edge weights of links between nodes of labeled pixels and the terminal nodes. For each labeled foreground (class 1) pixel uf , we can set wuf ,1 to an inﬁnitely large value and wuf ,0 to 0 so that uf always remains linked to the foreground t node otherwise it will incur an inﬁnite cost by assigning uf to the background s node. We can also set the weights of terminal links for background pixels in a similar way. To summarize, we list the rules for setting edge weights for binary interactive segmentation in Table 1(a).

3 3.1

Multiclass Interactive Image Segmentation A Multiway Cut Formulation for Multiclass Segmentation

The graph cut framework proposed for binary segmentation in the last section can be extended to multiclass segmentation in a straightforward manner via multiway cuts. Multiway cuts are the generalization of binary cuts and were ﬁrst proposed by Boykov et.al. in [2]. For m-way multiway cuts, an extra nodeset of m terminal nodes T = {t1 , . . . , tm } are inserted in the original graph G = V, E and links are established from each t-node tj (j = 1, . . . , m) to each node in the node-set V of graph G. The purpose is to split the nodes of this

138

F. Lu, Z. Fu, and A. Robles-Kelly

Table 1. (a) Edge weights for binary segmentation. (b) Edge weights for multiclass segmentation. (b) weight

(a) edge weight

for

edge

{u, v}

λu,v u, v ∈ V and u ∼ v ∞ u ∈ C0 {u, t} 0 u ∈ C1 cu (1) otherwise

{u, s}

0 ∞ cu (0)

for

{u, v}

λu,v u, v ∈ V and u ∼ v ∞ u ∈ C1 {u, t1 } 0 u ∈ Cj (j = 1) su − cu (1) otherwise ... ∞ u ∈ Cm {u, tm } 0 u ∈ Cj (j = m) su − cu (m) otherwise

u ∈ C0 u ∈ C1 otherwise

∗ augmented graph into m disjoint subsets V1∗ = {t1 , V1 }, . . . , Vm = {tm , Vm } such that each subset contains one and only one t-node from T and V1 ∪. . . ∪Vm = V. An example of 3-way cut is illustrated in Figure 1(b). According to [2], we can establish the link between the Potts MRF cost function in Equation 4 and the multiway cut metric as follows. For arbitrary node u in V, after multiway cut, it only remains linked to a single terminal node and the edges linking to all other terminal nodes are severed. Deﬁne

du =

m

cu (j)

(6)

j=1

the total sum of costs for assigning pixel u to all diﬀerent classes. We then deﬁne the edge weights of t-links wu,tj = du − cu (j) and the edge weights between neighboring nodes in V remain the same as binary cut, i.e. wu,v = λu,v . Then the multiway cut metric becomes C(G) =

wu,v =

∀i,j∈{1,...,m} u∈Vi∗ ,v∈Vj∗

=

(du − cu (j)) +

u∈V

wu,tj +

i=1 u∈Vi j=1 j=i

u∈V j=Xu

= (m − 2)

m m

du +

u∈V

wu,v

(7)

i,j u∈Vi ,v∈Vj

λu,v δ(Xu , Xv )

u∼v

cu (Xu ) +

λu,v δ(Xu , Xv )

u∼v

Since du is independent of label vectors Xu ’s, the multiway cut is equivalent to the multiclass Potts MRF cost deﬁned in Equation 4 up to a constant additive factor. In the case of binary cut where m = 2, the two costs are exactly the same. Thus, inference of a multiclass Potts MRF model can be recast as a multiway cut problem. We can also encode hard constraints for labeled pixels into the edge weights of t-links in exactly the same way as we did for binary segmentation.

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation

139

The resulting rules for choosing edge weights for multiclass segmentation tasks are summarized in Table 1(b).

4

Eﬃcient Implementation of Multiway Cuts

Although multiclass interactive image segmentation can be transformed the multiway cut problem, computing the minimum value of multiway cuts is a NP hard problem. An expansion-move algorithm was proposed by Boykov et.al. in [9] to iteratively update the current labels for lower cost. However, a number of expansion and move steps have to be computed alternately to recover a suboptimal labeling for the purpose of multiclass segmentation. Each step involves solving a graph cut over all pixels in the image. This is quite ineﬃcient for images with moderately large sizes. To make eﬃcient inference of multiway cut without compromising on the accuracy, we present an alternative inference algorithm in this section. Strategy. Our inference algorithm makes use of the one-against-all classiﬁer fusion framework to transform the multiway cut problem into a set of binary cut subproblems. Each binary cut subproblem has two classes, where the positive class is chosen from one of the original label classes, and the negative class is constructed by the rest of the classes. The global optimal solution of the binary subproblem can be obtained by using the max ﬂow algorithm in [10]. We can then obtain a partial solution of the multiway cut problem by voting on the binary cut results. Pixels with consistent labeling across all binary cuts are then assigned with the consistent labels. To resolve the ambiguities in the labels of remaining pixels, we build a graph over pixels with ambiguous labels apply the expansion-move algorithm [9] to solve the multiway cut problem for the labels of the ambiguous pixels with the labels of all other pixels ﬁxed, which is more eﬃcient than applying the expansion-move operations to all unlabeled pixels from scratch and more accurate than resolving the ties arbitrarily. The algorithm is summarized in the Table 2. Analysis. Several points deserve further explanation here. To obtain the cost cu (j) of assigning pixel u to each diﬀerent region j, we ﬁrst compute the posterior probability bu,j of pixel u in class j, which can obtained via the following Bayes rule P (u|Cj ) P (u|Cj )P (Cj ) = (9) bu,j = j P (u|Cj )P (Cj ) j P (u|Cj ) 1 is the prior probability of class j, which is assumed to be equal m for every class. P (u|Cj ) the conditional probability of pixel u in the jth region. The class conditional probabilities are obtained by indexing the normalised color histogram built from labeled pixels in the region. The edge weight term λu,v is given by ||I(v) − I(u)||2 ||pv − pu ||2 − ) (10) λu,v = βexp(− σc2 σs2 where p(Cj ) =

140

F. Lu, Z. Fu, and A. Robles-Kelly Table 2. Eﬃcient Graph Cut Algorithm for Multiclass Segmentation

Input: image I with pixels labeled for m diﬀerent regions. 1. Compute the cost cu (j) for assigning each unlabeled pixel u to the jth region. 2. Repeat the following operations for each region index by j – Establish a binary graph cut problem with the jth region as class 1 and all other 1 regions as class 0. Set c∗u (1) = cu (j) and c∗u (0) = cu (i) as new unary m − 1 i=j 1 λu,v as binary costs for neighboring costs for pixel u, and λ∗u,v = 1 − 2m − 2 pixels u and v. The meaning for the choice of weights will become evident later. – Solve the binary cut problem with new unary and binary costs and recover the (j) label Xu for each pixel u.

(1)

(m)

3. Vote on Xu , . . . , Xu Xu =

j

from the results of m binary cuts

undeﬁned

(j)

(i)

Xu = 1 and Xu = 0 for i = j otherwise

(8)

4. Add the unlabeled pixels with resolved labels into the set of labeled pixels. Run an additional graph cuts optimization over the subset of unresolved pixels to obtain their labels using the neighboring pixels with known labels as hard constraints . Output: labels Xu for each pixel u in image I

where I(u) and pu denote the pixel value and coordinate of pixel v and σc and σs are two bandwidth parameters, β gauges the relative importance of data consistency and smoothness, which is set to 10 in the following experiments. The above choice of binary weight accounts for the image contrast and favors segmentation along the edges as justiﬁed in [1]. After obtaining the posterior probability for pixel u, we can compute the cost of assigning u to class j by cu (j) = i=j P (u|Ci ). That is, the larger the posterior probability of class j, the smaller cost for assigning it to the same class. Nom tice that the costs for diﬀerent classes are normalized such that i=1 cu (i) = 1. In Step 2 of the above algorithm, we have to solve m binary cut problems. Notice that any two binary problems share the same graph topology and only diﬀer in the weights of edges linked to the terminal nodes, hence starting from the second binary cut subproblem, we can capitalize on the solution of the previous subproblem for solving the current subproblem. This is achieved by the dynamic graph cuts framework proposed by Kohli and Torr in [11] to update the labels iteratively from previous results instead of starting from scratch. After running binary cuts for all m subproblems, we get m votes for the label of each pixel u. If the votes are consistent, i.e. Equation 8 is satisﬁed for certain

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation

141

region j which gets all m votes from the binary cut results, then we can assign pixel u to the jth region with conﬁdence. We can also prove the following Proposition: The cost function of multiway cut is the aggregation of the costs of the binary cut subproblems up to a scaling and translation factor. Proof: The cost of the multiway cut problem is given by f (X) =

cu (Xu ) +

u

λu,v δ(Xu , Xv ) =

m

cu (j) +

j=1 u|Xu =j

u,v|u∼v

λu,v

u,v|u∼v Xu =Xv

where a|b is a notation indicating sum over variable(s) a . . . given condition(s) b . . .. The cost of the jth binary cut is given by

f (j) (X) = =

1 cu (i) + m−1

u|Xu =j

u|Xu =j

i|i=j

The sum over all binary costs is then given by

f m

(X) = m

(j)

j=1

j=1

=

cu (j) +

u|Xu =j

u|Xu =j

m−2 m−2 N + 1+ m−1 m−1

λ∗u,v

u,v|u∼v

cu (j) +

c∗u (0) +

u|Xu =j

u|Xu =j

c∗u (1) +

1 m−1

λ∗u,v

u,v|u∼v Xu =j,Xv =j

c (i) + m

u

i|i=j

m

j=1

cu (j) + 1 +

j=1 u|Xu =j

λ∗u,v

u,v|u∼v Xu =j,Xv =j

m−2 m−1

λu,v

u,v|u∼v Xu =Xv

where N is the total number of pixels under study. The second row of the above equation holds due to the identity below, which is true for normalized values of cu (j) over j. m

j=1 u|Xu =j i|i=j

cu (i) = (m − 2)N + (m − 2)

m

cu (j)

j=1 u|Xu =j

This concludes the proof. The above proposition states that minimizing f (X) is equivalent with minimizing j f (j) (X) due to their linear relation. Hence we can treat the consistent labels over all binary cuts as the approximate partial optimal labeling and solve the labels for the few remaining inconsistent pixels. This can be done using the expansion-move algorithm for multiway cut in [9]. As a result, we only need to build a much smaller graph for resolving the ambiguities in the labeling of remaining pixels together with their labeled neighbors, as compared to the total number of unlabeled pixels from the beginning.

142

5

F. Lu, Z. Fu, and A. Robles-Kelly

Experimental Results

We present experimental results for multiclass interactive segmentation of 7 realworld color images. Pixels were selected by the user with scribbles to deﬁne different regions. The original images overlapped with scribble labellings are shown in the left column of Figure 2, where diﬀerent colored brushes indicate diﬀerent regions in the image. First, we compared the eﬃciency of the proposed algorithm with the conventional expansion-move algorithm for the purpose of multiclass image segmentation. Speciﬁcally, we adopted the α expansion algorithm [9] over the alternative α-β move algorithm in our implementation. We runned our experiment on an Intel P4 workstation with 1.7GHz CPU and 1GB RAM. We have used the maxﬂow code of Boykov and Kolmogorov [10] for solving graph cuts for both binary inference and α expansion which also includes implementation of the dynamic graph cuts algorithm in [11]. The wrapper code is written in matlab. The average execution time over 5 trials for the segmentation of each image is reported in Table 3. Table 3. Comparison of speed for the conventional expansion-move based multiway graph cut algorithm [9] and the proposed eﬃcient graph cut algorithm Image Name Image Size Classes face∗ bush∗ grave∗ garden∗ tower lake beach

132 × 130 520 × 450 450 × 600 450 × 600 800 × 600 800 × 600 1600 × 1200

3 3 3 3 3 4 4

Execution Time (s) α-Expansion Eﬃcient Graph Cuts 0.0976 0.0836 2.2765 0.9999 3.5721 1.1672 2.6871 1.1407 5.5650 2.9180 5.8270 3.4522 26.3098 13.2942

From the table, we can ﬁnd that the proposed eﬃcient graph cut algorithm is consistently faster than the expansion-move algorithm with a speedup factor of roughly 2 on average. The two methods achieve comparable results for binary labeling problems. And not surprisingly, for the images we tested, the two algorithms under study achieve similar segmentation results. Hence, in the following, we only show segmentation results of our algorithm compared with those produced by maximum likelihood estimation (MLE) based pixel-level segmentation, which is equivalent to the MRF model with zero second-order pairwise terms. The segmentation results of both our method and the pixel-level segmentation are shown in the middle and right columns of Figure 2 respectively. We visualize the segmentation results with an alpha matte αI + (1 − α)L = 0, with α = 0.3. I is the color of image pixel and L is the color of brush label for the region it belongs to. From those results, we can see that the proposed algorithm can capture the structure of the regions labeled using the brush while pixel-level approach is prone to produce small scattered regions.

Eﬃcient Graph Cuts for Multiclass Interactive Image Segmentation

143

Fig. 2. Segmentation results on example images. Left Column: Original color images with brush labellings. Middle Column: Results of pixel based approach. Right Column: Results of multiclass graph cuts.

144

6

F. Lu, Z. Fu, and A. Robles-Kelly

Conclusions

An energy minimization approach is proposed for multiclass image segmentation in this paper based on the Markov random ﬁeld model. The cost function of the MRF model can be optimized by solving an equivalent multiway graph cut problem. An eﬃcient implementation of multiway cut is developed and experimental results demonstrate encouraging performance of the proposed approach.

References [1] Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. In: Intl. Conf. on Computer Vision, pp. 105–112 (2001) [2] Boykov, Y., Veksler, O., Zabih, R.: Markov random ﬁelds with eﬃcient approximations. In: Intl. Conf. on Computer Vision and Pattern Recognition (1998) [3] Gonzalez, R.C., Wintz, P.: Digital Image Processing. Addison-Wesley Publishing Company Limited, London, UK (1986) [4] Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color- and texture-based image segmentation using em and its application to content-based image retrieval. In: Proc. of Intl. Conf. on Computer Vision (1998) [5] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) [6] Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans. on Graphics 23(3), 309–314 (2004) [7] Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. on Graphics 23(3) (2004) [8] Grady, L.: Random walks for image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(11), 1768–1783 (2006) [9] Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) [10] Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. IEEE Trans.on Pattern Analysis and Machine Intelligence 26(9), 1124–1137 (2004) [11] Kohli, P., Torr, P.: Eﬃciently solving dynamic markov random ﬁelds using graph cuts. In: Intl. Conf. on Computer Vision (2005)

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation Lei Wang Research School of Information Sciences and Engineering The Australia National University Australia, ACT, 0200 Abstract. Multi-class image classiﬁcation can beneﬁt much from feature subset selection. This paper extends an error bound of binary SVMs to a feature subset selection criterion for the multi-class SVMs. By minimizing this criterion, the scale factors assigned to each feature in a kernel function are optimized to identify the important features. This minimization problem can be eﬃciently solved by gradient-based search techniques, even if hundreds of features are involved. Also, considering that image classiﬁcation is often a small sample problem, the regularization issue is investigated for this criterion, showing its robustness in this situation. Experimental study on multiple benchmark image data sets demonstrates the eﬀectiveness of the proposed approach.

1

Introduction

Multi-class image classiﬁcation often involves (i) high-dimensional feature vectors. An image is often reshaped as a long vector, leading to hundreds of dimensions; (ii) redundant information. Only part of an image, for instance, the foreground, is really useful for classiﬁcation. In the past few years, multi-class Support Vector Machines (SVMs) have been successfully applied to image classiﬁcation [1,2]. However, the feature subset selection issue, that is, identifying p important features from the original d ones has not been paid enough attention.1 Classical feature subset selection techniques may be used to ﬁnd the important dimensions. For instance, in [3], a subset of 15 features is selected by using the wrapper approach from 29 features. However, this approach often has heavy computational load and cannot deal with image classiﬁcation generally having hundreds of features. Also, the selected features may not ﬁt the SVM classiﬁer well when the selection criteria have no connection with the SVMs. Recent advance in the binary SVMs provides a ray of hope. In [4], a leave-oneout error bound of the binary SVMs (the radius-margin bound) is minimized to optimize the kernel parameters. The minimization is solved by gradient-based search techniques which can handle a large number of free kernel parameters. 1

Please note that this is diﬀerent from the problem of ﬁnding the p optimal combination of the original d features, for instance, in the way of PCA or LDA. In feature subset selection, the features are kept individual and the feature dependence is left to the classiﬁer, for instance, the SVMs, which handles it automatically.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 145–154, 2007. c Springer-Verlag Berlin Heidelberg 2007

146

L. Wang

Also, it is found that the optimized kernel parameters can reﬂect the importance of features for classiﬁcation. These properties are just what we are seeking for. Unfortunately, this error bound cannot be straightforwardly applied to the multiclass scenario because it is rooted in the theoretical result of binary SVMs, for instance, the VC-Dimension theory upon which it is developed. A possible way may be to apply the radius-margin bound to all of the binary SVMs obtained in a decomposition of the multi-class SVMs. However, by doing so, the selected features will be good at discriminating a certain pair of classes only. Also, how to integrate these selection results is another issue. Hence, this paper seeks a one-shot feature selection from the perspective of multi-class classiﬁcation. To realize this, this paper extends the radius-margin bound to a feature subset selection criterion for the multi-class SVMs. Firstly, the relationship between the binary radius-margin bound and the classical class separability concept is discussed. Enlightened by it, this work redeﬁnes the radius and margin to accommodate multiple classes, forming a new criterion. Its derivative with respect to the kernel parameters can also be analytically calculated, making the gradient-based search techniques still applicable. Minimization of this criterion can eﬃciently optimize hundreds of kernel parameters simultaneously. These optimized parameters are then used to identify the important features. This approach is quite meaningful for practical image classiﬁcation because it facilitates the reduction of system complexity and the feature discovery from the perspective of multiclass classiﬁcation. Experimental result on benchmark data sets demonstrates the eﬀectiveness of this approach.

2

Background

Let D denote a set of m training samples and D = (x, y) ∈ (Rd × Y)m , where Rd denotes a d-dimensional input space, Y denotes the label set of x, and the size of Y is the number of classes, c. A kernel is deﬁned as kθ (xi , xj ) = φ(xi ), φ(xj ), where φ(·) is a possibly nonlinear mapping from Rd to a feature space, F , and θ is the kernel parameter set. Due to the limit of space, the introduction of multi-class SVMs is omitted and this information can be found from [1,2]. The radius-margin bound. It is an upper bound of the test error estimated via a leave-one-out cross-validation procedure. Let Lm = L((x1 , y1 ), · · · , (xm , ym )) be the number of errors in this procedure. It is shown in [4] that Lm ≤

4R2 = 4R2 w2 γ2

(1)

where R is the radius of the smallest sphere enclosing all the m training samples in F , γ the margin, w the normal vector of the optimal separating hyperplane, and γ −1 = w. The R2 is obtained by solving m m R2 = maxβ∈Rm i=1 βi kθ (xi , xi ) − i,j=1 βi βj kθ (xi , xj ) (2) m subject to : i=1 βi = 1; βi ≥ 0 (i = 1, 2, · · · , m)

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation

where βi is the i-th Lagrange multiplier. The w2 is obtained by solving m m 1 1 2 m i=1 αi − 2 i,j=1 αi αj yi yj kθ (xi , xj ) 2 w = maxα∈R m subject to : i=1 αi yi = 0; αi ≥ 0 (i = 1, 2, · · · , m)

147

(3)

where αi is the i-th Lagrange multiplier. The derivative of R2 w.r.t. the t-th kernel parameter, θt , is shown in [4] as m m ∂kθ (xi , xi ) ∂kθ (xi , xj ) ∂R2 = βi0 − βi0 βj0 ∂θt ∂θt ∂θt i=1 i,j=1

(4)

where βi0 is the solution of Eq. (2). Similarly, the derivative of w2 w.r.t. θt is m ∂kθ (xi , xj ) ∂w2 = (−1) · α0i α0j yi yj ∂θt ∂θt i,j=1

(5)

where yi ∈ {+1, −1} is the label of xi . Thus, the radius-margin bound can be eﬃciently minimized by using the gradient search methods. As seen, this bound is rooted in binary SVMs and cannot be directly applied to the multi-class case. An extension of this bound to a feature selection criterion for multi-class SVMs is proposed below.

3

Proposed Approach

Class separability is a concept widely used in pattern recognition. The scatter matrix based measure is often favored thanks to its simplicity. The Between(ST ) are deﬁned as SB = class scatter matrix (SB ) and Total scatter matrix c c ni n (m − m)(m − m) and S = (x − m)(x − m) , i i i T ij ij j=1 i=1 i=1 where ni is the size of class i, xij the j-th sample of class i, mi the mean of class i, and m the mean of all c classes. The following derives the class separability in the feature space, F . Let Di be the training samples from class i and D be the set of all training samples. The mφi and mφ are the mean vectors of Di and D in F , respectively. KA,B is the kernel matrix where {KA,B }ij = kθ (xi , xj ) with the constraints of xi ∈ A and xj ∈ B. The traces of SφB and SφT can be expressed as tr(SφB ) =

c 1 KD i=1

where n =

c

i=1

i ,Di

ni

1

−

1 KD,D 1 1 KD,D 1 ; tr(SφT ) = tr(KD,D ) − n n

(6)

ni and 1 is a column vector whose components are all 1. Based

on these, the class separability in F can be written as C = two classes, it can be shown that n1 n2 φ tr(SB ) = mφ1 − mφ2 2 ; n1 + n2

tr(SφT ) =

ni 2 i=1 j=1

tr(Sφ B) . tr(Sφ T)

In the case of

φ(xij ) − mφ 2

(7)

148

L. Wang

and it can be proven (the proof is omitted) that γ2 ≤

4−

1 n1 +n2 n1 n2

tr(SφB )

R2 ≥

;

1 tr(SφT ) . (n1 + n2 )

(8)

That is, tr(SφB ) measures the squared distance between the class means in F . tr(SφT ) measures the squared scattering radius of the training samples in F if divided by (n1 + n2 ). Conceptually, the mφ1 − mφ2 2 and γ 2 reﬂect the similar property of data separability and tr(SφT ) positively correlates with R2 . Theoretically, γ 2 is upper bounded by a functional of tr(SφB ) and R2 is lower bounded by tr(SφT )/(n1 + n2 ). When minimizing the radius-margin bound, γ 2 is maximized while R2 is minimized. This requires tr(SφB ) to be maximized and tr(SφT ) to be minimized (although this does not necessarily have γ 2 maximized or R2 minimized). Hence, minimizing the radius-margin bound can be viewed as approximately maximizing the class separability in F . Part of the analysis can also be seen in [5]. Enlightened by this analogy, this work extends the radius-margin bound to accommodate multiple classes. In the multi-class case, tr(SφT )/n measures the squared scattering radius of all training samples in F . Considering the analogy between tr(SφT ) and R2 , the R2 is redeﬁned as the squared radius of the sphere enclosing all the training samples in the c classes, Rc2 =

min

ˆ 2 ,∀ xi ∈D φ(xi )−ˆ c2 ≤R

ˆ2) . (R

(9)

For tr(SφB ) in the multi-class case, it can be proven that tr(SφB )

=

1≤i<j≤c

ni nj mφi − mφj 2 n

.

(10)

Noting the analogy between mφ1 − mφ2 2 and γ 2 , the margin is redeﬁned as 2 Pi Pj γij = Pi Pj wij −2 (11) γc2 = 1≤i<j≤c

1≤i<j≤c

where γij is the margin between class i and j, and Pi = nni , which is the priori probability of the i-th class estimated from the training data. By doing so, a feature selection criterion for multi-class SVMs is obtained. The optimal kernel parameters is sought by minimizing this criterion, 2 Rc . (12) θ = arg min θ∈Θ γc2 The derivation of this criterion w.r.t. the t-th kernel parameter, θt , is 2 2 2 Rc ∂ 1 2 ∂Rc 2 ∂γc − Rc = 4 γc . ∂θt γc2 γc ∂θt ∂θt

(13)

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation ∂R2

149

∂w 2

The calculation of ∂θtc and ∂θijt follows the Eq. (4) and (5). This criterion can still be minimized by using the gradient-based search techniques. Finally, please note that the obtained criterion is not necessarily an upper bound of the leave-one-out error of the multi-class SVMs. It is only a criterion reﬂecting the idea of the radius-margin bound and working for the multi-class case. An elegant extension of the radius-margin bound to multi-class SVMs is in [6]. Regularization issue. When optimizing a criterion with multiple parameters, regularization is often needed to avoid over-ﬁtting the training samples, especially when the number of training samples is small. To check whether this is also needed for this new criterion, a regularized version is given as J (θ) = (1 − λ)

Rc2 + λθ − θ0 2 γc2

(14)

where λ (0 ≤ λ < 1) is the regularization parameter which penalizes the deviation of θ from a preset θ0 . Such a regularization imposes a Gaussian prior over the parameter θ to be estimated. When using the ellipsoid Gaussian RBF kernel deﬁned in Section 4, θ0 can be chosen by applying the constraint of θ1 = θ2 = · · · = θd and solving 2

Rc

. (15) θ0 = arg min

θ∈Θ γc2 θ1 =θ2 =···=θd Since the number of free parameters reduces to one, over-ﬁtting is less likely to occur. It is worth noting that the obtained θ0 has been optimal for the minimization of Rc2 /γc2 with a spherical Gaussian RBF kernel. It is a good starting point for the subsequent multi-parameter optimization.

4

Experimental Result

d −yi )2 , is used. The elliptical Gaussian RBF kernel, k(x, y) = exp − i=1 (xi2σ 2 i For binary SVMs, the optimized σi can reﬂect the importance of the feature i for classiﬁcation [4]. The larger the σi , the less important the feature i. Now, the multi-class case is studied with the proposed criterion. The BFGS Quasi-Newton optimization method is used. For the convenience of optimization, gi (gi = 1/2σi2 ) is optimized instead. A multi-class SVM classiﬁer using the selected features is then trained and evaluated by using [7]. Following [4], the feature selection criteria of “Fisher score”, “Pearson correlation coeﬃcient” and “Kolmogorov-Smirnov test” are compared with the proposed criterion. USPS for Optical digit recognition. This data set contains 7291 training images and 2007 test images from ten classes of digits of “0” to “9”. Each 16 × 16 thumbnail image is reshaped to a 256-dimensional feature vector consisting of its gray values. The proposed feature selection criterion is minimized on the training set to optimize the gi for each dimension. The 256 features are then sorted in a descending order of the g values, and the top k features are used for

150

L. Wang

(a) Optical character recognition

(b) Facial image classiﬁcation

(c) Textured image classiﬁcation

Fig. 1. Examples of image classiﬁcation tasks USPS data set

0

0.055

0.8 0.7 0.6

Test error

0.05

Pearson correlation coefficients Kolmogorov−Smirnov test Fisher criterion score Proposed criterion

2 0.045 4 0.04

0.5

6

0.4

8

0.035 0.03 0.025

0.3

10

0.02

0.2

0.015

12

0.1

0.01 14

0 0

0.005

10

20 30 40 The number of selected feature

50

(a) Feature selection result

60 16

0

2

4

6

8

10

12

14

16

(b) Optimized value of g = 1/2σ 2

Fig. 2. Results for the USPS data set

classiﬁcation. As shown in Figure 2(a), the test error for the proposed criterion decreases quickly with k. With only 50 features selected, it reaches 6.7% (The reported lowest error is 4.3% when all 256 features are used [8]). Compared with the other three selection criteria, this criterion gives the best feature selection performance. Now, the optimized g1 , · · · , g256 are reshaped back into a 16 × 16 matrix and shown in Figure 2(b). In this map, each block corresponds to one of the 256 g’s and its value is reﬂected by the gray level. As seen, the blocks with the larger g values scatter over the central part of this map, whereas those with the lower values are mostly at the borders and corners. This suggests that the pixels at the borders and corners are less important for discrimination. This observation well matches the images in Figure 1(a) in that the digits are displayed at the central parts and surrounded by a uniform black background. ORL for Facial image classiﬁcation. This database contains 40 subjects and each of them has 10 gray-level facial images, as shown in Figure 1(b). Each image is resized to 16 × 16 and a 256-dimensional feature vector is obtained. The 400 images are randomly split into 10 pairs of training/test subsets with the equal size of 200, forming ten 40-class classiﬁcation problems. For each problem, the g values are optimized on the training subset and the features are sorted accordingly. The feature selection result averaged over the 10 classiﬁcation problems is reported in Figure 3(a). By using only the top 50 selected features, the

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation

151

ORL Face data set 0.7

16 Pearson correlation coefficient Kolmogorov−Smirnov Test Proposed criterion Fisher score

0.6

Test error

0.5

Forehead 0.9

14 12

0.8

Right eye

Left eye

0.7 10

0.4

0.6

8

0.3

0.5

Nose

0.4

6 Mouth

0.2

0.3

4 0.1

0.2 Chin

2 0 0

50

100 150 200 The number of selected features

250

(a) Feature selection result

300

2

4

6

8

10

12

0.1 14

16

(b) Optimized value of g = 1/2σ 2

Fig. 3. Results for the ORL face data set

test error can be as low as 11.10 ± 2.87%, and the test error with all 256 features is 7.95 ± 2.36%. Again, the proposed criterion achieves the best selection performance, especially at the top 50. The optimized g values are also reshaped back and plotted in Figure 3(a). A shading eﬀect is used. As seen, the blocks with larger g values (whiter) roughly present a human face, where the eyes, nose, mouth, forehead and chin are marked. This result is consistent with our daily experience that we recognize a person often by looking at the eyes while the cheek, shown as the darker areas, is less used. The subjects in this database often have diﬀerent hairstyles at the forehead part, and this may explain why the forehead part is also shown as a whiter region. Brodatz for Textured image classiﬁcation. This database includes 112 different textured images, as shown in Figure 1(c). Top 10 images are used in this experiment. For each of them, sixteen 128×128 sub-images are cropped, without overlapping, from the original 512 × 512 image to form a texture class. By doing so, a ten-class textured image database is created, including 160 samples. Again, they are randomly split into 10 pairs of training (40 samples)/test (120 samples) subsets. By utilizing a bank of Gabor ﬁlters (6 orientations and 4 scales), the mean, variance, skewness and kurtosis of the values of the Gabor ﬁltered images are extracted, leading to a 96 (6 × 4 × 4) dimensional feature vector [9]. Using the proposed criterion, the 96 g values are optimized and the features are sorted. The test error with diﬀerent number of selected features is plotted in Figure 4(a). This criterion and the “Kolmogorov-Smirnov test” show comparable performance, with the former slightly better at selecting the top 10 features. Using this criterion, the lowest test error of 1.42 ± 1.36% is achieved at the top 50 selected features, whereas the error with all 96 features is 2.08 ± 1.06%. This indicates that half features are actually useless for discrimination. Again, the optimized g values are shown in Figure 4(b). It is found that the features from the 25-th to 48-th (the features of “variance”) and those from the 80-th to 96-th (the features of “kurtosis”) dominate the features with larger g values, suggesting that these two types of features are more useful for discriminating the textured images.

152

L. Wang Brodatz texture data set 0.9

3.5

Pearson correlation coefficient Kolmogorov−Smirnov Test Proposed criterion Fisher score

0.8 0.7

3 2.5

The g value

Test error

0.6 0.5 0.4

2 1.5

0.3 1

0.2 0.5

0.1 0 0

20

40 60 80 The number of selected features

100

(a) Feature selection result

0 0

20

40 60 Feature number

80

100

(b) Optimized value of g = 1/2σ 2

Fig. 4. Results for the Brodatz data set Table 1. Feature selection performance with regularization (ORL data set) Average test error for diﬀerent number of selected features (In %) λ value

1

2

5

10

20

50

68.85±2.73

51.80±8.16

33.75±3.28

24.65±3.94

16.80±3.28

11.10±2.87

0.1

70.10±4.01

62.85±10.15

40.70±4.74

30.15±2.24

19.60±3.85

11.75±2.20

0.25

70.10±4.01

62.05±11.10

40.30±5.06

28.80±1.46

19.15±3.42

12.20±2.38

0.5

73.50±5.25

57.10±12.12

36.05±5.23

29.15±4.78

20.20±3.12

12.55±2.75

0

0.75

77.40±6.82

53.85±9.92

35.05±4.73

28.80±4.19

20.17±2.72

12.95±3.07

0.9

74.45±6.09

55.40±11.29

35.60±4.40

28.10±3.27

21.25±2.44

12.60±3.18

0.99

79.60±7.40

64.15±8.67

37.15±6.03

28.25±3.43

22.10±2.73

12.85±3.35

Finally, the help of regularization is investigated for the proposed criterion on the ORL and Brodatz data sets. For the ORL, there are only 5 training samples in each class, whereas the number of features is 256. For the Brodatz, the case is 4 versus 96. Both have the small-sized training sets with respect to the number of features. In this experiment, θ0 is obtained by solving Eq. (15). By gradually increasing the regularization parameter, λ, from 0 to 0.99, the g values are optimized and the features are selected. The average test error of a multi-class SVM classiﬁer with the selected features is reported in Table 1. It is found that the lowest error (in bold) is consistently obtained with λ = 0, that is, no regularization is imposed. Simply applying the regularization leads to inferior selection performance. This result suggests that the proposed criterion is quite robust to the small sample problem in this classiﬁcation task. Similar case is observed on the Brodatz data set from Table 2. These results are, more or less, surprising and further analysis leads to the following experiment. By looking into the ORL facial image classiﬁcation problem, it is found that the total number of training samples is still comparable to the number of features, that is, 5 × 40 = 200 vs. 256, because there are 40 classes. As deﬁned in the proposed criterion, the radius 2 R2 is estimated by using all of the 200 training samples, although the margin γij is estimated from the 10 training samples from each pair of classes. Hence, such a training set is not a typical small sample for estimating R2 . Believing that regularization is necessary in the presence of a small sample problem, this paper

Feature Subset Selection for Multi-class SVM Based Image Classiﬁcation

153

Table 2. Feature selection performance with regularization (Brodatz data set) Average test error for diﬀerent number of selected features (In %) λ value 0

1

2

5

10

15

20

44.42±6.11

21.58±5.01

5.92±4.91

3.33±3.24

2.00±2.19

2.17±1.12

0.1

50.33±6.70

25.33±4.32

7.75±5.31

4.25±2.95

3.17±2.32

2.67±2.07

0.25

50.42±6.67

25.00±4.84

9.42±3.73

3.58±1.80

3.25±2.10

2.67±2.22

0.5

50.08±7.27

23.92±3.49

9.67±3.38

4.42±2.42

2.83±2.01

2.75±1.84

0.75

48.42±7.76

25.08±3.98

9.25±4.09

5.17±2.99

2.92±1.72

3.25±2.10

0.9

47.58±8.04

23.58±3.89

9.42±4.94

5.58±2.99

3.08±2.15

2.50±1.80

0.99

47.00±7.64

24.08±4.87

8.92±3.33

5.92±2.73

3.08±2.19

2.92±1.85

Result for 30 classes

Result for 20 classes

0.7 λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99

Test error

0.5 0.4

λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99

0.5

0.4

Test error

0.6

0.3

0.3

0.2

0.2 0.1

0.1 0 0

10

20 30 40 The number of selected features

0 0

50

(a) Class number = 30

10

20 30 The number of selected features

40

50

(b) Class number = 20

Result for 10 classes

Result for 8 classes 0.45 λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99

Test error

0.4

λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99

0.35 0.3

Test error

0.5

0.4

0.3

0.2

0.25 0.2 0.15 0.1

0.1 0.05 0 0

10

20 30 40 The number of selected features

(a) Class number = 10

50

0 0

10

20 30 40 The number of selected features

50

(b) Class number = 8

Fig. 5. Results for the ORL data set with less number of classes

further conducts the following experiment: maintain 5 training samples in each class and gradually reduce the number of classes to 30, 20, 10 and 8. After that, following the previous procedures, perform feature selection with the proposed criterion for these four new classiﬁcation problems, respectively. The result is plotted in Figure 5 and it shows what is expected. The selection performance of the criterion without regularization (labelled by λ = 0 and shown as a thick curve) degrades with the decreasing number of classes. When there are only 8 classes (or, equally, 5 × 8 = 40 training samples), the necessity and beneﬁt of regularization can be clearly seen. Summarily, two points can be drawn from Table 1,2 and Figure 5: (1) thanks to the way of deﬁning R2 , regularization is not a concern for the proposed criterion when the number of training samples in

154

L. Wang

each class is small but the number of classes is large. This case is quite reasonable for multi-class classiﬁcation; (2) when the total number of training samples is really small, employing the regularized version of the proposed criterion can achieve better feature selection performance.

5

Conclusion

A novel feature subset selection approach has been proposed for multi-class SVM based image classiﬁcation. It handles the selection with hundreds of features, very suitable for image classiﬁcation where high-dimensional feature vectors are often used. This approach produces overall better performance than the compared selection criteria and shows robustness to the small sample case in multi-class image classiﬁcation. These results preliminarily verify its eﬀectiveness. More theoretical and experimental study will be conducted in the future work.

References 1. Kim, K.I., Jung, K., Park, S.H., Kim, H.J.: Support vector machines for texture classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(11), 1542–1550 (2002) 2. Chapelle, O., Haﬀner, P., Vapnik, V.: Support vector machines for histogram-based image classiﬁcation. IEEE Transactions on Neural Networks 10(5), 1055–1064 (1999) 3. Luo, T., Kramer, K., Goldgof, D., Hall, L., Samson, S., Remsen, A., Hopkins, T.: Recognizing plankton images from the shadow image particle proﬁling evaluation recorder. IEEE Transactions on Systems, Man and Cybernetics, Part B 34(4), 1753– 1762 (2004) 4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for Support Vector Machines. Machine Learning 46(1-3), 131–159 (2002) 5. Wang, L., Chan, K.L.: Learning Kernel Parameters by using Class Separability Measure. In: Presented in the sixth kernel machines workshop, In conjuction with Neural Information Processing Systems (2002) 6. Darcy, Y., Guermeur, Y.: Radius-margin Bound on the Leave-one-out Error of Multi-class SVMs. Technical Report, No. 5780, INRIA (2005) 7. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/∼ cjlin/libsvm 8. Sch¨ olkopf, B., Smola, A.: Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA (2002) 9. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(8) (1996)

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization Xinyu Xu and Baoxin Li Department of Computer Science and Engineering, Arizona State University [email protected], [email protected]

Abstract. Automatic image categorization is a challenging computer vision problem, to which Multiple-instance Learning (MIL) has emerged as a promising approach. Typical current MIL schemes rely on binary one-versus-all classification, even for inherently multi-class problems. There are a few drawbacks with binary MIL when applied to a multi-class classification problem. This paper describes Multi-class Multiple-Instance Learning (McMIL) to image categorization that bypasses the necessity of constructing a series of binary classifiers. We analyze McMIL in depth to show why it is advantageous over binary MIL when strong target concept overlaps exist among the classes. We systematically valuate McMIL using two challenging image databases, and compare it with state-of-the-art binary MIL approaches. The McMIL achieves competitive classification accuracy, robustness to labeling noise, and effectiveness in capturing the target concepts using smaller amount of training data. We show that the learned target concepts from McMIL conform to human interpretation of the images. Keywords: Image Categorization, Multi-Class Multiple-Instance Learning.

1 Introduction Multiple-Instance Learning (MIL) is a generalization of supervised learning in which training class labels are associated with sets of patterns, or bags, instead of individual patterns. While every pattern may possess an associated true label, it is assumed that pattern labels are only indirectly assessable through bag labels [1]. MIL has been successfully applied to many applications such as drug activity prediction [8], image indexing for content-based image retrieval [1, 13, 21, 5, 2, 14, 6], object recognition [6], and text classification [1]. In many real-world multi-class classification problems, a common treatment adopted by traditional MIL is “one-versus-all” (OVA) approach, which constructs c binary classifiers, fj(x), j=1,…,c, each trained to distinguish one class from the remaining, and labels a novel input with the class giving the largest fj(x), j=1,…,c. Although OVA delivers good empirical performances, it has potential drawbacks. First, it is somewhat heuristic [15]. The binary classifiers are obtained by training in separate binary classification problems, and thus it is unclear whether their realvalued outputs are on comparable scales [15]. This can be a problem, since situations often arise where several binary classifiers acquire identical decision values, making the class prediction based on them very obscure. Second, binary OVA has been Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 155–165, 2007. © Springer-Verlag Berlin Heidelberg 2007

156

X. Xu and B. Li

criticized for its inadequacy in dealing with rather asymmetric problems [15]. When the number of classes becomes large, each binary classifier becomes highly unbalanced with many more negative examples than positive examples. In the case of non-separable classes, the class with smaller fraction of instances tends to be ignored, leading to degraded performances [18]. Third, with respect to Fisher consistency, it is unclear if OVA targets the Bayes rule in absence of a dominating class [10]. Recently, a Multi-class Multiple-Instance Learning (McMIL) approach has been proposed which considers multiple classes jointly by simultaneously minimizing a Support Vector Machine (SVM) objective function. In this paper we focus on the systematic evaluation of the McMIL using image categorization as a case study, and we show that why it is advantageous over binary OVA approaches.

2 Related Work There has been abundant work in the literature on image categorization. Due to space limit we only review the MIL-based approaches. Andrew et al. [1] proposed MI-SVM and applied it to small-scale image annotation. [12] proposed Diverse Density MIL framework and [13] used it to classify natural scene images. [20] proposed EM-DD, which views the knowledge of which instance corresponds to the label of the bag as a missing attribute. [2] performed an instance-based feature mapping through a chosen metric distance function, and a sparse SVM was adopted to select the target concepts. [6] proposed MILES for image categorization and object recognition. MILES does not impose the assumption relating instance labels to bag labels, and it does not rely on searching local maximizers of the DD functions, which makes MILES quite efficient. [14], within the CBIR setting, presented MISSL that transforms any MI problem into an input for a graph-based single-instance semi-supervised learning method. Unlike most prior MIL algorithms, MISSL makes use of the unlabeled data. Most of these works belong to the binary MIL approaches, and thus potentially suffer from the drawbacks mentioned in the previous section. One work by [5] is mostly related with ours. They proposed DD-SVM. Their major contribution is that the image label is no longer determined by a single region; rather it is determined by some number of Instance Prototypes (IPs) learned from the DD function. A nonlinear mapping is defined using the IPs to map every training image to a point in a bag feature space. Finally, for each category, an SVM is trained to separate that category from all the others. Note that the multi-class classification here is essentially achieved by OVA. The method described in this paper differs from DDSVM in that a multi-class feature space is built that enables us to train the SVM only once to yield the multi-class label, leading to increased classification accuracy.

3 Methodology of McMIL For completeness of the paper, we describe the McMIL approach in this section. Let us first define the problem and the notations. Let D={(x1, y1),…, (x|D|, y|D|)} be the labeled dataset, where xi, i=1,...,|D| is called a bag in the MIL setting, and yi ∈ {1,…,c} denotes the multi-class label. In the MIL model xi={xi1,…,xi|xi|}, where xij ∈ \ d represents the jth instance in the ith bag. Our goal is to induce a classifier from patterns to multi-class labels, i.e. a multi-class separating function f: \ d Æ y.

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization

157

3.1 Learning Instance Prototypes The first step in the training phase of the McMIL is to learn the IPs for each category according to the DD function [12]. Because the DD function is defined based on the binary framework, McMIL learns a set of IPs specifically for every class: the ith DD function uses all of the examples in the ith class with positive labels and some examples from other classes with negative labels. McMIL applies EM-DD to locate the multiple local peaks by using every instance in every positive bag with uniform weighs as the starting point of the Quasi-Newton search. Then the IPs with larger DD values and are distinct from each other are selected to represent the target concepts [5]. 3.2 Multi-class Feature Mapping Let IPm =

{( I

1 m

, I m2 ,..., I mnm

)} , m ∈ {1, 2,..., c} be the set of instance prototypes learned by

EM-DD for the mth category. Each I mk = ⎡ xmk , smk ⎤ , k=1,2,...,nm denotes the kth IP in the ⎣ ⎦ mth class, and nm is the number of IPs in the mth class. The learned IP consists of two parts: the ideal attribute value x and the weight factor s. To facilitate the multi-class classification, a similarity mapping kernel ψ is defined which maps a bag of instances into a multi-class feature space ] based on a similarity metric function h: \ d ×\ d → \ . The mapping ψ is defined as ψ(xi)= ⎡ h( xi , I11 ),..., h( xi , I1n1 ),..., h( xi , I c1 ),..., h( xi , I cnc ) ⎤ ⎣

T

⎦

(1)

The similarity function h is defined as ⎧ ⎩

h( xi , I mk ) = max j =1,..., x ⎨exp ⎡⎢ − xij − xmk i

⎣

2 smk

⎤ ⎫ , k = 1,..., n , m ∈ {1,..., c} m ⎥⎦ ⎬⎭

(2)

Each feature corresponds to the similarity between a bag and one IP, which is the similarity between I mk and the instance in xi that is closest to I mk . Note that while this is similar to the bag feature space mapping that was proposed by [5], an important difference between these two mappings is that, the mapping in our method simultaneously incorporate the similarity between a bag and each IP of every category into one feature matrix, while in binary MIL [5, 2, 6], separate feature matrices are needed, each of which only considers the similarity between a bag and each IP of one category that is under consideration. For a given training set of bags with multi-class labels D={(x1, y1),…, (x|D|, y|D|)} , applying the above mapping yields the following kernel similarity matrix S= n n 1 1 ⎡h1 ⎤ ⎡h(x1, I1 ),...., h(x1, I1 1 ),...., h(x1, Ic ),...., h(x1, Ic c ) ⎤ ⎢ ⎥ ⎢h ⎥ n1 nc 1 1 ⎢ 2 ⎥ = ⎢h(x2 , I1 ),...., h(x2 , I1 ),...., h(x2 , Ic ),...., h(x2 , Ic ) ⎥ ⎥ ⎢M ⎥ ⎢LLLLLLLLLLLLLLLLLL ⎥ ⎢ ⎥ ⎢ n 1 1 n ⎢⎣h|D| ⎥⎦ ⎢⎣h(x|D| , I1 ),..., h(x|D| , I1 1 ),..., h(x|D| , Ic ),..., h(x|D| , Ic c )⎥⎦

(3)

158

X. Xu and B. Li

3.3 Multi-class SVM Training and Classification Given the kernel similarity matrix S, the multi-class classification is achieved by multi-class SVM [16, 19] which simultaneously allows the computation of a multiclass classifier by considering all the classes at once. The formulation is |D| 1 c min ∑ wmT wm + C ∑ ∑ ξim w,b ,ξ 2 m =1 i =1 m ≠ y (4) T s.t. wy ψ ( xi ) + by ≥ wmTψ ( xi ) + bm + 2 − ξim , i

i

i

ξim ≥ 0, i = 1,...,| D |, m ∈ {1,.., c} y . i

where w is a vector orthogonal to the separating hyper-plane, |b|/||w|| is the perpendicular distance from the hyper-plane to the origin, C is a parameter to control the penalty due to misclassification. Nonnegative slack variables ξi are introduced in each constraint to account for the cost of the errors. In the testing phase, we first project the test image into the feature space ] using the learned IPs. The decision function is then given by (5) arg max m=1,...,c ( wmTψ ( x) + bm ) which is the same with the OVA decision making. Note that in multi-class SVM, there are c decision functions but all are obtained by solving one problem Eq. (4). 3.4 The Learned Target Concept Each instance prototype represents one kind of target concept that distinguishes one class from another. Thus it is important to visualize the target concept. In order to do that, in each class, for each feature that is defined by one positive instance prototype,

People

Portrait

Scene

Structure

Fig. 1. The target concepts of four categories in the SIMPLIcity-II data set. The target concepts are represented by conceptual regions. For “People”, the various shapes of human are returned as target concepts. For “Portrait”, the target concepts are mainly skin and hair with varying colors. For “Scene”, the major target concepts are mountain, sky, sea water and plants. For “Structure”, the target concepts are various kinds of building structures like walls, windows and roofs.

we find the training image that maximizes the corresponding feature, then we go further to locate the region that is closest to the corresponding prototype. The target concepts for a category are defined by the regions that maximize the similarity

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization

159

between images and the instance prototypes. Fig. 1 shows the target concepts of four classes in the SIMPLIcity-II dataset (Sec. 4).

4 The Advantage of Direct Multi-Class MIL The McMIL approach has advantage over binary MIL when strong target concept overlaps exist among the classes. To show the advantage of McMIL, let us suppose we have a dataset that has three categories: People, Portrait and Scene. Typical People IPs include “hair”, “face”, “clothes”, “building” and “water”, and typical Portrait IPs include “hair”, “face” and “clothes”. Fig. 2(a) shows the characteristic of the feature vectors for a People image and a Portrait image. We can see that the People IPs are inclusive of Portrait IPs. Now let us simplify the problem by assuming that the target concept of People are summarized by “face” and “non-face”, and the target concept of Portrait is summarized by the “face” IP.

(a)

(b)

(c)

Fig. 2. (a) The characteristic of the feature vectors for a People image and a Portrait image. Top: the features for a People image defined by Portrait IPs. Bottom: the features for a Portrait image defined by People IPs. (b) and (c) illustrate the advantage of McMIL over OVA MIL. (b) The feature space of the People and the Portrait classifier in the OVA MIL approach. (c) The multi-class feature space in the proposed McMIL approach.

Let us first consider the OVA MIL approach. OVA MIL will construct 3 binary classifiers. Fig. 2(b) shows the feature space of the People and the Portrait classifier in the OVA MIL approach. In the People classifier (top part of Fig. 2(b)), the features are defined in terms of People IPs: face and non-face. It is more likely that the three People training images will fall into circle A, because the similarity between a People bag and People IPs (face or non-face) would be large. And we can expect that the three Portrait training images will more likely fall into circle B since the similarity between a Portrait bag and the face IP is large but the similarity between a Portrait bag and a non-face IP would be small. In the same sense, the three Scene images will fall into circle C. Because the bags are clearly separable in the People feature space, we would expect that People binary classier will do a good job in classifying a new test point. Now let us turn to Portrait binary classifier whose feature is only defined by face IP. As shown in the bottom part of Fig. 2(b), the People and Portrait bags will become linearly non-separable in the Portrait feature space in that the feature values defined by Portrait face for the People bags and the Portrait bags are both likely to be

160

X. Xu and B. Li

large, and so People bags and Portrait bags will overlap with each other. As a result, Portrait classier will be more likely to make prediction errors in classifying a new test input. Now let us predict the label for a new test image whose true label is People. We feed it into each of the three binary classifier and we get the following decision values: People classifier 0.3218, Portrait classifier 0.3825, and Scene classifier 0.7726. Note that to classify this new image into the right category, the Portrait classifier should give us a negative decision value. But because of Portrait classifier’s poor job in classification, the new image will be associated with a wrong label Portrait since 0.3825>0.3218>-0.7726. Now let us turn to multi-class SVM classification whose feature space is illustrated in Fig. 2(c). Note that the multi-class feature space is defined by the IPs of all the classes, so McMIL has three features defined by People face (face 1), Portrait face(face 2) and non-face. In this multi-class feature space, the probability that the multi-class SVM makes a prediction error is greatly reduced because the dimension of the feature space is enlarged and thus any linearly non-separable points are likely to become separable in a higher dimensional feature space. This is essentially similar to applying a kernel to make linearly non-separable feature space nonlinearly separable. Actually the multi-class feature matrix S is a kernel similarity matrix, the kernel mapping is defined by Eq. (1).

5 Evaluation of McMIL 5.1 Experiment Design Using image categorization as a case study, we evaluate the performance of McMIL on two challenging image data sets: COREL database [2, 5, 6], and SIMPLIcity-II. The images in SIMPLIcity-II are selected from the SIMPLIcity image library [11, 17]. The COREL image database consists of 10 categories, with 100 images per category. The SIMPLIcity-II data set consists of 5 categories, with 100 images per category. Images in both data sets are in JPEG format of size 384 × 256 or 256 × 384. Using the COREL dataset, we compared the classification accuracy of McMIL with that of a few state-of-the-art methods including MILES [6], 1-norm SVM [2], DDSVM [5], k-means-SVM [7], Hist-SVM [4] and MI-SVM [1]. Using COREL dataset, we also comprehensively evaluate McMIL against DD-SVM. Using SIMPLIcity-II, we compared the robustness of McMIL with that of DD-SVM in classifying images where strong target concept overlaps exist among the classes. To facilitate fair comparison with other methods, we use the same feature matrix for all the methods. In all of the experiments, images within each category were randomly partitioned in half to form a training set and a test set. We repeated each experiment for 5 random splits, and reported the average of the results obtained over 5 random test sets, as done in the compared approaches in the literature. 5.2 Results Classification accuracy on COREL data set. Table 1 illustrates the classification accuracy for each class obtained by McMIL and that of DD-SVM and MILES. For McMIL, we presented the results on its two variants: one is McMIL90, which, instead

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization

161

of using all the examples in all the other classes with negative labels, uses 10 images per negative class to learn the IPs for each class. Another variant is McMIL450 which uses all the examples in all the other classes with negative labels to learn the IPs for each class. We are surprised to know that McMIL90 performs much better than DDSVM using 450 negative examples and MILES. This further implies that McMIL is able to achieve higher classification accuracy by using much smaller amount of training negative examples. The left table in Table 2 illustrates the average classification accuracies of McMIL in comparison with other approaches. McMIL90 performs much better than Hist-SVM, k-means-SVM, MI-SVM, 1-norm SVM. McMIL90 performs slightly better than DD-SVM and MILES. Moreover, McMIL450 performs much better than Hist-SVM, k-means-SVM, MI-SVM and 1-norm SVM with a large margin, and it performs comparably well with MILES. McMIL450 outperforms DD-SVM, though the difference is not statistically significant. Table 1. Classification accuracies (in %) for each class obtained by McMIL90, McMIL450, DD-SVM and MILES on the COREL dataset Class

McMIL90

McMIL450

0 1 2 3 4 5 6 7 8 9

71.6 ± 4.7 68.0 ± 6.8 76.8 ± 2.7 90.8 ± 2.7 100.0 ± 0.0 81.6 ± 6.1 94.4 ± 1.9 94.4 ± 3.4 67.2 ± 4.7 88.8 ± 1.6

70.4 ± 3.8 69.6 ± 4.19 74.8 ± 2.66 91.6 ± 3.37 100.0 ± 0.0 79.6 ± 5.32 95.6 ± 2.6 96.4 ± 2.6 62.8 ± 8.37 88.8 ± 4.04

DDSVM 67.7 68.4 74.3 90.3 99.7 76.0 88.3 93.4 70.3 87.0

MILES 68.8 66.0 75.7 90.3 99.7 77.7 96.4 95.0 71.0 85.4

Table 2. The classification accuracy (in %) obtained by McMIL and other methods (with the corresponding 95% confidence intervals) using the COREL (left table) and SIMPLIcity-II (right table) dataset

162

X. Xu and B. Li

Sensitivity to labeling noise. In multi-class classification, the labeling noise is defined as the probability that an image is misclassified. It is important to show the sensitivity of McMIL to labeling noise because in many practical applications, it is usually impossible to get a “clean” data set and the labeling process is often subjective [6]. In this experiment, we used 500 images from Class 0 to Class 4, with 100 images per class. The training and test sets have equal size. In the case of multiclass classification, training set with different level of labeling noise is generated as follows: for the i-th class, i=0,…,4 we randomly pick d% training images and modify their labels to j, j ≠ i, and the probability of j taking any value other than i is equal; for the i-th class, i=0,…,4, we also randomly pick d% images from among all the other classes (200 images in this case) and modify their labels to i. We compared McMIL with DD-SVM for d=0, 2, 4, …, 20 and report the average classification accuracy over 5 random test sets, as shown in fig. 3 (a). We use 50 positive images and 200 negative images per class to learn the IPs for both McMIL and DD-SVM. We observe that for d=0, 2, 4, 6, 8, the classification accuracies of McMIL are 2% higher than DD-SVM. As we increase the percent of training images with label noise from 10 to 20, the accuracy of McMIL remains roughly the same, but the accuracy of DD-SVM drops very quickly. This indicates that our method is less sensitive to labeling noise than DD-SVM.

(a)

(c)

(b)

(d)

Fig. 3. Compare McMIL with DD-SVM in terms of sensitivity to labeling noise (a), sensitivity to number of negative images (b), sensitivity to number of categories (c), and sensitivity to number of sample images (d)

Sensitivity to number of training negative images. We used 1000 images from Class 0 to Class 9 (training and test sets are of equal size) to analyze the performance of McMIL as the number of negative examples used for learning IPs for each class

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization

163

varies from 90 to 450. That is, in the ith (i = 1,…,5) experiment, we randomly choose 10 × i negative examples from the 50 training images of each of the other classes, and run the experiment for 5 random splits. In all these experiments, the number of positive examples remains 50 in learning the IPs for the jth class, j=1,…, 10. As indicated by fig. 3(b), even when the number of negative images drops to 90, McMIL still performs very well. This is an advantage over binary MIL since the performance of most binary MIL will degrade significantly as the amount of negative training data decreases. Why the performance of direct multi-class MIL does not degrade with the decreasing of the amount of negative training examples? We attempt to answer this question as follows: for the OVA MIL approaches, a series of one-versus-all binary classifiers are obtained, each trained to separate one from all the rest. This requires the IPs learned for each class not only to represent the target concepts of the underlying class but also distinguish the underlying class from the rest. This is because, in the definition of DD [12], the target concepts are those intersection points of all the positive bags without intersecting the negative bags, so the intersection points of all the positive bags which also intersect with negative bags should be excluded by learning from a large number of negative examples. However, in the direct multi-class MIL, the requirement for the IPs to distinguish one class from another is lessened, because the “distinguishing” power of the classifier can be offset by learning in the multi-class feature space, due to the nice property of multi-class feature matrix analyzed in Sec. 4. Therefore in order to accurately capture the target concept of the underlying class (i.e. the instance prototypes), the number of training negative examples used for McMIL can be much smaller than that used in DD-SVM. Sensitivity to number of categories. In this experiment, a total of 10 data sets are used, the number of categories in a data set varies from 10 to 19. A data set with i categories contains 100 × i images from Class 0 to Class i-1. In learning IPs for each class, we use the images in the current class with positive labels and all the images in the other classes with negative labels. As shown by fig. 3(c). McMIL outperforms DD-SVM consistently by a large margin for each data set. Sensitivity to number of training images. We compared McMIL with DD-SVM as the size of training set varies from 100 to 500. 1,000 images from Class 0 to Class 9 are used, with the size of the training sets being 100, 200, 300, 400, and 500 (the number of images from each category is size of the training set/10). As indicated in fig. 3(d), McMIL performs consistently better than DD-SVM as the number of training images increases from 100 to 500. Classification accuracy on SIMPLIcity-II. We have built SIMPLIcity-II that is specifically designed to test the robustness of our method in classifying images where strong target concept overlaps exist among the classes. While being small, the SIMPLIcity-II dataset is in fact a very challenging one for various reasons. First, the images in a semantic category are not “pure”. Therefore target concept overlaps occur more frequently than in the COREL dataset. Second, the data contain noise in that a negative bag may contain positive instances as well. Third, the images are very diverse in the sense that they have various background, colors and combinations of semantic concepts, and images are photographed at a wide-range or close up shots.

164

X. Xu and B. Li

The right table of Table 2 shows the average classification accuracy of McMIL and DD-SVM on the SIMPLIcity-II dataset over 5 random test sets (training and test sets are of equal size). Despite the aforementioned difficulties, McMIL has achieved reasonably good results on this dataset, demonstrating that this is indeed a promising approach. We also found that DD-SVM made much higher classification error than McMIL in classifying People and Portrait images: DD-SVM misclassifies 9.2% of People images into Portrait and 8.6% Portrait images into People; while McMIL only produces 6.4% and 6.8% for the two classes respectively.

6 Concluding Remarks In this paper, we described a multi-class multiple instance learning approach. We have carried out systematic evaluation to demonstrate the benefits of direct McMIL. Our empirical studies show that, in addition to overcoming the drawbacks of OVA MIL approaches, McMIL is able to achieve higher classification accuracy than most of the existing OVA MIL approaches even using only a small amount of training data, and it is less sensitive to labeling noise. These two benefits are significant in practical applications. The ability of McMIL in classifying images with strong target concept overlap is found to be superior to DD-SVM. One drawback of the McMIL is that as the number of classes becomes large, it becomes computationally inefficient since the gradient search for the IPs takes quite a lot of time. One solution to increase the efficiency of McMIL is to use only a small amount of negative examples; another is to bypass searching for the IPs by combining MILES [6] with McMIL. Our future work will explore these possibilities.

References 1. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multipleinstance learning. NIPS 15, 561–568 (1998) 2. Bi, J., Chen, Y., Wang, J.Z.: A sparse support vector machine approach to region-based image categorization. CVPR (1), 1121–1128 (2005) 3. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm2001 4. Chapelle, O., Haffner, P., Vapnik, V.N.: Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks 10(5), 1055–1064 (1999) 5. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. Journal of Machine Learning Research 5, 913–939 (2004) 6. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-Instance Learning via Embedded Instance Selection. IEEE Trans. on PAMI. 28(12), 1931–1947 (2006) 7. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual Categorization with Bags of Keypoints. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision, pp. 59–74 (2004) 8. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997) 9. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13(2), 415–425 (2002)

Evaluating Multi-Class Multiple-Instance Learning for Image Categorization

165

10. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004) 11. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. on PAMI 25(9), 1075–1088 (2003) 12. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. NIPS 10, 570– 576 (1998) 13. Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. ICML, 341–349 (1998) 14. Rahmani, R., Goldman, S.A.: MISSL: multiple-instance semi-supervised learning. ICML, 705–712 (2006) 15. Schölkopf, B., Smola, A.J.: Learning with Kernels-Support Vector Machines, Regularization, Optimization and Beyond, pp. 211–214. MIT Press, Cambridge, MA (2002) 16. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 17. Wang, J.Z., Li, J., Wiederhold, G.: SIMPLIcity: Semantics-sensitive Integrated Matching for Picture Libraries. IEEE Trans. on PAMI 23(9), 947–963 (2001) 18. Wang, L., Shen, X., Zheng, Y.F.: On L-1 norm multi-class Support Vector Machines. In: Proc. 5th Intl. Conf. on Machine Learning and Applications (2006) 19. Weston, J., Watkins, C.: Multi-class support vector machines. In: Verleysen, M. (ed.) Proc. ESANN 1999, D. Facto Press, Brussels (1999) 20. Zhang, Q., Goldman, S.A.: EM-DD: An improved multiple-instance learning technique. NIPS 14, 1073–1080 (2002) 21. Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multiple-instance learning. In: Proc. ICML, pp. 682–689 (2002)

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution Andrei Zaharescu, Edmond Boyer, and Radu Horaud INRIA Rhone-Alpes, 655 ave de l’Europe, Montbonnot, 38330, France [email protected]

Abstract. Most of the algorithms dealing with image based 3-D reconstruction involve the evolution of a surface based on a minimization criterion. The mesh parametrization, while allowing for an accurate surface representation, suffers from the inherent problems of not being able to reliably deal with selfintersections and topology changes. As a consequence, an important number of methods choose implicit representations of surfaces, e.g. level set methods, that naturally handle topology changes and intersections. Nevertheless, these methods rely on space discretizations, which introduce an unwanted precision-complexity trade-off. In this paper we explore a new mesh-based solution that robustly handles topology changes and removes self intersections, therefore overcoming the traditional limitations of this type of approaches. To demonstrate its efficiency, we present results on 3-D surface reconstruction from multiple images and compare them with state-of-the art results.

1 Introduction A vast number of problems in the area of image-based 3-D modeling are casted as energy minimization problems where 3-D shapes are optimized such that they best explain image information. More specifically, when interested in performing 3-D reconstruction from multiple images, one typically attempts to recover 3-D shapes by evolving a surface with respect to various criteria, such as photometric or geometric consistencies. Meshes, either triangular, polygonal or even tetrahedral, are one of the most used forms of shape representation. Traditionally, an initial mesh is obtained using some well known method, e.g. bounding boxes or visual hulls, and then deformed over time such that it minimizes an energy, based typically on some form of image similarity measure. There exist two main schools of thought on how to cast the above mentioned problem. Lagrangian methods propose an intuitive approach where the surface representation, i.e. the mesh, is directly deformed over time. Meshes present numerous advantages, among which adaptive resolution and compact representation, but raise two major issues of concern when evolved over time: self-intersections and topology changes. It is critical to deal with both issues when evolving a mesh over time. In order to answer to this, McInerney and Terzopoulos [1] proposed topology adaptive deformable curves and meshes, called T-snakes and T-surfaces. However, in solving the intersection problem, the authors use a spatial grid, thus imposing a fixed spatial resolution. In addition, only offsetting motions, i.e. inflating or deflating, are allowed. Lauchaud et al. [2] proposed a heuristic method where merges and splits are perform in near boundary cases: Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 166–175, 2007. c Springer-Verlag Berlin Heidelberg 2007

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution

167

when two surface boundaries are closer than a threshold and facing each other, an artificial merge is introduced; a similar procedure is applied for a split, when the two surface boundaries are back to back. Self-intersections are avoided in practice by imposing a fixed edge size. Eulerian methods formulate the problem as time variation over sampled space, most typically fixed grids. In such a formulation, the mesh, also called the interface, is implicitly represented. One of the most successful methods, the level set method [3] [4], represents the interface as the zero set of a function. Such an embedding within an implicit function allows one to automatically handle mesh topology changes, i.e. merges, splits. In addition, such methods allow for an easy computation of geometric properties. Nevertheless, this method does not come without its drawbacks. The amount of space required to store the implicit representation is much larger than when storing a mesh parametrization. In addition, at each step of the evolution, the mesh has be recovered from the implicit function. This operation is limited by the grid resolution. Consequently, a mesh with variable resolution cannot be properly preserved by the embedding function. In light of this, we propose a mesh-based lagrangian approach. To solve for the selfintersections and topology changes issues, we adopt the algorithm proposed by Jung et al. [5] in the context of mesh-offseting and extend it to more general situations, as faced with when modeling from multiple images. We have successfully applied our approach to surface reconstruction using multiple cameras. The 3-D reconstruction literature is vast. Recently, Furukawa et al. [6] provide some of the most impressive results. They use however a mesh based solution [2] mentioned above that imposes equal face sizes. Pons et al. [7] provide an elegant level-set based implementation. Hernandez et al. [8] provide an in-between solution: they choose an explicit mesh representation, however they immerse the mesh within a vector field with forces proportional to the image consistencies at given places within a grid. Self-intersections and topology changes are not handled. Other approaches are also surveyed in the recent overview on multi-view stereo reconstruction algorithms [9]. Our contribution with respect to these methods is to provide a purely mesh based solution that does not constrain meshes and allows faces of all sizes as well as topology changes, with the goal of increasing precision without sacrificing complexity when dealing with surface evolution problems. The rest of the paper is organized as follows. Section 2 introduces the algorithm that handles self-intersection and topology changes. In section 3, we present its application to the 3-D surface reconstruction problem. Section 4 shows results on well known data sets and makes comparisons with state-of-the-art approaches, before concluding in section 5.

2 TransforMesh - A Topology-Adaptive Self-intersection Removal Mesh Solution As stated earlier, the main limitations that prevent many applications from using meshes are self-intersections and topology changes, which frequently occur when evolving meshes. In this paper, we show that such limitations can be overcome using a very intuitive geometrically-driven solution. In essence, the approach preserves the mesh consistency by detecting self-intersections and considering the subset of the original mesh surface that is outside with respect to the mesh orientation. A toy example is il-

168

A. Zaharescu, E. Boyer, and R. Horaud

lustrated in Figure 1. Our method is inspired by the work of [5] proposed in the context of mesh-offsetting or mesh expansions. We extend it to the general situation of identifying a valid mesh, i.e. a manifold, from a self-intersecting one with possible topological issues. The currently proposed algorithm has the great advantage of dealing with topological changes naturally, much in the same fashion as the Level-Set based solutions, casting it as a viable solution to surface evolutions with meshes. The only requirement is that the input mesh is the result of the deformation of an oriented and valid mesh. The following sections detail the sequential steps of the algorithm.

(a) Input

(b) Input

(c) Intersections

(d) Output

(e) Output

Fig. 1. A toy example of TransforMesh at work

2.1 Pre-treatment Most mesh-related algorithms suffer from numerical problems due to degenerate triangles. Such degenerate triangles are mainly triangles having area close to zero. In order to remove those triangles, two operations are performed: edge collapse and edge flip. 2.2 Self-intersections The first step of the algorithm consists of identifying self-intersections, i.e. edges along which triangles of the mesh intersect. In practice, one would have to perform n2 /2 checks to see if two triangles intersect, which can become quite expensive when the number of facets is large. In order to decrease the computational time, we use a bounding box test to determine which bounding boxes (of triangles) intersect, and only for those perform a triangle intersection test. We use the fast box intersection method implemented in [10] and described in [11]. The complexity of the method is O(n log d (n)+k) for the running time and O(n) for the space occupied, where n is the number of triangles, d the dimension (3 in the 3-D case), and k the output complexity, i.e., the number of pairwise intersections of the triangles. 2.3 Valid Region Growing The second step of the algorithm consists in identifying valid triangles in the mesh. To this purpose, a valid region growing approach is used to propagate validity labels on triangles that composed the outside of the mesh. Initialization. Initially, all the triangles are marked as non-visited. This corresponds to Figure 2(a). Three queues are maintained: one named V, of valid triangles, i.e. triangles outside and without intersections; one named P of partially valid triangles, i.e. only part

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution

(a) Init

(b) Seed-triangle (c) Valid Queue (d) Partial Queue

169

(e) End

Fig. 2. Valid Region Growing (2-D simplified view). The selected elements are in bold.

of the triangle is outside, and finally one named G, where all the valid triangles will be stored until stitched together into a new mesh (in section 2.4). All that follows is performed within an outer loop, while there still exists a valid seed triangle. Seed-triangle Finding. A seed-triangle is defined as a non-visited triangle without intersections whose normal does not intersect any other triangle of the same connected component. This corresponds to Figure 2(b). In other words, a seed-triangle is a triangle that is guaranteed to be on the exterior. This triangle is crucial, since we will start our valid region growing from such a triangle. If found, the triangle will be added to V and marked as valid; otherwise, the algorithm will jump to the next stage (section 2.4). The next two-steps are performed within a inner loop until both V and P are empty. Valid Queue Processing. While V is not empty, pop a triangle t from the queue, add it to G and for each neighbouring triangle N (t) perform the following: if N (t) is nonvisited and has no intersections, then add it to V; if N (t) is non-visited and has intersections, then add it to P together with the entrance segment and direction, corresponding in this case to the oriented half-edge. (see Figure 2(c)). Partially-Valid Queue Processing. While P is not empty, pop a triangle t from the queue, together with the entrance half-edge ft . Also, we have previously calculated all the intersection segments between this triangle and all the other triangles. Let St = {sti } represent all the intersection segments between triangle t and the other triangles. In addition, let Ht = {htj |where j = 1..3} represent the triangle half-edges. A constrained 2-D triangulation is being performed in the triangle plane, using [12], to ensure that all segments in both St and Ht appear in the new mesh structure and that propagation can be achieved in a consistent way. A fill-like traversal is performed from the entrance half-edge to adjacent triangles, stopping on constraint edges, as depicted in Figure 3. A crucial aspect to ensure a natural handling of topological changes is on choosing the correct side of continuation of the "fill" like region growing when crossing a partially valid triangle. The correct orientation is chosen such that, if the original

170

A. Zaharescu, E. Boyer, and R. Horaud Intersection Segments

TD3

Entrance exit

Entrance exit

TD2 TD1

Entrance edge

(a) Triangle Intersections

(b) Partial Triangle Traversal

Fig. 3. Partial Triangle Traversal

normals are maintained, the two newly formed sub-triangles would preserve the watertightness constraint of a manifold. This condition can also be casted as following: the normals of the two sub-triangles should be opposing each other when the two subtriangles are "folded" on the common edge. A visual representation of the two cases is shown in Figure 4.

A

D

Incoming

Outgoing

A

C

Incoming

O

O Outgoing

C

B

D

B

Fig. 4. Partial Triangle Crossing Cases (side view)

2.4 Triangle Stitching The region growing algorithm described previously will iterate until all the triangles have been selected. In the chosen demo example, this corresponds to Figure 2(e). At this stage, what remains to do is to stitch together the 3-D triangle soup (G queue) in order to obtain a valid mesh which is manifold. We adopt a method similar in spirit to [13,14]. In most cases this is a straight forward operation, reduced to identifying the common vertices and edges, followed by stitching. However, there are three special cases, in which performing a simple stitching will violate the mesh constraints and produce locally non-manifold structures. The special cases, shown in Figure 5, arise from performing stitching in places where the original structure should have been maintained. We adopt the naming convention from [13], calling them the singular vertex case, the

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution

171

singular edge case and the singular face case. All cases are easy to identify only from local operations and are identified after all the stitching has been performed. In the singular vertex case, in order to detect whether vertex v is singular, we adopt the following algorithm: mark all the facets incident to the vertex v as non-visited. Then start from a facet of v, mark it visited and do the same with its non-visited neighbours that are also incident to v (neighbour as chosen based on half-edges). The process is repeated until all the neighbouring facets are processed. If by doing so we exhausted all the neighbouring facets, vertex v is non singular, otherwise it is singular, so a copy of it is created and added to all the remaining non-visited facets. In order to detect a singular edges e, all we have to do is count the number of triangles that share that edge. If it is bigger than 2, we have a singular edge case and two additional vertices and a new edge will be added to account for it.

(a) Singular Vertex

(b) Singular Edge

(c) Singular Face

Fig. 5. Special case encountered while stitching a triangle soup

(a) Original Mesh.

(b) Resulting mesh.

Fig. 6. An example of how a singular vertex occurs in a typical self-intersection removal situation, due to an inverted triangle (marked in red)

(a) Join Case

(b) Split Case

(c) Inside Carving Case

Fig. 7. Different topological changes examples (2-D simplified view). The outline of the final surface obtained after self-intersection removal is outlined in bold blue.

172

A. Zaharescu, E. Boyer, and R. Horaud

In practice, only the singular vertex case appears in real triangular mesh test cases. Such an example has been created to illustrate such a scenario and it is shown in Figure 6. 2.5 Topological Changes The partial-triangle crossing technique described earlier ensures a natural handling of the topological changes (splits and joins) that plagued the mesh approaches until now. Representative cases are illustrated in Figure 7.

3 Using TransforMesh to Perform Mesh Evolutions Our original motivation in developing a mesh self-intersection removal algorithm was to perform mesh evolutions, in particular when recovering 3-D shapes from multiple calibrated images. As stated earlier, few efforts have been put in mesh-based solutions for such 3-D surface reconstruction problem, mostly due to the topological issues raised by mesh evolutions. However, meshes allow one to focus on the region of interest in space, namely the shape’s surface and, as a result, lower the complexities and lead to better precisions with respect to volumetric approaches. In this section we present the application of TransforMesh to the surface reconstruction problem. Often such a problem is casted as an energy minimization over a surface. We decided to start from exact visual hull reconstructions, obtained using [15] and further improve the mesh using photometric constraints by means of an energy functional described in [7]. The photometric constraints are casted as an energy minimization problem using similarity measure between pairs of cameras that are close to each other. We denote by S ⊂ R3 the 3-D surface. Let Ii : Ωi ⊂ R2 → Rd be the image captured by camera i (d=1 for grayscale and d=3 for color images). The perspective projection of camera i is represented as Πi : R3 → R2 . Since the method uses visibility, consider Si as part of surface S visible in image i. In addition, the back-projection of image from camera i onto the surface is represented as Π−1 : Πi (S) → Si . i Armed with theabovenotation,onecan computeasimilarity measureMij of thesurface S as the the similarity measure between image Ii and the reprojection of image Ij into the othercameraiviathesurfaceS.Summingacrossallthecandidatestereopairs,onecanwrite: Mij (S) (1) M(S) = i

j=i

−1 Mij (S) = M |Ωi ∩Πi (Sj ) Ii , Ii ◦ Πi ◦ Πj,S Finally, the surface evolution equation at a point x is given by: ∂S = − λ1 Esmooth + Eimg N ∂t

(2)

(3)

where Esmooth depends on the curvature H (see [6]), N represents the surface normal and Eimg is a photoconsistency term that is a summation across pairs of cameras which depends upon derivatives of the similarity measure M, of the images I, of the projection matrices Π and on the distance xz (see [7] for more details).

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution

173

In the original paper [7], the surface evolution equation was implemented within the Level-Set framework. We extend it to meshes using the TransforMesh algorithm described in the previous section. The original solution performs surface evolution using a coarse to fine approach in order to escape local minima. Traditionally, in Level-Set approaches, the implicit function that embeds the surface S is discretized evenly on a 3-D grid. As a side-effect, all the facets of recovered surface are of approximately equal triangle size. In contrast, mesh based approaches do not impose such a constraint and allow facets of all sizes on the evolving surface. This is particularly useful when starting from visual hulls, for which the initial mesh contains triangles of all dimensions. In addition, the dimension of visual facets appears to be a relevant information since regions where the visual reconstruction is less accurate, i.e. concave regions on the observed surface, are described by bigger facets on the visual hull. Thus, we adopt a coarse to fine approach in which bigger triangles are moved until they stabilize, followed by dimension reduction via an edge-splits. The algorithm iterates at a smaller scale until the desired smallest edge size is obtained. Therefore, the√algorithm uses a multi-scale approach, starting from scale smax to smin = 1 in λ2 = 2 increments using Δt = 0.001 as the timestep. A vertex can maximally move 10% of the average incoming half-edges. The original surface, obtained from a visual hull, is evolved using equation (3), where the cross correlation was used as a similarity measure and λ1 = 0.3. Every 5 iterations TransforMesh is performed, in order to remove the self-intersections and allow for any topological changes.

4 Results We have tested the mesh evolution algorithm with the datasets provided by the MultiView Stereo Evalutation site [9] (http://vision.middlebury.edu/mview/) and our results are comparable with state-of-the-art: we rank in the top 1-3 (depending on the data set and ranking criteria chosen) and the results are within sub-milimeter accuracy. Detailed results are extracted from the website and presented in Table 1 (consult the website for detailed info and running times). We have also included results by Furukawa et al. [6], Pons et al. [7] and Hernandez et al. [8], considered to be the state of the art. The differences between all methods are very small, ranging between 0.01mm to 0.3mm. Some of our reconstruction results are shown in Figure 8. The algorithm reaches a good solution without the presence of a silhouette term in the evolution equation. In a typical evolution scenario, there are a more self-intersections at the beginning, but, as the algorithm converges, intersections rarely occur. Additionally, in the temple case, we performed a test where we have started from one big sphere as Table 1. 3-D Rec. Results. Accuracy: the distance d in mm that brings 90% of the result R within the ground-truth surface G. Completeness: the percentage of G that lies within 1.25mm of R.

PP PPDataset PP Paper P

Temple Ring Acc. Compl. Pons et al. [7] 0.60mm 99.5% Furukawa et al. [6] 0.55mm 99.1% Hernandez et al. [8] 0.52mm 99.5% Our results 0.55mm 99.2%

Temple Sparse Ring Acc. Compl. 0.90mm 95.4% 0.62mm 99.2% 0.75mm 95.3% 0.78mm 95.8%

Dino Ring Acc. Compl. 0.55mm 99.0% 0.33mm 99.6% 0.45mm 97.9% 0.42mm 98.6%

Dino Sparse Ring Acc. Compl. 0.71mm 97.7% 0.42mm 99.2% 0.60mm 98.52% 0.45mm 99.2%

174

A. Zaharescu, E. Boyer, and R. Horaud

(a) Dino - Input

(e) Temple - Input

(b) Dino - Start

(f) Temple - Start

(c) Dino - Results

(g) Temple-Results

(d) Dino - Closeup

(h) Temple-Closeup

Fig. 8. Reconstruction results obtained in the temple and in the dino case

the startup condition, in order to check whether the topological split operation performs properly. Proper converges was obtained. We acknowledge that TransforMesh was not put to a thorough test using the current data sets, which might leave the reader suspicious about special cases in which the method could fail. We have implemented a mesh morphing algorithm in order to test the robustness of the method. We have successfully morphed meshes with significantly different topology from the surface of departure. Results will be detailed in another publication. Implementation Notes. In our implementation we have made extensive use of CGAL (Computational Geometry Algorithms Library) [16] C++ library, which provides excellent implementations for various algorithms, among which the n-dimentional fast box intersections, 2-D constrained Delaunay triangulation, triangular meshes and support for exact arithmetic kernels. The running times of TransforMesh depend greatly on the number of self-intersections, since more than 80% of the running time is spent performing them. Typically, the running time for performing the self-intersections test is under 1 second for a mesh with 50, 000 facets, where exact arithmetic is used for triangle intersections and the self-intersections are in the range of 100.

5 Conclusion We have presented a fully geometric efficient Lagrangian solution for triangular mesh evolutions able to handle topology changes gracefully. We have tested our method in the

TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution

175

context of multi-view stereo 3-D reconstruction and we have obtained top ranking results, comparable with state-of-the-art methods in the literature. Our contribution with respect to the existing methods is to provide a purely geometric mesh-based solution that does not constrain meshes and that allows for facets of all sizes as well as for topology changes. Acknowledgements. This research was supported by the VISIONTRAIN RTN-CT2004-005439 Marie Curie Action within the European Community’s Sixth Framework Programme. This paper reflects only the author’s views and the Community is not liable for any use that may be made of the information contained therein.

References 1. McInerney, T., Terzopoulos, D.: T-snakes: Topology adaptive snakes. Medical Image Analysis 4(2), 73–91 (2000) 2. Lachaud, J.O., Taton, B.: Deformable model with adaptive mesh and automated topology changes. In: Proceedings of the Fourth International Conference on 3-D Digital Imaging and Modeling (2003) 3. Osher, S., Fedkiw, R.: Level Set Methods and Dynamic Implicit Surfaces. Springer, Heidelberg (2003) 4. Osher, S., Senthian, J.: Front propagating with curvature dependent speed: algorithms based on the Hamilton-Jacobi formulation. Journal of computational Physics 79(1), 12–49 (1988) 5. Jung, W., Shin, H., Choi, B.K.: Self-intersection removal in triangular mesh offsetting. Computer-Aided Design and Applications 1(1-4), 477–484 (2004) 6. Furukawa, Y., Ponce, J.: Accurate, dense and robust multi-view stereopsis. CVPR (2007) 7. Pons, J.P., Keriven, R., Faugeras, O.: Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision 72(2), 179–193 (2007) 8. Hernandez, C.E., Schmitt, F.: Silhouette and stereo fusion for 3-D object modeling. Computer Vision and Image Understanding 96(3), 367–392 (2004) 9. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–526. IEEE Computer Society Press, Los Alamitos (2006) 10. Kettner, L., Meyer, A., Zomorodian, A.: Intersecting sequences of dD iso-oriented boxes. In: Board, C.E. (ed.) CGAL-3.2 User and Reference Manual (2006) 11. Zomorodian, A., Edelsbrunner, H.: Fast software for box intersection. International Journal of Compational Geometry and Applications (12), 143–172 (2002) 12. Hert, S., Seel, M.: dD convex hulls and delaunay triangulations. In: Board, C.E. (ed.) CGAL3.2 User and Reference Manual (2006) 13. Gueziec, A., Taubin, G., Lazarus, F., Horn, B.: Cutting and stitching: Converting sets of polygons to manifold surfaces. IEEE Transaction on Visualization and Computer Graphics 7(2), 136–151 (2001) 14. Shin, H., Park, J.C., Choi, B.K., Chung, Y.C., Rhee, S.: Efficient topology construction from triangle soup. In: Proceedings of the Geometric Modeling and Processing (2004) 15. Franco, J.S., Boyer, E.: Exact polyhedral visual hulls. In: British Machine Vision Conference, vol. 1, pp. 329–338 (2003) 16. Board, C.E.: CGAL-3.2 User and Reference Manual (2006)

Microscopic Surface Shape Estimation of a Transparent Plate Using a Complex Image Masao Shimizu and Masatoshi Okutomi Graduate School of Science and Engineering, Tokyo Institute of Technology

Abstract. This paper proposes a method to estimate the surface shape of a transparent plate using a reﬂection image on the plate. The reﬂection image on a transparent plate is a complex image that consists of a reﬂection on the surface and on the rear surface of the plate. A displacement between the two reﬂection images holds the range information to the object, which can be extracted from a single complex image. The displacement in the complex image depends not only on the object range but also on the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. These parameters can be estimated using multiple planar targets with random texture at known distances. Experimental results show that the proposed method can detect microscopic surface shape diﬀerences between two diﬀerent commercially available transparent acrylic plates.

1

Introduction

Shape estimation of a transparent object is a diﬃcult problem of computer vision that has attracted many researchers. The background changes its apparent position through a moving transparent object according to the refraction of light [1]. Based on that observation, some techniques have been proposed to estimate the shape and the refraction index of a transparent object: using optical ﬂow [5], detecting the positional change of the structured lights [4], and measuring the eﬀects on the shape of a structured light [3] caused by a moving object. Other methods that have been investigated include using polarized light [6]. On the one hand, a monocular range estimation method, named reﬂection stereo, using a complex image observed as a reﬂection on a transparent parallel planar plate, has also been proposed [7],[8]. The complex image consists of the reﬂection at the plate surface (the surface reﬂection image Is ) and the reﬂection at the rear surface of the plate, which is then transmitted again through the surface (the rear-surface reﬂection image Ir ). The object range is obtainable from the displacement in the complex image, that is, the displacement between Is and Ir , which has a unique constraint that is the equivalent of the epipolar constraint in stereo vision. The displacement in the complex image depends not only on the object range, but also the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. This paper proposes a method to estimate these parameters of the transparent plate through calibration of the reﬂection stereo system, using multiple planar Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 176–185, 2007. c Springer-Verlag Berlin Heidelberg 2007

Microscopic Surface Shape Estimation of a Transparent Plate

177

targets with random texture at known distances. These estimated parameters are available for range estimation; simultaneously, they express the shape and orientation of the plate. The proposed method estimates a microscopic shape of a plate using a diﬀerential measurement of Is and Ir , whereas conventional methods have used only light transmitted through a transparent object. The method might be used for an industrial inspection of a planar glass plate with a common camera. This paper is organized as follows. Section 2 describes range estimation by reﬂection stereo with a perfectly parallel planar transparent plate. Section 3 extends to the case with a non-parallel planar plate, and shows the parameters to be estimated. Section 4 presents a parameter estimation method through the calibration of the reﬂection stereo using multiple targets. Experimental results are described in Section 5. This paper concludes with remarks in Section 6.

2

Range Estimation from a Single Complex Image

This section describes the reﬂection stereo with a perfectly parallel planar plate. Figure 1 shows the two light paths from an object to the camera center. A transparent plate reﬂects and transmits the incident light on its surface. The transmitted light is then reﬂected on the rear-surface and is transmitted again to the air through the surface. These two light paths have an angle disparity θs , which depends on the relative refractive index n of the plate, the plate thickness d, the incident angle θi , and the object distance Do . The fundamental relation between the angle disparity θs and the distance Do are explainable as the reﬂection and refraction of light in a plane including the object, the optical center, and the normal vector of the plate. A two-dimensional (2D) ξ-υ coordinate system is set with its origin at the reﬂecting point on the surface. The following equation can be derived by projecting the object position (−Do sin θi , Do cos θi ) and the optical center position (Dc sin θi , Dc cos θi ) to ξaxis. sin (2 (θi − θs )) . (1) Do + Dc = d sin θs n2 − sin2 (θi − θs ) The angle disparity θs is obtainable by ﬁnding the displacement in the complex image. Then the object distance Do is derived from Eq. (1). The displacement has a constraint which describes a correspondent position in the rear-surface image Object

υ

(− Do sin θ i , Do cosθ i )

Optical center

( Dc sin θ i , Dc cosθ i ) θs Surface reflection

θi

Is Rear-surface reflection

Ir Transparent plate

n

d

ξ

Fig. 1. Fundamental light paths in the reﬂection stereo

178

M. Shimizu and M. Okutomi

moving along a constraint line according to the image position [7]. The constraint reduces the search to 1D, just as stereo vision with the epipolar constraint. The angle disparity takes the minimum θs = 0 when Do = ∞ if the plate is manufactured perfectly as a parallel planar plate.

3

Reﬂection Stereo with Non-parallel Planar Plate

The transparent acrylic plate has limited parallelism and has a speciﬁc wedge angle, even if it has suﬃcient surface ﬂatness. In this section, we investigate the range estimation of the reﬂection stereo using a non-parallel planar plate. 3.1

Range Estimation Using Ray Tracing

Figure 2 depicts the camera coordinate system with its origin at the optical center. The f /δ denotes the lens focal length in the CCD pixel unit. The transparent plate is set with an angle for the Z-axis. We use ray tracing [9] to describe the light paths to estimate the object distance. Figure 3(a) shows that an object in a surface reﬂection image at an image position A(a) = [ua , va , f /δ] exists in the following line LS with parameter tS : LS (tS ) = qS + vS tS

(2)

qS = q(o, a, mS , n ˆS ) ˆS ), vS = r(a, n ˆS respectively denote a position and the unit normal vector where MS (mS ) and n of the plate surface. Point O(o) represents the origin. Transparent plate Image plane

A( )

Y

v

u a , va

X

u

Z

O

f /δ

Dc

Fig. 2. Transparent plate and camera

ˆS

Image plane

S

O

S

A( )

ˆR

Image plane

R

S

ˆS

S

R

O

R

0 1 0

n (a) Surface reflection image

1

n

(b) Rear-surface reflection image

Fig. 3. Surface and rear-surface reﬂections

D( )

Microscopic Surface Shape Estimation of a Transparent Plate

179

Similarly, a detected displacement d = d∗p (a)+e∗ (a)Δ of a rear-surface reﬂection image from the image position a shows that the object in the rear-surface image exists in the following line LR with parameter tR : LR (tR ) = qR + vR tR

(3)

qR = q(q1 , v1 , mS , n ˆS ) ˆR ) q1 = q(q0 , v0 , mR , n

vR = t(v1 , −ˆ nS , 1/n) v1 = r(v0 , n ˆR)

q0 = q(o, a + d, mS , n ˆS )

v0 = t(d, n ˆS , n),

ˆR respectively denote a position and the unit normal vector where MR (mR ) and n of the rear surface of the plate, as shown in Fig. 3(b). In addition, d∗p (a) and e∗ (a) denote the displacement of the rear-surface reﬂection image for an object at inﬁnite distance and a normal vector of the constraint line at the image position a. The intersection position of lines LS and LR represents the 3D position PO (ˆ pO ) of the object. In real situations, the object position PO (ˆ pO ) is determined using the following equation, which minimizes the distance between the two lines. 1 (4) p ˆO = [(qS + vS tS ) + (qR + vR tR )] 2 −1 tS qS · vS −qS · vR qS · qR − qS 2 = tR qR · vS −qR · vR qR 2 − qS · qR For the above, we use the following relations presented in Fig. 4. (s − p) · n ˆ v·n ˆ r(v, n ˆ) = v − 2ˆ nv · n ˆ

q(p, v, s, n ˆ) = p + v

(5)

1 2 2 v v 2 −n ˆ n − v · n ˆ v · n ˆ

t(v, n ˆ, n) =

3.2

v v · n ˆ

−n ˆ +n ˆ

(6) (7)

Parameters for Reﬂection Stereo

Reﬂection stereo with a non-parallel planar plate requires not only a lens focal length f /δ as a parameter, but also the following parameters, which are equivalent to extrinsic parameters in stereo vision. P( )

Q( )

S( )

n ˆ

Fig. 4. A ray and a normal vector of the plate

180

– – – –

M. Shimizu and M. Okutomi

surface and rear-surface normal vectors nS and nR of the plate, a position mS on the surface, plate thickness d, and relative refraction index n.

With these extrinsic parameters, the object range is obtainable using Eq. (4) with a displacement in a complex image.

4 4.1

Surface Shape Estimation by Reﬂection Stereo Piecewise Non-parallel and Planar Model of the Plate

Real surfaces of acrylic or glass plates are not perfectly ﬂat on some level. Consequently, the extrinsic parameters are position-variant in the complex image. The relative refraction index can diﬀer among production lots and manufacturers; it might also be position-variant. The proposed method assumes that the plate is piecewise non-parallel and planar, as illustrated in Fig. 5, for the reﬂection stereo with a real transparent plate. In other words, the method assumes that the extrinsic parameters are constant in a small region to form a complex image for an interest position in a complex image. The object range p ˆ Oy (a, d|pe (a)) along the Y -axis is obtainable using Eq. (4) for the small region that includes the image position a, where pe (a) = {nS (a), nR (a), mS (a), d(a), n(a)} denotes the extrinsic parameter vector at position a. 4.2

Parameter Estimation Method

The extrinsic parameters can be estimated from M sets of the object distance Dy and its corresponding displacement d at position a in a complex image, as follows analogously with calibration for stereo vision [10]. p ˆe (a) = arg min E(a)

(8)

pe (a)

E(a) =

M

2

Dyi − p ˆ Oy (a, di |pe (a)) + α ˜ pe − pe (a)

2

i=1

Whole transparent planar plate Image plane

u O A(a)

Piecewise planar region p e (a) n S (a), n R (a), m S (a), d (a), n(a)

Fig. 5. Extrinsic parameters for a piecewise non-parallel and planar model

Microscopic Surface Shape Estimation of a Transparent Plate

181

The second term in the objective function E(a) is a stabilization term that prevents the estimated parameters from deviating markedly from design (catalogue) values p ˜e . The weight α is set empirically to the minimum value. We have used the conjugate gradient method to minimize the nonlinear objective function with initial parameters of the design values. The displacement d in a complex image can be detected as the second peak location of the autocorrelation function of the complex image without any knowledge about a constraint if the image contains a rich and dense texture. The seven components in the extrinsic parameters can be estimated from M ≥ 4 sets of observations1. 4.3

Parametric Expression for Range Estimation

The estimated parameters are considered as continuous and their changes are small with respect to the image position, as is clear from observations of real acrylic plates. The following parametric expression with a two-dimensional cubic function with respect to the image position u = (u, v) is used for range estimation: p ¯e (u) = Φ1 +Φ2 u+Φ3 v+Φ4 uv+Φ5 u2 +Φ6 v 2 +Φ7 u2 v+Φ8 uv 2 +Φ9 u3 +Φ10 v 3 , (9) where Φj (j = 1, 2, .., 10) denotes a coeﬃcient vector corresponding to the seven components of the extrinsic parameters. These vectors are obtainable using leastsquares estimation, with estimated extrinsic parameters p ˆ e (al ) at image positions al , as follows. ¯ pe (al ) − p ˆe (al )2 (10) Φj (j = 1, 2, .., 10) = arg min Φ

l

The constraint line direction and the inﬁnite object position with respect to the image position are also obtained in advance, and parameterized similarly as the extrinsic parameters for range estimation.

5

Experimental Results

5.1

System Setup

The experimental system was constructed as shown in Fig. 6. The plate is set with an angle π/4 for the optical axis. The system includes a black and white camera (60 FPS, IEEE1394, Flea; Point Gray Research Inc.). The design values of the plate are 10.0 [mm] thickness and the relative refraction index of 1.49. The measured distance from the lens optical center to the plate along the optical axis is Dco = 66.2 [mm]. We used a well-known camera calibration tool [2] for 1

The normal vector has two degrees-of-freedom (2-DOF), the position on the surface is 1-DOF, as determined by a distance along Z-axis if the surface normal is given. In total, the extrinsic parameters are 2 + 2 + 1 + 1 + 1 = 7 DOF.

182

M. Shimizu and M. Okutomi

Fig. 6. Conﬁguration of the experimental system Textured plane

Camera

Transparent plate

Dc

D yi

Fig. 7. Setup of the calibration target

intrinsic parameter estimation. The calibrated parameters are the focal length f /δ = 1255.7 and the image center (cu , cv ) = (−31.845, −11.272); the image size is 640 × 480 [pixel]. Image distortion was alleviated using the estimated lens distortion parameters. 5.2

Calibration Targets

A random textured planar target at 13 diﬀerent positions is used for extrinsic parameter estimation. As depicted in Fig. 7, the experimental system is mounted on a motorized long travel translation stage. Figure 8 shows a complex image for the plane distance of 500 [mm]. As described in 4.2, the displacement (shift value) along the constraint line in the complex image is Fig. 8. Complex image example (Do = obtainable as the second peak location 500 [mm]) of the autocorrelation function. The target positions are at Dyi = 350 + 50i + De [mm] (i = 1, 2, .., 13) from the optical center in Y -axis, where

Microscopic Surface Shape Estimation of a Transparent Plate

183

[mm]

120

0

-120

-160

-320

0

(a) Estimated thickness

160

320

[pixel]

120

0 -120 -320

-160

0

320

160

[pixel]

(b) Estimated refraction index

Fig. 9. Estimated plate thickness and refraction index w.r.t. the image positions

De is a unknown value. To estimate De , ﬁrst the following sum of the residuals with three predetermined values (De = 0, ±25) is obtained; then the three sums of residuals are used to estimate the De for the minimum sum by a parabola ﬁtting over the three sums. M l

(350 + 50i + De ) − p ˆOy (al , di |pe (al ))2

(11)

i=1

Figure 9 displays the estimated plate thickness d and relative refraction index n of transparent plate “A” (described in the next subsection), at 24 × 18 = 432 image positions. The corresponding size of the plate to the image positions is about 30 × 30 [mm]. The mean and standard deviation of the plate thickness are, respectively, 10.005712 and 5.351757 × 10−3 [mm], those of the refraction index are 1.489652 and 4.427796 × 10−4 , whereas the catalogue values are 10.0 [mm] and 1.49. 5.3

Microscopic Surface Shape Estimation of a Transparent Plate

Two types of commercially produced transparent acrylic plate are used to estimate the extrinsic parameters. Both plates have the same thickness of 10 [mm]; plate “A” was cut from a cast acrylic plate to 100 × 100 [mm], whereas plate “B” was a cast acrylic plate marketed at that size without after-sales cutting. Figure 10 displays the estimated surface and rear-surface positions of plate “A”, corresponding to 24 × 18 = 432 image positions. In this ﬁgure resolution, the microscopic surface shape diﬀerence between plate “A” and “B” is invisible. Figures 11 and 12 respectively display the estimated surface (a) and rearsurface (b) shapes of plates “A” and “B”. The surface shapes are displayed with respect to least-squares approximation surfaces for each plate. The two plates are measured using the same conditions. The vertical axes are extremely magniﬁed. The standard deviation of the plate “A” surface is 0.013077, whereas that of plate “B” is 0.010649. More than the numerical diﬀerence, the shape diﬀerence between the two plates is readily apparent from the ﬁgures.

184

M. Shimizu and M. Okutomi 20 10 0 -10 -20 10 0 -10

[mm]

Fig. 10. Estimated surface and rear-surface positions of plate “A”

10

10 0

-10 -20 -20

-10

0

10

20

0 -10 -20 -20

[mm]

㩿㪸㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼

-10

0

10

20

[mm]

㩿㪹㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫉㪼㪸㫉㪄㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼

Fig. 11. Estimated surface and rear-surface shapes of plate “A”

10

10 0

-10 -20 -20

-10

0

10

[mm]

㩿㪸㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼

20

0 -10 -20 -20

-10

0

10

[mm]

㩿㪹㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫉㪼㪸㫉㪄㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼

Fig. 12. Estimated surface and rear-surface shapes of plate “B”

20

Microscopic Surface Shape Estimation of a Transparent Plate

6

185

Conclusions

In this paper, we have proposed a method to estimate the surface shape of a transparent plate using a complex image. The displacement in the complex image depends not only on the object range but also on the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. These parameters can be estimated using multiple planar targets with a random texture at known distances. Experimental results show that the proposed method can detect microscopic surface shape diﬀerences between two diﬀerent commercially available transparent acrylic plates. Future works will include an investigation of the shape estimation accuracy and resolution using an optical standard, and a study of the precision limits of the proposed method.

References 1. Ben-Ezra, M., Nayar, S.K.: What Does Motion Reveal about Transparency? Proc. on ICCV 2, 1025–1032 (2003) 2. Bouguet, J.-Y.: Camera Calibration Toolbox for Matlab (2007), http://www.vision.caltech.edu/bouguetj/calib doc/index.html 3. Hata, S., Saitoh, Y., Kumamura, S., Kaida, K.: Shape Extraction of Transparent Object using Genetic Algorithm. Proc. on ICPR 4, 684–688 (1996) 4. Manabe, Y., Tsujita, M., Chihara, K.: Measurement of Shape and Refractive Index of Transparent Object. Proc. on ICPR 2, 23–26 (2004) 5. Murase, H.: Surface Shape Reconstruction of a Nonrigid Transport Object using Refraction and Motion. IEEE Trans. on PAMI 14(10), 1045–1052 (1992) 6. Saito, M., Sato, Y., Ikeuchi, K., Kashiwagi, H.: Measurement of Surface Orientations of Transparent Objects using Polarization in Highlight. Proc. on ICPR 1, 381–386 (1999) 7. Shimizu, M., Okutomi, M.: Reﬂection Stereo – Novel Monocular Stereo using a Transparent Plate. In: Proc. on Third Canadian Conference on Computer and Robot Vision, pp. 14–14 (2006) (CD-ROM) 8. Shimizu, M., Okutomi, M.: Monocular Range Estimation through a Double-Sided Half-Mirror Plate. In: Proc. on Fourth Canadian Conference on Computer and Robot Vision, pp. 347–354 (2007) 9. Whitted, T.: An Improved Illumination Model for Shaded Display. Communications of the ACM 23(6), 343–349 (1980) 10. Zhang, Z.: Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. Proc. on ICCV 1, 666–673 (1999)

Shape Recovery from Turntable Image Sequence H. Zhong, W.S. Lau, W.F. Sze, and Y.S. Hung Department of Electrical and Electronic Engineering, the University of Hong Kong Pokfulam Road, Hong Kong, China {hzhong, sunny, wfsze, yshung}@eee.hku.hk

Abstract. This paper makes use of both feature points and silhouettes to deliver fast 3D shape recovery from a turntable image sequence. The algorithm exploits object silhouettes in two views to establish a 3D rim curve, which is defined with respect to the two frontier points arising from two views. The images of this 3D rim curve in the two views are matched using cross correlation technique with silhouette constraint incorporated. A 3D planar rim curve is then reconstructed using point-based reconstruction method. A set of rims enclosing the object can be obtained from an image sequence captured under circular motion. The proposed method solves the problem of reconstruction of concave object surface, which is usually left unresolved in general silhouette-based reconstruction methods. In addition, the property of the organized reconstructed rim curves allows fast surface extraction. Experimental results with real data are presented. Keywords: silhouette; rim reconstruction; surface extraction; circular motion.

1 Introduction The 3D modelling of real world objects is an important problem in computer vision and has many practical applications, such as in virtual reality, video games, motion tracking systems, etc. The point-based approach is the oldest technique for 3D reconstruction. Once feature points across different views are matched, a cloud of feature points lying on the surface of object can be recovered by triangulation methods. However, these points are not ordered, which means that the topological relationship among them is unknown. As a result, they are not readily usable for reconstructing the surface of an object. Besides feature points, silhouettes are prominent features of an image that can be extracted reliably. They provide useful information about both the shape and motion of an object, and indeed are the only features that can be extracted from a textureless object. But they do not provide point correspondences between images due to the viewpoint dependency of silhouettes. There is a need for the development of methods combining both point-based and silhouettes approaches for modelling a 3D object. The contribution of this paper is that it takes advantage of both point-based and silhouettes approaches to provide fast shape recovery. The method reconstructs a set of 3D rim curves on the object surface from a calibrated circular motion image sequence, with concavities on the object surface retained. These 3D rim curves are reconstructed in an organized manner consistent with the circular motion image order, leading to low computation complexity in the subsequent surface triangulation. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 186–195, 2007. © Springer-Verlag Berlin Heidelberg 2007

Shape Recovery from Turntable Image Sequence

187

This paper is organized as follows. Section 2 briefly reviews the literature on model reconstruction. Section 3 introduces the theoretical principles which are used in this paper. Section 4 describes how the rim curves are computed. The extraction of surface from the reconstructed rim curves is given in section 5. Experimental results are given in section 6 and a summary follows in section 7.

2 Previous Works There are two main streams of methods for model reconstruction, namely point-based and silhouette-based methods. In a point-based method, a limited number of 3D points are initially reconstructed from matched feature points. A 3D model is then obtained by computing a dense depth map of the 3D object from the calibrated images using stereo matching techniques in order to recover the depths of all the object points, (see, for example, [1, 2]). It is further necessary to fuse all the dense depth map into a common 3D model [3] and extract coherently connected surface by interpolation [4]. Although a good and realistic 3D model may be obtained by carrying out these processes, both the estimation of dense depth maps and the fusion of them are of high computation complexity and reported to be formidably time consuming [5]. Another model reconstruction approach utilizes object silhouettes instead of feature points, see seminal works [6, 7]. There are two groups of methods in this approach, namely surface and volumetric methods. The surface method aims to determine points on the surface of the visual hull by either updating depth intervals along viewing lines [8] or using a dual-space approach via epipolar parameterization [9, 10]. On the other hand, volume segment representation, introduced in [11], was for constructing the bounding volume which approximated the actual three-dimensional structure of the rigid object generating the contours in multiple views. In this technique, octree representation is adopted to provide a volumetric description of an object in terms of regular grids or voxels [12-14]. Given the volumetric data in form of an octree, the marching cubes algorithm [15] can be used to extract a triangulated surface mesh for visualization. Since only silhouettes are used, no feature points are needed. However, the aforementioned methods produce only the visual hull of the object with respect to the set of viewpoints from which the image sequence is captured. The main drawback is hence that concave surfaces of the object cannot be reconstructed. More recently, many methods are proposed to combine the silhouette and the photo-consistency constraint to obtain higher quality results based on different formulations, cf. [16-20]. Broadly speaking, these methods all employ a two-step approach to surface reconstruction. In the first step, a surface mesh is initialized by computing a visual hull satisfying the silhouette consistency constraint. In the second step, the surface mesh is refined using the photo-consistency constraint. In this paper, we also consider combining the properties of silhouettes and feature points in shape recovery from a calibrated circular motion image sequence. However, unlike the methods mentioned above which aimed to reconstruct the whole photoconsistent surface based on a visual hull, the proposed method computes a set of discrete photo-consistent surface curves which are circumnavigating the object to represent the shape. The advantages of using discrete surface curves to represent

188

H. Zhong et al.

object shape and to form surface are two fold. First, no visual hull and time-consuming surface optimization are needed, and secondly, surface concavities can be recovered. This is particularly suitable for fast modelling with more accurate shape description than visual hull. Experimental results show that through combining the photo-consistency and silhouette constraints, the surface curves can be computed accurately.

3 Theoretical Principles 3.1 Geometry of the Surface Consider an object with a smooth surface viewed by a pin-hole camera. The contour generator on the object surface is the set of points at which the rays cast from the camera center is orthogonal to the surface normal [21]. The projection of a contour generator on the image plane forms an apparent contour. A silhouette is a subset of the apparent contour where the viewing rays of the contour generator touch the object [22]. Due to the viewpoint dependency of the contour generators, silhouettes from two distinct viewpoints will be the projections of two different contour generators. As a result, there will be no point correspondence in the two silhouettes except for the frontier point(s) [21]. A frontier point is the intersection of the two contour generators in space and is visible in both silhouettes. For two distinct views, there are in general two frontier points at which the two outer epipolar planes are tangent to the object surface. Hence, the projections of them in one view are two outer epipolar tangent points on the object silhouette. The outer epipolar tangent points in the two associated views are corresponding points. X1

A 3D rim curve

C1

C2

A straight line

X2

A 2D rim curve

Fig. 1. A 3D rim curve associated with frontier points for two views. The two cameras centers C1 and C2 define two epipolar planes tangent to the objects at two frontier points X1 and X2 whose images are outer epipolar tangent points. The plane containing C1, X1 and X2 cuts the object at a 3D planar rim curve. The curve segment visible to cameras C1 and C2 projects onto them as a straight line and a 2D rim curve respectively.

Shape Recovery from Turntable Image Sequence

189

3.2 Definition of a 3D Rim Curve A 3D rim curve on an object surface is defined with respect to two distinct views as follows. The two frontier points associated with the two views, together with one of the two camera centers, define a plane. This plane intersects the object surface in a 3D planar curve on the object surface. The two frontier points are on this curve and they cut the curve into segments. The one closer to the chosen camera center is defined a 3D rim curve. The projection of this rim curve in the view associated with the chosen camera is a straight line, and the image of it in the other view is generally a 2D curve. Fig. 1 illustrates the rim curve so defined. 3.3 Cross-Correlation Cross-correlation is a general technique for finding correspondences between points in two images. The matching technique is based on the similarity between two image patches, and cross-correlation gives a measure of similarity. Let Ik(uk, vk) be the intensity of image k at pixel xk=(uk, vk) and the correlation mask size be (2n+1)×(2m+1). When n=m, then n is called half mask size of n. The normalized correlation score between a pair of points x1 and x2 is given by: n

r ( x1 , x2 ) =

m

∑ ∑ [ I (u

i =− n j =− m

1

1

+ i, v1 + j ) − I1 (u1 , v1 )] ⋅ ⎡⎣ I 2 (u2 + i, v2 + j ) − I 2 (u2 , v2 ) ⎤⎦ (1)

[(2n + 1) × (2m + 1)] ⋅ δ ( I1 ) ⋅ δ ( I 2 )

where n

I k (uk , vk ) =

m

∑∑I

i =− n j =− m

k

(uk + i, vk + j ) (2)

(2n + 1) × (2m + 1)

is average intensity about point xk, and n

δ (Ik ) =

m

∑ ∑ ⎡⎣ I

i =− n j =− m

⎤ k (uk + i , vk + j ) − I k (uk , vk ) ⎦ (2n + 1) × (2m + 1)

2

(3)

is the standard deviation of intensity over the mask. The feature point x2 which maximizes the response of the normalized correlation score is deemed to be the best match to x1., and it is usual to define a threshold T to remove dissimilar matches (i.e. the correlation score between corresponding points have to be larger than T, otherwise it will be not considered as a match).

4 Fast Modeling by Rim Reconstruction Given a calibrated circular motion image sequence, silhouettes are first extracted reliably from the images. For two adjacent views, the images of the two frontier

190

H. Zhong et al.

points are estimated using the silhouettes. A 2D rim (the image of a 3D rim curve in one view) is then defined as a line joining the two projected frontier points in an image. The correspondences of this 2D rim in the second image are found using cross-correlation methods. A 3D rim curve is then reconstructed by the linear triangulation method [23]. Repeating this process for the image sequence on a sequential two-view basis, a set of rim curves can be reconstructed. Since the object undergoes circular motion, the rims are constructed in a known order. 4.1 Silhouette Extraction

In order to extract silhouettes, the cubic B-spline snakes [24] are adopted for achieving sub-pixel accuracy. It provides a concise representation of silhouettes. In addition, using a B-spline representation allows explicit computation of tangency at each point on the silhouette which simplifies the subsequent identification of outer epipolar tangents (to be described in section 4.2). We will assume that each silhouette consists of one closed curve. 4.2 Estimation of Outer Epipolar Tangents

An outer epipolar tangent is an epipolar line tangent to the silhouette passing through the epipole. Given the B-spline and the known epipolar geometry between two views, the outer epipolar tangent points can be calculated as follows. Let Ai (i=1,2) be the set of sample points on the B-spline and ei be the epipole in image i from camera center j (j=1,2, j≠i), then the two outer epipolar tangent points in image i are determined by ⎛ p ( y ) − ei ( y ) ⎞ ⎛ pi ( y ) − ei ( y ) ⎞ min ⎜⎜ i max and (4) ⎟ ⎜ ⎟. ⎟ pi ∈ Ai pi ∈ Ai ⎜ p ( x ) − e ( x ) ⎟ i ⎝ pi ( x ) − ei ( x ) ⎠ ⎝ i ⎠ Fig. 2 illustrates the two outer epipolar tangents determined in view one based on the epipolar geometry between view one and view two.

Outer epipolar tangent points

Silhouette curve

Fig. 2. Estimation of outer epipolar tangent

4.3 Matching of 2D Rims

Given two views, the outer epipolar tangents in two views can be computed based on the estimated epipoles and the silhouettes as described in the last subsection. Let us name the two views as the source view and the target view respectively. In the source view, the rim is a straight line connecting the two epipolar tangent points. We need to

Shape Recovery from Turntable Image Sequence

191

find the counterpart of this rim in the target view. First, points on the rim in the source view are sampled. As the silhouettes of all images are available, for each sample point sp in the rim in source view, we can determine the associated visual hull surface points corresponding to the ray back-projected from sp [8]. Suppose the pair of visual hull surface points on the ray associated with sp is V1 and V2. We can limit the search region for sp in the target view by making use of V1 and V2, i.e. the epipolar line of sp in the target view to be searched for matched point is bounded by the projections of V1 and V2. Furthermore, we can use a parameter λ to reduce the searchable depth range from V1 to (1-λ)V1+λV2 based on prior knowledge about the shape of the object. For example, taking a λ=0.5 assumes that the maximum concavity of the object at sp is not larger than half of the depth from V1 to V2. Thus the search region in the target view can be accordingly reduced. After the search region along the epipolar line is defined, each corresponding point of the rim in the target view is determined as the pixel within the delimited epipolar line which has the highest cross correlation score with the sampled point in the source view. An example of matched rims in two consecutive views is shown in Fig. 3, from which we see that the reduction of depth search range greatly increases the matching accuracy.

(a)

(b)

(c)

Fig. 3. Matching of a rim in two views. The rim in the source view (a) is a straight line (green). The search region in the target view (b) is bounded by the projections of visual hull surface points determined based on silhouette information. The rim in the target view (c) is found by searching matched points along the delimited scan lines (blue) joining the two bounding points.

4.4 Reconstruction and Insertion of 3D Rims

Once the 2D rim points are matched, the 3D rim structure can be easily computed using optimal triangulation method. Repeating this process for every two successive views in the image sequence, a set of 3D rims on the object surface can be obtained. After this reconstruction processes, the rims of the model are generated, as shown in Fig. 4(a). However, there may be big gaps between pairs of rims where the surface is concave, see Fig. 4(a). This is because frontier points never occur on the concave part of a surface where the outer epipolar tangents cannot reach. To fill in a gap between two rims, new rims can be inserted. We first compute a 2D interpolated line between two rims in an image for which it is required to insert a new rim. Then, the correspondences in the target image are determined by feature point matching process as described in section 4.2. As a result, a 3D interpolated rim can be generated on the concave surface. This insertion of new rims is carried out until not rims are needed to be inserted. Fig. 4(b) shows the newly added rims (red) and the original rim sets (blue).

192

H. Zhong et al.

(a)

(b)

Fig. 4. Insertion of new 3D rims on concave surface. In (a) there may be a gap between two rims; in (b) more rims (red color) are inserted to fill in gaps.

5 Surface Formation 5.1 Slicing the 3D Rims

The proposed surface extraction method is built on a slice based re-sampling method, and it produces fairly evenly distributed mesh grids. The rims are first re-sampled to give cross-sectional contours by parallel slicing planes. Each slicing plane contains the intersections of the rims with the plane. For a better visual effect, the normal of the slicing planes is chosen to be the direction parallel to the rotation axis of circular motion for imaging the object. Since the 3D rims are generated in the order in accordance with the circular motion image sequence, in general the sampled points on each slice can be re-grouped by linking these points according to the spatial order of the corresponding rims. However, if the frontier points (the top most and lowest points) in the rims are very close, the sample points on the slicing plane near those frontier points may not exactly be in the desired order, for example, points on the slicing plane near the top of the object. So we reorder the sample points on the slicing planes before forming the polygonal cross section. Nonetheless, due to the sequential reconstruction of the rims, only the points on the top two slicing planes are affected. In the matching process, through enforcing the silhouette constraint, the search region on the epipolar line for matched point is substantially reduced, giving a very high accuracy of matching and thus a good reconstruction of surface points. However, there may be still few points reconstructed to be inside the object surface because of wrong matching. This type of error cannot be detected by the silhouette constraint, but can be reduced by smoothing the polygon on the slicing plane and by spatial smoothing on the extracted surface. The result of slice-based model after smoothing is shown in Fig. 6(a). 5.2 Surface Triangulation

In order to allow the reconstructed 3D model to be displayed using conventional rendering algorithms (e.g. in VRML standard adopted here), a triangulated mesh is extracted from the polygons on each slicing plane using the surface triangulation method, and triangle patches are produced that best approximate the surface. Since we have the same number of sample points on each slicing plane and the sample points are on the organized rims, triangle patches can be formed by a simple

Shape Recovery from Turntable Image Sequence

193

strategy: starting from the first sample point, say, p1i on layer i, the closest point pki+1 on layer i+1 to p1i is located. Then the third vertex can be either p2i or pk+1i+1, and is determined by considering two factors, the angles formed at p1i and at pki+1 must not exceed a threshold and the sum of distances to p1i and pki+1. If both points meet the angle constraint or conversely both are violated, the point between p2i or pk+1i+1 giving the shortest distance is chosen; otherwise, the one satisfying the angle constraint is selected.

6 Experimental Results Our approach has been tested on real data. Due to space limit, an experimental result of one circular motion image sequence is shown in this section. The experiment is performed over a turntable sequence of 36 images of a “Clay man” toy with moderately complex surface structure (see Fig. 3). All the cameras have been calibrated, which means that the projection matrices for cameras are already known. Since the performance of the proposed method depends on the accuracy of matching, we evaluate the proposed method by comparing our matching results with that generated by the graph-cut based stereo vision algorithm [25].

(a)

(b)

(c)

(d)

Fig. 5. Comparison of matching results between the graph-cut based method (GCM) and the proposed method. (a) 2D rim points in the source view (white dots); (b) Matched 2D rim points in the target view computed by GCM (green triangles) and the proposed method (white dots). (c) and (d) show a zoom-in region of (a) and (b) respectively.

We calculate the 2D rim curves on a successive view pair basis on gray scale images. A half window size of 10 pixels is used in calculating the normalized cross correlation and the parameter λ is chosen to be 0.5 in our method. For the graph-cut based method (GCM), the default parameter values are used except for the disparity range. The disparity search range is from -1 pixel to +40 pixels. As the images are large, the images are first reduced to a region containing the object with small margins before submitting to GCM. Despite so, the running time of GCM for matching one pair of size-reduced images was more than 40 minutes. In contrast, our

194

H. Zhong et al.

method took less than half a minute to match the 2D rim curves in the original images. Fig. 5 shows the matching results of two views by our method and GCM. It can be seen from Fig. 5 (b) and (d) that our method finds all correct matches but GCM does not. In the experiment, the percentage of correctly determined correspondences is about 99% by our method through visual judgment whereas that of GCM is less than 25%. Similar results are obtained for other view pairs, which demonstrate the effectiveness and accuracy of the proposed method.

(a)

(b)

(c)

Fig. 6. (a) The model after slicing the reconstructed rims using parallel planes, (b) the triangulated mesh model, (c) the textured model in VRML from a new viewpoint

After reconstructing all 3D rim curves, the surface of the object can be formed as discussed in section 5.2. The final result of our method is shown in Fig. 6 where Fig. 6 (b) presents the model in terms of triangle patches computed and Fig. 6 (c) shows the final result after texture mapping from the original image in VRML format.

7 Discussion This paper presents a new model reconstruction method from a circular motion image sequence. It combines the merits of both silhouette-based and point-based methods. By making use of the outer epipolar tangent points, a set of structured 3D rims can be rapidly reconstructed. The proposed rim structure greatly reduces the computation complexity of surface extraction and provides a flexible reconstruction. The flexibility lies in different levels of quality of the reconstruction, depending on the number of rims and slicing planes used to approximate the surface.

References 1. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo Matching Using Belief Propagation. IEEE Transation on Pattern Analysis and Machine Intelligence 25(7), 787–800 (2003) 2. Sun, C.: Fast Stereo Matching Using Rectangular Subregioning and 3D Maximum-Surface Techniques. International Journal of Computer Vision 47(1/2/3), 99–117 (2002) 3. Koch, R., Pollefeys, M., Van Gool, L.: Multi viewpoint stereo from uncalibrated video sequences. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 55–71. Springer, Heidelberg (1998)

Shape Recovery from Turntable Image Sequence

195

4. Pollefeys, M., Van Gool, L.: From Images to 3D Models, Ascona, pp. 403–410 (2001) 5. Tang, W.K.: A Factorization-Based Approach to 3D Reconstruction from Multiple Uncalibrated Images. In: Department of Electrical and Electronic Engineering, p. 245. The University of Hong Kong, Hong Kong (2004) 6. Baumgart, B.G.: Geometric Modelling for Computer Vision. Standford University (1974) 7. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 8. Boyer, E., Franco, J.S.: A hybrid approach for computing visual hulls of complex objects. Computer Vision and Pattern Recognition, 695–701 (2003) 9. Brand, M., Kang, K., Cooper, B.: An algebraic solution to visual hull. Computer Vision and Pattern Recognition, 30–35 (2004) 10. Liang, C., Wong, K.Y.K.: Complex 3D Shape Recovery Using a Dual-Space Approach. In: IEEE International Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, pp. 878–884. IEEE Computer Society Press, Los Alamitos (2005) 11. Martin, W.N., Aggarwal, J.K.: Volumetric descriptions of objects from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2), 150–158 (1983) 12. Chien, C.H., Aggarwal, J.K.: Volume/surface octrees for the representation of threedimensional objects. Computer Vision, Graphics and Image Processing 36(1), 100–113 (1986) 13. Hong, T.H., Shneier, M.O.: Describing a robot’s workspace using a sequence of views from a moving camera. IEEE Transactions on Pattern Analysis and Machine Intelligence 7(6), 721–726 (1985) 14. Szeliski, R.: Rapid octree construction from image sequences. Computer Vision, Graphics and Image Processing 58(1), 23–32 (1993) 15. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. ACM Computer Graphics 21(4), 163–169 (1987) 16. Esteban, C.H., Schmitt, F.: Silhouette and Stereo fusion for 3D object modeling. Computer Vision and Image Understanding (96), 367–392 (2004) 17. Sinha, S., Pollefeys, M.: Multi-view reconstruction using Photo-consistency and Exact silhouette constraints: A Maximum-Flow Formulation. In: IEEE International Conference on Computer Vision, pp. 349–356. IEEE, Los Alamitos (2005) 18. Furukawa, Y., Ponce, J.: Carved Visual Hulls for Image-Based Modeling. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 19. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. In: International Conference on Computer Vision, Kerkyra, Greece, pp. 307–314 (1999) 20. Isidoro, J., Sclaroff, S.: Stochastic Refinement of the Visual Hull to Satisfy Photometric and Silhouette Consistency Constraints. In: IEEE International Conference on Computer Vision, pp. 1335–1342. IEEE Computer Society Press, Los Alamitos (2003) 21. Cipolla, R., Giblin, P.J.: Visual Motion of Curves and Surfaces. Cambridge Univ. Press, Cambridge, U.K. (1999) 22. Wong, K.Y.K.: Structure and Motion from Silhouettes. In: Department of Engineering, p. 196. University of Cambridge, Cambridge (2001) 23. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 24. Cipolla, R., Blake, A.: The dynamic analysis of apparent contours. In: International Conference on Computer Vision, Osaka, Japan, pp. 616–623 (1990) 25. Kolmogorov, V., Zabih, R.: Multi-camera Scene Reconstruction via Graph Cuts. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 82–96. Springer, Heidelberg (2002) 26. Franco, J.S., Boyer, E.: Exact Polyhedral Visual Hulls. In: British Conference on Computer Vision, pp. 329–338 (2003)

Shape from Contour for the Digitization of Curved Documents Fr´ed´eric Courteille, Jean-Denis Durou, and Pierre Gurdjos IRIT, Toulouse, France {courteil,durou,gurdjos}@irit.fr Abstract. We are aiming at extending the basic digital camera functionalities to the ability to simulate the ﬂattening of a document, by virtually acting like a ﬂatbed scanner. Typically, the document is the warped page of an opened book. The problem is stated as a computer vision problem, whose resolution involves, in particular, a 3D reconstruction technique, namely shape from contour. Assuming that a photograph is taken by a camera in arbitrary position or orientation, and that the model of the document surface is a generalized cylinder, we show how the corrections of its geometric distortions, including perspective distortion, can be achieved from a single view of the document. The performances of the proposed technique are assessed and illustrated through experiments on real images.

1

Introduction

The digitization of documents currently knows an increasing popularity, because of the expansion of Internet browsing. The traditional process, which uses a ﬂatbed scanner, is satisfactory for ﬂat documents, but is unsuitable for curved documents like for example a thick book, since some defects will appear in the digitized image. Several speciﬁc systems have been designed, but such systems are sometimes intrusive with regard to the documents and, before all, they cannot be referred to as consumer equipments. An alternative consists in simulating the ﬂattening of curved documents i.e., in correcting the defects of images provided by a ﬂatbed scanner or a digital camera. In this paper, we describe a new method of simulation of document ﬂattening which uses one image taken from an arbitrary angle of view, and not only in frontal view (as this is often the case in the literature). The obtained results are very encouraging. In Section 2, diﬀerent techniques of simulation of document ﬂattening are reviewed. In Section 3, a new 3D-reconstruction method based on the so-called shape-from-contour technique is discussed. In Section 4, this method is applied to the ﬂattening simulation of curved documents. Finally, Section 5 concludes our study and states several perspectives.

2

Techniques of Simulation of Document Flattening

One can address a purely 2D-deformation of the image, in order to correct its defects according to an a priori modelling of the ﬂattened document [1,2]. In [3,4,5], Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 196–205, 2007. c Springer-Verlag Berlin Heidelberg 2007

Shape from Contour for the Digitization of Curved Documents

197

the characters orientation is estimated, so as to straighten them out. In all these papers, the results are of poor quality because, if the lines of text are rather well uncurved, the narrowing of the characters near the binding is not well corrected. In [6], a judicious 2D-deformation is introduced, which considers that the contour of each page becomes rectangular after ﬂattening: the results are nice, but a “paper checkerboard pattern” must be placed behind the document to force both them having the same 3D-shape, and this makes the process rather complicated. In order to successfully simulate the document ﬂattening, it is necessary to compute its surface shape. Stereoscopy aims at reconstructing the shape of a document from several photographs taken from diﬀerent angles of view. In [7], the CPU time is very high when dealing with two images of size 2048 × 1360. On the other hand, this technique works well only if the stereo ring has been intrinsically and extrinsically calibrated. Stereophotometry requires several photographs taken from the same angle of view, but under diﬀerent lightings. A modelling linking the image greylevel to the orientation of the surface is then used. It has been implemented by Cho et al. [8]. The results are of mean quality since, for photographs taken at close range, perspective should be taken into account. Structuredlighting systems make also use of two photographs taken under two diﬀerent lightings, knowing that, for one of the photographs, a pattern is projected onto the document [9,10]. The deformation of the pattern in this image gives some information on the surface shape. A second photograph is required, in order to avoid possible artefacts of the pattern in the ﬂattening simulation. The best results using this technique are presented in [11], but they make use of a dedicated imaging system. Shape-from-texture has also been used. An a priori knowledge on the document assumes that the text is printed along parallel lines, which is the case for most documents. Hence, the shape of the document surface may be deduced from the deformation of the lines of text in the image. This technique, that has been implemented by Cao et al. [12] on photographs of cylindrical books taken in frontal view, works well and quickly (some seconds on an image of size 1200 × 1600). Its crucial step consists in extracting the lines of text. In [13], it is generalized to any angles of view. Nevertheless, the latter work assumes that the lines of text are also equally spaced. The oldest contribution to the simulation of document ﬂattening uses the shape-from-shading technique. Wada et al. [14] take advantage of the greylevel gradation in the non-inked areas of a scanned image, in order to estimate the slope of the document surface. This idea has been resumed and improved by Tan et al. [15], whose results are of good quality, and also by Courteille et al. [16]. The latter paper provides two noticeable improvements: a digital camera replaces the ﬂatbed scanner, so as to accelerate the digitization process; a new modelling of shape-from-shading is stated, that takes perspective into account. Finally, the shape-from-contour technique may be used i.e., the deformation of the contours in the image provides information on the surface shape. This technique has been implemented in [17,6,18] on photographs of cylindrical books taken in frontal view. In [19], it is generalized to any applicable surfaces: the results are of mean quality, but this

198

F. Courteille, J.-D. Durou, and P. Gurdjos

last contribution is worth of mention, since it reformulates the problem elegantly, as a set of diﬀerential equations. The method of simulation of document ﬂattening that we discuss in this paper uses the shape-from-contour technique to compute the shape of the document from one photograph taken from an arbitrary angle of view.

3

3D-Reconstruction Using Shape-from-Contour

In the most general situation, shape-from-contour (SFC) is an ill-posed problem: the same contours may characterize a lot of diﬀerent scenes. To make the problem well-posed, it is necessary to make some assumptions. In [20], the scene is supposed to be cylindrically-symmetrical. In the present work, we suppose that its surface is a generalized cylinder having straight meridians. 3.1

Notations and Choice of the Coordinate Systems

The photographic bench is represented in Fig. 1: f refers to the focal length and C to the optical center; the axis Cz coincides with the optical axis, so that the equation of the image plane Πi is z = f . The digital camera is supposed to lie in an arbitrary position with regard to the book, apart from the fact that the optical axis must be non-parallel to the binding. Hence, the vanishing point F of the binding direction can be located at inﬁnity, but it is separate from the principal point O. Thus, we can deﬁne the axis Ox by the straight line F O, in such a way that F O > 0. Furthermore, we complete Ox by an axis Oy such that Oxy is an orthonormal coordinate system of the image plane Πi (cf. Fig. 1). It is convenient to deﬁne two 3D orthonormal coordinate systems: Ro = Cxyz and Rp = Cuvw, where Cu coincides with the straight line F C, which is parallel to the binding, and where Cv coincides with Cy. Since Cu intersects Πi at F = O, it follows that Cw intersects Πi at a point Ω which also lies on the axis Ox. We introduce the orthonormal coordinate system Ri = Ωxy of Πi . The angle between Cz and Cw is denoted α (cf. Fig. 1). The case α = 0 corresponds to the frontal orientation of the camera. Denoting c = cos α, s = sin α and t = tan α, it can easily be stated that OΩ = t f , F O = f /t and F Ω = f /(c s). Finally, we denote Πr the plane orthogonal to Cw and containing the binding, whose equation is w = δ. 3.2

Relations Between Object and Image

Let P be an object point, whose coordinates are (X, Y, Z) w.r.t. Ro and (U, V, W ) w.r.t. Rp . The transformation rules between these two sets of coordinates are: X = c U + s W, Y = V,

(1a) (1b)

Z = −s U + c W.

(1c)

Shape from Contour for the Digitization of Curved Documents

199

F Πi Πr

y

δ

C

y

O

z

α u

α y u f x

Ω x

w

Fig. 1. Representation of the photographic bench

Let Q be the image point conjugated to P , whose coordinates are (u, v) w.r.t. Ri . Using the perspective projection rules, we obtain: f X − t f, Z f v = Y. Z Denoting f = f /c, the equations (1a), (1b), (1c), (2a) and (2b) give: u=

U , −s U + c W V v=f . −s U + c W

u = f

(2a) (2b)

(3a) (3b)

Let us deﬁne the “pseudo-image” Q of P as the image of the orthogonal projection P of P on Πr . As the coordinates of P are (U, V, δ) w.r.t. Rp , the coordinates (u, v) of Q w.r.t. Ri are: U , −s U + c δ V . v=f −s U + c δ

u = f

(4a) (4b)

Dividing (4a) by (3a) and (4b) by (3b), we ﬁnd: u v = . u v

(5)

200

F. Courteille, J.-D. Durou, and P. Gurdjos

This equality means that the image points Q, Q and Ω are aligned i.e., Ω is the vanishing point of the direction orthogonal to Πr . In a general way, the knowledge of an image point Q does not allow us to compute its conjugated object point P . But, if we also know the pseudo-image Q associated to P , then we can compute the coordinates of P . Actually, it can be deduced from (3a), (3b) and (4a): cu , f + s u v u , V =δ u f + s u u f + s u W =δ . u f + s u U =δ

(6a) (6b) (6c)

Note that (6c) gives W = δ when u = u i.e., for image points such that Q = Q. For a given image point Q, if the location of the associated pseudo-image Q is known, the coordinates (U, V, W ) of the conjugated object point P can be computed using (6a), (6b) and (6c). Nevertheless, in a general way, the location of the pseudo-image Q on the straight line ΩQ is unknown. 3.3

Additional Assumptions

Within the framework of our application, the scene is a book. We make two additional assumptions: • A1 - The ﬂattened pages are rectangular. • A2 - The pages of the book are curved in a such way that they form a generalized cylinder. As the surface of the book is a generalized cylinder, the lower and upper contours are located in two planes which are orthogonal to the binding. Thanks to this property, the SFC problem becomes well-posed. As the binding belongs to the plane Πr , its image and its pseudo-image coincide. Under the assumptions A1 and A2 , it is easy to predict that the pseudoimage of the upper and lower contours of the book (whose images are called Cu and Cl ) are the straight lines Lu and Ll , parallel to Ωy and passing through the ends Bu and Bl of the image B of the binding (cf. Fig. 2). Let Q be an image point of coordinates (u, v) w.r.t. Ri = Ωxy. We call θ the polar angle in the coordinate system F xy, and Qu and Ql the two image points located on Cu and Cl , which have the same polar angle θ as Q (cf. Fig. 2). Considering the assumptions A1 and A2 , and knowing that F is the vanishing point of the binding direction, the object point P conjugated to Q has the same coordinate W as the two object points conjugated to Qu and Ql . We denote uu (θ) and ul (θ) the abscissas of Qu and Ql w.r.t. Ri . Finally, we denote θB the polar angle of B. According to these notations, uu (θB ) and ul (θB ) are the abscissas of

Shape from Contour for the Digitization of Curved Documents

201

y

F

Cu

Qu

Bu

Qu

Lu

B Q Q

θB

θ

Cl

Ql Bl

Ql

Ll y

Ω x

Fig. 2. Geometric construction of the pseudo-image Q associated to an image point Q

the pseudo-images Qu and Ql associated to Qu and Ql , w.r.t. Ri . Hence, if we denote f = f /s, the equation (6c) gives, when applied to Qu and to Ql : uu (θB ) uu (θ) ul (θB ) W =δ ul (θ)

W =δ

f + uu (θ) , f + uu (θB ) f + ul (θ) . f + ul (θB )

(7a) (7b)

From one of both these expressions of W , we can deduce the other coordinates U and V of P , solving the system of two equations (3a) and (3b). Considering the expressions (7a) and (7b) of W and the equations (3a) and (3b), it appears that the computation of the shape of the document requires the knowledge of some parameters: the focal length f , the viewing angle α and the location of the principal point O. On the other hand, the parameter δ can be chosen arbitrarily, because the shape of the document can be computed only up to a scale factor.

4

Application to the Simulation of Document Flattening

Due to lack of space, we do not show any result on synthetic images, but only on real images. The left column of Fig. 3 shows three photographs of the same

202

F. Courteille, J.-D. Durou, and P. Gurdjos

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Three photographs of the same book taken from three diﬀerent angles of view, and the ﬂattening simulations obtained from each of them: (a-b) α = 1.5◦ ; (c-d) α = 20.38◦ ; (e-f) α = 40.54◦

book taken from three diﬀerent angles of view. For each of them, the shape of the document is computed using the method described in Section 3. The ﬂattening simulations are shown on the right column of Fig. 3, knowing that a generalized cylinder is particularly easy to ﬂatten. The same six images are zoomed on the area near the binding which contains a picture representing a samurai (cf. Fig. 4). It appears that when the angle of view increases, the quality of the ﬂattening

Shape from Contour for the Digitization of Curved Documents

(a)

(c)

(e)

(b)

(d)

(f)

203

Fig. 4. Zooms on a common area of the six images of Fig. 3

simulation decreases but, even for a strong angle of view (cf. Fig. 3-e), the result remains acceptable (cf. Fig. 3-f). A second result proves that our method works well even with the most general case of camera pose i.e., when the optical axis is also tilted in the direction perpendicular to the axis of the cylinder (cf. Fig. 5).

5

Conclusion and Perspectives

In this paper, we generalize the 3D-shape reconstruction of a document from its contours, as it had previously been stated in frontal view in [17,6,18], to the case of an arbitrary view. We validate this result by simulating the ﬂattening of curved documents taken from diﬀerent angles of view. Even when the angle of view noticeably increases, the quality of the result remains rather good, in

204

F. Courteille, J.-D. Durou, and P. Gurdjos

(a)

(b)

Fig. 5. Most general case of camera pose: (a) original and (b) ﬂattened images

comparison with other results in the literature that are obtained under similar conditions. In the present state of our knowledge, the focal length and the location of the principal point have to be known. As a ﬁrst perspective, we aim at generalizing the 3D-shape reconstruction to the case of an uncalibrated camera. In addition, when the angle of view is too large, then focusing blur occurs, which inevitably restricts the quality of the ﬂattening simulation. Rather than enduring this defect, it could be interesting to correct it, knowing that the 3D-shape of the document could allow us to predict the focusing blur magnitude.

References 1. Tang, Y.Y., Suen, C.Y.: Image Transformation Approach to Nonlinear Shape Restoration. IEEE Trans. Syst. Man and Cybern. 23(1), 155–172 (1993) 2. Weng, Y., Zhu, Q.: Nonlinear Shape Restoration for Document Images. In: Proc. IEEE Conf. Comp. Vis. and Patt. Recog., San Francisco, California, USA, pp. 568–573. IEEE Computer Society Press, Los Alamitos (1996) 3. Zhang, Z., Tan, C.L.: Recovery of Distorted Document Images from Bound Volumes. In: Proc. 6th Int. Conf. Doc. Anal. and Recog., Seattle, Washington, USA, pp. 429–433 (2001) 4. Lavialle, O., Molines, X., Angella, F., Baylou, P.: Active Contours Network to Straighten Distorted Text Lines. In: Proc. IEEE Int. Conf. Im. Proc., Thessaloniki, Greece, vol. III, pp. 748–751. IEEE Computer Society Press, Los Alamitos (2001) 5. Wu, C.H., Agam, G.: Document Image De-Warping for Text/Graphics Recognition. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 348–357. Springer, Heidelberg (2002) 6. Tsoi, Y.C., Brown, M.S.: Geometric and Shading Correction for Images of Printed Materials: A Uniﬁed Approach Using Boundary. In: Proc IEEE Conf. Comp. Vis. and Patt. Recog., vol. I, pp. 240–246. IEEE Computer Society Press, Washington, D.C., USA (2004)

Shape from Contour for the Digitization of Curved Documents

205

7. Yamashita, A., Kawarago, A., Kaneko, T., Miura, K.T.: Shape Reconstr. and Image Restoration for Non-Flat Surfaces of Documents with a Stereo Vision System. In: Proc. 17th Int. Conf. Patt. Recog., Cambridge, UK, vol. I, pp. 482–485 (2004) 8. Cho, S.I., Saito, H., Ozawa, S.: A Divide-and-conquer Strategy in Shape from Shading Problem. In: Proc.IEEE Conf. Comp. Vis. and Patt. Recog., San Juan, Puerto Rico, pp. 413–419. IEEE Computer Society Press, Los Alamitos (1997) 9. Doncescu, A., Bouju, A., Quillet, V.: Former books digital processing: image warping. In: Proc. IEEE Worksh. Doc. Im. Anal., San Juan, Puerto Rico, pp. 5–9. IEEE Computer Society Press, Los Alamitos (1997) 10. Brown, M.S., Seales, W.B.: Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents. In: Proc. 8th IEEE Int. Conf. Comp. Vis., Vancouver, Canada, vol. I, pp. 367–375. IEEE Computer Society Press, Los Alamitos (2001) 11. Sun, M., Yang, R., Yun, L., Landon, G., Seales, W.B., Brown, M.S.: Geometric and Photometric Restoration of Distorted Documents. In: Proc. 10th IEEE Int. Conf. Comp. Vis., Beijing, China, vol. II, pp. 1117–1123. IEEE Computer Society Press, Los Alamitos (2005) 12. Cao, H., Ding, X., Liu, C.: Rectifying the Bound Document Image Captured by the Camera: A model Based Approach. In: Proc. 7th Int. Conf. Doc. Anal. and Recog., Edinburgh, UK, pp. 71–75 (2003) 13. Liang, J., DeMenthon, D., Doermann, D.: Flattening Curved Documents in Images. In: Proc. IEEE Conf. Comp. Vis. and Patt. Recog., San Diego, California, USA, vol. II, pp. 338–345. IEEE Computer Society Press, Los Alamitos (2005) 14. Wada, T., Ukida, H., Matsuyama, T.: Shape from Shading with Interreﬂections Under a Proximal Light Source: Distortion-Free Copying of an Unfolded Book. Int. J. Comp. Vis. 24(2), 125–135 (1997) 15. Tan, C.L., Zhang, L., Zhang, Z., Xia, T.: Restoring Warped Document Images through 3D Shape Modeling. IEEE Trans. PAMI 28(2), 195–208 (2006) 16. Courteille, F., Crouzil, A., Durou, J.D., Gurdjos, P.: Shape from Shading for the Digitization of Curved Documents. Mach. Vis. and Appl. (to appear) 17. Kashimura, M., Nakajima, T., Onda, N., Saito, H., Ozawa, S.: Practical Introduction of Image Processing Technology to Digital Archiving of Rare Books. In: Proc. Int. Conf. Sign. Proc. Appl. and Techn., Toronto, Canada, pp. 1025–1029 (1998) 18. Courteille, F., Durou, J.D., Gurdjos, P.: Transform your Digital Camera into a Flatbed Scanner. In: Proc. 9th Eur. Conf. Comp. Vis., 2nd Works. Appl. Comp. Vis., Graz, Austria, pp. 40–48 (2006) 19. Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of Applicable Surfaces from Single Views. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 482–496. Springer, Heidelberg (2004) 20. Colombo, C., Del Bimbo, A., Pernici, F.: Metric 3D Reconstruction and Texture Acquisition of Surfaces of Revolution from a Single Uncalibrated View. IEEE Trans. PAMI 27(1), 99–114 (2005)

Improved Space Carving Method for Merging and Interpolating Multiple Range Images Using Information of Light Sources of Active Stereo Ryo Furukawa1, Tomoya Itano1 , Akihiko Morisaka1 , and Hiroshi Kawasaki2 1

Faculty of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima, Japan [email protected], {t itano,a morisaka}@toc.cs.hiroshima-cu.ac.jp 2 Faculty of Engineering, Saitama University, 255, Shimo-okubo, Sakura-ku, Saitama, Japan [email protected]

Abstract. To merge multiple range data obtained by range scanners, filling holes caused by unmeasured regions, the space carving method is a simple and effective method. However, this method often fails if the number of the input range images is small, because unseen voxels that are not carved out remains in the volume area. In this paper, we propose an improved algorithm of the space carving method that produces stable results. In the proposed method, a discriminant function defined on volume space is used to estimate whether each voxel is inside or outside the objects. Also, in particular case that the range images are obtained by active stereo method, the information of the positions of the light sources can be used to improve the accuracy of the results.

1 Introduction Shape data obtained by 3D measurement systems are often represented as range images. In order to obtain the entire shape model from multiple scanning results, it is necessary to merge these range images. For this task, several types of approaches have been proposed. Among them, the methods using signed distance fields have been widely researched because they are capable of processing a large volume of input data efficiently. Signed distance fields have also been used as a shape representation in order to interpolate holes appearing in unmeasured parts of object surfaces. Curless and Levoy filled holes of a measured shape by classifying each voxel in volume space as either Unseen (unseen regions), Nearsurface (near the surfaces), or Empty (outside the object), and generating a mesh between Unseen and Empty regions [1]. This process is known as space carving (SC) method. The SC method is capable of effectively interpolating missing parts when there is plenty of observed data. However, when only a few range images are captured, the SC method often fails. This problem occurs because the target object and the ”remains of carving” in volume space become connected. One of the approaches to solve this problem would be classifying the Unseen voxels as either inside or outside of the object. In this paper, object merging and the interpolation method is proposed based on this approach. Since Unseen voxels include both Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 206–216, 2007. c Springer-Verlag Berlin Heidelberg 2007

Improved SC Method for Merging and Interpolating Multiple Range Images

207

unobserved voxels inside an object (due to occlusion or low reflection) and voxels outside the object, it is necessary to discriminate these cases. To classify voxels, we take the following two approaches: (1) defining discriminant function for classifying Unseen voxels, and (2) using the positions of light sources if the range images are obtained using active stereo methods. Under the proposed method, the large “remains of carving” that often occurs in SC method are not generated. Also, since all voxels are classified as inside, outside or near the surface, closed shapes are always obtained. A unique property of the proposed algorithm is that it can be implemented on GPU. Recently, many methods have been proposed for utilizing the computational performance of GPUs for general calculations besides graphics. Also, our algorithm can be executed efficiently on GPU.

2 Related Work To merge multiple range data into a single shape, several types of approaches have been proposed, such as, for example, generating a mesh from unorganized points [2], using deformable model represented as a level-set [3], stitching meshes at the overlapped surfaces (zippering)[4], and methods using signed distance fields [1,5,6]. In particular, methods using signed distance fields have been widely researched since they are capable of processing a large volume of input data. In order to express the distance from a voxel to the object’s surface, the Euclidean distance (the shortest distance between the center of the voxel and the object’s surface) is theoretically preferable[5,6]. However, since computational cost of calculating the Euclidean distance is large, simplified methods such as line-of-sight distance (the distance between the voxel center and the object’s surface measured along the line of sight from the viewpoint of the measurement) are sometimes used [1] instead. Regarding filling holes of the surface of the shape defined as isosurfaces of signed distance fields, a number of methods have already been proposed. Davis et al. [7] presented a method in which signed distance field volume data is diffused using a smoothing filter. As shown in the experiments later, this method sometimes propagates the isosurface in incorrect directions and yields undesirable results (Figure 4(b)). According to the SC method proposed by Curless et al. [1], all voxels are first initialized as Unseen. Then, all voxels between the viewpoints and the object surfaces are set to Empty (this method carves the voxels in front of the observed surfaces, in contrast to the SC method used for 3D reconstruction from 2D silhouettes in which the voxels in the surrounding unobserved regions are carved). This method only carves the volume space in front of the observed surfaces, so, in practice, the unobserved voxels outside of the object remain as Unseen, and excess meshes are generated on the borders between the Unseen and the Empty regions. When range images from a sufficient number of observation points are merged, this kind of excess meshes are not connected to the target object mesh, and can be simply pruned away. However, when the number of observation points is small, or when the object is not observed from certain directions, the excess meshes often become connected to the object (Figure 4(a)). Sagawa et al. succeeded in interpolating scenes with complex objects by taking the consensus between each voxel in a signed distance field with its surrounding voxels

208

R. Furukawa et al.

Unseen

observed surface

v5

Angle of view v1

v2

v3 sensor1

sensor2

v4

Angle of view

Fig. 1. (Left)Shape interpolation using space carving, and (right) classes of voxels used in the proposed method

[8]. Masuda proposed a method for filling unmeasured regions by fitting quadrics to the gaps in the signed distance field [9]. They use a Euclidean distance for their calculation to achieve high quality interpolation at the cost of high computational complexity.

3 Shape Merging and Interpolation Using Class Estimation of Unseen Voxels 3.1 Outline A signed distance field is a scalar field defined for each voxel in 3D space such that the absolute value of the scaler is the distance between the voxel and the object surface and the sign of the scaler indicates whether the voxel is inside or outside of the object (in this paper, voxels inside the object are negative, and those outside are positive). By describing a signed distance field as D(x), it is possible to define an object’s surface as the isosurface satisfying D(x) = 0. In order to express the distance from a voxel to the object’s surface, although there exist several problems for accuracy of hole-filling process, we adopt the line-of-sight distance (the distance from the voxel center to the object’s surface measured along the line of sight from the viewpoint of the measurement) instead of the Euclidean distance, since its computational cost is relatively small. The signed distance D(v) for a voxel v is calculated by accumulating the signed distances from each of the range images, d1 (v), d2 (v), · · · , dn (v), each multiplied by a weight wi (v). It is obtained with the following formula. D(v) =

di (v)wi (v)

(1)

i

The weights represent the accuracy of each distance value, and are often decided according to the angles between the directions of the line-of-sight from the camera and the directions of the normal vectors of the surface. In the constructed signed distance field D(x), the isosurface satisfying D(x) = 0 is the merged shape. In order to generate the mesh model of the merged result, the marching cubes method [10] is used.

Improved SC Method for Merging and Interpolating Multiple Range Images

209

The SC method of Curless et al. [1] divides all the voxels in volume space into one of the three types: Unseen (not observed), Empty (outside the object) and NearSurface (near the surface of the object). Shape interpolation is then performed by generating a mesh over the border between Unseen and Empty voxels (Figure 1(left)). A large problem with this method is that all the voxels in the following three cases are classified as Unseen: (1) voxels outside the object, that do not exist on any line-of-sight to an observed region in the range images, (2) voxels outside the object that exist BEHIND a surface of a range image (occluded voxels outside the object), and (3) voxels inside the object. In the proposed method, each Unseen voxel is classified as either outside or inside of objects using a discriminant function to solve this problem. We now describe the classification of voxels according to the proposed method. For a given voxel, the information obtained from a range image takes one of the following four types ((Figure 1(right)). – NearSurface (near the surface): The absolute value of the signed distance is below the assigned threshold, the voxel in question is classified as “near the surface”, and the signed distance is retained (case of v1 in Figure 1(right)). – Outside (outside the object): The absolute value of the signed distance is larger than the threshold and the sign is positive. The voxel in question exists between the object and the camera, so the voxel can be classified as “outside the object” (v2 ). – Occluded (occluded region): The absolute value of the signed distance is larger than the threshold and the sign is negative. It is not possible to assert unconditionally whether the voxel is inside or outside the object. In this case, the classification of the voxel in question is temporarily set to Occluded. Whether the voxel is inside or outside is estimated afterwards (v3 ). – NoData (deficient data): The signed distance value cannot be obtained due to missing data in the range images. It cannot be judged whether the voxel is inside or outside. In this case, the classification of the voxel is temporarily set to NoData. Whether the voxel is inside or outside is estimated afterwards (v5 ). The case of v4 in Figure 1(right), when the voxel in question is outside the angle of view of the range image, may be handled as either Outside or NoData according to the application. For many applications, the voxels outside of the view can be treated as Outside, but in cases such as zooming and measuring a large object, they should be treated as NoData. When merging multiple range images, the information obtained from each range image regarding voxels may differ. In such cases, priority is given to NearSurface over other classes. The second priority is given to the Outside class. When the information from range images is only Occluded or NoData, it is estimated whether the voxel is inside or outside the object according to the discriminant function defined in Section 3.2. The classified voxels are tagged as Inside or Outside. By performing the above process, all the voxels are classified into the three types: Inside (inside the object), NearSurface (near the surface), and Outside (outside the object). The usual signed distances are assigned to the NearSurface voxels, and a fixed negative and positive values are assigned Inside and Outside voxels, respectively. By generating the isosurface of the constructed signed distance field, a merged polygon mesh can be obtained.

210

R. Furukawa et al.

3.2 Voxel Class Estimation If the information of a voxel obtained from all the range images is only Occluded or NoData, whether the voxel is inside or outside the object should be estimated. For this purpose, discriminant functions based on Bayes estimation are defined. We consider scalar values that are positive outside the object and negative inside the object, and we estimate the probability distributions of these values based on Bayes estimation. The subjective probability distribution when there is no data is a uniform distribution. Based on the posterior probabilities obtained from the data calculated from each range image, the subjective probability distributions are updated according to Bayes theory. Using normal distributions N (μ, σ) for the posterior distributions, a heuristics for assigning the voxel to Inside or Outside is represented as a mean value μ, and the degree of confidence in the heuristics is represented as the standard deviation σ (the higher σ, the lower the degree of confidence). When the voxel in question is Occluded in a given range image, the voxel is behind the surface from the viewpoint. When the absolute value of the distance from the surface of the voxel (expressed as Dist) is relatively small, the confidence that the voxel is inside the object is high. On the other hand, when it is far from the surface (Dist is large), the confidence that it is inside the object is low. In this case, the degree of confidence that the voxel is inside is 1/Dist, so the corresponding posterior distribution is N (−1, Dist). When the voxel in question is NoData for a given range image, it may be either an outside voxel or an unobserved voxel inside the object. For actual cases, pixels of NoData in the range images often indicate outside regions. From this heuristics, a constant value is assigned to the degree of confidence that the voxel is outside, so the posterior distribution corresponding to NoData is represented as N (1, Const), where Const is a user-defined parameter. According to experiments, reasonable results can be obtained by setting Const to a value near the smallest thickness of the object. In Bayes estimation using a normal distribution, the prior distribution N (μ, σ) is updated using the posterior distribution N (μ , σ ) as follows. μ←

1 1 1 σ μ + σμ ← + , σ + σ σ σ σ

(2)

After performing this for all range images, a voxel is classified as Outside if the sign of the mean value μ of the final probability distribution is positive, and Inside if it is negative. By defining the discriminant function C(v) for voxel v as −1/|di (v)| if Occ(i, v) , (3) ci (v) = 1/Const if N od(i, v) ci (v), (4) C(v) = i

the above judgment is equivalent to determining the inside or outside of the object according to the sign of C(v), where Occ(i, v) and N od(i, v) mean that the information for the voxel v obtained from the range image i is Occluded or NoData, and di (v) is the signed line-of-sight distance between the voxel v and the surface of the range image i.

Improved SC Method for Merging and Interpolating Multiple Range Images

211

3.3 Utilizing the Position of Light Sources in Active Stereo Method Active stereo methods, in which the light sources are used as well as the camera, are common techniques for obtaining dense 3D shapes. In these methods, the occlusion of light sources causes missing data, but when data is measured at a point, it proves that the voxels between the point and the light source are outside the object. In this research, we utilize the position of light sources in an active stereo system by using additional range images from virtual cameras located at the light sources. For each of the measurement, we consider using two range images, from both the camera and the light source position (described below as the camera range image, and light source range image). The light source range image can be generated by projecting the 3D positions of pixels of the camera range image onto the virtual camera at the light source position. In the case shown in Figure 2(a), the data ends up missing in the camera range image since the light from the light source is occluded, therefore, at the voxel shown in Figure 2(b), the range data from the camera range image is missing. However, by referring to the light source range image, the line-of sight distance from the light source position to the voxel may be obtained. There are two advantages in using this information. First, by using light source range images, the number of voxels that can be classified as NearSurface or Outside increases. In addition, even if a voxel is not classified to these classes, we can define an improved discriminant function that utilizes more information than the one described in Section 3.2. The voxel types are mainly the same as when using the camera range images alone, but the order of priority for the voxel classifications is set to NearSurface in the camera range image, NearSurface in the light source range image, Outside in the camera range image, followed by Outside in the light source image. The discriminant function C(v) for voxel classification that utilizes the positions of the light sources is as follows. ⎧ −1 + −1 if Occc (i, v) ∧ Occl (i, v) c l ⎪ ⎪ ⎨ |di (v)| c |di (v)| if Occc (i, v) ∧ N odl (i, v) , (5) ci (v) = −1/|dil (v)| ⎪ −1/|di (v)| if N odc (i, v) ∧ Occl (i, v) ⎪ ⎩ 1/Const if N odc (i, v) ∧ N odl (i, v) C(v) = ci (v). (6) i

The superscript symbols c and l of Occ, N od, di (v) mean the camera range images and light source range images, respectively.

4 Implementation Using a GPU The merging and interpolation algorithms in this paper can be executed efficiently by utilizing a GPU. Since GPUs can only output 2D images, the volume of the signed distance field is calculated slice by slice. The signed distance value D(v) and discriminant function value C(v) are calculated by rendering each slice in a frame buffer. Rendering is performed by multi-pass rendering using programmable shaders. Each pass performs the rendering for one range image, and the results are added using blending functions. When performing the rendering for the ith camera range image

212

R. Furukawa et al.

Missing data

Camera range image

Light source (Projector)

(a)

No data

Camera range image

d

Light source range image

(b)

Fig. 2. Using light sources of active stereo methods: (a) Missing data caused by occlusion, and (b) using light source range images

(range image i), the camera range image and the corresponding light source range image are treated as floating point textures. Then, by issuing an instruction to draw one rectangle in the whole frame buffer, the pixel-shader process for each pixel is executed. The pixel shaders calculate di (v)wi (v) and ci (v) using the range image and voxel positions as input values, and add them to the frame buffer. This process is repeated for the number of measured shapes, while changing the textures of the camera range images and the light source range images. Finally, the frame buffer holding the weighted sum of signed distances D(v), and the values of C(v) is read back into the CPU. The flags for voxel classes are checked, the signed distance and the discriminant function are combined, and a slice of the signed distance field is calculated. Since only small parts in the entire volume space are related to the isosurface, the processing cost can be reduced by first computing the signed distance field in a coarse resolution, and performing the same computation in the high resolution for only the voxels determined to be near the isosurfaces in the coarse resolution. In the experiments described in section 5, the implementation of the proposed method uses this coarse-tofine method to reduce the computational cost and time.

5 Experiments In order to demonstrate the effectiveness of the proposed method, experiments were conducted using two types of data: synthetic data formed by seven range images generated from an existing shape model (mesh data), and an actual object (an ornament in the shape of a boar) measured from 8 viewpoints with an active stereo method. For synthetic data, the points on the surface where the light is self-occluded by the object were treated as missing data even if they were visible from the camera, as occurs in measurements by an active stereo method. Each data set was merged and a mesh model was generated using the SC method, the volumetric diffusion method [7] (VD method), the method proposed by Sagawa et al. [8] (Consensus method), and the proposed method (the information regarding the light source position for active stereo was used). In the SC method, the VD method, and the proposed method, the signed distance field is calculated using line-of-sight distance.

Improved SC Method for Merging and Interpolating Multiple Range Images

213

(a)Synthetic data (b)Real data Fig. 3. Results of merging(without interpolation)

(a)SC

(b)VD

(c)Consensus

(d)Proposed(Smoothing)

Fig. 4. Results of merging(synthetic data)

In the Consensus method, however, the signed distance field is calculated using the Euclidean distance. In the VD method, the diffusion of the volume was repeated 30 times. Also, the size of the volume was 512 × 512 × 512. For efficiency, in the proposed method, the procedure is first executed with resolution of 62 × 62 × 62 and rendering with the high resolution was only performed for the region where the surface exists. We performed experiments regarding the proposed method, applying a smoothing filter with a 3 × 3 × 3 size to the volume, and also without such a filter. The filter prevents aliasing on the interpolated surface (the smoothing was not performed in the SC method or the VM method). The SC method, the VD method and the proposed method were executed on PC with an Intel Xeon (2.8GHz) CPU, and an NVIDIA GeForce8800GTX GPU installed. The Consensus method was implemented on a PC constructed with 2 Opteron 275 (2.2GHz) CPUs (a total of 4 CPUs). The results of merging with no interpolation are shown in Figure 3, and the results of the interpolation process with each method (for the proposed method, the case when smoothing was applied) are shown in Figures 4 and 5. Also, Figures 6(a)-(f) show the signed depth fields for each experiments on the synthetic data sliced at a certain zcoordinate.

214

R. Furukawa et al.

(a)SC

(b)VD

(c)Consensus

(d)Proposed(Smoothing)

Fig. 5. Results of merging(real data)

(a)No interpolation

(e)Magnified image of (a)

(b)SC

(c)VD

(f)Magnified image of (c)

(d)Proposed

(g)Discriminant function

Fig. 6. (a)-(f):Slice images of the signed distance field: For “No interpolation”, “SC” and “VD,” green color represents Unseen, black represent Empty, and gray colors represents signed distance at NearSurface regions. For the proposed method, green represents Inside, black represents Outside. (g):The discriminant function. Gray intensities mean plus values and red color means region of minus value.

Figure 4(a) and Figure 5(a) show that excess meshes were produced surrounding the target object using the SC method, since Unseen (not observed) regions remained uncarved. Figure 4(b) shows that, using the VD method, the mesh surrounding the holes beneath the ears and on the chest spread in undesirable directions. These phenomena often occurred where the borders between the NearSurface and Unseen regions (the red line in Figure 6(e)) were not parallel with the normal directions of the surface of the object as shown in Figure 6(e). In such regions, the expansion of the isosurfaces due to the filter to occur in wrong directions (in Figure 6(f) the isosurface expands downwards and to the right). A similar phenomenon also occurred in the actual data (Figure 5(b)). The Consensus method produced interpolated meshes with the best quality. However, processing time was long since the Euclidean distance was required. The proposed method produced the results whose qualities are the best next to the Consensus method. Since our method does not use signed distance field, but only uses

Improved SC Method for Merging and Interpolating Multiple Range Images

215

Table 1. Execution time in seconds. The volume size of the Consensus method for synthesized data was 128 × 128 × 128. Methods

SC VD

Consensus

Proposed No smoothing Smoothing Synthetic data 55 168 15 min. for merging, 18 sec. for interpolation 25 36 Real data 38 120 7 hours for merging, 280 sec. for interpolation 21 28

a discriminant function for Unseen voxels to fill holes, smoothness or continuity of the shapes are not considered, thus there remains a good chance to improve the quality. Using both a signed distance field and a discrimination function may be a promising way to improve our algorithm in terms of qualities of results. Figure 6(g) shows the values of the discriminant function C(v). From the figure, we can see that regions where C(v) has plus values coarsely approximate the shapes of the target objects. Table 1 is the execution times of each method. It shows that the proposed method was executed much faster than the other methods.

6 Conclusion In this paper, the space carving method was improved, and an interpolation algorithm yielding stable results even when there are few range images was proposed. This method was realized by defining a discriminant function based on Bayes estimation in order to determine the inside and outside of an object in a signed distance field, even for unseen voxels. In addition, a method was proposed for improving the accuracy of this discriminant function by using range images obtained using an active stereo method. Furthermore, a technique for implementing the proposed method using a GPU was stated, which realized a reduction in computational time by a wide margin. Finally, experiments were conducted regarding the proposed method and existing methods, and the effectiveness of the proposed method was confirmed.

References 1. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. Computer Graphics(Annual Conference Series) 30, 303–312 (1996) 2. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. In: ACM SIGGRAPH, pp. 71–78. ACM Press, New York (1992) 3. Whitaker, R.T.: A level-set approach to 3d reconstruction from range data. IJCV 29(3), 203– 231 (1998) 4. Turk, G., Levoy, M.: Zippered polygon meshes from range images. In: SIGGRAPH 1994, pp. 311–318. ACM Press, New York (1994) 5. Masuda, T.: Registration and integration of multiple range images by matching signed distance fields for object shape modeling. CVIU 87(1-3), 51–65 (2002) 6. Sagawa, R., Nishino, K., Ikeuchi, K.: Adaptively merging large-scale range data with reflectance properties. IEEE Trans. on PAMI 27(3), 392–405 (2005)

216

R. Furukawa et al.

7. Davis, J., Marschner, S.R., Garr, M., Levoy, M.: Filling holes in complex surfaces using volumetric diffusion. In: 3DPVT 2002. Proc. of International Symposium on the 3D Data Processing, Visualization, and Transmission, pp. 428–438 (2002) 8. Sagawa, R., Ikeuchi, K.: Taking consensus of signed distance field for complementing unobservable surface. In: Proc. 3DIM 2003, pp. 410–417 (2003) 9. Masuda, T.: Filling the signed distance field by fitting local quadrics. In: 3DPVT 2004. Proc. of International Symposium on the 3D Data Processing, Visualization, and Transmission, pp. 1003–1010. IEEE Computer Society Press, Washington, DC, USA (2004) 10. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: SIGGRAPH 1987, pp. 163–169. ACM Press, New York, NY, USA (1987)

Shape Representation and Classification Using Boundary Radius Function Hamidreza Zaboli and Mohammad Rahmati Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran [email protected], [email protected], [email protected]

Abstract. In this paper, a new method for the problem of shape representation and classification is proposed. In this method, we define a radius function on the contour of the shape which captures for each point of the boundary, attributes of its related internal part of the shape. We call these attributes as “depth” of the point. Depths of boundary points generate a descriptor sequence which represents the shape. Matching of sequences is performed using dynamic programming method and a distance measure is acquired. At last, different classes of shapes are classified using a hierarchical clustering method and the distance measure. The proposed method can analyze features of each part of the shape locally which this leads to the ability of part analysis and insensitivity to local deformations such as articulation, occlusion and missing parts. We show high efficiency of the proposed method by evaluating it for shape matching and classification of standard shape datasets. Keywords: Computer vision, shape matching, shape classification, boundary radius function.

1 Introduction In the recent years, shape recognition and retrieval has been an important research area in the computer vision field. Various approaches and methods are proposed to deal with this problem. In spite of propounding state of the art methods in this field, there are serious drawbacks to overcome the deformations which may appear in the real applications. These deformations may appear in the form of noises, occlusions, missing parts and articulations. Different methods such as contour-based methods, region-based methods, skeleton-based methods and statistical methods have tried to overcome them in different ways [1]. Contour-based methods generally use features of the contour (boundary) of the shape without considering internal parts of the shape. This may reduces amount of computations to a process on the contour of the shape. On the other hand, regionbased methods consider the information represented by all points of the shape. Skeleton-based methods extract a rich descriptor of the shape which is called “skeleton”. Skeletons are first introduced by Blum in [2]. While the skeletons are rich descriptors of shapes, they are sensitive structures to deformations of the shape, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 217–226, 2007. © Springer-Verlag Berlin Heidelberg 2007

218

H. Zaboli and M. Rahmati

especially for natural objects such as human and animal [1]. Moreover, skeleton by itself and the structures and features which are extracted from it, are complex structures. These complex structures make the matching process a hard and sometimes inefficient step. Siddiqi and Sebastian at al. in [3,4] propose a shock graph method which extracts a directed acyclic graph (DAG) from the skeleton of the shape and uses a sub-graph isomorphism method at the matching stage. While the region-based methods, especially skeleton-based methods represent perfect description of the shape, extracting a simple and efficient structure from the skeleton remains as a challenge. In this work, we look for a simple and efficient structure which completely represents interior of the shape. In the skeleton-based methods, there is a function, called "radius function" [1,3]. This function perfectly captures convexities and concavities of different parts of the shape, along the skeletal lines and curves. In this paper, we define a new radius function on the boundary of the shape instead of the skeleton. By traversing the boundary of the shape, the boundary radius function (BRF) models behavior of the shape in the different parts and segments. In this way, we have a simple sequential structure which describes convexities and concavities of the shape as well as skeletal radius function. Variations of the BRF are considered as variations of the different parts of the shape which this creates the convexities and concavities of the shape. We use the variations of the BRF to represent shapes of different objects and classify them. Our method performs part analysis of the shape and it is not sensitive to articulations. Furthermore, due to the part analysis and local description of the different parts, the method is robust under high amount of occlusion and missing parts. Experiments confirm this robustness. The rest of the paper is organized as follows: In section 2, we present the related works. Section 3 presents new definition of the boundary radius function (BRF) and how to extract a descriptor of the shape using the BRF. Section 4 gives method of matching the descriptors of shapes. In Section 5, experimental results are presented and Section 6 concludes the paper.

2 Related Work Employing the variations of the shape contour has been reported widely in the previous works. These variations have been measured based on different criteria. Bernier et al. in [5] used a polar transformation of the distances between contour points and the geometric center of the shape. Hoan et al. used a similar approach in [6]. Variations of contour of the shape are measured by Euclidean distances between the contour points and a fixed reference point. The drawback is sensitivity of the reference point. Location of the reference point has a high potential to vary in the presence of small deformations, articulations, occlusion and missing parts. In fact, dependency of the descriptor to just a single and unstable reference point, makes these methods inefficient. In another work, Belongie et al. in [7] proposed a shape descriptor called "shape context" which captures for each point on the boundary of the shape, Euclidean distances and orientations of other contour points, relative to the current point. While, the descriptor is rich, the Euclidean distance and orientation between the contour

Shape Representation and Classification Using Boundary Radius Function

219

points may vary under deformations and clutter, especially articulations. Thayananthan et al. in [8] used Chamfer matching to improve robustness of the method. Another shortcoming is that the descriptor of a point depends on location of all other point on the boundary of the shape. This makes the descriptor sensitive to occlusion and missing parts which occur in the segments far from the current point. Therefore, the methods can not perform part analysis. In contrast, Siddiqi in [3] proposed a skeleton-based method with a structural descriptor, which can performs part analysis of the shape. It represents shape by a hierarchical structure called "shock graph". The method suffers from the difficult matching process and sensitivity to noises, occlusions and other deformations which appear on the different parts of the shape. Sebastian et al. in [4] have improved the method by defining a graph edit distance algorithm. In this paper, we introduce a contour descriptor using boundary radius function (BRF) which captures efficiently the internal parts of the shape. We define an edit distance algorithm for matching the descriptors and we perform shape recognition and classification using the resulted distance.

3 Boundary Radius Function In the skeleton-based methods, radius function on the skeleton of the shape is defined as follows: Assume S denotes the skeleton of a shape and consider point p where p ∈ S . Radius function R(p) is the radius of the maximal inscribed disc touches the boundary of the shape. With respect to this definition, we define the boundary radius function(BRF) on the boundary of the shape and we show how to calculate values of this function on the boundary of a shape. We define BRF as follows: Definition 1: Assume B denotes the boundary of a shape and consider point p where p ∈ B . Boundary radius function BRF(p) for point p, is defined as radius of the minimal disk centered at p, which touches nearest point q on the boundary of the shape, such that the line segment which connects point p to q lies entirely within the shape and is perpendicular to a ε -neighborhood of q on the boundary. Fig. 1 shows this definition.

(a)

(b)

(c)

(d)

Fig. 1. (a): A horse and BRF(p) for different points on the boundary. (b): line segment pq is perpendicular to the boundary segment in the point q. (c): BRF for a non smooth shape. (d): new BRF (definition 2) for point p on the rear part of the horse shape.

220

H. Zaboli and M. Rahmati

BRF for three sample point is shown in Fig. 1(a). In Fig. 1(b) line segment pq which is radius of the circle, is perpendicular to the boundary at point q. This verticality can be proved for any smooth closed curve. Although boundary of a shape is a closed curve, it can be uneven with sharp corners. In Fig. 1(c). an example of an uneven boundary with sharp corners is shown. As seen in this figure, the boundary at point q has a corner and is not smooth. This causes that the boundary at point q not to be differentiable. Therefore, there is no tangent line on the boundary at point q. As a result in such cases, Definition 1 may not true because there is no point q on the boundary which its tangent is perpendicular to the radius line pq. In fact, Definition 1 is true for smooth closed curves not for any closed shape boundary. Thus we give a new definition of BRF which is less formal but more applicable. Definition 2: Assume B denotes the boundary of a shape and consider point p where p ∈ B . Moreover, assume minimal circle C centered at p which touches boundary B at two different points b1 , b2. By increasing radius of the circle C until it touches a third boundary point q, such that line segment pq lies entirely within the shape, then boundary radius function BRF(p) for point p, is radius of the circle C. Note that in the above definition, radius of circle C is equal to length of the line segment pq. In Fig. 1(c). an example of definition 2 is illustrated. In this figure, by increasing radius of the circle, it touches the boundary at third point q and line segment pq lies entirely within the shape. Fig 1(d). shows another example of BRF with definition 2. In this figure, BRF(p) is not the line segment pa because the line segment pa is not entirely within the shape. While pq is the right line segment for BRF(p). Definition 2 is a useful and applicable definition for BRF in any closed shape. Therefore, we use it for calculating BRF on the boundary of shapes. BRF can analyze and capture related internal part of the shape for any boundary point. This provides enough information for local analysis of parts which is useful for part analysis, articulations, occlusion and missing parts. Given a sample shape, it is straightforward to compute values of the BRF for boundary points of the shape. Fig. 2(a) shows two sample shapes and values of the BRF for their boundary points. These values are shown as a set of transformed points for each sample shape, such that each point s with loci s(x, y) on the boundary of the sample shape is transformed into point s′ with loci s ′( x, y + BRF ( s )) .

(a)

(b)

Fig. 2. (a),(b): Two sample shapes, their boundaries and their transformed point set at their rightwards. Note that the X and Y values increase rightwards and downwards, respectively.

Shape Representation and Classification Using Boundary Radius Function

221

As seen in Fig. 2. the parts with high “depth” in their fronts, have high values of related BRFs. This causes that these parts locate at down sections (i.e. high values of y in their loci (x,y)) of the transformed points. Up to this point, we have for any given shape a transformed point set which reflects for each boundary point, attribute and feature of the related internal parts. This feature is like depth of the shape for the point which its BRF is computed. Although in some cases, BRF may not gives the real and accurate related depth for each boundary point, but it can be considered as a feature similar to depth. Moreover, values of BRF for different points of the boundary accurately represent shapes and it is not necessary to have a real depth measure for describing the shape. Depth-like measure i.e. BRF, represents for each segment of the shape, a view of corresponding segment of the boundary in front of the segment which includes convex and concave parts of the shape. Note that different shapes are created by these convexities and concavities in the different parts of shapes.

4 Extracting Structure from the BRFs In this section we intend to extract a suitable structure from the boundary points such that the structure contains BRF measures. Since the boundary points of a shape generate contour of the shape and the contour has a sequential structure, we use a vector structure containing values of BRFs to represent a shape. Order of the boundary points is kept and the relative BRF value is assigned to each one. In this way, we have a directed attributed vector for any shape to represent it. We call this vector as boundary radius vector (BRV). Therefore, matching of shapes can be performed by matching their related BRVs. One thing to mention is about length of the vectors. It is obvious that bigger shapes have longer boundaries. As a result, length of their BRV will be longer than small shapes. This can cause some problems especially in the presence of some transformations like scale. To overcome this, for matching of any two given shapes, we consider n points on their boundaries for both of them, such that the n points divide boundary of each shape into n equal segments. Thus, we have for each shape, a BRV of length n. Note that n is fixed in a matching process. After computing BRV for a shape, variations of the BRF values in the BRV should be considered. As seen in Fig. 2(a)., in the boundary of the shape, what is creates two convex parts in the two ends of the shape, is variations of the BRF values. In this figure, from point a to b, fixed values of BRF create a flat part between these points. Also, from point b to c and then c to d, increase and decrease of BRF values, respectively, create a bowed curve. Thus, it is necessary to analyze variations of BRF values for a shape, instead of values themselves. Although BRF values are important, their variations are more important. To achieve this goal, we define a first order differential on the BRF values of BRV. This definition is given in below: Definition 3: Assuming BRV of a shape with length n, variations of BRF for ith value in BRV(ith point on contour) is denoted by VBRF(i) and computed as follows:

222

H. Zaboli and M. Rahmati i +α

VBRF (i) =

∑ (BRF (k ) − BRF (k − 1))

k = i −α +1

2α

, α +1 ≤ i ≤ n − α −1

(1)

where, α is a parameter of resolution. Bigger α lead to less accuracy for analyzing details of parts. It is obvious that very little values of α may not leads to superior description of the shape and may not leads to superior efficiency and recognition rate of final recognition system. Using definition 3, we can compute variations of BRF (VBRF) for any given shape. Therefore, we have a sequence of values of VBRFs for any given shape. Similarly, this sequence specifies a vector containing values of VBRF, calculated for a shape. We call this final vector as “descriptor vector”. Note that for each value of the descriptor vector(VBRF(i)), we take into account its related BRF value(BRF(i)) as a weight factor for it. In this way, huge parts of the shape which are deeper than other parts, are considered more important. This will raise recognition rate of our method, especially in the presence of occlusion and missing parts.

5 Matching Descriptor Vectors Up to this point, we have introduced a descriptor vector for any given shape. For matching, classifying and recognizing of shapes, it is sufficient to match the descriptor vectors and compute difference between them. The resulted difference between the shapes is calculated as a distance between the shapes and can be used as a distance measure. Due to the sequential structure of the descriptor vectors, we use a dynamic programming method for matching the vectors. This method is used widely for matching attributed and sequential data structures e.g. strings. Moreover, this method has a high efficiency to deal with deleted and inserted parts in the sequential structures and it can find optimal matching. This ability is very useful for dealing with some deformations which occur to shapes such as occlusion and missing parts. Dynamic programming compares and matches the vectors and generates a cost for the matching. This cost is cost of transforming one of the vectors into the other one and is known as “edit distance”. Resulted edit distance is considered as distance between the two shapes. To use dynamic programming method, we define below cost functions: Cost(i, j) = min(cost(i, j − 1),cost(i − 1, j),cost(i − 1, j − 1) + ⎧⎪ BRF ( i ) − BRF ( j ) ⎨ ⎪⎩ (BRF ( i ) × VBRF ( i )) − (BRF ( j ) × VBRF ( j ))

if VBRF ( i ) = VBRF ( j ) ⎫⎪ ⎬ if VBRF ( i ) ≠ VBRF ( j )⎪⎭

(2)

In relation 2, Cost(i,j) is distance between values 1..i of first vector and values 1..j of the second vector. For a complete matching, i must be equal to the number of values in the first vector and j must be equal to the number of values in the second vector. As seen in this relation, variations of boundary radius function(VBRF), are considered as a basic for matching any two values in the descriptor vectors. After that, values of the related BRFs are taken into account in a next level.

Shape Representation and Classification Using Boundary Radius Function

223

Using the cost function in relation 2, we may compute edit distance between any two descriptor vectors. As a result, it is straightforward to calculate distance between any two given shapes. Therefore, given any two shapes S1 , S2 the process of matching of S1 , S2 can be performed as follows: 1. 2. 3.

Traversing on the boundaries of S1 , S2 and computing BRF for boundary points of S1 , S2 and generating BRV1 , BRV2 , respectively. Computing variations of BRFs and transforming BRV1 , BRV2 into descriptor vectors. Matching descriptor vectors using dynamic programming.

Assuming BRF is computed for n boundary points, the first stage takes time complexity of O(nk2), where k2 is complexity of computing BRF and k is related radius of a boundary point. In the worst case, k is not greater than n/2. The worst case occurs when the shape is like an ellipse. Thus, the first stage takes O(n3/4)=O(n3). Time complexity of the second stage is O(n) and at the third stage, dynamic programming takes O(nm) which n and m are length of the descriptor vectors. Therefore, assuming n ≥ m , overall time complexity of the method in the worst case is O(n3+n+nm) which is still O(n3).

6 Experiments The distance measure computed in section 5 can be used for shape matching and retrieval. We evaluated our method using two standard shape datasets and many customized sample shapes. The datasets are kimia[11] and MPEG-7[12] binary image databases. Moreover we designed and applied a new set of customized and deformed shapes and we performed experiments on this shape dataset. Note that in our experiments, number of sample boundary points (n), discussed in Section 4, is equal to 100. Also, resolution parameter α is considered less than or equal to 3. First experiment is comparison and matching of different shape from different classes. In this experiment, a set of randomly selected shapes from the datasets is considered and distance between each pair of them is computed using the proposed method. Evaluated shapes contain various deformed shapes such as occluded shapes, shapes with articulation and missing parts. Table 1. shows this experiment for 13 sample shapes out of about 100 evaluated shapes. Each value in the table, indicates distance between the shapes at relative row and column. As seen in the table, distances between the shapes which belong to a similar class, are closer than other distances. For example, the horse at column 10 has distance equal to 0.581 with the other horse at row 9 and distance 0.789 with the dog at row 3. Also each shape in the table has distance equal to 0 with itself (e.g. row10, column 10). In the next experiment, we classify shapes of different classes from the datasets based on the distance between the shapes. To achieve this goal, we use a hierarchical clustering method (single linkage).

224

H. Zaboli and M. Rahmati Table 1. 13 out of 100 sample evaluated shapes and distance between each pair of them

0

0.757 1.105 1.37 1.486 0.879 1.740 1.301 1.436 0

1.232 1.419 1.670 0.909 1.787 1.306 1.341 0

1.582 1.838 1.117 1.664 1.784 0.709 0

0.301 1.207 1.366 1.341 1.589 0

1.717 1.587 1.505 1.724 0

1.785 1.377 1.360 0

0.591 1.596 0

1.403 1.466 1.117 1.125 1.366 1.477 1.147 1.198 0.789 1.695 1.183 1.030 1.649 1.445 1.262 1.342 1.785 1.584 1.307 1.328 1.321 1.370 1.224 1.203 1.633 1.973 0.936 0.929

1.669

1.754 1.875 0.936 0.866

0

0.581 1.709 1.168 1.009 0

1.591 1.230 1.083 0

0.627 0.724 0

0.431 0

This method categorizes similar shapes into a class based on the fact that shapes of a class have minimal distances between themselves and have more distances with other classes of shapes. Using this method, we classified 340 shapes of different classes and reached to great results. Fig. 3 shows the results of this classification. Our method measured distances between each pair of shapes of different classes and generated a distance table. Then, clustering method categorized the shapes based on the distance table. As seen in Fig. 3. our method correctly has classified the shapes and has linked similar classes first. In this classification, first classes of horse and deer are linked. Also classes of rabbit and mouse are linked and after that, classes of pliers and scissors and classes of dummies and planes are linked, respectively. Also the dog class is linked to classes of horse and deer which is a correct classification. The classification is continued and similar classes are linked until the last class which class of leaves. This class is linked to the other classes at the last level due to its huge difference with the other classes. In the last experiment, we evaluate our method against the methods proposed in [5,7]. The first method, proposed in [5] is a close method to ours. Both of the two methods capture shape by traversing on the contour and based on its interior.

Shape Representation and Classification Using Boundary Radius Function

225

Fig. 3. Classification of different classes of shapes

The difference between them is that the proposed method in [5] considers all parts of the shape relative to a reference point, while our method in most cases considers each part based on a local analysis. Next method which is called “shape context”, proposed in [7] is a contour-based method which saves a profile for each boundary point. Each profile contains distances and orientations of other boundary points relative to the current point. As mentioned earlier, the method suffers from high dependency of its descriptor to deformations and variations of other parts of the boundary. In this experiment, we evaluate our method for retrieving most similar and closest matches. We perform this retrieval test on 530 randomly selected shapes from the kimia and MPEG-7 datasets. A query shape is selected from the dataset and 10 closest matches are retrieved. Results of these retrievals are shown in Table 2 and are compared with results of [5] and [7]. As shown in the table, Our method performs superior and achieves great results. While polar signature[5] and shape context[7] almost lose efficient retrieval after 5th Table 2. Result of retrieving of 530 shapes by our method and [5,7] and percent of recognition rate of the three methods for top 10 matches

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

100

100

99

99

98

97

97

97

94

84

Polar signature[5]

100

100

97

96

91

82

78

72

57

39

Shape Context[7]

97

91

88

85

84

77

75

66

56

37

Method Boundary Radius Function (BRF)

226

H. Zaboli and M. Rahmati

closest match, our method has recognition rate higher that 90 % for top 9 closest matches.

7 Conclusions We proposed a new descriptor which captures interior of a shape by walking along its contour. Result of this stage is a vector which represents the shape based on the interior of the shape. Matching of the vectors is performed using dynamic programming method. Since the proposed method and its descriptor, analyze locally each part of the shape, the method is robust under complex and problematic deformations, especially articulation, occlusion, missing parts and clutters. Moreover, matching process of the method i.e. dynamic programming, is an efficient and simple matching method. However, the matching process can be done using other methods such as computing the area lying between VBRFs.

References 1. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Regocnition 37, 1–19 (2004) 2. Blum, H.: A Transformation for extracting new descriptors of Shape. In: Whaten-Dunn, W. (ed.) Models for the perception of Speetch and Visual Forms, pp. 362–380. MIT Press, Cambridge (1967) 3. Siddiqi, K., Shokoufandehs, A., Dickinsons, S.J., Zucker, S.W.: Shock Graphs and Shape Matching. International Journal of Computer Vision 35(1), 13–32 (1999) 4. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of Shapes by Editing Shock Graphs. In: ICCV 2001, pp. 755–762 (2001) 5. Bernier, T., Landry, J.-A.: A New Method for Representing and Matching Shapes of Natural Objects. Pattern Recognition 36, 1711–1723 (2003) 6. Kang, S.K., Ahmad, M.B., Chun, J.H., Kim, P.K., Park, J.A.: Modified Radius-Vector Function for Shape Contour Description. In: Laganà, A., Gavrilova, M., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 940–947. Springer, Heidelberg (2004) 7. Belongie, S., Malik, J., Puzicha, J.: Shape Matching and Object Recognition using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(24), 509– 522 (2002) 8. Thayananthan, A., Stenger, B., Torr, P.H.S., Cipolla, R.: Shape Context and Chamfer Matching in Cluttered Scenes. IEEE CVPR 1, 127–133 (2003) 9. Gorelick, L., Galun, M., Sharon, E., Basri, R., Brandt, A.: Shape Representation and Classification Using the Poisson Equation. IEEE Transaction on Pattern Recognition and Machine Intelligence 28(12), 1991–2005 (2005) 10. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2000) 11. Kimia Image Database (May 2007), Available at http://www.lems.brown.edu/ dmc/main.html 12. MPEG7 CE Shape Database (May 2007), Available at http://www.imageprocessingplace.com/ DIP/dip_image_dabases/image_databases.htm

A Convex Programming Approach to the Trace Quotient Problem Chunhua Shen1,2,3 , Hongdong Li1,2 , and Michael J. Brooks3 1

2

NICTA Australian National University 3 University of Adelaide

Abstract. The trace quotient problem arises in many applications in pattern classification and computer vision, e.g., manifold learning, low-dimension embedding, etc. The task is to solve a optimization problem involving maximizing the ratio of two traces, i.e., maxW Tr(f (W ))/Tr(h(W )). This optimization problem itself is non-convex in general, hence it is hard to solve it directly. Conventionally, the trace quotient objective function is replaced by a much simpler quotient trace formula, i.e., maxW Tr h(W )−1 f (W ) , which accommodates a much simpler solution. However, the result is no longer optimal for the original problem setting, and some desirable properties of the original problem are lost. In this paper we proposed a new formulation for solving the trace quotient problem directly. We reformulate the original non-convex problem such that it can be solved by efficiently solving a sequence of semidefinite feasibility problems. The solution is therefore globally optimal. Besides global optimality, our algorithm naturally generates orthonormal projection matrix. Moreover it relaxes the restriction of linear discriminant analysis that the projection matrix’s rank can only be at most c − 1, where c is the number of classes. Our approach is more flexible. Experiments show the advantages of the proposed algorithm.

1 Introduction The problem of dimensionality reduction—extracting low dimensional structure from high dimensional data—is extensively studied in pattern recognition, computer vision and machine learning. Many of the dimensionality reduction methods, such as linear discriminant analysis (LDA) and its kernel version, end up with solving a trace quotient problem Tr W Sα W , (1) W = argmax W W =Id Tr W Sβ W where Sα , Sβ are positive semidefinite (p.s.d.) matrices (Sα 0, Sβ 0), Id the d × d identity matrix1 and Tr(·) denoting the matrix trace. W ∈ RD×d is target projection matrix for dimensionality reduction (typically d D). In the supervised learning framework, usually Sα represents the distance of different classes while Sβ is the distance between data points in the same class. For example, Sα is the “between classes scatter matrix” and Sβ is the “within classes scatter matrix” for LDA. By formulating 1

The dimension of I is omitted when it can be seen from the context.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 227–235, 2007. c Springer-Verlag Berlin Heidelberg 2007

228

C. Shen, H. Li, and M.J. Brooks

the problem of dimensionality reduction in a general setting and constructing Sα and Sβ in different ways, we can analyze many different types of data in the above underlying mathematical framework. Despite the importance of the trace quotient problem, it lacks a direct and globally optimal solution. Usually, as an approximation, the quotient trace −1

Tr (W Sα W )

W Sβ W ,

(instead of the trace quotient) is used. Such an approximation readily leads to a generalized eigen-decomposition (GEVD) solution, via which a close-form solution is readily available. It is easy to check that when rank(W ) = 1, i.e., W is a vector, then Equation (1) is actually a Rayleigh quotient problem, which can be solved by the GEVD. The eigenvector corresponding to the eigenvalue of largest magnitude gives the optimal W . Unfortunately, when rank(W ) > 1, the problem becomes much more complicated. Heuristically, the dominant eigenvectors corresponding to the largest eigenvalues are used to form the W . It is believed that the largest eigenvalues contains more useful information. However such a GEVD approach cannot produce the optimal solution to the original optimization problem (1) [1]. Furthermore, the GEVD does not yield an orthogonal projection matrix. Orthogonal LDA (OLDA) is proposed to compute a set of orthogonal discriminant vectors via the simultaneous diagonalisation of the scatter matrices [2]. In this paper, we proffer a novel semidefinite programming (SDP) based method to solve the trace quotient problem directly, which has the following properties: – It optimises the original problem (Equation (1)) directly; – The target low dimensionality is selected by the user and the algorithm guarantees an globally optimal solution since the optimisation is convex. In other words, it is local-optima-free; – The projection matrix is orthonormal naturally; – Unlike the GEVD approach to LDA, the data are not necessary to be projected to at most c − 1 dimensions with our algorithm solving LDA. c is the number of classes. To our knowledge, this is the first attempt which directly solves the trace quotient problem and at the same time, a global optimum is deterministically guaranteed.

2 SDP Approach to the Trace Quotient Problem In this section, we show how the trace quotient is reformulated into an SDP problem. 2.1 SDP Formulation By introducing an auxiliary variable δ, the problem (1) is equivalent to maximize subject to

(2a)

δ

Tr W Sα W ≥ δ · Tr W Sβ W

(2b)

A Convex Programming Approach to the Trace Quotient Problem

229

W W = Id

(2c)

W ∈R

(2d)

D×d

The variables we want to optimise here are δ and W . But we are only interested in W with which the value of δ is maximised. This problem is clearly not convex because the constraint (2b) is not convex and (2d) is actually a non-convex rank constraint. Let us define a new variable Z ∈ RD×D , Z = W W , and now the constraint (2b) is converted to Tr(Sα − δSβ )Z ≥ 0 under the fact that Tr W SW = Tr SW W = Tr SZ. Because Z is a product of matrix W and its transpose, it must be p.s.d. Overton and Womersley [3] have shown that the set of Ω1 = {W W : W W = Id } is the set of extreme points of Ω2 = {Z : Z = Z , Tr Z = d, 0 Z I}.2 That means, as constraints, Ω1 is more strict than Ω2 . Therefore constraints (2c) amd (2d) can be relaxed into Tr Z = d and 0 Z I, which are both convex. When the cost function is linear and it is subject to Ω2 , the solution will be at one of the extreme points [4]. Consequently, for linear cost functions, the optimization problems subject to Ω1 and Ω2 are exactly equivalent. Moreover, the same nice property follow even when the objective function is a quotient (i.e. fractional programming), which is precisely the case we are dealing with here. With respect to Z and δ, (2b) is still non-convex: the problem may have locally optimal points. But still the global optimum can be efficiently computed via a sequence of convex feasibility problems. By observing that the constraint is linear if δ is known, we can convert the optimization problem into a set of convex feasibility problems. A bisection search strategy is adopted to find the optimal δ. This technique is widely used in fractional programming. Let δ denote the unknown optimal value of the cost function. Given δ∗ ∈ R, if the convex feasibility problem3 ﬁnd Z subject to Tr(Sα − δ∗ Sβ )Z ≥ 0

(3a) (3b)

Tr Z = d 0ZI

(3c) (3d)

is feasible, then we have δ ≥ δ∗ . Otherwise, if the above problem is infeasible, then we can conclude δ < δ∗ . This way we can check whether the optimal value δ is smaller or larger than a given value δ∗ . This observation motivates a simple algorithm for solving the fractional optimisation problems using bisection search, which solves a convex feasibility problem at each step. Algorithm 1 shows how it works. At this point, a question remains to be answered: are constraints (3c) and (3d) equivalent to constraints (2c) and (2d) for the feasibility problem? Essentially the feasibility problem is equivalent to maximize 2 3

Tr(Sα − δ∗ Sβ )Z

(4a)

Our notation is used here. The feasibility problem has no cost function. The objective is to check whether the intersection of the convex constraints is empty.

230

C. Shen, H. Li, and M.J. Brooks

Algorithm 1. Bisection search Require: δL : Lower bounds of δ; δU : Upper bound of δ and the tolerance σ > 0. while δU − δL > σ do U . δ = δL +δ 2 Solve the convex feasibility problem described in (3a)–(3d). if feasible then δL = δ; else δU = δ. end if end while

subject to

Tr Z = d 0ZI

(4b) (4c)

If the maximum value of the cost function is non-negative, then the feasibility problem is feasible. Conversely, it is infeasible. Because this cost function is linear, we know that Ω1 can be replaced by Ω2 , i.e., constraints (3c) and (3d) are equivalent to (2c) and (2d). Note that constraint (3d) is not in the standard form of SDP. It can be rewritten into the standard form as Z 0 0, (5a) 0Q Z + Q = I,

(5b)

where the matrix Q acts as a slack variable. Now the problem can be solved using standard SDP packages such as CSDP [5] and SeDuMi [6]. We use CSDP in all of our experiments. 2.2 Recovering W from Z From the convariance matrix Z learned by SDP, we can recover the output W by eigen-decomposition. Let Vi denote the ith eigenvector, with eigenvalue λi . Let λ1 ≥ λ2 ≥ √· · · ≥ the sorted eigenvalues. It is straightforward to see that W = √ λD be √ diag( λ1 , λ2 , · · · , λD )V , where diag(·) is a square matrix with the input as its diagonal elements. To obtain a D × d projection matrix, the smallest D − d eigenvalues are simply truncated. This is the general treatment for recovering a low dimensional projection from a covariance matrix, e.g., principal component analysis (PCA). In our case, this procedure is precise, i.e., there is no information loss. This is obvious: λi , the eigenvalues of Z = W W , are the same as the eigenvalues of W W = Id . That means, λ1 = λ2 = · · · = λd = 1 and the left D − d eigenvalues are all zeros. Hence in our case we can simply stack the first d leading eigenvectors to obtain W .

A Convex Programming Approach to the Trace Quotient Problem

231

2.3 Estimating Bounds of δ The bisection search procedure requires a low bound and an upper bound of δ. The following theorem from [3] is useful for estimating the bounds. Theorem 1. Let S ∈ RD×D be a symmetric matrix, and μS1 ≥ μS2 ≥ · · · ≥ μSD be the sorted eigenvalues of S from largest to smallest, then max Tr W SW = W W =Id d S i=1 μi . Refer to [3] for the proof. This theorem can be extended to obtain the following corollary (following the proof for Theorem 1): S Corollary 1. Let S ∈ RD×D be a symmetric matrix, and ν1S ≤ ν2S ≤ · · · ≤ νD be its d S sorted eigenvalues from smallest to largest, then min Tr W SW = i=1 νi . W W =Id

Therefore, we estimate the upper bound of δ: d μSi α δU = i=1 . Sβ d i=1 νi

(6)

In the trace quotient problem, both Sα and Sβ are p.s.d. This is equivalent to say, all of their eigenvalues are non-negative. Recall that the denominator of (6) could be zeros and δU = +∞. This occurs when the d smallest eigenvalues of Sβ are all zeros. In this case, rank(Sβ ) ≤ D − d. For LDA, rank(Sβ ) = min(D, N ). Here N is the number of training data. When N ≤ D − d, which is termed the small sample problem, δU is invalid. A PCA data prep-processing can always be performed to remove the null space of the covariance matrix of the data, such that δU becomes valid. A lower bound of δ is then d νiSα δL = i=1 . (7) Sβ d i=1 μi Clearly δL ≥ 0.

3 Related Work The closest work to ours is [1] in the sense that it also proposes a method to solve the trace quotient directly. [1] finds the projection matrix W in the Grassmann manifold. Compared with optimization in the Euclidean space, the main advantage of optimization on the Grassman manifold is fewer variables. Thus the scale of the problem is smaller. There are major differences between [1] and our method: (i) [1] optimises Tr W Sα W − δ · Tr W Sβ W and they have no a principled way to determine the optimal value of δ. In contrast, we optimize the trace quotient function itself and a deterministic bisection search guarantees the optimal δ; (ii) The optimization in [1] is non-convex (difference of two quadratic functions). Therefore it might become trapped into a local maximum, while our method is globally optimal.

232

C. Shen, H. Li, and M.J. Brooks

Xing et al. [7] propose a convex programming approach to maximize the distances between classes and simultaneously to clip (but not to minimis) the distances within classes. Unlike our method, in [7] the rank constraint is not considered. Hence [7] is metric learning but not necessary a dimensionality reduction method. Furthermore, although the formulation of Xing et al. is convex, it is not SDP. It is more computationally expensive and general-purpose SDP solvers are not applicable. SDP (or other convex programming) is also used in [8,9] for learning a distance metric.

4 Experiments In this work, we consider optimizing the LDA criterion using the proposed SDP approach. Sα is the“between classes scatter matrix” and Sβ is the “within classes scatter matrix”. However, there are many different ways of constructing Sα and Sβ , e.g., the general methods considered in [7]. UCI data Firstly, we test whether the optimal value of the cost function, δ obtained by our SDP bisection search, is indeed larger than the one obtained by GEVD (conventional LDA). In all the experiments, the tolerance σ = 0.1. In this experiment, two datasets (“iris” and “wine”) from UCI machine learning repository [10] are used. We randomly sample 70% of the data each time and run 100 tests for the two datasets. The target low dimension d is set to 2. Figure 1 plots the difference of δ obtained by two approaches. Indeed, SDP consistently yields larger δ than LDA does. To see the difference between these two approaches, we project the “wine” data into 2D space and plot the projected data in Figure 2. As can be seen, the SDP algorithm has successfully brought together the points in the same class, while keeping dissimilar ones apart (for the original data and the PCA projected data, different classes are entangled together.) While both of the two discriminant projection algorithms can separately the data successfully, SDP intentionally finds a projection of the data onto a straight line that maintains the separation of the clusters. Xing et al. [7] have reported very similar observations with their convex metric learning algorithm. To test the influence of the SDP algorithm for classification, again we randomly select 70% of the data for training and the left 30% for testing. Both of the two data are projected to 2D. Results with a kNN classifier (k = 1) are collected. We run 100 tests for each data set. For the “iris” data, SDP is slightly better (test error 3.71%(±2.74%)) than LDA (test error 4.16%(±2.56%)). While for the “wine” data, LDA is better than SDP (1.53%(±1.62%) against 8.45%(±4.30%)). It means that a larger LDA cost δ does not necessarily produce better classification accuracy because: (i) The LDA criterion is not directly connected with the classifier’s performance. It is somewhat a heuristic criterion; (ii) As an example, Figure 2 indicates that SDP might over-fit the training data. When LDA already well separates the data, SDP aligns the data into a line for larger δ; (iii) With noisy training data, LDA can denoise by truncating the eigenvectors corresponding to the smaller eigenvalues, similar to what PCA does. During the learning of SDP, it takes noises into consideration as well, which appears to be harmful. We believe that some regularization to the LDA criterion would be beneficial. Also, other different criteria might perform differently in terms of over-fitting. These topics remain future research directions.

A Convex Programming Approach to the Trace Quotient Problem

233

650 600

δSDP − δLDA

550 500 450 400 350 300 0

20

40 60 Test # (UCI data: iris)

80

100

40 60 80 Test # (UCI data: wine)

100

250

LDA

100

δ

−δ

150

SDP

200

50

0 0

20

Fig. 1. The optimal value of δ obtained by our SDP approach (δSDP ) minus the value obtained by the conventional LDA (δLDA ). For all the runs, δSDP is larger than δLDA .

USPS handwritten digits data Experiments are also conducted on the full USPS data set. The US Postal (USPS) handwritten digit data-set is derived from a project on recognizing handwritten digits on envelopes. The digits were down sampled to 16 × 16 pixels. The training set has 7291 samples, and the test set has 2007 samples. The test set is rather difficult: the error rate achieved by human is 2.5% [11]. In the first experiment, we only use 7291 training digits. 70% are randomly selected for training and the other 30% for testing. The data are linearly mapped from 256D to 55D using PCA such that 90.14% of the energy is preserved. Then for LDA, we map them to 9D (because they are totally ten classes). SDP’s target low dimension is 50D. We run the experiments 20 times. The results are somewhat surprising. The 1NN classification (i.e, nearest neighbor) test error for LDA is 6.99%± 0.49%. SDP achieves much better performance: a 1NN test error 2.79%±0.27%. Note that if we set the target low dimension to 9D for SDP, SDP performs worse than LDA does. In the second experiment, we use 7291 training data for training and USPS’ 2007 test data for testing. Again at first they are mapped to 55D using PCA. LDA reduces

234

C. Shen, H. Li, and M.J. Brooks Original (wine)

PCA (wine)

6

40

5

20

4 0 3 −20 2 −40

1 0 11

12

13 LDA (wine)

−60 15 −1000

14

1.5

−0.4

1

−0.6

0.5

−0.8

0

−1

−0.5

−1.2

−1

−1.4

−1.5

−1.6

−2 −3

−2

−1

0

1

2

−1.8 −7

−500

−6

−5

0 SDP (wine)

−4

−3

500

−2

−1

Fig. 2. (1) Original data (first two dimensions are plotted); (2) Projected to 2D with PCA; (3) Projected to 2D with LDA; (4) Projected to 2D with SDP

the dimensionality to 9D and SDP to 54D. LDA has a 1NN test error 10.36% while our SDP achieves a 5.13% test error. Note that in these experiments we have not tuned all the parameters carefully.

5 Conclusion We have proposed a new formulation for directly solving the trace quotient problem. It is based on SDP, combining with a bisection search approach for solving a fractional programming, which allow us to derive a guaranteed globally optimal algorithm. Compared with LDA, the algorithm also relaxes the restriction of linear discriminant analysis that the projection matrix’s rank can only be at most c − 1. In the USPS classification experiment, it shows that this restriction might significantly affect the LDA’s performance. Our experiments have validated the advantages of the proposed algorithm.

Acknowledgements National ICT Australia (NICTA) is funded through the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.

A Convex Programming Approach to the Trace Quotient Problem

235

References 1. Yan, S., Tang, X.: Trace quotient problems revisited. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 232–244. Springer, Heidelberg (2006) 2. Ye, J., Xiong, T.: Null space versus orthogonal linear discriminant analysis. In: Proc. Int. Conf. Mach. Learn., Pittsburgh, Pennsylvania, pp. 1073–1080 (2006) 3. Overton, M.L., Womersley, R.S.: On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl. 13(1), 41–45 (1992) 4. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program 62, 321–357 (1993) 5. Borchers, B.: CSDP, a C library for semidefinite programming. Optim. Methods and Software 11, 613–623 (1999) 6. Sturm, J.F.: Using SeDuMi 1.02, a matlab toolbox for optimization over symmetric cones (updated for version 1.05). Optim. Methods and Software 11-12, 625–653 (1999) 7. Xing, E., Ng, A., Jordan, M., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Proc. Adv. Neural Inf. Process. Syst., MIT Press, Cambridge (2002) 8. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: Proc. Adv. Neural Inf. Process. Syst. (2005) 9. Globerson, A., Roweis, S.: Metric learning by collapsing classes. In: Proc. Adv. Neural Inf. Process. Syst. (2005) 10. Newman, D., Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998) 11. Simard, P., LeCun, Y., Denker, J.S.: Efficient pattern recognition using a new transformation distance. In: Proc. Adv. Neural Inf. Process. Syst., pp. 50–58. MIT Press, Cambridge (1993)

Learning a Fast Emulator of a Binary Decision Process ˇ Jan Sochman and Jiˇr´ı Matas Center for Machine Perception, Dept. of Cybernetics, Faculty of Elec. Eng. Czech Technical University in Prague, Karlovo n´ am. 13, 121 35 Prague, Czech Rep. {sochmj1,matas}@cmp.felk.cvut.cz Abstract. Computation time is an important performance characteristic of computer vision algorithms. This paper shows how existing (slow) binary-valued decision algorithms can be approximated by a trained WaldBoost classiﬁer, which minimises the decision time while guaranteeing predeﬁned approximation precision. The core idea is to take an existing algorithm as a black box performing some useful binary decision task and to train the WaldBoost classiﬁer as its emulator. Two interest point detectors, Hessian-Laplace and Kadir-Brady saliency detector, are emulated to demonstrate the approach. The experiments show similar repeatability and matching score of the original and emulated algorithms while achieving a 70-fold speed-up for KadirBrady detector.

1

Introduction

Computation time is an important performance characteristic of computer vision algorithms. We show how existing (slow) binary-valued classiﬁers (detectors) can be approximated by a trained WaldBoost detector [1], which minimises the decision time while guaranteeing predeﬁned approximation precision. The main idea is to look at an existing algorithm as a black box performing some useful binary decision task and to train a sequential classiﬁer to emulate its behaviour. We show how two interest point detectors, Hessian-Laplace [2] and KadirBrady [3] saliency detector, can be emulated by a sequential WaldBoost classiﬁer [1]. However, the approach is very general and is applicable in other areas as well (e.g. texture analysis, edge detection). The main advantage of the approach is that instead of spending man-months on optimising and ﬁnding a fast and still precise enough approximation to the original algorithm (which can be sometimes very diﬃcult for humans), the main eﬀort is put into ﬁnding a suitable set of features which are then automatically combined into a WaldBoost ensemble. Another motivation could be an automatic speedup of a slow implementation of one’s own detector. A classical approach to optimisation of time-to-decision is to speed-up an already working approach. This includes heuristic code optimisations (e.g. FastSIFT [4] or SURF [5]) but also very profound change of architecture (e.g. classiﬁer cascade [6]). A less common way is to formalise the problem and try to solve the error/time trade-oﬀ in a single optimisation task. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 236–245, 2007. c Springer-Verlag Berlin Heidelberg 2007

Learning a Fast Emulator of a Binary Decision Process

"Black Box"

Vision Algorithm

(t)

Training Set samples binary output

WaldBoost Learning

training sample?

(t)

f (x), θA , θB

Emulator

data request

labels

237

Wald decision Bootstrap management

images

images Image pool

Fig. 1. The proposed learning scheme

Our contribution is a proposal of a general framework for speeding up existing algorithms by a sequential classiﬁer learned by the WaldBoost algorithm. Two examples of interest point detectors were selected to demonstrate the approach. The experiments show a signiﬁcant speed-up of the emulated algorithms while achieving comparable detection characteristics. There has been much work on the interest point detection problem [7] but to our knowledge, learning techniques has been applied only to subproblems but not to the interest point detection as a whole. Lepetit and Fua [8] treated matching of detected points of interest as a classiﬁcation problem, learning the descriptor. Rosten and Drummond [9] used learning techniques to ﬁnd parameters of a hand-designed tree-based Harris corner classiﬁer. Their motivation was to speed-up the detection process, but the approach is limited to the Harris corner detection. Martin et al. [10] learned a classiﬁer for edge detection, but without considering the decision time and with signiﬁcant manual tuning. Nevertheless, they tested a number of classiﬁer types and concluded that a boosted classiﬁer was comparable in performance to these classiﬁers and was preferable for its low model complexity and low computational cost. The rest of the paper is structured as follows. The approximation of a blackbox binary valued decision algorithm by a WaldBoost classiﬁer is discussed in §2. Application of the approach to interest point detectors is described in §3. Experiments are given in §4 and the paper is concluded in §5.

2

Emulating a Binary-Valued Black Box Algorithm with WaldBoost

The structure of the approach is shown in Figure 1. The black box algorithm is any binary-valued decision algorithm. Its positive and negative outputs form a labelled training set. The WaldBoost learning algorithm builds a classiﬁer sequentially and when new training samples are needed, it bootstraps the training set by running the black box algorithm on new images. Only the samples not decided yet by the so far trained classiﬁer are used for training. The result of the

238

ˇ J. Sochman and J. Matas

process is a WaldBoost sequential classiﬁer which emulates the original black box algorithm. The bootstrapping loop uses the fact that the black box algorithm can provide practically unlimited number of training data. This is in contrast to commonly used human labelled data which are diﬃcult to obtain. Next, a brief overview of the WaldBoost learning algorithm is presented. 2.1

WaldBoost

WaldBoost [1] is a greedy learning algorithm which ﬁnds a quasi-optimal sequential strategy for a given binary-valued decision problem. WaldBoost ﬁnds a sequential strategy S ∗ such that S ∗ = arg min T¯S S

subject to βS ≤ β,

αS ≤ α

(1)

for speciﬁed α and β. T¯S is average time-to-decision, αS is false negative and βS false positive rate of the sequential strategy S. A sequential strategy is any algorithm (in our case a classiﬁer) which evaluates one measurement at a time. Based on the set of measurements obtained up to that time, it either decides for one of the classes or postpones the decision. In the latter case, the decision process continues by taking another measurement. To ﬁnd the optimal sequential strategy S ∗ , the WaldBoost algorithm combines the AdaBoost algorithm [11] for feature (measurement) selection and Wald’s sequential probability ratio test (SPRT) [12] for ﬁnding the thresholds which are used for making the decisions. The input of the algorithm is a labelled training set of positive and negative samples, a set of features F - the building blocks of the classiﬁer, and the bounds on the ﬁnal false negative rate α and the false positive rate β. The output is an ordered set of weak classiﬁers h(t) , t ∈ {1, . . . , T } each one corresponding to one (t) (t) feature and a set of thresholds θA , θB on the response of the strong classiﬁer for all lengths t. During the evaluation of the classiﬁer on a new observation x, one weak classiﬁer is evaluated at time t and its response is added to the response function t h(q) (x). (2) ft (x) = q=1

The response function ft is then compared to the corresponding thresholds and the sample is either classiﬁed as positive or negative, or the next weak classiﬁer is evaluated and the process continues ⎧ (t) ⎪ +1, ft (x) ≥ θB ⎨ (t) (3) Ht (x) = −1, ft (x) ≤ θA ⎪ ⎩ (t) (t) continue, θA < ft (x) < θB . If a sample x is not classiﬁed even after evaluation of the last weak classiﬁer, a threshold γ is imposed on the real-valued response fT (x).

Learning a Fast Emulator of a Binary Decision Process

239

Early decisions made in classiﬁer evaluation during training also aﬀect the training set. Whenever a part of the training set is removed according to eq. 3, new training samples are collected (bootstrapped) from yet unseen images. In the experiments we use the same asymmetric version of WaldBoost as used in [1]. When setting the β parameter to zero, the strategy becomes (t) −1, ft (x) ≤ θA (4) Ht (x) = (t) continue, θA < ft (x) and only decisions for the negative class are made during the sequential evaluation of the classiﬁer. A (rare) positive decision can only be reached after evaluating all T classiﬁers in the ensemble. In the context of fast black box algorithm emulation, what distinguishes training for diﬀerent algorithms is the feature set F . A suitable set has to be found for every algorithm. Hence, instead of optimising the algorithm itself, the main burden of development lies in ﬁnding a proper set F . The set F can be very large if one is not sure which features are the best. The WaldBoost algorithm selects a suitable subset together with optimising the time-to-decision.

3

Emulated Scale Invariant Interest Point Detectors

In order to demonstrate the approach, two similarity invariant interest point detectors have been chosen to be emulated: (i) Hessian-Laplace [2] detector, which is a state of the art similarity invariant detector, and (ii) Kadir-Brady [3] saliency detector, which has been found valuable for categorisation, but is about 100× slower. Binaries of both detectors are publicly available1 . We follow standard test protocols for evaluation as described in [7]. Both detectors are similarity invariant (not aﬃne), which is easily implemented via a scanning window over positions and scales plus a sequential test. For both detectors, the set F contains the Haar-like features proposed by Viola and Jones [6], plus a centre-surround feature from [13], which has been shown to be useful for blob-like structure detectors [4]. Haar-like features were chosen for their high evaluation speed (due to integral image representation) and since they have a potential to emulate the Hessian-Laplace detections [4]. For the Kadir-Brady saliency detector emulation, however, the Haar-like features turned out not to be able to emulate the entropy based detections. To overcome this, and still keep the eﬃciency high, “energy” features based on the integral images of squared intensities were introduced. They represent intensity variance in a given rectangle. To collect positive and negative samples for training, a corresponding detector is run on a set of images of various sizes and content. The considered detectors assign a scale to each detected point. Square patches of the size twice the scale are used as positive samples. The negative samples representing the “background” 1

http://www.robots.ox.ac.uk/~vgg/research/affine/

240

ˇ J. Sochman and J. Matas o

1 dc R

r

r R

dc

R

r+R

d

Fig. 2. The non-maximum suppression algorithm scheme for two detections

class are collected from the same images at positions and scales not covered by positive samples. Setting α.There is no error-free classiﬁcation, the positive and negative classes are highly overlapping in feature space. As a consequence, the WaldBoost classiﬁer responses on many positions and scales – false positives. One way of removing less reliable detections is to threshold the ﬁnal response function fT at some higher value γ. This would lead to less false positives, more false negatives and very slow classiﬁer (whole classiﬁer evaluated for most samples). A better option is to set α to a higher value and let the training to prune the negative class sequentially. Again, it results in less false positives and controllable amount of false negatives. Additionally, the classiﬁer becomes much faster due to early decisions. An essential part of a detector is the non-maximum suppression algorithm. Here the output diﬀers from that obtained from the original detectors. Instead of having a real-valued map over whole image, sparse responses are returned by the WaldBoost detector due to early decisions – value of ft , t < T available for early decisions is not comparable to fT of positive detections. Thus a typical cubic interpolation and a local maximum search cannot be applied. Instead, the following algorithm is used. Any two detections are grouped together if their overlap is higher than a given threshold (parameter of the application). Only the detection with maximal fT in each group is preserved. The overlap computation is schematically shown in Figure 2. Each detection is represented by a circle inscribed to the box (scanning window) reported as a detection (Figure 2, left). For two such circles, let us denote radius of the smaller circle as r and radius of the bigger one as R. The distance of circle centres will be denoted by d. The following approximation to the actual circles overlap is used to avoid computationally demanding goniometric functions. The measure has an easy interpretation in two cases. First, when the circle centres coincide, the overlap is approximated as r/R. It equals to one for two circles of the same radius and decreases as the radiuses become diﬀerent. Second, when two circles have just one point in common (d = r + R), the overlap is zero. These two situations are marked in Figure 2, right by blue dots. Linear interpolation (blue solid line in Figure 2, right) is used to approximate the overlap between these two states. Given two radiuses r and R where r ≤ R and circle centres distance dc , the overlap o is computed as

Learning a Fast Emulator of a Binary Decision Process

r o= R

4

1−

dc r+R

241

.

Experiments

This section describes experiments with two WaldBoost-emulated detectors Hessian-Laplace [2] and Kadir-Brady [3] saliency detector. The Hessian-Laplace detector is expected to be easily emulated due to its blob-like detections. This allows to keep the ﬁrst experiment more transparent. The Kadir-Brady detector is more complex one due to its entropy based detections. Kadir-Brady detector shows rather poor results in classical repeatability tests [7] but has been successfully used in several recognition tasks [14]. However, its main weakness for practical applications is its very long computation time (in order of minutes per image!). 4.1

Hessian-Laplace Emulation

The training set for the WaldBoost emulation of Hessian-Laplace is created from 36 images of various sizes and content (nature, urban environment, hand drawn, etc.) as described in §3. The Hessian-Laplace detector is used with threshold 1000 to generate the training set. The same threshold is used throughout all the experiments for both learning and evaluation. Training has been run for T = 20 (training steps) with α = 0.2 and β = 0. The higher α allows fast pruning of less trustworthy detections during sequential evaluation of the detector. The detector has been assessed in standard tests proposed by Mikolajczyk et al. [7]. First, repeatability of the trained WaldBoost detector has been compared with the original Hessian-Laplace detector on several image sequences with variations in scale and rotation. The results on two selected sequences, boat and east south, from [15] are shown in Figure 3 (top row). The WaldBoost detector achieves similar repeatability as the original Hessian-Laplace detector. In order to test the trained detectors for their applicability, a matching application scenario is used. To that eﬀect, a slightly diﬀerent deﬁnition of matching score is used than that of Mikolajczyk [7]. Matching score as deﬁned in [7] is computed as the number of correct matches divided by the smaller number of correspondences in common part of the two images. However, the matches are computed only pairwise for correspondences determined by the geometry ground truth. Here, the same deﬁnition of the matching score is used, but the deﬁnition of a correct match diﬀers. First, tentative matches using the SIFT detector are computed and mutually nearest matches are found. These matches are then veriﬁed by the geometry ground truth and only the veriﬁed matches are called correct. Comparison of the trainer and the trainee outputs on two sequences is given in Figure 3 (bottom row). The WaldBoost detector achieves similar matching score on both sequences while producing consistently more detections and matches.

ˇ J. Sochman and J. Matas

20 0 1

1.5 2 scale change

3000 2000 1000 0 1

2.5

(a) 80 60 40 20 0 1

1.5 2 scale change

(e)

60 40 20 0 1

2.5

(b) WB HL

2.5

4000 #correct matches

matching score

100

1.5 2 scale change

80

2000 1000 0 1

1.5 2 scale change

(f)

1.5 2 scale change

3000 2000 1000 0 1

2.5

2.5

100

60 40 20 0 1

1.5 2 scale change

(g)

1.5 2 scale change

2.5

(d) WB HL

80

WB HL

4000

(c) WB HL

3000

WB HL

#correspondences

40

5000

100 repeatability %

60

WB HL

4000

matching score

repeatability %

WB HL

80

2.5

4000 #correct matches

5000

100

#correspondences

242

WB HL

3000 2000 1000 0 1

1.5 2 scale change

2.5

(h)

Fig. 3. Comparison of Hessian-Laplace detector and its WaldBoost emulation. Top row: Repeatability on boat (a) and east south (c) sequences and corresponding number of detections (b), (d). Bottom row: Matching score (e), (g) and corresponding number of correct matches (f), (h) on the same sequences.

Fig. 4. First centre-surround and energy feature found in WaldBoost Hessian-Laplace (left) and Kadir-Brady (right) emulated detector. The underlying image is generated as E(|xi − 127.5|) and E(xi ) respectively, where E() is the average operator and xi is the i-th positive training example.

The WaldBoost classiﬁer evaluates on average 2.5 features per examined position and scale. This is much less than any reported speed for face detection [1]. The evaluation times are compared in Table 1. The WaldBoost emulation speed is comparable to manually tuned Hessian-Laplace detector. The Hessian-Laplace detector ﬁnds blob-like structures. The structure of the trained WaldBoost emulation should reﬂect this property. As shown in Figure 4, the ﬁrst selected feature is of a centre-surround type which gives high responses to blob-like structures. The outputs of the trained WaldBoost emulation of Hessian-Laplace and the original algorithm are compared in Figure 5. To ﬁnd the original Hessian-Laplace detection correctly found by the WaldBoost emulator, correspondences based on Mikolajczyk’s overlap criterion [7] have been found between the original and WaldBoost detections. The white circles show repeated correspondences. The black circles show the detections not found by the WaldBoost emulation. Note that most of the missed detections have a correct detection nearby, so the

Learning a Fast Emulator of a Binary Decision Process

(a)

243

(b)

Fig. 5. Comparison of the outputs of the original and WaldBoost-emulated (a) HessianLaplace and (b) Kadir-Brady saliency detectors. The white circles show repeated detection. The black circles highlight the original detections not found by the WaldBoost detector. Note that for most of missed detections there is a nearby detection on the same image structure. The accuracy of the emulation is 85 % for Hessian-Laplace and 96 % for Kadir-Brady saliency detector. Note that the publicly available Kadir-Brady algorithm does not detect points close to image edges.

corresponding image structure is actually found. The percentage of repeated detections of the original algorithm is 85 %. To conclude, the WaldBoost emulator of the Hessian-Laplace detector is able to detect points with similar repeatability and matching score while its speed is comparable to speed of the original algorithm. This indicates that the proposed approach is able to minimise the decision time down to a manually tuned algorithm speed. 4.2

Fast Saliency Detector

The emulation of the Kadir-Brady saliency detector [3] was trained on the same set of images as the WaldBoost Hessian-Laplace emulator. The saliency threshold of the original detector was set to 2 to limit the positive examples only to those with higher saliency. Note, that as opposed to the Hessian-Laplace emulation where rather low threshold was chosen, it is meaningful to use only the top most salient features from the Kadir-Brady detector. This is not true for HessianLaplace detector since its response does not correspond to the importance of the feature. The Haar-like feature set was extended by the “energy” feature described in §3. The training was run for T = 20 (training steps) with α = 0.2 and β = 0. The same experiments as for the Hessian-Laplace detector have been performed. The repeatability and the matching score of the Kadir-Brady detector and its WaldBoost emulation on boat and east south sequences are shown in Figure 6. The trained detector performs slightly better than the original one.

ˇ J. Sochman and J. Matas

20 0 1

1.5 2 scale change

2000 1000 0 1

2.5

(a) 80 60 40 20 0 1

1.5 2 scale change

(e)

60 40 20

1.5 2 scale change

0 1

2.5

1.5 2 scale change

(b) WB KB

2.5

2500 #correct matches

matching score

100

WB KB

80

1500 1000 500 0 1

1.5 2 scale change

2.5

100

2000 1500 1000 500 0 1

2.5

60 40 20 0 1

1.5 2 scale change

(f)

1.5 2 scale change

2.5

(d) WB KB

80

WB KB

2500

(c) WB KB

2000

#correspondences

40

3000

100

2000 #correct matches

60

WB KB

3000

repeatability %

repeatability %

WB KB

80

#correspondences

4000

100

matching score

244

WB KB

1500 1000 500

2.5

(g)

0 1

1.5 2 scale change

2.5

(h)

Fig. 6. Comparison of Kadir-Brady detector and its WaldBoost emulation. Top row: Repeatability on boat (a) and east south (c) sequences and corresponding number of detections (b), (d). Bottom row: Matching score (e), (g) and corresponding number of correct matches (f), (h) on the same sequences. Table 1. Speed comparison on the ﬁrst image (850×680) from the boat sequence

Hessian-Laplace Kadir-Brady

original 1.3s 1m 44s

WaldBoost 1.3s 1.4s

The main advantage of the emulated saliency detector is its speed. The classiﬁer evaluates on average 3.7 features per examined position and scale. Table 1 shows that the emulated detector is 70× faster than the original detector. Our early experiments showed that the Haar-like features are not suitable to emulate the entropy-based saliency detector. With the energy features, the training was able to converge to a reasonable classiﬁer. In fact, the energy feature is chosen for the ﬁrst weak classiﬁer in the WaldBoost ensemble (see Figure 4). The outputs of the WaldBoost saliency detector and the original algorithm are compared in Figure 5. The coverage of the original detections is 96 %. To conclude, the Kadir-Brady emulation gives slightly better repeatability and matching score. But, most importantly, the decision times of the emulated detector are about 70× lower than that of the original algorithm. That opens new possibilities for using the Kadir-Brady detector in time sensitive applications.

5

Conclusions and Future Work

In this paper a general learning framework for speeding up existing binary-valued decision algorithms by a sequential classiﬁer learned by WaldBoost algorithm has been proposed. Two interest point detectors, Hessian-Laplace and Kadir-Brady saliency detector, have been used as black box algorithms and emulated by the

Learning a Fast Emulator of a Binary Decision Process

245

WaldBoost algorithm. The experiments show similar repeatability and matching scores of the original and emulated algorithms. The speed of the Hessian-Laplace emulator is comparable to the original manually tuned algorithm, while the Kadir-Brady detector was speeded up seventy times. The proposed approach is general and can be applied to other algorithms as well. For future research, an interesting extension of the proposed approach would be to train an emulator with not only similar outputs to an existing algorithm but also with some additional quality like higher repeatability or specialisation to a given task.

Acknowledgement The authors were supported by Czech Science Foundation Project 102/07/1317 ˇ (JM) and by EC project FP6-IST-027113 eTRIMS (JS).

References ˇ 1. Sochman, J., Matas, J.: WaldBoost - learning for time constrained sequential detection. In: CVPR, Los Alamitos, USA, vol. 2, pp. 150–157 (2005) 2. Mikolajczyk, K., Schmid, C.: Scale and aﬃne invariant interest point detectors. IJCV 60(1), 63–86 (2004) 3. Kadir, T., Brady, M.: Saliency, scale and image description. IJCV 45(2) (2001) 4. Grabner, M., Grabner, H., Bischof, H.: Fast approximated SIFT. In: ACCV, pp. I:918–927 (2006) 5. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 6. Viola, P., Jones, M.: Robust real time object detection. In: SCTV, Vancouver, Canada (2001) 7. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaﬀalitzky, F., Kadir, T., Van Gool, L.: A comparison of aﬃne region detectors. In: IJCV (2005) 8. Lepetit, V., Lagger, P., Fua, P.: Randomized trees for real-time keypoint recognition. In: CVPR, vol. II, pp. 775–781 (2005) 9. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430– 443. Springer, Heidelberg (2006) 10. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI 26(5), 530–549 (2004) 11. Schapire, R.E., Singer, Y.: Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning 37(3), 297–336 (1999) 12. Wald, A.: Sequential analysis. Dover, New York (1947) 13. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP (2002) 14. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for eﬃcient learning and exhaustive recognition. In: CVPR, vol. 1, pp. 380–387 (2005) 15. Mikolajczyk, K.: Detection of local features invariant to aﬃnes transformations. PhD thesis, INPG, Grenoble (2002)

Multiplexed Illumination for Measuring BRDF Using an Ellipsoidal Mirror and a Projector Yasuhiro Mukaigawa, Kohei Sumino, and Yasushi Yagi The Institute of Scientiﬁc and Industrial Research, Osaka University

Abstract. Measuring a bidirectional reﬂectance distribution function (BRDF) requires long time because a target object must be illuminated from all incident angles and the reﬂected light must be measured from all reﬂected angles. A high-speed method is presented to measure BRDFs using an ellipsoidal mirror and a projector. The method can change incident angles without a mechanical drive. Moreover, it is shown that the dynamic range of the measured BRDF can be signiﬁcantly increased by multiplexed illumination based on the Hadamard matrix.

1

Introduction

In recent years, the measurement of geometric information (3D shapes) has become easier by using commercial range-ﬁnders. However, the measurement of photometric information (reﬂectance properties) is still diﬃcult. Reﬂection properties depend on the microscopic shape of the surface, and they can be used for a variety of applications such as computer graphics and inspection of painted surfaces. However, there is no standard way for measuring reﬂection properties. The main reason for this is that the dense measurement of BRDFs requires huge amounts of time because a target object must be illuminated from every incident angle and the reﬂected lights must be measured from every reﬂected angle. Most existing methods use mechanical drives to rotate a light source, and as a result, the measuring time becomes very long. In this paper, we present a new method to measure BRDFs rapidly. Our system substitutes an ellipsoidal mirror for a mechanical drive, and a projector for a light source. Since our system completely excludes mechanical drive, high-speed measurement can be realized. Moreover, we present an algorithm that improves the dynamic range of measured BRDFs. The combination of a projector and an ellipsoidal mirror can produce any illumination. Hence, the dynamic range is signiﬁcantly increased by multiplexing illumination based on the Hadamard matrix, while the capturing time remains the same as for normal illumination.

2

Related Work

If the reﬂectance is uniform over the surface, the measurement becomes easier by merging the BRDFs at every point. Matusik et al.[1] measured isotropic BRDFs by capturing a sphere. Anisotropic BRDFs can also be measured by Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 246–257, 2007. c Springer-Verlag Berlin Heidelberg 2007

Multiplexed Illumination for Measuring BRDF

247

Table 1. Comparison of major BRDF measuring devices Device Camera Light source Density of BRDF Li [5] mechanical rotation mechanical rotation dense Dana [10] ﬁxed mechanical translation dense M¨ uller [6], Han [12] ﬁxed ﬁxed sparse Our system ﬁxed ﬁxed dense

capturing a plane [2] or a cylinder[3]. Marschner et al.[4] measured the BRDFs of general shapes using a 3D range sensor. However, these methods cannot measure specially varying BRDFs. The most straightforward way to measure BRDFs is to use a gonioreﬂectometer, which allows a light source and a sensor to rotate around the target material. Li et al.[5] have proposed a three-axis instrument. However, because the angle needs to be altered mechanically, the measurement of dense BRDFs takes a long time. To speed up the measurement, rotational mechanisms should be excluded. By placing many cameras and light sources around the target object, BRDFs can be measured for some of the angle combinations of the incident and reﬂective directions. M¨ uller et al.[6] have constructed a system including 151 cameras with a ﬂash. However, dense measurement is physically diﬃcult. In the optics ﬁeld, some systems that utilize catadioptric devices have been proposed. Davis and Rawling[7] have patented a system using an ellipsoidal mirror to collect the reﬂected light. Mattison et al.[8] have developed a handheld instrument based on the patent. The patent focuses only on gathering reﬂected light, and does not mention the control of the incident direction. Although Ward[9] used a hemispherical half-mirror and a ﬁsh-eye lens, it requires a rotational mechanism for the light source. Although Dana[10] used a paraboloidal mirror, a translational mechanism for the light source remains necessary. To avoid using mechanical drive, some systems include catadioptric devices. Kuthirummal et al.[11] used a cylindrical mirror and Han et al.[12] combined a projector and several plane mirrors similar to those used in a kaleidoscope. However, these systems can measure only sparse BRDFs because measurable incident and reﬂective angles are quite discrete. We, on the other hand, propose a new system that combines an ellipsoidal mirror and a projector. Since our system completely excludes a mechanical device, high-speed measurement is realized. The system can measure dense BRDFs because both lighting direction and viewing direction are densely changed.

3 3.1

BRDF Isotropic and Anisotropic BRDFs

To represent reﬂection properties, a BRDF is used. The BRDF represents the ratio of outgoing radiance in the viewing direction (θr , φr ) to incident irradiance from a lighting direction (θi , φi ), as shown in Fig.2(a).

248

Y. Mukaigawa, K. Sumino, and Y. Yagi

When a camera and a light source are ﬁxed, the rotation of an object around the surface normal changes the appearance of some materials. Such reﬂection is called anisotropic reﬂection, and typical materials of this type are brushed metals and cloth fabrics such as velvet and satin. To perfectly describe anisotropic reﬂection, the BRDF should be deﬁned by four angle parameters. On the other hand, the appearance does not change according to the rotation around the surface normal for many materials. Such reﬂection is called isotropic reﬂection. If isotropic reﬂection can be assumed, the BRDF can be described using only three parameters, θi , θr , and φ (φ = φi − φr ). If the number of parameters can be reduced from four to three, then the measuring time and data size can be signiﬁcantly reduced. 3.2

Problems with a 4-Parameter Description

There are two major problems associated with a 4-parameter description: data size and measuring time. First, let us consider the data size. If the angles θr , φr , θi , and φi are rotated at one degree intervals, and the reﬂected light is recorded as R, G, and B colors for each angle, then the required data size becomes 360×90×360×90×3 = 3, 149, 280, 000 bytes. The size of 3GB is not impractical for recent PCs. Moreover, BRDFs can be eﬀectively compressed because they include much redundancy. Therefore, data size is not a serious problem. On the other hand, the problem of measuring time remains serious. Since the number of combinations of lighting and sensing angles becomes extremely large, a long measuring time is required. If the sampling interval is one degree, the total number of combinations becomes 360 × 90 × 360 × 90 = 1, 049, 760, 000. This means that it would require 33 years if it takes one second to measure one reﬂection color. Of course the time can be shortened by using a high-speed camera, but the total time required would still remain impractical. While the problem of data size is not serious, the problem of measuring time warrants consideration. In this paper, we straightforwardly tackle the problem of measuring time for 4-parameter BRDFs by devising a catadioptric system.

4 4.1

BRDF Measuring System Principle of Measurement

An ellipsoid has two focal points. All rays from one focal point reﬂect oﬀ the ellipsoidal mirror and reach the other focal point. This property is used for measuring BRDFs. The target object is placed at the focal point, and a camera is placed at the other focal point. Since rays reﬂected in all directions from the target object converge at a single point, all the rays can be captured at once. The most signiﬁcant characteristic of the system is that an ellipsoidal mirror is combined with a projector. The projector serves as a substitute for the light source. The projection of a pattern in which only one pixel is white corresponds to the illumination by a point light source. Moreover, changing the location of the white pixel corresponds to rotating the incident angle. Since the change

Multiplexed Illumination for Measuring BRDF

249

Object Mirror

Black Plate

Ellipsoidal Mirror

N

θr

Image Plane

Ellipsoidal Mirror

Half Mirror φr

θi

φi

Beam Splitter

Projector

Camera

Image Plane Camera

Object

Projector

(a) Vertical setup

(b) Horizontal setup

y

y

x

z

z

(c) Mirror for vertical setup

x

(d) Mirror for horizontal setup

Fig. 1. The design of the BRDF measuring device

of the projection pattern is faster than mechanical rotation, rapid and dense measurement can be achieved. 4.2

Design of the Measuring System

Based on the principle described in the previous section, we developed two BRDF measuring devises that have diﬀerently shaped ellipsoidal mirrors. One is a vertical setup in which a target material is placed vertically to the long axis as shown in Fig.1(a). The shape of the mirror is an ellipsoid that is cut perpendicularly to the long axis as shown in Fig.1(c). The other is a horizontal setup in which a target material is placed horizontally to the long axis as shown in Fig.1(b). The shape of the mirror is an ellipsoid that is cut parallel and perpendicularly to the long axis as shown in Fig.1(d). The major optical devices are a projector, a digital camera, an ellipsoidal mirror, and a beam splitter. The illumination from the projector is reﬂected on the beam splitter and the ellipsoidal mirror, and ﬁnally illuminates a single point on the target object. The reﬂected light on the target object is again reﬂected by the ellipsoidal mirror, and is recorded as an image. The vertical setup has merit in that the density of the BRDF is uniform along φ because the long axis of the ellipsoid and the optical axes of the camera and projector are the same. Moreover, this kind of mirror is available commercially because it is often used as part of a usual illumination device. However, target materials must be cut into small facets to be placed at the focal point. On the other hand, the horizontal setup has merit in that the target materials do

250

Y. Mukaigawa, K. Sumino, and Y. Yagi φ=0

N incoming

outgoing

θr

θi

φ=60

φ=300 φ=270

φ=90 θ=30 θ=60

φ=240

φr

φ=30

φ=330

φi

(a) Four angle parameters

φ=210

φ=180

φ=150

φ=120

φ=240

φ=300 φ=0 φ=60 θ=90 φ=330 φ=30 φ=90 φ=120 θ=60

φ=270

θ=30

θ=90

(b) Vertical setup

(c) Horizontal setup

Fig. 2. Angular parameters of BRDF. (a) Four angle parameters. (b)(c) Relationship between the angles (θ, φ) and the image location.

not have to be cut. Hence, the BRDF of cultural heritages can be measured. However, the mirror must be specially made by the cutting operation. 4.3

Conversion Between Angle and Image Location

The lighting and viewing directions are speciﬁed as angles, while they are expressed as 2-D locations in the projection pattern or the captured image. The conversion between the angle and the image location is easy if geometric calibration is done for the camera and the projector. Figures 2 (b) and (c) illustrate the relationship between the angle and the image location for the vertical and horizontal setup, respectively.

5

Multiplexed Illumination

In this section, the problem of low dynamic range inherent in the projectorbased system is clariﬁed, and this problem is shown to be solved by multiplexed illumination based on the Hadamard matrix. 5.1

Dynamic Range Problem

There are two main reasons for low dynamic range. One of these is the diﬀerence in intensities for specular and diﬀuse reﬂections. If a short shutter speed is used to capture the specular reﬂection without saturation, diﬀuse reﬂection tends to be extremely dark as shown in Fig.3(a). Conversely, a long shutter speed to capture bright diﬀuse reﬂection creates saturation of the specular reﬂection as shown in Fig.3(b). This problem is not peculiar to our system, but is common in general image measurement systems. The other reason is peculiar to our system that uses a projector for illumination. Generally, the intensity of the black pixel in the projection pattern is not perfectly zero. A projector emits a faint light when the projection pattern is black. Even if the intensity of each pixel is small, the sum of the intensities converging on one point cannot be ignored. For example, let us assume that the contrast ratio of the projector is 1000 : 1 and the size of the projection pattern

brightness

brightness

(a)

angle

(b)

angle

251

Specular Diffuse Sum of black projection

brightness

Multiplexed Illumination for Measuring BRDF

(c)

angle

Fig. 3. The problem of low dynamic range

is 1024 × 768. If 10 pixels in a projection pattern are white and the others are black, the intensity ratio of the white pixels to the black pixels is 10 × 1000 : (1024 × 768 − 10) × 1 = 1 : 79.

(1)

Thus the intensity of the black pixels is larger than one of the white pixels. This means that the measured data include a large amount of unnecessary information which should be ignored as shown in Fig. 3(c). By subtracting the image that is captured when a uniform black pattern is projected, this unnecessary information can be eliminated. As only a few bits are required to express the necessary information, a radical solution is required. 5.2

Multiplexed Illumination

Optical multiplexing techniques from the 1970s have been investigated in the spectrometry ﬁeld[13]. If the spectrum of a light beam is measured separately for each wavelength, each spectral signal becomes noisy. Using the optical multiplexing technique, multiple spectral components are simultaneously measured to improve the quality. In the computer vision ﬁeld, Schechner et al.[14] applied the multiplexing technique to capture images under varying illumination. In this method, instead of illuminating each light source independently, the multiplexed light sources are simultaneously illuminated. From the captured images, an image that is illuminated by a single light source is calculated. Wenger et al.[15] evaluated the eﬀects of noise reduction using multiplexed illumination. We brieﬂy describe the principle of multiplexed illumination. Let us assume that there are n light sources, and that s denotes the intensities of a point in the images when each light source turns on. The captured images are multiplexed by the weighting matrix W . The intensities m of the point by the multiplexed illumination are expressed by m = W s. (2) The intensities of the point under the single illumination can be estimated by s = W −1 m.

(3)

In our BRDF measuring system, a projector is used instead of an array of light sources. Hence the weighting matrix W can be arbitrarily controlled. It is

252

Y. Mukaigawa, K. Sumino, and Y. Yagi

known that if the component of the matrix W is −1 or 1, the Hadamard matrix is the best √ multiplexing weight[13]. In this case, the S/N ratio is increased by a factor of n. The n × n Hadamard matrix satisﬁes H Tn H n = nI n ,

(4)

where I n denotes an n × n identity matrix. In fact, as negative illumination cannot be given by a projector, the Hadamard matrix cannot be used directly. It is also known that if the component of the matrix W is 0 or 1, the S-matrix is the best√ multiplexing weight[13]. In this case, the S/N ratio is increased by a factor of n/2. The S-matrix can be easily generated from the Hadamard matrix. Hence, the projection pattern is multiplexed using the S-Matrix. Since the illumination can be controlled for each pixel using a projector, n becomes a large number and dramatic improvement can be achieved.

6 6.1

Experimental Results BRDF Measuring Systems

We constructed BRDF measuring systems named RCGs (Rapid Catadioptric Gonioreﬂectometers) as shown in Figs.4 (c) and (d). The RCG-1 includes a PointGrey Flea camera, an EPSON EMP-760 projector, and a Melles Griot ellipsoidal mirror as shown in Fig.4 (a). The RCG-2 includes a Lucam Lu-160C camera and a TOSHIBA TDP-FF1A projector. The ellipsoidal mirror for the RCG-2 is designed so that BRDFs can be measured for all angles of θ within 0 ≤ φ ≤ 240 as shown in Fig.4 (b).

Ellipsoidal mirror

Plate mirror

Object

(a)

(b)

Beam splitter

Projector Camera

(c) RCG-1 (vertical setup)

(d) RCG-2 (horizontal setup)

Fig. 4. The BRDF measuring systems

Multiplexed Illumination for Measuring BRDF

(a) velvet

(b) satin

(c) polyurethane

253

(d) copper

Fig. 5. Target materials

(a)

(b)

(c)

(d)

Fig. 6. BRDF of velvet and satin. (a)(b) examples of captured images, (c)(d) rendering result from measured BRDFs.

6.2

Measurement of Velvet and Satin

In this section, the results of measured BRDFs using the RCG-1 are shown. The target objects are velvet and satin, both of which have anisotropic reﬂections, as shown in Figs.5(a) and (b). First, the measuring time is evaluated. The sampling interval was set to one degree. The pattern corresponding to the lighting directions θi = 30 and φi = 250 was projected, and the reﬂected images were captured, as shown in Figs.6 (a) and (b) for velvet and satin, respectively. It is noted that some BRDFs could not be measured because the ellipsoidal mirror of the RCG-1 has a hole at the edge of the long axis. 360 × 90 = 32400 images were captured for each material. The measuring time was about 50 minutes. Figures 6 (c) and (d) are generated images of a corrugated plane that have the measured BRDFs of velvet and satin. The rendering process for this corrugated shape fortunately does not require the missing data. It can be seen that the characteristics of anisotropic reﬂection are reproduced. 6.3

Measurement of a Polyurethane Sphere

To evaluate the eﬀectiveness of the multiplexed illumination, the isotropic BRDF of a polyurethane sphere was measured as shown in Fig.5(c). In this case, the lighting direction is varied by 1-DOF rotation because of the isotropic reﬂection. That is, the azimuth angle φi is ﬁxed, and the elevation angle is varied 0 ≤ θi ≤ 180. Figure 7(a) shows an example of multiplexed illumination by a 191 × 191 S-matrix, and (b) shows the captured image after subtracting an image of projecting a black pattern. The captured images of the lighting direction θi = 10, φi = 270 were compared under several conditions. Figures 8(a) and (b) show the distribution of the reﬂected light without multiplexing. (a) is a single captured image, while (b) is

254

Y. Mukaigawa, K. Sumino, and Y. Yagi

(a)

(b)

Fig. 7. An example of the multiplexed illumination. (a) Projected pattern multiplexed by a 191 × 191 S-Matrix, and (b) the captured image.

(a) Single illumination without averaging

(b) Single illumination with averaging

(c) Multiplexed illumination (d) Multiplexed illumination without averaging with averaging Fig. 8. The reﬂected light of the lighting direction (θl = 10, φl = 270)

the average of ten captured images. The captured images are very noisy even with the averaging process. Figures 8(c) and (d) show the results with multiplexing. A sequence of multiplexed illumination patterns were projected, and the distribution of the reﬂected light corresponding to the same lighting direction is estimated. As before, (c) is the result without averaging, while (d) is the result with averaging ten images. Obviously, the noise is dramatically decreased in the multiplexed illumination. To ﬁnd the spatial distribution of the reﬂected light, the changes in intensity of y = 60, x = 30 − 200 are plotted as shown in Fig.9. (a) shows the intensities without averaging, while (b) shows those with averaging ten images. In the graphs, blue and red lines represent the distribution with and without multiplexing, respectively. It is interesting that the result of multiplexing without averaging is more accurate than the result of the single illumination with averaging. While the time taken for capturing images is ten times greater for the averaging process, the multiplexed illumination can improve accuracy without increasing the capturing time. Figure 10 shows rendered images of a sphere and a corrugated surface using the BRDF measured by multiplexed illumination with averaging. Compared with the real sphere, the distribution of the specular reﬂection is slightly wide. One of the reasons is that the reﬂected light on the ellipsoidal mirror does not converge

Multiplexed Illumination for Measuring BRDF brightness 1.6

255

brightness 1.6 single at once multiplexed at once

1.4 1.2

single average multiplexed average

1.4 1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4 30 40 50 60 70

80 90 100 110 120 130 140 150 160 170 180 190 200 pixel

(a) without averaging

-0.4 30 40 50 60 70

80 90 100 110 120 130 140 150 160 170 180 190 200 pixel

(b) with averaging ten images

Fig. 9. The distribution of the intensities

(a)

(b)

(c)

(d)

Fig. 10. Rendering results of the pink polyurethane sphere

perfectly on the target material because the alignment of the optical devices is not perfect. Since the target object is a sphere, the normal direction varies if the measuring point is diﬀerent. As a result, the wide specular reﬂections may appear to be generated incorrectly. Unnatural reﬂections were observed in the upper area in Figures 8(c) and (d). This problem may be caused by the cutting operation error of the ellipsoidal mirror. Therefore, the BRDF of θ = 65 is substituted for the missing data of θ > 65. One of our future aims is to improve the accuracy of the optical devices. 6.4

Measurement of a Copper Plate

The isotropic BRDFs of a copper plate were measured as shown in Fig.5(d). Metal is the most diﬃcult material for which to measure the BRDFs accurately, because the intensity levels of specular and diﬀuse reﬂections are vastly diﬀerent. Figure 11 shows the rendering results of a corrugated surface using the measured BRDFs. (a) represents the results of a single illumination, while (b) represents those of multiplexed illumination. Since a fast shutter speed is used when measuring BRDFs to avoid saturation, the captured images are very dark. Hence, the rendering results are brightly represented in this ﬁgure. In the rendered images of (a) and (b), incorrect colors such as red or blue are observed. These incorrect colors seem to be the result of magnifying noise during the brightening process. On the other hand, noise can be drastically decreased in the rendered images of (c) and (d) using multiplexed illumination.

256

Y. Mukaigawa, K. Sumino, and Y. Yagi

(a)

(b)

(c)

(d)

Fig. 11. Comparison of the rendered results of the copper plate. (a) and (b) Single illumination. (c) and (d) Multiplexed illumination.

Although the dynamic range of the measured BRDFs is suitably widened, some noise is still observed in the rendered images. Ratner et al.[16] pointed out the fundamental limitation of Hadamard-based multiplexing. The dynamic range problem can also be reduced by incorporating the use of several images captured with varying shutter speeds[17], while the capturing time increases.

7

Conclusion

In this paper, we proposed a new high-speed BRDF measurement method that combines an ellipsoidal mirror with a projector, and solved the low dynamic range problem by applying multiplexed illumination to pattern projection. Two BRDF measuring systems were developed, which include diﬀerently shaped ellipsoidal mirrors. The proposed systems can measure complex reﬂection properties including anisotropic reﬂection. Moreover, the measuring time of BRDFs can be signiﬁcantly shortened by the exclusion of a mechanical device. This paper focuses only on the BRDF measuring speed of the developed systems. The accuracy of the measured BRDFs needs to be evaluated. For the evaluation, we are attempting to compare the measured BRDFs and the ground truth using reﬂectance standards for which reﬂection properties are known. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (A).

References 1. Matusik, W., Pﬁster, H., Brand, M., McMillan, L.: A Data-Driven Reﬂectance Model. In: Proc. SIGGRAPH 2003, pp. 759–769 (2003) 2. Karner, K.F., Mayer, H., Gervautz, M.: An image based measurement system for anisotropic reﬂection. Computer Graphics Forum (Eurographics 1996 Proceedings) 15(3), 119–128 (1996) 3. Lu, R., Koenderink, J.J., Kappers, A.M.L.: Optical Properties (Bidirectional Reﬂection Distribution Functions) of Velvet. Applied Optics 37(25), 5974–5984 (1998) 4. Marschner, S.R., Westin, S.H., Lafortune, E.P.F., Torrance, K.E., Greenberg, D.P.: Image-Based BRDF Measurement Including Human Skin. In: Proc. 10th Eurographics Workshop on Rendering, pp. 139–152 (1999) 5. Li, H., Foo, S.C., Torrance, K.E., Westin, S.H.: Automated three-axis gonioreﬂectometer for computer graphics applications. In: Proc. SPIE, vol. 5878, pp. 221–231 (2005)

Multiplexed Illumination for Measuring BRDF

257

6. M¨ uller, G., Bendels, G.H., Klein, R.: Rapid Synchronous Acquisition of Geometry and Appearance of Cultural Heritage Artefacts. In: VAST2005, pp. 13–20 (2005) 7. Davis, K.J., Rawlings, D.C.: Directional reﬂectometer for measuring optical bidirectional reﬂectance, United States Patent 5637873 (June 1997) 8. Mattison, P.R., Dombrowski, M.S., Lorenz, J.M., Davis, K.J., Mann, H.C., Johnson, P., Foos, B.: Handheld directional reﬂectometer: an angular imaging device to measure BRDF and HDR in real time. In: Proc. SPIE, vol. 3426, pp. 240–251 (1998) 9. Ward, G.J.: Measuring and Modeling anisotropic reﬂection. In: Proc. SIGGRAPH 1992, pp. 255–272 (1992) 10. Dana, K.J., Wang, J.: Device for convenient measurement of spatially varying bidirectional reﬂectance. J. Opt. Soc. Am. A 21(1), 1–12 (2004) 11. Kuthirummal, S., Nayar, S.K.: Multiview Radial Catadioptric Imaging for Scene Capture. In: Proc. SIGGRAPH2006, pp. 916–923 (2006) 12. Han, J.Y., Perlin, K.: Measuring Bidirectional Texture Reﬂectance with a Kaleidoscope. ACM Transactions on Graphics 22(3), 741–748 (2003) 13. Harwit, M., Sloane, N.J.A.: Hadamard Transform Optics. Academic Press, London (1973) 14. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: A Theory of Multiplexed Illumination. In: Proc. ICCV 2003, pp. 808–815 (2003) 15. Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., Debevec, P.: Performance Relighting and Reﬂectance Transformation with Time-Multiplexed Illumination. In: Proc. SIGGRAPH2005, pp. 756–764 (2005) 16. Ratner, N., Schechner, Y.Y.: Illumination Multiplexing within Fundamental Limits. In: Proc. CVPR2007 (2007) 17. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: Proc. SIGGRAPH1997, pp. 369–378 (1997)

Analyzing the Influences of Camera Warm-Up Effects on Image Acquisition Holger Handel Institute for Computational Medicine (ICM), Univ. of Mannheim, B6, 27-29, 69131 Mannheim, Germany [email protected]

Abstract. This article presents an investigation of the impact of camera warmup on the image acquisition process and therefore on the accuracy of segmented image features. Based on an experimental study we show that the camera image is shifted to an extent of some tenth of a pixel after camera start-up. The drift correlates with the temperature of the sensor board and stops when the camera reaches its thermal equilibrium. A further study of the observed image flow shows that it originates from a slight displacement of the image sensor due to thermal expansion of the mechanical components of the camera. This sensor displacement can be modeled using standard methods of projective geometry in addition with bi-exponential decay terms to model the temporal dependence. The parameters of the proposed model can be calibrated and then used to compensate warmup effects. Further experimental studies show that our method is applicable to different types of cameras and that the warm-up behaviour is characteristic for a specific camera.

1 Introduction In the last couple of years much work has been done on camera modeling and calibration (see [1], [2], [3], [4] to mention a few). The predominant way to model the mapping from 3D world space to 2D image space is the well-known pinhole camera model. The ideal pinhole camera model has been extended with additional parameters to regard radial and decentering distortion ([5], [6], [7]) and even sensor unflatness [8]. These extensions have led to a more realistic and thus more accurate camera model (see Weng et al. [5] for an accuracy evaluation). Beside these purely geometrical aspects of the imaging process additional work has also been done on the electrical properties of the camera sensor and its influence on the image acquisition process. Some relevant variables are dark current, fixed pattern noise and line jitter ([9], [10], [11]). An aspect which has rarely been studied is the effect of camera warm-up on the imaging process. Beyer [12] reports a drift of measured image coordinates to an extent of some tenth of a pixel during the first hour after camera start-up. Wong et al. also report such an effect [13] as well as Robson et al. [14] . All of them only report drift distortions due to camera warm-up but do not give any further explanation of the origins of the observed image drift nor any way to model and compensate for these distortions. Today machine vision techniques have gained a great extension in many sensitive areas like industrial production and medical invention where errors of some tenth of a pixel in image feature Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 258–268, 2007. c Springer-Verlag Berlin Heidelberg 2007

Analyzing the Influences of Camera Warm-Up Effects

259

segmentation caused by sensor warm-up can result in significant reconstruction errors. In [15] measurement drifts of an optical tracking system up to 1 mm during the first 30 minutes after start-up are reported. In many computer assisted surgery applications such reconstruction errors are intolerable. Thus, a better understanding of the impact of camera warm-up on the image acquisition process is crucial. In this paper we investigate the influence of camera warm-up on the imaging process. We will show that the coordinates of segmented feature points are corrupted by a drift movement the image undergoes during camera warm-up. To our opinion this drift is caused by thermal expansion of a camera’s sensor board which results in a slight displacement of the sensor chip. We develop a model for the image plane movement which can be used to compensate distortions in image segmentation during a camera’s warmup period. Finally, we provide further experimental results approving the applicability of our method. The paper is organized as follows. Section 2 describes the experimental setup and the image segmentation methods from which we have observed the warm-up drift. Section 3 presents our model of warm-up drift and a way to calibrate the relevant parameters. Furthermore, a procedure is described to compensate for the image drift which fits easily in the distortion correction models widely used. Section 4 provides further experiments with different types of cameras.

2 Observing Warm-Up Drift To analyze the impact of temperature change after camera start-up a planar test field consisting of 48 white circular targets printed on a black metal plate is mounted in front of a camera (equipped with a 640 × 480 CMOS sensor). The test pattern is arranged to cover the entire field of view of the camera. The complete setup is rigidly fixed. The center points of the targets are initially segmented using a threshold technique. The coordinates of the target centers are refined using a method described in [16]. For each target the gray values along several rays beginning at the initial center are sampled until a gray value edge is detected. The position of the found edge is further refined to sub-pixel precision using moment preservation [17]. Next, a circle is fitted to the found sub-pixel edge points for each target using least squares optimization. The centers of the fitted circles are stored together with the current time elapsed since camera startup. The segmentation process is continuously repeated and stopped after approximately 45 minutes. At the same time the temperature on the sensor board is measured. The obtained data basis has a temporal resolution of approximately three seconds. Since the relative position between the test pattern and the camera is fixed the coordinates of the segmented target centers are not expected to vary systematically over time except noise. The results of the experiment are shown in figure 1 and figure 2.

3 Modeling Warm-Up Drift As one can see from figs. 1 and 2 the temperature increase of the camera sensor board induces an optical flow. To our hypothesis, this flow field results from a movement of the image plane due to thermal expansion of the sensor board. The CTE (Coefficient

260

H. Handel 500

sensor board temperature y-coordinate [pixel]

temperature [ C]

30 28.5 27 25.5 24 22.5

400 300 200 100

21 0 0

5

10 15 20 25 30 35 40 45

0

100 200 300 400 500 600

time [min]

x-coordinate [pixel]

(a)

(b) gray value shift of a falling edge 250

200

200 gray value

gray value

gray value shift of a rising edge 250

150 100

150 100

50

50

0

0 0

1

2

3

4

5

6

7

8

9 10

0

1

2

3

4

pixel

5

6

7

8

9 10

pixel

(c) Fig. 1. Warm-up drift. Fig. 1(a): Measured temperature on the sensor board. Fig. 1(b): Total displacement from camera start-up until thermal equilibrium, the lengths of the arrows are scaled by a factor of 150. Fig. 1(c): Gray value change for sampled line. The red curve shows the sampled gray values immediately after start-up and the blue curve after thermal stabilization.

x-coordinate drift

y-coordinate drift

point drift

39.25 39.2 39.15 39.1

42 y-coordinate [pixel]

41.95 y-coordinate [pixel]

x-coordinate [pixel]

39.3

41.9 41.85 41.8 41.75

5 10 15 20 25 30 35 40 45

41.95 41.9 41.85 41.8 41.75

5 10 15 20 25 30 35 40 45

time [min]

39.1

time [min]

39.2

39.3

x-coordinate [pixel]

(a) y-coordinate drift

5 10 15 20 25 30 35 40 45 time [min]

439.14

point drift y-coordinate [pixel]

596.15 596.1 596.05 596 595.95 595.9 595.85 595.8 595.75

y-coordinate [pixel]

x-coordinate [pixel]

x-coordinate drift

439.12 439.1 439.08 439.06 5 10 15 20 25 30 35 40 45 time [min]

439.14 439.12 439.1 439.08 439.06 595.8

595.9

596

596.1

x-coordinate [pixel]

(b) Fig. 2. Coordinate displacement. Coordinate changes of the top left target (fig. 2(a)) and the bottom rigth one (fig. 2(b)).

Analyzing the Influences of Camera Warm-Up Effects

261

of Thermal Expansion) of FR-4, the standard material which is used for printed circuit boards, can take values up to 150 − 200 ppm/K. Assuming a temperature increase to an extent of 10 to 20 K in the immediate vicinity of the sensor chip one would expect a thermal expansion and thus a displacement of the camera sensor to an extent of some microns. Taking the widely used pinhole camera model to describe the imaging process we can principally distinguish two cases: – The thermal expansion of the sensor board affects only the image plane. The center of projection remains fixed. – Both the image plane and the center of projection are displaced due to thermal expansion. Both cases can be found in real cameras. In the first case, the objective is fixed at the camera housing and is thus not affected by the local temperature increase of the sensor board since the distance to the board is relatively high. This configuration is typical for cameras equipped with C-mount objectives. In the second case, the lens holder of the objective is directly mounted on the circuit board. Thus, an expansion of the board displaces the lens and for this reason the center of projection. This configuration can be found at miniature camera devices used e.g. in mobile phones. Mathematically, the two cases have to be treated separately. In the remaining sections we use the following notation for the mapping from 3D world space to 2D image space: x = K [R|t] X

(1)

where x = (x, y, 1)T denotes the homogeneous image coordinates of world point X also described by homogeneous coordinates. The camera is described by its internal parameters K with ⎛ ⎞ f x 0 cx K = ⎝ 0 f y cy ⎠ 0 0 1 The exterior orientation of the camera is given by the rotation R and the translation t (see [4] for details). 3.1 Fixed Center of Projection If the center of projection remains fixed the observed optical flow will result from a movement of the image plane alone. In this case, the coordinate displacement can be described by a homography [4]. Let x(t0 ) and x(t1 ) denote the coordinates of the same target feature for the time t0 , i.e. immediately after camera start-up, and an arbitrary time t1 . Then, x(t0 ) = K(t0 )[I|0]X x(t1 ) = K(t1 )[R(t1 )|0]X = K(t1 )R(t1 )K−1 (t0 )(K(t0 )[I|0]X) = K(t1 )R(t1 )K−1 (t0 )x(t0 )

262

H. Handel

so that x(t1 ) = H(t1 )x(t0 ) with the time dependent homography H(t1 ): H(t1 ) = K(t1 )R(t1 )K−1 (t0 )

(2)

˜ 1 ) = K(t1 )R(t1 ). Since H(t) ˜ is invertible we Setting x ˜ = K−1 (t0 )x(t0 ) we get H(t can write ˜ −1 (t) = (K(t)R(t))−1 = R−1 (t)K−1 (t) H = RT (t)K−1 (t)

(3)

Since RT is orthogonal and K−1 is an upper diagonal matrix we can use QR decom˜ −1 is given [18]. For a rotation by a small angle position to obtain RT and K−1 once H ΔΩ around axis l we can further use the following approximation [19] R(t) = I + W(t)ΔΩ(t) + O(ΔΩ 2 ),

(4)

where the matrix W(t) is given by ⎞ 0 −l3 (t) l2 (t) 0 −l1 (t) ⎠ W(t) = ⎝ l3 (t) 0 −l2 (t) l1 (t) ⎛

(5)

The vector l is a unit vector and thus has two degrees of freedom. We can identify the rotation by the three component vector ˜l = ΔΩl. From ˜l we get ΔΩ = ˜l and ˜ ˜ becomes l = ˜ll . The homography H(t) ˜ H(t) = (K(t0 ) + ΔK(t))R(t)

(6)

where ΔK(t) denotes the time dependent offset to the original camera parameters and is given by ⎞ ⎛ 0 Δcx (t) Δfx (t) Δfy (t) Δcy (t) ⎠ ΔK(t) = ⎝ 0 (7) 0 0 1 ˜ is determined by seven time dependent parameters, namely Δfx (t), Δfy (t), Thus, H(t) Δcx (t), Δcy (t), the changes of the internal camera parameters and l˜1 (t), l˜2 (t), l˜3 (t), the external orientation. Motivated by the results of our empirical studies (see section 4 for further details) we choose bi-exponential functions for the time dependent parameters: f (t) = a0 + a1 e−k1 t − (a0 + a1 )e−k2 t

(8)

The parameterization of f (t) is chosen in such a way that f (0) = 0 and thus H(0) = I. Since f (t) is determined by the four parameters a0 , a1 , k1 , k2 and we have seven time ˜ dependent parameters for H(t) the complete warm-up model comprises 28 parameters. In section 4 it is shown that the total number of parameters can be reduced in praxis.

Analyzing the Influences of Camera Warm-Up Effects

263

3.2 Moving Center of Projection In this case we use the simplifying assumption that the center of projection and the image plane are equally translated, i.e. the internal parameters of the imaging device remain constant during the warm-up period. This assumption will later be justified empirically. Then, we get the following relations x(t0 ) = K [I|0] X x(t1 ) = K [R(t1 )|t(t1 )] X Since the observed targets lie on a plane the image coordinate changes can again be described by a homography (see [3] for a strict treatment). x(t1 ) = K [r1 (t1 ) r2 (t1 ) t(t1 )] x(t0 )

(9)

where ri (t) denotes the i-th column of R(t). Thus, we get H(t) = K [R(t)|t(t)]. Given the homography H(t) the external parameters can be computed as follows [3] r1 = λK−1 h1 r2 = λK−1 h2 r3 = r1 × r2 t = λK−1 h3 with λ = 1/ K−1 h1 . Using the axis angle notation for the rotation R(t) we get six temporal dependent parameters, namely the three rotation parameters ˜l1 (t), ˜l2 (t), ˜l3 (t) as well as the three translational parameters t1 (t), t2 (t), t3 (t). Again, we use biexponential terms to describe the temporal behaviour of the parameter values. Thus, we have 24 parameters. 3.3 Warm-Up Model Calibration In the previous sections we have shown how to model the coordinate displacement of segmented image features during camera warm-up. We now outline an algorithm to calibrate the parameters of the models: 1. Determine the internal camera parameters using a method described in [3] or [2] based on a few images taken immediately after camera start-up. The obtained values are used for K(0) or K respectively. 2. Collect image coordinates by continuously segmenting target center points. 3. For each segmented image determine the homography H(t) (see [4], [19]) 4. Use a factorization method described in section 3.1 or 3.2 depending on the type of camera to obtain values for the internal/external parameters. 5. Fit a bi-exponential function to the values of each camera parameter. 6. Perform a non-linear least squares optimization over all 28(24) parameters minimizing the following expression M N

xj (ti ) − H(ti ; β)xj (0) 2

(10)

j=1 i=1

where M denotes the number of feature points and β the current parameter vector.

264

H. Handel Focal length shift [pixel] 0.3

Principal point shift [pixel] 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

delta f

0.25 0.2 0.15 0.1 0.05

delta cx delta cy

0 10

20

30 40 time [min]

50

60

70

10

20

30 40 time [min]

50

60

70

(a) Rotation 0.002

x-axis y-axis z-axis

0.0015 0.001 0.0005 0 -0.0005 -0.001 10

20

30

40

50

60

70

time [min]

(b) Fig. 3. Estimated internal and external parameters of the SonyFCB-EX780BP over time x-coordinate drift

x-coordinate drift

point drift

56.6 56.4 56.2 56 55.8 55.6

52.4

y-coordinate [pixel]

x-coordinate [pixel]

x-coordinate [pixel]

57 56.8

52.2 52 51.8 51.6

0

10

20

30

40

50

60

0

10

20

time [min]

30

40

50

52.4 52.2 52 51.8 51.6 55.6 55.8 56 56.2 56.4 56.6 56.8 57

60

time [min]

x-coordinate [pixel]

(a) x-coordinate drift

y-coordinate drift

point drift

56.6 56.4 56.2 56 55.8 55.6

52.4

y-coordinate [pixel]

y-coordinate [pixel]

x-coordinate [pixel]

57 56.8

52.2 52 51.8 51.6

0

10

20

30

40

50

60

70

0

time [min]

10

20

30

40

time [min]

50

60

70

52.4 52.2 52 51.8 51.6 55.6 55.8 56 56.2 56.4 56.6 56.8 57 x-coordinate [pixel]

(b) Fig. 4. Application of warm-up calibration to the SonyFCB-EX780BP. Fig. 4(a) shows the results of the drift calibration. The red curve depicts the segmented image coordinates and the blue one the ideal trajectory according to the calibrated drift-model. Fig. 4(b) shows the results of the drift compensation.

3.4 Warm-Up Drift Compensation With a calibrated warm-up model we can regard the influences of the sensor warmup on the imaging process. For cameras whose center of projection remains fixed an

Analyzing the Influences of Camera Warm-Up Effects

Translation [mm] 0.07

Rotation 0.0004

x-axis y-axis z-axis

0.05 0.03

265

x-axis y-axis z-axis

0.0003 0.0002

0.01 0.0001

-0.01

0

-0.03 -0.05 5

10

15

20

25

30

35

5

10

time [min]

15

20

25

30

35

time [min]

Fig. 5. Estimated external camera parameters of the VRmagic-C3 over time

x-coordinate drift

y-coordinate drift

point drift

39.25 39.2 39.15 39.1

42 y-coordinate [pixel]

41.95 y-coordinate [pixel]

x-coordinate [pixel]

39.3

41.9 41.85 41.8 41.75

5 10 15 20 25 30 35 40 45

41.95 41.9 41.85 41.8 41.75

5 10 15 20 25 30 35 40 45

time [min]

39.1

time [min]

39.2

39.3

x-coordinate [pixel]

(a) y-coordinate drift

point drift 439.15 y-coordinate [pixel]

439.15 y-coordinate [pixel]

x-coordinate [pixel]

x-coordinate drift 596.15 596.1 596.05 596 595.95 595.9 595.85 595.8 595.75

439.1

439.05 5 10 15 20 25 30 35 40 45 time [min]

439.1

439.05 5 10 15 20 25 30 35 40 45 time [min]

595.8

595.9

596

596.1

x-coordinate [pixel]

(b) Fig. 6. Fig. 6(a) and 6(b) show the predicted trajectories of the point coordinates (blue) compared to the observed ones (red)

image coordinate correction is straight forward. Given observed image coordinates xo at time t after camera start-up the undistorted image coordinates xu can be computed by multiplying with the inverse of H(t) xu = H−1 (t)xo

(11)

This correction is independent from the structure of the scene, i.e. the distance of the observed world point from the camera. Fig. 4(b) shows the results of this drift correction. In the second case, where the center of projection is not fixed, a direct correction of the image coordinates is not possible since the image displacement of an observed feature point depends on its position in the scene. In this case, the drift model can only be applied in reconstruction algorithms where the position of the camera is corrected accordingly.

266

H. Handel

Table 1. Camera motion parameters and duration until thermal equilibrium for a single camera (VRmagic-C3, CMOS)

1 2 3 4 5

Translation tx ty 0.024286 -0.004156 0.022227 -0.004706 0.023420 -0.004448 0.018973 -0.004655 0.022384 -0.004380

Rotation lx 0.412771 0.445256 0.377107 0.328780 0.460083

tz -0.037418 -0.039590 -0.034522 -0.033378 -0.040338

0 -0.01 z-axis -0.02 -0.03 -0.04 0.5

1

1.5 2 x-axis

2.5

3

3.5

0

0.5

1

1.5

2

2.5 y-axis

ΔΩ 0.011553 0.012630 0.010667 0.009569 0.013023

Time T99 19.62 18.08 19.37 17.16 18.72

Residuals σ2 0.000108 0.000465 0.000114 0.000112 0.000100

0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04

-0.04 -0.050

0.5

1

1.5 2 x-axis

2.5

3

3.5

0

0.5

1

1.5

2

2.5 y-axis

0 -0.005 0.01 -0.01 -0.015 0 -0.02 -0.025 -0.01 -0.03 z-axis -0.035 -0.02 -0.04 -0.03

0.01 0 -0.01 z-axis -0.02 -0.03 -0.04 -0.050

lz 0.783463 0.748210 0.810767 0.786722 0.745038

0 -0.005 0.01 -0.01 -0.015 0 -0.02 -0.025 -0.03 z-axis-0.01 -0.035 -0.02 -0.04 -0.03

0.01

-0.050

ly 0.458647 0.487992 0.426833 0.352122 0.482951

0.5

1

1.5 2 x-axis

2.5

3

3.5

0

0.5

1

1.5

2

2.5 y-axis

0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04

-0.04 -0.050

0.5

1

1.5 2 x-axis

2.5

3

3.5

0

0.5

1

1.5

2

2.5 y-axis

Fig. 7. Reconstruction of the position of the image plane for the VRmagic-C3 after 0s, 100s, 600s and 1000s since camera start-up. The axis units are mm. The depicted range in x- and y-direction corresponds roughly to the dimensions of the active sensor chip area.

4 Experimental Results This section presents experimental studies which justify the applicability of the proposed warm-up model. The experiment described in section 2 is conducted for two different types of cameras. The one is a VRmagic-C3, a miniature sized camera whose lens is directly mounted on the circuit board. The camera is equipped with a CMOS based active pixel sensor. The other camera is a SonyFCB-EX780BP, a CCD-based camera whose objective is not directly connected to the sensor circuit board. The initially estimated motion parameters are shown in figures 3(a)-3(b) and 5 respectively. As the figures show, our choice for a bi-exponential function describing the temporal dependence of the camera parameters seems resonable. Furthermore, one can see that for some camera parameters a simple exponential term is sufficient reducing the total number of parameters. Figure 4(a) and 6(a) show the applicability of the chosen models

Analyzing the Influences of Camera Warm-Up Effects

267

to explain the observed image displacement. Figs. 4(b) and 6(b) show the results of the drift correction described in section 3.4. In a second experiment we examine the repeatability of the calibration. The drift model is repeatedly calibrated for one camera. The data has been collected over several weeks. Table (1) shows the resulting parameters. The table contains the values of the motion parameters when the camera comes to a thermal equilibrium. The column T99 denotes the time in minutes until the drift displacement reaches 99% of its final amount. The results show that the warm-up behaviour is characteristic for a specific device. Finally, fig. 7 shows a reconstruction of the image plane during the camera warm-up period.

5 Conclusion We have presented a study of the impact of camera warm-up on the coordinates of segmented image features. Based on experimental observations we have developed a model for image drift and a way to compensate for it. Once the warm-up model is calibrated for a specific camera we can use the parameters for drift compensation. The formulation of our displacement correction fits well in the widely used projective framework used in the computer vision community. Thus, the standard camera models used in computer vision can easily be extended to regard for warm-up effects. Further experimental evaluations have shown that our warm-up model is principally applicable for all kinds of digital cameras and additionally that the warm-up behaviour is characteristic for a specific camera. In the future we plan to use cameras with an on-board temperature sensor to get direct access to the camera’s temperature. The formulation of our model presented here is based on the time elapsed since camera start-up assuming that the temperature always develops similarly. A direct measurement of the temperature instead of time will propably increase accuracy further.

References 1. Brown, D.C.: Close-range camera calibration. Photogrammetric Engineering 37(8), 855–866 (1971) 2. Tsai, R.Y.: A versatile camera calibration technique for 3d machine vision. IEEE Journal for Robotics & Automation RA-3(4), 323–344 (1987) 3. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 4. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 5. Weng, J., Cohen, P., Herniou, M.: Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(10), 965– 980 (1992) 6. El-Melegy, M., Farag, A.: Nonmetric lens distortion calibration: Closed-form solutions, robust estimation and model selection. In: Proceedings ICCV, pp. 554–559 (2003) 7. Devernay, F., Faugeras, O.: Straight lines have to be straight. MVA 13(1), 14–24 (2001) 8. Fraser, C.S., Shortis, M.R., Ganci, G.: Multi-sensor system self-calibration. In: SPIE Proceedings, vol. 2598, pp. 2–15 (1995)

268

H. Handel

9. Healey, G., Kondepudy, R.: Radiometric ccd camera calibration and noise estimation. PAMI 16(3), 267–276 (1994) 10. Clarke, T.A.: A frame grabber related error in subpixel target location. The Photogrammetric Record 15(86), 315–322 (1995) 11. Ortiz, A., Oliver, G.: Radiometric calibration of ccd sensors: dark current and fixed pattern noise estimation. IEEE International Conference on Robotics and Automation 5, 4730–4735 (2004) 12. Beyer, H.A.: Geometric and radiometric analysis of a ccd-camera based photogrammetric close-range system. Mitteilungen Nr. 51 (1992) 13. Wong, K.W., Lew, M., Ke, Y.: Experience with two vision systems. Close Range Photogrammetry meets machine vision 1395, 3–7 (1990) 14. Robson, S., Clarke, T.A., Chen, J.: Suitability of the pulnix tm6cn ccd camera for photogrammetric measurement. SPIE Proceedings, Videometrics II 2067, 66–77 (1993) 15. Seto, E., Sela, G., McIlroy, W.E., Black, S.E., Staines, W.R., Bronskill, M.J., McIntosh, A.R., Graham, S.J.: Quantifying head motion associated with motor tasks used in fmri. NeuroImage 14, 284–297 (2001) 16. F¨orstner, W., G¨ulch, E.: A fast operator for detection and precise location of distinct points, corners and centers of circular features. ISPRS Intercommission Workshop on Fast Processing of Photogrammetric Data (1987) 17. Tabatabai, A.J., Mitchell, O.R.: Edge location to subpixel values in digital imagery. IEEE transactions on pattern analysis and machine intelligence 6(2), 188–201 (1984) 18. Golub, G.H., VanLoan, C.F.: Matrix computations. Johns Hopkins University Press, Baltimore, MD (1997) 19. Kanatani, K.: Geometric Computation for Machine Vision. Oxford University Press, Oxford, UK (1993)

Simultaneous Plane Extraction and 2D Homography Estimation Using Local Feature Transformations Ouk Choi, Hyeongwoo Kim, and In So Kweon Korea Advanced Institute of Science and Technology

Abstract. In this paper, we use local feature transformations estimated in the matching process as initial seeds for 2D homography estimation. The number of testing hypotheses is equal to the number of matches, naturally enabling a full search over the hypothesis space. Using this property, we develop an iterative algorithm that clusters the matches under the common 2D homography into one group, i.e., features on a common plane. Our clustering algorithm is less aﬀected by the proportion of inliers and as few as two features on the common plane can be clustered together; thus, the algorithm robustly detects multiple dominant scene planes. The knowledge of the dominant planes is used for robust fundamental matrix computation in the presence of quasi-degenerate data.

1

Introduction

Recent advances in local feature detection have achieved aﬃne/scale covariance of the detected region according to the varying viewpoint [11][12][13][8]. In the description phase of the detected region, geometric invariance to aﬃnity or similarity is achieved by explicitly or implicitly transforming the detected region to the standard normalized region. (See Fig. 1.) Statistics robust to varying illumination or small positional perturbation are extracted from the normalized region and are used in the matching phase. After the tentative matching of the normalized regions, not only the matching feature coordinates but also the feature transformations from image to image are given. In this paper, we are interested in further exploiting the local feature transformations for simultaneously extracting scene planes and estimating the induced 2D homographies. Local feature matches have been treated as point-to-point correspondences, not as region matches with feature transformations, in the literature on 2D homography or fundamental matrix computation [1][2][6][5][3][4]. The main reason is that non-covariant local features (e.g., single-scale Harris corners) are less elaborately described (e.g., the template itself) so that the aﬃne/similar transformation is not uniquely determined. Even the approaches that use aﬃne/scale covariant local features [12][7] do not utilize the feature transformations thoroughly. Some approaches propagate local feature matches into the neighborhood regions for simultaneous object recognition and segmentation [9][10]. These approaches use the feature transformation elaborately and show that a few matches Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 269–278, 2007. c Springer-Verlag Berlin Heidelberg 2007

270

O. Choi, H. Kim, and I.S. Kweon

T1i

xi

T2i T1j

xj

T2j

x2j

x2i x2k

xk

T1k Image 1

T2k Normalized regions

Image 2

Fig. 1. Aﬃne covariant regions are detected in each image independently and each region is transformed to the normalized region. The normalized regions are matched using statistics robust to varying illumination or small positional perturbation. Cyan-colored rectangles represent the features that are tentatively matched. Each tentative match provides not only the matching feature coordinates but also the feature transformations from image to image.

[9], or even only one [10], can be grown over a large portion of the object. Inspired by these approaches, we propagate the feature transformation of one match to the other matches and update the feature transformation iteratively so that the 2D homography of a plane can be estimated; as well, the local features on the common plane can be grouped together. The main diﬀerence between our approach and those mentioned is that we are more concerned about sparse scene geometry rather than the veriﬁcation of a single or a few matches by gathering more evidence in a dense neighborhood. Our approach, however, can also be interpreted as a match veriﬁcation process, because we group the local feature matches using 2D homography, as Lowe does in [12]. The diﬀerence between our algorithm and Lowe’s is that we use the feature transformation directly rather than the clustering with the Hough transform. The main advantage is that our algorithm does not require the labor of determining the resolution of the Hough bins. In Section 2, we develop a simple algorithm that simultaneously groups the coplanar features and estimates the 2D homography. In Section 3, the knowledge of the detected dominant planes is used to develop an importance sampling procedure for robust fundamental matrix computation in the presence of quasidegenerate data [4]. We show some experimental results for both plane extraction and fundamental matrix computation in Section 4. Finally, some advantages and limitations of our approach and future work are discussed in Section 5.

2

Simultaneous Plane Extraction and 2D Homography Estimation

We characterize a feature match as a set of the center coordinates x1i , x2i of matching regions, the feature transformations Hi , H−1 and a membership i variable hi (see Fig. 1 for details).

Simultaneous Plane Extraction and 2D Homography Estimation

where

271

mi = {x1i , x2i , Hi , H−1 i , hi },

(1)

−1 = T2i T−1 Hi = T1i T−1 2i , Hi 1i ,

(2)

T1i and T2i are estimated as described by Mikolajczyk et al. in [8]. In this section, we develop an algorithm that determines hi and updates Hi , H−1 for i each match mi . Feature transformations diﬀer in quality. Under some transformations, many features are transformed with small residual errors, while under other transformations, few features are transformed. Our algorithm described in Fig. 2 selects good transformations with high dominance scores deﬁned as (3); therefore, the algorithm works in the presence of a number of erroneous initial feature transformations on the assumption that there exists at least one good initial seed for each dominant plane. It is important to note that a match already contains a transformation, in other words a homography, so that the algorithm does not require three or four true matches for constructing a hypothesis. This fact enables as few as two matches to be clustered together once one match is dominated by the other. We deﬁne –dominance to assess the quality of transformations in a systematic way. Deﬁnition 1. mi dominates mj if x2j −Hi (x1j ) < and x1j −H−1 i (x2j ) < , where H(x) is the transformed Euclidean coordinates of x by H. The predicate in Def. 1 is identical to the inlier test in RANSAC (RANdom SAmple Consensus) approaches [1][5] with the maximum allowable residual error . Our algorithm can be considered as a deterministic version of the RANSAC algorithm, where random sampling is replaced by a full search over the hypothesis space. Nothing prevents us from modifying our algorithm to be random; however, the small number of hypotheses, equal to the number of matches, attracts us to choose the full search. The dominance score ni is deﬁned as ni =

N

dij ,

(3)

j=1

where dij =1 when mi dominates mj , otherwise 0. The dominance score ni is the number of features that are transformed by Hi with a residual error less than . Once the ni and dij values are calculated, the membership hi is determined in such a way that the mk with the largest dominance score collects its inliers ﬁrst and the collected inliers are omitted from the set of matches for the detection of the next dominant plane. We do not cover the model selection issue [6] thoroughly in this paper. The in Fig. 2 is simply decided by the number of inliers and update of Hk , H−1 k the residual error. When the number of inliers is less than three, we do not update the feature transformation and the match is discarded from gathering

272

O. Choi, H. Kim, and I.S. Kweon

Input: M0 = {mi |i = 1...N }. Output: updated hi , Hi , H−1 i , and ni for each mi . Variables: - it: an iteration number. - ti : an auxiliary variable that temporarily stores hi for each iteration. - M : a set of matches whose membership is yet to be determined. - The other variables: explained in the text. 1. Initialize it, hi . - it ← 0. - For i = 1 . . . N , {hi ← i.} 2. Initialize M , ti , ni , dij . - M ← M0 . - For i = 1 . . . N , {ti ← 0.} - For i = 1 . . . N , {for j = 1 . . . N , {calculate dij .}} - For i = 1 . . . N , {calculate ni .} 3. k = argmaxi:mi ∈M ni . 4. If nk = 0, {go to 10.} 5. For i : mi ∈ M , {if dki = 1 and ti = 0, {ti ← hi ← k.}} 6. Update Hk , H−1 k . 7. M ← M − {mi |ti = 0}. 8. Update ni , dij . - For i : mi ∈ M , {for j : mj ∈ M , {calculate dij .}} - For i : mi ∈ M , {calculate ni .} 9. nk ← 0. Go to 3. 10. For i = 1 . . . N , {if hi has changed, {it ← it + 1, go to 2.}} 11. Finalize. = H−1 - For i = 1 . . . N , {Hi = Hhi , H−1 i hi , ni = j:hj =i 1.}

Fig. 2. Algorithm sketch of the proposed simultaneous plane detection and 2D homography estimation technique. The symbol ‘←’ was used to denote replacement.

more inliers. An aﬃne transformation is calculated when the number of inliers is more than two, and a projective transformation is also calculated when the number of inliers is more than three. Three transformations compete with one another for each mk : the original transformation, and the newly calculated aﬃne and projective transformations. The transformation with the minimum cost is ﬁnally selected at each iteration. The cost is deﬁned as f (Hk ) =

2 x2i − Hk (x1i )2 + x1i − H−1 k (x2i ) .

(4)

i:hi =k

The cost is minimized using the Levenberg–Marquardt algorithm [14][15] using the original Hk as the initial solution. Refer to [16] for more statistically sound homography estimation.

Simultaneous Plane Extraction and 2D Homography Estimation

(a) = 3.0, it = 0.

273

(b) = 3.0, it = 2.

Fig. 3. (a) shows the plane detection result at it = 0. The same colored features were detected to be on the common plane. For visibility, planes with less than three features are not displayed. The dominance score decreases in the order of red, green, blue, yellow, purple. The algorithm converges very quickly at it = 2 (see (b)). The main reason is that the scene is fully aﬃne so that our initial seeds can collect many inliers in the 0th iteration.

Figure 3 shows the clustered features using our algorithm. One obvious plane on top of the ice cream box was not detected for lack of matching features because of severe distortion (see also Fig. 1). The front part of the detergent box (Calgon) could not be clustered because of the slight slant of the top. The total number of dominance calculations is O(N 2 × K × it) and Hk , H−1 k update is O(K × it), where N is the total number of matches, K is the average number of planes at each iteration and it is the total number of iterations until convergence. The former is the main sources of computational complexity, which is equivalent to testing N × it hypotheses for each plane extraction in conventional RANSAC procedures. More experiments using various scenes are described in Section 4.

3

Importance-Driven RANSAC for Robust Fundamental Matrix Computation

The fundamental matrix can be calculated using the normalized eight-point algorithm. This algorithm requires eight perfect matches. This number is larger than the minimum number required for the fundamental matrix computation. However, we use the eight-point algorithm because it produces a unique solution for general conﬁgurations of the eight matches. The main degeneracy occurs when there is a dominant plane and more than ﬁve points are sampled from it [5]. The knowledge of the detected planes can be used to develop an importance sampling procedure that avoids the obvious degenerate conditions, i.e., more than ﬁve points on the detected common plane. Any feature match that is not grouped with other matches can be considered either a mismatch or a valuable true match oﬀ the planes, and a plane that includes many feature matches is highly likely to be a true scene plane, for which the included matches are also likely to be true. To balance between avoiding

274

O. Choi, H. Kim, and I.S. Kweon

mismatches and degeneracy, we propose an importance sampling method that ﬁrst decides a plane according to its importance Ik = min(nk , n),

(5)

where Ik is the sampling importance of the k’th plane, nk is the dominance score, and n is a user-deﬁned threshold value. Once the plane is determined, a match on the plane is sampled according to the uniform importance. This sampling procedure becomes equivalent to the standard RANSAC algorithm [1][5] when n = ∞, i.e., Ik = nk . The sampling importance decreases for dominant planes that have more than n inliers. n is typically set to ﬁve in the experiments in Section 4. The ﬁnal sample set with eight matches is discarded if any six or more matches are sampled from a common plane. The following cost is minimized for each eight-match sample set. The F with the largest number of inliers is chosen as the most probable relation. xˆ2i T Fxˆ1i 2 . (6) g(F) = mi ∈S8

The inlier test is based on the following predicate, ˆ 2i = 0) < , ˆ T Fˆ ˆ T FT x mi ∈ Sin if d⊥ (x2i , x x1i = 0) < and d⊥ (x1i , x

(7)

ˆ is used to denote the homogeneous coordinates where Sin is the set of inliers, x and d⊥ is the distance from a point to a line in the line normal direction. Refer to [17] for more statistically sound fundamental matrix estimation.

4

Experiments

In this section, we show some experimental results for various images, including repeating patterns and dominant planes. The images in Figures 1, 3 and 4 were adapted from [3]. The castle images that contain repeating patterns (Fig. 5) were downloaded from http://www.cs.unc.edu/˜marc/. We extracted two kinds of features in the feature extraction stage. Maximally stable extremal regions (MSER) [3] and generalized robust invariant features (GRIF) [13] were used. MSERs are described using SIFT neighborhoods [12] and GRIFs are described using both SIFT and a hue histogram. The feature pairs were classiﬁed as tentative matches if the distance between the description vectors was smaller than 0.4 and the normalized cross correlation between the normalized regions was larger than 0.6 in all the experiments in this paper. is the only free parameter in our plane extraction algorithm; it is the maximum allowable error in model ﬁtting. Large tends to produce a smaller number of planes with a large number of inliers. Dominant planes invade nearby or even distant planes when is large. The homography is rarely updated with small because the initial feature transformation ﬁts the inliers very tightly. trades oﬀ between the accuracy of the homography and the robustness to the perturbed feature position. Figure 4 shows this trade-oﬀ. Good results were obtained in the

Simultaneous Plane Extraction and 2D Homography Estimation

(a) = 1

(b) = 5

(c) = 10

(d) = 20

275

Fig. 4. Plane detection results with varying . See the text for details.

range of 2 < < 10. = 5 produced the best results in Fig. 4. We used = 5 for all the experiments in this paper, unless otherwise mentioned. Figure 5 shows the plane extraction and epipolar geometry estimation results on the image pairs with varying viewpoint. It is hard to ﬁnd the true correspondences in these image pairs because many features are detected on the repeating pattern, e.g., the windows on the wall. The number of planes that humans can manually detect is six in this scene and our algorithm ﬁnds only one or two planes that can be regarded as correct. However, it is important to note that the most dominant plane was always correctly detected, because true matches on the distinctive pattern have grown their evidence over the matches on the repeating pattern. Moreover, the number of tentative matches is not enough to detect other planes, i.e., our algorithm does not miss a plane once a true feature correspondence lies on the plane. Figure 6 shows the results for the images with a dominant plane. Among the total 756 tentative matches, 717 matches were grouped to belong to the redcolored dominant plane. For the standard RANSAC approach, the probability that the eight points contain more than two points oﬀ the plane is: P =

8 m C8−m 717 C39 = 0.0059 ≈ 0.6%. 8 C756 m=3

(8)

We ran the standard random sampling procedure [1][5] 1000 times to count the number of occasions more than ﬁve matches were sampled from the detected dominant plane. The number of occasions was, unsurprisingly, 995, i.e.,

276

O. Choi, H. Kim, and I.S. Kweon

(a) Images with small viewpoint change

(b) Images with moderate viewpoint change

(c) Images with severe viewpoint change Fig. 5. Experiments on varying the viewpoint and repeating patterns. The number of tentative matches decreases with increasing viewpoint change. Many tentative matches are mismatches because of the repeating patterns (windows). The most dominant two planes were correctly detected in (a) and (b) (red–green and red–yellow, respectively). Only one plane was correctly detected in (c) for lack of feature matches (red). The epipolar geometry was correctly estimated in (a) and (b) with d⊥ 0.97 and 0.99 pixels respectively. The epipolar geometry could not be correctly estimated for lack of oﬀplane correct matches in (c).

99.5% of cases were degenerate and only 0.5% were not degenerate cases, on the assumption that our dominant plane is correct, which is very close to the theoretical value. Our algorithm does not suﬀer from the degeneracy problem on the assumption that the detected planes are correct. Figure 6 shows the two representative solutions that were most frequently achieved during the proposed sampling procedure. It is clear that no more than ﬁve matches are sampled from the dominant plane.

Simultaneous Plane Extraction and 2D Homography Estimation

277

(a) Detected dominantplanes

(b) 751/756 inliers, d⊥ = 0.52 (c) 751/756 inliers, d⊥ = 0.58 Fig. 6. Epipolar geometry estimation in the presence of quasi-degenerate data. The two solutions (b) and (c) that are most frequently achieved show the eﬀectiveness of our algorithm.

5

Conclusion

We developed a simultaneous plane extraction and homography estimation algorithm using local feature transformations that are already estimated in the matching process of the aﬃne/scale covariant features. Our algorithm is deterministic and parallel in the sense that all the feature transformations compete with one another at each iteration. This property naturally enables the detection of multiple planes without the independent RANSAC procedure for each plane or the labor in determining the resolution of the Hough bins. Our algorithm always produces consistent results and even two or three matches can be grouped together if one match dominates the others. The knowledge of the detected dominant planes is used to safely discard the degenerate samples so that the fundamental matrix can be robustly computed in the presence of quasidegenerate data. Our plane detection algorithm does not depend on the number of inliers, but depends on the number of supports that enable the original transformation to be updated. In further work, we hope to adapt more features (e.g., single-scale, aﬃne/scale covariant Harris corner [8]) into our framework so that both the number of supports and the number of competing hypotheses can be increased. It is clear that our algorithm can be applied to sparse or dense motion segmentation [6][7]. Dense motion segmentation is another area of future research.

278

O. Choi, H. Kim, and I.S. Kweon

Acknowledgement This work was supported by the IT R&D program of MIC/IITA [2006-S-02801, Development of Cooperative Network-based Humanoids Technology] and the Korean MOST for NRL program [Grant number M1-0302-00-0064].

References 1. Fishler, M.A., Bolles, R.C.: Random sampling Consensus: A paradigm for model ﬁtting with application to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981) 2. Torr, P.H., Zisserman, A.: Robust Computation and Parameterization of Multiple View Relations. In: ICCV (1998) 3. Chum, O., Werner, T., Matas, J.: Two-view Geometry Estimation Unaﬀected by a Dominant Plane. In: CVPR (2005) 4. Frahm, J., Pollefeys, M.: RANSAC for (Quasi-)Degenerate data (QDEGSAC). In: CVPR (2006) 5. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 6. Torr, P.H.S.: Geometric Motion Segmentation and Model Selection. Philosophical Transactions of the Royal Society, pp. 1321–1340 (1998) 7. Bhat, P., Zheng, K.C., Snavely, N., Agarwala, A., Agrawala, M., Cohen, M.F., Curless, B.: Piecewise Image Registration in the Presence of Multiple Large Motions. In: CVPR (2006) 8. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaﬀalitzky, F., Kadir, T., Van Gool, L.: A Comparison of Aﬃne Region Detectors. IJCV 65(1-2), 43–72 (2005) 9. Ferrari, V., Tuytelaars, T., Van Gool, L.: Simultaneous Object Recognition and Segmentation from Single or Multiple Model Views. IJCV 67(2), 159–188 (2006) 10. Vedaldi, A., Soatto, S.: Local Features, All Grown Up. In: CVPR (2006) 11. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In: BMVC, pp. 384–393 (2002) 12. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60(2), 91–110 (2004) 13. Kim, S.H., Yoon, K.J., Kweon, I.S.: Object Recognition using Generalized Robust Invariant Feature and Gestalt Law of Proximity and Similarity. In: CVPR 2006. 5th IEEE Workshop on Perceptual Organization in Computer Vision (2006) 14. Levenberg, K.: A Method for the Solution of Certain Problems in Least Squares. Quart. Appl. Math. 2, 164–168 (1944) 15. Marquardt, D.: An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J. Appl. Math. 11, 431–441 (1963) 16. Kanatani, K., Ohta, N., Kanazawa, Y.: Optimal Homography Computation with a Reliability Measure. IEICE Trans. Information Systems E83-D(7), 1369–1374 (2000) 17. Kanatani, K., Sugaya, Y.: High Accuracy Fundamental Matrix Computation and its Performance Evaluation. In: BMVC (2006)

A Fast Optimal Algorithm for L2 Triangulation Fangfang Lu and Richard Hartley Australian National University

Abstract. This paper presents a practical method for obtaining the global minimum to the least-squares (L2 ) triangulation problem. Although optimal algorithms for the triangulation problem under L∞ -norm have been given, ﬁnding an optimal solution to the L2 triangulation problem is diﬃcult. This is because the cost function under L2 -norm is not convex. Since there are no ideal techniques for initialization, traditional iterative methods that are sensitive to initialization may be trapped in local minima. A branch-and-bound algorithm was introduced in [1] for ﬁnding the optimal solution and it theoretically guarantees the global optimality within a chosen tolerance. However, this algorithm is complicated and too slow for large-scale use. In this paper, we propose a simpler branch-and-bound algorithm to approach the global estimate. Linear programming algorithms plus iterative techniques are all we need in implementing our method. Experiments on a large data set of 277,887 points show that it only takes on average 0.02s for each triangulation problem.

1

Introduction

The triangulation problem is one of the simplest geometric reconstruction problems. The task is to infer the 3D point, given a set of camera matrices and the corresponding image points. In the presence of noise, the correct procedure is to ﬁnd the solution that reproduces the image points as closely as possible. In other words, we want to minimize the residual errors between the reprojected and measured image points. Notice that residual errors measured under diﬀerent norms lead to diﬀerent optimization problems. It is shown in [2] that a quasi-convex cost function arises and thus a single local minimum exists if we choose to calculate the residual errors under L∞ -norm. However, the problem with the residual errors measured under L2 -norm still remain of primary interest. Suppose we want to recover a 3D point X = (x, y, z) . Let Pi (i = 1, 2, . . . , n) denote a set of n camera matrices, ui the corresponding image coordinates and ˆ = (X; 1) the homogenous coordinates of X. Under the L2 -norm, we are led to X solve the following optimization problem:

The second author is also aﬃlated with NICTA, a research instutute funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 279–288, 2007. c Springer-Verlag Berlin Heidelberg 2007

280

F. Lu and R. Hartley

minimize C2 (X) =

n

ˆ )2 d(ui , Pi X

subject to λi (X) > 0

(1)

i=1

where d(·, ·) is the Euclidean metric and λi (X) is the depth of the point relative to image i. The optimal solution to (1) is not easy, since the cost function C2 (X) is nonconvex. In the worst case, multiple local minima may occur. Iterative methods, so-called bundle adjustment methods [3], usually work well, though they are dependent on initialization. Both the L∞ solution [4] and the linear solution [5] are useful for initialization, but neither initialization can theoretically guarantee global optimality. A branch-and-bound algorithm was given in [1] and it provably ﬁnds the global optimum within any tolerance . However, the way of bounding the cost-function is complicated. The computational cost is high so it may not be used to deal with large-scale data sets. A simpler way of obtaining the global estimates by a branch-and-bound process is presented in this paper. The main feature is to get the lower bound for the cost-function in a bounding box. We then use a much simpler convex lower bound for the cost than is used in [1]. Instead of using Second Order Cone Programming as in [1], we need only Linear Programming and simple iterative convex optimization (such as a Newton or Gauss-Newton method). This makes the implementation very easy. Experimental results show that the method can theoretically guarantee the global optimum within small tolerance, and also it works fast, taking on average 0.02s for each triangulation problem. The only other known methods for ﬁnding a guaranteed optimal solution to the L2 triangulation problem are those given for the two-view ([6]) and the three-view problem ([7]), which involve ﬁnding the roots of of a polynomial. For the two-view triangulation problem this involves the solution of a degree-6 polynomial, whereas for 3-view problem, the polynomial is of degree 43. This approach does not generalize in any useful manner to larger numbers of views.

2

Other Triangulation Methods

Various other methods for triangulation have been proposed, based on simple algebraic or geometric considerations. Two of the most successful are brieﬂy discussed here. 2.1

Linear Triangulation Methods

ˆ = (X; 1) is given, let p1i , ˆ where X For every image i the measurement ui = Pi X p2i , p3i be the rows of Pi and xi , yi be the coordinates of ui , then two linear equations are given as follows. 3 ˆ = 0, ˆ = 0. xi p3i − p1i X yi pi − p2i X In all, a set of 2n linear equations are composed for computing X in the n-view triangulation problem. The linear least-squares solution to this set of equations

A Fast Optimal Algorithm for L2 Triangulation

281

provides the linear solution to the triangulation problem. The minimum is denoted by Xl here. To be a little more precise, there are two slightly diﬀerent methods that may ˆ to 1, as suggested be considered here. One may either set the last coordinate of X above, and solving using a linear least-squares solution method. Alternatively the ˆ may be treated as a variable. The optimal solution may then last coordinate of X be solved using Singular Value Decomposition (see [5], DLT method). 2.2

L∞ Framework

The L∞ formulation leads to the following problem: ˆ ) subject to λi (X) > 0. minimize C∞ (X) = max d(ui , Pi X i

(2)

This is a quasi-convex optimization problem. A global solution is obtained using a bisection algorithm in which a convex feasibility problem is solved in each step. Please refer to [4] for details. We denote the minimum by X∞ in this paper. Other methods are mentioned in [6]. The most commonly used technique is to use some non-optimal solution such as the algebraic or (as more recently suggested) the optimal L∞ solution to initialize an iterative optimization routine such as Levenberg-Marquardt to ﬁnd the L2 minimum. This method can not be guaranteed to ﬁnd the optimal solution, however. The purpose of this paper is to give a method guaranteed to ﬁnd the optimal solution.

3 3.1

Strategy Branch and Bound Theory

Branch and Bound algorithms are classic methods for ﬁnding the global solution to non-convex optimization problems. By providing a provable lower or upper bound, which is usually a convex function to the objective cost-function, and a dividing scheme, the algorithms tend to achieve the global estimates within an arbitrary small tolerance. 3.2

Strategy

In this paper, our strategy is to ﬁnd the L2 optimal point Xopt by a process of branch-and-bound on a bounding box B0 . We start with the bounding box and a best point Xmin found so far (X∞ plus local reﬁnement will do in this paper). At each step of the branch and bound process, we consider a bounding box Bi obtained as part of a subdivision of B0 . A lower bound for the minimum of the cost function is estimated on Bi and compared with the best cost found so far, namely C2 (Xmin ). If the minimum cost on Bi exceeds C2 (Xmin ), then we abandon Bi and go on to the next box to be considered. If, on the other hand, the minimum cost on Bi is smaller than C2 (Xmin ), then we do two things: we

282

F. Lu and R. Hartley

evaluate the cost function at some point inside Bi to see if there is a better value for Xmin , and we subdivide the box Bi and repeat the above process with the subdivided boxes.

4

Process

In this paper, the Branch and Bound process is shown considering three aspects. Firstly an initial bounding box is computed. Secondly the bounding method is presented which means the provably lower or upper bound of the objective costfunction is calculated. Then the branching part, that is, a subdivision scheme is given. Since Branch and Bound algorithms can be slow, a branching strategy should be devised to save the computation cost. 4.1

Obtaining the Initial Bounding Box B0

We start by an initial estimate Xinit for the optimum point. If Xopt is the true L2 minimum, it follows that C2 (Xopt ) ≤ C2 (Xinit ). This may be written as: C2 (X) =

n

ˆ opt )2 ≤ C2 (Xinit ) d(ui , Pi X

(3)

i=1

ˆ opt )2 is ˆ opt = (Xopt ; 1). We can see that the sum of the values d(ui , Pi X with X ˆ opt )2 is less than this value. bounded by C2 (Xinit ), and this means each d(ui , Pi X ˆ opt ) ≤ δ for all i. This In particular, deﬁning δ = C2 (Xinit ) we have d(ui , Pi X equation can also be written as

1X 2X ˆ opt 2 ˆ opt 2 p p i i ˆ opt ) = + yi − 3 ≤δ d(ui , Pi X xi − 3 ˆ opt ˆ opt pi X pi X which is satisﬁed for each i. This means the following two constraints are satisﬁed for each i. 1 ˆ 2 ˆ yi − pi Xopt ≤ δ xi − pi Xopt ≤ δ , ˆ ˆ 3 3 pi Xopt pi Xopt Notice for n-view triangulation, we have a total number of 4n linear constraints ˆ , formulated by multiplying both sides of the above constraint on the position X ˆ opt . equations with the depth term p3i X We wish to obtain bounds for Xopt = (xopt , yopt , zopt ) . That is to ﬁnd xmin , xmax , ymin , ymax , zmin , zmax such that xmin ≤ xopt ≤ xmax , ymax ≤ yopt ≤ ymax , zmin ≤ zopt ≤ zmax . For each of them, we can formulate a linear programming (LP) problem by linearizing the constraints in the above equations. For instance, xmin is the smallest value of the x-coordinate of Xopt with respect to the following linear constraints. ˆ opt ≤ 0 , (xi p3i − p1i − δp3i )X 3 2 3 ˆ (yi pi − pi − δpi )Xopt ≤ 0 ,

ˆ opt ≤ 0 (−xi p3i + p1i − δp3i )X 3 2 3 ˆ (−yi pi + pi − δpi )Xopt ≤ 0 .

A Fast Optimal Algorithm for L2 Triangulation

283

The other bound values are found in the similar way. This process then provides an initial bounding box B0 for the optimal point Xopt . Note: In this paper we got the initial optimal point Xinit by local reﬁnement from X∞ , It should be mentioned that any point X may be used instead of this Xinit for initialization. However, we would like to choose a relatively tight initial bounding box since it will reduce the computation complexity. Using the local minimum from X∞ would be a good scheme because it is likely to produce a relatively small value of the cost-function C2 , and hence a reasonably tight bound. 4.2

Bounding

Now we consider the problem of ﬁnding a minimum value for C2 (X) on a box B. We rewrite the L2 cost-function as follows: ˆ )2 d(ui , Pi X C2 (X) = i

ˆ 2 ˆ 2 p1i X p2i X = + yi − 3 xi − 3 ˆ ˆ pi X pi X i f 2 (X) + g 2 (X) i i = 2 (X) λ i i Here fi , gi , λi are linear functions in the coordinates of X. And the depths λi (X) can be assumed to be positive, by cheirality(see [5]). Note that for this we must choose the right sign for each Pi , namely that it is of the form Pi = [Mi |mi ] where det Mi > 0. It is observed in [4] that each of the functions fi2 (X) + gi2 (X) such that λi (X) > 0 λi (X) is convex. Now we deﬁne wi = maxX∈B λi (X), where B is the current bounding box. The value of each wi can be easily found using Linear Programming(LP). Then we may reason as follows, for any point X ∈ B, wi 1/wi fi2 (X) + gi2 (X) wi λi (X) f 2 (X) + g 2 (X) i i w λ ( X) i i i

≥ λi (X) ≤ 1/λi (X) f 2 (X) + gi2 (X) ≤ i λ2i (X) f 2 (X) + g 2 (X) i i ≤ = C2 (X) 2 (X) λ i i

(4)

However, the left-hand side of this expression is a sum of convex functions, and hence is convex. It is simple to ﬁnd the minimum of this function on the box

284

F. Lu and R. Hartley

B, hence we obtain a lower bound L(X) = C2 (X).

i

fi2 (X)+gi2 (X) wi λi (X)

for the cost function

BFGS algorithm: In this paper, we adopted the BFGS algorithm([8]) to ﬁnd the minimum of the convex function L(X) within the bounding box B. The BFGS algorithm is one of the main Quasi-Newton methods in convex optimization problems ([9]). It inherits good properties of the Newton Method such as fast convergence rate while avoiding the complexity of computing the Hessian. This signiﬁcantly improves the computation speed. 4.3

Branching

Given a box Bi , ﬁrst we evaluate the lower bound of the cost. If the lower bound exceeds the best cost C2 (Xmin ) we have got so far, we abandon the box and go on to the next box. If on the other hand the lower bound is less than C2 (Xmin ), we evaluate the cost of some point in the current box. If the minimum value is less than C2 (Xmin ), we change the C2 (Xmin ) to the current minimum value and subdivide the block into two along its largest dimension. We repeat the steps until the dimension of the box approaches zero within a given tolerance. Note: How exactly the lower bound approximates the cost depends essentially on how closely maxX∈B λi (X) approximates the value of λi (X) for arbitrary points X in B. It is best if λi (X) does not vary much in B. Note that λi (X) is the depth of the point X with respect to the i-th camera. Thus it seems advantageous to choose the boxes to be shallow with respect to their depth from the cameras. This suggests that a more sophisticated scheme for subdividing boxes may be preferabel to the simple scheme we use of subdividing in the largest dimension.

5

Proof of Optimality

The complete algorithm is given in Fig 1. We have claimed that the method will ﬁnd the optimal solution. That will be proved in this section. It will be assumed that the bounding box B0 is ﬁnite, as in the description of the algorithm. Because we use a FIFO structure to hold the boxes, as the algorithm progresses, the size of boxes B decreases towards zero. Note that at any time, the best result so-far found gives an upper bound for the cost of the optimal solution. We will also deﬁne a lower bound as follows. At time j, just before removing the j-th box from the queue, deﬁne lj = min min L(X) . B∈Q X∈B

It is clear that lj gives a lower bound for the optimal solution C2 (Xopt ), since L(X) < C2 (X) for all X. We will show two things.

A Fast Optimal Algorithm for L2 Triangulation

285

Algorithm. Branch and Bound Given an initial bounding box B0 , initial optimal point X0 with value f0 = C2 (X0 ), and tolerance > 0. 1. 2. 3. 4. 5.

6. 7. 8. 9. 10.

Initialize Q, a FIFO queue of bounding boxes, with Q = {B0 }. If Q is empty, terminate with globally optimal point X0 and optimal value f0 . Take the next box B from Q. Compute the largest dimension lmax of box B. If lmax < , – Set fC = C2 (XC ) where XC is the centroid of the box B. – If fC < f0 , set f0 = fC and X0 = XC . – Goto 2. Find the minimum of L(X) in B denoted by XL . Set fL = L(XL ). If fL > f0 , goto 2. Find the local minimum of C2 (X) denoted by XC with XL as the initial point. Set fC = C2 (XC ). If fC < f0 , set f0 = fC and X0 = XC . Subdivide B into two boxes B1 and B2 along the largest dimension, and insert B1 and B2 into Q. Goto 2.

Fig. 1. Branch and bound algorithm for optimal L2 triangulation. Alternative stopping conditions are possible, as discussed in the text.

1. The sequence of values for f0 set at line 8 of algorithm 1 is a decreasing sequence converging to C2 (Xopt ). 2. The sequence of values lj is an increasing sequence converging to C2 (Xopt ). Thus, the optimal value C2 (Xopt ) is sandwiched between two bounds, which can always be tested as the algorithm proceeds. The algorithm can be terminated when the diﬀerence between the bounds is small enough. We now proceed with the proof. Let δ > 0 be chosen. We will show that some value of f0 will be smaller than C2 (Xopt ) + δ, so the values taken by f0 converge to C2 (Xopt ). First, note that f0 will never be less than C2 (Xopt ) because of the way it is assigned at step 5 or 8 of the algorithm. The cost function C2 (X) is continuous on the box B0 , so the set {X ∈ B0 | C2 (X) < C2 (Xopt ) + δ} is an open set. Thus, there exists a ball S containing Xopt on which C2 takes values within δ of the optimal. Consider the sequence of boxes which would be generated in this algorithm if no boxes were eliminated at step 7. Since these boxes are decreasing in size, one of them B must lie inside the ball S. Thus C2 (X) < C2 (Xopt ) + δ on B . Note that no box that contains the point Xopt can be eliminated during the course of the branch-and-bound algorithm, so the box B must be one of the boxes Bj that will eventually be evaluated. Since box B can not be eliminated at step 7 of the algorithm, step 8 will be executed. The value fC found by optimizing starting in box Bj will result in a value less than C2 (Xopt ) + δ, since all points in Bj satisfy this condition. Thus, f0 will be assigned a value less than C2 (Xopt ) + δ, if it does not already satisfy this condition. This completes the proof that f0 converges to C2 (Xopt ).

286

F. Lu and R. Hartley

Next, we prove that lj converges to C2 (Xopt ). As before, deﬁne w ¯i = maxX∈B λi (X). Also deﬁne wi = minX∈B λi (X). As the size of the boxes diminishes towards zero, the ratio w ¯i /wi decreases towards 1. We denote by Bj the j-th box taken from the queue during the course of the algorithm 1. Then, for any > 0 there exists an integer value N such that w ¯i < (1 + )wi for all i and all j > N. Now, using the same reasoning as in (4) with the directions of the inequalities reversed, we see that f 2 (X) + g 2 (X) i

i

i

w ¯i λi (X)

≤ C2 (X) ≤

f 2 (X) + g 2 (X) i

i

i

wi λi (X)

.

So, if j > N , then for any point X ∈ Bj , L(X) ≤ C2 (X) ≤ (1 + )L(X) .

(5)

We deduce from this, and the deﬁnition of lj that lj < C2 (Xopt ) ≤ (1 + )lj if j > N . Thus, lj converges to C2 (Xopt ).

6

Experiments

We tested our experiments with a set of trials involving all the points in the “Notre Dame” data set([10]). This data set contains 277,887 points and 595 cameras, and involves triangulation problems involving from 2 up to over 100 images. We compared our optimal triangulation method with both L∞ and Linear methods of triangulation, as well as with iterative methods using the BFGS algorithm and Levenberg-Marquardt, initialized by the Linear and L∞ algorithms. The results of these experiments are shown in Figures 2 and 3. Although the experiments were run on all 277,887 points, only 30,000 points (randomly chosen) are shown in the following graphs, because of diﬃculties plotting 270,000 points. Synthetic Data. Suspecting that previous methods will have diﬃculty with points near inﬁnity, we devised an experiment that tested this. Results are a little preliminary for this, but they appear to show that the “homogeneous” Linear method (see section 2.1) still works well, but the inhomogeneous SVD method fails a few percent of the time, and iteration will not recover the failure. 6.1

Timing

Our algorithm is coded in C++ and takes about 0.02 seconds per point to run on a standard desktop computer in C++. We were unable to evaluate the algorithm of [1] directly, and its speed is a little uncertain. The authors have claimed 6-seconds (unpublished work) for average timing, but we can not verify this. Their algorithm is in Matlab, but the main numerical processing is done in Sedumi (which is coded in C). The algorithm is substantially more complex, and it is unlikely that the diﬀerence between Matlab and C++ would account for the 300-times speed-up of our algorithm compared with theirs.

A Fast Optimal Algorithm for L2 Triangulation

287

Fig. 2. Plot of Linear (left) and L∞ (right) triangulation versus the optimal algorithm. The graph shows the results for 30,000 points randomly chosen from the full Notre Dame set. The top row shows plots of the diﬀerence in residual (in pixels) between the two methods, plotted agains the point number. The points sorted according to the diﬀerence in residual so as to give a smooth curve, from which one sees the quantitative diﬀerence between the optimal and Linear or L∞ algorithms more easily. The plot for the Linear data is truncated on the Y -axis; the maximum diﬀerence in residual in reality reaches 12 pixels. The second row presents the same results in a diﬀerent way. It shows a scatter plot of the optimal versus the Linear or L∞ residuals. Note that the optimal algorithm always gives better results than the Linear or L∞ algorithms. (Speciﬁcally, the top row shows that the diﬀerence of residuals is always positive.) This is expected and only serves to verify the optimality of the algorithm. In addition, L∞ is seen to outperform the Linear method.

7

Conclusion

The key feature of the proposed method is that it guarantees global optimality with a reasonable computation cost. It can be applied to large-scale triangulation problems. Although the given experiments do show that traditional local methods also work very well on most occasions, the problem of depending on the initialization will always be a disadvantage. Our method may be still a little slow for some large-scale applications, but it does provide an essential benchmark against which other algorithms may be tested, to verify whether they are obtaining optimal results. On real examples where the triangulated points were at a great distance from the cameras, the Linear algorithm gave such poor results that iteration was unable to ﬁnd the optimal solution in many cases. On the other hand conventional

288

F. Lu and R. Hartley

Fig. 3. Plot of Levenberg-Marquardt (LM) reﬁned results versus the optimal algorithm. The stopping criterion for the LM algorithm was our default (relatively quick) stopping criterion, resulting in about three iterations. The optimal algorithm still does best, but only on a few examples. Note that for visibility, we only show the ﬁrst 2000 (out of 30,000) sorted points. When the LM algorithm and BFGS algorithms were run for more iterations (from either Linear or L∞ initialization), the results were largely indistinguishable on this data set from the optimal results. Also shown is a snapshot of the dataset that we used.

iterative methods worked well on the Notre Dame data set, because most of the points were relatively close or triangulated from a wide baseline.

References 1. Agarval, S., Chandraker, M., Kahl, F., Belongie, S., Kriegman, D.: Practical global optimization for multiview geometry. In: Proc. European Conference on Computer Vision (2005) 2. Hartley, R., Schaﬀalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, pp. I–504–509 (2004) 3. Triggs, W., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.: Bundle adjustment for structure from motion. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) Vision Algorithms: Theory and Practice. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000) 4. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Proc. International Conference on Computer Vision, pp. 1002–1009 (2005) 5. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 6. Hartley, R.I., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68(2), 146–157 (1997) 7. Stewenius, H., Schaﬀalitzky, F., Nister, D.: How hard is 3-view triangulation really. In: Proc. International Conference on Computer Vision, pp. 686–693 (2005) 8. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Oxford University Press, Oxford (2006) 9. Boyd, S., Vanderberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 10. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. ACM Trans on Graphics 25(3), 835–846 (2006)

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces Bo Zheng, Jun Takamatsu, and Katsushi Ikeuchi Institute of Industrial Science, The University of Tokyo, Komaba 4-6-1, Meguro-ku, Tokyo, 153-8505 Japan [email protected]

Abstract. Fitting an implicit polynomial (IP) to a data set usually suffers from the diﬃculty of determining a moderate polynomial degree. An over-low degree leads to inaccuracy than one expects, whereas an overhigh degree leads to global instability. We propose a method based on automatically determining the moderate degree in an incremental ﬁtting process through using QR decomposition. This incremental process is computationally eﬃcient, since by reusing the calculation result from the previous step, the burden of calculation is dramatically reduced at the next step. Simultaneously, the ﬁtting instabilities can be easily checked out by judging the eigenvalues of an upper triangular matrix from QR decomposition, since its diagonal elements are equal to the eigenvalues. Based on this beneﬁcial property and combining it with Tasdizen’s ridge regression method, a new technique is also proposed for improving ﬁtting stability.

1

Introduction

Recently representing 2D and 3D data sets with implicit polynomials (IPs) has been attractive for vision applications such as fast shape registration, pose estimation [1,2,3,4], recognition [5], smoothing and denoising, image compression [6], etc. In contrast to other function-based representations such as B-spline, Non-Uniform Rational B-Splines (NURBS), and radial basis function (RBF) [7], IPs are superior in the areas of fast ﬁtting, few parameters, algebraic/geometric invariants, robustness against noise and occlusion, etc. A 3D IP function of degree n is deﬁned as: aijk xi y j z k = (1 . . . z n)(a000 a100 . . . a00n )T , (1) fn (x) = x 0≤i,j,k;i+j+k≤n m(x)T a where x = (x y z) is a data point. fn (x)’s zero set {x|fn (x) = 0} is used to represent the given data set. The estimation of IP coeﬃcients belongs to the conventional ﬁtting problem, and various methods have been proposed [1,2,3,4,8]. These ﬁtting methods cannot adaptively control the IP degrees for diﬀerent object shapes; it is well known that simple shapes correspond to low-degree IPs whereas complicated shapes correspond to the higher ones. Fig.1 shows that Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 289–300, 2007. c Springer-Verlag Berlin Heidelberg 2007

290

B. Zheng, J. Takamatsu, and K. Ikeuchi

→

(a)

(b)

(c)

→

(c’)

(d)

(d’)

Fig. 1. IP ﬁtting results: (a) Original data set; (b) 4-degree IP surface; (c) 8-degree IP surface; (c’) Stability-improved 8-degree IP surface; (d) 10-degree IP surface; (d’) Stability-improved 10-degree IP surface

when ﬁtting an object like Fig.1(a), an over-low degree leads to the inaccuracy (Fig.1(b)), whereas an over-high degree leads to an unstable result: too many unexpected zero sets appear (see Fig.1(d)). This paper provides a solution to automatically ﬁnd the moderate degree IP (such as Fig.1(c)). Another issue of IP ﬁtting is that there may be collinearity in the covariant matrix derived from the least squares method, making them prone to instability [4], e.g., the ﬁtting results shown in Fig.1(c) and (d). In order to address this issue we propose a method for automatically checking out this collinearity and improving it. And we also combine the Ridge Regression (RR) technique recently introduced by [4,9]. Fig.1(c’) and (d’) show the improved results of Fig.1(c) and (d) respectively, where the redundant zero sets are retired. Note although Fig.1 (d) is globally improved by our method to Fig.1 (d’), since there are too many redundant zero sets that need to be eliminated, the local accuracy is also aﬀected very much. Therefore, we ﬁrst aim at adaptively ﬁnding a moderate degree, and then applying our stability-improving method to obtain a moderate result (accurate both locally and globally). This paper is organized as follows: Section 2 gives a review of IP ﬁtting methods; Section 3 and 4 provide an incremental method for ﬁtting IP with moderate degrees; Section 5 discusses on how to improve the global stability based on our algorithm; Section 6 presents experimental results followed by discussion and conclusions in section 7 and 8.

2

Implicit Polynomial Fitting

In order to estimate the coeﬃcient vector a in (1), a simple method is to solve a linear system M a = b,

(2)

where M is the matrix of monomials, and the ith row of M is m(xi ) (see (1)) and b is a zero vector. But generally M is not a square matrix, and the linear system is usually solved by the least squares method. Solutions to this least squares problem are classiﬁed into nonlinear methods [1,2,10] and linear methods [3,4,8,9]. Because the linear methods are

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces

291

simpler and much faster than the nonlinear ones, they are preferable and can be formulated as: M T M a = M T b.

(3)

Note this formula is just transformed from the least squares result, a = M † b, where M † = (M T M )−1 M T called a pseudo-inverse matrix. Direct solution of the linear system is numerically unstable, since M T M is nearly singular and b = 0. Thus a is determined from the kernel basis. Fortunately, the methods for improving the classical linear ﬁtting (avoiding the ill-condition matrix M T M in (3)) have already been proposed by adding some constraints, such as the 3L method [3], the Gradient-One method [4] and the Min-Max method [8]. The singularity of M is improved and b is also derived as a nonzero vector. In the prior methods, the symmetric linear system (3) was solved by the classical solvers such as the Cholesky decomposition method, the conjugate gradient method, singular value decomposition (SVD), and their variations. But none of them allow changing the degree during the ﬁtting procedure. This is the main reason why these prior methods require a ﬁxed degree.

3

Incremental Fitting

This section shows computational eﬃciency of the proposed incremental ﬁtting method. Although an IP degree is gradually increased until obtaining a moderate ﬁtting result, the computational cost is saved because each step can completely reuse the calculation results of the previous step. In this section, ﬁrst we describe the method for ﬁtting an IP with the QR decomposition method. Next, we show the incrementability of Gram-Schmidt QR decomposition. After that, we clarify the amount of calculation in order to increase the IP degree. 3.1

Fitting Using QR Decomposition

Without solving the linear system (3) directly, our method ﬁrst carries out QR decomposition on matrix M as M = QN ×m Rm×m , where Q satisﬁes the QT Q = I (I is the identity matrix), and R is an invertible upper triangular matrix. Then substituting M = QR into (2), we obtain: QRa = b → QT QRa = QT b → Ra = QT b → Ra = b.

(4)

Since R is an upper triangular matrix, a can be quickly solved (in O(m2 )). 3.2

Gram-Schmidt QR Decomposition

Let us assume that matrix M consisting of columns {c1 , c2 , · · · , cm } is known. We show the method of Gram-Schmidt orthogonalization, that is, how to orthogonalize the columns of M into the orthonormal vectors {q1 , q 2 , · · · , q m } which

292

B. Zheng, J. Takamatsu, and K. Ikeuchi Ri +1

~ bi

ai

Ri

~ bi

ri , j

=

~ ai +1 bi +1 ~ bi

ri , j

G

=

0

0 0

Fig. 2. The triangular linear system grows from the ith step to the (i + 1)th, and only the calculation shown in light-gray is required

are columns of matrix Q, and simultaneously calculate the corresponding upper triangular matrix R consisting of elements ri,j . The algorithm is as follows: Initially let q 1 = c1 / c1 and r1,1 = c1 . If {q1 , q 2 , · · · , q i } have been computed, then the (i+1)th step is: rj,i+1 = q Tj ci+1 , for j ≤ i, q i+1 = ci+1 −

i

rj,i+1 q j ,

j=1

ri+1,i+1 = q i+1 , q i+1 = q i+1 / q i+1 .

(5)

From this algorithm, we can see that Gram-Schmidt orthogonalization can be carried out in an incremental manner, which orthogonalizes the columns of M one by one. 3.3

Additional Calculation for Increasing an IP Degree

The incrementability in the QR decomposition with Gram-Schmidt orthogonalization makes our incremental method computationally eﬃcient. Fig.2 illustrates this eﬃciency by the calculation from the ith step to the (i + 1)th step in our incremental process. It is only necessary to calculate the parts shown in light-gray. For constructing these two upper triangular linear systems from the ith step to the (i + 1)th step, we only need two types of calculation: 1) for growing the upper triangular matrix from Ri to Ri+1 , calculate the rightmost column and add it into the Ri to construct Ri+1 , and 2) for growing the right-hand vector from bi to bi+1 , calculate the bottom element and add it into the bi to construct bi+1 . For the ﬁrst calculation, it can be simply obtained from Gram-Schmidt orthogonalization in (5). For the second calculation, assuming bi+1 is the bottom element of vector bi+1 , the calculation of bi+1 can follow the (i + 1)th step of Gram-Schmidt orthogonalization in (5), and can be calculated as bi+1 = q Ti+1 b. In order to clarify the computational eﬃciency, let us assume a comparison between our method and an incremental method that iteratively calls a linear

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces

293

method such as the 3L method at each step. It is obvious that, for solving coeﬃcient a at the ith step, our method needs i inner-product operations for constructing the upper triangular linear system, and O(i2 ) for solving this linear system; whereas the latter method needs i × i inner-product operations for constructing linear system (3), and O(i3 ) for solving (3). Let us deﬁne a function G to denote the above calculation. Then if we repeat calling function G, we can obtain the incremental (dimension-growing) upper triangular linear systems, and the corresponding coeﬃcient vectors with diﬀerent degrees can be solved from them.

4

Finding the Moderate Degree

Now to construct an algorithm for ﬁnding the moderate degree, we are facing two problems: 1) What is the order for choosing cα ? 2) When should the incremental procedure be stopped? Note: as a matter of convenience, hereafter we use notation α to denote the column index of M instead of i. 4.1

Incremental Order of Monomials

Feeding the columns cα from M into the function G in a diﬀerent order may lead to the diﬀerent result. Therefore it is important to decide a suitable order. A reasonable way is to choose cα in the degree-growing order described by Tab.1. The reason is as follows: when we ﬁt a two-degree IP to the data on a unit circle, a unique solution −1 + x2 + y 2 = 0 can be obtained, while if we choose a three-degree IP to ﬁt, solutions such as x(−1 + x2 + y 2 ) = 0, are obtained. There exist some redundant zero set groups, such as x = 0. 4.2

Stopping Criterion

For the second problem, we have to deﬁne a stopping criterion based on our deﬁned similarity measure between IP and data set. Once this stopping criterion is satisﬁed, we consider the desired accuracy is reached and the procedure is terminated. First let us introduce a set of similarity functions measuring the similarity between IP and data set, as follows: Ddist =

N 1 ei , N i=1

Dsmooth =

N 1 (N i · ni ), N i=1

(6)

|f (xi )| where N is the number of points, ei = f (xi ) , N i is the normal vector at a point obtained from the relations of the neighbor normals (here we refer to f (xi ) Sahin’s method [9]), and ni = f (xi ) is the normalized gradient vector of f at xi . ei has proved useful for approximating the Euclidean distance from xi to the IP zero set [2].

294

B. Zheng, J. Takamatsu, and K. Ikeuchi

Table 1. Index List: i, j and k are the powers of x, y and z respectively. α is the index of column of M . And the relations between α and (i, j, k) can be formulated as: α = j + (i + j + 1)(i + j)/2 + 1 (for 2D) and α = k + (j + k + 1)(j + k)/2 + (i + j + k + 2)(i + j + k + 1)(i + j + k)/6 + 1 (for 3D). α 1 2 3 4 5 6 7 8 9 10

[i j] [0 0] [1 0] [0 1] [2 0] [1 1] [0 2] [3 0] [2 1] [1 2] [0 3]

Form L0 (i + j = 0) L1 (i + j = 1)

L2 (i + j = 2)

L3 (i + j = 3) .. . m [0 n] Ln (i + j = n) (a) Index list for 2D

[i j k] Form [0 0 0] L0 (i + j + k [1 0 0] [0 1 0] [0 0 1] L1 (i + j + k [2 0 0] [1 1 0] [0 2 0] [1 0 1] [0 1 1] [0 0 2] L2 (i + j + k .. . m [0 n] Ln (i + j + k (b) Index list for 3D α 1 2 3 4 5 6 7 8 9 10

= 0)

= 1)

= 2) = n)

Ddist and Dsmooth can be considered as two measurements on distance and smoothness between data set and IP zero set. And we deﬁne our stopping criterion as: (Ddist < T1 ) ∧ (Dsmooth > T2 ). 4.3

(7)

Algorithm for Finding the Moderate IPs

Having the above conditions, our algorithm is simply described as follows: 1) Calling the function G to construct the upper triangular linear system; 2) Solving this linear system to obtain coeﬃcient vector a; 3) Measuring the similarity for the obtained IP; 4) Stopping the algorithm if the stopping criterion (7) is satisﬁed; otherwise going to 1) for growing up the dimension.

5

Improving Global Stability

Linear ﬁtting methods in general suﬀer from not achieving global stability, which is well discussed in [4,9]. Since our ﬁtting method belongs to these linear methods, we face the same problem. We propose a method for solving this by controlling the condition number of matrix M . 5.1

Stability and Condition Number of M

An important reason for global instability is the collinearity of M , which causes the matrix M T M to be nearly singular with some eigenvalues negligible

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces

295

compared to the others [4]. Such eigenvalues are degenerated to contribute very little to the overall shape of the ﬁt. Fortunately, since M = QR, we can obtain M T M = RT R, and thus the condition number of M T M can be improved by controlling the eigenvalues of R. Note here we just consider the condition number as λmax /λmin , where λmax and λmin are the maximum and minimum eigenvalues respectively. And from the good properties that upper triangular matrix R’s eigenvalues lay on its main diagonal, we can easily evaluate the singularity of R by only observing the diagonal values. To improve the condition number of M T M , this paper gives a solution from two aspects: eliminating collinear columns of M and using the Ridge Regression method. 5.2

Eliminating Collinear Columns of M

The ﬁrst simple idea is to check the eigenvalue ri,i to see whether it is too small (nearly zero), at each step. If ri,i is too small, that means the current column ci of M is nearly collinear to the space of {c1 , c2 , · · · , ci−1 }. Thus to ﬁnd a viable value for R, it should be abandoned, and the subsequent columns should be tried. 5.3

Ridge Regression Method

Ridge regression (RR) regularization is an eﬀective method that improves the condition number of M T M by adding a small moderate value to each diagonal element, e.g., adding a term κD to M T M [4,9]. Accordingly equation (3) can be modiﬁed as (M T M + κD)a = M T b and equation (4) can be modiﬁed as (R + κR−T D)a = b

(8)

where κ is a small positive value called the RR parameter and D is a diagonal matrix. D will be chosen to maintain Euclidean invariance, and the simplest choice is to let D be an identity matrix. A cleverer choice has been proposed by T.Tasdizen et al. [4] for 2D and T.Sahin et al. [9] for 3D. In fact, their strategies are to add the constraints that keep the leading forms of this polynomial always strictly positive, which proves that the zero set of polynomials with an even degree are always bounded (see the proof in [4]). We give details of this derivation in the appendix.

6

Experimental Results

The setting for our experiments involve some pre-conditions. 1) As a matter of convenience, we employ the constraints of the 3L method [3]; 2) All the data sets are regularized by centering the data-set center of mass at the origin of the coordinate system and scaling it by dividing each point by the average length from points to origin; 3) We choose T1 in (7) with about 20 percent of the layer distance c of the 3L method as done in [3].

296

B. Zheng, J. Takamatsu, and K. Ikeuchi

Ddist Dsmooth

(b)

Ddist

(a)

(c)

Dsmooth

α

(d) (e)

Fig. 3. IP ﬁtting results: (a) Original image. (b) α = 28 (six-degree). (c) α = 54 (ninedegree). (d) α = 92 (thirteen-degree). (e) Convergence of Ddist and Dsmooth . Note “o” symbols represent the boundary points extracted from the image and real lines represent the IP zero set in (b)-(d).

6.1

A Numerical Example

In this experiment, we ﬁt an IP to the boundary of a cell shown in Fig.3 (a). The stopping criterion is set as T1 = 0.01, T 2 = 0.95, and the layer distance of the 3L method is c = 0.05. The moderate IP is found out automatically (see Fig.3(d)). To give a comparison, we also show some ﬁts before the desired accuracy is reached (see Fig.3(b) and (c)). And these results are improved by our method mentioned in section 5. We also track the convergence of Ddist and Dsmooth shown in Fig.3(e). Although there are some small ﬂuctuations on the graph, Ddist and Dsmooth are convergent to 0 and 1 respectively, which also proves the stopping criterion in (7) can eﬀectively measure the similarity between the IP and the data set. 6.2

2D and 3D Examples

Some 2D and 3D experiments are shown in Fig.4 where the ﬁtting results are obtained with the same parameters as those in the ﬁrst example. As a conclusion here, objects with diﬀerent shapes may obtain ﬁtting results of diﬀerent degrees, since these results always respect the complexity of shapes. 6.3

Degree-Fixed Fitting Compared with Adaptive Fitting

Fig.5 shows some comparisons between degree-ﬁxed ﬁtting methods and our adaptive ﬁtting method. Compared to degree-ﬁxed methods such as [3,4,8], the results of our method show that there is neither over-ﬁtting nor the insuﬃcient

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces

11-degree

12-degree

12-degree

6-degree

8-degree

297

12-degree

Fig. 4. IP ﬁtting results. First row: Original objects; Second row: IP ﬁts.

ﬁtting, and we also attain global stability. It shows that our method is more meaningful than the degree-ﬁxed methods, since it fulﬁlls the requirement that the degrees should be subject to the complexities of object shapes.

7 7.1

Discussion QR Decomposition Methods

Other famous algorithms of QR decomposition are Householder and Givens [11]. In the ﬁeld of numerical computation, Householder and Givens have proved more stable than conventional Gram-Schmidt method. But in this paper, since our discussion is based on the good condition of a regularized data set, we ignore the small eﬀect of rounding errors, which causes instability. Here we just would like to take advantage of the properties of QR decomposition that orthogonalize vectors one by one, to demonstrate the possibility of constructing the moderatedegree ﬁtting algorithm described above. 7.2

IP vs Other Functions

In contrast to other function based representations such as B-spline, NURBS, and radial basis function, IP representation cannot give a relatively accurate model. But this representation is more attractive for applications that require fast registration and fast recognition (see the works [4,5,12,13,14]), because of its algebraic/geometric invariants [15]. Also, Sahin also showed some experiments for missing data in [9]. Further accurate representation of a complex object may require segmenting the object shapes and representing each segmented patch with an IP. We will consider this possibility in our future works.

298

B. Zheng, J. Takamatsu, and K. Ikeuchi

Original Objects:

Degree-ﬁxed ﬁtting in 2-degree: 2-degree *

2-degree †

2-degree †

4-degree ‡

4-degree †

4-degree †

2-degree *

6-degree *

12-degree *

Degree-ﬁxed ﬁtting in 4-degree:

Our method:

Fig. 5. Comparison between degree-ﬁxed ﬁtting and adaptive ﬁtting. First row: Original objects. Second and third row: IP ﬁts resulting from degree-ﬁxed ﬁtting with 2-degree and 4-degree ﬁtting respectively. Fourth row: Adaptive ﬁtting. Mark *: moderate ﬁtting. †: insuﬃcient ﬁtting (losing accuracy). ‡: over-ﬁtting.

7.3

Setting the Parameters T1 and T2

Since our stopping criterion can be approximately seen as a kind of Euclidean metric, it is intuitive to control moderate ﬁtting accuracy by setting the appropriate values to T1 and T2 . Basically, these parameters should be decided based on object scale or statistics about data noise. Further discussion is beyond the scope of this paper. Fortunately it is intuitive to decide if the parameters are appropriate for your demand, since the 2D/3D Euclidean distance can be easily observed. In this paper, we practically let T1 and T2 be close to zero and one respectively for a smooth model and more tolerant values for a coarse one.

8

Conclusions

This paper provided an incremental method for ﬁtting shape-representing IPs. By our stopping criterion, an IP of moderate degree can be adaptively found in one ﬁtting process, and global ﬁtting stability is successfully improved. Our results support the argument that IP degrees being adaptively determined by shapes is better than being ﬁxed, because this not only saves much time for users, but also it is suited to the future applications involving automatic recognition systems.

Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces

299

Acknowledgements Our work was supported by the Ministry of Education, Culture, Sports, Science and Technology under the Leading Project: Development of High Fidelity Digitization Software for Large-Scale and Intangible Cultural Assets.

References 1. Keren, D., Cooper, D.: Describing Complicated Objects by Implicit Polynomials. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 16(1), 38–53 (1994) 2. Taubin, G.: Estimation of Planar Curves, Surfaces and Nonplanar Space Curves Deﬁned by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 13(11), 1115–1138 (1991) 3. Blane, M., Lei, Z.B., Cooper, D.B.: The 3L Algorithm for Fitting Implicit Polynomial Curves and Surfaces to Data. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 22(3), 298–313 (2000) 4. Tasdizen, T., Tarel, J.P., Cooper, D.B.: Improving the Stability of Algebraic Curves for Applications. IEEE Trans. on Imag. Proc. 9(3), 405–416 (2000) 5. Tarel, J.-P., Cooper, D.B.: The Complex Representation of Algebraic Curves and Its Simple Exploitation for Pose Estimation and Invariant Recognition. PAMI 22(7), 663–674 (2000) 6. Helzer, A., Bar-Zohar, M., Malah, D.: Using Implicit Polynomials for Image Compression. In: Proc. 21st IEEE Convention of the Electrical and Electronic Eng., pp. 384–388. IEEE Computer Society Press, Los Alamitos (2000) 7. Turk, G., OfBrieni, J.F.: Variational Implicit Surfaces. Technical Report GITGVU-99-15, Graphics, Visualization, and Useability Center (1999) 8. Helzer, A., Barzohar, M., Malah, D.: Stable Fitting of 2D Curves and 3D Surfaces by Implicit Polynomials. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 26(10), 1283–1294 (2004) 9. Sahin, T., Unel, M.: Fitting Globally Stabilized Algebraic Surfaces to Range Data. In: Proc. 10th IEEE Int. Conf. on Compter Vision (ICCV), vol. 2, pp. 1083–1088 (2005) 10. Knanatani, K.: Renormalization for Computer Vision. The Institute of Elec., Info. and Comm. eng (IEICE) Transaction 35(2), 201–209 (1994) 11. Horn, R.A., Johnson, C.R.: Matrix Analysis: Section 2.8. Cambridge University Press, Cambridge (1985) 12. Tarel, J.P., Civi, H., Cooper, D.B.: Form 3d objects without point matching using algebraic surface models. In: Proceedings of IEEE Workshop Model Based 3D Image Analysis, pp. 13–21. IEEE Computer Society Press, Los Alamitos (1998) 13. Khan, N.: Silhouette-Based 2D-3D Pose Estimation Using Implicit Algebraic Surfaces. Master Thesis in Computer Science, Saarland University (2007) 14. Unsalan, C.: A Model Based Approach for Pose Estimation and Rotation Invariant Object Matching. Pattern Recogn. Lett. 28(1), 49–57 (2007) 15. Taubin, G., Cooper, D.: Symbolic and Numerical Computation for Artiﬁcial Intelligence. In: Computational Mathematics and Applications. ch. 6, Academic Press, London (1992)

300

B. Zheng, J. Takamatsu, and K. Ikeuchi

Appendix A. Choosing Diagonal Matrix D for RR Method A choice of diagonal matrix D for RR method was derived by T.Tasdizen et al. [4] and T.Sahin et al. [9] as: dαα = γ tˆ, where dαα is the αth diagonal element of D, and calculating dαα is respected to i, j, k. The relationship between index α and i, j, k are shown in Tab.1. γ is a free parameter for the (i + j)th form (2D) or (i + j + k)th form (3D) decided from the data set as follows: γi+j =

N

(x2t + yt2 )i+j

(2D), γi+j+k =

t=1

N

(x2t + yt2 + zt2 )i+j+k

(3D), (9)

t=1

where (xt yt ) and (xt yt zt ) are the data point. tˆ is a variable respected to i, j, k and for maintaining Euclidean invariance it can be derived as: tˆ =

i!j! (i + j)!

(2D) and tˆ =

i!j!k! (i + j + k)!

(3D).

(10)

Determining Relative Geometry of Cameras from Normal Flows Ding Yuan and Ronald Chung Department of Mechanical & Automation Engineering The Chinese University of Hong Kong Shatin, Hong Kong, China {dyuan,rchung}@mae.cuhk.edu.hk

Abstract. Determining the relative geometry of cameras is important in active binocular head or multi-camera system. Most of the existing works rely upon the establishment of either motion correspondences or binocular correspondences. This paper presents a first solution method that requires no recovery of full optical flow in either camera, nor overlap in the cameras’ visual fields and in turn the presence of binocular correspondences. The method is based upon observations that are directly available in the respective image stream – the monocular normal flow. Experimental results on synthetic data and real image data are shown to illustrate the potential of the method. Keywords: Camera calibration, Extrinsic camera parameters, Active Vision.

1 Introduction Active vision systems allow the position and orientation of each camera be arbitrarily controlled according to need and have many advantages. They however require the relative geometry of the cameras be determined from time to time for fusion of the visual information. Determination of the relative geometry of cameras is a wellstudied problem. What makes the problem unique and particularly challenging in active vision is that there is no guarantee on how much overlap there exists between the visual fields of the cameras. In the extreme case, there could be no overlap. There have been many methods in the literature proposed for binocular geometry determination. Some methods require certain specific objects appearing in the scene, such as planar surfaces [7] and cubic objects [9]. Such methods constitute simpler solution mechanisms, but they are restricted to certain applications or scenes. Other methods [3] [11] require the accessibility of either cross-camera feature correspondences [1] [11] or motion correspondences [2]. While cross-camera correspondences require the cameras to have much in common in what they picture, establishing motion correspondences is an ill-posed problem and the result is not always reliable when the scene contains much occlusion. The objective of this work is to develop a solution method that does not assume presence of calibration objects or specific shapes or features in the imaged scene, nor impose restriction on the viewing directions of the cameras, thus allowing the visual fields of the cameras to have little or zero overlap. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 301–310, 2007. © Springer-Verlag Berlin Heidelberg 2007

302

D. Yuan and R. Chung

In an inspiring work, Fermüller and Aloimonos [4][5] described a method (hereafter referred to as the FA-camera motion determination method) of determining the ego-motion of a camera directly from normal flow. Normal flow is the apparent motion of image position in the image stream picked up by the camera, which must be in the direction of the intensity gradient, and can be detected directly by applying some simple derivative filters to the image data. Fermüller and Aloimonos first define, for any arbitrarily chosen 3D axis p that contains the optical center, a vector field over the entire image space. Patterns with “+” and “-” candidates are generated according to the vector field and the normal flows. The camera motion parameters can be determined from those patterns. The FA-camera motion determination method forms the basis of a method we proposed earlier [10] for determining the relative geometry of two cameras. However, a key issue has not been addressed in [10]. Any p-axis allows not all but only a small subset of the data points – those whose normal flow is parallel or anti-parallel to the field vector there with respect to p-axis – to be usable. Different choices of the p-axis would allow different subsets of the data points to be usable, each with a different density of the data points in the space. The choices of the p-axis are thus crucial; they determine the density of the data points and the accuracy in determining the camera motions and in turn the binocular geometry. However, in both the FA-camera motion determination method [4][5] and in our earlier work on binocular geometry determination [10], no particular mechanism was devised in choosing the p-axes. This paper presents how the p-axes can be chosen optimally for best use of the data points. We assume that the intrinsic parameters of the cameras have been or are to be estimated by camera self-calibration methods like [8] [11]. The focus of this work is the estimation of the camera-to-camera geometry.

2 Copoint Vector Field with Respect to Arbitrary 3D Direction Fermüller and Aloimonos [4][5] proposed vector filed which was then applied in their camera motion determination method. Our binocular geometry determination method also makes use of this vector field. So here we give a brief review of it. Suppose the image space is represented by an image plane positioned perpendicular to the optical axis and at 1 unit away from the optical center. Pick any arbitrary axis p=[A B C] in space that contains the optical center; the axis hits the image plane at the point P=[A/C B/C]. The family of projection planes that contains axis p define the family of lines that contain point P on the image plane. The copoint vector field for the image space with respect to the p-axis is defined as the field with vectors perpendicular to the above family of lines about point P, as shown in Fig. 1 (a). In the figure each arrow represents the vector assigned to each image point, and it is [M x

My] =

[− y + B / C

x − A / C]

(2.1)

( x − A / C ) + (− y + B / C ) 2 2

Suppose the camera undergoes a pure rotation, which is represented in the rotationaxis form by a vector ω=[α β γ]. For any particular choice of the p-axis, we have a

Determining Relative Geometry of Cameras from Normal Flows

303

p-copoint field direction (Equation 2.1) at each image position. At any image position, the dot product between the p-axis induced field vector and the optical flow there allows the image position to be labeled as either: “+” if the dot product is positive, or “-” if the dot product is negative. By the distributions of the “+” and “-” labels, the image plane is divided into two regions: positive and negative, with the boundary being a 2nd order curve called the zero-boundary, as illustrated by Fig.1 (b). The zero-boundary is only determined by the ratios α/γ and β/γ. Fig.1 (c) illustrates the positive-negative pattern generated the same way as described above, when the camera takes a pure translation. Different from the pattern Fig.1 (b), the zeroboundary is a straight line (Fig.1 (c)), which is a function of the focus of expansion (FOE) of the optical flow and precisely describes the translational direction of the camera. If the motion of the camera includes both rotation and translation, the positivenegative pattern will include the positive region, the negative region, and the “Don’t know” region (depends on the structure), as shown in Fig. 1 (d). P=[A/C B/C ]

(a)

Positive

Negative

(b)

(c)

Don’t know

(d)

Fig. 1. p-copoint vector field and its positive-negative patterns in the image space. (a) pcopoint vector field; positive-negative labeled patterns when (b) camera undergoes pure rotation; (c) camera undergoes pure translation; (d) camera takes arbitrary motion (with both rotation and translation components).

The above labeling mechanism allows constraint for the camera motion to be constructed from optical flow. However, due to the aperture problem the full flow is generally not directly observable from the image data; only the normal flow, i.e., the component of the full flow projected to the direction of the local intensity gradient, is. The above labeling mechanism therefore has to be adjusted, and the positive-negative pattern still can be generated [4] [5]. The main difference is, while with the full flows all the data points can be labeled, with the normal flows only a handful of the data points can be labeled, and the localization of the zero-boundary from the sparsely labeled regions is much more challenging.

3 Binocular Geometry Determination Suppose the binocular geometry of two cameras at a particular configuration of the camera system is to be determined. Our procedure consists of the following. The binocular configuration is first frozen, and then moved rigidly in space while image streams are collected from the respective cameras to serve as data for the

304

D. Yuan and R. Chung

determination task. If the binocular geometry is expressed by a 4×4 matrix X, and the two camera motions by A and B, because of the rigidity of the motion of the camera pair we have the relationship AX=XB, which can be decomposed into the following two equations: R AR x = R x R B

(3.1)

(or ω A = R xω B in vector form) and (R A − I )t x = R x t B − tA

(3.2)

where Rx, RA, RB are the 3×3 orthonormal matrices representing the rotational components of X, A, B respectively, tx, tA, tB are the vectors representing the translational components, and ωA, ωB are the rotations RA, RB expressed in the axisangle form. The plan is, if the camera motions A and B can be determined from normal flows by the use of say the FA- camera motion determination method, Equations (3.1) and (3.2) would provide constraints for the determination of the binocular geometry parameters X. However, there are two complications. First, if general motion of any camera (i.e., motion that involves both translation and rotation) is involved, the image space associated with the camera under the FA- camera motion determination method contains two “Don’t know” regions as depicted by Fig. 1(d). The presence of the “Don’t know” regions would add much challenge to the localization of the zeroboundaries. Second, with only the normal flows in the respective image streams of the two cameras, only a small percentage of the data points can be made usable in the image space under a random choice of the p-axis. This complication is the most troubling, as the localization of the zero boundary from very sparsely labeled data points would be an extremely difficult task. On the first complication, we adopt specific rigid motions of the camera pair to avoid as much as possible the presence of general motion of any camera. On the second complication, we propose a scheme that allows the p-axes to be chosen not randomly as in previous work [4][5][10], but according to the how many data points they can make useful in the copoint vector field based method. The scheme allows each data point (an image position with detectable normal flow) to propose a family of p-axes that could make that data point useful, and vote for them in the space of all possible p-axes. The p-axes in the p-axis space that have the highest number of votes from all the data points are then the p-axes we use in the process. 3.1 Determination of Rx We first determine the rotation component Rx of the binocular geometry. We let the camera pair undergo a specific motion – pure translation – so as to reduce the complexity in locating the zero-boundary of the positive-negative labeled patterns in the image space. When the camera pair exercise a rigid-body pure translation, the motion of each camera is also a pure translation. From Equation (3.2) we have: ~ ~ (3.3) tA = R x tB

Determining Relative Geometry of Cameras from Normal Flows

305

where ~tA and ~tB are both unit vectors corresponding to the focus of expansions (FOEs) of the two cameras. Previously we proved at least two translational motions in different directions are required to achieve a unique solution of Rx [10]. Solution of Rx is presented in our previous work[10]. On determining ~tA and ~tB in each rigid-body translation of the camera pair, we adopt the p-copoint vector field model to generate patterns from the normal flows in the respective image streams. Both cameras will exhibit pattern like that shown in Fig. 1 (c), containing only the positive and the negative regions separated by a straight line (the zero boundary), without the “Don’t know” regions. 3.2 Determination of tx Up to Arbitrary Scale To determine the baseline tx of the binocular geometry, we let the camera pair undergo rigid-body pure rotations while the cameras capture the image stream data. However, tx can only be determined up to arbitrary scale unless certain metric measurement about the 3D world is available. Suppose the camera pair has a pure rotation about an axis passing through the optical center of one camera, say the optical center of camera A. Then camera A only undergoes a pure rotation, while camera B’s motion consists of a rotation about an axis containing the optical center of camera A, and a translation orthogonal to the baseline. In this case Equation (3.2) can be rewritten as: (3.4) (R A − I)t x = R xt B We then rewrite Equation (3.4) to a homogeneous system: )~ A tx = 0

(3.5)

) where tx is the normalized vector representing the direction of the baseline, and A is a )

2×3 matrix calculated from Rx, RA, tB with Rank( A ) = 1. At least two rotations are needed to determine tx uniquely [10]. Camera A, which has only pure rotations in the process, has the positive-negative labeled patterns in the image space just like the one shown in Fig.1 (b), in which a 2nd order curve (the zero-boundary) separates the ‘+’ and ‘-’ labeled regions. As for camera B, the positive-negative labeled patterns in the image space take the form of Fig.1 (d), and are more challenging because of the existence of two “Don’t know” regions. There are two zero boundaries to be determined: one a 2nd order curve, and the other a straight line. The strategy on calculating tx (up to arbitrary scale) by analyzing the positive-negative labeled patterns is presented in our earlier work [10].

4 Optimal Selection of p-Axes Under different choices of the p-axis, different subsets of data points would be made usable to generate the positive-negative labeled patterns in the image space. Obviously a higher density of the labeled patterns is desired, as it would make the localization of the zero-boundary easier. In this section we propose a scheme for that. In the following discussion, for simplicity we only describe the scheme under the case that the camera motion is a pure translation. The scheme for the pure rotation case is actually similar.

306

D. Yuan and R. Chung

4.1 From a Data Point of Normal Flow to a Locus of p-Axes For any given p-axis, only the data points with the normal flows exactly parallel or anti-parallel with the p-copoint field vectors there are usable data for participating in the positive-negative patterns in the image space. Viewing the whole process in the opposite angle, a data point (xi, yi) with normal flow (uin, vin) is usable only under a paxis whose equivalent image position P =[px , py] is located on the line li passing through the data point (xi, yi) and orthogonal to the normal flow (uin, vin), as illustrated by Fig. 2 . We call the line li the P-line of the data point (xi, yi), and it can be expressed as:

u i n p x + v i n p y − (u i n x i + v i n y i ) = 0

(4.1)

Thus, to find the p-axis which can make the maximum number of data points useful, a simple scheme is to let each data point vote for the members of its P-line in the space of all possible p-axes (which is only a two-dimensional space, as each paxes contains only two degrees of freedom). The p-axes that collect a large number of votes are then the good choices of p-axes we should use in the copoint vector field based method. li lj

(ujn, vjn)

(xi, yi)

(xj, yj)

(uin, vin) P =(px,py)

Fig. 2. The P-lines of data points (xi, yi) (with normal flow (uin, vin)) and (xj, yj) (with normal flow (ujn, vjn))

4.2 Optimal Determination of p-Axes

Obviously we could obtain a linear system of equations for the optimal p-axis (point P) from say n data points using Equation (4.1), and solve for the optimal p-axis. However, the normal flows’ orientations are extracted not without error, so each data point should vote not for a P-line, but a narrow cone centered at the data point and swung from the P-line. The size of the cone is a threshold that depends upon what estimated error we have in the extraction of the normal flows’ orientations. We thus adopt a voting scheme similar to the Hough Transform. We use an accumulator array to represent the entire space of all p-axes, and to collect votes from each data point. The accumulator is a two-dimensional array whose axes correspond to the quantized values of px and py. For each data point (an image point with detectable normal flow), we determine its P-line, look for bins in the accumulator array that the line falls into, and put one vote in each of those bins. After we finish this with all the data points, we identify the bins of the highest count of votes in the accumulator array. An example of an accumulator array is shown in Fig. 3(a). To

Determining Relative Geometry of Cameras from Normal Flows

(a )

(b)

(c)

307

(d)

Fig. 3. (a) Two-dimensional accumulation array that corresponds to various values of px and py. The P-line associated with each data point is determined, the array bins corresponding to the line are identified, and each of such bins has the vote count increased by one. The bin with the highest vote count is identified (and marked as a red circle in this figure), which corresponds to the optimal p-axis. (b) (c) (d): The development of the voting process under the coarse-to-fine strategy.

increase computational efficiency we use a coarse-to-fine strategy in the voting process, as illustrated by Fig. 3(b-d). Since the copoint vector field based method demands the use of not one but a few p-axes, in our case we use not only the optimal p-axis but a few p-axes of the highest number of vote counts. While in synthetic data experiments the scene texture (and thus the orientation of the normal flow) is often made random, making all p-axes having similar density of usable data points, in real image data the scene texture is often oriented to a few directions (and so is the normal flow), and densities of the usable data points could be drastically different under different choices of the p-axes. Our experience show that, especially in the real image data cases, the adoption of the optimal p-axes makes drastic improvement to the solution quality over those under random selection of the p-axes. More specifically, our experiments on real image data show that the pattern generated by the best p-axes often has 60% more data points than those under the average p-axes.

5 Experimental Results The whole method consists of two steps. First, the binocular cameras undergo pure translations twice as a whole, and each time they move in different directions. The rotational component Rx is computed in this first step. After that, we rotate the camera pair twice around two different axes passing through the optical center of one of the cameras. In this step tx is determined up to scale. 5.1 Experimental Results on Synthetic Data

The experiments on synthetic data aim at investigating the accuracy and precision of the method. Normal flows are the only input, same as in the case of real image experiments. We used image resolution 101×101 in the synthetic data.

308

D. Yuan and R. Chung

Estimation of Rx. The normal flows were generated by assigning to each image point a random intensity gradient direction. Dot product between the gradient direction and the optical flow incurred from the assumed camera motion determined the normal flow precisely. We selected the optimal set of p-axes first. With the first optimal p-axis we got the first positive-negative labeled pattern in the image space. After determining the pseudo FOEs at an accuracy of 0.25×0.25 pixel, a number of lines, determined from different pseudo FOEs, could well divide the pattern into two regions. Then we applied a second optimal p-axis to examine if those pseudo FOEs that had good performance in the first pattern still performed well in this new pattern. We kept those that still had good performance in the next round under the new p-axis. We repeated this process, until all possible FOEs were located within a small enough area. Then the center of these possible FOEs was considered as the input for computing Rx. We estimated the FOEs by locating the zero-boundaries for both camera A and B first, and the rotational component ωx of the binocular geometry was then estimated. The calculation result is shown in the Tab.1. The error was 0.7964o in direction, 1.2621% in length. Estimation of tx up to Arbitrary Scale. We assumed that the camera pair rotated about an axis passing the optical center of camera A at two different given velocities. As above, the normal flows were generated to be the inputs. We located the zero boundaries on the positive-negative labeled patterns to estimate rotations ωA of camera A, using the algorithm named “detranslation” [4][5]. FOE tB of camera B was obtained readily from the patterns. Finally we obtained tx up to arbitrary scale using Equation (3.5). The result, shown in Tab.1, is a unit vector describing the direction of the baseline. The angle between the ground truth and the result is 2.0907o. Table 1. Estimation of ω x and t x up to scale

ωx tx

Ground Truth [0.100 0.100 -0.200]T [-700 20 80]T

Experiment [ 0.097 0.204 -0.203] T [-0.988 0.043 0.147] T

In this experiment, synthetic normal flows, computed from full optical flows by allocating to each pixel a random gradient direction, are oriented more evenly to all directions than in real image data, because in real image data the scene texture is often oriented as discussed above. However, the accuracy of our method can be better explored in the synthetic data experiments. 5.2 Experimental Results on Real Image Data Here we only show results on the recovery of Rx (ωx) due to limitation of page space. We moved the camera pair on a translational platform. The image sequences were captured by Dragonfly CCD cameras of resolution 640×480. The first experiment is to investigate the accuracy of the algorithm. We used the algorithm described in [6] to estimate the intrinsic parameters of the two cameras.

Determining Relative Geometry of Cameras from Normal Flows

309

Input images were first smoothed by using Gaussian Filter with n=5 and σ=1.4 to eliminate the Gaussian noise. We examined pseudo FOEs pixel by pixel in the image frames. 377 p-axes were enough to pinpoint the locations of possible FOEs. The zeroboundaries determined by the estimated FOEs are shown in Fig 4.

(a)

(b)

(c)

(d)

Fig 4. The zero-boundaries (blue lines) determined by estimated FOEs. Green dots represent negative candidates; red dots represent positive candidates. (a) Camera A, Motion 1; (b) Camera B, Motion 1; (c) Camera A, Motion 2; (d) Camera B, Motion 2.

We then calibrated the binocular cameras using the traditional stereo calibration method [6], in which the inputs are the manually picked corner pairs of an imaged chess-board pattern in the stereo images. We compared in Tab. 2 the results from our method and from the traditional stereo calibration method. Table 2. Estimation of ω x Experiment 1--- by using our method; Experiment 2--- by using traditional stereo calibration method [6]

Experiment 1 [0.0129 -0.7896 0.5222]T

ωx

Experiment 2 [0.0270 -0.4109 -0.0100] T

Although there is still some error compared with the result by using tradition calibration method, our result is acceptable on the condition that we neither require any chess-board pattern appearing in the scene, nor need any manually intervention on selecting point-to-point correspondences across the image pairs. The second experiment is about the case where almost no overlap is there in the two cameras’ fields of view, as shown in Fig. 5. Estimating the binocular geometry of the cameras viewing such imaged scene could be a difficult task for the correspondence-based methods. However, our method is still effective.

(a)

(b)

(c)

(d)

Fig 5. The zero-boundaries (blue lines) determined by the estimated FOEs. Green dots represent negative candidates; red dots represent positive candidates. (a) Camera A, Motion 1, (b) Camera B, Motion 1, (c) Camera A, Motion 2, d) Camera B, Motion 2.

310

D. Yuan and R. Chung

Result of ωx in the experiment is shown in Table 3. Table 3. Estimation of the rotational component ω x of the binocular geometry

ωx

[0.1634 -0.0801 -2.1097] T

ωx

6 Conclusion and Future Work We have addressed in this work how the determination of inter-camera geometry from normal flows can be much improved by the use of better chosen p-axes, and how these better p-axes can be chosen. Our future work is to relax the requirement of the specific rigid-body motions required in the method. Acknowledgments. The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4195/04E), and is affiliated with the MicrosoftCUHK Joint Laboratory for Human-centric Computing and Interface Technologies.

References 1. Bjorkman, M., Eklundh, J.O.: Real-time epipolar geometry estimation of binocular stereo heads. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(3) (March 2002) 2. Dornaika, F., Chung, R.: Stereo geometry from 3D ego-motion streams. IEEE Trans. On Systems, Man, and Cybernetics: Part B, Cybernetics 33(2) (April 2003) 3. Faugeras, O., Luong, T., Maybank, S.: Camera self-calibration: theory and experiments. In: Proc. 3rd European Conf. Computer Vision, Stockholm, Sweeden, pp. 471–478 (1994) 4. Fermüller, C., Aloimonos, Y.: Direct perception of 3D motion from patterns of visual motion. Science 270, 1973–1976 (1995) 5. Fermüller, C., Aloimonos, Y.: Qualitative egomotion. Int’ Journal of Computer Vision 15, 7–29 (1995) 6. Heikkil, J.: Geometric camera calibration using circular control points. IEEE Trans. Pattern Analysis and Machine Intelligence 22(10), 1066–1077 (2000) 7. Knight, J., Reid, I.: Self-calibration of a stereo rig in a planar scene by data. combination. In: Proc. of the International Conference on Pattern Recognition, pp. 1411–1414 (September 2000) 8. Maybank, S.J., Faugeras, O.: A Theory of self-calibration of a moving camera. Int’ Journal of Computer Vision 8(2), 123–152 (1992) 9. Takahashi, H., Tomita, F.: Self-calibration Of Stereo Cameras. In: Proc. 2nd Int’l Conference on Computer Vision, pp. 123–128 (1988) 10. Yuan, D., Chung, R.: Direct Estimation of the Stereo Geometry from Monocular Normal Flows. In: International Symposium on Visual Computing (1), pp. 303–312 (2006) 11. Zhang, Z., Luong, Q.-T., Faugeras, O.: Motion of an uncalibrated stereo rig: Selfcalibration and metric reconstruction. IEEE Trans. on Robotics and Automation 12(1), 103–113 (1996)

Highest Accuracy Fundamental Matrix Computation Yasuyuki Sugaya1 and Kenichi Kanatani2 Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441-8580 Japan [email protected] Department of Computer Science, Okayama University, Okayama 700-8530, Japan [email protected] 1

2

Abstract. We compare algorithms for fundamental matrix computation, which we classify into “a posteriori correction”, “internal access”, and “external access”. Doing experimental comparison, we show that the 7-parameter Levenberg-Marquardt (LM) search and the extended FNS (EFNS) exhibit the best performance and that additional bundle adjustment does not increase the accuracy to any noticeable degree.

1

Introduction

Computing the fundamental matrix from point correspondences is the ﬁrst step of many vision applications including camera calibration, image rectiﬁcation, structure from motion, and new view generation [6]. To compute the fundamental matrix accurately from noisy data, we need to solve optimization subject to the constraint that it has rank 2, for which typical approaches are: A posteriori correction. We ﬁrst compute the fundamental matrix without considering the rank constraint and then modify the solution so that it is satisﬁed (Fig. 1(a)). Internal access. We minimally parameterize the fundamental matrix so that the rank constraint is always satisﬁed and do optimization in the reduced (“internal”) parameter space (Fig. 1(b)). External access. We do iterations in the redundant (“external”) parameter space in such a way that an optimal solution that satisﬁes the rank constraint automatically results (Fig. 1(c)). The aim of this paper is to ﬁnd the best method by thorough performance comparison.

2

Mathematical Fundamentals

Fundamental matrix. Given two images of the same scene, a point (x, y) in the ﬁrst image and the corresponding point (x , y ) in the second satisfy the epipolar equation [6] Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 311–321, 2007. c Springer-Verlag Berlin Heidelberg 2007

312

Y. Sugaya and K. Kanatani

⎞⎛ ⎞ ⎛ ⎞ x /f0 x/f0 F11 F12 F13 (⎝ y/f0 ⎠ , ⎝ F21 F22 F23 ⎠ ⎝ y /f0 ⎠) = 0, 1 F31 F32 F33 1 ⎛

(1)

where f0 is a scaling constant for stabilizing numerical computation [5] (In our experiments, we set f0 = 600 pixels). Throughout this paper, we denote the inner product of vectors a and b by (a, b). The matrix F = (Fij ) in Eq. (1) is of rank 2 and called the fundamental matrix . If we deﬁne u = (F11 , F12 , F13 , F21 , F22 , F23 , F31 , F32 , F33 ) , ξ = (xx , xy , xf0 , yx , yy , yf0 , f0 x , f0 y , f02 ) ,

(2) (3)

Equation (1) can be rewritten as (u, ξ) = 0.

(4)

The magnitude of u is indeterminate, so we normalize it to u = 1, which is equivalent to scaling F so that F = 1. With a slight abuse of symbolism, we hereafter denote by det u the determinant of the matrix F deﬁned by u. Covariance matrices. Given N observed noisy correspondence pairs, we represent them as 9-D vectors {ξ α } in the form of Eq. (3) and write ξα = ξ¯α + Δξα , where ξ¯α is the true value and Δξ α the noise term. The covariance matrix of ξα is deﬁned by (5) V [ξα ] = E[Δξ α Δξ α ], where E[ · ] denotes expectation over the noise distribution. If the noise in the xand y-coordinates is independent and of mean 0 and standard deviation σ, the covariance matrix of ξ α has the form V [ξ α ] = σ 2 V0 [ξ α ] up to O(σ 4 ), where

V0 [ξ α ] =

x ¯2α + x ¯2 x ¯α y¯α f0 x ¯α x ¯α y¯α 0 0 f0 x ¯α 0 α 2 2 ¯α + y¯α f0 y¯α 0 x ¯α y¯α 0 0 f0 x ¯α x ¯α y¯α x ¯α f0 y¯α f02 0 0 0 0 0 f0 x 0 0 y¯α2 + x ¯2 x ¯α y¯α f0 x ¯α f0 y¯α 0 x ¯α y¯α α 0 x ¯α y¯α y¯α2 + y¯α2 f0 y¯α 0 f0 y¯α 0 x ¯α y¯α ¯α f0 y¯α f02 0 0 0 0 0 f0 x ¯α 0 0 f0 y¯α 0 0 f02 0 f0 x ¯α 0 0 f0 y¯α 0 0 f02 0 f0 x 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

.

(6)

xα , y¯α ) are replaced by In actual computations, the true positions (¯ xα , y¯α ) and (¯ their data (xα , yα ) and (xα , yα ), respectively. ˆ of u by We deﬁne the covariance matrix V [ˆ u] of the resulting estimate u ˆ )(P U u ˆ ) ], V [ˆ u] = E[(P U u

(7)

where P U is the linear operator projecting R9 onto the domain U of u deﬁned by ˆ by projecting the constraints u = 1 and det u = 0; we evaluate the error of u it onto the tangent space Tu (U) to U at u.

Highest Accuracy Fundamental Matrix Computation

optimal correction

SVD correction det F = 0

(a)

det F = 0

(b)

313

det F = 0

(c)

Fig. 1. (a) A posteriori correction. (b) Internal access. (c) External access.

Geometry of the constraint. The unit normal to the hypersurface deﬁned by det u = 0 is (8) u† = N [∇u det u], where N [ · ] denotes normalization into unit norm. It is easily shown that the constraint det u = 0 is equivalently written as (u† , u) = 0.

(9)

Since the domain U is included in the unit sphere S 8 ⊂ R9 , the vector u is everywhere orthogonal to U. Hence, {u, u† } is an orthonormal basis of the orthogonal complement of the tangent space Tu (U). It follows that the projection operator P U in Eq. (7) has the following matrix representation: P U = I − uu − u† u† .

(10)

KCR lower bound. If the noise in {ξα } is independent and Gaussian with mean 0 and covariance matrix σ 2 V0 [ξ α ], the following inequality holds for an ˆ of u [7]: arbitrary unbiased estimator u V [ˆ u] σ 2

N (P U ξ¯α )(P U ξ¯α ) − . (u, V0 [ξ α ]u) 8 α=1

(11)

Here, means that the left-hand side minus the right is positive semideﬁnite, and ( · )− r denotes the pseudoinverse of rank r. Chernov and Lesort [2] called the right-hand side of Eq. (11) the KCR (Kanatani-Cramer-Rao) lower bound and ˆ is not unbiased; it is suﬃcient showed that Eq. (11) holds up to O(σ 4 ) even if u ˆ → u as σ → 0. that u Maximum likelihood. If the noise in {ξα } is independent and Gaussian with mean 0 and covariance matrix σ 2 V0 [ξ α ], maximum likelihood (ML) estimation of u is to minimize the sum of square Mahalanobis distances J=

N

¯ (ξ α − ξ¯α , V0 [ξ α ]− 2 (ξ α − ξ α )),

α=1

(12)

314

Y. Sugaya and K. Kanatani

subject to (u, ξ¯α ) = 0, α = 1, ..., N . Eliminating the constraint by using Lagrange multipliers, we obtain [7] J=

N

(u, ξ α )2 . (u, V0 [ξ α ]u) α=1

(13)

ˆ minimizes this subject to u = 1 and (u† , u) = 0. The ML estimator u

3

A Posteriori Correction

The a posteriori correction approach ﬁrst minimizes Eq. (13) without considering ˜ so as to satisfy it the rank constraint and then modiﬁes the resulting solution u (Fig. 1(a)). A popular method is to compute the singular value decomposition (SVD) of the computed fundamental matrix and replace the smallest singular value by 0, resulting in a matrix of rank 2 “closest” to the original one in norm [5]. We call this SVD correction. A more sophisticated method is the optimal correction [7,11]. According to the statistical optimization theory [7], the covariance matrix V [˜ u] of the rank ˜ can be evaluated, so u ˜ is moved in the direction of the unconstrained solution u mostly likely ﬂuctuation implied by V [˜ u] until it satisﬁes the rank constraint (Fig. 1(a)). The procedure goes as follows [7]: 1. Compute the 9 × 9 matrices ˜ = M

N

ξα ξ α , (˜ u , V [ξ u) 0 α ]˜ α=1

(14)

˜− and V0 [˜ u] = M 8. ˜ as follows (˜ ˜ ): 2. Update the solution u u† is deﬁned by Eq. (8) for u ˜ ← N [˜ u u−

˜ † )V0 [˜ 1 (˜ u, u u]˜ u† ]. 3 (˜ u† , V0 [˜ u]˜ u† )

(15)

˜ † ) ≈ 0, return u ˜ and stop. Else, update the matrix V0 [˜ 3. If (˜ u, u u] in the form ˜u ˜ , P u˜ = I − u

V0 [˜ u] ← P u˜ V0 [˜ u]P u˜ ,

(16)

and go back to Step 2. Before doing this, we need to solve unconstrained minimization of Eq. (13), for which many method exist: the FNS (Fundamental Numerical Scheme) of Chojnacki et al. [3], the HEIV (Heteroscedastic Errors-in-Variable) of Leedan and Meer [10], and the projective Gauss-Newton iterations of Kanatani and Sugaya [8]. Their convergence properties were studies in [8].

Highest Accuracy Fundamental Matrix Computation

4

315

Internal Access

The fundamental matrix F has nine elements, on which the normalization F = 1 and the rank constraint det F = 0 are imposed. Hence, it has seven degrees of freedom. The internal access minimizes Eq. (13) by searching the reduced 7-D parameter space (Fig. 1(b)). Many types of 7-degree parameterizations have been proposed in the past [12,14], but the resulting expressions are often complicated, and the geometric meaning of the individual unknowns are not clear. This was overcome by Bartoli and Sturm [1], who regarded the SVD of F as its parameterization. Their expression is compact, and each parameter has its geometric meaning. They did tentative 3-D reconstruction using the assumed F and adjusted the reconstructed shape, the camera positions, and their intrinsic parameters so that the reprojection error is minimized; such an approach is known as bundle adjustment. Sugaya and Kanatani [13] simpliﬁed this: adopting the parameterization of Bartoli and Sturm [1], they directly minimized Eq. (13) by the LevenbergMarquardt (LM) method. Their 7-parameter LM search goes as follows: 1. Initialize F in such a way that det F = 0 and F = 1, and express it as F = U diag(cos θ, sin θ, 0)V . 2. Compute J in Eq. (13), and let c = 0.0001. 3. Compute the matrices F U and F V and the vector uθ as follows: 0 0 0

−F FU = −F −F F F

31 32 33

21 22

F23

F31 −F21 F32 −F22 F33 −F23 0 F11 0 F12 0 F13 −F11 0 −F12 0 −F13 0

uθ =

,

FV

=

0

F13 −F12 −F13 0 F11 F12 −F11 0 0 F23 −F22 −F23 0 F21 F22 −F21 0 0 F33 −F32 −F33 0 F31 F32 −F31 0

U12 V12 cos θ − U11 V11 sin θ U12 V22 cos θ − U11 V21 sin θ U12 V32 cos θ − U11 V31 sin θ U22 V12 cos θ − U21 V11 sin θ U22 V22 cos θ − U21 V21 sin θ U22 V32 cos θ − U21 V31 sin θ U32 V12 cos θ − U31 V11 sin θ U32 V22 cos θ − U31 V21 sin θ U32 V32 cos θ − U31 V31 sin θ

.

,

(17)

(18)

4. Compute the following matrix X: X=

N

(u, ξ )2 V0 [ξ ] ξα ξ α α α − . 2 (u, V [ξ ]u) (u, V [ξ ]u) 0 0 α α α=1 α=1 N

(19)

316

Y. Sugaya and K. Kanatani

5. Compute the ﬁrst and (Gauss-Newton approximated) second derivatives of J as follows: ∇ω J = F U Xu, ∇2ω J = F UMF U,

∇ω J = F V Xu, ∇2ω J = F V MF V ,

∂∇ω J ∂J 2 = F = (uθ , M uθ ), U M uθ , 2 ∂θ ∂θ 6. Compute the following matrix H: ⎛ ∇2ω J ∇ωω J ∇2ω J H = ⎝ (∇ωω J) (∂∇ω J/∂θ) (∂∇ω J/∂θ)

∂J = (uθ , Xu), ∂θ ∇ωω J = F UMF V , ∂∇ω J = F V M uθ . ∂θ ⎞ ∂∇ω J/∂θ ∂∇ω J/∂θ ⎠ . ∂J 2 /∂θ2

7. Solve the simultaneous linear equations ⎞ ⎞ ⎛ ⎛ ω ∇ω J (H + cD[H]) ⎝ ω ⎠ = − ⎝ ∇ω J ⎠ , ∂J/∂θ Δθ

8. 9. 10. 11. 12.

5

(20)

(21)

(22)

(23)

for ω, ω , and Δθ, where D[ · ] denotes the diagonal matrix obtained by taking out only the diagonal elements. Update U , V , and θ in the form U = R(ω)U , V = R(ω )V , and θ = θ + Δθ, where R(ω) denotes rotation around N [ω] by angle ω. Update F to F = U diag(cos θ , sin θ , 0)V . Let J be the value of Eq. (13) for F . Unless J < J or J ≈ J, let c ← 10c, and go back to Step 7. If F ≈ F , return F and stop. Else, let F ← F , U ← U , V ← V , θ ← θ , and c ← c/10, and go back to Step 3.

External Access

The external access approach does iterations in the 9-D u-space in such a way that an optimal solution satisfying the rank constraint automatically results (Fig. 1(c)). The concept dates back to such heuristics as introducing penalties to the violation of the constraints or projecting the solution onto the surface of the constraints in the course of iterations, but it is Chojnacki et al. [4] that ﬁrst presented a systematic scheme, which they called CFNS (Constrained FNS ). Kanatani and Sugaya [9] pointed out, however, that CFNS does not necessarily converge to a correct solution and presented in a more general framework a new scheme, called EFNS (Extended FNS ), which is shown to converge to an optimal value. For fundamental matrix computation, it reduces to the following form: 1. Initialize u. 2. Compute the matrix X in Eq. (19).

Highest Accuracy Fundamental Matrix Computation

317

3. Computer the projection matrix P u† = I −u† u† (u† is deﬁned by Eq. (8)). 4. Compute Y = P u† XP u† . 5. Solve the eigenvalue problem Y v = λv, and compute the two unit eigenvectors v 1 and v 2 for the smallest eigenvalues in absolute terms. ˆ = (u, v 1 )v 1 + (u, v 2 )v 2 . 6. Compute u ˆ ]. 7. Compute u = N [P u† u 8. If u ≈ u, return u and stop. Else, let u ← N [u + u ] and go back to Step 2.

6

Bundle Adjustment

The transition from Eq. (12) to Eq. (13) is exact ; no approximation is involved. Strictly speaking, however, the minimization of the (squared) Mahalanobis distance in the ξ-space (Eq. (13)) can be ML only when the noise in the ξ-space is Gaussian, because then and only then is the likelihood proportional to e−J/constant . If the noise in the image plane is Gaussian, on the other hand, the transformed noise in the ξ-space is no longer Gaussian, so minimizing Eq. (13) is not strictly ML in the image plane. In order to test how much diﬀerence is incurred, we also implemented bundle adjustment, minimizing the reprojection error (we omit the details).

7

Experiments

Figure 2 shows simulated images of two planar grid surfaces viewed from diﬀerent angles. The image size is 600 × 600 pixels with 1200 pixel focal length. We added random Gaussian noise of mean 0 and standard deviation σ to the x- and y-coordinates of each grid point independently and from them computed the fundamental matrix by 1) SVD-corrected LS, 2) SVD-corrected ML, 3) CFNS, 4) optimally corrected ML, 5) 7-parameter LM, and 6) EFNS. “LS” means least squares (also called “8-point algorithm” [5]) minimizing N 2 α=1 (u, ξ α ) , which reduces to simple eigenvalue computation [8]. For brevity, we use the shorthand “ML” for unconstrained minimization of Eq. (13), for 0.2 1 3

5

2

4

6

0.1

0

0

1

2

3

σ

4

Fig. 2. Simulated images of planar grid surfaces and the RMS error vs. noise level. 1) SVD-corrected LS. 2) SVD-corrected ML. 3) CFNS. 4) Optimally corrected ML. 5) 7-parameter LM. 6) EFNS. The dotted line indicates the KCR lower bound.

318

Y. Sugaya and K. Kanatani 12

1.08 1

1.06

10 2 8

1.04

6 1.02 2

3

5

4

4

4

1

5 1

2

0.98

0

0.96

-2

0

1

2

(a)

3

σ

4

0

1

2

3

3

σ

4

(b)

Fig. 3. (a) The RMS error relative to the KCR lower bound. (b) Average residual minus minus (N − 7)σ 2 . 1) Optimally corrected ML. 2) 7-parameter LM started from LS. 3) 7-parameter LM started from optimally corrected ML. 4) EFNS. 5) Bundle adjustment.

which we used the FNS of Chojnacki et al. [3]. The 7-parameter LM and CFNS are initialized by LS. All iterations are stopped when the update of F is less than 10−6 in norm. On the right of Fig. 2 is plotted for σ on the horizontal axis the following rootmean-square (RMS) error D corresponding to Eq. (7) over 10000 independent trials:

1 10000 ˆ (a) 2 . P U u (24) D= 10000 a=1 ˆ (a) is the ath value, and P U is the projection matrix in Eq. (10). The Here, u dotted line is the bound implied by the KCR lower bound (the trace of the right-hand side of Eq. (11)). Preliminary observations. We can see that SVD-corrected LS (Hartley’s 8point algorithm) performs very poorly. We can also see that SVD-corrected ML is inferior to optimally corrected ML, whose accuracy is close to the KCR lower bound. The accuracy of the 7-parameter LM is nearly the same as optimally corrected ML when the noise is small but gradually outperforms it as the noise increases. Best performing is EFNS, exhibiting nearly the same accuracy as the KCR lower bound. In contrast, CFNS performs as poorly as SVD-corrected ML. The reason for this is fully investigated by Kanatani and Sugaya [9]. Doing many experiments (not all shown here), we have observed that i) EFNS stably achieves the highest accuracy over a wide range of the noise level, ii) optimally corrected ML is fairly accurate and very robust to noise but gradually deteriorates as noise grows, and iii) 7-parameter LM achieves very high accuracy when started from a good initial value but is likely to fall into local minima if poorly initialized. The robustness of EFNS and optimally corrected ML is due to the fact that the computation is done in the redundant (“external”) u-space, where J has a simple form of Eq. (13). In fact, we have never experienced local minima in our

Highest Accuracy Fundamental Matrix Computation

Fig. 4. Left: Real images and 100 corresponding points. Right: Residuals and execution times (sec) for 1) SVDcorrected LS, 2) SVD-corrected ML, 3) CFNS, 4) optimally corrected ML, 5) direct search from LS, 6) direct search from optimally corrected ML, 7) EFNS, 8) bundle adjustment.

1 2 3 4 5 6 7 8

residual 45.550 45.556 45.556 45.378 45.378 45.378 45.379 45.379

319

time . 000524 . 00652 . 01300 . 00764 . 01136 . 01748 . 01916 . 02580

experiments. The deterioration optimally corrected ML in the presence of large noise is because linear approximation is involved in Eq. (15). The fragility of 7-parameter LM is attributed to the complexity of the function J when expressed in seven parameters, resulting in many local minima in the reduced (“internal”) parameter space, as pointed out in [12]. Thus, the optimal correction of ML and the 7-parameter ML have complementary characteristics, which suggests that the 7-parameter ML initialized by optimally corrected ML may exhibit comparable accuracy to EFNS. We now conﬁrm this. Detailed observations. Figure 3(a) compares 1) optimally corrected ML, 2) 7-parameter LM started from LS, 3) 7-parameter LM started from optimally corrected ML, 4) EFNS, and 5) bundle adjustment. For visual ease, we plot the ratio D/DKCR of D in Eq. (24) to the corresponding KCR lower bound. Figure 3(b) plots the corresponding average residual J (minimum of Eq. (13). Since direct plots of J nearly overlap, we plot its diﬀerence from (N − 7)σ 2 , where N is the number of corresponding pairs. This is motivated by the fact ˆ 2 is subject to a χ2 distribution with N − 7 that to a ﬁrst approximation J/σ degrees of freedom [7], so the expectation of Jˆ is approximately (N − 7)σ 2 . We observe from Fig. 3 that i) the RMS error of optimally corrected ML increases as noise increases, yet the corresponding residual remains low, ii) the 7-parameter LM started from LS appears to have high accuracy for noise levels for which the corresponding residual high, iii) the accuracy of the 7-parameter LM improves if started from optimally corrected ML, resulting in the accuracy is comparable to EFNS, and iv) additional bundle adjustment does not increase the accuracy to any noticeable degree. The seeming contradiction that solutions that are closer to the true value (measured in RMS) have higher residuals Jˆ implies that the 7-parameter LM failed to reach the true minimum of the function J, indicating existence of local minima located close to the true value. When initialized by the optimally corrected ML, the 7-parameter LM successfully reaches the true minimum of J, resulting in the smaller Jˆ but larger RMS errors. Real image example. We manually selected 100 pairs of corresponding points in the two images in Fig. 4 and computed the fundamental matrix from them.

320

Y. Sugaya and K. Kanatani

The ﬁnal residual J and the execution time (sec) are listed there. We used Core2Duo E6700 2.66GHz for the CPU with 4GB main memory and Linux for the OS. We can see that for this example optimally corrected ML, 7-parameter LM started from either LS or optimally corrected ML, EFNS, and bundle adjustment all converged to the same solution, indicating that all are optimal. On the other hand, SVD-corrected LS (Hartley’s 8-point method) and SVD-corrected ML have higher residual than the optimal solution and that CFNS has as high a residual as SVD-corrected ML.

8

Conclusions

We compared algorithms for fundamental matrix computation (the source code is available from the authors’ Web page1 ), which we classiﬁed into “a posteriori correction”, “internal access”, and “external access”. We observed that the popular SVD-corrected LS (Hartley’s 8-point algorithm) has poor performance and that the CFNS of Chojnacki et al. [4], a pioneering external access method, does not necessarily converge to a correct solution, while the EFNS always yields an optimal value. After many experiments (not all shown here), we concluded that EFNS and the 7-parameter LM started from optimally corrected ML exhibited the best performance. We also observed that additional bundle adjustment does not increase the accuracy to any noticeable degree. Acknowledgments. This work was done in part in collaboration with Mitsubishi Precision, Co. Ldt., Japan. The authors thank Mike Brooks, Wojciech Chojnacki, and Anton van den Hengel of the University Adelaide, Australia, for providing software and helpful discussions. They also thank Nikolai Chernov of the University of Alabama at Birmingham, U.S.A. for helpful discussions.

References 1. Bartoli, A., Sturm, P.: Nonlinear estimation of fundamental matrix with minimal parameters. IEEE Trans. Patt. Anal. Mach. Intell. 26(3), 426–432 (2004) 2. Chernov, N., Lesort, C.: Statistical eﬃciency of curve ﬁtting algorithms. Comput. Stat. Data Anal. 47(4), 713–728 (2004) 3. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: On the ﬁtting of surfaces to data with covariances. IEEE Trans. Patt. Anal. Mach. Intell. 22(11), 1294–1303 (2000) 4. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: A new constrained parameter estimator for computer vision applications. Image Vis. Comput. 22(2), 85–91 (2004) 5. Hartley, R.I.: In defense of the eight-point algorithm. IEEE Trans. Patt. Anal. Mach. Intell. 19(6), 580–593 (1997) 1

http://www.iim.ics.tut.ac.jp/˜sugaya/public-e.html

Highest Accuracy Fundamental Matrix Computation

321

6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, UK (2000) 7. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science, Amsterdam, The Netherlands 1996, Dover, New York (2005) 8. Kanatani, K., Sugaya, Y.: High accuracy fundamental matrix computation and its performance evaluation. In: Proc. 17th British Machine Vision Conf., Edinburgh, UK, September 2006, vol. 1, pp. 217–226 (2006) 9. Kanatani, K., Sugaya, Y.: Extended FNS for constrained parameter estimation. In: Proc. 10th Meeting Image Recog. Understand, Hiroshima, Japan, July 2007, pp. 219–226 (2007) 10. Leedan, Y., Meer, P.: Heteroscedastic regression in computer vision: Problems with bilinear constraint. Int. J. Comput. Vision 37(2), 127–150 (2000) 11. Matei, J., Meer, P.: Estimation of nonlinear errors-in-variables models for computer vision applications. IEEE Trans. Patt. Anal. Mach. Intell. 28(10), 1537–1552 (2006) 12. Migita, T., Shakunaga, T.: One-dimensional search for reliable epipole estimation. In: Proc. IEEE Paciﬁc Rim Symp. Image and Video Technology, Hsinchu, Taiwan, December 2006, pp. 1215–1224 (2006) 13. Sugaya, Y., Kanatani, K.: High accuracy computation of rank-constrained fundamental matrix. In: Proc. 18th British Machine Vision Conf., Coventry, UK (September 2007) 14. Zhang, Z., Loop, C.: Estimating the fundamental matrix by transforming image points in projective space. Comput. Vis. Image Understand 82(2), 174–180 (2001)

Sequential L∞ Norm Minimization for Triangulation Yongduek Seo1, and Richard Hartley2, 1 2

Department of Media Technology, Sogang University, Korea Australian National University and NICTA, Canberra, Australia

Abstract. It has been shown that various geometric vision problems such as triangulation and pose estimation can be solved optimally by minimizing L∞ error norm. This paper proposes a novel algorithm for sequential estimation. When a measurement is given at a time instance, applying the original batch bi-section algorithm is very much inefficient because the number of seocnd order constraints increases as time goes on and hence the computational cost increases accordingly. This paper shows that, the upper and lower bounds, which are two input parameters of the bi-section method, can be updated through the time sequence so that the gap between the two bounds is kept as small as possible. Furthermore, we may use only a subset of all the given measurements for the L∞ estimation. This reduces the number of constraints drastically. Finally, we do not have to reestimate the parameter when the reprojection error of the measurement is smaller than the estimation error. These three provide a very fast L∞ estimation through the sequence; our method is suitable for real-time or on-line sequential processing under L∞ optimality. This paper particularly focuses on the triangulation problem, but the algorithm is general enough to be applied to any L∞ problems.

1 Introduction Recently, convex programming technique has been introduced and widely studied in the area of geometric computer vision. By switching from an L2 sum-of-squared error function to an L∞ , we are now able to find the global optimum of the error function since the image re-projection error is of quasi-convex type that can be efficiently minimized by the bi-section method [1]. This L∞ norm miniminzation is advantageous because we do not need to build a linearized formulation to find an initial solution for iterative optimization like Levenverg-Marquadt, but also it provides the global optimum of the error function, which is geometrically meaningful, with the well-developed minimization algorithm. Applying an idea of L∞ optimization was presented by Hartley and Schaffalitzky in [2], where it was observed that many geometric vision problems have a single global

This work was supported by the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korean government(MOST) (No. R01-2006-000-11374-0). This research is accomplished as the result of the research project for culture contents technology development supported by KOCCA. NICTA is a research centre funded by the Australian Government’s Department of Communications, Information Technology and the Arts and the Australian Research Council, through Backing Australia’s Ability and the ICT Research Centre of Excellence programs. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 322–331, 2007. c Springer-Verlag Berlin Heidelberg 2007

Sequential L∞ Norm Minimization for Triangulation

323

minimum under L∞ error norm. Kahl, and Ke & Kanade, respectively, showed that the error functions are of quasi-convex and can be solved by Second Order Cone Programming (SOCP) [1,3]. Vision problems that can be solved under L∞ formulation include triangulation [2], homography estimation and camera resectioning [1,3], multiview reconstruction knowing rotations or homographies induced by a plane [1,3], camera motion recovery [4], and outlier removal [5]. Vision problems like building a 3D model or computing the camera motion of a video clip require a batch process. A fast algorithm for such batch computations is presented in [6]. However, some applications need a mechanism of sequential update such as navigation or augmented reality, e.g, [7,8,9]. This paper is about sequentially minimizing the L∞ error norm, which has not been considered yet in vision literature. Our research is aimed at on-line or real-time vision applications. The most important constraint in this case is that the optimization should be done within a given compuation time. Therefore, we need to develop a bi-section algorithm for this purpose. This paper first introduces the triangulation problem in Section 2 and analyzes the bi-section algorithm and suggests three methods to reduce the computation time without sacrificing any accuracy. Section 3 presents our novel bisection algorithm suitable for time sequence applications. Experimental results are given in Section 4 and concluding remarks in Section 5. We focus on triangulation problem in this paper. Triangulation alone may look very much restricted but note that motion estimation knowing rotation is equivalent to triangulation as can be found in [4]. In addition, if a branch-and-bound algorithm is adopted for rotation estimation, then a fast triangulation computation becomes also very much important for global optimization for pose estimation or multi-view motion computation.

2 Triangulation with L∞ Norm Triangulation is to find a 3D space point X when we are given two or more pairs of camera matrix Pt of dimension 3×4 and its image point ut = [u1t , u2t ] at time t. These quantities are related by the projection equation: uit =

pit X p3t X

for

i = 1, 2,

(1)

where pit denotes the i-th 4D row vector of Pt and X is a 4D vector represented by homogeneous coordinates (that is, the 4-th coordinate is one). Re-projection discrepancy dt of X for the measurement ut contaminated by noise is given by p1t X p2t X 1 2 dt = ut − 3 , ut − 3 (2) pt X pt X and the quality of the error is given by the error function et (X) = dt (X). When we use L2 norm, 2 2 12 et (X) = dt 2 = d1t (X) + d2t (X) . (3) Any function f is called quasi-convex if its domain is convex and all its sub-level sets {x ∈ domain f | f (x) ≤ α} for α ∈ R are convex [10]. The error function in

324

Y. Seo and R. Hartley

Algorithm 1. Bisection method: L∞ norm minimization Input: initial upper(U )/lower(L) bounds, tolerance > 0. 1: repeat 2: γ := (L + U )/2 3: Solve the feasibility problem (7) 4: if feasible then U := γ else L := γ 5: until U − L ≤

Equation (3) is of convex-over-concave and can be shown to be a quasi-convex function in the convex domain D = {X|p3t X ≥ 0}, which means that the scene is in front of the camera [11,1]. Given a bound γ, the inequality et ≤ γ defines a set Ct of X Ct = {X|et (X) ≤ γ} .

(4)

Note that the feasible set Ct is due to t-th measurement ut ; Ct is called a second order cone due to Equation (3). The bound γ is called the radius of the cone in this paper. Note that Equation (4) is given by the circular disk dt 2 ≤ γ in the image plane. The set Ct defines the cone whose apex is at the location of camera center and whose cutting shape with the image plane is the circular disk dt 2 ≤ γ. The vector e = [e1 , e2 , ...eT ] represents the error vector of T measurements. A feasible set Fγ given a constant γ is defined as the intersection of all the cones Fγ =

T

{X|et ≤ γ}

(5)

t=1

= {X|e1 ≤ γ} ∩ . . . ∩ {X|eT ≤ γ}

(6)

We now have its feasibility problem for a given constant γ: find X subject to

et (X) ≤ γ,

t = 1, ...., T

(7)

The feasible set Fγ is convex because it is the intersection of (ice-cream shape) convex cones. Indeed, the feasibility problem (7) has already been investigated and well-known in the area of convex optimization; the solution X can be obtained by an SOCP solver. The L∞ norm of e is defined to be the maximum of et ’s, and the L∞ triangulation problem is to find the smallest γ that yields non-empty feasible set Fγ and its corresponding X . This can also be written as a min-max optimization problem min max {e1 , e2 , ..., eT } , X

(8)

and the global optimum can be found by the bi-section method presented in Algorithm 1. It consists of repeatedly solving the feasibility problem, adjusting the bound γ. Lemma 1. Since we minimize the maximum error, we don’t have to use all the measurements. There exist subsets of measurements that result in the same estimation. Reduced number of measurements will decrease the computation time, which is necessary for sequential applications. What we have to do is to choose some among those sequential measurements. Our approach will be provided in the next section together with our sequential bi-section algorithm.

Sequential L∞ Norm Minimization for Triangulation

325

3 Bisectioning for Sequential Update Problem 1. (Original Batch Problem) Given a set of image matches {ui , i = 1...T }, find their 3D point X that is optimal under L∞ error norm. As we mentioned in Section 1, the solution of this problem can be obtained by the bi-section method shown in Algorithm 1. From now on, the optimal solution with T ∞ measurements is represented by X∞ T and the corresponding minimum error by eT . Now let us cast our sequential problem. Problem 2. (Sequential Problem) The L∞ estimate X∞ T has been computed given image matches ui , i = 1, . . . , T . Now a new measurement uT +1 is arrived. Find the optimal estimate X∞ T +1 . Obviously, we might apply Algorithm 1 using all the T + 1 measurements again from scratch. However, we want to do it more efficiently in this paper; our first goal is to reduce the number of SOCP repetitions during the bisection algorithm. ∞ Lemma 2. If the re-projection error eT +1 (X∞ T ) for uT +1 is smaller than eT , that is, ∞ ∞ eT +1 ≤ e∞ , then no further minimization is necessary, and we can set e T T +1 = eT and ∞ = X . X∞ T +1 T ∞ If eT +1 ≤ e∞ T , the feasible cone CT +1 = {X|eT +1 (X) ≤ eT } for uT +1 is already a ∞ ∞ subset of FeT (i.e., γ = eT ). Therefore, we don’t have to run bisectioning to update the estimate. The only computation necessary is to evaluate the re-projection error eT +1 . This is because the bisection method is independent of the order of the input measurements {u1 , . . . , uT +1 }. The estimate X∞ T is already optimal and running the bisection ∞ algorithm with T + 1 measurements will result in the same output: X∞ T +1 = XT . Note that due to Lemma 2 the computational cost for evaluating eT +1 is so much less than the cost for running the bisection method.

Lemma 3. Otherwise (i.e., e∞ T < eT +1 ), we run the bisection algorithm but with different initial upper and lower bounds: U := eT +1 , L := e∞ T . In this case, e∞ T < eT +1 , we have ⊂ CT +1 = {X|eT +1 ≤ γ, where γ = eT +1 (X∞ Fe∞ T )} T

(9)

due to the fact that e∞ T < eT +1 ; therefore, the upper bound for the feasibility of X given T + 1 measurements can be set to U := eT +1 . In other words, the intersection of the T + 1 cones is non-empty when the cones are of radius eT +1 . It is natural that the initial lower bound be set to zero L0 := 0 to run the bisection algorithm. However, a lower bound greater than zero may reduce the number of iterations during the bi-sectioning. The feasible set FγT +1 with bound γ up to time T + 1 can be written as FγT +1 =

T +1

{X|et ≤ γ}

(10)

t=1

= [C1 ∩ . . . ∩ CT ] ∩ CT +1 =

FγT

∩ CT +1 .

(11) (12)

326

Y. Seo and R. Hartley

Algorithm 2. Sequential bi-section method with measurement selection. Input: Measurement set M, selected measurements S ⊂ M. 1: eT +1 := ReprojectionError (uT +1 , X∞ T ) 2: if eT +1 > e∞ T then 3: bool flag=FALSE 4: M := M ∪ {uT +1 }, U := eT +1 , L := e∞ T 5: repeat ∞ 6: (e∞ T +1 , XT +1 ) := BisectionAlgorithm(S, U , L) 7: (emax , tmax ) := ReprojectionErrors (M \ S, X∞ T +1 ) 8: if emax < e∞ T +1 then 9: flag=TRUE; 10: else 11: S := S ∪ {utmax } 12: L := e∞ T +1 13: end if 14: until flag=TRUE 15: end if

If the bound γ is smaller than e∞ T , the intersection of the first T cones results in an empty set because γ = e∞ is the smallest bound for FγT found from the bi-section algorithm T using T measurements. Consequently, it will make the total feasible set FγT +1 be empty or non-feasible. Therefore, we see that L0 = e∞ T is the greatest lower bound up to time T + 1 to execute the bi-section algorithm. Due to Lemma 3, we now have a much reduced gap of initial upper and lower bound; this decreases the number of itreations during the execution of bi-section method.

Fig. 1. A synthetic data sequence. Initial image location was at (0, 0), and each point was generated by Brownian camera motion. In total, a hundred data points were generated as an input sequence.

3.1 Measurement Selection From Lemma 1, we know that there is a possibility to reduce the number of measurements or constraints in SOCP without an accuracy loss. Here we explain our algorithm

Sequential L∞ Norm Minimization for Triangulation

327

Fig. 2. Evolution of L∞ estimation error through time. The initial error was from the first two measurements. The red line shows changes of L∞ error e∞ ; the green line (impulse style) at each time t corresponds to the re-projection error et . When e∞ < et (when the green line goes above the red line), our bisectioning re-computed the estimate Xt∞ . Otherwise, no more computation was necessary. The blue line denotes the evolution of RMS error for Xt∞ .

of selecting measurements presented in Algorithm 2. If we need to solve the feasibility problem due to the condition in Lemma 3 (e∞ T < eT +1 ), then we include uT +1 into the set S of selected measurements and run the bi-section algorithm to get the estima∞ tion results (e∞ T +1 , XT +1 ) (6th line). Using this estimate, we evaluate the reprojection errors for those un-selected measurements (7th line). If the estimation error is greater than the maximum error from the un-selected, emax < e∞ T +1 , then we are done (8th and 9th lines). Otherwise, we include the measurement utmax into the measurement set M (11th line) and repeat the operation. The lower bound is then set to the new value (12th line).

4 Experiments We implemented the algorithm using C/C++ language and tested for synthetic and real data. First, experiments with synthetic data set were done. A data set S was generated as follows: The center of the first camera was located at C1 = [0, 0, −1]T, and moved randomly with standard deviation σC of zero mean Gaussian. That is, Ct = Ct−1 + σC [N , N , N ]T

(13)

where N represents a random value from the standard Gaussian distribution. Then the space point X0 = [0, 0, 0]T was projected to the image plane ut = [u1t , u2t ]T with focal length f = 1000. Gaussian noise of level σ in the image space was then injected: ut = ut + σ[N , N ]. Figure 1 shows the trajectory of a random set S when σ = 1.5. All the camera parameters were assumed to be known in this paper. Figure 2 shows a sequential evolution of L∞ estimation error e∞ t through time t = 2, ..., 100, using the image data plotted in Figure 1. The initial error was from the first

328

Y. Seo and R. Hartley

Fig. 3. Evolution of the number of selected measurements. Initially two are necessary for triangulation. Only 16 among 100 measurements were selected. The increments were exactly at the time instance when we had e∞ t < et .

two measurements. The red line shows e∞ t ; the green line (impulse style) at each time t ∞ corresponds to the re-projection error et (Xt−1 ∞ ) defined in Equation 3. When et < et (that is, when the green line goes above the red line in this graph), our bisectioning was applied to compute the estimate X∞ t together with the scheme of measurement selection. Otherwise, no update was necessary. The blue line shows the evolution of RMS error for a comparison: RMSt =

1

di (Xt∞ )22 t i=1 t

12 (14)

.

Accumulated Computation Time 70

60

50

Clocks

40

30

20

10

0 0

10

20

30

40

50 60 Sequence Index

70

80

90

100

Fig. 4. Accumulated computation time. Computation was necessary when a new measurement < et . The total compuation took 3,414 was included into the measurement set M when e∞ t clocks (time units) when we simplied adopted the batch algorithm using every measurement every time step without any speed improving method; with adjustment of upper and lower bounds it took 100 clocks, which then reduced to 63 clocks with measurement selection.

Sequential L∞ Norm Minimization for Triangulation

329

Fig. 5. The ratio of the accumulated time to the batch computation time for the case of 100 data sets. The average ratio was almost 1.0, which meant that the speed of our sequential update algorithm was almost the same as that of one batch computation on the average.

Figure 3 shows the evolution of the number of selected measurements. In this experiment, only 16 measurements among 100 were selected during the sequential update. Note that the number of measurements increases when e∞ t < et . Figure 4 shows the accumulated computation time. Main computation was done only when the feasibility computation was necessary as can be seen in the graph. The total compuation took 3,414 clocks (time units) when we simplied adopted the batch algorithm using every measurement every time step without any speed improving method; with adjustment of upper and lower bounds it took 100 clocks, which then reduced to 63 clocks with measurement selection. The batch computation took 157 clocks.

Fig. 6. Results of a real experiment. Evolution of L∞ estimation error through time. The initial error was from the first two measurements. The red line shows changes of L∞ error e∞ ; the green line (impulse style) at each time t corresponds to the re-projection error et . When e∞ < et (when the green line goes above the red line), our bisectioning re-computed the estimate Xt∞ . Otherwise, no more computation was necessary. The blue line denotes the evolution of RMS (L2 ) error for Xt∞ ..

330

Y. Seo and R. Hartley

Fig. 7. Evolution of L∞ estimation error through time for the 25th 3D point from Corridor data set

Fig. 8. L∞ estimation error plot for Corridor sequence

Figure 5 shows that such a speed upgrade was attained on the average. We did the same experiments using different data sets. A hundred repetition showed that the average of the ratio of computation time was 1.0; this experimentally implies that the speed of our sequential algorithm is almost the same as running the batch algorithm once with total measurements. We then generated 1000 data sets {Sk , k = 1, ..., 1000} and repeated the same experiment for each. The difference of re-projection errors diﬀ errk = |e∞ (Xbat ) − e∞ (Xseq )| were computed to make sure that our algorithm results in the same estimation, where e∞ is the L∞ norm (maximum) of all the errors of 100 data in the set Sk . The average of the differences was approximately 7 × 10−9 , which meant that the two estimates were the same numerically. Figure 6 shows the same illustration as Figure 2 for a real experiment of sequence length 162. It took 305 time units (batch algorith took 719 time units). Notice that the error converges early in the sequence and there is almost no time update. This also shows that our algorithm is very much suitable when we need to update a large number of triangulation problems at the same time.

Sequential L∞ Norm Minimization for Triangulation

331

Finally, experiments with real data were done with Corridor sequence1 . Among the tracks of corners, those which had more than three matches were chosen. Figure 7 shows an exemplary L∞ evolution graph as Figure 2 did. Figure 8 shows the plot of L∞ errors for all the data sequence whose length was longer than three.

5 Conclusion This paper considered how to apply the bisection method for L∞ norm minimization to a sequential situation. The computation with bisection method did not have to be executed during the sequence when the re-projection of Xt−1 ∞ to the t-th camera yielded a smaller error than e∞ ; otherwise, bisectioning was necessary but with lower and upper bounds whose gap was narrower, resulting in faster computation. Measurement selection scheme was also provided to reduce the computational cost by decreasing the number of measurement cones (constraints). Our mathematical reasoning were provided and showed the performance of the sequential algorithm via synthetic and real experiments; our method is suitable for real-time or on-line applications.

References 1. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 1002–1009 (2005) 2. Hartley, R., Schaffalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2004) 3. Ke, Q., Kanade, T.: Quasiconvex optimization for robust geometric reconstruction. In: Proc. Int. Conf. on Computer Vision, Beijing, China (2005) 4. Sim, K., Hartley, R.: Recovering camera motion using L∞ minimization. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2006) 5. Sim, K., Hartley, R.: Removing outliers using the L∞ norm. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2006) 6. Seo, Y., Hartley, R.: A fast method to minimize L∞ error norm for geometric vision problems. In: Proc. Int. Conf. on Computer Vision (2007) 7. Robert, L., Buffa, M., H´ebert, M.: Weakly-calibrated stereo perception for rover navigation. In: Fifth Inter. Conf. Comp. Vision (1995) 8. Lamb, P.: Artoolkit (2007), http://www.hitl.washington.edu/artoolkit 9. Bauer, M., Schlegel, M., Pustka, D., Navab, N., Klinker, G.: Predicting and estimating the accuracy of optical tracking system. In: IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 43–51 (2007) 10. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge Press, Cambridge (2004) 11. Hartley, R.I.: The chirality. Int. Journal of Computer Vision 26, 41–61 (1998)

1

http://www.robots.ox.ac.uk/ vgg/data.html

Initial Pose Estimation for 3D Model Tracking Using Learned Objective Functions Matthias Wimmer1, and Bernd Radig2 1

Faculty of Science and Engineering, Waseda University, Tokyo, Japan Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen, Germany

2

Abstract. Tracking 3D models in image sequences essentially requires determining their initial position and orientation. Our previous work [14] identiﬁes the objective function as a crucial component for ﬁtting 2D models to images. We state preferable properties of these functions and we propose to learn such a function from annotated example images. This paper extends this approach by making it appropriate to also ﬁt 3D models to images. The correctly ﬁtted model represents the initial pose for model tracking. However, this extension induces nontrivial challenges such as out-of-plane rotations and self occlusion, which cause large variation to the model’s surface visible in the image. We solve this issue by connecting the input features of the objective function directly to the model. Furthermore, sequentially executing objective functions speciﬁcally learned for diﬀerent displacements from the correct positions yields highly accurate objective values.

1

Introduction

Model-based image interpretation is appropriate to extract high-level information from single images and from image sequences. Models induce a priori knowledge about the object of interest and thereby reduce the large amount of image data to a small number of model parameters. However, the great challenge is to determine the model parameters that best match a given image. For interpreting image sequences, model tracking algorithms ﬁt the model to the individual images of the sequence. Each ﬁtting step beneﬁts from the pose estimate derived from the previous image of the sequence. However, determining the pose estimate for the ﬁrst image of the sequence has not been suﬃciently solved yet. The challenge of this so-called initial pose estimation is identical to the challenge of ﬁtting models to single images. Our previous work identiﬁes the objective function as an essential component ﬁtting models to single images [14]. This function evaluates how well a particular model ﬁts to an image. Without losing generality, we consider lower values to represent a better model ﬁtness. Therefore, algorithms search for the model parameters that minimize the objective function. Since the described methods

This research is partly funded by a JSPS Postdoctoral Fellowship for North American and European Researchers (FY2007).

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 332–341, 2007. c Springer-Verlag Berlin Heidelberg 2007

Initial Pose Estimation for 3D Model Tracking

333

are independent of the used ﬁtting algorithm, we do not elaborate on them, but refer to Hanek et al. [5] for a recent overview and categorization of ﬁtting algorithms. As our approach has only been speciﬁed for 2D models so far, this paper extends it to be capable for handling 3D models while considering a rigid model of a human face. In contrast to artiﬁcial objects, such as cars, faces highly vary in shape and texture and therefore the described face model ﬁtting task represents a particular diﬃculty. Many researchers engage in ﬁtting 3D models. Lepetit et al. [7] treat this issue as a classiﬁcation problem and use decision trees for solution. As an example implementation, the ICP algorithm [2,12] minimizes the square error of the distance between the visible object and the model projected to the image. Problem Statement. Although the accuracy of model ﬁtting heavily depends on the objective function, it is often designed by hand using the designer’s intuition about a reasonable measure of ﬁtness. Afterwards, its appropriateness is subjectively determined by inspecting the objective function on example images and example model parameters. If the result is not satisfactory the function is tuned or redesigned from scratch [11,4], see Figure 1 (left). Therefore, building the objective function is very time consuming and the function does not guarantee to yield accurate results.

Fig. 1. The procedures for designing (left) and learning (right) objective functions

Solution Outline. Our novel approach focuses on the root problem of model ﬁtting: We improve the objective function rather than the ﬁtting algorithm. As a solution to this challenge, we propose to conduct a ﬁve-step methodology that learns robust local objective functions from annotated example images. We investigated this approach for 2D models so far [14]. This paper extends our methodology in order to generate objective functions that are capable of handling 3D models as well. The obtained functions consider speciﬁc aspects of 3D models, such as out-of-plane rotations and self-occlusion. We compute the features not in the 2D image plane but in the space of the 3D model. This requires connecting the individual features directly to the model. Contributions. The resulting objective functions work very accurately in realworld scenarios and they are able to solve the challenge of initial pose estimation that is required by model tracking. This easy-to-use approach is applicable to various image interpretation scenarios and requires the designer just to annotate example images with the correct model parameters. Since no further

334

M. Wimmer and B. Radig

computer vision expertise is necessary, this approach has great potential for commercialization. The paper proceeds as follows. In Section 2, we sketch the challenge of modelbased image interpretation. In Section 3, we propose our methodology to learn accurate local objective functions from annotated training images with particular focus on 3D models. Section 4 conducts experimental evaluations that verify this approach. Section 5 summarizes our approach and shows future work.

2

Model-Based Image Interpretation

Rigid 3D models represent the geometric properties of real-world objects. A six-dimensional parameter vector p=(tx , ty , tz , α, β, γ)T describes the position and orientation. The model consists of 1≤n≤N three-dimensional model points speciﬁed by cn (p). Figure 2 depicts our face model with N =214 model points. Fitting 3D models to images requires two essential components: The ﬁtting algorithm searches for the model parameters that match the content of the image best. For this task, it searches for the minimum of the objective function f (I, p), which determines how well a model p matches an image I. As in Equation 1, this function is often subdivided into N local components fn (I, x), one for each model point [7,1,9]. These local functions determine how well the nth model point ﬁts to the image. The advantage of this partitioning is that designing the local functions is more straightforward than designing the global function, because only the image content in the vicinity of one projected model point needs to be taken into consideration. The disadvantage is that dependencies and interactions between local errors cannot be combined. f (I, p) =

N

fn (I, cn (p))

(1)

n=1

2.1

Characteristic Search Directions of Local Objective Functions

Fitting algorithms for 2D contour models usually search the minimum of the objective function along the perpendicular to the contour [3]. The objective function

Fig. 2. Our 3D model of a human face correctly ﬁtted to images

Initial Pose Estimation for 3D Model Tracking

335

Fig. 3. Due to aﬃne transformations of the face model the characteristic search directions will not be parallel to the image plane, in general. These three images show how the transformations aﬀect one of these directions.

computes its value from image features in the vicinity of the perpendicular. We stick to this procedure, and therefore we create local objective functions that are speciﬁc to a search direction. These so-called characteristic directions represent three-dimensional lines and we connect them tightly to the model’s geometric structure, i.e. to the individual model points cn (p). Transforming the model’s pose transforms these directions equivalently, see Figure 3. The objective function computes its value from three-dimensional features, however their value is calculated projecting them to the image plane. An image is most descriptive for a characteristic direction if they are parallel. Unfortunately, transforming the model will usually yield characteristic directions that are not parallel to the image plane. Therefore, we consider not only one but 1≤l≤L characteristic directions per model point, which are diﬀerently oriented. These directions may be arbitrary, but we prefer them to be pairwise orthogonal. This yields L objective function fn,l (I, x) for each model point. In order not to increase computation time, we consider the characteristic direction, which is most parallel to the image, only. fn (I, x) = fn,gn (p) (I, x)

(2)

The model point’s local objective function fn is computed as in Equation 2. The indicator gn (p) computes the index of the characteristic direction that is most signiﬁcant for the current pose p of the model, i.e. that is most parallel to the image plane.

3

Learning Objective Functions from Image Annotations

Ideally, local objective functions have two speciﬁc properties. First, they should have a global minimum that corresponds to the best model ﬁt. Otherwise, we cannot be certain that determining the true minimum of the local objective function indicates the intended result. Second, they should have no other local minima. This implies that any minimum found corresponds to the global minimum, which facilitates search. A concrete example of an ideal local objective

336

M. Wimmer and B. Radig

function, that has both properties, is shown in Equation 3. pI denotes the model parameters with the best model ﬁt for a certain image I. fn (I, x) = |x − cn (pI )|

(3)

Unfortunately, fn cannot be applied to unseen images, for which the best model parameters pI are not known. Nevertheless, we apply this ideal objective function to annotated training images and obtain ideal training data for for the model point n and its characterlearning a local objective function fn,l istic direction l. The key idea behind our approach is that since the training data is generated by an ideal objective function, the learned function will also be approximately ideal. This has already been shown in [14]. Figure 1 (right) illustrates the proposed ﬁve-step procedure. Step 1: Annotating Images with Ideal Model Parameters. As in Figure 2, a database of 1≤k≤K images Ik is manually annotated with the ideal model parameters pIk , which are necessary to compute the ideal objective functions fn . This is the only laborious step of the entire procedure.

Fig. 4. Further annotations are generated by moving along the line that is longest when projected in the image. That line is colored white here. The directions illustrated in black are not used. Annotations on one of the directions that are not used are also shown to demonstrate that this direction is too short to be used.

Initial Pose Estimation for 3D Model Tracking

337

Fig. 5. This comprehensive set of image features is provided for learning local objective functions. In our experiments, we use a total number of A=6·3·5·5=450 features.

Step 2: Generating Further Image Annotations. The ideal objective function returns the minimum fn (I, x)=0 for all manual annotations x=cn (pIk ). These annotations are not suﬃcient to learn the characteristics of fn . Therefore, we will generate annotations x=cn (pIk ), for which fn (I, x)=0. In general, any 3D position x may represent one of these annotations, however, we sample −D≤d≤D positions along the characteristic direction with a maximum displacement Δ (learning radius), see Figure 4. This procedure learns the calculation rules of the objective function more accurately. In this paper, we use L=3 characteristic directions, because the model points vary within the 3D space. Note that the gn (pIk ) selects the most signiﬁcant direction for the nth model point. Step 3: Specifying Image Features. Our approach learns the calculation rules of a mapping fn,l from an image Ik and a location xk,n,d,l to the value has no knowledge of pI , it must compute its result of fn (Ik , xk,n,d,l ). Since fn,l from the image content. Instead of learning a direct mapping from the pixel values in the vicinity of x to fn , we compute image features, ﬁrst. Note that x does not denote a position in I but in 3D space. However, the corresponding pixel position is obtained via perspective projection. Our idea is to provide a multitude of 1≤a≤A features, and let the training . Each algorithm choose which of them are relevant to the calculation rules of fn,l feature ha (I, x) is computed from an image I and a position x and delivers a scalar value. Our approach currently relies on Haar-like features [13,8] of diﬀerent styles and sizes. Furthermore, the features are not only computed at the location of the model point itself, but also at positions on a grid within its vicinity, see

Fig. 6. The grid of image features moves along with the displacement

338

M. Wimmer and B. Radig

Figure 5. This variety of styles, sizes, and locations yields a set of A=450 image features as we use it in our experiments in Section 4. This multitude of features enables the learned objective function to exploit the texture of the image at the model point and in its surrounding area. When moving the model point, the image features move along with it, leading their values to change, see Figure 6. Step 4: Generating Training Data. The result of the manual annotation step (Step 1) and the automated annotation step (Step 2) is a list of correspondences between positions x and the corresponding value of fn . Since K images, N model points, and 2D+1 displacements are landmarked these correspondences amount to K·N ·(2D+1). Equation 4 illustrates the list of these correspondences. [

Ik ,

xk,n,d,l ,

[ h1 (Ik , xk,n,d,l ), . . . , hA (Ik , xk,n,d,l ),

fn (Ik , xk,n,d,l ) ]

(4)

fn (Ik ,

(5)

xk,n,d,l ) ]

with 1≤k≤K, 1≤n≤N, −D≤d≤D, l=gn (pIk )

Applying the list of image features to the list of correspondences yields the training data in Equation 5. This step simpliﬁes matters greatly. Since each feature returns a single value, we hereby reduce the problem of mapping the vast amount of image data and the related pixel locations to the corresponding target value, to mapping a list of feature values to the target value. Step 5: Learning the Calculation Rules. Given the training data from (I, x) that approximates Equation 5, the goal is to now learn the function fn,l is not provided knowledge of pI . Therefore, it fn (I, x). The challenge is that fn,l can be applied to previously unseen images. We obtain this function by training a model tree [10,15] with the comprehensive training data from Equation 5. Note have to be learned individually. However, that the N ·L objective functions fn,l learning fn,l only requires the records of the training data (Equation 5) where n and l match. Model trees are a generalization of regression trees and, in turn, of decision trees. Whereas decision trees have nominal values at their leaf nodes, model trees have line segments, allowing them to also map features to a continuous value, such as the value returned by the ideal objective function. One of the reasons for deciding for model trees is that they tend to select only features that are relevant to predict the target value. Therefore, they pick a small number of Mn Haar-like features from the provided set of A Mn features. for After executing these ﬁve steps, we obtain a local objective function fn,l each model point n and each direction l. It can now be called with an arbitrary location x of an arbitrary image I. The learned model tree calculates the values of the necessary features from the image content and yields the result value.

4

Experimental Evaluation

In this section, two experiments show the capability of ﬁtting algorithms equipped with a learned objective function to ﬁt a face model to previously unseen images.

Initial Pose Estimation for 3D Model Tracking

339

Fig. 7. If the initial distance to the ideal model points is smaller than the learning radius this distance is further reduced with every iteration. Otherwise, the result of the ﬁtting step is unpredictable and the model points are spread further.

The experiments are performed on 240 training images and 80 test images with a non-overlapping set of individuals. Furthermore, the images diﬀer in face pose, illumination, and background. Our evaluations randomly displace the face models from the manually speciﬁed pose. The ﬁtting process determines the position of every model point by exhaustively searching along the most signiﬁcant characteristic direction for the global minimum of the local objective function. Afterwards, the model parameters p are approximated. The subsequent ﬁgures illustrate the average point-to-point error of the model points between the obtained model p and the manually speciﬁed model pI . Our ﬁrst evaluation investigates the impact of executing the ﬁtting process with a diﬀerent number of iterations. Figure 7 illustrates that each iteration improves the model parameters. However, there is a lower bound to the quality of the obtained model ﬁt. Obviously, more than 10 iterations do not improve the fraction of well-ﬁtted models signiﬁcantly. Note that the objective function’s value is arbitrary for high distances from the correct position. These models are further distributed with every iteration. Our second experiment conducts model ﬁtting by subsequently applying the ﬁtting process with two diﬀerent objective functions f A and f B learned with decreasing learning radii Δ. f A with a large Δ is able to handle large initial displacements in translation and rotation. However, the obtained ﬁtting result gets less accurate, see Figure 8. The opposite holds true for f B . The idea is to apply a local objective function learned with large Δ ﬁrst and then gradually apply objective functions learned with smaller values for Δ. As opposed to the

340

M. Wimmer and B. Radig

Fig. 8. By combining ﬁtting algorithms using objective functions with diﬀerent learning radius we obtain result that show the strengths of both objective functions. The sequential approach shows the tolerance to errors of f A and the accuracy of f B .

previous experiment, where we iteratively executed the same objective function, this iteration scheme executes diﬀerent objective functions, which compensates the weakness of one function by the strength of another. The advantage of concatenating algorithms with objective functions with decreasing learning radii compared to iterating one algorithm several times is illustrated by Figure 8. Sequentially applying f A and f B is signiﬁcantly better than both of the other algorithms. Note that we execute each experiment with ten iterations, because we don’t expect any improvement in quality with a higher number of iterations, see our ﬁrst experiment. Therefore, the obtained accuracy from the sequential execution does not base of the fact that some additional iterations are applied.

5

Summary and Conclusion

In this paper, we extended our ﬁve-step methodology for learning local objective functions that we introduced for 2D models so far [14]. 3D models can now be considered as well. This approach automates many critical decisions and the remaining manual steps require little domain-dependent knowledge. Furthermore, its process does not contain any time-consuming loops. These features enable non-expert users to customize the ﬁtting application to their speciﬁc domain. The resulting objective function is not only able to process objects that look similarly, such as in [7,6] but objects that diﬀer signiﬁcantly in shape and texture, such as human faces. Being trained with a limited number of annotated images as

Initial Pose Estimation for 3D Model Tracking

341

described in Section 3 the resulting objective function is able to ﬁt faces that are not part of the training data as well. However, the database of annotated faces must be representative enough. If there were no bearded men in the training data the algorithm would have problem in ﬁtting the model to an image of such a man. The disadvantage of our approach is the laborious annotation step. Gathering and annotating hundreds of images requires several weeks.

References 1. Allezard, N., Dhome, M., Jurie, F.: Recognition of 3D textured objects by mixing view-based and model-based representations. ICPR, 960–963 (September 2000) 2. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 3. Cootes, T.F., Taylor, C.J., Lanitis, A., Cooper, D.H., Graham, J.: Building and using ﬂexible models incorporating grey-level information. In: ICCV, pp. 242–246 (1993) 4. Cristinacce, D., Cootes, T.F.: Facial feature detection and tracking with automatic template selection. In: 7th IEEE International Conference on Automatic Face and Gesture Recognition, April 2006, pp. 429–434. IEEE Computer Society Press, Los Alamitos (2006) 5. Hanek, R.: Fitting Parametric Curve Models to Images Using Local Self-adapting Seperation Criteria. PhD thesis, Technische Universit¨ at M¨ unchen (2004) 6. Lepetit, V., Lagger, P., Fua, P.: Randomized trees for real-time keypoint recognition. In: CVPR 2005, Switzerland, pp. 775–781 (2005) 7. Lepetit, V., Pilet, J., Fua, P.: Point matching as a classiﬁcation problem for fast and robust object pose estimation. In: CVPR 2004, June 2004, vol. 2, pp. 244–250 (2004) 8. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: IEEE ICIP, pp. 900–903. IEEE Computer Society Press, Los Alamitos (2002) 9. Marchand, E., Bouthemy, P., Chaumette, F., Moreau, V.: Robust real-time visual tracking using a 2D-3D model-based approach. In: ICCV, pp. 262–268 (September 1999) 10. Quinlan, R.: Learning with continuous classes. In: Proceedings of the 5th Australian Joint Conference on Artiﬁcial Intelligence, pp. 343–348 (1992) 11. Romdhani, S.: Face Image Analysis using a Multiple Feature Fitting Strategy. PhD thesis, University of Basel, Computer Science Department, Basel, CH (January 2005) 12. Simon, D., Hebert, M., Kanade, T.: Real-time 3-D pose estimation using a highspeed range sensor. In: ICRA 1994. Proceedings of IEEE International Conference on Robotics and Automation, vol. 3, pp. 2235–2241 (May 1994) 13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition (CVPR) (2001) 14. Wimmer, M., Pietzsch, S., Stulp, F., Radig, B.: Learning robust objective functions with application to face model ﬁtting. In: Proceedings of the 29th DAGM Symposium, Heidelberg, Germany, September 2007 (to appear) 15. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

Multiple View Geometry for Non-rigid Motions Viewed from Translational Cameras Cheng Wan, Kazuki Kozuka, and Jun Sato Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya 466–8555, Japan

Abstract. This paper introduces multiple view geometry under projective projections from four-dimensional space to two-dimensional space which can represent multiple view geometry under the projection of space with time. We show the multifocal tensors deﬁned under space-time projective projections can be derived from non-rigid object motions viewed from multiple cameras with arbitrary translational motions, and they are practical for generating images of non-rigid object motions viewed from cameras with arbitrary translational motions. The method is tested in real image sequences.

1

Introduction

The multiple view geometry is very important for describing the relationship between images taken from multiple cameras and for recovering 3D geometry from images [1,2,3,4,6,7]. In the traditional multiple view geometry, the projection from the 3D space to 2D images has been assumed [3]. However, the traditional multiple view geometry is limited for describing the case where enough corresponding points are visible from a static conﬁguration of multiple cameras. Recently, some eﬀorts for extending the multiple view geometry for more general point-camera conﬁgurations have been made [5,8,10,11,12]. Wolf et al. [8] studied the multiple view geometry on the projections from N dimensional space to 2D images and showed that it can be used for describing the relationship of multiple views obtained from moving cameras and points which move on straight lines with constant speed. Thus the motions of objects are limited. Hayakawa et al. [9] proposed the multiple view geometry in the space-time which makes it possible to describe the relationship of multiple views derived from translational cameras and non-rigid arbitrary motions. However their multiple view geometry assumes aﬃne projection, which is an ideal model and can not be applied if we have strong perspective distortions in images. In this paper we introduce the multiple view geometry under the projective projection from 4D space to 2D space and show that such a universal model can represent multiple view geometry in the case where non-rigid arbitrary motions are viewed from multiple translational projective cameras. We ﬁrst analyze multiple view geometry under the projection from 4D to 2D, and show that we have multilinear relationships for up to 5 views unlike the traditional multilinear Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 342–352, 2007. c Springer-Verlag Berlin Heidelberg 2007

Multiple View Geometry for Non-rigid Motions

343

relationships. The three view, four view and ﬁve view geometries are studied extensively and new trilinear, quadrilinear and quintilinear relationships under the projective projection from 4D space to 2D space are presented. We next show that the newly deﬁned multiple view geometry can be used for describing the relationship between images taken from non-rigid motions viewed from multiple translational cameras. We also show that it is very useful for generating images of non-rigid object motions viewed from arbitrary translational cameras.

2

Projective Projections from 4D to 2D

We ﬁrst consider projective projections from 4D space to 2D space. This projection is used to describe the relationship between the real space-time and 2D images, and for analyzing the multiple view geometry under space-time projections. Let X = [X 1 , X 2 , X 3 , X 4 , X 5 ] be the homogeneous coordinates of a 4D space point projected to a point in the 2D space, whose homogeneous coordinates are represented by x = [x1 , x2 , x3 ] . Then, the extended projective projection from X to x can be described as follows: x ∼ PX

(1)

where (∼) denotes equality up to a scale, and P denotes the following 3 × 5 matrix: ⎤ ⎡ m11 m12 m13 m14 m15 P = ⎣m21 m22 m23 m24 m25 ⎦ (2) m31 m32 m33 m34 m35 From (1), we ﬁnd that the extended projective camera, P, has 14 DOF. In the next section, we consider the multiple view geometry of the extended projective cameras.

3

Projective Multiple View Geometry from 4D to 2D

From (1), we have the following equation for N extended projective cameras: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ X 0 P x 0 0 ··· 0 ⎢ ⎥ ⎢ P 0 x 0 · · · 0⎥ ⎢ λ ⎥ ⎢0⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ (3) ⎢P 0 0 x · · · 0⎥ ⎢ λ ⎥ = ⎢0⎥ ⎣ ⎦⎢ ⎥ ⎣ ⎦ .. .. ⎣λ ⎦ .. .. . . . . where, the leftmost matrix, M, in (3) is 3N × (5 + N ), and the (5 + N ) × (5 + N ) minors Q of M constitute multilinear relationships under the extended projection projection as: det Q = 0. We can choose any 5 + N rows from M to constitute Q, but we have to take at least 2 rows from each camera for deriving meaningful N view relationships (note, each camera has 3 rows in M).

344

C. Wan, K. Kozuka, and J. Sato

Table 1. The number of corresponding points required for computing multifocal tensors in three, four and ﬁve views with nonlinear method and linear method views nonlinear mothod linear mothod three 9 13 four 8 10 ﬁve 8 9

Table 2. Trilinear relations between point and line coordinates in three views. The ﬁnal column denotes the number of linearly independent equations. correspondence relation # of equations three points xi xj xk krv Tijr = 0v 2 1 two points, one line xi xj lr Tijr = 0 2 one point, two lines xi lq lr qjt Tijr = 0t 4 three lines lp lq lr pis qjt Tijr = 0st

Thus, 5 + N ≥ 2N must hold for deﬁning multilinear relationships for N view geometry in the 4D space. Thus, we ﬁnd that, unlike the traditional multiple view geometry, the multilinear relationship for 5 views is the maximal linear relationship in the 4D space. We next consider the minimum number of points required for computing the multifocal tensors. The geometric DOF of N extended projective cameras is 14N − 24, since each extended projective camera has 14 DOF and these N cameras are in a single 4D projective space whose DOF is 24. Meanwhile, if we are given M points in the 4D space, and let them be projected to N projective cameras deﬁned in (1). Then, we derive 2M N measurements from images, while we have to compute 14N − 24 + 4M components for ﬁxing all the geometry in the 4D space. Thus, this condition must hold for computing the multifocal tensors from images: 2M N ≥ 14N − 24 + 4M . We ﬁnd that minimum of 9, 8, 8 points are required to compute multifocal tensors in three, four and ﬁve views (see Table 1). 3.1

Three View Geometry

We next introduce the multiple view geometry of three extended projective cameras. For three views, the sub square matrix Q is 8 × 8. From det Q = 0, we have the following trilinear relationship under extended projective camera projections: xi xj xk krv Tijr = 0v

(4)

where ijk denotes a tensor, which represents a sign based on permutation from {i,j,k} to {1,2,3}. Tijr is the trifocal tensor for the extended cameras and has the following form:

Multiple View Geometry for Non-rigid Motions

345

Table 3. Quadrilinear relations between point and line coordinates in four views correspondence relation # of equations four points xi xj xk xs jlu kmv snw Qlmn = 0 8 uvw i three points, one line xi xj xk ln jlu kmv Qlmn = 0uv 4 i two points, two lines xi xj lm ln jlu Qlmn = 0u 2 i lmn one point, three lines xi ll lm ln Q i =0 1 kiw lmn four lines lk ll lm ln Q i = 0w 2

Table 4. Quintilinear relations between point and line coordinates in ﬁve views correspondence relation # ﬁve points xixjxkxsxt ila jmb knc sf d tge Rlmnf g = 0abcde four points, one line xi xj xk xs lg ila jmb knc sf d Rlmnf g = 0abcd three points, two lines xi xj xk lf lg ila jmb knc Rlmnf g = 0abc two points, three lines xi xj ln lf lg ila jmb Rlmnf g = 0ab one point, four lines xi l mln lf lg ila Rlmnf g = 0a ﬁve lines ll l mln lf lg Rlmnf g = 0

⎤ al ⎢ am ⎥ ⎢ q⎥ r ⎥ Tij = ilm jqu det ⎢ ⎢ bu ⎥ ⎣b ⎦ cr

of eq. 32 16 8 4 2 1

⎡

(5)

where ai denotes the ith row of P, bi denotes the ith row of P and ci denotes the ith row of P respectively. The trifocal tensor Tijr is 3 × 3 × 3 and has 27 entries. If the extended cameras are projective as shown in (1), we have only 26 free parameters in Tijr except a scale ambiguity. On the other hand, (4) provides us 3 linear equations on Tijr , but only 2 of them are linearly independent. Thus, at least 13 corresponding points are required to compute Tijr from images linearly. A complete set of the trilinear equations involving the trifocal tensor are given in Table 2. All of these equations are linear in the entries of the trifocal tensor Tijr . 3.2

Four View and Five View Geometry

Similarly, the four view and the ﬁve view geometry can also be derived for the extended projective cameras. The quadrilinear relationship under extended porjective projection is: = 0uvw xi xj xk xs jlu kmv snw Qlmn i

(6)

346

C. Wan, K. Kozuka, and J. Sato

Qlmn is the quadrifocal tensor whose form is described as: i ⎡ p⎤ a ⎢ aq ⎥ ⎢ l⎥ ⎥ = ipq det⎢ Qlmn i ⎢ bm ⎥ ⎣c ⎦ dn

(7)

where di denotes the ith row of P .The quadrifocal tensor Qlmn has 81 entries. i Excluding a scale ambiguity, it has 80 free parameters. And 27 linear equations are given from (6) but only 8 of them are linearly independent. Therefore, from imminimum of 10 corresponding points are required to compute Qlmn i ages linearly. The quadrilinear relationships involving the quadrifocal tensor are summerized in Table 3. We next introduce the multiple view geometry of ﬁve extended projective cameras. The quintilinear constraint is expressed as follows: xixjxkxsxt ila jmb knc sf d tge Rlmnf g = 0abcde

(8)

where Rlmnf g is the quintifocal tensor (ﬁve view tensor) whose form is represented as follows: ⎡ l ⎤ a ⎢ bm ⎥ ⎢ n⎥ ⎥ (9) Rlmnf g = det⎢ ⎢ cf ⎥ ⎣d ⎦ eg where ei denotes the ith row of P . The quintifocal tensor Rlmnf g has 243 entries. If the extended cameras are projective as shown in (1), we have only 242 free parameters in Rlmnf g except a scale. On the other hand, (8) provides us 243 linear equations on Rlmnf g , but only 32 of them are linearly independent. If we have N correponding points, 32N −N C2 independent constraints can be derived. Thus, at least 9 corresponding points are required to compute Rlmnf g from images linearly. The number of corresponding points required for computing multifocal tensors is summerized in Table 1. The quintilinear relationships are given in Table 4.

4

Multiple View Geometry for Multiple Moving Cameras

Let us consider a single moving point in the 3D space. If the multiple cameras are stationary, we can compute the traditional multifocal tensors [3] from the image motion of this point, and they can be used for constraining image points in arbitrary views and for reconstructing 3D points from images. However, if these cameras are moving independently, the traditional multifocal tensors cannot be computed from the image motion of a single point. Nonetheless, we in this section

Multiple View Geometry for Non-rigid Motions

camera motion

C3 (T + 2)

C1 (T + 2)

trajectory of point motion

C1 (T + 1)

347

camera motion

C3 (T + 1)

X(T + 2) X(T + 1)

C1 (T )

C3 (T )

X(T )

C2 (T + 2) C2 (T + 1) camera motion

C2 (T )

Fig. 1. A moving point in 3D space and its projections in three translational projective cameras. The multifocal tensor deﬁned under space-time projections can describe the relationship between these image projections.

show that if the camera motions are translational as shown in Fig. 1, the multiple view geometry under extended projective projections can be computed from the image motion of a single point, and they can be used for, for example, generating image motions viewed from arbitrary translational cameras. We ﬁrst show that the extended projective cameras shown in (1) can be used for describing non-rigid object motions viewed from stationary multiple cameras.We next show that this camera model can also be used for describing non-rigid object motions viewed from multiple cameras with translational motions of constant speed. = [X, Y, Z] , in the real space can be considered The motions of a point, W = [X, Y, Z, T ] , in a 4D space-time where T denotes time as a set of points, X and ( ) denotes inhomogeneous coordinates. The motions in the real space are = [x, y] . Thus, if projected to images, and can be observed as a set of points, x we assume projective projections in the space axes, the space-time projections can be described by the extended projective cameras shown in (1). We next show that the multiple view geometry described in section 3 can also be applied for multiple moving cameras. Let us consider a usual projective camera which projects points in 3D to 2D images. If the translational motions of the projective camera are constant, non-rigid motions are projected to images as: ⎡ ⎤ ⎤ X(T ) − T ΔX ⎡ ⎤ ⎡ x(T ) a11 a12 a13 a14 ⎢ Y (T ) − T ΔY ⎥ ⎥ λ ⎣ y(T )⎦ = ⎣a21 a22 a23 a24 ⎦ ⎢ (10) ⎣ Z(T ) − T ΔZ ⎦ a31 a32 a33 a34 1 1 ⎡ ⎤ ⎤ X(T ) ⎡ ⎥ a11 a12 a13 −a11 ΔX − a12 ΔY − a13 ΔZ a14 ⎢ ⎢ Y (T ) ⎥ ⎥ (11) Z(T ) = ⎣a21 a22 a23 −a21 ΔX − a22 ΔY − a23 ΔZ a24 ⎦ ⎢ ⎢ ⎥ a31 a32 a33 −a31 ΔX − a32 ΔY − a33 ΔZ a34 ⎣ T ⎦ 1

348

C. Wan, K. Kozuka, and J. Sato

(a) Camera 1

(b) Camera 2

(c) Camera 3

Fig. 2. Single point motion experiment. (a), (b) and (c) show image motions of a single point viewed from camera 1, 2 and 3. The 13 green points in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with diﬀerent speed and direction.

y

y

200

200

150

150

100

100

50

50

50

100

150

(a)

200

250

300

x

50

100

150

200

250

300

x

(b)

Fig. 3. The white curve in (a) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3. (b) shows those recovered from the traditional trifocal tensor. The 13 black points in (a) and 7 black points in (b) show points used for computing the trifocal tensors.

where x(T ) and y(T ) denote image coordinates at time T , X(T ), Y (T ) and Z(T ) denote coordinates of a 3D point at time T , and ΔX, ΔY and ΔZ denote camera motions in X, Y and Z axes.Since the translational motion is constant in each camera, ΔX, ΔY and ΔZ are ﬁxed in each camera. Then, we ﬁnd, from (11), that the projections of non-rigid motions to multiple cameras with translational motions can also be described by the extended projective cameras shown in (1).Thus the multiple view geometry described in section 3 can also be applied to multiple projective cameras with constant translational motions. Note that if we have enough moving points in the scene, we can also compute the traditional multiple view geometry on the multiple moving cameras at each instant.

5

Experiments

We next show the results of experiments. We ﬁrst show the results from real images that the trifocal tensor for extended projective cameras can be computed from image motions viewed from arbitrary translational cameras, and can be used for generating the third view from the ﬁrst and the second view of moving cameras. We next evaluate the stability of extracted trifocal tensors for extended projective cameras.

Multiple View Geometry for Non-rigid Motions

(a1) Camera 1

(b1) Camera 2

(c1) Camera 3

(d1)

(a2) Camera 1

(b2) Camera 2

(c2) Camera 3

(d2)

349

Fig. 4. Other single point motion experiments. (ai), (bi) and (ci) show three views of the ith motion. The 13 green points in each image and black points in (di) are corresponding points. The white curve in (di) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3.

5.1

Real Image Experiment

In this section, we show the results from single point motion and multiple point motion experiments. In the ﬁrst experiment, we used 3 cameras which are translating with diﬀerent constant speed and diﬀerent direction, and computed trifocal tensors between these 3 cameras by using a single moving point in the 3D space. Since multiple cameras are dynamic, we cannot compute the traditional trifocal tensor of these cameras from a moving point. Nonetheless we can compute the extended trifocal tensor and can generate image motions in one of the 3 views from the others. Fig. 2 (a), (b) and (c) show image motions of a single moving point in translational camera 1, 2 and 3 respectively. The trifocal tensor is computed from 13 points on the image motions in three views. They are shown by green points in (a), (b) and (c). The extracted trifocal tensor is used for generating image motions in camera 3 from image motions in camera 1 and 2. The white curve in Fig. 3 (a) shows image motions in camera 3 generated from the extended trifocal tensor, and the black curve shows the real image motions viewed from camera 3. As shown in Fig. 3 (a), the generated image motions almost recovered the original image motions even if these 3 cameras have unknown translational motions. To show the advantage of the extended trifocal tensor, we also show image motions generated from the traditional trifocal tensor, that is, trifocal tensor deﬁned for projections from 3D space to 2D space. 7 points taken from the former 13 points are used as corresponding points in three views for computing the traditional projective trifocal tensor. The image motion in camera 3 generated from the image motions in camera 1 and 2 by using the extracted traditional trifocal tensor is shown by white curve in Fig. 3 (b). As shown in Fig. 3 (b), the generated image motion is very diﬀerent from the real image motion shown by black curve as we

350

C. Wan, K. Kozuka, and J. Sato

(a1) Camera 1

(b1) Camera 2

(c1) Camera 3

(d1)

(a2) Camera 1

(b2) Camera 2

(c2) Camera 3

(d2)

(a3) Camera 1

(b3) Camera 2

(c3) Camera 3

(d3)

Fig. 5. Multiple point motion experiments. (ai), (bi) and (ci) show three views of the ith motion. The green curve and the red curve represent two diﬀerent image motion. The 7 green points on the green curve and the 6 red points on the red curve in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with diﬀerent speed and direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3.

expected, and thus we ﬁnd that the traditional multiple view geometry cannot describe such general situations, while the proposed multiple view geometry can as shown in Fig. 3 (a). The results from other single point motions are also given. In Fig. 4, (ai), (bi) and (ci) show three views of the ith motion. The 13 green points in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with diﬀerent speed and diﬀerent direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor in camera 3, and the black curve shows real image motions observed in camera 3. The 13 black points in (di) show points used for computing the trifocal tensor. As we can see, the trifocal tensor deﬁned under space-time projective projections can be derived from arbitrary single point motions viewed from the 3 cameras with arbitrary translational motions, and they are practical for generating images of single point motions viewed from translational camera. Next we show the results from multiple point motions. In Fig. 5, (ai), (bi) and (ci) show three views of the ith motion. The green curve and the red curve represent two diﬀerent image motion. The 7 green points on the green curve and the 6 red points on the red curve in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are

Multiple View Geometry for Non-rigid Motions

351

40

C1 C2

2 Z

0

C3

-2

5 2.5 0

0 x

-2.5

5

30 25 20 15 10 5 0 13

-5 10

(a)

y

reprojection error

35

15

17

19

21

number of points

23

25

(b)

Fig. 6. Stability evaluation.(a) shows 3 translating cameras and a moving point in the 3D space. The black points show the viewpoints of the cameras before translational motions, and the white points show those after the motions.(b) shows the relationship between the number of corresponding points and the reprojection errors.

translating with diﬀerent speed and diﬀerent direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor in camera 3, and the black curve shows real image motions observed in camera 3. The 13 black points in (di) show points used for computing the trifocal tensor. According to these experiments, we found that the extended multifocal tensors can be derived from non-rigid object motions viewed from multiple cameras with arbitrary translational motions, and they are useful for generating images of non-rigid object motions viewed from cameras with arbitrary translational motions. 5.2

Stability Evaluation

We next show the stability of extracted trifocal tensors under space-time projections. Fig. 6 (a) shows a 3D conﬁguration of 3 moving cameras and a moving point. The black points show the viewpoints of three cameras, C1 , C2 and C3 , before translational motions, and the white points show their viewpoints after the translational motions. The translational motions of these three cameras are diﬀerent and unknown. The black curve shows a locus of a freely moving point. For evaluating the extracted trifocal tensors, we computed reprojection errors derived from the trifocal tensors. The reprojection error is deﬁned as: 1 N ˆ i )2 , where d(mi , m ˆ i ) denotes a distance between a true point i=1 d(mi , m N mi and a point m ˆ i recovered from the trifocal tensor. We increased the number of corresponding points used for computing trifocal tensors in three views from 13 to 25, and evaluated the reprojection errors. The Gaussian noise with the standard deviation of 1 pixel is added to every point on the locus. Fig. 6 (b) shows the relationship between the number of corresponding points and the reprojection errors. As we can see, the stability is obviously improved by using a few more points than the minimum number of corresponding points.

352

6

C. Wan, K. Kozuka, and J. Sato

Conclusion

In this paper, we analyzed multiple view geometry under projective projections from 4D space to 2D space, and showed that it can represent multiple view geometry under space-time projections. In particular, we showed that multifocal tensors deﬁned under space-time projective projections can be computed from non-rigid object motions viewed from multiple cameras with arbitrary translational motions. We also showed that they are very useful for generating images of non-rigid motions viewed from projective cameras with arbitrary translational motions. The method was implemented and tested by using real image sequences. The stability of extracted trifocal tensors was also evaluated.

References 1. Faugeras, O.D., Luong, Q.T.: The Geometry of Multiple Images. MIT Press, Cambridge (2001) 2. Faugeras, O.D., Mourrain, B.: On the geometry and algebra of the point and line correspondences beterrn N images. In: Proc. 5th International Conference on Computer Vision, pp. 951–956 (1995) 3. Hartley, R.I., Zisserman, A.: Multiple View Geometry. Cambridge University Press, Cambridge (2000) 4. Hartley, R.I.: Multilinear relationship between coordinates of corresponding image points and lines. In: Proc. International Workshop on Computer Vision and Applied Geometry (1995) 5. Heyden, A.: A common framework for multiple view tensors. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 3–19. Springer, Heidelberg (1998) 6. Heyden, A.: Tensorial properties of multiple view constraints. Mathematical Methods in the Applied Sciences 23, 169–202 (2000) 7. Shashua, A., Wolf, L.: Homography tensors: On algebraic entities that represent three views of static or moving planar points. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, Springer, Heidelberg (2000) 8. Wolf, L., Shashua, A.: On projection matrices P k → P 2 , k = 3, · · ·, 6, and their applications in computer vision. In: Proc. 8th International Conference on Computer Vision, vol. 1, pp. 412–419 (2001) 9. Hayakawa, K., Sato, J.: Multiple View Geometry in the Space-Time. In: Proc. Asian Conference on Computer Vision, pp. 437–446 (2006) 10. Wexler, Y., Shashua, A.: On the synthesis of dynamic scenes from reference views. In: Proc. Conference on Computer Vision and Pattern Recognition, pp. 576–581 (2000) 11. Hartley, R.I., Schaﬀalitzky, F.: Reconstruction from Projections using Grassman Tensors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 363–375. Springer, Heidelberg (2004) 12. Sturm, P.: Multi-View Geometry for General Camera Models. In: Proc. Conference on Computer Vision and Pattern Recognition, pp. 206–212 (2005)

Visual Odometry for Non-overlapping Views Using Second-Order Cone Programming Jae-Hak Kim1 , Richard Hartley1 , Jan-Michael Frahm2 and Marc Pollefeys2 1

Research School of Information Sciences and Engineering The Australian National University National ICT Australia, NICTA 2 Department of Computer Science University of North Carolina at Chapel Hill

Abstract. We present a solution for motion estimation for a set of cameras which are ﬁrmly mounted on a head unit and do not have overlapping views in each image. This problem relates to ego-motion estimation of multiple cameras, or visual odometry. We reduce motion estimation to solving a triangulation problem, which ﬁnds a point in space from multiple views. The optimal solution of the triangulation problem in Linﬁnity norm is found using SOCP (Second-Order Cone Programming) Consequently, with the help of the optimal solution for the triangulation, we can solve visual odometry by using SOCP as well.

1

Introduction

Motion estimation of cameras or pose estimation, mostly in the case of having overlapping points or tracks between views, has been studied in computer vision research for many years [1]. However, non-overlapping or slightly overlapping camera systems have not been studied so much, particulary the motion estimation problem. The non-overlapping views mean that all images captured with cameras do not have any, or at most have only a few common points. There are potential applications for this camera system. For instance, we construct a cluster of multiple cameras which are ﬁrmly installed on a base unit such as a vehicle, and the cameras are positioned to look at diﬀerent view directions. A panoramic or omnidirectional image can be obtained from images captured with a set of cameras with small overlap. Another example is a vehicle with cameras mounted on it to provide driving assistance such as side/rear view cameras. An important problem is visual odometry – how can we estimate the tracks of a vehicle and use this data to determine where the vehicle is placed. There has been prior research considering a set of many cameras moving together as one camera. In [2] an algebraic solution to the multiple camera motion problem is presented. Similar research on planetary rover operations has been conducted to estimate the motion of a rover on Mars and to keep track of the rover [3]. Other

NICTA is funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 353–362, 2007. c Springer-Verlag Berlin Heidelberg 2007

354

J.-H. Kim et al.

research on visual odometry has been performed to estimate the motion of a stereo rig or single camera [4]. Prior work on non-overlapping cameras includes most notably the paper [5]. This diﬀers from our work in aligning independently computed tracks of the diﬀerent cameras, whereas we compute a motion estimate using all the cameras at once. Finally, an earlier solution to the problem was proposed in unpublished work of [6], which may appear elsewhere. In this paper, we propose a solution to estimate the six degrees of freedom (DOFs) of the motion, three rotation parameters and three translation parameters (including scale), for a set of multiple cameras with non-overlapping views, based on L∞ triangulation. A main contribution of this paper is that we provided a well-founded geometric solution to the motion estimation in non-overlapping multiple cameras.

2

Problem Formulation

Consider a set of n calibrated cameras with non-overlapping ﬁelds of view. Since the cameras are calibrated, we may assume that they are all oriented in the same way just to simplify the mathematics. This is easily done by multiplying an inverse of the rotation matrix to the original image coordinates. This being the case, we can also assume that they all have camera matrices originally equal to Pi = [I| − ci ]. We assume that all ci are known. The cameras then undergo a common motion, described by a Euclidean matrix R −R t M= (1) 0 1 where R is a rotation, and t is a translation of a set of cameras. Then, the i-th camera matrix changes to R t Pi = Pi M−1 = [I | − ci ] = [R | t − ci ] 0 1 which is located at R(ci − t). Suppose that we compute all the essential matrices of the cameras independently, then decompose them into rotation and translation. We observe that the rotations computed from all the essential matrices are the same. This is true only because all the cameras have the same orientation. We can average them to get an overall estimate of rotation. Then, we would like to compute the translation. As we will demonstrate, this is a triangulation problem. Geometric concept. First, let us look at a geometric idea derived from this problem. An illustration of a motion of a set of cameras is shown in Figure 1. A bundle of cameras is moved by a rotation R and translation t. All cameras at ci are moved to ci . The ﬁrst camera at position c1 is a sum of vectors ci , ci − ci and c1 − ci where i = 1...3. Observing that the vector vi in Figure 1 is the same as the vector ci − ci and the vector c1 − ci is obtained by rotating

Visual Odometry for Non-overlapping Views

R, t

c3

c2

c3 c1

v1 c1

355

v2

c2

c2 + R(c1 − c2)

v3 c3 + R(c1 − c3)

Fig. 1. A set of cameras is moved by Euclidean motion of rotation R and translation t. The centre of the ﬁrst camera c1 is moved to c1 by the motion. The centre c1 is a common point where all translation direction vectors meet. The translation direction vectors are indicated as red, green and blue solid arrows which are v1 , v2 and v3 , respectively. Consequently, this is a triangulation problem.

the vector c1 − ci , the ﬁrst camera at position c1 can be rewritten as a sum of three vectors ci , R(c1 − ci ) and vi . Therefore, the three vectors vi , colored solid arrows in Figure 1 meet in one common point c1 , the position of the centre of the ﬁrst camera after the motion. It means that ﬁnding the motion of the set of cameras is the same as solving a triangulation problem for translation direction vectors derived from each view. Secondly, let us derive detailed equations on this problem from the geometric concept we have described above. Let Ei be the essential matrix for the i-th camera. From E1 , we can compute the translation vector of the ﬁrst camera, P1 , in the usual way. This is a vector passing through the original position of the ﬁrst camera. The ﬁnal position of this camera must lie along this vector. Next, we use Ei , for i > 1 to estimate a vector along which the ﬁnal position of the ﬁrst camera can be found. Thus, for instance, we use E2 to ﬁnd the ﬁnal position of P1 . This works as follows. The i-th essential matrix Ei decomposes into Ri = R and a translation vector vi . In other words, Ei = R[vi ]× . This means that the i-th camera moves to a point ci + λi vi , the value of λi being unknown. This point is the ﬁnal position of each camera ci in Figure 1. We transfer this motion to determine the motion of the ﬁrst camera. We consider the motion as taking place in two stages, ﬁrst rotation, then translation. First the camera centre c1 is rotated by R about point ci to point ci + R(c1 − ci ). Then it is translated in the direction vi to the point c1 = ci + R(c1 − ci ) + λi vi . Thus, we see that c1 lies on the line with direction vector vi , based at point ci + R(c1 + ci ).

356

J.-H. Kim et al.

In short, each essential matrix Ei constrains the ﬁnal position of the ﬁrst camera to lie along a line. These lines are not all the same, in fact unless R = I, they are all diﬀerent. The problem now comes down to ﬁnding the values of λi and c1 such that for all i: c1 = ci + R(c1 − ci ) + λi vi

for i = 1, ..., n .

(2)

Having found c1 , we can get t from the equation c1 = R(c1 − t). Averaging Rotations. From the several cameras and their essential matrices Ei , we have several estimates Ri = R for the rotation of the camera rig. We determine the best estimate of R by averaging these rotations. This is done by representing each rotation Ri as a unit quaternion, computing the average of the quaternions and renormalizing to unit norm. Since a quaternion and its negative both represent the same rotation, it is important to choose consistently signed quaternions to represent the separate rotations Ri . Algebraic derivations. Alternatively, it is possible to show an algebraic derivation of the equations as follows. Given Pi = [I| − ci ] and Pi = [R | t − ci ] ( See (2)), an essential matrix is written as Ei = R [ci + R(t − ci )]× I = [R ci + (t − ci )]× R .

(3)

Considering that the decomposition of the essential matrix Ei is Ei = Ri [vi ]× = [Ri vi ]× Ri , we may get the rotation and translation from (3), namely Ri = R and λi Ri vi = R ci + (t − ci ). As a result, t = λi R vi + ci − R ci which is the same equation derived from the geometric concept. A Triangulation Problem. Equation (2) gives us independent measurements of the position of point c1 . Denoting ci + R(c1 − ci ) by Ci , the point c1 must lie at the intersection of the lines Ci + λi vi . In the presence of noise, these lines will not meet, so we need ﬁnd a good approximation to c1 . Note that the points Ci and vectors vi are known, having been computed from the known calibration of the camera geometry, and the computed essential matrices Ei . The problem of estimating the best c1 is identical with the triangulation problem studied (among many places) in [7,8]. We adopt the approach of [7] of solving this under L∞ norm. The derived solution is the point c1 that minimizes the maximum diﬀerence between c1 − Ci and the direction vector vi , for all i. In the presence of noise, the point c1 will lie in the intersection of cones based at the vertex Ci , and with axis deﬁned by the direction vectors vi . To formulate the triangulation problem, instead of c1 , we write X as the ﬁnal position of the ﬁrst camera where all translations derived from each essential matrix meet together. As we have explained in the previous section, in the presence of noise we have n cones, each one aligned with one of the translation directions. The desired point X lies in the overlap of all these cones, and, ﬁnding this overlap region gives the solution we need in order to get the motion

Visual Odometry for Non-overlapping Views

357

of cameras. Then, our original motion estimation problem is formulated as the following minimization problem: min max X

i

||(X − Ci ) × vi || . (X − Ci ) vi

(4)

Note that the quotient is equal to tan2 (θi ) where θi is the angle between vi and (X − Ci ). This problem can be solved as a Second-Order Cone Programming (SOCP) using a bisection algorithm [9].

3

Algorithm

The algorithm to estimate motion of cameras having non-overlapping views is as follows: Given 1. A set of cameras described in initial position by their known calibrated camera matrices Pi = Ri [I| − ci ]. The cameras then move to a second (unknown) position, described by camera matrices Pi . 2. For each camera pair Pi , Pi , several point correspondences xij ↔ xij (expressed in calibrated coordinates as homogeneous 3-vectors). Objective: Find the motion matrix of the form (1) that determines the common motion of the cameras, such that Pi = Pi M−1 . Algorithm 1. Normalize the image coordinates to calibrated image coordinates by setting ˆ ij = R−1 ˆ ij = R−1 x i xij and x i xij ,

2. 3. 4. 5.

ˆ ij /ˆ ˆ ij ← x ˆ ij /ˆ ˆ ij ← x xij and x xij . then adjust to unit length by setting x ˆ ij ↔ x ˆ ij for Compute each essential matrix Ei in terms of correspondences x the i-th camera. Decompose each Ei as Ei = Ri [vi ]× and ﬁnd the rotation R as the average of the rotations Ri . Set Ci = ci + R(c1 − ci ). Solve the triangulation problem by ﬁnding the point X = c1 that (approximately because of noise) satisﬁes the condition X = Ci + λi vi for all i. Compute t from t = c1 − R c1 .

In our current implementation, we have used the L∞ norm to solve the triangulation problem. Other methods of solving the triangulation problem may be used, for instance the optimal L2 triangulation method given in [8]. Critical Motion. The algorithm has a critical condition when the rotation is zero. If this is so then, in the triangulation problem solved in this algorithm all the basepoints Ci involved are the same. Thus, we encounter a triangulation problem with a zero baseline. In this case, the magnitude of the translation can not be accurately determined.

358

4

J.-H. Kim et al.

Experiments

We have used SeDuMi and Yalmip toolbox for optimization of SOCP problems [10,11]. We used a ﬁve point solver to estimate the essential matrices [12,13]. We select the best ﬁve points from images using RANSAC to obtain an essential matrix, and then we improve the essential matrix by non-linear optimization. An alternative method for computing the essential matrix based on [14] was tried. This method gives the optimal essential matrix in L∞ norm. A comparison of the results for these two methods for computing Ei is given in Fig 6. Real data. We used Point Grey’s LadybugTM camera to generate some test data for our problem . This camera unit consists of six 1024×768 CCD color sensors with small overlap of their ﬁeld of view. The six cameras, 6 sensors with 2.5 mm lenses, are closely packed on a head unit. Five CCDs are positioned in a horizontal ring around the head unit to capture side-view images, and one is located on the top of the head unit to take top-view images. Calibration information provided by Point Grey [15] is used to get intrinsic and relative extrinsic parameters of all six cameras. A piece of paper is positioned on the ground, and the camera is placed on the paper. Some books and objects are randomly located around the camera. The camera is moved manually while the positions of the camera at some points are marked on the paper as edges of the camera head unit. These marked edges on the paper are used to get the ground truth of relative motion of the camera for this experiment. The experimental setup is shown in Figure 2. A panoramic image stitched in our experimental setup is shown in Figure 3.

Fig. 2. An experimental setup of the LadybugTM camera on an A3 size paper surrounded by books. The camera is moved on the paper by hand, and the position of the camera at certain key frames is marked on the paper to provide the ground truth for the experiments.

Visual Odometry for Non-overlapping Views

359

Table 1. Experimental results of rotations at key frames 0, 30, 57 and 80, which correspond to the position number 0–3, respectively. For instance, a pair of rotation (R0 , R1 ) corresponds to a pair of rotations at key frame 0 and 30. Angles of each rotation are represented by the axis-angle rotation representation. Rotation pair (R0 , R1 ) (R0 , R2 ) (R0 , R3 )

True rotation Estimated rotation Axis Angle Axis [0 0 -1] 85.5◦ [-0.008647 -0.015547 0.999842] [0 0 -1] 157.0◦ [-0.022212 -0.008558 0.999717] [0 0 -1] 134.0◦ [ 0.024939 -0.005637 -0.999673]

Angle 85.15◦ 156.18◦ 134.95◦

Fig. 3. A panoramic image is obtained by stitching together all six images from the LadybugTM camera. This image is created by LadybugPro, the software provided by Point Grey Research Inc.

In the experiment, 139 frames of image are captured by each camera. Feature tracking is performed on the image sequence by the KLT (Kanade-Lucas-Tomasi) tracker [16]. Since there is lens distortion in the captured image, we correct the image coordinates of the feature tracks using lens distortion parameters provided by the Ladybug SDK library. The corrected image coordinates are used in all the equations we have derived. After that, we remove outliers from the feature tracks by the RANSAC (Random Sample Consensus) algorithm with a model of epipolar geometry in two view and trifocal tensors in three view [17]. There are key frames where we marked the positions of the camera. They are frames 0, 30, 57, 80, 110 and 138 in this experiment. The estimated path of the cameras over the frames is shown in Figure 4. After frame 80, the essential matrix result was badly estimated and subsequent estimation results were erroneous. A summary of the experimental results is shown in Tables 1 and 2. As can be seen, we have acquired a good estimation of rotations from frame 0 up to frame 80, within about one degree of accuracy. Adequate estimation of translations is reached up to frame 57 within less than 0.5 degrees. We have successfully tracked the motion of the camera through 57 frames. Somewhere between frame 57 and frame 80 an error occurred that invalidated the computation of the position of frame 80. Analysis indicates that this was due to a critical motion (near-zero rotation of the camera ﬁxture) that made the translation estimation

360

J.-H. Kim et al.

0.1

0.08

0.06

0.04

0.03 0.025

0.02

0.02 0.015

0

0.01 0.005

−0.02

0 0.02

0

−0.02

−0.04

−0.06

−0.08

−0.1

−0.12

−0.14

−0.005 −0.14

(a) Top view

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

(b) Front view

Fig. 4. Estimated path of the LadybugTM camera viewed from (a) top and (b) front. The cameras numbered 0, 1, 2, 3, 4 and 5 are indicated as red, green, blue, cyan, magenta and black paths respectively.

Table 2. Experimental results of translation between two key frames are shown in scale ratio of two translation vectors and in angles of that at the two key frames. The translation direction vector t0i is a vector from the centre of the camera at the starting position, frame number 0, to the centre of the camera at the position number i. For example, t01 is a vector from the centre of the camera at frame 0 to the centre of the camera at frame 30. Translation Scale ratio Angles pair True value Estimated value True value Estimated value (t01 , t02 ) 0.6757 0.7424 28.5◦ 28.04◦ ◦ (t01 , t03 ) 0.4386 1.3406 42.5 84.01◦

inaccurate. Therefore, we have shown the frame-to-frame rotations, over frames in Figure 5-(a). As can be seen, there are frames for which the camera motion was less than 5 degrees. This occurred for frames 57 to 62, 67 to 72 and 72 to 77. In Figure 5-(c), we have shown the diﬀerence between the ground truth and estimated position of the cameras in this experiment. As can be seen, the position of the cameras is accurately estimated up to 57 frames. However, the track went oﬀ at frame 80. A beneﬁcial feature of our method is that we can avoid such bad condition for the estimation by looking at the angles between frames and residual errors on the SOCP, and then we try to use other frames for the estimation. Using the L∞ optimal E-matrix. The results so-far were achieved using the 5-point algorithm (with iterative reﬁnement) for calculating the essential matrix. We also tried using the method given in [14]. Since this method is quite new, we did not have time to obtain complete results. However, Fig 6 compares the angular error in the direction of the translation direction for the two methods. As may be seen, the L∞ -optimal method seems to work substantially better.

Visual Odometry for Non-overlapping Views

361

80 70 60

frame 30 frame 30

frame 0

50 60

frame 57 frame 57

40 420 −2 −4 −6 −8

30

40 20

frame 80

0

0

20 50

−20

10

frame 80

100

0 0

50

100

150

150

−40 −60

Fig. 5. The angles between pairs of frames used to estimate the motion are shown in (a). Note that a zero or near-zero rotation means a critical condition for estimating the motion of the cameras from the given frames. (b) Ground truth of positions (indicated as red lines) of the cameras with orientations at key frames 0, 30, 57 and 80, and estimated positions (indicated as black lines) of the cameras with their orientations at the same key frames. Orientations of the cameras are marked as blue arrows. Green lines are the estimated path through all 80 frames. Error of the estimation

Error of the estimation 30

Difference between the estimated translation direction and the translation from an essential matrix in degree

Difference between the estimated translation direction and the translation from an essential matrix in degree

30

25

20

15

10

5

0

0

20

40

60 Frames

80

(a) optimal

100

120

25

20

15

10

5

0

0

20

40

60

80

100

120

140

Frames

(b) 5-point

Fig. 6. Comparison of the angular error in the translation direction for two diﬀerent methods of computing the essential matrix

5

Discussion

We have presented a solution to ﬁnd the motion of cameras that are rigidly mounted and have minimally overlapping ﬁelds of view. This method works equally well for any number of cameras, not just two, and will therefore provide more accurate estimates than methods involving only pairs of cameras. The method requires a non-zero frame-to-frame rotation. Probably because of this, the estimation of motion through a long image sequence signiﬁcantly went oﬀ track. The method geometrically showed good estimation results in experiments with real world data. However, the accumulated errors in processing long sequences

362

J.-H. Kim et al.

of images made the system produce bad estimations over long tracks. In the real experiments, we have found that a robust and accurate essential matrix estimation is a critical requirement to obtain correct motion estimation in this problem.

References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 2. Pless, R.: Using many cameras as one. In: CVPR 2003, vol. II, pp. 587–593 (2003) 3. Cheng, Y., Maimone, M., Matthies, L.: Visual odometry on the mars exploration rovers. In: Systems, Man and Cybernetics, 2005 IEEE International Conference (2005) 4. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Conf. Computer Vision and Pattern Recognition, pp. 652–659 (2004) 5. Caspi, Y., Irani, M.: Aligning non-overlapping sequences. International Journal of Computer Vision 48(1), 39–51 (2002) 6. Clipp, B., Kim, J.H., Frahm, J.M., Pollefeys, M., Hartley, R.: Robust 6dof motion estimation for non-overlapping, multi-camera systems. Technical Report TR07-006 (Department of Computer Science, The University of North Carolina at Chapel Hill) 7. Hartley, R., Schaﬀalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, vol. I, pp. 504–509 (2004) 8. Lu, F., Hartley, R.: A fast optimal algorithm for L2 triangulation. In: Asian Conf. Computer Vision (2007) 9. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Int. Conf. Computer Vision, Beijing, China, pp. 1002–1009 (2005) 10. Sturm, J.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, Special issue on Interior Point Methods (CD supplement with software) 11–12, 625–653 (1999) 11. L¨ oberg, J.: Yalmip: A toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference, Taipei, Taiwan (2004) 12. Stew´enius, H., Engels, C., Nist´er, D.: Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing 60, 284–294 (2006) 13. Li, H., Hartley, R.: Five-point motion estimation made easy. In: ICPR (1), pp. 630–633. IEEE Computer Society Press, Los Alamitos (2006) 14. Hartley, R., Kahl, F.: Global optimization through searching rotation space and optimal estimation of the essential matrix. In: Int. Conf. Computer Vision (2007) 15. Point Grey Research Incorporated: LadybugTM 2 camera (2006), http://www.ptgrey.com 16. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, pp. 674–679 (1981) 17. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

Pose Estimation from Circle or Parallel Lines in a Single Image Guanghui Wang1,2 , Q.M. Jonathan Wu1 , and Zhengqiao Ji1 1

Department of Electrical and Computer Engineering, The University of Windsor, 401 Sunset, Windsor, Ontario, Canada N9B 3P4 2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080, P.R. China [email protected], [email protected]

Abstract. The paper is focused on the problem of pose estimation from a single view in minimum conditions that can be obtained from images. Under the assumption of known intrinsic parameters, we propose and prove that the pose of the camera can be recovered uniquely in three situations: (a) the image of one circle with discriminable center; (b) the image of one circle with preassigned world frame; (c) the image of any two pairs of parallel lines. Compared with previous techniques, the proposed method does not need any 3D measurement of the circle or lines, thus the required conditions are easily satisﬁed in many scenarios. Extensive experiments are carried out to validate the proposed method.

1

Introduction

Determining the position and orientation of a camera from a single image with respect to a reference frame is a basic and important problem in robot vision ﬁeld. There are many potential applications such as visual navigation, robot localization, object recognition, photogrammetry, visual surveillance and so on. During the past two decades, the problem was widely studied and many approaches have been proposed. One well known pose estimation problem is the perspective-n-point (PnP) problem, which was ﬁrst proposed by Fishler and Bolles [5]. The problem is to ﬁnd the pose of an object from the image of n points at known location on it. Following this idea, the problem was further studied by many researchers [6,8,9,15,14]. One of the major concerns of the PnP problem is the multi-solution phenomenon, all PnP problems for n ≤ 5 have multiple solutions. Thus we need further information to determine the correct solution [6]. Another kind of localization algorithm is based on line correspondences. Dhome et al. [4] proposed to compute the attitude of object from three line correspondences. Liu et al. [12] discussed some methods to recover the camera pose linearly or nonlinearly by using diﬀerent combination of line and point features. Ansar and Daniilidis [1] presented a general framework which allows for a novel set of linear solutions to the pose estimation problem for both n points and n lines. Chen [2] proposed a polynomial approach to ﬁnd close form solution for Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 363–372, 2007. c Springer-Verlag Berlin Heidelberg 2007

364

G. Wang, Q.M.J. Wu, and Z. Ji

pose determination from line-to-plane correspondences. The line based methods also suﬀer from the problem of multiple solutions. The above methods assume that the camera is calibrated and the positions of the points and lines are known. In practice, it may be hard to obtain the accurate measurements of these features in space. However, some geometrical constraints, such as coplanarity, parallelity and orthogonality, are abundant in many indoor and outdoor structured scenarios. Some researchers proposed to recover the camera pose from the image of a rectangle, two orthogonal parallel lines and some other scene constraints [7,18]. Circle is another very common pattern in man-made objects and scenes, many studies on camera calibration were based on the image of circles [10,11,13]. In this paper, we try to compute the camera’s pose from a single image based on geometrical conﬁgurations in the scene. Diﬀerent from previous methods, we propose to use the image of only one circle, or the image of any two pairs of parallel lines that may not be coplanar or orthogonal. The proposed method is widely applicable since the conditions are easily satisﬁed in many scenarios.

2 2.1

Perspective Geometry and Pose Estimation Camera Projection and Pose Estimation

Under perspective projection, a 3D point x ∈ R3 in space is projected to an image point m ∈ R2 via a rank-3 projection matrix P ∈ R3×4 as ˜ = P˜ x sm x = K[R, t]˜ x = K[r1 , r2 , r3 , t]˜

(1)

˜ = [xT , w]T and m ˜ = [mT , w]T are the homogeneous forms of points where, x x and m respectively, R and t are the rotation matrix and translation vector from the world system to the camera system, s is a non-zero scalar, K is the camera calibration matrix. In this paper, we assume the camera is calibrated, thus we may set K = I3 = diag(1, 1, 1), which is equivalent to normalize the image coordinates by applying transformation K−1 . In this case, the projection matrix is simpliﬁed to P = [R, t] = [r1 , r2 , r3 , t]. When all space points are coplanar, the mapping between the space points and their images can be modeled by a plane homography H which is a nonsingular 3 × 3 homogeneous matrix. Without loss of generality, we may assume the coordinates of the space plane as [0, 0, 1, 0]T for a speciﬁed world frame, then we have H = [r1 , r2 , t]. Obviously, the rotation matrix R and translation vector t can be factorized directly from the homography. Proposition 1. When the camera is calibrated, the pose of the camera can be recovered from two orthogonal vanishing points in a single view. Proof. Without loss of generality, let us set the X and Y axes of the world system in line with the two orthogonal directions. In the normalized world coordinate ˜w = [0, 1, 0, 0]T system, the direction of X and Y axes are x ˜w = [1, 0, 0, 0]T and y

Pose Estimation from Circle or Parallel Lines in a Single Image

365

respectively, and the homogeneous vector of the world origin is ˜ ow = [0, 0, 0, 1]T . Under perspective projection, we have: ˜x = P x ˜w = [r1 , r2 , r3 , t][1, 0, 0, 0]T = r1 sx v ˜y = P y ˜w = [r1 , r2 , r3 , t][0, 1, 0, 0]T = r2 sy v

(2) (3)

so v ˜o = P ˜ ow = [r1 , r2 , r3 , t][0, 0, 0, 1]T = t

(4)

Thus the rotation matrix can be computed from r1 = ±

v ˜x v ˜y , r2 = ± , r3 = r1 × r2 ˜ vx ˜ vy

(5)

where the rotation matrix R = [r1 , r2 , r3 ] may have four solutions if righthanded coordinate system is adopted. While only two of them can ensure that the reconstructed objects lie in front of the camera, which may be seen by the camera. In practice, if the world coordinate frame is preassigned, the rotation matrix may be uniquely determined [19]. Since we have no metric information of the given scene, the translation vector can only be deﬁned up to scale as t ≈ vo . This is to say that we can only recover the direction of the translation vector. In practice, the orthonormal constraint should be enforced during the computation since r1 and r2 in (5) may not be orthogonal due to image noise. Suppose the SVD decomposition of R12 = [r1 , r2 ] is UΣ VT , where Σ is a 3 × 2 matrix made of the two singular values of R12 . Thus we may obtain the best approxi 10 mation to the rotation matrix in the least square sense from R12 = U 0 1 VT , 00 since a rotation matrix should have unit singular values. 2.2

The Circular Points and Pose Estimation

The absolute conic (AC) is a conic on the ideal plane, which can be expressed in matrix form as Ω∞ = diag(1, 1, 1). Obviously, Ω∞ is composed of purely imaginary points on the inﬁnite plane. Under perspective projection, we can obtain the image of the absolute conic (IAC) as ωa = (KKT )−1 , which depends only on the camera calibration matrix K. The IAC is an invisible imaginary point conic in an image. It is easy to verify that the absolute conic intersects the ideal line at two ideal complex conjugate points, which are called the circular points. The circular points can be expressed in canonical form as I = [1, i, 0, 0]T, J = [1, −i, 0, 0]T. Under perspective projection, their images can be expressed as: si m ˜ i = P I = [r1 , r2 , r3 , t][1, i, 0, 0]T = r1 + i r2

sj m ˜ j = P J = [r1 , r2 , r3 , t][1, −i, 0, 0] = r1 − i r2 T

(6) (7)

Thus the imaged circular points (ICPs) are a pair of complex conjugate points, whose real and imaginary parts are deﬁned by the ﬁrst two columns of the rotation matrix. However, the rotation matrix can not be determined uniquely from the ICPs since (6) and (7) are deﬁned up to scales.

366

G. Wang, Q.M.J. Wu, and Z. Ji

Proposition 2. Suppose mi and mj are the ICPs of a space plane, the world system is set on the plane. Then the pose of the camera can be uniquely determined from mi and mj if one direction of the world frame is preassigned. Proof. It is easy to verify that the line passing through the two imaged circular points is real, which is the vanishing line of the plane and can be computed from l∞ = mi × mj . Suppose ox is the image of one axis of the preassigned world frame, its vanishing point vx can be computed from the intersection of line ox with l∞ . If the vanishing point vy of Y direction is recovered, the camera pose can be determined accordingly from Proposition 1. Since the vanishing points of two orthogonal directions conjugate with are vxT ωvy = 0 respect to the IAC, thus vy can be easily computed from . On the lT ∞ vy = 0 other hand, since two orthogonal vanishing points are harmonic with respect to the ICPs, their cross ratio Cross(vx vy ; mi mj ) = −1. Thus vy can also be computed from the cross ratio.

3 3.1

Methods for Pose Estimation Pose Estimation from the Image of a Circle

Lemma 1. Any circle Ωc in a space plane π intersects the absolute conic Ω∞ at exactly two points, which are the circular points of the plane. Without loss of generality, let us set the XOY world frame on the supporting plane. Then any circle on the plane can be modelled in homogeneous form as (x − wx0 )2 + (y − wy0 )2 − w2 r2 = 0. The plane π intersects the ideal plane π∞ at the vanishing line L∞ . In the extended plane of the complex domain, L∞ has at most two intersections with Ωc . It is easy to verify that the circular points are the intersections. Lemma 2. The image of the circle Ωc intersects the IAC at four complex points, which can be divided into two pairs of complex conjugate points. Under perspective projection, any circle Ωc on space plane is imaged as a conic ωc = H−T Ωc H−1 , which is an ellipse in nondegenerate case. The absolute conic is projected to the IAC. Both the IAC and ωc are conics of second order that can be written in homogeneous form as xT ωc x = 0. According to B´ezout’s theorem, the two conics have four imaginary intersection points since the absolute conic and the circle have no real intersections in space. Suppose the complex point [a + bi] is one intersection, it is easy to verify that the conjugate point [a − bi] is also a solution. Thus the four intersections can be divided into two complex conjugate pairs. It is obvious that one pair of them is the ICPs, but the ambiguity can not be solved in the image with one circle. If there are two or more circles on the same or parallel space plane, the ICPs can be uniquely determined since the imaged circular points are the common intersections of each circle with the IAC in the image. However, we may have only one circle in many situations, then how to determine the ICPs in this case?

Pose Estimation from Circle or Parallel Lines in a Single Image

367

Proposition 3. The imaged circular points can be uniquely determined from the image of one circle if the center of the circle can be detected in the image. Proof. As shown in Fig.1, the image of the circle ωc intersects the IAC at two pairs of complex conjugate points mi , mj and mi , mj . Let us deﬁne two lines as l∞ = mi × mj , l∞ = mi × mj

(8)

then one of the lines must be the vanishing line and the two supporting points must be the ICPs. Suppose oc is the image of the circle center and l∞ is the vanishing line, then there is a pole-polar relationship between the center image oc and the vanishing line with respect to the conic. λl∞ = ωc oc

(9)

where λ is a scalar. Thus the true vanishing line and imaged circular points can be determined from (9). Under perspective projection, a circle is transformed into a conic. However, the center of the circle in space usually does not project to the center of the corresponding conic in the image, since perspective projection (1) is not a linear mapping from the space to the image. Thus the imaged center of the circle can not be determined only from the contour of the imaged conic. There are several possible ways to recover the projected center of the circle by virtue of more geometrical information, such as by two or more lines passing through the center [13] or by two concentric circles [10,11]. vy Space plane

Y

Oc

Image plane

v′y

mj m′j

Ωc

y

ωc

m′i

oc

O

X

(a)

o

ωa

mi

vx

x

l∞′

l∞

v ′x

(b)

Fig. 1. Determining the ICPs from the image of one circle. (a) a circle and preassigned world frame in space; (b) the imaged conic of the circle.

Proposition 4. The imaged circular points can be recovered from the image of one circle with preassigned world coordinate system. Proof. As shown in Fig.1, suppose line x and y are the imaged two axes of the preassigned world frame, the two lines intersect l∞ and l∞ at four points. Since the two ICPs and the two orthogonal vanishing points form a harmonic relation. Thus the true ICPs can be determined by verifying the cross ratio of the

368

G. Wang, Q.M.J. Wu, and Z. Ji

two pairs of quadruple collinear points {mi , mj , vx , vy } and {mi , mj , vx , vy }. Then the camera pose can be computed according to Proposition 2. 3.2

Pose Estimation from Two Pairs of Parallel Lines

Proposition 5. The pose of the camera can be recovered from the image of any two general pairs of parallel lines in the space. Proof. As shown in Fig.2, suppose L11 , L12 and L21 , L22 are two pairs of parallel lines in the space, they may not be coplanar or orthogonal. Their images l11 , l12 and l21 , l22 intersect at v1 and v2 respectively, then v1 and v2 must be the vanishing points of the two directions, and the line connecting the two points must be the vanishing line l∞ . Thus mi and mj can be computed from the intersections of l∞ with the IAC. Suppose v1 is one direction of the world ⊥ frame, o2 is the image of the world origin. Then the vanishing point v1 of the direction that is orthogonal to v1 can be easily computed from Cross(v1 v1⊥ ; mi mj )

v1T ωv1⊥ =0 ⊥ lT ∞ v1 =0

or

= −1, and the pose of the camera can be recovered from Proposition 1. Speciﬁcally, the angle α between the two pairs of parallel lines in v T ωa v 2 the space can be recovered from cos α = √ T 1 √ . If the two pairs of T v 1 ωa v 1

v 2 ωa v 2

lines are orthogonal with each other, then we have v1⊥ = v2 .

v1⊥ v 2

ωa mj

L22

l 22

L21 L11 L12

mi

l 21

l∞

v1

l 11

o2 α

o1

l 12

Fig. 2. Pose estimation from two pairs of parallel lines. Left: two pairs of parallel lines in the space; Right: the image of the parallel lines.

3.3

Projection Matrix and 3D Reconstruction

After retrieving the pose of the camera, the projection matrix with respect to the world frame can be computed from (1). With the projection matrix, any geometry primitive in the image can be back-projected into the space. For example, a point in the image is back-projected to a line, a line is back-projected to a plane and a conic is back-projected to a cone. Based on the scene constraints, many geometrical entities, such as the length ratios, angles, 3D information of some planar surfaces, can be recovered via the technique of single view metrology [3,17,18]. Therefore the 3D structure of some simple objects and scenarios can be reconstructed only from a single image.

Pose Estimation from Circle or Parallel Lines in a Single Image

4

369

Experiments with Simulated Data

During simulations, we generated a circle and two orthogonal pairs of parallel lines in the space, whose size and position in the world system are shown in Fig.3. Each line is composed of 50 evenly distributed points, the circle is composed of 100 evenly distributed points. The camera parameters were set as follows: focal length fu = fv = 1800, skew s = 0, principal point u0 = v0 = 0, rotation axis r = [0.717, −0.359, −0.598], rotation angle α = 0.84, translation vector t = [2, −2, 100]. The image resolution was set to 600 × 600 and Gaussian image noise was added on each imaged point. The generated image with 1-pixel Gaussian noise is shown in Fig.3. 15

L22 L21

10

200

Ωc

100

5

l 22

ωc

l 21

l 11

0

l 12

0 -100

-5

L11

-10

L12

-200

-10

0

10

o2 -200

0

200

Fig. 3. The synthetic scenario and image for simulation

In the experiments, the image lines and the imaged conic were ﬁtted via least squares. We set L11 and L21 as the X and Y axes of the world frame, and recover the ICPs and camera pose according to the proposed methods. Here we only give the result of the recovered rotation matrix. For the convenience of comparison, we decomposed the rotation matrix into the rotation axis and rotation angle, we deﬁne the error of the axis as the angle between the recovered axis and the ground truth and deﬁne the error of the rotation angle as the absolute error of the recovered angle with the ground truth. We varied the noise level from 0 to 3 pixels with a step of 0.5 during the test, and took 200 independent tests at each noise level so as to obtain more statistically meaningful results. The mean and 0.1

0.05

Mean of rotation axis error

0.1

Mean of rotation angle error

0.05

STD of rotation axis error

STD of rotation angle error

0.08

Alg.1

0.04

Alg.1

0.08

Alg.1

0.04

Alg.1

0.06

Alg.2

0.03

Alg.2

0.06

Alg.2

0.03

Alg.2

0.04

0.02

0.04

0.02

0.02

0.01

0.02

0.01

0 0

1

2 Noise level

3

0 0

1

2 Noise level

3

0 0

1 2 Noise level

3

0 0

1

2

3

Noise level

Fig. 4. The mean and standard deviation of the errors of the rotation axis and rotation angle with respect to the noise levels

370

G. Wang, Q.M.J. Wu, and Z. Ji

standard deviation of the two methods are shown in Fig.4. It is clear that the accuracy of the two methods are comparable at small noise level (< 1.5 pixels), while the vanishing points based method (Alg.2) is superior to the circle based one (Alg.1) at large noise level.

5

Tests with Real Images

All images in the tests were captured by Canon Powershort G3 with a resolution of 1024 × 768. The camera was pre-calibrated via Zhang’s method [20]. Test on the tea box image: For this test, the selected world frame, two pairs of parallel lines and the two detected conics by the Hough transform are shown in Fig.5. The line segments were detected and ﬁtted via orthogonal regression algorithm [16]. We recovered the rotation axis, rotation angle (unit: rad) and translation vector by the two methods as shown in Table 1, where the translation vector is normalized by t = 1. The results are reasonable with the imaging conditions, though we do not have the ground truth.

ω c2

ω c1

y x

Fig. 5. Test results of the tea box image. Upper: the image and the detected conics and parallel lines and world frame for pose estimation; Lower: the reconstructed tea box model at diﬀerent viewpoints with texture mapping.

In order to give further evaluation of the recovered parameters, we reconstructed the 3D structure of the scene from the recovered projection matrix via the method in [17]. The result is shown from diﬀerent viewpoints in Fig.5. We manually took the measurements of the tea box and the grid in the background and registered the reconstruction to the ground truth. Then we computed the relative error E1 of the side length of the grid, the relative errors E2 , E3 of the diameter and height of the circle. As listed in Table 1, we can see that the reconstruction error is very small. The results veriﬁes the accuracy of the recovered parameters in return. Test on the book image: The image with detected conic and preassigned world frame and two pairs of parallel lines are shown in Fig.6. We recovered the

Pose Estimation from Circle or Parallel Lines in a Single Image

371

Table 1. Test results and performance evaluations for real images Images Method Alg.1 Box Alg.2 Alg.1 Book Alg.2

Raxis [-0.9746,0.1867,-0.1238] [-0.9748,0.1864,-0.1228] [-0.9173,0.3452,-0.1984] [-0.9188,0.3460,-0.1899]

Rangle 2.4385 2.4354 2.2811 2.3163

t [-0.08,0.13,0.98] [-0.08,0.13,0.98] [-0.02,0.09,0.99] [-0.02,0.09,0.99]

E1 (%) 0.219 0.284 0.372 0.306

E2 (%) 0.327 0.315 0.365 0.449

E3 (%) 0.286 0.329 0.633 0.547

pose of the camera by the proposed methods, then computed the relative errors E1 , E2 and E3 of the three side lengths of the book with respect to the ground truth taken manually. The results are shown in Table 1. The reconstructed 3D structure of the book is shown Fig.6. The results are realistic with good accuracy.

ωc y x

Fig. 6. Pose estimation and 3D reconstruction of the book image

6

Conclusion

In this paper, we proposed and proved the possibility to recover the pose of the camera from a single image of one circle or two general pairs of parallel lines. Compared with previous techniques, less conditions are required by the proposed method. Thus the results in the paper may ﬁnd wide applications. Since the method utilizes the least information in computation, it is important to adopt some robust techniques to ﬁt the conics and lines.

Acknowledgment The work is supported in part by the Canada Research Chair program and the National Natural Science Foundation of China under grant no. 60575015.

References 1. Ansar, A., Daniilidis, K.: Linear pose estimation from points or lines. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 578–589 (2003) 2. Chen, H.H.: Pose determination from line-to-plane correspondences: Existence condition and closed-form solutions. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 530–541 (1991)

372

G. Wang, Q.M.J. Wu, and Z. Ji

3. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. International Journal of Computer Vision 40(2), 123–148 (2000) 4. Dhome, M., Richetin, M., Lapreste, J.T.: Determination of the attitude of 3D objects from a single perspective view. IEEE Trans. Pattern Anal. Mach. Intell. 11(12), 1265–1278 (1989) 5. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartomated cartography. Communications of the ACM. 24(6), 381–395 (1981) 6. Gao, X.S., Tang, J.: On the probability of the number of solutions for the P4P problem. J. Math. Imaging Vis. 25(1), 79–86 (2006) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 8. Horaud, R., Conio, B., Leboulleux, O., Lacolle, B.: An analytic solution for the perspective 4-point problem. CVGIP 47(1), 33–44 (1989) 9. Hu, Z.Y., Wu, F.C.: A note on the number of solutions of the noncoplanar P4P problem. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 550–555 (2002) 10. Jiang, G., Quan, L.: Detection of concentric circles for camera calibration. In: Proc. of ICCV, pp. 333–340 (2005) 11. Kim, J.S., Gurdjos, P., Kweon, I.S.: Geometric and algebraic constraints of projected concentric circles and their applications to camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 637–642 (2005) 12. Liu, Y., Huang, T.S., Faugeras, O.D.: Determination of camera location from 2-D to 3-D line and point correspondences. IEEE Trans. Pattern Anal. Mach. Intell. 12(1), 28–37 (1990) 13. Meng, X., Li, H., Hu, Z.: A new easy camera calibration technique based on circular points. In: Proc. of BMVC (2000) 14. Nist´er, D., Stew´enius, H.: A minimal solution to the generalised 3-point pose problem. J. Math. Imaging Vis. 27(1), 67–79 (2007) 15. Quan, L., Lan, Z.: Linear n-point camera pose determination. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 774–780 (1999) 16. Schmid, C., Zisserman, A.: Automatic line matching across views. In: Proc. of CVPR, pp. 666–671 (1997) 17. Wang, G.H., Hu, Z.Y., Wu, F.C., Tsui, H.T.: Single view metrology from scene constraints. Image Vision Comput. 23(9), 831–840 (2005) 18. Wang, G.H., Tsui, H.T., Hu, Z.Y., Wu, F.C.: Camera calibration and 3D reconstruction from a single view based on scene constraints. Image Vision Comput. 23(3), 311–323 (2005) 19. Wang, G.H., Wang, S., Gao, X., Li, Y.: Three dimensional reconstruction of structured scenes based on vanishing points. In: Proc. of PCM, pp. 935–942 (2006) 20. Zhang, Z.: A ﬂexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)

An Occupancy–Depth Generative Model of Multi-view Images Pau Gargallo, Peter Sturm, and Sergi Pujades INRIA Rhˆ one-Alpes and Laboratoire Jean Kuntzmann, France [email protected]

Abstract. This paper presents an occupancy based generative model of stereo and multi-view stereo images. In this model, the space is divided into empty and occupied regions. The depth of a pixel is naturally determined from the occupancy as the depth of the ﬁrst occupied point in its viewing ray. The color of a pixel corresponds to the color of this 3D point. This model has two theoretical advantages. First, unlike other occupancy based models, it explicitly models the deterministic relationship between occupancy and depth and, thus, it correctly handles occlusions. Second, unlike depth based approaches, determining depth from the occupancy automatically ensures the coherence of the resulting depth maps. Experimental results computing the MAP of the model using message passing techniques are presented to show the applicability of the model.

1

Introduction

Extracting 3D information from multiple images is one of the central problems of computer vision. It has applications to photorealistic 3D reconstruction, image based rendering, tracking and robotics among others. Many successful methods exist for each application, but a satisfactory general formalism is still to be deﬁned. In this paper we present a simple probabilistic generative model of multi-view images that accurately deﬁnes the natural relationship between the shape and color of the 3D world and the observed images. The model is constructed with the philosophy that if one is able to describe the image formation process with a model, then Bayesian inference can be used to invert the process and recover information about the model from the images. The approach yields to a generic widely applicable formalism. The price to pay for such a generality is that the results for each speciﬁc application will be poor compared to those of more specialized techniques. There are mainly two approaches to the stereo and multi-view stereo problems. In small-baseline situations, the 3D world is represented by a depth map on a reference image and computing depth is regarded as a matching problem [1]. In wide-baseline situations, it is often more convenient to represent the shape of the objects by a surface or an occupancy function [2]. Depth and occupancy are Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 373–383, 2007. c Springer-Verlag Berlin Heidelberg 2007

374

P. Gargallo, P. Sturm, and S. Pujades

obviously highly related, but most of the algorithms concentrate on ﬁnding one of the two. The main problem with either approach are occlusions. The fact that a 3D point is not always visible from all the cameras makes the extraction of 3D information from images hard. The two main approaches to solve this issue are to treat occluded points as outliers [3] or to explicitly model the geometrical reason for the occlusion. Making a accurate generative model for multi-view images, as we wish to do, necessarily involves modeling occlusions geometrically, because geometric occlusions really exist in the true image formation process. In fact, geometric occlusions are so important that there are reconstruction techniques, like those based on the visual hull, that use only the occlusion information [4,5]. Geometric occlusions can be modeled eﬀectively in depth based approaches by computing a depth map for every input image [6,7,8]. This requires to add constraints, so that the multiple depth maps are coherent and form a single surface. These constrains are not necessary in shape based approaches that implicitly incorporate them because they compute a single model for all the images. Shape based approaches usually deal with geometric occlusions in an alternating way [9,10,11]. They ﬁrst compute the visibility given the current estimate of the shape; and then modify the shape according to some criteria. This procedure disregards the fact that the visibility will change while modifying the shape. A voxel carving technique carving inside the object, or a shrinking surface evolution are consequences of this oversight. Our recent work [12] avoids these problems by explicitly taking into account the visibility changes during the surface evolution. This is nothing but the continuous version of the model presented in this paper. The model presented in this paper explicitly characterizes the relationship between depth and shape and proﬁts of the beneﬁts of both worlds. The shape occupancy automatically gives coherence to the depth maps. Properly deriving the depth maps from the occupancy implicitly encodes the geometric occlusions. Our model is very related both in objective and approach to the recent work of Hern´ andez et al. [13]. In that work depth cues, are probabilistically integrated for inferring occupancy. The principal diﬀerence between their model and the one presented here is that they make some independence assumptions that we do not. In particular, they assume the depth of diﬀerent pixels to be independent. This greatly simpliﬁes the inference and good results are achieved. In our model, as in the real world, depth is determined by the occupancy and therefore the depth of diﬀerent pixels are not independent. This create a huge very loopy factor graph representation of the join probability of occupancy and depth. Inference in such a graph is hard as we will see in the experiments.

2

The Model

This section presents the occupancy-depth model. We ﬁrst introduce the random variables involved in the generative process. Then we decompose their joint probability distribution into simpler terms and give a form to each of them.

An Occupancy–Depth Generative Model of Multi-view Images

2.1

375

Occupancy, Depth and Color Variables

Consider a discretization of the 3D space in a ﬁnite set of sites S ⊂ R3 . A given site x ∈ S, can be in the free space or inside an object. This deﬁnes the occupancy of the site that will be represented by a binary random variable ux (1 meaning occupied and 0 meaning free). The occupancy of the whole space will be represented by the random process u : S → {0, 1}, which deﬁnes the shape of the objects in the space. The shape of the objects is not enough to generate images. Their appearance is also needed. In the simplest case, under the constant brightness assumption, the appearance can be represented by a color at each point on the surface of the objects. As we do not know the shape of the objects right now, we will need to deﬁne the color of all sites in S, even if only the color of the sites lying on the surface is relevant. The color will be represented by a random process C : S → R3 . The depth of a pixel is deﬁned as the distance (measured along the camera’s z-axis) between the optical center of the camera and the 3D point observed at the pixel. Given the occupancy of the space and the position and calibration of a camera i, the depth Dpi of a pixel p is determined as the depth of the ﬁrst occupied point in its viewing ray. The observed color at that pixel will be denoted by Ipi . This color should ideally correspond to the color of the site observed at that pixel. i.e. the point of the viewing ray of p which is at depth Dpi . 2.2

Decomposition

Having deﬁned all the variables depths, colors, occupancies and observed colors – we will now deﬁne their joint probability distribution. To do so, we ﬁrst decompose the distribution terms representing the natural dependence between the variables. One can think of this step as deﬁning the way the data (the images) were generated. The proposed decomposition is p(Dpi |u) p(Ipi |Dpi , C) . (1) p(u, C, D, I) = p(u)p(C|u) i,p

i,p

Fig. 1. Bayesian network representation of the joint probability decomposition

376

P. Gargallo, P. Sturm, and S. Pujades

It is represented graphically in Figure 1. Each term of the decomposition corresponds to a variable and, thus, to a node of the network. The arrows represent the statistical dependencies between the variables. In other words, the order that one has to follow to generate random samples of the model from scratch. Therefore, the data generation process is as follows. First one builds the objects of the world by generating an occupancy function. Then one paints them by choosing the space colors. Finally, one takes pictures of the generated world: one ﬁrst determines which points are visible from the camera by computing the depth of the pixels, and then sets the color of the pixels to be the color of the observed 3D points. In the following sections, we deﬁne each of the terms of the decomposition (1). 2.3

World Priors

Not all the possible occupancy functions are equally likely a priori. One expects the occupied points to be gathered together forming objects, rather than randomly scattered over the 3D space. To represent such a belief, we choose the occupancy u to follow a Markov Random Field distribution. This gives the following prior, ψ(ux , uy ) (2) p(u) ∝ exp − x,y

where the sum extends to all the neighboring points (x, y) in a grid discretization S of the 3D space. The energy potentials are of the form ψ(ux , uy ) = α|ux − uy |, so that they penalize neighboring sites of diﬀerent occupancies by a cost α. This prior is isotropic in the sense that two neighboring points are equally likely to have the same occupancy regardless of their position and color. From experience, we know that the discontinuities or edges in images often correspond to discontinuities in the occupancy (the occluding contours). Therefore, one could be tempted to use the input images to derive a smoothing prior for the occupancy that is weaker at the points projecting to image discontinuities. While eﬀective, this would not be correct from a Bayesian point of view, as one would be using the data to derive a prior for the model. We will now see how to obtain this anisotropic smoothing eﬀect in a more theoretically well funded way. In the proposed decomposition, the prior on the color of the space depends on the occupancy. This makes it possible to express the following idea. Two neighboring points that are both either occupied or free, are likely to have similar colors. The colors of two points with diﬀerent occupancies are not necessarily related. This can be expressed by the MRF distribution φ(Cx , Cy , ux , uy ) (3) p(C|u) ∝ exp − x,y

with

φ(Cx , Cy , ux , uy ) =

(Cx − Cy ) if ux = uy 0 otherwise

(4)

An Occupancy–Depth Generative Model of Multi-view Images

377

where is some robust penalty function, that penalize the diﬀerence of colors of neighboring points with the same occupancy. Now, combining the prior on the occupancy with the prior on the color we have ¯ x , Cy , u x , u y ) (5) ψ(C p(u, C) ∝ exp − x,y

with ¯ x , Cy , u x , u y ) = ψ(C

(Cx − Cy ) if ux = uy α otherwise

(6)

If we are given the color of the space, then p(u|C) ∝ p(u, C) is a color driven smoothing prior on the occupancy. Neighboring points with the same color are more likely to have the same occupancy than neighboring points with diﬀerent colors. As the color will be estimated from the images, the color discontinuities will coincide with the edges in the images. Thus, this term will represent our experience based knowledge that object borders coincide with image edges. 2.4

Pixel Likelihood

The color Ipi observed at a pixel should be equal to the color of the 3D point visible at that pixel, up to the sensor noise and other unmodeled eﬀects, e.g. specularities. If we denote the color of the observed 3D point as C(Dpi ), we have p(Ipi |Dpi , C) ∝ exp −ρ Ipi − C(Dpi )

(7)

where ρ is some robust penalty function.

Fig. 2. Bayesian network of the model associated to a single viewing ray. The occupancy variables ui along the viewing ray determine de depth d. The color of the image I corresponds to the color C at depth d.

Note that unlike traditional stereo algorithms, here there are no occlusions to be taken into account by the function ρ. We are matching the color of a pixel with the color of the observed scene point, not with the color of pixels in the other images. The observed scene point is, by deﬁnition, non-occluded, so no occlusion problem appears here.

378

P. Gargallo, P. Sturm, and S. Pujades

Depth Marginalization. The likelihood (7) of a pixel depends only on its depth and not on the whole occupancy function. However, the relationship between occupancy and depths is simple and deterministic. Therefore, it is easy to marginalize out the depth and express the likelihood directly in terms of the occupancy of the points on the viewing ray of the pixel. To simplify the notation we will do the computations for a single pixel. Figure2 shows the Bayesian network associated to a single pixel. The points of its viewing ray will be denoted by the natural numbers {0, 1, 2, · · · }, ordered by increasing distance to the camera. Their occupancy is a vector u such that ui is the occupancy of the i-th point in the viewing ray. The depth will be denoted by d. With this language, the probability of a depth given the occupancy is (1 − ui )ud (8) p(d|u) = i
This equation encodes the fact that if d is the depth of a pixel, then occupancy of the points with lower depths must be zero and the occupancy of the point at depth d must be one. It is easy to check that p(d|u) is 1 only when d is the actual depth determined by u and 0 otherwise. Now, if we note the likelihood of the pixel having a depth d by L(d) = p(Ipi |Dpi = d, C) and the likelihood of a pixel given the occupancy u by L(u), then p(I|d, C)p(d|u) = L(d) (1 − ui )ud . (9) L(u) = p(I|u, C) = d

d

i
Note that the summand is null for all depths d except for the one that corresponds to the occupancy u.

3

Inference

The last section presented the generative occupancy–depth model by deﬁning the joint probability of occupancy, color, depth and image pixels. In this section we will present an algorithm for inverting the process and recover occupancy from multiple images. Given a set of observed images {I i } the goal is to ﬁnd the posterior probability of occupancy and color, p(u, C|I). In a tracking application, for example, one may be interested in computing the occupancy marginals at each point in the 3D space. This can be used as input for a volumetric 3D tracker. Alternatively, in a novel view synthesis application one may be more interested in ﬁnding the most probable world in order to render it from other points of view. As we will see these are diﬃcult tasks, challenging the most recent inference algorithms. The main problem is the interdependence of the occupancies of the points in a viewing ray, which creates high order cliques, in addition to the extrem loopiness of the network. We present here a ﬁrst try of solving the inference problem by using EM and message passing.

An Occupancy–Depth Generative Model of Multi-view Images

379

The optimization is done by alternating between the optimization of the occupancy and the color. In the E-step, the probabilities of the occupancies are computed using message passing. Depending on the goal, the sum-product or the max-product algorithm is used. In the M-step, the color estimation is improved by maximizing its expected log-posterior. It is interesting to note that simpler optimization techniques like iterative conditional modes will lead to algorithms strongly similar to voxel carving or surface evolution. In this case, one will start with an initial guess of the occupancy and will try to change the occupancy of the voxels at the border of the objects in order to improve the posterior. The message passing algorithm presented below is smarter in the sense that it will make decisions about the occupancy of a whole viewing ray at a time. 3.1

Factor Graph Representation

In the E-step, for a ﬁxed coloring of the space, C, the posterior probability p(u|I, C) must be computed (or maximized). This distribution can be represented as a factor graph. Again, for notational simplicity, we detail here the factor graph corresponding to a single pixel, ¯ i , uj )) . exp(−ψ(u (10) p(u|I, C) ∝ L(u) ij

Figure 3 shows the factor graph for a single pixel and sketches the one corresponding to a pair of pixels in diﬀerent images.

Fig. 3. Factor graph representation of the distribution for a single pixel (left) and for two pixel with intersecting viewing rays (right)

The graph contains two types of factors. The likelihoods of the pixels L are huge factors connecting all the points in the viewing ray of each pixel. The ¯ are standard pairwise factors. Notice the extreme smoothing potentials exp(−ψ) loopiness of the complete graph where the viewing rays of diﬀerent pixels intersect to each other. 3.2

Message Passing

Inference will be done by message passing in the factor graph. Messages will be sent from the likelihood and smoothing factors to the occupancy variables and

380

P. Gargallo, P. Sturm, and S. Pujades

vice-versa. One can visualize this process as a dialog between the images and the occupancy. Each image tells the occupancy what to be. The occupancy gathers all the messages and replies to each image with a summary of the messages sent by the other images. The process continues until a global agreement is found. We derive here the reweighted message passing equations [14,15] for the occupancy factor graph. The equations are presented for the sum-product algorithm, but the max-product equivalents can be obtained easily by replacing all the sums in the equations by maximums. Using the notation of Minka [14], the posterior probability of the occupancy p(u|I, C) will be approximated by a fully factorized variational distribution q(u) = i qi (ui ). The belief qi of an occupancy variable ui is the product of the messages that it receives. That is

qi (ui ) =

ma→i (ui ) ,

(11)

a∈N (i)

where N (i) are the indices of the factors connected to ui . The message that a factor fa sends to a variable is given by ma→i (ui )αa =

fa (ua )αa

ua \ui

nj→a (uj ) ,

(12)

j∈N (a)\i

where ua is the set of variables connected to the factor fa and N (a) their indices. αa is the weight of the factor and it is a parameter of the message passing algorithm. Finally the replying message from a variable to a factor is ni→a (ui ) = qi (ui )ma→i (ui )−αa .

(13)

The pair-wise smoothing potentials are simple and a direct implementation of these formulas is straightforward. The likelihood factors, however, need more attention. These factors link a lot of variables. A direct implementation of the equations above will involve computing sums over all the possible conﬁgurations of occupancy along each viewing ray. The number of conﬁgurations grows exponentially with the number of sites and becomes quickly intractable. Luckily, the likelihood factor L is simple enough that the sum can be simpliﬁed analytically. The message that a pixel likelihood factor sends to the occupancy of the grid point i is mL→ui (x)αL =

L(u)αL

u\ui

nj→L (uj ) .

(14)

j=i

Substituting L(u) from equation (9) we get d u\ui

L(d)αL

j
(1 − uj )ud

j=i

nj→L (uj ) .

(15)

An Occupancy–Depth Generative Model of Multi-view Images

381

And now we can split and simplify the sum over d into 3 sums with d smaller, equal and bigger than i respectivelly, L(d)αL nj→L (0) nd→L (1) nj→L (uj ) d
+

j
αL

ui L(i)

nj→L (0)

j
+

(1 − ui )

d>i

nj→L (uj )

uj

uj

j>d

L(d)αL

j>d∧j=i

nj→L (0) nd→L (1)

j
(16) j>d

nj→L (uj ) .

uj

Finally, deﬁning τ (d) = L(d)αL

nuj →L (0)nud →L (1)

j
j>d

nuj →L (uj ) ,

(17)

uj

we have

1

mL→ui (ui )αL = ui

nui →L (ui )

τ (d) +

d
ui nui →L (1)

τ (i) +

1 − ui τ (d), nui →L (0) d>i

(18) which can be computed in linear time. 3.3

Color Estimation

In the M-step the color of the space is computed by maximizing the expected log-posterior, ln p(u, C|I)q = ln p(I|u, C)q + ln p(u, C)q + const.

(19)

where ·q denotes the expectation with respect to u assuming that it follows the variational distribution q. Again, the expectation is a sum over all possible occupancy conﬁgurations. Simpliﬁcations similar to the ones done above yield to qi (0) qd (1) ρ I − C(d) (20) ln p(I|u, C)q = − d i
and ln p(u, C)q = −

qi (0) qj (0) + qi (1) qj (1) (Ci − Cj ) ij

+ qi (0) qj (1) + qi (1) qj (0) α .

(21)

The optimization is done by a gradient descent procedure. The initialization of the whole EM procedure is as follows. The messages are all initialized to uniform distributions. The color is initialized at each site by computing the mean of the color of its projections to the input images.

382

P. Gargallo, P. Sturm, and S. Pujades

Fig. 4. Results for the single pixel (top-left), the multi-view (top-right) and stereo pair (bottom) experiments

4

Experimental Validation

In order to validate the generative model and to test the possibility of inverting the process using the inference method described above, we performed three types of experiments. The ﬁrst experiment was done to test the performance of the reweighted message passing algorithm. A single pixel camera model was used and diﬀerent likelihood function L(d) were manually chosen. The ground truth posterior of the occupancy was computed by sampling. First image of Figure 4 shows an example result. In general, both the sum-product and the max-product performed poorly, only roughtly approximating the accual marginals and MAPs. Although this poor performance, we tested the algorithm with real images with the hope that the interconnections between diﬀerent viewing rays will constrain the problem and possibly get better results. Last row of Figure 4 shows the results obtained for the four Middlebury stereo pairs [1]. The displayed depthmaps were computed from the inferred occupancy of frontoparallel layers placed in front of the cameras. The results are generally correct, but large errors are present, specially in the textureless regions. Holes appear in these regions. A possible explanation for these holes is that, as we are smoothing the occupancy of the layers (which are mostly empty) the likelihood term in the textureless regions is too weak to prevent the smoothing from erasing any occupancy on the layer. The last test was done with wide angle multi-view stereo data, to test the exact same algorithm in a very diﬀerent setup. Figure 4 shows preliminary results of the inference for the dataset dinoSparseRing [2]. The shape of the dinosaur is globally recovered, but here also there are numerous holes in the object. We did not perform a numerical evaluation of the results because the scores will be dominated by the errors described above and, therefore, uninteresting.

An Occupancy–Depth Generative Model of Multi-view Images

5

383

Conclusion

We presented a generative model of multi-view images, developped an inference algoritm to compute occupancy from images and test it experimentally. The results of the experiments are not comparable to the ones obtained by the stateof-the-art techniques, but served well to validate the model. Future work is needed for ﬁnding better inference methods and to determine to which degree the errors are due to the inference method or to the generative model itself.

References 1. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision 47(1-3), 7–42 (2002) 2. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Computer Vision and Pattern Recognition 2006, Washington, DC, USA, pp. 519–528 (2006) 3. Strecha, C., Fransens, R., Gool, L.V.: Combined depth and outlier estimation in multi-view stereo. In: Computer Vision and Pattern Recognition 2006, Washington, pp. 2394–2401. IEEE Computer Society Press, Los Alamitos (2006) 4. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16(2), 150–162 (1994) 5. Solem, J.E., Kahl, F., Heyden, A.: Visibility constrained surface evolution. In: Computer Vision and Pattern Recognition 2005, San Diego, USA, pp. 892–899 (2005) 6. Kang, S.B., Szeliski, R., Chai, J.: Handling occlusions in dense multi-view stereo. In: Computer Vision and Pattern Recognition 2001, Kauai, Hawaii, pp. 103–110 (2001) 7. Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: European Conference on Computer Vision 2002, London, UK, pp. 82–96 (2002) 8. Gargallo, P., Sturm, P.: Bayesian 3d modeling from images using multiple depth maps. In: Computer Vision and Pattern Recognition 2005, San Diego, vol. 2, pp. 885–891 (2005) 9. Faugeras, O.D., Keriven, R.: Complete dense stereovision using level set methods. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 379–393. Springer, Heidelberg (1998) 10. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. Int. J. Comput. Vision 38(3), 199–218 (2000) 11. Paris, S., Sillion, F.X., Quan, L.: A surface reconstruction method using global graph cut optimization. Int. J. Comput. Vision 66(2), 141–161 (2006) 12. Gargallo, P., Prados, E., Sturm, P.: Minimizing the reprojection error in surface reconstruction from images. In: Proceedings of the International Conference on Computer Vision, Rio de Janeiro, Brazil, IEEE Computer Society Press, Los Alamitos (2007) 13. Hern´ andez, C., Vogiatzis, G., Cipolla, R.: Probabilistic visibility for multi-view stereo. In: Computer Vision and Pattern Recognition 2007, Minneapolis (2007) 14. Minka, T.: Divergence measures and message passing. Technical report, Microsoft Research (2005) 15. Wainwright, M., Jaakkola, T., Willsky, A.: Map estimation via agreement on trees: message-passing and linear programming. Information Theory, IEEE Transactions on 51, 3697–3717 (2005)

Image Correspondence from Motion Subspace Constraint and Epipolar Constraint Shigeki Sugimoto1 , Hidekazu Takahashi2 , and Masatoshi Okutomoi1 Department of Mechanical and Control Engineering, Graduate School of Sceience and Engineering, Tokyo Institute of Technology. 2–12–1 O-okayama, Meguro-ku, Tokyo, 152–8550 Japan {shige,hidekazu,mxo}@ok.ctrl.titech.ac.jp 2 Hidekazu Takahashi is currently with Diesel Inj. Eng. Dept. 1, DENSO Corporation. 1–1 Showa-cho, Kariya-shi, Aichi, 448-8661 Japan hidekazu [email protected] 1

Abstract. In this paper, we propose a novel method for inferring image correspondences on the pair of synchronized image sequences. In the proposed method, after tracking the feature points in each image sequence over several frames, we solve the image corresponding problem from two types of geometrical constraints: (1) the motion subspace obtained from the tracked feature points of a target sequence, and (2) the epipolar constraints between the two cameras. Dissimilarly to the conventional correspondence estimation based on image matching using pixel values, the proposed approach enables us to obtain the correspondences even though the feature points, that can be seen from one camera view, but can not be seen (occluded or outside of the view) from the other camera. The validity of our method is demonstrated through the experiments using synthetic and real images.

1

Introduction

Image correspondence estimation is fundamental to various vision applications, such as object/scene recognition, 3D structure/motion from sequential images, and stereo vision. In computer vision literatures, in general, correspondences are obtained from pixel values: each correspondence is estimated by evaluating the similarity measures computed from image pixel values [1,2,3,4]. The pixel-based approach is usually eﬀective and eﬃcient for tracking feature points in an image sequence, since the neighboring image frames in the sequence are similar each other, and there is always a high similarity on every pair of corresponding points. On the other hand, in the case of wide-baseline views, the pixel-based approach becomes ineﬃcient or fails, even though the epipolar geometry between two images are known, because of (i) a small overlapped view (or no overlapped view), (ii) occlusions, or (iii) dissimilarities on the pairs of corresponding points. Especially in the former two cases, in principle, the pixelbased approach cannot obtain the correspondences on scene points visible in one view but invisible in the other. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 384–393, 2007. c Springer-Verlag Berlin Heidelberg 2007

Image Correspondence from Motion Subspace Constraint

385

F Camera

Camera

Moving object

Another sequence

Target sequence Correspondence

Unknown Unknown

Tracked feature point

Correspondence

Unknown

Correspondence

Fig. 1. Correspondence between two wide-baseline images. We consider the image correspondences as a track correspondence, while we don’t care whether the track in the target sequence is visible or not.

Our goal in this paper is to obtain image correspondences on each pair of wide-baseline synchronized image sequences, as illustrated by Fig. 1. Based on the facility of the pixel-based approach for tracking feature points in an image sequence, we consider the image correspondence problem as a track correspondence problem; we infer an unknown feature track (image points in all frames) in a target image sequence, which corresponds to one of the tracks of visible feature points in the other sequence, while we don’t care whether the unknown track in the target sequence is visible or not. Our basic idea for solving this problem is as follows. Under the assumption that distinct feature points in both sequences are tracked separately over several frames, and that all feature points belong to one rigid object, the unknown track in the target sequence should be constrained to the motion subspace [5] derived from the collection of visible tracks in the target sequence. Also, under the assumption that the two wide-baseline camera views are weakly calibrated, and that the fundamental matrix is consistent while the images are captured, every image point of the unknown track should be constrained to the fundamental equation with a visible point in the other sequence. In this paper, we show that the correspondence problem can be algebraically solved from the equations formed by the two types of constraints mentioned above. In addition, we present an optional way for reﬁning the correspondences via 3D reconstruction provided that the two views are strongly calibrated (both intrinsic and extrinsic parameters are known). The validity of the proposed method is demonstrated through the experiments using synthetic and real images.

386

2

S. Sugimoto, H. Takahashi, and M. Okutomoi

Related Work

The motion subspace theorem is a fundamental aspect in factorization methods [6,7] for structure from motion (SFM), and widely used in motion segmentation techniques [5,8,9]. Our proposed method presents another use of the theorem. There have been a few papers [10,11,12] which present solutions for similar correspondence problems. If the two views are strongly calibrated, feature tracks between two neighboring images give the structure and the motion of the scene [13] without a scale factor. In [10], the scale factor is recovered from the known extrinsic parameters of the two camera views. In this case, the image correspondences can be estimated from 3D reconstruction results. Instead of recovering the scale factor, Dornaika et al. [11] show that each correspondence point in the target sequence is given as the cross point of the two epipolar lines, derived from the motion between two neighboring images and the known extrinsic parameters of the two views. However, the estimation of the motion between two neighboring frames is very sensitive to tracking errors [14], and roundabout for obtaining the correspondences compared with our method. On the other hand, Ho et al. [12,15] show that the point collection of all tracks of the feature points in both two sequences belong to a four dimensional subspace. The correspondences between the two views are inferred from the basis of the subspace. However, for computing the basis, it is required to obtain more than four track correspondences between the two views. Our method has the advantage that the unknown correspondences are obtained even in an extreme case where no correspondence is given beforehand, such that one view sees one side of the object while the other sees another side.

3

Image Correspondence from Two Geometric Constraints

We track moving feature points over K images of two synchronized image sequences separately. Let M and N be the number of the tracked feature points in a target and the other sequence, respectively. Let ukm = (ukm , vkm )T be the image coordinates of the m-th feature point in the k-th image of the target )T be the image coordinates of the n-th feature sequence, and ukn = (ukn , vkn point in the k-th image of the other. Our goal is to infer unknown image coordi T ) , (k = 1, · · · , K), (n = 1, · · · , N ) nates in the target sequence, ukn = (ukn , vkn each of which corresponds to ukn . 3.1

Motion Subspace Constraint

We collect the image coordinates of the m-th feature point in the target sequence into a single vector: T

pm = [u1m , v1m , u2m , v2m , . . . , uKm , vKm ] .

(1)

Image Correspondence from Motion Subspace Constraint

387

Then the 2K dimensional vectors of all feature points are collected into a single matrix: W = [p1 , p2 , . . . , pM ]T .

(2)

Subtracting a mean vector p¯G from each row vector gives: ¯ = [p¯1 , p¯2 , . . . , p¯M ]T , W where p¯m = pm − pG , pG =

(3) 1 M

M

pm .

(4)

m=1

¯W ¯ T , and {e1 , · · · , e2K } Let λ1 ≤ · · · ≤ λ2K be the eigenvalues of M = W be the orthonormal basis of the corresponding eigenvectors. If aﬃne camera projection is assumed, pm should belong to a three dimensional aﬃne subspace, so-called a ’motion-subspace’ [8]: pm = pG + α1m e1 + α2m e2 + α3m e3 ,

(5)

where α1m , α2m , and α3m denote the coeﬃcients of the basis vectors. In general, camera projection can not be expressed by aﬃne projection. For more general camera projection, we exploit a higher dimensional motion subspace: pm = pG + α1m e1 + α2m e2 + · · · + ανm eν , where (3 ≤ ν ≤ 2M ).

(6)

If all feature points in both sequences belong to a single moving object, the 2K dimensional vector of an unknown track, pn = [u1n , v1n , u2n , v2n , . . . , uKn , vKn ] , T

(7)

which corresponds to one of the feature tracks in the other sequence, should belong to the motion subspace spanned by {pG , e1 , e2 , · · · , eν }. We refer to these constraints as the motion subspace constraints. 3.2

Inferring Wide-Baseline Correspondence

We can write the motion subspace constraints into the form: pn = pG + β1n e1 + β2n e2 + · · · + βνn eν ,

(8)

where the basis of the subspace, {pG , e1 , e2 , . . . , eν }, is given by the tracks of visible feature points in the target sequence. Consequently, unknowns are β1n , β2n , · · · , βνn . On the other hand, image corresponding points in the wide-baseline views should hold the epipolar constraints between the two views. Therefore we can write: ⎡ ⎤ ⎡ ⎤ ukn f11 f12 f13 ⎦ 1 F ⎣ vkn ukn vkn = 0, where F = ⎣ f21 f22 f23 ⎦ . (9) 1 f31 f32 f33

388

S. Sugimoto, H. Takahashi, and M. Okutomoi

Therein F denotes the fundamental matrix of the two views. We extract the k-th frame from (8).

ukn (k) (k) (k) = pG + β1n e1 + β2n e2 + · · · + βνn e(k) ν vkn (k)

(k)

(10)

(k)

where pG , e1 , · · · , eν denote the 2×1 vectors composed of the (2k − 1)-th and 2k-th row elements of pG , e1 , · · · , eν , respectively. Substituting (10) for (9) gives the equation of the epipolar constraints with respect to the n-th feature point in the k-th image frame, as follows: ⎧ ⎫ ⎡ ⎤ 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ ⎢ β1n ⎥ ⎢ ⎥ (k) (k) (k) T ˜ kn f1 pG e1 · · · eν u (11) ⎢ .. ⎥ + f2 = 0, ⎪ ⎪ ⎣ . ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ βνn where

⎤ f11 f12 f1 = ⎣ f21 f22 ⎦ , f31 f32

⎤ f13 f2 = ⎣ f23 ⎦ . f33

⎡

˜ T u kn = ukn vkn 1 ,

⎡

Then we collect Eq. (11) of all k, (k = 1, · · · , K) as follows: ⎧ ⎫ ⎡ ⎤ 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ ⎢ β1n ⎥ ⎢ ⎥ T F1 pG e1 · · · eν ⎢ . ⎥ + F2 = 0 , Mn ⎪ ⎪ ⎣ .. ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ βνn where

⎡

⎢ ⎢ MnT = ⎢ ⎣

˜ T u 1n

0

˜ T u 2n ..

.

⎤

⎡

⎦

⎢ ⎢ F1 = ⎢ ⎣

0 ⎥⎥⎥ , ˜ T u Kn

⎤

f1 f1

0

..

.

0 ⎥⎥⎥ , ⎦

f1

(12)

⎤ f2 ⎢ f2 ⎥ ⎢ ⎥ F2 = ⎢ . ⎥ . ⎣ .. ⎦ ⎡

f2

Finally we obtain: An xn = bn , where

⎤ β1n ⎥ ⎢ xn = ⎣ ... ⎦ ,

(13)

⎡

An = MnT F1 e1 · · · eν ,

bn = −MnT F1 pG − MnT F2 .

βνn Note that Eq.(13) has K equations with ν unknowns, β1n , · · · , βνn . Therefore we can compute the unknowns if we choose a number for ν from 3 ∼ K. In consequence, we can obtain all track correspondences between the two widebaseline views by solving Eq.(13) for all n.

Image Correspondence from Motion Subspace Constraint

4

389

Correspondence Reﬁnement Via 3D Reconstruction

In this section, we present an optional way for the reﬁnement of the correspondences via 3D reconstruction provided that the two views are strongly calibrated (both intrinsic and extrinsic parameters are known). Once the target tracks that correspond to the visible tracks in the other sequence are obtained, the intrinsic and extrinsic parameters of the two views provide the 3D position of every point at every frame. Let Xkn = (xkn , ykn , zkn )T be the world coordinates of the n-th feature point in the k-th frame. We can write [8]: ⎤ ⎡ xkn (14) Xkn = ⎣ ykn ⎦ = τk + an ik + bn jk + cn kk , zkn where τk , {ik , jk , kk }, respectively, denote the object origin and orthonormal basis in the k-th image frame, and {an , bn , cn } denotes the object coordinates of the n-th point. Collecting Xkn for all k, (k = 1, · · · , K) gives: T X2n · · · XKn = τ + an i + bn j + cn k, Xn = X1n (15) where T T τ = (τ1T , · · · , τK ) ,

j=

T T (j1T , · · · , jK ) ,

i = (iT1 , · · · , iTK )T , T T k = (k1T , · · · , kK ) .

Therefore the collection of (15) for (n = 1, · · · , N ), X = X1 X2 · · · XN ,

(16)

(17)

should belong to a three dimensional aﬃne subspace. Subtracting a mean vector XG from each row vector gives: ¯1 X ¯2 · · · X ¯N , ¯ = X X (18) N ¯ n = Xn − XG , XG = 1 where X Xn . N n=1

(19)

This means that the 3D positions, estimated by the track correspondences, can ¯ to a three dimensional subspace. Conbe reﬁned by ﬁtting the collection X sequently, we can reﬁne the correspondences by re-projecting the reﬁned 3D positions into the images. Additionally, it is possible to determine the number of basis, ν, by evaluating the residuals (the sums of the squared distances of the data points to the ﬁtted subspaces) resulted from the 3D position ﬁttings with diﬀerent ν-numbers. In our experiments, we study the eﬀect of the diﬀerence of ν to the ﬁtting residual.

390

S. Sugimoto, H. Takahashi, and M. Okutomoi

0.3 meters Right camera 800

Object

0

5400

X[mm] 120 degrees Z

800

400

0

-400

Y[mm]

5000 4800

5 meters

4600

Y Left camera

X

Fig. 2. Camera and object conﬁguration in simulation

5

-800 1200

5200

Motion

4400

1st frame 50th frame 100th frame trajectory

Z[mm]

Fig. 3. Sphere motion (in millimeter)

Experimental Result

In this section, we show the validity of the proposed method using synthetic and real images. 5.1

Simulation

Our setup for two cameras and an object is shown in Fig. 2, where the view angle of each camera was 7.8 degrees, and the focus length was 35 millimeters. The object was a sphere with 1.5 meter radius. Fig. 3 shows the object motion in our synthetic image sequences. We arranged 200 feature points on the sphere, and recorded the tracks of 71 and 77 feature points over 100 frames in the left and right sequences, respectively. We added zero-mean Gaussian noise of standard deviation σ (in pixels) on the image positions of the feature tracks. Fig. 4 shows the results of the proposed method described in Sec. 3, when ν = 3, σ = 1, and the left and right sequences were, respectively, the target and the other. The red points ’+’ denote the visible points in the 50-th frame of the left image, and the green points ’×’ denote the corresponding points inferred from the visible points in the right frame. We can easily see that the object shape is a sphere, and that some inferred points correspond to the points visible in the left sequence. Fig. 5 shows the correspondence precision (in pixels) with respect to σ when the left image sequence was the target. If σ = 0, the proposed method gains no tracking error. Otherwise, our method reduces tracking errors on visible points, since RMSE in the correspondences is smaller than σ. We also studied the relationship of the numbers of ν and an aﬃne camera eﬀect, which denotes what extent the camera model approximates an aﬃne model to. We tried ν for 3 ∼ K, and selected the number that takes the minimal ﬁtting residual on the 3D-reconstruction subspace (see Sec. 4). The aﬃne camera eﬀect was evaluated by the value computed by dividing the mean depth of the

Image Correspondence from Motion Subspace Constraint

391

Left Image Right Image

Fig. 4. Inferred correspondences

7

Left sequence Mean of determined base number

RMS error of estimated feature position [pix]

1.2 1 0.8 0.6 0.4 0.2 0

0

0.5

1

1.5

2

2.5

Standard dev. of additive tracking error [pix]

Fig. 5. Image correspondence precision (RMSE) with respect to standard deviation σ of additional tracking errors (ν = 3)

3

Std. dev. of additive tracking error

σ =0.2 σ =0.6 σ =1.0 σ =1.4 σ =1.8 σ =2.0

6 5 4 3 2 1 0 4

6

8

10

12

14

16

18

20

22

24

(Depth of object center)/(Object thikness in depth direction)

Fig. 6. Selected basis number: For each setup, we performed the ν-selection for 20 times for each σ, and plotted its mean

scene points by the thickness of the object. We varied the value by changing the camera-object setup while the motions of the feature points in the image sequences are as similar as possible. For each setup, we tried selecting ν 20 times for each tracking noise level σ, and we plotted the mean of ν that take the minimal residuals. This experimental result is shown in 6. If the aﬃne camera eﬀect is smaller (i.e. the camera model is more approximated to the perspective projection model), or the additive tracking noise is smaller, a larger number of ν is selected. In a large tracking noise level, the position errors of the feature points are represented by a higher dimensional part of the motion subspace. This makes the ﬁtting residual on the 3D-reconstruction subspace become larger, if ν is a large number. Considered that tracking errors in real image sequences are relatively large, we set ν = 3 for general cases.

392

S. Sugimoto, H. Takahashi, and M. Okutomoi

1st camera is reference 2nd camera is reference

marker

Fig. 7. Target object

Fig. 8. 3D reconstruction result in wire frame view

Fig. 9. 3D reconstruction result: The side, front, and top views are shown in the top row. Actual shapes in the similar views are shown in the bottom row.

5.2

Real Images

In our experiments using real images, we used a toy car with 83 markers as shown in Fig.7. Two wide-baseline views observed the object from the left and the right. We tracked 42 markers in the left view, and 43 markers in the right view, over 100 frames in each sequence, where only two markers were tracked in both views. Fig. 8 shows the 3D reconstruction results by switching the role of the sequences each other, where ν = 3. The top row of Fig. 9 shows the three views (side, front, top) of the reconstruction results, while the bottom row shows the corresponding actual views of the object.

6

Conclusions

In this paper, we have proposed a novel method for inferring image correspondences on a pair of synchronized image sequences. We have solved a wide-baseline

Image Correspondence from Motion Subspace Constraint

393

image corresponding problem using motion subspace constraints and epipolar constraints. The validity of the proposed method has been demonstrated through the experiments using synthetic and real images. The proposed method may beneﬁt from subspace ﬁtting techniques, such as RANSAC [16] and GPCA [17], for ﬁtting tracks in a single sequence to multiple subspaces. We will incorporate with these robust methods in the future work.

References 1. Lucas, B.D., Kanade, T.: An iterative image registration technique with an approach to stereo vision. In: Image Understanding Workshop, pp. 121–130 (1981) 2. Freeman, W., Adelson, E.: The design and use of steerable ﬂters. PAMI 13(9), 891–906 (1991) 3. Lazebnik, S., Schmid, C., Ponce, J.: Sparse texture representation using aﬃneinvariant neighborhoods. In: CVPR, pp. 319–324 (2003) 4. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 5. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: ICCV, vol. 2, pp. 301–306 (2001) 6. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization method. IJCV 9(2), 137–154 (1992) 7. Poelman, C.J., Kanade, T.: A paraperspective factorization method for shape and motion recovery. PAMI 19(3), 206–218 (1997) 8. Kanatani, K.: Motion segmentation by subspace separation: Model selection and reliability evaluation. IJIG 2(2), 179–197 (2002) 9. Costeria, J.P., Kanade, T.: A multibody factorization method for independently moving objects. IJCV 29(3), 159–179 (1998) 10. Weng, J., Huang, T.S.: Complete structure and motion from two monochular sequences without stereo correspondence. In: ICPR, pp. 651–654 (1992) 11. Dornaika, F., Chung, R.: Stereo correspondence from motion correspondence. CVPR, 70–75 (1999) 12. Ho, P.K., Chung, R.: Stereo-motion with stereo and motion in complement. PAMI 22(2), 215–220 (2000) 13. Faugeras, O.D., Lustman, F., Toscani, G.: Motion and structure from point and line matches. In: ICCV (1987) 14. Zhang, Z.: Determining the epipolar geometry and its uncertainty: A review. IJCV 27(2), 161–198 (1998) 15. Ho, P.-K., Chung, R.: Use of aﬃne camera model and all stereo pairs in stereomotion. In: IEEE International Conference on Intelligent Vehicles, pp. 323–328. IEEE Computer Society Press, Los Alamitos (1998) 16. Fischer, M.A.: Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Comm. ACM 26(6), 381–395 (1981) 17. Vidal, R., Ma, Y., Sastry, S.: Generalized Principal Component Analysis (GPCA). PAMI 27(12) (2005)

Efficient Registration of Aerial Image Sequences Without Camera Priors Shobhit Niranjan1,2 , Gaurav Gupta2 , Amitabha Mukerjee1 , and Sumana Gupta1 1

Indian Institute of Technology Kanpur, India {nshobhit,sumana,amit}@iitk.ac.in 2 Aurora Integrated Systems, Kanpur, India [email protected]

Abstract. We present an efficient approach for finding homographies between sequences of aerial images. We propose a two-step approach: a) initially solving for image-plane rotation and scale parameters without using correspondence (under affine assumption), and b) using these parameters to constrain the full homography search, and c) extending the results to full perspective projection. No flight meta-data, camera priors, or any other user defined information is used for the task. Based on the perspective parameters estimated, the aerial images are stitched with the best matching image based on a probabilistic model, to compose a high resolution aerial image mosaic. While retaining the improved asymptotic worst-case complexity of [6], we demonstrate significant performance improvements in practice.

1 Introduction Registration of aerial images with other aerial (or satellite images) is an important problem for GIS applications, surveillance and monitoring systems [2] [3]. Frame-toreference and frame-to-frame registration of video data is complex due to lack of stable features, because typical frame does not cover large enough area, and computationally expensive. In this work, we propose a computationally efficient algorithm that estimates part of the registration parameters (rotation and scaling) without correspondence, and uses these to estimate the remaining parameters (translation). The current automated registration techniques can be classified into a) area-based ([4] [5]), and b) feature-based [2] techniques. In area-based algorithms, a small window of points in the sensed image is compared statistically with windows of the same size in the reference image or in frequency domain using Fast Fourier Transform(FFT). Window correspondence is based on the similarity between two given windows, usually the normalized cross correlation. A majority of the area-based methods have the limitation of registering only images with small misalignment, and therefore, the images must be roughly aligned with each other initially. The correlation measures become unreliable when the images have multiple modalities and the gray-level characteristics vary. In contrast, The feature-based methods are more robust and more suitable in case of images with small misalignment [2]. There are two critical procedures generally involved

This work was being done as part of his B.Tech thesis at Dept. of EE, IIT Kanpur.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 394–403, 2007. c Springer-Verlag Berlin Heidelberg 2007

Efficient Registration of Aerial Image Sequences Without Camera Priors

395

in the feature-based techniques: feature extraction and feature correspondence. Previous attempts either use an initial estimate of camera motion parameters (metadata) available as priors [2], [3], or else use computationally intensive methods for establishing accurate correspondences and deterministic homography computation [14]. Nielsen [6] proposed a probabilistic approach of computing homography from corresponding pairs, that depended on an estimate of range within which motion parameters lie. He proved that bounds on computational complexity could be determined based on these estimates and algorithm parameters. Xiong [1] presented an approach that does not require correspondences to be established between features in two images. However, it estimates only rotation and scale parameters, with affine geometrical transform assumption. Here we present a novel adaptation of Nielsen’s approach by reducing the search for rotation using Xiong’s approach. This results in a computationally efficient framework for registering images. While Xiong’s notion was limited to affine transformations, we assume it provides a good estimate of scale and rotation for perspective transformation as well.

2 Background Surveillance with non-stationary camera involves both rotation and translation of camera axis. Under rigid motion assumption the motion of a camera can be described by translation T and Rotation R. Let P = (X, Y, Z) be a point in the 3D world while p = (x, y) be the perspective projection of P in the image plane, it’s position at time t (xt , yt ) and at t + 1, (xt+1 , yt+1 ) can be correlated as: xt+1 t a11 a12 xt = x + (1) yt+1 ty a21 a22 yt a11 a12 A= a21 a22 where (xt+1 , yt+1 ) is the new transformed coordinate of (xt , yt ). The matrix A can be combination of rotation, scale, or shear known as perspective transform. Here we assume that camera intrinsic parameters are already known so we know the relationship between 3D world P = (X, Y, Z) and its projection in image plane p = (x, y). If f be the focal length of the camera then p is give by, X x fZ p= = y f YZ We present a registration technique which efficiently computes homography matrix, without any priors i.e. camera motion parameters. We have studied various features which are used for registration including Harris corner detector, KLT corner detector and extending visual saliency (Itti’s) model for feature extraction. Initially, rotation and scale parameters are estimated using registration technique devised by Xiong et. al. [1]. The algorithm in [1] differs from conventional feature based image registration algorithms, as it does not establish correspondence between images. We extend the approach in [1] for computing perspective homography, making use of the geometrical filtering of features in feature space [6] and generate mosaic using computed homography.

396

S. Niranjan et al.

(a)

(b)

(c)

Fig. 1. (a) Corners obtained after using Harris Corner Detector algorithm, (b) Detected features using Itti’s Model (c) Typical angle histogram - In the X-axis each column of image represents bin of angle difference and in the Y-axis occurrence of that bin is shown

3 Feature Extraction In general, the feature detection process involves computing the response R of one or multiple detectors (filters/operators) to the input image(s), followed by the analysis of R to isolate points or pixel that satisfy devised constraints. There are several kinds of feature used for matching (e.g. Corners). We compare Harris corner detector [8], KLT feature detector [9], Visually salient points [11] and SIFT features [15]. Harris corner detector [8] algorithm computes a matrix, which averages the first derivatives of the signal on a window: 2 x2 + y 2 Ix Ix Iy ⊗ (2) exp Ix Iy Iy2 2σ 2 where Ix and Iy are the gradient (derivatives) in the x and y direction. The eigen values of this matrix are the principal curvatures of the auto-correlation function. If these two curvatures are high, an interest point is present. We have also integrated in our approach recently formulated Visual Saliency model for identifying reliable features in images. Visual attention is a biological mechanism, used by primates to compensate for the inability of their brains to process the huge amount of visual information gathered by the two eyes [10]. The Caltech’s Hypothesis [10] elaborated by Itti-Koch [11] represents descriptions on how the visual attention model works in our brain and process visual information. According to the hypothesis seven elementary features which are computed from an RGB color image and which belong to three main cues, namely intensity, color, and orientation are extracted. In a second step, each feature map is transformed into a conspicuity map which highlights the parts of the scene that strongly differ from their surroundings. Salient points are subsequently determined either by Winner Take All (WTA) algorithm, or dividing the saliency map into sufficient grids and thresholding local maximas in intensity image of conspicuity map.. Fig. 1(a) and fig. 1(b) shows the visually salient points detected on a pair of images.

Efficient Registration of Aerial Image Sequences Without Camera Priors

397

4 Registration Algorithm Overview 4.1 Estimation of Registration Parameters Using Angle Histogram Method This approach is based on the work Xiong et. al. [1] which demonstrates efficient registration of two aerial images without feature correspondence. Features are detected on pair of images (observed and reference) using any of the above feature extraction methods. Circular patches are then created around the features. For a given patch p(i, j)(i = 1, 2, ..., m), the covariance matrix is defined as (3) COVp = E(X − mx )(X − mx )T mx i i is the centroid of the image where X = is the position of the pixel; mx = mx j j patch p(i, j), the first order moment. It is then verified that patches are oriented in the first place by thresholding the eigen values of covariance matrix. The direction of eigenvector V1 is defined as the orientation of image patch p(i, j). By applying this approach, we can compute orientations for all image patches on reference and observed images. Taking any two patches in both images from set of patches (nt patches in I1 and nf patches in second image I2 ) we compute orientation differences Δϕ = {Δϕl , l = 1, 2, ..., nt nf }and find nt correspondence patches on reference image. For these nt pairs of correspondence patches, the value of orientation differences will be the rotation angle. Δϕl = |ϕjt − ϕif |, i = 1, 2, ..., nf (4) If we create a histogram for the orientation differences, the count for orientation difference between the two images will show a significant peak. We estimate the value of the scaling through a series of voting processes. By varying the size of the image patches and computing the angle histograms, we can obtain a series of angle histograms. Choose the one which has the highest maximum peak. Let At and Af denote the patch sizes on observed and reference images corresponding to the histogram Hh which has the highest maximum peak. The value of the scaling between observed and reference images can be computed by At (5) s= Af The orientation difference Δϕ corresponding to the histogram Hh is the rotation angle ϕ between observed and reference images. We have found this algorithm to work very well for images with weak perspective; however, with strong perspective, the angle we estimate is not the exact angle of rotation. Considering two images which differ by rotation of camera about more than one axis, the circular patches around features in the first image are transformed to sheared shape in the second image. Therefore, when we compute orientation angle of features in second image, the histogram formed still has a unique peak, with a rise in other peaks (flattening of distribution). What we found is that this algorithm still reliably estimates angle of rotation about camera axis, thus providing a reliably accurate restriction to be input into geometrical filtering proposed by Nielsen [6].

398

S. Niranjan et al.

(a)

(b)

Fig. 2. Geometrical Filtering (a) General constraints applied by Nielsen with comparatively loser bounds because of no prior information, (b) Constraints provided by first part of proposed method provide much tighter bounds and significant reduction in search space

4.2 Homography Estimation Using Geometrical Filtering The above algorithm estimates rotation and scale; for completion of registration we need to find homography extending the approach of angle histogram with refined search for translation assuming perspective transformation and known initial estimate of rotation and scale. The proposed algorithm in [6] is fully automatic method for mosaicing which handles large scale (∼ = 4) and arbitrary rotation values while keeping the running time attractive for responsive applications. The method is based on Monte Carlo Las Vegas algorithms that combine both randomization and geometric feature selection. The basic idea of geometric filtering is to constrain the properties of feature matchings. Nielsen proposes that we can input geometric constraints like scale value (e.g.,in range [ 14 , 4]) and rotations (e.g. between [−60deg, 45deg] and a precision tolerance (each feature can be moved into a ball centered at it of radius . For calculation of score, different methods can be used as described in [6] such as Pixel Cross correlation, the Haussdorf matching, Normalized cross correlation and bottleneck matching. We use Normalized Cross correlation and also probabilistic model. We use information of rotation and scale to reduce the search range in feature space and search only from α − δ to α + δ and s − ε to s + ε where α and s are known from previous method. For any planar transformation T we associate a characteristic vector v(T ) = (scale(T ), angle(T )) where scale(T ) is the scale of T and angle(T ) is the rotation angle about the x axis. Given two feature sets S1 = {L1 . . . Ln1 }(n1 = |S1 |) and S2 = {R1 . . . Rn2 }(n2 = S2 ) from two images I1 and I2 , the aim is to estimate an approximate collineation H, which could match as may features as possible. A point p1 in S1 is said to match point p2 in S2 for a collineation H if d2 (p2 , Hp1 ) ≤ for some prescribed ≥ 0, where d2 (., .) is the Euclidean distance. α is the percentage of points required to match up to an absolute error (that is max{α|S1 , α|S2 } points at least). We are looking for a (α, ) match i.e. a transformation H that matches at least αn points up to some error tolerance . This error tolerance can be set by trial and error depending on image data set being registered.The algorithm thus is a)Extract stable features in the images (reduced corners based on eigen values, b)Generate a set Γ of homographies satisfying geometric constraints and

Efficient Registration of Aerial Image Sequences Without Camera Priors

399

matching at least a given fraction of the point sets, c)Score each homography using probabilistic model and choose one with highest quality (e.g. score). Geometric Filtering. Geometric Filtering is a novel concept introduced by Nielsen to cut down the search space by considering the geometric constraints imposed by the selection of features in and image to be matched to its corresponding points in the reference image. It extends the combinatorial Monte Carlo approach using properties of the Euclidean/Projective space of point features. Transformations are estimated from point to point alignments. Once we have chosen a pair (L1 , L2 ) ∈ S1 ∗ S1 of features in I1 geometric filtering tests only some of the pairs (R1, R2) ∈ S2 ∗ S2 of I2 using the following constraint. Each detected feature point p lies in a ball B(p, μ). In order to take into account the fuzziness of the features μ is set to be 4 . Let p∗ denote the “visual” feature point that the feature detection algorithm approaches to p (p∗ ∈ B(p, μ)). Between any two feature points we have |d2 (p∗1 , p∗2 ) − d2 (p1 , p2 ) < 2μ| . Instead of comparing (L1 , L2 ) to all pairs (R1 , R2 ) first a point is chosen denoted by R1 ∈ I2 and the second point is chosen inside the ring whose center is R1 with minimum circle B(R1 , d2 (L1 , L2 ) − 2 ∗ μ) and width 4 ∗ μ. If the transformation involves uniform scale also, then length of line varies. To accommodate for variation in length the algorithm requires as input from the previous module a range of values in which the scale lies which have already been estimated. Given, that the scale lies in the range [scalemin , scalemax ] then we consider all pairs of points in S2 whose distance is between (d12 − ) ∗ scalemin to (d12 + ) ∗ scalemax . Since we have obtained a good estimate of scale, we use 5 − 10% tolerance about this scale value as scalemin and scalemax. The value of tolerance was set by trial and error, and this range worked fine on various image subsets we tested. Computing Homographies For Perspective Transform. Every time a k tuple is sampled from feature set S1 and matched with a possible k tuple in S2 we need to compute the homography. This computation depends on the number of features matched. We represent the features in set S1 as Li and the features in set S2 as Ri both represented in cartesian coordinates. The simplest case is when the transformation is with only similitude of the plane i.e. the transformation is a combination of translation, rotation and uniform scaling - affine transformation. In this case there are four unknowns and we can compute the homography by matching 2 points. However, in case of perspective transformation there are 8 unknowns for which we search for 4 pairs of tuple denoted as L1 toL4 in S1 and R1 toR4 in S2 . Then let P = (LT1 LT2 LT3 LT4 ) and Q = (R1T R2T R3T R4T ) be 3x4 matrices. Then perspective homography can be computed using pseudo inverse method [6]: H = Q ∗ PT ∗ (P ∗ PT )−1 For computing the best matching homography we evaluate its score by probabilistic approach proposed in [17]. After computing the homography, we apply the transform to the reference image and find its inlier and outlier features w.r.t. the other image. Let total number of features be nf in I1 and ni be the total number of inliers. The event that this homography is matching homography is represented by a binary variable m0, 1. Now a particular feature is an inlier or outlier f i 0, 1 is assumed to be an independent Bernoulli, so that the total number of inliers is Binomial: p(f(1:nf ) |m = 1) = B(ni ; nf , p1 )

400

S. Niranjan et al.

(a)

(b)

Fig. 3. Prior work on affine image registration (a) clear artifacts visible because of inaccurate computation of homography with affine assumption, (b) same images stitched with homography assuming perspective transform

p(f(1:nf ) |m = 0) = B(ni ; nf , p0 ) where p1 is the probability a feature is an inlier given a correct homography, and p0 is the probability a feature is an inlier given a false image match. We chose p1 = 0.8 and p0 = 0.02 after some trial and error. Now using Bayes’ rule it has been proved in [17], choosing pmin = 0.95 for our implementation, that computed homography is acceptable if, 1 B(ni ; nf , p1 ) > 1 B(ni ; nf , p0 ) pmin − 1

(a)

(b)

Fig. 4. (a) aerial view academic area, (b) orthographic view of academic area (courtesy Google), the registration parameters are given in Table 2

5 Image Mosaicing Images aligned after undergoing geometric corrections require further processing to eliminate distortions and discontinuities. Alignment of images may be imperfect due to registration errors resulting from incompatible model assumptions, dynamic scenes etc. Image Composition is the technique, which modifies the image gray levels in the vicinity of an alignment boundary to obtain a smooth transition between images by removing these seams and creating a blended image by determining how pixels in an overlapping area should be presented. The term image spline refers to digital techniques for making these adjustments [19]. The images to be splined are first decomposed into

Efficient Registration of Aerial Image Sequences Without Camera Priors

(a)

401

(b)

(c) Fig. 5. (a) Mosaic with affine asumption shows misalignments and distortion in regions of strong perspective, (b) Mosaic of same set as (a) constructed with perspective assumption (c) Mosaic with perspective assumption of another set of strongly perspective image set

a set of band-pass filtered component images (image components in a spatial frequency band). Next the component images in each spatial frequency band are assembled into a corresponding band-pass mosaic. In this step, component images are merged using a weighted average within a transition zone, which is proportional in size to the wavelengths represented in the band. Finally by summation of these band- passed mosaic images the desired image mosaic is generated. The spline is thus matched to the scale of features within the images. Figure 3(a) is the mosaic constructed without using the blending algorithm, and has clear visual artifacts along seam lines. Figure 3(b) has these artifacts eliminated using blending.

6 Results The system has been implemented on Visual C++ 6.0 using OpenCV library [20] from Intel. We have performed timing and performance analysis of proposed methodology with different feature detectors and analyzed their significance to accurate registration. The reported values are subjective since the results depend on the image data set. Since the Nielsen method is based on pattern matching, it is nontrivial to establish the logical relevance of the extracted feature points with the information contained in the image [6]. For the purpose of validation of presented algorithm, it was applied to pairs of test images, for which ground truthing was performed manually. KLT and Harris features take

402

S. Niranjan et al. Table 1. Performance Evaluation Dimensions Features Nielsen’s algorithm (sec) Self algorithm (sec) 728x388 100 15 5 728x388 150 27 8 728x388 175 34 10 800x600 100 19 5 800x600 150 30 8.7 Table 2. Ground truth result of Histogram method

Ground Truthing Histogram Method

Angle 34.062o 33.6752o

Scale 1.4512 1.445

equivalent computation times, but feature extraction based on Itti’s model consumes significantly higher time. KLT features were observed to be more stable for aerial images than Harris features. Salient points (Itti’s model) have also been observed to be stable, and are well distributed over entire images. Comparison with Nielsen algorithm with default parameters and proposed algorithms are given in Table 1. There is an overall decrease in computation time, and we are looking at the possibility of achieving this framework in real time. We also tested our registration algorithm with aerial images and orthographic images from Google Earth [21]. Manual registration is performed using standard Matlab functions Results obtained from algorithm for the same are listed in Table 2. Fig. 5(b) and 5(c) are the resulting mosaics with two sets of aerial images. We can clearly see distortions in image shown in Fig. 5(a), constructed from same set of images as Fig. 5(b), but with affine assumption [22]. Algorithm with affine assumption fails to compute correct homography in regions of strong perspective transform, with clear misalignments and distortions in constructed mosaic. Proposed algorithm works even for Figure 5(c), which uses image set of significantly higher perspective geometry, resulting in mosaic with invisible seam lines.

7 Conclusion Unmanned Aerial Vehicles(UAVs) equipped with lightweight,inexpensive cameras have grown in popularity by enabling new applications of UAV technology. Beginning with an investigation of previous registration and mosaicing works,this paper discusses the challenges of registering UAV based video sequence. A novel approach of estimating registration parameters is then presented. Future work for the improvement of proposed algorithm involves a frequency domain approach which could be used to determine a rough estimate of the parameters, which would help the registration algorithm to prune its search for registration parameters. Also with availability of coarse/fine metadata, significant improvements in performance can be achieved.

Efficient Registration of Aerial Image Sequences Without Camera Priors

403

References 1. Xiong, Y., Quek, F.: Automatic Aerial Image Registration Without Correspondence. In: ICVS 2006. The 4th IEEE International Conference on Computer Vision Systems, St. Johns University, Manhattan, New York City, New York, USA (January 5-7, 2006) 2. Sheikh, Y., Khan, S., Shah, M., Cannata, R.W.: Geodetic Alignment of Aerial Video Frames. VideoRegister03, ch. 7 (2003) 3. Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., Zhao, W.: Video Registration: Algorithm and quantitative evaluation. In: Proc. International Conference on Computer Vision, vol. 2, pp. 343–350 (2001) 4. Li, H., Manjunath, B.S., Mitra, S.K.: A contour based appraoch to multisensor image registration. IEEE Trans. Image Processing, 320–334 (March 1995) 5. Cideciyan, A.V.: Registration of high resolution images of the retina. In: Proc. SPIE, Medical Imaging VI: Image Processing, vol. 1652, pp. 310–322 (February 1992) 6. Nielsen, F.: Adaptive Randomized Algorithms for Mosaicing Systems. Transactions of the Institute of Electronics, Information, and Communication Engineers (IEICE), Information and Systems E83-D (7), 1386–1394 (2000) 7. Canny, J.: A computational approach to edge detection. IEEE PAMI, 679–698 (1986) 8. Harris, J.C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp. 189–192 (1988) 9. Tomasi, C., Kanade, T.: Detection and Tracking of Point Features. CMU Technical Report CMU-CS-91-132 (April 1991) 10. Itti, L.: Models of Bottom-Up and Top-Down Visual Attention. PhD thesis, Pasadena, California (2000) 11. Itti, L., Koch, C.: Nature Reviews Neuroscience 2(3), 194–203 (2001) 12. Ouerhani, N.: Visual Attention: Form Bio-Inspired Modelling to Real-Time Implementation, PhD thesis (2003) 13. Backer, G., Mertsching, B.: Two selection stages provide efficient object-based attentional controlfor dynamic vision. In: International Workshop on Attention and Performance in Computer Vision (2004) 14. Zoghliami, I., Faugeras, O., Deriche, R.: Using geometric corners to build a 2D mosaic from a set of images. In: CVPR Proceedings (1997) 15. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: BMVC 2002. British Machine Vision Conference, Cardiff, Wales, September 2002, pp. 656–665 (2002) 16. Maji, S., Mukerjee, A.: Motion Conspicuity Detection: A Visual Attention model for Dynamic Scenes. Report on CS497, IIT Kanpur, avialable at www.cse.iitk.ac.in/report-repository/2005/Y2383 497-report.pdf 17. Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV 2003. International Conference on Computer Vision, Nice, France, October 2003, pp. 1218–1225 (2003) 18. Kyung, Lacroix, S.: A Robust Interest Point Matching Algorithm. IEEE (2001) 19. Hsu, C., Wu, J.: Multiresolution mosaic. In: Image Processing, 1996, Proceedings, International Conference on, Iss., September 16-19, 1996, vol. 3, pp. 743–746 (1996) 20. Intel Open Source Computer Vision Libraray, http://www.intel.com/technology/computing/opencv/index.htm 21. http://www.earth.google.com 22. Gupta, G.: Feature Based Aerial Image Regsitration and Mosaicing. Bachelors Dissertation, IIT Kanpur (December 2006)

Simultaneous Appearance Modeling and Segmentation for Matching People Under Occlusion Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon Institute for Advanced Computer Studies University of Maryland, College Park, MD 20742, USA {zhelin,lsd,doermann,daniel}@umiacs.umd.edu

Abstract. We describe an approach to segmenting foreground regions corresponding to a group of people into individual humans. Given background subtraction and ground plane homography, hierarchical parttemplate matching is employed to determine a reliable set of human detection hypotheses, and progressive greedy optimization is performed to estimate the best conﬁguration of humans under a Bayesian MAP framework. Then, appearance models and segmentations are simultaneously estimated in an iterative sampling-expectation paradigm. Each human appearance is represented by a nonparametric kernel density estimator in a joint spatial-color space and a recursive probability update scheme is employed for soft segmentation at each iteration. Additionally, an automatic occlusion reasoning method is used to determine the layered occlusion status between humans. The approach is evaluated on a number of images and videos, and also applied to human appearance matching using a symmetric distance measure derived from the KullbackLeiber divergence.

1

Introduction

In video surveillance, people often appear in small groups, which yields occlusion of appearances due to the projection of the 3D world to 2D image space. In order to track people or to recognize them based on their appearances, it would be useful to be able to segment the groups into individuals and build their appearance models. The problem is to segment foreground regions from background subtraction into individual humans. Previous work on segmentation of groups can be classiﬁed into two categories: detection-based approaches and appearance-based approaches. Detection-based approaches model humans with 2D or 3D parametric shape models (e.g. rectangles, ellipses) and segment foreground regions into humans by ﬁtting these models. For example, Zhao and Nevatia [1] introduce an MCMC-based optimization approach to human segmentation from foreground blobs. Following this work, Smith et al. [2] propose a similar trans-dimensional MCMC model to track multiple humans using particle ﬁlters. Later, an EM-based approach is proposed by Rittscher et al. [3] for foreground blob segmentation. On the other Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 404–413, 2007. c Springer-Verlag Berlin Heidelberg 2007

Simultaneous Appearance Modeling and Segmentation

405

hand, appearance-based approaches segment foreground regions by representing human appearances with probabilistic densities and classifying foreground pixels into individuals based on these densities. For example, Elgammal and Davis [4] introduce a probabilistic framework for human segmentation assuming a single video camera. In this approach, appearance models must ﬁrst be acquired and used later in segmenting occluded humans. Mittal and Davis [5] deal with the occlusion problem by a multi-view approach using region-based stereo analysis and Bayesian pixel classiﬁcation. But this approach needs strong calibration of the cameras for its stereo reconstruction. Other multi-view-based approaches [6][7][8] combine evidence from diﬀerent views by exploiting ground plane homography information to handle more severe occlusions. Our goal is to develop an approach to segment and build appearance models from a single view even if people are occluded in every frame. In this context, appearance modeling and segmentation are closely related modules. Better appearance modeling can yield better pixel-wise segmentation while better segmentation can be used to generate better appearance models. This can be seen as a chicken-and-egg problem, so we solve it by the EM algorithm. Traditional EM-based segmentation approaches are sensitive to initialization and require appropriate selection of the number of mixture components. It is well known that ﬁnding a good initialization and choosing a generally reasonable number of mixtures for the traditional EM algorithm remain diﬃcult problems. In [15], a sample consensus-based method is proposed for segmenting and tracking small groups of people using both color and spatial information. In [13], the KDE-EM approach is introduced by applying the nonparametric kernel density estimation method in EM-based color clustering. Later in [14], KDE-EM is applied to single human appearance modeling and segmentation from a video sequence. We modify KDE-EM and apply it to our problem of foreground human segmentation. First, we represent kernel densities of humans in a joint spatial-color space instead of density estimation in a pure color space. This can yield more discriminative appearance models by enforcing spatial constraints on color models. Second, we update assignment probabilities recursively instead of using a direct update scheme in KDE-EM; this modiﬁcation of feature space and update equations results in faster convergence and better segmentation accuracy. Finally, we propose a general framework for building appearance models from occluded humans and matching them using full or partial observations.

2

Human Detection

In this section, we brieﬂy introduce our human detection approach (details can be found in [16]). The detection problem is formulated as a Bayesian MAP optimization [1]: c∗ = arg maxc P (c|I), where I denotes the original image, c = {h1 , h2 , ...hn } denotes a human conﬁguration (a set of human hypotheses). {hi = (xi , θi )} denotes an individual hypothesis which consists of foot position

406

Z. Lin et al.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 1. An example of the human detection process. (a) Adaptive rectangular window, (b) Foot candidate regions Rf oot (lighter regions), (c) Object-level likelihood map by hierarchical part-template matching, (d) The initial set of human hypotheses overlaid on the Canny edge map, (e) Human detection result, (f) Shape segmentation result.

xi and corresponding model parameters θi (which are deﬁned as the indices of part-templates). Using Bayes Rule, the posterior probability can be decomposed (c) ∝ P (I|c)P (c). We asinto a joint likelihood and a prior as: P (c|I) = P (I|c)P P (I) sume a uniform prior, hence the MAP problem reduces to maximizing the joint likelihood. The joint likelihood P (I|c) is modeled as a multi-hypothesis, multiblob observation likelihood. The multi-blob observation likelihood has been previously explored in [9][10]. Hierarchical part-template matching is used to determine an initial set of human hypotheses. Given the (oﬀ-line estimated) foot-to-head plane homography [3], we search for human foot candidate pixels by matching a part-template tree to edges and binary foreground regions hierarchically and generate the objectlevel likelihood map. Local maxima are chosen adaptively from the likelihood map to determine the initial set of human hypotheses. For eﬃcient implementation, we perform matching only for pixels in foot candidate regions Rf oot . Rf oot is deﬁned as: Rf oot = {x|γx ≥ ξ}, where γx denotes the proportion of foreground pixels in an adaptive rectangular window W (x, (w0 , h0 )) determined → by the human vertical axis − v x (estimated from the homography mapping). The window coverage is eﬃciently calculated using integral images. Then, a fast and eﬃcient greedy algorithm is employed for optimization. The algorithm works in a progressive way as follows: starting with an empty conﬁguration, we iteratively add a new, locally best hypothesis from the remaining set of possible hypotheses until the termination condition is satisﬁed. The iteration is terminated when the joint likelihood stops increasing or no more hypothesis can be added. Fig. 1 shows an example of the human detection process.

Simultaneous Appearance Modeling and Segmentation

3

407

Human Segmentation

3.1

Modiﬁed KDE-EM Approach

KDE-EM [13] was originally developed for ﬁgure-ground segmentation. It uses nonparametric kernel density estimation [11] for representing feature distributions of foreground and background. Given a set of sample pixels {xi , i = 1, 2...N } (with a distribution P), each represented by a d-dimensional feature vector as xi = (xi1 , xi2 ..., xid )t , we can estimate the probability Pˆ (y) of a new pixel y with feature vector y = (y1 , y2 , ..., yd )t belonging to the same distribution P as: pˆ(y ∈ P) =

N d yj − xij 1 k( ), N σ1 ...σd i=1 j=1 σj

(1)

where the same kernel function k(·) is used in each dimension (or channel) with diﬀerent bandwidth σj . It is well known that a kernel density estimator can converge to any complex-shaped density with suﬃcient samples. Also due to its nonparametric property, it is a natural choice for representing the complex color distributions that arise in real images. We extend the color feature space in KDE-EM to incorporate spatial information. This joint spatial-color feature space has been previously explored for feature space clustering approaches such as [12], [15]. The joint space imposes spatial constraints on pixel colors hence the resulting density representation is more discriminative and can tolerate small local deformations. Each pixel is represented by a feature vector x = (X t , C t )t in a 5D space, R5 , with 2D spatial coordinates X = (x1 , x2 )t and 3D normalized rgs color1 coordinates C = (r, g, s)t . In Equation 1, we assume independence between channels and use a Gaussian kernel for each channel. The kernel bandwidths are estimated as in [11]. In KDE-EM, the foreground and background assignment probabilities fˆt (y) and gˆt (y) are updated directly by weighted kernel densities. We modify this by updating the assignment probabilities recursively on the previous assignment probabilities with weighted kernel densities (see Equation 2). This modiﬁcation results in faster convergence and better segmentation accuracy, which is quantitatively veriﬁed in [17] in terms of pixel-wise segmentation accuracy and number of iterations needed for foreground/background segmentation. 3.2

Foreground Segmentation Approach

Given a foreground regions Rf from background subtraction and a set of initial human detection hypotheses (hk , k = 1, 2, ...K), the problem of segmentation is equivalent to the K-class pixel labeling problem. The label set is denoted as F1 , F2 , ...FK . Given a pixel y, we represent the probability of pixel y belonging to human-k as fˆkt (y), where t = 0, 1, 2... is the iteration index. The assignment K probabilities fˆkt (y) are constrained to satisfy the condition: k=1 fˆkt (y) = 1. 1

r = R/(R + G + B), g = G/(R + G + B), s = (R + G + B)/3.

408

Z. Lin et al.

Algorithm 1. Initialization by Layered Occlusion Model initialize R00 (y) = 1 for all y ∈ Rf for k = 1, 2, ...K − 1 - for all y ∈ Rf t −1 0 - fˆk0 (y) = Rk−1 (y)e−1/2(Y −Y0,k ) V (Y −Y0,k ) and Rk0 (y) = 1 − ki=1 fˆi0 (y) endfor 0 0 0 (y) = RK−1 (y) for all y ∈ Rf and return fˆ10 , fˆ20 , ..., fˆK set fˆK where Y denotes the spatial coordinates of y, Y0,k denotes the center coordinates of object k, and V denotes the covariance matrix of the 2D spatial Gaussian distribution.

Layered Occlusion Model. We introduce a layered occlusion model into the initialization step. Given a hypothesis of an occlusion ordering of detections, we build a layered occlusion representation iteratively by calculating the foreground probability map fˆk0 for the current layer and its residual probability map Rk0 for pixel y). Suppose the occlusion order (from front to back) is given by F1 , F2 , ...FK ; then the initial probability map is calculated recursively from the front layer to the back layer by assigning 2D anisotropic Gaussian distributions based on the location and scales of each detection hypothesis. Occlusion Reasoning. The initial occlusion ordering is determined by sorting the detection hypotheses by their vertical coordinates and the layered occlusion model is used to estimate initial assignment probabilities. The occlusion status is updated at each iteration (after the E − step) by comparing the evidence of occupancy in the overlap area between diﬀerent human hypotheses. For two human hypotheses hi and hj , if they have overlap area Ohi ,hj , we re-estimate the occlusion ordering between the two as: hi occlude hj if x∈Oh ,h fˆit (x) > x∈Oh ,h fˆjt (x) i

j

i

j

(i.e. hi better accounts for the pixels in the overlap area than hj ), hj occlude hi otherwise, where fˆit and fˆjt are the foreground assignment probabilities of hi and hj . At each iteration, every pair of hypotheses is compared in this way if they have a non-empty overlap area. The whole occlusion ordering is updated by exchanges if and only if an estimated pairwise occlusion ordering diﬀers from the previous ordering.

4

Partial Human Appearance Matching

Appearance models represented by kernel probability densities can be compared by information theoretic measures such as the Battacharyya Distance or the Kullback Leiber Distance for tracking and matching objects in video. Recently, Yu et al. [18] introduce an approach to construct appearance models from a video sequence by a key frame method and show robust matching results using a pathlength feature and the Kullback-Leiber distance measure. But this approach only handles un-occluded cases. Suppose two appearance models, a and b are represented as kernel densities in a joint spatial-color space. Assuming a as the reference model and b as the test

Simultaneous Appearance Modeling and Segmentation

409

Algorithm 2. Simultaneous Appearance Modeling and Segmentation for Occlusion Handling Given a set of sample pixels {xi , i = 1, 2...N } from the foreground regions Rf , we iteratively estimate the assignment probabilities fˆkt (y) of a foreground pixel y ∈ Rf belonging to Fk as follows: Initialization : Initial probabilities are assigned by the layered occlusion model. M − Step : (Random Pixel Sampling) We randomly sample a set of pixels (we use η = 5% of the pixels) from the foreground regions Rf for estimating each foreground appearances represented by weighted kernel densities. E − Step : (Soft Probability Update) For each k ∈ {1, 2, ...K}, the assignment probabilities Fkt (y) are recursively updated as follows: fˆkt (y) = cfˆkt−1 (y)

fˆ

k( y

i=1

j=1

N

d

t−1 (xi ) k

j

− xij ), σj

(2)

where N is the number of samples and c is a normalizing constant such that K ˆt k=1 fk (y) = 1. Segmentation : The iteration is terminated when the average segmentation diﬀerence of two consecutive iterations is below a threshold: k

y

|fˆkt (y) − fˆkt−1 (y)| nK

(3)

< ,

where n is the number of pixels in the foreground regions. Let fˆk (y) denote the ﬁnal converged assignment probabilities. Then the ﬁnal segmentation is determined as: pixel y belong to human-k, i.e. y ∈ Fk , k = 1, ...K, if k = arg maxk∈{1,...K} fˆk (y).

model, the similarity of the two appearances can be measured by the KullbackLeiber distance as follows [12][18]: pˆa (y) dy, (4) pb ||ˆ pa ) = pˆa (y)log b DKL (ˆ pˆ (y) where y denotes a feature vector, and pˆa and pˆb denote kernel pdf functions. For simpliﬁcation, the distance is calculated from samples instead of the whole feature set. We need to compare two kernel pdfs using the same set of samples in the feature space. Given Na samples xi , i = 1, 2..., Na from the appearance model a and Nb samples yk , k = 1, 2..., Nb from the appearance model b, the above equation can be approximated by the following form [18] given suﬃcient samples from the two appearances: Nb 1 pˆb (yk ) , p ||ˆ p )= log a DKL (ˆ Nb pˆ (yk )

(5)

Nb Na d d 1 ykj − xij 1 ykj − yij k( ), pˆb (yk ) = k( ). Na i=1 j=1 σj Nb i=1 j=1 σj

(6)

b

a

k=1

pˆa (yk ) =

Since we sample test pixels only from the appearance model b, pˆb is evaluated by its own samples and pˆb is guaranteed to be equal or larger than pˆa for any

410

Z. Lin et al.

convergence 0.1 0.09

segmentation difference

0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1

2

3

4 5 iteration index

6

7

convergence 0.1 0.09

segmentation difference

0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0

2

4 6 iteration index

8

10

convergence 0.07

segmentation difference

0.06 0.05 0.04 0.03 0.02 0.01 0 1

2

3 iteration index

4

5

Fig. 2. Examples of the detection and segmentation process with corresponding convergence graphs. The vertical axis of the convergence graph shows the absolute segmentation diﬀerence between two consecutive iterations given by Equation 3.

samples yk . This ensures that the distance DKL (ˆ pb ||ˆ pa ) ≥ 0, where the equality holds if and only if two density models are identical. The Kullback-Leiber pb ||ˆ pa ) = DKL (ˆ pa ||ˆ pb ). For distance is a non-symmetric measure in that DKL (ˆ obtaining a symmetric similarity measure between the two appearance models, we deﬁne the distance of the two appearances as follows: Dist(ˆ pb , pˆa ) = b a a b min(DKL (ˆ p ||ˆ p ), DKL (ˆ p ||ˆ p )). It is reasonable to choose the minimum as the distance measure since it can preserve the balance between (full-full), (fullpartial), (partial-partial) appearance matching, while the symmetrized distance pb ||ˆ pa ) + DKL (ˆ pa ||ˆ pb )) would only be eﬀective for (full-full) appearance DKL (ˆ matching and does not compensate for occlusion.

5

Experimental Results and Analysis

Fig. 2 shows examples of the human segmentation process for small human groups. The results show that our approach can generate accurate pixel-wise

Simultaneous Appearance Modeling and Segmentation

411

Fig. 3. Experiments on diﬀerent degrees of occlusion between two people

segmentation of foreground regions when people are in standing or walking poses. Also, the convergence graphs show that our segmentation algorithm converges to a stable solution in less than 10 iterations and gives accurate segmentation of foreground regions for images with discriminating color structures of diﬀerent humans. The cases of falling into local minimum with inaccurate segmentation is mostly due to the color ambiguity between diﬀerent foreground objects or misclassiﬁcation of shadows as foreground. Some inaccurate segmentation results can be found in human heads and feet in Fig. 2 and Fig. 3, and can be reduced by incorporating human pose models as in [17]. We also evaluated the segmentation performance with respect to the degree of occlusion. Fig. 3 shows the segmentation results given images with varying degrees of occlusion when two people walk across each other in an indoor environment. Note that the degree of occlusion does not signiﬁcantly aﬀect the segmentation accuracy as long as reasonably accurate detections can be achieved. Finally, we quantitatively evaluate our segmentation and appearance modeling approach to appearance matching under occlusion. We choose three frames from a test video sequence (containing two people in the scene) and perform segmentation for each of them. Then, the generated segmentations are used to estimate partial or full human appearance models as shown in Fig. 4. We evaluate the two-way Kullback-Leiber distances and the symmetric distance for each pair of appearances and represent them as aﬃnity matrices shown in Fig. 4. The elements of the aﬃnity matrices quantitatively reﬂect the accuracy of matching. We also conducted matching experiments using diﬀerent spatial-color space combinations, 3D (r, g, s) color space, 4D (x, r, g, s) space, 4D (y, r, g, s) space, and 5D (x, y, r, g, s) space. The aﬃnity matrices show that 3D (r, g, s) color space and 4D (y, r, g, s) space produce much better matching results than the other two. This is because color variation is more sensitive in the horizontal direction than in the vertical direction. The color-only feature space obtains good matching performance for this example because the color distributions are signiﬁcantly diﬀerent between appearances 1 and 2. But, in reality, there are often cases in which two diﬀerent appearances have similar color distributions with completely diﬀerent spatial layouts. On the other hand, 4D (y, r, g, s) joint spatial-color feature space (color distribution as a function of the normalized

412

Z. Lin et al.

Fig. 4. Experiments on appearance matching. Top: appearance models used for matching experiments, Middle: two-way Kullback-Leiber distances, Bottom: symmetric distances.

human height) enforces spatial constraints on color distributions, hence it has much more discriminative power.

6

Conclusion

We proposed a two stage foreground segmentation approach by combining human detection and iterative foreground segmentation. The KDE-EM framework is modiﬁed and applied to segmentation of groups into individuals. The advantage of the proposed approach lies in simultaneously segmenting people and building appearance models. This is useful when matching and recognizing people when only occluded frames can be used for training. Our future work includes the application of the proposed approach to human tracking and recognition across cameras.

Acknowledgement This research was funded in part by the U.S. Government VACE program.

Simultaneous Appearance Modeling and Segmentation

413

References 1. Zhao, T., Nevatia, R.: Tracking Multiple Humans in Crowded Environment. In: CVPR (2004) 2. Smith, K., Perez, D.G., Odobez, J.M.: Using Particles to Track Varying Numbers of Interacting People. In: CVPR (2005) 3. Rittscher, J., Tu, P.H., Krahnstoever, N.: Simultaneous Estimation of Segmentation and Shape. In: CVPR (2005) 4. Elgammal, A.M., Davis, L.S.: Probabilistic Framework for Segmenting People Under Occlusion. In: ICCV (2001) 5. Mittal, A., Davis, L.S.: M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene. International Journal of Computer Vision (IJCV) 51(3), 189–203 (2003) 6. Fleuret, F., Lengagne, R., Fua, P.: Fixed Point Probability Field for Complex Occlusion Handling. In: ICCV (2005) 7. Khan, S., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 8. Kim, K., Davis, L.S.: Multi-Camera Tracking and Segmentation of Occluded People on Ground Plane using Search-Guided Particle Filtering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 9. Tao, H., Sawhney, H., Kumar, R.: A Sampling Algorithm for Detecting and Tracking Multiple Objects. In: ICCV Workshop on Vision Algorithms (1999) 10. Isard, M., MacCormick, J.: BraMBLe: A Bayesian Multiple-Blob Tracker. In: ICCV (2001) 11. Scott, D.W.: Multivariate Density Estimation. Wiley Interscience, Chichester (1992) 12. Elgammal, A.M., Davis, L.S.: Probabilistic Tracking in Joint Feature-Spatial Spaces. In: CVPR (2003) 13. Zhao, L., Davis, L.S.: Iterative Figure-Ground Discrimination. In: ICPR (2004) 14. Zhao, L., Davis, L.S.: Segmentation and Appearance Model Building from An Image Sequence. In: ICIP (2005) 15. Wang, H., Suter, D.: Tracking and Segmenting People with Occlusions by A Simple Consensus based Method. In: ICIP (2005) 16. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: Hierarchical Part-Template Matching for Human Detection and Segmentation. In: ICCV (2007) 17. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: An Interactive Approach to Pose-Assisted and Appearance-based Segmentation of Humans. In: ICCV Workshop on Interactive Computer Vision (2007) 18. Yu, Y., Harwood, D., Yoon, K., Davis, L.S.: Human Appearance Modeling for Matching across Video Sequences. Special Issue on Machine Vision Applications (2007)

Content-Based Matching of Videos Using Local Spatio-temporal Fingerprints Gajinder Singh, Manika Puri, Jeﬀrey Lubin, and Harpreet Sawhney Sarnoﬀ Corporation

Abstract. Fingerprinting is the process of mapping content or fragments of it, into unique, discriminative hashes called ﬁngerprints. In this paper, we propose an automated video identiﬁcation algorithm that employs ﬁngerprinting for storing videos inside its database. When queried using a degraded short video segment, the objective of the system is to retrieve the original video to which it corresponds to, both accurately and in real-time. We present an algorithm that ﬁrst, extracts key frames for temporal alignment of the query and its actual database video, and then computes spatio-temporal ﬁngerprints locally within such frames, to indicate a content-match. All stages of the algorithm have been shown to be highly stable and reproducible even when strong distortions are applied to the query.

1

Introduction

With the growing popularity of free video publishing on the web, there exist innumerable copies of copyright videos ﬂoating over the internet unrestricted. A robust content-identiﬁcation system that detects perceptually identical video content thus, beneﬁts popular video-sharing websites and peer-to-peer (P2P) networks of today, by detecting and removing all copyright infringing material from their databases and prevent any such future uploads made by their users. Fingerprinting oﬀers a solution to query and identify short video segments from a large multimedia repository using a set of discriminative features called ﬁngerprints. Challenges in designing such a system are: 1. Accuracy to identify content even when it is altered either naturally because of changes in video formats, resolution, illumination settings, color schemes or maliciously by introducing frame letterbox, camcorder recording or cropping. This is addressed by employing robust ﬁngerprints that are invariant to such common distortions. 2. Speed to allow the ﬁngerprinting system to determine a content-match with small turn-around times, which is crucial for real-time applications. A common denominator of all ﬁngerprinting techniques is their ability to capture and represent high-dimensional, perceptually relevant multimedia content, in the form of short robust hashes [1], essential for fast retrieval. In content-based video identiﬁcation, literature reports of approaches that compute features such as, mean-luminance [1], centroid of gradient [2], rank-ordered image intensity Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 414–423, 2007. c Springer-Verlag Berlin Heidelberg 2007

Content-Based Matching of Videos

415

distribution [3] and centroid of gradient orientations [4], over ﬁxed-sized partitions of video frames. The limitation of such features is that they encode complete frame information and therefore fail to identify videos when presented with queries having partially cropped or scaled data. This motivates the use of a local ﬁngerprinting approach. Sivic and Zisserman [5] in their text-retrieval approach for object recognition make use of maximally stable extremal regions (MSERs), proposed by Matas[6], as representations of each video frame. Since their method clusters semantically similar content together in its visual vocabulary, it is expected to oﬀer poor discrimination, for example, between diﬀerent seasons of the same TV programme having similar scene settings, camera capture positions and actors. However, a video ﬁngerprinting system is expected to provide good discrimination between such videos. Similar to [5], Nist´er and Stew´enius propose an object recognition algorithm that extracts and stores MSERs on a group of images of an object, captured under diﬀerent viewpoint, orientation, scale and lighting conditions [7]. During retrieval, a database image is scored depending on the number of MSER correspondences it shares with the given query image. Only the top scoring hits are then scanned further. Hence, fewer MSER pairs decrease the possibility of a database hit to ﬁgure out within the top ranked images. Since a ﬁngerprinting system needs to identify videos even when queried with short distorted clips, both [5] and [7] become unsuitable. This is because, strong degradations such as, blurring, cropping, frame-letterboxing, result into fewer number of MSERs in a distorted image as compared to its original. Such degradations, thus, have a direct impact on the algorithm performance because of change in representation of the frame. Massoudi et al., proposes an algorithm that ﬁrst slices a query video in terms of shots, extracts key-frames and then performs local ﬁngerprinting [8]. A major drawback of this approach is the fact that even the most common forms of video processing such as blurring and scaling, disturb the key-frame and introduce misalignment between the query and database frames. In the proposed approach, key-frames correspond to local peaks of maximum intensity change across the video sequence. Such frames are hence, reproducible under most distortions. For feature extraction, information of each key-frame is localized as a set of aﬃne covariant 3D regions, each of which is characterized using a spatio-temporal binary ﬁngerprint. Use of such distortion-invariant ﬁngerprints as an index into the database look-up, fetches appreciable advantages in terms of an easy and eﬃcient database retrieval strategy. During database query, we employ a voting strategy to collate the set of local database hits in order to make a global decision about best-matching video in the database.

2

Proposed Video Fingerprinting Approach

Figure 1 shows various stages of the proposed video ﬁngerprinting framework. The input video is ﬁrst subjected to a series of video preprocessing steps [4]. These include, changing the source frame rate to a predeﬁned resampling rate (say, 10 fps), followed by converting each video frame to grayscale and ﬁnally, resizing all frames to a ﬁxed width(w) and height(h) (in our case,w = 160 and

416

G. Singh et al.

Fig. 1. Framework of proposed localized content-based video ﬁngerprinting approach

h = 120 pixels). Such preprocessing achieves two beneﬁts, one being in robustness against changes in color formats or resizing, while the other in terms of speed for retrieval of large-size videos (by choosing w and h accordingly). To reduce complexity, the frame sequence is examined to extract key-points that correspond to local peaks of maximum change in intensity. Since maximum change in intensity reduces stability of the regions detected in key-frames, we store a few neighboring frames on either side of the key-frame to maintain minimal database redundancy. 2.1

Region Detector

Stable regions are detected on key-frames for localizing their information. Our region detection process is inspired by the concept of Maximally Stable Volumes (MSVs), proposed by Donser and Bischof [9] for 3D segmentation. Extending MSER to the third dimension i.e., time, an MSV is detected by an extremal property of the intensity function in the volume and on its outer boundary. In the present case however, to extract regions stable under the set of distortions, we extend MSERs to the third dimension, that is, resolution. The process of extracting MSVs along with related terminology as adopted in the rest of the paper, is given as follows: Image Sequence. For a video frame F , consider a set of multi-resolution images F1 , F2 , . . . , Fi , . . . , Fs , where Fi is obtained when video frame F is subsampled by a factor 2i−1 . Size of each Fi is the same as F . Thus, a 3D point (x, y, z) in this space corresponds to pixel (x, y) of frame F at resolution z or, equivalently Fz (x, y). Volume. We deﬁne Vji as the j th volume such that all 3D points belonging to it have intensities less than (or greater than) i, ∀(x, y, z) ∈ Vji iﬀ Fz (x, y) ≤ i (or Fz (x, y) ≥ i) Connectivity. Volume Vji is said to be contiguous, if for all points p, q ∈ Vji , there exists a sequence p, a1 , a2 , . . . , an , q and pAa1 , a1 Aa2 , . . . , ai Aai+1 , . . . , an Aq. Here, A is an adjacency relation deﬁned such that two pixels p, q ∈ Vji 3 are adjacent (pAq), iﬀ 1 |pi − qi | ≤ 1.

Content-Based Matching of Videos

417

Partial Relationship. Any two volumes Vki and Vlj , are nested, i.e., Vki Vlj iﬀ i ≤ j (or i ≥ j). Maximally Stable Volumes. Let V1 , V2 , . . . , Vi−1 , Vi , . . . be a sequence of partially ordered set of volumes, such that Vi Vi+1 . Extremal volume Vi is said to be maximally stable iﬀ v(i) = |Vi+Δ \Vi−Δ |/|Vi | has a local minimum at i , i.e. for changes in intensity of magnitude less than Δ, the corresponding change in region volume is zero. Thus, we represent each video frame as a set of distinguished regions that are maximally stable to intensity perturbation over diﬀerent scales. The reason for stability of MSVs over MSERs in most cases of image degradations, is that additional volumetric information enables selection of regions with near-identical characteristics across diﬀerent image resolutions. The more volatile regions (the ones which split or merge), are hence eliminated from consideration. 2.2

Localized Fingerprint Extraction

Consider a frame sequence {F1 , F2 , . . . , Fp , . . .} where Fp denotes the pth frame of the video. For invariance to aﬃne transformations, each MSV is represented in the form of an ellipse. In terms of the present notation, pixels of ith maximally stable volume Vip in frame Fp are made to ﬁt an ellipse denoted by epi . Aﬃne invariance of local image regions makes our ﬁngerprint robust to geometric distortions. Each ellipse epi is represented by (xpi , yip , spi , lxpi , lyip , αpi ) where (xpi , yip ) are its center coordinates, lxpi is the major axis, lyip is the minor axis and αpi is the orientation of the ellipse w.r.t frame axis. A scale factor spi , which depends upon the ratio of ellipse area w.r.t total area of the frame, is used to encode bigger regions around MSVs which are very small. The proposed algorithm expresses each local measurement region of a frame in the form of a spatio-temporal ﬁngerprint that oﬀers a compact representation of videos inside the database. For each ellipse epi extracted from frame F p , the process of ﬁngerprint computation (similar to [1]) has been elaborated below: – Since small regions are more prone to perturbations, each region is blown-up by a factor si . A rectangular region rip that encloses ellipse epi and has an area of (lxpi × spi , lyip × spi ), is detected. – For a scale invariant ﬁngerprint, we divide rip into ﬁxed number of R × C blocks. – Let mean luminance of block (r, c) ∈ rip be denoted by Lpi (r, c) where r = [1, . . . , R] and c = [1, . . . , C]. We choose a spatial ﬁlter [-1 1] and a temporal ﬁlter [−α 1] for storing the spatio-temporal dimension of a video. In order to reduce susceptibility to noise, we compute a ﬁngerprint between rip and rip+step , which is same as region rip but shifted by step frames. The R×(C −1) bit ﬁngerprint Bip is equal to bit ’1’ if Qpi (r, c) is positive and bit ’0’ otherwise where, (r, c + 1) − Lp+step (r, c)) − α(Lpi (r, c + 1) − Lpi (r, c)) (1) Qpi (r, c) = (Lp+step i i Encoding mean luminance makes our ﬁngerprint invariant to photometric distortions. In the current implementation, α = 0.95 and step = 10.

418

2.3

G. Singh et al.

Database Organization and Lookup

Localized content of each video frame is stored inside the database look-up table (LUT) [1] using 32-bit signatures, as computed by equation 1. Each LUT entry in turn stores pointers to all video clips with regions having the same ﬁngerprint value. In order to have an aﬃne invariant representation of the video frame, independent of diﬀerent query distortions, we store the geometric and shape information of stable region ei along with its ﬁngerprint inside the database. This is done by transforming coordinates of the original frame center, denoted ˆ Yˆ ). The new axis has by (cx, cy), onto a new reference axis, denoted by (X, the property of projecting ellipse epi onto a circle, with the ellipse center being ˆ Yˆ ) the origin of this axis and ellipse major and minor axes aligned with (X, respectively. The coordinates of the original frame center w.r.t this new reference ˆ pi ). axis are denoted by (cx ˆ pi , cy ˆ pi ) is given by: The transformation between (cx, cy) and (cx ˆ pi , cy cx ˆ pi = ((cx − xpi )cos(−αpi ) − (cy − yip )sin(−αpi ))/(lxpi × spi )

(2)

cy ˆ pi = ((cx − xpi )sin(−αpi ) + (cy − yip )cos(−αpi ))/(lyip × spi )

(3)

Apart from the frame center, transformation of three points c1, c2, c3, located at corners of a preﬁxed square SQ (in our case, of size 100 × 100) centered at (cx, cy), is also stored inside the database. Let their coordinates w.r.t reference ˆ Yˆ ) be denoted as c1 ˆ pi , c2 ˆ pi , c3 ˆ pi . These three points are stored for their role axis (X, in the veriﬁcation stage. We deﬁne sif tp as the 128-dimensional SIFT descriptor [10] of SQ. ˆ pi , cy ˆ pi , epi ), sif tp ) is therefore expressed The ﬁngerprint database p ( i (Bip , cx th as a union of ﬁngerprint entries for each i MSV belonging to the pth video frame where each MSV entry inside the database consists of ﬁelds such as, its binary ﬁngerprint Bip , ellipse parameters ei , frame center’s coordinates w.r.t reference ˆ Yˆ ) and SIFT descriptor of SQ given by sif tp . axis (X, During database retrieval, for a query video frame E q , we ﬁrst generate its ellipses and their corresponding ﬁngerprints using equation 1. Thus, the query frame can be expressed as j {Bjq , eqj }. Expressing in terms of notation, we use q to denote query and p to denote database parameters. Each of the ﬁngerprints of MSVs belonging to the query frame are used to probe the database for potential candidates. That is, ∀j we query the database to get the candidate set given by p ˆ pi , cy ˆ pi , epi ), sif tp ). p ( i (Bi , cx There exists a possibility for every entry in the candidate set of being the correct-database match that we are looking for during database retrieval. Hence, we propose a hypothesis for the query frame E q of being the same as original frame Fp stored inside the database. This will happen when ellipses eqj and epi point to identical regions in their respective frames. For every candidate hit produced from the database, we compute the coordinates of the query frame’s center by using: cx ˇ p,q ˆ pi × sqj × lxqj )cos(αqj ) − (cy ˆ pi × sqj × lyjq )sin(αqj ) + xqj i,j = (cx

(4)

ˆ pi × sqj × lxqj )sin(αqj ) + (cy ˆ pi × sqj × lyjq )cos(αqj ) + yjq cy ˇ p,q i,j = (cx

(5)

Content-Based Matching of Videos

419

To speed-up query retrieval, we evaluate only a subset of the entire set of candidates by ﬁltering out spurious database hits. For every candidate clip entry, we associate a score sci,j,p,q which is deﬁned as: sci,j,p,q = f ac × (lxpi × lyip × spi × spi ÷ (w × h)) + (1 − f ac) × (log(N ÷ Njq )) (6) where N is the total number of entries present in the database and Njq is the number of database hits generated for query ﬁngerprint Bjq . The ﬁrst term in Equation 6, (lxpi ×lyip ×spi ×spi ÷(w×h)), signiﬁes that greater the area represented by the ﬁngerprint of the database image, higher is its score. The second term, log(N ÷ Njq ), assigns higher scores to unique ﬁngerprints Bjq that produce fewer database hits. A factor f ac ∈ [0, 1] is used for assigning appropriate weight to each of the two terms in equation 6. We have chosen f ac = 0.5 in the current implementation. 2.4

Scoring and Polling Strategy

As stated earlier, an essential requirement of a video ﬁngerprinting system is speed. For this purpose, we have an additional stage for scoring each database result, followed by a poll to collate all local information and arrive at the ﬁnal decision. Consider a video’s frame sequence as a 3D space, with its third dimension given by frame number. We divide this space into bins (in our case, of size 2 × 2 × 10), where each bin is described by a three tuple b ≡ (b1, b2, b3). Thus we merge database hits, (1) who have their hypothetical image centers close to each other, and (2) belong to neighboring frames of the same video considering the fact that movements of region across them is appreciably small. The scoring process and preparation of candidate clip for the veriﬁcation stage, is as follows: ˇ p,q ˇ p,q – ∀ ellipses eqj and epi , add the score sci,j,p,q to the bin in which (cx i,j , cy i,j , q) p p p,q p,q p,q ˇ ˇ ˇ ˆ ˆ ˆp falls. For each such entry, also calculate c1i,j , c2i,j , c3i,j by using c1i , c2 , c3 and ellipse parameters of eqj in Equations similar to 4 and 5. – Pick the entries that fall within the top n scoring bins for the next stage of veriﬁcation. – Every database hit which polled into a particular bin and added to its score, gives information of the aﬃne transformation by which the database frame can be aligned to the query frame. We compute the average transformation ˇ p,q ˇ p,q ˇ p,q of bin b, denoted by Hb , by taking an average of all c1 i,j , c2i,j , c3i,j that polled to bin b. – Inverse of Hb is applied on sif tnum number of query frames viz. {E q , E q+1 , . . . . . . , E q+sif tnum } to produce a series of frames {Ebq , Ebq+1 , . . . , Ebq+sif tnum }. Now, these frames are hypothetically aligned to the ones stored within the database and which polled to bin b. To carry out a check on sif tnum (in our case, sif tnum = 5) number of frames imposes a tougher constraint on making a decision of the correct database video. 2.5

Veriﬁcation

From the top n candidates obtained from the previous stage, the correct database hit is found using a SIFT-based veriﬁcation. Let esif tqb be the sift descriptor of

420

G. Singh et al.

ˆ centered at (cx, cy) in query’s frame E q with its sides aligned to the square SQ b the frame’s axis. The two-step veriﬁcation procedure is: – ∀p database frames that voted to bin b, calculate Bhattacharyya distance betnum , . . . , esif tq+sif } tween descriptors of the aligned query {esif tqb , esif tq+1 b b and its database hit {sif tp , sif tp+1 , . . . , sif tp+sif tnum }. – If the distance is less than an empirically chosen threshold T , declare a match to database frame p.

3

Experimental Validation

Results presented in this section have been calculated for a database populated with 1200 videos of 4 minutes each (in total, 80 hours of data). Video material stored in the database contains clips taken from diﬀerent seasons of TV programmes, music videos, animated cartoons and movies. Analysis of our algorithm (method 1) and its comparison with approaches described in [8] (method 2) and [1] (method 3) has been carried out using 2400 clips, each of duration 10 seconds and picked from diﬀerent locations of their respective original video. Robustness testing of the proposed algorithm has been evaluated under video degradations such as, spatial blur (gaussian blurring of radius 4), contrast modiﬁcation (query clips with 25% reduced contrast), scaling (resolution of each video frame is changed by a factor 1.5), frame letterbox (refers to the black bands, of width 100 pixels, appended around a video frame), DivX compression (alters bitrate by a factor of 4), temporal cropping (query clips having only 10% of frames present in the original video), spatial cropping (30% of spatial information is cropped in each video frame) and camcorder recording (refers to query clips captured using a camcorder). Diﬀerent parameters of our method are: top n candidate bins that are chosen based on scoring and polling and, threshold T used in the veriﬁcation stage. By setting empirical values of parameters, n = 10 and T = 0.2 (justiﬁed later in the section), results shown in ﬁgure 2 have been computed. Figure 2(a) demonstrates the repeatability of key-frames detected by our algorithm, as compared to method 2. As reﬂected by the graph, even under strong distortions such as spatial crop, percentage of key-frames re-detected in the query does not fall below 80%. Out of all the query clips performed on the database, percentage of queries which were retrieved correctly, is shown in Figure 2(b). Use of local, aﬃne and scale invariant ﬁngerprints makes our approach better than method 3, which fails for distortions like spatial cropping. At the same time, ability to encode reproducible key-frames makes our system more robust as compared to method 2. Variations in performance of the proposed system as a result of changing parameters n and T , is demonstrated in ﬁgure 3. Every row of ﬁgure 3 shows three diﬀerent graphs for each type of distortion. Due to paucity of space, we demonstrate graphs only for the ﬁrst ﬁve types of degradations following the order mentioned above. First column of ﬁgure 3 shows the eﬃcacy of our proposed scoring scheme. On the y-axis, it plots the percentage of query ﬁngerprints that produce database

Content-Based Matching of Videos

421

Table 1. Query retrieval timings for diﬀerent distortions Distortion

Feature Time Query Time Veriﬁcation Time

Spatial Blur Contrast Modiﬁcation Scaling Frame Letterbox DivX Compression Temporal Cropping Spatial Cropping Camcorder Capture

(a)

0.4984 0.3173 0.4308 0.3886 0.4396 0.3239 0.4874 0.3729

1.0103 0.3229 0.9442 0.9117 1.0622 0.6657 1.2064 0.9975

1.4660 0.9977 1.4236 1.1160 1.5445 0.9350 1.1237 0.7923

(b)

Fig. 2. Comparison of diﬀerent ﬁngerprinting algorithms (a) Repeatability of keyframes and, (b) Percentage of queries that are correctly retrieved from the database

hits of the correct video, incorrect video or result in no hits at all (misses). On the x-axis, such analysis is repeated for diﬀerent values of top n scoring bins as they are varied from 10 to 100. It can be noticed that in all cases of distortions, the yaxis distribution remains unchanged even as more bins are sent to the veriﬁcation stage. In other words, correct database hits always fall within the top 10 scoring bins, even when the query is altered signiﬁcantly. Examining a scenario when the size of database is increased, number of false database hits generated by a query ﬁngerprint are expected to go up. However, due to invariance of the proposed scoring strategy to the number of database hits, it is established that while correct hits are always scored at the top, false hits fall within lower scoring bins. Such property proves scalability of our proposed ﬁngerprinting scheme. Second column of ﬁgure 3 on the y-axis plots the probability of query and database hit being the same (Psame ) or diﬀerent (Pdif f ). The x-axis shows the corresponding Bhattacharyya distance between their SIFT descriptors. As reﬂected by the graphs, the point where Psame and Pdif f intersect (i.e., probability of the query and database candidate being identical becomes less than the two being diﬀerent) is the threshold T . Third column of ﬁgure 3 plots the distribution of queries versus the number of key-frames it takes for them to produce a correct hit. Curve for queries with

422

G. Singh et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

Fig. 3. An evaluation of changes in retrieval results as parameters of the system are varied for diﬀerent query distortions, Row 1: Spatial Blurring, Row 2: Contrast Modiﬁcation, Row 3: Scaling, Row 4: Frame Letterboxing and Row 5: DivX Compression. First column {(a),(d),(g),(j),(m)} show the percentage of query ﬁngerprints that are correct, incorrect or a miss, for diﬀerent number n of top scoring bins scanned by the veriﬁcation stage. Second column {(b),(e),(h),(k),(n)} show the probability of two videos being identical (Psame ) or diﬀerent (Pdif f ) and Bhattacharyya distance between their SIFT descriptors. Third column {(c),(f),(i),(l),(o)} plot probability of the number of iterations it takes to complete the retrieval process.

Content-Based Matching of Videos

423

higher distortions spans more number of iterations. In our case, even though distortions are stronger than the ones used in literature, a large percentage of queries lead to hits within the ﬁrst 10 iterations. Table 1 tabulates the retrieval timings for diﬀerent types of distorted queries. Given a query, the total time it takes for its retrieval can be split into: time taken to detect stable regions in a key-frame and compute their ﬁngerprints (Feature Time), time taken to obtain database hits for each ﬁngerprint and score them accordingly (Query Time) and ﬁnally time taken to process top 10 candidates for SIFT-based veriﬁcation (veriﬁcation time). All queries have been executed on a 1.6 GHz Pentium IV machine equipped with 1GB RAM and take a maximum of 3 seconds for retrieval under all distortions.

4

Conclusions

In this paper, a new video ﬁngerprinting scheme has been proposed. Primary novelties of this approach are: extraction of repeatable key-frames, introduction of aﬃne-invariant MSVs that are stable over diﬀerent frame resolutions and, a scoring and polling strategy to merge local database hits and bring the ones belonging to the correct video within the top rank order. With all the desirable properties of a video ﬁngerprinting paradigm, our system can be used as a robust perceptual hashing solution to identify pirated copies of copyright videos on the internet. We are currently working on new ways of making the algorithm foolproof against distortions such as spatial cropping.

References 1. Oostveen, J., Kalker, T., Haitsma, J.: Feature extraction and a database strategy for video ﬁngerprinting. In: International conference on recent advances in visual information systems, pp. 117–128 (2002) 2. Hampapur, A., Bolle, R.: Videogrep: video copy detection using inverted ﬁle indices. Technical report, IBM research (2001) 3. Hua, X.-S., Chen, X., Zhang, H.-J.: Robust video signature based on ordinal measure. ICIP 1, 685–688 (2004) 4. Lee, S., Yoo, C.-D.: Video ﬁngerprinting based on centroids of gradient distortions. In: ICASSP, pp. 401–404 (2006) 5. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. ICCV 2, 1–8 (2003) 6. Matas, J., Chum, O., Martin, U., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. BMVC 1, 384–393 (2002) 7. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. CVPR 2, 2161–2168 (2006) 8. Massoudi, A., Lefebvre, F., Demarty, C.-H., Oisel, L., Chupeau, B.: A video ﬁngerprint based on visual digest and local ﬁngerprints. ICIP, 2297–2300 (2006) 9. Donser, M., Bischof, H.: 3D segmentation by maximally stable volumes (MSVs). ICPR, 63–66 (2006) 10. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)

Automatic Range Image Registration Using Mixed Integer Linear Programming Shizu Sakakubara1, Yuusuke Kounoike2, Yuji Shinano1, and Ikuko Shimizu1 1

Tokyo Univ. of Agri. & Tech. 2 Canon Inc.

Abstract. A coarse registration method using Mixed Integer Linear Programming (MILP) is described that finds global optimal registration parameter values that are independent of the values of invariant features. We formulate the range image registration problem using MILP. Our algorithm using MILP formulation finds the best balanced optimal registration for robustly aligning two range images with the best balanced accuracy. It adjusts the error tolerance automatically in accordance with the accuracy of the given range image data. Experimental results show that this method of coarse registration is highly effective.

1 Introduction Automatic 3D model generation of real world objects is an important technique in various fields. Range sensors are useful for generating 3D models because they directly measure the 3D shape of objects. The range image measured by a range sensor reflects the partial 3D shape of the object expressed in the coordinate system corresponding to the pose and position of the sensor. Therefore, to obtain the whole shape of the object, many range images measured from different viewpoints must be expressed in a common coordinate system. Estimation of relative poses and positions of sensors from range images is called range image registration. In general, the registration process is divided into two phases: coarse registration and fine registration. For the fine registration, the ICP (Iterative Closest Point) algorithm by Besl and McKay [1] and its extensions [11] are widely used. However, they need sufficiently good initial estimates to achieve fine registration because their error functions have a number of local minima. Therefore, many coarse registration methods have been developed for obtaining good initial estimates automatically. Many of these coarse registration methods [2] match invariant features under rigid transformation. For example, spin images [7] are matched based on their crosscorrelation. A splash [12], a point signature [4], and a spherical attribute image [6] are used as indices representing surface structures in a hash table for matching. For the modeling of buildings, planar regions [5] and circular features [3] are used for matching. These methods are effective if the features can be sufficiently discriminated and their values can be accurately calculated. However, they cannot guarantee global optimality. In addition, registration has been formalized as a discrete optimization task of finding the maximum strict sub-kernel in a graph [10]. While the globally optimal solution can be obtained with this method, the solution depends on the quality of the features. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 424–434, 2007. c Springer-Verlag Berlin Heidelberg 2007

Automatic Range Image Registration Using MILP

425

However, invariant features such as curvatures are difficult to calculate stably because they greatly affected by occlusion and discretization of the object surface. In this paper, we propose a novel coarse registration method using Mixed Integer Linear Programming (MILP) that guarantees the global optimality. It aligns two range images independent of the values of invariant features. Since MILP problems are NPhard problems, they cannot be solved to optimality in a reasonable amount of time when they exceed a certain size. However, progress in algorithms, software, and hardware has dramatically enlarged this size. Twenty years ago, a mainframe computer was needed to solve problems with a hundred integer variables. Now it is possible to solve problems with thousands of integer variables on a personal computer [8]. The advances related to MILP have made our proposed algorithm possible.

2 Definition of Best Balanced Optimal Registration Consider the registration of two sets of points in R3 . Let V 1 = {v11 , v21 , · · · , vn1 1 } be the source point set and V 2 = {v12 , v22 , · · · , vn2 2 } the target one. Assume that these points are corrupted by additive noises and they are independent random variables with zero mean and unknown variance. The task of registration is to estimate parameter values of a rigid transformation T such that the T aligns the two point sets. We define the rigid transformation as T (R, t; v) = Rv + t where R is a 3 × 3 rotation matrix and t is a translation vector. To deal with occlusion and discretization of range images, we introduce a more accurate point set 2 −T (R, t; vi1 )||∞ ≤ , MAPS(T, φ, )= vi1 : ||vφ(i) where the function φ(i) denotes the index of the corresponding target point for each source point and the constant is a threshold value to remove outliers. To obtain robust estimation of the rigid transformation, the number of elements of MAPS(T, φ, ), |MAPS(T, φ, )|, needs to be maximized. Because no a priori correspondence between the source and target points is given, the function φ must be determined. And, we make no assumption about the variance of the noise. Hence, the value cannot be estimated in advance. However, if a number N is given, we can find the minimal value of such that |MAPS(T, φ, )| ≥ N . Therefore, we define the N -accuracy optimal registration as the (TˆN , φˆN ) that minimizes (TˆN , φˆN ) = argmin{ : |MAPS(T, φ, )| ≥ N }, T,φ

and define the optimal registration error ˆN by N -accuracy optimal registration as ˆN = min{ : |MAPS(T, φ, )| ≥ N }. If 1 ≥ 2 , then max |MAPS(T, φ, 1 )| ≥ max |MAPS(T, φ, 2 )|. On the other hand, if N1 ≤ N2 , then min{ : |MAPS(T, φ,)| ≥ N1 } ≤ min{ : |MAPS(T, φ, )| ≥ N2 }. A well-balanced registration with a large number of elements in MAPS(T, φ, ) with a small value is desired. Therefore, we define best balanced optimal registration ˆ that attains N ˆ such that as the (Tˆ, φ) ˆ = argmax{|MAPS(TˆN , φˆN , max )| : (TˆN , φˆN ) = argmin{ : |MAPS(T, φ, )| ≥ N }}, N 3≤N≤κ

T,φ

(1)

426

S. Sakakubara et al.

where κ is the maximum value for N and max = max{ˆ N }. The lower bound for N is three because a rigid transformation can be estimated using no less than three point pairs. To avoid that max might be too large, we give an upper bound ¯. In best balanced optimal registration, the N -accuracy optimal registration (TˆN , φˆN ) ˆ and for each N from three to κ is evaluated using |MAPS(TˆN , φˆN , max )|. We call N ˆNˆ the best balanced values. They depend on the accuracy of the original point sets. In general, this accuracy is unknown in advance. Our proposed method finds the best balanced optimal registration automatically. We assume that similarity sij between any pair of points vi1 and vj2 is given. Because the feature values of the corresponding points are similar, we assume that point pairs that have explicitly low similarity are not corresponding and remove them from the candidates for corresponding point pairs. Note that the precise values of the features are not needed because feature values are used only to remove obviously noncorresponding pairs.

3 Mixed Integer Linear Programming Based Registration In this section, we will first introduce the general form of Mixed Integer Linear Programming(MILP) problems. Next, we will give an MILP formulation of N -accuracy optimal registration. We also give an ILP formulation to evaluate each N -accuracy registration by counting |MAPS(TˆN , φˆN , max )|. Note that this step is not necessary to be formalized by ILP because it can be done by many other ways. Then, we will describe our algorithm for obtaining the best balanced optimal registration. Finally, we will address the problem size issue and our approach to overcoming it. 3.1 Mixed Integer Linear Programming Problems An MILP problem is the minimization or maximization of a linear objective function subject to linear constraints and to some of the variables being integers. Here, without loss of generality, we consider a maximization problem. Explicitly, the problem has the form (MILP) max c x sub. to Ax ≤ b, l ≤ x ≤ u, xi : integer, i = 1, . . . , p,

where A is an m×n matrix, c, x, l, and u are n-vectors, and b is an m-vector. Elements of c, l, u, b, and A are constant, and x is a variable vector. In this MILP problem, we assume 1 ≤ p ≤ n. If p = n, then the problem is called an Integer Linear Programming(ILP) problem. If all of the variables need not be integers, then the problem is called a Linear Programming(LP) problem. The set S = {x : Ax ≤ b, l ≤ x ≤ u, xi : integer, i = 1, . . . , p} is called the feasible region and an x ∈ S is called a feasible solution. A feasible solution x∗ that satisfies c x∗ ≥ c x, ∀x ∈ S is called an optimal solution. The value of c x∗ is called the optimal value.

Automatic Range Image Registration Using MILP

427

MILP problems are NP-hard in general. Therefore, heuristic methods for solving them are widely used. However, in this paper, we apply exact methods. Owing to recent progress in MILP techniques, relatively large-scale MILP problems can be solved to optimality with commercial solvers. Throughout this paper, the solutions for MILP problems are optimal ones. 3.2 MILP Formulation of the N -Accuracy Optimal Registration MILP formulation of N -accuracy optimal registration requires that the constraints on the rigid transformation be written in a linear form. Since this cannot be done straightforwardly, we introduce pseudo rotation matrix R and pseudo translation vector t and give linear constraints so that R and t are close to R and t. Symbols denoted with prime ( ) are obtained using R and t . def Initially, to simplify the notation, we define an index set for point set V as I(V ) = def

{i : vi ∈ V }, and define ABS(x) = (|x1 |, |x2 |, |x3 |) for a vector x = (x1 , x2 , x3 ) . We introduce a corresponding point pair vector p = (p11 , · · · , p1n2 , p21 , · · · , pn1 1 , · · · , pn1 n2 ) as one of the decision variables. Each element pij is a 0-1 integer variable that is designed to be 1 if source point vi1 corresponds to target point vj2 , that is, j = φ(i), and is designed to be zero otherwise. The other decision variables are the elements of R and t . All elements of these variables are continuous. We denote the ij-th element of R as rij (i, j = 1, 2, 3) and the i-th element of t as ti (i = 1, 2, 3). We assume that ≤ r¯ij , and ti ≤ ti ≤ t¯i . There are trivial upper and lower bounds are given: r ij ≤ rij bounds: −1 ≤ rij ≤ 1, (i, j = 1, 2, 3). As mentioned above, the constraints on the elements of the rotation matrix cannot be straightforwardly written in linear form. Therefore, we formulate this condition indirectly. Let vi1 , vk1 ∈ V 1 and vj2 , vl2 ∈ V 2 and d(vi1 , vk1 ) be the distance between vi1 and vk1 and and d(vj2 , vl2 ) is the distance between vj2 and vl2 . If |d(vi1 , vk1 ) − d(vj2 , vl2 )| > l is valid, then either j = φ(i) or l = φ(k) needs to be satisfied. This condition can be formulated as pij + pkl ≤ 1, (i, k ∈ I(V 1 ), j, l ∈ I(V 2 ), |d(vi1 , vk1 )−d(vj2 , vl2 )| > l ). This condition can be further rewritten with fewer pkl ≤ 1, (i, k ∈ I(V 1 ), j ∈ I(V 2 )). constraints as pij + 1 )−d(v 2 ,v 2 )|> l∈I(V 2 ),|d(vi1 ,vk l j l

Here, √ l depends on . Since a rigid transformation √ error within ± is allowed, l less than 2 3 should be allowed. Therefore, l ≥ 2 3 needs to be satisfied. 2 We assume one-to-one correspondence between subsets of V 1 and V . That is, two 1 conditions need to be satisfied: pij ≤ 1, (i ∈ I(V )) and pij ≤ 1, (j ∈ j∈I(V 2 )

i∈I(V 1 )

I(V 2 )). The minimal number of elements of MAPS(T, φ, ) is given as a constant N .Hence, pij ≥ N. i∈I(V 1 ) j∈I(V 2 )

For the points in MAPS(T, φ, ), only when pij = 1, ABS(vj2 − R vi1 − t ) ≤ e(i ∈ I(V 1 ), j ∈ I(V 2 )), where e = (, , ) . When pij is set to zero, these conditions need to be eliminated. They can be written by introducing a continuous vector

428

S. Sakakubara et al.

M = (m1 , m2 , m3 ) that is determined so as to always satisfy ABS(vj2 −R vi1 −t ) ≤ M, (i ∈ I(V 1 ), j ∈ I(V 2 )). Using this vector, we can formulate the condition for the points in MAPS(T, φ, ) as ABS(vj2 − R vi1 − t ) ≤ M(1 − pij ) + e, (i ∈ I(V 1 ), j ∈ pij vj2 − R vi1 − t ) ≤ M(1 − pij )+ I(V 2 )) and further rewrite it as ABS( j∈I(V 2 )

j∈I(V 2 )

e, (i ∈ I(V )). If several variables could be fixed in advance, the MILP problem could be solved more quickly. To fix pij to zero in advance, we use the similarity: pij = 0, (i ∈ I(V 1 ) j ∈ I(V 2 ), sij < s ), where s is a given parameter. If similarity sij is small enough, the pair vi1 and vj2 is removed from the putative corresponding point pairs. The objective of N -accuracy optimal registration is to minimize . Therefore, we can give an MILP formulation as follows. 1

(P1 ) min pij ≤ 1, (i ∈ I(V 1 )),

sub. to j∈I(V 2 )

pij ≤ 1, (j ∈ I(V 2 )), i∈I(V 1 )

pij ≥ N, i∈I(V 1 ) j∈I(V 2 )

pi,j vj2 − R vi1 − t ≥ −M(1 − j∈I(V 2 )

j∈I(V 2 )

pij vj2 j∈I(V

−R

vi1

− t ≤ M(1 −

2)

l∈I(V

pij ) + e, (i ∈ I(V 1 )),

j∈I(V 2 )

pij + r ij

pij ) −e, (i ∈ I(V 1 )),

pkl ≤ 1, (i, k ∈ (I(V 1 ), j ∈ I(V 2 )), 2 ),|d(v1 ,v1 )−d(v2 ,v2 )|> l i j k l

≤ rij ≤ r¯ij , (i = 1, 2, 3, j = 1, 2, 3), ti ≤ ti ≤ t¯i ,

(i = 1, 2, 3),

pij = 0 (i ∈ I(V ), j ∈ I(V ), sij < s ), pij ∈ {0, 1} (i ∈ I(V 1 ), j ∈ I(V 2 )). 1

2

Note that l is a given parameter and affects the optimal value of (P1 ). Here, we define Si and ˆi as the feasible region and the optimal value of the (P1 ) with a given parameter li . If l1 ≥ l2 , S1 ⊃ S2 , because the smaller l value forces the more elements of p to be zero, that is, it restricts the feasible region. Therefore, if l1 ≥ l2 , ˆ1 ≤ ˆ 2 . ˆN , max )| 3.3 ILP Formulation to Count |MAPS(TˆN , φ In this section, we describe ILP problem for evaluating (TˆN , φˆN ) by counting |MAPS(TˆN , φˆN , max )|. Its formulation uses almost the same components described in the previous section, except for the following ones. – The rigid transformation is given. In the formulation in this subsection, rotation matrix R and translation vector t with constant elements are used. They are calculated using a corresponding point pair vector introduced by R and t in our algorithm. – max is not a variable but a constant. – N is not given: the number of correspondences is maximized in this formulation.

Automatic Range Image Registration Using MILP

429

We can give an ILP formulation for counting |MAPS(TˆN , φˆN , max )| as follows using emax = (max , max , max ) . (P2 ) max

pij i∈I(V 1 ) j∈I(V 2 )

sub. to

pij ≤ 1, (i ∈ I(V 1 )), j∈I(V 2 )

pij ≤ 1, (j ∈ I(V 2 )), i∈I(V 1 )

pi,j vj2

−

pij vj2

−

Rvi1

− t ≥ −M(1 −

j∈I(V 2 )

j∈I(V

pij ) − emax (i ∈ I(V 1 )),

j∈I(V 2 )

Rvi1

− t ≤ M(1 −

2)

j∈I(V

pij ) + emax (i ∈ I(V 1 )), 2)

pij = 0 (i ∈ I(V 1 ), j ∈ I(V 2 ), sij < s ),

pij ∈ {0, 1} (i ∈ I(V 1 ), j ∈ I(V 2 )).

3.4 Algorithm for Best Balanced Optimal Registration Now, we describe our algorithm for obtaining the parameter values for the best balanced optimal registration. Equation (1) is solved in two phases. ˆ N . The p ˆ N that attains Phase 1. For N from 5 to κ, find ˆN and obtain (TˆN , φˆN ) as p ˆN ≤ ¯ is stored in list L. The minimum N value is set to five to avoid trivial solutions. ˆ N and count Phase 2. For each (ˆ pN , ˆN ) ∈ L, make (TˆN , φˆN ) from p ˆ N : (ˆ pN , ˆN ) ∈ L}. Then, select the N |MAPS(TˆN , φˆN , max )|, max = max{ˆ ˆ ˆ that attains argmax{|MAPS(TN , φN , max )| : N such that(ˆ pN , ˆN ) ∈ L}. N

In phase 1, To solve the problem P1 , an l needs to be given. However, the value of l affects the optimal value of the problem. The minimum ˆN value is found by solving P1 repeatedly by narrowing its range. ˆ by We find the corresponding point pair vector that attains the best balanced value N a two-phase algorithm as follows. ) [Algorithm] MILP-based Registration(¯ L = ∅; /* Phase 1 */ for(N = 5; N ≤ κ; N ++){ N = ¯; /* N : upper bound for ˆN */ N = 0.0; /* N : lower bound for ˆ N */ N = ¯; while(N > N ){ /* Narrowing process to find ˆ N */ √ l = 2 3N ; Solve problem (P1 ) with given parameters N and l ; ˆ N be the optimal solution and ˆ /* Let p N be the optimal value of problem (P1 ) */ if ( No feasible solution is found ) {break;} if ( ˆN < N ){ if ( ˆN > N ){ N = ˆ N ; } N = N ;

430

S. Sakakubara et al. }else{ if ( ˆN > ¯) {break; } if ( ˆN < N ){ N = ˆ N ; } N = N ; } N = (N − N )/2 + N ;

} L = L ∪ {(ˆ pN , ˆN )} ;

} if(L = ∅){return “No solution”; /* ¯ value is too small */ } /* Phase 2 */ ˆ = 0; max = max{ˆ N : (ˆ pN , ˆ N ) ∈ L}; N while(L = ∅){ Extract (ˆ pN , ˆN ) from L; /* L = L\{(ˆ pN , ˆ N )} */ ˆ ˆ ˆ N and count |MAPS(TˆN , φˆN , max )| by solving problem (P2 ) with Make (TN , φN ) from p a given parameter max ; /* Let Ntp = |MAPS(TˆN , φˆN , max )| */ ˆ = Ntp ; p ˆ < Ntp ){ N ˆ Nˆ = p ˆ N ;} if(N } ˆ, p ˆ Nˆ ); return (N

This algorithm has three particular advantages. – Although ¯ is initially given, l is adjusted to an appropriate value in accordance with the accuracy of the point sets automatically. – If the value of ¯ is too small to find a solution, it outputs “No solution.” If ¯ is too big, it outputs a feasible solution at the cost of time. – Because the optimal values of (P1 ) are used, the narrowing process is completed faster than that in a binary search. 3.5 Problem Size Issue and How to Overcome It Unfortunately, the MILP solver and computers still do not have enough power to solve the MILP problem (P1 ) for all points sets obtained from normal size range images. Therefore, we preselect the feature points from V 1 and V 2 . Since our algorithm evaluates only the distances between corresponding point pairs, it is robust against noise and the preselection method. When the algorithm is applied to the preselected point sets V˙ 1 ⊂ V 1 and V˙ 2 ⊂ V 2 , the robustness of the registration can be improved to solve the ILP problem (P2 ) with V˙ 1 and V 2 . The current MILP solver and computers have enough power to solve this size of problem (P2 ) within a reasonable time.

4 Experiments We tested the robustness of our method by applying it to three synthetic datasets and one real dataset. We used the ILOG CPLEX (ver.10.1) MILP solver installed on a PC

Automatic Range Image Registration Using MILP

431

with a Pentium D 950 CPU (3.4 GHz), 2 GB of RAM, and linux 2.6. We used GNU GLPK to generate files for our formalization. We use the algorithm of Umeyama [13] to estimate the rigid transformation from the point correspondences. We used 3D models of “Stanford Bunny”[14], “Horse” [16], and “Armadillo”[14] to generate synthetic range images. We generated 18 synthetic range images for each model with the size 200 × 200 pixels by rotating 20-degree rotation steps around the Y axis. Then, to generate five noisy datasets for each model, the Z coordinate of each point was perturbed by adding Gaussian noise with zero mean and a standard deviation of σ = 0.02, 0.04, 0.06, 0.08, or 1.00. We also applied our method to the real range images of “Pooh” [15]. The selected feature points had curvedness [9] values that were the maximal with a constraint on thedistance between feature points. The curvedness c of a point is calculated by c = κ21 + κ22 /2 where κ1 and κ2 are two main curvatures of the point. The number of feature points for the each range image was 50 for “Stanford Bunny” and “Horse”, and 70 for “Armadillo” and “Pooh”. The similarity sij between feature points vi1 and vj2 was defined using curvatures values of the local surfaces around the points. It is set to −100 if the shape of the surfaces, such as convex or concave, was not identical with each other, or else it is calculated based on the curvedness of points ci and cj as sij = 1/ ci − cj . We applied our method to all adjacent pairs in the range image sequences. For all = 1 (i, j = 1, 2, 3). For the synthetic datasets, κ = 10, s = 10, rij = −1, and r¯ij images, ¯ = 0.15, M = (100, 100, 100) , ti = −10, and t¯i = 10 (i = 1, 2, 3). We also used ¯ = 0.25 for the pairs whose solutions could not be obtained by ¯ = 0.15. For the real images, ¯ = 0.25, M = (1000, 1000, 1000), ti = −100, and t¯i = 100 (i = 1, 2, 3). We also applied ¯ = 0.50 for the pairs whose solutions could not be obtained by ¯ = 0.25. In order to improve accuracy of calculation, each points vi = (xi , yi , zi ) of “Pooh” dataset are translated near the origin of the coordinates ◦ ◦ by following: vir − ((min{xj : j ∈ I(V 0 )} + max{xj : j ∈ I(V 0 )})/2, (min{yj : ◦ ◦ ◦ j ∈ I(V 0 )} + max{yj : j ∈ I(V 0 )})/2, (min{zj : j ∈ I(V 0 )} + max{zj : j ∈ ◦ ◦ I(V 0 )})/2) , where V 0 are point set of 0◦ . The experimental results for the synthetic range image datasets are shown in Table 1( “b”, “h”, and “a” in the column “Dataset” indicate “Stanford Bunny”, “Horse” and “Armadillo”, respectively, followed by the numbers indicating the standard deviation of the added noise). The parameter ¯ = 0.25 was used to calculate the results for pairs 180◦ –200◦ of h00, 180◦ –200◦ and 280◦ –300◦ of h04, 180◦ –200◦ of h06, 280◦ –300◦ of h08, 20◦ –40◦ and 280◦ –300◦ of h10, and 60◦ –80◦ of all “Armadillo” datasets, and ¯ = 0.15 for the other pairs. Table 1 shows that the rigid transformation errors did not always increase with the standard deviation σ of the added Gaussian noise. There are two main reasons that our method is robust against the measurement noise: First, our method does not need accurate values for the invariant features. We use them to only select the feature points and to reduce the putative corresponding point pairs. Second, the rigid transformation is estimated using more than five corresponding pairs of feature

432

S. Sakakubara et al.

Table 1. Error evaluation of MILP-based registration. “Error of angle” is the absolute angle between the true rotation angle and the estimated one in degrees. “Error of axis” is the deviation in the estimated rotation axis in degrees. “Error of translation” is the norm of the error of the translation vector. Dataset σ

Time [sec.]

b00 b02 b04 b06 b08 b10 h00 h02 h04 h06 h08 h10

0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10

4578 4587 4906 5406 4700 4877 5334 5213 4273 4264 3202 3525

Error Error Error of Dataset σ of angle of axis translation [degree]

[degree]

0.343 0.321 0.466 0.247 0.366 0.421 0.281 0.272 0.354 0.364 0.417 0.344

1.89 1.74 2.00 1.30 1.67 1.47 2.14 1.94 1.97 2.19 2.55 3.04

Time [sec.]

0.328 0.311 0.390 0.298 0.364 0.291 0.587 0.552 0.507 0.485 0.625 0.592

a00 a02 a04 a06 a08 a10

0.00 119514 0.02 118699 0.04 124356 0.06 127393 0.08 118821 0.10 109337

Error Error Error of of angle of axis translation [degree]

[degree]

0.215 0.231 0.205 0.320 0.244 0.393

0.83 0.88 0.95 0.94 0.84 0.96

0.244 0.227 0.224 0.282 0.219 0.262

Table 2. Error evaluation of data set “Pooh”. “Error of angle” is the absolute angle between the true rotation angle and the estimated one in degrees. “Error of axis” is the angle in degrees between the rotation axis estimated from the given pair and the rotation axis estimated from all pairs. “Error of translation” is the norm of the error between the translation vector estimated from the given pair and the translation vector estimated from all pairs. Angle Time[sec.] Error of angle[degree] Error of axis[degree] Error of translation average 19496 1.950 0.072 0.719

points. If we can find only five pairs of points that are not so noisy by chance, we can accurately estimate rigid transformation. The computational time for “Armadillo” was higher than that of others because the numbers of MILP variables and constraints are proportional to the number of feature points. Table 2 shows the results for the real range image dataset of “Pooh”. The parameter ¯ = 0.50 was used to calculate the results for pairs 20◦ -40◦ , 60◦ –80◦ , 100◦ –120◦, 220◦ –240◦, and 320◦ –140◦, and ¯ = 0.25 was used for the other pairs. The registration results for “Pooh” are shown in Figure 1. While the errors for some pairs were relatively large, the results are good enough for coarse registration.

Fig. 1. Results for “Pooh”

Automatic Range Image Registration Using MILP

433

5 Concluding Remarks Our proposed coarse registration method using Mixed Integer Linear Programming (MILP) can find global optimal registration without using the values of the invariant features. In addition, it automatically adjusts the error tolerance depending on the accuracy of the given range image data. Our method finds the best consistent pairs from all possible point pairs using an MILP solver. While such solvers are powerful tools, all of the constraints should be written in linear form. This means that constraints on the rotation matrix cannot be applied directly. Therefore, we selected a relevant number of consistent pairs in the sense of the distances between point pairs, which gives the constraints on the rotation matrix indirectly. The number of corresponding point pairs and the distances between them are automatically balanced by our algorithm using two different MILP formulations. Future work will focus on reducing the computational time and improving the selection of the feature points.

References 1. Besl, P.J., McKay, N.D.: A Method for Registration of 3-D Shapes. IEEE Trans. on PAMI 14(2), 239–256 (1992) 2. Campbell, R.J., Flynn, P.J.: A Survey of Free-Form Object Representation and Recognition Techniques. CVIU 81, 166–210 (2001) 3. Chen, C.C., Stamos, I.: Range Image Registration Based on Circular Features. In: Proc. 3DPVT, pp. 447–454 (2006) 4. Chua, C.S., Jarvis, R.: 3D Free-Form Surface Registration and Object Recognition. IJCV 17(1), 77–99 (1996) 5. He, W., Ma, W., Zha, H.: Automatic Registration of Range Images Based on Correspondence of Complete Plane Patches. In: Proc. 3DIM, pp. 470–475 (2005) 6. Higuchi, K., Hebert, M., Ikeuchi, K.: Building 3-D Models from Unregistered Range Images. GMIP 57(4), 315–333 (1995) 7. Johnson, A.E., Hebert, M.: Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. IEEE Trans. on PAMI 21(5), 433–449 (1999) 8. Johnson, E.L., Nemhauser, G.L., Savelsbergh, M.W.P.: Progress in Linear ProgrammingBased Algorithms for Integer Programming: An Exposition. INFORMS Journal on Computing 12(1), 2–23 (2000) 9. Koenderink, J.J.: Solid Shape. MIT Press, Cambridge (1990) ˇ ara, R., Okatahi, I.S., Sugimoto, A.: Globally Convergent Range Image Registration by 10. S´ Graph Kernel Algorithm. In: Proc. 3DIM, pp. 377–384 (2005) 11. Rusinkiewicz, S., Levoy, M.: Efficient Variants of the ICP Algorithm. In: Proc. 3DIM, pp. 145–152 (2001) 12. Stein, F., Medioni, G.: Structural indexing: Efficient 3-D object recognition. IEEE Trans. on PAMI 14(2), 125–145 (1992) 13. Umeyama, S.: Least-Square Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. on PAMI 13(4), 376–380 (1991)

434

S. Sakakubara et al.

14. Stanford 3D Scanning Repository, http://www-graphics.stanford.edu/data/3Dscanrep/ 15. The Ohio State University Range Image Repository, http://sampl.ece.ohio-state.edu/data/3DDB/RID/minolta/ 16. Georgia Institute of Technology Large Geometric Models Archive, http://www-static.cc.gatech.edu/projects/large models/

Accelerating Pattern Matching or How Much Can You Slide? Oﬁr Pele and Michael Werman School of Computer Science and Engineering The Hebrew University of Jerusalem {ofirpele,werman}@cs.huji.ac.il

Abstract. This paper describes a method that accelerates pattern matching. The distance between a pattern and a window is usually close to the distance of the pattern to the adjacement windows due to image smoothness. We show how to exploit this fact to reduce the running time of pattern matching by adaptively sliding the window often by more than one pixel. The decision how much we can slide is based on a novel rank we deﬁne for each feature in the pattern. Implemented on a Pentium 4 3GHz processor, detection of a pattern with 7569 pixels in a 640 × 480 pixel image requires only 3.4ms.

1

Introduction

Many applications in image processing and computer vision require ﬁnding a particular pattern in an image, pattern matching. To be useful in practice, pattern matching methods must be automatic, generic, fast and robust.

(a)

(b)

(c)

(d)

Fig. 1. (a) A non-rectangular pattern of 7569 pixels (631 edge pixel pairs). Pixels not belonging to the mask are in black. (b) A 640 × 480 pixel image in which the pattern was sought. (c) The result image. All similar masked windows are marked in white. (d) The two found occurrences of the pattern in the image. Pixels not belonging to the mask are in black. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 21ms to only 3.4ms. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 435–446, 2007. c Springer-Verlag Berlin Heidelberg 2007

436

O. Pele and M. Werman

(a)

(b)

(c)

(d)

Fig. 2. (a) A non-rectangular pattern of 2197 pixels. Pixels not belonging to the mask are in black. (b) Three 640x480 pixel frames out of fourteen in which the pattern was sought. (c) The result. Most similar masked windows are marked in white. (d) Zoom in of the occurrences of the pattern in the frames. Pixels not belonging to the mask are in black. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 22ms to only 7.2ms. The average number of samples per window reduced from 19.7 to only 10.6.

Pattern matching is typically performed by scanning the entire image, and evaluating a distance measure between the pattern and a local rectangular window. The method proposed in this paper is applicable to any pattern shape, even a non-contiguous one. We use the notion of “window” to cover all possible shapes. There are two main approaches to reducing the computational complexity of pattern matching. The ﬁrst approach reduces the time spent on each window. The second approach reduces the number of windows visited. In this work we concentrate on the second approach. We suggest sliding more than one pixel at a time. The question that arises is: how much can you slide? The answer depends on the pattern and on the image. For example, if the pattern and the image are black and white checked boards of pixels, the distance of the pattern to the current window and to the next window will be totally diﬀerent. However, if the pattern is piecewise smooth, the

Accelerating Pattern Matching or How Much Can You Slide?

437

(a-zoom) (a)

(b)

(c)

(d)

Fig. 3. (a) A rectangular pattern of 1089 pixels. (b) A noisy version of the original 640x480 pixel image. The pattern that was taken from the original image was sought in this image. The noise is Gaussian with a mean of zero and a standard deviation of 25.5. (c) The result image. The single similar masked window is marked in white. (d) The occurrence of the pattern in the zoomed in image. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 19ms to only 6ms. The average number of samples per window reduced from 12.07 to only 2. The image is copyright by Ben Schumin and was downloaded from: http://en. wikipedia.org/wiki/Image:July 4 crowd at Vienna Metro station.jpg.

distances will be similar. We describe a method which examines the pattern and decides how much we can slide in each step. The decision is based on a novel rank we deﬁne for each feature in the pattern. We use a two stage method on each window. First, we test all the features with a high rank. Most of the windows will not pass and we will be able to slide more than one pixel. For the windows that passed the test we perform the simple test on all the features. A typical pattern matching task is shown in Fig. 1. A non-rectangular pattern of 7569 pixels (631 edge pixel pairs) was sought in a 640 × 480 pixel image. Using the Pele and Werman method[1] the running time was 21ms. Using our method the running time reduced to only 3.4ms. All runs were done on a Pentium 4 3GHz processor. Decreasing the number of visited windows is usually achieved using an image pyramid[3]. By matching a coarser pattern to a coarser level of the pyramid, fewer windows are visited. Once the strength of each coarser resolution match is calculated, only those that exceed some threshold need to be compared for the next ﬁner resolution. This process proceeds until the ﬁnest resolution is reached. There are several problems with the pyramid approach. First, important details of the objects can disappear. Thus, the pattern can be missed. For example, in Fig. 1 if we reduce the resolution to a factor of 0.8, the right occurrence of the pattern is found, but the left one is missed. Using the smaller images the running time decreases from 21ms to 18ms (without taking into account the time spent on decreasing the resolution). Using our approach, both occurrences of the patterns are found in only 3.4ms. Note that smoothness can change

438

O. Pele and M. Werman

(a)

(b)

(a-zoom)

(c)

(d)

Fig. 4. (a) A non-rectangular pattern of 3732 pixels (3303 edge pixel pairs). Pixels not belonging to the mask are in black. (b) A 2048 × 1536 pixel image in which the pattern was sought. The area where the pattern was found is marked in white. (c) The occurrence of the pattern in the image zoomed in. (d) The occurrence of the pattern in the image zoomed in, with the exact found outline of the pattern painted in white. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 437ms to only 51ms. Note the large size of the image. The average number of samples per window reduced from 27 to only 3.7.

between diﬀerent local parts of the pattern. The pyramid approach is global, while our approach is local and thus more distinctive. The second problem of pyramid approach is the memory overhead. This paper is organized as follows. Section 2 presents the LUp rank for pixels and for pairs of pixels. Section 3 describes a method that uses the LUp rank for accelerating pattern matching. Section 4 presents extensive experimental results. Finally, conclusions are drawn in Section 5.

2

The LUp Rank

In this section we deﬁne a novel smoothness rank for features, the LUp rank. The rank is later used as a measure that tells us how much we can slide for each

Accelerating Pattern Matching or How Much Can You Slide?

439

(a)

(b) Fig. 5. (a) A 115 × 160 pattern (2896 edge pixel pairs). (b) A 1000 × 700 pixel image in which the pattern was sought. The most similar window is marked in white. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 51ms to only 9.2ms. The average number of samples per window reduced from 23 to only 3. The images are from the Mikolajczyk and Schmid paper[2].

pattern. This is ﬁrst deﬁned for pixels and then for pairs of pixels. Finally, we suggest ways of calculating the LUp rank. 2.1

The LUp Rank for Pixels

In this sub-section we use the Thresholded Absolute Diﬀerence Hamming distance that was suggested by Pele and Werman[1]. This distance is the number of diﬀerent corresponding pixels between a window and a pattern, where the corresponding pixels are deﬁned as diﬀerent if and only if their absolute intensity diﬀerence is greater than a predeﬁned pixel similarity threshold, q; i.e. The distance between the set of pixels, A, applied to the pattern and the current window is deﬁned as (δ returns 1 for true and 0 for false): δ (|pattern(x, y) − window(x, y)| > q) (1) T ADA (pattern, window) = (x,y)∈A

We ﬁrst deﬁne the LU rank for a pattern pixel as: LU(pattern, (x, y)) = max s.t: R

∀ 0 ≤ rx , ry ≤ R pattern(x, y) = pattern(x − rx , y − ry )

(2)

Now, if we if we assess the similarity between a pixel in the pattern with an LU rank of R, to a pixel in the window, we get information about all the windows which are up to R pixels to the right and down to the current window. Using this information we can slide in steps of R + 1 pixels, without losing accuracy.

440

O. Pele and M. Werman

The requirement for equality in Eq. 2 is relaxed in the deﬁnition of the LUp rank. In this rank the only requirement is that the absolute diﬀerence is not too high: LU p (pattern, (x, y)) = max s.t: R

∀ 0 ≤ rx , ry ≤ R |pattern(x, y) − pattern(x − rx , y − ry )| ≤ p

(3)

Note that the LU and LU0 ranks for pixels are equivalent. 2.2

The LUp Rank for Pairs of Pixels

In this sub-section we use the Monotonic Relations Hamming distance that was suggested by Pele and Werman[1]. This distance is the number of pairs of pixels in the current window that does not have the same relationship as in the pattern; i.e. the basic features of this distance are pairs of pixels and not pixels. Pixel relations have been successfully applied in many ﬁelds such as pattern matching[1], visual correspondence[4] and keypoint recognition[5]. Each pattern is deﬁned by a set of pairs of pixels which are close, while the intensity diﬀerence is high. We assume without loss of generality that in the pattern the ﬁrst pixel in each pair has a higher intensity value than the second pixel. The distance between the set of pairs, A, applied to the pattern and the current window is deﬁned as (δ returns 1 for true and 0 for false): δ (window(x1 , y1 ) ≤ window(x2 , y2 )) (4) M RA (pattern, window) = [(x1 ,y1 ), (x2 ,y2 )]∈A

Given a pattern and pair of pixels, [(x1 , y1 ), (x2 , y2 )] such that the ﬁrst pixel has a higher intensity value than the second pixel, i.e. pattern(x1 , y1 ) > pattern(x2 , y2 ), we deﬁne the pair’s LUp rank as: LU p (pattern, [(x1 , y1 ), (x2 , y2 )]) = max s.t: ∀ 0 ≤ rx , ry ≤ R R

pattern(x1 − rx , y1 − ry ) > pattern(x2 − rx , y2 − ry ) + p

(5)

The requirement that the relation must be bigger in at least p is added for stability. Now, if we assess the similarity between a pair of pixels in the pattern with an LUp rank of R, to a pair of pixels in the window, we get information about all the windows which are up to R pixels to the right and down to the current window. Figure 6 illustrates this. Using this information we can slide in steps of R + 1 pixels, without losing accuracy. 2.3

How to Calculate the LUp Rank

We suggest two methods of calculating the LUp rank of all features (pixels or pairs of pixels) in the pattern. The ﬁrst is to calculate the rank for each feature. ¯ the average LUp rank and by |A| the feature set size, then If we denote by R

Accelerating Pattern Matching or How Much Can You Slide? Image 4

5

6

7

8

2

3

4

5

6

7

8

9

6

7

8

9

2 3 4 5

4

90

1

1

0 60

11

6

6

5

3

40

0

9 0

3

1

0

2

2

0

1

3

40

0

3

4

0

20

2

2

1

1

Pattern 0

441

(b)

(a) 2

3

4

5

6

7

8

0

9

1

2

3

4

5

1 2 3 4 5 6

6

5

4

3

2

1

0

1

0

0

(c)

(d)

Fig. 6. The pair of pixels in the pattern (marked with two circles): [(3, 4), (1, 1)], has LU10 rank of 1 (Pattern(3, 4) > Pattern(1, 1) + 10 and Pattern(3, 3) > Pattern(1, 0) + 10), etc). Thus, when we test whether Image(3, 4) > Image(1, 1), we get an answer to these 4 questions (all coordinates are relative to the window’s coordinates): 1. In the window of (a), is Window(3, 4) > Window(1, 1) as in the pattern? 2. In the window of (b), is Window(2, 4) > Window(0, 1) as in the pattern? 3. In the window of (c), is Window(3, 3) > Window(1, 0) as in the pattern? 4. In the window of (d), is Window(2, 2) > Window(0, 0) as in the pattern?

¯ 2 ). The second method is to test which the average time complexity is O(|A|R features have each LUp rank. This can be done quickly by ﬁnding the 2d min and max for each value of R. The Gil and Werman[6] method does this with a time complexity of O(1) per pixel. If we denote by Rmax the maximum R value, then the time complexity is O(|A|Rmax ). A combined approach can also be used. Note that the computation of the LUp is done oﬄine for each given pattern. Moreover, the size of the pattern is usually much smaller than the size of the image; thus the running time of this stage is negligible. In this paper we simply calculate the LUp rank for each feature.

3

The Pattern Matching Method

The problem of pattern matching can be formulated as follows: given a pattern and an image, ﬁnd all the occurrences of the pattern in the image. We deﬁne a window as a match, if the Hamming distance (i.e. Eq. 1 or Eq. 4) is smaller or equal to the image similarity threshold. In order to reduce the running time spent on each window we use the Pele and Werman[1] sequential sampling algorithm. The sequential algorithm random samples corresponding features sequentially and without replacement from the

442

O. Pele and M. Werman

window and pattern and tests them for similarity. After each sample, the algorithm tests whether the accumulated number of non-similar features is equal to a threshold, which increases with the number of samples. We call this vector of thresholds the rejection line. If the algorithm touches the rejection line, it stops and returns non-similar. If the algorithm ﬁnishes sampling all the features, it has computed the exact distance between the pattern and the window. Pele and Werman[1] presented a Bayesian framework for sequential hypothesis testing on ﬁnite populations. Given an allowable bound on the probability of a false negative the framework computes the optimal rejection line; i.e. a rejection line such that the sequential algorithm parameterized with it has the minimum expected running time. Pele and Werman[1] also presented a fast near-optimal framework for computing the rejection line. In this paper, we use the near-optimal framework. The full system we use for pattern matching is composed of an oﬄine and an online part. The oﬄine part gets a pattern and returns the characteristic LUp rank, two sets of features and the two corresponding rejection lines. One set contains all the pattern features. The second set contains all the pattern features from the ﬁrst set that have an LUp rank greater or equal to the characteristic LUp rank. The online part slides through the image in steps of the characteristic LUp rank plus one. On each window it uses the sequential algorithm to test for similarity on the second set of features. If the sequential algorithm returns nonsimilar, the algorithm slides the characteristic LUp rank plus one pixels right or the characteristic LUp rank plus one rows (at the end of each row). If the sequential algorithm returns similar (which we assume is a rare event), the window and all the windows that would otherwise be skipped are tested for similarity. The test is made again using the sequential algorithm, this time on the set that contains all the pattern features.

4

Results

The proposed method was tested on real images and patterns. The results show that the method accelerates pattern matching, with a very small decrease in robustness to rotations. For all other transformations tested - small scale change, image blur, JPEG compression and illumination - there was no decrease in robustness. First we describe results that were obtained using the Thresholded Absolute Diﬀerence Hamming distance (see Eq. 1). Second, we describe results that were obtained using the Monotonic Relations Hamming distance (see Eq. 4). 4.1

Results Using the Thresholded Absolute Diﬀerence Hamming Distance

We searched for windows with a Thresholded Absolute Diﬀerence Hamming distance lower than 0.4×|A|. The sequential algorithm was parameterized using the near-optimal method of Pele and Werman[1] with input of a uniform prior and

Accelerating Pattern Matching or How Much Can You Slide?

443

a false negative error bound of 0.1%. In all of the experiments, the p threshold for the LUp rank was set to 5. The characteristic LU5 rank for each pattern was set to the maximum LU5 rank found for at least 30 pattern pixels. Note that the online part ﬁrst tests similarity on the set of pixels with a LU5 rank greater or equal to the characteristic LU5 rank. The same relative similarity threshold is used; i.e. if the size of this small set is |As | we test whether the Thresholded Absolute Diﬀerence Hamming distance is lower than 0.4 × |As |. Results that show the substantial reduction in running time are shown in Figs. 2 and 3. 4.2

Results Using the Monotonic Relations Hamming Distance

The pairs that were used in the set of each pattern were pairs of pixels belonging to edges, i.e. pixels that had a neighbor pixel, where the absolute intensity value diﬀerence was greater than 80. Two pixels, (x2 , y2 ), (x1 , y1 ) are considered neighbors if their l∞ distance: max(|x1 − x2 |, |y1 − y2 |) is smaller or equal to 2. We searched for windows with a Monotonic Relations Hamming distance lower than 0.25 × |A|. The sequential algorithm was parameterized using the nearoptimal method of Pele and Werman[1] with input of a uniform prior and a false negative error bound of 0.1%. In all of the experiments, the p threshold for the LUp rank was set to 20. The characteristic LU20 rank for each pattern was set to the maximum LU20 rank found for at least 30 pairs of pixels from the set of all edge pixel pairs. Note that the online part ﬁrst tests similarity on the set of pairs of edge pixels with a LU20 rank greater or equal to the characteristic LU20 rank. The same relative similarity threshold is used; i.e. if the size of this small set is |As | we test whether the Monotonic Relations Hamming distance is lower than 0.25 × |As |. Results that show the substantial reduction in running time are shown in Figs. 1, 4 and 5. To illustrate the performance of our method, we ran the tests that were also conducted in the Pele and Werman paper[1]. All the data for the experiments were downloaded from http://www.cs.huji.ac.il/~ofirpele/hs/ all images.zip. Five image transformations were evaluated: small rotation; small scale change; image blur; JPEG compression; and illumination. The names of the datasets used are rotation; scale; blur ; jpeg; and light respectively. The blur, jpeg and light datasets were from the Mikolajczyk and Schmid paper[2]. scale dataset contains 22 images with an artiﬁcial scale change from 0.9 to 1.1 in jumps of 0.01; and rotation dataset contains 22 images with an artiﬁcial inplane rotation from -10◦ to 10◦ in jumps of 1◦ . For each collection, there were ten rectangular patterns that were chosen from the image with no transformation. In each image we considered only the window with the minimum distance as similar, because we knew that the pattern occurred only once in the image. We repeated each search of a pattern in an image 1000 times. There are two notions of error: miss detection error rate and false detection error rate. As we know the true homographies between the images, we know where the pattern pixels are in the transformed image. We denote a correct match as one that covers at least 80% of the transformed pattern pixels. A false match is one that covers less than 80% of the transformed pattern pixels. Note

444

O. Pele and M. Werman <11%

100%

Miss Detection Rate

<1%

Our Method 10.9

Original Method 1 2 3 4 5 6 7 8 9 10

pattern number

pattern number

1 2 3 4 5 6 7 8 9 10

0.7

0.1

-10◦ -8◦ -6◦ -4◦ -2◦

2◦

0◦

rotation

4◦

6◦

0.3 0.5

-10◦ -8◦ -6◦ -4◦ -2◦

0◦

1 0.4 2 3 4 5 6 7 8 9 0.2 10 0.9 0.92 0.94 0.96 0.98

1

8◦ 10◦

2◦

rotation

4◦

6◦

8◦ 10◦

pattern number

pattern number

1 4.2 2 3 4 5 6 7 8 9 10 0.9 0.92 0.94 0.96 0.98

0%

1

1.02 1.04 1.06 1.08 1.1

scale

1.02 1.04 1.06 1.08 1.1

scale

Fig. 7. The miss detection error rates of our method on the rotation and scale tests. There is a slight decrease in performance with our method on the rotation test. On the scale test the performance is the same for both methods. Note that our approach runs much faster and that on all other tests (light, jpeg and blur ) the miss detections error rates were exactly the same.

that there is also an event of no detection at all if our method does not ﬁnd any window with a Monotonic Relations Hamming distance lower than 0.25 × |A|. The miss detection error rate is the percentage of searches of a pattern in an image that does not yield a correct match. The false detection error rate is the percentage of searches of a pattern in an image that yields a false match. The detection and miss detection error rates were the same as in the Pele and Werman method[1], except in the rotation test where there was a slight decrease in performance (see Fig. 7). In the light and jpeg tests, the performance was perfect; i.e. 0% miss detection rate and 0% false detection rate. In the blur test, only one pattern was not found correctly in the most blurred image. The miss detection rate and false detection rate for this speciﬁc case was 99.6%. In all other patterns and images in the blur test, the miss detection rate and false detection rate was 0%. In the scale test, there was only one pattern with false detection in two images with scale 0.9 and 0.91. In the rotation test, there was only one pattern with false detection in images with rotation smaller than -2◦ or larger than +2◦ . Miss detection rates in the scale and rotation tests (see Fig. 7) were dependent on the pattern. If the scale change or rotation was not too big, the pattern was found correctly.

Accelerating Pattern Matching or How Much Can You Slide?

1 2 3 4 5 6 7 8 9 10

#avg samples

40

30

20

35

1 2 3 4 5 6 7 8 9 10

30

#avg samples

50

445

25 20 15 10

10 5 0 0

1

2

3

4

0 0

5

(a) darkness level

3

4

5

25 20 15 10 5

35

1 2 3 4 5 6 7 8 9 10

30

#avg samples

1 2 3 4 5 6 7 8 9 10

30

#avg samples

2

(b) blur level

35

25 20 15 10 5

0 0

1

2

3

4

0 -10◦-8◦ -6◦ -4◦ -2◦ 0◦ 2◦ 4◦ 6◦ 8◦ 10◦

5

(d) rotation

(c) JPEG compression level 35

1 2 3 4 5 6 7 8 9 10

30

#avg samples

1

25 20 15

Original method Our method

10 5 0 0.9

0.94

1

1.06

1.1

(e) scale

Fig. 8. Average number of samples per window for each of the ten patterns with our method (dotted lines) and with the Pele and Werman method[1] (solid lines). In all of the tests our approach ran much faster. Note that as our approach skips windows, the average of samples per window can be even smaller than one. It is also noteworthy that the running time in our approach depends on the characteristic LU20 rank. For example, in the light test - (a), ﬁnding pattern number 6 (marked as a pink star) using the original method took the most time, while using our method it is one of the patterns that was found the fastest. This can be explained by the fact that this pattern has a characteristic LU20 rank of two, which is high compared to the other patterns.

446

O. Pele and M. Werman

We also measured the average samples taken for each window using our method and the Pele and Werman method (see Fig. 8). In all of the tests our approach ran much faster. Note that as our approach skips windows, the average of samples per window can be even smaller than one. Further, the running time in our approach depends on the characteristic LU20 rank.

5

Conclusions

This paper described a method to accelerate pattern matching by adaptively sliding the window often by more than one pixel. We assigned a novel rank to pixels and edge pixel pairs that tells us how many pixels we can slide through the image. We suggested a pattern matching method that uses this rank. Extensive testing showed that the pattern matching was accelerated without losing almost any accuracy. Faster than real time results were presented, where patterns under large illumination changes, blur, occlusion, gaussian noise, etc, were detected in several milliseconds. To the best of our knowledge, the running time results presented in this paper are the fastest published. An interesting extension of this work would be to use this novel rank to accelerate additional methods of pattern matching.

References 1. Pele, O., Werman, M.: Robust real time pattern matching using bayesian sequential hypothesis testing. Technical Report 973, Department of Computer Science, The Hebrew University of Jerusalem (2007), http://www.cs.huji.ac.il/∼ ofirpele/hs/TR.html 2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Analysis and Machine Intelligence 27(10), 1615–1630 (2005) 3. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2001) 4. Zabih, R., Woodﬁll, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 800, pp. 151–158. Springer, Heidelberg (1994) 5. Lepetit, V., Fua, P.: Keypoint recognition using randomized trees. IEEE Trans. Pattern Analysis and Machine Intelligence 28(9), 1465–1479 (2006) 6. Gil, J., Werman, M.: Computing 2-d min, median, and max ﬁlters. IEEE Trans. Pattern Analysis and Machine Intelligence 15(5), 504–507 (1993)

Detecting, Tracking and Recognizing License Plates Michael Donoser, Clemens Arth, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology {donoser,arth,bischof}@icg.tu-graz.ac.at Abstract. This paper introduces a novel real-time framework which enables detection, tracking and recognition of license plates from video sequences. An eﬃcient algorithm based on analysis of Maximally Stable Extremal Region (MSER) detection results allows localization of international license plates in single images without the need of any learning scheme. After a one-time detection of a plate it is robustly tracked through the sequence by applying a modiﬁed version of the MSER tracking framework which provides accurate localization results and additionally segmentations of the individual characters. Therefore, tracking and character segmentation is handled simultaneously. Finally, support vector machines are used to recognize the characters on the plate. An experimental evaluation shows the high accuracy and eﬃciency of the detection and tracking algorithm. Furthermore, promising results on a challenging data set are presented and the signiﬁcant improvement of the recognition rate due to the robust tracking scheme is proved.

1

Introduction

There is a need for intelligent traﬃc management systems in order to cope with the constantly increasing traﬃc on todays roads. Video based traﬃc surveillance is one of the key parts of such installations. Beside detection and tracking of vehicles, identiﬁcation by license plate recognition is important for a variety of applications like access-control, security or traﬃc monitoring. In general, license plate recognition systems consist of two separate parts. First, plates are detected within a single frame of a traﬃc video sequence and second, character recognition is applied to identify the characters on the plate. Diﬀerent methods have been proposed for the detection of license plates. Shapiro et al. [1] use a mixture of edge detection and vertical pixel projection for their detection module. In the work of Jia et al. [2] color images were segmented by the Mean Shift algorithm into candidate regions and subsequently classiﬁed as plate or not. The AdaBoost algorithm was used by Dlagnekov and Belongie [3] for license plate detection on rather low resolution video frames. Matas and Zimmermann [4] proposed a diﬀerent approach for the localization of license plates. Instead of using properties of the plate directly, the algorithm tries to ﬁnd all character-like regions in the image by analyzing an interest region detection result. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 447–456, 2007. c Springer-Verlag Berlin Heidelberg 2007

448

M. Donoser, C. Arth, and H. Bischof

Also several approaches for recognizing the characters on the plate after successful detection were proposed. Shapiro et al. [1] use adaptive iterative thresholding and analysis of connected components for segmentation. The classiﬁcation task is then performed with two sets of templates. Rahman et al. [5] used horizontal and vertical intensity projection for segmentation and template matching for classiﬁcation. Dlagnekov and Belongie [3] use the normalized cross correlation for classiﬁcation by analyzing the whole plate, hence skipping segmentation. Although much scientiﬁc work has focused on recognizing license plates from traﬃc video sequences, surprisingly little work has been done on integrating a tracking scheme to gather additional representations of the plate for improving the recognition rate. Furthermore, in the few systems that apply a tracking scheme only simple and unstable approaches are used, as e. g. by Dlagnekov and Belongie [3] who perform tracking by simply repeating detection for building correspondences. The main contribution of this paper is a novel framework which uniﬁes detection, tracking and recognition of license plates in a robust and eﬃcient way. The underlying idea is to base detection, tracking and character segmentation on the same principles which allows to provide segmentations of the individual characters for recognition in addition to accurate and robust license plate localization results in subsequent frames. The framework is presented in detail in Section 2 and an experimental evaluation is shown in Section 3.

2

License Plate Recognition Framework

This section describes the entire framework for detection, tracking and recognition of license plates from traﬃc video sequences. The introduced system detects newly appearing license plates from the sequence by a novel algorithm which is based on the analysis of Maximally Stable Extremal Region (MSER) [6] detection results. The concept, introduced in Section 2.1, does not require any learning scheme and is capable of detecting diﬀerent types of international plates. After detection of a plate it is robustly tracked through the sequence by a modiﬁed version of the MSER tracking framework [7] as shown in Section 2.2. Therefore, for each appearing car in the video sequence a set of license plate representations is collected which is used to improve the subsequent character recognition. Finally, Section 2.3 describes how support vector machines are used to recognize the characters on the diﬀerent representations of each plate and how the results are merged by a voting scheme to achieve the ﬁnal recognition result. 2.1

License Plate Detection

The ﬁrst step of every license plate recognition system is the detection of the license plates within a single frame of the traﬃc video sequence. We propose a novel detection scheme which is motivated by the work of Matas et al. [6] who presented a method to learn category-speciﬁc extremal regions, i. e. the characters on a license plate, to perform detection. In contrast to their approach

Detecting, Tracking and Recognizing License Plates

449

our scheme does not require any learning scheme and is able to detect diﬀerent types of international license plates without adaption. Our detection algorithm is based on analyzing the results of a Maximally Stable Extremal Region (MSER) detection [8]. MSERs denote a set of distinguished regions and have proven to be one of the best interest point detectors in computer vision [9]. All of these regions are deﬁned by an extremal property of the intensity function in the region and on its outer boundary. Special properties form their superior performance as stable local detector. The set of MSERs is closed under continuous geometric transformations and is invariant to aﬃne intensity changes. Furthermore, MSERs are detected in every scale. We predominantly exploit these properties for segmentation purposes. In general, two variants of MSER detection can be distinguished denoted as MSER+ and MSER-. While MSER+ detects bright regions with darker boundary, MSER- ﬁnds dark regions with brighter boundary. Figure 1(a) shows an image from a traﬃc video sequence and Figure 1(b) and Figure 1(c) illustrate the corresponding MSER detection results as binary images. As can be seen, the license plate itself is identiﬁed as MSER+, whereas the characters on the plate are detected as MSER-.

(a) Input frame

(b) MSER+

(c) MSER-

Fig. 1. MSER detection results can be used for detecting license plates in video sequences. MSER+ ﬁnds the license plate, whereas MSER- identiﬁes the individual characters on it.

The underlying idea of our novel license plate detection scheme is to analyze the MSER+ and MSER- detection results. We are looking for a larger MSER+ region (license plate) that contains a set of smaller MSER- regions (characters). Such a combination is considered as license plate detection result. Furthermore, we verify the detection by checking if the MSER- regions are approximately equal sized, if their center points approximately lie on a line and if the height of the MSER+ is in the range of the average MSER- height. After veriﬁcation, the MSER+ is returned as license plate localization result and additionally segmentations of the characters are provided by the corresponding MSER- detections. Although the detection process is simple, it is eﬀective and allows stable and accurate detection of license plate in complex scenes. An exemplary result is shown in Figure 2(a). In this example 226 MSER- and 688 MSER+ regions are found, but only one set fulﬁlls the proposed criterions.

450

M. Donoser, C. Arth, and H. Bischof

(a) Proposed algorithm

(b) AdaBoost result

Fig. 2. License plate detection results. Figure (a) shows the single detection of the proposed algorithm indicated by the white border. 914 MSERs are detected, but only one set of regions fulﬁlls the described criterion. Figure (b) shows the wrong and multiple detections of an AdaBoost algorithm.

For comparison, Figure 2(b) shows the result of an AdaBoost detector based on Haar-like features [10]. As can be seen the boosting framework returns wrong and multiple detections which need to be signiﬁcantly post-processed, as e. g. by using non maximum suppression to remove multiple detections. Our result is also more accurate as the bounding box provided by the boosting variant. Furthermore, training of an AdaBoost based classiﬁer is a rather complex procedure, and in this case was especially trained on Austrian license plates. Our approach does not require any learning scheme and is able to identify diﬀerent types of international plates, because the simple criterion is fulﬁlled for almost all of them. Figure 3 shows detection results for diﬀerent international types.

(a) Input image

(b) Characters (MSER-)

(c) Plates (MSER+)

Fig. 3. Detection results on international license plates where (b) shows the segmented signs detected as MSER- and (c) the plate detected as MSER+. As can be seen the detection algorithm can be applied to any international type, because the criterion used in the detection algorithm is valid for all cases.

Detecting, Tracking and Recognizing License Plates

2.2

451

Tracking of License Plates

After detection of a newly appearing car in the traﬃc video sequence by its license plate, a robust tracker is applied to increase the number of character representations for the subsequent recognition step. In general, any tracker can be used but we propose to apply a modiﬁed version of the MSER tracking framework introduced by Donoser and Bischof [7] which has some advantages in contrast to other tracking schemes. First, it is eﬃcient and stable and can be adapted to our speciﬁc requirements. Second, it provides an accurate segmentation of the license plate and third, in addition to the tracked plate it also returns segmentations of the individual characters on the plate (MSERs), thus tracking and segmentation are handled simultaneously. The MSER tracking framework was designed to improve the stability and speed of MSER detection results in video sequences. The tracker has to be initialized by passing the region to be tracked Rt detected in image It of the sequence to the framework. The ﬁrst step of the algorithm is to propagate the center point of the region Rt to the next image It+1 and to crop a region-of-interest (ROI) around it from the image. Then a data-structure named component tree [11] is built for this ROI. Every node of the component tree contains one candidate i for the tracking and the algorithms looks for the node which is most region Ct+1 similar to the region Rt . The best ﬁt is identiﬁed by comparing feature vectors i that are built for each of the nodes of the component tree Ct+1 and the input i region Rt . The candidate Ct+1 with the lowest Euclidean distance between its feature vector and the one of the region Rt is taken as tracked representation. The features calculated are mean gray value, region size, center of mass, width and height of the bounding box and stability. All of these features are computed incrementally [12] during creation of the component tree. Thus, no additional computation time is required. After detection of the new representation the described steps can be repeated for tracking the region through the entire sequence. The original MSER tracking framework was designed for tracking arbitrarily shaped region, but we adapt the method to our speciﬁc requirements of tracking license plates. In our framework, the MSER tracking algorithm is initialized by the result of the license plate detection algorithm presented in Section 2.1. Because we focus on license plates we reformulate the feature comparison approach by replacing the Euclidean distance computation by a simpler, but more eﬀective equation based on comparison of two distinct features, the size and the rectangularity of the region. Thus, in our framework the tracked representation i ) for every candidate region is found by calculating a distance value Δ(Rt , Ct+1 i Ct+1 by i i abs |Rt | − Ct+1 i + 1 − ϑ Ct+1 , (1) Δ(Rt , Ct+1 ) = |Rt | where |. . .| denotes the size of the region and ϑ(. . .) is the rectangularity. Then, i i with the lowest Δ(Rt , Ct+1 ) value is taken as ﬁnal reprethe candidate Ct+1 sentation. Tracking is considered as valid as long the minimum distance value Δ is below a ﬁxed threshold. By repeating the presented steps the license plate

452

M. Donoser, C. Arth, and H. Bischof

(a) Frame 1

(b) Frame 15

Fig. 4. Illustration of license plate tracking. The images show a traﬃc scene and the segmented characters on the license plate are highlighted in white.

is tracked through the sequence and the detected MSER- regions within the tracked plate are provided as segmentations of the individual characters. Furthermore, the tracking scheme is also used for discarding false positives of the detection step by removing non-moving or unstable plate tracks. Figure 4 shows two frames of a traﬃc video sequence, where accurate license plate detections are provided in addition to the segmentation of the eight characters on the plate. 2.3

Character Recognition

The ﬁnal step of our framework is to recognize the individual characters on the detected plates based on support vector machines (SVMs). SVMs were ﬁrst described by Vapnik [13] and have proven to be an eﬃcient tool for classiﬁcation tasks as in optical character recognition (OCR) of handwritten digits or license plate contents [14]. For an introduction to SVMs and other kernel methods see for example [15]. In general, SVMs are designed for binary classiﬁcation problems. Because character recognition is a multi-class problem we apply a method based on a combination of several binary SVMs. The strategy is called one-vs-one where for each pair of output classes an individual SVM is trained, resulting in a total number of n · (n − 1)/2 classiﬁers. Then all classiﬁers are evaluated, the votes are summed up and the class with the maximum number is chosen. First the provided character segmentations are aligned, and then the one-vsone approach is used to classify each character independently of all the others. The presented tracking approach provides several license plate representations for every car and therefore, we also have several classiﬁcation results for every character on the plate. A majority voting scheme is then used to determine the ﬁnal character recognition result for every car. Figure 5 shows a sequence of tracked license plates, the segmented characters and the corresponding classiﬁcation results. As can be seen, the single image based recognition provides wrong assignments, but the subsequent majority voting scheme ensures that the ﬁnal character classiﬁcation is correct. Therefore, the power of our framework clearly comes from the stable tracking approach

Detecting, Tracking and Recognizing License Plates

(a) Tracking Sequence

453

(b) Classiﬁcation Results

Fig. 5. Recognition is improved by using several license plate representations provided by the tracking scheme. The ﬁnal result based on a majority voting for this plate sequence is G~19VAB which matches with the real plate number.

which allows to combine a sequence of single image based recognition results into the ﬁnal classiﬁcation. 2.4

Framework

Our framework is able to analyze traﬃc video sequences in real-time. It detects newly appearing cars by localization of their license plate. After a one-time detection, the plate is robustly tracked and several representations of the license plate are collected. Until tracking fails the available repeated segmentations of each character on the plate are used to improve the recognition rate along the complete tracking sequence and to determine a ﬁnal result by the majority voting scheme. The running times of the individual steps of the concept for analysis of a video sequence of size 352 × 288 are shown in Table 1. Table 1. Running time per image of the individual steps of the framework for analyzing a video stream of size 352 × 288

Running time

3

Detection

Tracking

Recognition

70ms

5ms

6ms

Experimental Evaluation

We evaluated our framework on a challenging traﬃc video sequence in the type of Figure 5(a) which was ﬁlmed from a footbridge. The provided resolution was 352 × 288 and therefore the characters on the plate only have an average size of 9 × 6 pixels in the sequence. Section 3.1 describes how the necessary training data for character recognition was acquired. Finally, Section 3.2 demonstrates that promising results are

454

M. Donoser, C. Arth, and H. Bischof

achieved on the challenging data set and shows how the recognition rate is significantly improved by using several plate representations provided by the tracking scheme. 3.1

Character Image Database

The support vector machines are trained on approximately 2700 manually labeled images of characters which were automatically extracted from high resolution images of parked cars using the license plate detection algorithm. According to the resolution of the test video sequence we resized all character images to 9 × 6 pixels. 3.2

Recognition Results

To evaluate the quality of the proposed framework we used it for recognizing the license plates of 109 cars passing the video sequence area. Due to the low resolution and changing lighting conditions the average recognition rate for independent classiﬁcation of every character in every detected license plate is only 80.74%. As a consequence, in a single image based license plate recognition approach on average only 23.23% of the cars are totally correct classiﬁed. But this rate is signiﬁcantly improved by analyzing additional plate representations provided by the tracking scheme and combining the corresponding recognition results by the presented majority voting scheme. Figure 6 analyzes the increase of the recognition rate by using more representations. As the number of tracked plate representations gets close to 70, the percentage of totally correct classiﬁcations reaches more than 90%. By using all available representations for recognizing the plate of every car (4880 plates) our framework classiﬁes 94.65% of all cars totally correct which is a

Fig. 6. Illustration of improvement in recognition rate achieved by analyzing several license plate representations provided by the tracking scheme

Detecting, Tracking and Recognizing License Plates

455

Table 2. Recognition rates of totally correct classiﬁed plates (plate level) and correctly classiﬁed characters (character level) for a single image based approach in comparison to the tracking based classiﬁcation

Character level Plate level

Single image approach

Tracking approach

80.74% 23.23%

97.16% 94.65%

promising result for such a challenging data set. Please note, that no postprocessing, like checking the validity of license plates according to syntax restrictions, is used. Table 2 analyzes the recognition rate on plate level, i. e. the number of totally correct classiﬁed plates, and character level, i. e. the number of correctly classiﬁed characters, for the single image based approach in comparison to our tracking variant. As can be seen analyzing several representations instead of a single one signiﬁcantly improves the recognition rate.

4

Conclusion

This paper introduced a novel framework which allows detection, tracking and recognition of license plates. Detection is handled by analysis of Maximally Stable Extremal Region (MSER) detection results and does not require any learning scheme. We introduced a robust tracking scheme, which provides accurate license plate localizations and segmented characters simultaneously. The experimental evaluation showed that promising results are achieved on a challenging data set and that the robust tracking approach signiﬁcantly improves the recognition rate. Furthermore, due to the high eﬃciency of the individual components, the framework can be used for real-time traﬃc video sequence analysis. To make the proposed framework applicable in industrial scenarios, we also ported it to an DSP-based embedded platform. Although the experiments were all performed on a desktop computer, the results also hold for a fully embedded implementation.

References 1. Shapiro, V., Gluhchev, G., Dimov, D.: Towards a multinational car license plate recognition system 17(3), 173–183 (2006) 2. Jia, W., Zhang, H., He, X., Piccardi, M.: Mean shift for accurate license plate localization. In: ITSC. Proceedings of the IEEE Conference on Intelligent Transportation Systems, pp. 566–571. IEEE Computer Society Press, Los Alamitos (2005) 3. Dlagnekov, L., Belongie, S.: Recognizing cars. Technical Report CS2005-0833, UCSD University of California, San Diego (2005) 4. Matas, J., Zimmermann, K.: Unconstrained licence plate and text localization and recognition. In: ITSC. Proceedings of the IEEE Conference on Intelligent Transportation Systems, Vienna, Austria, pp. 572–577. IEEE Computer Society Press, Los Alamitos (2005)

456

M. Donoser, C. Arth, and H. Bischof

5. Rahman, C., Badawy, W., Radmanesh, A.: A real time vehicle’s license plate recognition system. In: AVSS. Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 163–166. IEEE Computer Society Press, Los Alamitos (2003) 6. Matas, J., Zimmermann, K.: Unconstrained licence plate detection. In: ITSC. Proceedings of International Conference on Intelligent Transportation Systems, pp. 572–577 (2005) 7. Donoser, M., Bischof, H.: Eﬃcient maximally stable extremal region (MSER) tracking. In: CVPR. Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 553–560 (2006) 8. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: BMVC. Proceedings of British Machine Vision Conference, pp. 384–393 (2002) 9. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27(10), 1615– 1630 (2005) 10. Zhang, H., Jia, W., He, X., Wu, Q.: Learning-based license plate detection using global and local features. In: ICPR. Proceedings of International Conference on Pattern Recognition, pp. 1102–1105 (2006) 11. Najman, L., Couprie, M.: Quasi-linear algorithm for the component tree. In: SPIE Vision Geometry XII, vol. 5300, pp. 98–107 (2004) 12. Matas, J., Zimmermann, K.: A new class of learnable detectors for categorisation. In: SCIA. Proceedings of Scandinavian Conference of Image Analysis, pp. 541–550 (2005) 13. Vapnik, V.N.: The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA (1995) 14. Zheng, L., He, X.: Number plate recognition based on support vector machines. In: AVSS 2006. Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, Washington, DC, USA, p. 13. IEEE Computer Society Press, Los Alamitos (2006) 15. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA (2001)

Action Recognition for Surveillance Applications Using Optic Flow and SVM Somayeh Danafar* and Niloofar Gheissari Institute for studies in Thoretical Physics and Mathematics(IPM), Mathematics school, Tehran, Iran [email protected], [email protected]

Abstract. Low quality images taken by surveillance cameras pose a great challenge to human action recognition algorithms. This is because they are usually noisy, of low resolution and of low frame rate. In this paper we propose an action recognition algorithm to overcome the above challenges. We use optic flow to construct motion descriptors and apply a SVM to classify them. Having powerful discriminative features, we significantly reduce the size of the feature set required. This algorithm can be applied to videos with low frame rate without scarifying efficiency or accuracy, and is robust to scale and view point changes. To evaluate our method, we used a database consisting of walking, running, jogging, hand clapping, hand waving and boxing actions. This grayscale database has images of low resolution and poor quality. This image database resembles images taken by surveillance cameras. The proposed method outperforms competing algorithms evaluated on the same database. Keywords: Action recognition, motion recognition, human motion, motion analysis, surveillance.

1 Introduction In this paper, we address the problem of human action recognition based on optic flow. Human action recognition is a challenging and essential subject in computer vision. In fact, many computer vision applications such as surveillance, humancomputer interaction, video retrieval and scene understanding involve recognizing human action. Some major challenges in action recognition for surveillance applications [5] include noisy [6,11] [12]images of low resolution and low frame rate. In addition, changes in illumination and view point pose a great challenge to most human action recognition algorithms. In addition, they are problematic for most of segmentation algorithms. Hence, those action recognition algorithms that rely on segmenting body parts fail to recognize the correct actions. There are two general approaches to human action recognition. The first approach is based upon the shape and the movement of different body parts. Wang et al. [16] proposed computing a mean contour to represent static contour information. They *

Corresponding author.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 457–466, 2007. © Springer-Verlag Berlin Heidelberg 2007

458

S. Danafar and N. Gheissari

also build a model composed of 14 rigid body parts each of which presented by a truncated cone. An advantage of using body parts is that we can detect local variations in shape and movement which might in turn lead to different actions. However, an accurate segmentation and labeling of human body parts is necessary which is not a trivial task especially for surveillance cameras. In contrast to the above methods, the second approach is holistic and one look at the human body as a whole and tries to recognize actions based on its dominant movement. A pioneering algorithm in this category was proposed by Efros [3]. They build their motion descriptors using blurred optic flow and rely on spatio-temporal correlations to match the descriptors to the database. They reported their results on a database of images taken at distance. A recently introduced algorithm by Yilmaz and Shah [17] tracks the 2D contour of a human over time to construct spatio-temporal volumes. They analyze different geometrical properties of volumes to describe motion. This algorithm is invariant to view point and appears to perform well. However, they applied their algorithm to close up images and therefore we cannot objectively compare their results to ours. An advantage of the above holistic approaches is that they do not need labeling of different body parts. However, the local variations in shape or movement are overlooked. Since local variations might be due to different actions, they are important to be taken into account. To overcome the limitations of both approaches and to take advantage of their strength, we propose a hybrid method in this paper. We first find a bounding box around a human subject as a coarse segmentation and then refine the boundaries using k-means clustering. We spilt up the human body to head, torso and legs (based on some heuristic ratios). Then we use histograms of horizontal and vertical components of the optic flow field as our action descriptors. Having powerful discriminative features, we significantly reduce the size of the feature set require. We observed that by customizing the number of bins we can capture enough spatial patterns of optical flows to recognize actions. In addition, since we use a coarse to fine robust optic flow computation method (Black and Anandan [2]), we can reliably estimate optic flow for large displacements. Histograms of optical flows are robust relative to noise and changes in illumination. We evaluated our algorithm on a grayscale database of low resolution images and achieve above 85 percent accuracy. This image database resembles low quality images taken by surveillance cameras. This paper is organized as follows. In section 2 we present our human action recognition algorithm. This method has several ingredients including human body segmentation, optic flow computation, local motion representation, and action classification using SVM. Each ingredient is described in the following sub-sections. We used the database introduced by [13] to evaluate our algorithm. The proposed algorithm outperforms the competing methods on the same database (section 3).

2 The Proposed Action Recognition Algorithm Here we first discuss our database and then in the following subsections, we describe the details of the proposed action recognition algorithm.

Action Recognition for Surveillance Applications Using Optic Flow and SVM

459

We used the database used by [13] as it is the largest database available for this application. This database contains 2391 sequences of six types of human actions. That is 1. 2. 3. 4. 5. 6.

Walking Running Jogging. Hand clapping. Hand waving. Boxing

The above actions were performed by 25 people in four different scenarios: 1. 2. 3. 4.

Outdoor (s1), Outdoors with scale variations (s2), Outdoors with different clothes (s3), Indoors (s4).

Some samples of this database are shown in Figure 1. Walking

Jogging

Running

Boxing

Hand waving

Hand clapping

S1

S2

S3

S4

Fig. 1. Action database: different actions and scenarios

In this paper only 150 frames of each sequence are used. This is because the actions are periodic and consequently we can extract sufficient information using only a portion of each sequence. In addition, our experiments show that the algorithm needs only limited training samples. 2.1 Human Body Segmentation In some frames we only have background and there is no human in the scene. We expected that such frames have nominal optic flow field. However, we observed that due to noise and significant illumination change, the optic flow of background was significant compared to the human motion.

460

S. Danafar and N. Gheissari

To detect these frames we first employed a heuristic method to find a bounding box around the human body. We applied Harris corner detector [6] to extract interest points in each image. Background images (grasses and empty rooms) had almost no interest point compared to the large number of interest points in images containing humans. Using coordinates of the extracted interest points, we created a bounding box around human subjects (Figure 2). This is only a coarse segmentation and will be refined later. Alternatively, a robust and fast foreground from background separation algorithm can be used. Then in order to accurately subtract human body from the background in the mentioned bounding boxes, we use a k-means clustering algorithm to cluster pixels inside the bonding box into two classes. This results in more accurate boundaries of the human body. Applying k-means only to the pixels inside the bounding box, also improves the speed of clustering task significantly.

Fig. 2. (left) Detected points by Harris corner detector (right) Constructed bounded area from detected points. Database is borrowed from [13].

2.2 Optic Flow Computation The optic flow field is defined as the set of apparent velocities of the brightness patterns in the image plane [8]. To compute optic flow usually two assumptions are made, namely the brightness is assumed to be constant in a small neighborhood and the motion is considered to be small. Other assumptions also could be that the motion is locally smooth or image gradients are constant. The first two assumptions lead us to the first order optic flow equation

u . I x + v. I y = I t

(1)

where u and v are optic flow components in horizontal and vertical directions. In addition, Ix and Iy and It are derivatives of the brightness function with respect to x, y (image coordinates) and t (time). 1 In this research, we used the method proposed by Black and Anandan [2] . They incorporate a robust estimation method to overcome violations of the brightness constancy and spatial smoothness assumptions due to multiple motions. This method estimates parametric motion models within a region and computes piecewise-smooth flow fields. One reason for choosing this algorithm is that it uses a coarse to fine approach which makes the computation robust to large displacement. 1

We acknowledge the source code provided by them.

Action Recognition for Surveillance Applications Using Optic Flow and SVM

461

2.3 Local Motion Representation

In this section, we describe the proposed motion descriptor based on histograms of optic flows. We normalized the optic flow values of the segmented human body region using the global minimum and maximum values of optic flow over all images in the training set. Using normalized histograms makes motion descriptors robust relative to noise and outliers. Zhu et al. [18] also used histograms of optic flow as motion descriptors for human action recognition. However, they only applied their method to recognize left swings and right swings in Tennis games. To categorize forward and backward movements in the same class of actions, we only consider the absolute values of horizontal and vertical components of optic flow. For this purpose, let Fx and Fy be the horizontal and vertical components of the optic flow field F inside a bounding box. Motivated by Efros et al [3], we split up Fx and Fy, into four channels, Fx+ and Fy+ for optical flows with positive values and Fx- , Fyfor optical flows with negative values. Then we construct two optic flow channels: positive channel and negative one. As described in section 1, we use a hybrid approach to take advantage of both holistic and body part based approaches. To consider local information, we partition the bounding box of human body into 3 parts from top to bottom. The first 1/5 of the window from the top approximately contains the head and neck. The next 2/5 of the window has body and hands. Finally the bottom 2/5 of the window contains the legs (Figure 3). This heuristic approach can capture local information of optic flow without requiring exact segmentation which is a challenging task. We consider the combination of 15 bin histogram of head-neck part,15 bin histogram of body-hands part, and the 21 bin histogram of legs part ( Fhx, Fhy, Fbx, Fby, Ffx, Ffy) as local motion descriptors. Figure 4 and Figure 5 respectively show the motion descriptors for a walking and a hand waving sequence.

Fig. 3. Bounding box and its partitions based on ratios

2.4 Motion Classification

Efros et al. [3] used nearest neighbor classifier for human action recognition. This approach often has the drawback of poor generalization capabilities in real-world conditions.

462

S. Danafar and N. Gheissari

Fig. 4. Local motion features of head, body and legs parts in horizontal (top three figures) and in vertical direction (bottom three figures) for walking

Fig. 5. Local motion features of head, body and legs parts in horizontal (top three figures) and in vertical direction (bottom three figures) for hand waving

KNN classifiers suffer from so called “jig-jag” problem along the decision boundary, if the training set is small or features are high dimensional. This problem is particularly crucial here, since in surveillance have consequently a small training set (because of low frame rate). In addition, in KNN classifiers the number of neighbors (K) should be set. To avoid this problem we used SVM2 [15] in our research. SVM lends itself well to generalization and can avoid the over-fitting phenomena. SVM has been successfully used in many different recognition tasks. It is widely known that SVM is relatively slow because it uses quadratic programming for optimization. However, we use a small training set and as a result we mitigate computational burden. Since SVM is a kernel based method, selecting the appropriate type of kernel is crucial. We tested 2

We modified and used the source code provided in [8].

Action Recognition for Surveillance Applications Using Optic Flow and SVM

463

different kernels including polynomial, intersection, radial basis, linear and χ . We concluded that intersection kernel outperforms the others in our application. Therefore, we used the intersection kernel as:[14] 2

K(X,Y) =

∑ min( x , y ) i

i

for i = 1,.., m .

(2)

where xi and yi are feature vectors extracted from the data. This kernel has no free parameters and so can be adapted to real-time requirements. In the next section, we report the results of classifying the proposed motion descriptors using SVM.

3 Experiments and Evaluation Following Schuldt et al [13], we divided all sequences into a training set (8 persons), a validation set (8 persons), and a test set (9 persons) . The classifier was trained on a training set while the validation set was used to optimize the parameters for each method. Results of our experiments are shown in Figure 6. We achieved 85 percent total success rate. The algorithm seems to be stable in recognizing different actions and its performance varies between 75 to 92 percent. In Figure 7 and Figure 8, we have shown the results of three competing algorithms (introduced respectively by Schuldt et al. [13], Yeo et al [1], Dollar et al. [3]and Niebles et al [10]) on the same database. In both [13][10]and [3] the authors used local space-time features extracted from video. Schuldt at al. and Dollar et al. combined space-time features with a local kernel in SVM. In contrast, Neibels et al. achieved used a probabilistic Latent Semantic Analysis (PLSA) in an unsupervised learning framework. Yeo et al used a completely different approach based on a novel motion correlation measure. The results reported by Yeo et al exclude the indoor scenario.

Fig. 6. Confusion matrix achieved by our algorithm. Results are reported in percents.

464

S. Danafar and N. Gheissari

Comparing Figure 6 with Figures 7 and 8 shows the success of the proposed method. It is due to the discriminative motion descriptors which capture sufficient conceptual information to recognize action. In addition, we use a hybrid method in contrast with the local methods presented in [13] and [10]. This means we benefit from advantages of both global and local methods.

1

Fig. 7. (left)Confusion matrix achieved by Yeo et al [1]. Results are reported in percents.,(right) Confusion matrix achieved by Schuldt et al. [13]. Results are reported in percents.

6 1

Fig. 8. (left) Confusion matrix achieved by Dollar et al [3 ]. (right) Confusion matrix achieved by Niebles et al [10]. Results are reported in percents.

4 Conclusion In this paper, we proposed an action recognition algorithm suitable for surveillance applications. We use histograms of vertical and horizontal optic flow as motion descriptors that are robust to noise and outliers. We use a coarse partitioning of human body to capture local motion information which in turn makes the algorithm capable of detecting local variations. We apply a SVM to classify the descriptors. SVM has strong generalization capability in real world applications and performs well

Action Recognition for Surveillance Applications Using Optic Flow and SVM

465

with limited training samples. Therefore, this algorithm can be applied to videos with low frame rate. To evaluate our method, we applied it to a database of grey level low resolution images. This database consists of walking, running, jogging, hand clapping, hand waving and boxing image sequences. This database includes noisy images taken at different view points, scales and illumination conditions. We achieved about 85 percent of success rate in total. The success rates for different actions vary from 75 percent (for hand clapping) to 92 percent (for running and jogging). This means unlike the competing methods we achieve a reasonable and relatively stable performance for all types of action.

References 1. Ahammad, P., Yeo, C., Sastry, S.S., Ramchandran, K.: Compressed Domain Real-time Action Recognition. In: MMSP. Proceedings of 8th IEEE Workshop on Multimedia Signal Processing, pp. 33–36. IEEE Computer Society Press, Los Alamitos (2006) 2. Black, M.J., Anandan, P.: The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields. Computer Vision and Image Understanding (CVIU) 63, 75–104 (1996) 3. Dollar, P., Rabaud, V., Cottrel, G., Belongie, S.: Behavior Recognition Via SpatioTemporal Features. In: Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, pp. 65–72. IEEE Computer Society Press, Los Alamitos (2005) 4. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing Action at a Distance. In: ICCV 2003. IEEE International Conference on Computer Vision, Nice, France, pp. 726–733. IEEE Computer Society Press, Los Alamitos (2003) 5. Gavrila, D.M.: The Analysis of Human Motion and its Application for Visual Surveillance. In: Proceedings of the Second IEEE Workshop on Visual Surveillance, p. 3. IEEE Computer Society Press, Los Alamitos (1999) 6. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding (CVIU) 23, 82–98 (1999) 7. Harris, C., Stephens, M.: A combined Corner and Edge Detector. In: Proceedings of the 4th Alvey Vision Conference, Manchester, UK, pp. 147–151 (1988) 8. Horn, B.K., Schunck, B.G.: Determining Optic Flow. Artificial Intelligence 17, 185–203 (1981) 9. Chih-Jen, L., Chin-chung, C.: LIBSVM–A Library for Support Vector Machines (2007) 10. Niebles, J.C., Wang, H., Fei-fei, L.: Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. In: BMVC (2006) 11. Niu, W., Long, J., Han, D., Wang, Y.-F.: Human Activity Detection and Recognition for Video Surveillance. In: Proceedings of the IEEE Multimedia and Expo Conference, pp. 719–722. IEEE Computer Society Press, Los Alamitos (2004) 12. Robertson, N., Reid, I., Brady, M.: Behavior Recognition and Explanation for Video Survillance. In: ICDP 2006. IEE Conf. Imaging for Crime Detection and Prevention (2006) 13. Schuldt, C., Laptev, I., Caputo, B.: Recognizing Human Actions: A Local SVM Approach. In: ICPR 2004, pp. 32–36 (2004) 14. Swain, M.J., Ballard, D.H.: Color Indexing. International Journal of Computer Vision 7, 11–32 (1991)

466

S. Danafar and N. Gheissari

15. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 16. Wang, L., Ning, H., Hu., W.: Fusion of Static and Dynamic Body Biometrics for Gait Recognition. In: International Conference on Computer Vision, Nice, France, pp. 1449– 1454 (2003) 17. Yilmaz, A., Shah, M.: Actions As Objects: A Novel Action Representation. In: IEEE CVPR, San Diego, IEEE Computer Society Press, Los Alamitos (2005) 18. Zhu, G., Changsheng, X., Gao, W., Huang, Q.: Action Recognition in Broadcast Tennis Video Using Optical Flow and SVM. In: Proceedings of HCI/ECCV (2006)

The Kernel Orthogonal Mutual Subspace Method and Its Application to 3D Object Recognition Kazuhiro Fukui1 and Osamu Yamaguchi2 2

1 Department of Computer Science, University of Tsukuba, Japan Corporate Research and Development Center, Toshiba corporation, Japan

Abstract. This paper proposes the kernel orthogonal mutual subspace method (KOMSM) for 3D object recognition. KOMSM is a kernel-based method for classifying sets of patterns such as video frames or multi-view images. It classiﬁes objects based on the canonical angles between the nonlinear subspaces, which are generated from the image patterns of each object class by kernel PCA. This methodology has been introduced in the kernel mutual subspace method (KMSM). However, KOMSM is diﬀerent from KMSM in that nonlinear class subspaces are orthogonalized based on the framework proposed by Fukunaga and Koontz before calculating the canonical angles. This orthogonalization provides a powerful feature extraction method for improving the performance of KMSM. The validity of KOMSM is demonstrated through experiments using face images and images from a public database.

1

Introduction

This paper introduces the kernel orthogonal mutual subspace method (KOMSM) for 3D object recognition. KOMSM is an appearance-based method for classifying a set of patterns such as video frames or multi-view images. As the set of such patterns generally has highly nonlinear structure, we have to tackle a nonlinear classiﬁcation problem of multiple sets of patterns. The kernel mutual subspace method (KMSM) is one suitable method for this task. KMSM is a nonlinear extension of the mutual subspace method (MSM)[3] by using the kernel trick. MSM classiﬁes sets of patterns based on the canonical angles θ between linear class subspaces, which represent the distribution of the training set of each class respectively as shown in Fig.1. In this method a w×h image pattern is represented as a vector in w×h-dimensional space (called input space I). Although MSM has the ability to handle the variability of patterns to achieve higher performance compared to other methods[1], its performance drops signiﬁcantly when the pattern distributions have highly nonlinear structure. In such cases class distributions cannot be represented by a linear subspace without overlapping each other. The kernel mutual subspace method (KMSM)[4,5] has been proposed in order to solve this problem. In this method an input pattern x is mapped into a high (in some cases inﬁnite) dimensional feature space F via a nonlinear map φ. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 467–476, 2007. c Springer-Verlag Berlin Heidelberg 2007

468

K. Fukui and O. Yamaguchi

...... Subspace P

Subspace Q

KL expansion

...... KL expansion

θ...

Canonical angles

Fig. 1. Similarity between two distributions of view patterns in the input space I

Consequently, KMSM carries out the MSM on the linear subspaces 1 generated from the mapped patterns {φ(x)} using the Karhunen-Lo`eve (KL) expansion, also known as principal component analysis (PCA). KMSM works well since each subspace can be generated without overlapping with another subspace in the feature space F . However its classiﬁcation performance is still insuﬃcient for many applications in practice, because the nonlinear class subspaces are generated independently of each other [1]. There is no reason to assume a priori that it is the optimal nonlinear class subspace in terms of classiﬁcation performance, while each nonlinear class subspace represents the distribution of the training patterns well in terms of a least-mean-square approximation. This suggests that there is room for improving the performance of KMSM. In order to improve the performance of KMSM the kernel constrained mutual subspace method (KCMSM)[11] has been proposed. In this method, each nonlinear class subspace is projected onto a discrimination space called the constraint subspace. This projection extracts a common subspace of all the nonlinear class subspaces from each nonlinear class subspace, so that the canonical angles between nonlinear class subspaces are enlarged to approach orthogonal relation. KCMSM could signiﬁcantly improve the performance of KMSM in many applications, such as face recognition and 3D object recognition[11]. This implies that the concept of orthogonalization is essential for improving the performance of KMSM. Therefore, in this paper we apply the framework proposed by Fukunaga and Koontz[2], which achieves perfect orthogonalization, while the orthogonalization achieved in KCMSM is only approximate. We orthogonalize the nonlinear class subspaces using this framework and then apply KMSM to the orthogonalized nonlinear class subspaces. Fukunaga and Koontz’s method has been applied to related linear subspace methods including MSM, improving their performance [1,7,8,9]. In Fukunaga and Koontz’s method orthogonalization is achieved by applying a whitening transformation matrix to the training patterns or orthonormal basis vectors of each class subspace. Thus, the main task we need to solve is to construct the whitening transformation matrix for orthogonalizing the linear subspaces in the feature space F . 1

Note that a linear subspace in the feature space F is an nonlinear subspace in the input space I.

The Kernel Orthogonal Mutual Subspace Method

P2

P1

469

P3 P2 Whitening transformation

P3 P1

Fig. 2. Concept of orthogonalization of subspaces by the whitening transformation

This paper is organized as follows. Section 2 describes the calculation of the canonical angles. In Section 3, we compute the whitening transformation. In Section 4, we deﬁne the kernel whitening transformation and orthogonalize nonlinear class subspaces. Then we construct the KOMSM by applying the KMSM to the orthogonalized nonlinear class subspaces. Section 5 demonstrates the effectiveness of KOMSM through experiments. We conclude in section 6.

2

Canonical Angles Between Subspaces

Given an dp -dimensional linear subspace P and an dq -dimensional linear reference subspace Q in the n-dimensional input space, the canonical angles {0 ≤ θ1 , . . . , θdp ≤ π2 } between P and Q (for convenience dp ≤ dq ) are uniquely deﬁned as [12]: cos(θi ) = max max u i vi ui ∈P vi ∈Q

(1)

subject to : u i ui = vi vi = 1; ui uj = 0; vi vj = 0; i = j, i = 1∼dp , j = 1∼dp .

Let Φi and Ψi denote the i-th n-dimensional orthonormal basis vectors of the subspaces P and Q, respectively. These orthonormalbasis vectors can be l obtained as the eigenvectors of the correlation matrix i=1 xi xi calculated from the l learning patterns {x} of each class. A practical method of ﬁnding the canonical angles is by computing the matrix X=A B, where A = [Φ1 , . . . , Φdp ] and B=[Ψ1 , . . . , Ψdq ]. Let {κ1 , . . . , κdp } (κ1 ≥, . . . , ≥ κdp ) be the singular values of the matrix X. The canonical angles {θ1 , . . . , θdp } can be obtained as {cos−1 (κ1 ), . . . , cos−1 (κdp )}. In practice, we t consider the value, S[t] = 1t i=1 cos2 θi , as the similarity between two subspaces. The value S reﬂects the structural similarity between two subspaces.

3

Orthogonalization by the Whitening Transformation

In this section, we will describe how to calculate the whitening matrix O for orthogonalizing r d-dimensional linear class subspaces with the orthonormal basis vectors ei (i = 1∼d) in the n-dimensional input space I.

470

K. Fukui and O. Yamaguchi

At ﬁrst, we deﬁne the projection corresponding to the projection onto matrix d the class i subspace Pi by Pi = j=1 ej e orthonormal j where ej is the j-th basis vector of Pj . Then we deﬁne the total projection matrix G= ri=1 Pi . Using the eigenvectors and the eigenvalues of the total projection matrix G, the v×n whitening matrix O is deﬁned by the following equation: O = Λ−1/2 H ,

(2)

where v = r×d, (v = n, if v > n), Λ is the v×v diagonal matrix with the i-th highest eigenvalue of the matrix G as the i-th diagonal component, and H is the n×v matrix whose i-th column vector is the eigenvector of the matrix G corresponding to the i-th highest eigenvalue. We can conﬁrm that the matrix O can whiten the matrix G so that r class subspaces are orthogonalized as proved in [8]. Fig.2 shows the concept of orthogonalization by the whitening transformation. We can see that the orthogonalization is achieved by whitening, i.e. homogenizing the variances in all directions: All transformed basis vectors are orthogonal to each other when the multiple of the number r of classes and the dimension m of each class is smaller than the dimension n of the input space[8]. On the other hand, since the transformed basis vectors of an input subspace are no longer orthogonal, all transformed basis vectors need to be orthogonalized again by Gram-Schmidt orthogonalization.

4

The Proposed Method

In this section, we will ﬁrst describe the concept of nonlinear subspaces, and then construct the kernel whitening matrix in the feature space F . 4.1

Nonlinear Subspace by Kernel PCA

The nonlinear function φ maps the pattern x = (x1 , . . . , xn ) of an n-dimensional input space I onto an f -dimensional feature space F : φ : n → f , x → φ(x). To perform PCA on the mapped patterns, we need to calculate the inner product (φ(x) · φ(y)) between the function values. However, this calculation is diﬃcult, because the dimension f of the feature space F can be very high, possibly inﬁnite. However, if the nonlinear map φ is deﬁned through a kernel function k(x, y) which satisﬁes Mercer’s conditions, the inner products (φ(x) · φ(y)) can be calculated from the inner products (x · y). This technique is known as the “kernel trick”.2 . A common choice is to use an exponential function: k(x, y)=exp − ||x−y|| σ2 The function φ with this kernel function maps an input pattern onto an inﬁnite feature space F . The PCA of the mapped patterns is called kernel PCA[10]. 4.2

Kernel Whitening Transformation

In the following, we will explain how to generate the kernel whitening matrix Oφ from all the basis vectors of r d-dimensional nonlinear class subspaces Vk (k =

The Kernel Orthogonal Mutual Subspace Method

471

1 ∼ r), that is, r×d basis vectors. This calculation corresponds to the kernel PCA for the basis vectors of all the classes. Assume that a class k nonlinear subspace Vk is generated from l learning patterns xki (i = 1 ∼ l). The d basis vectors eki (i = 1∼d), which expand the subspace Vk , are deﬁned by the following equation: eki =

l

akij φ(xkj ) ,

(3)

j=1

where the coeﬃcient akij is the j-th component of the eigenvector ai corresponding to the i-th largest eigenvalue λi of the l×l kernel matrix K deﬁned by kij =(φ(xki ), φ(xkj )). ai is normalized to satisfy λi (ai · ai )=1. Next, assume that E is the matrix where all basis vectors are arranged as the column component: E = [e11 , . . . , e1d , . . . , er1 , . . . , erd ] .

(4)

Then, we solve the eigenvalue problem of the matrix Q deﬁned by the following equation: Qb = βb

(5)

Qij = (Ei · Ej ),

(i, j = 1∼r×d) ,

where Ei means the i-th column component of the matrix E. In the above equation, the inner product between i-th basis vector eki of class k and j-th ∗ basis vector ekj of class k ∗ can be actually calculated as the linear combination ∗ ∗ of a kernel function value k(xk , xk ) of xk and xk . l l ∗ ∗ ∗ akis φ(xks ) · akjt φ(xkt )) (eki · ekj ) = ( s=1

=

(6)

t=1

l l

∗

∗

akis akjt (φ(xks ) · φ(xkt ))

(7)

s=1 t=1

=

l l

∗

∗

akis akjt k(xks , xkt ) .

(8)

s=1 t=1

The i-th row vector Oφi of the kernel whitening matrix Oφ can be represented as the linear combination of the vectors Ej (j = 1 ∼ r×d) using the eigenvector bi corresponding to the eigenvalue βi as the combination coeﬃcient. O φi =

r×d b √ij Ej , βi j=1

(9)

where the vector bi is normalized to satisfy that βi (bi · bi ) is equal to 1. The row vectors of Oφ with eigenvalues β lower than a threshold value are discarded,

472

K. Fukui and O. Yamaguchi

since their reliability is low. Moreover, assume that E[j] is the η(j)-th basis vector of the class ζ(j). Then the above equation can be changed as follows: O φi =

r×d j=1

=

l b ζ(j) √ij a φ(xζ(j) ) s βi s=1 η(j)s

r×d l j=1 s=1

b ζ(j) √ij aζ(j) ) . η(j)s φ(xs βi

(10)

(11)

Although we cannot calculate this vector Oφi , the inner product with the mapped vector φ(x) can be calculated. 4.3

Whitening Transformation of the Mapped Patterns

The mapped vector φ(x) is transformed by the kernel whitening matrix. This can be calculated from an input vector x and all r×l learning vectors xks (s = 1∼l, k = 1∼r) using the following equation: r×d l bij ζ(j) √ aζ(j) ) · φ(x)) (Oφi · φ(x))= η(j)s (φ(xs β i j=1 s=1

=

r×d l bij ζ(j) √ aζ(j) , x) . η(j)s k(xs β i j=1 s=1

(12)

(13)

Finally, the whitening transformed vector χ(φ(x))(= Oφ φ(x)) of the mapped vector φ(x) is represented as (z1 , z2 , . . . , zno ) , zi = (φ(x)·Oφi ), 1≤i≤no≤r×d, where no is the row number of Oφ mentioned above. 4.4

The KOMSM Algorithm

We construct the KOMSM by applying the linear MSM to the linear subspaces generated from the whitening transformed vectors {χ(φ(x))} in the feature space F as follows. In the learning stage: 1. The nonlinear mapped φ(xki ) of all the patterns xki (i = 1 ∼ l) belonging to class k(= 1∼c) are transformed by the kernel whitening matrix Oφ . 2. The basis vectors of the d-dimensional linear orthogonal reference k(= 1∼c) O subspace Pk φ are obtained as the eigenvectors of the correlation matrix generated from the whitening transformed pattern set {χ(φ(xk1 )), . . . , χ(φ(xkl ))}, corresponding to the d highest values. In the recognition stage: O

1. The linear input orthogonal subspace Pinφ is also generated from the whitenin ing transformed pattern set {χ(φ(xin 1 )), . . . , χ(φ(xl ))}.

The Kernel Orthogonal Mutual Subspace Method

473 O

2. The canonical angles between the linear orthogonal input subspace Pinφ and O all the linear orthogonal reference subspaces Pk φ (k = 1∼c) are calculated as the similarity. 3. Finally the object class is determined as the linear orthogonal reference subspace with the highest similarity S, given that S is above a threshold value. In the above process, it is possible to replace the generation process of the nonlinear orthogonal subspaces with the following processes. Firstly the input subspace and the reference subspaces are generated from the set of the nonlinear mapped patterns. Next the basis vectors of the generated subspaces are transformed by the kernel whitening matrix, and then the whitening transformed basis vectors are orthogonalized by the Gram-Schmidt method.

5

Evaluation Experiments

We compared the performances of KOMSM with previous methods (MSM[3], CMSM[6], OMSM[8], KMSM[4], KCMSM[11]) using the public data base of multi-view images (Cropped-close128 of ETH-80)[13] and the data set of front face images collected by ourselves. 5.1

3D Object Recognition (Experiment-I)

Thirty similar models were selected as the evaluation data from the ETH-80 database as shown in Fig.3. The images of each model were captured from 41 views as shown in Fig.3. All images are cropped, so that they contain only the object without any border area. The odd numbered images (21 frames) and the even numbered images (20 frames) were used for training and evaluation, respectively. We prepared 10 datasets for each model by making the start frame number i changes from 1 to 10 where 10 frames from i-th frame to i + 9-th is one set. The total number of the evaluation trials is 9000(=10×30×30). The evaluation was performed in terms of the recognition rate and the equal error rate (EER), which represents the error rate at the point where the false accept rate (FAR) is equal to the false reject rate (FRR). The experimental conditions are as follows. We converted the 180×180 pixel color images to 15×15 pixels monochrome images and use them as the evaluation data. Thus, the dimension n of a pattern is 225(=15×15). The dimensions of the input subspace and the reference subspaces were set to 7 for all methods. The whitening matrix O and the kernel whitening matrix Oφ were generated from thirty 20-dimensional linear class subspaces and thirty 20-dimensional nonlinear class subspaces respectively. The row numbers of O and Oφ were set to 100 and 550, respectively, on the basis of the eigenvalues β in Eq.(6). The dimensions of the constraint subspaces of the CMSM and KCMSM were set to 200 and 400, respectively, as used in [11]. The length of input vectors was not normalized for all the methods. A Gaussian kernel with σ 2 = 1e + 6 was used for all the kernel methods.

474

K. Fukui and O. Yamaguchi

Fig. 3. Left: Evaluation data, Bottom: cow, Middle: dog, Bottom: horse. This ﬁgure shows ﬁve of ten models. Right: All view-patterns of dog1, the rows indicated by arrows are used as training data. Table 1. Recognition rate and EER of each method (%) (Experiment-I) Method MSM CMSM-200 OMSM-100 KMSM KCMSM-400 KOMSM-550 Rate 78.6 92.33 89.6 96.33 99.67 99.67 EER 16.6 4.7 7.7 5.2 1.0 1.0

Table 1 shows the recognition rate and EER of each method. The recognition of multiple view images is typically a nonlinear problem. This is clearly shown by the experimental results that the performance of the nonlinear methods (KMSM, KCMSM and KOMSM) is superior to that of the linear methods (MSM, CMSM and OMSM). The performance of MSM was improved by the nonlinear extension of MSM to KMSM where the recognition rate increased from 78.6% to 96.3% and the EER decreased from 16.6% to 5.2%. KOMSM further improved the performance of KMSM where the recognition rate increased from 96.3% to 99.6% and the EER decreased from 5.7% to 1.0%. This conﬁrms the eﬀectiveness of the orthogonalization of the nonlinear subspaces, which serves as a feature extraction step in the feature space F . 5.2

Recognition of Face Image (Experiment-II)

We conducted the evaluation experiment of all the methods using the face images of 50 persons captured under 10 kinds of lighting. We cropped the 15 ×15 pixel

Fig. 4. Face images: From left, Lighting1∼Lighting10

The Kernel Orthogonal Mutual Subspace Method

475

Table 2. Recognition rate and EER of each method (%)(Experiment-II) Method MSM CMSM-200 OMSM KMSM KCMSM-1050 KOMSM Recognition rate 91.74 91.30 97.09 91.15 97.40 97.42 EER 12.0 7.5 6.3 11.0 4.3 3.5

face images from the 320 × 240 pixel input images based on the positions of pupils and nostrils. The normalized face patterns of subjects 1-25 in lighting conditions L1-L10 were used for generating the diﬀerence subspace D, the kernel diﬀerence subspace Dφ , the whitening matrix O and the kernel whitening matrix Oφ . The face patterns extracted from the images of the other subjects, 26-50, in lighting conditions L1-L10 were used for evaluation. The number of the data of each person is 150∼180 frames for each lighting condition. The data was divided into 15∼18 sub datasets by every 10 frames. The input subspaces were generated from these sub datasets. The dimension of the input subspace and reference subspaces were set to 7 for all the methods. The diﬀerence subspace D and the whitening matrix O were generated from 25 60-dimensional linear subspaces of 1∼25 persons. The kernel diﬀerence subspace Dφ and the kernel whitening matrix Oφ were generated from 25 60-dimensional nonlinear class subspaces. The row numbers of O and Oφ were set to the full dimensions, 225 and 1500, respectively. The dimensions of the generalized diﬀerence subspace and the kernel generalized diﬀerence subspace were set to 200 and 1050, respectively. We used a Gaussian kernel with σ 2 = 1.0 for all nonlinear methods. Table 2 shows the recognition rate and the EER of each method. The diﬀerence between the recognition rates of OMSM and KOMSM was small, while the EER decreased from 6.3% to 3.5%. This implies that the data sets used in this task do not exhibit highly nonlinear structure. The good performance of KOMSM and KCMSM are also demonstrated in this experiment. In particular the EER of KOMSM is very low. Although the performance of KOMSM and KCMSM are at the same level regardless of their diﬀerent principals of orthogonalization, KOMSM has an advantage in selecting parameters compared to KCMSM. KOMSM does not have any parameters to be tuned. In contrast, the dimension of the constraint subspace used in KCMSM needs to be carefully selected in prior experiments, since the performance of KCMSM strongly depends on this value.

6

Conclusion

In this paper we have introduced the kernel orthogonal mutual subspace method (KOMSM) and applied it to 3D object recognition. The essence of the KOMSM is to orthogonalize nonlinear subspaces based on Fukunaga and Koontz’s framework before applying the KMSM. We have conﬁrmed that this orthogonalization provides a strong feature extraction method for the KMSM and that the performance of the KMSM is improved signiﬁcantly. This was shown by evaluation

476

K. Fukui and O. Yamaguchi

on multi-view image sets of 3D objects as well as frontal face images. In future work, we will attempt to ﬁnd the eﬃcient computation of the eigen problems of the matrixes K and Q with the sizes which are proportional to the numbers of the classes and the training patterns.

References 1. Oja, E.: Subspace methods of pattern recognition. Research Studies Press (1983) 2. Fukunaga, K., Koontz, W.C.G.: Application of the Karhunen-Loeve expansion to feature selection and ordering. IEEE Trans. Computers C-19, 311–318 (1970) 3. Maeda, K., Watanabe, S.: A pattern matching method with local structure: Trans. IEICE J68-D(3), 345–352 (1985) (in Japanese) 4. Sakano, H., Mukawa, N., Nakamura, T.: Kernel Mutual Subspace Method and its Application for Object Recognition. Electronics and Communications in Japan 88, 45–53 (2005) 5. Wolf, L., Shashua, A.: Kernel principal angles for classiﬁcation machines with applications to image sequence interpretation. In: CVPR 2003, pp. 635–642 (2003) 6. Fukui, K., Yamaguchi, O.: Face recognition using multi-viewpoint patterns for robot vision. In: ISRR. 11th International Symposium of Robotics Research, pp. 192–201 (2003) 7. Nagao, K., Sohma, M.: Weak orthogonalization of face and perturbation for recognition. In: CVPR 1998, pp. 845–852 (1998) 8. Kawahara, T., Nishiyama, M., Yamaguchi, O.: Face recognition by orthogonal mutual subspace method. CVIM-151 151, 17–24 (2005) (in Japanese) 9. Kim, T.-K., Kittler, J., Cipolla, R.: Incremental Learning of Locally Orthogonal Subspaces for Set-based Object Recognition. In: BMVC 2006, pp. 559–568 (2006) 10. Sch¨ olkopf, B., Smola, A., M¨ uller, K.-R.: Nonlinear principal component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319 (1998) 11. Fukui, K., Stenger, B., Yamaguchi, O.: A framework for 3D object recognition using the kernel constrained mutual subspace method. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 315–324. Springer, Heidelberg (2006) 12. Chatelin, F.: Eigenvalues of matrices. John Wiley & Sons, Chichester (1993) 13. Leibe, B., Schiele, B.: Analyzing appearance and contour based methods for object categorization. In: CVPR 2003, pp. 409–415 (2003)

Viewpoint Insensitive Action Recognition Using Envelop Shape Feiyue huang and Guangyou Xu Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China [email protected], [email protected]

Abstract. Action recognition is a popular and important research topic in computer vision. However, it is challenging when facing viewpoint variance. So far, most researches in action recognition remain rooted in view-dependent representations. Some view invariance approaches have been proposed, but most of them suffer from some weaknesses, such as lack of abundant information for recognition, dependency on robust meaningful feature detection or point correspondence. To perform viewpoint and subject independent action recognition, we propose a representation named “Envelop Shape” which is viewpoint insensitive. “Envelop Shape” is easy to acquire from silhouettes using two orthogonal cameras. It makes full use of two cameras’ silhouettes to dispel influence caused by human body’s vertical rotation, which is often the primary viewpoint variance. With the help of “Envelop Shape”, we obtained inspiring results on action recognition independent of subject and viewpoint. Results indicate that “Envelop Shape” representation contains enough discriminating features for action recognition. Keywords: Viewpoint Insensitive, Envelop Shape, Action Recognition, HMM.

1 Introduction Human action recognition is an active area of research in computer vision. There have been several surveys which tried to summarize and classify previous existing approaches on this area [1], [2], [3], [4]. In this paper, we develop a general approach to recognize actions independent of viewpoint and subject, focusing on discovering viewpoint invariance for action recognition. In our research work, we define human action recognition system as a system made up of three modules: Preprocessing, Pose Estimation and Recognition. Preprocessing module includes human detection and tracking, it extracts low level representation for pose estimation. Pose Estimation module is the process of identifying and representing how a human body and/or individual limbs are configured in a single frame. Recognition module uses results of pose estimation of frames to classify actions. Here we define posture as a kind of representation of human body in a single frame, for example, horizontal and vertical histograms of silhouette [5], vector of distances from boundary pixels to the centroid [6]. In our opinion, posture representation is one of the most basic and key issues in action recognition system. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 477–486, 2007. © Springer-Verlag Berlin Heidelberg 2007

478

F. huang and G. Xu

It is well known that a good representation for classification should have such measurement property whose values are similar for objects in the same category while very different for objects in the different categories. So this leads to the idea of seeking distinguishing features that are invariant to irrelevant transformations of the input [7]. In the case of recognition of human action, we argue that a good feature representation should be able to tolerate variations in viewpoint, human subject, background, illumination and so on. Among them, the most important invariance is viewpoint invariance. We can perform training and recognition according to given environments and specialized persons. But in order to perform natural human action recognition, we can’t limit human body’s movement and rotation at any time which inevitably leads to variable viewpoint. It is a challenge to find a viewpoint invariant posture representation for action recognition. There have been some proposed approaches on viewpoint invariant action recognition. Campbell et al. proposed a complex 3D gesture recognition system based on stereo data [8]. Seitz and Dyer described an approach to detect cyclic motion that is affine invariant [9]. Cen Rao did a lot of research work on view invariant analysis of human activities [10], [11]. He used trajectory of hand centroid to describe an action performed by one hand. He discovered affine invariance of trajectory and his system can work automatically. Vasu Parameswaran also focused on approaches for view invariant human action recognition [12], [13]. He chose six joints of the body and calculated their 3D invariants of each posture. So each posture can be represented by a parametric surface in 3D invariance space. Daniel et al. introduced Motion History Volumes as a free viewpoint representation for action recognition, which needs multiple calibrated cameras [14]. Though there have been some research work on viewpoint invariant action recognition, there are still many problems to be solved. Most approaches depend on robust meaningful feature detection or point correspondence, which, as we know, are often hard to implement. And there is a tradeoff to be insensitive to viewpoint is that some useful information for discriminating different actions is often eliminated. How to make representation insensitive to viewpoint while still keeping appropriate discriminating information for recognition appears to be the key issue. In this regard, we propose a posture representation named “Envelop Shape”. Under the assumption of affine camera projection model, we prove it from both theory and experiments that such representation is viewpoint insensitive for action recognition. “Envelop Shape” is easy to acquire from low level features, which can be obtained from silhouettes of subjects by using two orthogonal cameras. It conveys more information compared to previous view invariant representation for action recognition. And it does not rely on any meaningful feature detection or point correspondence, which as we know is often difficult and sensitive to errors. With the help of our proposed representation, we develop our action recognition system. Experiment results show that our system has impressive discriminating ability for actions independent of subject and viewpoint. The remainder of this paper is organized as follows. In section 2, we present our viewpoint insensitive representation. We propose the implementation of our action recognition system and give experiment results in section 3. We conclude in section 4.

Viewpoint Insensitive Action Recognition Using Envelop Shape

479

2 View Invariant Posture Representation In human action recognition, representation is a basic and key issue. A good system should have viewpoint invariance to perform natural action recognition. So a direct idea is to discover viewpoint invariant representation which means that measurements using this representation will keep almost the same even under different viewpoints. 2.1 Viewpoint in Action Recognition Viewpoint transformation can be separated into two parts, translation and rotation. In action recognition, almost all representations have translation invariance, so we only consider rotation invariance. Figure 1 shows the coordinate in our system. In this coordinate, the Y-axis is vertical. There are three kinds of rotation terms used to describe the rotation in the coordinate: roll, pitch and yaw. Roll, pitch and yaw describe the rotation around the Z-axis (α), X-axis (β), and Y-axis (γ), respectively.

Fig. 1. Coordinate in our system

Fig. 2. Two cameras’ configuration

It is quite often that actor makes some kind of action with his body yawing, when roaming in front of the fixed camera, for example the actor on the stage or teacher in front of the blackboard. In this case the yaw motion of the body causes variation of the viewpoint, but actions still make sense. In this regard, we classify human postures into a same category if only yaw rotation exists and we classify human postures in

480

F. huang and G. Xu

different categories if there exist the other two kinds of rotation terms, roll or pitch. For example, when a human is standing compared with lying on the ground, the rotation term is roll or pitch, and we regard them as different postures. However, if a human only turns his body facing another direction, he is thought to be acting the same posture. With above discussion, we can conclude that we need only consider invariance on yaw rotation for most viewpoint invariant action recognition. 2.2 Envelop Shape Representation In the practical situation of human action recognition, because the depth range of human body is usually small compared with the distance between human and the camera, the affine camera model can be used. To acquire viewpoint invariant representation for action recognition, a two cameras’ configuration is proposed as Figure 2. The image planes of two cameras are both parallel to the vertical axis Y, and the optical axes are orthogonal. Let us consider a horizontal section plane of human body, projections of all points on this section plane into the image plane 1 are on the line l and projections of all points on this section into the image plane 2 are on the line l’. The line l are the epipolar line of point p’ and the line l’ are the epipolar line of point p. To discover the yaw rotation invariance, we need only to analyze the 2D horizontal section shape’s projection on X-axis and Y-axis in different rotations.

Fig. 3. 2D Shape projection on X-axis and Y-axis in different rotations

In Figure 3, let us suppose a 2D shape “S” whose projection segments in original coordinate XY are AB and BC, so it is in the rectangle ABCD. In another coordinate X’Y’ which rotates at an angle θ, its projection segments will be in segment EF and FG. Let us define the original projection segment’s length as x and y, the new projection segment’s length as x’ and y’. We can get the following relationships:

x ' ≤ x cos θ + y sin θ y ' ≤ y cos θ + x sin θ

(1)

Let us define value “r” as equation 2, so we can get equation 3. r =

x2 + y2

r ' = x' 2 + y ' 2 ≤ x 2 + y 2 + 2 xy sin 2θ ≤ x 2 + y 2 + 2 xy ≤ 2r

(2) (3)

Let r0 be the minimal value of “r” s among all rotations. Then at any rotation, the r value will meet the following expression:

Viewpoint Insensitive Action Recognition Using Envelop Shape

r0 ≤ r ≤

2 r0

481

(4)

This is a quite small value range compared to the unlimited range of ratios between x’ and x or y’ and y, which indicates that we find a view insensitive representation of human body. At each horizontal section plane, we can calculate an “r” value using equation (2). From a single frame of human posture, we get a vector of “r” value. Since this vector can envelop the human body silhouette inside, we call this representation of vector of “r” values as “Envelop Shape”. Here we show some “Envelop Shape” images of synthesized human body model data at different viewpoints. Figure 4 shows two kinds of synthetic postures rotated on vertical axis Y at eight different angles, the first two rows are silhouettes of images in two cameras, and the third rows are “Envelop Shape” images. We can see that the Envelop Shapes does really change few facing viewpoint variance.

Fig. 4. Two kinds of postures at different viewpoints

Though we propose our cameras’ configuration as Figure 2, that is, two cameras should be placed with image planes both parallel to the vertical axis Y and optical axes orthogonal. It does not need accurate calibration. As we know, accurate calibration is often complex. It is enough when the cameras are placed approximately meeting this need of cameras placement. That means we do not need to spend a lot of time in configuring the accurate placement. As we mentioned above, this kind of representation is just view insensitive, so approximate value can also work. We will show our experiments in section 3.The videos are collected with just rough placement of two cameras, while we will see that the result is still inspiring good. Here is a brief description of our algorithm to generate Envelop Shape representation. 1. Extract silhouettes of human body from two cameras’ video data. 2. Perform a scale normalization of the silhouettes. 3. Use expression (2) to calculate “r” value at each height of the silhouettes, the x and y are the corresponding width of the two normalized silhouettes at this height. Envelop Shape representation has following advantages: 1. It keeps information on two dimensional degrees of freedom, vertical axis and horizontal plane. It has more information than some simple view invariance representation such as trajectory projection which is in fact one dimensional. Therefore, it has better discriminating ability, also it is view insensitive.

482

F. huang and G. Xu

2. It is easy to obtain. Only silhouettes are required as input, which are easier to extract than meaningful feature detection, tracking or point correspondence.

3 Action Recognition Experiment Results With the help of Envelop Shape representation, we deploy our arbitrary viewpoint action recognition system in a smart classroom. Figure 5 shows system flow diagram. We first use “Pfinder” algorithm to extract human silhouette [15]. With two cameras’ silhouette video sequences as original input, we generate “envelop shape” vector of each frame. We then use principal component analysis (PCA) to perform dimensional reduction.

Fig. 5. Action Recognition System Flow Diagram

For each video, after preprocessing and posture representation steps are completed, time sequential feature vectors can be obtained. There are many algorithms for classifying time sequential feature vectors, such as Hidden Markov Model[16], Coupled Hidden Markov Model[17] or stochastic parsing[18] and so on. Here we use continuous Hidden Markov Model for action training and recognizing. For action recognition experiment, we collected our own database of action video sequences. It contains seven different actors. Each actor performs nine natural actions which are “Point To”, “Raise Hand”, “Wave Hand”, “Touch Head”, “Communication”, “Bow”, “Pick Up”, “Kick” and “Walk”. Each action is performed by every actor three times repetitively at three arbitrary view point. Figure 6 and Figure 7 show examples of our experiment data. Each figure contains two groups of sampled action sequences. Each group contains five rows, the first two rows are images of two cameras, the next two rows are silhouettes extracted using “Pfinder” and the last row is normalized Envelop Shape vector images. (Each action sequence contains about 30 frames of images. Figure 6 and 7 only show partial sampled frames of each sequence.) As we can see, action sequences are collected at arbitrary view point, which means that our experiment is viewpoint independent. Experiment parameters are as follows: the dimension of input vector (which is, output after PCA dimensional reduction of Envelop Shape vector) is 8, the number of states of each HMM Model is 5 and the number of mixture Gaussian Models for observation is 10. With our video database, we carry out subject dependent and subject independent experiments separately.

Viewpoint Insensitive Action Recognition Using Envelop Shape

Fig. 6. Two groups of “point to” action sequences

Fig. 7. Two groups of “walk” action sequences

483

484

F. huang and G. Xu

In the case of subject independent action recognition, for each type of actions, we train an entire HMM models for all actors in train set and use the entire HMM models for recognition. For each action, we use first five actors’ video sequences as training sets and the last two actors’ sequences as test sets. That is to say, for each action, there are 45 sequences as train sets and 18 sequences as test sets. Table 2 shows the correct recognition rate. Table 1. Subject dependent recognition result

Point. Raise. Wave Touch Comm Bow Pick Kick Walk

Actor1 6 3 6 3 6 3 6 2 6 2 6 3 6 3 6 3 6 2

Actor2 6 3 6 3 6 3 6 3 5 3 6 3 6 3 6 3 6 3

Actor3 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3

Actor4 6 3 6 3 5 3 6 3 6 3 6 3 6 3 6 3 6 3

Actor5 6 3 6 3 6 3 6 3 5 2 6 3 6 3 6 3 6 3

Actor6 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3

Actor7 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3

Aver. % 100 100 100 95.2 97.6 100 100 95.2 95.2 90.5 100 100 100 100 100 100 100 95.2

Table 2. Subject independent recognition result

Point. Raise. Wave Touch Comm Bow Pick Kick Walk

Train set(%) 97.8 100 95.6 95.6 88.9 100 100 100 100

Test set(%) 100 100 88.9 94.4 83.3 100 94.4 100 94.4

Table 3. Comparison with view variant action recognition methods

Point. Raise. Wave Touch Comm Bow Pick Kick Walk

Envelop Shape (%) 100 94.4 100 100 88.9 94.4 94.4 88.9 83.3 83.3 100 100 94.4 100 100 94.4 94.4 94.4

Horizontal projection Motion Feature [19] (%) of silhouettes (%) 44.4 61.1 33.3 38.8 27.7 61.1 38.8 55.5 33.3

88.9 94.4 61.1 83.3 61.1 94.4 72.2 83.3 72.2

38.8 33.3 55.5 22.2 27.7. 38.8 33.3 38.8 44.4

100 94.4 94.4 100 88.9 94.4 83.3 88.9 100

Viewpoint Insensitive Action Recognition Using Envelop Shape

485

In order to make our approach on view invariant action recognition more convincing, we also tried some comparison experiments. In table 3, we give our results compared with view dependent methods. The table’s first row shows three kinds of methods we used. The first method is our proposed approach, “Envelop Shape”. As the second method, we use vector of horizontal projection of silhouettes (that is, the “x” in expression (2)) as input, which is view variant. As the third method, we refer to [19] which uses motion features and is a view dependent method. Below each method, there are two sub columns. The first column gives recognition rates of subject independent action recognition for any view points. The second column gives average recognition rates at a specified view point. We can see that in view independent scenario, only “Envelop Shape” method performs well, however the other two methods perform poor since they are view dependent.

4 Conclusion With the help of “Envelop Shape” representation, we set up an action recognition system despite of viewpoint variance. Experiment shows that with the help of Envelop Shape representation, our system has achieved a high correct recognition rate in the case of free view point actions. Results indicate that "Envelop Shape" representation contains enough discriminating features even for subject-independent action recognition. “Envelop Shape” representation is view insensitive, and compared to previous approaches, it is easier to acquire and has more abundant information. It does not need any meaningful feature detection or point correspondence, which as we know is often difficult to get and sensitive to errors. However as a view insensitive representation, it loses some view variant information sometimes important for certain action recognition. For example, we can not distinguish left or right hand’s movement only with this representation. Some view variant information may help for solving this kind of problem. How to combine this representation and other view variant information? It is further work to accomplish. Acknowledgements. This work is supported by National Science Foundation of China under grant No 60673189 and No 60433030.

References 1. Cedras, C., Shah, M.: Motion-based recognition: a survey. Image and Vision Computing 13(2), 129–155 (1995) 2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Computer Vision and Image Understanding 73(3), 428–440 (1999) 3. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81(3), 231–268 (2001) 4. Wang, L., Hu, W., Tan, T.: Recent Developments in Human Motion Analysis. Pattern Recognition 36(3), 585–601 (2003)

486

F. huang and G. Xu

5. Leo, M., D’Orazio, T., Spagnolo, P.: International Multimedia Conference. In: Proceedings of the ACM 2nd international workshop on Video surveillance & sensor networks, pp. 124–130. ACM Press, New York (2004) 6. Wang, L., Tan, T., Ning, H., Hu, W.: Silhouette analysis-based gait recognition for human identification. IEEE Trans on Pattern Analysis and Machine Intelligence 25(12), 1505– 1518 (2003) 7. Duda, R.O., Hart, P.E., Stock, D.G.: Pattern Classification, p. 11 8. Campbell, L.W., Becker, D.A., Azarbayejani, A., Bobick, A.F., Pentland, A.: Invariant Features for 3D Gesture Recognition. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, pp. 157–162 (1996) 9. Seitzl, S.M., Dyerl, C.R.: View-Invariant Analysis of Cyclic Motion. International Journal of Computer Vision (1997) 10. Rao, C., Yilmaz, A., Shah, M.: View-Invariant Representation And Recognition of Actions. International Journal of Computer Vision 50(2) (2002) 11. Rao, C., Shah, M., Mahmood, T.S.: Action Rectionition based onView Invariant Spatiotemporal Analysis. In: ACM Multimedia 2003, November 2-8, Berkeley, CA USA (2003) 12. Parameswaran, V., Chellappa, R.: Using 2D Projective Invariance for Human Action Recognition. International Journal of Computer Vision (2005) 13. Parameswaran, V., Chellappa, R.: Human Action Recognition Using Mutual Invariants, Computer Vision and Image Understanding (2005) 14. Weinland, D., Ronfard, R., Boyer, E.: Free Viewpoint Action Recognition using Motion History Volumes, Computer Vision and Image Understanding (2006) 15. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. IEEE Trans on Pattern Analysis and Machine Intelligence 19(7), 780–785 (1997) 16. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequetial images using hidden markov model. In: Proc. 1992 IEEE Conf. on Computer Vision and Pattern Rec., pp. 379–385. IEEE Press, Los Alamitos (1992) 17. Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Puerto Rico, IEEE Computer Society Press, Los Alamitos (1997) 18. Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 852–872 (2000) 19. Masoud, O., Papanikolopoulos, N.: A method for human action recognition. Image and Vision Computing 21, 729–743 (2003)

Unsupervised Identification of Multiple Objects of Interest from Multiple Images: dISCOVER Devi Parikh and Tsuhan Chen Carnegie Mellon University {dparikh,tsuhan}@cmu.edu Abstract. Given a collection of images of offices, what would we say we see in the images? The objects of interest are likely to be monitors, keyboards, phones, etc. Such identification of the foreground in a scene is important to avoid distractions caused by background clutter and facilitates better understanding of the scene. It is crucial for such an identification to be unsupervised to avoid extensive human labeling as well as biases induced by human intervention. Most interesting scenes contain multiple objects of interest. Hence, it would be useful to separate the foreground into the multiple objects it contains. We propose dISCOVER, an unsupervised approach to identifying the multiple objects of interest in a scene from a collection of images. In order to achieve this, it exploits the consistency in foreground objects - in terms of occurrence and geometry - across the multiple images of the scene.

1 Introduction Given a collection of images of a scene, in order to better understand the scene, it would be helpful to be able to identify the foreground separate from the background clutter. We interpret foreground to be the objects of interest, the objects that are found frequently across the images of the scene. In a collection of images of offices, for instance, we may find a candy box in some office image. However, we would perceive it to be part of the background clutter because most office scenes don’t have candy boxes. Most interesting scenes contain multiple objects of interest. Office scenes contain monitors, keyboards, chairs, desks, phones, etc. It would be useful if, given a collection of images of offices, we can identify the foreground region from the background clutter/objects and further more, separate the identified foreground into the different objects. This can then be used to study the interactions among the multiple objects of interest in the scene, learn models for these objects for object detection, track multiple objects in a video, etc. It is crucial to approach this problem in an unsupervised manner. First, it is extremely time consuming to annotate images containing multiple objects. Second, human annotation could introduce subjective biases as to which objects are the foreground objects. Unsupervised approaches on the other hand require no hand annotations, truly capture the properties of the data, and let the objects of interest emerge from the collection of images. In our approach we focus on rigid objects. We exploit two intuitive notions. First, the parts of the images that occur frequently across images are likely to belong to the foreground. And second, only those parts of the foreground that are found at geometrically consistent relative locations are likely to belong to the same rigid object. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 487–496, 2007. c Springer-Verlag Berlin Heidelberg 2007

488

D. Parikh and T. Chen

Several approaches in literature address the problem of foreground identification. First of, we differentiate our work from image segmentation approaches. These approaches are based on low level cues and aim to separate a given image into several regions with pixel level accuracies. Our goal is higher level, where we wish to separate the local-parts of the images that belong to the objects of interest from those that lie on background clutter using cues from multiple images. To re-iterate, several image segmentation approaches aim at finding regions that are consistent within a single image in color, texture, etc. We are however interested in finding objects in the scene that are consistent across multiple images in occurrence and geometry. Several approaches for discovering the topic of interest have been proposed such as discovering main characters [1] or objects and scenes [2] in movies or celebrities in collections of news clippings [3]. Recently, statistical text analysis tools such as probabilistic Latent Semantic Analysis (pLSA) [4] and Latent Dirichlet Allocation (LDA) [5] have been applied to images for discovering object and scene categories [6,7,8]. These use unordered bag-of-words [9] representation of documents to automatically (unsupervised) discover topics in a large corpus of documents/images. However these approaches, which we loosely refer to as popularity based approaches, do not incorporate any spatial information. Hence, while they can identify the foreground separate from the background, they can not further separate the foreground into multiple objects. Hence, these methods have been applied to images that contain only one foreground object. We further illustrate this point in our results. These popularity based approaches can separate the multiple objects of interest only if they are provided images that contain different number of these objects. For the office setting, in order to discover the monitor and keyboard separately, plSA, for instance, would require several images with just the monitor, and just the keyboard (and also a specified number of topics of interest). This is not a natural setting for images of office scenes. Leordeanu, et al. [10] propose an approach to unsupervised learning of the object model from its low resolution video. However, this approach is also based on co-occurrence and hence can not separate out multiple objects in the foreground. Several approaches have been proposed to incorporate spatial information in the popularity based approaches [11,12,13,14], however, only with the purpose of robustly identifying the single foreground object in the image, and not for separation of the foreground into multiple objects. Russell, et al. [15], through their approach of breaking an image down into multiple segments and treating each segment individually, can deal with multiple objects as a byproduct. However, although from multiple segmentations, they rely on consistent segmentations of the foreground objects. Further on the object detection/recognition front instead of object discovery, object localization approaches could be considered, with a stretch of argument, to provide rough foreground/background separation. Part-based approaches, as is ours, however towards this goal of object localization, have been proposed such as [16,17] which use spatial statistics of parts to obtain objects masks. However, these are supervised approaches for single objects. Unsupervised part-based approaches for learning the object models for recognition have also been proposed, such as [18,19]. However they deal with single objects.

Unsupervised Identification of Multiple Objects of Interest from Multiple Images

Images of a particular scene category

Feature extraction

489

Correspondences

Foreground identification Multiple objects (foreground)

Recursive clustering

Interaction between pairs of features

Fig. 1. Flow of dISCOVER for unsupervised identification of multiple objects of interest image1 ba b

a

image2 a(a)

=A

A d(B, )

a(ba) =

B

Fig. 2. An illustration of the geometric consistency metric used to retain good correspondences

The rest of the paper is organized as follows. Section 2 describes our algorithm dISCOVER, followed by experimental results in Section 3 and conclusion in Section 4.

2 dISCOVER Our approach, dISCOVER, is summarized in Fig. 1. The input to dISCOVER is a collection of images taken from a particular scene, and the desired output is the identified foreground separated into the multiple objects it contains. 2.1 Feature Extraction Given the collection of images taken from a particular scene, local features describing interest points/parts are extracted in all the images. These features may be appearance based features such as SIFT [20], shape based features such as shape context [21], geometric blur [22], or any such discriminative local descriptors as may be suitable for the objects under consideration. In our current implementation, we use the Derivative of Gaussian interest point detector, and SIFT features as our local descriptors. 2.2 Correspondences Having extracted features from all images, correspondences between these local parts are to be identified across images. For a given pair of images, potential correspondences are identified by finding k nearest neighbors of each feature point from one image in the other image. We use Euclidean distance between the SIFT descriptors to determine the nearest neighbors. The geometric consistency between every pair of correspondences is computed to build a geometric consistency adjacency matrix.

490

D. Parikh and T. Chen

Suppose we wish to compute the geometric consistency between a pair of correspondences shown in Fig. 2 involving interest regions a and b in image1 and A and B in image2 . All interest regions have a scale and orientation associated with them. Let φa be the similarity transform that transforms a to A. βA is the transformed ba , the relative location of b with respect to a in image1 , using φa . β is thus the estimated location of B in the image2 based on φa . If a and A, as well as b and B are geometrically consistent, the distance between β and B, d(B, β) would be small. A score that decreases exponentially with increasing d(B, β) is used to quantify the geometric consistency of the pair of correspondences. To make the score symmetric, a is similarly mapped to α using the transform φb that maps b to B, and the score is based on max(d(B, β), d(A, α)). This metric provides us with invariance only to scale and rotation, the assumption being that the distortion due to affine transformation in realistic scenarios is minimal among local features that are closely located on the same object. Having computed the geometric consistency score between all possible pairs of correspondences, a spectral technique is applied to the geometric consistency adjacency matrix to retain only the geometrically consistent correspondences [23]. This helps eliminate most of the background clutter. This also enables us to deal with incorrect low-level correspondences among the SIFT features that can not be reliably matched, for instance at various corners and edges found in an office setting. To deal with multiple objects in the scene, an iterative form of [23] is used. However, it should be noted that due to noise, affine and perspective transformations of objects, etc. correspondences of all parts even on a single object do not always form one strong cluster and hence are not entirely obtained in a single iteration, instead they are obtained over several iterations. 2.3 Foreground Identification Only the feature points that find geometrically consistent correspondences in most other images are retained. This is in accordance to our perception that the objects of interest are those that occur frequently across the image collection. Also, this post processing step helps to eliminate the remaining background features that may have found geometrically consistent correspondences in another image by chance. Using multiple images gives us the power to be able to eliminate these random errors which would not be consistent across images. However, we do not require features to be present in all images either in order to be retained. This allows us to handle occlusions, severe view point changes, etc. Since these affect different parts of the objects across images, it is unlikely that a significant portion of the object will not be matched in many images, and hence be eliminated by this step. Also, this enables us to deal with different number of objects in the scene across images, again, the assumption being that the objects that are present in most images are the objects of interest (foreground), while those that are present in a few images are part of the background clutter. This proportion can be varied to suit the scenario at hand. We now have a reliable set of foreground feature points and a set of correspondences among all images. An illustration can be seen in Fig. 3 where only a subset of the detected features and their correspondences are retained. It should be noted that the approach being unsupervised, there is no notion of an object yet. We only have a cloud of features in each image which have all been identified as foreground and

Unsupervised Identification of Multiple Objects of Interest from Multiple Images

491

Features discarded as no geometrically consistent correspondences in any image (background) Features discarded as geometrically consistent correspondences not found across enough images (occlusions, etc.) Features retained Geometric consistency adjacency matrix:

Fig. 3. An illustration of the correspondences and features retained during feature selection. The images contain two foreground objects, and some background. An illustration of the geometric consistency adjacency matrix of the graph that would be built for this set up is also shown.

correspondences among them. The goal is to now separate these features into different groups, where each group corresponds to a foreground object in the scene. 2.4 Interaction Between Pairs of Features In order to separate the cloud of retained feature points into clusters, a graph is built over the feature points, where the weights on the edge between the nodes represents the interaction between the pair of features across the images. The metric used to capture the interaction between the pairs of features is the same geometric consistency as computed in Section 2.2, except now averaged across all pairs of images that contain these features. While the geometric consistency could contain errors for a particular pair of images due to errors in correspondences, etc. averaging across all pairs suppress the contribution of these erroneous matchings and amplifies the true interaction among the pairs of features. If the geometric consistency between two feature points is high, they are likely to belong to the same rigid object. On the other hand, features that belong to different objects would be geometrically inconsistent because the different objects are likely to be found in different configurations across images. An illustration of the geometric consistency adjacency matrix can be seen in Fig. 3. Again, there is no concept of an object yet. The features in Fig. 3 are arranged in an order that correspond to the objects, and each object is shown to have only two features, only for illustration purposes. 2.5 Recursive Clustering Having built the graph capturing the interaction between all pairs of features across images, recursive clustering is performed on this graph. At each step, the graph is clustered into two clusters. The properties of each cluster are analyzed, and one or both of the clusters are further separated into two clusters, and so on. If the variance in the

492

D. Parikh and T. Chen

(a)

(b)

Fig. 4. (a) A subset of the synthetic images used as input to dISCOVER (b) Background suppressed for visualization purposes

adjacency matrix corresponding to a certain cluster (subgraph) is very low but with a high mean, it is assumed to contain parts from a single object, and is hence not divided further. Since the statistics of each of the clusters formed are analyzed to determine if it should be further clustered or not, the number of foreground objects need not be known a priori. This is an advantage as compared to plSA or parametric methods such as fitting a mixture Guassians to the foreground features spatial distribution. dISCOVER is nonparametric. We use normalized cuts [24] to perform the clustering. The code provided at [25] was used.

3 Results 3.1 Synthetic Images dISCOVER uses two aspects: popularity and geometric consistency. These can be loosely thought of as first order as well as second order statistics. In the first set of experiments, we use synthetic images to demonstrate the inadequacy of any of these alone. To illustrate our point - we consider 50×50 synthetic images as shown in Fig. 4(a). The images contain 2500 distinct intensity values, of which 128, randomly selected from the 2500, always lie on the foreground objects and the rest is background. We consider each pixel in the image to be an interest point, and the descriptor of each pixel is the intensity value of the pixel. To make visualization clearer, we display only the foreground pixels of these images in Fig. 4(b). It is evident from these that there are two foreground objects of interest. We assume that the objects undergo pure translation only. We now demonstrate the use of pLSA, as an example of an unsupervised popularity based foreground identification algorithm, on 50 such images. Since pLSA requires negative images without the foreground objects we also input 50 random negative images to pLSA, which dISCOVER does not need. If we specify pLSA to discover 2 topics, the result obtained is shown in Fig 5. It can be seen that it can identify the foreground from the background, but is unable to further separate the foreground into multiple objects. One may argue that we could further process these results and fit a mixture of Gaussians (for instance) to further separate the foreground into multiple objects. However this would require us to know the number of foreground objects a priori and also the distribution of features on the objects need not be Gaussian as in these images. If we specify pLSA to discover 3 topics instead, with the hope that it might separate the foreground into 2 objects, we find that it randomly splits the background into 2 topics, while still maintaining a single foreground topic, as seen in Fig. 5. This is because pLSA simply incorporates occurrence (popularity) and no spatial information.

Unsupervised Identification of Multiple Objects of Interest from Multiple Images Image

pLSA: 2 topics pLSA: 3 topics

493

dISCOVER

Fig. 5. Comparison of results obtained using pLSA with those obtained using dISCOVER

Hence, pLSA is inherently missing the information required to perceive the features on one of the foreground objects any different than those on the second object and hence separate them. On the other hand, dISCOVER does incorporate this spatial/geometric information and hence can separate the foreground objects. Since the input images are assumed to allow only translation of the foreground objects and the descriptor is simply the intensity value, we alter the notion of geometric consistency than that described in Section 2.2. In order to compute the geometric consistency between a pair of correspondences, we compute the distance between the pairs of features in both images. The geometric consistency decreases exponentially as the discrepancy in the distances increases. The result obtained by dISCOVER is shown in Fig. 5. We successfully identify the foreground from the background and further separate the foreground into multiple objects. Also, dISCOVER does not require any parameters to be specified, such as number of topics or foreground objects in the images. The inability of a popularity based approach to obtain the desired results illustrates the need for geometric consistency in addition to popularity. In order to illustrate the need for considering popularity and not just geometric consistency, let us consider the following analysis. If we consider all pairs of images such as those shown in Fig. 4 and keep all features that find correspondences that are geometrically consistent with at least one other feature in atleast one other image, we would retain approximately 2300 of the background features. This is because even for background, it is possible to find at least some geometrically consistent correspondences. However the background being random, this would not be consistent across several images. Hence, instead of retaining features that have geometrically consistent correspondences in one other image, if we now retain only those that have geometrically consistent correspondences in at least two other images, only about 50 of the background features are retained. As we use more images, we can eliminate the background features entirely. dISCOVER being an unsupervised approach, the use of multiple images to prune out background clutter is crucial. Hence, this demonstrates the need for considering popularity in addition to geometric consistency. 3.2 Real Images In the following experiments with real images, while we present results on specific objects, it is important to note that the recent advances in object recognition that deal with object categories complement the proposed work. Since any particular features are not an integral part of dISCOVER, it can be applied to object categories by using appropriate features. However, the focus of our work is to identify the multiple objects

494

D. Parikh and T. Chen

Accuracy ofdISC O VER

Fig. 6. A subset of images provided as input to dISCOVER 1 0.8 0.6 0.4 0.2 0

(a)

5

10 15 20 25 # inputimages used

30

(b)

Fig. 7. (a) Visual results obtained by dISCOVER. The cloud of features retained as foreground and further clustered into groups. Each group corresponds to an object in the foreground. (b) Quantitative results obtained using dISCOVER.

of interest in the scene, and not object categorization. Hence, to illustrate our algorithm, we show results on specific objects (however with considerable variations) using SIFT. We first illustrate dISCOVER on a collection of 30 real images as shown in Fig. 6. Note the variation in orientation, scale and view-point of objects as well as in lighting conditions along with the highly cluttered backgrounds. We now use the descriptors as well as geometric consistency notions as described in our approach in Section 2. The results obtained are shown in Fig. 7(a). All background features have been successfully eliminated and the foreground features have been accurately clustered into multiple objects. In order to quantify the results obtained, we hand labeled the images with the foreground objects. This being a staged scenario where the objects were intentionally placed, the ground truth foreground objects of interest were known and hence such an analysis is possible. The portion of features that were assigned to their appropriate cluster in the foreground was computed as the accuracy of dISCOVER. The accuracy is shown in Fig. 7(b) with varying number of images used as input. It can be seen that while we need multiple images for accurate unsupervised multiple-object foreground identification, our accuracy reaches its optimum with a fairly small number of images. Let us now consider a real scene where the objects are not staged. Consider a collection of 30 images such as those shown in Fig. 8. These are images of an office taken at different times. Note the change in view-points, scale of objects and varying lighting conditions. We run dISCOVER on these images, and the result obtained is as shown in Fig. 9. The monitor, keyboard and CPU are identified to be the foreground objects.

Unsupervised Identification of Multiple Objects of Interest from Multiple Images

495

Fig. 8. A subset of images provided as input to dISCOVER

Fig. 9. Results obtained by dISCOVER. The cloud of features retained as foreground and further clustered into groups. Each group corresponds to an object in the foreground.

This seems reasonable. The mouse is not identified to be the foreground object because very few features were detected on the mouse, which were not stable across images mainly due to the lighting variation and pose changes. The photo frame and the CPU are clustered together. This is because these objects are stationary in all the input images and hence are found at identical locations with respect to each other (whenever present) across images, and are hence perceived to be one object. This is an artifact of dISCOVER being an unsupervised algorithm. Also, the bag next to the CPU is not retained. This is because the bag is occluded in most images, and hence is considered to be background. Overall, the foreground is successfully separated from the background, and is further clustered into the different objects of interest it contains.

4 Conclusion We propose dISCOVER, which, given a collection of images of a scene, identifies the foreground and further separates the foreground into the multiple objects of interest it contains - all in an unsupervised manner. It relies on occurrence based popularity cues as well as geometry based consistency cues to achieve this. Future work includes loosening the geometric consistency notion to deal with non-rigid objects, learning models for the identified objects of interest for detection and studying interactions among the multiple objects in the scene to provide context for robust object detection.

Acknowledgments We thank Andrew Stein and Dhruv Batra for code to compute geometrically compatible correspondences among images.

496

D. Parikh and T. Chen

References 1. Fitzgibbon, A., Zisserman, A.: On affine invariant clustering and automatic cast listing in movies. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, Springer, Heidelberg (2002) 2. Sivic, J., Zisserman, A.: Video data mining using configurations of viewpoint invariant regions. CVPR (2004) 3. Berg, T., Berg, A., Edwards, J., White, R., Teh, Y., Learned-Miller, E., Forsyth, D.: Names and faces in the news. CVPR (2004) 4. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning (2001) 5. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research (2003) 6. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. CVPR (2005) 7. Quelhas, P., Monay, F., Odobez, J., Gatica, D., Tuytelaars, T., Van Gool, L.: Modeling scenes with local descriptors and latent aspects. ICCV (2005) 8. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering objects and their location in images. ICCV (2005) 9. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, Springer, Heidelberg (2004) 10. Leordeanu, M., Collins, M.: Unsupervised learning of object models from video sequences. CVPR (2005) 11. Liu, D., Chen, T.: Semantic-shift for unsupervised object detection. In: CVPR. Workshop on Beyond Patches (2006) 12. Li, Y., Wang, W., Gao, W.: A robust approach for object recognition. In: PCM (2006) 13. Fergus, R., FeiFei, L., Perona, P., Zisserman, A.: Learning object categories from Googles image search. In: ICCV (2005) 14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. CVPR (2006) 15. Russell, B., Efros, A., Sivic, J., Freeman, W., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. CVPR (2006) 16. Marszałek, M., Schmid, C.: Spatial weighting for bag-of-features. CVPR (2006) 17. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, Springer, Heidelberg (2004) 18. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scaleinvariant learning. CVPR (2003) 19. Weber, M., Welling, M., Perona, P.: Unsupervised learning of models for recognition. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, Springer, Heidelberg (2000) 20. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004) 21. Belongie, S., Malik, J., Puzicha, J.: Shape context: a new descriptor for shape matching and object recognition. In: NIPS (2000) 22. Berg, A., Malik, J.: Geometric blur for template matching. CVPR (2001) 23. Leordeanu, M., Hebert, M.: A spectral technique for correspondence problems using pairwise constraints. In: ICCV (2005) 24. Shi, J., Malik, J.: Normalized cuts and image segmentation. In: PAMI (2000) 25. Shi, J.: http://www.cis.upenn.edu/jshi/software/

Fast 3-D Interpretation from Monocular Image Sequences on Large Motion Fields Jong-Sung Kim and Ki-Sang Hong Division of Electrical and Computer Engineering, POSTECH, Pohang, Korea {kimjs,hongks}@postech.ac.kr http://iip.postech.ac.kr

Abstract. This paper proposes a fast method for dense 3-D interpretation to directly estimate a dense map of relative depth and motion from a monocular sequence of images on large motion ﬁelds. The NagelEnkelmann technique is employed in the variational formulation of the problem. Diﬀusion-reaction equations are derived from the formulation so as to approximate the dense map on large motion ﬁelds and realize an anisotropic diﬀusion to preserve the discontinuities of the map. By combining the ideas of implicit schemes and multigrid methods, we present a new implicit multigrid block Gauss-Seidel relaxation scheme, which dramatically reduces the computation time for solving the largescale linear system of diﬀusion-reaction equations. Using our method, we perform fast 3-D interpretation of image sequences with large motion ﬁelds. The eﬃciency and eﬀectiveness of our method are experimentally veriﬁed with synthetic and real image sequences.

1

Introduction

The 3-D interpretation is one of fundamental problems in computer vision. Reconstructed 3-D information such as scene depth and motion can be used in vision applications. Most methods for 3-D interpretation ﬁrst compute sparse or dense correspondences such as image features or optical ﬂow. They depend on the accuracy of generally ill-posed matching or estimation process [3,7]. As an alternative, direct methods [2,11,12] have been introduced to estimate the structure and motion of a static scene without prior matching and estimation. However, the requirement of static scenes limits the applicability of the direct methods in a variety of environments, e.g., where viewed objects move independently of viewing systems. This limitation has been ﬁrst studied as a dense 3-D interpretation problem [8,10], where a dense map of relative depth and motion is estimated from the spatio-temporal change of intensity images due to smallrange image motion. It has been formulated within the variational framework and partial diﬀerential equation (PDE) systems were solved to obtain the functional minimization. In this framework, each pixel has six variables representing 3-D motion. Thus, there are more than two million variables to evaluate with a 640 × 480 video image. For this reason, the computation cost for dense 3-D Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 497–506, 2007. c Springer-Verlag Berlin Heidelberg 2007

498

J.-S. Kim and K.-S. Hong

interpretation has been too expensive for time-critical applications, even though it has been applied to small motion ﬁelds. In this paper we present a fast method for dense 3-D interpretation. Our method performs fast 3-D interpretation on large motion ﬁelds. It was developed in the following way: First, dense 3-D interpretation was formulated with a variational PDE model based on the instantaneous motion model [6] and the constant brightness assumption. The Nagel-Enkelman technique [9] was employed in the variational model to perform our method on large motion ﬁelds. This model is used to derive diﬀusion-reaction equations, which approximate the dense map between two images and realize an anisotropic diﬀusion to preserve discontinuities in the map. Then, by combining implicit schemes [1,13] and multigrid methods [4,5], a new implicit multigrid block Gauss-Seidel relaxation scheme is devised to quickly solve the diﬀusion-reaction equations. The eﬃciency and eﬀectiveness of our method is veriﬁed with synthetic and real image sequences. This paper is organized as follows. Section 2 introduces the variational PDE model for dense 3-D interpretation. Section 3 describes the discretization for numerically solving the variational PDE model. Section 4 explains our numerical scheme based on implicit schemes and multigrid methods. Sections 5 and 6 present experimental results and conclusions.

2

Variational PDE Model

A 3-D point X = (X, Y, Z) on a moving object with linear velocity v = (v1 , v2 , v3 ) and angular velocity w = (w1 , w2 , w3 ) satisﬁes the diﬀerential ˙ = v +[w]× X and its projection xc = (x, y, f ) through a calibrated equation X camera (focal length f ) has the form xc = f (X/Z, Y /Z, 1) . The 2-D coordinate vector (x, y) is denoted by x to simplify the notation. Under the instantaneous motion model [6], the optical ﬂow u = (u, v) can be written as u = P t + Qw,

(1)

where t = (t1 , t2 , t3 ) is Z −1 v and the matrices P and Q are deﬁned by 1 f 0 −x −xy f 2 + x2 −f y , P = , Q= xy fx 0 f −y f −f 2 + y 2

(2)

respectively. As we see in Eq. (1), only the direction of translational motion and the depth of the scene up to a scaling factor [6,8,10] can be reconstructed by ˆ = ||t||−1 t and Zˆ−1 = ||t||, where || · || denotes the l2 -norm of a vector. v We perform dense 3-D interpretation between two images I1 (x) ≡ I1 (x, y) and I2 (x) ≡ I2 (x, y). Just as in optical ﬂow, the Nagel-Enkelman model is applied to dense 3-D interpretation between two frames as follows: 2 I1 (x) − I2 (x + P t + Qw) (3) E (t, w) = Ω

+α

3 3 ∇ti D (∇I1 ) ∇ti + β ∇wi D (∇I1 ) ∇wi dx, i

i

Fast 3-D Interpretation from Monocular Image Sequences

where D (∇I1 ) is the regularized projection matrix, deﬁned by

1 −∂I1 /∂y −∂I1 /∂y 2 +ν I , D (∇I1 ) = ∂I1 /∂x ∂I1 /∂x ||∇I1 ||2 + 2ν 2

499

(4)

where I is an identity matrix, ν a contrast parameter which is set to inhibit smoothing of the ﬂow ﬁeld across the edges of I1 at locations where ||∇I1 || ν, and α and β smoothing parameters. A piecewise smooth ﬂow ﬁeld is estimated by using that projection matrix in the regularization. We will describe the details for estimating the translational motion t, but our model can be applied to the simultaneous estimation of the translational motion t and rotational motion w without any modiﬁcation. We seek t minimizing the energy functional of Eq. (3) by solving a solution of the diﬀusion-reaction equations derived through the calculus of variations. The diﬀusion-reaction equations are given by the following PDE system ⎧ ∂I2 ⎪ ⎨ αdiv (D(∇I1 )∇t1 ) + (I1 (x) − I2 (x + P t + Qw)) ∂t1 = 0 2 αdiv (D(∇I1 )∇t2 ) + (I1 (x) − I2 (x + P t + Qw)) ∂I (5) ∂t2 = 0 ⎪ ⎩ αdiv (D(∇I )∇t ) + (I (x) − I (x + P t + Qw)) ∂I2 = 0 1 3 1 2 ∂t3 with boundary conditions ∂t1 /∂n = 0, ∂t2 /∂n = 0, ∂t3 /∂n = 0, where n is the unit vector normal to the boundary ∂Ω of the image domain Ω. In Eq. (5), the ﬁrs terms realize a discontinuity-preserving anisotropic diﬀusion process and the second terms enforce the constant brightness constraint in the estimation of t. The former are called diﬀusion terms and the latter reaction terms. As we see, the diﬀusion terms are linear w.r.t. t but the reaction terms are nonlinear. To apply eﬃcient linear methods to this problem, we approximate the nonlinear reaction terms as follows: Given initial estimates t0 , current estimates t is t = t0 + Δt, where Δt is the error between t and t0 , i.e., Δt = t−t0 . Then, our 3-D brightness constraint equations can be written as (6) I1 (x) − I2 x + P t0 + Qw + J t0 − J t = 0, where J is the Jacobian of I2 w.r.t. t, deﬁned by J ≡ ∂I2 /∂t. If we assume that t0 = 0 and w = 0 in Eq. (6), we can also acquire the 3-D brightness constraint equations for small motion ﬁelds, I1 (x) − I2 (x) − Jt = 0,

(7)

which have been used in all 3-D interpretation methods [8,10]. Finally, the successively approximated diﬀusion-reaction equations are given by ⎧ ⎨ αdiv (D(∇I1 )∇t1 ) + I1 (x) − I2 x + P t0 + Qw + J t0 − J t J1 = 0 αdiv (D(∇I1 )∇t2 ) + I1 (x) − I2 x + P t0 + Qw + J t0 − J t J2 = 0 (8) ⎩ αdiv (D(∇I1 )∇t3 ) + I1 (x) − I2 x + P t0 + Qw + J t0 − J t J3 = 0 which are our diﬀusion-reaction equations for dense 3-D interpretation. To solve a solution of these equations, ﬁnal estimates t should be iteratively updated with initial estimates t0 for every pixels. Therefore, the large-scale linear system of Eq. (8) should be solved in each iteration.

500

3

J.-S. Kim and K.-S. Hong

Discretization

The unknown functions, t1 (x, y), t2 (x, y), and t3 (x, y), are deﬁned on a pixel grid of cell size hx × hy . We denote by h the index for the cell size. We denote by t1i , t2i , and t3i the approximation to t1 , t2 , and t3 , respectively, at some pixel i with i = 1, . . . , N . All spatial gradients are approximated by central diﬀerences, and the divergences are estimated using the eight neighboring pixels, denoted by j, with j = 1, . . . , 8. We denote by Ni the set of neighbors of pixel i and dmni the element (m, n) of the projection matrix D (∇I1i ) for some pixel i. Then, the diﬀusivity coeﬃcients, cji , with j = 1, . . . , 8 in some pixel i, can be computed with c1i = c5i =

d111 +d11i , 2 d125 +d12i , 4

11i 22i c2i = d112 +d , c3i = d223 +d , c4i = 2 2 d126 +d12i d127 +d12i c6i = − , c = − , c8i = 7i 4 4

d224 +d22i 2 d128 +d12i . 4

(9)

We denote by Jmi the the element (m) of the Jacobian in some pixel i, deﬁned by (J1i , J2i , J3i ) = (f ∂I1i /∂x, f ∂I1i /∂y, −f x − f y). We describe the ﬁnite difference approximation to the diﬀusion-reaction equations in Eq. (8) by using this discretization. By assuming that t0 = 0 due to small motion and w = 0 in Eq. (8), we obtain the discrete diﬀusion-reaction equations for small motion, given by the elliptic PDE system ⎧ t1j −t1i 2 ⎪ ⎨ α j∈Ni cji h2 + J1i esi − J1i t1i − J2i J1i t2i − J3i J1i t3i = 0, t2j −t2i 2 t2i − J3i J2i t3i = 0, α j∈Ni cji h2 + J2i esi − J1i J2i t1i − J2i (10) ⎪ t −t ⎩α 3j 3i 2 + J3i esi − J1i J3i t1i − J2i J3i t2i − J3i t3i = 0, j∈Ni cji h2 for i = 1, . . . , N , where esi ≡ I1 (xi ) − I2 (xi ). This constitutes a large-scale linear system of equations for the 3N unknowns t1i , t2i , and t3i . Large motions are handled by considering the unknowns as the time evolution functions t (x, y, n) with a time variable n. We solve an evolution solution by calculating the asymptotic state (n → ∞) of Eq. (8). The associated discrete diﬀusion-reaction equations are given by the parabolic PDE system ⎧ t −t t −t0 2 ⎪ t1i − J2i J1i t2i − J3i J1i t3i , ⎪ 1i τ 1i = α j∈Ni cji 1jh2 1i + J1i edi t0i − J1i ⎨ 0 0 t2i −t2i t2j −t2i 2 = α c + J e t − J 2i di 1i J2i t1i − J2i t2i − J3i J2i t3i , i j∈Ni ji τ h2 ⎪ 0 ⎪ t3j −t3i ⎩ t3i −t03i = α 2 + J3i edi ti − J1i J3i t1i − J2i J3i t2i − J3i t3i , j∈Ni cji τ h2 (11) for i = 1, . . . , N , where edi t0i ≡ I1 (xi ) − I2 xi + Pi t0i + Qi wi − Ji t0i , and τ the size of the time-step. The above system can be solved by iteratively applying the gradient descent method but a small time-step is required at each iteration of the gradient descent method, which drastically increases the computation time. Instead, we present a new eﬃcient numerical scheme for quickly ﬁnding an evolution solution to the parabolic PDE system.

Fast 3-D Interpretation from Monocular Image Sequences

4

501

Eﬃcient Numerical Scheme

The equations in Eq. (10) constitute a linear elliptic PDE system for the 3N unknowns t1i , t2i , and t3i , with a sparse matrix structure. Given a linear system Az = b and a decomposition A = D − L − U with diagonal matrix D, lower triangular matrix L, and upper triangular matrix U , then one iteration of the Gauss-Seidel relaxation [4] is z n+1 = (D − L)−1 (U z n + b), where n is the iteration index of relaxation. The three unknowns of each pixel are coupled with each other. Such a coupling yields to a block Gauss-Seidel relaxation (BGSR). In each iteration, we simultaneously update the three unknowns, t1i , t2i , and t3i , by solving the following 3 × 3 linear system at pixel i: = Q−1 tn+1 i si y si

n+1/2

n+1/2

where Qsi and y si ⎛ α Qsi = ⎝

h2

n+1/2

y si

(12)

are deﬁned by

2 cji + J1i J1i J2i J1i J3i

j∈Ni

⎛

and

,

α h2

⎜ ⎜ = ⎜ hα2 ⎝ α h2

α h2

J2i J1i 2 j∈Ni cji + J2i J2i J3i

α h2

⎞ J3i J1i ⎠ J J 3i 2i 2 j∈Ni cji + J3i

⎞ n+1 + j∈N + cji tn1j + J1i esi j∈Ni− cji t1j i ⎟ ⎟ n+1 n − cji t + cji t + e + J ⎟, 2i si 2j 2j j∈Ni j∈Ni ⎠ n+1 n + j∈N + cji t3j + J3i esi j∈N − cji t3j i

(13)

(14)

i

where Ni− = {j ∈ Ni |i(j) < i} and Ni+ = {j ∈ Ni |i(j) > i}. The i(j) returns the corresponding pixel index of j. The explicit scheme for solving a solution to the resulting parabolic PDE system in Eq. (11) can be written as z n+1 = z n + τ (Az n − bn ), where bn denotes that this source term should be updated at each iteration n, unlike the case for small motion. This explicit scheme is only stable for small timesteps. The implicit scheme is stable no matter how large the time-step is [1,13]. In the implicit scheme, we obtain a solution at each iteration by solving the linear system (I − τ A) z n+1 = z n − τ bn . Since the matrix (I − τ A) has a sparse matrix structure equal to the structure of the matrix A, we can employ BGSR as a basic iterative solver for the implicit scheme. Thus, one iteration reads z n+1 = 1 −1 1 n n n τI −D+L τ z − U z − b . In each iteration step of the implicit block Gauss-Seidel relaxation (IBGSR), the three coupled unknowns, t1i , t2i , and t3i , in some pixel i at time n, are simultaneously updated by solving the 3 × 3 linear system: n+1/2 = Q−1 , (15) tn+1 i di y di n+1/2

are deﬁned by where Qdi and y di ⎛1 α 2 J J j∈Ni cji +J1i τ + h2 2i 1i α 1 2 ⎝ J1i J2i Qdi = j∈Ni cji + J2i τ + h2 J1i J3i J2i J3i

1 τ

+

α h2

J3i J1i J J 3i 2i j∈Ni

⎞ ⎠ cji +

2 J3i

(16)

502

J.-S. Kim and K.-S. Hong

and

⎛

n+1/2

y di

1 n τ t1i

+

⎜ ⎜ = ⎜ τ1 tn2i + ⎝ 1 n τ t3i +

α h2 α h2 α h2

⎞ cji tn1j + J1i edi (tni ) ⎟ ⎟ n+1 + j∈N + cji tn2j + J2i edi (tni ) ⎟ . (17) j∈Ni− cji t2j i ⎠ n+1 n n − cji t + cji t + J + e (t ) 3i di i 3j 3j j∈N j∈N j∈Ni−

i

cji tn+1 + 1j

j∈Ni+

i

We can obtain two diﬀerent solutions to our model by iteratively solving the 3 × 3 linear system of Eqs. (12) or (15) in each pixel. The convergence of relaxation is improved by employing the multigrid methods [5]. Let us denote the linear systems described above as Ah z h = bh . We ˜ the current approximate solution z, and z ¯ the smoothed solution denote by z obtained by BGSR or IBGSR. Let us denote by H the new cell size Hx × Hy on the coarse grid. The prolongation for coarse-to-ﬁne grid transfer is represented by P H→h , and the restriction for ﬁne-to-coarse grid transfer by Rh→H . The procedure of the two-grid iteration is composed of three steps; pre-smoothing, coarse-grid correction, and post-smoothing. The details can be described as follows. Two-Grid Iteration Algorithm ¯h 1. (Pre-smoothing) Compute a smoothed approximate solution z by applying μ1 iterations of a relaxation method to a current ap˜h. proximate solution z ¯ h with the error eh approxi2. (Coarse-grid correction) Correct z mated with the coarse grid as follows: (a) Compute the residual on ¯ h . (b) Restrict the residual r h the ﬁne grid rh with rh = bh − Ah z H h→H h to the coarse grid as r = R r . (c) Compute the error on the coarse grid eH by solving a linear system AH eH = rH . (d) Prolong the error eH to the ﬁne grid as eh = P H→h eH . (e) Correct the ¯ h by z ¯ h = z h + eh . approximate solution z ˜ h by apply3. (Post-smoothing) Compute the new approximation z h ¯ . ing μ2 iterations of the relaxation method to z The two-grid iteration is repeated recursively down to some coarsest grid. The recursive procedure is the multigrid method. One iteration of a multigrid method is called a cycle, and the exact structure of a cycle depends on the the value of γ, the number of two-grid iterations at each immediate step. The case γ = 1 is called a multigrid V-cycle (MV), while γ = 2 is called a multigrid W-cycle (MW). Let us denote by h the index of cell size on the current grid and by H the index of cell size on some coarsest grid. Then, the multigrid iteration can be deﬁned recursively by the following two steps: Multigrid Iteration Algorithm ¯ H obtained by solving 1. If h = H , then return an exact solution z −1 ¯ H = AH bH . z ¯ h obtained by recursively applying γ itera2. Otherwise, return an z tions of a multigrid iteration, from h to H .

Fast 3-D Interpretation from Monocular Image Sequences

503

Table 1. Comparison of the block Gauss-Seidel relaxation and the implicit block Gauss-Seidel relaxation

Block Gauss-Seidel Block Gauss-Seidel (V(1,1)) Block Gauss-Seidel (W(1,1)) Implicit block Gauss-Seidel Implicit block Gauss-Seidel (V(1,1)) Implicit block Gauss-Seidel (W(1,1))

α 10 1 10 1 10 1 10 1 10 1 10 1

n 1356 107 37 3 37 3 302 36 5 2 2 1

ζn 0.499 0.499 0.499 0.485 0.499 0.485 0.499 0.499 0.496 0.418 0.464 0.489

Time [s] 59.8 4.6 8.7 0.7 8.6 0.7 19.0 2.3 1.3 0.5 0.6 0.3

Speedup 1 13 7 85 7 85 3 26 46 120 100 200

Table 2. Comparison of diﬀerent multigrid implementations of the implicit block Gauss-Seidel relaxation

V(1,1) V(2,2) W(1,1) W(2,2)

5

n 19 17 17 16

Yosemite ζn Time [s] 0.267 4.8 0.263 5.7 0.266 5.3 0.262 6.7

n 16 11 11 9

Flower Garden ζn Time [s] 0.282 4.3 0.285 4.2 0.281 3.7 0.283 4.0

n 17 10 4 4

Rubik Cube ζn Time [s] 0.211 6.4 0.229 5.7 0.287 1.9 0.299 2.5

Experimental Results

All numerical experiments were carried out on a desktop PC with a 3.2-GHz Intel P4 CPU executing C/C++ code. The qualitative analysis of dense 3-D interpretation was conducted by using a display of gray value rendering of a dense map of relative depth as in [8,10]. To quantitatively evaluate the accuracy of 3-Dinterpretation, we have computed the error reduction factor deﬁned by ζ n = en /e0 , where en denotes the sum of squared intensity errors at iteration n. We used four grids for multigrid V-cycle or W-cycle iterations. We denote by V(1,1) or W(1,1) the V-cycle or W-cycle with one pre-smoothing and one post-smoothing iteration at each grid. The contrast parameter of our model, ν was set to 2. In all experiments, we set the ﬁeld of view of camera to 45◦ . In the ﬁrst experiment, we compared the performance of BGSR and IBGSR. The previous methods in [8,10] used BGSR for performing dense 3-D interpretation. In each test run, the algorithm was stopped when the error reduction factor was below 0.5. We also evaluated the performance of each relaxation with V(1,1) or W(1,1). The ﬁrst two frames of the Yosemite sequence were tested for this experiment. Table 1 shows the required number of iterations, n, to reach the

504

J.-S. Kim and K.-S. Hong

Fig. 1. Top: Result of the Yosemite sequence (ν=2.0, α=1, 4.8 s with 19 V(1,1) cycles). Middle: Result of the Flower Garden sequence (ν=2.0, α=1, 3.7 s with 10 W(1,1) cycles). Bottom: Result of the Rubik Cube sequence (ν=2.0, α=10, 1.9 s with 4 W(1,1) cycles).

desired error reduction of ζ n < 0.5, the computation time in seconds, and the speedup factor with respect to BGSR. While IBGSR computed the source term bn in each iteration, it was about 3 times faster than BGSR. While multigrid methods provided a signiﬁcant acceleration to both relaxation methods, IBGSR reached the desired error reduction faster than BGSR with V(1,1) or W(1,1). In our second experiment, diﬀerent multigrid implementations of IBGSR were tested. We performed tests with the ﬁrst two frames of the Yosemite sequence, the ﬁrst two frames of the Flower Garden sequence of 352×240 pixels, and the ﬁrst and the ﬁfth frames of the Rubik Cube sequence of 256×240 pixels. The Rubik Cube sequence has a rotating object, and thereby we have estimated the rotational motion w as well as the translational motion t. The algorithm was stopped when ζ n − ζ n−1 ≤ 10−3 . IBGSR was applied to each test set with 4grid V-cycle or W-cycle. Table 2 shows the required number of V or W cycles n, the error reduction factor ζ n , and the computation time for each multigrid implementation. To demonstrate the quality of 3-D interpretation, we display

Fast 3-D Interpretation from Monocular Image Sequences

505

Table 3. Comparison of the previous methods (MDL [8], Diﬀusion [10]) and our method (IBGSR, V(1,1)) for the Yosemite sequence

MDL Diﬀusion Our method

n 10000 10000 19

ζn 0.459 0.478 0.267

κ [%] 35.06 36.64 23.65

φ [◦ ] 13.91 17.17 4.11

Time [s] 6061.4 989.5 4.8

Fig. 2. Top: Result of the Teddy sequence (ν=2.0, α=5, 3.8 s with 5 W(2,2) cycles). Bottom: Diﬀerent views of the reconstructed 3-D structure by our method.

the ﬁrst frame, the gray value rendering result of the estimated depth map, and the optical ﬂow computed by Eq. (1) for three test set in Figure 1. Note that our method has output the full 3-D information of each test set in a few seconds. In Table 3, we show the performance of our method compared to the previous methods, MDL [8] and Diﬀusion [10]. For the quantitative analysis, we calculated the average angular error φ and the relative depth error κ, deﬁned by κ = ˆ + b − d||/||d||, where d denotes the ground truth depth, d ˆ the estimated ||ad depth, and a and b the scale and the bias factors of the estimated depth map, respectively. The maximum number of iterations was set to 104 . As we see, the previous methods could not reach the same error reduction as our method. In the last experiment, we experimentally veriﬁed that our method allows the movement of both viewing system and viewed objects. This was demonstrated using the Teddy sequence, where an object moves laterally and a textured background moves in the opposite direction. We used the ﬁrst two frames after resizing to 400 × 340 pixels and set ν=2 and α=5 for our model parameters. W(2,2) was selected for the multigrid implementation. Figure 2 shows the frame 1 of the Teddy sequence, the estimated depth map between the ﬁrst two frames, and the

506

J.-S. Kim and K.-S. Hong

computed optical ﬂow ﬁeld. The computation time was 3.8 seconds. That ﬁgure also displays diﬀerent views of the reconstructed 3-D structure by our method.

6

Conclusions

We have presented a fast method for dense 3-D interpretation method without prior estimation of optical ﬂow. The proposed method can conduct fast 3-D interpretation on large motion ﬁelds. The Nagel-Enkelman technique was employed in the variational model of our problem, and we have derived diﬀusion-reaction equations. The idea of implicit schemes and multigrid methods has been combined to develop a new implicit multigrid block Gauss-Seidel relaxation scheme for solving our dense 3-D interpretation problem quickly on large motion ﬁelds. We have veriﬁed the eﬃciency and eﬀectiveness of our method through the experimental results with synthetic and real image sequences.

References 1. Alvarez, L.: Images and PDE’s. ICAOS 1996 Images, Wavelets and PDE’s 219, 3–14 (1996) 2. Cohen, M., Irani, M., Anandan, P.: Direct recovery of planar-parallax from multiple frames. IEEE Trans. on Pattern Anal. and Mach. Intell. 24(11), 1528–1534 (2002) 3. Chellappa, R., Qian, G., Srinivasan, S.: Invited paper: Structure from motion: sparse versus dense correspondence methods. In: Proc. IEEE Conf. on Image Processing, pp. 492–499. IEEE Computer Society Press, Los Alamitos (1999) 4. Hackbusch, W.: Applied mathematical sciences, vol. 95. Springer, New York (1993) 5. Henson, V.E., Briggs, W.L., McCormick, S.F.: A multigrid tutorial. In: SIAM (2000) 6. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1987) 7. Mitiche, A.: Computational analysis of visual motion. Plenum Press, New York (1994) 8. Mitiche, A., Hadjres, S.: MDL estimation of a dense map of relative depth and 3d motion from a temporal sequence of images. Pattern Analysis and Applications 6, 78–87 (2003) 9. Nagel, H.H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector ﬁelds from image sequences. IEEE Trans. Pattern Anal. and Mach. Intell. 8, 565–593 (1986) 10. Sekkati, H., Mitiche, A.: Dense 3D interpretation of image sequences: A variational approach using anisotropic diﬀusion. In: Proc. Int. Conf. on Image Analysis and Processing (2003) 11. Stein, S., Shashua, M.: Model-based brightness constraints: On direct estimation of structure and motion. IEEE Trans. on Pattern Anal. and Mach. Intell. 22(9), 993–1005 (2000) 12. Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene ﬂow. IEEE Trans. on Pattern Anal. and Mach. Intell. 27(3), 475–480 (2005) 13. Weickert, J., Romeny, B.M., Viergever, M.A.: Eﬃcient and reliable schemes for nonlinear diﬀusion ﬁltering. IEEE Trans. Image Process. 7, 398–410 (1998)

Color-Stripe Structured Light Robust to Surface Color and Discontinuity Kwang Hee Lee, Changsoo Je, and Sang Wook Lee Dept. of Media Technology Sogang University Shinsu-dong 1, Mapo-gu, Seoul 121-742, Korea {Khecr, Vision, Slee}@sogang.ac.kr

Abstract. Multiple color stripes have been employed for structured light-based rapid range imaging to increase the number of uniquely identifiable stripes. The use of multiple color stripes poses two problems: (1) object surface color may disturb the stripe color and (2) the number of adjacent stripes required for identifying a stripe may not be maintained near surface discontinuities such as occluding boundaries. In this paper, we present methods to alleviate those problems. Log-gradient filters are employed to reduce the influence of object colors, and color stripes in two and three directions are used to increase the chance of identifying correct stripes near surface discontinuities. Experimental results demonstrate the effectiveness of our methods. Keywords: Structured light, color stripes, range imaging, active vision, surface color, surface discontinuity, projector-camera system, 3D modeling.

1 Introduction Structured light-based ranging is an accurate yet simple technique for acquiring depth image, and thus has been widely investigated [3-5, 7-15, 18]. There has been a variety of light patterns developed for rapid range imaging and the range resolution achievable in a single video frame has recently been increased sufficiently so that real-time shape capture became a practical reality. [12, 14, 18]. Structured-light methods suitable for realtime ranging uses their specific codifications based on color assignment [3, 5, 8, 11, 12, 14, 15, 18] and spatially windowed uniqueness (SWU) [3, 8, 12, 14, 18] for increasing the number of unique subpatterns in a single projection pattern. This SWU and the use of color have made it possible to design color stripe patterns for high-resolution range imaging in a single video frame. On the other hand, the main disadvantages of a single-frame color stripe pattern compared to sequential BW (black-and-white) stripes are that color stripe identification is affected by object surface color and it is often impossible near surface discontinuities such as occluding boundaries. In general, in a multiple-stripe color pattern, the uniqueness of a spatially windowed subpattern becomes higher as the number of stripes in the subpattern increases. To guarantee sufficient SWU for rapid imaging, a subpattern consists of several adjacent stripe colors. However, some of those adjacent stripes are often unavailable to identify a stripe color of interest near Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 507–516, 2007. © Springer-Verlag Berlin Heidelberg 2007

508

K.H. Lee, C. Je, and S.W. Lee

surface discontinuities. To the best of our knowledge, little research has been carried out explicitly to alleviate these problems.

(a)

(b)

(c)

Fig. 1. (a) A multiple-color stripe pattern and a unique subpattern with 7-stripe wide spatial window, (b) a subpattern window in projected stripes, and (c) projected stripes on a colored object

In this paper, we present structured-light methods with range estimation robust to object surface colors and discontinuities. We show that the logarithmic gradient (loggradient) [1, 2] can be used for decreasing the influence of surface color reflectance in structured-light range imaging. Our data processing algorithm incorporates the Fourier transform-based filtering that has been used for deriving intrinsic images [6, 16, 17]. By applying the algorithm to a captured image, the object reflectance in the image is discounted to some degree, and the stripe colors of structured light are detected with higher accuracy. In addition, we develop a method of applying stripe patterns in two or three directions simultaneously, decoupling the patterns and estimating depth images independently. This substantially improves the chance of estimating correct depths near surface discontinuities. The rest of this paper is organized as follows. Section 2 presents the problem statement, and Section 3 presents a filtering processing method for decreasing the influence of surface colors in color stripe identification. A method of applying more than one pattern simultaneously for improving depth estimation near surface discontinuities is described in Section 4. Experimental results are shown in Section 5 and Section 6 concludes this paper.

2 Problem Statement Sequential projections of BW patterns are in general inappropriate for rapid structurelight depth imaging and the use of colors has been investigated to reduce the number of pattern projections required for a given range resolution. If a stripe by its color alone in a color stripe pattern, global uniqueness is hard to achieve due to the repeated appearance of the color among M stripes. The use of more colors than three RGB can compromise their distinctiveness due to the reduced color distances between the stripes. In the SWU scheme, colors are sequentially encoded to stripes according to permutations or de Bruijn sequence. A stripe color centered a spatially windowed subpattern can be identified since the combination of the colors in the window is unique globally or at least semi-globally [3, 8, 12, 14, 18]. Figure 1 (a) illustrates this uniqueness of a seven-stripe subpattern. The larger the number of stripes is in the subpattern, the wider the window is and the more unique the subpattern is.

Color-Stripe Structured Light Robust to Surface Color and Discontinuity

509

A wide subpattern has an obvious disadvantage near a surface discontinuity. When the directions of an occluding boundary and stripes are similar, some stripes in the subpattern around a color stripe may be occluded and thus the color stripe is unidentifiable as depicted in Figure 1 (b). This results in a failure in depth estimation near surface discontinuities. The failure also occurs when strong surface colors alter stripe colors significantly as shown in Figure 1 (c). The goal of the research presented in this paper is to develop algorithms for reducing the influence of object colors and discontinuities using log-gradient filters and combined projection patterns in multiple directions.

3 Log-Gradient Processing for Surface Color Reduction As mentioned earlier, surface color hinders detection of stripes. Angelopoulou et al. proposed a spectral log-gradient method [1], and Berwick and Lee used spectral and spatial log-gradients to discount illumination colors [2]. In the research for deriving intrinsic image, several filtering schemes have been proposed [6, 16, 17]. Based on those spectral and spatial operators and filtering algorithms, we develop a method for discounting object colors across the color stripe directions. Let pd be the result signal of an original signal p processed by a filter fd:

pd ≡ f d ∗ p .

(1)

Then, p can be estimated as follows [6]:

§ F ( ¦ f dr ∗ pˆ d ) · ¨ d = x, y ¸ pˆ = F −1 ¨ ¸, r ( ∗ ) F f f ¦ d d ¸ ¨ © d = x, y ¹

(2)

where f dr is the reversed filter of f d : f d ( x, y ) = f dr (− x, − y ) , and F and F-1 denote the Fourier transform and its inverse transform, respectively. The objective of our processing algorithm is to regard the color variation across the stripes as illumination change (i.e., projection color change) and remove the surface reflectance as much as possible. We use log-gradient operators for this purpose. The reflected light intensity is given by the equation: I (λ ) = S (λ ) L ( λ ) ,

(3)

where S, L and Ȝ is the surface reflectance, illumination and wavelength, respectively. Applying log-gradient with respect to the way of stripe transition to the equation 3, ∂ x (ln I ) = ∂ x (ln S ) + ∂ x (ln L) ≅ ∂ x (ln L) ,

(4)

where it is assumed the spatial change of surface reflectance is much slower than that of illumination (through the stripe transition) and can be ignored. In equation 4, we can see spatially static surface reflectance can be easily removed. Since the stripe direction on the surface spatially varies in a scene image due to the surface geometry and triangulation, log-gradient has to be applied w.r.t. the corresponding direction to

510

K.H. Lee, C. Je, and S.W. Lee

each infinitesimal area for removing the surface reflectance component as much as possible. We do this process approximately by differentiating the images rotated from the original image with 17 angles sampled at the step of 10 degrees. In order to operate this process based on a filtering scheme in intrinsic imaging we apply a gradient filter w.r.t. x, Dx to the logarithm of each rotated image:

Fig. 2. The process of 3D modeling based on the proposed method with a combination of two stripe-patterns. Horizontal and vertical color stripe-patterns encoded based on SWU, are combined to a single combination pattern, and it is projected onto the leather surface of a colortextured bag. The scene is captured by a camera, and the scene image is filter-processed with rotations to estimate the two illumination-restored images. Stripes are identified independently in each restoration image, and ranges are obtained by geometric calibration of the projector and camera. Finally the ranges are merged into one, and it is meshed and rendered.

ixϕ ≡ Dx ∗ ln ( Rϕ ( I ) ) ,

(5)

where Rϕ denotes the rotation operator by an angle ϕ . For obtaining depth data it is proper to restore the illumination component of single-directional transitions. Therefore, in our application the restoration does not include the gradient w.r.t. y, iϕy contrary to equation 2:

§ · r ϕ i ϕ ≡ F −1 ¨ F ( Dx ∗ ix ) ¸ . ¨ F( Ddr ∗ Dd ) ¸¸ ¦ ¨ d x y = , © ¹

(6)

Color-Stripe Structured Light Robust to Surface Color and Discontinuity

511

Taking the exponential to the restoration and rotating back each image by −ϕ , the restoration of each rotation is obtained:

Iϕ ≡ R−ϕ ( exp(iϕ ) ) .

(7)

Finally, the illumination-restored image can be constructed by I Φ ( x , y ) where Φ ( x, y ) is the rotation angle which maximizes the gradient magnitude: Φ ≡ arg max (∂ x Iϕ ) 2 + (∂ y Iϕ )2 . ϕ

(8)

With the illumination-restored image, stripe segmentation is expected to be much less insensitive to object surface colors and to result in more accurate depth results.

4 Patterns Combination for Adapting to Surface Discontinuities As asserted in section 2, SWU based stripe identification can lose the correct coordinate of illumination near the surface discontinuity. In a scene where dominant discontinuities exist through the horizontal way, projection of vertical stripes can easily escape the problem of SWU based identification. By combining two sets of stripes with different directions, a single projection pattern containing multidirectional stripes can be generated, and projected into a scene. As a result, it can be noted that for any of single-directional discontinuities there can be a sufficient number of connected stripes with the directions distant to those of discontinuities, and thus the stripes can be correctly identified for most of single-directional discontinuities. In this case, we can find the two colors of the illuminated stripes, I Φ1 and I Φ 2 by estimating Φ1 and Φ 2 for each pixel. We estimate Φ1 and Φ 2 by a similar way as in equation 8, by finding the maximum in the π / 2 period centering at the globally dominant direction ( mΦ1 and mΦ 2 ) of each stripe-pattern. In an appropriate case, stripe patterns with three different directions can be combined into a single pattern to be projected into the scene, and Φ1 , Φ 2 and Φ3 can be found in each smaller range ( π / 3 interval) with the center mΦ1 , mΦ 2 , and mΦ 3 . Figure 2 illustrates the process of 3D modeling based on the proposed method with a combination of two stripe-patterns. The process is mostly similar to those for a single stripe-pattern and three intrinsic patterns.

5 Experimental Results We have made many experiments to validate the proposed techniques. The experimental setup consists of a Sony XC-003 3CCD VGA camera, an Epson color LCD XGA projector and a cubic calibration-object. A permutation-based color-stripe

512

K.H. Lee, C. Je, and S.W. Lee

pattern in [12] is used to generate the three kinds of patterns: single-direction, doubledirection and triple-direction stripe-patterns (see Figure 3). We used the Sobel operator as the gradient filter, and the stripes are segmented by hue thresholding of restored illumination colors normalized by neighbor colors. Each segmented stripe is identified according to the sequence of neighboring consecutive stripes. From the image coordinates and stripe identities, 3D world coordinates are estimated by the geometric calibration [13] which gives the extrinsic and intrinsic parameters of the projector and camera. When the double-direction or triple-direction is used, more than one range image are obtained, and are merged into one by removing erroneous points and by averaging multiple reliable points. Some results are meshed and rendered.

Fig. 3. The three kinds of stripe-patterns: (Left) single-direction, (Middle) double-direction, and (Right) triple-direction

(a)

(d)

(b)

(e)

(c)

(f)

Fig. 4. The results of a crumpled paper which consists of squares of 4 colors (cyan, blue, green and red, clockwise): (a) input image, (b) illumination-restored image by log-gradient processing without rotations (c) illumination-restored image by proposed processing, (d) range result by direct estimation (e) range estimation from (b), and (f) range estimation from (c)

Color-Stripe Structured Light Robust to Surface Color and Discontinuity

(a)

513

(b)

Fig. 5. The stripe segmentation results by (a) direct estimation and by (b) proposed processing

(a)

(b)

(d)

(e)

(c)

(f)

Fig. 6. The results of the leather surface of a color-textured bag: (a) input image, (b) illumination-restored image by proposed processing, (c) scene under white illumination, (d) rendered result by direct estimation, (e) result by proposed method, and (f) a filtered image with a single rotation

Figure 4 shows the experimental results of a crumpled paper which consists of squares of 4 colors (cyan, blue, green and red, clockwise): (a) input image, (b) illumination-restored image by log-gradient processing without rotations (c) illumination-restored image by proposed processing, (d) range result by direct estimation (e) range estimation from (b), and (f) range estimation from (c). Figure 5 compares the stripe segmentation results by direct estimation and by proposed processing. Figure 6 shows the results of the leather surface of a color-textured bag: (a) input image, (b) illumination-restored image by proposed processing, (c) scene under white illumination, (d) rendered result by direct estimation, (e) result by proposed method, and (f) a filtered image with a single rotation. Figure 7 depicts the results of the bag: (a) input image, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c).

514

K.H. Lee, C. Je, and S.W. Lee

Figure 8 depicts the results of hands: (a) input image, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c). Figure 9 depicts the results of results of a cow: (a) input image of a scene with double-directional pattern, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c).

(a)

(d)

(b)

(e)

(c)

(f)

Fig. 7. The results of the bag: (a) input image, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c)

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 8. The results of hands: (a) input image, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c)

Color-Stripe Structured Light Robust to Surface Color and Discontinuity

(a)

(d)

(b)

(e)

515

(c)

(f)

Fig. 9. The results of a cow: (a) input image of a scene with double-directional pattern, (b and c) the two illumination-restored images, (e and f) rendered results from (b and c), and (d) merged result from (b and c)

6 Conclusion In this paper, we demonstrated a filtering method for improving the estimation of stripe colors, and proposed a strategy of combining two or three stripe-patterns for increasing the probability of correct stripe identification. Through the experiments, we have shown that the method makes the range acquisition more insensitive to the scene characteristics. Acknowledgments. This work was supported by the Korea Science and Engineering Foundation (KOSEF) Grant (No. R01-2006-000-11374-0) and a Seoul R&BD Program.

References 1. Angelopoulou, E., Lee, S.W., Bajcsy, R.: Spectral gradient: a material descriptor invariant to geometry and incident illumination. In: Proc. ICCV 1999, pp. 861–867 (1999) 2. Berwick, D., Lee, S.W.: Spectral gradients for color-based object recognition and indexing. Computer Vision and Image Understanding 94(1-3), 28–43 (2004) 3. Boyer, K.L., Kak, A.C.: Color-encoded structured light for rapid active ranging. IEEE Trans. PAMI 9(1), 14–28 (1987)

516

K.H. Lee, C. Je, and S.W. Lee

4. Carrihill, B., Hummel, R.: Experiments with the intensity ratio depth sensor. Computer Vision, Graphics, and Image Processing 32, 337–358 (1985) 5. Caspi, D., Kiryati, N., Shamir, J.: Range imaging with adaptive color structured light. IEEE Trans. on PAMI 20(5) (1998) 6. Chung, Y.-C., Wang, J.M., Bailey, R.R., Chen, S.-W., Chang, S.-L., Cherng, S.: PhysicsBased Extraction of Intrinsic Images from a Single Image. ICPR (4), 693–696 (2004) 7. Curless, B., Levoy, M.: Better optical triangulation through spacetime analysis. In: Proc. ICCV, pp. 987–994 (1995) 8. Davies, C.J., Nixon, M.S.: A hough transformation for detecting the location and orientation of three-dimensional surfaces via color encoded spots. IEEE Trans. on Systems, Man, and Cybernetics 28(1B) (1998) 9. Hall-Holt, O., Rusinkiewicz, S.: Stripe Boundary Codes for Real-time Structured-light Range Scanning of Moving Objects. In: Proc. ICCV (2001) 10. Horn, E., Kiryati, N.: Toward optimal structured light patterns. Image and Vision Computing 17 (1999) 11. Huang, P.S., Hu, Q., Jin, F., Chiang, F.: Color-encoded digital fringe projection technique for high-speed three-dimensional surface contouring. Optical Engineering 38(6), 1065– 1071 (1999) 12. Je, C., Lee, S.W., Park, R.-H.: High-Contrast Color-Stripe Pattern for Rapid StructuredLight Range Imaging. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 95–107. Springer, Heidelberg (2004) 13. McIvor, A.M., Valkenburg, R.J.: Calibrating a Structured Light System, Image and Vision Computing, New Zealand, Lincoln, pp. 167–172 (August 1995) 14. Pagès, J., Salvi, J., Collewet, C., Forest, J.: Optimised De Bruijn patterns for one-shot shape acquisition. Image Vision Computing 23(8), 707–720 (2005) 15. Tajima, J., Iwakawa, M.: 3-D Data Acquisition by Rainbow Range Finder. In: Proc. ICPR, pp. 309–313 (1990) 16. Tappen, M.F., Freeman, W.T., Adelson, E.H.: Recovering Intrinsic Images from a Single Image. In: NIPS 2002, pp. 1343–1350 (2002) 17. Weiss, Y.: Deriving Intrinsic Images from Image Sequences. In: ICCV, pp. 68–75 (2001) 18. Zhang, L., Curless, B., Seitz, S.M.: Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. In: 3DPVT 2002, pp. 24–37 (2002)

Stereo Vision Enabling Precise Border Localization Within a Scanline Optimization Framework Stefano Mattoccia1,2 , Federico Tombari1,2 , and Luigi Di Stefano1,2 Department of Electronics Computer Science and Systems (DEIS) University of Bologna, Viale Risorgimento 2, 40136 Bologna, Italy Advanced Research Center on Electronic Systems ’Ercole De Castro’ (ARCES) University of Bologna, Via Toﬀano 2/2, 40135 Bologna, Italy {smattoccia, ftombari, ldistefano}@deis.unibo.it 1

2

Abstract. A novel algorithm for obtaining accurate dense disparity measurements and precise border localization from stereo pairs is proposed. The algorithm embodies a very eﬀective variable support approach based on segmentation within a Scanline Optimization framework. The use of a variable support allows for precisely retrieving depth discontinuities while smooth surfaces are well recovered thanks to the minimization of a global function along multiple scanlines. Border localization is further enhanced by symmetrically enforcing the geometry of the scene along depth discontinuities. Experimental results show a significant accuracy improvement with respect to comparable stereo matching approaches.

1

Introduction and Previous Work

In the last decades stereo vision has been one of the most studied task of computer vision and many proposals have been made in literature on this topic (see [1] for a review). The problem of stereo correspondence can be formulated as follows: given a pair of rectiﬁed stereo images, with one being the reference image Ir and the other being the target image It , we need to ﬁnd for each point pr ∈ Ir its correspondence pt ∈ It which, due to the epipolar constraint, lies on the same scanline as pr and within the disparity range D = [dmin ; dmax ]. The taxonomy proposed by Scharstein and Szelinski [1] for dense stereo techniques subdivides stereo approaches into two categories: local and global. Local approaches determine the stereo correspondence for a point pr by selecting the candidate pt,d , d ∈ D which minimizes a matching cost function CM (pr , pt,d ). In order to decrease the ambiguity of the scores the matching cost is not pointwise but is typically computed over a support which includes pr on Ir and pt,d on It . While the support can be in the simplest cases a static squared window, notable results have been yielded by using a variable support which dynamically adapts itself depending on the surroundings of pr and pt,d [2], [3], [4], [5], [6], [7], [8]. Conversely, most global methods attempt to minimize an energy function computed on the whole image area by employing a Markov Random Field model. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 517–527, 2007. c Springer-Verlag Berlin Heidelberg 2007

518

S. Mattoccia, F. Tombari, and L. Di Stefano

Since this task turns out to be a NP-hard problem, approximate but eﬃcient strategies such as Graph Cuts (GC) [9] and Belief Propagation (BP) [10], [11] have been proposed. In particular, a very eﬀective approach turned out to be the employment of segmentation information and a plane ﬁtting model within a BP-based framework [12], [13], [14]. A third category of methods which lies in between local and global approaches refers to those techniques based on the minimization of an energy function computed over a subset of the whole image area, i.e. typically along epipolar lines or scanlines. The adopted minimization strategy is usually based on Dynamic Programming (DP) or Scanline Optimization (SO) [15], [16], [17], [18] techniques, and some algorithms also exploit DP on a tree [19], [20]. The global energy function to be minimized includes a pointwise matching cost CM (see [1] for details) and a smoothness term which enforces constant disparity e.g. on untextured regions by means of a discontinuity penalty π: (1) CM pir , pit,d(A) + N (d (A)) · π E (d (A)) = i∈A

with A being the image subset (e.g. a scanline) and N being the number of times the smoothness constraint is violated within the region where the cost function has to be minimized. These approaches achieved excellent results in terms of accuracy in the disparity maps [15] and in terms of very fast, near real-time, computational performances [17]. In order to increase robustness against outliers a ﬁxed support (typically a 3×3 window) can be employed instead of the pointwise matching score. Nevertheless, this approach embodies all the negative aspects of a local window-based method, which are especially evident near depth discontinuities: object borders tend to be inaccurately detected. Hence, a ﬁrst contribution proposed by this paper is to deploy an SO-based algorithm which embodies, as matching cost CM , a function based on a variable support. The SO framework allows to handle eﬀectively low-textured surfaces while the variable support approach helps preserving accuracy along depth borders. In order to determine the variable support, we adopt a very eﬀective technique based on colour proximity and segmentation [21] recently proposed for local approaches. The accuracy of the SO-based process is also improved by the use of a symmetrical smoothness penalty which depends on the pixel intensities of both stereo images. It will be shown that this approach allows to obtain notable accuracy in the retrieved disparities. Moreover, we propose a reﬁnement step which allows to further increase the accuracy of the proposed method. This step relies on a technique that, exploiting symmetrically the relationship between occlusions and depth discontinuities on the disparity maps obtained assuming alternatively as reference the left and the right image, allows for accurately locating borders. This is shown to be particularly useful to assign the correct disparity values to those points violating the cross-checking constraint. Finally, experimental results show that the proposed approach is able to determine accurate dense stereo maps and it is state-of-theart for what means approaches which do not rely on a global framework.

Stereo Vision Enabling Precise Border Localization

2

519

The Support Aggregation Stage

The ﬁrst step of the proposed technique computes matching costs based on a variable support strategy proposed in [21] for local algorithms. In particular, given the task of ﬁnding the correspondence score between points pr ∈ Ir and pt,d ∈ It , during the support aggregation step each point of Ir is assigned a weight which depends on color proximity from pr as well as on information derived from a segmentation process applied on the colour images. In particular, weight wr (pi , pr ) for point pi belonging to Ir is deﬁned as: 1.0 p i ∈ Sr (2) wr (pi , pr ) = dc (Ir (pi ),Ir (pr )) otherwise exp − γc with Sr being the segment on which pr lies, dc the Euclidean distance between two RGB triplets and the constant γc a parameter of the algorithm. A null weight is assigned to those points of Ir which lie too far from pr , i.e. whose distance along x or y direction exceeds a certain radius. A similar approach is adopted to assign a weight wt (qi , pt,d ) to each point qi ∈ It . It is interesting to note that this strategy allows to ideally extract two distinct supports at every new correspondence evaluation, one for the reference image and the other for the target image. Once the weights are computed, the matching cost for correspondence (pr , pt,d ) is determined by summing over the image area the product of such weights with a pointwise matching score (the Truncated Absolute Diﬀerence (TAD) of RGB triplets) normalised by the weight sum: wr (pi , pr ) · wt (qi , pt,d ) · T AD(pi , qi ) CM,v (pr , pt,d ) =

pi ∈Ir ,qi ∈It

wr (pi , pr ) · wt (qi , pt,d )

(3)

pi ∈Ir ,qi ∈It

3

A Symmetric Scanline Optimization Framework

The matching cost CM,v (pr , pt,d ) described in the previous section is embodied in a simpliﬁed SO-based framework similar to that proposed in [15]. Hence, in the ﬁrst stage of the algorithm the matching cost matrix CM,v (pr , pt,d ) is computed for each possible correspondence (pr , pd,t ). Then, in the second stage, 4 SO processes are used: 2 along horizontal scanlines on opposite directions and 2 similarly along vertical scanlines. The j-th SO computes the current global cost between pr and pt,d as: j j j (pr , pt,d ) = CM,v (pr , pt,d ) + min(CG (ppr , ppt,d ), CG (ppr , ppt,d−1 ) + π1 , CG j (ppr , ppt,d+1 ) + π1 , cmin + π2 ) − cmin CG

(4)

with ppr and ppt,d being respectively the point in the previous position of pr and pt,d along the considered scanline, π1 and π2 being the two smoothness penalty terms (with π1 ≤ π2 ) and cmin deﬁned as:

520

S. Mattoccia, F. Tombari, and L. Di Stefano j cmin = mini (CG (ppr , ppt,i ))

(5)

For what means the two smoothing penalty terms, π1 and π2 , they are dependent on the image local intensities similarly to what proposed in [22] within a global stereo framework. This is due to the assumption that often a depth discontinuity coincides with an intensity edge, hence the smoothness penalty must be relaxed along edges and enforced within low-textured areas. In particular, we apply a symmetrical strategy so that the two terms depend on the intensities of both Ir and It . If we deﬁne the intensity diﬀerence between the current point and the previous one along the considered scanline on the two images as: (pr ) = |Ir (pr ) − Ir (ppr )| (pt,d ) = |It (pt,d ) − It (ppt,d )|

(6)

then π1 is deﬁned as:

⎧ ⎪ ⎪ Π1 ⎨ Π1 /2 π1 (pr , pt,d ) = Π1 /2 ⎪ ⎪ ⎩ Π1 /4

(pr ) < Pth , (pt,d ) < Pth (pr ) ≥ Pth , (pt,d ) < Pth (pr ) < Pth , (pt,d ) ≥ Pth (pr ) ≥ Pth , (pt,d ) ≥ Pth

(7)

where Π1 is a constant parameter of the algorithm, and π2 is deﬁned in the same manner based on Π2 . Finally, Pth is a threshold which determines the presence of an intensity edge. Thanks to this approach, horizontal/vertical edges are taken into account along corresponding scanline directions (i.e. horizontal/vertical) during the SO process, so that edges orthogonal to the scanline direction can not inﬂuence the smoothness penalty terms. Once the 4 global costs CG are obtained, they are summed up together and a Winner-Take-All approach on the ﬁnal cost sum assigns the disparity: dpr ,best = arg min{ d∈D

4

4

j CG (pr , pt,d )}

(8)

j=1

A First Experimental Evaluation of the Proposed Approach

We now brieﬂy show some results dealing with the use of the approach outlined so far. In particular, in order to demonstrate the beneﬁts of the joint use of the SO-based framework with the variable support-based matching cost CM,v , we compare the results yielded by our method to those attainable by the same SO framework using the pointwise TAD matching cost on RGB triplets, as well as by CM,v in a local WTA approach. The dataset used for experiments is available at the Middlebury website1 . Parameter set is constant for all runs. Truncation parameter for TAD in both 1

vision.middlebury.edu/stereo

Stereo Vision Enabling Precise Border Localization

521

Table 1. Error rates using CM,v within the SO-based framework proposed (ﬁrst row), a pointwise matching cost (CM,p ) within the same SO-based framework (second row), and CM,v in a local WTA approach (last row) Tsukuba

Venus

Teddy

Cones

N.O. - DISC

N.O. - DISC

N.O. - DISC

N.O. - DISC

CM,v , SO 1.63 - 6.80 0.97 - 9.03 9.64 - 19.35 4.60 - 11,52 3.70 - 13.38 4.19 - 19.27 12.28 - 20.40 5.99 - 13.96 CM,p , SO 10,8 - 21,7 5,08 - 12,5 CM,v , local 2,05 - 7,14 1,47 - 10,5

approaches is set to 80. For what means the variable support, segmentation is obtained by running the Mean Shift algorithm [23] with a constant set of parameters (spatial radius σS = 3, range radius σR = 3, minimum region size minR = 35), while maximum radius size of the support is set to 51, and parameter γc is set to 22. Finally, for what means the SO framework, our approach is run with Π1 = 6, Π2 = 27, Pth = 10, while the pointwise cost-based approach is run with Π1 = 106, Π2 = 312, Pth = 10 (optimal parameters for both approaches). Table 1 shows the error rates computed on the whole image area except for occlusion (N.O.) and in proximity of discontinuities (DISC ). Occlusions are not evaluated here since at this stage no speciﬁc occlusion handling approach is adopted by any of the algorithms. As it can be inferred, the use of a variable support in the matching cost yields signiﬁcantly higher accuracy in all cases compared to the pointwise cost-based approach, the highest beneﬁts being on Tsukuba and Venus datasets. Moreover, beneﬁts are signiﬁcant also by considering only depth discontinuities, which demonstrate the higher accuracy in retrieving correctly depth borders provided by the use of a variable support within the SO-based framework. Finally, beneﬁts of the use of the proposed SO-based framework are always notable if we compare the results of our approach with those yielded by using the same cost function within a local WTA strategy.

5

Symmetrical Detection of Occluded Areas and Depth Borders

By respectively assuming as reference Ir the left and the right image of the stereo pair, it is possible to obtain two diﬀerent disparity maps, referred to as DLR and DRL . Our idea is to derive a general method for detecting depth borders and occluded regions by enforcing the symmetrical relationship on both maps between occlusions and depth borders resulting from the stereo setup and the scene structure. In particular, due to the stereo setup, if we imagine to scan any epipolar line of DLR from left side to right side, each sudden depth decrement corresponds to an occlusion in DRL . Similarly, scanning any epipolar line of DRL from right side to left side, each sudden depth increment corresponds to an occlusion in DLR . Moreover, the occlusion width is directly proportional to the amount of

522

S. Mattoccia, F. Tombari, and L. Di Stefano

Fig. 1. Points violating (9) on DLR and DRL (colored points, left and center) are discriminated between occlusions (yellow) and false matches (green) on Tsukuba and Cones datasets. Consequently depth borders are detected (red points, right) [This Figure is best viewed with colors].

each depth decrement and increment along the correspondent epipolar line, and the two points composing a depth border on one disparity map respectively correspond to the starting point and ending point of the occluded area in the other map. Hence, in order to detect occlusions and depth borders, we deploy a symmetrical cross-checking strategy, which detects the disparities in DLR which violate a weak disparity consistence constraint by tagging as invalid all points pd ∈ DLR for which: |DLR (pd ) − DRL (pd − DLR (pd )) | ≤ 1 (9) and analogously detects invalid disparities on DRL . Points referring to disparity diﬀerences equal to 1 are not tagged as invalid at this stage as we assume that occlusions are not present where disparity varies smoothly along the epipolar lines, as well as to handle slight discrepancies due to the diﬀerent view points. The results of this symmetrical cross-checking are shown, referred to Tsukuba and Cones, on the left and center images of Fig. 1, where colored points in both maps represent the disparities violating (9). It is easy to infer that only a subset of the colored regions of the maps is represented by occlusions, while all other violating disparities denote mismatches due to outliers. Hence, after cross-checking the two disparity maps DLR and DRL , it is possible to discriminate on both maps occluded areas from incorrect correspondences (respectively yellow and green points on left and center image, Fig. 1) by means of application of the constraints described previously. Then, putting in correspondence occlusions on one map with homologous depth discontinuities in the

Stereo Vision Enabling Precise Border Localization

523

other map, it is possible to reliably localize depth borders generated by occlusions on both disparity maps (details of this method are not provided here due to the lack of space). Right images on Fig. 1 show the superimposition of the detected borders referred to DLR (in red color) on the corresponding grayscale stereo image. As it can be seen, borders along epipolar lines are detected with notable precision and very few outliers (detected borders which do not correspond to real borders) are present.

Fig. 2. The reliability of assigning disparities to points violating the strong crosschecking (10) along depth borders (green points, left) is increased by exploiting information on depth borders location (red points, center) compared to a situation where this information is not available (right)

6

Reﬁnement by Means of Detected Depth Borders and Segmentation

Depth border detection is employed in order to determine the correct disparity values to be assigned to points violating cross-checking. In particular, a two-step reﬁnement process is now proposed, which exploits successively segmentation and depth border information in order to ﬁll-in, respectively, low textured areas and regions along depth discontinuities. First of all, the following strong cross-checking consistency constraint is applied on all points of DLR :

DLR (pd ) DLR (pd ) = DRL (pd − DLR (pd )) (10) DLR (pd ) = invalid otherwise The ﬁrst step of the proposed reﬁnement approach employs segmentation information in order to ﬁll-in regions of DLR denoted as invalid after application of (10). In particular, for each segment extracted from the application of the Mean Shift algorithm, a disparity histogram is ﬁlled with all valid disparities included within the segment area. Then, if a unique disparity value can be reliably associated with that segment, i.e. if there is a minimum number of valid disparities in the histogram and its variance is low, the mean disparity value of the histogram is assigned to all invalid points falling within the segment area. This allows to

524

S. Mattoccia, F. Tombari, and L. Di Stefano

correctly ﬁll-in uniform areas which can be easily characterized by mismatches during the correspondence search. As this ﬁrst step is designed to ﬁll-in only invalid points within uniform areas, then a second step allows to ﬁll-in the remaining points by exploiting the previously extracted information on border locations, especially along depth border regions which usually are not characterized by uniform areas. In particular, the assigned disparity value for all invalid points near to depth discontinuities is chosen as the minimum value between neighbours which do not lie beyond a depth border. This allows to increase the reliability of the assigned values compared to the case of no information on borders location, where e.g. the minimum value between neighbouring disparities is selected, as shown in Fig. 2.

7

Experimental Results

This section shows an experimental evaluation obtained by submitting on the Middlebury site the results yielded by the proposed algorithm. The parameter set of the algorithm is constant for all runs and is the same as for the experiments in Sec. 4. As it can be seen from Table 2, our algorithm (SO+border ), which ranked 4th (as of May 2007), produces overall better results compared to [16], which employs a higher number of scanlines during the SO process, and also compared to the other SO and DP-based approaches and most global methods, for higher accuracy is only yielded by three BP-based global algorithms. Obtained disparity maps, together with corresponding reference images and groundtruth are shown in Fig. 3 and are available at the Middlebury website. The running time on the examined dataset is of the order of those of other methods based on a variable support [21], [2] (i.e. some minutes) since the majority of time is required by the local cost computation, while the S.O. stage and the border reﬁnement stage only account for a few seconds and are negligible compared to the overall time. Table 2. Disparity error rates and rankings obtained on Middlebury website Rank AdaptingBP [12] DoubleBP [10] SymBP+occ SO+border Segm+visib [13] C-SemiGlob [16] RegionTreeDP [19]

1 2 3 4 5 6 10

Tsukuba

Venus

Teddy

Cones

N.O.-ALL-DISC

N.O.-ALL-DISC

N.O.-ALL-DISC

N.O.-ALL-DISC

1.11-1.37-5.79 0.10-0.21-1.44 4.22-7.06-11.8 2.48-7.92-7.32 0.88-1.29-4.76 0.14-0.60-2.00 3.55-8.71-9.70 2.90-9.24-7.80 0.97-1.75-5.09 0.16-0.33-2.19 6.47-10.7-17.0 4.79-10.7-10.9 1.29-1.71-6.83 0.25-0.53-2.26 7.02-12.2-16.3 3.90-9.85-10.2 1.30-1.57-6.92 0.79-1.06-6.76 5.00-6.54-12.3 3.72-8.62-10.2 2.61-3.29-9.89 0.25-0.57-3.24 5.14-11.8-13.0 2.77-8.35-8.20 1.39-1.64-6.85 0.22-0.57-1.93 7.42-11.9-16.8 6.31-11.9-11.8

Stereo Vision Enabling Precise Border Localization

525

Fig. 3. Disparity maps obtained after the application of all steps of the proposed approach

8

Conclusions

A novel algorithm for solving the stereo correspondence problem has been described. The algorithm employs an eﬀective variable-support based approach in the aggregation stage together with a SO-based framework in the disparity optimization stage. This joint strategy allows for improving the accuracy of both SO-based and local variable-support based methods. Further improvements are obtained by embodying the disparity reﬁnement stage with border information and segmentation, which allows our proposal to outperform all DP and SO-based approaches as well as most global approaches on the Middlebury dataset.

526

S. Mattoccia, F. Tombari, and L. Di Stefano

References 1. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. Jour. Computer Vision 47(1/2/3), 7–42 (2002) 2. Yoon, K., Kweon, I.: Adaptive support-weight approach for correspondence search. IEEE Trans. PAMI 28(4), 650–656 (2006) 3. Boykov, Y., Veksler, O., Zabih, R.: A variable window approach to early vision. IEEE Trans. PAMI 20(12), 1283–1294 (1998) 4. Gong, M., Yang, R.: Image-gradient-guided real-time stereo on graphics hardware. In: Proc. 3D Dig. Imaging and Modeling (3DIM), Ottawa, Canada, pp. 548–555 (2005) 5. Hirschmuller, H., Innocent, P., Garibaldi, J.: Real-time correlation-based stereo vision with reduced border errors. Int. Jour. Computer Vision 47(1-3) (2002) 6. Xu, Y., Wang, D., Feng, T., Shum, H.: Stereo computation using radial adaptive windows. In: ICPR 2002. Proc. Int. Conf. on Pattern Recognition, vol. 3, pp. 595– 598 (2002) 7. Gerrits, M., Bekaert, P.: Local stereo matching with segmentation-based outlier rejection. In: CRV 2006. Proc. Canadian Conf. on Computer and Robot Vision, pp. 66–66 (2006) 8. Wang, L., Gong, M., Gong, M., Yang, R.: How far can we go with local optimization in real-time stereo matching. In: 3DPVT 2006. Proc. Third Int. Symp. on 3D Data Processing, Visualization, and Transmission, pp. 129–136 (2006) 9. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions via graph cuts. In: ICCV 2001. Proc. Int. Conf. Computer Vision, vol. 2, pp. 508–515 (2001) 10. Yang, Q.e.a.: Stereo matching with color-weighted correlation, hierachical belief propagation and occlusion handling. In: CVPR 2006. Proc. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 2347–2354 (2006) 11. Sun, J., Shum, H., Zheng, N.: Stereo matching using belief propagation. IEEE Trans. PAMI 25(7), 787–800 (2003) 12. Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: ICPR 2006. Proc. Int. Conf. on Pattern Recognition, vol. 3, pp. 15–18 (2006) 13. Bleyer, M., Gelautz, M.: A layered stereo matching algorithm using image segmentation and global visibility constraints. Jour. Photogrammetry and Remote Sensing 59, 128–150 (2005) 14. Tao, H., Sawheny, H., Kumar, R.: A global matching framework for stereo computation. In: ICCV 2001. Proc. Int. Conf. Computer Vision, vol. 1, pp. 532–539 (2001) 15. Hirschmuller, H.: Accurate and eﬃcient stereo processing by semi-global matching and mutual information. In: CVPR 2005. Proc. Conf. on Computer Vision and Pattern recognition, vol. 2, pp. 807–814 (2005) 16. Hirschmuller, H.: Stereo vision in structured environments by consistent semiglobal matching. In: CVPR 2006. Proc. Conf. on Computer Vision and Pattern recognition, vol. 2, pp. 2386–2393 (2006) 17. Gong, M., Yang, Y.: Near real-time reliable stereo matching using programmable graphics hardware. In: CVPR 2005. Proc. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 924–931 (2005) 18. Kim, J., Lee, K., Choi, B., Lee, S.: A dense stereo matching using two-pass dynamic programming with generalized ground control points. In: CVPR 2005. Proc. Conf. on Computer Vision and Pattern Recognition, pp. 1075–1082 (2005)

Stereo Vision Enabling Precise Border Localization

527

19. Lei, C., Selzer, J., Yang, Y.: Region-tree based stereo using dynamic programming optimization. In: CVPR 2006. Proc. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 2378–2385 (2006) 20. Deng, Y., Lin, X.: A fast line segment based dense stereo algorithm using tree dynamic programming. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 201–212. Springer, Heidelberg (2006) 21. Tombari, F., Mattoccia, S., Di Stefano, L.: Segmentation-based adaptive support for accurate stereo correspondence. Technical Report 2007-08-01, Computer Vision Lab, DEIS, University of Bologna, Italy (2007) 22. Yoon, K., Kweon, I.: Stereo matching with symmetric cost functions. In: CVPR 2006. Proc. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 2371– 2377 (2006) 23. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI 24, 603–619 (2002)

Three Dimensional Position Measurement for Maxillofacial Surgery by Stereo X-Ray Images Naoya Ohta1 , Kenji Mogi2 , and Yoshiki Nakasone2 1

Gunma University Graduate School of Engineering, Tenjin-cho 1-5-1, Kiryu, 376–8515 Japan 2 Gunma University Graduate School of Medicine, Showa-machi 3-39-22, Maebashi, 371-8511 Japan

Abstract. This paper describes a method whereby a three dimensional position inside a human body can be measured using a simple X-ray stereo image pair. Because the geometry of X-ray imaging is similar to that of ordinary photography, a standard stereo vision technique can be used. However, one problem is that the X-ray source position is unknown and should be computed from the X-ray image. In addition, a reference coordinate on which the measurement is based needs to be determined. The proposed method solves these two problems using a cubic wire frame called the reference object. Although three dimensional positioning for a human body is possible by Computer Tomography (CT), it requires expensive equipment. In contrast, the proposed method only requires ordinary X-ray photography equipment, which is inexpensive and widely available even in developing countries.

1

Introduction

For some medical treatments, 3D measurements of internal parts of a human body are highly desirable. An example arises in maxillofacial surgery when there is need to perform an injection into the mental or infraorbital foramina. The foramina are holes on the skull through which nerves run (ﬁg.1). A doctor sometimes needs to inject liquid into them. However, because they cannot be seen from the outside of the skin, the doctor must try to inject using an inaccurate guess of the position, which can result in unnecessary pain to the patient. If the doctor could know the positions before the injection, this problem could be solved. As a means for measuring the inside of a human body, Computer Tomography (CT) is a well known technique [1]. It could be used for that purpose, but it requires too complicated and costly processes to balance with positioning for an injection. In contrast, radiography is an easy and inexpensive method for investigating the inside of the body, but it provide only a 2D projection of a body; so, we cannot know the 3D position from it. In this report, we present a method whereby 3D positions inside a human body are computed from two X-ray images, using a stereo vision technique. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 528–536, 2007. c Springer-Verlag Berlin Heidelberg 2007

Three Dimensional Position Measurement for Maxillofacial Surgery

529

Stereo vision is a well know technique to compute 3D positions using two images taken from diﬀerent viewpoints [2]. Since the imaging geometry of radiography is almost the same as for ordinary photography, this technique is applicable to X-ray images. On the basis of this idea, Radeva et al. [3] reconstructed the shape of blood vessels. Kita et al. [4,5] applied a stereo vision technique to mammograms to detect 3D location of lesions. Wang et al. [6] developed a stereoscopic imaging system for surgery. Evans et al. [7,8] also developed a stereoscopic inspection machine, although this is not a medical application but for security control. When a stereo vision technique is applied to our case, however, two problems arise. One is that the viewpoints, namely the positions of the X-ray source are unknown. In the previous reports mentioned above, the X-ray sources were set on a rigid frame and the imaging geometry was calibrated beforehand. However, we shall assume use of an ordinary X-ray source and that its position is set by the doctor when imaging. The second problem is that we need reference points outside of the skin, based on which all 3D positions are measured. In order to solve those problems, we developed a reference object, which is a box-shape frame made of metal wire (ﬁg.2). It is used to compute the X-ray source positions from its image on the ﬁlms, and at the same time, it gives a reference coordinate for measured 3D points.

Fig. 1. Infraorbital foramina (upper) and mental foramina (lower)

For measurement, the reference object is set on face skin. Then, two radiographs are taken with diﬀerent positions of the X-ray source. By identifying the positions of the target (a mental or infraorbital foramen) and the reference object on the ﬁlms, the 3D position of the target, relative to the reference object, is computed. In the following, we ﬁrst explain how to determine the X-ray source positions from the images, and then the target position computation. Finally, we show experimental results and conclude the report.

530

N. Ohta, K. Mogi, and Y. Nakasone

Fig. 2. The reference object

2

Computation of X-Ray Source Position

As shown in ﬁg.2, the reference object is a rectangular parallelepiped with two square frames. The length of a side of the square frames is α (=2cm), and the distance between them is β (=1cm). The reference object is set so that one frame is touching the skin, and a ﬁlm is placed on the other frame. Under this setting, radiographs are taken. After exposure, the ﬁlm is removed from the reference object and developed, while the reference object remains on the skin, which gives the coordinate for the target position after measurement. We call the skin side frame of the reference object the skin frame, and the ﬁlm side frame the ﬁlm frame. Let four corners of the skin frame be a, b, c and d, and their images be a , b , c and d (ﬁg.3). The three dimensional coordinate is introduced as the origin coincides one corner of the ﬁlm frame, and X and Y axes are parallel to the two sides of the ﬁlm frame connecting the origin. The Z axis is perpendicularly set from the origin though the XY corner a of the skin frame. When specifying a point on the ﬁlm, we use XY coordinate embedded in this 3D coordinate. Because the skin frame and the ﬁlm are parallel, the image of the skin frame is a square of a diﬀerent size (ﬁg.4). The ratio s of their sides enables computation of the distance between the X-ray source and the ﬁlm. On the other hand, because the image of a is a , the X-ray source lies on line a a. From these observations, the position P X of the X-ray source is given by the following equation. PX =

s sP a − P a (P a − P a ) + P a = , s−1 s−1

(1)

where P a and P a are the positions of a and a , respectively. If the 2D coordinate of a on the ﬁlm is (Xa , Ya ), they are given by ⎞ ⎛ ⎞ ⎛ 0 Xa (2) P a = ⎝ 0 ⎠ , P a = ⎝ Ya ⎠ . 0 β

Three Dimensional Position Measurement for Maxillofacial Surgery

531

X ray source

Z d Y b

a β

d’

skin frame

α

reference object

c’

O

X

film frame

a’

αc

b’

film

Fig. 3. Geometry between the reference object, a ﬁlm, and an X-ray source

Theoretically, equation (1) holds for any X-ray position except inﬁnity. However, for stability in the real computation, we introduce the homogeneous coor˜ X for P X as follows. dinate expression P ˜X = P

sP a − P a s−1

⎛

⎞ −Xa ⎜ −Ya ⎟ ⎟ =⎜ ⎝ sβ ⎠ s−1

(3)

If we measure the position (Xa , Ya ) of a and the ratio s (= a b /ab, for instance) on the ﬁlm, the X-ray position can be computed by eq.(3). However, we use the fact that the image of the skin frame is a square in order to increase measurement accuracy. Let X1 and X2 be the signed distances between the Y axis and the line a d , and b c , respectively, as shown in ﬁg.4. In the same way, let Y1 and Y2 be the signed distances between X axis and line a b , and d c , respectively. The condition that a b c d is a square is represented by

Y

d’

c’

a’

b’

Y2 X Y1

X1

X2

Fig. 4. The image of the reference object on a ﬁlm

532

N. Ohta, K. Mogi, and Y. Nakasone

X2 − X1 = Y2 − Y1 .

(4)

We minimize the following quantity D2 under the condition of eq.(4). D2 = (X1 − X1∗ )2 + (Y1 − Y1∗ )2 + (X2 − X2∗ )2 + (Y2 − Y2∗ )2 ,

(5)

where the asterisks indicate measured values from the ﬁlm. Using Lagrange’s multiplier method, the solution is given as follows. X1 = X1∗ − δ, X2 = X2∗ + δ, Y1 = Y1∗ + δ, Y2 = Y2∗ − δ,

(6)

where

1 ((Y ∗ − Y1∗ ) − (X2∗ − X1∗ )). (7) 4 2 The ratio s and the position (Xa , Ya ) of a are computed from the values given by eq.(6) as follows: δ=

s=

3

X2 − X1 , Xa = X1 , Ya = Y1 . α

(8)

Computation of Target Position

Since the target lies on the line connecting its image on the ﬁlm and the X-ray source, the target position is determined if two such lines are given. However, the two lines are not necessarily crossing in the 3D space because of noise. Therefore, we compute the closest points on each line, and regard their middle point as the target position.

X-ray source

target

PX

PT

reference object film

PT'

Fig. 5. Relation between X-ray source, target, and its image

Let P T and P T be the 3D positions of the target and its image on the ﬁlm, respectively (ﬁg.5). If the 2D coordinate of the target image is (XT , YT ), P T is represented by ⎞ ⎛ XT (9) P T = ⎝ YT ⎠ . 0

Three Dimensional Position Measurement for Maxillofacial Surgery

533

Because P T is a point on the line between P X and P T , using parameter t, it can be expressed as follows. P T = t(P X − P T ) + P T = t(

sP a − P a − P T) + P T s−1

(10)

In the equation above, we used the relation given by eq. (1). To avoid computational instability when s is close to 1, we modify eq.(10) in the same way that we did for eq. (1). (11) P T = qu + P T , where u = sP a + (s − 1)P T − P a .

(12)

The parameter q in eq.(11) is related to t as q = t/(s − 1) when s = 1. From two X-ray images taken with diﬀerent X-ray source positions, we have two relations of eq.(11). If we distinguish quantities for the two images by a number in parenthesis, these relations are expressed as follows. P T = q (1) u(1) + P T

(1)

(1)

(13)

(2) PT

(2) P T

(14)

= q (2) u(2) + (1)

(2)

The squared distance DT2 between P T and P T is given by DT2 = P T − P T 2 = (P T − P T ) (P T − P T ), (1)

(2)

(1)

(2)

(1)

(2)

(15)

where expresses the transpose of a vector. Diﬀerentiating DT2 with respect to q (1) and q (2) in turn, and setting the both results to zero, we have (1) (1) (1) −1 (1) (2) (1) q u u u −u(1) u(2) (P a − P a ) . (16) = (1) (2) q (2) −u(2) u(1) u(2) u(2) u(2) (P a − P a ) (1)

(2)

Substituting q (1) and q (2) in eq.(16) for those in eqs.(13) and (14), P T and P T (1) are computed. The target position P T is given by the middle point between P T (2) and P T . (1) (2) P + PT (17) PT = T 2

4

Experiment

We conducted an experiment to evaluate the accuracy of the measurement by the proposed method. We used a model shown in ﬁg.6(a) to simulate the measurement conditions. The model consist of two layers made of plywood and soft plastic. Small circular metal plates with 2mm diameter are placed between those two layers, which are used as the targets to be measured. The surface of the model is divided into ten sections, under each of which one metal plate is placed so that ten diﬀerent measurements are possible.

534

N. Ohta, K. Mogi, and Y. Nakasone

(a)

(b)

Fig. 6. Model for experiment (a) and its image on the ﬁlm (b)

An X-ray ﬁlm is placed on the reference object, and two exposures are given from the back of the model with diﬀerent X-ray source positions. The X-ray source positions were set by hand so that the radiation angle between two exposures forms around 35 degrees. An example of the developed ﬁlms is shown in ﬁg.6(b). The accuracy evaluation process was carried out as follows. The 3D position of the metal plate was computed by the proposed method for each section of the model, where image positions of the metal plate and the frames on the radiograph were manually measured with a vernier micrometer. Guided by the computed X and Y values, we stuck a needle from the surface of the model, and marked the plywood layer with the needle. After separating the two layers, the distance between the mark and the center of the metal plate was measured as the measurement error. For Z values, the diﬀerence between the computed value and 10mm was used as the error, because the thickness of the plastic layer was 10mm. The result is shown in table 1; the columns labeled by ΔX, ΔY , and ΔZ are for errors in X, Y , and Z coordinates, respectively. It is found that most of the errors are under 1mm. The RMSE (Root Mean Square Error) values are 0.55mm, 0.80mm, and 0.78mm for X, Y , and Z values, respectively. Errors at this level are acceptable for an injection to a mental or infraorbital foramen. We used double exposure on one ﬁlm instead of using two ﬁlms to obtain two images. Since the model used for the experiment has simple structure, the positions of the target were easily identiﬁed. However, a skull has complex bone structure, and it might be hard to identify the target positions on the ﬁlm. Therefore, we conducted an experiment using a real skull. Figure 7(a) shows the setting for imaging, and (b) the image on the ﬁlm. The target (mental foramen) positions were easily identiﬁed, which suggests that double exposure has no problem for actual use. Incidentally, the computed coordinates of the target were (12.0, 4.8, 13.9), and correct as far as compared with manually measured values with a scale.

Three Dimensional Position Measurement for Maxillofacial Surgery

535

Table 1. Errors in computed positions Section ΔX (mm) ΔY (mm) ΔZ (mm) 1 −0.95 −0.75 1.30 2 0.30 0.45 0.62 3 0.45 −0.35 0.44 4 0.10 0.05 0.30 5 0.50 0.05 −0.18 6 −0.20 −0.55 0.20 7 −0.60 −0.50 −0.22 8 0.90 −0.75 1.55 9 0.25 −1.30 1.08 10 0.50 −1.65 0.28

(a)

(b)

Fig. 7. Experimental setting using a real skull (a), and developed ﬁlm (b)

5

Conclusion

We presented a useful, inexpensive method for measuring the 3D position of a mental or infraorbital foramen, or some speciﬁc part inside the body in general, by applying a stereo vision technique to radiographs. According to the experiment, the measurement accuracy is of the order of 1mm. Since the proposed method only requires ordinary X-ray equipment, the approach could ﬁnd widespread use, including in developing countries. The computer software to compute the algorithm of the method can be downloaded from the webpage at http://www.ail.cs.gunma-u.ac.jp/Soft/Xray-stereo.html.

Acknowledgments This work was in part supported by the Ministry of Education, Science, Sports and Culture, Japan under a Grant in Aid for Scientiﬁc Research C(2) (No. 16500099).

536

N. Ohta, K. Mogi, and Y. Nakasone

References 1. Duncan, J.S., Ayache, N.: Medical image analysis: progress over two decades and the challenges ahead. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1) (2000) 2. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, Cambridge (2000) 3. Radeva, P., Toledo, R., Von Land, C., Villanueva, J.: 3D vessel reconstruction from biplane angiograms using snakes. Computers in Cardiology 25, 773–776 (1998) 4. Kita, Y., Highnam, R., Brady, M.: Correspondence between diﬀerent view breast X rays using curved epipolar lines. Computer Vision and Image Understanding 83(1), 38–55 (2001) 5. Kita, Y., Tohno, E., Highnam, R., Brady, M.: A CAD system for the 3D location of lesions in mammograms. Medical Image Analysis 6(3), 267–273 (2002) 6. Wang, S., Chen, J.X., Dong, Z., Ledley, R.S.: SMIS – A real-time stereoscopic medical imaging system. In: Proceedings of 17th IEEE Symposium on Computer-Based Medical Systems, pp. 197–202. IEEE Computer Society Press, Los Alamitos (2004) 7. Evans, J.P.O, Robinson, M., Godber, S.X.: A new stereoscopic imaging technique using a single X-ray source: theoretical analysis. Journal of Non-Destructive Testing and Evaluation International (2), 27–35 (1996) 8. Evans, J.P.O., Robinson, M., Godber, S.X.: Pseudo-tomographic X-ray imaging for use in aviation security. IEEE Aerospace and Electronics Systems Magazine 13(7), 25–30 (1998)

Total Absolute Gaussian Curvature for Stereo Prior Hiroshi Ishikawa Department of Information and Biological Sciences Nagoya City University Nagoya 467-8501, Japan [email protected]

Abstract. In spite of the great progress in stereo matching algorithms, the prior models they use, i.e., the assumptions about the probability to see each possible surface, have not changed much in three decades. Here, we introduce a novel prior model motivated by psychophysical experiments. It is based on minimizing the total sum of the absolute value of the Gaussian curvature over the disparity surface. Intuitively, it is similar to rolling and bending a flexible paper to fit to the stereo surface, whereas the conventional prior is more akin to spanning a soap film. Through controlled experiments, we show that the new prior outperforms the conventional models, when compared in the equal setting.

1 Introduction In dense stereo matching, the ambiguity arising from such factors as noise, periodicity, and large regions of constant intensity makes it impossible in general to identify all locations in the two images with certainty. Thus, any stereo algorithm must have a way to resolve ambiguities and interpolate missing data. In the Bayesian formalism of stereo vision, this is given by the prior model. The prior model of a stereo matching algorithm is the algorithm’s expectation on the surfaces in the world, where it makes assumptions about the probability to see each surface that can be represented in its representation system. It is often just called the “smoothing term,” because its most popular form gives a higher probability to a smooth surfaces. The prior model is an ingredient of stereo matching that can be conceptually separated from other aspects of the process: whether a stereo system uses dynamic programming or graph cut or belief propagation, it explicitly or implicitly assumes a prior; and it is also usually independent of image formation model, which affects the selection of features in the images to match and the cost function to compare them. In some algorithms, it is less obvious than in others to discern the prior models they utilize, especially when the smoothing assumption is implicit as in most local, window-based algorithms. In some cases, it is intricately entwined with the image formation model, as in the case where a discontinuity in disparity is encouraged at locations where there are intensity edges. As far as we could determine, however, the prior models that are used in stereo matching algorithms have not changed much in three decades. Computational models [3,7,9,14,17,23] have generally used as the criterion some form of smoothness in terms of dense information such as the depth and its derivatives, sometimes allowing for discontinuities[26] and sometimes smoothing the higher-order derivative and/or Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 537–548, 2007. c Springer-Verlag Berlin Heidelberg 2007

538

H. Ishikawa

(a)

Left

(b)

Right

Left

(c)

Fig. 1. Stereogram from [12,13]. (a) shows the stereogram. (b) The human brain tend to perceive one of two similar shapes as shown; however, no previously-proposed algorithm has this behavior. (c) Algorithms that seek to minimize gradient give a “soap film” surface like this.

allowing slanted surfaces[2,4,19]. Among them, the most common is the minimization of the square difference of disparities between neighboring pixels, which encourages fronto-parallel surfaces. Perhaps that the most of the citations above are at least ten years old is indicative of the neglect the problem of prior model selection has suffered. The recent algorithms using graph cut[5,11,16,20,21,27], and belief propagation [8,18,24,25] did not touch on the prior models, concentrating on the optimization. The survey by Scharstein and Szeliski[22] does not classify the algorithms in their taxonomy by prior models— rightly, because there are not much difference in this respect among them. Of course, by itself it might mean that the selection was correct the very first time. However, it appears that there is a glaring exception to the widespread use of smoothing/ fronto-parallel criterion as the prior model: the human vision system. In [12,13], we reported new observations made in examining human disparity interpolation using highly ambiguous stereograms such as shown in Fig. 1(a). We noted that the surfaces perceived by most humans, such as the one shown in Fig. 1(b), is very different from the “soap film” surfaces that previously-proposed algorithms predict, such as the one shown in Fig. 1(c). We also suggested the minimization of the Gaussian curvature as a possible prior model compatible with the psychophysical experiments. In this paper, we examine how the prior model suggested by those findings fares in machine vision experiments. The prior is based on minimizing the total sum of the absolute value of the Gaussian curvature over the disparity surface. We show that it outperforms the conventional models through controlled experiments. In the next section, we briefly review the psychophysical motivation from [12,13] and discuss the theoretical implication of using the Gaussian curvature as a prior model for stereo vision. In

Total Absolute Gaussian Curvature for Stereo Prior

539

Fig. 2. Two planes that go through the normal vector to the surface at a point

Section 3, we present the results of our experiments comparing the performance of the new prior to those of conventional models and discuss the result and its implications. In Section 4 we conclude the paper.

2 Gaussian Curvature and Stereo Prior 2.1 Gaussian Curvature The Gaussian curvature is defined as follows. For a point on a surface, consider a plane that goes through the normal vector to the surface at the point. The intersection of the surface and the plane locally defines a curve on the surface (Fig. 2). Consider all such curves by rotating the plane around the normal vector. They are called the normal sections. The maximum and the minimum of the signed curvature of the normal sections are called the principal curvatures at the point. The Gaussian curvature K of the surface at the point is the product of the two principal curvatures, whereas the mean curvature H is their average. Thus, when K > 0, the maximum and the minimum have the same sign; it means all the normal sections are curving in the same direction with respect to the tangent plane. When K < 0, the maximum and the minimum of the curvature have the opposite signs; therefore, some normal sections are curving in the opposite direction from others. When K = 0, either the maximum or the minimum of the curvature among the normal sections is zero; all the normal sections are in the same side of the tangent plane, with one lying on it. Surfaces with K = 0 everywhere is said to be developable, meaning they can be unrolled into a flat sheet of paper without stretching. The surface represented by (x, y, d(x, y)) in the 3D space has the curvatures: K= H=

dxx dyy − dxy 2

(1)

2,

(1 + dx 2 + dy 2 )

(1 + dy 2 )dxx − 2dx dy dxy + (1 + dx 2 )dyy 3

2(1 + dx 2 + dy 2 ) 2

.

(2)

2.2 Developable vs. Minimal Surfaces We noted in [12,13] that the surfaces perceived by humans, as shown in Fig. 1(b), are developable, whereas previously-proposed algorithms predict the “soap film” surfaces

540

H. Ishikawa

such as the one shown in Fig. 1(c). Accordingly, we suggested that minimizing the total sum of the absolute value or the square of Gaussian curvature, for example, may predict the surfaces similar to those that are perceived by humans. Developable surfaces minimize the energy that is the total sum of the Gaussian curvature modulus. Thus, intuitively, it can be thought of as rolling and bending a piece of very thin paper (so thin that you cannot feel any stiffness, but it does not stretch) to fit to the stereo surface. In contrast, conventional priors tend to minimize the mean curvature H, rather than the Gaussian curvature. For instance, the continuous analog of the minimization of the square difference of disparities between neighboring pixels is minimizing the Dirichlet integral |∇d(x, y)|2 dxdy. It is known (Dirichlet’s Principle) that the function that minimizes this integral is harmonic. By equation (2), the surface represented by (x, y, d(x, y)) with a harmonic function d(x, y) has small mean curvature H when dx , dy are small compared to 1. Thus the surface approximates the minimal surface, which is defined as the surface with H = 0, and is physically illustrated by a soap film spanning a wire frame. Thus the difference between the kinds of surfaces favored by the present and the previously-used prior is the difference between minimizing the Gaussian and the mean curvature, and intuitively the difference between a flexible thin paper and a soap film. Since a surface whose Gaussian and mean curvature are both zero is a plane, one might say the two represent two opposite directions of curving the surface. In this sense, the minimization of Gaussian curvature modulus is the opposite of the extant priors. Note that this notion of oppositeness is about the smooth part of the surface; thus it does not change even when discontinities are allowed by the prior in disparity or its derivatives. Another difference between the currently popular priors and what the human vision system seems to use is, as we pointed out in [12,13], that the popular priors are convex. The surfaces that humans perceive, including the one shown in Fig. 1(b), cannot be predicted not only by the minimization of the fronto-parallel prior but also by that of any convex prior. Although some priors that allow discontinuities are non-convex, they are usually convex at the continuous part of the surface. As far as we could determine, the Gaussian curvature is the only prior that is inherently non-convex that has been proposed as a stereo prior. For more discussion on the convexity of the prior, see [12]. The total absolute Gaussian curvature has been proposed as a criterion of tightness for re-triangulation of a polyhedral surface mesh [1]. The re-triangulation process to minimize the total absolute Gaussian curvature has subsequently been proved to be NP-hard, at least in the case of terrains [6]. 2.3 Is It Good for Stereo? Let us consider the case of developable surfaces, the limiting case where the Gaussian curvature vanishes everywhere. Note that such a surface can have a sharp bend and still have zero Gaussian curvature, as shown in Fig. 3, allowing a sharp border in the depth surface. It also encourages the straightness of the border: a higher-order effect not seen in first-order priors. Compare this to the limiting case of the fronto-parallel prior, which

Total Absolute Gaussian Curvature for Stereo Prior

541

Fig. 3. Surfaces with zero Gaussian curvature. A surface can have a sharp bend and still have zero Gaussian curvature.

would be a plane. Thus, the Gaussian prior is more flexible than most conventional prior models. The question is rather if it is too flexible. It is not immediately clear if this scheme is useful as a prior constraint for stereo optimization model, especially since the functional would be hard to optimize. This is the reason for the experiments we describe in the next section. But before going to the experiments, we mention one reason that this type of prior might work. It is that there is a close relationship between the Gaussian curvature and the convex hull: Theorem. Let A be a set in the three-dimensional Euclidean space, B its convex hull, and p a point in ∂B \ A, where ∂B denotes the boundary of B. Assume that a neighborhood of p in ∂B is a smooth surface. Then the Gaussian curvature of ∂B at p is zero. A proof is in [12]. This means that the Gaussian curvature of the surface of the convex hull of a set at a point that does not belong to the original set is zero, wherever it is defined. How does this relate to the stereo prior? Imagine for a while that evidence from stereo matching is sparse but strong. That gives a number of scattered points in the matching space that should be on the matching surface. Then the role of the prior model is to interpolate these points. A soap film solution finds a minimum surface that goes through these points. We think taking the convex hull of these points locally might give a good solution, since the convex hull in a sense has the “simplest” 3D shape that is compatible with the data, much in the way the Kanizsa triangle[15] is the simplest 2D shape that explains incomplete contour information; and in the real world, most surfaces are in fact the faces of some 3D body. By the theorem, taking the convex hull of the points and using one of its faces gives a surface with zero Gaussian curvature. To be sure, there are always two sides of the hull, making it ambiguous, and there is the problem of choosing the group of local points to take the convex hull of—it would not do to use the all points on the whole surface. Also, when the evidence is dense, and not so strong, the data is not like the points in the space; it is a probability distribution. Then it is not clear what “taking the convex hull” means. So we decided to try minimizing the Gaussian curvature and hope that it has a similar effect.

3 Experiments We experimentally compared the Gaussian curvature minimization prior model with the conventional prior models, keeping other conditions identical. We used the data set and

542

H. Ishikawa

the code implementation of stereo algorithms, as well as the evaluation module, by D. Scharstin and R. Szeliski, available from http://vision.middlebury.edu/ stereo/, which was used in [22]. We made modifications to the code to allow use of different priors. 3.1 MAP Formulation Each data set consists of a rectified stereo pair IL and IR . The stereo surface is represented as a disparity function d(x, y) on the discretized left image domain. We seek the d(x, y) that minimizes the energy E(IL , IR , d) = E1 (IL , IR , d) + E2 (d).

(3)

The first term is the so-called data term, which encodes mainly the image formation model. The second term is the prior term, in which we are interested. For the image formation energy term E1 (IL , IR , d), we used the unaggregated absolute differences as the matching cost: E1 (IL , IR , d) = |IL (x, y) − IR (x + d(x, y), y)|. (4) (x,y)

This stays the same throughout the experiments. The purpose of the experiments is to evaluate the relative performance of the different prior energy terms E2 (d), as detailed next. 3.2 Priors We compared five prior models, one of which is the total absolute Gaussian curvature minimization prior. Each prior energy E2i (d)(i = 1, . . . , 5) is defined by adding the local function over all the pixels: fdi (x, y) (5) E2i (d) = λ (x,y)

The details of each prior model follow. i) Gaussian Curvature minimization. We experimented with several varieties of the prior model by minimization of Gaussian curvature modulus. First, to minimize the modulus of Gaussian curvature, we tried both the sum of square and the sum of absolute values. Second, since the curvature is not meaningful when the local discretized disparity surface is very rough, we used various smoothing schemes where the disparity change is larger than a threshold value; for smoothing we used i) the square difference, ii) the absolute difference, and iii) the constant penalty. By combination, there were six varieties of the local prior energy. Of which, we found that the combination of the absolute Gaussian curvature and the square difference smoothing worked best: |K(x, y, d)| (if dx 2 + dy 2 < c) 1 fd (x, y) = (6) dx 2 + dy 2 (if dx 2 + dy 2 ≥ c), where

Total Absolute Gaussian Curvature for Stereo Prior

dx = d(x + 1, y) − d(x, y),

543

dy = d(x, y + 1) − d(x, y)

dxx = d(x + 1, y) + d(x − 1, y) − 2d(x, y) dxy = d(x + 1, y + 1) + d(x, y) − d(x, y + 1) − d(x + 1, y) dyy = d(x, y + 1) + d(x, y − 1) − 2d(x, y), and K(x, y, d) =

dxx dyy − dxy 2

2.

(1 + dx 2 + dy 2 )

ii) Smoothness part only. In order to evaluate the effect of Gaussian curvature minimization, we compared it to the prior that uses only the smoothing part, i.e., the case c = 0 in fd1 (x, y). This is the most popular prior that minimizes the square disparity change. (7) fd2 (x, y) = dx 2 + dy 2 . iii) Potts model. This model is widely used in the literature, especially in relation with the graph cut algorithms, e.g., [5]. It is defined by fd3 (x, y) = T (dx ) + T (dy ),

(8)

where T (X) gives 0 if X = 0 and 1 otherwise. iv) Smoothing with a cut-off value. This is used to model a piecewise smooth surface. Where the surface is smooth, it penalizes a disparity change according to the absolute disparity change (fd4 ) or square disparity change (fd5 ). When there is a discontinuity, it costs only a constant value, no matter how large the discontinuity is: |dx | + |dy | (if |dx | + |dy | < c) 4 fd (x, y) = (9) c (if |dx | + |dy | ≥ c), dx 2 + dy 2 (if dx 2 + dy 2 < c) 5 fd (x, y) = (10) c (if dx 2 + dy 2 ≥ c). 3.3 Optimization We used the simulated annealing to optimize the energy functional. The number of iteration was 500 with the full Gibbs update rule, where all possible disparities at a given pixel is evaluated. The annealing schedule was linear. Note that we need to use the same optimization technique for all the prior. Thus we cannot use the graph cut or the belief propagation algorithms, since they cannot be used (at least easily) with the Gaussian curvature minimization because it is of second order. 3.4 Evaluation Measure The evaluation module collects various statistics from the experiments. Here, we give the definition of the statistics we mention later in discussing the results. RO¯ is the RMS

544

H. Ishikawa

Fig. 4. Statistics from the tsukuba data set, RO¯ , BO¯ , BT¯ , and BD . The total absolute Gaussian curvature minimization prior fd1 used c = 4.0. The prior fd4 (absolute difference with a cut-off) performed best when c = 5.0 and the prior fd5 (square difference with a cut-off) when c = 10.0.

(root-mean-squared) disparity error. BO¯ is the percentage of pixels in non-occluded areas with a disparity error greater than 3. BT¯ is the percentage of pixels in textureless areas with a disparity error greater than 3. BD is the percentage of pixels with a disparity error greater than 3 near discontinuities by more than 5. 3.5 Results Shown in Fig. 4 are the statistics from one of the data sets (tsukuba) with varying λ. The total absolute Gaussian curvature prior fd1 with c = 4.0 performed the best among the variations of minimizing Gaussian curvature modulus we tried. The prior fd4 , which uses the absolute difference with a cut-off value, performed best when the cut-off threshold c is 5.0 and fd5 , the square difference with a cut-off, when c = 10.0. Though the threshold value c for the best results varies from prior to prior, it is nothing to be alarmed for, as the quantity it is thresholding is different from prior to prior. In Fig. 5, we show the disparity maps of the experiments for qualitative evaluation. The first two rows show the original image and the ground truth. Each of the rest of the rows shows the best results in terms of RO¯ within the category, except for (d),

Total Absolute Gaussian Curvature for Stereo Prior

map

sawtooth

tsukuba

545

venus

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. Experiments: (a) The left image, (b) the ground truth, and the results by (c) total absolute Gaussian curvature minimization with smoothing, (d) smoothing only, (e) Potts model, and (f) smoothing with a cut-off. Each result is for the best value of λ within the category except for (d), which used the same λ as the one used in (c) for comparison.

which uses the same parameters as the result shown just above in row (c), to compare the effect of Gaussian curvature minimization, i.e., to see if the Gaussian part of (6) actually has any effect. The computation time was about 200 to 700 seconds for the Gaussian curvature minimization on a 3.4GHz processor.

546

H. Ishikawa

3.6 Discussion First of all, these results are not the best overall results in the standard of the state of the art, which is only achieved through the combination of various techniques in all aspects of stereo. It is not surprising as we use the simplest data term, and we do not use any of various techniques, such as the pixel classification into occluded and non-occluded pixels and the image segmentation before and after the matching; nor do we iterate the process based on such pre- and post processes. Also, it would have been much faster and the solution much better if we were using the best optimization method that can be used for each prior. The simple square difference without thresholding fd2 can be exactly solved by graph cut[10]. The Potts energy fd3 , as is now well known, can be efficiently optimized with a known error bound by α-expansion[5]; and recently an extension [27] of the algorithm made it possible to do the same for the truncated convex priors like fd4 and fd5 . However, this is expected and it is not the point of these experiments; rather, their purpose is to examine (i) whether or not the minimization of Gaussian curvature modulus works at all, and (ii) if it does, how does it compare to other priors, rather than the whole stereo algorithms. We conclude that (i) has been answered affirmative: it does seem to work in principle. We suspected that the too much flexibility of developable surfaces could be the problem; but it seems at least in combination with some smoothing it can work. As for (ii), quantitatively we say it is comparable to the best of other priors, with a caveat that the result might be skewed by some interaction between the prior and the chosen optimization, as when some priors are easier to optimize by simulated annealing than others. Thus, it might still be the case that the Gaussian curvature minimization is not good after all. It is a problem common in optimization algorithms that are not exact nor have known error bounds: we cannot know how close the results are to the optimum. From Fig. 4, it can be seen that the new prior model is comparable or better than any of the conventional energy functions tested. It also is not as sensitive to the value of λ as other priors. Qualitatively, the result by the Gaussian curvature minimization seems different, especially compared to the Potts model. It seems to preserve sharp depth boundaries better, as expected. Another consideration is that the relative success of this model may be simply due to the fact that the man-made objects in the test scenes exhibit mostly zero Gaussian curvature. While that remains to be determined experimentally, we point out that it might actually be an advantage if that is in fact the case. After all, developable surfaces like planes and cylinders are ubiquitous in an artificial environment, while minimal surfaces like soap films are not seen so much even in natural scenes.

4 Conclusion In this paper, we have examined a novel prior model for stereo vision based on minimizing the total absolute Gaussian curvature of the disparity surface. It is motivated

Total Absolute Gaussian Curvature for Stereo Prior

547

by psychophysical experiments on human stereo vision. Intuitively, it can be thought of as rolling and bending a piece of very thin paper to fit to the stereo surface, whereas the conventional priors are more akin to spanning a soap film over a wire frame. The experiments show that the new prior model is comparable or better than any of the conventional priors tested, when compared in the equal setting. The main drawback of the absolute Gaussian curvature minimization is that we don’t yet have an optimization method as good as graph cuts or belief propagation that can optimize it efficiently. Obviously, the real measure of a prior crucially depends on how well it can be actually optimized. However, the experiments have been a necessary step before attempting to devise a new optimization technique. It may be difficult: as we mentioned in 2.2, re-triangulation of polyhedral surfaces to minimize the total absolute Gaussian curvature is NP-hard; and the prior is of second-order. Still, the result in this paper at least gives us a motivation to pursue better optimization algorithms that can be used with this prior. Also, it might be fruitful to try exploiting the relationship between the convex hull and the Gaussian curvature mentioned in 2.3 in order to achieve the same goal. Using that method, we at least know there are definite solutions. The authors of [18] conclude their paper thus: “As can be seen, the global minimum of the energy function does not solve many of the problems in the BP or graph cuts solutions. This suggests that the problem is not in the optimization algorithm but rather in the energy function.” At the least, our new prior is something significantly different from other models that have been used for the past thirty years, as we discussed in some detail in 2.2 and also in [12]. Different, but still works at least as well as others, in a qualitatively different way. It may be a good starting point to begin rethinking about the energy functions in stereo. Acknowledgement. This work was partially supported by the Suzuki Foundation, the Research Foundation for the Electrotechnology of Chubu, the Inamori Foundation, the Hori Information Science Promotion Foundation, and the Grant-in-Aid for Exploratory Research 19650065 from the Ministry of Education, Culture, Sports, Science and Technology, Japan.

References 1. Alboul, L., van Damme, R.: Polyhedral metrics in surface reconstruction: Tight triangulations. In: The Mathematics of Surfaces VII, pp. 309–336. Clarendon Press, Oxford (1997) 2. Belhumeur, P.N.: A Bayesian Approach to Binocular Stereopsis. Int. J. Comput. Vision 19, 237–262 (1996) 3. Blake, A., Zisserman, A.: Visual reconstruction. MIT Press, Cambridge, MA (1987) 4. Birchfield, S., Tomasi, C.: Multiway cut for stereo and motion with slanted surfaces. In: ICCV 1999, vol. I, pp. 489–495 (1999) 5. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. on Patt. Anal. Machine Intell. 23, 1222–1239 (2001) 6. Buchin, M., Giesen, J.: Minimizing the Total Absolute Gaussian Curvature in a Terrain is Hard. In: the 17th Canadian Conference on Computational Geometry, pp. 192–195 (2005) 7. Faugeras, O.: Three-Dimensional Computer Vision. MIT Press, Cambridge, MA (1993) 8. Felzenszwalb, P., Huttenlocher, D.: Efficient Belief Propagation for Early Vision. In: CVPR 2004, pp. 261–268 (2004)

548

H. Ishikawa

9. Grimson, W.E.: From Images to Surfaces. MIT Press, Cambridge, MA (1981) 10. Ishikawa, H.: Exact Optimization for Markov Random Fields with Convex Priors. IEEE Trans. on Patt. Anal. Machine Intell. 25(10), 1333–1336 (2003) 11. Ishikawa, H., Geiger, D.: Occlusions, Discontinuities, and Epipolar Lines in Stereo. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 232–248. Springer, Heidelberg (1998) 12. Ishikawa, H., Geiger, D.: Rethinking the Prior Model for Stereo. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 526–537. Springer, Heidelberg (2006) 13. Ishikawa, H., Geiger, D.: Illusory Volumes in Human Stereo Perception. Vision Research 46(1-2), 171–178 (2006) 14. Jones, J., Malik, J.: Computational Framework for Determining Stereo Correspondence from a set of linear spatial filters. Image Vision Comput. 10, 699–708 (1992) 15. Kanizsa, G.: Organization in Vision. Praeger, New York (1979) 16. Kolmogorov, V., Zabih, R.: Computing Visual Correspondence with Occlusions via Graph Cuts. In: ICCV 2001, pp. 508–515 (2001) 17. Marr, D., Poggio, T.: Cooperative Computation of Stereo Disparity. Science 194, 283–287 (1976) 18. Meltzer, T., Yanover, C., Weiss, Y.: Globally Optimal Solutions for Energy Minimization in Stereo Vision Using Reweighted Belief Propagation. In: ICCV 2005, pp. 428–435 (2005) 19. Ogale, A.S., Aloimonos, Y.: Stereo correspondence with slanted surfaces: critical implications of horizontal slant. In: CVPR 2004, vol. I, pp. 568–573 (2004) 20. Roy, S.: Stereo Without Epipolar Lines: A Maximum-flow Formulation. Int. J. Comput. Vision 34, 147–162 (1999) 21. Roy, S., Cox, I.: A Maximum-flow Formulation of the N-camera Stereo Correspondence Problem. In: ICCV 1998, pp. 492–499 (1998) 22. Scharstein, D., Szeliski, R.: A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int. J. Computer Vision 47, 7–42 (2002) 23. Szeliski, R.: A Bayesian Modelling of Uncertainty in Low-level Vision. Kluwer Academic Publishers, Boston, MA (1989) 24. Sun, J., Zhen, N.N, Shum, H.Y.: Stereo Matching Using Belief Propagation. IEEE Trans. on Patt. Anal. Machine Intell. 25, 787–800 (2003) 25. Tappen, M.F., Freeman, W.T.: Comparison of Graph Cuts with Belief Propagation for Stereo, using Identical MRF Parameters. In: ICCV 2003, vol. II, pp. 900–907 (2003) 26. Terzopoulos, D.: Regularization of Inverse Visual Problems Involving Discontinuities. IEEE Trans. on Patt. Anal. Machine Intell. 8, 413–424 (1986) 27. Veksler, O.: Graph Cut Based Optimization for MRFs with Truncated Convex Priors. In: CVPR 2007 (2007)

Fast Optimal Three View Triangulation Martin Byr¨ od, Klas Josephson, and Kalle ˚ Astr¨ om Center for Mathematical Sciences, Lund University, Lund, Sweden {byrod, klasj, kalle}@maths.lth.se

Abstract. We consider the problem of L2 -optimal triangulation from three separate views. Triangulation is an important part of numerous computer vision systems. Under gaussian noise, minimizing the L2 norm of the reprojection error gives a statistically optimal estimate. This has been solved for two views. However, for three or more views, it is not clear how this should be done. A previously proposed, but computationally impractical, method draws on Gr¨ obner basis techniques to solve for the complete set of stationary points of the cost function. We show how this method can be modiﬁed to become signiﬁcantly more stable and hence given a fast implementation in standard IEEE double precision. We evaluate the precision and speed of the new method on both synthetic and real data. The algorithm has been implemented in a freely available software package which can be downloaded from the Internet.

1

Introduction

Triangulation, referring to the act of reconstructing the 3D location of a point given its images in two or more known views, is a fundamental part of numerous computer vision systems. Albeit conceptually simple, this problem is not completely solved in the general case of n views and noisy measurements. There exist fast and relatively robust methods based on linear least squares [1]. These methods are however sub-optimal. Moreover the linear least squares formulation does not have a clear geometrical meaning, which means that in unfortunate situations, this approch can yield very poor accuracy. The most desirable, but non-linear, approach is instead to minimize the L2 norm of the reprojection error, i.e. the sum of squares of the reprojection errors. The reason for this is that the L2 optimum yields the maximum likelihood estimate for the 3D point under the assumption of independent gaussian noise on the image measurements [2]. This problem has been given a closed form solution1 by Hartley and Sturm in the case of two views [2]. However, the approach of Hartley and Sturm is not straightforward to generalize to more than two views. 1

The solution is actually not entirely on closed form, since it involves the solution of a sixth degree polynomial, which cannot in general be solved on closed form. Therefore one has to go by e.g. the eigenvalues of the companion matrix, which implies an iterative process.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 549–559, 2007. c Springer-Verlag Berlin Heidelberg 2007

550

M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om

In the case of n views, the standard method when high accuracy is needed is to use a two-phase strategy where an iterative scheme for non-linear least squares such as Levenberg-Marquardt (Bundle Adjustment) is initialised with a linear method [3]. This procedure is reasonably fast and in general yields excellent results. One potential drawback, however, is that the method is inherently local, i.e. it ﬁnds local minima with no guarantee of beeing close to the global optimum. An interesting alternative is to replace the L2 norm with the L∞ , norm cf. [4]. This way it is possible to obtain a provably optimal solution with a geomtrically sound cost function in a relatively eﬃcient way. The drawback is that the L∞ norm is suboptimal under gaussian noise and it is less robust to noise and outliers than the L2 norm. The most practical existing method for L2 optimization with an optimality guarantee is to use a branch and bound approach as introduced in [5], which, however, is a computationally expensive strategy. In this paper, we propose to solve the problem of L2 optimal triangulation from three views using a method introduced by Stewenius et al. in [6], where the optimum was found by explicit computation of the complete set of stationary points of the likelihood function. This approach is similar to that of Hartley and Sturm [2]. However, whereas the stationary points in the two view case can be found by solving a sixth degree polynomial in one variable, the easiest known formulation of the three view case involves solving a system of three sixth degree equations in three unknowns with 47 solutions. Thus, we have to resort to more soﬁsticated techniques to tackle this problem. Stewenius et al. used algebraic geometry and Gr¨ obner basis techniques to analyse and solve the equation system. However, Gr¨ obner basis calculations are known to be numerically challenging and they were forced to use emulated 128 bit precision arithmetics to get a stable implementation, which rendered their solution too slow to be of any practical value. In this paper we develop the Gr¨ obner basis approach further to improve the numerical stability. We show how computing the zeros of a relaxed ideal, i.e. a smaller ideal (implying a possibly larger solution set/variety) can be used to solve the original problem to a greater accuracy. Using this technique, we are able to give the Gr¨ obner basis method a fast implementation using standard IEEE double precision. By this we also show that global optimization by calculation of stationary points is indeed a feasible approach and that Gr¨ obner bases provide a powerful tool in this pursuit. Our main contributions are: – A modiﬁed version of the Gr¨ obner basis method for solving polynomial equation systems, here referred to as the relaxed ideal method, which trades some speed for a signiﬁcant increase in numerical stability. – An eﬀecient C++ language implementation of this method applied to the problem of three view triangulation. The source code for the methods described in this paper is freely available for download from the Internet[7].

Fast Optimal Three View Triangulation

2

551

Three View Triangulation

The main motivation for triangulation from more than two views is to use the additional information to improve accuracy. In this section we brieﬂy outline the approach we take and derive the equations to be used in the following sections. This part is essentially identical to that used in [6]. We assume a linear pin-hole camera model, i.e. projection in homogeneous coordinates is done according to λi xi = Pi X, where Pi is the 3 × 4 camera matrix for view i, xi is the image coordinates, λi is the depth and X is the 3D coordinates of the world point to be determined. In standard coordinates, this can be written as 1 Pi1 X xi = , (1) Pi3 X Pi2 X where e.g. Pi3 refers to row 3 of camera i. As mentioned previously, we aim at minimizing the L2 norm of the reprojection errors. Since we are free to choose coordinate system in the images, we place the three image points at the origin in their respective image coordinate systems. With this choice of coordinates, we obtain the following cost function to minimize over X ϕ(X) =

(P11 X)2 + (P12 X)2 (P21 X)2 + (P22 X)2 (P31 X)2 + (P32 X)2 + + . (2) (P13 X)2 (P23 X)2 (P33 X)2

The approach we take is based on calculating the complete set of stationary points of ϕ(X), i.e. solving ∇ϕ(X) = 0. By inspection of (2) we see that ∇ϕ(X) will be a sum of rational functions. The explicit derivatives can easily be calculated, but we refrain from writing them out here. Diﬀerentiating and multiplying through with the denominators produces three sixth degree polynomial equations in the three unknowns of X = [ X1 X2 X3 ]. To simplify the equations we also make a change of world coordinates, setting the last rows of the respective cameras to P13 = [ 1 0 0 0 ], P23 = [ 0 1 0 0 ], P33 = [ 0 0 1 0 ].

(3)

Since we multiply with the denominator we introduce new stationary points in our equations corresponding to one of the denominators in (2) being equal to zero. This happens precisely when X coincides with the plane through one of the focal points parallel to the corresponding image plane. Such points have inﬁnite/undeﬁned value of ϕ(X) and can therefore safely be removed. To summarise, we now have three sixth degree equations in three unknowns. The remainder of the theoretical part of the paper will be devoted to the problem of solving these.

3

Using Gr¨ obner Bases to Solve Polynomial Equations

In this section we give an outline of how Gr¨ obner basis techniques can be used for solving systems of multivariate polynomial equations. Gr¨ obner bases are a

552

M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om

concept within algebraic geometry, which is the general theory of multivariate polynomials over any ﬁeld. Naturally, we are only interested in real solutions, but since algebraic closedness is important to the approach we take, we seek solutions in C and then ignore any complex solutions we obtain. See e.g. [8] for a good introduction to algebraic geometry. Our goal is to ﬁnd the set of solutions to a system f1 (x) = 0, . . . , fm (x) = 0 of m polynomial equations in n variables. The polynomials f1 , . . . , fm generate an ideal I in C[x], the ring of multivariate polynomials in x = (x1 , . . . , xn ) over the ﬁeld of complex numbers. To ﬁnd the roots of this system we study the quotient ring C[x]/I of polynomials modulo I. If the system of equations has r roots, then C[x]/I is a linear vector space of dimension r. In this ring, multiplication with xk is a linear mapping. The matrix mxk representing this mapping (in some basis) is referred to as the action matrix and is a generalization of the companion matrix for onevariable polynomials. From algebraic geometry it is known that the zeros of the equation system can be obtained from the eigenvectors/eigenvalues of the action matrix just as the eigenvectors/eigenvalues of the companion matrix yields the zeros of a one-variable polynomial [9]. The solutions can be extracted from the eigenvalue decomposition in a few diﬀerent ways, but easiest is perhaps to use the fact that the vector of monomials spanning C[x]/I evaluated at a zero of I is an eigenvector of mtxk . An alternative is to use the eigenvalues of mtxk corresponding to the values of xk at zeros of I. C[x]/I is a set of equivalence classes and to perform calculations in this space we need to pick representatives for the equivalence classes. A Gr¨ obner basis G for I is a special set of generators for I with the property that it lets us compute a well deﬁned, unique representative for each equivalence class. Our main focus is therefore on how to compute this Gr¨ obner basis in an eﬃcient and reliable way.

4

Numerical Gr¨ obner Basis Computation

There is a general method for constructing a Gr¨ obner basis known as Buchberger’s algorithm [9]. It is a generalization of the Euclidean algorithm for computing the greatest common divisor and Gaussian elimination. The general idea is to arrange all monomials according to some ordering and then succesively eliminate leading monomials from the equations in a fashion similar to how Gaussian elimination works. This is done by selecting polynomials pair-wise and multiplying them by suitable monomials to be able to eliminate the least common multiple of their respective leading monomials. The algorithm stops when any new element from I reduces to zero upon multivariate polynomial division with the elements of G. Buchberger’s algorithm works perfectly under exact arithmetic. However, in ﬂoating point arithmetic it becomes extremely diﬃcult to use due to accumulating round oﬀ errors. In Buchberger’s algorithm, adding equations and eliminating is completely interleaved. We aim for a process where we ﬁrst add all equations

Fast Optimal Three View Triangulation

553

we will need and then do the full elimination in one go, in the spirit of the f4 algorithm [10]. This allows us to use methods from numerical linear algebra such as pivoting strategies and QR factorization to circumvent (some of) the numerical diﬃculties. This approach is made possible by ﬁrst studying a particular problem using exact arithmetic2 to determine the number of solutions and what total degree we need to go to. Using this information, we hand craft a set of monomials which we multiply our original equations with to generate new equations. We stack the coeﬃcients of our expanded set of equations in a matrix C and write our equations as Cϕ = 0, (4) where ϕ is a vector of monomials. Putting C on reduced row echelon form then gives us the reduced minimal Gr¨ obner basis we need. In the next section we go in to the details of constructing a Gr¨ obner basis for the three view triangulation problem.

5

Constructing a Gr¨ obner Basis for the Three View Triangulation Problem

As detailed in Section 2, we optimize the L2 cost function by calculation of the stationary points. This yields three sixth degree polynomial equations in X = [X1 , X2 , X3 ]. In addition to this, we add a fourth equation by taking the sum of our three original equations. This cancels out the leading terms, producing a ﬁfth degree equation which will be useful in the subsequent calculations [6]. These equations generate an ideal I in C[X]. The purpose of this section is to give the details of how a Gr¨ obner basis for I can be constructed. First, however, we need to deal with the problem where one or more of Xi = 0. When this happens, we get a parametric solution to our equations. As mentioned in Section 2, this corresponds to the extra stationary points introduced by multiplying up denominators and these points have inﬁnite value of the cost function ϕ(X). Hence, we would like to exclude solutions with any Xi = 0 or equivalently X1 X2 X3 = 0. The algebraic geometry way of doing this is to calculate the saturation sat(I, X1 X2 X3 ) of I w.r.t. X1 X2 X3 , consisting of all polynomials f (X) s.t. (X1 X2 X3 )k · f ∈ I for some k. Computationally it is easier to calculate sat(I, Xi ) for one variable at a time and then joining the result. This removes the same problematic parameter family of solutions, but with the side eﬀect of producing some extra (ﬁnite) solutions with Xi = 0. These do not present any serious diﬃculties though since they can easily be detected and ﬁltered out. Consider one of the variables, say X1 . The ideal sat(I, X1 ) is calculated in three steps. We order the monomials according to X1 but take the monomial with the highest power of X1 to be the smallest, e.g. X1 X22 X3 ≥ X12 X22 X3 . With the monomials ordered this way, we perform a few steps of the Gr¨ obner basis 2

Usually with the aid of some algebraic geometry software as Macaulay 2 [11].

554

M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om

calculation, yielding a set of generators where the last elements can be divided by powers of X1 . We add these new equations which are “stripped” from powers of X1 to I. More concretely, we multiply the equations by all monomials creating equations up to degree seven. After the elimination step two equations are divisible by X1 and one is divisible by X12 . The saturation process is performed analogously for X2 and X3 producing the saturated ideal Isat , from which we extract our solutions. The ﬁnal step is to calculate a Gr¨obner basis for Isat , at this point generated by a set of nine ﬁfth and sixth degree equations. To be able to do this we multiply with monomials creating 225 equations in 209 diﬀerent monomials of total degree up to nine (refer to [6] for more details on the saturation and expansion process outlined above). The last step thus consists of putting the 225 by 209 matrix C on reduced row echelon form. This last part turns out to be a delicate task though due to generally very poor conditioning. In fact, the conditioning is often so poor that roundoﬀ errors in the order of magnitude of machine epsilon (approximately 10−16 for doubles) yield errors as large as 102 or more in the ﬁnal result. This is the reason one had to resort to emulated 128 bit numerics in [6]. In the next section, we propose a strategy for dealing with this problem which drastically improves numerical precision allowing us to use standard IEEE double precision.

6

The Relaxed Ideal Method

After the saturation step, we have a set of equations which “tightly” describe the set of solutions and nothing more. It turns out that by relaxing the constraints somewhat, possibly allowing some extra spurious solutions to enter the equations, we get a signiﬁcantly better conditioned problem. We thus aim at selecting a subset of the 225 equations. This choice is not unique, but a natural subset to use is the 55 equations with all possible 9th degree monomials as leading terms, since this is the smallest set of equations which directly gives us a Gr¨obner basis. We do this by QR factorization of the submatrix of C consisting of the 55 ﬁrst columns followed by multiplying the remaining columns with Qt . After these steps we pick out the 55 ﬁrst rows of the resulting matrix. These rows correspond to 55 equations forming the relaxed ideal Irel ⊂ I which is a subset of the original ideal I. Forming the variety/solution set V of an ideal is an inclusion reversing operation and hence we have V (I) ⊂ V (Irel ), which means that we are guaranteed not to lose any solutions. Moreover, since all monomials of degree nine are represented in exactly one of our generators for Irel , this means that by construction we have a Gr¨ obner basis for Irel . The set of eigenvalues computed from the action matrices for C[X]/I and C[X]/Irel respectively are shown if Fig. 1. The claim that the number of solutions is equal to the dimension of C[X]/I only holds if I is a radical ideal. Otherwise, the dimension is only an upper bound on the number of solutions [8]. Furthermore, as mentioned in Section 3, a necessary condition for a speciﬁc point to be a solution is that the vector of basis

Fast Optimal Three View Triangulation 0.4

555

Original Eigenvalues Relaxed Set of Eigenvalues

0.3 0.2

Im

0.1 0 −0.1 −0.2 −0.3 −0.4 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Re

Fig. 1. Eigenvalues of the action matrix using the standard method and the relaxed ideal method respectively, plotted in the complex number plane. The latter are a strict superset of the former.

monomials evaluated at that point is an eigenvector of the transposed action matrix. This condition is however not suﬃcient and there can be eigenvectors that do not correspond to zeros of the ideal. This will be the case if I is not a radical ideal. This can lead to false solutions, but does not present any serious problems since false solutions can easily be detected by e.g. evaluation of the original equations. Since we have 55 leading monomials in the Gr¨ obner basis, the 154 remaining monomials (of the totally 209 monomials) form a basis for C/Irel . Since Irel was constructed from our original equations by multiplication with monomials and invertible row operations (by Qt ) we expect there to be no new actual solutions. This has been conﬁrmed empirically. One can therefore say that starting out with a radical ideal I, we relax the radicality property and compute a Gr¨ obner basis for a non-radical ideal but with the same set of solutions. This way we improve the conditioning of the elimination step involved in the Gr¨ obner basis computation considerably. The price we have to pay for this is performing an eigenvalue decomposition on a larger action matrix.

7

Experimental Validation

The algorithm described in this paper has been implemented in C++ making use of optimized LAPACK and BLAS implementations [12] and the code is available for download from [7]. The purpose of this section is to evaluate the algorithm in terms of speed and numerical precision. We have run the algorithm on both real and synthetically generated data using a 2.0 Ghz AMD Athlon X2 64 bit machine. With this setup, triangulation of one point takes approximately 60 milliseconds. This is to be contrasted with the previous implementation by

556

M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om

Stewenius et al. [6], which needs 30 seconds per triangulation with their setup. The branch and bound method of [5] is faster than [6] but exact running times for triangulation are not given in [5]. However, based on the performance of this algorithm on similar problems, the running time for three view triangulation is probably at least a couple of seconds using their method. 7.1

Synthetic Data

To evaluate the intrinsic numerical stability of our solver the algorithm has been run on 50.000 randomly generated test cases. World points were drawn uniformly from the cube [−500, 500]3 and cameras where placed randomly at a distance of around 1000 from the origin with focallength of around 1000 and pointing inwards. We compare our approach to that of [6] implemented in double precision here referred to as the standard method since it is based on straightforward Gr¨ obner basis calculation. A histogram over the resulting errors in estimated 3D location is shown in Fig. 2. As can be seen, computing solutions of the smaller ideal yields an end result with vastly improved numerical precision. The error is typically around a factor 105 smaller with the new method. Since we consider triangulation by minimization of the L2 norm of the error, ideally behaviour under noise should not be aﬀected by the algorithm used. In the second experiment we assert that our algorithm behaves as expected under noise. We generate data as in the ﬁrst experiment and apply gaussian noise to the image measurements in 0.1 pixel intervals from 0 to 5 pixels. We triangulate 1000 points for each noise level. The median error in 3D location is plotted versus noise in Fig. 3. There is a linear relation between noise and error, which conﬁrms that the algorithm is stable also in the presence of noise. 0.35 Relaxed Ideal Complete Ideal

0.3

Frequency

0.25 0.2 0.15 0.1 0.05 0 −15

−10 Log

10

−5 0 of error in 3D placement

5

Fig. 2. Histogram over the error in 3D location of the estimated point X. As is evident from the graph, extracting solutions from the smaller ideal yields a ﬁnal result with considerably smaller errors.

Fast Optimal Three View Triangulation

557

Median of the error in 3D location

2.5

2

1.5

1

0.5

0 0

1

2 3 Noise standard deviation

4

5

Fig. 3. Error in 3D location of the triangulated point X as a function of image-point noise. The behaviour under noise is as expected given the problem formulation.

Fig. 4. The Oxford dinosaur reconstructed from 2683 point triplets using the method described in this paper. The reconstruction was completed in approximately 2.5 minutes.

7.2

A Real Example

Finally, we evaluate the algorithm under real world conditions. The Oxford dinosaur [13] is a familiar image sequence of a toy dinosaur shot on a turn table. The image sequence consists of 36 images and 4983 point tracks. For each point visible in three or more views we select the ﬁrst, middle and last view and triangulate using these. This yields a total of 2683 point triplets to triangulate from. The image sequence contains some erroneus tracks which we deal with by removing any points reprojected with an error greater than two pixels in any frame. The whole sequence was processed in approximately 2.5 minutes and the resulting point cloud is shown in Fig. 4. We have also run the same sequence using the previous method implemented in double precision, but the errors were too large to yield usable results. Note that [6] contains a successful triangulation of the dinosaur sequence, but this was done

558

M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om

using extremely slow emulated 128 bit arithmetic yielding an estimated running time of 20h for the whole sequence.

8

Conclusions

In this paper we have shown how a typical problem from computer vision, trianobner basis gulation, can be solved for the globally optimal L2 estimate using Gr¨ techniques. With the introduced method of the relaxed ideal, we have taken this approach to a state where it can now have practical value in actual applications. In all fairness though, linear initialisation combined with bundle adjustment will probably remain the choice for most applications since this is still signiﬁcantly faster and gives excellent accuracy. However, if a guarantee of ﬁnding the provably optimal solution is desired, we provide a competetive method. More importantly perhaps, by this example we show that global optimisation by calculation of the stationary points using Gr¨ obner basis techniques is indeed a possible way forward. This is particularly interesting since a large number of computer vision problems ultimately depend on some form of optimisation. Currently the limiting factor in many applications of Gr¨ obner bases is numerical diﬃculties. Using the technique presented in this paper of computing the Gr¨ obner basis of a smaller/relaxed ideal, we are able to improve the numerical precision by approximately a factor 105 . We thus show that there is room for improvement on this point and there is certainly more to explore here. For instance, our choice of relaxation is somewhat arbitrary. Would it be possible to select more/other equations and get better results? If more equations can be kept, but with retained accuracy this is certainly a gain since it allows an eigenvalue decomposition of a smaller action matrix and this operation in most cases has O(n3 ) time complexity.

Acknowledgment This work has been funded by the Swedish Research Council through grant no. 2005-3230 ’Geometry of multi-camera systems’, grant no. 2004-4579 ’ImageBased Localisation and Recognition of Scenes’, SSF project VISCOS II and the European Commission’s Sixth Framework Programme under grant no. 011838 as part of the Integrated Project SMErobot.

References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68, 146–157 (1997) 3. Triggs, W., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment: A modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) Vision Algorithms: Theory and Practice. LNCS, vol. 1883, Springer, Heidelberg (2000)

Fast Optimal Three View Triangulation

559

4. Kahl, F.: Multiple view geometry and the L∞ -norm. In: International Conference on Computer Vision, Beijing, China, pp. 1002–1009 (2005) 5. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical global optimization for multiview geometry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 592–605. Springer, Heidelberg (2006) 6. Stew´enius, H., Schaﬀalitzky, F., Nist´er, D.: How hard is three-view triangulation really? In: Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 686–693 (2005) 7. Three view triangulation, http://www.maths.lth.se/∼ byrod/downloads.html 8. Cox, D., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms. Springer, Heidelberg (2007) 9. Cox, D., Little, J., O’Shea, D.: Using Algebraic Geometry. Springer, Heidelberg (1998) 10. Faug`ere, J.C.: A new eﬃcient algorithm for computing gr¨ obner bases (f4 ). Journal of Pure and Applied Algebra 139(1-3), 61–88 (1999) 11. Grayson, D., Stillman, M.: Macaulay 2, http://www.math.uiuc.edu/Macaulay2 12. Lapack - linear algebra package, http://www.netlib.org/lapack 13. Visual geometry group, university of oxford, http://www.robots.ox.ac.uk/∼ vgg 14. Byr¨ od, M., Josephson, K., ˚ Astrom, K.: Improving numerical accuracy of gr¨ obner basis polynomial equation solvers. In: Proc. 11th Int. Conf. on Computer Vision, Rio de Janeiro, Brazil (2007)

Stereo Matching Using Population-Based MCMC Joonyoung Park1, Wonsik Kim2 , and Kyoung Mu Lee2 1

DM research Lab., LG Electronics Inc., 16 Woomyeon-Dong, Seocho-Gu, 137-724, Seoul, Korea [email protected] 2 School of EECS, ASRI, Seoul National University, 151-742, Seoul, Korea {ultra16, kyoungmu}@snu.ac.kr Abstract. In this paper, we propose a new stereo matching method using the population-based Markov Chain Monte Carlo (Pop-MCMC). Pop-MCMC belongs to the sampling-based methods. Since previous MCMC methods produce only one sample at a time, only local moves are available. However, since Pop-MCMC uses multiple chains and produces multiple samples at a time, it enables global moves by exchanging information between samples, and in turn leads to faster mixing rate. In the view of optimization, it means that we can reach a state with the lower energy. The experimental results on real stereo images demonstrate that the performance of proposed algorithm is superior to those of previous algorithms.

1

Introduction

Stereo matching is one of the classical problems in computer vision [1]. The goal of stereo matching is to determine disparities, which are distances between two corresponding pixel. If we get an accurate disparity map, we can recover 3-D scene information. However, it remains challenging problem because of occluded regions, noise of camera sensor, textureless regions, etc. Stereo matching algorithms can be classiﬁed into two approaches. One is the local approach, and the other is the global approach. In the local approach, disparities are determined by comparing the intensity values in the local windows, such as SAD (Sum of Absolute Diﬀerences), SSD (Sum of Squared Diﬀerences), and Birchﬁeld-Tomasi measure [2]. Although local approaches are fast, they have diﬃculties in obtaining an accurate disparity map. In the global approaches, one assumes that the disparity map is smooth in most regions. Usually, an energy function that is composed of local and global constraint is deﬁned and solved by various energy minimization techniques. Typical global approaches include graph cut [3,4,5], belief propagation [6], and dynamic programming [7,8]. Monte Carlo method is one of the global approaches. It uses statistical sampling to obtain the solutions of some mathematical problems. Although this method was originally developed to generate samples from a given target distribution or to integrate functions in high dimensional space, it has also been applied to other types of problems such as optimization and learning problems. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 560–569, 2007. c Springer-Verlag Berlin Heidelberg 2007

Stereo Matching Using Population-Based MCMC

561

However, there are some diﬃculties in applying the Monte Carlo methods to vision problems as an optimizer. In general, we need to solve the vision problems in very high-dimensional solution space. Even if it is assumed to be 100 pixels in the width and the height respectively, the dimension of the image space becomes as high as 104 . Monte Carlo methods would take inﬁnitely long time since the acceptance rate would be almost zero in such a high-dimensional case. Moreover, we need to design a proper proposal distribution close to a target distribution. To resolve these problems, Markov Chain Monte Carlo (MCMC) methods have been tried. In MCMC, a new sample is drawn from the previous sample and the local transition probability, based on the markov chain. Contrary to simple Monte Carlo methods, the acceptance rates of MCMC methods are high enough, and the proposal distributions are designable even in the high-dimensional problem. Therefore, MCMC methods are more proper to be applied to vision problems than the Monte Carlo methods. However, diﬃculties still remain. Since most MCMC methods allows only local move in a large solution space, it still takes very long time to reach the global optimum. To overcome the limitations of MCMC methods as an optimizer, SwendsenWang Cuts (SWC) was proposed [9,10]. In SWC, it is shown that bigger local move is possible than previous methods while maintaining the detailed balance. SWC uses Simulated Annealing (SA) [11] to ﬁnd optima. Although SWC allows bigger local move, very slow annealing process is needed to approach the global optima with probability 1. This is a drawback for real applications. Therefore, usually a faster annealing process is applied for practical use in vision problems. However, the fast annealing does not guarantee the global optima but the sample is often trapped at local optima. In this paper, we propose a new MCMC method called Population-Based MCMC (Pop-MCMC) [12,13] for stereo matching problem, trying to resolve the above problems of SWC. Our goal is to ﬁnd more accurate global optima than SWC. In Pop-MCMC, two or more samples are drawn at the same time. And the information exchange is occurred between the samples. That makes it possible to perform the global move of samples. It means that the mixing rate of drawn samples becomes faster. And in the view of optimization, it means that it takes the shorter time for the samples to approach the global optima than previous methods. This paper describes how Pop-MCMC is designed for stereo matching, and how the performance is comparing with the other methods like SWC or Belief Propagation. In section 2, we present how Pop-MCMC is applied to stereo matching. In section 3, we show the experimental results in the real problem. In the ﬁnal section, we conclude this paper with discussions.

2

Proposed Algorithm

Segment-Based Stereo Energy Model. In order to improve the accuracy of the disparity map, various energy models have been newly proposed for stereo problem. Among them, we choose the segment-based energy model which is

562

J. Park, W. Kim, and K.M. Lee

one of the popular models [15,16,17,18]. In a segment-based energy model, the reference image is over-segmented. Mean-shift algorithm is often used for the segmentation[14]. And then, we assume that each segment can be approximated by a part of a plane in the real world. Each segment is deﬁned as a node v ∈ V , and neighboring nodes s and t are connected with edges s, t ∈ E. Then we construct a graph G = (V, E). And the energy function is deﬁned by E(X) =

CSEG (fv ) +

βs,t 1(fs = ft ),

(1)

s,t∈N

v∈V

where X represents the current state of every segment, fv is an estimated plane for each segment, CSEG (fv ) is a matching cost, and βs,t is a penalty for diﬀerent neighboring nodes of s and t, which are deﬁned by CSEG (fv ) =

C(x, y, fv (x, y)),

(2)

(x,y)∈V

βs,t = γ · (mean color similarity) · (border length),

(3)

where function C(x, y, fv (x, y)) is Birchﬁeld-Tomasi cost. By varying γ, we can control the relative eﬀect of matching cost and smoothness cost. We ﬁrstly need to make a list of the planes for assigning them to each segment. For each segment, we calculate a new plane and add it to the list. The process to ﬁnd a new plane is following. We represent a plane with following equation. d = c1 x + c2 y + c3 ,

(4)

where x and y is the location of the pixel, d is the value of the disparity. From every pixel in a segment and initially assigned disparity values, we can construct the following algebraic equation. A [c1 , c2 , c3 ]T = B,

(5)

where ith row of the matrix A is [xi , yi , 1] and ith row of the matrix B is di . The values of c1 , c2 , c3 can be obtained by the least squares method. Once we ﬁnd the values of the parameters, we can distinguish outlier pixels based on the values of the parameters. Then, the least squares method is repeated to exclude the outliers and improve the accuracy of c1 , c2 , c3 . After obtaining the list of planes, we group the segments and calculate planes again in order to improve the accuracy of planes. To this end, each segment is ﬁrstly assigned to a plane in the list that has lowest CSEG value. Then we group the segments which is assigned to the same plane. And for each group, the above plane ﬁtting is repeated again. At last, we have the ﬁnal list of the planes to use.

Stereo Matching Using Population-Based MCMC

563

Initialization

U~[0,1]

U
N

Y

Mutation

Crossover

Exchange

Fig. 1. The overall ﬂow chart of Pop-MCMC applied to stereo matching

Design of Pop-MCMC. Given the probability distribution π(X) ∝ exp {−E(X)}, our aim is to ﬁnd the state X where the probability is maximized. In Pop-MCMC, we draw multiple samples from multiple chains at the same time with respect to the following distributions. 1 E(Xi ) Ti πi (X) = π(X) ∝ exp − , (6) Ti where Ti is the temperature of ith chain. Each sample from each chain is called a chromosome. Chromosomes interact with each other and this helps performing global moves. The overall ﬂow of Pop-MCMC is illustrated in ﬁgure 1. The three moves in Pop MCMC, which are mutation move, crossover move, and exchange move, are repeatedly performed and the samples are generated at each iteration. In this process, we ﬁrstly choose a random number U between 0 and 1, and compare U with the value of Qm . Depending on value of U , we choose one move among mutation and crossover. By varying the parameter value Qm , we can control the rates between the global move (crossover) and the local move (mutation) easily. A proper value of Qm can be adjusted according to the given problem, the model, or the number of chains. For example, if large number of chains are used, more global move will be needed than local move since the global move enables the samples exchange their information each other. Next, we will describe the detailed design of each move for stereo problem.

564

J. Park, W. Kim, and K.M. Lee

1. Mutation move In the case of mutation move, we apply the original MCMC move. we borrow the MCMC kernel from SWC [9,10] and a few modiﬁcations are made to the kernel. Basically, the MCMC kernel is used for a randomly chosen chain. In SWC, the graph nodes are probabilistically clustered as the ﬁrst step. For each edge e = s, t ∈ E, if labels of node s and node t are diﬀerent, we delete that edge. Otherwise, we determine whether delete it or not according to the edge probability qe . Then we randomly choose a cluster and propose a new label according to the predeﬁned proposal distribution. A label here means estimated plane. Instead of accepting every proposal, we determine the acceptance according to the acceptance ratio. And, the acceptance ratio is determined by Metropolis-Hastings rule [9]. Edge probability and proposal distribution of new labels are designed by following equations, respectively. ⎞

⎛

⎜ (mean color similarity) ⎟ ⎟ ⎜ qe = 1 − exp ⎜− ⎟, ⎠ ⎝ CSEG (fv1 ) CSEG (fv2 ) + +2 N (v1 ) N (v2 ) v∈V0 CSEG (fv = l ) q(l | V0 , A) = exp − +1− v∈V0 N (v)

(7)

1(l = fv2 ) ,

v1 ,v2 ∈N,v1 ∈V0 ,v2 ∈V0

(8) where v1 and v2 represent neighboring nodes, N (v) is the number of the pixels in the node v, l is the newly proposed label, and A is the current sample. In equation (7), the more similar the intensities of the connected nodes and the lower the matching costs are, the higher the probability that the edge remains is. And the size of the segments is not considered because of normalizing terms. In equation (8), when the nodes in the cluster V0 have low matching costs and there exist the neighboring nodes which have the same label, the value of q(l | V0 , A) becomes high. 2. Exchange move Exchange move is originated from parallel tempering [19,20]. In this move, we choose two chains and determine whether we have to exchange the chromosomes of two chains or not by the Metropolis-Hastings rule. Note that for the exchange move, there is no need of special design for stereo matching. So, when the ith and jth chains are selected, we simply calculate the Metropolis-Hastings rate by 1 1 − , γe = exp E(Xi ) − E(Xj ) Ti Tj where Xi and Ti are the current state and temperature of ith chain.

(9)

Stereo Matching Using Population-Based MCMC

565

3. Crossover move In crossover move, two chains are ﬁrstly selected as in exchange move. And then, instead of exchanging the whole chromosomes of two chains, we exchange only selected parts of chromosomes. After that, it is decided with the acceptance ratio whether the new sample is accepted or not. The typical types of the crossover moves are 1-point crossover and 2-point crossover moves. However, these are the methods based on the fact that the chromosomes are 1-dimensional vectors, so it is improper to apply them to the stereo matching problem in which the chromosomes are 2-dimensional images. However, 1-point and 2-point crossover moves have an advantage of low computational complexity because the proposal distributions are canceled out each other. Therefore, we designed the crossover move to maintain this advantage and to be proper to chromosomes whose dimensions are two. Detailed algorithms are as follows. We ﬁrstly choose two chains randomly, and we construct the cluster V0 in a similar way as in SWC. However, there are two diﬀerences in constructing V0 compared with SWC. First, qe is constant, not adaptively determined with the matching costs or the intensities of the input image since there is no need for the nodes of the cluster V0 to have same labels here. It is computationally efﬁcient to use qe as a constant value because the proposal distribution part in the Metropolis-Hastings rate is canceled. Second, when we calculate the probability qe , we do not consider whether the labels of the nodes are the same or not. So resulting cluster can have nodes with diﬀerent labels. Note that, in the mutation move, we should remove the edges connecting the nodes that have different labels, in order to compute the Metropolis-Hastings rate eﬃciently. But in the case of crossover move, the proposal distribution part is completely eliminated by the Metropolis-Hastings rate and eventually it enables high eﬃciency in computation. And this free construction of V0 also helps faster convergence. The process after constructing the cluster V0 is similar to the 1-point crossover move. From the chromosomes of two selected chains, new chromosomes are proposed by exchanging the labels of the nodes which belong to the cluster V0 . Then the acceptance ratio α = min(1, γc ) of the newly proposed chromosomes are calculated, and the next sample is determined. By substituting equation (6) to Metropolis-Hastings rule, we can obtain γc as follow. E(Xi ) − E(Yi ) E(Xj ) − E(Yj ) γc = exp . (10) + Ti Tj

3

Experimental Results

In this section, we evaluate the performance of the proposed algorithm by comparing with other methods such as SWC and BP (Belief Propagation) [6]. For the experiment, the test images from Middlebury website are used [21]. We employed the segment-based energy model in 1 for this test. The resulting disparity maps are shown in ﬁgure 2. And table 1 shows bad pixel rates of test

566

J. Park, W. Kim, and K.M. Lee

Fig. 2. Results of the proposed algorithm: the disparity maps of (a) Tsukuba (b) Venus (c) Teddy (d) Cones

images. Note that there are some limitations in the segment-based energy model. When real world objects are piecewise planar, the results might quite good. However, for the cases of Teddy and Cones that include objects with curved surfaces, the performances seem bad. And also, for a fronto-parallel plane, a non-segment based energy model can be superior to the segment-based energy model because of smaller number of the labels. In addition, occlusion or visibility are not considered here, so it may results in bad pixels in the disparity map. We compared the performance of Pop-MCMC with those of SWC and BP in the view of energy minimization. The same energy model was applied to each method. In case of SWC, we followed the Barbu’s work [9,10]. In Figure 3, the Table 1. The bad pixel rate for each test image Test images Bad Pixels(%)

Tsukuba

Venus

Teddy

Cones

1.38

1.21

14.7

13.1

Stereo Matching Using Population-Based MCMC

567

Population-Based MCMC Swendsen-Wang Cuts Belief Propagation

energy

60000

55000 0

200

400

600

800

1000

time(sec)

(a) Population-Based MCMC Swendsen-Wang Cuts Belief Propagation

energy

230000

225000

220000 0

20000

40000

60000

80000

time(sec)

(b) Fig. 3. Convergence comparison of energy minimization methods for (a) Tsukuba (b) Cones

convergence graphs of each method are presented. In the early part of each graph, SWC is faster than Pop-MCMC. That is caused by the characteristics of the energy function. In the energy function used in stereo matching, the local minima are concentrated near the global minimum. In the view of large scale, it can be considered as the energy function which has only one minimum (more exactly, one bunch of minima). So, SWC gets close to the vicinity of global minimum faster than Pop-MCMC. However, once the samples approach near the global minimum, there exist many local minima, which are no longer concentrated at one part. At this stage, SWC is easy to be trapped at the local minimum, while Pop-MCMC is likely to approach the global minimum nearer. As illustrated in Figure 3, we can get to the state of the lower energy with Pop-MCMC method than the other methods.

568

4

J. Park, W. Kim, and K.M. Lee

Conclusion

In this paper, we present a stereo matching algorithm using Pop-MCMC. PopMCMC uses multiple chains, and establish faster mixing rate by exchanging information between chromosomes. As a consequence, it is shown that the proposed Pop-MCMC method reaches the global optimum faster than other energy minimization method including SWC and BP. We have a plan to apply and analyze the performance of the proposed method to more sophisticated stereo energy models including occlusion handling and visibility terms as well as the segmentation problem. Acknowledgments. This work was supported in part by the ITRC program by Ministry of Information and Communication and in part by Defense Acquisition Program Administration and Agency for Defense Development, through the Image Information Research Center, Korea.

References 1. Scharstein, D., Szeliski, R.: A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Microsoft Research Technical Report MSRTR-2001-81 (2001) 2. Birchﬁeld, S., Tomasi, C.: A Pixel Dissimilarity Measure That is Insensitive to Image Sampling. IEEE Trans. Pattern Analysis and Machine Intelligence (1998) 3. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence 23(11), 1222– 1239 (2001) 4. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions using graph cuts. In: Proc. Int’l. Conf. Computer Vision, pp. 508–515 (2001) 5. Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: Proc. European Conf. Computer Vision, pp. 82–96 (2002) 6. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo matching using belief propagation. IEEE Trans. Pattern Analysis and Machine Intelligence 25(7), 787–800 (2003) 7. Ohta, Y., Kanade, T.: Stereo by intra- and inter-scanline search. TPAMI 2, 449–470 (1985) 8. Veksler, O.: Stereo Correspondence by Dynamic Programming on a Tree. Computer Vision and Pattern Recognition (2005) 9. Barbu, A., Zhu, S.C.: Graph Partition By Swendsen-Wang cuts: International Conf. Computer Vision, pp. 320–327 (2003) 10. Barbu, A., Zhu, S.C.: Multigrid and multi-level Swendsen-Wang cuts for hierarchic graph partition. Computer Vision and Pattern Recognition, 731–738 (2004) 11. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 12. Liang, F., Wong, W.H.: Evolutionary Monte Carlo: Applications to Model Sampling and Change Point Problem. Statistica Sinica 10, 317–342 (2000) 13. Jasra, A., Stephens, D.A.: On Population-Based Simulation for Static Inference (2005) 14. Comanicu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell. 24, 603–619 (2002)

Stereo Matching Using Population-Based MCMC

569

15. Tao, H., Sawhney, H.S., Kumar, R.: A Global Matching Framework for Stereo Computation. In: Proc. Int’l. Conf. Computer Vision, vol. 1, pp. 532–539 (2001) 16. Bleyer, M., Gelautz, M.: Graph-based surface reconstruction from stereo pairs using image segmentation. In: Proc. SPIE, Videometrics VIII, vol. 5665, pp. 288–299 (2005) 17. Hong, L., Chen, G.: Segment-Based Stereo Matching Using Graph Cuts. Computer Vision and Pattern Recognition, I, 74–81 (2004) 18. Klaus, A., Sormann, M., Karner, K.: Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure. In: International Conf. Pattern Recognition, pp. 15–18 (2006) 19. Geyer, C.J.: Markov chain Monte Carlo maximum likelihood, Computing Science and Statistics, 156–163 (1991) 20. Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 65, 1604–1608 21. http://cat.middlebury.edu/stereo

Dense 3D Reconstruction of Specular and Transparent Objects Using Stereo Cameras and Phase-Shift Method Masaki Yamazaki, Sho Iwata, and Gang Xu Faculty of Information Science and Engineering, Ritsumeikan University, Shiga, Japan

Abstract. In this paper, we ﬁrst describe our approach to measuring the surface shape of specular objects and then we extend the method to measuring the surface shape of transparent objects by using stereo cameras and a display. We show that two viewpoints can uniquely determine the surface shape and surface normal by investigating the light path for each surface point. We can determine the light origin for each surface point by showing two-dimensional phase shifts on the display. We obtained dense and accurate results for both planar surfaces and curved surfaces.

1

Introduction

Three-dimensional acquisition systems can be categorized into two types: passive and active. The passive systems can recover the 3D shape of an object by extracting feature points and ﬁnding corresponding points from multiple images. However, this approach does not work for textureless surfaces such as industrial machine parts. On the other hand, an active system utilizes a light or laser projector for projecting designed patterns onto object surfaces, and can thus retrieve high-precision correspondences, and can thus measure the shape of an object with higher accuracy [1,2,3,4]. Most of these active systems are designed to obtain the shape of opaque surfaces and are based on the analysis of the diﬀuse reﬂection components. However, when reﬂected from specular surfaces such as glass and metal, projected light goes to other directions and does not come to the camera. In addition, light not only reﬂects at the surface of a transparent object but also transmits into the object, and it causes multiple reﬂections and transmissions inside the object. In order to recover the 3D shape, powder is usually used to change the surfaces to diﬀuse surfaces. This is troublesome, and the thickness of powder inﬂuences the accuracy of 3D measurement. In computer vision research, several methods for measuring specular and transparent objects have been proposed. For measuring specular objects, there are stereo-based methods [5,6], light section-based methods [7], highlight-based methods [8], normal-based methods [9] and space carving-based methods [10]. However, these methods need special devices and restrict the measuring objects. For measuring transparent objects, there are polarization-based methods [11], Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 570–579, 2007. c Springer-Verlag Berlin Heidelberg 2007

Dense 3D Reconstruction of Specular and Transparent Objects

571

silhouetted-based methods [12], motion-based methods [13,14] and light sectionbased methods [15]. However, these methods can only measure the front surface shape and the parameterized surface shape of an object. In recent work on this subject, the 3D shape of specular and transparent objects is recovered by analyzing the distortion of a known pattern placed near the object surface. This approach is also called shape-from-distortion [16,17,18,19,20]. In these methods, accurate correspondences between image pixels and points on the pattern are required. Tarini et al. [16] estimated the accurate 3D shape of a specular object by using phase-shift method and a single camera. However, since it is generally impossible to reconstruct the 3D shape of an unknown specular scene from just one image, they used iterative computation for estimating the object shape. Kutulakos et al. [19] showed that, the 3D shape of specular and transparent objects could be recovered by three viewpoints if incoming light undergoes two reﬂections or refractions. They used an environment matting procedure [21] for ﬁnding correspondence between the image and the display. The environment matting procedure needs many viewing patterns, and thus it is diﬃcult to obtain accurate correspondences. They only showed results for planar surfaces. In this paper, we describe a novel method for estimating the surface shape of specular and transparent objects by using stereo cameras and two-dimensional phase shifts. We ﬁnd corresponding points between image and display by showing two-dimensional phase shifts on the display. In specular reﬂection, incident light, reﬂected light and surface normal are always coplanar, and the normal bisects the incident light and the reﬂected light. In transparent refraction, incident light, transmitted light and surface normal are always coplanar, and a ray from camera and a ray from a display will intersect at the front and back surface point of the object. Using these constraints, we can recover the shape of both specular and transparent objects. Our work has two key contributions. First, we show that, the 3D shape of specular objects and transparent objects through which light refracts twice with a known refractive index can be recovered by two viewpoints. Second, using the phase-shift method we can estimate accurate and dense correspondences between cameras and a display by a small number of projected patterns.

2

Specular Object Measurement Geometry

In specular reﬂection, incident light, reﬂected light and surface normal are always coplanar. In addition, the angle between the incidence light and the surface normal is always equal to that between the reﬂected light and the surface normal. In stereo cameras, the lights reﬂected from a common surface point come from diﬀerent directions, however they share the same surface normal (see Fig. 1(a)). So if we know the light path, the shape of specular objects can be estimated. In order to locate the light path, we used a PC display and showed two-dimensional phase shifts. Each point of the display can be identiﬁed by a phase in the horizontal direction and a phase in the vertical direction. The geometric relation between the display and cameras is calibrated in advance so that by checking

572

M. Yamazaki, S. Iwata, and G. Xu

Fig. 1. Specular object geometry: (a) of stereo cameras, (b) of single-camera

the horizontal and vertical phases of each image point, we know the 3D position of the light source on the display for that image point. A ray from the display and a ray from the camera intersect at a surface point, and they determine a plane. If we know the depth of the point, the surface normal of the point is uniquely determined. That is, the surface normal is a function of the depth (see Fig. 1(b)). This is the reason why it is impossible to reconstruct the 3D shape of a specular object from just one camera. In this paper, we used two cameras. A common surface point has a common surface normal. A surface point is reconstructed by searching the common surface normal and depth from stereo images. We set the coordinate system of the left camera as the world coordinate system. Let the position of the origin of a reﬂected light be Y. For an image point with normalized coordinate (x, y) in the left camera, the 3D position X of the specular point can be expressed by X = s˜ x

(1)

where s is the unknown depth and x ˜ = (x, y, 1)T is the homogeneous coordinates ˆ of this point of the normalized coordinate (x, y). The normalized normal vector N is ˆ = N N ||N|| x ˜ Y − s˜ x − N= ||Y − s˜ x|| ||˜ x||

(2)

The relationship between right camera coordinate system and left camera coordinate system can be expressed by X = Rc X + tc

(3)

where Rc , tc are the rotation matrix and translation vector between the right and left camera, respectively. Let the position of the origin of a reﬂected light

Dense 3D Reconstruction of Specular and Transparent Objects

573

be Y for a point with normalized coordinates (x , y ) in the right image. The 3D position of this surface point in the world coordinate system is X = s Rc x ˜ + tc

(4)

where s is the unknown scale and is the homogeneous coordinates of the norˆ of this point is malized coordinate (x , y ). The normalized normal vector N ˆ = N N ||N || x ˜ Y − (s Rc x ˜ + tc ) − N = ||Y − (s Rc x ˜ + tc )|| ||˜ x ||

At a common point, we have

(5)

X = X

(6)

ˆ =N ˆ N

(7)

The shape of a specular object is estimated by searching the corresponding points along the epipolar lines that satisfy Eq. (6) and Eq. (7).

3

Transparent Object Measurement Geometry

In transparent refraction, incident light, transmitted light and surface normal are always coplanar. For a transparent object that incoming light undergoes two refractions, a ray from the display is refracted at the back surface of the object and then at the front surface of the object, and ﬁnally reaches cameras. In order to determine a light path direction from the display, we move the display in a forward or backward direction. The purpose of transparent object measurement is to reconstruct the front and back surfaces of the object. We set the coordinate system of the left camera as the world coordinate system. For each image point, let the position of the origins of a reﬂected light Y1 , Y2 be respectively on the display before and after moving. For a point in the left image with normalized coordinate (x, y), the 3D position of the front surface can be expressed by ˜ X1 = s1 x

(8)

˜ = (x, y, 1)T is the homogeneous coordiwhere s1 is the unknown depth and x nates of the normalized coordinate (x, y). The incoming light path y ˆ (normalized vector) is obtained as Y1 − Y2 (9) y ˆ= ||Y1 − Y2 || The 3D position of the back surface on the light path can be expressed by X2 = s2 y ˆ + Y2

(10)

574

M. Yamazaki, S. Iwata, and G. Xu

Fig. 2. Transparent object geometry: (a) of 3 Viewpoints, (b) of 2 Viewpoints

where s2 is the unknown depth. For given a pair of X1 (s1 ), X2 (s2 ), the refraction vector v ˆ(normalized vector) from the front surface point to the back surface point ˆ 1 (s1 , s2 ) of the front point X1 (s1 ) is is deﬁned. The normalized normal vector N calculated by ˆ 1 (s1 , s2 ) = −R(θi , x ˆ×v ˆ)ˆ x N r sin θδ −1 θi = tan r cos θδ − 1 xT v ˆ) θδ = cos−1 (ˆ x ˜ x ˆ= ||˜ x||

(11)

where R(θ, ψ) is the rotation matrix of an angle θ about an axis ψ and r is ˆ 2 (s1 , s2 ) of the back point a refractive index. The normalized normal vector N X2 (s2 ) is expressed by the same procedure. Kutulakos et al. [19] used the relationship that the rays from a camera and a display must intersect at common points of the object’s back surface. Given front/back points X1 (s1 ), X2 (s2 ) in ﬁrst viewpoint, the front surface normal ˆ 1 (s1 , s2 ) is calculated. If this pair (s1 , s2 ) is wrong, the refraction ray through N the front point X1 (s1 ) from second viewpoint and the ray from a display do not intersect at the back surface point in second viewpoint. Unknown parameters are two variables (s1 , s2 ) (assuming that the refractive index is known). Therefore, given a pair (X1 (s1 ), X2 (s2 )) in one viewpoint, we can check whether the refraction rays through the front point X1 (s1 ) from the viewpoints and the rays from a display intersect or not in other two viewpoints. This problem is solved by searching a pair (s1 , s2 ) with which we do have these intersections for the other two viewpoints (see Fig. 2(a)). In our approach, we have only two viewpoints. We used both the front and back surface points to check intersections. We set a pair X1 (s1 ), X2 (s2 ) and calˆ 1 (s1 , s2 ) and the back surface normal N ˆ 2 (s1 , s2 ) culate the front surface normal N

Dense 3D Reconstruction of Specular and Transparent Objects

575

for the left camera. We check whether or not the refraction ray through the front point X1 (s1 ) from right camera and the ray from a display intersect each other. Moreover, we check whether or not the refraction ray through the back point X2 (s2 ) from a display and the ray from the right camera intersect each other. So this problem is solved by searching a true pair (s1 , s2 ) with which we do have these intersections for the other viewpoint (see Fig. 2(b)). In order to search the ray through the front point X1 (s1 ) from right camera, we project the front point X1 (s1 ) to the right image coordinate system. The homogeneous coordinate of the normalized coordinate x ˜ = (x , y , 1)T corresponding to the point X1 (s1 ) is then obtained in the right camera frame. If this point has no phase information, we cannot do anything. Let the position of the origins of a reﬂected light be Y1 , Y2 respectively on the display before and after moving. The light path direction from the display (normalized vector) is y ˆ = (Y1 − Y2 )/||Y1 − Y2 ||. Therefore, the refractive vector v ˆ1 corresponding ˆ to the front surface normal N1 (s1 , s2 ) is calculated by Snell’s law ˆ 1 )/r x + (c − g)N v ˆ1 = (ˆ g = r 2 + c2 − 1 ˆ1 c = −ˆ xT N x ˜ x ˆ = ||˜ x ||

(12)

In the right camera, in order to search the ray through the back point X2 (s2 ) from the display, we calculate the distance between the point X2 (s2 ) and the rays from the display. If no distance is small enough, there are no rays that pass through ˆ2 for the point X2 (s2 ), and we cannot determine this point. The refractive vector v ˆ the back normal N2 (s1 , s2 ) is calculated by the same procedure (Eq. (12)). In order to check whether or not two rays intersect each other, we calculate the distance between the two lines. In front surface, we set a line P1 as the ray from the right camera, a line Q1 as the refraction ray through the back point X2 (s2 ) from a display, and a distance d1 as the distance between these two lines. The intersection on the front surface is calculated by d1 = ||P1 − Q1 ||2 P1 = tc + p1 x ˆ Q1 = X2 + q1 v ˆ2 −1 1 −ˆ xT v (X2 − tc )T x ˆ2 ˆ p1 = q1 −ˆ xT v ˆ2 1 (tc − X2 )T v ˆ2

(13)

where tc is the position of the right camera in the world coordinate system. If this distance d1 is below a threshold, we say that these two rays intersect each other. However, if these two rays are parallel, the distance cannot be determined. Whether or not the two lines are parallel can be checked by 1 −ˆ xT v ˆ2 (14) det −ˆ xT v ˆ2 1

576

M. Yamazaki, S. Iwata, and G. Xu

If the determinant is near 0, the rays are parallel. For the back surface, we set a line P2 (P2 = Y2 + p2 y ˆ ) as the ray from the display, a line Q2 (Q2 = X1 + q2 v ˆ1 ) as the refraction ray that through the front point X1 (s1 ) from a right camera, and a distance d2 as the distance between these two lines. This intersection is calculated by the same procedure as Eq. (13) and Eq. (14). The solution of (s1 , s2 ) is decided by ﬁnding a pair of minimal (d1 , d2 ). It is time-consuming to search all possible combinations. So we use steepest descent to speed up searching. Since the surface should be smooth, we use the result of a neighboring point as an initial value. For points where the rays are parallel, we cannot determine the shape. However, even for these points, results may be obtained from the other camera if rays are not parallel there.

4

Experimental Results

We used two Nikon CCD cameras (3008 × 2000 pixels, a 28 mm lens) and a DELL PC display (17 inch, 1280 × 1024 pixels) for showing phase patterns. Background light was reduced to minimum during experiments. In order to calibrate between the cameras and the display, we set the display directly visible from the cameras. This calibration method is very simple. However, the method restricts the experimental arrangement. This restriction can be removed by using a mirror of known shape so that the display is visible via the mirror [16]. Each pixel of the display has a horizontal phase and a vertical phase by projecting 4 horizontal phase images and 4 vertical phase images. Phase is unwrapped by projecting 16 horizontal Gray Code images and 16 vertical Gray Code images. We can thus obtain corresponding points between the image and the display by checking the horizontal and vertical phases of each image pixel. Once correspondence is obtained, we can calculate the 3D position of each display point by stereo cameras. For measurement of transparent objects, we moved the display in a forward or backward direction, giving each image point not only light origin but also light direction from the display. 4.1

Experimental Results of Specular Objects

In order to determine the accuracy of our system, we used a planar mirror (15cm × 10cm square) and a hemispherical mirror (100mm diameter). Fig. 3(a) shows our experimental system. Fig. 3(b) shows a vertical stripe image and Fig. 3(c) shows a side view of the recovered 3D shape of the planar mirror. Fig. 3(d) shows a horizontal stripe image for the hemispherical mirror and Fig. 3(e) shows the recovered 3D shape of the hemispherical mirror. Fig. 3(f) shows a planar slice through a hemispherical mirror. For evaluating these results, we calculated the root of mean squared errors (RMS errors) by ﬁtting the point clouds as a plane and a sphere, respectively. The RMS error for the planar mirror was 0.22mm and the RMS error for the hemispherical mirror was 0.31mm.

Dense 3D Reconstruction of Specular and Transparent Objects

577

Fig. 3. Result of specular objects: (a) Experimental setup, (b) vertical stripe image of a planar mirror, (c) side view of the 3D shape of the planar mirror, (d) horizontal stripe image of a hemispherical mirror, (e) 3D shape of the hemispherical mirror, (f) a planar slice through a hemispherical mirror

Fig. 4. Result of transparent object: (a) Vertical stripe image, (b) horizontal stripe image, (c) 3D shape of front surface, (d) 3D shape of back surface, (e) side view of the 3D shape (the left is the front surface and the right is the back surface), (f) planar slices (the left is the front surface and the right is the back surface).

578

M. Yamazaki, S. Iwata, and G. Xu

From these results, we can see that the mirrors were reconstructed with quite high accuracy. In the result of Fig. 3(e), a part of the 3D shape was not reconstructed. This is because a part of the stripe image was crushed (see Fig. 3(d)), and the phases could not be determined for these points. Cameras of even higher resolution should be used to avoid this problem. 4.2

Experimental Results of Transparent Objects

We used an acrylic sphere having a refractive index of 1.5 and a diameter of 70mm. Fig. 4(a) shows a vertical stripe image and Fig. 4(b) shows a horizontal stripe image. Fig. 4(c) shows the 3D shape of front surface and Fig. 4(d) shows the 3D shape of back surface and Fig. 4(e) shows a side view of the recovered 3D shape. Fig. 4(f) shows planar slices through the front surface and the back surface. For evaluating these results, we calculated RMS errors by ﬁtting a sphere to the points. The RMS error was 1.47mm. From these results, we can see that the shape was correctly reconstructed. In the result of Fig. 4(c) and Fig. 4(d), the center part of the 3D shape was not reconstructed. This is because the rays of these points were parallel. The remedy for this problem is to rotate the object by using a turntable and then to integrate the results of the 3D shape.

5

Conclusion

In this paper, we have shown that the 3D shape of specular objects and transparent objects with a known refractive index can be recovered by stereo cameras and two-dimensional phase shifts. To our knowledge, this contribution is original. We have obtained very good results for both planar surfaces and curved surfaces, because dense and accurate correspondences between the cameras and the display were established by the phase-shift method. Our future work is to simultaneously measure a refractive index, which is vital to high accuracy 3D measurement. We are also looking for ways to deal with the inter-reﬂections and the internal reﬂections.

References 1. Batlle, J., Mouaddib, E., Salvi, J.: Recent progress in coded structured light as a technique to solve the correspondence problem: A survey. Pattern Recognition 31, 963–982 (1998) 2. Brenner, C., Bohm, J., Guhring, J.: Photogrammetric calibration and accuracy evaluation of a cross-pattern stripe projector. Photonics West Videometrics VI 3641 (1999) 3. Guhring, J., Brenner, C., Bohm, J., Fritsch, D.: Data Processing and Calibration of a Cross-pattern Stripe Projector. ISPRS Congress 2000 IAPRS 33 (2000) 4. Sato, K., Inokuchi, S.: Three-dimensional surface measurement by space encoding range imageing. Journal of Robotic Systems 2, 27–39 (1985)

Dense 3D Reconstruction of Specular and Transparent Objects

579

5. Ikeuchi, K.: Determining surface orientations of specular surfaces by using the photometric stereo method. IEEE Trans. PAMI 3, 661–669 (1981) 6. Oren, M., Nayar, S.K.: A Theory of Specular Surface Geometry. Int. Journal. Computer Vision 24, 105–124 (1997) 7. Baba, M., Ohtani, K., Imai, M.: New lasser rangeﬁnder for three-dimensional shape measurement of specular objects. Opticlal Engineering 40, 53–60 (2001) 8. Zheng, J.Y, Murata, A.: Acquiring a Complete 3D Model from Specular Motion under the Illumination of Circular-Shaped Light Sources. IEEE Trans. PAMI 22, 913–920 (2000) 9. Halstead, M., Barsky, B., Klein, S., Mandell, R.: Reconstructing curved surfaces from specular reﬂection patterns using spline surface ﬁtting of normals. In: SIGGRAPH 1996, pp. 335–342 (1996) 10. Bonfort, T., Sturm, P.: Voxel carving for specular surfaces. In: IEEE Int. Conf. Computer Vision, pp. 591–596. IEEE Computer Society Press, Los Alamitos (2003) 11. Miyazaki, D., Kagesawa, M., Ikeuchi, K.: Transparent surface modeling from a pair of polarization images. IEEE Trans. PAMI 26, 73–82 (2004) 12. Matusik, W., Pﬁster, H., Ziegler, R., Ngan, A., McMillan, L.: Acquisition and rendering of transparent and refractive objects. In: Eurographics Workshop on Rendering, pp. 267–278 (2002) 13. Ben-Ezra, M., Nayar, S.: What does motion reveal about transparency? In: IEEE Int. Conf. Computer Vision, pp. 1025–1032. IEEE Computer Society Press, Los Alamitos (2003) 14. Murase, H.: Surface Shape Reconstruction of a Nonrigid Transparent Object Using Refraction and Motion. IEEE Trans. PAMI 14, 1045–1052 (1992) 15. Narita, D., Baba, M.: Measurement of 3-D Shape and Refractive Index of a Transparent Object using Laser Rangeﬁnder. In: IEEE Instrumentation and Measurement Technology Conference, pp. 2247–2252. IEEE Computer Society Press, Los Alamitos (2003) 16. Tarini, M., Lensch, H.P.A., Goesele, M., Seidel, H.-P.: 3D Acquisition of Mirroring Objects using Striped Patterns. Journal of Graphical Models 67, 233–259 (2005) 17. Bonfort, T., Sturm, P., Gargallo, P.: General specular Surface Triangulation. In: Proc. Asian Conf. Computer Vision, pp. 872–881 (2006) 18. Hata, S., Saitoh, Y., Kumamura, S., Kaida, K.: Shape extraction of transparent object using genetic algorithm. In: Proc. Int. Conf. Pattern Recognition, pp. 684– 688 (1996) 19. Kutulakos, K.N., Steger, E.: A Theory of Refractive and Specular 3D Shape by Light-Path Triangulation. In: IEEE Int. Conf. Computer Vision, pp. 1448–1455 (2005) 20. Morris, N., Kutulakos, K.N.: Dynamic refraction stereo. In: IEEE Int. Conf. Computer Vision, pp. 1573–1580 (2005) 21. Zongker, D., Werner, D., Curless, B., Salesin, D.: Environment matting and compositing. In: SIGGRAPH 1999, pp. 205–214 (1999)

Identifying Foreground from Multiple Images Wonwoo Lee1 , Woontack Woo1 , and Edmond Boyer2 GIST U-VR Lab., 500-712, S. Korea {wlee, wwoo}@gist.ac.kr LJK - INRIA Rhˆ one-Alpes, Montbonnot, France [email protected] 1

2

Abstract. In this paper, we present a novel foreground extraction method that automatically identiﬁes image regions corresponding to a common space region seen from multiple cameras. We assume that background regions present some color coherence in each image and we exploit the spatial consistency constraint that several image projections of the same space region must satisfy. Integrating both color and spatial consistency constraints allows to fully automatically segment foreground and background regions in multiple images. In contrast to standard background subtraction approaches, the proposed approach does not require any a priori knowledge on the background nor user interactions. We demonstrate the eﬀectiveness of the method for multiple camera setups with experimental results on standard real data sets.

1

Introduction

Identifying foreground regions in single or multiple images is a necessary preliminary step of several computer vision applications in object tracking, motion capture or 3D modeling for instance. In particular, several 3D modeling applications optimize an initial model obtained using silhouettes extracted as foreground image regions. Traditionally, foreground regions are segmented under the assumption that the background is static and known beforehand in each image. This operation is usually performed on an individual basis, even when multiple images of the same scene are considered. In this paper, we take a diﬀerent strategy and propose a method that simultaneously extract foreground regions in multiple images without any a priori knowledge on the background. The interest arises in many applications where multiple images are considered and where background information are not available, for instance when a single image only is available per viewpoint. The approach we propose relies on a few assumptions that are often satisﬁed. First, the region of interest should appear entirely in several images. Second, in each image, the background colors should be consistent, i.e. the background is homogeneous to some extent, and diﬀer from the foreground colors. Under these

This project is funded in part by ETRI OCR and in part by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD)(KRF-2006-612D00081).

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 580–589, 2007. c Springer-Verlag Berlin Heidelberg 2007

Identifying Foreground from Multiple Images

581

assumptions, we iteratively segment in each image the 2 regions such that one, the background, satisﬁes color consistency constraints, and the second, the foreground, satisﬁes geometric consistency constraints with respect to other images. To initiate the iterative process, we use the ﬁrst assumption above to identify regions in the images that are necessarily background. Such regions are simply image regions which are outside the projections of the observation volume common to all considered viewpoints. These initial regions are then grown iteratively by adding pixels that inconsistently belong to background and foreground regions in other images. We adopt an EM scheme for that, where background and foreground models are updated in one step, and images are segmented in another step using the new model parameters. Some important features of the approach are as follows. The method is fully automatic and does not require a priori knowledge of any type nor user interactions. In addition, a single camera at diﬀerent locations or several cameras can be considered. In the latter, cameras do not need to be color calibrated since geometric and not color consistency is enforced between viewpoints. The remainder of the paper is as follows. In section 2, we review existing segmentation methods. Sections 3 and 4 detail the implementation of the proposed method. Experimental results and conclusions are given in sections 5 and 6, respectively.

2

Related Works

Background subtraction methods usually assume that background pixel values are constant over time while foreground pixel values vary at some time. Based on this fact, several approaches have been proposed which take into account photometric information: greyscale, color texture or image gradient among others, in a monocular context. For non-uniform backgrounds, statistical models are computed for pixels. Several statistical models have been proposed to this purpose, for instance: normal distributions used in conjunction with the Mahalanobis distance [1], or mixture of Gaussian to account for multi-value pixels located on image edges or belonging to shadow regions [2,3]. In addition to these models, and to enforce smoothness constraints over image regions, graph cut methods have been widely used. After the seminal work of Boykov and Jolly [4], many derivatives have been proposed. GrabCut reduces the user interaction required for a good result by iterative optimization [5]. Li et al. proposed a coarse to ﬁne approach in Lazy Snapping. It provides a user interface for boundary editing [6]. Freedman et al. [7] exploit the shape prior information to reduce segmentation error in the area where both the foreground and background have the similar intensities . The current graph cut based methods shows good results with both static images and videos, but user interaction are often required to achieve good results. All the aforementioned approaches assume a monocular context and do not consider multi-camera cues when available. An early attempt in that direction was to add stereo information, i.e. depth information obtained using 2

582

W. Lee, W. Woo, and E. Boyer

0

-

?

-

Fig. 1. Overall procedure of the proposed foreground extraction method

cameras, to the photometric information used for classiﬁcation into background and foreground [8]. Incorporating depth information makes the process more robust, however it does not account for more than 2 camera consistencies. Zeng and Quan [9] proposed a method which estimates the silhouette of an object from the unknown background. They exploit the relationship between a region of an image and the visual hull. The approach requires however a good color segmentation, since foreground regions are identiﬁed based on the regions. Sormann et al. [10] applied the graph cut method to multiple view segmentation problem. They combine the color and the shape prior for robust segmentation from a complex background but user interactions are still required. Our contribution with respect to the aforementioned approaches is to provide a fully automatic method that does not require static background prior knowledge, or user interaction. The diﬀerent steps of the method are depicted in the Fig 1 and explained in the following sections.

3 3.1

Probabilistic Modeling Deﬁnitions

We represent the input color images as I and the segmentation map as S. τ is the prior knowledge of the scene. The knowledge of the background and foreground are noted as B and F , respectively. F , B, and S, are unknown variables, and I is the only known variable among all the variables. For a pixel, S has a value of either 0 for the background or 1 for the foreground. We use superscript i to represent a speciﬁc view. Subscript x indicates a pixel located at x (u, v). Ixi means the color values of the pixel x in the ith image.

Identifying Foreground from Multiple Images

3.2

583

Joint Probability Decomposition

With the deﬁned variables, we represent the problem as a Bayesian network. Before we infer the probability of the segmentation map, we need to compute the joint probability of the variables. From the dependencies among the variables we decompose the joint probability as the equation 1.

τ B

F S

I Fig. 2. Bayesian network representing the dependencies among the variables

P r (S, F , B, I, τ ) = P r (τ ) P r (B|τ ) P r (F |τ ) P r (S|F, τ ) P r (I|B, S, τ )

(1)

P r (τ ), P r (B|τ ), and P r (F |τ ) are the prior probabilities. Since we give no constraints on them, we assume that they have uniform distributions. We don’t need to consider these priors to infer the probability of the segmentation map. Thus, they are ignored from now on. P r (S|F, τ ) is the spatial consistency term. It represents the probability of the segmentation map, when the foreground information is available. The term P r (I|B, S, τ ) is the image likelihood term. It tells us how much the image is related to the background information we know. 3.3

Spatial Consistency Term

Although each camera sees its own background diﬀerent from the other’s, the foreground should be consistent among all the views under our assumption. For a pixel in the ith image I i , the spatial consistency represents how much the other views agree that the pixel belongs to the foreground. It is referred from the segmentation maps S k where i = k. To compute the spatial consistency, we exploit the silhouette calibration ratio proposed in [11]. The silhouette ratio computes the probability of a pixel to be foreground from the silhouettes of the other views. We use modiﬁed silhouette calibration ratio Rx with a Gaussian distribution to give more penalty to the low silhouette calibration ratio value. Rx = e−(1−Cx )

2

/σ2

(2)

where Cx is the silhouette calibration ratio corresponding to x. σ is the standard deviation through which we can control the slope of the probability curve. σ determines how much penalty is given to the a silhouette calibration ratio value. We give a tolerance to the silhouette calibration ratio through σ, since the knowledge about the foreground inferred from the other views can be wrong. σ

584

W. Lee, W. Woo, and E. Boyer

is computed from the number of cameras m we allow to miss, the corresponding silhouette calibration ratio Cm and the expected probability q. −

σ=

2 Cm ln (q)

(3)

Since Sxi has the value of either 0 or 1, the spatial consistency term can be decomposed into two terms, when Sxi = 0 and Sxi = 1. When Sxi = 0, the foreground information inferred from the other views does not give any clue about the background. In this case, we assume the spatial consistency has a uniform distribution Pb . When Sxi = 1, the spatial consistency term follows the inferred foreground information. Thus, we refer the modiﬁed calibration ratio as the spatial silhouette consistency term. Consequently, P r Sxi |F i , τ is computed as the equation 4. P r Sxi |F i , τ =

Pb if Sxi = 0 −(1−Cx )2 /σ2 e if Sxi = 1

(4)

3.4

Image Likelihood Term The image likelihood term P r Ixi |B i , Sxi , τ measures the similarity between a pixel’s color Ixi and the background information we know. If a pixel is assumed to belong to the background, Sxi = 0 , the likelihood of the pixel color is computed from the the statistical color model of the background colors. To represent the color model of the background, several diﬀerent methods, such as Gaussian mixture model or histogram, can be used. When the pixel is considered as the foreground Sxi = 1 , the knowledge of the background color distribution does not provide any information to us. As we make no assumptions on the colors of the foreground, we set the image likelihood term to a uniform distribution Pf in this case. Consequently, the image likelihood term is deﬁned as the following equation. P r Ixi |B i , Sxi , τ =

HB Ixi if Sxi = 0 if Sxi = 1 Pf

(5)

where HB represents the statistical model of the background colors. 3.5

Segmentation Map Inference

After the joint probability distribution is deﬁned, it is possible to infer the segmentation map from the given conditions by exploiting Bayes’ rule. What we want to know is the probability distribution of the segmentation map S, given the variables, F , B, I, and τ . For a pixel Ixi , the probability of the segmentation map is inferred as the equation 6.

Identifying Foreground from Multiple Images

i i i i P r Sxi , F i , B i , Ixi , τ P r Sx |F , B , Ix , τ = i i i i i =0,1 P r (Sx , F , B , Ix , τ ) Sx i i i i i P r Sx |F , τ P r Ix |B , Sx , τ = i i i i i S i =0,1 P r (Sx |F , τ )P r (Ix |B , Sx , τ )

585

(6)

x

4

Iterative Optimization with Graph-Cut

To compute the optimal segmentation maps of all the views, we exploit the graph-cut method. In the same manner of the method proposed in [5], we iteratively update the unknown variables, F i and B i , and estimate S i for each view. We build a graph G i for every image I i . The pixels of I i become the nodes of G i and each pixel has edges connected to its eight neighbors. There are two special nodes in G i , the source S and the sink T which are connected to every nodes in the graph. Computing min-cut minimizes the the segmentation energy deﬁned in the equation 7. Etotal =

λ1 Ep (x) +

x∈I i

λ2 En (x, y)

(7)

(x,y)∈N,Sx =Sy

where Ep is the prior energy term and En is the neighborhood energy term. N is the sets of neighboring pixels. The prior energy represents that how much a pixel is close to the foreground or the background. The neighborhood energy is for the smoothness of the segmentation. If two pixels have large color diﬀerence, there of the ex is high probability istence of the segmentation boundary. Thus, P r Sxi |F i , B i , Ixi , τ is used as the prior energy. We deﬁne the neighborhood energy as the equation 8 in this work. En (x, y) =

1 1 + D (x, y)

(8)

where D(x, y) is a color diﬀerence measurement between two neighboring pixels, x and y. According to our experiments, the neighborhood energy is not limited to the equation 8. It may have diﬀerent form, if it is designed to have small value with large color diﬀerence and large value with the similar colors of the two pixels. We assign the capacity w (x, y) to every edge between the node x and y in G i . The prior energy assigns the capacities to the edges connected to S and T . For the edges connected to S, we set the capacities as shown in the equation 9. If a pixel is already known as background, we assign inﬁnity to the edge. If not, the capacity follows the inferred probability with with Sxi = 0, w (x, S) =

∞ if x is known as the background i i i i P r Sx = 0|F , B , Ix , τ otherwise

(9)

586

W. Lee, W. Woo, and E. Boyer

If T is one of vertices of an edge, the capacity is set as the inferred probability with Sxi = 1. w (x, T ) = P r Sxi = 1|F i , B i , Ixi , τ (10) For the edges between two neighboring pixels, we assign the scaled neighborhood energy to it. w (x, y) = λn En (x, y) (11) where λn is a scale factor. After the convergence of the iterative optimization process, there can be misclassiﬁed pixels. To remove remaining errors, we perform a graph-cut based segmentation again as a post-processing. In the post-processing, the image likelihood term is modiﬁed as the equation 12. We use the foreground color model instead of the uniform distribution Pf , in the same manner of the conventional graph cut methods. i i i HB Ixi if Sxi = 0 P r Ix |B , Sx , τ = (12) HF Ixi if Sxi = 1 where HB and HF represents the color model of the background and the foreground, respectively.

5

Experimental Results

To demonstrate the eﬀectiveness of our method, we performed experiments with the ‘Dancer’ and ‘Temple’ data sets which contain one foreground object. The ‘Dancer’ data set consists of 8 images with the size of 780x582. The ‘Temple’ data set is the selected 10 images with the size of 640x480.

Fig. 3. Interim segmentation results with and without the spatial consistency. One of the input images(left) Segmentation result with the color diﬀerence only (middle) Segmentation result with the color diﬀerence and the spatial consistency(right).

Fig 3 shows an intermediate segmentation results with and without the spatial consistency. In the input image, the foreground contains self shadows under the arms and the color of its hair is similar to colors of the background. Thus, some part of the foreground is lost, when only the color diﬀerence criterion is used. In contrast, segmentation with the spatial consistency preserves the part of the foreground well.

Identifying Foreground from Multiple Images

587

Fig. 4. Foreground extraction results with the ‘Dancer’ data set

Fig 4 shows the results with the ‘Dancer’ data set. The ﬁrst row shows the selected images among the 8 input images. The initial segmentation computed from the intersection of the viewing volumes are depicted in the second row. The extracted foreground after each iteration is shown from the row 3 to 6. The initial segmentations in the row 2 converge to the results in the row 6 by the iterative optimization we presented in the previous section. Thanks to the spatial consistency, our method removes the background even though there are hard edges between the green mat and the gray ﬂoor. After the post-processing, we obtain the ﬁnal segmentation maps depicted in the last row. Fig 5 shows the experimental results with the ‘Temple’ data set. Since the Temple data set has almost black background, the segmentation of the foreground looks easier. However, the color similarity between the foreground and the background causes errors. As shown in the Fig 5, our method extracts the foreground successfully. Note that only the selected results among 10 images are presented because of the lack of the space.

588

W. Lee, W. Woo, and E. Boyer

Fig. 5. Foreground extraction results with the ‘Temple’ data set

Table 1 shows the performance of the proposed method. To measure the performance, we computed the hit rate of the segmentation results from the ground truth. There exist errors in the segmentation, but it still shows good performance with the hit rates over 90 %. Table 1. Segmentation performance (unit: %) View ID

1

2

3

4

5

6

7

8

9

10

Temple

98.38 98.31 98.20 98.51 98.73 98.25 98.51 98.20 98.05 98.09

Dancer

93.93 97.74 96.44 93.14 92.41 94.94 97.20 94.93

-

-

Fig 6 shows the visual hulls reconstructed from the silhouettes of the ground truth that is obtained by manual hand operation and the silhouette obtained by the proposed method, respectively. They are not identical because of the segmentation errors, but the visual hull computed from our result still preserves the 3D shape of the foreground well.

Fig. 6. Visual hulls reconstructed from the silhouettes of the ground truth(left) and the silhouette obtained by our method(right)

6

Conclusions

In this paper, we proposed a novel method of foreground extraction from multiple images. Our method integrates the spatial consistency with the color information

Identifying Foreground from Multiple Images

589

for robust estimation of the foreground from the unknown background. As shown in the experimental results, the spatial consistency provides an important clue to the separation of the foreground from the background. Since the proposed method requires neither the pre-knowledge of the scene nor the user interaction, it is more close to an automatic method than others. As a future work, an interesting issue is to extend the current work to the multi-view video sequence by exploiting both spatial and temporal constraints.

References 1. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pﬁnder: Real-Time Tracking of the Human Body. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 780–785 (1997) 2. Rowe, S., Blake, A.: Statistical mosaics for tracking. IVC 14, 549–564 (1996) 3. Friedman, N., Russell, S.: Image Segmentation in Video Sequences: A Probabilistic Approach. In: Proc. Thirteenth Conf. on Uncertainty in Artiﬁcial Intelligence (1997) 4. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: International Conference on Computer Vision, vol. 1, pp. 105–112 (2001) 5. Rother, C., Kolmogorov, V., Blake, A.: Grabcut-interactive goreground extraction using iterated graph cuts. In: ACM SIGGRAPH, vol. 24, pp. 309–314 (2004) 6. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. In: ACM SIGGRAPH, vol. 23, pp. 303–308 (2004) 7. Freedman, D., Zhang, T.: Interactive graph cut based segmentation with shape priors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 755–762. IEEE Computer Society Press, Los Alamitos (2005) 8. Gordon, G., Darrell, T., Harville, M., Woodﬁll, J.: Background Estimation and Removal Based on Range and Color, 459–464 (1999) 9. Zeng, G., Quan, L.: Silhouette extraction from multiple images of an unknown background. In: Hong, K.S., Zhang, Z. (eds.) Asian Conference on Computer Vision. Asian Federation of Computer Vision Societies, vol. 2, pp. 628–633 (2004) 10. Sormann, M., Zach, C., Karner, K.: Graph cut based multiple view segmentation for 3d reconstruction. In: The 3rd International Symposium on 3D Data Processing, Visualization and Transmission (2006) 11. Boyer, E.: On using silhouettes for camera calibration. In: The 7th Asian Conference on Computer Vision, pp. 1–10. Springer, Heidelberg (2006)

Image and Video Matting with Membership Propagation Weiwei Du and Kiichi Urahama Dept. of Visual Communication Design, Kyusyu University, 4-9-1 Shiobaru, Fukuoka-shi, 815-8540 Japan {duweiwei@gsd., urahama@}design.kyushu-u.ac.jp

Abstract. Two techniques are devised for a natural image matting method using semi-supervised object extraction. One is a guiding scheme for placement of user strokes specifying object or background regions and the other is a scheme of adjustment of object colors for conforming to composited background colors. We draw strokes at inhomogeneous color regions disclosed with an unsupervised cluster extraction method from which the semi-supervised algorithm is derived. Objects are composited with a new background after their color adjustment using a color transfer method with eigencolor mapping. This image matting method is then extended to videos. Strokes are drawn only in the ﬁrst frame from which memberships are propagated to successive frames to extract objects in every frame. Performance of the proposed method is examined with images and videos experimented with existing matting methods.

1

Introduction

Natural image matting techniques have been developed[1,2] and extended to videos[3] where objects are extracted from photographs or videos with natural backgrounds and composited with another image or video. Though early methods[1,2] require users to draw supplementary images called trimaps such as shown in Fig.1(a), recent ones[3,4,5,6,7] have become to accept only rough strokes as is illustrated in Fig.1(b). We present, in this paper, a similar matting method with semi-supervised extraction of objects. In contrast to other methods where strokes are required to be drawn in both objects and backgrounds, it is

(a) trimap

(b) strokes

Fig. 1. Examples of supplementary drawings[3] Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 590–600, 2007. c Springer-Verlag Berlin Heidelberg 2007

Image and Video Matting with Membership Propagation

591

suﬃcient in our method that they are drawn in either objects or backgrounds. This is owing to that our semi-supervised method is derived from an unsupervised graph-spectral algorithm of cluster extraction type, e.g.[8,9] in contrast to cluster partition type, e.g.[10,11] adopted in most matting methods. In matting methods based on rough strokes such as in Fig.1(b), their places are crucial for the performance of object extraction. Levin et al.[4] have recently presented a guiding scheme for placement of strokes in their matting method where strokes are required to be drawn in both foregrounds and backgrounds. Their theoretical analysis is excellent, however, users must observe two eigenimages and the guidance is not so easy. In this paper, we present a similar guiding scheme for our matting method. Our scheme exploits an iterative solution process for an unsupervised object extraction method from which our semi-supervised algorithm is derived. Our guiding scheme utilizes only the ﬁrst eigen-image and is simple and fast. Most matting methods have been focused on still images and their extension to videos has reported only in a few paper[3]. We, in this paper, extend our matting method to videos where strokes are drawn only in the ﬁrst frame from which memberships are propagated to successive frames. Lastly, we adjust the colors of images by using the eigencolor mapping method [12] with some extensions before their composition in order to reduce their mismatch to new background colors.

2

Unsupervised Extraction of Homogeneous Color Regions

Our semi-supervised matting method is derived from a graph-spectral method for unsupervised cluster extraction[9] which is a cluster extraction type in contrast to cluster partitioning type algorithms[10,11] used in most matting methods. Clusters in an image are regions with homogeneous colors. 2 2 Let the similarity between pixels i and j be sij = e−αai −aj −βci −cj where ai is spatial position of pixel i and ci = [Ri , Gi , Bi ]T is its color. We set sii = 0. Connected components of pixels linked with large sij are clusters. An image usually includes multiple clusters. The fraction xi of pixel i belonging to such clusters can be evaluated with sij xi xj max i i j∈W (1) subj.to i di x2i = 1 where di = max{ j sij , } and Wi is a (2p + 1) × (2p + 1) square window around pixel i. The solution of eq.(1) is a stationary point of its Lagrange function: sij xi xj − λ( di x2i − 1) (2) max min x

λ

i

j∈Wi

i

from which we get x = [x1 , ..., xm ]T as a generalized eigenvector of Sx = λDx where S = [sij ] and D = diag(d1 , ..., dm ). A simple scheme for computing x is the power method:

592

W. Du and K. Urahama

(ξ+1) xi

=

(ξ)

s˜ij xj (ξ) ˜kj xj )2 k dk ( js j

(3)

where s˜ij = sij /di and xi is an iteration counter. Since D−S is an M-matrix, the iterant x(ξ) converges monotonically to the solution of eq.(1). After its convergence, we normalize it as x ˜i = xi / maxk {xk } which is the membership of pixel i in clusters. For instance, Fig.2(b) illustrates memberships for an image in Fig.2(a). We set the parameters as α = 0.01, β = 0.01, = 5, p = 5. Homogeneous color regions exist in background areas in addition to the object (white peacock). Region boundaries are vague because the membership is fuzzy. As is seen in this result, this unsupervised method cannot delineate a speciﬁc object. Exceptional cases are images satisfying both the following two conditions: (1) Objects are composed of only inhomogeneous color regions. (2) Background colors are homogeneous. We can extract objects from such exceptional images by using the above unsupervised method. An image of spider webs in Fig.3(a) is such an example of easily extractable objects. Its memberships are shown in Fig.3(b) of which negative is the membership of objects since plotted x ˜i is the membership in homogeneous color regions which are backgrounds in this case. As will be shown in section 4, this detectability of inhomogeneous color regions is useful for guiding the placement of strokes in a semi-supervised method derived from eq.(2) in the next section.

(a) image of white peacock

(b) memberships

Fig. 2. Homogeneous color regions

(a) image of spider web

(b) memberships

Fig. 3. Example of image of unsupervisedly extractable objects

Image and Video Matting with Membership Propagation

3

593

Semi-supervised Extraction of Objects

If a user wants to extract a target object, he/she must draw some strokes specifying either objects or backgrounds. Whereas strokes are drawn in both objects and backgrounds in Fig.1(b), it is suﬃcient for them to be drawn in only one of them in our method. Let T be an area (subset of pixels) of strokes. We ﬁx xi to 1 at i ∈ T . Then the Lagrange multiplier λ in eq.(2) can be ﬁxed to an arbitrary value. If we ﬁx it to 1, then eq.(2) becomes max sij xi xj − di x2i (4) x

i∈T / j∈Wi

We solve this equation also with iteration: (ξ+1) (ξ) xi = s˜ij xj

i∈T /

(i ∈ / T)

(5)

j∈Wi

As this iteration proceeds, the memberships propagate from stroke area T to its surroundings. In order to facilitate membership propagation, we project colors to 1-dimensional subspace with the linear discriminant analysis (LDA) as: (1) Dilate strokes to surrounding areas with similar colors. (2) Set a dilated region as one class and the remaining area as the other class and execute LDA to compute a projection vector q. Project the color of every pixel to fi = q T ci . 2 2 (3) Construct pixel similarity sij = e−αai −aj −β(fi −fj ) and compute xi by using eq.(5). (0)

In order to accelerate convergence of eq.(5), we set initial values as xi = 1/(1 + e−γ(fi−δ) ) and stop the iteration at pixels where xi becomes greater than 0.99 or below 0.01.

Fig. 4. Inhomogeneous regions (black areas)

4

Placement of Strokes

Memberships propagate easily within homogeneous color regions whereas they are hardly inﬁltrated into inhomogeneous areas. Therefore, strokes must be

594

W. Du and K. Urahama

(a) strokes

(b) memberships

Fig. 5. Strokes (left) and obtained memberships (right)

(a) spider lilly

(b) 10-th iterant

(c) snapshot of girl

(d) 10-th iterant

(e) ﬂame

(f) 10-th iterant

Fig. 6. Strokes in other example images

drawn on each inhomogeneous region in addition to homogeneous ones within an area (either object or background) to be extracted. Since inhomogeneous segments are usually narrow, it is easy to draw strokes as bridging over or touching into them. In any region homogeneous or inhomogeneous, a little amount of strokes are suﬃcient. Memberships propagate from them to whole area of each region. Especially in eq.(5), rough and sparse strokes are suﬃcient owing to high propagation capability due to broad windows Wi . Notice that strokes should not be drawn in boundaries between objects and backgrounds. This guideline for placement of strokes requires detection of inhomogeneous regions in an image. As was shown in the previous section, the above unsupervised algorithm can be utilized for this purpose. However its full convergence needs many iteration steps and obtained eigen-images are too fuzzy to visually detect inhomogeneous regions. For instance, convergence to Fig.2(b) requires 300 steps of eq.(5). In order to save computational time, we set excessively

Image and Video Matting with Membership Propagation

595

Fig. 7. Memberships

Fig. 8. Composite images

large for inhomogeneous regions to be extracted suﬃciently and stop iteration of (10) eq.(5) early before convergence. Fig.4 illustrates 10-th iterant xi for Fig.2(a) (10) binarized as xi = 1 if xi > 0.001 else xi = 0. Black areas in Fig.4 are inhomogeneous regions. An example of strokes placed in the background following the above guideline is shown in Fig.5(a) with which obtained memberships are illustrated in Fig.5(b). Other examples are shown in Fig.6 where strokes are illustrated on the left and binarized x10 i are shown on the right. Fig.6(c) and (e) were experimented with other existing methods[3,4]. Memberships obtained with strokes shown in Fig.6 are illustrated in Fig.7 where left are initial values and right are converged ones. We can get mattes by using our method with strokes simpler than other methods where they must be drawn in both objects and backgrounds.

5

Composition with Another Image

Colors of objects must be estimated at each pixel for them to be composited with another image as a new background. The color ci in the input image is calculated

596

W. Du and K. Urahama

(a) yi1

(b) ui2

(c) trimap

Fig. 9. Initial vlaues of memberships in the second frame

Fig. 10. Four frames in example video

with blending object colors cf i and background colors cbi with proportion xi : ci = xi cf i + (1 − xi )cbi . In the existing matting methods, both cf i and cbi are estimated by using this relation, however it is laborious and wasteful because only cf i is used for composition. Here we estimate only cf i by min cf

i

sij xj cf i − cj 2

(6)

j∈Wi

which is solved explicitly as sij xj cj / sij xj cf i = j∈Wi

(7)

j∈Wi

with which composite color is given by c˜i = xi cf i + (1 − xi )bi

(8)

where bi is color of pixel i in a new background image. Examples of composite images are shown in Fig.8.

6

Extension to Videos

This method for still images can be easily extended to videos. We draw strokes only in the ﬁrst frame from which memberships are propagated to successive frames. We ﬁrstly compute projection vector q by using LDA at the ﬁrst frame and project colors of every pixel in all frames to fik = q T cik where k denotes a frame number. We then compute memberships xi1 at the ﬁrst frame and propagate them to the second frame memberships xi2 which is propagated to the third

Image and Video Matting with Membership Propagation

597

Fig. 11. Memberships

Fig. 12. Composite video

frame and so on. In order to speed up the propagation, we construct and use trimaps for the second and subsequent frames as follows: (1) Discretize xi1 into three levels as yi1 = 1 if xi1 > 0.99, yi1 = 0 if xi1 < 0.01 else yi1 = 0.5. (2) Construct a binary map as ui2 = 1 if |fi1 − fi2 | > 2/3, else ui2 = 0. (0) (0) (3) Set initial values as xi2 = yi1 if ui2 = 0 otherwise xi2 = 0.5. (0) This xi2 gives a trimap for the second frame and is used for an initial value of xi2 . Fig.9(a) illustrates yi1 for a video in Fig.10, which was experimented in [3]. Fig.9(b) shows ui2 where white regions depict pixels where colors vary by motions. Fig.9(c) illustrates trimap constructed from Fig.9(a) and Fig.9(b). Memberships are recomputed only in gray areas in Fig.9(c). This propagation scheme is simple without explicit estimation of object and camera motions. Memberships obtained for each frame are shown in Fig.11 of which composition with another video is shown in Fig.12.

7

Color Adjustment of Objects

In most matting methods, extracted objects are directly pasted on another image as in eq.(8). Such direct matting, however, often gives unnatural images because color or direction of illumination is diﬀerent in the original image and in a new image. Since their precise estimation is hard in general, we resort here to a conventional technique for adjusting object colors before its composition by using an eigencolor mapping method[12] with its extension for the present task. 7.1

Eigencolor Mapping

Let us consider a case where color of a reference image is transferred to another target image. Let color of target image and that of reference image be ci1 and

598

W. Du and K. Urahama

(a) = 0

(b) = 0.2

(c) = 0.4

Fig. 13. Composition of object with adjusted colors

Fig. 14. Composite video with color adjustment

c matrices S1 = i (c1i − c¯1 )(c1i − c¯1 )T / i2 . We compute covariance i 1, S2 = T (c − c ¯ )(c − c ¯ ) / 1 where c ¯ = c / 1, c ¯ = c / 2 2i 2 1 2 i 2i i i 1i i i 2i i 1 and eigen-decompose them as S1 = U1 D1 U1T , S2 = U2 D2 U2T . By using these basis matrices, color ci1 of the target image is changed to −1/2

c1i = c¯2 + U2 D2 D1 1/2

7.2

U1T (c1i − c¯1 )

(9)

Object Composition with Its Color Adjustment

If we use this recoloring method for object composition, color of objects is changed completely to that of new background and cannot get desired composition. Natural composition needs partial blending of color of new background into object color for reproducing ambient illumination to the object from backgrounds. Let color of an object and that of new background be cf i and cbi as in section 5. We ﬁrstly compute a weighted covariance matrix of object colors as T = x (c − c ¯ )(c − c ¯ ) / ¯f is a weighted mean c¯f = S f i f i f f i f i i xi where c T i xi cf i / i xi of object colors. We then eigen-decompose Sf into Uf Df Uf . We next compute average c¯b and covariance matrix Sb of new background colors and shift them to c¯bm = (1 − )¯ cf + ¯ cb and Sbm = (1 − )Sf + Sb where ∈ [0, 1] denotes strength of ambient illumination. We eigen-decompose Sbm T into Ubm Dbm Ubm and change color cf i of the object to −1/2

cf i = c¯bm + Ubm Dbm Df 1/2

UfT (cf i − c¯f )

(10)

which is ﬁnally composited with a new background as xi cf i + (1 − xi )cbi which is color of an image with the object of adjusted color. Examples of composite images are shown in Fig.13.

Image and Video Matting with Membership Propagation

599

Adjustment of object colors is especially prerequisite for video matting where brightness or illumination color often varies with time. In such cases, adjustment of object color is necessarily varied in accordance to temporal variation in background color. An example is shown in Fig.14 where new background is reddish and becomes darker with time.

8

Conclusion

We have presented a guiding scheme for placement of strokes in a semi-supervised matting method for natural images and videos. We have also developed a composition method with adjusting object colors incorporating ambient illumination from new background. Some features of our method are summarized as (1) (2) (3) (4) (5) (6) (7) (8)

Membership propagation over holes or gaps owing to broad windows. Strokes are suﬃcient to be drawn in either object areas or backgrounds. Facilitation of object extraction by projection of colors with LDA. Eﬀective initial values for membership propagation. Simple guidance for placement of strokes. Fast composition of object with new background. Eﬀective membership propagation from frame to frame in video matting. Adaptive adjustment of object color for natural composition.

References 1. Ruzon, M., Tomasi, C.: Alpha estimation in natural images. Proc. CVPR, 18–25 (2000) 2. Chuang, Y.Y., Curless, D., Salesin, D., Szeliski, R.: A Bayesian approach to digital matting. CVPR, 264–271 (2001) 3. Wang, J., Cohen, M.C.: An iterative optimization approach for uniﬁed image segmentation and matting. In: ICCV, pp. 936–943 (2005) 4. Levin, A., Lischinski, D., Weiss, Y.: A closed form solution to natural image matting. CVPR, 61–68 (2006) 5. Rother, V.K.C., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. In: SIGGRAPH, pp. 309–314 (2004) 6. Grady, L., Schiwietz, T., Aharon, S.: Random walks for interactive alpha-matting. In: VIIP, pp. 423–429 (2005) 7. Vezhnevets, V., Konouchine, V.: GrowCut: Interactive multi-label N-D image segmentation by cellular automata. Graphicon (2005) 8. Perona, P., Freeman, W.T.: A factorization approach to grouping. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 655–670. Springer, Heidelberg (1998) 9. Inoue, K., Urahama, K.: Sequential fuzzy cluster extraction by a graph spectral method. Patt. Recog. Lett. 20, 699–705 (1999) 10. Ng., A.Y., Jordan, I.J., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS, pp. 849–856 (2001)

600

W. Du and K. Urahama

11. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. PAMI 22, 888–905 (2000) 12. Jing, L., Urahama, K.: Image recoloring by eigencolor mapping. In: IWAIT, pp. 375–380 (2006) 13. Reinhard, E., Ashkuhmin, B., Gooch, B., Shirley, P.: Color transfer between images. IEEE Trans. Comput. Graph. Appl. 21, 34–41 (2001)

Temporal Priors for Novel Video Synthesis Ali Shahrokni, Oliver Woodford, and Ian Reid Robotics Reseach Laboratory, University of Oxford, Oxford, UK http://www.robots.ox.ac.uk/

Abstract. In this paper we propose a method to construct a virtual sequence for a camera moving through a static environment given an input sequence from a diﬀerent camera trajectory. Existing image-based rendering techniques can generate photorealistic images given a set of input views, though the output images almost unavoidably contain small regions where the colour has been incorrectly chosen. In a single image these artifacts are often hard to spot, but become more obvious when viewing a real image with its virtual stereo pair, and even more so when when a sequence of novel views is generated, since the artifacts are rarely temporally consistent. To address this problem of consistency, we propose a new spatiotemporal approach to novel video synthesis. The pixels in the output video sequence are modelled as nodes of a 3–D graph. We deﬁne an MRF on the graph which encodes photoconsistency of pixels as well as texture priors in both space and time. Unlike methods based on scene geometry which yield highly connected graphs, our approach results in a graph whose degree is independent of scene structure. The MRF energy is therefore tractable and we solve it for the whole sequence using a stateof-the-art message passing optimisation algorithm. We demonstrate the eﬀectiveness of our approach in reducing temporal artifacts.

1

Introduction

This paper addresses the problem of reconstruction of a video sequence from an arbitrary sequence of viewpoints given an input video sequence. In particular, we focus on the reconstruction of a stereoscopic pair of a given input sequence captured by a moving camera through a static environment. This has application to the generation of 3-D content from commonly available monocular movies and videos for use with advanced 3-D displays. Existing image-based rendering techniques can generate photorealistic images given a set of input views. Though the best results apparently have remarkable ﬁdelity, closer inspection almost invariably reveals pixels or regions where incorrect colours have been rendered, as illustrated in Fig. 1. These are often, but not always, associated with occlusion boundaries, and while they are often hard to see in a single image, they become very obvious when a sequence of novel views is generated, since the artifacts are rarely spatio-temporally consistent. We propose to solve the problem via a Markov Random Field energy minimisation over Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 601–610, 2007. c Springer-Verlag Berlin Heidelberg 2007

602

A. Shahrokni, O. Woodford, and I. Reid

a video sequence with the aim of preserving spatio-temporal consistency and coherence throughout the rendered frames. Two broad approaches to the novel-view synthesis problem are apparent in the literature: (i) multi-view scene reconstruction followed by rendering from the resulting geometric model, and (ii) image-based rendering techniques which seek simply to ﬁnd the correct colour for a pixel. In both cases a data likelihood term f (C, z) is deﬁned over colour C and depth z which is designed to achieve a maximum at the correct depth and colour. In the multi-view stereo reconstruction problem the aim is generally to ﬁnd the correct depth, and [1] was the ﬁrst to suggest that this could be done elegantly for multiple input views by looking for the depth that maximises colour agreement between the input images. Recent approaches such as [2,3] involve quasi-geometric models for 3–D reconstruction where occlusion is modelled as an outlier process. Approximate inference techniques are then used to reconstruct the scene taking account of occlusion. Realistic generative models using quasi-geometric models are capable of rendering high quality images but lead to intractable minimisation problems [3]. More explicit reasoning about depth and occlusions is possible when an explicit volumetric model is reconstructed as in voxel carving such as [4,5]. The direct application of voxel carving or stereo with occlusion methods [6,7,8] to our problem of novel video synthesis would, however, involve simultaneous optimisation of the MRF energy with respect to depth and colour in the space-time domain. The graph corresponding to the output video then becomes highly connected as shown in Fig. 2-a for a row of each frame. Unfortunately however, available optimisation techniques for highly connected graphs with non-submodular potentials are not guaranteed to reach a global solution [9]. In contrast, [10] marginalise the data likelihood over depth and thus have no explicit geometric reasoning about the depth of pixels. This and similar methods rely on photoconsistency regularised by photometric priors [10,7] to generate photorealistic images. The priors are designed to favour output cliques which resemble samples in a texture library built from the set of input images. It has recently been shown [11] that using small 2-pixel patch priors from a local texture library can be as eﬀective as the larger patches used in [10]. [11] converts the problem of optimising over all possible colours, to a discrete labelling problem over modes of the photoconsistency function, referred to as colour modes, which can be enumerated a priori. Since the texture library comprises only pairs of pixels, the maximum clique size is two, and tree-reweighted message passing [12] can be used to solve for a strong minimum in spite of the non-submodular potentials introduced by enumerating the colour modes. We closely follow this latter, image-based rendering approach, but extend it to sequences of images rather than single frames. We propose to deﬁne suitable potential functions between spatially and temporally adjacent pixels. This, and our demonstration of the subsequent beneﬁts, form the main contribution of this paper. We deﬁne an MRF in space-time for the output video sequence, and optimise an energy function deﬁned over the entire video sequence to obtain a solution for the output sequence which is a strong local minimum of the energy

Temporal Priors for Novel Video Synthesis

603

Individual MRF optimisation for each output frame

Our method: Using temporal priors for video synthesis

Fig. 1. A pair of consecutive frames from a synthesised sequence. Top row: individual MRF optimisation for each output frame fails to ensure temporal consistency yielding artifacts that are particularly evident when the sequence is viewed continuously. Bottom row: Using temporal priors, as proposed in this paper, to optimise an MRF energy over the entire video sequence reduces those eﬀects. An example is circled.

function. Crucially, in contrast to methods based on depth information and 3-D occlusion, our proposed framework has a graph with a depth-independent vertex degree, as shown in Fig. 2-b. This results in a tractable optimisation over the MRF and hence we have an aﬀordable model for the temporal ﬂow of colours in the scene as the camera moves. The remainder of this paper is organised as follows. In Section 2, we introduce the graph and its corresponding energy function that we wish to minimise, in particular the diﬀerent potential terms. Section 3 gives implementation details, experimental results and a comparison of our method with (i) per-frame optimisation, and (ii) a na¨ıve, constant-colour prior.

2

Novel Video Synthesis Framework

We formulate the MRF energy using binary cliques with pairwise, texture-based priors for temporal and spatial edges in the output video graph. Spatial edges in the graph exist between 8-connected neighbourhood of pixels in each frame. Temporal edges link pixels with the same coordinates in adjacent frames as shown in Fig. 2-b. Therefore, the energy of the MRF can be expressed in terms

A. Shahrokni, O. Woodford, and I. Reid

frames

frames

604

pixels

pixels (a)

(b)

Fig. 2. Temporal edges in an MRF graph for video sequence synthesis. a) Using a 3–D occlusion model all pixels on epipolar lines of pixels in adjacent frames must be connected by temporal edges (here only four temporal edges per pixel are shown to avoid clutter). b) Using our proposed temporal texture-based priors we can reduce the degree of the graph to a constant. output sequence

ordered input frames

output sequence

k’

k epipolar line

j

j’

t+1 temporal edge q t p

i+1 epipolar line

i epipolar line

Tp Tq

epipolar line

i’

i

ordered input frames

Temporal texture dictionary for pixels p and q

epipolar line

(a)

(b)

Fig. 3. a) Local texture library is built using epipolar lines in sorted input views I for each pixel in the output video sequence. b) Local pairwise temporal texture dictionary for two output pixels p and q connected by a temporal graph edge.

of the unary and binary potential functions for the set of labels (colours) F as follows. E(F ) = φp (fp ) + λ1 ψpq (fp , fq ) + λ2 ψpq (fp , fq ) (1) p

p q∈Ns (p)

p q∈Nt (p)

where fp and fq are labels in the label set F , φ is the unary potential measuring the photoconsistency and ψ encodes the pairwise priors for spatial and temporal neighbours of pixel p denoted by Ns (p) and Nt (p) respectively. λ1 and λ2 are weight coeﬃcients for diﬀerent priors. The output sequence is then given by the optimal labelling F ∗ through minimisation of E: F ∗ = argmin{E(F )} F

(2)

Next, we ﬁrst discuss the texture library for spatial and temporal terms and introduce some notations and then deﬁne the unary and binary potentials.

Temporal Priors for Novel Video Synthesis

605

world scene

correct prior pairwise word in the local library

q frame t+1 p frame t

Tp Tq

input epipolar lines

pairwise local texture vocabulary from input views

Fig. 4. The temporal transition of colours between pixels in two output frames. A constant colour model between temporally adjacent output pixels p and q is clearly invalid because of motion parallax. On the other hand, there is a good chance that the local texture vocabulary comprising colour pairs obtained from the epipolar lines Tp and Tq (respectively the epipolar lines in the corresponding input view of the stereo pair) captures the correct colour combination, as shown in this case.

2.1

Texture Library and Notations

To calculate the local texture library, we ﬁrst ﬁnd and sort subsets of the input frames with respect to their distance to the output frames. We denote these subsets by I. The input frame in I, which is closest to the output frame containing pixel p is denoted by I(p). Then for each pairwise clique of pixels p and q, the local texture library is generated by bilinear interpolation of pixels on the clique epipolar lines in I as illustrated in Fig. 3. For a pixel p the colour in input frame k corresponding to the depth disparity z is denoted by Ck (z, p). The vocabulary of the library is composed of the colour of the pixels corresponding to the same depth on each epipolar line and is deﬁned below. T = {(Ci (z, p), Cj (z, q) ) | z = zmin , . . . , zmax , i = I(p), j = I(q)}

(3)

we also deﬁne Tp as the epipolar line of pixel p in I(p), Tp = {Ck (z, p) | z = zmin , . . . , zmax , k = I(p)}. 2.2

(4)

Unary Potentials

Unary potential terms express the measure of agreement in the input views for a hypothesised pixel colour. Since optimisation over the full colour space can only be eﬀectively achieved via slow, non-deterministic algorithms, we use instead a technique proposed in [11] that ﬁnds a set of photoconsistent colour modes. The optimisation is then over the choice of which mode, i.e. a discrete

606

A. Shahrokni, O. Woodford, and I. Reid

labelling problem. These colour modes are denoted by fp for pixel p and using their estimated depth z the unary potential is given by the photoconsistency of fp in a set of close input views V : ρ(||fp − Ci (z, p)||) (5) φp (fp ) = i∈V

where ρ(.) is a truncated quadratic robust kernel. 2.3

Binary Potentials

Binary (pair-wise) potentials in graph-based formulation of computer vision problems often use the Potts model (piece-wise constant) to enforce smoothness of the output (e.g. colour in segmentation algorithms, or depth in stereo reconstruction). While the Potts model is useful as a regularisation term, its application to temporal cliques is strictly incorrect. This is due to the relative motion parallax between the frames as illustrated in Fig. 4. In general, the temporal links marked by dotted lines between two pixels p and q for example do not correspond to the same 3–D point and therefore colour coherency assumption using the Potts model is invalid. Instead, we propose to use texture-based priors to deﬁne pairwise potentials in temporal edges. As shown in Fig. 4, a local texture library given by Eq. 3 for the clique of pixels p and q is generated using epipolar lines Tp and Tq deﬁned in Eq. 4 in two successive input frames close to the output frames containing p and q. This library contains the correct colour combination for the clique containing p and q corresponding to two distinct 3-D points (marked by the dotted rectangle in Fig. 4. This idea is valid for all temporal cliques in general scenes provided that there exists a pair of successive input frames throughout the whole sequence which can see the correct 3–D points for p and q. Each pairwise potential term measures how consistent the pair of labels for pixels p and q is with the (spatio-temporal) texture library. The potential is taken to be the minimum over all pairs in the library, viz: ψpq (fp , fq ) = min {ρ(||fp − Tp (z)||) + ρ(||fq − Tq (z)||)} . z

(6)

Note that the use of a robust kernel ρ(.) ensures that cases where a valid colour combination does not exist are not overly penalised; rather, if a good match cannot be found a constant penalty is used. As explained above, exploiting texture-based priors enables us to establish a valid model for temporal edges in the graph which is independent of the depth and therefore avoid highly connected temporal nodes. This is an important feature of our approach which implies that the degree of the graph is independent of the 3–D structure of the scene.

3

Implementation and Results

We veriﬁed the eﬀectiveness of temporal priors for consistent novel video synthesis in several experiments. We compare the generated views with and without

Temporal Priors for Novel Video Synthesis

(a) frame #2

(b) frame #5

(c) magnified part of #2

607

(d) magnified part of #5

Fig. 5. Top row, results obtained using our proposed texture-based temporal priors. Middle row, using the Potts temporal priors. Bottom row, individual rendering of frames. Columns (c) and (d) show the details of rendering. It can be noted that the Potts model and individual optimisation fails on the sharp edges of the leaves.

temporal priors. In all cases, the spatial terms for all 8-connected neighbours in each frame in the MRF energy were similarly computed from texture-based priors. Therefore the focus of our experiments is on the texture-based temporal priors. We also show results from using a the simpler constant-colour prior (the Potts model). The energy function of Eq. 1 is minimised using a recently introduced enhanced version of tree-reweighted max-product message passing algorithm known as TRW-S algorithm [12] which can handle non submodular graph edge costs and has guaranteed convergence properties. For an output video sequence with n frame of size W × H, the spatio-temporal graph would have n × W × H vertices and (n − 1) × W × H temporal edges in the case of using texture-based temporal priors or the Potts model. This is the minimum number of temporal edge for a spatio-temporal MRF and any other prior based on depth with number of disparities z would require at least z × (n − 1) × W × H temporal edges, where z is of the order of 10 to 100. Typical run time to process a space-time volume of 15 × 100 × 100 pixels is 600 seconds on a P4 D 3.00GHz machine. The same volume when treated as individual frames takes 30 × 15 = 450 seconds to process.

608

A. Shahrokni, O. Woodford, and I. Reid frame #1

frame #3

frame #5

frame #9

Our propose method: texture-based temporal priors

The Potts temporal priors

Individual optimisation per frame

Fig. 6. Synthesised Edmontosaurus sequence. First row, results obtained using our proposed texture-based temporal priors. Second row, using the Potts temporal priors creates some artifacts (frame #3). Third row, individual rendering of frames introduces artifacts in the holes (the nose and the jaw). Also note that the quality of frame #5 has greatly improved thanks to the texture-based temporal priors.

The input video sequence is ﬁrst calibrated using commercial camera tracking software1. The stereoscopic output virtual camera projection matrices are then generated from input camera matrices by adding a horizontal oﬀset to the input camera centres. The colour modes as well as unary photoconsistency terms given by Eq. 5 for each pixel in the output video are calculated using 8 closest views in the input sequence. We also compute 8 subsets I’s for texture library computation as explained in Section 2.3 with the lowest distance to the ensemble of the n output camera positions. Finally in Eq. 1 we set λ1 to 1 and λ2 to 10 in our experiments. Fig. 5 shows two synthesised frames of a video sequence of a tree and the details of rendering around the leaves for diﬀerent methods. Here, in the case of temporal priors (textured-based and Potts) 5 frames of 300 × 300 pixels are 1

Boujou, 2d3 Ltd.

Temporal Priors for Novel Video Synthesis

609

Fig. 7. Stereoscopic frames generated using texture-based temporal priors over 15 frames. In each frame, the left image is the input view (corresponding to the left eye) and the right image is the reconstructed right eye view.

rendered by a single energy optimisation. In the detailed view, it can be noted that the quality of the generated views using texture-based temporal priors has improved especially around the edges of the leaves. As another example, Fig. 6 shows some frames of the novel video synthesis on the Edmontosaurus sequence using diﬀerent techniques. Here the temporal priors are used to render 11 frames of 200 × 200 pixels by a single energy optimisation. The ﬁrst row shows the results obtained using our proposed texture-based temporal prior MRF. Using the Potts model for temporal edges generates more artifacts as shown in the second row in Fig. 6. Finally the third rows show the results obtained without any temporal priors and by individual optimisation of each frame. It can be noted that the background is consistently seen through the holes in the skull, while ﬂickering artifacts occur in the case of the Potts prior and individual optimisation. Here the output camera matrices are generated by interpolation between the ﬁrst and the last input camera positions. Finally Fig. 7 show the entire stereoscopic frames constructed using temporal priors over 15 frames.

4

Conclusion

We have introduced a new method for novel video rendering with optimisation in space-time domain. We deﬁne a Markov Random Field energy minimisation for rendering a video sequence which preserves temporal consistency and coherence throughout the rendered frames. Our method uses a ﬁnite set of colours for each pixel with their associated likelihood cost to ﬁnd a global minimum energy solution which satisﬁes prior temporal consistency constraints in the output sequence. In contrast to methods based on depth information and 3–D occlusion we exploit texture-based priors on pairwise cliques to establish a valid model for temporal edges in the graph. This approach is independent of the depth and therefore results in a graph whose degree is independent of scene structure. As a result and as supported by our experiments, our approach provides a method to

610

A. Shahrokni, O. Woodford, and I. Reid

reduce temporal artifacts in novel video synthesis without resorting to approximate generative models and inference techniques to handle multiple depth maps. Moreover, our algorithm can be extended to larger clique texture-based priors while keeping the degree of the graph independent of the depth of the scene. This requires sophisticated optimisation techniques which can handle larger cliques such as [13,14] and will be investigated in our future work. Quantitative analysis of the algorithm using synthetic/real stereo sequences is also envisaged to further study the eﬃciency of temporal priors for video synthesis. Acknowledgements. This work was supported by EPSRC grant EP/C007220/1 and by a CASE studentship sponsored by Sharp Laboratories Europe. The authors also wish to thank Andrew W. Fitzgibbon for his valuable input.

References 1. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(4), 353–363 (1993) 2. Strecha, C., Fransens, R., Gool, L.V.: Combined depth and outlier estimation in multiview stereo. In: Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 2394–2401. IEEE Computer Society, Los Alamitos (2006) 3. Gargallo, P., Sturm, P.: Bayesian 3d modeling from images using multiple depth maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, California, vol. 2, pp. 885–891 (2005) 4. Goesele, M., Seitz, S.M., Curless, B.: Multi-View Stereo Revisited. In: Conference on Computer Vision and Pattern Recognition, New York, USA (2006) 5. Kutulakos, K., Seitz, S.: A Theory of Shape by Space Carving. International Journal of Computer Vision 38(3), 197–216 (2000) 6. Kolmogorov, V., Zabih, R.: Multi-Camera Scene Reconstruction via Graph Cuts. In: European Conference on Computer Vision, Copenhagen, Denmark (2002) 7. Sun, J., Zheng, N., Shum, H.: Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis 25, 1–14 (2003) 8. Tappen, M., Freeman, W.: Comparison of graph cuts with belief propagation for stereo,using identical MRF parameters. In: International Conference on Computer Vision (2003) 9. Kolmogorov, V., Rother, C.: Comparison of energy minimization algorithms for highly connected graphs. In: European Conference on Computer Vision, Graz, Austria (2006) 10. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using imagebased priors. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1176–1183 (2003) 11. Woodford, O.J., Reid, I.D., Fitzgibbon, A.W.: Eﬃcient new view synthesis using pairwise dictionary priors. In: Conference on Computer Vision and Pattern Recognition (2007) 12. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1568–1583 (2006) 13. Kohli, P., Kumar, M.P., Torr, P.H.: P3 & Beyond: Solving Energies with Higher Order Cliques. In: Conference on Computer Vision and Pattern Recognition (2007) 14. Potetz, B.: Eﬃcient Belief Propagation for Vision Using Linear Constraint Nodes. In: Conference on Computer Vision and Pattern Recognition (2007)

Content-Based Image Retrieval by Indexing Random Subwindows with Randomized Trees Rapha¨el Mar´ee1, Pierre Geurts2 , and Louis Wehenkel2 GIGA Bioinformatics Platform, University of Li`ege, Belgium Systems and Modeling Unit, Monteﬁore Institute, University of Li`ege, Belgium 1

2

Abstract. We propose a new method for content-based image retrieval which exploits the similarity measure and indexing structure of totally randomized tree ensembles induced from a set of subwindows randomly extracted from a sample of images. We also present the possibility of updating the model as new images come in, and the capability of comparing new images using a model previously constructed from a diﬀerent set of images. The approach is quantitatively evaluated on various types of images with state-of-the-art results despite its conceptual simplicity and computational eﬃciency.

1

Introduction

With the improvements in image acquisition technologies, large image collections are available in many domains. In numerous applications, users want to search eﬃciently images in such large databases but semantic labeling of all these images is rarely available, because it is not obvious to describe images exhaustively with words, and because there is no widely used taxonomy standard for images. Thus, one well-known paradigm in computer vision is “content-based image retrieval” (CBIR) ie. when users want to retrieve images that share some similar visual elements with a query image, without any further text description neither for images in the reference database, nor for the query image. To be practically valuable, a CBIR method should combine computer vision techniques that derive rich image descriptions, and eﬃcient indexing structures [2]. Following these requirements, our starting point is the method of [8], where the goal was to build models able to predict accurately the class of new images, given a set of training images where each image is labeled with one single class among a ﬁnite number of classes. Their method was based on random subwindow extraction and ensembles of extremely randomized trees [6]. In addition to good accuracy results obtained on various types of images, this method has attractive computing times. These properties motivated us to extend their method for CBIR where one has to deal with very large databases of unlabeled images. The paper is organized as follows. The method is presented in Section 2. To assess its performances and usefulness as a foundation for image retrieval, we evaluate it on several datasets representing various types of images in Section 3, where the inﬂuence of its major parameters will also be evaluated. Method Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 611–620, 2007. c Springer-Verlag Berlin Heidelberg 2007

612

R. Mar´ee, P. Geurts, and L. Wehenkel

parameters and performances are discussed in Section 4. Finally, we conclude with some perspectives.

2

Method Rationale and Description

We now describe the diﬀerent steps of our algorithm: extraction of random subwindows from images (2.1), construction of a tree-based indexing structure for these subwindows (2.2), derivation of a similarity measure between images from an ensemble of trees (2.3), and its practical use for image retrieval (2.4). 2.1

Extraction of Random Subwindows

Occlusions, cluttered backgrounds, and viewpoint or orientation changes that occur in real-world images motivated the development of object recognition or image retrieval methods that model image appearances locally by using the socalled “local features” [15]. Indeed, global aspects of images are considered not suﬃcient to model variabilities of objects or scenes and many local feature detection techniques were developped for years. These consider that the neighboorhood of corners, lines/edges, contours or homogenous regions capture interesting aspects of images to classify or compare them. However, a single detector might not capture enough information to distinguish all images and recent studies [18] suggest that most detectors are complementary (some being more adapted to structured scenes while others to textures) and that all of them should ideally be used in parallel. One step further, several recent works evaluated dense sampling schemes of local features, e.g. on a uniform grid [4] or even randomly [8,11]. In this work, we use the same subwindow random sampling scheme than [8]: square patches of random sizes are extracted at random locations in images, resized by bilinear interpolation to a ﬁxed-size (16 × 16), and described by HSV values (resulting into 768 feature vectors). This provides a rich representation of images corresponding to various overlapping regions, both local and global, whatever the task and content of images. Using raw pixel values as descriptors avoids discarding potentially useful information while being generic, and fast. 2.2

Indexing Subwindows with Totally Randomized Trees

In parallel to these computer vision developments, and due to the slowness of nearest neighbor searches that prevent real-time response times with hundreds of thousands of local feature points described by high-dimensional descriptors, several tree-based data structures and/or approximate nearest neighbors techniques have been proposed [1,5,9,13,14,16] for eﬃcient indexing and retrieval. In this paper, we propose to use ensembles of totally randomized trees [6] for indexing (random) local patches. The method recursively partitions the training sample of subwindows by randomly generated tests. Each test is chosen by selecting a random pixel component (among the 768 subwindows descriptors) and a random cut-point in the range of variation of the pixel component in

Content-Based Image Retrieval by Indexing Random Subwindows

613

the subset of subwindows associated to the node to split. The development of a node is stopped as soon as either all descriptors are constant in the leaf or the number of subwindows in the leaf is smaller than a predeﬁned threshold nmin . A number T of such trees are grown from the training sample. The method thus depends on two parameters: nmin and T . We will discuss below their impact on the similarity measure deﬁned by the tree ensemble. There exists a number of indexing techniques based on recursive partitioning. The two main diﬀerences between the present work and these algorithms is the use of an ensemble of trees instead of a single one and the random selection of tests in place of more elaborated splitting strategies (e.g., based on a distance metric computed over the whole descriptors in [9,16] or taken at the median of the pixel component whose distribution exhibits the greatest spread in [5]). Because of the randomization, the computational complexity of our algorithm is essentially independent of the dimensionality of the feature space and, like other tree methods, is O(N log(N )) in the number of subwindows. This makes the creation of the indexing structures extremely fast in practice. Note that totally randomized trees are a special case of the Extra-Trees method exploited in [8] for image classiﬁcation. In this latter method, K random tests are generated at each tree node and the test that maximizes some information criterion related to the output classiﬁcation is selected. Totally randomized trees are thus obtained by setting the parameter K of this method to 1, which desactivates test ﬁltering based on the output classiﬁcation and allows to grow trees in an unsupervised way. Note however that the image retrieval procedure described below is independent of the way the trees are built. When a semantic classiﬁcation of the images is available, it could thus be a good idea to exploit it when growing the trees (as it would try to put subwindows from the same class in the same leaves). 2.3

Inducing Image Similarities from Tree Ensembles

A tree T deﬁnes the following similarity between two subwindows s and s [6]: 1 if s and s reach the same leaf L containing NL subwindows, kT (s, s ) = NL 0 otherwise. This expression amounts to considering that two subwindows are very similar if they fall in a same leaf that has a very small subset of training subwindows1 . The similarity induced by an ensemble of T trees is deﬁned by: kens (s, s ) =

T 1 kT (s, s ) T t=1 t

(1)

This expression amounts to considering that two subwindows are similar if they are considered similar by a large proportion of the trees. The spread of the similarity measure is controlled by the parameter nmin : when nmin increases, subwindows 1

Intuitively, as it is less likely a priori that two subwindows will fall together in a small leaf, it is natural to consider them very similar when they actually do.

614

R. Mar´ee, P. Geurts, and L. Wehenkel

tend to fall more often in the same leaf which yields a higher similarity according to (1). On the other hand, the number of trees controls the smoothness of the similarity. With only one tree, the similarity (1) is very discrete as it can take only two values when one of the subwindows is ﬁxed. The combination of several trees provides a ﬁner-grained similarity measure and we expect that this will improve the results as much as in the context of image classiﬁcation. We will study the inﬂuence of these two parameters in our experiments. Given this similarity measure between subwindows, we derive a similarity between two images I and I by: 1 kens (s, s ), (2) k(I, I ) = |S(I)||S(I )| s∈S(I),s ∈S(I )

where S(I) and S(I ) are the sets of all subwindows that can be extracted from I and I respectively. The similarity between two images is thus the average similarity between all pairs of their subwindows. Although ﬁnite, the number of diﬀerent subwindows of variable size and location that can be extracted from a given image is in practice very large. Thus we propose to estimate (2) by extracting at random from each image an a priori ﬁxed number of subwindows. Notice also that, although (2) suggests that the complexity of this evalation is quadratic in this number of subwindows, we show below that it can actually be computed in linear time by exploiting the tree structures. Since (1) actually deﬁnes a positive kernel [6] among subwindows, equation (2) actually deﬁnes a positive (convolution) kernel among images [17]. This means that this similarity measure has several nice mathematical properties. For example, it can be used to deﬁne a distance metric and it can be directly exploited in the context of kernel methods [17]. 2.4

Image Retrieval Algorithms

In image retrieval, we are given a set of, say NR , reference images and we want to ﬁnd images from this set that are most similar to a query image. We propose the following procedure to achieve this goal. Creation of the indexing structure. To build the indexing structure over the reference set, we randomly extract Nls subwindows of variable size and location from each reference image, resize them to 16×16, and grow an ensemble of totally randomized trees from them. At each leaf of each tree, we record for each image of the reference set that appears in the leaf the number of its subwindows that have reached this leaf. Recall of reference images most similar to a query image. We compute the similarities between a query image IQ and all NR reference images, by propagating into each tree Nts subwindows from the query image, and by incrementing, for each subwindow s of IQ , each tree T , and each reference image IR , the similarity k(IQ , IR ) by the proportion of subwindows of IR in the leaf reached by s in T , and by dividing the resulting score by T Nls Nts . This procedure estimates k(IQ , IR ) as given by (2), using Nls and Nts random subwindows

Content-Based Image Retrieval by Indexing Random Subwindows

615

from IR and IQ respectively. From these NR similarities, one can identify the N most similar reference images in O(N NR ) operations, and the complexity of the whole computation is on the average of O(T Nts (log(Nls ) + NR )). Notice that the fact that information about the most similar reference images is gathered progressively as the number of subwindows of the query image increases could be exploited to yield an anytime recall procedure. Note also that once the indexing structure has been built, the database of training subwindows and the original images are not required anymore to compute the similarity. Computation of the similarity between query images. The above procedure can be extended to compute the similarity of a query-image to another image not belonging to the reference set, an extension we name model recycling. To this end, one propagates the subwindows from each image through each tree and maintains counts of the number of these subwindows reaching each leaf. The similarity (2) is then obtained by summing over tree leaves the product of the subwindow counts for the two images divided by the number of training subwindows in the leaf and by normalizing the resulting sum. Incremental mode. One can incorporate the subwindows of a new image into an existing indexing structure by propagating and recording their leaf counts. When, subsequently to this operation a leaf happens to contain more than nmin subwindows, the random splitting procedure would merely be used to develop it. Because of the random nature of the tree growing procedure, this incremental procedure is likely to produce similar trees as those that would be obtained by rebuilding them from scratch. In real-world applications such as World Wide Web image search engines, medical imaging in research or clinical routine, or software to organize user photos, this incremental characteristic will be of great interest as new images are crawled by search engines or generated very frequently.

3

Experiments

In this section, we perform a quantitative evaluation of our method in terms of its retrieval accuracy on datasets with ground-truth labels. We study the inﬂuence of the number of subwindows extracted in training images for building the tree structure (Nls ), the number of trees built (T ), the stop-splitting criterion (nmin ), and the number of images extracted in query images (Nts ). Like other authors, we will consider that an image is relevant to a query if it is of the same class as the query image, and irrelevant otherwise. Then, diﬀerent quantitative measures [3] can be computed. In order to compare our results with the state of the art, for each of the following datasets, we will use the same protocol and performance measures than other authors. Note that, while using class labels to assess accuracy, this information is not used during the indexing phase. 3.1

Image Retrieval on UK-Bench

The University of Kentucky recognition benchmark is a dataset introduced in [9] and recently updated that now contains 640 × 480 color images of 2550 classes of

616

R. Mar´ee, P. Geurts, and L. Wehenkel

4 images each (10200 images in total), approximately 1.7GB of JPEG ﬁles. These images depict plants, people, cds, books, magazines, outdoor/indoor scenes, animals, household objects, etc., as illustrated by Figure 1. The full set is used as the reference database to build the model. Then, the measure of performance is an average score that counts for each of the 10200 images how many of the 4 images of this object (including the identical image) are ranked in the top-4 similar images. The score thus varies from 0 (when getting nothing right) up to 4 (when getting everything right). Average scores of variants of the method presented in [9] range from 3.07 to 3.29 (ie. recognition rates2 from 76.75% to 82.36%, see their updated website3 ), using among the best detector and descriptor combination (Maximally Stable Extremal Region (MSER) detector and the Scalable Invariant Feature Transform (SIFT) descriptor), a tree structure built by hierarchical k-means clustering, and diﬀerent scoring schemes. Very recently, [14] improved results up to a score of 3.45 using the same set of features but with an approximate k-means clustering exploiting randomized k-d trees.

Fig. 1. Several images of the UK-Bench. One image for various objects (top), the four images of the same object (bottom).

Figure 2 shows the inﬂuence of the parameters of our method on the recognition performances. We obtain scores slightly above 3 (ie. around 75% recognition rate) with 1000 subwindows extracted per image, 10 trees, and a minimum number of subwindows per node nmin between 4 and 10. Note that the recognition rate still increases when using more subwindows. For example, not reported on these ﬁgures, a score of 3.10 is obtained when 5000 subwindows are extracted per image with only 5 trees (nmin = 10). 3.2

Image Retrieval on ZuBuD

The Z¨ urich Buildings Database4 is a database of color images of 201 buildings. Each building in the training set is represented by 5 images acquired at 5 arbitrary viewpoints. The training set thus includes 1005 images and it is used to 2 3 4

(Number of correct images in ﬁrst 4 retrieved images /40800) ∗ 100% http://www.vis.uky.edu/~stewe/ukbench/ http://www.vision.ee.ethz.ch/showroom/zubud/index.en.html

Content-Based Image Retrieval by Indexing Random Subwindows UK−Bench: Influence of nmin stop splitting on recognition performance (T=10) 80%

617

UK−Bench: Influence of the number of trees T on recognition performance 80%

UK−Bench, 1000 subwindows per image UK−Bench, 100 subwindows per image

75%

UK−Bench, 1000 subwindows per image (nmin=4) UK−Bench, 1000 subwindows per image (nmin=15) UK−Bench, 100 subwindows per image (nmin=15)

75%

70%

70%

65% Recognition

Recognition

65% 60% 55%

60% 55%

50% 50%

45%

45%

40% 35%

40% 10

20

30

40

50 nmin

60

70

80

90

100

5

UK−Bench: Influence of the number of subwindows for training (T=10, Nts=100, nmin=15) 70%

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 T UK−Bench: Influence of the number of subwindows for test (T=10, Nls=1500)

80%

UK−Bench, 100 subwindows per query image

UK−Bench, 1500 subwindows per training image, nmin=15 UK−Bench, 1500 subwindows per training image, nmin=1000

75%

Recognition

Recognition

70%

65%

65%

60%

55%

50%

60%

45% 0

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Nls

0

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Nts

Fig. 2. Inﬂuence of the parameters on UK-Bench. Inﬂuence of stop splitting parameter, number of trees, number of training subwindows, number of test subwindows.

build the model, while the test set (acquired by another camera under diﬀerent conditions) that contains 115 images of a subset of the 201 buildings is used to evaluate the generalization performances of the model. The performance measured in [13,3,16] is the classiﬁcation recognition rate of the ﬁrst retrieved image, with 93%, 89.6%, and 59.13% error rates respectively. In [12], a 100% recognition rate was obtained, but with recall times of over 27 seconds per image (with an exhaustive scan of the database of local aﬃne frames). We obtain 95.65% with 1000 subwindows per image, T = 10, and several values of nmin inferior to 10. On this problem, we observed that it is not necessary to use so many trees and subwindows to obtain this state-of-the-art recognition rate. In particular, only one tree is suﬃcient, or less than 500 subwindows. 3.3

Model Recycling on META and UK-Bench

In our last experiment, we evaluate the model recycling idea, ie. we want to assess if given a large set of unlabeled images we can build a model on these images, and then use this model to compare new images from another set. To do so, we build up a new dataset called META that is basically the collection of images from the following publicly available datasets: LabelMe Set1-16, Caltech-256, Aardvark to Zorro, CEA CLIC, Pascal Visual Object Challenge 2007, Natural Scenes A. Oliva, Flowers, WANG, Xerox6, Butterﬂies, Birds. This

618

R. Mar´ee, P. Geurts, and L. Wehenkel

sums up to 205763 color images (about 20 GB of JPEG image ﬁles) that we use as training data from which we extract random subwindows and build the ensemble of trees. Then, we exploit that model to compare the UK-Bench images between themselves. Using the same performance measure as in section 3.1, we obtain an average score of 2.64, ie. a recognition rate of 66.1%, with 50 subwindows per training image of META (roughly a total of 10 million subwindows), T = 10, nmin = 4, and 1000 subwindows per test image of UK-Bench. For comparison, we obtained a score of 3.01 ie. 75.25% recognition rate using the full UK-Bench set as training data and same parameter values. Unsurprisingly, the recognition rate is better when the model is built using the UK-Bench set as training data but we still obtain an interesting recognition rate with the META model. Nist´er and Stew´enius carried out a similar experiment in [9], using different training sets (images from moving vehicles and/or cd covers) to build a model to compare UK-Bench images. They obtained scores ranging from 2.16 to 3.16 (using between 21 and 53 millions local features), which are also inferior to what they obtained exploiting the UK-Bench set for building the model.

4

Discussion

Some general comments about the inﬂuence of parameters can be drawn from our experiments. First, we observed that the more trees and subwindows, the better the results. We note that on ZuBuD, a small number of trees and not so large number of subwindows already gives state-of-the-art results. We also found out that the value of nmin should neither be too small, nor too large. It inﬂuences the recognition rate and increasing its value also reduces the memory needed to store the trees (as they are smaller when nmin is larger) and the required time for the indexing phase. It also reduces the prediction time, but with large values of nmin (such as 1000) image indexes at terminal nodes of the trees tend to become dense, which then slows down the retrieval phase of our algorithm which exploits the sparsity of these vectors to speed up the updating procedure. One clear advantage of the method is that the user can more or less control the performance of the method and its parameters could be choosen so as to trade-oﬀ recognition performances, computational requirements, problem diﬃculty, and available resources. For example, with our current proof of concept implementation in Java, one single tree that has 94.78% accuracy on ZuBuD is built in less than 1m30s on a single 2.4Ghz processor, using a total of 1005000 training subwindows described by 768 values, and nmin = 4. When testing query images, the mean number of subwindow tests in the tree is 42.10. In our experiment of Section 3.3, to ﬁnd similar images in UK-Bench based on the model built on META, there are on average 43.63 tests per subwindow in one single tree. On average, all 1000 subwindows of one UK-Bench image are propagated in all the 10 trees in about 0.16 seconds. Moreover, random subwindow extraction and raw pixel description are straightforward. In Section 3.3 we introduced the META database and model. While this database obviously does not represent the inﬁnite “image space”, it is however

Content-Based Image Retrieval by Indexing Random Subwindows

619

possible to extract a very large set of subwindows from it, hence we expect that the META model could produce scores distinctive enough to compare a wide variety of images. The results we obtained in our last experiment on the 2550 object UK-Bench dataset are promising in that sense. Increasing the number of subwindows extracted from the META database and enriching it using other image sources such as the Wikipedia image database dump or frames from Open Video project might increase the generality and power of the META model. Our image retrieval approach does not require any prior information about the similarity of training images. Note however that in some applications, such information is available and it could be a good idea to exploit it to design better similarity measures for image retrieval. When this information is available in the form of a semantic labeling of the images, it is easy to incorporate it into our approach, simply replacing totally randomized trees by extremely randomized trees for the indexing of subwindows. Note however that our result on ZuBuD equals the result obtained by [8] using extremely randomized trees that exploit the image labels during the training stage. This result suggests that for some problems, good image retrieval performances could be obtained with a fast and rather simple method and without prior information about the images. Beside a classiﬁcation, information could also be provided in the form a set of similar or dissimilar image pairs. Nowak and Jurie [10] propose a method based on randomized trees for exploiting such pairwise constraints to design a similarity measure between images. When a more quantitative information is available about the similarity between training images, one could combine our approach with ideas from [7], where a (kernel-based) similarity is generalized to never seen objects using ensembles of randomized trees.

5

Conclusions

In this paper, we used totally randomized trees to index randomly extracted subwindows for content-based image retrieval. Due to its conceptual simplicity (randomization is used both in image description and indexing), the method is fast. Good recognition results are obtained on two datasets with illumination, viewpoint, and scale changes. Moreover, incremental mode and model recycling were presented. In future works, other image descriptors and other stop splitting and scoring schemes might be evaluated. In terms of other applications, the usefulness of the method for the problem of near duplicate image detection might be investigated. Finally, totally randomized trees might also be helpful to index high-dimensional databases of other types of content.

Acknowledgements Rapha¨el Mar´ee is supported by the GIGA (University of Li`ege) with the help of the Walloon Region and the European Regional Development Fund. Pierre Geurts is a research associate of the FNRS, Belgium.

620

R. Mar´ee, P. Geurts, and L. Wehenkel

References 1. B¨ ohm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces - index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, inﬂuences, and trends of the new age. ACM Computing Surveys 39(65) (2007) 3. Deselaers, T., Keysers, D., Ney, H.: Classiﬁcation error rate for quantitative evaluation of content-based image retrieval systems. In: ICPR 2004. Proc. 17th International Conference on Pattern Recognition, pp. 505–508 (2004) 4. Deselaers, T., Keysers, D., Ney, H.: Discriminative training for object recognition using image patches. In: CVPR 2005. Proc. International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 157–162 (2005) 5. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for ﬁnding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3), 209–226 (1977) 6. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 36(1), 3–42 (2006) 7. Geurts, P., Wehenkel, L., d’Alch´e Buc, F.: Kernelizing the output of tree-based methods. In: ICML 2006. Proc. of the 23rd International Conference on Machine Learning, pp. 345–352. ACM, New York (2006) 8. Mar´ee, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classiﬁcation. In: Proc. IEEE CVPR, vol. 1, pp. 34–40. IEEE, Los Alamitos (2005) 9. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. In: Proc. IEEE CVPR, vol. 2, pp. 2161–2168 (2006) 10. Nowak, E., Jurie, F.: Learning visual similarity measures for comparing never seen objects. In: Proc. IEEE CVPR, IEEE Computer Society Press, Los Alamitos (2007) 11. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classiﬁcation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 490–503. Springer, Heidelberg (2006) 12. Obdrˇza ´lek, S., Matas, J.: Image retrieval using local compact DCT-based representation. In: Michaelis, B., Krell, G. (eds.) Pattern Recognition. LNCS, vol. 2781, pp. 490–497. Springer, Heidelberg (2003) 13. Obdrˇza ´lek, S., Matas, J.: Sub-linear indexing for large scale object recognition. In: BMVC 2005. Proc. British Machine Vision Conference, pp. 1–10 (2005) 14. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proc. IEEE CVPR, IEEE Computer Society Press, Los Alamitos (2007) 15. Schmid, C., Mohr, R.: Local greyvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5), 530–534 (1997) 16. Shao, H., Svoboda, T., Ferrari, V., Tuytelaars, T., Van Gool, L.: Fast indexing for image retrieval based on local appearance with re-ranking. In: ICIP 2003. Proc. IEEE International Conference on Image Processing, pp. 737–749. IEEE Computer Society Press, Los Alamitos (2003) 17. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 18. Zhang, J., Marszaek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classiﬁcation of texture and object categories: a comprehensive study. International Journal of Computer Vision 73, 213–238 (2007)

Analyzing Facial Expression by Fusing Manifolds Wen-Yan Chang1,2 , Chu-Song Chen1,3 , and Yi-Ping Hung1,2,3 2

1 Institute of Information Science, Academia Sinica, Taiwan Dept. of Computer Science and Information Engineering, National Taiwan University 3 Graduate Institute of Networking and Multimedia, National Taiwan University {wychang,song}@iis.sinica.edu.tw, [email protected]

Abstract. Feature representation and classification are two major issues in facial expression analysis. In the past, most methods used either holistic or local representation for analysis. In essence, local information mainly focuses on the subtle variations of expressions and holistic representation stresses on global diversities. To take the advantages of both, a hybrid representation is suggested in this paper and manifold learning is applied to characterize global and local information discriminatively. Unlike some methods using unsupervised manifold learning approaches, embedded manifolds of the hybrid representation are learned by adopting a supervised manifold learning technique. To integrate these manifolds effectively, a fusion classifier is introduced, which can help to employ suitable combination weights of facial components to identify an expression. Comprehensive comparisons on facial expression recognition are included to demonstrate the effectiveness of our algorithm.

1 Introduction Realizing human emotions plays an important role in human communication. To study human behavior scientifically and systematically, emotion analysis is an intriguing research issue in many fields. Much attention has been drawn to this topic in computer vision applications such as human-computer interaction, robot cognition, and behavior analysis. Usually, a facial expression analysis system contains three stages: face acquisition, feature extraction, and classification. For feature extraction, a lot of methods have been proposed. In general, most methods represent features in either holistic or local ways. Holistic representation uses the whole face for representation and focuses on the facial variations of global appearance. In contrast, local representation adopts local facial regions or features and gives attention to the subtle diversities on a face. Though most recent studies have been directed towards local representation [17,18], good research results are still obtained by using holistic approach [1,2]. Hence, it is interesting to exploit both of their benefits to develop a hybrid representation. In addition to feature representation, we also introduce a method for classification. Whether using Bayesian classifier [4,18], support vector machine (SVM) [1], or neural networks, finding a strong classifier is the core in the existing facial expression analysis studies. In the approaches that adopt local facial information, weighting these local regions in a single classifier is a common strategy [18]. However, not all local regions Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 621–630, 2007. c Springer-Verlag Berlin Heidelberg 2007

622

W.-Y. Chang, C.-S. Chen, and Y.-P. Hung

have the same significance in discriminating an expression. Recognition depending only on a fixed set of weights for all expressions cannot make explicit the significance of each local region to a particular expression. To address this issue, we characterize the discrimination ability per expression for each component in a hybrid representation; a fusion algorithm based on binary classification is presented. In this way, the characteristics of components can be addressed individually for expression recognition. In recent years, manifold learning [15,16] got much attention in machine learning and computer vision researches. The main consideration of manifold learning is not only to preserve global properties in data, but also to maintain localities in the embedded space. In addition to addressing the data representation problem, supervised manifold learning (SML) techniques [3,20] were proposed to further consider data class during learning and provide a good discriminating capability. These techniques are successfully applied to face recognition under different types of variations. Basically, SML can deliver superior performance to not only traditional subspace analysis techniques, such as PCA, LDA, but also unsupervised manifold learning methods. By taking the advantages of SML, we introduce a facial expression analysis method, where a set of embedded manifolds is constructed for each component. To integrate these embedded manifolds, a fusion algorithm is suggested and good recognition results can be obtained.

2 Background 2.1 Facial Expression Recognition To describe facial activity caused by the movement of facial muscles, the facial action coding system (FACS) was developed and 44 action units are used for modeling facial expressions. Instead of analyzing these complicated facial actions, Ekman et al. [6] also investigated several basic categories for emotion analysis. They claimed that there are six basic universal expressions: surprise, fear, sadness, disgust, anger, and happiness. In this paper, we follow the six-class expression taxonomy and classify each query image into one of the six classes. As mentioned above, feature extraction and classification are two major modules in facial expression analysis. Essa et al. [7] applied optical flow to represent motions of expressions. To lessen the effects of lighting, Wen and Huang [18] used both geometric shape and ratio-image based feature for expression recognition with a MAP formulation. Lyons et al. [11] and Zhang et al. [21] adopted Gabor wavelet features in this topic. Recently, Bartlett et al. [1] suggested using Adaboost for Gabor feature selection and a satisfied performance of expression recognition is achieved. Furthermore, appearance is also a popular representation for facial expression analysis and several subspace analysis techniques were used to improve recognition performance [11]. In [4], Cohen et al. proposed the Tree-Augmented Na¨ıve Bayes classifier for video-based expression analysis. Furthermore, neural network, hidden Markov model and SVM [1] were also widely used. Besides the image-based expression recognition, Wang et al. [17] used 3D range models for expression recognition and proposed a method to extract features from a 3D model recently. To analyze expressions under different orientations, head pose recovery is also addressed in some papers. In general, model registration or tracking approaches

Analyzing Facial Expression by Fusing Manifolds

623

are used to estimate the pose, and the image is warped into a frontal view [5,18]. Dornaika et al. [5] estimated head pose by using an iterative gradient descent method. Then, they applied particle filtering to track facial actions and recognize expressions simultaneously. Wen and Huang [18] also adopted a registration technique to obtain the geometric deformation parameters and warped images according to these parameters for expression recognition. Zhu and Ji [22] refined the SVD decomposition method by normalizing matrices to estimate the parameters of face pose and recover facial expression simultaneously. In a recent study, Pantic and Patras [13] further paid attentions to expression analysis based on face profile. More detailed surveys about facial expression analysis can be found in [8,12]. 2.2 Manifold Learning In the past decades, subspace learning techniques have been widely used for linear dimensionality reduction. Different from the traditional subspace analysis techniques, LLE [15] and Isomap [16] were proposed by considering the local geometry of data in recent manifold learning studies. They assumed that a data set approximately lies on a lower dimensional manifold embedded in the original higher dimensional feature space. Hence, they focused on finding a good embedding approach for training data representation in a lower dimensional space without considering the class label of data. However, one limitation of nonlinear manifold learning techniques is that manifolds are defined only on the training data and it is difficult to map a new test data to the lower dimensional space. Instead of using nonlinear manifold learning techniques, He et al. [9] porposed a linear approach, namely locality preserving projections (LPP), for vision-based applications. To achieve a better discriminating capability, class label of data is suggested to be considered during learning recently, and supervised manifold learning techniques were developed. Chen et al. [3] proposed the local discriminant embedding (LDE) method to learn the embedding of the sub-manifold for each class by utilizing the neighbor and class relations. At the same time, Yan et al. [20] also presented a graph embedding method, called marginal fisher analysis (MFA), which shares the similar concept with LDE. By using the Isomap, Chang and Turk [2] introduced a probabilistic method to video-based facial expression analysis.

3 Expression Analysis Using Fusion Manifolds 3.1 Facial Components Humans usually recognize emotions according to both global facial appearance and variations of facial components, such as eye shape, mouth contour, wrinkle expression, and the alike. In our method, we attempt to consider facial local regions and holistic face simultaneously. Based on facial features, we divide a face into seven components including left eye (LE), right eye (RE), middle of eyebrows (ME), nose (NS), mouth and chin (MC), left cheek (LC), and right cheek (RC). A mask of these components is illustrated in Fig. 1(a) In addition, two components, upper face (UF) and holistic face (HF), are also considered. The appearances of all components are shown in Fig. 1(b).

624

W.-Y. Chang, C.-S. Chen, and Y.-P. Hung

Fig. 1. Facial components used in our method. (a) shows the facial component mask and the locations of these local components. (b) examples of these components.

3.2 Fusion Algorithm for Embedded Manifolds After representing a face into nine components, we then perform expression analysis based on them. To deal with these multi-component information, a fusion classification is introduced. Given a face image I, a mapping M : Rd × c → Rt is constructed by M (I) = [m1 (I1 ), m2 (I2 ), . . . , mc (Ic )],

(1)

where c is the number of components, mi (·) is an embedding function and Ii is a ddimensional sub-image of the i-th component. Then, the multi-component information is mapped to a t-dimensional feature vector M (I), where t ≥ c. To construct the embedding function for each component, supervised manifold learning techniques are considered in our method. In this paper, the LDE [3] method is adopted for facial expression analysis. Considering a data set {xi |i = 1, ..., n} with class label {yi } in association with a facial component, where yi ∈ {Surprise, Fear, Sadness, Disgust, Anger, Happiness}, LDE attempts to minimize the distances of neighboring data points in the same class and maximize the distances between neighbor points belonging to different classes in a lower dimensional space simultaneously. The formulation of LDE is maxV such that

i,j

||V T xi − V T xj ||2 wij

i,j

||V T xi − V T xj ||2 wij = 1,

(2)

where wij = exp[−||xi −xj ||2 /r] is the weight between xi and xj , if xi and xj are neigh bors with the same class label. By contrast, wij is the weight between two neighbors, xi and xj , which belong to different classes. In LDE, only K-nearest neighbors are considered during learning. After computing the projection matrix V , an embedding of a data point x can be found by projecting it onto a lower dimensional space with z = V T x . For classification, nearest neighbor is used in the embedded low-dimensional space.

Analyzing Facial Expression by Fusing Manifolds

625

Since not all components are discriminative for an expression (e.g., chin features are particularly helpful for surprise and happiness), to take the discrimination ability of each component into account, a probabilistic representation is used to construct M (I) in our approach instead of hard decision by nearest neighboring. By calculating the shortest distances from x to a data point in each class, a probabilistic representation can be obtained by D(x ) =

1 i=1,...,e D

i

{D1 , D2 , . . . , De }

(3)

where Di = mink ||V T xik − z ||, xik is a training data belonging to class i, z = V T x , and e = 6 is the number of facial expression class. For each component j (j = 1, ..., c), the embedding function mj (·) can be written as mj (Ij ) = D(Ij ). Then, the dimension of M (I) is t = 6 × 9 = 54. The relationship among components and expressions can be encoded in M (I) by using this representation. Components that are complementary to each other for identifying an individual expression is thus considered in the fusion stage to boost the recognition performance. To learn the significance of components from the embedded manifolds, a fusion classifier F : Rt → {Surprise, Fear, Sadness, Disgust, Anger, Happiness} is used. With the vectors M (I), we apply a classifier to {(xi , yi )|i = 1, ..., n}, where x = M (I). The fusion classifier is helpful to decide the importance of each component to different expressions instead of selecting a fixed set of weights for all expressions. Due to its good generalization ability, SVM is adopted as the fusion classifier F in our method. Given a test data x , the decision function of SVM is formulated as f (x ) = uT φ(x ) + b,

(4)

where φ is a kernel function, u and b are parameters of the decision hyperplane. For a multi-class classification problem, pairwise coupling is a popular strategy that combines all pairwise comparisons into a multi-class decision. The class with the most winning two-class decisions is then selected as the prediction. Besides predicting an expression label, we also allow our fusion classifier to provide the probability/degree of each expression. In general, the absolute value of the decision function means the distance from x to the hyperplane and also reflects the confidence of the predicted label for a two-class classification problem. To estimate the probability of each class in a multi-class problem, the pairwise probabilities are addressed. Considering a binary classifier of classes i and j, pairwise class probability ti ≡ P (y = i|x ) can be estimated from (4) based on x and the training data by Platt’s posterior probabilities [14] with ti + tj = 1. That is, ti =

1 , 1 + exp(Af (x ) + B)

(5)

where the parameters A and B are estimated by minimizing the negative log likelihood function as yk + 1 yk + 1 min − log(qk ) + (1 − ) log(1 − qk ), (6) A,B 2 2 k

626

W.-Y. Chang, C.-S. Chen, and Y.-P. Hung

in which qk =

1 , 1 + exp(Af (xk ) + B)

(7)

and {xk , yk |yk ∈ {1, −1}} is the set of training data. Then, the class probabilities p = {p1 , p2 , . . . , pe } can be estimated by minimizing the Kullback-Leibler distance between ti and pi /(pi + pj ) , i.e., min

p

vij ti log(

i=j

ti (pi + pj ) ), pi

(8)

where k=1,...,e pk = 1, and vij is the number of training data in classes i and j. Recently, a generalized approach is proposed [19] to tackle this problem. For robust estimation, the relation ti /tj ≈ pi /pj is used and the optimization is re-formulated as min p

e 1 (tj pi − ti pj )2 , 2 i=1

(9)

j:j=i

instead of using the relation ti ≈ pi /(pi + pj ). Then, class probabilities can be stably measured by solving (9).

4 Experiment Results 4.1 Dataset and Preprocessing In our experiments, the public available CMU Cohn-Kanade expression database [10] is used to evaluate the performance of the proposed method. It consists of 97 subjects with different expressions. However, not all of these subjects have six coded expressions, and some of them only consist of less than three expressions. To avoid the unbalance problem in classification, we select 43 subjects who have at least 5 expressions from the database. The selection contains various ethnicities and includes different lighting conditions. Person-independent evaluation [18] is taken in our experiments so that the data of one person will not appear in the training set when this person is used as a testing subject. Evaluation of performance in this way is more challenging since the variations between subjects are much larger than those within the same subject, and it also examines the generalization ability of the proposed method. To locate the facial components, the eye locations available at the database are used. Then, the facial image is registered according to the locations and orientations of eyes. The component mask shown in Fig. 1 is applied to the registered facial image to extract facial components. The resolutions of a sub-image for each component is 32 × 32 in our implementation. 4.2 Algorithms for Comparison In this section, we give comparisons for different representations and algorithms. In holistic representation, we recognize expressions only by using the whole face,

Analyzing Facial Expression by Fusing Manifolds

627

i.e., the ninth image in Fig. 1(b), while the first seven components shown in Fig. 1(b) are used for local representation. To demonstrate the performance of the proposed method, several alternatives are also implemented for comparison. In the comparisons, appearance is used as the main feature by representing the intensities of pixels in a 1D vector. To evaluate the performance, five-fold cross validation is adopted. According to the identity of subjects, we divide the selected database into five parts, where four parts of them are treated as training data and the remaining part is treated as validation data in turn. To perform the person-independent evaluation, the training and validation sets do not contain images of the same person. We introduce the algorithms that are used for comparison as follows. Supervised Manifold Learning (SML). In this method, only holistic representation is used for recognition. Here, LDE is adopted and the expression label is predicted by using nearest-neighbor classification. We set the number of neighbors K as 19 and the dimension of reduced space as 150 in LDE. These parameters are also used in all of the other experiments. SML with Majority Voting. This approach is used for multi-component integration. SML is applied to each component at first. Then, the amount of each class label is accumulated and the final decision is made by selecting the class with maximum quantity. SVM Classification. This is an approach using SVM on the raw data (either holistic or local) directly without dimension reduction by SML. In our implementation, linear kernel is used by considering the computational cost. For multi-component integration, we simply concatenate the features of all of the components in order in this experiment. SVM with Manifold Reduction. This approach is similar to the preceding SVM approach. The main difference is that the dimension of data is reduced by manifold learning at first. Then, the projected data are used for SVM classification. Our Approach (SML with SVM Fusion). Here, the proposed method described in Section 3.2 is used for evaluation. 4.3 Comparisons and Discussions We summarize the recognition results of the aforementioned methods in Table 1. One can see that local representation provides better performance than holistic one in most methods. This agrees with the conclusions in many recent researches. By taking the advantages of both holistic and local representations, the hybrid approach can provide a superior result generally when an appropriate method is adopted. As shown in Table 1, the best result is obtained by using the proposed method in the hybrid representation. The recognition rate of each expression, obtained by using the aforementioned methods with the hybrid representation, are illustrated in Fig. 2. We illustrate the importance/influence of each component on an expression by a 3D visualization as shown in Fig. 3. The accuracy of each component is evaluated by applying SML. From this figure, the discrimination ability of each component to a particular expression can be seen. The overall accuracy evaluated by considering all expressions is summarized in Table 2. Though the accuracies of some components are not good enough, a higher recognition rate with 94.7% can still be achieved by using the proposed fusion algorithm to combine these components. It demonstrates the advantage of our fusion method.

628

W.-Y. Chang, C.-S. Chen, and Y.-P. Hung Table 1. Accuracies for different methods using holistic, local, and hybrid representation Methods SML Holistic SVM Classification Approaches SVM with Manifold Reduction SML with Majority Voting Local SVM Classification Approaches SVM with Manifold Reduction SML with SVM Fusion SML with Majority Voting Hybrid SVM Classification Approaches SVM with Manifold Reduction SML with SVM Fusion

Accuracy 87.7 % 86.1 % 87.7 % 78.6 % 87.2 % 92.5 % 92.0 % 87.2 % 87.7 % 92.0 % 94.7 %

Fig. 2. Comparison of accuracies for individual expression by using different methods with hybrid representation

Fig. 3. The importance/influence of each component on an expression

Analyzing Facial Expression by Fusing Manifolds

629

Table 2. Overall accuracies of expression recognition by using different facial components Component Name

Accuracy

Left Eye (LE) Right Eye (RE) Middle of Eyebrows (ME) Nose (NS) Mouth & Chin (MC) Left Chin (LC) Right Chin (RC) Upper Face (UF) Holistic Face (HF)

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

79.5 % 73.1 % 54.7 % 66.3 % 65.8 % 50.5 % 47.7 % 85.8 % 85.3 %

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. 4. Facial expression recognition results: horizontal bars indicate probabilities of expressions. The last column is an example where a surprise expression was wrongly predicted as a fear one.

Finally, some probabilistic facial expression recognition results are shown in Fig. 4, in which a horizontal bar indicates the probability of each expression. One mis-classified example is shown in the last column of this figure. Its ground-truth is surprise, but it was wrongly predicted as fear.

5 Conclusion In this paper, we propose a fusion framework for facial expression analysis. Instead of using only holistic or local representation, a hybrid representation is used in our framework. Hence, we can take both subtle and global appearance variations into account at the same time. In addition, unlike methods using unsupervised manifold learning for facial expression analysis, we introduce supervised manifold learning (SML) techniques to represent each component. To combine the embedded manifolds in an effective manner, a fusion algorithm is proposed in this paper, which takes into account the support of each component for individual expression. Both the expression label and probabilities can be estimated. Comparing to several methods using different representations and classification strategies, the experiment results show that our method is superior to the others, and promising recognition results for facial expression analysis are obtained.

630

W.-Y. Chang, C.-S. Chen, and Y.-P. Hung

Acknowledgments. This work was supported in part under Grants NSC 96-2752-E002-007-PAE. We would like to thank Prof. Jeffrey Cohn for providing the facial expression database.

References 1. Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C.: Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior. CVPR 2, 568–573 (2005) 2. Chang, Y., Hu, C., Turk, M.: Probabilistic Expression Analysis on Manifolds. CVPR 2, 520– 527 (2004) 3. Chen, H.T., Chang, H.W., Liu, T.L.: Local Discriminant Embedding and Its Variants. CVPR 2, 846–853 (2005) 4. Cohen, I., Sebe, N., Garg, A., Chen, L.S., Huang, T.: Facial Expression Recognition from Video Sequences: Temporal and Static Modeling. CVIU 91, 160–187 (2003) 5. Dornaika, F., Davoine, F.: Simultaneous Facial Action Tracking and Expression Recognition Using a Particle Filter. ICCV 2, 1733–1738 (2005) 6. Ekman, P., Friesen, W.V.: Unmasking the Face. Prentice Hall, Englewood Cliffs (1975) 7. Essa, I.A., Pentland, A.P.: Coding, Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Trans. on PAMI 19(7), 757–763 (1997) 8. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36, 259–275 (2003) 9. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face Recognition Using Laplacianfaces. IEEE Trans. on PAMI 27(3), 328–340 (2005) 10. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive Database for Facial Expression Analysis. AFG, 46–53 (2000) 11. Lyons, M., Budynek, J., Akamatsu, S.: Automatic Classification of Single Facial Images. IEEE Trans. on PAMI 21(12), 1357–1362 (1999) 12. Pantic, M., Rothkrantz, L.J.M.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Trans. on PAMI 22(12), 1424–1445 (2000) 13. Pantic, M., Patras, I.: Dynamics of Facial Expression: Recognition of Facial Actions and Their Temproal Segments From Face Profile Image Sequences. IEEE Trans. on SMCB 32(2), 433–449 (2006) 14. Platt, J.: Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. Advances in Large Margin Classifiers. MIT Press, Cambridge (2000) 15. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 16. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Diminsionality Reduction. Science 290, 2319–2323 (2000) 17. Wang, J., Yin, L., Wei, X., Sun, Y.: 3D Facial Expression Recognition Based on Primitive Surface Feature Distribution. CVPR 2, 1399–1406 (2006) 18. Wen, Z., Huang, T.: Capturing Subtle Facial Motions in 3D Face Tracking. ICCV 2, 1343– 1350 (2003) 19. Wu, T.F., Lin, C.J., Weng, R.C.: Probability Estimates for Multi-class Classification by Pairwise Coupling. Journal of Machine Learning Research 5, 975–1005 (2004) 20. Yan, S., Xu, D., Zhang, B., Zhang, H.J.: Graph Embedding: A General Framework for Dimensionality Reduction. CVPR 2, 830–837 (2005) 21. Zhang, Z., Lyons, M., Schuster, M., Akamatsu, S.: Comparison between Geometry-Based and Gabor Wavelets-Based Facial Expression Recongition Using Multi-Layer Perceptron. AFG, 454–459 (1998) 22. Zhu, Z., Ji, Q.: Robust Real-Time Face Pose and Facial Expression Recovery. CVPR 1, 681– 688 (2006)

A Novel Multi-stage Classifier for Face Recognition Chen-Hui Kuo1,2, Jiann-Der Lee1, and Tung-Jung Chan2 1

Department of Electrical Engineering, Chang Gung University, Tao-Yuan 33302, Taiwan, R.O.C 2 Department of Electrical Engineering, Chung Chou Institute of Technology, Chang-Hua 51003, Taiwan, R.O.C

Abstract. A novel face recognition scheme based on multi-stages classifier, which includes methods of support vector machine (SVM), Eigenface, and random sample consensus (RANSAC), is proposed in this paper. The whole decision process is conducted cascade coarse-to-fine stages. The first stage adopts one-against-one-SVM (OAO-SVM) method to choose two possible classes best similar to the testing image. In the second stage, “Eigenface” method was employed to select one prototype image with the minimum distance to the testing image in each of the two classes chosen. Finally, the real class is determined by comparing the geometric similarity, as done by “RANSAC” method, between these prototype images and the testing images. This multi-stage face recognition system has been tested on Olivetti Research Laboratory (ORL) face databases, and its experimental results give evidence that the proposed approach outperforms the other approaches either based on the single classifier or multi-parallel classifier, it can even obtain a nearly 100 percent recognition accuracy. Keywords: Face recognition; SVM; Eigenface; RANSAC.

1 Introduction In general, researches on face recognition system fall into two categories, one is single-classifier system and the other is multi-classifier system. The single-classifier system, including neural network (NN) [1], Eigenface [2], Fisher linear discriminant (FLD) [3], support vector machine (SVM) [4], hidden Markov model (HMM) [5], or AdaBoost [6], has been well developed in theories and experiments. On the other hand, the multi-classifier system such as local and global face information fusion [7],[8],[9], neural networks committee (NNC) [10], or multi-classifier system (MCS) [11], has been proposed in parallel process of different features or classifiers. The SVM is originally designed for binary classification and it is based on the structural risk minimization (SRM) principle. Although several methods to effectively extend the SVM for multi-class classification have been reported on technical literatures [12],[13], it is still a widely researched issue. The methods of SVM for multi-class classification are one-against-all (OAA) [12],[14], one-against-one (OAO) [12], directed acyclic graph support vector machine (DAGSVM) [15], or binary tree SVM [4]. If one employs the same feature vector for SVM, NN, and AdaBoost, he Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 631–640, 2007. © Springer-Verlag Berlin Heidelberg 2007

632

C.-H. Kuo, J.-D. Lee, and T.-J. Chan

will find the performance of SVM is better than that of NN and AdaBoost because the SVM always results in the maximum separating margin to the hyperplane of the two classes. If the feature vector includes noisy data, and the noisy data possesses at least one of the following properties: (a) overlapping class probability distributions, (b) outliers and (c) mislabeled patterns [16], the hyperplane of SVM turns out to be hard margin and overfitting. Additionally, the SVM allows noise or imperfect separation, provided that a non-negative slack variable is added to the objective function as a penalizing term. To integrate the image features of frequency, intensity, and space information, we propose a novel face recognition approach, which combines SVM, Eigenface, and random sample consensus (RANSAC) [17] methods with the multi-stage classifier system. The whole decision process is developed in stages, i. e., “OAO-SVM” first, “Eigenface” next, and finally “RANSAC”. In the first stage, “OAO-SVM”, we used the DCT features extracted from the entire face image to decide two possible classes which are best similar to the testing image. In the second stage, face images of the two classes decided in these two classes obtained from the first stage are projected onto a feature space (face space). The face space is defined by the “Eigenfaces”, which are the eigenvectors of the set of faces and are based on intensity information of the face image. “RANSAC” is applied in the last stage, in which the epipolar geometry with space information of the testing image is matched with the two training images obtained from the second stage, and then the prototype image with the greatest match numbers of correspondence feature points is considered as the real face. The face database used for performance evaluation is retrieved from Olivetti Research Laboratory (ORL) [18]. Three evaluation methods are adopted here to compare the performance of OAO-SVM, Eigenface, and our proposed multi-stage classifier. The remainder of this paper is organized as follows: In section 2, the proposed methods of OAO-SVM and multi-stage classifier are presented in detail. In section 3, experiment results and the comparison to other approaches with the ORL face database are given. In section 4, conclusions and directions for further research are summarized and discussed.

2 The Proposed Method On the basis of a coarse-to-fine strategy, we design a multi-stage recognition system which integrates OAO-SVM, Eigenface, and RANSAC methods to enhance recognition accuracy. The detail of this system is demonstrated as follows. 2.1 One-Against-One (OAO) of SVM for Multi-class Recognition In the OAO strategy, several binary SVMs are constructed, but each one is constructed by training data from only two different classes. Thus, this method is sometimes called a “pair-wise” approach. For a data set with N different classes, this method constructs C2N= N(N-1)/2 models of two-class SVM. Thus given m training data (x1,y1),…,(xm,ym), where xk ∈ Rd, k=1,…,m and yk ∈ {1,…,N} is the class of xk. For training data from the ith and jth classes, we solve the following binary classification problem:

A Novel Multi-stage Classifier for Face Recognition

min

w ij ,b ij ,ξ ij

1 ij T ij ( w ) w + C ∑ k ξ kij . 2 ( wij )T φ ( xk ) + bij ≥ 1 − ξ kij , if yk = i . ( wij )T φ ( xk ) + bij ≤ −1 + ξ kij , if yk = j .

633

(1)

ξ kij ≥ 0 . where w is the weight vector, b is the threshold, φ(xk) is a kernel function that mapped the training data xk to a higher dimensional space, and C is the penalty parameter, respectively. When data with noise causes hard margin, there is a penalty term C ∑ k ξ kij which can relax the hard margin and allow a possibility of mistrusting the data. It can reduce the number of training errors. The simplest decision function is a majority vote or max-win scheme. The decision function counts the votes for each class based on the output from the N(N-1)/2 SVM. The class with the most votes is selected as the system output. In the majority vote process, it is necessary to compute each discriminate function fij(x) of the input data x for the N(N-1)/2 SVMs model. The score function Ri(x) is sums of the correct votes. The final decision is made on the basis of the “winner takes all” rule, which corresponds to the following maximization. The expression for final decision is given as Eq. (2). f ij ( x) = ( x ∗ wn ) + bn , n = 1,K, N . N

{

}

Ri ( x) = ∑ sgn f ij ( x ) . j =1 j ≠i

(2)

m( x, R1 , R2 ,K, RN ) = arg max {Ri ( x)}. i =1,..., N

where fij(x) is the output of ijth SVM, x is the input data and m is the final decision that found which class has the largest voting from the decision function fij, respectively. 2.2 Our Multi-stage Classifier System for Face Recognition

Based on the same coarse-to-fine strategy, the proposed novel scheme for face recognition is performed by a consecutive multi-stage recognition system, in which each stage is devoted to remove a lot of false classes more or less. The flowcharts of this proposed system including the training phase and recognition phase are shown in Fig. 1(a) and (b), respectively. In the first stages (OAO-SVM), for a testing image, we obtained its DCT features by feature extraction process and employed “winner-takeall” voting strategy to select the top two classes i.e., ci and cj, with maximum votes for later use. In the second stage, the Euclidian distance of each image in ci and cj is calculated, and for each class only one prototype image with minimum distance to testing image is determined for the last stage. “RANSAC” method is applied in the last stage, in which the epipolar geometry with space information of the testing image is matched with the two prototype images obtained from the second stage, and then the prototype image with the greatest matching numbers of correspondence feature

634

C.-H. Kuo, J.-D. Lee, and T.-J. Chan

points is selected as the correct one; in other words, the prototype image with the most geometric similarity to the testing image is thus presented.

Fig. 1. Flowchart of the multi-stage classifier for face recognition. (a) Training phase. (b) Testing phase.

More specifically, there are C2N= N(N-1)/2 models of “pair-wise” SVM used in the first stage of OAO model, as shown in Fig. 1(b). According to Eq. (2), the voting value Ri(x) and Rj(x) of two classes ci and cj, the first and second largest voting value, are selected, respectively. Moreover, if the difference of Ri(x) and Rj(x) is less than or equals to e, i.e., Ri(x) – Rj(x) ≤ e, where e is a preset value. They are delivered into the second stage for binary classification. On the other hands, if Ri(x) –Rj(x) > e, the class ci is then be decided as the only correct answer and the recognition process is finished. In other words, while the difference of voting value of two classes ci and cj is less than or equals e, it represents a very little difference between classes ci and cj and it also indicates that there is definitely a need to proceed to the next stage to identify the decision. PCA is a well-known technique commonly exploited in multivariate linear data analysis. The main underlying concept is to reduce the dimensionality of a data set while retaining as much variation as possible in a data set. A testing face image (ix) is transformed into its eigenface components (projected into “face space”) by a simple operation, wk = ukT (ix − i ) for k=1,…, M, where ukT are eigenvectors obtained from the

covariance matrix of testing and average face image, i is the average face image and M is the number of face image. The weights form a vector ψT = [w1 w2 … wM] that describes the contribution of each eigenface in representing the input face image,

A Novel Multi-stage Classifier for Face Recognition

635

treating the eigenfaces as a basis set for face images. The vector is used to find which number of pre-defined face class best describes the face. The simplest method for determining which face class provides the best description of an input face image is to find the face class k that minimizes the Euclidian distance ε k = (ψ −ψ k ) , where ψk is a vector describing the kth face class. In the second stage, as shown in the Fig. 2(a), the input training image is projected into “face space” and the weights vectors ψT = [w1 w2 … w9] are then created. For each class, the image with the minimum Euclidian distance to the training image is selected as the prototype image. For example, Fig. 2(b) shows the Euclidian distance between input image and ten training images which belong to two classes. Class 1 includes the first five images, which are images 1, 2, 3, 4, and 5, respectively; and class 2 includes the other five images, which are images 6, 7, 8, 9, and 10, respectively. Subsequently, the image with the minimum Euclidian distance from each class is selected for the last stage. For example, image 3 in class 1 and image 7 in class 2 are decided in the second stage. Here, we denote these two images as c13 and c27.

Fig. 2. (a) The weight vector ψT = [w1 w2 … w9] of input face. (b) The Euclidian distance between the input image and ten training images, respectively. The images 1, 2, 3, 4, 5 are of the same class and the images 6, 7, 8, 9, 10 are of another class.

In the last stage, “RANSAC” method is used to match one testing image with these two training images (c13 and c27) obtained in the second stage, trying to find which prototype image best matches with the testing image. It shows that the one with the maximum numbers of correspondence points fits best. The procedure of “RANSAC” is described as follows.

• Find Harris Corners [19]. In the testing and training images, shifting a window in any direction should give a large change in intensity as shown in Fig. 3 (a). The change Ex,y produced by a shift (x,y) is given by: E x, y = ∑ δ u ,v [ I x+u , y +v − I u ,v ]2 .

δ

u ,v

(3)

where specifies the image window, for example a rectangular function: it is unity within a specified rectangular region, and zero elsewhere. A Gaussian functions: smooth circular window u,v = exp–(u2+v2)/2σ2 , Iu,v : image intensity

δ

636

C.-H. Kuo, J.-D. Lee, and T.-J. Chan Training image 2

Testing image

Training image 1

(a) Putative matches with training image 2 Putative matches with training image 1

(b) RANSAC matches with training image 2 RANSAC matches with training image 1

(c)

Match: 4 Unmatch: 6

Match numbers with training image 2

Match: 13 Unmatch: 4

Match numbers with training image 1

(d)

Fig. 3. The procedures of using RANSAC method to find the matched and unmatched correspondence points. (a) Find Harris corners feature points. (b) Find putative matches. (c) Use RANSAC method to find the correspondence points. (d) Count numbers of matched and unmatched of correspondence points.

• Find Putative Matches. Among previously detected corner feature points in given image pairs, putative matches are generated by looking for match points that are maximally correlated with each other within given windows. Undoubtedly, only points that robustly correlate with each other in both directions are returned. Even though the correlation matching results in many wrong matches, which is about 10 to 50 percent, it is strong enough to compute the fundamental matrix F as shown in Fig. 3 (b). • Estimate Fundamental Matrix F. Use RANSAC method to locate the correspondence points between the testing and training images: As shown in Fig. 4, the map x → l' between two images defined by fundamental matrix F is considered. And the most basic properties of F is x'Fx =0 [20] for any pair of corresponding points x ↔ x' in the given image pairs. Following steps was used by RANSAC method to consolidate fundamental matrix F estimation: Repeat (a) Select random samples of 8 correspondence points. (b) Compute F. (c) Measure support (number of inliers within threshold distance of epipolar line).

A Novel Multi-stage Classifier for Face Recognition

637

• Choose Fundamental Matrix. Choose the F with the largest number of inliers and obtain the correspondence point xi ↔ x'i (as shown in Fig. 3 (c)). • Count Numbers of Matched and Unmatched Feature Points. The threshold distance between two correspondence points xi ↔ x'i is set. Match counts if the distance between two correspondence points is smaller than that of the threshold; on the contrary, no match does. For any given image pairs, the successful match pairs should be the training images with the largest matching number as shown in Fig. 3 (d).

Fig. 4. The correspondence points of two images are x and x'. The two cameras are indicated by their centers C and C'. The camera centers, three-space point X, and its images x and x' lie in a common plane . An image point x back-projects to a ray in three-space defined by the first camera center, C, and x. This ray is imaged as a line l' in the second view. The three-space point X which projects to x must lies on this ray, so the image of X in the second view must lie on l'.

π

3 Experimental Results The proposed scheme for face recognition is evaluated on the ORL face databases. The ORL face database contains abundant variability in expressions, poses and facial details. We conducted experiments to compare our cascade multi-stage classifier strategy with some other well-known single classifier, e.g., the OAO-SVM, and Eigenface. The experimental platforms are Intel Celeron 2.93GHz processor, 1GB DDRAM, Windows XP, and Matlab 7.01. 3.1 Face Recognition on ORL Database

The experiment is performed on the ORL database. There are 400 images of 40 distinct subjects. Each subject has ten different images taken at different situations, i.e., pose, expression, etc. Each image was digitized a 112 × 92 pixel array whose gray levels ranged between 0 and 255. There are variations in facial expressions such as open/closed eyes, smiling/non-smiling, and with/without glasses. In our experiments, five images of each subject are randomly selected as training samples, the other five images and then serve as testing images. Therefore, for 40 subjects in the database, a total of 200 images are used for training and another 200 for testing,

638

C.-H. Kuo, J.-D. Lee, and T.-J. Chan

and there are no overlaps between the training and testing sets. Here, we verify our system based on the average error rate. Such procedures are repeated for four times, i.e. four runs, which result in four groups of data. For each group, we calculated the average of the error rates versus the number of feature dimensions (from 15 to 100). Fig. 5 shows the results of the average of four runs and the output of each stage rom the multi-stage classifier which integrates OAO-SVM, Eigenface, and RANSAC. As shown in Fig. 5, the error rates of the output of the final stage is lower than the other two types of single classifier. That is, our proposed method obtains the lowest error rate. Additionally, the average minimum error rate of our method is 1.37% on the 30 feature numbers, while the OAO-SVM is 2.87%, nd Eigenface is 8.50%. If we choose the best results among the four groups of the randomly selected data, the lowest error rate of the final stage can even achieve 0%.

Fig. 5. Comparison of recognition accuracy using OAO-SVM, Eigenface, and the proposed system on the ORL face database

3.2 Comparison with Previous Reported Results on ORL

Several approaches have been conducted for face recognition with the ORL database. The methods of using single classifier systems for face recognition are Eigenface [2],[21],[23],[24], DCT-RBFNN [1], binary tree SVM [4], 2D-HMM [5], LDA [25], and NFS [26]. The methods of using multi-classifiers for ORL face recognition are fuzzy fisherface [7], [22], and CF2C [9]. Here, we present a comparison under similar conditions between our proposed method and the other methods described on the ORL database. Approaches are evaluated on error rate, and feature vector dimension. Comparative results of different approaches are shown in Table 1. It is hard to compare the speed of different methods performed on different computing platforms, so we ignore the training and recognition time in each different approach. From Table 1, it is clear that the proposed approach achieves the best recognition rate compared with the other six approaches.

A Novel Multi-stage Classifier for Face Recognition

639

Table 1. Recognition performance comparison of different approaches (ORL)

Methods Eigenface [23] 2D-PCA [21] Binary tree SVM [4] DCT-RBFNN [1] CF2C [9] Fuzzy Fisherface [22] Our proposed approach

Error rate (%) Best Mean 2 4 4 5 N/A 3 0 2.45 3 4 2.5 4.5 0 1.375

Feature vector dimension 140 112×3 48 30 30 60 30

4 Conclusions This paper presents a multi-stage classifier method for face recognition based on the techniques of SVM, Eigenface, and RANSAC. The proposed multi-stage method is conducted on the basis of a coarse-to-fine strategy, which can reduce the computation cost. The facial features are first extracted by the DCT for the first stage, OAO-SVM. Although the last stage (RANSAC) led to more accuracy in comparison with the other two stages, its computation cost was more in the geometric fundamental matrix F estimation. In order to shorten the computation time, we reduced the classes and images to only two training images and then matched them with the testing image in the last stage. The key of this method is to consolidate OAO-SVM for the output of the top two maximum votes so that the decision of the correct class could be made later by RANSAC in the last stage. The feasibility of the proposed approach has been successfully tested on ORL face databases, which are acquired under varying poses, expressions, and an average amount of samples. Comparative experiments on the face databases also show that the proposed approach is superior to single classifier and multi-parallel classifier. Our ongoing research is to study the classification performance under the condition that the output of OAO is more than two classes, and to compare the relationship between successful rate and computation time, trying to find an optimal classification system with superior recognition capability in reasonable computation time.

References 1. Er, M.J., Chen, W., Wu, S.: High-Speed Face Recognition Based on Discrete Cosine Transform and RBF Neural Networks. IEEE Trans. Neural Networks 16(3), 679–691 (2005) 2. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 3. Xiang, C., Fan, X.A., Lee, T.H: Face Recognition Using Recursive Fisher Linear Discriminant. IEEE Trans. on Image Processing 15(8), 2097–2105 (2006) 4. Guo, G., Li, S.Z., Chan, K.L.: Support vector machines for face recognition. Image and Vision Computing 19, 631–638 (2001)

640

C.-H. Kuo, J.-D. Lee, and T.-J. Chan

5. Othman, H., Aboulnasr, T.: A Separable Low Complexity 2D HMM with Application to Face Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(10), 1229–1238 (2003) 6. Lu, J.K., Plataniotis, N.A., Venetsanopoulos, N., Li, S.Z.: Ensemble-Based Discriminant Learning With Boosting for Face Recognition. IEEE Trans. on Neural Networks 17(1), 166–178 (2006) 7. Kwak, K.C., Pedrycz, W.: Face recognition: A study in information fusion using fuzzy integral. Pattern Recognition Letters 26, 719–733 (2005) 8. Rajagopalan, A.N., Rao, K.S., Kumar, Y.A.: Face recognition using multiple facial features. Pattern Recognition Letters 28, 335–341 (2007) 9. Zhou, D., Yang, X., Peng, N., Wang, Y.: Improved-LDA based face recognition using both facial global and local information. Pattern Recognition Letters 27, 536–543 (2006) 10. Zhao, Z.Q., Huang, D.S., Sun, B.Y.: Human face recognition based on multi-features using neural networks committee. Pattern Recognition Letters 25, 1351–1358 (2004) 11. Lemieux, A., Parizeau, M.: Flexible multi-classifier architecture for face recognition systems. In: 16th Int. Conf. on Vision Interface (2003) 12. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Network 13(2), 415–425 (2002) 13. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, Inc., Chichester (1998) 14. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study in handwriting digit recognition. In: International Conference on Pattern Recognition, pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994) 15. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547– 553. MIT Press, Cambridge (2000) 16. Ratsch, G., Onoda, T., Muller, K.R.: Soft Margins for AdaBoost. Machine Learning 42, 287–320 (2001) 17. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24(6), 381–395 (1981) 18. ORL face database, http://www.uk.research.att.com/facedatabase.html 19. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: 4th Alvey Vision Conference, pp. 147–151 (1988) 20. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 21. Yang, J., Zhang, D., Frangi, A.F., Yang, J.Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 131–137 (2004) 22. Kwak, K.C., Pedrycz, W.: Face recognition using a fuzzy fisferface classifier. Pattern Recognition 38, 1717–1732 (2005) 23. Li, B., Liu, Y.: When eigenfaces are combined with wavelets. Knowledge-Based Systems 15, 343–347 (2002) 24. Phiasai, T., Arunrungrusmi, S., Chamnongthai, K.: Face recognition system with PCA and moment invariant method. In: Proc. of the IEEE International Symposium on Circuits and Systems, vol. 2, pp. 165–168 (2001) 25. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face recognition using LDA-based algorithms. IEEE Trans. on Neural Networks 14, 195–200 (2003) 26. Chien, J.T., Wu, C.C.: Discriminant waveletfaces and nearest feature classifiers for face recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 1644–1649 (2002)

Discriminant Clustering Embedding for Face Recognition with Image Sets Youdong Zhao, Shuang Xu, and Yunde Jia School of Computer Science and Technology Beijing Institute of Technology, Beijing 100081, PR. China {zyd458,xushuang,jiayunde}@bit.edu.cn

Abstract. In this paper, a novel local discriminant embedding method, Discriminant Clustering Embedding (DCE), is proposed for face recognition with image sets. DCE combines the effectiveness of submanifolds, which are extracted by clustering for each subject’s image set, characterizing the inherent structure of face appearance manifold and the discriminant property of discriminant embedding. The low-dimensional embedding is learned via preserving the neighbor information within each submanifold, and separating the neighbor submanifolds belonging to different subjects from each other. Compared with previous work, the proposed method could not only discover the most powerful discriminative information embedded in the local structure of face appearance manifolds more sufficiently but also preserve it more efficiently. Extensive experiments on real world data demonstrate that DCE is efficient and robust for face recognition with image sets. Keywords: Face recognition, image sets, submanifolds (local linear models), discriminant embedding.

1 Introduction In the past several years, automatic face recognition using image sets has attracted more and more attention due to its wide underlying applications [1, 2, 3, 4]. Images in the sets are assumed to be sampled from complex high-dimensional nonlinear manifolds independently e.g., they may be derived from sparse and unordered observation acquired by multiple still shots of an individual or a long-term monitoring of a scene by surveillance systems. This relaxes the assumption of the temporal coherence between consecutive images from video. In this paper, we focus on revealing and extracting the most powerful discriminative information from the face appearance manifolds for face recognition over image sets. In real world, due to the various variations e.g. large pose or illumination, the face appearance manifold of an individual in image space is a complex nonlinear distribution, which consists of a set of submanifolds (or local linear models) (see Fig. 1). Those submanifolds can sufficiently characterize the inherent structure of face appearance manifold. How to extract the submanifolds of each individual and how to utilize them sufficiently for efficient classification are key issues for face recognition over image sets. Intuitively, when the submanifolds are known, a reasonable solution Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 641–650, 2007. © Springer-Verlag Berlin Heidelberg 2007

642

Y. Zhao, S. Xu, and Y. Jia

to extract most efficient discriminative information is to find a discriminant function that compresses the points in each submanifold together and separates the neighbor submanifolds belonging to different individuals from each other at the same time. Yan et al. [7] propose a Marginal Fisher Analysis (MFA) method to extract the local discriminative information. In MFA, the intra-class compactness is characterized by preserving the relationships of the k-nearest neighbors of each point in the same class and the inter-class separability is characterized by maximizing the class margins. However, owing to the asymmetry relationship of k-nearest neighbors, MFA tends to compress the points of an individual together even though they are really far away in image space. This makes against uncovering the significant local structure of appearance manifolds and extracting efficient discriminative information. Motivated by effectiveness of submanifolds [3, 4, 5, 6], we propose a novel local discriminant embedding method, Discriminant Clustering Embedding (DCE), based on the submanifolds for face recognition over sets. The proposed method combines the effectiveness of submanifolds characterizing the inherent structure of face appearance manifolds and the discriminant property of discriminant embedding. This is the main contribution of this paper. Specially, in our framework, the submanifolds, corresponding to the local linear subspaces on the entire nonlinear manifold, are first extracted by clustering for each subject’s image set. Two graphs are then constructed based on each submanifold and its neighbors to locally characterize the intra-class compactness and inter-class separability respectively. Finally, the low-dimensional embedding is learned by preserving the neighborhood information within each submanifold, and separating the neighborhood submanifolds belonging to different subjects from each other. The reason we prefer submanifolds to the k-nearest neighbors of each point used in MFA lies in their appealing property of explicitly characterizing the local structure of nonlinear face manifolds. Extensive experiments on real world data demonstrate the effectiveness of our method for face recognition with image sets and show that our method significantly outperforms the state-of-the-art methods in terms of accuracy in this area. 1.1 Previous Work Most of previous work on face recognition over image sets focused on the issues that estimate the densities of mixture models or extract the submanifolds by clustering and then utilize them for recognition task. Those algorithms could be broadly divided into two categories: model-based parametric approaches and clustering-based nonparametric approaches. In the model-based methods, Frey and Huang [1] use the factor analyzers to estimate the submanifolds and the mixture of factor analyzers is estimated by EM algorithm. The recognition decision is made by Bayes’ rule. The manifold density divergence method [2] estimates a Gaussian mixture model for each individual, and the similarity between the estimated models is measured by the Kullback-Leibler Divergence. However, when there is no enough training data, or the training data and the new test data do not have strong statistical relationships, it is difficult to estimate the parametric densities properly, or measure the similarity between densities estimated accurately. In the clustering-based methods, Hadis et al. [6] apply k-means clustering to extract the submanifolds in the low-dimensional space learned by locally linear embedding

Discriminant Clustering Embedding for Face Recognition with Image Sets

643

(LLE). The traditional classification methods are performed on the cluster centers which are used to represent the local models. Obviously, they just make use of the local models roughly. Lee et al. [5] also use k-means algorithm to approximate the data set. However, their work mainly utilized the temporal coherence between consecutive images from video sequence for recognition. Fan and Yeung [3] extract the submanifolds via a hierarchical clustering. The classification task is performed by the dual-subspace scheme based on the neighbor submanifolds belonging to different subjects. Kim at el. [4] propose a discriminant learning method which learns a linear discriminant function that maximizes the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets. Our work is similar to the methods [3, 5, 6] in the extraction of submanifolds, but different in the utilization of these submanifolds. By powerful discriminant embedding based on the submanifolds, our method takes full use of the local information embedded in submanifolds to extract the discriminative features. In addition, our algorithm relaxes the constraint on the data distribution assumption and the total number of available features, which is also important for recognition using image sets. The rest of this paper is organized as follows: Section 2 discusses the local structure of face appearance manifolds in image sets. The discriminant clustering embedding method is presented in section 3. We show the experimental results on complex image sets in Section 4. Finally, we give our conclusions and future work in Section 5. 1100

3000 1000 900

2000 2000

800

1000

1500

700 -700

1000

0 500

-800

-1000

-900

2000

-1000 0 -1500 3000

-5000

4000

0 -500

2000

1000

0

-1000

-2000

-3000

-2000

(a) First three PCs of subject A embedded by Isomap [14]

-2000 4000

0 3000

2000

-1000 -1100

1000

0

-1000

-2000

-3000

-4000

5000

(b) First three PCs of subject A and B embedded by Isomap

-1200

300

400

500

600

700

800

900

(c) First three PCs of subject A and B embedded by DCE

Fig. 1. Local structure of face manifolds of subject A (blue asterisks) and B (red plus signs)

2 Local Structure of Face Manifold As usual, we represent face im]age as a D-dimensional vector, and D is the number of pixels in each image. Due to the smoothness and regular texture of surfaces of faces, face images usually lie in or close to a low-dimensional manifold, which is a continuous and smooth distribution embedded in image space. However, due to the non-continuous or sparse sampling, it is usually discrete and consists of a number of separate submanifolds (or clusters). Fig. 1 illustrates the distributions of the low-dimensional manifolds corresponding to two subjects’ image sets which are obtained from two video sequences. Fig. 1a shows the first three principal components of image set A, obtained by Isomap [14]. Different submanifolds of image set A correspond to different variations. The images

644

Y. Zhao, S. Xu, and Y. Jia

under similar condition lie in neighbor locations in image space. So it is possible that different individuals’ submanifolds under the similar condition are closer than the submanifolds coming from the same individual but under completely different conditions. The leading three principal components of two image sets that are sampled under similar conditions from two individuals A (blue asterisks) and B (red plus signs) respectively are shown in Fig. 1b. Although the two image sets are obtained from two different individuals, there is significant overlap between two manifolds. The goal of this paper is to utilize these meaningful submanifolds to learn the most powerful discriminant features for recognition task. In Fig. 1c, the first three principal components of A and B embedded by our DCE is shown. The local discriminant structure of the manifolds is more obvious. We use two classical clustering methods, k-means and hierarchical clustering [13], to extract the submanifolds of each individual. For k-means, the initialized k seeds are selected by a greedy search procedure [5]. For hierarchical clustering, we use the agglomerative procedure, i.e. hierarchical agglomerative clustering (HAC) and the following distance measure: 1 (1) d avg ( Di , D j ) = ∑ ∑ x - x′ , n i n j x∈ D x ′∈ D i

where

j

n i and n j are the numbers of samples in the clusters Di and Dj respectively.

To evaluate the effectiveness of clustering, we also test a random selection scheme, assigning the samples of each individual to k clusters each time randomly instead of classical clustering. The performances will be discussed in Section 4.1. The principal angle is used to measure the similarity between two submanifolds that are extracted from different individuals. Principal angle is the angle between two d-dimensional subspaces. Recently, it has become a popular metric for the similarity of different subspace [3, 4]. Let L1 and L2 are two d-dimensional subspaces. The cosines of principal angles 0 ≤ θ1 ≤ ⋅ ⋅ ⋅ ≤ θ d ≤ (π / 2) between them are uniquely defined as

cosθi = max max xiT yi , x i ∈L1 y i ∈L 2

(2)

subject to x i = y i = 1 , x i T x j = y i T y j = 0 , i ≠ j . Refer to [3, 4, 12] for the details of the solution to this problem.

3 Discriminant Clustering Embedding 3.1 Problem Formulation Given a gallery set: X = {x1 , x 2 ,..., x N } ∈ IR D×N ,

(3)

where x i represents a D-dimensional image vector, and N is the number of all the images in the gallery. The label of each image denotes as yi ∈ { 1, 2, ..., C } . For each class c, containing nc samples represented as X c , we extract a set of submanifolds: Sc = { Sc, i }isn=c1 ,

(4)

Discriminant Clustering Embedding for Face Recognition with Image Sets

where

645

snc is the number of submanifolds of c and typically n c >> snc .

A projection matrix is defined as

V = {v1 , v2 ,..., vd } ∈ IR D×d ,

(5)

| = 1 , and d is the number of features extracted. Our goal is to learn such a projection matrix V , by which after mapping, the high-dimensional data could

where D >> d , | vi

be properly embedded. The intra-class scatter matrix Sw and inter-class scatter matrix Sb are defined as S w = ∑∑ v T x i − v T x j

2

Wiw, j = 2v T X(D w − W w )X T v

S b = ∑∑ v T x i − v T x j

2

Wib, j = 2v T X( D b − W b )X T v

i

,

(6)

j

i

,

(7)

j

where D iiw = ∑ j≠i Wijw

∀i

D iib = ∑ j≠ i Wijb

∀i ,

,

(8) (9)

and W wand W b ∈ IR N × N are two affinity matrixes (graphs) which denote the neighbor relationships between two points of intra-class and inter-class respectively. How to define them is the critical issue for computing the projection matrix V . The projection matrix V can be redefined as V = argmax V

| V TX( Db − Wb )XTV | , = argmax T Sw V | V X(Dw − W w )XTV | Sb

(10)

which is equal to solve the following generalized eigenvalue problem, X (D b − W b ) X TV = λX (D w − W w ) X TV .

(11)

V = {v1 , v2 ,..., vd } are the generalized eigenvectors associated to the generalized eigenvalues λ1 ≥ λ2 ≥ ... ≥ λd of Eq. (11). Given a probe image set T = { t i }ip= 1 containing p images of an individual whose identity is one of the C subjects in the gallery. The test image set is first mapped onto a low-dimensional space via the projection matrix V , i.e. T′ = V T T . Then each test image t i is classified in the low-dimensional discriminant space by measuring the similarity between t′i = V T t i and each training submanifolds S′c,j = V TSc,j . This process is formulated as c∗ = arg max d(t′i , S′c, j ) , c

(12)

646

Y. Zhao, S. Xu, and Y. Jia

where d( ⋅, ⋅) denotes the distance between a image and a linear subspace [11]. Finally, to determine the class of the test image set, we combine the decisions of all test images by a majority scheme. 3.2 Discriminant Clustering Embedding As discussed in Section 2, images in the sets usually change largely due to many realistic factors, e.g., large pose or illumination. In this case, traditional Fisher Discriminant Analysis (FDA) [9] performs poorly, since it could not capture nonlinear variation in face appearance due to illumination and pose changes. Marginal Fisher Analysis (MFA) [7], which is devised to extract the local discriminative information by utilizing the marginal information, relaxes the limitations of FDA in data distribution assumption and available discriminative features. However, it still tends to compress the points belonging to the same class together even though they are really far away in image space. This makes against uncovering the significant local structure of appearance manifolds and extracting efficient discriminative information. To handle this problem, we propose a novel method, Discriminant Clustering Embedding, which combines the effectiveness of submanifolds charactering the inherent structure of face appearance manifolds and the discriminant property of discriminant embedding. Specifically, our algorithm can be summarized as following four steps: 1. 2.

3.

Extract a set of submanifolds Sc of each subject as Eq. (4) by clustering, e.g. kmeans or hierarchical agglomerative clustering. Measure the neighbor relationships of each submanifold with all other subjects’ submanifolds by some metric, e.g. principal angle, and select the m nearest neighbors of it. Construct two affinity matrixes (graphs) W w and W

b

based on each

submanifold and its m nearest neighbors obtained in Step 2, simply written as 1 Wij w = ⎧⎨ 0 ⎩ ⎧ 1 ⎪ b Wij = ⎨ ⎪ ⎩ 0

if x i , x j ∈ Sc,k , otherwise

if xi∈Sc1,k1, xj∈Sc2,k2, c1≠c2, m

and Sc1,k1~Sc2,k2

(13) ,

(14)

otherwise

where S c1,k1 ~ S c 2 ,k 2 means Sc1,k1 is the m nearest neighbors of S c 2 ,k 2 or Sc 2,k 2 is m

4.

the m nearest neighbors of Sc1,k1 . Learn the low-dimensional embedding VDCEdefined as Eq. (10) by preserving the neighbor information within each submanifold, and separating the neighbor submanifolds belonging to different subjects from each other.

By Ww and Wb defined as Eq. (13) and Eq. (14), Sw and Sb could locally characterize the intra-class compactness and inter-class separability respectively. The different

Discriminant Clustering Embedding for Face Recognition with Image Sets

647

affinity matrixes, which determine the efficiency of extracting the local discriminative information, are the essential difference among FDA, MFA and DCE. The mapping function VDCE obtained by solving Eq.(11), compresses the points in each submanifold together and separates each submanifold from its m nearest neighbors that belong to other individuals at the same time. It is worth noting that VDCEdoes not impose the submanifolds of the same class to be close which are really far apart in original image space. After embedding, DCE could preserve significant inherent local structure of face appearance manifolds and extract the most powerful discriminant features.

4 Experimental Results We use the Honda/UCSD Video Database [5] to evaluate our method. The database contains a set of 52 video sequences of 20 different persons. Each video sequence is recorded indoors at 15 frames per second over duration of at least 20 seconds. Each individual appears in at least two video sequences with the head moving with different combinations of 2-D (in-plane) and 3-D (out-of-plane) rotation, expression changes, and illumination changes (see Fig. 2). A cascaded face detector [10] and a coarse skin color based tracker are used to detect faces automatically. About 90% faces could be detected. The images are resized to a uniform scale of 16×16 pixels. Histogram equalized is the only preprocessing step. For training, we select 20 video sequences, one for each individual, and the rest 32 sequences are for testing. For each individual’s each testing video sequence we randomly select 10 sub-image-sets including 30% faces detected to buildup a set of test image sets. Finally, to determine the class of a test image set, we combine the decisions of all test images in the test image set by a majority scheme. 4.1 Parameters Selection Comparison of Clustering Methods and Number of Clusters. As shown in Fig. 3, random clustering scheme exhibits instability in the relationship between accuracy and the number of clusters, whereas the other two classical clustering schemes both provide near constant accuracies beyond certain points. This is expected because the clusters obtained by random clustering scheme could not characterize the inherent local structure of face appearance manifolds properly. It is also noticeable that the proposed method is less sensitive to the selection between k-means and HAC in the experiment. This may be because both clustering algorithms could find out wellseparated clusters which can properly approximate the distribution of image set. For the two classical clustering procedures, just as expected, the accuracy rises rapidly along with the number of clusters, then increases slowly, and tends to be a constant at last. To characterize the local structure of an appearance manifold properly, it is necessary to obtain a suitable number of clusters. However, with increasing of the number, the accuracy should not drop because more clusters could characterize the local structure more subtly. For convenience, we adopt the HAC procedure and fix the number of clusters at 40 to evaluate our algorithm in the following experiments.

648

Y. Zhao, S. Xu, and Y. Jia

Fig. 2. Examples of image sets used in the experiments

(a)

Fig. 3. Comparison of different clustering methods and the effect of the number of clusters on accuracy

(b)

Fig. 4. Effect of different parameters on accuracy: (a) the number of selected features and (b) the number of neighborhoods of each submanifold

Number of Features and Number of Neighbors. Fig. 4a shows the accuracy of DCE with respect to the number of features. When the number of features reaches 20, there is a rapid rise and then the recognition rate increases slowly and tends to be a constant. The top accuracy is not achieved at 19 as traditional FDA [9]. In the following comparison experiments, we set the number of features to 50. In Fig. 4b, the recognition rate of DCE with respect to the number of neighbors for each submanifold is shown. Note that the accuracy of DCE is not increasing but asymptotically converging along with the number of neighbors. This is an interesting finding that the larger number of neighbors does not mean the higher accuracy. In extreme case, each submanifold is neighboring to all others that belong to other subjects, which seems to take advantage of all the inter-class separability but not. This shows the effectiveness of neighbor submanifolds characterizing the inter-class separability. The best number of neighbors for DCE is found to be around 5 and fixed at 6 for the following comparison experiments.

Discriminant Clustering Embedding for Face Recognition with Image Sets

649

4.2 Comparison with Previous Methods We first compare our proposed method with traditional methods, i.e. nearest neighbor (NN) [13], Fisher discriminant analysis (FDA) and marginal Fisher analysis (MFA). All experiments for comparison methods are performed in the original image space. And the final decisions are obtained by majority voting scheme. As shown in Table 1, the local embedding methods (MFA and DCE) achieve better performance than the other two methods. It is obvious that the methods based on local embedding could reveal the significant local structure more efficiently and could extract more powerful discriminative information in the nonlinear face manifolds. Note that FDA achieves the lowest accuracy. This is because the features obtained by FDA could not characterize the various variations of human faces properly. As expected, our method outperforms MFA. The reason is that our DCE compresses only the points in each submanifold that are really neighboring to each other together. However, MFA may impose the faraway points of the same class to be close. This is negative for discovering the proper local structure of face appearance manifolds efficiently and utilizing them to characterize inter-class separability. Table 1. Average accuracy of our DCE and previous methods

Methods Recognition rate (%)

NN

FDA

MFA

DCE

Dual-space

DCC

95.26

94.37

98.1

99.97

96.34

98.93

We also compare our method with two latest methods, dual-subspace method [3] and discriminant analysis of canonical correlations (DCC) [4], which are devised for recognition over image sets. As shown in Table 1, our discriminant embedding method outperforms the other two methods. While the three methods all take advantage of the submanifolds, the ways of applying them for classification task are completely different. By discriminant embedding, DCE makes full use of the local information embedded in the submanifolds and extracts the most powerful discriminant features for recognition over sets.

5 Conclusions and Future Work We have presented a novel discriminant embedding framework for face recognition with image sets based on the submanifolds extracted via clustering. The proposed method has been evaluated on a complex face image sets obtained from UCSD/Honda video database. Extensive experiments show that our method could both uncover local structure of appearance manifolds sufficiently and extract local discriminative information by discriminant embedding efficiently. It is noticeable that our method is significantly improved by the clustering procedure but insensitive to the selection of different clustering methods. The experiments also demonstrate that it is not the larger number of neighbors of each submanifold the higher accuracy. Experimental results show that DCE outperforms the state-of-the-art methods in terms of accuracy.

650

Y. Zhao, S. Xu, and Y. Jia

In the future work, we will further compare our method with the previous relevant methods e.g., dual-subspace method [3] and discriminant analysis of canonical correlations (DCC) [4]. The application of DCE to object recognition is also our interest. In addition, a nonlinear extension of DCE by so-called kernel trick is direct. Acknowledgments. This work was supported by the 973 Program of China (No. 2006CB303105). The authors thank the Honda Research Institute for providing us with Honda/UCSD Video Database.

References 1. Frey, B.J., Huang, T.S.: Mixtures of Local Linear Subspace for Face Recognition. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 32–37. IEEE Computer Society Press, Los Alamitos (1998) 2. Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face recognition with image sets using manifold density divergence. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 581–588 (2005) 3. Fan, W., Yeung, D.Y.: Locally Linear Models on Face Appearance Manifolds with Application to Dual-Subspace Based Classification. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 1384–1390 (2006) 4. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Trans. PAMI 29(6), 1005–1018 (2007) 5. Lee, K., Ho, J., Yang, M., Kriegman, D.: Visual tracting and recognition using probabilistic appearance manifolds. CVIU 99, 303–331 (2005) 6. Hadid, A., Pietikainen, M.: From still image to videobased face recognition: an experimental analysis. In: 6th IEEE Conf. on Automatic Face and Gesture Recognition, pp. 17–19. IEEE Computer Society Press, Los Alamitos (2004) 7. Yan, S.C., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph Embedding and Extensions: A General Framework for Dimensionality Reduction. IEEE Trans. PAMI 29(1), 40–51 (2007) 8. Chen, H.T., Chang, H.W., Liu, T.L.: Local Discriminant Embedding and Its Variants. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 846–853 (2005) 9. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. PAMI 19(7), 711–720 (1997) 10. Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 11. Moghaddam, B., Pentland, A.: Probabilistic Visual Learning for Object Representation. IEEE Trans. PAMI 19(7), 696–710 (1997) 12. Bjorck, A., Golub, G.H.: Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27(123), 579–594 (1973) 13. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000) 14. Tenenbaum, J.B., Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(22), 2319–2323 (2000)

Privacy Preserving: Hiding a Face in a Face∗ Xiaoyi Yu and Noboru Babaguchi Graduate School of Engineering, Osaka University, Japan

Abstract. This paper proposes a detailed framework of privacy preserving techniques in real-time video surveillance systems. In the proposed system, the protected video data can be released in such a way that the identity of any individual contained in video cannot be recognized while the surveillance data remains practically useful, and if the original privacy information is demanded, it can be recoverable with a secrete key. The proposed system attempts to hide a face (real face, privacy information) in a face (new generated face for anonymity). To deal with the huge payload problem of privacy information hiding, an Active Appearance Model (AAM) based privacy information extraction and recovering is proposed in our system. A quantized index modulation based data hiding scheme is used to hide the privacy information. Experimental results have shown that the proposed system can embed the privacy information into video without affecting its visual quality and keep its practical usefulness, at the same time, allows the privacy information to be revealed in a secure and reliable way. Keywords: Privacy Preserving, Data Hiding, Active Appearance Model.

1 Introduction Identity privacy is one of the most important civil rights. It is defined as the ability to prevent other parties from learning one’s current identity by recognizing his/her personal characteristics. In recent years, advanced digital camera and computer vision technology, for example, personal digital camera for photography, digital recording of a surgical operation or medical image recording for scientific research, video surveillance etc., have been widely deployed in many circumstances. While these technologies provide many conveniences, they also expose people’s privacy. Although such a concern may not be significant in public spaces such as surveillance systems in metro stations, airports or supermarket, patients who have medical image photographing in hospitals may feel their privacy being violated if their face or other personal information is exposed to the public. In such situations, it is desirable to have a system to balance the disclosure of image/video and privacy such that inferences about identities of people and about privacy information contained in the released image/video cannot reliably be made, while the released data is usable. In the ∗

This work was supported in part by SCOPE from Ministry of Internal Affairs and Communications, Japan and by a Grant-in-Aid for scientific research from the Japan Society for the Promotion of Science.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 651–661, 2007. © Springer-Verlag Berlin Heidelberg 2007

652

X. Yu and N. Babaguchi

literature, many techniques on privacy preserving in image/video system have been proposed [1-9]. These techniques can be divided into several classes as follows: 1) Pixel operations based methods. The operations such as black out, pixelize or blur et al have been used to fade out the sensitive area. In [7], Kitahara et al. proposed an anonymous video capturing system, called “Stealth Vision”, which protects the privacy of objects by fading out their appearance. In [9], Wickramasuriya et al. proposed a similar privacy protecting video surveillance system. Although these systems in some degree fulfil the privacy-protecting goal, it has a potential security flaw because they can not keep a record of the privacy information. In the case of video surveillance when authorized personnel were also involved in some maleficent or even criminal behaviours, the surveillance system should have the ability to provide the original surveillance footage when necessary. The second flaw is that although these systems can keep identity anonymous, the sensitive area is distorted. 2) Cryptography based methods. Dufaux et al. [1, 2] proposed a solution based on transform domain scrambling of regions of interest in a video sequence. In [3], the authors present a cryptographically invertible obscuration method for privacy preserving. Martínez-Ponte et al. [4] propose a method using Motion JPEG 2000 encoding module to encode sensitive data. These methods have the drawbacks that the sensitive area is distorted. 3) Data hiding based methods. Zhang et al. [5] proposed a method of storing privacy information in video using data hiding techniques. Privacy information is not only removed from the surveillance video but also embedded into the video itself. This method solved the problem that the privacy information can not recoverable. The method also has the drawback that the sensitive area is distorted (disappear), and another problem is that it is hard to deal with large privacy information data size for data hiding. 4) Others. Newton et al. [6, 8] addressed the threat associated with face-recognition techniques by ‘de-identifying’ faces in a way that preserves many facial characteristics, however the original privacy information is lost with this method. A desirable privacy protecting system should meet the following requirements: 1. The original privacy information can be recoverable. 2. The privacy preserving image remains practically usable, e.g. one still can see the emotion of processed face images in our proposed system. 3. The identity is anonymous. Almost none of the methods in the literature can fulfill all the 3 requirements. The motivation of our research is to propose a solution to fulfill all of 3 requirements for privacy protection issue. The main contributions of this paper are as follows. z We propose a framework of privacy preserving techniques in real-time video surveillance systems. In our proposed system, the privacy information is not only protected but also embedded into the video itself, which can only be retrieved with a secrete key. z In our system, the identity of any individual contained in video cannot be recognized while the surveillance data remains practically useful. The facial area is distorted in systems such as [1, 2, 5, 7, 9], while it can be seen in our system. z Our proposed method can efficiently solve the huge payload problem of privacy information hiding.

Privacy Preserving: Hiding a Face in a Face

653

In the next section, we discuss the architecture of a privacy-protecting system. In Section 3, the statistical Active Appearance model (AAM) is introduced and AAM based synthesis for anonymity is proposed. In Section 4, privacy information extraction and hiding is proposed. Experimental results on the proposed method are presented in Section 5. Conclusions are made in Section 6.

2 System Description The solution we present is based on face masking and hiding. The architecture is illustrated in Fig. 1 and consists of 3 modules: Training module (shown in the dashed circle in Fig. 1), Encoding module (top in Fig. 1) and Decoding module (bottom of Fig. 1).

Fig. 1. Schematic Diagram of Privacy Preserving System

Before privacy information processing, we need to train a statistical AAM model. With a set of face images, the model is built using the training module. In the encoding procedure, AAM model parameters (privacy information) are obtained by analyzing an unseen input frame with face image using the AAM model built at first. Then, keep a copy of model parameters for the later procedure of hiding. Based on the estimated AAM model parameters, a mask face can be generated, which is different from the original face. An anonymous frame is obtained by the imposing the mask face on the original image. Last, with a secret key, the privacy information is embedded into the anonymous frame using QIM embedding method. The resulting privacy preserving video is then obtained for practically use. For the decoding procedure, the AAM parameters are first extracted using the extraction procedure of QIM data hiding method. With the extracted parameters and AAM model, the original face can be synthesized, and then impose the synthesized face on the privacy preserving frame to get the recovered frame.

654

X. Yu and N. Babaguchi

3 Real-Time Face Masking for Anonymity 3.1 Active Appearance Model and Real-Time Implementation The Active Appearance Model [10, 13] is a successful statistical method for matching a combined model of shape and texture to new unseen faces. First, shapes are acquired through hand placement of fiducial points on a set of training faces; then textures are acquired through piecewise affine image warping to the reference shape and grey level sampling. Both shape and texture are modeled using Principal Component Analysis (PCA). The statistical shape model is given by: (1) s = s + Φ s bs

s is the synthesized shape, Φ s is a truncated matrix and bs is a vector that controls the synthesized shape. Similarly, after computing the mean shape-free texture g and normalizing all textures from the training set relatively to g by scaling and offset of the luminance values, the statistical texture model is given by (2) g = g + Φ t bt

where

where

g i is the synthesized shape-free texture, Φ t is a truncated matrix and bt is a

vector controlling the synthesized shape-free texture. Then combined shape model with texture model by further using PCA. s = s + Qs c g = g + Qt c

(3)

where Qs and Qt are truncated matrices describing the principal modes of combined appearance variations in the training set, and c is a vector of appearance parameters simultaneously controlling the shape and texture. Given a suitably annotated set of example face images, we can construct statistical models ( s , g , Qs and Qt ) of their shape and their patterns of intensity (texture) [10]. Such models can reconstruct synthetic face images using small numbers of parameters c . We implement a real-time AAM fitting algorithm with OpenCV library. Fig. 2 shows the matching procedure on an AAM with only 11 parameters, Fig.2 shows (a) the original face, (b) Initialization of AAM fitting, (c) 2nd iteration (d) the convergence result only after 8 iterations.

(a)

(b)

(c)

(d)

Fig. 2. (a) Original (b) Initialized (c) 2nd iteration (d) Synthesized Faces

Privacy Preserving: Hiding a Face in a Face

655

3.2 AAM Based Face Mask As mentioned in Section 3.1, for a given unseen image X, a shape in the image frame, can be generated by applying a suitable global transformation (such as a similarity transformation) to the points in the model frame. The texture in the image frame is generated by applying a scaling and offset to the intensities generated in the model frame. A full reconstruction is given by generating the texture in a mean shaped patch, then warping it so that the model points lie on the image points. The parameters c in Equation (3) are parameters controlling the appearance of the reconstructed face. If c is obtained, the face features such as eye, nose, mouth can be very easily located. The textile of face patches also can be obtained via Eq. (3). This leads to a method of face masking for identity anonymity based on these AAM parameters: parameter c , shape parameter s and texture parameter g . We have several ways to generate a new face mask for anonymity: 1. Perturbation of AAM parameters. Let ci be elements of c , we apply perturbation to ci as follows:

c~i = ci (1 + vi )

(4)

vi is a manually selected variable. To constrain c~i to plausible values we ~ to: apply limits to each elements c i

where

c~i ≤ 3 λi

where

(5)

λi is the corresponding eigenvalue to c~i .

s and g~ using the following equations: Then reconstruct ~ ~ s = s + Qs c~ g~ = g + Qt c~

(6)

~ is the perturbed vector where s , g , Qs and Qt are the same matrices as in Eq. (3), c ~ .Similar operation can be performed on shape parameter s and with elements of c i

texture parameter g . With ~ s and g~ , we can generate a new face with is quite different from the original one. 2. Replacement of AAM parameters. First the AAM model is applied by analyzing a face to obtain parameter s and g . Then in the reconstruction process, the texture parameter g is directly replaced by other texture parameters of different faces or just by s . 3. Face features based mask face generation. Once parameters c is determined using AAM matching, the shape parameter s , which is just the matched points in the face surface, is determined. Usually these points locate in face features such as eye, nose or mouth corner. Since face features can be accurately located, we can modify, or displace these places with other patches or adding virtual objects on the face. We can use all kinds of masked face for identity anonymity, such as virtual object adding, wearing a mask, adding a Beijing Opera actor's face-painting, and even adding another person’s face. Of course we can also

656

X. Yu and N. Babaguchi

using those distorted methods for identity anonymity such as making a contorted face, blur, mosaic, or dark.

4 Privacy Extraction and Hiding 4.1 Privacy Information Extraction The privacy information, which we try to protect and hide in our proposed system, is the detected facial area. It is natural to think of cutting the facial area and using data hiding techniques to hide in the image. This is the very method proposed in [5]. However, this method is impracticable in scenario of large facial area. For example, in our experiment, the facial area size is about 140x140, whe frame size is 320x240. The privacy information in the system is over 50000 bits even after compression, which is much larger than the privacy information size (3000 bits) in [5]. Most current image data hiding techniques can not embed such large amount of information in image. How to imperceptibly hide the large amount of privacy information in the image is a great challenge. In Section 3, we mentioned that the parameters c in Equation (3) are parameters controlling the appearance of reconstructed image. In the scenario of the two sides of communications share the same AAM model, the parameters c can be seen as privacy information. With parameters c , we can only generate a shape free face. To recover the original frame, we should have face pose parameters. These parameters include position translation, rotation and scaling. So parameters c and pose parameters are privacy information in our proposed system. We hide them for privacy protection. The privacy information size would largely decrease compared with the method in [5]. 4.2 Privacy Information Hiding and Recovering Data hiding [11] has been widely used in copyright protection, authentication or secure communication applications, in which the data (for example, watermark or secret information) are embedded in a host image, and later, the embedded data can be retrieved for ownership protection, authentication or secure communication. Most data hiding methods in literature are based on the human visual system (HVS) or perceptual model to guarantee a minimal perceptual distortion. Due to the popularity of Discrete Cosine Transform (DCT) based perceptual model, we adopt the DCT perceptual model described in [11]. The same model is used in [5]. With this perceptual model, we can compute a perceptual mask value for each DCT coefficient in the image. We sort the order of these values. Then with a secret key to determine the embedding location and combined with these perceptual mask values, we use a special case of Quantization index modulation (QIM), the odd-even method [11] for data hiding in our implementation. The modified DCT coefficients are constructed to a privacy preserving image at last. For ordinary users, only this image can be viewed. For an authorized user, who can see the privacy information, a decoding procedure is necessary to view the privacy recovered video. The decoding procedure is shown in Fig. 1. With the secret key, the decoder can determine where the privacy information has been embedded. After extracting the privacy information (AAM

Privacy Preserving: Hiding a Face in a Face

657

parameters), together with the AAM mode, the original frame can be recovered with privacy information.

5 Experimental Results To evaluate the proposed system, three experiments were performed here. The first experiment was carried on to test the performance of selective perturbing AAM parameters for identity anonymity. The second experiment is to test the performance of face features based mask face generation for anonymity. The last one is carried on to evaluate the proposed system: real-time privacy preserving video surveillance system: hiding a face in a face. For selective perturbing AAM parameters for identity anonymity, we use public available data set [12] in our test. First, an appearance model is trained on a set of 35 labeled faces. This set [12] contains 40 people. We leave 5 images for the test. Each image was hand annotated with 58 landmark points on the key features. From this data a combined appearance model (Model-I) is built. Fig. 3. Original Face and Anonymous Face Vector c (24 parameters) in Equation (3) is obtained by analyzing a target input face (shown in top left of Fig.4) using Model-I. Vector c is perturbed ~ . Combined c~ and Model-I, we generate an using Equation (4) to obtain the vector c anonymous new face. Fig.3 shows the experimental results of two faces. The top left and bottom left show the original faces and top right and bottom right are the generated anonymous faces. In our experiment, we set vi = −2 in Equation (5). Readers can set other values to generate an anonymous face by “adjusting and looking” method. For the second experiments, we used a new generated face set for model training. This set contains 45 face images of a single person with different pose, illumination, and express. Each image was manually annotated with 54 landmark points on the key features. From this data set, a combined appearance model (Model-II) is built. We use Model-II to evaluate face features based mask method. We generate mask faces using method 3 described Section 3.2. Fig. 4 shows the original face and experimental results of all kinds of masked face for identity anonymity, (a) the original face (b) virtual beard addition, (c) virtual beard, (d) a contorted face, (e) blur, (f) mosaic, (g) adding another person’s face, (h)adding a Beijing Opera actor's face-painting , (i)dark. Fig. 4(b) and (c) shows that a deformable beard is added into the video sequences. From the results, we can observe that the beard can be deformed along with the expression changes. The added virtual objects are tightly overlaid on the subject. The masked Beijing Opera actor's face-painting or adding other people’s faces are also deformable.

658

X. Yu and N. Babaguchi

(a)

(d)

(b)

(e)

(g)

(c)

(f)

(h)

(i)

Fig. 4. Original Face and Anonymous Faces

From the above two experiments, we can see that the proposed method can generate different kinds of anonymous faces, which fulfill the privacy protecting requirement of releasing useful information in such a way that the identity of any individual or entity contained in image/video cannot be recognized while the image/video remains practically useful.

Fig. 5. Experimental Results of Real-time Privacy Preserving Systme

Privacy Preserving: Hiding a Face in a Face

659

The third experiment was carried on to evaluate the proposed real-time privacy preserving video surveillance system. We train a model (Model-III) on a set of 25*6 labeled faces. This set contains 30 people. Each person has 6 face images with

Fig. 6. Replaced face

Fig. 7. PSNR vs. Frame No

different poses and expressions. We leave 10*6 images for the test. Each image was manually annotated with 50 landmark points. For each input video frame of video surveillance system, vector c (30 parameters) in Equation (3) is obtained by ~ , s , g and analyzing the input frame (shown in) using Model-III. Combined c Model- III, we generate mask faces using method 2 described in Section 3.2. Fig.5 shows the experimental result by replacing face texture of a person by texture of anther person. The top row of Fig.5 show the input frames of a person and the second row show the generated results by replacing the texture using Model-III and shape parameters. The front face of the replaced person is shown in Fig. 6. For the limited of space, other results such as replacement of parameter by s are not shown here. Then we come to embedding procedure. Since the most widespread video format MPEG2, MPEG2 is used to compress all the video sequence in our experiments. The privacy information is embedded into generated I frame of MPEG2 using the method described in Section 4.2, and then the privacy preserving I frame is obtained. For ordinary user, the system can provide this frame. For authorized user, the original face should be recovered. We use data extraction procedure discussed in Section 4.2 to recover the original faces (shown in the last row of Fig.5). Fig. 7 plots the Peak Signal-to-noise (PSNR) of recovered frame compared with the original frame. Large error would decrease if a small convergence threshold were set, however small convergence threshold would affect the convergence speed. There is a tradeoff between the accuracy and efficiency. From the above experiment, we show the effectivity, efficiency and reliability of our proposed system. The main advantage of our system is that our proposed system has a record of the privacy information and versatile disposal of sensitive area. Now we compare our system with systems in the literature in Table 1. In Table 1, the column “Anonymity” shows whether the compared systems can protect identity’s privacy, the column “Distortion” shows whether the privacy area is distorted, and the column “Recoverability” shows whether the original privacy

660

X. Yu and N. Babaguchi Table 1. Comparison between our proposed system and methods in the literature

Pixel operations based methods[7,9] Coding based methods [1, 2, 3,4] Zhang et al’ method [5] Newton et al.’s method [6, 8] Our proposed system

Anonymity Yes Yes Yes Yes Yes

Distortion Yes Yes Yes No No

Recoverability No No Yes No Yes

information is recoverable. From the table, we can see our proposed method outperform all the method listed. Among all methods in the literature, only Zhang et al.’s method can recover the original privacy information; however the method has the drawback that the encoded image/video is probably not practically useful. Another drawback of [5] is that the data capacity for hiding is large. For example, in our experiments, the privacy information size is about [30 ( ci length) + 3(pose parameter length)*1 (bytes per parameter)*8(bits per byte) = 720bits, and can be further compressed for a small size. It takes 50000 bits for privacy information hiding if we use method of [5]. The privacy information size of method [5] is as 70 times large as the size of our method.

6 Conclusions In this paper, we have presented a privacy preserving system, hiding a face in a face, where the identity of release data is anonymous, the released data is practical useful, and the privacy information can be hidden in the surveillance data and be retrieved later. Effective identity anonymity, privacy information extraction and hiding method have been proposed to hide all the privacy information into the host image with minimal perceptual distortion. The proposed approach does not take advantage of 3D information for AAM matching. In the future, AAM convergence accuracy will be improved by training the 3D AAM with the aligned 3D shapes or 2D shapes. Another lackenss of current system is the generated facial expression can not vary as identity expression changing. Future research will be focused on identity anonymity with facial expression. Our proposed solution is focusing on video surveillance system for privacy protecting, further, the system can be used in many other applications such as privacy protection of news images or newsreels in journalism etc. Although the proposed method is demonstrated in facial privacy protection, it is a framework which can also be applied to virtually any type of privacy protection such as human body as privacy information.

References 1. Dufaux, F., Ouaret, M., Abdeljaoued, Y., Navarro, A., Vergnenegre, F., Ebrahimi, T.: Privacy Enabling Technology for Video Surveillance. In: Proc. SPIE, vol. 6250 (2006) 2. Dufaux, F., Ebrahimi, T.: Scrambling for Video Surveillance with Privacy. In: Proc. IEEE Workshop on Privacy Research In Vision, IEEE Computer Society Press, Los Alamitos (2006)

Privacy Preserving: Hiding a Face in a Face

661

3. Boult, T.E.: PICO: Privacy through Invertible Cryptographic Obscuration. In: IEEE/NFS Workshop on Computer Vision for Interactive and Intelligent Environments (2005) 4. Martinez-Ponte, X., Desurmont, J., Meessen, J.-F.: Delaigle: Robust Human Face Hiding Ensuring Privacy. In: Proc. Int’l. Workshop on Image Analysis for Multimedia Interactive Services (2005) 5. Zhang, W., Cheung, S.S., Chen, M.: Hiding Privacy Information In Video Surveillance System. In: Proceedings of ICIP 2005, Genova, Italy (September 11-14, 2005) 6. Newton, E., Sweeney, L., Malin, B.: Preserving Privacy by De-identifying Facial Images. IEEE Transactions on Knowledge and Data Engineering 17(2), 232–243 (2005) 7. Kitahara, I., Kogure, K., Hagita, N.: Stealth Vision for Protecting Privacy. In: Proc. of 17th International Conference on Pattern Recognition, vol. 4, pp. 404–407 (2004) 8. Newton, E., Sweeney, L., Malin, B.: Preserving Privacy by De-identifying Facial Images, Technical Report CMU-CS-03-119 (2003) 9. Wickramasuriya, J., Datt, M., Mehrotra, S., Venkatasubramanian, N.: Privacy Protecting Data Collection in Media Spaces. In: ACM International Conference on Multimedia, New York (2004) 10. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 11. Cox, I.J., Miller, M.L., Bloom, J.A.: Digital watermarking. Morgan Kaufmann Publishers, San Francisco (2002) 12. http://www.imm.dtu.dk/ aam/datasets/face_data.zip 13. Stegmann, M.B., Ersboll, B.K., Larsen, R.: FAME - A Flexible Appearance Modelling Environment. IEEE Transactions on Medical Imaging, Institute of Electrical and Electronics Engineers (IEEE) 22(10), 1319–1331 (2003)

Face Mosaicing for Pose Robust Video-Based Recognition Xiaoming Liu1 and Tsuhan Chen2 Visualization and Computer Vision Lab, General Electric Global Research, Schenectady, NY, 12309 2 Advanced Multimedia Processing Lab, Carnegie Mellon University, Pittsburgh, PA, 15213 1

Abstract. This paper proposes a novel face mosaicing approach to modeling human facial appearance and geometry in a uniﬁed framework. The human head geometry is approximated with a 3D ellipsoid model. Multi-view face images are back projected onto the surface of the ellipsoid, and the surface texture map is decomposed into an array of local patches, which are allowed to move locally in order to achieve better correspondences among multiple views. Finally the corresponding patches are trained to model facial appearance. And a deviation model obtained from patch movements is used to model the face geometry. Our approach is applied to pose robust face recognition. Using the CMU PIE database, we show experimentally that the proposed algorithm provides better performance than the baseline algorithms. We also extend our approach to video-based face recognition and test it on the Face In Action database.

1

Introduction

Face recognition is an active topic in the vision community. Although many approaches have been proposed for face recognition [1], it is still considered as a hard and unsolved research problem. The key of a face recognition system is to handle all kinds of variations through modeling. There are diﬀerent kinds of variations, such as pose, illumination, expression, among which, pose variation is the hardest, and contributes more recognition errors than others [2]. In the past decade, researchers mainly model each variation separately. For example, by assuming constant illumination and the frontal pose, expression invariant face recognition approaches are proposed [1]. However, although most of these approaches perform well for speciﬁc variation, the performance degrades quickly when multiple variations present, which is the case in real-world applications [3]. Thus, a good recognition approach should be able to model diﬀerent kinds of variations in an eﬃcient way. For human faces, most prior modeling work target at facial appearance using various pattern recognition tools, such as Principal Component Analysis (PCA) [4], Linear Discriminate Analysis [5], Support

The work presented in this paper is performed in Advanced Multimedia Processing Lab, Carnegie Mellon University.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 662–671, 2007. c Springer-Verlag Berlin Heidelberg 2007

Face Mosaicing for Pose Robust Video-Based Recognition

663

Y

ry

u

v

α

x

rx rz

β

O X

β

α Z

Fig. 1. Geometric mapping

Fig. 2. Up to 25 labeled facial features

Vector Machine [5]. On the other hand, except for the 3D face recognition, the human face geometry/shape is mostly overlooked in face recognition. We believe that, similar to the facial appearance, the face geometry is also a unique characteristic of human being. Face recognition can beneﬁt if we can properly model the face geometry, especially when pose variation is presented. This paper proposes a face mosaicing approach to modeling both the facial appearance and geometry, and applies it to face recognition. This paper extends the idea introduced in [6,7] by approximating the human head with a 3D ellipsoid. As shown in Fig. 1, an arbitrary view face image can be back projected onto the surface of the 3D ellipsoid, and results in a texture map. In multi-view facial images based modeling, combining multiple texture maps is conducted, where the same facial feature, such as the mouth’s corner, from multiple maps might not correspond to the same coordinate on the texture map. Hence the blurring eﬀect, which is normally not a good property for modeling, is observed. To reduce such blurring, the texture map is decomposed into a set of local patches. Patches from multi-view images are allowed to move locally for achieving better correspondences. Since the amount of movement indicates how much the actual head geometry deviates from the ellipsoid, a deviation model trained from patch movements models the face geometry. Also the corresponding patches are trained to model facial appearance. Our mosaic model is composed of both models together with a probabilistic model Pd that learns the statistical distribution of the distance measure between the test patch and the patch model [8]. Our face mosaicing approach makes a number of contributions. First, as the hardest variation, pose variation is handled naturally by mapping images from diﬀerent view-angles to form the mosaic model, whose mean image can be treated as a compact representation of faces under various view-angles. Second, all other variations that could not be modeled by the mean image, for example, illumination and expression, are taken care of by a number of eigenvectors. Therefore, instead of modeling only one type of variation, as done in conventional methods, our method models all possible appearance variations under one framework. Third, a simple geometrical assumption has the problem since the head geometry is not truly an ellipsoid. This is taken care of by training a geometric deviation model, which results in better correspondences across multiple views. There are many prior work on face modeling [9,10]. Among them, Blanz and Vetter’s approach [9] is one of the most sophisticated that applied to face recognition as well, where two subspace models are trained for facial texture and

664

X. Liu and T. Chen

shape respectively. Given a test image, they ﬁt the new image with two models by tuning the models’ coeﬃcients, which are eventually used for recognition. Intuitively better modeling leads to better recognition performance. However, a more sophisticated modeling also makes model ﬁtting to be too diﬃcult. For example, both training and test images are manually labeled with 6 to 8 feature points [9]. On the other hand, we believe that, unlike the rendering applications in computer graphics, we might not need a very sophisticated geometric model for recognition applications. The beneﬁt with a simpler face model is that model ﬁtting tends to be easier and automatic, which is the goal of our approach.

2

Modeling the Geometric Deviation

To reduce the blurring issue in combining multiple texture maps, we obtain a better facial feature alignment by relying on the landmark points. For the model training, it is reasonable to manually label such landmark points. Given K multi-view training facial images, {fk }, ﬁrstly we label the position of facial feature points. As shown in Fig. 2, 25 facial feature points are labeled. For each training image, only a subset of the 25 points is labeled according to their visibility. We call these points as key points. Second, we generate the texture map sk from each training image, and compute key points’ corresponding coordinates bik (1 ≤ i ≤ 25) in the texture map sk , as shown in Fig. 3. Furthermore, we would like to ﬁnd the coordinate on the mosaic model where all corresponding key points deviate to. Ideally if the human head is a perfect 3D ellipsoid, the same key point bik (1 ≤ k ≤ K) from multiple training texture maps should exactly correspond to the same coordinate. However, due to the fact that the human head is not a perfect ellipsoid, these key points deviate from each other. The amount of deviation is an indication of the geometrical diﬀerence between the actual head geometry and the ellipsoid. Third, we compute the averaged positions bik (1 ≤ k ≤ K) of all visible key points bi that correspond to the same facial feature. We treat this averaging, shown in the 3rd row of Fig. 3, as the target position in the ﬁnal mosaic model where all corresponding key points should move toward. Since our resulting mosaic model is composed of an array of local patches, each one of the 25 averaged key points falls into one particular patch, namely key patch. Fourth, for each texture map, we take the diﬀerence between the positions of key point bik and that of the averaged key point bi as the key patch’s deviation ﬂow (DF) that describes which patch from each texture map should move toward that key patch in the mosaic model. However, there are also non-key patches in the mosaic model. As shown in Fig. 4, we represent the mosaic model as a set of triangles, whose vertexes are the key patches. Since each non-key patch falls into at least one triangle, its DF is interpolated by the key patch’s DF. For each training texture map, its geometric deviation is a 2D vector map vk , whose dimension equals to the number of patches in vertical and horizontal directions, and each element is one patch’s DF. Note that for any training texture map, some elements in vk are considered missing. Finally the deviation model

Face Mosaicing for Pose Robust Video-Based Recognition

665

-

Fig. 3. Averaging key points: the position of key points in the training texture maps (2nd row), which correspond to the same facial feature are averaged and result in the position in the ﬁnal model (3rd row)

Fig. 4. Computation of patch’s DF: each non-key patch falls into at least one triangle; the deviation of a non-key patch is interpolated by the key patch deviation of one triangle

θ = {g, u} is learned from the geometric deviation {vk } of all training texture maps using the robust PCA [11], where g and u are the mean and eigenvectors respectively. Essentially this linear model describes all possible geometric deviation of any view angle for this particular subject’s face.

3

Modeling the Appearance

After modeling the geometric deviation, we need to build an appearance model, which describes the facial appearance for all poses. On the left hand side of Fig. 5, there are two pairs of training texture maps sk and their corresponding geometric deviation vk . The resulting appearance model Π = {m, V} with one mean and two eigenvectors are shown on the right hand side. This appearance model is composed of an array of eigenspaces, where each is devoted to modeling the appearance of the local patch indexed by (i, j). In order to train one eigenspace for one particular patch, the key issue is to collect one corresponding patch from each training texture map sk , where the correspondence is speciﬁed by k 1 . For example, the summation of vi,j and (40,83) the geometric deviation vi,j 1 determines the center of corresponding patch, vi,j , in the texture map s1 . Using the same procedure, we ﬁnd the corresponding patches ski,j (2 ≤ k ≤ K) from all other texture maps. Note some of ski,j might be considered as missing patches. Finally the set of corresponding patches are used to train a statistical model Πi,j via PCA. We call the array of PCA models as the patch-PCA mosaic. Modeling via PCA is popular when the number of training samples is large. However, when the number of training samples is small, such as the training of an individual mosaic model with only a few samples, it might not be suitable to train one PCA model for each patch. Instead we would rather train a universal PCA model based on all corresponding patches of all training texture maps, and keep the coeﬃcient of these patches in the universal PCA model as well. This is

666

X. Liu and T. Chen

83 (40,83)

40

(40,83)

Fig. 5. Appearance modeling: the deviation indicates the corresponding patch for each of training texture maps; all corresponding patches are treated as samples for PCA

Fig. 6. The mean images of two mosaic models without geometric deviation (top) and with geometric deviation (bottom)

called the global-PCA mosaic. Note that the patch-PCA mosaic and the globalPCA mosaic only diﬀer in how the corresponding patches across training texture maps are utilized to form a model, depending on the availability of training data in diﬀerent application scenarios. Eventually the statistical mosaic model includes the appearance model Π, the geometric deviation model θ and the probabilistic model Pd . We consider that the geometric deviation model plays a key role in training the mosaic model. For example, Fig. 6 shows the mean images of two mosaic models trained with the same set of images from 10 subjects. It is obvious that the mean image on the bottom is much less blurring and captures more useful information about facial appearance. Note that this mean image covers much larger facial area comparing to the up-right illustration of Fig. 5 since extrapolation is performed while computing the geometric deviations of non-key patches.

4

Face Recognition Using the Statistical Mosaic Model

Given L subjects with K training images per subject, an individual statistical mosaic model is trained for each subject. For simplicity, let us assume we have enough training samples and obtain the patch-PCA mosaic for each subject. We will discuss the case of the global-PCA mosaic in the end of this section. We now introduce how to utilize this model for pose robust face recognition. As shown in Fig. 7, given one test image, we generate its texture map by using the universal mosaic model, which is trained from multi-view images of many subjects. Then we measure the distance between the test texture map and each of the trained individual mosaic model, namely the map-to-model distance. Note that the appearance model is composed of an array of patch models, which is called the reference patch. Hence, the map-to-model distance equals to the

Face Mosaicing for Pose Robust Video-Based Recognition

667

summation of the map-to-patch distances. That is, for each reference patch, we ﬁnd its corresponding patch from the test texture map, and compute its distance to the reference patch. Since we have been deviating corresponding patches during the training stage, we should do the same while looking for the corresponding patch in the test stage. One simple approach is to search for the best corresponding patch for the reference patch within a search window. However, this does not impose any constraint on the deviation of neighboring reference patches. To solve this issue, we make use of the deviation model that was trained before. As shown in Fig. 7, if we randomly sample one coeﬃcient in the deviation model, the linear combination of this coeﬃcient describes the geometric deviation for all reference patches. Hence, the key is to ﬁnd the coeﬃcients that provide the optimal matching between the test texture map and the model. In this paper, we adopt a simple sequential searching scheme to achieve this. That is, in a Kdimensional deviation model, uniformly sample multiple coeﬃcients along the 1st dimension while the coeﬃcients for other dimensions are zero, and determine one of them which results in the maximal similarity between this test texture map and the model. The range of sampling is bounded by the coeﬃcients of training geometric deviations. Then we perform the same searching along the 2nd dimension while ﬁxing the optimal value for the 1st dimension and zero for all other dimensions. The searching is ﬁnished until the K th dimension. Basically our approach enforces the geometric deviation of neighboring patches to follow certain constraint, which is described by the bases of the deviation model. For each sampled coeﬃcient, the reconstructed 2D geometric deviation (in the bottom-left of Fig. 7) indicates where to ﬁnd the corresponding patches in the test texture map. Then the residue between the corresponding patch and the reference patch model is computed, which is further feed into the probabilistic model [8]. Finally the probabilistic measurement tells how likely this corresponding patch belongs to the same subject as the reference patch. By doing the same operation for all other reference patches and averaging all patch-based probabilistic measurements, we obtain the similarity between this test texture map and the model based on the current sampled coeﬃcient. Finally the test image is recognized as the subject who provides the largest similarity. Depending on the type of the mosaic model (the patch-PCA mosaic or the global-PCA mosaic), there are diﬀerent ways of calculating the distance between the corresponding patch and the reference patch model. For the patch-PCA mosaic, the residue with respect to the reference patch model is used as the distance measure. For the global-PCA mosaic, since one reference patch model is represented by a number of coeﬃcients, the distance measure is deﬁned as the nearest neighbor of the corresponding patch among all these coeﬃcients.

5

Video-Based Face Recognition

There are two schemes for recognizing faces from video sequences: image-based recognition and video-based recognition. In image-based recognition, usually the

668

X. Liu and T. Chen Pd

95

∏

38

(38,95)

Fig. 7. The map-to-patch distance: the geometric deviation indicates the patch correspondence between the model and the texture map; the distance of corresponding patches are feed into the Bayesian framework to generate a probabilistic measurement

face area is cropped before feeding to a recognition system. Thus image-based face recognition involves two separate tasks: face tracking and face recognition. In our face mosaicing algorithm, given one video frame, the most important task is to generate a texture map and compare it with the mosaic model. Since the mapping parameter x, which is a 6-dimensional vector describing the 3D head location and orientation [7], contains all the information for generating the texture map, the face tracking is equivalent to estimating x, which can result in the maximal similarity between the texture map and the mosaic model. We use the condensation method [12] to estimate the mapping parameter x. In image-based recognition, for a face database with L subjects, we build the individualized model for each subject, based on one or multiple training images. Given a test sequence and one speciﬁc model, a distance measurement is calculated for each frame by face tracking. Averaging of the distances over all frames provides the distance between the test sequence and one speciﬁc model. After the distances between the sequence and all models are calculated, comparing these distances provides the recognition result of this sequence. In video-based face recognition, two tasks, face tracking and recognition, are usually performed simultaneously. Zhou et al. [13] propose a framework to combine the face tracking and recognition using the condensation method. They propagate a set of samples governed by two parameters: the mapping parameter and the subject ID. We adopt this framework in our experiments.

6

Experimental Results

We evaluate our algorithm on pose robust face recognition using the CMU PIE database [14]. We use half of the subjects (34 subjects) in PIE for training the probabilistic model. The 9 pose images per subject from remaining 34 subjects are used for the recognition experiments.

Face Mosaicing for Pose Robust Video-Based Recognition (a)

669

(c) c34

c14

c22

c11

c02

c27

c29

c37

c05

(b)

Fig. 8. (a) Sample Images of one subject from the PIE database. (b) Mean images of three individual mosaic models. (c) Recognition performances of four algorithms on the CMU PIE database based on three training images.

Sample images and the pose labels from one subject in PIE are shown in Fig. 8(a). Three poses (c27, c14, c02) are used for the training, and the remaining 6 poses (c34, c11, c29, c05, c37, c22) are used for test using four algorithms. The ﬁrst is the traditional eigenface approach [4]. We perform the manual cropping and normalization for both training and test images. We test with diﬀerent number of eigenvectors and plot the one with the best recognition performance. The second is the eigen light-ﬁeld algorithm [15] (one frontal training image per subject). The third algorithm is our face mosaic method without the modeling of geometric deviation, which essentially sets the mean and all eigenvectors of θ = {g, u} to be zero. The fourth algorithm is the face mosaic method with the modeling of geometric deviation. Since the number of training images is small, we train the global-PCA mosaic for each subject. Three eigenvectors are used in building the global-PCA subspace. Thus each reference patch from the training stage is represented as a 3-dimentional vector. For the face mosaic method, the patch size is 4 × 4 pixels and the size of the texture map is 90 × 180 pixels. For illustration purpose, we show the mean images of three subjects in Fig. 8(b). Fig. 8(c) shows the recognition rate of four algorithms for each speciﬁc pose. Comparing among these four algorithms, both of our algorithms works better than the baseline algorithms. Obviously the mosaic approach provides a better way of registering multi-view images for an enhanced modeling, unlike the naive training procedure of the traditional eigenface approach. For our algorithms, the one with deviation modeling performs better than the one without deviation modeling. There are at least two beneﬁts for the former. One is that a geometric model can be used in the test stage. The other is that as a result of deviation modeling, the patch-based appearance model also better captures the personal characteristic of the multi-view facial appearance in a non-blurred manner. We perform video-based face recognition experiments based on the Face In Action (FIA) database [16], which mimics the “passport checking” scenario. Multiple cameras capture the whole process of a subject walking toward the desk, standing in front of the desk, making simple conversation and head motion, and ﬁnally walking away from the desk. Six video sequences are captured from six calibrated cameras simultaneously for 20 seconds with 30 frames per second.

670

X. Liu and T. Chen

(a)

(b)

Fig. 9. (a) 9 training images from one subject in the FIA database. (b) The mean images of the individual models in two methods (left: Individual PCA, right: mosaicing). Table 1. Recognition error rate of diﬀerent algorithms PCA Mosaic image-based method 17.24% 6.90% video-based method 8.97% 4.14%

We use a subset of the FIA database containing 29 subjects, with 10 sequences per subject as the test sequences. Each sequence has 50 frames, and the ﬁrst frame is labeled with the ground truth data. We use the individual PCA algorithm [17] with image-based recognition and the individual PCA with video-based recognition as the baseline algorithms. For both algorithms, 9 images per subject are used for training and the best performance is reported by trying diﬀerent number of eigenvectors. Fig. 9(a) shows the 9 training images for one subject in the FIA database. The face location of training images is labeled manually, while that of the test images is based on the tracking results using our mosaic model. Face images are cropped to be 64 × 64 pixels from video frames. We test two options for our algorithms based on the same training set (9 images per subject). The ﬁrst is to use the individual patch-PCA mosaic with image-based recognition, which uses the averaged distance between the frames to the mosaic model as the ﬁnal distance measure. The second is to use the individual patch-PCA mosaic with video-based recognition, which uses the 2D condensation method to perform tracking and recognition. Fig. 9(b) illustrates the mean images in two methods. We can observe signiﬁcant blurring eﬀect in the mean image of the individual PCA model. On the other hand, the mean image of our individual patch-PCA mosaic model covers larger pose variation while keeping enough individual facial characteristics. The comparison of recognition performance is shown in Table 1. Two observations can be made. First, given the same model, such as the PCA model or the mosaic model, video-based face recognition is better than image-based recognition. Second, the mosaic model works much better than the PCA model for pose-robust recognition.

7

Conclusions

This paper presents an approach to building a statistical mosaic model by combining multi-view face images, and applies it to face recognition. Multi-view face

Face Mosaicing for Pose Robust Video-Based Recognition

671

images are back projected onto the surface of an ellipsoid, and the surface texture map is decomposed into an array of local patches, which are allowed to move locally in order to achieve better correspondences among multiple views. We show the improved performance for pose robust face recognition by using this new method and extend our approach to video-based face recognition.

References 1. Zhao, W.Y., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computing Survey 35(4), 399–458 (2003) 2. Phillips, P., Grother, P., Micheals, R., Blackburn, D., Tabassi, E., Bone, J.: Face recognition vendor test (FRVT) 2002: Evaluation report (2003) 3. Sim, T., Kanade, T.: Combining models and exemplars for face recognition: An illuminating example. In: Proc. of the CVPR 2001 Workshop on Models versus Exemplars in Computer Vision (2001) 4. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 5. Duda, R., Hart, P., Stork, D.: Pattern Classiﬁcation, 2nd edn. John Wiley & Sons. Inc., New York (2001) 6. Liu, X., Chen, T.: Geometry-assisted statistical modeling for face mosaicing. In: ICIP 2003. Proc. 2003 International Conference on Image Processing, Barcelona, Catalonia, Spain, vol. 2, pp. 883–886 (2003) 7. Liu, X., Chen, T.: Pose-robust face recognition using geometry assisted probabilistic modeling. In: Proc. IEEE Computer Vision and Pattern Recognition, San Diego, California, vol. 1, pp. 502–509 (2005) 8. Kanade, T., Yamada, A.: Multi-subregion based probabilistic approach toward pose-invariant face recognition. In: IEEE Int. Symp. on Computational Intelligence in Robotics Automation, Kobe, Japan, vol. 2, pp. 954–959 (2003) 9. Blanz, V., Vetter, T.: Face recognition based on ﬁtting a 3D morphable model. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 10. Dimitrijevic, M., Ilic, S., Fua, P.: Accurate face models from uncalibrated and ill-lit video sequences. Proc. IEEE Computer Vision and Pattern Recognition 2, 1034–1041 (2004) 11. De la Torre, F., Black, M.J.: Robust principal component analysis for computer vision. In: Proc. 8th Int. Conf. on Computer Vision, Vancouver, BC, vol. 1, pp. 362–369 (2001) 12. Isard, M., Blake, A.: Active Contours. Springer, Heidelberg (1998) 13. Zhou, S., Krueger, V., Chellappa, R.: Probabilistic recognition of human faces from video. Computer Vision and Image Understanding 91, 214–245 (2003) 14. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(12), 1615–1618 (2003) 15. Gross, R., Matthews, I., Baker, S.: Appearance-based face recognition and lightﬁelds. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(4), 449–465 (2004) 16. Goh, R., Liu, L., Liu, X., Chen, T.: The CMU Face In Action (FIA) database. In: Proc. of IEEE ICCV 2005 Workshop on Analysis and Modeling of Faces and Gestures, Beijing, China, IEEE Computer Society Press, Los Alamitos (2005) 17. Liu, X., Chen, T., Kumar, B.V.K.V.: Face authentication for multiple subjects using eigenﬂow. Pattern Recognition 36(2), 313–328 (2003)

Face Recognition by Using Elongated Local Binary Patterns with Average Maximum Distance Gradient Magnitude Shu Liao and Albert C.S. Chung Lo Kwee-Seong Medical Image Analysis Laboratory, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong [email protected], [email protected]

Abstract. In this paper, we propose a new face recognition approach based on local binary patterns (LBP). The proposed approach has the following novel contributions. (i) As compared with the conventional LBP, anisotropic structures of the facial images can be captured eﬀectively by the proposed approach using elongated neighborhood distribution, which is called the elongated LBP (ELBP). (ii) A new feature, called Average Maximum Distance Gradient Magnitude (AMDGM), is proposed. AMDGM embeds the gray level diﬀerence information between the reference pixel and neighboring pixels in each ELBP pattern. (iii) It is found that the ELBP and AMDGM features are well complement with each other. The proposed method is evaluated by performing facial expression recognition experiments on two databases: ORL and FERET. The proposed method is compared with two widely used face recognition approaches. Furthermore, to test the robustness of the proposed method under the condition that the resolution level of the input images is low, we also conduct additional face recognition experiments on the two databases by reducing the resolution of the input facial images. The experimental results show that the proposed method gives the highest recognition accuracy in both normal environment and low image resolution conditions.

1

Introduction

Automatic facial recognition (AFR) has been the topic of extensive research in the past several years. It plays an important role in many computer vision applications including surveillance development, and biometric image processing. There are still many challenges and diﬃculties, for example, factors such as pose [8], illumination [9] and facial expression [10]. In this paper, we propose a new approach to face recognition from static images by using new features to eﬀectively represent facial images. Developing a face recognition framework involves two crucial aspects. 1. Facial image representation: It is also known as the feature extraction process, in which feature vectors are extracted from the facial images. 2. Classiﬁer design: The feature vectors extracted in the ﬁrst stage Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 672–679, 2007. c Springer-Verlag Berlin Heidelberg 2007

Face Recognition by Using Elongated Local Binary Patterns

673

are fed into a speciﬁc classiﬁer to obtain the ﬁnal classiﬁcation results. In this paper, we focus on the ﬁrst stage: the feature extraction stage. Many feature extraction methods for facial image representation have been proposed. Turk and Pentland [4] used the Principle Component Analysis (PCA) to construct eigenfaces and represent face images as projection coeﬃcients along these basis directions. Belhumeur et al. proposed the Linear Discriminant Analysis (LDA) method [5]. Wiskott et al. used the Gabor wavelet features for face recognition [3]. In recent years, a new feature extraction method is proposed, which is known as the Local Binary Patterns (LBP) [1]. LBP is ﬁrst applied in the texture classiﬁcation application [6]. It is a computationally eﬃcient descriptor to capture the micro-structural properties of the facial images. However, there is major limitation of the conventional LBP. The conventional LBP uses circularly symmetric neighborhood deﬁnition. The usage of the circularly symmetric neighborhood deﬁnition aims to solve the rotation invariant problem in the texture classiﬁcation application with the cost of eliminating anisotropic structural information. However, for face recognition, such problem does not exist. Anisotropic structural information is an important feature for face recognition as there are many anisotropic structures exist in the face (e.g. eyes, mouths). To this end, we extend the neighborhood distribution in the elongated manner to capture anisotropic properties of facial images called the elongated LBP (ELBP), the conventional LBP is a special case of the ELBP. Also, the conventional LBP does not take the gradient information into consideration. In this paper, we propose a new feature named average maximum distance gradient magnitude (AMDGM) to capture the general gradient information for each ELBP pattern. It is experimentally shown that the ELBP feature and the AMDGM feature are complement with each other and can achieve the highest recognition accuracy among all the other compared methods in both normal environment and low input image resolution conditions. The paper is organized as follows. In Section 2, the concepts of ELBP and AMDGM are introduced. Section 3 describes the experimental results for various approaches under the conditions of normal environment and low resolution images. Section 4 concludes the whole paper.

2

Face Recognition with ELBP and AMDGM

In this section, the ELBP and AMDGM features are introduced. We will ﬁrst brieﬂy review the conventional LBP approach and then describe these two features. 2.1

Elongated Local Binary Patterns

In this section, the Elongated Local Binary Patterns (ELBP) are introduced. In the deﬁnition of the conventional LBP [6], the neighborhood pixels of the reference pixel are deﬁned based on the circularly symmetric manner. There are

674

S. Liao and A.C.S. Chung

two parameters m and R respectively representing the number of the neighbor pixels and the radius (i.e. the distance from the reference pixel to each neighboring pixel). By varying the values of m and R, the multiresolution analysis can be achieved. Figure 1 provides examples of diﬀerent values of m and R. Then,

Fig. 1. Circularly symmetric neighbor sets for diﬀerent values of m and R

the neighboring pixels are thresholded to 0 if their intensity values are lower than the center reference pixel, and 1 otherwise. If the number of transactions between ”0” and ”1” is less or equal to two, then such patterns are uniform patterns. For example, ”00110000” is a uniform pattern, but ”01011000” is not a uniform pattern. It is obvious that there are m + 1 possible types of uniform patterns. The ﬁnal feature vector extracted from the conventional LBP is the occurrence of each type of uniform pattern in an input image as the authors in [6] pointed out that the uniform patterns represented the basic image structures such as lighting spots and edges. As we can see, for the conventional LBP, the neighborhood pixels are all deﬁned on a circle with radius R and reference center pixel. The main reason for deﬁning neighboring pixels in this isotropic manner is aimed to solve the rotation invariant problem in the texture classiﬁcation application which is the ﬁrst application of the conventional LBP. Later, the conventional LBP was applied in the face recognition application [1]. However, in this application, the rotation invariant problem does not exist. Instead, anisotropic information are important features for face recognition. To the best of our knowledge, this problem has not been mentioned by any researchers. Therefore, we are motivated to propose the ELBP approach. In ELBP, the distribution of neighborhood pixels gives an ellipse shape (see Figure 2). There are three parameters related to the ELBP approach: 1. The long axis of the ellipse, denoted by A; 2. The short axis of the ellipse, denoted by B; 3. The number of neighboring pixels, denoted by m. Figure 2 shows examples of the ELBP patterns with diﬀerent values of A, B and m: The X and Y coordinates, gix and giy , of each neighbor pixel gi (i = 1,2...,m) with respect to the center pixel is deﬁned by Equations 1 and 2 respectively, A2 B 2 Ri = , (1) 2 A2 sin θi + B 2 cos2 θi

Face Recognition by Using Elongated Local Binary Patterns

675

Fig. 2. Examples of ELBP with diﬀerent values of A, B and m

gix = Ri ∗ cos θi , and giy = Ri ∗ sin θi , ◦

(2)

where θi = ∗ (i − 1)) . If the coordinates of the neighboring pixels do not fall exactly at the image grid, then the bilinear interpolation technique is applied. The ﬁnal feature vector of ELBP is also the occurrence histogram of each type of uniform pattern. In this paper, three sets of ELBP are used with diﬀerent values of A, B and m: A1 = 1, B1 = 1, m = 8; A2 = 3, B2 = 1, m = 16; A3 = 3, B2 = 2, m = 16. Similar to [1], before processing the input image for face recognition, the input image is divided into six regions in advanced: bows, eyes, nose, mouth, left cheek, and right cheek, which are denoted as R1 , R2 , ..., R6 . Each region is assigned a weighting factor according to their importance. The larger the weighting factor, the more important the region. In this paper, the weighting factors for these six regions are set to be: w1 = 2, w2 = 4, w3 = 2, w4 = 4, w5 = 1, w6 = 1. The ELBP histograms are estimated from each region. Then, the feature vector is normalized to the range of [-1, 1]. Finally, the normalized feature vector is multiplied by its corresponding weighting factor to obtain the region feature vector. As such, the region feature vector encodes the textural information in each local region. By concatenating all the region feature vectors, global information of the entire face image can be obtained. The ELBP Pattern can also be rotated along the center pixel with a speciﬁc angle β to achieve multi-orientation analysis and to characterize elongated structures along diﬀerent orientations in the facial images. In this paper, four orientations β1 = 0◦ , β2 = 45◦ , β3 = 90◦ , β4 = 135◦ are used for each ELBP pattern with its own parameters A, B and m. The ﬁnal ELBP feature vector is an m + 1 dimension vector F , where each element Fi (i = 1,2,3...m+1) denotes the occurrence of a speciﬁc type of uniform pattern along all those four orientations β1 , β2 , β3 and β4 in an input image. As we can see, the ELBP features are more general than the conventional LBP. More precisely, the conventional LBP can be viewed as a special case of ELBP when setting the values of A and B equal to each other. The ELBP is able to capture anisotropic information from the facial images, which are important ( 360 m

676

S. Liao and A.C.S. Chung

features as there are many important parts in the face such as eyes, mouths are all elongated structures. Therefore, it is expected that ELBP can have more discriminative power than the conventional LBP, which will be further veriﬁed in the Experimental Results Section. 2.2

Average Maximum Distance Gradient Magnitude

As mentioned in Section 2.1, ELBP is more general than the conventional LBP and the anisotropic information can be eﬀectively captured. However, both the conventional LBP and the proposed ELBP still do not take the gradient information of each pattern into consideration. Since both the conventional LBP and ELBP are constructed by thesholding the neighboring pixels to 0 and 1 with respect to the reference center pixel, gradient magnitude information is therefore not included. In this paper, a new measure, called the average maximum distance gradient magnitude (AMDGM) is proposed to eﬀectively capture such information. To deﬁne AMDGM, we ﬁrst introduce the concept of distance gradient magnitude (DGM). For each ELBP pattern, there are three parameters: A, B and m denoting the long axis, short axis and the number of neighboring pixels. Then, the distance gradient magnitude for each neighboring pixel gi , given the center pixel gc , is deﬁned by Equation 3, | ∇d I(gi , gc ) |=

| Igi − Igc | , | vi − vc |2

(3)

where v = (x, y) denotes the pixel position, Igi and Igc are the intensities of the neighbor pixel and the reference pixel respectively. Based on the deﬁnition of DGM, the maximum distance gradient magnitude G(v) is deﬁned by Equation 4, (4) G(v) = maxgi | ∇d I(gi , gc ) |, i = 1, 2, . . . , m. Suppose that, in an input image, for each type of uniform ELBP patterns Pi (i = 1,2,...,m), its occurrence is Ni . Then, the average maximum distance gradient magnitude (AMDGM) A(Pi ) for each type of uniform patterns is deﬁned by Equation 5, Ni G(vk ) A(Pi ) = k=1 , (5) Ni where vk ∈ Pi , the AMDGM feature has more advantage over the conventional gradient magnitude as it encodes the spatial information (i.e. the distance from the neighbor pixel to the reference center pixel) into consideration. It is essential because the neighborhood distribution is no longer isotropic. The distance from each neighborhood pixel to the reference pixel can be diﬀerent, unlike the conventional LBP. The AMDGM feature is well complement with ELBP because the ELBP provides pattern type distribution and the AMDGM feature implies the general gradient information with spatial information for each type of uniform patterns.

Face Recognition by Using Elongated Local Binary Patterns

3

677

Experimental Results

The proposed approach have been evaluated by performing face recognition experiments on two databases: ORL and FERET [7]. The ORL database contains 40 classes with 10 samples for each class, each sample has resolution of 92 × 112 pixels. For the FERET database, we have selected 70 subjects from this database with six up-right, frontal-view images of each subject. For each facial image, the centers of the two eyes have already been manually detected. Then, each facial image has been translated, rotated and scaled properly to ﬁt a grid size of 64 × 64 pixels. The proposed method has been compared with three widely used methods similar to [1]: 1. Principle Component Analysis (PCA), also known as the eigenface approach [4]; 2. Linear Discriminant Analysis (LDA) [5]; 3. The conventional Local Binary Patterns (LBP) [1]. The support vector machine (SVM) [2] with the Gaussian Radius Basis Function (RBF) kernel was used as the classiﬁer in this work. To test the robustness of the proposed method under the condition of low input image resolution, which is a practical problem in real world applications, we have also performed face recognition experiment on the ORL and FERET databases by downsampling the input images. 3.1

Experiment on ORL and FERET Databases with Original Resolution

We have tested all the approaches under the normal environment in both ORL and FERET databases (i.e. all the input images were in their original resolution). Figure 3 and Figure 4 show some sample images of the ORL and FERET databases. The purpose of this experiment is to test the basic discriminative power of diﬀerent approaches. In this experiment, half of the facial images for each class were randomly selected as training sets, the remaining images were used as the testing sets. The experiment was repeated for all possible combinations of training and testing sets. The average recognition rates for diﬀerent approaches in both the ORL and FERET databases are listed in Table 1.

Fig. 3. Some sample images from the ORL Database

678

S. Liao and A.C.S. Chung

Fig. 4. Some sample images from the FERET Database Table 1. Performance of diﬀerent approaches under the normal condition in the ORL and FERET Database. Results of the proposed methods are listed in Rows 4 – 5. Recognition Rate % Methods 1. PCA [4] 2. LDA [5] 3. LBP [1] 4. ELBP 5. ELBP + AMDGM

ORL 85.00 87.50 94.00 97.00 98.50

FERET 72.52 76.83 82.62 86.73 93.16

From Table 1, the proposed method has the highest recognition rate among the compared methods in both databases. Furthermore, the ELBP and AMDGM features are well complement with each other. The discriminative power of the proposed method is obviously implied. 3.2

Experiment on ORL and FERET Databases with Low Resolution Images

In real world applications, the quality of the input facial images is not always good due to various factors such as imaging equipment and outdoor environment (i.e. low resolution facial image). Therefore, the robustness of a face recognition approach under this condition is essential. In this experiment, the input images of both the ORL and FERET databases were downsampled to 32 × 32 pixels before processing, the rest of the settings were the same as Section 3.1. The recognition rates of diﬀerent approaches under this environment are listed in Table 2. Table 2 echoes the robustness of diﬀerent methods against the condition of low input image resolution. It is shown that the proposed method maintains the highest recognition rate among all the compared methods. Its robustness under such condition is clearly illustrated.

Face Recognition by Using Elongated Local Binary Patterns

679

Table 2. Performance of diﬀerent approaches under the low resolution condition in the ORL and FERET Database. Results of the proposed methods are listed in Rows 4 – 5. Recognition Rate % Methods 1. PCA [4] 2. LDA [5] 3. LBP [1] 4. ELBP 5. ELBP + AMDGM

4

ORL 66.00 69.50 78.50 83.00 88.50

FERET 58.41 64.85 73.63 80.22 85.47

Conclusion

This paper proposed a novel approach to automatic face recognition. Motivated by the importance of capturing the anisotropic features of facial images, we propose the ELBP feature, which is more general and powerful than the conventional LBP. Also, to embed gradient information based on the deﬁnition of ELBP, a new feature AMDGM is proposed. The AMDGM feature encodes the spatial information of the neighboring pixels with respect to the reference center pixel, which is essential for the ELBP. Experimental results based on the ORL and FERET databases demonstrate the eﬀectiveness and robustness of our proposed method when compared with three widely used methods.

References 1. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer, Heidelberg (2004) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998) 3. Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE PAMI 19(7), 775–779 (1997) 4. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991) 5. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. ﬁsherfaces: Recognition using class speciﬁc linear projection. IEEE PAMI 19(7), 711–720 (1997) 6. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classiﬁcation with Local Binary Patterns. IEEE PAMI 24(7), 971–987 (2002) 7. Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.: The feret database and evaluation procedure for face recognition algorithms. IVC 16(5), 295–306 (1998) 8. Blanz, V., Vetter, T.: Face recognition based on ﬁtting a 3D morphable model. IEEE PAMI 25(9), 1063–1074 (2003) 9. Shashua, A., Riklin-Raviv, T.: The quotient image: Class based re-rendering and recognition with varying illuminations. IEEE PAMI 23(2), 129–139 (2001) 10. Tian, T.-I., Kanade, T., Cohn, J.F.: Recognizing action units for facial expression analysis. IEEE PAMI 23(2), 97–115 (2001)

An Adaptive Nonparametric Discriminant Analysis Method and Its Application to Face Recognition Liang Huang1,∗, Yong Ma2, Yoshihisa Ijiri2, Shihong Lao2, Masato Kawade2, and Yuming Zhao1 1 Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University, Shanghai, China, 200240 {asonder,arola_zym}@sjtu.edu.cn 2 Sensing & Control Technology Lab., Omron Corporation, Kyoto, Japan, 619-0283 {ma,joyport,lao,kawade}@ari.ncl.omron.co.jp

Abstract. Linear Discriminant Analysis (LDA) is frequently used for dimension reduction and has been successfully utilized in many applications, especially face recognition. In classical LDA, however, the definition of the between-class scatter matrix can cause large overlaps between neighboring classes, because LDA assumes that all classes obey a Gaussian distribution with the same covariance. We therefore, propose an adaptive nonparametric discriminant analysis (ANDA) algorithm that maximizes the distance between neighboring samples belonging to different classes, thus improving the discriminating power of the samples near the classification borders. To evaluate its performance thoroughly, we have compared our ANDA algorithm with traditional PCA+LDA, Orthogonal LDA (OLDA) and nonparametric discriminant analysis (NDA) on the FERET and ORL face databases. Experimental results show that the proposed algorithm outperforms the others. Keywords: Linear Discriminant Analysis, Nearest Neighbors, Nonparametric Discriminant Analysis, Face Recognition.

1 Introduction LDA is a well-known dimension reduction method that reduces the dimensionality while keeping the greatest between-class variation, relative to the within-class variation, in the data. An attractive feature of LDA is the quick and easy method of determining this optimal linear transformation, that requires only simple matrix arithmetic. It has been used successfully in reducing the dimension of the face feature space. LDA, however, has some limitations in certain situations. For example, in face recognition, when the number of class samples is smaller than the feature dimension (the well-known SSS (Small Sample Size) problem), LDA suffers from the singularity problem. Several extensions to classical LDA have been proposed to ∗

This work was done when the first author was an intern student at Omron Corporation.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 680–689, 2007. © Springer-Verlag Berlin Heidelberg 2007

An Adaptive Nonparametric Discriminant Analysis Method

681

overcome this problem. These include Direct LDA [2], null space LDA (NLDA) [8][9], orthogonal LDA (OLDA) [10], uncorrelated LDA (ULDA) [10][11], regularized LDA [12], pseudo-inverse LDA [13][14] and so on. The SSS problem can also be overcome by applying the PCA algorithm before LDA. Another limitation arises from the assumption by the LDA algorithm that all classes of a training set have a Gaussian distribution with a single shared covariance. LDA cannot obtain minimal error rate if the classes do not satisfy this assumption. This assumption thus restricts the application of LDA. When using LDA as a classifier for face recognition, this assumption is rarely true. To overcome this limitation, some extensions have been proposed that exclude the assumption, such as nonparametric discriminant analysis (NDA) [1][17], stepwise nearest neighbor discriminant analysis (SNNDA) [6], and heteroscedastic LDA (HLDA) [15]. Both NDA and HLDA, however, require a free parameter to be specified by the user. In NDA, this is the number of K nearest neighbors; in HLDA, it is the number of dimensions of the reduced space. SNNDA continues to modify the NDA by redefining the between-class matrix. SNNDA therefore, has the same free parameter as NDA. Thus each of these algorithms needs to be tuned to a specific situation, so as not to cause large overlaps between neighboring classes. In this paper, we propose an adaptive nonparametric discriminant analysis (ANDA) approach for face recognition, based on previously proposed methods, especially NDA. First, we apply the PCA method before the discriminant analysis to exclude the SSS problem. Then we redefine the between-class scatter matrix. In this step, we calculate the distances between each sample and its nearest neighbors, and these distances are used to define the between-class scatter matrix. Our goal is to find an adaptive method which can deal with different types of data distribution without parameter tuning. Finally we compare our approach with PCA+LDA, OLDA, and NDA on the FERET and ORL face datasets. The results show that our algorithm outperforms the other approaches. The rest of this paper is organized as follows. Section 2 reviews classical LDA and some extensions. Our ANDA algorithm is presented in Section 3. In Section 4, an explanation of the experiments and the results are given. Finally, we conclude in Section 5.

2 Overview of Discriminant Analysis 2.1 Classical LDA LDA is a statistical approach for classifying samples of unknown classes based on training samples from known classes. This algorithm aims to maximize the betweenclass variance and minimize the within-class variance. Classical LDA defines the between-class and within-class scatter matrices as follows: c

T Sb = ∑ pi ( mi − m )( mi − m ) . i =1

(1)

682

L. Huang et al.

c

S w = ∑ pi Si .

(2)

i =1

where mi , Si and pi are the mean, covariance matrix and prior probability of each class (i.e., each individual), respectively. m and c are the global mean and the number of classes. The trace of S w measures the within-class cohesion and the trace of Sb measures the between-class separation of the training set. Classical LDA results in finding a linear transformation matrix G to reduce the feature dimension, while maximizing trace Sb and minimizing trace S w . So classical LDA can compute the optimal G by solving the following optimization:

(

)

G = arg Tmax trace ( GT S wG ) GT Sb G . G SwG = I l

−1

(3)

The solution to this optimization problem can be given by the eigenvectors of S S corresponding to nonzero eigenvalues [3]. Thus if S w is singular, S w−1 definitely does not exist, and the optimization problem fails. This is known as the SSS problem. Similarly, another drawback is the definition of Sb in classical LDA that may cause large overlaps between neighboring classes. The reason is that this definition only calculates the distance between the global mean and the mean of each class without considering the distance between neighboring samples belonging to different classes. In this paper, we call this the dissatisfactory assumption problem. Fig. 2 illustrates this problem and we will explain it in Section 3. −1 w b

2.2 Extensions to LDA 2.2.1 Extensions for the SSS Problem PCA+LDA is commonly used to overcome the SSS problem. The PCA method first projects the high-dimensional features into a low-dimensional subspace. Thereafter, LDA is applied to the low-dimensional feature space. OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of the scatter matrices. It can be implemented by solving the optimization problem below:

(

)

G = arg max trace ( G T St G ) G T Sb G . T G G = Il

+

(4)

where St is the total scatter matrix. NLDA aims to maximize the between-class distance in the null space of the within-class scatter matrix. Ye et al. [4] present a computational and theoretical analysis of NLDA and OLDA. They conclude that both NLDA and OLDA result in orthogonal transformations. Although these extensions can solve the SSS problem in LDA, they cannot overcome the dissatisfactory assumption problem.

An Adaptive Nonparametric Discriminant Analysis Method

683

2.2.2 Extensions for the Dissatisfactory Assumption Problem NDA was first proposed by Fukunaga et al. [3]. Bressan et al. [5] then explored the nexus between NDA and the nearest neighbour (NN) classifier, and also introduced a modification to NDA, which extends the two-class NDA to multi-class. NDA provides a new definition for the between-class scatter matrix Sb , by first defining the extra-class nearest neighbor and intra-class neighbor of a sample x ∈ ωi as:

{

x E = x′ ∈ ωi x′ − x ≤ z − x , ∀z ∈ ωi

}.

x I = { x′ ∈ ωi x′ − x ≤ z − x , ∀z ∈ ωi } .

(5) (6)

Next, the extra-class distance and within-class distance are defined as: ΔE = x − xE .

(7)

ΔI = x − xI .

(8)

The between-class scatter matrix is then defined as: N

( )( Δ )

S%b = ∑ ωn Δ E n =1

T

E

.

(9)

where ωn is a weight to emphasize and de-emphasize different samples. Qiu et al. [6] continued to modify NDA, by redefining S w as follows: N

( )( Δ )

S%w = ∑ ωn Δ I n =1

I

T

.

(10)

They also proposed a stepwise dimensionality reduction method to overcome the SSS problem. However, the definition of Sb in SNNDA is the same as in NDA. The definition in (9) for Sb is concerned with the nearest neighbor distance of each sample. Previous definition can be extended to the K-NN case by defining x E as the mean of the K nearest extra neighbors [5]. This is an improvement to classical LDA, but both NDA and SNNDA require a free parameter to be specified by the user, namely the number of K nearest neighbors. Once this has been specified, it will remain the same for every sample. As this is overly restrictive, we need an adaptive and comprehensive method for selecting neighboring samples. In the next section we propose an approach that overcomes these limitations.

3 Adaptive Nonparametric Discriminant Analysis 3.1 Introduction for ANDA In this section we introduce our ANDA algorithm. This approach has been proposed to overcome the drawbacks mentioned in Section 2. First PCA is applied to reduce the

684

L. Huang et al.

feature dimension. This step makes S w non-singular and helps overcome the SSS problem. Thereafter our ANDA approach is applied to the low dimensional subspace. As we mentioned before, the definition of S b in classical LDA is designed for ideal instances, where the training data obeys a Gaussian distribution. To exclude this classical LDA assumption, we redefine the between-class scatter matrix Sb . Our goal is to find an optimal discriminant function that simultaneously makes the samples of the same class as close as possible and the distance between samples belonging to different classes as great as possible. In a face recognition application, we assume there are c classes (individuals) w ωl ( l = 1,L , c ) in the training set. Then we define the within-class distance di of the sample xi ∈ ωl as:

d iw = xn − xi , xn ∈ ωl . So the maximal within-class distance of

(11)

xi can be represented as max d iw .

Similarly, the between-class distance d ib is defined as: d ib = xm − xi , xm ∉ ωl .

(12)

Now we can define the near neighbors of sample xi as follows:

{

}

X i = xm d ib < ∂ max d iw .

(13)

where ∂ is an adjustable parameter to increase or decrease the number of near neighbors. As we mentioned before, NDA also has a parameter to adjust the number of near neighbors, but it specifies the same number of nearest neighbors for every sample. Because this parameter can greatly influence the performance of NDA, it needs to be tuned for different applications. This is too restrictive. The ideal method should specify a different number of nearest neighbors for each sample depending on the different conditions. ANDA does exactly this using the parameter ∂ . Next we prove that our ANDA algorithm can do this adaptively. We evaluate the performance of ANDA with different values of ∂ and different feature dimensions on the FERET databases. As can be seen in Fig. 1, the performance is stable when ∂ is greater than 1. This means that the parameter does not need to be tuned for different situations. As stated before, ∂ is the only parameter in our algorithm. This proves that our ANDA algorithm is stable with respect to this parameter. In this paper, ∂ is set to 1.14.

An Adaptive Nonparametric Discriminant Analysis Method

(a)

685

(b)

Fig. 1. Rank-1 recognition rates with different choices of ∂ on FERET datasets. Feature dimension is 100 in (a) and 200 in (b).

Now the redefined between-class scatter matrix is acquired from: T 11 n s Sˆb = xi − xi j )( xi − xij ) , xi j ∈ X i . ( ∑∑ n s i =1 j =1

(14)

where xij is the near neighbor around xi , and n and s are the number of whole samples in the training set and near neighbors of each sample, respectively. 3.2 Discussions

We compared the performance of our ANDA with NDA and traditional LDA on artificial datasets as shown in Fig. 2. Fig. 2 (a) shows that when the data distribution is Gaussian, all three algorithms are effective. However, if this precondition is not satisfied, as shown in Fig. 2 (b), traditional LDA fails, because it calculates the between-class matrix using the mean of each class and the global mean as given in Eq. (1). However, in Fig. 2 (b) the means of each class and the global mean are all close to zero. So Sb cannot represent the separation of different classes. While S%b and Sˆb could overcome this problem.

From Eqs. (9) and (14), we see that S%b and Sˆb each have one parameter. This is the number of nearest neighbors in NDA and the ratio ∂ in ANDA. S%b selects the nearest neighbor with a specified number for every sample, whereas Sˆb determines the number of neighbors by using the ratio of intra-class distance to extra-class distance. As we know, if one sample is closer to an extra-class neighbor than to the intra-class neighbors, it will be hard to discriminate correctly using classical LDA. So we must pay more attention to the samples near the border than those in the center of a cluster. Thus Sˆb is more flexible than S%b for different situations. To demonstrate the validity of this statement, we apply the same value to the two parameters in Fig. 2 (b), (c) and (d).

686

L. Huang et al.

(a)

(c)

(b)

(d)

Fig. 2. First direction of classical LDA (dot-dash line), NDA (dash line) and ANDA (solid line) projections for different artificial datasets

The results show that NDA cannot classify the different datasets correctly, whereas ANDA can. 3.3 Extensions to ANDA

This adaptive method for selecting neighbors will make our ANDA algorithm calculates the between-class matrix with partial samples near the borders of different classes. For this, we redefine the between-class scatter matrix as follows:

Sb = λ Sˆb + (1 − λ ) Sb , λ ∈ ( 0,1) .

(15)

We use λ = 0.5 in the experiments in this paper. This definition can be seen as a weighted approach using classical LDA and our ANDA and we called it weighted ANDA in this paper.

4 Experiments In this section, we compare the performance of ANDA, weighted ANDA (wANDA), OLDA, NDA and traditional PCA+LDA on the same database. Before conducting the

An Adaptive Nonparametric Discriminant Analysis Method

687

Table 1. Rank-1 recognition rates with different feature dimensions on four different FERET datasets Feature Dimension

Method

PCA+LDA OLDA NDA ANDA wANDA

Recognition Accuracy on FERET Database(%)

Fa/Fb

Fa/Fc

Dup1

Dup2

100

150

200

100

150

200

100

150

200

100

150

200

98.5 97 98.5 98.5 98.5

99 98 99 99 99

99.5 98.5 99.5 99.5 99.5

50 49 45 51 50

66 60 61 70 70

70 63 65 76 75

62 55 61 66 66

68 60 65 69 68

69 63 67 70 70

40 31 39 45 45

50 36 46 51 50

48 41 47 51 52

experiments, we extract a wavelet feature of the face images in the database. Next we apply PCA to reduce the feature dimension, thus making S w nonsingular. OLDA can overcome the SSS problem, so there is no need to apply PCA initially. We then use the different discriminant analysis methods to learn the projection matrix using the training set. Thereafter, we select different numbers of the first d vectors in the projection matrix to build discriminant functions, thus obtaining the relationship between the recognition rate and feature dimension. Finally we evaluate the different methods with gallery and probe sets to ascertain the performance of recognition rates. The nearest neighbour classifier is applied in this experiment. 4.1 Experimental Data

In our experiments, we used the ORL and FERET 1996 face databases to evaluate the different algorithms. The descriptions of databases are now given. The September 1996 FERET evaluation protocol was designed to measure performance on different galleries and probe sets for identification and verification tasks [16]. In our experiments we use the FERET training set to learn discriminant functions for the different algorithms, which are then evaluated with the FERET probe sets and gallery set. The FERET training set consists of 1002 images, while the gallery consists of 1196 distinct individuals with one image per individual. Probe sets are divided into four categories. The Fa/Fb set consists of 1195 frontal images. The Fa/Fc set consists of 194 images that were taken with a different camera and different lighting on the same day as the corresponding gallery image. Duplicate 1 contains 722 images that were taken on different days within one year from the acquisition of the probe image and corresponding gallery image. Duplicate 2 contains 234 images that were taken on different days at least one year apart. The ORL database contains 400 images of 40 distinct individuals, with ten images per individual. The images were taken at different times, with varying lighting conditions, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All individuals are in the upright, frontal position, with tolerance for some side movement. We randomly selected 1 image per individual for the gallery set and the other images became the probe set. Then we trained the

688

L. Huang et al.

Table 2. Rank-1 recognition rates with the different dimensions of feature on the ORL datasets Feature Dimension Method PCA+LDA OLDA NDA ANDA wANDA

Recognition Accuracy on ORL Database (%) 100

150

200

80 72.5 78.5 81.5 81

82 79.5 81 84.5 82

82.5 80.5 82.5 85 85.5

different methods using the FERET training set and evaluated these with the ORL probe set and ORL gallery set. This evaluation procedure was repeated 10 times independently and the average performance is shown in Table 2. 4.2 Experimental Results

Table 1 shows the rank-1 recognition rates with different feature dimensions on the FERET database. ANDA does not give the best performance on the Fa/Fb probe set. The Fa/Fb set is the simplest and the recognition rates for the distinct approaches, with the exception of OLDA, are very similar. The difference between these rates is less than 0.5%, making it hard to distinguish between these approaches. On the other three probe sets, ANDA performs best. As we can see from Table 1, ANDA evidently excels over NDA and PCA+LDA. These results clearly show the advantage of the ANDA algorithm, especially the adaptive method for selecting the near neighbors. Table 2 gives the performance results on the ORL database. We apply the FERET training set to learn the discriminant functions and evaluate these on the ORL database. This is a difficult task because images in these two databases are quite different. Results show that ANDA is still the best algorithm. The performance of the wANDA algorithm is similar to that of ANDA, but it does not always perform better than ANDA. It seems that this extension to ANDA does not always achieve the desired effect. Experimental results, therefore, prove that the ANDA approach is an effective algorithm for face recognition especially for the more difficult tasks.

5 Conclusion In this paper, we present an adaptive nonparametric discriminant analysis approach for face recognition, which can overcome the drawbacks of classical LDA. In the ANDA approach we use a novel definition for the between-class scatter matrix, instead of the one used in classical LDA, NDA and SNNDA which can cause overlaps between neighboring classes. This new definition can select nearest neighbors for each sample adaptively. It aims to improve the discriminant performance by improving the discriminating power of samples near the classification borders. Experimental results show that the adaptive method for selecting near neighbors is most effective. The ANDA approach outperforms other methods especially for the difficult tasks.

An Adaptive Nonparametric Discriminant Analysis Method

689

References 1. Fukunaga, K., Mantock, J.M.: Nonparametric Discriminant Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 671–678 (1983) 2. Yu, H., Yang, J.: A direct LDA algorithm for highdimensional data with application to face recognition. Pattern Recognition 34, 2067–2070 (2001) 3. Fukunaga, K.: Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston (1990) 4. Ye, J., Xiong, T.: Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. Journal of Machine Learning Research 7, 1183–1204 (2006) 5. Bressan, M., Vitria, J.: Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recognition Letters 24, 2743–2749 (2003) 6. Qiu, X., Wu, L.: Face Recognition By Stepwise Nonparametric Margin Maximum Criterion. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1567–1572 (2005) 7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, Chichester (2001) 8. Chen, L.F., Liao, H.Y.M., Ko, M.T., Lin, J.C., Yu, G.J.: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition 33, 1713–1726 (2000) 9. Huang, R., Liu, Q., Lu, H., Ma, S.: Solving the small sample size problem of LDA. In: Proc. International Conference on Pattern Recognition, pp. 29–32 (2002) 10. Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research 6, 483–502 (2005) 11. Ye, J., Janardan, R., Li, Q., Park, H.: Feature extraction via generalized uncorrelated linear discriminant analysis. In: Proc. International Conference on Machine Learning, pp. 895– 902 (2004) 12. Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84(405), 165–175 (1989) 13. Skurichina, M., Duin, R.P.W.: Stabilizing classifiers for very small sample size. In: Proc. International Conference on Pattern Recognition, pp. 891–896 (1996) 14. Raudys, S., Duin, R.P.W.: On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters 19(5-6), 385–392 (1998) 15. Loog, M., Duin, R.P.W.: Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA: The Chernoff Criterion. IEEE Trans. Pattern Analysis and Machine Intelligence 26(6), 732–739 (2004) 16. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10), 671–678 (2000) 17. Li, Z., Liu, W., Lin, D., Tang, X.: Nonparametric subspace analysis for face recognition. Computer Vision and Pattern Recognition 2, 961–966 (2005)

Discriminating 3D Faces by Statistics of Depth Diﬀerences Yonggang Huang1 , Yunhong Wang2 , and Tieniu Tan1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, Beijing, China {yghuang,tnt}@nlpr.ia.ac.cn 2 School of Computer Science and Engineering Beihang University, Beijing, China [email protected] 1

Abstract. In this paper, we propose an eﬃcient 3D face recognition method based on statistics of range image diﬀerences. Each pixel value of range image represents normalized depth value of corresponding point on facial surface, and so depth diﬀerences between two range images’ pixels of the same position on face can straightforwardly describe the diﬀerences between two faces’ structures. Here, we propose to use histogram proportion of depth diﬀerences to discriminate intra and inter personal diﬀerences for 3D face recognition. Depth diﬀerences are computed from a neighbor district instead of direct subtraction to avoid the impact of non-precise registration. Furthermore, three schemes are proposed to combine the local rigid region(nose) and holistic face to overcome expression variation for robust recognition. Promising experimental results are achieved on the 3D dataset of FRGC2.0, which is the most challenging 3D database so far.

1

Introduction

Face recognition has attracted much attention from researchers in recent decades, owing to its broad application and non-intrusive nature. A large amount of work has been done on this topic based on 2D color or intensity images. A certain level of success has been achieved by many algorithms with restricted conditions, and many techniques have been applied in practical systems. However, severe problems caused by large variations of illumination, pose, and expression remain unsolved. To deal with these problems, many researchers are now moving attention from 2D face to some other facial modalities, such as range data, infrared data, etc. Human face is actually 3D deformable object with texture. Most of its 3D shape information is lost in 2D intensity images. 3D shape information should not be ignored as it can provide another kind of distinct features to discriminate diﬀerent faces from 3D point of view. With recent advances of 3D acquisition technology, 3D facial data capture is becoming easier and faster, and 3D face recognition is attracting more and more attention [1].In earlier days, many researchers worked on curvatures from range data [2]. Chua et al. [3] divided the Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 690–699, 2007. c Springer-Verlag Berlin Heidelberg 2007

Discriminating 3D Faces by Statistics of Depth Diﬀerences

691

face into rigid and non-rigid regions, and used point signature to represent the rigid parts. Since recognition was conducted only based on those rigid parts, their method achieved a certain level of robustness to expression variance. Medioni et al. [4] used Iterative Closest Point (ICP) to align and match face surfaces. ICPbased methods were also adopted by many other researchers, such as Lu[5]. Xu et al. [6] used Gaussian-Hermite moments to describe shape variations around important areas on face, then combined them with the depth feature for recognition. Brostein et al. [7] achieved expression-invariant 3D face recognition by using a new representation of facial surface ,which is invariant to isometric deformations. Subspace methods were also used for 3D face recognition based on range images [8][14]. Before recognition, original 3D point set should be preprocessed ﬁrst, then subsequent feature extraction and classiﬁcation steps are conducted based on the preprocessed results. From previous work, we can see that one large category of subsequent processing is based on parameter models, such as [4][5].They ﬁrst build a mesh model to ﬁt the facial surface on the original point set, and then extract features from the parameters of the model [6] or conduct further projection based on this model [7]. Another category of previous work is range image based [8][9]. A range image is an image with depth information in every pixel. The depth represents the distance from the sensor to the point of facial surface, which is always normalized to [0 255] as pixel value. Structure information of facial surface is straightforwardly reﬂected in facial range image. Range image bears several advantages. It is simple, and immune to illumination variation, common facial make-up. Moreover, it is much easier to handle than mesh models. Since range image is represented as 2D images, many 3D face recognition methods based on range images can be borrowed from 2D face recognition methods. However, range images and intensity images are intrinsically diﬀerent in the fundamental physical meaning of pixel value, so it is necessary to develop recognition algorithms ﬁt for range images. This paper focuses on 3D face recognition using range images. In this paper, we introduce to discriminate intra personal diﬀerences from inter personal depth diﬀerences for 3D face recognition. The concept of intra and inter personal diﬀerences has been widely used in 2D face recognition. In [12], it is successfully integrated with Bayesian method for 2D face recognition. However, for range images, intra and inter personal diﬀerences hold more meanings than that in 2D face recognition. In 2D face recognition, the diﬀerences are computed from intensity subtraction. They may be caused by illumination variation rather than inter personal diﬀerence. However, for 3D faces, depth diﬀerence between two range images straightforwardly represents the structure diﬀerence of the two faces, and it is immune to illumination. We propose to use histogram statistics of depth diﬀerences between two facial range images as the matching criterion. To solve the diﬃculty to get values closest to depth diﬀerences of the same position, we propose to compute the depth diﬀerence for each pixel in a neighborhood district instead of range image subtraction. What is more, expression variation may cause intra class variation larger than the inter class variation, which would

692

Y. Huang, Y. Wang, and T. Tan

cause false matching. To achieve robust recognition, we propose three schemes to combine local rigid region(nose) and the holistic face for robust recognition. All of our experiments in this paper are conducted on the 3D face database of FRGC 2.0, which contains 4007 range data of 466 diﬀerent persons. The rest of this paper is organized as follows. Section 2 describes details of our proposed method. Section 3 presents the experimental results and discussions. Finally we summarize this paper in Section 4.

2 2.1

Methods Histogram Proportion of Depth Diﬀerences

Here, intra personal diﬀerence denotes diﬀerence between face images of the same individual, while inter personal diﬀerence is diﬀerence between face images of diﬀerent individuals. For 2D intensity faces, intra personal variation can be caused by expression, make-up, illumination variations.The variation is always so drastic that intra personal diﬀerence is often larger than inter personal diﬀerence. However,for the facial range images, only the expression can aﬀect the intra personal diﬀerence. If intra personal variation is milder than the variation of diﬀerent identities, depth diﬀerence can provide a feasible way to discriminate diﬀerent individuals. A simple experiment is shown in Fig. 1 and Fig. 2.

A1

A2

B

Fig. 1. Range images of person A, B

Fig. 1 shows three range images of person A and B. We can see that A1 varies much in expression comparing to A2. And Fig. 2 shows us three histograms of absolute depth diﬀerences between the three range images: |A1 − A2|, |A1 − B|, |A2 − B|. The ﬁrst one is intra personal diﬀerence |A1 − A2|, while the latter two are inter personal diﬀerences. For convenience to compare in appearance, all three histograms are shown in a uniform coordinate of the same scale, and values of 0 bin are not shown because they are too huge. Here we list the three values of 0 bin: 4265, 1043, 375. And the total amount of points in ROI(Region of Interest) is 6154. From the shapes of histograms in Fig. 2, we can see that most distributions of histogram |A1 − A2| are close to 0, while the opposite happens in the latter two. If we set a threshold α = 20,the proportions of points whose depth diﬀerences are smaller than α are 97.4%, 36.2%, 26.9% respectively(0 bin

Discriminating 3D Faces by Statistics of Depth Diﬀerences

A1-A2

A1-B

693

A2-B

Fig. 2. An example of histogram statistics of inter and intra personal diﬀerences with threshold α = 20 showed as the red line

is included). We can see that the ﬁrst one is much bigger than the latter two. This promising result tells us using histogram proportions can be a feasible way to discriminate intra and inter personal variation. We will further demonstrate this on a big 3D database composed of 4007 images in Section 3. The main idea of our method can be concluded as: compute the depth diﬀerences(absolute value) between diﬀerent range images, then use holistic histogram proportion(HP ) of depth diﬀerences(DD) as the similarity measure: DD(i, j) = |A(i, j) − B(i, j)|

(1)

N umber of points(DD ≤ α) (2) T otal of ROI Where A, B denote two facial range images. ROI denotes region of interest. However, to make this idea robust for recognition, two problems, namely, correspondence and expression, remain to be considered. We further make two modiﬁcations to optimize this basic idea in the following two subsections. HP =

2.2

Correspondence

For 3D faces, the comparison between two pixel values belonged to two range images respectively makes sense only when the two pixels are corresponded in the same position of two faces in a uniform coordinate. But no two range images are corresponded in the same position perfectly, because no matter the registration is conducted manually or automatically, errors inevitably exists in the process of registration. In the case of coarse registration, the situation becomes even worse. For example, A(i, j)’s correspondent point in image B may be the neighbor point of B(i, j) in the image coordinate. Here we do not compute the depth diﬀerences between two range images by direct subtraction as (A − B), but using the local window as showed in Fig. 3. Depth diﬀerence(DD) between image A and B at point (i, j) is computed using the following equation: DD(i, j) = min(|A(i, j) − B(u, v)|),

u ∈ [i − 1 i + 1], v ∈ [j − 1 j + 1]

(3)

694

Y. Huang, Y. Wang, and T. Tan

Fig. 3. local window for computing depth diﬀerence

By this way, we can solve the correspondence problem to some extent. However, it generates another problem. DD obtained in this way is the minimum depth diﬀerences in the local window, but probably not the precise corresponding DD. However, this happens the same for every point in each comparison between images, so we believe this modiﬁcation will not weaken its ability of similarity measure much. 2.3

Expression

Expression is a diﬃcult problem for both 2D and 3D face recognition. Especially, in our method, expression variation can probably cause intra personal variation larger than inter personal variation, which will deﬁnitely lead to false matching. One common way to deal with this problem in 3D face recognition is to utilize the rigid regions of face, such as nose region[10], which is robust to expression changes. Nose is a very distinct region on face, and its shape is aﬀected very little by facial muscle movement. However, only using nose region for matching may not be competent, since it is so small comparing to the whole face that it can not provide enough information for recognition. Here we propose three schemes to combine nose and holistic face for matching, so that we can achieve robust recognition, while we can also utilize discriminating ability of other facial regions. See Fig. 4. In scheme 1, we ﬁrst conduct similarity measure on the nose region using histogram proportion of depth diﬀerences, then a weight W is set to the ﬁrst M (e.g. 1000) most similar images. Next step is holistic matching, weight W is multiplied to corresponding images’ matching scores selected by M to strengthen the similarity as the ﬁnal matching score. Scheme 2 is the same as Scheme 1 in structure, but the nose matching step and holistic matching step are exchanged. Both nose matching and holistic matching use histogram proportion of depth diﬀerences for similarity measure. Scheme 1 and Scheme 2 can be considered as hierachical combination schemes. Scheme 3 uses weighted Sum rule to fuse the matching scores of nose and holistic face from two channels. Three Schemes will be compared in the experiments in next section.

Discriminating 3D Faces by Statistics of Depth Diﬀerences

695

Matching Scores

Matching Scores

Matching Scores Matching Scores

Matching Scores

Fig. 4. Three schemes for robust recognition

3

Experimental Results and Discussions

We design three experiments on the full set of 3D data in FRGC 2.0 database to verify our proposed method. Firstly, the performance of direct depth diﬀerence, local window depth diﬀerence, and ﬁnal combination are compared; Secondly, three combination schemes are compared; Finally, our proposed method is compared with some other 3D face recognition methods. The experimental results are reported as follows. 3.1

Database Description and Preprocessing

FRGC 2.0 [13] is a benchmark database released in recent years. To our best knowledge, the 3D face dataset of FRGC 2.0 is the largest 3D face database till now. It consists of six experiments. Our experiments belong to Experiment 3 which measures the performance of 3D face recognition. This dataset contains 4007 shape data of 466 diﬀerent persons. All the 3D data were acquired by a Minolta Vivid 900/910 series sensor in three semesters in two years. The 3D shape data are given as raw point clouds, which contain spikes and holes. And manually labeled coordinates of the eyes, noses, and chin are given. Registration, ﬁlling holes and moving spikes, and normalization were carried out on the raw data in succession as processing. After that, the data points in three dimension coordinates were projected and normalized as range images(100*80 in size); the regions of faces are cropped and noses were set in the same position; a mask was also used to remove marginal noises. Fig. 5 shows some samples. From Fig. 5, we can see that FRGC 2.0 is a very challenging database. And some ”bad” images like the ones in the second row still exist in the database after

696

Y. Huang, Y. Wang, and T. Tan

Fig. 5. Samples from FRGC 2.0 (after preprocessing). The ﬁrst row: images from the same person. The second row: some samples of ”bad” images suﬀered from data loss, big holes, and occlusions caused by the hair.

the preprocessing step due to the original data and our preprocessing method which is not powerful enough to handle those problems. However, our main focus in this work is not on data preprocessing, but the recognition method. Though those bad images will greatly challenge our ﬁnal recognition performance, we still carried out all the experiments on the full dataset. Three ROC (Receiver Operating Characteristic) curves, ROC I, ROC II, ROC III, are required for performance analysis in Experiment 3 of FRGC 2.0. They in turn measure algorithm performance in three cases: target and query images were acquired within the same semester, within the same year but in diﬀerent semesters, and in diﬀerent years. The diﬃculty of three cases increases by degrees. 3.2

Parameters: α, M and W

In Equation 2, threshold α is the key for similarity measure. Here, we obtain the optimal α from the training set: αh = 16 and αl = 19 for holistic HP matching and local HP matching respectively. What is more, we also get optimal parameters for combination schemes by training: W1 = 2.5 and M1 = 400 for Scheme 1, W2 = 1.1 and M2 = 400 for Scheme 2, and W3 = 0.25 for Scheme 3. All following experiments are carried out with above optimal parameters. 3.3

The Improvements of Two Modiﬁcations

In Section 2, we propose local window DD to replace direct DD to solve the correspondence problem, and then propose three schemes to solve the expression problem. To verify the performance brought by the two modiﬁcations, here we compare the performances of direct DD, local window DD, and Scheme 1 in Fig. 6. We can see that local window DD performs a bit better than direct DD, and Scheme 1 performs best among the three, it achieves signiﬁcant improvement.

Discriminating 3D Faces by Statistics of Depth Diﬀerences

697

Fig. 6. Performance of direct DD, local window DD, and Scheme 1

The results demonstrate that the two modiﬁcations we propose in Section 2 do work and achieve obvious improvement on recognition performance. 3.4

The Comparison of Three Combination Schemes

The proposed three schemes in Section 2 are to solve the problem caused by expression variation. Three schemes are diﬀerent combination methods for holistic face matching and rigid nose matching. Their EER performances of three ROC curves are shown in Table. 1. Table 1. EERs of ROC curves of three combination schemes ROCI ROCII ROCIII

Scheme 1 Scheme 2 Scheme 3 9.8% 10.7% 11.1% 10.9% 11.7% 12.1% 12.4% 12.8% 13.1%

From Table 1, Scheme 1 and Scheme 2 perform better than Scheme 3(weighted Sum) which is a widely used fusion scheme. So we can conclude that hierachical scheme is superior to weighted Sum rule for combining holistic face and local rigid region for recognition. What’s more, Scheme 1 achieves lowest EERs for the three ROC curves, which is the best result among the proposed three schemes. It is a rather good result tested in such a challenging database with our coarse preprocessing method. This demonstrates using rigid nose region as a prescreener is a feasible idea in a 3D face recognition system to alleviate the impact of expression.

698

3.5

Y. Huang, Y. Wang, and T. Tan

The Comparison with Other Methods

To demonstrate the eﬀectiveness of our proposed method, here we compare Scheme 1 with some other 3D face recognition methods on the same database. Their performance comparison is shown in Table 2: Table 2. Comparison with other methods ROCI ROCII ROCIII

Scheme 1 9.8% 10.9% 12.4%

3DLBP 11.2% 11.7% 12.4%

LBP 15.4% 15.8% 16.4%

LDA 16.81% 16.82% 16.82%

PCA 19.8% 21.0% 22.2%

PCA, LDA are common methods which have been used in 3D face recognition based on range image [8][9][14], and performance of PCA is the baseline for comparison in FRGC2.0. 3DLBP is a newly proposed and eﬀective 3D face recognition algorithm [11]. The result of using LBP (Local Binary Patterns)for 3D face recognition is also compared in [11]. From Table 2, we can see that our proposed method, Scheme 1, performs best among these 3D face recognition methods. What is more, comparing to 3DLBP, our proposed method is much simpler in theory and needs much less computation cost while achieving better performance. The results in Table 2 demonstrate that our method does work though it is rather simple and straightforward.

4

Conclusion

In this paper, we have proposed a 3D face recognition method from the analysis of intra and inter personal diﬀerences based on range images. Our contributions include: 1) We proposed to use histogram proportion to evaluate the similarity of facial range images, based on their depth diﬀerences. 2) We proposed local window depth diﬀerence to replace direct range image subtraction to avoid the unprecise correspondence problem. 3) To overcome the intra personal expression variation which can cause false matching, we proposed to combine facial rigid nose region and holistic face for matching. Three combination schemes were proposed and compared, and by experiment performance analysis, we obtained an eﬀective scheme as our ultimate method, which uses nose region as a prescreener in a hierachical combination scheme. Suﬃcient experiments have been done on the full set of 3D face data in FRGC 2.0. Encouraging results and comparisons to other 3D face recognition method have demonstrated the eﬀectiveness of our proposed method. And even better performance can be achieved after reﬁning our preprocessing method in the future.

Acknowledgement This work is supported by Program of New Century Excellent Talents in University, National Natural Science Foundation of China (Grant No. 60575003,

Discriminating 3D Faces by Statistics of Depth Diﬀerences

699

60332010, 60335010, 60121302, 60275003, 69825105, 60605008), Joint Project supported by National Science Foundation of China and Royal Society of UK (Grant No. 60710059), Hi-Tech Research and Development Program of China (Grant No. 2006AA01Z133), National Basic Research Program (Grant No. 2004CB318110), and the Chinese Academy of Sciences.

References 1. Bowyer, K., Chang, K., Flynn, P.: A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition. Computer Vision and Image Understanding 101(1), 1–15 (2006) 2. Gordon, G.: Face Recognition Based on Depth Maps and Surface Curvature. Proc. of SPIE, Geometric Methods in Computer Vision 1570, 108–110 (1991) 3. Chua, C.S., Han, F., Ho, Y.K.: 3d human face recognition using point signature. In: Proc. International Conf. on Automatic Face and Gesture Recognition, pp. 233–238 (2000) 4. Medioni, G., Waupotitsch, R.: Face recognition and modeling in 3D. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 232– 233 (2003) 5. Lu, X., Colbry, D., Jain, A.K.: Three-Dimensional Model Based Face Recognition. In: 17th ICPR, vol. 1, pp. 362–366 (2004) 6. Xu, C., Wang, Y., Tan, T., Quan, L.: Automatic 3d face recognition combining global geometric features with local shape variation information. In: Proc. of 6th ICAFGR, pp. 308–313 (2004) 7. Bronstein, A., Bronstein, M., Kimmel, R.: Expressioninvariant 3D face recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 62–69. Springer, Heidelberg (2003) 8. Achermann, B., Jiang, X., Bunke, H.: Face recognition using range images. In: Proc. of ICVSM, pp. 129–136 (1997) 9. Chang, K.I., Bowyer, K.W., Flynn, P.J.: An Evaluation of Multimodal 2D+3D Face Biometrics. IEEE Transaction on PAMI 27(4), 619–624 (2005) 10. Ojala, T., Pietikainen, M., Maenpaa, T.: Multi resolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Transaction on PAMI 24(7), 971–987 (2002) 11. Huang, Y., Wang, Y., Tan, T.: Combining Statistics of Geometrical and Correlative Features for 3D Face Recognition. In: Proc. of 17th British Machine Vision Conference, vol. 3, pp. 879–888 (2006) 12. Jebara, T., Pentland, A.: Bayesian Face Recognition. Pattern Recognition 33(11), 1771–1782 (2000) 13. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoﬀman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: Proc. of CVPR 2005, vol. 1, pp. 947–954 (2005) 14. Heseltine, T., Pears, N., Austin, J.: Three-Dimensional Face Recognition Using Surface Space Combinations. In: Proc. of BMVC 2004 (2004) 15. Chang, K., Bowyer, K.W., Flynn, P.J.: ARMS: Adaptive rigid multi-region selection for handling expression variation in 3D face recognition. In: IEEE Workshop on FRGC Expreriments, IEEE Computer Society Press, Los Alamitos (2005)

Kernel Discriminant Analysis Based on Canonical Differences for Face Recognition in Image Sets Wen-Sheng Vincnent Chu, Ju-Chin Chen, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {l2ior,joan,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw

Abstract. A novel kernel discriminant transformation (KDT) algorithm based on the concept of canonical differences is presented for automatic face recognition applications. For each individual, the face recognition system compiles a multi-view facial image set comprising images with different facial expressions, poses and illumination conditions. Since the multi-view facial images are non-linearly distributed, each image set is mapped into a highdimensional feature space using a nonlinear mapping function. The corresponding linear subspace, i.e. the kernel subspace, is then constructed via a process of kernel principal component analysis (KPCA). The similarity of two kernel subspaces is assessed by evaluating the canonical difference between them based on the angle between their respective canonical vectors. Utilizing the kernel Fisher discriminant (KFD), a KDT algorithm is derived to establish the correlation between kernel subspaces based on the ratio of the canonical differences of the between-classes to those of the within-classes. The experimental results demonstrate that the proposed classification system outperforms existing subspace comparison schemes and has a promising potential for use in automatic face recognition applications. Keywords: Face recognition, canonical angles, kernel method, kernel Fisher discriminant (KFD), kernel discriminant transformation (KDT), kernel PCA.

1 Introduction Although the capabilities of computer vision and automatic pattern recognition systems have improved immeasurably in recent decades, face recognition remains a challenging problem. Conventional face recognition schemes are invariably trained by comparing single-to-single or single-to-many images (or vectors). However, a single training or testing image provides insufficient information to optimize the face recognition performance because the faces viewed in daily life exhibit significant variations in terms of their size and shape, facial expressions, pose, illumination conditions, and so forth. Accordingly, various face recognition methods based on facial image appearance [3], [8], [14] have been proposed. However, these schemes were implemented on facial images captured under carefully controlled environments. To obtain a more practical and stable recognition performance, Yamaguchi et al. [17] Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 700–711, 2007. © Springer-Verlag Berlin Heidelberg 2007

Kernel Discriminant Analysis Based on Canonical Differences

701

proposed a mutual subspace method (MSM) and showed that the use of image sets consisting of multi-view images significantly improved the performance of automatic face recognition systems. Reviewing the literatures, it is found that the methods proposed for comparing two image sets can be broadly categorized as either sample-based or model-based. In sample-based methods, such as that presented in [10], a comparison is made between each pair of samples in the two sets, and thus the computational procedure is time consuming and expensive. Moreover, model-based methods, such as those described in [1] and [12] require the existence of a strong statistical correlation between the training and testing image sets to ensure a satisfactory classification performance [7]. Recently, the use of canonical correlations as a means of comparing image sets has attracted considerable attention. In the schemes presented in [5], [6], [7], [9], [16] and [17], each image set was represented in terms of a number of subspaces generated using such methods as PCA or KPCA, for example, and image matching was performed by evaluating the canonical angles between two subspaces. However, in the mutual subspace method (MSM) presented in [17], the linear subspaces corresponding to the two different image sets are compared without considering any inter-correlations. To improve the classification performance of MSM, the constrained mutual subspace method (CMSM) proposed in [6] constructed a constraint subspace, generated on the basis of the differences between the subspaces, and then compared the subspaces following their projection onto this constraint subspace. The results demonstrated that the differences between two subspaces provided effective components for carrying out their comparison. However, Kim et al. [7] reported that the classification performance of CMSM is largely dependent on an appropriate choice of the constraint subspace dimensionality. Accordingly, the authors presented an alternative scheme based upon a discriminative learning algorithm in which it was unnecessary to specify the dimensionality explicitly. It was shown that the canonical vectors generated in the proposed approach rendered the subspaces more robust to variations in the facial image pose and illumination conditions than the eigenvectors generated using a conventional PCA approach. For non-linearly distributed or highly-overlapped data such as those associated with multi-view facial images, the classification performances of the schemes presented in [6] and [17] are somewhat limited due to their assumption of an essentially linear data structure. To resolve this problem, Schölkopf et al. [11] applied kernel PCA (KPCA) to construct linear subspaces, i.e. kernel subspaces, in a highdimensional feature space. Yang [18] showed that kernel subspaces provides an efficient representation of non-linear distributed data for object recognition purposes. Accordingly, Fukui et al. [5] developed a kernel version of CMSM, designated as KCMSM, designed to carry out 3D object recognition by matching kernel subspaces. However, although the authors reported that KCMSM provided an efficient means of classifying non-linearly distributed data, the problem of the reliance of the classification performance upon the choice of an appropriate constraint subspace dimensionality was not resolved. In an attempt to address the problems outlined above, this study proposes a novel scheme for comparing kernel subspaces using a kernel discriminant transformation (KDT) algorithm. The feasibility of the proposed approach is explored in the context of an automatic face recognition system. To increase the volume of information

702

W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien

available for the recognition process, a multi-view facial image set is created for each individual showing the face with a range of facial expressions, poses and illumination conditions. To make the non-linearly distributed facial images more easily separable, each image is mapped into a high-dimensional feature space using a non-linear mapping function. The KPCA process is then applied to each mapped image set to generate the corresponding kernel subspace. To render the kernel subspaces more robust to variances in facial images, canonical vectors [7] are derived for each pair of kernel subspaces. The subspace spanned by these canonical vectors is defined as the canonical subspace. The difference between the vectors of different canonical subspaces (defined as the canonical difference) is used as a similarity measure to evaluate the relative closeness (i.e. similarity) of different kernel subspaces. Finally, exploiting the proven classification ability of utilizing the kernel Fisher discriminant (KFD) [2], [11], a kernel discriminant transformation (KDT) algorithm is developed for establishing the correlation between kernel subspaces. In the training process, KDT algorithm is proceeded to find a kernel transformation matrix by maximizing the ratio of the canonical differences of the between-classes to those of the within-classes. Then in the testing process, the kernel transformation matrix is applied to establish the inter-correlation between kernel subspaces. (a)

(b)

(c)

(e) d=u-v

u

(d)

ƈ v

d: Canonical difference

Fig. 1. (a) and (b) show the first five eigenvectors (or principal components) and corresponding canonical vectors, respectively, of a facial image set. Note that the first and second rows correspond to the same individual. Comparing the images in (a) and (b), it is observed that each pair of canonical vectors, i.e. each column in (b), contains more common factors such as poses and illumination conditions than each pair of eigenvectors, i.e. each column in (a). The images in (c) and (d) show the differences between the eigenvectors and the canonical vectors, respectively, of each image pair. It is apparent that the differences in the canonical vectors, i.e. the canonical differences, are less influenced by illumination effects than the differences in the eigenvectors. In (e), u and v are canonical vectors and the canonical difference d is proportional to the angle Θ between them.

2 Canonical Difference Creation This section commences by reviewing the concept of canonical vectors, which, as described above, span a subspace known as the canonical subspace. Subsequently,

Kernel Discriminant Analysis Based on Canonical Differences

703

the use of the canonical difference in representing the similarity between pairs of subspaces is discussed. 2.1 Canonical Subspace Creation Describing the behavior of a image sets in the input space as a linear subspace, we denote P1 and P2 as two n × d orthonormal basis matrices representing two of these linear subspaces. Applying singular value decomposition (SVD) to the product of the two subspaces yields P1T P2 = Φ12 ΛΦT21 s.t. Λ = diag (σ 1 ,..., σ n ) ,

(1)

where Φ12 and Φ 21 are eigenvectors and {σ 1 ,..., σ n } are the cosine values of the canonical angles [4] and represent the level of similarity of the two subspaces. The SVD form of the two subspaces can be expressed as T Λ = Φ 12 P1T P2 Φ 21 .

(2)

Thus, the similarity between any two subspaces can be evaluated simply by computing the trace of Λ. Let the canonical subspace and the canonical vectors be defined by the matrices d d C1 = P1Φ12 = [u1 ,..., ud ] and C 2 = P2 Φ 21 = [v1 ,..., vd ] and the vectors {ui}i =1 and {vi}i =1 , respectively [7]. Here, Φ12 and Φ 21 denote two projection matrices which regularize P1 and P2, respectively, to establish the correlation between them. Comparing Figs. 1.(a) and 1.(b), it can be seen that the eigenvectors of the orthonormal basis matrices P1 and P2 (shown in Fig. 1.(a)) are significantly affected by pose and illumination in the individual facial images. By contrast, the projection matrices ensure that the canonical vectors are capable of more faithfully reproducing variations in the pose and illumination of the different facial images. 2.2 Difference Between Canonical Subspaces

As shown in Fig. 1.(e), the difference between the two canonical vectors of different canonical subspaces is proportional to the angle between them. Accordingly, the canonical difference (or canonical distance) can be defined as follows:

CanonicalDiff (i, j ) = ∑ r =1 u r − vr d

((

= trace Ci − C j

2

) (C − C )). T

i

j

(3)

Clearly, the closer the two subspaces are to one another, the smaller the value given by CanonicalDiff (i, j ) in its summation of the diagonal terms. As shown in Figs. 1.(c) and (d), the canonical differences between two facial images contain more discriminative information than the eigenvector differences and are less affected by variances in the facial images.

704

W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien

3 Kernel Discriminant Transformation (KDT) Using Canonical Differences This section commences by discussing the use of KPCA to generate kernel subspaces and then applies the canonical difference concept proposed in Section 2.2 to develop a kernel discriminant transformation (KDT) algorithm designed to determine the correlation between two subspaces. Finally, a kernel Fisher discriminant (KFD) scheme is used to provide a solution to the proposed KDT algorithm. Note that the KDT algorithm is generated during the training process and then applied during the testing process. 3.1 Kernel Subspace Generation

To generate kernel subspaces, each image set in the input space is mapped into a high-dimensional feature space F using the following nonlinear mapping function:

φ : {X1 , K , Xm } → {φ (X1 ), K , φ ( Xm )} .

(4)

In practice, the dimensionality of F, which is defined as h, can be huge, or even infinite. Thus, performing calculations in F is highly complex and computationally expensive. This problem can be resolved by applying a “kernel trick”, in which the dot products φ (x ) ⋅ φ (y ) are replaced by a kernel function k(x,y) which allows the dot products to be computed without actually mapping the image sets. Generally, k(x,y) is specified as the Gaussian kernel function, i.e. ⎛ x−y k ( x , y) = exp⎜ − ⎜ σ2 ⎝

2

⎞ ⎟. ⎟ ⎠

(5)

Let m image sets be denoted as {X1 ,K, Xm } , where the i-th set, i.e. Xi = [ x1 ,K, xni ] , contains ni images in its columns. Note that the images of each facial image set belong to the same class. The “kernel trick” can then be applied to compute the kernel matrix Kij of image sets i and j ( i, j = 1,..., m ). Matrix Kij is an ni × nj matrix, in which each element has the form

(K )

ij sr

( )( ) (

)

= φ T x is φ xrj = k x is , x rj , s = 1,..., ni , r = 1,..., n j .

(6)

The particular case of Kii, i.e. j = i , is referred to as the i-th kernel matrix. To generate the kernel subspaces of each facial image set in F, KPCA is performed on each of the mapped image sets. In accordance with the theory of reproducing kernels, the p-th eigenvector of the i-th kernel subspace, e ip , can be expressed as the linear combination of all the mapped images, i.e. ni

eip = ∑ a ispφ ( x is ) , s =1

(7)

Kernel Discriminant Analysis Based on Canonical Differences

705

where the coefficient a ipq is the q-th component of the eigenvector corresponding to the p-th largest eigenvalue of the i-th kernel matrix Kii. Denoting the dimensionality of kernel subspace Pi of the i-th image set as d, then Pi can be represented as the span of the eigenvectors {e ip}d . p =1

3.2 Kernel Discriminant Transformation Formulation h×d

The kernel subspace generation process described in Section 3.1 yields Pi ∈R as a ddimensional kernel subspace corresponding to the i-th image set s.t. φ (X i )φ T (X i ) ≅ Pi ΛPi T . In this section, an h × w kernel discriminant transformation (KDT) matrix T is defined to transform the mapped image sets in order to obtain an improved identification ability, where w is defined as the dimensionality of KDT matrix T. Multiplying both sides of the d-dimensional kernel subspace of the i-th image set by T gives

(T φ (X ))(T φ (X )) ≅ (T P )Λ(T P ) T

T

T

i

T

i

T

T

i

i

(8)

.

It can be seen that the kernel subspace of the transformed mapped image set is equivalent to that obtained by applying T to the original kernel subspace. Since it is necessary to normalize the subspaces to guarantee that the canonical differences are computable, we have to first ensure that TT Pi is orthonormal. A process of QR-decomposition is then applied to each transformed kernel subspace TT Pi such that T T Pi = Qi Ri , where Qi is a w × d orthonormal matrix and Ri is a d × d invertible upper triangular matrix. Since Ri is invertible, the normalized kernel subspace following transformation by matrix T can be written as

Qi = TT Pi Ri−1 .

(9)

To obtain the canonical difference between two subspaces, it is first necessary to calculate the d × d projection matrices Φij and Φ ji , i.e. QiT Q j = Φ ij ΛΦTji .

(10)

The canonical difference between two transformed kernel subspaces i and j can then be computed from

((

CanonicalDiff (i, j ) = trace Qi Φ ij − Q j Φ ji

( (

)(

) (Q Φ T

i

ij

− Q j Φ ji

) )

= trace TT Pi Φ′ij − P j Φ′ji Pi Φ′ij − P j Φ′ji T , T

)) (11)

−1 −1 where Φ′ij = Ri Φ ij and Φ′ji = R j Φ ji . The transformation matrix T is derived by maximizing the ratio of the canonical differences of the between-classes to those of the within-classes. This problem can be formulated by optimizing the Fisher discriminant, i.e.

706

W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien

T = arg max T

∑ ∑ ∑ ∑

m

i =1

l∈Bi

k∈Wi

CanonicalDiff (i, l )

CanonicalDiff (i, k )

( (

) )

(12)

( Pi Φ′il − Pl Φ′li )( Pi Φ′il − Pl Φ′li )T

denotes the between-scatter

T

∑ ∑

l∈Bi

m i =1

trace T T S B T , trace T T SW T

= arg max where S B =

m i =1

given that Bi is the set of class labels of the between classes and SW =

∑ ∑ m

i =1

k ∈Wi

(Pi Φ′ik − Pk Φ′ki )(Pi Φ′ik − Pk Φ′ki )T is the within-scatter given that Wi is the

set of class labels of the within classes. 3.3 Kernel Discriminant Transformation Optimization

In this section, we describe an optimization process for solving the Fisher’s discriminant given by Eq. (12). First, the number of all the training images is assumed m to be M, i.e. M = ∑i=1 ni . Using the theory of reproducing kernels as shown in w Eq. (7), vectors {t q} q=1∈ T can be represented as the span of all mapped training images in the form M

t q = ∑ α uqφ ( x u ) ,

(13)

u =1

where α uq are the elements of an M × d coefficient matrix α . Applying the definition of Φ′ij and Eq. (7), respectively, ~ Pij = Pi Φ′ij = ~ e1ij ,K , ~ edij are given by

[

the

]

projected

kernel

d n ~ e pij = ∑r =1 ∑ s i=1 a isr Φ ′ijrpφ ( x is ) .

subspaces

(14)

T~ Applying Eqs. (13) and (14), it can be shown that T Pij = αZ ij , where each element of Zij has the form

(Z )

ij up

(

)

= ∑ dr=1 ∑ nsi=1a isr Φ ′ijrp k x u , x is , u = 1,K, M , p = 1,K, d .

(15)

From the definition of SW and Eq. (15), the denominator of Eq. (12) can be rewritten as TT SW T = αTUα , where U =

∑ ∑ (Z m

i =1

k ∈Wi

− Zik )(Z ki − Zik )

T

ki

(16)

is an M × M within-scatter matrix.

Following the similar step as deriving Eq. (16), the numerator of Eq. (12) can be rewritten as

Kernel Discriminant Analysis Based on Canonical Differences

T T S B T = α T Vα , where V =

∑ ∑ m

i =1

l∈Bi

(Zli − Zil )(Zli − Zil )T

707

(17)

is an M × M between-scatter matrix.

Combining Eqs. (16) and (17), the original formulation given in Eq. (12) can be transformed into the problem of maximizing the following Jacobian function:

J (α) =

α T Vα . α T Uα

(18)

α can be found by solving the leading eigenvectors of U-1V. Fig. 2 summarizes the solution procedure involved in computing the KDT matrix, T. However, considering the case that U is not invertible, we replace U by Uμ , i.e. simply add a constant value to the diagonal terms of U, where Uμ = U + μ I .

(19)

Thus Uμ is ensured to be definite-positive and the inverse Uμ −1 exists. That is, the leading eigenvectors of Uμ −1V are computed as the solutions of α .

Algorithm: Kernel Discriminant Transformation (KDT) Input: All training image sets {X1 ,K , X m } Output: T = [t1 ,K, t w ] , where t q =

∑

M u =1

α uqφ ( x u) , q = 1,K, w

1. α ← rand ( M , w) 2. For all i, do SVD: K ii = a i Γa Ti 3. Do the following iteration:

(

4. For all i, compute T T Pi

)

(

= ∑u =1 ∑ s =i 1 α uq a sp k x u , x is M

qp

n

i

)

5. For all i, do QR-decomposition: T T Pi = Qi Ri T T 6. For each pair of i and j, do SVD: Qi Q j = Φ ij ΛΦ ji −1 7. For each pair of i and j, compute Φ′ij = Ri Φ ij

( )

8. For each pair of i and j, compute Z ij

up

∑ ∑ (Z − Z )(Z − Z ) V = ∑ ∑ (Z − Z )(Z − Z ) Compute eigenvectors {α } of V U , α ← [α ,K , α ]

9. Compute U =

m

T

i =1

k ∈Wi

ki

ik

ki

ik

m

i =1

10.

(

= ∑ dr=1 ∑ nsi=1a isr Φ ′ijrp k x u , x is

T

l∈Bi

w p p =1

li

il

li

il

−1

1

w

11. End Fig. 2. Solution procedure for kernel discriminant transformation matrix T

)

708

W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien Testing Process Testing pattern set Xtest ....

Training Process ....

Training image sets {X1,…,Xm}

Xm

.... ....

X1

Non-linear mapping I (X test )

Non-linear mapping function I (X i )

Training image sets in high-dimensional feature space : I (X1 ) ~ I (X m )

Testing image set in feature space : I (X test )

KPCA

KPCA

Kernel subspace Ptest

Kernel subspace Pm Kernel subspace P1

TTPtest

Kernel discriminative transformation (KDT) T

Refm=TTPm Reference subspaces

Ref1=TTP1

Similarity Measure: Canonical Difference Identification

: Apply kernel transformation matrix T to kernel subspace P.

Fig. 3. Workflow of proposed face recognition system

4 Face Recognition System Fig. 3 illustrates the application of the KDT matrix in the context of a face recognition system. In the training process, kernel subspaces Pi are generated by performing KPCA on each set of mapped training images. The KDT matrix (T) is then obtained via an iterative learning process based on maximizing the ratio of the canonical differences of the within-classes to those of the between-classes. Reference subspaces, Refi, are then calculated by applying T to Pi, i.e. TTPi. In the testing process, a similar procedure to that conducted in the training process is applied to the testing image set Xtest to generate the corresponding kernel subspace Ptest and the transformed kernel subspace TTPtest, i.e. Reftest. By comparing each pair of reference subspaces, i.e. Refi and Reftest, an identification result with index id can be obtained by finding the minimal canonical difference, i.e.

id = arg min CanonicalDiff (i, test ) . i

(20)

5 Experimental Results A facial image database was compiled by recording image sequences under five controlled illumination conditions. The sequences were recorded at a rate of 10 fps

Kernel Discriminant Analysis Based on Canonical Differences

709

using a 320 × 240 pixel resolution. This database was then combined with the Yale B database [19] to give a total of 19,200 facial images of 32 individuals of varying gender and race. Each facial image was cropped to a 20×20-pixel scale using the Viola-Jones face detection algorithm [15] and was then preprocessed using a bandpass filtering scheme to compensate for illumination differences between the different images. The training process was performed using an arbitrarily-chosen image sequence, randomly partitioned into three image sets. The remaining sequences relating to the same individual were then used for testing purposes. A total of eight randomly-chosen sequence combinations were used to verify the performance of the proposed KDT classifier Furthermore, the performance of the KDT scheme is also verified by performing a series of comparative trials using existing subspace comparison schemes, i.e. KMSM, KCMSM and KDT. In performing the evaluation trials, the dimensionality of the kernel subspace was specified as 30 for the KMSM, KCMSM and KDT schemes, while in DCC, the dimensionality was assigned a value of 20 to preserve 98% of the energy in the images. In addition, the variance of the Gaussian kernel was specified as 0.05. Finally, in ensuring that the matrix U was computable, μ in Eq. (19) was assigned a value of 10-3. Fig. 4.(a) illustrates the convergence of the KDT solution procedure for different experimental initializations. It can be seen that as the number of iteration increases, the Jacobian value given in Eq. (18) converges to the same point irrespective of the initialization conditions. However, it is observed that the KMSM scheme achieves a better performance than the proposed method for random initializations. Fig. 4.(b) and (c) demonstrate the improvement obtained in the similarity matrix following 10 iterations. Fig.5.(a) illustrates the relationship between the identification rate and the dimensionality, w, and demonstrates that the identification rate is degraded if w is not assigned a sufficiently large value. From inspection, it is determined that w = 2,200 represents an appropriate value. Adopting this value of w, Fig. 5(b) compares the identification rate of the KDT scheme with that of the MSM, CMSM, KMSM and KCMSM methods, respectively. Note that the data represent the results obtained using eight different training/testing combinations. Overall, the results show that KDT consistently outperforms the other classification methods. (a)

(b)

(c)

Fig. 4. (a) Convergence of Jacobian value J (α) under different initialization conditions. (b) and (c) similarity matrices following 1st and 10th iterations, respectively.

710

W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien

(a)

(b)

Fig. 5. (a) Relationship between dimensionality w and identification rate. (b) Comparison of identification rate of KDT and various subspace methods for eight training/testing combinations.

5 Conclusions A novel kernel subspace comparison method of kernel discriminant transformation (KDT) has been provided based on canonical differences and formulated as a form of Fisher’s discriminant with parameter T. For the original form with parameter T may not be computable, kernel Fisher’s discriminant is used to rewrite the form and to convert the original problem into a solvable one with parameter α. An optimized solution for KDT is then obtained by iteratively learning α. KDT has been evaluated on a proposed face recognition system through various multi-view facial images sets. KDT has also been proven to converge stably with different initializations. Experiment results show promising performance in face recognition. In future studies, the performance of the KDT algorithm will be further evaluated using additional facial image databases, including the Cambridge-Toshiba Face Video Database [7]. The computational complexity of the KDT scheme increases as the number of images used in the training process increases. Consequently, future studies will investigate the feasibility of using an ensemble learning technique to reduce the number of training images required while preserving the quality of the classification results.

References 1. Arandjelović, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face Recognition with Image Sets Using Manifold Density Divergence. In: IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), vol. 1, pp. 581–588 (2005) 2. Baudat, G., Anouar, F.: Generalized Discriminant Analysis Using a Kernel Approach. Neural Computation 12(10), 2385–2404 (2000) 3. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI) 19(7), 711–720 (1997) 4. Chatelin, F.: Eigenvalues of matrices. John Wiley & Sons, Chichester (1993) 5. Fukui, K., Stenger, B., Yamaguchi, O.: A Framework for 3D Object Recognition Using the Kernel Constrained Mutual Subspace Method. In: Asian Conf. on Computer Vision, pp. 315–324 (2006)

Kernel Discriminant Analysis Based on Canonical Differences

711

6. Fukui, K., Yamaguchi, O.: Face Recognition Using Multi-Viewpoint Patterns for Robot Vision. In: International Symposium of Robotics Research, pp. 192–201 (2003) 7. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Trans. on PAMI 29(6), 1005–1018 (2007) 8. Penev, P.S., Atick, J.J.: Local Feature Analysis: A General Statistical Theory for Object Representation. Network: Computation in Neural systems 7(3), 477–500 (1996) 9. Sakano, H., Mukawa, N.: Kernel Mutual Subspace Method for Robust Facial Image Recognition. In: International Conf. on Knowledge-Based Intelligent Engineering System and Allied Technologies, pp. 245–248 (2000) 10. Satoh, S.: Comparative Evaluation of Face Sequence Matching for Content-Based Video Access. In: IEEE Conference on Automatic Face and Gesture Recognition (FG), pp. 163– 168. IEEE Computer Society Press, Los Alamitos (2000) 11. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear Component Analysis as A Kernel Eigenvalue Problem. Neural Computation 10(5), 1299–1319 (1998) 12. Shakhnarovich, G., Fisher, J.W., Darrel, T.: Face Recognition from Long-Term Observations. In: European Conference on Computer Vision, pp. 851–868 (2000) 13. Shakhnarovich, G., Moghaddam, B.: Face Recognition in Subspaces. Handbook of Face Recognition (2004) 14. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. CVPR, 453–458 (1993) 15. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 16. Wolf, L., Shashua, A.: Kernel Principal Angles for Classification Machines with Applications to Image Sequence Interpretation. CVPR, 635–642 (2003) 17. Yamaguchi, O., Fukui, K., Maeda, K.: Face Recognition Using Temporal Image Sequence. FG (10), 318–323 (1998) 18. Yang, M.-H.: Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods. FG, 215–220 (2002) 19. http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html

Person-Similarity Weighted Feature for Expression Recognition Huachun Tan1 and Yu-Jin Zhang2 1

Department of Transportation Engineering, Beijing Institure of Technology, Beijing 100081, China [email protected] 2 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected]

Abstract. In this paper, a new method to extract person-independent expression feature based on HOSVD (Higher-Order Singular Value Decomposition) is proposed for facial expression recognition. With the assumption that similar persons have similar facial expression appearance and shape, person-similarity weighted expression feature is used to estimate the expression feature of the test person. As a result, the estimated expression feature can reduce the inﬂuence of individual caused by insuﬃcient training data and becomes less person-dependent, and can be more robust to new persons. The proposed method has been tested on Cohn-Kanade facial expression database and Japanese Female Facial Expression (JAFFE) database. Person-independent experimental results show the eﬃciency of the proposed method.

1

Introduction

Facial expression analysis is an active area in Human-Computer interaction [1,2]. Many techniques of facial expression analysis have been proposed that try to make the interaction tighter and more eﬃcient. During the past decade, the development of image analysis, object tracking, pattern recognition, computer vision, and computer hardware brings facial expressions into human computer interaction as a new modality. Many systems for automatic facial expressions have been developed since the pioneering work of Mase and Pentland [3]. Some surveys of automatic facial expression analysis [1, 2] are also appeared. Many algorithms are proposed to improve the robustness towards environmental changes, such as diﬀerent illuminations, or diﬀerent head poses. Traditionally, they use the geometric features that present the shape and locations of facial components, including mouth, eyes, brows, nose etc., for facial expression recognition. Using the geometric features, the methods are more robust to variation in face position, scale, size, head orientation and less person-dependent [4, 5]. In order to represent the detailed appearance changes such as wrinkles and creases as well, they combine geometric facial features and appearance facial features [6, 7]. There are few researches to deal with the eﬀect of individuals [7-11]. It is still a challenging task for computer vision to recognize facial expression across Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 712–721, 2007. c Springer-Verlag Berlin Heidelberg 2007

Person-Similarity Weighted Feature for Expression Recognition

713

diﬀerent persons [8]. Matsugu et al [9] proposed a rule-based algorithm for subject independence facial expression recognition, which is reported as the ﬁrst facial expression recognition system with subject independence combined with robustness with regard to variability in facial appearance. However, for a rulebased method, setting eﬀective rules for many expressions is diﬃcult. Wen et al [9] proposed a ratio-image based appearance feature for facial expression recognition, which is independent of a person’s face albedo. Abbound et al [10] proposed bilinear appearance factorization based representations for facial expression recognition. But the two methods did not consider the diﬀerences of the representation of facial expressions of diﬀerent persons. Wang et al [8] applied modiﬁed Higher-Order Singular Value Decomposition (HOSVD) in expressionindependent face recognition and person-independent facial expression recognition simultaneously. In the method, the representation of test person should be as same as the representation of one of persons in the training set. Otherwise, his/her expression would be wrongly classiﬁed. The problem may be solved by training the model based on a great deal of face images. However, the feature space of expression is so huge that it is diﬃcult to collect enough face image data and train a model that works robustly for new face images. Tan et al [11] proposed person-similarity weighted distance based on HOSVD to improve the performance of person-independent expression recognition. But the recognition rate is not satisﬁed. In this paper, a new method based on HOSVD to extract person-independent facial expression feature is proposed to improve the performance of expression recognition for new persons. The new method is based on two assumptions, – It is assumed that similar persons have similar facial expression appearance and shape, which is often used for facial expression synthesis [8]. – For simplicity, the facial expression is aﬀected only by one factor: individual is also assumed. Other factors, such as pose and illumination, are not considered. Based on the two assumptions, the expression feature of new persons is estimated to reduce the inﬂuence of individual caused by insuﬃcient training data. The estimation, named by person-similarity weighted feature, is used for expression recognition. Diﬀerent from the work in [11] which improves the performance in distance measure, this paper improves the performance of person-independent expression recognition by estimating the expression feature that could represent the expression of new persons more eﬀectively. The proposed method has been tested on Cohn-Kanada facial expression database [12] and Japanese Female Facial Expression (JAFFE) database [13]. Person-independent experimental results show that the proposed method is more robust to new persons than previous HOSVD based method [8], using geometric and appearance feature directly and the work in [11]. The remainder of this paper is organized as follows. Background about HOSVD is overviewed in Section 2. Then the person-similarity weighted expression feature extraction is described in Section 3. Person-independent experimental results are presented in Section 4. Finally, conclusion of the paper is made in Section 5.

714

2 2.1

H. Tan and Y.-J. Zhang

Background of HOSVD HOSVD

In order to reduce the inﬂuence caused by individual, the factorization model to disentangle the person and expression factors is explored. Higher-Order Singular Value Decomposition (HOSVD) oﬀers a potent mathematical framework for analyzing the multifactor structure of image ensembles and for addressing the diﬃcult problem of disentangling the constituent factors or modes. Vasilescu et al. [14, 15] proposed using N-mode SVD to analyze face image that can account for each factors inherent to image formation. The resulting ”TensorFaces” are used for face recognition which obtained better results than PCA (Principle Component Analysis). Terzopoulos [16] extended TensorFaces for analyzing, synthesizing, and recognizing facial images. Wang et al. [8] used modiﬁed HOSVD to decompose facial expression images from diﬀerent persons into separate subspaces: expression subspace and person subspace. Thus, it could lead to expression-independent face recognition and person-independent facial expression recognition simultaneously. In the conventional multi-factor analysis method for facial expression recognition proposed by Wang et al [8], a third-order tensor A ∈ RI×J×K is used to represent the facial expression conﬁguration, where I is the number of persons, J is the number of facial expressions for each person, Kis the dimension of the facial expression feature vector Ve combining geometric features and appearance features. Then the tensor A can be decomposed as A = S ×1 U P ×2 U e ×3 U f

(1)

where S is the core tensor representing the interactions of the person, expression and feature subspaces, U p ,U e ,U f represent the person, expression and facial expression feature subspace, respectively. Each row vector in each subspace matrix represents a speciﬁc vector in this mode. Each column vector in each subspace matrix represents the contributions of other modes. For details of HOSVD, we refer readers to the works in [8, 14, 15, 17]. Using a simple transformation, two tensors related to the person and expression modes are deﬁned, and we call them the expression tensor T e , and person tensor T p , respectively, given by,

p

T e = S × U e ×3 U f

(2)

T p = S ×1 U P ×3 U f

(3)

e

Each column in T or T is a basis vector that comprise I or J eigenvectors. T p or T e is a tensor which size is I × J × K. The input test tensor Ttest is a 1 × 1 × K tensor using Vet est in the third mode. Then the expression vector uetest of i-th person is represented as i = uf (Ttest , 2)T · uf (T p (i), 2)− 1 uetest i

(4)

Person-Similarity Weighted Feature for Expression Recognition

715

Similarly, the person vector of j-th facial expression is represented as uptest = uf (Ttest , 1)T · uf (T e (j), 1)− 1 j

(5)

where uf (Ttest , n) means unfolding tensor T in the n-th mode. That is, the coeﬃcient vector of projecting the test vector to the i-th row eigenvectors in basis tensor T e is the expression vector uetest of i-th person. And the coeﬃcient i vector of projecting the tested vector to the j-th column eigenvectors in basis ( tensor T p is the person vector upj test) of j-th expression. 2.2

Conventional Matching Processing (

Given the test original expression vector Ve test), the goal of expression recognition is to ﬁnd j ∗ that satisﬁes j ∗ = arg max P (uej | Vetest ) j=1,...,J

(6)

In the previous HOSVD-based method [8], all test expression vectors associated with all persons in the training set, uetest , i = 1, 2, . . . , I , are compared i to the expression-speciﬁc coeﬃcient vectors uej , j = 1, 2, . . . , J, which is the j-th row of expression subspace U e . The one that yields the maximum similarity sim(uetest , uej ),i = 1, 2, . . . , I, j = 1, 2, . . . , J, among all persons and expresi sions, identiﬁes the unknown vector Vetest as expression index j. That is, the matching processing of conventional methods is to ﬁnd (i∗, j∗) that satisfy (i∗ , j ∗ ) = arg

3 3.1

max

i=1,...,I,j=1,...,J

P (uej | uetest ) i

(7)

Person-Similarity Weighted Expression Features Problems of Conventional Matching Processing

In the matching processing, one assumption is made that the test person and one of persons in the training set represent their expression in the same way. Then, assume the ”true” expression feature of the test person uetest equals to uetest with probability 1, that is, i∗ 1 i = i∗ test | V ) = (8) P (uetest = uetest i e 0 others , and used for classiﬁcation according to Then uetest is estimated as uetest i∗ equation (7). Ideally, i∗ -th person that used for calculating the expression feature is the most similar person in face recognition. However, in many cases, the assumption that the test person and one of persons in the training set represent their expression in the same way is not always true because of the diﬀerence of individual and the insuﬃcient training data. That is, the assumption is not true and test expression could not be estimated using the equation (8). In these cases, the expressions are apt to be wrongly recognized. Some results in our experiments, which are reported in Section 4, could show the problem. How to estimate the expression feature of test person is still a challenge for expression recognition when the training data are insuﬃcient.

716

3.2

H. Tan and Y.-J. Zhang

Person-Similarity Weighted Expression Features

In order to improve the performance of facial expression recognition across different persons, we propose to estimate the ”true” expression feature of the test person by taking the information of all persons in the training set into account. In order to set the probability model, an assumption that similar person have similar facial expression appearance and shape, which has been widely used in face expression synthesis [8] is used for facial expression recognition in our method. The assumption can be formulated as: the expression feature of one person is equal to that of other person with probability P , and the probability P is proportional to the similarity between the two persons. That is, | Vetest ) ∝ si , i = 1, 2, . . . , I P (uetest i

(9)

where uetest is the expression feature associated with i-th person, uetest is the i ”true” expression feature of test person. si denotes the similarity between the test person and i-th person in training set. Under the assumption, uetest is estimated. For simplicity, it can be also assumed the prior probabilities of all classes of persons are equal. Then, based on equation (9), the estimation of expression feature of the test person is u etest = E(uetest | Vetest ) = i

i

pi ∗ uetest = i

1 si ∗ uetest i Z i

(10)

where Z is the normalization constant which is equal to the sum of the similarities of test person and all persons in the training set. Then, the expression feature weighted by the similarities of the test person and all persons in training set can be used as the ”true” expression feature of test person for expression recognition. Through the weighting process, the person-similarity weighted expression feature can reduce the inﬂuence of new persons caused by insuﬃcient training data, and is less person-dependent. In order to estimate the expression feature, the similarities of the test person and all persons in training set need to be determined ﬁrstly. It can be calculated using the person subspace obtained from HOSVD decomposition proposed by Wang et al [8]. In the process of calculating person similarities, cosine of the angle between two vectors, a and b, is adopted as the function of similarity. sim(a, b) =

4

tr < aT , b > < a, b > = a • b a • b

(11)

Experimental Results

The original expression features including geometric feature and appearance feature is ﬁrstly extracted. The process of geometric feature extraction is similar to that of [11]. But the geometric features about cheek are not used in our experiments since the deﬁnition of cheek feature points is diﬃcult and tracking the cheek feature points is not robust from our experiments. Then the geometric

Person-Similarity Weighted Feature for Expression Recognition

717

features are tracked using the method proposed in [18]. The appearance features based on the expression ratio image and Gabor wavelets which are used in [11] are also extracted. Finally, person-similarity weighted expression feature is estimated for expression recognition. 4.1

Experimental Setup

The proposed method is applied to CMU Cohn-Kanade expression database [12] and JAFFE database [13] to evaluate the performance of person-independent expression recognition. Cohn-Kanade database In Cohn-Kanade expression database, the facial expressions are classiﬁed into action units (AUs) of Face Action Coding System (FACS), instead of a few prototypic expressions. In our experiments, 110 image sequences of 6 AUs for upper face were selected from 46 subjects of European, African, and Asian ancestry. The data distribution for training and testing is shown in Table 1. No subject appears in both training and testing. Table 1. Data distribution of training and test data sets AU1 AU6 AU1+2 AU1+2+5 AU4+7 AU4+6+7 train test

9 5

19 14

6 6

12 11

14 6

5 3

In the experiments on Cohn-Kanade database, since there are not enough training data to ﬁll in the whole tensor for training, the mean feature of the speciﬁc expression in training set is used to substitute the blanks that have no training data. JAFFE Database. For JAFFE database, 50 images of 10 persons with 5 basic expressions displayed by each person are selected for training and testing. The 5 basic expressions are happiness, sadness, surprise, angry and dismal. Leave one out cross validation is used. For each test, we use the data of 9 persons for training and that of the rest one as test data. Since only static images are provided in JAFFE database, the image of neutral expression is considered as the ﬁrst frame. The initial expression features are extracted by two images: one is the image of neutral expression and the other is the image of one of 5 basic expressions mentioned above. 4.2

Person-Independent Experimental Results

The proposed method is based on conventional HOSVD method proposed by Wang [8], and the initial expression vectors are similar to those in [19]. In our experiments, the classiﬁcation performances of the proposed method are compared with the following methods:

718

H. Tan and Y.-J. Zhang

– Classifying geometric and appearance features directly by a three-layer standard back propagation neural network with one hidden layer which is similar to the method used in [19]. – Conventional HOSVD based method proposed by Wang [8]. – Classifying expressions by person-similarity distance proposed by Tan [11]. The results on Cohn-Kanade database are shown in Table 2. The average accuracy of expression recognition using proposed method has been improved to 73.3% from 58.6%, 55.6% and 64.4% respectively comparing with other methods. From Table 2, it can be observed that the recognition rate of AU6 of proposed method is the lowest. And the recognition rate of AU1+2+5 using proposed method is slightly lower than that of using geometric and appearance features directly. When the training data is adequate, the test person is more familiar to the persons in training set, the ﬁrst two methods can achieve satisﬁed results. When the training data is inadequate, the proposed method outperforms other methods. Because the estimated expression feature is less person-dependent by the weighting process, the performance of the proposed method is morerobust. Table 2. Comparison of Average recognition rate on Cohn-Kanade database Tian[19] Wang[8] Tan[11] Proposed AU1 AU6 AU1+2 AU1+2+5 AU4+7 AU4+6+7 Average Rate

40% 85.7% 16.7% 54.6% 66.7% 0% 55.6%

80% 92.9% 33% 27.3% 33.3% 66.7% 58.6%

80% 78.6% 66.7% 27.3% 66.7% 100% 64.4%

100% 78.6% 83.3% 45.5% 66.7% 100% 73.3%

The performances of the three methods on JAFFE database are reported in Table 3. We can see that the proposed method is more robust to new persons than other methods. Though proposed method does not outperform in all expressions, the average recognition rate is much higher than that of other methods. The average recognition rate of proposed method has been improved to 66% from Table 3. Comparison of Average recognition rate on JAFFE database Tian[19] Wang[8] Tan[11] Proposed Happiness Sadness Surprise Angry Dismal Average Rate

70% 60% 40% 50% 60% 56%

30% 90% 70% 30% 60% 56%

50% 60% 70% 80% 50% 62%

50% 70% 80% 90% 40% 66%

Person-Similarity Weighted Feature for Expression Recognition

719

56%, 56% and 62% of using initial expression feature directly, tradition method and our previous work, respectively. 4.3

Discussions

From person-independent experiments, it can be observed that the proposed method is more robust to new persons in expression recognition. The reason is that the estimated expression feature can reduce the inﬂuence of individual and become less person-dependent. However, proposed method did not outperform for all expression. The reason is that the similarities of persons obtained by face recognition algorithm are rough, thus the ”true” expression feature can not be estimated accurately. Can the assumption ”similar persons have similar expressions appearance and shape” be used for facial expression recognition? This is a discussable problem. Though the assumption is often used in facial expression synthesis and is right intuitively, there is no psychological evidence to support the claim. The assumption can not be generalized to all persons. From Darwin’s theory, the representation of expression is inﬂuenced by person’s habit [20], not the individual appearance and shape. However, the experimental results show that the average reorganization rate of proposed method is higher than other two methods that do not use the assumption. This proves that the assumption can be generalized to a certain extent.

5

Conclusion

We have proposed a method of extracting person-independent facial expression feature for expression recognition. After obtaining person subspace and expression subspace using HOSVD, the expression features associated with all persons in training set are linear combined weighted by the similarity of the person. The work is based on the assumption that similar person have similar facial expression representation, which is often used for facial expression synthesis. By the weighting process, the person-similarity weighted expression feature is less person-dependent and more robust to new persons. The personindependent experimental results show that the proposed method can achieve more accurate expression recognition. Comparing with using traditional method based on HOSVD [8], using geometric and appearance features directly [19] and the work using person-similarity weighted distance [11], the average accuracy of expression recognition using proposed method outperforms other methods. On Cohn-Kanade database, it has been improved to 73.3% from 58.6%, 55.6% and 64.4%, respectively. On JAFFE database, it has been improved to 66% from 56%, 56% and 62%, respectively. In this paper, we simply use the mean feature of the expression to ﬁll in the tensor, and not consider the factor of person on Cohn-Kanade database. Using an iterative method to synthesis the lost feature in the tensor may be more eﬃcient.

720

H. Tan and Y.-J. Zhang

In our method, only the factor of person is considered. However, the method is a general framework that can be easily extended to multifactor analysis. For example, if the illumination of test expression is more similar to a class of illumination, larger weight can be set to estimate the illumination-independent expression feature. Because the similarities of persons are roughly valued, the ”true” expression feature of test person could not be estimated accurately. The problem may be solved by improving the performance of face recognition. However, recognize an individual across diﬀerent expressions is also a challenging task for computer vision. These are all our future works.

Acknowledgment This work has been supported by Grant RFDP-20020003011.

References 1. Pantic, M., Rothkrantz, L.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Transaction on Pattern Analysis and Machine Intelligence 22, 1424–1445 (2000) 2. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36, 259–275 (2003) 3. Mase, K., Pentland, A.: Recognition of Facial Expression from Optical Flow. IEICE Transactions E74(10), 3474–3483 (1991) 4. Bartlett, M., Hager, J., Ekman, P., Sejnowski, T.: Measuring facial expressions by computer image analysis. Psychophysiology, 253–263 (1999) 5. Essa, I., Pentland, A.: Coding Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Transaction on Pattern Analysis and Machine Intelligence, 757– 763 (1997) 6. Tian, Y., Kanade, T., Cohn, J.: Evaluation of gabor wavelet-based facial action unit recognition in image sequences of increasing complexity. In: Proc. of Int’l Conf. on Automated Face and Gesture Recognition, pp. 239–234 (2002) 7. Wen, Z., Huang, T.S.: Capturing subtle facial motions in 3D face tracking. In: ICCV, pp. 1343–1350 (2003) 8. Wang, H., Ahuja, N.: Facial Expression Decomposition. In: ICCV, pp. 958–965 (2003) 9. Matsugu, M., Mori, K., Mitari, Y., Kaneda, Y.: Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Networks 16, 555–559 (2003) 10. Abbound, B., Davoine, F.: Appearance Factorization based Facial Expression Recognition and Synthesis. In: Proceedings of the International Conference on Pattern Recognition, vol. 4, pp. 163–166 (2004) 11. Tan, H., Zhang, Y.: Person-Independent Expression Recognition Based on Person Similarity Weighted Distance. Jounal of Electronics and Information Technology 29, 455–459 (2007) 12. Kanade, T., Cohn, J., Tian, Y.: Comprehensive Database for Facial Expression Analysis. In: Proc. of Int’l Conf. Automated Face and Gesture Recognition, pp. 46–53 (2000)

Person-Similarity Weighted Feature for Expression Recognition

721

13. Michael, J.L., Shigeru, A., Miyuki, K., Jiro, G.: Coding Facial Expressions with Gabor Wavelets. In: Proc. of Int’l Conf. Automated Face and Gesture Recognition, pp. 200–205 (1998) 14. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear Analysis of Image Ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 15. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear subspace analysis of image ensembles: Image Analysis for facial recognition. In: Proceedings of the International Conference on Pattern Recognition, pp. 511–514 (2002) 16. Terzopoulos, D., Lee, Y., Vasilescu, M.A.O.: Model-based and image-based methods for facial image synthesis, analysis and recognition. In: Proc. of Int’l Conf. on Automated Face and Gesture Recognition, pp. 3–8 (2004) 17. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A Multilinear Singular value Decomposition. SIAM Journal of Matrix Analysis and Applications 21, 1253–1278 (2000) 18. Tan, H., Zhang, Y.: Detecting Eye Blink States by Tracking Iris and Eyelids. Pattern Recognition Letters 27, 667–675 (2006) 19. Tian, Y., Kanade, T., Cohn, J.: Recognizing Action Units for Facial Expression Analysis. IEEE Trans. On PAMI 23, 97–115 (2001) 20. Darwin, C.: The Expression of Emotions in Man and Animals. Reprinted by the University of Chicago Press (1965)

Converting Thermal Infrared Face Images into Normal Gray-Level Images Mingsong Dou1, Chao Zhang1, Pengwei Hao1,2, and Jun Li2 1

State Key Laboratory of Machine Perception, Peking University, Beijing, 100871, China 2 Department of Computer Science, Queen Mary, University of London, E1 4NS, UK [email protected]

Abstract. In this paper, we address the problem of producing visible spectrum facial images as we normally see by using thermal infrared images. We apply Canonical Correlation Analysis (CCA) to extract the features, converting a many-to-many mapping between infrared and visible images into a one-to-one mapping approximately. Then we learn the relationship between two feature spaces in which the visible features are inferred from the corresponding infrared features using Locally-Linear Regression (LLR) or, what is called, Sophisticated LLE, and a Locally Linear Embedding (LLE) method is used to recover a visible image from the inferred features, recovering some information lost in the infrared image. Experiments demonstrate that our method maintains the global facial structure and infers many local facial details from the thermal infrared images.

1 Introduction Human facial images have been widely used in the biometrics, law enforcement, surveillance and so on [1], but only the visible spectrum images of human faces were used in most cases. Recently the literature begins to emerge for face recognition (FR) based on infrared images or fusion of infrared images and visible spectrum images [2-4], and some sound results have been published. Other than FR based on infrared images this paper focuses on the transformation from thermal IR images to visible spectrum images (see Fig.2 for examples of both modal images), i.e. we try to render a visible spectrum image from a given thermal infrared image. Thermal infrared imaging sensors measure temperature of shot objects and are invariant to illuminance. There are many surveillance applications in which the light conditions are so poor that we can only acquire thermal infrared images. As we know, we see objects because of the reflectance of light, i.e. formation of visible-spectrum images needs light sources. For thermal infrared images, it is very optimistic. All objects with temperature above the absolute zero emit electromagnetic wave, and the human body temperature is in the range of emitting infrared electromagnetic wave. So even it is completely dark, we can still obtain thermal infrared images with thermal infrared imaging cameras. Though the formations of visible spectrum and infrared facial images are of different mechanisms, the images do share some commons if they come from the same face, e.g. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 722–732, 2007. © Springer-Verlag Berlin Heidelberg 2007

Converting Thermal Infrared Face Images into Normal Gray-Level Images

723

we can recognize some facial features from both modals of images. There indeed exists some correlation relationship between them which can be learned from training sets. The problem to normally view thermal infrared images is actually very challenging. First of all, the correlations between the visible image and corresponding infrared one are not strong. As mentioned above, the imaging models are of different mechanisms. Infrared images are invariant under the changes of the lighting conditions, so many visible spectrum images taken under different lighting conditions correspond to one infrared image. Therefore, the solution to our problem is not unique. To the contrary, thermal infrared images are not constant, either. Thermal infrared images are subject to the surface temperature of the shot objects. For example, the infrared images taken respectively from a person when he just came from the cold outside and the same person when he just did lots of sports are quite different. The analysis above shows that it is a many-to-many mapping between visual facial images and thermal infrared images of the same person, which is the biggest barrier for solving the problem. Another problem is that the resolution of visible spectrum facial images is generally much higher than that of thermal infrared images. Thus visible images have more information, and some information of visible spectrum images definitely can not be recovered from thermal infrared images through the correlation relationship. In this paper, we have developed a method to solve the problems. We use Canonical Correlation Analysis (CCA) to the extract features, converting a many-to-many mapping between infrared and visible image into a one-to-one mapping approximately. Then we learn the relationship between the feature spaces, in which the visible features are inferred from the corresponding infrared features using Locally-Linear Regression (LLR) or, what is called, Sophisticated LLE, and a Locally Linear Embedding (LLE) method is applied to recover a visible image from the inferred features, recovering some information lost in the infrared image.

2 Related Works As presented above, this paper addresses the problem of conversion between different modal images, which shares lots of commons with the super-resolution problem [5-7], which is to render one high resolution (HR) image from one or several low resolution (LR) images. For example, the data we try to recover for two problems both have some information lost in the given observation data. Baker et al. [5] developed a super-resolution method called face hallucination to recover the lost information. They first matched the input LR image to those in the training set, found the most similar LR image, and then take the first derivation information of the corresponding HR image in the training set as the information of the desired HR image. We adopt this idea of finding information from the training set for the recovery data. Chang et al. [6] introduced LLE [10] to super-resolution. Their method is based on the assumption that the patches in the low- and high- resolution images form the manifolds with the same local geometry in two distinct spaces, i.e. we can reconstruct a HR patch from the neighboring HR patches with the same coefficients as that for reconstructing the corresponding LR patch from the neighboring LR patches. Actually this method is a special case of Locally-Weighted Regression (LWR) [8] when the

724

M. Dou et al.

weights for all neighbors are equal and the regression function is linear, as we show in Section 4. We develop a Sophisticated LLE method which is an extension of LLE. Freeman et al. [7] took images as a Markov Random Field (MRF) with the nodes corresponding to image patches, i.e. the information from the surrounding patches is used to constrain the solution, while LLE does not. MRF improves the results when the images are not well aligned, but in our paper we assume all the images are well-registered and we do not use the time-consuming MRF method. Our work is also related to some researches on statistical learning. Melzer et al. [11] used Canonical Correlation Analysis (CCA) to infer the pose of the object from the gray-level images, and Reiter et al.’s method [12] learns the depth information from RGB images also using CCA. CCA aims to find two sets of projection directions for two training sets, with the property that the correlation between the projected data is maximized. We use CCA for feature extraction. Shakhnarovich et al. [9] used Locally-Weighted Regression for pose estimation. To accelerate searching for the nearest neighbors (NN), they adopted and extended the Locality-Sensitive Hashing (LSH) technique. The problem we address here is much different from theirs. A visible image is not an underlying scene to generate an infrared image, while in their problem the pose is the underlying parameter for the corresponding image. So we use CCA to extract the most correlated features, at the same time the dimensionality of data is reduced dramatically, making nearest neighbors searching easier. In our experiments we use exhaustive search for NN instead of LSH.

3 Feature Extraction Using CCA As mentioned above, the correspondence between the visible and the infrared images is a many-to-many mapping, and then to learn a simple linear relationship between the two image spaces is not possible. Instead, extracting features and learning the relationship between the feature spaces can be a solution. We wish to extract features from the original image with the properties as follows: (1) The relationship between two feature spaces is stable, i.e. there exists a one-to-one mapping between them, and it is easy to be learned from the training set and performs well when generalized to the test set; (2) The features in two distinct feature spaces should contain enough information to approximately recover the images. Unfortunately for our problem the two properties conflict with each other. Principal Component Analysis (PCA), which is known as the EigenFace method [13] in face recognition, is a popular method to extract features. For our problem it well satisfies the second condition above, but two sets of principal components, extracted from a visible image and the corresponding infrared image, have weak correlations. Canonical Correlation Analysis (CCA) finds pairs of directions that yield the maximum correlations between two data sets or two random vectors, i.e. the correlations between the projections (features) of the original data projected onto these directions are maximized. CCA has our desired traits as given in the above property (1). But unlike PCA, several CCA projections are not sufficient to recover the original data, for the found directions may not be able to cover the principal variance of the data set. However, we find that regularized CCA is a satisfying trade-off between the two desired properties.

Converting Thermal Infrared Face Images into Normal Gray-Level Images

725

3.1 Definition of CCA Given two zero-mean random variables x, a p×1 vector, and y, a q×1 vector, CCA finds the 1st pair of directions w1 and v1 that maximize the correlation between the projections x = w1Tx and y = v1Ty, max ρ(w1Tx, v1Ty) , s.t. Var(w1Tx) = 1 and Var(v1Ty) = 1 ,

(1)

where ρ is the correlation coefficient, the variables x and y are called the first canonical variates, and the vectors w1 and v1 are the first correlation direction vector. CCA finds kth pair of directions wk and vk satisfying: (1) wkTx and vkTy are uncorrelated to the former k-1 canonical variates; (2) the correlation between wkTx and vkTy is maximized subject to the constraints Var(wkTx) = 1 and Var(vkTy) = 1. Then wkTx and vkTy are called the kth canonical variates, and wk and vk are the kth correlation direction vector, k ≤ min(p, q). The solution for the correlation directions and correlation coefficients is equivalent to the solution of the generalized eigenvalue problem below, (∑xy∑yy-1∑xyT – ρ2∑xx)w = 0 ,

(2)

(∑xyT∑xx-1∑xy – ρ2∑yy)v = 0 ,

(3)

where ∑xx and ∑yy are the self-correlation matrices, ∑xy and ∑yx are the co-correlation matrices. There are robust methods to solve this problem, interested readers please refer to [15], where an SVD-based method is introduced. Unlike PCA, which aims to minimize the reconstruction error, CCA puts the first place the correlation of the two data sets. There is no assurance that the directions found by CCA cover the main variance of the data set, so generally speaking a few projections (canonical variates) are not sufficient to recover the original data well. Beside the recovery problem, we also have to deal with the overfitting problem. CCA is sensitive to noise. Even if there is small amount of noise in the data, CCA might give a good result to maximize the correlations between the extracted features, but the features more likely represent the noise rather than the data. As mentioned in [11], it is a sound method to add a multiple of the identity matrix λI to the co-variance matrix ∑xx and ∑yy to overcome the overfitting problem, and this method is called regularized CCA. We find that it also has effect on the reconstruction accuracy, as depicted in Fig.1. Then regularized CCA is a trade-off between the two desired properties mentioned above. 3.2 Feature Extraction and Image Recovery from Features We extract the local features rather than the holistic features for the holistic features seem to fail to capture the local facial traits. There is a training set consisting of pairs of visible and infrared images at our disposal. We partition all the images into overlapping patches, then at every patch position we have a set of patch pairs for CCA learning, and CCA finds pairs of directions W(i) = [w1,w2,…,wk] and V(i) = [v1,v2,…,vk] for visible and infrared patches respectively, where the superscript (i) denotes the patch index

726

M. Dou et al.

(or the patch position in the image). Every column of W or V is a unitary direction vector, but it is not orthogonal between different columns. Take a visible patch p (represented as a column vector by raster scan) at position i for example, we can extract the CCA feature of the patch p, using f = W(i)Tp ,

(4)

where f is the feature vector of the patch.

(a)

(b)

(c)

(d)

(e)

Fig. 1. The first row is the first CCA directions with different λ (we rearrange the direction vector as an image, and there are outlined faces in the former several images), and the second row is the corresponding reconstruction results. CCA is patch-based as introduced in Section 3.2; we reconstruct the image with 20 CCA variates using Eq(6). If the largest singular value of variance matrix is c, we set (a) λ = c/20; (b) λ = c/100; (c) λ = c/200; (d) λ = c/500; (e) λ = c/5000. It is obvious that when λ is small, the CCA direction tends to be noisy, and the reconstructed face tends to the mean face.

It is somewhat tricky to reconstruct the original patch p through feature vector f. Since W is not orthogonal, we cannot reconstruct the patch by p = Wf as we do in PCA. However we can solve the least squares problem below to obtain the original patch, p = argp min ||WTp – f||22 ,

(5)

or to add an energy constraint, p = argp min ||WTp – f||22 + ||p||22 .

(6)

The least squares problem above can be efficiently solved with the scaled conjugate gradient method. The above reconstruction method is feasible only in the situation when the feature vector f contains enough information of the original patch. When fewer canonical variates (features) are extracted, we can recover the original path using LLE method [10]. As the method in [6], we assume that the manifold of the feature space and that of the patch space have the same local geometry; then the original patch and its features have the same reconstruction coefficients. If p1, p2,…, pk are the patches whose features f1, f2,…, fk are f’s k nearest neighbors, and f can be reconstructed from neighbors with f = Fw, where F = [f1, f2,…, fk], w = [w1, w2,…, wk]T, we can reconstruct the original patch by

Converting Thermal Infrared Face Images into Normal Gray-Level Images

p = Pw ,

727

(7)

where P = [p1, p2,…, pk]. The reconstruction results using Eq(6 & 7) are show in Fig. 3(a). When only a few canonical variates at hand, the method of Eq(7) performs better than Eq(6); while there are more canonical variates, two methods give almost the same satisfying results.

4 Facial Image Conversion Using CCA Features From the training database we can obtain the CCA projection directions at every patch position, and for all the patches from all the training images we extract features by projection onto the proper directions, then at each patch position i we get a visible training set Ovi = {fv,ji} and an infrared one Oiri = {fir,ji }. Given one new infrared image, we partition it into small patches, and obtain the feature vector fir of every patch. If we can infer the corresponding visible feature vector fv, the visible patch can be obtained using Eq(7) and then the patches will be combined into an visible facial image. In this section we will focus on the prediction of the visible feature vector from the infrared one. Note that the inferences for patches at different positions are based on different training feature sets. 4.1 Reconstruction Through Locally-Linear Regression Locally-Weighted Regression [8][9] is a method to fit a function of the independent variables locally based on a training set, and it suits our problem well. To simplify the methods, we set the weights of nearest neighbors (NN) equal, and use a linear model to fit the function, then LWR degenerates to Locally-Linear Regression (LLR). For an input infrared feature vector fir, we find K-NNs in training set Oir, which compose a matrix Fir = [fir,1, fir,2,…, fir,K], and their corresponding visible feature vectors compose a matrix Fv = [fv,1, fv,2,…, fv,K]. Note that we omit the patch index for convenience. Then a linear regression obtains the relation matrix, M = argM min ∑k || fv,k – M fir,k||22 = Fv. Fir+ ,

(8)

where Fir+ is the pseudo-inverse of Fir. The corresponding visible feature vector fv can be inferred from the input infrared feature fir by fv = M fir = Fv Fir+ fir.

(9)

To find the nearest neighbors, the distance between two infrared feature vectors fT and fI need be defined. In this paper, we define the distance as D = ∑k ρk ( fkT, – fk I, ) ,

(10)

where ρk is kth correlation coefficient; fkT and fk I denotes the kth element of the feature vector fT and fI respectively.

728

M. Dou et al.

Actually the LLE method using in [6] is equivalent to LLR. The reconstruction coefficients w of infrared feature fir from K-NNs, Fir, can be obtained by solving the Least Squares Problem. w = argw min || Fir w - fir ||

(11)

= Fir+ fir .

then the reconstructed corresponding visible feature vector fv= Fv Fir+ fir, which has the same form as Eq(9). The difference between the two methods is the selection of the number K of NNs. In LLR, to make regression sensible we select a large K to ensure that Fir Rm×n has more columns than rows (mn and Fir+ = (FirT Fir)-1 FirT. The reconstruction results are shown in Fig.2(b)(e). The LLR method gives better results, but consumes more resources for it needs to find a larger number of NNs. In the next section we extend LLE to a Sophisticated LLE which achieves competitive results as LLR, while it uses approximately the same resources as LLE.

∈

4.2 Reconstruction Via Sophisticated LLE The reason of the poor performance of LLE method may be that the local geometry of the two manifolds of visible and infrared features is not the same. We use an experiment to demonstrate it. For every infrared feature vector firi in the training set we find its four NNs {fir1, fir2, fir3, fir4} (the neighbors are organized in the decreasing order according to the distance to firi, the same below.) whose convex hull (a tetrahedron) contains the infrared patch, but their visible counterpart, the visible feature vector fvi and its neighbors {fv1, fv2, fv3, fv4} do not preserve the same geometric relations. Moreover, more than 90 percent of fvi’s are out of the convex hulls of the corresponding neighbors. It is a natural idea to learn the changes between two local geometries of two manifolds. Since the local geometry is represented by the reconstruction coefficients, we only need to learn the mapping H(·) between the infrared and the visible reconstruction coefficients denoted as x and y respectively, and y = H(x). Since we have a training database at hand, we collect the pairs of reconstruction coefficient vectors (x1, y1),…,(xN, yN), which are used to reconstruct feature vectors of visible and infrared patches respectively. We can obtain the function H(·) between them using the least squares method. Or a simpler algorithm can be used, while the form of H(·) need not be known. For an input feature vector firi, we compute its reconstruction coefficients xi using its k-NNs in the infrared feature space. What we try to obtain is the reconstruction coefficient vector yi which is used to reconstruct the visible feature fvi corresponding to firi. We found the most similar coefficient vector xi’ with xi in the infrared coefficient dataset, and we regard the corresponding visible coefficient vector yi’ as an estimate of yi. We call our method Sophisticated LLE.

5 Experimental Results We use the public available database collected by Equinox Corporation [14] for our experiments. We select 70 subjects from the database, and each subject has 10 pairs of

Converting Thermal Infrared Face Images into Normal Gray-Level Images

729

visible and infrared images with different expressions. The long wave thermal infrared images are used because the points of image pairs are well-matched, even though they have lower resolution than the middle wave images. All the images have been manually registered to guarantee the eye centers and the mouth centers are well-registered. Some image pairs of the data set are shown in Fig. 2(a)(g). We test our algorithm on the training set using the leave-one-out scheme, i.e. take one pair out from the database as the test images (the infrared image as the input and the visible image as the ground truth); all the pairs of the same subject are removed from the database as well; and the left pairs are taken as the training data.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Fig. 2. The results of face image conversion from thermal infrared images. (a)the input infrared image; (b) the result of our method using the prediction method of LLR using 5 canonical variates for each patch; (c) the result of our method using the prediction method of Sophisticated LLE; (d) the reconstruction face using 5 canonical features for each patch extracted from the ground truth; (e) the result using directly LLE method; (f) the result using the holistic method; (g) the ground truth.

730

M. Dou et al.

There are several parameters to be chosen in our algorithm, such as the size of patches, the number of canonical variates k (the dimensionality of feature vector) we take for every patch, and the number of the neighbors we use to train the canonical directions. Generally speaking, the correlation between pairs of infrared and visible patches of a smaller size is weaker, so the inference is less reasonable. While the larger size, makes the correlation stronger, but more canonical variates are needed to represent the patch, which makes training samples much sparser in the feature space. The size of images of our database is 110×86, and we choose the patch size of 9×9 with 3-px overlapping. Since the projections (features) onto the former pairs of direction have stronger correlations, choosing fewer features makes the inference more robust, while choosing more features gives a more accurate representation of the original patch. Similarly, when we choose a larger number of neighbors, K, there are more samples, which makes the algorithm more robust but time-consuming. We choose 2~8 features and 30~100 neighbors for LLR, and we have the slightly different results. We have compared our methods with other existing algorithms such as LLE and the holistic method. The results in Fig.2 show that our method is capable to preserve the global facial structure and to capture some facial detailed features such as wrinkles, mustache, and the boundary of nose. Our algorithm is also robust to facial expressions,

(a)

(b)

Fig. 3. (a) The comparison of reconstruction results using Eq(6) and Eq(7). The first row is the ground truth; the second row is the face reconstructed using Eq(7), and the third row using Eq(6). 5 canonical variates taken from the ground truth are used for each patch. It is clear that the reconstructions of Eq(7) contain more information than those of Eq(6). (b) The face image conversion results with different expressions of the same subject. The first column is the input infrared image; the second column is our conversion result; the third column is the reconstruction result using the canonical variates extracted from the ground truth; the last column is the ground truth.

Converting Thermal Infrared Face Images into Normal Gray-Level Images

731

as shown in Fig. 3 (b). The prediction methods proposed in Section 4.1 and 4.2 give slightly different results as shown in Fig. 2(b) (c). Although our method is effective, there is still difference between our results and the ground truth. There should be two key points to account for it. First, the correspondence between the visible and the infrared images is a many-to-many mapping and infrared images contain less information than visible images. Second, our method tries to obtain the optimal result only in the statistical sense.

6 Conclusion and Future Work In this paper we have developed an algorithm to render the visible facial images from thermal infrared images using canonical variates. Given an input thermal infrared image, we partition it into small patches, and for every patch we extract the CCA features. Then the features of the corresponding visible patch can be inferred by LLR or by sophisticated LLE, and the visible patch can be reconstructed by LLE using Eq(7) according to the inferred features. We use CCA to extract features, which makes the correlation in the feature space are much stronger than that in the patch space. And using LLE to reconstruct the original patch from inferred features recovers some information lost in the infrared patch and in the feature-extraction process. The experiments show that our algorithm is effective. Thought it cannot recover visible images the same as the ground truth because of less information of infrared images, but it does preserve some features of the ground truth such as the expression. The future work includes: (1) applying the method in infrared face recognition to improve the recognition rate for it recovers some information lost in infrared images; (2) making the methods more robust to ill-registered images.

Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments, which have contributed to a vast improvement of the paper. This work is supported by research funds of NSFC No.60572043 and the NKBRPC No.2004CB318005.

References 1. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys 35, 399–459 (2003) 2. Kong, S.G., Heo, J., Abidi, B.R., Paik, J., Abidi, M.A.: Recent Advances in Visual and Infrared face recognition—A Review. Computer Vision and Image Understanding 97, 103–135 (2005) 3. Bebis, G., Gyaourova, A., Singh, S., Pavlidis, I.: Face Recognition by Fusing Thermal Infrared and Visible Imagery. Image and Vision Computing 24, 727–742 (2006)

732

M. Dou et al.

4. Heo, J., Kong, S.G., Abidi, B.R., Abidi, M.A.: Fusion of Visual and Thermal Signatures of with Eyeglass Removal for Robust Face Recognition. In: Proc. of CVPRW2004, vol. 8, pp. 122–127 (2004) 5. Baker, S., Kanade, T.: Limits on Super-Resolution and How to Break Them. IEEE Trans. On Pattern Analysis and Machine Intelligence 24, 1167–1183 (2002) 6. Chang, H., Yeung, D.Y., Xiong, Y.: Super-Resolution Through Neighbor Embedding. In: Proc. of CVPR2004, vol. 1, pp. 275–282 (2004) 7. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. International Journal of Computer Vision 40, 25–47 (2000) 8. Cleveland, W.S., Devlin, S.J.: Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association 83(403), 596–610 (1988) 9. Shakhnarovich, G., Viola, P., Darrell, T.: Fast Pose Estimation with Parameter-Sensitive Hashing. In: Proc. of ICCV2003, vol. 2, pp. 750–757 (2003) 10. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 11. Melzer, T., Reiter, M., Bischof, H.: Appearance Models Based on Kernel Canonical Correlation Analysis. Pattern Recognition 36, 1961–1971 (2003) 12. Reiter, M., Donner, R., Langs, G., Bischof, H.: 3D and Infrared Face Reconstruction from RGB data using Canonical Correlation Analysis. In: Proc. of ICPR2006, vol. 1, pp. 425–428 (2006) 13. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence 19, 711–720 (1997) 14. Socolinsky, D.A., Selinger, A.: A Comparative Analysis of Face Recognition Performance with Visible and Thermal Infrared Imagery. In: Proc. of ICPR2002, vol. 4, pp. 217–222 (2002) 15. Weenink, D.: Canonical Correlation Analysis. In: IFA Proceedings, vol. 25, pp. 81–99 (2003)

Recognition of Digital Images of the Human Face at Ultra Low Resolution Via Illumination Spaces Jen-Mei Chang1 , Michael Kirby1 , Holger Kley1 , Chris Peterson1, Bruce Draper2 , and J. Ross Beveridge2 Department of Mathematics, Colorado State University, Fort Collins, CO 80523-1874 U.S.A. {chang,kirby,kley,peterson}@math.colostate.edu Department of Computer Science, Colorado State University, Fort Collins, CO 80523-1873 U.S.A. {draper,ross}@math.colostate.edu 1

2

Abstract. Recent work has established that digital images of a human face, collected under various illumination conditions, contain discriminatory information that can be used in classiﬁcation. In this paper we demonstrate that suﬃcient discriminatory information persists at ultralow resolution to enable a computer to recognize speciﬁc human faces in settings beyond human capabilities. For instance, we utilized the Haar wavelet to modify a collection of images to emulate pictures from a 25pixel camera. From these modiﬁed images, a low-resolution illumination space was constructed for each individual in the CMU-PIE database. Each illumination space was then interpreted as a point on a Grassmann manifold. Classiﬁcation that exploited the geometry on this manifold yielded error-free classiﬁcation rates for this data set. This suggests the general utility of a low-resolution illumination camera for set-based image recognition problems.

1

Introduction

The face recognition problem has attracted substantial interest in recent years. As an academic discipline, face recognition has progressed by generating large galleries of images collected with various experimental protocols and by assessing the eﬃcacy of new algorithms in this context. A number of proposed face recognition algorithms have been shown to be eﬀective under controlled conditions. However, in the ﬁeld, where data acquisition is essentially uncontrolled,

This study was partially supported by the National Science Foundation under award DMS-0434351 and the DOD-USAF-Oﬃce of Scientiﬁc Research under contract FA9550-04-1-0094. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation or the DOD-USAF-Oﬃce of Scientiﬁc Research.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 733–743, 2007. c Springer-Verlag Berlin Heidelberg 2007

734

J.-M. Chang et al.

the performance of these algorithms typically degrades. In particular, variations in the illumination of subjects can signiﬁcantly reduce the accuracy of even the best face recognition algorithms. A traditional approach in the face recognition literature has been to normalize illumination variations out of the problem using techniques such as nonlinear histogram equalization or image quotient methods [1]. While such approaches do indeed improve recognition, as demonstrated on the FERET, Yale and PIE databases, they do not exploit the fact that the response of a given subject to variation in illumination is idiosyncratic [2] and hence can be used for discrimination. This work builds on the observation that in the vector space generated by all possible digital images collected by a digital camera at a ﬁxed resolution, the images of a ﬁxed, Lambertian object under varying illuminations lie in a convex cone [3] which is well approximated by a relatively low dimensional linear subspace [4,5]. In our framework, we associate to a set of images of an individual their linear span, which is in turn represented, or encoded, by a point on a Grassmann manifold. This approach appears to be useful for the general problem of comparing sets of images [6]. In the context of face recognition our objective is to compare a set of images associated with subject 1 to a set of images associated with subject 2 or to a diﬀerent set of images of subject 1. The comparison of any two sets of images is accomplished by constructing a subspace in the linear span of each that optimizes the ability to discriminate between the sets. As described in [2], a sequence of nested subspaces may be constructed for this purpose using principal vectors computed to reveal the geometric relationship between the linear spans of each subject. This approach provides an immediate pseudo-metric for set-to-set comparison of images. In an application to the images in the CMU-PIE Database [7] and Yale Face Database B [4], we have previously established that the data are Grassmann separable [2], i.e., when distances are computed between sets of images using their encoding as points on an appropriately determined Grassmann manifold the subjects are all correctly identiﬁed. The CMU-PIE database consists of images of 67 individuals. While the Grassmann separability of a database of this size is a signiﬁcant, positive result, it is important to understand the general robustness of this approach. For example, the application of the methodology to a larger data set is of critical interest. In the absence of such data, however, we propose to explore a related question: as we reduce the eﬀective resolution of the images of the 67 individuals which make up the CMU-PIE database, does Grassmann separability persist? The use of multiresolution analysis to artiﬁcially reduce resolution introduces another form of nested approximation into the problem that is distinct from that described above. We observe that facial imagery at ultra low resolutions is typically not recognizable or classiﬁable by human operators. Thus, if Grassmann separability persists at ultra low resolution, we can envision large private databases of facial imagery, stored at a resolution that is suﬃciently low to prevent recognition by a human operator yet suﬃciently high to enable machine recognition and classiﬁcation via the Grassmann methods described in Section 2.

Recognition of Digital Images of the Human Face

735

Accordingly, the purpose of this paper is to explore the idiosyncratic nature of digital images of a face under variable illumination conditions at extremely low resolutions. In Section 2 we discuss the notion of classiﬁcation on a Grassmann variety and a natural pseudo-metric that arises in the context of Schubert varieties. In Section 3 we extend these ideas to the context of a sequence of nested subspaces generated by a multiresolution analysis. Results of this approach applied to the CMU-PIE database are presented in Section 4. We contrast our approach with other methods in Section 5 and discuss future research directions in Section 6.

2

Classiﬁcation on Grassmannians

The general approach to the pattern classiﬁcation problem is to compare labeled instances of data to new, unlabeled exemplars. Implementation in practice depends on the nature of the data and the method by which features are extracted from the data and used to create a representation optimized for classiﬁcation. We consider the case that an observation of a pattern produces a set of digital images at some resolution. This consideration is a practical one, since the accuracy of a recognition scheme that uses a single input image is signiﬁcantly reduced when images are subject to variations, such as occlusion and illumination [8]. Now, the linear span of the images is a vector subspace of the space of all possible images at the given resolution, and thus, corresponds to a point on a Grassmann manifold. More precisely, let k (generally independent) images of a given subject be grouped together to form a data matrix X with each image stored as a column of X. If the column space of X, R(X), has rank k and if n denotes the image resolution, then R(X) is a k-dimensional vector subspace of Rn , which is a point on the Grassmann manifold G(k, n). See Fig. 1 for a graphical illustration of this correspondence. Speciﬁcally, the real Grassmannian (Grassmann manifold), G(k, n), parameterizes k-dimensional vector subspaces of the n-dimensional vector space Rn . Naturally, this parameter space is suitable for subspace-based algorithms. For example, the Grassmann manifold is used in [9] when searching for a geometrically invariant subspace of a matrix under full rank updates. An optimization over the Grassmann manifold is proposed in [10] to solve a general object recognition problem. In the case of face recognition, by realizing sets of images as points on the Grassmann manifold, we can exploit the geometries imposed by individual metrics (drawn from a large class of metrics) in computing distances between these sets of images. With respect to the natural structure of a Riemannian manifold that the Grassmannian inherits as a quotient space of the orthogonal group, the geodesic distance between two points A, B ∈ G(k, n) (i.e., two k-dimensional subspaces of Rn ) is given by dk (A, B) = (θ1 , . . . , θk )2 , where θ1 ≤ θ2 ≤ · · · ≤ θk are the principal angles between the subspaces A and B. The principal angles are readily computed using an SVD-based algorithm [11].

736

J.-M. Chang et al.

T

T T

T

Fig. 1. Illustration of the Grassmann method, where each set of images may be viewed as a point on the Grassmann manifold by computing an orthonormal basis associated with the set

Principal angles between subspaces are deﬁned regardless of the dimensions of the subspaces. Thus, inspired by the Riemannian geometry of the Grassmannian, we may deﬁne, for any vector subspaces A, B of Rn , d (A, B) = (θ1 , . . . , θ )2 , for any ≤ min{dim A, dim B}. While d is not, strictly speaking, a metric (for example, if dim A ∩ B ≥ , then d (A, B) = 0), it nevertheless provides an eﬃcient and useful tool for analyzing conﬁgurations in ∪k≥ G(k, n). For points on a ﬁxed Grassmannian, G(k, n), the geometry driving these distance measures is captured by a type of Schubert variety Ω¯ (W ) ⊆ G(k, n). More speciﬁcally, let W be a subspace of Rn , then we deﬁne Ω¯ (W ) = {E ∈ G(k, n) | dim(E ∩ W ) ≥ }. With this notation, d (A, B) simply measures the geodesic distance between A ¯ (B)) = min{dk (A, C)|C ∈ Ω ¯ (B)} (it is worth noting and Ω¯ (B), i.e. d(A, Ω that under this interpretation, d (A, B) = d (B, A)).

3 3.1

Resolution Reduction Multiresolution Analysis and the Nested Grassmannians

Multiresolution analysis (MRA) works by projecting data in a space V onto a sequence of nested subspaces · · · ⊂ Vj+1 ⊂ Vj ⊂ Vj−1 ⊂ · · · ⊂ V0 = V. The subspaces Vj represent the data at decreasing resolutions and are called scaling subspaces or approximation subspaces. The orthogonal complements Wj to

Recognition of Digital Images of the Human Face

737

Vj in Vj−1 are the wavelet subspaces and encapsulate the error of approximation at each level of decreased resolution. For each j, we have an isomorphism ∼

φj : Vj−1 − → Vj ⊕ Wj . Let π j : Vj ⊕ Wj → Vj denote projection onto the ﬁrst factor and let ψ j = π j ◦ φj (thus ψ j : Vj−1 → Vj ). This single level of subspace decomposition is represented by the commutative diagram in Fig. 2(a). Let G(k, V ) denote the Grassmannian of k-dimensional subspaces of a vector space V . Suppose that V, V are vector spaces, and that f : V → V is a linear map. Let ker(f ) denote the kernel of f , let dim(A) denote the dimension of the vector space A and let G(k, V )◦ = {A ⊂ V | dim(A) = k and A ∩ ker(f ) = 0}. If k + dim ker(f ) ≤ dim V , then G(k, V )◦ is a dense open subset of G(k, V ) and almost all points in G(k, V ) are in G(k, V )◦ . Now if A ∩ ker(f ) = 0, then dim f (A) = dim A, so f induces a map fk◦ : G(k, V )◦ → G(k, V ). Furthermore, if f is surjective, then so is f ◦ . The linear maps of the MRA shown in (a) of Fig. 2 thus induce the maps between Grassmannians shown in (b) of the same ﬁgure. Finally, we observe that if A, B are vector subspaces of V , then dim(A ∩ B) = dim(f (A) ∩ f (B)) if and only if (A + B) ∩ ker(f ) = 0. In particular, when (A + B) ∩ ker(f ) = 0 and ≤ min{dim A, dim B}, then d (A, B) = 0 if and only if d (f (A), f (B)) = 0. From this vantage point, we consider the space spanned by a linearly independent set of k images in their original space on the one hand, and the space spanned in their reduced resolution projections on the other hand, as points on corresponding Grassmann manifolds. Distances between pairs of sets of k linearly independent images or their low-resolution emulations can then be computed using the pseudo-metrics d on these Grassmann manifolds. The preceding observation suggests the possibility that for resolution-reducing projections, spaces which were separable by d remain separable after resolution reduction. Of course, taken to an extreme, this statement can no longer hold true. It is therefore of interest to understand the point at which separability fails. 3.2

Image Resolution Reduction

In a 2-dimensional Discrete Wavelet Transform (DWT), columns and rows of an image I each undergo a 1-dimensional wavelet transform. After a single level of a 2-dimensional DWT on an image I of size m-by-n, one obtains four subn images of dimension m 2 -by- 2 . If we consider each row and column of I as a 1-dimensional signal, then the approximation component of I is obtained by a low-pass ﬁlter on the columns then a low-pass ﬁlter on the rows and sampled

738

J.-M. Chang et al.

Vj ⊕ W j

O

φj

Vj−1

πj

uψ uuu

j

/ Vj u: u u u

(π j )◦ k

G(k, Vj ⊕ Wj )◦

O

(φj )◦ k

o(ψ ooo

j

/ G(k, Vj ) o7 o o o ◦

)k

G(k, Vj−1 )◦

(a)

(b)

Fig. 2. (a) Projection maps between scaling and wavelet subspaces for a single level of wavelet decomposition. (b) Projection maps between nested Grassmannians for a single level of decomposition.

on a dyadic grid. The other 3 sub-images are obtained in a similar fashion and collectively, they are called the detail component of I. The approximation component of an image after a single level of wavelet decomposition with the Haar wavelet is equivalent to averaging the columns, then the rows. See Fig. 3 for an illustration of the sub-images obtained from a single level of Haar wavelet analysis. To use wavelets to compress a signal, we sample the approximation and detail components on a dyadic grid. That is, keeping only one out of two wavelet coeﬃcients at each step of the analysis. The approximation component of the signal, Aj , after j iterations of decomposition and down-sampling, will serve as the same image in level j with resolution 2mj -by- 2nj . In the subsequent discussions, we present results obtained by using the approximation subspaces. However, similar results obtained by using the wavelet subspaces are also observed.

(a) Original

(b) LL

(c) HL

(d) LH

(e) HH

Fig. 3. An illustration of the sub-images from a single level of Haar wavelet analysis on an image in CMU-PIE. From left to right: original image, approximation, horizontal, vertical, and diagonal detail.

4

Results: A 25-Pixel Camera

The experiment presented here follows the protocols set out in [2], where it was established that CMU-PIE is Grassmann separable. This means that using one of the distances d on the Grassmannian, the distance between an estimated illumination space of a subject and another estimated illumination space of the same subject is always less than the distance to an estimated illumination space of

Recognition of Digital Images of the Human Face

739

any diﬀerent subject. In this new experiment we address the question of whether this idiosyncratic nature of the illumination spaces persists at signiﬁcantly reduced resolutions. As described below, we empirically test this hypothesis by calculating distances between pairs of scaling subspaces. The PIE database consists of digital imagery of 67 people under diﬀerent poses, illumination conditions, and expressions. The work presented here concerns only illumination variations, thus only frontal images are used. For each of the 67 subjects in the PIE database, 21 facial images were taken under lighting from distinct point light sources, both with ambient lights on and oﬀ. The results of the experiments performed on the ambient lights oﬀ data is summarized in Fig. 4. The results obtained by running the same experiment on illumination data collected under the presence of ambient lighting were not signiﬁcantly diﬀerent. For each of the 67 subjects, we randomly select two disjoint sets of 10 images to produce two 10-dimensional estimates of the illumination space for the subject. Two estimated spaces for the same subject are called matching subspaces, while estimated subspaces for two distinct subjects are called non-matching subspaces. The process of random selection is repeated 10 times to generate a total of 670 matching subspaces and 44,220 non-matching subspaces. We mathematically reduce the resolution of the images using the Haar wavelet, eﬀectively emulating a camera with a reduced number of pixels at each step. As seen in Fig. 5, variations in illumination appear to be retained at each level of resolution, suggesting that the idiosyncratic nature of the illumination subspaces might be preserved. At the ﬁfth level of the MRA the data corresponds to that which would have been captured by a camera with 5 × 5 pixels. We observe that at this resolution the human eye can no longer match an image with its subject. The separability of CMU-PIE at ultra low resolution is veriﬁed by comparing the distances between the matching to the non-matching subspaces as points on a Grassmann manifold. When the largest distance between any two matching subspaces is less than the smallest distance between any two non-matching subspaces, the data is called Grassmann separable. This phenomenon can be observed in Fig. 4. The three lines of the box in the box whisker plot shown in Fig. 4 represent the lowest quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to show the extent of the rest of the data and outliers are data with values beyond the ends of the whiskers. Using d1 , i.e., a distance based on only one principal angle, we observe a significant separation gap between the largest and smallest distance of the matching and non-matching subspaces throughout all levels of MRA. Speciﬁcally, the separation gap between matching and non-matching subspaces is approximately 16◦ , 18◦ , 17◦ , 14◦ , 8◦ , and 0.17◦ when subspaces are realized as points in G(10, 22080), G(10, 5520), G(10, 1400), G(10, 360), G(10, 90), and G(10, 25), respectively. Note that the non-decreasing trend of the separation gap is due to the random selection of the illumination subspaces.

J.-M. Chang et al. LL1

LL2

50

LL4

LL5

120

120

120

100

100

100

100

80

80

80

80

60

60

degrees

60

100 degrees

80

degrees

degrees

100

LL3

120

degrees

Original 120

degrees

740

60

60

40

40

40

40

40

20

20

20

20

20

0

0

0

0

0

1 2 match nonmatch

1 2 match nonmatch

1 2 match nonmatch

1 2 match nonmatch

1 2 match nonmatch

1 2 match nonmatch

Fig. 4. Box whisker plot of the minimal principal angles of the matching and nonmatching subspaces. Left to right: original (resolution 160×138), level 1 Haar wavelet approximation (80×69), level 2 (40×35), level 3 (20×18), level 4 (10×9), level 5 (5×5). Perfect separation of the matching and non-matching subspaces is observed throughout all levels of MRA.

As expected, the separation gap given by the minimal principal angle between the matching and non-matching subspaces decreases as we reduce resolution, but never to the level where points on the Grassmann manifold are misclassiﬁed. In other words, individuals can be recognized at ultra-low resolutions provided they are represented by multiple image sets taken under a variety of illumination conditions. It is curious to see if similar outcomes can be observed when using unstructured projections, e.g., random projections, to embed subject illumination subspaces into spaces of signiﬁcantly reduced dimensions. To test this, we repeated

(a)

(b)

Fig. 5. Top to bottom: 4 distinct illumination images of subjects 04006 (a) and 04007 (b) in CMU-PIE; level 1 to level 5 approximation obtained from applying 2D discrete Haar wavelet transform to the top row

Recognition of Digital Images of the Human Face

741

the experiments described above in this new setting. Subject illumination subspaces in their original level of resolution were projected onto low dimensional spaces via randomly determined linear transformations. Error statistics were collected by repeating the experiment 100 times. Perfect separation between matching and non-matching subspaces occurred when subject illumination subspaces were projected onto random 35-dimensional subspaces. This validates the use of digital images at ultra low resolution and emphasizes the importance of illumination variations in the problem of face recognition. Furthermore, while unstructured projections perform surprisingly well in the retention of idiosyncratic information, structured projections that exploit similarities of neighboring pixels allow perfect recognition results at even lower resolutions. We remark that the idiosyncratic nature of the illumination subspaces can be found not only in the scaling subspaces, but also in the wavelet subspaces. Indeed, we observed perfect separation using the minimal principal angle in almost all scales of the wavelet subspaces.

5

Related Work

A variety of studies consider the roles of data resolution and face recognition, including [12,13,14,15,16]. A common feature of these studies is the practice of using single to single image comparison in the recognition stage (with the exception of [16]). Among the techniques used to train the algorithms are PCA, LDA, ICA, Neural Network, and Radial Basis Functions. Some of the classiﬁers used are correlation, similarity score, nearest neighbor, neural network, tangent distance, and multiresolution tangent distance. If variation in illumination is present in the data set, it is removed by either histogram equalization [17] or morphological nonlinear ﬁltering [18]. Except in [16], the variation of illumination was treated as noise and eliminated in the preprocessing stage before the classiﬁcation takes place. In a more related study, Vasconcelos and Lippman proposed the use of transformation invariant tangent distance embedded in the multiresolution framework [16]. Their method, based on the (2-sided) tangent distance between manifolds, is referred to as the multiresolution tangent distance (MRTD) and is similar to our approach in that it requires a set-to-set image comparison. It is also postulated that the use of a multiresolution framework preserves the global minima that are needed in the minimization problems associated with computing tangent distances. The results of [16], however, are that when the only variation in the data is illumination, the performance of MRTD is inferior to that of the normal tangent distance and Euclidean distance. Hence, it appears that the framework of [16] does not suﬃciently detect the idiosyncratic nature of illumination at low resolutions. In summary, we have presented an algorithm for classiﬁcation of image sets that requires no training and retains its high performance rates even at extremely low resolution. To our knowledge, no other algorithm has claimed to have achieved perfect separability of the CMU-PIE database at ultra low resolution.

742

6

J.-M. Chang et al.

Discussion

We have shown that a mathematically emulated ultra low-resolution illumination space is suﬃcient to classify the CMU-PIE database when a data point is a set of images under varying illuminations, represented by a point on a Grassmann manifold. We assert that this is only possible because the idiosyncratic nature of the response of a face to varying illumination, as captured in digital images, persists at ultra low resolutions. This is perhaps not so surprising given that the conﬁguration space of a 25-pixel camera consists of 25625 diﬀerent images and we are comparing only 67 subjects using some 20 total instances of illumination. The representation space is very large compared to the amount of data being stored. Furthermore, the reduction of resolution that was utilized takes advantage of similarities of neighboring pixels. The algorithm introduced here is computationally fast and can be implemented eﬃciently. In fact, on a 2.8GHz AMD Opteron processor, it takes approximately 0.000218 seconds to compute the distance between a pair of 25-pixel 10-dimensional illumination subspaces. The work presented here provides a blueprint for a low-resolution illumination camera to capture images and a framework in which to match them with lowresolution sets in a database. Future work will focus on evaluating this approach on a much larger data set that contains more subjects and more variations. The Grassmann method has shown promising results in a variety of face recognition problems [6,19,2], we intend to examine the eﬀect of resolution reduction on the accuracy of the algorithm with a range of variations, such as viewpoint and expressions.

References 1. Riklin-Raviv, T., Shashua, A.: The quotient image: Class based re-rendering and recognition with varying illuminations. PAMI 23(2), 129–139 (2001) 2. Chang, J.M., Beveridge, J., Draper, B., Kirby, M., Kley, H., Peterson, C.: Illumination face spaces are idiosyncratic. In: International Conference on Image Procesing & Computer Vision, vol. 2, pp. 390–396 (June 2006) 3. Belhumeur, P., Kriegman, D.: What is the set of images of an object under all possible illumination conditions. IJCV 28(3), 245–260 (1998) 4. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models for face recognition under variable lighting and pose. PAMI 23(6), 643–660 (2001) 5. Basri, R., Jacobs, D.: Lambertian reﬂectance and linear subspaces. PAMI 25(2), 218–233 (2003) 6. Chang, J.M., Kirby, M., Kley, H., Beveridge, J., Peterson, C., Draper, B.: Examples of set-to-set image classiﬁcation. In: Seventh International Conference on Mathematics in Signal Processing Conference Digest, The Royal Agricultural College, Cirencester, Institute for Mathematics and its Applications, pp. 102–105 (December 2006) 7. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. PAMI 25(12), 1615–1618 (2003)

Recognition of Digital Images of the Human Face

743

8. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image sequence. In: AFGR, pp. 318–323 (1998) 9. Smith, S.: Subspace tracking with full rank updates. In: The 31st Asilomar Conference on Sinals, Systems & Computers, vol. 1, pp. 793–797 (November 1997) 10. Lui, X., Srivastava, A., Gallivan, K.: Optimal linear representations of images for object recognition. PAMI 26, 662–666 (2004) 11. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 12. Kouzani, A.Z., He, F., Sammut, K.: Wavelet packet face representation and recognition. In: IEEE Int’l Conf. on Systems, Man and Cybernetics, Orlando, vol. 2, pp. 1614–1619. IEEE Computer Society Press, Los Alamitos (1997) 13. Feng, G.C., Yuen, P.C., Dai, D.Q.: Human face recognition using PCA on wavelet subband. SPIE J. Electronic Imaging 9(2), 226–233 (2000) 14. Nastar, C., Moghaddam, B., Pentland, A.: Flexible images: Matching and recognition using learned deformations. Computer Vision and Image Understanding 65(2), 179–191 (1997) 15. Nastar, C.: The image shape spectrum for image retrieval. Technical Report RR3206, INRIA (1997) 16. Vasconcelos, N., Lippman, A.: A multiresolution manifold distance for invariant image similarity. IEEE Trans. Multimedia 7(1), 127–142 (2005) 17. Ekenel, H.K., Sankur, B.: Multiresolution face recognition. Image Vision Computing 23(5), 469–477 (2005) 18. Foltyniewicz, R.: Automatic face recognition via wavelets and mathematical morphology. In: Proc. of the 13th Int’l Conf. on Pattern Recognition, vol. 2, pp. 13–17 (1996) 19. Chang, J.M., Kirby, M., Peterson, C.: Set-to-set face recognition under variations in pose and illumination. In: 2007 Biometrics Symposium at the Biometric Consortium Conference, Baltimore, MD, U.S.A. (September 2007)

Crystal Vision-Applications of Point Groups in Computer Vision Reiner Lenz Department of Science and Technology, Link¨ oping University SE-60174 Norrk¨ oping, Sweden [email protected]

Abstract. Methods from the representation theory of ﬁnite groups are used to construct eﬃcient processing methods for the special geometries related to the ﬁnite subgroups of the rotation group. We motivate the use of these subgroups in computer vision, summarize the necessary facts from the representation theory and develop the basics of Fourier theory for these geometries. We illustrate its usage for data compression in applications where the processes are (on average) symmetrical with respect to these groups. We use the icosahedral group as an example since it is the largest ﬁnite subgroup of the 3D rotation group. Other subgroups with fewer group elements can be studied in exactly the same way.

1

Introduction

Measuring properties related to the 3D geometry of objects is a fundamental problem in many image processing applications. Some very diﬀerent examples are: Light stages, omnidirectional cameras and measurement of scattering properties. In a light stage (see [1] for an early description) an object is illuminated by a series of light sources and simultaneously images of the object are taken with one or several cameras. These images are then used to estimate the optical properties of the object and this information is then in turn used by computer graphics systems to insert the object into computer generated,or real world, environments. A typical application is the generation of special eﬀects in the movie industry. An omnidirectional camera captures images of a scene from diﬀerent directions simultaneously. Typical arrangements to obtain these images are combinations of a camera and a mirror ball or systems consisting of a number of cameras. The third area where similar techniques are used is the investigation of the optical properties of materials like skin or paper [2, 3]. These materials are characterized by complicated interactions between the light and the material due to sub-surface scattering and closed form descriptions are not available. Applications range from wound monitoring over cosmetics to the paper manufacturing and graphic arts industry. All of these problems have two common characteristics: Their main properties are deﬁned in terms of directions (the directions of the incoming and reﬂected light) and the space of direction vectors is represented by a few representative Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 744–753, 2007. c Springer-Verlag Berlin Heidelberg 2007

Crystal Vision-Applications of Point Groups in Computer Vision

745

samples (for example the directions where the sensors or light sources are located). If we describe directions with vectors on the unit sphere then we see that the basic component of these models is a ﬁnite set of vectors on the unit sphere. Similar models are used in physics to investigate the properties of crystals whose atoms form similar geometric conﬁgurations. A standard tool used in these investigations is the theory of point groups which are ﬁnite subgroups of the group SO(3) of three-dimensional rotations. In this paper we will use methods from the representation theory of these point groups to construct eﬃcient processing methods for computer vision problems involving quantized direction spaces. We will describe the main idea, summarize the necessary facts from the representation theory and illustrate it by examples such as light stage processing and modeling of the optical properties of materials. The group we use in this paper is the icosahedral group. We select it because it is the largest ﬁnite subgroup of the 3D rotation group. Other subgroups with fewer group elements (and thus a coarser quantization of the direction space) can be used in exactly the same way.

2

Geometry

Consider the problem of constructing a device to be used to measure the optical properties of materials or objects. The ﬁrst decision in the construction of such a device concerns the placements of the light sources and cameras. Since we want to design a general instrument we will use the following scanning mechanism: Start with one light source in a ﬁxed position in space and then move it to other positions with the help of a sequence of 3D rotations. From a mathematical point of view it is natural to require that the rotations used form a group: applying two given rotations in a sequence moves the light source to another possible position and all movements can be reversed. Since we want to have a physically realizable system we also require that only a ﬁnite number of positions are possible to visit. We therefore conclude that the positions of the light sources are characterized by a ﬁnite subgroup of the group SO(3) of 3D rotations. If the rotations are not all located in a plane then it is known that there are only a ﬁnite number of ﬁnite subgroups of the rotation group. The largest of these subgroups is the icosahedral group I and we will in the following only consider this group since it provides the densest sampling of the unit sphere constructed in this way. The other groups (related to cubic and tetrahedral sampling schemes) can be treated in a similar way. Here we only use the most important facts from the theory of these groups and the interested reader is referred to the many books in mathematics, physics and chemistry ( [4, 5, 6, 7]) for detailed descriptions. We will now collect the most important properties of I. Among the vast amount of knowledge about these groups we select those facts that are (1) relevant for the application we have in mind and (2) those that can be used in software systems like Maple and Matlab to do the necessary calculations. The group IG consists of 60 elements and these elements can be characterized by three elements Rk , k = 1, 2, 3, the generators of the group I. All group

746

R. Lenz

elements can be obtained by a concatenations of these three rotations and all R ∈ I have the form R = Rν(1) Rν(2) . . . Rν(K) where Rν(k) is one of the three generators Rk , k = 1, 2, 3 and Rν(k) Rν(k+1) is the concatenation of two elements. The generators satisfy the following equations: R22 = E; R32 = E R13 = E; 3 (R2 R3 ) = E; (R1 R3 )2 = E (R1 R2 ) = E; 3

(1)

and these equations specify the group I. These deﬁning relations can be used in symbolic programs to generate all the elements of the group. The icosahedral group I maps the icosahedron into itself and if we cut oﬀ the vertices of the icosahedron we get the truncated icosahedron, also known as the buckyball or a fullerene. The buckyball has 60 vertices and its faces are pentagons and hexagons (see Figure 3(A)). Starting from one vertex and applying all the rotations in I will visit all the vertices of the buckyball (more information on the buckyball can be found in [6]). Now assume that at every vertex of the buckyball you have a controllable light source. We have sixty vertices and so we can describe the light distribution generated by these sources by enumerating them as Lk , k = 1, . . . 60. We can also describe them as functions of their positions using the unit vectors Uk , k = 1, . . . 60 : L(Uk ). The interpretation we will use in the following uses the rotation Rk needed to reach the k-th position from an arbitrary but ﬁxed starting point. We have L(Rk ), k = 1, . . . 60 and we can think of L as a function deﬁned on the group I. This space of all functions on I will be denoted by L2 (I). This space is a 60-dimensional vector space and in the following we will describe how to partition it into subspaces with simple transformation properties.

3

Representation Theory

The following construction is closely related to Fourier analysis, where functions on a circle are described as superpositions of complex exponentials. For a ﬁxed value of n the complex exponential is characterized by the transformation property ein(x+Δ) = einΔ einx . The one-dimensional space spanned by all functions of the form ceinx , c ∈ C is thus invariant under the shift operation x → x + Δ of the unit circle. We will describe similar systems in the following. We construct a 60D space by assigning the k-th basis vector in this space to group element Rk ∈ I. Next select a ﬁxed R ∈ I and form all products RRk , k = 1, . . . 60. The mapping R : Rk → Rl = RRk deﬁnes the linear mapping that moves the k-th basis vector to the l-th basis vector. Doing this for all elements Rk we see that R deﬁnes a 60D permutation matrix Dr (R). The map R → Dr (R) has the property that Dr (RQ) = Dr (R)Dr (Q) for all R, Q ∈ I. A mapping with this transformation property is called a representation and the special representation Dr is known as the regular representation. All elements in I are concatenations of the three elements Rk , k = 1, 2, 3 and every representation D is therefore completely characterized by the three matrices D(Rk ), k = 1, 2, 3.

Crystal Vision-Applications of Point Groups in Computer Vision

747

A given representation describes linear transformations D(R) in the 60D space. Changing the basis in this space with the help of a non-singular matrix T describes the same transformation in the new coordinate system by the matrix T D(R)T −1. Also T D(R)T −1T D(Q)T −1 = T D(RQ)T −1 and we see that this gives a new representation T D(R)T −1 which is equivalent to the original D. Assume that for the representation D there is a matrix T such that for all R ∈ I we have: (1) D (R) 0 −1 DT (R) = T D(R)T = (2) 0 D(2) (R) Here D(l) (R) are square-matrices of size nl and 0 are sub-matrices with all-zero entries. We have n1 + n2 = 60 and T D(R1 R2 )T −1 = T D(R1 )T −1 T D(R2 )T −1 (1) (1) D (R1 ) 0 0 D (R2 ) = 0 D(2) (R1 ) 0 D(2) (R2 ) (1) 0 D (R1 )D(1) (R2 ) = 0 D(2) (R1 )D(2) (R2 ) This shows that D(1) , D(2) are representations of lower dimensions n1 , n2 < 60. Each of the two new representations describes a subspace of the original space that is closed under all group operations. If we can split a representation D into two lower-dimensional representations we say that D is reducible, otherwise it is irreducible. Continuing splitting reducible representations ﬁnally leads to a decomposition of the original 60D space into smallest, irreducible components. One of the main results from the representation theory of ﬁnite groups is that the irreducible representations of a group are unique (up to equivalence). For the group I we denote the irreducible representations by M (1) , M (2) , M (3) , M (4) , M (5) . Their dimensions are 1, 3, 3, 4, 5. For the group I we ﬁnd 1 + 32 + 32 + 42 + 52 = 60

(3)

which is an example of the general formula n21 + ... + n2K = n where nk is the dimension of the k-th irreducible representation and n is the number of elements in the group. Eq. 3 shows that the 60D space L2 (I) can be subdivided into subspaces of dimensions 1, 9, 9, 16 and 25. These subspaces consist of nk copies of the nk -dimensional space deﬁned by the k-th irreducible representation of I. Next we deﬁne the character to describe how to compute this subdivision: assume that D is a representation given by the matrices D(R) for the group elements R. Its trace deﬁnes the character χD : χD : I → C; R → χD (R) = tr (D(R))

(4)

For the representations M (l) we denote their characters by χl = χM (l) . From the properties of the trace follows χl (R) = χl (QRQ−1 ) for all R, Q ∈ I. If we

748

R. Lenz

deﬁne R1 , R2 ∈ I as equivalent if there is a Q ∈ I such that R2 = QR1 Q−1 then we see that this deﬁnes an equivalence relation and that the characters are constant on equivalence classes. Characters are thus deﬁned by their values on equivalence classes. We now deﬁne the matrix Pl as: Pl =

60

χl (Rk )Dr (Rk )

(5)

k=1

where Dr is the representation consisting of the permutation matrices deﬁned at the beginning of this section. It can be shown that the matrix Pl deﬁnes a projection from the 60D space L2 (I) into the n2l dimensional space given by the nl copies of the nl -dimensional irreducible representation M (l) . The matrix Pl deﬁnes a projection matrix and if we compute its SingularValue-Decomposition (SVD) Pl = Ul Dl Vl then we see that the n2l columns of Vl span the range of Pl . We summarize the computations as follows: – For the generators Rk , k = 1, 2, 3 of I construct the permutation matrices Dr (Rk ) – Apply the group operations to generate all matrices Dr (R), R ∈ I – For the generators Rk , k = 1, 2, 3 of I construct the matrices M (l) (Rk ) of the l-th irreducible representation – Apply the group operations to generate all matrices M (l) (R), R ∈ I – Compute the values of the characters of the irreducible representations M (l) on the equivalence classes – Extend them to all elements R ∈ I to obtain the characters χl – Use Eq. 5 to construct the projection matrices Pl – Use the SVD of the matrices Pl to construct the new basis The lengthy theoretical derivation thus results in a very simple method to decompose functions on the buckyball. The new basis can be constructed automatically and from its construction it can also be seen that the elements of the projection matrices Pl are given by the values of the characters on the equivalence classes. Here we only concentrate on these algorithmically derived bases and using the SVD to construct the basis in the subspaces is only one option. Others, more optimized for special applications, can also be used.

4

Description and Compression of Scatter Measurements

Radiation transfer models provide a standard toolbox to describe the interaction between light and material and they are therefore useful in such diﬀerent applications as remote sensing and subsurface scattering models of materials like skin and paper (see [2] for an example). The key component of the theory is the function that describes how incoming radiation is mapped to outgoing radiation. In the following we let U, V denote two unit vectors that describe

Crystal Vision-Applications of Point Groups in Computer Vision

749

directions in 3D space. We denote by p (U, V ) the probability that an incoming photon from direction U interacts with the material and is scattered into direction V . Now divide the unit sphere into 60 sections, each described by a vertex of the buckyball. The argument above shows that in that case we can write it as a function p(Ri , Rj ) with Ri , Rj ∈ I. Assume further that f : I → IR describes the incoming light distribution from the directions given by the elements in I. The expectation of the outgoing radiation g(Rj ) in direction Rj is then given by p(Ri , Rj )f (Ri ) (6) g(Rj ) = I

and we write this in operator notation as g = Sp f where Sp is an operator deﬁned by the kernel p. A common assumption in applications of radiation transfer is that the function p (U, V ) only depends on the angle between the vectors U, V . We therefore consider especially probability functions that are invariant under elements of I, ie. we assume that: (7) p(RRi , RRj ) = p(Ri , Rj ) for all elements R, Ri , Rj ∈ I. We ﬁnd that the operator commutes with the operator TQ describing the application of an arbitrary but ﬁxed rotation Q : TQ f (R) = f (Q−1 R): Sp (f (QRi ))(Rj ) = p(Ri , Rj )f (QRi ) = I

I

p(Q

−1

Ri , Rj )f (Ri ) =

p(Ri , QRj )f (Ri ) = (Sp f )(QRj )

I

This shows that Sp TQ = TQ Sp for all Q ∈ I. We now use the new coordinate system in L2 (I) constructed above. The operator maps then the invariant subspaces deﬁned by the irreducible representations onto itself. Schur’s Lemma [4] states that on these spaces SR is either the zero operator or a multiple of the identity. In other words: on such a subspace there is a constant λ such that: SR f = λk f and we ﬁnd that the elements in this subspace are eigenvectors of the operator. We illustrate this by an example from radiation transfer theory describing the reﬂection of light on materials. We consider illumination distributions measured on the 60 vertices of the buckyball and described by 60D vectors f . We measure the reﬂected light at the same 60 positions resulting in a new set of 60D vectors g. In what follows we will not consider single distributions f, g but we will consider stochastic processes generating a number of light distributions fω where ω is the stochastic variable. This scenario is typical for a number of diﬀerent applications like in the following examples: – The operator can describe the properties of a mirror ball, the incoming vectors f the light ﬂow in the environment and g the corresponding measurement vector. This is of interest in computer graphics

750

R. Lenz

– The operator represents the optical properties of a material like paper or skin. The illumination/measurement conﬁguration can be used for estimation of the reﬂectance properties of the material – The model describes a large number of independent interactions between the light ﬂow and particles. A typical example is the propagation of light through the atmosphere In the following simulations we generated 500 vectors with uniformly distributed random numbers representing 500 diﬀerent incoming illumination distributions from the 60 directions of the buckyball. The scattering properties of the material are characterized by the Henyey-Greenstein function [8] deﬁned as p(cos θ) =

1 − ξ2 (1 + ξ 2 − 2ξ cos(θ))

3/2

(8)

where θ is the angle between the incoming and outgoing direction and ξ is a parameter characterizing the scattering properties. We choose this function simply as an illustration. Here we use it to illustrate how these distributions of the scattered light are described in the basis that was constructed in the previous sections. Note that this coordinate system is only constructed based on the buckyball geometry independent of the scattering properties. We illustrate the results with two examples: ξ = 0.2 (diﬀuse scattering) and ξ = 0.8 (specular reﬂection). We show images of the covariance matrix of the original scatter vectors, the covariance matrix of the scatter vectors in the new coordinate system and the correlation matrix of the scattered vectors in the new coordinate system where we set the matrix element in the upper left pixel (corresponding to the squared magnitude of the ﬁrst coeﬃcient) to zero. For all matrices we plot also the values of the diagonal where most of the contributions are concentrated. In Figure 1 we show the results for ξ = 0.2 and in Figure 2 for ξ = 0.8. In both cases we get similar results and for ξ = 0.8 we therefore omit the correlation matrix. The results show that in the new basis the results are more concentrated. We also see clearly the structure of the diﬀerent invariant representations accounting for the block structure of the subspaces of dimensions: (1,9,9,16,25) and the most important components with numbers (1,2,11,20,36) corresponding to the ﬁrst dimension in these subspaces. We also see that the concentration in these components is more pronounced in the ﬁrst example with the diﬀuse reﬂection than it is for the more specular reﬂection. This is to be expected since in that case the energy of the reﬂected light is more concentrated in narrower regions. The shape of these basis functions is illustrated in Figure 3(B) showing basis vector number 36 which gives (after the constant basis vector number one) the highest absolute contribution in the previous plots. In this ﬁgure we mark the vertices with positive contributions by spheres and the vertices with negative values by tetrahedron.

Crystal Vision-Applications of Point Groups in Computer Vision

751

Asymmetry: 0.2 / Number: 1000 Covariance/Original 1.66

1.64

1.62

1.6

1.58

1.56

1.54

1.52

12

11

20

36

(A) Covariance Original

Asymmetry: 0.2 / Number: 1000 Correlation/Sym−Basis 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

12

11

20

36

(B) Correlation, Symmetry Basis Asymmetry: 0.2 / Number: 1000 Covariance/Sym−Basis 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

12

11

20

36

(C) Covariance Symmetry Basis

Fig. 1. Results for ξ=0.2. (A) Covariance Original (B) Thresholded Correlation Matrix in Symmetry Basis (C) Covariance Matrix in Symmetry Basis. Asymmetry: 0.8 / Number: 1000 Covariance/Original 0.03

0.0295

0.029

0.0285

0.028

0.0275

0.027

0.0265

0.026

12

11

20

36

(A) Covariance Original

Asymmetry: 0.8 / Number: 1000 Covariance/Sym−Basis 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

12

11

20

36

(B) Covariance Symmetry Basis

Fig. 2. Results for ξ = 0.8. (A) Covariance Original (B) Covariance Matrix in Symmetry Basis.

752

R. Lenz

Fig. 3. (A) The Buckyball (B) Basis Vector 36

5

Summary and Discussion

In this paper we used tools from the representation theory of the icosahedral group to construct transforms that are adapted to the transformation properties of the group. We showed how to construct the transform algorithmically from the properties of the group. We showed that under certain conditions these transforms provide approximations to principal component analysis of datasets deﬁned on the vertices of the buckyball. We used a common model to describe the reﬂection properties of materials (the Henyey-Greenstein equation) and illustrated the compression properties of the transform with the help of simulated illumination distributions scattered from the surface of objects. The assumption of perfect symmetry under the icosahedral group on which the results in this paper were derived are seldom fulﬁlled in reality: the invariance property in Eq. 7 is clearly seldom fulﬁlled for real objects. This is also the case in physics where perfect crystals are rather an exception than the rule. In this case we can still use the basis constructed in this paper as a starting point and tune it to the special situation afterwards in a perturbation framework. But even in this simplest form it should be useful in computer vision applications. As a typical example we mention the fact that omnidirectional cameras typically produce large amounts of data and the examples shown above illustrate that the new basis should provide better compression results than the original point-based system. Without going into details we remark also that the new basis has a natural connection to invariants. From the construction we see that the new basis deﬁnes a partition of the original space into 1, 9, 16 and 25-dimensional subspaces that are invariant under the action of the icosahedral group. The projection onto the ﬁrst subspace thus deﬁnes an invariant. The vectors obtained by projections onto the other subspaces are not invariants but their lengths are and

Crystal Vision-Applications of Point Groups in Computer Vision

753

we thus obtain four new invariants. From the construction follows furthermore that the transformation rules of the projected vectors in these subspaces follow the transformation rules of the representations and they can thus be used to obtain information about the underlying transformation causing the given transformation of the projected vectors. The application to the design of illumination patterns is based on the observation that the projection matrices P deﬁned in Eq. (5) only contain integers and two non-integer constants. We can now use this simple structure to construct illuminations patterns by switching on all the light sources located on the vertices with identical values in the projection vector. Such a system should have favorite properties similar to those obtained by the technique described in [9], [10], [11].

References 1. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reﬂectance ﬁeld of a human face. In: Proc. SIGGRAPH 2000, pp. 145–156. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA (2000) 2. Edstr¨ om, P.: A fast and stable solution method for the radiative transfer problem. Siam Review 47(3), 447–468 (2005) 3. Weyrich, T., Matusik, W., Pﬁster, H., Bickel, B., Donner, C., Tu, C., McAndless, J., Lee, J., Ngan, A., Jensen, H.W., Gross, M.: Analysis of human faces using a measurement-based skin reﬂectance model. Acm Transactions On Graphics 25(3), 1013–1024 (2006) 4. Serre, J.P.: Linear representations of ﬁnite groups. Springer, Heidelberg (1977) 5. Stiefel, E., F¨ assler, A.: Gruppentheoretische Methoden und ihre Anwendungen. Teubner, Stuttgart (1979) 6. Sternberg, S.: Group Theory and Physics. First paperback (edn.) Cambridge University Press, Cambridge, England (1995) 7. Kim, S.K.: Group theoretical methods and applications to molecules and crystals. Cambridge University Press, Cambridge (1999) 8. Henyey, L., Greenstein, J.: Diﬀuse radiation in the galaxy. Astrophys. Journal 93, 70–83 (1941) 9. Schechner, Y., Nayar, S., Belhumeur, P.: A theory of multiplexed illumination. In: Proc. Ninth IEEE Int. Conf. on Computer Vision, vol. 2, pp. 808–815. IEEE Computer Society Press, Los Alamitos (2003) 10. Ratner, N., Schechner, Y.Y.: Illumination multiplexing within fundamental limits. In: Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Los Alamitos (2007) 11. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: Multiplexing for optimal lighting. IEEE Trans. Pattern Analysis and Machine Intelligence 29(8), 1339–1354 (2007)

On the Critical Point of Gradient Vector Flow Snake Yuanquan Wang1, Jia Liang2, and Yunde Jia2 1

School of Computer Science, Tianjin University of Technology, Tianjin 300191, PRC School of Computer Science, Beijing Institute of Technology, Beijing 100081, PRC {yqwang,liangjia,jiayunde}@bit.edu.cn

2

Abstract. In this paper, the so-called critical point problem of Gradient vector flow (GVF) snake is studied in two respects: influencing factors and detection of the critical points. One influencing factor that particular attention should be paid to is the iteration number in the diffusion process, too large amount of diffusion would flood the object boundaries while too small amount would preserve excessive noise. Here, the optimal iteration number is chosen by minimizing the correlation between the signal and noise in the filtered vector field. On the other hand, we single out all the critical points by quantizing the GVF vector field. After the critical points are singled out, the initial contour can be located properly to avoid the nuisance arising from critical points. Several experiments are also presented to demonstrate the effectiveness of the proposed strategies. Keywords: snake model, gradient vector flow, critical point, optimal stopping time, image segmentation.

1 Introduction Object shape segmentation and extraction in visual data is an important goal in computer vision. The parametric active contour models [1] and geometric active contour models [2] dominate this field in the latest two decades. From its debut in 1988[1], the parametric active contour models, i.e., snake models become extremely popular in the field of computer vision, which integrate an initial estimate, geometrical properties of the contour, image data and knowledge-based constraints into a single process, and provide a good solution for shape recovery of objects of interest in visual data. Despite the marvelous ability to represent the shapes of objects in visual data, the original algorithm is harassed by several limitations, such as initialization sensitivity, boundary concavities convergence and topology adaptation. These limitations have been extensively studied and many interesting results are presented. Among all the results, a new external force called gradient vector flow (GVF) which was proposed by Xu and Prince [3, 4] outperforms the other gradientbased methods in capture range enlarging and boundary concavities convergence and becomes the focus of many research. Examples include [5-16], among others. It is worthy of noting the graceful works on GVF proposed by Ray et al. They first presented a shape and size constrained active contour with the GVF, but modified by introducing additional boundary conditions of Dirichlet type using initial contour Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 754–763, 2007. © Springer-Verlag Berlin Heidelberg 2007

On the Critical Point of Gradient Vector Flow Snake

755

location to the PDEs, as external force [10]; then, they discussed this new formulation of GVF from another point of view [11]; Later, they presented motion gradient vector flow by integrating object moving direction into the GVF vector [12]; More recently, they utilized the GVF snake characterized by Dirichlet boundary condition to segment the spatiotemporal images [13]. Although the capture range of GVF snake is very high, unfortunately, the initial contour still suffers from some difficulties, that is, the initial contour should contain some points and exclude some other points, otherwise, the final results would be far from expected. We demonstrate this phenomenon in Fig.1. In the top row, there are some particular points, denoted by white cross, in GVF field within the heart ventricle, if the initial contour contains none or only part of these points, the contour would fail; the bottom row illustrates the particular points, denoted by white square, between the rectangle and the circle, if the initial contour contains any of these points, the snake contour would stabilize on the opposite object boundaries. This is the socalled critical point problem in this study and the critical points that should be included within initial contour are referred to as inner critical points and those which should be excluded as outer ones. The existence of inner critical point is first pointed out in [11] and recently, a dynamic system method is employed to detect the inner critical points in [15], later, He et al also utilized the dynamic system method for this purpose [16]. But this approach is computationally expensive and can’t detect the outer ones, in fact, Ford has proposed a more efficient and effective method based on dynamic system under the context of fluid flow [17]. In this work, we investigate this critical point problem in two respects: analysis of the influencing factors and detection of critical points. Understanding the influencing factors is helpful to select the parameters for computing GVF. One particular influencing factor is the iteration number during diffusion process; since too large amount of diffusion would flood the object boundaries, here, an optimal stopping

Fig. 1. Top row: demonstration of the inner critical points. The inner critical points are denoted by white crosses; the white circles are initial contours. Bottom row: demonstration of outer critical points. The black dashed circles are the initial contours, and the black solid curves are the converged results and the outer critical points are denoted by white squares.

756

Y. Wang, J. Liang, and Y. Jia

time, i.e., iteration number, is chosen by minimizing the correlation between the signal and noise in the filtered vector field. By quantizing the GVF vector field, a simple but effective method is presented to single out the critical points and the initial contour can be located around the inner critical points within object; in this way, the GVF snake can conquer this critical point problem. A preliminary version of this work appeared first in [18]. The remainder of this paper is organized as follows: the GVF snake is briefly reviewed in section 2 and section 3 devotes to the influencing factors of the critical points. In section 4, we detail the detection of the critical points and initialization of the GVF snake. Section 5 presents some experimental results and section 6 concludes this paper.

2 Brief Review of the GVF Snake A snake contour is an elastic curve that moves and changes its shape to minimize the following energy

(

E snake = ∫ 12 α c s + β c ss

∈

2

2

)+ E

ext

(c(s ))ds

.

(1)

where c(s)=[x(s) y(s)],s [0,1],is the snake contour parameterized by arc length, cs(s) and css(s) are the first and second derivative of c(s)with respect to s and positively weighted by α and β respectively. Eext(c(s)) is the image potential which may result from various events, e.g., lines and edges. By calculus of variation, the Euler equation to minimize Esnake is

αc ss ( s) − βc ssss ( s ) − ∇Eext = 0

.

(2)

This can be considered as a force balance equation

Fint + Fext = 0 . where

(3)

Fint = αc ss ( s ) − β c ssss ( s ) and Fext = −∇Eext . The internal force Fint

makes the snake contour to be smooth while the external force Fext attracts the snake to the desired image features. In a departure from this perspective, the gradient vector flow external force is introduced to replace − ∇E ext with a new vector v(x,y)=[u(x,y) v(x,y)] which is derived by minimizing the following function

ε = ∫∫ μ ∇v + ∇f 2

2

v − ∇f dxdy . 2

(4)

where f is the edge map of image I, usually, f = ∇ Gσ ∗ I . μ is a positive weight. Using calculus of variation, the Euler equations seeking the minimum of ε read

v t = μΔv − ∇f

2

( v − ∇f ) .

(5)

On the Critical Point of Gradient Vector Flow Snake

757

where Δ is the Laplacian operator. The snake model with v as external force is called GVF snake.

3 Critical Point Analysis: Influencing Factors and Optimal Stopping Time 3.1 Summary of the Influencing Factors Owing to the critical points, taking GVF as external force would introduce new nuisance for contour initialization and one should pay burdened attention to these points. Therefore, it is expected that there would be some guidance for choosing the parameters for GVF calculation such that the GVF is as regular as possible, i.e., better edge-preserving and fewer critical points. By analyzing Eq.5 and carrying out practical exercises, we summarize the influencing factors of critical points, which may serve as a qualitative guidance, as follows:

1) Shape of the object: The object shape is characterized by the edge map. Generally speaking, the inner critical points lie on the medial axis of the object, thus, the initial contour should include the medial axis in order to capture the desired object. In order to obtain a good edge map for contaminated images, the Gaussian blur with deviation σ is employed first, therefore, a slightly large σ is favored. 2) Regularization parameter μ : This coefficient controls the tradeoff between the fidelity to the original data and smoothing. Large μ means smoother results and fewer critical points, also large deviation from the original data. It is expected that μ is slightly small in the vicinity of boundaries and large in homogeneous areas, but this is a dilemma for contaminated images. 3) Iteration number in diffusion process: It was said in [3] that “the steady-state solution of these linear parabolic equations is the desired solution of the Euler equations…,” this statement gives rise to the following question: dose “the desired solution of the Euler equations” be the desired external force for Snake model, i.e., the desired GVF? We answer this question and demonstrate the influence of iteration number to critical points by using the example in Fig.2, μ = 0.15 , time step is 0.5. The heart image in Fig. 1 is smoothed using a Gaussian kernel of σ = 2.5 and the GVF fields at 100, 200 and 2000 iterations of diffusion are given in Fig.2 (a), (b), and (c) respectively. Visibly, there is less critical point in Fig. 2(b) than in Fig. 2(a) (see the white dot ) and the result in Fig. 2(c) is far from available in that the GVF flows into the ventricle from right and out from left-bottom. Surely, the result in Fig. 2(c) approximates the steady state solution, but it cannot serve as the external force for snake model. The reason behind this situation is that Eq.5 is a biased version of 2 v t = μΔv by (∇f − v) ∇f , where v t = μΔv is an isotropic diffusion. As t increases, the isotropic smoothing effect will dominate the diffusion of (5) and converge to the average of the initial value, ∇f . Small enough μ could depress this oversmoothing efficacy, but, at the same time, preserves excessive noise; alternatively, an optimal iteration number, say, 200 for this example, would be an effective solution for this issue, this is the topic in the next subsection.

758

Y. Wang, J. Liang, and Y. Jia

Fig. 2. Gradient vector flow fields at different iteration: (a) GVF at 100 iteration; (b) GVF at 200 iteration; (c) GVF at 2000 iteration

3.2 Optimal Stopping Time Following the works for image restoration by Mrázek and Navara[19], the decorrelation criterion by minimizing the correlation between the signal and noise in the filtered signal is adopted for the selection of the optimal diffusion stopping time. Starting with the noise image as its initial condition, I (0) = I 0 , I evolves along some

trajectory I (t ), t > 0 . The time T is optimal for stopping the diffusion in the sense that the noise in I (T ) is removed significantly and the object structure is preserved to the extent possible. Obviously, this is an ambiguous statement; T can only be estimated by some criteria. Based on the assumption that the ‘signal’ I (t ) and ‘noise’ I (0) − I (t ) is uncorrelated, the decorrelation criterion is proposed and select

T = arg min corr (I (0) − I(t ), I (t )) .

(6)

t

where

corr (I (0) − I (t ), I (t )) =

cov(I (0) − I (t ), I (t )) . var(I(0) − I (t )) ⋅ var(I (t ))

(7)

Although the underlying assumption is not necessary the case and corr (I (0) − I (t ), I (t )) is also not necessary unimodal, but this situation is not so severe in practice. Mrázek and Navara showed the effectiveness of this criterion [19]. Regarding the vector-valued GVF, by taking its two components into account, we slightly modify this criterion and obtain

TGVF = arg min corr (u (0) − u (t ), u (t )) + corr (v(0) − v (t ), v(t )) .

(8)

t

Since the initial values u (0 ) = f x , v(0 ) = f y are the derivatives of an edge map and may have different intensity, in this way, the diffusion stops at TGVF so that the one with smaller initial value isn’t over-smoothed and the other one not under-smoothed too much.

On the Critical Point of Gradient Vector Flow Snake

759

4 Initialization Via Critical Point Detection Generally speaking, the inner critical points locate within closed homogeneous image regions, e.g., object area. When they locate within object area, the initial contour should contain these points; otherwise, should be excluded. The outer ones generally locate between objects or parts of one object; when lying between objects, the initial contour should exclude these points; or, the snake contour would be driven to the opposite object. See the examples in Fig. 1. But the noise in practice would disturb the location of critical points. Due to page limitation, we don’t elaborate on the distribution of and graceful solution to the critical points and here we only present a practical method to alleviate contour initialization by using the inner critical points within object region. The proposed strategy achieves this end by singling out all the critical points and locating a proper initial contour around those within object regions. 4.1 Identifying the Critical Points

Our proposed method follows the basic idea of [14] by quantizing the GVF in the following way. Given a point p in the image domain, the associated GVF vector is

r v p , denote the 8-neightborhood of p by Ω p , for any point q in Ω p , pq is a unit r r vector from p to q, w p is derived from v p such that

r r v p ⋅ w p = max q∈Ω

r v p ⋅ pq .

(9)

p

In fact,

r r w p is the one nearest to v p in direction among the eight pq ’s. In Fig.3, we

demonstrate this transformation of GVF on the synthetic image used in Fig.1. Our proposed method would identify the critical points based on this quantized GVF. As aforementioned, there are two types of critical point; here we will address the identification algorithm in detail.

r

Given a point p in the image domain, for any point q in Ω p with w q , qp is a unit vector from q to p; p would be an inner critical point if, for all q ∈ Ω p ,

r w q ⋅ qp ≤ 0 .

(10)

If p isn’t an inner critical point and for all q ∈ Ω p ,

r w q ⋅ qp < 1 .

(11)

we call p an endpoint. An endpoint p is an outer critical point if p is not isolated or not on an isolated clique of endpoints. It is clear from Eq.10 that for an inner critical point, the quantized GVF vector of all the points in its 8-neightborhood doesn’t points to it; therefore, if this critical point is outside the initial snake contour, the GVF external force couldn’t put the snake

760

Y. Wang, J. Liang, and Y. Jia

contour across it; thus, if this critical point is within the object region, the initial snake contour should enclose it. For an endpoint, the quantized GVF vector of some points in its 8-neightborhood may point to it, but no one points to it exactly. If the endpoint is not isolated, it is an outer critical point. When this type critical points lies between objects and the initial contour encircles them, the snake contour would be driven away and stabilize on the opposite object. Here we carry out an experiment on the synthetic image and the corresponding GVF field shown in Fig. 3 to demonstrate the identification of the critical points. See Fig.4, the inner critical points are denoted by crosses and the outer ones by dots.

Fig. 3. Quantization of the GVF: (a) GVF field; (b) quantized GVF field

4.2 Locating the Initial Contour

After identifying the critical points, we can locate an initial contour based on the inner critical points, but the proposed method should make use of the prior position of ROI and includes the following steps:

Compute and quantize the GVF vector field; Adopt the prior knowledge about the object position as done in [11]; Single out all inner critical points within the object according to Eq.10; Locate one initial contour around these critical points; therefore, the snake contour would evolve to the object boundary under the GVF force.

In this way, we can get expected results. This strategy is somewhat rough and restricted, but it works well in practice.

Fig. 4. Identification of critical points. White dots denote the outer critical points between objects and black dots denote those separating parts of objects; while the inner critical points are denoted by black crosses.

On the Critical Point of Gradient Vector Flow Snake

761

5 Experimental Results In this section, we will first assess the applicability of the decorrelation criterion for selecting the optimal stopping time for GVF by performing an experiment on the heart image in Fig.1. The parameters to compute GVF are the same as in Fig.2. Fig.5 (a) shows the evolution of correlation with iteration number. The correlation decreases first as the iteration number increases, then reaches a minimum, and increases after this milestone. The iteration number where the correlation is minimal is optimal to stop the diffusion; it is 188 for this example. The associated GVF at 188 iteration is shown in Fig.5 (b).

Fig. 5. Optimal iteration number and the associated GVF field. (a) Correlation evolving with iteration number, the iteration where the correlation is minimal is optimal for stopping the diffusion. (b) GVF at the optimal iteration number.

In order to demonstrate this automatic identification of critical points and locating initial snake contour based on inner critical points, we utilize the proposed method to

Fig. 6. Segmentation examples. All the snake contours are initialized based on inner critical points.

762

Y. Wang, J. Liang, and Y. Jia

segment several images including the heart image and the synthetic image in Fig.1. The results are shown in Fig.6 with the initial contour, snapshot and final contour overlaid. The GVF fields are calculated with μ = 0.15 for all images and all real images are smoothed by Gaussian kernel with standard deviation σ = 2.5 . The iteration number is chosen automatically based on the decorrelation criterion, and it is 188,64,269,182 and 208 for Fig.6(a),(b),(c),(d) and (e) respectively. The inner critical points are indicated by cross. When the critical points are regularly distributed, the initial contour is automatically located, see Fig.6(a) and (b), otherwise, by hand, see Fig.6(c),(d)and (e). Because all the initial contours contain the corresponding inner critical points, they can cope with this critical point issue as shown in Fig.1 and converge to all desired boundaries. These experiments validate the feasibility of the proposed solution to the critical point problem.

6 Conclusion In this paper, a theoretical study has been launched on the GVF snake model. The critical point problem lurked in the GVF snake has been pointed out, and the critical points are identified as inner and outer ones. The influencing factors of critical point include object shape, regularization and iteration number during diffusion. By minimizing the correlation between the signal and noise in the filtered vector field, we have introduced the decorrelation criterion to choose the optimal iteration number. We have also presented an approach to find all the critical points by quantizing the GVF field. The snake contour initialized by containing the inner critical points within object region could avoid the nuisance stemming from critical point and converge to the desired boundaries. In a forthcoming work, we will elaborate on the detection of the critical points and present a graceful solution to critical point during initialization. Acknowledgments. This work was supported by the national natural science foundation of China under grants 60543007 and 60602050.

References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snake: active contour models. Int’l J. Computer Vision 1(4), 321–331 (1988) 2. Han, X., Xu, C., Prince, J.: A topology preserving level set method for geometric deformable models. IEEE TPAMI 25(6), 755–768 (2003) 3. Xu, C., Prince, J.: Snakes, Shapes and gradient vector flow. IEEE TIP 7(3), 359–369 (1998) 4. Xu, C., Prince, J.: Prince, Generalized gradient vector flow external forces for active contours. Signal Processing 71(2), 131–139 (1998) 5. Tang, J., Acton, S.T.: Vessel boundary tracking for intravital microscopy via multiscale gradient vector flow snakes. IEEE TBME 51(2), 316–324 (2004) 6. Paragios, N., Mellina-Gottardo, O., Ramesh, V.: Gradient Vector Flow fast geometric active contours. IEEE TPAMI 26(3), 402–407 (2004) 7. Chuang, C., Lie, W.: A downstream algorithm based on extended gradient vector flow for object segmentation. IEEE TIP 13(10), 1379–1392 (2004)

On the Critical Point of Gradient Vector Flow Snake

763

8. Cheng, J., Foo, S.W.: Dynamic Directional Gradient Vector Flow for Snakes. IEEE TIP 15(6), 1653–1671 (2006) 9. Yu, H., Chua, C.-S.: GVF-Based Anisotropic Diffusion Models. IEEE TIP 15(6), 1517– 1524 (2006) 10. Ray, N., Acton, S.T., Ley, K.: Tracking leukocytes in vivo with shape and size constrained active contours. IEEE TMI 21(10), 1222–1235 (2002) 11. Ray, N., Acton, S.T., Altes, T., et al.: Merging parametric active contours within homogeneous image regions for MRI-based lung segmentation. IEEE TMI 22(2), 189–199 (2003) 12. Ray, N., Acton, S.T.: Motion gradient vector flow: an external force for tracking rolling leukocytes with shape and size constrained active contours. IEEE TMI 23(12), 1466–1478 (2004) 13. Ray, N., Acton, S.T.: Acton, Data acceptance for automated leukocyte tracking through segmentation of spatiotemporal images. IEEE TBME 52(10), 1702–1712 (2005) 14. Li, C., Liu, J., Fox, M.D.: Segmentation of external force field for automatic initialization and splitting of snakes. Pattern Recognition 38, 1947–1960 (2005) 15. Chen, D., Farag, A.A.: Detecting Critical Points of Skeletons Using Triangle Decomposition of Gradient Vector Flow Field. In: GVIP 2005 Conference, CICC, Cairo, Egypt, December 19-21 (2002) 16. He, Y., Luo, Y., Hu, D.: Semi-automatic initialization of gradient vector flow snakes. Journal of Electronic Imaging 15(4), 43006–43008 (2006) 17. Ford, R M.: Critical point detection in fluid flow images using dynamical system properties. Pattern Recognition 30(12), 1991–2000 (1997) 18. Wang, Y.: Investigation on deformable models with application to cardiac MR images analysis, PhD dissertation, Nanjing Univ. Sci. Tech., Nanjing, PRC (June 2004) 19. Mrázek, P., Navara, M.: Selection of optimal stopping time for nonlinear diffusion filtering. Int. J. Comput Vis. 52(2/3), 189–203 (2003)

A Fast and Noise-Tolerant Method for Positioning Centers of Spiraling and Circulating Vector Fields Ka Yan Wong and Chi Lap Yip Dept. of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong {kywong,clyip}@cs.hku.hk

Abstract. Identification of centers of circulating and spiraling vector fields are important in many applications. Tropical cyclone tracking, rotating object identification, analysis of motion video and movement of fluids are but some examples. In this paper, we introduce a fast and noise tolerant method for finding centers of circulating and spiraling vector field pattern. The method can be implemented using integer operations only. It is 1.4 to 4.5 times faster than traditional methods, and the speedup can be further boosted up to 96.6 by the incorporation of search algorithms. We show the soundness of the algorithm using experiments on synthetic vector fields and demonstrate its practicality using application examples in the field of multimedia and weather forecasting.

1 Introduction Spiral, circular, or elliptical 2D vector fields, as well as sources and sinks are encountered in many applications. Of particular interest to researchers is the detection of centers of these vector field patterns, which provides useful information of the field structure. For example, in [1], circulating or elliptical vector fields are formed by motion compensated prediction of rotating objects and swirl scene changes in video sequences. Locating the centers helps object segmentation and tracking. As another example, in meteorology, vector fields constructed from remote sensing images show circulating or spiraling structures of tropical cyclones and pressure systems, which help positioning them [2]. Orientation fields which show circulating and spiraling patterns also draw attention to computer vision researchers [3] [4]. To locate the centers of a circulating or spiraling vector field F, one can use circulation analysis to locate the regions with high magnitude of vorticity ||∇ × F||. To locate sources or sinks, divergence can be calculated. However, such simplistic methods are ineffective on incomplete or noisy vector fields. Previous work that address the issue mainly solve the problem using three approaches: (1) vector field pattern matching, (2) examination of dynamical system properties of vector fields using algebraic analysis, and (3) structural analysis. The idea of vector field pattern matching algorithm is to take the center as the location of the input vector field that best fits a template under some similarity measures, such as sine metric [5], correlation [6] and Clifford convolution [7]. Methods employing such approach are flexible, as different templates can be defined for finding different

The authors are thankful to Hong Kong Observatory for data provision and expert advice.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 764–773, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Fast and Noise-Tolerant Method

765

flow patterns. However, the size and pattern of templates have to be similar to patterns in the vector field. The computation time also increases with the template size. The idea of algebraic analysis is to analyze a vector field decomposed into a sum of solenoidal and irrotational components. This can be done using discrete HelmholtzHodge Decomposition [8] or 2D Fourier transform [9]. The corresponding potential functions are then derived from these components, and the critical points at the local extrema of these functions are found. These points are then characterized by the analysis of the corresponding linear phase portrait matrices. Besides centers, vector field singularities such as swirls, vortices, sources and sinks can also be identified. Other methods for phase portrait estimation include the isotangent-based algorithm [3] and the use of least square error estimator [4]. These methods are mathematically sound, but are relatively computationally expensive and are sensitive to noise in vector fields which inevitably arise in practical situations. Besides pattern matching and algebraic analysis, a structural motion field analysis approach is proposed in [10]. The method models vector field patterns as logarithmic spirals, with the polar equation r = aeθ cot α . The method works by transforming each vector on the field into a sector whose angular limits are bounded by the two lines, formed by rotating the vector counterclockwise by ψm and ψM . The value of ψm and ψM are calculated from the aspect ratio ρ and the angle α of the vector field pattern. The rotation center is the point covered by the largest number of sectors from neighbours, no larger than d pixels away. The method can handle circulating and spiraling vector fields with different inception angles and aspect ratios. It was reported to be up to 1.81 times faster than circulation analysis when the method is used to detect centers of rotating objects in video sequences [1]. Yet, the method requires the determination of parameters ρ and α which could be sometimes difficult. All the methods mentioned above require complex operations such as vector matching and parameter estimation. Floating point operations and a well-structured motion field that fits the assumed mathematical models or templates are required. To handle practical situations, we need a robust and fast method. In this paper, we introduce a fast and noise tolerant method for finding the centers of circulating and spiraling vector fields.

2 Identifying the Centers Our method is a structural analysis method which does not require the users to define a template nor carry out complex mathematical operations. In [10], the optimal rotation angle ω and sector span σ for center identification are determined for perfect vector fields. Yet, the performance of the algorithm in handling noisy vector field is not addressed. We found that this method can be modified to handle noisy vector field by increasing the sector span σ from the optimal value (see 1(a) to (c)). However, the amount of increase σ + depends on the vector field itself, and varies from case to case. Hence, instead of using extra computational resources to determine the suitable expansion value, let us consider the extreme case, where the sector span σ + σ + is of the value π. In this case, as long as the vector differ by less than π2 from the actual direction, the sector of the rotated vector, now a half-plane, is large enough to cover the center (see Fig 1(d)).

766

K.Y. Wong and C.L. Yip

sector region

the point covered by the largest number of sectors

(a) Optimal σ

(b) Optimal σ

(c) Enlarged σ

(d) σ + σ + = π

Fig. 1. Algorithm illustration. (a): perfect field; (b),(c) and (d): noisy field.

(a)ρ = 1, (b)ρ = 1, (c)ρ = 1, (d)ρ = 1, (e)ρ = 1, (f)ρ > 1, (g)ρ > 1, (h)ρ > 1, (i)ρ > 1, (j)ρ > 1, α = 0+ α = 3π α = π2 α = 7π α = π − α = 0+ α = 3π α = π2 α = 7π α = π− 10 10 10 10

Fig. 2. Result on synthetic field with different values of ρ and α

Based on this idea, our proposed center identification algorithm is as follows. For each point on the vector field, any neighbouring vectors within a distance of d pixels from it is checked to see whether the point is on its left or right. The left- or right-counts are then recorded. For a clockwise circulating or spiraling vector field, the point having the maximum right-count is the center location, whereas left-count is considered when the field is a counterclockwise one. The value d defines a distance of influence so that vectors far away, that may not be relevant to the flow of interest, are not considered. Also, it allows the algorithm to handle vector fields with multiple centers. Since the algorithm only involves left-right checking and counting, the method can be implemented with only integer operations. To further boost up the speed of the algorithm, search algorithms that make use of principle of locality is also incorporated in the design to locate the point with maximum count.

3 Evaluation To validate and evaluate the proposed method, a Java-based system prototype is built. We show the soundness of the proposed algorithm and its robustness against noisy vector fields using synthetic vector fields in Section 3.1 and 3.2, followed by a discussion on efficiency issues and the effect of search algorithms in Section 3.3 and 3.4 respectively. The practicality of the method is then demonstrated in Section 4. The performance is evaluated by comparing with three center finding approaches: (1) vector field pattern matching with a template “center” and absolute sine metric [5], (2) critical points classified as “center” by algebraic analysis, based on [9], and (3) structural analysis, as in [10]. The efficiency of the algorithm is evaluated by profiling tests on an iMac G5 computer running Mac OS 10.4.9, with a 2GHz PowerPC G5 processor and 1.5GB of RAM. The

A Fast and Noise-Tolerant Method

767

effectiveness of the algorithm is evaluated by finding the average Euclidean distance between the proposed center and the actual center of the vector field. 3.1 Validating the Method To empirically validate our algorithm, it is applied to synthetic vector fields of size 161 × 161 pixels generated using the polar equation r = aeθ cot α , which covers a family of 2D vector field patterns: spiral flow, circular flow, source and sink [10]. The field is viewed at different angles to produce fields with different aspect ratios ρ in the Cartesian coordinates: (x, y) = (r cos θ, rρ sin θ). The score is presented as a grayscale image on which the original vector field is overlaid. This representation is used in subsequent results. The higher the left-count (counterclockwise field) at a location (x, y), the brighter the point is. With d = 5, a single brightest spot is clearly shown at the field center in all tested values of ρ and α, giving a basic validation of our algorithm (Fig. 2). 3.2 Robustness Against Noise Noise sensitivity studies were carried out using synthetic circulating fields of size 161× 161 pixels with different types of noise that model three different situations: (1) Gaussian noise on each vector dimension, modeling sensor inaccuracies (e.g., error in wind direction measurements); (2) Random missing values, modeling sensor or object tracking failures; and (3) Partially occluded field, modeling video occlusions (e.g., a train passing in front of a Ferris wheel). The latter two cases are generated by replacing some of the vectors of an ideal vector field by zero or vectors pointing to a particular direction. In the experiment, the distance of influence d of the proposed method and structural analysis are set to five pixels. The template used for pattern matching algorithms is of size 41 × 41 pixels, generated from an ideal rotating field. Pattern matching takes the lowest score (darkest pixel) as the answer. All other methods take the highest score (brightest pixel) as answer. Fig 3(a) shows the results. The ideal vector fields without noise are references for this noise sensitivity study. Pattern matching gives the worst result when Gaussian noise is present. This is because the absolute sine function changes rapidly around the zero angle difference point, and the function tends to exaggerate the damage of occasional vectors that are not in the right direction. Yet, the pattern matching approach can handle data discontinuity cases, such as occlusion and random missing values. In contrast, algebraic analysis method works well on fields with Gaussian noise. Fourier transform of a Gaussian function in spatial domain results in a Gaussian function in the frequency domain. Thus, the addition of the Gaussian noise would not affect the global maximum score unless a frequency component of the noise is greater than the strongest frequency component of the original signal. However, phase portrait analysis cannot handle the occlusion case well. The structural analysis and our proposed method work well in all cases, and their level of noise tolerance can be controlled by adjusting the distance of influence d. In particular, the consideration of only left or right count of a vector in our method allows

768

K.Y. Wong and C.L. Yip Proposed Pattern Algebraic Structural method matching analysis analysis

TSS

2DLogS

OrthS

Occluded Random missing

Occluded Random missing

Gaussian Ideal noise

Gaussian Ideal noise

2LHS

(a) Comparison between major approaches

(b) Comparison between search algorithms

Fig. 3. Performance on noisy vector fields

slightly distorted vectors to cover the true rotation center, giving a more distinguished peak (the brightest pixel) than structural analysis when Gaussian noise is present. 3.3 Efficiency The efficiencies of the algorithms are compared by profiling tests. Table 1(a) shows the result. Algebraic analysis requires vector field decomposition, estimation of phase portrait matrices and classification, and thus takes the longest time. The speeds of pattern matching and structural analysis are comparable. Their run times are quadratic, to the linear dimension of the template and the distance of influence d respectively. Our proposed method is the fastest, and can be implemented using integer operations only. Yet, same as structural analysis, its efficiency is affected by the distance of influence d. 3.4 Boosting Efficiency: Use of Search Algorithms To further speed up our algorithm, search algorithms that make use of principle of locality are incorporated in the design. Four popular algorithms [11], namely Two level hierarchical search (2LHS), Three step search (TSS), Two dimensional logarithmic search (2DLogS), and Orthogonal search (OrthS) are implemented for comparison with Exhaustive search (ExhS) in which the score of every point is evaluated. Their properties are summarized in Table 1(b). The algorithm performance on noise tolerance and efficiency test are shown in Fig. 3(b) Table 1(b) respectively. In general, the fewer location a search algorithm examines in each iteration, the faster is the center identification process, but with a less distinguished result. The use of search algorithms boost up the efficiency by at least 9.33 times, but did not affect the noise tolerance much. This is an advantage as a faster search algorithm can be chosen to speed up the process, yet, preserving the quality.

A Fast and Noise-Tolerant Method

769

Table 1. Result on profiling test (a) Comparison between major approaches

(b) Comparison between different search algorithms (d = 10)

Method

Algorithm Properties

Parameter

Algebraic analysis Pattern matching Structural analysis Proposed method

Time (ms) 669769 208759 200444 148416

d = 10 d = 10

ExhS 2LHS TSS 2DLogS OrthS

Time (ms)

Examines every possible location A hierarchical algorithm, sparse then narrow down Reduces search distance in each iteration Reduces search distance when center is of the highest rank Explores the search space axis by axis

(a) Frame 01

(b) Frame 15

1 0.8 0.6 0.4

proposed method pattern matching algebraic analysis structural analysis

0.2 0 0

50

100

150

200

Euclidean distance from actual center (pixel)

(d) Proposed method and major approaches

(c) Key

Effect of search algorithms on proposed method accumulative frequency (%)

accumulative frequency (%)

Comparison between proposed method and major approache

148416 15905 5756 7973 6932

1 0.8 ExhS 2LHS TSS 2DLogS OrthS subsampled

0.6 0.4 0.2 0 0

10

20

30

40

50

Euclidean distance from actual center (pixel)

(e) Effect of search algorithms

Fig. 4. Ferris wheel: results and cumulative percentage of frames against error

4 Applications In this section, we demonstrate the use of our method on the fields of multimedia and meteorology. In these practical applications, vectors far away may not be related to the rotating object. Hence, we only consider vectors within d = 100 pixels in calculating the left–right counts on videos of 320 × 240 or 480 × 480 in size. Depending on the level of noise immunity desired, different values of d can be used in practice. To speed up the process in handling real-life applications, the proposed method, pattern matching, algebraic analysis and structural analysis are applied to sampled locations of the vector field to position the circulation center. Besides, the effect of search algorithms in handling real-life application is also studied. 4.1 Rotation Center Identification in Video Sequences The video sequences used for the experiments include the video of a Ferris wheel taken from the front at a slightly elevated view, and a sequence of presentation slides with

770

K.Y. Wong and C.L. Yip

swirl transition effect. Motion fields generated from MPEG-4 videos of 320 × 240 pixels in size are used for detection of rotation center. Each frame is segmented into overlapping blocks of 16 × 16 pixels in size 10 pixels apart both horizontally and vertically. Two level hierarchical search is then applied to every block to find the motion vector from the best matching blocks in the previous frame, using mean absolute error as distortion function. The motion field is then smoothed by a 5×5 median filter. The tested methods are then applied to the motion field for rotation center location. Since the density of the motion field affects the accuracy of the rotation center location, and the vector field found may not be perfect, for each algorithm, we take the top three highest-scored centers to determine the output. Among these top three, the one that is closest to the actual center is taken as the final answer. Here, the actual center is the centroid of the Ferris wheel. The performances of the algorithms are compared by plotting the fraction of frames with error smaller than an error distance, as in Fig. 4(d). A point (x, y) means a fraction y of the frames gives an error that is no more than x pixels from the actual center. A perfect algorithm should produce a graph that goes straight up from (0, 0) to (0, 1) then straight to the right. The nearer the line for an algorithm to the perfect graph, the more accurate it is. From the graph, the percentage of the frames giving an average error within one step size with (1) our proposed method, (2) pattern matching, (3) algebraic analysis, and (4) structural analysis are 91%, 1%, 46% and 90% respectively. The low accuracy of pattern matching method is caused by the sine function exaggerating the damage of occasional vectors that are not in the right direction under the imperfect field, as discussed in Section 3.2. For algebraic analysis, the imperfections in the motion field, especially the noisy vectors at the edges of the frame (Fig. 4(a)) cause the error. This explains why the fraction stays at less than 80% till the error distance is about 170 pixels. The results of our proposed algorithm and structural analysis are comparable. Both methods consider only vectors within a distance d, so noise far away from the actual center does not have any effect, offsetting the imperfections in motion fields and increases the practicality of the method. Results and the error graph for the swirl transition sequence from presentation slides are shown in Fig. 5. In this presentation sequence, the first slide rotates while zooming out from the center, and the next slide rotates while zooming in. The areas with motion vectors change continuously in the video sequence. Moreover, as the slides are shown as rectangular boxes, undesired motion vectors are generated at its edges and corners. Such edge effect affects the algorithm performance. The percentage of the frames giving an average error within two step sizes with (1) our proposed method, (2) pattern matching, (3) algebraic analysis, and (4) structural analysis are 83%, 52%, 20% and 66% respectively. Algebraic analysis classified the vector field center as a node or a saddle instead of a center in some cases, lead to a low 20% coverage within two step sizes. Yet, if the points classified as saddles or nodes are also considered as centers, over 80% coverage within two step sizes is obtained. The use of the limit d in our algorithm and structural analysis, or the size of template in pattern matching approach, limit the area of analysis. Unless vectors at the edges or corners of the slides are within a distance d from the actual center (or within the template size for pattern matching approach), the performance of the algorithms remain unaffected by the edge effect. The lower

A Fast and Noise-Tolerant Method

(a) Frame 299

(b) Frame 386

1 0.8 0.6 0.4

proposed method pattern matching algebraic analysis structural analysis

0.2 0 0

50

100

150

200

Euclidean distance from actual center (pixel)

(d) Proposed method and major approaches

(c) Key

Effect of search algorithms on proposed method accumulative frequency (%)

accumulative frequency (%)

Comparison between proposed method and major approache

771

1 0.8 ExhS 2LHS TSS 2DLogS OrthS subsampled

0.6 0.4 0.2 0 0

20

40

60

80

100

120

140

Euclidean distance from actual center (pixel)

(e) Effect of search algorithms

Fig. 5. Presentation: result and cumulative percentage of frames against error

percentage resulted from the pattern matching approach is mainly due to the mismatch between the template and the imperfect vector field. In studying the effect of search algorithms in both cases (Fig. 4(e) and 5(e), as expected, the best results are given by ExhS on all vectors, followed by 2LHS on sampled search positions. The use of TSS, 2DLogS and OrthS resulted in relatively larger errors. This is because TSS, 2DLogS and OrthS start the search with an initial search position, which depends on the answer of the previous frame. If the initial search position is incorrect, the result may accumulate and affect subsequent results. When an initial estimation of the center position is available, these three algorithms would be good choices since they are faster. 4.2 Tropical Cyclones Eye Fix Fixing the center of a tropical cyclone (TC) is important in weather forecasting. A typical TC has spiral rainbands with an inflow angle of about 10◦ swirls in counterclockwisely in the Northern Hemisphere. A spiraling vector field would thus be generated from sequence of remote sensing data. Our proposed method is applied to a sequence of radar images (5 hours, 50 frames) compressed as MPEG-4 videos of 480×480 pixels in size. The output positions of the system is smoothed using a Kalman filter. To give an objective evaluation, interpolated best tracks1 issued by Joint Typhoon Warning Center (JTWC) [12] and the Hong Kong Observatory (HKO) [13] are used for 1

Best tracks are the hourly TC locations determined after the event by a TC warning center using all available data.

772

K.Y. Wong and C.L. Yip

(a) Early stage of the TC

(b) Later stage of the TC

(d) Proposed method and major approaches

(c) Key

(e) Effect of search algorithms

Fig. 6. Comparison of TC tracks and eye fix results

comparison. Fig. 6 shows the eye fix results and the comparison of proposed tracks by different methods. We observe that pattern matching gives the worst result, with most of the frames having an answer far away from best tracks. Results of algebraic analysis were affected by the vectors formed by radar echoes at the outermost rainbands of the TC (Fig. 6(b)). Yet, when a well-structured vector field is found, the algebraic analysis method gives proposed centers close to the best tracks (Fig. 6(a)) Our proposed track and the one given by structural analysis are close to best tracks given by HKO and JTWC. Using HKO best track data as a reference, our proposed method gives an average error of 0.16 degrees on the Mercator projected map (Table 2(a), well within the relative error of 0.3 degrees given by different weather centers [14]. The use of search algorithms, 2LHS, 2DLogS, and OrthS are comparable, with average error ranging from 0.17 to 0.19 degrees, while TSS gives an average error of 0.35 degrees. This is because potentially better results far away from the initial location cannot be examined as the search distance of TSS halves every iteration. The application of our proposed method in weather forecasting shows its practicality and its ability to find the center of spiraling flow.

5 Summary We have proposed a fast and noise-tolerant method for identifying centers of a family of 2D vector field patterns: spiral flow, circular flow, source and sink. For each point on the

A Fast and Noise-Tolerant Method

773

Table 2. Average error from interpolated HKO best track (a) Proposed method and major approaches

(b) Effect of search algorithms

Algorithm

Search algorithm

Proposed method Pattern matching Algebraic analysis Structural analysis

Error (degrees) 0.16 1.40 0.39 0.25

2LHS TSS 2DLogS OrthS

Error (degrees) 0.17 0.35 0.19 0.18

vector field, every neighbouring vectors within a distance of d pixels are checked to see whether the point is on the left or right of the vector. The location with the maximum left or right count is the center location of a counterclockwise or clockwise circulating flow respectively. The method can be implemented by only integer operations. It is found that the proposed method is 1.35 to 4.51 times faster than traditional methods, and can be boosted up to 96.62 times faster when search algorithm is incorporated, with little tradeoff in effectiveness. The algorithm is tolerant to different types of noises such as Gaussian noise, missing vectors, and partially occluded fields. The practicality of the method is demonstrated using examples in detecting centers of rotating objects in video sequences and identifying the eye positions of tropical cyclones in weather forecasting.

References 1. Wong, K.Y., Yip, C.L.: Fast rotation center identification methods for video sequences. In: Proc. ICME, Amsterdam, The Netherlands, pp. 289–292 (July 2005) 2. Li, P.W., Lai, S.T.: Short range quantitative precipitation forecasting in Hong Kong. J. Hydrol. 288, 189–209 (2004) 3. Shu, C.F., Jain, R., Quek, F.: A linear algorithm for computing the phase portraits of oriented textures. In: Proc. CVPR, Maui, Hawaii, USA, pp. 352–357 (June 1991) 4. Shu, C.F., Jain, R.C.: Vector Field Analysis for Oriented Patterns. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-16(9), 946–950 (1994) 5. Rodrigues, P.S.S., de Ara´ujo, A.A., Pinotti, M.: Describing patterns in flow-like images. In: Proc. 10th ICIAP, Venice, Italy, pp. 424–429 (September 1999) 6. Heiberg, E., Ebbers, T., Wigstom, L., Karlsson, M.: Three-Dimensional Flow Characterization Using Vector Pattern Matching. IEEE Trans. Vis. Comput. Graphics 9(3), 313–319 (2003) 7. Ebling, J., Scheuermann, G.: Clifford convolution and pattern matching on vector fields. In: Proc. 14th Vis., Seattle, Washington, USA, pp. 26–33 (October 2003) 8. Polthier, K., Preuss, E.: Identifying Vector Field Singularities Using a Discrete Hodge Decomposition. In: Visualization and Mathematics III (2003) 9. Corpetti, T., M´emin, E., P´erez, P.: Extraction of Singular Points from Dense Motion Fields: An analytic approach. J. Math. Imag. and Vis. 19, 175–198 (2003) 10. Yip, C.L., Wong, K.Y.: Identifying centers of circulating and spiraling flow patterns. In: Proc. 18th ICPR, Hong Kong, vol. 1, pp. 769–772 (August 2006) 11. Furht, B., Greenberg, J., Westwater, R.: Motion estimation algorithms for video compression. Kluwer Academic Publishers, Boston (1997) 12. Joint Typhoon Warning Center: Web page (2007), http://www.npmoc.navy.mil/jtwc.html 13. Hong Kong Observatory: Web page (2007), http://www.hko.gov.hk/ 14. Lam, C.Y.: Operational Tropical Cyclone forecasting from the perspective of a small weather service. In: Proc. ICSU/WMO Sym. Tropical Cyclone Disasters, pp. 530–541 (October 1992)

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions Tomokazu Takahashi1,2 , Lina1 , Ichiro Ide1 , Yoshito Mekada3 , and Hiroshi Murase1 Graduate School of Information Science, Nagoya University, Japan [email protected] 2 No Japan Society for the Promotion of Science Department of Life System Science and Technology Chukyo University, Japan 1

3

Abstract. We propose a method for interpolation between eigenspaces. Techniques that represent observed patterns as multivariate normal distribution have actively been developed to make it robust over observation noises. In the recognition of images that vary based on continuous parameters such as camera angles, one cause that degrades performance is training images that are observed discretely while the parameters are varied continuously. The proposed method interpolates between eigenspaces by analogy from rotation of a hyper-ellipsoid in high dimensional space. Experiments using face images captured in various illumination conditions demonstrate the validity and eﬀectiveness of the proposed interpolation method.

1

Introduction

Appearance-based pattern recognition techniques that represent observed patterns as multivariate normal distribution have actively been developed to make them robust over observation noises. The subspace method [1] and related techniques [2,3] enable us to achieve accurate recognition under conditions where such observation noises as pose and illumination variations exist. Performance, however, degrades when the variations are far larger than expected. On the other hand, the parametric eigenspace method [4] deals with variations using manifolds that are parametric curved lines or surfaces. The manifolds are parameterized by parameters corresponding to controlled pose and illumination conditions in the training phase. This enables object recognition and at the same time parameter estimation that estimates pose and illumination parameters when an input image is given. However, this method is not very tolerant of uncontrolled noises that are not parameterized, e.g., translation, rotation, or motion blurring of input images. Accordingly, Lina et al. have developed a method that embeds multivariate normal density information in each point on the manifolds [5]. This method generates density information as a mean vector and a covariance matrix from training images that are degraded by artiﬁcial noises such as translation, rotation, or motion blurring. Each noise is controlled by a noise model and its Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 774–783, 2007. c Springer-Verlag Berlin Heidelberg 2007

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions

775

parameter. To obtain density information between consecutive poses and generate smooth manifolds, the method interpolates training images degraded by the identical noise model and the parameter between consecutive poses. By considering various other observation noises, however, controlling noises by model and parameter is diﬃcult; therefore, making correspondence between training images is not realistic. Increasing computational cost with a growing number of training images is also a problem. In light of the above background, we propose a method to smoothly interpolate between eigenspaces by analogy from rotation of a hyper-ellipsoid in a high dimensional space. Section 2 introduces the mathematical foundation, the interpolation of a rotation matrix using diagonalization and its geometrical signiﬁcance, followed by Section 3, where the proposed interpolation method is described. Section 4 demonstrates the validity and eﬀectiveness of interpolation by the proposed method from experiment results using face images captured in various illumination conditions. Section 5 summarizes the paper.

2

Interpolation of Rotation Matrices in an n-Dimensional Space

2.1

Diagonalization of a Rotation Matrix

An n × n real number matrix nR is a rotation matrix when it satisﬁes the following conditions: nR nR

T

= nRT nR = nI,

det(nR) = 1,

(1)

where AT represents a transpose matrix of A and nI represents an n× n identity matrix. nR can be diagonalized with an n × n unitary matrix and a diagonal matrix nD including complex elements as nR

= nU nDnU † .

(2)

Here, A† represents a complex conjugate transpose matrix of A. The following equation is obtained for a real number x: nR

x

= nU nDx nU † .

(3)

represents an interpolated rotation when 0 ≤ x ≤ 1 and an extrapolated rotation in other cases. This means that once Un is calculated, the interpolation and extrapolation of nR can be easily obtained.

nR

x

2.2

Geometrical Signiﬁcance of Diagonalization

A two-dimensional rotation matrix 2R(θ) whose θ(−π < θ ≤ π) is its rotation angle can be diagonalized as 2R(θ)

= 2U 2D(θ)2U † ,

(4)

776

T. Takahashi et al.

cos θ − sin θ , 2R(θ) = sin θ cos θ iθ e 0 . 2D(θ) = 0 e−iθ

where

(5) (6)

Here, since eiθ = cos θ + i sin θ (Euler s f ormulaC|eiθ | = |e−iθ | = 1), 2R(θ)x = x 2R(xθ) as well as 2D(θ) = 2D(xθ) for a real number x. The Eigen-equation of nR has m sets of complex conjugate roots whose absolute value is 1 when n = 2m. Meanwhile, when n = 2m + 1, nR has the same m sets of complex conjugate roots and 1 as roots. Therefore, nD in Equation 2 can be described as ⎤ ⎧⎡ 0 ⎪ 2D(θ1 ) · · · ⎪ ⎪⎢ . ⎥ ⎪ .. .. ⎪ ⎣ .. ⎦ (n = 2m) ⎪ . . ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎡ 0 · · · 2D(θm ) ⎤ 1 ··· 0 (7) nD(θ) = ⎪ ⎢ ⎥ .. ⎪ ⎪ ⎢ ⎥ ⎪ . ⎪ ⎢ 2D(θ1 ) ⎥ (n = 2m + 1) ⎪ ⎪ ⎢. ⎥ . ⎪ . . ⎪ ⎣ ⎦ . ⎪ . ⎪ ⎩ 0 ··· 2D(θm ) by an m dimensional vector θ = (θj | − π < θj ≤ π, j = 1, 2, · · · , m) composed of m rotation angles. Thus Equation 2 can be described as nR(θ)

= nU nD(θ)nU † .

(8)

This means that nRx in Equation 3 is obtained as nR(xθ) by simply linearly interpolating the vector. Additionally, † † (9) nR(θ) = nU nU nR (θ)nU nU . Here, when n = 2m + 1, ⎡

1

···

0 .. .

⎤

⎥ ⎢ ⎥ ⎢ 2R(θ1 ) ⎥, ⎢ nR (θ) = ⎢ . ⎥ . .. ⎦ ⎣ .. 0 ··· 2R(θm ) ⎤ ⎡ 1 ··· 0 ⎢ .. ⎥ ⎢ 2U . ⎥ ⎥. ⎢ nU = ⎢ . ⎥ .. ⎦ ⎣ .. . 0 ··· 2U

(10)

(11)

Meanwhile, when n = 2m, nR (θ) and nU are obtained by removing the ﬁrst column and the ﬁrst row from the matrix in the same way as Equation 7. Because

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions

777

x Fig. 1. Pose interpolation for a four-dimensional cube

the set of all n-dimensional rotation matrixes forms a group with multiplication † called the SO(n) (Special Orthogonal group), nU nU can be transformed into a rotation matrix nR ; therefore, using only real number rotation matrices, nR(θ) can be described as T . (12) nR(θ) = nR nR (θ)nR Using this expression, computational cost and memory are expected to be reduced in the sense of using real number matrices instead of complex number matrices. Note that the interpolated results are identical to those obtained from the simple diagonalization shown in Equation 2. nR (θ) represents rotations on m independent rotational planes where no rotation aﬀects other rotational planes. This means that Equation 12 expresses nR(θ) as a sequence of rotation matrices that are a unique rotation nR for nR(θ), a rotation on independent planes nR (θ), and the inverse of the unique rotation matrix. 2.3

Rotation of a Four-Dimensional Cube

We interpolated poses of a four-dimensional cube using the rotation matrix interpolation method described above. First, two rotation matrices, R0 , R1 , were randomly chosen as key poses of the cube, and then poses between the key poses were interpolated by Equation 13. For every interpolated pose, we visualized the cube by a wireframe model using perspective projection: Rx = R0→1 (xθ)R0 .

(13)

Here, Ra→b (θ) = Rb Ra T . This equation corresponds to the linear interpolation of rotation matrices. Figure 1 shows the interpolated results. To simplify seeing how the four-dimensional cube rotates, the vertex trajectory is plotted by dots.

3 3.1

Interpolation of Eigenspaces Using Rotation of a Hyper-Ellipsoid Approach

The proposed method interpolates eigenspaces considering an eigenspace as a multivariate normal density. The iso-density points of a multivariate normal density are known to form a hyper-ellipsoid surface. Eigenvectors and eigenvalues

778

T. Takahashi et al.

e x0

μx

e00

e01

μ0

e10

e xn e x1

μ1

μx Σx )

Nx (

e0 n

μ 0 Σ0 ) ,

e11

μ Σ

1 ( 1, 1 )

N

,

…y

1

N0 (

e1n

yn

y0

Feature space

Fig. 2. Interpolation of hyper-ellipsoids

can be considered the directions of the hyper-ellipsoid’s axes and their lengths, respectively. We consider that the eigenspaces between two eigenspaces could be interpolated by rotation of a hyper-ellipsoid with the expansion and contraction of the length of each axis of the ellipsoid (Figure 2). The interpolation of ellipsoids has the following two problems. First, the correspondence of one ellipsoid’s axes to another ellipsoid’s axes cannot be determined uniquely. Secondly, the rotation angle cannot be determined uniquely because ellipsoids are symmetrical. From these problems, in general, ellipsoids cannot be interpolated uniquely from two ellipsoids. The following two conditions are imposed in the proposed method to obtain a unique interpolation. [Condition 1] Minimize the interpolated ellipsoid’s volume variations. [Condition 2] Minimize the interpolated ellipsoid’s rotation angle variations. 3.2

Algorithm

When two multivariate normal densities N0 (μ0 , Σ0 ) and N1 (μ1 , Σ1 ) are given, an interpolated or extrapolated density Nx (μx , Σx ) for a real number x is calculated by the following procedure. Here, μ and Σ represent an n-dimensional mean vector and an n × n covariance matrix, respectively. Interpolation of mean vectors: μx is obtained by a simple linear interpolation by the following equation. This corresponds to interpolation of the ellipsoids’ centers. (14) μx = (1 − x)μ0 + xμ1 . Interpolation of covariance matrices: Eigenvectors and eigenvalues of each covariance matrix have information about the pose of the ellipsoid and the lengths of its axes, respectively. First, n × n matrices E0 and E1 are formed by aligning each eigenvector e0j and e1j (j = 1, 2, · · · , n) of Σ0 and Σ1 . At the same time, n-dimensional vectors λ0 Cλ1 are formed by aligning eigenvalues λ0j , λ1j (j = 1, 2, · · · , n).

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions

779

[Step 1] To obtain the correspondences of axes between ellipsoids based on [Condition 1], E0 and E1 are formed by sorting eigenvectors in E0 and E1 in the order of their eigenvalues. λ0 and λ1 are formed from λ0 Cλ, as well. T [Step 2] Based on [Condition 2], e1j (j = 1, 2, · · · , n) is inverted if e0j e1j < 0 so that the angle between corresponded axes is less than or equal to π2 . [Step 3] e0n is inverted if det(E0 ) = −1, and e1n is inverted if det(E1 ) = −1, as well, so that E0 and E1 should meet Equation 1. The eigenvalue of Σx , λxj is calculated by 2 λxj = (1 − x) λ0j + x λ1j ,

(15)

and its eigenvectors Ex is calculated by Ex = R0→1 (xθ)E0 .

(16)

Here, R0→1 (θ) = E1 E0 D Therefore, Σx is calculated by T

Σx = Ex Λx Ex T .

(17)

HereCΛx represents a diagonal matrix that has λxj (j = 1, 2, · · · , n) as its diagonal elements.

4

Experiments Using Actual Images

To demonstrate the eﬀectiveness and validity of the proposed interpolation method, we conducted face recognition experiments based on a subspace method. Training images were captured from two diﬀerent angles in various illumination conditions, whereas input images were captured only from intermediate angles. In the training phase, a subspace for each camera angle was constructed from images captured in diﬀerent illumination conditions. We compared the performance between recognition by the two subspaces and the interpolated subspaces. 4.1

Conditions

In the experiments, we used the face images of ten persons captured from three diﬀerent angles (two for training and one for testing) in 51 diﬀerent illumination conditions. Figures 3 and 4 show examples of the persons’ images and images captured in various conditions. In Figure 5, images from camera angles c0 and c1 were used for training and c0.5 for testing. The images were chosen from the face image dataset, “Yale Face Database B”[6]. We represented each image as a low dimensional vector in a 30-dimensional feature space using a dimension reduction technique based on PCA. In the train(p) (p) ing phase, for each person p, autocorrelation matrices Σ0 and Σ1 were calcu(p) lated from images obtained from angles c0 and c1 , and then matrices E0 and (p) E1 were obtained that consist of eigenvectors of the autocorrelation matrices.

780

T. Takahashi et al.

Fig. 3. Sample images of ten persons’ faces used in experiment

Fig. 4. Sample images captured in various illumination conditions used in experiment

In the recognition phase, the similarity between the subspaces and a test image captured from c0.5 were measured, and recognition result pˆ was obtained that gives maximum similarity. The similarity between an input vector z and (p) the K(≤ 30)-dimensional subspace of Ex is calculated by Sx(p) (z) =

K

(p)

< ex,k , z >2 ,

(18)

k=1 (p)

(p)

(p)

where Ex (0 ≤ x ≤ 1) is the interpolated eigenspace between E0 and E1 (p) and < ·, · > represents the inner product of the two vectors. Ex is calculated by Equation 16. The proposed method that uses the subspaces of the interpolated eigenspaces obtains pˆ by (19) pˆ = arg max max Sx(p) (z) . p

0≤x≤1

On the other hand, as a comparison method, the recognition method with (p) (p) subspaces of E0 and E1 obtains pˆ by (p) (p) pˆ = arg max max S0 (z), S1 (z) . p

(20)

We deﬁned K = 5 in Equation 18 empirically through preliminary experiments.

c

0

c

0.5

c

1

Fig. 5. Sample images captured from three camera angles used in experiment

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions

781

Table 1. Comparison of recognition rates Recognition Method Recognition Rate [%] Interpolated subspaces by proposed method (Eq. 19) 71.8 62.6 Two subspaces (Eq. 20)

e c5 n a t 4 is D

a3 y y r a2 h c a t t 1 a h B0 0.0

0.1

0.2

0.3

0.4

0.5

x

0.6

0.7

0.8

0.9

1.0

Fig. 6. Bhattacharyya distances between actual normal density and interpolated densities

4.2

Results and Discussion

Table 1 compares the recognition rates of the two methods described in 4.1. From this result, we conﬁrmed the eﬀectiveness of the proposed method for face recognition. For veriﬁcation of the validity of interpolation by the proposed method, Figure 6 shows the Bhattacharyya distances between normal density obtained from c0.5 and the interpolated normal densities from x = 0 to x = 1 for a person. Since the distance becomes smaller around x = 0.5, the validity of interpolation by the proposed method can be observed. In addition, Figure 7 visualizes the interpolated eigenvectors from x = 0 to x = 1 of the person. We can see that the direction of each eigenvector was changed smoothly by high dimensional rotation.

5

Summary

In this paper, we proposed a method for interpolation between eigenspaces. The experiments on face recognition based on the subspace method demonstrated the eﬀectiveness and validity of the proposed method. Future works include expansion of the method into higher order interpolation such as a cubic spline and recognition experiments using larger datasets.

782

T. Takahashi et al.

dimension 0.0

0.5

1.0

x Fig. 7. Interpolated eigenvectors

Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions

783

References 1. Watanabe, S., Pakvasa, N.: Subspace Method of Pattern Recognition. In: Proc. 1st Int. J. Conf. on Pattern Recognition, pp. 25–32 (1971) 2. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 568–591. IEEE Computer Society Press, Los Alamitos (1991) 3. Moghaddam, B.: Principal Manifolds and Bayesian Subspaces for Visual Recognition. In: Proc. Int. Conf. on Computer Vision, pp. 1131–1136 (1999) 4. Murase, H., Nayar, S.K.: Illumination Planning for Object Recognition Using Parametric Eigenspaces. IEEE Trans. Pattern Analysis and Machine Intelligence 16(12), 1218–1227 (1994) 5. Takahashi, L.T., Ide, I., Murase, H.: Appearance Manifold with Embedded Covariance Matrix for Robust 3-D Object Recognition. In: Proc. 10th IAPR Conf. on Machine Vision Applications, pp. 504–507 (2007) 6. Georghiades, A.S., Belhumeur, P.N., Kriegman, C.D.J.: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans. Pattern Analysis and Machine Intelligence 23(6), 643–660 (2001)

Conic Fitting Using the Geometric Distance Peter Sturm and Pau Gargallo INRIA Rhˆ one-Alpes and Laboratoire Jean Kuntzmann, France

Abstract. We consider the problem of ﬁtting a conic to a set of 2D points. It is commonly agreed that minimizing geometrical error, i.e. the sum of squared distances between the points and the conic, is better than using an algebraic error measure. However, most existing methods rely on algebraic error measures. This is usually motivated by the fact that pointto-conic distances are diﬃcult to compute and the belief that non-linear optimization of conics is computationally very expensive. In this paper, we describe a parameterization for the conic ﬁtting problem that allows to circumvent the diﬃculty of computing point-to-conic distances, and we show how to perform the non-linear optimization process eﬃciently.

1

Introduction

Fitting of ellipses, or conics in general, to edge or other data is a basic task in computer vision and image processing. Most existing works concentrate on solving the problem using linear least squares formulations [3,4,16]. Correcting the bias introduced by the linear problem formulation, is often aimed at by solving iteratively reweighted linear least squares problems [8,9,10,12,16], which is equivalent to non-linear optimization. In this paper, we propose a non-linear optimization approach for ﬁtting a conic to 2D points, based on minimizing the sum of squared geometric distances between the points and the conic. The arguments why most of the algorithms proposed in literature do not use the sum of squared geometrical distances as explicit cost function, are: – non-linear optimization is required, thus the algorithms will be much slower. – computation of a point’s distance to a conic requires the solution of a 4th order polynomial [13,18], which is time-consuming and does not allow analytical derivation (for optimization methods requiring derivatives), thus leading to the use of numerical diﬀerentiation, which is again time-consuming. The main goal of this paper is to partly contradict these arguments. This is mainly achieved by parameterizing the problem in a way that allows to replace point-to-conic distance computations by point-to-point distance computations, thus avoiding the solution of 4th order polynomials. The problem formulation remains non-linear though. However, we show how to solve our non-linear optimization problem eﬃciently, in a manner routinely used in bundle adjustment. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 784–795, 2007. c Springer-Verlag Berlin Heidelberg 2007

Conic Fitting Using the Geometric Distance

2 2.1

785

Problem Formulation Cost Function T

Let qp = (xp , yp , 1) , p = 1 . . . n be the homogeneous coordinates of n measured 2D points. The aim is to ﬁt a conic to these points. Many methods have been proposed for this task, often based on minimizing the sum of algebraic distances [3,4,12] (here, C is the usual symmetric 3 × 3 matrix representing the conic): n 2 C11 x2p + C22 yp2 + 2C12 xp yp + 2C13 xp + 2C23 yp + C33 p=1

This is a linear least squares problem, requiring some constraint on the unknowns in order to avoid the trivial solution. For example, Bookstein proposes 2 2 2 + 2C12 + C22 = 1, which allows to make the solution invariant the constraint C11 to Euclidean transformations of the data [3]. Fitzgibbon, Pilu and Fisher impose 2 ) = 1 in order to guarantee that the ﬁtted conic will be an ellipse 4(C11 C22 − C12 [4]. In both cases, the constrained linear least squares problem can be solved by solving a 3 × 3 symmetric generalized eigenvalue problem. The cost function we want to minimize (cf. section 5), is n

dist (qp , C)2

(1)

p=1

where dist(q, C) is the geometric distance between a point q and a conic C, i.e. the distance between q and the point on C, that is closest to q. Determining dist(q, C) requires in general to compute the roots of a 4th order polynomial. 2.2

Transformations and Types of Conics

Let P be a projective transformation acting on 2D points (i.e. P is a 3×3 matrix). A conic C is transformed by P according to (∼ means equality up to scale): C ∼ P−T CP−1

(2)

In this work we are only interested in real conics, i.e. that do not only contain imaginary points. These can be characterized using the eigendecomposition of the conic’s 3 × 3 matrix: imaginary conics are exactly those whose eigenvalues have all the same sign [2]. We are thus only interested in conics with eigenvalues of diﬀerent signs. This constraint will be explicitly imposed, as shown in the following section. In addition, we are only interested in proper conics, i.e. nondegenerate ones, with only non-zero eigenvalues. Concerning diﬀerent types of real conics, we distinguish the projective and aﬃne classes: – all proper real conics are projectively equivalent, i.e. for any two conics, there exists at least one projective transformation relating them according to (2). – aﬃne classes: ellipses, hyperbolae, parabolae.

786

P. Sturm and P. Gargallo

In the following, we formulate the optimization problem for general conics, i.e. the corresponding algorithm may ﬁnd the correct solution even if the initial guess is of the “wrong type”. Specialization of the method to the 3 aﬃne cases of interest, is relatively straightforward; details are given in [15].

3

Minimizing the Geometrical Distance

In this section, we describe our method for minimizing the geometrical distance based cost function. The key of the method is the parameterization of the problem. In the next paragraph, we will ﬁrst describe the parameterization, before showing that it indeed allows to minimize geometrical distance. After this, we explain how to initialize the parameters and describe how to solve the non-linear optimization problem in a computationally eﬃcient way. 3.1

Parameterization

The parameterization explained in the following, is illustrated in ﬁgure 1. ˆ p , such that For each of the n measured points qp , we parameterize a point q ˆ p lie on a conic. The simplest way to do so is to choose the unit circle as all q ˆ p by an angle αp each: support, in which case we may parameterize the q ⎞ ⎛ cos αp ˆ p = ⎝ sin αp ⎠ q 1 Furthermore, we include in our parameterization a 2D projective transformation, or, homography, P. We then want to solve the following optimization problem: min

P,α1 ···αn

n

2

dist (qp , Pˆ qp )

(3)

p=1

In section 3.2, we show that this parameterization indeed allows to minimize the desired cost function based on point-to-conic distances. At ﬁrst sight, this parameterization has the drawback of a much larger number of parameters than necessary: n + 8 (the n angles αp and 8 parameters for P) instead of 5 that would suﬃce to parameterize a conic. We will show however, in section 3.4, that the optimization can nevertheless be carried out in a computationally eﬃcient manner, due to the sparsity of the normal equations associated to our least squares cost function. Up to now, we have considered P as a general 2D homography, which is clearly an overparameterization. We do actually parameterize P minimally: P ∼ R Σ = R diag(a, b, c) where R is an orthonormal matrix and a, b and c are scalars. We show in the following section that this parameterization is suﬃcient, i.e. it allows to express all proper real conics.

Conic Fitting Using the Geometric Distance

787

Fig. 1. Illustration of our parameterization

We may thus parameterize P using 6 parameters (3 for R and the 3 scalars). Since the scalars are only relevant up to a global scale factor, we may ﬁx one and thus reduce the number of parameters for P to the minimum of 5. More details on the parameterization of the orthonormal matrix R are given in section 3.4. 3.2

Completeness of the Parameterization

We ﬁrst show that the above parameterization allows to “reach” all proper real conics and then, that minimizing the associated cost function (3) is equivalent to minimizing the desired cost function (1). For any choice of R, a, b and c (a, b, c = 0), the associated homography will ˆ p to a set of points that lie on a conic C. This is obvious since map the points q ˆp point-conic incidence is invariant to projective transformations and since the q lie on a conic at the outset (the unit circle). The resulting conic C is given by: ⎛ ⎞ ⎛ ⎞ 1 1/a2 ⎠ RT ⎠ P−1 ∼ R ⎝ 1/b2 C ∼ P−T ⎝ 1 2 −1/c −1 unit circle

We now show that any proper real conic C can be “reached” by our parameterization, i.e. that there exist an orthonormal matrix R and scalars a, b and c such that C ∼ C . To do so, we consider the eigendecomposition of C : ⎛ ⎞ a T C = R ⎝ b ⎠ R c where R is an orthonormal matrix, whose rows are eigenvectors of C , and a , b and c are its eigenvalues (any symmetric matrix may be decomposed in this way). The condition for C being a proper conic is that its three eigenvalues

788

P. Sturm and P. Gargallo

are non-zero, and the condition that it is a proper and real conic is that one eigenvalue’s sign is opposed to that of the two others.

If for example, c is this eigenvalue, then with R = R , a = 1/ |a |, b = 1/ |b |

and c = 1/ |c |, we have obviously C ∼ C . If the “individual” eigenvalue is a instead, the following solution holds: ⎛ ⎞ 1 1 1 1 b= c= R = ⎝ −1 ⎠ R a= |c | |b | |a | 1 and similarly for b being the “individual” eigenvalue. Hence, our parameterization of an homography via an orthonormal matrix and three scalars, is complete. We now show that the associated cost function (3), is equivalent to the desired cost function (1), i.e. that the global minima of both cost functions correspond to the same conic (if a unique global minimum exists of course). Let C be the ˆ p be global minimum of the cost function (1). Let, for any measured point qp , v the closest point on C . If more than one point on C are equidistant from qp , pick any one of them. We have shown above that there exist R, a, b and c, such that P maps the unit ˆ p . Since v ˆ p = P−1 v ˆ p lies on C , it follows that w ˆ p lies on the circle to C . Let w ˆ p ∼ (cos αp , sin αp , 1)T . unit circle. Hence, there exists an angle αp such that w Consequently, there exists a set of parameters R, a, b, c, α1 , . . . , αn for which the value of the cost function (3) is the same as that of the global minimum of (1). Hence, our parameterization and cost function are equivalent to minimizing the desired cost function based on geometrical distance between points and conics. 3.3

Initialization

Minimizing the cost function (3) requires a non-linear, thus iterative, optimization method. Initial values for the parameters may be taken from the result of any other (linear) method. Let C be the initial guess for the conic. The initial values for R, a, b and c (thus, for P) are obtained in the way outlined in the previous section, based on the eigendecomposition of C . As for the angles αp , we determine the closest points on C to the measured qp , by solving the 4th order polynomial mentioned in the introduction or an equivalent problem (see [15] for details). We then map these to the unit circle using P and extract the angles αp , as described in the previous section. 3.4

Optimization

We now describe how we optimize the cost function (3). Any non-linear optimization method may be used, but since we deal with a non-linear least squares problem, we use the Levenberg-Marquardt method [7]. In the following, we describe how we deal with the rotational components of our parameterization (the orthonormal matrix R and the angles αp ), we then explicitly give the Jacobian of the cost function, and show how the normal equations’ sparsity may be used to solve them eﬃciently.

Conic Fitting Using the Geometric Distance

789

Update of Rotational Parameters. To avoid singularities in the parameterization of the orthonormal matrix R, we estimate, as is typical practice e.g. in photogrammetry [1], a ﬁrst order approximation of an orthonormal “update” matrix at each iteration, as follows: 1. Let R0 be the estimation of R after the previous iteration. 2. Let R1 = R0 Δ be the estimation to be obtained after the current iteration. Here, we only allow the update matrix Δ to vary, i.e. R0 is kept ﬁxed. Using the Euler angles α, β, γ, we may parameterize Δ as follows: ⎛ ⎞ cos β cos γ sin α sin β cos γ − cos α sin γ cos α sin β cos γ + sin α sin γ Δ = ⎝ cos β sin γ sin α sin β sin γ + cos α cos γ cos α sin β sin γ − sin α cos γ ⎠ − sin β sin α cos β cos α cos β (4) 3. The update angles α, β and γ will usually be very small, i.e. we have cos α ≈ 1 and sin α ≈ α. Instead of optimizing directly over the angles, we thus use the ﬁrst order approximation of Δ: ⎛ ⎞ 1 −γ β Δ = ⎝ γ 1 −α⎠ −β α 1 4. In the cost function (3), we thus replace R by R0 Δ , and estimate α, β and γ. At the end of the iteration, we update the estimation of R. In order to keep R orthonormal, we do of course not update it using the ﬁrst order approximation, i.e. as R1 = R0 Δ . Instead, we compute an exact orthonormal update matrix Δ using equation (4) and the estimated angles, and update the rotation via R1 = R0 Δ. 5. It is important to note that at the next iteration, R1 will be kept ﬁxed on its turn, and new (small) update angles will be estimated. Thus, the initial values of the update angles at each iteration, are always zero, which greatly simpliﬁes the analytical computation of cost function’s Jacobian. ˆ p on the unit circle, are updated in a similar manner, using a 1D The points q rotation matrix each for the update: ⎛ ⎞ cos ρp − sin ρp 0 Ψp = ⎝ sin ρp cos ρp 0⎠ 0 0 1 and its ﬁrst order approximation: ⎛

Ψp

⎞ 1 −ρp 0 = ⎝ρp 1 0⎠ 0 0 1

ˆ p (where, as for R, the update ˆ p → Ψp q Points are thus updated as follows: q angles are estimated using the ﬁrst order approximations Ψp ).

790

P. Sturm and P. Gargallo

Cost Function and Jacobian. Let the measured points be given by qp = T T ˆ p by q ˆ p = (ˆ xp , yˆp , 1) . At each (xp , yp , 1) , and the current estimate of the q iteration, we have to solve the problem: n

min

a,b,c,α,β,γ,ρ1···ρn

ˆp d2 qp , RΔ ΣΨp q

(5)

p=1

This has a least squares form, i.e. we may formulate the cost function using 2n residual functions: 2n

rj2

with

j=1

r2i−1 = xi −

ˆ i )1 (RΔ ΣΨi q ˆ i )3 (RΔ ΣΨi q

and

r2i = yi −

As for the Jacobian of the cost function, it is deﬁned as: ⎛ ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂a ∂b ∂c ∂α ∂β ∂γ ∂ρ1 ∂ρ2 · · · ⎜ .. . . . . . .. .. . . .. .. .. .. .. J=⎝ . . . . ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂a ∂b ∂c ∂α ∂β ∂γ ∂ρ1 ∂ρ2 · · ·

ˆ i )2 (RΔ ΣΨi q ˆ i )3 (RΔ ΣΨi q

∂r1 ∂ρn

⎞

.. ⎟ . ⎠

∂r2n ∂ρn

It can be computed analytically, as follows. Due to the fact that before each iteration, the update angles α, β, γ, ρ1 , . . . , ρn are all zero, the entries of the Jacobian, evaluated at each iteration, have the following very simple form:

x ˆ1 u11 x ˆ1 v11 .. . x ˆn un1 x ˆn vn1

yˆ1 u12 yˆ1 v12 .. . yˆn un2 yˆn vn2

u13 v13 .. . un3 vn3

(bˆ y1 u13 − cu12 ) (bˆ y1 v13 − cv12 ) .. . (bˆ yn un3 − cun2 ) (bˆ yn vn3 − cvn2 )

(cu11 − aˆ x1 u13 ) (cv11 − aˆ x1 v13 ) .. . (cun1 − aˆ xn un3 ) (cvn1 − aˆ xn vn3 )

with ui = s2i

bR23 yˆi − cR22 cR21 − aR23 x ˆi aR22 x ˆi − bR21 yˆi c(bR21 x ˆi + aR22 yˆi ) − abR23

vi = s2i

(aˆ x1 u12 − bˆ y1 u11 ) (aˆ x1 v12 − bˆ y1 v11 ) .. . (aˆ xn un2 − bˆ yn un1 ) (aˆ xn vn2 − bˆ yn vn1 )

u14 v14 .. . 0 0

··· ··· .. . ··· ···

0 0 .. . un4 vn4

cR12 − bR13 yˆi aR13 x ˆi − cR11 bR11 yˆi − aR12 x ˆi abR13 − c(bR11 x ˆi + aR12 yˆi )

−1

and si = (aR31 x ˆi + bR32 yˆi + cR33 ) . As for the residual functions themselves, with α, β, γ, ρ1 , . . . , ρn being zero before each iteration, they evaluate to: ˆi + bR12 yˆi + cR13 ) r2i−1 = xi − si (aR11 x r2i = yi − si (aR21 x ˆi + bR22 yˆi + cR23 ) With the explicit expressions for the Jacobian and the residual functions, we have given all ingredients required to optimize the cost function using e.g. the Levenberg-Marquardt or Gauss-Newton methods. In the following paragraph, we show how to beneﬁt from the sparsity of the Jacobian (nearly all derivatives with respect to the ρp are zero).

Conic Fitting Using the Geometric Distance

791

Hessian. The basic approximation to the Hessian matrix used in least squares optimizers such as Gauss-Newton is H = JT J. Each iteration of such a non-linear method comes down to solving a linear equation system of the following form: A6×6 B6×n x6 a6 = BT Dn×n yn bn where D is, due to sparsity of the Jacobian (see previous paragraph) a diagonal matrix. The right-hand side is usually the negative gradient of the cost function, which for least squares problems can also be computed as −JT r, r being the vector of the 2n residuals deﬁned above. As suggested in [14], we may reduce this (6 + n) × (6 + n) problem to a 6 × 6 problem, as follows: 1. The lower set of equations give: y = D−1 b − BT x 2. Replacing this in the upper set of equations, we get: A − BD−1 BT x = a − BD−1 b

(6)

(7)

3. Since D is diagonal, its inversion is trivial, and thus the coeﬃcients of the equation system (7) may be computed eﬃciently (in time and memory). 4. Once x is computed by solving the 6 × 6 system (7), y is obtained using (6). Hence, the most complex individual operation at each iteration is the same as that in iterative methods minimzing algebraic distance – inverting a 6 × 6 symmetric matrix or, equivalently, solving a linear equation system of the same size. In practice, we reduce the original problem to (5 + n) × (5 + n), respectively 5 × 5, by ﬁxing one of the scalars a, b, c (the one with the largest absolute value). However, most of the computation time is actually spent on computing the partial derivatives required to compute the coeﬃcients of the above equation systems. Overall, the computational complexity is linear in the number of data points. A detailed complexity analysis is given in [15]. With a non-optimized implementation, measured computation times for one iteration were about 10 times those required for the standard linear method (Linear in the next section). This may seem much but note that e.g. with 200 data points, one iteration requires less than 2 milli-secs on a 2.8GHz Pentium 4.

4

Experimental Results

Points were simulated on a unit circle, equally distributed over an arc of varying length, to simulate occluded data. Each point was subjected to Gaussian noise (in x and y). Six methods were used to ﬁt conics: – Linear: least-square solution based on the algebraic distance, using the constraint of unit norm on the conic’s coeﬃcients. – Bookstein: the method of [3].

792

P. Sturm and P. Gargallo 1.5

1.5 Linear Bookstein Fitzgibbon Linear-opt Book-opt Fitz-opt Original

1

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

-1.5 -1.5

-1

-0.5

0

0.5

Linear Bookstein Fitzgibbon Linear-opt Book-opt Fitz-opt Original

1

1.5

-1.5 -1.5

-1

-0.5

0

0.5

1

1.5

Fig. 2. Two examples: 50 points were distributed over an arc of 160◦ , and were subjected to a Gaussian noise of a standard deviation of 5 percent the radius of the circle

1

Linear Bookstein Fitzgibbon Linear-opt Book-opt 0.8 Fitz-opt

0.6

0.4

0.2

0 0

1

2

3

4

5

Fig. 3. Relative error on estimated minor axis length, as a function of noise (the unit of the y-axis is 100%). The graphs for the three non-linear optimization methods are superimposed.

– Fitzgibbon: the method of [4]. – Non-linear optimization using our method, using the results of the above methods as initialization: Linear-opt, Book-opt and Fitz-opt. We performed experiments for many combinations of noise level, amount of occlusion and number of points on the conic, see [15] for a detailled account. Figure 2 shows two typical examples. With all three initializations, the optimization method converged to the same conic in a few iterations each (2 or 3 typically). It is not obvious how to quantitatively compare the methods. Displaying residual geometrical point-to-conic distances for example would be unfair, since our method is designed for minimizing this. Instead, we compute an error measure on the estimated conic. Figure 3 shows the relative error on the length of the estimated conic’s minor axis (one indicator of how well the conic’s shape has

Conic Fitting Using the Geometric Distance

793

Fig. 4. Sample results on real data: ﬁtting conics to catadioptric line images. Colors are as in ﬁgure 2; reference conics are shown in white and data points in black, in the common portion of the estimated conics.

been recovered), relative to the amount of noise. Each point in the graph represents the median value of the results of 50 simulation runs. All methods degrade rather gracefully, the non-linear optimization results being by far the best (the three graphs are superimposed). We also tested our approach with the results of a hyperbola-speciﬁc version of [4] as initialization. In most cases, the optimization method is capable of switching from an hyperbola to an ellipse, and to reach the same solution as when initialized with an ellipse. Figure 4 shows sample results on real data, ﬁtting conics to edge points of catadioptric line images (same color code as in ﬁgure 2). Reference conics are shown in white; they were ﬁtted using calibration information on the catadioptric camera (restricting the problem to 2 dof) and serve here as “ground truth”. The data points are shown by the black portion common to all estimated conics. They cover very small portions of the conics, making the ﬁtting rather ill-posed. The ill-posedness shows e.g. in the fact that in most cases, conics with widely varying shape have similar residuals. Nevertheless, our approach gives results that are clearly more consistent than for any of the other methods; also note that in the shown examples, the three non-linear optimizations converged to the same conic each time. More results are given in [15].

5

Discussion on Choice of Cost Function

Let us brieﬂy discuss the cost function used. A usual choice, and the one we adopted here, is the sum of squared geometrical distances of the points to the conic. Minimizing this cost function gives the optimal conic in the maximum likelihood sense, under the assumption that data points are generated from points on the true conic, by displacing them along the normal direction by a random distance that follows a zero mean Gaussian distribution, the same for all points. Another choice [17] is based on the assumption that a data point could be generated from any point on the true conic, by displacing it possibly in other directions than the normal to the conic. There may be other possibilities, taking into account the diﬀerent densities of data points along the conic in areas with diﬀerent

794

P. Sturm and P. Gargallo

curvatures. Which cost function to choose depends on the underlying application but of course also on the complexity of implementation and computation. In this work we use the cost function based on the geometrical distance between data points and the conic; it is analytically and computationally more tractable than e.g. [17]. Further, if data points are obtained by edge detection, i.e. if they form a contour, then it is reasonable to assume that the order of the data points along the contour is the same as that of the points on the true conic that were generating them. Hence, it may not be necessary here to evaluate the probability of all points on the conic generating all data points and it seems reasonable to stick with the geometric distance between data points and the conic, i.e. the distance between data points and the closest points on the conic. A more detailed discussion is beyond the scope of this paper though. A ﬁnal comment is that it is straightforward to embed our approach in any M-estimator, in order to make it robust to outliers.

6

Conclusions and Perspectives

We have proposed a method for ﬁtting conics to points, minimizing geometrical distance. The method avoids the solution of 4th order polynomials, often considered to be one of the main reasons for using algebraic distances. We have described in as much detail as possible how to perform the non-linear optimization computationally eﬃciently. A few simulation results are presented that suggest that the optimization of geometrical distance may correct bias present in results of linear methods, as expected. However, the main motivation for this paper was not to measure absolute performance, but to show that conic ﬁtting by minimization of geometrical distance, is feasible. Recently, we became aware of the work [5], that describes an ellipse-speciﬁc method very similar in spirit and formulation to ours. Our method, as presented, is not speciﬁc to any aﬃne conic type. This is an advantage if the type of conic is not known beforehand (e.g. line-based camera calibration of omnidirectional cameras is based on ﬁtting conics of possibly diﬀerent types [6]), and switching between diﬀerent types is indeed completely natural for the method. However, we have also implemented ellipse-, hyperbola- and parabola-speciﬁc versions of the method [15]. The proposed approach for conic ﬁtting can be adapted to other problems. This is rather direct for e.g. the reconstruction of a conic’s shape from multiple calibrated images or the optimization of the pose of a conic with known shape, from a single or multiple images. Equally straightforward is the extension to the ﬁtting of quadrics to 3D point sets. Generally, the approach may be used for ﬁtting various types of surfaces or curves to sets of points or other primitives. Another application is plumb-line calibration, where points would have to be parameterized on lines instead of the unit circle. Besides this, we are currently investigating an extension of our approach to the estimation of the shape and/or pose of quadrics, from silhouettes in multiple images. The added diﬃculty is that points on quadrics have to be parameterized such as to lie on occluding contours.

Conic Fitting Using the Geometric Distance

795

This may be useful for estimating articulated motions of objects modelled by quadric-shaped parts, similar to [11] which considered cone-shaped parts. Other current work is to make a Matlab implementation of the proposed approach publicly available, on the ﬁrst author’s website and to study cases when the Gauss-Newton approximation of the Hessian may become singular. Acknowledgements. We thank Pascal Vasseur for the catadioptric image and the associated calibration data and the reviewers for very useful comments.

References 1. Atkinson, K.B. (ed.): Close Range Photogrammetry and Machine Vision. Whittles Publishing (1996) 2. Boehm, W., Prautzsch, H.: Geometric Concepts for Geometric Design. A.K. Peters (1994) 3. Bookstein, F.L.: Fitting Conic Sections to Scattered Data. Computer Graphics and Image Processing 9, 56–71 (1979) 4. Fitzgibbon, A., Pilu, M., Fisher, R.B.: Direct Least Square Fitting of Ellipses. IEEE–PAMI 21(5), 476–480 (1999) 5. Gander, W., Golub, G.H., Strebel, R.: Fitting of Circles and Ellipses. BIT 34, 556–577 (1994) 6. Geyer, C., Daniilidis, K.: Catadioptric Camera Calibration. In: ICCV, pp. 398–404 (1999) 7. Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, San Diego (1981) 8. Hal´ır, R.: Robust Bias-Corrected Least Squares Fitting of Ellipses. In: Conf. in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (2000) 9. Kanatani, K.: Statistical Bias of Conic Fitting and Renormalization. IEEE– PAMI 16 (3), 320–326 (1994) 10. Kanazawa, Y., Kanatani, K.: Optimal Conic Fitting and Reliability Evaluation. IEICE Transactions on Information and Systems E79-D (9), 1323–1328 (1996) 11. Knossow, D., Ronfard, R., Horaud, R., Devernay, F.: Tracking with the Kinematics of Extremal Contours. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 664–673. Springer, Heidelberg (2006) 12. Rosin, P.L.: Analysing Error of Fit Functions for Ellipses. Pattern Recognition Letters 17, 1461–1470 (1996) 13. Rosin, P.L.: Ellipse Fitting Using Orthogonal Hyperbolae and Stirling’s Oval. Graphical Models and Image Processing 60(3), 209–213 (1998) 14. Slama, C.C. (ed.): Manual of Photogrammetry, 4th edn. American Society of Photogrammetry and Remote Sensing (1980) 15. Sturm, P.: Conic Fitting Using the Geometric Distance. Rapport de Recherche, INRIA (2007) 16. Taubin, G.: Estimation of Planar Curves, Surfaces, and Nonplanar Space Curves Deﬁned by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE–PAMI 13(11), 1115–1138 (1991) 17. Werman, M., Keren, D.: A Bayesian Method for Fitting Parametric and Nonparametric Models to Noisy Data. IEEE–PAMI 23(5), 528–534 (2001) 18. Zhang, Z.: Parameter Estimation Techniques: A Tutorial with Application to Conic Fitting. Rapport de Recherche No. 2676, INRIA (1995)

Efficiently Solving the Fractional Trust Region Problem Anders P. Eriksson, Carl Olsson, and Fredrik Kahl Centre for Mathematical Sciences Lund University, Sweden

Abstract. Normalized Cuts has successfully been applied to a wide range of tasks in computer vision, it is indisputably one of the most popular segmentation algorithms in use today. A number of extensions to this approach have also been proposed, ones that can deal with multiple classes or that can incorporate a priori information in the form of grouping constraints. It was recently shown how a general linearly constrained Normalized Cut problem can be solved. This was done by proving that strong duality holds for the Lagrangian relaxation of such problems. This provides a principled way to perform multi-class partitioning while enforcing any linear constraints exactly. The Lagrangian relaxation requires the maximization of the algebraically smallest eigenvalue over a one-dimensional matrix sub-space. This is an unconstrained, piece-wise differentiable and concave problem. In this paper we show how to solve this optimization efficiently even for very large-scale problems. The method has been tested on real data with convincing results.1

1 Introduction Image segmentation can be defined as the task of partitioning an image into disjoint sets. This visual grouping process is typically based on low-level cues such as intensity, homogeneity or image contours. Existing approaches include thresholding techniques, edge based methods and region-based methods. Extensions to this process includes the incorporation of grouping constraints into the segmentation process. For instance the class labels for certain pixels might be supplied beforehand, through user interaction or some completely automated process, [1,2]. Perhaps the most successful and popular approaches for segmenting images are based on graph cuts. Here the images are converted into undirected graphs with edge weights between the pixels corresponding to some measure of similarity. The ambition is that partitioning such a graph will preserve some of the spatial structure of the image itself. These graph based methods were made popular first through the Normalized Cut formulation of [3] and more recently by the energy minimization method of [4]. This algorithm for optimizing objective functions that are submodular has the property of solving many discrete problems exactly. However, not all segmentation problems can 1

This work has been supported by the European Commission’s Sixth Framework Programme under grant no. 011838 as part of the Integrated Project SMErobotT M , Swedish Foundation for Strategic Research (SSF) through the programmes Vision in Cognitive Systems II (VISCOS II) and Spatial Statistics and Image Analysis for Environment and Medicine.

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 796–805, 2007. c Springer-Verlag Berlin Heidelberg 2007

Efficiently Solving the Fractional Trust Region Problem

797

be formulated with submodular objective functions, nor is it possible to incorporate all types of linear constraints. In [5] it was shown how linear grouping constraints can be included in the former approach, Normalized Cuts. It was demonstrated how Lagrangian relaxation can in a unified can handle such linear constrains and also in what way they influence the resulting segmentation. It did not however address the practical issues of finding such solutions. In this paper we develop efficient algorithms for solving the Lagrangian relaxation.

2 Background 2.1 Normalized Cuts Consider an undirected graph G, with nodes V and edges E and where the non-negative weights of each such edge is represented by an affinity matrix W , with only nonnegative entries and of full rank. A min-cut is the non-trivial subset A of V such that the sum of edges between nodes in A and V is minimized, that is the minimizer of wij (1) cut(A, V ) = i∈A, j∈V \A

This is perhaps the most commonly used method for splitting graphs and is a well known problem for which very efficient solvers exist. It has however been observed that this criterion has a tendency to produced unbalanced cuts, smaller partitions are preferred to larger ones. In an attempt to remedy this shortcoming, Normalized Cuts was introduced by [3]. It is basically an altered criterion for partitioning graphs, applied to the problem of perceptual grouping in computer vision. By introducing a normalizing term into the cut metric the bias towards undersized cuts is avoided. The Normalized Cut of a graph is defined as: Ncut =

cut(B, V ) cut(A, V ) + assoc(A, V ) assoc(B, V )

(2)

where A ∪ B = V , A ∩ B = ∅ and the normalizing term defined as assoc(A, V ) = i∈Aj∈V wij It is then shown in [3] that by relaxing (2) a continuous underestimator of the Normalized Cut can be efficiently computed. To be able to include general linear constraints we reformulated the problem in the following way, (see [5] for details). With d = W 1 and D = diag(d) Normalized Cut cost can be written as inf z

z T (D − W )z , s.t. z ∈ {−1, 1}n, Cz = b. −z T ddT z + (1T d)2

(3)

The above problem is a non-convex, NP-hard optimization problem. In [5] z ∈ {−1, 1}n constraint was replaced with the norm constraint z T z = n. This gives us the relaxed problem inf z

z T (D − W )z , s.t. z T z = n, Cz = b. −z T ddT z + (1T d)2

(4)

798

A.P. Eriksson, C. Olsson, and F. Kahl

Even though this is a non-convex problem it was shown in [5] that it is possible to solve this problem exactly. 2.2 The Fractional Trust Region Subproblem Next we briefly review the theory for solving (4). If we let zˆ be the extended vector T T z zn+1 . Throughout the paper we will write zˆ when we consider the extended variables and just z when we consider the original ones. With Cˆ = [C − b] the linear constraints becomes Cz = b, and now form a linear subspace and can be eliminated in the following way. Let NCˆ be a matrix where its columns form a base of the nullspace ˆ Any zˆ fulfilling Cˆ zˆ = 0 can be written zˆ = N ˆ yˆ, where yˆ ∈ Rk+1 . Assuming of C. C that the linear constraints are feasible we may always choose that basis so that yˆk+1 = T ((1T d)D−ddT ) 0 )0 NCˆ , both zˆn+1 . Let LCˆ = NCˆ T (D−W N and M = N ˆ ˆ ˆ C C C 0 0 0 0 positive semidefinite, (see [5]). In the new space we get the following formulation inf yˆ

ˆ yˆT LC ˆy yˆT MCˆ yˆ ,

s.t. yˆk+1 = 1, ||ˆ y ||2NCˆ = n + 1,

(5)

where ||ˆ y ||2N ˆ = yˆT NCˆ T NCˆ yˆ. We call this problem the fractional trust region subC problem since if the denominator is removed it is similar to the standard trust region problem [6]. A common approach to solving problems of this type is to simply drop one of the two constraints. This may however result in very poor solutions. For example, in [7] segmentation with prior data was studied. The objective function considered there contained a linear part (the data part) and a quadratic smoothing term. It was observed that when yk+1 = ±1 the balance between that smoothing term and the data term was disrupted, resulting in very poor segmentations. In [5] it was show that in fact this problem can be solved exactly, without excluding any constraints, by considering the dual problem. Theorem 1. If a minima of (5) exists its dual problem supt inf ||ˆy||2N where ECˆ = [ 00 01 ] −

ˆ C

=n+1

T NC ˆ ˆ NC n+1

y yˆT (LC ˆ +tEC ˆ )ˆ yˆT MCˆ yˆ

= NCTˆ

1 − n+1 I 0 0 1

(6) NCˆ ,

has no duality gap. Since we assume that the problem is feasible and as the objective function of the primal problem is the quotient of two positive semidefinite quadratic forms a minima obviously exists. Thus we can apply this theorem directly and solve (5) through its dual formulation. We will use F (t, yˆ) to denote the objective function of (6), the Lagrangian of problem (5). By the dual function θ(t) we mean the solution of θ(t) = inf ||ˆy||2N =n+1 F (t, yˆ) ˆ C

The inner minimization of (6) is the well known generalized Rayleigh quotient, for which the minima is given by the algebraically smallest generalized eigenvalue2 of 2

By generalized eigenvalue of two matrices A and B we mean finding a λ = λG (A, B) and v, ||v|| = 1 such that Av = λBv has a solution.

Efficiently Solving the Fractional Trust Region Problem

799

(LCˆ + tECˆ ) and MCˆ . Letting λmin (t)(·, ·) denote the smallest generalized eigenvalue of two entering matrices, we can also write problem (6) as sup λmin (LCˆ + tECˆ , MCˆ ).

(7)

t

These two dual formulations will from here on be used interchangeably, it should be clear from the context which one is being referred to. In this paper we will develop methods for solving the outer maximization efficiently.

3 Efficient Optimization 3.1 Subgradient Optimization First we present a method, similar to that used in [8] for minimizing binary problems with quadratic objective functions, based on subgradients for solving the dual formulation of our relaxed problem. We start off by noting that as θ(t) is a pointwise infimum of functions linear in t it is easy to see that this is a concave function. Hence the outer optimization of (6) is a concave maximization problem, as is expected from dual problems. Thus a solution to the dual problem can be found by maximizing a concave function in one variable t. Note that the choice of norm does not affect the value of θ it only affects the minimizer yˆ∗ . It is widely known that the eigenvalues are analytic (and thereby differentiable) functions as long as they are distinct. Thus, to be able to use a steepest ascent method we need to consider subgradients. Recall the definition of a subgradient [9,8]. Definition 1. If a function g : Rk+1 → R is concave, then v ∈ Rk+1 is a subgradient to g at σ0 if (8) g(σ) ≤ g(σ0 ) + v T (σ − σ0 ), ∀σ ∈ Rk+1 . One can show that if a function is differentiable then the derivative is the only vector satisfying (8). We will denote the set of all subgradients of g at a point t0 by ∂g(t0 ). It is easy to see that this set is convex and if 0 ∈ ∂g(t0 ) then t0 is a global maximum. Next we show how to calculate the subgradients of our problem. y0 , t0 ) = θ(t0 ) and ||ˆ y0 ||2N ˆ = n + 1. Then Lemma 1. If yˆ0 fulfills F (ˆ C

v=

yˆ0T ECˆ yˆ0 yˆ0T MCˆ yˆ0

(9)

is a subgradient of θ at t0 . If θ is differentiable at t0 , then v is the derivative of θ at t0 . Proof.

θ(t) =

yˆ0T (LCˆ + tECˆ )ˆ yˆT (LCˆ + tECˆ )ˆ y y0 ≤ = T T yˆ MCˆ yˆ yˆ0 MCˆ yˆ0 ||ˆ y||N =1 ˆ min 2 C

=

yˆ0T (LCˆ + t0 ECˆ )ˆ y0 yˆ0T MCˆ yˆ0

+

yˆ0T ECˆ yˆ0 (t − t0 ) = θ(t0 ) + v T (t − t0 ) yˆ0T MCˆ yˆ0

(10)

800

A.P. Eriksson, C. Olsson, and F. Kahl

A Subgradient Algorithm. Next we present an algorithm based on the theory of subgradients. The idea is to find a simple approximation of the objective function. Since the function θ is concave, the first order Taylor expansion θi (t), around a point ti , always y , ti ) and this solution is unique then fulfills fi (t) ≤ f (t). If yˆi solves inf ||ˆy||2N =n+1 F (ˆ ˆ C

the Taylor expansion of θ at ti is

θi (t) = F (ˆ yi , ti ) + v T (t − ti ).

(11)

Note that if yˆi is not unique fi is still an overestimating function since v is a subgradient. One can assume that the function θi approximates θ well in a neighborhood around t = ti if the smallest eigenvalue is distinct. If it is not we can expect that there is some tj such that min(θi (t), θj (t)) is a good approximation. Thus we will construct a function θ¯ of the type ¯ = inf F (ˆ θ(t) yi , ti ) + v T (t − ti ) i∈I

(12)

that approximates θ well. That is, we approximate θ with the point-wise infimum of several first-order Taylor expansions, computed at a number of different values of t, an ¯ illustration can be seen in fig. 1. We then take the solution to the problem supt θ(t), given by supt,α α α ≤ F (ˆ yi , ti ) + v T (t − ti ), ∀i ∈ I, tmin ≤ t ≤ tmax .

(13)

as an approximate solution to the original dual problem. Here, the fixed parameters tmin , tmax are used to express the interval for which the approximation is believed to be valid. Let ti+1 denote the optimizer of (13). It is reasonable to assume that θ¯ approximates θ better the more Taylor approximations we use in the linear program. Thus, we can improve θ¯ by computing the first-order Taylor expansion around ti+1 , add it to (13) and solve the linear program again. This is repeated until |tN +1 − tN | < for some predefined > 0, and tN +1 will be a solution to supt θ(t). 3.2 A Second Order Method The algorithm presented in the previous section uses first order derivatives only. We would however like to employ higher order methods to increase efficiency. This requires calculating second order derivatives of (6). Most formulas for calculating the second derivatives of eigenvalues involves all of the eigenvectors and eigenvalues. However, determining the entire eigensystem is not feasible for large scale systems. We will show that it is possible to determine the second derivative of an eigenvalue function by solving a certain linear system only involving the corresponding eigenvalue and eigenvector. The generalized eigenvalues and eigenvectors fulfills the following equations y(t) = 0 ((LCˆ + tECˆ ) − λ(t)MCˆ )ˆ ||ˆ y (t)||2NCˆ

= n + 1.

(14) (15)

Efficiently Solving the Fractional Trust Region Problem −1.8

801

−2

−2

−2.5

−2.2 −3

−2.4

−3.5

−2.6 −2.8

−4

−3 −3.2 −3.4

objective function approximation −0.2

−0.1

0

0.1

−4.5 −5

0.2

−1.8

objective function approximation −0.2

−0.1

0

0.1

−0.2

−0.1

0

0.1

0.2

−2

−2

−2.5

−2.2 −3

−2.4

−3.5

−2.6 −2.8

−4

−3 −3.2 −3.4

objective function approximation −0.2

−0.1

0

0.1

0.2

−4.5 −5

objective function approximation 0.2

Fig. 1. Approximations of two randomly generated objective functions. Top: Approximation after 1 step of the algorithm. Bottom: Approximation after 2 steps of the algorithm.

To emphasize the dependence on t we write λ(t) for the eigenvalue and yˆ(t) for the eigenvector. By differentiating (14) we obtain y (t) + ((LCˆ + tECˆ ) − λ(t)M )ˆ y (t) = 0. (ECˆ − λ (t)MCˆ )ˆ

(16)

This (k + 1) × (k + 1) linear system in yˆ (t) will have a rank of k, assuming λ(k) is a distinct eigenvalue. To determine yˆ (t) uniquely we differentiate (15), obtaining yˆT (t)NCˆ T NCˆ yˆ (t) = 0.

(17)

Thus, the derivative of the eigenvector yˆ (t) is determined by the solution to the linear system (L +tE )−λ(t)M ˆ ˆ ˆ C C C (−EC y(t) ˆ +λ (t)MC ˆ )ˆ y ˆ (18) (t) = T T yˆ (t)N N 0 ˆ C

ˆ C

If we assume differentiability at t, the second derivative of θ(t) can now be found by d computing dt θ (t), where θ (t) is equal to the subgradient v given by (9). θ (t) =

d dt θ (t)

=

T ˆ(t) ˆy d yˆ(t) EC dt yˆ(t)T MCˆ yˆ(t)

=

2 ˆT (t) yˆ(t)T MC ˆ(t) y ˆy

ECˆ − θ (t)MCˆ yˆ (t)(19)

A Modified Newton Algorithm. Next we modify the algorithm presented in the previous section to incorporate the second derivatives. Note that the second order Taylor expansion is not necessarily an over-estimator of θ. Therefore we can not use the the second derivatives as we did in the previous section. Instead, as we know θ to be infinitely differentiable when the smallest eigenvalue λ(t) is distinct, strictly convex around its optima t∗ , Newton’s method for unconstrained optimization can be applied. It follows from these properties of θ(t) that Newton’s

802

A.P. Eriksson, C. Olsson, and F. Kahl

method [9] should be well behaved on this function and that we could expect quadratic convergence in a neighborhood of t∗ . All of this, under the assumption that θ is differentiable in this neighborhood. Since Newton’s method does not guarantee convergence we have modified the method slightly, adding some safeguarding measures. At a given iteration of the Newton method we have evaluated θ(t) at a number of points ti . As θ is concave we can easily find upper and lower bounds on t∗ (tmin, tmax ) by looking at the derivative of the objective function for these values of t = ti . tmax =

min

i;θ (ti )≤0

ti , and tmin =

max ti

i;θ (ti )≥0

(20)

At each step in the Newton method a new iterate is found by approximating the objective function is by its second-order Taylor approximation θ(t) ≈ θ(ti ) + θ (ti )(t − ti ) +

θ (ti ) (t − ti )2 . 2

(21)

and finding its maxima. By differentiating (21) it is easily shown that its optima, as well as the next point in the Newton sequence, is given by ti+1 = −

θ (ti ) + ti θ (ti )

(22)

If ti+1 is not in the interval [tmin , tmax ] then the second order expansion can not be a good approximation of θ, here the safeguarding comes in. In these cases we simply fall back to the first-order method of the previous section. If we successively store the values of θ(ti ), as well as the computed subgradients at these points, this can be carried out with little extra computational effort. Then, the upper and lower bounds tmin and tmax are updated, i is incremented by 1 and the whole procedure is repeated, until convergence. If the smallest eigenvalue λ(ti ) at an iteration is not distinct, then θ (t) is not defined and a new Newton step can not be computed. In these cases we also use the subgradient gradient method to determine the subsequent iterate. However, empirical studies indicate that non-distinct smallest eigenvalues are extremely unlikely to occur.

4 Experiments A number of experiments were conducted in an attempt to evaluate the suggested approaches. As we are mainly interested in maximizing a concave, piece-wise differentiable function, the underlying problem is actually somewhat irrelevant. However, in order to emphasize the intended practical application of the proposed methods, we ran the subgradient- and modified Newton algorithms on both smaller, synthetic problems as well as on larger, real-world data. For comparison purposes we also include the results of a golden section method [9], used in [5], as a baseline algorithm. First, we evaluated the performance of the proposed methods on a large number of synthetic problems. These were created by randomly choosing symmetric, positive definite, 100×100 matrices. As the computational burden lies in determining the generalized

Efficiently Solving the Fractional Trust Region Problem 1200

803

1200 Subgradient alg. Golden section

1000 800

800

600

600

400

400

200

200

0 0

10

Mod−Newton alg. Golden section

1000

20

30

40

0 0

50

10

20

30

40

50

Fig. 2. Histogram of the number of function evaluations required for 1000-synthetically generated experiments using a golden section method (blue) and the subgradient algorithm (red)

eigenvalue of the matrices LCˆ + tECˆ and MCˆ we wish to reduce the number of such calculations. Figure 2 shows a histogram of the number of eigenvalue evaluations for the subgradient-, modified Newton method as well as the baseline golden section search. The two gradient methods clearly outperforms the golden section search. The difference between the subgradient- and modified Newton is not as discernible. The somewhat surprisingly good performance of the subgradient method can be explained by the fact that far away from t∗ the function θ(t) is practically linear and an optimization method using second derivatives would not have much advantage over one that uses only first order information.

7

0.1 6

0.08 5

0.06 4

0.04 3

0.02 2

0 1

−0.02 0

0

5

10

15

20

25

30

35

40

45

50

8

10

12

14

16

18

20

22

24

26

Fig. 3. Top: Resulting segmentation (left) and constraints applied (right). Here an X means that this pixel belongs to the foreground and an O to the background. Bottom: Convergence of the modified Newton (solid), subgradient (dashed) and the golden section (dash-dotted) algorithms. The algorithms converged after 9, 14 and 23 iteration respectively.

804

A.P. Eriksson, C. Olsson, and F. Kahl

Finally, we applied our methods to two real world examples. The underlying motivation for investigating an optimization problem of this form was to segment images with linear constraints using Normalized Cuts. The first image can be seen in fig. 3, the linear constraints included were hard constraints, that is the requirement that that certain pixels should belong to the foreground or background. One can imagine that such constraints are supplied either by user interaction in a semi-supervised fashion or by some automatic preprocessing of the image. The image was gray-scale, approximately 100 × 100 pixels in size, the associated graph was constructed based on edge information as described in [10]. The second image was of traffic intersection where one wishes to segment out the small car in the top corner. We have a probability map of the image, giving the likelihood of a certain pixel belonging to the foreground. Here the graph representation is based on this map instead of the gray-level values in the image. The approximate size and location of the vehicle is know and included as linear constraint into the segmentation process. The resulting partition can be seen in fig. 4. In both these real world cases, the resulting segmentation will always be the same, regardless of approach. What is different is the computational complexity of the different methods. Once again, the two gradient based approaches are much more efficient than a golden section search, and their respective performance comparable. As the methods differ in what is required to compute, a direct comparison of them is not a straight forward procedure. Comparing the run time would be pointless as the degree to which the

−3

1

x 10

0.9 8 0.8

6

0.7

0.6 4 0.5 2

0.4

0.3

0

0.2 −2 0.1

−4

0 0

5

10

15

20

25

30

5

10

15

20

25

Fig. 4. Top: Resulting segmentation (left) and constraints applied, in addition to the area requirement used (area = 50 pixels) (right). Here the X in the top right part of the corner means that this pixel belongs to the foreground. Bottom: Convergence of the modified Newton (solid), subgradient (dashed) and the golden section (dash-dotted) algorithms. The algorithms converged after 9, 15 and 23 iteration respectively.

Efficiently Solving the Fractional Trust Region Problem

805

implementations of the individual methods have been optimized for speed differ greatly. However, as it is the eigenvalue computations that are the most demanding we believe that comparing the number of such eigenvalue calculations will be a good indicator of the computational requirements for the different approaches. It can be seen in fig. 3 and 4 how the subgradient methods converges quickly in the initial iterations only to slow down as it approaches the optima. This is in support of the above discussion regarding the linear appearance of the function θ(t) far away from the optima. We therfore expect the modified Newton method to be superior when higher accuracy is required. In conclusion we have proposed two methods for efficiently optimizing a piece-wise differentiable function using both first- and second order information applied to the task of partitioning images. Even though it is difficult to provide a completely accurate comparison between the suggested approaches it is obvious that the Newton based method is superior.

References 1. Rother, C., Kolmogorov, V., Blake, A.: ”GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 309–314 (2004) 2. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: International Conference on Computer Vision, Vancouver, Canada, pp. 05–112 (2001) 3. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 4. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 5. Eriksson, A., Olsson, C., Kahl, F.: Normalized cuts revisited: A reformulation for segmentation with linear grouping constraints. In: International Conference on Computer Vision, Rio de Janeiro, Brazil (2007) 6. Sorensen, D.: Newton’s method with a model trust region modification. SIAM Journal on Nummerical Analysis 19(2), 409–426 (1982) 7. Eriksson, A., Olsson, C., Kahl, F.: Image segmentation with context. In: Proc. Conf. Scandinavian Conference on Image Analysis, Ahlborg, Denmark (2007) 8. Olsson, C., Eriksson, A., Kahl, F.: Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In: Proc. Conf. Computer Vision and Pattern Recognition, Mineapolis, USA (2007) 9. Bazaraa, Sherali, Shetty: Nonlinear Programming, Theory and Algorithms. Wiley, Chichester (2006) 10. Malik, J., Belongie, S., Leung, T.K., Shi, J.: Contour and texture analysis for image segmentation. International Journal of Computer Vision 43(1), 7–27 (2001)

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing Tomoyuki Nagahashi1, Hironobu Fujiyoshi1 , and Takeo Kanade2 Dept. of Computer Science, Chubu University. Matsumoto 1200, Kasugai, Aichi, 487-8501 Japan [email protected], [email protected] http://www.vision.cs.chubu.ac.jp 2 The Robotics Institute, Carnegie Mellon University. Pittsburgh, Pennsylvania, 15213-3890 USA [email protected] 1

Abstract. We present a novel approach to image segmentation using iterated Graph Cuts based on multi-scale smoothing. We compute the prior probability obtained by the likelihood from a color histogram and a distance transform using the segmentation results from graph cuts in the previous process, and set the probability as the t-link of the graph for the next process. The proposed method can segment the regions of an object with a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using diﬀerent values for the standard deviation. We demonstrate that we can obtain 4.7% better segmentation than that with the conventional approach.

1

Introduction

Image segmentation is a technique of removing objects in an image from their background. The segmentation result is typically composed on a diﬀerent background to create a new scene. Since the breakthrough of Geman and Geman [1], probabilistic inference has been a powerful tool for image processing. The graph-cuts technique proposed by Boykov [2][3] has been used in recent years for interactive segmentation in 2D and 3D. Rother et al. proposed GrabCut[4], which is an iterative approach to image segmentation based on graph cuts. The inclusion of color information in the graph-cut algorithm and an iterative-learning approach increases its robustness. However, it is diﬃcult to segment images that have a complex edge. This is because it is diﬃcult to achieve segmentation by overlapping local edges that inﬂuence the cost of the n-link, which is calculated from neighboring pixels. Therefore, we introduced a coarse-to-ﬁne approach to detecting boundaries using graph cuts. We present a novel method of image segmentation using iterated Graph Cuts based on multi-scale smoothing in this paper. We computed the prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from the graph cuts in the previous process. The proposed Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 806–816, 2007. c Springer-Verlag Berlin Heidelberg 2007

Image Segmentation Using Iterated Graph Cuts

807

method could segment regions of an object with a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using diﬀerent values for the standard deviation.

2

Graph Cuts

This section describes the graph-cuts-based segmentation proposed by Boykov and Jolly[2]. 2.1

Graph Cuts for Image Segmentation

An image segmentation problem can be posed as a binary labeling problem. Assume that the image is a graph, G = (V, E), where V is the set of all nodes and E is the set of all arcs connecting adjacent nodes. The nodes are usually pixels, p, on the image, P , and the arcs have adjacency relationships with four or eight connections between neighboring pixels, q ∈ N . The labeling problem is to assign a unique label, Li , to each node, i ∈ V , i.e. Li ∈ {“obj”, “bkg”}. The solution, L = {L1 , L2 , . . . , Lp , . . . , L|P | }, can be obtained by minimizing the Gibbs energy, E(L): E(L) = λ · R(L) + B(L)

(1)

where R(L) =

Rp (Lp )

(2)

p∈P

B(L) =

B{p,q} · δ(Lp , Lq )

(3)

{p,q}∈N

and δ(Lp , Lq ) =

1 if Lp = Lq 0 otherwise.

(4)

The coeﬃcient, λ ≥ 0, in (1) speciﬁes the relative importance of the regionproperties term, R(L), versus the boundary-properties term, B(L). The regional term, R(L), assumes that the individual penalties for assigning pixel p to “obj” and “bkg”, corresponding to Rp (“obj”) and Rp (“bkg”), are given. For example, Rp (·) may reﬂect on how the intensity of pixel p ﬁts into a known intensity model (e.g., histogram) of the object and background. The term, B(L), comprises the “boundary” properties of segmentation L. Coeﬃcient B{p,q} ≥ 0 should be interpreted as a penalty for discontinuity between p and q. B{p,q} is normally large when pixels p and q are similar (e.g., in intensity) and B{p,q} is close to zero when these two diﬀer greatly. The penalty, B{p,q} , can also decrease as a function of distance between p and q. Costs B{p,q} may be based on the local intensity gradient, Laplacian zero-crossing, gradient direction, and other criteria.

808

T. Nagahashi, H. Fujiyoshi, and T. Kanade

Fig. 1. Example of graph from image Table 1. Edge cost Edge n-link {p, q}

Cost For B{p,q} {p, q} ∈ N λ · Rp (”bkg”) p ∈ P, p ∈ /O ∪ B {p, S} K p∈O 0 p∈B t-link λ · Rp (”obj”) p ∈ P, p ∈ /O ∪ B {p, T } 0 p∈O K p∈B

Figure 1 shows an example of a graph from an input image. Table 1 lists the weights of edges of the graph. The region term and boundary term in Table 1 are calculated by Rp (“obj”) = − ln Pr(Ip |O) Rp (“bkg”) = − ln Pr(Ip |B) (Ip − Iq )2 1 B{p,q} ∝ exp − · 2 2σ dist(p, q) B{p,q} . K = 1 + max p∈P

(5) (6) (7) (8)

q:{p,q}∈N

Let O and B deﬁne the “object” and “background” seeds. Seeds are given by the user. The boundary between the object and the background is segmented by ﬁnding the minimum cost cut [5] on the graph, G. 2.2

Problems with Graph Cuts

It is diﬃcult to segment images including complex edges in interactive graph cuts [2], [3], as shown in Fig. 2. This is because the cost of the n-link is larger than that of the t-link. If a t-link value is larger than that of an n-link, the number of error

Image Segmentation Using Iterated Graph Cuts

809

Fig. 2. Example of poor results

pixels will be increased due to the inﬂuence of the color. The edge has a strong inﬂuence when there is a large n-link. The cost of the n-link between the ﬂower and the leaf is larger than that between the leaf and the shadow as seen in Fig. 2. Therefore, it is diﬃcult to segment an image that has a complex edge. We therefore introduced a coarse-to-ﬁne approach to detect boundaries using graph cuts.

3

Iterated Graph Cuts by Multi-scale Smoothing

We present a novel approach to segmenting images using iterated Graph Cuts based on multi-scale smoothing. We computed the prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from graph cuts in the previous process. The proposed method could segment regions of an object using a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using diﬀerent values the for standard deviation. 3.1

Overview of Proposed Method

Our approach is outlined in Fig. 3. First, the seeds the “foreground” and “background” are given by the user. The ﬁrst smoothing parameter, σ, is then

Fig. 3. Overview of proposed method

810

T. Nagahashi, H. Fujiyoshi, and T. Kanade

determined. Graph cuts are done to segment the image into an object or a background. The Gaussian Mixture Model (GMM) is then used to make a color distribution model for object and background classes from the segmentation results obtained by graph cuts. The prior probability is updated from the distance transform by the object and background classes of GMM. The t-links for the next graph-cuts process are calculated as a posterior probability which is computed a prior probability and GMMs, and σ is updated as, σ = α · σ(0 < α < 1). These processes are repeated until σ = 1 or classiﬁcation converges if σ < 1. The processes are as follows. Step 1. Input seeds Step 2. Initialize σ Step 3. Smooth the input image by Gaussian ﬁltering Step 4. Do graph cuts Step 5. Calculate the posterior probability from the segmentation results and set as the t-link Step 6. Steps 1-5 are repeated until σ = 1 or classiﬁcation converges if σ < 1. The proposed method can be used to segment regions of the object with a stepwise process from global to local segmentation by iterating the graphcuts process with Gaussian smoothing using diﬀerent values for the standard deviation, as shown in Fig. 4.

Fig. 4. Example of iterating the graph-cuts process

3.2

Smoothing Image Using Down Sampling

The smoothing image is created with a Gaussian ﬁlter. Let I denote an image and G(σ) denote a Gaussian kernel. Smoothing image L(σ) is given by L(σ) = G(σ) ∗ I.

(9)

If Gaussian parameter σ is large, it is necessary to enlarge the window size for the ﬁlter. As it is very diﬃcult to design such a large window for Gaussian ﬁltering. We used down-sampling to obtain a smoothing image that maintained the continuity of σ. Smoothing image L1 (σ) is ﬁrst computed using the input image I1 increasing σ. Image I2 is then down-sampled into half the size of input image I. Smoothing

Image Segmentation Using Iterated Graph Cuts

811

Fig. 5. Smoothing Image using down-sampling

image L2 (σ) is computed using the I2 . Here, the relationship between L1 (σ) and L2 (σ) . L (σ) (10) L1 (2σ) = . 2 We obtain the smoothing image, which maintains continuity of σ without changing the window size, using this relationship. Figure 5 shows the smoothing process obtained by down-sampling. The smoothing procedure was repeated until σ = 1 in our implementation. 3.3

Iterated Graph Cuts

We compute the prior probability obtained by the likelihood from a color histogram and a distance transform using the segmentation results from the graph cuts in the previous process, and set the probability as the t-link using Rp (”obj”) = − ln Pr(O|Ip )

Fig. 6. Outline of updating for likelihood and prior probability

(11)

812

T. Nagahashi, H. Fujiyoshi, and T. Kanade

Rp (”bkg”) = − ln Pr(B|Ip )

(12)

where Pr(O|Ip ) and Pr(B|Ip ) is given by Pr(O)Pr(Ip |O) Pr(Ip ) Pr(B)Pr(Ip |B) Pr(B|Ip ) = . Pr(Ip )

Pr(O|Ip ) =

(13) (14)

Pr(Ip |O), Pr(Ip |B) and Pr(O), Pr(B) are computed from the segmentation results using graph cuts in the previous process. Figure 6 outlines t-link updating obtained by the likelihood and prior probability. Updating likelihood. The likelihoods Pr(Ip |O) and Pr(Ip |B) are computed by GMM[6]. The GMM for RGB color space is obtained by Pr(Ip |·) =

K

αi pi (Ip |μi , Σ i )

i=1

p(Ip |μ, Σ) =

1 · exp (2π)3/2 |Σ|1/2

(15)

1 (Ip − μ)T Σ −1 (Ip − μ) . 2

(16)

We used the EM algorithm to ﬁt the GMM[7]. Updating prior probability. The prior probabilities Pr(O) and Pr(B) are updated by spatial information from graph cuts in the previous process. The next segmentation label is uncertain in the vicinity of the boundary. Therefore, the prior probability is updated by using the results of a distance transform. The distance from the boundary is normalized from 0.5 to 1. Let dobj denote the distance transform of the object, and dbkg denote the distance transform of the background. The prior probability is given by: dobj if dobj ≥ dbkg (17) Pr(O) = 1 − dbkg if dobj < dbkg Pr(B) = 1 − Pr(O)

(18)

Finally, using Pr(Ip |O), Pr(Ip |B) from GMM, and Pr(O) and Pr(B) from distance transform, posterior probability is computed by means of Eq. (11) and (12). We compute a prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results obtained by graph cuts in the previous process. Figure 7 shows examples of segmentation results when the n-link is changed. When σ is small, the boundary-properties term, B{p,q} , at the object is small because of the complex texture. Therefore, graph-cuts results do not work well for image segmentation. However, B{p,q} in the smoothing image is small between the object and background. The proposed method can used to segment regions of the object using a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using diﬀerent values for the standard deviation.

Image Segmentation Using Iterated Graph Cuts

813

Fig. 7. Example of segmentation results when changing n-link

4 4.1

Experimental Results Dataset

We used 50 images(humans, animals, and landscapes) provided by the GrabCut database [8]. We compared the proposed method, Interactive Graph Cuts[2] and GrabCut[4] using the same seeds. The segmentation error rate is deﬁned as object of miss detection pixels image size background of miss detection pixels under segmentation = . image size over segmentation =

4.2

(19) (20)

Experimental Results

Table 2 lists the error rate (%) for segmentation results using the proposed method and the conventional methods [2], [4]. The proposed method can obtain Table 2. Error rate[%] Interactive GrabCut[4] Proposed method Graph Cuts[2] Over segmentation 1.86 3.33 1.12 1.89 1.59 0.49 Under segmentation total 3.75 4.93 1.61

2.14% better segmentation than Interactive Graph Cuts. To clarify the diﬀerences between the methods, successfully segmented images were deﬁned, based on the results of interactive Graph Cuts, as those with error rates below 2%, and missed images were deﬁned as those with error rates over 2%. Table 3 list the segmentation results for successfully segmented and missed images. The proposed

814

T. Nagahashi, H. Fujiyoshi, and T. Kanade

Fig. 8. Examples of segmentation results

method and Interactive graph cuts are comparable in the number of successfully segmented images. However, we can see that the proposed method can obtain 4.79% better segmentation than Interactive Graph Cuts in missed images. The proposed method can be used to segment regions of the object using a stepwise process from global to local segmentation. Figure 8 shows examples of segmentation results obtained with the new method. 4.3

Video Segmentation

The proposed method can be applied to segmenting N-D data. A sequence of 40 frames (320x240) was treated as a single 3D volume. A seed is given to the ﬁrst frame. Figure 9 shows examples of video segmentation obtained with the new method. It is clear that the method we propose can easily be applied to segmenting videos. We can obtain video-segmentation results.

Image Segmentation Using Iterated Graph Cuts

815

Fig. 9. Example of video segmentation Table 3. Error rate [%]

Over segmented Successfully segmented Under segmented (26 images) total Over segmented Missed images Under segmented (24images) total

5

Interactive Proposed GrabCut[4] Graph Cuts[2] method 0.29 3.54 0.81 0.43 1.03 0.22 0.72 4.58 1.03 3.56 3.10 1.45 3.47 2.21 0.79 7.04 5.31 2.25

Conclusion

We presented a novel approach to image segmentation using iterated Graph Cuts based on multi-scale smoothing. We computed the prior probability obtain by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from the graph cuts in the previous process. The proposed method could segment regions of an object with a stepwise process from global to local segmentation by iterating the graph cuts process with Gaussian smoothing using diﬀerent values for the standard deviation. We demonstrated that we could obtain 4.7% better segmentation than that with the conventional approach. Future works includes increased speed using super pixels and highly accurate video segmentation.

References 1. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. PAMI-6, 721–741 (1984)

816

T. Nagahashi, H. Fujiyoshi, and T. Kanade

2. Boykov, Y., Jolly, M-P.: Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images. In: ICCV, vol. I, pp. 105–112 (2001) 3. Boykov, Y., Funka-Lea, G.: Graph Cuts and Eﬃcient N-D Image Segmentation. IJCV 70(2), 109–131 (2006) 4. Rother, C., Kolmogorv, V., Blake, A.: “GrabCut”:Interactive Foreground Extraction Using Iterated Graph Cuts. ACM Trans. Graphics (SIGGRAPH 2004) 23(3), 309– 314 (2004) 5. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. PAMI 26(9), 1124–1137 (2004) 6. Stauﬀer, C., Grimson, W.E.L: Adaptive Background Mixture Models for Real-time Tracking. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 246–252. IEEE Computer Society Press, Los Alamitos (1999) 7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977) 8. GrabCut Database: http://research.microsoft.com/vision/cambridge/i3l/ segmentation/ GrabCut.htm

Backward Segmentation and Region Fitting for Geometrical Visibility Range Estimation Erwan Bigorgne and Jean-Philippe Tarel LCPC (ESE), 58 Boulevard Lef`ebvre, F-75732 Paris Cedex 15, France [email protected] [email protected]

Abstract. We present a new application of computer vision: continuous measurement of the geometrical visibility range on inter-urban roads, solely based on a monocular image acquisition system. To tackle this problem, we propose ﬁrst a road segmentation scheme based on a Parzenwindowing of a color feature space with an original update that allows us to cope with heterogeneously paved-roads, shadows and reﬂections, observed under various and changing lighting conditions. Second, we address the under-constrained problem of retrieving the depth information along the road based on the ﬂat word assumption. This is performed by a new region-ﬁtting iterative least squares algorithm, derived from half-quadratic theory, able to cope with vanishing-point estimation, and allowing us to estimate the geometrical visibility range.

1

Introduction

Coming with the development of outdoor mobile robot systems, the detection and the recovering of the geometry of paved and /or marked roads has been an active research-ﬁeld in the late 80’s. Since these pioneering works, the problem is still of great importance for diﬀerent ﬁelds of Intelligent Transportation Systems. A precise, robust segmentation and ﬁtting of the road thus remains a crucial requisite for many applications such as driver assistance or infrastructure management systems. We propose a new infrastructure management system: automatic estimation of the geometrical visibility range along a route, which is strictly related to the shape of the road and the presence of occluding objects in its close surroundings. Circumstantial perturbations such as weather conditions (vehicles, fog, snow, rain ...) are not considered. The challenge is to use a single camera to estimate the geometrical visibility range along the road path, i.e the maximum distance the road is visible. When only one camera is used, the process of recovering the projected depth information is an under-constrained problem which requires the introduction of generic constraints in order to infer a unique solution. The hypothesis which is usually considered is the ﬂat world assumption [1], by which the road is assumed included in a plane. With the ﬂat word assumption, a precise detection

Thanks to the French Department of Transportation for funding, within the SARIPREDIT project (http://www.sari.prd.fr/HomeEN.html).

Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 817–826, 2007. c Springer-Verlag Berlin Heidelberg 2007

818

E. Bigorgne and J.-P. Tarel

of the vanishing line is crucial. Most of the past and recent single camera algorithms are based on this assumption but diﬀer by the retained model for the road itself [2,3,4,5,6,7]. One group of algorithms moves aside from the ﬂat world assumption and provides an estimation of the vertical curvature of the road. In [8,9,10] the constraint that the road generally keeps an approximately constant width and does not tilt sideways is used. In a general way, it should be noted that the quoted systems, which often relate to applications of lane-tracking /-following, work primarily on relatively ’not so far’ parts of the road. In our case, the geometrical visibility range should be monitored along an interurban route to check for instance its compatibility with speed limits. We are thus released from the requirements of a strictly realtime application; however, both parts of the system, the detection and the ﬁtting of the road, should manage the far extremity of the perceptible road, a requisite for which a road detection-based approach appears to be more adequate than the detection of markings. This article is composed of two sections. The ﬁrst section deals with the segmentation of the image. We restrict ourselves to structured road contrary to [11]. The proposed algorithm operates an adaptative supervised classiﬁcation of each pixel in two classes: Road (R) and Other (O). The proposed algorithm is robust and beneﬁts from the fact that the process is oﬀ-line. The second section deals with the region-ﬁtting algorithm of the road working on the probability map provided by the segmentation step. The proposed algorithm follows an alternated iterative scheme which allows both to estimate the position of the vanishing line and to ﬁt the borders of the road. The camera calibration being known, the positions of the vanishing line and of the far extremity of the perceptible road are enough to estimate the geometrical visibility range.

2

An Adaptative Probabilistic Classiﬁcation

A dense detection of the road has been the object of many works which consider it as a two class pixel classiﬁcation problem either in a supervised way [1,12,13,14], or not [15,16]. All these works face the same diﬃculty: the detection should be performed all along a road, when the appearance of the road is likely to strongly vary because of changes in the pavement material or because of local color heterogeneity; the lighting conditions can also drastically modify the appearance of the road, see Fig. 1:

Fig. 1. Examples of road scenes to segment, with shadows and changes in pavement material

Backward Segmentation and Region Fitting

819

– The shadows in outdoors environment modify intensity and chromatic components (blue-wards shifting). – The sun at grazing angles and/or the presence of water on the road causes specular reﬂections. Several previously proposed systems try to tackle these diﬃculties. The originality of our approach is that we take advantage of the fact that the segmentation is oﬀ-line by performing backward processing which leads to robustness. We use a classiﬁcation scheme able to cope with classes with possibly complex distributions of the color signal, rather than searching for features that would be invariant to well-identiﬁed transformations of the signal. In our tests, and contrary to [11], no feature with spatial or textured content (Gabor energy, local entropy, moments of co-occurence matrix, etc.) appeared to be suﬃciently discriminant in the case of paved roads, whatever is the environment. In practice, we have chosen to work in the La∗ b∗ color-space which is quasi-uncorrelated. 2.1

Parzen-Windowing

In order to avoid taking hasty and wrong decisions, the very purpose of the segmentation stage is restricted to provide a probability map to be within the Road class, which will be used for ﬁtting the road. The classiﬁcation of each pixel is performed using a Bayesian decision: the posterior probability for a pixel with feature vector x = [L, a∗ , b∗ ] to be part of the road class is: P(R/x) =

p(x/R)P(R) p(x/R)P(R) + p(x/O)P(O)

(1)

where R and O denotes the two classes. We use Parzen windows to model p(x/R) and p(x/O), the class-conditional probability density functions (pdf ). We choose the anisotropic Gaussian function with mean zero and diagonal covariance matrix Σd as the Parzen window. Parzen windows are accumulated during the learning phases in two 3-D matrices, called P R and P O . The matrix dimensions depend on the signal dynamic and an adequacy is performed with respect to the bandwidth of Σd . For a 24-bit color signal, we typically use two 643 matrices and a diagonal covariance matrix Σd with [2, 1, 1] for bandwidth. This particular choice indeed allows larger variations along the intensity axis making it possible to cope with color variations causes by sun reﬂexions far ahead on the road. A fast estimation of p(x/R) and p(x/O) is thus obtained by using P R and P O as simple Look-Up-Tables, the entries of which are the digitized color coordinates of feature vectors x. 2.2

Comparison

We compared our approach with [16] which is based on the use of color saturation only. We found although saturation usually provides good segmentation results, this heuristic fails in cases too complex, where separability is no longer veriﬁed, see Fig. 2. Fig. 3 shows the correct pixel classiﬁcation rate for a variable

820

E. Bigorgne and J.-P. Tarel

Fig. 2. Posterior probability maps based on saturation (middle) and [L, a∗ , b∗ ] vectors (right) 1

Correct classification rate

0.9 0.8 0.7 0.6 Saturation rg RGB

0.5 0.4

0

0.1

0.2

0.3

0.4 0.5 0.6 Probability threshold

0.7

0.8

0.9

1

Fig. 3. Correct classiﬁcation rate comparison for diﬀerent types of feature

threshold applied on the class-conditional pdf p(x/R). Three types of feature have been compared on twenty images of diﬀerent road scenes with a groundMin(R,G,B) , 2) truth segmented by hand: 1) the color saturation x = S = 1 − Mean(R,G,B) R B , b = R+G+B ] and 3) the full color the chromatic coeﬃcients x = [r = R+G+B signal x = [L, a∗ , b∗ ]. The obtained results show the beneﬁt of a characterization based on this last vector, which is made possible by the use of Parzen-windowing. 2.3

Robust Update

The diﬃculty is to correctly update the class-conditional pdfs along a route despite drastic changes of the road appearance. In case of online processing, thanks to the temporal continuity, new pixel samples are typically selected in areas where either road or non-road pixels are predicted to take place [12,1]. In practice, this approach is not very robust because segmentation prediction is subject to errors and these errors imply damaged class-conditional pdfs that will produce a poor segmentation on the next image. Due to our particular application which is oﬀ-line, we greatly beneﬁt from a backward processing of the entire sequence: being given N images taken at regular intervals, the (N − k)-th one is processed at the k-th iteration. For this image, new pixel samples are picked up in the bottom center part of the image to update the ’Road ’ pdf. The advantage is that we know for sure that these

Backward Segmentation and Region Fitting

821

Fig. 4. 20 of the detected road connected-components in a image sequence. This particular sequence is diﬃcult due to shadows and pavement material changes.

pixels are from the ’Road ’ class since the on-board imaging system grabbing the sequence is on the road. Moreover, these new samples belongs to the newly observable portion of the road, and thus no prediction is needed. The update of the ’Other ’ pdf is only made on pixels that have been labeled ’Other ’ at the previous iteration. In order to lower as much as possible the risk of incorrect learning of the ’Road ’ class, and then to prevent any divergence of the learning, the proper labeling of pixels as ’Road ’ is performed by carrying out a logicalAND operation between the ﬁtted model explained in the next section and the connected-component of the threshold probability map which is overlapping the bottom center-part of the image. The ’Other ’ class is then naturally deﬁned as the complementary. This process drastically improves the robustness of the update compared to online approaches. Fig. 4 shows the detected connected-component superimposed on the corresponding original images with a probability threshold set at 0.5. These quite diﬃcult frames show at the same time shadowed and overexposed bi-component pavement materials. The over-detections in the three ﬁrst frames of the fourth row are due to a partially occluded private gravel road. This quality of results cannot be obtained with online update.

3

Road Fitting

As explain in the introduction, the estimation of the shape of the road is usually achieved by means of edge-ﬁtting algorithms, which are applied after the detection of some lane or road boundaries. Hereafter, we propose an original approach based on region-ﬁtting which is more robust to missing data and which is also able to cope with vanishing line estimation.

822

E. Bigorgne and J.-P. Tarel

3.1

Road Models

Following [4], we use two possible curve families to model the borders of the road. First we use polynomial curves. ur (v) (resp. ul (v)) models the right (resp. left) border of the road and is given as: u r = b0 + b1 v + b2 v 2 + . . . + bd v d =

d

bi v i

(2)

i=0

and similarly for the left border. Close to the vehicle, the four ﬁrst parameters b0 , b1 , b2 , b3 are proportional respectively to lateral oﬀset, to the bearing of the vehicle, to the curvature and to the curvature gradient of the lane. Second we use hyperbolic polynomial curves which better ﬁt road edges on long range distances: 1 = ai (v − vh )1−d (v − vh )d−1 i=0 d

ur = a0 (v − vh ) + a1 + . . . + ad

(3)

and similarly for the left border. The previous equations are rewritten in short in vector notations as ur = Atr Xvh (v) (resp. ul = Atl Xvh (v)). 3.2

Half Quadratic Theory

We propose to set the region ﬁtting algorithm as the minimization of the following classical least-square error: 2 [P (R/x(u, v)) − ΩAl ,Ar (u, v)] dudv (4) e(Al , Ar ) = Image

between the image P (R/x(u, v)) of the probability to be within the road class and the function ΩAl ,Ar (u, v) modeling the road. This region is parametrized by Al and Ar , the left and right border parameters. ΩAl ,Ar must be one inside the region and zero outside. Notice that function At Xvh (v) − u is deﬁned for all pixel coordinates (u, v). Its zero set is the explicit curve parametrized by A and the function is negative on the left of the curve and positive on its right. We thus can use the previous function to build ΩAl ,Ar . For instance, function g

At Xvh (v)−u σ

+ 12 is a smooth model of the region on the right of the curve for

any increasing odd function g with g(+∞) = 12 . The σ parameter is useful to tune the smoothing strength. For a two-border region, we multiply the models for the left and right borders accordingly: t t Ar Xvh (v) − u Al Xvh (v) − u 1 1 −g + (5) ΩAl ,Ar = g σ 2 2 σ By substitution of the previous model in (4), we rewrite it in its discrete form: t t 2 Ar Xi − j Al Xi − j 1 1 −g eAl ,Ar = (6) Pij − g + σ 2 2 σ ij∈Image

Backward Segmentation and Region Fitting

823

The previous minimization is non-linear due to g. However, we now show that this minimization can be handled with the half-quadratic theory and allows us to derive the associated iterative algorithm. Indeed, after expansion t of the Al Xi −j square in (6), the function g 2 of the left and right residuals appears: g 2 σ t 2 Ar Xi −j 2 2 and g . Function g (t) being even, it can be rewritten as g (t) = σ h(t2 ). Once the problem is set in these terms, the half-quadratic theory can be applied in a similar way as for instance in [6] by deﬁning the auxiliary variables 2 2 t t t t Al Xi −j Ar Xi −j Al Xi −j Ar Xi −j l r l r ωij , ω , ν = = = and ν = . The ij ij ij σ σ σ σ Lagrangian of the minimization is then obtained as:

1 l r l r l r L = ij h(νij )h(ν ij )g(ωij ij ) + l4 (h(νij ) r+ h(νij )) l + (2Prij − 1)g(ω ) r l +(Pij − 1/4) −g(ω ) + g(ω ) −h(ν )g(ω ) + h(ν )g(ω ) ij ij ij ij ij ij

At X −j At X −j l r (7) + ij λlij ωij + λrij ωij − l σi − r σi 2 2 t t

l Al Xi −j j−Ar Xi −j l r − − + ij μij νij + μrij νij σ σ The derivatives of (7) with respect to : the auxiliary variables, the unknown variables Al and Ar , and the Lagrange coeﬃcients λlij , λrij , μlij μrij are set to zero. The algorithm is derived as an alternate and iterative minimization using the resulting equations.

Fig. 5. 15th degree polynomial region ﬁtting on a diﬃcult synthetic image. Left: ΩAl ,Ar 3D-rendering. Right: Obtained borders in white.

The proposed algorithm can handle a region deﬁned either with polynomial curves (2) or with hyperbolic curves (3). It is only the design matrix (Xi ) that changes. Fig. 5 presents a region ﬁt on a diﬃcult synthetic image with numerous outliers and missing parts. The ﬁt is a 15th order polynomial. On the left side, the 3-D rendering of the obtained region model ΩAl ,Ar is shown. Notice how the proposed region model is able to ﬁt a closed shape even if the region borders are two explicit curves. We want to insist on the fact that contour-based ﬁtting cannot handle correctly such images, with so many edge outliers and closings. 3.3

Geometrical Visibility Range

As explained in the introduction, for road ﬁtting, it is of main importance to be able to estimate the position vh of the vanishing line which parametrizes the

824

E. Bigorgne and J.-P. Tarel

Fig. 6. Road region ﬁtting results with 6th order hyperbolic polynomial borders. The images on the right provide a zoom on the far extremity of the road. The white line ﬁgures the estimated vanishing line; the red line ﬁgures the maximum distance at which the road is visible.

design matrix. We solve this problem by adding an extra step in the previous iterative minimization scheme, where vh is updated as the ordinate of the point where the asymptotes of the two curves intersect each other. In practice, we observed that the modiﬁed algorithm converges towards a local minimum. The minimization is performed with decreasing scales to better converge towards a local minimum not too far from the global one. Moreover, as underlined in [4], the left and right borders of the road are related being approximately parallel. This constraint can be easily enforced in the region ﬁtting algorithm and leads to a minimization problem with a reduced number of parameters. Indeed, parallelism of the road borders implies ari = ali , ∀i ≤ 1, in (3). This constraint brings an improved robustness to the road region ﬁtting algorithm as regards missing parts and outliers. Fig. 6 shows two images taken from one of the sequences we experimented with. It illustrates the accuracy and the robustness of the obtained results when the local-ﬂatness assumption is valid, ﬁrst row. Notice the limited eﬀect of the violation of this assumption on

Fig. 7. Flat and non-ﬂat road used for distance accuracy experiments

Backward Segmentation and Region Fitting

825

the second row at long distance. The white line shows the estimated vanishing line while the red line shows the maximum image height where the road is visible. The geometric visibility range of the road is directly related to the diﬀerence in height between the white and red lines. Table 1. Comparison of the true distance in meters (true) with the distances estimated by camera calibration (calib.), and estimated using the proposed segmentation and ﬁtting algorithms (estim.) for four targets and on two images. On the left is the ﬂat road, on the right the non-ﬂat road of Fig. 7. target 1 2 3 4

true 26.56 52.04 103.52 200.68

calib. 26.95 56.67 98.08 202.02

estim. 27.34 68.41 103.4 225.96

true 33.59 59.49 111.66 208.64

calib. 23.49 60.93 111.58 1176.75

estim. 34.72 61.67 114.08 1530.34

Finally, we ran experiments to evaluate the accuracy of the estimated distances using one camera. On two images, one where the road is really ﬂat and one where it is not the case, see Fig. 7, we compared the estimated and measured distances of white calibration targets set at diﬀerent distances. The true distances were measured using a theodolite, and two kinds of estimation are provided. The ﬁrst estimation is obtained using the camera calibration with respect to the road at close range and the second estimation is obtained using road segmentation and ﬁtting. Results are shown on Tab. 1. It appears that errors on distance estimation can be important for large distances when the ﬂat world assumption is not valid; but when it is valid the error is no more than 11%, which is satisfactory. A video format image is processed in a few seconds, but can be optimized further.

4

Conclusion

We tackle the original question of how to estimate the geometrical visibility range of the road from a vehicle with only one camera along inter-urban roads. This application is new and of importance in the ﬁeld of transportation. It is a diﬃcult inverse problem since 3D distances must be estimated using only one 2D view. However, we propose a solution based ﬁrst on a fast and robust segmentation of the road region using local color features, and second on parametrized ﬁtting of the segmented region using a priori knowledge we have on road regions. The segmentation is robust to lighting and road color variations thanks to a backward processing. The proposed original ﬁtting algorithm is another new illustration of the power of half-quadratic theory. An extension of this algorithm is also proposed to estimate the position of the vanishing line in each image. We validated the good accuracy of the proposed approach for ﬂat roads. In the future, we will focus on the combination of the proposed approach with stereovision, to handle the case of non-ﬂat roads.

826

E. Bigorgne and J.-P. Tarel

References 1. Turk, M., Morgenthaler, D., Gremban, K., Marra, M.: VITS - a vision system for autonomous land vehicle navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(3), 342–361 (1988) 2. Liou, S., Jain, R.: Road following using vannishing points. Comput. Vision Graph. Image Process 39(1), 116–130 (1987) 3. Crisman, J., Thorpe, C.: Color vision for road following. In: Proc. of SPIE Conference on Mobile Robots, Cambridge, Massachusetts (1988) 4. Guichard, F., Tarel, J.P.: Curve ﬁnder combining perceptual grouping and a kalman like ﬁtting. In: ICCV 1999. IEEE International Conference on Computer Vision, Kerkyra, Greece, IEEE Computer Society Press, Los Alamitos (1999) 5. Southall, C., Taylor, C.: Stochastic road shape estimation. In: Proceedings Eighth IEEE International Conference on Computer Vision, vol. 1, pp. 205–212. IEEE Computer Society Press, Los Alamitos (2001) 6. Tarel, J.P., Ieng, S.S., Charbonnier, P.: Using robust estimation algorithms for tracking explicit curves. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 492–507. Springer, Heidelberg (2002) 7. Wang, Y., Shen, D., Teoh, E.: Lane detection using spline model. Pattern Recognition Letters 21, 677–689 (2000) 8. Dementhon, D.: Reconstruction of the road by matching edge points in the road image. Technical Report Tech. Rep. CAT-TR-368, Center for Automation Research, Univ Maryland (1988) 9. Dickmanns, E., Mysliwetz, B.: Recursive 3D road and relative ego-state recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 199–213 (1992) 10. Chapuis, R., Aufrere, R., Chausse, F.: Recovering a 3D shape of road by vision. In: Proc. of the 7th Int. Conf. on Image Processing and its applications, Manchester (1999) 11. Rasmussen, C.: Texture-based vanishing point voting for road shape estimation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 470–477. IEEE Computer Society Press, Los Alamitos (2004) 12. Thorpe, C., Hebert, M., Kanade, T., Shafer, S.: Vision and navigation for the Carnegie-Mellon Navlab. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(3), 362–373 (1988) 13. Sandt, F., Aubert, D.: Comparaison of color image segmentations for lane following. In: SPIE Mobile Robot VII, Boston (1992) 14. Crisman, J., Thorpe, C.: Scarf: a color vision system that tracks roads and intersections. IEEE Transactions on Robotics and Automation 9(1), 49–58 (1993) 15. Crisman, J., Thorpe, C.: Unscarf, a color vision system for the detection of unstructured roads. In: Proc. Of IEEE International Conference on Robotics And Automation, pp. 2496–2501. IEEE Computer Society Press, Los Alamitos (1991) 16. Charbonnier, P., Nicolle, P., Guillard, Y., Charrier, J.: Road boundaries detection using color saturation. In: European Conference (EUSIPCO). European Conference, Ile de Rhodes, Gr`ece, pp. 2553–2556 (1998)

Image Segmentation Using Co-EM Strategy Zhenglong Li, Jian Cheng, Qingshan Liu, and Hanqing Lu National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences P.O. Box 2728, Beijing (100080), P.R. China {zlli,jcheng,qsliu,luhq}@nlpr.ia.ac.cn

Abstract. Inspired by the idea of multi-view, we proposed an image segmentation algorithm using co-EM strategy in this paper. Image data are modeled using Gaussian Mixture Model (GMM), and two sets of features, i.e. two views, are employed using co-EM strategy instead of conventional single view based EM to estimate the parameters of GMM. Compared with the single view based GMM-EM methods, there are several advantages with the proposed segmentation method using co-EM strategy. First, imperfectness of single view can be compensated by the other view in the co-EM. Second, employing two views, co-EM strategy can oﬀer more reliability to the segmentation results. Third, the drawback of local optimality for single view based EM can be overcome to some extent. Fourth, the convergence rate is improved. The average time is far less than single view based methods. We test the proposed method on large number of images with no speciﬁed contents. The experimental results verify the above advantages, and outperform the single view based GMM-EM segmentation methods.

1

Introduction

Image segmentation is an important pre-process for many higher level vision understanding systems. The basic task of image segmentation is dividing an input image according to some criteria into foreground objects and background objects. The most used criteria can be categorized into three classes, i.e. global knowledge based, region-based (homogeneity) and edge-based [1]. The global knowledge based methods are usually refer to as thresholding using global knowledge of a histogram of image. As to the homogeneity criteria, it assume that in an image the meaningful foregrounds and background objects should comprise of homogeneous regions in the sense of some homogeneity metrics. Usually the resulting segmentations are not exactly the objects: only partial segmentation [1] are fulﬁlled in which the segmentation results are regions with homogeneous characteristics. And in the higher level of image understanding system, the partial segmentation results can be regrouped to correspond to the real objects with the aid of speciﬁc domain knowledges. This paper focuses on the partial image segmentation. The segmentation of natural images or images of rich textural content is a challenging task, because of inﬂuences of texture, albedo, the unknown number Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 827–836, 2007. c Springer-Verlag Berlin Heidelberg 2007

828

Z. Li et al.

of objects, and amorphous object shapes etc. Some methods [2, 3, 4] have been proposed to deal with natural image segmentation. Amongst them, EdgeFlow [2] is a successful edge based method, and shows ﬁne performance on generic image segmentation. But for its edge based nature, EdgeFlow’s performance relies heavily on the post-process, such as spur trimming, edge linking etc. The works of [4] using Gaussian Mixture Models (GMM) to model content of image shows success to some extent in the application of content based image retrieval. The parameters of GMM are solved by Expectation Maximization (EM) methods. This GMM-EM based method is more robust than the edge based methods for its region based essence. And there is no demand for the post-process to form closed and continuous boundaries, which heavily inﬂuences the performance of segmentation results. The GMM-EM based methods are to estimate maximum a posteriori parameters of GMM with EM algorithm and using this GMM as classiﬁer to give labels to all points in N -D feature space, then the labels are re-map to 2-D image plane to achieve partitions of the image. The ﬁrst step is feature extraction in which the 2-D image is transformed into N -D feature space, and then the feature vectors are clustered according to some metric. Usually only single feature set is used in the GMM-EM based methods. Although the GMM-EM method with single feature set in [3, 4] gets some successes in the generic image segmentation on the CorelTM image dataset, there exist several drawbacks with this GMM-EM based method using single feature set strategy: First, the feature set is usually imperfect on some aspect of discriminating details of an image. Second, to get more reliable results, the more features are desired to be incorporated, while high dimension of feature space will suﬀer the problem of over-ﬁtness. Third, EM in essence is a solution seeking local optimal. It is prone to stick to the local optimal so that the algorithms give the improper segmentation results. To solve the above problems, we introduce the co-EM strategy (or multi-view) in this paper. Here we give a deﬁnition of jargon view that will be used later in this paper: for feature domain that can be divided into disjoint subsets and each subset is enough to learn the targets [5], such each subset is called a view. The idea of multi-view was proposed [6, 7] to text or Web page classiﬁcation. It is an extension to co-Training of Blum and Mitchell [7]. The diﬀerence between co-Training and co-EM is that for the former only the most conﬁdent labels are adopted in each iteration, whereas for the latter the whole trained labels are used during each iteration (for details, please cf. [6, 5]). In this paper, we propose a method using co-EM strategy for natural content or rich-textured image segmentation. In the proposed methods, image is ﬁrstly modeled by an inﬁnite GMM, then the co-EM algorithm employs two views to solve the parameters of GMM to label all pixels in the image to achieve segmentation result. Compared with those only one-view based methods, the proposed method has the following advantages: Compensated imperfectness of single view. There are usually some aspect of discrimination imperfectness in single view. In image segmentation, this imperfectness will bring with the problems such as imprecise boundaries or even

Image Segmentation Using Co-EM Strategy

829

wrong boundaries. By introducing the co-EM strategy, two views can augment each other, and to some extent the imperfectness of each view will be compensated by each other to give ideal segmentation results. More reliability. An option to improve reliability is including more features into feature space. While higher dimension of feature space improve the reliability, the single view based methods are apt to incur the curse of overﬁtness. And higher dimension exacerbates the computation burden. The coEM strategy employs two views by turns, and more informations are provided by the strategy while the dimension of feature spaces are kept relatively small, hence the proposed method is more reliable and with low computation overhead. Improved local optimality. The solution by EM is not optimal in the global sense. The improper initial estimate of parameters is prone to get stuck to a local optimality. In the co-EM strategy, when the initial estimates get stuck to local extrema in one view, the other view will “pull” the evolution from the local extrema, and vice versa. Until consensus of two views are achieved, the evolution of co-EM will not stop optimal solution seeking. Therefore the proposed methods with co-EM strategy can to some extent prevent the local optimality compared with single view based GMM-EM. Accelerated convergence rate. By augmenting each other with two view, the converging rate of co-EM strategy is accelerated compared with classical EM with single view. We notice recently [8] gives a similar segmentation method using co-EM. However there exist several serious problems with the works of [8], and we will give an short comparison with [8] in Section 2. The rest of the paper is organized as follows. In the Section 2, we brieﬂy review some related works, especially comparison with [8]. And the algorithms using coEM strategy are proposed in the Section 3. Section 4 gives the experiments and analysis. Finally, we conclude the paper in the Section 5.

2

Related Works

In [3,4], the GMM-EM algorithms get some successes in generic image segmentation. The keys to their successes can be concluded as: 1) Fine feature descriptors on image content. 2) The scheme of selecting the proper initial estimates of parameters for EM. 3) Minimum Description Length (MDL) to determine the proper number of Gaussians in GMM. However, only one view is used in their works, and there are the issues such as imperfectness of views, reliability mentioned in last section. We notice recently a similar concept of co-EM was proposed for image segmentation in [8]. However there are several critical problems with [8]. First, there is not any consideration on initial parameter estimate for EM in [8]. The initial estimate of parameters for EM is important: the improper initial parameters are apt to give wrong under-segmentation for local optimality. Second, it is curious that only RGB channels and 2-D spatial coordinates are chosen as two views.

830

Z. Li et al.

This split of feature domain breaks the rules that two views should be suﬃcient to learn the object, i.e. two views must be both strong view [9]. Third, only two images are tested in [8]. Compared with the works of [8], our method considers the sensitivity of EM to initial parameter estimation, and the features are split into two strong views. Finally the proposed method is tested on large number of images and the experiment results are meaningful and promising.

3

Image Segmentation Using Co-EM Strategy

We use the ﬁnite GMM to model content of image [3]. The ﬁrst step is to extract features for co-EM strategy. In tests, we choose the Carson’s features and Gabor features as two views to describe the content of image. Then co-EM strategy is applied to solve the parameters of GMM and give the segmentation results. 3.1

Feature Extraction

For co-EM strategy, the views must possess two properties [9]: 1) uncorrelated : for any labeled instance the description on it should be not strongly correlated; 2) compatibility: each view should give the same label for any instance. As for uncorrelated property, it means that each view should be the “strong” view that can itself learn the objects. Compatibility means that each view should be consistent in describing the nature of objects: there is no contradiction between resulting classiﬁcation. These two properties should be strictly obeyed by the coEM strategy (although there exist the co- strategy to deal with views violating two properties, i.e. “weak” view, it needs human intervene and is not suitable in our case. For details we refer the readers to [9, 5]). We choose two views, Carson’s features [3, 4] and Gabor features [10, 11], in our experiments. Although the rigid proof of the uncorrelated and compatibility of two views is diﬃcult. The experiments in Section 4 show that the chosen views satisfy the criteria. Following we describes the Carson’s features and Gabor features respectively. Carson’s features. The features are mainly composed of three parts: color, texture and normalized 2-D coordinates. The 3-D color descriptors adopt CIE 1976 L*a*b* color space for its perceptual uniformity. The texture descriptor used in Carson’s features is slightly complicated. Here we give a slightly detailed elucidation on it. The texture descriptors comprise of anisotropy, normalized texture contrast, and polarity. For anisotropy and normalized texture contrast, the second moment matrix Mσ (x, y) = Gσ (x, y) ∗ ∇L(∇L)T ,

(1)

should be computed ﬁrst, where L is the L* component of color L*a*b*, and Gσ is a smoothing Gaussian ﬁlter with standard variance σ. Then the eigenvalues λ1 , λ2 (λ1 > λ2 ) and corresponding eigenvectors l1 , l2 of Mσ are computed.

Image Segmentation Using Co-EM Strategy

831

The anisotropy is given as a = 1 − λλ21 and the normalized texture contrast is √ c = 2 λ1 + λ2 . To generate the polarity descriptors with speciﬁc scales, the ﬁrst step is scale selecting using the polarity property of image [12]: pσ =

|E+ − E− | E+ + E−

(2)

where E+ = x,y Gσ (x, y)[∇L · n]+ and E− = x,y Gσ (x, y)[∇L · n]− . The Pσ is the polarity for one point in image; Gσ is a Gaussian ﬁlter with standard variance as σ; n is a unit vector perpendicular to l1 , and operator [·]+ (or [·]− ) is the rectiﬁed positive (or negative) part of its argument. For each point (x, y) in an image, according to Eq. (2), the σ will take k = 0, 12 , 22 , · · · , 72 respectively. Then the resulting 7 polarity images will be convolved by Gaussians with standard variance of 2∗k = 0, 1, 2, · · · , 7 respectively. For each k−1 < 0.02. point (x, y), the scale will be selected as k if Pk −P Pk The last part of Carson’s features is the coordinates (x, y) normalized to range of [0, 1]. This 2-D coordinate (x, y) describes the spatial coherence to prevent the over-segmentation. The resulting Carson’s features is an 8-D space. Gabor features. Gabor ﬁlters can be considered as Fourier basis modulated by Gaussian windows, and Gabor ﬁlter bank can well describes the particular spatial frequency and orientation locally. Gabor ﬁlters used in this paper are 1 y2 1 x2 + 2 + 2πjW x , exp − (3) g(x, y) = 2πσx σy 2 σx2 σy 1 (u − W )2 v2 G(u, v) = exp − + , (4) 2 σu2 σv2 1 1 , and σv = . σu = 2πσx 2πσy The Eq. (3) is a 2-D Gabor in spatial domain, and Eq. (4) is the Fourier Transform of Eq. (3). Gabor ﬁlter bank should cover the whole eﬀective frequency domain while reducing the redundancy as much as possible. In our test, the number of orientations and of scales is set as 6, 4 respectively (for detailed deduction of Gabor ﬁlter bank parameters, please cf. [10]). Therefore there are 24 ﬁlters in the Gabor ﬁlter bank, and the Gabor feature dimension is 24. 3.2

Co-EM Strategy

We use the well-known ﬁnite GMM to model the content of images. The single view based EM algorithm is composed of two steps: the E-step and M-step. In the E-step, the expectation is evaluated, and the this expectation is maximized at the M-step. For brevity, here we do not give the solution equations to standard EM-GMM (we refer the interested readers to [13]).

832

Z. Li et al.

We use the co-EM strategy to estimate the parameters of GMM. The idea of co-EM is using two classiﬁers to be trained in two views respectively, and suggesting labels to each other by turns during the training process of EM. Two views employed in our experiments are Carson’s features and Gabor features. We give the pseudocode for the co-EM strategy in Fig. 1. Line 1 to 3 are initial stage for co-EM. At line 3 of initial stage, Initial Label1 is fed to TrainingClassifier to get an initial estimate on parameters of GMM. TrainingClassifier is an classical GMM-EM solver. Line 4-19 are the main loop of co-EM algorithm. Line 5-10 train a classiﬁer Classiﬁer1 in the view View1 , and give Label1 for all the training points in View1 using Classiﬁer1 . Note at line 8 (line 14), TrainingClassifier is using Label1 (Label2 ) from View1 (View2 ) to aid training Classiﬁer2 (Classiﬁer1 ) in View2 (View1 ). Line 12-16 serve the same purpose as line 5-10, but the former are done in the View2 instead of View1 . The cooperation of the View1 and View2 by means of EM is incarnated at line 8 and 14: the labels learned from one view is suggested into the other view to aid learning object by turns.

co-EM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Input: View1 , View2 , Initial Label1 Label1 ← 0, Label2 ← 0 counter ← 0, ﬂag1 ← false, ﬂag2 ← false Classiﬁer1 ← TrainingClassifier(Initial Label1 , View1 ) while counter < max iteration or ﬂag1 = false or ﬂag2 = false do Label1 ← LabelingData(Classiﬁer1 , View1 ) if IsLabelFull(Label1 ) = false then goto Step 16 else Classiﬁer2 ← TrainingClassifier(Label1 , View2 ) end if Label2 ← LabelingData(Classiﬁer2 , View2 ) f lag1 ← IsConverged(Classiﬁer1 , View1 ) if IsLabelFull(Label2 ) = false then goto Step 18 else Classiﬁer2 ← TrainingClassifier(Label2 , View1 ) end if Classiﬁer1 ← TrainingClassifier(Label2 , View1 ) f lag2 ← IsConverged(Classiﬁer2 , View2 ) counter ← counter +1 end while Fig. 1. The pseudocode of co-EM strategy

The initial parameter estimate plays the key role for the performance of the EM-like algorithms. The improper initial parameters for EM-like algorithms will produce the local optimal solution and result in wrong segmentation. In the experiments, the test image will be given several ﬁxed initial partitioning templates to co-EM as a tempt to avoid local optimal solution.

Image Segmentation Using Co-EM Strategy

833

The determination of the number of Gaussians, K, in GMM is another critical problem in the co-EM strategy. In the experiments, K for each image is set as 3 (In [4], the MDL criteria is used to determine the number of Gaussians, but we found the ﬁxed K of 3 works well in most situations in the experiments; and another promising method to estimate K is using the Dirichlet Process, although it can work for our cases, considering the eﬃciency we still use the scheme of a ﬁxed K in the current experiments).

4

Experiments

Fig. 2 shows comparison between results by co-EM strategy and single-view EM respectively. The second column is the result by the proposed co-EM strategy using two views: Carson’s features and Gabor features. The third column is the result by single-view based EM using Carson’s features, and the rightmost

(a)

(b) tcoEM = 13.53 sec (c) tCarson = 25.61 sec (d) tGabor = 26.42 sec

(e)

(f) tcoEM = 12.84 sec (g) tCarson = 16.54 sec (h) tGabor = 17.83 sec

(i)

(j) tcoEM = 13.38 sec (k) tCarson = 31.16 sec (l) tGabor = 23.97 sec

Fig. 2. Comparison between results by co-EM strategy and single view based GMMEM methods. The second column is the segmentation results by co-EM strategy using two view: Carson’s features and Gabor features. The third column is the results by oneview based EM using Carson’s features, and the rightmost column by Gabor features. The gray regions in the segmentation results are invalid regions.

834

Z. Li et al.

108004

41069

42049

66053

97033

291000

62096

134052

45096

126007

156065

41004

100080

78004

189080

55075

302008

Image Name 108004 41069 42049 66053 97033 291000 62096 134052 45096 Co-EM 7.76 EM (Carson) 16.54 EM (Gabor) 11.71

15.58 18.14 29.16

11.89 12.24 20.56 27.25 24.05 34.59

17.25 14.54 19.24 14.18 34.87 24.18

24.16 14.88 20.65 21.25 34.59 26.42

12.34 31.17 23.97

126007 156065 41004 100080 78004 189080 55075 302008 12.10 20.56 33.85

30.63 18.42 23.65

10.92 10.63 21.05 15.33 30.38 23.11

15.81 21.59 17.59 18.81 29.35 29.59

15.04 21.75 26.30 33.06 32.09 31.56

Fig. 3. Segmentation results using co-EM strategy and time comparison between coEM strategy, single view based GMM-EM methods using Carson’s features, and Gabor features

Image Segmentation Using Co-EM Strategy

835

column by Gabor features. The gray regions in the segmentation results represent the invalid regions, i.e. the regions with low conﬁdences belonging to any object. Observe the leopard in the top row in the ﬁgure. We can ﬁnd that only using the Carson’s features, the leopard and tree trunk are wrongly classiﬁed as one object, while in the Gabor feature domain, the image is over-segmented: the boundaries get distorted. However the proposed co-EM strategy gives a ﬁner result than one-view strategy: the leopard and tree trunk are correctly classiﬁed as two part, and the boundaries are well ﬁtted to the real boundaries of objects (see top left subﬁgure in Fig. 2. The rest results all show that the co-EM strategy gives more superior results than the one-view based EM. We list the used time under each segmentation result. The proposed co-EM method show the fastest convergence rate while giving the ﬁner segmentation results. An interesting phenomenon can be observed from Fig. 2: the cooperation of two views, e.g. Carson’s and Gabor view in our experiment, can produce ﬁner results than single views. This phenomenon can be explained as follows. There are usually imperfectness for some view, and that will impair the classiﬁcation performance. As to our cases, the ability of Carson features to classify texture is relatively weak, whereas the Gabor features place too weights on texture discrimination. Therefore in Carson feature space, the textures cannot get ﬁne discriminated; and Gabor features based algorithms are prone to over-segment the textured region. However with co-EM strategy, the drawbacks of these two views get compensated to some extent. Hence the proposed co-EM algorithms outperform single view based methods. In Fig. 3, we show some segmentation results by the proposed co-EM strategy. The number marked under each segmentation result is the image name in the CorelTM image database. Under the lower part, a table shows the running time by the proposed co-EM strategy with 8-D Carson’s features and 24-D Gabor features, the single view based GMM-EM using Carson’s features, and Gabor features respectively. The average times for these three methods are 15.52 sec, 21.49 sec, 26.43 sec respectively. The proposed method are faster than single view EM based method using Carson’s view and Gabor view by 27.8%, 41.3%.

5

Conclusion

In this paper, we proposed a co-EM strategy for generic image segmentation. Two views are employed in the co-EM strategy: Carson’s features and Gabor features. The proposed methods show the advantages over classical single view based methods, and we test the proposed methods on large number of images. The experimental results are promising and verify the proposed methods.

Acknowledgment We would like to acknowledge support from Natural Sciences Foundation of China under grant No. 60475010, 60121302, 60605004 and No. 60675003.

836

Z. Li et al.

References 1. Sonka, M., Hilavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision, 2nd edn. Brooks/Cole (1998) 2. Ma, W., Manjunath, B.S.: EdgeFlow: A technique for boundary detection and image segmentation. IEEE Trans. Image Process 9(8), 1375–1388 (2000) 3. Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color- and texture-based image segmentation using EM and its application to content-based image retrieval. In: IEEE Proc. Int. Conf. Computer Vision, pp. 675–682. IEEE Computer Society Press, Los Alamitos (1998) 4. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using Expectation-Maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002) 5. Muslea, I., Mintion, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proc. Int. Conf. Machine Learning, pp. 435–442 (2002) 6. Nigam, K., Ghani, R.: Analyzing the eﬀectiveness and applicability of co-training. In: Proc. Intl Conf. of Information and Knowledge Management, pp. 86–93 (2000) 7. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Annual Workshop on Computational Learning Theory, pp. 92–100 (1998) 8. Yi, X., Zhang, C., Wang, J.: Multi-view EM algorithm and its application to color image segmentation. In: IEEE Proc. Int. Conf. Multimedia and Expo, pp. 351–354. IEEE Computer Society Press, Los Alamitos (2004) 9. Muslea, I., Minton, S.N., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proc. Int. Joint Conf. on Artiﬁcial Intelligence, pp. 415–420 (2003) 10. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996) 11. Daugman, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters. Journal of Optical Society of America A 2(7) (1985) 12. Freeman, W.T., Adelson, E.H.: The design and use of steerable ﬁlters. IEEE Trans. Pattern Anal. Mach. Intell. 13(9), 891–906 (1991) 13. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs Yadong Mu and Bingfeng Zhou Institute of Computer Science and Technology Peking University, Beijing, 100871 {muyadong,zhoubingfeng}@icst.pku.edu.cn

Abstract. This paper provides a novel method for co-segmentation, namely simultaneously segmenting multiple images with same foreground and distinct backgrounds. Our contribution primarily lies in four-folds. First, image pairs are typically captured under diﬀerent imaging conditions, which makes the color distribution of desired object shift greatly, hence it brings challenges to color-based co-segmentation. Here we propose a robust regression method to minimize color variances between corresponding image regions. Secondly, although having been intensively discussed, the exact meaning of the term ”co-segmentation” is rather vague and importance of image background is previously neglected, this motivate us to provide a novel, clear and comprehensive deﬁnition for co-segmentation. Thirdly, it is an involved issue that speciﬁc regions tend to be categorized as foreground, so we introduce ”risk term” to diﬀerentiate colors, which has not been discussed before in the literatures to our best knowledge. Lastly and most importantly, unlike conventional linear global terms in MRFs, we propose a sum-of-squared-diﬀerence (SSD) based global constraint and deduce its equivalent quadratic form which takes into account the pairwise relations in feature space. Reasonable assumptions are made and global optimal could be eﬃciently obtained via alternating Graph Cuts.

1

Introduction

Segmentation is a fundamental and challenging problem in computer vision. Automatic segmentation [1] is possible yet prone to error. After the well-known Graph Cuts algorithm is utilized in [2], there is a burst of interactive segmentation methods ([3], [4] and [5]). Also it is proven that fusing information from multiple modalities ([6], [7]) can improve segmentation quality. However, as argued in [8], segmentation from one single image is too diﬃcult. Recently there is much research interest on multiple-image based approaches. In this paper we focus on co-segmentation, namely simultaneously segmenting image pair containing identical objects and distinct backgrounds. The term ”cosegmentation” is ﬁrst introduced into the computer vision community by Carsten Rother [8] in 2006. Important areas where co-segmentation is potentially useful are broad: automatic image/video object extraction, image partial distance, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 837–846, 2007. c Springer-Verlag Berlin Heidelberg 2007

838

Y. Mu and B. Zhou

Fig. 1. Experimental results for our proposed co-segmentation approach

video summarization and tracking. Due to space consideration, we focus on the technique of co-segmentation itself, discussing little about its applications. We try to solve several key issues in co-segmentation. Traditional global terms in MRF are typically linear function, and can be performed in polynomial time [9]. Unfortunately, such linear term is too limited. Highly non-linear, challenging global terms [8] are proposed for the goal of co-segmentation, whose optimization is NP-hard. Moreover, although having been intensively discussed, the exact meaning of the term ”co-segmentation” is rather vague and importance of image background is previously neglected. In this paper, we present a more comprehensive deﬁnition and novel probabilistic model for co-segmentation, introduce a quadric global constraint which could be eﬃciently optimized and propose Risk Term which proves eﬀective to boost segmentation quality.

2 2.1

Generative Model for Co-segmentation Notations

The inputs for co-segmentation are image pairs, and it is usually required that each pair should contain image regions corresponding to identical objects or scenes. Let K = {1, 2} and Ic = {1, . . . , N } are two index sets, ranging over images and pixels respectively. k and i are elements from them. Zk and Xk are random vectors of image measurements and pixel categories (foreground/background in current task). zki or xki represents i-th element from k-th image. We assume images are generated according to some unknown distribution, and each pixel is sampled independently. Parameters for image generation model could be divided into two parts, related to foreground or background regions respectively. Let θkf and θkb denote object/background parameters for k-th image. 2.2

Graphical Models for Co-segmentation

Choosing appropriate image generation models is the most crucial step in cosegmentation. However, such models are not obvious. As in the previous work in

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs

X1

ș1b

ș2b

ș1f Z1

J

X2

X1

ș1b

ș2b

839

X2

ș*f

ș2f Z2

Z1 (a)

J

Z2 (b)

Fig. 2. Generative models for co-segmentation. (a) Rother’s model (refer to [8] for details) based on hypothesis evaluation. J = 1 and J = 0 correspond to the hypothesis that image pairs are generated with/without common foreground model respectively. (b) Generative model proposed in this paper for co-segmenting.

[8], Rother etc. selected 1D histogram based image probabilistic models, whose graphical models are drawn in Figure 2(a). As can be seen, Rother’s approach relies on hypothesis evaluation, namely choose the parameters maximizing the desired hypothesis that two images are generated in the manner of sharing nontrivial common parts. It could also be equivalently viewed as maximizing the joint probability of observed image pairs and hidden random vectors (speciﬁcally speaking, these are θkf , θkb and Xk in Figure 2(a), where k ranges over {1, 2}). However, the above-mentioned generative models are not practical although ﬂexible. The drawbacks lie in several aspects. Firstly, Rother’s model makes too many assumptions for the purpose of feasibility, which complicates parameter estimation and makes this model sensitive about noises. Some model parameters even could not unbiasedly estimated due to lack of suﬃcient training samplings. For some parameters there is only one sampling could be found. An example for this point is that, image likelihoods under hypothesis J = 0 are always almost equal to 1, which is certainly not the true case. Secondly, the ﬁnal deduced global term in [8] is highly non-linear. In fact it could be regarded as the classical 1-norm if we treat each histogram as a single vector, which complicated optimization for optimal pixel labeling. Lastly and most importantly, the authors did not seriously take into account the relation between background models in the image pair. Let hkf and hkb denote image measurement histograms (typically color or texture) of foreground/background for k-th image. The ﬁnal energy function to be minimized in [8] only contains an image generation term proportional to z |h1f (z) − h2f (z)|, while background parameters disappear. This greedy strategy sometimes brings mistake. Here we argue that the eﬀect of background could not be neglected. An example to illuminate our idea is given in Figure 3, where two segmenting results are shown for comparison. In case 1, the extracted foregrounds match each other perfectly if just comparing their color histogram. However, it seems the segmentation in case 2 is more preferable, although the purple regions in

840

Y. Mu and B. Zhou

Fig. 3. An example to illustrate the relation between ”optimality” and ”maximality”. The purple region in the bottom image is slightly larger than the top image’s. If we only consider foreground models as in [8], case 1 is optimal. However, it is not maximal, since the purple regions are supposed to be labeled as foreground as in case 2.

the two images diﬀer greatly in size. In other words, we should consider both ”optimality” and ”maximality”. Case 1 is an extreme example, which is optimal according to the aforementioned criteria, yet not maximal. We argue that the task of co-segmentation could be regarded as ﬁnding the maximal common parts between two feature sets together with spatial consistency. Unlike [8], we obtain maximality by introducing large penalties if the backgrounds contain similar contents. A novel energy term about image backgrounds is proposed and detailed in Section 4. Our proposed graphical model could be found in Figure 2(b). At each phase, we optimize over X on one image by assuming parameters of the other image are known (Note θˆ in Figure 2(b) is colored in gray since its value is known.). We solve this optimization using alternating Graph Cuts, which is illustated in Figure 4. The joint probability to be maximized could be written as: ˆ X ∗ = arg max P (X)P (Z|X, θ) X

(1)

To solve this optimization problem is equivalent to ﬁnd the minima of its ˆ For negative logarithm. Denote E1 = − log P (X) and E2 = − log P (Z|X, θ). convenience we use the latter log form.

3

Preprocessing by Color Rectiﬁcation

It is well known that RGB color space is not uniform, and each of the three channel is not independent. It is previously argued in [10] that proper color coordinate transformations are able to partition RGB-space diﬀerently. Similar

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs

841

Fig. 4. Illustration for alternating Graph Cuts. The optimization is performed in an alternative style. The vertical arrows denote optimization with graph cuts, while the horizontal arrows indicate building color histograms from segmentation X and pixel measurements Z.

to the ideas used in intrinsic images [11], we abandon the intensity channel, keeping solely color information. In practice, we ﬁrst transform images from RGB-space to CIE-LAB space, where the L-channel reprensents lightness of the color and the other two channels are about color. After that, we perform color rectiﬁcation in two steps: – Step One: Extract local feature points from each image, and ﬁnd their correspondences in the other image if existing. – Step Two: Sample colors from a small neighborhood of matching points, use linear regression to minimize color variance. In step one, we adopts SIFT [12] to detect feature points. SIFT points are invariant to rotation, translation, scaling and partly robust to aﬃne distortion. Also it shows high repeatability and distinctiveness in various applications and works well for our task. Typically we can extract hundreds of SIFT points from each image, while the number of matching point pairs varies according to current inputs. An example for SIFT matching procedure could be found in Figure 8. In the middle column of Figure 8, matching pixels are connected with red lines. These matching points are further used to perform linear regression [13] within each color channel. Colors are scaled and translated to match theirs correspondences, so that color variances between image pair are minimized in a sense of least squared error (LSE). An example can be found in Figure 5. Robust methods such as RANSAC could be exploited to remove outliers.

4 4.1

Incorporating Global Constraint into MRFs Notations

In this section we provide deﬁnitions for E1 and E2 , which are the negative log of image prior and likelihood respectively. Since we focus on only one image each time, we will drop the k subscript and use it to index histogram bins. We adopted the following notations for convenience: – xi ∈ {1, −1}, where xi = 1 implies ”object”, otherwise background. – Ih = {1, . . . , M } and Ic = {1, . . . , N } are index sets for histogram bins and image pixels.

842

Y. Mu and B. Zhou

Fig. 5. Illustration for color rectiﬁcation. Variances of foreground colors aﬀect ﬁnal segmentation results notably (see the top images in the third column, compare it with the bottom segmentations). We operate in CIE-LAB color space. After color rectiﬁcation, 1-norm of distribution diﬀerence in A-channel is reduced to 0.2245, compared with original 0.2552. And the results in B-channel are more promising, from 0.6531 to 0.3115. We plot color distribution curves in the middle column. Color curve for image A remains unchanged as groundtruth and plotted in black, while color curves for image B before/after rectiﬁcation are plotted in red and blue respectively. Note that the peaks in B-channel approach groundtruth perfectly after transformation. The two experiments in rightmost column share same parameters.

– S(k) is the set of pixels that lies in histogram bin k. – F (k) and B(k) denote the number of pixels belonging to foreground/ background in bin k. Speciﬁcally speaking, F (k) = 12 (|S(k)| + i∈S(k) xi ), B(k) = 12 (|S(k)| − i∈S(k) xi ), where | · | means the cardinality of a set. across – Nf and Nb denotes pixel counts labeled as foreground/background the whole image. Nf = 12 (N + i∈Ic xi ), Nb = 12 (N − i∈Ic xi ). – DIST (h1 , h2 ) is a metric deﬁned on histograms. We adopt a sum-of-squareddiﬀerence (SSD) form, namely DIST (h1 , h2 ) = k (h1 (k) − h2 (k))2 . 4.2

Ising Prior for P (X)

We adopt the well-known Ising prior for P (X). Similar to [8], a preference term is added to encourage larger foreground regions, whose strength is controlled by a positive constant α. A second term is over neighboring pixels. This energy term could be summarized as follows: xi + λ cij xi xj (2) E1 = −α i

i,j

where cij = exp(−||zi − zj ||2 /σ 2 ) are coeﬃcients accounting for similarity between pixel pairs.

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs

4.3

843

ˆ Global Term for P (Z|X, θ)

As argued before, the global constraint should take into account both the eﬀects of foreground/background. We adopt a simple linear combination of the two, that is: ˆ f , hf ) − wb DIST (h ˆ b , hb ) (3) E2 = wf DIST (h ˆ denotes known histograms of the reference image, while h represents where h histograms to be estimated. In practice we build 2D histogram from the two color channels in LAB-space. It is obvious that this global term favors maximal common parts: similar foregrounds, and backgrounds that are diﬀerent from each other as much as possible. For the purpose of tractability we assume wf = γ1 Nf2 and wb = γ2 Nb2 , then E2 could be written as: ˆ f ) − wb DIST (hb , h ˆ b) E2 = wf DIST (hf , h F (k) B(k) ˆ f (k))2 − γ2 N 2 ˆ b (k))2 = γ1 Nf2 ( −h ( −h b Nf Nb k k 2 2 ˆ 2 (k) − γ2 ˆ 2 (k) = γ1 F (k) − γ2 B (k) + γ1 Nf2 h Nb2 h f b k

−2γ1

k

ˆ f (k) + 2γ2 Nf F (k)h

k

k

k

ˆ b (k) Nb B(k)h

(4)

k

Now we will prove Equation 4 is actually quadric function about X. Denote the ﬁrst two terms in Equation 4 as T1 , middle two as T2 , the last two as T3 , thus E2 = T1 + T2 + T3 . Recall that in Equation 2, parameter α indicates user’s Foreground size preference for the ratio (typically set to 0.3 in our experiments), Image size thus we could deduce that i∈Ic xi = (2α − 1)N . Basing on this observation, it is easy to prove that: – T1 = 12 (γ1 − γ2 ) ∃k,i,j∈S(k) xi xj + i∈Ic pi xi + const, where pi is coeﬃcient unrelated to X. – T2 is unrelated to X. – T3 = i∈Ic qi xi + const, where qi is coeﬃcient concerning i-th pixel. As a result, we could represent global term E2 in the following form: E2 =

1 (γ1 − γ2 ) 2

∃k,i,j∈S(k)

xi xj +

(pi + qi )xi + const

(5)

i∈Ic

This novel quadratic energy term consists of both unary and binary constraints, thus fundamentally diﬀerent from conventional ones used in [2], [3] and [4], where only linear constraints are utilized. Moreover, it also diﬀers from the pairwise Ising term deﬁned in Equation 2, since the latter performs on neighborhood system in spatial domain while the pairwise term in Equation 5 works in feature space. From a graph point of view, each adjacent pixel pair in feature space (that is, they fall into the same histogram bin) is connected by an edge, even if they are far away from each other in the spatial domain.

844

4.4

Y. Mu and B. Zhou

Computation

To optimize above-deﬁned energy function is challenging due to the existence of quadric global constraint. Although optimization methods like graph cuts [14] or normalized cuts [1] could found its optimal, required memory space is too huge for current computer hardware. For image pairs with typical size of 800*600, the global term usually gives rise to more than 1G extra edges, which is intolerant. General inference algorithm like MCMC [15], hierarchical methods or iterative procedures [8] are more favorable for such optimizing task. However, the common drawback for these methods lies in that they are too time-consuming, thus not suitable for real-time applications. To make a balance between eﬃciency and accuracy, we let γ1 be equal to γ2 in Equation 5, reducing the global term into a classical linear form. Experiments prove eﬀectiveness of this approximation. 4.5

Risk Term

Another important issue is seldom considered in previous work. For an input image pair, small regions with unique color usually tend to be categorized as ”foreground” (see Figure 6 for an concrete example). This is mainly because they aﬀect E2 much slighter than the preference term in E1 . To mitigate this problem, we propose a novel constraint named Risk Term, which reﬂects the risk to assign a pixel as foreground according to its color. h1 , h2 denote 2D histograms for image pair. For histogram bin k, its risk value is deﬁned as follows: R(k) =

|h1 (k) − h2 (k)| |h1 (k) + h2 (k)|

(6)

Fig. 6. Illustration for Risk Term. For the right image in (a), several small regions are labeled as foreground objects (see the left image in (b)), after introducing risk term they are removed. Also we draw the coeﬃcients pi + qi in Equation 5 (normalized to [0, 255]) in (c) and (d). Lower brightness implies more tendency to be foreground. The beneﬁt of risk term is obvious.

Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs

845

Fig. 7. Comparison with Rother’s method. Parameters are identical in both experiments: α = 0.3, λ = 50. Note that α corresponds to user’s prior knowledge about the percentage of foreground in the whole image. It is shown that the way to choose α in our method is more consistent with user’s intuition.

5

Experiments and Comparison

We apply the proposed method in a variety of image pairs from public image sets or captured by ourselves. Experiments show our method is superior to previous ones in aspects including accuracy, computing time and ease of use. Lacking color rectiﬁcation makes previous methods such as in [8] couldn’t handle input images captured under very diﬀerent illuminating conditions or cluttered backgrounds (Figure 1, 5 and 6). Also, experiments shows the way to choose parameter in our method is more consistent with user’s intuition (Figure 7). For typical 640*480

Fig. 8. A failure example due to confusion of foreground/background colors

846

Y. Mu and B. Zhou

image pairs, the algorithm usually converges in fewer than 4 cycles, and each iteration takes about 0.94 seconds on a Pentium-4 2.8G/512M RAM computer.

6

Conclusions and Future Work

We have presented a novel co-segmentation method. Various experiments demonstrated its superiority over the state-of-the-art work. Our result (Figure 8) also showed certain limitation of the algorithm due to only utilizing color information; and our future work will focus on how to eﬀectively utilize more types of information such as shapes, textures and high-level semantics.

References 1. Yu, S.X., Shi, J.: Multiclass spectral clustering. In: ICCV, pp. 313–319 (2003) 2. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: ICCV, pp. 105–112 (2001) 3. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004) 4. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. 23(3), 303–308 (2004) 5. Wang, J., Cohen, M.F.: An iterative optimization approach for uniﬁed image segmentation and matting. In: ICCV, pp. 936–943 (2005) 6. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Bi-layer segmentation of binocular stereo video. CVPR (2), 407–414 (2005) 7. Sun, J., Kang, S.-B., Xu, Z., Tang, X., Shum, H.Y.: Flash cut: Foreground extraction with ﬂash/no-falsh image pairs. In: CVPR (2007) 8. Rother, C., Minka, T.P., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching - incorporating a global constraint into mrfs. CVPR (1), 993–1000 (2006) 9. Narasimhan, M., Bilmes, J.: A submodular-supermodular procedure with applications to discriminative structure learning. In: UAI, pp. 404–441. AUAI Press (2005) 10. van de Weijer, J., Gevers, T.: Boosting saliency in color image features. CVPR (1), 365–372 (2005) 11. Weiss, Y.: Deriving intrinsic images from image sequences. In: ICCV, pp. 68–75 (2001) 12. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999) 13. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, Heidelberg (2001) 14. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? In: ECCV (3), pp. 65–81 (2002) 15. Barbu, A., Zhu, S.C.: Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1239–1253 (2005)

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints Hiroshi Kawasaki1 and Ryo Furukawa2 1

2

Faculty of Engineering, Saitama University, 255, Shimo-okubo, Sakura-ku, Saitama, Japan [email protected] Faculty of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima, Japan [email protected]

Abstract. To date, various techniques of shape reconstruction using cast shadows have been proposed. The techniques have the advantage that they can be applied to various scenes including outdoor scenes without using special devices. Previously proposed techniques usually require calibration of camera parameters and light source positions, and such calibration processes make the application ranges limited. If a shape can be reconstructed even when these values are unknown, the technique can be used to wider range of applications. In this paper, we propose a method to realize such a technique by constructing simultaneous equations from coplanarities and metric constraints, which are observed by cast shadows of straight edges and visible planes in the scenes, and solving them. We conducted experiments using simulated and real images to verify the technique.

1 Introduction To date, various techniques of scene shape reconstruction using shadows have been proposed. One of the advantages of using shadows is that the information for 3D reconstruction can be acquired without using special devices, since shadows exist wherever light is present. For example, these techniques are applicable to outdoor poles on a sunny day or indoor objects under a room light. Another advantage of shape reconstruction using shadows is that only a single camera is required. So far, most previously proposed methods assumed known light source positions because, if they are unknown, there are ambiguities on the solution and Euclidean reconstruction can not be achieved[1]. If a shape can be reconstructed with unknown light source positions, the technique can be used to wider applications. For example, a scene captured by a remote web camera under unknown lighting environments could be reconstructed. Since intrinsic parameters of a remote camera are usually unknown, if the focal length of the camera can be estimated at the same time, the application becomes more useful. In this paper, we propose a method to achieve this. Our technique is actually more general, i.e. both the object that casts shadows and the light source can be freely moved while scanning because both of their positions are not required to be known and static. This is a great advantage for actual scanning processes, since the unmeasured area caused by self-shadows can be drastically reduced by moving the light source. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 847–857, 2007. c Springer-Verlag Berlin Heidelberg 2007

848

H. Kawasaki and R. Furukawa

To actually realize the technique, we propose a novel formulation of simultaneous linear equations from planes created by shadows of straight edges (shadow planes) and the real planes in the scene, which are extension of the previous studies for shape from planes [2,3] and interpretation of line drawings of polyhedrons [4]. Since shadow planes and the real planes are treated equally in our formulation, various geometrical constraints among the planes can be utilized efficiently for Euclidean upgrade and camera calibration. In this paper, we assume two typical situations to reconstruct the scene. The first one, which we call “shadows of the static object,” assumes a fixed camera position, a static scene, and a static object of a straight edge which casts a moving shadow as the light source(e.g. the sun or a point light) moves. The second one, which we call “active scan by cast shadow,” assumes a fixed camera, and arbitrary motion of both a light source and an object with a straight edge to generate shadows to conduct an active scan.

2 Related Work 3D reconstruction using information of shadows has a long history. Shafer et al. presented the mathematical formulation of shadow geometries and derived constraints for surface orientation from shadow boundaries [5]. Hambrick et al. proposed a method for classifying boundaries of shadow regions [6]. Several methods for recovering 3D shapes up to Euclidean reconstruction based on geometrical constraints of cast-shadows have been proposed [7,8,9,10]. All of these methods assumes that the objects that cast shadows are static and the light directions or positions are known. On the other hand, Bouguet et al. proposed a method which allowed users to move a straight edged object freely so that the shadow generated by a fixed light source sweep the object [11,12]. However, the technique requires calibration of camera parameters, a light source position, and a reference plane. If an Euclidean shape can be reconstructed with unknown light source positions, it may broaden the application of “shape from cast shadow” techniques. However, it was proved that scene reconstructions based on binary shadow regions have ambiguities of four degrees of freedom (DOFs), if the light positions are unknown [1]. In the case of a perspective camera, these ambiguities correspond to the family of transformations called generalized projective bas relief (GPBR) transformations. To deal with unknown light source positions, Caspi et al. proposed a method using two straight, parallel and fixed objects to cast shadows and a reference plane (e.g. the ground) [13]. To solve ambiguities caused by unknown light sources, they used parallelisms of shadows of straight edges by detecting vanishing points. Compared to their work, our method is more general. For example, in our method, camera can be partially calibrated, the straight object and the light source can be moved, the light source can be a parallel or point light source, and wider types of constraints than parallelisms of shadows can be used to resolve the ambiguities.

3 Shape Reconstruction from Cast Shadow If a set of points exist on the same plane, they are coplanar as shown in figure 1(a). All the points on a plane are coplanar even if the plane does not have textures or feature

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints

Box

B

π3 π4

π0

(a)

(b)

λ

849

Intersections

π2

π5 π1

(c)

Intersections

(d)

Fig. 1. Coplanarities in a scene:(a) Explicit coplanarities. Regions of each color except for white are a set of coplanar points. Note that points on a region of a curved surface are not coplanar. (b) Implicit coplanarities. Segmented lines of each color are a set of coplanar points. (c)Examples of metric constraints: π0 ⊥π1 and π0 ⊥π2 if λ⊥π0 . π3 ⊥π4 , π4 ⊥π5 , π3 ⊥π5 , and π3 π0 if box B is rectangular and on π0 . (d) Intersections between explicit coplanar curves and implicit coplanar curves in a scene. Lines of each color corresponds a plane in the scene.

points. A scene composed of plane structures has many coplanarities. In this paper, a coplanarity that is actually observed as a real plane in the scene is called as an explicit coplanarity. As opposed to this, in a 3D space, there exist an infinite number of coplanarities that are not explicitly observed in ordinary situations, but could be observed under specific conditions. For example, a boundary of a cast-shadow of a straight edge is a set of coplanar points as shown in figure 1(b). This kind of coplanarity is not visible until the shadow is cast on the scene. In this paper, we call these coplanarities as implicit coplanarities. Implicit coplanarities can be observed in various situations, such as the case that buildings with straight edges are under the sun and cast shadows onto the scene. Although explicit coplanarities are observed only for limited parts of the scene, implicit coplanarities can be observed on arbitrary-shaped surfaces including free curves. In this study, we create linear equations from the implicit coplanarities of the shadows and explicit coplanarities of the planes. By solving the acquired simultaneous equations, a scene can be reconstructed, except for four (or more) DOFs that simultaneous equations have, and also the DOFs corresponding to unknown camera parameters. For an Euclidean reconstruction from the solution, the remaining DOFs should be solved (called metric reconstruction in this paper). To achieve this, constraints other than coplanarities should be used. For many scenes, especially those that include artificial objects, we can find geometrical constraints among explicit and implicit planes. Examples of such information are explained here. (1) In figure 1(c), the ground is plane π0 , and linear object λ is standing vertically on the ground. If the planes corresponding shadows of λ are π1 and π2 , π0 ⊥π1 ,π0 ⊥π2 can be derived from λ⊥π0 . (2) In the same figure, the sides of boxB are π3 ,π4 , and π5 . If boxB is rectangular, π3 ,π4 , and π5 are orthogonal with each other. If boxB is on the ground, π3 is parallel to π0 . From constraints available from the scene such as above examples we can determine variables for the remaining DOFs and achieve metric reconstruction. With enough constraints, the camera parameters can be estimated at the same time. We call these constraints the metric constraints.

850

H. Kawasaki and R. Furukawa

Based on this, actual flow of the algorithms are as follows. Step 1: Extraction of coplanarities. From a series of images that are acquired from a scene with shadows captured by a fixed camera, shadow boundaries are extracted as implicit-coplanar curves. If the scene has plane areas, explicit-coplanar points are sampled from the area. For the efficient processing of steps 2 and 3 below, only selected frames are processed. Step 2: Cast shadow reconstruction by shape from coplanarities. From a dataset of coplanarities, constraints are acquired as linear equations. By numerically solving the simultaneous equations, a space of solutions with four (or more) DOFs can be acquired. Step 3: Metric reconstruction by metric constraints. To achieve metric reconstruction, an upgrade process of the solution of step 2 is required. The solution can be upgraded by solving the metric constraints. Step 4: Dense shape reconstruction. The processes in steps 2 and 3 are performed on selected frames. To realize dense shape reconstruction of a scene, implicit-coplanar curves from all the images are used to reconstruct 3D shapes using the results of the preceding processes.

4 Algorithm Details for Each Steps 4.1 Data Acquisition To detect coplanarities in a scene, the boundaries of cast shadows are required. Automatic extraction of a shadow area from a scene is not easy. However, since shadow extraction has been studied for a long period of time [14,15], many techniques are already proposed and we adopt a spatio-temporal based method as follows: 1. Images are captured from a fixed camera at fixed intervals, and a spatio-temporal image is created by stacking images after background subtraction. 2. The spatio-temporal image is divided by using 3D segmentation. The 3D segmentation has been achieved by applying a region growing method to the spatio-temporal space. To deal with noises on real images, we merge small regions to the surrounding regions and split a large region connected by a small region into two. 3. From the segmented regions, shadow regions are selected interactively by manual. Also, if wrong regions are produced by the automatic process, those regions are modified manually in this step. 4. The segmented regions are again divided into frames, and coplanar shadow curves are extracted from each frames as boundaries of divided regions. By drawing all the detected boundaries on a single image, we can acquire many intersections. Since one intersection shares at least two planes, we can construct simultaneous equations. The numerical solution of these equations is explained in the following section.

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints

851

4.2 Projective Reconstruction Suppose a set of N planes including both implicit and explicit planes. Let j-th plane of the set be πj . We express the plane πj by the form a j x + b j y + cj z + 1 = 0

(1)

in the camera coordinates system. Suppose a set of points such that each point of the set exists on intersections of multiple planes. Let the i-th point of the set be represented as ξi and exist on the intersection of πj and πk . Let the coordinates (ui , vi ) be the location of the projection of ξi onto the image plane. We represent the camera intrinsic parameter by α = p/f , where f is the focal length and p is the size of the pixel. We define a∗j = αaj and b∗j = αbj . The direction vector of the line of sight from the camera to the point ξi is (αui , αvi , −1). Thus, (2) aj (−αui zi ) + bj (−αvi zi ) + cj (zi ) + 1 = 0, where zi is the z-coordinate of ξi . By dividing the form by zi and using the substitutions of ti = 1/zi , a∗j = αaj , and b∗j = αbj , we get −(ui )a∗j − (vi )b∗j + cj + ti = 0.

(3)

−(ui )a∗k − (vi )b∗k + ck + ti = 0.

(4)

Since ξi is also on πk ,

From the equations (3) and (4), the following simultaneous equations with variables a∗j ,b∗j and cj can be obtained: (a∗j − a∗k )ui + (b∗j − b∗k )vi + (cj − ck ) = 0.

(5)

We define L as the coefficient matrix of the above simultaneous equations, and x as the solution vector. Then, the equations can be described by a matrix form as Lx = 0.

(6)

The simultaneous equations of forms (5) have trivial equations that satisfy a∗j = a∗k , b∗j = b∗k , cj = ck , (i = j).

(7)

Let x1 be the solution of a∗i = 1, b∗i = 0, ci = 0(i = 1, 2, . . .), x2 be the solution of a∗i = 0, b∗i = 1, ci = 0, and x3 be the solution of a∗i = 0, b∗i = 0, ci = 1. Then, the above trivial solutions form a linear space spanned by the bases of x1 ,x2 ,x3 , which we represent as T . We describe a numerical solution of the simultaneous equations assuming the observed coordinates (ui , vi ) on the image plane include errors. Since the equation (6) is over-constrained, the equation generally cannot be fulfilled completely. Therefore, we consider the n-dimensional linear space Sn spanned by the n eigenvectors of L L associated with the n minimum eigenvalues. Then, Sn becomes the solution space of x

852

H. Kawasaki and R. Furukawa

such that maxx∈Sn |Lx|/|x| is the minimum with respect to all possible n-dimensional linear spaces. Even if coordinates of ui , vi are perturbed by additive errors, x1 ,x2 ,x3 remain trivial solutions that completely satisfies equations(5) within the precision of floating point calculations. Thus, normally, the 3D space S3 becomes equivalent with the space of trivial solutions T . For non-trivial solution, we can define xs = argminx∈T ⊥ (|Lx|/|x|)2 , where T ⊥ is the orthogonal complement space of T . xs is the solution that minimizes |Lx|/|x| and is orthogonal to x1 ,x2 and x3 . Since T and S3 are normally equal, xs can be calculated as the eigenvector of L L associated with the 4-th minimum eigenvalue. Thus, the general form of the non-trivial solutions are represented as x = f1 x1 + f2 x2 + f3 x3 + f4 xs = Mf ,

(8)

where f1 , f2 , f3 , f4 are free variables, f is a vector of (f1 f2 f3 f4 ) , and M is a matrix of (x1 x2 x3 xs ). The four DOFs of the general solution basically correspond to the DOFs of generalized projective bas-relief (GPBR) transformations described in the work of Kriegman et al. [1]. As far as we know, there are no previous studies that reconstruct 3D scenes by using the linear equations from the 3-DOF implicit and explicit planes. Advantages of this formulation are that the solution can be obtained stably, and the wide range of geometrical constraints can be used as metric constraints. 4.3 Metric Reconstruction The solution obtained in the previous section has four DOFs from f . In addition, if camera parameters are unknown, additional DOFs should be resolved to achieve metric reconstruction. Since these DOFs cannot be solved using coplanarities, they should be solved using metric constraints derived from the geometrical constraints in the scene. For example, suppose that the orthogonality between the planes πs and πt is assumed. We denote the unit normal vector of plane πs as a vector function ns (f , α) = N ((as (f , α) bs (f , α) cs (f , α)) ) whose parameters are f and the camera parameter α, where N () means an operation of normalization. Then, the orthogonality between πs and πt can be expressed as {(ns (f , α)} {nt (f , α)} = 0.

(9)

Other types of geometrical constraints such as parallelisms can be easily formulated using the similar method. To solve the equations described above, non-linear optimization with respect to f and α can be used. We implemented the numerical solver using Levenberg-Marquardt method. The determination of the initial value of f may be a problem. In the experiments described in this study, we construct a solution vector xI from the given plane parameters and fI = M xI is used as the initial values of f . In this method, fI is the projection of xI in the space of the plane parameters whose dimension is 3N , onto the solution space of the projective reconstruction (8) such that the metric distance between fI and xI is minimum. Using this process, we can obtain a set of plane parameters which fulfills the coplanarity conditions for an arbitrary set of plane parameters.

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints

(a)

(b)

(c)

853

(d)

Fig. 2. Reconstruction of simulation data:(a)(b) input images with shadows and (c)(d) reconstruction results. In the results, the shaded surfaces are ground truth and the red points are reconstructed points.

4.4 Dense Reconstruction To obtain a dense 3D shape, we also conduct a dense 3D reconstruction by using all the captured frames. The actual processes are as follows. 1. Detect the intersections between a implicit-coplanar curve on an arbitrary frame and the curves of the already estimated planes. 2. Estimate the parameters of the plane of the implicit-coplanar curve by fitting it to the known 3D points, which are retrieved from the intersections, using principal components analysis (PCA). 3. Recover all the 3D positions of the points on the implicit-coplanar curve by using the estimated plane parameters and triangulation method. 4. Iterate 1 to 3 for all frames.

5 Experiments 5.1 Simulation Data(Shadow of Static Object) Figures 2 (a),(b) show data synthesized by CG including a square floor, a rabbit, and a perpendicular wall. 160 images were generated while moving the light source, so that the edge of the shadow scanned the rabbit. Simultaneous equations were created from intersection points between the implicit-coplanar shadow curves and lines that were drawn on an explicit plane (the floor). The initial value of nonlinear optimization was given to indicate whether the light source was located on the right or left. By using the coplanar information, the reconstruction could be done only up to scale, so there were three DOFs remained. Since we also estimated the focal length, we needed four metric constraints. For obtaining an Euclidean solution, we used two metric constraints from the orthogonalities of the shadow planes and the floor, and other two constraints from the orthogonalities of the two corners of the floor. Figures 2 (c) and (d) show the result (red points) and the ground truth (shaded surface). We can observe the reconstruction result almost coincides with the correct shape. The RMS error (root of

854

H. Kawasaki and R. Furukawa

(a)

(b)

(c)

(d)

Fig. 3. Reconstruction of simulation data (active scanning): (a) an input image, (b) explicit and implicit coplanarities, and (c)(d) reconstruction results. In the results, the shaded surfaces are ground truth and the red points are reconstructed points.

mean squared error) of the z-coordinates of all the reconstructed points was 2.6 × 10−3 where the average distance from the camera to the bunny was scaled to 1.0. Thus, a highly accurate reconstruction of the technique was confirmed. 5.2 Simulation Data(Active Scan by Cast Shadow) Next, we attempted to reconstruct 3D shapes by sweeping the cast shadows on the objects by moving both a light source and a straight objects. We synthesized a sequence of images of the model of a bunny that includes 20 implicit coplanarities and three visible planes (i.e. explicit planes). There are three metric constraints of orthogonalities and parallelisms between the visible planes. The figure 3(a) shows an example of the synthesized images, and the figure 3(b) shows all the implicit-coplanar curves as the borders of the grid patterns. The figures 3(c) and (d) show the result. The RMS error of the z-coordinates of all the reconstructed points (normalized by the average of the zcoordinates like the previous section) was 4.6×10−3. We can confirm the high accuracy of the result. 5.3 Real Outdoor Scene(Shadow of Static Object) We conducted a shape reconstruction from images acquired by outdoor fixed uncalibrated cameras. Images from a fixed outdoor camera were captured periodically and a shape and the focal length of the camera was reconstructed by the proposed technique from shadows in the scene. Since the scene also contained many shadows generated by non-straight edges, the automatic extraction of complete shadows was difficult. In this experiment, these noises were eliminated by human interactions and it took about 10 minutes for the actual working time. The figure 4 (a) shows the input frame, (b) shows the detected coplanar shadow curves, (c) shows all the coplanar curves and their intersections, and (d) to (f) show the reconstruction result. The proposed technique could correctly reconstruct the scene by using images from a fixed remote camera. 5.4 Real Indoor Scene(Active Scan by Cast Shadow) We conducted an indoor experiment on an actual scene by using a point light source. A video camera was directed toward a target object and multiple boxes and the scene

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints

(a)

(b)

(c)

(d)

(e)

(f)

855

Fig. 4. Reconstruction of outdoor scene: (a) input image, (b) an example frame of the 3D segmentation result, (c) implicit (green) and explicit (red) coplanar curves, (d) reconstructed result of coplanar curves(red) and dense 3D points(shaded), and (e)(f) the textured reconstructed scene

was captured while the light source and the bar for shadowing were being moved freely. From the captured image sequence, several images were selected and the shadow curves of the bar were detected from the images. By using the detected coplanar shadow curves, we performed the 3D reconstruction up to 4 DOFs. For the metric

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. Reconstruction of an indoor real scene: (a)(b) the captured frames, (c)(d) the reconstructed coplanar shadow curves (red) with dense reconstructed model(shaded), and (e)(f) the textured reconstructed model

856

H. Kawasaki and R. Furukawa

(a)

(b)

(c)

(d)

Fig. 6. Reconstruction and evaluation of an indoor real scene: (a)(b) the captured frames and (c)(d) the reconstructed model displayed with the ground truth data (shaded model: reconstructed model, red points: ground truth)

reconstruction, orthogonalities of faces of the boxes were used. Figures 5 show the capturing scenes and the reconstruction result. In this case, since there were only small noises extracted because of indoor environment, shadow detection was stable and no human interaction was required. These results show that the dense shape is correctly reconstructed. We also reconstructed a scene of a box (size:0.4m × 0.3m × 0.3m) and a cylinder(height:0.2m, diameter:0.2m) to evaluate accuracies of the proposed method. The process of reconstruction was conducted in the same way as the previous experiment, except that we also measured the 3D scene by an active measurement method using coded structured light [16] as the ground truth. The reconstruction result was scaled to match the ground truth using the average distance to the points. Figures 6 (a) and (b) show the capturing scene, and (c) and (d) show both the scaled reconstruction (polygon mesh) and the ground truth (red points). Although there were small differences between the reconstruction and the ground truth, the shape was correctly recovered. The RMS error of the reconstruction from the ground truth normalized by the average distance was 1.80 × 10−2 .

6 Conclusion This paper proposed a technique capable of reconstructing a shape if only multiple shadows of straight linear objects or straight edges are available from a scene even when the light source position is unknown and the camera is not calibrated. The technique is achieved by extending the conventional method, which is used to reconstruct polyhedron from coplanar planes and its intersections, to general curved surfaces. Since reconstruction from coplanarities can be solved up to four DOFs, we proposed a technique of upgrading it to the metric solution by adding metric constraints. For the stable extraction of shadow areas from a scene, we developed a spatio-temporal image processing technique. By implementing the technique and conducting an experiment using simulated and real images, accurate and dense shape reconstruction were verified.

Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints

857

References 1. Kriegman, D.J., Belhumeur, P.N.: What shadows reveal about object structure. Journal of the Optical Society of America 18(8), 1804–1813 (2001) 2. Bartoli, A., Sturm, P.: Constrained structure and motion from multiple uncalibrated views of a piecewise planar scene. International Journal of Computer Vision 52(1), 45–64 (2003) 3. Kawasaki, H., Furukawa, R.: Dense 3d reconstruction method using coplanarities and metric constraints for line laser scanning. In: 3DIM 2007. Proceedings of the 5th international conference on 3-D digital imaging and modeling (2007) 4. Sugihara, K.: Machine interpretation of line drawings. MIT Press, Cambridge, MA, USA (1986) 5. Shafer, S.A., Kanade, T.: Using shadows in finding surface orientations. Computer Vision, Graphics, and Image Processing 22(1), 145–176 (1983) 6. Hambrick, L.N., Loew, M.H., Carroll, J.R.L.: The entry exit method of shadow boundary segmentation. PAMI 9(5), 597–607 (1987) 7. Hatzitheodorou, M., Kender, J.: An optimal algorithm for the derivation of shape from shadows. CVPR, 486–491 (1988) 8. Raviv, D., Pao, Y., Loparo, K.A.: Reconstruction of three-dimensional surfaces from twodimensional binary images. IEEE Trans. on Robotics and Automation 5(5), 701–710 (1989) 9. Daum, M., Dudek, G.: On 3-d surface reconstruction using shape from shadows. CVPR, 461–468 (1998) 10. Savarese, S., Andreetto, M., Rushmeier, H., Bernardini, F., Perona, P.: 3d reconstruction by shadow carving: Theory and practical evaluation. IJCV 71(3), 305–336 (2007) 11. Bouguet, J.Y., Perona, P.: 3D photography on your desk. In: ICCV, pp. 129–149 (1998) 12. Bouguet, J.Y., Weber, M., Perona, P.: What do planar shadows tell about scene geometry? CVPR 01, 514–520 (1999) 13. Caspi, Y., Werman, M.: Vertical parallax from moving shadows. In: CVPR, pp. 2309–2315. IEEE Computer Society, Washington, DC, USA (2006) 14. Jiang, C., Ward, M.O.: Shadow segmentation and classification in a constrained environment. CVGIP: Image Underst. 59(2), 213–225 (1994) 15. Salvador, E., Cavallaro, A., Ebrahimi, T.: Cast shadow segmentation using invariant color features. Comput. Vis. Image Underst. 95(2), 238–259 (2004) 16. Sato, K., Inokuchi, S.: Range-imaging system utilizing nematic liquid crystal mask. In: Proc. of FirstICCV, pp. 657–661 (1987)

Evolving Measurement Regions for Depth from Defocus Scott McCloskey, Michael Langer, and Kaleem Siddiqi Centre for Intelligent Machines, McGill University {scott,langer,siddiqi}@cim.mcgill.ca

Abstract. Depth from defocus (DFD) is a 3D recovery method based on estimating the amount of defocus induced by ﬁnite lens apertures. Given two images with diﬀerent camera settings, the problem is to measure the resulting diﬀerences in defocus across the image, and to estimate a depth based on these blur diﬀerences. Most methods assume that the scene depth map is locally smooth, and this leads to inaccurate depth estimates near discontinuities. In this paper, we propose a novel DFD method that avoids smoothing over discontinuities by iteratively modifying an elliptical image region over which defocus is estimated. Our method can be used to complement any depth from defocus method based on spatial domain measurements. In particular, this method improves the DFD accuracy near discontinuities in depth or surface orientation.

1

Introduction

The recovery of the 3D structure of a scene from 2D images has long been a fundamental goal of computer vision. A plethora of methods, based on many diﬀerent depth cues, have been presented in the literature. Depth from defocus methods belong to class of depth estimation schemes that use optical blur as a cue to recover the 3D scene structure. Given a small number of images taken with diﬀerent camera settings, depth can be found by measuring the resulting change in blur. In light of this well-known relationship, we use the terms ’depth’ and ’change in blur’ interchangeably in this paper. Ideally, in order to recover the 3D structure of complicated scenes, the depth at each pixel location would be computed independently of neighboring pixels. This can be achieved through measurements of focus/defocus, though such approaches require a large number of images [9] or video with active illumination [17]. Given only a small number of observations (typically two), however, the change in blur must be measured over some region in the images. The shape of the region over which these measurements are made has, to date, been ignored in the literature. Measurements for a given pixel have typically been made over square regions centered on its location, leading to artiﬁcially smoothed depth estimates near discontinuities. As a motivating example, consider the image in Fig. 1 of a scene with two fronto-parallel planes at diﬀerent distances, separated by a depth discontinuity. Now consider an attempt to recover the depth of the point near the discontinuity Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 858–868, 2007. c Springer-Verlag Berlin Heidelberg 2007

Evolving Measurement Regions for Depth from Defocus

859

Fig. 1. Two fronto-parallel surfaces separated by a step edge in depth. When estimating depth at a point near the step edge, inaccuracies arise when measurements are made over a square window.

(such as the point marked with the X) by estimating blur over a square region centered there (as outlined in the ﬁgure). We would like to recover the depth of the point, which is on one plane, but this region consists of points that lie on both surfaces. Inevitably, when trying to recover a single depth value from a region that contains points at two diﬀerent depths, we will get an estimate that is between the distances of the two planes. Generally, the recovered depth is an average of the two depths weighted by the number of points in the measurement region taken from each plane. As a result, the depth map that is recovered by measuring blur over square windows will show a smooth change in depth across the edge where there should be a discontinuity. (See red curve in Fig. 3.) The extent of the unwanted smoothing depends on the size of the measurement region. In order to produce accurate estimates, DFD methods have traditionally used relatively large measurement regions. The method of [5] uses 64-by-64 pixel regions, and the method of [10] uses 51-by-51 regions. For the scene in Fig. 1, this results in a gradual transition of depth estimates over a band of more then 50 pixels where there should be a step change. This example illustrates a failure of the standard equifocal assumption, which requires that all points within the measurement region be at the same depth in the scene. In order for the equifocal assumption to hold over square regions, the scene must have constant depth over that region. In the case of a depth discontinuity, as illustrated above, the violation of the equifocal assumption results in smoothed depth estimates. In this paper, we propose a method for measuring blur over elliptical regions centered at each point, rather than over square regions at each point as in standard DFD [5,10,14,15]. Each elliptical region is evolved to minimize the depth variation within it (Sec. 3.1). Depths estimated over these regions (Sec. 3.2) are more accurate, particularly near discontinuities in depth and surface orientation.

860

S. McCloskey, M. Langer, and K. Siddiqi

Given the semantic importance of discontinuities, this improvement is important for subsequent 3D scene recovery and modeling. We demonstrate the improved accuracy of our method in several experiments with real images (Sec. 4).

2

Previous Work in Depth from Defocus

Broadly speaking, DFD methods fall into one of two categories: deconvolution methods based on a linear systems model, and energy minimization methods. Deconvolution Methods. Pentland [12] proposed the ﬁrst DFD method based on frequency domain measurements. Given one image taken through a pinhole aperture, the blur in a second (ﬁnite aperture) image is found by comparing the images’ power spectra. Subbarao [15] generalized this result, allowing for changes in other camera settings, and removing the requirement that one image be taken through a pinhole aperture. G¨ okstorp [8] uses a set of Gabor ﬁlters to generate local frequency representations of the images. Other methods based on the frequency domain model such as by Nayar and Watanabe [11] develop ﬁlter banks which are used to measure blur (see also [16]). Ens and Lawrence point out a number of problems with frequency domain measurements and propose a matrix-based spatial domain method that avoids them [5]. Subbarao and Surya [14] model image intensities as a polynomial and introduce the S-transform to measure the blur in the spatial domain. More recently, McCloskey et al use a reverse projection model, and measure blur using correlation in the spatial domain [10]. Whether they employ spatial or frequency domain schemes, square measurement regions are the de facto standard for DFD methods based on deconvolution. Frequency domain methods use the pixels within a square region to compute the Fast Fourier Transform and observe changes in the local power spectrum. Spatial domain methods must measure blur over some area, and take square regions as the set of nearby points over which depth is assumed to be constant. In either case, estimates made over square windows will smooth over discontinuities, either in depth or surface orientation. Despite this shortcoming, deconvolution has been a popular approach to DFD because of its elegance and accessibility. Energy Minimization Methods. Several iterative DFD methods have been developed which recover both a surface (depth) and its radiance from pairs of defocused images. The methods presented by Favaro and Soatto [6,7] deﬁne an energy functional which is jointly minimized with respect to both shape and radiance. In [6], four blurred images are used to recover the depth and radiance, including partially occluded background regions. In [7] a regularized solution is sought and developed using Hilbert space techniques and SVD. Chaudhuri and Rajagopalan model depth and radiance as Markov Random Fields (MRF) and ﬁnd a maximum a posteriori estimate [4]. These methods have the advantage that they don’t explicitly assume the scene to have locally constant depth, though regularization and MRF models implicitly assume that depth changes slowly.

Evolving Measurement Regions for Depth from Defocus

861

Building upon the MRF scheme with line ﬁelds, Bhasin and Chaudhuri [2] explicitly model image formation in the neighborhood of depth discontinuities in order to more accurately recover depth. However, their results are generated on synthetic data that is rendered with an incorrect model: they assert that (regardless of shadowing) a dark band arises near depth discontinuities as a result of partial occlusion. For a correct model of partial occlusion and image formation near depth discontinuities, see Asada et al [1].

3

DFD with Evolving Regions

The method we propose in this paper combines elements of both of the above categories of DFD methods. Our approach iteratively reﬁnes a depth estimate like the energy minimization methods, but it does not involve a large error function of unknown topography. Our method is conceptually and computationally straightforward like the deconvolution methods, but it will be shown to have better accuracy near discontinuities in depth and in surface orientation. To achieve this, we evolve a measurement region at each pixel location toward an equifocal conﬁguration, using intermediate depth estimates to guide the evolution. In order to vary the shape of the measurement region in a controlled fashion despite erroneous depth estimates, we impose an elliptical shape on it. The measurement region at a given location starts as a circular region centered there, and evolves with increasing eccentricity while maintaining a ﬁxed area1 M . In addition to providing a controlled evolution, elliptical regions are more general then square one in that they can provide better approximations to the scene’s equifocal contours2 . Furthermore, ellipses can be represented with a small number of parameters. This compactness of representation is important in light of our desire to produce dense depth maps, which requires us to maintain a separate measurement region for each pixel location in an image. Given only two images of the scene under diﬀerent focus settings as input, we have no information about depth or the location of equifocal regions. Initially, we make the equifocal assumption that the scene has locally constant depth and estimate depth over hypothesized equifocal regions (circles) based on this assumption. The key to our approach is that this initial depth estimate is then used to update the measurement region at each pixel location. Instead of estimating defocus over square regions that we assume to have constant depth, we evolve elliptical regions over which we have evidence that depth is nearly constant. In order to give the reader an intuition for the process, we refer to Fig. 2, which shows the evolution of the measurement regions at two locations. In regions of the scene that have constant depth, initial measurement regions are found to contain no depth variation, and are maintained (blue circle). In regions of changing depth, as near the occluding contour, the initial region (red circle) is found to 1

2

Maintaining a measurement region with at least M pixels is necessary to ensure a consistent level of performance in the DFD estimation. In regions of locally constant depth, equifocal (iso-depth) points form a 2D region. More generally, the surface slopes and equifocal points fall along a 1D contour.

862

S. McCloskey, M. Langer, and K. Siddiqi

Fig. 2. Evolving measurement regions. Near the depth discontinuity, regions grow from circles (red) through a middle stage (yellow) to an equifocal conﬁguration (green). In areas of constant depth, initial regions (blue) do not evolve.

contain signiﬁcant depth variation and is evolved through an intermediate state (yellow ellipse) to an equifocal conﬁguration (green ellipse). Our method can be thought of as a loop containing two high-level operations: evolution of elliptical regions and depth measurement. The method of evolving elliptical regions toward an equifocal conﬁgurations is presented in Sec. 3.1. The blur estimation method is presented in Sec. 3.2. We use the Ens and Lawrence blur estimation algorithm of [5], and expect that any DFD method based on spatial domain measurements could be used in its place. 3.1

Evolving Elliptical Regions

Given a circle3 as our initial measurement region’s perimeter, we wish to evolve that region in accordance with depth estimates. Generally, we would like the extent of the region to grow in the direction of constant scene depth while shrinking in the direction of depth variation. Using an elliptical model for the region, we would like the major axis to follow depth contours and the minor axis to be oriented in the direction of depth variation. We iteratively increase the ellipse’s eccentricity in order to eﬀect the expansion and contraction along these dimensions, provided there are depth changes within it. Each pixel location p has its own measurement region Rp , an ellipse repre→ − sented by a 2-vector fp which expresses the location of one of the foci relative to → − p, and a scalar rp which is the length of the semi-major axis. Initially f = (0, 0) and r = rc for all p. The value rc comes from the region area parameter M = πrc2 . Once we have measured depth over circular regions, we establish the orientation of the ellipse by ﬁnding the direction of depth variation in the smoothed depth map. In many cases, thought not all, the direction of depth variation 3

An ellipse with eccentricity 0.

Evolving Measurement Regions for Depth from Defocus

863

is the same as the direction of the depth gradient. Interesting exceptions happen along contours of local minima or maxima of depth (as in the image in Fig. 4 (left)). In order to account for such conﬁgurations, we take the angle θv to be the direction of depth variation if the variance of depth values along diameters of the circular region is maximal in that direction. The set of depth values D(θ) along the diameter in direction θ is interpolated from the depth map d at equally-spaced points about p, D(θ) = {d(p + n(cos θ, sin θ))|n = −rc , −rc + 1, ...rc }.

(1)

We calculate D(θ) over a set of orientations θ ∈ [0, π), and take θv to be the direction that maximizes the depth variance, i.e. θv = argmaxθ var(D(θ)).

(2)

The variance of D(θv ) is compared to a threshold chosen to determine if the variance is signiﬁcant. In the event that the scene has locally constant depth within the circle, the variance should be below threshold, and the measurement region does not evolve. In the event that the variance is above the threshold, we → − orient the minor axis in the direction θv by setting f = (− sin θv , cos θv ). Having established the orientation of the ellipses, we estimate depth over the current measurement regions and increase their eccentricity if the depth variation along the minor axis is signiﬁcant (i.e. above threshold) and the variation along the major axis is insigniﬁcant. These checks halt the evolution if the elliptical region begins to expand into a region of changing depth or if it is found to be an equifocal conﬁguration. If these checks are passed, the ellipse at iteration n + 1 → − is evolved from the ellipse at iteration n by increasing f according to − → fn − → −−→ fn+1 = (fn + k) − → , fn

(3)

where the scalar k represents the speed of the deformation. As necessary, the value of r is suitably adjusted in order to maintain a constant area despite → − changes in f . Though the accuracy of depth estimation may increase by measuring blur over regions of area greater then M , we keep the area constant for our experiments, allowing us to demonstrate the improvement in depth estimates due only to changes in the shape of the measurement region. 3.2

Measuring Depth Over Evolving Regions

In order to be utilized in our algorithm, a depth estimation method must be able to measure blur over regions of arbitrary shape. This rules out frequency domain methods that require square regions over which a Fast Fourier Transform can be calculated. Instead, we use a spatial domain method, namely the DFD method of Ens and Lawrence [5], adapted to our elliptical regions. The choice of this particular algorithm is not integral to the method; we expect that any spatial domain method for blur measurement could used.

864

S. McCloskey, M. Langer, and K. Siddiqi

The Ens and Lawrence method takes two images as input: i1 taken through a relatively small aperture, and i2 taken through a larger aperture. The integration time of the image taken through the larger aperture is reduced to keep the level of exposure constant. As is common in the DFD literature, we model each of these images to be the convolution of a theoretical pinhole aperture image i0 (which is sharply focused everywhere) with a point spread function (PSF) h, i1 = i0 ∗ h1 and i2 = i0 ∗ h2 .

(4)

The PSFs h1 and h2 belong to a family of functions parameterized by the spread σ. This family of PSFs is generally taken to be either the pillbox, which is consistent with a geometric model of optics, or Gaussian, which accounts for diﬀraction in the optical system. We take the PSFs to be a Gaussian parameterized by its standard deviation. That is, hn = G(σn ). Since it was taken through a larger aperture, regions of the image i2 cannot be sharper than corresponding regions in i1 . Quantitatively, σ1 ≤ σ2 . As described in [5], depth recovery can be achieved by ﬁnding a third PSF h3 such that i 1 ∗ h3 = i 2 .

(5)

As we have chosen a Gaussian model for h1 and h2 , this unknown PSF is also a Gaussian; h3 = G(σ3 ). Computationally, we take σ3 at pixel location p to be the value that minimizes, in the sum of squared errors sense, the diﬀerence between i2 and i1 ∗ G(σ) over p’s measurement region Rp . That is, σ3 (p) = argminσ

2

(i2 (q) − (i1 ∗ G(σ))(q)) .

(6)

q∈Rp

The value of σ3 can be converted to the depth when the camera parameters are known. We omit the details for brevity.

4

Experiments

We have conducted a number of experiments on real scenes. Our images were all acquired with a Nikon D70s digital camera and a 50mm Nikkor lens. The camera’s tone mapping function was characterized and inverted by the method described in [3], eﬀectively making the camera a linear sensor. For each scene, the lens was focused at a distance nearer then the closest object in the scene4 . Depth Discontinuities. Our ﬁrst experiment involves the scene considered in Sec. 1, which consists of two fronto-parallel planes separated by a depth discontinuity. We have shown the sharply focused, small aperture (f /13) input image in Figs. 1 and 2; the second, blurrier input image (not shown) was taken with a larger aperture (f /4.8). We have worked on a relatively small (400-by-300 pixel) window of 1000-by-1500 pixel images in order to illustrate the improved accuracy in the area of the discontinuity in suﬃcient detail.

Evolving Measurement Regions for Depth from Defocus

865

0.26

Scene Depth − distance from the plane of focus (m)

0.24

0.22

0.2

0.18

0.16

0.14

0.12

0.1

80

90

100

110

120

130 140 Image Column

150

160

170

180

190

Fig. 3. (Left) Surface plot of estimates from circular regions (dark points are closer to the camera). (Center) Estimates from computed elliptical regions. (Right) Proﬁle of depth discontinuity as measured over disks (red), and evolved elliptical regions (blue).

Fig. 4. (Left) Image of two planes meeting at a discontinuity in surface orientation, annotated with an evolved measurement region. (Center) Surface plot of depth recovered over circular regions (dark points are closer to the camera). (Right) Surface plot of depth recovered over elliptical regions.

Fig. 3 shows surface renderings of the recovered scene from the initial circular measurement regions (left), and the elliptical regions recovered by our method (middle). These surface renderings demonstrate the improved accuracy of the depth estimates as a result of the elliptical evolution. Fig. 3 (right) shows a plot of the depth estimates for a single row, recovered from the circular regions (red), and elliptical regions (green). As this plot shows, our method recovers an edge proﬁle that is signiﬁcantly more accurate than the initial result. This result was obtained by evolving the ellipses 25 times with speed constant k = 10. The orientation θv was determined by measuring the depth variance over 8 diﬀerent angles uniformly spaced in the interval [0, π). The value rc = 45 pixels, making the measurement regions comparable in size to those used in [5]. 4

Defocus increases in both direction away from the focused depth, and so depth from defocus suﬀers a sign ambiguity relative to the plane of focus [12]. This is typically avoided by having the entire scene be in front of or behind the plane of focus.

866

S. McCloskey, M. Langer, and K. Siddiqi

Fig. 5. (Left) Image of a tennis ball on a ground plane annotated with measurement regions. (Center) Surface plot of depth measured over circular regions. (Right) Surface plot of depth measured over ellipses evolved to approximate equifocal regions.

Discontinuities in Surface Orientation. As in the case of discontinuous depth changes, square (or circular) windows for the measurement of defocus produce inaccurate depth estimates near discontinuities in surface orientation. Consider, for example, the scene shown in Fig. 4 (left), which consists of two slanted planes that intersect at the middle column of the image. The scene is convex, so the middle column of the image depicts the closest points on the surface. Within a square window centered on one of the locations on this ridge line of minimum depth, most of the scene points will be at greater depths. As a result, the distance to that point will be overestimated. Similarly, were the ridge line at a maximum distance in the scene, the depth would be underestimated. Fig. 4 (center and right) show surface plots of the scene recovered from circular and evolved elliptical regions, respectively. An example of an evolved measurement region is shown in Fig. 4 (left). The inaccuracies of the circular regions are apparent near the discontinuity in orientation, where the width of the measurement region results in a smoothed corner. The surface plot of estimates made over evolved elliptical regions shows the discontinuity more accurately, and has more accurate depth estimates near the perimeter of the image. Non-linear Discontinuities. Because our ellipses evolve to a line segment in the limit, one may expect that our method will fail on objects with occluding contours that are non-linear. While an ellipse cannot describe a curved equifocal contour exactly, it can still provide an improvement over square or circular regions when the radius of the equifocal contour’s curvature is small compared to the scale of the ellipse. Fig. 5 (left) shows an example of a spherical object, whose equifocal contours are concentric circles. The curvature of these equifocal contours is relatively low at the sphere’s occluding contour, allowing our elliptical regions to increase in eccentricity while reducing the internal depth variation (blue ellipse). Near the center of the ball, the equifocal contours have higher curvature, but the surface is nearly fronto-parallel in these regions. As a result, the initial circles do not become as elongated in this region (red circle). This results in depth estimates that show a clearer distinction between the sphere and the background (see Fig 5 (center) and (right)).

Evolving Measurement Regions for Depth from Defocus

867

The parameters for this experiment were the same as in the previous examples, except that θv was found over 28 angles in the interval [0, π). The algorithm was run for 25 iterations, though most of the ellipses stopped evolving much earlier, → − leaving the maximum value of f = 100. For the experiments shown in this paper, a substantial majority of the running time was spent in the depth estimation step. The amount of time spent orienting and evolving the elliptical regions depends primarily on the scene structure, and totals about 2 minutes in the worst case.

5

Conclusions and Future Work

We have demonstrated that the accuracy of DFD estimates can depend on the shape of the region over which that estimate is computed. The standard assumption in the DFD literature - that square regions are equifocal - is shown to be problematic around discontinuities in depth and surface orientation. Moreover, through our experiments, we have demonstrated that an elliptical model can be used to evolve measurement regions that produce more accurate depth estimates near such features. We have, for the ﬁrst time, presented an algorithm that iteratively tailors the measurement region to the structure of the scene. Future research could address both the size and shape of the measurement region. In order to illustrate the beneﬁts of changes in its shape, we have kept the size M of the measurement region constant. In actual usage, however, we may choose to increase the size of the measurement area in regions of the scene that are found to be fronto-parallel in order to attain improved DFD accuracy. Though we have shown that elliptical regions are more general then squares, and that this additional generality improves DFD performance, there are scene structures for which ellipses are not suﬃciently general. Non-smooth equifocal contours, found near corners, will be poorly approximated by ellipses. Such structures demand a more general model for the measurement region, and would require a diﬀerent evolution algorithm, which is an interesting direction for future work.

References 1. Asada, N., Fujiwara, H., Matsuyama, T.: Seeing Behind the Scene: Analysis of Photometric Properties of Occluding Edges by the Reversed Projection Blurring Model. IEEE Trans. on Patt. Anal. and Mach. Intell. 20, 155–167 (1998) 2. Bhasin, S., Chaudhuri, S.: Depth from Defocus in Presence of Partial Self Occlusion. In: Proc. Intl. Conf. on Comp. Vis., pp. 488–493 (2001) 3. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: Proc. SIGGRAPH, pp. 369–378 (1997) 4. Chaudhuri, S., Rajagopalan, A.: Depth from Defocus: A Real Aperture Imaging Approach. Springer, Heidelberg (1999) 5. Ens, J., Lawrence, P.: Investigation of Methods for Determining Depth from Focus. IEEE Trans. on Patt. Anal. and Mach. Intell. 15(2), 97–108 (1993)

868

S. McCloskey, M. Langer, and K. Siddiqi

6. Favaro, P., Soatto, S.: Seeing beyond occlusions (and other marvels of a ﬁnite lens aperture). In: Proc. CVPR 2003, vol. 2, pp. 579–586 (June 2003) 7. Favaro, P., Soatto, S.: A Geometric Approach to Shape from Defocus. IEEE Trans. on Patt. Anal. and Mach. Intell. 27(3), 406–417 (2005) 8. G¨ okstorp, M.: Computing Depth from Out-of-Focus Blur Using a Local Frequency Representation. In: Proc. of the IAPR Conf. on Patt. Recog., pp. 153–158 (1994) 9. Hasinoﬀ, S.W., Kutulakos, K.N.: Confocal Stereo. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 620–634. Springer, Heidelberg (2006) 10. McCloskey, S., Langer, M., Siddiqi, K.: The Reverse Projection Correlation Principle for Depth from Defocus. In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission (2006) 11. Nayar, S.K., Watanabe, M.: Minimal Operator Set for Passive Depth from Defocus. In: Proc. CVPR 1996, pp. 431–438 (June 1996) 12. Pentland, A.: A New Sense for Depth of Field. IEEE Trans. on Patt. Anal. and Mach. Intell. 9(4), 523–531 (1987) 13. Pentland, A., Scherock, S., Darrell, T., Girod, B.: Simple Range Cameras Based on Focal Error. J. of the Optical Soc. Am. 11(11), 2925–2935 (1994) 14. Subbarao, M., Surya, G.: Depth from Defocus: A Spatial Domain Approach. Intl. J. of Comp. Vision 13, 271–294 (1994) 15. Subbarao, M.: Parallel Depth Recovery by Changing Camera Parameters. In: Proc. Intl. Conf. on Comp. Vis., pp. 149–155 (1998) 16. Xiong, Y., Shafer, S.A.: Moment Filters for High Precision Computation of Focus and Stereo. In: Proc. Intl. Conf. on Robotics and Automation, pp. 108–113 (1995) 17. Zhang, L., Nayar, S.K.: Projection Defocus Analysis for Scene Capture and Image Display. In: Proc. SIGGRAPH, pp. 907–915 (2006)

A New Framework for Grayscale and Colour Non-lambertian Shape-from-Shading William A.P. Smith and Edwin R. Hancock Department of Computer Science, The University of York {wsmith,erh}@cs.york.ac.uk

Abstract. In this paper we show how arbitrary surface reﬂectance properties can be incorporated into a shape-from-shading scheme, by using a Riemannian minimisation scheme to minimise the brightness error. We show that for face images an additional regularising constraint on the surface height function is all that is required to recover accurate face shape from single images, the only assumption being of a single light source of known direction. The method extends naturally to colour images, which add additional constraints to the problem. For our experimental evaluation we incorporate the Torrance and Sparrow surface reﬂectance model into our scheme and show how to solve for its parameters in conjunction with recovering a face shape estimate. We demonstrate that the method provides a realistic route to non-Lambertian shape-from-shading for both grayscale and colour face images.

1

Introduction

Shape-from-shading is a classical problem in computer vision which has attracted over four decades of research. The problem is underconstrained and proposed solutions have, in general, made strong assumptions in order to make the problem tractable. The most common assumption is that the surface reﬂectance is perfectly diﬀuse and is explained using Lambert’s law. For many types of surface, this turns out to be a poor approximation to the true reﬂectance properties. For example, in face images specularities caused by perspiration constitute a signiﬁcant proportion of the total surface reﬂectance. More complex models of reﬂectance have rarely been considered in single image shape-from-shading. Likewise, the use of colour images for shape recovery has received little attention. In this paper we present a general framework for solving the shape-fromshading problem which can make use of any parametric model of reﬂectance. We use techniques from diﬀerential geometry in order to show how the problem of minimising the brightness error can be solved using gradient descent on a spherical manifold. We experiment with a number of regularisation constraints and show that from single face images, given only the illumination direction, we can make accurate estimates of the shape and reﬂectance parameters. We show how the method extends naturally to colour images and that both the shape and diﬀuse albedo maps allow convincing view synthesis under a wide range of illumination directions, viewpoints and illuminant colours. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 869–880, 2007. c Springer-Verlag Berlin Heidelberg 2007

870

1.1

W.A.P. Smith and E.R. Hancock

Related Work

Ahmed and Farag [1] incorporate a non-Lambertian reﬂectance model into a perspective shape-from-shading algorithm. By accounting for light source attenuation they are able to resolve convex/concave ambiguities. Georghiades [2] uses a similar approach to the work presented here in an uncalibrated photometric stereo framework. From multiple grayscale images he estimates reﬂectance parameters and surface shape using an integrability constraint. Work that considers the use of colour images for shape recovery includes Christensen and Shapiro [3]. They use a numerical approach to transform a pixel brightness colour vector into the set of normals that corresponds to that colour. By using multiple images they are able to constrain the choice of normal directions to a small set, from which one is chosen using additional constraints. Ononye and Smith [4] consider the problem of shape-from-colour, i.e. estimating surface normal directions from single, colour intensity measurements. They present a variational approach that is able to recover coarse surface shape estimates from single colour images.

2

Solving Shape-from-Shading

The aim of computational shape-from-shading is to make estimates of surface shape from the intensity measurements in a single image. Since the amount of light reﬂected by a point on a surface is related to the surface orientation at that point, in general the shape is estimated in the form of a ﬁeld of surface normals (a needle-map). If the surface height function is required, this can be estimated from the surface normals using integration. In order to recover surface orientation from image intensity measurements, the reﬂectance properties of the surface under study must be modelled. 2.1

Radiance Functions

The complex process of light reﬂecting from a surface can be captured by the bidirectional reﬂectance distribution function (BRDF). The BRDF describes the ratio of the emitted surface radiance to the incident irradiance over all possible incident and exitant directions. A number of parametric reﬂectance models have been proposed which capture the reﬂectance properties of diﬀerent surface types with varying degrees of accuracy. Assuming a normalised and linear camera response, the image intensity predicted by a particular parametric reﬂectance model is given by its radiance function: I = g(N, L, V, P). This equality is the image irradiance equation. The arguments N, L and V are unit vectors in the direction of the local surface normal, the light source and viewer respectively. We use the set P to denote the set of additional parameters speciﬁc to the particular reﬂectance model in use. The best known and simplest reﬂectance model follows Lambert’s law which predicts that light is scattered equally in all directions, moderated by a diﬀuse albedo term, ρd , which describes the intrinsic reﬂectivity of the surface: gLambertian (N, L, V, {ρd }) = ρd N · L.

(1)

A New Framework for Grayscale and Colour

871

Note that the image intensity is independent of viewing direction. Other models capture more complex eﬀects. For example, the Torrance and Sparrow [5] model has previously been shown to be suitable for approximating skin reﬂectance properties in a vision context [2] since it captures eﬀects such as oﬀ-specular forward scattering at large incident angles. If the eﬀects of Fresnel reﬂectance and bistatic shadowing are discounted, the Torrance and Sparrow model has the radiance function: gT&S (N, L, V, {ρd , ρs , ν, L}) = Lρd N · L +

Lρs e−ν

2

2

L+V arccos(N· L+V )

N·V

,

(2)

where ρs is the specular coeﬃcient, ν the surface roughness, L the light source intensity and, once again, ρd is the diﬀuse albedo. 2.2

Brightness Error

The radiance function provides a succinct mapping between the reﬂectance geometry in the scene and the observed intensity. For an image in which the viewer and light source directions are ﬁxed, the radiance function reduces to a function of one variable: the surface normal. This image-speciﬁc function is known as the reﬂectance map in the shape-from-shading literature. The squared error between the observed image intensities, I(x, y), and those predicted by the estimated surface normals, n(x, y), according to the chosen reﬂectance model is known as the brightness error. In terms of the radiance function this is expressed as follows: (I(x, y) − g(n(x, y), L, V, P(x, y)))2 . (3) EBright (n) = x,y

For typical reﬂectance models, this function does not have a unique minimum. In fact, there are likely to be an inﬁnite set of normal directions all of which minimise the brightness error. For example, in the case of a Lambertian surface the minimum of the brightness error will be a circle on the unit sphere. 2.3

The Variational Approach

In order to make the shape-from-shading problem tractable, the most common approach has been to augment the brightness error with a regularization term, EReg (n), which penalises departures from a constraint based on the surface normals. A wide range of such constraints have been considered, but obvious choices are surface smoothness and integrability. In some of the earliest work, Horn and Brooks [6] used a simple smoothness constraint in a regularization framework. They used variational calculus to solve the minimisation: n∗ = arg min EBright (n) + λEReg (n), where λ is a Lagrange multiplier which efn

fectively weights the inﬂuence of the two terms. The resulting iterative solution is [7]: C(I − nt · L) nt+1 = fReg (nt ) + L, (4) λ

872

W.A.P. Smith and E.R. Hancock

where fReg (nt ) is a function which enforces the regularising constraint (in this case a simple neighbourhood averaging which eﬀectively smooths the ﬁeld of surface normals). The second term provides a step in the light source direction of a size proportional to the deviation of nt from the image irradiance equation and seeks to reduce the brightness error. The weakness of the Horn and Brooks approach is that for reasons of numerical stability, a large value of λ is typically required. The result is that the smoothing term dominates and image brightness constraints are only weakly satisﬁed. The recovered surface normals therefore lose much of the ﬁne surface detail and do not accurately recreate the image. 2.4

The Geometric Approach

An approach which overcomes these deﬁciencies was proposed by Worthington and Hancock [7]. Their idea was to choose a solution which strictly satisﬁes the brightness constraint at every pixel but uses the regularisation constraint to help choose a solution from within this reduced solution space. This is possible because in the case of Lambertian reﬂectance, it is straightforward to select a surface normal direction whose brightness error is 0. From (1), it is clear that the angle θi = ∠NL can be recovered using θi = arccos(I), assuming unit albedo. By ensuring a normal direction is chosen that meets this condition, the brightness error is strictly minimised. In essence, the method applies the regularization constraint within the subspace of solutions which have a brightness error of 0: n∗ = arg min EReg (n). EBright (n)=0

To solve this minimisation, Worthington and Hancock use a two step iterative procedure which decouples application of the regularization constraint and projection onto the closest solution with zero brightness error: 1. nt = fReg (nt ) 2. nt+1 = arg min d(n, nt ), EBright (n)=0

where d(., .) is the arc distance between two unit vectors and fReg (nt ) enforces a robust regularizing constraint. The second step of this process is implemented using nt+1 = Θnt , where Θ is a rotation matrix which rotates a unit vector to the closest direction that satisﬁes θi = arccos(I).

3

Minimising the Brightness Error

Worthington and Hancock exploited the special case of Lambertian reﬂectance in which the brightness error can be made zero using a single rotation whose axis and magnitude can be found analytically. Our approach in this paper eﬀectively generalises their idea to reﬂectance models that do not have such easily invertible forms. For Lambertian reﬂectance, our approach would take the same single step in order to minimise the brightness error. Neither their approach, nor that of Horn and Brook’s can incorporate arbitrary reﬂectance models.

A New Framework for Grayscale and Colour

873

In general, ﬁnding a surface normal that minimises the brightness error rev quires minimising a function of a unit vector. It is convenient to consider a unit vector as a point lying on a spherical manifold. Accordingly, the surface norn ||v|| mal n corresponds to the point on the unit 2-sphere, n ∈ S 2 , where Φ(n) = n Exp (v) TS and Φ : S 2 → R3 is an embedding. We can therefore deﬁne the brightness error for the local surface normal as a function of a point on the manifold S 2 : Fig. 1. The Exponential map from the 2 f (n) = (g(Φ(n), L, V, P) − I) . The need tangent plane to the sphere to solve minimisation problems on Riemannian manifolds arises when performing diﬀusion on tensor valued data. This problem has recently been addressed by Zhang and Hancock [8]. We propose a similar approach here for ﬁnding minima in the brightness error on the unit 2-sphere for arbitrary reﬂectance functions. This provides an elegant formulation for the shape-from-shading problem stated in very general terms. In order to do so we make use of the exponential map which transforms a point to the tangent plane to a point on the sphere. If v ∈ Tn S 2 is a vector on the tangent plane to S 2 at n ∈ S 2 and v = 0, the exponential map, denoted Expn , of v is the point on S 2 along the geodesic in the direction of v at distance v from n. Geometrically, this is equivalent to marking out a length equal to v along the geodesic that passes through n in the direction of v. The point on S 2 thus obtained is denoted Expn (v). This is illustrated in Figure 1. The inverse of the exponential map is the log map which transforms a point from the sphere to the tangent plane. We can approximate the local gradient of the error function f in terms of a vector on the tangent plane Tn S 2 using ﬁnite diﬀerences: T f Expn (, 0)T − f (n) f Expn (0, )T − f (n) , ∇f (n) ≈ . (5) n

n

2

We can therefore use gradient descent to ﬁnd the local minimum of the brightness error function: nt+1 = Expnt (−γ∇f (nt )) , (6) where γ is the step size. At each iteration we perform a line search using Newton’s method to ﬁnd the optimum value of γ. In the case of Lambertian reﬂectance, the radiance function strictly monotonically decreases as θi increases. Hence, the gradient of the error function will be a vector that lies on the geodesic curve that passes through the current normal estimate and the point corresponding to the light source direction vector. The optimal value of γ will reduce the brightness error to zero and the update is equivalent to the one step approach of Worthington and Hancock. For

874

W.A.P. Smith and E.R. Hancock

more complex reﬂectance models, the minimisation will require more than one iteration. We solve the minimisation on the unit sphere in a two step iterative process: 1. nt = fReg (nt ) 2. nt+1 = arg min EBright (n), n

where step 2 is solved using the gradient descent method given in (6) with nt as the initialisation. Our approach extends naturally to colour images. The error functional to be minimised on the unit sphere simply comprises the sum of the squared errors for each colour channel: 2 (g(Φ(n), L, V, Pc ) − Ic ) . (7) f (n) = c∈{R,G,B}

Note that for colour images the problem is more highly constrained, since the ratio of knowns to unknowns improves. This is because the surface shape is ﬁxed across the three colour channels.

4

Regularisation Constraints

In this paper we use a statistical regularisation constraint, closely related to integrability [9]. Suppose that a facial surface F ∈ R3 is projected orthographically onto the image plane and parameterised by the function z(x, y). We can express the surface F in terms of a linear combination of K surface functions Ψi (or modes of variation): K (8) zb (x, y) = i=1 bi Ψi (x, y), where the coeﬃcients b = (b1 , . . . , bK )T are the surface parameters. In this paper we use a surface height basis set learnt from a set of exemplar face surfaces. Here, the modes of variation are found by applying PCA to a representative sample of face surfaces and Ψi is the eigenvector of the covariance matrix of the training samples corresponding to the ith largest eigenvalue. We may express the surface normals in terms of the parameter vector b: nb (x, y) =

K i=1

bi ∂x Ψi (x, y),

K

T bi ∂y Ψi (x, y), −1

.

(9)

i=1

When we wish to refer to the corresponding vectors of unit length we use: nb (x,y) ˆ b (x, y) = n n . A ﬁeld of normals expressed in this manner satisﬁes a b (x,y) stricter constraint than standard integrability. The ﬁeld of normals will be integrable since they correspond exactly to the surface given by (8). But in addition, the surface corresponding to the ﬁeld of normals is also constrained to lie within the span of the surface height model. We term this constraint model-based integrability.

A New Framework for Grayscale and Colour

875

In order to apply this constraint to the (possibly non-integrable) ﬁeld of surface normals n(x, y), we seek the parameter vector b∗ , whose ﬁeld of surface normals given by (9), minimises the distance to n(x, y). We pose this as minimising the squared error between the surface gradients of n(x, y) and those given by n (x,y) (x,y) and q(x, y) = nyz (x,y) . (9). The surface gradients of n(x, y) are p(x, y) = nnxz (x,y) The optimal solution is therefore given by: b∗ = arg min b

K x,y

i=1

2 bi pi (x, y) − p(x, y)

+

K

2 bi qi (x, y) − q(x, y)

, (10)

i=1

where pi (x, y) = ∂x Ψi (x, y) and qi (x, y) = ∂y Ψi (x, y). The solution to this minimisation is linear in b and is solved using linear least squares as follows. If the input image is of dimension M × N , we form a vector of length 2M N of the surface gradients of n(x, y): G = (p(1, 1), q(1, 1), . . . , p(M, N ), q(M, N ))T .

(11)

We then form the 2M N ×K matrix of the surface gradients of the eigenvectors, T Ψ, whose ith column is Ψi = [pi (1, 1), qi (1, 1), . . . , pi (M, N ), qi (M, N )] . We may now state our least squares problem in terms of matrix operations: b∗ = arg min Ψb − G2 . The least squares solution is given by: b

−1 T b∗ = ΨT Ψ Ψ G.

(12)

With the optimal parameter vector to hand, the ﬁeld of surface normals satisfying the model-based integrability constraint are given by (9). Furthermore, we have also implicitly recovered the surface height, which is given by (8). 4.1

Implementation

For our implementation, we choose to employ the Torrance and Sparrow [5] reﬂectance model given in (2). We make a number of assumptions to simplify its ﬁtting. The ﬁrst follows [2] and assumes that the specular coeﬃcient, ρs , and roughness parameter, ν, are constant over the surface, i.e. we estimate only a single value for each from one image. We allow the diﬀuse albedo, ρd , to vary arbitrarily across the face, though we do not allow albedo values greater than one. For colour images we also allow the diﬀuse albedo to vary between colour channels. However, we ﬁx the specular coeﬃcient and roughness terms to remain the same. In doing so we are making the assumption that specular reﬂection is independent of the colour of the surface. For both colour and grayscale images, we introduce a regularisation constraint on the albedo by applying a light anisotropic diﬀusion to the estimated albedo maps at each iteration [10] In performing this step we are assuming the albedo is piecewise smooth. We initialise the normal ﬁeld to the surface normals of the average face surface from the samples used to construct the surface height model.

876

W.A.P. Smith and E.R. Hancock

Algorithm 1. Non-Lambertian shape-from-shading algorithm Input: Light source direction L, image intensities I(x, y) and gradients of surface functions Ψ Output: Estimated surface normal map n(x, y), surface height zb (x, y) and T&S parameters: ρs , ρd (x, y), ν and L. (0) (0) (0) 1 Initialise ρs = 0.2, ρd (x, y) = 0.8, ν = 2 and b = (0, . . . , 0)T ; 2 Set iteration t = 1; 3 repeat 4 Find L(t) by solving for L in (2) (linear least squares) using surface ˆ b(t−1) (x, y), ﬁx all other parameters; normals n (t) 5 Find ρs and ν (t) by solving nonlinear minimisation of (2) using (t−1)

6

7

8

9 10

Newton’s method keeping all other parameters ﬁxed and using ρs ˆ b(t−1) (x, y); and ν (t−1) as an initialisation and normals n Find n(t) (x, y) by minimising brightness error by solving (4) for every ˆ b(t−1) (x, y) as initialisation, ﬁx all other parameters; pixel (x, y) using n Enforce model-based integrability. Form matrix of surface gradients G(t) using Equation 11 from n(t) (x, y) and ﬁnd b(t) by −1 T (t) solving:b(t) = ΨT Ψ Ψ G ; Calculate diﬀuse albedo for every pixel: 2 L+V −ν 2 arccos(N· L+V) (t) I(N·V)−Lρ(t) s e ˆ b(t)(x, y); ρd (x, y) = min 1, , where N= n L(N·V)(N·L) Set iteration t = t + 1; 2 until x,y arccos n(t) (x, y) · n(t−1) (x, y) < ;

The algorithm for a grayscale image is given in Algorithm 1. For a colour image we simply replace the minimisation term in step 7 with (7) and calculate the diﬀuse albedo and light source intensity independently for each colour channel.

5

Experiments

We now demonstrate the results of applying our non-Lambertian shape-fromshading algorithm to both grayscale (drawn from the Yale B database [11]) and colour (drawn from the CMU PIE database [12]) face images. The model-based integrability constraint is constructed by applying PCA to cartesian height maps extracted from the 3DFS database [13] range data. In Figure 2 we show results of applying our technique to three grayscale input images (shown in ﬁrst column). In the second and third columns we show the estimated shape rendered with Lambertian reﬂectance and frontal illumination in both a frontal and rotated view. The shape estimates are qualitatively good and successfully remove the eﬀects of non-Lambertian reﬂectance. In the fourth and ﬁfth columns we compare a synthesised image under novel illumination with a real image under the same illumination (the light source is 22.3◦ from frontal). Finally in the sixth column

A New Framework for Grayscale and Colour Input

Estimated Rotated Novel Lambertian Lambertian Illumination

Real View

877

Novel Pose

Fig. 2. Results on grayscale images

Frequency

we show a synthesised novel pose, rotated 30◦ from frontal. We keep the light source and viewing direction coincident and hence the specularities are in a diﬀerent position than in the input images. The result is quite convincing. In Figure 3 we provide an example of applying our technique to an input image in which the illumination is non-frontal. In this case the light source is positioned 25◦ along the negative horizontal axis. We show the input image on the left and on the right we show an image in which we have rendered the recovered shape with Fig. 3. Correcting for non-frontal illufrontal illumination using the estimated mination. reﬂectance parameters. This allows us to normalise images to frontal illumination. 16 In Figure 4 we show the results of ap14 plying our method to colour images taken 12 from the CMU PIE database. As is clear 10 in the input images, the illuminant is 8 strongest in the blue channel, resulting in 6 the faces appearing unnaturally blue. The 4 subject shown in the third row is partic2 ularly challenging because of the lack of 0 shading information due to facial hair. 190 200 210 220 230 240 Light Source Hue ( ) For each subject we apply our algorithm to a frontally illuminated im- Fig. 5. Histogram of estimated light age. This provides an estimate of the source hue for 67 CMU PIE subjects. ◦

878

W.A.P. Smith and E.R. Hancock Input

Diﬀuse Albedo

White Estimated Rotated Illuminant Lambertian Lambertian

Novel Pose

Fig. 4. Results on colour images

colour of the light source in terms of an RGB vector: (LR , LG , LB )T . To demonstrate that our estimate of the colour of the light source is stable across all 67 subjects, we convert the estimated light source colour into HSV space and plot a histogram of the estimated hue values in Figure 5. It can be seen that this estimate is quite stable and that all samples lie within the ‘blue’ range of hue values. The mean estimated hue was 215.6◦ . Returning to Figure 4, in the second column we show the estimated diﬀuse albedo in the three colour channels. These appear to have accurately recovered the colour of features such as lips, skin and facial hair, despite the use of a coloured illuminant. Note that residual shading eﬀects in the albedo maps are minimal. In the third column we show a synthesised image in which we have rendered the estimated shape using a white light source and the estimated reﬂectance parameters. These are qualitatively convincing and appear to have removed the eﬀect of the Fig. 6. Synthesising colour images coloured light source. This eﬀectively provides under novel lighting a route to facial colour constancy. In the fourth and ﬁfth columns we show the estimated shape rendered with Lambertian reﬂectance in both a frontal and rotated view. Finally in the sixth column we show a synthesised image in a novel pose rendered with a white light source coincident with the viewing direction. Note that the specularities are in a diﬀerent position compared to the input images.

A New Framework for Grayscale and Colour

879

Finally, in Figure 6 we provide some additional examples of the quality of image that can be synthesised under novel lighting conditions. From the input image in the top row of Figure 4, we synthesise images using white light from a variety of directions which subtend an angle of approximately 35◦ with the viewing direction.

6

Conclusions

In this paper we have presented a new framework for solving shape-from-shading problems which can incorporate arbitrary surface reﬂectance models. We used techniques from Riemannian geometry to minimise the brightness error in a manner that extends naturally to colour images. We experimented with the Torrance and Sparrow reﬂectance model on both grayscale and colour images. We showed that the shape and reﬂectance information we recover from one image is suﬃcient for realistic view synthesis. An obvious target for future work is to exploit the recovered information for the purposes of face recognition. The work also raises a number of questions that we do not answer here. The ﬁrst is whether the iterative solution of two minimisation steps always converges in a stable manner (experimental results would suggest this is the case). The second is whether these two steps could be combined into a single minimisation in a more elegant manner. Finally, the generalisation power of the statistical model impacts upon the precision of the recovered face surfaces, it would be interesting to test this experimentally.

References 1. Ahmed, A.H., Farag, A.A.: A new formulation for shape from shading for nonlambertian surfaces. In: Proc. CVPR, vol. 2, pp. 1817–1824 (2006) 2. Georghiades, A.: Recovering 3-d shape and reﬂectance from a small number of photographs. In: Eurographics Symposium on Rendering, pp. 230–240 (2003) 3. Christensen, P.H., Shapiro, L.G.: Three-dimensional shape from color photometric stereo. Int. J. Comput. Vision 13, 213–227 (1994) 4. Ononye, A.E., Smith, P.W.: Estimating the shape of a surface with non-constant reﬂectance from a single color image. In: Proc. BMVC, pp. 163–172 (2002) 5. Torrance, K., Sparrow, E.: Theory for oﬀ-specular reﬂection from roughened surfaces. J. Opt. Soc. Am. 57, 1105–1114 (1967) 6. Horn, B.K.P., Brooks, M.J.: The variational approach to shape from shading. Comput. Vis. Graph. Image Process 33, 174–208 (1986) 7. Worthington, P.L., Hancock, E.R.: New constraints on data-closeness and needle map consistency for shape-from-shading. IEEE Trans. Pattern Anal. Mach. Intell. 21, 1250–1267 (1999) 8. Zhang, F., Hancock, E.R.: A riemannian weighted ﬁlter for edge-sensitive image smoothing. In: Proc. ICPR, pp. 594–598 (2006) 9. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 439–451 (1988)

880

W.A.P. Smith and E.R. Hancock

10. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diﬀusion. IEEE Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990) 11. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001) 12. Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression database. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1615–1618 (2003) 13. USF HumanID 3D Face Database, Courtesy of Sudeep. Sarkar, University of South Florida, Tampa, FL

A Regularized Approach to Feature Selection for Face Detection Augusto Destrero1 , Christine De Mol2 , Francesca Odone1 , and Alessandro Verri1 1

DISI, Universit`a di Genova, Via Dodecaneso 35 I-16146 Genova, Italy {destrero,odone,verri}@disi.unige.it 2 Universite Libre de Bruxelles boulevard du Triomphe, 1050 Bruxelles, Belgium [email protected]

Abstract. In this paper we present a trainable method for selecting features from an overcomplete dictionary of measurements. The starting point is a thresholded version of the Landweber algorithm for providing a sparse solution to a linear system of equations. We consider the problem of face detection and adopt rectangular features as an initial representation for allowing straightforward comparisons with existing techniques. For computational efficiency and memory requirements, instead of implementing the full optimization scheme on tenths of thousands of features, we propose to first solve a number of smaller size optimization problems obtained by randomly sub-sampling the feature vector, and then recombining the selected features. The obtained set is still highly redundant, so we further apply feature selection. The final feature selection system is an efficient two-stages architecture. Experimental results of an optimized version of the method on face images and image sequences indicate that this method is a serious competitor of other feature selection schemes recently popularized in computer vision for dealing with problems of real time object detection.

1 Introduction Overcomplete, general purpose set of features combined with learning techniques provide effective solutions to object detection [1,2]. Since only a small fraction of features is usually relevant to a given problem, these methods must face a difficult problem of feature selection. Even if a coherent theory is still missing a number of trainable methods are emerging empirically for their effectiveness [3,4,5,1,6]. An interesting way to cope with feature selection in the learning by examples framework is to resort to regularization techniques based on penalty term of L1 type [4]. In the case of linear problems, a theoretical support for this strategy can be derived from [7] where it is shown that for most under-determined linear systems the minimal-L1 solution equals the sparsest solution. In this paper we explore the Lagrangian formulation of the so called Lasso scheme [4] for selecting features in the computer vision domain. This choice is driven by three major considerations: first, a simple algorithm for obtaining the optimal solution in this setting has been recently proposed [8]. Second, this feature selection mechanism to date has not been evaluated in the vision context, context in which many spatially highly correlated features are available. Finally, since this approach to feature selection seems to Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 881–890, 2007. c Springer-Verlag Berlin Heidelberg 2007

882

A. Destrero et al.

be appropriate for mathematical analysis [8] the gathering of empirical evidence on its merits and limitations can be very useful. To the purpose of obtaining a direct comparison with state-of-the-art methods we decided to focus our research to the well studied case of face detection. In the last years face detection has been boosted by the contribution of the learning from examples paradigm which helps to solve many well known problems related to face variability in images [9,10,11]. Component-based approaches highlighted the fact that local areas are often more meaningful and more appropriate for dealing with occlusions and deformations [12,13]. We use rectangle features [1] (widely retained as a very good starting point for many computer vision applications) as a starting representation. We perform feature selection through a sequence of stages, each of which we motivate empirically through extensive experiments and comparisons. Our investigation shows that the proposed regularized approach to feature selection appears to be very appropriate for computer vision applications like face detection. As a by-product we obtain a deeper understanding of the properties of the adopted feature selection framework: one step feature selection does not allow to obtain a small number of significant features due to the presence of strong spatial correlation between features computed on overlapping supports. As a simple way to overcome this problem we propose to repeat the feature selection procedure on a reduced but still quite large set of features. The obtained results are quite interesting. The final set of features not only leads to the construction of state-of-the-art classifiers but can also be further optimized to build a hierarchical real-time system for face detection which makes use of a very small number of loosely correlated features.

2 Iterative Algorithm with a Sparsity Constraint In this section we describe the basic algorithm on which our feature selection method is build upon. We restrict ourselves to the case of a linear dependence between input and output data, which means that the problem can be reformulated as the solution of the following linear system of equations: g = Af

(1)

where g = (g1 , . . . , gn ) is the n × 1 vector containing output labels, A = {Aij }, i = 1, . . . , n; j = 1, . . . , p is the n × p matrix containing the collection of features j for each image i and f = (f1 , . . . , fp ) the vector of the unknown weights to be estimated. Typically the number of features p is much larger than the dimension n of the training set, so that the system is hugely under-determined. Because of the redundancy of the feature set, we also have to deal with the collinearities responsible for severe illconditioning. Both difficulties call for some form of regularization and can be obviated by turning problem (1) into a penalized least-squares problem. Instead of classical regularization such as Tikhonov regularization (based on a L2 penalty term) we are looking for a penalty term that automatically enforce the presence of (many) zero weights in the vector f . Among such zero-enforcing penalties, the L1 norm of f is the only convex one, hence providing feasible algorithms for high-dimensional data. Thus we consider

A Regularized Approach to Feature Selection for Face Detection

883

the following penalized least-squares problem, usually referred to as “lasso regression” [4]: fL = arg min{|g − Af |22 + 2τ |f |1 } (2) f

where |f |1 = j |fj | is the L1 -norm of f and τ is a regularization parameter regulating the balance between the data misfit and the penalty. In feature selection problems, this parameter also allows to vary the degree of sparsity (number of true zero weights) of the vector f . Notice that the L1 -norm penalty makes the dependence of lasso solutions on g nonlinear. Hence the computation of L1 -norm penalized solutions is more difficult than with L2 -norm penalties. To solve (2) in this paper we adopt a simple iterative strategy: (t+1)

fL

= Sτ [fL + A (g − AfL )] t = 0, 1, . . . (t)

(t)

(3)

(0)

with arbitrary initial vector fL , where Sτ is the following “soft-thresholder” hj − τ sign(hj ) if |hj | ≥ τ (Sτ h)j = 0 otherwise In the absence of soft-thresholding (τ = 0) this scheme is known as the Landweber iteration, which converges the the generalized solution (minimum-norm least-squares solution) of (1). The soft-thresholded Landweber scheme (3) has been proven in [8] to converge to a minimizer of (2), provided the norm of the matrix A is renormalized to a value strictly smaller than 1.

3 Setting the Scene The application motivating our work is a face detector to be integrated in a monitoring system installed in our department. For this reason most of our experiments are carried out on a dataset of images collected by the system (Fig. 1). The system monitors a busy corridor acquiring video shots when motion is detected. We used the acquired video frames to extract examples of faces and non faces (non faces are motion areas containing everything but a face). We crop and rescale all images to the size 19×19. Our scenario has few difficult negative examples, but, face examples are quite challenging since faces rarely appear in a frontal position. In addition, the quality of the signal is low due to the fact that the acquisition device is a common video-surveillance camera and the detected object are often affected by motion blur. The dataset we consider for our empirical evaluations is made of 4000 training data, evenly distributed between positive and negative data, 2000 validation data and a test set of 3400 images. We compute rectangle features [1] over different locations, sizes, and aspect ratios of each 19 × 19 image patch, obtaining an overcomplete set of image descriptors of about 64000 features. Given the size of the image description obtained, computing the whole set of rectangle features for each analyzed image patch would make video analysis impossible: some kind of dimensionality reduction has to be performed.

884

A. Destrero et al.

Fig. 1. Examples of positive (left) and negative (right) data gathered with our monitoring system

4 Sampled Version of the Thresholded Landweber Algorithm In this section we present and discuss the method we propose for feature selection. We start by applying the iterative algorithm of Sec. 2 on the original set of features. We consider a problem of the form (1) in which A is the matrix of processed image data, where each entry Aij is obtained from i = 1 . . . n images each of which is represented by j = 1, . . . , p rectangle features. Since we are in a binary classification setting we associate to each datum (image) a label gi ∈ {−1, 1}. Each entry of the unknown vector f is associated to one feature: we perform feature selection looking for a sparse solution f = (f1 , . . . , fp ) : features corresponding to non-zero weights fi are relevant to model the diversity of the two classes. Experimental evidence showed that the choice of the initialization vector is not crucial, therefore we always initialize the weight vector f with zeros: f (0) = 0 . The stopping rule of the iterative process is related to the stability of the solution reached: at the t-th iteration we evaluate |f (t) − f (t−1) |: if it is smaller than a threshold T (that we choose as a proportion of f (t) , T = f (t) /100) for 50 consecutive iterations we conclude that the obtained solution is stable and stop the iterative process. 4.1 Implementation and Design Issues Let us now analyze in detail how we build the linear system: all images of our training set (the size of which is 4000) are represented by 64000 measurements of the rectangle features. Since these measures take real values and matrix A is 4000×64000, the matrix size in bytes is about 1 Gb (if each entry is represented in single precision). For this reason applying the iterative algorithm described in Eq. 3 directly to the whole matrix may not be feasible on all systems: the matrix multiplication needs to be implemented carefully so that we do not keep in primary memory the entire matrix A. One possibility is to compute intermediate solutions with multiple accesses to secondary memory. We implemented a different approach, based on resampling the features set and obtaining many smaller problems, that can be briefly described as follows: we build S feature subsets each time extracting m features from the original set of size p (m << p), we then obtain S smaller linear sub-problems of the type: As fs = g for s = 1, . . . , S, where As is a sub-matrix of A containing the columns relative to the features in s; fs is computed accordingly. As for the choice of the number S of sub-problems and their size we observe that the subset size should be big enough to be descriptive, small enough to handle the matrix easily; thus, we consider subsets of 6400 features (10% of the original size). To choose the number of sub-problems S, we rely on the binomial distribution

A Regularized Approach to Feature Selection for Face Detection

885

and estimate how many extractions are needed so that each feature is extracted at least 10 times with high probability. We observe that, if S = 200, such a probability is 99.6% which is good enough to our purposes — in practice only 256 features over 64000 are extracted less then 10 times. After we build the S sub-problems we look for the S solutions running S iterative methods as in Eq. (3). At the end of the process we are left with S overlapping sets of features. The final set is obtained choosing the features that have been selected each time they appear in the subsets. The experimental analysis that we report in the next section confirms that our resampling strategy leads to results comparable to the original version of the algorithm with a gain in efficiency. As for model selection we tune the parameter on the first sub-problem and consider two alternative methods: – Method 1: choose τ that includes a fixed number of 0s in the solution (or, equivalently, that selects a given number features) in about I iterations. – Method 2: choose τ on the basis of the generalization performance, using crossvalidation: we select τ s leading to classification rates below a certain threshold and then choose among them, the value providing the smallest number of features. 4.2 Experimental Analysis We now report on the feature selection results obtained running the selection process described above on 200 sub-problems built each time extracting 10% of the original set of features. The comparisons we discuss in the paper are based on the classification performance on the test set, using a SVM (for computational reasons we choose a linear kernel): we train the SVM on the training data represented with the set of features output of our feature selection method; we tune the SVM parameters on the validation set, then we build a ROC curve over the test set, varying the SVM offset b. We first analyze the effectiveness of our resampling strategy comparing the results obtained with the ones achieved by solving directly problem (1) on a high performance PC. Fig. 2 shows that there is no loss when applying the resampling strategy. We then compare the two different parameter selection methods. For Method 1 we use the first sub-problem to choose a τ leading to the 90% of zeros in about 1000 iterations. The size of selected feature sets is, on average, 638.7 ± 0.6 on the S subproblems. When applying Method 2 the average size of selected feature sets on the S sub-problems is 194.9 ± 23.4. In both cases, after solving the 200 problems we merge the feature sets obtained, keeping only the features that were selected at each time they appeared in the sub-problem. The final set of selected features, S1 , contains 4636 in the case of fixed number of zeros, 2368 features in the case τ was selected with a tuning procedure. In both cases less than 10% of the original feature set was kept. The selected features are a good synthesis of the original description as confirmed by the classification results on our test set (see Fig. 2) and they maintain all the descriptiveness, representing all meaningful areas of a face. Nonetheless, the number of selected features is still high, higher than the size of the feature set obtained with Adaboost on our same dataset (about 70 features), than the number of principal components (about 400), and also than the intrinsic size of original

886

A. Destrero et al. 1

0.95

0.9

0.85

0.8

0.75 One stage feature selection One stage feature selection (cross validation) One stage feature selection (on entire set of features) 0.7 0

0.005

0.01

0.015

0.02

Fig. 2. Generalization performance of a linear SVM with different feature sets: a direct solution of the linear problem and the resampling strategy (with the two different parameter choices). The intersection of the ROC with the vertical line gives the hit rate for 0.5% false positives, the intersection with f (x) = 1 − x gives the e.e.r.

input data ( 19 × 19 pixels). To reduce the number of features we first tried to force a higher number of zeros using Method 1. Doing so, we noticed that the features descriptiveness decreases: we tested our feature selection tuning τ with Method 1 and setting 99% of zeros to be obtained in about 1000 iterations: we got 345 features that showed to a visual inspection a higher degree of short range correlation than the previous output sets, while many interesting patterns were missing. Also, as expected, the classification results on the test set dropped of about 3%. The solutions to this problem of redundancy that we investigate in the next section consider a possible subsequent stage, aiming at decreasing the size of S1 .

5 A Refinement of the Solution In principle, the redundancy of image features does not compromise generalization performance, but certainly affects the computational efficiency of the classifier. Dealing with redundancy of information means selecting one or few delegates for each group of correlated features to represent the other elements of the group. As for the choice of the delegate, unlike in other application domains (such as micro-array data analysis) in most computer vision problems one could choose a random delegate for the correlated feature sets. Image features are not important per se but for the appearance information that they carry, which is resemblant to all other members of their group. On this respect a remark is in order. The face features that we compute are correlated, not only because of the intrinsic short range correlation of all natural images, or because they are an overcomplete description, but also because of dependencies related to the class of face images (which contain multiple occurrences of similar patterns at different locations, for example the eyes). Correlation due to the representation chosen produces redundant descriptions, while correlation due to the class of interest may carry important information on its peculiarities.

A Regularized Approach to Feature Selection for Face Detection 1

887

1

0.95

0.95

0.9 0.9 0.85 0.85 0.8 0.8 0.75

0.75

0.7 Method 1 on I and II stage Method 1 on I stage, method 2 on II stage Method 2 on I and II stage

Two stages feature selection Two stages + correlation analysis

0.7

0.65 0

0.005

0.01

0.015

0.02

0

0.01

0.02

0.03

0.04

0.05

0.06

Fig. 3. Evaluations of our strategy Left: results obtained with different model selections for the first and second stage. Right: results obtained with and without a 3rd step of correlation analysis.

5.1 Thresholded Landweber Once Again In order to obtain a smaller set of meaningful features we further apply the feature selection procedure to set S1 looking for a new, sparser, solution. The new data matrix is (0) obtained selecting from A the columns corresponding to S1 , and fS1 is initialized again to a vector of zeros. As for the choice of the parameter tuning strategy we considered various combinations of Method 1 and 2: our final choice was Method 1 for the first stage and Method 2 for the second. This choice is not so much related to performance (the results obtained with the various methods are very similar – see Figure 3, left) but to two important issues: (A) Method 2 is computationally expensive for the first stage (B) Method 2 takes explicitly into account generalization and therefore is more appropriate for the second stage. 5.2 Experimental Analysis As pointed out earlier, the generalization performances are similar for the various model selection methods analyzed, with a slight advantage for the combination of Method 1 for the first stage and Method 2 for the second. Fig. 3 (left) reports the ROC curves. The selection procedure leaves us with a set S2 of 247 features. To compare our approach with other dimensionality reduction methods we first consider the classical PCA. Fig. 4 (left) compares the classification results obtained with PCA on the set S1 and the ones obtained with a second step of regularized feature selection. Our set of features allows us to obtain higher classification rates and, also, only 247 rectangle features per each test need to be computed, instead than combinations of about 2400 features. Then we consider the Adaboost feature selection proposed in [1] 1 . Our comparison focuses on the quality of the selected features more than on the classifier: again, we evaluate the goodness of features with respect to their generalization ability, than in both cases we check how good they are on a classification task, for a fixed classifier (SVM). Figure 4 (right) shows a comparison between classification performances obtained with our features and Adaboost features selected on our same set of data. Since in [1] feature selection and training are performed on a unique Adaboost loop we verified the fairness of our comparison by including in our plot the false 1

We used the Intel OpenCV implementation http://www.intel.com/

888

A. Destrero et al. 1

1

0.95

0.95

0.9

0.9

0.85

0.85

0.8

0.8

0.75

0.75 2 stages feature selection 2 stages feature selection + correlation Viola+Jones feature selection using our same data Viola+Jones cascade performance

2 stages feature selection PCA 0.7

0.7 0

0.005

0.01

0.015

0.02

0

0.005

0.01

0.015

0.02

Fig. 4. Comparison of our second selection stage against other dimensionality reduction methods applied to set S1 . Left: PCA; Right: Adaboost features obtained from our same pool of data (see text).

positive / hit ratio obtained with a cascade of Adaboost classifiers: they are marked as a black dot. Notice that these results are in line with the curve obtained with Adaboost features and the SVM classifier. The impressive performance obtained by the face detection in [1] seems to rely on the use of a very big set of negative examples. At each stage of the cascade the current classifier is trained with new negative examples, while the ones that were correctly classified in the previous stages are discarded.

6 The Final System This section is devoted to describing the final face detection system based on our feature selection strategy. We start by describing a third selection stage allowing us to obtain a very small set of features to the price of a negligible reduction in performance. In order to obtain a minimal set of features carrying all the information without redundancy, we apply a further selection on the features that survived the previous two stages, based on choosing only one delegate for groups of short range correlated features. Our evaluation is based on the principle of discarding features that are of the same type, correlated, and spatially close to a feature already included in the final set. We evaluate the correlation between two features using the well known Spearman’s correlation test. Using this further analysis as a third stage of our selection procedure we obtain very compact descriptions with little degradation of the results. Fig. 5 shows the final set S3 of 42 features. The classification results are reported in Fig. 3 (right). The features in S3 is used are used to build a cascade of small SVMs that is able to process video frames in real time. The cascade of classifiers analyzes an image according to the usual coarse-to-fine approach. Each image patch at a certain location and scale is the input of the classifiers cascade: if the patch fails one test it is immediately discarded; if it passes all tests a face is detected. Each classifier of the cascade is built starting by 3 mutually distant features, training a linear SVM on such features, and adding further features until a target performance is obtained on a validation set. Target performance is chosen so that each classifier will not be likely to miss faces: we set the minimum hit rate to 99.5% and the maximum false positive rate to 50%.

A Regularized Approach to Feature Selection for Face Detection

889

Fig. 5. The 42 features that are left after a third stage of correlation analysis (S3 ) Table 1. The performance of the face detection system on a controlled set of images (top row) and on a 5 hours unconstrained live video (bottom row) test data false pos rate hit rate test images 0.1 % 94 % 5 hours live 4 × 10−7 % 76 %

Assuming a cascade of 10 classifiers, we would get as overall performance: HR = 0.99510 ∼ 0.9 and F P R = 0.510 ∼ 3 × 10−5 [1]. Currently our face detector is running as a module of a monitoring system installed in our department. Table 1 shows its detection performance in two different situations. The first row of the table refers to the results obtained on images acquired in our department in controlled conditions, while the second row is refers to uncontrolled detection: we manually labeled the events occurring in a 5 hours recording of a busy week day; the recording was entirely out of our control and it included changes of the scene, people stopping for unpredictable time, lateral faces. Notice that at run time, the classifiers have been tuned so to minimize the number of false positives. As a final remark we observe that the amount of data analyzed is huge since, on average, the detector analyzes 20000 patches per image or frame. Our system running in various environmental conditions can be appreciated on the demo video uploaded as supplemental material 2 . We also tested our face detection method on benchmark datasets — test set A and C of MIT-CMU and the BioID dataset — obtaining results consistently above an Adaboost detector trained with our same dataset. Keeping the false positive rate fixed, the hit rate gain with our approach is on average 8% on the various test sets.

7 Discussion This work is built around an iterative algorithm implementing the so-called Lasso scheme that produces a sparse solution to a linear problem and can be used as a feature selection procedure. We explored its effectiveness on the computer vision domain taking into account the peculiarity of image features. The final strategy that we devised includes two main stages: in the first stage, for computational reasons, we proposed to find a solution of a big problem by solving a number of smaller optimization problems 2

The video will be available at the url: ftp://ftp.disi.unige.it/person/DestreroA/Publications/ACCV07/ACCV.mpg

890

A. Destrero et al.

and returns a smaller but still highly redundant set of features; this stage could be further optimized by adopting parallel computation. The second stage is based on applying again the selection algorithm to the features resulting from the first stage; the obtained set is, on average, 0.5% the size of the initial representation. We experimentally validated its appropriateness on a face classification task obtaining very good results. A further third stage, added for optimizing the system, is based on finding a very small set of uncorrelated features that is suitable for real-time processing to the price of a very limited decrease in performance. We tested the latter description on a real-time face detection system based on a cascade of SVM classifiers. The generality of our method, which is entirely data driven, is confirmed by very promising preliminary results that we are obtaining on eye detection. A real time monitoring system implementing our method both for face and eye detection is currently running in our department [14]. For completeness we tested our real-time face detector on publicly available data obtaining comparable results to state-of-the-art approaches trained on our same dataset.

References 1. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal on Computer Vision 57(2), 137–154 (2004) 2. Papageorgiou, C., Poggio, T.: A trainable system for object detection. Internatonal Journal of Computer Vision 38(1), 15–33 (2000) 3. Guyon, I., Elisseeff, E.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003) 4. Tibshirani, R.: Regression shrinkage and selection via the lasso. J Royal. Statist. Soc. B 58(1), 267–288 (1996) 5. Weston, J., Elisseeff, A., Scholkopf, B., Tipping, M.: The use of zero-norm with linear models and kernel methods. Journal of Machine Learning Research 3 (2003) 6. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. In: Advances in Neural Information Processing SYstems 16, MIT Press, Cambridge (2004) 7. Donoho, D.: For most large underdetermined systems of linear equations, the minimal l1norm near-solution approximates the sparsest near-solution (2004) 8. Daubechies, I., Defrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. on Pure Appl. Math. 57 (2004) 9. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: a survey. IEEE Trans. on Pattern Analysis and Machine Intelligenge 24(1), 34–58 (2002) 10. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. CVPR (1997) 11. Schneiderman, H., Kanade, T.: A statistical method for 3D object detection applied to faces and cars. In: International Conference on Computer Vision (2000) 12. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE Trans. on PAMI 23(4), 349–361 (2001) 13. Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classification. Nature Neuroscience 5(7) (2002) 14. Destrero, A., Mol, C.D., Odone, F., Verri, A.: A regularized approach to feature selection for face detection. Technical Report DISI-TR-07-01, Dipartimento di informatica e scienze dell’informazione, Universita’ di Genova (2007)

Iris Tracking and Regeneration for Improving Nonverbal Interface Takuma Funahashi, Takayuki Fujiwara, and Hiroyasu Koshimizu School of Information Science and Technology Chukyo University, 101, Tokodachi, Kaizu-cho, Toyota, Aichi, 470-0393, Japan [email protected], {tfuji,hiroyasu}@sist.chukyo-u.ac.jp

Abstract. In this study, we discuss the quality of teleconference with respect to especially to the "eye-contact". Recently, VIDEO conference system can be used easily even in the mobile phone environment with camera, and many people use it in daily life. Since human is likely to look at the face of his partner on monitor not at camera, he will usually fail to send his own eye-contacted facial images to him, and vice versa. We pay attention to the disagreement of the eye-contact in teleconference caused by the separation between the input camera and output monitor devices. Then we propose the Eye-Contact Camera System for generating eye-contacted motion images to the receiver. In this system, iris contour is extracted after the face region extraction, the vertical and horizontal directions of the glance are calculated based on the relation among positions of the monitor, camera and receiver, and finally the iris center coordinates are shifted in the image so that the partner looks just looking at him, and vice versa. We implemented the system on note PC with web-camera for evaluating the usability. Keywords: Eye-Contact, Eye Tracking, Iris Evaluation, Iris Regeneration.

1 Introduction We introduce a system for supporting teleconference system. Recently, teleconference system could be used easily even in the mobile phone environment with camera, and many people use it in daily life. As shown Fig. 1, since human is likely to look at the face of his partner on monitor not at camera, he will usually fail to send his own eyecontacted facial images to him, and vice versa.

Fig. 1. Glance disagreement on teleconference Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 891–900, 2007. © Springer-Verlag Berlin Heidelberg 2007

892

T. Funahashi, T. Fujiwara, and H. Koshimizu

1.1 Precedent Researches (A Brief Survey) Studies of Ishii et al.[1] and Ichikawa et al.[2] on the hardware approach by using camera used a specialized set up behind the specialized screen for solving the eyecontact problem. However, it is difficult for the user to construct these complicated systems. Yang et al.[3] study of virtual face model designed to generate a virtual face 3D images by using stereo matching and a pair of cameras. Yip [4] study of the face angle rectification proposed a method to obtain the eye-contacted images by using NN, Affine transform and a single camera. These software approaches could maintain the other possibility for solving this problem, but the usability and the cost problems remain. 1.2 Proposed Method The basic idea to improve this fatal communication degradation is to regenerate facial image by changing the direction of the irises in the same original facial image. The system is composed as follows; CPU: Pentium1.8GHz, RAM: 1.0GB, Camera: 0.3Mpixel CMOS Sensor. The system flow is shown in Fig. 2. This system captures facial images from Web camera, a face region extraction procedure is applied to the images, and the iris is recognized and evaluated in limited eye region.

Fig. 2. System flow

2 Iris Recognition 2.1 Face and Eye Region Extraction The images with 24bit color and QVGA (320x240) in size were taken in an indoor circumstance. First, we extract a skin region by using color image. Here we used HSV color table [5]. To put it concrete, let the color transform system be Eq.(1) and a set of hue image, saturation image and value image be generated. H = tan -1 (C 1 / C 2 ) S =

C 12 + C 22

V = − 0 .3 R − 0 .59 G + 0 . 89 B

(1)

⎛ C 1 = R − Y = 0 .7 R − 0 .59 G − 0 .11 B ⎞ ⎜⎜ ⎟⎟ ⎝ C 2 = B − Y = − 0 .3 R − 0 . 59 G + 0 .89 B ⎠

Procedures of erosion, dilatation and labeling are utilized to detect the skin region. Fig. 3b shows an example of the face region extracted from the image given in Fig. 3a.

Iris Tracking and Regeneration for Improving Nonverbal Interface

893

Eye region is limited by using the extracted facial region. Next, contrast improvement, horizontal edge emphasis and binarization are applied to the extracted facial region. Finally, the number of pixels is counted from the center of face region for thresholding the region, and the eye region is located as shown Fig. 3c.

(a) Original Image

(b) Skin Region

(c) Eye region

Fig. 3. Face and eye region extraction

2.2 Iris Recognition A method for recognition of irises from the gray images is proposed by using Hough transform for circle detection. Several candidates of a pair of the irises were extracted at first by applying Hough transform for circle detection to the binary image. The binarization method was especially constructed by the method given in [6]. The voting ranges of the parameter space (a, b, r) were limited to some extent in order to reduce the computation cost and to enforce the performance. Parameters a and b indicate the centre of iris. Parameter r indicates the radius of the iris. The best pair of the irises is detected from the candidates in accordance with the criteria standards given by that the number of votes is bigger, that the positional relation between left and right irises is horizontal and that the radius of the left equals to the right. Fig. 4 shows an example of the iris selection processing. Fig. 4a is an example of the experimental result. The one with the shortest segment that connects a right and left iris is adopted for the iris candidate. As shown in Fig. 4b, it is understood that the circle drawn in the extraction result shows the iris position, and the iris is extracted accurately in the image.

(a) Candidates of irises

(b) Selected pair of irises

Fig. 4. Iris recognition

894

T. Funahashi, T. Fujiwara, and H. Koshimizu

2.3 Evaluation of Iris Irises were recognized by section 2.3. However, eyebrow must be extracted as irises in this processing. Therefore, It is necessary to limit more precisely the eye region and to check whether a pair of circles is irises or not. Limit of eye regions is provided by each iris positions in previous frames. Each new region is determined by Eq.(2) based on the radius of iris. Center of iris positions of the previous 5 frames are saved, the average value is calculated by Eq.(3). Fig. 5 shows an example of the limited eye regions. Eye Region_L = (circle_l_length.x) *(circle_l_length.y) Eye Region_R = (circle_r_length.x) *(circle_r_length.y) (each lengths are computed by radius of iris)

(x

ave i

, yi

ave

) = 15 ∑ (x , y )

(2)

N

i = N −5

i

i

(3)

( N : frame number)

Fig. 5. Adaptive limitation of eye regions leaded by previous iris position

Current eye region is compared with previous eye region, and it is evaluated whether the region could include the irises or not. The evaluation method is follows: Step1. Calculating the difference between the average gray values in left and right eye regions Step2. Comparing the difference between left and right Y-axis coordinate value Step3. Counting the number of pixels of the extracted iris by Eq.(4) If the sum of these evaluation values is greater than the threshold, the true irises will not be included in this region.

Pth =

3r

3r

∑ ∑ (i, j)

i = −3 r j = −3 r

⎧ Pth < 4r 2 ⇒ others ⎪ 2 2 ⎨4r < Pth < 10r ⇒ eye ⎪ 2 ⇒ eyebrow ⎩10r < Pth

(4)

Iris Tracking and Regeneration for Improving Nonverbal Interface

895

3 Eye Contact Image Generation Method 3.1 Geometric Model of the Eye Contact In order to model the situation of the VIDEO conference environment shown in Fig. 6, the parameters V and H are specified for modeling the vertical and horizontal relation between the camera and the monitor, and the parameter Duc is specified for the direction between camera and user. This is depicted in Fig. 7. In the beginning, let us imagine the iris moves from the coordinate (xin, yin) extracted before to the new coordinate (xout, yout). This new coordinate (xout, yout) can be easily calculated by Eq.(5) and Eq.(6) characterized with the parameters θver and θhor indicating the spatial relation-ship among a person, camera and monitor. In this expression, functions Δx and Δy are designed to convert the parameter θ to the number of the pixels in the facial image.

θ ver = tan −1

V Duc

θ hor = tan −1

H Duc

x out = x in + Δx (θ ver ) : Δx(θ ver ) =

θ ver

y out = y in + Δy (θ hor ) : Δy (θ hor ) =

θ hor

(a)Vertical position

10

(5)

(6)

10

(b)Horizontal position

Fig. 6. Adaptive limitation of eye regions leaded by previous iris position

3.2 Segmentation of the Eye Region

The centre coordinate (xout, yout) and the radius r of the extracted iris are utilized to regenerate the moved iris. Beforehand the eye regions are recognized for utilizing the extracted eye centre coordinate (xin, yin). The eye regions (sclera, iris, skin and contour of the eyelid as shown Fig. 7) are extracted for both of eyes, and basing on the recognition results, the moved iris is generated within the contour of the eyelid.

896

T. Funahashi, T. Fujiwara, and H. Koshimizu

Fig. 7. Eye region segment for iris regeneration

3.3 Regeneration of the Irises

The pixels (sn, tn) in the region for new iris are painted in black at the region where the distance d between (xout, yout) and (sn, tn) is less than or equal to the radius r (as shown Fig. 8). All pixels (sn, tn) within the contour of the eyelid are painted in white at the region where the distance is greater than or equal to r. The black and white colors are decided as follows Eq.(7):

Bi, j = black + d n × γ Wi, j = white + d n × γ (d n = ( s n − xout ) 2 + (t n − y out ) 2

γ : constant)

(7)

black = min {Fi,j | f i,j =1}, white = max { Fi,j | f i,j =0} (Fi,j: pixel value, f i,j: label number in Fig. 7.) After this procedure, smoothing regenerates the irises. The procedure is shown in Fig. 9.

Fig. 8. Section of iris regeneration region

Fig. 9. Flow of color selection

Iris Tracking and Regeneration for Improving Nonverbal Interface

897

4 Experiment and Consideration 4.1 Iris Recognition

Performance evaluation was executed. The experimental result of the iris recognition presented below. Three kinds of data set are prepared as follows by changing three different distances between camera and face by lighting without special light; (a).30cm (looking into the monitor) (b).50cm (typing closely at key board) (c).80cm (typing apart from key board) Fig. 10 shows the example of the experiment result in each distance, and 25 frontal facial images (QVGA (320×240) size), and the iris recognition results are shown in Table 1.

(a) (b) (c) (a), (b), (c) are captured from 30cm, 50cm, 80cm distance respectively. Fig. 10. Example of iris recognition result Table 1. Example of iris recognition result

Distance

30cm

50cm

80cm

Success

18/25 72%

21/25 84%

16/25 64%

Failure

7/25 28%

4/25 16%

9/25 36%

Evaluation

In the distance 30cm, the edge extraction had failed because of the reflected light of the monitor. In the distance 50cm, the reflected light of the monitor was reduced and there was no problem in the edge extraction. In the distance 80cm, there were a lot of iris recognition failures due to the lack of resolution and glasses. The possibility of real time processing could be obtained because it was clarified that the process speed was about 20fps.

898

T. Funahashi, T. Fujiwara, and H. Koshimizu

4.2 Iris Evaluation

A few experimental results of the iris evaluation are presented below. Two kinds of movie data set captured under the circumstances without any special light are prepared. The performance was checked by comparing the facial image generated by the proposed method and the original image. Fig. 12 and 13 show the example of the experimental results. Left Eye

Result

Right Eye

○ × 1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 Frame No.

(a) Original method Result

Left Eye

Right Eye

Eye Region

○ × 1

3

5

7

9

11

13

15

17 19 21 Frame No.

23

25

27

29

31

33

35

(b) Proposed method Fig. 11. Iris evaluation result by movie A Result

Left Eye

Right Eye

○ × 1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Frame No.

(a) Original method Left Eye

Result

Right Eye

Eye Region

○ × 1

3

5

7

9

11

13

15 17 19 Frame No.

21

23

(b) Proposed method Fig. 12. Iris evaluation result by movie B

25

27

29

31

Iris Tracking and Regeneration for Improving Nonverbal Interface

899

In the original method, the results are unstable in iris position and are low in success rate. It is due to the fact that the extraction of iris contour could not be determined without misrecognition. In the proposed method, the results are stable and are high in success rate, and it was confirmed in the proposed method that the number of processing frames of the proposed method has decreased than the original. The frame is recognized by mistake as the iris, and have recognized as the iris only at the completed frame. 4.3 Eye Contact Image

Fig. 13 shows the result of the generation of facial image of which eye gaze is contacted to the camera (to his partner). In this experiment, since the monitor was set at the topside of the camera, his eye gaze was shifted to the center so that his eye gaze came to the position of the camera.

(a) Original Image

(b) Eye-contacted Image Fig. 13. Eye contact image result by iris regeneration

Since the successive relationship among the image frames was not yet implemented in this algorithm, the robustness of this application could be improved. 4.4 Usability Test

27 persons participated in the usability test (5-Point Evaluation) of Eye-contact camera system. Fig.14 shows the result of this a questionnaire survey. The index for eye-contact was improved in 14 % (from 3.07 of original images to 3.51 of eyecontacted images). In order to evaluate the side effect of the proposed method, we took a look to the index of natural motion of eyes as shown in Fig.14b. As the results, we can not find any significant degradation. In addition, in the next step of the proposed method, it is expected to introduce a concurrent motion model of eye lid with the motion of irises.

900

T. Funahashi, T. Fujiwara, and H. Koshimizu

(a) Eye-Contact

(b) Naturality of Motion Fig. 14. Result of usability test

5 Conclusion In this paper, based on the image processing techniques for the iris recognition, we proposed a system of eye-contacted facial image generation. Through the system development, we could demonstrate to introduce new facial interface media on the network environment that could guarantee the "eye-contact". As the coming subjects, the proposed systems are now being brushed up so that the real applications could be realized and the recognitions of other facial parts than irises are also being introduced for the complete digital modeling of the face.

References 1. Ishii, H., Kobayashi, M., Grudin, J.: Integration of interpersonal space and shared workspace: Clear- Board design and experiments. ACM Transactions on Information Systems 11(4), 349–375 (1993) 2. Ichikawa, Y., Okada, K., Jeong, G., Tanaka, S., Matsushita, Y.: MAJIC Videoconferencing System: Experiments, Evaluation, and Improvement. In: Proc. of ECSCW 1995, pp. 279– 292 (1995) 3. Yang, R., Zhang, Z.: Eye Gaze Correction with Stereovision for Video-teleconference. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 479–494. Springer, Heidelberg (2002) 4. Yip, B.: Face and Eye Rectification in Video Conference Using Artificial Neural Network. In: Proc. of IEEE ICME 2005, IEEE Computer Society Press, Los Alamitos (2005) 5. Hagai, N., Hongo, H., Kato, K., Yamamoto, K.: Method of Eyes and Mouth detection which is robust to changing shapes, Technical Report of IEICE, PRMU97-159, pp. 55–60 (1997) 6. Funahashi, T., Yamaguchi, T., Tominaga, M., Koshimizu, H.: Facial Parts Recognition by Hierarchical Tracking from Motion Image and Its Application. IEICE TRANS. on Information and Systems E87-D(1), 129–135 (2004)

Face Mis-alignment Analysis by Multiple-Instance Subspace Zhiguo Li1 , Qingshan Liu12 , and Dimitris Metaxas1 Department of Computer Science, Rutgers University National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {zhli,qsliu,dnm}@cs.rutgers.edu 1

2

Abstract. In this paper, we systematically study the eﬀect of poorly registered faces on the training and inferring stages of traditional face recognition algorithms. We then propose a novel multiple-instance based subspace learning scheme for face recognition. In this approach, we iteratively update the subspace training instances according to diverse densities, using class-balanced supervised clustering. We test our multiple instance subspace learning algorithm with Fisherface for the application of face recognition. Experimental results show that the proposed learning algorithm can improve the robustness of current methods with poorly aligned training and testing data.

1

Introduction

Face recognition has been one of the most successful applications of image analysis due to its wide range of potential commercial, security and entertainment applications. Depending on the type of features used, face recognition algorithms can be classiﬁed into two categories: shape based approaches, such as elastic bunch graph matching [1], and appearance based approaches, such as eigenfaces [2,3] and Fisher-faces [4,5] etc. Accurate face alignment is critical to the performance of both appearancebased and shape-based approaches. However, current feature extraction techniques are still not reliable or accurate enough. It is unrealistic to expect localization algorithms to always get very accurate results under very diﬀerent lighting, pose and expression conditions. To get better recognition rate, we need to improve the robustness of existing recognition algorithms. To illustrate the eﬀect of the face alignment error on face recognition performance, we use the FERET face database [6] with ground truth alignment information available. We intentionally add some perturbations to the ground truth. Perturbations are added by moving the left center and right eye center ground truth with some random pixels. Figure 1 shows that the rotation perturbation aﬀects the recognition performance most, and the translation perturbation has the smallest eﬀects. Overall, we can see that even small perturbations could reduce the recognition rate signiﬁcantly. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 901–910, 2007. c Springer-Verlag Berlin Heidelberg 2007

Z. Li, Q. Liu, and D. Metaxas

Rank−1 recognition rate

Rank−1 recognition rate

0.92

1

0.9

0.9

0.9 0.8 0.7 0.6 0.5

0.8 0.7 0.6 0.5 0.4 0.3

0.4 −25−20−15−10 −5 0 5 10 15 20 25 Degree of rotation

(a)

Translation perturbation

Scaling perturbation

Rotation perturbation 1

Rank−1 recognition rate

902

0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74

0.2 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Eye distance scaling

(b)

0.72 −5 −4 −3 −2 −1 0 1 2 3 Eye center translation

4

5

(c)

Fig. 1. Recognition Rate Change for FisherFace w.r.t (a) Rotation, (b) Scale and (c) Translation Perturbations

One intuitive way to make classiﬁers robust to image alignment errors is to augment the training sets by adding random perturbations to the training images. By adding noisy but identiﬁable versions of given examples, we can expand our training data and improve the robustness of the feature extraction against a small amount of noise in the input. The augmented training set can model the small image alignment errors. The other way is to add perturbations to the probe images during the testing stage. Adding perturbations to the training set requires that we know the ground truth before hand. In multiple-instance learning algorithms, the task is to learn a classiﬁer given positive and negative bags of instances. Each bag may contain many instances. A bag is labeled positive if at least one of the instance in it is positive. A bag is labeled negative only if all instances in it are negative. The face alignment problem can be explicitly formulated as a multiple-instance learning problem: we take the whole image as a bag, and all possible sub-windows within it as instances. If an image contains a face, then we label this image as a positive bag, since we know that there is at least a sub-window containing the face, but we don’t know where exactly that sub-window is. In this paper, we systematically investigate the eﬀect of mis-aligned face images on face recognition systems. To make classiﬁers robust to the unavoidable face registration error, we formulate the face alignment problem within the multiple-instance learning framework. We then propose a novel multiple-instance based subspace learning scheme for face recognition tasks. In this algorithm, noisy training image bags are modeled as the mixture of Gaussians, and we introduce a supervised clustering method to iteratively select better subspace learning samples. Compared with previous methods, our algorithm does not require accurately aligned training and testing images, and can achieve the same or better performance as manually aligned face recognition systems. In this paper, we used the term ”noisy images” to denote poorly aligned images. 1.1

Related Work

Researchers have been trying to overcome the sensitivity of subspace based face recognition algorithms to image alignment errors. Martinez [7] proposed

Face Mis-alignment Analysis by Multiple-Instance Subspace

903

a method to learn the subspace that represents the error for each of training images. Shan et al. [8] studied the eﬀect of the mis-alignment problem, and for each training image they generated several perturbed images to augment the training set and thus modeling the mis-alignment errors. Compared with the aforementioned work, our algorithm requires the ground truth for neither the training set nor the testing set. Multiple-instance learning approach, on the other hand, such as MILBoost, was used in [9] for face detection problems. In their work, Viola et al. formulated the face detection problem as a multiple-instance learning approach, and AnyBoost was modiﬁed to adapt to multiple-instance learning condition. Several multiple-instance learning methods have been proposed, such as diverse density [10] and MI-SVMs [11]. Diverse density algorithm tries to ﬁnd the area which is both of high density positive points and of low density negative points. kNN is adopted for multiple-instance learning by using Hausdorﬀ distance in the work of Wang et al. [12].

2 2.1

Multiple-Instance Subspace Learning Motivation

Given a limited set of noisy training images, we augment the training set by perturbing the training images. The augmented larger training set will normally cover more variations for each subject and thus model the alignment error, however, it could also introduce some very poorly registered faces into the training set, which will have negative eﬀect for the learning process.

Fig. 2. Bags of Instances

Figure 2 shows two noisy training images (1) and (2). From each of the noisy image, we generate two bags each with multiple instances, denoted by (a), (b), ...(e) in the ﬁgure. While image (b) and (d) will certainly beneﬁt the training process, image (e) will most likely cause confusion for the classiﬁer, since it could be more similar to other subject. As will be shown later, those

904

Z. Li, Q. Liu, and D. Metaxas

very poorly registered images will indeed increase the recognition error. Thus given noisy training images, we must build algorithm that can automatically select those ”good” perturbed images from training bags, and exclude those very poorly registered images from being selected. 2.2

Approximating the Constrained k-Minimum Spanning Tree

Excluding very poorly registered images from the noisy bags can be formulated within the multiple-instance learning framework. One assumption is that the good perturbed images from the same subject tend to be near to each other. The high density areas correspond to the good perturbed images, while the low density areas correspond to poorly perturbed images, and those are the bad images we want to exclude from the training set. As shown in ﬁgure 2, the good perturbed images will lie in the intersection area of the two bags. The idea is very similar to the diverse density approach used by Maron [10] for multiple-instance learning. Since the the perturbed noisy images have irregular distribution, we use non-parametric method to ﬁnd out the high density area. Our non-parametric method is based on k-minimum spanning tree [13]: given an edge-weighted graph G = (V, E), it consists of ﬁnding a tree in G with exactly k < |V | − 1 edges, such that the sum of the weight is minimal. In our face recognition application, the nodes will be the face image instances, and the edges represent the Euclidean distance between face image instances. The problem is known as NP-complete problem, and we don’t need to get the exact solution. We used heuristic method to ﬁnd out the approximate k-minimum spanning tree. Firstly, for each instance, we build its k-nearest neighbor graph. Among all the instances, we ﬁnd the one with minimum k-nearest neighbor graph. Since the size of the neighbors is ﬁxed by k, the one with minimum sum of k-nearest neighbor graph will have the highest density, and thus corresponds to the good perturbed image area. Although in this high density area, there will still exist some noisy images, those noisy images are identiﬁable and useful to our learning algorithm. We also need to add the constraint to include at least one instance from each bag during the base selection phase. The idea is similar to that of MI-SVM. In MI-SVM, for every positive bag, we initialize it with the average of the bag, and compute the QP solution. With this solution, we compute the responses for all the examples within each positive bag and take the instance with maximum response as the selected training example in that bag. In our k-nearest neighbor graph algorithm, if some bag is far from other bags, using only the k-nearest neighbor graph to select training images may not include any instance from this isolated bag. We force the algorithm to accept at least one instance from every bag. If all the instances in a bag fall outside the most compact k-nearest neighbor graph, we select the instance with the minimum distance to the knearest neighbor graph. The iterative multiple-instance based FisherFace [4] [5] learning procedure is shown in the following algorithm 1. The learning procedure normally takes 2-3 iterations to converge. In our experiments, we use bag size of 25, i.e., each original training image is perturbed

Face Mis-alignment Analysis by Multiple-Instance Subspace

905

Algorithm 1. Multiple-Instance Subspace Learning Algorithm Input: S: number of subjects Ns , s = 1...S: number of noisy image for subject s R: number of instances per bag K: target number of nearest neighbors

(0)

(0)

1: Initialize xb = 1/R R i=1 xbi ; 2: while Base is still changing do 3: Compute Suﬃcient Statistics (t) (t) xb = Z1 (t) xbg ; g∈Gs 4: Compute Multiple-Instance Eigenbase (t) (t) s ms = N1s N b=1 xb

(t)

SW

(t) SB

1

S s=1

Ns

S s=1 S i=s

W

S s=1

s

(t) s

Ns (t) b=1 g∈Gs (t) (t) s i (t) W T SB W

W ∗ = arg max 5:

N m = (x − m )(x = N (m − m )(m − m )

m(t) =

(t) bg (t) i

(t) s

(t) bg (t) T

(t)

− m s )T

(t)

W T SW W

Base Selection y = W ∗T ∗ x Select good perturbed training samples Gs for each subject by ﬁnding the most compact k-nearest neighbor graph from projected subspace y. 6: end while

Fig. 3. Bag Distances Map

to generate 25 images. Each subject has 1-4 training images, and we take k as 60% of each subject’s total number of perturbed noisy images. To show that good perturbed images are similar to each other, ﬁgure 3 shows an example distance map for two bags (1) and (2). Each bag has 25 instances, which are generated by adding 25 random perturbations to a well-aligned image. The instances around the middle of the two bags have smaller perturbations, i.e.,

906

Z. Li, Q. Liu, and D. Metaxas

they are good perturbed images. In the distance map, the darker the color, the similar the two instances. From the graph we can see that an instance from bag (1) is not necessarily always nearer to instances in bag (1) than in bag (2), which means that two aligned diﬀerent face images from one subject could be more similar than the same image to itself perturbed by noises. Also we can see that instances around the middle of bag (1) are more similar to those instances around the middle of bag (2), which means good perturbed images from the same subject are similar to each other, and thus conﬁrmed our assumption. 2.3

Testing Procedure

During testing stage, we used the nearest neighbor algorithm as our classiﬁcation algorithm. The distance metric we used is the modiﬁed Hausdorﬀ distance. The Hausdorﬀ distance provides a distance measurement between subsets of a metric space. By deﬁnition, two sets A and B are within Hausdorﬀ distance of d of each other iﬀ every point of A is within distance of d of at least one point of B, and every point of B is within distance d of at least one point of A. Formally, given two sets of points A = {A1 , ..., Am } and B = {B1 , ..., Bn }, the Hausdorﬀ distance is deﬁned as: H(A, B) = max{h(A, B), h(B, A)}, where h(A, B) = maxAi ∈A minBj ∈B Ai − Bj |. This deﬁnition is very sensitive to outliers, so we used a modiﬁed version of the Hausdorﬀ distance. In this paper, we take the distance of bag A and bag B as H(A, B) = minAi ∈A minBj ∈B Ai −Bj . For single instance probe and gallery testing case, we use the nearest neighbor method based on Euclidian distance in the subspace.

3

Experimental Results and Discussions

We used the well known FERET database [6] in our experiments. One reason to use this data set is that it’s relatively a large database available, and the testing results will have more statistical signiﬁcance. The training set, which is used to ﬁnd the optimal FisherFace subspace, consists of 1002 images of 429 subjects, with all subjects at near-frontal pose. The testing set consists of the gallery set and the probe set. The gallery set has 1196 subjects, each subject has one near-frontal image with under normal lighting condition. The probe set has 1195 subjects, each subject has one image with the same condition as the probe set, but with diﬀerent expressions. For comparison purposes, we have the ground truth positions of the two eye centers for training, probe and gallery images. In this paper, we denote noisy bag as a bag generated from a noisy image, and aligned bag as a bag generated from a well-aligned image. We use ”single” in comparison to bag. Since we have many possible experimental setup combinations (training data, gallery data, probe data, noisy image, well-aligned image, single image and bag of images etc), we use table 1 and table 3 to explain our experimental setup.

Face Mis-alignment Analysis by Multiple-Instance Subspace

3.1

907

Testing with Well-Aligned Training Data

To see how the introduction of the augmented training bags will aﬀect the recognition performance, we ﬁrst test on the well-aligned training data. Table 1. Testing combinations for aligned training data Base 1 single aligned training Base 2 aligned bag training Testing 1 single aligned gallery, single aligned probe Testing 2 single aligned gallery, single noisy probe Testing 3 aligned bag gallery, noisy bag probe Testing 4 single aligned gallery, noisy bag probe

Table 2. Results comparison testing testing testing testing

1 2 3 4

base 1 0.9247 0.8795 0.9674 0.9431

base 2 0.9749 0.9665 0.9849 0.9774

From table 2 we have the following notable observations: – The recognition rate is always higher if we use aligned bag instead of single image as training data, which motivates the aforementioned perturbation based robust algorithms. However, it’s not true anymore if we don’t have well-aligned training data, i.e., we only have some noisy training images, and we add perturbations to generate noisy bags. Using the noisy bags as training data may not necessarily improve recognition performance, since the very poorly aligned images will confuse the classiﬁer. – If we take the baseline algorithm as the case of single aligned training, single aligned gallery and single aligned probe, then the rank-1 recognition rate for the baseline algorithm is 92.47%. – If we use aligned bag probe and noisy bag probe, the rank-1 recognition rate is 96.74%, which is better than the baseline algorithm. It means adding perturbations to the gallery and probe set can make the algorithm robust to alignment errors. 3.2

Testing with Noisy Training Data

To show that if we don’t have well-aligned training data, adding random perturbations to augment the training set may help much, we performed various experiments. More importantly, we also show that after selecting good perturbed images using our multiple-instance based scheme from the set of augmented data, the recognition performance improves a lot. Table 4 shows the testing results, and we have the following notable observations: – When we use single noisy training image without adding perturbations (base 1), the recognition rate is very low for all the testing combinations. This indicates that the within-subject scatterness for poorly registered face in the training set is so high that they overlap with other subjects’ clusters and

908

Z. Li, Q. Liu, and D. Metaxas

Table 3. Testing combination for noisy training Table 4. Results comparison data Base 1 single noisy training Base 2 Iteration 1, noisy bag training Base 3 Iteration 3, noisy bag training Testing 1 single aligned gallery, single aligned probe Testing 2 single aligned gallery, single noisy probe Testing 3 aligned bag gallery, noisy bag probe

testing 1 base 1 0.3213 base 2 0.9540 base 3 0.9690

testing 2 0.1941 0.9364 0.9590

testing 3 0.5431 0.9766 0.9833

lead to confusion for the classiﬁers. For Fisherface, it means the objective function it tries to minimize is ill-conditioned, which will lead to the failure of the algorithm. – For base 2 case, we augment the noisy images by adding perturbations to generate noisy bags, then the recognition rate increases greatly compared to using noisy images directly. – Base 3 shows that it’s not good to treat all the instances from the noisy bags as the same. We used our multiple-instance based subspace learning method to remove those ”bad” instances from the augmented noisy bags. The resulting training set increases the discriminative power of the classiﬁer, but not to disperse the within subject cluster and cause confusion. – Given only noisy training and probe set, we still achieved much higher recognition rate of 98.33% than the baseline algorithm of 92.47% as shown in table 2, and roughly the same as the optimal case of 98.49%, where all noisy bags are generated by perturbing the aligned images.

Whole Bag Image Base

Single Random Noisy Image Base 0.7

0.98

0.6

0.94 0.92 0.9 0.88

testing1

0.86

testing2

0.84

testing3

0.82 0.8

testing4

Rank−1 recognition rate

0.99

0.96 Rank−1 recognition rate

Rank−1 recognition rate

Single Aligned Image Base 0.98

0.97 0.96 0.95 0.94

testing1

0.93

testing2

0.92

testing3

0.91 0

50

100 150 200 250 Number of Dimensions

Fig. 4. Single Training Base

300

0

50

100 150 200 250 Number of Dimensions

300

350

testing1 testing2

0.5

testing3

0.4 0.3 0.2 0.1 0

0

50

100 150 200 250 Number of Dimensions

300

350

Aligned Fig. 5. Aligned Bag Train- Fig. 6. Single Noisy Training Base ing Base

Figures 4 shows testing results with single aligned image as training data. Figure 5 shows testing results with aligned bag as the training data. Both ﬁgures show the change of recognition rate w.r.t. the change of the number of dimension used by FisherFace. In both cases, the recognition rate has the following order: testing 3 > testing 1 > testing 2, where all the testings have the same meaning as explained in table 1. Figure 6 shows how noisy training images could aﬀect the recognition rate. It’s obvious that when the training set is not aligned very well, all the testing

Face Mis-alignment Analysis by Multiple-Instance Subspace

909

cases fail, including using probe bags and gallery bags. So it’s very important to remove noisy training images from corrupting the training subspace. Figure 7, 8 and 9 show recognition error rates on three diﬀerent testing combinations. The testings have the same meaning as explained in table 3. Optimal 1 means training with aligned bags, and optimal 2 means training with aligned single images. Iter1 and Iter3 means the ﬁrst iteration and the 3rd iteration of the base selection procedure. We can see that in all cases, the 3rd iteration results is better than the 1st iteration results. It supports our claim that extremely poorly registered images will not beneﬁt the learning algorithm. We use our multiple-instance learning algorithm to exclude those bad training images from corrupting the training base. Also interestingly, in all tests, optimal 1 always performs worst, which indicates that by adding perturbations to the training base, even very noisy images, we can improve the robustness of learning algorithms. Note that in all cases, when the number of dimensions increases, the error rate will ﬁrst decrease and then increase. Normally we get the best recognition rate using around the ﬁrst 50 dimensions (account for 70% of total energy). Iterative Base Selection Testing1 Results

Iterative Base Selection Testing2 Results

Iter 3 Optimal1

10

Optimal2

8 6 4

0

50

100 150 200 Number of Dimensions

250

6.5 Iter 1

Rank−1 recognition error rate

12

2

Iterative Base Selection Testing3 Results

25 Iter 1

Rank−1 recognition error rate

Rank−1 recognition error rate

14

Iter 3

20

Optimal1 Optimal2

15 10 5 0

0

50

100 150 200 Number of Dimensions

250

6 5.5 5 4.5 4

Iter 1 Iter 3 Optimal1 Optimal2

3.5 3 2.5 2 1.5 20 40 60 80 100 120 140 160 180 200 220 Number of Dimensions

Fig. 7. Single aligned Fig. 8. Single aligned Fig. 9. Aligned bag gallery, gallery, single aligned probe gallery, single noisy probe noisy bag probe

4

Conclusions

In this paper, we systematically studied the inﬂuence of image mis-alignment on face recognition performance, including mis-alignment in training sets, probe sets and gallery sets. We then formulated the image alignment problem in the multiple-instance learning framework. We proposed a novel supervised clustering based multiple-instance learning scheme for subspace training. The algorithm proceeds by iteratively updating the training set. Simple subspace method, such as FisherFace, when augmented with the proposed multiple-instance learning scheme, achieved very high recognition rate. Experimental results show that even with the noisy training and testing set, the Fisherface learned by our multipleinstance learning scheme achieves much higher recognition rate than the baseline algorithm where the training and testing images are aligned accurately. Our algorithm is a meta-algorithm which can be easily used with other methods. The same framework could also be deployed to deal with illumination and occlusion problems, with diﬀerent deﬁnition of training bags and training instances.

910

Z. Li, Q. Liu, and D. Metaxas

Acknowledgments The research in this paper was partially supported by NSF CNS-0428231.

References 1. Wiskott, L., Fellous, J.M., Kr¨ uger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997) 2. Kirby, M., Sirovich, L.: Application of the karhunen-loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(1), 103–108 (1990) 3. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 586–591. IEEE Computer Society Press, Los Alamitos (1991) 4. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. ﬁsherfaces: Recognition using class speciﬁc linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997) 5. Etemad, K., Chellappa, R.: Discriminant analysis for recognition of human face images. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 127–142. Springer, Heidelberg (1997) 6. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000) 7. Martinez, A.: Recognizing imprecisely localized, partially occuded and expression variant faces from a single sample per class. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6), 748–763 (2002) 8. Shan, S., Chang, Y., Gao, W., Cao, B.: Curse of mis-alignment in face recognition: Problem and a novel mis-alignment learning solution. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, pp. 314–320 (2004) 9. Viola, P., Platt, J.C, Zhang, C.: Multiple instance boosting for object dection. In: Proceedings of Neural Information Processing Systems (2005) 10. Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: Proceedings of Neural Information Processing Systems, pp. 570–576 (1998) 11. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vecctor machines for multiple-instance learning. In: Proceedings of Neural Information Processing Systems, pp. 561–568 (2002) 12. Wang, J., Zucker, J.D.: Solving multiple-instance problem: A lazy learning approach. In: Proceedings of International Conference on Machine Learning, pp. 1119–1125 (2000) 13. Blum, C., Blesa, M.J.: New metaheuristic approaches for the edge-weighted kcardinality tree problem. Computers and Operations Research 32(6), 1355–1377 (2005)

Author Index

Abe, Shinji I-292 Agrawal, Amit I-945 Ai, Haizhou I-210 Akama, Ryo I-779 Andreopoulos, Alexander Aoki, Nobuya I-116 Aptoula, Erchan I-935 Arita, Daisaku I-159 Arth, Clemens II-447 Ashraf, Nazim II-63 ˚ Astr¨ om, Kalle II-549

I-385

Babaguchi, Noboru II-651 Banerjee, Subhashis II-85 Ben Ayed, Ismail I-925 Beveridge, J. Ross II-733 Bigorgne, Erwan II-817 Bischof, Horst I-657, II-447 Bouakaz, Sa¨ıda I-678, I-738 Boyer, Edmond II-166, II-580 Brice˜ no, Hector M. I-678, I-738 Brooks, Michael J. I-853, II-227 Byr¨ od, Martin II-549 Cai, Kangying I-779 Cai, Yinghao I-843 Cannons, Kevin I-532 Cha, Seungwook I-200 Chan, Tung-Jung II-631 Chang, Jen-Mei II-733 Chang, Wen-Yan II-621 Chaudhuri, Subhasis I-240 Chebrolu, Hima I-956 Chen, Chu-Song I-905, II-621 Chen, Ju-Chin II-700 Chen, Qian I-565, I-688 Chen, Tsuhan I-220, II-487, II-662 Chen, Wei I-843 Chen, Wenbin II-53 Chen, Ying I-832 Chen, Yu-Ting I-905 Cheng, Jian II-827 Choi, Inho I-698 Choi, Ouk II-269

Chu, Rufeng II-22 Chu, Wen-Sheng Vincnent II-700 Chun, Seong Soo I-200 Chung, Albert C.S. II-672 Chung, Ronald II-301 Cichowski, Alex I-375 Cipolla, Roberto I-335 Courteille, Fr´ed´eric II-196 Cui, Jinshi I-544 Dailey, Matthew N. I-85 Danafar, Somayeh II-457 Davis, Larry S. I-397, II-404 DeMenthon, Daniel II-404 De Mol, Christine II-881 Destrero, Augusto II-881 Detmold, Henry I-375 Di Stefano, Luigi II-517 Dick, Anthony I-375, I-853 Ding, Yuanyuan I-95 Dinh, Viet Cuong I-200 Doermann, David II-404 Donoser, Michael II-447 Dou, Mingsong II-722 Draper, Bruce II-733 Du, Wei I-365 Du, Weiwei II-590 Durou, Jean-Denis II-196 Ejiri, Masakazu I-35 Eriksson, Anders P. II-796 Fan, Kuo-Chin I-169 Farin, Dirk I-789 Foroosh, Hassan II-63 Frahm, Jan-Michael II-353 Fu, Li-Chen II-124 Fu, Zhouyu I-482, II-134 Fujimura, Kikuo I-408, II-32 Fujiwara, Takayuki II-891 Fujiyoshi, Hironobu I-915, II-806 Fukui, Kazuhiro II-467 Funahashi, Takuma II-891 Furukawa, Ryo II-206, II-847

912

Author Index

Gao, Jizhou I-127 Gargallo, Pau II-373, II-784 Geurts, Pierre II-611 Gheissari, Niloofar II-457 Girdziuˇsas, Ram¯ unas I-811 Goel, Dhiraj I-220 Goel, Lakshya II-85 Grabner, Helmut I-657 Grabner, Michael I-657 Guillou, Erwan I-678 Gupta, Ankit II-85 Gupta, Gaurav II-394 Gupta, Sumana II-394 Gurdjos, Pierre II-196 Han, Yufei II-1, II-22 Hancock, Edwin R. II-869 Handel, Holger II-258 Hao, Pengwei II-722 Hao, Ying II-12 Hartley, Richard I-13, I-800, II-279, II-322, II-353 Hasegawa, Tsutomu I-628 Hayes-Gill, Barrie I-945 He, Ran I-54, I-728, II-22 H´eas, Patrick I-864 Hill, Rhys I-375 Hiura, Shinsaku I-149 Honda, Kiyoshi I-85 Hong, Ki-Sang II-497 Horaud, Radu II-166 Horiuchi, Takahiko I-708 Hou, Cong I-210 Hsiao, Pei-Yung II-124 Hsieh, Jun-Wei I-169 Hu, Wei I-832 Hu, Weiming I-821, I-832 Hu, Zhanyi I-472 Hua, Chunsheng I-565 huang, Feiyue II-477 Huang, Guochang I-462 Huang, Kaiqi I-667, I-843 Huang, Liang II-680 Huang, Po-Hao I-106 Huang, Shih-Shinh II-124 Huang, Weimin I-875 Huang, Xinyu I-127 Huang, Yonggang II-690 Hung, Y.S. II-186 Hung, Yi-Ping II-621

Ide, Ichiro II-774 Ijiri, Yoshihisa II-680 Ikeda, Sei II-73 Iketani, Akihiko II-73 Ikeuchi, Katsushi II-289 Imai, Akihiro I-596 Ishikawa, Hiroshi II-537 Itano, Tomoya II-206 Iwata, Sho II-570 Jaeggli, Tobias I-608 Jawahar, C.V. I-586 Je, Changsoo II-507 Ji, Zhengqiao II-363 Jia, Yunde I-512, II-641, II-754 Jiao, Jianbin I-896 Jin, Huidong I-482 Jin, Yuxin I-748 Josephson, Klas II-549 Junejo, Imran N. II-63 Kahl, Fredrik I-13, II-796 Kalra, Prem II-85 Kanade, Takeo I-915, II-806 Kanatani, Kenichi II-311 Kanbara, Masayuki II-73 Katayama, Noriaki I-292 Kato, Takekazu I-688 Kawabata, Satoshi I-149 Kawade, Masato II-680 Kawamoto, Kazuhiko I-555 Kawasaki, Hiroshi II-206, II-847 Khan, Sohaib I-647 Kim, Daijin I-698 Kim, Hansung I-758 Kim, Hyeongwoo II-269 Kim, Jae-Hak II-353 Kim, Jong-Sung II-497 Kim, Tae-Kyun I-335 Kim, Wonsik II-560 Kirby, Michael II-733 Kitagawa, Yosuke I-688 Kitahara, Itaru I-758 Klein Gunnewiek, Rene I-789 Kley, Holger II-733 Kogure, Kiyoshi I-758 Koh, Tze K. I-945 Koller-Meier, Esther I-608 Kondo, Kazuaki I-544 Korica-Pehserl, Petra I-657

Author Index Koshimizu, Hiroyasu II-891 Kounoike, Yuusuke II-424 Kozuka, Kazuki II-342 Kuijper, Arjan I-230 Kumano, Shiro I-324 Kumar, Anand I-586 Kumar, Pankaj I-853 Kuo, Chen-Hui II-631 Kurazume, Ryo I-628 Kushal, Avanish II-85 Kweon, In So II-269 Laaksonen, Jorma I-811 Lai, Shang-Hong I-106, I-638 Lambert, Peter I-251 Langer, Michael I-271, II-858 Lao, Shihong I-210, II-680 Lau, W.S. II-186 Lee, Jiann-Der II-631 Lee, Kwang Hee II-507 Lee, Kyoung Mu II-560 Lee, Sang Wook II-507 Lee, Wonwoo II-580 Lef`evre, S´ebastien I-935 Lei, Zhen I-54, II-22 Lenz, Reiner II-744 Li, Baoxin II-155 Li, Heping I-472 Li, Hongdong I-800, II-227 Li, Jiun-Jie I-169 Li, Jun II-722 Li, Ping I-789 Li, Stan Z. I-54, I-728, II-22 Li, Zhenglong II-827 Li, Zhiguo II-901 Liang, Jia I-512, II-754 Liao, ShengCai I-54 Liao, Shu II-672 Lien, Jenn-Jier James I-261, I-314, I-885, II-96, II-700 Lim, Ser-Nam I-397 Lin, Shouxun II-106 Lin, Zhe II-404 Lina II-774 Liu, Chunxiao I-282 Liu, Fuqiang I-355 Liu, Jundong I-956 Liu, Nianjun I-482 Liu, Qingshan II-827, II-901 Liu, Wenyu I-282

Liu, Xiaoming II-662 Liu, Yuncai I-419 Loke, Eng Hui I-430 Lu, Fangfang II-134, II-279 Lu, Hanqing II-827 Lubin, Jeﬀrey II-414 Lui, Shu-Fan II-96 Luo, Guan I-821 Ma, Yong II-680 Maeda, Eisaku I-324 Mahmood, Arif I-647 Makhanov, Stanislav I-85 Makihara, Yasushi I-452 Manmatha, R. I-586 Mao, Hsi-Shu II-96 Mar´ee, Rapha¨el II-611 Marikhu, Ramesh I-85 Martens, Ga¨etan I-251 Matas, Jiˇr´ı II-236 Mattoccia, Stefano II-517 Maybank, Steve I-821 McCloskey, Scott I-271, II-858 Mekada, Yoshito II-774 Mekuz, Nathan I-492 ´ M´emin, Etienne I-864 Metaxas, Dimitris II-901 Meyer, Alexandre I-738 Michoud, Brice I-678 Miˇcuˇs´ık, Branislav I-65 Miles, Nicholas I-945 Mitiche, Amar I-925 Mittal, Anurag I-397 Mogi, Kenji II-528 Morgan, Steve I-945 Mori, Akihiro I-628 Morisaka, Akihiko II-206 Mu, Yadong II-837 Mudenagudi, Uma II-85 Mukaigawa, Yasuhiro I-544, II-246 Mukerjee, Amitabha II-394 Murai, Yasuhiro I-915 Murase, Hiroshi II-774 Nagahashi, Tomoyuki II-806 Nakajima, Noboru II-73 Nakasone, Yoshiki II-528 Nakazawa, Atsushi I-618 Nalin Pradeep, S. I-522, II-116 Niranjan, Shobhit II-394 Nomiya, Hiroki I-502

913

914

Author Index

Odone, Francesca II-881 Ohara, Masatoshi I-292 Ohta, Naoya II-528 Ohtera, Ryo I-708 Okutomi, Masatoshi II-176 Okutomoi, Masatoshi II-384 Olsson, Carl II-796 Ong, S.H. I-875 Otsuka, Kazuhiro I-324 Pagani, Alain I-769 Paluri, Balamanohar I-522, II-116 Papadakis, Nicolas I-864 Parikh, Devi II-487 Park, Joonyoung II-560 Pehserl, Joachim I-657 Pele, Oﬁr II-435 Peng, Yuxin I-748 Peterson, Chris II-733 Pham, Nam Trung I-875 Piater, Justus I-365 Pollefeys, Marc II-353 Poppe, Chris I-251 Prakash, C. I-522, II-116 Pujades, Sergi II-373 Puri, Manika II-414 Radig, Bernd II-332 Rahmati, Mohammad II-217 Raskar, Ramesh I-1, I-945 Raskin, Leonid I-442 Raxle Wang, Chi-Chen I-885 Reid, Ian II-601 Ren, Chunjian II-53 Rivlin, Ehud I-442 Robles-Kelly, Antonio II-134 Rudzsky, Michael I-442 Ryu, Hanjin I-200 Sagawa, Ryusuke I-116 Sakakubara, Shizu II-424 Sakamoto, Ryuuki I-758 Sato, Jun II-342 Sato, Kosuke I-149 Sato, Tomokazu II-73 Sato, Yoichi I-324 Sawhney, Harpreet II-414 Seo, Yongduek II-322 Shah, Hitesh I-240, I-522, II-116 Shahrokni, Ali II-601

Shen, Chunhua II-227 Shen, I-fan I-189, II-53 Shi, Jianbo I-189 Shi, Min II-42 Shi, Yu I-718 Shi, Zhenwei I-180 Shimada, Atsushi I-159 Shimada, Nobutaka I-596 Shimizu, Ikuko II-424 Shimizu, Masao II-176 Shinano, Yuji II-424 Shirai, Yoshiaki I-596 Siddiqi, Kaleem I-271, II-858 Singh, Gajinder II-414 Slobodan, Ili´c I-75 Smith, Charles I-956 Smith, William A.P. II-869 ˇ Sochman, Jan II-236 Song, Gang I-189 Song, Yangqiu I-180 Stricker, Didier I-769 Sturm, Peter II-373, II-784 Sugaya, Yasuyuki II-311 Sugimoto, Shigeki II-384 Sugiura, Kazushige I-452 Sull, Sanghoon I-200 Sumino, Kohei II-246 Sun, Zhenan II-1, II-12 Sung, Ming-Chian I-261 Sze, W.F. II-186 Takahashi, Hidekazu II-384 Takahashi, Tomokazu II-774 Takamatsu, Jun II-289 Takeda, Yuki I-779 Takemura, Haruo I-618 Tan, Huachun II-712 Tan, Tieniu I-667, I-843, II-1, II-12, II-690 Tanaka, Hidenori I-618 Tanaka, Hiromi T. I-779 Tanaka, Tatsuya I-159 Tang, Sheng II-106 Taniguchi, Rin-ichiro I-159, I-628 Tao, Hai I-345 Tao, Linmi I-748 Tarel, Jean-Philippe II-817 Tian, Min I-355 Tombari, Federico II-517 Tominaga, Shoji I-708

Author Index Toriyama, Tomoji I-758 Tsai, Luo-Wei I-169 Tseng, Chien-Chung I-314 Tseng, Yun-Jung I-169 Tsotsos, John K. I-385, I-492 Tsui, Timothy I-718 Uchida, Seiichi I-628 Uehara, Kuniaki I-502 Urahama, Kiichi II-590 Utsumi, Akira I-292 Van de Walle, Rik I-251 van den Hengel, Anton I-375 Van Gool, Luc I-608 Verri, Alessandro II-881 Vincze, Markus I-65 Wada, Toshikazu I-565, I-688 Wan, Cheng II-342 Wang, Fei II-1 Wang, Guanghui II-363 Wang, Junqiu I-576 Wang, Lei I-800, II-145 Wang, Liming I-189 Wang, Te-Hsun I-261 Wang, Xiaolong I-303 Wang, Ying I-667 Wang, Yuanquan I-512, II-754 Wang, Yunhong I-462, II-690 Wehenkel, Louis II-611 Wei, Shou-Der I-638 Werman, Michael II-435 Wildenauer, Horst I-65 Wildes, Richard I-532 Wimmer, Matthias II-332 With, Peter H.N. de I-789 Wong, Ka Yan II-764 Woo, Woontack II-580 Woodford, Oliver II-601 Wu, Fuchao I-472 Wu, Haiyuan I-565, I-688 Wu, Jin-Yi II-96 Wu, Q.M. Jonathan II-363 Wu, Yihong I-472 Wuest, Harald I-769 Xu, Gang II-570 Xu, Guangyou I-748, II-477

Xu, Lijie II-32 Xu, Shuang II-641 Xu, Xinyu II-155 Yagi, Yasushi I-116, I-452, I-544, I-576, II-246 Yamaguchi, Osamu II-467 Yamamoto, Masanobu I-430 Yamato, Junji I-324 Yamazaki, Masaki II-570 Yamazoe, Hirotake I-292 Yang, Ruigang I-127 Yang, Ying II-106 Ye, Qixiang I-896 Yin, Xin I-779 Ying, Xianghua I-138 Yip, Chi Lap II-764 Yokoya, Naokazu II-73 Yu, Hua I-896 Yu, Jingyi I-95 Yu, Xiaoyi II-651 Yuan, Ding II-301 Yuan, Xiaotong I-728 Zaboli, Hamidreza II-217 Zaharescu, Andrei II-166 Zha, Hongbin I-138, I-544 Zhang, Changshui I-180 Zhang, Chao II-722 Zhang, Dan I-180 Zhang, Fan I-282 Zhang, Ke I-482 Zhang, Weiwei I-355 Zhang, Xiaoqin I-821 Zhang, Yongdong II-106 Zhang, Yu-Jin II-712 Zhang, Yuhang I-800 Zhao, Qi I-345 Zhao, Xu I-419 Zhao, Youdong II-641 Zhao, Yuming II-680 Zheng, Bo II-289 Zheng, Jiang Yu I-303, II-42 Zhong, H. II-186 Zhou, Bingfeng II-837 Zhou, Xue I-832 Zhu, Youding I-408

915