Computer Vision

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Yasushi Yagi | Sing Bing Kang | In So Kweon | Hongbin Zha

12 downloads 785 Views 29MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

4843

Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha (Eds.)

Computer Vision – ACCV 2007 8th Asian Conference on Computer Vision Tokyo, Japan, November 18-22, 2007 Proceedings, Part I

13

Volume Editors Yasushi Yagi Osaka University The Institute of Scientiﬁc and Industrial Research 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan E-mail: [email protected] Sing Bing Kang Microsoft Corporation 1 Microsoft Way, Redmond WA 98052, USA E-mail: [email protected] In So Kweon KAIST School of Electrical Engineering and Computer Science 335 Gwahag-Ro Yusung-Gu, Daejeon, Korea E-mail: [email protected] Hongbin Zha Peking University Department of Machine Intelligence Beijing, 100871, China E-mail: [email protected]

Library of Congress Control Number: 2007938408 CR Subject Classiﬁcation (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13

0302-9743 3-540-76385-6 Springer Berlin Heidelberg New York 978-3-540-76385-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12183654 06/3180 543210

Preface

It is our great pleasure to welcome you to the Proceedings of the Eighth Asian Conference on Computer Vision (ACCV07), which held November 18–22, 2007 in Tokyo, Japan. ACCV07 was sponsored by the Asian Federation of Computer Vision. We received 640 abstracts by the abstract submission deadline, 551 of which became full submissions. This is the largest number of submissions in the history of ACCV. Out of these 551 full submissions, 46 were selected for oral presentation and 130 as posters, yielding an acceptance rate of 31.9%. Following the tradition of previous ACCVs, the reviewing process was double blind. Each of the 31 Area Chairs (ACs) handled about 17 papers and nominated ﬁve reviewers for each submission (from 204 Program Committee members). The ﬁnal selection of three reviewers per submission was done in such a way as to avoid conﬂict of interest and to evenly balance the load among the reviewers. Once the reviews were done, each AC wrote summary reports based on the reviews and their own assessments of the submissions. For conﬂicting scores, ACs consulted with reviewers, and at times had us contact authors for clariﬁcation. The AC meeting was held in Osaka on July 27 and 28. We divided the 31 ACs into 8 groups, with each group having 3 or 4 ACs. The ACs can confer within their respective groups, and are permitted to discuss with pre-approved “consulting” ACs outside their groups if needed. The ACs were encouraged to rely more on their perception of paper vis-a-vis reviewer comments, and not strictly based on numerical scores alone. This year, we introduced the category “conditional accept;” this category is targeted at papers with good technical content but whose writing requires signiﬁcant improvement. Please keep in mind that no reviewing process is perfect. As with any major conference, reviewer quality and timeliness of reviews varied. To minimize the impact of variation of these factors, we chose highly qualiﬁed and dependable people as ACs to shepherd the review process. We all did the best we could given the large number of submissions and the limited time we had. Interestingly, we did not have to instruct the ACs to revise their decisions at the end of the AC meeting—all the ACs did a great job in ensuring the high quality of accepted papers. That being said, it is possible there were good papers that fell through the cracks, and we hope such papers will quickly end up being published at other good avenues. It has been a pleasure for us to serve as ACCV07 Program Chairs, and we can honestly say that this has been a memorable and rewarding experience. We would like to thank the ACCV07 ACs and members of the Technical Program Committee for their time and eﬀort spent reviewing the submissions. The ACCV Osaka team (Ryusuke Sagawa, Yasushi Makihara, Tomohiro Mashita, Kazuaki Kondo, and Hidetoshi Mannami), as well as our conference secretaries (Noriko

VI

Preface

Yasui, Masako Kamura, and Sachiko Kondo), did a terriﬁc job organizing the conference. We hope that all of the attendees found the conference informative and thought provoking. November 2007

Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha

Organization

General Chair General Co-chairs

Program Chair Program Co-chairs

Workshop/Tutorial Chair Finance Chair Local Arrangements Chair Publication Chairs Technical Support Staﬀ

Area Chairs

Katsushi Ikeuchi (University of Tokyo, Japan) Naokazu Yokoya (NAIST, Japan) Rin-ichiro Taniguchi (Kyuushu University, Japan) Yasushi Yagi (Osaka University, Japan) In So Kweon (KAIST, Korea) Sing Bing Kang (Microsoft Research, USA) Hongbin Zha (Peking University, China) Kazuhiko Sumi (Mitsubishi Electric, Japan) Keiji Yamada (NEC, Japan) Yoshinari Kameda (University of Tsukuba, Japan) Hideo Saito (Keio University, Japan) Daisaku Arita (ISIT, Japan) Atsuhiko Banno (University of Tokyo, Japan) Daisuke Miyazaki (University of Tokyo, Japan) Ryusuke Sagawa (Osaka University, Japan) Yasushi Makihara (Osaka University, Japan) Tat-Jen Cham (Nanyang Tech. University, Singapore) Koichiro Deguchi (Tohoku University, Japan) Frank Dellaert (Georgia Inst. of Tech., USA) Martial Hebert (CMU, USA) Ki Sang Hong (Pohang University of Sci. and Tech., Korea) Yi-ping Hung (National Taiwan University, Taiwan) Reinhard Klette (University of Auckland, New Zealand) Chil-Woo Lee (Chonnam National University, Korea) Kyoung Mu Lee (Seoul National University, Korea) Sang Wook Lee (Sogang University, Korea) Stan Z. Li (CASIA, China) Yuncai Liu (Shanghai Jiaotong University, China) Yasuyuki Matsushita (Microsoft Research Asia, China) Yoshito Mekada (Chukyo University, Japan) Yasuhiro Mukaigawa (Osaka University, Japan)

VIII

Organization

P.J. Narayanan (IIIT, India) Masatoshi Okutomi (Tokyo Inst. of Tech., Japan) Tomas Pajdla (Czech Technical University, Czech) Shmuel Peleg (The Hebrew University of Jerusalem, Israel) Jean Ponce (Ecole Normale Superieure, France) Long Quan (Hong Kong University of Sci. and Tech., China) Ramesh Raskar (MERL, USA) Jim Rehg (Georgia Inst. of Tech., USA) Jun Sato (Nagoya Inst. of Tech., Japan) Shinichi Sato (NII, Japan) Yoichi Sato (University of Tokyo, Japan) Cordelia Schmid (INRIA, France) Christoph Schnoerr (University of Mannheim, Germany) David Suter (Monash University, Australia) Xiaoou Tang (Microsoft Research Asia, China) Guangyou Xu (Tsinghua University, China)

Program Committee Adrian Barbu Akash Kushal Akihiko Torii Akihiro Sugimoto Alexander Shekhovtsov Amit Agrawal Anders Heyden Andreas Koschan Andres Bruhn Andrew Hicks Anton van den Hengel Atsuto Maki Baozong Yuan Bernt Schiele Bodo Rosenhahn Branislav Micusik C.V. Jawahar Chieh-Chih Wang Chin Seng Chua Chiou-Shann Fuh Chu-song Chen

Cornelia Fermuller Cristian Sminchisescu Dahua Lin Daisuke Miyazaki Daniel Cremers David Forsyth Duy-Dinh Le Fanhuai Shi Fay Huang Florent Segonne Frank Dellaert Frederic Jurie Gang Zeng Gerald Sommer Guoyan Zheng Hajime Nagahara Hanzi Wang Hassan Foroosh Hideaki Goto Hidekata Hontani Hideo Saito

Hiroshi Ishikawa Hiroshi Kawasaki Hong Zhang Hongya Tuo Hynek Bakstein Hyun Ki Hong Ikuko Shimizu Il Dong Yun Itaru Kitahara Ivan Laptev Jacky Baltes Jakob Verbeek James Crowley Jan-Michael Frahm Jan-Olof Eklundh Javier Civera Jean Martinet Jean-Sebastien Franco Jeﬀrey Ho Jian Sun Jiang yu Zheng

Organization

Jianxin Wu Jianzhuang Liu Jiebo Luo Jingdong Wang Jinshi Cui Jiri Matas John Barron John Rugis Jong Soo Choi Joo-Hwee Lim Joon Hee Han Joost Weijer Jun Sato Jun Takamatsu Junqiu Wang Juwei Lu Kap Luk Chan Karteek Alahari Kazuhiro Hotta Kazuhiro Otsuka Keiji Yanai Kenichi Kanatani Kenton McHenry Ki Sang Hong Kim Steenstrup Pedersen Ko Nishino Koichi Hashomoto Larry Davis Lisheng Wang Manabu Hashimoto Marcel Worring Marshall Tappen Masanobu Yamamoto Mathias Kolsch Michael Brown Michael Cree Michael Isard Ming Tang Ming-Hsuan Yang Mingyan Jiang Mohan Kankanhalli Moshe Ben-Ezra Naoya Ohta Navneet Dalal Nick Barnes

Nicu Sebe Noboru Babaguchi Nobutaka Shimada Ondrej Drbohlav Osamu Hasegawa Pascal Vasseur Patrice Delmas Pei Chen Peter Sturm Philippos Mordohai Pierre Jannin Ping Tan Prabir Kumar Biswas Prem Kalra Qiang Wang Qiao Yu Qingshan Liu QiuQi Ruan Radim Sara Rae-Hong Park Ralf Reulke Ralph Gross Reinhard Koch Rene Vidal Robert Pless Rogerio Feris Ron Kimmel Ruigang Yang Ryad Benosman Ryusuke Sagawa S.H. Srinivasan S. Kevin Zhou Seungjin Choi Sharat Chandran Sheng-Wen Shih Shihong Lao Shingo Kagami Shin’ichi Satoh Shinsaku Hiura ShiSguang Shan Shmuel Peleg Shoji Tominaga Shuicheng Yan Stan Birchﬁeld Stefan Gehrig

Stephen Lin Stephen Maybank Subhashis Banerjee Subrata Rakshit Sumantra Dutta Roy Svetlana Lazebnik Takayuki Okatani Takekazu Kato Tat-Jen Cham Terence Sim Tetsuji Haga Theo Gevers Thomas Brox Thomas Leung Tian Fang Til Aach Tomas Svoboda Tomokazu Sato Toshio Sato Toshio Ueshiba Tyng-Luh Liu Vincent Lepetit Vivek Kwatra Vladimir Pavlovic Wee-Kheng Leow Wei Liu Weiming Hu Wen-Nung Lie Xianghua Ying Xianling Li Xiaogang Wang Xiaojuan Wu Yacoob Yaser Yaron Caspi Yasushi Sumi Yasutaka Furukawa Yasuyuki Sugaya Yeong-Ho Ha Yi-ping Hung Yong-Sheng Chen Yoshinori Kuno Yoshio Iwai Yoshitsugu Manabe Young Shik Moon Yunde Jia

IX

X

Organization

Zen Chen Zhifeng Li Zhigang Zhu

Zhouchen Lin Zhuowen Tu Zuzana Kukelova

Additional Reviewers Afshin Sepehri Alvina Goh Anthony Dick Avinash Ravichandran Baidya Saha Brian Clipp C´edric Demonceaux Christian Beder Christian Schmaltz Christian Wojek Chunhua Shen Chun-Wei Chen Claude P´egard D.H. Ye D.J. Kwon Daniel Hein David Foﬁ David Gallup De-Zheng Liu Dhruv K. Mahajan Dipti Mukherjee Edgar Seemann Edgardo Molina El Mustapha Mouaddib Emmanuel Prados Frank R. Schmidt Frederik Meysel Gao Yan Guy Rosman Gyuri Dorko H.J. Shim Hang Yu Hao Du Hao Tang Hao Zhang Hirishi Ohno Hiroshi Ohno Huang Wei Hynek Bakstein

Ilya Levner Imran Junejo Jan Woetzel Jian Chen Jianzhao Qin Jimmy Jiang Liu Jing Wu John Bastian Juergen Gall K.J. Lee Kalin Kolev Karel Zimmermann Ketut Fundana Koichi Kise Kongwah Wan Konrad Schindler Kooksang Moon Levi Valgaerts Li Guan Li Shen Liang Wang Lin Liang Lingyu Duan Maojun Yuan Mario Fritz Martin Bujnak Martin Matousek Martin Sunkel Martin Welk Micha Andriluka Michael Stark Minh-Son Dao Naoko Nitta Neeraj Kanhere Niels Overgaard Nikhil Rane Nikodem Majer Nilanjan Ray Nils Hasler

Nipun kwatra Olivier Morel Omar El Ganaoui Pankaj Kumar Parag Chaudhuri Paul Schnitzspan Pavel Kuksa Petr Doubek Philippos Mordohai Reiner Schnabel Rhys Hill Rizwan Chaudhry Rui Huang S.M. Shahed Nejhum S.H. Lee Sascha Bauer Shao-Wen Yang Shengshu Wang Shiro Kumano Shiv Vitaladevuni Shrinivas Pundlik Sio-Hoi Ieng Somnath Sengupta Sudipta Mukhopadhyay Takahiko Horiuchi Tao Wang Tat-Jun Chin Thomas Corpetti Thomas Schoenemann Thorsten Thormaehlen Weihong Li Weiwei Zhang Xiaoyi Yu Xinguo Yu Xinyu Huang Xuan Song Yi Feng Yichen Wei Yiqun Li

Organization

Yong MA Yoshihiko Kawai

Zhichao Chen Zhijie Wang

Sponsors Sponsor Technical Co-sponsors

Asian Federation of Computer Vision IPSJ SIG-CVIM IEICE TG-PRMU

XI

Table of Contents – Part I

Plenary and Invited Talks Less Is More: Coded Computational Photography . . . . . . . . . . . . . . . . . . . . Ramesh Raskar

1

Optimal Algorithms in Multiview Geometry . . . . . . . . . . . . . . . . . . . . . . . . . Richard Hartley and Fredrik Kahl

13

Machine Vision in Early Days: Japan’s Pioneering Contributions . . . . . . Masakazu Ejiri

35

Shape and Texture Coarse-to-Fine Statistical Shape Model by Bayesian Inference . . . . . . . . . Ran He, Stan Li, Zhen Lei, and ShengCai Liao

54

Eﬃcient Texture Representation Using Multi-scale Regions . . . . . . . . . . . . Horst Wildenauer, Branislav Miˇcuˇs´ık, and Markus Vincze

65

Fitting Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ili´c Slobodan A Family of Quadratic Snakes for Road Extraction . . . . . . . . . . . . . . . . . . . Ramesh Marikhu, Matthew N. Dailey, Stanislav Makhanov, and Kiyoshi Honda

75 85

Poster Session 1: Calibration Multiperspective Distortion Correction Using Collineations . . . . . . . . . . . . Yuanyuan Ding and Jingyi Yu

95

Camera Calibration from Silhouettes Under Incomplete Circular Motion with a Constant Interval Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Po-Hao Huang and Shang-Hong Lai

106

Mirror Localization for Catadioptric Imaging System by Observing Parallel Light Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryusuke Sagawa, Nobuya Aoki, and Yasushi Yagi

116

Calibrating Pan-Tilt Cameras with Telephoto Lenses . . . . . . . . . . . . . . . . . Xinyu Huang, Jizhou Gao, and Ruigang Yang

127

XIV

Table of Contents – Part I

Camera Calibration Using Principal-Axes Aligned Conics . . . . . . . . . . . . . Xianghua Ying and Hongbin Zha

138

Poster Session 1: Detection 3D Intrusion Detection System with Uncalibrated Multiple Cameras . . . . Satoshi Kawabata, Shinsaku Hiura, and Kosuke Sato Non-parametric Background and Shadow Modeling for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Tanaka, Atsushi Shimada, Daisaku Arita, and Rin-ichiro Taniguchi Road Sign Detection Using Eigen Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luo-Wei Tsai, Yun-Jung Tseng, Jun-Wei Hsieh, Kuo-Chin Fan, and Jiun-Jie Li

149

159

169

Localized Content-Based Image Retrieval Using Semi-supervised Multiple Instance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Zhang, Zhenwei Shi, Yangqiu Song, and Changshui Zhang

180

Object Detection Combining Recognition and Segmentation . . . . . . . . . . . Liming Wang, Jianbo Shi, Gang Song, and I-fan Shen

189

An Eﬃcient Method for Text Detection in Video Based on Stroke Width Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Viet Cuong Dinh, Seong Soo Chun, Seungwook Cha, Hanjin Ryu, and Sanghoon Sull

200

Multiview Pedestrian Detection Based on Vector Boosting . . . . . . . . . . . . Cong Hou, Haizhou Ai, and Shihong Lao

210

Pedestrian Detection Using Global-Local Motion Patterns . . . . . . . . . . . . . Dhiraj Goel and Tsuhan Chen

220

Poster Session 1: Image and Video Processing Qualitative and Quantitative Behaviour of Geometrical PDEs in Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arjan Kuijper

230

Automated Billboard Insertion in Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hitesh Shah and Subhasis Chaudhuri

240

Improved Background Mixture Models for Video Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle

251

Table of Contents – Part I

XV

High Dynamic Range Scene Realization Using Two Complementary Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming-Chian Sung, Te-Hsun Wang, and Jenn-Jier James Lien

261

Automated Removal of Partial Occlusion Blur . . . . . . . . . . . . . . . . . . . . . . . Scott McCloskey, Michael Langer, and Kaleem Siddiqi

271

Poster Session 1: Applications High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Zhang, Wenyu Liu, and Chunxiao Liu Attention Monitoring for Music Contents Based on Analysis of Signal-Behavior Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masatoshi Ohara, Akira Utsumi, Hirotake Yamazoe, Shinji Abe, and Noriaki Katayama View Planning for Cityscape Archiving and Visualization . . . . . . . . . . . . . Jiang Yu Zheng and Xiaolong Wang

282

292

303

Face and Gesture Synthesis of Exaggerative Caricature with Inter and Intra Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chien-Chung Tseng and Jenn-Jier James Lien Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shiro Kumano, Kazuhiro Otsuka, Junji Yamato, Eisaku Maeda, and Yoichi Sato Gesture Recognition Under Small Sample Size . . . . . . . . . . . . . . . . . . . . . . . Tae-Kyun Kim and Roberto Cipolla

314

324

335

Tracking Motion Observability Analysis of the Simpliﬁed Color Correlogram for Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qi Zhao and Hai Tao

345

On-Line Ensemble SVM for Robust Object Tracking . . . . . . . . . . . . . . . . . Min Tian, Weiwei Zhang, and Fuqiang Liu

355

Multi-camera People Tracking by Collaborative Particle Filters and Principal Axis-Based Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Du and Justus Piater

365

XVI

Table of Contents – Part I

Poster Session 2: Camera Networks Finding Camera Overlap in Large Surveillance Networks . . . . . . . . . . . . . . Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, and Rhys Hill

375

Information Fusion for Multi-camera and Multi-body Structure and Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Andreopoulos and John K. Tsotsos

385

Task Scheduling in Large Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . Ser-Nam Lim, Larry Davis, and Anurag Mittal

397

Poster Session 2: Face/Gesture/Action Detection and Recognition Constrained Optimization for Human Pose Estimation from Depth Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youding Zhu and Kikuo Fujimura

408

Generative Estimation of 3D Human Pose Using Shape Contexts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Zhao and Yuncai Liu

419

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eng Hui Loke and Masanobu Yamamoto

430

Tracking and Classifying Human Motions with Gaussian Process Annealed Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonid Raskin, Michael Rudzsky, and Ehud Rivlin

442

Gait Identiﬁcation Based on Multi-view Observations Using Omnidirectional Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazushige Sugiura, Yasushi Makihara, and Yasushi Yagi

452

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences . . . . Guochang Huang and Yunhong Wang

462

Poster Session 2: Learning MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heping Li, Zhanyi Hu, Yihong Wu, and Fuchao Wu

472

Optimal Learning High-Order Markov Random Fields Priors of Colour Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ke Zhang, Huidong Jin, Zhouyu Fu, and Nianjun Liu

482

Table of Contents – Part I

XVII

Hierarchical Learning of Dominant Constellations for Object Class Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathan Mekuz and John K. Tsotsos

492

Multistrategical Approach in Visual Learning . . . . . . . . . . . . . . . . . . . . . . . . Hiroki Nomiya and Kuniaki Uehara

502

Poster Session 2: Motion and Tracking Cardiac Motion Estimation from Tagged MRI Using 3D-HARP and NURBS Volumetric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Liang, Yuanquan Wang, and Yunde Jia

512

Fragments Based Parametric Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prakash C., Balamanohar Paluri, Nalin Pradeep S., and Hitesh Shah

522

Spatiotemporal Oriented Energy Features for Visual Tracking . . . . . . . . . Kevin Cannons and Richard Wildes

532

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras . . . . . . Jinshi Cui, Yasushi Yagi, Hongbin Zha, Yasuhiro Mukaigawa, and Kazuaki Kondo

544

Optical Flow – Driven Motion Model with Automatic Variance Adjustment for Adaptive Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiko Kawamoto

555

A Noise-Insensitive Object Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . . Chunsheng Hua, Qian Chen, Haiyuan Wu, and Toshikazu Wada

565

Discriminative Mean Shift Tracking with Auxiliary Particles . . . . . . . . . . . Junqiu Wang and Yasushi Yagi

576

Poster Session 2: Retrival and Search Eﬃcient Search in Document Image Collections . . . . . . . . . . . . . . . . . . . . . . Anand Kumar, C.V. Jawahar, and R. Manmatha

586

Human Pose Estimation Hand Posture Estimation in Complex Backgrounds by Considering Mis-match of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiro Imai, Nobutaka Shimada, and Yoshiaki Shirai

596

Learning Generative Models for Monocular Body Pose Estimation . . . . . Tobias Jaeggli, Esther Koller-Meier, and Luc Van Gool

608

Human Pose Estimation from Volume Data and Topological Graph Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hidenori Tanaka, Atsushi Nakazawa, and Haruo Takemura

618

XVIII

Table of Contents – Part I

Matching Logical DP Matching for Detecting Similar Subsequence . . . . . . . . . . . . . . Seiichi Uchida, Akihiro Mori, Ryo Kurazume, Rin-ichiro Taniguchi, and Tsutomu Hasegawa

628

Eﬃcient Normalized Cross Correlation Based on Adaptive Multilevel Successive Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shou-Der Wei and Shang-Hong Lai

638

Exploiting Inter-frame Correlation for Fast Video to Reference Image Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arif Mahmood and Sohaib Khan

647

Poster Session 3: Face/Gesture/Action Detection and Recognition Flea, Do You Remember Me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Grabner, Helmut Grabner, Joachim Pehserl, Petra Korica-Pehserl, and Horst Bischof

657

Multi-view Gymnastic Activity Recognition with Fused HMM . . . . . . . . . Ying Wang, Kaiqi Huang, and Tieniu Tan

667

Real-Time and Marker-Free 3D Motion Capture for Home Entertainment Oriented Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brice Michoud, Erwan Guillou, Hector Brice˜ no, and Sa¨ıda Bouakaz

678

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation . . . . . . . Haiyuan Wu, Yosuke Kitagawa, Toshikazu Wada, Takekazu Kato, and Qian Chen

688

Eye Correction Using Correlation Information . . . . . . . . . . . . . . . . . . . . . . . Inho Choi and Daijin Kim

698

Eye-Gaze Detection from Monocular Camera Image Using Parametric Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Ohtera, Takahiko Horiuchi, and Shoji Tominaga

708

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Shi and Timothy Tsui

718

Poster Session 3: Low Level Vision and Phtometory Color Constancy Via Convex Kernel Optimization . . . . . . . . . . . . . . . . . . . Xiaotong Yuan, Stan Z. Li, and Ran He

728

Table of Contents – Part I

XIX

User-Guided Shape from Shading to Reconstruct Fine Details from a Single Photograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Meyer, Hector M. Brice˜ no, and Sa¨ıda Bouakaz

738

A Theoretical Approach to Construct Highly Discriminative Features with Application in AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuxin Jin, Linmi Tao, Guangyou Xu, and Yuxin Peng

748

Robust Foreground Extraction Technique Using Gaussian Family Model and Multiple Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, and Kiyoshi Kogure Feature Management for Eﬃcient Camera Tracking . . . . . . . . . . . . . . . . . . Harald Wuest, Alain Pagani, and Didier Stricker Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Yin, Kangying Cai, Yuki Takeda, Ryo Akama, and Hiromi T. Tanaka

758

769

779

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ping Li, Dirk Farin, Rene Klein Gunnewiek, and Peter H.N. de With

789

Where’s the Weet-Bix? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuhang Zhang, Lei Wang, Richard Hartley, and Hongdong Li

800

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR-Based Stopping in Nonlinear Diﬀusion Scale-Spaces . . . . . . . . . . . . . . Ram¯ unas Girdziuˇsas and Jorma Laaksonen

811

Poster Session 3: Motion and Tracking Kernel-Bayesian Framework for Object Tracking . . . . . . . . . . . . . . . . . . . . . Xiaoqin Zhang, Weiming Hu, Guan Luo, and Steve Maybank

821

Markov Random Field Modeled Level Sets Method for Object Tracking with Moving Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xue Zhou, Weiming Hu, Ying Chen, and Wei Hu

832

Continuously Tracking Objects Across Multiple Widely Separated Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinghao Cai, Wei Chen, Kaiqi Huang, and Tieniu Tan

843

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pankaj Kumar, Michael J. Brooks, and Anthony Dick

853

XX

Table of Contents – Part I

Image Assimilation for Motion Estimation of Atmospheric Layers with Shallow-Water Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Nicolas Papadakis, Patrick H´eas, and Etienne M´emin

864

Probability Hypothesis Density Approach for Multi-camera Multi-object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nam Trung Pham, Weimin Huang, and S.H. Ong

875

Human Detection AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi-Chen Raxle Wang and Jenn-Jier James Lien

885

Multi-posture Human Detection in Video Frames by Motion Contour Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qixiang Ye, Jianbin Jiao, and Hua Yu

896

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection . . . . Yu-Ting Chen and Chu-Song Chen

905

Combined Object Detection and Segmentation by Using Space-Time Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Murai, Hironobu Fujiyoshi, and Takeo Kanade

915

Segmentation Embedding a Region Merging Prior in Level Set Vector-Valued Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ismail Ben Ayed and Amar Mitiche

925

A Basin Morphology Approach to Colour Image Segmentation by Region Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erchan Aptoula and S´ebastien Lef`evre

935

Detecting and Segmenting Un-occluded Items by Actively Casting Shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tze K. Koh, Amit Agrawal, Ramesh Raskar, Steve Morgan, Nicholas Miles, and Barrie Hayes-Gill

945

A Local Probabilistic Prior-Based Active Contour Model for Brain MR Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jundong Liu, Charles Smith, and Hima Chebrolu

956

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

965

Less Is More: Coded Computational Photography Ramesh Raskar Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA

Abstract. Computational photography combines plentiful computing, digital sensors, modern optics, actuators, and smart lights to escape the limitations of traditional cameras, enables novel imaging applications and simplifies many computer vision tasks. However, a majority of current Computational Photography methods involve taking multiple sequential photos by changing scene parameters and fusing the photos to create a richer representation. The goal of Coded Computational Photography is to modify the optics, illumination or sensors at the time of capture so that the scene properties are encoded in a single (or a few) photographs. We describe several applications of coding exposure, aperture, illumination and sensing and describe emerging techniques to recover scene parameters from coded photographs.

1 Introduction Computational photography combines plentiful computing, digital sensors, modern optics, actuators, and smart lights to escape the limitations of traditional cameras, enables novel imaging applications and simplifies many computer vision tasks. Unbounded dynamic range, variable focus, resolution, and depth of field, hints about shape, reflectance, and lighting, and new interactive forms of photos that are partly snapshots and partly videos are just some of the new applications found in Computational Photography. In this paper, we discuss Coded Photography which involves encoding of the photographic signal and post-capture decoding for improved scene analysis. With filmlike photography, the captured image is a 2D projection of the scene. Due to limited capabilities of the camera, the recorded image is a partial representation of the view. Nevertheless, the captured image is ready for human consumption: what you see is what you almost get in the photo. In Coded Photography, the goal is to achieve a potentially richer representation of the scene during the encoding process. In some cases, Computational Photography reduces to ‘Epsilon Photography’, where the scene is recorded via multiple images, each captured by epsilon variation of the camera parameters. For example, successive images (or neighboring pixels) may have a different exposure, focus, aperture, view, illumination, or instant of capture. Each setting allows recording of partial information about the scene and the final image is reconstructed from these multiple observations. In Coded Computational Photography, the recorded image may appear distorted or random to a human observer. But the corresponding decoding recovers valuable information about the scene. ‘Less is more’ in Coded Photography. By blocking light over time or space, we can preserve more details about the scene in the recorded single photograph. In this paper we look at four specific examples. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 1–12, 2007. © Springer-Verlag Berlin Heidelberg 2007

2

R. Raskar

(a) Coded Exposure: By blocking light in time, by fluttering the shutter open and closed in a carefully chosen binary sequence, we can preserve high spatial frequencies of fast moving objects to support high quality motion deblurring. (b) Coded Aperture Optical Heterodyning: By blocking light near the sensor with a sinusoidal grating mask, we can record 4D light field on a 2D sensor. And by blocking light with a mask at the aperture, we can extend the depth of field and achieve full resolution digital refocussing. (c) Coded Illumination: By observing blocked light at silhouettes, a multi-flash camera can locate depth discontinuities in challenging scenes without depth recovery. (d) Coded Sensing: By sensing intensities with lateral inhibition, a gradient sensing camera can record large as well as subtle changes in intensity to recover a highdynamic range image. We describe several applications of coding exposure, aperture, illumination and sensing and describe emerging techniques to recover scene parameters from coded photographs. 1.1 Film-Like Photography Photography is the process of making pictures by, literally, ‘drawing with light’ or recording the visually meaningful changes in the light leaving a scene. This goal was established for film photography about 150 years ago. Currently, ‘digital photography’ is electronically implemented film photography, refined and polished to achieve the goals of the classic film camera which were governed by chemistry, optics, mechanical shutters. Film-like photography presumes (and often requires) artful human judgment, intervention, and interpretation at every stage to choose viewpoint, framing, timing, lenses, film properties, lighting, developing, printing, display, search, index, and labelling. In this article we plan to explore a progression away from film and film-like methods to something more comprehensive that exploits plentiful low-cost computing and memory with sensors, optics, probes, smart lighting and communication. 1.2 What Is Computational Photography? Computational Photography (CP) is an emerging field, just getting started. We don’t know where it will end up, we can’t yet set its precise, complete definition, nor make a reliably comprehensive classification. But here is the scope of what researchers are currently exploring in this field. – Computational photography attempts to record a richer visual experience, captures information beyond just a simple set of pixels and makes the recorded scene representation far more machine readable. – It exploits computing, memory, interaction and communications to overcome long-standing limitations of photographic film and camera mechanics that have persisted in film-style digital photography, such as constraints on dynamic

Less Is More: Coded Computational Photography

3

range, depth of field, field of view, resolution and the extent of scene motion during exposure. – It enables new classes of recording the visual signal such as the ‘moment’ [Cohen 2005], shape boundaries for non-photorealistic depiction [Raskar et al 2004] , foreground versus background mattes, estimates of 3D structure, ‘relightable’ photos and interactive displays that permit users to change lighting, viewpoint, focus, and more, capturing some useful, meaningful fraction of the ‘light field’ of a scene, a 4-D set of viewing rays. – It enables synthesis of impossible photos that could not have been captured at a single instant with a single camera, such as wrap-around views (‘multiple-centerof-projection’ images [Rademacher and Bishop 1998]), fusion of time-lapsed events [Raskar et al 2004], the motion-microscope (motion magnification [Liu et al 2005]), video textures and panoramas [Agarwala et al 2005]. They also support seemly impossible camera movements such as the ‘bullet time’ (Matrix) sequence recorded with multiple cameras with staggered exposure times. – It encompass previously exotic forms of scientific imaging and data gathering techniques e.g. from astronomy, microscopy, and tomography. 1.3 Elements of Computational Photography Traditional film-like photography involves (a) a lens, (b) a 2D planar sensor and (c) a processor that converts sensed values into an image. In addition, the photography may involve (d) external illumination from point sources (e.g. flash units) and area sources (e.g. studio lights).

Computational Photography

Novel Illumination Light Sources

Novel Cameras

Modulators Generalized Optics

Generalized

Sensor

Processing Ray Reconstruction

Generalized

Optics

4D Incident Lighting

4D Ray Bender

Upto 4D Ray Sampler

4D Light Field Display

Recreate 4D Lightfield

Scene: 8D Ray Modulator

Fig. 1. Elements of Computational Photography

4

R. Raskar

Computational Photography generalizes these four elements. (a) Generalized Optics: Each optical element is treated as a 4D ray-bender that modifies a light field. The incident 4D light field for a given wavelength is transformed into a new 4D lightfield. The optics may involve more than one optical axis [Georgiev et al 2006]. In some cases the perspective foreshortening of objects based on distance may be modified using wavefront coded optics [Dowski and Cathey 1995]. In recent lensless imaging methods [Zomet and Nayar 2006] and coded-aperture imaging [Zand 1996] used for gamma-ray and X-ray astronomy, the traditional lens is missing entirely. In some cases optical elements such as mirrors [Nayar et al 2004] outside the camera adjust the linear combinations of ray bundles that reach the sensor pixel to adapt the sensor to the viewed scene. (b) Generalized Sensors: All light sensors measure some combined fraction of the 4D light field impinging on it, but traditional sensors capture only a 2D projection of this lightfield. Computational photography attempts to capture more; a 3D or 4D ray representation using planar, non-planar or even volumentric sensor assemblies. For example, a traditional out-of-focus 2D image is the result of a capture-time decision: each detector pixel gathers light from its own bundle of rays that do not converge on the focused object. But a Plenoptic Camera [Adelson and Wang 1992, Ren et al 2005] subdivides these bundles into separate measurements. Computing a weighted sum of rays that converge on the objects in the scene creates a digitally refocused image, and even permits multiple focusing distances within a single computed image. Generalizing sensors can extend their dynamic range [Tumblin et al 2005] and wavelength selectivity as well. While traditional sensors trade spatial resolution for color measurement (wavelengths) using a Bayer grid or red, green or blue filters on individual pixels, some modern sensor designs determine photon wavelength by sensor penetration, permitting several spectral estimates at a single pixel location [Foveon 2004]. (c) Generalized Reconstruction: Conversion of raw sensor outputs into picture values can be much more sophisticated. While existing digital cameras perform ‘demosaicking,’ (interpolate the Bayer grid), remove fixed-pattern noise, and hide ‘dead’ pixel sensors, recent work in computational photography can do more. Reconstruction might combine disparate measurements in novel ways by considering the camera intrinsic parameters used during capture. For example, the processing might construct a high dynamic range scene from multiple photographs from coaxial lenses, from sensed gradients [Tumblin et al 2005], or compute sharp images a fast moving object from a single image taken by a camera with a ‘fluttering’ shutter [Raskar et al 2006]. Closed-loop control during photography itself can also be extended, exploiting traditional cameras’ exposure control, image stabilizing, and focus, as new opportunities for modulating the scene’s optical signal for later decoding. (d) Computational Illumination: Photographic lighting has changed very little since the 1950’s: with digital video projectors, servos, and device-to-device communication, we have new opportunities to control the sources of light with as much sophistication as we use to control our digital sensors. What sorts of spatiotemporal modulations for light might better reveal the visually important contents

Less Is More: Coded Computational Photography

5

of a scene? Harold Edgerton showed high-speed strobes offered tremendous new appearance-capturing capabilities; how many new advantages can we realize by replacing ‘dumb’ the flash units, static spot lights and reflectors with actively controlled spatio-temporal modulators and optics? Already we can capture occluding edges with multiple flashes [Raskar 2004], exchange cameras and projectors by Helmholz reciprocity [Sen et al 2005], gather relightable actor’s performances with light stages [Wagner et al 2005] and see through muddy water with coded-mask illumination [Levoy et al 2004]. In every case, better lighting control during capture to builds richer representations of photographed scenes.

2 Sampling Dimensions of Imaging 2.1 Epsilon Photography for Optimizing Film-Like Camera Think of film cameras at their best as defining a ‘box’ in the multi-dimensional space of imaging parameters. The first, most obvious thing we can do to improve digital cameras is to expand this box in every conceivable dimension. This effort reduces Computational Photography to ‘Epsilon Photography’, where the scene is recorded via multiple images, each captured by epsilon variation of the camera parameters. For example, successive images (or neighboring pixels) may have different settings for parameters such as exposure, focus, aperture, view, illumination, or the instant of capture. Each setting allows recording of partial information about the scene and the final image is reconstructed from these multiple observations. Epsilon photography is thus concatenation of many such boxes in parameter space; multiple film-style photos computationally merged to make a more complete photo or scene description. While the merged photo is superior, each of the individual photos is still useful and comprehensible on its own, without any of the others. The merged photo contains the best features from all of them. (a) Field of View: A wide field of view panorama is achieved by stitching and mosaicking pictures taken by panning a camera around a common center of projection or by translating a camera over a near-planar scene. (b) Dynamic range: A high dynamic range image is captured by merging photos at a series of exposure values [Debevec and Malik 1997, Kang et al 2003] (c) Depth of field: All-in-focus image is reconstructed from images taken by successively changing the plane of focus [Agrawala et al 2005]. (d) Spatial Resolution: Higher resolution is achieved by tiling multiple cameras (and mosaicing individual images) [Wilburn et al 2005] or by jittering a single camera [Landolt et al 2001]. (e) Wavelength resolution: Traditional cameras sample only 3 basis colors. But multi-spectral (multiple colors in the visible spectrum) or hyper-spectral (wavelengths beyond the visible spectrum) imaging is accomplished by taking pictures while successively changing color filters in front of the camera, using tunable wavelength filters or using diffraction gratings. (f) Temporal resolution: High speed imaging is achieved by staggering the exposure time of multiple low-framerate cameras. The exposure durations of individual cameras can be non-overlapping ) [Wilburn et al 2005] or overlaping [Shechtman et al 2002].

6

R. Raskar

Taking multiple images under varying camera parameters can be achieved in several ways. The images can be taken with a single camera over time. The images can be captured simultaneously using ‘assorted pixels’ where each pixel is a tuned to a different value for a given parameter [Nayar and Narsimhan 2002]. Simultaneous capture of multiple samples can also be recorded using multiple cameras, each camera having different values for a given parameter. Two designs are currently being used for multi-camera solutions: a camera array [Wilburn et al 2005] and single-axis multiple parameter (co-axial) cameras [Mcguire et al 2005]. Coded Exposure

Temporal 11-D broadband code

Coded Aperture

Spatial 2-D broadband code

Fig. 2. Blocking light to achieve Coded Photography. (Left) Using a 1-D code in time to block and unblock light over time, a coded exposure photo can reversibly encode motion blur (Raskar et al 2006). (Right) Using a 2-D code in space to block parts of the light via a masked aperture, a coded aperture photo can reversibly encode defocus blur (Veeraraghavan et al 2007).

2.2 Coded Photography But there is much more beyond the ‘best possible film camera’. We can virtualize the notion of the camera itself if we consider it as a device that collects bundles of rays, each ray with its own wavelength spectrum and exposure duration. Coded Photography is a notion of an ‘out-of-the-box’ photographic method, in which individual (ray) samples or data sets may or may not be comprehensible as ‘images’ without further decoding, re-binning or reconstruction. Coded aperture techniques, inspired by work in astronomical imaging, try to preserve high spatial frequencies so that out of focus blurred images can be digitally re-focused [Veeraraghavan07]. By coding illumination, it is possible to decompose radiance in a scene into direct and global components [Nayar06]. Using a coded exposure technique, one can rapidly flutter open and close the shutter of a camera in a carefully chosen binary sequence, to capture a single photo. The fluttered shutter encoded the motion in the scene in the observed blur in a reversible way. Other examples include confocal images and techniques to recover glare in the images [Talvala07].

Less Is More: Coded Computational Photography

7

We may be converging on a new, much more capable ‘box’ of parameters in computational photography that we don’t yet recognize; there is still quite a bit of innovation to come! In the rest of the article, we survey recent techniques that exploit exposure, focus, active illumination and sensors. Coding in Time

Coding in Space

Coded Illumination

Coded Sensing

Exposure

Aperture

Inter-View

Gradient Sensor (Differential Encoding)

[Raskar et al 2006]

[Veeraraghavan et al 07]

[Raskar et al 2004]

[Tumblin et al 2005]

Mask, Optical Heterodyning

Intra-view

[Veeraraghavan et al 07]

[Nayar et al 2006]

Fig. 3. An overview of projects. Coding in time or space, coding the incident active illumination and coding the sensing pattern.

3 Coded Exposure In a conventional single-exposure photograph, moving objects or moving cameras cause motion blur. The exposure time defines a temporal box filter that smears the moving object across the image by convolution. This box filter destroys important high-frequency spatial details so that deblurring via deconvolution becomes an illposed problem. We have proposed to flutter the camera’s shutter open and closed during the chosen exposure time with a binary pseudo-random sequence, instead of leaving it open as in a traditional camera [Raskar et al 2006]. The flutter changes the box filter to a broad-band filter that preserves high-frequency spatial details in the blurred image and the corresponding deconvolution becomes a well-posed problem. Results on several challenging cases of motion-blur removal including outdoor scenes, extremely large motions, textured backgrounds and partial occluders were presented. However, the authors assume that PSF is given or is obtained by simple user interaction. Since changing the integration time of conventional CCD cameras is not feasible, an external ferro-electric shutter is placed in front of the lens to code the exposure. The shutter is driven opaque and transparent according to the binary signals generated from PIC using the pseudo-random binary sequence.

8

R. Raskar

Fig. 4. The flutter shutter camera. The coded exposure is achieved by fluttering the shutter open and closed. Instead of a mechanical movement of the shutter, we used a ferro-electric LCD in front of the lens. It is driven opaque and transparent according to the desired binary sequence.

4 Coded Aperture and Optical Heterodyning Can we capture additional information about a scene by inserting a patterned mask inside a conventional camera? We use a patterned attenuating mask to encode the light field entering the camera. Depending on where we put the mask, we can effect desired frequency domain modulation of the light field. If we put the mask near the lens aperture, we can achieve full resolution digital refocussing. If we put the mask near the sensor, we can recover a 4D light field without any additional lenslet array.

Fig. 5. Encoded Blur Camera, i.e. with mask in the aperture, can preserve high spatial images frequencies in the defocus blur. Notice the glint in the eye. In the misfocused photo, on the left, the bright spot appears blurred with the bokeh of the chosen aperture (shown in the inset). In the deblurred result, on the right, the details on the eye are correctly recovered.

Less Is More: Coded Computational Photography

9

Ren et al. have developed a camera that can capture the 4D light field incident on the image sensor in a single photographic exposure [Ren et al. 2005]. This is achieved by inserting a microlens array between the sensor and main lens, creating a plenoptic camera. Each microlens measures not just the total amount of light deposited at that location, but how much light arrives along each ray. By re-sorting the measured rays of light to where they would have terminated in slightly different, synthetic cameras, one can compute sharp photographs focused at different depths. A linear increase in the resolution of images under each microlens results in a linear increase in the sharpness of the refocused photographs. This property allows one to extend the depth of field of the camera without reducing the aperture, enabling shorter exposures and lower image noise. Our group has shown that it is also possible to create a plenoptic camera using a patterned mask instead of a lenslet array. The geometric configurations remains nearly identical [Veeraraghavan2007]. The method is known as ‘spatial optical heterodyning’. Instead of remapping of rays in 4D using microlens array so that they can be captured on a 2D sensor, spatial optical heterodyning remaps frequency components of the 4D lightfield so that the frequency components can be recovered from Fourier transform of the captured 2D image. In microlens array based design, each pixel effectively records light along a single ray bundle. With patterned masks, each pixel records a linear combination multiple ray-bundles. By carefully coding the linear combination, the coded heterodyning method can reconstruct the values of individual ray-bundles. This is reversible modulation of 4D light field by inserting a patterned planar mask in the optical path of a lens based camera. We can reconstruct the 4D light field from a 2D camera image. The patterned mask attenuates light rays inside the camera instead of bending them, and the attenuation recoverably encodes the ray on the 2D sensor. Our mask-equipped camera focuses just as a traditional camera might to capture conventional 2D photos at full sensor resolution, but the raw pixel values also hold a modulated 4D light field. The light field can be recovered by rearranging the tiles of the 2D Fourier transform of sensor values into 4D planes, and computing the inverse Fourier transform. Mask?

Mask

Sensor

Coded Aperture for Full Resolution Digital Refocusing

Sensor

Mask

Sensor

Heterodyne Light Field Camera

Fig. 6. Coding Light Field entering a camera via a mask

10

R. Raskar

5 Coded Illumination By observing blocked light at silhouettes, a multi-flash camera can locate depth discontinuities in challenging scenes without depth recovery. We used a multi-flash camera to find the silhouettes in a scene [Raskar et al 2004]. We take four photos of an object with four different light positions (above, below, left and right of the lens). We detect shadows cast along the depth discontinuities are use them to detect depth discontinuities in the scene. The detected silhouettes are then used for stylizing the photograph and highlighting important features. We also demonstrate silhouette detection in a video using a repeated fast sequence of flashes. Bottom Flash

Top Flash

Left Flash

Right Flash

Ratio images showing shadows and traversal to find edges

Photo

Shadow-Free

Depth Edges

Depth Edges

Fig. 7. Multi-flash Camera for Depth Edge Detection. (Left) A camera with four flashes. (Right) Photos due to individual flashes, highlighted shadows and epipolar traversal to compute the single pixel depth edges.

6 High Dynamic Range Using a Gradient Camera A camera sensor is limited in the range of highest and lowest intensities it can measure. To capture the high dynamic range, one can adaptively exposure the sensor so that the signal to noise ratio is high over the entire image, including in the the dark and brightly lit regions. One approach for faithfully recording the intensities in a high dynamic range scenes is to capture multiple images using different exposures, and then to merge these images. The basic idea is that when longer exposures are used, dark regions are well exposed but bright regions are saturated. On the other hand, when short exposures are used, dark regions are too dark but bright regions are well imaged. If exposure varies and multiple pictures are taken of the same scene, value of a pixel can be taken from those images where it’s neither too dark nor saturated. This type of approach is often referred to as exposure bracketing. At the sensor level, various approaches have also been proposed for high dynamic range imaging. One type of approach is to use multiple sensing elements with different sensitivities within each cell [Street 1998, Handy 1986, Wen 1989, Hamazaki 1996]. Multiple measurements are made from the sensing elements, and they are combined

Less Is More: Coded Computational Photography

11

on-chip before a high dynamic range image is read out from the chip. Spatial sampling rate is lowered in these sensing devices, and spatial resolution is sacrificed. Another type of approach is to adjust the well capacity of the sensing elements during photocurrent integration [Knight 1983, Sayag 1990, Decker 1998] but this gives higher noise. By sensing intensities with lateral inhibition, a gradient sensing camera can record large as well as subtle changes in intensity to recover a high-dynamic range image. By sensing different between neighboring pixels instead of actual intensities, our group has shown that a ‘Gradient Camera’ can record large global variations in intensity [Tumblin et al 2005]. Rather than measure absolute intensity values at each pixel, this proposed sensor measures only forward differences between them, which remain small even for extremely high-dynamic range scenes, and reconstructs the sensed image from these differences using Poisson solver methods. This approach offers several advantages: the sensor is nearly impossible to over- or under-expose, yet offers extremely fine quantization, even with very modest A/D convertors (e.g. 8 bits). The thermal and quantization noise occurs in the gradient domain, and appears as low frequency ‘cloudy’ noise in the reconstruction, rather than uncorrelated highfrequency noise that might obscure the exact position of scene edges.

7 Conclusion As these examples indicate, we have scarcely begun to explore the possibilities offered by combining computation, 4D modeling of light transport, and novel optical systems. Nor have such explorations been limited to photography and computer graphics or computer vision. Microscopy, tomography, astronomy and other optically driven fields already contain some ready-to-use solutions to borrow and extend. If the goal of photography is to capture, reproduce, and manipulate a meaningful visual experience, then the camera is not sufficient to capture even the most rudimentary birthday party. The human experience and our personal viewpoint is missing. Computational Photography can supply us with visual experiences, but can’t decide which one’s matter most to humans. Beyond coding the first order parameters like exposure, focus, illumination and sensing, maybe the ultimate goal of Computational Photography is to encode the human experience in the captured single photo.

Acknowledgements We wish to thank Jack Tumblin and Amit Agrawal for contributing several ideas for this paper. We also thank co-authors and collaborators Ashok Veeraraghavan, Ankit Mohan, Yuanzen Li, Karhan Tan, Rogerio Feris, Jingyi Yu, Matthew Turk. We thank Shree Nayar and Marc Levoy for useful comments and discussions.

References Raskar, R., Tan, K., Feris, R., Yu, J., Turk, M.: Non-photorealistic Camera: Depth Edge Detection and Stylized Rendering Using a Multi-Flash Camera. SIGGRAPH 2004 (2004) T umblin, J., Agrawal, A., Raskar, R.: Why I want a Gradient Camera. In: CVPR 2005, IEEE, Los Alamitos (2005)

12

R. Raskar

Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: motion deblurring using fluttered shutter. ACM Trans. Graph 25(3), 795–804 (2006) Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled Photography: Mask-Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing. ACM Siggraph (2007) Nayar, S.K., Narasimhan, S.G.: Assorted Pixels: Multi-Sampled Imaging With Structural Models. In: ECCV. Europian Conference on Computer Vision, vol. IV, pp. 636–652 (2002) Debevec, Malik.: Recovering high dynamic range radiance maps from photographs. In: Proc. SIGGRAPH (1997) Mann, Picard.: Being ’undigital’ with digital cameras: Extending dynamic range by combining differently exposed pictures. In: Proc. IS&T 46th ann. conference (1995) McGuire, M., Matusik, Pfister, Hughes, Durand.: Defocus Video Matting, ACM Transactions on Graphics. Proceedings of ACM SIGGRAPH 2005 24(3) (2005) Adelson, E.H., Wang, J.Y.A.: Single Lens Stereo with a Plenoptic Camera. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2) (1992) Ng, R.: Fourier Slice Photography, SIGGRAPH (2005) Morimura. Imaging method for a wide dynamic range and an imaging device for a wide dynamic range. U.S. Patent 5455621 (October 1993) Levoy, M., Hanrahan, P.: Light field rendering. In: SIGGRAPH, pp. 31–42 (1996) Dowski Jr., E.R., Cathey, W.T.: Extended depth of field through wave-front coding. Applied Optics 34(11), 1859–1866 (1995) Georgiev, T., Zheng, C., Nayar, S., Salesin, D., Curless, B., Intwala, C.: Spatio-angular Resolution Trade-Offs in Integral Photography. In: proceedings, EGSR 2006 (2006)

Optimal Algorithms in Multiview Geometry Richard Hartley1 and Fredrik Kahl2 Research School of Information Sciences and Engineering, The Australian National University National ICT Australia (NICTA) Centre for Mathematical Sciences, Lund University, Sweden

1

2

Abstract. This is a survey paper summarizing recent research aimed at ﬁnding guaranteed optimal algorithms for solving problems in Multiview Geometry. Many of the traditional problems in Multiview Geometry now have optimal solutions in terms of minimizing residual imageplane error. Success has been achieved in minimizing L2 (least-squares) or L∞ (smallest maximum error) norm. The main methods involve Second Order Cone Programming, or quasi-convex optimization, and Branch-andbound. The paper gives an overview of the subject while avoiding as far as possible the mathematical details, which can be found in the original papers. J.E.Littlewood: The ﬁrst test of potential in mathematics is whether you can get anything out of geometry. G.H.Hardy: The sort of mathematics that is useful to a superior engineer, or a moderate physicist has no esthetic value and is of no interest to the real mathematician.

1

Introduction

In this paper, we describe recent work in geometric Computer Vision aimed at ﬁnding provably optimal solutions to some of the main problems. This is a subject which the two authors of this paper have been involved in for the last few years, and we oﬀer our personal view of the subject. We cite most of the relevant papers that we are aware of and apologize for any omissions. There remain still several open problems, and it is our hope that more researchers will be encouraged to work in this area. Research in Structure from Motion since the start of the 1990s resulted in the emergence of a dominant accepted technique – bundle adjustment [46]. In this method, a geometric problem is formulated as a (usually) non-linear optimization problem, which is then solved using an iterative optimization algorithm.

NICTA is funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 13–34, 2007. c Springer-Verlag Berlin Heidelberg 2007

14

R. Hartley and F. Kahl

Generally, the bundle adjustment problem is formulated as follows. One deﬁnes a cost function (also called an objective function) in terms of a set of parameters. Solving the problem involves ﬁnding the set of parameters that minimize the cost. Generally, the parameters are associated with the geometry that we wish to discover. Often they involve parameters of a set of cameras, as well as parameters (such as 3D point coordinates) describing the geometry of the scene. The cost function usually involves image measurements, and measures how closely a given geometric conﬁguration (for instance the scene geometry) explains the image measurements. Bundle adjustment has many advantages, which account for its success and popularity. 1. It is quite general, and can be applied to a large range of problems. 2. It is very easy to augment the cost function to include other constraints that the problem must satisfy. 3. We can “robustify” the cost function by minimizing a robust cost function, such as the Huber cost function. 4. Typically, the estimation problem is sparse, so sparse techniques can be used to achieve quite fast run times. One of the issues with bundle adjustment is that it requires a relatively accurate initial estimate of the geometry in order to converge to a correct minimum. This requirement led to one of the other main themes of research in Multiview Geometry through the 1990s - ﬁnding reliable initial solutions usually through so-called algebraic techniques. Most well known among such techniques is the 8point algorithm [28] for estimation of the essential or fundamental matrix, which solves the two-view relative orientation problem. Generally, in such methods, one deﬁnes an algebraic condition that must hold in the case of a noise-free solution, and deﬁnes an algebraic cost function that expresses how far this condition is from being met. Unfortunately, the cost function often is not closely connected with the geometry, and the cost function may be meaningless in any geometric or statistical sense. Multiple minima. One of the drawbacks of bundle adjustment is the possibility of converging to a local, rather than a global minimum of the cost function. The cost functions that arise in multiview optimization problems commonly do have multiple local minima. As an example, in Fig 1 we show graphs of the cost functions associated with the two-view triangulation problem (described later) and the autocalibration problem. The triangulation problem with more than two views is substantially more complex. It has been shown in [43] that the triangulation problem with three views involves solving a polynomial of degree 47, and hence the cost function potentially may have up to 24 minima. For higher numbers of views, the degree of the polynomial grows cubically. Stew´enius and Nist´er have calculated the sequence of degrees of the polynomial to be 6, 47, 148, 336, 638, 1081 for 2 to 7 view triangulation, which implies a large number of potential local minima of the cost function.

Optimal Algorithms in Multiview Geometry

15

Fig. 1. Top: Two-view triangulation cost function, showing up to three minima. The independent variable (x-axis) parametrizes the epipolar plane. On the left, an example with three local minima (two of them with equal cost). On the right, an example with global solution with zero cost (perfect noise-free triangulation), yet having a further local minimum. The graphs are taken from [12]. Bottom: 2-dimensional cross-section of a cost associated with an autocalibration problem. The graphs are taken from [9]. Left to right: top view, side-view, contour plot. Note the great complexity of the cost function, and the expected diﬃculties in ﬁnding a global minimum.

As for autocalibration, progress has been made on this problem in [5], which ﬁnds optimal methods of carrying out the projective - aﬃne - Euclidean upgrade, the “stratiﬁed” approach to autocalibration. However, this is not quite the same as an optimal solution to autocalibration, which remains an open (very diﬃcult) problem. Optimal methods. The diﬃculties and uncertainties associated with bundle adjustment, and algebraic methods has led to a theme of research in the last few years that aims at ﬁnding guaranteed provably optimal methods for solving these problems. Although such methods have not been successful for all geometric problem, the number of problems that can be solved using such optimal methods continues to grow. It is the purpose of this paper to survey progress in this area.

2

What Is Meant by an Optimal Solution?

In this section, we will argue that what is meant by an optimal solution to a geometric problem is not clearly deﬁned. We consider a problem in which a set of measurements are made and we need to ﬁt a parametrized model of some kind to these measurements. Optimization involves ﬁnding the set of parameters of the model that best ﬁt the purpose. The optimization problem is deﬁned in terms of a deﬁned cost function which must be minimized over the set of

16

R. Hartley and F. Kahl

meaningful parameters. However, what particular cost functions merit being called “optimal” will be considered in the rest of this section. To set up the problem, consider a set of measurements, xi to which we wish to ˆ i which are ﬁt a parametrized model. The model gives rise to some model values x deﬁned in some way in terms of the parametrization. The set of residuals δ i are deﬁned as δi = xi −ˆ xi where · represents a suitable norm in the measurement space. For image measurements, this is most reasonably the distance in the image ˆ i . Finally, denote by Δ the vector with components δi . This between xi and x may be called the vector of residuals. 2.1

L2 Error and ML Estimation

Often, it is argued that the optimal solution is the least squares solution, which minimizes the cost function ˆ i 2 , xi − x Δ 2 = i

namely the L2 -norm of the vector of residuals.1 The argument for the optimality of this solution is as follows. If we assume that the measurements are derived from actual values, corrupted with Gaussian noise with variance σ 2 , then the ˆ i is given by probability of the set of measurements xi given true values x ˆ i 2 /(2σ 2 ) . P ({xi } | {ˆ xi }) = K exp −xi − x i

where K is a normalizing constant. Taking logarithms and minimizing, we see that the modelled data that2 maximizes the probability of the measurements ˆ i . Thus, the least-squares solution is the maximialso minimizes i xi − x mum likelihood (ML) estimate, under an assumption of Gaussian noise in the measurements. This is the argument for optimality of the least-squares solution. Although useful, this argument does rely on two assumptions that are open to question. 1. It makes an assumption of Gaussian noise. This assumption is not really justiﬁed at all. Measurement errors in a digital image are not likely to satisfy a Gaussian noise model. 2. Maximum likelihood is not necessarily the same as optimal, a term that we have not deﬁned. One might deﬁne optimal to mean maximum likelihood, but this is a circular argument. 2.2

L∞ Error

An alternative noise model, perhaps equally justiﬁed for image point measurements is that of uniform bounded noise. Thus, we assume that all measurements 1

In future the symbol · represents the 2-norm of a vector.

Optimal Algorithms in Multiview Geometry

17

less than a given threshold distance from the true value are equally likely, but measurements beyond this threshold have probability zero. In the case of a discrete image, measurements more accurate than one pixel from the true value are diﬃcult to obtain, though in fact they may be achieved in carefully controlled situations. If we assume that the measurement error probability model is ˆ ) = K exp (−(x − x ˆ /σ)p ) P (x | x

(1)

where K is a normalizing factor, then as before, the ML estimate is the one ˆ i p . Letting p increase to inﬁnity, the probability that minimizes i xi − x distribution (1) converges uniformly (except at σ) to a uniform distribution for ˆ . Now, taking the p-th root, we see that minimizing (1) is x within distance σ of x equivalent to normalizing the p-norm of the vector Δ of residuals. Furthermore, as p increases to inﬁnity, Δp converges to the L∞ norm Δ∞ . In this way, minimizing the L∞ error corresponds to an assumption of uniform distribution of measurement errors. Looked at another way, under the L∞ norm, all sets of measurements that are within a distance σ of the modelled values are equally likely, whereas a set of measurements where one of the values exceeds the threshold σ has probability zero. Then L∞ optimization ﬁnds the smallest noise threshold for which the set of measurements is possible, and determines the ML estimate for this minimum threshold. Note that the L∞ norm of a vector is simply the largest component of the vector, in absolute value. Thus, ˆi , min Δ∞ = min max xi − x i

where the minimization is taken over the parameters of the model. For this reason, L∞ minimization is sometimes referred to as minimax optimization. 2.3

Other Criteria for Optimality

It is not clear that the maximum likelihood estimate has a reasonable claim to being the optimal estimate. It is pointed out in [13] that the maximum likelihood estimate may be biased, and in fact have inﬁnite bias even for quite simple estimation problems. Thus, as the noise level of measurements increases, the average (or expected) estimate drifts away from the true value. This is of course undesirable. In addition, a diﬀerent way of thinking of the estimation problem is in terms of the risk of making a wrong estimate. For example consider the triangulation problem (discussed in more detail later) in which several observers estimate a bearing (direction vector) to a target from known observation points. If the bearing directions are noisy, then where is the target? In many cases, there is a cost associated with wrongly estimating the position of the target. (For instance, if the target is an incoming ballistic missile, the

18

R. Hartley and F. Kahl

cost of a wrong estimate can be quite high.) A reasonable procedure would be to choose an estimate that minimizes the expected cost. As an example, if the cost of an estimate is equal to the square of the distance between the estimate and the true value, then the expected cost is equal to the mean of the posterior probability distribution of the parameters, P (θ | {xi })2 More discussion of these matters is contained in [13], Appendix 3. We are not, however, aware of any literature in multiview geometry ﬁnding estimates of this kind. What we mean by optimality. In this survey, we will consider the estimates that minimize the L2 or L∞ norms of the residual error vector, with respect to a parametrized model as being the optimal. This is reasonable in that it is related in either case to a speciﬁc geometric noise model, tied directly to the statistics of the measurements.

3

Polynomial Methods

One approch for obtaining optimal solutions to multiview problems is to compute all stationary points of the cost function and then check which of these is the global minimum. From a theoretical point of view, any structure and motion problem can be solved in this manner as long as the cost function can be expressed as a rational polynomial function in the parameters. This will be the case for most cost functions encountered (though not for L∞ cost functions, which are not diﬀerentiable). The method is as follows. A minimum of the cost function must occur at a point where the derivatives of the cost with respect to the parameters vanish. If the cost function is a rational polynomial function, then the derivatives are rational as well. Setting the derivatives to zero leads to a system of polynomial equations, and the solutions of this set of equations deﬁne the stationary points of the cost function. These can be checked one-by-one to ﬁnd the minimum. This method may also be applied when the parameters satisfy certain constraints, such as a constraint of zero determinant for the fundamental matrix, or a constraint that a matrix represents a rotation. Such problems can be solved using Lagrange multipliers. Although this method is theoretically generally applicable, in practice it is only tractable for small problems, for example the triangulation problem. A solution to the two-view triangulation problem was given in [12], involving the solution of a polynomial of degree 6. The three-view problem was addressed in [43]; the solution involves the solution of a polynomial of degree 47. Further work (unpublished) by Stew´enius and Nist´er has shown that the triangulation problem for 4 to 7 views can be solved by ﬁnding the roots of a polynomial of degree 148, 336, 638, 1081 respectively, and in general, the degree grows cubically. Since 2

This assumes that the parameters θ are in a Euclidean space, which may not always be the case. In addition, estimation of the posterior probability distribution may require the deﬁnition of a prior P (θ).

Optimal Algorithms in Multiview Geometry

19

solving large sets of polynomials is numerically diﬃcult, the issue of accuracy has been addressed in [4]. The polynomial method may also be applied successfully in many minimalconﬁguration problems. We do not attempt here to enumerate all such problems considered in the literature. One notable example, however, is the relative orientation (two-view reconstruction) problem with 5 point correspondences. This has long been known to have 10 solutions [7,16].3 The second of these references gives a very pleasing derivation of this result. Recent simple algorithms for solving this problem by ﬁnding the roots of a polynomial have been given in [31,26]. Methods for computing a polynomial solution need not result in a polynomial of the smallest possible degree. However recent work using Galois theory [32] gives a way to address this question, showing that the 2-view triangulation and the relative orientation problem essentially require solution of polynomials of degree 6 and 10 respectively.

4

L∞ Estimation and SOCP

In this section, we will consider L∞ optimization, and discuss its advantages vis-a-vis L2 optimization. We show that there are many problems that can be formulated in L∞ and give a single solution. This is the main advantage, and contrasts with L2 optimization, which may have many local minima, as was shown in Fig 1. 4.1

Convex Optimization

We start by considering convex optimization problems. First, a few deﬁnitions. Convex set. A subset S of Rn is said to be convex if the line segment joining any two point in S is contained in S. Formally, if x0 , x1 ∈ S, then (1−α)x0 +αx1 ∈ S for all α with 0 ≤ α ≤ 1. Convex function. A function f : Rn → R is convex if its domain is a convex set and for all x, y ∈ domain(f ), and α with 0 ≤ α ≤ 1, we have f ((1 − α)x0 + αx1 ) ≤ (1 − α)f (x0 ) + αf (x1 ). Another less formal way of deﬁning a convex function is so say that a line joining two points on the graph of the function will always lie above the function. This is illustrated in Fig 2. 3

These papers ﬁnd 20 solutions, not 10, since they are solving for the number of possible rotations. There are 10 possible essential matrices each of which gives two possible rotations, related via the twisted pair ambiguity (see [13]). Only one of these two rotations is cheirally correct, corresponding to a possible realizable solution.

20

R. Hartley and F. Kahl

Fig. 2. Left. Examples of convex and non-convex sets. Middle. The deﬁnition of a convex function; the line joining two points lies above the graph of the function. Right. Convex optimization.

Convex optimization. A convex optimization problem is as follows: – Given a convex function f : D → R, deﬁned on a convex domain D ⊂ Rn , ﬁnd the minimum of f on D. A convex function is always continuous, and given reasonable conditions that ensure a minimum of the function (for instance D is compact) such a convex optimization problem is solvable by known algorithms4 . A further desirable property of a convex problem is that it has no local minima apart from the global minimum. The global minimum value is attained at a single point, or at least on a convex set in Rn where the function takes the same minimum value at all points. For further details, we refer the reader to the book [3]. Quasi-convex functions. Unfortunately, although convex problems are agreeable, they do not come up often in multiview geometry. However, interestingly enough, certain other problems do, quasi-convex problems. Quasi-convex functions are deﬁned in terms of sublevel sets as follows. Deﬁnition 1. A function f : D → R is called quasi-convex if its α-sublevel set, Sα = {x ∈ D | f (x) ≤ α} is convex for all α. Examples of quasi-convex and non-quasi-convex functions are shown in Fig 3. Quasi-convex functions have two important properties. 1. A quasi-convex function has no local minima apart from the global minimum. It will attain its global minimum at a single point or else on a convex set where it assumes a constant value. 4

The most eﬃcient algorithm to use will depend on the form of the function f , and the way the domain D is speciﬁed.

Optimal Algorithms in Multiview Geometry

21

Fig. 3. Quasi-convex functions. The left two functions are quasi-convex. All the sublevel sets are convex. The function on the right is not quasi-convex. The indentation in the function graph (on the left) means that the sublevel-sets are not convex. All convex functions are quasi-convex, but the example on the left shows that the converse is not true.

2. The pointwise maximum of a set of quasi-convex functions is quasi-convex. This is illustrated in Fig 4 for the case of functions of a single variable. The general case follows directly from the following observation concerning sublevel sets. Sδ (fi ) Sδ (max fi ) = i

i

which is convex, since each Sδ (fi ) is convex.

Fig. 4. The pointwise maximum of a set of quasi-convex functions is quasi-convex

A quasi-convex optimization problem is deﬁned in the same way as a convex optimization problem, except that the function to be minimized is quasi-convex. Nevertheless, quasi-convex optimization problems share many of the pleasant properties of convex optimization. Why consider quasi-convex functions? The primary reason for considering such functions is that the residual of a measured image point x with respect to the projection of a point X in space is a quasi-convex function. In other words, f (X) = d(x, PX) is a quasi-convex function of X. Here, PX is the projection of a point X into an image, and d(·, ·) represents distance in the image.

22

R. Hartley and F. Kahl

Fig. 5. The triangulation problem: Assuming that the maximum reprojection error is less than some value δ, the sought point X must lie in the intersection of a set of cones. If δ is set too small, then the cones do not have a common intersection (left). If δ is set too large, then the cones intersect in a convex region in space, and the desired solution X must lie in this region (right). The optimal value of δ lies between these two extremes, and can be found by a binary search (bisection) testing successive values of δ. For more details, refer to the text.

Speciﬁcally, the sublevel set Sδ (f (X)) is a convex set, namely a cone with vertex the centre of projection, as will be discussed in more detail shortly (see Fig 5). As the reader may easily verify by example, however, the sum of quasi-convex functions is not in general quasi-convex. If we take several image measurements then the sum of squares of the residuals will not in general be a quasi-convex 2 function. In other words, an L2 cost function of the form N i=1 fi (X) will not in general be a quasi-convex function of X, nor have a single minimum. On the other hand, as remarked above, the maximum of a set of quasiconvex functions is quasi-convex, and hence will have a single minimum. Specifically, maxi fi (X) will be quasi-convex, and have a single minimum with respect to X. For this reason, it is typically easier to solve the minimax problem minX maxi fi (X) than the corresponding least-squares (L2 ) problem. Example: The triangulation problem. The triangulation problem is the most simple problem in multiview geometry. Nevertheless, in the L2 formulation, it still suﬀers from the problem of local minima, as shown in Fig 1. In this problem, we have a set of known camera centres Oi and a set of direction vectors vi which give the direction of a target point X from each of the camera centres. Thus, nominally, vi = (X − Oi )/X − Oi . The problem is to ﬁnd the position of the point X. We choose to solve this problem in L∞ norm, in other words, we seek the point X that minimizes the maximum error (over all i) between vi and the directions given by X − Oi . Consider Fig 5. Some simple notation is required. Deﬁne ∠(X − Oi , vi ) to be the angle between the vectors X − Oi and vi . Given a value δ > 0, the set of points Cδ (Oi , vi ) = {X | ∠(X − Oi , vi ) ≤ δ} forms a cone in R3 with vertex Oi , axis vi and angle determined by δ.

Optimal Algorithms in Multiview Geometry

23

We begin by hypothesizing that there exists a solution X to the triangulation problem for which the maximum error is δ. In this case, the point X must lie inside cones in R3 with vertex Oi , axis vi and angle determined by δ. If the cones are too narrow, they do not have a common intersection and there can be no solution with maximum error less than δ. On the other hand, if δ is suﬃciently large, then the cones intersect, and the desired solution X must lie in the intersection of the cones. The optimal value of δ is found by a binary search over values of δ to ﬁnd the smallest value such that the cones Cδ (Oi , vi ) intersect in at least one point. The intersection will be a single point, or in special conﬁgurations, a segment of a line. The problem of determining whether a set of cones have non-empty intersection is solved by a convex optimization technique called Second Order Cone Programming (SOCP), for which open source libraries exist [44]. We make certain observations about this problem: 1. Each cone Cδ (Oi , vi ) is a convex set, and hence their intersection is convex. 2. If we deﬁne a cost function Cost∞ (X) = max ∠(X − Oi , vi ), i

then the sublevel set Sδ (Cost∞ ) is simply the intersection of the cones Cδ (Oi , vi ), which is convex for all δ. This by deﬁnition says that Cost∞ (X) is a quasi-convex function of X. 3. Finding the optimum min Cost∞ (X) = min max ∠(X − Oi , vi ) X

X

i

is accomplished by a binary search over possible values of δ, where for each value of δ we solve an SOCP feasibility problem, (determine whether a set of cones have a common intersection). Such a problem is known as a minimax or L∞ optimization problem. Generally speaking, this procedure generalizes to arbitrary quasi-convex optimization problems; they may be solved by binary search involving a convex feasibility problem at each step. If we have a set of individual cost functions fi (X), perhaps each associated with a single measurement, and each of them quasi-convex, then the maximum of these cost functions maxi fi (X) is also quasiconvex, as illustrated in Fig 4. In this case, the minimax problem of ﬁnding minX maxi fi (X) is solvable by binary search. Reconstruction with known rotations. Another problem that may be solved by very similar means to the triangulation problem is that of Structure and Motion with known rotations, which is illustrated in Fig 6. The role of cheirality. It is important in solving problems of this kind to take into account the concept of “cheirality”, which means the requirement that

24

R. Hartley and F. Kahl

Fig. 6. Structure and motion with known rotations. If the orientations of several cameras are all known, then image points correspond to direction vectors in a common coordinate frame. Here, blue (or grey) circles represent the positions of cameras, and black circles the positions of points. The arrows represent direction vectors (their length is not known) from cameras to points. The positions of all the points and cameras may be computed (up to scale and translation) using SOCP. Abstractly, this is the embedding problem for a bi-partite graph in 3D, where the orientation of the edges of the graph is known. The analogous problem for an arbitrary (not bi-partite) graph was applied in [41] to solve for motion of the cameras without computing the point positions.

points visible in an image must lie in front of the camera, not behind. If we subdivide space by planes separating front and back of the camera, then there will be at least one local minimum of the cost function (whether L∞ or L2 ) in each region of space. Since the number of regions separated by n planes grows cubically, so do the number of local minima, unless we constrain the solution so that the points lie in front of the cameras. Algorithm acceleration. Although the bisection algorithm using SOCP has been the standard approach to L∞ geometric optimization problems, there has been recent work for speeding up the computations [39,40,34]. However, it has been shown that the general structure and motion (with missing data) is N P -hard no matter what criterion of optimality of reprojection error is used [33]. 4.2

Problems Solved in L∞ Norm

The list of problems that can be solved globally with L∞ estimation continues to grow and by now it is quite long, see Table 1. In [21], an L∞ estimation algorithm serves as the basis for solving the leastmedian squares problem for various geometric problems. However, the extension to such problems is essentially based on heuristics and it has no guarantee of ﬁnding the global optimum.

Optimal Algorithms in Multiview Geometry

25

Table 1. Geometric reconstruction problems that can be solved globally with the L∞ or L2 norm L∞ -norm − Multiview triangulation − Camera resectioning (uncalibrated case) − Camera pose (calibrated case) − Homography estimation − Structure and motion recovery with known camera orientation − Reconstruction by using a reference plane − Camera motion recovery − Outlier detection − Reconstruction with covariance-based uncertainty − Two-view relative orientation − 1D retinal vision L2 -norm − Aﬃne reconstruction from aﬃne cameras − Multiview triangulation − Camera resectioning (uncalibrated case) − Homography estimation − 3D – 3D registration and matching − 3D – 3D registration and matching (unknown pairing)

5

References [11,18,20,6] [18,20] [10,47] [18,20] [11,18,20] [18] [41] [42,20,25,47] [41,21] [10] [2] References [23,45] [1,29] [1] [1] [14] [27]

Branch-and-Bound Theory

The method based on L∞ optimization is not applicable to all problems. In this section, we will describe in general terms a diﬀerent method that has been used with success in obtaining globally optimal solutions. Branch and bound algorithms are non-heuristic methods for global optimization in non-convex problems. They maintain a provable upper and/or lower bound on the (globally) optimal objective value and terminate with a certiﬁcate proving that the solution is within of the global optimum, for arbitrarily small . Consider a non-convex, scalar-valued objective function Φ(x), for which we seek a global minimum over a domain Q0 . For a subdomain Q ⊆ Q0 , let Φmin (Q) denote the minimum value of the function Φ over Q. Also, let Φlb (Q) be a function that computes a lower bound for Φmin (Q), that is, Φlb (Q) ≤ Φmin (Q). An intuitive technique to determine the solution would be to divide the whole search region Q0 into a grid with cells of sides δ and compute the minimum of a lower bounding function Φlb deﬁned over each grid cell, with the presumption that each Φlb (Q) is easier to compute than the corresponding Φmin (Q). However, the number of such grid cells increases rapidly as δ → 0, so a clever procedure must be deployed to create as few cells as possible and “prune” away as many of these grid cells as possible (without having to compute the lower bounding function for these cells). Branch and bound algorithms iteratively subdivide the domain into subregions (which we refer to as rectangles) and employ clever strategies to prune away as many rectangles as possible to restrict the search region.

26

R. Hartley and F. Kahl

Φ(x)

x

l

(a)

u

Φ(x)

Φ(x)

Φ(x)

q1∗

q1∗

q1∗ q2∗

x

l

u

(b)

x

l

u

(c)

x

l

u

(d)

Fig. 7. This ﬁgure illustrates the operation of a branch and bound algorithm on a one dimensional non-convex minimization problem. Figure (a) shows the the function Φ(x) and the interval l ≤ x ≤ u in which it is to be minimized. Figure (b) shows the convex relaxation of Φ(x) (indicated in yellow/dashed), its domain (indicated in blue/shaded) and the point for which it attains a minimum value. q1∗ is the corresponding value of the function Φ. This value is the current best estimate of the minimum of Φ(x), and is used to reject the left subinterval in Figure (c) because the minimum value of the convex relaxation is higher than q1∗ . Figure (d) shows the lower bounding operation on the right sub-interval in which a new estimate q2∗ of the minimum value of Φ(x) is found.

A graphical illustration of the algorithm is presented in Fig 7. Computation of the lower bounding functions is referred to as bounding, while the procedure that chooses a domain and subdivides it is called branching. The choice of the domain picked for reﬁnement in the branching step and the actual subdivision itself are essentially heuristic. Although guaranteed to ﬁnd the global optimum (or a point arbitrarily close to it), the worst case complexity of a branch and bound algorithm is exponential. However, in many cases the properties oﬀered by multiview problems lead to fast convergence rates in practice.

6

Branch-and-Bound for L2 Minimization

The branch-and-bound method can be applied to ﬁnd L2 norm solutions to certain simple problems. This is done by a direct application of branch-andbound over the parameter space. Up to now, methods used for branching have been quite simple, consisting of simple subdivision of rectangles in half, or in half along all dimensions. In order to converge as quickly as possible to the solution, it is useful to conﬁne the region of parameter space that needs to be searched. This can be conveniently done if the cost function is a sum of quasi-convex functions. For instance, suppose the cost is C2 (X) = i fi (X)2 , and the optimal value is denoted by Xopt . If a good initial estimate X0 is available, with C2 (X0 ) = δ 2 , then C2 (Xopt ) = fi (Xopt )2 ≤ C2 (X0 ) = δ 2 . i

This implies that each fi (Xopt ) ≤ δ, and so Xopt ∈ i Sδ (fi ), which is a convex set enclosing both X0 and Xopt . One can ﬁnd a rectangle in parameter space that

Optimal Algorithms in Multiview Geometry

27

encloses this convex region, and begin the branch-and-bound algorithm starting with this rectangle. This general method was used in [1] to solve the L2 multiview triangulation problem and the uncalibrated camera resection (pose) problem. In that paper fractional programming (described later in section 8.2) was used to deﬁne the convex sub-envelope of the cost function, and hence provide a cost lower bound on each rectangle. The same branch-and-bound idea, but with a diﬀerent bounding method, was described in [29] to provide a simpler solution to the triangulation problem. The triangulation problem for other geometric features, more speciﬁcally lines and conics, was solved in [17] using the same approach.

7 7.1

Branch-and-Bound in Rotation Space Essential Matrix

All the problems that we have so-far considered, solvable using quasi-convex optimization or SOCP, involved no rotations. If there are rotations, then the optimization problem is no longer quasi-convex. An example of this type of problem is estimation of the essential matrix from a set of matching points xi ↔ xi . A linear solution to this problem was given in 1981 by Longuet-Higgins [28]. From the Essential matrix E, that satisﬁes the deﬁning equation xi Exi = 0 for all i, we can extract the relative orientation (rotation and translation) of the two cameras. To understand why this is not a quasi-convex optimization problem, we look at the minimal problem involving 5 point correspondences. It is well know that with 5 points, there are 10 solutions for the relative orientation. (Recent algorithms for solving the 5-point orientation are given in [31,26]). However, if there are many possible discrete solutions, then the problem can not be quasi-convex or convex – such problems have a unique solution. This is so whether we are seeking an L∞ or L2 solution, since the 5-point problem has an exact solution, and hence the cost is zero in either norm. Many algorithms for estimating the essential matrix have been given, without however any claim to optimality. Recently, an optimal solution, at least in L∞ norm was given in [10]. To solve the essential matrix problem optimally (at least in L∞ norm) we make the following observation. If the rotation of the two cameras were somehow known, then the problem would reduce to the one discussed in Fig 6, where the translation of the cameras can be estimated optimally (in L∞ norm) given the rotations. The residual cost of this optimal solution may be found, as a function of the assumed rotation. To solve for the relative pose (rotation and translation) of the two cameras, we merely have to consider all possible rotations, and select the one that yields the smallest residual. The trick is to do this without having to look at an inﬁnite number of rotations. Fortunately, branch-and-bound provides a means of carrying out this search. A key to the success of branch-and-bound is that the optimal cost (residual) estimated for one value of the rotation constrains the optimal cost for “nearby”

28

R. Hartley and F. Kahl

rotations. This allows us to put a lower bound on the optimal cost associated with all rotations in a region of rotation space. The branch-and-bound algorithm carries out a search of rotation space (a 3-dimensional space). The translation is not included in the branch-and-bound search. Instead, for a given rotation, the optimal translation may be computed using SOCP, and hence factored out of the parameter-space search. A similar method of search over rotation space is used in [10] to solve the calibrated camera pose problem. This is the problem of ﬁnding the position and orientation of a camera given known 3D points and their corresponding image points. An earlier iterative algorithm that addresses this problem using L∞ norm is given in [47]. The algorithm appears to converge to the global optimum, but the author states that this is unproven. 7.2

General Structure and Motion

The method outlined in section 7.1 for solving the structure and motion problem for two views could in principle be extended to give an optimal solution (albeit in L∞ norm) to the complete structure and motion problem for any number of views. This would involve a search over the combined space of all rotations. For the two camera problem there is only one relative rotation, and hence the branchand-bound algorithm involves a search over a 3-dimensional parameter space. In the case of n cameras, however the parameter space is 3(n − 1)-dimensional. Had we but world enough and time a branch-and-bound search over the combined rotation parameter space would yield an optimal L∞ solution to structure and motion with any number of cameras and points. Unfortunately, in terms of space and time requirements, this algorithm would be vaster than empires and more slow[30]. 7.3

1D Retinal Vision

One-dimensional cameras have proven useful in several diﬀerent applications, most prominently for autonomous guided vehicles (see Fig 8), but also in ordinary vision for analysing planar motion and the projection of lines. Previous results on one-dimensional vision are limited to classifying and solving of minimal cases, bundle adjustment for ﬁnding local minima to the structure and motion problem and linear algorithms based on algebraic cost functions. A method for ﬁnding the global minimum to the structure and motion problem using the max norm of reprojection errors is given in [2]. In contrast to the 2Dcase which uses SOCP, the optimal solution can be computed eﬃciently using simple linear programming techniques. It is assumed that neither the positions of the objects nor the positions and orientations of the cameras are known. However, it is assumed that the correspondence problem is solved, i.e., it is known which measured bearings correspond to the same object. The problem can formally be stated as follows. Given n bearings from m diﬀerent positions, ﬁnd the camera positions and 2D points

Optimal Algorithms in Multiview Geometry

29

2

reﬂector

1.5

1

0.5

0

−0.5 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Fig. 8. Left: A laser guided vehicle. Middle: A laser scanner or angle meter. Right: Calculated structure and motion for the icehockey experiment.

in the plane, such that the reprojected solution has minimal residual errors. The norm for measuring the errors will be the L∞ norm. The basic idea of the optimization scheme is to ﬁrst consider optimization with ﬁxed camera orientations - which is a quasi-convex problem - and then use branch-and-bound over the space of possible orientations, similar to that of section 7.1. Hockey rink data. By combining optimal structure and motion with optimal resection and intersection it is possible to solve for arbitrarily many cameras and views. We illustrate this with the data from a real set of measurements performed at an ice-hockey rink. The set contains 70 images of 14 points. The result is shown in the right of Fig 8. 7.4

3D – 3D Alignment

A similar method of doing branch-and-bound in rotation space was used in [27] to ﬁnd an optimal solution for the problem of aligning two sets of points in 3D with unknown pairing. The solution consists of a speciﬁed pairing of the two point sets, along with a rotation and translation to align the paired points. The algorithm relies on the fact that if the rotation is known, then the optimal pairing can be computed directly using the Hungarian algorithm [36]. This enables the problem to be addressed by a branch-and-bound search over rotations. The problem is solved for L2 norm in [27] using a bounding method based on Lipschitz bounds. Though the L∞ problem is not speciﬁcally addressed in the paper, it would probably also yield to a similar approach. 7.5

An Open Problem: Optimal Essential Matrix Estimation in L2 Norm

The question naturally arises of whether we can use similar techniques to section 7.1 to obtain the optimal L2 solution for the essential matrix. At present, we have no solution to this problem. Two essential steps are missing.

30

R. Hartley and F. Kahl

1. If the relative rotation R between the two cameras is given, can we estimate the optimal translation t. This is simple in L∞ norm using SOCP, or in fact linear programming. In L2 norm, a solution has been proposed in [15], but still it is iterative, and it is not clear that it is guaranteed to converge. For n points, this seems to be a harder problem than optimal n-view L2 triangulation (for which solutions have been recently given [1,29]). 2. If we can ﬁnd the optimal residual for a given rotation, how does this constrain the solution for nearby rotations? Loose bounds may be given, but they may not be suﬃciently tight to allow for eﬃcient convergence of the branch-and-bound algorithm.

8 8.1

Other Methods for Minimizing L2 Norm Convex Relaxations and Semideﬁnite Programming

Another general approach for solving L2 problems was introduced in [19] based on convex relaxations (underestimators) and semideﬁnite programming. More speciﬁcally, the approach is based on a hierarchy of convex relaxations to solve non-convex optimization problems. Linear matrix inequalities (LMIs) are used to construct the convex relaxations. These relaxations generate a monotone sequence of lower bounds of the minimal value of the objective function and it is shown how one can detect whether the global optimum is attained at a given relaxation. Standard semideﬁnite programming software (like SeDuMi [44]) is extensively used for computing the bounds. The technique is applied to a number of classical vision problems: triangulation, camera pose, homography estimation and epipolar geometry estimation. Although good results are obtained, there is no guarantee of achieving the optimal solution and the sizes of the problem instances are small. 8.2

Fractional Programming

Yet another method was introduced in [1,35]. It was the ﬁrst method to solve the n-view L2 triangulation problem with a guarantee of optimality. Other problem applications include camera resectioning (that is, uncalibrated camera pose), camera pose estimation and, homography estimation. In its most general form, fractional programming seeks to minimize/maximize the sum of fractions subject to convex constraints. Our interest from the point of view of multiview geometry, however, is speciﬁc to the minimization problem min x

p fi (x) i=1

gi (x)

subject to x ∈ D

where fi : Rn → R and gi : Rn → R are convex and concave functions, respectively, and the domain D ⊂ Rn is a convex compact set. This is because the residual of the projection of a 3D point into an image may be written in this form. Further, it is assumed that both fi and gi are positive with lower

Optimal Algorithms in Multiview Geometry

31

and upper bounds over D. Even with these restrictions the above problem is N P -complete [8], but practical and reliable estimation of the global optimum is still possible for many multiview problems through an iterative algorithm that solve an appropriate convex optimization problem at each step. The procedure is based on branch and bound. Perhaps the most important observation made in [1] is that many multiview geometry problems can be formulated as a sum of fractions where each fraction consists of a convex over a concave function. This has inspired a new more eﬃcient ways of computing the L2 -minimum for n-view triangulation, see [29].

9

Applications

There have been various application papers that have involved this type of optimization methodology, though they can not be said to have found an optimal solution to the respective problems. In [38] SOCP has been used to solve the problem of tracking and modelling a deforming surface (such as a sheet of paper) from a single view. Results are shown in Fig 9.

Fig. 9. Modelling a deforming surface from a single view. Left: the input image, with automatically overlaid grid. Right: the computed surface model viewed from a new viewpoint. Image features provide cone constraints that constrain the corresponding 3D points to lie on or near the corresponding line-of-sight, namely a ray through the camera centre. Additional convex constraints on the geometry of the surface allow the shape to be determined unambiguously using SOCP.

In another application-inspired problem, SOCP has been applied (see [22]) to the odometry problem for a vehicle with several rigidly mounted cameras with almost non-overlapping ﬁelds of view. Although the algorithm in [22] is tested on laboratory data, it is motivated by its potential use with vehicles such as the one shown in Fig 10. Such vehicles are used for urban-mapping. Computation of individual essential matrices for each of the cameras reduces the computation of the translation of the vehicle to a multiple view triangulation problem, which is solved using SOCP.

32

R. Hartley and F. Kahl

Fig. 10. Camera and car mount used for urban mapping. Images from non-overlapping cameras on both sides of the car can be used to do odometry of the vehicle. An algorithm based on triangulation and SOCP is proposed in [22]. (The image is used with permission of the UNC Vision group).

10

Concluding Remarks

The application of new optimization methods to the problems of Multiview Geometry have led to the development of reliably and provably optimal solutions under diﬀerent geometrically meaningful cost functions. At present these algorithms are not as fast as standard methods, such as bundle adjustment. Nevertheless the run times are not wildly impractical. Recent work on speeding up the optimization process is yielding much faster run times, and further progress is likely. Such optimization techniques are also being investigated in other areas of Computer Vision, such as discrete optimization. A representative paper is [24]. For 15 years or more, geometric computer vision has relied on a small repertoire of optimization methods, with Levenberg-Marquardt [37] being the most popular. The beneﬁt of using new methods such as SOCP and other convex and quasi-convex optimization methods is being realised.

References 1. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical global optimization for multiview geometry. In: European Conf. Computer Vision, Graz, Austria, pp. 592–605 (2006) 2. ˚ Astr¨ om, K., Enqvist, O., Olsson, C., Kahl, F., Hartley, R.: An L∞ approach to structure and motion problems in 1d-vision. In: Int.Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 3. Boyd, S., Vanderberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 4. Byr¨ od, M., Josephson, K., ˚ Astr¨ om, K.: Improving numerical accuracy in gr¨ obner basis polynomial equation solvers. In: Int. Conf.Computer Vision, Rio de Janeiro, Brazil (2007) 5. Chandraker, M.K., Agarwal, S., Kriegman, D.J., Belongie, S.: Globally convergent algorithms for aﬃne and metric upgrades in stratiﬁed autocalibration. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007)

Optimal Algorithms in Multiview Geometry

33

6. Farenzena, M., Fusiello, A., Dovier, A.: Reconstruction with interval constraints propagation. In: Proc. Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1185–1190 (2006) 7. Faugeras, O.D., Maybank, S.J.: Motion from point matches: Multiplicity of solutions. Int. Journal Computer Vision 4, 225–246 (1990) 8. Freund, R.W., Jarre, F.: Solving the sum-of-ratios problem by an interior-point method. J. Glob. Opt. 19(1), 83–102 (2001) 9. Hartley, R., de Agapito, L., Hayman, E., Reid, I.: Camera calibration and the search for inﬁnity. In: Proc. 7th International Conference on Computer Vision, Kerkyra, Greece, September 1999, pp. 510–517 (1999) 10. Hartley, R., Kahl, F.: Global optimization through searching rotation space and optimal estimation of the essential matrix. Int. Conf. Computer Vision (2007) 11. Hartley, R., Schaﬀalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Conf. Computer Vision and Pattern Recognition, Washington DC, USA, vol. I, pp. 504–509 (2004) 12. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68(2), 146–157 (1997) 13. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 14. Horn, B.K.P.: Closed form solution of absolute orientation using unit quaternions. J. Opt. Soc. America 4(4), 629–642 (1987) 15. Horn, B.K.P.: Relative orientation. Int. Journal Computer Vision 4, 59–78 (1990) 16. Horn, B.K.P.: Relative orientation revisited. J. Opt. Soc. America 8(10), 1630–1638 (1991) 17. Josephson, K., Kahl, F.: Triangulation of points, lines and conics. In: Scandinavian Conf. on Image Analysis, Aalborg, Denmark (2007) 18. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Int. Conf. Computer Vision, Beijing, China, pp. 1002–1009 (2005) 19. Kahl, F., Henrion, D.: Globally optimal estimates for geometric reconstruction problems. Int. Journal Computer Vision 74(1), 3–15 (2007) 20. Ke, Q., Kanade, T.: Quasiconvex optimization for robust geometric reconstruction. In: Int. Conf. Computer Vision, Beijing, China, pp. 986–993 (2005) 21. Ke, Q., Kanade, T.: Uncertainty models in quasiconvex optimization for geometric reconstruction. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1199–1205 (2006) 22. Kim, J.H., Hartley, R., Frahm, J.M., Pollefeys, M.: Visual odometry for nonoverlapping views using second-order cone programming. In: Asian Conf. Computer Vision (November 2007) 23. Koenderink, J.J., van Doorn, A.J.: Aﬃne structure from motion. J. Opt. Soc. America 8(2), 377–385 (1991) 24. Kumar, P., Torr, P.H.S., Zisserman, A.: Solving markov random ﬁelds using second order cone programming relaxations. In: Conf. Computer Vision and Pattern Recognition, pp. 1045–1052 (2006) 25. Li, H.: A practical algorithm for L-inﬁnity triangulation with outliers. In: CVPR, vol. 1, pp. 1–8. IEEE Computer Society, Los Alamitos (2007) 26. Li, H., Hartley, R.: Five-point motion estimation made easy. In: Int. Conf. Pattern Recognition, pp. 630–633 (August 2006) 27. Li, H., Hartley, R.: The 3D – 3D registration problem revisited. In: Int. Conf. Computer Vision (October 2007) 28. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981)

34

R. Hartley and F. Kahl

29. Lu, F., Hartley, R.: A fast optimal algorithm for l2 triangulation. In: Asian Conf. Computer Vision (November 2007) 30. Marvell, A.: To his coy mistress. circa (1650) 31. Nist´er, D.: An eﬃcient solution to the ﬁve-point relative pose problem. IEEE Trans. Pattern Analysis and Machine Intelligence 26(6), 756–770 (2004) 32. Nist´er, D., Hartley, R., Stew´enius, H.: Using Galois theory to prove that structure from motion algorithms are optimal. In: Conf. Computer Vision and Pattern Recognition (June 2007) 33. Nist´er, D., Kahl, F., Stew´enius, H.: Structure from motion with missing data is N P -hard. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 34. Olsson, C., Eriksson, A., Kahl, F.: Eﬃcient optimization of L∞ -problems using pseudoconvexity. In: Int. Conf. Computer Vision, Rio de Janeiro, Brazil (2007) 35. Olsson, C., Kahl, F., Oskarsson, M.: Optimal estimation of perspective camera pose. In: Int. Conf. Pattern Recognition, Hong Kong, China, vol. II, pp. 5–8 (2006) 36. Papadimitriou, C., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliﬀs (1982) 37. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C. Cambridge University Press, Cambridge (1988) 38. Salzman, M., Hartley, R., Fua, P.: Convex optimization for deformable surface 3D tracking. In: Int. Conf. Computer Vision (October 2007) 39. Seo, Y., Hartley, R.: A fast method to minimize L∞ error norm for geometric vision problems. In: Int. Conf. Computer Vision (October 2007) 40. Seo, Y., Hartley, R.: Sequential L∞ norm minimization for triangulation. In: Asian Conf. Computer Vision (November 2007) 41. Sim, K., Hartley, R.: Recovering camera motion using the L∞ -norm. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 1230–1237 (2006) 42. Sim, K., Hartley, R.: Removing outliers using the L∞ -norm. In: Conf. Computer Vision and Pattern Recognition, New York City, USA, pp. 485–492 (2006) 43. Stew´enius, H., Schaﬀalitzky, F., Nist´er, D.: How hard is three-view triangulation really? In: Int. Conf. Computer Vision, Beijing, China, pp. 686–693 (2005) 44. Sturm, J.F.: Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric cones. Optimization Methods and Software 11(12), 625–653 (1999) 45. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization approach. Int. Journal Computer Vision 9(2), 137–154 (1992) 46. Triggs, W., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.: Bundle adjustment for structure from motion. In: Vision Algorithms: Theory and Practice, pp. 298–372. Springer, Heidelberg (2000) 47. Zhang, X.: Pose estimation using L∞ . In: Image and Vision Computing New Zealand (2005)

Machine Vision in Early Days: Japan’s Pioneering Contributions Masakazu Ejiri R & D Consultant in Industrial Science, formerly at Central Research Laboratory, Hitachi, Ltd.

Abstract. The history of machine vision started in the mid-1960s by the eﬀorts of Japanese industry researchers. A variety of prominent visionbased systems was made possible by creating and evolving real-time image processing techniques, and was applied to factory automation, oﬃce automation, and even to social automation during the 1970-2000 period. In this article, these historical attempts are brieﬂy explained to promote understanding of the pioneering eﬀorts that opened the door and formed the bases of today’s computer vision research. Keywords: Factory automation, oﬃce automation, social automation, real-time image processing, video image analysis, robotics, assembly, inspection.

1

Introduction

There is an old saying, “knowing the old brings you a new wisdom for tomorrow,” that originated with Confucius (a Chinese philosopher, 551BC-479BC). This is the basic idea underlying this article, and its purpose is to enlighten young researchers on old technologies rather than new ones. In the 1960s, one of the main concerns of researchers in the ﬁeld of information science was the realization of intelligence by using a conventional computer, which had been used mainly for numerical computing. At that time, a hand-eye system was thought to be an excellent research tool to visualize intelligence and demonstrate its behavior. The hand-eye system was, by itself, soon recognized as an important research target, and it became known as the “intelligent robot.” One of the core technologies of the intelligent robot was, of course, vision, and people started to call this vision research area “computer vision.” However, the academic research on computer vision was apt to be stagnant. Its achievements stayed at the level of simulated tasks and could not surpass our expectations because of its intrinsic diﬃculty and the limitation of computing power in those days. On the other hand, practical vision technology was eagerly anticipated in industry, particularly in Japan, as one of the core technologies towards attaining ﬂexible factory automation. Research was initiated in the mid-1960s, typically at our group of Hitachi’s Central Research Laboratory. In contrast to the word “computer vision,” we used the word “machine vision” for representing a more Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 35–53, 2007. c Springer-Verlag Berlin Heidelberg 2007

36

M. Ejiri

pragmatic approach towards “useful” vision systems, because the use of computers was less essential in the pragmatic approach. Most of the leading companies followed this lead, and they all played an important role in the incubation and development of machine vision technology. Currently, computer vision is regarded as a fundamental and scientiﬁc approach to investigate the principles that underlie vision and how artiﬁcial vision can be best achieved, in contrast to the more pragmatic, needs-oriented machine vision approach. We have to note, however, that there is no diﬀerence in the ultimate goals of these two approaches. Though the road to machine vision was not smooth, Japanese companies fortunately achieved some key successes. In this article, we brieﬂy introduce these pioneering vision applications and discuss the history of machine vision.

2

Prehistoric Machine Vision

Our ﬁrst attempt at vision application, in 1964, was to automate the assembly process (i.e., wire-bonding process) of transistors. In this attempt, we used a very primitive optical sensor by combining a microscope and a rotating-drum type scanner with two slits on its surface. By detecting the reﬂection from the transistor surface with photo-multipliers, the position and orientation of transistor chips were determined with about a 95% success rate. However, this percentage was still too low to enable us to replace human workers; thus, our ﬁrst attempt failed and was eventually abandoned after a two-year struggle. What we learned from this experience was the need for reliable artiﬁcial vision comparable to a human’s pattern recognition capability, which quickly captures the image ﬁrst, and then reduces the information quantity drastically until the positional information is ﬁrmly determined. Our slit-type optical-scanning method inherently lacked the right quantity of captured information; thus, the recognition result was apt to be easily aﬀected by reﬂective noises. In those days, microprocessors had not yet been developed and the available computers were still too expensive, bulky, and slow, particularly for image processing and pattern recognition. Moreover, memory chips were extremely costly, so the use of full-frame image memories was prohibitive. Though there was no indication that these processors would soon improve, we started seminal research on ﬂexible machines in 1968. A generic intelligent machine conceived at that time was the one that consisted of three basic functions: understanding of a human’s instruction or intention to clarify the goal of the task, understanding of objects or environment to clarify the start of the task, and decision-making to ﬁnd the optimum route between the start and the goal. Based on this conception, a prototype intelligent robot was developed in 1970. The conﬁguration of this intelligent robot is shown in Fig.1. In this robot, the image of a trihedral plan drawing was captured by one of the cameras and was analyzed to clarify the goal assemblage as well as the components of the assemblage. Another camera looked at the real blocks scattered on a table, and found their positions and postures. From these blocks, the

Machine Vision in Early Days: Japan’s Pioneering Contributions

37

Fig. 1. Intelligent robot (1970)

computer recognized the component blocks needed to complete this assembly task, and made a plan to assemble these blocks. For this assembly planning, backward reasoning was used to ﬁnd the route from the goal to the start, not from the start to goal. That is, the route was found by analyzing the disassembly task from the goal assemblage to each component. The assembly sequence was determined as the reverse of this disassembly sequence. Thus, the robot could assemble blocks into various forms by responding to the objectives presented macroscopically by an assembly drawing [2]. This research formed part of Japan’s national project on PIPS (Pattern Information Processing System), initiated in the following year.

3

Factory Applications

Our project on the prototype intelligent robot in 1970 revealed many basic problems underlying “ﬂexible machines” and gave us useful insights into future applications of robotics. One signiﬁcant problem we were confronted with was the robot’s extremely slow image-processing speed in understanding objects. Our next eﬀort was therefore focused on developing high-speed dedicated hardware for image processing with the minimum use of memory, instead of using rather slow and expensive computers. One of the core ideas was to adaptively threshold the image signal into a binary form by responding to the signal behavior and to input it into a shift-register-type local memory that dynamically stored the latest pixel data of several horizontal scan-lines. This local-paralleltype conﬁguration enabled us to simultaneously extract plural pixel data from a 2-D local area in synchronization with image scanning. By designing the logic circuit connected to this 2-D local area according to the envisaged purpose, the processing hardware could be adapted to many applications. One useful yet simple method using this local-parallel-type image processing was windowing. This method involved setting up a few particular window areas

38

M. Ejiri

Fig. 2. Bolting robot for piles/poles (1972)

in the image plane, and the pixels in the windows were selectively counted to ﬁnd the background area and the object area occupying the windows. In 1972 a bolting robot applying windowing was developed in order to automate the molding process of concrete piles and poles [3]. It became the ﬁrst application of machine vision to moving objects. Note that this paper [3] was published at a later time. This was likely to happen because publication was not the ﬁrst priority for industry researchers. Another eﬀective method based on local-parallel architecture was erosion/ dilation of patterns, which was executed by simple AND/OR logic on a 2-D local area. This method could detect defects in printed circuit boards (PCBs), and formed one of the bases of today’s morphological processing. This defectdetection machine in 1972 also became the ﬁrst application of machine vision to the automation of visual inspection [4]. These two pioneering applications are illustrated in Figs. 2 and 3. Encouraged by the eﬀectiveness of these machine-vision systems in actual production lines, we again started to develop a new assembly machine for transistors that was, this time, based fully on image processing. A multiple local pattern matching method was extensively studied for this purpose. In this method, each local pattern position in a transistor image was found by matching to a standard pattern. The distance and the angle between a pair of detected local pattern positions were sequentially checked to see if these local patterns were correctly detected. The electrode positions for wiring were then calculated from the coordinates of the ﬁrst detected correct pair. By basing a local-parallel-type image processor on this matching, we ﬁnally developed fully automatic transistor assembly machines in 1973 [5]. This successful development was the result of a ten-year eﬀort since our ﬁrst failed attempt. The developed assembly machines were recognized as the world’s ﬁrst image-based machines for fully automatic assembly of semiconductor devices. These machines and their conﬁguration are shown in Fig. 4.

Machine Vision in Early Days: Japan’s Pioneering Contributions

39

Fig. 3. PCB inspection machine (1972)

Fig. 4. Transistor assembly machine (1973)

After this development, our eﬀorts were then focused on expanding machinevision applications from transistors to other semiconductor devices, such as integrated circuits (ICs), hybrid ICs, and large-scale integrated circuits (LSIs). Consequently, the automatic assembly of all types of semiconductor devices was completed by 1977. With the export of this automatic assembly technology to a US manufacturer as a start, the technology gained widespread attention from semiconductor manufacturers worldwide and expanded quickly into industry. As a result, the semiconductor industry as a whole prospered by virtue of higher speed production of higher quality products with more uniform performance than had been achieved previously. Encouraged by the success of semiconductor assembly, our eﬀorts were further broadened to other industrial applications in the mid-1970s to early 1980s. Examples of such applications during this period are a hose-connecting robot

40

M. Ejiri

Fig. 5. Wafer inspection machines (1984-1987)

for pressure testing in pump production lines, a reading machine of 2-D objectcode for an intra-factory physical distribution system, and a quality inspection machine for marks and characters printed on electronic parts [6][7][8]. Machines for inspecting photo-masks in semiconductor fabrication and CRT black-matrix fabrication were other examples [9] (by Toshiba) and [10] (by Hitachi). Machines for classifying medical tablets and capsules [11] (by Fuji Electric) and machines for classifying agricultural products and ﬁsh [12][13][14] (by Mitsubishi Electric) were also unique and epoch-making achievements in those days. These examples show that the key concept representing those years seemed to be the realization of a “productive society” through factory automation, and the objectives of machine vision were mainly position detection for assembly, shape detection for classiﬁcation, and defect detection for inspection. In 1980, the PIPS project ﬁnished after a 10-year eﬀort by Japanese industry. A variety of recognition systems were successfully prototyped for hand-written Kanji characters, graphics, drawings, documents, color pictures, speeches, and three-dimensional objects. One particular outcome among others was the development of high-speed, general-purpose image processors [15], which in turn served as the basis of subsequent research and development in industry. The most diﬃcult but rewarding development in the mid-1980s was an inspection machine for detecting defects in semiconductor wafers [16]. It was estimated that even the world’s largest super-computer available at that time would require at least one month of computing for ﬁnishing the defect detection of an 8-inch single wafer. We therefore had to develop special hardware for lowering the processing time to less than 1 hour/wafer. The resulting hardware was a network of local-parallel-type image processors that use a “design pattern referring method,” shown in Fig. 5. In this machine, hardware-based knowledge processing, in which each processor was regarded as a combination of IF-part and THEN-part logical circuits, was ﬁrst attempted [17].

Machine Vision in Early Days: Japan’s Pioneering Contributions

41

Meanwhile, the processing speed of microprocessors improved considerably since their appearance in the mid-1970s, and the memory capacity drastically increased without excessively increasing costs. These improvements facilitated the use of gray-scale images instead of binary ones, and dedicated LSI chips for image processing were developed in the mid-1980s [18]. These developments all contributed to achieving more reliable, microprocessor-based general-purpose machine vision systems with full-scale buﬀers for gray-level images. As a result, applications of machine vision soon expanded from circuit components, such as semiconductors and PCBs, to home-use electronic equipment, such as VCRs and color TVs. Currently, machine vision systems are found in various areas such as electronics, machinery, medicine, and food industries.

4

Oﬃce Applications

Besides the above-described machine-vision systems for factory applications, there was extensive research on character recognition in the area of oﬃce automation. For example, in the mid-1960s, a FORTRAN program reader was developed to replace key-punching tasks. Mail-sorting machines for post oﬃces were developed in the late 1960s to automatically read handwritten postal codes (by Toshiba et al.). Another topical developmental eﬀort started in 1974 for automatic classiﬁcation of ﬁngerprint patterns, and in 1982 the system was ﬁrst put in use at a Japanese police oﬃce with great success, and later at US police oﬃces (by NEC). Our ﬁrst eﬀort to apply machine-vision technology to areas other than factory automation was the automatic recognition of monetary bills in 1976. This recognition system was extremely successful in spurring the development of automated teller machines (ATMs) for banks (see Fig. 6). Due to the processing time limitation, the entire image of a bill was not captured, but by combining several partial images obtained from optical sensors with those from magnetic sensors, so-called sensor-fusion was ﬁrst attempted and thus resulted in highaccuracy bill recognition with a less than 1/1015 theoretical error rate. Early ATM models for domestic use employed vertical safes, but in the later models, the horizontal safes were extensively used for increasing spatial eﬃciency and for facilitating use in Asian countries having a larger number of bill types. Our next attempt, in the early 1980s, was the eﬃcient handling of a large amount of graphic data in the oﬃce [19]. The automatic digitization of paperbased engineering drawings and maps was ﬁrst studied. The recognition of these drawings and maps was based on a vector representation technique, such as that shown in Fig. 7. The recognition was usually executed by spatially-paralleltype image processors, in which each processor was designated to a speciﬁc image area. Currently, geographic information systems (GIS) based on these digital maps gained popularity and are being used by many service companies and local governments to manage their electric power supply, gas supply, water supply facilities, and sewage service facilities (see Fig. 8). The use of digital maps was then extended to car navigation systems and more recently to various

42

M. Ejiri

Fig. 6. ATM: automated teller machines (1976-1995)

Fig. 7. Automatic digitizer for maps and engineering drawings (1982)

other information service systems via the Internet. Machine-vision technology contributed, mainly in the early developmental stage of these systems, to the digitization of original paper-based maps into electronic form until these digital maps began to be produced directly from measured data through computer-aided map production. Spatially divided parallel processing was also useful for large-scale images such as those from satellite data. One of our early attempts in this area was the recognition of wind vectors, back in 1972, by comparing two simulated satellite images with a 30-minute interval. This system formed a basis of weather forecasting using Japan’s ﬁrst meteorological geo-stationary satellite “Himawari,” launched a few years later. Also, an algorithm for deriving a sea temperature contour map from infra-red satellite images was built for environmental study and ﬁsheries.

Machine Vision in Early Days: Japan’s Pioneering Contributions

43

Fig. 8. GIS: geographic information systems(1986)

Research on document understanding also originated as part of machine-vision research in the mid-1980s [20]. During those years, electronic document editing and ﬁling became popular owing to the progress in word-processing technology for over 4000 Kanji and Kana characters. The introduction of an electronic patent-application system in Japan in 1990 was an important stimulus for further research on oﬃce automation. We developed dedicated workstations and a parallel-disk-type distributed ﬁling system for the use of patent examiners. This system enabled examiners to eﬃciently retrieve and display the images of past documents for comparison. The recognition of handwritten postal addresses was one of the most challenging topics in machine-vision applications. In 1992, a decision was made by a government committee (to which the author served as a member) to adopt a new 7-digit postal code system in Japan beginning in 1998. To this end, three companies (Hitachi, Toshiba and NEC) developed new automatic mail-sorting machines for post oﬃces in 1997. An example of the new sorting machines is shown in Fig. 9. In those machines, hand-written/printed addresses in Kanji characters are fully read together with the 7-digit postal codes; both results are then matched for consistency; and the recognized full address is printed on each letter as a transparent barcode consisting of 20-digit data. The letters are then dispatched to other post oﬃces for delivery. In a subsequent process, only these barcodes are read, and prior to home delivery the letters are arranged by the new sorting machine in such a way that the order of the letters corresponds to the house order on the delivery route. In these postal applications, the recognition of all types of printed fonts and hand-written Kanji characters was made possible by using a multi-microprocessor type image processing system. A mail image is sent to one of the unoccupied processors, and this designated processor analyzes the image. The address recognition by a designated single processor usually requires 1.0 to 2.5 seconds, depending on the complexity of the address image. As up to 32 microprocessors are used

44

M. Ejiri

Fig. 9. New mail sorting machine (1997)

in parallel for successively ﬂowing letters, the equivalent recognition time of the whole system is less than 0.1 seconds/letter, producing a maximum processing speed of 50,000 letters per an hour. The oﬃce applications of vision technology described above show that the key concept representing those years seemed to be the realization of an “eﬃcient society” through oﬃce automation, and the objectives of machine vision were mainly eﬃcient handling of large-scale data and also high-precision, high-speed recognition and handling of paper-based information. Recent progress in network technology has also increased the importance of oﬃce automation. To secure the reliability of information and communication systems, a variety of advanced image processing technologies will be expected. These will include more eﬀective and reliable compression, encryption, scrambling, and watermarking technologies of image data.

5

Social Applications

In recent years, applications to social automation have become increasingly important. Social automation here means “making the systems designed for social use more intelligent,” and it includes systems for traﬃc and for environmental use. The technologies used in these systems are, for example, surveillance, monitoring, ﬂow control and security assurance. The earliest attempt at social automation was probably our elevator-eye project in 1977, in which we tried to implement machine vision in an elevator system in order to control the human traﬃc in large-scale buildings. The elevator hall on each ﬂoor was equipped with a camera to observe the hall, and a vision system to which these cameras were connected surveyed all ﬂoors in a time-sharing manner and estimated the number of persons waiting for an elevator. The vision system then designated an elevator cage to quickly serve the crowded ﬂoor [21]. The conﬁguration of this system is shown in Fig. 10.

Machine Vision in Early Days: Japan’s Pioneering Contributions

45

Fig. 10. Elevator and other traﬃc applications (1977-1986)

In this elevator system, a robust change-ﬁnding algorithm based on edge vectors was used in order to cope with the change in the brightness of the surroundings. In this algorithm, the image plane was divided into several blocks, and the edge-vector distribution in each block was compared with that of the background image, which was updated automatically by new image data when no motion was observed and thus nobody was in the elevator hall. This system could minimize the average waiting time for the elevator. Though a few systems were put into use in Tokyo area in the early 1980s, there has not been enough market demand to continue to develop the system further. More promising applications of image recognition seemed to be for monitoring road traﬃc, where license plates, traﬃc jams, and illegally parked cars were identiﬁed so that traﬃc could be controlled smoothly and parking lots could be automatically allocated [22]. Charging tolls automatically at toll gates without stopping cars, by means of a wireless system with IC card, is now popular on highways as a result of the ITS (Intelligent Transport System) project. The system will be further improved if the machine vision can be eﬀectively combined to quickly recognize other important information such as license plate numbers and even driver’s faces and other identities. A water-purity monitoring system using ﬁsh behavior [23] was in operation for at least 10 years at a river control center in a local city in Japan since the river water was accidentally polluted by toxicants. A schematic diagram of the system is shown in Fig. 11. The automatic observation of algae in water in sewage works was also studied. Volcanic lava ﬂow was continuously monitored at the base of Mt. Fugendake in Nagasaki, Japan, during the eruption period in 1993. To optically send images from unmanned remote observation posts to the central control station, laser communication routes were planned by using 3-D undulation data derived from GIS digital contour maps. A GIS was also constructed to assist in restoration after the “Hanshin-Awaji” earthquake in Kobe, Japan, in 1995. Aerial photographs after the earthquake were analyzed by

46

M. Ejiri

Fig. 11. Environmental use (1990-1995)

matching them with digital 3-D urban maps containing additional information on the height of buildings. Buildings with damaged walls and roofs could thus be quickly detected and given top priority for restoration [24]. Intruder detection is also becoming important in the prevention of crimes and in dangerous areas such as those around high-voltage electric equipment. Railroad crossings can also be monitored intensively by comparing the vertical line data in an image with that in a background image updated automatically [25]. Arranging the image diﬀerences in this vertical window gives a spatiotemporal image of objects intruding onto the crossing. In almost all of these social applications, real-time color-image processing is becoming increasingly important for reliable detection and recognition. As mentioned before, the application of image processing to communications is increasingly promising as multimedia and network technologies improve. Humanmachine interfaces will be greatly improved if the machine is capable of recognizing every media used by humans. Human-to-human communication assisted by intelligent machines and networks is also expected. Machine vision will contribute to this communication in such ﬁelds as motion capturing, face recognition, facial expression understanding, gesture recognition, sign language understanding, and behavior understanding. In addition, applications of machine vision to the ﬁeld of human welfare, medicine, and environment improvement will become increasingly important in the future. Examples of these applications are rehabilitation equipment, medical surgery assistance, and water puriﬁcation in lakes. Thus, the key concept representing the future seems to be the realization of a calm society, in which all uneasiness will be relieved through networked social automation, and the important objectives of machine vision will typically be the realization of the following two functions: 24-hour/day “abnormality monitoring” via networks and reliable “personal identiﬁcation” via networks.

Machine Vision in Early Days: Japan’s Pioneering Contributions

6

47

Key Technologies

In most of the future applications, dynamic image processing will be a key to success. There are various approaches already for analyzing incoming video images in real-time by using smaller-scale personal computers. One typical example is the “Mediachef” system, which automatically cuts video images into a set of scenes by ﬁnding signiﬁcant changes between consecutive image frames [26]. The principle of the system is shown in Fig. 12. This is one of the essential technologies for video indexing and video-digest editing. To date, this technology has been put into use in the video inspection process in a broadcasting company so that subliminal advertising can be detected before the video is on the air.

Fig. 12. Key technologies: “Mediachef” for video indexing and editing (1990)

For the purpose of searching scenes, we developed a real-time video coding technique that uses an average color in each frame and represents its sequence by a “run” between frames. This method can compress 24-hour video signals into a memory capacity of only 2 MB. This video-coding technology can be applied to automatically detect the time of broadcast of a speciﬁc TV commercial by continuously monitoring TV signals by means of a compact personal computer. It therefore allows manufacturers to monitor their commercials being broadcast by an advertising company and, thus, provides evidence of a broadcast. The technology called “Cyber BUNRAKU,” in which human facial expressions are monitored by small infrared-sensitive reﬂectors put on a performer’s face, is also noteworthy. By combining the facial expressions thus obtained with the limb motions of a 19-jointed “Bunraku doll” (used in traditional Japanese theatrical performance), a 3-D character model in a computer can be animated in real-time to create video images [27], as shown in Fig. 13. This technology can create TV animation programs much faster than through traditional handdrawing methods. Another example of dynamic video analysis is “Tour into the picture (TIP)” technology. As shown in Fig. 14, a 2-D picture is scanned and interpreted into

48

M. Ejiri

Fig. 13. Key technologies: Cyber BUNRAKU(1996)

Fig. 14. Key technologies: Tour into the picture (1997)

three-dimensional data by manually ﬁtting vanishing lines on the displayed picture. The picture can then be looked at from diﬀerent angles and distances [28]. A motion video can thus be generated from a single picture and viewers can feel as if they were taking a walk in an ancient city when an old picture of the city is available. A real-time creation of panoramic pictures is also an important application of video-image processing [29]. A time series of each image frame from a video camera during panning and tilting is spatially connected in real-time into a single still picture (i.e. image mosaicing), as shown in Fig. 15. Similarly, by connecting all the image frames obtained during the zooming process, a high-resolution picture (having higher resolution in the inner areas) can be obtained as shown in Fig. 16.

Machine Vision in Early Days: Japan’s Pioneering Contributions

49

Fig. 15. Key technologies: Panoramic view by panning and tilting(1998)

Fig. 16. Key technologies: Panoramic view by zooming (1999)

As mentioned already, one important application of machine vision is personal identiﬁcation in social use. Along these lines, there have been a few promising developments. These include the personal identiﬁcation systems by means of ﬁngerprint patterns [30] (by NEC, 1996), of iris patterns (by Oki Electric, 1997), of ﬁnger vein patterns (by Hitachi, 2000, see Fig. 17) and of palm vein patterns (by Fujitsu, 2002). These are now ﬁnding the wide use in security systems, including the application to automated teller machines (ATMs). We have given a few examples of real-time image processing technologies, which will be key technologies applicable to wide variety of systems in the future. The most diﬃcult technical problem that social automation is likely to face, however, is how to make robust machine-vision systems that can be used day or night in all types of weather conditions. To cope with the wide changes in illumination, the development of a variable-sensitivity imaging device with a wide dynamic range is still a stimulating challenge. Artiﬁcial retina chips [31] (by Mitsubishi, 1998) and high-speed vision chips (by Fujitsu and University of Tokyo, 1999) are expected to play an important role along these lines.

50

M. Ejiri

Fig. 17. Key technologies: Biometrics based on ﬁnger vein patterns (2000)

7

Summaries

The history of machine vision and its applications in Japan was brieﬂy reviewed by focusing on the eﬀorts in industry, and is summarized roughly in Fig. 18 as a chronological form.

Fig. 18. History of machine vision research

Details of topics of industrial activities are listed on a year-to-year basis in Table 1, together with the various topics in other related ﬁelds for easier understanding of each period. The history is also summarized in Table. 2 as a list form characterizing each developmental stage. As indicated, we can see that, in addition to factory automation, oﬃce automation and social automation have been greatly advanced in those years by the evolution of machine-vision technology, owing to the progress of processor and

Machine Vision in Early Days: Japan’s Pioneering Contributions

51

Table 1. History of machine vision research (1961-2000)

memory technologies. However, it is also a fact that one of the most prominent contributions of machine vision technology was in the production of semiconductors. The semiconductor industry, and thus our human life, would not have been able to enjoy prosperity without machine vision technology. This article was prepared from the view point of the old saying; “knowing the old brings you a new wisdom for tomorrow” by Confucius. The author will be extremely pleased if this article is read widely by young researchers, as it would give them some insights to this ﬁeld, and would encourage them to get into, and play a great role in, this seemingly simple but actually diﬃcult research ﬁeld.

52

M. Ejiri Table 2. History of machine vision research

The closing message from the author to young researchers is as follows: Lift your sights, raise your spirits, and get out into the world!

References 1. Ejiri, M., Miyatake, T., Sako, H., Nagasaka, A., Nagaya, S.: Evolution of realtime image processing in practical applications. In: Proc. IAPR MVA, Tokyo, pp. 177–186 (2000) 2. Ejiri, M., Uno, T., Yoda, H., Goto, T., Takeyasu, K.: A prototype intelligent robot that assembles objects from plan drawings. IEEE Trans. Comput. C-21(2), 161–170 (1972) 3. Uno, T., Ejiri, M., Tokunaga, T.: A method of real-time recognition of moving objects and its application. Pattern Recognition 8, 201–208 (1976) 4. Ejiri, M., Uno, T., Mese, M., Ikeda, S.: A process for detecting defects in complicated patterns, Comp. Graphics & Image Processing 2, 326–339 (1973) 5. Kashioka, S., Ejiri, M., Sakamoto, Y.: A transistor wire-bonding system utilizing multiple local pattern matching techniques. IEEE Trans. Syst. Man & Cybern. SMC-6(8), 562–570 (1976) 6. Ejiri, M.: Machine vision: A practical technology for advanced image processing. Gordon & Breach Sci. Pub, New York (1989) 7. Ejiri, M.: Recent image processing applications in industry. In: Proc. 9th SCIA, Uppsala, pp. 1–13 (1995) 8. Ejiri, M.: A key technology for ﬂexible automation. In: Proc. of Japan-U.S.A. Symposium on Flexible Automation, Otsu, Japan, pp. 437–442 (1998) 9. Goto, N.: Toshiba Review 33, 6 (1978) (in Japanese) 10. Hara, Y., et al.: Automatic visual inspection of LSI photomasks. In: Proc. 5th ICPR (1980) 11. Haga, K., Nakamura, K., Sano, Y., Miyamori, N., Komuro, A.: Fuji Jiho 52(5), pp.294–298 (1979) (in Japanese) 12. Nakahara, S., Maeda, A., Nomura, Y.: Denshi Tokyo. IEEE Tokyo Section 18, 46–48 (1979)

Machine Vision in Early Days: Japan’s Pioneering Contributions

53

13. Nomura, Y., Ito, S., Naemura, M.: Mitsubishi Denki Giho 53(12), 899–903 (1979) (in Japanese) 14. Maeda, A., Shibayama, J.: Pattern measurement, ITEJ Technical Report, 3, 32 (in Japanese) (1980) 15. Mori, K., Kidode, M., Shinoda, H., Asada, H.: Design of local parallel pattern processor for image processing. In: AFIP Conf. Proc., vol. 47, pp. 1025–1031 (1978) 16. Yoda, H., Ohuchi, Y., Taniguchi, Y., Ejiri, M.: An automatic wafer inspection system using pipelined image processing techniques, IEEE Trans. Pattern Analysis & Machine Intelligence. PAMI-10 1 (1988) 17. Ejiri, M., Yoda, H., Sakou, H.: Knowledge-directed inspection for complex multilayered patterns. Machine Vision and Applications 2, 155–166 (1989) 18. Fukushima, T., Kobayashi, Y., Hirasawa, K., Bandoh, T., Ejiri, M.: Architecture of image signal processor, Trans. IEICE, J-66C 12, 959–966 (1983) 19. Ejiri, M., Kakumoto, S., Miyatake, T., Shimada, S., Matsushima, H.: Automatic recognition of engineering drawings and maps. In: Proc. Int. Conf. on Pattern Recognition, Montreal, Canada, pp. 1296–1305 (1984) 20. Ejiri, M.: Knowledge-based approaches to practical image processing. In: Proc. MIV-89, Inst. Ind. Sci, Univ. of Tokyo, pp. 1–8. Tokyo (1989) 21. Yoda, H., Motoike, J., Ejiri, M., Yuminaka, T.: A measurement method of the number of passengers using real-time TV image processing techniques, Trans. IEICE, J-69D 11, 1679–1686 (1986) 22. Takahashi, K., Kitamura, T., Takatoo, M., Kobayashi, Y., Satoh, Y.: Traﬃc ﬂow measuring system by image processing. In: Proc. IAPR MVA, Tokyo, pp. 245–248 (1996) 23. Yahagi, H., Baba, K., Kosaka, H., Hara, N.: Fish image monitoring system for detecting acute toxicants in water. In: Proc. 5th IAWPRC, pp. 609–616 (1990) 24. Ogawa, Y., Kakumoto, S., Iwamura, K.: Extracting regional features from aerial images based on 3-D map matching, Trans. IEICE, D-II 6, 1242–1250 (1998) 25. Nagaya, S., Miyatake, T., Fujita, T., Itoh, W., Ueda, H.: Moving object detection by time-correlation-based background judgment. In: Li, S., Teoh, E.K., Mital, D., Wang, H. (eds.) Recent Developments in Computer Vision. LNCS, vol. 1035, pp. 717–721. Springer, Heidelberg (1996) 26. Nagasaka, A., Miyatake, T., Ueda, H.: Video retrieval method using a sequence of representative images in a scene. In: Proc. IAPR MVA, Kawasaki, pp. 79–82 (1994) 27. Arai, K., Sakamoto, H.: Real-time animation of the upper half of the body using a facial expression tracker and an articulated input device, Research Report 96CG-83, Information Processing Society of Japan (in Japanese), 96, 125, pp. 1–6 (1996) 28. Horry, Y., Anjyo, K., Arai, K.: Tour into the picture: Using a spidery mesh interface to make animation from a single image. In: Proc. ACM SIGGRAPH 1997, pp. 225– 232 (1997) 29. Nagasaka, A., Miyatake, T.: A real-time video mosaics using luminance-projection correlation, Trans. IEICE, J82-D-II 10, 1572–1580 (1999) 30. Kamei, T., Shinbata, H., Uchida, K., Sato, A., Mizoguchi, M., Temma, T.: Automated ﬁngerprint classiﬁcation, IEICE Technical Report, Pattern Recognition and Understanding, 95(470), 17–24 (in Japanese) (1996) 31. Ui, H., Arima, Y., Murao, F., Komori, S., Kyuma, K.: An artiﬁcial retina chip with pixel-wise self-adjusting intensity response, ITE Technical Report, 23(30), pp.29–33 (in Japanese) (1999)

Coarse-to-Fine Statistical Shape Model by Bayesian Inference Ran He, Stan Li, Zhen Lei, and ShengCai Liao Institute of Automation, Chinese Academy of Sciences, Beijing, China [email protected]

Abstract. In this paper, we take a predefined geometry shape as a constraint for accurate shape alignment. A shape model is divided in two parts: fixed shape and active shape. The fixed shape is a user-predefined simple shape with only a few landmarks which can be easily and accurately located by machine or human. The active one is composed of many landmarks with complex shape contour. When searching an active shape, pose parameter is calculated by the fixed shape. Bayesian inference is introduced to make the whole shape more robust to local noise generated by the active shape, which leads to a compensation factor and a smooth factor for a coarse-to-fine shape search. This method provides a simple and stable means for online and offline shape analysis. Experiments on cheek and face contour demonstrate the effectiveness of our proposed approach. Keywords: Active shape model, Bayesian inference, statistical image analysis, segmentation.

1 Introduction Shape analysis is an important area in computer vision. A common task of shape analysis is to recover both pose parameters and low-dimensional representation of the underlying shape from an observed image. Applications of shape analysis spread from medical image processing, face recognition, object tracking and etc. After the pioneering work on active shape model (ASM) put forward by Cootes and Taylor [1,2], various shape models have been developed for shape analysis, which mainly focus on two parts: (1) statistic framework to estimate the shape and pose parameters and (2) optimal features to accurately model appearance around landmarks. For parameter estimation, Zhou, Gu, and Zhang [3] propose a Bayesian tangent shape model to estimate parameters more accurately by Bayesian inference. Liang et al. [4] adopt Markov network to find an optimal shape which is regularized by the PCA based shape prior through a constrained regularization algorithm. Li and Ito [5] use AdaBoosted histogram classifiers to model local appearances and optimize shape parameters. Thomax Brox et al. [6] integrated 3D shape knowledge into a variational model for pose estimation and image segmentation. For optimal features, van Ginneken et al. [7] propose a non-linear ASM with Optimal Features (OF-ASM), which allows distributions of multi-modal intensities and uses a k-nearest neighbors classifier for local textures classification. Federico Sukno et al. [8] further develop Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 54–64, 2007. © Springer-Verlag Berlin Heidelberg 2007

Coarse-to-Fine Statistical Shape Model by Bayesian Inference

55

this non-linear appearance model, incorporating a reduced set of differential invariant features as local image descriptors. A Cascade structure containing multiple ASMs is introduced in [9] to make location of landmarks more accurate and robust. However, these methods will lose their efficiency when dealing with complicated geometry of shapes or large texture variations. Can we utilize some accurate information to simplify ASM algorithm and make shape parameters estimation more robust? For example, we can utilize face detection algorithm to detect the coordinates of eyes and mouth or manually label these coordinates when we want to find a facial contour for further analysis. In this paper, the problem of shape analysis is addressed from three aspects. Firstly, we present geometry constrained active shape model (GCASM) and divide it in two parts: fixed shape and active shape. The fixed shape is a user-predefined shape with only a few points and lines. Those points could be easily and accurately located by machine or human. The active one is a user's desired shape and is composed of many landmarks with a complex contour. It will be located automatically with the help of the fixed shape. Secondly, Bayesian inference is introduced to make parameter estimation more robust to local noise generated by the active shape, which leads to a compensation factor and a smooth factor to perform a coarse-to-fine shape search. Thirdly, optimal features are selected as local image descriptors. Since the pose parameters can be calculated by the fixed shape, classifiers are trained for each landmark without scarifying performance. The rest of the paper is organized as follows: In Section 2, we begin with a brief review of ASM. Section 3 describes our proposed algorithm and Bayesian inference. Experimental results are provided in Section 4. Finally, we draw the conclusions in Section 5.

2 Active Shape Models This section briefly reviews the ASM segmentation scheme. We follow the description and notation of [2]. An object is described by points, referred as landmark points. The landmark points are (manually) determined in a set of N training images. From these collections of landmark points, a point distribution model (PDM) [10] is constructed as follows. The landmark points (x1, y1, … , xn, yn) are stacked in shape vectors.

x = ( x1 , y1 ,..., xn , yn )T .

(1)

Principal component analysis (PCA) is applied to the shape vectors x by computing the mean shape, covariance and eigensystem of the covariance matrix.

x=

1 N

∑

N

x and S =

i =1 i

N N ( xi − x )( xi − x )T . ∑ i =1 N −1

(2)

The eigenvectors corresponding to the k largest eigenvalues λ j are retained in a matrix Φ = (φ1 | φ2 | ⋅ ⋅ ⋅ | φk ) . A shape can now be approximated by x ≈ x + Φb .

(3)

56

R. He et al.

Where b is a vector of k elements containing the shape parameters, computed by b = ΦT ( x − x ) .

(4)

When fitting the model to a set of points, the values of b are constrained to lie within a range | b j |≤ ±c λ j .

(5)

where c usually has a value between two and three. Before PCA is applied, the shapes can be aligned by translating, rotating and scaling so as to minimize the sum of squared distances between the landmark points. We can express the initial estimate x of a shape as a scaled, rotated and translated version of original shape

x = M ( s,θ )[ x] + t .

(6)

Where M ( s,θ ) and t are pose parameters (See [1] for details). Procrustes analysis [11] and EM algorithm [3] are often used to estimate the pose parameters and align the shapes. This transformation and its inverse are applied both before and after projection of the shape model. The alignment procedure makes the shape model independent of the size, position, and orientation of the objects.

3 Coarse-to-Fine Statistical Shape Model 3.1 Geometry Constrained Statistical Shape Model

To make use of the user-predefined information, we extend PDM to two parts: active shape and fixed shape. The active shape is a collection of landmarks to describe an object in the basic PDM. It is composed of many points with a complex contour. The fixed shape is a predefined simple shape accurately marked by user or machine. It is composed of several connected lines between these points which can be easily and accurately marked by machine or human. Considering there are tremendous points in a line, we present a line with several equidistant points. Thus the extended PDM is constructed as follows. The landmarks (x1, y1, … , xm, ym) are stacked in active shape vectors, and landmarks (xm+1, ym+1, … , xn, yn) are stacked in fixed shape vectors. x = ( x1 , y1 ,..., xm , ym , xm +1 , ym +1 ,..., xn , yn )T .

(7)

As in PDM, a shape can now be approximated by x ≈ x + Φb .

(8)

When aligning shapes during training, the pose parameters of a shape (scaling, rotation and translation) are estimated by the fixed shape. An obvious reason is that the fixed shape is simpler and more accurate than the active one. Taking cheek contour as an example, the active shape is composed of landmarks in a cheek contour and the fixed shape is composed of 13 landmarks derived from three manual labeled points: left eye center, right eye center and mouth center. Five

Coarse-to-Fine Statistical Shape Model by Bayesian Inference

57

landmarks are added equidistantly between two eyes center to represent horizontal connected line. And five landmarks are inserted equidistantly in the vertical line passing the mouth center and perpendicular to the horizontal line. (See left graph of Fig.1 for details) During training, two shapes are aligned according to the points between two eyes only. Each item of b reflects a specific variation along the corresponding principle component (PC) axis. Shape variation along first three PCs is shown in right graph of Fig.1. The interpretation of these PCs is straight forward. The first PC describes left-right head rotations. The second PC accounts for face variation in vertical direction: long or short. And the third one explains a human face fat or thin.

Fig. 1. The fixed shape and shapes reconstructed by the first three PCs. The thirteen white circles in left image are points of the fixed shape. In right image, the middle one in each row is the mean shape.

3.2 Bayesian Inference

When directly calculating shape parameter b by formula (4), there is an offset between the reconstructed fixed shape and the given fixed shape. But the fixed shape is supposed to be accurate. This noise comes from reconstruction error of the active shape. Inspired by paper [3], we associate PCA with a probabilistic explanation. An isotropic Gaussian noise item is added to both fixed and active shape; thereby we can compute the poster of model parameters. The model can be written as: y = x + Φb + ε .

(9)

y − x − Φb = ε .

(10)

Where the shape parameter b is a n-dimensional vector distributed as multivariate Gaussian N (0, Λ ) and Λ = diag (λ1 ,..., λk ) . ε denotes an isotropic noise on the whole shape. It is a n-dimensional random vector which is independent with b and distributes as

p(ε ) ~ exp{− || ε ||2 / 2( ρ ) 2 } .

(11)

ρ = ∑ i =1α i || yiold − yi ||2 .

(12)

n

Where yold is the shape estimated in the last iteration and y is an observed shape in the current iteration. ai is classification confidence related to a classifier used in locating a

58

R. He et al.

landmark. When ai is 0, which implies that classifier can perfectly predict shape’s boundary; when ai is 1, which means classifier fails to predict the boundary. Combing (10) and (11) we obtain the likelihood of model parameters: 1 P(b | y ) = constP ( y | b) P (b) ~ exp( − [( y − x − Φb)T ( y − x − Φb) / ρ + bT Λ −1b]) 2

Let

(13)

∂ (ln P (b | y )) = 0 , we get: ∂b

b j = (λ j \ (λ j + ρ ))φ Tj ( y − x ) .

(14)

Combining (4), we obtain: b j = (λ j \ (λ j + ρ ))b j .

(15)

It is obvious that value of bj will become smaller after updating of (15) ( ρ ≥ 0 ). This updating will slow down search speed. Hence, a compensation factor p1 is introduced to make shape variation along eigenvectors corresponding to large eigenvalues more aggressive (see formula 18). If p1 is equal to (λmax + ρ ) / λmax , we get b j = ((λmax + ρ ) \ λmax ) × (λ j \ (λ j + ρ )) × b j .

(16)

Formula (16) shows that the parameter bj corresponding to a larger eigenvalue will receive a small punishment. And the parameter bj corresponding to a small eigenvalue will become smaller after updating. Moreover, we expect a smooth shape contour and neglect details in the first several iterations. A smooth factor p2 (see formula 18) is introduced to further punish the parameter bj. It is noticed that ρ is smaller than the largest eigenvalue and will become smaller. The p2 regularizes the parameters by enlarging the punishment. As in Fig.2, the reconstructed shape’s contour by Bayesian inference is smoother than the one by PCA in regions pointed by the black arrows. Although the PCA reconstruction can remove some noise, the reconstructed shape is still unstable when the image is noisy. Formula (18) makes the parameter estimation more robust to local noise.

Fig. 2. Shapes reconstructed from PCA and Bayesian Inference. Left shape is mean shape after desired movements; middle shape is reconstructed by PCA; right shape is reconstructed by Bayesian Inference. The black arrows highlight the regions to be compared.

Coarse-to-Fine Statistical Shape Model by Bayesian Inference

59

3.3 Optimal Features

Recently, optimal features are applied in ASMs and have drawn more and more attentions. [5, 6, 7] Experimental results show that optimal features can make shape segmentation more accurate. But a main drawback of optimal features method is that it takes ASMs more time to find the desired landmarks due to extract optimal features in each iteration. An efficient speed-up strategy is to select a subset of the available features for all landmarks. [6, 7] It is clear that textures around different landmarks are different. It is impossible for a single subset of optimal features to describe various textures around all the landmarks. In GCASM, the pose parameters of scale, rotation and translation can be calculated by the fixed shape. All landmarks can be categorized into several groups, for each of which we select the same discriminate features. When search a shape, the image is divided into several areas according to the categories. For each area, the same optimal features are extracted to determine movement. Optimal features are features reported in both paper [6] and [7]. Fig.3 shows classification results for each landmark. The Mean classification accuracy is 76.67%. We can learn about that landmarks near jaw and two ears have low classification accuracy, and the landmarks near cheek have high classification accuracy. Considering this classification error, we introduce Bayesian Inference and ai of formula (12) to make shape estimation more robust.

Fig. 3. Classification results for each cheek landmark. Classification accuracy stands for a classifier’s ability to classify whether a point near the landmark is in or outside of the shape. The points around the indices of 4 and 22 are close to ears and the points around the index of 13 are close to jaw.

3.4 Coarse-to-Fine Shape Search

During image search, main differences between GCASM and ASM lie in twofold. One is that since the pose parameters of GCASM have been calculated by the fixed shape, we needn’t to think about the pose variation during iterative updating procedure. The other is that the fixed shape is predefined accurately in GCASM. After reconstruction from the shape parameters, the noise will make the reconstructed fixed shape leave away from the given fixed shape. Because the fixed shape is supposed to be accurate, it should be realigned to the initial points. The iterative updating procedure of GCASM and ASM are shown in Fig.4. We use formula (17) to calculate shape parameter b=[b1,…,bk]T and normalize b by formula (18).

60

R. He et al.

b j = ΦTj ( y − x ) .

(17)

b j = ( p1λ j /(λ j + p2 ρ ))b j

(18)

Where 1 ≤ p1 ≤ (λmax + p2 ρ ) / λmax , p2 ≥ 1 . We call the parameter p1 compensation factor which makes shape variation in a more aggressive way. The parameter p2 is a smooth factor which gives a penalty to the shape parameter when shape has a large variation. The compensation factor and smoother factor give more emphasis on shape parameters corresponding to large eigenvalues. This can adjust a shape along major PCs and neglect shape’s local detail in initial several iterations. When the algorithm converges ( ρ → 0 ), p1λ j /(λ j + p2 ρ ) is equal to 1. Hence, the compensation factor and smoother factor lead a coarse-to-fine shape searching. Here, we simply set p1 = (λmax + p2 ρ ) / λmax , α i = 1 and p2 = 4 . Obviously, the formula (18) can also be used in ASMs to normalize the shape parameter.

Fig. 4. Updating rules of ASM and GCASM. The left block diagram is the basic ASM’s updating rule and the right block diagram is GCASM updating rule.

4 Experiments In this section, our proposed method is tested on two experiments: cheek contour search and facial contour search. A total of 100 face images are randomly taken from the XM2VTS face database. [12] Each image is aligned by coordinates of two eyes. The average distance between two eyes is 80 pixels. Three points of the fixed shape including two eyes and mouth center are manually labeled. The fixed shape takes a shape of letter ‘T’. Hamarneh’s ASM source code [13] is taken as the standard ASM without modification. Optimal features are collected from features reported in both paper [7] and [8]. The number of optimal features is reduced by sequential feature

Coarse-to-Fine Statistical Shape Model by Bayesian Inference

61

selection [14]. In this work, all the points near the landmarks are classified by linear regression to predict whether they lie in or out of a shape. 4.1 Experiments on Cheek Contour

A designed task to directly search a cheek contour without eyes, brow, mouth, and nose is presented to validate our method. A total of 25 cheek landmarks are labeled manually on each image. The PCA thresholds are set to 99% for every ASMs. The fixed shape is composed of points between two eyes and mouth. As in Fig.3, it is difficult to locate points around landmarks near ears and jaw. When a contour shape is simple and textures around landmarks are complex, the whole shape will be dragged off from the right position if there are several inaccurate points. It is clear that the cheek shape can be accurately located with the help of the fixed shape.

Fig. 5. Comparison of different algorithms’ cheek searching results: Shapes in first column are results of ASM searching; Shapes in second column are results of simple OF-ASM; Shapes in third column are results of the basic GCASM; Shapes in fourth column are results of GCASM with optimal features; Shapes in fifth column are results of GCASM with optimal features and Bayesian inference

As in Fig.5, first two columns are the searching results of ASM and OF-ASM. It is clear that the searching results miss desired position because of local noise. Several inaccurate landmarks will drag the shape from desired position. It also illustrates that optimal features can model contour appearance more accurately. As illustrated in the last three columns in Fig.5, searching results are well trapped in a local area when the fixed shape is introduced. Because the fixed shape is accurate without noise, reconstructed shape will fall into a local area around the fixed shape even if some landmarks are inaccurate. Every landmark will find a local best matched point instead of a global one. Comparing the third and fourth column, we can learn about that optimal features can locate landmarks more accurately. But optimal

62

R. He et al.

features couldn’t keep local contour detail very well. There is still some noise in searching results. Looking at the fifth column of Fig.5, it is clear that borders of the shapes become smoother. The Bayesian inference can further improve the accuracy. 4.2 Experiments on Facial Contour

A total of 96 face landmarks are labeled manually on each image. The PCA thresholds are set to 95% for every ASMs. Three landmarks are inserted into two eyes to present horizontal connected line. And three landmarks are inserted between mouth and horizontal line to present the vertical line. For the sake of simplicity, optimal features don’t used in this subsection. The results are shown in table 1. Table 1. Comparison results of traditional ASM and our method without optimal features

ASM Our algorithm Improvement

Face 7.74 4.68 39.5%

F.S.O 6.45 4.41 31.6%

Cheek Contour 11.4 5.47 52.0%

Where F.S.O. means five sense organs. Location error is measured in pixel. It is clear that our algorithm is much more accurate than ASM.

Fig. 6. Comparison results of ASM and GCASM with Bayesian inference. The first row is ASM results, and the second row is our results.

Fig.6 shows a set of searching results of basic ASM and GCASM with Bayesian inference. In the case, there are wrinkles and shadings on the facial contour or other facial sub-parts. It is clear that our method can recover the shape from local noise. A direct reason is that the shape variation is restricted in a local area when combining accurate information in ASM. The Bayesian inference holds the whole shape and smoothes the shape border.

Coarse-to-Fine Statistical Shape Model by Bayesian Inference

63

5 Conclusion This work focuses on an interesting topic how to combine some accurate informationgiven by user or machine to further improve shape alignment accuracy. The PDM is extended by adding a fixed shape which is generated from given information. After PCA reconstruction, local noise of the active shape will make the whole shape unsmooth. Hence Bayesian inference is proposed to further normalize parameters of the extended PDM. Both compensate factor and smooth factor lead a coarse-to-fine shape adjustment. Comparisons of our algorithm and the ASM algorithms demonstrate the effectiveness and efficiency.

Acknowledgements This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the Authen-Metric Collaboration Foundation.

References 1. Cootes, T.F., Taylor, C.J., Cooper, D., Graham, J.: Active shape models-Their training and application. Comput. Vis. Image Understanding 61(1), 38–59 (1995) 2. Cootes, T.F., Taylor, C.J.: Statistical models of appearance for computer vision, Wolfson Image Anal. Unit, Univ. Manchester, Manchester, U.K., Tech. Rep (1999) 3. Zhou, Y., Gu, L., Zhang, H.-J.: Bayesian tangent shape model: Estimating shape and pose parameters via Bayesian inference. In: IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI (June 2003) 4. Liang, L., Wen, F., Xu, Y.Q., Tang, X., Shum, H.Y.: Accurate Face Alignment using Shape Constrained Markov Network. In: Proc. CVPR (2006) 5. Li, Y.Z., Ito, W.: Shape parameter optimization for Adaboosted active shape model. In: ICCV, pp. 259–265 (2005) 6. Brox, T., Rosenhahn, B., Weickert, J.: Three-Dimensional Shape Knowledge for Joint Image Segmentation and Pose Estimation. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) Pattern Recognition. LNCS, vol. 3663, pp. 109–116. Springer, Heidelberg (2005) 7. Ginneken, B.V., Frangi, A.F., Staal, J.J., ter Har Romeny, B.M., Viergever, M.A.: Active shape model segmentation with optimal features. IEEE Transactions on Medical Imaging 21(8), 924–933 (2002) 8. Sukno, F., Ordas, S., Butakoff, C., Cruz, S., Frangi, A.F.: Active shape models with invariant optimal features IOF-ASMs. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 365–375. Springer, Heidelberg (2005) 9. Zhang, S., Wu1, L.F., Wang, Y.: Cascade MR-ASM for Locating Facial Feature Points. The 2nd International Conference on Biometrics (2007) 10. Dryden, I., Mardia, K.V.: The Statistical Analysis of Shape. Wiley, London, U.K (1998)

64

R. He et al.

11. Goodall, C.: Procrustes methods in the statistical analysis of shapes. J.Roy. Statist. 53(2), 285–339 (1991) 12. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The extended M2VTS database. In: Proc. AVBPA, pp. 72–77 (1999) 13. Hamarneh, G.: Active Shape Models with Multi-resolution, http://www.cs.sfu.ca/~hamarneh/ software/asm/index.html 14. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classiers. Pattern Recognition, 25–41 (2000)

Eﬃcient Texture Representation Using Multi-scale Regions Horst Wildenauer1 , Branislav Miˇcuˇs´ık1,2, and Markus Vincze1 Automation and Control Institute Institute of Computer Aided Automation, PRIP Group, Vienna University of Technology, Austria 1

2

Abstract. This paper introduces an eﬃcient way of representing textures using connected regions which are formed by coherent multi-scale over-segmentations. We show that the recently introduced covariancebased similarity measure, initially applied on rectangular windows, can be used with our newly devised, irregular structure-coherent patches; increasing the discriminative power and consistency of the texture representation. Furthermore, by treating texture in multiple scales, we allow for an implicit encoding of the spatial and statistical texture properties which are persistent across scale. The meaningfulness and eﬃciency of the covariance based texture representation is veriﬁed utilizing a simple binary segmentation method based on min-cut. Our experiments show that the proposed method, despite the low dimensional representation in use, is able to eﬀectively discriminate textures and that its performance compares favorably with the state of the art.

1

Introduction

Textures and structured patterns are important cues towards image understanding, pattern classiﬁcation and object recognition. The analysis of texture properties and their mathematical and statistical representation is attracting the interest of researchers since many years; with the primary goal of ﬁnding low dimensional and expressive representations that allow for reliable handling and classiﬁcation of texture patterns. Texture representations, which have been successfully applied to image segmentation tasks, include steerable ﬁlter responses [1], color changes in a pixel’s neighborhood [2], covariance matrices of gradients, color, and pixel coordinates [3], Gaussian Mixture Models (GMM) computed from color channels [4,5], color histograms [6], or multi-scale densities [7,8]. Since textures ”live” at several scales, a scale-dependent discriminative treatment should be aimed for. In this paper, we explore the possibility to reﬁne coarse texture segmentation by matching textures between adjacent scales, taking

The research has been supported by the Austrian Science Foundation (FWF) under the grant S9101, and the European Union projects MOVEMENT (IST-2003-511670), Robots@home (IST-045350), and MUSCLE (FP6-507752).

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 65–74, 2007. c Springer-Verlag Berlin Heidelberg 2007

66

H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze

(a)

(c)

(b)

(d)

(e)

Fig. 1. Segmentation results using the min-cut algorithm [12] with diﬀerent texture representations. (a) Input image with user speciﬁed foreground and background markers. (b) The proposed multi-scale texture representation. (c) Color Histograms [6]. (d) GrabCut [4] using GMMs. (e) Color changes in the pixel neighbourhoods [2].

advantage of spatial and statistical properties which persist across scale. We show that texture segments can be eﬃciently treated in a multi-scale hierarchy similarly to [8], however building on superpixels. In our approach, textures are represented by covariance matrices, for which an eﬀective similarity measure based on the symmetric generalized eigenproblem was introduced in [9]. In contrast to the rectangular windows used in [3], covariance matrices are computed from irregular structure-coherent patches, found at diﬀerent scales. In order to allow for an eﬃcient image-partitioning into scalecoherent regions we devised a novel superpixel method, utilizing watershed segmentations imposed by extrema of an image’s mean curvature. However, the suggested framework of multi-scale texture representation is generally applicable for other superpixel methods, such as [10,11], depending on accuracy and time complexity constraints imposed by the application domain. We verify the feasibility and meaningfulness of the multi-scale covariance based texture representation by a binary segmentation method based on the min-cut algorithm [12]. Figure 1 shows an example of how diﬀerent types of texture descriptors inﬂuence the min-cut segmentation of a particularly challenging image, consisting of textured regions with highly similar color characteristics. The remainder of the paper is organized as follows. We present the details of the proposed method in Section 2. Section 3 reports experimental results and compares them to the results obtained using state-of-the-art methods. The paper is concluded with a discussion in Section 4.

Eﬃcient Texture Representation Using Multi-scale Regions

2 2.1

67

Our Approach Superpixels

Probably one of the most commonly used blob detectors is based on the properties of the Laplacian of Gaussians (LoG) or its approximation, the Diﬀerence of Gaussians (DoG) [13]. Given a Scale-Space representation L(t) obtained by repeatedly convolving an input image by Gaussians of increasing sizes t, the shape of the intensity surface around a point x at scale t can be described using the Hessian matrix L (x, t) Lxy (x, t) . (1) H(x, t) = xx Lxy (x, t) Lyy (x, t) The LoG corresponds to the trace of the Hessian: ∇2 L(x, t) = Lxx (x, t) + Lyy (x, t),

(2)

and equals the mean intensity curvature multiplied by two. The LoG’s computation results in strong √ negative or positive responses for bright and dark blob-like structures of size t respectively. Using this, the position and characteristic scale of blobs can by found by detecting Scale-Space extrema of scale normalized LoG responses [14]. In our approach, we do not directly search for blob positions and scales, but rather use spatial response extrema as starting points for a watershed-based oversegmentation of an image’s mean curvature surface. Speciﬁcally, we proceed as follows: √ 1. Computation of LoG responses at scales of t = 2m/3 , with m = 1 . . . M , where M denotes a predeﬁned number of scales. I.e., we calculate 3 scale levels per Scale-Space octave. 2. Watershed-segmentation: (a) Detection of spatial response extrema at all scales. Extrema with low contrast, i.e. those with a minimum absolute diﬀerence to adjacent pixels smaller than a predeﬁned threshold, are discarded. (b) At each scale, segment the image into regions assigned to positive or negative mean curvature. This is achieved by applying the watershed to the negative absolute Laplacian −|∇2 L(x, t)| using the seeds from (a). The majority of watersheds thus obtained follow the zero-crossings of the Laplacian; i.e., the edges where the mean curvature of the intensity surface changes its sign. Though, for irregularly shaped blobs, which exhibit signiﬁcant variations in mean curvature, usually several seed-points are detected. This results in an over-segmentation of regions with otherwise consistent curvature signs. Figure 2 shows a direct comparison of the superpixels produced by our method at a single scale and the normalized-cut based superpixels suggested in [10]. Another method for image over-segmentation, which is partially favoured for its speed, utilizes the Minimum Spanning Tree [11]. However, for larger superpixels, which are needed to stably compute the covariance-based descriptor

68

H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze

Fig. 2. Left: Superpixels obtained by the proposed method. Right: Superpixels obtained by the method of Ren et al. [10].

on, the regions obtained by this method are highly irregular and often do not align well with object boundaries. Figure 3 shows the eﬀect of using diﬀerent superpixels in conjunction with the method proposed in this paper. As one can see, our method gives acceptable results compared to the normalized-cut based approach, which needs more than 100 times longer to compute the segmentation. The outlined approach is similar in spirit to the watershed segmentation of principal curvature images proposed by Deng et al. [15]. In their approach, the ﬁrst principle curvature image (i.e., the image of the larger eigenvalue of the Hessian matrix) is thresholded near zero and either the positive, or negative remainder is ﬂooded starting from the resulting zero-valued basins. Hence, as opposed to our method, the watersheds follow the ridges of the image’s principal curvature surface. In experiments we found that this approach was not suitable for our purposes since it tends to under-segment images, aggressively merging regions with same-signed principal curvature. 2.2

Covariance-Based Texture Similarity

Recently, Tuzel et al. [3] have introduced region covariance matrices as potent, low-dimensional image patch descriptors, suitable for object recognition and texture classiﬁcation. One of the authors’ main contributions was the introduction of an integral-image like preprocessing stage, allowing for the computation of covariances from image features of an arbitrarily sized rectangular window in constant time. However, in the presented work covariances are directly obtained from irregularly shaped superpixels, the aforementioned fast covariance computation is not applicable. We proceed to give a brief description of the covariance-based texture descriptor in use. The sample covariance matrix of feature vectors collected inside a superpixel is give by: M=

N 1 (z n − μ)(z n − μ) , N − 1 n=1

(3)

Eﬃcient Texture Representation Using Multi-scale Regions

69

Fig. 3. Eﬀect of superpixels on the ﬁnal image segmentation. From left to right: Superpixels through a color-based Minimum Spanning Tree [11]. The proposed approach. Superpixels based on normalized-cuts using combined color and textured gradient [10].

where μ denotes the sample mean, and {z n }n=1...N are the d-dimensional feature vectors extracted at N pixel positions. In our approach, these feature vectors are composed of the values of the RGB color channels R, G, and B and the absolute values of the ﬁrst derivatives of the Intensity I at the n-th pixel ∂I ∂I (4) z n = Rn , Gn , Bn , , . ∂x ∂y The resulting 5×5 covariance matrix gives a very compact texture representation with the additional advantage of exhibiting a certain insensitivity to illumination changes. And, as will be shown experimentally, oﬀers suﬃcient discriminative power for the segmentation task described in the remainder of the paper. To measure the similarity ρ(M i , M j ) of two covariance matrices M i and M j we utilize the distance metric initially proposed by F¨ orstner [9]: d ρ(M i , M j ) = ln λ2k (M i , M j ), (5) k=1

where the {λk }k=1...d are the eigenvalues obtained by solving the generalized eigenvalue problem (6) M i ek = λk M j ek , k = 1 . . . d with ek = 0 denoting the generalized eigenvectors. The cost for computing ρ is in the order of O(d3 ) ﬂops which, due to the low dimensionality of the representation, leads to speed advantages compared to histogram matching methods. For a detailed discussion among the topic, other choices of feature combinations as well as the useful properties of region covariances see [3]. 2.3

Foreground and Background Codebooks

From the covariance-based descriptors proposed in Subsection 2.2 we compute representative codebooks for foreground and background regions. These are used later on to drive the image segmentation.

70

H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze

Sink

Source

Fig. 4. Two-layer MRF with superpixels detected at two diﬀerent scales. To avoid clutter not all superpixels are connectd to sink/source nodes.

For the foreground and background codebook, we require user-speciﬁed markB ers, as shown in Figure 1(a), computing the covariance matrices M F i and M i from all points under the marker region. Usually, the background contains more textures and cluttered areas requiring more seeds to be establish. Moreover, in applications like object detection or recognition, the background can signiﬁcantly vary across images while objects of interest usually remain quite consistent in appearance. To somewhat alleviate the burden of manually selecting many seeds, we propose to avoid the need of background markers by following a simple strategy: We take a rim at the boundary of the image and feed all superpixels under the rim into a hierarchical clustering method with a predeﬁned stopping distance threshold, with the distance between superpixels given by Equation 5. After clustering we take the K most occupied clusters and compute the mean covariance matrix for each cluster out of all covariance matrices belonging to the cluster. For eﬃciency reasons, we do not calculate the mean covariance matrix by polling over all participating feature vectors, but use the method described in [16,3], which has its roots in formulations of symmetric positive deﬁnite matrices lying on connected Riemannian manifolds. Using this procedure, we arrive at the background codebook matrices M B i . Of course, the applicability of this ad-hoc technique is limited to cases where the object of interest touches the boundary, or when the rim is not representative enough. However, in most cases the approach lead to background codebooks with suﬃcient explanatory power for a successful segmentation. 2.4

Multi-scale Graph-Cut

In order to verify the validity of the covariance-based texture representation, taking into account the superpixel behaviour across diﬀerent scales we adopted a binary segmentation method based on the min-cut algorithm [12]. Suppose that the image at a certain scale t is represented by a graph Gt = Vt , Et , where Vt is a set of all vertices representing superpixels, and Et is a set of all intrascale edges connecting spatially adjacent vertices. To capture the Scale-Space behaviour we connect the graphs by interscale edges forming a set of edges S. We form the entire graph G = V, E, S consisting of the union of

Eﬃcient Texture Representation Using Multi-scale Regions

71

all vertices Vt , and all intrascale Et and interscale edges S. For more clarity, the resulting graph structure is depicted in Figure 4. The binary segmentation of the graph G is achieved by ﬁnding the minimumcut [12], minimizing the Gibbs energy, Edata (xi , M i ) + λ δ(xi , xj ) Esm im (M i , M j )+ E(x) = i∈V

(i,j)∈E

+γ

δ(xi , xj ) Esm

sc (M i , M j ),

(7)

(i,j)∈C

where x = [x0 , x1 , . . .] corresponds to a vector with label xi for each vertex. We concentrate on a bi-layer segmentation where the label xi is either 0 (background) or 1 (foreground). M i corresponds to the measurement in the i-th graph vertex, i.e., to a covariance matrix for a given superpixel. The weight constants λ, γ control the inﬂuence of the image (intrascale), and interscale smoothness terms respectively; δ denotes the Kronecker delta. The data term. describes how likely the superpixel is foreground or background. The data term for the foreground is deﬁned as l(M i , F ) , (8) l(M i , F ) + l(M i , B)

where l(Mi , F ) = exp − mink=1...|F | ρ(Mi , MkF )/(2σ12 ) stands for the foreground likelihood of the superpixel i. M F k denotes the k-th covariance matrix from a foreground codebook set F , and σ1 is an experimentally determined parameter. As the derivation of the background terms and likelihoods follows analogously, we will omit its description. Edata (xi = 1, M i ) =

The smoothness term. describes how strongly neighborhood pixels are bound together. There are two types of the smoothness terms, see Equation (7), one for intrascale neighborhoods, Esm im , one for interscale neighborhoods, Esm sc . The intrascale smoothness term using α blending is deﬁned as Esm im (Mi , Mj ) = α exp − ρ(M i , M j )/(2σ22 ) +

2 2 + (1 − α) exp − l(M i , F ) − l(M j , F ) /(2σ3 ) , (9) where σ2 and σ3 are pre-deﬁned parameters. The interscale smoothness term is only deﬁned for edges between two vertices from neighboring scales when the corresponding superpixels share at least one same image pixel. The weight on the edge between superpixels i and j from consecutive scales is set to Esm sc (Mi , Mj ) = β

area(i ∩ j) + area(i)

2 2 + (1 − β) exp − l(M i , F ) − l(M j , F ) /(2σ3 ) . (10)

72

H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze

Fig. 5. Importance of inter-scale graph edges. From left to right: Only one lower scale used. Only one higher scale used. Three consecutive scales used.

The second term in both Equations (9), (10) increases the dependency of smoothness terms on the foreground likelihood, making it more robust as originally suggested by [8]. However, we rely on this term only partially through the interpolation parameteres α, β, since a full dependency on the likelihood often resulted in compact, but otherwise incomplete segmentations. Figure 5 shows how the use of multiple scales and inter-scale edges improves the segmentation compared to segmentation performed separately for given scales.

3

Experimental Results

We performed segmentation tests on images from the Berkeley dataset1 . We compare the result to the recent approach proposed by Micusik&Pajdla [2]. Their method looks at color changes in the pixel neighbourhood, yielding superior results on textured images compared to other methods. For both methods the same manually established foreground and background markers were used. To guarantee a fair comparison, the automatic background codebook creation proposed in Section 2.3 was omitted. We present some results where our proposed method performs superior or comparable to [2]. These images typically contain textures with similar colors and are, as stated in [2], the most crucial for their texture descriptor. One must realize that covariance based texture description cannot cope reliably with homogenoeus color regions, see the missing roof of the hut in Figure 6. This should be kept in mind, and use such a descriptor complementary with some color features. Overall, as experiments show, the newly proposed technique performs very well on textures. The advantage over methods, e.g. [6,4,2], is computational eﬃcienty. Moreover, using more accurate superpixels, e.g. [10], improve the accuracy of the result for the price of higher time consumption.

4

Summary and Conclusion

We present an eﬃcient way of representing textures using connected regions, formed by coherent multi-scale over-segmentations. We show the favourable 1

http://www.cs.berkeley.edu/projects/vision/grouping/segbench

Eﬃcient Texture Representation Using Multi-scale Regions

(a)

(b)

73

(c)

Fig. 6. Segmentation comparison. (a) Input image with user marked seeds. (b) The method from [2]. (c) Our approach.

74

H. Wildenauer, B. Miˇcuˇs´ık, and M. Vincze

performance on segmentation of textures images. However, our primary goal is not to segment images accurately, but to demonstrate the feasibility of the covariance matrix based descriptor used in a multi-scale hierarchy built on superpixels. The method is aimed at a further use in recognition and an image understanding systems where so accurate segmentation is not required.

References 1. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. IJCV 43(1), 7–27 (2001) 2. Miˇcuˇs´ık, B., Pajdla, T.: Multi-label image segmentation via max-sum solver. In: Proc. CVPR (2007) 3. Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classiﬁcation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006) 4. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. In: Proc. ACM SIGGRAPH, pp. 309–314. ACM Press, New York (2004) 5. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Probabilistic fusion of stereo with color and contrast for bi-layer segmentation. PAMI 28(9), 1480–1492 (2006) 6. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: Proc. ICCV, pp. 105–112 (2001) 7. Hadjidemetriou, E., Grossberg, M., Nayar, K.S.: Multiresolution histograms and their use for recognition. PAMI 26(7), 831–847 (2004) 8. Turek, W., Freedman, D.: Multiscale modeling and constraints for max-ﬂow/mincut problems in computer vision. In: Proc. CVPR Workshop, vol. 180 (2006) 9. F¨ orstner, W., Moonen, B.: A metric for covariance matrices. Technical report, Dpt. of Geodesy and Geoinformatics, Stuttgart University (1999) 10. Ren, X., Malik, J.: Learning a classiﬁcation model for segmentation (2003) 11. Felzenszwalb, P., Huttenlocher, D.: Eﬃcient graph-based image segmentation. IJCV 59(2), 167–181 (2004) 12. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. PAMI 26(9), 1124–1137 (2004) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 14. Lindeberg, T.: Scale-Space Theory in Computer Vison. Kluwer Academic Publishers, Dordrecht (1994) 15. Deng, H., Zhang, W., Diettrich, T., Shapiro, L.: Principal curvature-based region detector for object recognition. In: Proc. CVPR (2007) 16. Pennec, X., Fillard, P., Ayache, N.: A riemannian framework for tensor computing. International Journal of Computer Vision 66(1), 41–66 (2006)

Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data Ili´c Slobodan Deutsche Telekom Laboratories Berlin University of Technology Ernst-Reuter Platz 7, 14199 Berlin, Germany [email protected]

Abstract. In this paper we develop highly flexible Timoshenko beam model for tracking large deformations in noisy data. We demonstrate that by neglecting some physical properties of Timoshenko beam, classical energy beam can be derived. The comparison of these two models in terms of their robustness and precision against noisy data is given. We demonstrate that Timoshenko beam model is more robust and precise for tracking large deformations in the presence of clutter and partial occlusions. The experiments using both synthetic and real image data are performed. In synthetic images we fit both models to noisy data and use Monte Carlo simulation to analyze their performance. In real images we track deformations of the pole vault, the rat whiskers and the car antenna.

1 Introduction In this paper, we develop true physical 2D Timoshenko beam model and use it for tracking large deformations in noisy image data. Timoshenko beam relays on shear deformation to account for non-linearities. We derive from it physically based energy beam, by neglecting shear deformation. The models which closely approximate real physics we call true physical models, in this case Timoshenko beam, while the models which are designed to retain some physical properties we call physically based models, in this case energy beam. Physically based models introduced almost twenty years ago [1,2,3,4] have demonstrated their effectiveness in the Computer Vision problems. However, they typically rely on simplifying assumptions to yield easy to minimize energy functions and ignore the complex non-linearities that are inherent to large deformations present in highly flexible structures. To justify the use of complex true physical models over simplified physically based models we compare Timoshenko beam to energy beam. Both models were fitted to noisy synthetic data and real images in presence of clutter and partial occlusions. We demonstrate that using fully non-linear Timoshenko beam model, which approximates the physical behavior more closely, yields to more robust and precise fitting to noisy data and tracking of large deformations in the presence of clutter and partial occlusions.

The rat whiskers images shown in this paper were obtained at Harvard University’s School of Engineering and Applied Sciences by Robert A. Jenks.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 75–84, 2007. c Springer-Verlag Berlin Heidelberg 2007

76

I. Slobodan

The fitting algorithm used for both beam representations is guided by the image forces. Since the image forces are proportional to the square distance of the model to the image data, they are usually not sufficient to deform Timoshenko beam model of known material properties, so that it fits the data immediately. The image forces only move the model toward the image observations. To fit the model to the data we repeat Gauss-Newton minimization in several steps. We stop when the distance of the model to the data of two consecutive Gauss-Newton minimizations is smaller then some given precision. We use Levenberg-Marquardt optimizer to fit quadratic energy function of the energy beam. In this case image forces are sufficient to deform the beam in a single minimization because the model energy, being only the approximation of the real model strain energy, does not impose realistic physical restrictions to the beam deformations. In the reminder of the paper we give a brief overview of the know physically based techniques, then introduce non-linear Timoshenko planar beam model, derive energy beam from it, then describe our fitting algorithm and finally present the results.

2 Related Work Recovering model deformations from images requires deformable models to constrain the search space and make the problem tractable. In recent decade deformable models were exploited in Computer Graphics [5], Medical Imaging [6] and Computer Vision [7]. There are several important categories of physically based models: mass-spring models [8], finite element models (FEM) [9,10], snake like models [1,3,4] and models obtained from FEM, to reduce number of dofs, by modal analysis [11,2,12,13]. In this paper we are particularly interested in physical models, especially those based on FEM. FEM are known to be precise and to produce physically valid deformations. However, because of their mathematical complexity the FEM were mainly developed for small linear deformations [14] where the model stiffness matrix is constant. In case of large deformations the stiffness matrix and the applied forces become the function of the displacement. Such non-linear FEM were used by [15] to recover material parameters from images of highly elastic shell like models. The model deformation was measured form 3D model scan and then given to the Finite Element Analysis (FEA) software. By contract we develop non-linear beam equations and recover the model deformations automatically through optimization. In computer vision physically based models, based on the continuous energy function were extensively used. The original ones [1] were 2D and have been shown to be effective for 2D deformable surface registration [16]. They were soon adapted for 3D surface modeling purposes by using deformable superquadrics [3,4], triangulated surfaces [2], or thin-plate splines [6]. In this framework, modeling generic structures often requires many degrees of freedom that are coupled by regularization terms. In practice, this coupling implicitly reduces the number of degrees of freedom, which makes these models robust to noise and is one of the reasons for their immense popularity. In this paper we compare complex true physical 2D Timoshenko beam model to 2D elastic energy beam in terms of their robustness against noisy data. We reveal, in spite of their complexity, the real benefits of true physical models.

Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data

77

3 Plane Timoshenko Beam Model The beam represents the most common structural component in civil and mechanical engineering. A beam is a rod-like — meaning that one dimension is considerably larger then the other two — structural element that resists transversal loading applied between its supports. Timoshenko beam, we develop, assumes geometrically large deformations under small strains with the linearly elastic materials. Timoshenko beam theory [17] accounts for nonlinear effect, such as shear assuming that the cross-section does not remains normal to the deformed longitudinal axis. The beam is divided to a number of finite elements. The beam deformation is defined by 3 dof per node being the axial displacement uX (X), the transverse displacement uY (X) and the cross-section rotation θ(X), where X is longitudinal coordinate in the reference configuration as shown in Fig 1. The undeformed initial configuration is referred as reference and the deformed one as current configuration. The parameters describing beam geometry and the material properties are A0 cross-sectional area, L0 element length in reference configuration, L element length in deformed configuration, I0 second moment of inertia, E Young modulus of elasticity, and G is shear modulus. The material remains linearly elastic. The beam rotation is defined by the angle ψ also equal to the rotation of the cross-section. The angle γ¯ is a shear angle for which the cross section deviates from its normal position defined by the angle ψ. The total rotation of a beam cross section becomes θ = ψ − γ¯ which is exactly one of dofs defined above. To describe beam kinematics we consider the motion of the point P0 (X, Y ) in reference configuration to the point P (x, y) in the current configuration. We keep the assumption that the cross section dimensions do not change and that the shear rotation

(a)

(b)

Fig. 1. (a) Plane Timoshenko beam kinematics notation. (b)Synthetic example of fitting the plane beam, initially aligned along the x-axis , to the synthetic image data shown as magenta dots. The intermediate steps shown in blue are the output of number of repeated Gauss-Newton optimizations driven by the image forces.

78

I. Slobodan

is small γ¯ 1 so that cosγ ≈ 1. The Lagrangian representation of the motion relating points P0 (X, Y ) and P (x, y) is then given by x = X +uX −Y sin θ, y = uY +Y cos θ. The displacement of any point on the beam element can be than represented by a vec tor w = uX (X) uY (X) θ(X) . In the FEM formulation for 2-node C 0 element it is natuaral to express the displacements and rotation functions of w as linear 2 2 2combination of node displacements uX = i=1 Ni uXi , vX = i=1 Ni vXi , θX = i=1 Ni θXi or in matrix form w = Nu where N1 = 12 (1 − ξ), N2 = 12 (1 + ξ) are the linear element shape functions. The strain is a measure of the change of the object shape, in this case the length, before and after the deformation caused by some applied load. The stress is the internal distribution of force per unit area that balances and reacts to external loads applied to a body. We have three different stress components per beam element: e axial strain measuring the beam relative extension, γ shear strain measuring the relative angular change between any two lines in a body before and after the deformation and κ measuring the curvature change. They can be computed from the deformation gradient of motion F =

∂x ∂x ∂X ∂Y ∂y ∂y ∂X ∂Y 1 T 2 (F F

. Green-Lagrange(GL) strain tensor describing the model strain

becomes e = − I). After the derivation the only nonzero elements are axial strain eXX and the shear strain 2eXY = eXY + eY X . Under small strain assumption we can finally express strain vector as: eXX e−Yκ e1 = = (1) e= e2 2eXY γ where three strain quantities introduced above are e axial strain, γ shear strainand κ curvature.These can be collected in the generalized strain vector hT = e γ κ . Because of the assumed linear variations in X of uX (X), uY (Y ) and θ(X), e and γ depend on θ and κ is constant over the element depending only on rotation angles at element end nodes. e and γ can be expressed in a geometrically invariant form: e=

L L L L cos γ¯ − 1 = cos (θ − ψ) − 1, γ = sin γ¯ = sin (θ − ψ) L0 L0 L0 L0

(2)

These geometrically invariant strain quantities can be used for the beam in arbitrary reference configuration. The variations of δe, δγ and δκ with respect to the nodal displacement variations are required for derivations of strain-displacement relation δh = Bδu. To form strain-displacement matrix B we take partial derivatives of e, γ and κ with respect to the node displacements and collect them into matrix B: ⎤ ⎡ cω sω L0 N2 γ −cω −sω L0 N1 γ (3) B = ⎣ sω −cω −L0 N1 (1 + e) −sω cω −L0 N2 (1 + e)⎦ 0 0 −1 0 0 1 where ω = θ + ψ, cω = cos ω and sω = sin ω. We introduce pre-stress resultants N 0 , V 0 and M 0 which define the axial forces, transverse shear forces and bending moments respectively, in the reference configuration. We also define the stress resultants in the current configuration using linear elastic 0 0 equation to be N = N 0 + EA0 e, V = V + GA 0 γ and M = M + EI0 κ, and collect them into stress-resultant vector z = N V M .

Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data

79

The internal model strain energy along the beam, under zero pre-stress resultants N 0 = V 0 = M 0 = 0, can be expressed as the length integral: 1 1 1 zT hdX = EA0 e2 dX + GA0 γ 2 dX + EI0 κ2 dX (4) U= 2 2 2 L0 L0 L0 L0 where L0 is the beam length in reference configuration. The internal force vector can be obtained by taking the fist variation of the strain energy with respect to the nodal displacement: ∂U = BT (u)zdX. (5) p= ∂u L0 This expression we evaluate by reduced Gauss integration in order to eliminate shear locking which overstiff the model deformation making a shear energy to dominate. In addition we use residual bending flexibility (RBF) correction and replace GA0 for shear energy of Eq. 4 by 12EI0 /L20 . Finally, the first variation of the internal force defines the tangent stiffness matrix: ∂B ∂z ∂p = + z)dX = (KM + KG ). (BT (6) KT = ∂u ∂u ∂u L0 where KM is material stiffness and KG is geometric stiffness. The material stiffness is constant and identical to the linear stiffness matrix of the C 1 Euler-Bernoulli beam. The geometric stiffness comes the variation of B while stress resultants are kept fixed and caries the beam nonlinearity responsible for large geometric deformations.

4 Energy Beam Model The model energy of the energy beam can be derived directly from the Timoshenko beam strain energy of Eq. 4. Let us neglect the shear deformation by putting shear anγ − 1 ≈ LL0 − 1, gle to be zero γ¯ . The strain quantities of Eq. 2 become e = LL0 cos¯ γ = LL0 sin γ¯ ≈ 0. In this way the shear energy is eliminated and only the axial energy and bending energy are left. Also, since shear is eliminated the rotational dof θ(X) disappears, and only displacements in X and Y directions are taken to form new displacement vector w = [uX uY ]. Since we deal with discrete beam its energy can be expressed as: U=

1 ws 2

(

(i,j)∈1..n

vi − vj 1 − 1)2 + wb L0 2

2vj − vi − vk

2

(7)

(i,j,k)∈1..n

where i, j are pairs of element nodes, and i, j, k are triplets of element nodes necessary to define curvature at the j t h beam node. Derived energy can be considered as physically based since it directly comes from the realistic physical beam model. The weight coefficients ws and wb can be considered proportional to the Young modulus of elasticity E. However, we will show that, in practice, they significantly change behavior of the fitting algorithm.

80

I. Slobodan

5 Model Fitting The general approach in mechanical simulations is that some external load f is applied to the model and the displacement u is computed. This can be done through energy minimization, where the total potential energy Π of the system is computed as the sum of the system internal strain energy U and the energy caused by the external load P . The minimum of the energy in respect to the displacement u can be found by derivation: ∂U ∂P ∂Π = + ⇒ r(u) = p(u) + f = 0 (8) ∂u ∂u ∂u The r(u) is a force residual, p(u) is internal force of Eq. 5 and f is the external load. This is a nonlinear system of equations and is usually solved by using Newton-Raphson method. It is incremental method and at each iteration we solve for the displacement update du by solving linear system KT du = −(f + p(u)). In classical mechanical simulation the external forces f are a priori given. In our case we do not know them. To compute model displacements we solve Eq. 8, using image T forces. We create a vector of image observations F(u) = d1 (u) d2 (u) . . . dN (u) where di (u) are distances of the image observations from the beam segments. We use the edges in the image, obtained using Canny edge detector, to be our observations. In practice we sample every beam segment and then search for the multiple observations in the direction of the beam normal. The external image energy becomes PI = 12 FT (u)F(u). The images forces are obtained as a derivative of the energy in respect to the displacement fI = ∇FT (u)F(u). The force residual of Eq. 8 becomes: r(u) = p(u) + fI (u) = 0. We derive the displacement increment by developing the residual in a Taylor series around the current displacement u as follows: r(u + du) = p(u + du) + fI (u + du) = 0 ∇FT F + (∇2 FT F + ∇FT ∇F)du + p(u) + KT du = 0 (KT + ∇2 FT F + ∇FT ∇F)du = −(∇FT F + p(u)) ≈ (KT + ∇FT ∇F)du = −(∇FT F + p(u))

(9)

we obtain Gauss-Newton optimization step. We neglect the second order term ∇2 FT F of Eq. 9. To make it more robust we use Tukey robust estimator ρ of [18]. This is simply done by weighting the residuals di (u) of the image observation vector F(u) at each Newton iteration: Each di (u) is replaced by di (u) = wi di (u) such that: (di )2 (u) = (wi )2 d2i (u), therefore the weight is chosen to be: wi = ρ(di (u))1/2 /di (u). We then create a weighting matrix W = diag(. . . , wi , . . .). We then solve in each step: (KT + ∇FT W∇F)du = −(∇FT WF + p(u))

(10)

By solving the Eq. 9 we compute the displacement caused by the image forces. Since the image forces are proportional to the square distance of the model to the image edge points, they are not sufficient to deform the model so that it fits the data. They only move the model toward the image observations. To obtain the exact fit of the model to the data we repeatedly fit the model to the data performing Gauss-Newton method in several steps. We stop when the distance of the model to the data of two consecutive Gauss-Newton minimizations becomes smaller then some given precision. The

Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data

81

optimization algorithm is illustrated on the synthetic example of Fig. 1(b). We obtain the total displacement uT as a sum of all intermediate displacements. For fitting energy beam we slightly modify Eq. 10 by adding λI to the left side of Eq. 10 such that the we obtain Levenberg-Marquart optimizer. In practice it turned out to be more suitable for optimizing energy function of Eq. 7. The computational complexity of our method is quadratic and corresponds to the complexity of the Gauss-Newton minimization.

6 Results We fit the Timoshenko beam and energy beam to synthetic data and compare their performance in presence of different amount of added noise. We then run our experiments on real images in three different cases: the deformation of a car antenna, pole vault and the deformation of the rat whiskers. 6.1 Fitting Synthetic Data We generate the synthetic data clouds around two given ground truth positions of the deformed beams depicted in Fig. 1(b) by adding certain amount of Gaussian noise around them. The amount of noise is controlled by the variance σ ∈ {0.01, 0.1, 0.5, 1.0, 2.0}. We perform a Monte-Carlo simulation such that for each value of σ we fit, in a number of trials, Timoshenko and energy beam to randomly generated data clouds. The number of trials is 100 for every value of σ. We measure mean square error of the fitting result to the ground truth position of the beam. Error measures from ground trought for different values of Young modulus

Error measures from ground trought for different energy weights

0.25 E=1E2 E=1E3 E=1E4 E=1E5 E=1E6

0.06

2.5 ws=1E3,wb=1E1 2

ws=1E1,wb=1E3 ws=1E4,wb=1E2 ws=1E4,wb=1E2

1.5

ws=1E4,wb=1E4

1

0.05

0.04

0.03

0.02

0.15

0.1

0.05

0.5

0.01

0

0.2

0.4

0.6

0.8 1 1.2 1.4 Noise standard derivation σ

1.6

1.8

0

2

0

0.5

1 1.5 Noise standard derivation σ

2

0

2.5

Average mean square error

ws=1E4,wb=1E2 ws=1E4,wb=1E4

3

2

0.8 1 1.2 1.4 Noise standard derivation σ

1.6

1.8

2

Physics Beam Energy Beam

0.16

0.15

0.1

0.05

1

0.6

0.18

Average mean square error

0.2

ws=1E4,wb=1E2 4

0.4

0.2 E=1E2 E=1E3 E=1E4 E=1E5 E=1E6 E=1E7

ws=1E3,wb=1E1 ws=1E1,wb=1E3

0.2

Error in respect to the ground trought vs. different noise levels

0.25 ws=1E0,wb=1E0

5

0

Error measures from ground trought for different values of Young modulus

Error measures from ground trought for different energy weights 6

Average mean square error

Physics Beam Energy Beam 0.2 Average mean square error

Average mean square error

Average mean square error

ws=1E0,wb=1E0

0

Error in respect to the ground trought vs. different noise levels

0.07

3

0.14 0.12 0.1 0.08 0.06 0.04 0.02

0

0

0.2

0.4

0.6

0.8 1 1.2 1.4 Noise standard derivation σ

(a)

1.6

1.8

2

0

0

0.2

0.4

0.6

0.8 1 1.2 1.4 Noise standard derivation σ

(b)

1.6

1.8

2

0

0

0.2

0.4

0.6

0.8 1 1.2 1.4 Noise standard derivation σ

1.6

1.8

2

(c)

Fig. 2. Mean square error measured for number of fittings using Monte Carlo simulation from the ground truth in respect to the different amount of noise. (a) Energy beam fitting errors for different values of energy weights. (b) Timoshenko beam fitting errors for different values of Young modulus of elasticity. (c) Comparison of fitting errors for energy beam shown in red, and Timoshenko beam shown in blue.

82

I. Slobodan

(a)

(b)

(c)

(d)

Fig. 3. Failiour examples using energy beam with different energy weighting coefficients. (a) For wb = 104 , ws = 102 stays smooth but changes its length producing failiour in 5th frame. (b) For wb = 102 , ws = 104 tries to retains its length but no the smoothness producing failiour in 3rd frame. (c,d) The rat whisker fails in the 10th frame because of the occluded ear, and the energy coefficients are also wb = 103 , ws = 10 and ws = 102 , wb = 10 respectively.

Fig. 4. Tracking the car antenna using Temoshenko beam. Selected frames from the tracking sequence with the recovered model shown in white.

Fig. 5. Tracking the pole in a pole vault using Timoshenko beam. Because of the moving camera the image frames are warped to one reference frame using robustly estimated homography. Selected frames from the tracking sequence with the recovered model shown in yellow.

Initially we perform fittings for different values of energy weights ws and wb for energy beam and different values of Young modulus E for Timoshenko beam as shown in Fig. 2(a,b) respectively. The error differs for the different values of the energy weights. We take those values for which the error is minimal and refit the beams to the noisy data with different values of σ. Usually a good balance between stretching and bending energy is required for reasonable fitting performance of the energy beam. For Timoshenko

Comparing Timoshenko Beam to Energy Beam for Fitting Noisy Data

83

Fig. 6. Tracking the deformation of the rat whisker using Timoshenko beam. Selected frames from the tracking sequence with the recovered model are shown in white.

beam the small values of Young modulus ranging from 102 to 103 are unrealistic for materials with the small strains to which the Timoshenko theory applies. It means that small values of Young modulus are suitable for elastic materials with the large strains, while big values of Young modulus are suitable for elastic materials which have small strains, i.e tend to retain their length but can have large rotations. For that reasons we obtained the best fitting performance for values of E being 105 and 106 . The errors in respect to the ground truth for both beams in two synthetic examples are shown in Fig. 2(c). Timoshenko beam retains the same error measure with the increase of the amount of noise, while the error for the energy beam increases with the increase of the amount of added noise. This indicates that Timoshenko beam is more robust when fitted to noisy data. The same is proven bellow during tracking in real images. 6.2 Real Images In real images we chose to track highly flexible structures such as a car antenna, pole vault and the deformation of the rat whiskers. The car antenna example of Fig. 4 has simple background and both Timoshenko and energy beam track it with no problems as can be seen in supplementary videos. More complex pole vault and rat whiskers were successfully tracked using Timoshenko beam while they failed when energy beam was used. The failiour examples are depicted in Fig. 3. The selected frames form successful tracking using Timoshenko beam are depicted in Fig. 5 and Fig 6. In all examples the initialization was done manually in the first frame and then the frame to frame fitting was done. The energy beam has tendance to attach to the strong edges regardless the combination of the energy weights as depicted by Fig. 3, while the Timoshenko beam overcomes this problem because its naturally imposed physical constrains implicitly contained in the model description.

7 Conclusion In this paper we investigated true physical Timoshenko beam model to track large non-linear deformations in images. We compared it to physically based energy beam

84

I. Slobodan

approach which uses simplifying physical assumptions to create the model energy, similar to most physically based models used in computer vision. These approaches ignore the complex non-linearities that are inherent to large deformations. We discovered that using Timoshenko beam, which approximates the physical behavior more closely, contributed to robust fitting to noisy data and efficient tracking of large deformations in the presence of clutter and partial occlusions.

References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Cohen, L., Cohen, I.: Deformable models for 3-d medical images using finite elements and balloons. In: Conference on Computer Vision and Pattern Recognition, pp. 592–598 (1992) 3. Terzopoulos, D., Metaxas, D.: Dynamic 3D models with local and global deformations: Deformable superquadrics. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 703–714 (1991) 4. Metaxas, D.T.D.: Constrained deformable superquadrics and nonrigid motion tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(6), 580–591 (1993) 5. Gibson, S., Mirtich, B.: A survey of deformable modeling in computer graphics. Technical report, Mitsubishi Electric Research Lab, Cambridge, MA (1997) 6. McInerney, T., Terzopoulos, D.: Deformable models in medical images analysis: a survey. Medical Image Analysis 1(2), 91–108 (1996) 7. Metaxas, D.: Physics-Based Deformable Models: Applications to Computer Vision, Graphics, and Medical Imaging. Kluwer Academic Publishers, Dordrecht (1996) 8. Lee, Y., Terzopoulos, D., Walters, K.: Realistic modeling for facial animation. In: SIGGRAPH 1995. Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 55–62. ACM Press, New York (1995) 9. Essa, I., Sclaroff, S., Pentland, A.: Physically-based modeling for graphics and vision. In: Martin, R. (ed.) Directions in Geometric Computing. Information Geometers, U.K (1993) 10. Sclaroff, S., Pentland, A.P.: Physically-based combinations of views: Representing rigid and nonrigid motion. Technical Report 1994-016 (1994) 11. Pentland, A.: Automatic extraction of deformable part models. International Journal of Computer Vision 4(2), 107–126 (1990) 12. Delingette, H., Hebert, M., Ikeuchi, K.: Deformable surfaces: A free-form shape representation. SPIE Geometric Methods in Computer Vision 1570, 21–30 (1991) 13. Nastar, C., Ayache, N.: Frequency-based nonrigid motion analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(11) (1996) 14. O’Brien, J.F., Cook, P.R., Essl, G.: Synthesizing sounds from physically based motion. In: Fiume, E. (ed.) SIGGRAPH 2001. Computer Graphics Proceedings, pp. 529–536 (2001) 15. Tsap, L., Goldgof, D., Sarkar, S.: Fusion of physically-based registration and deformation modeling for nonrigid motion analysis (2001) 16. Bartoli, A., Zisserman, A.: Direct Estimation of Non-Rigid Registration. In: British Machine Vision Conference, Kingston, UK (2004) 17. Timoshenko, S., MacCullogh, G.: Elements of Strength in Materials, 3rd edn., van Nostrand. New York (1949) 18. Lepetit, V., Fua, P.: Monocular model-based 3d tracking of rigid objects: A survey. Foundations and Trends in Computer Graphics and Vision 1(1), 1–89 (2005)

A Family of Quadratic Snakes for Road Extraction Ramesh Marikhu1 , Matthew N. Dailey2 , Stanislav Makhanov3 , and Kiyoshi Honda4 Information and Communication Technologies, Asian Institute of Technology Computer Science and Information Management, Asian Institute of Technology 3 Sirindhorn International Institute of Technology, Thammasat University 4 Remote Sensing and GIS, Asian Institute of Technology

1 2

Abstract. The geographic information system industry would beneﬁt from ﬂexible automated systems capable of extracting linear structures from satellite imagery. Quadratic snakes allow global interactions between points along a contour, and are well suited to segmentation of linear structures such as roads. However, a single quadratic snake is unable to extract disconnected road networks and enclosed regions. We propose to use a family of cooperating snakes, which are able to split, merge, and disappear as necessary. We also propose a preprocessing method based on oriented ﬁltering, thresholding, Canny edge detection, and Gradient Vector Flow (GVF) energy. We evaluate the performance of the method in terms of precision and recall in comparison to ground truth data. The family of cooperating snakes consistently outperforms a single snake in a variety of road extraction tasks, and our method for obtaining the GVF is more suitable for road extraction tasks than standard methods.

1

Introduction

The geographic information system industry would beneﬁt from ﬂexible automated systems capable of extracting linear structures and regions of interest from satellite imagery. In particular, automated road extraction would boost the productivity of technicians enormously. This is because road networks are among the most important landmarks for mapping, and manual marking and extraction of road networks is an extremely slow and laborious process. Despite years of research and signiﬁcant progress in the computer vision and image processing communities (see, for example, [1,2] and Fortier et al.’s survey [3]), the methods available thus far have still not attained the speed and accuracy necessary for practical application in GIS tools. Among the most promising techniques for extraction of complex objects like roads are active contours or snakes, originally introduced by Kass et al. [4]. Since the seminal work of Kass and colleagues, techniques based on active contours have been applied to many object extraction tasks [5] including road extraction [6]. Rochery et al. have recently proposed higher-order active contours, in particular quadratic snakes, which hold a great deal of promise for extraction of linear Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 85–94, 2007. c Springer-Verlag Berlin Heidelberg 2007

86

R. Marikhu et al.

structures like roads [7]. The idea is to use a quadratic formulation of the contour’s geometric energy to encourage anti-parallel tangents on opposite sides of a road and parallel tangents along the same side of a road. These priors increase the ﬁnal contour’s robustness to partial occlusions and decrease the likelihood of false detections in regions not shaped like roads. In this paper, we propose two heuristic modiﬁcations to Rochery et al.’s quadratic snakes, to address limitations of a single quadratic snake and to accelerate convergence to a solution. First, we introduce the use of a family of quadratic snakes that are able to split, merge, and disappear as necessary. Second, we introduce an improved formulation of the image energy combining Rochery et al.’s oriented ﬁltering technique [7] with thresholding, Canny edge detection, and Xu and Prince’s Gradient Vector Flow (GVF) [8]. The modiﬁed GVF ﬁeld created using the proposed method is very eﬀective at encouraging the quadratic snake to snap to the boundaries of linear structures. We demonstrate the eﬀectiveness of the family of snakes and the modiﬁed GVF ﬁeld in a series of experiments with real satellite images, and we provide precision and recall measurements in comparison with ground truth data. The results are an encouraging step towards the ultimate goal of robust, fully automated road extraction from satellite imagery. As a last contribution, we have developed a complete GUI environment for satellite image manipulation and quadratic snake evolution, based on the Matlab platform. The system is freely available as open source software [9].

2 2.1

Experimental Methods Quadratic Snake Model

Here we provide a brief overview of the quadratic snake proposed by Rochery et al. [7]. An active contour or snake is parametrically deﬁned as T γ(p) = x(p) y(p) ,

(1)

T where p is the curvilinear abscissa of the contour and the vector x(p) y(p) deﬁnes the Cartesian coordinates of the point γ(p). We assume the image domain Ω to be a bounded subset of R2 . The energy functional for Rochery et al.’s quadratic snake is given by Es (γ) = Eg (γ) + λEi (γ),

(2)

where Eg (γ) is the geometric energy and Ei (γ) is the image energy of the contour γ. λ is a free parameter determining the relative importance of the two terms. The geometric energy functional is deﬁned as β tγ (p) · tγ (p ) Ψ (γ(p) − γ(p )) dp dp , (3) Eg (γ) = L(γ) + αA(γ) − 2

A Family of Quadratic Snakes for Road Extraction

87

where L(γ) is the length of γ in the Euclidean metric over Ω, A(γ) is the area enclosed by γ, tγ (p) is the unit-length tangent to γ at point p, and Ψ (z), given the distance z between two points on the contour, is used to weight the interaction between those two points (see below). α and β are constants weighting the relative importance of each term. Clearly, for positive β, Eg (γ) is minimized by contours with short length and parallel tangents. If α is positive, contours with small enclosed area are favored; if it is negative, contours with large enclosed area are favored. The interation function Ψ (z) is a smooth function expressing the radius of the region in which parallel tangents should be encouraged and anti-parallel tangents should be discouraged. Ψ (z) incorporates two constants: d, the expected road width, and , the expected variability in road width. During snake evolution, weighting by Ψ (z) in Equation 3 discourages two points with anti-parallel tangents (the opposite sides of a putative road) from coming closer than distance d from each other. The image energy functional Ei (γ) is deﬁned as Ei (γ) =

nγ (p) · ∇I(γ(p)) dp − tγ (p) · tγ (p ) ∇I(γ(p)) · ∇I(γ(p )) Ψ (γ(p) − γ(p )) dp dp , (4)

where I : Ω → [0, 255] is the image and ∇I(γ(p)) denotes the 2D gradient of I evaluated at γ(p). The ﬁrst linear term favors anti-parallel normal and gradient vectors, encouraging counterclockwise snakes to shrink around or clockwise snakes to expand to enclose dark regions surrounded by light roads.1 The quadratic term favors nearby point pairs with two diﬀerent conﬁgurations, one with parallel tangents and parallel gradients and the other with anti-parallel tangents and anti-parallel gradients. After solving the Euler-Lagrange equations for minimizing the energy functional Es (γ) (Equation 2), Rochery et al. obtain the update equation nγ (p) ·

+ 2λ

1

∂Es (p) = −κγ (p) − α − λ∇I(γ(p))2 + G(γ(p)) ∂γ + β r (γ(p), γ(p )) · nγ (p ) Ψ (γ(p) − γ(p )) dp

r (γ(p), γ(p )) · nγ (p ) (∇I(γ(p)) · ∇I(γ(p ))) Ψ (γ(p) − γ(p )) dp + 2λ ∇I(γ(p )) · (∇∇I(γ(p)) × nγ (p )) Ψ (γ(p) − γ(p )) dp , (5)

For dark roads in light background, we negate all the terms involving image, including G(γ(p)) in Equation 5. In the rest of the paper, we assume light roads on a dark background.

88

R. Marikhu et al.

where κγ (p) is the curvature of γ at γ(p) and G(γ(p)) is the “speciﬁc energy,” γ(p)−γ(p ) evaluated at point γ(p) (Section 2.2). r (γ(p), γ(p )) = γ(p)−γ(p ) is the unit vector pointing from γ(p) towards γ(p ). ∇∇I(γ(p)) is the Hessian of I evaluated at γ(p). α, β, and λ are free parameters that need to be determined experimentally. d and are speciﬁed a priori according to the desired road width. Following Rochery et al., we normally initialize our quadratic snakes with a rounded rectangle covering the entire image. 2.2

Oriented Filtering

We use Rochery’s oriented ﬁltering method [10] to enhance linear edges in our satellite imagery. The input image is ﬁrst convolved with oriented derivativeof-Gaussian ﬁlters at various orientations. Then the minimum (most negative) ﬁlter response over the orientations is run through a ramp function equal to 1 for low ﬁlter values and −1 for high ﬁlter values. The thresholds are user-speciﬁed. An example is shown in Fig. 1(b). 2.3

GVF Energy

Rather than using the oriented ﬁltering speciﬁc image energy G(x) from Section 2.2 for snake evolution directly, we propose to combine the oriented ﬁltering approach with Xu and Prince’s Gradient Vector Flow (GVF) method [8]. T The GVF is a vector ﬁeld V GVF (x) = u(x) v(x) minimizing the energy functional GVF E(V )= μ(u2x (x) + u2y (x) + vx2 (x) + vy2 (x)) (6) Ω 2 2 ˜ ˜ V (x) − ∇I(x) dx, + ∇I(x) ∂u ∂v ∂v ˜ where ux = ∂u ∂x , uy = ∂y , vx = ∂x , vy = ∂y , and I is a preprocessed version of image I, typically an edge image of some kind. The ﬁrst term inside the integral encourages a smooth vector ﬁeld whereas the second term encourages ﬁdelity to ˜ μ is a free parameter controlling the relative importance of the two terms. ∇I. Xu and Prince [8] experimented with several diﬀerent methods for obtaining ˜ We propose to perform Canny edge detection on G (the result of oriented ∇I. ﬁltering and thresholding, introduced in Section 2.2) to obtain a binary image I˜ for GVF, then to use the resulting GVF V GVF as an additional image energy for quadratic snake evolution. The binary Canny image is ideal because it only includes information about road-like edges that have survived sharpening by oriented ﬁlters. The GVF ﬁeld is ideal because during quadratic snake evolution, it points toward road-like edges, pushing the snake in the right direction from a long distance away. This speeds evolution and makes it easier to ﬁnd suitable parameters to obtain fast convergence. Fig. 1 compares our method to alternative GVF formulations based on oriented ﬁltering or Canny edge detection alone.

A Family of Quadratic Snakes for Road Extraction

89

Fig. 1. Comparison of GVF methods. (a) Input image. (b) G(x) obtained from oriented ﬁltering on I(x). (c) Image obtained from G(x) using threshold 0. (d) Canny edge detection on (c), used as I˜ for GVF. (e-f) Zoomed views of GVFs in region delineated ˜ (f) in (d). (e) Result of using the magnitude of the gradient ∇(Gσ ∗ I) to obtain I. ˜ Result of using Canny edge detection alone to obtain I. (g) GVF energy obtained using our proposed edge image. This ﬁeld pushes most consistently toward the true road boundaries.

2.4

Family of Quadratic Snakes

A single quadratic snake is unable to extract enclosed regions and multiple disconnected networks in an image. We address this limitation by introducing a family of cooperating snakes that are able to split, merge, and disappear as necessary. In our formulation, due to the curvature term κγ (p) and the area constant α in Equation 5, specifying the points on γ in a counterclockwise direction creates a shrinking snake and specifying the points on γ in a clockwise direction creates a growing snake. An enclosed region (loop or a grid cell) can be extracted eﬀectively by initializing two snakes, one shrinking snake covering the whole road network and another growing snake inside the enclosed region. On the one hand, our method is heuristic and dependent on somewhat intelligent user initialization, but it is much simpler than level set methods for the same problem [7], and, assuming a constant number of splits and merges per iteration, it does not increase the asymptotic complexity of the quadradic snake’s evolution. Splitting a Snake. We split a snake into two snakes whenever two of its arms are squeezed too close together, i.e. when the distance between two snake points is less than dsplit and those two points are at least k snake points from each other in both directions of traversal around the contour. dsplit should be less than 2η, where η is the maximum step size.

90

R. Marikhu et al.

Merging Two Snakes. Two snakes are merged when they have high curvature points within a distance dmerge of each other, the two snakes’ order of traversal (clockwise or counterclockwise) is the same, and the tangents at the two high curvature points are nearly antiparallel. High curvature points are those with where κmax is the maximum curvature for any point on γ. High κγ (p) > 0.6κmax γ γ curvature points are taken to ensure merging only occurs if two snakes have the semi-circular tip of their arms facing each other. Filtering out the low curvature points necessitates computing angle between the tangents at two points only for the high curvature points. When these conditions are fulﬁlled, the two snakes are merged by deleting the high curvature points and joining the snakes into a single snake while preserving the direction of traversal for the combined snake. Deleting a Snake. A snake γ is deleted if it has low compactness ( 4πA(γ) L(γ)2 ) and delete . a perimeter less than L 2.5

Experimental Design

We analyze extraction results on diﬀerent types of road networks using the single quadratic snake proposed by Rochery et al. [7] and the proposed family of cooperating snakes. The default convergence criterion is when the minimum Es (γ) has not improved for some number of iterations. Experiments have been performed to analyze the extraction of tree-structured road networks and those with loops, grids and disconnected networks. We then analyze the eﬀectiveness of GVF energy obtained from the proposed edge image in Experiment 4. For all the experiments, we digitize the images manually to obtain the ground truth data necessary to compute precision and recall.

3

Results

We have obtained several parameters emperically. For splitting a snake, dsplit should be less than d. k to be chosen depending on how far the two splitting points should be to ensure that the snakes formed after splitting have at least k points. In order to ensure that merging of snakes takes place only among the arms with the semi-circular tips facing each other, the tangents at the high curvature points are checked for antiparallel threshold of 130π/180.. The compactness should be greater than 0.2 to ensure that linear structured contours are not deleted. 3.1

Experiment 1: Simple (Tree-Structured) Road Networks

A single quadratic snake is well suited for tree-structured road networks as the snake will not need to change its topology during evolution (Figure 2). A family of snakes enable faster and better road extraction as non-road regions are eliminated using splitting and deletion of snakes.

A Family of Quadratic Snakes for Road Extraction

91

Fig. 2. Evolution of quadratic snake on roads with tree structure. Each column displays an image with initial contour in red and the extracted road network below it.

Fig. 3. Evolution of quadratic snake on roads with loops and disconnected networks. Each column displays an image with initial contour in red and the extracted road network below it.

92

3.2

R. Marikhu et al.

Experiment 2: Road Networks with Single Loop and Multiple Disconnected Networks

The family of quadratic snakes are able to extract disconnected networks with high accuracy (Figure 3) but are not able to extract enclosed regions automatically as the snakes are not able to develop holes inside it in the form of growing snakes. 3.3

Experiment 3: Complex Road Networks

A road network is considered complex if it has multiple disconnected networks and enclosed regions and large number of branches. With the appropriate user initialization (Figure 4), the snakes are able to extract the road networks with high accuracy and in less time. 3.4

Experiment 4: GVF Energy to Enable Faster Evolution

The Gradient Vector Flow Field [8] boosts the evolution process as we can see from the number of iterations required for each evolution in Experiment 4 with and without the use of GVF energy. From the evolution in the ﬁfth column, we see that the snake was able to extract the network with greater detail. Also, from the evolution in the last column, we see that it is necessary for the quadratic image energy to enable robust extraction and thus the GVF weight and λ need to be balanced appropriately.

Fig. 4. Evolution of quadratic snake on roads with enclosed regions. Each column displays an image with initial contour in green and the extracted road network below it.

A Family of Quadratic Snakes for Road Extraction

93

Fig. 5. Evolution of quadratic snake on roads with enclosed regions. Each column displays an image with initial contour in green and the extracted road network below it.

4

Discussion and Conclusion

In Experiment 1, we found that the our modiﬁed quadratic snake is able to move into concavities to extract entire tree-structured road networks with very high accuracy. Experiment 2 showed that the family of quadratic snakes is eﬀective at handling changes in topology during evolution, enabling better extraction of road networks. Currently, loops cannot be extracted automatically. We demonstrated the diﬃculty in extracting complex road networks with multiple loops and grids in Experiment 3. However, user initialization of a family of contours enable extraction of multiple closed regions and help the snake to avoid road-like regions. The level set framework could be used to handle change in topology enabling eﬀective extraction of enclosed regions. Rochery et al. [10] evolved the contour using the level set methods introduced by Osher and Sethian. However, our method is faster, conceptually simpler, and a direct extension of Kass et al.’s computational approach. In Experiment 4, we found that faster and robust extraction is achieved using oriented ﬁltering and GVF energy along with image energy of the quadratic snakes. Our proposed edge image obtained from oriented ﬁltering is eﬀective for computing GVF energy to enhance the process of extraction. We also found that our method for obtaining the GVF outperforms standard methods. Finally, we have developed a complete GUI environment for satellite image manipulation and quadratic snake evolution, based on the Matlab platform. The system is freely available as open source software [9].

94

R. Marikhu et al.

Future work will focus on possibilities to automate the extraction of enclosed regions. Digital elevation models could be integrated with image energy for increased accuracy.

Acknowledgments This research was supported by Thailand Research Fund grant MRG4780209 to MND. RM was supported by a graduate fellowship from the Nepal High Level Commission for Information Technology.

References 1. Fischler, M., Tenenbaum, J., Wolf, H.: Detection of roads and linear structures in low-resolution aerial imagery using a multisource knowledge integration technique. Computer Graphics and Image Processing 15, 201–223 (1981) 2. Geman, D., Jedynak, B.: An active testing model for tracking roads in satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(1), 1–14 (1996) 3. Fortier, A., Ziou, D., Armenakis, C., Wang, S.: Survey of work on road extraction in aerial and satellite images. Technical Report 241, Universit´e de Sherbrooke, Quebec, Canada (1999) 4. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1987) 5. Cohen, L.D., Cohen, I.: Finite-element methods for active contour models and balloons for 2-D and 3-D images. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 131–147 (1993) 6. Laptev, I., Mayer, H., Lindeberg, T., Eckstein, W., Steger, C., Baumgartner, A.: Automatic extraction of roads from aerial images based on scale space and snakes. Machine Vision and Applications 12(1), 23–31 (2000) 7. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher order active contours. International Journal of Computer Vision 69(1), 27–42 (2006) 8. Xu, C., Prince, J.L.: Gradient Vector Flow: A new external force for snakes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 66–71 (1997) 9. Marikhu, R.: A GUI environment for road extraction with quadratic snakes Matlab software (2007), available at http://www.cs.ait.ac.th/∼mdailey/snakes. 10. Rochery, M.: Contours actifs d’order sup´erieur et leur application ` a la d´etection de lin´eiques dans des images de t´el´ed´etection. PhD thesis, Universit´e de Nice, Sophia Antipolis — UFR Sciences (2005)

Multiperspective Distortion Correction Using Collineations Yuanyuan Ding and Jingyi Yu Department of Computer and Information Sciences University of Delaware Newark, DE 19716, USA {ding,yu}@eecis.udel.edu

Abstract. We present a new framework for correcting multiperspective distortions using collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. We show that image distortions in many previous models of cameras can be eﬀectively reduced via proper collineations. To correct distortions in a speciﬁc multiperspective camera, we develop an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Experiments demonstrate that our system robustly corrects complex distortions without acquiring the scene geometry, and the resulting images appear nearly undistorted.

1

Introduction

A perspective image represents the spatial relationships of objects in a scene as they would appear from a single viewpoint. Recent developments have suggested that alternative multiperspective camera models [5,16] can combine what is seen from several viewpoints into a single image. These cameras provide potentially advantageous imaging systems for understanding the structure of observed scenes. However, they also exhibit multiperspective distortions such as the curving of lines, apparent stretching and shrinking, and duplicated projections of a single point [12,14]. In this paper, we present a new framework for correcting multiperspective distortions using collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. We show that image distortions in many previous cameras can be eﬀectively reduced via proper collineations. To correct distortions in a speciﬁc multiperspective camera, we develop an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Compared with classical distortion correction methods [12,2,11], our approach does not require prior knowledge on scene geometry and it can handle highly Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 95–105, 2007. c Springer-Verlag Berlin Heidelberg 2007

96

Y. Ding and J. Yu

complex distortions. We demonstrate the eﬀectiveness of our technique on various synthetic and real multiperspective images, including the General Linear Cameras [14], catadioptric mirrors, and reﬂected images from arbitrary mirror surfaces. Experiments show that our method is robust and reliable, thus the resulting images appear nearly undistorted.

2

Previous Work

In recent years, there has been a growing interest in designing multiperspective cameras which capture rays from diﬀerent viewpoints in space. These multiperspective cameras include pushbroom cameras [5], which collect rays along parallel planes from points swept along a linear trajectory, the cross-slit cameras [8,16], which collect all rays passing through two lines, and the oblique cameras [7], in which each pair of rays are oblique. The recently proposed General Linear Cameras (GLC) uniformly model these multiperspective cameras as 2D linear manifolds of rays (Fig. 1). GLCs produce easily interpretable images, which are also amenable to stereo analysis [9]. However, these images exhibit multiperspective distortions [14]. In computer vision, image-warping has been commonly used to reduce distortions. Image-warping computes an explicit pixel-to-pixel mapping to warp the original image onto a nearly perspective image. For cameras that roughly maintain a single viewpoint [6], simple parametric functions are suﬃcient to eliminate perspective, radial, and tangential distortions [2,3]. However, for complex imaging systems, especially those exhibiting severe caustic distortions [12], the warping function is diﬃcult to model and may not have a closed-form solution. Image-based rendering algorithms have also been proposed to reduce image distortions [10,4]. There, the focus has been to estimate the scene structure from a single or multiple images. Swaminathan and Nayar [13] have shown that simple geometry proxies, such as the plane, sphere, and cylinder, are often suﬃcient to reduce caustic distortions on catadioptric mirrors, provided that the prior on scene structure is known. We present a third approach based on multiperspective collineations. A collineation describes the transformation between the images of a camera due to changes in sampling and image plane selection. For many multiperspective cameras such as the pushbroom [5] and the cross-slit [8], collineations can be uniformly modeled using the recently proposed General Linear Cameras (GLC) [15]. 2.1

GLC Collineation

In the GLC framework, every ray is parameterized by its intersections with the two parallel planes, where [u, v] is the intersection with the ﬁrst and [s, t] the second, as shown in Fig. 1(a). This parametrization is often called a two-plane parametrization (2PP) [4,15]. We can reparameterize each ray by substituting σ = s − u and τ = t − v. In this paper, we will use this [σ, τ, u, v] parametrization to simplify our analysis. We also assume the default uv plane is at z = 0 and st plane at z = 1. Thus [σ, τ, 1] represents the direction of the ray.

Multiperspective Distortion Correction Using Collineations

97

r3 r2 t

(s2, t2)

(s3, t3)

(u2, v2)

s

r3

r1

(u3, v2) z

(s1, t1)

r3

r2

v

u

(u1, v1)

r2

r1

α⋅ r 1 + β ⋅ r 2 + (1 −α−β ) ⋅ r 3

(a)

Π1

r1

Π2

Π3

Π4

C

(b)

(c)

(d)

(e)

Fig. 1. General Linear Camera Models. (a) A GLC collects radiance along all possible aﬃne combination of three rays. The rays are parameterized by their intersections with two parallel planes. The GLC model uniﬁes many previous cameras, including the pinhole (b), the orthographic (c), the pushbroom (d), and the cross-slit (e).

A GLC is deﬁned as the aﬃne combination of three rays parameterized under 2PP: r = α[σ1 , τ1 , u1 , v1 ] + β[σ2 , τ2 , u2 , v2 ] + (1 − α − β)[σ3 , τ3 , u3 , v3 ], ∀α, β

(1)

Many well-known multiperspective cameras, such as pushbroom, cross-slit, linear oblique cameras are GLCs as shown in Fig. 1. If we assume uv is the image plane, we can further choose three special rays with [u, v] coordinates [0, 0], [1, 0], and [0, 1] to form a canonical GLC as: r[σ, τ, u, v] = (1 − α − β) · [σ1 , τ1 , 0, 0]+α · [σ2 , τ2 , 1, 0] + β · [σ3 , τ3 , 0, 1]

(2)

It is easy to see that α = u, β = v, and σ and τ are linear functions in u and v. Therefore, under the canonical form, every pixel [u, v] maps to a ray r(u, v) in the GLC. A GLC collineation maps every ray r(u, v) to a pixel [i, j] on the image plane Π[p, ˙ d1 , d2 ], where p˙ speciﬁes the origin and d1 , and d2 specify the two spanning directions of Π. For every ray r[σ, τ, u, v], we can intersect r with Π to compute [i, j]: [u, v, 0] + λ[σ, τ, 1] = p˙ + id1 + jd2

(3)

Solving for i, j, and λ gives:

where

i=

y x z x (τ dz2 −dy 2 )(u−px )+(d2 −σd2 )(v−py )−(σd2 −τ d2 )pz γ

j=

y z z x x (dy 1 −τ d1 )(u−px )+(σd1 −d1 )(v−py )−(τ d1 −σd1 )pz γ

x x d1 d2 −σ y y γ = d1 d2 −τ dz1 dz2 −1

(4)

(5)

For a canonical GLC, since σ and τ are both linear functions in u and v, γ must be linear in u and v. Therefore, we can rewrite i and j as:

98

Y. Ding and J. Yu

i= j=

a1 u2 +b1 uv+c1 v 2 +d1 u+e1 v+f1 a3 u+b3 v+c3 2

(6)

2

a2 u +b2 uv+c2 v +d2 u+e2 v+f2 a3 u+b3 v+c3

˜ Π (u, v) of a GLC from the uv image plane to a new Thus, the collineation Col image plane Π is a quadratic rational function. Fig. 2 shows the images of a GLC under diﬀerent collineations. It implies that image distortions may be reduced using a proper collineation. Π2

Π1 GLC

GLC

(a)

(b)

(c)

(d)

Fig. 2. The image of a cross-slit GLC (d) under collineation (c) appear much less distorted than the image (b) of the same camera under collineation (a)

3

Correct Distortions in GLCs

Given a speciﬁc GLC, our goal is to ﬁnd the optimal collineation to minimize its distortions. Similar to previous approaches [12,11], we assume the rays captured by the camera are known. We have developed an interactive system to allow users to design their ideal undistorted images. Our system supports two modes. In the ﬁrst mode, the user can select feature rays from the camera and position them at desirable pixels in the target images. In the second mode, the user can simply provide a reference perspective image. Our system then automatically matches the features points. Finally, the optimal collineation is estimated to ﬁt the projections of the feature rays with the target pixels. 3.1

Interactive Distortion Correction

Given a canonical GLC, the user can ﬁrst select n feature rays (blue crosses in Fig. 3(a)) from the source camera and then position them at desirable pixels (red crosses in Fig. 3(b)) on the target image. Denote [uk , vk ] as the uv coordinate of each selected ray rk in the camera and [ik , jk ] as the desired pixel coordinate ˙ d1 , d2 ] that of rk on the target image, we want to ﬁnd the collineation Π[p, maps [u, v] as close to [i, j] as possible. We formalize it as a least squares ﬁtting problem: n ˜ Π (uk , vk ) − [ik , jk ]||2 min ||Col (7) Π

k=1

Since each collineation Π[p, ˙ d1 , d2 ] has 9 variables, we need a minimal number of ﬁve ray-pixel pairs. This is not surprising because four pairs uniquely

Multiperspective Distortion Correction Using Collineations

99

determine a projective transformation, a degenerate collineation in the case of perspective cameras. Recall that the GLC collineations are quadratic rational functions. Thus, ﬁnding the optimal Π in Equation (7) requires using non-linear optimizations. To solve this problem, we use the Levenberg-Marquardt method. A common issue with the Levenberg-Marquardt method, however, is that the resulting optimum depends on the initial condition. To avoid getting trapped in a local minimum, we choose a near optimal initial condition by sampling diﬀerent spanning directions of Π. We rewrite the spanning directions as: di = ηi · [cos(φi )cos(θi ), cos(φi )sin(θi ), sin(φi )],

i = 1, 2

(8)

We sample several θ1 , θ2 , φ1 , and φ2 and ﬁnd the corresponding p, ˙ η1 , and η2 as the initial conditions. Finally, we choose the one with the minimum error. This preconditioned optimization robustly approximates a near optimal collineation that signiﬁcantly reduces distortions as shown in Fig. 3(b).

(a)

(b)

Fig. 3. Interactive Distortion Correction. (a) The user selects feature rays (blue crosses) and positions them at desirable pixels (red crosses). (b) shows the new image under the optimal collineation. The distortions are signiﬁcantly reduced. The green crosses illustrate the ﬁnal projections of the feature rays.

3.2

Automatic Distortion Correction

We also present a simple algorithm to automatically reduce distortions. Our method consists of two steps. First, the user provides a target perspective image that captures the same scene. Next, we automatically select the matched features between the source camera and the target image and compute the optimal collineation by minimizing Equation (7). Recall that a GLC captures rays from diﬀerent viewpoints in space and hence, its image may appear very diﬀerent from a perspective image. To match the feature points, we use Scale Invariant Feature Transform (SIFT) to preprocess the two images. SIFT robustly handles image distortion and generates transformation-invariant features. We then perform global matching to ﬁnd the potential matching pairs. Finally, we prune the outliers by using RANSAC with the homography model. To tolerate parallax, we use a loose inlier threshold of 20 pixels. In Fig. 4, we show our automatic distortion correction results on various GLCs including the pushbroom, the cross-slit, and the pencil cameras. The user inputs

100

Y. Ding and J. Yu Original Pushbroom

Original Cross-slit

Original Pencil

Target

(b) Corrected Pushbroom

(c) Corrected Cross-slit

(d) Corrected Pencil

(a)

(e)

(f)

(g)

Fig. 4. Automatic Distortion Correction. (a) Perspective reference image; (b),(c), and(d) are distorted images captured from pushbroom camera, cross-slit camera, and pencil camera. (e), (f), and (g) are the distortion corrected results of (b), (c), and (d) using the automatic algorithm.

a perspective image (Fig. 4(a)) and the corrected GLC images appear nearly undistorted using the optimal collineations (bottom row of Fig. 4).

4

Correcting Distortions on Catadioptric Mirrors

Next, we show how to correct multiperspective distortions on catadioptric mirrors. Conventional catadioptric mirrors place a pinhole camera at the focus of a hyperbolic or parabolic surface to synthesize a diﬀerent pinhole camera with a wider ﬁeld of view [6]. When the camera moves oﬀ the focus, the reﬂection images exhibit complex caustic distortions that are generally diﬃcult to correct [12]. We apply a similar algorithm using multiperspective collineations. Our method is based on the observation that, given any arbitrary multiperspective imaging system that captures smoothly varying set of rays, we can map the rays onto a 2D ray manifold in the 4D ray space. The characteristics of this imaging system, such as its projection, collineation, and image distortions can be analyzed by the 2-D tangent ray planes, i.e., the GLCs [14]. This implies that a patch on an arbitrary multiperspective image can be locally approximated as a GLC. We ﬁrst generalize the GLC collineation to arbitrary multiperspective imaging systems. Notice that not all rays in these systems can be parameterized as [σ, τ, u, v] (e.g., some rays may lie parallel to the parametrization plane). Thus, we use the origin o˙ and the direction l to represent each ray r. ˙ l] to a pixel [i, j] as: The collineation Π[p, ˙ d1 , d2 ] maps r[o, [ox , oy , oz ] + λ[lx , ly , lz ] = p˙ + id1 + jd2

(9)

Solving for i, j in Equation (9) gives: i=

x x z x x z y y x y y x z z (ly dz2 −lz dy 2 )(o −p )+(l d2 −l d2 )(o −p )+(l d2 −l d2 )(o −p ) γ∗

j=

y z x x x z z x y y y x x y z z (lz dy 1 −l d1 )(o −p )+(l d1 −l d1 )(o −p )+(l d1 −l d1 )(o −p ) γ∗

(10)

Multiperspective Distortion Correction Using Collineations

(a)

101

(b) (e)

(f) (c)

(d)

(g)

Fig. 5. Selecting diﬀerent feature rays ((a) and (c)) produces diﬀerent distortion correction results ((b) and (d)). (f) shows the automatic feature matching between a region (blue rectangle) on the spherical mirror and a perspective image. (g) is the ﬁnal distortion corrected image. The holes are caused by the under-sampling of rays.

where

x x x d1 d2 −l ∗ γ = dy1 dy2 −ly dz1 dz2 −lz

(11)

˜ Π (o, ˙ l). We abbreviate Equation (10) as [i, j] = Col The user then selects n feature rays from the catadioptric mirror and positions them at target pixels [ik , jk ], k = 1 . . . n. Alternatively, they can provide a target perspective image (Fig. 5(f)) and our system will automatically establish feature correspondences using the SIFT-RANSAC algorithm. We then use the Levenberg-Marquardt method (equation (7)) with sampled initial conditions to ˜ Π. ﬁnd the optimal collineation Col In the case of catadioptric mirrors, if the selected patch is too large, the resulting image may depend on which rays-pixel pairs are selected. In the kitchen scene example (Fig. 5(a)), selecting the rays from the right side of the spherical mirror produces diﬀerent results than selecting the rays from the middle part, although distortions are reduced in both cases. This is because the rays inside the patch cannot be approximated as a single GLC model.

5

Results

We have experimented our system on various multiperspective images. We modify the PovRay [18] ray tracer to generate both GLC images and reﬂected images on catadioptric mirrors. Fig. 3 shows an image of a cross-slit camera in

102

Y. Ding and J. Yu

(b)

(c)

(a)

(d)

(e)

Fig. 6. Correcting distortions on a spherical mirror. The user selects separate regions on the sphere (a) to get (b) and (d). (c) and (e) are the resulting images by matching the selected features (blue) and target pixels (red) in (b) and (d) using collineations.

which the two slits form an acute angle. The user then selects feature rays (blue) from the GLC image and positions them at desirable pixels (red). Our system estimates the optimal collineation and re-renders the image under this collineation as shown in Fig. 3(b). The distortions in the resulting image are signiﬁcantly reduced. Next, we apply our algorithm to correct reﬂection distortions on a spherical mirror shown in Fig. 6. It has been shown [14] that more severe distortions occur near the boundary of the mirror than at the center. Our algorithm robustly corrects both distortions in the center region and near the boundary. In particular, our method is able to correct the highly curved silhouettes of the refrigerator (Fig. 6(d)). The resulting images are rendered by intersecting the rays inside the patch with the collineation plane, thus, containing holes due to the undersampling of rays. Our algorithm can further correct highly complex distortions on arbitrary mirror surfaces. In Fig. 7, we render a reﬂective horse model of 48, 000 triangles at two diﬀerent poses. Our system robustly corrects various distortions such as stretching, shrinking, and duplicated projections of scene points in the reﬂected image, and the resulting images appear nearly undistorted. We have also experimented our automatic correction algorithm on both the GLC models and catadioptric mirrors. In Fig. 4, the user inputs a target perspective image 4(a) and our system automatically matches the feature points between the GLC and the target image. Even though the ray structures in the GLCs are signiﬁcantly diﬀerent from a pinhole camera, the corrected GLC images appear close to perspective. In Fig. 5(f), a perspective image of a kitchen scene is used to automatically correct distortions on a spherical mirror. This

Multiperspective Distortion Correction Using Collineations

(a)

(b)

(c)

(d)

(e)

(f)

103

Fig. 7. Correcting complex distortions on a horse model. We render a reﬂective horse model under two diﬀerent poses (a) and (d) and then select regions (b) and (e). (c) and (f) are the resulting images by matching the selected features (blue) and target pixels (red) in (b) and (e) using collineations.

(a)

(b)

(c)

(d)

Fig. 8. Correcting reﬂection distortions. (a) and (c) are two captured reﬂected images on a mirror sphere. Our algorithm not only reduces multiperspective distortions but also synthesizes strong perspective eﬀects (b) and (d).

implies that our collineation framework has the potential for beneﬁting automatic catadioptric calibrations. Finally, we have applied our algorithm on real reﬂected images of a mirror sphere in a deep scene. We position the viewing camera far away from the sphere so that it can be approximated as an orthographic camera. We then calculate the corresponding reﬂected ray for each pixel and use our collineation algorithm to correct the distortions. Our system not only reduces multiperspective distortions but also synthesizes strong perspective eﬀects as shown in Fig. 8.

104

6

Y. Ding and J. Yu

Discussions and Conclusion

We have presented a new framework for correcting multiperspective distortions using collineations. We have shown that image distortions in many previous cameras can be eﬀectively reduced via proper collineations. To ﬁnd the optimal collineation for a speciﬁc multiperspective camera, we have developed an interactive system that allows users to select feature rays from the camera and position them at the desirable pixels. Our system then computes the optimal collineation to match the projections of these rays with the corresponding pixels. Experiments demonstrate that our system robustly corrects complex distortions without acquiring the scene geometry, and the resulting images appear nearly undistorted.

(a)

(b)

(c)

(d)

Fig. 9. Comparing collineations with the projective transformation. The user selects feature rays (blue) and target pixels (red). (c) is the result using the optimal collineation. (d) is the result using the optimal projective transformation.

It is important to note that a collineation computes the mapping from a ray to a pixel whereas image warping computes the mapping from a pixel to a pixel. One limitation of using collineations is that we cannot compute the inverse mapping from pixels to rays. Therefore, if the rays in the source camera are undersampled, e.g., in the case of a ﬁxed-resolution image of the catadioptric mirrors, the collineation algorithm produces images with holes. As for future work, we plan to explore using image-based rendering algorithms such as the push-pull method [4] to ﬁll in the holes in the ray space. We have also compared our collineation method with the classical projective transformations. In Fig. 9, we select the same set of feature points (rays) from a reﬂected image on the horse model. Fig. 9(c) computes the optimal projective transformation and Fig. 9(d) computes the optimal collineation, both using the Levenberg-Marquardt method for ﬁtting the feature points. The optimal collineation result is much less distorted and is highly consistent with the pinhole image while the projective transformation result remains distorted. This is because multiperspective collineation describes a much broader class of warping functions than the projective transformation.

Acknowledgement This work has been supported by the National Science Foundation under grant NSF-MSPA-MCS-0625931.

Multiperspective Distortion Correction Using Collineations

105

References 1. Chahl, J., Srinivasan, M.: Reﬂective surfaces for panoramic imaging. Applied Optics 37(8), 8275–8285 (1997) 2. Chen, S.E.: QuickTime VR – An Image-Based Approach to Virtual Environment Navigation. Computer Graphcs 29, 29–38 (1995) 3. Derrien, S., Konolige, K.: Approximating a single viewpoint in panoramic imaging devices. International Conference on Robotics and Automation, 3932–3939 (2000) 4. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: ’The Lumigraph. SIGGRAPH 1996, 43–54 (1996) 5. Gupta, R., Hartley, R.I.: Linear Pushbroom Cameras. IEEE Trans. Pattern Analysis and Machine Intelligence 19(9), 963–975 (1997) 6. Nayar, S.K.: Catadioptric Omnidirectional Cameras. In: Proc. CVPR, pp. 482–488 (1997) 7. Pajdla, T.: Stereo with Oblique Cameras. Int’l J. Computer Vision 47(1/2/3), 161–170 (2002) 8. Pajdla, T.: Geometry of Two-Slit Camera, Research Report CTU–CMP–2002–02, March (2002) 9. Seitz, S., Kim, J.: The Space of All Stereo Images. In: Proc. ICCV, pp. 26–33 (July 2001) 10. Shum, H., He, L.: Rendering with concentric mosaics. Computer Graphcs 33, 299– 306 (1999) 11. Stein, G.P.: Lens distortion calibration using point correspondences. In: Proc. CVPR, pp. 143–148 ( June 1997) 12. Swaminathan, R., Grossberg, M.D., Nayar, S.K.: Caustics of Catadioptric Cameras. In: Proc. ICCV, pp. 2–9 (2001) 13. Swaminathan, R., Grossberg, M.D., Nayar, S.K.: A Perspective on Distortions. In: Proc. IEEE Computer Vision and Pattern Recognition, Wisconsin (June 2003) 14. Yu, J., McMillan, L.: Multiperspective Projection and Collineation. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 15. Yu, J., McMillan, L.: Modelling Reﬂections via Multiperspective Imaging. In: Proc. IEEE Computer Vision and Pattern Recognition, San Diego (June 2005) 16. Zomet, A., Feldman, D., Peleg, S., Weinshall, D.: Mosaicing New Views: The Crossed-Slits Projection. IEEE Trans. on PAMI, 741–754 (2003) 17. Zorin, D., Barr, A.H.: Correction of Geometric Perceptual Distortions in Pictures. Computer Graphics 29, 257–264 (1995) 18. POV-Ray: The Persistence of Vision Raytracer, http://www.povray.org/

Camera Calibration from Silhouettes Under Incomplete Circular Motion with a Constant Interval Angle Po-Hao Huang and Shang-Hong Lai Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan {even,lai}@cs.nthu.edu.tw

Abstract. In this paper, we propose an algorithm for camera calibration from silhouettes under circular motion with an unknown constant interval angle. Unlike previous silhouette-based methods based on surface of revolution, the proposed algorithm can be applied to sparse and incomplete image sequences. Under the assumption of circular motion with a constant interval angle, epipoles of successive image pairs remain constant and can be determined from silhouettes. A pair of epipoles formed by a certain interval angle can provide a constraint on the angle and focal length. With more pairs of epipoles recovered, the focal length can be determined from the one that most satisfies the constraints and determine the interval angle concurrently. The rest of camera parameters can be recovered from image invariants. Finally, the estimated parameters are optimized by minimizing the epipolar tangency constraints. Experimental results on both synthetic and real images are shown to demonstrate its performance. Keywords: Circular Motion, Camera Calibration, Shape Reconstruction.

1 Introduction Reconstructing 3D model from image sequences has been studied for decades [1]. In real applications, for instance 3D object digitization in digital museum, modeling from circular motion sequences is a practical and widely used approach in computer vision and computer graphic communities. Numerous methods, which focus on circular motion, have been proposed and they can be classified into two camps; namely, the feature-based [2,3,4,5] and silhouette-based [6,7,8,9] approaches. In the feature-based approaches, Fitzgibbon et al. [2] proposed a method that makes use of the fundamental matrices and trifocal tensors to uniquely determine the rotation angles and determine the reconstruction up to a two-parameter family. Jiang et al. [3,4] further developed a method that avoids the computation of multiview tensors to recover the circular motion geometry by either fitting conics to tracked points in at least five images or computing a plane homography from minimally two points in four images. Cao et al. [5] aimed at the problem of varying focal lengths under circular motion with a constant but unknown rotation angle. However, it is difficult to establish accurate feature correspondences from the image sequences for objects of texture-less, semi-transparent, or reflective materials, such as jade. Instead of feature correspondences, the silhouette-based approach Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 106–115, 2007. © Springer-Verlag Berlin Heidelberg 2007

Camera Calibration from Silhouettes Under Incomplete Circular Motion

107

integrates the object contours to recover the 3D geometry. In [6], Mendonca and Cipolla addressed the problem of estimating the epipolar geometry from apparent contours under circular motion. Under the assumption of constant rotation angle, there are only two common epipoles of successive image pairs needed to be determined. The relation between the epipolar tangencies and the image of rotation axis is used to define a cost function. Nevertheless, the initialization of the epipole positions can influence the final result and make the algorithm converge to a local minimum. In [7], Mendonca et al. exploited the symmetry properties of the surface of revolution (SoR) swept out by the rotating object to obtain an initial guess of image invariants, followed by several one-dimensional searching steps to recover the epipolar geometry. Zhang et al. [8] further extended this method to achieve auto-calibration. The rotation angle is estimated from three views which sometimes results in inaccurate estimation. In [9], they formulated the circular motion as 1D camera geometry to achieve more robust motion estimation. Most of the silhouette-based methods are based on the SoR to obtain an initial guess of image invariants, thus making them infeasible when the image sequence is sparse (interval angle larger than 20 degree [7]) or incomplete. In this paper, we propose an algorithm for camera calibration from silhouettes of an object under circular motion with a sparse and incomplete sequence. In our approach, we first use the same cost function, as proposed in [6], to determine the epipoles of successive image pairs from silhouettes. Thus, constant interval angle is the main assumption of our algorithm. In addition, we propose a method for initializing the positions of epipoles, which is important in practice. A pair of epipoles formed by a certain interval angle can provide a constraint on the angle and focal length. With more pairs of epipoles recovered, the focal length can therefore be determined as the one that best satisfies these constraints and the angle is determined concurrently. After obtaining the camera intrinsic parameters, the rotation matrix about camera center can be recovered from the image invariants up to two sign ambiguities which can be further resolved by making the sign of back-projection epipoles consistent with the camera coordinate. Finally, the epipolar tangency constraints for all pairs of views are minimized to refine the camera parameters by using all determined parameters as an initial guess in the nonlinear optimization process. The remainder of this paper is organized as follows. Section 2 describes the image invariants under circular motion. Section 3 describes the epipolar tangency constraints and explains how to extract epipoles from contours. The estimation of camera parameters is described in section 4. Experimental results on both synthetic and real data are given in section 5. Finally, we conclude this paper in section 6.

2 Image Invariants Under Circular Motion The geometry of circular motion can be illustrated in Fig. 1(a). A camera C rotates about an axis Ls, and its track forms a circle on a plane Πh that is perpendicular to Ls and intersects at the circle center Xs. Without loss of generality, we assume the world coordinate system be centered at Xs with Y-axis along Ls and C be placed on the negative Z-axis. If the camera parameters are kept constant under circular motion, the

108

P.-H. Huang and S.-H. Lai

Ls Πi

ls

Z Πh C

Xs

Πi

X

lh

xs

vx

vy

Y (a)

(b)

Fig. 1. (a) The geometry of circular motion. (b) The image invariants under circular motion.

image Πi of C will contain invariant entities of the geometry as shown in Fig. 1(b). Line lh (ls) is the projection of Πh(Ls). The three points, vx, vy, and xs, are the vanishing points of X-axis, Y-axis and Z-axis, respectively. Similar description of the image invariants can also be found in [2,3,4,8]. Assume the camera intrinsic parameters and the rotation matrix about camera center, which will be referred as camera pose in the rest of this paper, be denoted as K and R, respectively. The camera projection matrix P can be written as:

P = KR [ R y (θ ) | − C ] .

(1)

where R= [r1 r2 r3], Ry(θ) is the rotation matrix about Ls with angle θ, and C=[0 0 –t]T. In mathematical expression, the three points can be written as:

[v x

vy

xs ] ~ KR = K [r1

r2

r3 ] .

(2)

where the symbol “~” denotes the equivalence relation in the homogenous coordinate.

3 Epipolar Geometry from Silhouettes Epipoles can be obtained by computing the null vectors of the fundamental matrix when feature correspondences between two views are available. However, from only silhouettes, it takes more efforts to determine the epipoles. In this section, the relationship between the epipoles and the silhouettes are discussed. 3.1 Constraints on Epipoles and Silhouettes Under Circular Motion In two-view geometry, frontier point is the intersection of contour generators and its projection will be located at the epipolar tangency of the object contour as shown in Fig. 2(a). Hence, the tangent points (lines) induced by epipoles can be regarded as corresponding points (epipolar lines). In addition, as mentioned in [6], under circular motion, the intersection of the corresponding epipolar lines will lie on ls when two views are put in the same image, as shown in Fig. 2(b). This property provides constraints on epipoles and silhouettes. In [6], the cost function is defined as the distance between the intersections of corresponding epipolar

Camera Calibration from Silhouettes Under Incomplete Circular Motion

ls

frontier point contour generator

l1 t1 x2

lh e2

e1 contour C2 C1

e1

e2

contour C1

t2

109

l2 lh

x1 e2

e1

contour C1

contour C2

C2

(a)

(b)

(c)

Fig. 2. (a) Frontier point and the epipolar tangencies. (b) Epipolar tangencies and ls under circular motion (c) Epipolar tangency constraints.

lines and ls. In general, a pair of views has two epipoles with four unknowns but provides only two constraints (intersections), which are not enough to unique determine the answer. Therefore, they assume the interval angle of adjacent views is kept constant, thus reducing the number of epipoles needed to be estimated to only two with four unknowns. Given the epipoles, ls can be determined by line fitting the intersections. In their method, with appropriate initialization of epipoles, the cost function is iteratively minimized to determine the epipoles. (see [6] for details). 3.2 Initialization of Epipoles In [6], they only showed experiments on synthetic data. In practice, it is crucial to obtain good initial positions of epipoles. In our algorithm, the assumption of constant interval angle is also adopted to reduce the unknowns. When taking an image sequence under the turn-table environment, the camera pose is usually close to the form R=Rz(0)Ry(0)Rx(θx)=Rx(θx), therefore the harmonic homography derived in [7] can be reduced to bilateral symmetry as follows: ⎡− 1 0 2u0 ⎤ W = I − 2Kr r K = ⎢⎢ 0 1 0 ⎥⎥ . ⎢⎣ 0 0 1 ⎥⎦ T 11

−1

(3)

where r1=[1 0 0]T and u0 is the x-coordinate of optical center. Assume the optical center coincide with the image center, we can obtain a roughly harmonic homography from (3). Using this harmonic homography, the initial position of epipoles can be obtained from the epipole estimation step as described in [7]. In fact, given contours C1 and C2, and a harmonic homography W, the corresponding epipole e1 (e2) can be directly located from the bi-tangent lines of the contours WC1 and C2 (C1 and WC2) without performing several one-dimensional searching steps as in [7]. Note that here WC means the contour C is transformed by W. 3.3 Epipolar Tangency Constraints In the silhouette-based approach, the most common energy function to measure the model is the epipolar tangency constraints, which can be illustrated in Fig. 2(c). In

110

P.-H. Huang and S.-H. Lai

Fig. 2(c), a pair of contours is put on the same image. The epipole e1(e2) is the projection of C1 (C2) onto the camera C2 (C1), and x1 (x2) is the tangent point induced by epipole e2 (e1) with tangent line t1 (t2). As mentioned in section 3.1, the tangent points x1 and x2 are considered as the correspondence points. The dashed line l1 (l2) is the corresponding epipolar line of x2 (x1). Ideally, l1 (l2) should be the same as t1 (t2). Assume the projection matrix of camera C1 (C2) be denoted as P1 (P2). The error associated with the epipolar tangency constraints can be written as:

err(x1 , x2 , P1 , P2 ) = d (x1 , l1 ) + d ( x2 , l2 ) .

(4)

where the function d(.,.) gives the Euclidean distance from a point to a line, + l1 = P1 P2+ x2 × e2 , l2 = P2 P1+ x1 × e1 , and P is the pseudo-inverse of the projection matrix. Given a set of silhouettes S and its corresponding projection matrices P, the overall cost function can be written as:

Cost (P, S ) =

∑ err (x , x , P , P ) .

∑

(Si ,S j )∈S p ( xa , xb )∈Tpi , j

a

b

i

(5)

j

where the set Sp contains all contour pairs that are used in the cost function, and Tpi,j is the set of tangent points induced by epipoles (of Pi and Pj) with contours (Si and Sj).

4 Camera Calibration In the previous section, the method to extract epipoles from silhouettes under circular motion with a constant interval angle is presented. In this section, we describe how to compute the camera parameters from epipoles. For simplification, the camera is assumed to be zero skew, unit aspect ratio and principle point at the image center, which is a reasonable assumption for current cameras. 4.1 Recovery of Focal Length and Interval Angle The geometry of circular motion under a constant interval angle can be illustrated in Fig. 3(a). In Fig. 3(a), Xs is the circle center, cameras distribute on the circle with a constant interval angle θ. With a certain interval angle, a pair of determined epipoles Xs θ C1

θ θ

θ

optical center

Πh

e2 C5

e1 e5

e4

C4

C2 C3 (a)

Πi

C3 (b)

Fig. 3. (a) Circular motion with a constant interval angle. (b) The image of one camera.

Camera Calibration from Silhouettes Under Incomplete Circular Motion

111

can provide a constraint on the angle and focal length. For instance, for the image of C3 as shown in Fig. 3(b), epipoles formed by the angle θ are e2 and e4. Therefore, we can derive a relationship of the angle and focal length with epipoles as follows:

θ = π − ang(K −1e2 , K −1e4 ) .

(6)

where the function ang(.,.) gives the angle between two vectors. In equation (6), we have one constraint but two unknowns, which are the angle θ and the focal length, the constraint is not enough to determine the unknowns. Recall that, the image sequence is taken under a constant interval angle. Take Fig. 3(a) for example, C1-C2-C3-C4-C5 is an image sequence with a constant interval angle θ. Also, C1-C3-C5 is an image sequence with a constant interval angle 2θ, which can provide another constraint as following:

(

)

2θ = π − ang K −1e1 , K −1e5 .

(7)

Two pairs of epipoles are sufficient to determine the unknowns. With more pairs of epipoles formed by different interval angles to be recovered, more constraints similar to equation (6) and (7) can be applied to precisely determine the focal length. In our implementation, a linear search on the angle θ is performed to find the one that best satisfies these constraints. For instance, given an angle θ, the focal length can be determined by solving a quadratic equation from equation (6). Substituting the estimated focal length into right-hand side of equation (7), we have the difference between left-hand side and right-hand side of equation (7). Therefore, the interval angle and focal length are determined as the one that best satisfies these constraints. 4.2 Recovery of Image Invariants From the extracted epipoles, lh can be computed by line fitting these epipoles, and also ls can be determined concurrently in the epipole extraction stage. Then, xs is the intersection of lh and ls. After camera intrinsic parameter K is obtained, vx can be computed from the pole-polar relationship [1], i.e. vx~KKTls. 4.3 Recovery of Camera Pose From equation (2), with the camera intrinsic parameters and image invariants known, the camera pose R can be computed up to two sign ambiguities as follows:

( ) . = β × norm (K x ) , β = ±1

r1 = α × norm K −1v x , α = ±1 r3

−1

s

(8)

r2 = r3 × r1

where the function norm(.) normalizes a vector to unit norm. Notice that, the sign of rotation axis has no difference for projection to the image coordinates, but back-projection of image points will lead to a sign ambiguity. This ambiguity can be resolved by back-projecting the epipole, which is obtained from image, and checking the sign with the corresponding camera position which is transformed to the camera coordinate system from the world coordinate system by using the determined camera pose R. Because the camera position in the world

112

P.-H. Huang and S.-H. Lai

coordinate system is irrelevant to camera pose R, we still can recover the camera position in the presence of rotation ambiguity. Furthermore, the Gram-Schmidt process is applied to obtain the orthogonal basis. 4.4 Summarization of the Proposed Algorithm INPUT: n object contours, from S1 to Sn, under circular motion with an unknown constant interval angle. OUTPUT: Camera parameters and 3D model. 1. Choose a frame interval Δv, the contours Sv and Sv+Δv are considered as contour pair for determining the two epipoles formed by the interval, where v=1...(n-Δv). 2. Initialize the two common epipoles by the method described in section 3.2. 3. Extract epipoles with a nonlinear minimization step as described in section 3.1. 4. Choose different frame intervals and repeat step 1-3 to extract more epipoles. 5. Use the epipoles extracted from step 1-4 as initial guesses and perform the nonlinear minimization step again to uniquely determine the ls. 6. Recover camera parameters as mentioned in section 4. 7. Set the projection matrices according to equation (1) as initial guesses. 8. Minimize the overall epipolar tangency errors as described in section 3.3. 9. Generate the 3D model using the image-based visual hull technique [10].

5 Experimental Results In this section, we show some experimental results on both synthetic and real data sets of applying the proposed silhouette-based algorithm to reconstruct 3D object models from sparse and incomplete image sequences under circular motion.

(a)

(b)

Fig. 4. (a) Experimental images for reconstructing the bunny model. (b) Intersections of corresponding epipolar lines and the estimated rotation axis ls before and after minimization. Table 1. Accuracy of the recovered camera parameters error avg. std.

Δe (%) 0.565 0.270

Δθx (o) 0.301 0.315

Δθy (o) 0.046 0.038

Δθz (o) 0.080 0.059

Δθi (o) 0.279 0.244

Δf (%) 1.822 1.484

Camera Calibration from Silhouettes Under Incomplete Circular Motion

113

5.1 Synthetic Data In this part, we used the Stanford bunny model to randomly generate 100 synthetic data sets to test the algorithm. Each set contain 12 images of size 800x600 pixels with interval angle, θi = 30o, which means each sequence is sparsely located and the methods based on SoR will fail. Example images of one set are depicted in Fig. 4(a). The generated range of focal length, f, is 1500~5000 pixels, three angles, θx, θy, and θz, of camera pose are within -10o~-50o, -5o~5o, and -5o~5o, respectively. Two different frame intervals, 2 and 3, are chosen to extract the epipoles. The comparison of the recovered parameters and the ground truth are listed in Table 1. In Table 1, the error of angles is in degrees. The error of focal length is expressed as percentage, which is the difference divided by the ground truth. The error of the epipoles is expressed as the difference divided by the length of ground truth to the image center. The experimental results show that the proposed algorithm can provide a good initial guess for the camera parameter optimization. Fig. 4(b) shows an example result of the epipole extraction stage before and after iteratively minimization. The dashed line is the initial ls according to ‘x’ points, which are the intersections induced by the initial epipoles as described in Fig. 2(b). The solid line is the estimated ls after minimization and the intersections (‘o’ points) are close to ls. The obtained ls is very close to the ground truth, expressed by dashed-dot line, as shown in the enlarged figure. 5.2 Real Data In the experiments on real data, two image sequences are used. Example images are shown in Fig. 5. Fig. 5(top) is the Oxford dinosaur sequence, which contains 36 images of size 720x576 pixels. Fig. 5(down) is a sequence of jadeite that contains 36 images of size 2000x1303 pixels and it is very difficult to establish feature correspondences for this kind of material. In both sequences, only the silhouette information is used for reconstruction. Different views of the reconstructed models are shown in Fig. 6 and Fig. 7, respectively. After overall optimization, the RMS errors of the recovered interval angles in two sequences are 0.192o and 0.247o, respectively. In addition, when only the first 18 images of the sequence used, which means the image sequence is incomplete, the estimated results are similar. Due to space limitation, we cannot give the details of the experimental results.

Fig. 5. Example images of (top) the Oxford dinosaur sequence and (down) the jadeite sequence

114

P.-H. Huang and S.-H. Lai

Fig. 6. Different views of the reconstructed Oxford dinosaur model

Fig. 7. Different views of the reconstructed jadeite model

6 Conclusion In this paper, we propose a novel silhouette-based algorithm for camera calibration and 3D reconstruction from sparse and incomplete image sequences of objects under circular motion with an unknown but constant interval angle. Different from previous silhouette-based methods, the proposed algorithm does not require either dense image sequences or the assumption of known camera intrinsic parameters in advance. Under the assumption of constant interval angle, the epipoles of successive images are kept constant and can be determined from silhouettes by a nonlinear optimization process. With more pairs of epipoles recovered from silhouettes, constraints on the interval angle and focal length can be provided to determine the camera parameters. Experimental results on synthetic and real data sets are presented to demonstrate the performance of the proposed algorithm. Acknowledgments. This work was supported by National Science Council, Taiwan, under the grant NSC 95-2221-E-007-224.

References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D Model Construction for TurnTable Sequences. In: Proceedings of SMILE Workshop on 3D Structure from Multiple Images of Large-Scale Environments, pp. 155–170 (1998)

Camera Calibration from Silhouettes Under Incomplete Circular Motion

115

3. Jiang, G., Tsui, H.T., Quan, L., Zisserman, A.: Single Axis Geometry by Fitting Conics. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1343–1348 (2002) 4. Jiang, G., Quan, L., Tsui, H.T.: Circular Motion Geometry Using Minimal Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 721–731 (2004) 5. Cao, X., Xiao, J., Foroosh, H., Shah, M.: Self-calibration from Turn-table Sequences in Presence of Zoom and Focus. Computer Vision and Image Understanding 102, 227–237 (2006) 6. Mendonca, P.R.S., Cipolla, R.: Estimation of Epipolar Geometry from Apparent Contours: Affine and Circular Motion Cases. In: Proceedings of Computer Vision and Pattern Recognition, pp. 9–14 (1999) 7. Mendonca, P.R.S., Wong, K.-Y.K., Cipolla, R.: Epipolar Geometry from Profiles under Circular Motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 604–616 (2001) 8. Zhang, H., Zhang, G., Wong, K.-Y.K.: Auto-Calibration and Motion Recovery from Silhouettes for Turntable Sequences. In: Proceedings of British Machine Vision Conference, pp. 79–88 (2005) 9. Zhang, G., Zhang, H., Wong, K.-Y.K.: 1D Camera Geometry and Its Application to Circular Motion Estimation. In: Proceedings of British Machine Vision Conference, pp. 67–76 (2006) 10. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-Based Visual Hulls. In: Proceedings of SIGGRAPH, pp. 369–374 (2000)

Mirror Localization for Catadioptric Imaging System by Observing Parallel Light Pairs Ryusuke Sagawa, Nobuya Aoki, and Yasushi Yagi Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki-shi, Osaka, 567-0047, Japan {sagawa,aoki,yagi}@am.sanken.osaka-u.ac.jp

Abstract. This paper describes a method of mirror localization to calibrate a catadioptric imaging system. While the calibration of a catadioptric system includes the estimation of various parameters, we focus on the localization of the mirror. The proposed method estimates the position of the mirror by observing pairs of parallel lights, which are projected from various directions. Although some earlier methods for calibrating catadioptric systems assume that the system is single viewpoint, which is a strong restriction on the position and shape of the mirror, our method does not restrict the position and shape of the mirror. Since the constraint used by the proposed method is that the relative angle of two parallel lights is constant with respect to the rigid transformation of the imaging system, we can omit both the translation and rotation between the camera and calibration objects from the parameters to be estimated. Therefore, the estimation of the mirror position by the proposed method is independent of the extrinsic parameters of a camera. We compute the error between the model of the mirror and the measurements, and then estimate the position of the mirror by minimizing this error. We test our method using both simulation and real experiments, and evaluate the accuracy thereof.

1 Introduction For various applications, e.g. robot navigation, surveillance and virtual reality, a special field of view is desirable to accomplish the task. For example, omnidirectional imaging systems [1,2,3] are widely used in various applications. One of the main methods to obtain a special field of view, is to construct a catadioptric imaging system, which observes rays reflected by mirrors. By using various shapes of mirrors, different fields of view are easily obtained. There are two types of catadioptric imaging systems; central and noncentral. The former has a single effective viewpoint, and the latter has multiple ones. Though central catadioptric systems have an advantage in that the image can be transformed to a perspective projection image, they have strong restrictions on the shape and position of the mirror. For example, it is necessary to use a telecentric camera and a parabolic mirror whose axis is parallel to the axis of the camera. Thus, misconfiguration can be the reason that a catadioptric system is not a central one. To obtain more flexible fields of view, several noncentral systems [4,5,6,7,8] have been proposed for various purposes. For geometric analysis with catadioptric systems, it is necessary to calibrate both camera and mirror parameters. Several methods of calibration have been proposed for Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 116–126, 2007. c Springer-Verlag Berlin Heidelberg 2007

Mirror Localization for Catadioptric Imaging System

117

central catadioptric systems. Geyer and Daniilidis [9] have used three lines to estimate the focal length, mirror center, etc. Ying and Hu [10] have used lines and spheres to calibrate the parameters. Mei and Rives [11] have used a planar marker to calibrate the parameters, which is based on the calibration of a perspective camera [12]. However, since these methods assume that the system has a single viewpoint, they cannot be applied to noncentral systems. On the other hand, several methods have also been proposed to calibrate noncentral imaging systems. Aliaga [13] has estimated the parameters of a catadioptric system with a perspective camera and a parabolic mirror using known 3D points. Strelow et al. [14] have estimated the position of a misaligned mirror using known 3D points. Micus´ık and Pajdla [15] have fitted an ellipse to the contour of the mirror and calibrated a noncentral camera by approximating it to a central camera. Mashita et al. [16] have used the boundary of a hyperboloidal mirror to estimate the position of a misaligned mirror. However, all of these methods are restricted to omnidirectional catadioptric systems. There are also some approaches for calibrating more general imaging systems. Swaminathan et al. [17] computed the parameters of noncentral catadioptric systems by estimating a caustic surface from known camera motion and the point correspondences of unknown scene points. Grossberg and Nayar [18] proposed a general imaging model and computed the ray direction for each pixel using two planes. Sturm and Ramalingam [19] calibrated the camera of a general imaging model by using unknown camera motion and a known object. Since these methods estimate both the internal and external parameters of the system, the error of measurement affects the estimated result of all of the parameters. In this paper, we focus on the localization of the mirror in the calibration of catadioptric systems. Assumptions of the other parameters are as follows: – The intrinsic parameters, such as the focal length and principal point of a camera, are known. – The shape of the mirror is known. The only remaining parameters to be estimated are the translation and rotation of the mirror with respect to the camera. If we calibrate the parameters of an imaging system by observing some markers, it is necessary to estimate the extrinsic parameters, such as rotation and translation, with respect to the marker. If we include these parameters as parameters to be estimated, the calibration results are affected by them. We proposed a method to localize a mirror by observing a parallel light [20] that estimates the mirror parameters independently of the extrinsic parameters. Since the translation between a marker and a camera is omitted from the estimation, this method can reduce the number of parameters. The method however, needs a rotation table to observe a parallel light from various directions. Instead of using a rotation table, the method proposed in this paper observes pairs of parallel lights as calibration markers. We can therefore, omit both rotation and translation from the estimation and reduce the number of parameters that are affected by the measurement error in the calibration. We describe the geometry of projection of two parallel lights in Section 2. Next, we propose an algorithm for mirror localization using pairs of parallel lights in Section 3. We test our method in Section 4 and finally summarize this paper in Section 5.

118

R. Sagawa, N. Aoki, and Y. Yagi Mirror surface x’

Parallel light

x

Mirror surface

x1

v

O

x2’

v1

v2’

v1’

NR,t (x) Image plane

x1’

x2

Image plane

m Camera

m1

m2 O

Fig. 1. Projecting a parallel light onto a catadioptric imaging system

m’1

v2 m’2

O’

Fig. 2. Projecting a pair of parallel lights with two different camera positions and orientations

2 Projecting a Pair of Parallel Lights onto a Catadioptric Imaging System In this section, we first explain the projection of a parallel light, which depends only on the rotation of a camera. Next, we describe the projection of a pair of parallel lights and the constraint on the relative angle between them. 2.1 Projecting a Parallel Light First, we explain the projection of a parallel light. Figure 1 shows the projection of a parallel light onto a catadioptric system. Since a parallel light is not a single ray, but a bunch of parallel rays, such as sunlight, it illuminates the whole catadioptric system. v is the vector of the incident parallel light. m is the vector at the point onto which the light is projected. m is computed as follows: ˆ, m = K −1 p

(1)

ˆ = (px , py , 1) is the point onto which the light is projected in the homogeneous where p image coordinate system. K is a 3×3 matrix that represents the intrinsic parameters of the camera. Although the incident light is reflected at every point on the mirror surface where the mirror is illuminated, the reflected light must go through the origin of the camera to be observed. Since the angle of the incident light is the same as that of the reflected light, the camera only observes the ray reflected at a point x. Therefore, the equation of projection becomes −v =

m m + 2(NR,t (x) · )N (x), m m R,t

(2)

where NR,t (x) is the normal vector of the mirror surface at the point x. R and t are the rotation and translation, respectively, of the mirror relative to the camera. 2.2 Projecting a Pair of Parallel Lights Since the direction of the incident parallel light is invariant even if it is observed from different camera positions, the direction of the light relative to the camera depends only

Mirror Localization for Catadioptric Imaging System

119

on the orientation of the camera. Now, if we observe two parallel lights simultaneously, the relative angle between these parallel lights does not change irrespective of the camera orientation. Figure 2 shows a situation, in which a pair of parallel lights is projected onto a catadioptric system, and which has two different camera positions and orientations. The relative position of the mirror is fixed to the camera. The two parallel lights are reflected at the points x1 , x2 , x2 and x1 , respectively. The reflected rays are projected onto the points m1 , m2 , m2 and m1 in the image plane, respectively. Since the relative angle between the pair of parallel lights is invariant, we obtain the following constraint: (3) v 1 · v 2 = v 1 · v 2 , where v 1 and v 2 are represented in a different camera coordinate system from v 1 and v 2 , which are computed by (2).

3 Mirror Localization Using Pairs of Parallel Lights This section describes an algorithm to estimate mirror position by observing pairs of parallel lights. 3.1 Estimating Mirror Position by Minimizing Relative Angle Error By using the constraint (3), we estimate the mirror position by minimizing the following cost function: v i1 · v i2 − cos αi 2 , (4) E1 = i

where i is the number of the pair and αi is the angle of the i-th pair. If we do not know the angle between the parallel lights, we can use E2 =

v i1 · v i2 − v j1 · v j1 2 .

(5)

i=j

The parameters of these cost functions are R and t, which are the rotation and translation, respectively, of the mirror relative to the camera. Since minimizing (4) or (5) is a nonlinear minimization problem, we estimate R, t and RC by a nonlinear minimization method, such as the Levenberg-Marquardt algorithm. Our algorithm can then be described as follows: 1. 2. 3. 4. 5. 6. 7.

Set initial parameters of R and t. Compute the intersecting point x for each image point m. Compute the normal vector NR,t (x) for each intersecting point x. Compute the incident vector v for each intersecting point x. Compute the cost function (4) or (5). Update R and t by a nonlinear minimization method. Repeat steps 2-6 until convergence.

120

R. Sagawa, N. Aoki, and Y. Yagi

Concave parabolic mirror Collimator 1

Collimator 2

Light source

Pinhole

Fig. 3. Two collimators generate a pair of parallel lights. Each collimator consists of a light source, a pinhole and a concave parabolic mirror.

In the current implementation, the initial parameters are given by user. We set them so that the every image point m has the intersecting point x. As described in Section 3.2, computing the intersecting points is high cost if a mirror surface is represented by a mesh model. Therefore, we describe a GPU-based method for steps 2-4 to directly compute the incident vectors to reduce the computational time. For updating the parameters, we numerically compute the derivatives required in the Levenberg-Marquardt algorithm. To keep so that every image point has the intersecting point, if an image point has no intersecting point, we penalize it with a large value instead of computing (4) or (5). 3.2 Computing the Incident Vector The important step in this algorithm is the computation of the incident vector v, for which there are two methods. The first of these computes x by solving a system of equations. If the mirror surface is represented as a parametric surface, x is obtained by simultaneously solving the equations of the viewing ray and the mirror surface, because the intersecting point x is on both the viewing ray and the mirror surface. Once x is computed, the normal vector NR,t (x) is obtained by the cross product of two tangential vectors of the mirror surface at x, and then the incident vector v is computed by (2). However, it is high cost to solve the simultaneous equations if the mirror surface is an intricate shape or non-parametric surface. If a mirror surface is represented as a mesh model, it is necessary to search the intersecting point for each image point by solving the equations for each facet of the model. To accommodate any mirror shape, the second method computes x by projecting the mirror shape onto the image plane of the camera with R, t and the intrinsic parameter K. Since this operation is equivalent to rendering the mirror shape onto the image plane, it can be executed easily using computer graphics techniques if the mirror shape is approximated by a mesh model. Furthermore, if we use recent graphics techniques, the incident vector v is computed directly by the rendering process. The source code to compute v for every pixel is shown in Appendix A. 3.3 Generating a Pair of Parallel Lights Our proposed method requires observation of parallel lights. A parallel light can be viewed by adopting one of the following two approaches: – Use a feature point of a distant marker. – Generate a collimated light.

Mirror Localization for Catadioptric Imaging System

121

In the former approach, a small translation of camera motion can be ignored because it is much smaller than the distance to the marker. Thus, the ray vector from the feature point is invariant even if the camera moves. The issue of this approach is a lens focus problem. When the focus setting of the camera is not at infinite focus, the image is obtained with a minimum aperture and long shutter time to avoid a blurred image. Instead of using distant points to obtain two parallel lights, vanishing points can be used. Some methods [21,22,23] was proposed for the calibration of a perspective camera. In the latter approach, a parallel light is generated by a collimator. A simple method is to use a concave parabolic mirror and a point-light source. Figure 3 shows an example of such a system. By placing pinholes in front of the light sources, they become point-light sources. Since pinholes are placed at the focus of the parabolic mirrors, the reflected rays are parallel. The illuminated area is indicated in yellow in the figure. The advantage of this approach is that a small and precise system can be constructed although optical apparatus is required.

4 Experiments 4.1 Estimating Accuracy by Simulation We first evaluate the accuracy of our method by simulation. In this simulation, we estimate the position of a parabolic mirror relative to a perspective camera. The intrinsic parameter K of the perspective camera is represented as ⎞ ⎛ f 0 cx (6) K = ⎝ 0 f cy ⎠ . 00 1 1 The shape of the mirror is represented as z = 2h (x2 + y 2 ), where h is the radius of a paraboloid. In this experiment, the image size is 512×512 pixels and f = 900, cx = cy = 255 and h = 9.0. The ground truths of the rotation and translation of the mirror are R = I and t = (0, 0, 50), respectively. We tested two relative angles between two incident parallel lights, namely 30 and 90 degrees. 24 pairs of the incident lights are used by rotating the camera and mirror around the y- and z-axes. We estimate R and t by adding noise to the position of the input points. The added Gaussian noise has standard deviations of 0, 0.1, 0.5, and 1.0 pixels. As for E1 , since the relative angle α between the two points has to be given, we add noise to α, which has standard deviations of 0, 0.1, and 0.5 degrees. To evaluate the accuracy of the estimated parameters, we compute the root-mean-square (RMS) errors between the input points and the reprojection of the incident lights. Figure 4 shows the RMS errors of E1 and E2 . It is clear that the results obtained with the relative angle equal to 90 degrees are better than those for 30 degrees. A reason for this may be that the constraint is weaker when the relative angle is smaller and the projected points are close to each other. The error depends mainly on the noise of the input points, as the effect of the noise of the relative angle is small. Since the accuracy of E2 is similar to that of E1 , we can apply our method even if we do not know the relative angle. Next, we evaluate the error if the intrinsic parameter K is different from the ground truth. Figure 5 shows the RMS errors of E1 with varying values of f and cx . The

122

R. Sagawa, N. Aoki, and Y. Yagi 4

3.5 °

f σ =0.0

E 30 σ =0.0

α

α

1

3.5

°

E1 30 σα=0.1

x

E 90 σ =0.0 α

1

2.5

α

°

E1 90 σα=0.5

RMS Error (pixels)

RMS Error (pixels)

1

°

E2 30 2

α

cx σα=0.1

°

E 90° σ =0.1 2.5

α

c σ =0.0

α

1

3

f σ =0.1

3

E 30° σ =0.5

°

E2 90

1.5

2

1.5

1 1

0.5

0.5

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 −50

−40

−30

Standard deviation of noise (pixels)

−20 −10 0 10 20 Difference from Ground Truth (pixels)

30

40

50

Fig. 4. The RMS errors with respect to the Fig. 5. The RMS errors with respect to the ernoise of image points ror of the intrinsic parameters

Side2

Side3

Mirrors

Camera

Side4

Side1

Center

Fig. 6. Compound parabolic mirrors attached Fig. 7. An example image from compound parabolic mirrors to a camera

other parameters are fixed to the ground truth. The horizontal axis means the difference between the ground truth and f or cx . The results show that the error from reprojecting the incident lights is significantly affected by cx , while the effect of f is small. This shows that the principal point (cx , cy ) must be computed accurately before minimizing E1 and that the error of f is more acceptable than that of the principal point. 4.2 Localizing Mirrors from Real Images In the next experiment, we compute the mirror positions of a catadioptric system with compound parabolic mirrors [24,25] as shown in Figure 6. Figure 7 shows an example of an image obtained from such a system. Our system has 7 parabolic mirrors and a perspective camera, PointGrey Scorpion, which has 1600 × 1200 pixels and about 22.6◦ field of view. The distortion of the lens is calibrated by the method described in [26], and the intrinsic parameters of the camera are already calibrated. With this setup, the catadioptric system is not single viewpoint. The radii h of a center mirror and the side mirrors are 9.0mm and 4.5mm, respectively. The diameter and height of the center mirror are 25.76mm and 9.0mm, respectively, and the diameter and height of the side mirrors are 13.0mm and 4.5mm, respectively. The diameters of the center and side mirrors projected onto the image are 840 and 450 pixels, respectively.

Mirror Localization for Catadioptric Imaging System Side2

Side1

123

Side3

Side4

Center

Fig. 8. A distant point used as a parallel light Fig. 9. The mirror positions estimated by the source proposed method

Table 1. The RMS errors of (7) are computed using the estimated mirror positions Mirror Number of Pairs RMS Error (pixels) Center 78 0.84 Side1 21 0.87 Side2 45 1.05 Side3 45 1.16 Side4 21 0.59

To localize the mirrors from real images, we experimented with different ways of acquiring parallel lights, namely distant markers and collimated lights. In the first case, we chose points on a distant object in the image. Figure 8 shows the chosen point, which is a point on a building that is about 260 meters away from the camera. We rotated the catadioptric system and obtained 78 pairs of parallel lights. The relative angles of the pairs of parallel lights vary between 15 degrees and 170 degrees. We estimated the positions of the center and four side mirrors independently. Figure 9 shows the estimated mirror positions by rendering the mirror shapes from the viewpoint of the camera. Since we do not know the ground truth of the mirror position and the incident light vectors, we estimate the accuracy of the estimated parameters by the following criterion. If the observed points of a pair of parallel lights are p1 and p2 , and the corresponding incident vectors, as computed by (2), are v 1 and v 2 , respectively, (7) min p2 − q 2 subject to v q · v 1 = cos α, q where v q is the incident vector corresponding to an image point q. This criterion computes the errors in pixels. Table 1 shows the estimated results. Since some of the lights are occluded by the other mirrors, the number of lights used for calibration varies for each mirror. The error is computed by the RMS of (7). Since the position of a feature point is considered to have 0.5 pixel error, the error computed by using the estimated position of the mirrors is appropriate. Next, we tested our method by observing collimated lights generated by the system shown in Figure 3. The relative angle of the two collimated lights is 87.97 degrees.

124

R. Sagawa, N. Aoki, and Y. Yagi struct VS_OUT { float4 Pos : POSITION; float3 Tex : TEXCOORD0; }; VS_OUT VS(float4 Pos : POSITION, float4 Nor : NORMAL) { VS_OUT Out = (VS_OUT)0; float3 tmpPos, tmpNor, v; float a; tmpPos = normalize(mul(Pos, T)); tmpNor = mul(Nor, R); a = dot(-tmpPos, tmpNor); v = tmpPos + 2 * a * tmpNor; Out.Pos = mul(Pos, KT); Out.Tex = normalize(v); return Out; } float4 PS(VS_OUT In) : COLOR { float4 Col = 0; Col.rgb = In.Tex.xyz; return Col; }

Fig. 10. Top: an example of the acquired Fig. 11. The source code for computing the incident image. Bottom: the image of two collimated vector in HLSL lights after turning off the room light.

We acquired 60 pairs of parallel lights. Figure 10 shows an example of an image, onto which two collimated lights are projected. In this experiment, we estimated the position of the center mirror. The RMS error of (7) is 0.35 pixels, which is smaller than that obtained using distant markers. This shows that the accuracy of the estimated results is improved by using the collimated lights.

5 Conclusion This paper describes a method of mirror localization to calibrate a catadioptric imaging system. In it, we focused on the localization of the mirror. By observing pairs of parallel lights, our method utilizes the constraint that the relative angle of two parallel lights is invariant with respect to the translation and rotation of the imaging system. Since the translation and rotation between a camera and the calibration objects are omitted from the parameters, the only parameter to be estimated is the rigid transformation of the mirror. Our method estimates the rigid transformation by minimizing the error between the model of the mirror and the measurements. Since our method makes no assumptions about the mirror shape or its position, the proposed method can be applied to noncentral systems. If we compute the incident light vector by projecting the mirror shape onto an image, our method is able to accommodate any mirror shape. Finally, to validate the accuracy of our method, we tested our method in a simulation and in real experiments. For future work, we plan to apply the proposed method to various shapes of mirrors using the collimated lights and analyzing the best settings for the parallel lights.

Mirror Localization for Catadioptric Imaging System

125

References 1. Ishiguro, H., Yamamoto, M., Tsuji, S.: Omni-directional stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 257–262 (1992) 2. Yamazawa, K., Yagi, Y., Yachida, M.: Obstacle detection with omnidirectional image sensor hyperomni vision. In: IEEE The International Conference on Robotics and Automation, Nagoya, pp. 1062–1067. IEEE Computer Society Press, Los Alamitos (1995) 3. Nayar, S.: Catadioptric omnidirectional camera. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 482–488. IEEE Computer Society Press, Los Alamitos (1997) 4. Gaspar, J., Decco, C., Okamoto Jr., J., Santos-Victor, J.: Constant resolution omnidirectional cameras. In: Proc. The Third Workshop on Omnidirectional Vision, pp. 27–34 (2002) 5. Hicks, R., Perline, R.: Equi-areal catadioptric sensors. In: Proc. The Third Workshop on Omnidirectional Vision, pp. 13–18 (2002) 6. Swaminathan, R., Nayar, S., Grossberg, M.: Designing Mirrors for Catadioptric Systems that Minimize Image Errors. In: Fifth Workshop on Omnidirectional Vision (2004) 7. Kondo, K., Yagi, Y., Yachida, M.: Non-isotropic omnidirectional imaging system for an autonomous mobile robot. In: Proc. 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, IEEE Computer Society Press, Los Alamitos (2005) 8. Kojima, Y., Sagawa, R., Echigo, T., Yagi, Y.: Calibration and performance evaluation of omnidirectional sensor with compound spherical mirrors. In: Proc. The 6th Workshop on Omnidirectional Vision, Camera Networks and Non-classical cameras (2005) 9. Geyer, C., Daniilidis, K.: Paracatadioptric camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 687–695 (2002) 10. Ying, X., Hu, Z.: Catadioptric camera calibration using geometric invariants. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10), 1260–1271 (2004) 11. Mei, C., Rives, P.: Single view point omnidirectional camera calibration from planar grids. In: Proc. 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, pp. 3945–3950. IEEE Computer Society Press, Los Alamitos (2007) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 13. Aliaga, D.: Accurate catadioptric calibration for realtime pose estimation of room-size environments. In: Proc. IEEE International Conference on Computer Vision, vol. 1, pp. 127–134. IEEE Computer Society Press, Los Alamitos (2001) 14. Strelow, D., Mishler, J., Koes, D., Singh, S.: Precise omnidirectional camera calibration. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 689–694. IEEE Computer Society Press, Los Alamitos (2001) 15. Micus´ık, B., Pajdla, T.: Autocalibration and 3d reconstruction with non-central catadioptric cameras. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington US, vol. 1, pp. 58–65. IEEE Computer Society Press, Los Alamitos (2004) 16. Mashita, T., Iwai, Y., Yachida, M.: Calibration method for misaligned catadioptric camera. In: Proc. The Sixth Workshop on Omnidirectional Vision (2005) 17. Swaminathan, R., Grossberg, M., Nayar, S.: Caustics of catadioptric camera. In: Proc. IEEE International Conference on Computer Vision, vol. 2, pp. 2–9. IEEE Computer Society Press, Los Alamitos (2001) 18. Grossberg, M., Nayar, S.: The raxel imaging model and ray-based calibration. International Journal on Computer Vision 61(2), 119–137 (2005) 19. Sturm, P., Ramalingam, S.: A generic camera calibration concept. In: Proc. European Conference on Computer Vision, Prague, Czech, vol. 2, pp. 1–13 (2004)

126

R. Sagawa, N. Aoki, and Y. Yagi

20. Sagawa, R., Aoki, N., Mukaigawa, Y., Echigo, T., Yagi, Y.: Mirror localization for a catadioptric imaging system by projecting parallel lights. In: Proc. IEEE International Conference on Robotics and Automation, Rome, Italy, pp. 3957–3962. IEEE Computer Society Press, Los Alamitos (2007) 21. Caprile, B., Torre, V.: Using vanishing points for camera calibration. International Journal of Computer Vision 4(2), 127–140 (1990) 22. Daniilidis, K., Ernst, J.: Active intrinsic calibration using vanishing points. Pattern Recognition Letters 17(11), 1179–1189 (1996) 23. Guillemaut, J., Aguado, A., Illingworth, J.: Using points at infinity for parameter decoupling in camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(2), 265–270 (2005) 24. Mouaddib, E., Sagawa, R., Echigo, T., Yagi, Y.: Two or more mirrors for the omnidirectional stereovision? In: Proc. of The second IEEE-EURASIP International Symposium on Control, Communications, and Signal Processing, Marrakech, Morocco, IEEE Computer Society Press, Los Alamitos (2006) 25. Sagawa, R., Kurita, N., Echigo, T., Yagi, Y.: Compound catadioptric stereo sensor for omnidirectional object detection. In: Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, vol. 2, pp. 2612–2617 (2004) 26. Sagawa, R., Takatsuji, M., Echigo, T., Yagi, Y.: Calibration of lens distortion by structuredlight scanning. In: Proc. 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, Canada, pp. 1349–1354 (2005)

A Source Code for Rendering Incident Vectors The reflected vector for each pixel is computed using the source code in Figure 11. It is written in High-Level Shader Language (HLSL) and executed by graphics hardware. The shape of a mirror is represented by a mesh model that consists of vertices and triangles. The inputs of the vertex shader (VS) are the positions of vertices of the mirror (Pos) and the normal vectors of the vertices (Nor). R, T and KT are constant matrices given by a main program. R is the rotation matrix of the mirror, and T = [R|t], where t is the translation vector of the mirror. KT is the projection matrix computed as KT = K[R|t], where K is the intrinsic matrix of the camera. The reflected vector v is computed for each vertex. Since it is interpolated by the rasterizer of the graphics hardware, the pixel shader (PS) outputs the reflected vector for each pixel.

Calibrating Pan-Tilt Cameras with Telephoto Lenses Xinyu Huang, Jizhou Gao, and Ruigang Yang Graphics and Vision Technology Lab (GRAVITY) Center for Visualization and Virtual Environments University of Kentucky, USA {xhuan4,jgao5,ryang}@cs.uky.edu http://www.vis.uky.edu/∼gravity

Abstract. Pan-tilt cameras are widely used in surveillance networks. These cameras are often equipped with telephoto lenses to capture objects at a distance. Such a camera makes full-metric calibration more difficult since the projection with a telephoto lens is close to orthographic. This paper discusses the problems caused by pan-tilt cameras with long focal length and presents a method to improve the calibration accuracy. Experiments show that our method reduces the re-projection errors by an order of magnitude compared to popular homographybased approaches.

1 Introduction A surveillance system usually consists of several inexpensive wide fields of view (WFOV) fixed cameras and pan-tilt-zoom (PTZ) cameras. The WFOV cameras are often used to provide an overall view of the scene while a few zoom cameras are controlled by pan-tilt unit (PTU) to capture close-up views of the subject of interest. The control of PTZ camera is typically done manually using a joystick. However, in order to automate this process, calibration of the entire camera network is necessary. One of our driving applications is to capture and identify subjects using biometric features such as iris and face over a long range. A high-resolution camera with a narrow field of view (NFOV) and a telephoto lens is used to capture the rich details of biometric patterns. For example, a typical iris image should have 100 to 140 pixels in iris radius to obtain a good iris recognition performance [1]. That means, in order to capture the iris image over three meters using a video camera (640×480), we have to use a 450mm lens assuming sensor size is 4.8 × 3.6 mm. If we want to capture both eyes (e.g., the entire face) at once, then the face image resolution could be as high as 5413×4060 pixels–well beyond the typical resolution of a video camera. In order to provide adequate coverage over a practical working volume, PTZ cameras have to be used. The simplest way to localize the region of interest (ROI) is to pan and tilt the PZT camera iteratively until the region is approximately in the center of field of view [2]. This is time-consuming and only suitable for still objects. However, if the PTZ cameras are fully calibrated, including the axes of rotation, the ROI can be localized rapidly with a single pan and tilt operation. In this paper we discuss the degeneracy caused by cameras with telephoto lenses and develop a method to calibrate such a system with significantly improved accuracy. The remaining of this paper is organized as the following. We first briefly overview the related work in section 2. In section 3, we describe our system and a calibration Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 127–137, 2007. c Springer-Verlag Berlin Heidelberg 2007

128

X. Huang, J. Gao, and R. Yang

method for long focal length cameras. Section 4 contains experimental results. Finally, a summary is given in section 5. We also present in the appendix a simple method to calculate the pan and tilt angle when the camera coordinate is not aligned with the pan-tilt coordinate.

2 Related Work It is generally considered that camera calibration reached its maturity in the late 90’s. A lot of works have been done in this area. In the photogrammetry community, a calibration object with known and accurate geometry is required. With markers of known 3D positions, camera calibration can be done efficiently and accurately (e.g., [3], [4]). In computer vision, a planar pattern such as a checkerboard pattern is often used to avoid the requirement of a 3D calibration object with a good precision (e.g., [5], [6]). These methods estimate intrinsic and extrinsic parameters including radial distortions from homographies between the planar pattern at different positions and the image plane. Self-calibration estimates fixed or varying intrinsic parameters without the knowledge of special calibration objects and with unknown camera motions (e.g., [7], [8]). Furthermore, self-calibration can compute a metric reconstruction from an image sequence. Besides the projective camera model, the affine camera model in which camera center lies on the plane at infinity is proposed in ([9], [10]). Quan presents a self-calibration method for an affine camera in [11]. However, the affine camera model should not be used when many feature points at different depths [12]. For the calibrations of PTZ cameras, Hartley proposed a self-calibration method for stationary cameras with purely rotations in [13]. Agapito extended this method in [14] to deal with varying intrinsic parameters of a camera. Sinha and Pollefeys proposed a method for calibrating pan-tilt-zoom cameras in outdoor environments in [15]. Their method determines intrinsic parameters over the full range of zoom settings. These methods above approximate PTZ cameras as rotating cameras without translations since the translations are very small compared to the distance of scene points. Furthermore, These methods are based on computing absolute conic from a set of inter-image homographies. In [16], Wang and Kang present an error analysis of intrinsic parameters caused by translation. They suggest self-calibrate using distant scenes, larger rotation angles, and more different homographies in order to reduce effects from camera translation. The work most similar to ours is proposed in [17,18]. In their papers, Davis proposed a general pan-tilt model in which the pan and tilt axes are arbitrary axes in 3D space. They used a bright LED to create a virtual 3D calibration object and a Kalman filter tracking system to solve the synchronization between different cameras. However, they did not discuss the calibration problems caused by telephoto lens. Furthermore, their method cannot be easily applied to digital still cameras with which it will be tedious to capture hundreds or even thousands of frames.

3 Method In this section, we first describe the purpose of our system briefly. Then, we discuss the calibration for long focal length camera in details.

Calibrating Pan-Tilt Cameras with Telephoto Lenses

129

3.1 System Description The goal of our system is to capture face or iris images over a long range with a resolution high enough for biometric recognitions. As shown in Fig. 1, a prototype of our system consists of two stereo cameras and a NFOV high resolution (6M pixels) still camera. The typical focal length for the pan-tilt camera is 300mm, while previous papers dealing with pan-tilt cameras have reported the use of lenses between 1mm to 70mm. When a person walks into the working area of the stereo cameras, facial features are detected in each frame and their 3D positions can be easily determined by triangulation. The pan-tilt camera is steered so that the face is in the center of the observed image. A high-resolution image with enough biometric details then can be captured. Since the field of view of pan-tilt camera is only about 8.2 degrees, the ROI (e.g., the eye or entire face) is likely to be out of the field of view if the calibration is not accurate enough.

1) Pan-tilt Camera (Nikon 300mm)

6

3

2) Laser Pointer 3) Stereo Camera (4mm)

2 5 1

4) Pan Axis 5) Tilt Axis 6) Flash

3 4

Fig. 1. System setup with two WFOV cameras and a pan-tilt camera

3.2 Calibration In [5], given one homography H = [h1 , h2 , h3 ] between a planar pattern at one position and the image plane, two constraints on the absolute conic ω can be formulated as in Eq.(1). hT1 ωh2 = 0 hT1 ωh1 = hT2 ωh2

(1)

By imaging the planar pattern n times at different orientations, a linear system Ac = 0 is formed, where A is a 2n × 6 matrix from the observed homographies and c represents ω as a 6 × 1 vector. Once c is solved, intrinsic matrix K can be solved by Cholesky factorization since ω = (KK T )−1 . Equivalently, one could rotate the camera instead of moving a planar pattern. This is the key idea in self-calibration of pan-tilt cameras ( [15], [13]). First, inter-image homographies are computed robustly. Second, the absolute conic ω is estimated by a

130

X. Huang, J. Gao, and R. Yang

linear system ω = (H i )−T ω(H i )−1 , where H i is the homography between each view i and a reference view. Then, Cholesky decomposition of ω is applied to compute the intrinsic matrix K. Furthermore, a Maximum Likelihood Estimation (MLE) refinement could be applied using the above close-form solution as the initial guesses. However, the difference is small between the close-form solution and that from MLE refinement [12]. As mentioned in [5], the second homography will not provide any new constraints if it is parallel to the first homography. In order to avoid this degeneracy and generate a over-determined system, the planar pattern has to be imaged many times with different orientations. This is also true for the self-calibration of rotating cameras. If conditions are near singular, the matrix A formed from the observed homographies will be illconditioned, making the solution inaccurate. Generally, the degeneracy is easy to avoid when the focal length is short. For example, we only need to change the orientation for each position of planar pattern. However, this is not true for long focal length cameras. When the focal length increases and the filed-of-view decreases, the camera’s projection becomes less projective and more orthographic. The observed homographies contain a lot of depth ambiguities that make the matrix A ill-conditioned and solution is very sensitive to a small perturbation. If the projection is purely orthographic, then observed homographies can not provide any depth information no matter where we put the planar pattern or how we rotate the camera. In summary, traditional calibration methods based on observed homographies are in theory not accurate for long focal length camera. We will also demonstrate this point with real data in the experiments section. X

x1 O

x

Stereo Cameras

( px , p y )

World

x2

R, T PTU

R*

Camera (pan=0, tilt=0)

Fig. 2. Pan-tilt camera model

The best way to calibrate a long focal length camera is to create 2D-3D correspondences directly. One could use a 3D calibration object but this approach is not only costly, but also un-practical given the large working volume we would like to cover. In our system 3D feature points are triangulated by stereo cameras, therefore it will not induce any ambiguities caused by the methods based on observed homographies. With a set of known 2D and 3D features, we can estimate intrinsic parameters and the relative

Calibrating Pan-Tilt Cameras with Telephoto Lenses

131

transformation between the camera and the pan-tilt unit. The pan-tilt model is shown in Fig. 2 and is written as x = KR∗−1 Rtilt Rpan R∗ [R|t]X

(2)

where K is intrinsic parameters, R and t are extrinsic parameters of pan-tilt camera at reference view that is pan = 0 and tilt = 0 in our setting. X and x are 3D and 2D feature points. Rpan and Rtilt are rotation matrices around pan and tilt axes. R∗ is the rotation matrix between coordinates of the camera and the pan-tilt unit. We did not consider the pan and tilt axes as two arbitrary axes in 3D space as in [17] since translation between the two coordinates are very small (usually only a few millimeters in our setting) and a full-scale simulation shows that adding the translational offset yield little accuracy improvement. Based on the pan-tilt model in Eq.(2), we could estimate the complete set of parameters using MLE to minimize the re-projected geometric distances. This is given by the following functional: argminR∗ ,K,R,t

n m

xij − x ˆij (K, R∗ , Rpan , Rtilt , R, t, Xij )2

(3)

i=1 j=1

The method of acquiring of calibration data in [17] is not applicable in our system because that our pan-tilt camera is not a video camera that could capture a video sequence of LED points. Typically a commodity video camera does not support both long focal length and high-resolution image. Here we propose another practical method to acquire calibration data from a still camera. We attach a laser pointer close enough to the pan-tilt camera as shown in Fig.1. The laser’s reflection on scene surfaces generates a 3D point that can be easily tracked. The laser pointer rotates with the pan-tilt camera simultaneously so that its laser dot can be observed by the pan-tilt camera at most of pan and tilt settings. In our set-up, we mount the laser pointer on the tilt axis. A white board is placed at several positions between the near plane and the far plane within the working area of two wide-view fixed cameras. For each pan and tilt step, three images are captured by the pan-tilt camera and two fixed cameras respectively. A 3D point is created by triangulation from the two fixed cameras. The white board does not need to be very large since we can always move it around during the data acquisition process so that 3D points cover the entire working area. The calibration method is summarized as Algorithm 1. In order to compare our method with methods based on observed homographies, we formulate a calibration framework similar to the methods in [15] and [12]. The algorithm is summarized in Algorithm 2. An alternative of step 4 in Algorithm 2 is to build a linear system ω = (H i )−T ω(H i )−1 and solve ω. Intrinsic matrix K is solved by Cholesky decomposition ω = KK T . However, this closed-form solution often fails since infinite homography is hard to estimate with narrow fields of view. After calibration step, R∗ , intrinsic and extrinsic parameters of three cameras are known. Hence, We can solve the pan and tilt angles easily (see Appendix A for the details) for almost arbitrary 3D points triangulated by stereo cameras.

132

X. Huang, J. Gao, and R. Yang

Algorithm 1. Our calibration method for a pan-tilt camera with a long focal length Input: observed laser point images by three cameras. Output: intrinsic matrix K extrinsic parameters R, t, and rotation matrix R∗ between coordinates of camera and PTU. 1. Calibrate the stereo cameras and reference view of the pan-tilt camera using [19]. 2. Rectify stereo images such that epipolar lines are parallel with the y-axis (optional). 3. Capture laser points on a 3D plane for three cameras at each pan and tilt setting in the working area. 4. Based on blob detection and epipolar constraint, find two laser points in the stereo cameras. Generate 3D points by triangulation of two laser points. 5. Plane fitting for each plane position using RANSAC. 6. Remove outliers of 3D points based on the fitted 3D plane. 7. Estimate R∗ , K, R, t by minimizing Eq.(3).

Algorithm 2. Calibration method for a pan-tilt camera with a long focal length based on homographies Input: images captured at each pan and tilt setting. Output: intrinsic matrix K and rotation matrix R∗ between coordinates of camera and PTU. 1. Detect features based on Scale-invariant feature transform (SIFT) in [20] and find correspondences between neighboring images. 2. Robust homography estimation using RANSAC. 3. Compute homography between each image and reference view (pan = 0, tilt = 0). 4. Estimate K using Calibration Toolbox [19]. 5. Estimate R∗ and refine intrinsic matrix K by minimizing argminR∗ ,K

n m

xij − KR∗−1 Rtilt Rpan R∗ K −1 xiref 2

(4)

i=1 j=1

where xij and xiref are ith feature point at jth and reference view respectively.

4 Experiments Here we present experimental results from two fixed cameras (Dragonfly2 DR2-HICOL with resolution 1024 × 768) and a pan-tilt still camera (Nikon D70 with resolution 3008 × 2000). First, we compare the calibration accuracy with short and long focal length lenses using traditional homograph-based method. Then, we demonstrate that our calibration method significantly improves accuracy for telephoto lenses. In order to validate the calibration accuracy, we generate about 500 3D testing points that are randomly distributed cover the whole working area following step 2 to step 4 in Algorithm 1, i.e., tracking and triangulating the laser dot. The testing points are different from the points used for calibration. First we present the calibration results of the still camera with a short (18mm) and a long focal length (300mm) lenses. For simplicity, we assume the coordinate systems

Calibrating Pan-Tilt Cameras with Telephoto Lenses

133

Table 1. The comparison between short and long focal length cameras. α and β are focal length. μ0 and ν0 are principal point. The uncertainties of principal point for 300mm camera cannot be estimated by Calibration Toolbox [19]. focal length α β μ0 ν0 RMS (in pixels) 300mm 40869.2 ± 1750.2 41081.7 ± 1735.1 1503.5 ± ∗ 999.5 ± ∗ 3.85 18mm 2331.2 ± 9.1 2339.7 ± 9.1 1550.8 ± 12.1 997.9 ± 14.4 2.11

of pan-tilt camera and pan-tilt unit are aligned perfectly. This means R∗ is an identity matrix. We use the Calibration Toolbox [19] to do the calibration for the reference view of pan-tilt camera and stereo cameras. In order to reduce the ambiguities caused by the long focal length, we capture over 40 checkerboard patterns at different orientations for pan-tilt camera. Table 1 shows results of the intrinsic matrix K and RMS of calibration data. The uncertainties of the focal length with the 300mm lens is about 10 times larger than that with a 18mm lens although the RMS of calibration data for both cases are similar. Fig.3 shows distributions of re-projection errors for the 500 testing points with 18mm and 300mm cameras. From this figure, we find that calibration is quite accurate for short focal length camera even that we assume R∗ is an identity matrix. Given the high resolution image, the relative errors from 18mm and 300mm cameras are about 1.3% and 30% respectively. This is computed as the ratio of the mean pixel error and the image width. Furthermore, many of the test points are out of field of view of the 300mm camera. Focal Length: 18mm

0

10

20

30

40

50

Focal Length: 300mm

60

Error in Pixels, Mean: 37.9 Variance: 265.9

70

0

500

1000

1500

2000

Error in Pixels, Mean:898.6, Variance: 2.5206e+05

Fig. 3. Distributions of re-projection error (in pixels) based on 500 testing data for 18mm and 300mm pan-tilt cameras

We then recalibrate the 300mm case with methods outlined in Algorithm 1 and 2, both of which include the estimation of R∗ . About 600 3D points are sampled for calibration over the working area in Algorithm 1. We pre-calibrate the reference view for the pan-tilt camera as the initial guess. After calibration, we validate the accuracy with 500 3D points. Fig. 4 shows the distributions of re-projection errors from the two different methods. Our method is about 25 times better than the homography-based one. The

134

X. Huang, J. Gao, and R. Yang Calibration Based on Algorithm 1

0

20

40

60

80

100

120

Error in Pixels, Mean: 34.6, Variance: 424.3

Calibration Based on Algorithm 2

140

0

500

1000

1500

2000

Error in Pixels, Mean: 880.0 Variance: 2.2946e+05

Fig. 4. Distributions of re-projection error based on 500 testing data (in pixels) for Algorithm 1 and 2 Table 2. The comparison between Algorithm 1 and 2. α and β are focal length. μ0 and ν0 are principal point. θx and θy are rotation angles between pan-tilt camera and pan-tilt unit. Algorithm α β μ0 ν0 θx θy 1 40320.9 39507.7 1506.6 997.7 −0.14 −1.61 2 39883.3 40374.6 1567.5 1271.5 1.99 1.41

relative errors from Algorithm 1 and 2 are about 1.2% and 29% respectively. It should be noted that R∗ can not be estimated accurately from observed homographies. Hence, the percentage error from Algorithm 2 remains very large. In fact, the improvement over assuming an identity R∗ is little. Table 2 shows the results for intrinsic matrix K, θx , and θy after MLE refinement. Here we decompose R∗ into two rotation matrices. One is the rotation around x axis for θx degree, and the other is the rotation around y axis for θy degree.

5 Conclusion This paper shows that calibration methods based on observed homographies are not suitable for cameras with telephoto (long-focal-length) lenses. This is caused by the ambiguities induced by the near-orthographic projection. We develop a method to calibrate a pan-tilt camera with long focal length in a surveillance network. In stead of using a large precisely-manufactured calibration object, our key idea is to use fixed stereo cameras to create a large collection of 3D calibration points. Using these 3D points allows full metric calibration over a large area. Experimental results show that the re-projection relative error is reduced from 30% to 1.2% with our method. In future work, we plan to extend our calibration method to auto-zoom cameras and build a complete surveillance system that can adjust zoom settings automatically by estimating the object’s size.

Calibrating Pan-Tilt Cameras with Telephoto Lenses

135

References 1. Daugman, J.: How Iris Recognition Works. In: ICIP (2002) 2. Guo, G., Jones, M., Beardsley, P.: A System for Automatic Iris Capturing. In: MERL TR2005-044 (2005) 3. Tsai, R.Y.: A Versatile Camera Calibration Technique for High-accuracy 3D Machine Vision Metrology Using Off-The-Shelf TV Cameras and Lenses. IEEE Journal of Robotics and Automation 4(3), 323–344 (1987) 4. Faugeras, O.: Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, Cambridge (1993) 5. Zhang, Z.: A Flexible New Technique for Camera Calibration. PAMI 22, 1330–1334 (2000) 6. Heikkila, J., Silven, O.: A Four-Step Camera Calibration Procedure with Implicit Image Correction. In: Proceedings of CVPR, pp. 1106–1112 (1997) 7. Pollefeys, M., Koch, R., Gool, L.V.: Self-Calibration and Metric Reconstruction in spite of Varying and Unknown Internal Camera Parameters. In: Proceedings of ICCV, pp. 90–95 (1997) 8. Pollefeys, M.: Self-Calibration and metric 3D reconstruction from uncalibrated image sequences. PhD thesis, K.U.Leuven (1999) 9. Mundy, J., Zisserman, A.: Geometric Invariance in Computer Vision. MIT Press, Cambridge (1992) 10. Aloimonos, J.Y.: Perspective Approximations. Image and Vision Computing 8, 177–192 (1990) 11. Quan, L.: Self-Calibration of an Affine Camera from Multiple Views. International Journal of Computer Vision 19(1), 93–105 (1996) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry. Cambridge University Press, Cambridge (2000) 13. Hartley, R.I.: Self-Calibration of Stationary Cameras. International Journal of Computer Vision 1(22), 5–23 (1997) 14. de Agapito, L., Hayman, E., Reid, I.: Self-Calibration of a Rotating Camera with Varying Intrinsic Parameters. In: BMVC (1998) 15. Sinha, N., Pollefeys, M.: Towards Calibrating a Pan-Tilt-Zoom Camera Network. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, Springer, Heidelberg (2004) 16. Wang, L., Kang, S.B.: Error Analysis of Pure Rotation-Based Self-Calibration. PAMI 2(26), 275–280 (2004) 17. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proceedings of ICCV, vol. 1, pp. 144–150 (2003) 18. Chen, X., Davis, J.: Wide Area Camera Calibration Using Virtual Calibration Objects. In: Proceedings of CVPR (2000) 19. Bouguet, J.Y.: Camera Calibration Toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib doc/ 20. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 20, 91–110 (2003)

Appendix A: Solving Pan and Tilt Angles Here we discuss how to solve the pan and tilt angles so that the projection of an arbitrary point X in the 3D space is in the center of the image plane. We assume there is a rotation between the pan-tilt coordinate system and the camera’s. Because of the dependency of the pan-tilt unit, that is, the tilt axis depends on the pan axis, the solution is not as

136

X. Huang, J. Gao, and R. Yang

simple as it appears. In order to address this problem, we back project the image center ˜ 2 . The center of projection and point X form another line L ˜ 1 . After the to a line L ˜ ˜ calibration steps described in Section 3, L1 and L2 are transformed into L1 and L2 in the coordinate system of the pan-tilt unit. Hence, the problem is simplified to panning around y-axis and tilting around x-axis to make L1 and L2 coincident or as close as possible to each other. If L1 and L2 are represented by the Pl¨uker matrices, one method to compute the transformation of an arbitrary 3D line to another line by performing only rotations around x and y axes could be a minimization of the following functional, argminRx ,Ry ,λ L1 − λL2 2

(5)

where λ is a scalar, L2 is a 6 × 1 vector of the Pl¨uker coordinates of L2 , and L1 is the 6 × 1 Pl¨uker coordinates of the multiplication of (Ry Rx )L1 (Ry Rx )T , where Rx and Ry are rotation matrices around the x and y axes. Y

-2

( x2 , y 2 , z 2 )

C2 ( a 2 , b2 , c 2 )

L2

-1

(a1 , b1 , c1 )

O

L1

M2

M1

X

( x1 , y1 , z1 )

Z

C1

Fig. 5. Solve pan and tilt angles from L1 to L2

However, the problem can be further simplified because L1 and L2 are intersected in the origin of the pan-tilt unit in our model. As shown in Fig. 5, we want to pan and tilt line L1 to coincide with another line L2 . Assuming both of the two lines have unit lengths, the tilt angles are first computed by Eq. (6). y1 y2 ) − arctan ( ) z1 r y1 y2 ) ϕ2 = arctan ( ) − arctan ( z1 −r r = y12 + z12 − y22

ϕ1 = arctan (

(6)

If (y12 + z12 − y22 ) is less than 0, two conics C1 and C2 are not intersected that means no exact solution exists. However, it almost never happens in practice since the rotation

Calibrating Pan-Tilt Cameras with Telephoto Lenses

137

between the pan-tilt unit and the camera is small. After tilting, (x1 , y1 , z1 ) is rotated to (a1 , b1 , c1 ) or (a2 , b2 , c2 ). Then the pan angles are computed by Eq. (7). z2 c1 ) − arctan ( ) x2 a1 z2 c2 ϑ2 = arctan ( ) − arctan ( ) x2 a2

ϑ1 = arctan (

(7)

Hence, two solutions, (ϕ1 , ϑ1 ) and (ϕ2 , ϑ2 ), are obtained. We choose the minimum rotation angles as the final solution.

Camera Calibration Using Principal-Axes Aligned Conics Xianghua Ying and Hongbin Zha National Laboratory on Machine Perception Peking University, Beijing, 100871 P.R. China {xhying,zha}@cis.pku.edu.cn

Abstract. The projective geometric properties of two principal-axes aligned (PAA) conics in a model plane are investigated in this paper by utilized the generalized eigenvalue decomposition (GED). We demonstrate that one constraint on the image of the absolute conic (IAC) can be obtained from a single image of two PAA conics even if their parameters are unknown. And if the eccentricity of one of the two conics is given, two constraints on the IAC can be obtained. An important merit of the algorithm using PAA is that it can be employed to avoid the ambiguities when estimating extrinsic parameters in the calibration algorithms using concentric circles. We evaluate the characteristics and robustness of the proposed algorithm in experiments with synthetic and real data. Keywords: Camera calibration, Generalized eigenvalue decomposition, Principal-axes aligned conics, Image of the absolute conic.

1 Introduction Conic is one of the most important image features like point and line in computer vision. The motivation to study the geometry of conics arises from the facts that conics have more geometric information, and can be more robustly and more exactly extracted from images than points and lines. In addition, conics are very easy to be produced and identified than general algebraic curves, though general algebraic curves may have more geometric information. Unlike a large number of researches have been developed on points and lines, there are just several algorithms proposed based on conics for pose estimation [2][10], structure recovery [11][15][7][17][13], object recognition [8] [14][5], and camera calibration [18][19][3]. Forsyth et al. [2] discovered the projective invariants for pairs of conics then developed an algorithm to determine the relative pose of a scene plane from two conic correspondences. However the algorithm requires solving quartics and has no closed form solutions. Ma [10] developed an analytical method based on conic correspondences for motion estimation and pose determination from stereo images. Quan [14] discovered two polynomial constraints from corresponding conics in two uncalibrated perspective images and applied them to object recognition. Weiss [18] demonstrated that two conics are sufficient for calibration under the affine projection and derived a nonlinear calibration algorithm. Kahl and Heyden [7] proposed an algorithm for epipolar geometry estimation from conic correspondences. They found that one conic correspondence gives two independent constraints on the fundamental matrix and a method to Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 138–148, 2007. © Springer-Verlag Berlin Heidelberg 2007

Camera Calibration Using Principal-Axes Aligned Conics

139

estimate the fundamental matrix from at least four corresponding conics was presented. Sugimoto [17] proposed a linear algorithm for solving the homography from conic correspondences, but it requires at least seven correspondences. Mudigonda et al. [13] shown that two conic correspondences are enough for solving the homography but requires solutions of polynomial equations. The closest works to that proposed here are [19] and [3]. Yang et al. [19] presented a linear approach for camera calibration from concentric conics on a model plane. They showed that 2 constraints could be obtained from a single image of these concentric conics. However, it requires at least three concentric conics, and the equations of all these conics must be given in advance. Gurdjos et al. [3] utilized the projective and Euclidean properties of confocal conics to perform camera calibration. These properties are that the line conic consisted of the images of the circular points should belong to the conic range of these confocal conics. Two constraints on the IAC can be obtained from a single image of the confocal conics. Gurdjos et al. [3] claimed that the important reason to propose confocal conics for camera calibration is that there exist ambiguities in the calibration methods using concentric circles [9][6] when recovering the extrinsic parameters of the camera, and the algorithms using the confocal conics can avoid such ambiguities. In this paper, we discover a novel useful pattern, PAA. And the properties of two arbitrary PAA conics with unknown or known eccentricities are deeply investigated and discussed in this paper.

2 Basic Principles 2.1 Pinhole Camera Model Let X = [X Y Z 1]T be a world point and ~ x = [u v 1]T be its image point, both in the homogeneous coordinates, they satisfy:

μ~x = PX ,

(1)

where P is a 3× 4 projection matrix describing the perspective projection process. μ is an unknown scale factor. The projection matrix can be decomposed as:

P = K [R | t ] ,

(2)

where

⎡ fx K = ⎢⎢ 0 ⎢⎣ 0

s fy 0

u0 ⎤ v0 ⎥⎥ . 1 ⎥⎦

(3)

Here the matrix K is the matrix of the intrinsic parameters, and (R, t ) denote a rigid transformation which indicate the orientation and position of the camera with respect to the world coordinate system.

140

X. Ying and H. Zha

2.2 Homography Between the Model Plane and Its Image Without loss of generality, we assume the model plane is on Z = 0 of the world coordinate system. Let us denote the i th column of the rotation matrix R by ri . From (1) and (2), we have,

μ~x = K [r1 r2 We denote x = [X homography H :

⎡X ⎤ ⎢Y ⎥ t ]⎢ ⎥ = K [r1 r2 ⎢0⎥ ⎢ ⎥ ⎣1⎦

r3

⎡X ⎤ t ]⎢⎢ Y ⎥⎥ . ⎢⎣ 1 ⎥⎦

(4)

Y 1]T , then a model point x and its image ~ x is related by a 2D

μ~x = Hx ,

(5)

where H = K [r1 r2

t] .

(6)

Obviously, H is defined up to a scale factor. 2.3 Standard Forms for Conics All conics are projectively equivalent under the projective transformation [16]. This means that any conic can be converted into any anther conic by some projective transformations. A conic is an ellipse (including circle), a parabola or a hyperbola, respectively, if and only if its intersection with the line at infinity on the projective plane consists of 2 imaginary points, 2 repeated real points or 2 real points, respectively. In cases of central conics (ellipses and hyperbolas), by moving the coordinate origin to the center and choosing the directions of the coordinate axes coincident with the socalled principal axes (axes of symmetry) of the conic, we can obtain that the equation in standard form for an ellipse is X 2 a 2 + Y 2 b 2 = 1 , where a 2 ≥ b 2 , and the equation in standard form for a hyperbola is X 2 a 2 − Y 2 b 2 = 1 . These equations can be written in a simpler form: AX 2 + BY 2 + C = 0 ,

(7)

and rewritten in matrix form, we obtain, xT Ax = 0 ,

(8)

⎡A ⎤ ⎢ ⎥. A=⎢ B ⎥ ⎢⎣ C ⎥⎦

(9)

where

Camera Calibration Using Principal-Axes Aligned Conics

141

For a parabola, let the unique axis of symmetry of the parabola coincident with the X-axis, and let the Y-axis pass through the vertex of the parabola, then the equation of the parabola is brought into the form: Y 2 = 2 pX ,

(10)

xT Bx = 0 ,

(11)

− p⎤ ⎡ ⎥. B = ⎢⎢ 1 ⎥ ⎢⎣− p ⎥⎦

(12)

or

where

Equation (12) can be rewritten in a homogenous form: ⎡ B = ⎢⎢ ⎢⎣ E

D

E⎤ ⎥. ⎥ ⎥⎦

(13)

2.4 Equations for the Images of Conics in Standard Form Given the homography H between the model plane and its image, from (5) and (8), we can obtain the image of a central conic in standard form satisfies: ~ ~ x T A~ x=0, (14) where ~ A = H −T AH −1 .

Similarly, the image of a parabola in standard form satisfies, ~ ~ x T B~ x =0,

(15)

(16)

where ~ B = H −T BH −1 .

(17)

3 Properties of PAA Conics 3.1 Properties of Two Conics Via the GED Conics are still conics under an arbitrary 2D projective transformation [16]. An interesting property of two conics is that the GED of the two conics is projectively invariant [12]. This property is interpreted in details as follows: Given two point ~ ~ conic pairs ( A1 , A 2 ) and ( A1 , A 2 ) , they are related by a plane homography H , i.e., ~ A i ~ H −T A i H −1 , i = 1,2 . If x is the generalized eigenvector of ( A1 , A 2 ) i.e.,

142

X. Ying and H. Zha

~ ~ A1x = λA 2 x , then ~ x = Hx must be the generalized eigenvector of ( A1 , A 2 ) , i.e., ~ ~~ A1~ x = λ A 2~ x . In general, there are 3 generalized eigenvectors for two 3 × 3 matrixes. Therefore, for a point conic pair, we may obtain three points (i.e., the three generalized eigenvectors of a point conic pair), which are projectively invariant under the 2D projective transformation in the projective plane. Similarly, for a line conic pair, we may obtain three lines (i.e., the three generalized eigenvectors of a line conic pair), which are projectively invariant under the 2D projective transformation in the projective plane.

3.2 Properties of Two PAA Central Conics Two PAA central conics (point conics) in standard form are: ⎡ A1 A1 = ⎢⎢ ⎢⎣

B1

⎤ ⎡ A2 ⎥, A =⎢ 2 ⎥ ⎢ ⎢⎣ C1 ⎥⎦

B2

⎤ ⎥. ⎥ C 2 ⎥⎦

(18)

The GED of the two conics is:

A1x = λA 2 x .

(19)

It is not difficult to find that, the generalized eigenvalues and the generalized eigenvectors of A1 and A 2 are as follows:

⎡1 ⎤ ⎡0 ⎤ ⎡0 ⎤ B1 C1 A1 ⎢ ⎥ ⎢ ⎥ λ1 = , x 1 = ⎢0 ⎥ , λ 2 = , x 2 = ⎢1⎥ , λ3 = , x 3 = ⎢⎢0⎥⎥ , A2 B2 C2 ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣1⎥⎦

(20)

where x1 is the directional vector in the X-axis, x 2 is the directional vector in the Y-axis, and x 3 is the homogeneous coordinates of the common center of the two central conics. From the projective geometric properties of two point conics via the GED as presented in Section 3.1, we obtain,

Proposition 1. From the images of two PAA central conics, we can obtain the image of the directional vector in the X-axis, the image of the directional vector in the Y-axis, and the image of the common center of the two central conics via the GED. 3.3 Properties and Ambiguities in Concentric Circles Two concentric circles in standard form are: ⎡ A1 A1 = ⎢⎢ ⎢⎣

A1

⎤ ⎡ A2 ⎥, A =⎢ 2 ⎥ ⎢ ⎢⎣ C1 ⎥⎦

A2

⎤ ⎥. ⎥ C 2 ⎥⎦

(21)

It is not difficult to find that, the generalized eigenvalues and the generalized eigenvectors of A1 and A 2 are as follows:

Camera Calibration Using Principal-Axes Aligned Conics

143

⎡1 ⎤ ⎡0 ⎤ ⎡1⎤ ⎡0 ⎤ ⎡0 ⎤ C1 A1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ λ1 = λ2 = , x1 = ρ1 ⎢0⎥ + μ1 ⎢1⎥ , x 2 = ρ 2 ⎢0⎥ + μ 2 ⎢1⎥ , λ3 = , x 3 = ⎢⎢0⎥⎥ , (22) A2 C2 ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣1⎥⎦ where ρ1 , μ1 , ρ 2 , μ 2 are four real constants which are only required to satisfy that x1 ≠ x 2 up to a scale factor. This means ρ1 , μ1 , ρ 2 , μ 2 cannot be determined uniquely. There are infinitely many solutions for ρ1 , μ1 , ρ 2 , μ 2 , thus infinitely many solutions for x1 and x 2 . x1 and x 2 are two points at infinity, and x 3 is the homogeneous coordinates of the common center of the two central conics. The ambiguities in x1 and x 2 can be comprehended from the facts that we cannot establish a unique XY coordinate system from the two concentric circles on the model plane because there exists a degree of freedom in the 2D rotation around the common center. However for two general central PAA conics, it is very easy to establish a XY coordinate system in the supporting plane without any ambiguities because we can choose the coordinate axes coincident with the principal axes of two PAA conics.

Proposition 2. From the images of two concentric circles, we can obtain the image of the common center, and the image of the line at infinity of the supporting plane via the GED.

4 Calibration 4.1 Dual Conic of the Absolute Points from Conics in Standard Form The eccentricity e is one of the very important parameters in a conic. If e = 0 , the conic is a circle. If 0 < e < 1 , the conic section is an ellipse. If e = 1 , it is a parabola. If e > 1 , it is a hyperbola. The equation in standard form for an ellipse is: 2 2 2 2 2 2 X 2 a 2 + Y 2 b 2 = 1 , then e = c a , where c = a − b , thus, b = (1 − e )a . ThereT fore, we can obtain that the line at infinity l∞ = (0,0,1) of the supporting plane intersects the ellipse at two imaginary points: 1 ⎡ ⎤ ⎡ 1 ⎤ ⎢ ⎢ 2 ⎥ I E = ⎢ 1 − e i ⎥ , J E = ⎢ − 1 − e 2 i ⎥⎥ . ⎢⎣ ⎥⎦ ⎢⎣ 0 ⎥⎦ 0

(23)

The equation in standard form for a hyperbola is X 2 a 2 − Y 2 b 2 = 1 , then e = c a , where c 2 = a 2 + b 2 , thus, b 2 = (e 2 − 1)a 2 . Therefore, we can obtain that the line at infinity l∞ = (0,0,1)T of the supporting plane intersects the hyperbola at two real points: 1 ⎡ ⎤ ⎡ 1 ⎤ I H = ⎢⎢ e 2 − 1 ⎥⎥ , J H = ⎢⎢ − e 2 − 1 ⎥⎥ . ⎢⎣ ⎥⎦ ⎢⎣ 0 ⎥⎦ 0

(24)

The equation in standard form for a parabola is Y 2 = 2 pX , it is not difficult to obtain that the line at infinity l∞ = (0,0,1)T of the supporting plane intersects the parabola

144

X. Ying and H. Zha

at two repeated real points, or say that the line at infinity is tangent to the parabola at one real point:

⎡1 ⎤ I P = J P = ⎢⎢0⎥⎥ . ⎢⎣0⎥⎦

(25)

From discussions above, we obtain:

Definition 1. The line at infinity intersects a conic in standard form at two points, which are called the absolute points of a conic in standard form: 1 ⎡ 1 ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ 2 I A = ⎢ e − 1 ⎥ , J A = ⎢− e 2 − 1 ⎥⎥ . ⎢⎣ 0 ⎥⎦ ⎢⎣ ⎥⎦ 0

(26)

For a circle ( e = 0 ), the two absolute points is the well-known circular points, I = [1 i 0]T and J = [1 − i 0]T .

Definition 2. The conic C*∞ = I A J TA + J AITA

(27)

is the conic dual to the absolute points. The conic C*∞ is a degenerate (rank 2 or 1) line conic, which consists of the two absolute points. In a Euclidean coordinate system it is given by ⎡1 C*∞ = I A J TA + J A I TA = ⎢⎢ 1 − e 2 ⎢⎣

⎤ ⎥. ⎥ 0⎥⎦

(28)

The conic C*∞ is fixed under scale and translation transformation. The reasons are as follows: Under the point transformation ~ x = Hx , where H is a scale and translation transformation, one can easily verify that, ~ C*∞ = HC*∞ H T = C*∞ . (29) The converse is also true, and we have,

Proposition 3. The dual conic C*∞ is fixed under the projective transformation H if and only if H is a scale and translation transformation. For circles, C*∞ is fixed not only under scale and translation transformation, but also fixed under rotation transformation [4].

4.2 Calibration from Unknown PAA Central Conics Given the images of two PAA central conics, from Proposition 1, we can determine the images of the directional vectors in the X-axis and Y-axis, then denote them as ~ x1

Camera Calibration Using Principal-Axes Aligned Conics

145

and ~ x 2 , respectively. From [4] we know, the vanishing points of lines with perpendicular directions satisfy: ~ x1T ω~ x2 = 0 ,

(30)

where ω = K −T K −1 is the IAC [4]. Therefore, we have:

Proposition 4. From a single image of two PAA conics, if the parameters of the two conics are both unknown, one constraint can be obtained on the IAC. Given 5 images taken in general positions, we can linearly recover the IAC ω . The intrinsic parameter matrix K can be obtained by the Cholesky factorization of the IAC ω . After the intrinsic parameters are known, it is not difficult to obtain the images of the circular points for each image by intersecting the image of the line at infinity and the IAC ω . From the images of the circular points, the image of the common center, and the images of the directional vectors in the X-axis and Y-axis, we can obtain the extrinsic parameters without ambiguities [4].

4.3 Calibration from Eccentricity-Known PAA Central Conics Assume that the eccentricity of one of the PAA central conics is known, from Proposition 2, we can determine the image of the line at infinity from the images of the two PAA conics. Then we can obtain the images of the absolute points of the conic with known eccentricity by intersecting the image of the line at infinity and the image of ~ this conic. Thus we can obtain the image of the conic dual to the absolute points, C*∞ . Actually, a suitable rectifying homography may be obtained directly from the identi~ fied C*∞ in an image using the eigenvalue decomposition, and after some manipulation, we can obtain, ⎡1 ~* C ∞ = U ⎢⎢ 1 − e 2 ⎢⎣

⎤ ⎥ UT . ⎥ 0⎥⎦

(31)

The rectifying projectivity is H = U up to a scale and translation transformation.

Proposition 5. Once the dual conic C*∞ is identified on the projective plane then projective distortion may be rectified up to a scale and translation transformation. After performing the rectification, we can translate the image so that the coordinate origin is coincident with the common center. Thus we obtain the 2D homography between the supporting plane and its image while the coordinate system in the supporting plane is established whose axes are coincident with the principal axes of the central PAA conics. Let us denote H = [h1 h 2 h 3 ] , from (6), we have,

H = [h1 h 2

h 3 ] = K [r1 r2

t] .

(32)

146

X. Ying and H. Zha

Using the fact that r1 and r2 are orthonormal, we have [20],

h1T K −T K −1h 2 = 0 , i.e., h1T ωh 2 = 0 , h1T K −T K −1h1 = h T2 K −T K −1h 2 , i.e., h1T ωh1 = h T2 ωh 2 .

(33) (34)

These are 2 constraints on the intrinsic parameters from one homography. If the eccentricities of two PAA central conics are both known, we can obtain a least squares solution for the homography. From discussions above, we have,

Proposition 6. From a single image of two PAA conics, if the eccentricity of one of the two conics is known, two constraints can be obtained on the IAC. Given 3 images taken in general positions, we can obtain the IAC ω . The intrinsic parameter matrix K can be obtained by the Cholesky factorization of the IAC ω . Once the intrinsic parameter matrix K is obtained, the extrinsic parameters for each image can be recovered without ambiguity as proposed in [20].

5 Experiments We perform a number of experiments, both simulated and real, to test our algorithms with respect to noise sensitivity. Due to lack of space, the simulated experimental results are not shown here. In order to demonstrate the performance of our algorithm, we capture an image sequence of 209 real images, with resolution 800 × 600 , to perform augmented reality. Edges were extracted using Canny’s edge detector and the ellipses were obtained using a least squares ellipse fitting algorithm [3]. Some augmented realities examples are shown in Fig. 1 to illustrate the calibration results.

Fig. 1. Some augmented realities results

6 Conclusion A very deep investigation in the projective geometric properties of the principal-axes aligned conics is given in this paper. These properties are obtained by utilizing the generalized eigenvalue decomposition of two PAA conics. We define the absolute

Camera Calibration Using Principal-Axes Aligned Conics

147

points of a conic in standard form, which is analogy of the circular points of a circle. Furthermore, we define the dual conic consisted of the two absolute points, which is analogy of the dual conic consisted of the circular points. By using the dual conic consisted of the two absolute points, we propose a linear algorithm to obtain the extrinsic parameters of the camera. We also discovered a novel example of the PAA conics, which is consisted of a circle and a conic concentric with each other while the parameters of the circle and the conic are both unknown, and two constraints on the IAC can be obtained from a single image of this pattern. Due to lack of space, these are not discussed in this paper. To explore more novel patterns containing conics is our ongoing work.

Acknowledgements This work was supported in part by the NKBRPC 973 Grant No. 2006CB303100, the NNSFC Grant No. 60605010, the NHTRDP 863 Grant No. 2006AA01Z302, and the Key grant Project of Chinese Ministry of Education No. 103001.

References 1. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least squares fitting of ellipses. IEEE Trans. Pattern Analysis and Machine Intelligence 21(5), 476–480 (1999) 2. Forsyth, D., Mundy, J.L., Zisserman, A., Coelho, C., Heller, A., Rothwell, C.: Invariant descriptors for 3-D object recognition and pose. IEEE Trans. Pattern Analysis and Machine Intelligence 13(10), 971–991 (1991) 3. Gurdjos, P., Kim, J.-S., Kweon, I.-S.: Euclidean Structure from Confocal Conics: Theory and Application to Camera Calibration. In: Proc. IEEE. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 1214–1222. IEEE Computer Society Press, Los Alamitos (2006) 4. Hartley, R., Zisserman, A.: Multiple View Geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge, UK (2003) 5. Heisterkamp, D., Bhattacharya, P.: Invariants of families of coplanar conics and their applications to object recognition. Journal of Mathematical Imaging and Vision 7(3), 253– 267 (1997) 6. Jiang, G., Quan, L.: Detection of Concentric Circles for Camera Calibration. In: Proc. Int’l. Conf. Computer Vision, pp. 333–340 (2005) 7. Kahl, F., Heyden, A.: Using conic correspondence in two images to estimate the epipolar geometry. In: Proc. Int’l. Conf. Computer Vision, pp. 761–766 (1998) 8. Kanatani, K., Liu, W.: 3D Interpretation of Conics and Orthogonality. Computer Vision and Image Understanding 58(3), 286–301 (1993) 9. Kim, J.-S., Gurdjos, P., Kweon, I.-S.: Geometric and Algebraic Constraints of Projected Concentric Circles and Their Applications to Camera Calibration. IEEE Trans. Pattern Analysis and Machine Intelligence 27(4), 637–642 (2005) 10. Ma, S.: Conics-Based Stereo, Motion Estimation, and Pose Determination. Int’l J. Computer Vision 10(1), 7–25 (1993) 11. Ma, S., Si, S., Chen, Z.: Quadric curve based stereo. In: Proc. of The 11th Int’l. Conf. Pattern Recognition, vol. 1, pp. 1–4 (1992)

148

X. Ying and H. Zha

12. Mundy, J.L., Zisserman, A. (eds.): Geometric Invariance in Computer Vision. MIT Press, Cambridge (1992) 13. Mudigonda, P., Jawahar, C.V., Narayanan, P.J.: Geometric structure computation from conics. In: Proc. Indian Conf. Computer Vison, Graphics and Image Processing (ICVGIP), pp. 9–14 (2004) 14. Quan, L.: Algebraic and geometric invariant of a pair of noncoplanar conics in space. Journal of Mathematical Imaging and Vision 5(3), 263–267 (1995) 15. Quan, L.: Conic reconstruction and correspondence from two views. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(2), 151–160 (1996) 16. Semple, J.G., Kneebone, G.T.: Algebraic Projective Geometry. Oxford University Press, Oxford (1952) 17. Sugimoto, A.: A linear algorithm for computing the homography from conics in correspondence. Journal of Mathematical Imaging and Vision 13, 115–130 (2000) 18. Weiss, I.: 3-D curve reconstruction from uncalibrated cameras. In: Proc. of Int’l. Conf. Pattern Recognition, vol. 1, pp. 323–327 (1996) 19. Yang, C., Sun, F., Hu, Z.: Planar Conic Based Camera Calibration. In: Proc. of Int’l. Conf. Pattern Recognition, vol. 1, pp. 555–558 (2000) 20. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)

3D Intrusion Detection System with Uncalibrated Multiple Cameras Satoshi Kawabata, Shinsaku Hiura, and Kosuke Sato Graduate School of Engineering Science, Osaka University, Japan [email protected], {shinsaku,sato}@sys.es.osaka-u.ac.jp

Abstract. In this paper, we propose a practical intrusion detection system using uncalibrated multiple cameras. Our algorithm combines the contour based multi-planar visual hull method and a projective reconstruction method. To set up the detection system, no advance knowledge or calibration is necessary. A user can specify points in the scene directly with a simple colored marker, and the system automatically generates a restricted area as the convex hull of all speciﬁed points. To detect an intrusion, the system computes intersections of an object and each sensitive plane, which is the boundary of the restricted area, by projecting an object silhouette from each image to the sensitive plane using 2D homography. When an object exceeds one sensitive plane, the projected silhouettes from all cameras must have some common regions. Therefore, the system can detect intrusion by any object with an arbitrary shape without reconstruction of the 3D shape of the object.

1

Introduction

In this paper, we propose a practical system for detecting 3D volumetric intrusion in a predeﬁned restricted area using uncalibrated multiple cameras. Intrusion detection techniques (e.g., person–machine collision prevention, oﬀlimits area observation, etc.) are important for establishing safe, secure societies and environments. Today, equipment which detects the blocking of a light beam, referred to as a light curtain, are widely used for this purpose. Although the light curtain is useful to achieve very safe environments which were previously considered dangerous, it is excessive for widespread applications. For example, the light curtain method requires us to set equipment at both sides of a rectangle for detection, which leads to higher cost, limited shape of the detection plane and set-up diﬃculty. In the meantime, surveillance cameras have been installed into many various environments; however, the scenes observed by these cameras are used only for recording or visual observation by distant human observers, and they are merely used to warn a person in a dangerous situation or to immediately halt a dangerous machine. There are many computer enhancements that recognize events in a scene [1], but it is diﬃcult to completely detect dangerous situations, including unexpected phenomena. Furthermore, we do not have suﬃcient knowledge and methodologies to use the recognition result from these systems to ensure safety. Therefore, our proposed system simply detects an intrusion Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 149–158, 2007. c Springer-Verlag Berlin Heidelberg 2007

150

S. Kawabata, S. Hiura, and K. Sato

in a speciﬁc area in 3D space using multiple cameras. We believe this system will help establish a safe and secure society. As mentioned above, ﬂexibility and ease in setting up the equipment and detection region are important factors to the cost and practical use. However, there are two problems in image based intrusion detection: one is the necessity of the complex and nuisance calibration for a multiple camera system, and the other is the intuitiveness for deﬁning a restricted area. Thus, we propose a method to complete the calibration and the restricted area deﬁnition simultaneously by simply moving a colored marker in front of the cameras.

2

Characterization and Simpliﬁcation of the Intrusion Detection Problem

In the last decade of computer vision, there have been many studies to measure or recognize a scene taken by cameras in an environment. In particular, methods to extract or track a moving object in an image have been investigated with great eﬀort and have rapidly progressed. In most of this research, the region of an object can be detected without consideration of the actual 3D shape. Therefore, although these techniques may be used for rough intrusion detection, they cannot handle detailed motion and deformation, such as whether a person is reaching for a dangerous machine or an object of value. On the other hand, there has been other research to reconstruct the whole shape of a target object from images taken by multiple cameras. Using this method, it is possible to detect the intrusion of an object in a scene by computing the overlapping region of the restricted area and target object. This approach is not reasonable because the reconstruction computation generally needs huge CPU and memory resources, and, as described later, the approach involves unnecessary processes to detect an intrusion. In addition, it is not easy for users to set up such a system because the cameras must be calibrated precisely. Thus, we resolve these issues by considering two characteristics of the intrusion detection problem. The ﬁrst is the projective invariance of the observed space in intrusion detection. The state of intrusion, that is, the existence of an overlapping region of a restricted area and object, is invariant if the entire scene is projectively transformed. Hence, we can use weak calibration, instead of full calibration, to detect an intrusion. Furthermore, setting the restricted area can be done simultaneously with the calibration, because the relationship between the area and cameras can also be represented in a projective space. Although the whole shape of an intruding object has projective indeﬁniteness, it doesn’t aﬀect the detection of intrusion. The second characteristic is that a restricted area is always a closed region. Consequently, we do not have to check the total volume of a restricted area; it is suﬃcient to observe only the boundary of the restricted area. This manner of thinking is one of the standard approaches for ensuring safety, and is also adopted by the abovementioned light curtain. Our system detects an intrusion by projecting the silhouette on each camera image onto the boundary plane, then

3D Intrusion Detection System with Uncalibrated Multiple Cameras

151

computing the common region of all the silhouettes. This common region on the boundary plane is equivalent to the intersection of the reconstructed shape of an object by the visual hull method and the shape of the boundary plane. The remainder of this paper is organized as follows. In the next section, the principle of our approach is described. We explain our approach in more detail in Section 3. In Section 4, we derive the simultaneous initialization (calibration and restricted area setting). We describe an experiment of intrusion detection in Section 5. In Section 6 we present our conclusion.

3 3.1

Detection of an Intruding Object The Visual Hull Method

To decide if an object exists in a speciﬁc area, the 3D shape of the object in the scene must be obtained. We adopt the visual hull method for shape reconstruction. In the visual hull method, the shape of an object can be reconstructed by computing the intersection of all cones, which are deﬁned by a set of rays through the viewpoint and one point on the edge of the silhouette on an image plane. This method has the advantage that the texture of an object does not aﬀect the reconstructed shape, because there is no necessity to search the corresponding points between images. However, this method tends to reconstruct a shape larger than the real shape, particularly with concave surfaces. Also, an invisible area from any of the cameras can also make it impossible to measure the shape. Although this is a common problem for image-based surveillance, our approach is always safe because the proposed system handles the invisible area as a part of the object. Although the visual hull method has great merit for intrusion detection, it needs large computational resources for the set operation in 3D space. Therefore, it is diﬃcult to construct an intrusion detection system that is reasonable and works in real time. 3.2

Section Shape Reconstruction on a Sensitive Plane

As mentioned above, it is suﬃcient to observe only sensitive planes, the boundary of a restricted area, for intrusion detection. Accordingly, only the shape of the intersection region on a sensitive plane is reconstructed by homography based volume intersection [2]. In this case, the common region of projected silhouettes on the plane is equivalent to the intersection of the visual hull and the plane. Therefore, when an object exceeds a sensitive plane, the common region appears on the plane (Fig. 1). In this way, the 3D volumetric intrusion detection problem is reduced to eﬃcient processes of inter-plane projection and common region computation in 2D space. 3.3

Vector Representation of the Silhouette Boundary

The visual hull method only uses information of the boundary of a silhouette. Therefore, the amount of data can be decreased by replacing the boundary with

152

S. Kawabata, S. Hiura, and K. Sato

(a) Non-intruding

(b) Intruding

Fig. 1. Intrusion detection based on the existence of an intersection

Fig. 2. Vector representation of silhouette contours

vector representation by tracking the edge of the silhouette in an image (Fig. 2). In the vector representation, the projection between planes is achieved by transforming a few vertices on the edge. It is easy for the common region computation to decide whether each vertex is inside or outside the other contour. With this representation, we are able to reduce the computational costs for the transformation and common region calculation, and it is not necessary to adjust the resolution of the sensitive plane to compute the common region with suﬃcient preciseness. In a distributed vision system, it is possible to reduce the amount of communication data because many camera-connected nodes extract silhouette contours and one host gathers the silhouette data and computes the common region. 3.4

Procedure of the Proposed System

For summarization, intrusion detection on the boundary is realized by the following steps: 1. 2. 3. 4. 5. 6.

Deﬁning sensitive planes. Extracting the silhouette of a target object. Generating the vector representation from the silhouette. Projecting each silhouette vector onto sensitive planes. Computing the common region. Deciding the intrusion.

In the next section, we discuss step 1.

4

Construction of a Restricted Area

Using the following relationship, the silhouette of an object on an image plane can be transformed onto a sensitive plane. Let x(∈ 2 ) be the coordinate of a

3D Intrusion Detection System with Uncalibrated Multiple Cameras

153

㪭㫀㪼㫎㫇㫆㫀㫅㫋

㪠㫄㪸㪾㪼㩷㫇㫃㪸㫅㪼

㪪㪼㫅㫊㫀㫋㫀㫍㪼㩷㫇㫃㪸㫅㪼

Fig. 3. Homography between two planes

point on a sensitive plane. The corresponding point on the image plane can be calculated, as follows: ˜, μ˜ x = H x ⎡ ⎤ h11 h12 h13 H = ⎣h21 h22 h23 ⎦ h31 h32 h33

(1) (2)

˜ is the notation of homogeneous coordinates of x. Matrix H is referred where x to as a homography matrix, which has only 8 DOF for the scale invariant. From Eq. (2), the homography matrix can be determined by more than four pairs of corresponding points which are speciﬁed by a user. However, this method is a burden to users, who must set up the system in proportion to the product of the number of cameras and the number of sensitive planes. Also, it is not easy for users to deﬁne an arbitrarily restricted area without a reference object. Therefore, in the next section, we introduce a more convenient method for setting a sensitive plane. 4.1

Relation of the Homography Matrix and Projection Matrix

Instead of specifying the points on an image from a camera view, it is easy to place a small marker in the real observed space so that we obtain the corresponding points using cameras. However, in this case, it is diﬃcult to point out the four points on a plane in real 3D space. Therefore, we consider the method in which users input enough ‘inner’ points of the restricted area so that the system automatically generates a set of sensitive planes which cover all the input points. Now, when we know the projection matrix P , which translates a coordinate in a scene onto an image plane, the relationship between X, a point in 3D space, and x, a point on an image plane, is given by ˜ λ˜ x = P X.

(3)

Likewise, as shown in Fig. 4, a point on the plane Π in 3D space is projected onto the image plane as follows.

154

S. Kawabata, S. Hiura, and K. Sato

㪊㪄㪛㩷㫇㫃㪸㫅㪼㪭㫀㪼㫎㫇㫆㫀㫅㫋㪠㫄㪸㪾㪼㩷㫇㫃㪸㫅㪼 Fig. 4. A plane in 3D space projected onto the image plane

˜ 0) λ˜ x = P (α˜ e1 + β˜ e2 + π ⎡ ⎤ α ˜1 e ˜2 π ˜ 0 ⎣β ⎦ =P e 1

(4) (5)

where e1 , e2 are bases of Π in 3D, and π 0 , (α, β) are the origin and parameter of Π, respectively. From Eq. (5), we can compute the homography matrix between an arbitrary plane in 3D and the image plane by ˜1 e ˜2 π ˜0 . H=P e (6) Therefore, when we know the projection matrices of the cameras and are given three or more points on a plane in 3D, it is possible to deﬁne the plane as a sensitive plane, except in a singular case (e.g., all points are on a line.). For example, the three adjacent points X 0 , X 1 , X 2 make one plane: ⎧ ⎨ e1 := X 1 − X 0 , e2 := X 2 − X 0 , (7) ⎩ π0 := X 0 . As mentioned above, a set of homogeneous matrices can be automatically generated from each given camera projection matrix and the vertices of the sensitive planes in 3D space. However, in our problem, we assume both the camera parameters and 3D points are unknown. Therefore, we have to calculate both by the projective reconstruction technique [3] using the given corresponding points between cameras. 4.2

Generation of Sensitive Planes from Reconstructed Inner Points

Now we have the projection matrices and many reconstructed 3D points which reside in the restricted area, so we have to determine enough pairs of 3D points as the vertices of the sensitive planes. We compute the convex hull, which handles all the input points for generating sensitive planes. The system deﬁnes a restricted

3D Intrusion Detection System with Uncalibrated Multiple Cameras

155

5GPUKVKXG2NCPG5GVWR +PRWVVKPIRQKPVUD[COCTMGT 2TQLGEVKXGTGEQPUVTWEVKQP %QPXGZJWNNECNEWNCVKQP )GPGTCVKQPQHUGPUKVKXGRNCPGU

+PVTWUKQP&GVGEVKQP 5KNJQWGVVGGZVTCEVKQP 5KNJQWGVVGXGEVQTK\CVKQP 2TQLGEVKQPQPVQUGPUKVKXGRNCPGU %QOOQPTGIKQPEQORWVCVKQP

Fig. 5. Points and their convex hull Fig. 6. Flow chart of the proposed system (2D case)

area as the boundary of the convex hull computed using qhull [4] (Fig. 5). The reconstructed points, except on the boundary, are removed because they do not make a sensitive plane.

5

Experiment

We implemented the proposed intrusion detection method in a multiple-camera system. From the users’ view, the system has two phases: one is setting the sensitive planes and the other is executing intrusion detection (see Fig. 6). Since the latter phase is completely automated, users need only to input corresponding points with a simple marker. Therefore, any complicated technical process, such as calibration of the multiple camera system, is already managed for setting the actual sensitive plane. In this experiment, we conﬁrm the proposed method of sensitive plane generation and intrusion detection in projective space. The system consists of three cameras (SONY DFW-VL500) and a PC (Dual Intel Xeon @ 3.6 [GHz] w/ HT). We set the cameras at an appropriate position so that each camera can observe the whole region to detect an intrusion (Fig. 7). 5.1

Input of Sensitive Plane Using a Colored Marker

We use a simple red colored marker to input corresponding points among all image planes. First, the user speciﬁes the color of the marker by clicking on the area of the marker, then the system computes the mean and the variance of the area. According to the Mahalanobis distance between an input color at each pixel and the reference color, the system extracts similar pixels by thresholding the distance. For noise reduction, the center of gravity of the largest region is

156

S. Kawabata, S. Hiura, and K. Sato

Fig. 7. Cameras and observed Fig. 8. Setting of restricted area (top: camera space view, bottom: extracted marker position)

Fig. 9. Inputted points and generated convex hull

calculated as the marker position (Fig. 8). The user in a real scene moves the marker position to set up the restricted area. Fig. 9 shows an example of the sensitive planes generated from inputted points. In this case, 16 sensitive planes are generated from 10 of 12 inputted points, and remaining two points of them are removed because they are inside of the convex hull. 5.2

Intrusion Detection

In this experiment, we input eight points on the vertices of a hexahedron. Fig. 10 depicts the generated set of sensitive planes from the input points. In this case, 12 planes are generated by the proposed method. The result of the intrusion detection is shown in Fig. 11. In our implementation, we use a statistical background subtraction method [5] to extract a silhouette of the object from an image. The silhouette is transformed into vector representation by tracking the edge and projected onto each sensitive

3D Intrusion Detection System with Uncalibrated Multiple Cameras

157

Fig. 10. Generated sensitive planes

Fig. 11. Detection result (top: intrusion of a leg, bottom: intrusion of a wrist, reaching for the object)

plane. Then, the system computes the common region on each sensitive plane. In the ﬁgure, the leg or wrist of the intruder is detected on the boundary of the restricted area. Although one can see some false positive extraction areas of the silhouette (e.g., the shadow cast in the image of the top row, third column), our method has a robustness against such noise because of the common region computation of all extracted silhouettes.

158

6

S. Kawabata, S. Hiura, and K. Sato

Conclusion

In this paper, we introduce an intrusion detection system for an arbitrary 3D volumetric restricted area using uncalibrated multiple cameras. Although our algorithm is based on the visual hull method, the whole shape of intruding object does not need to be reconstructed; instead, the system can eﬃciently detect an intrusion by perspective projections in 2D space. In general, an intricate calibration process for a distributed camera system has been necessary, but the proposed system automatically calibrates the cameras when users input corresponding points through the restricted region setting. Furthermore, the user does not need any previous knowledge about cameras because of the projective reconstruction. Also, any combination of cameras having varying intrinsic camera parameters can be used. Therefore, non-expert users can intuitively operate the proposed system for intrusion detection by only setting the cameras in place.

References 1. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., et al.: A system for video surveillance and monitoring (VSAM project ﬁnal report). Technical report, CMU Technical Report CMU-RI-TR-00 (2000) 2. Wada, T., Wu, X., Tokai, S., Matsuyama, T.: Homography Based Parallel Volume Intersection: Toward Real-Time Volume Reconstruction Using Active Cameras. In: Proc. Computer Architectures for Machine Perception, pp. 331–339 (2000) 3. Mahamud, S., Hebert, M.: Iterative projective reconstruction from multiple views. In: Proc. CVPR, vol. 2, pp. 430–437 (2000) 4. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Mathematical Software (TOMS) 22(4), 469–483 (1996), http://www.qhull.org 5. Horprasert, T., Harwood, D., Davis, L.S.: A statistical approach for real-time robust background subtraction and shadow detection. In: ICCV 1999, pp. 1–19 (1999)

Non-parametric Background and Shadow Modeling for Object Detection Tatsuya Tanaka1 , Atsushi Shimada1, Daisaku Arita1,2 , and Rin-ichiro Taniguchi1 1

Department of Intelligent Systems, Kyushu University, 744, Motooka, Nishi-ku, Fukuoka 819–0395 Japan 2 Institute of Systems & Information Technologies/KYUSHU 2–1–22, Momochihama, Sawara-ku, Fukuoka 814–0001 Japan

Abstract. We propose a fast algorithm to estimate background models using Parzen density estimation in non-stationary scenes. Each pixel has a probability density which approximates pixel values observed in a video sequence. It is important to estimate a probability density function fast and accurately. In our approach, the probability density function is partially updated within the range of the window function based on the observed pixel value. The model adapts quickly to changes in the scene and foreground objects can be robustly detected. In addition, applying our approach to cast-shadow modeling, we can detect moving cast shadows. Several experiments show the effectiveness of our approach.

1 Introduction Background subtraction technique has been traditionally applied to detection of objects in image. Without prior information about the objects, we can get object regions by subtracting a background image from an observed image. However, when simple background subtraction technique is applied to video-based surveillance which usually captures outdoor scenes, it often detects not only objects but also a lot of noise regions. This is because it is quite sensitive to small illumination changes caused by moving clouds, swaying tree leaves, etc. There are many approaches to handle these background changes [1,2,3,4]. Han et al. proposed a background estimation method, in which mixture-of-Gaussians is used to approximate background model, and the number of Gaussians is variable in each pixel. Their method can handle variations in lighting since a Gaussian is inserted or deleted according to the illuminant condition. However, it takes a long time to estimate background model. There are also several approaches to estimate background model in shorter time [5,6]. For example, Stauffer et al. proposed a fast estimation method to avoid a costly matrix inversion by ignoring covariance components of multi-dimensional Gaussians [6]. However, the number of Gaussians is constant in their background model. When recently observed pixel values frequently change, a constant number of Gaussians is not always enough to estimate the background model accurately, and it is very difficult to determine the appropriate number of Gaussians in advance. Shimada et al proposed a fast method in which the number of Gaussians are changed dynamically to adapt to the change of the lighting condition [7]. However, in principle, Gaussian Mixture Model (GMM) can not make a well-suited background model and can not detect foreground objects accurately when the intensity Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 159–168, 2007. c Springer-Verlag Berlin Heidelberg 2007

160

T. Tanaka et al.

of the background changes frequently. Especially when the intensity distribution of the background is very wide, it is not easy to represent the distribution with a set of Gaussians. In addition, if the number of Gaussians is increased, the computation time to estimate the background model is also increased. Thus, GMM is not powerful enough to represent the various changes of the lighting condition. To solve the problem, Elgammal et al employed non-parametric representation of the background intensity distribution, and estimated the distribution by Parzen density estimation [2]. However, in their approach, the computation cost of the estimation is quite high, and it is not easy to apply it to real-time processing. Another problem of background subtraction is that detected foreground regions generally include not only objects to be detected but their cast shadows since the shadow intensity differs from that of the modeled background. This misclassification of shadow regions as foreground objects can cause various unwanted behavior such as object shape distortion and object merging, affecting surveillance capability like target counting and identification. To obtain better segmentation quality, detection algorithms must correctly separate foreground objects from the shadows they cast. Then, various approachs has been proposed [8,9,10,11,12]. Martel-Brisson et al. proposed a shadow detection method [12], in which detection of moving cast shadows is incorporated into a background subtraction algorithm. However, they use GMM to model background and shadow, and the aforementioned problem of GMM remains. In this paper, we propose a fast algorithm to estimate non-parametric probability distribution based on Parzen density estimation, which is applied to background modeling. Also, applying our approach to chast-shadow modeling the shadow models, we can detect moving cast shadow. Several experiments show its effectiveness, i.e., its accuracy and computation efficiency.

2 Background Estimation by Parzen Density Estimation 2.1 Basic Algorithm At first, we describe basic background model estimation and object detection process. The backgrounmd model is established to represent recent pixel information of an input image sequence, reflecting the change of intensity, or pixel-value, distribution as quickly as possible. We consider values of a particular pixel (x, y) over time as a “pixel process”, which is a time series of pixel values, e.g. scalars for gray values and vectors for color images. Each pixel is judged to be a foreground pixel or a background pixel by observing the pixel process. In Parzen density estimation, or the kernel density estimation, the probability density function (PDF) of a pixel value is estimated referring to the latest pixel process, and, here, we assume that a pixel process consists of the latest N pixel values. Let X be a pixel value observed at pixel (x, y), and {X 1 , · · · , X N } be the latest pixel process. The PDF of the pixel value is estimated with the kernel estimator K as follows P (X) =

N 1 K(X − X i ) N i=1

(1)

Non-parametric Background and Shadow Modeling for Object Detection

161

Usually a Gaussian distribution function N (0, Σ) is adopted for the estimator K 1 . In this case the equation (1) is reduced to as follows: N 1 1 1 −1 T exp − (X − X i ) Σ (X − X i ) P (X) = N i=1 (2π) d2 |Σ| 12 2

(2)

where d is the dimension of the distribution (for example, d = 3 in color image pixels). To reduce the computation cost, the covariance matrix in equation (2) is often approximated as follows: ⎞ ⎛ 2 σ1 0 · · · 0 ⎜ . ⎟ ⎜ 0 σ22 . . . .. ⎟ ⎟ (3) Σ=⎜ ⎟ ⎜ .. . . . . ⎝ . . . 0⎠ 0 · · · 0 σd2 . This means that each dimension of the distribution is independent from one another. By this approximation, equation (2) is reduced into the following.

2 N d 1 1 1 (X − Xi )j P (X) = exp − (4) N i=1 j=1 (2πσj2 ) 12 2 σj2 This approximation might make the density estimation error a little bigger, but the computation is considerably reduced. The detailed algorithm of background model construction and foreground object detection is summarized as follows: 1. When a new pixel value X N +1 is observed, P (X N +1 ), the probability that X N +1 occurs is estimated by equation (4). 2. If P (X N +1 ) is greater than a given threshold, the pixel is judged to be a background pixel. Otherwise, it is judged to be a foreground pixel. 3. The newly observed pixel value X N +1 is kept in the “pixel process,” while the oldest pixel value X 1 is removed from the pixel process. Applying the above calculation to every pixel, the background model is generated and distinction between a background pixel and a foreground pixel is accomplished. 2.2 Fast Algorithm When we estimate the generation probability of pixel value X in every frame using equation (4) and estimate the background model, its computation cost becomes quite large. To reduce the computation, Elgammal et al. computed the kernel, K, for all possible (X − X i ) in advance, which is stored in a look-up table. However, in their method, computation cost of N -times addition in the kernel K() in equation (1) is not small, which makes the computation time for background estimation large. To solve this problem, we have developed a fast estimation scheme of the PDF as follows. 1

Here, Σ works as the smoothing parameter.

162

T. Tanaka et al.

P( X )

Pt ( X ) = Pt −1 ( X ) +

1 ⎛ | X − X N +1 | ⎞ 1 ⎛ | X − X1 | ⎞ Ȁ⎜ Ȁ⎜ ⎟− ⎟ d Nh d ⎝ h h ⎠ ⎠ Nh ⎝

Update within the range of the window function

h=5

Background

K(u)

d=1

1 h

Threshold

u

䈅䈅䈅 − h

2

0

h 2

Fig. 1. Kernel function of our algorithm

Pixel Value

Observed value

Oldest data

Fig. 2. Update of background model

At first, we use a kernel with rectangular shape, or hypercube, instead of Gaussian distribution function. For example, in 1-dimensional case, the kernel is represented as follows (see Figure 1). 1 if − 12 ≤ u ≤ h2 K(u) = h (5) 0 otherwise where h is a parameter representing the width of the kernel,i.e., some smoothing parameter [13]. Using this kernel, equation (1) is represented as follows: N |X − X i | 1 1 ψ (6) P (X) = N i=1 hd h where, |X − X i | means the chess-board distance in d-dimensional space, and ψ(u) is calculated by the following formula. 1 if u ≤ 12 ψ(u) = (7) 0 otherwise When an observed pixel value is inside of the kernel located at X, ψ(u) is 1; otherwise ψ(u) is 0. Thus, we estimate the PDF based on equation (6), and P (X) is calculated by enumerating pixels in the latest pixel process whose values are inside of the kernel located at X. However, if we calculate the PDF, in a naive way, by enumerating pixels in the latest pixel process whose values are inside of the kernel located at X, the computational time is proportional to N . Instead, we propose a fast algorithm to compute the PDF, whose computation cost does not depend on N . In background modeling we estimate P(X) referring to the latest pixel process consisting of pixel values of the latest N frames. Let us suppose that at time t we have a new pixel value X N +1 , and that we estimate an updated PDF P t (X) referring to the new X N +1 . Basically, the essence of PDF estimation is accumulation of the kernel

Non-parametric Background and Shadow Modeling for Object Detection

163

estimator, and, when a new value, X N +1 , is acquired the kernel estimator corresponding to X N +1 should be accumulated. At the same time, the oldest one, i.e., the kernel estimator at N frames earlier, should be discarded, since the length of the pixel process is constant, N . This idea leads to reduction of the PDF computation into the following incremental computation: 1 1 |X − X N +1 | |X − X 1 | )− ) (8) ψ( ψ( N hd h N hd h where Pt−1 is the PDF estimated at the previous frame. The above equation means that the PDF when a new pixel value is observed can be acquired by: Pt (X) = Pt−1 (X) +

– increasing the probabilities of pixel values which are inside of the kernel located at the new pixel value X N +1 by N1hd – decreasing those which are inside of the kernel located at the oldest pixel value, a pixel value at N frames earlier, X 1 by N1hd . In other words, the new PDF is acquired by local operation of the previous PDF, assuming the latest N pixel values are stored in the memory, which achieves quite fast computation of PDF estimation. Figure 2 illustrates how the PDF, or the background model, is modified.

3 Cast-Shadow Modeling by Parzen Density Estimation In this section, we propose a method to detect moving cast shadows in a background subtraction algorithm. We have developed a cast shadow detection method, which is based on a similar idea to [12], and it is based on the observation that a shadow cast on a surface will equally attenuate the values of three components of its YUV color. We first estimate this attenuation ratio referring to the component Y , and, then, we examine whether both U and V components are reduced by a similar ratio. More specifically, if color vector X represents the shadow cast on a surface whose background color vector is B, we have XY αmin < αY < 1 with αY = BY min{|X U |, |X V |} > αY − X U < Λ BU αY − X V < Λ BV

(9) (10) (11) (12)

where B means a pixel value of the highest probability. αmin is a threshold on maximum luminance reduction. This threshold is important when the U and V components of a color are small, in which case any dark pixel value would be labeled as a shadow for a light color surface. is a threshold for minimum value of the U and V components. If either X U or X V does not satisfy the equation (10), we use only equation (9). Λ represents the tolerable chromaticity fluctuation around the surface value B U ,

164

T. Tanaka et al.

B V . If these conditions are satisfied, the pixel value is regarded “pseudo-shadow”, and the shadow model is updated with the procedure which is similar to those which are expressed in section 2.2 making use of the pixel value. The detailed algorithm of shadow model construction and shadow detection is summarized as follows: 1. Background subtraction is done with the dynamic background model described in section 2.2. 2. If the pixel is labeled as foreground, PS (X N +1 ), the probability that X N +1 belongs to a cast shadow is estimated. If PS (X N +1 ) is greater than a given threshold, the pixel is judged to be a shadow pixel. Otherwise, it is judged to be an object pixel. However, in the shadow model, there is a possibility that the number of “pseudoshadow” pixel value which is necessary to approximate the shadow model is not enough, because the shadow model is updated only when the observed pixel is regarded as “pseudo-shadow”. Therefore in such pixel, equation(9)‘(12) are used for shadow detection. 3. When the observed pixel value satisfies the equation (9)‘(12), the pixel value is regarded as “pseudo-shadow”, the shadow model is updated by a similar way expressed in section 2.2.

4 Experiment 4.1 Experiment 1: Experiment on the Dynamic Background Model In our experiment verifying the effectiveness of the proposed method, we have used data set of PETS2001 2 after the image resolution is reduced into 320 × 240 pixels. The data set includes images in which people and cars are passing through streets, tree leaves are flickering, i.e., the illumination condition are varying rapidly. Using this data set, we have compared the proposed method with Adaptive GMM (GMM in which the number of Gaussians is adaptively changed) [7], and Elgammal’s method based on Parzen density estimation [2]. In this experiment, supposing that R, G, B components of pixel values are independent of one another, we estimate a one-dimensional PDF of each component. Then, we have judged a pixel as a foreground pixel when at least the probability of one component, either R, G or B, is below a given threshold. For the evaluation of computation speed, we have used a PC with a Pentium IV 3.3GHz and 2.5GB memory. Next, we have evaluated computation time to process one image frame. For the proposed algorithm, we have used h = 5, N = 500, Figure 3 shows comparison between the proposed method and the adaptive GMM method, where the horizontal axis is the frame-sequence number and where the vertical axes are the processing speed (left) and the average of number of Gaussians assigned to each pixel. In this experiment, after the 2500th frame, the number of Gaussians increases, where people and cars, i.e., foreground objects begin to appear in the scene. 2

Benchmark data of International Workshop on Performance Evaluation of Tracking and Surveillance. From ftp://pets.rdg.ac.uk/PETS2001/ available.

Non-parametric Background and Shadow Modeling for Object Detection Gaussian Mixture Model

2.5

Processing time (msec)

100 80

2

60

1.5

40

1

20

0.5

0

Processing time (msec)

Traditional approach

3 Number of Gaussians

Proposed method Number of Gaussians 120

0 0

1000

2000

Frame

3000

Accuracy (%)

Fig. 3. Processing time of adaptive GMM and average number of Gaussians

Proposed method

350 300 250 200 150 100 50 0 100

4000

165

200

300 400 Number of sample data

500

Fig. 4. The number of samples, or N , and require processing time

100 90 80 70 60 50 40 30 20 10 0

Recall Precission

Propose method

Gaussian Mixture Model

Traditional approach

Fig. 5. Recall and Precision

In the adaptive GMM method, the number of Gaussians is increased so that changes of pixel values are properly represented in GMM. However, when the number of Gaussians increases, the computation time also increases. On the other hand, the computation time of the proposed method does not change depending on the scene, which shows that the real-time characteristic, i.e., invariance of processing speed, of the proposed method is much better than the adaptive GMM method. Next, figure 4 shows comparison between the proposed method and Elgammal’s method based on Parzen density estimation. In the Elgammal’s method, the computation time is almost proportional to the length of the pixel process in which the PDF is estimated, and, from the viewpoint of real-time processing, we can not use long image sequence to estimate the PDF. For example, when we use a standard PC environment, like our experiment, only up to 200 frames can be used for the PDF estimation in the Elgammal’s method. On the other hand, in our method, when we estimate the PDF, we just update it in the local region, i.e., in the kernel located at the oldest pixel value and in the kernel located at the newly observed pixel value, and the computation cost does not depend on the length of the pixel process at all. Finally, to evaluate the object detection accuracy, we examine precision and recall rates of object detection. Precision and recall are respectively defined as follows: precision =

recall =

# correctly detected objects # of detected objects

# of correctly detected objects # of objects which should be detected

(13)

(14)

166

T. Tanaka et al.

(a) Input image

(b) Background image (c) Proposed method

(d) GMM method

Fig. 6. Object detection by the proposed method and GMM-based method

When we apply our proposed method and Elgammal’s method, we set N = 500. In addition, we set h = 5 in our method. Figure 5 shows precision and recall when the data set is processed by the proposed method, adaptive GMM method, and Elgammal’s method, where the vertical axis means recall and precision rate. This shows that the proposed method outperform the adaptive GMM method. Also, it is shown that the proposed method gives almost the same performance as Elgammal’s method, although, in the proposed method, we use a simple kernel function, i.e., rectangular function shown in Figure 5. We have achieved a recall rate with 94.38% and a precision rate with 88.89%. Figure 6 shows results of object detection by the proposed method. Figure 6(a) is an input image frame. Figure 6(b) is acquired background model when that input image frame is acquired, which shows a pixel value having the highest probability at each pixel. Figure 6(c) shows detected objects. Figure 6(d) shows object detection result acquired by the adaptive GMM method. Comparing these two result, the proposed method exhibits very good result. 4.2 Experiment 2: Experimtnt on the Dynamic Shadow Model We took indoor scenes where people were walking on the floor. Those images include shadows with various darkness which is cast from the pedestrian. The size of image is 320 × 240 and each pixel had a 24-bit RGB value. We have compared the proposed method with Adaptive Gaussian Mixture Shadow Model(GMSM) [12]. Furthermore, respectively, the dynamic background model which uses Parzen density estimation and which uses GMM are used to object detection. Figure 7 shows results of shadow detection by the proposed method. Figure 7(a) is an input image frame. Figure 7(b) shows shadow detection result acquired by the

Non-parametric Background and Shadow Modeling for Object Detection

(a) Input image

(b) Proposed method

167

(c) GMSM method

Fig. 7. Shadow detection by the proposed method and GMSM-based method

proposed method. Figure 7(c) shows shadow detection result acquired by the GMSM method. The red colored pixels represent pixels judged to be a shadow pixels. In Figure 7(b), the green colored pixels represent pixels judged to be shadow pixels just according to equation (9)‘(12), where they can not be examined by the probabilistic model, because the number of pseudo-shadow pixels is not enough to estimate the probability distribution of the shadow pixel value. Comparing these two results with each other, the proposed method exhibits a good result. In addition, the computation time of the proposed method is superior to that of GMSM, i.e., the former is 88 msec per image frame while the latter is 97 msec.

5 Conclusion In this paper, we have proposed a fast computation method to estimate non-parametric background model using Parzen density estimation. We estimate the PDF of background pixel value at each pixel position. In general, to estimate the PDF at every image frame, a pixel value sequence of the latest N frames, or a pixel process, should be referred to. In our method, using a simple kernel function, the PDF can be estimated from the PDF at the previous frame using local operations on the PDF. This much improves the computation cost of the PDF estimation. Comparison of our method with GMM-based method and Elgammal’s method based on Parzen density estimation shows that our method has the following merits: small computation cost, real-time characteristic (invariance of computation speed), object detection accuracy.

168

T. Tanaka et al.

In addition, applying our approach to shadow modeling, we can construct shadow model and detect moving cast shadows correctly. Comparison of our method with GMSM-based method shows its effectiveness, i.e., its accuracy and computation speed. Future works are summarized as follows: – Reduction of memory space. – Precision improvement of shadow detection.

References 1. Han, B., Comaniciu, D., Davis, L.: Sequential Kernel Density Approximation through Mode Propagation: Applications to Background Modeling. In: Asian Conference on Computer Vision 2004, pp. 818–823 (2004) 2. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance. In: Proceedings of the IEEE, vol. 90, pp. 1151–1163 (2002) 3. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principle and Practice of Background Maintenance. In: International Conference on Computer Vision, pp. 255–261 (1999) 4. Harville, M.: A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-ofGaussian Background Models. In: the 7th European Conference on Computer Vision, vol. III, pp. 543–560 (2002) 5. Lee, D.-S.: Online Adaptive Gaussian Mixture Learning for Video Applications. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 105–116. Springer, Heidelberg (2004) 6. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. Computer Vision and Pattern Recognition 2, 246–252 (1999) 7. Shimada, A., Arita, D., Taniguchi, R.i.: Dynamic Control of Adaptive Mixture-of-Gaussians Background Model. In: Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance 2006 (2006) 8. Salvador, E., Cavallaro, A., Ebrahimi, T.: SHADOW IDENTIFICATION AND CLASSIFICATION USING INVARIANT COLOR MODELS. In: Proc. of IEEE International Conference on Acoustics, vol. 3, pp. 1545–1548 (2001) 9. Cucchiara, R., Grana, C., Piccardi, M., Prati, A., Sirotti, S.: Improving Shadow Suppression in Moving Object Detection with HSV Color Information. In: IEEE Intelligent Transportation Systems Conference Proceedings, pp. 334–339 (2001) 10. Schreer, O., Feldmann, I., Golz, U., Kauff, P.: FAST AND ROBUST SHADOW DETECTION IN VIDEOCONFERENCE APPLICATION. 4th IEEE Intern. Symposium on Video Proces. and Multimedia Comm, 371–375 (2002) 11. Bevilacqua, A.: Effective Shadow Detection in Traffic Monitoring Applications. In: The 11th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (2003) 12. Martel-Brisson, N., Zaccarin, A.: Moving Cast Shadow Detection from a Gaussian Mixture Shadow Model. IEEE Computer Society International Conference on Computer Vision and Pattern Recognition (2005) 13. Parzen, E.: On the estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076 (1962)

Road Sign Detection Using Eigen Color Luo-Wei Tsai1, Yun-Jung Tseng1, Jun-Wei Hsieh2, Kuo-Chin Fan1, and Jiun-Jie Li1 1

Department of CSIE, National Central University Jung-Da Rd., Chung-Li 320, Taiwan [email protected] 2 Department of E. E., Yuan Ze University 135 Yuan-Tung Road, Chung-Li 320, Taiwan [email protected]

Abstract. This paper presents a novel color-based method to detect road signs directly from videos. A road sign usually has specific colors and high contrast to its background. Traditional color-based approaches need to train different color detectors for detecting road signs if their colors are different. This paper presents a novel color model derived from Karhunen-Loeve(KL) transform to detect road sign color pixels from the background. The proposed color transform model is invariant to different perspective effects and occlusions. Furthermore, only one color model is needed to detect various road signs. After transformation into the proposed color space, a RBF (Radial Basis Function) network is trained for finding all possible road sign candidates. Then, a verification process is applied to these candidates according to their edge maps. Due to the filtering effect and discriminative ability of the proposed color model, different road signs can be very efficiently detected from videos. Experiment results have proved that the proposed method is robust, accurate, and powerful in road sign detection.

1 Introduction Traffic sign detection is an important and essential task in a driver support system. The texts on road signs carry much useful information like limited speed, guided direction, and current traffic situation for helping the drivers drive safely and comfortably. However, it is very challenging to detect road signs directly from still images or videos due to the large changes of environmental conditions. In addition, when the camera is moving, the perspective effects will make a road sign have different sizes, shapes, contrast changes, or motion blurs. Moreover, sometimes it will be occluded with some natural objects like trees. To tackle the above problems, there have been many works [1]-[9]proposed for automatic road sign detection and recognition. Since a road sign usually has a high-contrast color and regular shape, these approaches can be categorized into color-based or shape-based ones. For the color-based approach, in [1], Escalera et al. used a color threshoding technique to separate road sign regions from the background in the RGB color domain. In addition to the RGB space, other color spaces like YIQ and HSV are also good for road sign detection. For example, in [2], Kehtarnavaz and Ahmad used a discriminant analysis on the YIQ color space for Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 169–179, 2007. © Springer-Verlag Berlin Heidelberg 2007

170

L.-W. Tsai et al.

detecting desired road signs from the background. Since road signs have different colors (like red, blue, or green) for showing different warning or direction messages, different color detectors should be designed for tackling the above color variations. In addition to color, shape is another important feature for detecting road signs. In [7], Barnes and Zelinsky adopted the fast radial symmetry detector to detect possible road sign candidates and then to verify them using a correlation technique. Wu et al. [6] used the corner feature and a vertical plane criterion to cluster image data for finding possible road sign candidates. Blancard [9] used an edge linking technique and the contour feature to locate all possible road sign candidates and then verified them according to their perimeters and curvature features. Usually, different shapes of road signs represent different warning functions. Different shape detectors should be designed and make the detection process become very time-consumed. Therefore, there are some hybrid methods proposed for road sign detection. For example, Bahlmann et al. [8] used a color representation, integral features, and the AdaBoost algorithm for training a stronger classifier such that a real-time traffic sign detector can be achieved. Furthermore, Fang et al. [3] used fuzzy neural networks and gradient feature to locate and track road signs. The major disadvantage of the shape-based approach is that a road sign has large shape variations when the camera is moving. This paper presents a novel hybrid method to detect road signs from videos using eigen color and shape feature. First of all, this paper proposes a novel eigen color model for searching possible road signs candidates from videos. The model can make road sign colors be more compact and thus sufficiently concentrated on a smaller area. It is learned by observing how the road sign colors change in static images under different lighting conditions and cluttered backgrounds. It is global and doesn’t need to be re-estimated for any new road signs or new input images. Without prior knowledge of surface reflectance, weather condition, and view geometry is used in the training phase, the model still performs very efficiently to locate road sign pixels from the background. Even though road signs have different colors, only one single model is needed. After the transformation, the RBF network is used for finding the best hyper-plane to separate the road sign color from the background. Then, a verification engine is built to verify these candidates using their edge maps. The engine records appearance characteristic of road signs and has good discriminative properties to verify the correctness of each candidate. In this system, the eigen color model can filter out most of background pixels in advance. Only few candidates need to be further checked. In addition, no matter what color the road sign is, only one eigen color model is needed for color classification. Due to the filtering effect and discriminative abilities of the proposed method, different road signs can be effectively detected from videos. Experiment results have proved the superiority of the proposed method in road sign detection.

2 Eigen Color Detection A road sign usually has a specific color which is high contrast to the background. The color information can be used to narrow down the searching area for finding the road signs more efficiently. For example, in Fig. 1(a), the road sign has a specific “green” color. Then, we can use a green color detector to filter out all non-green color objects.

Road Sign Detection Using Eigen Color

(a)

171

(b)

Fig. 1. Green color detection in HIS color space. (a) Original image. (b) Result of green color detection.

However, after simple green color classification, there are many non-road-sign objects (with green color) to be detected as shown in Fig. 1(b). Precise color modeling method is necessary for road sign detection. In addition, different road signs have different specific colors (green, red, or blue). In contract to most previous system which designed different “specific” color detectors, this paper presents a single eigen color model to detect all kinds of road signs. 2.1 Eigen Color Detection Using Dimension Reduction Our idea is to design a single eigen-color transform model for detecting road sign pixels from the background. At first, we collect a lot of road sign images from various highways, roads, and nature images under different lighting and weather conditions. Fig. 2(a) shows parts of our training samples. Assume that there are N training images. Through a statistic analysis, we can get the covariance matrix ∑ of the color

(a)

(b)

Fig. 2. Parts of training samples. (a) Road sign images. (b) Non-road sign images.

distributions of R, G, and B channels from these N images. Using the Karhunen-Loeve(KL) transform, the eigenvectors and eigenvalues of ∑ can be further obtained and represented as ei and λi , respectively, for i = 1, 2, and 3. Then, three new color features Ci can be formed and defined, respectively,

172

L.-W. Tsai et al.

Ci = eir R + eig G + eib B for i =1, 2, and 3,

(1)

where ei = ( eir , eig , eib ) . The color feature C1 with the largest eigenvalue is the one used for color-to-gray transform, i.e.,

1 1 1 C1 = R + G + B . 3 3 3

(2)

Other two color features C2 and C3 are orthogonal to C1 and have the following forms: C2 =

2(R - B ) R +B-2G and C3 = . 5 5

(3)

In [10], Healey used the similar idea for image segmentation and pointed out that the colors of homogeneous dielectric surfaces will move close along the axis directed by Eq.(2), i.e., (1/3, 1/3, 1/3). In other words, if we try to project all the road sign colors to a plane which is perpendicular to the axis pointed by C1 , the road sign colors will concentrate around a small area. The above principal component analysis (PCA) gives us an inspiration to analyze road signs so that a new color model can be found.

(a)

(b)

Fig. 3. Eigen color re-projection. (a) Original image. (b) Result of projection on eigen color map.

This paper defines the plane ( C2 , C3 ) as a new color space (u, v). Then, given an input image, we first use Eq.(3) to project all color pixels on the (u, v) space. Then, the problem becomes a 2-class separation problem, which tries to find a best decision boundary from the (u, v) space such that all road sign color pixels can be well separated from non-road sign ones. Fig. 3(b) shows the projection result of road sign pixels and non-road sign pixels. The green and red regions denote the results of re-projection of green and red road signs, respectively. The blue region is the result of background. We also re-project the tree region and green road signs (shown in Fig. 3(a)) on the (u, v) space. Although these two regions are both “green”, they can be easily separated on the (u, v) space if a proper classifier is designed for finding the best separation boundary. In what follows, road sign pixels are fed into the RBF network for this classification task.

Road Sign Detection Using Eigen Color

173

2.2 Eigen Color Pixel Classification Using Radial Basis Function Network

A RBF network’s structure is similar to multilayer perceptrons. The RBF network we used includes an input layer, one hidden layer, and an output layer. Each hidden neuron is associated with a kernel function. The most commonly used kernel function (also called an activation function) is Gaussian. The output units is approximated as a linear combination of a set of kernel functions, i.e., R

ψ i ( x ) = ∑ wijϕ j ( x ) , for i=1, …, C, j =1

where wij is the connection weight between the jth hidden neuron and ith output layer neuron, and C the number of outputs. The output of the radial basis function is limited to the interval (0, 1) by a sigmoid function:

Fi ( x ) =

1 . 1 + exp(-ψ i ( x ))

When training the RBF network, we use the back-propagation rule to adjust the output connection weights, the mean vector, and the variance vectors of the hidden layer. The parameters wij of the RBF networks are computed by the gradient descent method such that the cost function is minimized: E=

1 N

N

C

∑∑ ( y (x ) − F ( x )) k =1 i =1

i

k

i

k

2

,

where N is the number of inputs and yi ( xk ) the ith output associated with the input sample xk from the training set. Then, if a pixel belongs to the road sign class, it will be labeled to 1; otherwise, 0. When training, all pixels in the (R, G, B) domain are first transformed to the (u, v) domain using Eq. (3).

3 Candidate Verification After color segmentation, different road sign candidates can be extracted. For verifying these candidates, we use road sign’s shape to filter out impossible candidates. The verification process is a coarse-to-fine scheme to gradually remove impossible candidates. At the coarse stage, two criteria are first used to roughly eliminate a large number of impossible candidates. The first criterion requires the dimension of a road sign R being large enough. The second criterion requires the road sign having enough edge pixels: ER / AreaR < 0.02 , where ER and Area R are the number of edge pixels and the area of R, respectively.

174

L.-W. Tsai et al.

(a)

(b)

(c)

Fig. 4. Result of distance transform. (a) Original Image. (b) Edge map. (c) Distance transform of (b).

After the coarse verification, a fine verification procedure is further applied to verifying each candidate using its shape. Assume that BR is a set of boundary pixels extracted from R. Then, the distance transform of a pixel p in R is defined as

DTR ( p ) = min d ( p, q) , q∈BR

(4)

where d ( p, q) is the Euclidian distance between p and q. In order to enhance the strength of distance changes, Eq.(4) is further modified as follows DT R ( p ) = min d ( p, q) × exp(κ d ( p, q)) , q∈BR

(5)

where κ = 0.1 . Fig. 4 shows the result of the distance transform. (a) is an image R of road sign and (b) is its edge map. Fig. 4(c) is the result of the distance transform of Fig. 4(b). If we scan all pixels of R in a row major order, a set FR of contour features can be represented as a vector, i.e.,

FR = [ DT R ( p0 ),...., DT R ( pi ),....] ,

(6)

where all pi belong to R and i is the scanning index. In addition to the outer contour, a road sign usually contains many text patterns. To verify a road sign candidate more accurately, its outer shape is more important than its inner text patterns. For reflecting this fact, a new weight wi which increases according to the distance between the pixel pi and the original O is included. Assume that O is the central of R and ri is the distance between pi and O, and the circumcircle of R has the radius z. Then, the weight wi is defined by: ⎧ exp(- | ri - z |2 ), if ri ≤ z; wi = ⎨ otherwise. ⎩0,

(7)

Then, Eq.(6) can be rewritten as follows: FR = [ w0 DT R ( p0 ),...., wi DT R ( pi ),....] .

(8)

This paper assumes that there are only three types of road signs, i.e., circle, triangle, and rectangle needed for verification. For each type Ri , a set of training samples is

Road Sign Detection Using Eigen Color

175

collected in advance for capturing shape characteristics. If there are N i templates in Ri , we can calculate its mean μi and variance Σi of FR from all samples in Ri . Then, given a road sign candidate H, the similarity between H and Ri can be measured by this equation: __

__

S ( H , Ri ) = exp(−( FH − ui ) ∑ i−1 ( FH − ui ) t ), ,

(9)

where t means the transpose of a vector.

4 Experimental Results To examine the performances of our proposed method, several video sequences on high way and roads were adapted. The sequences were captured under different road and weather conditions (like sunny, cloudy). The camera was embedded in the front position of the car and its optical axis is not required being perpendicular to the road sign. The frame rate of our system is over 20 fps. Fig. 5 shows the results of road sign color detection using the proposed method. For comparisons, the color thresholding technique [1] was also implemented. Fig. 5 (a) is the original image and Fig. 5 (b) is the result using the color thresholding technique. There were many false region detected for road sign detection in Fig. 5 (b). Fig. 5 (c) is the result of eigen color classification. It is noticed that only one eigen color model was used to detect all the desired road signs even though their colors were different. Compared with the thresholding technique, our proposed scheme has a much lower false detection rate. A lower false detection rate means that less computation time

(a)

(b)

(c)

Fig. 5. Result of color classification. (a) Original image. (b) Detection result of color thresholding [1]. (c) Eigen color classification.

176

L.-W. Tsai et al.

needed for candidate verification. In addition, the color thresholding technique needs several scanning passes to detect road signs if they have different colors. Thus, our method can perform much efficiently than traditional color-based approaches.

Fig. 6. Detection results of rectangular road sign

Fig. 7. Detection result when a skewed road sign or a low-quality frame was handled

Fig. 6 shows the detection results when rectangular road signs were handled. Even though the tree regions have similar color to the road signs, our method still worked very well to detect the desired road signs. Fig. 7 shows the detection results when skewed road signs or a low-quality video frame were handled. No matter how skewed and what color the road sign is, our proposed method performed well to detect it from the background.

Fig. 8. Detection results of circular road signs

Fig. 8 shows the detection results when the circular road signs were captured under different lighting conditions. The conditions included low lighting, skewed shape, or multiple signs. However, our method still worked well to detect all these circular road signs. Furthermore, we also used our scheme to detect triangular road signs. Fig. 9 shows the detection results when triangular roads were handled. No matter what types or colors of road signs were handled, our proposed method worked very successfully to detect them.

Road Sign Detection Using Eigen Color

177

Fig. 9. Detection results of triangular road signs

Fig. 10. Road sign detection in a video sequence under a sunny day. (a), (b), and (c): Consecutive detection results of a road sign from a video.

The next set of experiments was used to demonstrate the performances of our method to detect road signs under different weather conditions in video sequences. Fig. 10 shows a series of detection results when consecutive video frames under a sunny day were handled. In Fig. 10 (a) and (b), a smaller and darker road sign was detected. Then, its size gradually became larger. Fig. 10 (c) shows the detection result of a larger road sign. Clearly, even though the road sign had different size changes, all its variations were successfully detected using our proposed method. Fig. 11 shows the detection result when a series of road signs captured under a cloudy day were handled. In Fig. 11 (a), a very smaller road sign was detected. Its color was also darker. In Fig. 11(b), (c), and (d), its size gradually became larger. No matter how size of the road sign changes, it still can be well detected using our proposed method. Experiment results have proved the superiority of our proposed method in real time road sign detection.

178

L.-W. Tsai et al.

Fig. 11. Road sign detection in a video sequence under a cloudy day

5 Conclusion This paper presents a novel eigen color model for road sign detection. With this model, different road sign candidates can be quickly located no matter what colors they have. The model is global and doesn’t need to be re-estimated. Even though the road signs are lighted under different illuminations, the model still works very well to identify them from the background. After that, a coarse-to-fine verification scheme is applied to effectively identify all candidates according to their edge maps. Since most impossible candidates have been filtered in advance, desired road signs can be located very quickly. Experimental results have proved the superiority of our proposed method in real time road sign detection.

References [1] Escalera, A.D.L., et al.: Road Traffic Sign Detection and Classification. IEEE Transaction on Industrial Electronics 44(6), 848–859 (1997) [2] Kehtarnavaz, N., Ahmad, A.: Traffic sign recognition in noisy outdoor scenes. In: Proceedings of Intelligent Vehicles 1995 Symposium, pp. 460–465 (September 1995) [3] Fang, C.-Y., Chen, S.-W., Fuh, C.-S.: Road-sign detection and tracking. IEEE Transactions on Vehicular Technology 52(5), 1329–1341 (2003) [4] Chen, X., Yang, J., Zhang, J., Waibel, A.: Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing 13(1), 87–99 (2004) [5] Loy, G., Barnes, N.: Fast shaped-based road sign detection for a Driver Assistance System. In: IROS 2004 (2004)

Road Sign Detection Using Eigen Color

179

[6] Wu, W., Chen, X., Yang, J.: Detection of Text on Road Signs From Video. IEEE Transactions on ITS 6(4), 378–390 (2005) [7] Barnes, N., Zelinsky, A.: Real-time radial symmetry for speed sign detection. In: Proc. IEEE Intelligent Vehicles Symposium, Italy, pp. 566–571 (June 2004) [8] Bahlmann, C., et al.: A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In: Proceedings of IEEE Intelligent Vehicles Symposium, pp. 255–260 (June 2005) [9] de Saint Blancard, M.: Road Sign Recognition: A Study of Vision-based Decision Making for Road Environment Recognition, ch. 7. Springer, Heidelberg (1991) [10] Healey, G.: Segmenting Images Using Normalized Color. IEEE Transactions on Systems, Man, and Cybernetics 22(1), 64–73 (1992)

Localized Content-Based Image Retrieval Using Semi-Supervised Multiple Instance Learning Dan Zhang1 , Zhenwei Shi2 , Yangqiu Song1 , and Changshui Zhang1 State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation,Tsinghua University, Beijing 100084, China Image Processing Center, School of Astronautics, Beijing University of Aeronautics and Astronautics, Beijing 100083, P.R. China [email protected], [email protected], [email protected], [email protected]

1

2

Abstract. In this paper, we propose a Semi-Supervised MultipleInstance Learning (SSMIL) algorithm, and apply it to Localized ContentBased Image Retrieval(LCBIR), where the goal is to rank all the images in the database, according to the object that users want to retrieve. SSMIL treats LCBIR as a Semi-Supervised Problem and utilize the unlabeled pictures to help improve the retrieval performance. The comparison result of SSMIL with several state-of-art algorithms is promising.

1

Introduction

Much work has been done in applying Multiple Instance Learning (MIL) to Localized Content-Based Image Retrieval (LCBIR). One main reason is that, in LCBIR, what a user wants to retrieve is often an object in a picture, rather than the whole picture itself. Therefore, in order to tell the retrieval system what he really wants, the user often has to provide several pictures with the desired object on it, as well as several pictures without this object, either directly or through relevance feedback. Then, each picture with the desired object is treated as a positive bag, while the other query pictures will be considered as negative ones. Furthermore, after using image segmentation techniques to divide the images into small patches, each patch represents an instance. In this way, the problem of image retrieval can be converted to an MIL one. The notion of Multi-Instance Learning was ﬁrst introduced by Dietterich et al. [1] to deal with the drug activity prediction. A collection of diﬀerent shapes of the same molecule is called a bag, while its diﬀerent shapes represent diﬀerent instances. A bag is labeled positive if and only if at least one of its instances is positive; otherwise, this bag is negative. This basic idea was extended by several following works. Maron et al. [2] proposed another MIL algorithm - Diverse Density (DD). They tried to ﬁnd a target in the feature space that resembled positive instance most, and this target was called a concept point. Then they applied this

The work was supported by the National Science Foundation of China (60475001, 60605002).

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 180–188, 2007. c Springer-Verlag Berlin Heidelberg 2007

Localized Content-Based Image Retrieval Using SSMIL

181

method to solve the task of natural scene classiﬁcation [3]. Zhang and Goldman [6] combined Expectation Maximization with DD together and developed an algorithm EM-DD, which was much more eﬃcent than the previous DD algorithm, to search for the desired concept. They extended their idea in [7] and made some modiﬁcations to ensemble the diﬀerent concept points returned by EM-DD with diﬀerent initial values. This is reasonable, since the desired object can not be described by only one concept point. Andrew et al. [10] used a SVM based method to solve the MI problem. Then, they developed an eﬃcient algorithm based on linear programming boosting [11]. Y. Chen et al. [4] combined EM-DD and SVM, and devised DD-SVM. Recently, P.M. Cheung et al.[9] give a regularized framework to solve this problem. Z. H. Zhou etc. [15], also initiate some research on Multiple-Instance Multiple-Label problem and apply it to scene classiﬁcation. All the above works assume that each negative bag should not contain any positive instance. But there may exist exceptions. After the image segmentation, the desired object may be divided into several diﬀerent patches. The pictures without this object may also contain a few particular patches that are similar to that of the object and should not be retrieved. So, negative bags may also contain positive instances, if we consider each patch as an instance. Based on this assumption, Y. Chen et al. [5] recently devised a new algorithm called MultipleInstance Learning via Embedded Instance Selection (MILES) to solve multiple instance problems. So far, some developments of MIL have been reviewed. When it comes to LCBIR, one natural problem is that users are often unwilling to provide so many labeled pictures, and therefore the inadequate number of labeled pictures poses a great challenge to the existing MIL algorithms. Semi-Supervised algorithms are just devised to handle the situation when the labeled information is inadequate. Some typical semi-supervised algorithms include Semi-Supervised SVM [13], Transductive SVM [12], graph-based semi-supervised learning [14], etc. How to convert a standard MIL problem to a Semi-Supervised one has received some notices. Recently, R. Rahmani and S. Goldman combined a modiﬁed version of DD and graph-based semi-supervised algorithms together, and put forward the ﬁrst graph-based Semi-Supervised MIL algorithm - MISSL[8]. They adopted an energy function to describe the likelihood of an instance being the concept points, and redeﬁned the weights between diﬀerent bags. In this paper, we propose a new algorithm - Semi-Supervised Multiple-Instance Learning (SSMIL) to solve the Semi-Supervised MIL problem, and the result is promising. Our paper is outlined as follows: in Section 2, the motivation of our algorithm will be introduced. In Section 3, we will give the proposed algorithm. In Section 4 the experimental results will be presented. In the end, a conclusion is given in Section 5.

2

Motivation

A bag can be mapped into a feature space determined by the instances in all the labeled bags. To be more precise, a bag B is embedded in this feature space as follows [5]:

182

D. Zhang et al.

m(B) = [s(x1 , B), s(x2 , B), · · · , s(xn , B)]T

(1)

Here, s(xk , B) = maxt exp(− ||btσ−2x || ). σ is a predeﬁned scaling parameter. xk is the kth instance among all the n instances in the labeled bags and bt denotes the tth instance in the bag B. Then, the whole labeled set can be mapped to such a matrix: k

+ − − [m+ 1 , · · · , m l+ , m 1 , · · · , m l− ] + − =⎡ [m(B+ ), · · · , m(B− 1 ), · · · , m(Bl+ ), m(B1 l− )] + − ⎤ 1 1 s(x , B1 ) · · · s(x , Bl− ) − ⎥ 2 ⎢ s(x2 , B+ 1 ) · · · s(x , Bl− ) ⎥ ⎢ =⎢ ⎥ . .. .. .. ⎣ ⎦ . .

(2)

− n s(xn , B+ 1 ) · · · s(x , Bl− )

+ − − B+ 1 , . . . , Bl+ denote the bags labeled positive, while B1 , . . . , Bl− refer to the k negatively labeled bags. Each column represents a bag. If x is near some positive bags and far from some negative ones, the corresponding dimension is useful for discrimination. In MILES[5], a 1-norm SVM is trained to select features and get their corresponding weights from this feature space as follows :

min λ

w,b,ξ,η

n k=1

l

−

+

|wk | + C1

i=1

ξi + C2

l

ηj

j=1

+ s.t. (wT m+ i + b) + ξi ≥ 1, i = 1, . . . , l , − − (wT m− j + b) + ηj ≥ 1, j = 1, . . . , l ,

ξi , ηj ≥ 0, i = 1, . . . , l+ , j = 1, . . . , l−

(3)

Here, C1 and C2 reﬂect the loss penalty imposed on the misclassiﬁcation of positive and negative bags, respectively. λ is a regularizer parameter, which controls the trade-oﬀ between the complexity of the classiﬁer and the hinge loss. It can be seen that this formulation does not restrict all the instances in negative bags to be negative. Since the 1-norm SVM is utilized, a sparse solution can be obtained, i.e. in this solution, only a few wk in Eq. (3) are nonzero. Hence, MILES ﬁnds the most important instances in the labeled bags and their corresponding weights. MILES gives an impressive result on several data sets and has shown its advantages over several other methods, such as DD-SVM [4], MI-SVM [10]and k-means SVM [16], both in accuracy and speed. However, the image retrieval task is itself a Semi-Supervised problem - with only a few labeled pictures searching in a tremendous database. The utilization of the unlabeled pictures may actually improve the retrieval performance.

Localized Content-Based Image Retrieval Using SSMIL

3 3.1

183

Semi-Supervised Multiple Instance Learning (SSMIL) The Formulation of Semi-Supervised Multiple Instance Learning

In this section, we give the formulation for Semi-Supervised Multiple Instance Learning. Our aim is to maximize margins not only on the labeled but the unlabeled bags. A straightforward way is to map both the labeled and unlabeled bags into the feature space determined by all the labeled bags, using Eq. (2). Then, we try to solve such an optimization problem: minw,b,ξ,η,ζ λ

n

l

−

+

|wk | + C1

ξi + C2

l

ηj + C3

i=1 j=1 k=1 + s.t. (wT m+ i + b) + ξi ≥ 1, i = 1, · · · , l T − − −(w mi + b) + ηj ≥ 1, j = 1, · · · , l yu∗ (wT mu + b) + ζu ≥ 1, u = 1, · · · , |U | ξi , ηj , ζu ≥ 0, i = 1, · · · , l+ , j = 1, · · · , l− ,

|U|

ζu

u=1

(4)

u = 1, · · · , |U |

The diﬀerence between Eq. (3) and Eq. (4) is the appended penalty term imposed on the unlabeled data. C3 is the penalty parameter that controls the eﬀect of unlabeled data, and yu∗ is the label assigned to the uth unlabeled bag during the training phase. 3.2

The Up-Speed of Semi-Supervised Multiple Instance Learning (UP-SSMIL)

Directly solving the optimization problem (4) is too time-consuming, because, in Eq. (4), all the unlabeled pictures are required to be mapped into the feature space determined by all the instances in the labeled bags and most of the time will be spent on the feature mapping step(Eq. (2)). In this paper, we try to up-speed this process and propose UP-SSMIL. After each labeled bag is mapped into the feature space by Eq. (2), all the unlabeled bags can also be mapped into this feature space according to Eq. (1). As mentioned in Section 2, one norm SVM can ﬁnd the most important features, i.e. predominant instances in training bags. Hence, the dimension for each bag can be greatly reduced, with the irrelevant features being discarded. So, we propose using MILES as the ﬁrst step to select the most important instances and mapping each bag B in both the labeled and unlabeled set into the space determined by these instances as follows: m(B) = [s(z1 , B), s(z2 , B), · · · , s(zv , B)]T

(5)

Here, z k is the kth selected instance and v denotes the total number of the selected instances. This is a supervised step. Then, we intend to use the unlabeled bags to improve the performance by optimize the feature weights of the selected

184

D. Zhang et al. Table 1. UP-SSMIL Algorithm

1. Feature Mapping 1: Map each labeled bag (into the feature space determined by the instances in the labeled bags, using Eq.(2). 2. MILES Training: Use 1-norm SVM to train a classiﬁer, utilizing only the training bags. Then, each feature in the feature space determined by the training instances is assigned a weight, i.e. wk in Eq. (3). The regularizer in this step is denoted as λ1 . 3. Feature Selecting: Select the features with nonzero weights. 4. Feature Mapping 2: Map all the unlabeled and labeled bags into the feature space determined by the features selected from the previous step, i.e. the selected instances, using Eq. (5). 5. TSVM Training: Taking into account both the re-mapped labeled and unlabeled bags, use TSVM to train a classiﬁer. The regularizer in TSVM is denoted as λ2 . 6. Classifying: Use this classiﬁer to rank the unlabeled bags.

features. A Transductive Support Vector Machine (TSVM) [12] algorithm is employed to learn these weights. The whole UP-SSMIL algorithm can be depicted in Table 1. In this algorithm, TSVM is a 2-norm Semi-Supervised SVM. The reason why 1-norm Semi-Supervised SVM is not employed is that, after the feature selection step, the selected features are most relevant to the ﬁnal solution. However, 1-norm Semi-Supervised SVM favors the sparsity of w. Therefore, it is not used here.

4

Experiments

We test our method on SIVAL, which is obtained at www.cs.wustl.edu/∼sg/ multi-inst-data/. Some sample images are shown in Fig. (1). In this database, each image is pre-segmented into around 30 patches. Color, texture and

(a) SpriteCan

(b) WD40Can

Fig. 1. Some sample images in SIVAL dataset

Localized Content-Based Image Retrieval Using SSMIL

185

Table 2. Average AUC values with 95% conﬁdence intervals, with 8 randomly selected positive and 8 randomly selected negative pictures

FabricSoftenerBox CheckeredScarf FeltFlowerRug WD40Can CockCan GreenTeaBox AjaxOrange DirtyRunningShoe CandleWithHolder SpriteCan JulisPot GoldMedal DirtyWorkGlove CardBoardBox SmileyFaceDoll BlueScrunge DataMiningBook TranslucentBowl StripedNoteBook Banana GlazedWoodPot Apple RapBook WoodRollingPin LargeSpoon Average

UP-SSMIL MISSL 97.2±0.7 97.7±0.3 95.5±0.5 88.9±0.7 94.6±0.8 90.5±1.1 90.5±1.3 93.9±0.9 93.4±0.8 93.3±0.9 90.9±1.9 80.4±3.5 90.1±1.7 90.0±2.1 87.2±1.3 78.2±1.6 85.4±1.7 84.5±0.8 84.8±1.1 81.2±1.5 82.1±2.9 68.0±5.2 80.9±3.0 83.4±2.7 81.9±1.7 73.8±3.4 81.1±2.3 69.6±2.5 80.7±1.8 80.7±2.0 76.7±2.6 76.8±5.2 76.6±1.9 77.3±4.3 76.3±2.0 63.2±5.2 75.1±2.6 70.2±2.9 69.2±3.0 62.4±4.3 68.6±2.8 51.5±3.3 67.8±2.7 51.1±4.4 64.9±2.8 61.3±2.8 64.1±2.1 51.6±2.6 58.6±1.9 50.2±2.1 80.6 74.8

MILES 96.8±0.9 95.1±0.8 94.1±0.8 86.9±3.0 91.8±1.3 89.4±3.1 88.4±2.8 85.6±2.1 83.4±2.3 82.1±2.8 78.8±3.5 76.1±3.9 80.4±2.2 78.4±3.0 77.7±2.8 73.2±2.8 74.0±2.3 74.0±3.1 73.2±2.5 66.4±3.4 69.0±3.0 64.7±2.8 64.6±2.3 63.5±2.0 57.7±2.1 78.6

Accio! 86.6±2.9 90.8±1.5 86.9±1.6 82.0±2.4 81.5±3.4 87.3±2.9 77.0±3.4 83.7±1.9 68.8±2.3 71.9±2.4 79.2±2.6 77.7±2.6 65.3±1.5 67.9±2.2 77.4±3.2 69.5±3.3 74.7±3.3 77.5±2.3 70.2±3.1 65.9±3.2 72.7±2.2 63.4±3.3 62.8±1.7 66.7±1.7 57.6±2.3 74.6

Accio!+EM 44.4±1.1 58.1±4.4 51.1±24.8 50.3±3.0 48.5±24.6 46.8±3.5 43.6±2.4 75.4±19.8 57.9±3.0 59.2±22.1 51.2±24.5 42.1±3.6 57.8±2.9 57.8±2.9 48.0±25.8 36.3±2.5 37.7±4.9 47.4±25.9 43.5±3.1 43.6±3.8 51.0±2.8 43.4±2.7 57.6±4.8 52.5±23.9 51.2±2.5 50.3

neighborhood features have already been extracted for each patch, and form a set of 30-dimension feature vectors. In our experiments, these features are normalized to be exactly in the range from 0 to 1, and the scaling parameter σ is chosen to be 0.5. Treat each picture as a bag, and each patch in this picture as an instance in this bag. The source code of MILES is obtained from [17], and TSVM is obtained from [18]. During each trial, 8 positive pictures are randomly selected from one category, and other 8 negative pictures are randomly selected as background pictures from the other 24 categories. The retrieval speed of UP-SSMIL is pretty fast. In my computer, for each round, UP-SSMIL takes only 25 seconds while SSMIL takes around 30 minutes. For convenience, only the results of UP-SSMIL are reported here. We will demonstrate below that it achieves the best performance on SIVAL database. In UP-SSMIL’s Training step in Table 1 and MILES (see Eq. (3)), λ1 is set to 0.2, C1 and C2 are set to 0.5. In UP-SSMIL’s TSVM Training step in Table 1 (for a detailed description of the parameters, see the reference of SVMlin [18]),

186

D. Zhang et al. SpriteCan |U|=1500−|L|

SpriteCan |L|=16

0.92 0.86

0.9

0.84

0.88 AUC value

AUC value

0.82

0.86 0.84

0.8 0.78

MILES

0.76

UP−SSMIL

0.74

0.82

MILES

0.8 0.78

UP−SSMIL

0.72

20

30 40 50 60 Number of labeled pictures (|L|)

70

0.7

80

200

400 600 800 Number of unlabeled pictures (|U|)

(a)

(b)

WD40can |U|=1500−|L|

WD40Can |L|=16

0.96

0.92

0.95

0.91 0.9

0.94

0.89 AUC value

AUC value

0.93 0.92 0.91 0.9

0.88 0.87 0.86 0.85

0.89 0.88 0.87

1000

20

MILES

0.84

UP−SSMIL

0.83

30 40 50 60 Number of labeled pictures (|L|)

(c)

70

80

0.82

MILES UP−SSMIL 200

400 600 800 Number of unlabeled pictures (|U|)

1000

(d)

Fig. 2. The comparison result between UP-SSMIL and MILES

λ2 is set to 0.1 The positive class fraction of unlabeled data is set to 0.01. The other parameters in SVMlin are all set to their default values. In the image retrieval, the ROC curve is a good measure of the performance. So, the area under ROC curve - AUC value is used here to measure the performance. All the results reported here are averaged over 30 independent runs, with a 95% conﬁdence interval being calculated. The ﬁnal comparison result is shown in Table 2. From this table, it can be seen that, compared with MISSL, among all the 25 categories, UP-SSMIL performs better than MISSL for most categories, with only a few categories worse than MISSL. This may be due to two reasons. For one thing, MISSL uses inadequate number of pictures to learn the likelihood for each instance being positive and the “steepness factor” in MISSL is relatively hard to determine. These may lead to an inaccurate energy function. For another, on the graph level, MISSL uses just one vertex to represent all the negative training vertexes, and assumes the weights connecting from this vertex to all the unlabeled vertexes to be the same, which will result in some inaccuracy as well. Furthermore, after the pre-calculation of the distances between diﬀerent instances, MISSL takes 30-100 seconds to get a retrieval result, while UP-SSMIL takes no more than 30 seconds without the need to calculate these distances. This

Localized Content-Based Image Retrieval Using SSMIL

187

is quite understandable, In the ﬁrst Feature Mapping Step in Table 1, UP-SSMIL only need to calculate the distances within the training bags. Since the number of query images is so small, this calculation burden is relatively light. Then, after the features being selected, the unlabeled bags only need to be mapped into the space determined by these few selected features. In our experiments, this dimension can be reduced to around 10. So, the calculation cost of the second Feature Mapping step in Table 1 is very low. With the dimensions being greatly reduced, TSVM gets the solution relatively fast. Compared with other supervised methods, such as MILES, Accio [7] and Accio+EM [7]. The performance of UP-SSMIL is also quite promising. Some comparisons result with its supervised opponent–MILES are provided in Fig. 2. We illustrate how the learning curve will change when both the number of labeled pictures(|L|) and the number of unlabeled pictures(|U |) vary. It can be seen that the performance of UP-SSMIL always outperforms its supervised opponent.

5

Conclusion

In this paper, we propose a semi-supervised SVM framework of Multiple Instance algorithm - SSMIL. It uses the unlabeled pictures to help improve the performance. Then, UP-SSMIL is presented to accelerate the retrieval speed. In the end, we demonstrate on SIVAL database its superior performances.

References 1. Dietterich, T.G., Lathrop, R.H., Lozano-P¨erez, T.: Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artiﬁcial Inteligence 1446, 1–8 (1998) 2. Maron, O., Lozano-P¨erez, T.: A Framework for Multiple-Instance Learning. Advances in Neural Information Processing System 10, 570–576 (1998) 3. Maron, O., Ratan, A.L.: Multiple-Instance Learning for Natural Scene Classiﬁcation. In: Proc. 15th Int’l. Conf. Machine Learning, pp. 341–349 (1998) 4. Chen, Y., Wang, J.Z.: Image Categorization by Learning and Reasoning with Regions. J. Machine Learning Research 5, 913–939 (2004) 5. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-Instance Learning via Embedded Instance Selection. IEEE Transatctions on Pattern Analysis and Machine Intelligence 28(12) (2006) 6. Zhang, Q., Goldman, S.: EM-DD: An improved Multiple-Instance Learning. In: Advances in Neural Information Processing System, vol. 14, pp. 1073–1080 (2002) 7. Rahmani, R., Goldman, S., Zhang, H., et al.: Localized Content-Based Image Retrieval. In: Proceedings of ACM Workshop on Multimedia Image Retrieval, ACM Press, New York (2005) 8. Rahmani, R., Goldman, S.: MISSL: Multiple-Instance Semi-Supervised Learning. In: Proc. 23th Int’l. Conf. Machine Learning, pp. 705–712 (2006) 9. Cheung, P.-M., Kwok, J.T.: A Regularization Framework for Multiple-Instance Learning. In: ICML (2006) 10. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support Vector Machines for Multiple-Instance Learning. In: Advances in Neural Information Processing System, vol. 15, pp. 561–568 (2003)

188

D. Zhang et al.

11. Andrews, S., Hofmann, T.: Multiple Instance Learning via Disjunctive Programming Boosting. In: Advances in Neural Information Processing System, vol. 16, pp. 65–72 (2004) 12. Joachims, T.: Transductive Inference for Text Classiﬁcation using Support Vector Machine. In: Proc. 16th Int’l. Conf. Machine Learning, pp. 200–209 (1999) 13. Bennett, K.P., Demiriz, A.: Semi-supervised sup- port vector machines. In: Advances in Neural Information Processing System, vol. 11, pp. 368–374 (1999) 14. Zhu, X.: Semi-supervised learning literature survey, in Technical Report 1530, Department of Computer Sci- ences, University of Wisconsin at Madison (2006) 15. Zhou, Z.H., Zhang, M.L.: Multi-Instance Multi-Label Learning with Application to Scene Classiﬁcation. In: Advances in Neural Information Processing System (2006) 16. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual Categorization with Bags of Keypoints. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 59–74. Springer, Heidelberg (2004) 17. http://john.cs.olemiss.edu/∼ ychen/MILES.html 18. http://people.cs.uchicago.edu/∼ vikass/svmlin.html

Object Detection Combining Recognition and Segmentation Liming Wang1 , Jianbo Shi2 , Gang Song2 , and I-fan Shen1 1

2

Fudan University,Shanghai,PRC,200433 {wanglm,yfshen}@fudan.edu.cn University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104 [email protected], [email protected]

Abstract. We develop an object detection method combining top-down recognition with bottom-up image segmentation. There are two main steps in this method: a hypothesis generation step and a verification step. In the top-down hypothesis generation step, we design an improved Shape Context feature, which is more robust to object deformation and background clutter. The improved Shape Context is used to generate a set of hypotheses of object locations and figureground masks, which have high recall and low precision rate. In the verification step, we first compute a set of feasible segmentations that are consistent with top-down object hypotheses, then we propose a False Positive Pruning(FPP) procedure to prune out false positives. We exploit the fact that false positive regions typically do not align with any feasible image segmentation. Experiments show that this simple framework is capable of achieving both high recall and high precision with only a few positive training examples and that this method can be generalized to many object classes.

1 Introduction Object detection is an important, yet challenging vision task. It is a critical part in many applications such as image search, image auto-annotation and scene understanding; however it is still an open problem due to the complexity of object classes and images. Current approaches [1,2,3,4,5,6,7,8,9,10] to object detection can be categorized by top-down, bottom-up or combination of the two. Top-down approaches [2,11,12] often include a training stage to obtain class-specific model features or to define object configurations. Hypotheses are found by matching models to the image features. Bottomup approaches start from low-level or mid-level image features, i.e. edges or segments [5,8,9,10]. These methods build up hypotheses from such features, extend them by construction rules and then evaluate by certain cost functions. The third category of approaches combining top-down and bottom-up methods have become prevalent because they take advantage of both aspects. Although top-down approaches can quickly drive attention to promising hypotheses, they are prone to produce many false positives when features are locally extracted and matched. Features within the same hypothesis may not be consistent with respect to low-level image segmentation. On the other hand, bottom-up approaches try to keep consistency in low level image segmentation, but usually need much more efforts in searching and grouping. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 189–199, 2007. c Springer-Verlag Berlin Heidelberg 2007

190

L. Wang et al.

Input Image Feature 1 Codebook Building

2 Matching &Voting Codebook Entries

Hypotheses

Image Segmentation

3 Verification

Detection

(Re−Evaluation)

Results

Fig. 1. Method overview. Our method has three parts (shaded rectangles). Codebook building (cyan) is the training stage, which generates codebook entries containing improved SC features and object masks. Top-down recognition (blue) generates multiple hypotheses via improved SC matching and voting in the input image. The verification part (pink) aims to verify these top-down hypotheses using bottom-up segmentation. Round-corner rectangles are processes and ordinary rectangles are input/output data.

Wisely combining these two can avoid exhaustive searching and grouping while maintaining consistency in object hypotheses. For example, Borenstein et al. enforce continuity along segmentation boundaries to align matched patches [2]. Levin et al. take into account both bottom-up and top-down cues simultaneously in the framework of CRF [3]. Our detection method falls into this last category of combining top-down recognition and bottom-up segmentation, with two major improvements over existing approaches. First, we design a new improved Shape Context (SC) for the top-down recognition. Our improved SC is more robust to small deformation of object shapes and background clutter. Second, by utilizing bottom-up segmentation, we introduce a novel False Positive Pruning (FPP) method to improve detection precision. Our framework can be generalized to many other object classes because we pose no specific constraints on any object class. The overall structure of the paper is organized as follows. Sec. 2 provides an overview to our framework. Sec.3 describes the improved SCs and the top-down hypothesis generation. Sec.4 describes our FPP method combining image segmentation to verify hypotheses. Experiment results are shown in Sec.5, followed by discussion and conclusion in Sec.6.

2 Method Overview Our method contains three major parts: codebook building, top-down recognition using matching and voting, and hypothesis verification, as depicted in Fig.1. The object models are learned by building a codebook of local features. We extract improved SC as local image features and record the geometrical information together with object figure-ground masks. The improved SC is designed to be robust to shape variances and background clutters. For rigid objects and objects with slight articulation, our experiments show that only a few training examples suffice to encode local shape information of objects. We generate recognition hypotheses by matching local image SC features to the codebook and use SC features to vote for object centers. A similar top-down voting scheme is described in the work of [4], which uses SIFT point features for pedestrian

Object Detection Combining Recognition and Segmentation r 1 2 3 1 2 3

(a)

(b)

θ θ 1

5

(c)

9

θ

θ

(d)

r 1 2 3 1 2 3

191

θ 1

5

(e)

9

θ

Fig. 2. Angular Blur. (a) and (b) are different bin responses of two similar contours. (c) are their histograms. (d) enlarges angular span θ to θ , letting bins be overlapped in angular direction. (e) are the responses on the overlapped bins, where the histograms are more similar.

detection. The voting result might include many false positives due to small context of local SC features. Therefore, we combine top-down recognition with bottom-up segmentation in the verification stage to improve the detection precision. We propose a new False Positive Pruning (FPP) approach to prune out many false hypotheses generated from top-down recognition. The intuition of this approach is that many false positives are generated due to local mismatches. These local features usually do not have segmentation consistency, meaning that pixels in the same segment should belong to the same object. True positives are often composed of several connected segments while false positives tend to break large segments into pieces.

3 Top-Down Recognition In the training stage of top-down recognition, we build up a codebook of improved SC features from training images. For a test image, improved SC features are extracted and matched to codebook entries. A voting scheme then generates object hypotheses from the matching results. 3.1 Codebook Building For each object class, we select a few images as training examples. Object masks are manually segmented and only edge map inside the mask is counted in shape context histogram to prune out edges due to background clutter. The Codebook Entries (CE) are a repository of example features: CE = {cei }. Each codebook entry cei = (ui , δi , mi , wi ) records the feature for a point i in labelled objects of the training images. Here ui is the shape context vector for point i. δi is the position of point i relative to the object center. mi is a binary mask of figure-ground segmentation for the patch centered at point i. wi is the weight mask computed on mi , which will be introduced later. 3.2 Improved Shape Context The idea of Shape Context (SC) was first proposed by Belongie et al. [13]. The basic definition of SC is a local histogram of edge points in a radius-angle polar grid. Following works [14,15] improve its distinctive power by considering different edge orientations. Besides SC, other local image features such as wavelets, SIFT and HOG have been used in keypoint based detection approaches [4,12].

192

L. Wang et al.

Suppose there are nr (radial) by nθ (angular) bins and the edge map E is divided into E1 , . . . , Eo by o orientations (similar to [15]), for a point at p, its SC is defined as u = {h1 , . . . , ho }, where →

hi (k) = #{q = p : q ∈ Ei , pq∈ bin(k)}, k = 1, 2, ..., nr nθ

(1)

Angular Blur. A common problem for the shape context is that when dense bins are used or contours are close to the bin boundaries, similar contours have very different histograms (Fig.2-(c)). This leads to a large distance for two similar shapes if L2 -norm or χ2 distance function is used. EMD [16] alleviates this by solving a transportation problem; but it is computationally much more expensive. The way we overcome this problem is to overlap spans of adjacent angular bins: bin(k) ∩ bin(k + 1) = ∅ (Fig.2-(d)). This amounts to blurring the original histogram along the angular direction. We call such an extension Angular Blur. One edge point in the overlapped regions are counted in both of the adjacent bins. So the two contours close to the original bin boundary will have similar histograms for the overlapping bins(Fig.2-(e)). With angular blur, even simple L2 -norm can tolerate slight shape deformation. It improves the basic SC without the expensive computation of EMD. Mask Function on Shape Context. In real images, objects SCs always contain background clutter. This is a common problem for matching local features. Unlike learning methods [1,12] which use a large number of labeled examples to train a classifier, we propose to use a mask function to focus only on the parts inside object while ignoring background in matching. For ce = (u, δ, m, w) and a SC feature f in the test image, each bin of f is masked by figure-ground patch mask m of ce to remove the background clutter. Formally, we compute the weight w for bin k and distance function with mask as: w(k) = Area(bin(k) ∩ m)/Area(bin(k)), k = 1, 2, ..., nr nθ Dm (ce, f ) = D(u, w · v) = ||u − w · v||

2

(2) (3)

where (·) is the element-wise product. D can be any distance function computing the dissimilarity between histograms (We simply use L2 -norm). Figure 3 gives an example for the advantage of using mask function. 3.3 Hypothesis Generation The goal of hypothesis generation is to predict possible object locations as well as to estimate the figure-ground segmentation for each hypothesis. Our hypothesis generation is based on a voting scheme similar to [4]. Each SC feature is compared with every codebook entry and makes a prediction of the possible object center. The matching scores are accumulated over the whole image and the predictions with the maximum scores are the possible object centers. Given a set of detected features {fi } at location {li }, we define the probability of matching codebook entry cek to fi as P (cek |li ) ∝ exp(−Dm (cek , fi )). Given the match of cek to fi , the probability of an object o with

Object Detection Combining Recognition and Segmentation

b1

a1

A

B

input feature 0.2 0.1 0

v

193

b2

a2

u

0.2 0.1 0 0.2 0.1 0

(a)

0

50 weighted feature

100

0

50 matched codebook entry

100

0

50

100

(b)

Fig. 3. Distance function with mask. In (a), a feature point v has the edge map of a1 around it. Using object mask b1 , it succeeds to find a good match to u in B (object model patch), whose edge map is b2 . a2 is the object mask b1 over a1 . Only the edge points falling into the mask area are counted for SC. In (b), histograms of a1 , a2 and b2 are shown. With the mask function, a2 is much closer to b2 , thus got well matched.

center located at c is defined as P (o, c|cek , li ) ∝ exp(−||c + δk − li ||2 ). Now the probability of the hypothesis of object o with center c is computed as: P (o, c) = P (o, c|cek , li )P (cek |li )P (li ) (4) i,k

P (o, c) gives a voting map V of different locations c for the object class o. Extracting local maxima in V gives a set of hypotheses {Hj } = {(oj , cj )}. Furthermore, figure-ground segmentation for each Hj can be estimated by backtracing the matching results. For those fi giving the correct prediction, the patch mask m in the codebook is “pasted” to the corresponding image location as the figure-ground segmentation. Formally, for a point p in image at location pl , we define P (p = f ig|cek , li ) as the probability of point p belonging to the foreground when the feature at location −→

li is matched to the codebook cek : P (p = f ig|cek , li ) ∝ exp(−||pl − li ||)mk (pl li ). And we assume that P (cek , li |Hj ) ∝ P (oj , cj |cek , li ) and P (fi |cek ) ∝ P (cek |fi ). The figure-ground probability for hypothesis Hj is estimated as −→ P (p = f ig|Hj ) ∝ exp(−||pl − li ||)mk (pl li )P (fi |cek )P (cek , li |Hj) (5) k

Eq. (5) gives the estimation of top-down segmentation. The whole process of top-down recognition is shown in Fig. 4. The binary top-down segmentation (F, B) of figure(F ) and background (B) is the obtained by thresholding P (p = f ig|Hj ).

4 Verification: Combining Recognition and Segmentation From our experiments, the top-down recognition using voting scheme will produce many False Positives (FPs). In this section, we propose a two-step procedure of False Positive Pruning (FPP) to prune out FPs. In the first step we refine the top-down hypothesis mask by checking its consistency with bottom-up segmention. Second the final score on the refined mask is recomputed by considering spatial constraints.

194

L. Wang et al. f1 f2

Hj

(a)

(b)

(c)

(d)

(e)

Fig. 4. Top-down recognition. (a) An input image; (b) A matched point feature votes for 3 possible positions; (c) The vote map V . (d) The hypothesis Hj traces back find its voters {fi }. (d) Each fi predicts the figure-ground configration using Eq. (5).

Combining Bottom-up Segmentation. The basic idea for local feature voting is to make global decision by the consensus of local predictions. However, these incorrect local predictions using a small context can accumulate and confuse the global decision. For example, in pedestrian detection, two trunks will probably be locally taken as human legs and produce a human hypothesis (in Fig. 5-(a)); another case is the silhouettes from two standing-by pedestrians.

A O1

A O B C D

E

O O2

D

(a)

E

O3

(b)

(c)

(d)

Fig. 5. Combining bottom-up segmentation. FPs tend to spread out as multiple regions from different objects. In example of (a). an object O consists of five parts (A, B, C, D, E). (A ∈ O1 , D ∈ O2 , E ∈ O3 ) are matched to (A, D, E) because locally they are similar. The hypothesis of O = (A , D , E ) is generated. (b) shows boundaries of a FP (in green) and a TP (in red) in a real image. (c) is the layered view of the TP in (b). The top layer is the top-down segmentation, which forms a force (red arrows) to pull the mask out from the image. The bottom layer is the background force (green arrows). The middle layer is the top-down segmentation (we threshold it to binary mask) over the segmentation results.(d) is the case for the FP.

In pedestrian detection, the top-down figure-ground segmentation masks of the FPs usually look similar to a pedestrian. However we notice that such top-down mask is not consistent with the bottom-up segmentation for most FPs. The bottom-up segments share bigger contextual information than the local features in the top-down recognition and are homogenous in the sense of low-level image feature. The pixels in the same segment should belong to the same object. Imagine that the top-down hypothesis mask(F, B) tries to pull the object F out of the whole image. TPs generally consists of several well-separated segments from the background so that they are easy to be pulled

Object Detection Combining Recognition and Segmentation

195

out (Fig. 5-(c)). However FPs often contain only part of the segments. In the example of tree trunks, only part of the tree trunk is recognized as foreground while the whole tree trunk forms one bottom-up segment. This makes pulling out FPs more difficult because they have to break the homogenous segments (Fig. 5-(d)). Based on these observations we combine the bottom-up segmentation to update the top-down figure-ground mask. Incorrect local predictions are removed from the mask if they are not consistent with the bottom-up segmentation. We give each bottom-up cut to propose segment Si a binary label. Unlike the work in [17] which uses graph Area(Si F ) the optimized hypothesis mask, we simply define the ratio Area(S B) as a criteria to i assign Si to F or B. We try further segmentation when such assignment is uncertain to avoid the case of under-segmentation in a large area. The Normalized Cut (NCut) cost [18] is used to determine if such further segmentation is reasonable. The procedure to refine hypothesis mask is formulated as follows: Input: top-down mask (F, B) and bottom-up segments {Si , i = 1, . . . , N }. Output: refined object mask (F, B). Set i = 0. 1) If i > N ,exit; else, i = i + 1. Area(Si F ) 2) If Λ = Area(S B) > κup , then F = F ∪ Si ,goto 1); i elseif Λ < κdown, then F = F − (F ∩ Si ), goto 1). Otherwise, go to 3). 3) Segment Si to (Si1 , Si2 ). If ζ = NCut(Si ) > Υup , F = F − (F ∩ Si ), goto 1); else SN +1 = Si1 ,SN +2 = Si2 , S = S ∪ {SN +1 , SN +2 }, N = N + 2, goto 1). Re-evaluation. There are two advantages with the updated masks. The first is that we can recompute more accurate local features by masking out the background edges. The second is that the shapes of updated FPs masks will change much more than those of TPs, because FPs are usually generated by locally similar parts of other objects, which will probably be taken away through the above process. We require TPs must have voters from all the different locations around the hypothesis center. This will eliminates those TPs with less region support or with certain partial matching score. The final score is the summation of the average scores over the different spatial bins in the mask. The shape of the spatial bins are predefined. For pedestrians we use the radius-angle polar ellipse bins; for other objects we use rectangular grid bins. For each hypothesis, SC features are re-computed over the masked edge map by F and feature fi is only allowed to be matched to cek in the same bin location. For each bin j, we P (cek |fi ) compute an average matching score Ej = #(ce , where both cek and fi come k ,fi ) from bin j. The final score of this hypothesis is defined as:

E=

j

Ej ,where Ej =

Ej , if Ej > α; −α , if Ej = 0 and #{cek , cek ∈ bin(j)} > 0.

(6)

The term α is used to penalize the bins which have no matching with the codebook. This decreases the scores of FPs with only part of true objects, i.e. bike hypothesis with one wheel. Experiments show that our FPP procedure can prune out FPs effectively.

196

L. Wang et al.

5 Results Our experiments test different object classes including pedestrian, bike, human riding bike, umbrella and car (Table. 1). These pictures were taken from scenes around campus and urban streets. Objects in the images are roughly at the same scale. For pedestrians, the range of the heights is from 186 to 390 pixels. Table 1. Dataset for detection task #Object Pedestrian Bike Human on bike Umbrella Car Training 15 3 2 4 4 Testing 345 67 19 16 60

For our evaluation criteria, a hypothesis whose center falls into an ellipse region around ground truth center is classified as true positive. The radii for ellipse are typically chosen as 20% of the mean width / height of the objects. Multiple detections for one ground truth object are only counted once. Angular Blur and Mask Function Evaluation. We compare the detection algorithm on images w/ and w/o Angular Blur (AB) or mask function. The PR curves are plotted in Fig.6. For pedestrian and umbrella detection, it is very clear that adding Angular Blur and mask function can improve the detection results. For other object classes, AB+Mask outperforms at high-precision/low-recall part of the curve, but gets no significant improvement at high-recall/low-precision part. The reason is that AB+Mask can improve the cases where objects have deformation and complex background clutter. For bikes,

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0.2

0.4

0.6

0.8

1

(a) Pedestrian

1

0

0

0.2

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0.6

0.8

1

0

0

(b) Bike

1

0.8

0.4

0.2

0.4

0.6

Angular Blur+Mask Function w/o Angular Blur w/o Mask Function w/ FPP

0

0.2

0.4

0.6

(c) Umbrella

0.8

1

0.8

(c) Human on bike

HOG (only in (a))

0

0.2

0.4

0.6

0.8

1

(d) Car Fig. 6. PR-Curves of object detection results

(e) plot legend

1

Object Detection Combining Recognition and Segmentation

197

Fig. 7. Detection result on real images. The color indicates different segments. The last row contains cases of FPs for bikes and pedestrians.

198

L. Wang et al.

the inner edges dominate the SC histogram; so adding mask function makes only a little difference. Pedestrian Detection Compared with HOG. We also compare with HOG.using the implementation of the authors of [12] Figure 6-(a) shows that our method with FPP procedure are better than the results of HOG. Note that we only use a very limited number of training examples as shown in Table. 1 and we did not utilize any negative training examples.

6 Conclusion and Discussion In this paper, we developed an object detection method of combining top-down modelbased recognition with bottom-up image segmentation. Our method not only detects object positions but also gives the figure-ground segmentation mask. We designed an improved Shape Context feature for recognition and proposed a novel FPP procedure to verify hypotheses. This method can be generalized to many object classes. Results show that our detection algorithm can achieve both high recall and precision rates. However there are still some FPs hypotheses that cannot be pruned. They are typically very similar to objects, like a human-shape rock, or some tree trunks. More information like color or texture should be explored to prune out these FPs. Another failure case of SC detector is for very small scale object. These objects have very few edges points thus are not suitable for SC. Also our method does no work for severe occlusion where most local information is corrupted. Acknowledgment. This work is partially supported by National Science Foundation through grants NSF-IIS-04-47953(CAREER) and NSF-IIS-03-33036(IDLP). We thank Qihui Zhu and Jeffrey Byrne for polishing the paper.

References 1. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 2. Borenstein, E., Ullman, S.: Class-specific, top-down segmentation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, Springer, Heidelberg (2002) 3. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 4. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR (2005) 5. Ferrari, V., Tuytelaars, T., Gool, L.J.V.: Object detection by contour segment networks. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 6. Kokkinos, I., Maragos, P., Yuille, A.L.: Bottom-up & top-down object detection using primal sketch features and graphical models. In: CVPR (2006) 7. Zhao, L., Davis, L.S.: Closely coupled object detection and segmentation. In: ICCV (2005)

Object Detection Combining Recognition and Segmentation

199

8. Ren, X., Berg, A.C., Malik, J.: Recovering human body configurations using pairwise constraints between parts. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 9. Mori, G., Ren, X., Efros, A.A., Malik, J.: Recovering human body configurations: Combining segmentation and recognition. In: CVPR (2004) 10. Srinivasan, P., Shi, J.: Bottom-up recognition and parsing of the human body. In: CVPR (2007) 11. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61(1) (2005) 12. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 13. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(4) (2002) 14. Mori, G., Belongie, S.J., Malik, J.: Efficient shape matching using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell 27(11) (2005) 15. Thayananthan, A., Stenger, B., Torr, P.H.S., Cipolla, R.: Shape context and chamfer matching in cluttered scenes. In: CVPR (2003) 16. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV (1998) 17. Ramanan, D.: Using segmentation to verify object hypotheses. In: CVPR (2007) 18. Shi, J., Malik, J.: Normalized cuts and image segmentation. In: CVPR (1997)

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity Viet Cuong Dinh, Seong Soo Chun, Seungwook Cha, Hanjin Ryu, and Sanghoon Sull Department of Electronics and Computer Engineering, Korea University, 5-1 Anam-dong, Seongbuk-gu, Seoul, 136-701, Korea {cuongdv,sschun,swcha,hanjin,sull}@mpeg.korea.ac.kr

Abstract. Text appearing in video provides semantic knowledge and significant information for video indexing and retrieval system. This paper proposes an effective method for text detection in video based on the similarity in stroke width of text (which is defined as the distance between two edges of a stroke). From the observation that text regions can be characterized by a dominant fixed stroke width, edge detection with local adaptive thresholds is first devised to keep text- while reducing background-regions. Second, morphological dilation operator with adaptive structuring element size determined by stroke width value is exploited to roughly localize text regions. Finally, to reduce false alarm and refine text location, a new multi-frame refinement method is applied. Experimental results show that the proposed method is not only robust to different levels of background complexity, but also effective to different fonts (size, color) and languages of text.

1 Introduction The need for efficient content-based video indexing and retrieval has increased due to the rapid growth of video data available to consumers. For this purpose, text in video, especially the superimposed text, is the most frequently used since it provides highlevel semantic information about video content and it has distinctive visual characteristic. Therefore, the success in video text detection and recognition would have a great impact on multimedia applications such as image categorization [1], video summarization [2], and lecture video indexing [3]. Many efforts have been made for text detection in image and video. Regarding the way used to locate text regions, text detection methods can be classified into three approaches: connected component (CC)-based method [4, 5, 6], texture-based method [7, 8], and edge-based method [9, 10]. The CC-based method is based on the analysis of geometrical arrangement of edges or homogeneous color that belongs to characters. Alternatively, the texture-based method treats text region as a special type of texture and employs learning algorithms, e.g., neural network [8], support vector machine (SVM) [11], to extract text. In general, the texture-based method is more robust than the CC-based method in dealing with complex background. However, the main drawbacks of this method are its high complexity and inaccuracy localization. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 200–209, 2007. © Springer-Verlag Berlin Heidelberg 2007

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity

201

Another popularly studied method is the edge-based method, which is based on the fact that text regions have abundant edges. This method is widely used due to its fast performance in detecting text and its ability to keep geometrical structure of text. The method in [9] detects edges in an image and then uses the fixed size horizontal, vertical morphological dilation operations to form text line candidate. Real text regions are identified by using the SVM. Two disadvantages of this method are its poor performance in case of complex background and the use of fixed size structuring element in dilation operations. To deal with the background complexity problem, edge detection-based method should be accompanied by a local threshold algorithm. In [10], the image is first divided into small windows. A window is considered to be complex if the “number of blank rows” is smaller than a certain specific value. Then, in the edge detection step, a higher threshold is assigned for these complex windows. However, the “number of blank rows” criterion appears sensitive to noise and not strong enough to handle different text sizes. Therefore, how to design an effective local threshold algorithm for detecting edge is still a challenging problem of text detection in video. The main problem of the above existing methods is that they are not robust to different text colors, sizes, and background complexity, since they simply use either general segmentation method or some prior knowledge. In this paper, we attempt to discover the intrinsic characteristic of text (namely the stroke width similarity) and then exploit it to build a robust method for text detection in video. From the knowledge of font system, it turns out that, if characters are in the same font type and size, their stroke widths are almost constant. In another view, a text region can be considered as a region with a dominant fixed stroke width value. Therefore, the similarity in stroke width can be efficiently used as a critical characteristic to describe the text region in video frame. The contributions of this paper can be summarized as follow: • Exploiting the similarity in stroke width characteristic of text to build an effective edge detection method with local adaptive threshold algorithm. • Implementing a stroke-based method to localize text regions in video. • Designing a multi-frame refinement method which can not only refine the text location but also enhance the quality of the detected text. The rest of this paper is organized as follows: Section 2 presents the proposed method for text detection in video. To demonstrate its effectiveness, experimental results are given in Section 3. In Section 4, the concluding remarks are drawn.

2 Proposed Method In the proposed method, text regions in video are detected through three processes. First, edge detection with local adaptive threshold algorithm is applied to reveal text edge pixels. Second, dilation morphological operator with adaptive structuring element size is exploited in the stroke-based localization process to roughly localize text regions. Finally, a multi-frame refinement process is applied to reduce false alarm, refine the location, and enhance the quality of each text region. Figure 1 shows the flow chart of the proposed system.

202

V.C. Dinh et al.

Video frames

Edge Detection with Local Adaptive Thresholds

Stroke-Based Text Localization

Multi-Frame Text Refinement

Detected Text Regions

Fig. 1. Flowchart of the proposed text detection method

2.1 Motivation From the knowledge of font system, it turns out that, if characters are in the same font type and font size, their stroke widths are almost constant. Therefore, in the proposed method, the stroke width similarity is used as a clue to characterize text regions in frame. Generally, the width of any stroke (of both text and non-text objects) can be calculated as distance (measured in pixel) in horizontal direction between its doubleedge pixels. Figure 2(a) shows an example of double-edge pixels (A and B). It can be seen from the figure that stroke widths from different characters are almost similar. Scan line A

B

(a)

(b)

Fig. 2. An example of text image. (a) Text image. (b) Edge values for the scan line in (a), wt is the stroke width value.

In general, the color of text often contrasts to its local background. Therefore, for any double-edge pixels of a stroke, this contrast makes an inversion in sign of the edge values, i.e. the gradient magnitude of edge pixels, in horizontal direction (dx) between two pixels on the left- and right-hand side of the stroke. Figure 2(b) shows the corresponding edge values in horizontal direction of a given horizontal scan line in Fig. 2(a); it is clear that the stroke can be modeled as double-edge pixels within a certain range, delimited by a positive and a negative peak nearby. By using the doubleedge pixel model to describe the stroke, we can take the advantages of: 1) Reducing the effect of noise; 2) Applicability even with low-quality edge image. 2.2 Edge Detection with Local Adaptive Threshold Algorithm First, the Canny edge detector with a low threshold is applied to video frame to keep all possible text edge pixels and each frame is divided into M × N blocks, typically 8 × 8 or 12 × 8. Second, by analyzing the similarity in stroke width corresponding to each block, blocks are classified into two types: simple blocks and complex blocks. Then, a suitable threshold algorithm for each block type is used to determine the proper threshold for each block. Finally, the final edge image is created by applying each block with the new proper threshold.

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity

203

2.2.1 Block Classification For each block, we create a stroke width set which is the collection of all stroke width candidates contained in this block. Due to the similarity in stroke width of characters, the values in the stroke width set of the text region on the simple background are concentrated on some close values. Whereas stroke width candidates of the text region on the complex background or background region may also be created by other background objects. As a result, the element values in this set may spread over a wide range of values. Therefore, text regions on a simple background can be characterized by a smaller value of the standard deviation of stroke width than those on other regions. Based on this different characteristic, blocks in the frame are classified into two types: simple blocks and complex blocks. A block is classified as a simple one if the standard deviation of stroke width values is smaller than a given specific value. Otherwise, it will be classified as a complex one. For the simple block, the threshold of the edge detector should be relatively low to detect both low-contrast and high-contrast texts. On the contrary, the threshold for the complex block should be relatively high to eliminate background and highlight text. 2.2.2 Local Adaptive Threshold Algorithm In each block, the stroke width value corresponding to text objects often dominates in population of the stroke width set. Therefore, it can be estimated by calculating the stroke width with the maximum stroke width histogram value. Let wt denote the stroke width value of text, wt can be defined as:

wt = max H (l ) , l

(1)

where H(l) is the value on the block’s stroke width histogram with stroke width l. From the set of all double-edge pixels, we construct two rough sets: the text set St and the background set Sbg . The St represents the set of all pixels which are predicted as text edge pixels whereas the Sbg represents the set of all predicted background edge pixels. St and Sbg are constructed as follow:

St = {i, j | i, j ∈ E , w(i, j ) = wt } ,

(2)

Sbg = {i, j | i, j ∈ E , w(i, j ) ≠ wt } ,

(3)

where E is the edge map of the block and w(i, j) denotes the stroke width between the double-edge pixels i and j. Note that St and Sbg are only the rough sets of the text edge pixels and background edge pixels, since only edge pixels with gradient direction in horizontal are considered during the stroke width calculation process. Thresholds for the simple block and the complex block are determined as follow: • In the simple block case, the text lies on clear background. Therefore, the threshold is determined as the minimum edge value of all edge pixels belonging to St in order to keep text information and simplify the computational process. • In the complex block case, to determine the suitable threshold for the edge detector is much more difficult. Applying general thresholding methods does not often give

204

V.C. Dinh et al.

a good result since these methods are used for classifying general problems, not for such a specific problem as separating text from background. In this paper, by discovering the similarity in stroke width of text, we can roughly estimate the text set and background set as St and Sbg . Therefore, the problem of finding an appropriate threshold in this case can be converted into another but easier problem of finding appropriate threshold to correctly separate the two sets: St and Sbg . Image Block

Calculate stroke width wt Estimate the text set and background set as: St and Sbg Simple block case

Complex block case

Calculate edge value of St and Sbg Set the threshold as the smallest edge value of St

(a) Construct corresponding edge value histograms: ht(r) and hbg(r) Set the threshold as the edge value satisfying equation (4)

Edge detection with new threshold value

Edge Image Block

(b)

Fig. 3. Flowchart of the proposed local adaptive threshold algorithm

(c)

Fig. 4. Edge detection results. (a) Original Image. Edge detection using (b) constant threshold, (c) proposed local adaptive threshold algorithm.

Let r denote the edge value (gradient magnitude) of a pixel in a block, ht (r ) and hbg (r ) denote the histograms of the edge values corresponding to the text set St and background set Sbg , respectively. According to [12], if the form of the two distributions is known or assumed, it is possible to determine an optimal threshold (in term of minimum error) for segmenting the image into the two distinct sets. And the optimal threshold, denoted as T, can be revealed as the root of the equation: pt × ht (T ) = pbg × hbg (T ) ,

(4)

where pt and pbg ( pbg = 1 − pt ) are the probabilities of a pixel to be in St and Sbg sets, respectively. Consequently, the appropriate threshold for the complex block is determined as the value which satisfies or approximately satisfies equation (4). Figure 3 shows flowchart of the local adaptive threshold algorithm. Figure 4 shows the results of edge detection method on video frame in Fig. 4(a) by using only one constant threshold (Fig. 4(b)), in comparison with using the proposed local adaptive thresholds (Fig. 4(c)). The pictures show that the proposed method could eliminate more background pixels while still conserves text pixels.

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity

205

2.3 Stroke-Based Text Localization

After edge detection process, dilation morphological operator is applied to the edge detected video frame for highlighting text regions. The size of the structuring element is adaptively determined by the stroke width value. When applying the dilation operator, one of the most important factors that need to be considered is the size of the structuring element. If this size is set too small, the text area cannot be filled wholly. As the result, this area can be regarded as non-text area. In contrast, if this size is set too large, text can be mixed with the surrounding background. This problem results in increasing the number of false alarms. Moreover, using only a fixed size of the structure element, as in Chen et al.’s [9] method, is not applicable for texts of different sizes. 2wt+1

1

1

….

1

1

Fig. 5. Structure element of the dilation operation (wt is the stroke width value)

In this paper, we determine the size of the structure element based on the stroke width value which is already revealed in the edge detection process. More specifically, for each block, we apply a dilation operator of the size: ( (2 × wt + 1) × 1 ) at which the stroke width is wt as shown in Fig. 5. This size is satisfactory to wholly fill the character as well as connect neighborhood characters together. Moreover, using block-based dilation with suitable structure element shape makes it applicable for text with different sizes, at different locations in video frame. Figure 6(a) shows the image using the proposed dilation operators.

(a)

(b)

(c)

Fig. 6. Text localization and refinement process (a) Dilated image (b) Text regions candidates (c) Text regions after being refined by multi-frame refinements

After dilation process, connected component analysis is performed to create text region candidates. Then, based on the characteristic of text, the following simple criteria for filtering out non-text regions are applied: 1) the height of the region is between 8 and 35; 2) the width of the region must be larger than the height; 3) the number of edge pixel must be two times larger than the width based on the observation that text

206

V.C. Dinh et al.

region should have abundant edge pixels. Figure 6(b) shows the text region candidates after applying these criteria. 2.4 Multi-frame Refinement

Multi-frame integration has been used for the purpose of text verification [13], or text enhancement [14]. However, temporal information for the purpose of text refinement in frame, which often plays an important role in increasing the accuracy of text segmentation and recognition steps afterward, has not been utilized so far. In this paper, we propose a multi-frame based method to refine the location of text by further eliminating background pixels in the rough text regions detected in the previous steps. Moreover, the quality of text is also improved by selecting the most suitable frame, i.e. the frame at which text is displayed clearest, in the frame sequence. By using our method, the enhanced text region doesn’t cause the blurring problem as in text enhancement of Li et al.’s method [14]. First, a multi-frame verification [13] is applied to reduce the number of false alarms. For each of m consecutive frames in a video sequence, a text region candidate is considered as a true text region only if existing at least n (n<m) similar text regions T0, T1, …, Tn-1 appearing in n different frames. Tk (k = 0, 1...n-1) is the region of the corresponding frame received after edge detection process. Let call T the stationary edge image of the corresponding text region candidate. The pixel value at location (x, y) of T is determined as follows: ⎧ if ⎪edge pixel , T ( x, y ) = ⎨ ⎪non edge pixel , ⎩

n-1

∑ I k ( x, y ) > θ

k=0

(5)

otherwise,

where θ is a specific threshold and Ik (x, y) is defined as:

⎧1, I k ( x, y ) = ⎨ ⎩0,

if Tk ( x, y ) is edge pixel otherwise.

(6)

Refer to (5), T(x, y) is an edge pixel if at the location (x, y), an edge pixel appears more than θ times, otherwise, T(x, y) is a non-edge pixel. In the proposed method, the θ is set equal to [n × 3 / 4] in order to reduce the effect of noise. Based on the stationary characteristic of text, almost all background pixels are removed in T. However, this integration process may also remove some text edge pixels. In order to recover the lost text edge pixels, a simple edge recovery process is performed. A pixel in T is marked as edge pixel if it’s two neighborhoods in the horizontal, vertical, or diagonal direction are edge pixels. After the recovery process, T can be seen as the edge image of the true text regions. Therefore, the precise text location of the corresponding text region can be obtained by calculating the bounding box of edge pixels contained in T. In order to enhance the quality of the text, we extract the most suitable frame in the frame sequence where text appears clearest. Based on the fact that a text region is clearest if the corresponding edge image contains almost text pixels, the most suitable frame is extracted if the edge image of its text region is the best matching with T. In

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity

207

other words, we choose the frame whose edge image Tk (k = 0,.., n-1) is the most similar with T. The MSE (Mean Squared Error) measurement is used to measure the similarity between two regions. The effectiveness of using multi-frame refinement is manifested in Fig. 6(c). Comparing to Fig. 6(b), two false alarms are removed and all of true text regions have more precise bounding boxes.

3 Experimental Result Due to the lack of a standard database for the problem of text detection in video, in order to evaluate the effectiveness of the proposed method, we have collected a number of videos from various sources for a test database. Text appearance varies with different color, orientation, language, and character font size (from 8pt to 90pt). The video frame formats are 512×384 and 720×480 pixels. The test database can be divided into three main categories: news, sport, and drama. Table 1 shows the video length and the number of ground-truth text regions contained in each video category. Totally, there are 553 ground-truth text regions in the whole video test database. Table 1. Properties of video categories Drama Video length Text regions

15 minute 126

Sport 32 minute 202

News 38 minute 225

For quantitative evaluation, the detected text region is considered as the correct one if the intersection of the detected text region (DTR) and the ground-truth text region (GTR) covers more than 90% of this DTR and 90% of this GTR. The efficiency of our detection method is assessed in terms of three measurements (which are defined in [10]): Speed, Detection Rate, and Detection Accuracy. In order to assess the effectiveness of the proposed method, we compare the performance of the proposed method with that of the typical edge-based method proposed by Lyu et al. [10], and the method using 3 processes: edge detection with a constant threshold, text localization with fixed size dilation operations (similar to the algorithm in [9]), and multi-frame refinement. Let call it “constant threshold” method. Table 2 shows the number of correct and false DTRs for three video categories. It can be seen from the table that not only does the proposed method create the highest number of correct DTRs but it also produces the smallest number of false DTRs in every case. Our method is obviously stronger than the others even in the case of news category (the number of false DTRs is about only a half compared to other methods). It is more difficult to detect text in news video since the background is changing fast and texts have variable sizes with different contrast levels to the background. The proposed method could overcome these problems since it successfully exploits the self characteristic of text (the stroke similarity), which is invariant to the background complexity as well as different font sizes and colors of text. Table 3 gives a summary of the detection rate and the detection accuracy of the three methods tested on the whole video test database. The proposed method achieved

208

V.C. Dinh et al. Table 2. Number of correct and false DTRs

Correct DTRs False DTRs Correct DTRs False DTRs Correct DTRs False DTRs

Lyu et al. [10] Constant threshold Proposed Method

Drama

Sport

News

96 16 109 19 114 11

154 26 152 32 179 20

185 38 189 46 205 21

the highest accuracy with the detection rate of 90.1% and the detection accuracy of 90.5%. This encouraging result shows that our proposed method is an effective solution to the background complexity problem of text detection in video. It can be seen from the table that the proposed method is faster than Lyu et al.’s [10] method and a bit lower than using constant threshold method which is obviously clear since we need to scan the frame with different thresholds. Moreover, the processing time of 0.18s per frame meets the requirement for real-time applications. Figure 7 shows some more examples of the results we got. In these pictures, all the text strings are detected and their bounding boxes are relatively tight and accurate. Table 3. Text detection accuracy

Lyu et al. [10] Constant threshold Proposed Method

Correct DTRs 435 450 498

False DTRs 80 97 52

Detection Rate 78.7 % 81.4 % 90.1 %

Detection Accuracy 84.5 % 82.3 % 90.5 %

Fig. 7. Some pictures of detected text regions in frames

Speed (Sec/frame) 0.23s 0.16s 0.18s

An Efficient Method for Text Detection in Video Based on Stroke Width Similarity

209

4 Conclusion This paper presents a comprehensive method for text detection in video. Based on the similarity in stroke width of text, an effective edge detection method with local adaptive thresholds is applied to reduce the background complexity. The stroke width information is further utilized to determine the structure element size of the dilation operator in the text localization process. To reduce the false alarm as well as refine the text location, a new multi-frame refinement method is applied. Experimental results with a large set of videos demonstrate the efficiency of our method with the detection rate of 90.1% and detection accuracy of 90.5%. Based on these encouraging results, we plan to continue research on text tracking and recognition for a real time text-based video indexing and retrieval system.

References 1. Zhu, Q., Yeh, M.C., Cheng, K.T.: Multimodal fusion using learned text concepts for image categorization. In: Proc. of ACM Int’l. Conf. on Multimedia, pp. 211–220. ACM Press, New York (2006) 2. Lienhart, R.: Dynamic video summarization of home video. In: Proc. of SPIE, vol. 3972, pp. 378–389 (1999) 3. Fan, J., Luo, H., Elmagarmid, A.K.: Concept-oriented indexing of video databases: toward semantic sensitive retrieval and browsing. IEEE Trans. on Image Processing 13, 974–992 (2004) 4. Zhong, Y., Karu, K., Jain, A.K.: Locating text in complex color images. Pattern Recognition 28, 1523–1536 (1995) 5. Jain, A.K., Yu, B.: Automatic text location in images and video frames. In: Proc. of Int’l. Conf. on Pattern Recognition, vol. 2, pp. 1497–1499 (August 1998) 6. Ohya, J., Shio, A., Akamatsu, S.: Recognition characters in scene images. IEEE Trans. on Pattern Analysis and Machine Intelligence 16, 214–220 (1994) 7. Qiao, Y.L., Li, M., Lu, Z.M., Sun, S.H.: Gabor filter based text extraction from digital document images. In: Proc. of Int’l. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, pp. 297–300 (December 2006) 8. Li, H., Doermann, D., Kia, O.: Automatic text detection and tracking in digital video. IEEE Trans. on Image Processing, 147–156 (2000) 9. Chen, D., Bourlard, H., Thiran, J.P.: Text identification in complex background using SVM. In: Proc. of Int’l. Conf. on Document Analysis and Recognition, vol. 2, pp. 621–626 (December 2001) 10. Lyu, M.R., Song, J., Cai, M.: A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Trans. on Circuits Systems Video Technology, 243–255 (2005) 11. Jung, K.C., Han, J.H., Kim, K.I., Park, S.H.: Support vector machines for text location in news video images. In: Proc. of Int’l. Conf. on System Technology, pp. 176–189 (September 2000) 12. Gonzalez, R.-C., Woods, R.E.: Digital Image Processing, 2nd edn., pp. 602–608. PrenticeHall, Englewood Cliffs (2002) 13. Lienhart, R., Wernicke, A.: Localizing and segmenting text in images and videos. IEEE Trans. on Circuits Systems Video Technology, 256–268 (2002) 14. Li, H., Doermann, D.: Text enhancement in digital video using multiple frame integration. In: Proc. of ACM Int’l. Conf. on Multimedia, pp. 19–22. ACM Press, New York (1999)

Multiview Pedestrian Detection Based on Vector Boosting Cong Hou1, Haizhou Ai1, and Shihong Lao2 1

Computer Science and Technology Department, Tsinghua University, Beijing 100084, China Sensing and Control Technology Laboratory, Omron Corporation, Kyoto 619-0283, Japan [email protected]

2

Abstract. In this paper, a multiview pedestrian detection method based on Vector Boosting algorithm is presented. The Extended Histograms of Oriented Gradients (EHOG) features are formed via dominant orientations in which gradient orientations are quantified into several angle scales that divide gradient orientation space into a number of dominant orientations. Blocks of combined rectangles with their dominant orientations constitute the feature pool. The Vector Boosting algorithm is used to learn a tree-structure detector for multiview pedestrian detection based on EHOG features. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. Experimental results are reported to show its high performance. Keywords: Pedestrian detection, Vector Boosting, classification.

1 Introduction Pedestrian detection researches originated in the requirement of intelligent vehicle system such as driver assistance systems [1] and automated unmanned car systems [13], and become more popular in recent research activities including visual surveillance [2], human computer interaction, and video analysis and content extraction, of which the last two are in more general sense that involve full-body human detection and his movement analysis [14]. Pedestrian, by definition, means a person traveling on foot, that is, a walker. Pedestrian detection is to locate all pedestrian areas in an image, usually in the form of bounding rectangles. We all know as a special case in more general research domain “object detection or object category”, face, car and pedestrian are most researched targets. Nowadays, although face detection or at least frontal face detection is well accepted solved problem in academic society, car detection and pedestrian detection are not so well solved; they remain a big challenge to achieve a comparable performance to face detection in order to meet the requirement of practical applications in visual surveillance etc. In general, object detection or object category is still in its early research stage that is very far from real application. For previous works before 2005 see a survey [3] and an experimental study [4]. Recent works are mainly machine learning based approaches among which the edgelets method [5] and the HOG method [7] are most representative. The edgelets method [5] uses a new type of silhouette oriented feature called an edgelet that is a short segment of line or curve. Based on edgelets features, part (full-body, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 210–219, 2007. © Springer-Verlag Berlin Heidelberg 2007

Multiview Pedestrian Detection Based on Vector Boosting

211

head-shoulders, torso, and legs) detectors are learned by Real AdaBoost. Responses of part detectors are combined to form a joint likelihood model and the maximum a posteriori (MAP) method is used for post processing to deal with multiple, possibly inter-occluded humans detection problem. The HOG method [7] uses the histograms of oriented gradients features to characterize gradient orientation distribution in a rectangular block. A detection window is divided into several rectangular blocks and each block is divided into several small spatial regions called cells in which the HOGs are computed and combined to form the features for classification. A linear SVM is used for detector training. This method is improved in [9] for speed up by AdaBoost learning of a cascade detector, in which variable-size blocks are used to enrich the feature pool and each feature is a 36D vector (concatenated by 4 cells’ 9 orientation bins of histogram in a block) that is fed into a linear SVM to form a weak classifier for AdaBoost learning. The above works are for still images and there are also recent advance for video [6][8][12]. Pedestrian detection in video is quite different from that in still images although techniques developed in later case can be help to pedestrian detection in video, for example to initialize or trigger a tracking module. Anyway pedestrian detection in still images is more fundamental. In this paper, we will focus on multiview pedestrian detection (MVPD) in still images and present a method using extended HOG features and Vector Boosting originally developed for multiview face detection (MVFD) for MVPD. Although pedestrian detection seems similar to face detection, it is more difficult due to large variation caused by clothes in addition to other common factors like pose, illumination, etc. In last several years MVFD has achieved great success and found its ways in practical applications. Many MVFD methods have been developed including parallel cascades [15], pyramid [16], decision tree [17] and WFS tree [18], of which the WFS tree together with the Vector Boosting algorithm is proved to be one of the most efficient methods. In this paper, we develop a method to apply this technique to the MVPD problem. We quantify gradient orientations into three angle scales that divide gradient orientation space into totally 27 dominant orientations. The EHOG features of a block of rectangle or non-regular rectangle are used to represent statistical information of edges in that block. Therefore blocks with their dominant orientations constitute the feature pool. The Vector Boosting learning [18] is used to construct a tree-structure detector [18] for MVPD. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. The main contributions are (1) Dominant orientation in combined rectangle block is introduced into HOG features to form a feature pool; (2) A high performance tree-structure detector is developed for MVPD based on Vector Boosting. The rest of the paper is organized as follows: in Section 2, an extension to the HOG feature is introduced. In Section 3, the tree-structure detector training is described. Experiments are reported in Section 4 and conclusions are given in Section 5.

2 Extended HOG Feature The HOG feature has been proved effective in pedestrian detection [7][9]. The feature makes statistics about magnitude of gradient in several orientations, which are called

212

C. Hou, H. Ai, and S. Lao

bins. In [7][9], the orientation over 0◦~180◦ is divided into 9 bins. The HOG is collected in local regions in a picture called cells. A block contains several cells whose HOGs are concatenated into a higher dimension vector. And then a SVM is used to construct a classifier for each corresponding block over training set. However, the SVM detector is computational intensive in detection. Therefore, boosted detector that has been proved successful in face detection can be a good choice. In [9] HOG features are fed into a linear SVM to form a weak classifier which results in a much faster detector. In this paper in order to achieve better performance in both detection rate and speed, we make an extension to the HOG feature which outputs a scale value. The feature can be directly used in boosting learning as a weak classifier, which avoids time consuming inner product of high dimension vectors as in SVM or LDA type of weak classifier. First, we calculate the HOG in a block itself without dividing it into smaller cells as in [7]. Therefore the block functions in fact as an ensemble cell:

Gb = ( gb (1), gb (2),..., g b (n))T where n is the dimension of the HOG (in [7][9], n = 9), and b is a block in an image. Then, we introduce the concept of dominant orientation D (for detail, see section 2.1) that is defined as a subset of the above basic level of bins, that is, D ⊆ {1, 2," , n} , and calculate the EHOG feature corresponding to D as:

Fb ( D) = ∑ gb (i ) / Z b i∈D

where Z b is the normalizing factor: n

Z b = ∑ g b (i ) i =1

With the help of the integral image of HOG [9],

gb (1), gb (2),..., gb (n) and Zb

can be calculated very fast. We will explain two important concepts in more detail: the dominant orientation D and the non-rectangle block b . 2.1 Dominant Orientation The dominant orientation is indeed a set of representative bins of the HOG. We have observed that in an area containing simple edges, most of gradients concentrate in a relatively small range of orientation. Therefore, we can use a small part of bins to represent these edges. In most situations, this treatment is acceptable as shown in Fig 1. In training, the dominant orientation is found by feature selection. In our implementation, we also divide the orientation over 0◦~180◦ into 9 bins as in [7], and the dominant orientation of each feature may contain 1, 2 or 3 neighboring bins as shown in Fig 2. Therefore, there are totally 27 different dominant orientations for each block.

Multiview Pedestrian Detection Based on Vector Boosting

213

Fig. 1. (a) A picture with a pedestrian. (b) The HOG calculated in the red rectangle in (a). The length of each line denotes the magnitude of gradient in each bin. It can be seen there are three main orientations (lines of these orientations are in bold). (c) We only pick these three bins out and use the normalized summation of their values as the output of the EHOG feature.

Fig. 2. Three levels of orientation partition between 0◦-180◦, and each partition has 9 different orientations (note that there are some overlaps between neighboring parts in (b) (c)). The dominant orientation in each level covers (a) 20◦ (1 bin), (b) 40◦ (2 bins), (c) 60◦ (3 bins).

2.2 Non-rectangle Blocks The HOG and EHOG feature are both calculated in a local region of an image which is called a block. In [7], the size of the block is fixed, and in [9] it is variable. We also use variable-size blocks, and make some extension that other than the rectangle blocks used in [7][9], we also adopt blocks with non-rectangle shapes like in Fig 3 (a) called combined blocks to enrich the feature pool in order to reflex geometry structure of feature representation. In addition, we add block pairs to capture symmetric feature of pedestrians (see Fig.7 for block pair examples). To avoid feature space explosion, we manage the feature space by way of selecting and expanding with a heuristic search strategy. The initial feature space contains only

Fig. 3. (a) Some blocks with irregular shapes. (b) Two types of expanding operators.

214

C. Hou, H. Ai, and S. Lao

rectangle blocks. After feature selection, we get a small set of best rectangle features as seeds for generating additional non-rectangle blocks. Two kinds of operation on these seeds to make shape changes are defined as illustrated in Fig 3 (b): sticking and pruning. To describe the operation, we differ the rectangle blocks into two types: positive one and negative one. To stick is to add a positive rectangle block beside the seed block, and to prune is to add a negative rectangle block in the seed block. After several such operations, a seed can be propagated into thousands of ones which constitute the new feature space for further training.

3 Multi-view Pedestrian Detection Although pedestrians of different poses are not so much discriminative as that in MVFD problem in which frontal, left-half-profile and left-full-profile, right-halfprofile and right-full-profile are common divided sub-views, we can still divide pedestrians into three relatively separated classes according to their views: frontal/rear, left-profile and right-profile views. We use Vector Boosting to learn a tree-structure detector for multiview pedestrian detection. 3.1 Vector Boosting The Vector Boosting algorithm was first proposed by Huang etc. in [19] to deal with the multi-view face detection. It deals with multi-class classification problems by means of the vectorization of hypothesis output space and the flexible loss function defined by intrinsic projection vectors, for detail see [19]. 3.2 Tree-Structure Detector The tree-structure detector is illustrated in Fig 4. Before the branching node, a series of nodes try to separate different views of positive samples and at the same time discard as many negative samples as possible. They functions as a cascade detector [11] in which each node performs a binary decision: positive or negative. The branching node outputs a 3D vector whose components determine which branch or branches the sample should be sent to. For example, the output (1, 1, 0) means the sample may be a left profile pedestrian or a frontal/rear one. After the branching node, again there comes a cascade detector for each branch.

Fig. 4. The tree-structure multi-view pedestrian detector. The gray node is a branching node which outputs a 3D binary decision.

Multiview Pedestrian Detection Based on Vector Boosting

215

3.3 Training Process There are three kinds of tree node to train: the nodes before the branching node, the branching node and the nodes after the branch node. Each node is a strong classifier learned by Vector Boosting which is denoted by F(x) . In our problem,

F (x) = ( Fl ( x), F f (x), Fr (x))T . The decision boundaries as stated in [19] will be as follows in our problem:

P (ω N | x) =

1 1 + exp(2 Fl ( x)) + exp(2 F f ( x)) + exp(2 Fr ( x))

P (ωL | x) = exp(2 Fl ( x)) P (ω N | x) P (ωF | x) = exp(2 F f ( x)) P (ω N | x) P (ωR | x) = exp(2 Fr ( x)) P (ω N | x) where P(ωN | x) , P(ωL | x), P(ωF | x) and P(ωR | x) are separately the posterior probability of negative samples and positive samples of three views. The first kind of node above only cares if the sample is positive or negative, so it only needs to calculate P(ωN | x) . In the training, we’ll find a threshold Pt (ωN )

Fig. 5. Distributions of 3 classes (negative samples, positive samples of left profile and frontal/rear views) in the output space of the first 9 nodes before the branching. It can be seen that after 6 nodes pedestrians of different views can be separated rather well.

216

C. Hou, H. Ai, and S. Lao

according to the detection rate and false alarm rate of the node. If P (ω N | x) > Pt (ω N ) , the sample is regarded as a negative one, else positive. The second kind of node, that is, the branching node, tries to separate positive samples with different views, so P(ωL | x), P(ωF | x) and P(ωR | x) are all needed. The output is a 3D vector, in which each dimension is a binary decision decided by a corresponding threshold. The node in branches deals with a two-class classification problem, so the normal Real AdaBoost [10] learning can be used. One question remained is how to determine when to branch. This is done by experiments in our practice. As shown in Fig 5, the first 9 nodes before branching node have their distributions of negative samples, positive samples with frontal/rear and left profile views. It can be seen that the pedestrians with different views have been well separated in the 9th node; therefore we choose this node as the branching node. 3.4 The Detector Pyramid Framework Generally speaking, the size of training samples has great impact on the performance of the learnt detector both in detection accuracy and in speed. In face detection research, a common used size is 24×24 pixel (19×19 and 20×20 are also used in earlier work) which has been demonstrated very effective. In pedestrian detection research, 15×20 [12], 24×58 [5], 64×128[7][9] have been used. Different from face detection research, there is no common accepted size widely used. In practice, we found that detectors trained by larger samples have better performance when detecting larger pedestrians possibly because larger samples offer more clear information for classification of more complex objects like pedestrians. So, we use samples of different scales (sizes) to build a detector pyramid. The small size detector in the pyramid deals with small pedestrians and the large size detector deals with large ones. The number of layers of the scale pyramid of the input image to be scanned accordingly decreases, which can speed up the detection compared with single scale detector case.

4 Experiments Our training set contains 9,006 positive samples for frontal/rear view, 2,817 positive samples for left/right profile view. Pedestrians in samples are upright standing or walking. Some samples are shown in Fig 6. The negative samples are sampled from more than 10,000 images without any human.

Fig. 6. Positive training samples: (a) frontal/real views; (b) left profile view; (c) right profile view

Multiview Pedestrian Detection Based on Vector Boosting

217

The detector pyramid has 3 layers whose sizes are 24×58, 36×87 and 48×116 pixel respectively. The number of features in each node of the three detectors decreases as the size of the detector increases. For example, the total numbers of features in the first 5 nodes of these three detectors are 75, 53 and 46 respectively. So the speed of the detector increases with its size grows. Because EHOG features with non-rectangle blocks are slower than those with rectangle blocks in computing, for efficiency the feature pool for the first several nodes only contains the rectangle ones that guarantees faster speed. Fig 7 shows the first three (pair) features selected in the 24×58 detector. It can be seen that the second feature captures the edge of shoulders and the third captures the edge of foot. The detection speed of our detector is about 1.2 FPS with a 320×240 pixel image on a 3.06GHz CPU.

Fig. 7. The first three features selected and their corresponding dominant orientations

We evaluate our detector on two testing sets: one is Wu et al [5]’s testing set which contains pedestrians with frontal/rear view and the other is the INRIA testing set [7]. Wu’s testing set contains 205 photos with 313 humans of frontal and rear view. Fig 8 (a) shows the ROC curves of our detector (including the detector pyramid and a

Fig. 8. (a) ROC curves of evaluation on Wu’s testing set [5]. (b) Miss-rate/FPPW curves on INRIA testing set [7].

218

C. Hou, H. Ai, and S. Lao

Fig. 9. Some detection results on Wu’s frontal/rear testing set [5]

Fig. 10. Some detection results on INRIA testing set [7]

24×58 detector) and Wu’s edgelet full-body detector and their combined detector. It can be seen that our detector pyramid is better than the full-body detector and the combined detector in accuracy, and is better than single detector too. Some detection results are shown in Fig 9 on Wu’s test set. The INRIA testing set contains 1805 64×128 images of humans with a wide range of variations in pose, appearance, clothing, illumination and background. Fig 8 (b)

miss-rate/FPPW

shows the comparative results in (False Positive Per Window) curves. We can see that our method is comparable with Zhu’s method when the FPPW is low. At 10-4 FPPW, the detection rate is 90%. Some detection results are shown in Fig 10.

5 Conclusion In this paper, a multiview pedestrian detection method based on Vector Boosting algorithm is presented. The HOG features are extended to form EHOG features via dominant orientations. Blocks of combined rectangles with their dominant orientations constitute the feature pool. The Vector Boosting algorithm is used to learn a tree-structure detector for multiview pedestrian detection based on EHOG features. Further a detector pyramid framework over several pedestrian scales is proposed for better performance. This results in a high performance MVPD system that can be very useful in many practical applications including visual surveillance. We are planning to extend this research to video for pedestrian tracking in future.

Acknowledgement This work is supported in part by National Science Foundation of China under grant No.60673107 and it is also supported by a grant from Omron Corporation.

Multiview Pedestrian Detection Based on Vector Boosting

219

References 1. Gavrila, D.M.: Sensor-based Pedestrian Protection. IEEE Intelligent Systems, 77–81 (2001) 2. Zhao, T.: Model-based Segmentation and Tracking of Multiple Humans in Complex Situations. In: CVPR 2003 (2003) 3. Ogale, N.A.: A survey of techniques for human detection from video, University of Maryland, Technical report (2005) 4. Munder, S., Gavrila, D.M.: An Experimental Study on Pedestrian Classification. TPAMI 28(11) (2006) 5. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 6. Wu, B., Nevatia, R.: Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. In: CVPR 2006 (2006) 7. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human detection. In: CVPR 2005 (2005) 8. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 9. Zhu, Q., Avidan, S., et al.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: CVPR 2006 (2006) 10. Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37, 297–336 (1999) 11. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. In: CVPR 2001 (2001) 12. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Pattern of Motion and Appearance. In: ICCV 2003 (2003) 13. Zhao, L., Thorpe, C.E.: Stereo- and Neural Network-Based Pedestrian Detection. IEEE Trans. on Intelligent Transportation Systems 1(3) (2000) 14. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 15. Wu, B., Ai, H., et al.: Fast rotation invariant multi-view face detection based on Real Adaboost. In: FG 2004 (2004) 16. Li, S.Z., Zhu, L., et al.: Statistical Learning of Multi-View Face Detection. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, Springer, Heidelberg (2002) 17. Jones, M., Viola, P.: Fast Multi-view Face Detection. MERL-TR2003-96 (July 2003) 18. Huang, C., Ai, H.Z., et al.: Vector Boosting for Rotation Invariant Multi-View Face Detection. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, Springer, Heidelberg (2005) 19. Huang, C., Ai, H.Z., et al.: High-Performance Rotation Invariant Multiview Face Detection. TPAMI 29(4), 671–686 (2007)

Pedestrian Detection Using Global-Local Motion Patterns Dhiraj Goel and Tsuhan Chen Department of Electrical and Computer Engineering Carnegie Mellon University, U.S.A. [email protected], [email protected]

Abstract. We propose a novel learning strategy called Global-Local Motion Pattern Classiﬁcation (GLMPC) to localize pedestrian-like motion patterns in videos. Instead of modeling such patterns as a single class that alone can lead to high intra-class variability, three meaningful partitions are considered - left, right and frontal motion. An AdaBoost classiﬁer based on the most discriminative eigenﬂow weak classiﬁers is learnt for each of these subsets separately. Furthermore, a linear threeclass SVM classiﬁer is trained to estimate the global motion direction. To detect pedestrians in a given image sequence, the candidate optical ﬂow sub-windows are tested by estimating the global motion direction followed by feeding to the matched AdaBoost classiﬁer. The comparison with two baseline algorithms including the degenerate case of a single motion class shows an improvement of 37% in false positive rate.

1

Introduction

Pedestrian detection is a popular research problem in the ﬁeld of computer vision. It ﬁnds its applications in surveillance, fast automatic video browsing for pedestrians, activity monitoring etc. The problem to localize pedestrians in image sequences, however, is extremely challenging owing to the variations in pose, articulation and clothing. The resulting high intra-class variability for the class of pedestrians is further exaggerated by the background clutter and the presence of pedestrian-like upright objects in the scene like trees and windows. Traditionally, appearance and shape cues have been the popular discernible features to detect pedestrians in a single image. Oren et al. [1] devised one of the ﬁrst appearance based algorithms using wavelet response, while more recently, histogram of oriented gradients [2] have been used to learn a shapebased model to segment out humans. However, in an uncontrolled environment the appearance cues alone aren’t faithful enough for reliable detection. Recently, motion cues have been gaining a lot of interest for pedestrian detection. In general, pedestrians need to be detected in videos where high correlation between consecutive frames can be used to good eﬀect. While human appearances can be deceptive in a single image, their motion patterns are signiﬁcantly diﬀerent from other kinds of motions like vehicles (Fig. 2). The articulation of the human body while in motion due to the movement of limbs and torso can Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 220–229, 2007. c Springer-Verlag Berlin Heidelberg 2007

Pedestrian Detection Using Global-Local Motion Patterns

221

Fig. 1. Overview of the proposed system

provide useful cues to localize moving pedestrians, especially in a stationary cluttered background. To model such a phenomenon, spatio-temporal ﬁlters based on shifted frame diﬀerence were used by Viola et al. [3], thus, combining the advantages of both shape and motion cues. Fablet and Black [4] used dense optical ﬂow to learn a generative human-motion model while a discriminative model based on Support Vector Machines was trained by Hedvig [5]. The common feature in all the above techniques is that they consider pedestrians as a single class. Though at one hand using human motion patterns circumvents many problems posed by appearance cues, considering all such patterns as a single class can still lead to a very challenging classiﬁcation problem. In this paper, we present a novel learning strategy to partition the human motion patterns into natural subsets with lesser variability. The rest of the paper is organized as follows: Sect. 2 provides an overview of the proposed method, Sect. 3 introduces the learning strategy based on partitioning the human motion pattern space, Sect. 4 reports the comparison with two baseline algorithms and detection results, and Sect. 5 concludes with a discussion.

2

Overview

Figure 1 gives an overview of the proposed system to detect pedestrian-like motion patterns in the image sequences. Figure 2 illustrates some of the examples of such patterns. Due to high intra-class variability of the ﬂow patterns generated by the pedestrians, modeling all such patterns using a single classiﬁer is diﬃcult. Hence, these are divided into meaningful subsets according to the global motion direction - left, right and frontal. As a result, the classiﬁcation is divided into two stages. A linear three-class Support Vector Machines (SVM) classiﬁer is trained to estimate the global motion direction. Next, a cascade of AdaBoost classiﬁers with the most discriminative eigenﬂow vectors is learnt for each of the global motion subsets. The motion patterns in the same partition share some similarity and hence, intra-class variability for each of these subsets is be less as compared to the whole set, rendering the classiﬁcation less challenging.

222

D. Goel and T. Chen

(a)

(b)

Fig. 2. (a) Pedestrian sample images along with their horizontal optical ﬂow for right, left and frontal motion subsets. (b) Sample labeled images from the non-pedestrian data and examples of non-pedestrian horizontal ﬂow.

At the time of testing, the dense optical ﬂow image is searched for pedestrianlike motion patterns using sub-windows of diﬀerent sizes. For every candidate sub-window, ﬁrst the global motion direction is estimated using the linear threeclass SVM classiﬁer. Thereafter, it is tested against the matching AdaBooost classiﬁer. 2.1

Computing Dense Optical Flow

Dense optical ﬂow is used as a measure to estimate motion between consecutive frames. Though numerous methods exist in the literature to compute dense ﬂow, 2-D Combined Local Global method [8] was chosen since it has been shown to provide very accurate ﬂow. Furthermore, using bidirectional multi-grid strategy, it can work in real-time [9] at upto 40 fps for 200x200 pixels image. The ﬁnal implementation used for pedestrian detection incorporates a slight modiﬁcation in the weighting function of the regularization term as mentioned in [6]. 2.2

Training Data

The anatomy of the learning algorithm necessitates a pedestrian data set labeled according to the global motion. For this purpose, the CASIA Gait database [7] was chosen. A total of eight global motion directions were considered that were merged to give three dominant motions - left, right and frontal (Fig. 2(a)). The left and the right motion subset capture the lateral motion while the motion perpendicular to the camera plane is contained in the frontal motion subset. Dense optical ﬂow was computed for the videos and the horizontal, u, and the vertical, v, ﬂows for the labeled pedestrians were cropped. The collection of these ﬂow patterns formed the training-test data for the classiﬁcation. Speciﬁcally, the frontal motion subset had 2500 training data samples and 1000 test data samples. The other two motion subsets had 4800 training data samples and 2000 test data samples each. The cropped data samples were resized to 16x8 pixels, normalized to lie in the range [−1, 1] and concatenated to form a 256 dimension feature vector - [u1 , u2 , . . . , u128 , v1 , v2 , . . . , v128 ]. The non-pedestrian data was generated by hand-labeling sub-windows with non-zero ﬂow in the videos containing moving vehicles. To automate the process,

Pedestrian Detection Using Global-Local Motion Patterns

223

an Adaboost classiﬁer was trained for the set of all pedestrian and non-pedestrian data and was run on other videos to generate additional non-pedestrian ﬂow patterns (from the false positives). The non-pedestrian data samples are resized and normalized in the same way as the pedestrian data. Approx. 120,000 such samples were generated, with some examples shown in Fig. 2(b).

3

Classiﬁcation Strategy

This section describes the classiﬁcation strategy to distinguish the motion patterns of pedestrians from other kinds of motions like that of vehicles etc. As illustrated in Fig. 1, it is divided into two stages - estimating the global motion direction (Section 3.1) followed by testing against the discriminative classiﬁer (Section 3.2). Training procedure for the latter has been described in [6]. The ﬁnal detection performance depends on the accuracy of both the stages and is greatly inﬂuenced by the taxonomy of the pedestrian motion patterns. A maximum of eight possible motion classes were considered as shown in the Fig. 2(a). Building a discriminative classiﬁer for each of them results in a group of classiﬁers that are highly discriminative for the motion direction they are trained for. Thus, the accuracy in estimating the motion direction becomes crucial to the overall performance, i.e. the sub-window containing strictly left moving pedestrian should be fed to the classiﬁer trained to detect strictly left moving pedestrians. However, it is very diﬃcult to reliably estimate the motion direction in these eight subsets. Thus, the detection rate of the classiﬁer as a whole degrades. The natural modiﬁcation is to merge the diﬀerent motion subsets such that the motion direction can be estimated faithfully but at the same time intraclass variability is kept low. Splitting the motion patterns into three subsets left, right and frontal - gave the best performance. 3.1

Estimating Global Motion

In order to decide which motion-speciﬁc discriminative classiﬁer to use, it is important to ﬁrst estimate the global motion. The mean motion direction for the pedestrian data was found to be unreliable in achieving such an objective. Hence, a linear three-class SVM classiﬁer was trained. This classiﬁer acts as more of a switch that assigns the queried data samples to their appropriate classiﬁers that have been speciﬁcally trained to handle those particular ﬂow patterns. The labeled pedestrian data is used to train this switch classiﬁer. The same number of training data samples, about 2000 each, was used for all the three classes to obviate bias towards any particular class. Further, each of the classes themselves contain the same proportion of diﬀerent motions contained within them. For example, the left class contains the same number of samples for strict left motion, left front at 45o and left back at 45o . Figure 3 shows the class confusion matrix for the learned model. 348 support vectors were chosen by the model that is less than 6% of the number of training data samples, indicating a well generalized classiﬁer.

Frontal

224

D. Goel and T. Chen Frontal

Right

Left

0.964

0.023

0.013

Left

Right

Frontal

0.022

0.978

0.00

0.019

0.00

0.981 Right

Fig. 3. Class confusion matrix for estimating the global motion direction using the three-class linear SVM classiﬁer

Left

Fig. 4. Magnitude of the mean and the ﬁrst two eigenﬂow vectors of the horizontal optical ﬂow for the training pedestrian data

The trained switch classiﬁer is used to allocate non-pedestrian data for each of the motion classes for training the discriminative motion-speciﬁc classiﬁers. Out of 120,000 data samples, about 75,000 got classiﬁed as belonging to the frontal motion, 25,000 were categorized as left motion class while the remaining 20,000 as having right motion. 3.2

Learning the Discriminative Classiﬁers

This section describes the learning procedure to train the discriminative motionspeciﬁc classiﬁers. In total, three separate classiﬁers are learnt, one for each global motion. The learning process is the same for all of them. Hence, for the sake of clarity, motion-speciﬁc term has been dropped in this section and whenever pedestrian and non-pedestrian data is mentioned, it refers to the data belonging to a particular global motion, unless stated otherwise. It is worth mentioning that the symmetrical properties of left and right classiﬁers can be exploited by training the classiﬁer for one and using it’s mirror image (after changing the sign for horizontal motion) for the other. Weak Classiﬁer. Principal Component Analysis was done separately on the pedestrian and non-pedestrian data to obtain the eigenvectors for the optical ﬂow, known as eigenﬂow [10]. Figure 4 shows the magnitude of the mean and the ﬁrst two u-ﬂow eigenvectors for each of the three global motions. As is evident, the mean ﬂows represent the global motion while the eigenﬂow vectors capture the poses and the articulation of the human body, especially the movement of the limbs. For the frontal motion, the mean is not that informative since it contains both front and backward moving pedestrians. Using all the eigenﬂow vectors, 256 for each of the pedestrian and nonpedestrian data, we have a total of 512 eigenﬂow vectors that act as a pool of features for AdaBoost. Taking the magnitude of correlation between the training

Pedestrian Detection Using Global-Local Motion Patterns

225

Table 1. Feature selection and training AdaBoost classiﬁer – Given the training data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) where xi is the eigenﬂow and yi is 0 for non-pedestrian and 1 for pedestrian examples. 1 1 , 2m for yi = 0, 1 respectively, where l and m are – Initialize the weights w1,i = 2l the number of pedestrian and non-pedestrian examples. – for t = 1, . . . , T w 1. Normalize the weights wt,i ← n t,iwt,j j=1

2. Selectthe best weak classiﬁer ht with respect to the weighted error: t = minj i wi |hj − yi | 3. Update the weights: wt+1,i = wt,i βt1−ei where ei = 0 if example xi is correctly classiﬁed by ht , ei = 1 otherwise, and t . βt = 1− t – The strong classiﬁer is given by: T T 1 1, if t=1 αt ht (x) ≥ 2 t=1 αt (2) C(x) = 0, otherwise. where αt = log

1 βt

data x and an eigenﬂow vector zj and ﬁnding the optimum threshold θj that minimizes the overall classiﬁcation error would yield a weak classiﬁer hj . 1, if |xT zj | ≶ θj hj (x) = (1) 0, otherwise. Feature Selection and AdaBoost. The procedure to choose the most discriminative of the weak classiﬁers, as illustrated in Fig. 5(a) is motivated by the face detection algorithm proposed in [11]. Table 1 describes the complete algorithm. The ﬁnal strong classiﬁer is a weighted vote of the weak classiﬁers (Eq. (2)). Figures 5(b), (c) and (d) depict the horizontal component of the two eigenﬂow features selected by this algorithm for each of the global motion subset. The selection of the most discriminative vectors follows a similar trend in all the three cases. While the ﬁrst one responds to motion near the boundary, the second one captures the motion within the window. It is also interesting to note the pattern at the bottom of the ﬁrst eigenﬂow vectors - those belonging to the right and left subsets take into account the spread of the legs in the lateral motion while the one for the frontal motion restricts any such articulation. Individually, they may perform poorly but as a combination, they can perform much better. Table 2 juxtaposes the false positive rate (FPR) of the GLMPC classiﬁer with two other classiﬁers for a ﬁxed detection rate of 98%. The ﬁrst one is the linear SVM classiﬁer that is clearly outperformed in both speed and accuracy. 13,313 support vectors were chosen by the linear SVM that is more than 50% of the

226

D. Goel and T. Chen

Fig. 5. (a) Feature Selection using AdaBoost. (b), (c) and (d) Two u-eigenﬂow vectors selected by AdaBoost for the Right, Left and Frontal subsets respectively. Table 2. False positive rate for the diﬀerent classiﬁers for the detection rate of 98% SVM LMPC GLMPC False Positives (%) 62.3 1.16 0.74

training data, an indication of a poorly generalized classiﬁer. Besides, such a high number of support vectors would result in about 1.3 million dot products per frame, assuming 100 candidate sub-windows in a frame. On the other hand, classiﬁcation using GLMPC requires only 348 dot products for the three-class SVM switch and 35 dot products for AdaBoost cascade (full cascade in the worst case). The other classiﬁer considered for comparison is the degenerate case of the proposed algorithm, that we refer as Local Motion Pattern Classiﬁer (LMPC) [6], when all the pedestrian data is considered as one single class. GLMPC provides a reduction of 37% in FPR that is further ampliﬁed by the fact that they may hundreds of candidate sub-windows in a frame. Cascade of AdaBoost Classiﬁers. In general, in any scene, ﬂow patterns that share no resemblance with human motion should be discarded quickly, while those that share greater similarity require more complex analysis. A cascade of AdaBoost classiﬁers [11] can achieve this. The early stages in the cascade have a lesser number of weak classiﬁers and hence, aren’t too discriminative but are really fast at classiﬁcation. The later stages consist of more complex classiﬁers with larger number of weak classiﬁers. To be labeled as a detection, a candidate data sample has to pass through all the stages. Hence, the classiﬁer spends most of the time analyzing diﬃcult motion patterns and rejects easy ones quickly. In our implementation, there are two stages in the cascade for each of the global motion classiﬁers. The same pedestrian data was used across all stages. For training the classiﬁer, the ratio of pedestrian to non-pedestrian data (for both training and test data) was kept at one for the left and right motion subsets and 0.5 for the frontal motion. Non-pedestrian data for the next stage in the cascade is generated by collecting the false positives after running the existing classiﬁer on diﬀerent videos taken from both static and moving cameras. The

Pedestrian Detection Using Global-Local Motion Patterns

227

ﬁnal frontal classiﬁer has 5 weak learners in the ﬁrst stage and 20 in the second. The corresponding numbers for the right and the left motion classiﬁers are 10 and 25, and 10 and 20 respectively.

4

Experiments

For detecting human motion patterns, the dense optical ﬂow image is searched with sub-windows of diﬀerent scales, seven in total. Every scale size also has an associated step size. Naturally, larger sub-windows have bigger steps size to prevent redundancy due to excessive overlap between neighboring sub-windows. Knowing a priori, the camera orientation can greatly reduce the search space since the pedestrians need to be looked for only on the ground plane. Exploiting such an information reduced the total number of scanned sub-windows in the image by almost half. Finally, only the candidate sub-windows that satisfy the minimum ﬂow thresholds are resized and normalized, before feeding to the classiﬁer. Again, these thresholds vary with the scale size as larger sub-windows search for near-by pedestrians that should appear to move faster due to parallax. Figure 6 depicts the detection results by linear SVM, LMPC and GLMPC classiﬁer after the ﬁrst stage in the cascade. The overlapping windows have not been merged to show the all the detected sub-windows. As is evident, the GLMPC is able to localize the pedestrians much better than any of the two methods and in addition, gives less false positives. The full cascade GLMPC classiﬁer was tested for pedestrian patterns in different test videos and works at 2fps on a Core 2 Duo 2 GHz PC. Figure 7 shows some of the relevant results. The algorithm was tested with multiple moving pedestrians in the presence of other moving objects, mainly cars and is able to detect humans in diﬀerent poses and moving at diﬀerent pace (Fig. 7(a)). The occluding objects can lead to false rejections since the ﬂow in the concerned sub-window doesn’t conform to the pedestrian motion. This is evident in the

(a) SVM

(b) LMPC

(c) GLMPC

Fig. 6. Comparison of the performance of GLMPC classiﬁer with linear SVM and LMPC after Stage 1 in the cascade. Color coding - white if direction is not known, red for right moving pedestrians, yellow for left and black for frontal motion.

228

D. Goel and T. Chen

(a)

(b)

(c)

(d) Fig. 7. Final detection results without merging the overlapping detections

second image in Fig. 7(a). Stationary and far-oﬀ pedestrians that are moving very slowly can also be missed owing to their negligible optical ﬂow. The system is also robust to illumination changes (Fig. 7 (b)) and can detect moving children (Fig. 7(c)) even though the training data was composed of only adult pedestrians. Moreover, notice the panning of the camera over time in the image sequence, illustrating the robustness of the system towards small camera motion. The videos captured from a slow moving car were also tested and the system still manages to detect pedestrians (Fig. 7 (d)).

5

Discussion

A novel learning strategy to detect moving pedestrians in videos using motion patterns was introduced in the paper. Instead of considering all human motion patterns as one class, they were split into three meaningful subsets dictated by the global motion direction. A cascade of AdaBoost classiﬁers with the most discriminative eigenﬂow vectors were learnt for each of these global motion

Pedestrian Detection Using Global-Local Motion Patterns

229

subsets. Further, a linear three-class SVM classiﬁer was trained that acts as a switch to decide which Adaboost classiﬁer to choose to determine if a pedestrian is contained in the candidate sub-window. It was shown that the proposed algorithm is far superior to the linear SVM and provides an improvement of 37% in FPR as compared to LMPC. Moreover, the proposed system has been shown to be robust to slow illumination changes, camera motion and can even detect children. Apart from conspicuous advantages of accuracy, GLMPC allows for extensibility to incorporate new pedestrian motion like jumping without retraining the whole classiﬁer again. Only a couple of changes would be required. The ﬁrst would be to retrain the motion switch multi-class SVM classiﬁer to take into account the new motion type. The next would be to train a new AdaBoost classiﬁer to discriminate between the jumping motion of the pedestrians and other kinds of motions. The already trained classiﬁers for left, right and frontal motion can be used in their original form. An important area of research for the future work would be to compute the ROC curve for the classiﬁers like GLMPC that don’t have a single global threshold. Work on similar lines has been done by Xiaoming et al. [10].

References 1. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian detection using wavelet templates. CVPR, 193–199 (1997) 2. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. CVPR 1, 886–893 (2005) 3. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. ICCV 2, 734–741 (2003) 4. Fablet, R., Black, M.J.: Automatic Detection and Tracking of Human Motion with a View-Based Representation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 476–491. Springer, Heidelberg (2002) 5. Sidenbladh, H.: Detecting Human Motion with Support Vector Machines. ICPR 2, 188–191 (2004) 6. Goel, D., Chen, T.: Real-time Pedestrian Detection using Eigenﬂow. In: IEEE International Conference on Image Processing, IEEE Computer Society Press, Los Alamitos (2007) 7. http://www.cbsr.ia.ac.cn/Databases.htm 8. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods. IJCV 61, 211–231 (2005) 9. Bruhn, A., Weickert, J., Kohlberger, T., Schn¨ orr, C.: A Multigrid Platform for Real-Time Motion Computation with Discontinuity-Preserving Variational Methods. IJCV 69, 257–277 (2006) 10. Liu, X., Chen, T., Kumar, B.V.: Face authentication for multiple subjects using eigenﬂow. Pattern Recognition 36, 313–328 (2003) 11. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR (2001)

Qualitative and Quantitative Behaviour of Geometrical PDEs in Image Processing Arjan Kuijper Radon Institute for Computational and Applied Mathematics, Linz, Austria

Abstract. We analyse a series of approaches to evolve images. It is motivated by combining Gaussian blurring, the Mean Curvature Motion (used for denoising and edge-preserving), and maximal blurring (used for inpainting). We investigate the generalised method using the combination of second order derivatives in terms of gauge coordinates. For the qualitative behaviour, we derive a solution of the PDE series and mention its properties brieﬂy. Relations with general diﬀusion equations are discussed. Quantitative results are obtained by a novel implementation whose stability and convergence is analysed. The practical results are visualised on a real-life image, showing the expected qualitative behaviour. When a constraint is added that penalises the distance of the results to the input image, one can vary the desired amount of blurring and denoising.

1

Introduction

Already in early years of image analysis the Gaussian ﬁlter played an important role. As a side eﬀect of Koenderink’s observation that this ﬁlter relates to human observation due to the causality principle [1], it opened the way for application of diﬀusion processes in image analysis. This is due to the fact that the Gaussian ﬁlter is the Greens’ function of the heat equation, a linear partial diﬀerential equations (PDEs). Because of its linearity, details are blurred during evolution. Therefore, various non-linear PDEs were developed to analyse and process images. A desirable aspect in the evolution of images is independence of the Cartesian coordinate system by choosing one that relates directly to image properties. One can think of the famous Perona Malik equation [2] using edge-strength. Using such so-called gauge coordinates, Alvarez et al. derived the Mean Curvature Motion [3] by blurring only along edges. On the other hand, the opposite approach can be used in inpainting [4,5]: blurring perpendicular to edges. Perhaps surprisingly, when combining these to methods one obtains the heat equation (see section 2). In this paper we proposea series of PDEs obtained by a parameterised linear combination of these two approaches. By doing so, one is able to inﬂuence the

This work was supported by FFG, through the Research & Development Project ‘Analyse von Digitaler Bilder mit Methoden der Diﬀerenzialgleichungen’, and the WWTF ‘Five senses-Call 2006’ project ‘Mathematical Methods for Image Analysis and Processing in the Visual Arts’.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 230–239, 2007. c Springer-Verlag Berlin Heidelberg 2007

Qualitative and Quantitative Behaviour of Geometrical PDEs

231

evolution of the methods discussed above by adjusting the parameters. This relates to in- or decreasing blurring that is locally either tangent or normal to isophotes. Although one cannot obtain a ﬁlter as the Greens’ function for the general case, solutions give insight into the qualitative behaviour of the PDE. This is done in section 2. Also relations with general diﬀusion processes are given. The PDEs need a stable numerical implementation, which is depending on the parameters. In section 3 a novel numeric scheme is given, including a stability analysis. This scheme allows larger time steps than conventional ﬁnite diﬀerence schemes, and remains stable at corner points, in contrast to standard ﬁnite diﬀerence schemes.

2

Geometric PDEs: Second Order Gauge Derivatives

An image can be thought of as a collection of curves with equal value, the isophotes. Most isophotes are non-self-intersecting. At extrema an isophote reduces to a point, at saddle points the isophote is self-intersecting. At the noncritical points Gauge coordinates (v, w) (or (T, N ), or (ξ, η), or . . .) can be chosen [6,7,8]. Gauge coordinates are locally set such, that the v direction is tangent to the isophote and the w directionpoints in the direction of the gradient vector. Consequently, Lv = 0 and Lw = L2x + L2y . Of interest are the following second order structures: Lvv =

L2x Lyy + L2y Lxx − 2Lx Ly Lxy L2x + L2y

(1)

Lww =

L2x Lxx + L2y Lyy + 2Lx Ly Lxy L2x + L2y

(2)

These gauge derivatives can be expressed as a product of gradients and the Hessian matrix H with second order derivatives: Lww L2w = ∇L · H · ∇T L ˜ −1 · ∇T L, Lvv L2w = ∇L · H

(3) (4)

˜ −1 = det H · H −1 . Note that with ∇L = (Lx , Ly ), H the Hessian matrix, and H the expressions are invariant with respect to the spatial coordinates. Combining the two diﬀerent expressions for the second order derivatives in gauge coordinates, Eqs. (1)-(2), yield Lt = pLvv + qLww .

(5)

Several parameter settings have relations to PDEs and histogram operations [9]: – (p, q) = (1, 1): Gaussian scale space [1], repeated inﬁnitesimal mean ﬁltering, – (p, q) = (1, 0): Mean Curvature Motion [3,10], repeated inﬁnitesimal median ﬁltering, – (p, q) = (1, −2): inﬁnitesimal mode ﬁltering. – (p, q) = (0, 1): maximal blurring used for inpainting [4].

232

2.1

A. Kuijper

A General Solution

In this section we derive the general solution for Eq. (5). As is the case with gauge coordinates, it is assumed that the solution is independent of direction, 2 2 size, dimension, and orientation. Therefore the dimensionless variable ξ = x +y t is used. Second, an additional t-dependency is assumed. This is inspired by the observation that the solution for p = q = 1 (the Gaussian ﬁlter) contains the factor t−1 . So starting assumption is L(x, y; t) = tn f (ξ), and equation (5) becomes tn−1 (−nf (ξ) + (2(p + q) + ξ)f (ξ) + 4qξf (ξ)) = 0. (6) The solution of this ODE with respect to f (ξ) and ξ is given by p−q ξ ξ p+q p+q ξ − 4q 2q f (ξ) = e c1 U n + , , + c2 L− p+2nq+q 2q 2q 4q 4q 2q

(7)

Here U (a, b, z) is a conﬂuent hypergeometric function and Lba (z) is the generalised Laguerre polynomial expression [11]. Taking r = p+q 2q , we ﬁnd L(x, y; t) = e−

x2 +y2 4qt

2 x + y2 x2 + y 2 tn c1 U n + r, r, + c2 Lr−1 . (8) −n−r 4qt 4qt

The formula reduces dramatically for n = −r, since U (0, ., .) = L.0 (.) = 1. This gives the following positive solutions of Eq. (5): p+q

L(x, y; t) =

+y2 t− 2q − x24qt e 4πq

(9)

The simpliﬁed diﬀusion (p, q) = (b − 1, 1) Lt = Lww + (b − 1)Lvv

(10)

has solution

−x2 −y2 1 4t e . (11) 4πtb/2 Qualitatively these types of ﬂows are just a rescaling of standard Gaussian blurring, albeit that linearity between subsequent images in a sequence with increasing scale t is lost. Only for b = 2 the ﬁlter is linear, resulting in the Gaussian ﬁlter. For b = 1 one obtains maximal blurring. Note that a solution can only be obtained when q = 0. This implies that the direction Lww (i.e. blurring) must be present in the ﬂow. Solutions for pure Lvv ﬂow - mean curvature motion - are given by L(x, y; t) = L x2 + y 2 + 2t , which is not dimensionless.

2.2

Nonlinear Diﬀusion Filtering

The general diﬀusion equation [12] reads Lt = ∇ · (D · ∇L).

(12)

Qualitative and Quantitative Behaviour of Geometrical PDEs

233

The diﬀusion tensor D is a positive deﬁnite symmetric matrix. With D = 1 (when D is considered a scalar - i.e. an isotropic ﬂow), or better, D = In , we have Gaussian scale space. When D depends on the image, the ﬂow is nonlinear, e.g. in the case of the Perona Malik equation [7,2] with D = k 2 /(k 2 + ∇L2 ). For D = Lp−2 we have the p-Laplacian [13,14]. To force the equality Eq. (5) = w Eq. (12), (13) ∇ · (D · ∇L) = pLvv + qLww , D must be a matrix that is dimensionless and that contains only ﬁrst order derivatives. The most obvious choice for D is D = ∇L · ∇T L/L2w . This yields, perhaps surprisingly, the Gaussian scale space solution. This is, in fact, the only possibility as one can verify. 2.3

Constraints

An extra condition may occur in the presence of noise (assume zero mean, variance σ 2 ): 1 (L − L0 )2 d Ω = σ 2 (14) I= 2 Ω where L0 is the input image and L the denoised one. The solution of min E s.t. I is obtained by the Euler Lagrange equation δE + λδI = 0 with δI = L − L0 , 2 λ = <δE,δI> <δI,δI> , and < δI, δI > = 2σ (see Eq. (14)). The solution can be reached by an evolution determined by a steepest decent evolution Lt = −(δE + λδI) When we set λ = 0, an unconstrained blurring process is obtained. Alternatively, λ can be regarded as a penalty parameter that limits the L2 diﬀerence between the input and output images. A too small value will cause an evolution that forces the image to stay close to the input image.

3

Numerical Implementation

The PDE is implemented using Gaussian derivatives [6,15]. As a consequence, larger time steps can be taken. When the spatial derivatives are computed as a convolution ( ) of the original image L with derivatives of a Gaussian G, the −y following results hold: (G L)x = ( −x 2t G) L, (G L)y = ( 2t G) L, (G L)xx = xy y −2t ( x 4t−2t 2 G) L, (G L)xy = ( 4t2 )G L, and (G L)yy = ( 4t2 G) L. Consequently, 2

2

y 2 + x2 − 4t G) L 4t2 −1 =( G) L 2t y 2 + y 2 − 2t =( G) L 4t2

(16)

q(x2 + y 2 ) − 2t(p + q) G) L 4t2

(18)

(G L)xx + (G L)yy = ( (G L)vv (G L)ww

(15)

(17)

Then we have pLvv + qLww = (

234

A. Kuijper

and Eq. (5) is numerically computed by n Ln+1 j,k − Lj,k

Δt

=(

q(x2 + y 2 ) − 2t(p + q) G) Lnj,k 4t2

(19)

where Lnj,k = ξ n ei(jx+ky) is the Von Neumann solution. The double integral, the right hand side of Eq. (19) reads q (α − x)2 + (β − y)2 − 2t(p + q) n i(jα+kβ)− (α−x)2 +(β−y)2 4t ξ e dαdβ 16πt3 2 2 1 p + q 2t(j 2 + k 2 ) − 1 ξ n e−t(j +k )+i(xj+ky) , which and evaluates to − 2t equals 2 2 1 − (20) p + q 2t(j 2 + k 2 ) − 1 e−t(j +k ) · Lnj,k = Ψ · Lnj,k . 2t Consequently, after dividing by Lnj,k (= 0!), Eq. (19) reduces to ξ − 1 = Δt · Ψ

(21)

For stability we require ξ ≤ 1, so Δt · Ψ + 1 ≤ 1. The minimum for Ψ is −p+3q obtained by ∂j Ψ = 0, ∂k Ψ = 0, i.e. t = 2q(j 2 +q 2 ) , yielding the value Ψmin = −q p−3q 2q . t e

p−3q

− 2q For the maximum step size we ﬁnd ξmax = 2t . q e Obviously, as the implementation is based on the solution of the heat equation, so the maximum step size is limited by the case p = q = 1, i.e. ξmax ≤ 2te. So for 3 the Lww ﬂow (p = 0, q = 1) the step size 2te 2 would yield instabilities. Secondly, −t(j 2 +k2 )

for Lvv ﬂow (p = 1, q = 0), Ψ reduces to − e 2t p . The minimum is obtained at (j, k) = (0, 0), which obviously makes no sense as the Von Neumann solution −t then simpliﬁes to ξ n . We therefore can assume j 2 + k 2 ≥ 1. Then Ψmin = −1 2t e t and the maximum step size is min{4te , 2te}. 3.1

An Alternative Approach

∇L Niessen et al. [15, p196] used ∇L = (cos θ, sin θ) to derive a maximal time step of 2et for the Lvv ﬂow. Here we follow their line of reasoning for the more general Eq. (5). Firstly, the derivatives become

Gvv = cos2 (θ) ∗ Gyy + sin2 (θ) ∗ Gxx − 2 sin(θ) cos(θ) ∗ Gxy

(22)

Gww = cos (θ) ∗ Gxx + sin (θ) ∗ Gyy + 2 sin(θ) cos(θ) ∗ Gxy .

(23)

2

2

Strictly, the Von Neumann stability analysis is only suitable for linear diﬀerential equations with constant coeﬃcients. However, we can apply it to equations with variable coeﬃcients by introducing new constant coeﬃcients equal to the frozen values of the original ones at some speciﬁc point of interest and test the n . We then ﬁnd: modiﬁed problem instead. Let θ denote θj,k y 2 − 2t x2 − 2t xy + sin2 (θ ) − 2 sin(θ ) cos(θ ) 2 2 2 4t 4t 4t 2 2 2 y − 2t 2 x − 2t xy = cos (θ ) + sin (θ ) + 2 sin(θ ) cos(θ ) 2 . 4t2 4t2 4t

Gvv = cos2 (θ ) Gww

(24) (25)

Qualitative and Quantitative Behaviour of Geometrical PDEs

235

Numerically, with Lnj,k as above, we derive (Lnj,k G)vv = (j sin(θ ) − k cos(θ )) e−(j 2

2

+k2 )t

2 −(j 2 +k2 )t

(Lnj,k G)ww = (j cos(θ ) + k sin(θ )) e

Lnj,k

(26)

Lnj,k .

(27)

Since p (j sin(θ ) − k cos(θ )) + q (j cos(θ ) + k sin(θ )) ≤ max(p, q)(j 2 + k 2 ) 2

2

(28)

we derive for the stability criterion ξ = 1 − Δt max(p, q)(j 2 + k 2 )e−(j

2

+k2 )t

(29)

2et where again the optimum is obtained for s = j 2 + k 2 , yielding Δt ≤ max(p,q) . n This derivation holds for all points Ljk and we ﬁnd the same stability criterion for the Lww and Lvv ﬂow and for Gaussian blurring.

4

Results

Figure 1 shows two standard shapes used to evaluate the given numerical recipes. Firstly, results for applying 10 time steps in a ﬁnite diﬀerence scheme is shown in Figure 2. Clearly artifacts can be seen at the corners, due to the directional preference of the ﬁrst order derivatives. Clearly, the corner behaving “good” in Lvv behaves “bad” in Lww ﬂow, vice versa. Secondly, the Gaussian derivatives implementation for the disk and square are shown in Figures 3 - 4. The scale is chosen as σ = .8, so Δt = 2et = 2e 21 σ 2 = 1.74. The predicted critical scale for the Lvv ﬂow is 4tet = 1.64e.32 = 1.76. One clearly

Fig. 1. Disk and square with values √ 0,1, with uniform random noise on (0,1), and the results of a Gaussian ﬁlter at σ = 128, i.e. t=64

Fig. 2. Results of 10 time steps in a ﬁnite diﬀerence scheme for, from left to right, the Lvv ﬂow for the noisy disk and square, and the Lww ﬂow for these images

236

A. Kuijper timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

Fig. 3. Results of the noisy disk for Lvv ﬂow (top row), Gaussian ﬂow (middle row), and Lww ﬂow (bottom row), for various time step ranges around the critical value Δt = 2et = 2e 12 σ 2 = 1.74 timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

timestep 1.82857

timestep 1.77778

timestep 1.72973

timestep 1.68421

timestep 1.64103

Fig. 4. Results of the noisy square for Lvv ﬂow (top row), Gaussian ﬂow (middle row), and Lww ﬂow (bottom row), for various time step ranges around the critical value Δt = 2et = 2e 12 σ 2 = 1.74

sees the change around the critical values. Since relatively much noise is added, the value is a bit lower that the predicted value. If a too large time step is taken, instability artifacts are visible: For Lvv ﬂow the results become peaky, while the Lww ﬂow shows ringing, and the Gaussian blurring is completely disastrous. Note that the rounding eﬀect for the Lvv ﬂow and the peaky results for the Lww ﬂow are intrinsic to these ﬂows.

Qualitative and Quantitative Behaviour of Geometrical PDEs

237

Fig. 5. Original image and a noisy one, σ = 20 p,q 0.5,1.

p,q 0.,1.

p,q 0.5,1.

p,q 1.,1.

p,q 0.5,0.5

p,q 0.,0.5

p,q 0.5,0.5

p,q 1.,0.5

p,q 0.5,0.

p,q 0.,0.

p,q 0.5,0.

p,q 1.,0.

p,q 0.5,0.5

p,q 0.,0.5

p,q 0.5,0.5

p,q 1.,0.5

Fig. 6. Geometrical evolution of Lt = pLvv + qLww for several values of p and q. The noise variance σ is set to 20. The result satisﬁes the noise constraint up to an error of 10−7 .

To see the eﬀect of Lt = pLvv + qLww for several values of p and q, Figure 5 is used. The result of applying the Gaussian derivatives implementation in 50 time steps is shown in Figure 6 (with noise constraint) and Figure 7 (without one). As one can see in Figure 6, the choice of p and q enables one to steer between

238

A. Kuijper p,q 0.5,1.

p,q 0.,1.

p,q 0.5,1.

p,q 1.,1.

p,q 0.5,0.5

p,q 0.,0.5

p,q 0.5,0.5

p,q 1.,0.5

p,q 0.5,0.

p,q 0.,0.

p,q 0.5,0.

p,q 1.,0.

p,q 0.5,0.5

p,q 0.,0.5

p,q 0.5,0.5

p,q 1.,0.5

Fig. 7. Geometrical evolution of Lt = pLvv + qLww for several values of p and q. There is no constraint. For negative p there are spiky artifacts, for positive ones there is blurring. For negative q one sees the edges.

denoising regions and deblurring around edges (where the artifacts occurred). The evolution converges within 50 time steps, the error in the constraint is of order 10−7 . The unconstrained evolution shows the spiky artifacts for p ≤ 0, while q < 0 gives the edges. Note that for these values Ψ may become negative and local stability problems may occur. The diagonal gives Gaussian (de)blurring. Visually, q = 0 gives the best result, although here the number of time steps heavily inﬂuences the results.

5

Summary and Discussion

We presented a line of approaches to evolve images that unify existing methods in a general framework, by a weighted combination of second order derivatives in terms of gauge coordinates. The series incorporate the well-known Gaussian blurring, Mean Curvature Motion and Maximal Blurring. For the qualitative

Qualitative and Quantitative Behaviour of Geometrical PDEs

239

behaviour, a solution of the series was derived and its properties were brieﬂy mentioned. Relations with general diﬀusion equations were given. Quantitative results were obtained by a novel implementation and its stability was analysed. The practical results are visualised on artiﬁcial images to study the method in detail, and on a real-life image showing the expected qualitative behaviour. The examples showed that positive values for p and q are indeed necessary to guarantee numerical stability (Fig. 7). Theoretically, this relates to the fact that q < 0 implies deblurring, notoriously ill-posed and unstable. However, when a reasonable constraint is added, this deblurring is possible (Fig. 6). Choosing optimal values of p and q depends on the underlying image and is beyond the scope of this paper.

References 1. Koenderink, J.J.: The structure of images. Biological Cybernetics 50, 363–370 (1984) 2. Perona, P., Malik, J.: Scale space and edge detection using anisotropic diﬀusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 3. Alvarez, L., Lions, P., Morel, J.: Image selective smoothing and edge detection by nonlinear diﬀusion. SIAM Journal on Numerical Analysis 29, 845–866 (1992) 4. Caselles, V., Morel, J.M., Sbert, C.: An axiomatic approach to image interpolation. IEEE Transactions on Image Processing 7, 376–386 (1996) 5. Bertalmio, M., Vese, L., Sapiro, G., Osher, S.: Simultaneous structure and texture image inpainting. IEEE Transactions on Image Processing 12, 882–889 (2003) 6. Haar Romeny, B.M.t.: Front-end vision and multi-scale image analysis. Kluwer Academic Publishers, Dordrecht, The Netherlands (2003) 7. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing: Partial Diﬀerential Equations and the Calculus of Variations, 2nd edn. Springer, Heidelberg (2006) 8. Kornprobst, P., Deriche, R., Aubert, G.: Image coupling, restoration and enhancement via PDE’s. In: Proc. Int. Conf. on Image Processing, vol. 4, pp. 458–461 (1997) 9. Griﬃn, L.: Mean, median and mode ﬁltering of images. Proceedings of the Royal Society Series A 456, 2995–3004 (2000) 10. Yezzi, A.: Modiﬁed curvature motion for image smoothing and enhancement. IEEE Transactions on Image Processing 7, 345–352 (1998) 11. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th edn. Dover, New York (1972) 12. Weickert, J.A.: Anisotropic Diﬀusion in Image Processing. Teubner, Stuttgart (1998) 13. Aronsson, G.: On the partial diﬀerential equation u2x uxx + 2ux uy uxy + u2y uyy = 0. Arkiv f¨ ur Matematik 7, 395–425 (1968) 14. Kuijper, A.: p-laplacian driven image processing. In: ICIP 2007 (2007) 15. Niessen, W.J., ter Haar Romeny, B.M., Florack, L.M.J., Viergever, M.A.: A general framework for geometry-driven evolution equations. International Journal of Computer Vision 21, 187–205 (1997)

Automated Billboard Insertion in Video Hitesh Shah and Subhasis Chaudhuri Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India

Abstract. The paper proposes an approach to superimpose virtual contents for advertising in an existing image sequence with no or minimal user interaction. Our approach automatically recognizes planar surfaces in the scene over which a billboard can be inserted for seamless display to the viewers. The planar surfaces are segmented in the image frame using a homography dependent scheme. In each of the segmented planar regions, a rectangle with the largest area is located to superimpose a billboard into the original image sequence. It can also provide a viewing index based on the occupancy of the virtual real estate for charging the advertiser.

1

Introduction

Recent developments in computer vision algorithms have paved the way for a novel set of applications in the ﬁeld of augmented reality [1]. Among these, virtual advertising has gained considerable attention on account of its commercial implications. The objective of virtual advertising is to superimpose computer mediated advertising images or videos seamlessly into the original image sequence so as to give the appearance that the advertisement was part of the scene when the images were taken. It introduces possibilities to capitalize on the virtual space. Conventionally, augmentation of video or compositing has been done by skilled animators by painting 2D images onto each frame. This technique ensures that the ﬁnal composite is visually credible, but is enormously expensive, and is also limited to relatively simple eﬀects. Current state-of-art methods for introducing virtual advertising broadly fall into three categories. The ﬁrst category consists of approaches which utilize pattern recognition techniques to track the regions over which the advertisement is to be placed. Patent [2] is an example of such an approach. It depends on human assistance to initially locate the region for placement of billboard which is tracked in subsequent frames using a Burt pyramid. The approaches in this category face problems when the region leaves the ﬁeld of view and later reappears, requiring complete and accurate re-initialization. Medioni et al. [3] present an interesting approach which addresses this issue, but the approach is limited to substitution of billboards. The second category comprises of the methods which require access to the scene and/or to the equipment prior to ﬁlming like in [4,5,6,7]. In these approaches special markers are set up in the scene to identify the places for future billboard insertions. They may also require physical sensors on the camera to track the changes in the view (pan, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 240–250, 2007. c Springer-Verlag Berlin Heidelberg 2007

Automated Billboard Insertion in Video

241

tilt, zoom). However, this renders such approaches incapable of augmenting an existing video. The ﬁnal category assumes the knowledge about the structure and geometry of the scene, e.g. the system proposed in patent [8] depends on a landmark model describing the set of natural landmarks in a given scene. Similarly, the techniques proposed in [9,10] assume the image sequence to be of a soccer and tennis game, respectively to make the best use of scene property. This makes such a solution very case speciﬁc. As opposed to the above methods, the proposed approach automatically locates a possible region for superimposing a billboard and does not require access or strong assumptions on the structure of the scene or equipment. Our approach exploits the inherent constraints introduced due to requirement of mapping of a planar surface (billboard) onto a planar surface (for e.g. wall) in this particular context. Further, we provide a viewing index for the price ﬁxation for the advertisement.

2

Problem Formulation

An arbitrary sequence of n images I1 , ..., In of a scene and the billboard(s) to be placed are given. We need to automatically locate dominant planar regions in the scene and superimpose geometrically corrected billboards over each of the planar regions in each image frame (Figure 1). The scene is assumed to have textured and planar dominant regions which are not occluded in majority of the frames in the sequence. As indoor scene with walls, outdoor scenes of buildings or pre-existing physical billboards are the target for the approach, the requirement of dominant planar region is not at all restrictive.

(a)

(b)

(c)

Fig. 1. Illustration of virtual billboard insertion: (a) is a frame from input image sequences. The frame after it has been augmented by placing a virtual advertisement at a planar region in the scene are shown in (b) and (c).

3

Proposed Approach

Our approach consists of three stages: image analysis, reconstruction of a quadrilateral in 3D space and image synthesis. Image analysis stage is responsible for

242

H. Shah and S. Chaudhuri

ﬁnding and segmenting the planar surfaces in the scene. It consists of weak calibration, plane ﬁtting, planar segmentation and then locating the largest rectangular area on each of the segmented regions. Back projecting each rectangle on the corresponding planar surface, a projective reconstruction of a quadrilateral in 3D space is obtained. The image synthesis stage maps texture on the quadrilateral with that of the required billboard and performs augmentation by projecting them on each of the given image frames. It also calculates the viewing index for each billboard inserted in the image sequence as a measure of price to be paid by the sponsor. 3.1

Weak Calibration

In weak calibration the structure and motion will be determined up to an arbitrary projective transformation. For this, the interest points in the image sequence, obtained by Harris corner detector are tracked over all the frames using normalized correlation based matching. The tracked interest points are then utilized to solve for the projection matrices P1 , ..., Pn corresponding to each image frame and for recovering the 3D positions of the interest points X1 , ..., Xm with projective ambiguity as explained in Beardsley et al. [11] or by Triggs [12]. In our approach the projection matrices and the recovered positions of the interest points are used to evaluate the homography between image frames. As co-planarity of points is preserved under any projective transformation, a projective reconstruction suﬃces; updating the reconstruction to aﬃne or Euclidean is not needed to deal with the planar regions. 3.2

Plane Fitting

For recovering the planar surface in the scene, interest points X1 , ..., Xm are divided on the basis of the plane they support. Thus from a point cloud in 3D space, points which are coplanar are to be identiﬁed and grouped. Hough transform and RANdom SAmple Consensus (RANSAC) [13] are powerful tools to detect speciﬁed geometrical structures among a cluster of data points. However, any one of them when used individually has the following limitations. Accuracy of the parameters recovered using Hough transform is dependent on the bin size. To obtain higher accuracy the bin size has to be smaller implying a large number of bins and thus is computationally more expensive. RANSAC, on the other hand requires many iterations when the fraction of outliers is high and trying all possible combinations can be also computationally expensive. It is able to calculate the parameters for the plane with higher accuracy in reasonable time when it is to ﬁt one instance of the model albeit with a few outliers in the data points, as in our case there might be multiple instances of the model, i.e. plane, in the data it performs poorly on its own. To overcome the above limitations, a Hough transform followed by RANSAC on the partitioned data is adopted for recognizing planes. In the ﬁrst stage Hough transform with a coarse bin size is utilized to obtain the parameters of the planes. These parameters are then utilized to partition the input points

Automated Billboard Insertion in Video

243

X1 , ..., Xm into subsets of points belonging to individual planar regions. Each one of these subset of points support a plane whose parameters are calculated using the Hough transform. Note that there will be a number of points which cannot be ﬁt to a planar surface and they should be discarded. Each subset of data forms the input to the RANSAC algorithm which then ﬁts a plane to recover the accurate parameters for the plane. Such an approach can eﬃciently calculate the equations of planes ﬁtting the data points. Thus at the end of plane ﬁtting operation, equations of the dominant planes in the scene are obtained. In the following subsections we explain the details of Hough transform and RANSAC method used in this study. Data Partitioning. A plane P in XY Z space can be expressed with the following equation: ρ = xsinθcosφ + ysinθsinφ + zcosθ

(1)

where (ρ, θ, φ) helps deﬁne a vector from the origin to the nearest point on the plane. This vector is perpendicular to the plane. Thus under the Hough transform each plane in XY Z space is represented by a point in (ρ, θ, φ) parameter space. All the planes passing through a particular point B(xb , yb , zb ) in XY Z space can be expressed with the following equation from eq. (1) ρ = xb sinθcosφ + yb sinθsinφ + zb cosθ.

(2)

Accordingly all the planes that pass through the point B(xb , yb , zb ) can be expressed with a curved surface described by the eq. (2) in (ρ, θ, φ) space. A three dimensional histogram in (ρ, θ, φ) space is set up to ﬁnd the planes to which a group of 3D data points belong. For each 3D data point B(xb , yb , zb ) ∈ Xi , all histogram bins that the curved surface passes through are incremented. To obtain the parameters of a particular plane a search for the local maxima in the (ρ, θ, φ) space is performed. The top k signiﬁcant local maxima are obtained in the (ρ, θ, φ) space the input point cloud is divided into k + 1 subsets, each containing the points that satisfy the plane eq. (1) with a certain tolerance. The last subset contains points that do not ﬁt into any of the above k planes. Accurate plane ﬁtting is carried out on each set using RANSAC as explained in the next section. Plane Fitting Using RANSAC. The basic idea of RANSAC method is to compute the free parameters of the model from an adequate number of randomly selected samples. Then all samples vote whether they agree with the proposed hypothesis. This process is repeated until a suﬃciently broad consensus is achieved. The major advantage of this approach is its ability to ignore outliers without explicit handling. We proceed as follows to detect a plane in the subsets of points obtained using the previous step: – Choose a candidate plane by randomly drawing three samples from the set. – The consensus on this candidate is measured.

244

H. Shah and S. Chaudhuri

– Iterate the above steps. – The candidate having the highest consensus is selected. At the end of above iterations, equations for k planes π1 , ..., πk , corresponding to each subset, are obtained. 3.3

Segmentation of Planar Regions

Having estimated the dominant planar structures in the scene, we now need to segment these regions irrespective of its texture. For a given plane πi , the image frame in which the minimum foreshortening of the plane occurs is selected as the reference image Iref . This ensures maximum visibility of the region on the plane in the image Iref . When any other image frame Iother from the sequence is mapped onto the reference image using a homography for the plane πi , the region of the image frame Iother on the plane πi is pixel aligned with the region of the image on the plane πi in Iref and the rest of it gets misaligned due to non belongingness to the selected plane πi . Figure 2(b) shows the resulting image obtained by applying homography (calculated for the plane coincident with the top of the box) to images in the sequences and then taking the average color at each pixel location over all back projected frames. The pixels in the region on the top of the box in the image frames projected using homography are aligned well with the region of the box top in Iref . Thus in the averaged image the top of the box appears sharp in contrast to the surrounding region. Hence to segment the region on the plane πi in image Iref , at each pixel location a set of color is obtained by mapping equally time spaced image frame (for e.g., every 10th frame) in the sequence using their respective homography for the plane πi . Homography calculation is explained in Appendix A. For each pixel in image Iref lying on the plane, variance of image texture for all re-projected points in the scene at this pixel will be very small as compared to pixels which are not on this planar region due to misalignment. Hence pixel wise variance over all re-projected image frames can be used as a measure to segment the regions in the image on the plane. For each pixel in the image Iref the variance of the above set is calculated and compared against a threshold to obtain a binary segmentation

(a)

(b)

(c)

(d)

Fig. 2. Illustration of homography based segmentation: (a) reference image to be segmented. (b) is obtained by projecting and averaging all the frames in the sequence on (a). Notice the extensive blurring of the region not coplanar to the top of the box. (c) Variance measured at each pixel (white represents larger variance). (d) Segmented planar region obtained after performing thresholding on the basis of variance.

Automated Billboard Insertion in Video

(a)

(b)

245

(c)

Fig. 3. (a), (b) and (c) are the augmented image frames where separate billboards have been placed on two dominant planes which were automatically identiﬁed by the approach

of the reference image for the particular planar region. Figure 2(c) represents the per pixel variance of the image frame in ﬁgure 2(a). It can be readily seen that the variance of pixels on the top of the box are less as compared to the surrounding region which appears in white region. Finally, ﬁgure 2(d) is the segmented planar region obtained by thresholding the variance image. There may be occasional holes in the segmented region as seen in ﬁgure 2(d). Such small regions are ﬁlled up using a morphological closing operation. Regions corresponding to each of the plane π1 , ..., πk are obtained similarly. 3.4

Billboard Placement and Augmentation

Having obtained the segmented regions corresponding to each of the dominant planes, the largest inscribed rectangular area within each of them is located using a dynamic programming Billboards are usually rectangular in shape and are horizontally (or vertically) oriented. Hence we try to ﬁt the largest virtual real estate possible in the segmented region. In absence of any extrinsic camera calibration, it is assumed that the reference frame is vertically aligned. The end points of these rectangles are back projected using the projection matrix, as explained in Appendix B, of the corresponding reference image on the corresponding plane to obtain a quadrilateral in 3D space. Each quadrilateral represents a possible planar region for insertion of a billboard in the 3D projective space. The quadrilaterals can be texture mapped [14] with the required advertising material and then projected onto each of the image frames in the sequence using the respective projection matrices. To reduce aliasing artifact and to increase rendering speed mipmapping [15] is used for texture mapping. 3.5

Calculation of the Viewing Index

Total viewing index can also be calculated during augmentation for each billboard inserted in the video. The total viewing index for a billboard is directly proportional to the amount of time the billboard is on the screen and is equal to the sum of the viewing index calculated per frame. The per frame viewing index

246

H. Shah and S. Chaudhuri Viewing index per frame 0.4 Billboard on the front side of box Billboard on the top side of box 0.35

Viewing Index

0.3

0.25

0.2

(b)

0.15

0.1

0.05

0

0

100

200

300 400 Frame Number

500

600

700

(a)

(c)

Fig. 4. (a) Calculated viewing index for billboard on the top and front of the box for each frame. (b) & (c) are the frames with highest viewing indices (encircled in (a)) for billboard on the front and the top, respectively.

depends on the amount of area the billboard is occupying in the image frame as well as the part of the image frame where it appears, i.e. top, middle, bottom, corners as the location matters for advertising purposes. The total viewing index for a particular billboard reﬂects roughly the amount of impact the billboard has on the viewer. Thus, it can be utilized to develop a fair pricing policy for the sponsor of the advertisement billboard. To calculate viewing index per frame each pixel Pi,j in the image frame is assigned a weight 2

Weight(Pi,j ) =

− 12 ( (μx −i) 1 2 σx e 2πσx σy

+

(μy −j)2 2 σy

)

(3)

where μx = (height of frame)/2, μy = (width of frame)/2, σx = (height of frame)/6, σy = (width of frame)/6. This selection of parameter assigns higher weights to the pixel in the center of the frame and the weight slowly decreases as the pixels move away from the center. Also the selection of σx and σy ascertains that sum of weights of all the pixels in a frame is almost equal to unity. The viewing index for the billboard in a frame is then equal to the sum of the weights of each pixel over which it is projected. This ensures that the viewing index is directly proportional to the area of frame occupied by the billboard and also to the billboard’s position in the frame. Figure 4 shows computed the viewing indices for the billboard on the top and front sides of the box calculated in the above manner. Using the viewing index per frame a total viewing index for a particular billboard can be calculated by summing the viewing index in each frame. It can be observed that the billboard on the top of the box has higher total viewing index as compared to the one

Automated Billboard Insertion in Video

247

on the front side. Hence the sponsor of the billboard on the top can be charged relatively higher to account for the larger occupancy of prime virtual real estate.

4

Experimental Results

The proposed approach has been implemented in MATLAB and the image sequences used for tests have been captured using a hand held camera. Each of the image sequences had 300-700 frames. In our experimentation we used factorization method proposed in [12] for weak calibration. Figure 1 shows the qualitative result of our approach on two image sequences. In the ﬁrst sequence one dominant plane was detected corresponding to the wall whereas in the second sequence two dominant planes, top and front of the box, were located. Figure 1 (b,c,e,f) are resulting frames with billboard added on one plane and ﬁgure 3 shows the altered image frames with separate billboard over the two dominant planes generated by the proposed approach. Videos captured using a mobile phone camera are diﬃcult to augment using existing approaches due to inherent jitters, low frame rate and less spatial resolution. However, the proposed approach is able to insert billboards seamlessly into such videos also. Few results obtained by augmenting videos captured using a mobile phone camera are shown in ﬁgure 5.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. Augmented frames from three distinct videos captured using a mobile phone camera are shown in (a,b,c), (d,e,f) respectively. In each of the video one dominant planar region is identiﬁed and augmented with a billboard.

5

Conclusion

In this paper we have presented an automated approach for locating planar regions in a scene. The planar regions recovered are used to augment the image sequence with a billboard which may be a still image or a video. One of the

248

H. Shah and S. Chaudhuri

possibilities for the application of the approach is to use it in conjunction with the set top box. Before transmission the video is analyzed for the planar regions in the scene. Information about the identiﬁed planar regions is stored as meta data in each frame. At the receiving end before display the set top box can augment each of the image frames with billboards in real time. The billboards in this case may be adaptively selected by the set top box depending on the viewer habits learned by it or the video being shown, e.g. a health video may be augmented with a health equipment advertisement. While, evaluating the results of the current experimentation it is observed that the placement of the billboard is geometrically correct in each of the image frame. No signiﬁcant drift or jitter has been observed. However, there may be photometric mismatches for the inserted billboard with its surrounding. We are currently looking into the photometric issues related to illumination, shadow and focus correction of the augmented billboard.

References 1. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B.: Recent advances in augmented reality. IEEE Comput. Graph. Appl. 21(6), 34–47 (2001) 2. Rosser, R.J., Leach, M.: Television displays having selected inserted indicia. In: US Patent 5,264,933 (2001) 3. Medioni, G., Guy, G., Rom, H., Francois, A.: Real-time billboard substitution in a video stream. In: Proceedings of the 10th Tyrrhenian International Workshop on Digital Communications (1998) 4. Rosser, R., Tan, Y., Kennedy Jr., H., Jeﬀers, J., DiCicco, D., Gong, X.: Image insertion in video streams using a combination of physical sensors and pattern recognition. In: US Patent 6,100,925 (2000) 5. Wilf, I., Sharir, A., Tamir, M.: Method and apparatus for automatic electronic replacement of billboards in a video image. In: US Patent 6,208,386 (2001) 6. Gloudemans, J.R., Cavallaro, R.H., Honey, S.K., White, M.S.: Blending a graphic. In: US Patent 6,229,550 (2001) 7. Bruno, P., Medioni, G.G., Grimaud, J.J.: Midlink virtual insertion system. In: US Patent 6,525,780 (2003) 8. DiCicco, D.S., Fant, K.: System and method for inserting static and dynamic images into a live video broadcast. In: US Patent 5,892,554 (1999) 9. Xu, C., Wan, K., Bui, S.H., Tian, Q.: Implanting virtual advertisement into broadcast soccer video. In: Advances in Multimedia Information Processing - PCM, vol. 2, pp. 264–271 (2004) 10. Tien, S.C., Chia, T.L.: A fast method for virtual advertising based on geometric invariant-a tennis match case. In: Proc. of Conference on Computer Vision, Graphics, and Image Processing (2001) 11. Beardsley, P.A., Zisserman, A., Murray, D.W.: Sequential updating of projective and aﬃne structure from motion. Int. J. Comput. Vision 23(3), 235–259 (1997) 12. Triggs, B.: Factorization methods for projective structure and motion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 845–851. IEEE Computer Society Press, San Francisco, California, USA (1996) 13. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

Automated Billboard Insertion in Video

249

14. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer graphics: principles and practice. Addison-Wesley Longman Publishing Co. Inc., USA (1996) 15. Williams, L.: Pyramidal parametrics. In: SIGGRAPH 1983, pp. 1–11. ACM Press, New York (1983) 16. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, New York, USA (2000)

A

Homography Calculation

Consider images Im and In with projection matrices given by Pm and Pn , respectively. Let A be any point on the plane π which projects to am and an on the images Im and In , respectively, thus πT A = 0 am = Pm A

(4) (5)

an = Pn A

(6)

where π is a 4x1 row vector and A is a 4x1 homogeneous representation of the point. Due to eq. (4), A lies in the null space ( NS ) of π T . A = NS(π T )C

(7)

where C are the coordinates of A with respect to the basis of the nullspace of A. From eq. (7) and eq. (6) C = [Pn NS(π T )]

−1

an .

(8)

Using eq. (5), eq. (7) and eq. (8) −1

am = Pm NS(π T )[Pn NS(π T )]

an

am = Hmn an where Hmn = Pm NS(π T )[Pn NS(π T )] ping an to am .

B

(9) (10)

−1

and is a 3x3 homography matrix map-

Back-Projection of a Point on a Plane

Let x be the homogenous representation of the point in the image, which is to back-projected on the plane. Let P be a 3x4 projection matrix of the image. It can be written as

250

H. Shah and S. Chaudhuri

P = [M |p4 ]

(11)

where M is a 3x3 matrix consisting of the ﬁrst three columns and p4 is the last column of P . As per [16], the camera center C for the image is given by C = −M −1 p4

1

T

(12)

and if D be the point at intiﬁnity in the direction of the ray from C passing through x. Then D = M −1 x

0

T

(13)

All the points on the line from C and D can be expressed parametrically by X(t) = C + tD

(14)

Let π = [ a b c d ]T be the equation of the plane on which the point x is to back-projected. Thus for a point, with homogenous representation Y , on the plane πT Y = 0 T

(15) T

It can also be written as π = [z d ] , where z = [a b c ] . The back projected point is on the line and the plane. Thus using eq. (14) and eq. (15) π T X(t) = π T C + tπ T D −1 −M −1 p4 M x 0 = [ z d ]T + t [ z d ]T 1 0 −1 −1 t zM x = zM p4 − d t=

zM −1 p4 − d zM −1 x

Thus the back-projection of a point x on a plane is given by zM −1 p4 − d M −1 x −M −1 p4 ∗ X = + . 1 0 zM −1 x

Improved Background Mixture Models for Video Surveillance Applications Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle Ghent University - IBBT Department of Electronics and Information Systems - Multimedia Lab Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium

Abstract. Background subtraction is a method commonly used to segment objects of interest in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a mixture of several models has been proposed. This paper proposes an update of the popular Mixture of Gaussian Models technique. Experimental analysis shows a lack of this technique to cope with quick illumination changes. A diﬀerent matching mechanism is proposed to improve the general robustness and a comparison with related work is given. Finally, experimental results are presented to show the gain of the updated technique, according to the standard scheme and the related techniques.

1

Introduction

The detection and segmentation of objects of interest in image sequences is the ﬁrst processing step in many computer vision applications, such as visual surveillance, traﬃc monitoring, and semantic annotation. Since this is often the input for other modules in computer vision applications, it is desirable to achieve very high accuracy with the lowest possible false alarm rates. The detection of moving objects in dynamic scenes has been the subject of research for several years and diﬀerent approaches exist [1]. One popular technique is background subtraction. During the surveillance of a scene, a background model is created and dynamically updated. Foreground objects are represented by the pixels that differ signiﬁcantly from this background model. Many diﬀerent models have been proposed for background subtraction, of which the Mixture of Gaussian Models (MGM) is one of the most popular [2]. However, there are a number of important problems when using background subtraction algorithms (quick illumination changes, initialization with moving objects, ghosts and shadows), as was reported in [3]. Sect. 2 elaborates on a number of techniques which improve the traditional MGM and try to deal with the above mentioned problems. This paper presents a new approach to deal with the problem of quick illumination changes (like clouds gradually changing the lighting conditions of the environment). We propose an updated matching mechanism for MGM. As such, Sect. 3 elaborates on the conventional mixture technique and its observed shortcomings. Subsequently, Sect. 4 shows the adjustments of the original scheme Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 251–260, 2007. c Springer-Verlag Berlin Heidelberg 2007

252

C. Poppe et al.

and presents experimental results. Finally, some conclusions are formulated in Sect. 5.

2

Related Work

Toyama et al. discussed in detail several known problems when using background subtraction algorithms [4]. This list was adopted by Javed et al. and they selected a number of important problems which have not been addressed by most background subtraction algorithms [3]. Wu et al. gave a concise overview of background subtraction algorithms, of which they have chosen MGM to compare with their own technique [2]. They use a global gaussian mixture model, built upon a diﬀerence image between the current image and an estimated background. Although their system is better for localization and contour preserving, it is more sensitive to complex environmental movements (such as waving trees). Lee et al. improved MGM by introducing means to initialize the background models when moving objects are present in the environment [5]. They presented an online expectation maximization learning algorithm for training adaptive Gaussian mixtures. Their system allows to initialize the mixture models much faster then the original approach. Related to this topic, Zhang et al. presented an online background reconstruction method to cope with the initialization problem [6]. Additionally, they presented a change history map to control the foreground mergence time and make it independent of the learning rate. As such, they deal with the initialization problem and the problem of ghosts (background objects which are moved) but can not deal with quick illumination changes. In [3], Javed et al. presented a number of important problems when using background subtraction algorithms. Accordingly, they proposed a solution using pixel, region and frame level processing. They proposed a solution to the problem of the quick illumination changes, but their technique is based on a complex gradients-based algorithm. Conveniently, the paper does not provide any information about the additional processing times needed for this technique. Tian et al. [7] used a similar approach as the one used in [3]. They presented a texture similarity measure based on gradient vectors, obtained by the Sobel operator. A ﬁxed window is used for the retrieval of the gradient vectors, which largely determines the performance (both in processing time and accuracy ) of their system. We present a conceptual simpler approach by extending the MGM algorithm. The results presented in Sect. 4.2 show that we obtain similar successes in coping with quick illumination changes. Numerous techniques have been proposed to deal with shadows. Since these are used to deal with the eﬀects of the lighting of a scene, an evaluation is made of available shadow detection techniques to see how they can be used to manage the quick illumination changes. An interesting overview on the detection of moving shadows is given in [8]. Prati et al. divide the shadow detection techniques in four classes, of which the deterministic non-model based approach shows the

Improved Background Mixture Models for Video Surveillance Applications

253

best results for the entire evaluation set used in the overview. Since the two critical requirements of a video surveillance system are accuracy and speed, not every shadow removal technique is appropriate. MGM is an advanced object detection technique, however, the maintenance of several models for each pixel is computational expensive. Therefore, every additional processing task should be minimal. Furthermore, MGM was created to cope with highly dynamic environments, with the only assumption being the static camera. According to these constraints and following the results presented by Prati et al., we have chosen the technique described in [9] for a comparison with our system. Results hereof are presented in Sect. 4.2.

3

3.1

Background Subtraction Using Mixture of Gaussian Models The Mixture of Gaussian Models

MGM was ﬁrst proposed by Stauﬀer and Grimson in [10]. It is a time-adaptive per pixel subtraction technique in which every pixel is represented by a vector, called Ip , consisting of three color component (red, green, and blue). For every pixel a mixture of Gaussian distributions, which are the actual models, is maintained and each of these models is assigned a weight. T −1 1 1 e− 2 (Ip −μp ) Σp (Ip −μp ) . G (Ip , μp , Σp ) = n (2π) |Σp |

(1)

(1) depicts the formula for a Gaussian distribution G. The parameters are μp and Σp , which are the mean and covariance matrix of the distribution respectively. For computational simplicity, the covariance matrix is assumed to be diagonal. For every new pixel a matching, an update, and a decision step are executed. The new pixel value is compared with the models of the mixture. A pixel is matched if its value occurs inside a conﬁdence interval within 2.5 standard deviations from the mean of the model. In that case, the parameters of the corresponding distribution are updated according to (2), (3), and (4). μp,t = (1 − ρ) μp,t−1 + ρ (Ip,t ) .

(2)

Σp,t = (1 − ρ) Σp,t−1 + ρ (Ip,t − μp,t ) (Ip,t − μp,t )T .

(3)

ρ = αG (Ip,t , μp,t−1 , Σp,t−1 ) .

(4)

α is the learning rate, which is a global parameter, and introduces a trade-oﬀ between fast adaptation and detection of slow moving objects. Each model has a weight, w, which is updated for every new image according to (5). wt = (1 − α) wt−1 + αMt .

(5)

254

C. Poppe et al.

If the corresponding model introduced a match, M is 1 , otherwise it is 0. Formulas (2) to (5) represent the update step. Finally, in the decision step, the models are sorted according to their weights. MGM assumes that background pixels occur more frequently then actual foreground pixels. For that reason a threshold based on these weights is used to deﬁne which models of the mixture depict background or foreground. Indeed, if a pixel value occurs recurrently, the weight of the corresponding model increases and it is assumed to be background. If no match is found with the current pixel value, then the model with the lowest weight is discarded and replaced by a normal distribution with a small weight, a mean equal to the current pixel value, and a large covariance. In this case the pixel is assumed to be a foreground pixel. 3.2

Problem Description

Fig. 1 shows the results of applying MGM to the PetsD2Tec2 sequence (with a resolution of 384x288) provided by IBM Research at several time points [11]. Black pixels depict the background, white pixels are assumed to be foreground. The ﬁgure shows a fragment of the scene being subject of changing illumination circumstances causing a repetitive increase of certain pixel values in a relatively short time period (about 30s). The illumination change results in relatively small diﬀerences between the pixel values of consecutive frames. However, the consistent nature of these diﬀerences causes the new pixel values to eventually exceed the acceptance range of the mixture models. This is because the acceptance decision is based on the diﬀerence with the average of the model, regardless the diﬀerence with the previous pixel value. The learning rate of MGM is typically very small (α is usually less then 0.01), so gradual changes spread over long periods (e.g., day turning into night) can be learned into the models. However, the small learning rate makes the adaptation of the current background models not quick enough to encompass the short consistent gradual changes. Consequently, these changes, which are hard to distinguish by the human eye, result in numerous false detections. The falsely detected regions can range from small regions of misclassiﬁed pixels to regions encompassing almost half of the image.

Fig. 1. MGM output during quick illumination change

Improved Background Mixture Models for Video Surveillance Applications

4 4.1

255

Improved Background Mixture Models Advanced MGM

MGM uses only the current pixel value and the mixture model in the matching, update and decision steps. The pixel values of the previous image are not stored since they are only used to update the models. We propose to make the technique aware of the immediate past by storing the previous pixel value (prevI) and the previously matched model number (prevM odel). The matching step is then altered according to the following pseudocode: If (Model == prevModel) If (|I - prevI| < cgc * stdev) Match = true; Else checkMatch(Model,I); Else checkMatch(Model,I); update(Model,I); decide(Model); If ((Match == true) and (Decision == background)) { prevModel = Model; prevI = I; } In the matching step for each pixel, it is checked if the pixel value matches with one of the models in the mixture. For the model which matched the pixel values in the previous frame, the diﬀerence between the previous and current pixel value is taken. If this diﬀerence is small enough, a match is immediately eﬀectuated. Otherwise the normal matching step is executed. If the matched model is considered to represent part of the background, then the model number and the current pixel value are stored, otherwise they remain unchanged. This way, passing foreground objects do not aﬀect the recent history. If a new pixel value diﬀers slightly from the previously matched one, but would fall out of the matching range of the model, a diﬀerent outcome, compared with the original algorithm, will be obtained. Since the normal matching process is dependent on the speciﬁc model, more speciﬁcally on the standard deviation, it is better to enforce this for the threshold as well. Therefore, we have chosen for a per pixel threshold dependent on the standard deviation. We introduce a new parameter, cgc (from Consistent Gradual Change), to control the threshold. In Fig. 2 we have recorded the number of detection failures and false alarms for several values of cgc for the PetsD1TeC2 sequence (another sequence from the IBM benchmark which shows similar situations for the consistent gradual changes). A manual ground

256

C. Poppe et al. 300 2,6 290 2,4 280

2,2 2

Detection Failure

270

1,8 260

1,6

1,4

1,2 1

250

0,8

0,6

240 230 220 210 200 0

200

400

600

800

1000

1200

False Alarms

Fig. 2. ROC for diﬀerent values of cgc

truth annotation has been used to calculate the false positives and negatives. The average values over the entire sequence were then plotted in the curve to ﬁnd the optimal value for the parameter. A cgc of about 1.8 gives the best results. Lower values result in too much false alarms since many of the consistent gradual changes will not be dealt with then. If we increase the value of cgc, we see that the number of detection failures increases drastically; if the threshold is too high, too many foreground pixels will be mistaken for background pixels. Consequently, cgc = 1.8 is chosen and is further used in all experiments. 4.2

Experimental Results

We adopt the evaluation means of the related work [3,7] and compare the updated algorithm with the original scheme. The result of the proposed algorithm for the example frame of Fig. 1 is shown in Fig. 3. The left side shows the results of the original MGM, the right side shows the results of our system. The new matching process gives signiﬁcantly less false positives, while it still detects foreground objects. In this case, no morphological post-processing has been applied, so further reﬁnements can be done. Fig. 4 and 5 show a quantitative comparison of the regular MGM and the proposed scheme for the PetsD2TeC2 sequence (with a framerate of 30 frames per second). A manual ground truth annotation has been done for every 50th frame of the sequence. For each of these frames the ground truth annotation is matched with the output of the detection algorithms to ﬁnd the amount of pixels which are erroneously considered to be foreground. As can be seen, a sudden increase occurs at the end of the video (which corresponds to a quick illumination change in the scene). We notice that

Improved Background Mixture Models for Video Surveillance Applications

257

Fig. 3. Results of MGM and the proposed scheme

14000

MGM Shadow_hsv Proposed

12000

False Positives

10000 8000 6000 4000 2000

95 0 11 00 12 50 14 00 15 50 17 00 18 50 20 00 21 50 23 00 24 50 26 00 27 50

80 0

65 0

50 0

35 0

50 20 0

0

Frame

Fig. 4. False Positives for MGM, a shadow removal technique and our proposed system

the proposed scheme succeeds to deal with the gradual lighting changes (frames 2100 to 2850) much better then the original scheme. The amount of false positives is largely reduced; in the best case we obtain a reduction of up to 95 % of the false positives compared with the normal scheme. The ﬁgure also shows that the updated technique obtains the same results as the original technique in scenes without gradual changes (frames 0 to 2050). Fig. 5 shows the false negatives recorded during the sequence. Our updated algorithm gives only a slight increase in the number of false negatives. In Sect. 2 we elaborated on alternate techniques which also give a solution for the quick illumination change problem. These methods are based on complex region level processing whereas our technique is solely pixel-based. Javed et al. presented their results on their website.1 Fig. 6 shows a scene in which a sudden 1

http://www.cs.ucf.edu/∼vision/projects/Knight/background.html

258

C. Poppe et al. 800

MGM Shadow_hsv Proposed

700

False Negatives

600 500 400 300 200 100

95 0 11 00 12 50 14 00 15 50 17 00 18 50 20 00 21 50 23 00 24 50 26 00 27 50

80 0

65 0

50 0

35 0

50 20 0

0

Frame

Fig. 5. False Negatives for MGM, a shadow removal technique and our proposed system

Fig. 6. From left to right: captured image, MGM output, result of [3], result of proposed scheme

illumination change occurs. The second image is the output from MGM and the third is the result of the system of Javed et al. The fourth image is the output of our proposed system. As can be seen, our conceptually simpler approach achieves similar results in coping with the illumination changes. As discussed in Sect. 2, some shadow techniques might provide a solution for the problem of quick illumination changes. We have evaluated the technique described in [9]. This technique uses the HSV color space since this corresponds closely to the human perception of color. Since the hue of a pixel does not change signiﬁcantly when a shadow is cast and the saturation is lowered in shadowed points, the HSV color space indeed looks interesting for shadow detection. Consequently, the decision process is based on the following equation: IpV ≤ β ∧ IpS − BgpS ≤ τs ∧ IpH − BgpH ≤ τh . (6) Sp = α ≤ BgpV In (6), Bgp are the pixel values for the current background model. If Sp = true the pixel is assumed to be covered by a shadow. α should be adjusted according

Improved Background Mixture Models for Video Surveillance Applications

259

to the strength of the light source causing the shadows, β is needed to cope with certain aspects of noise, τs and τh are thresholds which respectively decide how large the diﬀerence in saturation and hue can be. This technique is therefore vastly dependent on the actual environment, but works well for shadow detection if the individual parameters can be ﬁne-tuned according to the scene. In highly dynamic scenes as discussed in this paper, this approach would not be optimal. The illumination changes, in our situation, can cause shadows, but will mostly result in the opposite eﬀect; pixel values get lighter color values. Therefore, we use the adjusted formula (7) for the detection of the lighting change. IpV (7) Sp = 1/β ≤ V ≤ 1/α ∧ IpS − μSp ≥ −τs ∧ IpH − μH p ≤ τh . μp Fig. 4 and 5 also show the false positives and false negatives of the adjusted shadow removal technique (Shadow hsv), respectively. We see that the shadow detection results in less false positives then the original scheme, but it cannot manage the entire change. Moreover, there is a strong increase of the false negatives.

5

Conclusion

This paper presents an updated scheme for object detection using a Mixture of Gaussian Models. The original scheme has been discussed in more detail and the incapability of dealing with quick illumination changes has been detected. An update of the matching mechanism has been presented. Furthermore, a comparison has been made with existing relevant object detection techniques which are able to deal with the problem. Experimental results show that our algorithm has signiﬁcant improvements compared with the standard scheme, while only introducing minor additional processing.

Acknowledgments The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientiﬁc Research-Flanders(FWO-Flanders), the Belgian Federal Science Policy Oﬃce (BFSPO), and the European Union.

References 1. Dick, A., Brooks, M.J.: Issues in automated visual surveillance. In: Proceedings of International Conference on Digital Image Computing: Techniques and Applications, pp. 195–204 (2003)

260

C. Poppe et al.

2. Wu, J., Trivedi, M.: Performance Characterization for Gaussian Mixture Model Based Motion Detection Algorithms. In: Proceedings of the IEEE International Conference on Image Processing, pp. 97–100. IEEE Computer Society Press, Los Alamitos (2005) 3. Javed, O., Shaﬁque, K., Shah, M.: A Hierarchical Approach to Robust Background Subtraction using Color and Gradient Information. In: Proceedings of the Workshop on Motion and Video Computing, pp. 22–27 (2002) 4. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallﬂower: Principles and Practice of Background Maintenance. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 255–261. IEEE Computer Society Press, Los Alamitos (1999) 5. Lee, D.: Online Adaptive Gaussian Mixture Learning for Video Applications. Statistical Methods in Video Processing. LNCS, pp. 105–116 (2004) 6. Zhang, Y., Liang, Z., Hou, Z., Wang, H., Tan, M.: An Adaptive Mixture Gaussian Background Model with Online Background Reconstruction and Adjustable Foreground Mergence Time for Motion Segmentation. In: Proceedings of the IEEE International Conference on Industrial Technology, pp. 23–27. IEEE Computer Society Press, Los Alamitos (2005) 7. Tian, Y., Lu, M., Hampapur, A.: Robust and Eﬃcient Foreground Analysis for Real-time Video Surveillance. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1182–1187. IEEE Computer Society Press, Los Alamitos (2005) 8. Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting Moving Shadows: Algorithms and Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 918–923 (2003) 9. Cucchiara, R., Grana, C., Neri, G., Piccardi, M., Prati, A.: The Sakbot System for Moving Object Detection and Tracking. Video-Based Surveillance Systems Computer Vision and Distributed Processing, pp. 145–157 (2001) 10. Stauﬀer, C., Grimson, W.E.L.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 747–757 (2000) 11. Brown, L.M., Senior, A.W., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merkl, H., Lu, M.: Performance Evaluation of Surveillance Systems Under Varying Conditions. In: Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2005) http://www.research.ibm.com/peoplevision/performanceevaluation.html

High Dynamic Range Scene Realization Using Two Complementary Images Ming-Chian Sung, Te-Hsun Wang, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {qwer,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw

Abstract. Many existing tone reproduction schemes are based on the use of a single high dynamic range (HDR) image and are therefore unable to accurately recover the local details and colors of the scene due to the limited information available. Accordingly, the current study develops a novel tone reproduction system which utilizes two images with different exposures to capture both the local details and color information of the low- and high-luminance regions of a scene. By computing the local region of each pixel, whose radius is determined via an iterative morphological erosion process, the proposed system implements a pixel-wise local tone mapping module which compresses the luminance range and enhances the local contrast in the low-exposure image. And a local color mapping module is applied to capture the precise color information from the high-exposure image. Subsequently, a fusion process is then performed to fuse the local tone mapping and color mapping results to generate highly realistic reproductions of HDR scenes. Keywords: High dynamic range, local tone mapping, local color mapping.

1 Introduction In general, a tone reproduction problem occurs when the dynamic range of a scene exceeds that of the recording or display device. This problem is typically resolved by applying some form of tone mapping technique, in which the high dynamic range (HDR) luminance of the real world is mapped to the low dynamic range (LDR) luminance of the display device. Various uniform (or global) tone mapping methods have been proposed [19], [21]. However, while these systems are reasonably successful in resolving the tone reproduction problem and avoid visual artifacts such as halos, the resulting images tend to lose the local details of the scene. By contrast, non-uniform (or local) tone mapping methods such as those presented in [1], [3], [4], [5], [7], [16] and [18] not only provide a good tone reproduction performance, but also preserve the finer details of the original scene. Such approaches typically mimic the human visual system by computing the local adaptation luminance in the scene. When computing the local adaptation luminance, the size of the local region is a crucial consideration and is generally estimated using some form of local contrast measure. Center-surround Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 261–270, 2007. © Springer-Verlag Berlin Heidelberg 2007

262

M.-C. Sung, T.-H. Wang, and J.-J.J. Lien

functions such as the difference of Gaussian blurring images in [2] and [16] provides one means of estimating the size of this region. However, the local region size determined using this method is generally too local (or small) to reveal sufficient details. By contrast, piece-wise methods tend to improperly emphasize the local details. Furthermore, if the dynamic range of the real-world scene is very large, some of the image regions will be over-exposed, while some will be under-exposed, and hence the details in these regions will be lost. When processing such scenes using only a single image, the use of luminance compression techniques to recover the scene details achieves only limited success due to the lack of scene information available. Modern digital cameras invariably feature a built-in exposure bracketing functionality which allows a series of images to be obtained at different exposure levels via a single depression of the shutter release mechanism. This functionality is exploited to generate two simultaneous images of a high contrast scene with different levels of exposure such that the color and details of the scene can be accurately reproduced. Let the term IL refer to the low-exposure image, in which the brighter region is well-exposed, but the darker region is under-exposed. The brighter region contains good detail definition and has an abundance of color information. However, in the dark region of the image, the scene details are hidden and the true color of the scene cannot be recovered. Furthermore, let IH denote the high-exposure image, in which the darker region is well-exposed, but the brighter region is over-exposed. In this case, the darker region retains good detail definition and accurately reproduces the true color, while the brighter region is saturated such that the scene details cannot be perceived and the true color is not apparent. Although IL and IH have different exposure levels and may be not perfectly overlapped geometrically due to taken by unstable hand-held camera, coherence nevertheless exists in their color statistics and spatial constraints because they are taken simultaneously of the same scene [10]. The basic principle of the tone reproduction system developed in this study is to exploit these coherences in order to derive an accurate reproduction of the scene. In many image processing applications, the performance can be enhanced by using multiple input images to increase the amount of available information. Typical applications which adopt this approach include noise removal using flash and no-flash pairs [6], [12]; motion deblurring using normal and low-exposure pairs [10]; color transfer [15], [17]; and gray scale image colorization [11]. Goshtasby proposed an excellent method for realizing HDR reduction via the maximization of image information by using many images with different exposures [8]. However, the proposed method required the contents of the two images to be perfectly overlapped. Thus, the use of a fixed tripod was required, with the result that the practicality of the method was rather limited. In an attempt to resolve the problems highlighted above, the current study develops a novel tone reproduction system in which two input images acquired under different exposure conditions are input to a local adaptation mechanism which takes into account both the color statistics and the spatial constraint between the two images in order to reproduce the details and color of the scene. The proposed system performs a local tone mapping and local color mapping process to refine the advantage of IL and IH, respectively. Subsequently, a fusion procession is applied to make a compromise between the local optimum and the global optimum.

High Dynamic Range Scene Realization Using Two Complementary Images

(a)

(b)

(c)

(d)

(e)

(f)

263

Fig. 1. (a) Low-exposure image IL; (b) high-exposure image IH; (c) segmentation of IL into four regions based upon entropy; (d) illustration of morphological erosion operation following three iterations; (e) iteration map of IL; and (f) example of pixels and their corresponding local regions as determined by their iteration map values

2 Iteration Map Creation and Local Region Determination In order to perform the local mechanism, this study commences by finding the local region of each pixel. A histogram-based segmentation process is applied to group the luminance values of the various pixels in the image into a small number of discrete intervals. The radius of the local region of each pixel is then determined by iteration map which is derived from a morphological erosion operation. 2.1 Histogram-Based Region Segmentation Using Entropy Theorem The entropy theorem provides a suitable means of establishing an optimal threshold value T when separating the darker and brighter regions of an image [13]. According to this theorem, by maximizing the entropy of an image by maximizing the entropy of both the darker and the brighter regions of the image, an optimal threshold value can be obtained via the following formulation: t

255

T = arg max ( − ∑ p i log p i − ∑ p i log p i ) , t

i=0

d

d

b

b

(1)

i=t

where t is the candidate threshold, Pid and Pib are the probability of the darker pixels with luminance value i and brighter pixels with luminance i, respectively. Adopting a dichotomy approach, the segmentation procedure is repeated three times, yielding three separate threshold values, i.e. Llow, Lmiddle and Lhigh , which collectively segment the histogram into four subintervals, namely [Lmin, Llow], [Llow, Lmiddle], [Lmiddle, Lhigh] and [Lhigh, Lmax], respectively.

264

M.-C. Sung, T.-H. Wang, and J.-J.J. Lien

2.2 Iteration Map Creation Using Morphological Erosion Operation Having segmented the image, the proposed tone reproduction system then determines the circular local region Rx,y of each pixel (x, y). The radius of this region is found by performing an iterative morphological erosion operation in each luminance region, and creating an iteration map to record the iteration number at which each pixel is eroded. Clearly, for pixels located closer to the region boundary, the corresponding iteration value is lower, while for those pixels closer to the region center, the iteration value is higher. Hence, by inspection of the values indicated on the iteration map, it is possible not only to determine the radius of the circular local region of each pixel, but also to modulate the tone mapping function as described later.

3 Tone Reproduction System Since the light range in the brighter regions of an image is greater than that in the darker regions, the details in the over-exposed regions in IH are usually lost. Hence, in the current approach, more detailed IL is executed using the color and tone information associated with IH. 3.1 Luminance: Pixel-Wise Local Tone Mapping The proposed tone reconstruction method commences by applying a non-uniform luminance scaling process to IL to generate an initial middle-gray tone image. Due to the under-exposed darker region and well-exposed brighter region of IL, it is necessary to apply a greater scaling effect to the darker region to brighten the concealed details and a reduced scaling to the well-exposed brighter region, i.e. ⎛1 ⎞ Lk = exp⎜⎜ ∑ log(δ + Lk ( x, y)) ⎟⎟ Lk ∈ {LL , LH } ⎝ N x, y ⎠

⎛ L L( x, y ) = ⎜⎜ 2 H ⎝ LL

⎞⎛ LL ( x, y ) ⎞ ⎟ LL ( x, y ) , ⎟⎟⎜1 − 2 ⎜ Lwhite ⎟⎠ ⎠⎝

(1)

(2)

LL and LH are the log-average luminance (referred to as the key values in [9], [19] and [20]) of IL and IH , respectively, and are used to objectively measure whether the scene is low-gray, middle-gray or high gray tone. Furthermore, LL(x, y) is the luminance value of pixel (x, y) in IL, and is normalized within the interval [0, 1]. Finally, Lwhite is the maximum luminance value in IL. By applying Eqs. (1) and (2), the luminance of LL can be scaled to an overall luminance L. To mimic the human visual system which attains visual perception of a scene by locally adapting luminance differences, the system proposed in this study performs a local tone mapping process which commences by computing the local adaptation luminance. Since the radius of the circular local region Rx,y has already been determin ed for each pixel (x, y), the value of the local adaptation luminance can be obtained

High Dynamic Range Scene Realization Using Two Complementary Images

(a)

(b)

265

(c)

Fig. 2. (a) Local adaptation luminance result. Note result is normalized into interval [0, 255] for display purposes; (b) detailed term H; and (c) luminance compression term V’.

simply by convoluting the luminance values in the local region using a weighted mask, i.e. V ( x, y ) =

⎞ ⎛ ⎟ 1 ⎜ L ( x, y)G x , y (i, j ) K x, y (i, j ) ⎟ , ⎜ Z x, y ⎜ (i, j∈R ) ⎟ x, y ⎠ ⎝

∑

(4)

The significance of each neighborhood pixel (i, j) when performing this convolution, is evaluated using Gx,y and Kx,y , which are Gaussian weights corresponding to the spatial distance between pixels (x, y) and (i, j) and the difference in luminance of the two pixels, respectively. And Zx,y in Eq. (4) is a normalization term. A method known as local tone mapping was proposed by Reinhard et al. [16] for addressing the tone reproduction problem. This simple non-uniform mapping technique compresses the luminance range of the scene such that all of the luminance values fall within the interval [0,1]. The system presented in the current study goes a step further in modulating the local contrast and luminance compression by extracting the detailed term (denoted as H) and the local adaptation luminance compression term (denoted as V’) as Fig. 2 (a~c) and then modulating them in accordance with [3]: ρ

Ld =

ρ

L ⎛ L⎞ ⎛ V ⎞ = HxV ′ = ⎜ ⎟ x⎜ ⎟ 1+V ⎝V ⎠ ⎝1+ V ⎠

γ≦1

γ

(5)

ρ

. The value of controls the degree of sharpness of the where 0< <2 and 0< reproduced image, with a larger value generating a sharper result. In the current study, the value of is varied in direct proportion to the iteration value of each pixel specified in the iteration map to ensure a smooth boundary while simultaneously revealing most of the image details near the region center. Meanwhile, the value of determines the degree of luminance compression. As its value is reduced, the luminance of the darker regions in the image is compressed into a larger display is inversely proportional to the range of the interval [Lmin, range. As a result, Lmiddle]. Finally, in order to obtain the final local tone mapping result It, the value of each RGB channel is scaled according to the change ratio of the luminance which is obtained by dividing the output luminance Ld by input scaled luminance L, i.e.

ρ

γ

γ

It = I L x

Ld L

(6)

266

M.-C. Sung, T.-H. Wang, and J.-J.J. Lien

3.2 Color: Pixel-Wise Local Color Mapping Tone mapping method mentioned in last section modulates only the luminance value, and hence the original imprecise RGB color information of IL is still reserved. If a low-exposure image is modulated such that the color and detailed information of the brighter regions can be captured, the darker regions of the image will inevitably be under-exposed. Even if the scene is characterized by an extremely broad light range, the darker regions will become so dark that the true color information has not been recorded by camera and thus only an unsatisfied result of local tone mapping can be obtained. Moreover, the darker regions generally contain considerable noise and thus obtaining a crisp result is difficult. To resolve these problems, the current approach modulates IL using the color information relating to IH, and pixel-wise local color mapping module is applied to acquire the ground-truth color. Reinhard et al. [15] proposed the following simple but highly effective method, implemented in the Aαβ color space, for transferring the color statistics from IH to IL [14]: I L ′ ( x, y) = g (I L ( x, y )) = μ H +

σH ( I L ( x, y) − μ L ), σL

(7)

where IL(x, y) is the color value of pixel (x, y) in each Aαβ channel in the lowexposure image, L and L are the mean and standard deviation of the color value in IL, respectively, and H and H are the mean and standard deviation of the color value in IH., respectively. In the current system, this color transfer process is performed on each Aαβ channel and the results are then converted back to the RGB color space to obtain the preliminary result IL’. However, this global color transfer approach fails to obtain the precise local color when the source or target image contains many different color regions because it cannot distinguish which particular region each color statistic derives from, and thus mixes them all up, yielding a uniform transfer [17]. Hence, the current system applies a further pixel-wise local color mapping process in RGB channel to find the true local color result of pixel (x, y) from pixel (i*, j*) in IH in the local color mapping region Sx,y whose radius is inversely proportional to the value of pixel (x, y) in the iteration map because the color near the region center is more

μ

μ

σ

σ

Weight

LL

Fig. 3. Local color mapping region Sx,y is inversely proportional to the value of pixel (x, y) in the iteration map

Fig. 4. Finding fusion weight αx,y using double sigmoid function

High Dynamic Range Scene Realization Using Two Complementary Images

267

reliable, so we can search its precise color in IH within a smaller local color mapping region Sx,y as shown in Fig. 3.. This suggests following equation: ⎛ ( I ' ( x, y ) − I H (i, j )) 2 ⎞⎟ (i* , j* ) = arg max ⎜ exp(− L ) , ⎜ ⎟ 2σ 2 (i , j )∈S x , y ⎝ ⎠

(8)

σ

is specified as half the value of pixel (x, y) in the iteration map. where the value of When (i*, j*) in the Eq. (8) is obtained, the color value of IH (i*, j*) is used in place of IL’(x, y) to construct the local mapping result Ic. This local color mapping method resolves the image shift problem and gives the ground-truth color appearance of IH. Input Image Pairs

Output Image

IH

IL IL

Pixel-Wise Local Tone Mapping

Iteration map Creation and Local Region Determination Region Segmentation By Entropy

Id

IH

LL Lψ L

Iteration Map

Result Fusion

Local Adaptation Luminance V

Local Tone Mapping Result Local Adaptation Luminance V

IL Segmentation Result

Morphological Erosion Operation

Local Tone Mapping Result It

It

Pixel-Wise Local Color Mapping

IL

IH

Global Color ’ Transfer Result IL Local Color Mapping Result

Iteration Map

Ic IH & IL

’

Iteration Map

Local Color Mapping Result Ic

Fig. 5. Flowchart of the proposed system

268

M.-C. Sung, T.-H. Wang, and J.-J.J. Lien

I d = α x , y I t + (1 − α x , y ) I c

α x, y

β ⎧ ⎪ ⎛ ⎞ ⎪1 + exp⎜ − 2 LL ( x, y ) − Lmiddle ⎟ ⎜ | Lmiddle − Llow | ⎟⎠ ⎪⎪ ⎝ =⎨ β ⎪ ⎛ ⎪ L x, y ) − Lmiddle ⎞⎟ ( L ⎪1 + exp⎜⎜ − 2 | Lmiddle − Lhigh | ⎟⎠ ⎝ ⎩⎪

(9)

if LL ( x, y ) < Lmiddle

(10)

otherwise.

α

The function used to generate the weight x,y is the double sigmoid function as shown in Fig. 4. This function provides a virtually linear fusion over the interval [Llow, Lhigh] and a non-linear fusion over the intervals [Lmin, Llow] and [Lhigh, Lmax], respectively. In the darker regions, Ic is assigned a higher weight to obtain the true color appearance and to reduce the noise in It. Meanwhile, in the brighter regions, It is assigned a greater weight to compensate the color and details in the saturated region of Ic. The parameter in Eq. (10) is inversely proportional to the value of pixel (x, y) in the iteration map serves to further control the color appearance of the fusion result. The flowchart of proposed method is shown in Fig. 5.

β

4 Experimental Results and Conclusion The current tone reproduction experiments were performed on a PC equipped with an Intel Pentium 4 (3.2GHz) processor. The execution time of the proposed method depends on the number and size of segmentation regions. In general, the results showed that a 640 x 480 image could be processed within 10 seconds on average. When implementing the local tone reproduction methods presented in the literature, the size of the local region is a crucial factor. Figure 6 compares the local regions estimated by the schemes presented in [16] and [3], respectively, with that estimated by the current method. It can be seen that the local region size derived using the method proposed by Reinhard et al. [16] may be too small to reveal sufficient local details. Conversely, the region size estimated using the region-wise method presented in [3] is not always adaptive for each pixel in that region and may therefore result in an unnatural emphasis. However, the morphological erosion method proposed in the current study enables the derivation of a more adaptive local region size.

(a)

(b)

(c)

Fig. 6. Example of pixel and corresponding local region measured using: (a) local region method proposed by Reinhard et al. [16]; (b) region-wise method [3]; and (c) current morphological erosion operation

High Dynamic Range Scene Realization Using Two Complementary Images

(a.1)

(b.1)

(c.1)

(a.2)

(b.2)

(c.2)

(a.3)

(b.3)

(c.3)

(a.4)

(b.4)

(c.4)

269

Fig. 7. Typical image pairs and corresponding tone reproduction results: (a.1~c.1) input image pairs IL and IH; (a.2~c.2) tone mapping results It; (a.3~c.3) local color mapping results Ic; and (a.4~c.4) fusion results Id

Besides the ability of the proposed method to successfully process the HDR images as shown in Fig. 7(a.1~a.4), our method is also useful to deal with the LDR images and non-overlapped image pairs as shown in Fig. 7(b.1~b.4). The local color mapping process effectively overcomes such shift position problem between image pairs while simultaneously obtaining the true color from IH. In Fig. 7(c.1~c.4), it is seen that the proposed method effectively smoothes the noise within the darker regions. In conclusion, tone reproduction methods are essential techniques in realizing HDR scenes in LDR display devices. Many previous tone reproduction techniques fail to accurately recover the color and details of HDR scenes since they use only a single image and therefore have only limited information at their disposal. However, in the tone reproduction technique presented in this paper, two images are acquired simultaneously at different exposures and are supplied to an automatic local adaptation mechanism which takes account of both the color statistics and the spatial constraint between the images in order to maintain the color and detailed information of the original scene. The experimental results confirm that the proposed system

270

M.-C. Sung, T.-H. Wang, and J.-J.J. Lien

provides promising results full of rich detailed and color information content and is capable of generating highly realistic reproductions of HDR scenes.

References 1. Ashikhmin, M.: A Tone Mapping Algorithm for High Contrast Images. 13th Eurographics, 145–156 (2002) 2. Blommaert, F.J.J., Martens, J.-B.: An Object-Oriented Model for Brightness perception. Spatial Vision 5(1), 15–41 (1990) 3. Chen, H.T., Liu, T.L., Fuh, C.S.: Tone Reproduction: A Perspective from luminanceDriven Perceptual Grouping. In: IJCV, vol. 65, pp. 73–96 (2005) 4. DiCarlo, J.M., Wandell, B.A.: Rendering High Dynamic Range Images. SPIE: Image Sensors 3965, 392–401 (2000) 5. Durand, F., Dorsey, J.: Fast Bilateral Filtering for the Display of High-Dynamic-Range Images. ACM SIGGRAPH, 257–266 (2002) 6. Eisemann, E., Durand, F.: Flash photography enhancement via intrinsic relighting. ACM Trans. Graph. 23(3), 673–678 (2004) 7. Fattal, R., Lischinski, D., Werman, M.: Gradient Domain High Dynamic Range Compress. ACM SIGGRAPH, 249–256 (2002) 8. Goshtasby, A.: A High dynamic range reduction via maximization of image information, http://www.cs.wright.edu/~agoshtas/hdr.html 9. Holm, J.: Photographic Tone and Colour Reproduction Goals. CIE Expert Symposium ’96 on Colour Standard for Image Technology, 51–56 (1996) 10. Sun, J., Tang, J.: Bayesian Correction of Image Intensity with Spatial Consideration. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 342–354. Springer, Heidelberg (2004) 11. Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. ACM Trans Graph. 23(3), 689–694 (2004) 12. Petschnigg, G., Szeliski, R., Agrawala, M., Choen, M., Hoppe, H., Toyama, K.: Digitial photography with flash and no-flash image pairs. ACM Trans. Graph. 23(3), 664–672 (2004) 13. Pardo, A., Sapiro, G.: Visualization of High Dynamic Range Images. IEEE Trans. on Image Proc. 12(6), 639–647 (2003) 14. Ruderman, D.L., Chonin, T.W., Chiao, C.C: Statistic of cone responses to natural images: implications for visual coding. J. Optical Soc. Of America 15(8), 2036–2045 (1998) 15. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color Transfer between Images. IEEE CG&A 21, 34–41 (2001) 16. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic Tone Reproduction for Digital Images. ACM SIGGRAPH, 267–276 (2002) 17. Tai, Y.W., Jia, J., Tang, C.K.: Local Color Transfer via Probabilistic Segmentation by Expectation-Maximization. In: CVPR, vol. 1, pp. 747–754 (2005) 18. Tumblin, J., Turk, G.: LCIS: A Boundary Hierarchy for Detail-Preserving Contrast Reduction. ACM SIGGRAPH, 83–90 (1899) 19. Tumblin, J., Rushmeier, H.: Tone Reproduction for Computer Generated Images. IEEE CG&A 13(6), 42–48 (1993) 20. Ward, G.: A Contrast-Based Scale factor for Luminance Display. In: Heckbert, P. (ed.) Graphics Gems IV, pp. 415–421. Academic Press, Boston (1994) 21. Ward, G., Rushmeuer, H.E., Piatko, C.D.: A visibility matching tone reproduction operator for high dynamic range scenes. IEEE Trans. Visualization and Computer Graphics 3(4), 291–306 (1997)

Automated Removal of Partial Occlusion Blur Scott McCloskey, Michael Langer, and Kaleem Siddiqi Centre for Intelligent Machines, McGill University {scott,langer,siddiqi}@cim.mcgill.ca

Abstract. This paper presents a novel, automated method to remove partial occlusion from a single image. In particular, we are concerned with occlusions resulting from objects that fall on or near the lens during exposure. For each such foreground object, we segment the completely occluded region using a geometric ﬂow. We then look outward from the region of complete occlusion at the segmentation boundary to estimate the width of the partially occluded region. Once the area of complete occlusion and width of the partially occluded region are known, the contribution of the foreground object can be removed. We present experimental results which demonstrate the ability of this method to remove partial occlusion with minimal user interaction. The result is an image with improved visibility in partially occluded regions, which may convey important information or simply improve the image’s aesthetics.

1

Introduction

Partial occlusions arise in natural images when an occluding object falls nearer to the lens than the plane of focus. The occluding object will be blurred in proportion to its distance from the plane of focus, and contributes to the exposure of pixels that also record background objects. This sort of situation can arise, for example, when taking a photo through a small opening such as a cracked door, fence, or keyhole. If the opening is smaller than the lens aperture, some part of the door/fence will fall within the ﬁeld of view, partially occluding the background. This may also arise when a nearby object (such as the photographer’s ﬁnger, or a camera strap) accidentally falls within the lens’ ﬁeld of view. Whatever its cause, the width of the partially-occluded region depends on the scene geometry and the camera settings. Primarily, the width increases with increasing aperture size (decreasing f -number), making partial occlusion a greater concern in low lighting situations that necessitate a larger aperture. Fig. 1 (left) shows an image with partial occlusion, which has three distinct regions: complete occlusion (outside the red contour), partial occlusion (between the green and red contours), and no occlusion (inside the green contour). As is the case in this example, the completely occluded region often has little highfrequency structure because of the severe blurring of objects far from the focal plane. In addition, the region of complete occlusion can be severely underexposed when the camera’s settings are chosen to properly expose the background. In [7], it was shown that it is possible to remove the partial occlusion when the location and width of the partially occluded region are found by a user. Because Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 271–281, 2007. c Springer-Verlag Berlin Heidelberg 2007

272

S. McCloskey, M. Langer, and K. Siddiqi

Fig. 1. (Left) Example image taken through a keyhole. Of the pixels that see through the opening, more then 98% are partially occluded. (Right) The output of our method, with improved visibility in the partially-occluded region.

of the low contrast and arbitrary shape of the boundary between regions of complete and partial occlusion, this task can be challenging, time consuming, and prone to user error. In the current paper we present an automated solution to this vision task for severely blurred occluding objects and in doing so signiﬁcantly extend the applicability of the method in [7]. Given the input image of Fig. 1 (left), the algorithm presented in this paper produces the image shown in Fig. 1 (right). The user must only click on a point within each completely-occluded region in the image, from which we ﬁnd the boundary of the region of complete occlusion. Next, we ﬁnd the width of the partially occluded band based on a model of image formation under partial occlusion. We then process the image to remove the partial occlusion, producing an output with improved visibility in that region. Each of these steps is detailed in Sec. 4.

2

Previous Work

The most comparable work to date was presented by Favaro and Soatto [3], who describe an algorithm which reconstructs the geometry and radiance of a scene, including partially-occluded regions. While this restores the background, it requires several registered input images taken at diﬀerent focal positions. Tamaki and Suzuki [12] presented a method for the detection of completely occluded regions in a single image. Unlike our method, they assume that the occluding region has high contrast with the background, and that there is no adjacent region of partial occlusion. A more distantly related technique is presented by Levoy et al. in [5], where synthetic aperture imaging is used to see around occluding objects. Though this

Automated Removal of Partial Occlusion Blur

273

ability is one of the key features of their system, no eﬀort is made to identify or remove partial occlusion in the images. Partial occlusion also occurs in matte recovery which, while designed to extract the foreground object from a natural scene, can also recover the background in areas of partial occlusion. Unlike our method, matte recovery methods require either additional views of the same scene [8,9] or substantial user intervention [1,11]. In the latter category, users must supply a trimap, a segmentation of the image into regions that are either deﬁnitely foreground, deﬁnitely background, or unknown/mixed. Our method is related to matte recovery, and can be viewed as a way of automatically generating, from a single image, a trimap for images with partial occlusion due to blurred foreground objects.

3

Background and Notation

In [7], it was shown that the well-known matting equation, Rinput (x, y) = α(x, y)Rf + (1 − α(x, y))Rb (x, y),

(1)

describes how the lens aperture combines the radiance Rb of the background object with the radiance Rf of the foreground object. The blending parameter α describes the proportion in which the two quantities combine and Rinput is the observed radiance. Notionally, the quantity α is the fraction of the pixel’s viewing frustum that is subtended by the foreground object. Since that object is far from the plane of focus, the frustum is a cone and α is the fraction of that cone subtended by the occluding object. In order to remove the contribution of the occluding object, the values of α and Rf must be found at each pixel. Given the location of the boundary between regions of complete and partial occlusion, the distance d between each pixel and the nearest point on the boundary can be found. From d and the width w of the partially occluded region, the value of α is well-approximated [7] by √ 2d 1 l 1 − l2 + arcsin(l) , where l = min 2, α= − − 1. (2) 2 π w This can be done if the user supplies both w and the boundary between regions complete and partial occlusion, as in [7]. Unfortunately, this task is time consuming, diﬃcult, and prone to user error. In this paper, we present an automated solution to this vision problem, from which we compute the values α and Rf .

4

Method

To state the vision problem more clearly, we refer to the example1 in Fig. 2. In this example, the partial occlusion is due to the handle of a fork directly in front 1

The authors of [7] have made this image available to the public at http://www.cim. mcgill.ca/∼scott/research.html

274

S. McCloskey, M. Langer, and K. Siddiqi

Fig. 2. To remove partial occlusion from a foreground object, the vision problem is to determine the boundary of the completely occluded region (green curve) and the width of the partially-occluded region (the length of the red arrow)

of the lens. In order to remove the contribution of the occluding object, we must automatically ﬁnd the region of complete occlusion (outlined in green) and the width of the partially occluded band (the length of the red arrow). In order to ﬁnd the region of complete occlusion within the image, we assume that the foreground image appears as a region of nearly constant intensity. Note that this does not require that the object itself have constant radiance. Because the object is far from the plane of focus, high-frequency radiance variations will be lost due to blurring in the image. Moreover, when objects are placed against the lens they are often severely under-lit, as they fall in the shadow of the camera or photographer. As such, many occluding objects with texture may appear to have constant intensity in the image. A brief overview of the method is as follows. Given the location of a point p that is completely occluded (provided by the user), we use a geometric ﬂow (Sec. 4.2) to produce a segmentation with a smooth contour such as the one outlined in Fig. 2, along which we ﬁnd normals facing outward from the region of complete occlusion. The image is then re-sampled (Sec. 4.3) to uniformly-spaced points on these normals, reducing an arbitrarily-shaped occluding contour to a linear contour. Low variation rows in the resulting image are averaged to produce a proﬁle from which the blur width is estimated (Sec. 4.4). Once the blur width is estimated, the method of [7] is used to remove the partial occlusion (Sec. 4.5). 4.1

Preprocessing

Two pre-processing steps are applied before attempting segmentation: 1. Because our model of image formation assumes that the camera’s response is linear, we use the method of [2] to undo the eﬀects of its tone mapping function, transforming camera image Iinput to a radiance image Rinput . 2. Before beginning the segmentation, we force the completely occluded region to be the darkest part of the image by subtracting Rp , the radiance of the user-selected point, and taking the absolute value. This gives a new image R = Rinput − Rp ,

(3)

Automated Removal of Partial Occlusion Blur

275

which is nearly zero at points in the region of complete occlusion, and higher elsewhere. As a result of this step, points on the boundary between the regions of partial and complete occlusion will have gradients ∇R that point out of the region of complete occlusion. This property will be used to ﬁnd the boundary between the regions of complete and partial occlusion. 4.2

Foreground Segmentation

While the region of complete occlusion is assumed to have nearly constant intensity, segmenting this region is nontrivial due to the extremely low contrast at the boundary between complete and partial occlusion. In order to produce good segmentations in spite of this diﬃculty, we use two cues. The ﬁrst cue is that pixels on the boundary of the region of complete occlusion have gradients of R that point into the region of partial occlusion. This is assured by the pre-processing of Eq. 3, which causes the foreground object to have the lowest radiance in R. The second cue is that points outside of the completely occluded region will generally have intensities that diﬀer from the foreground intensity. To exploit these two cues, we employ the ﬂux maximizing geometric ﬂow of [13], which evolves a 2D curve to increase the outward ﬂux of a static vector ﬁeld through its boundary. Our cues are embodied in the vector ﬁeld ∇R − → , where φ = (1 + R)−2 . V =φ |∇R|

(4)

∇R The vector ﬁeld |∇R| embodies the ﬁrst cue, representing the direction of the gradient, which is expected to align with the desired boundary as well as be orthogonal to it2 . The scalar ﬁeld φ, which embodies the second cue, is near 1 in the completely-occluded region and smaller elsewhere. As noted in [6], an exponential form for φ can be used to produce a monotonically-decreasing function of R, giving similar results. The curve evolution equation works out to be ∇R →− − → → − Ct = div( V )N = ∇φ, (5) + φκR N , |∇R|

where κR is the Euclidean mean curvature of the iso-intensity level set of the image. The ﬂow cannot leak outside the completely occluded region since by construction both φ and ∇φ are nearly zero there. This curve evolution, which starts from a small circular region containing the user-selected point, may produce a boundary that is not smooth in the presence of noise. In order to obtain a smooth curve, from which outward normals can be robustly estimated, we apply a few iterations of the euclidean curve-shortening ﬂow [4]. While it is possible to include a curvature term in the ﬂux-maximizing ﬂow to evolve a smooth contour, we separate the terms into diﬀerent ﬂows which are computed in sequence. Both ﬂows are implemented using level set methods [10]; details are given in the Appendix. 2

It is important to normalize the gradient of the image so that its magnitude does not dominate the measure outside of the occluded region.

276

S. McCloskey, M. Langer, and K. Siddiqi

Fig. 3. (Left) Original image with segmentation boundary (green) and outward-facing normals (blue) along which the image will be re-sampled. (Right) The re-sampled image (scaled), which is used to estimate the blur width.

Once the curve-shortening ﬂow has terminated, we can recover the radiance Rf of the foreground (occluding) object by simply taking the mean radiance value within the segmented (completely occluded) region. Note that we use this instead of Rp , the radiance of the user-selected point, as there may be some low-frequency intensity variation within the region of complete occlusion. 4.3

Boundary Rectiﬁcation and Proﬁle Generation

One of the diﬃculties in measuring the blur width is that the boundary of the completely occluded region can have an arbitrary shape. In order to handle this, we re-sample the image R along outward-facing normals to the segmentation boundary, reducing the shape of the occluding contour to a line along the left edge of a re-sampled image Rl . The number of rows in Rl is determined by the number of points on the segmentation boundary, and pixels in the same row of Rl come from points on the same outward-facing normal. Pixels in the same column come from points the same distance from the segmentation boundary on diﬀerent normals and thus, recalling Eq. 2, have the same α value. The number of columns in the image depends on the distance from the segmentation boundary to the edge of the input image. We choose this quantity to be the largest value such that 80% of the normals remain within the image frame and do not re-enter the completely-occluded region (this exact quantity is arbitrary and the method is not sensitive to variations in this choice). Fig. 3 shows outward-facing surface normals from the contour in Fig. 2, along with the re-sampled image. The task of measuring the width of the partially occluded region is also complicated by the generality of the background intensity. In the worst case, it is impossible (for human observers or our algorithm) to estimate the width if the background has an intensity gradient in the opposite direction of the intensity gradient due to partial occlusion. The measurement is straightforward if the background object had constant intensity, though this assumption is too strong. Given that the blurred region is a horizontal feature in the re-sampled image, we average rows of Rl in order to smooth out high-frequency background

Automated Removal of Partial Occlusion Blur

277

2 0.03

1.8

1.6

0.025

1.4 0.02

Fitting Error

P(x)

1.2

1

0.8

0.015

0.01

0.6

0.4 0.005

0.2

0

0

20

40

60

80

100

120

140

160

0 20

40

60

80

x

100 w′

120

140

160

Fig. 4. [Left] Proﬁle generated from the re-sampled image in Fig. 3 (black curve). Model 50 141 with relatively high error (red curve). Model proﬁle Pm with minimum proﬁle Pm error (green curve). [Right] Fitting error as a function of w .

texture. While we do not assume a uniform background, we have found it useful to eliminate rows with relatively more high-frequency structure before averaging. In particular, for each row of Rl we compute the sum of its absolute horizontal derivatives Rl (x + 1, y) − Rl (x − 1, y). (6) x

Rows with an activity measure in the top 70% are discarded, and the remaining rows are averaged to generate the one dimensional blur proﬁle P . 4.4

Blur Width Estimation

Given a 1D blur proﬁle P , like the one shown in Fig. 4 (black curve), we must estimate the width w of the partially occluded region. We do this by ﬁrst expressing P in terms of α. Recalling Eq. 3 and the fact that Rf ≈ Rp , we rearrange Eq. 1 to get (7) Rl (x, y) = (1 − α(x, y))Rbl (x, y) − Rf , where Rbl is the radiance of the background object deﬁned on the same lattice as the re-sampled image. The proﬁle P (x) is the average of many radiances from pixels with the same α value, so P (x) = (1 − α(x))Rbl (x) − Rf ,

(8)

where Rbl (x) is the average radiance of background points a ﬁxed distance from the segmentation boundary (which fall in a column of the re-sampled image). As we have removed rows with signiﬁcant high-frequency structure and averaged the rows of the re-sampled image, we assume that the values Rbl (x) are relatively constant over the partially-occluded band, and thus P (x) = (1 − α(x))Rbl − Rf .

(9)

Based on this, the blur width w is taken to be the value that minimizes the average ﬁtting error between the measured proﬁle P and model proﬁles. The

278

S. McCloskey, M. Langer, and K. Siddiqi

w model proﬁle Pm for a given width w is constructed by ﬁrst generating a linear ramp l and then transforming these values into α values by Eq. 2. An example is shown in Fig. 4, where the green curve shows the model proﬁle for which the error is minimized with respect to the measured proﬁle (black curve), and the red curve shows another model proﬁle which has higher error. A plot of the error as a function of w is shown in ﬁgure 4. We see that it has a well-deﬁned global minimum, which is at w = 141 pixels.

4.5

Blur Removal

Once the segmentation boundary and the width w of the partially-occluded region have been determined, the value of α can be found using Eq. 2. In order to compute α at each pixel, we must ﬁnd its distance to the nearest point on the segmentation boundary. We employ the fast marching method of [10]. Recall that the radiance Rf of the foreground object was found previously, so we can recover the radiance of the background at pixel (x, y) according to Rb (x, y) =

Rinput (x, y) − α(x, y)Rf . 1 − α(x, y)

(10)

Finally, the processed image Rb is tone-mapped to produce the output image. This tone-mapping is simply the inverse of what was done in section 4.1.

5

Experimental Results

Fig. 5 shows the processed result from the example image in Fig. 2. The userselected point was near the center of the completely occluded region though, in our experience, the segmentation is insensitive to the location of the initial point. We also show enlargements of a region near the occluding contour to illustrate the details that become clearer after processing. Near the contour, as α → 1, noise becomes an issue. This is because we are amplifying a small part of the input signal, namely the part that was contributed by the background.

Fig. 5. (Left) Result for the image shown in Fig. 2. (Center) Enlargement of processed result. (Right) Enlargement of the corresponding region in the input image.

Automated Removal of Partial Occlusion Blur

279

Fig. 6. Example scene through a small opening. (Left) Input wide-aperture image. (Middle) Output wide-aperture image. (Right) Reference small-aperture image. Notice that more of the background is visible in our processed wide-aperture image.

Fig. 6 shows an additional input and output image pair, along with a reference image taken through a smaller aperture. The photos were taken through a slightly opened door. It is important to note that processing the wide aperture photo reveals background detail in parts of the scene where a small aperture is completely occluded. Namely, all pixels where α > .5 are occluded in a pinhole aperture image, though many of them can be recovered by processing a wide aperture image. In this scene, there are two disjoint regions of complete occlusion, each of which has an adjacent region of partial occlusion. This was handled by having the user select two starting points from which the segmentation ﬂow was initialized, though the method could also have been applied separately to each occluded region. The method described in this paper can also be extended to video processing. In the event that the location of the camera and the occluding object are ﬁxed relative to one another, we need only perform the segmentation and blur estimation on a single frame of the video. The recovered value of α at each pixel (the matte) can be used to process each frame of the video separately. A movie, keyholevideo.mpg, is included in the supplemental material with this submission, and shows the raw and processed frames side-by-side (as in Fig. 1).

6

Conclusion

The examples in the previous section demonstrate how our method automatically measures the blur parameters and removes partial occlusion due to nearby objects. Fig. 6 shows that pictures taken through small openings (such as a fence,

280

S. McCloskey, M. Langer, and K. Siddiqi

keyhole, or slightly opened door) can be processed to improve visibility. In this and the case of the text image shown in Fig. 5, this method reveals important image information that was previously diﬃcult to see. The automated nature of this method makes the recovery of partially-occluded scene content accessible to the average computer user. Users need only specify a single point in each completely occluded region, and the execution time of 10-20 seconds is likely acceptable. Given such a tool, users could signiﬁcantly improve the quality of images with partial occlusions. In order to automate the recovery of the necessary parameters, we have assumed that the combination of blurring and under-exposure produces a foreground region with nearly constant intensity. Methods that allow us to relax this assumption are the focus of ongoing future work, and must address signiﬁcant additional complexity in each of the segmentation, blur width estimation, and blur removal steps.

References 1. Chuang, Y., Curless, B., Salesin, D.H., Szeliski, R.: A Bayesian Approach to Digital Matting. In: CVPR 2001, pp. 264–271 (2001) 2. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: SIGGRAPH 1997, pp. 369–378 (1997) 3. Favaro, P., Soatto, S.: Seeing Beyond Occlusions (and other marvels of a ﬁnite lens aperture). In: CVPR 2003, pp. 579–586 (2003) 4. Grayson, M.: The Heat Equation Shrinks Embedded Plane Curves to Round Points. Journal of Diﬀerential Geometry 26, 285–314 (1987) 5. Levoy, M., Chen, B., Vaish, V., Horowitz, M., McDowall, I., Bolas, M.: Synthetic Aperture Confocal Imaging. In: SIGGRAPH 2004, pp. 825–834 (2004) 6. Perona, P., Malik, J.: Scale-Space and Edge Detection using Anisotropic Diﬀusion. IEEE Trans. on Patt. Anal. and Mach. Intell. 12(7), 629–639 (1990) 7. McCloskey, S., Langer, M., Siddiqi, K.: Seeing Around Occluding Objects. In: Proc. of the Int. Conf. on Patt. Recog. vol. 1, pp. 963–966 (2006) 8. McGuire, M., Matusik, W., Pﬁster, H., Hughes, J.F., Durand, F.: Defocus Video Matting. ACM Trans. Graph. 24(3) (2005) 9. Reinhard, E., Khan, E.A.: Depth-of-ﬁeld-based Alpha-matte Extraction. In: Proc. 2nd Symp. on Applied Perception in Graphics and Visualization 2005, pp. 95–102 (2005) 10. Sethian, J.A.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (1999) 11. Sun, J., Jia, J., Tang, C., Shum, H.: Poisson Matting. ACM Trans. Graph. 23(3) (2004) 12. Tamaki, T., Suzuki, H.: String-like Occluding Region Extraction for Background Restoration. In: Proc. of the Int. Conf. on Patt. Recog. vol. 3, pp. 615–618 (2006) 13. Vasilevskiy, A., Siddiqi, K.: Flux Maximizing Geometric Flows. IEEE Trans. on Patt. Anal. and Mach. Intell. 24(12), 1565–1578 (2002)

Appendix: Implementation Details For the experiments shown here, we down-sample the original 6MP images to 334 by 502 pixels for segmentation and blur width estimation. Blur removal is

Automated Removal of Partial Occlusion Blur

281

performed on the original 6MP images. Based on this, blur estimation and image processing takes approximately 10 seconds (on a 3 GHz Pentium IV) to produce the output in Fig. 5. Other images take more or less time, depending on the size of the completely-occluded region. Readers should note that some of the code used in this implementation was written in Matlab, implying that the execution time could be further reduced in future versions. As outlined in section 4.2, we initially use a ﬂux-maximizing ﬂow to perform the segmentation, followed by a euclidean curve-shortening ﬂow to produce a smooth contour. For the ﬂux-maximizing ﬂow, we evolve the level function with speed Δt = 0.1. This parameter was chosen to ensure stability for our 6MP images; in general it depends on image size. The evolution’s running time depends on the size of the foreground region. The curve evolution is terminated if it fails to increase the segmented area by 0.01% over 10 iterations. As the ﬂux-maximizing ﬂow uses an image-based speed term, we use a narrow-band implementation [10] with a bandwidth of 10 pixels.

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint Fan Zhang, Wenyu Liu, and Chunxiao Liu Huazhong University of Science and Technology, Wuhan, 430074, P.R. China {zhangfan,liuwy}@hust.edu.cn [email protected]

Abstract. High-capacity image watermarking scheme aims at maximize bit rate of hiding information, neither eliciting perceptible image distortion nor facilitating special watermark attack. Texture, in preattentive vision, delivers itself by concise high-order statistics, and holds high capacity for watermark. However, traditional distortion constraint, e.g. just-noticeable-distortion (JND), cannot evaluate texture distortion in visual perception and thus imposes too strict constraint. Inspired by recent work of image representation [9], which suggests texture extraction and mix probability principal component analysis for learning texture feature, we propose a distortion measure in the subspace spanned by texture principal components, and an adaptive distortion constraint depending on image local roughness. The proposed spread-spectrum watermarking scheme generates watermarked images with larger SNR than JND-based schemes at the same distortion level allowed, and its watermark has a power spectrum approximately directly proportional to the host image’s and thereby more robust against Wiener filtering attack.

1 Introduction Image watermarking is applied to copyright protection and covert communication. Efficiency of watermarking scheme can be measured by hiding capacity. Generally, a watermarking scheme achieves a high hiding capacity by qualifying the tradeoff between the achievable information-hiding rates and the allowed distortion constraint for the information hider and the attacker [1]. Specially, for additional watermarking scheme, e.g. spread-spectrum watermarking scheme [2], the embedding algorithm is designed to add maximal intensity watermark into the host image satisfying the distortion constraint for information hider. In this way, hiding capacity is restricted by distortion level allowed. Information-theoretical analysis [1] has proved the augment. A variety of distortion constraints have been proposed to incorporate certain psychovisual properties of the human visual system (HVS). The most popular method maybe is Just-noticeable-distortion (JND), which is firstly applied to image compression [3]. JND provides each signal a threshold level of error visibility, below which reconstructed signals are rendered without noticeable distortion. JND profile of a still image is a function of local signal properties, including background intensity, activity of luminance changes and dominant spatial frequency. For more accurate JND estimation, edge region and nonedge region should be distinguished [4]. Edge is Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 282–291, 2007. © Springer-Verlag Berlin Heidelberg 2007

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint

283

directly related to image content that demarcates object boundaries, surface crease, and reflectance change, and thus distortion around edge is easier to be noticed. Nonedge texture, therefore, can hide more error than smooth or edge areas. Psychovisual studies by Julesz [5] suggest that image textures with the same second-order statistics are perceived as identical by the human visual system. Among the work of texture synthesis, some synthesis-by-analysis methodologies [6~8] verify that texture can be reconstructed by matching features’ statistics, where the features are generally texture’s projection to a set of direction determined by a bank of suitable filters. Meanwhile, other resampling approaches [11,12] implies local neighborhoods can be alternated without inciting distortion. These works suggest that distortion constraint should tolerance local diversity but guard the global difference of statistics. This paper holds a hypothesis that local texture can be warped more intensely across the direction where the own texture has larger variance. The augment may be explained as (1) noise (watermark) having the same statistics with the texture will cause less perceptible distortion than white noise, and (2) when warped across the direction respect to large variance, the texture patch has more probability to resemble another patch somewhere, hence the warp is equivalent to alternation and may be allowable. Inspired by HIMPA (Hybrid ICA-Mixture of PPCA Algorithm) [9], which use an independent component analysis (ICA) model for edge representation followed by a mixture of probabilistic principal components analyzers (MPPCA) for textural surface representation, this paper proposes (1) a nonedge texture extraction method by a three-factor image model, and (2) a texture distortion measure based on its principal components and an adaptive distortion constraint depending on image local roughness, so that we can embed spread-spectrum watermark with maximal intensity into the nonedge regions of the host image to achieve a high capacity watermarking scheme. The watermarking scheme is introduced in section 2. Selected experimental results are presented in section 3. The paper closes with concluding remarks in section 4.

2 Proposed Scheme Our scheme is shown in Figure 1. Without loss of generality, we embed one bit of message m whose value is either –1 or +1 into image blocks vector x. Corresponding to secret key k, p is a pseudorandom noise sequence with zero mean, and its values are equal to +1 or –1. Modulated by m, the sequence p is weighted added to or subtracted from x to form the watermarked image block vector y as

y = x + mdiag( p)γ .

(1)

where diag(A) is a diagonal matrix having vector A as its diagonal. The pixelwise weighted coefficient γ represent the intensity of watermark, and it is maximized under the adaptive distortion constraint as formula (2), which will be analyzed in section 2.1. d ( y, x ) < Dw .

(2)

The watermarked image is possibly corrupted by an attacker’s noise. We only consider the additive noise, as formula (3)

284

F. Zhang, W. Liu, and C. Liu

z = y + n.

(3)

The receiver knows the secret key k and can recover the p, and then the detection is performed. Firstly, the (normalized) correlation is calculated as

r =< z,p >= m

∑ γ + < x ,p > + < n ,p > .

(4)

where denotes correlation, i.e., inner product of sequence A and sequence B. The two latter terms in (4) can be neglected due to independency between x and p and independency between n and p, then watermark is estimated as the sign of r mˆ = sign (r ).

key adaptive distortion constraint

(5)

PRN

PRN

γ

p

p m y

x

Nonedge texture extraction

key

z

correlation detector

mˆ

n channel & attack noise Fig. 1. The proposed watermarking scheme

2.1 Nonedge Texture Extraction

For extraction of nonedge texture, we propose a three-factor image model. Three factors are assumed to contribute to the appearance of nature image: (a) objects’ intrinsic luminance and shading effects, (b) dominant edges, and (c) nonedge texture. Different from HIMPA model [9], our model isolates low frequency band of image, xL, as the factor (a), and uses HIMPA model to analyze the high frequency band xH, so as to discriminate factor (b) and factor (c). Factor (b) is edge region of xH while factor (c) is the surface region. The reason for isolating factor (a) is that clustering of MPPCA will not be affected by local average luminance, and thereby will depend more on local contrast. According to HIMPA [9], we use independent component analysis (ICA) to extract nonedge texture from the high frequency band of image. An image patch xH is represented as a linear superposition of columns of the mixing matrix A and a residue noise which can be neglected. x H = As.

(6)

The columns of A are learned from nature images by making the components of the vector s statistically independent. In the learning algorithm, A is constrained to be an orthogonal transformation. So the vector s is calculated by

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint

285

s = AT x H .

(7)

The elements of s with large magnitudes signify the presence of structure in an image block, corresponding to edges, oriented curves, etc. The elements of s with smaller magnitudes represent the summation of a number of weak edges to yield nonedge texture. For each image block, we divide A into two groups, AE and AT, keeping all the absolute values of s corresponding to columns of AE are larger than AT. The image block is also decomposed into two subcomponents. xH = xE + xT .

(8)

x E = AE ( AE T AE ) −1 AET x H .

(9)

where

The ratio between the column amounts of AT and A is determined by ICA threshold. In this paper, ICA threshold is 0.375 and A has 160 columns, therefore AE has 100 columns, while AT has 60 columns. Due to ratio control by the ICA threshold, nonedge texture is the residual after removing the relatively dominant edge instead of the absolutely sharp edge in HIMPA [9], therefore, much intenser nonedge texture will be extracted from the rough region than the smooth region. Figure 2 illustrates the process of nonedge texture extraction. 2.2 Statistical Distortion Measure

The basic idea of the proposed measurement is to valuate distortion in the subspace which is spanned by the texture’s principal components. PPCA is a suitable stochastic model for a homogenous nonedge region as suggested by HIMPA [9]. MPPCA is able to model more complex nonedge regions in a real image scene due to utilization of clustering. We describe the salient feature of MPPCA model and the detail is contained in [9]. The nonedge texture can be assumed to sample from a limited number of clusters. Each cluster is assumed homogenous, and efficient basis can be constructed, where a texture block from cluster k is generated using the generative model

∈

x k = W k s k + μ k + ε k , k = 1,..., K .

(10)

where xk Ra is a column vector elongated from the host image block and has a dimension of a, sk Rq is the lower dimensional source manifold assumed to be Gaussian distributed with zero mean and identity covariance. Note that, in this section, image block always refers to the nonedge texture block. The dimension of sk is q and a > q. Wk is a mixing matrix of a by q. μk is the cluster observation mean, εk is Gaussian white isotropic noise with zero mean, i.e. εk ~ N (0, σ2I). Hence, Ra conforms to a distribution N (μk, WkWkT+σ2I). Wk, Rq, μk, and εk are hidden variables that need to be estimated from the data of observed texture. μ and columns of W in one cluster are visualized as 8×8 block at the right bottom of Figure 2.

∈

286

F. Zhang, W. Liu, and C. Liu

xT

MPPCA analyzing

ICA xH

coding

averaging x

host image

xE

filtering classified 8×8 blocks xL μk

Wk

Fig. 2. Three-factor model for nonedge extraction and features learned by MPPCA

We define the distortion between image block vector x and y as their quadratic,

d ( y, x ) = E xy2 = ( x − y )T C −1 ( x − y )

.

(11)

where C is covariance of the maximum-likelihood cluster which x belongs to, and

C = σ 2 I + WW T

(12)

Formula (11) can be transformed into a standard quadratic form according to [10]

d ( y, x ) = v T D −1v

(13)

where v is the projection of x – y, v = UT(x – y). If sample covariance matrix learned in MPPCA has an eigenvalues matrix arranged in descending order denoted by diag(λ1, λ2, …, λq, …, λa), U is the corresponding eigenvectors matrix, D–1 is a diagonal matrix, noted as diag(1/λ12, 1/λ22, …, 1/λq2, 1/σ2,…, 1/σ2), and σ is obtained by averaging the last a – q eigenvalues. Similar with principal component analyzer (PCA), the transform UT projects the observation into a set of orthogonal directions, across which the observations have a large variation. In fact, formula (13) is equal to mean square error metric, when D is a unit matrix. So the measurement is a general case of mean square error metric. It is noteworthy that formula (13) looks like Karhunen Loeve Transform (KLT) norm, but substantial difference is that formula (13) is a measurement in an orthogonal space estimated by the whole observations’ covariance, instead of a single observation’s covariance. Therefore, formula (13) is a statistical measurement. 2.3 Adaptive Distortion Constraint

We define adaptive distortion constraint by setting Dw as Dw = α ( x T − μ) T C −1 ( x T − μ).

(14)

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint

287

where α is a positive coefficient scaling the upper limit of distortion, μ is mean of the maximum-likelihood cluster which xT belongs. Because xT more tends to μ in the smooth region than in the rough region, Dw for smooth region will be generally smaller than for rough region. Therefore Dw is loosened for rough region which is often full of intenser texture. According to (4), optimization of watermark hiding capacity is to maximize sum of γ under quadratic inequality constraint from (2) and (13). By Lagrange multiplier, we obtain the solution,

γ~ =

Dw diag ( p)Cp . pT Cp

(15)

We also consider luminance masking effect in JND theory, and the luminance masking TL conforms to definition in [4]. So the final solution γ for each pixel is

γ = γ~ + TL .

(16)

3 Experimental Results ICA mixing matrix is estimated beforehand using samples from a training set of 13 natural images downloaded with the FastICA package [17]. Note that ICA mixing matrix is independent with host images and need not be held by watermark detector. At the watermark encoder, after 8×8 averaging filtering and extracting by ICA coding, nonedge texture is obtained and then, partitioned into 8×8 blocks and vectorized into 64×1 vectors. Parameters of MPPCA model has eight clusters with 4 principal components in each cluster, where the program code was developed based on the NETLAB Matlab package [13]. Then, adaptive distortion constraint Dw and the maximal intensity γ for each pixel is calculated, considering luminance masking of JND. Finally, spread-spectrum watermark is embedded. At the watermark decoder, only correlation detection is enough to extract the watermark message without any image analysis or attack estimation. So the proposed scheme is a blind watermarking scheme without need of any side information. 3.1 Distortion We compare our scheme with the spread-spectrum watermarking schemes based on spatial JND model [4] and wavelet JND model [14]. We measure the watermark intensity by signal-to-noise-ratio (SNR) of watermarked image related to the host image. SNR is also used to measure indirectly the hiding capacity for spread-spectrum watermark because intenser watermark scheme supports higher bit ratio of watermark to host image under the same detecting error rate at the watermark decoder. Our experiment makes the watermarked images of three schemes have the same SNR (Baboon 20.1dB, Bridge 21.7dB, and Lena 27.3dB) by adjusting their parameters, so that experimenters can make subject distortion evaluation and further compare the distortion levels allowed under different constraints. The watermarked image patches are shown in Figure 3.

288

F. Zhang, W. Liu, and C. Liu

Baboon

Bridge

Lena (a)

(b)

(c)

(d)

Fig. 3. Watermarked patches. (a) host image. (b) our scheme, (c) spatial JND, (d) wavelet JND.

(a)

(b)

(c)

Fig. 4. Watermark of our scheme. The top row contains host images, and the bottom contains watermarks, where dark pixels denote the negative signals and the light denote the positive.

In the experiment, two JND-based schemes expose more distortion than our scheme when embedding saturated watermark. Spatial JND scheme (Figure 3c) reveals noticeable noise either across sharp edges or at smooth surfaces. Wavelet JND

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint

289

scheme (Figure 3d) fails to keep image quality around fine edges, which correspond to high frequency wavelet bands. Moreover, wavelet JND scheme generates glaring dark or light speckles in texture region, where, in fact, inverse wavelet transform of the coefficients watermarked exhaustedly is overbound. Our scheme has followed advantages. Firstly, it does not incite watermark intensity at sharp edges due to the three-factor image coding before watermarking. Secondly, it prefers to embed watermark into rough region rather than smooth region due to the adaptive distortion constraint Dw. Lastly, it activates the intensity of watermark conforming to the principal components of texture and inhibit the unconformable watermark. The characters are clearly exhibited in the watermark signal as shown in Figure 4. Since our scheme exploits the texture region, its advantages are not obvious for images with sparse texture, e.g. image Lena. 3.2 Robustness Against Estimation-Based Attack Sophisticate attackers can make estimation-based attack if they can obtain some prior knowledge of host image or watermark’s statistics [16]. Image denoising provides a natural way to develop estimation-based attacks, where watermark is treated as a noise. Given the power spectrum of host image and watermark, one of the most malevolent attacks is the denoising method by adaptive Wiener filter. So the most robust watermark should have a power spectrum directly proportional to the power spectrum of the host image [16]. We compare the power spectrum of watermark by three schemes in Figure 5.

(a)

(b)

(c)

(d)

Fig. 5. Power spectrum of image Baboon by Fourier analysis. (a) host image. (b) ~ (d) watermark signals, (b) our scheme, (c) spatial JND scheme, (d) wavelet JND scheme. Signal is normalized to 0 mean and unit variance before Fourier analysis. The luminance denotes logarithmic amplitude of Fourier components. Zero-frequency component is shifted to the center.

It is clear that spatial JND scheme generates a nearly white watermark, and wavelet JND scheme embed watermark focusing on the middle frequency components. Among the three schemes, watermark power spectrum of our scheme mostly resembles the power spectrum of the host image, and it is nearly directly proportional to the power spectrum of the host image. So our scheme gives a suboptimal watermarking solution against the estimation-based attack of Wiener filter.

4 Conclusion The proposed watermarking scheme can hide more information than traditional schemes because it loosens distortion constraint for nonedge texture, and it is also

290

F. Zhang, W. Liu, and C. Liu

robust against Wiener filtering attack due to the watermark power spectrum resembling the host image. Meanwhile, the watermark detection can be blind and fast. High-capacity watermark should qualify the tradeoff between the bit ratio of watermark to host image and the error rate of detector. Although this paper guides a one-bit watermarking scheme with maximal watermark intensities, it still provides a potential design for high-capacity watermarking scheme, since multiple-bit watermark is to choose a suitable number of image blocks for embedding each one bit of watermark. Our schemes only consider the attack of additional noise. Some well known strategies against geometric attack and some smart detection algorithms may be merged into the proposed scheme so as to resist more sophisticate attacks. Acknowledgement. This work is supported by NSFC (NO: 60572063), Doctoral Fund of Ministry of Education of China (NO: 20040487009), and the Cultivation Fund of the Key Scientific and Technical Innovation Project Ministry of Education of China (NO: 705038). The author would like to thank Nikhil Balakrishnan and Yizhi Wang for discussions.

，

References 1. Mounlin, P., O’Sullivan, J.A.: Information-theoretic analysis of information hiding. IEEE Trans. Information Theory 49, 563–593 (2003) 2. Cox, I.J., Kilian, J., Leighton, F.T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Trans. Image Proc. 6, 1673–1687 (1997) 3. Jayant, N.: Signal compression: Technology targets and research directions. IEEE J. Select. Areas Commun. 10, 314–323 (1992) 4. Yang, X.K., Li, W.S., Lu, Z.K., et al.: Motion-compensated residue preprocessing in video coding based on just-noticeable-distortion profile. IEEE Trans. Circuits & System Video Tech. 15, 742–752 (2005) 5. Julesz, B.: Visual pattern discrimination. IRE Trans. Inf. Theory 8, 84–92 (1962) 6. Heeger, D., Bergen, J.: Pyramid-based texture analysis/synthesis. In: Proc. ACM SIGGRAPH, pp. 229–238 (1995) 7. Zhu, S.C., Wu, Y.N., Mumford, D.B.: Filter, Random fields, and Maximum Entropy (FRAME) –Towards a Unified Theory for Texture Modeling. Int’l Journal of Computer Vision 27, 107–126 (1998) 8. Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision 40, 49–71 (2000) 9. Balakrishnan, N., Hariharakrishnan, K., Schonfeld, D.: A new image representation algorithm inspired by image submodality models, redundancy reduction, and learning in biological vision. IEEE Trans. Pattern Analysis & Machine Intelligence 27, 1367–1378 (2005) 10. Tipping, M., Bishop, C.: Mixtures of Probabilistic Principal Component Analyzers. Neural Computation 11(2), 443–482 (1999) 11. Lefebvre, S., Hoppe, H.: Parallel controllable texture synthesis. In: ACM SIGGRAPH, pp. 777–786 (2005) 12. De Bonet, J.S.: Multiresolution sampling procedure for analysis and synthesis of texture images. In: ACM SIGGRAPH, pp. 361–368 (1997)

High Capacity Watermarking in Nonedge Texture Under Statistical Distortion Constraint

291

13. Nabney, I., Bishop, C.: Netlab Neural Network Software (July 2003), http://www.ncrg. aston.ac.uk/netlab 14. Huang, X.L., Zhang, B.: Perceptual watermarking using a wavelet visible difference predictor. IEEE ICASSP 2, 817–820 (2005) 15. Liu, W.Y., Zhang, F., Liu, C.X.: Spread-Spectrum Watermark by Synthesizing Texture. In: Pacific-Rim Conf on Multimedia (to be published, 2007) 16. Voloshynovskiy, S., Pereira, S., Pun, T., et al.: Attacks on Digital Watermarks: Classification, Estimation-Based Attacks, and Benchmarks. IEEE Communications Magazine 39, 118– 126 (2001) 17. Hyrarinen, A.: Fast ICA Matlab package (April 2003), http://www.cis.hut.fi/projects/ica /fastlab

Attention Monitoring for Music Contents Based on Analysis of Signal-Behavior Structures Masatoshi Ohara1,2, Akira Utsumi1 , Hirotake Yamazoe1 , Shinji Abe1 , and Noriaki Katayama2 ATR Intelligent Robotics and Communication Laboratories 2-2-2 Hikaridai, Seikacho, Sorakugun, Kyoto 619-0288, Japan 2 Osaka Prefectural College of Technology 26-12 Saiwaicho Neyagawashi, Osaka 572-8572, Japan

1

Abstract. In this paper, we propose a method to estimate user attention to displayed content signals with temporal analysis of their exhibited behavior. Detecting user attention and controlling contents are key issues in our “networked interaction therapy system” that eﬀectively attracts the attention of memory-impaired people. In our proposed method, user behavior, including body motions (beat actions), is detected with auditory/vision-based methods. This design is based on our observations of the behavior of memory-impaired people under video watching conditions. User attention to the displayed content is then estimated based on body motions synchronized to auditorial signals. Estimated attention levels can be used for content control to attract deeper attention of viewers to the display system. Experimental results suggest that the proposed method eﬀectively extracts user attention to musical signals.

1

Introduction

Human behavior is considered to have a close relation with mental and/or physical states, intention, and individual interests. For instance, when a person is shopping, human eye movement may provide signiﬁcant information regarding interest in speciﬁc products in the shop. Since the issue of human state/intention estimation based on behavior can be related with a wide range of application domains, estimation mechanisms have been widely investigated in the ﬁeld of computer vision and pattern recognition [1,2]. However, generally speaking, achieving reliable behavior-based state/intention estimation is hard without any domain speciﬁc knowledge because there are a wide variety of human behaviors and environment/context dependencies. On the other hand, estimating human attention and interest to a speciﬁc stimuli is relatively easy when the timing, position, and content of the stimuli are known. In this paper, we propose a system to estimate viewer’s interest in a displayed content based on viewer behavior. We focus on a method to estimate viewer attention levels to music contents by observing the synchronous behavior of users to the music signals. In a human-computer interaction task, for instance, attracting and retaining the motivation of users becomes signiﬁcant for extracting positive reactions. To Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 292–302, 2007. c Springer-Verlag Berlin Heidelberg 2007

Attention Monitoring for Music Contents

293

achieve this, the system has to estimate individual concentration levels and control the style and amount of displayed information. Since reactions vary from user to user, such control must occur dynamically. The same situation prevails in our “networked interaction therapy system,” which eﬀectively attracts the attention of memory-impaired people and lightens the burden of helpers or family members [3]. “Networked interaction therapy” requires that the system provide remote communication with helpers and family members as well as video contents and other services. To attract the attention of users for long periods of time, the system has to detect user behaviors and control the order and timing of provided services based on estimates of their concentration levels. Before the above research, we presented several video contents to memoryimpaired people and analyzed their behaviors when they watched video and music contents. We observed the following positive reactions to video and musical contents: head nodding, laughing, singing, and clapping. Based on the results, we implemented a content switching mechanism based on user head movements. Then we estimated user concentration levels to the displayed video contents due to head directions and controlled the switching of contents to retain more user attentions to the video. However, since human behavior can change quickly based on content categories, such a tendency is remarkable for musical contents that do not require user attention to the video display. Therefore, in this paper, we examine the possibility of estimating viewer concentration levels to presented musical contents using user “beat” actions. “Beat” is a basic property by which humans recognize music, and there are close relations between music’s “beat” and human behavior. For example, Shiratori et al. analyzed the relation between the “beat” structure in music and human dance behavior. Since synchronized motion to music is a commonly observed human behavior when listening music, it should function as a useful observable feature to estimate user attention to and interest in music. As we mention below, synchronized motion to music contents is behavior that is not always observed for all users. However, we still consider that this useful cue complements other cues such as head movements in concentration level estimation. In the next section, we brieﬂy summarize our “networked interaction therapy” project. Section 3 describes our observations of the TV watching behavior of dementia patients. Section 4 explains the framework of our attention estimation, and Section 5 shows the experimental results. In Section 6, we conclude this paper.

2

Information Therapy Project

Memory is frequently impaired in people with such acquired brain damage problems as encephalitis, head trauma, subarachnoid haemorrhage, dementia, and cerebral vascular injury. Such people have diﬃculty leading normal lives due to memory impairment or higher brain dysfunction; consequently, constant care and attention impose a heavy burden on their families. Networked Interaction Therapy, the name of our method that relieves the stress suﬀered by memory-impaired

294

M. Ohara et al.

people and their family members, provides easy access to the services of networked support groups [3]. The main goals of Networked Interaction Therapy include supporting the daily activities of memory-impaired people and reducing the burden on their families. Its primary function is to automatically detect the intentions of a memoryimpaired person. This therapy provides the necessary information to guide the individual to comfortable situations on behalf of family members before beginning to experience behavioral problems such as wandering at night, incontinence, temper tantrums, and so on. These anxieties are often caused by a lack of information. A system based on Networked Interaction Therapy needs to detect situations where communication can help the individual overcome the diﬃculties encountered in daily activities. In this paper, we describe a method to estimate user attention/interest levels to presentation contents. For instance, estimation results can be used for eﬀectively switching contents to hold user interest/attention. In the next section, we give some typical behavior of memory-impaired people that was observed through experiments using video contents display.

3

Video Watching Behavior of Dementia People

Before system implementation, we examined the behavior of actual dementia patients watching TV through observation experiments [4]. Here we brieﬂy summarize the results. In the experiments, we prepared both interesting/uninteresting video programs for subjects based on their preexamined preferences and personalized reminiscence video produced by photographs selected from their photo albums [5]and sequentially displayed such video contents to them. We performed experiments with eight subjects including three subjects (A, B, C) living at home with their families and ﬁve in nursing facilities (D-H). Table 1 shows brief proﬁles of the subjects. Though observed behavior depends on individuals and their symptom levels, behavior where subjects look away from the TV due to “uninterested” contents is very common. We observed signiﬁcant diﬀerences of watching time for particular subjects. Based on this observation, we consider face orientation a crucial behavior replated for attention/interest to video contents. The following are additional typical behavior related to level of attention/ interest. Example of uttterances: – presenting positive or negative utterances to displayed video contents. – explaining the contents of the reminiscence video to family members. – singing songs from the video. From subject utterances, both positive and negative words are observed. Thus, it is diﬃcult to estimate subject interest to the displayed contents only from the existence of utterances.

Attention Monitoring for Music Contents

295

Table 1. Brief subject proﬁles Subjects A B C D E F G H

Age 62 81 69 83 90 89 89 92

Case History Preference cerebral contusion Japanese chess, music Alzheimer’s disease train travel, music cerebral infarction baseball games, music cerebral dementia children’s songs senile dementia N/A Alzheimer’s disease movies Alzheimer’s disease music senile dementia N/A

Example of hand motions: – beating time with hands to music contents. – pointing at TV while viewing reminiscence videos and news programs. The “beating time with hands to music” action is considered a typical positive reaction to the displayed contents. However subject A, for instance, presented the “beat” action without gazing at the TV and left the room just after making this reaction. In addition, in one case, a “pointing” action was observed with a negative utterance to the contents. So hand motions can be either “positive” or “negative” reactions. Therefore, estimating user concentration levels is diﬃcult from only the existence of hand motions. The entire summary of observations is as follows. – Eﬀective content switching attracts user attention to the displayed contents longer and can be realized by measuring user attention levels with face directions. – Although user utterances contain information regarding user interest in the displayed contents, estimating the user interest level is hard from only the presence of utterances. – Although frequency depends on individuals, such reactions as hand beckoning, clapping, and singing were observed while subjects watched the video contents. The above results suggest that interest in user contents can be estimated from the direction of user faces, the content of utterances, and hand motions. We previously developed a content switching system based on user face orientations and conﬁrmed its eﬀectiveness through experiments. However, in some cases, estimating user interest levels is diﬃcult from face orientations, such as music contents that can be enjoyed without being constantly watched. Therefore, the following sections examine a method to estimate user interest levels from another cue, i.e., synchronous body motions to the displayed music signals.

296

4

M. Ohara et al.

Attention Monitoring Using Image and Audio Analysis

From the results of the previous section, we found that laughing, singing, and clapping can be observed as “positive” reactions. Therefore, we developed a system to estimate user attention to the displayed music contents based on synchronization between user behavior and displayed music contents. Here, the “tempo” of user behavior is extracted with image and auditorial analysis. Assuming a known temporal structure for music signals like MIDI signals, we can directly compare user behavior with such structures. However, since temporal structures are generally unknown for normal TV/radio programs, CDs, DVDs, etc., we ﬁrst extract the “beat” structures from music signals through frequency analysis and then compare them with viewer body motions. 4.1

System Conﬁgurations

Figure 1 shows a diagram of the proposed system. The music signal output from a speaker system is simultaneously taken to the system with an A/D converter, where it is then processed with FFT to extract the frequency elements and sent to the beat extraction process. During beat extraction, a voting process determines the dominant (fundamental) tempo value of the input music signal. User cyclic behavior is detected as follows. User motions are observed through both image and audio analysis. Image analysis detects user movements as numbers of ﬂuctuated pixels in inter-frame subtractions, and user “beat” actions can be extracted as cyclic changes. Audio analysis extracts sound related with user motions (e.g., clapping) from input audio signals with a subtraction process of original music signals through an LMS algorithm. Then the “beat” of the user motion is determined from the extracted sound signals. Finally, we compare the tempos of user behavior and music signals and estimate the attention (concentration) level of users to the music contents by the degree of synchronicity. Sections below explain the details of the above processes.

♪ Music

Motion

Music Data

FFT

Beat Extraction

FFT

Beat Extraction

Rhythm Estimation

Speaker

Microphone

Clapping Noise Extraction

Frame Subtraction

Beat Extraction

Camera

Fig. 1. System diagram

Rhythm Estimation

Attention Monitoring for Music Contents

297

Frequency 9 6 0-250Hz

3 250-500Hz

0

500Hz-1KHz

1-2KHz

25

40

2-4KHz 0

100

200

300

400

500

[frame]

Fig. 2. Input music signal and results of frequency analysis

4.2

60

30

35

20 15 PHASE 80 100 10 5 120 BPM 140 160 0

Fig. 3. Voting results for tempo estimation (Music No. 3)

Beat Detection

First, we describe how to detect the beat information in a music signal. The proposal method employs a frequency-based beat extraction algorithm [6,7]. Here, the input signal is converted into frequency domain with FFT analysis every 1/32 sec. It separates the input signal into the following ﬁve frequency bands: 0-250 Hz, 250-500 Hz, 500-1 kHz, 1-2 kHz, and 2-4 kHz. We detect the envelope of each frequency band and extract the “beat” as common rising points in multiple frequency bands. Figure 2 shows the example results for a pop song. 4.3

Tempo Estimation

In this section, we explain our tempo estimation method. Current implementation assumes that the music tempo stays in a range of 60 to 145 bpm (the range most pop music falls), and the tempo and the phase of the “beat” are determined using a voting algorithm. Figure 3 shows the result for a country song of 81 bpm. 4.4

Detection of Body Behavior by Image

For body motion detection, a silhouette-based method [8,9], a model-based method [10], etc. have previously been proposed. However, since it is diﬃcult to speciﬁcally target a body part for body motion extraction in advance and only the relatively lower frequency one-dimensional signal must be extracted in “beat” extraction, we employed simple motion analysis based on inter-frame extraction. Here, the magnitude of motion is measured as the number of moving pixels Nt . “Beat action” in which the user takes hand and feet action can be detected as the times when number Nt of the pixels becomes minimum: dNt−1 t 1 ( dt < 0, dN dt 0 and Nt 0) beat = (1) 0 (otherwise).

298

M. Ohara et al.

(1) Original Music Signal 200 100 0 0

100

200

300

400

500

[frame]

-100 -200

(2) Sound data captured with microphone 200 100 0 -100 -200 0

100

200

300

400

500

[frame]

(3) Extracted user origin sound 200 100 0 -100 -200 60000 65000 70000 75000 80000 85000 90000 95000 100000 0

100

200

300

400

500

frame

[frame]

Fig. 4. Examples of “beat” action extractions (top: hand, middle: foot, bottom: head)

Fig. 5. Separation of user origin sound through LMS algorithm

Figure 4 shows an example of an extracted beat action. Here, the person was expressing the beat by using hand (left), foot (middle), and nodding (right). 4.5

Extraction of User Origin Sound

Audio signal observed contains the displayed music signal output from the speaker in addition to the sound related with user actions such as clapping. Here, we separate these signals by an LMS algorithm [11] that is commonly used for echo canceling in TV conferencing systems. Generally, there are multiple reﬂection paths for sound output from a speaker before it reaches a microphone. The LMS algorithm models this echo path as an linear system and determines its impulse response with a steepest descent method as follows: hN (k + 1) = hN (k) + αe(k)x(k − N ).

(2)

Here, x(k) is observation at time k. hN (k) is a N -th coeﬃcient at time k. e(k) is estimation error for an observation value at time k. α is a step gain. We estimate the music signal level in an input audio signal through the estimated model and extract sound related to user motions by a subtraction process. Figure 5 shows the results of the extraction process. Figure 5 (top) shows the original music signal. Figure 5 (middle) shows the observed signal with a microphone. Figure 5 (bottom) shows the subtraction results based on an LMS algorithm. As can be seen, the sound signal related with user motion is properly extracted by the above process. We estimate the “tempo” of user motions with a similar process in Section 4.2.

Attention Monitoring for Music Contents

4.6

299

Synchronous Judgment

Finally, we judge the synchronicity between the estimated tempos of the music signal and user “beats.” First, we estimate the “tempo” of user motions by applying the voting algorithm described in Section 4.3 to the “beat” information (image and sound). Then, we compare the estimation results with music “tempo,” and if the diﬀerence is less than 10 bpm, we determine that both signals are synchronized. As mentioned in the next section, we estimate the song to which a user is listening as one that has a tempo diﬀerence with user motions less than 10 bpm and the diﬀerence is minimum in the multiple candidates (if any). ⎧ |bmusic,i − b| < threshold and ⎨ i |bmusic,i − b| < |bmusic,j − b|(j = i) result = (3) ⎩ −1 otherwise. In addition, there is a case where a user’s “beat” action happened once every two beats of the music signal. Therefore, we may have to consider two signals synchronized if the tempo of one signal is an integral multiple of another.

5

Experiment Results

First, the example of extracting user “beat” motion was synchronized to displayed music signals. Figure 6 shows the voting results of tempo extraction for clapping while listening to the music of Figure 3 (Figure 6 left) and random clapping (Figure 6 right). As can be seen, the “beat” action case has a clear peak signal, although the non-“beat” action case does not. This suggests that the presence of synchronized behavior can be distinguished from others. Figure 7 shows the diﬀerence of the number of detected synchronous “beats” in “beat” and non-“beat” actions for a longer sequence. Since non-“beat” action does not have a cyclic property and is not synchronized to the music signal, “beat” and non-“beat” motions can be distinguished with the proposed method. Next, we performed the following experiments to conﬁrm the possibility of behavior-based estimation of user attention to music signals. We prepared ﬁve Frequency

Frequency 9

9

6

6

3

3 0

0 30

40

60

80 100 10 5 120 140 BPM 160 0

Beat action

35

25 20 15 PHASE

40

60

80 100 10 5 120 140 BPM 160 0

non-beat action

Fig. 6. Results of beat and non-beat actions (body motion)

35 30 25 20 15 PHASE

300

M. Ohara et al.

Number of Sync. Beats Per Minute

75 70 65 60 55 50 45 40 35 30

Beat Action

Random

Fig. 7. Discrimination of beat/non-beat actions Table 2. Genre and extracted tempo for sample music Music No. Genre Tempo extraction results [bpm] 1 country 113 fusion 74 2 country 81 3 Japanese pop 130 4 popular 113 5

music signals. Table 2 shows their genres and estimated tempos that are distributed from 74 to 130 bpm. Music 1 and 5 have similar tempos. We employed ﬁve subjects (healthy adults) who listened and clapped to the displayed music. The tempos of their body motions were estimated by the proposed method. Table 3 shows the tempo of subject 1’s motion estimated with both image and sound cues. We obtained similar estimation results for both cues. Table 4 shows the estimation results of all subjects (sound cues). Here, denotes cases where the system correctly estimated the music from user behavior. × denotes cases where estimation failed, and denotes others. In the following, we discuss the “” cases. Music 1 has two instruments related to “beat.” The extracted tempo changes along with the instrument’s part to which the user clapped. As in the above case, the extracted tempo may change depending on the part (instrument) to which the user is attracted. The result for music 2 resembles that of subject 3. Therefore, we may have to consider two signals synchronized if the tempo of one signal is a integral multiple of another. On the other hand, since music 1 and 5 have similar tempos, deciding what song the user is listening to is diﬃcult. However, since the main purpose of the proposed method is to estimate attention levels to currently displayed music contents, synchronization with other music does not matter. Consequently, we accurately estimated synchronous properties between user “beat” actions and displayed music in more than 90% of all cases. This suggests the eﬀectiveness of the proposed method for estimating user attention levels to displayed music contents.

Attention Monitoring for Music Contents

301

Table 3. Tempo extraction results (subject 1, image, and sound cues) Music No. 1 2 3 4 5 tempo extracted with image cues 61 77 88 130 130 tempo extracted with sound cues 61 77 81 130 113 Table 4. Tempo extraction for all subjects (sound cues) Music No. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5

1 61 67 60 61 61

2 77 74 145 77 74

3 81 130 × 81 81 81

4 130 64 × 135 135 135

5 113 61 61 113 113

Furthermore, since two false cases belong to the same subject, perhaps the temporal accuracy of this subject’s clapping was less than others. We must investigate this point in the future.

6

Summary

In this paper, we examined a method to estimate user interest in music contents from the synchronicity of the presented music and human behavior. The results are presented while switching two or more video contents for memory-impaired people and observing user attention behavior. User reactions diﬀer based on presented contents: clapping, speaking, and sideways turning of the face. We examined an eﬀective content switch technique for reﬂecting user intention. Future work includes improvements of the estimation accuracy of user interests by integrating other user behaviors with beat actions and developing a contents control system to switch displayed contents based on user attention levels to entertain users longer. This research was supported in part by the National Institute of Information and Communications Technology.

References 1. Boiman, O., Irani, M.: Detecting irregularities in images and in video. In: Proceedings of International Conference on Computer Vision, pp. 462–469 (2005) 2. Osawa, T., Wu, X., Wakabayashi, K., Yasuno, T.: Human tracking by particle ﬁltering using full 3d model of both target and environment. In: Porceedings of International Conference on Pattern Recognition, pp. 25–28 (2006) 3. Kuwahara, N., Kuwabara, K., Utsumi, A., Yasuda, K., Tetsutani, N.: Networked interaction therapy: Relieving stress in memory-impaired people and their family members. In: Proc. of IEEE Engineering in Medicine and Biology Society, IEEE Computer Society Press, Los Alamitos (2004)

302

M. Ohara et al.

4. Utsumi, A., Kawato, S., Abe, S.: Attention monitoring based on temporal signalbehavior structures. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, pp. 100–109. Springer, Heidelberg (2005) 5. Kuwahara, N., Kuwabara, K., Tetsutani, N., Yasuda, K.: Reminiscence video helping at-home caregivers of people with dementia. In: HOIT 2005, pp. 145–154 (2005) 6. Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. Journal of Accoustical Society of America 103(1), 588–601 (1998) 7. Goto, M.: An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research 30(2), 159–171 (2001) 8. Kanade, T., Rander, P., Narayanan, P.J.: Virtualized reality: Constructing virtual worlds from real scenes. IEEE MultiMedia 4(1), 34–47 (1997) 9. Waggg, D.K., Nixon, M.S.: On automated model-based extraction and analysis of gait. In: Proc. of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 11–16. IEEE Computer Society Press, Los Alamitos (2004) 10. Lim, J., Kriegman, D.: Tracking humans using prior and learned representations of shape and appearance. In: Proc. of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 869–874. IEEE Computer Society Press, Los Alamitos (2004) 11. Widrow, B., Hoﬀ, M.E.: Adaptive switching circuit, 96–104 (1960)

View Planning for Cityscape Archiving and Visualization Jiang Yu Zheng and Xiaolong Wang Department of Computer Science Indiana University Purdue University Indianapolis (IUPUI), USA

Abstract. This work explores full registration of scenes in a large area purely based images for city indexing and visualization. Ground-based images including route panoramas, scene tunnels, panoramic views, and spherical views are acquired in the area and are associated with geospatial information. In this paper, we plan distributed locations and paths in the urban area based on the visibility, image properties, image coverage, and scene importance for image acquisition. The criterion is to use a small number of images to cover as large scenes as possible. LIDAR data are used in this view evaluation and real data are acquired accordingly. The extended images realize a compact and complete visual data archiving, which will enhance the perception of spatial relations of scenes. Keywords: Multimedia, panoramas, visibility, urban space, LIDAR data, model based vision, heritage.

1 Introduction This work explores a framework to represent spaces with images. With densely taken images in a space, one can establish a complete database for scene access. This image registration task is required at excavation sites, museums, markets, real estate property, heritage districts, and large urban areas. Moreover, a full archive of scenes will allow navigation, location finding, guidance, crisis preparing, etc. Employing 3D models, either manually constructed or automatically extracted from LIDAR (Light Detection and Ranging) data, have been a long-standing trend in showcasing an urban area [6]. The process is usually laborious particularly in the model construction and texture mapping. To obtain higher resolutions and more proper viewing angles than overhead images (satellite or aerial) and map, we emphasize ground-based-views in this work. Although video clips can capture walkthrough views continuously [12], they have not been extended to large areas because of the vast data size. Mosaicing translating images is also difficult because of the disparity inconsistencies from the depth variations [8]. A multi-perspective image has been composed only for scenes close to a planar surface [1]. On the other hand, the slit scanning approaches including push-bloom [14] and X-slit [9][16] have been implemented to create route panorama in order to avoid inter-frame matching, depth estimation, morphing and interpolation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 303–313, 2007. © Springer-Verlag Berlin Heidelberg 2007

304

J.Y. Zheng and X. Wang

Our work in this paper aims at a ground-based image database representing largescale spaces. The goals to achieve are: (1) Less images for rich spatial information, and (2) Pervasive view archive of an environment. Taking images at every location in a large area can be very inefficient due to data redundancy, and sometimes impossible. We plan viewpoints to cover as large spaces as possible to achieve data reduction. The view significance is evaluated according to the scene visibility and distribution. The sparse viewpoints and paths are then planned for acquiring cityscapes using route panoramas [14], scene tunnels [15][20], panoramic view, spherical views [4][5][11] and traditional digital images. As these images are compact in data size, as well as complete and continuous in the scene coverage, we can enlarge the area to model significantly. In the following sections, we propose a view significance measure using LIDAR data in Section 2, for various types of spaces. Section 3 describes view selection approaches based on the viewpoint evaluations, followed with a real process of view acquisition to form an image based virtual environment.

2 Ground-Based Views and View Significance 2.1 Views Categorization According to Virtual Actions In archiving a space at a city level, the image storage and transmission demand a scene selection scheme to cover more scenes with less data. Taking images at every viewpoint in a space is not necessary. Extended from normal digital images, we use local panoramas, route panoramas, and object panoramas to cover sites, paths and architectures, respectively. In a real environment, a visitor takes different actions in spaces such as looking around, walking, and examining things from various directions. It is preferable that our images are categorized according to these actions. In addition, a map view (or satellite image) can provide global locations. Figure 1 shows a diagram to register an area with these images. All these views can be classified more formally as follows: A local panorama V(0) is taken from a single viewpoint outward with an FOV possibly up to 360 degree in orientations. Examples include wide/fish-eye images, panoramic/spherical images, which employ perspective, central, cylindrical, and spherical projections respectively. A route panorama V(1) is a long image along a straight or mildly curved path for representing positions. Examples include slit-scanning route panoramas and scene tunnels, which employ parallel-perspective and parallel-central projections. Multiperspective projection (mosaicing multiple images on a path) can also be used if scenes are close to an equal-depth [1]. An object panorama V(2) is a set of inward images taken around an object or building (also called object movie and similar to aspect views). They are useful in showing all building facades or full appearances of an object. Now, it is necessary to investigate where these views should be located first in a large urban area if we cannot take images everywhere, i.e., how to cover large areas pervasively with a limited number of ground-based images. We use Airborne LIDAR data as an elevation map for view prediction. LIDAR is an optical remote sensing

View Planning for Cityscape Archiving and Visualization

305

(a)

(b)

(c) Fig. 1. A visualization model of urban area. Real scenes are projected towards a grid of paths and positions. (a) Image framework containing V(0), V(1), and V(2). (b) An object panorama V(2), (c) A section of route panorama V(1).

technology that measures properties of scattered light to find range and intensity of a remote target. The resolution of airborne LIDAR data with height and texture measured by laser scanning can be as fine as half meter per dot. 2.2 Viewpoint Evaluation Based on Visibility To take images in a large space, we have to plan the camera positions. We propose a view significance estimation based on how much of 3D surfaces an image can cover. This view significance can be measured in an urban area using LIDAR data or be examined on typical structure layouts. Our main interest is on walkthrough views taken roughly at eye-level; compared to an overhead image, ground-based-images capture more vertical surfaces and details. Intuitively, a view with large portions as horizon is not as visually significant as a view full of objects in conveying spatial information. Similarly, a view with a large sight from an overlook is more significant than a view from a narrow valley of buildings in telling global locations. Figure 2 illustrates an idea to compute the view significance from a view shed that is a view-covered region on 3D surfaces. Let P(X,Y,Z) denote a position in the space and a ray from P is defined by n(φ,ϕ), in orientation φ∈[0, 2π] and azimuth angle ϕ∈[-π/2, π/2]. If the ray hits a surface point at distance D(ϕ, φ), a sign function λ(φ, ϕ) takes value 1 and otherwise 0. The visible

306

J.Y. Zheng and X. Wang

surfaces from P thus form a view shed. We define view significance σ(P) to be the area of the view shed (∑λ), which is calculated by σ (P) =

D (φ , ϕ )

∫∫ w λ (φ , ϕ ) D (φ , ϕ ) + D

φ ,ϕ

d φd ϕ

(1)

0

where w is a weight of importance assigned on each surface, and D0 is a large constant (e.g., 100m). The denominator of Eq. 1 counts for the image quality degradation on distant scenes due to atmospheric haze. It avoids a close-to-infinity scene to be integrated largely into σ(P). The weight w, assigned building-wise in a space, takes a uniform value in this paper to show the influence from the geometric layout only, unless some important façades or landmarks need emphasis particularly.

View shed View volume P

Fig. 2. Computing view significances from the view shed at a viewpoint for V(0)

We calculate a continuous distribution of σ(P) using an elevation map H(X,Z). For an urban area, LIDAR data are reduced in resolution to a hole-less map first. At each small grid region (e.g., 1∼5m2) in the LIDAR data, non-zero elevation points are median-filtered to yield an integer value in discrete metrics H(X,Z) (Fig. 4a). Second, all reachable points P(X,Y,Z) at eye-level are marked (if Y>0). Third, we compute lines of sight in all discrete orientations originating from every viewpoint P(X,Y,Z). Each line is stretched out until it hits an obstacle, where the front tip of the line, Pl(Xl,Yl,Zl), satisfies Yl
Fig. 3. Typical spatial structures and distributions of the view significances with spherical views. A crossing, square, square with roads, and two connected yards are enclosed with walls.

View Planning for Cityscape Archiving and Visualization

307

Figure 4 shows a city area with buildings of different heights. Three spherical views in Fig. 4a are evaluated in depth (Fig. 4b). The σ(P) of entire reachable space are calculated with D0=100m in Fig. 4c, d. According to σ(P) evaluated for spherical views [11], the positions marked in red and yellow at a round square and an open crossing in Fig. 4a are more significant than the third position marked in green in a parking lot. This can be confirmed in Fig. 4b where a large sky appears in Fig. 4(b.3) and the high buildings at distance degrade in the significance estimation. For cylindrical panoramas [4] (limited FOV at ϕ∈[-45°,45°]), the distribution of σ(P) is shown Fig. 4d. The significance value is lower at positions close to high rises than at positions farther away now because the close positions miss high surfaces on buildings.

(a)

(b.1)

(b.2)

(c)

(b.3)

(d)

Fig. 4. View significance evaluations at all positions in an area. (a) LIDAR elevation map with the intensity representing height. (b.1)(b.2)(b.3) 360°×180° spherical depths at three positions marked in (a). FOVs of cylindrical projection are in dashed frames. (c) Reachable positions and their view significances in gray levels. (d) View significance evaluated with cylindrical panoramas.

308

J.Y. Zheng and X. Wang

(a)

(b)

(c) Fig. 5. Depth map from LIDAR data and corresponding views. (a) The mechanism to scan a scene tunnel. (b) Depths along a camera path calculated from LIDAR data (intensity inversely proportional to depth). Vertical axis indicates azimuth angle of rays and the horizontal axis indicates the route distance. (c) Blocks of scene tunnel taken along the real street.

2.3 View Significance Evaluation for Paths and Buildings The view significance can also be defined for a route panorama and object panorama similarly. Figure 2b depicts the geometry in getting a scene tunnel when a camera facing sideways on a vehicle moves along a street [15]. Denote a ray by n(ϕ) in the Plane of Scanning (PoS) at position l on the path (Fig. 5a). A surface point at distance D(l,ϕ) is projected to the scene tunnel. The view significance is measured at the position is defined as σ ( P(l )) = ∫ wλ (l , ϕ ) ϕ

D (l , ϕ ) dϕ D(l , ϕ ) + D0

(2)

where ϕ∈[-π/2, π/2] is the azimuth angle of the ray. The significance of an entire street is σ (L) =

∫ σ ( P ( l )) dl L

L

(3)

where L is the street length. Figure 5 shows the depth map D(l,ϕ) visible from a street. It is predicted using LIDAR data and the half-side scene tunnel along the real street is displayed for comparison as well. The σ(L) is high in this section. On the contrary, we can predict that very open streets such as highways at suburb may be insignificant for their monotonic scenes and large sky area. An architecture can displayed with its object panoramas V(2), i.e., side views adjacent to each other. Here, the object panorama does not strictly fall into the definition of aspect views that are divided according to object surfaces. In evaluating significant viewpoints for buildings or landmarks, visibility is more meaningful than the shooting distance because distance can be compensated by using different lenses. We define view significance σ(P,B) at a surrounding point P for watching B by

View Planning for Cityscape Archiving and Visualization σ (P, B) =

∫∫ w λ

φ ,ϕ

2

(φ , ϕ )

D (φ , ϕ ) d φd ϕ D (φ , ϕ ) + D 0

309

(4)

where λ2(φ,ϕ) is 1 if a ray reaches building B from P and 0 otherwise. As an example, we use a convex block, a building complex, and scattered towers in Fig. 6a to calculate their view significance distributions in Fig. 6b,c. The block in Fig. 6a can be captured with a normal lens (Fig 6b.1), while the scattered architecture group is suitable to be captured in a wide image from a point at the central area (Fig. 6b.3). In general, the view significance is continuous in the reachable area of H(X,Z) and decreases as the viewing distance gets farther away. The significance value also tends to be higher in an orientation from which multiple facades become visible, and drops to zero when the target is occluded completely. Moreover, the smaller the vertical field of view of a camera, the farther the significant viewpoints tend to be, as we compare individual pairs in Fig. 6b and 6c. In real situation, many buildings are aligned in street blocks, which is simpler to be captured in V(1) images.

(a)

(b.1)

(c.1)

(b.2)

(c.2)

(b.3)

(c.3)

Fig. 6. View significances of buildings displayed in levels. (a) Elevation map of three groups of architecture. (b.1)(b.2)(b.3) distributions of view significances in (a) when a cylindrical panorama (vertical FOV is 90°) is used. (c.1)(c.2)(c.3) as references, the view significances are counted with spherical views for each building group again.

3 View Selections and Pervasive Image Acquisition 3.1 Street View Acquisition from a Moving Vehicle We start with the scanning of route panoramas V(1) because imaging from a moving vehicle is the most efficient way to obtain cityscapes. Route panoramas and scene tunnels are typically planned along streets with rich scenes (σ(L) is high). The route panoramas are scanned continuously with a pixel line in the video frame. This slit scanning method is much more efficient in capturing translating scenes with depth changes than most mosaicing methods that need to merge discrete images at consecutive positions, because it avoids image correspondence, depth estimation, and image integration.

310

J.Y. Zheng and X. Wang

According to the simulated depth map along a street, we select a lens up to 180 degree FOV for a street. The lens scope may exclude tops of some high rises if other parts of the street are low architectures in average. The route panoramas also have some deformations. Because of the employed parallel-perspective projection in composing a long image, the aspect ratios of objects at different depths from the path are not constant in the route panorama. The aspect ratio can be adjusted well at one depth but is hard to satisfy other depths. At an adjusted depth, scenes are exposed to the route panorama sharply, while beyond the depth the stationary blur [10] appears along horizontal direction in the route panorama. Therefore, we select a sampling rate and vehicle speed after the camera lens is determined from the height coverage of side scenes. If we denote the curvature of the path by κ, the depth of scene by z, the vehicle speed by v, the angle of the plane of sight with respect to the motion vector by α, and the focal length by f, the blurring rate from the original image (It/Ix) can be calculated as ⎛1 Il κ ⎞ = fv⎜⎜ − 2 ⎟⎟ Ix z sin α⎠ ⎝

(5)

where It and Ix are contrast in the route panorama and video image respectively. The detailed proof is in [21]. Setting the ratio to be 1, we can inversely obtain v from known κ, α, f and average z. This will reduce the distortion of aspect ratio as well as the stationary blur on major scenes. On the other hand, we can also keep the vehicle velocity as constant as possible and normalize the length of route panorama/scene tunnel according to GPS output or satellite images. 3.2 Placing Local Panoramas in Urban Area After route panoramas V(1) cover major streets, we place a number of local panoramas V(0) in the urban area. There are various strategies of viewpoint selection. For representing larger 3D surfaces, we plan V(0) at peaks of σ(P) distribution. A viewpoint close to a selected peak may share a large portion of scenes with the peak. Therefore, we avoid local maxima of σ(P) in the same hill with a selected peak. As an automatic procedure, we decrease a threshold level gradually over the σ(P) distribution in Fig. 4d, in order to locate peaks and island regions, E(level)={Pe |σ(Pe)≥level}. Assuming two viewpoints should not be closer than a predefined distance r, we select an emerging local maximum of σ(P) as a viewpoint if it is not closer to any island region than r. We locate those peaks in σ(P) for panoramic view in order according to the following algorithm. Set E to be an empty set For level decreasing from max(σ) to min(σ) For every point P∈H(X,Z) satisfying σ(P)≥level and P∉E If P is a peak and is father away from E than r then select P as a viewpoint; // Mark P as a point in E Add point P to E;

This algorithm selects peaks of σ(P) on major hills and ignores local maxima on the examined hills of the view significance distribution. Figure 7 gives an output of the algorithm for placing local panoramas. After selecting viewpoints, images are

View Planning for Cityscape Archiving and Visualization

311

Fig. 7. Planned positions in white spots are plotted for local panoramas in the elevation map of the area

taken with digital cameras [17] and spherical views are taken with a fish-eye lens if necessary. In real cases, the camera positions have to be deviated from the planned ones due to busy traffics. Also, V(0) images here are more suitable to represent spatial layouts rather than symbolic scenes with cultural and sightseeing values. Additional V(2) images can always be taken to emphasize meaningful scenes. 3.3 Locating Images Around Objects Now, we locate multiple discrete images around buildings of interest to generate an image group namely object panorama (in QTVR). After the calculation of σ(P,B) for a building or building block, picking viewpoints manually at positions with high σ(P,B) values is feasible. Alternatively, we can also calculate viewpoints automatically. First, the center of heights, P0(X0, Z0), of a building is obtained from H(X,Z) as (X 0 , Z 0 ) = ∑ ( X , Z ) H ( X , Z ) P∈ B ( 2 )

∑ H (X ,Z )

P∈ B ( 2 )

(6)

In each orientation φ around P0, we find a distance d(φ) where the viewpoint has the highest significance, i.e., d (φ ) = arg max σ (P , d

B)

(7)

which results in a closed curve surrounding B. We then select several orientations φ1, φ2, φ3, ... on the curve which local maxima of σ(P,B) to imaging B inward. The images can be dense to share partial views (or facades) so that the orientations are more conceivable through images. Various lenses can be used for these images as long as scenes are located in the frame properly. Beside above three types of images, discrete images can always be taken at spots of interest or partial scenes on the buildings to highlight details. We did experiments at an urban area of 1.6×1 km2, which includes 50 panoramic images, 34 route panoramas of both sides and 44 buildings. The image has been normalized to a fixed height for display and the lengths are proportional to their real lengths [22].

312

J.Y. Zheng and X. Wang

4 Conclusion This work creates a framework to pervasively record scenes in a large-scale area for image archive and visualization. The compact nature of the employed images has significantly enlarged the scale of cityscape modeling. We proposed an evaluation of viewpoints, namely view significance, according to the visibility and effective image coverage of cityscapes. Based on the significance distribution estimated from LIDAR data, we precisely located different types of images so that they can include as many scenes as possible but avoid data redundancy. The visual data can be displayed in a web and loaded further to portable navigation devices for on-site urban area guidance.

References [1] Agarwala, A., et al.: Photographing long scenes with multi-viewpoint panoramas. ACM Transactions on Graphics 25(3), 853–861 (2006) [2] Aliaga, D.G., Carlbom, I.: Plenoptic Stitching: A scalable method for reconstructing 3D interactive walkthroughs. In: SIGGRAPH 2001, pp. 443–450 (2001) [3] Aliaga, D.G., Funkhouser, T., Yanovsky, D., Carlbom, I.: Sea of Images. In: IEEE Conf. Visualization, pp. 331–338 (2002) [4] Chen, S.E., Williams, L., Quicktime, V.R.: An image-based approach to virtual environment navigation. In: SIGGRAPH 1995, pp. 29–38 (1995) [5] Coorg, S., Master, N., Teller, S.: Acquisition of a large pose-mosaic dataset. In: IEEE CVPR 1998, pp. 23–25 (1998) [6] Frueh, C., Zakhor, A.: Constructing 3D city models by merging ground-based and airborne views. In: IEEE CVPR 2003, pp. 562–569 (2003) [7] McMillan, L., Bishop, G.: Plenoptic modeling: an image based rendering system. In: ACM SIGGRAPH 1995 (1995) [8] Peleg, S., Rousso, B., Rav-Acha, A., Zomet, A.: Mosaicing on adaptive manifolds. IEEE Trans. PAMI 22(10), 1144–1154 (2000) [9] Roman, A., Garg, G., Levoy, M.: Interactive design of multi-perspective images for visualizing urban landscapes. In: IEEE Conf. Visualization 2004, pp. 537–544 (2004) [10] Shi, M., Zheng, J.Y.: A slit scanning depth of route panorama from stationary blur. In: IEEE CVPR (2005) [11] Szeliski, R., Shum, H.-Y.: Creating full view panoramic image mosaics and texturemapped models. In: ACM SIGGRAPH 1997, pp. 251–258 (1997) [12] Uyttendaele, M., et al.: Image-based interactive exploration of real-world environments. IEEE Computer Graphics and Application 24(3) (2004) [13] Zhao, H., Shibasaki, R.: A Vehicle-borne urban 3D acquisition system using single-row laser range scanners. IEEE Trans. on SMC, B 33(4), 658–666 (2003) [14] Zheng, J.Y.: Digital Route Panorama. IEEE Multimedia 10(3), 57–68 (2003) [15] Zheng, J.Y., Zhou, Y., Mili, P.: Scanning Scene Tunnel for city traversing. IEEE Trans. Visualization and Computer Graphics 12(2), 155–167 (2006) [16] Zomet, A., Feldman, D., Peleg, S., Weinshall, D.: Mosaicing New Views: The crossedslits projection. IEEE Trans. on PAMI, 741–754 (2003) [17] Li, S.: Full-View Spherical Image Camera. ICPR (4), 386–390 (2006) [18] Li, S., Nakano, M., Chiba, N.: Acquisition of spherical image by fish-eye conversion lens. IEEE Virtual Reality 235–236 (2004)

View Planning for Cityscape Archiving and Visualization

313

[19] Zheng, J.Y., Tsuji, S.: Panoramic Representation for route recognition by a mobile robot. IJCV 9(1), 55–76 (1992) [20] Zheng, J.Y., Li, S.: Employing a fish-eye camera in scanning scene tunnel. In: 7th ACCV, vol. 1, pp. 509–518 (2006) [21] Zheng, J.Y., Shi, M.: Depth from stationary blur with adaptive filtering. In: 8 th ACCV (2007) [22] http://www.cs.iupui.edu/ jzheng/ACCV07

Synthesis of Exaggerative Caricature with Inter and Intra Correlations Chien-Chung Tseng and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {ed,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw

Abstract. We developed a novel system consisting of two modules, statisticsbased synthesis and non-photorealistic rendering (NPR), to synthesize caricatures of exaggerated facial features and other particular characteristics, such as beards or nevus. The statistics-based synthesis module can exaggerate shapes and positions of facial features based on non-linear exaggerative rates determined automatically. Instead of comparing only the inter relationship between features of different subjects at the existing methods, our synthesis module applies both inter and intra (i.e. comparisons between facial features of the same subject) relationships to make the synthesized exaggerative shape more contrastive. Subsequently, the NPR module generates a line-drawing sketch of original face, and then the sketch is warped to an exaggerative style with synthesized shape points. The experimental results demonstrate that this system can automatically, and effectively, exaggerate facial features, thereby generating corresponding facial caricatures. Keywords: Exaggerative rate, Exaggerative caricature synthesis, Eigenspace, Non-photorealistic rendering (NPR).

1 Introduction Caricatures often appear in newspapers, comic, and even at popular tourist spots where some artists draw for sightseers. Generally, caricature is a kind of exaggerative representation. It exhibits the funny and extraordinary characteristics of a person, and it also can be a human-like agent because it contains lots of recognizable facial features. The basic definition of facial caricature is that it is the exaggeration for all the facial features which are found by comparing the impression features of the subject with the average face. The first caricature generation system is made by Brennan [4], who used the interactive algorithm to create the sketch with exaggeration. Following works base on two approaches: with or without training process. The works with training process usually build some standards which are used to compare with the testing data to find the difference between their features. Koshimize et al [11] proposed a template-based approach to create caricature with exaggerative rates. However, the method is linebased drawing; the result will be unrecognizable when the exaggerative rates are too Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 314–323, 2007. © Springer-Verlag Berlin Heidelberg 2007

Synthesis of Exaggerative Caricature with Inter and Intra Correlations

315

large. The works [12], [13], [17] developed example-based approaches by using partial least squares or neural networks to learn the drawing style of an artist, but they can not exhibit some particular characteristics like beards, nevus and etc. without training them in advance. Chiang et al [7] analyzed facial features and warped the color caricature created by artist to the exaggerative style with analyzed result. However, the representation of result is limited by the prototype drawn by an artist. Xu [18] exaggerated both the shapes and the positions of facial features by eigenspace, but the parameters used to exaggerate facial features are determined empirically. Except the approach with training process, some works for caricature creation has no training process. Gooch et al [9] proposed an approach to create an exaggerative black-and-white drawing by using Blommaert and Martens’ model and manual exaggeration [3], and Akleman [1] created a caricature by a forward triangle morphing. Those methods involve not only exaggerated facial features but also other particular characteristics. However, they also need a lot of manual works. Besides, there are other works [5], [6], [10], [19] creating good-quality cartoon face without exaggeration, and our work can be views as the extension of their results. In order to solve existing problems discussed above, we developed a novel system based on the advantages of processes with training and without training by implementing statistics-based synthesis (with training process) and non-photorealistic rendering (without training process). The exaggerative facial feature caricatures generated by our method contain exaggerative facial features and other particular characteristics which are always eliminated in order to simplify the training process. In the statistics-based synthesis module, we find the major principle components of features by using the principal component analysis (PCA) process, and expand them to emphasize personal specialties. In addition, the exaggeration is based on two relationships proposed by Redman [16]: one is the relationship between different subjects’ features, and we define this relationship as inter process; the other is the relationship between facial features of the same subject ignored by previous works, and we define it as intra process.

2 Statistics-Based Synthesis Module: Training Process In the training process of the statistics-based synthesis module, we compiled 69 male and 55 female images from the AR face database [14], 90 images are for training and 34 images are for testing. All the facial images are taken at the same distance from camera, and each training facial image is normalized as a canonical image (512×512 pixels) by affine transformation including only translation and rotation factors by two inner corner points of the eyes. The reason of eliminating scaling factor is that all the position of eyes of training images will be the same if applying scaling factor, thus we can’t exaggerate the position of eyes by comparing the difference from the position of average eyes. For each canonical image, the feature shape points of eyebrows (20 points), eyes (16 points), nose (12 points), mouth (20 points), and facial contour (19 points) are selected manually and denoted by Fi where i = 1 to 7. Also, we choose 7 control points corresponding to left eyebrow, right eyebrow, left eye, right eye, nose, mouth,

316

C.-C. Tseng and J.-J.J. Lien

and face contour respectively and denoted by P where P = (p1x, p1y, p2x … p7x, p7y). Each control point represents the position of each feature shape. Furthermore, all the feature shape and position vectors are separated into x and y coordinates denoted by Fix, Fiy, Px, Py respectively to avoid mutual effects of x and y coordinates of featureshapes or positions. Based on all feature shape vectors, Fix and Fiy, we can calculate their corresponding average vectors, MFix and MFiy, and by applying the PCA, the corresponding eigenspaces, UFix and UFiy, can be obtained. Subsequently, the average position point of each feature is calculated in stead of generating eigenspaces, because there is only one position point for one facial feature. The information is not enough to build position eigenspaces. The framework of training process is shown in Fig. 1.

Geometric normalization

I

S

Feature shape F1

F3

F2

F at x and y coordinates. F1x Eigenspaces at x and y UF 1x coordinates. I: Canonical image.

FF1y1y F

1y

UF1y

F4

F5

…

F1y

…

UF7x

F7x

F6

F7

F1y

Feature position

F7y

P at x and y Px coordinates Px

PPxy

UF7y

Average P x, P y

MPy

MPx

S: Feature shape points.

Fig. 1. The training framework of statistics-based synthesis module

3 Statistics-Based Synthesis Module: Testing Process To start with the testing process of statistics-based synthesis module, as shown in Fig. 2, we extract the facial feature points by applying an active appearance model (AAM) [8]. If the extracting result is not good, we label the feature points manually. Then, as the training process, we can have TFi where i = 1 to 7, TP = (tp1x, tp1y, tp2x … tp7x, tp7y), TFix, TFiy, TPx, TPy. Subsequently, the eigenspaces and average position points which have been generated at training process are used to perform statistics-based synthesis module. The statistics-based synthesis module is divided into two parts: one is the exaggeration of feature shape points on x and y coordinates; and the other one is the exaggeration of feature position points on x and y coordinates. Both two parts further apply inter and intra processes. By inter and intra process we can automatically determine which features should be exaggerated. The exaggeration of feature position points is performed after the exaggeration of feature shape points. And to simplify the explanation, we take feature shape exaggeration on x coordinate as examples to explain inter and intra processes for facial feature shape. As for the feature position exaggeration, fuller discussion will be presented in Section 3.4.

Synthesis of Exaggerative Caricature with Inter and Intra Correlations

317

Statistics-based synthesis module Fig. 4

FFiyiy TFix TFi

TP

T

Iaf TFi’

Ini

TPx

TS

AAM

Irf

FFiyiy TFiy

Irp

TPy

Iap

TS’ TP’

Non-photorealistic rendering module Line-drawing sketch generation

Metamorphosis ’

T

Caricature

Ini: Exaggerative rates initialization. Irf: Inter exaggeration for feature shape. Iaf: Intra exaggeration for feature shape. Irp: Inter exaggeration for feature position. Iap: Intra exaggeration for feature position. Fig. 2. The framework of exaggerative caricature creation

3.1 The Initialization of Exaggerative Rates In Fig. 3, there are two ellipses, a small one and a big one. Having the same horizontal variation, the small ellipse is more distinctive than the big ellipse, and so does the variation on displacement. This example indicates that the degree of shape or position exaggeration is negatively relative to the length or width of feature itself [15]. Therefore, we develop following equation to model such condition:

Dix = exp(1 −

length(TFix )

∑i=1length(TFix ) 7

)

(1)

where Dix represents the ratio which is negatively and exponentially relative to the horizontal or vertical length of feature; and length(TFix) is the horizontal length of feature where length(TFix)=max(TFix) min(TFix). The reasons we use exponential function are that the exponential function is always positive (never violates the truth of feature), and the rate of growth is proportional to its value, namely the larger variation the feature has, the higher degree of exaggeration it should have. Basically, the exaggeration of feature is in fact the extension of the difference from corresponding average feature. In order to implement the exaggeration process, the “exaggerative rate” is defined to be the scalar. There are two kinds of exaggerative rates: the shape exaggerative rates and the position exaggerative rates. The initial

－

shape exaggerative rate of x coordinate of i-th feature is

ef ixk =1 = 1 + c1 ⋅ Dix , and it

satisfies the rule of exaggeration, c1 is the constant to controlling the degree of

318

C.-C. Tseng and J.-J.J. Lien

(a)

(b)

Fig. 3. (a) Ellipses having the same horizontal variances. (b) Ellipses having the same horizontal displacement.

exaggeration and c1=0.2 in this study. The initialization of position exaggerative rates will be discussed later. 3.2 Exaggerative Feature Shape Creation: Inter Exaggeration After initializing the shape exaggerative rates, the unbiased shape vector of i-th facial feature is calculated by subtracting corresponding average shape MFix, and then projected to the corresponding eigenspace UFix to get the weights of eigenvectors. Then we expand the weights by multiplying shape exaggerative rates. Finally, the exaggerative feature shape is obtained by the reconstruction with the expended weights:

Eixk = ∑ j =1 ( ρ ⋅ ef ixk ⋅ w jx ⋅ u jx ) + MFix

(2)

w jx = (TFix − MFix )T ⋅ u jx

(3)

n

T

Eixk

is the result of inter exaggeration at the k-th iteration. ρ is the proportion of each eigenvector, which means the higher proportion of eigenvector, the bigger shape exaggerative rate the eigenvector has. wjx is the projection weight of the j-th eigenvector, and ujx is the j-th eigenvector of the eigenspace. The result of inter process is shown in Fig. 5.(d). After comparing with the average face, we further apply the intra exaggeration process which considers the relationship with other features of the same subject to increase the contrast between all the facial features. where

3.3 Exaggerative Feature Shape Creation: Intra Exaggeration As we have said above, an artist considers the difference not only from the average face but also other neighbor features of the same subject. In other words, the major facial features of the subject are expected to be extracted, enhanced and then exaggerated in order to emphasize personal style of this face. As for other non-major facial features, the degree of exaggeration will decrease in order to enhance the contrast between the major features and the non-major features. For example, if the subject’s mouth is bigger than average mouth, but his/her nose is much bigger and

Synthesis of Exaggerative Caricature with Inter and Intra Correlations

319

rounder. Then the artist will draw the mouth smaller, but still bigger than average mouth, to emphasize the variation of nose. In order to increase the contrast between features, all the variances of exaggerative features should be calculated by the following equation: 1 ∑ (Eix − TFix ) 2 . n vix = length(TFix )

(4)

Here, vix represents the variance of the x coordinate of the i-th feature normalized by the feature length. n is the number of points of each feature. After calculating all the variances of features, these variances are sorted to determine which shape exaggerative rate of feature should be increased and which one should be decreased. Then, we update the shape exaggerative rates by the equations based on the sorted variances. Indeed, the updated process should satisfy the condition mentioned in Section 3.1, and the power of effect will decrease based on a Gaussian weighting function.

∑ (r ⋅ d ( x, y)) ⋅ D − A⋅ ∑ d ( x, y ) 7

ef

( k +1) ix

= ef

k ix

j =1

j

7

ix

(5)

j =1

⎧ 1 if v jx > vix ⎪ r j = ⎨ 0 if v jx = vix ⎪− 1 if v < v jx ix ⎩

(6)

⎡ tpix − tp jx 2 tpiy − tp jy 2 ⎤ ) +( ) ⎥) d ( x, y ) = exp(− ⎢( σx σy ⎣⎢ ⎦⎥

(7)

where A is a constant controlling the power of effect and A = 0.2 in this study. rj is a switch which increases the shape exaggerative rate of feature with bigger variance and decreases the shape exaggerative rate of feature with smaller variance. tpix and tpiy are the i-th elements of vector TPx and TPy used to calculate the distance between facial features in the 2-D Gaussian weighting function d(x,y). After the intra exaggeration process, we use the new shape exaggerative rates and original feature points TF consists of all facial feature points to perform inter exaggeration process again in order to increase the contrast of these features and find major features as shown in Fig. 4. Thus, we can determine all of the shape exaggerative rates automatically instead of setting them one by one. As long as any exaggerative rate equals to zero, the iteration process terminates to prevent the feature from exaggerating in the contrary direction. For example, if we draw a nose small but in fact it is bigger than general. Obviously, the feature is exaggerated in the contrary direction. Additionally, we need to set the maximum iteration time to avoid none of the shape exaggerative rates being zero. When satisfying the termination condition

320

C.-C. Tseng and J.-J.J. Lien

after n-1 time iteration, original feature shape TF and last shape exaggerative rates ef n are used in inter process to generate the x coordinate of exaggerative shape. As for the exaggeration of feature shapes on y coordinate, inter and intra processes mentioned above is applied in the same way. The result of iteration is shown in Fig. 5.(e). ef 1 TF

ef 2 TF

ef n-1 TF

ef n TF

Inter process

Inter process

Inter process

Inter process

E1 Intra process

E2 Intra process

…

E

n-1

Intra process

Exaggerative shape

ef k: shape exaggerative rate at k-th iteration Ek: the intermediate at k-th iteration Fig. 4. The diagram of entire iteration process for inter and intra shape exaggerations

3.4 Exaggerative Feature Position Creation Having finished the exaggeration of feature shape, the positions of features are further exaggerated. For the position vectors TPx and TPy, the difference from average feature position is calculated by subtracting MPx, MPy, and then expended by the equations as follow with the position exaggerative rate which also contains two parts: inter position exaggerative rate epinter and intra position exaggerative rate epintra:

⎧tp 'ix = mpix + epix (tpix − mpix ).Dix ⎨ ⎩tp 'iy = mpiy + epiy (tpiy − mpiy ).Diy

(8)

⎧⎪ep ix = ep ixint er + ep ixint ra ⎨ int er int ra ⎪⎩ep iy = ep iy + ep iy

(9)

⎧ int ra length(TF7' x ) − length( MF7 x ) ⎪epix = length( MF7 x ) ⎪ . ' ⎨ ( ) − length( MF7 y ) length TF 7 y int ra ⎪epiy = ⎪⎩ length( MF7 y )

(10)

Here, inter position exaggerative rate is given by the initial process where epixint er = 1 + c2 ⋅ Dix , c2 is a constant to control the exaggerative degree and c2 = 0.3 in this study. After comparing the difference from average feature positions, artists always adjust the results of comparisons based on the width or the length of face. If the face is wider or longer than average face, the distance between the two features will be increased horizontally or vertically, and vise versa. Thus, the intra exaggerative rate only depends on the width and length of face. Indeed, Dix, Diy are added to ensure that the displacement of each position of feature is also satisfies the

Synthesis of Exaggerative Caricature with Inter and Intra Correlations

321

condition discussed in Section 3.1. After the feature position exaggeration, we translate all the exaggerative feature shape points to the specific location based on exaggerative positions, and the result TS’ is shown in Fig. 5.(f).

4 Non-photorealistic Rendering Module Since we have exaggerated the shapes and positions of facial features, we will apply non-photorealistic rendering processes to generate a caricature with exaggerative style. First, the facial contour is emphasized by modifying a method proposed by Gooch [9]. Instead of generating a binary image, the values of pixels of result are replaced with gray values of original image to make the result more lifelike. Then after creating a line-drawing sketch T’ without exaggeration, as shown in Fig. 5.(g), the line-drawing sketch is warped to an exaggerative style by using the image metamorphosis [2] with exaggerative feature shape. Finally, a caricature with exaggerative feature positions and shapes and other particular characteristics, as shown in Fig. 5.(h), are generated.

(b)

(c) (f)

(d)

(e)

(g)

(h)

Fig. 5. (a) Original input image. (b) Original feature shape. (c) Average feature shape. (d) The result after only inter exaggeration process for shape. (e) The result after inter and intra exaggeration processes only for shape. (f) The result after inter and intra exaggeration processes for both shape and position. (g) The line-drawing sketch of original face. (h) The exaggerative caricature of original face keeping the information of nevus.

5 Experimental Results For the testing image of frontal face, as shown in Fig. 5.(a), the diagrams of shape exaggerative rates are shown in Fig. 6. These two diagrams indicate that the contrast between shape exaggerative rates increases by iteration, and major features can be extracted with high exaggerative rates. In other words, our system can automatically set the exaggerative rates and extract major facial features. In addition, by comparing with Xu’s method [18], shown in Fig. 7, our result can keep the contour of original feature after the increase of exaggerative rates, but the result of Xu’s method occurs distortion which decreases the likeness of features. Therefore, our result is subtler than Xu’s. The experimental results demonstrate that our system can automatically and effectively exaggerate facial features, and generate corresponding facial caricatures of other particular characteristics. More results with exaggerative sizes and positions of the facial features are shown in Fig 8.

322

C.-C. Tseng and J.-J.J. Lien

Left Eyebrow Left Eye Nose 3 Face

Right Eyebrow Right Eye Mouth

Left Eyebrow Left Eye Nose 3.5 Face

ER: Exaggerative rate

2.5

ER

ER

2 1.5 1 0.5 0 1

3

5

7

9 11 13 15 17 19 Iteration

Right Eyebrow Right Eye Mouth ER: Exaggerative rate

3 2.5 2 1.5 1 0.5 0 1

(a)

3

5

7

9 11 13 15 17 19 Iteration

(b)

Fig. 6. The variation of shape exaggerative rates on (a) x coordinate (b) y coordinate

Fig. 7. (a) The results of our approach (b) The results of Xu’s approach [18]

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

Fig. 8. More results of (a) original face (b) line-drawing sketch (c) exaggerative caricature

6 Conclusion We have developed a novel system serving to generate an exaggerative caricature which involves not only exaggerative facial features but also other particular characteristics, like beards or nevus. Besides, we model the drawing skills of artists to

Synthesis of Exaggerative Caricature with Inter and Intra Correlations

323

simplify the determination of exaggerative rates. However, our system can only handle the image data taken from fixed distance and under fixed lighting. The variation of lightness and distance will affect our result. In the future work, we want to solve the factors of environment by simulating the image with visual distance and appropriate lighting to make our system more robust.

References 1. Akleman, E.: Making Caricature with Morphing. In: Proc. of ACM SIGGRPH, p. 145 (1997) 2. Beier, T., Neely, S.: Feature-based Image Metamorphosis. In: Proceedings of ACM SIGGRAPH, pp. 35–42 (July 1992) 3. Blommaert, F.J.J., Martens, J.-B.: An Object-Oriented Model for Brightness Perception. Spatial Vision 5(1), 15–41 (1990) 4. Brennan, S.: Caricature Generator. Master’s thesis, Cambridge, MIT (1982) 5. Chen, H., Xu, Y.Q., Shum, H.Y., Zhu, S.C., Zheng, N.N.: Example-based Facial Sketch Generation with Non-parametric Sampling. In: Proceedings of International Conference on Computer Vision, Vancouver, Canada (July 2001) 6. Chen, H., Zheng, N.N., Liang, L., Li, Y., Xu, Y.Q., Shum, H.Y.: PicToon: A Personalized Image-based Cartoon System. In: Proc. of ACM Int’l Conf. on Multimedia (2002) 7. Chiang, P.Y., Liao, W.H., Li, T.Y.: Automatic Caricature Generation by Analyzing Facial Features. In: Proceedings of Asia Conf. on Computer Vision (2004) 8. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6) (2001) 9. Gooch, B., Reinhard, E., Gooch, A.: Human Facial Illustrations: Creation and Psychophysical Evaluation. ACM Trans. on Graphics 23(1), 27–44 (2004) 10. Hsu, R.L., Jain, A.K.: Generating Discriminating Cartoon Faces Using Interacting Snakes. IEEE Trans. on Pattern Analysis and Machine Intelligence 25, 1388–1398 (2003) 11. Koshimize, H., Tominaga, M., Fujiwara, T., Murakami, K.: On KANSEI Facial Processing for Computerized Facial Caricaturing System Picasso. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 294–299 (1999) 12. Lai, K.H., Edirisinghe, E.A., Chung, P.W.H.: A Facial Component based Hybrid Approach to Caricature Generation using Neural Networks. In: Proceedings of ACTA on Computational Intelligence (2006) 13. Liang, L., Chen, H., Xu, Y.Q., Shum, H.Y.: Example-based Caricature Generation with Exaggeration. In: Proc. 10th Pacific Conf. on Computer Graphics and Applications (2002) 14. Martinez, A.M., Benavente, R.: The AR Face Database. CVC Technical Report #24 (June 1998) 15. Mo, Z., Lewis, J.P., Neumann, U.: Improved Automatic Caricature by Feature Normalization and Exaggeration. In: Proceedings of ACM SIGGRAPH Conf. on Abstracts and Applications, New York (2004) 16. Redman, L.: How to Draw Caricatures. Contemporary Books (1984) 17. Shet, R.N., Lai, K.H., Edirisinghe, E.A., Chung, P.W.H.: Use of Neural Networks in Automatic Caricature Generation: An Approach Based on Drawing Style Capture. In: IEE International Conference on VIE, UK, pp. 23–29 (2005) 18. Xu, G.Z., Kaneko, M., Kurematsu, A.: Synthesis of Facial Caricature Using Eigenspaces. In: Proceedings of Electronics and Communications in Japan. Part 3, vol. 87(8) (2004) 19. Xu, Z., Chen, H., Zhu, S.C.: A High Resolution Grammatical Model for Face Representation and Sketching. In: Proceedings of Computer Vision and Pattern Recognition, pp. 470–477 (2005)

Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates Shiro Kumano1 , Kazuhiro Otsuka2 , Junji Yamato2 , Eisaku Maeda2 , and Yoichi Sato1 Institute of Industrial Science, The University of Tokyo, 4–6–1 Komaba, Meguro-ku, Tokyo, 153–8505 Japan {kumano,ysato}@iis.u-tokyo.ac.jp 2 NTT Communication Science Laboratories, NTT 3–1 Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243–0198 Japan {otsuka,yamato}@eye.brl.ntt.co.jp, [email protected] 1

Abstract. In this paper, we propose a method for pose-invariant facial expression recognition from monocular video sequences. The advantage of our method is that, unlike existing methods, our method uses a very simple model, called the variable-intensity template, for describing different facial expressions, making it possible to prepare a model for each person with very little time and eﬀort. Variable-intensity templates describe how the intensity of multiple points deﬁned in the vicinity of facial parts varies for diﬀerent facial expressions. By using this model in the framework of a particle ﬁlter, our method is capable of estimating facial poses and expressions simultaneously. Experiments demonstrate the effectiveness of our method. A recognition rate of over 90% was achieved for horizontal facial orientations on a range of ±40 degrees from the frontal view.

1

Introduction

Facial expression recognition is attracting a great deal of attention because of its usefulness in many applications such as human-computer interaction and the analysis of conversation structure [1]. Most existing methods for facial expression recognition assume that the person in the video sequence does not make any large movements and that the image shows a nearly frontal view of the face [2][3][4][5]. However, in situations such as multi-party conversations (e.g. meetings), people will often turn their faces to look at other participants. Hence, we must simultaneously handle the variations in facial pose as well as facial the expression changes. Facial expression recognition methods handling facial pose variations require a facial shape model of the user’s neutral expression and a model of facial expression to treat the variations of facial pose and expression separately. The shape model and the facial expression model are together referred to as the face model in this paper. The face model expresses facial pose variations by globally translating and rotating the shape model in three-dimensional space, and facial Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 324–334, 2007. c Springer-Verlag Berlin Heidelberg 2007

Pose-Invariant Facial Expression Recognition

325

expression changes by locally deforming the shape model according to the facial expression model. Existing methods require an accurate face model, because image variations caused by facial expression change are often smaller than those caused by facial pose change. Accordingly, the use of inaccurate shape models degrades the accuracy of the facial pose and expression estimates because those two components cannot be separated reliably. Face models are divided broadly into two groups: person-dependent models and person-independent models. Previous methods generate a person-dependent face model for each user by using stereo cameras [6][7]. Accordingly, this approach cannot be applied to monocular video sequences. On the other hand, personindependent models can be applied to arbitrary users [8][9][10]. However, it has been reported that person-independent models cannot cover large interpersonal variations of face shape and facial expression with suﬃcient accuracy [11]. Motivated by these problems, we propose a novel method for facial expression recognition; based on variable-intensity templates, it oﬀers the following advantage and goals: 1. It supports monocular video capture systems. 2. It can easily generate a face model for each person. 3. Facial expressions can be estimated even with a large change in facial pose. The variable-intensity template consists of three components: (1) a simple shape model, (2) a set of interest points, (3) an intensity distribution model. We use a cylinder as the shape model because it can be easily generated. However, to use such a simple shape model, we must handle the following problem. The interest points are deﬁned on a face image in the frontal view (white points in the right ﬁgure of Fig.3). Their image positions for arbitrary facial poses are then calculated by projecting them onto the shape model, translating and rotating the shape model according to the pose, and projecting the resulting three-dimensional positions onto the image plane. Hence, if there is an error in the shape model, the calculated image position of an interest point shifts from its actual position as the facial pose angle increases. This problem can be eﬀectively avoided by employing pairs of points that straddle the edge of diﬀerent facial parts (the left part of Fig.1) [1][12]. Even if the interest points are shifted due to the error in the model, the change in the intensity of the points is small because the points are deﬁned away from the edges where the intensity changes signiﬁcantly (the center part of Fig.1). The intensity distribution model describes how interest point intensity varies for diﬀerent expressions of a face. Our method prepares it to recognize facial expressions and to improve the robustness of facial pose estimataion against changes in facial expressions which cause large changes in intensity (the right part of Fig.1). Our contribution is that we propose a facial expression recognition method for varying facial poses based on the key idea that facial expressions can be correctly recognized by knowing how interest point intensity varies for diﬀerent facial expressions. The main advantage of our method is that a face model for each person is created simply by capturing frontal face images of the person in

S. Kumano et al.

Enlarged

326

The shift of interest points due to error in shape model Intensity is not largely changed

Interest points

The shift of facial parts due to change in facial expression Intensity is largely changed

Face image Eyebrow Eye Expression

Neutral

Neutral

Angry

Fig. 1. Our method absorbs errors in shape models and recognizes facial expressions by treating the changes in intensity of multiple points deﬁned around facial parts. Interest points are composed of pairs of points that straddle the edges of facial parts.

the target facial expressions, unlike existing methods. Furthermore, we implement a particle ﬁlter utilizing the variable-intensity template as a face model to simultaneously estimate facial poses and expressions. The remainder of this paper is organized as follows. First, our proposed method is described in Section 2. Then, in Section 3, experimental results are given. Finally, a summary and future works are given in Section 4.

2

Proposed Method

Our method consists of two stages: a calibration stage and a test stage (see Fig.2). The calibration stage generates a variable-intensity template. In the test stage, our method estimates the facial pose and expression simultaneously within the framework of a particle ﬁlter. 2.1

Variable-Intensity Template

The variable-intensity template M consists of the following three components: M = {S, P, L} Calibration Stage

Variable-Intensity Template

Face detection

Shape Model

Interest Points Extraction

Interest Points

Calibration

Intensity Distribution Model

Fig. 2. System ﬂow chart

(1) Test Stage Input Video Sequences Estimation (Particle Filter ) Estimated States

Pose-Invariant Facial Expression Recognition

327

Edge detection

Extraction of pairs of points

Eyebrow Input image

Edge image

Extraction results

Interest points

Enlarged

Face image in neutral expression

Fig. 3. Left: The method to extract interest points P. Right: The extraction result. Small white rectangles represent interest points; point pairs are indicated by the lines. The set of large rectangles represents the regions holding the interest points.

where S, P, and L denote a rigid facial shape model, a set of interest points, and an intensity distribution model, respectively. The intensity distribution model describes the intensity distribution of each interest point for diﬀerent facial expressions. Shape Model S. A cylinder is used as the facial shape model because of its geometric simplicity. The radius of the cylinder is calculated as the width of the face region detected in a neutral expression image by the method of [13] multiplied by a ﬁxed constant. Set of Interest Points P. The set of interest points P is described as follows: P = {p1 , · · · , pN }

(2)

where pi denotes the image coordinates of point i and N denotes the number of interest points in the same image used to generate shape model S. An interest point constitutes a pair of points. The pairs that satisfy the following conditions are extracted, from the region including four kinds of facial parts (eyebrows, eyes, nose, and mouth) in ascending order of the interpair difference in intensity until the number of pairs in each facial part reaches the limit [12]: (1) Each pair of interest points straddles and is centered on an edge. (2) Pairs are separated by at least a predeﬁned distance (see Fig.3). The edges are obtained as zero-cross boundaries of the Laplacian-Gaussian ﬁltered image. Intensity Distribution Model L. The intensities of interest points are assumed to follow independent normal distributions, because each point is apart from the other point and the normal distribution can adequately express the eﬀect of the position shifts due to shape model error and imaging noise. The intensity distribution model L describes how the means and standard deviations of the distributions vary for discrete facial expressions. That is, the intensity distribution model of each interest point is the mixture distribution that consists of distributions for each facial expression.

328

S. Kumano et al. Interest point

(2) Eyebrow lowerer

Probability

Eyebrow

Face Expression Neutral

Angry

(3) Change in intensity of interest point

Intensity of interest point

Intensity distribution model (1) Change in facial expression

Fig. 4. Intensity distribution model L: The intensity distributions of interest points, described as normal distributions, change in facial expressions. The colors in the right part correspond to the interest points in the left part.

The intensity distribution model L is described as follows: L = {N1 , · · · , NN } , Ni = N μi (e), σi2 (e) , σi (e) = kμi (e)

(3) (4)

where N (μ, σ 2 ) denotes a normal distribution with mean μ and standard deviation σ, e ∈ {0, · · · , Ne − 1} denotes facial expression, Ne denotes the number of target expressions, e = 0 expresses neutral expression, μi (e) and σi (e) denote the mean and standard deviation of the intensity of point i for expression e, respectively. Standard deviation σi is assumed to be proportional to mean μi with a constant of proportionality k. The changes in facial expressions cause large changes in intensity around facial parts (see Fig.4). The intensity distribution model is used to these changes for diﬀerent facial expressions. Our method generates intensity distribution models for each person by using one frontal face image for each facial expression without any head movement during the capture process. By using this neutral expression image to also generate shape model S and extract the set of interest points P, the intensity means of the interest points in each expression can be set to be the intensities on the pixels where these points were deﬁned. For the standard deviation, our method employs a large k, to reduce the eﬀect of calibration errors and changes in intensity of the face caused by changes in facial poses which alter the illumination direction of the face. 2.2

Simultaneous Estimation of Facial Pose and Expression by Using a Particle Filter

Our method simultaneously estimates the facial pose and expression by calculating the likelihood of the intensity of interest points for the intensity distribution model. The joint distribution of facial pose and expression at time t given all face images up to that time (z 1:t ) is recursively represented as follows:

Pose-Invariant Facial Expression Recognition

P (ht , et |z 1:t ) = αP (z t |ht , et )

P (ht |ht−1 )

329

P (et |et−1 )

et−1

P (ht−1 , et−1 |z 1:t−1 )dht−1

(5)

where the facial pose state ht and expression state et follow ﬁrst order Markov processes; ht and et are assumed to be conditionally independent given image z t ; Bayes’ rule and conditional independence are used along with marginalization [14], and α = 1/P (z t ) is a normalization constant. The facial pose state ht consists of the following six continuous variables: the coordinate of the center of the template on the image plane, three-dimensional rotation angles (roll, pitch, and yaw), and scale. We adopt a random walk model for each parameter of facial pose yielding state transition model P (ht |ht−1 ), and set P (et |et − 1) to be equal for all facial expression combinations. Equation (5), unfortunately, cannot be calculated exactly, because parameters of facial pose ht are continuous, and their distributions are complex due to occlusion, etc. Hence, we use a particle ﬁlter, which calculates Equation (5) by approximating the posterior density as a set of weighted samples called particles. Each particle expresses a state and its weight. In our method, the state and (l) (l) (l) (l) weight of the lth particle are expressed as [ht , et ] and ωt , where ωt is propor (l) (l) (l) tional to P (z t |ht , et ) calculated using Equation (6) and satisﬁes l ωt = 1. Likelihood of Observation. The likelihood of face image z t for facial pose ht and expression et is expressed as P (z t |ht , et ). Since the intensity of each interest point zi,t is assumed to be independent as described in Section 2.1, likelihood P (z t |ht , et ) is transformed as follows: P (z t |ht , et ) = P (zi,t |ht , et ) (6) i∈P

where P denotes the set of non-occluded interest points. Here, we consider that the interest point is not occluded if the surface normal of its corresponding point on the facial shape model is pointing toward the camera. We deﬁne the likelihood of point i in facial pose ht and expression et , P (zi,t |ht , et ), by adopting a robust estimation as follows:

zi,t − μi (et ) 1 1 exp − ρ P (zi,t |ht , et ) = √ , (7) 2 σi (et ) 2πσi (et ) 2 if x2 < x , (8) ρ(x) = , otherwise where μi (et ) and σi (et ) are the mean and standard deviation of the intensity of interest point i in facial expression et , respectively, and ρ(·) denotes a robust function. Intensity zi,t is the intensity of image z t at the image coordinate of point i at time t, q i,t . The image coordinate q i,t is calculated as q i,t = f (pi , S, ht ), where function f returns q i,t by projecting the image coordinate pi onto the shape model S, with translation and rotation of S according to pose ht , and performing weak perspective projection of the resulting threedimensional position onto the image plane.

330

S. Kumano et al.

Estimator of Facial Pose and Expression. Estimators of facial pose and

t and e t ) are calculated as follows: expression at time t (h (l) (l)

t = h ωt ht , (9) l

e t = arg max et

(l)

ωt

(10)

(l) l∈et =et

where estimated facial expression e t is deﬁned as the maximum probability in facial expression P (et |z1:t ).

3

Experimental Results

To evaluate the usefulness of our method, we performed two types of tests on video sequences wherein subjects exhibited multiple facial expressions with the head ﬁxed (Test 1) and freely changed (Test 2). The target facial expressions were neutral, angry, sad, surprise and happy. Grayscale video sequences with a size of 512 × 384 pixel were captured by an IEEE1394 camera at 15 fps for the subjects. The number of particles was set to 1,500, and the processing time was about 80 ms/frame on a Pentium D processor at 3.73GHz with 3.0GB RAM. 3.1

Details of the Experiment

Five male subjects participated in Test 1 once, and one male subject participated in Test 2 once. In Test 1, the subject showed ﬁve facial expressions one by one with the head ﬁxed to an horizontal direction for a duration of 60 frames followed by a 60 frame interval, according to instructions displayed on a monitor. We targeted the following ﬁve yaw angles of the face relative to the camera direction: −40, −20, 0, 20, and 40 (degrees). In Test 2, the subject freely showed ﬁve facial expressions one by one while shaking the head left and right. 3.2

Facial Expression Recognition Results

The recognition rates of facial expression were calculated for Test 1. Ground truth of facial expression at every frame was deﬁned to be the expression indicated by the instruction. In consideration of the time lag between the instruction and the exhibition of the facial expression, we excluded the ﬁrst 20 frames of each expression just after the instruction was displayed. Figure 5 shows some estimation results of facial poses and expressions in Test1, where facial poses and expressions were correctly estimated in the images of all subjects. Recognition rate was calculated as the ratio between the number of frames wherein the estimated expression equaled the ground truth and the total number of target frames. Table 1 shows the average facial expression recognition rates of the ﬁve tests for each target yaw angle. The average rate in the range of ±40 degree yaw angles exceeded 90(%), although the recognition rate decreased as the yaw

Pose-Invariant Facial Expression Recognition Neutral Angry Sad Surprise Happy

331

Estimated probabilities of facial expressions

Grids of shape model (White lines)

Interest points (Small points)

Ground truth: Neutral

Ground truth: Angry

Sad

Ground truth: Surprise

Happy

Fig. 5. Some estimation results of facial poses and expressions in Test 1: White grids and small points on each face denote the shape model and interest points in the estimated facial pose. The width of each bar in the upper right part of each image denotes the estimated probability of each facial expression, P (et |z1:t ).

Table 1. Average recognition rates of facial expressions for each yaw angle: Test 1 Yaw angle (degree) total -40 Recognition rate (%) 90.7 91.4

-20 100.0

0 99.9

20 84.3

40 78.1

332

S. Kumano et al.

Table 2. Confusion matrix of average recognition rate of facial expressions in Test 1: The expressions in the rows are the ground truths and the expressions in the columns describe the recognition results Expression Neutral Angry Sad Surprise Happy Neutral 97.0 0.0 2.8 0.1 0.1 1.9 80.1 12.3 1.0 4.7 Angry 9.2 0.6 77.7 12.2 0.3 Sad 0.3 0.1 0.3 99.0 0.3 Surprise 0.1 0.0 0.0 0.0 99.9 Happy

(a) Input video sequence (from left to right, frame number 100, 290, 400, 560, 660). GT 1 NEUTRAL

Probability

0 1 ANGRY 0 1 SAD 0 1 SURPRISE 0 1 HAPPY 0 0

100

200

300

400

500

600

Frame Number

Angle (degrees)

(b) Ground truth (top) and recognition results (others) of facial expression: The probability of correct expression is remarkably higher than that of other expressions.

80 60 40 20 0 −20 −40 −60 −80 0

PITCH

100

200

300

YAW

ROLL

400

500

600

Frame Number

(c) Estimation results of facial pose (horizontal axis equals to that of (b)): Facial poses are estimated with enough accuracy to detect three cycles of head shake movement (red solid line). Fig. 6. Images and estimation results in Test 2

angle increased. It seems that the recognition rates decrease more with positive yaw angles than with negative yaw angles because of lighting asymmetry. That is, if the lighting condition is horizontally symmetric against the face, it

Pose-Invariant Facial Expression Recognition

333

is expected that the recognition rates with positive yaw angles will increase to a similar extent as negative yaw angles. Table 2 shows the confusion matrix for the recognition rates. It seems that angry and sad expressions were most similar, because they were sometimes confused each other. Five frames of the video sequence and the estimated results of facial expression and pose in each frame in Test 2 are shown in (a), (b) and (c) of Fig.6. Moreover, ground truth of the facial expression at every frame was determined manually. Figure 6(b) shows that facial expressions were recognized correctly in almost all frames. In addition, the correct expressions were assigned remarkably higher probabilities than other expressions. Fig.6(c) also shows that facial poses were estimated with enough accuracy to detect three cycles of head shake movement. Video sequences for the result in Test 1 and 2 are available from [15].

4

Summary and Future Works

In this paper, we presented a particle ﬁlter-based method for estimating facial pose and expression simultaneously by using a novel face model called the variable intensity template. Our method has the distinct advantage that a face model for each person can be prepared very easily with a simple calibration step. With our method, ﬁve facial expressions were recognized with 90.7% of accuracy for horizontal facial orientations over the range of ±40 degrees from the frontal view. In the future, we would like to conduct more experiments with additional subjects to complete the statistical evaluation. In the current framework, our method cannot correctly estimate facial poses and expressions under the large variation in intensities caused by large head movements, especially vertical direction, or lighting variations. Hence, we are planning to handle such intensity variations by updating the intensity of each interest point (e.g. [16]). Another goal is to achieve a fully automatic system by applying an online clustering technique for extracting target facial expressions, instead of calibrating the intensities for facial expressions in advance. In addition, We would also like to improve the way interest points are deﬁned by using the component corresponding to the point of a projection matrix that maximizes the ratio of between and within class scatter of the facial expression images (i.e. Fisherfaces [17]) for optimizing point selection.

References 1. Otsuka, K., Yamato, J., Takemae, Y., Murase, H.: Conversation scene analysis with Dynamic Bayesian Network based on visual head tracking. In: Proc. of the IEEE International Conference on Multimedia and Expo, pp. 949–952. IEEE Computer Society Press, Los Alamitos (2006) 2. Cohen, I., Sebe, N., Chen, L., Garg, A., Huang, T.: Facial expression recognition from video sequences: Temporal and static modeling. Computer Vision and Image Understanding 91, 160–187 (2003)

334

S. Kumano et al.

3. Kaliouby, R., Robinson, P.: Generalization of a vision-based computational model of mind-reading. In: Proc. of the First International Conference on Aﬀective Computing and Intelligent Interatction, pp. 582–589 (2005) 4. Chang, Y., Hu, C., Feris, R., Turk, M.: Manifold based analysis of facial expression. Image and Vision Computing 24, 605–614 (2006) 5. Bartlett, M., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia 1, 22–35 (2006) 6. Gokturk, S.B., Tomasi, C., Girod, B., Bouguet, J.Y.: Model-based face tracking for view-independent facial expression recognition. In: Proc. of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 287–293. IEEE Computer Society Press, Los Alamitos (2002) 7. Oka, K., Sato, Y.: Real-time modeling of face deformation for 3D head pose estimation. In: Proc. of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 308–320. IEEE Computer Society Press, Los Alamitos (2005) 8. Dornaika, F., Davoine, F.: Simultaneous facial action tracking and expression recognition using a particle ﬁlter. In: Proc. of the Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1733–1738 (2005) 9. Zhu, Z., Ji, Q.: Robust real-time face pose and facial expression recovery. In: Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 681–688. IEEE Computer Society Press, Los Alamitos (2006) 10. Lucey, S., Matthews, I., Hu, C., Ambadar, Z., Torre, F., Cohn, J.: AAM derived face representations for robust facial action recognition. In: Proc. of the 7th International Conference on Automatic Face and Gesture Recognition, pp. 155–160 (2006) 11. Gross, R., Matthews, I., Baker, S.: Generic vs. person speciﬁc Active Appearance Models. Image and Vision Computing 23, 1080–1093 (2005) 12. Matsubara, Y., Shakunaga, T.: Sparse template matching and its application to real-time object tracking. IPSJ Transactions on Computer Vision and Image Media 46(9), 17–40 (2005) 13. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 14. Rich, E., Knight, K.: Artiﬁcial intelligence, pp. 537–583. McGraw-Hill Book Company, New York (1991) 15. http://www.hci.iis.utokyo.ac.jp∼ kumano/papers/accv2007.html 16. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1296–1311 (2003) 17. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class speciﬁc linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 711–720 (1997)

Gesture Recognition Under Small Sample Size Tae-Kyun Kim1 and Roberto Cipolla2 2

1 Sidney Sussex College, University of Cambridge, Cambridge, CB2 3HU, UK Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK

Abstract. This paper addresses gesture recognition under small sample size, where direct use of traditional classiﬁers is diﬃcult due to high dimensionality of input space. We propose a pairwise feature extraction method of video volumes for classiﬁcation. The method of Canonical Correlation Analysis is combined with the discriminant functions and Scale-InvariantFeature-Transform (SIFT) for the discriminative spatiotemporal features for robust gesture recognition. The proposed method is practically favorable as it works well with a small amount of training samples, involves few parameters, and is computationally eﬃcient. In the experiments using 900 videos of 9 hand gesture classes, the proposed method notably outperformed the classiﬁers such as Support Vector Machine/Relevance Vector Machine, achieving 85% accuracy.

1

Introduction

Gesture Recognition Review Gesture recognition is an important topic in computer vision because of its wide ranges of applications such as human-computer interfaces, sign language interpretation and visual surveillance. Not only spatial variation but also temporal variation among gesture samples make this recognition problem diﬃcult. For instance, diﬀerent subjects have diﬀerent hand appearance and may sign gesture in diﬀerent pace. Recent work in this area tends to handle the above variations separately and therefore leads to two smaller areas, namely posture recognition (static) and hand motion or action recognition (dynamic). In posture recognition, the pose or the conﬁguration of hands is recognised using silhouette [5] and texture [6]. By contrast, hand motion or action recognition interprets the meaning of the movement using full trajectory [9], optical ﬂow [4] and motion gradient [11]. Compared with hand motion recognition, posture recognition is easier in the sense that state-of-the-art classiﬁers, e.g. Support Vector Machine, Relevance Vector Machine [11] or Adaboot [6] can be directly applied to it. Gesture recognition, on the other hand, has adopted rather diﬀerent approaches, e.g. Hidden Markov Model [9]) or Dynamic Time Warping [3]), to discriminate dynamic/or temporal information which is typically highly non-linear in a data space. These methods, especially the Hidden Markov Models, have many parameters to set, a large amount of training examples, and diﬃculty for extension to large vocabulary [2]. Besides, these traditional methods have not integrated the posture and Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 335–344, 2007. c Springer-Verlag Berlin Heidelberg 2007

336

T.-K. Kim and R. Cipolla

temporal information and thus are diﬃcult to diﬀerentiate gestures of similar movements signed by diﬀerent hand shapes. Some recent works [8] directly operate with full spatiotemporal volume considering both posture and temporal information of gestures to a certain degree, but are still unreliable in cases of motion discontinuities and motion aliasing. Also, the method [8] requires the manual setting of the important parameters such as positions and scales of local space-time patches. Another important line of methods exploits visual code words (for representation) with either a Support Vector Machine (SVM) or a probabilistic generative model [12,13]. Again, for their good performance, it is critical to properly set the parameters associated with the representation, for e.g. space-time interest points and code book size. Motivation and Summary of This Study To avoid empirical setting of the parameters in the existing methods, it seems obvious to seek a more generic and simpler learnable approach for gesture recognition. Note that many of critical parameters in the previous methods are incurred in the step of representing gesture videos prior to using classiﬁers. In that case, it could be better to apply learnable classiﬁers directly to the videos which can be simply converted into column vectors. Unfortunately, this is not a good way either. Vectorization of a video by concatenating all pixels in the three-dimensional video volume causes a high dimension of N 3 , which is much larger than N 2 of an image. Also, it may be more diﬃcult to collect suﬃcient number of video samples for classiﬁers than images (see that a single video consists of multiple images). So called small sample size problem is more serious in learning classiﬁers with videos than images. Getting back to the representation issue, this work focuses on how to learn useful features from videos for classiﬁcation, discussing its beneﬁts over direct using classiﬁers. With the given discriminative features, even a simple Nearest Neighbor classiﬁer (NN) achieved a very good accuracy. An extension of Canonical Correlation Analysis (CCA) [1,15]-a standard tool of inspecting linear relationships of two sets of vectors- is proposed to yield robust pairwise features of any two gesture videos. The proposed method is closely related to our previous framework of Tensor Canonical Correlation Analysis [14], which extends the classical CCA into multidimensional data arrays by sharing either a single axis or two axes. The method of sharing two axes, i.e. planes between two video data, is updated and combined with the discriminative functions and the Scale-Invariant-Feature-Transform for further improvements. The proposed method does not require any signiﬁcant meta-parameters to be adjusted and can learn both posture and temporal information for gesture classiﬁcation. The rest of the paper is organized as follows: Next section explains the proposed method with the discriminant functions, discussing the beneﬁt of the method over traditional classiﬁers. The SIFT representation for video data is combined to the method for improvements in Section 3. Section 4 shows the experimental results and Section 5 draws conclusion.

Gesture Recognition Under Small Sample Size

2

337

Discriminative Spatiotemporal Canonical Correlations

Canonical Correlation Analysis (CCA) has been a standard tool of inspecting linear relationships of two random variables, or two sets of vectors. This was recently extended to two multidimensional data arrays in [14]. The method of spatiotemporal canonical correlations (which is related to the previous work in exploiting planes rather than scan vectors of two videos) is explained as follows: A gesture video is represented by ﬁrstly decomposing an input video clip (i.e. a spatiotemporal volume) into three sets of orthogonal planes, namely XY-, YT- and XT-planes as shown in Figure 1. This allows posture information in XY-planes and joint posture/dynamic information in YT and XT-planes. Three kinds of subspaces are learnt from the three sets of planes (which are converted into vectors by raster-scanning). Then, gesture recognition is done by comparing these subspaces with the corresponding subspaces from the models by classical canonical correlation analysis, which measures principal angles between subspaces1 . By comparing subspaces of an input and a model, robust gesture recognition can be achieved up to pattern variations on the subspaces. The similarity of any model Dm and query spatiotemporal data Dq is deﬁned as the weighted sum of the normalized canonical correlations of the three subspaces by 3 wk N k (Pkm , Pkq ) F (Dm , Dq ) = Σk=1

(2)

N k (Pkm , Pkq ) = (G(Pkm , Pkq ) − mk )/σ k ,

(3)

where, P1 , P2 , P3 denotes a matrix containing the ﬁrst few eigenvectors in its columns of XY-planes, XT-planes, YT-planes respectively and G(Pm , Pq ) sum of the canonical correlations computed from Pm , Pq . The normalization parameters with index k are mean and standard deviation of matching scores, i.e. G of all pairwise videos in a validation set for the corresponding planes. The discriminative spatiotemporal canonical correlation is deﬁned by applying the discriminative transformation [10] learnt from each of the three data domains as 3 wk N k (h(QkT Pkm ), h(QkT Pkq )), (4) H(Dm , Dq ) = Σk=1 1

Canonical correlations between two d-dimensional linear subspaces L1 and L2 are uniquely deﬁned as the maximal correlations between any two vectors of the subspaces [1]: (1) ρi = cos θi = max max uTi vi ui ∈L1 vi ∈L2

= = 1, = viT vj = 0, j = 1, ..., i − 1. We will refer to subject to: ui and vi as the i-th pair of canonical vectors. Multiple canonical correlations are deﬁned by having next pairs of canonical vectors orthogonal to previous ones. The solution is given by SVD of PT1 P2 as uTi ui

viT vi

uTi uj

PT1 P2 = LΛRT

where

Λ = diag{ρ1 , ..., ρd }.,

where P1 , P2 are the eigen-basis matrix, L, Λ, R are the outputs of SVD.

338

T.-K. Kim and R. Cipolla

where h is a vector orthonormalization function and Qk are the discriminative transformation matrix learnt over the corresponding sets of planes. The discriminative matrix is found to maximize the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets by analogy to the optimization concept of Linear Discriminant Analysis (LDA) (See [10] for details). On the transformed space, gesture video classes are more discriminative in terms of canonical correlations. In this paper, this concept has been validated not only for the spatial domain (XY-subspaces) but also for the spatiotemporal domains (XT-, YT-subspaces). Discussions The proposed method is a namely divide-and-conquer approach by partitioning original input space into the three diﬀerent data domains, learning the canonical correlations on each domain, and then aggregating them with proper weights. By this way, the original data dimension N 3 , where N is the size of each axis, is reduced into 3 × N 2 so that the data is conveniently modelled. As shown in Figure 2a-c, each data domain is well-characterized by the corresponding lowdimensional subspace (e.g. hand shapes in XY-planes, joint spatial and temporal information in YT-, and XT- planes).

Y X T

Spatiotemporal Volume

XY-planes

XT-planes

YT-planes

Fig. 1. Spatiotemporal Data Representation

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2. Principal Components and Canonical Vectors: The ﬁrst few principal components of the (a) XY (b) XT (c) YT subspaces of the two diﬀerent illumination sequences of the same gesture class (See Figure 5) are shown at the top and bottom row respectively. The corresponding pairwise canonical vectors are visualized in (d) (f). Despite the diﬀerent lighting conditions of the two input sequences, the canonical vectors in the pair (top and bottom) are very much alike, capturing common modes.

Gesture Recognition Under Small Sample Size

339

Moreover, the method is robust using mutual (or canonically correlated) components of the pairwise subspaces. By ﬁnding the mutual components of maximum correlations, which are canonical correlations, some undesirable information for classiﬁcation can be ﬁltered out. See Figure 2 for the principal components and canonical vectors for the given two sequences of the same gesture class which were captured under the diﬀerent lighting conditions. Whereas the ﬁrst few principal components mainly corresponded to the diﬀerent lighting conditions (in Figure 2a-c), the canonical vectors (in Figure 2d-f) well captured the common modes of the two sequences, being visually same in each pair. In other words, the lighting variations across the two sets were removed in the process of CCA, as it is invariant to any variations on the subspaces. Many previous studies have told that lighting variations are often conﬁned to a low-dimensional subspace. In summary, the proposed method has a beneﬁt over direct learning classiﬁers under small sample size as drawn in Figure 3. High dimensional input space and a small training set often cause overﬁtting of classiﬁers to the training data and poor generalization to new test data. Distribution of the test samples taken under diﬀerent conditions can be largely deviated from that of the training set, resulting in the majority of the test samples of class 1 misclassiﬁed in Figure 3. Nevertheless, the two Fig. 3. Canonical Correlaintersection sets of the train and test sets are still tion Based Classiﬁcation placed in the correct decision regions learnt over the training sets. As discussed above, canonical correlation analysis can be conceptually seen as a process to ﬁnd mutual information (or an intersection set) of any two sets.

3

SIFT Descriptor for Spatiotemporal Volume Data

Edge-based description of each plane of videos can help the method achieve more robust gesture recognition. In this section we propose a simple and eﬀective SIFT (Scale-Invariant Feature Transform) [7] representation for a spatiotemporal data by a ﬁxed grid. As explained, the spatiotemporal volume is broken down into three sets of orthogonal planes (XY-, YT- and XT-planes) in the method. Along each data domain, there is a ﬁnite number of planes which can be regarded as images. Each of these images is further partitioned into M × N patches in a predeﬁned ﬁxed grid and the SIFT descriptor is obtained from each patch (see Figure 4a). For each image, the feature descriptor is obtained by concatenating the SIFT descriptors of several patches in a predeﬁned order. The SIFT representation of the three sets of planes is directly integrated into the proposed method in Section 2 by replacing the sets of image vectors with the sets of the SIFT descriptors prior to canonical correlation analysis. The experimental results tell that the edge-based representation generally improves the intensity-based

340

T.-K. Kim and R. Cipolla

representation in both of the joint space-time domain (YT-, XT-planes) and the spatial domain (XY-planes).

Region of Interest (ROI)

Histogram "Mask" on XY-plane

(a)

SIFT on XY-plane Region of Interest (ROI)

Histogram "Masks" on XY- and YT-plane

Composite descriptor of size 256

Analysis on dx, dy and dt under ROI

SIFT on YT-plane

dx dy Analysis on dx and dy under the ROI

SIFT Descriptor of Size 128

SIFT obtained from 3D blocks. This section presents a general 3D extension of SIFT features. Traditional classiﬁers such as Support Vector Machine (SVM)/ Relevance Vector Machine (RVM) are applied to the video data represented by the 3D SIFT so that they can be compared with the proposed method (with SIFT) in the same data domain. Given a spatiotemporal volume representing a gesture sequence, the volume is ﬁrstly partitioned into M × N × T tiny blocks. Within each tiny block, further analysis is done along XY-planes and YT-planes (see Figure 4b). For analysis on a certain plane, say XY-planes, derivatives along X- and Y- dimensions are obtained and accumulated to form several regional orientation histograms (under a 3D Gaussian weighting scheme). For each tiny block, the resultant orientation histograms of both planes are then concatenated to form the ﬁnal SIFT descriptor of dimension 256. The descriptor for the whole spatiotemporal volume can then be formed by concatenating the SIFT descriptors of all tiny blocks in a predeﬁned order. The spatiotemporal volume is eventually represented as a single long concatenated vector.

(b)

Fig. 4. SIFT Representation: (a) SIFT used in [7]. (b) SIFT from 3D blocks (refer to text).

4

Empirical Evaluation

4.1

Cambridge Hand Gesture Data Set and Experimental Protocol

We have acquired the hand-gesture data base 2 consisting of 900 image sequences of 9 gesture classes. Each class has 100 image sequences (5 diﬀerent illuminations × 10 arbitrary motions of 2 subjects). Each sequence was recorded in a frame rate of 30fps and a resolution of 320×240. The 9 classes are deﬁned by the 3 primitive hand shapes and 3 primitive motions (See Figure 5). See Figure 5c for the example images captured under the 5 diﬀerent illumination settings. The 2

The database is available upon request. Contact e-mail: [email protected]

Gesture Recognition Under Small Sample Size

Flat/Leftward

class1

Flat/Rightward

class2

Flat/Contract

class3

Spread/Leftward

class4

Spread/Rightward

class5

Spread/Contract

class6

V-shape/Leftward

class7

V-shape/Rightward

class8

V-shape/Contract

class9

341

(a) 9 gesture classes formed by 3 shapes and 3 motions.

(b) 5 illumination settings. Fig. 5. Hand-Gesture Database

data set has temporally isolated gesture sequences which exhibit variations in initial positions, postures of hands and speed of movements in diﬀerent sequences. All training was performed on the data acquired in a single illumination setting while testing was done on the data acquired in the remaining settings. The 20 sequences in the training set were randomly partitioned into the 10 sequences for training and the other 10 for the validation. 4.2

Results and Discussions

We compared the accuracy of 9 diﬀerent methods: – Applying Support Vector Machine (SVM) or Relevance Vector Machine (RVM) on Motion Gradient Orientation Images [11] (MGO SVM or MGO RVM), – Applying RVM on the 3D SIFT vectors described in Section 3 (3DSIFT RVM), – Using the canonical correlations (CC) (i.e. the method using G(P1m , P1q ) in (2), spatiotemporal canonical correlations (ST-CC), discriminative ST-CC (ST-DCC), – Using the canonical correlations of the SIFT descriptors (SIFT CC), spatiotemporal canonical correlations of the SIFT vectors (SIFT ST-CC), and SIFT ST-CC with the discriminative transformations (SIFT ST-DCC).

342

T.-K. Kim and R. Cipolla

Fig. 6. Recognition Accuracy: The identiﬁcation rates (in percent) of all comparative methods are shown for the plain lighting set used for training and all the others for testing

In the proposed method, the weights wk were set up proportionally to the accuracy of the three subspaces for the validation set and Nearest Neighbor classiﬁcation (NN) was done with the deﬁned similarity functions. Figure 6 shows the recognition rates of the 9 methods, when the plain lighting set (the leftmost in Figure 5c) was exploited for training and all the others for testing. The approaches of using SVM/RVM on the motion gradient orientation images are the worst. As observed in [11], using RVM improved the accuracy of SVM by about 10% for MGO images. However, we got much poorer accuracy than those in the previous study [11] mainly due to the following reasons: The gesture classes in this study were deﬁned by hand shapes as well as motions. Both methods often failed to discriminate the gestures which exhibit the same motion of the diﬀerent shapes, as the methods are mainly based on motion information of gestures. A much smaller number of sequences (of a single lighting condition) used in training is another reason to get the performance degradation. The accuracy of the RVM on the 3D-SIFT vectors was also poor. The high dimension of the 3D-SIFT vectors and small sample size might prevent the classiﬁer from learning properly, as discussed. We measured the accuracy of the RVM classiﬁer for the diﬀerent numbers of the blocks in the 3D-SIFT representations (2-2-1,3-3-1,4-4-1,4-4-2 for X-Y-T) and obtained the best accuracy for the 2-2-1 case, which yields the lowest dimension of the 3D-SIFT vectors. Canonical correlation-based methods signiﬁcantly outperformed the previous approaches. The proposed spatiotemporal canonical correlation method (ST-CC) improved the simple canonical correlation method by about 15%. The proposed discriminative method (ST-DCC) unexpectedly decreased the accuracy of STCC, possibly due to overﬁtting of discriminative methods. The train set did not reﬂect the lighting conditions in the test set. However, note that the discriminative method improved the accuracy when it was applied to the SIFT representations rather than using intensity images (See SIFT ST-CC and SIFT

Gesture Recognition Under Small Sample Size

343

Table 1. Evaluation of the individual subspace

(%) mean std

XY 64.5 1.3

CC XT 40.2 5.9

YT 56.2 5.3

ST 78.9 2.4

XY 70.3 2.1

SIFT CC XT 61.8 3.3

YT 58.3 4.0

ST 80.4 3.2

Table 2. Evaluation for diﬀerent numbers of blocks in the SIFT representation: E.g. 2-2-1 indicates the SIFT representation where X,Y,and T axes are divided into 2,2,1 segments respectively 2-2-1 3-3-1 4-4-1 4-4-2 (%) ST-CC ST-DCC ST-CC ST-DCC ST-CC ST-DCC ST-CC ST-DCC mean 80.3 80.0 78.9 83.8 80.4 85.1 75.9 83.4 1.9 2.5 3.6 2.7 3.2 2.8 2.4 0.7 std

ST-DCC in Figure 6). The proposed three methods using the SIFT representations are better than the respective three methods of the intensity images. The best accuracy was achieved by the SIFT ST-DCC at 85%. Table 1 and Table 2 show more results about the proposed method, where all 5 experimental results (corresponding to each illumination set used for training) are averaged. As shown in Table 1 canonical correlations of the XY subspace obtained better accuracy with smaller standard deviations than the other two subspaces, but all three are relatively good compared with the traditional methods, MGO SVM/RVM and 3DSIFT RVM. Using the SIFT representation considerably improved the accuracy of the intensity images for each subspace, whereas the improvement for the joint representation was relatively small. Table 2 shows the accuracy of ST-CC and ST-DCC for the diﬀerent numbers of the blocks of the SIFT representation. The best accuracy was obtained for the case of 4-4-1 for XYT (each number indicates the number of divisions along one axis). Generally, using the discriminative transformation improved the accuracy of ST-CC for the SIFT representation. Note that accuracy of the method is not sensitive about settings in number of the blocks, which is practically important. Also, the proposed approach based on canonical correlations is computationally cheap taking computations O(3 × d3 ), where d is the dimension of each subspace (which was 10), and thus facilitates eﬃcient gesture recognition in a large data set.

5

Conclusion

A new method based on subspace has been proposed for gesture recognition under small sample size. Unlike typical classiﬁcation approaches directly operating with input space, the proposed method reduces input dimension using the three sets of orthogonal planes. The method provides robust spatiotemporal volume

344

T.-K. Kim and R. Cipolla

matching by analyzing mutual information (or canonical correlations) between any two gesture sequences. Experiments for the 900 gesture sequences showed that the proposed method signiﬁcantly outperformed the traditional classiﬁers and yielded the best classiﬁcation result using the discriminative transformations and SIFT descriptors jointly. The method is also practically attractive as it does not involve signiﬁcant tuning parameters and is computationally eﬃcient.

References 1. Bj¨ orck, ˚ A., Golub, G.H.: Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123) 579–594, 1973. 2. Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: ECCV, pp. 390– 401 (2004) 3. Darrell, T., Pentland, A.: Space-time gestures. In: Proc. of CVPR, pp. 335–340 (1993) 4. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. of ICCV, pp. 726–733 (2003) 5. Freeman, W., Roth, M.: Orientation histogram for hand gesture recognition. In: Int’l Conf. on Automatic Face and Gesture Recognition (1995) 6. Just, A., Rodriguez, Y., Marcel, S.: Hand posture classiﬁcation and recognition using the modiﬁed census transform. In: Int’l Conf. on Automatic Face and Gesture Recognition (2006) 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 8. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: Proc. of CVPR 2005, pp. 405–412 (2005) 9. Starner, T., Pentland, A., Weaver, J.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998) 10. Kim, T., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. on PAMI 29(6), 1005–1018 (2007) 11. Wong, S., Cipolla, R.: Real-time interpretation of hand motions using a sparse bayesian classiﬁer on motion gradient orientation images. In: Proc. of BMVC 2005, pp. 379–388 (2005) 12. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. In: BMVC (2006) 13. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR 2004, pp. 32–36 (2004) 14. Kim, T., Wong, S., Cipolla, R.: Tensor Canonical Correlation Analysis for Action Classiﬁcation. In: CVPR (2007) 15. Hardoon, D., Szedmak, S., Taylor, J.S.: Canonical correlation analysis; An overview with application to learning methods. Neural Computation 16(12), 639–2664 (2004)

Motion Observability Analysis of the Simpliﬁed Color Correlogram for Visual Tracking Qi Zhao and Hai Tao Department of Computer Engineering University of California at Santa Cruz, Santa Cruz, CA 95064 {zhaoqi,tao}@soe.ucsc.edu

Abstract. Compared with the color histogram, where the position information of each pixel is ignored, a simpliﬁed color correlogram (SCC) representation encodes the spatial information explicitly and enables an estimation algorithm to recover the object orientation. This paper analyzes the capability of the SCC (in a kernel based framework) in detecting and estimating object motion and presents a principled way to obtain motion observable SCCs as object representations to achieve more reliable tracking. Extensive experimental results demonstrate the reliability of the tracking procedure using the proposed algorithm.

1

Introduction

The computer vision community has witnessed the development of several excellent tracking algorithms using color statistics, e.g., color histogram, as representations. There statistics features can be convolved with an isotropic kernel to allow gradient estimation of the representation [1]. One inherent limitation of such kernel based methods is the singularity problem, where the representation is blind to certain motion. Most existing kernel based tracking algorithms are concerned only with the tracking of object locations [1], or object locations and scales [2]. Since isotropic kernels are often used [1,3,4], rotational motion can not be estimated using these methods. Recently, Zhao and Tao [5] proposed the simpliﬁed color correlogram (SCC) representation to eﬃciently track location and orientation simultaneously. Although the kernel is also rotationally-symmetric, the underlying SCC representation is sensitive to orientation changes. This property makes the representation capable of tracking rotational motion as well as translational motion. As in most kernel based algorithms, the assumption is that the statistics of the SCC feature be suﬃcient to determine the motion of the object [4]. However, this assumption needs to be validated. This paper shows that under certain degenerated cases, ill-conditioned cases may occur, i.e., translational and/or rotational motion may not cause changes to the SCC, therefore the motion is not observable. In this study, we derive a criterion to evaluate the numerical stability of the tracking solution, according to which, schemes for the SCC selection are designed. The paper is organized as follows. Section 2 reviews the simpliﬁed color correlogram (SCC). Section 3 analyzes the properties of the SCC in a kernel based Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 345–354, 2007. c Springer-Verlag Berlin Heidelberg 2007

346

Q. Zhao and H. Tao

framework, and proposes the solution to obtain the motion observable SCCs. Section 4 discusses implementation details. Section 5 shows experimental results and demonstrates the improvement over the standard mean shift algorithm, and section 6 concludes the paper.

2

Introduction to Simpliﬁed Color Correlogram (SCC)

Color correlogram expresses the correlation between color pairs in an image, which has been commonly used in the literature of image retrieval/indexing [6]. Zhao and Tao [5] has recently proposed a simpliﬁed version of color correlogram (SCC) for tracking purpose. Instead of including pixel pairs along all directions and with a set of distances, the SCC only counts pairs along one or several axes, i.e., pre-selected directions, with predeﬁned distances. Formally, the SCC with L axes are deﬁned as Su,v = P r(I(p1 ) = u ∧ I(p2 ) = v | f (p1 − p2 ) = (θ, d)).

(1)

Here, f is a function to obtain the direction and the distance of a pixel pair, representing the spatial relationship of two pixels p1 and p2 . θ ∈ {θl , l = 1, ..., L} and d ∈ {dl , l = 1, ..., L}, where L is the number of axes, θl is the direction of axis l and dl is the pair distance along axis l. We use the L2 norm to measure the distance between pixels. Though the conclusions about the singularity problem in the kernel based representations made in this study is not restricted to the SCC representation, we focus our discussions on the SCC due to the following reasons [5]: 1. The SCC achieves a natural integration of color and spatial information. 2. Since certain directions are emphasized, the SCC is eﬀective in manifesting rotational variations. 3. The SCC is computationally inexpensive. 4. Being a middle ground of the template based methods and the pure statistics based representations, the SCC can have desired properties from both sides. Similar to the conventional color histogram, the SCC is also integrated into a kernel framework to allow eﬃcient localization, thereby the singularity problem exists. This paper presents two methods to approach this issue in the SCC context, where both translation and rotation are concerned. One is to select more than one axis to form a multi-axis SCC, therefore motion ignored by pairs along one axis can be recovered along other axes, as will be justiﬁed later. This strategy suﬃces in most cases. However, eﬃciency consideration suggests an alternative of obtaining one optimal axis and its corresponding pair distance, so that the resulting SCC is the most sensitive to all diﬀerent motion. Details of the two approaches are provided in the next section.

3

Motion Observability Analysis

The SCC based tracking method enables the detection of both translational and rotational motion, therefore reliable tracking in this context requires that both types of motion be distinctly observed and reliably recovered.

Motion Observability Analysis of the SCC for Visual Tracking

3.1

347

Representing Objects Using SCC

In this subsection, we ﬁrst address necessary notations for the later analysis. Speciﬁcally, each pair of pixels counted in the SCC representation can be parameterized using a 3-dimensional vector Φ = [cx, cy, θ]T , where (cx, cy) are the image coordinates of the midpoint of each pair and θ is the angle between the axis of the object and the object coordinate system. Consider for a moment a target model/candidate for the SCC with one axis l and the index l is omitted to keep the notation simple. Target Model. Similar to [4], we deﬁne the matrix form of the target model as T M = αUM K(0). (2) In Eqn.2, UM = [u11 , u12 · · · u1m , u21 · · · u2m · · · um1 , um2 · · · umm ], where urs = [δ(I(Φ111 ) − Irs ), δ(I(Φ112 ) − Irs ) · · · δ(I(ΦW HO ) − Irs )]T , r, s = 1 · · · m. I(Φijk ) represents the colors of the pixel pair Φijk in the image I. W , H and O are the numbers of cxs, cys and θs considered in the SCC. If the colors of pixels in the pair Φijk are r and s, then the corresponding element in the vector is assigned 1, otherwise it is 0. The subscript M represents “model” and α normalizes the Φ112 ΦW HO T )] , where K is a kernel representation. K(0) = [K( Φ111 h ), K( h ) · · · K( h function that assigns a smaller weight to the locations and orientations that are farther from the center of the object, h is the kernel radius, and the kernel is centered at 0. By deﬁnition, L diﬀerent axes yield L diﬀerent target models M 1 , ..., M L . Target Candidate. Similarly, the target candidate is deﬁned as C(Φ0 ) = βUCT K(Φ0 ),

(3)

where Φ0 is the initialized location and orientation in the current frame. β and UC are deﬁned the same way as α and UM for the target model, and the subscript −Φ0 T C denotes “candidate”. K(Φ0 ) = [K( Φ111h−Φ0 ), K( Φ112h−Φ0 ) · · · K( ΦW HO )] . h 1 L L target candidates C , ..., C can be deﬁned for L axes. 3.2

The Objective Function and Solution for Reliable Tracking

We focus on the single-axis case in this subsection and the multi-axis cases would be analyzed in the next subsection. For the mean shift based tracking algorithms [1,5], the objective is to seek the maximum of the Bhattacharyya coeﬃcient [7]. Its well known connection with the Matusita metric opens the possibility that we analyze the Matusita metric other than the Bhattacharyya coeﬃcient to better illustrate the inherent problem of kernel based tracking [4]. Using the notations given in section 3.1, the objective of tracking using the Matusita metric is deﬁned as √ minΦ (DM (Φ)) = minΦ || M − C(Φ)||2 = minΦ ( ( Mu,v − Cu,v (Φ))2 ), u,v

(4) where Φ is the object location and orientation in the current frame.

348

Q. Zhao and H. Tao

A Newton-style iterative procedure is applied to convert this optimization problem to a more explicit form (Derivations provided in the Appendix) √ 1 d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )), (5) where Φ0 is the initialized object location and orientation for the current frame and d(C(Φ0 )) denotes the matrix with C(Φ0 ) on its diagonal. In Eqn.5, JK (Φ0 ) = Φ −Φ −Φ0 T [∇Φ K( Φ111h−Φ0 ), ∇Φ K( Φ112h−Φ0 ) · · · ∇Φ K( ΦW HO )] , where ∇Φ K( ijkh 0 ) = h Φijk −Φ0 2 2 || ), and g(x) = −k (x), k(||x||2 ) = K(x). h2 (Φijk − Φ0 )g(|| h 1 Denoting A = d(C(Φ0 ))− 2 (U )TC JK (Φ0 ) and converting the matrix before ΔΦ to a square matrix for further analysis, we obtain √ (6) AT AΔΦ = 2AT ( M − C(Φ0 )). Therefore the solution to the optimization problem in this 3-dimensional case is unique if and only if the 3 × 3 matrix AT A is of full rank. Additionally, the stability of the solution depends on the magnitude of its condition number. In the single-axis cases, the SCC with the parameters (θ, d) corresponding to the smallest condition number of AT A is the optimal SCC. 3.3

SCC with Multiple Axes

If multi-axis correlograms are used, denote ⎡ ⎤ 1 1 (Φ0 ) d(C 1 (Φ0 ))− 2 (UC1 )T JK ⎢ ⎥ .. AL = ⎣ ⎦, . L − 12 L T L d(C (Φ0 )) (UC ) JK (Φ0 ) and

l Al = d(C l (Φ0 ))− 2 (UCl )T JK (Φ0 ), l = 1, .., L, 1

then we have ATL AL =

L l=1

(Al )T Al .

In this paper, we explore further into this multi-axis problem considering the simple yet eﬀective two-axis cases. A useful property of the semi-positive deﬁnite matrices (A1 )T A1 and (A2 )T A2 states as min(cond((A1 )T A1 ), cond((A2 )T A2 )) ≤ cond(AT2 A2 ) ≤ max(cond((A1 )T A1 ), cond((A2 )T A2 )),

(7)

where AT2 A2 = (A1 )T A1 + (A2 )T A2 . These inequalities indicate that the condition number of a two-axis SCC is between the two condition numbers of the corresponding single-axis SCCs. A consequence is that for a SCC deﬁned with two axes, unfavorable condition numbers are less possible to be generated, since it requires both corresponding single-axis SCCs to have suﬃciently large condition numbers. To make full use of this point, the two axes should be as independent as possible and in our work, two orthogonal axes are used.

Motion Observability Analysis of the SCC for Visual Tracking

3.4

349

Visual Interpretations and Veriﬁcations on Example Patterns

Using the SCC as an object representation, an image patch can have diﬀerent SCCs, with some being more favorable than others in terms of motion observability. To provide a visual interpretation on the SCC representation, we make further analysis into the matrix A, which is ⎡ ⎤ − 12 Φ −Φ 2 (Φijk − Φ0 )g(|| ijkh 0 ||2 ) h2 C1,1 ⎢ ⎥ I(Φijk )=I11 ⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ − 12 ⎢ Φijk −Φ0 2 ⎥ 1 2 −2 T C (Φ − Φ )g(|| || ) ⎢ ⎥. 2 ijk 0 d(C(Φ0 )) UC JK (Φ0 ) = ⎢ h 1,m h ⎥ I(Φijk )=I1m ⎢ ⎥ ⎢ ⎥ . .. ⎢ ⎥ ⎢ ⎥ ⎣ 2 − 12 Φijk −Φ0 2 ⎦ C (Φ − Φ )g(|| || ) m,m 2 ijk 0 h h I(Φijk )=Imm

(8) To obtain a unique solution for ΔΦ in Eqn.5, at least three of the row vectors of the above matrix need to be linearly independent. In this paper, we analyze two typical image patterns that are inherently unobservable to certain motion.

Fig. 1. Illustration for visual interpretation of SCC choice

Concentric Circles (Fig.1(a)): In color histogram based kernel methods, concentric circles are regarded as a degenerated case [4], where translation can not be detected. In this SCC based kernel methods, due to the spatial information encoded in the pixel pairs, translation along the SCC axis can now be observed. Without loss of generality, we set the SCC axis to be along the y direction, as shown in Fig.1(a), and validate the translation observability through Eqn.8 by Φ −Φ examining the weighted distance vectors (Φijk −Φ0 )g(|| ijkh 0 ||2 ) for pixel pairs of two certain distinct colors. The cx component (recall that Φ = [cx, cy, θ]T ) of this term is cancelled out by every two corresponding pixel pairs, i.e., two symmetric pairs w.r.t. the SCC axis, like pairs a and b in Fig.1(a). However, the cy component of the term can not be always cancelled out by corresponding pairs, i.e., symmetric ones w.r.t. the x direction. For example, pair c in Fig.1(a) is of color (i, j), while its corresponding pair (pair d ) is of color (j, i), therefore they

350

Q. Zhao and H. Tao

can neither cancel out each other for Cij , nor for Cji . Since the representation is rotationally-symmetric, the θ components are all cancelled out. As a result, the row vectors in Eqn.8 is κ[0, 1, 0]T . The intuition behind this is that among the three degrees of motion, only the translation along the SCC axis can cause suﬃcient changes to the SCC. Although one may use two axes to detect translation in both dimensions, the blindness to rotation is the inherent limitation of Concentric Circles. Parallel Stripes (Fig.1(b)): Independent of the axis choice in the SCC, the parallel stripes pattern is sensitive to motion along the x direction, while blind to motion along the y direction. However, its observability to rotation depends on the direction of the SCC axis. If the axis is deﬁned to the be along the x direction, Φ −Φ then the elements for the orientation dimension in (Φijk − Φ0 )g(|| ijkh 0 ||2 ) cancel out. Intuitively, this means that slight rotation does not cause enough change to the SCC. On the other hand, if the axis is some degrees away from the x direction, then rotation makes a diﬀerence in the SCC by causing some pixels in the boundary of two stripes to be the other color.

Fig. 2. Quantitative relationship between condition numbers and parameters (θ, d): (a) Image patch (96 × 96), (b)-(d) Condition numbers w.r.t. axis directions with diﬀerent pair distances: (b) d = 5 (Symbols at the top line represent overﬂow values), (c) d = 10, (d) d = 40

We evaluated the quantitative relationship between condition numbers and parameters (θ, d) for the image patch of the Parallel Stripes pattern, shown in Fig.2(a). The axis direction θ is deﬁned in the object coordinate system, i.e., with along the x direction being 0 degree, and increases counterclockwise. We observe the condition numbers of the single-axis SCCs with diﬀerent axis directions and those of two-orthogonal-axis SCCs, where the directions of the axes are θ and θ + 90. Fig.2(b-d) show the relationship of condition numbers w.r.t.

Motion Observability Analysis of the SCC for Visual Tracking

351

diﬀerent axis directions with pair distances of 5, 10, and 40, respectively. From the illustrated outputs, following conclusions can be made: 1. The pair distance can not be too small. This is due to the discrete nature of the image and the fact that the colors for the SCC are assigned in a nearestneighbor manner. For short pixel pairs, small image rotation may not cause any changes to the SCC. As shown in Fig.2, the results with pair distance of 5 pixels (Fig.2(b)) are much poorer than those with longer distances(Fig.2(c-d)). 2. Some axis directions are more favorable than others in terms of stability. For the patch shown in Fig.2(a), the most favorable axis direction for single-axis SCCs is along the y direction (θ = 90 in Fig.2(c-d)). 3. When two axes are used, due to the inequalities given in Eqn.7, no matter what the directions of the axes are, the condition number is between the two condition numbers generated independently by the two single-axis SCCs. Fig.2(c-d) show that the average condition numbers provided by the two-orthogonal-axis SCCs are signiﬁcantly smaller than those generated by the single-axis ones.

4

Implementation Details

For most real objects, textures are irregular enough to avoid the extreme cases, as shown in section 3.4, thereby a two-orthogonal-axis SCC suﬃces in most cases. However, in applications where speed is an important factor, single-axis SCC is greatly favored for eﬃciency considerations. In this case, we search for the optimal axis direction in the orientation space to obtain the SCC with the smallest condition number. The pair distance is an important parameter in that it inﬂuences not only the SCC’s sensitivity to rotation, but also the stability of the solution. On one hand, the larger the pair distance, the more observable the orientation changes, as discuessed in section 3.4; on the other hand, a SCC with a large pair distance tends to have too few pairs counted (both pixels should be in the tracking window), which decreases the stability of the tracking. By trial and error, we set the default distance to be max(1/8(l + w), 10), where l and w are the length and width of the kernel size. The lower bound of 10 pixels ensures the stability of the tracker when the object is small. In this paper, similar to [5], mean shift algorithm is extended to a translationrotation joint domain to locate the object position and orientation simultaneously, in a gradient descent manner. However, the proposed idea can be easily incorporated into any tracking framework other than the mean shift one.

5

Experiments

The usefulness of the proposed schemes to ensure reliable tracking have been demonstrated on vehicle and pedestrian sequences under various environmental conditions. In the following, the ﬁrst two real-time tracking tasks compute and use the optimal single-axis SCCs as representations, and the two-orthogonal-axis SCCs are used for the other sequences.

352

Q. Zhao and H. Tao

a-1 a-2 b-1 b-2 c-1 c-2

Fig. 3. Car-Chasing Sequence

Car-Chasing: The Car-Chasing is a live video sequence of 2250 frames. It has been tested using the standard mean shift (MS) based tracking algorithm, the single-axis SCC based tracking algorithms with and without the optimal SCC selection. The possible problems in vehicle tracking using the MS tracker are revealed in Fig.3: (1) Loss of tracking tends to occur when the car makes turns (Fig.3(a-1)); (2)Fixed orientation of the tracking window causes scale adaptation diﬃcult to realize and the mismatch of the window to the object makes the tracker sensitive to background clutter (Fig.3(b-1)). In Fig.3(c-1), although the window direction adapts to the object, using a non-optimal single-axis SCC, the tracker is less stable. In Fig.3(a-2),(b-2),(c-2), we show results of the optimal single-axis SCC based tracker, which tracks the car throughout the entire sequence. Results demonstrate that challenging issues like object rotations, heavy occlusions, background clutters, scale changes and motion blur are handled elegantly. PETS 2001 Data: The proposed schemes are further evaluated on the PETS 2001 Dataset [8]. Compared with the color histogram based methods, the SCC based tracker successfully removes the restrictions brought up by certain object shapes and/or camera viewpoints. The proposed optimal single-axis SCC selection scheme further ensures both reliability and eﬃciency of the tracker. Sample frames of the experimental outputs are shown in Fig.4. Multiple Human Parts: Other experiments are shown in Fig.5. The twoorthogonal-axis SCCs are used to track multiple human parts. Although only

Motion Observability Analysis of the SCC for Visual Tracking

353

Fig. 4. PETS 2001 Sequences

Fig. 5. Stretching and Walking Sequences

color information is extracted, the reliable tracking results indicate the algorithm’s potential in being a useful module in any human tracking or behavior analysis tasks.

6

Conclusion

This paper analyzes the capability of the SCC (in a kernel based framework) in recovering both translation and rotation. A criterion to evaluate the SCC in terms of motion estimation is provided to guide the SCC selection. Twoorthogonal-axis SCCs are proved to be practical suﬃcient, while in tasks where speed requirements are high, optimal single-axis SCCs are desired. The discussion is focused on, but not limited to, the SCC representation. The SCC in an extended mean shift tracking framework is not computationally expensive. The tracker runs comfortably at 30 fps on a PIV 3.20GHz PC.

354

Q. Zhao and H. Tao

References 1. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. PAMI 25(5), 564–577 (2003) 2. Collins, R.: Mean-shift blob tracking through scale space. In: CVPR, pp. 234–240 (2003) 3. Fan, Z., Wu, Y.: Multiple collaborative kernel tracking. In: CVPR II, pp. 502–509 (2005) 4. Hager, G., Dewan, M., Stewart, C.: Multiple kernel tracking with SSD. In: CVPR I, pp. 790–797 (2004) 5. Zhao, Q., Tao, H.: Object tracking using color correlogram. In: VS-PETS, pp. 263– 270 (2005) 6. Huang, J.: Color-spatial image indexing and applications. PhD thesis, Cornell University (1998) 7. Kailath, T.: The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. on Comm. Tech. 15(1), 52–60 (1967) 8. http://visualsurveillance.org/PETS2001/

Appendix: Derivation for Eqn.5

√ Optimization on the objective function of min|| M − C(Φ)||2 results the fol√ 1 lowing equation: d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )), where ΔΦ is the motion vector in terms of both translation and rotation, and Φ0 is the initialized object center in the current frame. Applying the Taylor Expansion on C(Φ) and dropping higher order terms yields d C(Φ) C(Φ) = C(Φ0 ) + (9) |Φ=Φ0 ΔΦ. dΦ T Since C(Φ0 ) = UCT K(Φ0 ), therefore dC(Φ) dΦ |Φ=Φ0 = UC ∇K(Φ0 ). Introducing this into Eqn.9, we have 1 1 C(Φ) = C(Φ0 ) + d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ, (10) 2 where d(C(Φ0 )) is the matrix with C(Φ0 ) on its diagonal. Rewrite the objective function in terms of the motion vector ΔΦ, we obtain √ (11) argminΔΦ || M − C(Φ0 + ΔΦ)|| Substituting Eqn.10 into Eqn.11, the resulting objective function is √ 1 1 (12) argminΔΦ || M − C(Φ0 ) − d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ||, 2 the solution of which equates to the solution of the linear system √ 1 1 d(C(Φ0 ))− 2 UCT ∇K(Φ0 )ΔΦ = M − C(Φ0 ). (13) 2 Denoting ∇K(Φ0 ) as JK (Φ0 ) and scaling both sides up by a factor of 2 results √ 1 (14) d(C(Φ0 ))− 2 UCT JK (Φ0 )ΔΦ = 2( M − C(Φ0 )).

On-Line Ensemble SVM for Robust Object Tracking Min Tian1, Weiwei Zhang2, and Fuqiang Liu1 1

Broadband Wireless Communication and Multimedia Laboratory, Tongji University, Shanghai, China 2 Microsoft Research Asia, Beijing, China [email protected], [email protected], [email protected]

Abstract. In this paper, we present a novel visual object tracking algorithm based on ensemble of linear SVM classifiers. There are two main contributions in this paper. First of all, we propose a simple yet effective way for on-line updating linear SVM classifier, where useful “Key Frames” of target are automatically selected as support vectors. Secondly, we propose an on-line ensemble SVM tracker, which can effectively handle target appearance variation. The proposed algorithm makes better usage of history information, which leads to better discrimination of target and the surrounding background. The proposed algorithm is tested on many video clips including some public available ones. Experimental results show the robustness of our proposed algorithm, especially under large appearance change during tracking.

1 Introduction Visual tracking is an important subject in computer vision with a variety of applications. One of the main challenges that limit the performance of the tracker is appearance change caused by the variances of pose, illumination or view-point. In order to develop a robust tracker, lots of former work has been done to address those problems, however, robust object tracking still remains a big challenge. Object tracking can be considered as an optimization problem. Tracking algorithm is used to find a region with the local maximum similarity score. In [1], the similarity is defined as the SSD between the observation and a fixed template. In [6], Mean-shift is proposed as a nonparametric density gradient estimator to search the most similar region by computing the similarity between color histogram of the target and the search window. Object tracking can also be considered as a state estimation problem. In early works, Kalman filter or its variants are frequently used. However, Kalman can’t solve the non-Gaussian and non-linear cases well. In order to solve the nonGaussian and non-linear cases well, sequential Monte Carlo methods are applied for tracking, among which Particle Filter (PF) [3,8,12,4] is the most popular one. Object tracking can also be regarded as a template updating problem. The classical subspace tracking method is proposed by Black et al. [2]. Ross and Lim [13] extended Eigen-tracking by on-line incremental subspace updating. Along the other direction, in [9] Jepson models the target as a mixture of stable component, outliers, and two frame transient component. And an on-line EM algorithm is use for the parameters of each Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 355–364, 2007. © Springer-Verlag Berlin Heidelberg 2007

356

M. Tian, W. Zhang, and F. Liu

component. Considering the tracker as a binary classify problem is very popular recently. [18] proposed a transductive learning method for tracking, in which D-EM algorithm is used for transducing color classifiers and selecting a good color space to determine each pixel whether belongs to the foreground or background. The constraints of these color trackers are clear because they can only work on color image sequences. In [10] an off-line SVM is used for distinguish the target vehicle from the background. Similar to [10], in [12] an Adaboost classifier is trained off-line to detect the hockey players for a proposal distribution to improve the robustness of the tracker. Since they need large mount training data, it’s not easy to extend those approaches to general objects tracking. In order to get a general tracker, [16] proposed an online learning of ensemble pixel based weak classifier, and tracking is done by classifying pixels into foreground and background. However, in his method the features are limited to an 11D low dimension vector including pixel colors and local orientation histogram. Helmun Grabner proposed an on-line version of the Adaboost algorithm in [17], which can on-line select features to get a strong classifier. Their work is roughly similar to the idea of [16], however it can on-line choose the features with the most discriminative ability. Among all of the topics that have been discussed in existing classifier based trackers, we found that the history information has not been paid much attention, which is very important for tracking. In order to better use those important information, we proposed an ensemble SVM classifier based tracking algorithm. In our algorithm, the linear SVM can automatically select the “Key Frames” of the target as support vectors. Moreover, through ensemble combining several linear SVM classifiers, history information can be used more reasonable, and the risk of drift can be decreased more effectively. Finally, because the ensemble method can automatically adjust each SVM classifiers weight on-line, so some off-line classifiers can also be trained and add into the framework to form a more robust tracker. The paper is organized as following, section 2 give out the proposed on-line updating of SVM based tracker. Section 3 give out the ensemble of SVM based tracker. Experiment and conclusion are given in section 4 and section 5 respectively.

2 On-Line SVM Tracker In this paper, we take the tracking as a binary classification problem, where we choose SVM as our basic classifier. In the following sections, we will show that SVM can automatically select “Key frame” of the target from the historic frames, and that is one of the most important factors for the proposed algorithm in this paper. First of all, let’s introduce our on-line SVM classifier based tracker. 2.1 SVM Classifier Based Tracking Within the context of object tracking, we define the target object region and its surroundings as positive data source and negative data source respectively, as shown in fig. 1(b). Our target is to learn a SVM classifier which can classify the positive data and negative data in the new frame. Starting from first frame, the positive and the negative samples are used to training the SVM classier. Then the search region (fig. 1(b)) can be estimated in the next frame. Finally, the target region in next frame is located with local maximum score within the search region.

On-Line Ensemble SVM for Robust Object Tracking

(a)

(b)

357

(c)

Fig. 1. (a) The confidence map of the search region. (b) The object region, search region and the context region. (c) Demonstration of a linear SVM. The filled circles and rectangles are the “support vectors”.

2.2 On-Line Updating Linear SVM One of the most difficult tasks for a tracker is how to on-line update the tracker to make it adapt to the appearance change of the target. Some former methods use a rapid update model, saying that xt = xˆ t −1 , but it’s dangerous and may cause drift problem [11]. Some off-line tracker, such as [7], takes benefit of the off-line selected “Key Frames”, which will let the tracker be more robust to against drift problem. Our idea is inspired by the former ones, and the difference is that our tracker can not only record the “Key Frames” of the target as the history information, but can also update on-line to decrease the risk of drift. We consider the updating of the classifier as an on-line learning process. Here we propose a simple yet effective way for on-line updating the linear SVM classifier. The details are described as Algorithm 1. Algorithm 1. On-line Linear SVM Tracking & Updating Input: In Video frames for processing ( n=1,…, L) R Rectangle region of the target region Output: Rectangles of target object’s region Rn ( n=1,…, L) Initialization for the first frame In ( n=1): z Extract positive and negative samples S1 = {xi , yi }iN=1 , where yi ∈ {−1, +1} , corresponding to the target region R. z Train a linear SVM to get f1 (x) = w1x + b1 and its support vectors V1 = {xi , yi }iM=1 For each new frame In ( n>1): z Find region Rn with the local maximum score given by f n−1 (x) . Here x denotes the search window’s feature vector. x Rn = arg max f n−1 (x) = arg max(w n−1x + bn−1 ) x

x

z If f n−1 (x R ) > 0 go to the next step to get a new SVM. n

Else stop updating and guess Rn is the target region, and go to the next frame. z Refresh positive samples Pn = Vn+−1 ∪ Sn+ and negative samples N n = Vn−−1 ∪ S n− . Here, Vn+−1 and Vn−−1 are the positive and negative support vectors of f n−1 (x) , Sn+ and Sn− are the

positive and negative samples of current frame. z Retrain the SVM using new samples for updating to get f n (x) = w n x + bn

358

M. Tian, W. Zhang, and F. Liu

By on-line updating, the SVM tracker can adjust its hyper-plane for the maximum margin between the new positive and negative samples. The support vectors transferred frame by frame contain important “Key Frames” of the target object in the previous tracking process (see fig.2 (b)). Figure 2 show the tracking result and part of selected “Key frames” by SVM in the final frame, the video is provided by Jepson [9]. The proposed tracker can adapt to the face with variation in appearance and distinguish it from the cluttered background (fig. 2(a)).

Fig. 2. (a) Tracking results of frame 1, 206, 366, 588, 709, 761, 973 and 1131. (b) In the end of the tracking task, 182 positive support vectors contain enough history information. Images displayed on the bottom are some of these support vectors.

3 Ensemble SVM Classifier Based Tracking Although the proposed SVM based tracker is powerful for tracking by on-line updating, we found there still existing several issues need to be addressed in the real world video clips. (a) The variance of the target is very large. (b) The tracker is disturbed by scale variation, partial occlusion or movement blur on a certain frame. Those two issues may lead the tracking algorithm drifting and finally fail. In order to further address those two issues, we proposed an ensemble SVM algorithm in this section. 3.1 On-Line Building the Ensemble of SVMs Our algorithm starts with a SVM trained with labeled data in the first frame. After then, in each frame new SVM may be added, the current tracking result with previous SVM classifiers. The match ratio rm is defined as equation (1), where U ( x) is a step function that equals to 1 when x is above zero, and otherwise it equals to 0. +

Here {xk }kN=1 are the positive samples, and

N+

is their number. The larger rm is, the

On-Line Ensemble SVM for Robust Object Tracking

359

better current component matches the positive samples. If rm below a ratio threshold, a new SVM should be added. N+

rm =

∑ U ( f m (x k ) − 1)

(1)

k =0

N+

N + − ∑ U ( f m (x k ) − 1) k =0

So after several frames, many SVM classifiers are generated and updated during different periods, which is shown as fig. 3.

Fig. 3. This flowchart is the demonstration of our framework of on-line ensemble SVM Tracker

After the number of SVM classifier is larger than one, we combine the linear SVM classifiers in the pool to get better classify result. Each SVM classifier is assigned with a coefficient α m , which is defined as following: N

1 α m = log 2

∑ U (Pm (xi , yi ) ⋅ ωi )

i =1 N

(2)

∑ U (− Pm (xi , yi ) ⋅ ωi )

i =1

Here ωi is the samples’ weight. And Pm (xi , yi ) is the output of each SVM classifier to evaluate its discriminative ability for every sample. Here we define Pm (xi , yi ) as following: 1 ⎧ ⎪ ⎪ ( fm (x) − T ) ⋅ fm (x) − T Pm ( x i , y i ) = ⎨ T2 ⎪ ⎪ −1 ⎪⎩

if y i ⋅ f m ( x ) ≥ 1 if 1 > y i ⋅ f m ( x ) > 0

(3)

if y i ⋅ f m ( x ) ≤ 0

When Pm (xi , yi ) is positive, it means the right classifying probability. Meanwhile, when it is negative, it means the wrong classifying probability. Here, T ∈ (0,1) is a threshold in the determining the rule, and we set is as 0.5 in our method. The details of on-line ensemble SVM tracker are described as Algorithm 2.

360

M. Tian, W. Zhang, and F. Liu

Algorithm 2. On-line Ensemble SVM Classifiers for Tracking Input: Output:

In Video frames for processing ( n=1,…, L) R Rectangle region of the target region Rectangles of target object’s region Rn ( n=1,…, L)

Initialization for the first frame In ( n=1): z Use the target and other random chosen regions without overlap with the target in the first frame to form a ground truth classifier f 0 (x) = w 0 x + b0 , which will not be updated during tracking. This classifier can also be other off-line trained classifier.

z Extract the positive and the negative samples as Algorithm 1. z Train a SVM classifier f1 (x) = w1x + b1 by using the extracted samples. z Initialize the samples’ weights ωi = 1/ N , i = 1, 2,..., N . z For m = 0 to 1 a)

Make {ωi }iN=1 a distribution

b)

Chose the most strong SVM with the largest α m by using equation (2)

c)

If α m <0, α m =0 and break

d) e)

Remove the chosen SVM Update samples’ weight ωi = ωi exp[−α m ⋅ yi Pm (xi , yi )] , i = 1, 2,..., N

z Normalize α i to make ∑ αi = 1 . The output of the ensemble one is F(x) = α 0 f0 (x) + α1 f1 (x) i

For each new frame In (n>1): z Use F(x) to search the target region and extract samples S = S + ∪ S − .

z z z z

Push some of S + into the positive sample queue by random sampling. Check whether a new SVM should be built by

rm in (1).

Choose the last K SVMs to update as Algorithm 1. Here we set K=5. Radom chose M samples S′ from the sample history queue, M equals to the number of S + . New group of samples are determined as: S ′′ = S ∪ S ′

z Initialize the samples’ weights ωi = 1/ N , i = 1, 2,..., N . z For m = 0 to K max ( K max = 10 in ours) a)

Make {ωi }iN=1 a distribution

b)

Chose the most strong SVM with the largest α m by using equation (2)

c)

If α m <0, α m =0 and break

d) e)

Remove the chosen SVM from the SVM queue Update samples’ weights ωi = ωi exp[−α m ⋅ yi Pm (xi , yi )] , i = 1, 2,..., N

z Normalize α i to make ∑ αi = 1 . The output of the ensemble one is: F(x) = i

K

∑ α m f m (x)

m =0

Compared with a single on-line SVM, the ensemble tracker can get a more reliable result, especially when the appearance of the target changes frequently (as fig. 4,5). From fig. 4 and fig.5, we can clearly find that a single on-line SVM is useful. However, it record all the history information as its support vectors to achieve the global optimization, which may cause it can difficultly handle large appearance variation in a short period. This phenomenon is also appeared in the incremental

On-Line Ensemble SVM for Robust Object Tracking

361

-

(a)

(b)

(c)

(d) Fig. 4. (a) Tracking results of frame 100,152,171,183 and 366 by using single on-line linear SVM tracker. (b) Tracking results of on-line ensemble SVMs tracker. The confidence map of the search region is on the right-bottom of each frame. (c) The ensemble weight of each SVM in the mixture model. (d) The updating period of each SVM (from being generated to stop updating). Mind that No.1 SVM is the ground truth SVM, and it will not be updated during tracking. frame 1

frame 121

frame 207

frame 282

frame 345

frame 384

frame 409

frame 429

frame 440

frame 481

(a)

(b)

(c)

Fig. 5. Sequences provided by Lim and Ross (a) Tracking results of incremental subspace learning tracker. The tracker failed after frame 345. (b) Tracking results of our single on-line linear SVM tracker. The tracker almost failed on frame 345, and then it drifts away from the target. (c) Tracking results of on-line ensemble SVMs tracker. The tracker finished the whole video with accurate results.

362

M. Tian, W. Zhang, and F. Liu

subspace learning tracker shown as fig.5 (a). The ensemble SVM tracker, which we proposed here, can choose the SVM classifiers with the best discriminative ability to the chosen samples, and on-line combines them together by adjusting their weight. Using this method, the ensemble classifier can use the history information more reasonable, at the same time the final tracker has an especially strong distinguish ability, which makes the tracking result more reliable.

4 Experiments In this section, several experimental results are carried out by our algorithm. Region patterns we used here are some common features: histograms of oriented gradients (HOG) [14] and local binary patterns (LBP) [5]. Integral histograms [15] are built for extracting region feature efficiently. Similar to the method used in [14], we construct a 9 bins HOG histogram for each cell, each block contains four cells with a 36-D HOG feature vector that is normalized to an L2 unit and a 59-D feature vector for LBP. Different form [14], the pixels including in a cell is not a constant, because the object region is scalable. The positive samples are captured by scaling the target region from 0.8 to 1.5 and rotating it form -8 degree to +8 degree. The negative samples are captured within the context region, and the negatives can have some overlaps with the positive one (below 1/3 area of positive sample). The sample rate between each negative region is set as 5 pixels per step in our method. In our method, the scale problem is solved by a naive way. The suitable scale is got by searching different scales around the centre of local maximum region. First of all, we captured some videos by ourselves to demonstrate the robustness of our framework (fig. 6(a)).Then we carried out our method on some frequently used public available sequences (fig. 6(b,c)). Compared with some other popular methods,

(a)

(b)

(c) Fig. 6. (a) A glass bottle with illumination and appearance change while moving in cluttered background. Beside the target, there is another bottle, which looks like the target bottle. However, the tracker can discriminate them very well without confusion. (b) A moving doll with large pose and illumination change, frames 1, 454, 728, 1162 and 1343. (c) A moving vehicle with disturbance of the light around, frames 1, 129, 245, 295 and 391.

On-Line Ensemble SVM for Robust Object Tracking

363

(a)

(b)

(c)

Fig. 7. Results of a boy with large head pose change. (a) Results of incremental subspace learning tracker. The tracker failed after frame 96 (b) Results of ensemble tracker, it runs well for 203 frames, but failed later, which may cause by size problem. (c) Results of our method.

the incremental subspace learning tracker [13] of Ross and Lim and the ensemble tracker [16] of Shai Avidan, we get the results in fig. 7. The incremental learning tracker is based on updating the sample mean and the eigenbasis over time. However, when the variation is very large, the updating can’t adapt the change quickly enough and an imprecise position may be got. So after several frames’ updating, the target may drift very quickly because of error accumulation. The ensemble is powerful, however it is a pixel based tracker (as [18]), the information for a pixel is little and the feature vector may have a large variation when the target is colorful, and the tracker may get confused when the color of the target and the background is similar. Our method, as mentioned before, is based on the region patterns which is more stable while tracking. Meanwhile, it contains and chooses the most useful “key frames” of the target by ensemble of SVMs, which have the most discriminative ability. Because of that, the performance of the tracker is especially good on some challenging videos with large appearance variation.

5 Conclusion In this paper, we build a novel framework to track general object. The ensemble SVM tracker proposed here is made up of several SVM classifiers, which are proved especially strong in selecting and recording the “Key Frames” of the object. These classifiers are generated and updated during different periods with different historical

364

M. Tian, W. Zhang, and F. Liu

information. By on-line adjusting each SVM’s weight, the ensemble classifier can distinguish the target and the background better than any single component. With the selected useful historical information and the strong discriminative ability, the tracker performs especially well on some difficult videos with large appearance variation.

Acknowledgments This work was done when the author visited visual computing group in Microsoft Research Asia. The author would like to appreciate all the researchers in that group for their supports. Thanks for Lim and Ross’s image sequences and matlab code of incremental subspace learning tracker provided in their website. Furthermore, thanks to Shai Avidan’s help and the results of ensemble tracker provided by him.

References 1. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. PAMI 20(10), 1025–1039 (1998) 2. Black, M.J., Jepson, A.: EigenTracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision 26(1), 63–84 (1998) 3. Isard, M., Blake, A.: Condensation-Conditional Density Propagation for Visual Tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 4. Perez, P., et al.: Color-Based Probabilistic Tracking. In: ECCV, pp. 661–675 (2002) 5. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. PAMI 24, 971–987 (2002) 6. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. PAMI 25(5), 564–577 (2003) 7. Vacchetti, L., Lepetit, V., Fua, P.: Fusing online and offline information for stable 3D tracking in real-time. In: CVPR 2003, vol. 2, pp. 241–248 (2003) 8. Nummiaroa, K., Koller-Meierb, E., Gool, L.V.: An Adaptive Color-Based Particle Filter. Image and Vision Computing 99–110 (2003) 9. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. PAMI 25(10), 1296–1311 (2003) 10. Avidan, S.: Support Vector Tracking. PAMI 26(8), 1064–1072 (2004) 11. Matthews, I., Ishikawa, T., Baker, S.: The Template Update Problem. PAMI 26, 810–815 (2004) 12. Okuma, K., Taleghani, A.: A Boosted Particle Filter: Multitarget Detection and Tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 28–39. Springer, Heidelberg (2004) 13. Ross, D., Lim, J., Yang, M.H.: Probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 470–482. Springer, Heidelberg (2004) 14. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR 2005, vol. 1, pp. 886–893 (2005) 15. Porikli, F.: Integral histogram: a fast way to extract histograms in Cartesian spaces. In: CVPR 2005, vol. 1, pp. 829–836 (2005) 16. Avidan, S.: Ensemble tracking. In: Proceedings of CVPR 2005. vol.2, pp. 494–501 (2005) 17. Grabner, H., Bischof, H.: On-line Boosting and Vision. In: CVPR 2006, vol. 1, pp. 260– 267 (2006) 18. Wu, Y., Huang, T.S.: Color Tracking by Transductive Learning. In: Proceedings of CVPR 2000, vol. 1, pp. 133–138 (2000)

Multi-camera People Tracking by Collaborative Particle Filters and Principal Axis-Based Integration Wei Du and Justus Piater University of Li`ege, Department of Electrical Engineering and Computer Science Monteﬁore Institute, B28, Sart Tilman Campus, B-4000 Li`ege, Belgium weidu.montefiore.ulg.ac.be, [email protected]

Abstract. This paper presents a novel approach to tracking people in multiple cameras. A target is tracked not only in each camera but also in the ground plane by individual particle ﬁlters. These particle ﬁlters collaborate in two diﬀerent ways. First, the particle ﬁlters in each camera pass messages to those in the ground plane where the multi-camera information is integrated by intersecting the targets’ principal axes. This largely relaxes the dependence on precise foot positions when mapping targets from images to the ground plane using homographies. Secondly, the fusion results in the ground plane are then incorporated by each camera as boosted proposal functions. A mixture proposal function is composed for each tracker in a camera by combining an independent transition kernel and the boosted proposal function. Experiments show that our approach achieves more reliable results using less computational resources than conventional methods.

1

Introduction

Tracking people in multiple cameras is a basic task in many applications such as video surveillance and sports analysis. A commonly-used fusion strategy is to detect people in each camera with bottom-up approaches such as background subtraction and color segmentation, and then to calculate the correspondences between cameras using the camera calibrations, or more often, the ground homographies. In order to reason about occlusions between targets, this fusion strategy usually requires all targets to be correctly detected and tracked [8,4,2,6,5]. However, sometimes we may be interested in the trajectories of only a few key targets, for instance, the star players in a soccer game or a few suspects in a surveillance scenario. Top-down approaches are preferable in such situations. In this paper, we present a novel top-down approach to people tracking by multiple cameras. The approach is based on collaborative particle ﬁlters, i.e., we track a target not only in each camera but also in the ground plane by individual particle ﬁlters. These particle ﬁlters collaborate in two diﬀerent ways. First, the particle ﬁlters in each camera pass messages to those in the ground plane where the multi-camera information is integrated using the homographies Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 365–374, 2007. c Springer-Verlag Berlin Heidelberg 2007

366

W. Du and J. Piater

of each camera. Such a fusion framework usually relies on precise foot positions of the targets, which are often not provided by the particle ﬁlters in the cameras. To overcome the imprecise foot positions as well as the uncertainties of the camera calibrations, we exploit the principal axes of the targets during integration, which greatly improves the precision of the fusion results. These fusion results are then incorporated by the trackers in each camera as boosted proposal functions. A mixture proposal function is composed for each tracker in a camera by combining an independent transition kernel and the boosted proposal function, from which new particles are generated for the next time instant. Our approach has several distinct features. First, it doesn’t require all targets to be tracked simultaneously. Instead of having diﬀerent target trackers interact, we compute the consensus between cameras by having diﬀerent camera trackers communicate. Second, it has a fully distributed architecture. All the computations are performed locally and only the ﬁlter estimates are exchanged between the cameras and the fusion module. Third, the fusion of the multi-camera information is done by intersecting the targets’ principal axes. Experiments on both surveillance and soccer scenarios show that our approach achieves more reliable results using less computational resources than conventional methods. Particle ﬁlters are conventional in multi-camera tracking. Most previous work performed particle ﬁltering in 3D so that precise camera calibration is required to project particles into the image plane of each camera [9,7]. The multi-camera information is often integrated by either the product of the likelihoods in all cameras [7] or a selection of the best cameras that contain the most distinctive information [9]. In our previous work, we proposed a diﬀerent approach to fusion that combined particle ﬁlters and belief propagation, where particle ﬁlters collaborated with each other via a message passing procedure [1]. To match ground-plane target positions using homographies, the foot positions of the tracked people have to be detected. This, however, is a diﬃcult and errorprone task if done separately for each camera. In this paper, we address the precision and computational issues. We relax the dependence on precise foot positions by exploiting the principal axes of the targets, the intersections of which give better ground positions. At the same time, we improve the speed over our previous system [1] by incorporating the fusion results from the ground plane as proposal functions into each camera. The rest of the paper is organized as follows. Section 2 formulates the multicamera tracking problem. Section 3 introduces the collaborative particle ﬁlters, including the principal axis-based integration and the boosted proposal functions. Experiments on sequences of video surveillance and soccer games are shown in Section 4.

2

Problem Formulation

Suppose L cameras are used and each camera collects one observation for each target at each time instant. Denote the target state on the ground plane by xt,0 and its states in diﬀerent cameras by xt,j , j = 1, . . . , L. Let zt,j denote

Multi-camera People Tracking by Collaborative Particle Filters

(a) tree-structured graphical model

367

(b) dynamic Markov model

Fig. 1. Graphical models for modeling the dependencies at time t and for modeling the evolution of the system in time

the observation in camera j at time t, Zt = {zt,1 , . . . , zt,L } the multi-camera observation at time t, and Z t = {Z1 , . . . , Zt } the multi-camera observations up to time t. Fig. 1(a) shows the graphical model that models the dependencies between target states in the ground plane and at diﬀerent cameras at time t. We assume that the xt,j , j = 1, . . . , L, are independent given xt,0 so that a tree-structured model is formed. Note that xt,0 is associated with no observation. Connecting the graphical models at diﬀerent times results in a dynamic Markov model, shown in Fig. 1(b), that describes the evolution of the system over time. As all the xt,j depend on xt,0 , we add temporal links from xt−1,0 to xt,j . The addition of these temporal links is beneﬁcial to the design of the proposal functions, shown in the next section. In both models in Fig. 1, each directed link from xt,0 to xt,j , j = 1, . . . , L, represents a message passing process and is associated with a potential function t (xt,0 , xt,j ). The directed link from xt,j to zt,j , j = 1, . . . , L, represents the ψ0,j observation process and is associated with a likelihood function pj (zt,j |xt,j ). In Fig. 1(b), the directed links from xt−1,i to xt,i , i = 0, . . . , L, and from xt−1,0 to xt,j , j = 1, . . . , L represent the state transition processes and are associated with motion models p(xt,i |xt−1,i ) and p(xt,j |xt−1,0 ) respectively. Thus, we infer each xt,i , i = 0, . . . , L, based on all Z t . A message passing scheme, the same as is used in belief propagation, is adopted to pass messages from each camera to the ground plane. The message from camera j is deﬁned as m0j (xt,0 ) ←

t pj (zt,j |xt,j )ψ0,j (xt,0 , xt,j )

p(xt,j |xt−1,j )p(xt−1,j |Z t−1 )dxt−1,j dxt,j . (1)

The belief p(xt,0 |Z t ) is computed recursively by the message product and the propagation of the previous posterior, t p(xt,0 |Z ) ∝ m0j (xt,0 ) × p(xt,0 |xt−1,0 )p(xt−1,0 |Z t−1 )dxt−1,0 . (2) j=1,..., L

Note that the same message and belief update equations are used in our previous work [1].

368

W. Du and J. Piater

The inference of xt,j , j = 1, . . . , L, is done by nearly standard particle ﬁlters, except that the fusion results at t − 1 are taken into consideration. The belief p(xt,j |Z t ) is computed as (3) p(xt,j |Z t ) ∝ p(xj |zj ) × p(xt,j |xt−1,j )p(xt−1,j |Z t−1 )p(xt,j |xt−1,0 )p(xt−1,0 |Z t−1 )dxt−1,0 dxt−1,j . The underlined terms incorporate the fusion results as a boosted proposal function. In other words, the fusion module is used by each camera as a coupled process.

3

Collaborative Particle Filters

All the inference processes formulated above, in the ground plane and for each camera, are performed by individual but collaborative particle ﬁlters. Details are given below. 3.1

Principal Axis-Based Integration

The ground-plane particle ﬁlter integrates the multi-camera information according to Eqs. 1 and 2. For tracking ground targets, homographies are often used to map the foot positions from each camera to the ground plane. However, a large number of particles are required to estimate precise foot positions, which signiﬁcantly slows down the tracking system. With a small number of particles, usually the sizes of the targets cannot be estimated precisely. We overcome this problem by exploiting the principal axes of the targets. The principal axis of a target is deﬁned as the vertical line from the head of the target to the feet. It has been shown that the principal axes of a target in diﬀerent cameras intersect in the ground plane, and computing the intersection point yields very robust fusion results [4,6], illustrated in Fig. 3. We exploit this eﬀect in our multi-camera integration. The idea is to sample particles in the ground plane by importance sampling, and to evaluate these particles by passing messages from each camera. Here, p(xt,0 |xt−1,0 ) is used as the proposal function from which new particles for xt,0 are sampled. Each of these ground-plane particles receives messages from each camera, and a message weight is computed using Eq. 1. The principal axes are t incorporated in the potential function ψ0,j (xt,0 , xt,j ) in Eq. 1. In general, the principal axes of the particles in a camera are projected to the ground plane using the homographies. The potential function measures the distances of the ground particles to these projected principal axes and converts them to probability densities, given by 2 n t m (xnt,0 , xm ψ0,j t,j ) ∝ exp(−dist (xt,0 , project(Hj , xt,j ))),

(4)

where xnt,0 and xm t,j are the nth ground-plane particle and mth particle in camera j, Hj is the homography from camera j to the ground plane, dist() computes

Multi-camera People Tracking by Collaborative Particle Filters

369

Fig. 2. The particle distributions in four cameras at a time instant. It can be seen that the foot positions are not precise although all the particles are placed at the right location.

(a) Mapping particles to the ground.

(b) Mapping principal axes to the ground.

Fig. 3. Comparison between homography-based integration and principal axis-based integration. In (a), the projections of the particles (the red stars) from the images in Fig 2 to the ground have a large variance, making the integration imprecise. In contrast, in (b), the intersection of the principal axes (the red lines) of four selected particles yields a more precise foot position (the white square).

the distance between a point and a line, and project() maps the principal axis to the ground. The message and belief weights are then computed by j,n ∝ wt,0

N

m t n πt,j ψ0,j (xnt,0 , xm t,j ), πt,0 ∝

m=0

L

j,n wt,0 ,

(5)

j=1

j,n n m is the message weight of xnt,0 from camera j, and πt,0 and πt,j are where wt,0 the belief weights of xnt,0 and xm . Intuitively, the closer a ground-plane particle t,j is to all the principal axes, the larger its weight is, as illustrated in Fig. 4.

3.2

Boosted Proposal Functions

A target is tracked in each camera by a particle ﬁlter. Due to the occlusions or other image noise, feedback from the fusion module is expected to improve

370

W. Du and J. Piater

(a) Camera 1 passes messages

(b) Camera 2 passes messages

Fig. 4. An illustration of evaluating ground-plane particles using two cameras. The ground-plane particles are evaluated according to the distances to the projected principal axes. (a) After the ﬁrst camera passes messages to the ground plane, all the particles along the principal axes (red dots) have larger weights than those further away (blue dots). The weights of the camera particles are shown at one end of the corresponding principal axes. (b) After the second camera passes messages, only those ground-plane particles that are close to the intersections have large weights.

the tracking performance in a camera. A similar message passing procedure was adopted in our previous work to pass messages from the ground plane to each camera, which proved computationally expensive. We propose here a diﬀerent method to incorporate this feedback. Note that in the dynamic Markov model in Fig. 1(b), for each xt,j , j = 1, . . . , L, there is an extra temporal link from xt−1,0 besides that from xt−1,j . This enables us to design a mixture proposal function for importance sampling, p(xt,j |xt−1,j , xt−1,0 ) ∝ αp(xt,j |xt−1,j ) + (1 − α)p(xt,j |xt−1,0 ).

(6)

Thus, we sample particles from both p(xt,j |xt−1,j ) and p(xt,j |xt−1,0 ), i.e., αN particles are sampled from p(xt,j |xt−1,j ) and the other (1 − α)N from p(xt,j |xt−1,0 ). Parameter α speciﬁes a trade-oﬀ between two proposal functions and is set to 0.5 in our experiments. To sample from p(xt,j |xt−1,0 ), we ﬁt a Gaussian distribution to xt−1,0 and propagate it to each camera using the homographies. In a sense, the fusion results at t− 1 are used as boosted proposal functions by each camera. This is beneﬁcial not only in maintaining consistency between the particle ﬁlters at diﬀerent nodes but also in speeding up the tracking algorithm. The sampled particles are evaluated using the image likelihood as is done in standard particle ﬁlters.

4

Results

We tested our method on both video surveillance and soccer game sequences. We manually initialized the targets of interest in the ﬁrst frames of the sequences and sampled 100 particles for each ﬁlter. Figure 5 shows the results of tracking a pedestrian in PETS sequences and a comparison with a reference method [7], which tracks a target in 3D by a particle

Multi-camera People Tracking by Collaborative Particle Filters

371

Fig. 5. The results of tracking a pedestrian in PETS sequences with our approach (top rows) and with a reference method [7] (bottom rows). In the latter method [7], we initialize a tracker in one camera and project the particles to another camera using the homographies. Here, due to imprecise foot positions, the estimates are projected to wrong positions.

ﬁlter and evaluates the particles by the product of the likelihoods in all cameras. In this experiment, we adopted a classic color observation model and evaluate each particle by matching the color histogram to a reference model [11]. The ﬁgure shows that particle ﬁlters do not estimate precise foot positions; thus, mapping the particles between cameras or between cameras and the ground plane using homographies is imprecise. As a result, using this method [7], most particles in one camera are projected to wrong positions in another camera so that only one camera contributes to the tracking. On the other hand, due to the use of the principal axes, our method integrates information from both cameras and achieves more reliable results. Figure 6 shows the results of tracking two selected people in an indoor environment with four cameras. In this experiment, we adopted a hierarchical multi-cue observation model and evaluated each particle ﬁrst by a color likelihood function and then by a background-subtraction likelihood function [10]. We also assumed that the sizes of the people were ﬁxed and could be inferred from their ground positions [2]. Thus, the only parameters of interest were the positions in the images and in the ground plane. A comparison with our previous work [1] shows that the new approach achieved similar results but was approximately twice faster.

372

W. Du and J. Piater

Fig. 6. The results of tracking two people in an indoor environment with four cameras. Each row shows four simultaneous views. In this experiment, both the head and the ground homographies of each camera are available. The ﬁxed-size assumption signiﬁcantly improved the robustness of the algorithm.

Figure 7 shows the results of tracking several soccer players in three cameras. Due to the interactions between the players, the feedback from the fusion module to each camera becomes critical, without which the trackers in diﬀerent cameras fail one by one. In this experiment, the same observation model as in the PETS experiment was used and the homographies of each camera were obtained on-line by using a ﬁeld model and by accumulating motion estimates between consecutive frames [3]. Note in the ﬁgure that the estimated foot positions do not coincide with the bottom of the bounding boxes, but are more precise than these thanks to the multiple-camera fusion using principal axes. At one point, due to a heavy occlusion that occurs in all cameras, a tracker jumps from one

Multi-camera People Tracking by Collaborative Particle Filters

373

Fig. 7. The results of tracking several soccer players in the last frames of the three sequences. The ellipses under the rectangles are the fusion results in the ground plane.

Fig. 8. The particle distributions at the time when the tracker is about to jump to a diﬀerent player, which happens here because the players involved are very close both in space and in appearance in all three views. The green rectangles are the sampled particles, the blue are the estimates, and the red are the predictions of the fusion results at the previous time.

target to another. In such situations, multi-camera systems without feedback between cameras are susceptible to mismatched targets. In our system, thanks to the feedback from the ground-plane tracker, the trackers at each camera remain consistent, even if they collectively follow the wrong target. Figure 8 shows the particle distributions at the time instant when the jump begins. This problem can be partially solved by tracking multiple targets simultaneously.

5

Conclusion and Future Work

This paper presents a novel approach to ground-plane tracking of targets in multiple cameras. Diﬀerent from previous work, our approach is not based on bottom-up detection or segmentation methods. Instead, we infer target states in each camera and in the ground plane by collaborative particle ﬁlters. Message passing and boosted proposal functions are incorporated in the collaboration between the trackers in each camera and the fusion module. Principal axes are exploited in the multi-camera integration, which enables us to handle the imprecise foot positions and some calibration uncertainties. In doing so, we achieve robust results using relatively little computational resources. We are currently adapting this approach to multi-target, multi-camera tracking, which involves the modeling of the target interactions and data association across cameras.

374

W. Du and J. Piater

Acknowledgement The authors wish to thank J. Berclaz and F. Fleuret for sharing their data.

References 1. Du, W., Piater, J.: Multi-view object tracking using sequential belief propagation. In: Asian Conference on Computer Vision, Hyderabad, India (2006) 2. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence (2007) 3. Hayet, J.-B., Piater, J., Verly, J.: Robust incremental rectiﬁcation of sports video sequences. In: British Machine Vision Conference, Kingston, UK, pp. 687–696 (2004) 4. Hu, W.-M., Hu, M., Zhou, X., Tan, T.-N., Lou, J., Maybank, S.J.: Principal axisbased correspondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 663–671 (2006) 5. Khan, S.M., Shah, M.: A multiview approach to tracking people in crowded scenes using a planar homography constraint. In: ECCV, pp. 98–109 (2006) 6. Kim, K., Davis, L.S.: Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle ﬁltering. In: ECCV, pp. 98–109 (2006) 7. Kobayashi, Y., Sugimura, D., Sato, Y.: 3d head tracking using the particle ﬁlter with cascaded classiﬁers. In: BMVC (2006) 8. Mittal, A., Davis, L.S.: M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene. Internatial Journal of Computer Vision 51(3), 189– 203 (2003) 9. Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., van Gool, L.: Color-based object tracking in multi-camera environment. In: Michaelis, B., Krell, G. (eds.) Pattern Recognition. LNCS, vol. 2781, Springer, Heidelberg (2003) 10. P´erez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proceeding of the IEEE 92(3), 495–513 (2004) 11. P´erez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: European Conference on Computer Vision, Copenhagen, Denmark, vol. 1, pp. 661–675 (2002)

Finding Camera Overlap in Large Surveillance Networks Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, and Rhys Hill School of Computer Science University of Adelaide Adelaide, 5005, Australia {anton,ard,henry,alex,rhys}@cs.adelaide.edu.au

Abstract. Recent research on video surveillance across multiple cameras has typically focused on camera networks of the order of 10 cameras. In this paper we argue that existing systems do not scale to a network of hundreds, or thousands, of cameras. We describe the design and deployment of an algorithm called exclusion that is speciﬁcally aimed at ﬁnding correspondence between regions in cameras for large camera networks. The information recovered by exclusion can be used as the basis for other surveillance tasks such as tracking people through the network, or as an aid to human inspection. We have run this algorithm on a campus network of over 100 cameras, and report on its performance and accuracy over this network.

1

Introduction

Manual inspection is an ineﬃcient and unreliable way to monitor large surveillance networks (see Figure 1 for example), particularly when coordination across observations from multiple cameras is required. In response to this, several systems have been developed to automate inspection tasks that span multiple cameras, such as following a moving target, or grouping together related cameras. A key part of any multi-camera surveillance system is to understand the spatial relationships between cameras in the network. In early surveillance systems, this information was manually speciﬁed or derived from camera calibration, but recent systems at least partly automate the process by analysing video from the cameras. These systems are demonstrated on networks containing of the order of 10 cameras, but have requirements that mean they do not scale well to networks an order of magnitude larger. For example: [1] requires manually marked correspondences between images; [2] requires a training stage where only one object is observed; and [3,4,5] require many correct detections of objects as they appear and disappear from cameras over a long period of time. An important step towards recovering spatial camera layout is to determine where cameras overlap. The approach taken in [6] is to estimate motion trajectories for people walking on a plane, and then match trajectories between cameras. However, this assumes planar motion, and accurate tracking over long periods Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 375–384, 2007. c Springer-Verlag Berlin Heidelberg 2007

376

A. van den Hengel et al.

Fig. 1. Snapshot of video feeds from the network. Some cameras are oﬄine.

of time. It also does not scale well, since track matching complexity increases as O(n2 ) with the number of cameras n. In [7] evidence for overlap is accumulated by estimating the boundary of each camera’s ﬁeld of view in all other cameras. Again, this does not scale well to large numbers of cameras, and assumes that all cameras overlap. Because they start with an assumption of non-connectedness, and gradually accumulate evidence for connections, most methods for determining spatial layout rely on accurately detecting and/or tracking objects over a long time period. They also require comparisons to be made between every pair of cameras in a network. The number of pairs of cameras grows with the square of the number of cameras in the network, rendering exhaustive comparisons infeasible. This paper describes the implementation of a method called exclusion for determining camera overlap that is designed to quickly home in on cameras that may overlap. The method is computationally fast, and does not rely on accurate tracking of objects within each camera view. In contrast to most existing methods, it does not attempt to build up evidence for camera overlap over time. Instead, it starts by assuming all cameras are connected and uses observed activity to rule out connections over time. This is an easier decision to make, especially when a limited amount of data is available. It is also based on the observation that it is impossible to prove a positive connection between cameras—any correlation of events could be coincidence—whereas it is possible to prove a negative connection by observing an object in one camera while not observing it at all in another.

Finding Camera Overlap in Large Surveillance Networks

2

377

The Exclusion Algorithm

Consider a set of c cameras that generates c images at time t. By applying foreground detection [8] to all images we obtain a set of foreground blobs, each of which can be summarised by an image position and camera index. Each image is partitioned into a grid of windows, and each window can be labelled “occupied” or “unoccupied” depending on whether it contains a foreground object. Exclusion is based on the observation that a window which is occupied at time t cannot be an image of the same area as any other window that is simultaneously unoccupied. Given that windows tend to be unoccupied more often than they are occupied, this observation can be used to eliminate a large number of window pairs as potentially viewing the same area. The process of elimination can be repeated for each frame of video to rapidly reduce the number of pairs of image windows that could possibly be connected. This is the opposite of most previous approaches: rather than accumulate positive information over time about links between windows, we seek negative information allowing the instant elimination of impossible connections. Such connections are referred to as having been excluded [9]. 2.1

Exclusion over Multiple Timesteps

Rather than calculate exclusion separately at each timestep, it is more eﬃcent to gather occupancy information over multiple frames and then calculate exclusion over all of them at once. Let the set of windows over all cameras be W = {w1 . . . wn }. Corresponding to each window wi is an occupancy vector oi = (oi1 , . . . , oiT ) with oit set to 1 if window wi is occupied at time t, and 0 if not. If two windows are images of exactly the same region in the world, we would expect their corresponding occupancy vectors to match exactly. This can be tested by applying the exclusiveor operator ⊕ to elements of the occupancy vectors: K

a ⊕ b = max ak ⊕ bk k=1

It can be inferred that two windows wi and wj do not overlap if oi ⊕ oj = 1. This comparison is very fast to compute, even for long vectors. 2.2

Exclusion with Tolerance

Exclusion as described so far assumes that: 1. corresponding windows in overlapping cameras cover exactly the same visible area in the scene, 2. all cameras are synchronised, so they capture frames at exactly the same time, and 3. the foreground detection module never produces false positives or false negatives.

378

A. van den Hengel et al.

In reality none of these assumptions is likely to hold completely. It is thus possible that two overlapping windows might simultaneously register as occupied and vacant and therefore that the exclusive-or of the corresponding occupancy vectors might incorrectly indicate that they do not overlap. Assumptions 1 and 2 can be relaxed by including the neighbours of a particular window when registering its occupancy. We use a padded occupancy vector pi which has element pit set to 1 when window wi or any of its neighbours is occupied at time t. A more robust mechanism for determining whether two windows wi and wj overlap is thus to calculate oi pj on the basis of the occupancy vector oi and the padded occupancy vector pj . The operator is a uni-directional version of the exclusive-or deﬁned such that K

a b = max ak bk . k=1

(1)

where ak bk is 1 if and only if ak is 1 and bk is 0. Note that this means exclusion calculation is no longer symmetric. To account for detection errors (assumption 3), we calculate exclusion based on accumulated results over multiple tests, rather than relying on a single contradictory observation. Assuming that the detector has a constant failure rate, the evidence for exclusion is directly related to the number of contradictory observations in a ﬁxed time period t = 1...T [9], which we call the exclusion count: Eij =

T

oit pjt .

(2)

t=1

2.3

Normalised Exclusion

The exclusion count has two main shortcomings as a measure for deciding window overlap/non-overlap: – As the operator ab will only return true when a is true, the exclusion count Eij between windows wi and wj is bounded by the number of detections in wi , and is likely to be higher for windows wi that register more detections. – In a large network, it will frequently occur that data sent from a camera will be lost, or not arrive in time to be included in the exclusion calculation, or that a camera will go oﬄine. Thus the maximum value of Eij also depends on how often data from wj is available. To address these problems we deﬁne a padded availability vector v for each window that is set to 1 when occupancy data for the window and its neighbours is available, and 0 otherwise. We can then deﬁne an exclusion opportunity count between each pair of windows: Oij =

T t=1

oit vjt

(3)

Finding Camera Overlap in Large Surveillance Networks

379

Based on this we deﬁne an overlap certainty measure from each window with opportunity count at least 1 to every other window: Cij =

Oij − Eij Oij

(4)

which measures the number of times that an exclusion was not found between wi and wj as a proportion of the number of times an exclusion could possibly have been found given the available data. In general, exclusion estimates for windows that are only occupied a small number of times are dominated by noise such as erroneous detection. We therefore include a penalty term for such windows: log(Oij ) = Cij × min 1, (5) Cij log(Oref ) where Oref is a number of detections empirically determined to result in reliable exclusion calculation. We set this to 20 in our experiments.

3

Implementing Exclusion

In this section we describe how the exclusion algorithm is implemented in order to ﬁnd overlap in a large network of cameras. This is done in two steps: – Object detection (Section 3.1): after each frame is captured, it is processed to detect objects within it. These detections are then converted to occupancy data for each window and sent to a central server. The main challenge for large camera networks is to detect objects quickly and reliably. – Exclusion calculation/update (Section 3.2, 3.3): at regular intervals of the order of several seconds, the stored occupancy data is used to calculate exclusion between each window pair. This exclusion result is then merged with exclusion results from earlier time periods, resulting in an updated estimate of camera overlap. The main challenge here is to synchronise data from different cameras, and to mitigate the memory requirements of exclusion data. 3.1

Distributed Foreground Detection

We detect foreground objects within each camera image using the Stauﬀer and Grimson background subtraction method [8]. To derive a single position from a foreground blob, we use connected components and take the midpoint of the low edge of the bounding box of each blob. This corresponds approximately to the lowest visible extent of the object in the image, assuming that the camera is approximately upright. Foreground detection is the most computationally intensive part of exclusion, but is also the stage that is easiest to parallelise. Presently, cameras are assigned to one of several processors that perform background subtraction on each image they capture. Eventually, though, we aim to implement detection on the cameras themselves.

380

A. van den Hengel et al.

3.2

Calculating Exclusion

Each occupancy result is tagged with the timestamp of the frame of video on which it is based and sent to a central server. After a ﬁxed time interval, typically several seconds, these results are assembled to form an occupancy vector oi for each window wi . Each element of oi is indexed by a time oﬀset t within the time interval, and can be one of three values: – oit = 2 if no occupancy data is available for wi within the time interval [t − tˆ, t + tˆ) – oit = 1 if wi is occupied within the time interval [t − tˆ, t + tˆ) – oit = 0 if wi is not occupied within the time interval [t − tˆ, t + tˆ) where tˆ is a tolerance to account for inaccuracies in camera synchronisation. These occupancy vectors are then used to calculate exclusion and opportunity counts as described in Sections 2.2 and 2.3 for each window pair within the time interval. These counts are then added to counts from previous time intervals, giving an updated estimate of exclusion conﬁdence for each window pair. 3.3

Exclusion Data Compression

The central server stores both an exclusion count Eij and an exclusion opportunity count Oij for each pair of windows. Both counts are stored as a byte. This means that for a network of 100 cameras, each containing a 10 × 10 window grid, the counts require approximately 2 × 108 bytes of storage. Initially, Eij = 0 for all i and j. Consider how the exclusion counts are aﬀected when a single person is observed in one window wD , and no other person is detected across the network. This will result in the exclusion count EDj being incremented for all windows j = D in the network. If the exclusion counts are stored in a matrix whose ij th element is the exclusion count between wi and wj , this results in an entire row of the matrix being incremented. Situations similar to this are quite common and suggest that a run length encoding scheme could eﬀectively compress the matrix. Similarly, exclusion opportunity counts Oij are initially 0 for all i and j. Like exclusion counts, neighbouring opportunity counts are likely to be incremented at identical times, since an increment to Oij requires that wi is occupied and all data in the neighbourhood of wj is available. Again, this suggests the use of a run length encoding scheme to store exclusion opportunity data.

4

Testing Exclusion

We tested our exclusion implementation on network containing 100 Axis IP cameras, distributed across a university campus. Frames are captured from each camera as JPEG compressed 320×240 images using the Axis API. Each frame is divided into a 9×12 grid of windows, for a total of 10800 windows. As previously mentioned, the computational cost of foreground detection over a large number of

Finding Camera Overlap in Large Surveillance Networks

381

cameras far outweighs that of exclusion. This coarse level of foreground detection is well suited to implementation on board a camera, but for the purposes of testing a cluster of 16 dual core Opteron PCs has been used to process the footage from the 100 cameras in real time. By contrast, the central server, where occupancy results are assembled and exclusion is calculated, is a single desktop PC (Dell Dimension 4700, 3.2GHz Pentium 4, 1GB memory). 4.1

Performance Testing

We ﬁrst test how the performance of exclusion scales, both over long time periods and large numbers of cameras. It was found that due to the optimisations described previously, the performance of exclusion does not depend strongly on the number of cameras on the network. Rather, it depends on the amount of activity in the network. Thus we observed the performance of exclusion during high and low activity periods, over a period of one hour. The memory required by exclusion increases over time, as shown in Figure 2. This is largely due to the decreased eﬀectiveness of RLE compression of the exclusion counts (EC) as more activity is observed. The opportunity counts (EOC) are still well compressed by RLE after one hour, as camera availability changes rarely during this time. However, notice that the increase in EC elements, and corresponding increase in memory usage, is less than linear. Even after an hour of observation, only 29.56MB of memory is being used, compared to over 200MB that would be required to store the uncompressed data. Figure 3 shows the time taken to calculate exclusion at intervals over the one hour period. Notice that the time to compute exclusion remains fairly constant over the time period, and is consistently faster than real time, even using a standard desktop PC. In fact the exclusion is calculated for the hour’s footage 30000000

40 35

Memory Use (MB)

20000000

25

15000000

20 Data Size (MB)

15

Allocated Size (MB)

10000000

Compressed EC Elements

10

Compressed EOC Elements

Number of elements

25000000

30

5000000

5 0

0 0

15

30

45

60

Frame time (minutes)

Fig. 2. Memory usage over one hour of processing. The exclusion element count (EC) shows how the RLE compression becomes less eﬀective over time.

382

A. van den Hengel et al. 16

14 Avg Occupancy Count / Camera

14

12

Speed Relative To Real Time Time Taken (min)

12

8 8 6 6

Time taken (min)

10

10

4

4

2

2

0

0 0

15

30

45

60

Frame time (minutes)

Fig. 3. Timing information for one hour of processing. The time required to process each frame remains approximately constant over time, although it increases slightly during periods of higher activity. Exclusion for 100 cameras is consistently calculated at over 4 times real time on a desktop PC, and an hour’s video takes under 13 minutes to process.

using less than 13 minutes of processor time. It is also evident that the time taken to calculate each exclusion count does depend on the amount of activity, measured by the number of occupancies detected per available camera. This can be seen by the slight increase in “Avg Occupancy Count per Camera” between about 30 and 50 minutes, and the corresponding decrease in “Speed relative to real time”. 4.2

Ground Truth Veriﬁcation

It is diﬃcult to verify that exclusion captures all overlap in a large camera network, and excludes all non-overlap. For example, Figure 1 shows a set of images captured from across the network at one moment. After some exclusion processing, the grid is rearranged to group together related cameras as shown in Figure 4. Connections are drawn between a window pair when the overlap certainty measure (Equation 5) exceeds a threshold C ∗ . The link must pass the threshold in both directions for the connection to be established, i.e. a link > C ∗ and Cji > C ∗ . In our is drawn between wi and wj if and only if Cij ∗ experiments we set C = 0.8. To verify the exclusion results we manually inspected the groups that were found. Close up views of some groups can be seen in Figure 5. It can be seen that overlap has been detected correctly in a variety of cases despite widely diﬀering viewpoints and lighting conditions. These are correspondences that would be

Finding Camera Overlap in Large Surveillance Networks

383

Fig. 4. Video feeds from Figure 1 after running exclusion on one hour of footage. The cameras are arranged on screen so that related cameras are near each other, to aid human inspection.

Fig. 5. Overlapping groups detected by exclusion

very diﬃcult to detect by tracking people, and attempting to build up correlations between tracks. The lighting conditions are often very poor, and the size of people in each camera varies greatly. Figure 4 also includes four camera groups that have been erroneously linked. Each of these groups has only one or two links between windows in each image, and view low traﬃc areas. These errors would thus disappear as more traﬃc is viewed. To correct these groups until enough traﬃc has been seen, a ﬁlter can be implemented that only links cameras when more than window in each camera is linked. Alternatively, a human operator can sever the links manually.

384

A. van den Hengel et al.

Some camera overlap was not detected because of low traﬃc during the hour that footage was captured. However, all overlap between cameras monitoring areas with enough detections (relative to Oref ) to calculate exclusion was correctly determined. This leads us to believe that remaining overlap can be detected when the system is run over a longer time period.

5

Conclusion

This paper describes a method for automatically determining camera overlap in large surveillance networks. The method is based on the process of eliminating impossible connections rather than the slower process of building up positive evidence of activity. We describe our implementation of the method, and show that it runs faster than real time on an hour of footage from a 100 camera network, using a single desktop PC. Future work includes testing the system over a period of several days, adding more cameras to the network, and implementing a more eﬃcient foreground detector.

References 1. Javed, O., Rasheed, Z., Shaﬁque, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: ICCV 2003, pp. 952–957 (2003) 2. Dick, A.R., Brooks, M.J.: A stochastic approach to tracking objects across multiple cameras. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 160– 170. Springer, Heidelberg (2004) 3. Ellis, T.J., Makris, D., Black, J.K.: Learning a multi-camera topology. In: Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pp. 165–171. IEEE Computer Society Press, Los Alamitos (2003) 4. Stauﬀer, C.: Learning to track objects through unobserved regions. In: IEEE Computer Society Workshop on Motion and Video Computing, pp. 96–102. IEEE Computer Society Press, Los Alamitos (2005) 5. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: ICCV 2005, pp. 1842–1849 (2005) 6. Lee, L., Romano, R., Stein, G.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 758–767 (2000) 7. Khan, S., Javed, O., Rasheed, Z., Shah, M.: Human tracking in multiple cameras. In: IEEE International Conference on Computer Vision, pp. 331–336 (2001) 8. Stauﬀer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) 9. van den Hengel, A., Dick, A., Hill, R.: Activity topology estimation for large networks of cameras. In: AVSS 2006. Proc. IEEE International Conference on Video and Signal Based Surveillance, pp. 44–49. IEEE Computer Society Press, Los Alamitos (2006)

Information Fusion for Multi-camera and Multi-body Structure and Motion Alexander Andreopoulos and John K. Tsotsos York University, Dept. of Computer Science & Engineering, Toronto, Ontario, M3J 1P3, Canada {alekos,tsotsos}@cse.yorku.ca Abstract. Information fusion algorithms have been successful in many vision tasks such as stereo, motion estimation, registration and robot localization. Stereo and motion image analysis are intimately connected and can provide complementary information to obtain robust estimates of scene structure and motion. We present an information fusion based approach for multi-camera and multi-body structure and motion that combines bottom-up and top-down knowledge on scene structure and motion. The only assumption we make is that all scene motion consists of rigid motion. We present experimental results on synthetic and nonsynthetic data sets, demonstrating excellent performance compared to binocular based state-of-the-art approaches for structure and motion.

1

Introduction

Multi-body and multi-camera structure and motion establishes the structure and motion of a scene that consists of multiple moving rigid objects that are observed from multiple views [1], [2]. Stereo vision analysis and image motion analysis provide information with complementary uncertainties which can depend on the motion of the camera platform, the scene structure and the spatiotemporal baselines. There are four fundamental problems with the extractable information from motion data or from stereo data [3]: (i) Image motion and disparity, with an unknown camera translation, allow us to infer object range only up to a scale ambiguity, since image motion and disparity depend on the ratio of camera translation to object range. (ii) Image motion and disparity tend towards zero near the focus of expansion (FOE). Since object range is inversely proportional to image motion and disparity, scene structure estimation is ill-conditioned near the FOE. (iii) The more closely aligned the local image structure is with the epipolar directions – i.e., directions pointing towards the FOE – the more ill-conditioned scene structure estimation becomes in those regions. (iv) Whereas large spatio-temporal baselines give better depth estimates for distant objects, the greater disparity and occlusion makes such cameras unsuitable for nearby objects. The severity of these problems is reversed when dealing with small baselines. The spatio-temporal baselines might be deﬁned with respect to a monocular camera in motion –structure from motion–, a static stereo camera, or some other combination of static and non-static cameras. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 385–396, 2007. c Springer-Verlag Berlin Heidelberg 2007

386

A. Andreopoulos and J.K. Tsotsos

A method for fusing the structure and motion estimates of diﬀerent cameras by preserving the accurate estimates and diminishing the eﬀect of inaccurate estimates is highly desirable. For example, whereas in one camera the optical ﬂow near the FOE might be poorly estimated, from another camera’s viewpoint the optical ﬂow of the same scene region might not be as ill-conditioned, since the FOE will likely have changed. We present an information fusion based approach for dealing with all these problems in a uniﬁed framework. We model the above mentioned errors as originating from ambiguities in the estimation of stereo image correspondences and in the optical ﬂow across all cameras. The only assumption we make as to the scene motion is that we are dealing with rigidly moving objects. The rest of the paper is organized as follows. Section 2 presents some related work. Section 3 introduces an approach for representing the motion and stereo data from a network of cameras. Section 4 describes how to combine this data in a single reference frame. Section 5 outlines a simple extension of the approach to camera rigs with arbitrary intrinsic and extrinsic parameters. Section 6 presents experimental results demonstrating the robustness of the approach. Section 7 concludes the paper.

2

Related Work

Richards [4] shows how the integration of changing disparity and object velocity can solve many of the ambiguitites inherent in stereopsis and motion under orthographic projection. Waxman [5] demonstrates the importance of the ratio of the rate of change of disparity over disparity, by using this quantity to unify stereo and motion analysis. As it is elaborated in [6], the importance of this ratio has been demonstrated numerous other times. Hanna and Okamoto [3] demonstrate how motion and stereo could be combined in a multi-camera system for egomotion and scene structure estimation. Their work is further expanded upon by Mandelbaum et al. [7]. Zhang and Kambhamettu [8] present a system which integrates 3D scene ﬂow and structure recovery in order to complement the performance of each other, using a number of calibrated cameras. Singh and Allen [9] employ the Best Linear Unbiased Estimator (BLUE) to fuse local motion. Comaniciu [10], [11] developed a method for motion estimation under multiple source models. Neumann et al. [12] present a method for establishing a hierarchy of cameras based upon the stability and complexity of structure and motion estimation. To the best of our knowledge, the work we present is the ﬁrst approach using information fusion for multi-camera and multi-body structure and motion.

3

Fusing Multiple Cameras

Assume we have a multi-camera rig composed of N monocular cameras. A maxiN mum of camera pairs exist. The coordinate system of camera C0 is referred 2

Information Fusion for Multi-camera and Multi-body Structure and Motion

387

C3 C4

C5

C2

C0

C6

C1

C8 C7

(a)

(b)

(c)

(d)

Fig. 1. (a) Diagram of a hypothetical nine camera rig. (b) A ﬁve camera rig mounted on a mobile robotic platform. (c)A planar textured region we used in some of the experiments for structure and motion estimation at a depth of 300cm. (d)The region after a 20 degree rotation around the camera’s optical axis.

to as the basis coordinate system. By convention a vector’s superscript will denote the coordinate system with respect to which we are expressing the vector. The camera rig is calibrated and therefore, for each pair of cameras Ci , Cj , we know a rotation matrix Rij and translation vector Tij = (Tijx , Tijy , Tijz )T that describes the rotation and translation that aligns camera Ci ’s coordinate axes with camera Cj ’s coordinate axes. See Fig. 1(a),(b) for examples of camera rigs where Rij = I (the identity matrix) ∀i, j. For each pixel p0 in camera C0 , and for each camera pair (Cj , Ci ) such that i = 0, j = i, we can use a stereo correspondence algorithm, such as [13], to obtain estimates of the pixels pj , pi in cameras Cj , Ci respectively, corresponding to pixel p0 in basis camera C0 . Similarly, we can obtain motion ﬂow estimates for each pixel pj , pi in Cj , Ci . With each such pair of image pixels pj , pi , we can associate a 6D vector C0 C0 C0 0 V(p0 , Cj , Ci ), containing the 3D coordinates XC 1 = (X1 , Y1 , Z1 ) of a point P that is imaged by pj , pi in camera pair (Cj , Ci ). We can also associate with V(p0 , Cj , Ci ) a 3D vector uC0 corresponding to the 3D displacement vector of P that was extracted using the camera pair (Cj , Ci ). The displacement vector might be due to camera movement, an independent motion of scene point P or a C0 0 combination of both. As we have indicated above, the superscript C0 in XC 1 ,u indicates that the vectors are expressed with respect to the coordinate system C0 C0 C0 0 of C0 . Let XC 2 = (X2 , Y2 , Z2 ) denote the coordinate of P with respect to camera’s C0 coordinate system, obtained after an arbitrary camera rig or scene motion. The context will always make it clear with respect to which camera pair C0 0 (Cj , Ci ) we estimated XC 1 , X2 . We can then obtain the 3D motion estimate C0 C0 C0 C0 C0 T 0 u for point P by u = X2 − XC 1 . Then V(p0 , Cj , Ci ) (X1 , u ) . Given a small neighborhood Δp0 of pixels around a pixel p0 in C0 – we use 3×3 N N pixel neighborhoods in this paper –, the set p∈Δp j=0 i=1,i>j V(p, Cj , Ci ) 0 contains estimates of scene structure and motion over all camera pairs. If we need to enforce a hard real-time constraint, we can select to process a subset of the camera pairs. For each camera pair Cj , Ci and each pixel p0 in C0 that we process, we assign the covariance matrix Cov(V(p0 , Cj , Ci )). In the next section we will show how to estimate this covariance matrix and how to use it to assign a weight of importance to each one of those vectors. We will also show how to use

388

A. Andreopoulos and J.K. Tsotsos

information fusion techniques to get a robust estimate of the true scene structure and motion. Notice that in the above mentioned set, mainly due to occlusions, V(p0 , Cj , Ci ) will not always contribute a vector for all p0 , Cj , Ci .

4

Fusing the Camera Data

We need to model the uncertainty in each of the 6D vectors V(p0 , Cj , Ci ) in order to obtain each vector’s 6 × 6 covariance matrix. These covariance matrices are used by the BLUE estimator to obtain a reliable estimate of the scene structure and motion. For example, an image pixel that is near the focus of expansion in one monocular camera needs to assign a high uncertainty to its motion elements and assign a 3D structure uncertainty that depends on the scene depth relative to the camera pair used. From a diﬀerent stereo camera’s point of view, these uncertainties will diﬀer. By combining bottom-up and top-down information related to the scene uncertainty we obtain the noise model used by our BLUE estimator. For notational simplicity, we initially assume a perspective projection camera model where all cameras have the same focal length f , the aspect ratio is 1, the skew is 0, and the principal point is set to (0,0). The camera set up similar to Fig. 1(a),(b) where Rij = I and Tijz = 0 ∀i, j. The extension to arbitrary camera setups is presented in Section 5. Every pair of cameras (Cj , Ci ) can be viewed as a stereo camera with a focal length f , such that the projection C0 C0 C0 0 of a point XC 1 = (X1 , Y1 , Z1 ) in camera Ci is given by: xr =

x )∗f (X1C0 − T0i , C0 Z1

yr =

y )∗f (Y1C0 − T0i C0 Z1

(1)

and the projection of the same point in camera Cj is: xl =

x (X1C0 − T0j )∗f

Z1C0

,

yl =

y )∗f (Y1C0 − T0j

Z1C0

.

(2)

y y x x If | − T0j + T0i | ≥ | − T0j + T0i |, we have: x x x x (−T0j + T0i ) (xr + xl ) T0j + T0i + 2 xl − xr 2 yr y x x = (−T0j + T0i ) + T0i xl − xr x x + T0i )f (−T0j = . xl − xr

X1C0 =

(3)

Y1C0

(4)

Z1C0

(5)

y y x x Conversely, if | − T0j + T0i | < | − T0j + T0i |, we have:

xr x + T0i yl − yr y y y y + T0i ) (yr + yl ) T0j + T0i (−T0j = + 2 yl − yr 2 y y + T0i )f (−T0j = . yl − yr

y y X1C0 = (−T0j + T0i )

(6)

Y1C0

(7)

Z1C0

(8)

Information Fusion for Multi-camera and Multi-body Structure and Motion

389

Notice that in Eqs.(3)-(5) and Eqs.(6)-(8), yl and xl respectively, are not used. 0 This provides a simple approximation for XC 1 when due to small errors (xl , yl ), (xr , yr ) are not corresponding pixels. The corresponding image coordinates in C C the next frame are given by (xl , yl ) = (xl , yl ) + (vx j , vy j ), (xr , yr ) = (xr , yr ) + C C j j (vxCi , vyCi ) where (vx , vy ), (vxCi , vyCi ) denote the motion ﬂow vectors in cameras Cj , Ci respectively. We can use (xl , yl ),(xr , yr ), in conjunction with Eqs.(3)-(8), C0 C0 C0 0 to estimate XC 2 = (X2 , Y2 , Z2 ) and calculate V(p0 , Cj , Ci ). We now show how Eqs. (3)-(8) can be used to deﬁne a covariance matrix for x V(p0 , Cj , Ci ). We only describe the covariance matrix derivation for | − T0j + y y y y x x x T0i | ≥ | − T0j + T0i |, since the case | − T0j + T0i | < | − T0j + T0i | is similar. We model the error in the correspondences of the image points as (xr + nxr , yr + nyr ), (xl + nxl , yl + nyl ) where nxr ,nyr ,nxl ,nyl are zero mean Gaussian random variables. Their standard deviation can depend on how noisy the images are and on prior knowledge regarding the accuracy of the correspondences –e.g., the sample variance of the correspondences within Δp0 . In this paper we assume a variance of 12 pixel for each of the four random variables. We also assume that the random variables are independent. Furthermore, we notice that in Eqs. (3)-(8) we can view X1C0 , Y1C0 , Z1C0 as functions in terms of nxr ,nyr ,nxl ,nyl . We obtain ﬁrst order Taylor expansions of X1C0 , Y1C0 , Z1C0 and we use these Taylor 0 expansions to obtain variance/covariance measures for vector XC 1 . It can be shown that within ﬁrst order: x x x x + T0i ) xl )2 − T0i ) xr )2 ((−T0j ((T0j V ar(x ) + V ar(xl ) r ( xl − x r )4 ( xl − x r )4 x x 2 x x + T0i ) + T0i ) yr )2 (−T0j ((−T0j V ar(yr ) + V ar(xr ) + V ar(Y1C0 ) ≈ 2 4 ( xl − x r ) ( xl − x r ) x x ((T0j − T0i ) yr )2 V ar(xl ) ( xl − x r )4 x x x x + T0i )f )2 − T0i )f )2 ((−T0j ((T0j V ar(Z1C0 ) ≈ V ar(x ) + V ar(xl ) r ( xl − x r )4 ( xl − x r )4

V ar(X1C0 ) ≈

(9)

(10) (11)

r , yl and yr are estimated using a trimmed mean estimator, with where x l , x the top and bottom, 25% of the samples being rejected before calculating the mean. The samples used to estimate x l , x r , yl and yr are the pixels in Ci , Cj corresponding to the neighborhood Δp0 in C0 . For example, to estimate x r we use the stereo matching algorithm to ﬁnd the pixels in Ci that correspond to the r . pixels Δp0 in C0 , and then apply the trimmed mean estimator to get x

⎞ ⎛ Var ( X 1C0 ) 0 0 0 0 0 ⎞ ⎛ Var ( X 1C0 ) 0 0 −Var ( X 1C0 ) 0 0 ⎟ ⎜ ⎟ ⎜ Var (Y1C0 ) 0 0 0 0 0 ⎟ ⎜ 0 Var (Y1C0 ) 0 0 −Var (Y1C0 ) 0 ⎟ ⎜ ⎟ ⎜ C0 ⎜ Var ( Z ) 0 0 0 0 0 Var ( Z1C0 ) 0 0 0 0 −Var ( Z1C0 ) ⎟ 1 ⎟ ⎜ + a Cov(V (p 0 , C j , Ci ) = (1 − a ) ⎜ ⎟ C0 ⎟ ⎜ Var ( X 2C0 − X 1C0 ) 0 0 0 0 0 0 0 Var ( X 2C0 − X 1C0 ) 0 0 ⎟ ⎜ −Var ( X 1 ) ⎟ ⎜ C0 C0 C0 ⎟ ⎜ C C −Var (Y1 ) Var (Y2 − Y1 ) 0 0 0 0 ⎟ ⎜ 0 0 0 0 Var (Y2 0 − Y1 0 ) 0 ⎟ ⎜ ⎟ ⎜ ⎜ −Var ( Z1C0 ) Var ( Z 2C0 − Z1C0 ) ⎟⎠ 0 0 0 0 C0 C0 ⎟ ⎜ ⎝ 0 0 0 0 0 Var ( Z 2 − Z1 ) ⎠ ⎝

Fig. 2. The covariance matrix encoding the uncertainties

390

A. Andreopoulos and J.K. Tsotsos

In order to obtain a covariance matrix for V(p0 , Cj , Ci ), we also need to obtain an estimate of the variance of the elements of uC0 . We know that for a physical point PCj that is moving with velocity SCj with respect to camera Cj and its coordinate frame, we can decompose the velocity as SCj = −TCj − C C C C C C ΩCj × PCj where TCj = (Tx j , Ty j , Tz j )T and ΩCj = (Ωx j , Ωy j , Ωz j )T denote the translational and angular velocity vectors of camera Cj that would cause the same apparent motion of the particle PCj with respect to camera Cj ’s coordinate frame. Then, the image velocity of the projection (xl , yl ) of PCj in camera Cj is given by C

vx j C vy j

= B Cj ΩCj + dCj ACj TCj

(12)

where dCj is the inverse of the scene depth with respect to camera Cj ’s coordinate system – it is estimated using Eqs.(3)-(8) and the current camera pair – and B

Cj

=

(f

xl yl f y2 + fl )

−(f +

x2l f )

− xlfyl

yl −xl

Cj

A

=

−f 0 xl 0 −f yl

.

(13)

Similar conditions hold for camera Ci . We use Eq.(12) to model the noise senC0 0 sitivity of XC 2 , as we did for X1 in Eqs.(9)-(11). This allows us to weigh the suitability of each camera for tracking a particular object. In the case of multibody structure and motion and due to the reasons mentioned in the introduction, it is quite feasible to end up with degenerate situations of objects whose motion estimation is ill-conditioned from a particular viewpoint. If we model TCj , ΩCj 0 as being corrupted by Gaussian noise, we can view XC 2 as a function of nxr , nyr , nxl , nyl , nTCj , nΩCj , nTCi , nΩCi , where nTCj , nΩCj , nTCi , nΩCi denote zero mean Gaussian noise vectors. For each camera pair (Cj , Ci ) and for each corresponding pixel pair pj , pi in the two cameras, we obtain approximations for TCj , ΩCj , TCi , ΩCi – de Cj , Ω Cj , T Ci , Ω Ci – and the variances of n Cj , n Cj ,nTCi , nΩCi as noted T T Ω follows: For each local image region centered at pα , α ∈ {i, j}, or for each image region containing pα and undergoing independent rigid motion – esti Cα , mated using any popular motion segmentation algorithm – we estimate T Cα Ω , the approximation of the camera’s translational and rotational velocity that would lead to the motion ﬂow observed in that particular image region using camera pair (Cj , Ci ). For each such image region, we use a least squares pseudo-inverse based approach on a random subset of the estimated displacement vectors to approximate the translational and rotational velocity. We repeat this approach a number of times and the mean of the results is used as Cα , Ω Cα and their variance provides an estimate for the variance used in T the noise model described above. If we take the partial derivatives of X2C0 , Y2C0 , Z2C0 with respect to the above mentioned random variables and expand Cj , Ω Cj , T Ci , Ω Ci we obtain the desired expressions around x l , x r , yl , yr , T C0 C0 C0 for V ar(X2 ), V ar(Y2 ) and V ar(Z2 ). In the appendix we list the derived expressions for these variances. The above mentioned variances are referred to

Information Fusion for Multi-camera and Multi-body Structure and Motion

391

as the “top-down” information. Note that in our experiments, when modeling V ar(W2C0 − W1C0 ), we make the assumption of independence between W1C0 and W2C0 for all W ∈ {X, Y, Z}. Notice that V ar(X2C0 ), V ar(Y2C0 ) and V ar(Z2C0 ) are C C calculated using the derivatives of velocities vx j , vy j and might have very diﬀerC0 C0 ent magnitudes from V ar(X1 ), V ar(Y1 ), V ar(Z1C0 ). To guarantee the numerical stability of the covariance matrices, we perform two simple modiﬁcations to the top-down variances. We ﬁrst set an upper bound maxvar to each of the variances by setting V ar(W1C0 ) ← min (V ar(W1C0 ), maxvar), V ar(W2C0 − W1C0 ) ← min (V ar(W2C0 − W1C0 ), maxvar). Secondly, for each W ∈ {X, Y, Z} and each C0 pixel in C0 , wescale the variances V ar(W 2 ) acquired across all camera pairs by C0 aW c · min( allpairs V ar(W1 ))/ min( allpairs V ar(W2C0 )) for some constant c (we set c = 2 in our experiments). For each pair (Cj , Ci ), we also estimate the sample variances V ar(X1C0 ), V ar(Y1C0 ), V ar(Z1C0 ), V ar(X2C0 − X1C0 ), V ar(Y2C0 − Y1C0 ) and V ar(Z2C0 − Z1C0 ) by using the samples in p∈Δp V(p, Cj , Ci ) and using the mean of the vectors 0 N N in p∈Δp j=0 i=1,i>j V(p, Cj , Ci ) as the sample mean. We refer to these 0 sample variances as the “bottom-up” information. We deﬁne the ﬁnal covariance matrix corresponding to each vector V(p0 , Cj , Ci ) as a linear combination of their corresponding top-down and bottom-up variances. For each point p0 in camera C0 and by using the two cameras Cj ,Ci for depth estimation, we use the set p∈Δp V(p, Cj , Ci ) in conjunction with the variances deﬁned above, to 0 model the covariance matrix of V(p0 , Cj , Ci ) as given by Fig.2, where 0 ≤ a ≤ 1. , where for each k ∈ {1, ..., n}, Assume we have n vectors Vi(1),j(1) ,...,Vi(n),j(n) Vi(k),j(k) is the average of all the vectors in p∈Δp V(p, Cj(k) , Ci(k) ). Also with 0 each of the vectors Vi(k),j(k) we associate a covariance matrix Nk indicating our conﬁdence in this measure, as described in this section. If we ignore any potential cross-correlation between the n vectors, the Best Linear Unbiased Estimator (BLUE) [9] is the vector X that minimizes the sum of the Mahalanobis distances n T T = (Vi(1),j(1) N−1 1 + ... + k=1 D(X, Vi(k),j(k) , Nk ). It can be shown that X −1 T −1 −1 −1 Vi(n),j(n) Nn )(N1 + ... + Nn ) . In the next section we extend our approach to camera rigs with arbitrary intrinsic and extrinsic parameters.

5

Arbitrary Camera Rig Setup

Let us suppose that for a camera pair (Cj , Ci ) with intrinsic camera parameters (Kj , Ki ) and for pixels pj = (xl , yl )T , pi = (xr , yr )T imaging a common scene point P = (X1C0 , Y1C0 , Z1C0 )T , the following equations hold:

xl yl

⎛ ⎜ =⎝

C

C0 C0 y z −T0,j ))+Kj1,3 (R1,3 j,0 (Z1 −T0,j )) C0 z )) Kj3,3 (R3,3 (Z −T 1 j,0 0,j C0 C0 C0 2,2 2,2 y x z Kj2,1 (R2,1 −T0,j ))+Kj2,3 (R2,3 j,0 (X1 −T0,j ))+Kj (Rj,0 (Y1 j,0 (Z1 −T0,j )) C0 3,3 3,3 z Kj (Rj,0 (Z1 −T0,j ))

1,2 1,2 x 0 Kj1,1 (R1,1 j,0 (X1 −T0,j ))+Kj (Rj,0 (Y1

⎞ ⎟ ⎠ (14)

392

A. Andreopoulos and J.K. Tsotsos

xr yr

⎛ ⎜ =⎝

C

C0 C0 y z −T0,i ))+Ki1,3 (R1,3 i,0 (Z1 −T0,i )) C0 z )) Ki3,3 (R3,3 (Z −T 1 i,0 0,i C0 C0 C0 2,2 2,2 y x z Ki2,1 (R2,1 −T0,i ))+Ki2,3 (R2,3 i,0 (X1 −T0,i ))+Ki (Ri,0 (Y1 i,0 (Z1 −T0,i )) C0 3,3 3,3 z Ki (Ri,0 (Z1 −T0,i ))

1,2 1,2 x 0 Ki1,1 (R1,1 i,0 (X1 −T0,i ))+Ki (Ri,0 (Y1

⎞ ⎟ ⎠ (15)

m,n denote the m, nth entry of Kj /Rj0 . As we did in Eqs.(3)-(8), where Kjm,n /Rj,0 y y x x 0 + T0i |, we can express XC if | − T0j + T0i | ≥ | − T0j 1 in terms of xl , xr , yr . y y x x 0 Conversely, if | − T0j + T0i | < | − T0j + T0i | we can express XC 1 in terms of yl , C0 yr , xr Thus, we can deﬁne a function g(pj , pi ) X1 with respect to camera C0 ’s coordinate system. By using g(·) and the approach described in Section 4, we can obtain the desired variance approximations. We also need to redeﬁne Eqs.(12)-(13) in order to obtain variance estimates for the motion error. We will only deal with the case of camera Cj , as the case of camera Ci is similar. As indicated in Section 4, SCj = −TCj − ΩCj × PCj . Then: 1,1 Cj 1,1 1,2 YCj 1,3 1,2 d YCj C d XCj K + K d Kj X vx j Cj + Kj Cj + Kj C C j j j j dt Z dt Z Z Z = = Cj C 2,3 2,2 d YCj dt vy j Kj2,2 Y K Cj + Kj Cj j dt Z Z (16) assuming Kj2,1 = 0, Kj3,3 = 1. The derivatives are taken with respect to time t, and by using the expression for SCj we can express Eq.(16) in terms of TCj and ΩCj . Then the variance derivation proceeds as described in Section 4. The derivatives can be determined analytically, or via common numerical methods such as ﬁnite diﬀerences.

6

Experiments

We present our camera setup and results in Figs. 1, 3, 4. We test our approach on a number of synthetic and non-synthetic datasets. Synthetic dataset (i) consists of a 30cm × 30cm planar surface on a black background (Fig. 1(c),(d)) centered at camera C0 , moving in depth, along the optical axis, by 15cm per frame. Synthetic dataset (ii) consists of the planar surface, rotated by 4 degrees around the optical axis between each frame. Syntethic dataset (iii) consists of a (2cm, 2cm) translation of the planar surface, parallel to the image plane, between each frame. The camera setup is similar to that of Fig. 1(a). All cameras Ci , i > 0 are radially distributed around camera C0 at a radius of 12cm, have a focal length of 4mm, and have corrupting Gaussian noise added to their images. We fuse all (C0 , Ci ) camera pairs and demonstrate the performance of the algorithm with an increasing object range from 300cm to 800cm by setting a = 0.5 in Fig.2. The stereo correspondence and optical ﬂow algorithm used is described in [13] and is available by the authors online1 . Our results are illustrated in Fig. 3(a)-(f). We also test our algorithm using a ﬁve camera rig, as shown in Fig. 1(b). The corresponding results are presented in Fig. 4(a)-(h). In the synthetic data set we used the entire planar surface to estimate each TCα , ΩCα 1

http://www.cs.umd.edu/users/ogale/download/code.html

Information Fusion for Multi-camera and Multi-body Structure and Motion

Stereo reconstruction error of world coordinates Object shift:[0;0;15], SNR:4 dB, no calibration error

Stereo reconstruction error of world coordinates Object rotated by 4 degrees around optical axis, SNR:4 dB, no calibration error

18

393

Stereo reconstruction error of world coordinates Object shift:[2;2;0], SNR:4 dB, no calibration error 25

25

12 10 8 6 4

20

20

RMS error of 3D reconstruction

14 RMS error of 3D reconstruction

RMS error of 3D reconstruction

16

15

10

15

10

5

5

2 0

300

400 500 600 700 Distance of object from camera (cm)

0

800

300

0

800

300

(c)

(a)

400 500 600 700 Distance of object from camera (cm)

800

(e)

3D motion error Object rotated by 4 degrees around optical axis, SNR:4 dB, no calibration error

3D motion error Object shift:[0;0;15], SNR:4 dB, no calibration error

3D motion error Object shift:[2;2;0], SNR:4 dB, no calibration error 25

25

25

20

RMS error of 3D motion

15

10

15

10

5

5

0

0 300

400 500 600 700 Distance of object from camera (cm)

(b)

800

RMS error of 3D motion

20

20 RMS error of 3D motion

400 500 600 700 Distance of object from camera (cm)

15

10

5

300

400 500 600 700 Distance of object from camera (cm)

(d)

800

0

300

400 500 600 700 Distance of object from camera (cm)

800

(f)

Fig. 3. (a)-(f):The results of our tests on the synthetic dataset. The x-axes represent the depth of the object in cm, and the y-axes represent the RMS error of the stereo reconstructed coordinates and the 3D motion vector (in cm). The RMS error for a particular camera pair is calculated by estimating the error across all pixels in the base camera C0 that fall within the textured region. The solid/dashed lines correspond to the errors of our information fusion based approach using the BLUE/mean estimator, and the boxplots represent the distribution of the errors across each of the camera pairs used. The red crosses represent outliers. Note that in some ﬁgures the outliers are not displayed as they fall outside the vertical range of our error axes. (a),(b) correspond to the stereo reconstruction and 3D motion error respectively, when the planar object was translated by 15cm in depth along the optical axis. (c),(d) correspond to the stereo reconstruction and 3D motion error respectively, when the planar object was rotated by 4 degrees around the optical axis between frames. (e),(f) correspond to the stereo reconstruction and 3D motion error respectively when the translation occured parallel to the image plane. The object was translated by 2cm along the x and y axes of the world coordinate system. Notice how, even though gross outliers exist in most of the ﬁgures, the eﬀect of those outliers on the estimated scene structure and motion is minimal in general. We also performed a number of experiments with modest errors in the external parameters’ calibration and similar observations were made. The mean RMS error of the stereo reconstruction using the information fusion/mean approach for all instances of the reconstructed planar surface is 2.05 ± 1.71/3.71 ± 3.87 cm respectively. The respective values for the motion data are 2.35 ± 1.43/2.56 ± 1.41 cm. In both cases the improvement compared to the mean approach is statistically signiﬁcant using a paired-samples t-test (p ≈ 0.01).

394

A. Andreopoulos and J.K. Tsotsos

(a)

(c)

(e)

(g)

(b)

(d)

(f)

(h)

Fig. 4. (a)-(h):Experimental results from an image sequence showing a robotic wheelchair that is equipped with a 6-d.o.f. robotic arm. The robotic arm is moving diagonally towards the top left image corner. (a)-(b): Adjacent frames from the respective sequence (before correcting for radial/tangential distortions). (c): The reconstructed scene depth using a single pair of cameras to reconstruct each scene. Image regions in black denote pixels where the left-right consistency constraint could not be enforced. (d): The reconstructed scene depth of frames (a),(b) using the ﬁve camera rig setup shown in Fig. 1 in conjunction with our information fusion based algorithm. The colorbar depths of (c),(d) represent mm. Notice the signiﬁcant decrease in occlusions. (e)-(f): Image motion of the sequence after projecting the estimated 3D motion on the image plane using a single camera pair in conjunction with our information fusion based algorithm. Image motion is represented in pixel units. (g)-(h): Image motion of the respective image sequences after using the ﬁve camera rig setup shown in Fig. 1 in conjunction with our information fusion based algorithm. (e),(g): The image motion component parallel to the horizontal axis and (f),(h): The image motion component parallel to the vertical axis.

(simulating perfect motion segmentation) and in the non-synthetic data set we used 21 × 21 pixel regions centered at the current pixel of interest. From Fig. 3, we observe that the multi-camera approach provides a signiﬁcant decrease of the RMS error in both structure and motion estimation compared to the errors achieved using the stereo camera pairs. In almost all cases the quality of the results surpasses that obtained by any one of the camera pairs. As indicated in the caption of Fig. 3 the BLUE estimator provides better results than the results obtained by the mean vector across all cameras and their neighborhoods. For both the structure and motion data the improvement is judged statistically signiﬁcant. In Fig. 3(a),(b) where the plane is moving along the z-axis and we are dealing with ill-conditioned motion estimation near the focus of expansion, we observe signiﬁcant improvements. In Fig. 3(c)-(d) we present results after a pure rotation of the plane around the optical axis. It is interesting

Information Fusion for Multi-camera and Multi-body Structure and Motion

395

to notice however, that for depths 700cm, 800cm the optical ﬂow estimation algorithm we used performs poorly on about half of our camera pairs, thus, resulting in a big RMS error, as the boxplots show. Our algorithm is capable of ignoring the erroneous data and gives us a relatively robust estimate of the 3D motion. This indicates that if we are using a multi-camera rig with cameras that break down quite often and provide gross outliers, our algorithm remains reliable. We observe that the mean estimator is severely aﬀected by outliers at various depths, while the information fusion based algorithm is more robust in the presence of outliers. In Fig. 4(a)-(h) we compare the performance of our algorithm using a two camera rig versus a ﬁve camera rig (Fig. 1(b)). The ﬁve camera rig consists of two Point Grey Research Bumblebee stereo cameras and a Point Grey Research Flea camera. The coordinate system of the Flea camera is used as our basis coordinate system and represents camera C0 . The two camera rig is represented using the Flea camera and one of the four Bumblebee monocular cameras. The robotic wheelchair presented in Fig. 3 is equipped with a 6-d.o.f. robotic arm providing a number of independent rigid motions to test our algorithm. We used the algorithm described in [13] to determine the correspondences. We notice a dramatic increase in the number of pixels satisfying the left-right consistency constraint as the number of cameras in our rig increases.

7

Conclusion

We presented an algorithm for multi-camera and multi-body structure and motion. The algorithm combines top-down and bottom-up knowledge on scene structure and motion to model the respective uncertainties. An information fusion based algorithm uses these uncertainties to obtain competitive results demonstrating that our algorithm performs robustly in situations where a number of camera pairs provide severely degraded results. Such situations arise in practice due to hardware failures and poor environmental conditions. We are currently investigating the use of other information fusion algorithms for solving this problem [10]. Some potential application areas in future research are dynamic scene interpretation, vision based simultaneous localization and mapping (SLAM) and dynamic rendering. Acknowledgments. JKT holds the Canada Research Chair in Computational Vision and gratefully acknowledges its ﬁnancial support. AA would also like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for its ﬁnancial support through the PGS-D scholarship program.

References 1. Schindler, K., Suter, D.: Two-view multibody structure and motion. In: Proc. Conf. Computer Vision and Pattern Recognition (2005) 2. Zhang, W., Kosecka, J.: Nonparametric estimation of multiple structures with outliers. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, Springer, Heidelberg (2006)

396

A. Andreopoulos and J.K. Tsotsos

3. Hanna, K.J., Okamoto, N.E.: Combining stereo and motion analysis for direct estimation of scene structure. In: Proc. Int. Conf. on Computer Vision (1993) 4. Richards, W.: Structure from stereo and motion. Journal of the Optical Society of America A. 2(2), 343–349 (1985) 5. Waxman, A., Duncan, J.: Binocular image ﬂows. IEEE Trans. Patt. Anal. Mach. Intell. 8(6), 715–729 (1986) 6. Grosso, E., Tistarelli, M.: Active dynamic stereo vision. IEEE Trans. Patt. Anal. Mach. Intell. 17(11), 1117–1128 (1995) 7. Mandelbaum, R., Salgian, G., Sawhney, H.: Correlation-based estimation of egomotion and structure from motion and stereo. In: Proc. Int. Conf. on Computer Vision (1999) 8. Zhang, Y., Kambhamettu, C.: Integrated 3D scene ﬂow and structure recovery from multiview image sequences. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2000) 9. Singh, A., Allen, P.: Image-ﬂow computation: An estimation-theoretic framework and a uniﬁed perspective. CVGIP: Image Understanding 56(2), 152–177 (1992) 10. Comaniciu, D.: Nonparametric information fusion for motion estimation. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2003) 11. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Patt. Anal. Mach. Intell. 24, 603–619 (2002) 12. Neumann, J., Fermuller, C., Aloimonos, Y.: A hierarchy of cameras for 3D photography. Computer Vision and Image Understanding 96, 274–293 (2004) 13. Ogale, A.S., Aloimonos, Y.: A roadmap to the integration of early visual modules. International Journal of Computer Vision: Special Issue on Early Cognitive Vision 72(1), 9–25 (2007)

Appendix In this section we derive expressions for V ar(X2C0 ), V ar(Y2C0 ), V ar(Z2C0 ) y y x x + T0i | ≥ | − T0j + T0i |. From Eqs.(9)-(11) we can for the case | − T0j derive the corresponding expressions for V ar(X2C0 ), V ar(Y2C0 ), V ar(Z2C0 ):

x x x x ((−T0j +T0i ) xl )2 ((T0j −T0i ) xr )2 V ar(x ) + V ar(xl ), V ar(Y2C0 ) ≈ 4 r ( xl − xr ) ( xl − x r )4 x x 2 x x 2 x x (−T0j +T0i ) ((−T0j +T0i ) yr ) ((T0j −T0i ) yr )2 V ar(xr ) + ( V ar(xl ), V ar(Z2C0 ) ( xl − xr )2 V ar(yr ) + ( xl − xr )4 xl − xr )4 x x x x ((−T0j +T0i )f )2 ((T0j −T0i )f )2 ≈ V ar(xr ) + ( V ar(xl ). We have previously noted that ( xl − xr )4 xl − xr )4 C C (xr , yr ) = (xr , yr ) + (vxCi , vyCi ), (xl , yl ) = (xl , yl ) + (vx j , vy j ). If we let a ∈ {x, y},

V ar(X2C0 ) ≈

b ∈ {r, l} and k = i/j if b = r/l we obtain the following approximations ∂a ∂a for V ar(xr ), V ar(xl ), V ar(yr ): V ar(ab ) ≈ ( ∂xbr )2 V ar(xr )+ ( ∂xbl )2 V ar(xl )+ ∂a

( ∂yrb )2 V ar(yr )+ ( ∂a

∂ab

C ∂Ωx k

)2 V ar(ΩxCk )+ ( ∂a

∂ab

C ∂Ωy k

)2 V ar(ΩyCk )+ (

∂ab C

∂Ωz k

)2 V ar(ΩzCk )+

( Cbk )2 V ar(TaCk ) + ( Cbk )2 V ar(TzCk ). By using Eqs. (12)-(13) we can derive ∂Ta ∂Tz the expressions for the partial derivatives. By expanding these expressions for Ci , Ω Ci , T Cj , Ω Cj , as appropriate, the partial derivatives around x l , x r , yl , yr , T we can obtain the desired expressions.

Task Scheduling in Large Camera Networks Ser-Nam Lim1, , Larry Davis2 , and Anurag Mittal3 2

1 Cognex Corp., Natick, MA, USA CS Dept., University of Maryland, College Park, Maryland, USA 3 CSE Dept., IIT, Madras, India

Abstract. Camera networks are increasingly being deployed for security. In most of these camera networks, video sequences are captured, transmitted and archived continuously from all cameras, creating enormous stress on available transmission bandwidth, storage space and computing facilities. We describe an intelligent control system for scheduling Pan-Tilt-Zoom cameras to capture video only when task-specific requirements can be satisfied. These videos are collected in real time during predicted temporal “windows of opportunity”. We present a scalable algorithm that constructs schedules in which multiple tasks can possibly be satisfied simultaneously by a given camera. We describe two scheduling algorithms: a greedy algorithm and another based on Dynamic Programming (DP). We analyze their approximation factors and present simulations that show that the DP method is advantageous for large camera networks in terms of task coverage. Results from a prototype real time active camera system however reveal that the greedy algorithm performs faster than the DP algorithm, making it more suitable for a real time system. The prototype system, built using existing low-level vision algorithms, also illustrates the applicability of our algorithms.

1 Introduction Large scale camera network systems are being increasingly deployed for purposes that include security, traffic monitoring, etc. These systems typically consist of a large number of cameras, which can either be active (specifically, Pan-Tilt-Zoom or PTZ cameras) or static, transmitting in real time video streams to processing and/or storage systems. Our interest is in controlling these cameras to acquire video segments that satisfy taskspecific constraints. For example, one may wish to acquire at least a few images of each person who enters a given region, capture video segments lasting k seconds and containing well-magnified facial images for facial recognition, or, capture k second long video segments of the side view of a person for gait modeling and recognition. By intelligently transmitting and storing only video segments satisfying task requirements, we can reduce the bandwidth requirements and storage space significantly and increase the efficiency and effectiveness with which the collected video segments can be processed. The control of the cameras to collect these video segments is a challenging problem. The system must detect and track moving objects both within and between cameras in a sensing stage, a problem which is not fully solved yet. Papers such as [1,2,3] deal

This research was funded in part by the U.S. Government VACE program.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 397–407, 2007. c Springer-Verlag Berlin Heidelberg 2007

398

S.-N. Lim, L. Davis, and A. Mittal

with tracking under occlusions, and other papers such as [4,5] describe algorithms for tracking across non-overlapping views. A second challenge is to predict, given a set of tracked targets, the time intervals during which video segments meeting the requirements of the tasks can be collected from available cameras. These requirements include ensuring (1) that the associated object is unobstructed by other objects, (2) that it is moving in a direction suitable for the task, (3) that it can be captured in a field of view of the camera (PTZ) assigned to collect its video segments, and (4) that the collected video segments must satisfy task-specific minimum resolution and duration. For example, if the task is to collect facial images, then we must ensure that the video segments are collected only during time intervals when the person is predicted to be walking towards the camera and unobstructed by other moving objects.This can be done using the observed tracks of the person and other moving objects, predicting their trajectories into the future, and then identifying periods of crossings between the predicted trajectories with respect to each of the available cameras. The complements of these periods of crossings are visibility time intervals during which the person is unobstructed, and camera settings can be determined within these temporal visibility predictions to capture the person in a well-magnified frontal image or video sequence satisfying the four requirements above. This problem has been addressed in earlier work. [6] described the construction of so-called “Task Visibility Intervals” (TVIs) and “Multiple Task Visibility Intervals” (MTVIs), that represent timevarying camera setting ranges that can be used to collect video segments satisfying one (TVI) or multiple tasks simultaneously (MTVIs). A TVI is a 6-tuple: (1) (c, (T, o), [r, d], V alidψ,φ,f (t)), where c represents a PTZ camera, (T, o) is a (task, object) pair - T is the index of a task to be accomplished and o is the index of the object to which the task is to be applied, and [r, d] is a future time interval during which task requirements can be satisfied using camera c. Then, for each time instance t ∈ [r, d], V alidψ,φ,f (t) is the range of valid combinations of the pan angle (ψ), tilt angle (φ) and focal length (f ) settings that camera c can employ to capture object o at time t. The tasks themselves are 3-tuples: (p, α, β),

(2)

where p is the required duration of the task, α is the orientation of the object relative to the optical axis of the camera used to accomplish the task, and β is the minimum image resolution needed to accomplish the task. [6] also described the composition of TVIs into MTVIs, time intervals during which collections of tasks can be satisfied simultaneously by one camera. A set of n TVIs, each represented in the form: (c, (Ti , oi ), [ri , di ], V alidψi ,φi ,fi (t)), for TVI i [Eqn. 1] can be combined into a valid MTVI, represented as: (Ti , oi ), [ri , di ], V alidψi ,φi ,fi (t), (c, i=1...n

i=1...n

i=1...n

(3)

Task Scheduling in Large Camera Networks

such that:

[ri , di ] = ∅,

399

(4)

i=1...n

i.e., there is some common time interval during which they can be scheduled, and: [ri , di ] ≥ pmax , i=1...n

where pmax is the largest processing time among the tasks, and for all t ∈ V alidψi ,φi ,fi (t) = ∅,

i=1...n [ri , di ],

(5)

i=1...n

i.e., the tasks can be captured with common PTZ settings. Besides [6], other work that has focused on temporal analysis and planning for camera scheduling includes [7,8], which discuss a dynamic sensor planning system, called the MVP system. They were concerned with determining occlusion-free viewpoints of a target. This involves handling occlusions between the target and the different moving objects in a scene, each of which generates a swept volume in temporal space. Using a temporal interval search, they divide the temporal intervals into halves while searching for a viewpoint that is not occluded in time by these sweep volumes. This is then integrated with other constraints such as focus and field of view in [8]. The culmination of this work is found in [7], where the algorithms are applied to an active robot work cell. In this paper, we will address the problem of job scheduling given the TVIs and MTVIs generated as in [6]. In general, job scheduling problems are NP-hard, and approximation algorithms have to be employed. We first analyze the approximation factor of a greedy scheduling algorithm (as a function of the number of cameras), which reveals that its performance deteriorates significantly as the number of cameras increases. We then describe a Dynamic Programming (DP) approximation algorithm with an approximation factor that is much better than the greedy approach. The performance advantage of the DP algorithm is confirmed by simulations. Finally, we describe a prototype real time active camera system. A scheduler controls PTZ cameras in real time to capture video segments based on automatically constructed TVIs and MTVIs. While the prototype system includes only a small number of cameras due to limited resources, the results illustrate the applicability of the algorithms for large scale camera networks.

2 Single-Camera Scheduling We first study the scheduling problem when only a single camera is used. This will be extended to the problem of multiple cameras in the next section. Also, we will limit our analysis to non-preemptive schedules in this paper. We introduce the following theorems that make the single-camera scheduling problem tractable: + Theorem 1. Let the slack for the i-th task be δi = [t− δi , tδi ], and define δmax = max(|δi |) and pmin as the smallest processing time among all (M)TVIs for some camera. Then, if |δmax | < pmin , then any feasible schedule for the camera is ordered by the slacks’ start times.

400

S.-N. Lim, L. Davis, and A. Mittal

+ − + Proof. Consider that the slack δ1 = [t− δ1 , tδ1 ] precedes δ2 = [tδ2 , tδ2 ] in a schedule and − − − tδ1 > tδ2 . Let the processing time corresponding to δ1 be p1 . Then t− δ1 + p1 > tδ2 + p1 . − + We know that if tδ1 + p1 > tδ2 , then the schedule is infeasible. This happens if t+ δ2 ≤ − + − + − tδ2 + p1 - i.e., tδ2 − tδ2 ≤ p1 . Given that |δmax | < pmin , tδ2 − tδ2 ≤ p1 is true.

Theorem 1 implies that if |δmax | < pmin , we can limit our attention to feasible schedules that are ordered by the slacks’ start times. This is a reasonably close assumption in many cases since the time to move the cameras and capture an object is generally quite large compared to the slack times in crowded scenes, where such scheduling matters most. This assumption allows us to construct a Directed Acyclic Graph (DAG), where each (M)TVI is a node with an incoming edge from a common source node and outgoing edge to a common sink node, with the weights of the outgoing edges initialized to zero. An outgoing edge from one (M)TVI node to another exists iff the slack’s start time of the first node precedes that of the second (Theorem 1), which can however be removed if it makes the schedule infeasible. Consider the following theorem and corollary: + Theorem 2. A schedule - a sequence of n (M)TVIs each with slack δi = [t− δi , tδi ], where − i = 1...n represents the order of execution - is feasible if t+ δn − tδ1 ≥ ( i=1...n−1 pi ) − th ( i=1...n−1 |δi |), pi being the processing time of the i (M)TVI in the schedule. + Proof. For the schedule to be feasible the following must be true: t− δ1 + p1 ≤ tδ2 , + − + − − t− δ2 + p2 ≤ tδ3 , ... , tδn−1 + pn−1 ≤ tδn . Summing them up gives tδ1 + tδ2 + ... + − + + + + tδn−1 + i=1...n−1 pi ≤ tδ2 +tδ3 +...+tδn , which can then be simplified as tδn −t− δ1 ≥ − + − + ( i=1...n−1 pi ) − ( i=1...n−1 |δi |). The condition, tδ1 + p1 ≤ tδ2 , tδ2 + p2 ≤ tδ3 , ... , + t− δn−1 + pn−1 ≤ tδn , is however only a sufficient condition for a feasible schedule. + − + Corollary 1. Define a new operator , such that if δ1 (= [t− δ1 , tδ1 ]) δ2 (= [tδ2 , tδ2 ]), − + then tδ1 + p1 ≤ tδ2 . Consider a schedule of (M)TVIs with slacks δi...n . The condition: δ1 δ2 , δ2 δ3 , ..., δn−1 δn , is necessary for the schedule to be feasible. Conversely, if a schedule is feasible, then δ1 δ2 , δ2 δ3 , ..., δn−1 δn . Proof is omitted since it follows easily from Theorem 2.

Due to Corollary 1, an edge between two (M)TVI nodes can be removed if it violates the relationship, since it can never be part of a feasible schedule. Using such a DAG, a Dynamic Programming (DP) algorithm can be used to solve the single-camera scheduling problem. Consider the following set of (M)TVIs that have been constructed for a given camera, represented by the tasks (T1...6) they satisfy and sorted in order of their slacks’ start times:{node1 = {T1 , T2 }, node2 = {T2 , T3 }, node3 = {T3 , T4 }, node4 = {T5 , T6 }}, where the set of nodes in the DAG in Figure 1 is given as nodei=1...4 . DP is run by first initializing paths of length 1 starting from each of the (M)TVI nodes to the sink, all with “merit” 0. At each subsequent path length, the next node nodenext chosen for a given node nodecurr in the current iteration is: max |S (6) nodenext =n∈Sarg T asks(nodecurr )|, n curr2next

Task Scheduling in Large Camera Networks

node1

Source

401

00

0 Sink

node2

0 node3 0 node4

Fig. 1. Single-camera scheduling: DAG formed from the set {node1 = {T1 , T2 }, node2 = {T2 , T3 }, node3 = {T3 , T4 }, node4 = {T5 , T6 }}. The weights between (M)TVI nodes are determined on the fly during DP. Assume that, in this example, the relationship is satisfied for the edges between the (M)TVI nodes.

where Scurr2next is the set of nodes that have valid paths starting from them in the previous iteration and for which nodecurr has an outgoing edge to. Sn is defined as the set of tasks covered by the path (in the previous iteration) starting from n, and T asks() gives the set of tasks covered by the (M)TVI associated with nodecurr . So, for example, from node1 , paths of length 2 exist by moving on to either one of node2...4 , with the move to node2 , node3 and node4 covering {T1 , T2 , T3 } (merits=3), {T1 , T2 , T3 , T4 } (merits=4) and {T1 , T2 , T5 , T6 } (merits=4) respectively. We choose the path of length 2 from node1 to node3 . Iterations are terminated when there is only one path left that starts at the source node or a path starting at the source node that covers all the tasks. In our example, the optimal path becomes node1 → node3 → node4 , terminated at paths of length 4 from the sink when all the tasks are covered.

3 Multi-camera Scheduling While single-camera scheduling using DP is optimal and has polynomial running time, the multi-camera scheduling problem is unfortunately NP-hard. Consequently, computationally feasible solutions can only be obtained with approximation algorithms. We consider both a simple greedy algorithm and a branch and bound-like algorithm. 3.1 Greedy Algorithm The greedy algorithm iteratively picks the (M)TVI that covers the maximum number of uncovered tasks, subject to schedule feasibility as given by Theorem 2. Under such a greedy scheme, the following is true: Theorem 3. Given k cameras, the approximation factor for multi-camera scheduling using the greedy algorithm is 2 + kλμ, where the definitions of λ and μ are given in the proof.

402

S.-N. Lim, L. Davis, and A. Mittal

Proof. Let G = i=1...k Gi , where Gi is the set of (M)TVIs scheduled on camera i by the greedy algorithm, and let OP T = i=1...k OP Ti , where OP Ti is the set of (M)TVIs assigned to camera i in the optimal schedule. We further define (1) H1 = i=1...k H1,i , where H1,i is the set of (M)TVIs for camera i, that have been chosen by the optimal schedule but not the greedy algorithm and each of these (M)TVIs contains tasks that are not covered by the greedy algorithm in any of the cameras, (2) H2 = i=1...k H=2,i , where H2,i is the set of (M)TVIs for camera i, that have been chosen by the optimal schedule but not the greedy algorithm and each of these (M)TVIs contains tasks that are also covered by the greedy algorithm, and finally (3) OG = OP T G. Clearly, OP T = H1 H2 OG. Then, for hj=1...ni ∈ H1,i where ni is the number of (M)TVIs in H1,i , ∃gj=1...ni ∈ Gi such that hj and gj cannot be scheduled together based on the requirement given in Theorem 2, else hj should have been included by G. If T asks(hj ) T asks(gj ) = ∅, then hj contains only tasks that are not covered by G. In this case, |hj | ≤ |gj |, else G would have chosen hj instead of gi . Note that the cardinality is defined as the number of unique tasks covered. In the same manner, even if T asks(hj ) T asks(g j ) = ∅,hj could have replaced gj unless |hj | ≤ |gj |. Consequently, |H1,i | = |h1 h2 ... hni | ≤ |h1 | + |h2 | + ... + |hni | ≤ |g1 | + |g2 | + |g | ... + |gni |. Let βj = |Gji | and λi = max(βj ∗ ni ). This gives |H1,i | ≤ β1 |Gi | + ... + βni |Gi | ≤ λi |Gi |. Similarly, we know |H1 | ≤ λ1 |G1 | + ... + λk |Gk | ≤ λ(|G1 | + i| ... + |Gk |), where λ = max(λi ). Introducing a new term, γi = |G |G| and letting μ = max(γi ), we get |H1 | ≤ kλμ|G|. Since |H2 | ≤ |G| and |OG| ≤ |G|, |OP T | ≤ (2 + kλμ)|G|. 3.2 Branch and Bound Algorithm The branch and bound approach runs DP in a similar manner as single-camera scheduling, but on a DAG that consists of multiple source-sink pairs (one pair per camera), with the node of one camera’s sink node linked to another camera’s source node. An example is shown in Figure 2. Then, for a source node s, we define its “upper bounding set” Ss as: Ss = Sc , (7) c∈Slink

where Slink is the set of cameras for which paths starting from the corresponding sink nodes to s exist in the DAG, and Sc is the set of all tasks that are covered by some (M)TVIs belonging to camera c. Intuitively, such an approach aims to overcome the “shortsightedness” of the greedy algorithm by “looking forward” in addition to backtracking and using the tasks that can be covered by other cameras to influence the (M)TVI nodes chosen for a particular camera. Admittedly, better performance is possibly achievable if “better” upper bounding sets are used, as opposed to blindly using all the tasks that other cameras can cover without taking scheduling feasibility into consideration. The algorithm can be illustrated with the example shown in Figure 2, which shows two cameras, c1 and c2 , and the following sets of (M)TVIs that have been constructed for them, again ordered by the slacks’ start times and shown here by the tasks (T1...4 )

Task Scheduling in Large Camera Networks

403

node1

Source1 Source2

node2 node3

Sink1 0 Sink2 0

Sink

Fig. 2. Multi-camera scheduling: DAG formed from the set {node1 = {T1 , T2 , T3 }, node2 = {T3 , T4 }} for the first camera, and the set {node3 = {T1 , T2 , T3 }} for the second camera

they satisfy. For c1 , the set is {node1 = {T1 , T2 , T3 }, node2 = {T3 , T4 }} and for c2 , {node3 = {T1 , T2 , T3 }}. The DAG that is constructed has two source-sink pairs, one for each camera - (Source1 , Sink1 ) belongs to c1 and (Source2 , Sink2 ) to c2 . The camera sinks are connected to a final sink node as shown, with the weights of the edges initialized to zero. Weights between nodes in the constructed DAG are similarly determined on the fly like in the single-camera scheduling. Directed edges from Sink2 to Source1 connects c1 to c2 . The DP algorithm is run in almost the same manner as single-camera scheduling, except that at paths of length 3 from the final sink node, the link from Source1 to node2 , is chosen because the upper bounding set indicates that choosing the link potentially covers a larger number of tasks (i.e., the upper bounding set of Source1 , {T1 , T2 , T3 } combines with the tasks covered by node2 to form {T1 , T2 , T3 , T4 }). The branch and bound algorithm can be viewed as applying the single-camera DP algorithm, camera by camera in the order given in the corresponding DAG, with the schedule of one camera depending on its upper bounding set. This allows us to derive a potentially better approximation factor than the greedy algorithm as follow: Theorem 4. For k cameras, the approximation factor of the branch and bound algo (1+kμ(1+u))k ∗ ∗ rithm is (1+kμ(1+u)) k −(kμ(1+u))k . μ and u are defined as follow. Let G = i=1...k Gi , ∗ where Gi is the set of (M)TVIs assigned to camera i by the branch and bound algorithm. |G∗ | Then, μ = max( |Gi∗ | ) and u = max(ui ), where ui is the ratio of the cardinality of the upper bounding set of camera i to |G∗i |. Proof. Let α be the approximation factor of the branch and bound algorithm. Then, assuming that schedules for G∗1 , ..., G∗i−1 have been determined, |G∗i | ≥ α1 (|OP T | − i−1 ∗ i−1 ∗ j=1 |Gj |). Adding j=1 |Gj | to both sides gives: i j=1

α−1 ∗ OP T + |G |). α α j=1 j i−1

|G∗j | ≥

A proof by induction shows, after some manipulation: k αk |G∗ | ≥ |OP T |. αk − (α − 1)k j=1 j

404

S.-N. Lim, L. Davis, and A. Mittal Greedy − 10 cameras

Greedy − 50 cameras

30 20 10 0 1

Greedy − 100 cameras

200

400

150

300

Approx. factor

Approx. factor

Approx. factor

40

100 50 0 1

μ

3

2 1

0.5

3 0.5

2

μ

λ

0 0

100 0 1

3 0.5

200

1 0 0

2

μ

λ

1 0 0

λ

(a) Branch and Bound − 100 cameras

Branch and Bound − 50 cameras 5

4

4

4

3 2

Approx. factor

5

5

Approx. factor

Approx. sfactor

Branch and Bound − 10 cameras

3 2 1 3

1 3 1

2

u

0.5

1

μ

0 0

2 1 3

1

2

u

3

0.5

1 0 0

1

2

u

μ

0.5

1 0 0

μ

(b) Fig. 3. (a) The approximation factor for the greedy algorithm using 10, 50 and 100 cameras respectively. λ and μ here are as defined in Theorem 3. (b) The same plots for the branch and bound algorithm. Here, the approximation factor depends only on the distribution parameters and not on the number of cameras. u and μ are as defined in Theorem 4. Comparing DP and Greedy Branch and Bound

% of tasks covered

100 90 80 70

Greedy

60 50 200

100 150

No. of tasks

50 100

0

No. of cameras

Fig. 4. The DP algorithm consistently covers more tasks than the greedy algorithm

Let H = i=1...k Hi , Hi being the set of (M)TVIs chosen by the optimal schedule on camera i but not the branch and bound algorithm. The condition |Hi | ≤ |G∗i | + ui |G∗i | is true; otherwise, Hi would have been added to G∗ instead. Consequently, |H| ≤ (|G∗1 | + ... + |G∗k |) +(u1 |G∗1 | + ... + uk |G∗k |) ≤ kμ|G∗ | + kuμ|G∗ | ≤ kμ(1 + u)|G∗ |. Since OP T = OG H (Theorem 3), we get |OP T | ≤ 1 + kμ(1 + u)|G∗ |. Thus, α = 1 + kμ(1 + u).

Task Scheduling in Large Camera Networks

405

By expressing the approximation factors of the greedy and branch and bound algorithm as a function of the number of cameras, we see that the branch and bound algorithm theoretically outperforms the greedy algorithm substantially in terms of task coverage. This is illustrated in Figure 3, whereby the approximation factors of the greedy and branch and bound algorithm are plotted as the “distribution” parameters vary when different number of cameras are used. These distribution parameters refer to λ and μ in Theorem 3, and μ and u in Theorem 4. They represent how well the tasks are distributed among the cameras and (M)TVIs. The plots show that the greedy algorithm is highly sensitive to the number of cameras, with the approximation factor becoming prohibitively high when the tasks are unevenly distributed. On the other hand, the performance of the branch and bound algorithm depends only on the distribution parameters and is not affected by the number of cameras. Both the single-camera and branch and bound DP-based multi-camera algorithm have a computational complexity of O(N 3 ), N being the average number of (M)TVIs constructed for a given camera and used in the resulting DAG. On the other hand, the greedy algorithm takes only O(N 2 ) time, which could outweigh the benefits of better scheduling for very large camera networks.

4 Implementation and Experiments Although we have theoretically found the approximation factors for the scheduling algorithms, it would be interesting for practical purposes to investigate the performance of the greedy algorithm relative to the DP algorithm under “normal” circumstances where we would expect “reasonable” task distribution. For this purpose, we conducted simulations using a scene of size 200m × 200m, and generated moving objects in the scene by randomly assigning to them different starting positions in the scene, sizes and velocities. Cameras are also simulated with calibration data from real cameras. The objects are assumed to be moving in straight lines at constant speeds, and the (M)TVIs for each camera are then constructed and utilized by the scheduler. We conducted simulations for 20, 40, 60, and 80 cameras and 100, 120, 140, 160, 180, and 200 objects, and plot the percentage of the total number of tasks that were captured by both the greedy and DP algorithm. For each object, the task is to capture video segments in which the full body of the object is visible. Since there is only one task for each object, the total number of tasks equals the number of objects. The results are shown in Figure 4. The DP algorithm schedules more tasks than the greedy algorithm by a minimum of 13.55 percent and a maximum of 33.78 percent. Finally, we test our algorithms in a small-scale real time image analysis system. Due to limited resources, building a system with large number of cameras was not possible. We developed a prototype multi-camera system consisting of four PTZ cameras synchronized by a Matrox four-channel card. For running the experiments, one camera is kept static, so that it can be used for background subtraction and tracking in the sensing stage[9,1]. From the detection and tracking, the system recovers an approximate 3D size estimate of each detected object from the ground plane and camera calibration. This is followed by the planning stage, during which the observed tracks allow the system to predict the future locations of

406

S.-N. Lim, L. Davis, and A. Mittal

(a)

(b)

(c)

Fig. 5. (a) The robots are tracked (left and middle image), and the predicted tracks are used to construct the TVIs and MTVIs, which are then used by the scheduler to assign cameras to the (M)TVIs (annotated in the right image). Next, (b) camera 0 captures robot 3, and (c) camera 1 captures robots 0, 1 and 2 simultaneously.

the objects, and to use them for constructing (M)TVIs, which are then scheduled for capture. The predicted position of each detected object on the ground plane is mapped to the PTZ cameras, after which the 3D size estimate of the object is used to construct a rough 3D model of the object for the corresponding PTZ camera. Such a 3D model is utilized to determine valid ranges of PTZ settings during the construction of TVIs. The experiments confirm that the greedy algorithm performs faster than the DP algorithm. This makes the greedy algorithm more suitable for our prototype system. A real time experiment using the greedy scheduler is illustrated in Figures 5. Four remotecontrollable 12x14 inches robots move through the scene. Two PTZ cameras were needed to capture the four robots using a (one task) TVI and a three-task MTVI.

5 Conclusion This paper addressed scheduling algorithms for smart video capture in large camera networks. We developed approximation algorithms for scheduling using a greedy and a DP based approach. While both theoretically and experimentally, the DP algorithm gives very good results, it is computationally more expensive than the greedy algorithm. A suitable algorithm can thus be chosen depending on the application scenario and computational resources available.

Task Scheduling in Large Camera Networks

407

References 1. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International journal of computer vision 29, 5–28 (1998) 2. Mittal, A., Davis, L.: M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene. In: European Conference on Computer Vision, Copenhagen, Denmark (2002) 3. Zhao, T., Nevatia, R.: Bayesian human segmentation in crowded situation. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2003) 4. Kaucic, R., Perera, A.A., Brooksby, G., Kaufhold, J., Hoogs, A.: A unified framework for tracking through occlusions and across sensor gaps. In: IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, IEEE Computer Society Press, Los Alamitos (2005) 5. Rahimi, A., Dunagan, B., Darrell, T.: Simultaneous calibration and tracking with a network of non-overlapping sensors. In: IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, IEEE Computer Society Press, Los Alamitos (2004) 6. Lim, S.N., Davis, L.S., Mittal, A.: Constructing task visibility intervals for video surveillance. ACM Multimedia Systems (2006) 7. Abrams, S., Allen, P.K., Tarabanis, K.: Computing camera viewpoints in an active robot work cell. International Journal of Robotics Research 18 (1999) 8. Tarabanis, K., Tsai, R., Allen, P.: The mvp sensor planning system for robotic vision tasks. IEEE Transactions on Robotics and Automation 11, 72–85 (1995) 9. Grimson, W.E.L., Stauffer, C.: Adaptive background mixture models for real-time tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (1999)

Constrained Optimization for Human Pose Estimation from Depth Sequences Youding Zhu1 and Kikuo Fujimura2 1

Computer Science and Engineering, The Ohio State University 2 Honda Research Institute USA [email protected], [email protected]

Abstract. A new 2-step method is presented for human upper-body pose estimation from depth sequences, in which coarse human part labeling takes place ﬁrst, followed by more precise joint position estimation as the second phase. In the ﬁrst step, a number of constraints are extracted from notable image features such as the head and torso. The problem of pose estimation is cast as that of label assignment with these constraints. Major parts of the human upper body are labeled by this process. The second step estimates joint positions optimally based on kinematic constraints using dense correspondences between depth proﬁle and human model parts. The proposed framework is shown to overcome some issues of existing approaches for human pose tracking using similar types of data streams. Performance comparison with motion capture data is presented to demonstrate the accuracy of our approach.

1

Introduction

Markerless human motion capture is a research ﬁeld concerned with obtaining large scale human motion data, such as head, torso, limbs, from image observation of human subjects. For the past decades, markerless human motion capture has been an active research ﬁeld motivated by various applications such as action recognition, surveillance, and man-machine interaction. Despite substantial advances in related aspects including tracking, pose estimation, and recognition, challenging problems still remain due to the high degrees of freedom coming from the dynamic range of poses during human activities, the diversity of visual appearance caused by clothing, visual ambiguities from self-occlusion of non-rigid 3D object, and background clutters. There have been many attempts at solving this problem using various modalities including a single image, a sequence of images, and multiple streams (using multiple cameras) of images. In this paper, we propose a method for upper-body pose estimation from a stream of depth images. The region occupied by a human subject is easier to capture in depth image and it usually contains a stronger cue to distinguish a human subject from other objects. By taking advantage of this characteristic, an optimization approach is presented to generate the most plausible pose from various cues provided by depth image analysis. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 408–418, 2007. c Springer-Verlag Berlin Heidelberg 2007

Constrained Optimization for Human Pose Estimation

409

More concretely, coarse body part labeling is cast as a linear programming problem, where various small segments of the human body are labeled with constraints arisen from body parts detection and tracking. Furthermore, 3D human pose is optimally estimated from dense correspondences. Our algorithm is structured to work for a stream of point clouds from depth sensors. It also conﬁgured to function when a color stream is provided in addition to a depth stream. Our implementation of the algorithm runs at 5∼9 frames per second in online mode on a 3.00GHz HP desktop. The rest of the paper is organized as follows. After a brief review of related work in Section 2, our algorithm is described in Section 3. Experimental results are presented in Section 4 and Section 5 concludes the paper.

2

Related Work

A large number of pose estimation methods have been proposed in the literature. A thorough discussion of various pose estimation approaches is beyond the scope of this paper and the reader is referred to a recent survey [1] for a comprehensive comparison between various approaches. Lately, there have been approaches at making use of depth sequences due to its advantages over a single color image. For example, depth measurement provides necessary information to resolve depth ambiguity which is an issue with approaches using a single color image [2]. Grest et al. [4] adapt an Iterative Closest Point (ICP) approach to the articulated human model, where pose parameters are updated using inverse kinematics based on dense correspondences between sampled depth observations and model vertices that are found based on nearest neighbor association. Knoop et al. [3] also use ICP to update pose parameters by incorporating multiple input data from diﬀerent sensors such as stereo depth, hands/face tracking from color camera, etc. Ziegler et al. [11] use an unscented kalman ﬁlter based on a set of correspondences between model vertices and observed stereo point cloud. ICP is often used as a method of choice when a 3D model is to be ﬁtted to 3-dimensional data. A common issue with ICP approaches for human pose tracking is that the model may drift away from the data or get stuck in local minima. An initial conﬁguration is critical for ICP to converge correctly. Our framework also uses the idea of closest point correspondence as a part of the solution, but it is less susceptible to the problem of local minima due to the coarse body part identiﬁcation. We also use a grid acceleration data structure so as to achieve pose estimation at a high frame rate without loss of accuracy.

3

Algorithm

Our algorithm takes a depth image sequence representing human motion and outputs pose vectors of the upper body. Depth data is usually obtained by using stereo cameras, structured light sensors, or time-of-ﬂight sensors. If other image modality, e.g., a color image sequence corresponding to the depth sequence, is

410

Y. Zhu and K. Fujimura

Fig. 1. Flow of the Algorithm. The left half of the ﬁgure illustrates Step 1, in which coarse body part labeling is determined. The right half illustrates the process of determining joint positions by ﬁtting.

also available, our framework allows such data to be integrated to strengthen the result. The algorithm consists of two major modules, namely, (i) coarse body part labeling and (ii) model ﬁtting (Fig. 1). In the ﬁrst module, the region within the given image corresponding to the human body is partitioned into small homogeneous segments. Such segments are formed such that within each segment, depth of each pixel is similar and areas of these segments are of a small and similar size. Each segment is then assigned a body part label (e.g., head, left arm) by using a label assignment framework. At this point, coarse body part identiﬁcation within each image is completed and passed to the second module. For the second module, a polygonal human upper body model attaching to the underlying kinematic skeleton structure is ﬁtted to the depth observation using ICP for each body part. 3.1

Body Constraints

A few body constraints are extracted from depth images. 1. Head and torso constraints: The head and torso are tracked by specialized modules. The head is tracked based on circle ﬁtting with predicted head contour points from depth, while the torso is tracked based on box ﬁtting, where a box with 5 degrees of freedom (x, y, height, width, and orientation) is positioned so as to minimize the number of background pixels within the box. 2. Depth constraint. For certain frames, an arm shape is clearly separable in depth when it is in front of the torso. These cues are further used by body part labeling as described in the next.

Constrained Optimization for Human Pose Estimation

3.2

411

Coarse Part Labeling

For the next step, we form an adjacent graph G where a node represents a small cluster of pixels within the human body (see Fig. 1, top left) and an edge represents adjacency relationship between two pixel clusters. A recursive subdivision strategy is taken to partition the image into pixel clusters based on homogeneity of depth values and spatial positions. Starting from the root node representing all pixels, the k-mean clustering (with k=2) is used to subdivide clusters further until each cluster has the property that it is suﬃciently small in size and that it has small depth variance. All the leaf nodes form segment set S to be labeled. Segments si (i = 1, 2, · · · , N ) are to be classiﬁed into major body parts p1 , p2 , · · · , pM . This is a labeling problem and formulated in the following optimization problem. A segment si in S is to be assigned a label by a function f . Each si has an estimate of its likelihood of having labeling f (si ). This comes from a heuristic in pose estimation. For example, if s is near the bottom of the image, its likelihood of being a part of the head is low. For this purpose, a non-negative cost function c(s, f (s)) is introduced to represent the likelihood. Further, we consider two neighboring segments si and sj to be related, in the sense that we would like si and sj to have the same label. Each edge e in graph G has a nonnegative weight we indicating the strength of the relation. Moreover, certain pairs of labels are more similar than others. Thus, we impose a distance d() on the label set. Larger distance values indicate less similarity. The total cost of a labeling f is given by: c(s, f (s)) + we d(f (si ), f (sj )) Q(f ) = s∈S

In our problem, the following table (Fig. 2) is to be completed, where binary variable Aij indicates if segment si belongs to body part pj . For c(i, j), the Euclidean distance from segment to the model body part (from the previous frame) is used. More speciﬁcally, the Euclidean distance from a segment si to a model body part pj is estimated as the sum of squared distances from a number of sampled pixels to their nearest model vertices in the model body part. table Aij segment:s1 segment:s2 ... s egment:sN

Head (p1 ) A11 A21

Torso LeftArm RightArm ... (p2 ) (p3 ) (p4 ) (pM ) A12 A13 A14 (A1M ) A22 A23 A24 (A2M )

AN1 AN2

AN3

AN4

(ANM )

Label Assignment Problem Aij = 1 ifsi belongs to pj A = 0 otherwise ijM j=1 Aij = 1 Nearby pixels have similar label

Fig. 2. Labeling problem

M Since each segment belongs to only one body part, j=1 Aij = 1 hold. In addition to this constraint, a number of related constraints are considered.

412

1. 2. 3. 4.

Y. Zhu and K. Fujimura

Neighboring segments should have a similar label. Head and torso constraint. Depth slicing constraint. Color constraint.

It turns out that this is an instance of the Uniform Labeling Problem, which can be expressed as the following integer programming by introducing auxiliary variables Ze for an edge e to express the distance between the labels and we use Zej to express the absolute value |Apj − Aqj |. Following Kleinberg and Tardos [6], we can rewrite our optimization problem as follows: min{

N M

c(i, j)Aij +

i=1 j=1

we Ze }

(1)

eE

subject to M

Aij = 1, i = 1, 2, 3, · · · , N

(2)

j=1

1 Zej , e ∈ E 2 j=1 M

Ze =

(3)

Zej ≥ Apj − Aqj , e = (p, q); j = 1, 2, · · · , M

(4)

Zej ≥ Aqj − Apj , e = (p, q); j = 1, 2, · · · , M

(5)

Aij ∈ {0, 1}, i = 1, 2, · · · , N ; j = 1, 2, · · · , M

(6)

Here, terms involving Ze and Zej come from constraint (1). The weight we is given by we = e−αde , where de is depth diﬀerence between two adjacent segments and α is selected based on experiments. In our study, we let M = 4 (head, torso, left arm, right arm). For constraint (2), if the segment si is outside the tracked head circle, an additional constraint Ai,1 = 0 is added. If segment si is outside of the tracked torso box, constraint Ai,2 = 0 is added. To apply constraint (3) for the detected arm segments, constraint:Ai,3 + Ai,4 = 1 is added (as it is not clear whether it is the right or left arm). Finally, if there are tracked hand positions based on skin color information, we can add constraint: Ai,3 + Ai,4 = 1. In general, solving an integer program optimally is NP-hard. However, we can relax the above problem to linear programming with Aij ≥ 0, and this can be solved eﬃciently by using a publicly available library, e.g., [7]. Kleinberg and Tardos [6] describe a method for rounding fractional solutions so that the expected objective function Q(f ) is within a factor of 2 from the optimal solution. In our experiments we ﬁnd that this relaxed linear programming always returns an integer solution. (See an observation due to Anguelov et al. [10].) Figure 1 shows one example of this body part labeling result.

Constrained Optimization for Human Pose Estimation

3.3

413

Model Fitting

The human body model is represented as a hierarchy of joint link models with a skin mesh attached to it as in Lewis et al. [8]. For the upper body model used in this paper, a skin mesh and hierarchical skeleton structure are illustrated as in Fig. 1. Given a set of 3D data point P = {p1 , p2 , · · · , pm } as targets and their corresponding model vertices V = {v1 , v2 , · · · , vm }, the model pose vector q is estimated as qˆ = argminq ||P − V (q)||2

(7)

where q = (θ0 , · · · , θn )T is the pose parameter vector and vi ’s are visible vertices of the polygonal model. To solve this minimization problem eﬃciently and robustly, we use a variant of inverse kinematics (known as damped least square), which is inspired by the well-known ICP algorithm [9]. The formulation (see Fig. 7) minimizes ||JΔq − ΔE||2 + λ||Δq||2 where J is the Jacobian. The inverse kinematics with damped least square [5] has the beneﬁt of avoiding singularities, thus making the process numerical stable. We use λ=0.1 based on our experiments. For articulated body pose estimation, the algorithm depends on the accuracy of ﬁnding correspondence pairs of data point and model vertices. Most recent works apply a nearest neighbor search between two point clouds: one contains all the observed 3D points from depth or other sensor; the other contains all the model vertices. Since iteration may be attracted to local minima as the algorithm is inspired by the well-known ICP algorithm [9], we apply the aforementioned body part labeling to limit the nearest neighbor search between a subset of observed 3D points and a subset of visible model vertices for each body part. Thus, this not only speeds up the nearest neighbor search, but more importantly, it achieves robust pose estimation even for long sequences containing large motions between two consecutive frames. In our implementation, the OpenGL depth buﬀer is utilized to decide model vertex visibility. For faster computation, we use a grid based spatial index data structure to speed up nearest neighbor search between point clouds. We partition the working volume into a set of 3D grid. Because only scene proﬁles are used, we partition the xy plane of the working volume. Depth points and visible model vertices are indexed into corresponding grids. To perform nearest neighbor search for a model vertex, we ﬁrst ﬁnd the grid it is located. The nearest neighbor depth point in this grid is found afterward. Then, we recursively propagate the nearest neighbor depth point search to the neighboring grids until the minimal distance from grid corners to the model vertex is greater than the current minimal distance from the model vertex to the depth points. As illustrated in Fig. 8, the method contributes speed-up by a factor of 6. When capturing a pose sequence, the subject is initially requested to take an open-arm posture (so-called “T-pose”), in which his arms and torso do not

414

Y. Zhu and K. Fujimura

overlap. At this initialization stage, body dimensions are measured and further used to scale the kinematic skeleton and polygonal human body model.

4

Experimental Results

The proposed pose estimation algorithm has been tested on many sequences collected from a few human subjects. Depth sequences and color sequences have been obtained using a calibrated hybrid pair of stereo (Swiss ranger SR3000 for depth and Sony DFWV500 for color) in a synchronous fashion. Furthermore, a motion capture system by PhaseSpace Inc. with 8 camera units has also been run synchronously with the hybrid camera system, taking coordinates for eight major joints of the subject for ground-truth reference. The subject wears markers for a motion capture purpose only and these markers are not used for the main algorithm. Test motion sequences include a complete semaphore ﬂag signaling motion (A to Z), simple exercise movements, and TaiChi movements. The total number of frames collected is 4800 and each test sequence is about 400 frames long. All sequences have been tracked successfully at the frame rate of 5∼9Hz on a 3.00GHz HP desktop. Figure 3 contains tracked frames taken from full-length sequences, where the subject performs (a) a TaiChi motion and (b) an exercise motion, respectively. Pose estimation precision has also been compared against joint position data captured by a marker-based motion capture system. Figure 4 contains error in various joint positions for the TaiChi motion sequence. As seen here, the overall tracking error is approximately 5cm (where the subject stands 1.5m to 2m from the camera). Similar tracking results have been obtained for the other sequences, such as out-of-plane-rotation (Fig. 6) which is usually diﬃcult to capture with single-color-camera-based pose estimation methods. To compare tracking stability, we tested an ICP-based approach using exactly the same sequences. The ICP and our method have similar performance (in terms of error amount from the ground-truth), when tracking runs successfully. However, a signiﬁcant diﬀerence is that for some frames (where our method processes successfully), the ICP based tracking fails and never recovers (as shown in Fig. 5). Our method functions for all cases where the ICP method works. The ICP-based method is slower, because it has to do more iteration for convergence. This illustrates the advantage of the part labeling step in our framework. At this point, let us contrast our approach with a few other approaches using depth sequences. Ziegler et al. [11] use depth sequences obtained by four stereo cameras. Point correspondences are based on spatial proximity which may result in wrong correspondences when body parts are close to each other. Our method is less susceptible to such a problem due to the use of inverse kinematics. Demirdjian et al. [13] use eﬃcient example-based matching to improve the tracking of a set of likelihood modes. Large errors can still occur when the test example is not close to training examples. Our coarse labeling step has a similar function and ﬁnds the likelihood mode with constraints from bottom-up observations. Grest et al. [4] introduce an ICP method for articulated body pose estimation,

Constrained Optimization for Human Pose Estimation

415

(a)

(b)

Fig. 3. Snapshots of the algorithm output (a) TaiChi sequence, (b) Simple exercise sequence

Model Joints

error (in millimeter) ΔX(μ, σ) ΔY (μ, σ) ΔZ(μ, σ) Right Hand (-15, 49) (-39, 58) (23,44) Right Elbow (-23, 34) (-70, 42) (-48,59) Right Shoulder (21, 57) (-43,19) (1,25) Waist (-24, 26) (-12, 15) (-19,14) Left Hand (16, 61) (-6, 86) (44,45) Left Elbow (30, 35) (-74, 39) (71,66) Left Shoulder (-23, 53) (-36, 30) (27,30) Head (-15, 26) (-18, 15) (-22,15) Overall (-4, 49) (-37, 50) (22,52) Fig. 4. Comparison table

416

Y. Zhu and K. Fujimura TaiChi

Color Image Our Method

ICP only

Frame:178 Fig. 5. Stability comparison between our method and standard ICP

Fig. 6. Examples of out-of-plane rotation up to 50 degree

while they do not address the robustness of ICP-based pose estimation. Knoop et al. [3] utilize skin color segmentation (hence face and hand feature trackers) to improve the ICP-based pose estimation. It is not clear how to handle temporarily invisible face or hands. Some other approaches use multiple sensors to obtain more surface data. Cheung et al. [12] use visual hull while Anguelov et al. [10] use 3D range scan data to reconstruct human skeletal structures. Accurate pose estimation might be obtained for these methods since body parts are visible in multiple views.

5

Concluding Remarks

We have presented a method for estimating human pose from depth sequences. Our method consists of two major components that cooperate to estimate and track human motion. The ﬁrst module is body component identiﬁcation which has been solved by reducing it to linear programming. If a certain application requires only rough body labeling in the image, the ﬁrst module provides such a solution. If more accurate positions for major joints are required, then the second module is to be used. The second module is model ﬁtting by using inverse kinematics based on dense correspondences between the image data and the human kinematic model. The result of the second component, in turn, is used to initiate the ﬁrst component for the next frame. The algorithm tracks human upper-body movements over several minutes of pose sequences at a speed of a few Hz using a laptop PC (up to 10Hz when a desktop with 3GHz is used).

Constrained Optimization for Human Pose Estimation

417

model ﬁtting(P :3D point,V :model vertex) 1. Form ⎤ ⎡ ⎤ Δe1 pxi − vix ⎢ Δe2 ⎥ ⎥ Δei = ⎣ pyi − viy ⎦ , ΔE = ⎢ ⎣ ... ⎦ z z pi − vi Δem ⎡

2. Solve JΔq = ΔE by damped least square where J is Jacobian of model vertices Δq = (J T J + λI)−1 J T ΔE q = q + Δq 3. Repeat until Δq is suﬃciently small Fig. 7. Model ﬁtting procedure

Fig. 8. Performance comparison with and without grid acceleration (on a 2.13GHz IBM Laptop)

We have also made a comparative study of marker-less pose tracking based on a commercial marker-based tracking system and shown that our joint positions are accurate to several cm. The algorithm has a number of extension possibilities such as severe occlusions coming from environmental objects. We leave these for our future work.

References 1. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2,3), 90–126 (2005)

418

Y. Zhu and K. Fujimura

2. Sminchisescu, C., Triggs, B.: Kinematic jump processes for monocular 3D human tracking. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 18–20 (2003) 3. Knoop, S., Vacek, S., Rillman, R.: Sensor fusion for 3D human body tracking with an articulated 3D body model. In: Int. Conf. on Robotics and Automation, pp. 1686–1691 (2006) 4. Grest, D., et al.: Nonlinear body pose estimation from depth images. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, Springer, Heidelberg (2005) 5. Buss, S., Kim, J.: Selectively damped least squares for inverse kinematics. Journal of Graphics Tools 10(3), 37–49 (2005) 6. Kleinberg, J., Tardos, E.: Approximation algorithms for classﬁcation problems with pairwise relationships: Metric partitioning and Markov random ﬁelds. Journal of the ACM 49(5), 616–639 (2002) 7. LP solve reference guide, http://lpsolve.sourceforge.net/5/5 8. Lewis, J.P., Cordner, M., Fong, N.: Pose space deformations: A uniﬁed approach to shape interpolation and skeleton-driven deformation, Siggraph, pp. 165–172 (2000) 9. Besl, P., McKay, N.: A method for registration of 3-d shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 10. Anguelov, D., Koller, D., Pang, H., Srinivasan, P., Thrun, S.: Recovering articulated object models from 3D range data. In: Proc. of Uncertainty in Artiﬁcial Intelligence Conference, pp. 18–26 (2004) 11. Ziegler, J., Nickel, K., Stiefelhagen, R.: Tracking of the articulated upper body on multi-view stereo image sequences. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 774–781 (2006) 12. Cheung, K.M., Baker, S., Kanade, T.: Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 77–84 (2003) 13. Demirdjian, D., Taycher, L., Shakhnarovich, G., Grauman, K., Darrell, T.: Avoiding the Streetlight Eﬀect: tracking by exploring likelihood modes. In: Int. Conf. on Computer Vision, pp. 357–364 (2005)

Generative Estimation of 3D Human Pose Using Shape Contexts Matching Xu Zhao and Yuncai Liu Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University 200240, Shanghai, China

Abstract. We present a method for 3D pose estimation of human motion in generative framework. For the generalization of application scenario, the observation information we utilized comes from monocular silhouettes. We distill prior information of human motion by performing conventional PCA on single motion capture data sequence. In doing so, the aims for both reducing dimensionality and extracting the prior knowledge of human motion are achieved simultaneously. We adopt the shape contexts descriptor to construct the matching function, by which the validity and the robustness of the matching between image features and synthesized model features can be ensured. To explore the solution space eÆciently, we design the Annealed Genetic Algorithm (AGA) and Hierarchical Annealed Genetic Algorithm (HAGA) that searches the optimal solutions eectively by utilizing the characteristics of state space. Results of pose estimation on dierent motion sequences demonstrate that the novel generative method can achieves viewpoint invariant 3D pose estimation.

1 Introduction Capturing 3D human motion from visual cues has received increasing attention in recent years, due to the drive from a wide spectrum of potential applications such as behavior understanding, content-based image retrieval and visual surveillance. Although having been attacked by many researchers, this challenging problem is still long standing because of the diÆculties conduced mainly by complicated nature of 3D human motion and incomplete information of 2D images for 3D human motion analysis. In the context of graphical models, the state-of-art approaches of 3D human motion estimation can be classified as generative and discriminative [1]. Generative methods [2,3,4,5,6,7] follow the bottom-up Bayes’ rule and model the state posterior density using observation likelihood or cost function. Given an image observation and prior state distribution, the posterior likelihood is usually evaluated using Bayes’ rule. This approach has a sound framework of probabilistic support and can achieve significant success for recovering complex unknown motions by utilizing well-defined state constrains. However, it is generally computationally expensive because one had to perform complex search over the state space in order to locate the peaks of the observation likelihood. Moreover, prediction model and initialization are also the bottlenecks of the approach especially in tracking situation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 419–429, 2007. c Springer-Verlag Berlin Heidelberg 2007

420

X. Zhao and Y. Liu

In this paper, we propose a novel generative approach in the framework of evolutionary computation, by which we try to widen the bottlenecks mentioned above with eective search strategy embedded in the extracted state subspace. Considering the generalization of application scenario, the observation information we utilized comes from an uncalibrated monocular camera. This makes the state estimation get into severe illconditioned problem. And, we have to confront the curse of dimensionality because there are more than forty degrees of freedom (DOFs) of full body joints in our 3D human model. Therefore, the process for searching optimal solutions should be performed in some compact state space by the search algorithms which suit for the characteristics of this space. In doing so, infeasible solutions, namely, the absurd poses can be avoided naturally. To this end, we consider to reduce the dimensionality of state space by principal component analysis (PCA) of motion capture data. Actually, the motion capture data embody the prior knowledge about human motion. By PCA, the aims at both reducing dimensionality and extracting the prior knowledge of human motion are achieved simultaneously. From the theoretical view, PCA is optimal in the sense of reconstruction because it allows the minimal information loss in the course of state transformation from the subspace to original state space. Dierent from the previous works [8,9], we perform the lengthways PCA, by which the subspace can be extracted from only single sequence of motion capture data. To explore the solution space eÆciently, we design the Annealed Genetic Algorithm (AGA) combining the ideas of simulated annealing and genetic algorithm [10]. As the promoted version of AGA, Hierarchical Annealed Genetic Algorithm (HAGA) searches the optimal solutions more eectively than AGA by utilizing the characteristics of state space. According to the theory of PCA, in our problem, the first principle component captures the most important part of human motion and the rest of principle components capture the detailed parts of this motion. In monocular uncalibrated camera situation, the fitness function (observation likelihood function) is very sensitive to the change of global motion. The HAGA performs hierarchical search automatically in the extracted state subspace by localizing priorly the state variables such as the global motions and the coordinate of the first principle component which dominate the topology of state space. We adopt the shape contexts descriptor [11] to construct the fitness function, by which the validity and the robust matching between image features and synthesized model features can be achieved. 1.1 Related Work There has been considerable previous work on capturing human motion from image information. The earlier work on this research topic had been reviewed comprehensively by the survey papers [12,13,14]. Generally speaking, to recover 3D human pose configuration, more information are required than image can provide especially in the monocular situation. Therefore, much work focus on using prior knowledge and experiential data in order to alleviate the ill-condition of this problem. Explicit body model embodies the most important prior knowledge about pose configuration and thus be widely used in human motion analysis. Another class of important prior knowledge comes from the experiential data such as motion capture data acquired by commercial

Generative Estimation of 3D Human Pose

421

motion capture system and other hand-labeled data. The combination of the both prior information can produces favorable techniques for solving this problem. Agarwal et al. [5] distill prior information (the motion model) of human motion from hand-labeled training sequences using PCA and clustering on the base of a simple 2D human body model. This method presents a good autoregressive based tracking scheme but has no description about pose initialization. In the framework of generative approach, the prior information is usually employed to constrain or reduce the search space. Urtasun et al. [15,9] construct a dierentiable objective function based on the PCA of motion capture data and then find the poses of all frames simultaneous by optimizing a function in low-dim space. Sidenbladh et al. [3,8] present similar methods in the framework of stochastic optimization. For a specific activity, such methods need many example sequences of motion capture to perform PCA and all of these sequences must keep same length and same phase by interpolating and aligning. Ning et al. [6] learn a motion model from semi-automatically acquired training examples which are aligned with correlation function, and then, some motion constrains are introduced to cut the search space. Unlike these methods, we extract the state subspace from only one example sequence of a specific activity using the lengthways PCA and thus have no use for interpolating or aligning. In addition, useful motion constraints are included naturally in the low-dim subspace. In recent years, particle filter [16] (also known as condensation algorithm) based optimization methods are used widely for recovering human pose in generative framework [2,3,4,5,6,7]. However, as a stochastic search algorithm, we think that particle filter is essentially similar with evolutionary algorithm (EA) if having no explicit temporal dynamic model. The EA can provide more flexible evolutionary mechanism such as crossover operator. This is the important motivation for us to solve this problem in the framework of EA. A noticeable example showing the relationship between particle filter and EA is the work of Deutscher et al. [17]. By introducing the crossover operator, the annealed particle filter proposed in theirs earlier work [2] get remarkable improvement. Comparing with previous generative methods, extracting the common characteristic of a special types of motion from prior information and represent them with some compact forms are of particular interests to us. At the same time, we ensure the motion individuality of the input sequences with eective evolutionary search strategy suiting for the characteristic of state subspace.

2 State Space Analysis The potential special interests motivate us to analyze the characteristics and structure of the state space. Such interests involve mainly modeling the human activities eectively in the extracted state subspace and eliminating the curse of dimension. 2.1 Pose Representation We use a explicit model that represent the articulated structure of the human body. Our fundamental 3D skeleton model (see Figure 1.a) is composed of 34 articulated rigid sticks. The pose is described by a 44 dimensional vector x xg x j , where 3D vector

422

X. Zhao and Y. Liu

(a )

(b )

(c )

Fig. 1. (a) The 3D human skeleton model. (b) The 3D human convolution surface model. (c) The 2D convolution curves.

xg represents the global rotations of human motion and 41D vector x j represents the joint angles. Figure 1.b shows the 3D convolution surface [18] human model which actually is an isosurface in a scalar field defined by convolving the 3D body skeleton with a kernel function. Similarly, the 2D convolution curves of human body as shown in Figure 1.c are the isocurves generated by convolving the 2D projection skeleton. As the synthetical model features, the curves are used to match with the edges of image silhouettes for constructing the likelihood function. 2.2 Subspace Extraction All of the 3D poses distribute in the state space X. The pose set which belongs to a special activity, such as walking, running, handshaking, etc., generally crowd in a subspace of X. We extract the subspace X s from motion capture data obtained from the CMU database (http:mocap.cs.cmu.edu). Assuming xt xt X is a given data sequence of motion capture corresponding to one motion type, where t is the time tag, the subspace X s is extracted by PCA as follows: 1. Centering the state vectors and assembling them into a matrix (by rows): X [(x1 c); (x2 c); ; (xT c)], where c is the mean vector. 2. Performing a singular value decomposition of the matrix to project out the dominant directions: X U D VT . 3. Projecting the state vectors into the dominant subspace: each state vector is represented as a reduced vector x s (x c) Um , where Um is the matrix consisting of first m columns of U, by which the m-D subspace X s is spanned. Therefore, the original state vector x can be reconstructed by: x c x s UTm

(1)

The dimensionality m of subspace X s is determined according to the cumulative sum of principal component variance percentage. With our experiences, the value of is set to be not smaller than 095; accordingly, the value of m is not greater than 6 generally.

Generative Estimation of 3D Human Pose

423

3 Fitness Function In generative framework, pose capturing can be formulated as Bayesian posterior distribution inference: (2) p(x s y) p(x s )p(y x s) The function p(y x s) represents the likelihood observing in image y, conditioned on a pose candidate x s . It is used to evaluate every pose candidate generated from p(x s ). In the context of evolutionary algorithm, the likelihood function is just the fitness function. We propose a fitness function on the basis of shape contexts matching [11]. We choose the image silhouette of subject as the observed image feature, which is extracted using statistical background subtraction. The shape context descriptor is used to describe the shape of image silhouette and convolution curves generated by the pose candidate (see Figure 1). Figure 2 illustrate the shape contexts [11] (histograms of local edge pixels into log-polar bins) of human shape. Our shape contexts contain 12 angular five radial bins, giving rise to 60-dimensional histograms as shown in Figure 2.b. In the matching process, the regularly spaced points on the edge of the silhouette are sampled as the query shape. The point set sampled from the convolution curves is viewed as the candidate shape. Before matching, the image shape and the candidate shape are normalized to same scale. We represent the query shape and the candidate shape as S query (y ) and S m (x s) respectively. To this end, the matching cost function is formulated as: F(S query (y) S m (x s ))

r

2 (S Cquery (y) S Cm (x s ) ) j

(3)

j 1

where S C is the shape context, r is the number of sample point on the edge of image j silhouette, and S Cm (x s ) argminu2 (S Cquery (y) S Cmu (x s)). Here, we use the 2 distance as the similarity measurement. In AGA, the optimization mechanism are designed for searching the maximal value of object function. Therefore, according to Eq. (3), the fitness function can be reformulated as: (S query (y)

S m (x s )) C exp(F(S query (y) S m (x s)))

(4)

where C is a constant for adjusting the value range of fitness function.

4 Pose Estimation Using HAGA In this section, we describe the key algorithms of the generative framework, namely, the AGA and HAGA, and theirs adaption for pose estimation from monocular silhouettes. 4.1 Hierarchical Annealed Genetic Algorithm Combining simulated annealing (SA) and genetic algorithm (GA), we design the annealed genetic algorithm, which actually is a hybrid (1 1) evolutionary strategy. In our algorithm, the local optimal solutions are avoided by introducing several genetic evolutionary principles. We represent chromosome by state vector as x [x1 x2 xn ],

424

X. Zhao and Y. Liu

(a)

(b )

Fig. 2. (a) The shape contexts computed on edge points of image silhouette (right) and sampled points of convolution curves (left). (b) The example shape contexts for reference samples showed in (a) of image silhouette (bottom) and convolution curves (top).

where the genes xi i 1 2 n are random numbers uniformly distributed in the interval (0 1). We use real encodings. The algorithm searching for optimal solutions with the AGA is described as follows: Parameter initialization set values for evolution control parameters: S t – stop criteria; Nt – termination condition; Et – times for searching a equation state; for st 1 to S t do: NonImproveNum 0; Generate the genes of x uniformly at random in the interval (0 1); Evaluate the fitness function (x) by mapping x onto the problem domain; while (NonImproveNum Nt ) do for et 1 to Et do: Evolution of x driven by the genetic operators; (see Table 1.) Evaluate (x); end for If the value of fitness function is improved, NonImproveNum 0, else NonImproveNum NonImproveNum 1; end while Record the optimal x; end for We design five genetic operators, which are executed orderly in AGA. The operators are introduced by evolving a example chromosome x [x1 x2 x3 x4 x5 x6 ]. The new chromosome generated by the operators is denoted as x . Assuming the positions generated randomly are number 2 and number 6 or number 3 ( for point mutation operator), for example, the five operators are illustrated in Table 1. (The new genes are represented as x ). On the basis of AGA, we develop a HAGA by utilizing the characteristics of state space X. In HAGA, the state space is decomposed automatically by computing the variances of state components which are generated in each annealing run. According to the variances of state components, the state space is partitioned by localizing down the important components to a small area of theirs range. It is explainable in theory because ¼

¼

Generative Estimation of 3D Human Pose

425

Table 1. The genetic operators in AGA Operators

Example

Exchange

x [x1 x2 x3 x4 x5 x6 ] x

¼

Segment reversion x [x1 x2 x3 x4 x5 x6 ] x

¼

Segment shift Point mutation

x [x1 x2 x3 x4 x5 x6 ] x

¼

x [x1 x2 x3 x4 x5 x6 ] x

¼

Segment mutation x [x1 x2 x3 x4 x5 x6 ] x

¼

[x1 [x1 [x1 [x1 [x1

x6 x3 x4 x5 x2 ] x6 x5 x4 x3 x2 ] x6 x2 x3 x4 x5 ] ¼

x2 x3 x4 x5 x6 ] ¼

x2 x3 ¼ x4 ¼ x5 ¼ x6 ¼ ]

the important state components dominate the topology of the state space and the little changes of theirs value can produce great eect whereas the values of other state components had little influence on whether they were selected or not. This theory is illustrated in Figure 3. Focusing only on one annealing run of sate evolution (st st 1), we describe the detailed HAGA as follows. 1. Generate initial chromosome x [x1 x2 xn ] at random, where xi i 1 2 n are random numbers uniformly distributed in the interval (0 1). Mapping it linearly into the variance domain: x xt

(min xt max xt )

(5)

In the first round of state evolution, (min x1 max x1 ) (0 1). Evaluating the fitness function (x). 2. Evolve the chromosome according to the state evolutionary mechanism of AGA. Before evaluating the fitness function, every new chromosome needs to be mapped onto the variance domain as formulated in Eq.5. 3. Store N best states (chromosomes) and computing the covariance matrix: Vt1

1 N

N

(xit1 xct1 )T (xit1 xct1 )

(6)

i 1

where xct1 is the mean vector, and the covariance matrix Vt1 is a diagonal matrix on the assumption that the state components are independent each other. To this end, the variance domain can be formulated as:

min xt1 max xt1

xct1 ct1 Vt1 xct1 ct1 Vt1

(7)

where ct1 [ct1 ct1 ct1 ] is used to adjust the variance domain and ct1 is a positive constant. 4. The variance domain (min xt1 max xt1 ) is used to cut down the state space in next state evolution. 4.2 Experiments We demonstrate our method by extracting subspaces for dierent classes of human motion and using them to estimate 3D body pose in unseen video sequences.

426

X. Zhao and Y. Liu

(a)

(b )

(c)

Fig. 3. Variance reduction contrast between principal state components and other state components. Graph (a) shows the variances of state set in which the chromosomes have not been evolved, displaying almost equal variances for each components. Graph (b) shows the variances of state set which have come through one round of state evolution, noticing that the variances of first four principal state components have been greatly reduced whereas the variances of other components have been reduced with a slighter extent. In graph (c), the variances of the principal components have been reduced to very small scopes indicating advanced localization after coming through two rounds of state evolution.

Walking motion: straight walk and turning walk. To extract the motion subspace of walking, a data set consisting of motion capture data of a single subject was used. The total frame number is 316. It was found that the dierent subject and dierent frame numbers can produce generally identical subspace. To keep the ratio of information loss lower than 0.05, the dimensionality of the subspace was choose to be 5. For the sequence of one subject walking in a straight line, the parameters of HAGA are set as S t 2 Nt 2 Et 5. The results are showed in Figure.4. It can be seen that the estimator is successful in determining the correct global motion as well as the 3D pose of the subject. The occlusion problem are tackled by searching the optimal pose in the extracted subspace because the prior knowledge about walking motion is contained in

Fig. 4. Results of recovering the poses of a subject walking straight. (the images are part of a sequence from www.nada.kth.se hedvigdata.html). The second pose demonstrated the left-right confusion in the silhouette.

Generative Estimation of 3D Human Pose

427

Fig. 5. Results of recovering the poses of a subject performing a turning walking motion

this space. The left-right confusion is mostly disambiguated, however, in few frames, the left-right confusion conduced by silhouette ambiguity still exist. This can be seen from Figure. 4. We test the generalization capability of our method in a turning walk sequence. In this sequence [19], a subject is performing continuing turning walking motion around a circle therefore the global motion is changed in a wide range. The parameters of HAGA are set as S t 2 Nt 2 Et 5. The results can be seen in Figure.5. Running motion. The subspace of running motion is extracted from motion capture data that consisted of 130 frames. This subspace is more compact than that of walking motion. Figure.6 shows the estimation results of 3D poses.

Fig. 6. Results of recovering the poses of a subject performing a running motion. The images are extracted from the video taken from the web site http:mocap.cs.cmu.edu.

428

X. Zhao and Y. Liu

5 Conclusion We have discussed a novel generative approach to estimating 3D human pose from a single camera. Our approach is a step towards describing motion characteristic of highdimensional data spaces by extracting its subspace. From motion capture data, we not only distilled the prior knowledge about human motion, but also reduced the dimensionality of problem. In the compact subspace, we perform eective search for finding the optimal poses. To explore the solution space eÆciently, we designed the AGA and HAGA, by which the optimal solutions can be searched eectively by utilizing the characteristics of state subspace. The robust shape contexts descriptor allows us using the silhouettes as image features. The approach was tested on dierent human motion sequences with good results, and allows the estimation of complex unseen motions in the presence of image ambiguities. In terms of future work, the more interior edge information need be added to disambiguate some challenging sequences. Including a wider range of motion capture data would allow the estimator to cover more types of human motions.

Acknowledgements This research is supported by the National Basic Research Program (973 Program) of China (No. 2006CB303103) and the National Natural Science Foundation of China (No. 60675017).

References 1. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: Proc. Conf. Computer Vision and Pattern Recognition, pp. 217–323 (2005) 2. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: Proceedings of the 2000 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 126–133 (2000) 3. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3D human figures using 2D image motion. In: European Conference on Computer Vision, vol. 2, pp. 702–718 (2000) 4. Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 447–454 (2001) 5. Agarwal, A., Triggs, B.: Tracking articulated motion using a mixture of autoregressive models. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 54–65. Springer, Heidelberg (2004) 6. Ning, H., Tan, T., Wang, L., Hu, W.: People tracking based on motion model and motion constraints withautomatic initialization. Pattern Recognition 37(7), 1423–1440 (2004) 7. Mori, G., Malik, J.: Recovering 3 D Human Body Configurations Using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1052–1062 (2006) 8. Sidenbladh, H., Black, M., Sigal, L.: Implicit Probabilistic Models of Human Motion for Synthesis and Tracking. In: European Conference on Computer Vision, vol. 1, pp. 784–800 (2002)

Generative Estimation of 3D Human Pose

429

9. Urtasun, R., Fleet, D., Fua, P.: Monocular 3-D Tracking of the Golf Swing. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE Computer Society Press, Los Alamitos (2005) 10. Michalewicz, Z.: Genetic algorithms data structures evolution programs. Springer, Heidelberg (1996) 11. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 12. Aggarwal, J., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding 73(3), 428–440 (1999) 13. Gavrila, D.: Visual analysis of human movement: A survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 14. Moeslund, T., Granum, E.: A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81(3), 231–268 (2001) 15. Urtasun, R., Fua, P.: 3D Human Body Tracking using Deterministic Temporal Motion Models. In: European Conference on Computer Vision, vol. 3, pp. 92–106 (2004) 16. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T., Sci, D., Organ, T., Adelaide, S.: A tutorial on particle filters for online nonlinearnon-GaussianBayesian tracking. IEEE Transactions on Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing 50(2), 174–188 (2002) 17. Deutscher, J., Davison, A., Reid, I.: Automatic partitioning of high dimensional search spaces associated with articulated body motion capture. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2001) 18. Jin, X., Tai, C.: Convolution surfaces for arcs and quadratic curves with a varying kernel. The Visual Computer 18(8), 530–546 (2002) 19. Sigal, L., Black, M.J.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University (2006)

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body Eng Hui Loke and Masanobu Yamamoto Graduate School of Science & Technology, and Department of Information Engineering, Niigata University, Ikarashi 2-nocho 8050, Nishi-ku, Niigata-city 950-2181, Japan [email protected]

Abstract. This paper explores a novel endeavor of deploying only four active-tracking cameras and fundamental vision-based technologies for 3D motion capture of a full human body ﬁgure, which includes facial expression, motion of ﬁngers of both hands and a whole body. The proposed methods suggest alternatives to extract motion parameters of the mentioned body parts from four single-view image sequences. The proposed ellipsoidal model- and ﬂow-based facial expression motion capture solution tackles both 3D head pose and non-rigid facial motion eﬀectively and we observe that a set of 22 self-deﬁned feature points suﬃce the expression representation. The body ﬁgure and ﬁngers motion capture is solved with a combination of articulated model and ﬂow-based methods.

1

Introduction

The human-based character animation technology has been growing in an incredible speed. Indirectly, this induces the growth in and demands towards motion capture. An abundance of work has been proposed in suggesting better and robust solutions for human ﬁgure [8], ﬁngers or facial [10] motion capture. Despite these eﬀorts, never has anyone in vision history proposed a solution on full human motion processing, that includes the motion of the body ﬁgure, facial expression and ﬁngers simultaneously. In this paper, we propose a novel motion capture idea to implement an imagebased full human motion capture system where the simultaneous motion data estimated with our framework can easily be reconstructed in realizing the 3D character animation. This paper presents the idea: by simply adopting four autotracking function installed cameras and employing the knowledge in computer vision techniques, the 3D motion of a human ﬁgure, ﬁngers and facial expression can be estimated from single-view image sequences recorded against a cluttered background. This ﬁrst attempt of using only four cameras in image acquisition of the motion of the whole body eliminates the costing and performance area issues. Besides, in one system, we tackle the facial features recognition problem, 3D rigid head and non-rigid expression motion tracking based on the mixture vision techniques while treating human ﬁgure and ﬁngers as articulated models problems for the 3D motion estimation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 430–441, 2007. c Springer-Verlag Berlin Heidelberg 2007

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

431

Taking account of the fact that the characteristics of the facial expression and the other body parts are dissimilar, we treat them diﬀerently. To be precise, the facial expression is caused by muscles’ motion and thus, is changeable and induces the variant outlook and size of the features, that it is posed as a non-rigid-motion problem [1]. Meanwhile, body ﬁgure and ﬁngers motion are caused by the bones and joints with respective limited D.O.F. [9]. Therefore, their size is invariant and appropriate to be represented as articulated model with rigid-motion. In brief, our framework consists of three main modules, image acquisition (section 2), vision-based motion estimation (sections 3 and 4), full model reconstruction and animation (sections 5 and 6).

2

Image Acquisition

Most of the image acquisition in the past works is done using VGA camera, with which it is either dealing solely with the movement of one small body part or tackling with the whole human ﬁgure motion without involving smaller body parts. However in our context, a VGA image, sized 640x480 pixels, has its limitation on the resolution and aﬀects the image analysis quality on the smaller body parts such as face and ﬁngers. To solve this, a high-vision camera with its high resolution of 1920x1080 pixels suits our requirement. Despite its high resolution, one non-compressive highvision camera with a full set of image capture devices can easily reach a cost that enfeebles its feasibility in most cases. We adopt an active multi-camera system consisting of four VGA color cameras (SONY, EVI-D30) which are connected parallel to the computer, where each automatically tracks the four diﬀerent body parts: the face, the whole body, and the left and right hands. These cameras are able to auto-pan, tilt and zoom in tracking the movement of the target by color- and brightness-matching the current view with the user predeﬁned area [11]. Once the starter is intrigued, the computer with multi-image capture board (Micro Vision, MV-34) starts fetching image data from these four cameras in parallel. This avoids the starting time delay that is common in a manual starting set-up and achieves a real deﬁnition of simultaneous motion tracking. The sequences of images are captured at the frame rate 30 fps, sized 320x240 pixels and in a limited designated range. Each camera is targeted on tracking the movement of diﬀerent body parts, where the resulted images are shown in Figure 1. The hand and face images keep enough high resolution to capture the targets in motion while each camera can keep track of the target within the ﬁeld of view.

3

Facial Expression Solutions

This section introduces a mixture of rigid and non-rigid solution framework, with which both the 3D head pose and facial features motion can be solved concurrently. We deﬁne a set of 22 feature points to represent the facial features as shown in Figure 2: one per nostril, three per eyebrow, four per eye and six

432

E.H. Loke and M. Yamamoto

Fig. 1. The 4 image sequences recorded using the active multi-camera system

Fig. 2. Facial feature points location. The numbering is the detection sequence.

for mouth. These feature points are assumed adhering to the surface of the ellipsoidal head model yet independently tracked on the image. The initial facial pose is required to be at its front view with no occlusion onto the facial features allowed. The following sub-section details the facial features localization, followed by the facial expression motion capture adopted for both rigid and non-rigid motion recovery. 3.1

Skin Color Region Detection and Facial Region Recognition

Chromatic color space is the traditional approach for skin color space analysis since it carries such characteristic where its two useful color components are in fact the result of the color elements with intensity normalization. However, this paper considers also the HSV space of the skin color based on two facts: (i) the evaluation done in [12] proves that a Gaussian mixture is more accurate than a single Gaussian model, and (ii) HSV color space performs the same clustering characteristic as chromatic’s with a broader range. A Gaussian model can approximate these clustering characteristics. Applying this Gaussian model onto the chromatic and HSV color space of the initial facial image, two grey level images are generated with the intensity of each pixel representing its probability of the skin color existence. We average the intensity values of these two images and apply an empirical threshold on them to get the binary skin-likelihood segments. Spurious pixels induced by problematic noise or color defects in the image are cancelled oﬀ with morphological closing operator and followed by median ﬁlter. The biggest skin-color blob is selected as the facial candidate-region while the other regions are discarded. With the assumption that the front-face-view is assured on the initial image, we use a grey scale eye-nose template as a detector identifying the exact location

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

433

of the facial features, eyes and nose. This is based on the conclusion deduced by Brunelli et al [2] that eyes perform the best matching result followed by the nose. A full frontal face template would be too time-consuming and the matching with diﬀerent mouth-opening images is error-prone, that we settle down on this eye-nose template and it is rescaled only once to the updated width of the facial-feature area image. Template matching is carried out on the grey-level image of the updated facial features area. The Normalized Correlation Coeﬃcients (NCC) of diﬀerent y location are computed vertically. Normalization on the correlation coeﬃcients subdues the illumination- and brightness-induced ambiguous matching results. The highest correlation value of the facial upper region reports the presence of the eyes and nose. 3.2

Extraction of Facial Feature Points

There are 22 feature points in a set to be localized after the position of the eyes and nose are conﬁrmed. Various approaches are proposed in diﬀerent works, to name a few, edge detection, integral projection or snake localization. The facial feature area recognized in the previous section is segmented into diﬀerent blocks on the color image according to the scaled size of the eyes and nose on the template. The approaches for each feature point’s localization are detailed as followed. – Calculate the color histogram within the eye feature block and extract the highest value as the threshold, where pixels with lower luminance are extracted – Convert the extracted pixels block into a binary image – Get the Vertical/Horizontal Integral Projection values of the binary image to locate the hair region – Identify the upper rows and side columns at which the projection value is above-threshold. These regions are recognized for the existence of hair and thus are eliminated – The binary region is thus segmented as the eye region The two corners of an eye are located at the left and right bounds of this region. The Harris corner detector is further carried out within a ﬁxed area centered at these two bounds to ﬁnd good corner values. Following that, the y values of both upper and lower bounds of the eye region are taken as the opening values of the eye, albeit the x values of them are not determined yet. The initial try was done by calculating the center of the two corner points but it doesn’t end up a stable nor good solution. The ﬁnal adopted approach is that by calculating the Horizontal Integral Projection of the eye region on the binary image. The x value with the highest integral projection value is adopted as the x coordinate of the opening points as plotted in Figure 3. It is proven in the detection result that this method works well as it indirectly identiﬁes the gazing view of the person. In addition, this approach works stably even when the captured eye is closed because dark line forms at the edge of the eyelid.

434

E.H. Loke and M. Yamamoto

Fig. 4. Lips opening feature points detection

Fig. 3. Eyes feature points dDetection

Fig. 5. Eyebrow mid point detection

Detecting feature points on Eyes: In [4], eyes (pupils) are located using template matching but in [3,5], the eyes feature points are detected using integral projection or in others’, manually. However, before the two corners and two opening points of each left and right eye are detected, we have to eliminate the hair-likelihood region with the following steps: Detecting feature points on Nose: The approach taken in detecting nostrils is straightforward as it carries a special characteristic - nostrils are the darkest regions in the mid area of a face (between the vertical distance of the eyes and mouth). This system searches one dark region in each half of the nose feature block. The center point of each region is taken as the nostril feature point. Detecting feature points on Mouth: In this system, we deﬁne six feature points on the mouth where two are the corners and the rest represents the upper and lower lips. To calculate each of their location, we have to estimate the searching area on the image. This is done by ﬁxing the searching boundary based

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

435

on the known eyes and nostrils coordinates, which is shown as the black segment in Figure 4. The luminance of the lips is the most easily recognized characteristic in the case of mouth detection. Thus, this informative color can be considered as the best choice for such a task on a color image instead of the edge extraction method adopted in [4], which is susceptible to noise that can lead to fallacious detection result. Our method applies HSV or RGB color space in the mentioned bounding box to extract the segment where lips-likelihood pixels exist. Easily, the two feature points at each upper and lower lip can be derived at the mid x value between the nostrils. The y values of each point is extracted from the lips-likelihood segment as illustrated in Figure 4. Similar to the searching method in detecting corners of the eyes, the two corner points of the lip are determined using Harris corner detector one at each half of the mouth feature block and they must satisfy certain geometric constraints. Detecting feature points on Eyebrows: Eyebrow is one of the diﬃcult objects to be detected and tracked on the face for its uneven shape and thickness traits, besides the problem of being blocked or misrecognized as forelock. Not many facial features detection research has a clear solution for it. We are representing one eyebrow with three points where one at the middle and two at both of its ends. Figures 5 and 6 illustrate the detection mechanism explained below. First, the middle position of the eyebrow has to be determined. The searching boundary is estimated in between of the beginning y value of the skin-likelihood region, the ystart and the beginning of the eyes’ bounding box, the yend . This bounding area is gotten rid of noise with a Gaussian convolution method. Starting from the yend to ystart , a searching-for-edge function is run at the mid point of both corners of the eye. The ﬁrst point encountered with a signiﬁcant intensity changes (from bright to dark) would be taken as the lower edge of the eyebrow. Once this point is detected, we search for the upper edge of the eyebrow that would be the ﬁrst point encountered with the characteristic of signiﬁcant intensity changes (from dark to bright). Averaging the y value of these two points, we mark it as the mid feature point for the eyebrow. Both ends of an eyebrow are detected by carrying out the detection function starting from the known mid eyebrow point towards both ends horizontally. This

Fig. 6. Eyebrow corner points localization

436

E.H. Loke and M. Yamamoto

bounding area is converted to an edge-image. At every incremental/decremental x location, the upper and lower edges (y value) of the eyebrow have to be determined. The search continues by taking the previous average y value as the starting y location at next x value until it meets either of the two conditions: (i) the two upper and lower edges cling/are close to each other under certain threshold, or (ii) the edges seem too apart and the distance is larger than the thickness of the center point of the eyebrow. The head is represented as a 3D ellipsoidal model which is ﬁtted onto the facial region in the image. The depth values of each facial feature points are calculated according to their x and y positions to be placed at the surface of the ellipsoidal head. 3.3

Facial Features Motion Capture

Facial expression, also the facial features motion, poses as a non-rigid motion issue only very recently as most of the earlier works either treat it as twodimensional aﬃne solution or leave it untouched. Among the proposed solutions are inclusive of but not limited to, the application of either complicated or simple physically based models (polygonal or mesh) of the head, local parametric ﬂow model and optical ﬂow calculation with model-based solution. Aiming for a robust but less computation intensive approach, this system took the middle path between the simple template-based aﬃne approximation approach and the intricate physical or muscle structure model-based tracking method. Adopting the similar idea proposed in [1], we tackle the rigid head motion and non-rigid facial expression tracking separately. Rigid Head Motion Estimation: The 2D image ﬂow ﬁeld in the head region between two successive frames (exclusive of the sparse motion ﬁeld in facial features region of the eyebrows, eyes and mouth) can be calculated by the constant brightness constraint over time coupling with the Lucas-Kanade technique[6]. This 2D ﬂow is then interpreted as the 3D rigid motion of the head using the depth constraint by referring to the current pose of the ellipsoidal head model. The 3D rigid motion updates current pose of the head at every frame. Non-Rigid Facial Features Motion Estimation: This module inherits the 3D motion parameter from the rigid tracking module to expect the facial feature points in the next frame. The facial feature points are also tracked individually using the NCC method. A displacement between the expected position and tracked one denotes the relative motion with respect to the head as reference coordinates. If the displacement is large, expression deformation occurs.

4

Articulated Model Solutions

The use of articulated model for human body and hand ﬁgures representation is common in model-based tracking system [8]. Fitting the model to image sequence can produce a pose sequence, i.e. motion, of the body and hand. In this paper, we

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

437

Head Neck Index F. Middle F. Ring F. Seg. 3 Seg. 3 Seg. 3

Little F. Seg. 3

Index F. Middle F. Ring F. Seg. 2 Seg. 2 Seg. 2

Little F. Seg. 2

Thumb. Index F. Middle F. Ring F. Seg. 2 Seg. 1 Seg. 1 Seg. 1

Little F. Seg. 1

Right Left Upper Upper Arm Chest Arm Right Forearm

Waist

Right Hand Right Thigh

Left Forearm

Left Left Hand Thigh

Right Shin

Left Shin

Right Foot

Left Foot

Thumb Seg. 1

Palm (Left)

Palm (Right)

Fig. 7. Left: Body ireframe model and the tree structure of the body parts, Right: Fingers wireframe model and its tree structure

inherit a motion capture method proposed in [9]. We brieﬂy explain the method, but the details can be further referred in [9]. Our body model consists of 16 parts arranged in a tree structure with the waist model part as the root of the model, as shown in Figure 7 left. The similar concept in body ﬁgure modeling is extended to the ﬁngers modeling. We deﬁne it with the similar structure with 16 parts as illustrated in Figure 7 right. The root of this model is placed at the palm. The tree structure in the model denotes connection from parent to child by the arrow direction. Each part has its own coordinates system of which one axis aligns along the body axis and the origin is located at the joint with its parent part (a center of gravity for waist and chest). We manually adjust the pose of the articulated model to ﬁt it onto the initial frame image of each body and ﬁngers images. The automatic model ﬁtting makes further issue. It will be possible since we can utilize existing methods, e.g. [7], for detection of the human body in image and determining the 3D pose. A pose displacement of human body can be estimated from image subtraction between the successive frames. Therefore, after getting a pose at the initial frame by model ﬁtting, pose at any frame can be obtained from accumulating the successive pose displacements onto the initial pose. In fact, the estimation of pose displacement is based on that (1) optical ﬂow is constrained by temporal-spatial linear equation, (2) a 3D translation vector on the model is approximated as a sum of pose displacements weighted by the Jacobian matrix, and (3) a depth of human body can be obtained from the model ﬁtted to the body. Chain substitutions based on the above (1), (2) and (3) can produce a system of linear equations with respect to the pose displacements as unknowns. Solving the system of linear equations and accumulating the obtained pose displacements onto the initial pose can result in the human motion. However, this approach causes a pose drift as an inherent drawback. To cope with this drift, additional

438

E.H. Loke and M. Yamamoto

model ﬁtting is manually required at several key-frames, and the accumulated pose at in-between frame is corrected by propagation of the poses given at keyframes [9].

5

Animation of the Simultaneous Motion

A general humanoid polygonal model is deformed automatically according to the initial pose acquired from the earlier initialization module. Since this animation step highlights the reconstruction of a humanoid model coupling with the restoration and animation of the captured simultaneous motion from an actor, no human motion kinematics is applied. The ﬁngers and facial features models are installed on the body ﬁgure model for the reconstruction of a full humanoid model. Nevertheless, the initial estimated size of each body parts calculated from each image sequence are diﬀerent. Therefore, several steps are taken for the integration: (i) The system recognizes the parental/replacement part for each model : The head is the parental part for facial features as is the palm for the ﬁngers model. The system locates the center of the ellipsoidal head on the body ﬁgure model, makes it the origin of the coordinate system for the facial feature points. As for the both hands, ﬁngers models can just be considered the replacements for hand parts and be installed at the origin of their respective local coordinate systems, ΣLef tHand and ΣRightHand , on the body ﬁgure model. (ii) Model size scaling : Assuming the three axes of the initial ellipsoidal head size of the body ﬁgure and facial image sequences are Hbody = (x, y, z) and Hf ace = (x, y, z) respectively, we rescale each features position with Hbody /Hf ace at all the x, y and z axes. The similar calculation works on ﬁngers model as well. 5.1

Reconstruction of the Full Human Model

By utilizing the camera-referenced motion data of all the targetted subjects acquired in the previous motion capture section, this system automatically constructs a full human model and realises the simultaneous motions of each body part on it. However, each facial features, ﬁngers and body ﬁgure model consists of each model structure and poses in diﬀerent camera coordinate systems. Thus, in constructing these diﬀerent body parts into one complete human model structure, they need to be referenced in the same coordinate system, namely the world coordinate system. In this paper, the human body camera coordinate system is taken as the world coordinate system. The hand model shares a palm with the body model. The face model shares a head with the body model. Therefore, if the parent of the ﬁngers changes from the palm of the hand model to the palm of the body model, and the parent of the facial features changes from the head of the face model to the head of the body model, a full body model can be established without further camera calibration. We draw the transformation from a local camera coordinates to the world coordinates by an example of the little ﬁngertip. In order to assemble the hand models on the body model, we assume j Ti as the transformation from coordinates

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

439

Fig. 8. The tracking result of each face, body ﬁgure, right and left hand image sequence, at frame 56, 101 and 197

Σi of part i to Σj of part j. Let denote parts 1, 2, · · ·, 8 be waist, chest, upper arm, forearm, hand (alias palm), ﬁnger seg.1, ﬁnger seg.2, and ﬁnger seg.3 (alias ﬁngertip), respectively, and Σ0 denotes the world coordinate. The body camera can capture the motion up to the palm. The transformation from the palm to the world coordinate system is given by 0

T1 1 T2 2 T3 3 T4 4 T5 .

(1)

Meanwhile, the hand camera captures the hand motion from the palm to ﬁngertip. The transformation from the ﬁngertip to the hand camera coordinate system is given by h T5 5 T6 6 T7 7 T8 (2) where h denotes the hand camera coordinate system. To represent a pose of the ﬁngertip in the world coordinate, adding 5 T6 6 T7 7 T8 in eq. (2) to eq. (1), we have the transformation 0

T1 1 T2 2 T3 3 T4 4 T5 5 T6 6 T7 7 T8 .

Similarly, the face model is embedded into the full human model.

(3)

440

6

E.H. Loke and M. Yamamoto

Experimental Results

To prepare the motion image sequences for the system experiment, we carry out the image acquisition in the laboratory without constraint on the background scene and costumes of the actor. The actor is required to start with a front view position but free to move along the recording process of 200 frames with the conditions that facial features and ﬁngers visibility is guaranteed and the motion has to be smooth. Two sets of motion were recorded with this multicamera system as sized 320x240 pixels bitmap image sequences,in which the actor (i) did simple movements on the hands and face, and (ii) performed airguitar while singing. The tracking result at diﬀerent time-frame of the four image sequences on face features, whole body ﬁgure and both hands can be refered in Figure 8 while Figure 9 illustrates the reconstructed model’s appearance and the motion recovered at the same time-frame. The average processing time consumes about 30 seconds per frame with which the 200 frames motion capture and animation process takes less than 15 minutes to complete.

Fig. 9. Animation reconstructed on the humanoid model at frame 56, 101 and 197

An Active Multi-camera Motion Capture for Face, Fingers and Whole Body

7

441

Conclusion

In this paper, we have suggested a novel idea for full human motion capture, inclusive of the motion on the full human ﬁgure, ﬁngers of both hands and facial features simultaneously, for which only 4 cameras are needed for the recording purpose. Our major contributions lie in several aspects. The main relates to the novel endeavor of using only 4 cameras to capture the full motion of an actor. We also successfully built a foundation platform for the concurrent motion estimation of the body ﬁgure, ﬁngers and facial expression solely from the digital information on single-view image sequences and the motion reconstruction on a full humanoid model. We also proposed several alternatives for facial features detection and 3D rigid head pose and non-rigid features motion estimation. Through the animation result, we have demonstrated how our approaches provide a concise description of the human motion on diﬀerent body parts and are feasible for humanoid model animation construction. Acknowledgments. We thank Hideaki Sasagawa for the hands tracking works.

References 1. Black, M.J., Yacoob, Y.: Recognizing Facial Expressions In Image Sequences Using Local Parameterized Models Of Image Motion. IJCV 25(1), 23–48 (1997) 2. Brunelli, R., Poggio, T.: Face Recognition: Features versus Templates. IEEE PAMI 15(10), 1042–1052 (1993) 3. Chuang, M.M., Chang, R.F., Huang, Y.L.: Automatic Facial Feature Extraction In Model-Based Coding. Journal of Information Science And Engineering 16, 447–458 (2000) 4. Feris, R.S., De Campos, T.E., Junior, R.M.C.: Detection and Tracking Of Facial Features In Video Sequences. In: Cair´ o, O., Cant´ u, F.J. (eds.) MICAI 2000. LNCS, vol. 1793, pp. 197–206. Springer, Heidelberg (2000) 5. Gu, H., Su, G.: Feature Points Extraction From Faces. Image and Vision Computing NZ (2003) 6. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. Imaging Understanding Workshop, pp. 121–130 (1981) 7. Mori, G., Malik, J.: Recovering 3D human body conﬁgurations using shape contexts. IEEE PAMI 7(28), 1052–1062 (2006) 8. Wang, J.J., Singh, S.: Video analysis of human dynamics - a survey. Real-Time Imaging 9(5), 321–346 (2003) 9. Yamamoto, M., Ohta, Y., Yamagiwa, T., Yagishita, K., Yamanaka, H., Ohkubo, H.: Human Action Tracking Guided by Key-Frames. FG2000, 354–361 (2000) 10. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computiong Surveys 35(4), 299–458 (2003) 11. http://www.Sony.Co.Jp/Products/ISP/Products/Model/Ptz/EVID30.Html 12. Evaluation of RGB and HSV models in Human Faces Detection, [Online] (2004) Available, http://www.cescg.org/CESCG-2004,/web/Sedlacek-Marian/

Tracking and Classifying of Human Motions with Gaussian Process Annealed Particle Filter Leonid Raskin, Michael Rudzsky, and Ehud Rivlin Computer Science Department, Technion—Israel Institute of Technology, Haifa, Israel, 32000 {raskinl,rudzsky,ehudr}@cs.technion.ac.il

Abstract. This paper presents a framework for 3D articulated human body tracking and action classiﬁcation. The method is based on nonlinear dimensionality reduction of high dimensional data space to low dimensional latent space. Motion of human body is described by concatenation of low dimensional manifolds which characterize diﬀerent motion types. We introduce a body pose tracker, which uses the learned mapping function from low dimensional latent space to high dimensional body pose space. The trajectories in the latent space provide low dimensional representations of body poses performed during motion. They are used to classify human actions. The approach was checked on HumanEva dataset as well as on our own one. The results and the comparison to other methods are presented.

1

Introduction

Human body pose estimation and tracking is a challenging task for several reasons. The main problem that has to be solved in order to achieve satisfactory results of pose tracking and understanding is large dimensionality of the human pose model, which complicates the examination of the entire subject and makes it harder to detect each body part separately. Despite the high dimensionality of the problem, many poses can be presented in a low dimensional space by dimensionality reduction. Human body motions can be displayed as curves in this space. This space can be obtained by learning diﬀerent motion types [1,2]. This paper presents an approach to 3D people tracking and motion analysis. In this approach we apply a nonlinear dimensionality reduction using Gaussian Process Dynamical Model (GPDM) [3,4] and annealed particle ﬁlter [5]. GPDM is better able to capture properties of high dimensional motion data than linear methods such as PCA. This method generates a mapping function from the low dimensional latent space to the full data space based on learning from previously observed poses of diﬀerent motion types. For the tracking we separate model state into two independent parts: one contains information about 3D location and orientation of the body and the second one describes the pose. We learn latent space that describes poses only. The tracking algorithm consists of two stages. Firstly the particles are generated in the latent space and are transformed into the data space by using learned a priori mapping function. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 442–451, 2007. c Springer-Verlag Berlin Heidelberg 2007

Tracking and Classifying of Human Motions

443

Secondly we add rotation and translation parameters to obtain valid poses. The likelihood function is calculated in order to evaluate how well a pose matches the visual data. The resulting tracker estimates the locations in the latent space that represents poses with the highest likelihood. As the latent space is learned from sequences of poses from diﬀerent motion types, each action is represented by a curve in the latent space. The classiﬁcation of the motion action is based on the comparison of the sequences of latent coordinates that the tracker has produced, to the ones that represent poses sequences of diﬀerent motion types. We use a modiﬁed Fr`echet distance [6] in order to compare the pose sequences. This approach allows introducing diﬀerent actions from the ones we have used for learning by exploiting the curve that represents it. We show that our tracking algorithm provides good results even for the low frame rate. An additional advantage of our tracking algorithm is the capability to recover after temporal loss of the target. We also show that the task of action classiﬁcation, when performed in the latent space, is robust.

2

Related Works

One of the common approaches for tracking is the particle ﬁlter method. This method uses multiple predictions, obtained by drawing samples of pose and location prior and then propagating them using the dynamic model, which are reﬁned by comparing them with the local image data, calculating the likelihood [7]. The prior is typically quite diﬀused (because motion can be fast) but the likelihood function may be very peaky, containing multiple local maxima which are hard to account for in detail [8]. Annealed particle ﬁlter [5,19] or local searches are the ways to attack this diﬃculty. An alternative is to apply a strong model of dynamics [9]. There exist several possible strategies for reducing the dimensionality of the conﬁguration space. Firstly it is possible to restrict the range of movement of the subject [10]. Due to the restricting assumptions the resulting trackers are not capable of tracking general human poses. Another way to cope with highdimensional data space is to learn low-dimensional latent variable models [11]. However, methods like Isomap [12] and locally linear embedding (LLE) [13] do not provide a mapping between the latent space and the data space, and, therefore Urtasun et al. [14] proposed to use a form of probabilistic dimensionality reduction by GPDM [3,4] to formulate the tracking as a nonlinear least-squares optimization problem. During the last decade many diﬀerent methods for behavior recognition and classiﬁcation of human actions have been proposed. The popular methods are based on Hidden Markov Models (HMM), Finite State Automata (FSA), context-free grammar (SCFG) etc. Sato et al. [15] presented a method to use extraction of human trajectory patterns that identify the interactions. Park et al. [16] proposed a method using a nearest neighbor classiﬁer for recognition of twoperson interactions such as hand-shaking, pointing, and standing hand-in-hand. Hongeng et al. [17] proposed probabilistic ﬁnite state automata for recognition

444

L. Raskin, M. Rudzsky, and E. Rivlin

of a sequential occurrence of several scenarios. Park et al. [18] presented a recognition method that combines model-based tracking and deterministic ﬁnite state automata. This paper is organized as follows. Section 3 describes the tracking algorithm. Section 4 describes the classiﬁcation algorithm. Section 5 shows the experimental results for tracking and action classiﬁcation of diﬀerent data sets and motion types.

3 3.1

Tracking GPAPF Tracker

The drawback in the annealed particle ﬁlter [5] tracker is that a high dimensionality of the state space causes an increase in the number of particles that are needed to be generated in order to preserve the same density of particles. This makes this algorithm computationally ineﬀective for low frame rate videos (30 fps and lower). The other problem is that once a target is lost (i.e. the body pose was wrongly estimated, which can happen for the fast and not smooth movements) it becomes highly unlikely that the next pose will be estimated correctly in the following frames. In order to reduce the dimension of the space we propose to use Gaussian Process Annealed Particle Filter (GPAPF). Using Gaussian Process Dynamical Model (GPDM) [3,4] we embedded several types of poses into a low dimensional space. We used two and three dimensional spaces which was enough for robust tracking and classiﬁcation. The poses are taken from diﬀerent sequences, such as walking, running, punching and kicking. We divide our state into two independent parts. The ﬁrst part contains the global 3D body rotation and translation, which is independent of the actual pose. The second part contains only information regarding the pose (26 DoF). We use GPDM to reduce the dimensionality of the second part. This way we construct a latent space (Fig. 1). This space has a signiﬁcantly lower dimensionality (for example 2 or 3 DoF). The latent space includes solely pose information and is therefore rotation and translation invariant. For the tracking task we use a modiﬁed annealed particle ﬁlter tracker [5]. We are using a 2-stage algorithm. The ﬁrst stage is generation of new particles in the latent space, which is the main modiﬁcation of the tracking algorithm. Then we apply the learned mapping function that transforms latent coordinates to the data space. As a result, after adding the translation and rotation information, we construct 31 dimensional vectors that describe a valid data state, which includes location and pose information, in the data space. In order to estimate how well the pose matches the images the likelihood function is calculated [19]. Suppose we have M annealing layers. The state is deﬁned as a pair Γ = {Λ, Ω}, where Λ is the location information and Ω is the pose information. We also deﬁne ω as a latent coordinates corresponding to the data vector Ω: Ω = ℘ (ω), where ℘ is the mapping function learned by the GPDM. Λn,m , Ωn,m and ωn,m are the location, pose vector and corresponding latent coordinates on the frame n and annealing layer m. For each 1 ≤ m ≤ M − 1 Λn,m and ωn,m

Tracking and Classifying of Human Motions

445

Fig. 1. The latent space that is learned from diﬀerent motion types. (a) 2D latent space from 3 diﬀerent motions: lifting an object (red), kicking with the left (green) and the right (magenta) legs. (b) 3D latent space from 3 diﬀerent motions: hand waving (red), lifting an object (magenta), kicking (blue), sitting down (black), and punching (green).

are generated by adding multi-dimensional Gaussian random variable to Λn,m+1 and ωn,m+1 respectively. Then Ωn,m is calculated using ωn,m . Full body state Γn,m = {Λn,m , Ωn,m } is projected to the cameras and the likelihood πn,m is calculated using likelihood function. The main diﬃculty is that the latent space is not uniformly distributed and sequential poses may be not close on the latest space to each other. Therefore we use a dynamic model, as proposed by Wang et al. [4], in order to achieve smoothed transitions between sequential poses in the latent space. However, there are still some irregularities and discontinuities. Moreover, in the latent space each pose has a certain probability to occur and thus the probability to be drawn as a hypothesis should be dependent on it. For each location in the latent space the variance can be estimated that can be used for generation of the new particles. In Fig. 1(a) the lighter pixels represent lower variance, which depicts the regions of latent space that corresponds to more likely poses. The additional modiﬁcation that has been done is in the way the optimal conﬁguration is calculated. In the original annealed particle ﬁlter algorithm the optimal conﬁguration is achieved by averaging over the particles in the last layer. However, as the latent space is not an Euclidian one, applying this method on ω will produce poor results. We propose to calculate the optimal conﬁguration in the data space and then project it back to the latent space. At the ﬁrst stage we apply the ℘ on all the particles to generate vectors in the data space. Then in the data space we calculate the average on these vectorsand project it back to N (i) (i) . π ℘ ω the latent space. It can be written as follows: ωn = ℘−1 n,0 i=1 n,0 The resulting tracker is capable of recovering after several frames of poor estimations. The reason for this is that particles generated in the latent space are representing valid poses more authentically. Furthermore because of its low dimensionality the latent space can be covered with a relatively small number of particles. Therefore, most of possible poses will be tested with emphasis on the pose that is close to the one that was retrieved in the previous frame. So if the

446

L. Raskin, M. Rudzsky, and E. Rivlin

Fig. 2. Losing and ﬁnding the tracked target despite the miss-tracking on the previous frame

pose was estimated correctly the tracker will be able to choose the most suitable one from the tested poses. At the same time, if the pose on the previous frame was miscalculated the tracker will still consider the poses that are quite diﬀerent. As these poses are expected to get higher value of the weighting function the next layers of the annealed will generate many particles using these diﬀerent poses. In this way the pose is likely to be estimated correctly, despite the misstracking on the previous frame as shown in Fig. 2. Another advantage of our approach is that the generated poses are, in most cases, natural. In case of CONDENSATION or annealed particle ﬁlter, the large variance in the data space, can cause generation of unnatural poses. Poses that are produced by the latent space that correspond to points with low variance are usually natural and therefore the number of the particles eﬀectively used is higher, which enables more accurate tracking. 3.2

Obtaining Better Tracker

The problem with such a 2-staged approach is that Gaussian ﬁeld is not capable to describe all possible poses. As we have mentioned above, this approach resembles using probabilistic PCA in order to reduce the data dimensionality. However, for tracking issues we are interesting to get pose estimation as close as possible to the actual one. Therefore, we add an additional annealing layer as the last step. This layer consists only from one stage. We are using data states, which were generated on the previous two staged annealing layer, in order to generate data states for the next layer. This is done with very low variances in all the dimensions, which practically are even for all actions, as the purpose of this layer is to make only the slight changes in the ﬁnal estimated pose. Thus it does not depend on the actual frame rate, contrarily to original annealing particle tracker, where if the frame rate is changed one need to update the model parameters (the variances for each layer).

4

Action Classiﬁcation

The classiﬁcation of the actions is done based on the sequences of the poses that were detected by the tracker during the performed motion. We use Fr`echet distance [6] in order to determine the class of the motion, i.e. walking, kicking,

Tracking and Classifying of Human Motions

447

waving etc. The Fr`echet distance between two curves measures the resemblance of the curves taking into consideration their direction. This method is quite tolerant to position errors. Suppose there are K diﬀerent motion types. Each type k is represented by a model Mk which is a sequence of the lk + 1 latent coordinates Mk = {μ0 , ..., μlk }. The GPAPF tracker generates a sequence of l + 1 latent coordinates: Γ = {ϕ0 , ..., ϕl }. We deﬁne a polygonal curve P E as a continuous and piecewise linear curve made of segments connecting vertexes E = {v0 , ..., vn }. The curve can be parameterized with a parameter α ∈ [0, n], where P E (α) refers to a given position on the curve, with P E (0) deE notes Mv0 and between two curves is deﬁned as P (n) denotes vn . The distance Γ i F P , P l] , = minα,β f P Mi (α), P Γ (β) : α [0, 1] → [0, lk ] , β [0, 1] → [0, where f P Mi (α) , P Γ (β) = max P Mk (α (t)) − P Γ (β (t)) 2 : t ∈ [0, 1] and α (t) and β (t) represent sets of continuous and increasing functions with α (0) = 0, α (1) = lk , β (0) = 0, β (1) = l. The model with the smallest distance is chosen to represent the type of the action. While in general it is hard to calculate the Fr`echet distance, Alt et al. [6] has shown an eﬃcient algorithm to calculate it between two piecewise linear curves.

5

Results

We have tested GPAPF tracking algorithm using HumanEva data set. The data set contains diﬀerent activities, such as walking, boxing etc. and provides the correct 3D locations of the body joints, such as hips and knees, for evaluation of the results and comparison to other tracking algorithms. We have compared our results to the ones produced by the annealed particle ﬁlter body tracker [20] and compared the results with the ones produced by the GPAPF tracker. The error measures the average 3D distance between the locations of the joints that is provided by the MoCap system and by ones that were estimated the tracker [20]. Fig. 3 shows the actual poses that were estimated for this sequence. The poses are projected to the ﬁrst and second cameras. The ﬁrst two rows show the results of the GPAPF tracker. The last two rows show the results of the annealed particle ﬁlter. Fig. 4.a shows the error graphs, produced by GPAPF tracker (blue circles) and by the annealed particle ﬁlter (red crosses) for the walking sequence taken at 30 fps. The graph suggests that the GPAPF tracker produces more accurate estimation. We also compared the performance of the tracker with and without the additional annealed layer. We have used 5 double staged annealing layers in both cases. For the second tracker we have added additional single staged layer. The Fig. 4.b shows the errors of the GPAPF tracker version with the additional layer (blue circles) and without it (red crosses); Fig. 5 shows sample poses, projected on the cameras. The improvement is not dramatic. This is explained by the fact that the diﬀerence between the estimated pose using only the latent space annealing and the actual pose is not very big. That suggests that the latent space represents accurately the data space. We have also created a database, which contains videos with similar actions, produced by a diﬀerent actor. The

448

L. Raskin, M. Rudzsky, and E. Rivlin

Fig. 3. Tracking results of annealed particle ﬁlter tracker and GPAPF tracker. Sample frames from the walking sequence. First row: GPAPF tracker, ﬁrst camera. Second row: GPAPF tracker, second camera. Third row: annealed particle ﬁlter tracker, ﬁrst camera. Forth row: annealed particle ﬁlter tracker, second camera.

Fig. 4. (a) The errors of the annealed tracker (red crosses) and GPAPF tracker (blue circles) for a walking sequence captured at 30 fps. (b) The errors GPAPF tracker with additional annealing layer (blue circles) and without it (red crosses) for a walking.

frame rate was 15 fps. We have manually marked some of the sequences in order to produce the needed training sets for GPDM. After the learning we validated the results on the other sequences containing same behavior. We have experimented with diﬀerent number of particles. For the 100 particles per layer the computational cost was 30 sec per frame. Using the same number of particles and layers in the annealed particle ﬁlter algorithm takes 20 seconds per frame. However, the annealed particle ﬁlter algorithm was not capable of tracking the body pose with such a low number of particles for 30 fps and 15

Tracking and Classifying of Human Motions

449

Fig. 5. GPAPF algorithm with (a) and without (b) additional annealed layer

Fig. 6. Tracking results of annealed particle ﬁlter tracker and GPAPF tracker. Sample frames from the running, leg movements and object lifting sequences.

fps. Therefore, we had to increase the number of particles used in the annealed particle ﬁlter to 500. 5.1

Motion Classiﬁcation

The classiﬁcation algorithm was tested on two diﬀerent data sets. The ﬁrst set contained 3 diﬀerent activities: (1) lifting an object, kicking with (2) the left and (3) the right leg. For each activity 5 diﬀerent sequences were captured. We have used one sequence for each motion type in order to construct the models. The latent space was learned based on the poses in these models (Fig. 1.a). The latent space had a clear and very distinguishable separation between these 3 actions. Therefore, although the results of the tracker contained much noise as shown in Fig. 7, the algorithm was able to perform perfect classiﬁcation. The second set contained 5 diﬀerent activities: (1) hand waving, (2) lifting an object, (3) kicking, (4) sitting down, and (5) punching. Once again 5 diﬀerent sequences were captured for each activity. The cross validation procedure was used to classify the sequences (see Fig. 1.b). The accuracies of the classiﬁcation,

450

L. Raskin, M. Rudzsky, and E. Rivlin

Fig. 7. Tracking trajectories in the latent space for diﬀerent activities: (a) lifting an object, kicking with (b) the left and (c). On each image the black lines represent incorrect activities, the red line represents the correct one, and other colored lines represent the trajectories produced by the GPAPF tracker.

Table 1. The accuracies of the classiﬁcation for 5 diﬀerent activities: hand waving, object lifting, kicking, sitting down, and punching. The rows represent the correct motion type; the columns represent the classiﬁcation results. Hand waving Object lifting Kicking Sitting down Punching Hand waving Object lifting Kicking Sitting down Punching

15 0 0 0 6

0 17 0 3 0

0 0 20 1 20

0 3 0 16 0

5 0 0 0 14

as shown in Table 1, are 75, 85, 100, 80, 70 percent for the above interactions (1)-(5) respectively. The low classiﬁcation rates of actions involving the hand gestures are due to the similarity of the native actions. The low classiﬁcation rates of sitting down and object lifting actions are due to the high self occlusions, which caused the tracker to perform wrong estimations of the actual poses.

6

Conclusion and Future Work

In this paper we have introduced an approach for articulated body tracking and human motion classiﬁcation using a low dimensional latent space. The latent space is constructed from pose samples from diﬀerent motion types. The tracker generates trajectories in the latent space, which are classiﬁed using Fr`echet distance. The interesting problem that has not been solved yet is to perform classiﬁcation of the interactions between multiple actors. The main problem is constructing a latent space. While a single persons poses can be described using a low dimensional space it may not be the case for multiple people.

Tracking and Classifying of Human Motions

451

References 1. Christoudias, C.M., Darrell, T.: On modelling nonlinear shape-and-texture appearance manifolds. In: Proc. CVPR, vol. 2, pp. 1067–1074 (2005) 2. Elgammal, A., Lee, C.: Inferring 3d body pose from silhouettes using activity manifold learning. In: Proc. CVPR, vol. 2, pp. 681–688 (2004) 3. Lawrence, N.: Gaussian process latent variable models for visualization of high dimensional data. In: NIPS. Information Processing Systems, vol. 16, pp. 329–336 (2004) 4. Wang, J., Fleet, D., Hetzmann, A.: Gaussian process dynamical models. In: NIPS. Information Processing Systems, pp. 1441–1448 (2005) 5. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle ﬁltering. In: Proc. CVPR, pp. 2126–2133 (2000) 6. Alt, H., Knauer, C., Wenk, C.: Matching polygonal curves with respect to the fr`echet distance. In: Ferreira, A., Reichel, H. (eds.) STACS 2001. LNCS, vol. 2010, pp. 63–74. Springer, Heidelberg (2001) 7. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 8. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3d human ﬁgures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 702–718. Springer, Heidelberg (2000) 9. Mikolajczyk, K., Schmid, K., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Proc. ECCV, vol. 1, pp. 69–82 (2003) 10. Rohr, K.: Human movement analysis based on explicit motion models. MotionBased Recognition 8, 171–198 (1997) 11. Wang, Q., Xu, G., Ai, H.: Learning object intrinsic structure for robust visual tracking. In: Proc. CVPR, vol. 2, pp. 227–233 (2003) 12. Tenenbaum, J., de Silva, V.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 13. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000) 14. Urtasun, R., Fleet, D., Fua, P.: 3d people tracking with gaussian process dynamical models. In: Proc. CVPR, vol. 1, pp. 238–245 (2006) 15. Sato, K., Aggarwal, J.: Recognizing two-person interactions in outdoor image sequences. In: IEEE Workshop on Multi-Object Tracking, IEEE Computer Society Press, Los Alamitos (2001) 16. Park, S., Aggrawal, J.: Recognition of human interactions using multiple features in a grayscale images. In: Proc. ICPR, vol. 1, pp. 51–54 (2000) 17. Hongeng, S., Bremond, F., Nevatia, R.: Representation and optimal recognition of human activities. In: Proc. CVPR, vol. 1, pp. 818–825 (2000) 18. Park, J., Park, S., Aggrawal, J.: Video retrieval of human interactions using modelbased motion tracking and multi-layer ﬁnite state automata. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds.) CIVR 2003. LNCS, vol. 2728, Springer, Heidelberg (2003) 19. Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search. International Journal of Computer Vision 61(2), 185–205 (2004) 20. Balan, A., Sigal, L., Black, M.: A quantitative evaluation of video-based 3d person tracking. In: VS-PETS. IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 349–356. IEEE Computer Society Press, Los Alamitos (2005)

Gait Identiﬁcation Based on Multi-view Observations Using Omnidirectional Camera Kazushige Sugiura, Yasushi Makihara, and Yasushi Yagi Osaka University 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan {sugiura,makihara,yagi}@am.sanken.osaka-u.ac.jp

Abstract. We propose a method of gait identiﬁcation based on multiview gait images using an omnidirectional camera. We ﬁrst transform omnidirectional silhouette images into panoramic ones and obtain a spatiotemporal Gait Silhouette Volume (GSV). Next, we extract frequencydomain features by Fourier analysis based on gait periods estimated by autocorrelation of the GSVs. Because the omnidirectional camera makes it possible to observe a straight-walking person from various views, multiview features can be extracted from the GSVs composed of multi-view images. In an identiﬁcation phase, distance between a probe and a gallery feature of the same view is calculated, and then these for all views are integrated for matching. Experiments of gait identiﬁcation including 15 subjects from 5 views demonstrate the eﬀectiveness of the proposed method.

1

Introduction

There is a growing necessity in modern society to identify individuals in many situations, including, surveillance and access control. For personal identiﬁcation, many biometrics-based authentication methods are proposed using a wide variety of cues; ﬁngerprint, iris, face, and gait. Among these, gait identiﬁcation has recently gained considerable attention because gait promises to enable surveillance systems to ascertain identity at a distance. Currently, many gait identiﬁcation approaches are proposed by model base [1][2] and appearance base [3][4]. One of the diﬃculties facing those approches is an appearance change due to changes of viewing or walking direction. Yu et al. [5] discussed the eﬀects of view angle variation on gait identiﬁcation and reported a performance drop when view diﬀerence is large. To cope with view changes, Kale et al. [6] proposed a view transformation method based on perspective projection of the sagittal plane. The method does not, however, work well when view diﬀerence is large. Shakhnarovich et al. [7] proposed a visual hull-based method. However, the method needs multiple-view synchronized images for all subjects. As a training-based method, View Transformation Model (VTM) in the frequency domain was proposed [8]. Once the VTM is trained using sets of gait features of multiple views and subjects, a few-view reference can be transformed Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 452–461, 2007. c Springer-Verlag Berlin Heidelberg 2007

Gait Identiﬁcation Based on Multi-view Observations

453

into an arbitrary-view gallery so as to match a probe view. They also reported that veriﬁcation rate increased as the number of reference views increased [9]. Moreover, a method of multi-view gait identiﬁcation using walking direction changes in a sequence was proposed [10], and it was reported that the veriﬁcation rate increased as the number of walking directions increased. It is, however, troublesome to capture gait images many times to acquire many references in registration phase. In addition, it is unreasonable to assume that subjects always change their walking directions enough for multi-view identiﬁcation. Therefore, we propose a method of gait identiﬁcation based on multi-view observations from an omnidirectional camera. Note that an omnidirectional camera makes it possible to observe multi-view gait images even if a subject walks straight. Observation views are estimated by azimuth angles of tracked person regions in the omnidirectional image and walking trajectory on the ﬂoor. Then, for each gallery and probe sequence, a silhouette-based gait features are extracted for multiple basis views which are common both for the gallery and the probe. Finally the extracted multiple gait features are matched for each same view and the matching results are integrated for better identiﬁcation. The outline of this paper is as follows. First, construction of a Gait Silhouette Volume (GSV) is addressed with silhouette extraction and panoramic expansion in section 2. Next, extraction and matching of multi-view frequency-domain gait feature are described in section 3. Finally, experimental results for gait identiﬁcation are presented with an analysis of the eﬀect of view variations in section 4. Section 5 contains conclusions and discussion of further work in the area.

2 2.1

GSV Construction Extraction of Gait Silhouette Images

The ﬁrst step in constructing a GSV is to extract gait silhouette images by background subtraction from omnidirectional images. First, background is modeled by average color vector u(x, y) and its covariance matrix Σ (x, y) at position (x, y) using background image sequence as follows. u (x, y) =

N 1 u(x, y, n) N n=1

(1)

1 u(x,y,n)u(x,y,n)T −u(x,y)u (x,y)T , N n=1

(2)

N

Σ (x,y) =

where u(x, y, n) is a background color vector at position (x, y) at nth frame, and N is the number of total frames for a background training sequence. Second, to extract foreground regions, Mahalanobis distance D(x, y, n) between an input image c(x, y, n) and the modeled background is calculated at each position (x, y) at each nth frame as d (x, y, n) = c(x, y, n) − u (x, y)

(3)

454

K. Sugiura, Y. Makihara, and Y. Yagi

D(x,y,n) =

d (x,y,n)T Σ (x,y)−1 d (x,y,n).

(4)

A foreground region is deﬁned as a set of pixels whose Mahalanobis distance D(x, y, n) is larger than threshold value Dthresh . Here, the threshold Dthresh is set to be 12.0 empirically. Figure 1 shows an input image and a result of background subtraction. We can see that person region is extracted correctly. Background subtraction, however, sometimes fails because of cast shadows and illumination condition changes. To overcome such diﬃculties, a shadow removal is processed based on color vector angle between background and foreground. Moreover, morphological closing ﬁlter is applied to improve silhouette quality. 2.2

Panorama Extension

The second step is panorama extension of silhouettes in omnidirectional image [11]. Let P (X, Y, Z) be a point in world coordinate and p(x, y) be a point in an omnidirectional image projected from point P . Then, let ρ and Z be azimuth angle and vertical position in a cylindrical coordinate whose center axis passes through mirror focal point Om and camera center Oc , and whose radius is RP . Thus, the panorama extension is expressed as follows. tan ρ = Y /X = y/x Z = RP tan α + c where α = tan−1

(b2 +c2 ) sin γ−2bc (b2 −c2 ) cos γ

(5) (6)

and γ = tan−1 √

f x2 +y 2

are viewing directions

deﬁned in Fig. 2 respectively, and b and c are mirror parameters. 2.3

Scaling and Registration of Silhouette Images

The third step is scaling and registration of the panoramic silhouettes to acquire normalized gait patterns.

(a) Input image with omnidirectional camera

(b) Background subtraction

Fig. 1. Result of background subtraction

Gait Identiﬁcation Based on Multi-view Observations

455

Fig. 2. Projection to cylindrical surface and ﬂoor surface

(b) Panoramic image coordinate (a) Omnidirectional image coordinate Fig. 3. Deﬁnition of person region for scaling and registration

First, person regions are simply tracked in the omnidirectional image considering connected region’s area sizes, and position diﬀerences between adjacent frames. Next, in order to normalize the silhouette by person region height, the maximum radius (head point) rmax and minimum radius (foot point) rmin of the person region in polar coordinate (r, ρ) of the omnidirectional image are found (see Fig. 3(a)). Then, in order to register the horizontal position, median of azimuth angle ρmed of the person region is found (see Fig. 3(a)). Note that radius and azimuth angle are corresponds to vertical and horizontal positions in the panorama image respectively. As a result, head, foot position, and horizontal center in panorama image are represented by Zmax , Zmin , and ρmed as shown in Fig. 3(b). Second, silhouette images are scaled so that the height (Zmax − Zmin ) in panoramic image can be just 30 pixels, and so that the aspect ratio of each region can be kept. Then, we produce a 20 × 30 pixel-sized image in which the horizontal median ρmed corresponds to the horizontal center of the produced

456

K. Sugiura, Y. Makihara, and Y. Yagi

(a) front-oblique

(b) fronto-parallel

(c) rear-oblique

(d) Deﬁnition of observation view

Fig. 4. GSV examples for multiple observation views

image. A GSV is ﬁnally constructed by aligning the images on the temporal axis. Figure 4 shows GSV examples for multiple observation views. We can clearly see appearance changes in each view.

3 3.1

Multi-view Feature Extraction and Matching Frequency-Domain Feature Extraction

The second step in the proposed method is frequency-domain feature extraction from the constructed GSV. First, gait period Ngait is detected by maximizing the following normalized autocorrelation C(N ) T(N) x,y n=0g(x,y,n)g(x,y,n+N) , (7) C(N)= T(N) T(N) 2 2 g(x,y,n) g(x,y,n+N) x,y n=0 x,y n=0 of the GSV g(x, y, n) with the N frame shift for the temporal axis, where Ntotal and T (N ) = Ntotal − N − 1 is the number of total and overlapped frames in the sequence respectively. The domain of N is set to [25, 45] empirically for natural gait periods. This is because various gait types such as running, brisk walking, and ’ox walking’ are not within the scope of this paper. For the autocorrelation-based period detection, adjacent gait-period sequences need to be similar each other. We assume that a walker’s trajectory is smooth to some extent and that appearance changes between adjacent gait-period sequences are small. Next, the subsequences S ns is picked up from a complete sequence S . Note that the frame range of the subsequence S ns is [ns , ns + Ngait − 1]. A Discrete Fourier Transformation (DFT) Gns (x, y, k) for the temporal axis is then applied for the subsequence, and amplitude spectra Ans (x, y, k) are calculated as ns +Ngait −1

Gns (x, y, k) =

n=ns

g(x, y, n)e−jω0 kn

(8)

Gait Identiﬁcation Based on Multi-view Observations

Ans (x, y, k) =

1 |Gns (x, y, k)|. Ngait

457

(9)

where ω0 is the base angular frequency for the gait period Ngait . In this paper, direct-current elements (k = 0) (averaged silhouette) and low-frequency elements (k = 1, 2) are chosen as experimental gait features. Let a be a feature vector composed of elements of the amplitude spectra A(x, y, k). As a results, the dimension of the feature vector a sums up to 20 × 30 × 3 = 1800. 3.2

Observation View Estimation

In this section, observation view estimation for multi-view feature extraction is addressed. The observation view θ is deﬁned as θ = (180 − φ) + ρ

(10)

where ρ is a azimuth angle, and φ is a walking direction (see Fig. 4(d)). Azimuth angle ρ is simply deﬁned as direction of vector (x, y), where (x, y) is a foot point in the omnidirectional image. Walking direction φ is estimated from a trajectory of subject’s foot points F (X, Y ) on a ﬂoor coordinate. Let (Rf , ρ) be polar coordinate on the ﬂoor. If the ﬂoor plane is regarded as image plane, the distance Hr from mirror focal point Om to the ﬂoor can be seen as focal length to the ﬂoor image plane. Then, radius Rf is calculated as follows [11]. Rf =

−(b2 − c2 )Hr rf (b2 + c2 )f − 2bc rf2 + f 2

(11)

Thus, walking trajectory on the ﬂoor is obtained as a time series of the above ﬂoor points (Rf , ρ). Next, walking direction φ is deﬁned as tangential direction of the estimated walking trajectory. Let (Xn , Yn ) and (VXn , VYn ) be foot point’s position and velocity at nth frame. The velocity is introduced by central diﬀerence as follows. VXn =

Xn+Δn − Xn−Δn Yn+Δn − Yn−Δn , VYn = 2Δn 2Δn

(12)

Here, Δn is set to be 15 [frame] considering velocity smoothness. Finally, walking direction φn in nth frame is deﬁned as direction of velocity vector (VXn , VYn ). 3.3

Multi-view Feature Extraction

In this section, multi-view feature extraction is introduced based on the estimated observation views. First, multiple basis views θi (i = 1, 2, . . .) are chosen from observation views. In this time, interval of the basis views is set to 15 deg empirically. Next, a basis frame nθi corresponding to a basis view θi is found from a complete sequence, and a subsequence is picked up as a set of Ngait frames around the basis frame nθi as shown in Fig. 5(a). Concretely speaking, the start frame s in eq. (9) is replaced by ns = nθi − Ngait /2.

458

K. Sugiura, Y. Makihara, and Y. Yagi

(a) Overview of multi-view feature extraction

(b) Multi-view features for each subject (every 15 deg)

Fig. 5. Multi-view feature extraction

Results of multi-view feature extraction for multiple subjects are shown in Fig. 5(b). In this ﬁgure, each block indicates each subject, and each row and column indicate observation view and frequency respectively. We can see individual diﬀerences, for example, swing motion diﬀerence of subject 2 and 4 in 2-times frequency of 270-deg features. In addition, we can also see view diﬀerences for each subject. Thus, by integrating the diﬀerent type of features across views, gait identiﬁcation performance should improve more than the case of a single-view feature. Next section gives how to match the multi-view features. 3.4

Matching Features

A matching measure between two subsequences must ﬁrst be found if the proposed method is to work. Let S P and S G be complete sequences for probe and galley, respectively, and let SθPi , SθGi be their subsequences for basis angle θi , respectively. Also let a(S θi ) be feature vector for subsequence S θi . The matching measure d(S θi , S θi ) is simply chosen as the Euclidean distance as d(SθPi , SθGi ) = a(SθPi ) − a(SθGi ). Complete sequences have variations in general and may contain outliers. Because a measure candidate D(SP , SG ) can handle this noise, the median value of each subsequence result is used: D(S P , S G ) = Mediani {d (SθPi , SθGi )}

4 4.1

(13)

Experiment Datasets

A total of 60 gait sequences from 15 subjects were used for the experiments. Each sequence consisted of approximately 10 steps of a straight walk in front of the omnidirectional camera, and it included 5 basis views: 240, 255, 270,

Gait Identiﬁcation Based on Multi-view Observations

459

285, and 300 deg. The camera was Sony Inc. DCR-VX2000, and images were captured by 720 × 480 pixel size at 30 fps. The hyperboloidal mirror and camera parameters were a = 13.722Cb = 11.708Cc = 18.038Cf = 427.944 (unit: mm). The dataset was taken for two days, that is, there were two sequences per day for each subject. A test set is composed of one gallery sequence of a day and two probe sequences of the other day, therefore totally four combinations of test sets were generated. 4.2

Results

The gait identiﬁcation experiments were done for the above four combinations of datasets and average performance was evaluated by Receiver Operating Characteristics (ROC) curves [12]. The ROC curves shows relation between veriﬁcation rate PV and false alarm rate PF when the receiver changes the acceptance thresholds. The ROC curves tilting toward left top corner in the graph indicates high performance because it means high veriﬁcation rate at low false alarm rate. In addition, the eﬀect of the number of observation views and combinations of views on performance are analyzed to validate the eﬀectiveness of multi-view observations. First, ROC curves for each single view are illustrated in Fig. 6(a). The ﬁgure shows that performance varies greatly among basis views, and that it is diﬃcult to gain enough performance when an arbitrary single-view feature is used for matching. Next, ROC curves for two-view combinations are illustrated in Fig. 6(b). Here, the best and the worst three combinations are shown. Note that the performance order is judged by Equal Error Rate (EER), that is, error rate when false alarm rate PF becomes equal to false rejection rate (1 − PV ). Focusing on the worst cases, view diﬀerences are small (within 15 deg except for the worst 2). On the other hand, focusing on the best cases, view diﬀerences are relatively large (more than 30 deg). As a result, it is clear that the combination with large view diﬀerence is eﬀective for identiﬁcation. Moreover, ROC curves for each number of observation views are shown in Fig. 6(c). In veriﬁcation rates in this graph are averaged over all the combinations for each number of observation views. As a result, we can see that the performance becomes better as the number of observation views increase. Finally, veriﬁcation rates at 3% false alarm rate are picked up for each number of observation views. Figure 6(d) shows of the best, the worst, and the averaged performance over all the combinations. As for the best combinations, the veriﬁcation rate for two observation views reaches the highest performance. Thus a small number of observation views are enough when the combination can be speciﬁed. As for the worst combination, the veriﬁcation rates make a steady progress as the number of observation views increase. In case of the worst, because combinations are usually composed of adjacent views as known from two views combination case, the increase of the number of observation views directly leads to observation views variation. In summary, it is validated that observation view variations greatly contribute to performance improvement.

460

K. Sugiura, Y. Makihara, and Y. Yagi

(a) ROC curves for single views

(b) ROC curves for two views

(c) ROC curves for each number of views

(d) Veriﬁcation rate at 3% false alarm for each number of views

Fig. 6. Experimental results

5

Conclusion

This paper describes a method of gait identiﬁcation based on multi-view gait images using an omnidirectional camera. The omnidirectional silhouette images ﬁrst transformed into panoramic ones and a spatio-temporal Gait Silhouette Volume (GSV) is obtained. Next, frequency-domain features are extracted by Fourier analysis. Because the omnidirectional camera makes it possible to observe a person from various views, multi-view features can be extracted from the GSVs composed of multi-view images. In an identiﬁcation phase, distance between a probe and a gallery feature of the same view is calculated, and then these for all views are integrated for matching. The eﬀect of observation view variation on gait identiﬁcation performance was analyzed through experiments including 15 subjects from 5 views. As a result, average performance increases from 82% (single view) to 93% (5 views), and it was clear that observation view variation contributes to gait identiﬁcation performance. In this paper, basis views are chosen only from views common for a gallery and a probe. It is possible to use other-view features interpolated by View Transformation Model (VTM) [8] for better performance in a future work. Moreover, subjects in this experiment walked within 5m from the omnidirectional camera, and thus relatively high-resolution silhouettes (approximately 60 pixel height) are obtained. Therefore, eﬀects of distance from the camera or silhouette resolution on identiﬁcation performance should be analyzed. That also leads to analysis of the optimal alignment of the omnidirectional camera to capture multi-view

Gait Identiﬁcation Based on Multi-view Observations

461

gait images eﬀectively considering both silhouette resolution and observation view variation.

References 1. Urtasun, R., Fua, P.: 3d tracking for gait characterization and recognition. In: Proc. of the 6th IEEE Int. Conf. on Automatic Face and Gesture Recognition, pp. 17–22. IEEE Computer Society Press, Los Alamitos (2004) 2. Yam, C., Nixon, M., Carter, J.: Automated person recognition by walking and running via model-based approaches. Pattern Recognition 37(5), 1057–1072 (2004) 3. Sarkar, S., Phillips, J., Liu, Z., Vega, I., Grother, P., Bowyer, K.: The humanid gait challenge problem: Data sets, performance, and analysis. Trans. of Pattern Analysis and Machine Intelligence 27(2), 162–177 (2005) 4. Han, J., Bhanu, B.: Individual recognition using gait energy image. Trans. on Pattern Analysis and Machine Intelligence 28(2), 316–322 (2006) 5. Yu, S., Tan, D., Tan, T.: Modelling the eﬀect of view angle variation on appearancebased gait recognition. In: Proc. of 7th Asian Conf. on Computer Vision, vol. 1, pp. 807–816 (2006) 6. Kale, A., Roy-Chowdhury, A., Chellappa, R.: Towards a view invariant gait recognition algorithm. In: Proc. of IEEE Conf. on Advanced Video and Signal Based Surveillance, pp. 143–150. IEEE Computer Society Press, Los Alamitos (2003) 7. Shakhnarovich, G., Lee, L., Darrell, T.: Integrated face and gait recognition from multiple views. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 439–446 (2001) 8. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Gait recognition using a view transformation model in the frequency domain. In: Proc. of the 9th European Conf. on Computer Vision, Graz, Austria, vol. 3, pp. 151–163 (2006) 9. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Which reference view is eﬀective for gait identiﬁcation using a view transformation model? In: Proc. of the IEEE Computer Society Workshop on Biometrics 2006, New York, USA (2006) 10. Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y.: Adaptation to walking direction changes for gait identiﬁcation. In: Proc. of the 18th Int. Conf. on Pattern Recognition, Hong Kong, China, vol. 2, pp. 96–99 (2006) 11. Yamazawa, K., Yagi, Y., Yachida, M.: Hyperomni vision: Visual navigation with an omnidirectional image sensor. Systems and Computers in Japan 28(4), 36–47 (1997) 12. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-recognition algorithms. Trans. of Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000)

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences Guochang Huang and Yunhong Wang Intelligent Recognition and Image Processing Lab, School of Computer Science and Engineering, Beihang University, Beijing 100083, China gc [email protected], [email protected]

Abstract. In this paper, we present a new method for gender classiﬁcation based on fusion of multi-view gait sequences. For each silhouette of gait sequences, we ﬁrst use a simple method to divide the silhouette into 7 (for 90 degree, i.e. fronto-parallel view) or 5 (for 0 and 180 degree, i.e. front view and back view) parts, and then ﬁt ellipses to each of the regions. Next, the features are extracted from each sequence by computing the ellipse parameters. For each view angle, every subject’s features are normalized and combined as a feature vector. The combination of feature vector contains enough information to perform well on gender recognition. Sum rule and SVM are applied to fuse the similarity measures from 0o , 90o , and 180o . We carried our experiments on CASIA Gait Database, one of the largest gait databases as we know, and achieved the classiﬁcation accuracy of 89.5%.

1

Introduction

Gait is an attractive biometric feature for human recognition and classiﬁcation. In recent years, gait receives more and more attentions from computer vision and biometric researchers. Compared with other biometric features, such as ﬁngerprint, face, and iris, gait has many good qualities such as non-invasive, capture at distance and non-perceivable. Gait analysis plays an important role in surveillance. Gait analysis mainly consists of two areas, one is gait recognition, which is to identify subject’s ID in the speciﬁc environment where the other biometric is very diﬃcult to be captured. The other is gait classiﬁcation, including gender recognition, action classiﬁcation, and estimation of age. Meanwhile gender classiﬁcation has attracted much attention recently [2,4], due to its widely potential application. Researching on gait recognition and gait classiﬁcation have a long history, but most of the articles about gait analysis focus on human recognition, the articles that about how to use gait to classify gender is few. In complex real surveillance scenarios, describing the subject’s features, such as gender and age, is very important and necessary. Because in these speciﬁc environments, it is very diﬃcult or even impossible to capture other available features to correctly identify the subject’s ID. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 462–471, 2007. c Springer-Verlag Berlin Heidelberg 2007

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences

463

The remainder of this paper is organized as follows, Section 2 summarizes the related work on the gender recognition. Gait representation and feature extraction are described in Section 3. Section 4 provides classiﬁcation and fusion scheme. Experiments and results are presented in Section 5, followed by the conclusions in Section 6.

2

Related Work

Point-light display was an important method for walking manner researches and received much attention during the pass few decades. Kozlowski and Cutting [3] were the ﬁrst researchers who began to study on gender recognition from human walking manner. They demonstrated that observers are able to recognize the point-light walkers’ gender. Barclay et al. [1] conducted a further research on gender recognition, they investigated the inﬂuenced of spatial and temporal factors to the correct rate. The result shows that at least two gait cycles are necessary for successful gender recognition, the speed of walkers also has great inﬂuence for on classiﬁcation accuracy. In their experiments, the highest recognition accuracy is 68%. Most of the studies were based on the side-view presentation of walkers to observers, on the other hand some experiments were try to ﬁnd the eﬀect of view angle on gender recognition [1,8]. It was found that the front-view presentation contains more information than side-view for gender recognition [8]. In our method, we combine the front-view, back-view, and side-view presentation of gait, and thus conducts a higher correct rate then ever before. Troje [8] recently used linear pattern recognition technique to deal with the analysis of biological motion, and presented a two-stage PCA framework for recognizing gender. He reported 92.5% recognition rates. Davis and Gao [2] used three-mode PCA model basing point-light walkers to do gender recognition. In their experiments, there are 40 walkers, and the best recognition rate is 95.5%. But in real surveillance environments, it is very diﬃcult to attached small point lights to the main joints of subject’s body, so weather this approach will wok at good performance on video-based gait database is unknown. Most of the aforementioned studies used point-light display to form the aspect of biological motion, as mentioned above, the point-light display has fatal limitation in surveillance environments. Lee and Grimson [5,4] recently proposed a computer vision algorithm to extract visual feature of gaits from image sequences for gender recognition. Their experiments performed on a database of 24 objects and the approach achieved the recognition rate of 84%.

3

Gait Representation

In our method, three view angles are chose,including 0o (front view), 90o (frontoparallel view), and 180o(back view). A lot of researches have demonstrated that the front- view presentation contains more information for gender classiﬁcation, and some research[11] show that fusion of gait sequences with an angle diﬀerence

464

G. Huang and Y. Wang

greater than or equal to 90o can achieve better improvement than fusion of those with an acute angle. Therefore, 0o , 90o and 180o are chose to do gender classiﬁcation in our method. First of all, we assume that silhouettes have been extracted from original video ﬁles. And then, the extracted silhouettes sequences are normalized to the same size, so all silhouettes have the same height.Horizontal center of normalized silhouettes are obtained and all the normalized silhouettes are aligned according to the horizontal center.

Fig. 1. Examples of normalized silhouettes in diﬀerent view angle from the same subject. The top row shows the 0o normalized silhouettes, the middle row shows the 90o , and the bottom row shows the 180o .

The gait image representation method is ﬁrst proposed by Lee [5]. but in Lee’s thesis, she only applied this representation method on 90o gait images, in this paper, we apply the method to divide the 0o and 180o images into ﬁve parts, and the experimental results demonstrate that the segmentation way of 0o and 180o image is eﬀective in gender classiﬁcation. For the gait images of 0o and 180o, we proportionally divide the silhouette into 5 parts as shown in Fig.2 (a). These 5 regions roughly correspond to: Z1 , head region; Z2 , left of torso; Z3 , right of torso; Z4 , left leg; Z5 , right leg. For the gait images of 90o , we proportionally divide the silhouette into 7 parts as shown in Fig.3 (a). These 7 regions roughly correspond to: N1 , head region; N2 , front of torso; N3 , back of torso; N4 , front thigh; N5 , back thigh; N6 , front foot; N7 , back foot. For each of the parts, we ﬁt an ellipse to the foreground in this region, as shown in Fig.2 (b) and Fig.3 (b). The intuition behind the segmentation of 0o and 180o silhouette is that men show a larger extent of lateral sway of the upper body than women do, and the orientation of the major axis can reﬂect this phenomenon. The diﬀerence of the shoulder-hip ratio between men and women also can be reﬂected from the orientation of the major axis. For facilitating the expression of each part of the silhouette, we divide the 90o silhouettes into 7 regions, each region roughly correspond to one part of human body. For each ellipse ﬁtted to these regions, we compute four parameters, including ¯ Y¯ ), the elongation of the ellipse (L), and the orientation of the the centroid (X,

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences

465

Fig. 2. Example of the 0 and 180 degree silhouette which is divided into 5 regions, and ﬁve ellipses are ﬁtted to these regions

Fig. 3. Example of the 90 degree silhouette which is divided into 7 regions, and seven ellipses are ﬁtted to these regions

major axis (α). The details about how to calculate these four parameters is completely included in [4], and we do not present it here. For each region there are 4 parameters, they form the region feature vector Ri ¯ i , Y¯i , Li , αi ) Ri = (X

(1)

Where i = 1, . . . , 5 (for 0o and 180o images)or 7(for 90o images).And for each image there are 20 parameters (5 regions × 4 parameters) or 28 parameters (7 regions × 4 parameters), these parameters form the image feature vector Ij ¯1 , Y¯1 , L1 , α1 , . . . , X ¯ 5(7) , Y¯5(7) , L5(7) , α5(7) )j Ij = (R1 , . . . , R5(or7) )j = (X

(2)

Where j = 1, . . . , n, n is the total number of the images in one gait sequence. By computing the mean value of image feature vectors in one sequence, we get the sequence feature vector Sp (k) Sp (k) = mean(I1 (k), . . . , In (k))p

(3)

For example ¯ 1 ) = mean(I1 (X ¯ 1 ), . . . , In (X ¯ 1 ))p = mean(I1 (1), . . . , In (1))p (4) Sp (1) = Sp (X Where p is the index of sequences, p = 1, . . . , total number of sequences, n is the total number of images from one sequences, and k is the index of features,

466

G. Huang and Y. Wang

k = 1, . . . , 20 (or 28), so there are totally 20 features for each 0o and 180o sequence and 28 features for each 90o sequence.

4 4.1

Gender Classiﬁcation Similarity Measure

When the sequence feature vectors are obtained for each subject, the similarity measures are calculated next. Here, the calculation of similarity measures is applied on three view angles, respectively. We randomly choose some male and female subjects from the database to construct testing and training set. In both testing and training sets, the number of female subjects must equates to the number of male subjects, because if the number of female and male do not equate, the experimental results may be inﬂuenced by the separability of the larger class. According to the gender attribute, training set is further divided into female training subset and male training subset. Mean Euclidean distance from testing set to the female training subset or male training subset is calculated. Let M be the total number of female (or male) training sequences, and St (k) be the number k feature of number t testing sequence. The Mean Euclidean Distance of number k feature between testing sequence and female training set, DFt (k), is deﬁned as 1 DFt (k) = Euclidean(St(k), Sn (k)) (5) M Where n = 1, . . . , M , Sn ∈ female training set, St ∈ testing set, and k = 1, . . . , 20 (or 28). And the Mean Euclidean Distance of number k feature between testing sequence and male training set, DMt (k), is deﬁned as DMt (k) =

1 Euclidean(St (k), Sm (k)) M

(6)

Where m = 1, . . . , M , Sm ∈ male training set, St ∈ testing set, and k = 1, . . . , 20 (or 28). Both distance, DFt and DMt , are regarded as the similarity measures of number t sequence, they mean the similar degree between testing sequence and two subsets. DFt and DMt are 20 (or 28) dimensional vectors. 4.2

Fusion Scheme

Based on the similarity measure algorithm which is introduced in 4.1, we get two vectors, DFt and DMt , for each angle. So there are total three female vectors o o o o o o and three male vectors, DFt0 , DFt90 , DFt180 , DMt0 , DMt90 , DMt180 . We concatenate three female vectors to one vector and three male vectors to another vector, respectively.

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences o

o

o

CFt = concatenate(DFt0 , DFt90 , DFt180 ) o

o

o

CMt = concatenate(DMt0 , DMt90 , DMt180 ) o

o

467

(7) (8)

o

DFt0 and DFt180 are 20 dimensional vectors and DFt90 is 28 dimensional vector, so CFt is 68 dimensional vectors, and the same to CMt . Before fusion the similarity measures of all feature, we normalize these similarity measures to common a range [0, 1] by using the Min-Max normalization method [6] CFt (k) − min (9) CFt (k) = max − min CMt (k) − min (10) max − min Where max and min denote the maximum and the minimum value of number k feature in number t sequence,respectively. CFt (k) and CMt (k) denote the normalized gender similarity measures of number k feature in number t sequence. CMt (k) =

Sum Rule. Snelick et al. [7] found that the Min-Max normalization followed by the sum of scores fusion method outperform other schemes. So we adopt this fusion scheme. N CFt (k) (11) CFt = k=1

CMt =

N

CMt (k)

(12)

k=1

Where N denotes the dimension of CFt and CMt , N = 68. CFt and CMt are the similarity measures of gender. Comparing CFt and CMt , we can make the decision that whether testing sequence belong to female or male. Testing sequence is considered to have the same gender attribute with the minimum of CFt and CMt f emale, CFt < CMt gender = (13) male, CFt > CMt . Support Vector Machine (SVM). SVM method attempts to maximize the distance between the hyperplane and the closest training samples on either side of the hyperplane. It is a powerful technique for classiﬁcation and in particular, for solving binary classiﬁcation problems. We choose SVM to be the classiﬁer. First, we construct feature vectors for training and testing SVM classiﬁer. We concatenate the normalized vectors, CFt and CMt , into one vector, which will be used to represent the sequence. Gt = concatenate(CFt , CMt )

(14)

468

G. Huang and Y. Wang

We construct a feature vector G for each training sequence. And then, Gt (where t = 1, . . ., total number of training sequence) can be used as the input vector to train SVM classiﬁer.

5 5.1

Experiments and Results Data

We carried our experiments on the CASIA Gait Database [10], one of the largest shared databases in current gait-research community. There are 124 subjects in the database, of which 93 were male and 31 were female. We only chose the normal sequences from 0 degree, 90 degree, and 180 degree to construct experimental data set. There are totally 2, 232 (124 subjects × 6 sequences × 3 angles) video sequences in our experimental data set. As mentioned above, if the numbers of female and male are unbalance in training and testing sets, the experimental results may be inﬂuenced by the separability of the larger class. In CASIA database, there are only 31 females, so, in our experiments, we randomly choose 25 males and 25 females from the database to form a training set and randomly choose 5 males and 5 females from the rest subjects to form a training set. So there are totally 600(50×6) sequences in training set and 60(10×6) sequences in testing set. If the subject is assigned to training set or testing set, all the sequences of this subject are assigned into corresponding set too. 5.2

Experimental Results

Since the training set and testing set are chose randomly, we repeat our experiments two hundred times and use the mean value of these experimental results to be the terminal recognition accuracy. So the recognition rates which are listed here can correctly reﬂect the performance of our method. First, we use Sum Rule scheme to fuse 20 features from 0 degree view angle, 28 features from 90 degree view angle, and 20 features from 180 degree view angle, respectively, in order to see what performance can be achieved before fusion all the three view angles. The recognition results are listed in Table 1. And then, we use Sum Rule scheme to fuse all the features from three view angles, total of 68 features, and achieve the recognition accuracy of 87.7%. Fig.4 Table 1. Results of respectively fuse the features of three view angles by using sum rule View angle (degree) Recognition rate (%) 0 83.0 90 85.5 180 85.5

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences

469

Fig. 4. Two-dimensional scatter-plots showing the testing data and decision boundary. Red ’x’ denotes male and green ’o’ demotes female. Table 2. Results of respectively fuse the features of three view angles by using Linear, Polynomial (d = 2), RBF (g = 1) kernels. (d = degree of polynomial, g = width of RBF network). View angle (degree) Kernel Recognition rate (%) 0 Linear 79.5 0 Polynomial 88.0 0 RBF 80.0 90 Linear 82.0 90 Polynomial 85.0 90 RBF 82.5 180 Linear 86.0 180 Polynomial 88.0 180 RBF 86.0 Table 3. Results of fuse all the features from three view angles by using using Linear, Polynomial (d = 2), RBF (g = 1) kernels Kernel Recognition rate (%) Linear 89.5 Polynomial 89.5 RBF 88.5

shows scatter plot of the test data and decision boundary. The x axis denotes the dissimilarity measure of male, in other words, it means the distance between testing subject and male; and y axis denotes the dissimilarity measure of female, similarly, it means the distance between testing subject and female. Three kernels, including Linear, 2nd degree Polynomial, and RBF (width of RBF network = 1) kernels, are used to train SVM. The results of respectively fuse the features of three view angles are shown in Table 2. And the results of fuse all the features from three view angles are shown in Table 3.

470

G. Huang and Y. Wang Table 4. Comparison of experimental results

Recognition rate(%) 63% 92.5% 84% 95.5% 89.5%

5.3

Authors Kozlowski and Cutting (1977) Troje (2002) Lee and Grimson (2002) Davis and Gao (2004) This paper

Database Representation View 6 subjects Point-light Side-view 40 subjects Point-light Mixed 24 subjects Video-based Side-view 40 subjects Point-light Front-view 124 subjects Video-based Mixed

Discussions

Based on the above results, the conclusions are draw as follows: using fusion schemes can improve the performance of gender recognition systems, especially by suing SVM fusion scheme the performance achieved 89.5%. Comparing the results of using Sum Rule and SVM scheme, we note that SVM scheme has more advantages than Sum Rule. And SVM scheme shows an improvement of up to 5% over Sum Rule. According to Table 2, we can ﬁnd that 2nd degree Polynomial kernel performs better than other kernels, this leads us to believe that Polynomial kernel is more eﬀective to gender recognition task. In Table 3, the performance of the three kernels show that the Linear kernel performed as well as the 2nd degree Polynomial kernels. Compared with other methods, the recognition rate of our method is higher than most of other methods. Table 4 shows the recognition accuracy of our method and other related methods. From Table 4 we note that although our recognition rate is a little lower than Davis’s and Troje’s , our experiments are applied on large database and based on video sequences. From this point, our method is more suitable to do gender recognition in surveillance environments. We also realize Lee and Grimson’s method and carry it on the same CASIA database, the recognition rate is 85%, which equates to the result of only using 90o sequences in our experiments. The rest methods in Table 4 are hard to be realized and ran on the same CASIA database, because they use point-light representation method. In this representation method, points of light must be attached to body joints in order to accurately locate the joints position, but in video sequences it is hard or even impossible to accurately locate the joints position. So it is meaningless to realize these methods and compare them with our method on the same CASIA database.

6

Conclusion

In this paper, we have presented a new gender classiﬁcation scheme based on fusion the similarity measures from multi-view gait sequences. From the experimental results, we can see that our method achieved higher performance than most of other methods and more suitable for surveillance scenarios. The proposed fusion method can help improve the performance of gender classiﬁcation.

Gender Classiﬁcation Based on Fusion of Multi-view Gait Sequences

471

Acknowledgments This work was supported by the program of New Century Excellent Talents in University, National Nature Science Foundation of China (No. 60332010), Joint Project supported by Nation Science Foundation of China and Royal Society of UK (No. 60710059), and Hi-Tech Research and Development Program of China (No. 2006AA01Z133). Portions of the research in this paper use the CASIA Gait Database collected by Institute of Automation, Chinese Academy of Sciences.

References 1. Barclay, C.D., Cutting, J.E., Kozlowski, L.T.: Temporal and spatial actors in gait perception that inﬂuence gender recognition. Perception & Psychophysics 23(2), 145–152 (1978) 2. Davis, J.W., Gao, H.: Gender recognition from walking movements using adaptive three-mode pca. In: IEEE CVPR Workshop on Articu-lated and Nonrigid Motion, IEEE Computer Society Press, Los Alamitos (2004) 3. Kozlowski, L.T., Cutting, J.E.: Recognizing the sex of a walker from dynamic point-light display. Perception & Psychophysics 21(6), 575–580 (1977) 4. Lee, L.: Gait Analysis for Classiﬁcation. Technical report, MIT AI Lab (2003) 5. Lee, L., Grimson, W.E.L.: Gait analysis for recognition and classiﬁcation. In: FG. IEEE International Conference on Automatic Face Gesture Recognition, IEEE Computer Society Press, Los Alamitos (2002) 6. Nandakumar, K., Jain, A.K., Ross, A.A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(1212), 2270–2285 (2005) 7. Snelick, R., Indovina, M., Yen, J., Mink, A.: Multimodal biometrics: Issues in design and testing. In: Proceedings of Fifth International Conference on Multimodal Interfaces. Vancouver (2003) 8. Troje, N.F.: Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision 2(5), 371–387 (2002) 9. Wang, Y., Yu, S., Wang, Y., Tan, T.: Gait recognition based on fusion of multi-view gait sequences. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, Springer, Heidelberg (2005) 10. Yu, S., Tan, D., Tan, T.: A framework for evaluating the eﬀect of view angle, clothing and carrying condition on gait recognition. In: ICPR 2006. Proc. of the 18’th International Conference on Pattern Recognition, Hong Kong, China (2006) 11. CASIA Gait Database, http://www.sinobiometrics.com

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models Heping Li1,2 , Zhanyi Hu1 , Yihong Wu1 , and Fuchao Wu1 National Laboratory of Pattern Recognition Digital Content Technology Research Center, Institute of Automation,Chinese Academy of Sciences, P.O. 2728, Beijing 100080, P.R. China {hpli,huzy,yhwu,fcwu}@nlpr.ia.ac.cn 1

2

Abstract. The traditional co-training algorithm, which needs a great number of unlabeled examples in advance and then trains classiﬁers by iterative learning approach, is not suitable for online learning of classiﬁers. To overcome this barrier, we propose a novel semi-supervised learning algorithm, called MAPACo-Training, by combining the co-training with the principle of Maximum A Posteriori adaptation. This MAPACoTraining algorithm is an online multi-class learning algorithm, and has been successfully applied to online learning of behaviors modeled by Hidden Markov Model. The proposed algorithm is tested with the Li’s database as well as Schuldt’s dataset.

1

Introduction

Behavior modeling is driven by a wide range of applications, such as advanced user interface, visual surveillance, virtual reality and so on. The most existing works in this ﬁeld focused on modeling the behaviors with manually labeling like [1,2,3,4]. For example, Li and Greenspan [1] built a multi-scale model from timevarying contours and Gong and Xiang [2] learned a Dynamically Multi-Linked Hidden Markov Model (DML-HMM). However, manual labeling of behavior patterns is laborious, impractical and error prone [5]. Recently, some behavior modeling methods based on semi-supervised/unsupervised learning [5,6,7,8] have been proposed. For instance, Xiang and Gong [5] discovered natural grouping of behavior patterns through unsupervised model selection and feature selection, and Zelnik-Manor and Irani [6] used the normalized-cut approach to automatically cluster the data and then build the statistical behavior model. Unfortunately, these methods need to get a great number of unlabeled examples beforehand, which are therefor unsuitable for online learning of behavior models and cannot automatically adjust the models’ parameters according to the circumstantial changes. The co-training approach proposed by Blum and Mitchell [9] is also a semisupervised learning method. Levin et al. [10] used the co-training framework in the context of boosted binary classiﬁers to build the automobile detectors. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 472–481, 2007. c Springer-Verlag Berlin Heidelberg 2007

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models

473

And Yan and Naphade [11] proposed a multi-view semi-supervised learning algorithm which avoids the requirement of the co-training approach about that each view of examples is suﬃcient for learning the target concepts. However, these methods belong to the oﬀ-line learning category. Javed et al. [12] combined the co-training approach and boosting to propose an algorithm for online detection and classiﬁcation of moving objects, where behavior modeling is not considered. In this paper, we present a novel semi-supervised learning method called MAPACo-Training which combines the co-training approach and the principle of Maximum A Posteriori adaptation [8,13,16]. The proposed method can simultaneously train the parameters of multi-class models. We have successfully applied the method to online learning the parameters of behaviors modeled by Hidden Markov Model (HMM). Since it only needs a small labeled sample set beforehand, our method can alleviate the problem in the methods [1,2,3,4]. And unlike the approaches [5,6,7,8], the method can automatically adjust the parameters with the current example online. The remainder of this paper is organized as follows: Motion signature representation is outlined in Section 2. Section 3 is a detailed description of MAPACoTraining. Experimental results are reported in section 4. And conclusions as well as future research directions are listed in section 5.

2 2.1

Motion Signature Representation Feature Extraction

Background subtraction is used to detect foreground. In our approach, two types of features are considered: (1) shape feature; (2) optical ﬂow feature [14]. The size of the foreground region varies with the distance of object to camera, camera parameters and the size of object. We therefore need to normalize the foreground region. Firstly, we equidistantly divided the bounded rectangle of the foreground into U × V non-overlapping sub-blocks. Then, the normalized value of each sub-block is calculated as follows: x1i = s− sub(i)/max,

i = 1, 2, . . . , num,

(1)

where num = U × V is the number of sub-blocks; s sub(i) is the number of the foreground pixels in the ith sub-block; max is the maximum value of {s sub(i), i = 1, 2, . . . , num}. The optical ﬂow value of each sub-block is calculated as follows: xji = f− sub(i, j)/sum(i), i = 1, 2, . . . , num, j = 2, 3, (2) where f sub(i,j) with j = 2, 3 are respectively the sum of horizontal, vertical optical ﬂow in the ith sub-block; sum(i) is the pixel number in the ith sub-block. Then, the feature vectors at frame t from shape and optical ﬂows are as: odt = [xd1 , xd2 , . . . , xdnum ],

d = 1, 2, 3.

474

2.2

H. Li et al.

Motion Signature Representation

Given the observation feature sequences OTd = {od1 , od2 , · · · , odt , · · · , odT }, two different Hidden Markov Models (HMMs) are adopted to build the behavior models. The one is a single continuous HMM from shape, and the other is like a Parallel Hidden Markov Model (PHMM) with two continuous HMMs from optical ﬂow. These HMM topologies are shown in Figure 1, where shade circles are as observation nodes and clear circles as hidden nodes. For optical ﬂow model, the two HMMs are learned independently. The output probability density function of learning is the following Gaussian Mixture Model (GMM): p(odt |θ) =

K

αk pk (odt |μk , Σk )

(3)

k=1

where θ = {αk , μk , Σk , k = 1, 2, . . . , K} represents the parameters of GMM including weight αk , mean value μk and covariance matrix Σk of every mixture K component; αk = 1. k=1

(a)

(b)

Fig. 1. HMM topology: (a) shape model; (b) optical ﬂow model

By the Forward procedure, we compute the observation probabilities P (OTd |λdc ), c = 1, 2, . . . , C for the observation feature sequences OTd , where C is the class number of behaviors and λdc is the HMM parameter set of the cth class behavior. Since the output probability density function is GMM, the probabilities are normalized [15] as follows: P¯ (OTd |λdi ) = P (OTd |λdi )/

C c=1

P (OTd |λdc ).

(4)

And for optical ﬂow model (Figure 1(b)), the following operation is further performed as: P¯ (OT2 |λ2i )P¯ (OT3 |λ3i ) P¯ (OT2,3 |λ2,3 . i ) = C ¯ 2 2 ¯ 3 3 c=1 [P (OT |λc )P (OT |λc )]

(5)

The Bayes classiﬁer is used as our base classiﬁer. According to the Bayes rule, the posterior

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models

475

P (c|OTd ) = P¯ (OTd |λdc )P (c)/P (OTd ), where P (c) = 1/C, so P (c|OTd ) ∝ P¯ (OTd |λdc ). Thus if P¯ (OTd |λdc0 ) = max P¯ (OTd |λdc ), c

OTd belongs to the c0 –th behavior. In our proposed algorithm, we set f1i = P¯ (OT1 |λ1i ) and f2i = P¯ (OT2,3 |λ2,3 i ).

3

MAPACo-Training

In this section, we propose a new semi-supervised learning algorithm called Maximum A Posteriori Adaptation Co-Training (MAPACo-Training) which attempts to learn behavior models online. We ﬁrst describe the principle of MAP adaptation, and then give the details of MAPACo-Training. 3.1

MAP Adaptation

MAP adaptation has widely been used in speaker and face veriﬁcation [13]. Recently, Zhang et al. in [8,16] used it for unusual event detection and meeting event recognition. During the course of learning the parameters of GMM-based HMM in [8,16], the state-transition probabilities are kept ﬁxed while the mean, variance and mixture weights are adapted as follows: (1) According to the existing parameters, new statistical values are computed: K d d αk pk (odt |μk , Σk ) (6) P (i|ot ) = αi pi (ot |μi , Σi ) k=1

αnew i = μnew i

T t=1

T Σinew =

=

t=1

T t=1

P (i|odt )

T

T odt P (i|odt )

t=1

(7) P (i|odt )

P (i|odt )(odt − μnew )(odt − μnew )T i i T d t=1 P (i|ot )

(8)

(9)

(2) New parameters are estimated as follows: + (1 − ρ) · αold α ˆ i = ρ · αnew i i

(10)

+ (1 − ρ) · μold μ ˆi = ρ · μnew i i

(11)

T ˆi = ρ · Σinew + (1 − ρ) · [Σiold + (ˆ Σ μi − μold μi − μold i )(ˆ i ) ]

(12)

where ρ(0 ≤ ρ ≤ 1) is the scale factor. We use the principle of MAP adaptation into our algorithm. More details about MAP adaptation can be found in [8,13,16].

476

3.2

H. Li et al.

MAPACo-Training Algorithm

The traditional co-training algorithm [9] needs to get a great number of unlabeled samples in advance and then train models by an approach of iterative learning. It is an oﬀ-line learning method. By combining the co-training and the MAP adaptation, we propose a novel online multi-class learning algorithm called MAPACo-Training as follows: Input: Labeled data L including a small training sample set Ltr and a small validation sample set Lv with two views V1 and V2 , threshold value Th > 1 and Tnum ≥ 1. Output: a classiﬁer from the probabilitiesf 1 , f 2 , ..., f C . MAPACo-Training 1. Create f1i and f2i (i = 1, 2, . . . , C) using Ltr on V1 and V2 . Set the new training sample set of each one of the C classes Lib = φ (b=1,2); 2. For k = 1, 2, . . . , C (a) For current sample S, assume n = max{f1j ,j = 1, 2, . . . , C, j = k}, j

m = max{f2j , j = 1, 2, . . . , C, j = k}, j

i. if f1k /f1n ≥ Th , the view V2 of sample S is added into Lk2 as a new sample; ii. if f2k /f2m ≥ Th , the view V1 of sample S is added into Lk1 as a new sample; iii. if 1 < f1k /f1n < Th and 1 < f2k /f2m < Th , the view V1 of sample S is added into Lk1 as a new sample and the view V2 of sample S is added into Lk2 as a new sample. (b) if the sample number in Lkb equals Tnum , the parameters of model fbk are updated according to MAP equations (6)∼(12) with these samples in Lkb and the scale factor ρ is decided by validation sample set Lv . And then let Lkb = φ. 3. Combine f i = ω1 f1i + ω2 f2i (ω1 + ω2 = 1) using Lv . 4. Create a new classiﬁer using f i according to the Bayes theory. Similar to co-training, two base classiﬁers of every class model need to be trained on separate features of the same sample. How to select samples to train the models? In this algorithm, we use a threshold Th to do it. The conditions (i)(ii) show if one base classiﬁer can predict the label of the sample conﬁdently, then we add this sample into the training set of the other base classiﬁer of the corresponding class. The condition (iii) means that both base classiﬁers can get the same label according to the bayes rule, but neither of them is conﬁdent, which shows the sample is useful for improving the performance of the two classiﬁers. During the course of updating parameters by MAP adaptation equations (6)∼(12), we use validation set Lv to decide the scale factor ρ. If the class prediction for a sample from the conditions (i)∼(iii) is not correct, which means the sample is a noise, the sample is no longer used for further learning by setting

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models

477

ρ = 0 according to Lv . In our experiment, we assume the possible value of ρ is 0 or a constant¯ ρ (0 ≤ ρ¯ ≤ 0.5). Remark: From the equations (6)∼(12), we can see that the MAP adaptation only uses the current samples to calculate the new statistical values and then gets the new parameters by simple weighted estimation. It avoids to directly train the HMM parameters from a great number of samples by EM algorithm and improves the computational eﬃciency. The MAPACo-Training algorithm starts from a small label sample set Ltr and then updates the parameters by the MAP adaptation. So the algorithm is suitable for online multi-class learning.

4

Experiments

We test our method from two datasets: Li’s dataset [17] and Schuldt’s dataset [18]. In the experiments, U =9 and V =5 are used for dividing the bounded rectangle of foreground. To each type of features such as shape, horizontal optical ﬂow and vertical optical ﬂow, the Principal Component Analysis (PCA) is used to reduce the 45-dimensional features to the 8-dimensional ones. 4.1

Results on Li’s Dataset

38

24

8

36

22

7

34

20

32

18

30

16

28

14

26

12

24

10

22 0

500

1000

1500

2000

6

HTER(%)

HTER(%)

HTER(%)

We get a video consisting of ﬁve kinds of behaviors from Li’s dataset [17], of which each one is performed by 18 subjects. Image size is of 160 × 120 pixels and frame rate is of 6 frames/sec. The video totals 38120 frames including “box”

8 0

2500

5

4

3

2

500

number of samples

1000

1500

2000

1 0

2500

500

number of samples

(a)

(b)

22

1000

1500

2000

2500

number of samples

(c)

25

22

20 20 18

14

18

12 10

HTER(%)

20

HTER(%)

HTER(%)

16

15

16

14

8 12 6 4 0

500

1000

1500

number of samples

(d)

2000

2500

10 0

500

1000

1500

number of samples

(e)

2000

2500

10 0

500

1000

1500

2000

2500

number of samples

(f)

Fig. 2. The learning curves: (a) box; (b) kick; (c) lookround; (d) standup (e) wave; (f) average HTER

478

H. Li et al. Table 1. Initial confusion matrix box kick lookround standup wave

box 37.14 12.86 1.43 1.43 16.67

kick 34.29 72.86 1.90 33.80 2.38

lookround 7.62 1.90 92.86 0.48 14.76

standup 8.57 12.38 3.33 63.81 5.24

wave 12.38 0 0.48 0.48 60.95

Table 2. Final confusion matrix box kick lookround standup wave

box 54.76 0 0 0.95 10.95

kick 18.10 86.67 0 3.80 0.95

lookround 3.33 1.43 95.72 0.48 1.43

standup 7.14 11.42 3.33 94.29 5.24

wave 16.67 0.48 0.95 0.48 81.43

(8000 frames), “kick” (7600 frames), “lookround” (7820 frames), “standup” (7040 frames) and “wave” (7660 frames). We slice this video sequence into 3810 segments with the ﬁxed time duration of 20 frames and the step length of 10 frames, where 25 segments in every class are selected for the small training sample setLtr , 12 segments for the validation sample set Lv , 210 segments for the test sample set and the remaining segments for online learning. Parameters in our algorithm are preset as: Th =1.5, Tnum =5 and ρ¯ = 0.2. MAP adaptation is only used to update the means. The proposed algorithm is implemented in Matlab 6.0 and tested on a 2.0 GHz Pentium 4 PC with 256MB memory. The average time per frame is about 0.228s. As a result, our algorithm at the correct implement could be used for those applications with a frame rate of 6∼10 frames/sec. Figure 2 gives the learning curves for behavior models of “box”, “kick”, “lookround”, “standup”, “wave” and average half-total error rate (HTER), where HTER=(FAR+FRR)/2 [8], FAR is false acceptance rate and FRR is false rejection rate. The horizontal axis shows the number of eﬀective samples for estimating the parameters in the MAPACo-Training algorithm. The vertical axis shows the HTER. Figure 2(f) is the average HTER curve of all behaviors. From these curves, we can see the learning performance of behavior models can be markedly improved by MAPACo-Training, and after about 500 samples are used, the curves almost become stable. Table 1 gives the initial confusion matrix from the initial behavior models trained by the small training set Ltr , and Table 2 shows the ﬁnal confusion matrix from the ﬁnal behavior models by our algorithm. From these tables, we can see that when the initial recognition rate is low, those for “box”, “kick”, “standup” and “wave”, the ﬁnal recognition rate is clearly improved. And when the initial recognition rate is high, that for “lookround”, the ﬁnal recognition rate is still high.

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models

4.2

479

Results on Schuldt’s Dataset

We get a video sequence of 49813 frames from Schuldt’s dataset [18] including “box” (8370 frames), “clap” (8476 frames), “wave” (8275 frames), “run” (7945 frames), “jog” (8170 frames) and “walk” (8577 frames). We slice this video sequence into 4978 segments with the ﬁx time duration of 25 frames and the step length of 10 frames, where 30 segments in each class are selected for the small training sample set Ltr , 16 segments for the validation sample set Lv , 240 segments for the test sample set, and the remaining segments for online learning. Parameters in our algorithm are preset as: Th =1.5, Tnum =5 and ρ¯ = 0.4. MAP adaptation is only used to update the means and variances. Figure 3 shows the learning curves. We can see that the learning results for all the behaviors except “run” are very good. For the behavior “run”, the main reason of poor performance is that running of some people is very similar to the jogging of the others in this dataset [18], which is diﬃcult to distinguish. From the initial confusion matrix (Table 3) and the ﬁnal confusion matrix (Table 4), 15

30

22 20

25 18

5

16

HTER(%)

HTER(%)

20

HTER(%)

10

15

10

14 12 10 8

5 6 0 0

500

1000

1500

2000

2500

0 0

3000

500

number of samples

1000

1500

2000

2500

4 0

3000

500

1000

number of samples

(a)

(b)

25

40

24

38

1500

2000

2500

3000

number of samples

(c) 34 32 30

36

23

28

21

HTER(%)

HTER(%)

HTER(%)

34 22

32 30

26 24 22 20

20

28 18

19

26

18 0

24 0

500

1000

1500

2000

2500

3000

number of samples

16 500

1000

1500

2000

2500

3000

14 0

500

1000

number of samples

(d)

1500

2000

2500

3000

number of samples

(e)

(f)

24

15 14

22

13 12

HTER(%)

HTER(%)

20

18

16

11 10 9 8 7

14

6 12 0

500

1000

1500

2000

number of samples

(g)

2500

3000

5 0

500

1000

1500

2000

2500

3000

number of samples

(h)

Fig. 3. The learning curves: (a) box; (b) clap; (c) wave; (d) run; (e) jog; (f) walk; (g) average HTER (h) run+jog

480

H. Li et al. Table 3. Initial confusion matrix box clap wave run jog walk

box 79.17 24.17 6.25 0.42 2.50 7.08

clap 2.50 53.75 8.75 0 4.58 1.67

wave 2.92 9.17 60.00 0 0 0

run 3.33 3.33 2.50 80.00 50.42 25.83

jog 3.33 2.08 18.75 14.16 31.67 8.75

walk 8.75 7.50 3.75 5.42 10.83 56.67

jog 0 0 0 30.42 59.58 20.00

walk 0.84 0 0 4.58 7.08 57.92

Table 4. Final confusion matrix box clap wave run jog walk

box 94.58 3.33 0 0.83 0.42 0.41

clap 2.50 94.17 7.50 0 0.42 0.42

wave 2.08 2.50 91.25 0 2.08 3.33

run 0 0 1.25 64.17 30.42 17.92

we can see the confusion values between a pair of behaviors other than “run” and “jog” are not high. But for “run” and “jog”, the HTER about “run” is only increased from 18.5% to 23% and nearly unchanged after about 2000 samples while the HTER about “jog” is declined from 38.5% to 24.5%. When we regard “run” and “jog” as one behavior “run+jog”, the result becomes quite satisfactory as shown in Figure 3(h).

5

Conclusion

In this paper, we proposed a semi-supervised learning algorithm called MAPACoTraining, which combines the traditional co-training algorithm and the principle of MAP adaptation. The algorithm is suitable for online learning of behaviors modeled by HMM. Experiments on two datasets also validate our method. In the future, we will explore a better way to train the models of similar behaviors like “run” and “jog”. Acknowledgment. This work was supported by the National Natural Science Foundation of China under grant Nos (60633070, 60475009) and by National Key Technology R&D Program under grant Nos (2006BAH02A03,2006BAH02A13).

References 1. Li, H., Greenspan, M.: Multi-scale gesture recognition from time-Varying contours. In: IEEE Int’l Conf. On Computer Vision, pp. 236–243. IEEE Computer Society Press, Los Alamitos (2005)

MAPACo-Training: A Novel Online Learning Algorithm of Behavior Models

481

2. Gong, S.G., Xiang, T.: Recognition of group activities using dynamic probabilistic networks. In: IEEE Int’l Conf. On Computer Vision, pp. 742–749. IEEE Computer Society Press, Los Alamitos (2003) 3. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 4. Laptev, I., Linderberg, T.: Space-time interest points. In: IEEE Int’l Conf. On Computer Vision, pp. 432–439. IEEE Computer Society Press, Los Alamitos (2003) 5. Xiang, T., Gong, S.G.: Video behaviour proﬁling and abnormality detection without manual labeling. In: IEEE Int’l Conf. On Computer Vision, pp. 1238–1245. IEEE Computer Society Press, Los Alamitos (2005) 6. Zelnik-Manor, L., Irani, M.: Event-based analysis of video. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 123–130. IEEE Computer Society Press, Los Alamitos (2001) 7. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 819–826. IEEE Computer Society Press, Los Alamitos (2004) 8. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Semi-supervised adapted HMMs for unusual event detection. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 611–618. IEEE Computer Society Press, Los Alamitos (2005) 9. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: 11th Annual Conference on Computational Learning Theory (1998) 10. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: IEEE Int’l Conf. On Computer Vision, pp. 626–633. IEEE Computer Society Press, Los Alamitos (2003) 11. Yan, R., Naphade, M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 657–663. IEEE Computer Society Press, Los Alamitos (2005) 12. Javed, O., Ali, S., Shah, M.: Online detection and classiﬁcation of moving objects using progressively improving detectors. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 696–701. IEEE Computer Society Press, Los Alamitos (2005) 13. Reynolds, D.A., Quatieri, T.F., Dumn, R.B.: Speaker veriﬁcation using adapted Gauusian mixture models. Digital Signal Processing 10, 19–41 (2000) 14. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop (April 1981) 15. Lv, F., Nevatia, R.: Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In: Proc. European Conference on Computer Vision, vol. IV, pp. 359–372 (2006) 16. Zhang, D., Gatica-Perez, D., Bengio, S.: Semi-supervised meeting event recognition with adapted HMMs. In: ICME. IEEE International Conference on Multimedia Expo (2005) 17. Li, H., Hu, Z., Wu, Y., Wu, F.: Behavior modeling and recognition based on spacetime image features. In: International Conference on Pattern Recognition, vol. 1, pp. 243–246 (2006) 18. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: International Conference on Pattern Recognition, vol. 3, pp. 32–36 (2004)

Optimal Learning High-Order Markov Random Fields Priors of Colour Image Ke Zhang1,2 , Huidong Jin1,2 , Zhouyu Fu1,2 , and Nianjun Liu1,2 Research School of Information Sciences and Engineering (RSISE) Australian National University 2 National ICT Australia (NICTA), Canberra Lab, ACT, Australia {ke.zhang,huidong.jin,zhouyu.fu,nianjun.liu}@rsise.anu.edu.au 1

Abstract. In this paper, we present an optimised learning algorithm for learning the parametric prior models for high-order Markov random ﬁelds (MRF) of colour images. Compared to the priors used by conventional low-order MRFs, the learned priors have richer expressive power and can capture the statistics of natural scenes. Our proposed optimal learning algorithm is achieved by simplifying the estimation of partition function without compromising the accuracy of the learned model. The parameters in MRF colour image priors are learned alternatively and iteratively in an EM-like fashion by maximising their likelihood. We demonstrate the capability of the proposed learning algorithm of highorder MRF colour image priors with the application of colour image denoising. Experimental results show the superior performance of our algorithm compared to the state–of–the–art of colour image priors in [1], although we use a much smaller training image set. Keywords: Markov random ﬁelds, image prior, colour image denoising.

1

Introduction

The need for prior models of image structure occurs in a lot of computer vision problems including stereo, optical ﬂow, denoising, super-resolution, image-based rendering and to name a few. Whenever an observed“scene” must be inferred from noisy, degraded or loss partial image information, a natural image prior is required [2]. Modeling image priors is a challenging task, because of the highdimensionality of images, their non-Gaussian statistics and the need to model correlations in image structure over extended image neighbourhoods [3]. Some researchers attempted to use sparse coding approaches to address the modeling of complex image structure. Based on a variety of simple assumptions, they obtained sparse representations of local image structure in terms of the statistics of ﬁlters that are local in position, orientation, and scale [4][5]. However, these methods which focus on image patches provide no direct way of modeling the statistics of whole images [3]. Markov random ﬁelds (MRF) on the other hand have been widely used in computer vision but exhibit serious limitations. In particular, as MRF priors typically exploit handcrafted clique potentials and small neighbourhood systems, it is limited in the expressiveness of the models, and it only crudely captures the statistics Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 482–491, 2007. c Springer-Verlag Berlin Heidelberg 2007

Optimal Learning High-Order MRF Priors of Colour Image

483

of natural images [6]. Since typical MRF models consider simple nearest neighbour relations and model ﬁrst derivative ﬁlter response, extremely local (e.g. ﬁrst order) priors employed by most MRF methods may not show any advantages, comparing with rich, patch based priors obtained by sparse coding methods [3]. However, Roth and Black [3] went beyond this limitation with the Fields of Experts (FoE) model, which is a generic MRF model of image priors over extended neighbourhoods. For availability of application, they represented the MRF potentials as a Product of Experts (PoE) [7]. As FoE takes the product over all neighbourhoods of each image patch, the number of parameters is only determined by the size of the maximal cliques in the MRF model and the number of ﬁlters deﬁning the potential [3]. Furthermore, because of the homogeneity of the potential functions, the model does not have any restriction for the size of images [3]. As shown in their experiment, FoE can achieve the state–of–the– art performance for monochromatic image denoising and inpainting. Based on the work of Roth and Black [3], McAuley et al. [1] proposed an MRF colour image prior model by generalising the FoE model to capture the correlations between diﬀerent colour channels. Their model was compared with the original FoE monochromatic prior for colour image denoising and evidenced performance improvements although their learning algorithm is clearly sub-optimal. In this paper we build on McAuley et al.’s [1] contribution to further improve the learning algorithm for colour image priors. The proposed model of colour image priors is akin to the one in [1], yet by improving the estimation of the model partition function, both high dimensional ﬁlters and their corresponding weights can be optimally learned by maximising the likelihood. The experimental results show improvements of results reported in McAuley et al. [1] on colour image denoising, although we use a much smaller training image set. The remainder of this paper is organised as follows. In Section 2, we brieﬂy illustrate the MRF image prior models and their original learning approaches. Our optimised learning algorithm is introduced in Section 3. In Section 4, we demonstrate the performance of our learning algorithm and compare the denoising quality with three other methods (McAuley et al’s [1], Bilateral Filtering and Wavelet-based denoising approach). Finally, Section 5 concludes this paper.

2 2.1

MRF Prior Model Monochromatic MRF Prior Model

In [3], Roth and Black merged the ideas of learning in MRF and sparse image coding in order to develop a high-order MRF prior model where the cliques are square image patches [8][4]. According to the Hammersley–Cliﬀord theorem, the joint probability distribution of an MRF with clique set C can be written as P (x) =

1 φc (xc ), Z(Θ) c∈C

where φc (xc ) is a potential function and Z(Θ) is the partition function.

(1)

484

K. Zhang et al.

The potential functions over these cliques are assumed to be Products of Experts [7], i.e., products of individual function φf (with a parameter αf ) given the response of a ﬁlter Jf to the image patch xc , φc (xc ; J, α) =

F

φf (xc ; Jf , αf ),

(2)

f =1

In the prior model, the potential functions are assumed to be stationary, i.e., every clique in the image has the same parameter set Θ = {Jf , αf : 1 ≤ f ≤ F }. The particular form they postulated for individual experts is related to the Student–T distribution, and is given by [3] 1 φf (xc ; Jf , αf ) = (1 + Jf , xc 2 )−αf . 2

(3)

The problem of learning MRF priors can then be recast in a parameter estimation setting. The optimal J’s and α’s are recovered by maximising the likelihood given by the joint probability in Equation 1. However, calculating the true partition function Z(Θ) is intractable due to its high complexity. Approximate procedures are often used here to get an estimation of the partition function, such as the contrastive divergence method used by Roth and Black [3]. 2.2

Colour Image MRF Prior Model

McAuley et al. [1] extended the monochromatic MRF prior to handle colour images. They proposed the “higher” order MRF prior model, i.e., using 3 × 3 × 3 clique instead of 3 × 3 clique, to represent the correlations of colour channels over the local neighbourhood. To deal with the dramatic increase in the computational load due to the signiﬁcant rise of data dimensions in this colour model, they adopted a simple gradient-ascent-based learning algorithm rather than learning by maximising the likelihood. In their learning algorithm, they performed singular value decomposition (SVD) over the covariance matrix of training data to learn ﬁlters J’s, and only updated α’s along the gradient direction with ﬁxed ﬁlters J’s. The estimation of the α’s in their model takes the following form: Let D = {X1 , X2 , · · · , XM } a set of training images, R = {Y1 , Y2 , · · · , YN } a set of random images, P (D|Θ) the likelihood of the training images given the model. Let Θ = {θ1 , θ2 , · · · , θF } where θf = (Jf , αf ) a set of ﬁlters and their corresponding weights. Then the likelihood of training images, P (D|Θ) is given by P (D|Θ) =

M i=1

1 φc (xic ; J, α), Z(Θ)

(4)

c∈C

where Z(Θ) is the partition function. They used the arithmetic mean of responses from the random images, Zˆam (Θ), to approximate the real value of partition function, given by

Optimal Learning High-Order MRF Priors of Colour Image

Z(Θ) ∝ Zˆam (Θ) =

N 1 φc (xic ; J, α). N i=1

485

(5)

c∈Yi

From Eq. 4, the gradient of the log-likelihood function with respect to αk is given by M ∂ ∂ log P (D|Θ) = ψc (Jk , xic ) − M log Z(Θ), ∂αk ∂α k i=1

(6)

c∈Xi

where ψ(a, b) = − log(1 + 12 a, b2 ). From Eq.5, the gradient of the log partition function is given by ∂ log Z(Θ) = ∂αk

N c∈Yi ψc (Jk , x)( c∈Yi φc (xc ; J, α))] i=1 [ , N c∈Yi φc (xc ; J, α) i=1

(7)

and the α’s are updated along the gradient ascent direction which can be obtained by Eq. 6 and Eq. 7. Note that the algorithm proposed by McAuley et al. [1] only updates the α’s with ﬁxed values of ﬁlters. It is not optimal for several reasons: (1) the ﬁlters obtained from SVD are sub–optimal because they ideally should be learned by maximising the likelihood; (2) the α’s in their learning algorithm must be initialised to zero for numerical reasons [1], and the gradient ascent dose not work (absolute values of α’s are not convergent and their relative values remain the same) after the ﬁrst iteration according to our implementation. Our aim is to learn a set of ﬁlters and their corresponding weights that maximises the likelihood. This can be achieved via standard gradient ascent method given initial estimates of the model parameters. However, as the Z(Θ) estimated in [1] is just proportional to the true Z(Θ), we can not obtain a reliable model likelihood by Eq.4. Furthermore, when we performed the partial derivative with respect to J and implemented it, we found that the J–α gradient iterations did not converge due to numerical instability in the estimation of Z(Θ).

3 3.1

An Optimised Learning Algorithm Estimation of Partition Function

Although the approximation of Z(Θ), Zˆam (Θ), has a clear physical meaning (when the images used to approximate Z(Θ) cover all possible images, this estimation represents the true form of Z(Θ)), the main problem is in the update of parameters, i.e. the complicated form of Eq. 7, which makes the gradient iteration error-prone due to numerical problems. For solving the problems described above, we use the geometric mean Zˆgm (Θ) with more robust performance in the case of non-Gaussian distributions [9], instead of arithmetic mean Zˆam (Θ) (Eq.5). The geometric mean of the partition function can be written as

486

K. Zhang et al.

Zˆgm (Θ) = (

N

1

φc (xic ; J, α)) N .

(8)

i=1 c∈Yi

According to Jensen’s inequality [10], we can obtain the upper and lower boundaries of Zˆam (Θ) and Zˆgm (Θ) N N N N fi (ε)2 1 1 1 1 −1 fi (ε) ≥ ( fi (ε)) N ≥ N ( (9) ( i=1 )2 ≥ ) , N N i=1 f (ε) i=1 i=1 i where fi (ε) = c∈Yi φc (xic ; J, α). As shown in Fig. 1, the log values of Zˆgm (Θ) and Zˆam (Θ) are very close along the increasing number of random images. Furthermore, we found that the standard deviations of log Zˆgm (Θ) is smaller than those of log Zˆam (Θ) over various amount of random images tests. Therefore, we can say that Zˆgm (Θ) is a robust approximation to the mean of the partition function, and can use a small set of random images to estimate the mean values of partition function in log form. In the calculation of the model likelihood given a set of parameters, ˆ we can use Z(Θ) = T × Zˆgm (Θ) to estimate the true value of Z(Θ). Here, T is the number of assignments for all the possible pixel values of the images patch, i.e., T = 2563×3×3 (3 × 3 clique size) or T = 2565×5×3 (5 × 5 clique size). Based ˆ on this observation, the approximation of log–partition function Z(Θ) can be rewritten as: ˆ log Z(Θ) = log T + log Zˆgm (Θ) = log T +

N F 1 αf ψf (Jf , xic ). N i=1

(10)

c∈Xi f =1

3.2

Proposed Learning Algorithm

The log–likelihood of MRF prior model (Eq.4) can be rewritten as follows: log P (D|Θ) =

M F

αf ψf (Jf , xic ) − M log T

i=1 c∈Xi f =1

−

N F M αf ψf (Jf , xic ), N i=1

(11)

c∈Yi f =1

where parameters have the same denotations in the Section 2.1. As we can expect to obtain a more accurate value of the log model likelihood than the one suggested by [1], Eq.11 can be an indicator that determines whether a given set of parameters has higher likelihood in our learning algorithm. Furthermore, based on Eq.11, the partial derivative with respect to both ﬁlters J’s and their corresponding weights α’s are signiﬁcantly simpliﬁed: M N M −αk xic Jk , xic ∂ log P (D|Θ) −αk xic Jk , xic = − ,(12) ∂Jk N i=1 (1 + 12 Jk , xic 2 ) (1 + 12 Jk , x2c 2 ) i=1 c∈X c∈Y i

i

Optimal Learning High-Order MRF Priors of Colour Image M N ∂ log P (D|Θ) M = ψc (Jk , xic ) − ψc (Jk , xic ). ∂αk N i=1 i=1 c∈Xi

487

(13)

c∈Yi

Our learning algorithm is summarised as follows: 1. Initialise the ﬁlters (J’s) by performing SVD over training images. The initial values of α’s are randomly generated. 2. Update α’s by applying a line search in the gradient direction given by Eq. 13. The step size μα is chosen such that the highest log–likelihood in Eq. 11 is reached. ∂ α ← α + μα { log P (D|Θ)}. (14) ∂αi 3. Update J’s by applying a line search in the gradient direction given by Eq. 12. The step size μJ is, again, chosen by maximising the log–likelihood in Eq. 11. ∂ log P (D|Θ)}. (15) J ← J + μJ { ∂Ji 4. Repeat steps 2–3 until the log–likelihood of the model dose not change. As Eq. 14 and Eq. 15 indicate that both α’s and J’s are updated along the direction of maximising model likelihood, the proposed learning algorithm is optimal based on the model likelihood in Eq. 11. Since the update step sizes (μα and μJ ) are very sensitive to the input parameters, it is quite diﬃcult to specify them with any constants. In our implementation, we employ back–tracking line search to ﬁnd the optimal solution in each update step [11].

Fig. 1. The log values of two Z(Θ) estimation methods along the increasing number of random images

488

3.3

K. Zhang et al.

Inference

After we get the MRF prior model, in order to perform inference (i.e. denoising in our experiments), we adopted a standard gradient based approach, as used by McAuley et al [1]. Gradient ascent is a valid technique in the case of denoising, since the noisy image is “close to” the original image, meaning that a local maximum is likely to be a global one [1]. In the denoising problem, the purpose is to infer the most likely correction for the image given the image prior and the noise model. Thenoise model assumed in our experiments, as in [1], is i.i.d. Gaussian: P (y|x) ∝ j exp(− 2σ1 2 (yj − xj )2 ). Here, j ranges over all the pixels in the image, yj denotes the real colour value of the noisy image at pixel j, and xi denotes the colour to be estimated at pixel j; σ denotes the variance of Gaussian noise. Combining the noise model and the MRF prior (Eq.1), the gradient of the log–posterior becomes [1]: ∇x log P (x|y) =

F

αf Jf− ∗

f =1

(Jf ∗ x) λ + 2 (y − x) σ 1 + 12 (Jf ∗ x)2

(16)

where ∗ denotes matrix convolution, and the algebraic operations above are performed in an elementwise fashion on the corresponding convolution matrix. Jf− denotes the mirror image of Jf in two dimensions. λ is a critical parameter that gauges the relative importance of the prior and the image terms. The updated image is then simply computed by xt+1 = xt + δ

∂ log P (x|y) ∂x

(17)

where δ is the step size of the gradient ascent. We ﬁnd it is not sensitive to the inference result and can be selected empirically.

4

Experimental Results and Comparison

In our experiments, to initialise our ﬁlters J’s, we randomly selected 8,000 3×3×3 and 5 × 5 × 3 patches, cropped from 200 images in the Berkeley Segmentation Database, and performed singular value decomposition (SVD) over their covariance matrices [12]. Thus, we obtained 27 and 75 ﬁlters for the two clique sizes, with 27 and 75 dimensions for each kind of ﬁlters respectively. α’s were initialised to be a set of random values with the same dimension as the number of ﬁlters. There was no constraint on the scale of the initial α’s, and they converged for both absolute and relative values after several steps of update. In the updating process, we randomly selected 2,000 training image patches and 2,000 random images patches from the same image database for each update step. The sizes of training/random image patches for 3 × 3 × 3 and 5 × 5 × 3 cliques were, respectively, 7 × 7 × 3 and 13 × 13 × 3.

Optimal Learning High-Order MRF Priors of Colour Image

489

Fig. 2. Typical denoising results. The ﬁrst column displays the original image (up), the noisy image (middle) with σ = 75 (red), 25 (green), 15 (blue) (PSNR=14.97) and the result of bilateral ﬁltering 5 × 5 window (down, PSNR=23.55); the second column shows the result using Wavelet-based approach 3 × 3 window (up, PSNR=23.32), the results of McAuley et al’s 3 × 3 prior (middle, PSNR=25.03) and our 3 × 3 prior (down, PSNR=25.90). the third column shows the result of 5 × 5 window Waveletbased approach (up, PSNR=23.84), McAuley et al’s 5 × 5 prior (middle, PSNR=25.99) and our 5 × 5 prior (down, PSNR=26.82).

In the inference, we did not need to eliminate the ﬁlter with highest variance since in our algorithm α’s are normalised (with range of [0,1]). The least important ﬁlter will be ignored automatically in the denoising process because its corresponding weight will be zero. In the selection of λ, we used images other than those involving in our experiments. This was done by denoising a test image with several candidate λ–value, and selecting whichever one yields the best results [1]. The step size δ has been chosen to grow linearly with the noise level, which was found to work well in practice. The denoising performances 2

I,J,K

(Ri,j,k −Oi,j,k )2

255 ), where M SE = i,j,k IJK ; were evaluated by PSNR (10 log10 ( MSE R denotes restored image and O denotes original image). In Figure 2 we show results obtained for denoising an image in which a diﬀerent amount of noise is applied to each of the three channels, and compare these with the state-ofthe-art [1], simple bilateral ﬁltering (using the MATLAB code from [13]), and

490

K. Zhang et al.

Table 1. (3 × 3 × 3 window) Average denoising performance over 50 testing images. Results are measured in PSNR. image/σ 5 Noisy image 34.10 McAuley et al1 36.19 Our algorithm 1 37.12∗ McAuley et al2 36.83 Bilateral ﬁltering 28.11 Wavelet-based denoising 36.11 ∗

15 24.70 29.17 29.78∗ 29.74 27.18 28.99

25 20.15 26.04 26.98∗ 26.69 25.75 25.98

50 14.16 22.45 23.38∗ 23.15 21.87 22.41

Indicates signiﬁcant diﬀerence in performance compared with the upper one.

Table 2. (5 × 5 × 3 window) Average denoising performance over 50 testing images. Results are measured in PSNR. image/σ 5 Noisy image 34.10 McAuley et al1 36.57 Our algorithm 1 37.56∗ McAuley et al2 37.28 Bilateral ﬁltering 29.32 Wavelet-based denoising 36.41 ∗

15 24.70 29.59 30.11∗ 30.08 27.78 29.50

25 20.15 26.55 27.41∗ 27.16 25.82 26.32

50 14.16 22.79 23.69∗ 23.41 21.90 22.46

Indicates signiﬁcant diﬀerence in performance compared with the upper one.

Wavelet-based denoising [14]. In the bilateral ﬁltering and Wavelet-based denoising experiments, the RGB test images were converted into YCbCr format before processing, which has less correlation between colour channels. As we show by using diﬀerent priors, the denoising performance of the priors learned by our algorithm has signiﬁcant improvements comparing that of using McAuley et al.’s [1] and the other two denoising approaches. Tables 1 and 2 summarise results about the denoising 50 test colour images (from the Berkeley Segmentation Database) in which all three channels have been equally corrupted, and compare our algorithm with McAuley et al.’s priors and two well known methods in the 3 × 3 × 3 and 5 × 5 × 3 case, respectively. As results show, the performance of priors learned by our proposed algorithm for both model sizes is statistically signiﬁcantly (paired T-test at the 0.05 level) superior to its counterparts which use the same training/random image set. Furthermore, the performance of our priors learned from 2,000/2,000 training images/random image patches is comparable with the priors learned in [1], which used 100,000/50,000 training/random image patches.

5

Conclusion

In this paper, we have proposed the learning algorithm of high-order MRF prior models for colour image denoising. By collecting a relatively small set of sample 1 2

Indicates priors learned from 2,000/2,000 training/random patches. Indicates priors learned from 100,000/50,000 training/random patches.

Optimal Learning High-Order MRF Priors of Colour Image

491

colour image patches from a standard colour image database, we have learned priors speciﬁed for colour images using several steps of gradient ascent update with the rule of maximum likelihood. Results comparing the colour prior models learned by our algorithm to a state–of–the-art colour image prior model [1] show performance improvements.

References 1. McAuley, J., Caetano, T., Smola, A., Franz, M.: Learning high-order MRF priors of color images. In: ICML. LNCS, vol. 4503, pp. 617–624. Springer, Heidelberg (2006) 2. Freeman, W., Pasztor, E., Carmichael, O.: Learning low-level vision. International Journal of Computer Vision 40, 25–47 (2000) 3. Roth, S., Black, M.: Fields of experts: A framework for learning image priors. In: ICCV, pp. 860–867 (2005) 4. Olshausen, B., Field, D.: Sparse coding with an overcomlete basis set: a strategy employed by v1? Vision Research 37, 3311–3325 (1997) 5. Welling, M., Hinton, G., Osindero, S.: Learning sparse topographic representations with products of Student-t distributions. In: NIPS, vol. 15, pp. 1359–1366 (2003) 6. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. PAMI 6(6), 721–741 (1984) 7. Hinton, G.: Products of experts. In: 9th ICANN, pp. 1–6 (1999) 8. Zhu, S., Wu, Y., Mumford, D.: Filters, random ﬁelds and maximum entropy (frame): Towards a uniﬁed theory of texture modeling. International Journal of Computer Vision 27, 107–126 (1998) 9. Abramowitz, M., Stegun, I.E.: The process of the Arithmetic–Geometric mean. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing, 571 (1972) 10. Krantz, S.: In: Handbook of Complex Variables. Boston, MA:Brikhauser, p.118 (1999) 11. Mor´e, J., Thuente, D.: Line search algorithms with guaranteed suﬃcient decrease. ACM Trans. Math. Software 20, 286–307 (1994) 12. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and ites application to evaluating segmentation algorithms and measuring ecological statistics. In: 8th ICCV, pp. 416–423 (2001) 13. http://mesh.brown.edu/dlanman/photos/Bilateral 14. Portilla, J., Strela, V., Wainwright, M., Simoncelli, E.: Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans. Image Processing. 12(11), 1338–1351 (2003)

Hierarchical Learning of Dominant Constellations for Object Class Recognition Nathan Mekuz and John K. Tsotsos Center for Vision Research (CVR) and Department of Computer Science and Engineering, York University, Toronto, Canada M3J 1P3 {mekuz,tsotsos}@cse.yorku.ca

Abstract. The importance of spatial conﬁguration information for object class recognition is widely recognized. Single isolated local appearance codes are often ambiguous. On the other hand, object classes are often characterized by groups of local features appearing in a speciﬁc spatial structure. Learning these structures can provide additional discriminant cues and boost recognition performance. However, the problem of learning such features automatically from raw images remains largely uninvestigated. In contrast to previous approaches which require accurate localization and segmentation of objects to learn spatial information, we propose learning by hierarchical voting to identify frequently occurring spatial relationships among local features directly from raw images. The method is resistant to common geometric perturbations in both the training and test data. We describe a novel representation developed to this end and present experimental results that validate its eﬃcacy by demonstrating the improvement in class recognition results realized by including the additional learned information.

1

Introduction

Humans are highly adept at classifying and recognizing objects with signiﬁcant variations in shape and pose. However, the complexity and degree of variance involved make this task extremely challenging for machines. Current leading edge methods use a variety of tools including local features [1,2,3,4], global [5] and region [6] histograms, dominant colors [7], textons [8] and others, collecting features sparsely at detected key points, at random locations, or densely on a grid and at single or multiple scales. In practice, diﬀerent types of features are often complementary and work well in diﬀerent scenarios, and good results are often achieved by combining diﬀerent classiﬁers. Of the above approaches, much focus has recently been dedicated to learning with local appearance descriptors, which have been shown to be extremely eﬀective thanks to their discriminant qualities and high degree of resistance to geometric and photometric variations as well as partial occlusions. A very eﬀective and widely-used technique that enables the use of eﬃcient search methods borrowed from the text retrieval ﬁeld is vector quantization, Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 492–501, 2007. c Springer-Verlag Berlin Heidelberg 2007

Hierarchical Learning of Dominant Constellations

493

whereby each patch is associated with a label (visual word) from a vocabulary. The vocabulary is usually constructed oﬄine by means of some clustering algorithm. To avoid aliasing eﬀects arising from boundary conditions, soft voting is employed, whereby each vote is distributed into several nearby words using some kernel function. Finally, images are coded as histograms of their constituent visual words. While the importance of local features’ spatial conﬁguration information for object class recognition is widely recognized, the basic scheme described above is typically employed on sets of isolated local appearance descriptors. However, for the most part, local appearance descriptors were designed to recognize local patches. When used for recognizing objects, the spatial layout that they appear in is of paramount importance. The SIFT algorithm [9], for example, represents local features in a way that is invariant to geometric perturbations. However, it also stores the parameters of the local geometry, and subsequently applies a Hough transform to select from potential hypotheses a model pose that conforms to the geometry associated with a large number of identiﬁed keys. Current systems that capture spatial information do so by learning and enforcing local relationships [10,11], global relationships [12,13], using dense sampling [2,1], or at multiple levels [14,15]. In [14], the system learns groups of local features that appear frequently in the training data, followed by global features composed of local groupings. In [11], spatial consistency is enforced by requiring a minimum number of features to co-occur in a feature neighborhood of ﬁxed size. The authors of [16,12] demonstrate the beneﬁt of learning the spatial relationships between various components in an image from a vocabulary of relative relationships. In [2], appearance models are built where clusters are learned around object centers and the object representation encodes the position and scale of local parts within each cluster. Signiﬁcant performance gains are reported resulting from the inclusion of location distribution information. Fergus et al. [1] learn a scale-normalized shape descriptor for localized objects. However, the shapes are not normalized with respect to any anchor point. Consequently, some preprocessing of the input images is required. A boosting algorithm that combines local feature descriptors and global shape descriptors is presented in [13], however extracting global shape is extremely diﬃcult under occlusion or cluttered background. We take a diﬀerent approach and seek to learn object class-speciﬁc hierarchies of constellations, based on the following principles: Unsupervised learning. A clear tradeoﬀ exists between the amount of training data required for eﬀective learning, and the quality of its labeling. Given the high cost of manual annotation and segmentation, and the increased availability (e.g. on the internet) of images that are only globally annotated with a binary class label, a logical goal is the automatic learning of constellation information from images with minimal human intervention. Speciﬁcally, this precludes manual segmentation and localization of objects in the scene. Invariance to shift, scale and rotation. In order to be able to train with and recognize objects in various poses, we require a representation that

494

N. Mekuz and J.K. Tsotsos

captures spatial information, yet is resistant to common geometric perturbations. Robustness. In order to successfully learn in an unsupervised fashion, the algorithm must be robust to feature distortions and partial occlusion. A common approach for achieving robustness is voting. Learn with no spatial restrictions We would like to learn spatial relationships over the entire image, without restrictions of region or prior (e.g. Gestalt principles). This allows grouping discontinuous features, e.g. such that lie on the outline of an object with a variable texture interior. The main contribution of this paper is a novel representation that captures spatial relationship information in a scale and rotation invariant manner. The constellation descriptors are made invariant by anchoring with respect to one local feature descriptor, similar to the way the SIFT local descriptor anchors with respect to the dominant orientation. We present a framework for learning spatial conﬁguration information by collecting inter-patch statistics hierarchically in an unsupervised manner. To tackle the combinatorial complexity problem, higher level histograms are constructed by successive pruning. The most frequently occurring constellations are learned and added to the vocabulary as new visual words. We also describe an eﬃcient representation for matching learned constellations in novel images for the purpose of object class recognition or computing similarity. The remainder of this paper is organized as follows: in Section 2 we describe our proposed constellation representation. This is followed by implementation details of the voting scheme in Section 3, and the matching algorithm in Section 4. The results obtained on images of various categories are presented in Section 5, and ﬁnally, Section 6 concludes with a discussion.

2

Invariant Constellation Representation

The constellation representation captures the types and relative positions, orientations and scales of the constituent parts. An eﬀective representation must

f2 f4

f1

f3

Fig. 1. An illustration of the constellation representation. Local features are represented with circles, with arrows emanating out of them to indicate dominant orientation. Feature f1 is selected as anchor, and the positions, scales and orientations of the remaining features are expressed relative to it.

Hierarchical Learning of Dominant Constellations

495

be resistant to minor distortions arising from changes in pose or artifacts of the local feature extraction process. Pose changes can have a signiﬁcant eﬀect on the coordinates of local features. Another key requirement is a consistent frame of reference. In the absence of models of localized objects, a frame of reference can be constructed as a function of the constituent features. One option is to use the average attributes of local features [17]. However, since our method uses local features that are quantized into discrete visual words, we opt for the simpler alternative of pivoting at the feature with the lowest vocabulary index. This results in a more compact representation (and in turn, computational complexity savings) by eliminating the need to store spatial information for the anchor feature. On the downside, the anchor feature may not lie close to the geometric center of the constellation, reducing the granularity of position information for the other features. Whatever method is used for selecting the pivot, detection of the constellation depends on the reliable recovery of the pivot feature. However, even if the pivot feature cannot be recovered (e.g. under occlusion), subsets of the constellation may still be detected. Our representation is illustrated graphically in Figure 1, with the local appearance features represented as circles, and their dominant orientation as arrows. Feature f1 is selected as anchor and the coordinate system representing the remaining constellation features is centered about it and rotated to align with its dominant orientation. More formally, given a set of local features Fi = Γi , ti , xi , yi , αi where Γi is the index of the visual word corresponding to feature Fi , ti is its scale, xi and yi its position in the image and αi its orientation, relative to the global image coordinate system, we select the anchor F∗ as F∗ = arg min Γi , and construct the constellation descriptor encoding the anchor feature’s type Γ∗ , as well as the following attributes for each remaining feature Fj : Type: Γj Scale ratio: tj /t∗ Relative orientation: αj − α∗ Relative position: atan2(yj − y∗ , xj − x∗ )− α∗ where atan2 is the quadrantsensitive arctangent function. This attribute ignores distances, and merely provides a measure of Fj ’s polar angle relative to F∗ , using F∗ ’s coordinate system. As an example, using this representation, a pair of local features {F1 , F2 } with Γ1 < Γ2 is represented as Γ1 , Γ2 , t2 /t1 , α2 − α1 , atan2(y2 − y1 , x2 − x1 ) − α1 . A constellation of m local features is represented as an n-tuple with n = 4m − 3 elements. To maintain consistent representation, the descriptor orders the local features by their vocabulary index. If the lowest vocabulary index is not unique to one local feature, we build multiple descriptors, just as the SIFT algorithm creates multiple descriptors at each keypoint where multiple dominant orientations exist.

496

3

N. Mekuz and J.K. Tsotsos

Voting by Successive Pruning

ī1

ī1

The learning phase performs histogram voting in order to identify the most frequently occurring constellations in each category, using the representation described above. Since the descriptor orders the local features by their type attribute, each resulting histogram takes the shape of a triangular hyper-prism, with the Γ (type) axes along the hyper-triangular bases and the other attributes forming the rectangular component of the prism. Spatial information is encoded in 8 × 8 × 8 bins. The relative orientation and relative position attributes encode the angle into one of eight bins, similar to way this is done in SIFT. Scale ratios are also placed into a bin according to log2 (tj /t∗ ) + 3. Co-occurrences with scale ratios outside the range [1/16, . . . , 32] are discarded. As in SIFT, in order to avoid aliasing eﬀects due to boundary conditions, we use soft voting whereby for each attribute, each vote is hashed proportionally into two neighboring bins. For the type attribute, the vote is distributed into several nearby visual words using a kernel. We use a Gaussian kernel with σ set to the average cluster radius, although other weighting formulas are certainly possible. It is also possible to threshold by distance rather than ﬁxing the number of neighbors. The exponential complexity of the problem, and in particular, the size of the voting space, call for an approximate solution. A simple method that has often been used successfully is successive pruning. In computer vision, good results have been reported in [18] although some human supervision was necessary at the highest levels of the hierarchy. In some sense, the successive pruning strategy can be viewed as a coarse-to-ﬁne reﬁnement process. Starting with coarse bins, the algorithm identiﬁes areas of the search space with a high number of counts. It then iteratively discards bins with a low number of counts, re-divides the rest of the voting space into ﬁner bins, and repeats the voting process. In the case of multi-dimensional histograms, coarse bins can also be created by collapsing dimensions. This latter approach is more convenient in our case since it ﬁts

ī2

ī2

(a)

(b)

Fig. 2. (a) A depiction of a triangular histogram used for voting for the most frequently co-occurring pairs. (b) Histogram pruning: the bin with a low number of counts are discarded. Finer bins are allocated for each bin with a number of counts above a threshold.

Hierarchical Learning of Dominant Constellations

497

naturally with the notion of hierarchical learning, creating larger constellations from smaller ones. Also, collapsing the dimensions associated with spatial information oﬀers computational advantages by allowing early termination of the voting in the discarded bins, since visual word indices for local features are available immediately. Figures 2(a) and 2(b) illustrate the structures used in the two-phase voting process to identify pairs of local features that appear frequently in a particular spatial conﬁguration. In the ﬁrst phase, local feature descriptors are extracted and cached from all images belonging to an object class, and a triangular histogram (Fig. 2(a)) is collected to count the number of times each pairs of local features co-occurs, regardless of geometry. In the second phase (Fig. 2(b)), subhistograms are allocated for the bins with a high number of counts, and the remaining bins are discarded. Each sub-histogram consists of 8 × 8 × 8 bins and captures spatial information for its associated bin in the triangular histogram. Finally, the vocabulary is augmented with new visual words corresponding to the most frequent constellations in each class identiﬁed in phase 2.

4

Indexing Constellation Descriptors for Eﬃcient Matching

Given novel input images, the system compares constellations extracted from these images against the learned constellation stored in its vocabulary. In order to achieve this, an exhaustive search of all local feature combinations in the input images is not necessary. A more eﬃcient search is possible by indexing the learned constellation information oﬄine, as depicted in Figure 3. At the ﬁrst level, the structure consists of a single array indexed by local feature type index. Given a moderate number of learned constellations, the resulting ﬁrst level array is sparse. Local feature types for which learned constellations exist, have their array entries point to arrays of stored constellation descriptors, sorted lexicographically by the other type indices. The matching algorithm works by constructing an inverted ﬁle [19] of the local features in the image. A sparse inverted ﬁle containing only links to vocabulary

local feature words

sorted constellation descriptors

sorted constellation descriptors

Fig. 3. A depiction of an eﬃcient indexing structure for fast lookup of constellation features

498

N. Mekuz and J.K. Tsotsos

entries that are in use suﬃces thanks to the sorted second-level arrays: a match is sought simply by traversing both lists simultaneously. Spatial relationship information is compared only when all type attributes in a constellation descriptor are matched in the inverted ﬁle.

5

Evaluation

We tested our technique by examining the eﬀect of using spatial relationship information captured using our descriptors on object class recognition performance. In order to isolate the eﬀect of our constellation learning algorithm on class detection, we limited our evaluation to still greyscale images. We used the SIFT detector and descriptor to extract and represent local appearance features. We constructed a vocabulary of 13,000 visual words by extracting features from the ﬁrst 800 hits returned by Google Images for the keyword ‘the’ and clustering with k-means. The result is a neutral vocabulary, that is not tuned speciﬁcally for any object category. For training and test data, we used 600 images of faces, airplanes, watches, bonsai trees and motorbikes from the PASCAL 2006 data set [20], divided equally into training and test images. All images were converted to greyscale, but no other processing was performed. In the training phase, image descriptors were collected for each of the training images encoding histograms of their constituent visual words. We used our neutral vocabulary constructed as described above, and quantized each feature descriptor to its 15 nearest neighbors using a Gaussian kernel with σ set to the average radius covered by each vocabulary entry. In the testing phase, each image was matched and classiﬁed using a simple unweighted nearest neighbor classiﬁer against the trained image descriptors. As is standard practice, we used

Faces Airplanes Watches Bonsai trees Motorcycles

Faces 88.3 53.3 45.0 21.7 70.0

Airplanes 3.3 36.7 11.7 1.7 5.0

Watches 5.0 0.0 26.7 1.7 1.7

Bonsai trees 1.7 8.3 16.7 71.7 8.3

Motorcycles 1.7 1.7 0.0 3.3 15.0

Bonsai trees 0.0 5.0 13.3 76.7 6.7

Motorcycles 0.0 1.7 1.7 1.7 21.7

(a)

Faces Airplanes Watches Bonsai trees Motorcycles

Faces 91.7 50.0 40.0 16.7 66.7

Airplanes 3.3 43.3 10.0 3.3 5.0

Watches 5.0 0.0 35.0 1.7 0.0 (b)

Fig. 4. Confusion matrices (a) using a vocabulary of only local appearance features. (b) using an augmented vocabulary with an additional 50 constellation words per category.

Hierarchical Learning of Dominant Constellations

499

100

No spatial words 10 Spatial words 20 Spatial words 50 Spatial words

90

% Correct classification

80

70

60

50

40

30

20

10

0 faces

airplanes

watches

Category

bonsai

motorcycles

Fig. 5. Object class categorization performance as a function of the number of constellation visual words used. The error bars represent a margin of 3 standard errors.

a stop list to discard the 2% most frequently occurring visual words. Although more elaborate classiﬁers (e.g. SVM) and weighting schemes (e.g. tf-idf [21]) are possible, we opted for the simple scheme described here in order to focus on the eﬀect of the additional spatial information. We expect the tf-idf scheme to place increased weights on the constellation features, since they carry more class-speciﬁc discriminant information. We tested the eﬀect of augmenting the vocabulary with 10, 20 and 50 constellation words per object class on class recognition performance. It is worth noting that the vocabulary used for constructing these additional constellation words was again our generic neutral vocabulary: the only class-speciﬁc information captured in the training phase was the most frequently co-occurring pairs and their spatial relationships in each class. A 2% stop list was again used on the visual words associated with the local features but not on the pairs. Figure 4 presents confusion matrices for the categorization tests (a) using no spatial information, and (b) using 50 additional constellation visual words. Perhaps surprisingly, poor performance is realized in the motorcycles category, where local feature-based methods typically excel. A likely explanation is that normally the vocabulary is constructed using images of the modeled class, and

500

N. Mekuz and J.K. Tsotsos

captures features such as wheels in the case of the motorcycle class, whereas in our experiments we used a neutral vocabulary that was not trained speciﬁcally for any class. More importantly, however, we note that the addition of a few visual words corresponding to learned spatial features clearly boosted recognition performance in all classes, with average gains of about 5% using 50 constellation words. Figure 5 shows correct recognition results (corresponding to the diagonal of the confusion matrices) with diﬀerent numbers of constellation words. The general trend shows recognition performance improving as more constellation features are used. The error bars represent an interval of 3 standard errors.

6

Discussion

This paper has presented a novel approach for representing constellation information that is learned directly from raw image data in a hierarchical fashion. The method is capable of learning spatial conﬁguration information from possibly cluttered images where objects appear in various poses and possibly partly occluded. Novel images are tested for the presence of learned conﬁgurations in a way that is robust to common geometric perturbations. Additionally, the paper presents implementation details for an eﬃcient voting algorithm that allows collecting robust co-occurrence statistics in a computationally highly complex voting space, and eﬃcient indexing structures that allow fast lookup in the matching phase. Our experimental results conﬁrm the importance of spatial structure to the class recognition problem, and show that the proposed representation can provide signiﬁcant beneﬁt with constellations consisting of pairs. We are currently exploring richer constellation structures corresponding to higher levels of the hierarchy and looking at ways for visualizing the learned constellations.

Acknowledgments The authors are grateful to Erich Leung and Kosta Derpanis for many helpful discussions. This work was supported by OGSST and Precarn incorporated.

References 1. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 264–271 (2003) 2. Leibe, B., Mikolajczyk, K., Schiele, B.: Eﬃcient clustering and matching for object class recognition. In: British Machine Vision Conference, Edinburgh, England (2006) 3. Berg, A.C., Berg, T.L., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society Press, Los Alamitos (2005) 4. Dorko, G., Schmid, C.: Object class recognition using discriminative local features (2005)

Hierarchical Learning of Dominant Constellations

501

5. Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S., Huang, T.S.: Supporting similarity queries in mars. In: ACM International Conference on Multimedia, pp. 403– 413. ACM Press, New York (1997) 6. Carson, C., Thomas, M., Belongie, S., Hellerstein, J., Malik, J.: Blobworld: a system for region-based image indexing and retrieval. Technical report, Berkeley, CA, USA (1999) 7. Mukherjea, S., Hirata, K., Hara, Y.: Amore: a world-wide web image retrieval engine. In: CHI 1999. Extended abstracts on human factors in computing systems, pp. 17–18. ACM Press, New York (1999) 8. Malik, J., Belongie, S., Shi, J., Leung, T.K.: Textons, contours and regions: Cue integration in image segmentation. In: IEEE International Conference on Computer Vision, pp. 918–925. IEEE Computer Society Press, Los Alamitos (1999) 9. Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision, vol. 1150, IEEE Computer Society Press, Los Alamitos (1999) 10. Lazebnik, S., Schmid, C., Ponce, J.: Aﬃne-invariant local descriptors and neighborhood statistics for texture recognition. In: IEEE International Conference on Computer Vision, vol. 649, IEEE Computer Society, Los Alamitos (2003) 11. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1470– 1477 (2003) 12. Lipson, P., Grimson, E., Sinha, P.: Conﬁguration based scene classiﬁcation and image indexing. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1007, IEEE Computer Society, Los Alamitos (1997) 13. Zhang, W., Yu, B., Zelinsky, G.J., Samaras, D.: Object class recognition using multiple layer boosting with heterogeneous features. In: IEEE Conference on Computer Vision and Pattern Recognition 14. Amit, Y., Geman, D.: A computational model for visual selection. Neural Comput. 11, 1691–1715 (1999) 15. Agarwal, A., Triggs, W.: Hyperfeatures - multilevel local coding for visual recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 16. Sinha, P.: Image invariants for object recognition. Invest. Opth. & Vis. Sci. 34(6) (1994) 17. Shokoufandeh, A., Dickinson, S.J., J¨ onsson, C., Bretzner, L., Lindeberg, T.: On the representation and matching of qualitative shape at multiple scales. In: European Conference on Computer Vision, pp. 759–775. Springer, Heidelberg (2002) 18. Fidler, S., Berginc, G., Leonardis, A.: Hierarchical statistical learning of generic parts of object structure. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 182–189. IEEE Computer Society Press, Los Alamitos (2006) 19. Witten, I.H., Moﬀat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann Publishers Inc, San Francisco (1999) 20. Everingham, M., Zisserman, A., Williams, C.K.I., Van Gool, L.: The PASCAL Visual Object Classes Challenge. In: VOC2006 (2006) 21. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

Multistrategical Approach in Visual Learning Hiroki Nomiya and Kuniaki Uehara Graduate School of Science and Technology, Kobe University [email protected], [email protected]

Abstract. In this paper, we propose a novel visual learning framework to develop ﬂexible and accurate object recognition methods. Currently, most of visual learning based recognition methods adopt the monostrategy learning framework using a single feature. However, the real-world objects are so complex that it is quite diﬃcult for monostrategy method to correctly classify them. Thus, utilizing a wide variety of features is required to precisely distinguish them. In order to utilize various features, we propose multistrategical visual learning by integrating multiple visual learners. In our method, multiple visual learners are collaboratively trained. Speciﬁcally, a visual learner L intensively learns the examples misclassiﬁed by the other visual learners. Instead, the other visual learners learn the examples misclassiﬁed by L. As a result, a powerful object recognition method can be developed by integrating various visual learners even if they have mediocre recognition performance.

1

Introduction

For the ﬂexible and accurate recognition in computer vision, an eﬀective framework called visual learning has been proposed by introducing machine learning frameworks into computer vision understanding. However, conventional visual learning methods adopt monostrategy learning. That is, they are based on a single learning strategy and able to utilize only a few features. Thus, they often fail to distinguish complex objects. Since an image is described with various features such as contour, color and texture, a wide variety of features are required for ﬂexible recognition. Therefore, it is essential to integrate multiple features. To solve this problem, multistrategy learning is developed [1]. Since it can deal with multiple learning frameworks, it has a potentially better competence. For example, Nomiya et al. proposed a multistrategy visual learning method by integrating two types of object recognition methods (appearance-based and region-based) using decision trees and discriminant analysis [2]. But the integration method is so simple that the recognition performance is inadequate. Most of existing multistrategy learning methods simply integrates the learning results of all base learners using, for example, linear combination. Moreover, each learner is separately trained. Since the features are mutually interrelated, the visual learners should be trained collaboratively. Thus, we propose an eﬀective learning framework that the visual learners can cooperate with each other. If a visual learner L frequently misclassiﬁes some examples, it seems to be difﬁcult for L to correctly discriminate the examples. Then, L never learns the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 502–511, 2007. c Springer-Verlag Berlin Heidelberg 2007

Multistrategical Approach in Visual Learning

503

examples and they are learned by the other visual learners. Instead, L learns the examples misclassiﬁed by the other learners. As a result, each visual learner is specialized in discriminating the objects which have particular features. For example, an appearance-based visual learner will be specialized in discriminating the objects whose shapes are unique while they have various colors and textures. Conversely, a region-based visual learner will be specialized in discriminating the objects whose colors and textures are unique while they have similar shapes. To classify an example (i.e. an image), the more accurate prediction can be obtained by determining the most suitable visual learner based on the learning result. Therefore, our method can be more eﬃcient than the conventional multistrategy learning methods. In order to develop this learning scheme, we propose the intensive and collaborative learning framework in the following section.

2

Multistrategical Visual Learning

We propose an eﬀective multistrategy learning model based on AdaBoost [3]. We improve its weighting algorithm so that each visual learner can be trained collaboratively. First, we extend AdaBoost to solve multiclass problems. AdaBoost.M1 [3] is a direct extension to the multiclass problems. But it has the same property as AdaBoost that the classiﬁcation error of each weak learner must be less than 1 2 . This is a crucial constraint when the number of classes is large. AdaBoost.M2 [3] gives a solution to this problem by using pseudo-loss. A hypothesis whose pseudo-loss is lower than 12 is much more easily generated for multiclass problems. Thus, we introduce AdaBoost.M2 algorithm. It takes a training set {(x1 , y1 ), · · · , (xm , ym )} where m is the number of training examples. xi is an element of instance space X and yi is an element of label space Y = {1, · · · , C}, where C is the number of classes. In object recognition problems, an example corresponds to an image. A class label corresponds to the object in the image. AdaBoost.M2 can be extended to our multistrategy learning framework. l In our framework, the label weighting functions qtX are computed for each base Xl (i), which learner as follows, using the weight vectors of the l-th base learner wt,y is the i-th example’s weight for the y-th class (y = yi ) at the t-th round: l

l qtX (i, y)

l

X where w1,y (i) =

1 m(k−1)

=

X wt,y (i) l

WtX (i)

for all l and i, and l Xl WtX (i) = wt,y (i). y=yi l

Then, the weight distribution DtX for the l-th base learner at the t-th round can be computed based on the weight vectors of the other base learners as follows: n Xj l 1 W (i) DtX (i) = . (1) m t X j n−1 i=1 Wt (i) j=l

504

H. Nomiya and K. Uehara

Equation (1) represents the intensive and collaborative learning framework. It l is obvious that the weight distribution DtX depends on the learning result of the other base learners X j (j = l). If X j misclassiﬁes an example, then X l ’s weight is increased. Thus, X l intensively learns the examples misclassiﬁed by X j . Conversely, if X l misclassiﬁes some examples, X j intensively learns them. Consequently, each base learner is collaboratively trained by leaving the examples misclassiﬁed by itself to the other base learners and training the examples misclassiﬁed by the other base learners. This intensive and collaborative learning framework leads to the better performance than the conventional multistrategy learning which integrates only the learning results of each base learners. l of the l-th base learner at the t-th round is given by The pseudo-loss εX t ⎛ ⎞ m l l l l l 1 ⎠ DX (i) ⎝1 − [hX qtX (i, y)[hX εX t = t (xi ) = yi ] + t (xi ) = y] . 2 i=1 t y=yi

For any predicate π, we deﬁne that [π] = 1 if π holds, otherwise [π] = 0. At the next round, the new weights vectors can be calculated as

l

βtX Xl Xl Xl Xl wt+1,y (i) = wt,y (i) exp 1 + [ht (xi ) = yi ] − [ht (xi ) = y] 2 l

l

l

X where, βtX = log{(1 − εX t )/εt }. l We deﬁne the ﬁnal hypothesis H X of the base learner X l as follows: l

l

l

H X (x) = argmax HTX (c, x); HtX (c, x) = c∈Y

t

l

l

βτX [hX τ (x) = c].

τ =1

where T is the number of rounds. To evaluate each hypothesis, we compute the class separability at each round. The class separability is a criterion calculated using a confusion matrix. If most of the training examples belonging to a class are correctly classiﬁed, then the class separability is high. Conversely, if most of the training examples are confused with the other class(es), then the class l l separability is low. We deﬁne the class separability sX t (c) of X for the c-th class at the t-th round as follows: ⎧ l Xl ⎨ sX t+ (c) if argmax Ht (y, x) = c l X y∈Y (2) st (c) = ⎩ sX l (c) otherwise. t− where,

C C

l

l sX t+ (c)

nX t,c,c

= C

i=1 l

l

nX t,c,i

,

l sX t− (c)

l

i=c

j=c

nX t,i,j

i=c

j=1

nX t,i,j

= C C

l

and nX t,i,j denotes the number of the examples whose class label is i and classiﬁed l

into the class j by HtX .

Multistrategical Approach in Visual Learning

505

As the learning proceeds, each base learner will be specialized in discriminating a kind of objects which are relatively easy to classify. As a result, the base learner can very precisely classify some kinds of objects but may sometimes misclassify the other objects. Thus, we estimate the conﬁdence of the predictions of all the base learners to determine reliable base learners. We utilize the class separability of each weak hypothesis to estimate the conﬁdence of the prediction of a base learner. Especially, the weak hypotheses generated at the beginning of the learning are more suitable because the weak hypotheses in the later stage of the learning are specialized. Thus, we deﬁne the conﬁdence K based on the class separability of each weak hypothesis, emphasizing the suitable hypotheses. Xl (c) + KtXl (c) = Kt−1

t

l sˆX i (c)

(3)

i=1 X where K0X (c) = 0 for all c and sˆX t (c) is calculated by replacing Ht in equation X (2) with ht . The ﬁnal hypothesis H is computed by integrating the learning results of all the base learners considering their conﬁdence values as follows:

L X Xl l KT (c)HT (x) . (4) H(x) =argmax c

l=1

That is, the ﬁnal hypothesis contains two prediction steps. The ﬁrst step is to predict which base learner can correctly classify the example. The second step is to predict the class label of the example by the hypotheses of the base learners.

3

Base Learners

The appearance is an essential feature to recognize objects. Thus, we utilize a set of straight lines extracted from the contour. We call the lines contour fragments. Since a contour fragment is too simple, we discriminate objects by ﬁnding meaningful combinations of the contour fragments called patterns. In the learning process, we ﬁrst extract contour fragments using stick growing method [4]. Next, we ﬁnd meaningful combinations of contour fragments. We show the process in Figure 1. In Figure 1, (a) and (b) represent the original image and the extracted contour fragments respectively. P1 , P2 and P3 in (c) are the patterns found by searching mutually adjacent contour fragments.

Fig. 1. An example of the pattern extraction process

506

H. Nomiya and K. Uehara

A frequent pattern, which is common to the examples in a certain class is useful. We deﬁne a frequent pattern as the pattern which satisﬁes the condition nc that N > ρ, where, for the c-th class, nc is the number of examples which c nc contain the pattern. Nc is the number of examples in the c-th class. Thus, N c corresponds to the probability that the pattern is included in the c-th class. ρ is the frequency threshold. To ﬁnd useful frequent patterns, we deﬁne a criterion to evaluate the usefulness U of a frequent pattern p for the c-th class as follows: nc . (5) Uc (p) = C i=1 ni where C is the number of classes. When a test example is given, the frequent patterns {pi } (i = 1, · · · , m) are extracted from the object, where m is the number of frequent patterns. Each pattern pi is compared with each useful frequent pattern qic (i = 1, · · · , mc ) for each class, where mc is the number of frequent patterns in the c-th class. The similarity σ(pi , qi ) between pi and qi is calculated for each frequent pattern. Then, the conﬁdence Sc of the test example for the c-th class is calculated as follows. Sc corresponds to the possibility that the class label of the example is c. Sc =

M

σ(pi , qi ) where M = min{m, mc }.

(6)

i=1

If p is similar to q, σ(pi , qi ) is Uc (qi ), otherwise 0. In equation (6), pi and the corresponding frequent pattern qi are determined so that Sc is minimized. To determine whether a pattern p is similar to a pattern q, we deﬁne the following conditions for each contour fragment lip and liq in p and q: Condition 1: Condition 2: Condition 3:

np = nq = n. p q 1 r < |li |/|li | < r (i = 1, · · · , n). p q A(li , li ) < θ (i = 1, · · · , n).

where np and nq are the numbers of the contour fragments included in p and q. A(x, y) represents the angle between x and y. r and θ are the thresholds. When all the contour fragments in p and q satisfy these conditions, p is similar to q. The test example is classiﬁed into the class which has the highest conﬁdence. The region component which represents the color and texture is a discriminative feature. We consider the region component in the minimum region that contains all contour fragments represented by the encircled regions in Figure 2. Figure 2 (a) is the minimum region in the original image. (b) is the corresponding contour fragments. We use the pixel intensity values in the minimum region. But there is the problem of the computational cost caused by a large amount of pixels. To solve this problem, we introduce Generic Fourier Descriptor (GFD) [5] and reduce the dimensionality of the feature vectors. GFD is a rotation-invariance descriptor derived by applying a 2D polar Fourier transform to the polar image as shown in Figure 2 (c). The transform is given by F (ρ, φ) =

−1 R−1 T r=0 i=0

2πi r φ) f (r, θi ) exp j2π( ρ + R T

(7)

Multistrategical Approach in Visual Learning

507

Fig. 2. An example of the minimum region and GFD

where R and T are the radial and angular resolutions and θi = 2πi T (0 ≤ i < T ). Figure 2 (d) represents the Fourier coeﬃcients. We use the GFD feature vector as the feature vector calculated as follows:

GF D =

|F (0, n)| |F (m, n)| |F (0, 0)| ,···, ,···, area |F (0, 0)| |F (0, 0)|

(8)

where area represents the area of the polar image. m and n are the maximum numbers of the radial and angular frequencies respectively. GFD is calculated to obtain the feature vector of the minimum region. The similarity between two images is calculated as the distance between the two GFD feature vectors. The similarity S(x, t) between a test example x and a training example t is deﬁned as the Euclidian distance between the GFD feature vectors as follows: S(x, t) =

d

− 12 {GF Di (x) − GF Di (t)}

2

(9)

i=1

where d is the number of dimensions of the GFD feature vectors. GF Di (x) and GF Di (t) are the i-th dimension’s values of the examples x and t respectively. The example x is classiﬁed into the class which has the highest average similarity.

4

Experiments

We carried out some experiments to verify the performance of our method with real-world images. The images are in the ETH-80 Image Set database [6]. This data set contains 8 diﬀerent objects; apple, car, cow, cup, dog, horse, pear, and tomato. Each class contains 410 images (10 kinds of objects from 41 diﬀerent directions). We used a total of 656 images as the training set and the other 2624 images as the test set. The number of rounds is experimentally set to 100. In this experiment, we used the following six base learners. The ﬁrst and second learners are the appearance and region based methods proposed in this paper. The third learner is based on feature tree [2]. This method combines some predeﬁned features into some decision trees called feature trees. The fourth learner is based on Scale Invariant Feature Transform (SIFT) [7]. This method generates the deformation-invariant descriptors by ﬁnding some keypoints in an object. The ﬁfth learner is based on PCA-SIFT [8]. It can generate more distinctive and compact descriptors than SIFT by introducing Principal Component

508

H. Nomiya and K. Uehara

Analysis. The sixth learner is based on shape context method [9]. It discriminates an object using a set of points on its contour called shape context. 4.1

Comparison with Other Object Recognition Methods

We compare the recognition performance of our method with the following six object recognition methods. First, the shape context by Belongie et al. [9]1 . Second, an appearance-based method called multidimensional receptive histogram by Schiele et al. [10]. This method describes local shapes of objects using statistical representations to recognize the objects. Third, a region-based method based on color indexing by Swain et al. [11]. It discriminates an object using RGB histograms calculated from the pixels in the object. Fourth, a regionbased method using local invariant features by Grauman et al. [12]. It utilizes deformation-invariant local features generated by a gradient-based descriptor. Fifth, a region-based method based on boosting by Tu et al. [13]. In this method, probabilistic boosting-tree framework is introduced to construct discriminative models. Finally, a visual learning method by Mar´ee et al. [14]. In this method, an object is described by randomly extracted multi-scale subwindows in the image. The random decision tree ensembles are constructed using the subwindows. The recognition accuracy for each method is shown in Table 1. Table 1. The recognition accuracy (in %) of each method methods Swain accuracy(%) 64.85

Mar´ee 74.51

Tu 76

Schiele 79.79

Grauman 81

Belongie 81.06

proposed 84.49

Our method outperforms all the other recognition methods. This result reﬂects the eﬀectiveness of the multistrategical learning. Although the base learner using shape context method greatly contributes the high accuracy, our multistrategy learning is fully eﬀective because the recognition performance is considerably improved compared with the recognition accuracy of a single base learner. 4.2

Recognition Performance of Multistrategical Learning Methods

In order to verify the eﬀectiveness of our multistrategy learning from the viewpoint of the number of base learners, we construct ﬁve object classiﬁers using two, three, four, ﬁve and six diﬀerent base learners and compare them. We call these classiﬁers L2 , L3 , L4 , L5 and L6 respectively. Li contains a total of i base 1

It is reported in [6] that the shape context method achieved 86.40% accuracy. However, this recognition accuracy has been achieved using over 98% of examples in the training set. Since this method discriminates objects by matching with each object in the training set, the accuracy is proportional to the number of the training examples as shown in [9]. Thus, we show in Table 1 the recognition accuracy of the shape context method using 20% of training examples in the same way as our experiments.

Multistrategical Approach in Visual Learning

509

95% L2 L3 L4

90

L5 L6 shape context

Accuracy

85 80 75 70 65 60

apple

car

cow

cup

dog

horse

pear

tomato

total

Fig. 3. The recognition accuracy with multiple base learners 95% 90

Accuracy

85

L2 L3

L5 L6

L4

shape context

80 75 70 65 60

apple

car

cow

cup

dog

horse

pear

tomato

total

Fig. 4. The recognition accuracy with hard example elimination

learners. In addition, since we use the shape context method as the sixth base learner, we also show the recognition accuracy of the shape context method for comparison. The result of the experiment is shown in Figure 3. The recognition accuracy is proportional to the number of base learners. The third learner (feature tree) and the fourth learner (SIFT) give new features to L3 and L4 , so that more complex objects can be precisely described. The ﬁfth learner (PCA-SIFT) is similar to the fourth learner because it is based on SIFT. Thus, the recognition performance is slightly improved in L5 . Introducing the sixth base learner (shape context) considerably improved the total recognition accuracy. Although this improvement is due to the high recognition performance of the shape context method, the recognition accuracy of L6 is signiﬁcantly higher than that of the shape context method. In addition, the recognition accuracy of our method for apples does not degrade by introducing the shape context method, in spite of the low recognition accuracy of the shape context method. But there is room for improvement because the classiﬁcation accuracy of our method for cars, cows and cups is lower than that of the shape context method.

510

H. Nomiya and K. Uehara

The main reason of this result is that our method is vulnerable to hard examples. Hard examples mean noisy examples such as deformed or occluded images. In our method, the examples misclassiﬁed by all the base learners are regarded as hard examples. In AdaBoost, it is a crucial problem that hard examples often cause overﬁttng and degrades the classiﬁcation performance. In order to investigate the inﬂuence of hard examples, we performed an additive experiment by introducing NadaBoost [15], which can detect hard examples. We used NadaBoost instead of AdaBoost.M2 and eliminated hard examples detected by NadaBoost during learning process. The result of the experiment is shown in Fig 4. By eliminating hard examples, overall recognition performance of our method is improved and our method outperforms the shape context method in all the objects. From the result, we conﬁrmed the inﬂuence of hard examples and the necessity to appropriately detect and eliminate hard examples. 4.3

Eﬀectiveness of the Method to Integrate Visual Learners

To verify the eﬀectiveness of our integration method, we compare our method with a voting method using the six base learners. The voting method separately trains the base learners and combines them using a voting method without weighting. The result using 5-fold cross validation is shown in Table 2. Table 2. The recognition accuracy (in %) of our method and the voting method apple car cow cup dog horse voting 84.51 85.06 69.45 82.38 70.07 70.43 proposed 90.30 89.88 76.10 87.01 78.23 79.27

pear 83.60 89.63

tomato 75.79 85.49

total 77.66 84.49

For all objects and the total recognition accuracy, the accuracy of our method is signiﬁcantly better than the voting method using t-test with a signiﬁcant level at 5%. The voting method equally treats the predictions of all the base learners even if the object is often misclassiﬁed by a base learner while the other base learner correctly classiﬁes it. In addition, each base learner is separately trained. As a result, using the voting method deteriorates the total recognition performance. Our method can avoid this problem by selecting suitable base learner depending on the given object. This result shows the advantage of our integration method and our learning framework.

5

Conclusion and Future Work

In this paper, we proposed an eﬀective object recognition method based on multistrategical visual learning. Since our method collaboratively trains and integrates multiple visual learners, the discrimination performance can be improved compared with the monostragety visual learning methods. Through the experiments, we veriﬁed the performance of our methods. The experimental result shows that complex objects can be correctly discriminated by integrating diverse visual learners. However, the recognition accuracy of our method is still

Multistrategical Approach in Visual Learning

511

inadequate in the presence of hard examples. But from the result of an additive experiment, we should make our method more robust over the hard examples by appropriately detecting and eliminating hard examples.

References 1. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Proc. of the 9th IEEE International Conference on Computer Vision, pp. 626–633. IEEE Computer Society Press, Los Alamitos (2003) 2. Nomiya, H., Uehara, K.: Feature construction and feature integration in visual learning. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 86–95. Springer, Heidelberg (2005) 3. Freund, Y., Schapire, R.E.: A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 4. Nelson, R.C.: Finding line segments by stick growing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(5), 519–523 (1994) 5. Zhang, D., Lu, G.: Enhanced generic fourier descriptors for object-based image retrieval. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3668–3671. IEEE Computer Society Press, Los Alamitos (2002) 6. Leibe, B., Schiele, B.: Analyzing appearance and contour based methods for object categorization. In: Proc. of International Conference on Computer Vision and Pattern Recognition, pp. 409–415 (2003) 7. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 8. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 506–513 (2004) 9. Belongie, S., Malik, J., Puzicha, J.: Matching shapes. In: Proc. of the 8th IEEE International Conference on Computer Vision, pp. 454–463. IEEE Computer Society Press, Los Alamitos (2001) 10. Schiele, B., Crowley, J.L.: Recognition without correspondence using multidimensional receptive ﬁeld histograms. International Journal of Computer Vision 36(1), 31–50 (2000) 11. Swain, M.J., Ballard, D.H.: Color indexing. International Journal of Computer Vision 7(1), 11–32 (1991) 12. Grauman, K., Darrell, T.: Eﬃcient image matching with distributions of local invariant features. In: Proc. of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 627–634 (2005) 13. Tu, Z.: Probabilistic boosting-tree: Learning discriminative models for classiﬁcation, recognition, and clustering. In: Proc. of the 10th IEEE International Conference on Computer Vision, vol. 2, pp. 1589–1596 (2005) 14. Mar´ee, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classiﬁcation. In: Proc. of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 34–40 (2005) 15. Nakamura, M., Nomiya, H., Uehara, K.: Improvement of boosting algorithm by modifying the weighting rule. Annals of Mathematics and Artiﬁcial Intelligence 41, 95–109 (2004)

Cardiac Motion Estimation from Tagged MRI Using 3D-HARP and NURBS Volumetric Model Jia Liang1, Yuanquan Wang2, and Yunde Jia1 1

School of Computer Science&Technology, Beijing Institute of Technology, Beijing 100081, P.R. China 2 School of Computer Science, Tianjin University of Technology, Tianjin 300191, P.R. China {liangjia,yqwang,jiayunde}@bit.edu.cn

Abstract. Concerning analysis of tagged cardiac MR images, harmonic phase (HARP) is a promising technique with the largest potential for clinical use in terms of rapidity and automation without tags detection and tracking. However, it is usually applied to 2D images and only provides “apparent motion” information. In this paper, HARP is integrated with a nonuniform rational Bspline (NURBS) volumetric model to densely reconstruct 3D motion of left ventricle (LV). The NURBS model represents anatomy of LV compactly, and displacement information that HARP provides within short-axis and long-axis images drives the model to deform. After estimating the motion at each phase, we smooth the NURBS models temporally to achieve a 4D continuous timevarying representation of LV motion. Experimental results on in vivo data show that the proposed strategy could estimate 3D motion of LV rapidly and effectively benefiting from both HARP and NURBS model. Keywords: Tagged MRI, LV, motion estimation, HARP, NURBS model.

1 Introduction Tagged MRI [1,2] is a noninvasive technique that can be used for quantitative assessment of myocardial function and dynamic behavior of human heart, which is invaluable in the diagnosis of myocardial diseases [3]. In tagged MRI, tags move with underlying tissue during cardiac cycle, providing unsurpassed information about myocardium motion which can be used to calculate local strain and deformation indices from different myocardial regions. Hence, many researches have contributed to analyze deformation of tags to derive a motion model of the underlying tissue, such as [4,5,6]. These methods for tags detection and tracking are almost manual or semiautomatic with human interaction, which make cardiac motion analysis more time-consuming and more dependent on the validity of the detected tags. Therefore it is imperative to develop an automatic method for estimating heart motion. Recently, Osman et al. [7-9] have introduced a harmonic phase (HARP) technique for cardiac motion tracking without extracting tag features from an image. The approach treats harmonic phase which is computed from inverse Fourier transform of the first off-center isolated spectral peak in the Fourier domain of tagged MRI as a Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 512–521, 2007. © Springer-Verlag Berlin Heidelberg 2007

Cardiac Motion Estimation from Tagged MRI

513

material property for tracking underlying motion and calculating myocardial strain. It is a promising technique with the largest potential for clinical use in terms of rapidity, simplicity, robustness and automation. However HARP technique is mostly applied to 2D images, typically on short-axis (SA) image planes and thus, only provides information about “apparent motion” while the true motion of heart is 3D. Some of work has been done to deal with this problem. Ryf et al. [10] presented a combined 3D tagging and imaging approach. Haber and Westin [11] constructed a FEM within LV wall and the HARP phase computed on a sparse collection of image planes. Pan et al. [12] developed a mesh model approximating the mid-wall of LV and then applied HARP to track the mesh. In this paper, we present a novel method of integrating HARP with a nonuniform rational B-spline (NURBS) volumetric model to densely reconstruct 3D motion of LV. First we extend HARP from 2D to 3D mathematically, making full use of information afforded by SA and LA images each with a grid tagging pattern to obtain three mutual orthogonal components of LV motion displacements. This process still remains the advantages of rapidity and automation of the HARP method. Then a NURBS volumetric model is employed to fit the complex anatomy of LV. The model has its trait of immediate and compact representation of wide variety of shapes, and can model the geometry of LV well only with few control points coming from the sparse image planes. Also the model interpolates the known sparse displacements to get a 3D dense motion field reflecting the natural continuity and smoothness of the three-dimensional tissue deformations. Finally, after smoothing the model temporally, we could estimate 3D motion of LV at any time, and consequently the local strain and deformation indices from different myocardial regions could be computed.

2 3D Extension of HARP Harmonic phase (HARP) is an image processing technique developed to analyze the motion of heart rapidly using MR tagging [7,8]. It is mostly applied to 2D images, typically on SA image planes and thus, only provides information about “apparent motion”. The actual motion of heart is not confined to the imaging plane but likely out of it, however, 2D-HARP motion computations do not yield the true motion. Therefore we extend this method to 3D-HARP that yields a more comprehensive description of the full 3D motion of myocardial tissues. This extension is also based on the principle that harmonic phase value of a material point is time-invariant which material property we could use to track the motion of material points of heart in threedimensional space. In tagged MRI, one tag pattern can only provide one component of the underlying motion. To achieve a full 3D tracking of any point, the information coming from different mutually orthogonal tagging patterns has to be combined and interpolated in space and time. Hence, we need acquire SA image planes with two orthogonal tag orientations to obtain in-plane motion and several LA image planes to capture the third directional component of tissue motion normal to the short axis image planes. In LA images arranged radially, tagging planes are usually applied orthogonal to the long axis appear as parallel lines measuring longitudinal compression. Since HARP

514

J. Liang, Y. Wang, and Y. Jia

method needs 2D tagging images, LA images should be added another orthogonal tag pattern when imaging. SA and LA images each with a grid tagging pattern are acquired for 3D-HARP. To detail the procedure of 3D-HARP analysis formally, we define a material point p ∈ IR 3 that lies in the cross-section of SA image and LA image. Points y S and y L are its projections onto the SA and LA. To extract useful information of the Fourier transform of the SA and LA, we use the band-pass filter [8] to isolate the off-center (non-dc) spectral peak in one tag direction and rotate this filter by 90 degree to isolate the spectral peak in the orthogonal tag direction. Then zero padding the rest of the Fourier transform and performing inverse Fourier transform yields a complex image whose calculated angle is called a harmonic phase image. The harmonic phase image gives a detailed picture of myocardial motion in corresponding direction. Harmonic

phases (φ1, φ 2 ) computing from the SA in x/y tag direction corresponding to y S and T

(φ3 , φ4 )T from

the LA in z/z’ (orthogonal to z) tag direction corresponding to y L remain time-invariant throughout the motion. Thus, we track the point that has the same phase value in the filtered image sequence. Here the phase value of p is denoted by a vector φ = (φ1 , φ 2 , φ 3 ) , where φ1 , φ 2 and φ 3 come from three mutually orthogonal tagging directions, φ 4 is just for tracking the point on LA with 2D-HARP. Assume a T

material point located at p t at time t . If p t +1 is the position of this point at time t + 1 , then

( )

(

)

φ n p t , t = φ n p t +1 , t + 1

n = 1,2,3 .

(1)

This relationship provides the basis for tracking p t from time t to time t + 1 . In practice, φ can not be calculated and visualized from the data directly, so its principle value produced by wrapping function often takes the place of it. Despite the wrapping artifact, the principle value is also the material property of tagged tissue and maintains time-invariant. The following work is to target y S with the same harmonic phase (φ1, φ 2 ) from SA and y L with the same harmonic phase (φ3 , φ 4 ) from LA. The T

T

(

displacement field u = u x ,u y ,u z

)T for all intersections could be received.

3 NonUniform Rational B-Splines Model 3.1 Principle

NURBS, an acronym for nonuniform rational B-splines, have become the de facto standard for computational geometric representation [13]. Their predominance lies on parametric continuity, local support, compact and unified mathematical representation and an extra degree of freedom, in the form of weights, apart from knot vector and control points, which can be used in designing wide variety of shapes.

Cardiac Motion Estimation from Tagged MRI

515

The NURBS volumetric model can be expressed as m

n

h

∑∑∑ ω

Di,j,f N i,k1 (u )N j,k2 (v )N f,k3 (w )

i,j,f

i =0 j =0 f =0 m n h

P(u,v,w) =

∑∑∑ ω

i,j,f

i =0 j =0 f =0

N i,k1 (u )N j,k 2 (v )N f,k3 (w)

,

(2)

where Di,j,f represents a control point, ωi,j,f is the corresponding weight. Quantities

u , v and w are location parameters. N (* ) is B-spline basis function. The variables k1 , k 2 and k 3 are orders (one more than degree) of model in the direction of u , v and w

respectively. After defining knot vectors U = {u1 ,… , u ms } , V = {v1 ,… , vns } and

W = {w1 ,… , whs } each to be a sequence of nondecreasing real numbers, called knots, a NURBS volume model can be uniquely defined. The following work is to solve control points and weights in a least-squares sense. Given a set of discrete points Γ = Pp ,q ,r = x p ,q ,r ,y p ,q ,r ,z p ,q ,r and a corresponding set

{

{

}

(

)}

of weights Λ = ξ p ,q ,r which quantifies the relative confidence in the measurement of each corresponding point, where p ∈ [0 , s1 ], q ∈ [0 , s 2 ], r ∈ [0 , s 3 ] . Then the weighted least squares error criterion Eξx for the x-coordinate is

⎛ ⎜ ⎜ Eξx = ξ p ,q ,r ⎜ x p ,q ,r − ⎜ p ,q ,r ⎜ ⎜ ⎝

∑

2

⎞ N i,k1 u p N j,k2 vq N f,k3 (wr ) ⎟ ⎟ i =0 j =0 f =0 ⎟ , m n h ⎟ ωi,j,f N i,k1 u p N j,k 2 vq N f,k 3 (wr ) ⎟ ⎟ i =0 j =0 f =0 ⎠ m

n

h

∑∑∑

ωi,j,f Dix, j , f

( )

( )

∑∑∑

( )

( )

(3)

where Dix, j , f is x component of control point Di,j,f . Other parameters are defined as previous. It is known that rational B-splines can be produced by projecting nonrational B-splines in hyperplane w = 1 . Therefore we substitute 3D control points

{ D ′ = {D

}

Di = Dix , Diy , Diz = {xi , yi , zi } in 3D space by 4D homogeneous coordinates wx i

i

, Diwy

squares error

Eξwx

}

, Diwz , Diw = {wi xi , wi yi , wi z i , wi } , and then the weighted least

criterion Eξwx

for the x-homogeneous coordinate is given by

m ⎛ = ξ p ,q ,r ⎜ x p ,q ,r D pw,q ,r − ⎜ p ,q ,r i =0 ⎝

∑

n

h

∑∑∑

k1 ,k 2 k3 Diwx , j , f Bi , j , f

j =0 f =0

2

⎞ u p , vq , wr ⎟ , ⎟ ⎠

(

)

(4)

where

(

)

( )

( )

Bik,1j,,kf2 k3 u p , vq , wr = N i,k1 u p N j,k 2 vq N f,k3 (wr ) ,

(5)

516

J. Liang, Y. Wang, and Y. Jia m

D pw,q ,r =

n

h

∑∑∑ ω

i,j,f

(

)

Bik,1j,,kf2 k3 u p , vq , wr .

i =0 j =0 f =0

(6)

Similar equations are used for y and z components. Then we follow the methodology given by Tustison and Amini [14] to solve the equations for the control points and the weights. 3.2 Anatomy of the LV

Image planes acquired by tagging MRI imaging technique are usually sparse 2D images while the actual anatomy of LV is very complex and looks like a prolate spheroid. In order to cover the full geometry we employ a NURBS volumetric model for it can be created with only few control points. From sparse SA and LA images, a set of finite discrete points locating on intersections of contours and images are observed. The phantom of LV anatomy and the parametric directions of model are shown in Fig.1(d). Applying the fitting algorithm in Section 3.1, the NURBS volumetric model is received. As the nature of parametric continuity and compact representation, the model could show any points of the 3D LV very well. 3.3 3D Motion Estimations

After building the NURBS volumetric model of the current phase, 3D-HARP is applied on the SA and LA images at the next phase to obtain the sparse 3D displacements. Then these displacements drive the model to deform. For the sake of the predominance of parametric continuity, local support and differentiability, we could gain the dense 3D motion displacement field by difference of the NURBS volumetric models. 3.4 Four-Dimensional (4D) NURBS Model

After the 3D NURBS volumetric model is generated across 3D space for each phase, a time variable dimension is added to smooth it along all continuous time points. Given orders of the model, 4D grid control points, corresponding weights, and knot vector sequence, the 4D NURBS model can be written as P(u,v,w, t) =

∑∑∑∑ N (u )N (v )N (w)N (t )ωD , ∑∑∑∑ N (u )N (v )N (w)N (t )ω

(7)

where t denotes the time instant. The meanings of other parameters are the same as defined previously, and the subscripts are omitted for concise writing.

4 Strain Analysis Strain is a dimensionless quantity measuring the percent change in length at different points to describe the internal deformation of a continuum body. It is an appealing tool to study and quantify myocardial deformation.

Cardiac Motion Estimation from Tagged MRI

517

Given the spatial coordinates of a point in the material, X at time t = 0 and x at time t > 0 , the deformation gradient tensor F includes both the rotation and deformation around a point in the material and can be calculated by

Fpq = ∂x p ∂X q ,

(8)

where the subscripts p and q range from 1 to 3 and denote one of the 3D Cartesian coordinates. The Lagrangian strain tensor E only includes the deformation of the material with respect to its initial configuration, and is related to F as follow:

(

)

E = 1/2 F T F − I ,

(9)

where the superscript T represents the matrix transpose and I represents the identity matrix. The Lagrangian strain E is used to describe systolic deformation in a region surrounding a point in the heart wall relative to its initial position at end-diastole.

5 Experiments on in vivo Data 5.1 Imaging Protocol

For the studies shown in this paper, the following imaging protocol was utilized on a 1.5T Sonata produced by the Manufacturer of Siemens. Two sequences consisting of 11 phases of 9 tagged MR SA images and 9 LA images each with a grid tagging plane, 256 × 208 acquisition matrix and 8 mm slice thickness were acquired throughout the cardiac cycle from a normal healthy volunteer. LA images were taken perpendicular to SA images and arranged radially every 20 degrees around the long axis of LV. Fig. 1 shows the spatial relative position of the SA and LA images. 5.2 Initial NURBS Volumetric Model

To construct the reference NURBS volumetric model of LV, we segment the endocardial and epicardial contours of LV from the end-diastolic images manually by experience. Then the initial reference NURBS (cubic spline) volumetric model of LV is built illustrated by the left top one of Fig. 2. Here color means nothing and the cyan pentagrams denote the points at the reference phase.

(a)

(b)

(c)

(d)

Fig. 1. (a) The spatial relative position of the SA and LA images of LV. Red circles denote the origins of parallel SA images. (b) LA images locations (denoted by red lines) on a sample SA image. (c) SA images locations (denoted by red lines) on a sample LA image. (d) Parametric directions of the NURBS volumetric model of LV.

518

J. Liang, Y. Wang, and Y. Jia

Fig. 2. The NURBS volumetric models of LV at six different instants and the displacements are illustrated. The cyan pentagrams denote the points at the reference phase and the blue pentagrams mean the corresponding positions at the current phase. The pink vectors represent the displacements roughly. The color bar denotes the length of displacements. Left top one shows the initial NURBS volumetric model of the real LV and the reference points.

5.3 3D Motion Reconstruction

Once the initial reference NURBS volumetric model of LV is achieved, the phases of the material points on the model are known. The following work is to track the 3D motion of this model according to the material property of phase-invariant using 3DHARP method. Fig. 2 shows the NURBS volumetric models and the displacement field during the cardiac systole. These results are similar to the work [15]. 5.4 4D NURBS Model and Strain Analysis

The 4D NURBS model is generated by smoothing the 3D NURBS volumetric models over time using cubic splines, which is exactly suitable for real heart well. The movement of each myocardial point over time can be captured accurately by assigning u , v and w to any fractional values. In addition, the shape of the LV at any time instant can be obtained by setting t to any desired value. Using this model the changes of displacement and strain over time can be obtained at all myocardial points with sub-pixel accuracy. Due to the LV geometry, it is appropriate to calculate the myocardial strains based on the radial, circumferential, and longitudinal directions. The basal and midcavity portions of it are each divided into six regions in the SA view: antero-septal, anterior, lateral, posterior, inferior, and infero-septal. Normal LV strains, i.e., average radial, circumferential, and longitudinal strains, are given in Fig. 3. These results are similar to the work [16]. The radial strains mostly remain positive indicative of the systolic thickening of the LV. Circumferential and longitudinal strains are negative denoting

Cardiac Motion Estimation from Tagged MRI

519

Fig. 3. Average Lagrangian normal strains are plotted for the six basal and midcavity regions of the left ventricle of a normal human volunteer. The different geometric shapes (star, diamond, and square), represent the radial, longitudinal, and circumferential strain values, respectively. The x axis marks the time point during systole.

520

J. Liang, Y. Wang, and Y. Jia

shortening in the circumferential direction and compression in the longitudinal direction during LV contraction.

6 Conclusion In this paper, we have proposed a novel method for dense 3D motion estimation of LV without tag detection and tracking. This method takes advantages of rapidity and automation of HARP technique. Also it benefits from NURBS properties, such as parametric continuity, local support, and compact and unified mathematical representation for wide variety of shapes. Under this framework, we have created the compact representation of the complex LV anatomy and reconstructed the motion of the LV on in vivo data, experimental results show that the dense 3D motion estimation and the local strains could be calculated rapidly and effectively. It is strongly felt that this tool will help take MR tagging from the ranks of a valuable scientific research tool into the ranks of a valuable diagnostic clinical tool. This method is also suitable for the right ventricle and even for the atria only with a different model initialization. Acknowledgments. We are grateful to Prof. Pheng Ann Heng in the Chinese University of Hong Kong for providing the in vivo human heart data. This work was supported by the Natural Science Foundation of China under grants 60602050 and 973 Program of China (No. 2006CB303105).

References 1. Zerhouni, E.A., Parish, D., Rogers, W., Yang, A., Shapiro, E.: Human heart: tagging with MR imaging — a method for non-invasive assessment of myocardial motion. J. Radiology 169, 59–63 (1988) 2. Axel, L., Dougherty, L.: MR imaging of motion with spatial modulation of magnetization. J. Radiology 171, 841–845 (1989) 3. Masood, S., Yang, G.-Z., Pennell, D.J., Firmin, D.N.: Investigating intrinsic myocardial mechanics: the role of MR tagging, velocity phase mapping, and diffusion imaging. J. Magn. Reson. Imag. 12, 873–883 (2000) 4. Guttman, M.A., Prince, J.L., McVeigh, E.R.: Tag and contour detection in tagged MR images of the left ventricle. IEEE Trans. Med. Imag. 13, 74–88 (1994) 5. Amini, A.A., Chen, Y., Curwen, R.W., Mani, V., Sun, J.: Coupled B-snake grids and constrained thin-plate splines for analysis of 2-D tissue deformations from tagged MRI. IEEE Trans. Med. Imag. 17, 344–356 (1998) 6. Young, A.: Model tags: direct 3D tracking of heart wall motion from tagged magnetic resonance images. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 92–101. Springer, Heidelberg (1998) 7. Osman, N., Kerwin, W., McVeigh, E., Prince, J.: Cardiac motion tracking using CINE harmonic phase (HARP) magnetic resonance imaging. J. Magn. Reson. Med. 42, 1048– 1060 (1999) 8. Osman, N., McVeigh, E., Prince, J.: Imaging heart motion using harmonic phase MRI. IEEE Trans. Med. Imag. 19, 186–202 (2000)

Cardiac Motion Estimation from Tagged MRI

521

9. Osman, N.F., McVeigh, E.R., Prince, J.L.: Visualizing myocardial function using HARP MRI. J. Phys. in Med. and Biol. 45, 1665–1682 (2000) 10. Ryf, S., Spiegel, M.A., Gerber, M., Boesiger, P.: Myocardial tagging with 3D-SPAMM. J. Magn. Res. Imag. 16, 320–325 (2002) 11. Haber, I., Westin, C.F.: Model-based 3D tracking of cardiac motion in HARP images. In: Int. Soc. Mag. Reson. Med. Honolulu, HI (2002) 12. Li, P., Prince, L.J., Lima Jiao, A.C., Osman Nael, F.: Fast tracking of Cardiac Motion Using 3D-HARP. IEEE Trans. BioMed. Eng. 52, 1425–1435 (2005) 13. Piegl, L., Tiller, W.: The NURBS Book. Springer, Berlin (1997) 14. Tustison, N.J., Amini, A.A.: Biventricular myocardial kinematics based on tagged MRI from anatomical NURBS models. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 514–519 (2004) 15. Luo, G., Heng, P.A.: LV Shape and Motion: B-Spline-Based Deformable Model and Sequential Motion Decomposition. IEEE Trans. Inform. Technol. BioMed. 9, 430–446 (2005) 16. Moore, C., Lugo-Olivieri, C., McVeigh, E., Zerhouni, E.: Three-dimensional systolic strain patterns in the normal human left ventricle: characterization with tagged MR imaging. J. Radiology 214, 453–466 (2000)

Fragments Based Parametric Tracking Prakash C, Balamanohar Paluri, Nalin Pradeep S, and Hitesh Shah Sarnoﬀ Innovative Technologies Private Limited, Asha arch, Magrath Road, Bangalore-560025, India

Abstract. The paper proposes a parametric approach for color based tracking. The method fragments a multimodal color object into multiple homogeneous, unimodal, fragments. The fragmentation process consists of multi level thresholding of the object color space followed by an assembling. Each homogeneous region is then modelled using a single parametric distribution and the tracking is achieved by fusing the results of the multiple parametric distributions. The advantage of the method lies in tracking complex objects with partial occlusions and various deformations like non-rigid, orientation and scale changes. We evaluate the performance of the proposed approach on standard and challenging real world datasets.

1

Introduction

Two prominent components of a tracking system are: object descriptor and search mechanism. Object descriptor is the representation of the object to be tracked using a set of features that capture various properties of the object such as the appearance, shape, texture etc. Given an object descriptor, the search mechanism like [1,2], locates the region in a new image that best matches the object description. There are multiple methods suggested in the literature for object descriptors. Most of the successful methods for tracking employ non-parametric object descriptor like histogram [1,3,4,5,6,7], as it faithfully captures the variability in the features of the object to be tracked. However, with the increase in number of objects to be tracked or the features to be considered, the histogram size grows exponentially which is an undesired behavior. To address this issue, we propose a parametric object descriptor for color based tracking. An N-dimensional Gaussian distribution is employed as the object descriptor in the proposed approach. Such a descriptor can accurately model a unimodal object. But objects under consideration for tracking are generally multimodal in color space making N-d Gaussian descriptor insuﬃcient. Hence, we need to convert multimodal objects to unimodal representation. Primarily, there are two ways to achieve this conversion: – By projecting the multimodal object into a space where it becomes unimodal – By representing each mode separately Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 522–531, 2007. c Springer-Verlag Berlin Heidelberg 2007

Fragments Based Parametric Tracking

523

An approach similar to Collins et al. [5] can be used to ﬁnd a linear transformation to project the multimodal color object into a unimodal space. Also, non-linear transformation as suggested by Larry et al. [7] can be used to the same eﬀect. However, in both the cases, the search for ﬁnding such an optimal transformation is not exhaustive, as it is computationally expensive, over the entire space of possible transformations. Hence the obtained transformation is suboptimal. For representing a multimodal object, Gaussian mixture model (GMM) distributions can also be used. However, computation of GMM parameters are expensive and an aprior knowledge of number of modes is quintessential, thus rendering it is not applicable for object descriptors in tracking. Therefore, in this paper we propose a method based on fragmenting multimodal objects into multiple homogeneous models using Discriminant Analysis. The fragmentation process ﬁnds the fragments online as opposed to fragmenting the object into ﬁxed sizes as suggested in [4]. Each fragment is then modelled using a single N-dimensional Gaussian distribution and tracked separately. These parametric distributions are used to generate a probability density function termed as strength image. The maximum likelihood(ML) framework proposed in [2] is used to estimate the location(mean) and shape(covariance) of the best matching region in the subsequent frame. The paper is organized as follows: Section 2 explains the proposed approach. Experimental results are presented in Section 3 to illustrate the performance of the tracker and Section 4 concludes the work by presenting the future work.

2

Proposed Method

The proposed tracking approach is color based, hence in modelling an object we use the color values of the pixel. Our initial step involves fragmenting based on the color values of the pixels. Prior work involved application of multi-level thresholding technique to segment an illumination/gray image [8]; but these techniques can’t be applied directly in our case as the our objective is to group regions similar in color rather than illumination (gray). Hence, multi-level thresholding is done in color space. The input template is in the RGB space. Multi level thresholding on the histogram generated using all three channels is not possible due to the immense size of the histogram (256 × 256 × 256). So, given the color template of the object in the RGB space, we ﬁrst transform the input to HSV space. Since Hue represents the color component alone, multi-level thresholding on Hue gives the desired results. The grouped regions similar in color are then modelled using single parametric distribution. The Uni/Multi modal classiﬁcation and the fragmentation processing use the Hue image for processing. 2.1

Fragmentation

The given region of interest (ROI) is initially divided into uniform blocks of size M × N . Each block(B) with mean(μ), variance(σ) and Hue histogram(H) is then classiﬁed as homogeneous if any of the following two criterions satisfy:

524

Prakash C. et al.

1. If the variance of the region is less than a certain pre-deﬁned value. The variance of the region is given by 2 (Pi − μ) (1) σ2 = iB

where Pi is the hue value at i and the μ is the mean of the block. 2. If the block is divided into two classes C1 with values [1, . . . , t] and C2 with values [t + 1, . . . , L] using the optimal threshold given by eqn 2 and if the Separability factor(SF ) given by eqn 3 of the block is less than a certain pre-deﬁned value. arg max t

N

2

Wi (μi − μ)

(2)

i=1

where N is the number of classes, Wi is the total number of pixels in the class i, μi is the mean of the it h class and μ is the mean of the block. BCV (3) TV where T V is the total variance of the block given by 1 and BCV is the between class variance given by: SF =

BCV =

N

Wi (μi − μ)2

(4)

i=1

The fragmentation process is applied only for non-homogeneous regions. The multi-level thresholding is carried until the SF of the block is less than a predeﬁned valueT hSF . The T V of the region is constant and is used for normalizing purposes. The BCV will be high when the fragments of similar color are grouped and dissimilar colors are separated.The multi level thresholding is done in the following way: Each time the fragment/class with the maximum within class variance is selected (Initially, the entire block is started as one class), since high within class variance signiﬁes that the class is non-homogeneous. The division is done by ﬁnding the Optimal threshold given by 2. The process is repeated till the SF of the block is less than T hSF . The class pool thus created needs to be assembled together based on the color similarity. The assembled regions will signify the multiple unimodal regions of the multimodal object. The assembling process is started with a new region which includes the ﬁrst class of the ﬁrst region. This is followed by a merging process which ﬁnds out the classes similar to this class. The criterion for similarity is the diﬀerence of the mean values of the two classes. The class with the least diﬀerence is identiﬁed and if the diﬀerence of the means is less than a pre-deﬁned value T hmean the class is merged into the region. If none of the classes can be merged to any of the existing regions, a new region is created by picking up the class which

Fragments Based Parametric Tracking

(a)

(b)

(d)

(e)

525

(c)

(f)

Fig. 1. Parrot sequence: The input image (a) is fragmented into ﬁve parts. (b) represents the body of the parrot(green), (c) represents the forehead(white), (d) represents the hair(blue), (e) represents the cheeks(red) and (f) represents the beak(yellow).

has the largest diﬀerence in the mean value with the existing regions. Then the unclassiﬁed classes are again tried to merge to this new region. The process is repeated until all the classes are merged to the regions. The regions obtained thus form the unimodal fragments of the multimodal object. An example of the fragmentation is shown in Figure 1. 2.2

Modelling the Object

Each fragment obtained after the fragmentation process is modelled separately. Each region R ⊂ Regions is described by the color values {R, G, B} , thus t the feature descriptor at an image location x = x y is computed as f (x) = t R(I, x) G(I, x) B(I, x) . The region covariance, C of the feature descriptors in R is computed as C=

1 (f (x) − μ )(f (x) − μ )t |R|

(5)

x∈R

1 where μ = |R| x∈R f (x) is the mean feature descriptor in R and |R| is the number of pixels in the region R. A simple covariance matrix computed with color features contains the information needed to capture the appearance of the object. An estimate of the color distribution in the target region is the Gaussian distribution. The ML estimates of the parameters of the Gaussian, Θ=(μ, C) is the target model. The probability density function (PDF), also termed as the strength image, is computed over the new image. The value of each pixel in

526

Prakash C. et al.

the strength image signiﬁes the probability with which the pixel belongs to the target model. In the remainder of this paper we denote this value as p(x|Θ ) where x is the pixel location. The PDF in this case is computed as : p(x|Θ ) ∝ exp(−(f (x) − μ )t C−1 (f (x) − μ ))

(6)

The PDF is calculated for each of the unimodal regions of the object obtained through the fragmentation process. The PDF will have high values for pixels which belong to the particular parametric distribution and vice-versa. In the next section, we show how the PDF computed for the image can be used to track the region accurately in presence of various deformations. 2.3

ML Framework

The region to be tracked R0 is represented by an ellipse in our case. The position and shape of the object are described by the mean M0 and covariance V0 of the pixels in the region. Given the target model Θ, the objective of the search mechanism is to ﬁnd a region R in the new frame described by mean and covariance (M, V) that maximize the function: p(x|Θ )L(x|M, V) (7) J(M, V) = x∈R

where the term L(x|M, V) ∝ exp(−(x − M)t V−1 (x − M))

(8)

prevents pixel locations that are farther from the original region from distracting the tracker. As a pixel’s contribution falls oﬀ with the distance from the original region, this helps in both reducing the eﬀect of outlier pixel on the search as well as preventing the tracker from drifting away from the object. As shown in [2,9,10], the maximum-likelihood estimates of M and V can be obtained via an EM-like iterative procedure. The key to the method is to assume a set of hidden variables w(x). Starting with an initial estimate M0 , V0 of R, the EM-iteration proceeds as below: – E-Step: Given current estimates Mk and Vk of the mean and covariance of the region in k th iteration, compute hidden variables wk (x): p(x|Θ )L(x|Mk , Vk ) k k x ∈R p(x |Θ )L(x |M , V )

wk (x) =

(9)

– M-Step: Using the hidden variables computed above, compute the next estimates of mean and covariance of the region, Mk+1 and Vk+1 of that maximize J(., .): wk (x)x (10) Mk+1 = x∈R

V

k+1

=

x∈R

wk (x)(x − Mk+1 )(x − Mk+1 )t

(11)

Fragments Based Parametric Tracking

527

The optimal values for M and V are obtained by iterating the above steps until convergence. Our experimental results demonstrate that the search mechanism described above is both eﬃcient and robust to a wide variety of changes in the shape of the object. In Algorithm 1, we explain the complete tracking algorithm. Algorithm 1. Track(Video V , Region R0 ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

3

I0 ←Initial Frame(V ) (M0 , V0 ) ← Fit Ellipse(R0 ) H ← HSV(R0 ) Class P ool ← Multi Level Thresholding(H) F ragments ← Assembling(Class P ool) Θ ← Region Covariance(R0 , F ragments) for each frame Ii in V do (Mi , Vi ) ← (Mi−1 , Vi−1 ) S ← Strength Image(I, Θ ) k←0 while not converged do compute weight wk using equation 9 update estimates (Mi , Vi ) using equations 10, 11 k ←k+1 end while end for

Experimental Results

The tracking algorithm was tested on various challenging datasets [11]. It was also tested on few low contrast videos taken from Internet. The tracker performance was encouraging when tested for its ability to handle the following aspects: Non-rigid deformations: Tracking non-rigid objects is a challenging problem in tracking. Couple of examples are highlighted in Figures 2, 4. In Cat sequence (Figure 2), the cat is tracked accurately under considerable deformations (sitting,jumping and running). Also, note that the contrast between the cat and the background is quite low. In case of Figure 4, the monkey is tracked successfully under extreme deformations. In both the cases, the tracked ellipse changes accurately to handle the non-rigid deformations of the object. Orientation: Change in the orientation of objects is a common scenario in tracking. Tracking the object with accurate orientation is possible in our case since we track the object using an ellipse. The mean of the ellipse characterizes the location and the covariance signiﬁes the scale and orientation. In Figure 4(b,c), the monkey undergoes considerable changes in orientation. The orientation of the ellipse changes according to the orientation of the object. Figure 3 is another example where the ﬁsh is tracked accurately in the presence of rapid orientation changes.

528

Prakash C. et al.

(a)

(b)

(c)

Fig. 2. Cat sequence: The cat is tracked successfully in presence of changes in scale and non-rigid deformations

(a)

(b)

(c)

Fig. 3. Fish sequence: An example of low quality video containing partial occlusions and orientation changes. Note the other ﬁshes in the tank with the similar color(but not pattern) to the object being tracked. Many existing trackers fail in such cases.

(a)

(b)

(c)

Fig. 4. Monkey sequence: Inspite of changes in orientation and non-rigid deformations, the monkey is tracked precisely

Scale: Earlier trackers relied on techniques such as searching through exhaustive search space [12] or using templates of the object at diﬀerent scales [13]. In our case, the EM-like algorithm enables eﬃcient handling of scale changes by estimating the covariance of the tracked ellipse. Figure 2, shows how the tracking handles scale changes. Partial Occlusion: Parital and full occlusions occur frequently in tracking scenearios and the tracker needs to handle these sucessfully. Even if an object is

Fragments Based Parametric Tracking

(a)

(b)

529

(c)

Fig. 5. Caviar sequence: The sequence shows the handling of partial occlusion of the person (blue ellipse) when he crosses two other people

(a)

(b)

(c)

Fig. 6. Parrot sequence: The multimodal object parrot is decomposed into multiple unimodals and tracked seperately

(a)

(b)

(c)

Fig. 7. Parrot sequence: The multimodal object is tracked sucessfully. Note that the ellipse completely ﬁts the entire parrot enclosing all the homogenous regions.

completely occluded for considerable time, the tracker should be able to track the object on reappearance. A scenario with complete and partial occlusions is shown in the Figures 3, 5. Figure 3(b) shows the cases where the ﬁsh is partially occluded. In Figure 3(c), the ﬁsh reappears after being completely occluded and the tracker was able to relocate the object. On a standard dataset as in Figure 5, the ellipse ﬁts the partially visible person where major portion of the person is occluded by two other people. Handling Multimodal: Many of the datasets to be tracked has multimodal objects. We handle multimodal objects by fusing information from each unimodal.

530

Prakash C. et al.

Figures 2, 3, 7 show tracking results on multimodal objects. For instance, Figure 7 shows the example of parrot, where each homogeneous region is extracted and modelled seperately as explained previously and Figure 6 shows the tracking of each unimodal region. Videos with low quality and contrast: The quality of multimedia data available on the web varies signiﬁcantly owing to various compression and transmission techniques. Several tests using videos with low quality and contrast were carried out to test our technique. In case of Figures 3,4 taken from googlevideos, the quality is poor owing to compression. In these videos, background color merges more with the color of the object. The tracker performance is very good and insensitive to these variations in video.

4

Conclusion

We have proposed a fragment based tracking approach in which the multimodal objects were fragmented into homegenous regions based on hue. These unimodal regions are then tracked using a single parametric distribution and these distributions are fused to form the ﬁnal tracking result of the entire object. The tracker proposed was also complimented with an eﬃcient search mechanism to make the system robust to handle non-rigid deformations, occlusions, scale and orientation changes eﬃciently. For modelling the object more eﬃciently, the research is currently focused on combining other cues like motion,edge,texture with present color based tracker.

References 1. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: ICCV, vol. 2, pp. 1197–1203 (1999) 2. Zivkovic, Z., Krose, B.: An em-like algorithm for color-histogram-based object tracking. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 798–803 (2004) 3. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conf. on Comp. Vis. and Pat. pp. 142–151. IEEE Computer Society Press, Los Alamitos (2000) 4. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: CVPR 2006. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 798–805. IEEE Computer Society Press, Los Alamitos (2006) 5. Leordeanu, M., Collins, R.T., Liu, Y. 6. Birchﬁeld, S.T., Rangarajan, S.: Spatiograms versus histograms for region-based tracking. In: Proceedings of the Computer Vision and Pattern Recognition, vol. 2, pp. 1158–1163. IEEE Computer Society, Los Alamitos (2005) 7. Han, B., Davis, L.: Object tracking by adaptive feature extraction. In: In proceeding of International Conference on Image Processing (2004) 8. Liao, P.S., Chen, T.S., Chung, P.C.: A fast algorithm for multilevel thresholding.

Fragments Based Parametric Tracking

531

9. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977) 10. Neal, R.M., Hinton, G.E.: A new view of the EM algorithm that justiﬁes incremental, sparse and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. Kluwer Academic Publishers, Dordrecht (1998) 11. project/IST 2001 37540, E.F.C.: found (2004), at http://homepages.inf.ed. ac.uk/rbf/caviar/ 12. Porikli, F., Tuzel, O.: Covariance tracking using model update based on means on riemannian manifolds. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2006) 13. Birchﬁeld, S.: Elliptical head tracking using intensity gradients and color histograms. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 232, IEEE Computer Society, Los Alamitos (1998)

Spatiotemporal Oriented Energy Features for Visual Tracking Kevin Cannons and Richard Wildes York University Department of Computer Science and Engineering Toronto, Ontario, Canada {kcannons,wildes}@cse.yorku.ca Abstract. This paper presents a novel feature set for visual tracking that is derived from “oriented energies”. More speciﬁcally, energy measures are used to capture a target’s multiscale orientation structure across both space and time, yielding a rich description of its spatiotemporal characteristics. To illustrate utility with respect to a particular tracking mechanism, we show how to instantiate oriented energy features eﬃciently within the mean shift estimator. Empirical evaluations of the resulting algorithm illustrate that it excels in certain important situations, such as tracking in clutter with multiple similarly colored objects and environments with changing illumination. Many trackers fail when presented with these types of challenging video sequences.

1

Introduction

Target tracking is a critically important aspect to a wide range of computer vision applications, including surveillance, smart rooms, and human-computer interfaces. Signiﬁcant contributions have been made to the ﬁeld, but no generalpurpose tracker has been found that can operate eﬀectively in every real-world setting [1]. Scenarios that are present in realistic sequences and challenge many trackers include changes in illumination, small targets, and signiﬁcant clutter. In general, to facilitate accurate tracking, features must be selected that distinguish targets from the background and from one another, even while being robust to photometric and geometric distortions. In response to these requirements, many diﬀerent proposals have been made; here, representative examples are provided. Perhaps the simplest approach is to make use of image intensitybased templates for feature deﬁnition [2,3,4]. To provide robustness to photometric distortions, consideration has been given to discrete features [5,6,7]. To encompass object outlines, methods have emerged that use contours and silhouettes [8,9,10]. Other features (e.g., color, texture) have been derived on a more regional basis [11,12,13]. Recovered motion also has been used in feature deﬁnitions [14,15,16]. Limited attention has been given to the integrated analysis of both the spatial and temporal domains when considering features for visual tracking. Potential beneﬁts of a more integrated approach include the ability to combine static and dynamic target information in a natural fashion as well as simplicity of Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 532–543, 2007. c Springer-Verlag Berlin Heidelberg 2007

Spatiotemporal Oriented Energy Features for Visual Tracking

533

design and implementation. In response to this observation, the present paper documents a novel feature set for visual tracking that uses energy measures to capture a target’s multiscale, spatiotemporal orientation structure. A considerable body of research has emerged on the use of orientation selective ﬁlters in the spatiotemporal domain for the purpose of analyzing motion [17,18,19]. However, it appears that no previous research has explored the use of multiscale, spatiotemporal oriented energies that uniformly encompass space and time as the basis for deﬁning features in the service of visual tracking. To illustrate the use of the proposed oriented energy feature set, we make use of the mean shift tracking paradigm [13,20,21,22], a framework upon which these features readily map. Although, the energy features are also applicable to alternative paradigms, e.g., those that preserve within target spatial relationships, as the oriented energies are calculated locally. In light of previous research, the main contributions of this paper are as follows. (1) A novel oriented energy feature set is deﬁned for visual tracking. This representation captures the spatiotemporal characteristics of a target in an integrated, compact fashion. (2) Oriented energy features are instantiated with respect to the mean shift estimator. (3) The performance of the resulting system is documented both qualitatively and quantitatively. Our algorithm outperforms a color-based mean shift implementation in three common, real-world situations: substantial clutter; multiple targets with similar color; and illumination changes.

2 2.1

Technical Approach Oriented Energy Features

Oriented Energy Computation. Events in a video sequence will generate diverse structures in the spatiotemporal domain. For instance, a textured, stationary object produces a much diﬀerent signature in image space-time than if the same object were moving. One method of capturing the spatiotemporal characteristics of a video sequence is through the use of oriented energies [17]. These energies are derived using the ﬁlter responses of orientation selective bandpass ﬁlters when they are convolved with the spatiotemporal volume produced by a video stream. Responses of ﬁlters that are oriented parallel to the image plane are indicative of the spatial pattern of observed surfaces and objects (e.g., spatial texture); whereas, orientations that extend into the temporal dimension capture dynamic aspects (e.g., velocity and ﬂicker). The basis of our approach is that energies computed at orientations which span the space-time domain can provide a rich description of a target for visual tracking. Here, multiscale processing is also important, as coarse scales capture gross spatial pattern and overall target motion while ﬁner scales capture detailed spatial pattern and motion of individual parts (e.g., limbs). With regard to dynamic aspects, simple motion is captured (orientation along a single spatiotemporal diagonal) as well as more complex phenomena, e.g., multiple juxtaposed motions as limbs cross (multiple orientations in a spatiotemporal region). By encompassing both spatial and temporal target characteristics in an integrated fashion,

534

K. Cannons and R. Wildes

tracking is supported in the presence of signiﬁcant clutter. Further, as detailed below, such representations can be made invariant to local image contrast to support tracking throughout substantial illumination changes. For this work, ﬁltering was performed using broadly tuned, steerable, separable ﬁlters based on the second derivative of a Gaussian, G2 , and their corresponding Hilbert transforms, H2 [23], with responses pointwise rectiﬁed (squared) and summed. Filtering was executed across θ = (η, ξ) 3D orientations (η, ξ specifying polar angles) and σ scales using a Gaussian pyramid formulation. Hence, a measure of local energy, e, can be computed according to e (x; θ, σ) = [G2 (θ, σ) ∗ I (x)]2 + [H2 (θ, σ) ∗ I (x)]2 ,

(1)

where x = (x, y, t) corresponds to spatiotemporal image coordinates, I is the image sequence, and ∗ denotes convolution. This initial measure of local energy is dependent on image contrast. To attain a purer measure of the relative contribution of the orientations irrespective of contrast, (1) is normalized as e (x; θ, σ) eˆ (x; θ, σ) = , ˜σ e x; θ, ˜ + ˜ σ ˜ θ

(2)

where is a bias term to avoid instabilities when the energy content is small and the summations in the denominator cover all scale and orientation combinations. (In this paper, our convention is to superscript variables of summation with˜.) For illustrative purposes, Fig. 1 displays a subset of the energies that are computed for a single frame of a MERL traﬃc sequence [24]. Here, there is a white car moving to the left near the center of the frame. Notice how the energy channel that is tuned for leftward motion is very eﬀective at distinguishing this car from the static background. Consideration of the channel tuned for horizontal structure shows how it captures the overall orientation structure of the white car. In contrast, while the channel tuned for vertical textures captures the outline of the crosswalks, it shows little response to the car, as it is largely devoid of vertical structure at the scales considered. Finally, note how the energies become more diﬀuse and capture more gross structure at the coarser scale. Given that the tracking problem is being considered, the goal is to locate the target’s position as precisely as possible. However, as seen in Fig. 1, the energies computed at coarser scales are diﬀuse due to the downsampling/upsampling that is employed in pyramid processing. Coarse energies are important because they provide information regarding the target’s gross shape and motion, but a method is required to improve their localization for accurate tracking. To that end, a set of weights are applied to the normalized energies of (2) according to ˆ (x; θ, σ) = eˆ (x; θ, σ) b (x; θ) , E

(3)

where b are pixel-wise weighting factors for a particular orientation channel, θ. The weighting factors for a speciﬁc orientation are computed by integrating the energies across all scales and applying a threshold, Tθ , according to b (x; θ) = eˆ (x; θ, σ ˜ ) > Tθ . (4) σ ˜

Spatiotemporal Oriented Energy Features for Visual Tracking

535

Fig. 1. Frame 29 of the MERL traﬃc video sequence with select corresponding energy channels. Finer and coarser scales are shown in rows two and three, resp. From left to right, the energy channels roughly correspond to horizontal structure, vertical structure, and leftward motion.

When computing the weights, summing across scales allows the better localized ﬁne scales to sharpen the coarse scales, while the coarse scales help to smooth the responses of the ﬁne scales. Furthermore, by calculating weights separately for each orientation, we avoid being prejudiced toward any particular type of oriented structure (e.g., static vs. dynamic). Two signiﬁcant advantages of the proposed oriented energy feature set must be further highlighted. First, normalized energy, as deﬁned by (1) and (2), captures local spatiotemporal structure at a particular orientation and scale with a degree of robustness to scene illumination: By virtue of the bandpass ﬁltering, (1), invariance will be had to changes that are manifest in the image as additive oﬀsets to image brightness; by virtue of the normalization, (2), invariance will be had to changes that are manifest in the image as multiplicative oﬀsets. Second, the calculation of the deﬁned normalized oriented energies requires nothing more than 3D separable convolution and pointwise nonlinear operations, and is thereby amenable to compact, eﬃcient implementation [25]. Histogram Representation. As deﬁned, oriented energies provide local characterization of image structure. Therefore, the energy measurements could be used to provide pointwise descriptors for target tracking (e.g., in conjunction

536

K. Cannons and R. Wildes

with spatial template-based matching). Alternatively, the pointwise measurements can be aggregated over target support to provide region-based descriptors (e.g., in conjunction with mean shift tracking). Here, we pursue the second option and demonstrate the eﬃcacy of the features as regional descriptors. With an eye to mean shift tracking, we collapse the spatial information in our initial energy measurements and represent the target as a histogram. Each histogram bin corresponds to the weighted energy content of the target at a particular scale and orientation. Speciﬁcally, the template histogram that deﬁnes the target in the ﬁrst frame is given by qˆu = C

n ˆ (x∗ ; φu ) , k x∗i 2 E i

(5)

i=1

where k is the proﬁle of the tracking kernel, C is a normalization constant to ensure the histogram sums to unity, x∗i = (x∗ , y ∗ ) is a single target pixel at some temporal instant, i ranges so that x∗i covers the template support, and φu is the scale and orientation combination which corresponds to bin u of the histogram. When tracking a target, it may be necessary to evaluate several target candidates for the current frame. Candidate histograms are deﬁned as

nh y − x∗i 2 E ˆ (x∗i ; φu ) , k (6) pˆu (y) = Ch h i=1 where y is the center of the target candidate’s tracking window, h is the bandwidth of the tracking kernel and i ranges so that x∗i covers the candidate support. A sample energy histogram for the target region shown in Fig. 1 (represented by the white box) is shown in Fig. 2. The bin corresponding most closely to leftward motion at the ﬁnest scale (bin 5) has by far the most energy. The next two high energy counts are found in bins 2 and 9 which are tuned to combinations of dynamic and static structure, with an emphasis on leftward motion and spatial orientation similar to that of the target. The overall horizontal structure of the car is captured by the energy in bins 1 and 4. In contrast, bins 3 and 6, which roughly represent static, vertical structure, do not have strong responses, given the nature of the car target. The histogram also shows that the oriented energies for the highest frequency structures have the strongest response, as the target is fairly small and dominated by relatively ﬁner scale structure. 2.2

Oriented Energy Features in the Mean Shift Framework

Target Position Estimation. Under the mean shift framework, tracking an object involves locating the candidate position in the current frame that produces the histogram that is most similar to the template. Thus, a measure of similarity between two histograms is required. For histogram comparisons we utilize the Bhattacharyya coeﬃcient, the sample estimate of which can be computed using ˆ] = ρ [ˆ p (y) , q

m pˆu (y) qˆu , u=1

(7)

Spatiotemporal Oriented Energy Features for Visual Tracking

537

0.25

Weighted Energy

0.2

High Frequency Energies Mid Frequency Energies Low Frequency Energies

0.15

0.1

0.05

0 0

5

10

15

20

25

30

Scale and Orientation

Fig. 2. Oriented energy histogram for the target region in Fig. 1

ˆ (y) and q ˆ are histograms with m bins apiece. Due to the deﬁnition of where p the Bhattacharyya coeﬃcient, in order to minimize the distance between two histograms, (7) must be maximized with respect to the target position, y. The Bhattacharyya coeﬃcient can be maximized via mean shift iterations [20]. The speciﬁc mean shift vector that can be used for this maximization is ⎡ ⎤ nh ∗ yˆ 0 −x∗i 2 x w g ⎥ m i h ⎢ i=1 i qˆu ∗ ˆ (x ; φu ) ˆ1 = ⎢ y E ,(8) ⎥ i ⎣ n ⎦ , where wi = ∗ 2 p ˆ (ˆ y0 ) ˆ y −x u 0 h i u=1 i=1 wi g h g (x) = k (x) is the derivative of the tracking kernel proﬁle, k, with respect to x, ˆ 0 is the current target position. The Epanechnikov kernel has been shown and y to be eﬀective [20] and is the most commonly used kernel for mean shift tracking. Thus, the position of the target in the current frame is estimated as follows. Starting from the target’s position in the previous frame, the mean shift vector is computed and the target candidate is moved to the position indicated by the mean shift vector. These steps are repeated until convergence has been reached or a ﬁxed number of iterations have been executed. Template and Scale Updates. When tracking an object through a long video sequence, it is common that its characteristics will change. To combat the changes a target may incur over time (e.g., due to alterations in velocity or rotation), our tracker includes a simple template update mechanism deﬁned as ˆ (yi ) , ˆ i+1 = απˆ qi + (1 − α) (1 − π) p q

(9)

ˆ i is the temwhere α is a weighting factor the speed of the updates, q to control i ˆ (yi ) , q ˆ is the Bhattacharyya coeﬃcient between plate at frame i, and π = ρ p the current template and the optimal candidate found in the ith frame. Empirically, α was set to 0.85. Following each application of (9), the resulting template is renormalized and thereby remains consistent with our overall formulation. Owing to dependence on the Bhattacharyya coeﬃcient, the template update rule

538

K. Cannons and R. Wildes

indicates that if the template and the optimal candidate are well-matched, the update to the template will be minimal. The size of a target may change during a video sequence as well. Although there are more eﬀective methods of dealing with changes of object scale in the mean shift framework [21,22], in the current implementation we employ a simple approach, similar to that taken in [20]. In particular, our system performs mean shift optimization three times per frame using three diﬀerent bandwidth values, h. Unless stated otherwise, h values of ±5% are used. We obtain the new bandwidth, hnew , by combining the best of the three bandwidths evaluated at the current frame, hopt , with the previous target size, hprev , according to hnew = γhopt + (1 − γ) hprev .

(10)

Empirically, we set γ = 0.15.

3

Empirical Evaluation

The performance of the oriented energy-based mean shift tracker has been evaluated on an illustrative set of test sequences. For comparative purposes, a mean shift tracker based on RGB color space was also developed and tested. Other than the use of diﬀerent histograms, the two trackers were identical. The colorbased tracker was implemented in a similar manner to [20], whereby each color channel was quantized into 16 levels (yielding a histogram with 163 bins). In our current implementation of the energy-based tracker, energies were computed at 3 scales with 10 diﬀerent spatiotemporal orientations per scale. Hence, the energy-based histograms contained 30 bins. For the oriented energy feature set, 10 orientations were selected because they span the space of 3D orientations for the highest order ﬁlters that we use (H2 ) [23]; in particular, the selected orientations correspond to the normals to the faces of an icosahedron with antipodal directions counted once, which provides a uniform tessellation of a sphere. For all results in this paper, an Epanechnikov kernel, K, was used. The thresholds for (4) were empirically set as 2.75× the mean energy for each orientation channel. The color and energy-based trackers were hand-initialized with identical target regions in the ﬁrst frame of each video. Figure 3 illustrates the eﬀectiveness of oriented energy-based features in dealing with illumination changes. An individual starts walking in a poorly lit area;

Fig. 3. Video sequence (x × y × t = 360 × 240 × 60) of a man walking through shadows. From left to right, frames 4, 18, 31, and 55 are shown. Tracked regions are highlighted with white boxes.

Spatiotemporal Oriented Energy Features for Visual Tracking

539

Fig. 4. Video sequence (x × y × t = 360 × 240 × 50) of people walking through a room with similar colored clothing. From left to right, frames 6, 18, 32, and 50 are shown. Tracked regions are highlighted with white boxes.

Fig. 5. MERL traﬃc video sequence (x × y × t = 368 × 240 × 64) where a white car is tracked as it travels through an intersection. From left to right frames 13, 24, 38, and 58 are shown. Tracked regions are highlighted with white boxes.

then, he travels into and out of the bright region as he walks across the room. Using our proposed feature set, the tracker appeared to be relatively unaﬀected by the changes in illumination. This robustness arises from the normalization performed in (2). In comparison, our color-based mean shift tracker completely lost track of the target after only a few frames, even when histograms created using normalized RG-space [20] were utilized. Figure 4 shows a case where two persons with similar colored clothing walk in opposite directions and the individual starting on the right side is being tracked. Despite the full occlusion that occurs for several frames, the tracker using energy features is capable of following the true target throughout the video. The diﬀerent texture patterns and velocities of the walkers were suﬃcient cues for the energy-based tracker to achieve success, as the representation spans the spatiotemporal domain. In comparison, our color-based tracker became distracted by the other walker as the individuals have near-identical color distributions. Figure 5 shows a real-life, grayscale video sequence of a cluttered traﬃc scene that was obtained from MERL [24] (a portion also used in Fig. 1). As the ﬁgure shows, our proposed system experiences some slight diﬃculty when tracking the vehicle as it passes over the crosswalk (e.g. notice oﬀ-centered tracking in frames 13 and 24). This performance decrease occurs because the lack of contrast (essentially uniform white on white) between the car and the crosswalk yields little energy for the involved portions of the car. Nevertheless, the tracker never loses the target; indeed, the frames shown are representative of the worst case performance in this video. Our feature set was also successfully used when tracking people and vehicles in videos obtained from the PETS2001 dataset [26]. Figure 6 shows an example

540

K. Cannons and R. Wildes

Fig. 6. PETS2001 video sequence (x × y × t = 384 × 288 × 85) where a cyclist is being tracked. From left to right frames 18, 32, and 73 are displayed. Tracked regions are highlighted with white boxes.

Fig. 7. Video sequence (x × y × t = 360 × 240 × 100) showing an individual walking in an erratic pattern. From left to right frames 22, 74, 86, and 100 are displayed. Tracked regions are highlighted with white boxes.

of our results on this dataset where a cyclist is tracked. The tracker that utilizes oriented energy features is successful despite the fact that the cyclist is partially occluded by another individual near the beginning of the sequence. The results on this data sequence are impressive given that the video accurately reﬂects real-world surveillance settings where targets of interest are often small and of low-resolution. In contrast, our implementation of the color-based mean shift tracker drifted oﬀ the target after only a few frames. In Fig. 7 an individual is shown walking erratically, making sudden changes in direction and moving at a wide variety of speeds. Since the oriented energy features encompass both spatial and temporal information, tracking of the target continues throughout each change in velocity. In particular, at instances where the target motion changes radically, the spatially-based components of the representation keep the tracker on target. Subsequently, template updates, (9), incorporate changes to adapt the model for further tracking. Figure 8 shows footage that one might obtain from overhead surveillance cameras in public areas. The oriented energy-based tracker follows the target of interest even though there are multiple similar walkers with little texture, cast shadows, and complex reﬂectance eﬀects, as the video was recorded through a window. Using the oriented energy feature set, the target is not lost, even during the partial occlusion. The tracker does lag behind the target for a few frames immediately following the occlusion; however, it ultimately follows the correct person. Indeed, frame 39 is representative of its worst-case performance for this

Spatiotemporal Oriented Energy Features for Visual Tracking

541

Fig. 8. Video sequence (x × y × t = 320 × 240 × 70) showing multiple people in motion that are similar in appearance. From left to right frames 9, 31, 39, and 59 are displayed. Tracked regions are highlighted with white boxes.

Bhattacharyya Coefficient

1.1

MERL

1.05

PETS 1 0.95 0.9 0.85 0.8 0.75 0

20

40

60

80

Frame Number

Fig. 9. Bhattacharyya coeﬃcients over the entire video sequence for the MERL and PETS2001 videos

video. In comparison, our color-based implementation was only able to follow the true target for approximately 30 frames. Quantitative performance analysis was performed for the video sequences that are publicly available — MERL and PETS2001. Speciﬁcally, Fig. 9 shows the Bhattacharyya coeﬃcient vs. frame number for these two sequences. The Bhattacharyya coeﬃcient is a measure of the system’s conﬁdence in the target found in each frame, with 1 being the largest possible value. For the MERL video, the decreased level of performance at the crosswalks that was qualitatively observed is also indicated quantitatively. In particular, Fig. 9 shows two slight decreases in the Bhattacharyya coeﬃcient at frames 12 and 58 — precisely the frames when the vehicle is passing over the crosswalks. For the PETS video sequence, the signiﬁcant deviation the Bhattacharyya coeﬃcient experiences is a result of the partial occlusion of the cyclist by the walker (approximately frames 15 - 34). The other, less substantial decreases are a result of the signiﬁcant background clutter (e.g., parked cars). Also of note is that an average of 3 mean shift iterations were required to reach convergence for these two videos. Twenty iterations, the maximum we allow, was observed only three times.

4

Summary

Spatiotemporal oriented energy features provide a rich, yet compact representation of a target’s characteristic structure across both space and time. In particular,

542

K. Cannons and R. Wildes

by encompassing a range of orientations and scales, the proposed feature set provides a natural integration of the static (e.g., spatial texture) and dynamic (e.g., motion) aspects of a target. To illustrate their usefulness with respect to a particular tracking mechanism, we provide an instantiation with respect to the mean shift estimator. In our experiments over a wide range of video sequences, the energybased tracker was considered to perform as well or better than an identical algorithm that used color histograms. Of primary interest in our work were surveillance-inspired video sequences that included challenges such as substantial background clutter, targets that contained similar colors to other objects in the scene, and changes in illumination. Tracking with the use of oriented energy features was shown to be robust to these challenges. Acknowledgments. Portions of this work were funded by an Ontario Graduate Scholarship to K. Cannons and an NSERC Discovery Grant to R. Wildes.

References 1. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. Comp. Surv. 38(4), 1–45 (2006) 2. Lucas, B., Kanade, T.: An iterative image registration technique with application to stereo vision. In: DARPA IUW, pp. 121–130 (1981) 3. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. IJCV 2(3), 283–310 (1989) 4. Shi, J., Tomasi, C.: Good features to track. CVPR 1, 593–600 (1994) 5. Sethi, I., Jain, R.: Finding trajectories of feature points in monocular images. PAMI 9(1), 56–73 (1987) 6. Deriche, R., Faugeras, O.: Tracking line segments. IVC 8(4), 261–270 (1991) 7. Rangarajan, K., Shah, M.: Establishing motion correspondence. CVGIP 54(1), 56– 73 (1991) 8. Terzopoulos, D., Szeliski, R.: Tracking with kalman snakes. In: Blake, A., Yuille, A. (eds.) Active Vision, pp. 553–556. MIT Press, Cambridge (1992) 9. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 343–354. Springer, Heidelberg (1996) 10. Haritaoglu, L., Harwood, D., Davis, L.: W4: Real-time surveillance of people and their activities. PAMI 22(8), 809–830 (2000) 11. Birchﬁeld, S.: Elliptic head tracking with intensity gradients and color histograms. CVPR 1, 232–237 (1998) 12. Sigal, L., Sclaroﬀ, S., Athitsos, V.: Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. CVPR 2, 152–159 (2000) 13. Elgammal, A., Duraiswami, R., Davis, L.: Probabilistic tracking in joint featurespatial spaces. CVPR 1, 781–788 (2003) 14. Bolgomolov, Y., Dror, G., Lapchev, S., Rivlin, E., Rudzsky, M.: Classiﬁcation of moving targets based on motion and appearance. In: BMVC, pp. 142–149 (2003) 15. Cremers, D., Schnorr, C.: Statistical shape knowledge in variational motion segmentation. IVC 21(1), 77–86 (2003)

Spatiotemporal Oriented Energy Features for Visual Tracking

543

16. Sato, K., Aggarwal, J.: Temporal spatio-velocity transformation and its application to tracking and interaction. CVIU 96(2), 100–128 (2004) 17. Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. JOSA 2(2), 284–299 (1985) 18. Heeger, D.: Optical ﬂow from spatiotemporal ﬁlters. IJCV 1(4), 297–302 (1988) 19. Enzweiler, M., Wildes, R., Herpers, R.: Uniﬁed target detection and tracking using motion coherence. Wrkshp. Motion & Video Comp. 2, 66–71 (2005) 20. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE PAMI 25(5), 564–575 (2003) 21. Collins, R.: Mean-shift blob tracking through scale space. CVPR 2, 234–240 (2003) 22. Zivkovic, Z., Krose, B.: An EM-like algorithm for color-histogram tracking. CVPR 1, 798–803 (2004) 23. Freeman, W., Adelson, E.: The design and use of steerable ﬁlters. IEEE PAMI 13(9), 891–906 (1991) 24. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE PAMI 22(8), 844–851 (2000) 25. Derpanis, K., Gryn, J.: Three-dimensional nth derivative of Gaussian separable steerable ﬁlters. ICIP 3, 553–556 (2005) 26. PETS (2006), http://peipa.essex.ac.uk/ipa/pix/pets/

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras Jinshi Cui1, Yasushi Yagi2, Hongbin Zha1, Yasuhiro Mukaigawa2, and Kazuaki Kondo2 1

State Key Lab on Machine Perception, Peking University, China {cjs,zha}@cis.pku.edu.cn 2 Department of Intelligent Media, Osaka University, Japan {yagi,mukaigawa,kondo}@am.sanken.osaka-u.ac.jp

Abstract. A movie captured by a wearable camera affixed to an actor’s body gives audiences the sense of “immerse in the movie”. The raw movie captured by wearable camera needs stabilization with jitters due to ego-motion. However, conventional approaches often fail in accurate ego-motion estimation when there are moving objects in the image and no sufficient feature pairs provided by background region. To address this problem, we proposed a new approach that utilizes an additional synchronized video captured by the camera attached on the foreground object (another actor). Formally we configure above sensor system as two face-to-face moving cameras. Then we derived the relations between four views including two consecutive views from each camera. The proposed solution has two steps. Firstly we calibrate the extrinsic relationship of two cameras with an AX=XB formulation, and secondly estimate the motion using calibration matrix. Experiments verify that this approach can recover from failures of conventional approach and provide acceptable stabilization results for real data. Keywords: Wearable camera, synchronized ego-motion estimation, stabilization, two face-to-face cameras, extrinsic calibration.

1 Introduction The goal of this work is to recover ego-motion of two face-to-face moving cameras simultaneously. This work aims at some situations where ego-motion with only one camera may fail and use another camera to provide additional information. Ego-motion estimation of a moving camera is the task of recovering camera motion trajectory given a set of 2D image frames. It has many applications like stabilization in our application. Most existing methods take one of the following two cases. For the case of static scenes, the problem of fitting a 3D scene compatible with the images is well understood and essentially solved [1, 2]. The second case deals with dynamic scenes, where the segmentation into independently moving objects and the motion estimation for each object have to be solved simultaneously [3-4]. Above methods may fail in camera ego-motion estimation if: (1) the foreground occupies too much space in the image, (2) there are insufficient features in the background Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 544–554, 2007. © Springer-Verlag Berlin Heidelberg 2007

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

545

Fig. 1. Two image pairs captured by one wearable camera with a moving foreground. Left image pair: Camera motion can be computed using background region with enough feature point matches.Right image pair: There are very few feature matches in background region. It’s impossible to estimate ego-motion without additional information. And motion of the foreground point matches is related with both camera motion and person’s motion. If we know foreground person’s motion, camera ego-motion can be estimated.

Fig. 2. Two face-to-face cameras in our application of “Dive into Movie”. One camera is attached to the body of each person.

region of image pair, or (3) there is too much repeated structure for features to get a good match. Fig. 1 showed the situation that almost the whole image is covered by moving foreground. It’s impossible to estimate ego-motion in this case. Additional information can be utilized, such as inertial data [6] or synchronized image frames from another camera. In the case of using another camera, there can be two cases. One is that the additional camera is fixed somewhere watching person1 or person2. In the case of watching person1, the motion of camera is directly estimated by pose estimation. In the case of watching person2, first the motion of person 2 is estimated, and then camera motion is estimated by eliminating the person’s motion from the foreground motion of the camera. In both cases, it’s necessary to make the fixed camera always watching the moving person. The other one is that the additional camera is just the one attached on the foreground object (i.e. another person’s body). This configuration is very natural in our application (see Fig. 2). Motivation for the above work is from a new application of computer vision technology in entertainment, so called “Dive into Movie”. In this application, a movie captured by a wearable camera attached to the actor’s body can give audiences the sense of “immerse in the movie”. The raw movie captured by wearable camera needs to be stabilized due to jitters and ego-motion of the actor. And accurate ego-motion estimation of a moving camera is not easy when there are moving objects in the

546

J. Cui et al.

Fig. 3. Overview of the proposed approach at time k

image. In this application, there are at least two face-to-face interacting actors in a scene. The audience can choose anyone of the actors to see the movie from different views. One camera is attached to each actor. For simplicity, in this paper, we only consider the case of two actors in the scene. Then our goal is to recover ego-motion of two face-to-face cameras using information from both cameras. To address this problem, we first configure the sensor system as two face-to-face moving cameras. And then we derived the relationship between four views that consist of two consecutive views of each camera. In estimation stage, two cameras are calibrated first, and then ego-motion is estimated using calibration result. The calibration problem is formed as AX=XB and refer to the solutions in traditional robotics hand-eye calibration [6-9]. Compared with the consistent motion of hand/eye in traditional hand-eye calibration, we deal with two independently moving cameras. To our knowledge, there is no other work reported on this problem. In [10], a similar configuration of stereo camera is proposed, which used two face-to-face static cameras. The epipolar geometry for these mutual cameras is studied and used to improve the performance of structure from motion approach. In contrast to [10], our approach tries to estimate the ego-motion of two moving face-to-face cameras. The flowchart of the proposed system is showed in Fig. 3. Firstly input videos are pre-processed to segment out the background region and object region which moves consistently with the opposing camera. SIFT features are extracted and matched within consecutive two image pairs for background region and object respectively. If there are enough reliable point matches in background region, ego-motion is estimated and output stabilized frame. Above steps are processed for both cameras. Secondly, if estimation fails with background region, go to the synchronized estimation step, which includes two stages. Extrinsic parameters are calibrated in first stage for two cameras. Here, it’s necessary to get at least three consecutive images from each camera. And then, ego-motions are estimated with calibration result. The following section provides the two-camera geometry. Section III describes estimation procedure. Finally, the evaluation of the experiments is given in Section IV.

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

547

2 Two-Camera Geometry Our application of the ego-motion estimation is stabilization for “Dive into Movie”. Cameras are affixed to the actors’ body and move consistently with person (see Fig. 2). First of all, it is convenient to assign frames of reference.

W : a fixed frame of reference; C1 (k): the camera1 frame located at the optical center of camera1 with positive z axis along the optical axis at time k, attached on person 1, watching camera2, varying with camera1’s motion; C2 (k): the camera2 frame located at the optical center of camera2 with positive z axis along the optical axis at time k, attached on person 2, watching camera1, varying with camera2 motion. The relation between any two coordinates is represented by the rotation matrix Ra −> b ∈ SO (3) and a translation vector ta −> b ∈ℜ3 . Ta −> b = [ Ra −> b ta −> b ; 03×3 1] is

the transformation from coordinates a to coordinates b. We express a point X a with respect to the reference a, then X b = Ta −>b X a . We assume that the internal parameters of cameras are initialized as known. Given enough correct feature matches in two views (with static scenes) captured by the same camera, camera ego-motion can be computed easily. In the following, we first recall two-view geometry of conventional static scene. And then foreground motion is taken into account. Finally, four-view (two from each camera) geometry is derived by 3D motion analysis on two moving cameras. 2.1 Two-View Geometry: Epipolar Constraint and Essential Matrix

As well known, the Essential matrix constrains the motion of points between two views from one camera. It encodes the epipolar constraint and motion matrix. The set of homogeneous image points {x i }, i = 1,..., n in the first image is related to the set

{x′i }, i = 1,..., n in the second image by Essential matrix with the following equation: ˆ , Tˆ = [0 −t xi′Exi = 0, E = TR 3

t2 ; t3

0 −t1 ; −t2

t1 0]

(1)

From above equation, given feature matches in two-view, Essential matrix can be determined, and then rotation matrix and translation vector can be computed up to a universal scale. We used RANSAC [1] for transformation matrix estimation. 2.2 Two-View Geometry with Moving Foreground

Let a set of homogeneous 3D space points { X F ,i (k )}, i = 1,..., n be positions of foreground points at time k in the view of camera1 with a rigid motion independently from camera1’s motion. Then, motion of these points in C1 (k) can be represented as

548

J. Cui et al.

X F1 ,C1 (k ) = TF1 ,C1 (k ) X F1 ,C1 (k − 1) = TC−11 (k )TC1 <−W ( k − 1)TF1 ,W ( k )TW <− C1 (k − 1) X F1 ,C1 ( k − 1)

(2)

where TF ,C1 ( k ) represents 3D foreground motion in C1 ’s coordinates from time k-1 to k. TC1 (k ) is C1 ’s motion and TF1 ,W (k ) is foreground motion in world coordinates. TF ,C1 ( k ) and TC1 (k ) can be computed with two-view geometry described in Section 2.1 using feature matches in foreground region and background region respectively. If there is no enough background feature matches for TC1 (k ) , and if TF1 ,W (k ) is given by some other way, TC1 (k ) can be computed using Equation (2). 2.3 Four-View Geometry of Two Face-to-Face Cameras

In this case (see Fig.2), motion of camera1’s foreground points F1 in C2 (k ) coordinates is the same as C2 ’s motion TC2 ( k ) , i.e. TF1 ,C2 (k ) = TC2 (k ) . Then TF1 ,W ( k ) = TW <− C2 ( k − 1)TF1 ,C2 ( k )TC2 <−W = TW <− C2 (k − 1)TC2 ( k )TC2 <−W ( k − 1)

(3)

Now, let’s derive relations among four 3D motion transformation matrices: TF1 ,C1 (k ) , TC2 ( k ) , TC1 (k ) and TF2 ,C2 ( k ) . With the relations, given any three matrices of these four, remaining unknown matrix can be computed. TC2 ( k ) and TC1 (k ) are the target matrices in this paper. From Equation (2) and (3), we have TF1 ,C1 (k ) = TC−11 (k )TC1 <−W ( k − 1)TW <− C2 ( k − 1)TC2 (k )TW <− C1 (k − 1)TC2 <−W (k − 1) = TC−11 (k )TC1 <− C2 (k − 1)TC2 ( k )TC−11<− C2 ( k − 1) If we let TC2 <− C1 = TC 2 −1 for simplicity, then we have TF1 ,C1 (k ) = TC−11 (k )TC1− 2 (k − 1)TC2 ( k )TC−11− 2 (k − 1)

(4)

Similarly, considering foreground points of Camera2, we can obtain: TF2 ,C2 (k ) = TC−21 (k )TC 2 −1 ( k − 1)TC1 (k )TC−21−1 ( k − 1)

(5)

Now let check relations between above matrices and image observations. a) TF1 ,C1 (k ) : motion of foreground points (belong to person2) in camera1; b) TF2 ,C2 ( k ) : motion of foreground points (belong to person1) in camera2, c) TC2 ( k ) : computed from motion of background points in camera2 d) TC1 (k ) : computed from motion of background points in camera1; e) TC 2 −1 (k − 1) : Extrinsic calibration matrix between camera1 and camera2. a)-d) can be computed using two-view relations described in Section 2.1. e) can not be directly computed, and to be determined in Section 3.1.

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

549

3 Synchronized Estimation Recall the overview of the algorithm in Fig. 3. Synchronized estimation stage is divided into two steps: extrinsic calibration using frames at time k-3, k-2 and k-1 and motion estimation using frames at time k-1 and k. 3.1 Extrinsic Calibration of Two Face-to-Face Cameras

First we present an outline of our calibration procedure, and then the details of each step will be presented. The extrinsic calibration of two cameras is broken down into the following steps: a) for time k-3, k-2 and k-1, compute TF1 ,C1 , TF2 ,C2 , TC2 and TC1 with steps in Section 2.3 ; b) compute TC 2 −1 (k − 1) using Equation (6) in the following soon. To get a unique solution, at least three views from one camera is necessary [6], with avoiding special configurations of view angles. In the following equation, matrices with underline denote that they can be calculated as known. Using Equation (4) ⎧TF1 ,C1 ( k − 1) = TC−11 (k − 1)TC 1− 2 ( k − 2)TC2 (k − 1)TC−11− 2 ( k − 2) ⎪ ⇒ ⎨ −1 ⎪⎩TC 1− 2 (k − 1) = TC 1 (k − 1)TC 1− 2 (k − 2)T C2 (k − 1) TF1 ,C1 (k − 1) = TC1− 2 (k − 1)TC2 (k − 1)TC−11− 2 ( k − 1)TC−11 ( k − 1)

(6)

TF1 ,C1 (k − 1)TC 1 (k − 1)TC 1− 2 (k − 1) = TC1− 2 (k − 1)TC2 (k − 1)

(7)

From Equation (6), ⎧TF1 ,C1 (k − 2) = TC 1− 2 (k − 2)TC2 (k − 2)TC−11− 2 ( k − 2)TC−11 (k − 2) ⎪ ⇒ ⎨ −1 ⎪⎩TC 1− 2 (k − 2) = TC 1 (k − 1)TC 1− 2 (k − 1)TC2 (k − 1) TF1 ,C1 (k − 2) = TC 1 (k − 1)TC 1− 2 (k − 1)TC−21 (k − 1)TC2 ( k − 2)TC2 ( k − 1)TC−11− 2 ( k − 1)TC−11 ( k − 1)TC−11 ( k − 2) TC 1 (k − 1)TF1 ,C1 (k − 2)TC 1 ( k − 2)TC 1 (k − 1)TC 1− 2 (k − 1) = TC1− 2 ( k − 1)TC−21 ( k − 1)TC2 (k − 2)TC2 (k − 1)

(8)

(9)

In estimation of extrinsic motion, we decompose T into R and t. Then problem can be simplified as compute X that satisfies AX = XB in Equation (7) and (9) for X = RC1− 2 ( k − 1) . tC1− 2 ( k − 1) can be easily obtained from RC1− 2 ( k − 1) and Equation(7), (9). Here, both A and B are known, and X is unknown and has to be solved. While solutions to this question have been studied when A and B are general n n × n matrices, here we need solutions that belong to Euclidean group. In the context of robot sensor calibration, [6] first motivate this equation, and provide a closed-form solution. Their approach is based on geometric interpretations

550

J. Cui et al.

of the eigen-values and eigenvectors of a rotation matrix. Both translation and orientation values are calculated simultaneously using least-square fitting. [7] used this formulation of the problem and developed a non-linear optimization technique to solve it. Martin and Park [8] derive a closed form solution as a linear least squares fit . [9] formulated the problem using canonical coordinates of the rotation group, which enables a particularly simple closed form solution. In [6], conditions for uniqueness of solutions are discussed. It is concluded that the solution can not be found with only one measurement, and the parameters can be uniquely estimated with two camera positions, but the orientations of the camera cannot be zero or π value. In this paper, we used the approach described in [8]. 3.2 Ego-Motion Estimation

Given TF1 ,C1 (k ) - motion of foreground point in view of camera1 and TC2 ( k ) - motion of background points in view of camera2, and TC1− 2 (k − 1) obtained in Section 3.2,

TC1 (k ) is computed using Equation (4). TF1 ,C1 (k ) = TC−11 (k )TC1− 2 (k − 1)TC2 ( k )TC−11− 2 (k − 1)

(10)

4 Evaluations Both simulated data and real data are used for evaluations. Using synthesis data, we check the accuracy of the approach and the sensitivity to various levels of noise. Using real data, the procedure outlined in Fig. 3. is implemented along with the proposed calibration and estimation approach. Furthermore, stabilization results using estimated ego-motion matrices are shown to prove the feasibility and accuracy of the approach. 4.1 Evaluations with Simulated Data

The simulated data was created using a set of known 3D points and transformations. Transformations between two cameras and ego-motions of both cameras are constructed with random rotation axis, angle and translation vector. Additionally, in order to analysis the influence of noise, the data sets were defined by the radius of the Gaussian noise in the 2D pixel points. We create data sets with three different levels of noise. The resulting error in the calibration transformation is plotted in Fig.4. The error in the final estimation of transformation matrix, or residual error, is defined as Ttrue − Testimated F , where ⋅ F is the Frobenious norm of the matrix. Referring to the results (Fig.4 and 5), some interesting observations are made. The proposed approach produces results with low error. In Fig.5, we also showed the residual error resulted from noise in calibration matrix. We can find that the noise in calibration matrix doesn’t impact the final estimation result much. This is because the error in calibration stage might be eliminated by an inverse computation.

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

551

Gaussian noise with var 1 pixel Gaussian noise with var 2 pixel Gaussian noise with var 4 pixel

0.15

Residual error

0.10

0.05

0.00 0

5

10

15

20

25

30

35

Date No.

Fig. 4. Residual error of calibration result with different level of noise added to 2d image pixels 0.20

5 degrees of rotation angle error 10 degrees of rotation angle error 20 degrees of rotation angle error 45 degrees of rotation angle error

Residual error

0.15

0.10

0.05

0.00 0

5

10

15

20

25

30

35

Data No.

Fig. 5. Residual error of estimated matrices with different level of noise on calibration matrix

4.2 Experimental Results with Real Data

For real experiments, recall the overview of the algorithm in Fig. 3. We used a real video data with cameras affixed on the body. Before estimation, synchronized input videos from both cameras are pre-processed to segment out the background region and object region. In this step, color distribution based mean-shift region tracking [11] is implemented for object region. SIFT features [12] are extracted and matched within consecutive two image pairs for background region and object respectively. In Fig.6, we showed the result of feature matches with SIFT features. The synchronized estimation step has two stages. Extrinsic parameters are calibrated in first stage for two cameras with totally four image pairs (two for each camera using data at three time steps) using method in Section 3.1. Two-view transformation matrices for foreground and background region are computed using RANSAC [1]. Calibration matrix is computed using the approach described in [8]. Ego-motion is estimated using method in Section 3.2.

552

J. Cui et al.

Fig. 6. One set of data for transformation computation (A and B) in calibration stage. Left column: Up) Point matches in background region of video2; Middle) Point matches in foreground region of video2; Down) Point matches in background region of video1. Right column: Up) stabilization result using background region in video2. Middle) stabilization result using foreground region in video2. Down) Stabilization result using background region in video1.

Fig. 7. Stabilization results. Left-top: point matches in background region of video2. No enough features and the conventional approach failed in this case. Right-top: original image before motion. Down: stabilization result using the proposed approach. Compared with original image on the right-top, we can see our approach can provide acceptable stabilization result.

Synchronized Ego-Motion Recovery of Two Face-to-Face Cameras

553

Finally, a 2D affine transformation is derived from motion matrix for stabilization only considering effect of rotation： x ′ = sRx ， where we set s = 1 R33 for simplicity. x and x′ are homogeneous image points before and after motion. Since the main purpose of this paper is ego-motion recovery, stabilization has not been carefully considered, which can be our future work. In Fig. 7, we showed the stabilization result on background region and foreground region, they all failed. Stabilization result with the proposed approach is given. Compared with original image, we can see it provides acceptable result.

5 Conclusion Accurate estimation of ego-motion is not easy when there is moving foreground. Especially in some special situations it’s almost impossible. To address the problem, we proposed a new approach that utilizes additional video captured by the camera attached on the foreground object (i.e. another actor in our application). We first configure the sensor system as two face-to-face moving cameras. And then we derived the relationship between four views from two cameras. In estimation stage, two cameras are calibrated firstly, and then ego-motion is estimated. We calibrate the extrinsic relationship of two cameras with an AX=XB formulation. Experiments with simulated data and real data verify that this approach can provide acceptable ego-motion estimation and stabilization results.

Acknowledgment This work was supported in part by the NKBRPC (No. 2006CB303100), NSFC Grant (No. 60333010), NSFC Grant (No. 60605001) and Key grant Project of Chinese Ministry of Education (No. 103001).

References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Faugeras, O., Luong, Q.T., Papadopoulo, T.: The geometry of multiple images. MIT Press, Cambridge (2001) 3. Schindler, K., Suter, D.: Two-view multibody structure-and-motion with outliers through model selection. IEEE T-PAMI 28(6), 983–995 (2006) 4. Wolf, L., Shashua, A.: Two-body segmentation from two perspective views. In: Proc. CVPR, pp. 263–270 (2001) 5. Makadia, A., Daniilidis, K.: Correspondenceless Ego-Motion Estimation Using an IMU. In: Proceedings of the IEEE International Conference on Robotics and Automation (2005) 6. Shiu, Y.C., Ahmad, S.: Calibration of wrist-mounted robotic sensors by solving homogenous transform equations of the form AX = XB. IEEE Transactions on Robotics and Automation 5(1), 16–29 (1989) 7. Li, M.: Kinematic calibration of an active head-eye system. IEEE Transactions on Robotics and Automation 14(1), 153–157 (1998)

554

J. Cui et al.

8. Park, F.C., Martin, B.J.: Robot sensor calibration: Solving AX = XB on the Euclidean group. IEEE T-RA 10(5), 717–721 (1994) 9. Neubert, J., Ferrier, N.J.: Robust active stereo calibration. In: Proceedings of the IEEE International Conference on Robotics and Automation, vol. 3, pp. 2525–2531 (2002) 10. Sato, J.: Recovering Multiple View Geometry from Mutual Projections of Multiple Cameras. Int. J. Comput. Vision 66(2), 123–140 (2006) 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. IEEE Trans. Pattern Analysis Machine Intell. 25(5), 564–575 (2003) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)

Optical Flow–Driven Motion Model with Automatic Variance Adjustment for Adaptive Tracking Kazuhiko Kawamoto Kyushu Institute of Technology, 1-1 Sensui-cho, Tobata-ku, Kitakyushu 804-8550, Japan [email protected]

Abstract. We propose a statistical motion model for sequential Bayesian tracking, called the optical ﬂow–driven motion model, and show an adaptive particle ﬁlter algorithm with the motion model. It predicts the current state with the help of optical ﬂows, i.e., it explores the state space with information based on the current and previous images of an image sequence. In addition, we introduce an automatic method for adjusting the variance of the motion model, which parameter is manually determined in most particle ﬁlters. In experiments with synthetic and real image sequences, we compare the proposed motion model with a random walk model, which is a widely used model for tracking, and show the proposed model outperform the random walk model in terms of accuracy even though their execution times are almost the same.

1

Introduction

Particle ﬁlters [1] have proven to be a powerful and popular tool for visual tracking. One strength of particle ﬁlters is the ability to deal with a wide range of statistical models in sequential Bayesian estimation. In particle ﬁlters, a ﬁltering distribution is approximately represented by a ﬁnite number of weighted samples, referred to as particles, and is updated by propagating particles through time. The most common probability distribution used for propagation may be a prior model [2,3,4], which describes the state dynamics. However this often gives a poor estimate if unexpected motions happen, because the model explores the state space without any additional information on the current state. For adaptive particle propagation, we propose a statistical motion model which predicts the current state with the help of sparse optical ﬂows. We call it optical ﬂow–driven motion model. Due to recent computer power, the real–time computation of sparse optical ﬂows has become possible even if sophisticated methods, such as robust estimation and hierarchical search, are employed. Hence a construction of the motion model is not expensive. In addition we introduce a method for adjusting the variance of the motion model. The variance aﬀects the robustness to unexpected motions and the accuracy of Monte Carlo approximation. This adjustment is based on an error propagation technique [5] and is fully automatic, i.e., the manual setting of the variance is not required. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 555–564, 2007. c Springer-Verlag Berlin Heidelberg 2007

556

K. Kawamoto

This motion model becomes more eﬀective in combination with global image feature–based observation models such as color histogram based models [6], because a combination of optical ﬂow and such a global image feature is complementary. In experiments, we implement the particle ﬁlter algorithm using a color histogram based observation model. This paper is organized as follow. In section 2, we review particle ﬁlters and related works. In section 3, we propose the optical ﬂow–driven motion model and show how to construct the model. In section 4, we show experimental results with synthetic and real image sequences.

2

Related Works

Visual tracking can be formulated as the problem of estimating recursively in time the ﬁltering distribution p(xt |y 1:t ) of the state xt , given the sequence of observations y 1:t ≡ {yk |k = 1, 2, . . . , t}. In the context of visual tracking, xt represents the state of a target, such as the position, the rotational angle, and the velocity, and y t might be intensity pattern, feature points, or histogram. The states xk , k = 1, 2, . . . , t, are assumed to be Markovian given an initial distribution p(x0 ) and a transition distribution p(xt |xt−1 ). The observations y k , k = 1, 2, . . . , t, are conditionally independent of distribution p(y k |xk ) given the state xk . A recursive estimation of the ﬁltering distribution p(xt |y 1:t ) can be achieved by iterating two steps: (1) prediction : p(xt |y 1:t−1 ) = p(xt |xt−1 )p(xt−1 |y 1:t−1 )dxt−1 , ﬁltering : p(xt |y 1:t ) ∝ p(y t |xt )p(xt |y 1:t−1 ).

(2)

For linear–Gaussian state space models, the Kalman ﬁlter gives an analytical solution to this recursive estimation. However, for more general models including the model studied in the paper, the analytical evaluation is impossible. Particle ﬁlters give a numerical solution to the recursive estimation with Monte Carlo methods. The basic idea of particle ﬁltering is that one approximately represents the ﬁltering distribution in the pointwise form p(xt |y 1:t ) ≈ N (i) (i) N (i) = 1, where δ(·) denotes the delta-Dirac funci=1 wt δ(xt − xt ), i=1 wt (i) tion, and xt is an independent and identically distributed ith random sample at (i) time t, called particle, drawn from p(xt |y 1:t ), and wt is the normalized weight (i) associated with xk . There are some sampling methods for the prediction step in the recursive estimation. One of the most widely used sampling methods might be to draw (i) particles xt , i = 1, . . . , N, from a prior model p(xt |xt−1 ) [2,3,4]. As a prior model, “smooth” motion models such as random walk constant velocity model

xt+1 = xt + v t

(3)

xt+1 = 2xt − xt−1 + v t

(4)

Title Suppressed Due to Excessive Length

557

are often used for visual tracking, where v t is a white Gaussian noise. Although these motion models in eqs. (3) and (4) works well in many situations, they cannot deal with the rapid motion change of a target because of its nature. Thus, if the rapid motion change happens, the particles drawn from such a smooth model can give a poor Monte Carlo approximation of the ﬁltering distribution. To improve the Monte Carlo approximation, adaptive sampling methods might be useful. The auxiliary particle (AP) ﬁlter [7], the self–organizing state space model, and the sequential importance sampling (SIS) ﬁlter [8,9] (and their variants) include the mechanism of adaptive sampling. The AP ﬁlter [7] draws particles from (i)

(i)

(i)

q(xt , i|y1:t ) ∝ wt−1 p(y t |μt )p(xt |xt−1 ), (i)

(5)

(i)

where μt is some characterization of xt , given xt−1 . This distribution is conditioned on the current observation y t . Thus the distribution adaptively changes according to y t . This adaptation leads to a more eﬃcient sampling method, but, (i) if p(xt |xt−1 ) is far away from the true one, q(xt , i|y 1:t ) can also be poor. The self–organizing state space model [10,11] builds an augmented state vector ¯ x t ≡ (xt θ t ), where θ t is unknown parameters of the system, called hyper– parameter. The dynamics of the augmented state can be decomposed into xt−1 ) = p(xt |xt−1 , θ t )p(θ t |θt−1 ) p(¯ xt |¯

(6)

assuming p(xt |xt−1 , θt , θ t−1 ) = p(xt |xt−1 , θt ) and p(θt |xt−1 , θt−1 ) = p(θ t |θt−1 ). Particles are therefore drawn from the distribution conditioned on the hyper– parameter θt . Speciﬁcally θt can be taken as the variance of the prior model. In (i) (i) this case, each particle xt is diﬀused based on its own variance θt . Since the particles with large variances are expected to widely diﬀuse in the state space, such particles are likely to capture the rapid motion change. As a result, the self–organizing state space model is robust to unexpected motions. Although the model is a ﬂexible approach for adaptive sampling, it renders the dimensionality of the state space twofold. This leads to ineﬃciency of sampling because more particles are required. The SIS ﬁlter [8,9] draws particles from a proposal distribution q(xt |x1:t−1 , y 1:t ). This distribution includes the prior model p(xt |xt−1 ) and eq. (5) as a special case. The optimal proposal distribution is (i)

(i)

p(xt |xt−1 , y t ) ∝ p(y t |xt )p(xt |xt−1 ).

(7)

In visual tracking, this optimal distribution is often not available because p(y t |xt ) is unknown in most cases. Hence an alternative proposal distribution is necessary. ICondensation [12] construct a proposal distribution by detecting skin color regions in the image for hand tracking. However such speciﬁc knowledge about objects is not always available for a wide range of objects.

558

3

K. Kawamoto

Optical Flow-Driven Motion Model

We propose a statistical motion model, formally expressed by xt = f (xt−1 , ut , v t ),

v t ∼ N (0, Σ v ),

(8)

where v t is a white Gaussian noise with mean 0 and covariance matrix Σ v . Unlike the smooth models in eqs. (3) and (4), this model includes the term ut which is estimated from sparse optical ﬂows between successive two images. The term ut helps the particles to capture the object of interest by guiding the particles based on the current and previous images. We call this optical ﬂow– driven motion model, because the state is mainly driven by optical ﬂows. In experiments the object of interest is represented by its bounding box (more general object models are also available). The reference bounding box to be tracked is speciﬁed by the user at time 0. The bounding box is modeled by (w, h, tx , ty ), where w and h are the width and the length of the bounding box, respectively, and (tx , ty ) is the center of the bounding box. The state vector xt at time t is therefore deﬁned by xt = (wt , ht , txt , tyt ) . The covariance matrix 2 2 2 2 , σv,h , σv,t , σv,t ) in what Σ v is assumed to be a diagonal matrix Σ v = diag(σv,w x y follows. In this setting, eq. (8) can be speciﬁed by ⎛ ⎞ (uwt + vwt )wt−1 ⎜ (uht + vht )ht−1 ⎟ ⎟ xt = ⎜ (9) ⎝ txt−1 + utx t + vtx t ⎠ tyt−1 + uty t + vty t where v t = (vwt vht vtx t vty t ) and ut = (uwt uht utx t uty t ) which is estimated from optical ﬂows. The underlying motion in eq. (9) is anisotropic similar transformations, which include planar rigid transformations. Since the elements of ut changes at every time, the model in eq.(9) is adaptively updated. 3.1

Robust Estimation of u t from Optical Flows

The estimation of ut from optical ﬂows is a common problem in computer vision. Let r αt = (xαt , yαt ) , α = 1, . . . , M, 1 denote the αth feature point in the image at time t and assume that r αt is a feature point of the object. The geometrical relation between r αt−1 and r αt is written by

uwt 0 utx t rαt = Dt r αt−1 + tt , where D t = and tt = . (10) 0 uht uty t Therefore, if more than a pair of corresponding points, i.e., r αt ↔ r αt−1 , α = 1, 2, . . ., are available, the least–squares estimate of ut is obtained by solving the optimization problem M

r αt − (Dt rαt−1 + tt )2 → min .

(11)

α=1 1

The number of feature points M may vary from time to time, but the time subscript is suppressed for the sake of brevity.

Title Suppressed Due to Excessive Length

The solution to eq. (11) is calculated by

M M M 1 uwt = M xαt−1 xαt − xαt−1 xαt , Δx α=1 α=1 α=1

M M M 1 uht = M yαt−1 yαt − yαt−1 yαt , Δy α=1 α=1 α=1

M M M M 1 2 utx t = x xαt − xαt−1 xαt−1 xαt , Δx α=1 αt−1 α=1 α=1 α=1

M M M M 1 2 uty t = y yαt − yαt−1 yαt−1 yαt , Δy α=1 αt−1 α=1 α=1 α=1

559

(12)

where Δx = M

M α=1

x2αt−1 − (

M

α=1

xαt−1 )2 , Δy = M

M α=1

2 yαt−1 −(

M

yαt−1 )2 . (13)

α=1

In practice a proportion of feature points, r αt , α = 1, . . . , M, may be outliers, i.e., some of them may be feature points of background or other objects. Since the least-squares estimate to eq. (12) is sensitive to outliers, a robust method for estimating ut is necessary. In order to remove outliers we employ the RANSAC (Random Sample Consensus) algorithm [13]. After removing outliers, we can obtain the least-squares estimate recalculated from only inliers. For RANSAC, an important parameter is the distance threshold d which is used to classify a given data into inliers and outliers. If the measurement error deviation σr , a of rαt is isotropic and Gaussian with zero mean and standard √ reasonable choice of the threshold is d = χ22 (0.95) σr ≈ 5.99 σr , where χ2m (α) is the α × 100 percentile of the χ2 distribution with m degrees of freedom. In practice σr is empirically determined because it is usually unknown (we set σr = 3 in the experiments). In what follows we assume the outliers are removed and all of r αt , α = 1, . . . , M, are inliers for simple notation. 3.2

Automatic Variance Adjustment by Error Propagation

The variance Σ v of the stochastic term v t in eq. (9) is important to an appropriate diﬀusion of particles. If the variance is too large, most of the particles may not contribute to the approximation of the ﬁltering distribution. This situation results in ineﬃcient sampling. If the variance is too small, the particles may not accurately capture characteristics of the ﬁltering distribution because of a loss of diversity of particles. In order to adjust Σ v automatically, we assume the uncertainty of the model in eq. (8) is almost the same as that of ut , i.e., Σ v ≈ Σ u , where Σ u is the covariance matrix of ut and is assumed to be a diagonal matrix Σ u = 2 2 2 2 diag(σu,w , σu,h , σu,t , σu,t ). In face we take the variance of v t as that of ut . We x y

560

K. Kawamoto

estimate the variance of ut from optical ﬂows using an error propagation technique [5]. If optical ﬂows are inaccurate, the variance of ut increases accordingly, and vice versa. The error propagation is generally based on the relation

2 ∂f 2 σθ = σx2 (14) ∂x where x and θ are related by θ = f (x) and σx2 , σθ2 are the variances of x and θ. From eqs. (12) and (14), we estimate the variances of ut by

M M 2 2 σ ˆu,w = ˆu,h = σr2 , σ σr2 , Δx Δy

M M 1 2 1 2 2 2 2 x ˆu,ty = y (15) σr , σ σr2 . σ ˆu,tx = Δx α=1 αt−1 Δy α=1 αt−1 The variance σr2 of r αt is usually unknown, but the unbiased estimate of σr2 is calculated by 2 (16) σ ˆr2 = 2M − 2 where 2 is the sum of squares of the residuals to eq. (11). We therefore obtain the 2 2 2 2 ,σ ˆu,h ,σ ˆu,t ,σ ˆu,t by substituting σ ˆr2 in eq. (16) for σr2 in eq. (15). variances σ ˆu,w x y 3.3

Redetection of Feature Points

The number of feature points r αt , α = 1, . . . , M, decreases over time, because tracking some feature points may fail and RANSAC may remove some feature points as outliers. With a small number of feature points, the estimates in eq. (12) may be inaccurate. In the worst case, no feature points may exist in images, and thus it is impossible to estimate ut . This means the proposed tracking algorithm will no longer work well. Hence the redetection of feature points is necessary. To this end, feature points are redetected within the bounding box corresponding to the mode of the ﬁltering (i) distribution, which is calculated by xt = arg maxx(i) p(y t |xt ), if the number of t feature points is less than a threshold (in the experiments we set it to 10).

4

Experiments with Synthetic and Real Image Sequences

We compare the performance of the proposed motion model with a prior model using synthetic and real image sequences. Speciﬁcally the random walk model in eq. (3) is used as a prior model in both experiments. The elements of noise v in the random walk model are assumed to be independent and white Gaussian 2 2 2 2 = 0.01w0 , σv,h = 0.01h0 , σv,t = σv,t = 3.02 (pixel), noises with variances σv,w x y where w0 and h0 are the width and length of the reference bounding box, respectively.

Title Suppressed Due to Excessive Length

561

The observations are taken to be the normalized histograms of the bounding box in the RGB channels, denoted by y = (hR , hG , hB ). Then the observation model is deﬁned as

R G G B B 2 2 , h ) + B (h , h ) + B (h , h ) B 2 (hR r r r p(y t |xt ) ∝ exp , (17) 2σ 2 G B where (hR r , hr , hr ) is the normalized histograms of the reference bounding box and B(h, h ) is the Bhattacharyya distance between the normalized histograms Nh with Nh bins, deﬁned as B 2 (h, h ) = (1 − i=1 hi hi ). In both experiments, the parameter σ in eq. (17) is set to σ = 0.001. The number of particles is set to 200. The two particle ﬁlters are implemented on a computer with Pentium 4 (3.4GHz) and 2 GB main memory.

4.1

Synthetic Example

The purpose of this example is to quantitatively evaluate the performance diﬀerence between the proposed model and the prior model. To this end, the ground truth is necessary. Then we generate the synthetic image sequence by transforming the “Lena” image in Fig. 1 (left) with the motion parameters in Fig. 1 (right). The white bounding box in Fig. 1 (left) is a manually selected region at time 0 in size 90 × 100 pixels and the feature points within the box are used for tracking the region by the proposed model. The feature points are detected by the Harris detector [14] and they are tracked by the Lucas–Kanade–Tomasi tracker [15] with pyramidal implementation [16]. We perform the 100 simulations and evaluate the root mean squared errors ¯t = (RMSE) between the mean estimates of the state, which is calculated by x N (i) 1/N i=1 xt , and the ground truth. In Fig. 2 the RMSEs with the two models are present. The results show the RMSEs by the prior model sharply increase at time 6. This increase is caused by the large displacement of the target region from time 5 to 6 ((305, 214) → (320, 200)), as shown in Fig. 1 (right), i.e., the prior model can not follow the target region because of the unexpected motion. On the other hand, the proposed model provides more accurate and stable estimates. The average execution times with the proposed model and the prior model are 39

t wt ht tx ty

0 90 100 300 220

1 90 100 301 219

2 90 100 302 218

3 90 100 303 217

4 90 100 304 216

5 90 100 305 215

6 90 100 320 200

7 90 100 321 199

8 90 100 322 198

9 90 100 323 197

10 90 100 324 196

Fig. 1. The “Lena” image (left) and the state true parameters (bottom) used for generating the synthetic image sequence

562

K. Kawamoto 18

SIR proposed

4

16

3.5

14

3

12 height(pixel)

width(pixel)

4.5

2.5 2

10 8

1.5

6

1

4

0.5

SIR proposed

2

0

0 0

2

4

6

8

10

0

2

4

time(frame)

(a) width wt 20

10

8

10

SIR proposed

18

16

16

14

14 y-translation(pixel)

x-translation(pixel)

8

(b) height ht 20

SIR proposed

18

6 time(frame)

12 10 8

12 10 8

6

6

4

4

2

2

0

0 0

2

4

6

8

time(frame)

(a) x–directional translation txt

10

0

2

4

6 time(frame)

(b) y–directional translation tyt

Fig. 2. Root mean squared errors (RMSE) between the estimated mean of the state and the ground truth for the synthetic image sequence

msec/frame and 36 msec/frame, respectively. Consequently the proposed model outperforms the prior model in terms of accuracy even though the execution times are almost the same. 4.2

Real Example

The real image sequence consists of 240 frames (8 sec) at 320 × 240 pixels resolution, in which a target object in size 80 × 100 pixels is moved by hand. Figures 3 show mean estimates (the white bounding box) provided by the prior model (upper) and the proposed model (lower) at time 126, 128, 130, 137, 139, 141. These results are selected as typical tracking examples to show the diﬀerence between the two models. The object being rapidly moved at time 126, the prior model loses track of the object, as shown at time 128 and 130. Similarly, after time 137, the prior model fails to track the object and captures a false object with relatively similar color pattern (the can on the desk), as shown at time 141. In contrast, the proposed model successfully tracks the object through the image sequence. The execution times with the proposed model and the prior model are 14 msec/frame and 12 msec/frame, respectively.

Title Suppressed Due to Excessive Length #126

#128

#130

#137

#139

#141

563

Fig. 3. Mean estimate of the state with the three proposal distributions

5

Conclusion

We propose an adaptive statistical motion model, called the optical ﬂow–driven motion model, for particle ﬁlter based tracking. This motion model explores the state space with the help of sparse optical ﬂows. This exploration can be carried out fast because optical ﬂow can be estimated by a gradient method, which is a local search and is done fast. Furthermore, we introduce a variance adjustment method for the motion model. This adjustment is derived from the error propagation technique and fully automatic. The determination of the variance of motion model are important because the variance aﬀects tracking accuracy and eﬃciency. It is however manually determined in most particle ﬁlters. The experimental results with the synthetic and real image sequences show the proposed model provides better performance than the random walk model.

564

K. Kawamoto

Acknowledgements This work is supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan, under a Grant-in-Aid (No.19700174).

References 1. Doucet, A., de Freitas, N., Gordon, N.J.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 2. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE Proc.–F 140(2), 107–113 (1993) 3. Kitagawa, G.: Monte Carlo ﬁlter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat. 5(1), 1–25 (1996) 4. Isard, M., Blake, A.: Condensation – Conditional density propagation for visual tracking. Int. J. Computer Vision 29(1), 5–28 (1998) 5. Haralick, R.M.: Propagating Covariance in Computer Vision. Int. J. Pattern Recognition and Artiﬁcial Intelligence 10(5), 561–572 (1996) 6. Nummiaro, K., Koller-Meier, E., Gool, L.V.: An adaptive color-based particle ﬁlter. Image and Vision Computing 21, 99–110 (2003) 7. Pitt, M.K., Shephard, N.: Filtering Via Simulation: Auxiliary Particle Filters. J. the American Statistical Association 94, 590–599 (1999) 8. Liu, J.S., Chen, R.: Sequential Monte Carlo methods for dynamical systems. J. the American Statistical Association 93(443), 1032–1044 (1998) 9. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian ﬁltering. Statistics and Computing 10, 197–208 (2000) 10. Kitagawa, G.: Self–organizing state space model. J. the American Statistical Association 93(443), 1203–1215 (1998) 11. Ichimura, N.: Stochstic Filtering for Motion Trajectory in Image Sequences Using a Monte Carlo Filter with Estimation of Hyper-Parameters. Proc. Int. Conf. on Pattern Recognition IV, 68–73 (2002) 12. Isard, M., Blake, A.: ICondensation: Unifying low-level and high-level tracking in a stochastic framework. Proc. European Conf. Computer Vision 1, 893–908 (1998) 13. Fischer, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography. Comm. ACM 24(6), 381–395 (1981) 14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp.147–151 (August 1988) 15. Shi, J., Tomasi, C.: Good Features to Track, Proc. Computer Vision Pattern Recognition, pp.593–600 (1994) 16. Bouguet, J.Y.: Pyramidal Implementation of the Lucas Kanade Feature Tracker, Intel Corporation, Microprocessor Research Labs (2000)

A Noise-Insensitive Object Tracking Algorithm Chunsheng Hua1,2 , Qian Chen1 , Haiyuan Wu1 , and Toshikazu Wada1 1

2

Graduate School of Systems Enigneering, Wakayama University, 930 Sakaedani, Wakayama, 640-8510, Japan Institute of Scientiﬁc and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan

Abstract. In this paper, we brought out a noise-insensitive pixel-wise object tracking algorithm whose kernel is a new reliable data grouping algorithm that introduces the reliability evaluation into the existing K-means clustering (called as RK-means clustering). The RK-means clustering concentrates on two problems of the existing K-mean clustering algorithm: 1) the unreliable clustering result when the noise data exists; 2) the bad/wrong clustering result caused by the incorrectly assumed number of clusters. The ﬁrst problem is solved by evaluating the reliability of classifying an unknown data vector according to the triangular relationship among it and its two nearest cluster centers. Noise data will be ignored by being assigned low reliability. The second problem is solved by introducing a new group merging method that can delete pairs of ”too near” data groups by checking their variance and average reliability, and then combining them together. We developed a video-rate object tracking system (called as RK-means tracker) with the proposed algorithm. The extensive experiments of tracking various objects in cluttered environments conﬁrmed its eﬀectiveness and advantages.

1

Introduction

Object tracking has drawn the attention of more and more researchers since the last decade, and numerous powerful tracking algorithms have been brought out, such as background subtraction [16], optical ﬂow [17], CONDENSATION [8], template matching [15,19], mean shift [11], EM algorithm [10], dynamic Bayesian network [9], iterative clustering[6], Kalman ﬁlter [7], etc. The common feature of these algorithms is that: their success depends on checking the similarity between the target model and an unknown region/pixel. To measure such similarity, a threshold is usually applied into these algorithms. Since the target object may move under the cluttered condition, it is diﬃcult to select the proper threshold to work stably under all conditions. Furthermore, there is no guarantee that if the object with the maximum similarity is really the target one or not. Collins [13] et al. bring out an idea that: while object tracking, the most important thing is the ability to discriminate the target object from its surrounding background. Not only the target feature but also the background feature should be processed while object tracking. They propose a method to switch the meanshift tracking algorithm among the diﬀerent linear combination of the RGB Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 565–575, 2007. c Springer-Verlag Berlin Heidelberg 2007

566

C. Hua et al.

colors which can select the very features that distinguish the object most from the surrounding background. However, the color histogram has little identiﬁcation power, and in the case of high dimensional features like textures, the large number of combination of colors (in fact, each image contains 49 such color combinations) will prevent their method from achieving the real-time performance. Similar ideas have been applied by Nguyen [14] and Zhang [18]. The performance of [18] will be unstable if the target object contains apertures, because they assume the target object to be solid. In [14], although the more powerful Gabor ﬁlter is used to discriminate the target from the background, the performance of [14] in the long term is suspected because the target is assumed to be solid. When the target is non-rigid, there is no guarantee that the update of target template is correct, and the multi-scale problem is also remained in [14]. Therefore, obviously almost all the mentioned tracking algorithms share one common problem: when the target object is non-rigid and/or contains apertures, background pixels will be mixed into the target object; when the target object moves under the cluttered background condition, the continuously mixed background pixels will greatly degrade the purity of the target feature (such as color or texture). In this paper, this phenomenon is called as background interfusion. In order to solve the background interfusion problem, Hua et al.[12] proposed a pixel-wise tracking algorithm called as “K-means tracker” which is based on applying K-means clustering into both the target and background samples to remove the mixed background pixels from the target object. However, two problems degrade the performance of K-means tracker: 1) K-means clustering will wrongly classify the noise data into some pre-deﬁned clusters; 2) the wrongly assumed number of clusters sometimes leads to the wrong clustering result. Although some eﬀorts [1,2,3,4,5] were made to K-means clustering, while object tracking, such problems still aﬀect the performance of K-means tracker. In this paper, we solve these two problems by introducing the reliability estimation into K-means clustering. On considering the triangular relationship among an unknown data and its two nearest cluster centers, each data will be given a reliability value, and the noise data will be ignored by being assigned low reliability. With a new merging method which is based on the average reliability and variance of each cluster, the second problem is also solved.

2 2.1

The RK-Means Clustering Reliability Evaluation

Because the noise data is usually distant from any cluster center, the distance from a noise data to any cluster center is always longer than that of a normal data. Thus, such distance can be regarded as the feature that tells noise data from the normal data. However, since it is diﬃcult to get a proper threshold to examine such feature (as we claimed before), we prefer the triangular relationship among an unknown data and its two nearest cluster centers to measure if it is reliable or not to classify this data into some clusters. High reliability will be sent to the normal data and low reliability to the noise data.

A Noise-Insensitive Object Tracking Algorithm

567

Fig. 1. The relationship between data vectors and their two closest cluster centers

As shown in Fig.1, w1 and w2 are the cluster centers, x1 and x2 are the unknown data vectors. Only according to d11 and d21 , we can only say that x2 is closer to w1 than x1 , but not judge x1 and x2 to be the noise data vector or not. However, according to the shape of triangle x1 w1 w2 and x2 w1 w2 , we can judge that, compared with x2 , x1 is more like to be a noise data vector. While clustering an arbitrary data vector xk of data set X ( {xk ; k = 1, . . . , n}), the reliability value of xk is deﬁned as the ratio of the distance between its two closest cluster centers to the sum of the distance from xk to the two cluster centers: wf (xk ) − ws(xk ) R(xk ) = , (1) dkf + dks dkf = xk − wf (xk ) , f (xk ) = argmin (xk − wi ) ,

dks = xk − ws(xk ) . s(xk ) =

i=1,...,c

argmin

(xk − wi ) .

(2)

i=1,...,c,i=f (xk )

f (xk ) and s(xk ) are the subscript of the closest and the secondly closest cluster centers to xk . The degree (μkf ) of a data vector xk belonging to its closest cluster f (xk ) is computed from dkf and dks : μkf =

dks . dkf + dks

(3)

Since R(xk ) indicates how reliable that xk can be classiﬁed, the probability that xk belongs to its closest cluster can be computed as: tkf = R(xk ) ∗ μkf

(4)

tkf denotes that: under the reliability R(xk ), the probability that xk belongs to its closest cluster. Given the number of clusters and the initial cluster centers, the data grouping in RK-means clustering algorithm is carried out by two steps: 1). For each data vector xk , compute its probability of belonging to its nearest cluster center as Eq(4); 2). Update the clusters by minimizing the following objective function: Jrkm (w) =

n k=1

tkf xk − wf (xk ) 2 .

(5)

568

C. Hua et al.

Mistake area

(a) Input image

(b) K-means clustering (c) RK-means clustering

Fig. 2. RK-means clustering can detect the outliers with the reliability evaluation

The cluster centers w are obtained by solving the equation ∂Jrkm (w) = 0. ∂w

(6)

The existence of the solution to Eq.(6) can be proved easily if the Euclidean distance is assumed. The cluster centers w can be separately updated as: n δj (xk )tkf xk 1 if j = f (xk ) k=1 wj = n , δj (xk ) = . (7) 0 otherwise δ (x )t k kf k=1 j The output of Eq.(7) will be the initial value for step one. The step one and two are performed iteratively until w converges. 2.2

Redundant Cluster Deletion

When the assumed number of clusters is greater than that real number of a dataset, there will be some redundant clusters. Such redundant clusters will scramble for the data vectors that should belong to one cluster, thus one cluster may be divided into two or more clusters forcibly, which makes the clustering process become unstable and unreliable. To solve this problem, we merge two redundant clusters into one cluster according to their variance and average reliability. (A) of Fig.3 illustrates a reliability ﬁeld in a two-dimension space around two cluster centers. The data vectors located on the line connecting the two

(A) Reliability ﬁeld

(B) Initial state

(C) Iteration: 5

(D) Iteration: 9

Fig. 3. The changes of two cluster centers in one crowd of data vectors during updating

A Noise-Insensitive Object Tracking Algorithm

569

cluster centers will have higher reliability than the others. Such data vectors will attempt to attract the two cluster centers together, and such attraction will become stronger when the two cluster centers really get closer. (B) ∼ (D) of Fig.3 show an example of clustering one crowd of data vectors with RK-means clustering, when given two initial clusters. Because the two cluster centers (red circle) get closer as the iteration increases, the average reliability of the two clusters will decrease continuously according to Eq.(1). Therefore, we considered it is possible to judge if two clusters should be merged or not by checking their average reliability and variance. Here we describe the dispersion of cluster i as: 2 xk ∈cluster(i) (xk − wi ) , (8) v(i) = N where N is the number of data vectors in cluster i. The average reliability of cluster i can be computed by: xk ∈cluster(i) R(xk ) r(i) = . (9) N In Fig.4, Case 1∼4 show the clustering results of data sets with RK-means clustering under diﬀerent distribution. In order to check if two clusters should be merged or not, we compare the dispersion and the average reliability of two clusters before and after merging. The result of this comparison is obtained as: Rv (i, j) =

v(i) + v(j) , v(i ∪ j)

Rr (i, j) =

r(i) + r(j) , r(i ∪ j)

(10)

where i ∪ j indicates the merged cluster from cluster i and j. The graph in the right part of Fig.4 shows the results of Rv and Rr of the two clusters. By analyzing these results, we discovered that the ratio of Rr to Rv can be used to judge if two clusters should be merged: M erge(i, j) = Pm (i, j) =

Rr (i, j) . Rv (i, j)

Fig. 4. Investigation on the possibility of merging two clusters

(11)

570

C. Hua et al.

If Pm (i, j) ≤ 1 then the cluster i and the cluster j should be merged. This procedure should be executed after each update of cluster centers. Therefore, we add the redundant cluster deletion procedure as the third step into the process of data grouping described in subsection 2.2. The RK-means clustering algorithm is summarized as: 1) Initialization: give the number of clusters c and the initial value to each cluster center wi , i = 1, . . . , c. 2) Iteration of data grouping: i) calculate f (xk ) and s(xk ) for each xk . ii) update wi , i = 1, . . . , c by solving Eq.(6). iii) delete redundant cluster with the method described in subsection 2.2

3

Object Tracking Using RK-means Algorithm

While object tracking, we solve the background interfusion by applying the RKmeans clustering to target and background samples. To describe the image feature properly, we assume that: each object in the image is composed of one or several regions, all the pixels within each region should : 1) contain similar colors; 2) be close to each other in the 2D space. Therefore, according to this assumption, the image feature is described by a constructed color-position 5D feature vector f . In fact, any kind of color system can be applied in Eq(12) (such as RGB, Y CbCr and Y U V color system). ⎤⎡ ⎤ ⎡ ⎤ ⎡ Y 100 0 0 f1 ⎢ f2 ⎥ ⎢ 0 1 0 0 0 ⎥ ⎢ U ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎥ ⎢ (12) f =⎢ ⎢ f3 ⎥ = ⎢ 0 0 1 0 0 ⎥ ⎢ V ⎥ , ⎣ f4 ⎦ ⎣ 0 0 0 α 0 ⎦ ⎣ x ⎦ y 000 0 α f5 where α is weight factor for adjusting the importance of the position information relative to the color information. Then, as shown in Fig.5, the minimum dissimilarities from an unknown data vector fu to the target clusters and to the surrounding background samples (selected from the ellipse contour) will be calculated as: dT = min fu − fT (i), i=1∼K

dB = min fu − fB (j), j=1∼m

(13)

correspondently, we can get the nearest target cluster center fT (i) and nearest background center fB (j) to fu . The dissimilarity between fT (i) and fB (j) is calculated as follows: dT B = fT (i) − fB (j). (14) The reliability of fu is estimated as: R(fu ) =

dT B , dT + dB

(15)

A Noise-Insensitive Object Tracking Algorithm Cross point

Unknown pixel

Target centers

Nearest point

dB

571

dTB

dT

Fig. 5. RK-means clustering on multi-color object with target and background samples

Then the probability that fu belongs to target cluster i is calculated as: (i)

μT (fu ) = R(fu ) ∗

dB . dT + dB

(16)

According to the Eq(7), the target centers will be smoothly updated in both the color and position space. How to select the background samples and update the search area, the tracking failure detection and recovery, as well as the initialization process have been reported in [12] in details.

4

Experiment and Discussion

4.1

Evaluating the Eﬀectiveness of RK-Means

To conﬁrm the eﬀectiveness of the RK-means clustering (RKM), we compare it with the existing K-means (EKM) and fuzzy K-means (FKM) clustering. All the algorithms are given the same condition (e.g. initial cluster centers and iteration). The upper row of Fig.6 shows a comparative experiment where four clusters exist but the initial number of clusters is three. Here, the sky blue “” denotes the initial cluster centers and the yellow “•” shows the clustering result during each iteration and the ﬁnal results are pointed by arrows. The EKM only succeeds in setting one cluster center correctly. The FKM also fails to give good clustering result. Our RKM successfully classiﬁes three clusters according to the initial cluster centers because the data vectors of the fourth redundant cluster are given extremely low reliability by RKM. This means they can not be classiﬁed reliably and should not be assigned to any of given clusters. The lower row of Fig.6 shows another case that the real number of clusters is smaller than the assumed one. The hollow “” indicates the initial cluster center, and the solid sky blue “” shows the resulted cluster center. The EKM and FKM algorithm brutally divided the data vectors of group 2 into two separated clusters. Meanwhile, the RKM successfully classiﬁes the three clusters by reducing one redundant assumed cluster as the way described in Section 2.2. In order to conﬁrm the convergence of RKM, we applied it to the IRIS dataset1 as shown in (A) of Fig.7. The pink curve shows the convergence of RKM and blue 1

http://www.ics.uci.edu/˜mlearn/databases/iris/iris.data

572

C. Hua et al. 300

300

300

"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_CKM.dat"

250

"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_FKM.dat"

250

200

200

200

150

150

150

100

100

100

50

50

50

0

0

0

50

100

EKM

150

200

250

300

350

400

0 0

450

50

100

150

200

250

300

350

400

450

300 "Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_CKM.dat"

250

150

200

250

300

350

400

450

100

50

50

Group 3

Group 2 150

200

250

300

350

400

Group 3

Group 2

0 450

Group 1

150

100

50

"Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_RKM.dat"

200

Group 1

150

100

EKM

100

250

200

Group 1

100

50

300

"Three_cluster1.dat" "Three_cluster2.dat" "Three_cluster3.dat" "Three_cluster_Initial_Value.dat" "Three_cluster_FKM.dat"

250

200

50

0

FKM RKM Real number of clusters is larger than the assumed number.

300

150

"Four_cluster1.dat" "Four_cluster2.dat" "Four_cluster3.dat" "Four_cluster4.dat" "Initial_Value.dat" "Four_cluster_RKM.dat"

250

Group 3

Group 2 0

0 50

100

150

200

250

300

350

400

450

50

100

150

200

250

300

350

400

450

RKM FKM Real number of clusters is smaller than the assumed number. Fig. 6. Simulation experiment under diﬀerent conditions

(A)convergence

(B)input

(C)iteration 1 (D)iteration 2 (E)iteration 3

Fig. 7. (A): Comparing the convergence of RKM and EKM. (B) ∼ (E): Image segmentation with redundant cluster deletion.

curve for EKM. We conﬁrm that the convergence of the RKM is as good as the EKM, and its converging speed is not slower than the EKM. The (B) ∼ (E) of Fig.7 show an image segmentation experiment of the RKM with the redundant cluster deletion. (B) is extracted from the lower row (frame 285) of Fig.8. The yellow cross indicates the target cluster centers. Although three initial cluster centers are given, according to the redundant cluster deletion described in Section 2.2, the RKM correctly merge them into one cluster. 4.2

Object Tracking with RK-Means Tracker

Because Hua[12] have compared2 K-means tracker with some famous tracking algorithms, here we only compare our RK-means tracker with the K-means tracker. 2

http://vrl.sys.wakayama-u.ac.jp/VRL/studyresult/study result 3 en.html

A Noise-Insensitive Object Tracking Algorithm

573

Results of the K-means tracker.

Frame 220

Results of the RK-means tracker. Frame 258 Frame 285

Frame 305

Fig. 8. Results of comparative experiment with K-means tracker[12] and RK-means tracker under complex scenes

Frame 055

Frame 075

Frame 107

Frame 117

Frame 125

Frame 138

Frame 142

Frame 168

Fig. 9. Performance of RK-means tracker on PETS2004 where two people are ﬁghting

Fig.8 shows a sequence of hand tracking. The K-means tracker fails to work since frame 285. In frame 285, since the color of some surrounding background parts (e.g. a corrugated carton) is similar to the skin color for some degree, K-means tracker mistakenly takes them as the target pixels. This causes the incorrect update of the search area and leads to the tracking failure. As for our RK-means tracker, in frame 285, the inﬂuence of such background parts is eﬀectively repressed through the reliability evaluation. Therefore, the RK-means tracker can detect the target area (e.g. hand) and update the search area correctly. Fig.9 shows the performance of RK-means tracker on the public database PETS2004. In this sequence, although two persons are ﬁghting and get entangled

574

C. Hua et al.

in each other, the RK-means tracker successfully treats the mixed person as noise data by giving him low reliability and ignores those noise data. All the experiments were performed with a desktop PC with a 3.06GHZ Intel XEON CPU, and the image size was 640 × 480 pixels. When the target size varied from 100 × 100 ∼ 200 × 200 pixels, target colors varied from 1 ∼ 6, the processing time of our algorithm was about 9 ∼ 15ms/frame.

5

Conclusion

In this paper, we have proposed a robust pixel-wise object tracking algorithm which is based on a new reliability-based K-means clustering algorithm (called as RK-means tracker). By considering the triangular relationship among an unknown data and its two nearest cluster centers, each data is assigned with a reliability value, and the noise data will be ignored by being given the low reliability value. When the number of clusters is incorrectly assumed, by checking the variance and average reliability of each cluster, a new group merging method is brought out to delete the redundant clusters. Through the extensive experiments, we have conﬁrmed that the proposed RK-means tracker can work more robustly than other algorithms when the background is cluttered. Besides object tracking, we also conﬁrm that the proposed RK-means clustering algorithm can be applied to image segmentation. Acknowledgement. This research is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientiﬁc Research (A)(2), 16200014 and (C)18500131, and (C)19500150.

References 1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum Press, New York (1981) 2. Krishnapuram, R., Keller, J.M.: A Possibilistic Approach to Clustering. IEEE Trans. Fuz. Sys. 1(2), 98–110 (1993) 3. Chintalapudi, K.K., Kam, M.: A noise-resistant fuzzy C means algorithm for clustering. FUZZ-IEEE 2, 1458–1463 (1998) 4. Jolion, J.M., et al.: Robust Clustering with Applications in Computer Vision. PAMI 13(8) (1991) 5. Zass, R., Shashua, A.: A Unifying approach to Hard and Probabilistic Clustering. ICCV 1, 294–301 (2005) 6. Hartigan, J., Wong, M.: Algorithm AS136: A K-means clustering algorithm. Applied Statistics 28, 100–108 (1979) 7. Peterfreund, N.: Robust tracking of position and velocity with Kalman snakes. PAMI 22, 564–569 (2000) 8. Isard, M., Blake, A.: CONDENSATION-Conditional density propagation for visual tracking. IJCV 29(1), 5–28 (1998) 9. Toyama, K., Blake, A.: Probabilistic Tracking in a Metric Space. ICCV 2, 50–57 (2001)

A Noise-Insensitive Object Tracking Algorithm

575

10. Japson, A.D., et al.: Robust Online Appearance Model for Visual Tracking. PAMI 25(10) (2003) 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. PAMI 25(5), 564–577 (2003) 12. Hua, C., Wu, H., Wada, T., Chen, Q.: K-means Tracking with Variable Ellipse Model. IPSJ Transactions on CVIM 46(Sig 15 CVIM12), 59–68 (2005) 13. Collins, R., Liu, Y.: On-line Selection of Discriminative Tracking Feature. ICCV 2, 346–352 (2003) 14. Nguyen, H.T., Semeulders, A.: Tracking aspects of the foreground against the background. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 446–456. Springer, Heidelberg (2004) 15. Rosenfeld, A., Kak, A.C.: Digital Picture Processing, Computer Science and Applied Mathematics. Academic Press, New York (1976) 16. Stauﬀer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for Real-time Tracking. CVPR, pp. 246–252 (1999) 17. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical ﬂow techniques. IJCV 2(1), 42–77 (1994) 18. Zhang, C., Rui, Y.: Robust Visual Tracking via Pixels Classiﬁcation and Integration. ICPR, 37–42 (2006) 19. Gr¨ aβl, C., et al.: Illumination Insensitive Template Matching with Hyperplanes. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 273–280. Springer, Heidelberg (2003)

Discriminative Mean Shift Tracking with Auxiliary Particles Junqiu Wang and Yasushi Yagi The Institute of Scientiﬁc and Industrial Research, OSAKA University 8-1 Mihogaoka, Ibaraki, Osaka, Japan [email protected]

Abstract. We present a new approach towards eﬃcient and robust tracking by incorporating the eﬃciency of the mean shift algorithm with the robustness of the particle ﬁltering. The mean shift tracking algorithm is robust and eﬀective when the representation of a target is suﬃciently discriminative, the target does not jump beyond the bandwidth, and no serious distractions exist. In case of sudden motion, the particle ﬁltering outperforms the mean shift algorithm at the expense of using a large particle set. In our approach, the mean shift algorithm is used as long as it provides reasonable performance. Auxiliary particles are introduced to conquer the distraction and sudden motion problems when such threats are detected. Moreover, discriminative features are selected according to the separation of the foreground and background distributions. We demonstrate the performance of our approach by comparing it with other trackers on challenging image sequences.

1

Introduction

Tracking objects through image sequences is one of the fundamental problems in computer vision. Among the algorithms developed in the pursuit of robust and eﬃcient tracking, two major successful approaches are the mean shift algorithm [1][5], which focuses on Target Representation and Localization, and particle ﬁltering [7][9], which is developed based on Filtering and Data Association. Both of them have their respective advantages and drawbacks. This paper aims at developing a robust and eﬃcient tracker that incorporates the eﬃciency of the mean shift algorithm with the multi-hypothesis characteristics of the particle ﬁltering. The mean shift algorithm is a robust non-parametric probability density estimation method. Comaniciu et al. [5] deﬁne a spatially-smooth similarity function and reduce the state estimation problem to a search of the basin of attraction of this function. Since the similarity function is smooth, a gradient optimization method leading to fast localization is applied. Despite its eﬃciency and robustness, the mean shift algorithm is not good at coping with quick motions. The distractions in the neighborhood of the target are threats to successful tracking. In addition, the basic mean-shift algorithm assumes that the target representation is suﬃciently discriminative against the background. This assumption is Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 576–585, 2007. c Springer-Verlag Berlin Heidelberg 2007

Discriminative Mean Shift Tracking with Auxiliary Particles

577

not always true especially when tracking is carried out in a dynamic background such as surveillance with a moving camera. We introduce particles to deal with the ﬁrst two problems because they are able to provide multiple hypothesis. Adaptive tracking is one possible solution to alleviate the third problem [3]. We update the target model according to the separation of the foreground and background distributions. Particle ﬁltering stands out in ﬁltering-based techniques due to its ability to represent multi-modal probability distributions using a weighted sample set S = {(s(n) , π (n) )|n = 1, . . . , N } that keeps multiple hypothesis of the states of targets [7] [9]. When the tracking is performed in a cluttered environment where multiple objects similar to the target can present, particle ﬁlters are able to ﬁnd the target by validation and association of the measurements. However, since the number of particles can be large, a potential drawback of particle ﬁltering is the high computational cost. Moreover, the particle set can degenerate and diﬀuse in a long sequence. Only few particles with high weights are useful after the tracking in certain frames. Accurate models of shape and motion learned from examples have been used to deal with these problems [9]. One of the drawbacks of this method though is that the construction of explicit models sometimes is hardly achievable because of viewpoint changes. Blake et al. [10] proposed the ICONDENSATION algorithm in which high and low-level information are combined using importance sampling. However, it is complicated to model the dynamic characteristics accurately in an uncontrolled environment. Sullivan and Rittscher [14] noticed the advantages of the mean shift and particle ﬁlter algorithms. They proposed a particle ﬁlter-based tracking guided by deterministic search based on a SSD type cost function. The size of particle set is adjusted according to the diﬃculty of the problem at hand, which is indicated by motion. Deterministic search using mean-shift has also been applied in a hand tracking algorithm by embedding the mean-shift optimization into particle ﬁltering to move particles to local peaks in the likelihood, which improves the sampling eﬃciency [13]. Although the mean-shift and particle ﬁlters have been combined in various ways in previous works, none of them deal with occlusions and distractions explicitly. Cai et al. [2] embed the meanshift algorithm into the particle ﬁlter framework to stabilize the trajectories of the targets. It is necessary to learn classiﬁers for the targets in their work, which is not always possible in tracking applications. The mean shift tracking algorithm outperforms the particle ﬁlter when the representation of a target is discriminative enough, the target does not jump beyond the bandwidth, and no serious distractions exist. Although it seems that these conditions are too strict, we observed that they can be met in a large percentage of real image sequences captured for surveillance or other applications. In this work, the mean-shift algorithm is adopted as the main tracker as long as these conditions are met. In other words, only one particle driven by the mean-shift searching is used to estimate the state of the target. Auxiliary particles are introduced when sudden motion or distractions are detected. We compute log likelihood ratios of class conditional sample densities of the target

578

J. Wang and Y. Yagi

and its background. These ratios are applied in feature selection and distraction detection. The target model is updated according to feature selection results. Sudden motions are estimated using the eﬃcient motion ﬁlters [16]. The proposed method oﬀers several advantages. It achieves high eﬃciency when the target moves smoothly. When sudden motions or distractions are detected, auxiliary particles are initialized to support the mean shift tracker. The help from particle ﬁltering partially solves the problems resulted from sudden motions or distractions. The remainder of the paper is organized as follows. Section 2 gives a brief introduction of the target model. Section 3 describes the feature selection and model updating methods. Section 4 introduces motion estimation and distraction detection. Section 5 discusses the use of auxiliary particles. The performance of the proposed method is evaluated in Section 6 and conclusions are given in Section 7.

2

Target Modeling

The target model should be as discriminative as possible to distinguish between complex target and background. We use an adaptive target model represented by the best features selected from shape-texture and color cues [17]. Color histograms are computed in three color spaces: RGB, HSV and normalized rg. There are 7 color features (R, G, B, H, S, r, g) in the candidate feature set. These color channels are quantized into 12 bins respectively. A color histogram is calculated using a weighting scheme in which the Epanechnikov kernel is applied [5]. A shape-texture cue is described by an orientation histogram, which is computed based on image derivatives. The orientations are also quantized into 12 bins. Each orientation is weighted and assigned to one of two adjacent bins according to its distance from the bin centers. The similarity between the model and its candidates is measured by Bhattacharya distance [5].

3 3.1

Feature Selection and Model Updating Log-Likelihood Ratio Images

To determine the descriptive ability of diﬀerent features, we compute loglikelihood ratio images [3] [15] based on the histograms of the target and its background. Log-likelihood ratio images are also employed in detecting possible threats to the target. The likelihood ratio produces a function that maps feature values associated with the target to positive values and those associated with the background to negative values. The frequency of the pixels that appear in a histogram bin (b ) (b ) (b ) (b ) (p(bin ) ) is calculated as ζf in = pf in /nf and ζb in = pb in /nb , where nf is the pixel number of the target region and nb the pixel number of the background.

Discriminative Mean Shift Tracking with Auxiliary Particles

579

The log-likelihood ratio of a feature value is given by L(bin ) = max(−1, min(1, log

(bin )

, δL )

(bin )

, δL )

max(ζf max(ζb

)),

(1)

where δL is a very small number (δL is set to 0.001 in this work). The likelihood image for each feature is created by back-projecting the ratio into each pixel in the image. 3.2

Feature Selection

Given md features for tracking, the purpose of the feature selection module is to ﬁnd the best subset feature of size mm , and mm < md . Feature selection can help minimize the tracking error and maximize the descriptive ability of the feature set. We ﬁnd the features with the largest corresponding variances. Following the method in [3], based on the equality var(x) = E[x2 ] − (E[x])2 , the variance of Equation(1) is computed as var(L; p) = E[(Lbin )2 ] − (E[Lbin ])2 . The variance ratio of the likelihood function is deﬁned as [3]: VR = 3.3

var(L; (pf + pb )/2) var(B ∪ F ) = . var(F ) + var(B) var(L; pf ) + var(L; pb )

(2)

Updating the Target Model

It is necessary to update the target model due to the fact that the appearance of a target tends to change during a tracking process. Unfortunately, updating the target model adaptively may lead to tracking drift because of the imperfect classiﬁcation of the target and background. To reliably update the target model, we propose an approach based on similarities between the initial and current appearance of the target. Similarity θ is measured by a simple correlation based template matching performed between the initial and current frames. The updating is done according to the similarity θ: Hm = (1 − θ)Hi + θHc ,

(3)

where the Hi is the histogram computed on the initial target; the Hc the histogram of the target current appearance, the Hm the updated histogram of the target. Template matching is performed between the initial model and the current candidates. Since we do not use the search window that is necessary in template matching-based tracking, the matching process is eﬃcient and brings little computational cost to our algorithm. In unstable tracking period (When sudden motions or distractions are detected), the classiﬁcation of the target and background is not reliable. It is difﬁcult to reliably update the target model at these moments. Thus the model is updated when the tracker is in stable states.

580

4 4.1

J. Wang and Y. Yagi

Motion Estimation and Distraction Detection Motion Estimation

The number of particles is adjusted according to motion information of the target. Discriminative mean shift tracking is suﬃcient to determine the position of a target when it moves smoothly and slowly. More particles are necessary to estimate the correct position of the target when it moves quickly. We use the eﬃcient motion ﬁlters that have been applied in pedestrian detection [16]. We estimate the motion of foreground and background region simultaneously and partially solve the problem brought by dynamic background. There are ﬁve motion ﬁlters computed on 5 image pairs: 1 τi |It (x) − It+1 (x)|, (4) Δi = nRg x∈Rg where It and It+1 are consequential images, nRg is the number of pixels in a speciﬁc region, and τi ∈ {, ←, →, ↑, ↓} which are image shift operators denoting no shift, shift left, shift right, shift up, and shift down for one pixel respectively. The motion ﬁlters are computed on the target and its background region respectively. The results of the last four motion ﬁlters (Δi , i ∈ {1, 2, 3, 4}) are compared with the absolute diﬀerences Δ0 : Mif = |Δfi − Δf0 |, Mib = |Δbi − Δb0 |

(5)

Mi represent the likelihood that a particular region is moving in a given direction. We compute the maximum motion likelihood to determine the number of particles for the tracking: Mmax = max(|Mif − Mib |)i=1,2,3,4. .

(6)

Given the high eﬃciency of the estimation method, it is performed in each frame before tracking is carried out. 4.2

Distraction Detection

Distractions in the neighborhood of the target have similar appearance to the target. They are possible threats to successful tracking. When the similarity between the target model and its candidate is less than a certain value (ρT ), distraction detection is performed using spatial reasoning [3] to ﬁnd peaks besides the target in the log-likelihood ratio images. Note that the log-likelihood ratio images here are back-projection results of the conditional distributions based on selected features. Assuming that the region RT actually contains the target and the region RD is a possible distraction, we want to ﬁnd the region that have maximum strength of threat to the target. A certain region where the sum of its log-likelihood ratios

Discriminative Mean Shift Tracking with Auxiliary Particles

581

has minimum diﬀerence with that in the target region is the distraction we want to ﬁnd: min(| L(bin ) − L(bin ) |) (7) RD

RX

where RX is a region in the neighborhood of the target. It is too expensive to compare the sums of log-likelihood ratio in all the possible regions with that in the target region. The searching process can be accelerated using a Gaussian kernel [3]. The value at each pixel in the convolved log-likelihood ratio image with a Gaussian kernel is a weighted sum of the log-likelihood ratios in a circular region surrounding it, normalized by the total weight pixels in that region. First, the log-likelihood image is convolved using a Gaussian kernel. The peak DT which represents the target region can be found in the convolved image. Second, the target region in the log-likelihood image is removed and the result is convolved using a Gaussian kernel again. The most dangerous distraction is detected by searching for the peak DD in the convolved image. The diﬀerence between the two peaks represents the threat strength of the distraction: ρ = |DD − DT |, (8) The distraction may attract the mean shift tracker to the incorrect position if it is strong enough. We initialize a auxiliary particle set to track the distraction region when ρ is less than the given threshold ρT .

5

Auxiliary Particle Filtering

Particle ﬁltering implements recursive Bayesian ﬁlter by Monte Carlo simulations. In the implementation, the posterior density is approximated by a weighted (n) (n) (n) (n) particle set {st , πt }n=1,···,J , where πt = p(zt |xt = st ). We initialize auxiliary particles when sudden motion or distraction are detected. Diﬀerent strategies are adopted for the generation of particles under these two circumstances. 5.1

Particle Filtering for Sudden Motion

When a sudden motion is detected Np particles are generated using a stochastic motion model. The number of particles is determined from to the motion computed: (9) JS = max(min(J0 Mmax , Jmax ), Jmin ), where J0 is the coeﬃcient; Jmin is the smallest number of particles and Jmax the largest number of particles to maintain reasonable particles. The motion model is a normal density centered on the previous pose with a constant shift vector: (10) xjt = xt−1 + xc + ujt ; where ujt is a standard normal random vector and xc a constant shift vector from the previous position according to the motion estimation results (it is set to one pixel to the motion direction).

582

5.2

J. Wang and Y. Yagi

Particle Filtering for Distraction

After distractions are detected, a joint particle ﬁlter with an MRF motion model is initialized [12]. The motion interaction between the target and the distraction ψ(Xit , Xjt ) is described by the Gibbs distribution ψ(Xit , Xjt ) ∝ exp(−g(Xit , Xjt ), where g(Xit , Xjt ) is a penalty function approximated by the distance between the target and the distraction. The posterior on the joint state Xt is approximated as a set of J weighted samples: (J) (J) ψ(Xit , Xjt ) πt−1 P (Xit |Xi(t−1) ), P (Xt |Z t ) ≈ kP (Zt |Xt ) ij∈E

J

i

where the samples are drawn from the joint proposal distribution; k is a normalizing constant that does not depend on the state variables; E is edges in the MRF model; the samples are weighted according to the factored likelihood function: 2 (s) (s) (s) (s) P (Zit |Xit ) ψ(Xit , Xjt ). πt = i

ij∈E

where Zit are measurement nodes. 5.3

Algorithm Summary

In summary, the detailed steps of the proposed tracking algorithm are: Algorithm: Discriminative Mean-Shift Tracking with Auxiliary Particles Input:

t video frames I1 , . . . , It ; Initial target region given in the ﬁrst frame I1 target regions in I2 , . . . , It

Output: Initialization in I1 1. Save the initial target appearance for model updating; 2. Compute the similarity (S1 ) between the target model and the candidate. For each new frame Ij : Estimate the motion (Mj ) on the consequential frames; IF Mj > MT THEN initialize particles according to the motion estimated. ELSE IF the similarity is less than a given threshold (Sj−1 < S T ) THEN detect distractions in the neighborhood of the target If Distraction is detected (ρ < ρT ) Initialize MRF particles; Else Update the target model. End If End If End If Estimate the position of the target. Compute the similarity Sj for next frame. End For

Discriminative Mean Shift Tracking with Auxiliary Particles

2 Particle filtering

90

90

80 70 60 50

50 30

20

20

10

10

0

0

3

4

5

25

60

30

Tracking approaches

20 15 10 5

1

2

4

3

0

5

Tracking approaches

90

30 25

14 12 10

20

8

15

6

10

4

5

2

0

0

3

4

5

Dataset tracked(%)

100

16

Dataset tracked(%)

18

35

Tracking approaches

(d)

4

3

5

(c)

40

2

2

Tracking approaches

45

1

1

(b)

(a)

Dataset tracked(%)

5 The proposed tracker

Peak difference

70

40

2

4

80

40

1

3 Variance ratio

Dataset tracked(%)

100

100

Dataset tracked(%)

Dataset tracked(%)

1 Basic mean-shift

583

80 70 60 50 40 30 20 10

1

2

4

3

5

Tracking approaches

(e)

0

1

2

4

3

5

Tracking approaches

(f)

Fig. 1. Tracking results using diﬀerent tracking approaches. Tests are performed on (a) EgTest01; (b) EgTest02; (c) EgTest03; (d) EgTest04; (e) EgTest05; and (f) Redteam.

6

Experimental Results

To illustrate the performance of the proposed tracker, we have implemented and tested it on a wide variety of challenging image sequences in diﬀerent environments and applications. Due to space limitation, we only show the results on the public CMU datasets with ground truth [4]. The datasets include 6 sequences: EgTest01, EgTest02, EgTest03, EgTest04, EgTest05 and Redteam. There are diﬀerent factors that make the tracking challenging: diﬀerent viewpoints (these sequences are captured by moving cameras); similar objects nearby; sudden motions; illumination changes; reﬂectance variations of the targets; and partial occlusions. The tracking results are compared with the basic mean shift and particle ﬁltering trackers. Since the proposed tracker updates the target model based on feature selection, it is reasonable to compare it with those adaptive trackers. The variance ratio and peak diﬀerence [3] trackers are included for this purpose. In the particle ﬁltering tracker, the target model is represented by 12 × 12 × 12-bin RGB histograms. There are 100 samples in the sample set. RGB histograms are also adopted in the basic mean shift algorithm. The similarity measure is Bhattacharya distance between the model and its candidate. The most important criterion for the comparison is the percentage of dataset tracked, which is the number of the tracked frames divided by the total number of frames. The track is considered to be lost if the bounding box does not overlap the ground truth. The tracking success rates achieved by each tracker are compared and the results are shown in Fig. 1. The proposed tracker gives the

584

J. Wang and Y. Yagi

(b) f=516

(a) f=1

(c) f=1300

(d) Target appearance changes

Fig. 2. Tracking results of the EgTest02 sequence

best results (or has same results with another tracker) in all the test sequences. Those comparisons demonstrate that the proposed tracking algorithm has better performance than other trackers. In Fig. 2, the tracking results for EgTest02 are shown. Despite the distractions and sudden motions in the sequence, the proposed tracker completes the tracking successfully. Fig. 2.(d) illustrates how the appearance of the target changes over time. There are sudden motions and image blur in the EgTest04, which leads to the failure of the basic mean-shift tracker. The proposed tracker detects this motion successfully and initializes auxiliary particles. These particles help the proposed tracker to conquer the problem brought by the sudden motion. The running time of the proposed tracker depends on the diﬃculty level of the image sequence being tracked. If sudden motions or distractions happen frequently, its eﬃciency is low. Otherwise it has high eﬃciency because the mean shift algorithm is adopted in most cases. The current implementation ran 16 frames per second (average speed) on a Intel Centrino 1.6GHz laptop with 1G RAM when applied to images of size 640×480. The average running time includes time to do the main tracking algorithm, to read image ﬁle from a USB disk, and to display color images with the object bounding box overlaid.

7

Conclusion and Future Work

We describe a discriminative mean shift tracking algorithm with auxiliary particles in the pursuit of robust and eﬃcient tracking. The arrangement of the particle ﬁltering and the mean shift algorithm is based on the diﬃculty of the tracking which is indicated by sudden motions and distractions. The model updating strategy in our tracker can eﬀectively deal with appearance changes of targets. The proposed approach provides better performance than those of the mean shift, particle ﬁltering and other trackers.

Discriminative Mean Shift Tracking with Auxiliary Particles

585

We are going to investigate how to extend the proposed method to multitarget tracking, in which multiple mean shift searching is necessary.

References 1. Bradski, G.R.: Computer Vision Face Tracking as a Component of a Perceptural User Interface. In: Proc. of the IEEE Workshop Applications of Computer Vision, pp. 214–219 (1998) 2. Cai, Y., Freitas, N., Little, J.: Robust Visual Tracking for Multiple Targets. In: Little, J. (ed.) Proc. of 6th Europearn Conf. on Computer Vision, pp. 893–908 (2006) 3. Collins, R.T., Liu, Y.: On-line Selection of Discriminative Tracking Features. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005) 4. Collins, R.T., Zhou, X., Teh, S.K.: An Open Source Tracking Testbed and Evaluation Web Site. In: PETS 2005. IEEE Int’l Workshop on Performance Evaluation of Tracking and Surveillance (January 2005) 5. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based Object Tracking. IEEE Trans. Pattern Analysis Machine Intelligence 25(5), 564–577 (2003) 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and Sons Press, Chichester (1991) 7. Gordon, N., Salmond, D., Smith, A.: Novel approach to nonlinear/non-Gaussain Bayesian state estimation. IEEE Proc. 140(2), 107–113 (1993) 8. Gevers, T., Smeulders, A.W.M.: Color based object recognition. Pattern Recognition 32(3), 453–464 (1999) 9. Isard, M., Blake, A.: Condensation - conditional density propagation for tracking. Int’l Journal of Computer Vision 29(1), 2–28 (1998) 10. Isard, M., Blake, A.: ICONDENSATION: unifying low-level and high-level tracking in a stochastic framework. In: Proc. of 5th Europearn Conf. on Computer Vision, vol. I, pp. 893–908 (1998) 11. Jhne, B., Scharr, H., Krkel, S.: Handbook of Computer Vision and Applications. In: Jhne, B., Hauecker, H., Geiler, P. (eds.), vol. 2, pp. 125–151. Academic Press, London (1999) 12. Khan, Z., Balch, T., Dellaert, F.: An MCMC-based particle ﬁlter for tracking multiple interacting targets. In: Proc. of 5th Europearn Conf. on Computer Vision, vol. I, pp. 893–908 (2004) 13. Shan, C., Tan, T., Wei, Y.: Real-time hand tracking using a mean shift embedded particle ﬁlter. Pattern Recognition 40(7), 1958–1970 (2007) 14. Sullivan, J., Rittscher, J.: Guiding Random Particles by Deterministic Search. In: Proc. of Eighth IEEE Int’l Conf. on Computer Vision, vol. I, pp. 323–330 (2001) 15. Swain, M., Ballard, D.: Color Indexing. Int’l Journal of Computer Vision 7, 11–32 (1991) 16. Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. Int’l Journal of Computer Vision 63(2), 153–161 (2005) 17. Wang, J., Yagi, Y.: Integrating Shape and Color Features for Adaptive Real-time Object Tracking. In: IEEE Int’l Conf. on Robotics and Biomimetics 2006 (2006)

Eﬃcient Search in Document Image Collections Anand Kumar1 , C.V. Jawahar1, and R. Manmatha2 1 Center for Visual Information Technology International Institute of Information Technology Hyderabad, India - 500032 [email protected], [email protected] 2 Department of Computer Science University of Massachusetts Amherst, MA 01003, USA [email protected]

Abstract. This paper presents an eﬃcient indexing and retrieval scheme for searching in document image databases. In many non-European languages, optical character recognizers are not very accurate. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search, tend to be slow. Here indexing is done using locality sensitive hashing (LSH) - a technique which computes multiple hashes - using word image features computed at word level. Eﬃciency and scalability is achieved by content-sensitive hashing implemented through approximate nearest neighbor computation. We demonstrate that the technique achieves high precision and recall (in the 90% range), using a large image corpus consisting of seven Kalidasa’s (a well known Indian poet of antiquity) books in the Telugu language. The accuracy is comparable to using dynamic time warping and nearest neighbor search while the speed is orders of magnitude better - 20000 word images can be searched in milliseconds.

1

Introduction

Many document image collections are now being scanned and made available over the Internet or in digital libraries. Eﬀective access to such information sources is limited by the lack of eﬃcient retrieval schemes. The use of text search methods requires eﬃcient and robust optical character recognizers(OCR), which are presently unavailable for Indian languages [1]. Another possibility is to search in the image domain using word spotting [2,3,4]. Direct matching of images is ineﬃcient due to the complexity of matching and thus impractical for large databases. We solve this problem by directly hashing word image representations. We present an eﬃcient mechanism for indexing and retrieval in large document image collections. First, words are automatically segmented. Then features are computed at word level and indexed. Word retrieval is done very eﬃciently by using an approximate nearest neighbor retrieval technique called locality sensitive hashing (LSH). Word images are hashed into multiple tables with features computed at word level. Content-sensitive hash functions are used to hash words Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 586–595, 2007. c Springer-Verlag Berlin Heidelberg 2007

Eﬃcient Search in Document Image Collections

587

such that the probability of grouping similar words in the same index of the hash table is high. The sub-linear time content-sensitive hashing scheme makes the search very fast without degrading the accuracy. Experiments on a collection of Kalidasa’s - the classical Indian poet of antiquity - books in Telugu demonstrate that 20,000 word images may be searched in a few milliseconds. The approach thus makes searching large document image collections practical. There are essentially three classes of techniques to search document image collections. The ﬁrst is based on using a recognizer to convert an image to text and then searching the results using a text search engine. An example is the gHMM approach of Chan et al. [5], suggested for printed and handwritten Arabic documents. It uses gHMMs with a bi-gram letter transition model, and KPCA/LDA for letter discrimination. In this approach segmentation and recognition go hand in hand. The words are modeled at letter level, where the likelihood of a word given a segment is used for discriminating words. The Byblos system [6] also uses a similar approach to recognize documents where a line is ﬁrst segmented out and then divided in to image strips. Each line is then recognized using an HMM and a bi-gram letter transition model. The second used by Rath et al. [7], involves the automatic annotation of word images with a lexicon and probabilities using a relevance-based language model. Here, words are segmented out and then each word image is annotated using a statistical model with the entire lexicon and probabilities. A language model retrieval approach is then used to search the documents. The technique was successfully used to build a 1000 page demonstration for George Washington’s handwritten manuscripts. The third approach proposed by Rath and Manmatha [2,3] involves using what is called word spotting, where word images are matched with each other and then clustered. Each cluster is then annotated by a person. Alternatively, Jawahar et al. [4] showed that in the case of printed books one can synthesize the query image from a textual query for making the system more usable. Word spotting has been tried for many diﬀerent kinds of documents both handwritten and print. Rath and Manmatha [2] used dynamic time warping (DTW) to compute image similarities for handwriting. The word similarities are then used for clustering using K-means or agglomerative clustering techniques. This approach was adopted in Jawahar et al. [4] for printed Indian language document images. To simplify the process of querying, a word image is generated for each query and the cluster corresponding to this word is identiﬁed. In such methods, eﬃciency is achieved by signiﬁcant oﬄine computation. Ataer and Duygulu [8] tried word spotting for handwritten Ottoman documents where they use successive pruning stages to eliminate irrelevant words. Gatos et al. [9] used word spotting for old Greek typewritten manuscripts for which OCRs did not work. One advantage of word spotting over traditional OCR methods is that they take advantage of the fact that within corpora such as books the word images are likely to be much more similar, which traditional OCRs do not do. In addition, techniques that work at the symbol level of word images, like [5], are very sensitive to segmentation errors. Segmentation of Indian language document images at symbol level is very diﬃcult due to the complexity of the scripts.

588

A. Kumar, C.V. Jawahar, and R. Manmatha

Many of these techniques (for example DTW) are computationally expensive and do not scale very well. Inspite of this, Sankar et al. [10] successfully indexed 500 books in Indian languages using this approach by doing virtually all the computation oﬀ-line. Avoiding DTW, Rath et al. [3] demonstrated the use of direct clustering of word image features on historical handwritten manuscripts. However, clustering is itself an expensive operation. Image matching often involves oﬄine nearest neighbor computations and storage for eﬃcient access. These nearest neighbor techniques are expensive in high dimensions even when computed oﬀ-line. Indyk and Motwani [11] proposed an approximate nearest neighbor search technique called locality sensitive hashing (LSH) which is much more eﬃcient. LSH has been applied to a number of problems including some in computer vision. For example, LSH is used to eﬃciently index high dimensional pose examples by Shakhnarovich et al. [12]. Matei et al. [13] use LSH for 3D object indexing. LSH is diﬀerent from the geometric hashing approaches used in model based recognition of 3-D objects in occluded scenes from 2-D gray scale images [14] and also for ﬁnding documents from a set of camera-based document images [15].

Fig. 1. Sample document images from Kalidasa’s books in Telugu

Our work mainly aims at addressing some of the issues involved in eﬀective and eﬃcient retrieval in document images with eﬀective representations of the word images. We demonstrate eﬃcient retrieval through content-sensitive hashing on a collection of Kalidasa’s writings. Sample pages from the Kalidasa collection are shown in Figure 1.

2

Content Sensitive Hashing

In the proposed retrieval technique, the index is built by hashing word level features of document images. The features are hashed using content sensitive hash functions, such that the probability of ﬁnding words with similar content in the same bucket is high. The same content sensitive hash functions are used to

Eﬃcient Search in Document Image Collections Input

589

Output

11 00 00 11 Document Images Textual Query

Relevant Documents

Pre−processing Hashed Words

Word Rendering

Segementation and Word detection

Feature Extraction

Hashing

Online Process

Feature Extraction Offline Process

Fig. 2. Hashing Method: Word image hashing for eﬃcient search. Showing oﬄine preprocessing and on-line query processing stages.

query similar words during the search. The major challenges in eﬃcient indexing and retrieval are the preprocessing and word matching times. We overcome these challenges with the use of hashing. A conceptual block diagram of the technique is shown in Figure 2. Books are scanned and processed to index the document pages. The textual word query is ﬁrst converted to an image by rendering, features are extracted from these images and then search is carried out to retrieve relevant word images. To facilitate searching, scanned document images are preprocessed and segmented at word level. A set of features are extracted as representatives of word images to be indexed. Content-sensitive hash functions are used to hash the features such that similar word images are grouped in the same index of the hash table. 2.1

Word Image Representation

We employ a combination of scalar, proﬁle, structural and transform domain feature extraction methods as used in [2,3,4]. Scalar features include the number of ascenders, descenders and the aspect ratio. The proﬁle and structural features include: projection proﬁles, background to ink transitions and upper and lower word proﬁles. Fixed length description of the features are obtained by computing lower order coeﬃcients of a DFT (Discrete Fourier Transform) - discarding noisy high order coeﬃcients makes the representation more robust. We use 84 Fourier coeﬃcients of the segmented proﬁles and ink transition features to represent the word images. Finding similar word images is now equivalent to the nearest neighbor search (NNS) problem: Given a set of n points P = p1 , . . . , pn in some metric space X, we preprocess P so as to eﬃciently answer queries, which require ﬁnding the

590

A. Kumar, C.V. Jawahar, and R. Manmatha

point in P closest to a query point q ∈ X. Traditional data structures for similarity search suﬀer from the curse of dimensionality. Locality sensitive hashing (LSH) is a state-of-the-art technique introduced by Indyk and Motwani [11] to alleviate the problem of high dimensional similarity search in large databases. The main idea in LSH is to hash points into bins based on a probability of collision. Thus, points that are far in the parameter space will have a high probability of landing into diﬀerent bins, while close points will go into the same bucket. It has been shown that LSH out-performs tree-based structures such as the Sphere/Rectangle-tree (SR-tree) by at least an order of magnitude. 2.2

Hashing Technique

Let P = {x1 , x2 , . . . , xn } be the words in the document image collection. A word is represented by a feature vector x = {f1 , . . . , fD }, represented as a point x ∈ RD in feature space, where fj is computed by extracting features that describe the content of the word images. The extracted features satisfy the following assumptions. 1. A distance function d is given which measures the content level similarity of the words, and a radius R in the feature space is given such that x1 , x2 are considered similar iﬀ d(x1 , x2 ) < R. 2. For a randomly chosen word image, there exists a word image with high probability and similar feature values in the collection. 3. There are no signiﬁcant variations in feature vectors of the words with similar content, or the feature extraction process is unbiased. The distance function and the similarity threshold are dependent on the particular task, and often reﬂect perceptual similarities between the words. The last assumption implies that there are no signiﬁcant sources of variation in the word features for words that are similar in content. The content similarity search is done by eﬃcient nearest neighbor searching with content-sensitive hashing algorithm. The content-sensitive hashing is achieved by hashing words using a number of hash functions from a family H = {h : S → U } of functions. H is called content-sensitive if for any q, the function p(t) = P rH [h(q) = h(x) : ||q − v|| = t]

(1)

is strictly decreasing in t. That is, the probability of collision of points q and x is decreasing with content dissimilarity (distance) between them. We concatenate several hash functions h ∈ H. In particular deﬁne a function family G = {g : S → U k } such that, g(x) = (h1 (x), . . . , hk (x)). For an integer L, the algorithm chooses L functions g1 , . . . , gL from G, independently and uniformly at random. During preprocessing, the algorithm stores each input point in buckets gj (x), for all j = 1, . . . , L. Since the total number of buckets may be large, the algorithm retains only the non-empty buckets by resorting to hashing.

Eﬃcient Search in Document Image Collections

591

Algorithm 1. Content Sensitive Hashing Input: Word Images Wj , j = 1, . . . , n Output: Hash Tables Ti , i = 1, . . . , l 1: for each i = 1, . . . , l do 2: Initialize hash table Ti with hash functions gi 3: end for 4: for each i = 1, . . . , l do 5: for each j = 1, . . . , n do 6: Pre-process word image Wj (noise removal etc). 7: Extract features Fj of word image Wj . 8: Compute hash bucket I = gi (Fj ) 9: Store word image Wj on bucket I of hash table Ti 10: end for 11: end for

A D dimensional word feature x is mapped onto a set of integers by each hash function ha,b (x). Each hash function in the family is indexed by a choice of random a and b, where a is a D dimensional vector with entries chosen independently from a p-stable distribution and b is a real number chosen uniformly from the range[0, w]. For a ﬁxed a, b the hash function ha,b is given by, ha,b (x) =

a · x + b w

(2)

Generally w = 4. The dot product a · x projects each vector onto a real line. The real line is chopped into equi-width segments of appropriate size w and hash values are assigned to vectors based on which segment they project onto. The value of k is chosen such that tc + tg is minimal, where tc is the mean query time and tg is the time to compute the hash functions in L hash tables. The values of k is determined by estimating the times on a sample data set S ⊆ P . The details about such parameter settings and the hash functions are presented in [11,16]. Algorithm 1 summarizes the major steps of content-sensitive hashing. Given a query word image, it is represented with the set of features q. The ﬁrst level k hash functions are calculated and concatenated to get bucket id’s gi (q), i = 1, . . . , L in L hash tables. Then all the features, and the corresponding words, in the buckets of L tables are retrieved as the query results. Thus the problem of ﬁnding nearest neighbor boils down to searching only the vectors in the bucket that have the same hash index value as the query. Algorithm 2 summarizes the major steps of querying. The hash based search in a collection of document images is faster as compared to other approaches, like exhaustive search with DTW and nearest neighbor techniques. Approaches presented in the literature take a long time in building the index and retrieval due to preprocessing and complex matching procedures. This computational time can be reduced with the elimination of costly processes, like clustering. We achieve this by employing faster content-sensitive hashing technique. We achieve interactive retrieval with retrieval speed in milliseconds. Time

592

A. Kumar, C.V. Jawahar, and R. Manmatha

Algorithm 2. Word Retrieval Input: Query Word Image w Output: Similar word images 1: O ← φ 2: for each i = 1, . . . , l do 3: Pre-process word image w (noise removal etc). 4: Extract features Fw of word image w. 5: Compute hash function I = gi (Fw ) 6: O ← O ∪ {points found in index I of Ti } 7: end for 8: Return similar words O by linear search.

ineﬃcient oﬄine processing of the data is not required for creating the index. The hashing technique avoids complex image matching methods and searches in sub-linear time.

3

Results and Discussions

We evaluated the proposed hash based retrieval scheme on word image data sets obtained from a collection of 7 Kalidasa books. The books are printed in Telugu, an Indian language. The document images were scanned and preprocessed to get segmented words with little manual eﬀort to remove segmentation errors. Then, the words were represented by a set of features. Around 20 words were annotated in each book for experimentation and performance evaluation purpose. Given a textual query word, an image is rendered (generated). Features are extracted from the query image and hashed to search and retrieve the relevant words. The book-wise performance measured using precision, recall and F-score values are shown in Table 1. The query image and example search results are shown in Figure 3. The ﬁrst two rows show correct results. Sometimes other words may also appear somewhat visually similar and the last column in the last two rows shows examples of such words being retrieved. Table 1. Search Performance: Precision, recall and F-score values for retrieval experiments conducted on each book from Kalidasa collection Book # Pages Maalavikaagnimitra 292 Vikramuurvashiyam 286 Abhijnanasakuntalam 312 Ritusamhara 142 Kumarasambhava 282 Raghuvamsha 300 Meghaduta 238

# Words Precision Recall F-score 22,500 100.00 91.72 95.68 23, 600 100.00 95.58 97.74 22,500 96.79 91.27 93.96 11,000 94.65 93.67 94.16 56,100 92.37 90.21 91.27 36,000 93.23 92.6 92.91 44,000 96.15 93.53 94.82

Eﬃcient Search in Document Image Collections

Query Image

593

Some of the Retrieved Images

Fig. 3. Results: Example (Telugu) words searched for input queries

Fig. 4. Results: Words with small variations in style and size are retrieved

Examples of queries containing words of diﬀerent sizes and style types are shown in Figure 4. Such results are obtained by querying the same word in multiple books of the collection. Using the same query across two diﬀerent books of the collection retrieves words which are content-wise similar. Indian language words have small form variations. For example, the same word may have diﬀerent case endings. Such words are also searched correctly using the proposed solution. Example results of such queries are shown in Figure 5 (row 2). The retrieved words have the same stem, which is due to the similarity in image content. There are limits to the font variations that can be handled by the proposed retrieval technique. Experiments show that we cannot use combinations of diﬀerent font words but such combinations are very unlikely to occur in books. The proposed hashed based search is sub-linear and much faster than exhaustive nearest neighbor search. The plot in Figure 6(a) shows the time eﬃciency of our hash based search versus nearest neighbor search. The experiments were conducted on data sets of increasing size (by 5,000 words) in each iteration. The maximum number of words used were around 45,000. With the use of the maximum size data set, the maximum time to search relevant words was of the order of milliseconds. The experiments were conducted on an AMD Athlon 64 bit processor using 512 MB memory. The precision and recall values are controlled by the query radius (distance) value. Experiments were conducted on synthetic images of Telugu language to see

594

A. Kumar, C.V. Jawahar, and R. Manmatha

Query Image

Some of the Retrieved Images

Fig. 5. Results: Words with small form variations are retrieved as relevant

100 Precision Recall

% Precision & Recall

Time (Sec)

0.2

Exhaustive NNS Hash based search

0.15 0.1

0.05

95

90

85

1

1.5

2

2.5 3 Data Size

(a)

3.5

4 4

x 10

0.2

0.25 Query Radius

0.3

0.35

(b)

Fig. 6. Performance comparison: (a) Hashing and exhaustive nearest neighbor search. (b) Eﬀect of distance: Precision and recall change with the query radius.

the eﬀect of radius on the performance. Around 6,000 synthetic word images of Telugu language were used for the experiments. Each word was repeated around 4-10 times in the whole collection. Figure 6(b) shows the change in precision and recall values with the radius. The degradation in performance with the increase in radius indicates that many irrelevant words are added to group of similar words. Similar results were obtained with diﬀerent font datasets for Telugu language. Therefore query distance has to be determined experimentally. Table 2 compares this approach to one based on using the DTW score as a similarity measure. It shows that our method is much faster than the DTW based exhaustive matching and search procedure while the accuracy is similar. Table 2. Performance: DTW based exhaustive search is much slower. Accuracy of the proposed method similar to the DTW matching. Book

Hash Based Search DTW Based NNS Precision Recall Time(sec) Precision Recall Time(sec) Abhijnanasakuntalam 96.79 91.27 0.005 95.27 93.71 650 Ritusamhara 94.65 93.67 0.003 93.33 96.63 216

Eﬃcient Search in Document Image Collections

4

595

Conclusion and Future Work

We presented an eﬃcient indexing and retrieval scheme for searching in large document image databases. Eﬃciency and scalability along with high precision and recall values are achieved by content-sensitive hashing. The retrieval speed is orders of magnitude better - the technique can search 20,000 word images in milliseconds. We have demonstrated that this technique is practical for searching printed documents rapidly. Future improvements could include feature selection using machine learning techniques to include multiple fonts and styles.

References 1. Pal, U., Chaudhuri, B.: Indian script character recognition: A survey. Pattern Recognition 37, 1887–1899 (2004) 2. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Conference on Computer Vision and Pattern Recognition, vol. (2), pp. 521–527 (2003) 3. Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2), 139–152 (2007) 4. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 1–12. Springer, Heidelberg (2006) 5. Chan, J., Ziftci, C., Forsyth, D.A.: Searching oﬀ-line arabic documents. In: CVPR. Conference on Computer Vision and Pattern Recognition, vol. (2), pp. 1455–1462 (2006) 6. Lu, Z., Schwartz, R., Natarajan, P., Bazzi, I., Makhoul, J.: Advances in the bbn byblos ocr system. In: ICDAR, pp. 337–340 (1999) 7. Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: SIGIR, pp. 369–376 (2004) 8. Ataer, E., Duygulu, P.: Retrieval of ottoman documents. In: Multimedia Information Retrieval (MIR) workshop, pp. 155–162 (2006) 9. Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. IJDAR 9(2), 167–177 (2007) 10. Sankar, K.P., Jawahar, C.V.: Probabilistic reverse annotation for large scale image retrieval. In: Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007) 11. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: SOTC, pp. 604–613 (1998) 12. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parametersensitive hashing. In: Sebe, N., Lew, M.S., Huang, T.S. (eds.) Computer Vision in Human-Computer Interaction. LNCS, vol. 3766, pp. 750–757. Springer, Heidelberg (2005) 13. Matei, B., Shan, Y., Sawhney, H., Tan, Y., Kumar, R., Huber, D., Hebert, M.: Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation. IEEE Trans. PAMI 28(7), 1111–1126 (2006) 14. Lamdan, Y., Wolfson, H.: Geometric hashing: A general and eﬃcient model-based recognition scheme. In: ICCV, pp. 238–249 (1988) 15. Nakai, T., Kise, K., Iwamura, M.: Use of aﬃne invariants in locally likely arrangement hashing for camera-based document image retrieval. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 541–552. Springer, Heidelberg (2006) 16. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th VLDB conference, pp. 518–529 (1999)

Hand Posture Estimation in Complex Backgrounds by Considering Mis-match of Model Akihiro Imai1 , Nobutaka Shimada2 , and Yoshiaki Shirai2

2

1 Dept.of Computer-Controlled Mechanical Systems, Osaka University Yamadaoka, Suita, Osaka 565-0871, Japan Dept.of Human and Computer Intelligence, Ritumeikan University, Nojihigashi, Kusatsu, Shiga 525-8577, Japan

Abstract. This paper proposes a novel method of estimating 3-D hand posture from images observed in complex backgrounds. Conventional methods often cause mistakes by mis-matches of local image features. Our method considers possibility of the mis-match between each posture model appearance and the other model appearances in a Baysian stochastic estimation form by introducing a novel likelihood concept “Mistakenly Matching Likelihood (MML)“. The correct posture model is discriminated from mis-matches by MML-based posture candidate evaluation. The method is applied to hand tracking problem in complex backgrounds and its eﬀectiveness is shown.

1

Introduction

Precise hand-ﬁnger shape estimation methods using visual cues have been developed [1][2][3][4][5][6] in order to implement the gestural interfaces in a touch-less manner which are utilized in interaction with virtual environments and automatic sign-language translation. One of the diﬃculties of implementing the interfaces based on the hand shape estimation exists in its situations where the interfaces are needed: its complex backgrounds like colorful and textured clothes, skin-colored region as human face and some desktops on which various tools and objects are scattered. Since the hand shape even in simple backgrounds is a tough problem due to its great varieties of posture, shape estimation with simultaneous segmentation is still a challenging problem. To solve the problem with feasible computing resources, some trials were reported from the following two viewpoints: 1) how to reduce the number of posture candidates to consider (i.e. how to predict the posture), 2) how to evaluate the matching degree between the posture candidate and observed image features. From the viewpoint 1), Active Shape Model [7] is proposed, which learns acceptable shape deformations and tracks the region contour or texture assuming smooth deformation and motion. Non-smooth deformation can be treated by introducing Switching Linear Dynamics [8]. 3-D model-based shape prediction and tracking, not based on appearance learning, is also proposed [5]. Most of Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 596–607, 2007. c Springer-Verlag Berlin Heidelberg 2007

Hand Posture Estimation in Complex Backgrounds

(a) Estimation result

(b) Edges of estimation result put on input image

(c) Shape similar to input image

(d) Edges of similar shape put on input image

Fig. 1. Mistake of conventional method

597

Fig. 2. Corresponding of edges

those methods employ a parallel search scheme in tracking, like beam-search or particle ﬁlter for robustness against temporal mis-estimation and tracking failure [5][9][10][11][12][13][14]. While many improvements from the ﬁrst viewpoint are reported, those from the second viewpoint, concerned with the evaluation of matching degree, are comparatively few and most implementations employ a simple feature corresponding and evaluating method: chamfer matching [15][16]. Chamfer matching makes correspondences between the features with the least distance in the image, and evaluates the matching degree by the sum of the distances (chamfer distance). This simple matching scheme, of course, causes often a wrong shape estimate on the complicated backgrounds. Fig. 1 is an example of wrong estimate of hand posture caused by chamfer matching. Because many edge textures are observed in the hand region, the ﬁnger tips of the posture model are mistakenly corresponding to the inner edges (see Fig. 2) and as a result its chamfer distance is evaluated too small. While this problem is hard to avoid as long as using chamfer matching, no more appropriate matching method is found other than chamfer matching. Therefore the matching degree should be evaluated under the consideration of that such a wrong matching often happens. Nevertheless, the existence of the mismatch caused by chamfer matching does not directly mean its uselessness. Embedding approach [15] evaluates the matching degree between an input image feature and not only one posture model but also several other reference models. For example of Fig. 1, in addition to the correct posture candidate (c), the candidate (a) also has so high matching

598

A. Imai, N. Shimada, and Y. Shirai

degree that (a) is picked up as the estimate. However, if only (a) has the high matching degree when (a) should be the correct matching, these two cases can be discriminated by evaluating the matching degrees with both reference models (a) and (c). Since Embedding approach only uses an ad-hoc way by evaluating the squared sum of the matching degrees of all reference models, its estimate is not the optimum in Bayesian point of view. This paper mathematically derives the Baysian form of Embedding approach. In its derivation, a novel concept of likelihood is introduced: Mistakenly Matching Likelihood (MML), which predicts the high evaluation caused by wrong matching and gives the ability to discriminate the true estimate from the false match in a stochastic way. The derived MML-based candidate evaluation is applied to hand tracking problem in complex backgrounds and its eﬀectiveness is experimentally shown.

2

Acquisition of Typical Hand Posture Images

3-D Hand model used in our research is originally wireframe model. The model is modiﬁed into a shaded model. The joint bending angles are denoted by θb,t,1 , θb,t,2 , θb,t,3 , θb,i,1 , θb,i,2 , θb,i,3 , . . . and opening angles at a base joint of the ﬁngers are denoted by θo,t , θo,i , . . . (shown in Fig.3). The posture of the whole hand model is represented by translation tx , ty , tz and Ritalin θr,x , θr,y , θr,z . As a whole, the shape of the hand model has 26 degrees of freedom θ = (θb,t,1 , . . . , θr,z ).

(1)

CG images of typical hand models are shown in Fig. 4. The ﬁnger joints dependently moves in natural actions [5]. In index, middle, ring, pinky ﬁngers, adjacent joint angles are usually similar. This kind of joint constraints reduce

θo,r

θo,m θo,i

θb,i,3 θb,i,2

θo,p θo,t

θb,i,1 t y θy t z θz (a) Wireframe model

t x θx

(b) Model after shading Fig. 3. Hand model

Hand Posture Estimation in Complex Backgrounds

(a)

(b)

(c)

(d)

599

(e)

Fig. 4. CG images of typical hand postures

Fig. 5. Edge images of hand model

Table 1. Quantization width from search center c Δθbend [◦ ] c Δθopen [◦ ] c Δθo,m [◦ ] c Δθr,x [◦ ] c Δθr,y [◦ ] c Δθr,z [◦ ] c c Δtx , Δty [mm] Δtcz [mm]

−ζ -6 -30 -15 -6 -8 -40

0 0 0 0 0 0 0 0

ζ 6 30 15 6 8 40

the number of possible postures. Fig.5 shows the edge images generated from the CG images of the typical hand postures under the constraints. The posture parameters θ to be estimated are quantized. The change is shown c , that of θo other than in Table 1 (The change of θb is represented by Δθbend c middle is represented by Δθopen ). Each parameter of θb is assumed to change 0 or ±ζ. Each parameter has own ζ between 9 deg to 15 deg.

3

Matching Method

The system has several hand models with various dimensions: i.e. lengths and widths of the palm and ﬁngers. Since input image sequences are assumed to start from a predeﬁned simple shape and an initial position, the dimensions are easily initialized at the ﬁrst frame. The posture parameters to be estimated are around the posture estimate at the previous image frame. When each input image is obtained, the best-matched model is determined by the maximum likelihood criterion. Let I denote edges and skin regions extracted from an input image and Θj denote a quantized parameter vector of the j th model. The criterion is ˆ = arg max P (I|Θj ) Θ (2) Θj

where P (I|Θj ) denotes the likelihood of the j th model for the input. The likelihood is deﬁned based on the diﬀerence of the image I and the the appearance of the shape model.

600

3.1

A. Imai, N. Shimada, and Y. Shirai

Diﬀerence of Image and Appearance of Shape Model

In this paper, edge image I (e) and skin-color region image I (s) are used as the image features of the input I. The diﬀerence between the silhouette of a typ(s) ical hand model and that of an image is computed. Let Aθ be the silhouette generated from a typical hand model θ. The Diﬀerence of the silhouette, (s) (s) fskin (Aθ ; I (s) ), is deﬁned as the area of Aθ which does not overlap with I (s) . (e) (e) The diﬀerence between the I (e) and the edge appearance Aθ , fdist (Aθ ; I (e) ), is computed by a modiﬁed chamfer matching, in which the edge points are classiﬁed by gradient direction and the edges is matched by the original chamfer matching in each direction class[13][18]. The distance is weighted by the edge contrast and as a result fdist is deﬁned as follows: (e) fdist (Aθ ; I (e) ) = wθ,j min (||xθ,j − xI (e) ,k || + fI (e) ,k + g(j, k)) (3) j

k

where xθ,j and xI (e) ,k denotes the j th edge point of the model and the k th edge point of an input edge image. || · || is 2-dimensional Euclidean norm, wθ,j is a weight constant, fI (e) ,k is a penalty for an edge with a low contrast, dθ,j wθ,j = . l dθ,l

(4)

fI (e) ,k = −wd dI (e) ,k

(5)

where wd represents weight constant. The diﬀerence of gradient direction g(j, k) is deﬁned in terms of gradient direction φθ,j of the model edge and gradient direction φI (e) ,k as (6) g(j, k) = wφ ||φθ,j − φI (e) ,k ||. where wφ represents weight constant. All weights are experimentally determined. The modiﬁed chamfer matching is computed by using distance transformation as fast as the original chamfer matching. 3.2

Discrimination Principle

The example of mis-matching by a conventional method has been shown as Fig. 1 in section 1. Let Θa and Θc respectively denote the posture parameter of model (a) (ﬁst shape) and (c) (ﬂat shape). Suppose one case that the true hand posture should be Θc . In this section, we describe the principle to discriminate the true match from the wrong matches due to the complicated skin textures and the backgrounds and ﬁnally introduce its stochastic forms giving the discriminating criterion. Then the probability of that the appearance AΘc is matched to the input image, p(AΘc |Θc ), should be large enough because the true posture is the same as the one which generates the appearance, Θc . On the other hand, the probability of AΘa (small ﬁst shape), p(AΘa |Θc ) can be also large in spite of the posture diﬀerence between Θa and Θc because

Hand Posture Estimation in Complex Backgrounds

601

p(AΘa |Θc ) ∼ = p(AΘa |Θa ) AΘa

hand with Θa

AΘc

hand with Θc p(AΘc |Θa ) < p(AΘc |Θc ) Fig. 6. Likelihood for edge images

almost all of the area of AΘa is included and the inner texture edges can be wrongly matched to the ﬁnger contours. Therefore the conventional likelihood maximization can often choose Θa mistakenly for ﬂat hand shapes like Θc due to the image capture noise, inaccuracy of the 3-D shape model and quantization errors of the posture parameters. In order to solve the mis-matches, we carefully analyze the behaviours of two more probabilities: p(AΘa |Θa ) and p(AΘc |Θa ). p(AΘa |Θa ) should be large and p(AΘc |Θa ) should be small because AΘc protrudes from the area of Θa . The four probabilities take the behaviours as follows: while p(AΘa |Θc ), p(AΘc |Θc ) are large together for the posture Θc , p(AΘa |Θa ) is large and p(AΘc |Θa ) is small for the posture Θa (see Fig. 6). Therefore when the appearance AΘa and AΘc are observed together, that posture should be estimated as Θc . When AΘa is observed alone, that posture should be Θa . When the likelihood of an appearance AΘk to a model Θj , p(AΘk |Θj ), is obtained for each possible combination of k and j in advance, the appropriate model can be chosen by taking the all p(AΘk |Θj ) values into account like the above discussion. When k and j are identical, p(AΘk |Θj ) is equivalent to the conventional likelihood function. Otherwise, it means the likelihood of that an appearance AΘk comes from a mistakenly chosen model Θj . We call the likelihood as ”mistakenly matching likelihood” (MML). 3.3

Model Selection using Mistakenly Matching Likelihood

We introduce the stochastic discrimination criterion from the principle described in the previous section by employing Bayesian estimation framework. Let AΘ1 , AΘ2 , . . . denote appearances of typical hand models. Assuming AΘ1 , AΘ2 , . . . are exclusive under each Θj , the likelihood of Θj for I can be expanded as below: p(I|Θj ) = k p(I, AΘk |Θj ) = k p(I|AΘk , Θj )p(AΘk |Θj ). (7)

602

A. Imai, N. Shimada, and Y. Shirai

Assuming the appearance AΘk has all information to generate the observed image I, condition Θj can be removed, p(I|Θj ) = k p(I|AΘk )p(AΘk |Θj ). (8) In the conventional maximum likelihood estimation method, the likelihood for the case of k = j is considered alone. In contrast, we additionally consider the MML for the case of k = j. Assuming that I (e) and I (s) are independent when a certain Θj is speciﬁed, p(I|Θj ) = p(I (e) |Θj )p(I (s) |Θj )

(9)

is derived as the discrimination criterion in stochastic form. The likelihood p(I (e) |Θj ) and p(I (s) |Θj ) are respectively derived from the following equations: (e) (e) (10) p(I (e) |Θj ) = k p(I (e) |AΘk )p(AΘk |Θj ) p(I (s) |Θj ) =

(s)

k

(s)

p(I (s) |AΘk )p(AΘk |Θj )

(e)

(11)

(e)

The probabilistic distributions p(AΘk |Θj ) and p(I (e) |AΘk ) for edge images is (s)

introduced the following sections. Those for skin color silhouette, p(AΘk |Θj ) (s)

and p(I (s) |AΘk ) can be introduced in the same manner of those for edge image. 3.4

Likelihood of Typical Hand Models for Appearances

The likelihood of typical hand model is obtained as the following form because of quantization errors of Θj . (e) (e) p(AΘk |Θj ) = Θj p(AΘk , θj |Θj )dθj (e) = Θj p(AΘk |θj , Θj )p(θj |Θj )dθj (12) (e) = Θj p(AΘk |θj )p(θj |Θj )dθj . The sampling distribution of p(θj |Θj ) can be assumed as a uniform distribu(e) tion under each Θj . Assuming p(AΘk |θj ) is constant for each j since the the quantization interval of Θj is small enough, p(AΘk |θj ) is reduced to p(AΘk |θj∗ ), (e)

(e)

where θj∗ is the mean value of the interval Θj . p(AΘk |θj∗ ) is derived as follows from the deﬁnition of fdist in sec.3.1 and assuming that fdist obeys a gaussian distributionfdist [12][13][17]: (e)

p(AΘk |θj∗ ) = αθ∗ exp(−(dM (k, j))2 ) (e)

(e)

(e)

j

(13)

(e)

where Ir (θ) is the edge image rendered from the posture θ, and (e)

dM (k, j) =

fdist (AΘ ;Ir(e) (θj∗ )) (e)

k (e)

σM

.

(14)

Hand Posture Estimation in Complex Backgrounds (e) 2

σM

603

is the variance of the value of fdist (AΘk ; Ir (θj∗ )). σM is experimentally (e)

(e)

(e)

(e)

determined. αθ∗ is normalization constant, j

(e)

αθ∗ =

k

j

−1 (e) exp(−(dM (k, j))2 ) .

(15)

In the same manner as the above, p(AΘk |θj∗ ) = αθ∗ exp(−(dM (k, j))2 ) (s)

(s)

(s)

(16)

j

(s)

where Ir (θ) is the silhouette generated from θ, and fskin (AΘ ;Ir(s) (θj∗ )) (s)

(s)

dM (k, j) =

k (s)

σM

(s) 2

(17)

.

σM is the variance of the value of fskin (AΘk ; Ir (θj∗ )). αθ∗ is normalization j constant, −1 (s) (s) 2 (18) αθ∗ = . k exp(−(dM (k, j)) ) (s)

(s)

(s)

j

3.5

Likelihood of Appearance

In this section, we explain the evaluation of the likelihood of an appearance (e) p(I (e) |AΘk ). The likelihood is deﬁned based on the deﬁnition of fdist as p(I (e) 2

where, σI

(e)

(e) |AΘk )

=

(e) βΘk exp

2 (e) (fdist (AΘ ;I (e) )) k − . (e) 2 σI

(19)

(e)

is the variance of (fdist (AΘk ; I (e) )). It is experimentally determined.

The normalization constant abilistic distributions:

(e) βΘk

is derived from the integral condition of prob(e)

p(i(e) |AΘk )di(e) = 1

(20)

(e)

Assuming that p(i(e) |AΘk ) can be large value only for i(e) of hand images and is 0 for most of other i(e) , (e) (e) (e) p(i(e) |AΘk )di(e) ≈ p(Ir (θl )|AΘk )dθl (e) ∗ (e) ≈ l p(Ir (θl )|AΘk ) · δ (21) (e) (e) = βΘk l exp(−(dM (k, l))2 ) · δ ≡1 where δ is the range of the quantization of Θ. −1 (e) (e) 2 βΘk = l exp(−(dM (k, l)) ) · δ

(22)

604

A. Imai, N. Shimada, and Y. Shirai (e)

(e)

(e)

When AΘk wrongly matches to many of Ir (θl )(l = k), βΘk becomes small. On (e) βΘk

becomes large. It means that ambiguous the other hand when a few of those, appearance model, which is easy to mis-match to other posture’s appearances, are automatically low evaluated.

4

Estimation of More Accurate Posture Parameters

Posture parameters of the best-matched model are slightly diﬀerent from that of the hand of an input image due to quantization errors of posture parameters. Thus, more accurate parameters must be estimated. The wireframe CG model of the hand is deformed so that the model is matched to an input image, and the 3-D hand shape is reconstructed from the deformed model[19]. In this method, while the curved surface shape of the hand is reconstructed, posture parameters are not estimated. We deform the CG model so that the appearance of the model is matched to those of the input image by using this method. The accurate posture parameters are estimated from coordinates of the vertices of the triangle patches of the deformed wireframe model. Parameters are estimated by the following steps of a procedure. 1. We make correspondences of edges of the best-matched model to those of the input image. 2. The change of the appearance is evaluated from the correspondences so that the edges of the model move toward those of the input image. 3. In order to reduce the huge search region of posture parameters due to the high DOF of human hand, available deformations of surface mesh of the CG model are learned by PCA in advance for each of typical postures, and then the best approximated mesh deformation is estimated by the projection to the PCA subspace. 4. Return to 1. We make correspondences of edges of the CG model deformed at 3. to those of the input image. CG model is deformed by the change of appearance evaluated from the correspondence, again. Repeat these processes. 5. We evaluate the 2-dimensional Euclidean norm of the vertices of the triangle patches between the deformed CG model and CG model generated by the posture parameters. The sum of the norms is minimized using steepest descent method. The posture parameters with minimized sum of the norms are the posture estimate.

5

Experiment

We did the experiment of posture tracking for 250 hand images. The resolution of the images is 320 × 240. The images are captured by 30 fps. In the conventional (e) (s) method, where p(I|Θj ) = p(I|AΘj )p(I|AΘj ) is used as a matching criterion, 70.4% images are correctly matched. In our method, 82.0% images are correctly matched. The success rates show eﬀectiveness of our method.

Hand Posture Estimation in Complex Backgrounds

605

Fig. 7. Experimental result

Fig. 9. Experimental result

Fig. 8. Experimental result

Fig. 10. Experimental result

The example of the image which is correctly matched in our method while mis-matched in the conventional method, is shown in Fig. 7. While the wrong ﬁst hand shape is matched in conventional method, the correct ﬂat shape is matched in our method. Fig. 8 shows the results of an image sequence. In the input images, ﬁngers are partially occluded and the edges of the background are confusingly appeared near the ﬁngers. Such cases causing mismatches are correctly matched in our method. Fig. 9 and Fig. 10 show the tracking results for other hand shapes. These images are also correctly matched in our method.

6

Conclusion and Discussion

The paper introduces a Bayesian form of evaluation of posture candidates for hand tracking in complex backgrounds. The novel concept of Mistakenly

606

A. Imai, N. Shimada, and Y. Shirai

Matching Likelihood (MML) enables to discriminate the true posture candidate from other confusing ones when the mismatch of image features frequently occurs. Experimental results for tracking of the real human hand show the eﬀectiveness of this evaluation method. Additional image features like optical ﬂows or range other than edges and silhouette should be considered on this framework as future work.

Acknowledgment This work is supported in part by Grant-in-Aid for Scientiﬁc Research from Ministry of Education, Science, Sports, and Culture, Japanese Government, No.15300058. The 3-D model of the real human hand was provided by courtesy of Prof. F. Kishino and Prof. Y. Kitamura, Osaka University.

References 1. Liu, X., Fujimura, K.: Hand Gesture Recognition using Depth Data. In: Proc. 6th Int. Conf. on Automatic Face and Gesture Recognition, pp. 529–534 (2004) 2. Iwai, Y., Yagi, Y., Yachida, M.: Estimation of Hand Motion and Position from Monocular Image Sequence. In: Li, S., Teoh, E.K., Mital, D., Wang, H. (eds.) ACCV1995. LNCS, vol. 1035, pp. 230–234. Springer, Heidelberg (1996) 3. Lee, S.U., Cohen, I.: 3D Hand Reconstruction from a Monocular View. In: Proc. 17th Int. Conf. on Pattern Recognition, vol. 3, pp. 310–313 (1995) 4. Kameda, Y., Minoh, M., Ikeda, K.: Three Dimensional Pose Estimation of an Articulated Object from its Silhouette Image. In: ACCV 1993, pp. 612–615 (1993) 5. Shimada, N., Kimura, K., Shirai, Y.: Real-time 3-D Hand Posture Estimation based on 2-D Appearance Retrieval Using Monocular Camera. In: Proc. Int. Workshop on RATFG-RTS, pp. 23–30 (2001) 6. Imai, A., Shimada, N., Shirai, Y.: 3-D Hand Posture Recognition by Training Contour Variation. In: Proc. of The 6th Int. Conf. on Automatic Face and Gesture Recognition, pp. 895–900 (2004) 7. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models-Their Training and Application. COMPUTER VISION AND IMAGE UNDERSTANDING 61(1), 38–59 (1995) 8. Jeong, M., Kuno, Y., Shimada, N., Shirai, Y.: Recognition of shape-changing hand gestures. IEICE Trans. inf.Syst. E85-D(10), 1678–1687 (2002) 9. Isard, M., Blake, A.: Visual tracking by stochastic propagation of conditional density. In: Proc. European Conf. Computer Vision, pp. 343–356 (1996) 10. Isard, M., Blake, A.: ICONDENSATION:Unifying low-level and high-level tracking in a stochastic framework. In: Proc. European Conf. Computer Vision, pp. 767–781 (1996) 11. Heap, T., Hogg, D.: Wormholes in Shape Space:Tracking through Discontinuous Changes in Shape. In: 6th Int. Conf. on Computer Vision, pp. 344–349 (1998) 12. Zhou, H., Huand, T.S.: Tracking Articulated Hand Motion with Eigen Dynamics Analysis. 9th Int. Conf. on Computer Vision 2, 1102–1109 (2003)

Hand Posture Estimation in Complex Backgrounds

607

13. Stenger, B., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Model-Based Hand Tracking Using a Hierarchical Bayesian Filter. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1372–1384 (2006) 14. Wu, Y., Lin, J., Huang, T.S.: Analyzing and Capturing Articulated Hand Motion in Image Sequences. IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 27(12), 1910–1922 (2005) 15. Athitsos, V., Sclaroﬀ, S.: Estimating 3D Hand Pose from a Cluttered Image. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. II, pp. 432–439. IEEE Computer Society Press, Los Alamitos (2003) 16. Barrow, H.G., Tenenbaum, J.M., Bolles, R.C., Wolf, H.C.: Parametric correspondence and chamfer matching: Two new techniques for image matching. In: Proc. 5th Int. Joint Conf. Artiﬁcial Intelligence, pp. 659–663 (1977) 17. Blake, A., Isard, M.: Active Contours. Springer, Heidelberg (1998) 18. Navaratnam, R., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Hierarchical PartBased Human Body Pose Estimation. In: Proc. British machine Vision Conference (2005) 19. Heap, T., Hogg, D.: Towards 3D Hand Tracking using a Deformable Model. In: 2nd Int. Conf. on Automatic Face and Gesture Recognition, pp. 140–145 (1996)

Learning Generative Models for Monocular Body Pose Estimation Tobias Jaeggli1 , Esther Koller-Meier1 , and Luc Van Gool1,2 1

2

ETH Zurich, D-ITET/BIWI, CH-8092 Zurich Katholieke Universiteit Leuven, ESAT/VISICS, B-3001 Leuven [email protected]

Abstract. We consider the problem of monocular 3d body pose tracking from video sequences. This task is inherently ambiguous. We propose to learn a generative model of the relationship of body pose and image appearance using a sparse kernel regressor. Within a particle filtering framework, the potentially multimodal posterior probability distributions can then be inferred. The 2d bounding box location of the person in the image is estimated along with its body pose. Body poses are modelled on a low-dimensional manifold, obtained by LLE dimensionality reduction. In addition to the appearance model, we learn a prior model of likely body poses and a nonlinear dynamical model, making both pose and bounding box estimation more robust. The approach is evaluated on a number of challenging video sequences, showing the ability of the approach to deal with low-resolution images and noise.

1 Introduction Monocular body pose estimation is difficult, because a certain input image can often be interpreted in different ways. Image features computed from the silhouette of the tracked figure hold rich information about the body pose, but silhouettes are inherently ambiguous, e.g. due to the Necker reversal. Through the use of prior models this problem can be alleviated to a certain degree, but in many cases the interpretation is ambiguous and multi-valued throughout the sequence. Several approaches have been proposed to tackle this problem, they can be divided into discriminative and generative methods. Discriminative approaches directly infer body poses given an appearance descriptor, whereas generative approaches provide a mechanism to predict the appearance features given a pose hypothesis, which is then used in a generative inference framework such as particle filtering or numerical optimisation. Recently, statistical methods have been introduced that can learn the relationship of pose and appearance from a training data set. They often follow a discriminative approach and have to deal explicitly with the nonfunctional nature of the multi-valued mapping from appearance to pose [1,2,3,4]. Generative approaches on the other hand typically use hand crafted geometric body models to predict image appearances (e.g. [5], see [6,7] for an overview). We propose to combine the generative methodology with a learning based statistical approach. The mapping from pose to appearance is single-valued and can thus be seen Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 608–617, 2007. c Springer-Verlag Berlin Heidelberg 2007

Learning Generative Models for Monocular Body Pose Estimation

609

as a nonlinear regression problem. We approximate the mapping with a RVM kernel regressor [8] that is efficient due to its sparsity. The human body has many degrees of freedom, leading to high dimensional pose parametrisations. In oder to avoid the difficulties of high dimensionality in both the learning and the inference stage, we apply a nonlinear dimensionality reduction algorithm [9] to a set of motion capture data containing walking and running movements. 1.1 Related Work Statistical approaches to the monocular pose estimation problem include [1,2,3,4,10,11]. In [10] the focus lies on the appearance descriptor, and the discriminative mapping from appearance to pose is assumed to be single-valued and thus modelled with a single linear regressor. The one-to-many discriminative mapping is explicitly addressed in [1,2,3,4] by learning multiple mappings in parallel as a mixture of regressors. In order to choose between the different hypotheses that the different regressors deliver, [1,2] use a geometric model that is projected into the image to verify the hypotheses. Inference is performed for each frame independently in [1]. In [2] a temporal model is included using a bank of Kalman filters. In [3,4] gating functions are learned along with the regressors in order to pick the right regressor(s) for a given appearance descriptor. The distribution is propagated analytically in [3], and temporal aspects are included in the learned discriminative mapping, whereas [4] adopts a generative sampling-based tracking algorithm with a first-order autoregressive dynamic model. These discriminative approaches work in a bottom-up fashion, starting with the computation of the image descriptor, which requires the location of the figure in the images to be known beforehand. When including 2d bounding box estimation in the tracking problem, a learned dynamical model might help the bounding box tracking, and avoid loosing the subject when it is temporarily occluded. To this end, [12] learns a subjectspecific dynamic appearance model from a small set of initial frames, consisting of a low-dimensional embedding of the appearances and a motion model. This model is used to predict location and appearance of the figure in future frames, within a CONDENSATION tracking framework. Similarly, low-dimensional embeddings of appearance (silhouette) manifolds are found using LLE in [11], where additionally the mapping from the appearance manifold to 3d pose in body joint space is learned using RBF interpolants, allowing for pose inference from sequences of silhouettes. Instead of modelling manifolds in appearance space, [13,14,15] work with low dimensional embeddings of body poses. In [13], the low-dimensional pose representation, its dynamics, and the mapping back to the original pose space are learned in a unified framework. This approach does not include statistical models of image appearance. In a similar fashion, we also chose to model manifolds in pose space rather than appearance space, because the pose manifold has fewer self-intersections than the appearance manifold, making the dynamics and tracking less ambiguous. In contrast to [13,14,15], our model includes a learned generative likelihood model. When compared to [1,2,3,4,10,11], our approach can simultaneously estimate pose and bounding box, and learning a single regressor is more easily manageable than a mixture of regressors. The paper is structured as follows. Section 2 and 3 introduce our learned models and the inference algorithm, and in Section 4 we show experimental results.

610

T. Jaeggli, E. Koller-Meier, and L. Van Gool

2 Learning Figure 1 a) shows an overview of the tracking framework. Central element is the lowdimensional body pose parametrisation, with learned mappings back to the original pose space and into the appearance space. In this section all elements of the framework will be described in detail. Our models were trained on real motion capture data sets of different subjects, running and walking at different speeds. 2.1 Pose and Motion Prior Representations for the full body pose configuration are high dimensional by nature; our current representation is based on 3d joint locations of 20 body locations such as hips, knees and ankles, but any other representation (e.g. based on relative orientations between neighbouring limbs) can easily be plugged into the framework. To alleviate the difficulties of high dimensionality in both the learning and inference stages, a dimensionality reduction step identifies a low dimensional embedding of the body pose representations. We use Locally Linear Embedding (LLE) [9], which approximately maintains the local neighbourhood relationships of each data point and allows for global deformations (e.g. unrolling) of the dataset/manifold. LLE dimensionality reduction is performed on all poses in the data set and expresses each data point in a space of desired low dimensionality. We currently use a 4-dimensional embedding. However, LLE does

Body Pose

X : Body Pose (high dim.)

learn LLE dim. red.

reconstruct pose, eq. (1)

x : Body Pose (low dim.)

generative mapping

dynamic prior eq. (4)

Y : Image (high dim.)

BPCA reconstruction eq. (5)

(b) eq. (6)

y : Appearance Descriptor: (low dim.)

learn BPCA dim. red.

Appearance

(a)

(c)

Fig. 1. a) An overview of the tracking framework. Solid arrows represent signal flow during inference, the dashed arrow stands for LLE resp. BPCA dimensionality reduction during training. The figure refers to equations in Section 2. b) Body pose representation as a number of 3d joint locations. c) Corresponding synthetically generated silhouette, as used for training the appearance model.

Learning Generative Models for Monocular Body Pose Estimation

611

not provide explicit mappings between the high-dimensional and the low-dimensional space, that would allow to project new data points (that were not contained in the original data set) between the two spaces. Therefore, we model the reconstruction projection from the low-dimensional LLE space to the original pose space with a kernel regressor. X = fp (x) = Wp Φp (x)

(1)

Here, X and x are the body pose representations in original resp. LLE-reduced spaces, Φp is a vector of kernel functions, and Wp is a sparse matrix of weights, which are learned with a Relevance Vector Machine (RVM). We use Gaussian kernel functions, computed at the training data locations. The training examples form a periodic twisted ’ring’ in LLE space, with a curvature that varies with the phase within the periodic movement. A linear dynamical model, as often used in tracking applications, is not suitable to predict future poses on this curved manifold. We view the nonlinear dynamics as a regression problem, and model it using another RVM regressor, yielding the following dynamic prior, pd (xt |xt−1 ) = N (xt ; xt−1 + fd (xt−1 )ΔT , Σd ),

(2)

where fd (xt−1 ) = Wd Φd (xt−1 ) is the nonlinear mapping from poses to local velocities in LLE pose space, ΔT is the time interval between the subsequent discrete timesteps t − 1 and t, and Σd is the variance of the prediction errors of the mapping, computed on a hold-out data set that was not used for the estimation of the mapping itself. Not all body poses that can be expressed using the LLE pose parameterisation do correspond to valid body configurations that can be reached with a human body. The motion model described so far does only include information about the temporal evolution of the pose, but no information about how likely a certain body pose is to occur in general. In other words, it does not yet provide any means to restrict our tracking to feasible body poses. Worse, the learned regressors can produce erroneous outputs when they are applied to unfeasible input poses, since the extrapolation capabilities of kernel regressors to regions without any training data is limited. The additional prior knowledge about feasible body poses is introduced as a static prior that is modelled with a Gaussian Mixture Model (GMM). ps (x) =

C

pc N (x; μc , Σc ),

(3)

c

with C the number of mixture components. We obtain the following formulation for the temporal prior by combination with the dynamic prior pd (xt |xt−1 ). p(xt |xt−1 ) ∝ pd (xt |xt−1 ) ps (xt )

(4)

2.2 Likelihood Model The representation of the subject’s image appearance is based on a rough figure-ground segmentation. Under realistic imaging conditions, it is not possible to get a clean silhouette, therefore the image descriptor has to be robust to noisy segmentations to a certain degree. In order to obtain a compact representation of the appearance of a person, we apply Binary PCA [16] to the binary foreground images. The descriptors are

612

T. Jaeggli, E. Koller-Meier, and L. Van Gool

computed from the content of a bounding box around the centroid of the figure, and 10 to 20 BPCA components are kept to yield good reconstructions. The projection of a new bounding box into the BPCA subspace is done in an iterative fashion, as described in [16]. Since we model appearance in a generative top-down fashion, we also consider the inverse operation that projects the low-dimensional image descriptors y back into high dimensional pixel space and transforms it into binary images or foreground probability maps. By linearly projecting y back to the high-dimensional space using the mean μ and basis vectors V of the Binary PCA, we obtain a continuous representation Yc that is then converted back into a binary image by looking at its signs, or into a foreground probability map via the sigmoid function σ(Yc ). p(Y = f g|y) ∝ σ(V T y + μ)

(5)

Now we will look how the image appearance is linked to the LLE body pose representation x. We model the generative mapping fa from pose x to image descriptors y that allows to predict image appearance given pose hypotheses and fits well into generative inference algorithms such as particle filtering. In addition to the local body pose x, the appearance depends on the global body orientation ω relative to the camera, around the vertical axis. First, we map the pose x, ω into low dimensional appearance space y, fa (x, ω) = Wa Φa (x, ω)

(6)

where the functional mapping fa (x, ω) is approximated by a sparse kernel regressor (RVM) with weight matrix Wa and kernel functions Φa (x). By plugging (6) into (5), we obtain a discrete 2d probability distribution of foreground probabilities Seg(p) over the pixels p in the bounding box. Seg(p) = p(p = f g|fa (x, ω))

(7)

From this pdf, a likelihood measure can then be derived by comparing it to the actually observed segmented image Yobs , also viewed as a discrete pdf Obs(p), using the Bhattacharyya similarity measure [17] which measures the affinity between distributions. Obs(p) =p(p = f g|Yobs ) Seg(p)Obs(p) BC(x, ω, Yobs ) =

(8)

p

We model the likelihood measure as a zero mean Gaussian distribution of the Bhattacharyya distance dBh = −ln(BC(x, ω, Yobs )), and obtain the observation likelihood p(Yobs |x, ω) ∝ exp(−

ln(BC(x, ω, Yobs ))2 ) 2 2σBC

(9)

3 Inference In this section we will show how the 2d image position, body orientation, and body pose of the subject are simultaneously estimated given a video sequence, by using the learned models from the previous section within the framework of particle filtering. The pose estimation as well as the image localisation can benefit from the coupling of pose

Learning Generative Models for Monocular Body Pose Estimation

613

and image location. For example, the known current pose and motion pattern can help to distinguish subjects from each other and track them through occlusions. We therefore believe that tracking should happen jointly in the entire state space Θ, Θt = [ωt , ut , vt , wt , ht , xt ],

(10)

consisting of the orientation ω, the 2d bounding box parameters (position, width and height) u, v, w, h, and the body pose x. Despite the reduced number of pose dimensions, we face an inference problem in 9-dimensional space. Having a good sample proposal mechanism like our dynamical model is crucial for the Bayesian recursive sampling to run efficiently with a moderate number of samples. For the monocular sequences we consider, the posteriors can be highly multimodal. For instance a typical walking sequence, e.g. observed from a side view, has two obvious posterior modes, shifted 180 degrees in phase, corresponding to the left resp. the right leg swinging forward. When taking the orientation of the figure into account, the situation gets even worse, and the modes are no longer well separated in state space, but can be close in both pose and orientation. Our experiments have shown that a strong dynamical model is necessary to avoid confusion between these posterior modes and reduce ambiguities. Some posterior multimodalities do however remain, since they correspond to a small number of different interpretations of the images, which are all valid and feasible motion patterns. The precise inference algorithm is very similar to classical CONDENSATION [18], with normalisation of the weights and resampling at each time step. The prior and likelihood for our inference problem are obtained by extending (4) and (9) to the full state i ) serves as the sample space Θ. In our implementation, the dynamical prior pd (Θti |Θt−1 proposal function. It consists of the learned dynamical prior from eq. (2), and a simple motion model for the remaining state variables θ = [ωt , ut , vt , wt , ht ]. i i pd (Θti |Θt−1 ) = pd (xit |xit−1 )N (θti ; θt−1 , Σθ )

(11)

In practice, one may want to use a standard autoregressive model for propagating θ, omitted here for notational simplicity. The static prior over likely body poses (3) and the likelihood (9) are then used for assigning weights wi to the samples. wti ∝ p(Yti |Θti )ps (Θti ) = p(Yti |xit , ωti )ps (xit )

(12)

Here, i is the sample index, and Yti is the foreground probability map contained in the sampled bounding box (uit , vti , wti , hit ) of the actually observed image. Note that our choice for sample proposal and weighting functions differs from CONDENSATION in that we only use one component (pd ) of the prior (4) as a proposal function, whereas the other component (ps ) is incorporated in the weighting function.

4 Experiments We evaluated our tracking algorithm on a number of different sequences. The main goals were to show its ability to deal with noisy sequences with poor foreground segmentation, very low resolution, and varying viewpoints. Particle filtering was performed using a set of 500 samples, leading to a computation time of approx. 2-3 seconds per image frame in unoptimised Matlab code. The

614

T. Jaeggli, E. Koller-Meier, and L. Van Gool

Fig. 2. Circular walking sequence from [5]. The figure shows full frames (top), and cutouts with bounding box in original or segmented input images and estimated poses. Darker limbs are closer in depth.

Fig. 3. Diagonal walking sequence. Estimated bounding boxes and poses. The intensity of the stick figure limbs encodes depth; lighter limbs are further away.

sample set is initialised in the first frame as follows. Hypotheses for the 2d bounding box locations are either derived from the output of a pedestrian detector that is run on the first image, or from a simple procedure to find connected components in the

Learning Generative Models for Monocular Body Pose Estimation

615

Fig. 4. An extract from a soccer game. The figure shows original and segmented images and with estimated bounding boxes, and estimated 3d poses.

segmented image. Pose hypotheses xi1 are difficult to initialise, even manually, since the LLE parameterisation is not easily interpretable. Therefore, we randomly sample from the entire space of feasible poses in the reduced LLE space to generate the initial hypotheses. Thanks to the low-dimensional representation, this works well, and the sample set converges to a low number of clusters after a few time steps, as desired. The described models were trained on a database of motion sequences from 6 different subjects, walking and running at different speeds. The data was recorded using an optical motion capture system. The resulting sequences of body poses were normalised for limb lengths and used to animate a realistic computer graphics figure in order to create matching silhouettes for all training poses (see Fig. 1c). The figure was rendered from different view points, located every 10 degrees in a circle around the figure. Due to this choice of training data, our system currently assumes that the camera is in an approximately horizontal position. The training set consists of 4000 body poses in total. All the kernel regressors were trained using the Relevance Vector Machine algorithm (RVM) [8], with Gaussian Kernels. Different kernel widths were tested and compared using a crossvalidation set consisting of 50% of the training data, in order to avoid overfitting. 4 LLE dimensions were used, and 15 BPCA components. The first experiment (Fig. 2) shows tracking on a standard test sequence1 from [5], where a person walks in a circle. We segmented the images using background subtraction, yielding noisy foreground probability maps. The main challenge here is the varying viewing angle that is difficult to estimate from the noisy silhouettes. Tracking through another publicly available sequence from the HumanID corpus is shown in Figure 3. The subject walks in an angle of approx. 35 degrees to the camera plane. In addition it is viewed from a slight top-view and shows limb foreshortening due to the perspective projection. These are violations of the assumptions that are inherent in our 1

http://www.nada.kth.se/∼hedvig/data.html

616

T. Jaeggli, E. Koller-Meier, and L. Van Gool

Fig. 5. Traffic scene with low resolution images and noisy segmentation

choice of training data, where we used horizontal views and orthographic projection. Nevertheless the tracker performs well. Figure 4 shows an extract from a real soccer game with a running player. The sequence was obtained from www.youtube.com, therefore the resolution is low and the quality suffers from compression artefacts. We obtained a foreground segmentation by masking the color of the grass. In Figure 5 we show a real traffic scene that was recorded with a webcam of 320 × 240 pixels. The subjects are as small as 40 pixels in height. Noisy foreground segmentation was carried out by subtracting one of the frames at the beginning of the sequence. Our experiments have shown that the dynamical model is crucial for tracking through these sequences with unreliable segmentations and multimodal per-frame likelihoods.

5 Summary and Conclusion We have proposed a learning-based approach to the estimation of 3d body pose and image bounding boxes from monocular video sequences. The relationship between body pose and image appearance is learned in a generative manner. Inference is performed with a particle filter that samples in a low-dimensional body pose representation obtained by LLE. A nonlinear dynamical model is learned from training data as well. Our experiments show that the proposed approach can track walking and running persons through video sequences of low resolution and unfavourable image quality. Future work will include several extensions of the current method. We will explicitly consider multiple activity categories and perform action recognition along with the

Learning Generative Models for Monocular Body Pose Estimation

617

tracking. Also, we will investigate different image descriptors, that do extract the relevant image information more efficiently.

Acknowledgements This work is supported, in parts, by the EU Integrated Project DIRAC (IST-027787), the SNF project PICSEL and the SNF NCCR IM2.

References 1. Rosales, R., Sclaroff, S.: Learning body pose via specialized maps. In: NIPS (2001) 2. Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., Cipolla, R.: Multivariate relevance vector machines for tracking. In: Ninth European Conference on Computer Vision (2006) 3. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3d human motion estimation. In: CVPR (2005) 4. Agarwal, A., Triggs, B.: Monocular human motion capture with a mixture of regressors. In: CVPR. IEEE Workshop on Vision for Human-Computer Interaction, IEEE Computer Society Press, Los Alamitos (2005) 5. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3d human figures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 702–718. Springer, Heidelberg (2000) 6. Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J.D.R.: Computational studies of human motion: Part 1. Computer Graphics and Vision 1(2/3) (2006) 7. Moeslund, T.B., Hilton, A., Kr¨uger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006) 8. Tipping, M.: The relevance vector machine. In: NIPS (2000) 9. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 10. Agarwal, A., Triggs, B.: A local basis representation for estimating human pose from cluttered images. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, Springer, Heidelberg (2006) 11. Elgammal, A., Lee, C.S.: Inferring 3d body pose from silhouettes using activity manifold learning. In: CVPR (2004) 12. Lim, H., Camps, O.I., Sznaier, M., Morariu, V.I.: Dynamic appearance modeling for human tracking. In: Conference on Computer Vision and Pattern Recognition, pp. 751–757 (2006) 13. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models. Advances in Neural Information Processing Systems 18, 1441–1448 (2006) 14. Sminchisescu, C., Jepson, A.: Generative modeling for continuous non-linearly embedded visual inference. In: ICML. International Conference on Machine Learning (2004) 15. Li, R., Yang, M.H., Sclaroff, S., Tian, T.P.: Monocular tracking of 3d human motion with a coordinated mixture of factor analyzers. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 137–150. Springer, Heidelberg (2006) 16. Zivkovic, Z., Verbeek, J.: Transformation invariant component analysis for binary images. In: CVPR, vol. 1, pp. 254–259 (2006) 17. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math Soc. (1943) 18. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Computer Vision (1998)

Human Pose Estimation from Volume Data and Topological Graph Database Hidenori Tanaka1 , Atsushi Nakazawa2, and Haruo Takemura2 1

Graduate School of Information Science and Technology, Osaka University 1-32 Machikaneyama, Toyonaka-shi, Osaka, 560-0043 Japan [email protected] 2 Cybermedia Center, Osaka University 1-32 Machikaneyama, Toyonaka-shi, Osaka, 560-0043 Japan {nakazawa,takemura}@cmc.osaka-u.ac.jp

Abstract. This paper proposes a novel volume-based motion capture method using a bottom-up analysis of volume data and an example topology database of the human body. By using a two-step graph matching algorithm with many example topological graphs corresponding to postures that a human body can take, the proposed method does not require any initial parameters or iterative convergence processes, and it can solve the changing topology problem of the human body. First, three-dimensional curved lines (skeleton) are extracted from the captured volume data using the thinning process. The skeleton is then converted into an attributed graph. By using a graph matching algorithm with a large amount of example data, we can identify the body parts from each curved line in the skeleton. The proposed method is evaluated using several video sequences of a single person and multiple people, and we can conﬁrm the validity of our approach.

1

Introduction

Motion capture is widely used in the research ﬁelds of computer graphics, humancomputer interaction, medical applications, and robotics. However, in conventional motion capture systems, the actors must attach markers or special devices to their bodies. To solve this problem, markerless motion capture methods have been thoroughly studied by computer vision researchers [1,2]. Many studies have introduced articulated human body models and solve the problem through error minimization frameworks. These studies use diﬀerent types of three-dimensional primitives to express body parts, for example cylinders and ellipsoids [3], colored blobs [4], and so on. And Kehl et al. [5] have introduced a dense surface model, where joint parameter estimation is carried out through error minimization between model features and input image features, such as silhouettes, optical ﬂows, and contours. These top-down approaches pose some diﬃculties with regard to the computation time for the convergence process, initial parameter estimation, and recovery from tracking error. In reality, the initial body parameters are provided manually Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 618–627, 2007. c Springer-Verlag Berlin Heidelberg 2007

Human Pose Estimation from Volume Data

619

in all methods, or users are required to form speciﬁc poses at the start frame. Moreover, the estimation is performed by the tracking framework; this implies that the estimation at one frame depends on the results of previous frames. Therefore, the tracking cannot be continued once an error has occurred. To avoid these problems, some methods employ a bottom-up approach. Cheung et al. [6] proposed an algorithm to segment volume data into articulated rigid parts and acquire human kinematics. Other methods directly analyze the input volume data and obtain the medial axes, termed “skeleton”, of the 3D shape. Then, joint positions are estimated from the skeletons. The analysis of each frame is performed by a bottom-up process and is independent of the other frames; this can avoid the problems that occur in the top-down approach. To segment volume data and obtain the skeleton, Chu et al. [7] have used Isomaps. Another approach used by Sundaresan et al. [8] uses the Laplacian eigenspace and ﬁtting of spline curves. During human movement, the topology of the body can change signiﬁcantly, for example when both hands touch each other or touch the body. This changing topology should be taken into consideration to ensure robust estimation. However, previous studies have only considered a limited number (one or two) of topologies. Our method can be categorized as one of the latter bottom-up methods. Skeletons are extracted from captured volume data and the joint positions are estimated frame-by-frame. In order to consider many topologies in the same manner, example skeletons and a matching framework are used. These steps are explained in the following sections.

2

Proposed Method

Figure 1 shows an overview of our method. First, we capture the time series of the volume data using a visual hull based method (2.1). Then, we apply the volume thinning process and obtain 3D curved lines (2.2),. Next, we identify the body parts formed by the curves using the model graph database (2.3). Finally, the joint positions are estimated using the body part information along with time-series curvature analysis of the skeletons (2.4). In contrast to previous methods, our method has the following advantages: – By adopting a bottom-up approach, we can avoid the diﬃculty of parameter initialization and having to recover from tracking failure. – The computational cost of our process is less than that of model-based approaches or subspace projections because we do not use any iterative convergence algorithms. – The changing topology problem of the human body is solved by developing many possible examples of the human body with a graph matching and decision tree framework. This approach is very general and there are few heuristic rules. We can easily expand this method for identifying many varieties of topologies, noisy data, multiple person tracking, and other articulated objects.

620

H. Tanaka, A. Nakazawa, and H. Takemura

Volume Reconstruction

Estimation of Joint Positions

Capture Video by 8 cams

3D Volume Thinning

Silhouette-based Visual Hull

Skeleton Conversion

Find Local Maxima from Summed up Curvature

Estimate Joint Positions for All Frames Graph Matching

Model Graph Database Learn

Body Part Identification

Example Graphs

Fig. 1. Algorithm Overview

Fig. 2. Captured volume data and their skeletons

2.1

Volume Reconstruction

We use a voxel-based visual hull algorithm for capturing time-series volume data of the human body[9]. The actor’s image regions are detected by using background subtraction. In this step, the CIELab color space is used to remove the shadows on the ﬂoor. Here, the weight used for the L value is smaller than that used for a and b. Finally, a visual hull algorithm is applied to reconstruct the target volume data (ﬁgure 2). 2.2

Three Dimensional Volume Thinning

We apply a 3D volume thinning process to the captured data and obtain a skeleton of the human body. Though there are many algorithms to extract skeleton from volume data [10,11], we use Saitoh’s algorithm [10] because it is fast and simple. First, we calculate the depth value (minimum distance between a voxel and the original surface) for each voxel. Then, the surface voxels are sequentially removed beginning from smaller depth voxels. This process is continued until the line width becomes one voxel while preserving the original topology. The resulting skeleton consists of a set of 3D curved lines whose topology is the same as that of the original volume data, passing through the middle of the original shape (ﬁgure 2). In this process, some unnecessary short lines are produced. To solve this problem, we ﬁrst apply a thresholding technique for their removal. In order to remove longer noise lines, we add example graphs into the model graph database as described below.

Human Pose Estimation from Volume Data

a

b

c

d

e

f

621

g

Fig. 3. Variations in the topology of the human body. (a,b): Two arms are connected at the same position on the body, (c,d): Loop structure, (e,f,g): Some body parts merge into each other.

2.3

Identiﬁcation of Body Parts Using Model Graph Database

In this step, we identify the body parts formed by curves in the skeleton. Here, we must consider the variation in the topologies of the human body. In this section, we ﬁrst describe the topology of the human body and then describe a method to identify the topology and body parts based on the input skeleton. The topologies of the skeletons may diﬀer from the normal topology of the human (ﬁgure 3-a) in the following ways: 1. arms and feet may not be connected at the same position (ﬁgure 3-b), 2. multiple body parts contact with each other (ﬁgure 3- c,d), 3. multiple body parts merge into each other (ﬁgure 3-e,f). In these cases, we apply a split-and-merge algorithm for some of the curves to determine the joint positions. In addition, multiple skeletons, while having the same topology, may diﬀer in the way body parts touch the body (touch pattern) such as ﬁgure 3-b and g. So, we must consider not only the topologies, but also other features of each curved line for proper identiﬁcation. To solve this problem, we introduce an example based approach (ﬁgure 4). We develop several example skeletons with diﬀerent topologies in advance. They are converted into attributed graphs, and body part IDs are manually assigned to their nodes which are then stored in the model graph database (MGDB). The input skeleton is also converted into an attributed graph and then matched with the example graphs. Based on the node-to-node correspondence, we can identify body parts from the input graph. Skeletons are represented as attributed graphs; the line segments become nodes and the intersection points become links (ﬁgure 5). As attribute values for the node, we use the length of the curve, volume of the original 3D volume data, and variance of the depth value among the point in the curve. The length attribute value is the length of the line segment normalized over the whole skeleton. The volume is obtained by summing the squares of the depth values (described in Section 2.2) of line voxels. Normalization is also applied to the volume, resulting in the ﬁnal volume attribute value. Example graphs are obtained from test sequences and stored into the MGDB. IDs and attribute values are assigned to the nodes in the graph. These graph representations of skeletons are invariant with human posture (joint angles). Thus, we only need to develop a limited number of examples, despite the high number of possible human postures.

622

H. Tanaka, A. Nakazawa, and H. Takemura

Input Skeleton

Graph Conversion

Example Graphs

Attributed Graph Model Graph Database Example Graphs

Topology Matching

HEAD

...

ARM_A

ARM_A

BODY

Matching Results Decision Tree Database

LEG_B

ARM_B

LEG_A

LEG_B

HEAD

LEG_A

Example Graphs Example Graphs Topology 2 Topology 1

BODY

Feature Vectors

Feature Vectors

Learning

Learning

...

Learning

Decision Tree Filter

Decision Tree Filter

...

Decision Tree Filter

ARM_B

Decision Tree filter HEAD ARM_A

Node-to-node Correspondence LEG_A

ARM_B

Graphs ... Example Topology n

...

Feature Vectors

BODY

LEG_B

Fig. 4. Left: our body part identiﬁcation algorithm that uses the MGDB. Right: learning of the decision tree ﬁlter.

Two step graph matching algorithm. Graph matching consists of two steps and is performed on the input graph with the example graphs in the MGDB (ﬁgure 4-left). First, the topology of the input graph is identiﬁed using Messmer’s method [12], which is a combination of graph subdivision and subgraph matching. This algorithm increases the speed of matching by using a network that contains hierarchical relations between subgraphs of the model graphs. However, this matching may produce multiple results and node-to-node correspondences because it compares only the graph topologies. Therefore, we employ decision tree based ﬁlters to narrow these down to one correct matching. The Decision Tree. To obtain the correct matching result and node-to-node correspondence, C4.5 decision trees [13] are made for each topology and touch pattern (ﬁgure 4-right). During the learning stage of the MGDB, we manually label the nodes in the example graphs and acquire feature vectors using the length and volume of each node. These feature vectors are collected from example graphs with the same touch pattern and used as positive samples (ﬁgure 6). We also prepare feature vectors for error patterns, such as the head and a hand being swapped and diﬀerent touch patterns, and used them as negative samples. Then, C4.5 decision trees are constructed using these samples. During the matching step, several feature vectors are generated from one input graph according to the topology-matching results. After that, they are evaluated by the decision trees of the touch patterns and we can identify the body parts of the input nodes.

Human Pose Estimation from Volume Data

HEAD

HEAD

0

ARM_A

0

3

BODY_U

ARM_B

1 4

ARM_B LEG_A

2 5

6

LEG_B

0

4

8,17,53

ARM_B HEAD

0

3

ARM_B

BODY_L LEG_B

2

6

5

3

2

6

ARM_A

24,24,196

BODY_L LEG_A 6

⎛ HEAD ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ ARM_A ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_B ⎟ ⎜ ⎟ ⎝ LEG_A ⎠

0

BODY_U 4

ARM_B

3

2

BODY_L

5

LEG_B

HEAD

1

LEG_A 6

⎛ ARM_A ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ HEAD ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_B ⎟ ⎜ ⎟ ⎝ LEG_A ⎠

Positive Sample LEG_A 0

BODY_L

BODY_U

2

5

LEG_B

⎛ ARM_A ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ HEAD ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_A ⎟ ⎜ ⎟ ⎝ LEG_B ⎠

LEG_B

1

1

ARM_B

ARM_A

ARM_A 3

BODY_U 4

5

LEG_A

⎛ LEG_A ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_B ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ HEAD ⎟ ⎟ ⎜ ARM_A ⎠ ⎝

25,9,9

Skeleton

6

LEG_A BODY_L

26,26,171

BODY_L LEG_B

HEAD

HEAD

1

Positive Sample

Attribute Swapping

8,6,63

9,18,130

4

⎛ HEAD ⎞ ⎜ ⎟ ⎜ BODY_U ⎟ ⎜ BODY_L ⎟ ⎜ ⎟ ⎜ ARM_A ⎟ ⎜ ARM_B ⎟ ⎜ ⎟ ⎜ LEG_A ⎟ ⎜ ⎟ ⎝ LEG_B ⎠

Sample Graph

volume 8 length 6 variance 63

5

LEG_A

BODY_L

2

0

BODY_U

1

4

ARM_A

ARM_A 3

BODY_U

623

4

ARM_B ARM_A

LEG_B 0

LEG_B

BODY_L

3 1

4

BODY_U

2

5

6

⎛ LEG_A ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_B ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ ARM_A ⎟ ⎟ ⎜ HEAD ⎠ ⎝

Attributed Graph

3

ARM_B

HEAD

HEAD

5

2

LEG_B

LEG_A

BODY_U 6

0

BODY_L

1

ARM_A

⎛ LEG_B ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_A ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ HEAD ⎟ ⎟ ⎜ ARM_A ⎠ ⎝

4

ARM_B ARM_A

5

LEG_A 3

1 2

BODY_U 6

HEAD

⎛ LEG_B ⎞ ⎟ ⎜ ⎜ BODY_L ⎟ ⎜ BODY_U ⎟ ⎟ ⎜ ⎜ LEG_A ⎟ ⎜ ARM_B ⎟ ⎟ ⎜ ⎜ ARM_A ⎟ ⎟ ⎜ HEAD ⎠ ⎝

Feature Vectors

Fig. 5. Graph representation of Fig. 6. Making feature vectors from an example skeleton and attribute values graph. Positive samples are enclosed within blue frames.

ARM_A

ARM_A BODY

ARM_A ARM_B

ARM_B BODY

BODY

BODY

ARM_B

ARM_A

ARM_B

BODY

BODY

Fig. 7. Split and merge algorithm in joint estimation. Left: If there are several line segments identiﬁed as the body, they are merged together. Right: The loop structure is also removed by the same rule.

2.4

Estimation of Joint Positions

In this step, we estimate the joint positions from the curved lines that are used to identify the body parts. First, we introduce the split and merge algorithm using body part IDs. If a body part is split by connecting points (ﬁgure 7-left), they are removed and the line segments of that body part are merged together. If there is a loop structure (ﬁgure 7-right), the same rule is applied and we can obtain a one-to-one relation between the body parts and the curved lines. Next, we perform a time-series curvature analysis and determine the joint positions in each curved line. Our algorithm consists of the following steps: 1. The lengths of the curved lines are normalized, and their curvature at each frame is calculated. 2. The curvatures of each body portion are summed up for all frames. 3. The local maxima in the summed curvature graph are determined and considered as joint positions relative to the body part’s length. Here, we use the same number of maxima as the number of joints given by the body part attribute value.

624

H. Tanaka, A. Nakazawa, and H. Takemura

Fig. 8. Result of body part identiﬁcation with subject 1. Up: Input image. Bottom: Skeleton color coded by body part.

4. The absolute joint positions are determined for all the frames according to the relative joint positions in the curved line. Through these steps, all joints will be determined if they are “bent” at least once within all the video frames. By using the split and merge algorithm and the timeseries curvature analysis based on the results of body part identiﬁcation, we can obtain consistent kinematics of the human body in every frame.

3

Experimental Results

We tested our algorithm on 3 subjects (subjects 1, 2, and 3) individually and two subjects (subjects 1 and 4) together. We set up a studio that contained 8 cameras and blue screens. Each camera was connected to a PC and could synchronously capture XGA images at 30 fps. Volume reconstruction was performed at a resolution of 2 cm, and the following processes were done on a single PC. We prepared the MGDB with 190 example graphs (11 topologies) selected from the reconstruction results of several videos and 196 test graphs from these same videos. The body part IDs were manually assigned to the nodes. The results of body part identiﬁcation with subject 1 are shown in ﬁgure 8. Captured images are shown in the top row and the colors of curves in the bottom row show identiﬁcation of body parts. The results conﬁrm that our method correctly identifys body parts even for loops and mergers. We can also conﬁrm this method works well even when unnecessary curves exist thanks to the prepared example graphs. Figure 9 show the results of subjects 2 and 3 along with that of subjects 1 and 4 together in the scene. The experimental result also shows our method correctly identifys the body parts of two people even when their body parts touch each other. 170 graphs out of 196 test graphs were assigned topology and body part IDs matching the manual assignments. The others were assigned incorrect touch patterns. Figure 10 is a curvature graph of four body parts summed up over 291 frames from one video. Local maxima in the curved lines of arms become wrists (relative

Human Pose Estimation from Volume Data

625

Fig. 9. Result of body part identiﬁcation with subject 2, 3 and with 2 subjects. Up: Input image. Bottom: Color coded skeleton.

㪘㪩㪤㪶㪘㪘㪩㪤㪶㪙㪣㪜㪞㪶㪘㪣㪜㪞㪶㪙

㪪㫌㫄㩷㫆㪽㩷㫋㪿㪼㩷㪺㫌㫉㫍㪸㫋㫌㫉㪼

㪇㪅㪇㪇㪎㪇㪅㪇㪇㪍㪇㪅㪇㪇㪌㪇㪅㪇㪇㪋㪇㪅㪇㪇㪊㪇㪅㪇㪇㪉㪇㪅㪇㪇㪈㪇㪈

㪊㪋

㪍㪎

㪈㪇㪇

㪧㫆㫊㫀㫋㫀㫆㫅㩷㫀㫅㩷㫋㪿㪼㩷㪺㫌㫉㫍㪼㪻㩷㫃㫀㫅㪼㩷㩿㩼㪀

Fig. 10. Sum of the curvature of 291 skeletons

position from 9% to 11%), elbows (40% to 43%), and shoulders (84% to 85%), and those found in one of the legs become ankles (7% to 8%), knees (51% to 52%), and thighs (90% to 94%). The estimated joint positions are shown in ﬁgure 11. With regards to processing time, it took 5.75 seconds for volume reconstruction, 480 ms for thinning, 94 ms for graph conversion of the skeleton, and 4.2ms for graph matching and decision tree ﬁltering. Joint estimation needs 40 ms per frame. 3.1

Discussion

The experimental results show that our method can successfully identify body parts even when the topology of the body shape changes in each consecutive frame. Since the algorithm performs the estimation independently for each frame, the estimation in one frame is not aﬀected by errors in previous frames. The system can identify some unnecessary curved lines in the skeletons as noise and ignore them during body part identiﬁcation. Analyzing the failure cases, we found that sometimes the upper arm is partially merged to the body. Our touch pattern classes did not contain such a

626

H. Tanaka, A. Nakazawa, and H. Takemura

Fig. 11. Result of the pose estimation

pattern; our sample patterns had the upper arm as either completely separate or completely merged with the body. Our example graph-based approach is very easy to expand. As shown in the experiment, we can easily apply our method to multiple people simply by developing new example graphs that express the topologies of multiple human bodies. The time-series curvature analysis eﬀectively obtains consistent kinematics of the human body. Hence, we can obtain joint position data in a manner similar to that of conventional motion capture systems. Our method requires a total processing time of 6.4 s per frame. However, 90% of the total processing time takes place during reconstruction, while the volume thinning process and joint position estimation only require less than 0.7s. Thus, we are attempting to develop faster volume reconstruction and thinning methods to reach real-time estimation. In general, our example based approach is very useful because of its generality and fast processing.

4

Conclusion and Future Study

In this paper, we have proposed a novel markerless motion capture method using volume data. Our method employs the bottom up analysis of the volume data; therefore, we do not require any initial parameters of the human body, and fast estimation can be achieved by using a 3D volume thinning process. Additionally, we have solved the changing topology problem of the body part identiﬁcation process using an example-based approach. The experimental results indicate that our method can successfully identify diﬀerent topologies that consist of a single person as well as many persons and can correctly estimate joint positions using time-series curvature analysis. Thus far, it has been diﬃcult to distinguish between left and right body parts because our graph representations do not contain geometric information. To solve this problem, we are currently attempting to incorporate some geometrical attributes into our graph representations.

Human Pose Estimation from Volume Data

627

Acknowledgement This work is supported by the Ministry of Education, Culture, Sports, Science and Technology under the “Development of fundamental software technologies for digital archives” project.

References 1. Gavrila, D.M.: The visual analysis of human movement: A survey. CVIU 73(1), 82–98 (1999) 2. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. CVIU 81(3), 231–268 (2001) 3. Miki´c, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. International Journal of Computer Vision 53(3), 199–223 (2003) 4. Caillette, F., Howard, T.: Real-Time Markerless Human Body Tracking with MultiView 3-D Voxel Reconstruction. In: Proc. of British Machine Vision Conference, vol. 2, pp. 597–606 (2004) 5. Kehl, M.B.R., Gool, L.V.: Full body tracking from multiple views using stochastic sampling. In: Proc. of CVPR 2005, vol. 2, pp. 129–136 (2005) 6. Cheung, G., Baker, S., Kanade, T.: Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In: Proc. of CVPR 2003, vol. 1, pp. 77–84 (2003) 7. Chu, C.W., Jenkins, O.C., Matari´c, M.J.: Markerless kinematic model and motion capture from volume sequences. Proc. of CVPR 2003 2, 475–482 (2003) 8. Sundaresan, A., Chellappa, R.: Segmentation and probabilistic registration of articulated body models. In: Proc. of ICPR 2006, vol. 2, pp. 92–96 (2006) 9. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. on PAMI 16(2), 150–162 (1994) 10. Saitoh, T., Mori, K., Toriwaki, J.: A sequential thinning algorithm for three dimensional digital pictures using the euclidean distance transformation and its properties. The transactions of the Institute of Electronics, Information and Communication Engineers. D-II J79-D-II (10), 1675–1685 (1996) 11. Brostow, G.J., Essa, I., Steedly, D., Kwatra, V.: Novel skeletal representation for articulated creatures. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 66–78. Springer, Heidelberg (2004) 12. Messmer, B.T., Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Trans. on PAMI 20, 493–504 (1998) 13. Quinlan, R.J.: Programs for Machine Learning. Morgan kaufmann Publishers, San Francisco (1993)

Logical DP Matching for Detecting Similar Subsequence Seiichi Uchida1 , Akihiro Mori2 , Ryo Kurazume1 , Rin-ichiro Taniguchi1 , and Tsutomu Hasegawa1 1

2

Kyushu University, Fukuoka, 819-0395, Japan [email protected] http://human.is.kyushu-u.ac.jp/ Toshiba Corporation, Tokyo, 198-8710, Japan

Abstract. A logical dynamic programming (DP) matching algorithm is proposed for extracting similar subpatterns from two sequential patterns. In the proposed algorithm, local similarity between two patterns is measured by a logical function, called support. The DP matching with the support can extract all similar subpatterns simultaneously while compensating nonlinear ﬂuctuation. The performance of the proposed algorithm was evaluated qualitatively and quantitatively via an experiment of extracting motion primitives, i.e., common subpatterns in gesture patterns of diﬀerent classes.

1

Introduction

Detection of similar subsequences between two sequential patterns is one of fundamental problems for sequential pattern processing. Especially, this problem is very important when extracting frequent subsequences. For example, frequent subsequences of gesture patterns, called motion primitives, are widely used for analyzing human activities. In addition, frequent subsequences of biological sequences (such as DNA sequence and protein sequence), called motives, also plays quite important role in genome science [1,2]. In this paper, a new algorithm, called logical dynamic programming (DP) matching, is proposed for detecting similar subsequences. The proposed algorithm employs a logical function s(i, j), called support, to evaluate similarity between the ith frame of A = a1 , . . . , ai , . . . , aI and the jth frame of B = b1 , . . . , bj , . . . , bJ . Speciﬁcally, if a pair of frames ai and bj are similar, s(i, j) = true. Otherwise, s(i, j) = false. Similar subsequences between A and B can be determined as sets of consecutive frame pairs having true supports. The use of the support allows simultaneous detection of all similar subsequences while compensating nonlinear ﬂuctuations in the subsequences. The remaining part of this paper is organized as follows. After reviewing related work in in Section 2, the logical DP matching algorithm is described in Section 3. In addition, the logical DP matching algorithm is applied to actual gesture patterns for evaluating its performance on the detection of similar subsequences among them. In Section 4, the performance of the proposed algorithm Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 628–637, 2007. c Springer-Verlag Berlin Heidelberg 2007

Logical DP Matching for Detecting Similar Subsequence

629

is further evaluated quantitatively via an experiment of extracting motion primitives from gesture patterns.

2

Related Work

The algorithms for detecting similar subsequences are classiﬁed into two groups. The ﬁrst group is based on “rigid comparison” between two sequential patterns. In the algorithms of this group, sequential patterns are ﬁrstly described by rough representations, such as sequences of a limited number of symbols, by clustering or other quantization technique, and then identical subsequences are detected. The rough representation is necessary for compensating ﬂuctuations between two sequences. Tanaka et al. [3] transformed a gesture sequence into a symbol sequence for extracting motion primitives under an MDL criterion. A similar algorithm can be found in [4]. A problem of this group is that the rough representation will lose exact boundaries of similar subsequences. In fact, a reﬁnement step is introduced in [4] for ﬁnding exact boundaries. The second group is based on “elastic comparison”, where two sequential patterns are compared under an optimized nonlinear matching to compensate ﬂuctuations. DP matching algorithm, or dynamic time warping (DTW) algorithm, will be the most popular algorithm in this group. DP matching possesses several useful properties, such as (i) the global optimality of its matching result, (ii) computational feasibility, (iii) high versatility, etc., and thus has been used various sequential pattern processing tasks, such as speech recognition [5] and character recognition [6] and gnome science [1,2]. The diﬀerence between the proposed and the conventional DP matching algorithms is the deﬁnition of frame (dis)similarity. Speciﬁcally, the proposed algorithm employs the logical support function s(i, j), whereas the conventional algorithm has been employed Euclidean distance between feature vectors of the ith and jth frames. This modiﬁcation brings the following novel abilities to the DP matching: – The proposed algorithm can detect strictly similar subsequences. This means that all the paired frames in the similar subsequences are similar. In contrast, similar subsequences by the conventional DP matching algorithms are averagely similar; that is, several paired frames can be dissimilar if other paired frames are very similar. (This is because of the conventional algorithm is based on accumulated Euclidean distance.) – The proposed algorithm can detect all similar subsequences simultaneously.

3

Detecting Similar Subsequence by Logical DP Matching

This section describes the logical DP matching algorithm and its application to the simultaneous detection of similar subsequences. The performance of the algorithm is also discussed via a detection experiment.

630

S. Uchida et al. detected similar sub-sequence s(i,j) = 1

pattern A

θ1

i pattern B j

feature space

Fig. 1. Similar subsequence detection using support s(i, j)

3.1

Logical DP Matching

Support and Supported Path. Each sequential pattern can be represented as a trajectory in a feature space, as shown in Fig. 1. Similar subsequences between two sequential patterns will be observed as regions where two trajectories are close to each other. This closeness is evaluated by the following logical function, called support: true if d(i, j) < θ1 (1) s(i, j) = false otherwise, where θ1 is a positive constant and d(i, j) is a distance between the ith and the jth frames. In the followings, we will use Euclidean distance d(i, j) = ai − bj , unless otherwise noted. If two trajectories are close at the ith and the jth frames (namely, if ai and bj are similar), the support s(i, j) = true. Otherwise, s(i, j) = false. The dashed lines in Fig. 1 indicates frame pairs where two trajectories are close. Consider an i-j plane of Fig. 2 and assume that the support s(i, j) has been already calculated at each (i, j) node according to (1). Since similar subsequences between A and B can be determined as sets of consecutive frame pairs having true supports, the problem of ﬁnding similar subsequences is treated as the problem of ﬁnding paths connecting nodes with true supports. Hereafter, this path is called a supported path. A supported path from (is , js ) to (ie , je ) indicates that subsequences ais , . . . , aie and bjs , . . . , bje are similar throughout, that is, strictly similar. DP Recursion. All the supported paths can be found eﬃciently and simultaneously by a DP-based algorithm. In the algorithm, the following equation, so-called DP recursion, is calculated at each j from i = 1 to I: 3 gk (2) g(i, j) = s(i, j) ∧ k=1

where

⎧ ⎨ g1 = g(i − 1, j − 1) g2 = s(i, j − 1) ∧ g(i − 1, j − 2) ⎩ g3 = s(i − 1, j) ∧ g(i − 2, j − 1)

pattern B

Logical DP Matching for Detecting Similar Subsequence J

j

1

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

0

1

1

0

0

1

0

0

0

1

0

0

0

0

0

0

1

0

0

1

0

0

0

1

1

0

0

0

1

1

0

1

0

0

0

0

1

1

0

0

0

0

0

0

1

0

0

0

1

i

631

I pattern A

a pair of similar sub-patterns

Fig. 2. Representation of similar subsequences by supported paths. Here, “1” stands for true and “0” for false.

...

...

...

...

...

...

...

...

pattern B ... ... ...

... ...

...

J

...

ending node

... (d)...

starting node

j

g(i,j) (c)

1 1

...

(a)

...

(b)

... ... ... ... I i pattern A

Fig. 3. DP recursion and staring/ending node

and ∨ and ∧ denote logical OR and AND operations, respectively. The function g(i, j) is a logical function taking true or false and indicates whether there is at least one supported path arriving at the node (i, j) or not. The nodes (a), (b), and (c) of Fig. 3 correspond to the nodes with g1 , g2 , and g3 , respectively. The above recursion represents the trial to extend the supported paths to the nodes (a), (b), and (c) by connecting the node (i, j). The starting and the ending nodes of each supported path are also detected while calculating of the DP recursion (2). The node (i, j) is marked as a starting node iﬀ g1 = g2 = g3 = false and s(i, j) = true. Similarly, the node (i, j) is marked as an ending node iﬀ g(i, j) = true and s(i + 1, j + 1) = false1 . Backtracking from every ending node provides all the supported paths simultaneously. The supported paths shorter than θ2 are eliminated as noise. As shown in Fig. 2, supported paths often branch and share the same starting/ending node. In the following experiment, those branched supported paths are to be shrunk into a non-branched one by removing less reliable paths. The reliability of each branch is evaluated by the accumulated distance of d(i, j) along the branch. A branch with a higher accumulated distance is considered as a less reliable branch. 1

Any supported path passing (i, j) will also pass (i + 1, j + 1). Thus, if s(i + 1, j + 1) = false, no path can be extended from (i, j) and therefore (i, j) is an ending node.

hurrah

bye

stop

raise hand point

S. Uchida et al.

circle shrug

632

time

time Fig. 4. Snapshot of assumed gestures

After this removal, only one supported path will start from each starting node and end in each ending node. The computational complexity of the above algorithm is O(IJ). This is the same complexity as the conventional DP matching algorithms. 3.2

Experiment of Detecting Similar Subpatterns Between Gesture Patterns

Experimental Setup. The performance of the proposed algorithm has been evaluated via an experiment of detecting similar subsequences of gesture patterns. Gestures of 18 classes were used in the experiment. Fig. 4 shows snapshots of 7 gestures of them. Each gesture was performed 6 times by one person and thus 108 gesture patterns were prepared. The average frame length was 89. A six-dimensional feature vector was used for representing the performer’s posture at each frame. Speciﬁcally, the six elements of the vector were the threedimensional positions of both hands (relative to the head position), which were obtained by stereo measurement with two IEEE1394 cameras. The two parameters θ1 and θ2 were ﬁxed at 200 and 8, respectively. The condition θ1 = 200 implies that the support s(i, j) becomes true if the diﬀerence of two postures, i.e., ai − bj , is less than 20cm. The condition θ2 = 8 implies that similar subsequences whose length is less than 500ms are eliminated as noise. Detection Result of Similar Subsequences. Fig. 5 shows ﬁve detection results. The images at the leftmost column show d(i, j) on the i–j plane and the images at the middle column show the support s(i, j). The images at the rightmost column show the supported paths detected by the proposed algorithm. Fig. 5 (a) is the result when an identical “(left-hand) bye” pattern was used as both A and B. Since A and B were identical, a long diagonal straight supported path was obtained. The other, more meaningful, two supported paths were obtained because a left-to-right hand motion is repeated twice in “bye” and the ﬁrst motion and the second motion were correctly detected as similar subsequences. Fig. 5 (b) is the result when two diﬀerent “bye” patterns were used for A and B. The repeated hand motions were correctly detected like (a).

Logical DP Matching for Detecting Similar Subsequence distance map

90 80 70 60 50 40 30 20 10 0

"left-hand bye" no.1

"left-hand bye" no.1

0

10 20 30 40 50 60 70 80 90

"left-hand bye" no.1 100

"left-hand bye" no.2

(b)

detected similar sub-sequences

support map

"left-hand bye" no.1

(a)

633

80 60 40 20 0

"left-hand bye" no.1

"left-hand bye" no.1

0

10 20 30 40 50 60 70 80 90

(c)

"raise right-hand"

"left-hand bye" no.1

"right-hand bye"

0

"hurrah"

10 20 30 40 50 60 70 80 90

"right-hand bye" 90 80 70 60 50 40 30 20 10 0

"circle"

(d)

"right-hand bye"

45 40 35 30 25 20 15 10 5 0

"hurrah"

0

20

40

60

80

100

80

100

"hurrah" 70 60 50 40

"stop"

(e)

30 20 10 0

"hurrah"

"hurrah"

0

20

40

60

"hurrah"

Fig. 5. Detection result of similar subsequences

Fig. 5 (c) is the result when “bye” and “raise (right-)hand” were used for A and B, respectively. Those gestures share two common motions. One is the beginning motion that the right hand is raised to shoulder height. The other is the ending motion that the right hand is lowered from shoulder height. Those common motions were successfully detected as the two supported paths around the beginning and the end of the gestures. Fig. 5 (d) is the result when “hurrah” and “circle” were used for A and B, respectively. Those gestures may seem similar but, actually, do not have any common motion, as shown in Fig. 4. The proposed algorithm could avoid to detect any false positive, i.e., spurious supported path.

634

S. Uchida et al.

feature space

Fig. 6. Statistical extension of support

Fig. 5 (e) is a failure result. Two diﬀerent gestures “hurrah” and “stop” were used and their common motion (“raising both hands to shoulder height”) around their beginning part was not detected. This failure results indicates that the beginning parts of those gestures ﬂuctuate more largely than other parts. One possible remedy to deal with this non-uniform ﬂuctuation range is a statistical extension of d(i, j). Speciﬁcally, as illustrated in Fig. 6, if d(i, j) is evaluated according to the degree of unstability, the region where s(i, j) = true will be deﬁned adaptively to the ﬂuctuation range and thus it will be possible to improve the detection accuracy. In fact, the above missed common motion was detected correctly in the experimental result using the Bhattacharyya distance as d(i, j). (The detail of this experiment will be presented elsewhere.)

4

Application to Extraction of Motion Primitives

The logical DP algorithm can be applied to the extraction of motion primitives, which are subpatterns constituting gesture patterns. In this section, the performance of the logical DP matching was evaluated qualitatively and quantitatively via an experiment of extracting motion primitives. 4.1

Extracting Motion Primitives by Logical DP Matching

Gesture patterns are often decomposed as a sequence of motion primitives for analyzing how gesture patterns are comprised and share common motions. The extraction of the motion primitives can be done by the logical DP algorithm as follows: 1. Apply the logical DP algorithm to a pair of training gesture patterns of the assumed classes. 2. Decompose those two gesture patterns into similar subsequences and the remaining subsequences. All of those subsequences are candicates of motion primitives2 . 2

In this paper, we treat subsequences particular to only a certain gesture class as motion primitives. Thus, any gesture pattern can be decomposed into a sequence of motion primitives.

Logical DP Matching for Detecting Similar Subsequence

0:

1:

7:

9:

12:

22:

635

8:

Fig. 7. Snapshot of extracted motion primitives

3. Repeat the above two steps for all pairs. 4. Unify similar subsequences if their distance3 is smaller than θ3 . The subsequences which survive the uniﬁcation are considered as motion primitives. Any gesture pattern (from the assumed classes) will be approximately represented as a sequence of the resulting motion primitives. Most past attempts for extracting motion primitives, such as Nakazawa et al. [7] and Fod et al. [8], have assumed that motion primitives can be obtained by segmenting gesture patterns at physically particular points, such as locally minimum speed points and zero-acceleration points. In contrast, the motion primitives extracted by the procedure have two diﬀerent properties. First, they are extracted without using any physically particular point, that is, they are free from any assumption (or users’ prejudice) on motion primitives. Second, the extracted motion primitives totally depend on the assumed classes. In other words, if the classes of training gesture patterns change, the extracted motion primitives will also change. This property is especially useful for gesture recognition tasks, where the number of assumed gestures is often limited. 4.2

Extraction Results of Motion Primitives

For each of 18 gesture classes, one of 6 patterns was used as a training pattern for the extraction of motion primitives. According to the above procedure, 142 subsequences were ﬁrstly detected and then uniﬁed into 26 motion primitives. Fig. 7 shows snapshots of several motion primitives.Note that the parameter θ3 was determined via a preliminary experiment. For evaluating the extracted motion primitives, the remaing ﬁve gesture patterns of each class were decomposed into the motion primitives. If the ﬁve gestures are decomposed into the same motion primitive sequence, the validity of the motion primitives can be shown experimentally. The decomposition was done by a recognition-based segmentation algorithm [5,9], which performs segmentation and assignment of each segment to a motion primitive in an optimization framework. Fig. 8 (a) is the decomposition result of a pattern “hurrah.” The ﬁve gestures were correctly decomposed into the same motion primitive sequence, 3

Two subsequences may have diﬀerent lengths and therefore their distance is evaluated by the conventional DP matching algorithm with Euclidean distance.

636

S. Uchida et al. (a)

(b)

motion primitive ID

25

25

20

20

15 10

10

5

1

0

0 0

20

40

0 80

100

motion primitive ID

0

120

10

20

20

20 15

30

40

8 50

60

70

80

no.2

90

22

22

15

12

12

10

10 5

1

0

0

20

40

1

0

60

5

9

7

8

0 80

100

0

120

25

motion primitive ID

9

7

0

60

25

0

20

40

60

80

100

no.3

25

20

20

15

12

22

22

15

12

10

10

5

1

0

0 0

20

1

0

40

5

9

7

8

no.4

0

60

80

0

100

25

motion primitive ID

5

1

25

10

20

30

40

50

60

70

80

90

25 20

20 15

12

22

22

15

12

10

10 5

1

0

0 0

20

40

60

5

1

0

9

7

8

no.5

0 80

100

0

120

25

motion primitive ID

22

22

15

12

12

20

40

60

80

100

25

20

20

15

22

22

15

12

12

10

10

5

1

0

1

0

0

5

9

7

8

0 0

20

40

60

"hurrah" no.2~6

80

100

0

20

40

60

80

100

no.6

"left-hand bye" no.2~6

Fig. 8. Segmentation of gesture pattern by extracted motion primitives

0→12→1→0→12→1. (Those numbers correspond to the motion primitives shown in Fig. 7.) As shown in Fig. 8 (b) the ﬁve gestures of “(left-hand) bye” were also correctly decomposed into the same sequence, 7→22→9→22→8. The same decomposition results were obtained at 13 classes among 18 classes. This stable result indicates that the motion primitives extracted by the procedure of Section 4.1 represent common and particular motions validly. Thus, this result also shows resonable accuracy of the logical DP algorithm. The failure results, i.e., diﬀerent decomposition results of the same class, were mainly due to lax uniﬁcation on selecting motion primitives. For example, several subsequences of “lowering both hands from shoulder height” were survived as motion primitives. This fact indicates that some gesture subsequences ﬂuctuate more largely than others and thus the distance between them exceeded θ3 . The statistical extension pointed out in Section 3.2 will be useful also for tackling this ﬂuctuation problem.

5

Conclusion

A logical DP matching algorithm has been proposed for detecting similar subsequences between two sequential patterns, such as gesture patterns. The algorithm

Logical DP Matching for Detecting Similar Subsequence

637

was examined via an experiment of extracting motion primitives from a set of gesture patterns. The result of the experiment showed that the proposed algorithm could detect the similar subsequences among gesture patterns successfully and provide stable motion primitives. Future work will focus on (i) a statistical extension of d(i, j), (ii) application to sequential patterns other than gesture, and (iii) utilization of the extracted motion primitives for practical tasks.

References 1. Durbin, R., Eddy, S., Korgh, A., Mitchison, G.: Biological sequence analysis. Camblidge University Press, Cambridge (1998) 2. Mount, D.: Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press (2004) 3. Tanaka, Y., Iwamoto, K., Uehara, K.: Discorvery of time-series motif from multidimensional data based on MDL principle. Machine Learning 58, 269–300 (2005) 4. Zhao, T., Wang, T., Shum, H.-Y.: Learning a highly structured motion model for 3D human tracking. In: Proc. Asian Conf. Comp. Vis. pp. 144–149 (2002) 5. Ney, H., Ortmanns, S.: Progress in dynamic programming search for LVCSR. Proc. IEEE 88(8), 1224–1240 (2000) 6. Uchida, S., Sakoe, H.: A survey of elastic matching techniques for handwritten character recognition. IEICE Trans. Inf. Syst. E88-D(8), 1781–1790 (2005) 7. Nakazawa, A., Nakaoka, K., Ikeuchi, S., Yokoi, K.: Imitating human dance motions through motion structure analysis. In: Proc. Int. Conf. Intell. Robots Syst. pp. 2539–2544 (2002) 8. Fod, A., Mataric, M.J., Jenkins, O.C.: Automated deviation of primitives for movement classiﬁcation. Autonomous Robots 12(1), 39–54 (2002) 9. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)

Efficient Normalized Cross Correlation Based on Adaptive Multilevel Successive Elimination Shou-Der Wei and Shang-Hong Lai Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan {greco,lai}@cs.nthu.edu.tw

Abstract. In this paper we propose an efficient normalized cross correlation (NCC) algorithm for pattern matching based on adaptive multilevel successive elimination. This successive elimination scheme is applied in conjunction with an upper bound for the cross correlation derived from Cauchy-Schwarz inequality. To apply the successive elimination, we partition the summation of cross correlation into different levels with the partition order determined by the gradient energies of the partitioned regions in the template. Thus, this adaptive multi-level successive elimination scheme can be employed to early reject most candidates to reduce the computational cost. Experimental results show the proposed algorithm is very efficient for pattern matching under different lighting conditions. Keywords: Pattern matching, normalized cross correlation, successive elimination, multi-level successive elimination, fast algorithms.

1 Introduction Pattern matching is widely used in many applications related to computer vision and image processing, such as object tracking, object detection, pattern recognition and video compression, etc. The pattern matching problem can be formulated as follows: Given a source image I and a template image T of size M × N , the pattern matching problem is to find the best match of template T from the source image I with minimum distortion or maximum correlation. The most popular similarity measures are the sum of absolute differences (SAD), the sum of squared differences (SSD) and the normalized cross correlation (NCC). For some applications, such as the block motion estimation in video compression, the SAD and SSD measures have been widely used. For practical applications, a number of approximate block matching methods have been proposed [1][2][3] and some optimal block matching solutions have been proposed [4][5][6], which have the same solution as that of full search but with less operations by using the early termination in the computation of SAD, given by M

N

SAD( x, y ) = ∑∑ T (i, j ) − C ( x + i, y + j )

(1)

i =1 j =1

In [7], a coarse-to-fine pruning algorithm with the pruning threshold determined from the lower resolution search space was presented. This search algorithm can be Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 638–646, 2007. © Springer-Verlag Berlin Heidelberg 2007

Efficient NCC Based on Adaptive Multilevel Successive Elimination

639

proved to provide the global solution. Hel-Or and Hel-Or [8] proposed a fast template matching method based on accumulating the distortion on the Walsh-Hadamard domain in the order of the frequency of the Walsh-Hadamard basis. In general, a small number of the first few projections can capture most of the distortion energy. By using a predefined threshold, they can early reject most of the impossible candidates very efficient. Beside the SAD and SSD, the NCC is also a popular similarity measure. The NCC measure is more robust than SAD and SSD under linear illumination changes, so the NCC measure has been widely used in object recognition and industrial inspection. The definition of NCC is given as follows: M

NCC ( x, y ) =

N

∑∑ I ( x + i, y + j ) ⋅ T (i, j ) i =1 j =1

M

N

∑∑ I ( x + i, y + j ) i =1 j =1

2

⋅

M

(2)

N

∑∑ T (i, j )

2

i =1 j =1

The traditional NCC needs to compute the numerator and denominator, which is very time-consuming. Later, the sum table scheme [11][12][13] have been proposed to reduce the computation in the denominator. And some other methods have been proposed [9] [10] to reduce the computation in the numerator. In this paper, we propose an efficient NCC algorithm based on an adaptive MSEA procedure. The adaptive MSEA scheme determines the elimination order based on the sum of the gradient magnitudes of template and adapted the bound value derived from the Cauchy-Schwarz inequality to early reject the candidates. The rest of this paper is organized as follow: we first briefly review the successive elimination algorithm (SEA)-based method [4][5] as well as the upper bound for the cross correlation derived from Cauchy-Schwarz inequality[9][10]. Then, we present the proposed efficient NCC algorithm that performs the adaptive MSEA in section 3. The experimental results are shown in section 4. Finally, we conclude this paper in the last section.

2 Previous Works The Successive Elimination Algorithm (SEA) [4] used an upper bound for a block sum difference as the criterion to eliminate the impossible candidate blocks to reduce the computation of motion estimation based on the SAD criterion. Suppose (u , v ) is the current optimal motion vector in the previous search process, and the corresponding SAD value is denoted by SADmin . According to the inequality

a + b ≤ a + b , where a, b ∈ ℜ , it can be easily shown that the

following relation holds:

T − C (u, v) ≤ SAD(u, v) .

(3)

640

S.-D. Wei and S.-H. Lai

where T and C (u , v ) represent the sums of the image intensities for the template and the window of the source image I at the position (u , v ) , respectively, and SAD(u, v) denotes the corresponding SAD value computed for the window at the position (u, v). From inequality (2), we can conclude that a candidate corresponding to the source image I at position (u , v ) can not be a better-matching block if

T − C (u, v) ≥ SADmin . If SAD(u , v) is less than SADmin , then SADmin and (u , v ) is replaced by SAD(u , v) and (u , v) , respectively. The boundary value

BV = T − C (u , v) can be considered as the elimination criterion. For each candidate block, we can repeat this procedure to prune out a large portion of candidates. At the end, we can still find the best motion vector. The drawback of SEA is that the difference of block sums is not close enough to the SAD. Gao et al. extended the SEA to a multilevel successive elimination algorithm (MSEA) [5] that provides tighter and tighter boundary values from the lowest level to the highest level, as depicted in Figure 1. The relation of boundary values at different level is BV0 ≤ BV1 ≤ " ≤ BVlog 2 N = SAD . MSEA builds an image pyramid structure of the current and reference blocks with L = log 2 N levels. Using only level zero to eliminate impossible candidates is the same as the SEA, and the boundary value of the final level is the same as the SAD value. Level 0

SEA

Level 1

Level 2

Level 3

Level 4

SAD

Fig. 1. The levels of elimination order in MSEA. Using only level 0 as the elimination criterion is the same as SEA, and the final level is the same as SAD.

Although the similarity measure of NCC is more robust than SAD, but the computational cost of NCC is very high. The technique of sum table [11][12][13] can be used to reduce the computation of denominator. Any block sum in the source image can be accomplished with 4 operations from the pre-computed sum table. To reduce the computational cost in the numerator, Stefano and Mattoccia [9][10] derived upper bounds for the cross correlation based on Jensen’s and CauchySchwarz inequalities to early terminate some impossible search points. Because the bound is not tight enough, they partitioned the image into two blocks and compute the partial cross correlation for the first block (from row 1 to row k) with the other block bounded by the upper bound (from row k to row N). Then they used the SEA scheme to reject the impossible candidates successively with increasing the size of the first block. From Cauchy-Schwarz inequality [10] given in equation (4), the upper bound (UB) of the numerator, i.e. cross correlation, can be derived as in equation (5) and the boundary value of NCC is given in equation (6).

Efficient NCC Based on Adaptive Multilevel Successive Elimination

N

N

∑ ai ⋅ bi ≤

N

∑ ai2 ⋅

i =1

641

∑b

i =1

i =1

2 i

(4)

UB( x, y ) M

k

M

= ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) +

∑ ∑ I ( x + i, y + j )

i =1 j =1 M

2

M

k

M

i =1 j =1

N

∑ ∑ T (i, j )

⋅

i =1 j = k +1

≥ ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) + ∑ M

N

2

i =1 j = k +1

(5)

N

∑ I ( x + i, y + j ) ⋅ T (i, j )

i =1 j = k +1

N

= ∑∑ I ( x + i, y + j ) ⋅ T (i, j ) i =1 j =1

UB( x, y ) I ( x, y ) ⋅ T

BV ( x, y ) =

(6)

Similar to the SEA scheme, the candidate at the position (x,y) of image I will be rejected if BV ( x, y ) < NCCmax and the NCCmax will be updated as NCC ( x, y ) if

NCC ( x, y ) > NCCmax .

3 Adaptive Successive Elimination for NCC The cross correlation can be bound by the Cauchy-Schwarz inequality as described above, but the bound is not tight enough. As equation (7) shows, if we can divide the block into many subblocks and calculate the summation of each block’s upper bound, we can get tighter bound. Following the partitioning scheme of MSEA, we have many upper bounds for different partitioning levels and the relation between the upper bounds for different levels are given in equation (8) and (9). At the final level the upper bound is equal to cross correlation as the following shown. N

∑a i =1

2 i

⋅

N

∑b i =1

2 i

≥

UBl ( x, y ) =

k

∑a i =1

2 i

⋅

⎛ ⎜⎜ a∈AllSubblock ⎝

∑

k

∑b i =1

2 i

+

∑I

ai i∈AllPixels

N

∑a

i = k +1

2 i

⋅

( x, y ) 2 ⋅

N

∑b

i = k +1

2 i

∑T

2 ai i∈AllPixels

UB0 ≥ UB1 ≥ " ≥ UBL = log 2 N = CC

N

≥ ∑ ai ⋅ bi

(7)

i =1

⎞ ⎟⎟ ⎠

(8) (9)

The MSEA scheme can provide tighter and tighter upper bounds as the partitioning level increases, but there are at most L = log 2 N levels. If we increase the partitioning levels we have better chance to early reject most candidates. The upper bound of a block

642

S.-D. Wei and S.-H. Lai

is determined by the summation of squared pixel values. If the block is homogeneous, the upper bound is close to cross correlation value. Thus, the partitioning within homogeneous area has less chance to reject non-optimal candidates. However, they increase operation counts for measuring the similarity. These unnecessary operations should be avoided. In other words, the blocks with large intensity variance normally contain more details. This means the block sum cannot present the details of block, so partitioning a block with larger variance may produce a larger decrease in the upper bound value. Consequently, in order to obtain a tighter bound in the early stage, it is reasonable to partition the blocks with large variances into sub-blocks first. We proposed the adaptive MSEA algorithm for adaptive block partitioning and successive elimination for NCC. The blocks with large variances normally contain more details. The block sum cannot represent the details of block, so partitioning a block with larger variance may produce a larger increase in the boundary value. Consequently, in order to obtain a tighter bound in the early stage, it is reasonable to partition the blocks with large variances into sub-blocks first. For simplicity, we determine the elimination order by the sum of gradient magnitudes for the subblocks in the template. The block with the current largest sum of gradient magnitudes is divided into 2x2 sub-blocks for consideration of further partitioning. It should be noted that a block will not be further partitioned into 2x2 sub-blocks if its sum of gradient magnitudes is less than a given threshold T . This partitioning process is repeated until the gradient magnitude sums of all subblocks are less than the threshold. For simplicity, we determine the block elimination order from the sum of gradient magnitudes instead of the variances. Figure 2 depicts an example of block partitioning by using the proposed algorithm. The adaptive algorithm of determining the block partitioning order is given in Algorithm 1. Level 1

Level 2

Level 3

Level 4

Level 5

Level 6

Fig. 2. An example of the block partitioning order Algorithm 1. Algorithm for determining adaptive block partitioning order Push the largest block into the queue REPEAT 1. Select the block with largest sum of gradient magnitudes from the queue. 2. Divide the selected block into four sub-blocks and calculate their sum of gradient magnitudes. 3. Check the four sub-blocks and push each sub-block into the queue if its sum of gradient magnitudes is greater than a given threshold T . UNTIL the queue is empty End FOR

With the block partitioning order obtained by using the above algorithm, we have the relation of upper bounds for different levels as UB0 ≥ UB1 ≥ " ≥ UBmax L ≥ CC . We can calculate the boundary values from equation (6) and have the relation of

Efficient NCC Based on Adaptive Multilevel Successive Elimination

643

≥ BV1 ≥ " ≥ BV max L ≥ NCC . The BVl

boundary values of different levels as BV0

value is closer to NCC as the level increases. If the BVl ( x, y ) < NCCmax , the candidate at the position (x,y) is rejected, else if

NCC ( x, y ) > NCCmax , NCCmax is updated

by NCC ( x, y ) . The following is the proposed adaptive multi-level elimination algorithm for fast finding the position with the optimal NCC in pattern matching. Algorithm 2. The proposed fast NCC pattern matching algorithm Step 1: Determine the elimination order by the Algorithm 1. Step 2: Calculate the norm of template |T| Step 3: Calculate initial NCCmax = NCC (Template, initial candidate) Step 4: Compute the integral image for the square of the search image I For each candidate C(x,y) do Step 5:Calculate the norm of current candidate ||C(x,y)|| from the integral image Step 6: Repeat 1. Retrieve the next partitioning level l from the queue 2. Calculate the

UBl

for level

BVl

3. Reject the candidate if

l . Compute BVl = UBl /( ||T|| ||C(x,y)||) < NCCmax .

Until the queue is empty. Step 7: 1. If the candidate passes all criteria of all levels, calculate NCCnew =NCC (T, C(x,y) ). 2. If ( NCCnew > NCCmax ) update

NCCmax by NCCnew .

End For

We can also apply the proposed scheme on the zero mean normalized cross correlation (ZNCC) by rewriting it in the form as equation (10). Note that calculating the terms

2

2

a , a , b and b in the equation by integral image is very efficient. n

∑ (a

ZNCC =

i =1

n

∑

i =1

∑a i =1

n

− a ) ⋅ ( bi − b ) n

∑ (b

( ai − a ) 2 ⋅

n

=

i

i

i =1

i

− b)2 (10)

⋅ bi − n a b n

( ∑ a i − n a ) ⋅ ( ∑ bi − n b ) 2

2

i =1

2

2

i =1

a=

n

1 1 n ai b = ∑ bi ∑ n i=1 , n i =1

(11)

644

S.-D. Wei and S.-H. Lai

4 Experimental Results In this section, we show the efficiency improvement of the proposed adaptive MSEA algorithm for NCC-based pattern matching. The proposed algorithm adaptively partitioned the image block into many subblocks to obtain tighter upper bounds for the cross correlation. To compare the efficiency of the proposed algorithm termed AdaMSEA_NCC, we also implement the multi-level SEA with fixed partitioning scheme and the results are termed as MSEA_NCC. In our experiment, we use the Lenna

(a)

(b)

Fig. 3. (a) The original Lenna image and (b) the noisy Lenna image added with Gaussian noise with σ =8

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. (a), (b), (c): The template images. (d), (e), (f): their brighter versions.

(a)

(b)

Fig. 5. The template images of size of 128x128

(c)

Efficient NCC Based on Adaptive Multilevel Successive Elimination

645

image of size 512-by-512 as the source image and six template images of size 64x64 inside the Lenna image as shown in Figure 3 and Figure 4, respectively. Figure 4(d)(e)(f) is the brighter version (increase 50% brightness) of Figure 4(a)(b)(c). To compare the robustness and the efficiency of the proposed algorithms, we add random Gaussian noises with σ =8 onto the search image shown in figure (b) and compare the performance of the pattern search on the noisy images. The experimental results of proposed algorithms and the original NCC are shown in Table 1 and 2. All these three algorithms used the sum table to reduce the computation of denominator in equation (2). For efficiently calculating the bound of the numerator, we also used the approach of BSAP[6] to build two block square sum pyramids for intensity image and the gradient map, respectively. The execution time shown in section includes the time of memory allocation for sum table and pyramids, and building sum table, pyramids and the gradient map. Table 3 shows the experimental results of applying different template of size 128x128 as shown in Figure 5. All of these experimental results show the significantly improved efficiency of the proposed adaptive MSEA algorithm for the NCC-based pattern matching compared to the previous MSEA algorithm. Table 1. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on six templates shown in Figure 4(a)~(f) and the source image shown in Figure 3(a). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.

NCC MSEA_NCC AdaMSEA_NCC

T(a) 3235(ms) 688 297

T(b) 3235 453 188

T(c) 3235 203 219

T(d) 3235 515 125

T(e) 3235 329 172

T(f) 3235 203 141

Table 2. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on six templates shown in Figure 4(a)~(f) and the noisy source image shown in Figure 3(b). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.

NCC MSEA_NCC AdaMSEA_NCC

T(a) 3235(ms) 1672 594

T(b) 3235 1485 359

T(c) 3235 265 141

T(d) 3235 1687 546

T(e) 3235 1515 438

T(f) 3235 313 172

Table 3. The execution time of applying traditional NCC, MSEA_NCC and AdaMSEA_NCC on three templates shown in Figure 5(a)(b)(c) and the source image shown in Figure 3(a). It should be notable the NCC algorithm used the sum table to reduce the computation in the denominator of NCC.

NCC MSEA_NCC AdaMSEA_NCC

T(a) 9813(ms) 2250 844

T(b) 9813 1032 407

T(c) 9813 469 234

646

S.-D. Wei and S.-H. Lai

5 Conclusion In this paper, we proposed a very efficient adaptive MSEA algorithm for fast pattern matching in an image based on normalized cross correlation. To achieve more effective successive elimination, we partition the summation of cross correlation into different levels with the partition order adaptively determined by the sum of gradient magnitudes for each partitioned regions in the template. The experimental results show the proposed adaptive MSEA algorithm is very efficient and robust for pattern matching under linear illumination change and noisy environments. Acknowledgments. This research work was supported in part by National Science Council, Taiwan, under grant 95-2220-E-007-028.

References 1. Zhu, S., Ma, K.K.: A new diamond search algorithm for fast block-matching motion estimation. Image Processing 9(2), 287–290 (2000) 2. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits Systems Video Technology 4(4), 438–442 (1994) 3. Po, L.M., Ma, W.C.: A novel four-step search algorithm for fast block motion estimation. IEEE Trans. Circuits Systems Video Technology 6(3), 313–317 (1996) 4. Li, W., Salari, E.: Successive elimination algorithm for motion estimation. IEEE Trans. Image Processing 4(1), 105–107 (1995) 5. Gao, X.Q., Duanmu, C.J., Zou, C.R.: A multilevel successive elimination algorithm for block matching motion estimation. IEEE Trans. Image Processing 9(3), 501–504 (2000) 6. Lee, C.-H., Chen, L.-H.: A fast motion estimation algorithm based on the block sum pyramid. IEEE Trans. Image Processing 6(11), 1587–1591 (1997) 7. Gharavi-Alkhansari, M.: A fast globally optimal algorithm for template matching using low-resolution pruning. IEEE Trans. on Image Processing 10(4), 526–533 (2001) 8. Hel-Or, Y., Hel-Or, H.: Real-time pattern matching using projection kernels. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9), 1430–1445 (2005) 9. Di Stefano, L., Mattoccia, S.: Fast Template Matching using Bounded Partial Correlation. Machine Vision and Applications (JMVA) 13(4), 213–221 (2003) 10. Di Stefano, L., Mattoccia, S.: A sufficient condition based on the Cauchy-Schwarz inequality for efficient template matching. In: Proc. Intern. Conf. Image Processing (2003) 11. Lewis, J.P.: Fast template matching Vision Interface, pp. 120–123 (1995) 12. Mc. Donnel, M.: Box-filtering techniques. Computer Graphics and Image Processing 17, 65–70 (1981) 13. Viola, P., Jones, M.: Robust real-time object detection. In: Proceeding of International Conf. on Computer Vision Workshop Statistical and Computation Theories of Vision (2001)

Exploiting Inter-frame Correlation for Fast Video to Reference Image Alignment Arif Mahmood and Sohaib Khan Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan {arifm,sohaib}@lums.edu.pk

Abstract. Strong temporal correlation between adjacent frames of a video signal has been successfully exploited in standard video compression algorithms. In this work, we show that the temporal correlation in a video signal can also be used for fast video to reference image alignment. To this end, we ﬁrst divide the input video sequence into groups of pictures (GOPs). Then for each GOP, only one frame is completely correlated with the reference image, while for the remaining frames, upper and lower bounds on the correlation coeﬃcient (ρ) are calculated. These newly proposed bounds are signiﬁcantly tighter than the existing Cauchy-Schwartz inequality based bounds on ρ. These bounds are used to eliminate majority of the search locations and thus resulting in signiﬁcant speedup, without eﬀecting the value or location of the global maxima. In our experiments, up to 80% search locations are found to be eliminated and the speedup is up to ﬁve times the FFT based implementation and up to seven times the spatial domain techniques.

1

Introduction

A digital video signal consists of a sequence of frames and is usually characterized by strong temporal correlation between adjacent frames. This correlation has been successfully exploited in standard video codecs to achieve signiﬁcant compression [1]. We show that the temporal correlation of a video sequence can also be used for fast video to reference image alignment. This results in an eﬃcient, close to real time implementation of a number of applications in computer vision that use video to image alignment as key component. Such applications include automatic camera tracking, model based landmark extraction, activity monitoring and video geo-registration [2]. Although video to reference image alignment involves pattern matching, it is inherently diﬀerent from block matching for motion compensation as used in video codecs [3]. This is because block matching algorithms for video codecs are applied to video frames that appear temporally close to each other and are acquired by the same sensor, hence the level of dissimilarity is low. On the other hand, video to reference image alignment requires pattern matching between frames in a video signal and a reference image. These two signals are usually Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 647–656, 2007. c Springer-Verlag Berlin Heidelberg 2007

648

A. Mahmood and S. Khan

Rio jo A Portion of Reference Image

U k io j o

U iko jo2

U iko jo2 U k k 2

B-Frame: Fk-2

U k k 1

C-Frame: Fk

B-Frame: Fk-1

U k k 1

B-Frame: Fk+1

U k k 2

B-Frame: Fk+2

Fig. 1. Every GOP must have at least one C-Frame while the remaining frames in the GOP are B-Frames. A GOP of length 5 frames is shown.

acquired at diﬀerent times, under diﬀerent illumination conditions, and by using sensors with diﬀerent spectral responses, resulting in high dissimilarity between them. The inherent diﬀerences between block matching in video coding and video to reference image alignment renders the standard matching techniques [4], [5] used in the former to be inaccurate for the latter. In contrast, the correlation coeﬃcient, which is usually criticized for its high complexity for block matching applications [6], [7], turns out to be a more accurate and robust similarity measure for video to reference image alignment [8], [9], [10]. While a number of schemes have been investigated to reduce the time complexity of the correlation, elimination of potential matching locations based on correlation bounds, to the best of our knowledge, has not been previously considered. In this work, we propose an elimination algorithm for reducing the number of search locations in correlation coeﬃcient based video to reference image alignment. This reduction results in signiﬁcant speedup over the currently used correlation coeﬃcient based image alignment techniques. The reduction in search space is based upon the newly proposed bounds on the correlation coeﬃcient, which are signiﬁcantly tighter than the currently known Cauchy-Schwartz inequality based bounds. During the search space reduction process, elimination of a search location takes place when the upper bound on correlation coeﬃcient at that location turns out to be smaller than the currently known maxima. In order to implement the elimination algorithm, we divide the input video sequence into groups of pictures (GOPs). In each GOP, only one frame is completely correlated with the reference image. For the remaining frames, the proposed bounds are evaluated at all search locations and unsuitable search locations are identiﬁed and eliminated from the search space. In the elimination algorithm, the length of GOP is an important parameter. An algorithm for automatic detection of optimal length of GOP is also proposed. In our experiments, the optimal length of GOP is found to be 7 frames. For this optimal length, on the average, 81% search locations are found to be eliminated. The execution time is compared

Exploiting Inter-frame Correlation for Fast Image Alignment

649

with the FFT [11], [12] based frequency domain implementation and the spatial domain implementations including the Bounded Partial Correlation (BPC) [13] technique. The execution time speedup is up to 7 times the spatial domain correlation, 5 times the FFT based implementation and 2.5 times the BPC based implementation.

2

Problem Deﬁnition

We consider a digital video signal as a sequence of frames F indexed at discrete time k, each of size m × n pixels. Each frame F k is to be matched at all valid search locations in the reference image R of size s × t pixels. For the purpose of matching, the reference image R is considered to be divided into overlapping rectangular blocks Rio jo , each of size m × n pixels, where (io , jo ) are the coordinates of the ﬁrst pixel of the reference block. Each of the reference block Rio jo is a valid candidate search location. The similarity measure used to match the frame F k with search location Rio jo is the correlation coeﬃcient deﬁned as: n m

ρkio ,jo

(F k (i, j) − F k )(Rio jo (i, j) − Rio jo ) = m m n n k 2 (F (i, j) − F k ) (Rio jo (i, j) − Rio jo )2 i=1 j=1

i=1 j=1

(1)

i=1 j=1

where F k and Rio jo represent the mean of F k and Rio jo . For each frame F k , the maxima of ρkio ,jo will yield the best matching candidate search location deﬁned as (k, imax , jmax ). The primary goal is to minimize overall computations for the full video sequence without changing the value or location of the global maxima for any individual frame. For this purpose, following information can be used: 1. Each frame F k exhibits nonzero correlation ρkk with each of its temporal neighbor F k ∈ {F k−p . . . F k−1 , F k+1 , . . . F k+p }, where p is an integer. 2. For a frame F k , the correlation coeﬃcient values ρkio ,jo at all (io , jo ) are available to be used for each of its temporal neighbor F k . 3. For each F k , an initial guess of the best match location is available from a previously matched frame.

We intend to ﬁnd exact bounds on ρkio ,jo using ρkio ,jo and ρkk such that complete calculations of ρkio ,jo can be avoided without having any eﬀect on the value or location of global maxima for each frame F k .

3

Related Work

The correlation coeﬃcient as given by Equation 1, has been computed in spatial domain with the computational complexity of the order of O mn(s − m + 1)(t − n + 1) and in frequency domain using FFT with the computational complexity of the order of O (s + m − 1)(t + n − 1) log2 [(s + m − 1)(t + n − 1)] [6]. The order

650

A. Mahmood and S. Khan

of computational complexity of frequency domain implementation is lesser than the spatial domain implementation, however the basic operations in frequency domain implementation are signiﬁcantly more complex. In addition to complex basic operations, the over all complexity of the frequency domain implementation remains the same no matter how dissimilar the two images to be matched are. The frequency domain implementation cannot utilize the additional information available in the form of initial guess or inter frame correlation in the video to reference image alignment problem. Recently a fast spatial domain technique has been proposed in [13], the Bounded Partial Correlation (BPC) algorithm. In the BPC algorithm, each frame F k and the search location Rio jo is divided into cross-correlation area Ac and the bound area Ab . Partial cross-correlation (CC p ) is calculated over area Ac only, and using the Cauchy-Schwartz inequality, partial upper bound (Up ) is calculated on the area Ab only. It is shown that Up + CC p ≥ CC, where CC is the complete cross-correlation between frame F k and Rio jo . An upper-bound on ρkio jo is computed using Equation (1). During the match process, further calculations on the current search location are terminated if the upper-bound on ρkio jo evaluates to a value smaller than the yet known best maxima. Since the bound calculations are simpler than the correlation calculations, computations are reduced by selecting a larger bound are, Ab . However, a larger Ab yields the upper bound looser and it becomes harder to obtained elimination. In the limiting case, if the total block area is used as Ab , the Cauchy-Schwartz inequality yields an upper bound of +1, which will never become less than the known maxima and therefore no elimination will be obtained. Moreover, like FFT based algorithms, the BPC algorithm also cannot utilize the inter frame correlation for computational advantage. Many approximate fast block matching schemes have also been proposed in literature, for example: two dimensional logarithmic search, three step search, conjugate direction search, cross search, orthogonal search. In these approximate schemes, global maxima is not guaranteed to be found while the algorithms proposed in this paper are exact and preserve the value and the location of the global maxima.

4 4.1

Transitive Elimination Algorithm Basic Idea

The basic idea of the Transitive Elimination Algorithm (TEA) is similar to the video compression algorithms [3]. In video compression algorithms, an input video sequence is divided into Group of Pictures (GOP). Each GOP contains at least one intra-coded frame (I-Frame) and the remaining frames are predictive coded (P-Frames) or bidirectional predictive coded (B-Frames) [1]. In video to reference image alignment problem, we also propose to divide the input video sequence into GOPs. In each GOP we have one C-Frame (Correlated-Frame) and the remaining are B-Frames (Bounded-Frames). The C-Frame is correlated with complete reference image. For B-frames, transitive bounds are evaluated

Exploiting Inter-frame Correlation for Fast Image Alignment

Ȇ'

Ȇ

~k F

~ k' F

651

~ R io j o

T kk'

Tiko jo

I33 ' Fig. 2. Vector representation of the images and the angles between their mean subtracted versions

at all search locations. Then based upon suﬃcient conditions for elimination, unsuitable search locations are identiﬁed and eliminated from the search space. Note that there is no approximation involved in bound calculations or in search location elimination, therefore the value and location of global maxima is exactly preserved. In the following discussion, the C-Frames are denoted by F k and the B-Frames are denoted by F k . In each GOP, the middle frame is considered as the C-Frame and the temporal neighbors on each side are considered as B-Frames. 4.2

Transitive Bounds on the Correlation Coeﬃcient

Transitive bounds are derived by considering the video frames and the search locations as vectors in Rm×n . Let the vectors Fk and Rio jo represent a C-Frame F k and a search location Rio jo . The correlation coeﬃcient between Fk and Rio jo can be interpreted as the inner product of their mean subtracted and unit magnitude normalized versions: ˜k ˜i j F R ρkio ,jo = · o o , ˜ k R ˜ io jo F

(2)

˜ i j = Ri j − Ri j and . denotes the magnitude ˜ k = Fk − Fk and R where F o o o o o o ˜ k and of each vector. Let θiko jo be the magnitude of the smaller angle between F ˜ i j measured in the plane Π containing both of these vectors (Figure 2), such R o o that: 0o ≤ θiko jo ≤ 180o. Then the correlation coeﬃcient between Fk and Rio ,jo is given by: (3) ρkio ,jo = cos θiko jo .

Let Fk be a vector representing a B-Frame in the temporal neighborhood of ˜ k and F ˜ k measured in plane Π Fk and θkk be the smaller angle between F containing both of these vectors, as shown in Figure 2. Like θiko jo , θkk is also constrained between 0o and 180o: 0o ≤ θkk ≤ 180o. The correlation coeﬃcient

652

A. Mahmood and S. Khan

between vectors Fk and Fk is given by ρkk = cos θkk . Using the orientation of vectors and the planes shown in Figure 2, we are interested in ﬁnding the bounds on θiko jo that will be used to bound ρkio jo . k, F k and The planes Π and Π in Figure 2 are uniquely deﬁned by vectors F Π io jo . Let φ be the magnitude of the smaller angle between the planes Π and R Π o o o Π Π . Like θiko jo and θkk , φΠ Π is also bounded between 0 and 180 : 0 ≤ φΠ ≤ 180o. The magnitude of θiko jo is functionally dependent upon the magnitude Π k of φΠ Π . If the magnitude of φΠ is zero, then the magnitude of θio jo is equal o to |θkk − θiko jo |. At the other extreme, if magnitude of φΠ Π is 180 , then the magnitude of θiko jo is equal to θkk + θiko jo . For all intermediate values of φΠ Π , the magnitude of θiko jo will remain within these bounds:

|θkk − θiko jo | ≤ θiko jo ≤ θkk + θiko jo .

(4)

Since θiko jo is also bounded between 0o and 180o, therefore actual bounds on θiko jo turn out to be:

|θkk − θiko jo | ≤ θiko jo ≤ min(θkk + θiko jo , 180o − (θkk + θiko jo )). o

(5)

o

For θ = 0 to 180 , cos(θ) is a monotonic decreasing function therefore, taking cosine of both sides of Equation 5, changes the direction of inequality. Using cosine function properties: cos(−θ) = cos(θ) and cos(180o − θ) = cos(θ), we get: (6) cos θkk − θiko jo ≥ cos θiko jo ≥ cos θkk + θiko jo . Using the trigonometric identities, following expression for the upper bound αkio jo , and the lower bound βiko jo , on ρkio jo can be easily calculated: (7) αkio jo = ρkk ρkio jo + (1 − (ρkk )2 )(1 − (ρkio jo )2 ), (8) βiko jo = ρkk ρkio jo − (1 − (ρkk )2 )(1 − (ρkio jo )2 ).

We have experimentally studied the characteristics of αkio jo and βiko jo . Figure 3 shows the variation of αkio jo and βiko jo with the variation of ρkk and ρkio jo on a real dataset. In Figure 3, each pair of αkio jo and βiko jo corresponds to a ﬁxed value of ρkk , while the variation along the x-axis is due to the variation in ρkio jo on consecutive pixel positions in the reference image. From Figure 3, it can be observed that if both of the angles, θkk and θiko jo , are small (or both of the correlations, ρkk and ρkio jo , are large), then both of the bounds, αkio jo and βiko jo , become tight. If one of the two angles is small and the other is large, the upper bound remains tight because cos(θiko jo − θkk ) in Equation 6, results in a smaller value, however the lower bound becomes loose because cos(θiko jo + θkk ) also results in a small value. On the other hand, if both of the angles are large i.e both correlations are small, then both upper and lower bounds become loose. Therefore in order to get a useful upper bound on ρkio jo , at least one of the two bounding correlations should be large.

Exploiting Inter-frame Correlation for Fast Image Alignment

RHO / Upper Bound / Lower Bound

1

653

1 2 3 4 5

0.8 0.6 0.4 0.2

6 Upper Bounds

0

Actual RHO

-0.2

7

-0.4 Lower Bounds

-0.6

8 9 10 11

-0.8 -1 0

20

40

60

80

100

120

140

160

180

200

Pixel Position

Fig. 3. Variation of upper and lower bounds on the correlation coeﬃcient with the variation of bounding correlations ρkk and ρkio jo . ρkk varies across the curves while k ρio jo varies as the pixel position varies along a row in the reference image. (1) to (5): Upper Bounds for ρkk = 0.306, 0.441, 0.571, 0.722 and 0.896. (6) Actual value of ρkio jo . (7) to (11) Lower bounds for ρkk = 0.896, 0.722, 0.571, 0.441 and 0.306. The Cauchy Schwartz inequality based upper bound is always +1 and the lower bound is always -1.

4.3

Suﬃcient Elimination Conditions

Using the upper and lower bounds on ρkio jo , we can derive three types of suﬃcient elimination conditions for eliminating the candidate search locations from the search space of B-Frame, F k . Note that this elimination is without any change in value or location of the global maxima. All locations for which any one of the following conditions is satisﬁed, cannot become the best match search location: 1. All search locations Rio jo can be discarded if there exists a location Ri o jo such that: αRio jo < βRio jo (9) For this condition to be maximally eﬀective, the maxima of the lower bound should be known. However, it can be shown that the maxima of the lowerbound will occur at the peak location of ρkio jo . The proof of this fact follows directly from Equation 8. Since ρkk is constant for all locations, the location at which the maxima of ρkio jo occurs, will be the location of the maxima of the lower-bound. 2. All search locations Rio jo can be discarded if their upper bound is less than the yet known maxima of correlation surface: αRio jo < ρmax

(10)

The actual amount of search locations discarded due to this condition depends upon the magnitude and location of the current known maxima (ρmax ). In search order, earlier a high value of correlation coeﬃcient is found, larger the number of eliminated search locations will be. Thus this condition takes into account the computational advantage of an initial guess.

654

A. Mahmood and S. Khan

3. All locations Rio jo can be discarded if their upper bound is less than a known initial threshold: αRio jo < ρthreshold (11) Selecting a high initial threshold can enhances the elimination capability of this condition. However, selection of a high initial threshold can discard the actual peak location as well, in case if the actual peak magnitude was lesser than the selected initial threshold.

5

Experiments and Results

In order to demonstrate the concept of Transitive Elimination Algorithm, a multi-satellite multi-temporal real image dataset is prepared. The reference image is an 800K pixels satellite image [14] from the University of Central Florida area, having ground sampling distance of approximately 5 m/pixel. The video frames are acquired by modeling a ﬂight simulation on images of the same area but captured at diﬀerent time of the year and by a diﬀerent satellite [15]. In the simulation, the scale and orientation is assumed to be approximately same as that of the reference image (Figure 1). The video camera acquires 25 frames per second while moving at a velocity of 450km/hr. In order to reduce the size and temporal redundancy, the video is down-sampled by dropping every second frame. For experimentation, a sequence of 250 frames, each of size 101×101 pixels is selected. The input video sequence is divided into GOPs. For the test sequence, the optimal length of GOP is found to be 7 frames. The detection of optimal length of GOP is discussed later in this section. For the optimal length GOP, the average execution time of the proposed algorithm is 2.24 Sec/Frame, including all required calculations, on a 1.6GHz, 1GB RAM, IBM ThinkPad computer. The C-Frames in each GOP are correlated in frequency domain [11] while the B-Frames are correlated in spatial domain. The average spatial domain time, without elimination, is observed to be 15.64 Sec/Frame and the frequency domain time as 10.53 Sec/Frame. The spatial domain implementation is also modiﬁed for the BPC algorithm with correlation area taken as 20% [13]. The average execution time of BPC implementation is observed to be 5.70 Sec/Frame and the average elimination as 65% of the computations. The elimination observed in the proposed algorithm is 81% of the search locations, which is signiﬁcantly larger than the BPC algorithm (Figure 4b). According to these results, the speedup of the proposed algorithm, over spatial domain implementation is 7 times, over FFT based implementation is 5 times and over BPC implementation is 2.5 times. For maximum speedup performance of the proposed algorithm, detection of the optimal length of GOP is of prime importance. Optimal GOP length is one that minimizes the overall computation time by maximizing the elimination in B-Frames while minimizing the number of C-Frames. Optimal GOP length can be found by starting the matching process with the minimum GOP length and then increasing the length gradually until the optimal performance is achieved (Figure 5c). Optimal GOP length can also be predicted if average value of global

Exploiting Inter-frame Correlation for Fast Image Alignment % Eliminated Computations

Execution Time (sec)

16 14 12 10 8 6 4 2 0

SPT

FFT

BPC

655

100 80 60 40 20 0

TEA

SPT

FFT

Algorithm Type

BPC

TEA

Algorithm Type

Fig. 4. Comparison of Transitive Elimination Algorithm (TEA) with Spatial (SPT), FFT and BPC based implementations: (a) Average execution time per frame. (b) Average % computation elimination.

maxima and average inter frame correlation is known for a given dataset. For optimal GOP length, the inter frame correlation should be such that, at all search locations, the upper bound approaches the current known maxima ρmax from below: ρmax − ρkk ρkio jo + (1 − (ρkk )2 )(1 − (ρkio jo )2 ) = Δkio jo , (12)

where Δkio jo is a very small positive number. By varying the length of GOP, we can change ρkk , while ρkio jo is generally very small and we can safely assume its average value to be zero: E[ρkio jo ] = 0. For the test video sequence average value of ρmax is 0.72. Assuming Δkio jo = 0 in Equation 12, the required ρkk turns out to be 0.70 (Figure 5a). From Figure 5b, for ρkk = 0.70, the number of B-Frames per GOP turn out to be 6, that is the size of GOP as 7 frames. We also veriﬁed this ﬁnding experimentally, by studying the variation of performance with the variation of the length of GOP (Figure 5c). The length of GOP is varied from 3 to 15 frames and the optimal length of GOP is again found to be 7 frames, which veriﬁed the predicted GOP length. 1

1 2 3 4 5

0.2 0

-0.2 0

0.2

0.4

0.6

0.8

Inter Frame Correlation

1

UB / MLB / IFC

Upper Bound

0.4

(a)

UB

0.8

0.6

Average Execution Time(s)

1

0.8

0.6

IFC 0.4

MLB 0.2 0 3

(b)

5

7

9

11

13

15

Number of B-Frames per GOP

17

8

The opt when up

6

4

2

(c)

2

4

6

8

10

12E[rho]=0 14

IFC B-Frames/GOP

rho max 0

Fig. 5. (a) Variation of UB with variation of Inter Frame Correlation (IFC) for E[ρkio jo ] = {0.2, 0.1, 0.0, −0.0, −0.2} for curves 1 to 5 respectively. (b) Variation of IFC, Upper Bound(UB) and Maxima of Lower Bound (MLB) with variation of the length of GOPs. (c)Variation of execution time with the length of GOPs.

656

6

A. Mahmood and S. Khan

Conculsion

An elimination algorithm for fast video to reference image alignment is presented. The elimination algorithm is based upon new proposed bounds on the correlation coeﬃcient. Using the proposed algorithm, signiﬁcant speedup is obtained without any change in the value or location of the global maxima of the correlation coeﬃcient. The proposed algorithm is up to 7 times faster than the spatial domain implementation and 5 times faster than the FFT based implementations of the correlation coeﬃcient.

References 1. ITU-T, ISO/IEC JTC1: Advanced video coding for generic audiovisual services. (JTC 1, Recommendation H.264 and ISO/IEC 14 496-10 (MPEG-4) AVC, 2003 2. Shah, M., Kumar, R.: Video Registration. Kluwer Academic Publishers, Boston (2003) 3. Ghanbari, M.: Standard Codecs: Image compression to advanced video coding. IEE Telecom. Series 49, Institute of Electrical Engineers 49 (2003) 4. Li, W., Salari, E.: Successive elimination algorithm for motion estimation. IEEE Trans. Image Processing 4(1), 105–107 (1995) 5. Montrucchio, Q.B.: New sorting-based lossless motion estimation algorithms and a partial distortion elimination performance analysis. IEEE Trans.CSVT 15, 210–212 (2005) 6. Barnea, D., Silverman, H.: A class of algorithms for fast digital image registration. IEEE Trans. Commun. 21(2), 179–186 (1972) 7. Pratt, W.K.: Digital Image Processing, 3rd edn (2001) 8. Irani, M., Anandan, P.: Robust multi-sensor image alignment. In: ICCV (1998) 9. Sheikh, Y., Khan, S., Shah, M.: Feature-based geo-registration of aerial images. Geo-sensorNetworks (2004) 10. Ziltova, B., Flusser, J.: Image registration methods: A survey. Image and Vision Computing 21, 977–1000 (2003) 11. Lewis, J.: Fast normalized cross-correlation. In: Vision Interface, pp. 120–123 (1995) 12. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992) 13. di Stefano, L., Mattoccia, S., Tombari, F.: ZNCC-based template matching using bounded partial correlation. Pattr. Rec. Ltr. 26(14) (2005) 14. Google Earth, http://earth.google.com/ 15. Microsoft Terra Server, http://terraserver.microsoft.com/

Flea, Do You Remember Me? Michael Grabner, Helmut Grabner, Joachim Pehserl, Petra Korica-Pehserl, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {mgrabner,hgrabner,pehserl,korica,bischof}@icg.tugraz.at

Abstract. The ability to detect and recognize individuals is essential for an autonomous robot interacting with humans even if computational resources are usually rather limited. In general a small user group can be assumed for interaction. The robot has to distinguish between multiple users and further on between known and unknown persons. For solving this problem we propose an approach which integrates detection, recognition and tracking by formulating all tasks as binary classiﬁcation problems. Because of its eﬃciency it is well suited for robots or other systems with limited resources but nevertheless demonstrates robustness and comparable results to state-of-the-art approaches. We use a common over-complete representation which is shared by the diﬀerent modules. By means of the integral data structure an eﬃcient feature computation is performed enabling the usage of this system for real-time applications such as for our autonomous robot Flea.

1

Introduction

Autonomous robots guiding blinds, cleaning dishes, delivering mail, laundering, entertaining and handling many other daily tasks belong to the future goals of a competition called RoboCup@Home1 . The aim is to develop applications that can assist humans in everyday life. One speciﬁc task within this challenge is called Who is Who? and is thought to focus on enhancement of techniques for natural and social human-computer interaction. In speciﬁc, the detection and recognition of known vs. unknown individuals should enforce robots usability and make them automatically recognize familiar persons. Real-time capability is essential for interaction. From the computer vision perspective (we do not consider the audio modality in this work) it requires three approaches to successfully handle these tasks, namely (1) Detection (2) Recognition and further on (3) Tracking can be optionally added. For these speciﬁc computer vision problems much research has been done and overviews of proposed techniques are given in [1,2,3]. Especially classiﬁcation techniques have turned out to provide robust results for these tasks and are hard to compete in eﬃciency. For object detection the probably most widely 1

www.robocupathome.org

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 657–666, 2007. c Springer-Verlag Berlin Heidelberg 2007

658

M. Grabner et al.

used technique is AdaBoost introduced by Viola and Jones [4]. For face recognition/identiﬁcation several classiﬁcation methods have been applied [5,6] and also tracking recently has been often solved by fromulating it as a classiﬁcation problem between object and background [7,8,9]. However, only a few approaches exist considering the problem of detection, recognition and tracking as a common problem [10,11]. At most they are combined using consecutive stages and using totally independent techniques for the speciﬁc tasks (i.e. motion detection for tracker initialization). Related is also the work of Zisserman et al. [12,13] on face recognition in feature-length ﬁlms. Their task is to perform person retrieval from movies, and they use face detection and tracking to perform that task. Their system demonstrates that by closely integrating the tasks an impressive performance can be obtained. However their approach is not applicable to our problem because of the high computational costs. To summarize, there does not exist an eﬃcient combination of detection, recognition and tracking however for all speciﬁc tasks classiﬁcation methods have been successfully applied. In this paper we propose a system which integrates detection, identiﬁcation and further on tracking. The approach is applied to faces however can be used as well for other objects. All tasks are formulated as binary classiﬁcation problems allowing to apply well established learning techniques. An integral data structure is shared among all modules allowing very fast and eﬃcient feature extraction. The system is especially suited for any device having limited resources. In speciﬁc, we apply the proposed approach to an autonomous robot. The outline of the paper is as follows. In Section 2 we describe detection, recognition and tracking as binary classiﬁcation problems. Furthermore the procedure how to share low-level computations among the modules as well as the used learning technique is presented. Section 3 shows experimental evaluations on a public database and in addition illustrative sample sequences captured from our mobile robot Flea. Section 4 concludes the paper and gives some outlook of ongoing work.

2

Identifying Familiar Persons and Unknowns

The identiﬁcation of persons within images requires two steps. First, there is the need of a detection part which is responsible for locating all faces appearing in the image. Second, given the set of faces we want to distinguish between the class of known persons and unknown persons and further on between the individuals of known persons which is accomplished by the recognition step. The problem of identiﬁcation is formulated in a coarse to ﬁne manner by applying discriminative classiﬁcation methods at both stages. In addition we are also interested in tracking of detected individuals since it allows our robot to track even if appearance changes (e.g. due to occlusion, view change) occur.

Flea, Do You Remember Me?

2.1

659

Detection, Recognition and Tracking as Binary Classiﬁcation

The key idea of our approach is to formulate the abilities to detect, recognize and to track as classiﬁcation problems, as depicted in Figure 1. By doing so we can apply the same techniques for all tasks. The major advantage is that low-level computations can be shared and have to be done only once. This will be explained in more detail in Section 2.2.

Fig. 1. Detecting, recognizing and tracking persons are considered as independent binary classiﬁcation problems. A detector is trained oﬀ-line given a large set of positive labeled faces against non-face images. For recognition a speciﬁc face is trained against all other faces. For tracking an on-line classiﬁer is used which allows continuously updating the model using the surrounding background as negative samples.

Detection. The task of the detector is to distinguish between the class of faces and the background. This can be formulated as a binary classiﬁcation problem as proposed by Viola and Jones [14]. Given a training set Xd = {xd,1 , yd,1 , . . . , xd,n , yd,n } where xd,i ∈ IRm is an image patch and the class labels yd,i ∈ {+1, −1} for faces and non-faces, respectively. Using this training set a binary classiﬁer is trained by applying an oﬀ-line learning algorithm Lof f . Positive labeled samples (faces) are hand labeled and negative samples are sampled from an image database containing no faces. For evaluation, i.e. the detection of faces in an image, the classiﬁer is applied in an exhaustive way searching over many image patches which are sampled at diﬀerent locations and scales. Recognition. Once a detection has been accomplished by the detector it is handed over to the recognizer module. In the ﬁrst stage it classiﬁes the provided sample as known or unknown and in the second stage veriﬁes the identity in case it is a known face. This can be formulated as a multiclass classiﬁcation problem. Given a set of samples Xr = {xr,1 , yr,1 , . . . , xr,n , yr,n } where xr,i ∈ IRm represent face images and yr,i ∈ {0, 1, 2, . . . , M } correspond to the labels each corresponding to one of the M diﬀerent individuals and 0 to unknowns. This problem can be rewritten and formulated in an one vs. all manner which makes the usage of binary classiﬁers feasible. Meaning, for each person j = 1, . . . , M

660

M. Grabner et al.

we train a single binary classiﬁer against the other classes. Thus, the training set for the classiﬁer Cj is Xr,j = {xr,i , +1|yr,i = j} ∪ {xr,i , −1|yr,i = j}. The classiﬁer Cj is created by applying an oﬀ-line learning algorithm Lof f on the training set. In fact, a model is learned which best discriminates the current person to the other given identities as shown in Figure 1. In order to obtain robustness against the class unknown, we add arbitrary faces (e.g. from a face database used for training the face detector) as negative samples for the training. In the evaluation step, each classiﬁer Cj (x) evaluates the face image x and provides a conﬁdence value. The classiﬁer with the highest response delivers the class label yˆ. yˆ = arg max Cj (x); j = 1, . . . , M j

(1)

The class unknown is recognized if all classiﬁers responses are below a certain threshold. This approach can be extended by using an on-line learning algorithm Lon for updating the classiﬁers. It allows to add novel persons by just applying its samples as negative updates to existing classiﬁers and learning a new classiﬁer as shown above. Tracking. Tracking allows us to localize the object even if detection fails due to appearance change (e.g. occlusion). Further on it helps to get rid of possible false detections and to increase recognition accuracy. Following the formulation in [9] we summarize the main steps of the tracking formulation as a binary classiﬁcation problem. Once the target object has been detected at time t, it is assumed to be a positive image sample x, +1t=0 for the classiﬁer. At the same time negative examples {x1 , −1, . . . , xn , −1}t=0 are extracted by taking regions of the same window size from the surrounding background. Given these examples an initial classiﬁer Ct=0 is trained. The tracking step is based on the classical approach of template tracking. The current classiﬁer Ct is evaluated at the surrounding region of interest and so obtain for each sub-patch a conﬁdence value which implies how well the underlying image patch ﬁts the current model. Afterwards we analyze the obtained conﬁdence map and shift the target window to the new maxima location. Next, the classiﬁer has to be updated in order to adjust to possible changes in appearance of the target object and to become discriminative to a diﬀerent background. The current target region is used for a positive update of the classiﬁer while surrounding regions again are taken as negative samples. This update policy has proved to allow stable tracking in natural scenes. As new frames arrive, the whole procedure is repeated and the classiﬁer is therefore able to adapt to possible appearance changes and in addition becomes robust against background clutter. Note, in order to formulate the tracking as a classiﬁcation task, we need an on-line learning algorithm Lon . The binary classiﬁer updates the model (decision boundary) by using the information from a single new sample x, y, x ∈ IRm and y ∈ {+1, −1}.

Flea, Do You Remember Me?

2.2

661

Eﬃcient Features and a Single Shared Data Structure

An overview of the proposed system is given in Figure 2. For each frame the integral representation needs to be computed only once which is then used by all three modules for feature computation. Note that each unit selects appropriate features for the speciﬁc task however computation time of the features is negligible.

Fig. 2. Each module (detector, tracker and recognizer) is based on the same classiﬁcation method, allowing the use of same feature types (Haar-like wavelets). These features can be computed very eﬃciently using a shared integral data structure.

For binary classiﬁcation of image patches we propose to use the classical approach from Viola and Jones [14]. There main assumption is that a small set of simple image features can separate two classes. The selection of the features is done by a machine learning algorithm. In the following we brieﬂy summarize the applied techniques. Features. As features we use simple Haar-like features2 . We spend some time on pre-computation of the eﬃcient data structure, namely the integral image, which can be used for fast feature evaluation. This pre-computation has to be done only once since all three modules share this information. Boosting for Feature Selection. For training a classiﬁer we apply boosting for feature selection [17,14]. Core of the technique is the machine learning algorithm AdaBoost [18]. Given a set of training samples X = {x1 , y1 , . . . , xn , yn }, xi ∈ IRm and yi ∈ {−1, +1} boosting builds an additive model of weak classiﬁers in the training stage. At each iteration a weak classiﬁer is trained using a weight distribution over the training samples. In order to perform feature selection, a weak classiﬁer corresponds to a simple image feature. Afterwards a re-weighting of the samples is done. The result is a strong classiﬁer

2

Note, also other kind of features like edge orientation histograms can be build using integral data structures [15] or Local Binary Patterns [16].

662

M. Grabner et al.

H(x) = sign(conf (x)) N 1 conf (x) = N α αi · hi (x) i=1

(2)

i

i=1

which consists of a weighted linear combination of N weak classiﬁers hi . The value conf (x) bounded in the interval [−1, +1], denotes how conﬁdent the classiﬁer is about its decision. This fulﬁlls the requirements of the classiﬁer C from the previous section. Boosting and especially boosting for feature selection as described above runs oﬀ-line, meaning all the training data is given at once. Primarily for tracking we need an on-line learning algorithm. For on-line adaption of the classiﬁer we make use of an on-line version [19]. The key idea is to introduce so called selectors which hold a set of weak classiﬁers and each selector can chose exactly one of them. An on-line boosting algorithm [20] is performed on the selectors and not on the weak classiﬁers directly. Updating can be done eﬃciently. After updating the classiﬁer the evaluation is similar to the oﬀ-line case, because the selector has chosen one speciﬁc weak classiﬁer which again corresponds to a single feature.

3

Results

First we introduce our autonomous robot and give some relevant details regarding the hardware setup. We present a performance evaluation of our proposed system focusing on recognition accuracy as well as considering the ability to distinguish between known and unknown individuals. Finally an illustrative experiment is shown which is also available at www.flea.at. 3.1

Robot Flea

The used hardware setup consists of an ActiveMedia Peoplebot platform including diverse sensors (e.g. sonar, IR). The robot’s head has thirteen degrees of freedom and can move its eyes, mouth, eyebrows, forehead, chin and neck. A Dual Core Centrino with 2 GHz and 1024 MB RAM represents the main

Fig. 3. Our robot Flea consists of a humanoid head. Visual information is obtained through a camera which is included in the artiﬁcial eye. A captured image from the view of the robot is depicted on the right.

Flea, Do You Remember Me?

(a) Scalability

663

(b) Performance

Fig. 4. Performance evaluation of the recognition system. (a) shows the recognition rate when increasing the number of persons, (b) depicts the trade oﬀ of recognizing persons vs. unknowns.

processing unit. A stereo camera from Videre Design STH-MDCS2-VAR (max. 1280 × 960 used: 640 × 480) is used for capturing images and about 12 frames per seconds are processed with a non optimized C++ implementation. Training of the robot is done in a fully autonomous way. In case Flea meets an unknown person, she focuses on it and asks for the name and other relevant information. During the conversation it starts collecting training samples of the person and trains a classiﬁer for identiﬁcation. When meeting Flea somewhere and asking the robot Flea, do you remember me?, she will reply Sure,... adding your name in case she knows you and otherwise asking you for your name. This is exactly the task that has to be fulﬁlled within the Who is Who? competition from RoboCup@Home. 3.2

Recognition Performance

For evaluation of the recognition accuracy and in speciﬁc the recognition of class unknown we use the AT&T database3 which includes 40 diﬀerent persons with 10 samples per class. This dataset is well suited since in our application it is not important to distinguish a huge number of individuals however it is important to accurately recognize the class of unknown individuals. In the ﬁrst experiment we want to demonstrate recognition accuracy with respect of the number of classes. For training the dataset has been randomly split into training and test set (70% trainig and 30% testing). The result, depicted in Figure 4, has been obtained by running the experiment 10 times. The second evaluation illustrates on the one hand the performance of distinguishing between unknown and known faces and on the other hand shows also the trade oﬀ of recognizing the correct class in case of a known person. For training we use the same training procedure as in the previous evaluation. We randomly selected 20 3

www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

664

M. Grabner et al.

Fig. 5. Subset of training samples for three diﬀerent persons (from upper left to bottom right: Chris, Joe, Ann) and samples taken from the training database (bottom right set) as additional negative samples

(a) Frame 215

(b) Frame 220

(c) Frame 355

(d) Frame 357

(e) Frame 431

(f) Frame 839

Fig. 6. Sample sequence from the perspective of Flea. Learned faces are detected, robustly tracked (d) and correctly identiﬁed if they are known (e-f). Diﬀerent individuals are marked by diﬀerent colored rectangles. The approach is running on the robot with about 12 frames per second.

individuals from the dataset and applied cross-validation to achieve statistically signiﬁcant results. As depicted in Figure 4 the overall performance for distinguishing between unknown and known faces is handled in a proper way. The diﬀerence in performance of recognizing the correct identity compared to recognizing just the class known is marginal. 3.3

Sequences

We also want to demonstrate a typical Who is who? scenario. Three persons introduce themselves to the robot whereas the robot collects samples of each

Flea, Do You Remember Me?

665

individual as depicted in Figure 5. Training each individual is done within a few seconds including the capturing of the faces. Figure 6 illustrates the applied system on a sequence of our autonomous robot. As can be seen, the approach handles the recognition of known and unknown faces and further on shows the beneﬁt of combining detection, recognition and tracking.

4

Conclusion

We have combined detection, recognition and tracking by formulating all tasks as binary classiﬁcation problems. As a result low-level computations can be shared among all modules. Due to eﬃcient feature computation the approach can be used within real-time applications such as autonomous robots. Note that the approach is not limited to faces since all modules are generic and therefore the proposed approach can be applied to any other type of object. The common formulation opens several new venues such as improving (specializing) detectors and recognizers for images taken from a static camera as it is the case in video surveillance applications.

Acknowledgement This work has been sponsored by the Austrian Joint Research Project Cognitive Vision under projects S9103-N04 and S9104-N04, the EC funded NOE MUSCLE IST 507572.

References 1. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34–58 (2002) 2. Tan, X., Chen, S., Zhou, Z.H., Zhang, F.: Face recognition from a single image per person: A survey. Pattern Recognition 39(9), 1725–1745 (2006) 3. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(4) (2006) 4. Viola, P., Jones, M.: Robust real-time object detection. International Journal of Computer Vision (2002) 5. Jonsson, K., Kittler, J., Li, Y.P., Matas, J.: Learning support vectors for face veriﬁcation and recognition. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 208–213. IEEE Computer Society Press, Los Alamitos (2000) 6. Yang, P., Shan, S., Gao, W., Li, S.: Face recognition using Ada-boosted Gabor features. In: Proceedings Conference on Automatic Face and Gesture Recognition, pp. 356–361 (2004) 7. Avidan, S.: Ensemble tracking. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 2, pp. 494–501 (2005) 8. Avidan, S.: Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1064–1072 (2004)

666

M. Grabner et al.

9. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Proceedings British Machine Vision Conference, vol. 1, pp. 47–56 (2006) 10. Ebbecke, M., Ali, M., Dengel, A.: Real time object detection, tracking and classiﬁcation in monocular image sequences of road traﬃc scenes. In: Proceedings International Conference on Image Processing, vol. 2, pp. 402–405 (1997) 11. Hern´ andez, M., Cabrera, J., Dominguez, A., Santana, M.C., Guerra, C., Hern´ andez, D., Isern, J.: Deseo: An active vision system for detection, tracking and recognition, pp. 376–391 (1999) 12. Arandjelovic, O., Zisserman, A.: Automatic face recognition for ﬁlm character retrieval in feature-length ﬁlms. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 860–867 (2005) 13. Sivic, J., Everingham, M., Zisserman, A.: Person spotting: Video shot retrieval for face sets. In: Proceedings International Conference on Image and Video Retrieval, pp. 226–236 (2005) 14. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition. vol. 1, pp. 511–518 (2001) 15. Porikli, F.: Integral histogram: A fast way to extract histograms in cartesian spaces. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 829–836 (2005) 16. Ojala, T., Pietik¨ ainen, M., M¨ aenp¨ aa ¨, T.: Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 17. Tieu, K., Viola, P.: Boosting image retrieval. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition, vol. 1, pp. 228–235 (2000) 18. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 19. Grabner, H., Bischof, H.: On-line boosting and vision. In: Proceedings IEEE Conference Computer Vision and Pattern Recognition. vol. 1, pp. 260–267 (2006) 20. Oza, N., Russell, S.: Online bagging and boosting. In: Proceedings Artiﬁcial Intelligence and Statistics, pp. 105–112 (2001)

Multi-view Gymnastic Activity Recognition with Fused HMM Ying Wang, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing {wangying,kqhuang,tnt}@nlpr.ia.ac.cn

Abstract. More and more researchers focus their studies on multi-view activity recognition, because a ﬁxed view could not provide enough information for recognition. In this paper, we use multi-view features to recognize six kinds of gymnastic activities. Firstly, shape-based features are extracted from two orthogonal cameras in the form of transform. Then a multi-view approach based on Fused HMM is proposed to combine different features for similar gymnastic activity recognition. Compared with other activity models, our method achieves better performance even in the case of frame loss.

1

Introduction

Human activity recognition is a hot topic in the domain of computer vision. There are a wide range of open questions in this ﬁeld, such as dynamic background modelling, object tracking under occlusion, activity recognition and so on [1]. Most of the previous activity recognition methods are dependent on the view direction. In these work, there is a strong assumption that low-level features for latter activity recognition are obtained without any ambiguity. However, recognizing actions from a single camera is aﬀected by the unavoidable fact that parts of the action are not available from the camera because of self-occlusions. Moreover, action from any view looks diﬀerent and some activities may not be captured because of the loss of depth information. In [2], Madabhushi and Aggarwal recognized 12 diﬀerent actions in the frontal or lateral view using movement of the head, but they had not been able to model and test all the actions due to the problem of self-occlusion for some actions in frontal view. Therefore great eﬀorts are taken to ﬁnd robust and accurate approaches to solve this problem. In fact, while performing an action, the object essentially generates a viewindependent 3D trajectory or shape in (X, Y, Z) space with respect to time. Thus 3D methods can recognize activity eﬃciently without the trouble of selfocclusion and depth information loss. In [3], authors extracted 3D shape for recognizing human posture using support vector machines. In [4], Chellappa et al. chose six joint points of the body and calculated their 3D invariants of each posture. In [5], Motion History Volume (MHV) was proposed to extract viewinvariant features in Fourier space for recognizing actions in a variety of viewpoints. Alignment and comparisons were performed using Fourier transform in Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 667–677, 2007. c Springer-Verlag Berlin Heidelberg 2007

668

Y. Wang, K. Huang, and T. Tan

OriginalȱVideo

AtoS

AupLup

AtoC

Lup

Aup

Still Recognizing

Extractedȱ Silhouettes

ActivityȱModel Parameters Learning R Transform

O11

O12

O13

U11

U12

U13

O1N

U1N

FHMM O21

O22

O23

O2N

U21

U22

U23

U2N

PCAȱFeatureȱ Vector

Fig. 1. The ﬂowchart of multi-view activity recognition

cylindrical coordinate around the vertical axis [5]. For all 3D methods, in order to use aﬃne transformation in activity learning and recognition, point correspondence is needed, which has high computation cost. To avoid this, a simple mechanism is to use 2D data from several views, which can integrally describe the activities with low computation cost. Some attempts to combine these features from diﬀerent views are in the process of constructing activity model. In [6], Bui et al. constructed Abstract Hidden Markov Model (AHMM) to hierarchically encode the wide spatial locations from diﬀerent views, and describe activities at diﬀerent levels of details. Some researchers try to directly fuse multi-view information on feature level. In [7], Bobick et al. used two cameras at orthogonal views to recognize activity by temporal template. Motion history information (MHI) was proposed to represent activity, which just had temporal information but no spatial information of motion. In [8], Huang proposed a representation “Envelop Shape” obtained from silhouette of objects using two orthogonal cameras for view insensitive action recognition. However, “Envelop Shape” simply overlap the silhouettes from two views, which inevitably destroys the correspondence between consequent frames of respective view. 2D data from multiple views are easy to acquire, but how to use them eﬃciently deserves further research. In this paper, six kinds of gymnastic activities are recognized. Only from a single view, many movements seem so similar that they could not be classiﬁed correctly. Similarly, we use two cameras with orthogonal views to capture more silhouette features. Diﬀerent from previous work, we use transform, a novel shape descriptor, to represent gymnastics activities. Then activity models, fused HMMs (FHMMs) based on features extracted by transform, are trained for

Multi-view Gymnastic Activity Recognition with Fused HMM

669

six kinds of gymnastic activities, which could merge diﬀerent activity features captured from diﬀerent views. The overall system architecture is illustrated in Fig. 1. The remainder of this paper is organized as follows: In section 2, we describe six kinds of gymnastic activities for analysis in this paper. In section 3 and 4, we provides a detailed description of transform and FHMM. Section 5 demonstrates the eﬀectiveness of the proposed method by comparison with other activity models. Finally, some conclusions are drawn in section 6.

2

Activity Description

In this study, we focus our attention on gymnastic activity. Gymnastics is rhythmical, and each activity starts by standing with one’s arms down without any motion and ends with the same stance. In this framework, we divide gymnastic video of each person into six kinds of activities. Table 1 describes these activities with their respective activity number and abbreviations. Some silhouette examples sampled from video sequence for each activity are shown in Fig. 2. In each sub ﬁgure, the ﬁrst row shows the silhouettes from the frontal view and the second row shows the silhouettes from the lateral view. Table 1. Six type activity description No. 1

2 3 4

5

6

Ab. AtoS

Activity description Raise arms out to the side with elbow straight to shoulder height, keep standing with arms held at such height and then put arms downwards to body side. AtoC Lift two arms up towards ceiling and then put arms to starting position. Aup Raise one arm forward with elbow straight to ceiling and then down. AupLup Lift two arms up towards ceiling while one leg backwards, then arms backwards while one leg forwards, ﬁnally put arm and leg down to starting position. Lup Lift two hands up to shoulder height while raise lap up till it is parallel to ﬂoor and knee pointing to the ceiling, ﬁnally put arm and leg down to starting position. Still Keep body still (which occurs in the end of each activity).

These activities are so similar that they are diﬃcult to discriminate from a single view. Because of the loss of depth information, the fore-and-aft movement facing the camera could not be captured on a 2D image plane. As for shape sequence, there are just a little variance that could not represent the detailed activity information, as shown in Fig. 2.4, which is hard to recognize. For example, from the frontal view, activity 1 and activity 5 are diﬀerent movements, but have quite similar shape variance as shown in Fig. 2.1 and 2.5. So do activity 2 and 4 (Fig. 2.2 and 2.4). From the lateral view, activity 1 and 6 have the same shape variance (Fig. 2.1 and 2.6). In order to discriminate these seemingly

670

Y. Wang, K. Huang, and T. Tan

(1. Hand upwards to shoulder height)

(2. Two hands upwards to ceiling)

(3. One hand upwards to ceiling)

(4. Two hands upwards to ceiling, leg forwards and backwards alternately)

(5. One leg upwards till it parallel to ﬂoor)

(6. Body keeping still) Fig. 2. Examples of extracted silhouettes in video sequences from two views

similar but actually diﬀerent activities, two views are needed to provide more abundant information for discriminating activities. As shown in Fig. 2, some activity sequences have the similar variance from one view, but discriminations could be found from another view. So these easily misclassiﬁed activity sequences from a single view could be recognized correctly from two views.

3

Low-Level Feature Representation by Transform

Feature representation is the key step of human activity recognition because it is an abstraction of original data to a compact and reliable format for latter

Multi-view Gymnastic Activity Recognition with Fused HMM

671

processing. In this paper, we adopt a novel feature descriptor, transform, which is an extended Radon transform [10]. Two dimensional Radon transform is the integral of a function over the set of lines in all directions, which is roughly equivalent to ﬁnding the projection of a shape on any given line. For a discrete binary image f (x, y), its Radon transform is deﬁned by [11]: ∞ ∞ f (x, y)δ(x cos θ + y sin θ − ρ)dxdy = Radon {f (x, y)} (1) TRf (ρ, θ) = −∞

−∞

where θ ∈ [0, π], ρ ∈ [−∞, ∞] and δ(.) is the Dirac delta-function, 1 if x = 0 δ(x) = 0 otherwise

(2)

However, Radon transform is sensitive to the operation of scaling, translation and rotation. and hence an improved representation, called Transform, is introduced [9,10]: f (θ) =

∞

−∞

TR2 f (ρ, θ)dρ

(3)

transform has several useful properties in shape representation for activity recognition [9,10]: → Translate the image by a vector − μ = (x0 , y0 ), ∞ ∞ TR2 f ((ρ − x0 cos(θ) − y0 sin(θ)), θ)dρ = TR2 f (ν, θ)dρ = f (θ) (4) −∞

−∞

Scale the image by a factor α, ∞ ∞ 1 1 1 2 T f (αρ, θ)dρ = 3 T 2 f (ν, θ)dρ = 3 f (θ) α2 −∞ R α −∞ R α Rotate the image by an angle θ0 , ∞ TR2 f (ρ, (θ + θ0 ))dρ = f (θ + θ0 )

(5)

(6)

−∞

According to the symmetric property of Radon transform, and let ν = −ρ,

∞ −∞

TR2 f (−ρ, (θ±π))dρ = −

−∞ ∞

TR2 f (ν, (θ±π))dν =

∞ −∞

TR2 f (ν, (θ±π))dν = f (θ±π) (7)

From equations (4)-(7), one can see that: 1. Translation in the plane does not change the result of transform. 2. A scaling of the original image only induces the change of amplitude. Here in order to remove the inﬂuence of body size, the result of transform is normalized to the range of [0, 1].

672

Y. Wang, K. Huang, and T. Tan

3. A rotation of θ0 in the original image leads to the phase shift of θ0 in transform. In this paper, recognized activities rarely have such rotation. 4. Considering equation (7), the period of transform is π. Thus a shape vector with 180D is suﬃcient to represent the spatial information of silhouette. Therefore, transform is roust to geometry transformation, which is appropriate for activity representation. According to [9], transform outperforms other moment based descriptors, such as Wavelet moment, Zernike moment and Invariant moment, on similar but actually diﬀerent shape sequences, and even in the case of noisy data. 1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

(Arm up and leg up)

0.1 0

20

40

60

80

100

120

140

160

0

180

( transform)

(Arm up and leg up)

0

20

1

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

80

100

120

140

160

180

0.2

0.1 0

60

( transform)

0.9

0.2

40

0.1 0

20

40

60

80

100

120

140

160

0

180

(Arm back, leg forwards) ( transform)

20

40

60

80

100

120

140

160

180

(Arm back, leg forwards) ( transform)

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

0

0.1 0

20

40

60

80

100

120

140

160

(Arm up, leg backwards) ( transform)

180

0

0

20

40

60

80

100

120

140

160

180

(Arm up, leg backwards) ( transform)

Fig. 3. transform of key frames for diﬀerent activities from two views

Fig. 3 shows silhouette examples extracted from diﬀerent activities. Each row shows the same frame from two views and the sub-ﬁgure following each silhouette is their respective transform results. transform curves of diﬀerent activity from two views show the diﬀerent variance. For example, from the frontal view, the transform curve of the ﬁrst row has two peaks, one is about 90◦ and the other is about 170◦ . That of the second row has no peak and that of the last row has a peak close to 10◦ . This proves that transform can describe the spatial information suﬃciently and characterize the diﬀerent activity silhouettes eﬀectively.

Multi-view Gymnastic Activity Recognition with Fused HMM

4

673

Fused Hidden Markov Model

The following process is combining these features obtained from diﬀerent views. Here we employ FHMM, which is proposed by Pan in bimodal speech processing [12]. Like Coupled HMM (CHMM) [13], FHMM consists of two HMMs as shown in Fig. 4 (where circle represents the observation and rectangle represents hidden state. Each red rectangle is one HMM component). However, unlike CHMM’s connections between hidden states, FHMM’s connections is between hidden state node and observation node of diﬀerent HMMs, as shown in Fig. 4.

O11

O12

O13

HMM 1 U11

U12

U13

O21

O22

O23

U1N

O2N

HMM 2

HMM 2 U21

CHMM (1)

O1N

HMM 1

U22

U23

U2N

FHMM (2)

Fig. 4. The graphical structure of CHMM and FHMM

Assume O1 and O2 are two diﬀerent observation video. In FHMM’s parameter training, the focus is to estimate the joint probability function P (O1 , O2 ). However, the straightforward estimation of joint likelihood P (O1 , O2 ) is not desirable because the computation is ineﬃcient and large training data are required. To solve this problem, Pan et. al. train two HMM separately, and then use their respective parameters to estimate an optimal solution for P (O1 , O2 ) [12,14]. According to maximum entropy sense, an optimal solution P(O1 , O2 ) is less precisely equal to computing the following equation [15]: P (w, v) P (O1 , O2 ) = P (O1 )P (O2 ) P (w)P (v)

(8)

where w = f (O1 ), v = g(O2 ), f (.) and g(.) are mapping functions, which must satisfy the following requirement: 1. The dependencies between w and v can describe the dependencies between O1 and O2 to some extent. 2. P (w, v) is easy to be estimated. In other words, f (.) and g(.) should maximize the mutual information of w and v [16]. This is an ill-posed inverse problem with more than one solutions. ˆ1 = arg maxU1 (log p(O1 , U1 )), V = O2 according to Speciﬁcally, we choose w = U maximum mutual information criterion [16]. Then equation (8) is expressed as:

674

Y. Wang, K. Huang, and T. Tan

ˆ1 , O2 ) P (U ˆ1 ) = P (O1 )P (O2 |U P(O1 , O2 ) = P (O1 )P (O2 ) ˆ1 )P (O2 ) P (U

(9)

Finally, the computation of joint probability P (O1 , O2 ) is converted to estimate ˆ1 ). According to the process mentioned above, the learning P (O1 ) and P (O2 |U algorithm of FHMM include three steps [12,14]: 1) Learn the parameters of two individual HMM independently by EM algorithm: (Π1 , A1 , B1 ) and (Π2 , A2 , B2 ). 2) Determine the optimal hidden states of the HMMs using Viterbi algorithm with obtained parameters: U1 and U2 . ˆ1 ) using known parameters. 3) Estimate the coupling parameters P (O2 |U ˆ1 ) = P (O2 |U

T −1 t=0

ˆ1t − i) δ(O2t − k)δ(U ˆ t − i) δ(U 1

(10)

where k is the length of observation O2 and i is the number of states in HMM 1.

5 5.1

Experimental Analysis Experimental Data and Feature Extraction

Experimental data are synchronized videos (320*240, 25fps) obtained by two cameras placed roughly orthogonally. The experiments are based on 300 low resolution video sequences of 50 diﬀerent people, each performing six gymnastic activities as described in Table 1. The resultant silhouettes contain holes and shadows due to imperfect background segmentation. To train the activity models, holes, shadows and other noise are removed manually to form ground truth data. 180 of 300 sequences (30 of 50 people) are used in training while 120 of 300 sequences (20 of 50 people) are used for recognition. Then transform is used to extract the spatial information of posture in video. Because transform is non-orthogonal, the shape vector of 180D is redundant. In general, PCA is employed to obtain the compact and accurate information in each video sequence. According to primary analysis of each activity, we ﬁnd 10 principal components are enough to represent 98% variance. Then six Table 2. Recognition results based on FHMM Activity 1. 1. AtoS 2. AtoC 3. Aup 4. AupLup 5. Lup 6. Still

AtoS 2. 16 1 2

1

AtoC 3. Aup 4. AupLup 5. Lup 6. Still Correct recognition rate 2 1 1 80% 17 2 85% 2 15 1 75% 19 1 95% 1 18 1 90% 1 2 16 80%

Multi-view Gymnastic Activity Recognition with Fused HMM

675

FHMMs, consisting of a 2-states HMM for frontal view and a 3-states HMM for lateral view, are constructed to combine two views’ features and model six kinds of activity in Table 1. We can ﬁnd that FHMM receives good recognition results as shown in Table 2 (Each activity has 20 testing samples). Activity 4 achieves the best recognition rate, 95%. Even the poorest result, 75% of activity 3 is also promising. 5.2

Comparison with Other Graphical Models

In order to evaluating the performance of robustness and coupling ability, FHMM is compared with CHMM and Independent HMM (IHMM). Moreover, two component HMMs in CHMM and IHMM have the same structures with those of FHMM, i.e. a 2-states HMM for the frontal view and a 3-states HMM for the lateral view. Both of them use the same training and testing data with FHMM. The structure of CHMM is shown as Fig. 4.1, and more details in parameter training and inference can be found in [13]. IHMM assumes O1 , O2 independent, so the dependence between two observations is computed by P (O1 , O2 ) = P (O1 )P (O2 ). This means IHMM simply multiples the observation probability of two independent HMMs. As shown in Fig. 5.1, although three methods achieve diﬀerent recognition rates for each activity, the overall performance among is: F HM M > CHM M > IHM M IHMM obtains the worst recognition performance. This is because it does not consider the correlations of observation from two views. As described in Section 2, some activity seem so similar just from one view, and thus it hard to avoid misrecognition. This misrecognition increases linearly with the product of P (O1 ) and P (O2 ). The recognition performance of CHMM is better than that of IHMM but worse than that of FHMM, because CHMM optimizes all the parameters globally by iteratively updating the component HMM’s parameters and coupling parameter. Therefore more training data and more iterations are needed for it to achieve convergence. Considering the same training data with FHMM, but more requirements of training data, the parameters of CHMM may

1

Correct Recognition Rate

0.8 0.7 0.6 0.5 0.4 0.3 0.2

0.8 0.7 0.6 0.5 0.4 0.3 0.2

0.1

0.1

0

0

1 AtoH;

2 AtoC;

3 Aup;

4 AupLup;

5 Lup;

6 Still

1. Ground truth data.

1 CHMM FHMM IHMM

0.9

CHMM FHMM IHMM

0.9

Correct Recognition Rate

CHMM FHMM IHMM

Correct Recognition Rate

1 0.9

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1 AtoH;

2 AtoC;

3 Aup;

4 AupLup;

5 Lup;

6 Still

2. Frame loss data.

0

1 AtoH;

2 AtoC;

3 Aup;

4 AupLup;

5 Lup;

6 Still

3. Frame loss data.

Fig. 5. Recognition rates for CHMM, FHMM and IHMM in the case of diﬀerent data

676

Y. Wang, K. Huang, and T. Tan

not be robust. Moreover, CHMM is linked by their hidden states, which could not fully represent the statistical relationship between observations extracted from diﬀerent cameras. That is because the dependence between the hidden states is so weak that it can not represent coupled observation videos accurately. Compared with them, FHMM links the hidden states of one HMM and the observation of the other, which has stronger coupling ability than that of CHMM. 5.3

Comparison Experiments in the Case of Frame Loss

In order to compare the robustness of CHMM, FHMM and IHMM, we simulate 120 sequences (each activity type has 20 samples) with frame loss by removing 10 frames (from 26 to 35, each activity has about 50 ∼ 90 frames). Fig. 5.2 illustrates the recognition results for three models. In spite of the lower recognition rate than that of ground truth data, FHMM still outperforms CHMM and IHMM. This proves that FHMM is relatively more robust to frame loss in video than other two models. In order to test the coupling ability of three models, we also simulate 120 frame loss data (each type has 20 samples), but remove 10 frames of frontal view (from 26 to 35) and diﬀerent 10 frames from lateral view (from 16 to 25). Fig. 5.3 illustrates the recognition results for three models. Compared with Fig. 5.2, the performance of FHMM and IHMM does not change much, but that of CHMM decreases noticeably (comparing the blue parts of Fig. 5.2 and 5.3 respectively). Note that CHMM does not always perform better than IHMM. For activity 1 and 4, CHMM gets even worse results than that of IHMM. This is because the frame loss of two views does not happen at the same time, the state relationship coupled in CHMM is destroyed which leads to lower recognition rate.

6

Conclusion

From the theoretical and experimental analysis of our proposed approach, we can ﬁnd that FHMM based on R transform descriptor does have many advantages for multi-view activity recognition. Firstly, only silhouette is taken as input, which is easier to obtain than meaningful feature points which need tracking and correspondence. Secondly, transform descriptor captures both boundary and internal content of the shape. The computation of 2D transform is linear, so the computation cost is low. Moreover, transform performs well for similar but actually diﬀerent shape sequences, e.g. gymnastic activities. Thirdly, activity features based on multi-view are easy to acquire and have abundant information for discriminating similar activity. Finally, compared with CHMM and IHMM, FHMM gets the best performance with lower model complexity and computational cost. Even in the case of frame loss data, FHMM shows strong robustness, with great binding ability to couple inputs from two views.

Multi-view Gymnastic Activity Recognition with Fused HMM

677

Acknowledgment The work reported in this paper was funded by research grants from the National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.

References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. on Systems, Man and Cybernetics, Part C: Applications and Reviews 34, 334–352 (2004) 2. Madabhushi, A.R., Aggarwal, J.K.: Using movement to recognize human activity. In: ICIP, vol. 4, pp. 698–701 (2000) 3. Cohen, I., Li, H.: Inference of human postures by classiﬁcation of 3D human body shape. In: IEEE Internal Workshop on FG, pp. 74–81. IEEE Computer Society Press, Los Alamitos (2003) 4. Parameswaren, V.V., Chellappa, R.: Human Action-Recognition Using Mutual Invariants. Computer Vision and Image Understanding 98, 295–325 (2005) 5. Weinland, D., Ronfard, R., Boyer, E.: Free Viewpoint Action Recognition using Motion History Volumes. In: CVIU (2006) 6. Bui, H., Venkatesh, S., West, G.: Policy Recognition in the Abstract Hidden Markov Model. Journal of Artiﬁcial Intelligence Research 17, 451–499 (2002) 7. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. In: PAMI, vol. 23, pp. 257–267 (2001) 8. Huang, F., Di, H., Xu, G.: Viewpoint Insensitive Posture representation for action recognition (2006) 9. Wang, Y., Huang, K., Tan, T.: Human Activity Recognition based on Transform. In: The 7th IEEE International Workshop on Visual Surveillance, IEEE Computer Society Press, Los Alamitos (2007) 10. Tabbone, S., Wendling, L., Salmon, J.-P.: A new shape descriptor deﬁned on the Radon transform. Computer Vision and Image Understanding 102 (2006) 11. Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, Chichester (1983) 12. Pan, H., Levinson, S.E., Huang, T.S., Liang, Z.-P.: A Fused Hidden Markov Model with Application to Bimodal Speech Processing. IEEE Transactions on Signal Processing 52, 573–581 (2004) 13. Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: CVPR, pp. 994–999 (1997) 14. Rabiner, L.R.: A Tutorial On Hidden Markov Models and Selected Applications in Speech. Proceedings of the IEEE 77(2), 257–286 (1989) 15. Luttrell, S.P. (ed.): The use of Bayesian and entropic methods in neural network theory. Maximum Entropy and Bayesian Methods, pp. 363–370. Kluwer, Boston (1989) 16. Pan, H., Liang, Z.-P., Huang, T.S.: Estimation of the joint probability of multisensory signals. Pattern Recogn. Letter 22, 1431–1437 (2001)

Real-Time and Marker-Free 3D Motion Capture for Home Entertainment Oriented Applications Brice Michoud, Erwan Guillou, Hector Brice˜ no, and Sa¨ıda Bouakaz Laboratory LIRIS - CNRS, UMR 5205 University of Lyon, France {brice.michoud,erwan.guillou,saida.bouakaz}@liris.cnrs.fr

Abstract. We present an automated system for real-time marker-free motion capture from two calibrated webcams. For fast 3D shape and skin reconstructions, we extend Shape-From-Silhouette algorithms. The motion capture system is based on simple and fast heuristics to increase the eﬃciency. Multi-modal scheme using both shape and skin-parts analysis, temporal coherence, and human anthropometric constraints are adopted to increase the robustness. Thanks to fast algorithms, low-cost cameras and the fact that the system runs on a single computer, our system is perfectly suitable for home entertainment device. Results on real video sequences demonstrate our approach eﬃciency.

1

Introduction

In this paper we propose a real-time method for markerless 3D human motion capture apt for home entertainment (see Fig. 2(a)). While commercial real-time products using markers are already available, online marker-free systems remain an open issue because many real-time algorithms still lack robustness, or require expensive devices. While most popular techniques run on PC clusters, our system requires two low-cost cameras (e.g. webcams) and a laptop computer. Our system works in real-time (at least 30 fps), without markers (active or passive) or any particular sensors. Several techniques have been proposed to tackle the marker-free motion capture problem. They diﬀer on the number of cameras and analysis method used. We now review techniques related to our work. Motion capture systems vary in the number of cameras used. Single camera systems [1] encounter several limitations. In some cases they suﬀer from ambiguous response as diﬀerent positions can yield the same image. Concerning multi-views approaches, most of techniques are based on silhouette analysis [2,3]. These techniques provide good results if the reconstructed shape topology complies with human topology i.e. each body part is unambiguously mapped to the 3D shape estimation. With self-occlusion cases or large contacts between limbs and body, these techniques frequently fail. Caillette et al . [4] method involves shape and color clues. They link colored blobs to a kinematic model to track an individual’s body parts. This technique requires contrasted clothing between each body part for tracking, thus adding a usability constraint. Few methods provide real-time motion Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 678–687, 2007. c Springer-Verlag Berlin Heidelberg 2007

Real-Time and Marker-Free 3D Motion Capture

(a)

(b)

679

(c)

Fig. 1. (a) System overview: Reconstruction algorithms and Pose estimation algorithms. Body parts labeling (b) and joint naming (c).

capture from multiple views. Most of them run at interactive frame rates (10 fps for [4]). We therefore propose a fully automated system for practical real-time motion capture from two calibrated webcams. Our process is based on simple heuristics, driven by shape and skin-parts topology analysis, and temporal coherence. It runs at 30 fps. This article is organized as fallows: Section 2 presents our work for realtime 3D reconstruction. Section 3 describes an overview of the motion tracking. Section 4 details the motion tracking and Section 5 presents the initialization step. Experimental results are presented Section 6. They show the validity and the robustness of our process. In Section 7 we conclude about our contributions and develop some perspectives for this work.

2

3D Shape and Skin-Parts Estimation

We propose extensions of Shape-From-Silhouette (SFS) algorithms: it reconstructs in real-time 3D shape and 3D skin-colored parts of a person. SFS methods compute in real-time an estimation of the 3D shape of an object, from its silhouette images. Silhouette images are binary masks corresponding to captured images where 0 corresponds to background, and 1 stands for the (interesting) feature of the object. The formalism of SFS was introduced by A. Laurentini [5]. By deﬁnition, an object lies inside the volume generated by backprojecting its silhouette through the camera center (called silhouette’s cone). With multiple views of the same object at the same time, the intersection of all the silhouette’s cones build a volume called “Visual Hull”, which is guaranteed to contain the real object. There are mainly two ways to compute an object’s Visual Hull: surface-based and volumetric-based methods. Surface-based approaches compute the intersection of silhouette’s cone surfaces (see Fig. 2(b)). Silhouette edges are ﬁrst converted into polygons, then back-projected to form Silhouette’s cones. Intersection of these cones approximates the object shape ([6]). Because high computation time, these methods are

680

B. Michoud et al.

(a)

(b)

(c)

(d)

Fig. 2. (a) Interaction setup. (b) Object reconstruction by surface and volumetric approaches. (c) SFS using “projective texture mapping”. (d) Example of ghost objects.

not well-suited for real-time reconstruction on a single computer. Volumetricbased approaches [2] usually estimate shape by processing a set of voxels. The object’s acquisition area is split up into a 3D grid of voxels (volume elements). Each voxel remains part of the estimated shape if its projection in all images lies in all silhouettes (see Fig. 2(b)). This volumetric approach is suitable for real-time pose estimation, due to its fast computation and robustness to noisy silhouettes. We propose a new framework which computes simultaneously a 3D volumetric shape estimation and skin estimation from “hybrid” silhouettes. Our GPU implementation provides real-time reconstruction on a laptop computer. 2.1

Image Processing

Our 3D reconstruction system consists of three tasks. First, webcams are calibrated using a popular algorithm proposed by Zhang et al . [7]. To enforce coherency between the two webcams, color calibration is done using the method proposed by N.Joshi [8]. The second step consists in silhouette segmentation. We assume that the background is static and the subject moves. We use the method proposed by [9]. First, we acquire images of the background (without user). The user is then detected in the pixels whose value has changed. The third step consists of extracting skin parts from silhouette masks and color images. “Normalized Look-up Table” method [10] provides fast skin-colored segmentation. This segmentation is applied on each image limited to silhouette mask. 2.2

Extended GPU SFS Implementation

Volumetric SFS is usually based on voxel projection: a voxel remains part of the estimated shape if it projects itself into each silhouette. To better ﬁt a GPU implementation we choose the opposite: we project each silhouette into the 3D voxel grid as proposed in [9]. If a voxel is intersected by all the silhouette’s projections, then it represents the original object. This provides 3D shape estimation in real-time. We extend this method in order to compute in parallel 3D shape and skin estimations. The classical N 3 voxel cube can be considered as a stack of N images of resolution N × N . We stack the N image in screen parallel planes. For each webcam, let HS be (hybrid silhouette) a two channel image which contains the silhouette mask in the ﬁrst channel, and the skin mask

Real-Time and Marker-Free 3D Motion Capture

681

in the second. For each camera view, its HS is projected onto each slice using the “projective texture mapping” technique. The per channel intersection of all HS projections on all slices provides voxel-based 3D shape (in the ﬁrst channel) and 3D skin (in the second channel) estimations. Single channel HS projection is underlined Fig 2(c). Our implementation provides 100 reconstructions per second. Our fast method assumes that skin parts are visible in all the cameras. Nonetheless in our context, the user look the screen, then faces the two webcams (Fig. 2(a)). Ghost Objects Removal. One of the SFS limitations is ghost objects construction. They appear when they are some visual ambiguities like in the example underlined Fig. 2(d). There is a single person ﬁlmed. It corresponds to one 3D connex component on the estimated shape. To remove ghost parts we keep the voxels which are in the biggest (in volume, i.e. voxel number) connex component. Data Simpliﬁcation. To reduce computation time for pose estimation we use a subset of each set of voxels (i.e. shape voxels and skin voxels). Using surface normal estimation for each surface voxel (i.e. which have less than 26 neighbors), we keep voxels that face the webcams. Let Vskin be the selected voxels form shape voxel set, Vskin be the selected voxels from skin voxel set, and Vall be their union.

3

Motion Capture

The goal of motion capture is to determine the pose of the body throughout time. We can determine the pose of the body if we can associate each voxel to a body part. Joints labeling is presented in Figure 1(c). We propose a system based on simple and fast heuristics. This approach is less accurate than registration-based methods, but nonetheless it runs in real-time. Robustness is increased by using a multi-modal scheme composed of both shape and skin-part analysis, temporal coherence and human anthropometric constraints. Our motion capturing runs on two steps: initialization and tracking; both use the same algorithm with diﬀerent initial conditions. The initialization step (see Section 5) estimates anthropometric values, and initial pose. Then using this information, the latter step tracks joint positions (see section 4). Our premises are that both hands and person’s face are partially uncovered, that the torso is dressed, and that the clothing have a non-skin color. We present here some common notations for the reader: Lx denotes the length of body part x (see Fig. 1(b)), Dx its orientation and Rx its radius (of sphere or cylinder). J n denotes the value of a quantity J (joint position, voxel set...) at frame n. When dealing with sided joints, indices l and r denote respectively left and right side. Vx denotes a set of voxel, EVx its inertia ellipsoid and Cog(Vx ) its gravity center. When dealing with iterative algorithms J(i) denotes the J quantity value at step i.

682

4

B. Michoud et al.

Body Parts Tracking

To track body parts, we assume that the previous body pose and anthropometric values are known. Using 3D shape estimation and 3D skin parts we track the human body parts in real-time. The tracking process works on active voxels Vact . This set of voxels is initialized to all voxels Vall and updated at each step by removing voxels used to estimate body parts. First we estimate head joints. Next, we track torso joints, in the end we compute limb joints. 4.1

Head Tracking

This step aims at ﬁnding Tn and Bn , respectively the positions of the top of the head and the connection point between head and neck at frame n. n n be the face’s voxels at the current frame. By hypothesis Vskin contains Let Vface n face and hands voxels. Using Temporal coherence criteria Vface is the nearest n−1 n from the previous set of face voxels Vface . connex component of Vskin n The center of the head Cn is computed by ﬁtting a sphere S(i) in Vact (see ﬁgure 3). The sphere S(i) is deﬁned by its center Cn (i) and radius Rhead . n Head Fitting Algorithm. Cn (0) is initialized as the centroid of Vface . n n At step i of the algorithm, C (i) is the centroid of the set Vhead (i) of active voxels that lie into a sphere S(i − 1) deﬁned by its center Cn (i − 1) and its radius Rhead (see Fig. 3(a)). The algorithm iterates until step k when the position of Cn stabilizes, i.e. the distance between Cn (k − 1) and Cn (k) falls below a threshold head.

Head Joints Estimation. Knowing the Cn position, Bn (respectively Tn ) is computed as the lower (resp. upper) intersection between S(k) and the principal n (see Fig. 3(b)). axis of EVhead n The back-to-front direction Db2f is deﬁned as the direction from Cn towards n the centroid of Vface (note that voxels from the back of the head are not in Vskin ). n n At this point, we remove from Vact the set of elements that belongs to Vhead .

(a)

(b)

(c)

(d)

(e)

n n Fig. 3. (a) Head sphere ﬁtting (light gray denotes Vface , dark gray denotes Vhead (i)), (b) head joints estimation, (c) torso segmentation by cylinder ﬁtting. (d) the “binding” step of legs tracking and (e) legs articulations estimation.

Real-Time and Marker-Free 3D Motion Capture

4.2

683

Torso Tracking

n This step aims at ﬁnding Pn the pelvis position, by ﬁtting a cylinder in Vact . Estimating the torso shape by a cylinder provides simple and fast method to n be the set of voxels that describes the torso, they are localize pelvis. Let Vtorso n n initialized using voxels Vact . At step i, the algorithm estimates Dtorso by ﬁtting n a cylinder CYL(i − 1) in Vtorso (i) (see Fig 3(c)). CYL(i) has a cap anchored at n Bn , as radius Rtorso , its length is Ltorso and its axis is Dtorso (i). n n (0) is initialized with Vact and the vector from Torso Fitting Algorithm. Vtorso n (0) initial value. Bn to Pn−1 deﬁne Dtorso n n At step i, Vtorso (i) is computed as the set of elements from Vtorso (i-1) that lie n n (i) is then the principal axis of EVtorso (see Fig. 3(c)). in CYL(i − 1). Dtorso (i) The algorithm iterates until step k when the distance between the axis of n CYL(k) and the centroid of Vtorso (k) falls below a threshold torso . Pn position is deﬁned as the center of the lower cap of CYL(k). n of the acquired Global Body Orientation. The top-down orientation Dt2d n n subject is given by P − B . Db2f was computed in 4.1. The left-to-right orienn n n n tation Dl2r of the acquired subject is given by Dl2r = Dt2d × Db2f . n n Vact is then updated by removing its elements that belong to Vtorso .

4.3

Arms Tracking

We propose a simple and robust algorithm to compute the forearm joint positions. First, we compute hand positions from skin voxels. Using the forearm length, we determine the elbow positions. Temporal coherence is used to comn be the set of potential voxels of hands. Let Lheight pute their sides. Let Vhand n be the acquired human body length. Lheight /2 is a raising of arm length. Vhand n n is deﬁned by the voxels of Vskin − Vface that lie within a sphere deﬁned by its n center Bn and its radius Lheight/2. By hypothesis Vskin contains hands and face voxels. The diﬀerent forearms conﬁgurations are: n n contains several connex components. Let Vhand0 Two Distinct Hands. Vhand n and Vhand1 be the biggest ones, corresponding to the two hands with Hnx = n Cog(Vhandx ) with x ∈ [0, 1]. Forearms have constant length Lfarm across time. n The potential voxels for forearmx are the voxels from Vact which lie within n a sphere of radius Lfarm and centered in Hx . The connex component of these n voxels which contains Hnx represents the forearmx. Let Vfarm x be this connex component; there are two possible cases to identify elbow. n n n ∩ Vfarm1 = ∅ then we use the principal axis of EVhandx and Lfarm If Vfarm0 n to compute the elbow position Ex . The sides are computed using temporal coherence criteria: the side of the forearmx is the same than the closest forearm computed at the previous frame. n n Otherwise Vfarm0 ∩ Vfarm1 = ∅ and the forearms are touching themselves. In that case we ﬁrst identify the hand sides by the property of constant forearms length. Hnx is right sided if

||d(Hnx , En−1 ) − Lfarm || < ||d(Hnx , En−1 ) − Lfarm ||, r l

684

B. Michoud et al.

n n or Hnx is left sided. The voxels vi of Vfarm0 ∪ Vfarm1 are segmented into two parts n n Vfarmr and Vfarmg using “point to line mapping” algorithm (see 4.4). If vi is n closer to [Hnr En−1 ] than to [Hnl En−1 ], vi is added on Vfarmr . Else vi is added r l n n n n n on Vfarm l . We compute Er and El with the principal axis of EVfarm ,EVfarm and r l Lfarm . n contains only one connex component One Hand or Jointed Hands. Vhand and it corresponds to jointed hands or to only one hand (the other is not visible). We use the temporal coherence to disambiguate these two cases. n , then the hands are jointed and Hn r = If Hn−1 r and Hn−1 l are close to Vhand n n n H l = Cog(Vhand ) and we compute Vfarm as proposed previously. We segment n n n in two parts Vfarmr and Vfarml by the orthogonal plane to [En−1 En−1 ] Vfarm r l n n n containing H l . Principal axis of EVfarm , E and L are used to compute Vfarm l farm r Enr and Enl . n is used to compute the side of Otherwise the closest hand Hn−1 x to Vhand n n n n H x and H x = Cog(Vhand ). We compute Vfarm as proposed previously and its principal axis of inertia is used to compute En x . n No Visible Hand. Vhand is empty, then no hand is visible. We report the position computed at the n − 1 frame to the current frame. n is updated by removing its elements that belong to each forearm. Vact

4.4

Legs Tracking

n Until now all body parts but the legs have been estimated, hence Vact contains only the legs voxels. Our leg joints extraction is inspired from “point to line mapping” process used to bind an animation skeleton on a 3D mesh [11]. The eln n n n n are split up into four sets Vthigh , Vcalf ements of Vact l , Vthigh r and Vcalf r dependl n−1 n−1 n−1 ing of their euclidean distance to segments [Pn , Kl ], [Kl , Fl ], [Pn , Kn−1 ], r , Fn−1 ] (see Fig. 3(d)). For the left/right side x, we compute the inertia and [Kn−1 r r n ellipsoid EVcalf and P0 and P1 its extrema points. The knee is the intersection x n point between thigh and calf (Fig. 3(e)), hence the foot position Fn x is the EVcalf x n extrema point, farthest from EVthigh (let say it’s P1 ). Then knee is aligned on x [P0 P1 ], P0 sided, at a Lcalf distance of Fn x .

5

Body Parts Initialization

In this section we present our techniques to estimate the anthropometric measures and the initial pose of the body. We can classify in three the methods presented in the literature regarding the initial pose estimation. The ﬁrst kind [6], the anthropometric measurements and initial pose are entered manually. Another class of methods need a ﬁxed pose, like T-pose [12], these methods work in real-time. The last class of methods are fully automatic [2] and do not need a speciﬁc pose, but are not real-time. Our approach is a real-time and fully automated one for any movement as long as the person ﬁlmed is standing up, hands below the level of the head, and feet not joined.

Real-Time and Marker-Free 3D Motion Capture

685

Anthropometric Measurements. They correspond to lengths of each body part [13]. We have estimated some anthropometric measures as average ratios of the human body length. Let Lheight be the acquired human body length, estimated as the maximum distance from foreground voxels to ﬂoor plane. Hence, knowing Lheight, anthropometric measures can be approximated by these ratios: Rhead ≈ Lheight /16, Ltorso ≈ Lheight /8, Lfarm ≈ Lheight /6, Lcalf ≈ Lheight /4. Active set of voxels Vact is initialized by Vall . Head Initialization. This step aims at ﬁnding T0 and B0 . From our initial0 ization hypothesis, the face’s voxels Vface are deﬁned by the topmost connex 0 component among Vskin . We compute T0 and B0 with the head tracking algo0 0 rithm (section 4.1). Vact is updated by removing elements that belong to Vhead . Torso Initialization. The torso ﬁtting algorithm (section 4.2) is applied us0 0 0 ing Vact as initial value for Vtorso (0).Dtorso (0) is initialized as the vector from 0 0 n 0 (0) . Pelvis position P , D N0 toward the centroid of EVact t2d and Dl2r are then 0 0 computed. Vact is updated by removing voxels that belong to Vtorso . Legs Initialization. Tracking algorithm outlined in Section 4.4 need the legs’ previous positions. We simulate them by a coarse estimation, then we compute 0 contains the voxels more precise positions using the legs tracking algorithm. Vact that haven’t been used for any other parts of the body. First, we compute the 0 having their height below Lheight /8. If there set of connex components of Vact is less than two connex components, we assume that feet are joined and can’t 0 be distinguished. Otherwise we use the two major connex components Vfootl 0 and Vfootr . Left and right assignation of voxel’s set is done using the left-to-right vector Dl2r . For the left/right side x, let vx be the vector from P0 to the centroid 0 . Knee and Foot joints are estimated using the following equations: of Vfootx K−1 x = T0 x + vx

Lthigh Lthigh + Lcalf , F−1 x = T0 x + vx . |vx | |vx |

Finally we compute F0 r , K0 r , F0 l and K0 l using the legs tracking algorithm.

6

Results

We now present the results from our system. Fig. 2(a) outlines the system conﬁguration. The acquisition infrastructure is composed of two Phillips webcams (SPC900NC) connected to a single PC (CPU: P4 3.2ghz, GPU: NVIDIA Quadro 3450). Webcams produce images of resolution 320 × 240 at 30fps. Our method has been applied on diﬀerent persons doing fast and challenging motions. Thanks to shape analysis and skin parts knowledge, our system is able to acquire the joint positions for a challenging pose outlined on the Fig. 4(a). This pose is diﬃcult because the topology of the reconstructed shape is not coherent with the human shape topology. The temporal coherence is the key to

686

B. Michoud et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 4. (a) (b) and (c) underline results for challenging poses. (d) represents the user recovered pose with shape and skin voxels. (e-h) show results for wide range motions.

success for the pose presented Fig. 4(b). This shows the case of jointed hands (4.3) which is successfully recognized. A diﬃcult pose underlined in Fig. 4(c) is successfully recovered by our system. Fig. 4(d) shows shape, skin voxels and the joints recovered. Images of Fig. 4(e), 4(f), 4(g) and 4(h) demonstrate that our system tracks large range of movements. Additional results are included in the supplementary video. It shows the robustness of our approach. Our current experimental implementation can track more than 30 poses per second on a single computer. It is faster than the webcams acquisition frame rate. An optimized implementation can be used for current generation of home entertainment computers. As our algorithm is based on 3D reconstruction, it depends on the voxel grid resolution. Experiments are made on a grid of 643 voxels with a resolution of 2.7 × 2.7 × 2.7 cm per voxel. This resolution is enough for human-machine interfaces in the ﬁeld of entertainment.

7

Conclusion

In this paper, we describe a new marker-free system of human motion capture from two webcams connected to a single computer. Fully automated and working under real-time constraint, the system is based on both a 3D shape analysis, human morphology constraints, and a skin-colored segmentation. Combining diﬀerent 3D information, the approach is robust to self-occlusion and to coarse 3D shape approximation provided by voxel estimation sub-system. We are able to estimate the eleven main human body joints, at more than 30 frames per second. This frame-rate is well suited for home entertainment applications. The system provides real-time motion capture for one person. Future work aims at providing motion capture of several persons ﬁlmed together in the same

Real-Time and Marker-Free 3D Motion Capture

687

area, even they are in contact. For home entertainment application, the major limitation is silhouette processing, because the background cannot be guaranteed to be static at home. We work on a new segmentation algorithm based on statistical background model helped by optical ﬂow algorithm.

References 1. Agarwal, A., Triggs, B.: Monocular human motion capture with a mixture of regressors. In: CVPR 2005, p. 72. IEEE Computer Society, Los Alamitos (2005) 2. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. Int. J. Comput. Vision 53(3), 199–223 (2003) 3. Tangkuampien, T., Suter, D.: Human motion de-noising via greedy kernel principal component analysis ﬁltering. In: ICPR 3, pp. 457–460. IEEE Computer Society Press, Los Alamitos (2006) 4. Caillette, F., Galata, A., Howard, T.: Real-Time 3-D Human Body Tracking using Variable Length Markov Models. In: Proceedings BMVC 2005, vol. 1 (2005) 5. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16(2), 150–162 (1994) 6. M´enier, C., Boyer, E., Raﬃn, B.: 3d skeleton-based body pose recovery. In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission, Chapel Hill (USA) (2006) 7. Zhang, Z.: Flexible camera calibration by viewing a plane from unknown orientations. In: ICCV, pp. 666–673 (1999) 8. Joshi, N.: Color calibration for arrays of inexpensive image sensors. Technical report, Stanford University (2004) 9. Hasenfratz, J.M., Lapierre, M., Sillion, F.: A real-time system for full body interaction with virtual worlds. In: Eurographics Symposium on Virtual Environments, pp. 147–156 (2004) 10. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proceedings of Graphicon-2003 (2003) 11. Sun, W., Hilton, A., Smith, R., Illingworth, J.: Layered animation of captured data. The Visual Computer 17(8), 457–474 (2001) 12. Fua, P., Gruen, A., D’Apuzzo, N., Plankers, R.: Markerless Full Body Shape and Motion Capture from Video Sequences. In: Symposium on Close Range Imaging, International Society for Photogrammetry and Remote Sensing, Corfu, Greece (2002) 13. Dreyfuss, H., Tilley, A.R.: The Measure of Man and Woman: Human Factors in Design. John Wiley & Sons, Chichester (2001)

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation Haiyuan Wu, Yosuke Kitagawa, Toshikazu Wada, Takekazu Kato, and Qian Chen Faculty of Systems Engineering, Wakayama University, Japan

Abstract. This paper describes a sophisticated method to track iris contour and to estimate eye gaze for blinking eyes with a monocular camera. A 3D eye-model that consists of eyeballs, iris contours and eyelids is designed that describes the geometrical properties and the movements of eyes. Both the iris contours and the eyelid contours are tracked by using this eye-model and a particle ﬁlter. This algorithm is able to detect “pure” iris contours because it can distinguish iris contours from eyelids contours. The eye gaze is described by the movement parameters of the 3D eye model, which are estimated by the particle ﬁlter during tracking. Other distinctive features of this algorithm are: 1) it does not require any special light sources (e.g. an infrared illuminator) and 2) it can operate at video rate. Through extensive experiments on real video sequences we conﬁrmed the robustness and the eﬀectiveness of our method.

1

Introduction

The goal of this work is to realize video rate tracking of iris contours and eye gaze estimation with a monocular camera in a normal indoor environment without any special lighting. Since blinking is a physiological necessity for human, it is also a goal of this work to be able to cope with blinking eyes. To detect or to track eyes robustly, many popular systems use infrared light (IR) [1][2] and stereo cameras [3][4]. Recently, mean-shift ﬁltering [5], particle ﬁltering [6] and K-means clustering [15] are also used for tracking eyes. The majority of the proposed methods assume open eyes, and most of them neglect the eyelids and assume that iris contours show circles in the image. Therefore, they can not detect pure iris contours which are important for eye gaze estimation. The information about eyes used for eye gaze estimation can roughly be divided into three categories: (a) global information, such as active appearance model (AAM) [5][7], (b) local information, such as eye corners, iris, and mouth corners [8][9][10], and (c) the shape of ellipses ﬁtted to the iris contours [11]. There are also some methods based on the combinations of them [12]. D.Hansen et al. [6] employ an active-contour method to track iris. It is based on a combination of a particle ﬁlter and the EM algorithm. The method is robust against the changes of lighting condition and camera defocusing. A looking calibration is necessary for eye gaze estimation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 688–697, 2007. c Springer-Verlag Berlin Heidelberg 2007

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation

689

Y.Tian et al. [13] propose a method of tracking the eye locations, detecting the eye states, and estimating the eye parameters. They develop a dual-state eye model and use it to detect whether an eye is open or closed. However, it can not estimate eye gaze correctly. It is diﬃcult to track iris contours consistently and reliably because they are often partly occluded by the upper and the lower eyelids and confused with the eyelid contours. Most conventional methods lack the ability to distinguish iris contours from eyelid contours. Some methods separate the iris contours from eyelid contours heuristically. In order to solve this problem, we design a 3D eye-model that consists of eyeballs, irises and eyelids. We use this model together with a particle ﬁlter to track both the iris contours and the eyelid contours. With this approach, the iris contours can be distinguished from the eyelid contours and both of them can be tracked simultaneously. Then the shape of the iris contours can be estimated correctly by only using the edge points between upper and lower eyelids, which are then used to estimate the eye gaze. In this method, we assume that movements of the two eyes of a person are synchronized and use this constraint to restrict the movement of eyeballs in the 3D eye-model during tracking. This increases the robustness of tracking and the reliableness of eye gaze estimation of our method. Implementing on a PC with a Pentium4 3GHz CPU, the processing speed is about 30frames/second.

2

3D Eye-Model for Tracking Iris Contours

Our 3D eye-model consists of two eyes, each eye consists of an eyeball, an iris, an upper and a lower eyelids. As shown in ﬁgure 1(a), the two eyeballs are assumed as spheres of equal radius rw . The irises are deﬁned as circles of equal radius rb on the surface of each eyeball. Then the distance between the eyeball center and the iris plane is zpi =

2 − r2 . rw b

(1)

The centers of the two eyeballs (cl and cr ) are put on the X-axis of the eye-model coordinates system symmetrically about the origin, and the distance between each center and the origin is wx . T cl = −wx 0 0 ;

T cr = wx 0 0 .

(2)

We deﬁne direction of the visual lines when people look forward same as the Z-axis. Then the plane where the iris resides will be parallel to the X-Y plane. The iris contours of the left and the right eye (pl and pr ) can be expressed with the following expressions. pj (α) = p(α) + cj , where

j = l, r

T p(α) = rb cos α rb sin α zpi ,

α ∈ [0, 2π].

(3)

690

H. Wu et al.

The upper and lower eyelids are deﬁned as B-spline curves located on the plane z = rw , which is a vertical plane in front of eyeballs. As shown in ﬁgure 1(a), each eyelid has three control points. We let the upper and the lower eyelid of the same eye share the same inner eye corners (Ehl and Ehr ) and the outside eye corners (Etl and Etr ). The eight control points (Ehl , Eul , Etl , Edl , Ehr , Eur , Etr and Edr ) describe the shape of the eyelids when the two eyes are open. Since the values of these eight control points, wx , rw and rb depend on each individual person, we call them as personal parameters which are estimated at the beginning of tracking.

(a)

(b)

(c)

Fig. 1. (a) The structure of the 3D eye-model. (b) The gaze vector and the movement parameters of an eyeball. (c) The movement parameter (m) of an upper eyelid.

In general, when people look at a far place, the visual lines of both eye will be approximately parallel. Also, when people blink, in most cases the lower eyelids do not move and the upper eyelids of the two eyes move in the same way. In this paper, in order to keep the number of eye movement parameters small so that the particle ﬁlter can work eﬃciently while not losing generrality, we assume that the lower eyelids keep still and the movements of the eyeballs and the upper eyelids of two eyes are same and synchronized. Therefore, only the parameters for describing the movements of the eyeball and the two eyelids of one eye are necessary because the movement of the another eye can be described with the same parameters. The movement of the left (or right) eyeball can be described by an unit vector ve that indicates the gaze direction (See ﬁgure 1(b)). T ve = cos θt sin θp sin θt cos θt cos θp , (4) where θt is the angle between ve and the X-Z plane (tilt), and θp is the angle between the projection of ve onto the X-Z plane and Z-axis (pan). In this paper, θt and θp are used as the movement parameters of the left and the right eyeball. Then the iris contours of the left and the right eye can be expressed by pM j (α, θt , θp ) = R(θt , θp )p(α) + cj ;

j = l, r,

(5)

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation

where

691

⎛

⎞ 0 sin θp cos θp R(θt , θp ) = ⎝ − sin θt sin θp cos θt sin θt cos θp ⎠ . cos θt sin θp − sin θt cos θt cos θp

(6)

The movement of opening or closing eyes can be expressed by changing the shape of the upper eyelids. This movement is expressed by movement of the middle control point of an upper eyelid (Eul and Eur ) along the line connecting the middle control points of the upper and the lower eyelids of an opened eye as shown in ﬁgure 1(c). The middle control points of the upper eyelids can be expressed as following Euj = mEuj + (1 − m)Edj ;

j = l, r, m = 0 ∼ 1,

(7)

where m is a parameter that describes the movements of the both upper eyelids. The parameters θt , θp and m are movement parameters of our 3D eye-model. In order to track the iris contours and estimate the eye gaze with a particle ﬁlter, it is necessary to project the iris contours and the eyelid contours onto the image plane. This can be done with the following equation. ip = Mc (Rh p + Th ).

(8)

p is the point on an iris contour or an eyelid contour in the 3D eye-model and ip is its projection onto the image plane. Mc is the projection matrix of the camera which is assumed known. Rh and Th are the rotation matrix and the translation vector of the eye-model coordinates system relateive to the camera, which are the movement parameter of the head and also are the ones to be estimated with the particle ﬁlter during tracking.

3

Likelihood Function for Tracking Iris

In many applications using the particle ﬁlter (also called as Condensation[14]), the distance between a model and the edges detected from an input image is used as the likelihood. However, in the case of iris contour tracking, this deﬁnition of likehood often leads particles to converge on a wrong place (such as inner eyes or eyelashes). This is because there are many edges so that the likelihood also becomes high at those places. In order to track the iris contours with the particle ﬁlter, we deﬁne a likelihood function by considering both the image intensity and the image gradient. 3.1

The Likelihood Function of Irises

In most cases, the brightness of iris in an input image is lower than its surroundings. In order to make use of this fact, the average brightness of iris area is introduced into the likelihood function of iris. In this paper, we let the the average brightness the iris area of 3D eye model be 0. Then the values (El and

692

H. Wu et al.

Er ) indicating the likehood of an iris candidate area in an image by considering the image intensity can be calculated with the following equation. Ej = e−Yj /k , 2

j = l, r.

(9)

Here, Yl , Yr are averages of the brightness of the left and right iris candidate areas, and k is a constant. The higher the average brightness in those areas are, the lower El and Er will be. In order to reduce the inﬂuence of the non-iris contour edges when estimating the likelihood for irises, we consider the direction of the edges as well as their strength. Since an iris area is darker than its surrounding, the direction of the image gradient at iris contour will be outward from the iris center. In this paper, we deﬁne the direction of the normal vector of the iris contour in our 3D eyemodel is outward from the iris center. Therefore, if the iris contour of the 3D eye-model and the iris contour in the image overlaps, the direction of the normal vector and the image gradient will be same. We pick n points from the iris contours of the 3D eye-model at ﬁxed intervals as following.

2πk psjk = pM , j = l, r, k = 0, 1, · · · , n − 1. (10) j n These points are called iris contour points as (ICPs). By using the hypothesis generated by the particle ﬁlter, which is a set of parameters describe the movements of the eyes and the head, the projection of ICPs (isjk ) and the normal vectors (hjk ) of them on the image plance can be calculated. Let g(isjk ) indicates the image gradient at each projected ICP, the likelihood πI of iris candidates in the image is computed with the following expression. n−1 n−1 s s B(isrk )D(isrk ) k=0 B(ilk )D(ilk ) πI = El + Er k=0 , (11) n−1 n−1 s s k=0 B(ilk ) k=0 B(irk ) here, B(isjk ) is a function for removing the inﬂuence of the edges of eyelids by ignoring the ICPs outside the region enclosed by the lower and the upper eyelids, 1 if isjk is between the lower and the upper eyelids s , (12) B(ijk ) = 0 otherwise

and D(isjk )

=

hjk · g(isjk ) if hjk · g(isjk ) > 0 , j = l, r. 0 otherwise

(13)

Here, · indicates inner product. 3.2

The Likelihood Function of Eyelids

In order to calculate the likelihood of iris more correctly, it is also necessary to track the eyelids. Since the diﬀerence between the brightness of the two sides

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation

693

of an eyelid is not as big as the case of an iris contour, we only use the image gradient to estimate the likelihood of eyelids (πE ). N −1 πE =

here D(idjk )

=

D(idlk ) + N

N −1

k=0

h(idjk ) · g(idjk ) 0

D(idrk ) , N

k=0

if h(idjk ) · g(idjk ) > 0 , otherwise

(14)

j = l, r,

idjk is the projection of each point on the eyelids and hdjk is its normal vector. The likelihood function π of the whole 3D eye-model including the irises and the eyelids is deﬁned with the following expression. π = πI πE

4

(15)

Eye Gaze Estimation

In order to estimate the eye gazes with the 3D eye-model, it is necessary to determine the personal parameters described in section 2 for testee’s eyes. At the begining of tracking, we assign the positions of the inner eye corners (Ehl and Ehr ), and the outside eye corners (Etl and Etr ) on the image manually. The other parameters of eyelids are drawn from these four points. The personal parameters about eyeballs are set to the average values of people. After this, all personal parameters except the four manually assigned points are estimated with a particle ﬁlter by using the several frames of image. During tracking, the movement parameters of the eyes and the head that give the maximum value of likelihood are estimated by the particle ﬁlter. From these movement parameters the eye gaze is calculated.

5

Experimental Results

We have tested our method of tracking iris contours and eye gaze estimation using a PC with a 3GHz Pentium 4 CPU. The input image (640 × 480 pixels) sequences were taken by a normal video camera. The experimental environment is show in ﬁgure 2(a). 5.1

Tracking the Iris Contour

Firstly, we used our method to track iris contours. The parameters of particle ﬁlter are θp , θt , m, T = (tx , ty , tz ), and ψ that is the rotation angle of the head (eye-model) around z axis of the camera coorinates system. As for the random sampling, the standard deviations of normal distribution were taken as θp : 5[degree], θt : 5[degree], m: 0.05, tx : 0.1[cm], ty : 0.03[cm], tz : 0.01[cm], and ψ: 0.001[rad], respectively.

694

H. Wu et al.

(a)

(b)

Fig. 2. (a): The experimental environment. (b): The environment for evaluating the accuracy of eye gaze.

(a) 1 frame

(e) 80 frame

(b) 35 frame

(f) 140 frame

(c) 45 frame

(g) 170 frame

(d) 74 frame

(h) 200 frame

Fig. 3. Some tracking results with random sampling twice for each frame by 150 samples

The processing time of the tracking was 9.6ms/frame for 100 samples, and was 30.8ms/frame for 400 samples, respectively. When the eye moved quickly, a lot of delays occured during tracking in the case of 100 samples, and only a little occured in the case of 400 samples. In order to increase the tracking accuracy, we carried out the random sampling twice for each frame by 150 samples. Some tracking results are shown in ﬁgure 3. In this case, the processing time was 29ms/frame and the iris contour and the eyelid contour could be tracked without delay even when the eyes moved quickly. Moreover, the tracking accuracy was much improved. All experiments described hereafter were carried out in this way. When a person was blinking the eyes, as shown in ﬁgure 4, the eyelids moved quickly thus could not be tracked perfectly. Also, since the irises were not visible when the eyes were closed, the iris contours could not be tracked exactly. In the case that the system has detected closed eyes (from the movement parameter m), the system holds the former state of the irises just before the eyes were closed. When the irises become visible after blinking, the tracking for irises will

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation

695

(a) 263 frme

(b) 264 frame

(c) 265 frame

(d) 266 frame

(e) 267 frame

(f) 268 frame

(g) 269 frame

(h) 270 frame

Fig. 4. Some tracking results of blinking eyes

(a) radius of eyeball rw (b) radius of the iris rb (c) the eyeball center wx

(d) ﬁrst frame

(e) 15 frame

Fig. 5. The convergence of personal parameters

be restarted again. From this experimental resuls we have conﬁrmed that our method can track iris contours even when people blink their eyes. 5.2

Initialization of Personal Parameters

After the four eye corners had been given manually, the rest personal parameters of the 3D eye-model were estimated with a particle ﬁlter. As for the random sampling, the standard deviations of normal distribution were taken as rb : 0.02[cm], rw : 0.01[cm], wx : 0.03[cm], all the x and y coordinats of the control points of eyelids: 0.2[cm] and 0.05[cm], respectively. The random sampling by 1500 samples was performed to each frame, and the likelihood was evaluated twice. An known eye gaze is necesary for estimating the personal parameters. In the experiments, this estimation was carried out using the images when the person looked forward.

696

H. Wu et al.

Figure 5 shows the behavior of estimated values of some personal parameters. The horizontal axis indicates the frame number and the vertical axis indicates the estimated value. From ﬁgure 5, we conﬁrmed that the personal parameters converge within several frames. Figure 5(d) and (e) show the estimated results of the initial frame and the 15th frame. The processing time for estimating the personal parameters was 430ms/frame. After that, our algorithm could work at video rate (30.1ms/frame) for tracking iris contour and estimating eye gaze. 5.3

Accuracy Evaluation of Eye Gaze Estimation

In order to evaluate the accuracy of the estimated eye gaze using the proposal method, we put some markers on a wall which was 4 meters away from a testee(see ﬁgure 2(b)). The number of testee was ﬁve. We leted the testee gaze at each marker for 2 second and estimated the eye gaze with our system. Table 1 shows the diﬀerence of the mean value of estimated eye gaze and the true value of each marker. Table 1. The diﬀerence of the mean value of estimated eye gaze and the true value of each marker to (x-direction, y-directions) [unit: degree] Person A 1 low 2 low 3 low Upper col. (0.6,-0.5) (1.5,1.9) (2.5,1.5) Middle col. (2.6,2.4) (2.5,1.6) (1.8,0.6) Bottom col. (4.9,5.1) (1.9,3.6) (1.7,4.3) Person C 1 low 2 low 3 low Upper col. (5.1,3.7) (2.5,3.3) (2.6,1.3) Middle col. (2.4,-1.8) (2.9,-1.4) (0.9,-0.2) Bottom col. (1.9,1.2) (-2.0,1.9) (-3.2,2.9) Person E 1 low 2 low 3 low Upper col. (3.5,0.8) (2.4,1.4) (4.5,1.5) Middle col. (-0.9,-5.3) (1.2,-6.4) (3.5,-3.5) Bottom col. (3.8,-1.6) (-2.6,-1.9) (-1.1,-2.2)

6

Person B 1 low 2 low 3 low Upper col. (-1.8,-0.5) (-2.9,-0.1) (-4.1,1.4) Middle col. (-1.6,1.7) (-1.6,2.1) (-4.8,1.6) Bottom col. (-0.9,5.4) (-2.4,6.0) (-3.4,6.0) Person D 1 low 2 low 3 low Upper col. (-3.3,-1.8) (1.1,0.0) (-1.0,0.0) Middle col. (-4.5,-1.8) (-2.4,-1.3) (1.6,-0.1) Bottom col. (0.9,0.4) (-0.9,0.2) (0.1,0.8)

Conclusion

This paper aims at tracking iris contours and estimating eye gaze from monocular video images. In order to suppress the inﬂuence of eyelid edges, we have proposed a 3D eye-model for tracking iris and eyelid contours. Using this eye-model, the eyelid contours and iris contours can be distinguished. By only using the edge points between upper and lower eyelids, the shape of the iris can be estimated and then the eye gaze can be measured. From the experimental results, we conﬁrmed that the proposed algorithm can track iris contours and eyelid contours robustly and can estimate the eye gaze at video rate. The proposal algorithm can be used

Tracking Iris Contour with a 3D Eye-Model for Gaze Estimation

697

to various applications, such as, to check on whether looking-aside movement of a driver, etc. Acknowledgments. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientiﬁc Research (C), 18200131 and 19500150.

References 1. Zhu, Z., Ji, Q.: Eye Gaze Tracking Under Natural Head Movements. In: CVPR, vol. 1, pp. 918–923 (2005) 2. Hennessey, C., Noureddin, B., Lawrence, P.: A Single Camera Eye-Gaze Tracking system with Free Head Motion. In: Symposium on Eye tracking research & applications, pp. 87–94 (2006) 3. Matsumoto, Y., Zelinsky, A.: An Algorithm for Real-time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement. In: FG, pp. 499–504 (2000) 4. Beymer, D., Flickner, M.: Eye Gaze Tracking Using an Active Stereo Head. In: CVPR, vol. 2, pp. 451–458 (2003) 5. Hansen, D., et al.: Tracking Eyes using Shape and Appearance. In: IAPR Workshop on Machine Vision Applications, pp. 201–204 (2002) 6. Hansen, D.W., Pece, A.: Eye Typing oﬀ the Shelf. In: CVPR, vol. 2, pp. 159–164 (2004) 7. Ishikawa, T., et al.: Passive Driver Gaze Tracking with Active Appearance. In: 11th Word Congress on ITS in Nagoya, pp. 100–109 (2004) 8. Gee, A., Cipolla, R.: Estimating Gaze from a Single View of a Face. In: ICPR, pp. 758–760 (1994) 9. Zhu, J., Yang, J.: Subpixel Eye Gaze Tracking. In: FG (2002) 10. Smith, P., Shah, M., Lobo, N.: Determining Driver Visual Attention with One Camera. IEEE Trans. On Intelligent Transportation System 4(4), 205–218 (2003) 11. Wu, H., Chen, Q., Wada, T.: Visual Direction Estimation from a Monocular Image. IEICE E88-D(10), 2277–2285 (2005) 12. Wang, J.G., Sung, E., Venkateswarlu, R.: Eye Gaze Estimation from a Single Image of One Eye. In: ICCV (2003) 13. Tian, Y.l., Kanade, K., Cohn, J.F.: Dual-state Parametric Eye Tracking. In: FG, pp. 110–115 (2000) 14. Isard, M., Blake, A.: Condensation-conditional density propagation for visual tracking. IJCV 29(1), 5–28 (1998) 15. Hua, C., Wu, H., Chen, Q., Wada, T.: A General Framework For Tracking People. In: FG, pp. 511–516 (2006) 16. Duchowski, A.T.: A Breadth-First Survey of Eye Tracking Applications, Behavior Research Methods, Instruments, and Computers (2002) 17. Criminisi, A., Shotton, J., Blake, A., Torr, P.H.S.: Gaze Manipulation for One-toone Teleconferencing. In: ICCV (2003) 18. Yoo, D.H., et al.: Non-contact Eye Gaze Tracking System by Mapping of Corneal Reﬂections. In: FG (2002) 19. Schubert, A.: Detection and Tracking of Facial Feature in Real time Using a Synergistic Approach of Spatial-Temporal Models and Generalized Hough-Transform Techniques. In: FG, pp. 116–121 (2000)

Eye Correction Using Correlation Information Inho Choi and Daijin Kim Department of Computer Science and Engineering Pohang University of Science and Technology (POSTECH) {ihchoi,dkim}@postech.ac.kr

Abstract. This paper proposes a novel eye detection method using the MCT-based pattern correlation. The proposed method detects the face by the MCT-based AdaBoost face detector over the input image and then detects two eyes by the MCT-based AdaBoost eye detector over the eye regions. Sometimes, we have some incorrectly detected eyes due to the limited detection capability of the eye detector. To reduce the falsely detected eyes, we propose a novel eye veriﬁcation method that employs the MCT-based pattern correlation map. We verify whether the detected eye patch is eye or non-eye depending on the existence of a noticeable peak. When one eye is correctly detected and the other eye is falsely detected, we can correct the falsely detected eye using the peak position of the correlation map of the correctly detected eye. Experimental results show that the eye detection rate of the proposed method is 98.7% and 98.8% on the Bern images and AR-564 images.

1

Introduction

Face analysis and authentication problem are solved by three diﬀerent methods [1] as holistic method [2], local method [2,3], and hybrid method [4]. The holistic method identiﬁes a face using the whole face image and needs the alignment in images and the normalization using the facial features. The local method uses the local facial features on face such as eye, nose, and mouth, and needs to localize the ﬁducial points to analysis the face image. The hybrid method uses the both holistic features and the local features. Because eyes are the stable features on the face, they are used as reliable features for face normalization. So, it is very important to detect and localize the eyes in the applications of face authentication and/or recognition. Brunelli and Poggio [2] and Beymer [5] detected the eyes using the template matching. It used the similarity between the template image and the input image and is largely dependent on the initial position of the template. Pentland et. al. [6] used the eigenspace method to detect the eyes. The eigenspace method showed better eye detection performance than the template matching method. But its detection performance is largely dependent on the choice of training images. Kawaguchi and Rizon [7] detected the iris using the intensity and the edge information. Song and Liu [8] use the binary edge images. They include many technique such Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 698–707, 2007. c Springer-Verlag Berlin Heidelberg 2007

Eye Correction Using Correlation Information

699

as the feature template, the template matching, separability ﬁlter, and binary valley extraction, and so on and need the diﬀerent parameters on the diﬀerent database. So, these methods are not intuitive and not simple. So, we consider a more natural and intuitive method that detects the face region and then search the eyes in the subregion of the detected face. Freund and Schapire introduced the AdaBoost algorithm [9] and showed that it had a good generalization capability. Viola and Jones [10] proposed a robust face detection method using AdaBoost with the simple features and it provided a good performance to locate the face region. Fr¨ oba and Ernst [11] used the AdaBoost algorithm with a modiﬁed version of the census transform (MCT). This method is very robust to the illumination change and very fast to ﬁnd the face. Among these AdaBoost methods, we choose the MCT-based AdaBoost training method to detect the face and the eye due to its simplicity of learning and high speed of face detection. However, sometimes we fail to detect the eyes such that the eyebrows or hairs. To reduce this problem, we propose a novel eye veriﬁcation process whether the detected eye is true or false eye. The proposed eye veriﬁcation method employs the MCT-based pattern correlation map. We can verify whether the detected eye is true or false depending on the existence of the noticeable peak in the correlation map. Using this property of the correlation map, we can correct the falsely detected eye using the peak position in the correlation map of the opposite eye. Assume that one eye is correctly detected and the other eye is falsely detected. Then, the correlation map of the correctly detected eye provides the noticeable peak that corresponds to the true location of the falsely detected eye.

2

Eye Detection Using MCT + AdaBoost

The Modiﬁed Census Transform (MCT) is a non-parametric local transform which modiﬁes the census transform by Fr¨ oba and Ernst [11]. It is an ordered set of comparisons of pixel intensities in a local neighborhood representing which pixels have lesser intensity than the mean of pixel intensities. We present the eye detection method using the AdaBoost training with MCTbased eye features. In the AdaBoost training, we construct the weak classiﬁer which classiﬁes the eye and non-eye pattern and then construct the strong classiﬁer which is the linear combination of weak classiﬁers. In the detection, we scan the eye region by moving a 12×8 size of the scanning window and obtain the conﬁdence value corresponding to the current window location using the strong classiﬁer. Then, we determine the window location whose conﬁdent value is maximum as the location of the detected eye. Our MCT-based AdaBoost training has been performed using only the left eye and non-eye training images. So, when we are trying to detect the right eye, we need to ﬂip the right subregion of the face image.

700

3

I. Choi and D. Kim

Eye Veriﬁcation

To remove false detection, we devise an eye veriﬁcation whether the detected eye is true or false using the MCT-based pattern correlation based on symmetrical property of the human face [12]. 3.1

MCT-Based Pattern Correlation

The MCT represents the local intensity variation of several neighboring pixels around a given pixel in an ordered bit pattern. (See Fig. 1) Because the MCT value is non-linear, the decoded value of the MCT is not appropriate for measuring the diﬀerence between two MCT patterns. To solve this problem, we propose the idea of the MCT-based pattern and the MCT-based pattern correlation based on the Hamming distance that measures the diﬀerence between two MCT-based patterns. The MCT-based pattern P (x, y) at the pixel position (x, y) is a binary representation of the 3 × 3 pixels in the determined order, from the upper left pixel to the lower right pixel as P (x, y) = [b0 , b1 . . . b8 ],

(1)

where bi is a binary value that is obtained by the comparison function as ¯ y) + α, I(x , y )), b3(y −y+1)+(x −x+1) = C(I(x,

(2)

¯ y) is the where x = {t|t ∈ {x − 1, x, x + 1}}, y = {t|t ∈ {y − 1, y, y + 1}}, I(x, mean of neighborhood pixels and I(x , y ) is the intensity of each pixel. b0 b1 b2 b3 b4 b5 b6 b7 b8 (a)

15 70 15 15 70 15 15 70 15

M BP −−−−−−−−−−−→ if I(x, y) > 33.3

010010010

(b)

Fig. 1. Examples of the MCT-based patterns

We propose the MCT-based pattern correlation to compute the similarity between two diﬀerent MCT-based patterns. It is based on the Hamming distance that counts the number of positions whose binary values are equal between two MCT-based patterns. The MCT-based pattern correlation between image A and image B is deﬁned as 1 ρ= ρx,y , (3) N x,y where N is the number of pixels in the image and ρx,y is the MCT-based pattern correlation at the pixel position (x, y) as ρx,y =

1 (9 − HammingDistance(PA(x, y), PB (x, y))), 9

(4)

Eye Correction Using Correlation Information

701

Fig. 2. Five face images with diﬀerent illuminations Table 1. Comparison between the conventional image correlation with histogram equalization and the MCT-based pattern correlation

Face image Face image Face image Face image Face image Face image Face image Face image Face image Face image Mean Variance

1 1 1 1 2 2 2 3 3 4

and and and and and and and and and and

face face face face face face face face face face

image image image image image image image image image image

2 3 4 5 3 4 5 4 5 5

Conventional MCT-based pattern Image correlation correlation with HIST.EQ. 0.873 0.896 0.902 0.862 0.889 0.883 0.856 0.839 0.659 0.827 0.795 0.890 0.788 0.849 0.846 0.870 0.794 0.865 0.627 0.808 0.803 0.859 0.094 0.028

where PA (x, y) and PB (x, y) are the MCT-based patterns of the image A and B, respectively. Fig. 2 shows ﬁve diﬀerent face images with diﬀerent illuminations. Table 1 compares the conventional image correlation with histogram equalization and the MCT-based pattern correlation between two image pairs. The table shows that (1) the mean of the MCT-based pattern correlation is higher than that of the conventional image correlation and (2) the variance of the MCT-based pattern correlation is much smaller than that of the conventional image correlation. This implies that the MCT-based pattern is more robust to the change of illuminations than the conventional image correlation. Where the MCT-based pattern correlation map is built by sliding a detected eye over the eye region of the opposite side and computing the correlation value in term of the Hamming distance. 3.2

Eye/Non-eye Classiﬁcation

The detected left and right eye patch can be either eye or non-eye, respectively. In this work, they are classiﬁed into eye or non-eye depending on the existence of a noticeable peak in the MCT-based correlation map as follows. If there is a

702

I. Choi and D. Kim

noticeable peak in the MCT-based correlation map, the detected eye patch is an eye. Otherwise, the detected eye patch is a non-eye. Since the eye detector produces two detected eye patches on the left and right eye subregions, respectively, we build two diﬀerent MCT-based pattern correlation maps as – Case 1(2): Left(right) eye correlation map that is the MCT-based pattern correlation map between the detected left(right) eye patch and the right(left) subregion of the face image. We want to show how the MCT-based pattern correlation maps of the correctly detected eye and the falsely detected eye are diﬀerent each other. Three images in Fig. 3 are taken to build the left eye correlation map (Case 1), where they are (a) a right eye subregion, (b) a ﬂipped image patch of the correctly detected left eye, and (c) a ﬂipped image patch of the falsely detected left eye (in this case, eyebrow), respectively. Fig. 3-(d),(e) show the correlation maps of the correctly detected left eye patch and the falsely detected left eye patch, respectively. As you see, two correlation maps look very diﬀerent each other: the true eye patch produces a noticeable peak at the right eye position while the non-eye patch (eyebrow) does not produces any noticeable peak over the entire right eye subregion. From this fact, we need an eﬀective way of ﬁnding a noticeable peak in the correlation map in order to decide whether the detected eye patch is eye or non-eye. In this work, we consider a simple way of peak ﬁnding based on two predetermined correlation values.

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 100

0 100 80 60

50

40

80 60

50

40

20 0

(a)

(b)

(c)

0

(d)

20 0

0

(e)

Fig. 3. A typical example of the right eye subregion, the detected eye and non-eye in the left eye subregion

The proposed eye/non-eye classiﬁcation method is given below. First, we rescale the correlation map whose the highest peak value becomes 1. Second, we overlay a peak ﬁnding window Wpeak with a size of w × h at the position with the highest value in the correlation map, where w and h are the width and the height of the detected eye patch. Third, we classify whether the detected eye Edetected patch is eye or non-eye according to the following rule as eye if R < τ, (5) Edetected = non − eye otherwise,

Eye Correction Using Correlation Information

703

where τ and R are a given threshold and the high correlation ratio, which is deﬁned by a ratio of the number of pixel positions whose correlation value is greater than a given threshold value ρt over the number of total pixel positions within the peak ﬁnding window Wpeak as R=

1 N

u+w/2

v+h/2

C(ρ(u , v ), ρt ),

(6)

u =u−w/2 v =v−h/2

where N is the number of total pixel positions of Wpeak and C is an comparison function as 1 if ρ(u , v ) > ρt , C(ρ(u , v ), ρt ) = (7) 0 otherwise.

4

Falsely Detected Eye Correction

After eye veriﬁcation, we have four diﬀerent classiﬁcation results on the left and right eye regions: 1) eye and eye, 2) eye and non-eye, 3) non-eye and eye, and 4) non-eye and non-eye. In the ﬁrst and fourth case, we succeed and fail to detect two eyes, respectively. In the fourth case, there is no way to detect the eyes. However, in the case of the second and third cases, we can correct the falsely detected eye as follows. In the second case, we can locate the falsely detected right eye using the peak position of the correlation map of the correctly detected left eye. Similarly, in the third case, we can locate the falsely detected left eye using the peak position of the correlation map of the correctly detected right eye. Fig. 4 shows an example of the falsely detected eye correction, where (a) and (b) show the eye region images before and after falsely detected eye correction, respectively. In Fig. 4-(a), A, B , and C represent the correctly detected left eye, the falsely detected right eye, and the true right eye, respectively. As you see in Fig. 4-(b), the falsely detected right eye is corrected well.

(a)

(b)

Fig. 4. An example of the falsely detected eye correction

5

Experimental Results

For the AdaBoost Training with MCT-based eye features, we used two face databases such as Asian Face Image Database PF01 [13] and XM2VTS Database

704

I. Choi and D. Kim

(XM2VTSDB) [14] and prepared 3,400 eye images and 220,000 non-eye images whose size is 12 × 8. For evaluating the proposed eye detection method, we used the Bern database [15] and the AR face database [16]. Because the accuracy of our face detection method is 100% in the Bern images and the AR image, we consider only the performance of the proposed eye detection method. As a measure of eye detection, we deﬁne the eye detection rate as reye =

N 1 di , N i=1

(8)

where N is the total number of the test eye images and di is an indicator function of successful detection as 1 if max(δl , δr ) < Riris , di = (9) 0 otherwise, where and δl and δr are the distance between the center of the detected left eye and the center of the real left eye, and the distance between the center of the detected right eye and the center of the real right eye, respectively, and Tiris is a radius of the eye’s iris. 5.1

Experiments in the Bern Images

Fig. 5-(a) and Fig. 5-(b) show some Bern images whose eyes are correctly and falsely detected by the strong classiﬁer that is obtained by the AdaBoost Training with MCT-based eye features, respectively, where the boxes represent the

(a) Some examples of the (b) Some examples of the (c) An example of falsely correctly detected results falsely detected results detected eye correction Fig. 5. Some examples of results in Bern face database Table 2. Comparisons of various eye detection methods using the Bern face database Algorithms Eye detection rate (%) Proposed method 98.7% Kawaguchi and Rizon [7] 95.3% Template matching [7] 77.9% Eigenface method using 50 training samples [7] 90.7% Eigenface method using 100 training samples [7] 93.3%

Eye Correction Using Correlation Information

705

detected eye patches and the white circles represent the center of the detected eye patches. Fig. 5-(c) shows one example of falsely detected eye correction by the proposed eye correction method, where the left and right ﬁgures represent the eye detection results before and after falsely detected eye correction, respectively. Table 2 compares the detection performance of various eye detection methods. 5.2

Experiments in the AR Images

The AR-63 face database contains 63 images (twenty-one people × three diﬀerent facial expressions) and the AR-564 face database includes 564 images (94 peoples × 6 conditions (3 diﬀerent facial expressions and 3 diﬀerent illuminations)). Fig. 6-(a) and Fig. 6-(b) show some AR images whose eyes are correctly and falsely detected by the strong classiﬁer that is obtained by the AdaBoost Training with MCT-based eye features, respectively, where the boxes represent the detected eye patches and the white circles represent the center of the detected eye patches.

(a) Some examples of the correctly de- (b) Some examples of the falsely detected tected results results Fig. 6. Some examples of results in AR-564 face database

(a) Results of the falsely detected

(b) Results of the correction

Fig. 7. Three examples of the falsely detected eye correction

Table 3. Comparisons of various eye detection methods using the AR face database Algorithms Proposed method Song and Liu [8] Kawaguchi and Rizon [7]

AR-63 AR-564 98.4% 98.8% 96.8% 96.8% 96.8% -

706

I. Choi and D. Kim

Fig. 7 shows four examples of falsely detected eye correction by the proposed eye correction method, where (a) and (b) represent the eye detection results before and after falsely detected eye correction, respectively. Table 3 compares the detection performance of various eye detection methods. As you see, the proposed eye detection method shows better eye detection rate than other existing methods and we the improvement of eye detection rate in the case of AR-564 face database is bigger than that in the case of AR-63 face database. This implies that the proposed eye detection method works well under various conditions than other existing eye detection methods.

6

Conclusion

We proposed a eye detection method using the MCT-based pattern correlation. The eye detection method can produce the false detection near the eyebrows or the boundary of hair and forehead in particular. When the existing eye detection method detects the eye in just one subregion, then it does not improve the eye detection rate. To overcome this limitation, we proposed the eye veriﬁcation and falsely detected eye correction method based on the MCT-based pattern correlation. The MCT-based pattern correlation is based on the Hamming distance that measures the diﬀerence between two MCT-based patterns ,where the MCT-based pattern is a binary representation of the MCT. Also, the proposed MCT-based pattern is robust to the illumination changes. To verify detected eye, we proposed the eye/non-eye classiﬁcation method which classiﬁes into eye or non-eye depending on the existence of a noticeable peak in the MCT-based pattern correlation map. The proposed falsely detected eye correction method uses the peak position in the MCT-based pattern correlation map to correction of the falsely detected eye which is veriﬁed by the proposed eye/non-eye classiﬁcation method. It improves the eye detection rate of the proposed eye detection method. The experimental results show that a eye detection rate of 98.7% and 98.8% can be achieve on the Bern images and AR-564 database. The proposed eye detection method works well under various conditions than other existing eye detection methods.

Acknowledgements This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, it is ﬁnancially supported by the Ministry of Education and Human Resources Development (MOE), the Ministry of Commerce, Industry and Energy (MOCIE) and the Ministry of Labor (MOLAB) through the fostering project of the Lab of Excellency.

Eye Correction Using Correlation Information

707

References 1. Tan, X., Chen, S., Zhou, Z., Zhang, F.: Face recognition from a single image per person: A survey. Pattern Recognition 39, 1725–1745 (2006) 2. Brunelli, R., Poggio, T.: Face recognition: features versus templates. IEEE Transaction on Pattern Analysis and Machine Intelligence 15, 1042–1052 (1993) 3. Lawrence, S., Giles, C., Tsoi, A., Back, A.: Face recognition: a convolutional neuralnetwork approach. IEEE Transaction on Neural Networks 8, 98–113 (1997) 4. Martinez, A.: Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Transaction on Pattern Analysis and Machine Intelligence 24, 748–768 (2002) 5. Beymer, D.: Face recognition under varying pose. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 756–761. IEEE Computer Society Press, Los Alamitos (1994) 6. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 84–91. IEEE Computer Society Press, Los Alamitos (1994) 7. Kawaguchi, T., Rizon, M.: Iris detection using intensity and edge information. Pattern Recognition 36, 549–562 (2003) 8. Song, J., Chi, Z., Li, J.: A robust eye detection method using combined binary edge and intensity information. Pattern Recognition 39, 1110–1125 (2006) 9. Freund, Y., Schapire, R.: A short introduction to boosting. Journal of Japanese Society for Artiﬁcial Intelligence 14, 771–780 (1999) 10. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 11. Froba, B., Ernst, A.: Face detection with the modiﬁed census transform. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 91–96 (2004) 12. Song, Y.J., Kim, Y.G., Chang, U.D., Kwon, H.B.: Face recognition robust to left/right shadows; facial symmetry. Pattern Recognition 39, 1542–1545 (2006) 13. Je, H., Kim, S., Jun, B., Kim, D., Kim, H., Sung, J., Bang, S.: Asian Face Image Database PF01, Technical Report. Intelligent Multimedia Lab, Dept. of CSE, POSTECH (2001) 14. Luettin, J., Maˆıtre, G.: Evaluation Protocol for the Extended M2VTS database (XM2VTSDB), IDIAP Communication 98-05. In: IDIAP, Martigny, Switzerland, pp. 98–95 (1998) 15. Achermann, B.: The face database of University of Bern. Institute of Computer Science and Applied Mathematics, University of Bern (1995) 16. Martinez, A., Benavente, R.: The AR Face Database, CVC Technical Report #24 (1998)

Eye-Gaze Detection from Monocular Camera Image Using Parametric Template Matching Ryo Ohtera, Takahiko Horiuchi, and Shoji Tominaga Graduate School of Science and Technology, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba, 263-8522, Japan [email protected], {horiuchi,shoji}@faculty.chiba-u.jp

Abstract. In the coming ubiquitous-computing society, an eyegaze interface will be one of the key technologies as an input device. Most of the conventional eyegaze tracking algorithms require specific light sources, equipments, devices, etc. In a previous work, the authors developed a simple eye-gaze detection system using a monocular video camera. This paper proposes a fast eye-gaze detection algorithm using the parametric template matching. In our algorithm, the iris extraction by the parametric template matching is applied to the eye-gaze detection based on physiological eyeball model. The parametric template matching can carry out an accurate sub-pixel matching by interpolating a few template images of a user’s eye captured in the calibration process for personal error. So, a fast calculation can be realized with keeping the detection accuracy. We construct an eye-gaze communication interface using the proposed algorithm, and verified the performance through key typing experiments using visual keyboard on display. Keywords: Eyegaze Detection, Parametric template matching, Eyeball model, Eyegaze Keyboard.

1 Introduction Human eyes always chase an interesting object. Gaze determines a user’s current line of sight or point of fixation. So, the direction of the eyegaze can express the interests of the user, and the gaze may be used to interpret the user’s intention for non-command interactions. Incorporating eye movements into the interaction between humans and computers provides numerous potential benefits. Moreover, the eyegaze communication interface is very important for not only users in normal health but also severely handicapped users such as quadriplegic users. Although the keyboard and the mouse are used as an interface of the computer, it is necessary for us to move the input device with the hand. Therefore, the substituted input device is necessary for the person who owed the handicap. The eyegaze detection has progressed through measurements of infrared irradiation and myoelectric potential around eyes. Gips et al. proposed a detection algorithm based on EOG method [1] in which the myoelectric potential following the motion of eyeball can be measured by electrode on the face. However, the influence of the electric noise Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 708–717, 2007. © Springer-Verlag Berlin Heidelberg 2007

Eye-Gaze Detection from Monocular Camera Image

709

embedded in miniature potential is not negligible. Moreover, specific instruments are required for measurements, and the load for the user is also not negligible. Cornea-reflex-based systems were also proposed in Refs.[2]-[4]. Those systems require that the environment illumination must suppress the extraneous reflection. T.N.Cornsweet and H.D.Crane presented a simple and high-precision eyegaze detection system with the jaw board and the headrest. In this detection, the motion of head can be cancelled by utilizing 1st and 4th Purkinje images from affecting inversely each other on the motion of head [5]. However, the method using these cornea reflection images had to prepare a specific optical device such as the infrared irradiation devices to take picture of the iris image as a high luminance area or a low brightness area. Recently video-based techniques have been studied. Compared with non-videobased gaze tracking methods as mentioned above, video-based gaze tracking methods have the advantage of being unobtrusiveness and comfortable during the process of the gaze detection. Kawato et al. presented a gaze detection method using four reference points which put on the face and three calibration images [6]. Matsumoto et al. presented a real-time stereo vision system to estimate the gaze direction [7]. Wang et al. presented a method for estimating the eyegaze by measuring the change of the contour of the iris [8]. Generally, the big-eye approaches based on the precise measurements of the eye are expensive [8],[9], because they require a pan-tilt/zoom-in stereo-camera with sufficient resolution to measure accurately iris contour or pupil. The methods using the movement of the iris captured by a low resolution single camera under the condition of fixing head pose are described in Refs.[10],[11]. These methods used a luminance gradient for extract the iris semicircle and eyes corners. Therefore, it is necessary to specify rough position of eyes beforehand. The iris extraction is very important in the algorithm that detects the gaze from the movement of the iris. The authors proposed a precise iris extraction algorithm based on the Hough transform [12]. However, the method cannot detect movement in sub-pixel and the Hough transform requires a heavy operation time. In this study, we concentrate on a video-based approach, and develop a simple eyegaze detection system using only a monocular video camera. The proposed method does not require any specific equipment excluding the monocular video camera, and has psychologically a light burden for a user. In our algorithm, the rotation model for eyeball is constructed through traditional physiological models which are Emsley’s eyeball model [13] and Gullstrand’s model No.2 [14]. In the system, the eyegaze angle to optical axis is calculated by using the amount of the movement at the center of the iris after the calibration. Then, a coarse-to-fine parametric template matching method [15] is performed to extract the user’s iris in robustness, high speed, and sub-pixel accuracy. The remaining sections are organized as follows: In Sec.2, a rotation model for the eyeball is defined, and an eyegaze estimation algorithm is described. In Sec.3, an iris detection algorithm which is the key technology for estimating the eyegaze is proposed. In Sec.4, the experimental results show the performance of the proposed gaze detection system. In order to verify the performance, in Sec. 5, the proposed gaze detection algorithm is applied to an eyegaze communication interface.

710

R. Ohtera, T. Horiuchi, and S. Tominaga

2 Gaze Estimation Model 2.1 Gullstrand’s Schematic Eye No.2 and Emsley’s Reduced Schematic Eye In order to estimate the eyegaze, we begin to consider physiological eyeball models. Schematic eyes are standard model in the eye optics of which the parameters are provided by observed values or its approximated values for the optical parameters in dioptrics. Several schematic eyes have been proposed. Examples include LeGrand’s schematic eye, Donders’ reduced eye, Lawrence’s reduced eye and Listing’s reduced eye. In this paper, we use Gullstrand’s No.2 schematic eye and Emsley’s schematic eye, which can simply express the size of eyeball. Gullstrand’s model, which consists of the precise model (No.1) and the non-precise one (No.2), and Emeley’s model are well-known eyeball models. Gullstrand (No.2)-Emsley’s reduced eye consists of one-surface cornea, two-surface lens, spherical, rotationally symmetric surfaces. Values for the accommodation-stop and the super-accommodation of eyes can be presented. 2.2 Gaze Detection Algorithm The eyegaze can be defined as a vector directed from the center of an iris to the center of the eyeball. In this study, the amount of the movement from the center of the iris is detected, and the eyegaze angle is calculated by using the rotation model based on the Gullstrand’s reduced schematic eye (No.2) and the Emsley’s reduced schematic eye. Figure 1 shows the proposed rotation model. The center of the eyeball lies at the distance 13mm behind the cornea. The gonioscope width, which is the distance from the cornea to the iris, is set as 3.4mm on averaging the value from non-controlled to strong controlled eyes (3.2-3.6mm). The length from the cornea to the eye ground is provided by 23.9mm with the Emsley’s reduced schematic eye. Therefore, the length from the eyeball rotation center to the bottom of the iris becomes 9.6mm. The amount of the eyeball rotation is defined as 0 deg for gazing at the reference point, and the amount of the movement at the center of the iris calculated in Sec.3 is defined as B. For assuming that the size of a camera is very larger than that of the eyeball, the eyegaze angle θ to the front baseline is easily given by θ = sin −1 B S

(1)

where the eyegaze detection for oblique one is done by calculating vertical and horizontal movement B of the iris, respectively, and applying to the model separately. The symbol S in Fig.1 shows the personal size of the eyeball depending on the user, and it will be calibrated in Sec.4.2. After calculating B for vertical and horizontal direction, the oblique eyegaze is detected by applying them into the models, respectively. However, there is a necessity for the conversion to apply to the model because the amount of the movement at the center of the iris is not a value of the observational measurement. When a user fixates on the reference point, the iris becomes a circle because the eyeball does not rotate. Moreover, the diameter of the iris is assumed to be 11.5mm from the model eyes. Therefore, conversion from the obtained image to the observational measurement value assumed the diameter of the cornea to be 11.5mm

Eye-Gaze Detection from Monocular Camera Image

711

from the model eyes, and calculated measurement value (mm) that corresponded to one pixel. Since we have not considered the time-sequential processing between frames, the proposed process is performed for each frame.

Fig. 1. The eye rotation model used in our algorithm

3 Iris Detection Algorithm 3.1 Overview of the Parametric Template Matching As described in Sec.2.2, the eyegaze can be detected by extracting the center of the iris. So, the precise extraction of the center of the iris becomes important for accurate eyegaze detection. In the iris extraction, one of the biggest problems is how to consider the change of shape depending on the rotation of the eyeballs. In addition, high speed processing and the accuracy of the iris detection are required for real-time processing. In this paper, we use the parametric template matching method [15] for the iris extraction. The parametric template space is defined as a template space expressed by a linear interpolation of two or more given templates called “vertex templates”, and high speed and high accuracy matching can be realized by the coarse-to-fine matching. The vertex templates are described in Sec.3.2 in detail. The Zero-mean Normalized Cross- Correlation is calculated between a constructed parametric template and an object image as follows:

ρ ( x, y ) ≡

∑ (Δt ( x, y, k , s))× (Δg (k , s) )

(k , s )∈T

∑ (Δt ( x, y, k , s) ) ( ) k , s ∈T

2

∑ (Δg (k , s)) ( ) k , s ∈T

Δt ( x, y, k , s ) ≡ t ( x + k , y + s ) − t Δg ( k , s ) ≡ g ( k , s ) − g

{

}

T ≡ (k , s )1 ≤ k ≤ l x ,1 ≤ s ≤ l y ;

l x , l y : template size

2

(2)

712

R. Ohtera, T. Horiuchi, and S. Tominaga

Where, t, g represent the luminance value within the template image and the object image, respectively. In this paper, an iris image at gazing center position is used as the object image. Symbols t and g are mean values of each images. The correlation is calculated for all positions within the input image, and the position (x*,y*) with the maximum correlation ρ ( x*, y*) is detected as the iris position. Here, we use the coarse-to-fine searching algorithm for reducing processing time which is explained in the next subsection. 3.2 Construction of the Parametric Template

In this section, we define the vertex templates and construct the parametric template. We use the following seven images as the vertex templates. (a) Three local images within an input image (b) Four iris images when each display corner is gazed. Images (a) are used as vertex templates for detecting the iris movement. As shown in Fig.2, three vertex templates tˆ1 , tˆ2 , tˆ3 are local images at the position ( x, y ) , ( x + Δx, y ) and ( x, y + Δy ) in the input image, respectively. Here, Δ x and Δ y are the sampling intervals. Images (b) are used as vertex template for robustness of geometric transformation of the iris because the shape is transformed into the ellipse by rotating

Fig. 2. Concept of the parametric template matching

Fig. 3. Vertex template images for iris transformation

Eye-Gaze Detection from Monocular Camera Image

713

the eyeball. In Fig.2, the vertex templates are expressed as tˆ4 , tˆ5 , tˆ6 and tˆ7 . Figure 3 shows an example of a set of vertex images for iris transformation. When the rotation angle is large, the iris is concealed by the eyelid. In general multi template matching, all transformed iris images must be prepared beforehand. In this study, we get four vertex images by gazing corner position and express variation of the iris shape by interpolating those four images continuously. By using seven vertex images, the parametric template t (ω ) is constructed by linear interpolation proposed in Ref.[15]. By using the constructed parametric template, the correlation ρ ( x, y ) in Eq.(2) is calculated for all position in the input image. The most matched position (x*,y*) with the highest ρ ( x*, y*) becomes a candidate position of the iris. After that, the sampling intervals Δ x and Δ y decrease, and more precise matching is performed around the candidate position again. These procedures are repeated by decreasing the interval until Δx = Δy = 1 . In the case Δx = Δy = 1 , a sub-pixel matching can be realized. Finally, the center of the template at the most matched position is detected as the center of the iris. 3.3 Adjustment for Center of Iris

The shape of the iris can be transformed into the ellipse when the eyeball is rotate. So the center of the iris projected in the two dimensional image may shift from the center of the detected template. Figure 4 shows the eyeball which we looked at from the top. The line A-B shows the iris and the line A'-B' shows a rotated iris. Algorithm in the previous section extracts the point C as the center of the iris, because the algorithm extracts it as the geometric center point between A' and B'. However, actual center point of the iris is E in Fig.4. For more accurate eyegaze detection, we have to adjust the center of the iris. We assume that the correct center of the iris is the center of the iris when the front is gazed. Let O be the center of the eyeball in Fig.4. Then A and B can be expressed as ⎛⎜ − I ,− S 2 − I 2 ⎞⎟ and ⎛⎜ I ,− S 2 − I 2 ⎞⎟ , respectively. Let S be the ⎝ ⎠ ⎝ ⎠ calibrated radius of the rotation locus. Detailed calibration process is described in Sec.4.2. Then A' and B' transformed by rotating θ degrees can be expressed as follows: ⎡ A' x ⎤ ⎡cosθ A': ⎢ ⎥ = ⎢ ⎣ A' y ⎦ ⎣ sin θ

− sin θ ⎤ ⎡ −I ⎤ ⎥ ⎢ cosθ ⎦ ⎣− S 2 − I 2 ⎥⎦

(3)

⎡ B' x ⎤ ⎡cosθ B ': ⎢ ⎥ = ⎢ ⎣ B ' y ⎦ ⎣ sin θ

I − sin θ ⎤ ⎡ ⎤ ⎥ ⎢ 2 2⎥ cosθ ⎦ ⎣− S − I ⎦

(4)

In Fig.4, f , which is the x-coordinate of the point D, can be calculated as f =

A'x + B ' x = S 2 − I 2 sin θ 2

(5)

Here D is the center of the iris in the camera image. So, the actual angle can be derived as f θ = sin −1 (6) 2 S −I2 Then the accurate eyegaze vector O-E is detected.

714

R. Ohtera, T. Horiuchi, and S. Tominaga

Fig. 4. Adjustment for center of iris

4 Experimental Results of Eyegaze Detection 4.1 Experimental Environment

The proposed method is here demonstrated for the display interface. In the experiment, a reference point is set on the center of display, and an observer with the naked-eyes sits in front of the reference point. The observer fixates his eye to the front, thus the effect of the direction of his face can be suppressed. Observers are three males. Each observer tested by twice. As an observer sits in front of the eyegaze monitor display, a monocular video camera mounted below the monitor observes one of the observer's eyes. The distance from the display to the eyeball of the observer sitting on the chair is set with 400mm, which is a widely usable distance. The source of light is arranged a little backward of the observer only by an overhead fluorescent lamp. The observer is irradiated from the upper behind by only the fluorescent lamp. The face image is taken with the digital video camera, Panasonic NV-GS200K(640×480), which is at the distance 120mm apart from the display. The jaw and forehead of observer is fixed on the plate. In the experiment, 20 indices without the center index were displayed at 10 degree intervals in visual angle. The center index was used as the reference point. One index was displayed in five seconds intervals. Then, the observers gazed the displayed index. The face image was captured four seconds later after the index was displayed. The direction of the eyegaze from the reference point was calculated for each selected indices by using Eq.(1). Under the condition of fixing head pose, the iris doesn't move large. The searching region was limited from an initial position in about 2.5 times of the iris size on account of high-speed processing. 4.2 Calibration Method

In general, before the eyegaze is detected, individual calibration for each observer is performed by gazing two or more markers, here 5-20 markers, on the display. In the eyegaze detection, it is to be desired that the individual calibration is unnecessary. However, because a deviation result from some factors, the individual calibration is

Eye-Gaze Detection from Monocular Camera Image

715

actually required. The guessable factors in the proposed method are as follows. Errors due to the optical system such as the position of camera (1) (2) (3) (4)

Refraction at the surface of cornea Personal error due to the shape of eyeball Degree of aspheric for the surface of cornea Refracting through the grasses or contact lenses

This paper focuses on Error (3). Considering personal eyeball size in Emsley’s model eye and Gullstrand’s model eye No.2, we propose simple 4-point calibration using corner points of the display screen. The points are (vertical view angle[deg], horizontal view angle[deg])=(-25, 20), (-25,-20), (25,-20), (25,20). First, the display is divided into four blocks from the reference point. Next, B in Eq.(1) is calculated. Then, the parameter S in Eq(1) is personalized so that the amount of the rotation of the eyeball in the calibration process. The same adjustment procedure proceeds to the other calibration point. Then, the adjusted parameter S to four blocks is obtained. 4.3 Results of Eyegaze Detection

We compared the proposed method with a conventional algorithm in Ref.[15] under the same system condition. Table 1 shows the average error of gaze detection and processing times. The average error of the proposed method is 0.92deg for horizontal direction, 2.09deg for vertical direction. The average error of the proposed method is slightly larger than the conventional method. However, the maximum error of the proposed method is smaller. Therefore, the proposed method realizes stable eyegaze detection. Next, we compared as to the processing time. In the parametric template matching, the initial sampling intervals are set to 8-15. Then the processing times are averaged. As shown in Table 1, the processing times are drastically reduced. The proposed method is inferior to Ref.[12] in the accuracy. However, it has an advantage in the processing time. There is a large influence in human comfort at the processing time when the eyegaze is applied to the human interface. As for the proposed method, further speed-up is expected by using information between frames. Table 1. Average error of gaze detection and processing times

Ref.[12] The proposed method

horizontal 0.66.deg 0.92.deg

vertical 1.05.deg 2.09.deg

Processing times 86.3 sec 0.69 sec

5 Application for Eyegaze Keyboard We developed an eyegaze communication interface using the proposed method. A user can operate the developed eyegaze communication system by looking at rectangular keys that are displayed on the control screen. Japanese syllabary was written on the visual keyboards. Then the simple word processing can be realized by looking at each key in turn. The size of a key is 5 deg. When the error exceeds 2.5 degrees, the detection fails. However, a user can push forward work without stress so that a letter is detected fast and correct it quickly. Figure 5 shows the eyegaze keyboard system.

716

R. Ohtera, T. Horiuchi, and S. Tominaga

Fig. 5. Eyegaze keyboard system

6 Conclusion This paper has proposed a simple method for eyegaze detection. It is non-contact for the observer and any specific devices excluding the monocular video camera are not required. Moreover, this method requires neither the reference light nor the infrared rays light, etc. The rotation model of the eyeball was constructed. Then, we devised a simple and fast eyegaze detection algorithm by iris extraction using the parametric template matching method. In order to verify the performance of the proposed method, an eyegaze detection experiment was performed. The average of horizontal direction error was 0.92deg, and vertical one was 2.09deg. Although the accuracy was slightly bad, compared with the conventional algorithm, the processing speed was improved with about 1/125 drastically. The improvement of accuracy is future work. For more comfortable system, head-free condition is required.

References 1. Gips, J., Olivieri, C.P., Tecce, J.J.: Direct control of the computer through electrodes placed around the eyes. In: Smith, M.J., Salvendy, G. (eds.) Proc. 5th Int. Conf. on Human Computer Interaction, Orlando, FL. Published in Human-Computer Interaction: Applications and Case Studies, pp. 630–635. Elsevier, Amsterdam (1993) 2. Talmi, K., Liu, J.: Eye and gaze tracking for visually controlled interactive stereoscopic displays. Signal Processing: Image Communication 14, 799–810 (1999) 3. Hutchinson, T.E., White, K.P., Martin, W.N., Reichert, K.C., Frey, L.A.: Human-computer interaction using eyegaze input. IEEE Trans. Systems, Man & Cybernetics 19(6), 1527–1534 (1989) 4. Ohno, T., Mukawa, N., Kawato, S.: Just Blink Your Eyes: A Head-Free Gaze Tracking System. In: Int. Conf. for Human-Computer Interaction, Florida, USA, pp. 950–951 (2003) 5. Cornsweet, T.N., Crane, H.D.: Accurate two-dimensional eye tracker using first and forth Purkinje images. J. Opt. Soc. Am. 63(8), 921–928 (1973) 6. Kawato, S., Tetsutani, N.: Gaze Direction Estimation with a Single Camera Based on Four Reference Points and Three Calibration Images. In: Narayanan, P.J., Nayar, S.K., Shum, H-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 419–428. Springer, Heidelberg (2006)

Eye-Gaze Detection from Monocular Camera Image

717

7. Matsumoto, Y., Zelinsky, A.: An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement. In: Proceedings of IEEE fourth Int. Conf. on Faze and Gesture Recognition, pp. 499–505 (2000) 8. Wang, J., Sung, E.: Gaze determination via images of irises. Image and Vision Computing 19(12), 891–911 (2001) 9. Kim, K.-N., Ramakrishna, R.S.: Vision-based Eyegaze Tracking for Human Computer Interface. In: IEEE Int. Conf. On Systems, Man, and Cybernetics, vol. 2, pp. 324–329 (1999) 10. Hammal, Z., Massot, C., Bedoya, G., Caplier, A.: Eyes Segmentation Applied to Gaze Direction and Vigilance Estimation. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 236–246. Springer, Heidelberg (2005) 11. Benoit, A., Caplier, A., Bonnaud, L.: Gaze direction estimation tool based on head motion analysis or iris position estimation. In: Proc. EUSIPCO2005, Antalya, Turkey (September 2005) 12. Ohtera, R., Horiuchi, T., Kotera, H.: Eye-gaze Detection from Monocular Camera Image Based on Physiological Eyeball Models. In: IWAIT2006. Proc. International Workshop on Advanced Image Technology, pp. 639–664 (2006) 13. Emsley, H.H.: Visual Optics, 5th edn. Hatton Press Ltd, London (1952) 14. Gullstrand, A.: Appendix II.3 The optical system of the eye, von Helmholtz H, Handbuch der physiologischen Optik (1909) 15. Tanaka, K., Sano, M., Ohara, S., Okudaira, M.: A parametric template method and its application to robust matching. Proc, Computer Vision and Pattern Recognition, IEEE, 620–627 (2000)

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications Yu Shi and Timothy Tsui National ICT Australia, Bay 15 Australian Technology Park, Sydney, NSW 1430, Australia [email protected]

Abstract. Smart camera is a camera that can not only see but also think and act. A smart camera is an embedded vision system which captures and processes image to extract application-specific information in real time. The brain of a smart camera is a special processing module that performs application specific information processing. The design of a smart camera as an embedded system is challenging because video processing has insatiable demand for performance and power, but at the same time embedded systems place considerable constraints on the design. We present our work to develop GestureCam, an FPGA-based smart camera built from scratch that can recognize simple hand gestures. The first completed version of GestureCam has shown promising realtime performance and is being tested in several desktop HCI (Human Computer Interface) applications. Keywords: Intelligent Systems, Human-Computer Interaction, Embedded Systems, Computer Vision, Pattern Recognition.

1 Introduction Broadly speaking, a smart camera can be defined as a vision system in which the primary function is to produce a high-level understanding of the imaged scene and generate application-specific data to be used in an autonomous and intelligent system. A smart camera is ‘smart’ because there is a processing unit which performs application specific information processing (ASIP). The primary goal of ASIP is to extract information from the captured images that is useful to an application. For example, a motion-triggered surveillance camera captures video of a scene, detects motion in the region of interest, and raises an alarm when the detected motion satisfies certain criteria. In this case, the ASIP is motion detection and alarm generation. Strictly speaking, a smart camera is a stand-alone, self-contained embedded system that integrates image sensing, ASIP and communications in one single box. However, there are other types of vision systems that are often referred to as smart cameras as well, such as PC-based smart cameras. In a PC-based smart camera, the camera is a general purpose camera such as webcam or CCTV camera, with the video output connected to a PC port through USB, Ethernet, Firewire, or other protocols. This kind of configuration has a few disadvantages. For example, general purpose PC is usually not suited to intensive image processing of high resolution and high frame rate camera output video streams. In addition, bandwidth Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 718–727, 2007. © Springer-Verlag Berlin Heidelberg 2007

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications

719

requirement for the camera-PC link is very high. A smart camera can greatly simplify application system design because there is no need to have an extra PC to perform image processing tasks. What’s more, the embedded processing unit inside the smart camera is a much better way to process images at high resolution and high frame rate in real time. The output from the smart camera requires very low bandwidth, because only image feature or high level description of the imaged scene needs to be transferred to a central control computer. Smart cameras can have many applications, such as in video surveillance, security, machine vision, human computer interaction, and so on. A multimodal user interface (MMUI) allows a user to interact with a computer by using his or her natural communication modalities, such as speech, pen, touch, gestures, eye gaze, facial expression, just as in human-to-human communication. Gesture recognition is an important part of MMUI system. Compared to glove and trackers based gesture recognition, vision based gesture recognition, which uses cameras and computer vision techniques, is more flexible, portable and affordable. However, vision based gesture recognition is not a trivial task, especially when built as an embedded system. Building smart camera as an embedded system has been a hot R/D topic in recent years. In particular, there has been some research work in recent years in building smart cameras that can recognize gestures. Wolf et al in [1] described a VLSI-based smart camera for gesture recognition. They used a commercial general purpose camera that provides analogue output to a VLSI video processing board which is inserted into a PC. Bonato et al in [2] presented the design of an FPGA-based smart camera that can perform gesture recognition in real-time for mobile robot application. They used a CMOS camera as capture device which provides a processed digital image output to their FPGA. Wilson et al in [3] designed a system allowing a user to control Windows applications using gestures. Their system uses a pair of general purpose cameras. In this paper, we present the design and development of a smart camera, called GestureCam, which can perform simple hand gesture recognition. GestureCam is a smart camera built from scratch, that is, it is not based on a commercial camera which provides processed analogue or digital outputs. Rather, the image capture part of the GestureCam is customarily built on a CMOS image sensor chip, which allows us to apply our own color and image pre-processing algorithms to the raw video output (Bayer pattern) of the image sensor. This provides us with opportunity to have lownoise and better quality data going into gesture recognition. In the following sections, we describe GestureCam design process and architecture, followed by discussions on algorithm development and on implementation issues, especially for the contour tracing and gesture classification algorithms. Finally we present our work-in-progress in applying GestureCam to GestureBrowser, a tool that can enable a person to use hand gestures to control a web browser.

2 GestureCam Design and Implementation 2.1 Design Process To design GestureCam, we followed the process below which we believe is appropriate to the design of smart cameras as embedded systems [4].

720 •

•

•

•

•

•

Y. Shi and T. Tsui

Step one: Application Requirements Specifications. Correct specifications can shorten design and development cycle, provide clear targets for algorithm and hardware performance, and reduce total cost. Step two: System Architecture Design. Software and hardware architectures based on performance, time-to-delivery and cost criteria. Algorithmic design and timing design suitable to the targeted hardware platform also needs to be defined. The mapping between algorithm requirements and hardware resources is an important issue. For hardware architecture, a heterogeneous, multiple-processor architecture can be ideal for smart camera development. Step three: Proof-of-Concept. This may use a PC platform for research and algorithm development. Usually a general purpose camera (e.g. a webcam) is used at this stage. In the next section we will describe our work on proof-ofconcept by using a webcam and a PC. Step four: Algorithm Conversion. This is necessary because algorithm development for embedded systems is quite different from PC-based platforms. It can be a lot more demanding and challenging, especially if FPGA or ASIC processors are targeted. Converting floating-point arithmetic to fixed-point, eliminating divisions as much as possible (e.g. by using hardware multipliers and look-up tables), and taking low power and low complexity requirements into account are other design considerations for algorithm conversion. Step five: Integration and Debugging. This will result in a prototype smart camera using an embedded hardware platform running embedded versions of algorithms. This can be a time-consuming process and sometimes requires adjustments to be made to the algorithms, software and hardware architecture. Step six: Test and Evaluation. Test camera performance under realistic environment, identify potential problems and possible improvements, and benchmark camera performance against initial application requirement specifications.

2.2 Vision Based Gesture Recognition The main steps typically involved in vision-based gesture recognition include image pre-processing, object segmentation, feature extraction, tracking, and gesture classification. Before we built GestureCam, we went through a proof-of-concept stage (Step 3). Specifically, we used a Logitech QuickCam Pro 4000 and a PC to implement and test core modules such as object segmentation, feature extraction, and gesture classification. For object segmentation, we applied skin color detection and contour tracing techniques to segment hand from background. For feature extraction, we calculated the center of mass of the segmented hand. For gesture classification, we applied a simple non-trainable neural network (with hard-coded weightings) to classify gestures. Trajectory-based classification was also tested. The QuickCam and PC based proof-of-concept produced successful validation of all core algorithms for GestureCam, and was used in a speech and gesture based multimodal interface built for traffic incident management application [5]. However, building smart camera as embedded system is very different from building a PC based system. There are many design considerations and implementation issues that are pertinent to the chosen hardware processing architecture and design specifications. All this will be discussed later.

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications

721

2.3 GestureCam System Architecture GestureCam consists of mainly three parts: an image capture unit (ICU), an FPGAbased gesture recognition unit (GRU), and a host and display unit (HDU). The ICU includes a small in-house built PCB on which sits a mega pixel CMOS color image sensor OV9620 from Omnivision. The PCB is fit into a dummy camera casing which provides easy connection to a 2/3” format video lens from Computar. The OV9620 provides full SXGA (1280x1024) resolution Bayer pattern video output at 15 frames per second, and VGA (640x480) resolution at 30 frames per second. A Xilinx VirtexII Pro FPGA development kit from Memec has been chosen to form the GRU. This kit is a powerful yet flexible development platform for imaging applications. The Virtex-II Pro 2VP30 FPGA comes with over 2 million system gates, 2 on-chip embedded PowerPC cores and over 2MB on-chip RAM. All image and video processing algorithms, in addition to image sensor interface and display interface, are to be implemented into the Virtex-II Pro FPGA, without using off-chip memory. All implementation was done in VHDL. Main programming tools are Xilinx ISE7.1, EDK 7.1, and ChipScope Pro 7.1. The motivation of adopting FPGA as a development

Fig. 1. (a) GestureCam development platform. (b) A programmer working on the platform.

Fig. 2. Processing chain of GestureCam implemented in the FPGA

722

Y. Shi and T. Tsui

platform is because FPGA is a far better computing platform than PC to perform dataintensive video processing tasks with real-time requirements. In addition, there is no need for high bandwidth requirement between camera and PC, because the output from the camera can be as simple as merely an index of a gesture among pre-defined gesture database. A picture of the development platform is shown in Figure 1 (a). Figure 1(b) shows one of the authors working on the development platform. Figure 2 shows the software architecture of the GestureCam. The CMOS image sensor outputs Bayer pattern images, meaning that each pixel has only one color of 8bit to it. For each frame, a color interpolation algorithm developed in-house produces real-world color image of the scene. A skin color detection algorithm is then applied to the color image to extract skin color pixels from background. A low-pass filter is then called to smooth out isolated in-color noise and produce cleaner skin image. Contour tracing algorithm is then called to extract hand contour coordinates for the benefit of feature extraction, which calculates the center of mass (CoM) of the hand contour. Lastly, a neural network based gesture recognition stage analyzes the temporal trajectory of the hand CoM and classifies the hand gesture based on probability. The contour tracing and gesture classification will be described in more details in the following section. For the first version of the GestureCam, we decided to work on an image of the half-VGA resolution, that is 320x240 pixels, so that the RAM on the FPGA chip is big enough for all frame and line buffering requirements and there is no need to use off-chip SDRAM. Later-on the design can be scaled up to accommodate full capture resolution. 2.4 Contour Tracing and Gesture Classification for GestureCam 2.4.1 Contour Tracing There are two well-known methods in the context of contour tracing: chain code and row-wise representations. In chain code methods, the immediate pixels surrounding the current edge pixel, often called the 8-neighbour-hood, is searched. In contrast, rowwise representations involve analyzing the interface between two adjacent rows of pixels and inferring a contour from it by looking at the connection between two edge pixels. Row-wise representations offer better performance [6] because tracing can begin as soon as two rows of pixel data are available to the algorithm. However, all of this is at the cost of increased design complexity. With chain code methods, the entire frame needs to be available for tracing to occur, giving rise to poorer performance since tracing starts much later. The initial algorithms researched for GestureCam involved chain code. Given the small resolution of 320x240 pixels at which GestureCam inputs, it was not necessary at this stage to design a complex contour tracing module such as row-wise schemes. The chain code algorithms investigated involved the Square-tracing algorithm [7], Moore-neighbourhood algorithm [7], Rhee et al’s algorithm[12] based on External Boundary Tracing Algorithm and Inner Boundary Tracing Algorithm (IBTA)[8]. Of these, IBTA was believed to be the most suitable. In square-tracing, one of its shortcomings is the restriction to strictly 8-connected only images. Although similar to IBTA, Moore-neighbourhood searches for the next edge pixel in a brute-force approach whereas IBTA searches for the next edge pixel based on a mathematical

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications

723

rule, optimizing the chances of detecting an edge pixel. Rhee et al’s algorithm has this similar shortfall as well. The tracing algorithm used by GestureCam is based on IBTA with several notable modifications which involve: the method in finding the first pixel to begin tracing and the stopping criterion. In the IBTA algorithm, the first pixel to start tracing is found by searching from the top left of the frame and continuing in a raster direction – left to right, top to bottom. The implementation in GestureCam differs by locating the brightest pixel of a frame and hopping in the negative x-direction until the pixel is below a certain threshold. This is the left-most edge of the object and is where tracing begins. The adoption of this method decreases the chance of noisy pixels being traced and hence improves the chance of tracing the hand which is the object of interest. The IBTA algorithm includes a stopping criterion that can trace objects which have one pixel wide segments. This may be a critical feature for applications such as medical imaging in which the utmost precision is required to ensure proper diagnosis, but for the purposes of GestureCam, we do not require such fine-tuned precision, so we have replaced it with a simpler criterion, that is, stop tracing when the initial pixel with which tracing began has been encountered a second time. Figure 3 shows the flow diagram for tracing one frame using the modified IBTA algorithm. Data from the filtering module is passed pixel by pixel to the preprocessing state where the brightest pixel is found. This is achieved by updating the location of the brightest pixel until the whole frame has been traversed. Lastly, because the data from the filtering module is constantly updating, the pixel data is copied to a memory buffer to allow the IBTA algorithm to trace a static frame. The algorithm itself is robust enough to handle any shape of the hand. However, inner boundaries are not traced which is a property inherent in the IBTA algorithm, but since image moments require only the outer boundary, it does not present any problems.

Fig. 3. The flow diagram for tracing one frame using the modified IBT algorithm

2.4.2 Gesture Classification In Gesture Cam’s first prototype, classification involves utilizing the image moment from the feature extraction module. In any given time interval, we can collect a group of image moments and hence build up a trajectory of the user executing a moving hand gesture. GestureCam collects 15 image moments which is approximately 1 second of input data. A total of 8 gestures are recognizable and are shown in Figure 4.

724

Y. Shi and T. Tsui

Fig. 4. Gestures (“left”, “right”, “up”, “down”, “curved right”, “curved left”, “curved down”, “curved up”) that are recognized by GestureCam – arrow tail indicates start of gesture and arrow head indicates end of gesture

The field of classification has been extensively researched over the years and has provided sound techniques such as neural networks, Hidden Markov Models and various other statistical based methods. We achieved classification by implementing a neural network as well as another technique which we call ‘trajectory-based’. While the concept is not new, trajectory-based methods are a term coined by the authors. It is based on the fact that rules are inferred from the trajectory itself. These rules may be based on the general shape, curve or features of the trajectory, but should be differentiable from others such that they do not cause ambiguity. Consequently, the rules derived from the trajectory are hard-coded in VHDL. Preliminary literature review revealed that neural network design with a hardware focused goal predominantly involved classifying ‘stationary’ gestures in which the positions of fingers and hands were being recognized. Bonato et al used a RAM-based neural net in order to control the direction and speed of a mobile robot using stationary gestures [2]. Gruestein compared two methods, the condensation algorithm alongside a time-delayed neural network [9]. The neural net they investigated was essentially based on the back-propagation algorithm in which each layer of inputs correspond to different time intervals. In GestureCam, the input data to classification is the image moment obtained from the feature extraction module. To build up a trajectory, we collect 15 successive image moments and store them into block ram to display onto a monitor. Gesture recognition is processed as soon as new image moments are received at the input, thus no storage is necessary. Once the raw input data have been collected, they undergo conversion for the neural network and trajectory-based methods. As each address of the memory buffer corresponds directly to each pixel on the screen, the image moments are normally stored as 17 bit addresses. The conversion from memory address to input data appropriate for the gesture recognition techniques is based on the work done by Kinder et al [11]. They argue that by separating a trajectory into segments of 2 or more pieces called tuples, position-invariance and error tolerant recognition can occur. For each tuple, a set of features are obtained such as x-component, y-component, gradient and angle. In GestureCam, we define a tuple to be any 2 successive image moments and obtain 3 features: x-component, y-component and gradient – these form the input for both neural network and trajectory based methods. GestureCam analyses 14 tuples individually to build up a collective decision on classifying gestures. We use the neural network to detect “UP”, “DOWN”, “LEFT” and “RIGHT” gestures, whilst for the curved “UP”, “DOWN”, “LEFT” and “RIGHT” gestures, we use the trajectorybased methods. The neural network is composed of 3 inputs, a hidden layer of 2 nodes and an output node. To understand the operation of this neural network, let us consider the

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications

725

“LEFT” or “RIGHT” gesture in which the user waves his hand from right to the left and vice versa. Given the feature set derived from the tuple, we can infer that the ycomponent will not deviate from start to finish, thus we impose a threshold of 10 pixels. Another inference is that the gradient will be mostly level, thus we impose another threshold that the gradient can not exceed 0.5. Since both properties have to be met, there is a third and final node to check that both rules are met. This is illustrated in Figure 5. Next, we can determine whether the gesture is “LEFT” or “RIGHT” by noting the sign of the x-component - if it is negative, then the direction is “LEFT”, otherwise it is “RIGHT”. For gestures “UP” and “DOWN”, we apply similar principles. The result of the neural network consists of whether the gesture was detected, in which case the output would be ‘1’, or not detected where the output would be ‘0’. Using this neural network as a basis, we can alter the weights to suit the requirements for “LEFT”, “RIGHT”, “UP” and “DOWN” gestures, making it highly adaptable and easy to adjust.

(a)

(b)

Fig. 5. Neural network for “left” and “right” gesture recognition. Fig 5a shows the decision boundaries for the “LEFT” or “RIGHT” gesture. Fig 5b shows the neural network with hard-coded weights. Inputs of 1 denote bias weights.

3 Results and Application Development The first version of GestureCam has been completed. The prototype has shown good real-time performance, recognizing all 8 planned hand gestures. The implemented algorithm itself is robust enough to handle any shape of the hand. We applied GestureCam in an application called GestureBrowser, which allows common web browsing operations such as following hyperlinks, traversing the browsing history or scrolling up and down pages, using hand or head gestures. The GestureBrowser is implemented as a regular extension to the Mozilla Firefox web browser, visible as a new toolbar. It connects to the GestureCam through a UDP connection. The GestureCam acts as a gesture event server, delivering gesture and tracking information according to a proprietary protocol. The toolbar allows the user to enter the address and port of the GestureCam, and then initiates connection by a handshake. The server then sends packets containing information about the GestureCam identity (potentially allowing multiple cameras to be used at the same time), timestamp

726

Y. Shi and T. Tsui

(mainly for late packet ordering if required) and a set of parameters about the current gesture: 2D tracking position, type of gesture: simple tracking versus user-defined gesture.

4 Conclusion and Future Work We have developed a smart camera that can recognize simple hand gestures. The camera was built using a single chip of FPGA as processing device. The first version prototype of GestureCam has shown very promising real-time performance and recognition rate. The GestureCam is being tested in a real-world application, GestureBrowser. In future work, we plan to continue improving the various aspects of the GestureCam through GestureBrowser application; in particular we’ll use an offchip SDRAM to allow us to process images of higher resolution which should improve performance in skin color detection which is critical to good hand tracking. Initially, a trainable neural network using the back-propagation algorithm was intended for GestureCam. However, robust classification success rates normally require the amount of training input data to be in the order of hundreds and because GestureCam had no training data of gestures, it made it difficult to achieve acceptable performance using the back-propagation algorithm. In its present state, the neural network is based around thresholding which limit the amount of complexity a gesture can have. Using the Backpropagation algorithm and given ample training data, GestureCam could recognize a plethora of non-linear gestures which would replace the current classification scheme. A major advantage to the programmer is that they need not define rules for every gesture and convert it to VHDL. Instead, the Backpropagation algorithm automatically adjusts edge boundaries until all training data is correctly classified. Some of the benefits include classification of a wide range of complex gestures and improved accuracy.

References 1. Wolf, W., Ozer, B., Lv, T.: Smart Cameras as Embedded Systems. IEEE Computer 35(9), 48–53 (2002) 2. Bonato, V., Sanches, A., Fernandes, M., Cardoso, J., Simoes, E., Marques, E.: A Real Time Gesture Recognition System for Mobile Robots. In: International Conference on Informatics in Control, Automation, and Robotics, August 25-28 2004, Setúbal, Portugal, pp. 207–214. INSTICC (2004) 3. Wilson, A., Oliver, N.: Gwindows: Robust Stereo Vision for Gesture-Based Control of Windows. In: Proceedings of the International Conference on Multimodal Interaction, November 5–7, 2003, Vancouver, British Columbia, Canada (2003) 4. Shi, Y., Taib, R., Lichman, S.: GestureCam: A Smart Camera for Gesture Recognition and Gesture-Controlled Web Navigation. In: Proc. of ICARCV 2006, ICARCV, Singapore (December 2006) 5. Chen, F., Choi, E., Epps, J., Lichman, S., Ruiz, N., Shi, Y., Taib, R., Wu, M.: A Study of Manual Gesture-Based Selection for the PEMMI Multimodal Transport Management Interface. In: Proc. ICMI 2005, pp. 274–281 (2005)

An FPGA-Based Smart Camera for Gesture Recognition in HCI Applications

727

6. Miyatake, T., Matsushima, H., Ejiri, M.: Contour representation of binary images using run-type direction codes. Machine Vision and Applications 70(2), 239–284 (1997) 7. Ghuneim: Contour Tracing (August 2006), http://www.imageprocessingplace.com/DIP/ dip_downloads/tutorials/contour_tracing_Abeer_George_Ghuneim/index.html 8. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision, 2nd edn. Brooks Cole (1998) 9. Gruenstein, A.: Two Methods of Gesture Recognition (March 2002) 10. Gose, E., Johnsonbaugh, R., Jost, S.: Pattern recognition and image analysis. Prentice Hall, PTR (1996) 11. Kinder, M., Brauer, W.: Classification of Trajectories – Extracting Invariants with a Neural Network. Neural Networks 7, 1011–1017 (1993) 12. Rhee, P.K., La, C.W.: Boundary Extraction of Moving Objects From Image Sequence. In: IEEE TENCON (1999)

Color Constancy Via Convex Kernel Optimization Xiaotong Yuan, Stan Z. Li, and Ran He Center for Biometrics and Security Research, Institute of Automation, Chinese Academy of Science, Beijing – China, 100080 Abstract. This paper introduces a novel convex kernel based method for color constancy computation with explicit illuminant parameter estimation. A simple linear render model is adopted and the illuminants in a new scene that contains some of the color surfaces seen in the training image are sequentially estimated in a global optimization framework. The proposed method is fully data-driven and initialization invariant. Nonlinear color constancy can also be approximately solved in this kernel optimization framework with piecewise linear assumption. Extensive experiments on real-scene images validate the practical performance of our method.

1

Introduction

Color is an important feature for many machine vision tasks such as segmentation [8], object recognition [13] and surveillance [4]. However, light sources, shadows, transducer non-linearities, and camera processing (such as auto-gaincontrol and color balancing) can all aﬀect the apparent color of a surface. Color constancy algorithms attempt to estimate these photic parameters and compensate for their contribution to image appearance. There are a large body of works in color constancy literature. A common approach is to use linear models of reﬂectance and illuminant spectra [9]. Gray world algorithm [1] assumes the average reﬂectance of all the surfaces in a scene is gray. White world algorithm [5] assumes the brightest pixel corresponds to a scene point with maximal reﬂectance. Another widely used technique is to estimate the relative illuminant or mapping of colors under an unknown illuminant to a canonical one. Color gamut mapping [3] uses the convex hull of all achievable RGB values to represent an illuminant. The intersection of the mapping for each pixel in an image is used to choose a “best” mapping. In [14], a back-propagation multi-layer neural network is trained to estimate the parameters of a linear color mapping. In [6], a Bayesian estimation scheme is introduced to integrate prior knowledge, e.g. lighting and object classes, into a bilinear likelihood model motivated from the physics of image formulation and sensor error. Linear subspace learning is used in [12] to develop the color eigenﬂows method to model joint illuminant change. This linear model uses no prior knowledge of lighting condition and surface reﬂectance and does not need to be re-estimated for new objects or scenes. However, the demanding for large training set and rigorous pixel-wise correspondence between training and test images limits the application of this method. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 728–737, 2007. c Springer-Verlag Berlin Heidelberg 2007

Color Constancy Via Convex Kernel Optimization

729

In this work, we build our color constancy study on linear transformation parameter estimation. Recently, [8] presented a diagonal rendering model for outdoor color classiﬁcation problem. Only one image containing the color samples under a certain “canonical” illuminant is needed for training Gaussian classiﬁers. The trained colors seen under diﬀerent illuminations can be robustly recognized via MAP estimation. Due to the advantage of fewer training data requirements, we adopt this diagonal render model as the base model for our study . The main diﬀerence between our solution and that of [8] lies in the deﬁnition of objective function and the associated optimization method. In [8], image likelihood and model priors are integrated into a MAP formulation and locally optimized with EM algorithm. This algorithm works well when all the render matrices are properly initialized. However, such initializations are not always available and accurate in practice. In this paper, we propose a novel convex kernel based criteria function to measure the color compensation accuracy in a new scene. A sequential global mode-seeking framework is then developed for parameter estimation. The optimization procedure includes following three key steps: – A two-step iterative algorithm derived by Half-Quadratic optimization is used to ﬁnd the local maximum. – A multi-bandwidth method is then used to locate the global maximum by gradually decreasing the bandwidth from an estimated uni-mode promising bandwidth. – A well designed adaptive re-sampling mechanism is adopted and the above multi-bandwidth method is repeated till the desired number of peak modes are found. The peak modes obtained in this procedure may be naturally viewed as transformation vectors for apparent illuminants in the scene. Our convex kernel based method is fully data-driven and initialization invariant. Such good numerical properties also leads to our solution for nonlinear color constancy problem based on current linear model. To do this, we make piecewise linear assumption to approximate the general nonlinear cases. Our method can automatically ﬁnd the transformation vectors for each linear piece. Local optimization methods, such as EM based method in [8], can hardly achieve this goal in practice because of initialization dependency. Some results achieved by our method will be reported. The reminder of this paper is organized as follows: In Section 2, we model color constancy as linear mapping and estimate the parameters via multi-bandwidth kernel optimization in a fully data-driven way. In Section 3 we show the experimental results that validate the numerical superior of our method to that in [8]. We conclude this paper in Section 4.

2

Problem Formulation

For the beneﬁt of fewer training samples requirement, we adopt the linear render model stated in [8] as the base model for our color constancy study. The key assumptions of are:

730

X. Yuan, S.Z. Li, and R. He

– One hand-labeled image is available for training the class-conditional color distributions under the “canonical” illuminant. – The class-conditional color surface likelihood under the canonical illuminant is a Gaussian density, with mean μj and covariance Σj – The illuminant-induced color transformation from test image to training image can be modeled as F (Ci ) = Ci d, where d = (d1 , d2 , d3 )T is the color render vector to be estimated. Ci = diag(ri , gi , bi ) is a diagonal matrix that stores the observed RGB colors for pixel i in the test image. Suppose we have trained S color surfaces with distributions yj ∼ N (μj , Σj ),j = 1, ..., S. Also, assume given a test image with N pixels Ci , i = 1, ..., N , which contains L illuminants linearly parameterized by vectors dl , l = 1, ..., L. Our goal is to estimate the optimal dl from image data and then get the assignments of surface class labels j(i) and illuminant type labels l(i) for each pixel i according to: (1) (j(i), l(i)) = arg min(dist(Ci dl , yj )) j,l

dist(·) is some properly selected distance measurement metric (e.g. Mahalanobis distance in this work). 2.1

Kernel Based Objective Function

To estimate the optimal transformation vectors dl , we propose do ﬁnd the L peak modes of following kernel sum function: fˆk (d) =

N S

wij k(M 2 (Ci d, μj , η 2 Σj ))

(2)

i=1 j=1

where k(·) is the kernel proﬁle function [2]( see sect.2.2 for detailed description), M 2 (Ci d, μj , η 2 Σj ) = (Ci d−μj )T (η 2 Σj )−1 (Ci d−μj ) is the Mahalanobis distance from compensated color Ci d to training color surface mean yj . wij is the prior weight for pixel i belonging to color surface j. The larger function (2) is, the better test image is compensated by vector d. In the following subsections 2.2 to 2.4, we will focus on the optimization issues and develop a highly eﬃcient sequential mechanism to ﬁnd the desired L peak modes of (2) as the optimal dl . 2.2

Half Quadratic Optimization

In this section, we will use half quadratic technique [10] to optimize objective function (2). The results follow directly from standard material in convex analysis (e.g. [10]) and we will omit the technical proofs for page limit. All the conditions we impose on kernel proﬁle k(·) are summarized as below: 1. 2. 3. 4.

k(x) is a continuous monotonously decreasing and strictly convex function = β > 0, limx→+∞ k(x) = 0 limx→0+ k(x) limx→0+ k (x) = −γ < 0, limx→+∞ k (x) = 0, limx→+∞ (−xk (x)) = α < β k (x) is continuous with ﬁnite discontinuous points.

Color Constancy Via Convex Kernel Optimization

731

The following theorem 1 founds the base for optimizing function (2) in a half quadratic way. Theorem 1. Let k(.) be a proﬁle satisfying all above conditions, then there exists a strictly monotonously increasing concave function ϕ : (0, γ) → (α, β), such that k(M 2 (Ci d, μj , η 2 Σj )) = sup(−pM 2 (Ci d, μj , η 2 Σj ) + ϕ(p)) p

and for a ﬁxed d, the supmum is reached at p = −k (M 2 (Ci d, μj , η 2 Σj ). To further study criteria (2), we introduce a new objective function F : R3 × (0, γ)N → (0, +∞) Fˆη (d, p) =

S N

wij (−pij M 2 (Ci d, μj , η 2 Σj ) + ϕ(pij ))

(3)

i=1 j=1

where p = (p11 , pN 1 , ..., pN S ). According to theorem 1, we get fˆk (d) = sup(Fˆη (d, p)) p

It is straight forward to see that max fˆk (d) = max sup(Fˆη (d, p)) d

d

(4)

p

From (4) we tell that maximizing fˆk (d) is equivalent to maximizing Fˆη (d, p)), which is quadratic w.r.t. d when p is ﬁxed. We propose to use a strategy based on alternate maximization over d and p as follows (superscript l denotes the time stamp): plij = −k (M 2 (Ci dl−1 , μj , η 2 Σj )) (5) ⎡ dl = ⎣

N S

⎤−1 ⎡ ⎤ N S wij plij CiT Σj−1 Ci ⎦ ⎣ wij plij CiT Σj−1 μj ⎦

i=1 j=1

2.3

(6)

i=1 j=1

Global Mode-Seeking

Since the above two-step iterations (5) and (6) are essentially a gradient ascending method, it will surely converge to local maximum. In this section, we ﬁrst derive in the following proposition 1 indicating that if bandwidth parameter η is large enough, then the criterion function (2) is strictly concave, hence is uni-mode. Then we develop a global peak mode seeking method based on this proposition to ﬁnd the transformation vector d that best compensates the illuminant in the test image.

732

X. Yuan, S.Z. Li, and R. He

Proposition 1. One suﬃcient condition for Fˆη (d, p) to be uni-mode is 12 k (v) η > Const ∗ 2 sup − k (v) v

(7)

where Const = max{ M 2 (x, μj , Σj )|x ∈ [0, 255]3, j = 1, ...S}. The proof is just built on trivial derivative calculation. We give below an example proﬁle to further clarify proposition 1.

−x/2 Example 1. (Gaussian . Then k (x) = − 12 e−x/2 , k (x) =

proﬁle): k(x) = e (x) 1 −x/2 = 12 . By proposition 1, the uni-mode-promising band. supx − kk (x) 4e width can be selected according to η > Const. In addition, the dual variable function is ϕ(p) = 2p − 2p ln 2p in theorem 1.

From proposition 1 we can tell that if η is large enough , then from any initial estimation, the two-step iteration algorithm presented in (5) and (6) will converge to a unique maximizer of the over-smoothed density function. When the uni-maximizer is reached, we may decrease the value of η and run the same iterations again, taking the previous maximizers as initializations. This procedure is repeated until a certain termination condition is met (e.g., convergence error is small enough). The ﬁnal obtained maximizer is very likely to be the global peak mode of the criteria function, since such a numerical procedure is actually deterministic annealing [7]. See algorithm 1 for a formal description of this optimization procedure. We have noticed that this global peak mode-seeking mechanism is similar to what called annealed mean shift in [11], which aims to ﬁnd the global kernel density mode. The key improvement lies in that we give an up-bound of uni-mode promising bandwidth, hence make the algorithm more operable in practice.

Algorithm 1. Global Transformation Vector Seeking 1: 2: 3: 4: 5: 6: 7: 8:

m ← 0, Initialize ηm satisfying the condition presented in proposition 1 Randomly initialize d while Terminate condition is not met do Run the iteration (5) and(6) till converge. m←m+1 ηm ← (ηm−1 ∗ ρ). Initialize d and p with the maximizers obtained in 4. end while

In the following subsections, we denote d∗ and p∗ be the convergent points reached in algorithm1, and η ∗ be the corresponding bandwidth. We also call the global maximizer d∗ reached in algorithm 1 to be the global transformation vector (GTV) (associated with current prior weights w).

Color Constancy Via Convex Kernel Optimization

2.4

733

Multiple Mode-Seeking

In this section, as an extension of algorithm 1, we develop an adaptive and sequential method, namely Ada-GTV, for multiple transformation vector modeseeking. The core idea of this method is to ﬁnd the GTVs one after another by adaptively changing the prior weight vector w and ﬁnding the corresponding GTV d∗ via algorithm 1. Suppose that current GTV is estimated , we then search for a local maximizer d∗ around it for the criterion function (2) estimated under equal prior weights (this is because our purpose is to ﬁnd the peak modes of (2) estimated on original training and test data). Dual variable p is calculated as pij = −k (M 2 (Ci d∗ , μj , η ∗2 Σj ), i = 1, ..., N, j = 1, ..., S. We then reweight all the terms in (2), giving higher weight to the cases that are “worse” compensated (with lower pij ) and repeat the GTV seeking procedure by algorithm 1. This leads to a sequential global mode-seeking algorithm. The formal description of Ada-GTV is given in algorithm 2. The founded GTVs can be naturally viewed as transformation parameters for diﬀerent illuminations in the scene. Compensation and color classiﬁcation can be easily done according to (1), as stated in [8]. The running time of Ada-GTV is obviously O(L ∗ N S) (L, S N ), hence it is a linear complexity algorithm w.r.t pixel number N . Algorithm 2. Ada-GTV 0 1: Initialization: Start with weights wij = 1/N S, i = 1, ..., N, j = 1, ..., S 2: for l = 0 to L − 1 do 3: GTV Estimation: Find GTV d∗ by algorithm 1 with current prior weight wl . 4: Mode Reﬁnement : Starting from d∗ , ﬁnd the local maximum d∗ for fˆk (d) estimated under η ∗ and w0 . 5: Dual Variables: Get pij = −k (M 2 (Ci d∗ , μj , η ∗2 Σj )). l+1 l+1 l ← wij /(1 + pij ). Normalize wij ← 6: Sample Re-weight: Set wij l+1 l+1 wij / ij wij 7: end for 8: Color and Illuminant Classiﬁcation: Each pixel’s illuminant and color label is determined as (j(i), l(i)) = arg minj,l (M 2 (Ci dl , μj , Σj )).

3

Experiments

We present several groups of experiments on color compensation and classiﬁcation of real scenes to show the performance of the our method. The ﬁrst experiment is done to show the global optimization property of our algorithm. For comparison purpose, we adopt one set of image data used in [8]. The training image under “canonical” light (with the manually selected sample colors) and the test image are shown in ﬁg.1(a) and 1(b). Compensation and color classiﬁcation results by [8] are shown in ﬁg.1(c) ∼ 1(f). It is obviously to see that result R1 from starting point P1 (ﬁg. 1(c) and 1(d)) is much more satisfying than R2 from starting point P2 (ﬁg. 1(e) and 1(f)), hence the algorithm

734

X. Yuan, S.Z. Li, and R. He

(a)

(b)

(c)

(g)

(d)

(h)

(e)

(f)

(i)

Fig. 1. A comparison example with EM based method [8]. (a): training image (with selected color) under “canonical” light (b) test image. (c) ∼ (f): compensation and color classiﬁcation results by [8]. (c) and (d): R1 from starting point P1 ; (e) and (f): R2 from starting point P2 . (g) ∼ (i): the compensation, color classiﬁcation and illuminant classiﬁcation results by Ada-GTV from the starting point either P1 or P2 .

Table 1. Numerical results, Ada-GTV vs. EM

Starting point P 1 Result R1 by EM [8] Result by Ada-GTV Starting point P 2 Result R2 by EM [8] Result by Ada-GTV

d1 (1.0,1.0,1.0) (0.693,0.773,0.914) (0.916,0.990,1.053) (0.5,0.5,0.5) (0.493,0.748,0.502) (0.916,0.990,1.053)

d2 (2.0,2.0,2.0) (2.005, 1.636,1.456) (2.123,1.614,1.402) (1.0,2.0,1.0) (1.873, 1.557,1.487) (2.123,1.614,1.402)

is highly initialization relevant. The compensation, color classiﬁcation and illuminant classiﬁcation results by our Ada-GTV algorithm initialized with either P1 or P2 is shown in ﬁg. 1(g) ∼ 1(h). Detailed numerical results can be seen in table 1, which clearly indicates the initialization invariant property of our method. The second experiment will show the ability of our method to handle nonlinear illuminant changes based on current linear render model. To do this, we make piecewise linear assumption to approximate the general nonlinear cases. Our method can automatically ﬁnd the transformation vectors for each linear piece. We give here one experiment on a pair of “map” images to validate this interesting property. We used Canon A550 DC with automatic exposure, taking care to compensate for the camera’s gamma setting. The training image ﬁg.2(a) and test image ﬁg.2(b) are shot under two very diﬀerent camera settings. The selected 6 sample colors from the training image and their ground truth

Color Constancy Via Convex Kernel Optimization

(a)

(d)

(b)

(e)

735

(c)

(f)

(g)

Fig. 2. Piecewise linear color constancy. (a) Training image; (b) Test image; (c) left: 6 selected sample colors and their ground truth counterparts in the test image; right: the ground truth transformation vectors for the 6 sample colors; (e)∼(g) color compensation, color classiﬁcation and piecewise linear illuminant classiﬁcation results. The black part in (e) and (f) represents unseen colors in the test image. (h): color compensation result by render vector d1 only.

Table 2. Initializations and iteration results for render matrices d1 d2 Initializations (1,1,1) (1,1,1) Iteration results (0.649,0.845,1.661) (0.788,1.008,3.014) Initializations (0.5,0.5,0.5) (0.5,0.5,0.5) Iteration results (0.648,0.843,1.661) (0.788,1.008,3.014) Initializations (5,5,5) (5,5,5) Iteration results (0.655,0.852,1.661) (0.788,1.008,3.014)

counterparts in the test image are shown in ﬁg.2(c)(left part). To test whether the illuminant change in the test image is linear or not, we calculate the ground truth transformation vectors for the samples and plot them in ﬁg.2(c)(right part). Obviously two clusters (bounded by dotted ellipses) appear from these vectors, thus the illuminant change is highly nonlinear. One reasonable assumption is that such a change is piecewise linear and we may just feed the image data into Ada-GTV to let it ﬁnd the transformation vector modes sequentially for each piece, from arbitrary initializations. EM based method [8] can hard to achieve this goal simply because accurate initialization for each linear piece is required, which is not always available beforehand. Here, we properly set the mode number L=2 in Ada-GTV and initialize both render vectors d1 and d2 with three diﬀerent starting points. The convergent points are the same under these initializations, as is shown in table 2 (parameters are set to be η0 = 1.934 and ρ0 = 0.5). The image results are shown in ﬁg.2(d)∼ 2(f). ﬁg.2(g) shows the color compensation result by render vector d1 only, which obviously introduces very

736

X. Yuan, S.Z. Li, and R. He

Fig. 3. Some other experimental results. From left to right: training image, test image and color compensated image. (a)“Casia” image pairs, (b)“Comic” image pairs, (c) and (d): “face” image pairs.

large compensation error, visually. Thus, we can see that the adopted piecewise linear assumption greatly improves performance of color constancy. We have also extensively evaluated our Ada-GTV method on some other real scene image pairs, and selected results are given in ﬁg.3.

4

Conclusion

We introduce in this paper a novel convex kernel based method for color constancy computation with explicit illuminant parameter estimation. A convex kernel sum function is deﬁned to measure the illuminant compensation accuracy in a new scene that contains some of the color surfaces seen in the training image. Render vector parameters are estimated by sequentially locating the peak modes of this objective function. The proposed method is fully data-driven and initialization invariant. Nonlinear color constancy can also be approximately solved in our framework with piecewise linear assumption. The experimental results clearly show the advantage of the our method over local optimization frameowrk, e.g. MAP formulation with EM solution stated in [8].

Acknowledgement This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, and Chinese Academy of Sciences “100 people project”.

Color Constancy Via Convex Kernel Optimization

737

References 1. Buchsbaum, G.: A spatial processor model for object color perception. Journal of Franklin Institute 310(1), 1–26 (1980) 2. Comaniciu, D., Meer, P.: Mean shft: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603– 619 (2002) 3. Forsyth, D.A.: A novel algorithm for color constancy. International Journal of Computer Vision 5(1), 5–36 (1990) 4. Gilbert, A., Bowden, R.: Tracking objects across cameras by incremetanlly learning inter-camera colour calibration and patterns of activity. In: European Conference on Computer Vision, vol. 2, pp. 125–136 (2006) 5. Hall, J., McGann, J., Land, E.: Color mondrian experiments: the study of average spectral distributions. J. Opt. Soc. Amer. A(67), 1380 (1977) 6. Finlayson, G., Banard, K., Funt, B.: Color constancy for scenes with varying illumination. Computer Visualization and Image Understanding 65(2), 311–321 (1997) 7. Li., S.Z.: Robustizing robust m-estimation using deterministic annealing. Pattern Recognition 29(1), 159–166 (1996) 8. Manduchi, R.: Learning outdoor color classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1713–1723 (2006) 9. Marimont, D.H., Wandell, B.A.: Linear models of surface and illuminant spectra. J. Opt. Soc. Amer. 9(11), 1905–1913 (1992) 10. Rockfellar, R.: Convex Analysis. Princeton Press (1970) 11. Shen, C., Brooks, M.J., Hengel, V.A.: Fast global kernel density mode seeking with application to localization and tracking. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1516–1523. IEEE, Los Alamitos (2005) 12. Tieu, K., Miller, E.G.: Unsupervised color constancy. In: Thrun, S., Becker, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, MIT Press, Cambridge 13. Tsin, Y., Ramesh, V., Collins, R., Kanade, T.: Bayesian color constancy for outdoor object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1132–1139 (2001) 14. Funt, B.V., Gardei, V.C., Barnard, K.: Modeling color constancy with neural networks. In: International Conference on Visual Recognition and Action: Neural Models of Mind and Machine (1997)

User-Guided Shape from Shading to Reconstruct Fine Details from a Single Photograph Alexandre Meyer, Hector M. Brice˜no, and Sa¨ıda Bouakaz Universit´e de Lyon, LIRIS, France

Abstract. Many real objects, such as faces, sculptures, or low-reliefs are composed of many detailed parts that can not be easily modeled by an artist nor by 3D scanning. In this paper, we propose a new shape from shading (SfS) approach to rapidly model details of these objects such as wrinkles and reliefs of surfaces from one photograph. The method first determines the surface’s flat areas in the photograph. Then, it constructs a graph of relative altitudes between each of these flat areas. We circumvent the ill-posed problem of shape from shading by having the user set if some of these flat areas are a local maximum or a local minimum; additional points can be added by the user (e.g. at discontinuous creases) – this is the only user input. We use an intuitive mass-spring based minimization to determine the final position of these flat areas and a fast-marching method to generate the surface. This process can be iterated until the user is satisfied with the resulting surface. We illustrate our approach on real faces and low-relief photographs.

1 Introduction Despite recent advances in surface modeling and deformation, creating photorealistic 3D models remains a difficult and time consuming task. Many real objects, such as people, faces, sculptures, masks or low-reliefs are composed of many detailed parts that can not be easily modeled by an artist. Alternatively, 3D scanning technology is still an expensive process. While much work has been devoted to using several photographs to build 3D models, or to rendering new views from many photographs, little work has been done to address the problem of modeling objects from a single photograph. The fine aspects of these surfaces appear on photographs as a variation of shading. The methods that recover these features from the shading are called Shape from Shading (SfS). Nevertheless, it has been shown that this is an ill-posed problem [1,2]: a solution does not necessarily exist and when it exists, it is not unique meaning that different surfaces may have produced a given image. Figure 1 illustrates this point: the image on the left may have been produced by both objects on the right. In this paper, we propose a new practical Shape from Shading method which can be applied to a real photograph to help on the difficult problem of modeling fine aspect such as wrinkles and reliefs of surfaces. The ill-posed of shape from shading is solved by asking an user to decide whether some areas orthogonal to the viewing direction are a

Laboratoire d’InfoRmatique en Images et Syst`emes d’information UMR 5205 CNRS/INSA de Lyon/Universit´e Claude Bernard Lyon 1/Universit´e Lumi`ere Lyon 2/Ecole Centrale de Lyon.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 738–747, 2007. c Springer-Verlag Berlin Heidelberg 2007

User-Guided Shape from Shading to Reconstruct Fine Details

739

Fig. 1. Shape from Shading is an ill-posed problem: two different surfaces may produce the same shaded image. The left image may have been produced by several surfaces shown on the on the right. The highlights correspond to the flat areas. The upper right image was produced considering highlights A and C as peaks, and the lower right image considering only highlight B as a peak. Other combinations are possible.

local extrema (local maximum or local minimum) or not. The user information is propagated during a minimization process based on a mass-spring simulation. This massspring minimization has the advantage of presenting a graphical visualization which allows the user to interact during the computation according to his knowledge. Indeed, we have noticed that many minimization approaches like simulated annealing [3] which are fully automatic do not offer any convenient way of correcting reconstructed surface if it is incorrect. With our approach, the user can visually follow the intuitive massspring minimization and correct any errors in the subsequent reconstruction. Moreover, each time he wants to see the potential reconstructed surface, the fast marching method [4,5] generates it in few seconds.

2 Related Work The problem of surface reconstruction from single image can vary in the degree of user interaction: from fully automatic approaches to interactive modeling. Hoiem et al. in [6] proposed a fully automatic system that creates a 3D model of the scene made up of several texture-mapped planar billboards. Their approach captures the global geometry of the scene as planar. On a single image without any underlying lines (or planes), for example a face, it seems difficult to obtain better results without using shading information. The Shape from Shading (SfS) problem has been widely studied in the computer vision area. See [7,8,9] for a survey on SfS methods. The SfS problem is known as a difficult problem because of his ill-posedness [2]. Consequently, few approaches have been tried on real photographs. Courteille et al. in [10] propose a method to set flat a photograph of a curved sheet of paper in order to facilitate the character recognition. Prados et al. in [2,11] have tried a method based on taking into account the light attenuation on face photographs with relatively good results. Zhu et al. in [12] tackle the ambiguities of shape from shading by a semi-definite programming relaxation process which flip patches and adjust heights until the result surface has no kinks.

740

A. Meyer, H.M. Brice˜no, and S. Bouakaz

To our knowledge, beside Zeng et al. [13] SfS approaches are mostly fully automatic in spite of the ill-posed nature of the problem. Zeng’s method asks the user to enter some normal in order to determine in which direction the slope is going up to a local maximum. Once all local maxima are computed, they compute the relative altitude between each of them and the fast-marching algorithm generates the surface. Similarly to Zeng et al., we believe that the ill-posed aspect may only be tackled with user interaction. Comparing to Zeng’s approach we differ in the user input aspect. In Zeng’s approach, the user has to enter enough normals to capture the small variation of surface such as wrinkles on a forehead. Our approach automatically computes all flat areas and mass-spring minimization is more visual. Thus, the user may interact more directly on the data (the mass-spring graph for us) and may explore different solutions as illustrated on Fig. 1 with the two plausible configurations.

3 Formulation and Overview of Our Approach Our technique takes as input a color image and produces as output an heightmap which is an image where each pixel stores a distance to the underlying plane. Input images are obtained with a camera using a flash as then only light source. The first step of our approach is to compute the shading image from the RGB colored photograph. For that, we convert each RGB pixel into YUV color space and keep the luminance Y as shading. A more elaborate solution based on assumption of shading continuity is one proposed by Funt et al. in [14]: reflectance changes are located and removed from the luminance image by thresholding the gradient at locations of abrupt chromaticity change. More recently, Tappen et al. in [15] proposed a solution based on both color information and a classifier trained to recognize gray-scale patterns. Once the shading image is computed, we have the classical SfS problem. In this part, we assume that the shading image is photographed orthographically, and the scene is composed of Lambertian surfaces which exhibit single-bounce reflections and are illuminated from the camera direction by a point light source at infinity. Our approach is decomposed as follows: 1. For each pixel of the RGB photograph, extract the shading value by converting it into YUV and taking the Y value (or with [14]’s or [15]’s methods). 2. Detect the flat regions (highlights) in the image which will become the vertices to our relative altitude graph which will guide the user-interaction and the reconstruction (Sect. 4). 3. Using fast-marching (for a recap see Sect. 3.1) we compute the relative altitude difference between the vertices in the relative-altitude graph (Sect. 4). 4. The user defines few vertices as peaks or saddles. 5. The position of the remaining vertices is computed by solving a spring-mass system over the vertices (Sect. 5), and a new surface is quickly computed 6. If features on the surface do not have a vertex associated with them, it is possible for the user to add new vertices to the relative-altitude graph. 7. These three last steps can me re-iterated until the user is satisfied with the reconstructed surface.

User-Guided Shape from Shading to Reconstruct Fine Details

741

3.1 Shading Image Formation Model and Fast-Marching We first review the shading image I(x, y) formation model for a 3D Lambertian object. In our approach the camera and the light-source have the same position which we consider at infinity from the object; we then define the light source direction as ∂zx,y ∂zx,y , ∂y , 1). NoL = (0, 0, 1). The surface normal direction is given by Nx,y = ( ∂x tice that N is not normalized. The shading image is the dot product of the light and the normalized surface normal, it is computed by: Ix,y = L.

1 Nx,y = 2 ||Nx,y || ∂zx,y ∂zx,y 2 + ∂y +1 ∂x

∂zx,y ∂zx,y −2 With ∇zx,y = ( ∂x , ∂y ) we get ||∇zx,y || = Ix,y − 1. This equation is known as the Eikonal equation which can be solved by the numerical algorithm fast marching which we recap coarsely here. More detailed information can be found in [4,5,16]. At initialization, all pixel’s altitude zx,y are set to ∞ beside few pixels whose altitudes are known. All known pixels are put into a priority queue ordered by their altitude: smaller altitude first. The algorithm extracts pixels from the priority queue until it is empty. Starting from a known pixel, the altitude of its four-connected neighbors is updated and added to the queue. The altitude zx,y of pixel (x, y) is updated as follows: • Let z1 = min(zx−1,y , zx+1,y ) and z2 = min(z √ x,y−1 , zx,y+1 ). • If |z1 − z2 | < ||∇zx,y || then zx,y = else zx,y = min(z1 , z2 ) + ||∇zx,y ||;

z1 +z2 +

2 −(z −z )2 2×∇zx,y 1 2 2

4 Relative Altitude Graph Our minimization process described in the next section is based on a relativealtitude graph mapped on the shading image. This section is dedicated to this graph computation. Vertices Detection. We define a graph over the photograph where the vertices correspond to highlights (high intensity values) on the shading image. These points correspond to singular points on the surface where the gradiant is 0 (e.g. local minima, local maxima, saddle) On the shading image, several adjacent pixels may have the same intensity, we thus consider only one area, and thus one vertex in the graph. Since we assume an orthographic camera, a vertex represent a flat area of the surface orthogonal to the viewing direction. We named them OVD areas for Orthogonal to the Viewing Direction. All these OVD areas are parallel because of the orthographic assumption. Theoretically, these OVD areas have a maximal shading value. Since on real photograph, light attenuation may appear, we define them by a threshold Tshading . To compute these OVD areas we consider all regions of 4-connected pixels with equivalent values. The OVD areas are computed by a depth-first search algorithm on the shading image interpreted as a graph: two pixels are neighbor if their shading values are equal. Among all these areas, we keep as OVD areas only those with a shading value greater than

742

A. Meyer, H.M. Brice˜no, and S. Bouakaz

Fig. 2. (a) Each highlight of the shading image gives a vertex in the relative-distance graph. Fast marching algorithm computes the relative distance between each pair of vertices. The graph is simplified into a minimum spanning tree. (b) An iterative process of mass-spring simulation/user inputs on the graph (c) runs until the user is satisfied by the reconstructed surface (d). Notice that, since our graph represents the relative-altitude (and not euclidean distance), each vertex can move only in a column (change altitude) during the mass-spring simulation.

Tshading and where its neighboring area is less bright. On the vase image of Fig. 2, this algorithm finds four OVD areas which intuitively are the four highlight spots. Edges and Relative Altitude Computation. We now determine edges and their weight which are the relative-altitude between vertices. For each vertex vi of the graph, we compute the relative altitude to all other vertices by fast marching: We set to zero the altitude of the pixel pi under the considered vertex vi . The altitude of all other pixels is set to ∞. We run the fast marching algorithm as described on Sect. 3.1. The altitude of all other pixels will be lower than vi because we only descend from this pixel. We then look at the altitude of pixels that correspond to the other vertices vj , j = i. We use this difference to set the weight of the edge eij to be the relative altitude difference between vertex vi and vj . Notice the fact that relative altitude difference between two vertices might be wrong if there is a inflection point between them, this situation will be addressed in next section by a simplification of the graph. We iterate this process for each vertex until we get the weights of all the edges between all the vertices. After this process, we do not know if a vertex is, for example, a local maximum, local minimum, saddle; we only know its relative altitude to its neighbors.

5 Mass-Spring Simulation and User Interaction The relative altitude graph is mapped onto the shading image, then it is simplified and converted into a mass-spring network which will serve as a visual aid and a way for the user to correct the minimization process. Our initial complete graph is composed of C2n = 12 n(n − 1) edges, n being the number of vertices (OVD areas). However, the relative altitude between two far vertices (in the sense of Euclidean distance) is probably incorrect and should be discarded: the monotonic descent assumption used for the relative altitude calculation does not

User-Guided Shape from Shading to Reconstruct Fine Details

743

hold if the path between the two vertices crosses a valley, a saddle or a ridge. Thus, a relative altitude is only meaningful for adjacent vertices, the graph can be reduced to a subset of edges. For the same reasons that Zeng et al. in [13], we simplify the graph by its minimum spanning tree using Prim’s algorithm [17]. The number of edges is thus reduced to n. Since all vertices are directly/indirectly connected to each other by the tree, the user can seed a minimization process to find their absolute altitude. Notice, that our approach of computing a complete graph which is then simplified is simple to setup whereas determining directly which vertices are neighbors would have been error-prone. The weight of each edge is the (relative-)altitude difference between its two vertices but the sign of this relative altitude is unknown: we do not know which vertex is above the other. In others words, we do not which vertices are local maximum, which are local minimum, and which ones are saddle. Thus, we ask a user to select some vertices and to move them up or to the down according to his knowledge of the target surface, this will serve as the initial condition to the minimization process. For example on a face, user will move up the vertex corresponding to the nose. In order to respect the relative altitude constrains between vertices and to propagate user’s information to the remaining vertices we build a mass-spring network. This saves the user from having to adjust the altitude of all vertices. Each vertex becomes a mass which will be able to move only in the z direction as its (x,y) position is fixed. Indeed, we consider only relative-altitude between vertices (and not Euclidean distance) this simulation is like a single column of mass linked by springs as illustrates on Fig. 2. All vertices have a mass of 1 meaning any vertex is more important than another. Each edge of the spanning tree becomes a spring having a rest-length equal to the relative altitude between its two vertices. This mass-spring network is animated by an Euler-explicit integration [18] until it stabilizes. This simulation has the two advantages: propagating user inputs and being visually intuitive for the user. Indeed, even before the stabilization, the user may want to interact by moving vertices according to his knowledge of the surface. Moreover, each time he wants to see the potential reconstructed surface, the fast marching method

Fig. 3. A surface may have sharp edges corresponding to local minimum or maximum without producing highlights. For example, at the line of the junction of the lips, the surface changes its orientation without forming a flat area. To correct this, the user can add vertices to the graph which will allow a change in the orientation. On the left, the graph without user intervention produces incoherent lips whereas on the right, after adding a vertex (blue), our method produces correct lips.

744

A. Meyer, H.M. Brice˜no, and S. Bouakaz

generates it in less than a second. This iterative process “mass-spring simulation/user interaction on the vertices” is running until the reconstructed surface satisfies the user. Our interactive method allows to deal with surfaces with sharp edges: meaning local minimum or maximum without highlight. Sharp edges are points where the surface is C0 continuous but piecewise C1 continuous meaning where the gradient is not smooth. For instance, at the line of the junction of the lips, the surface changes its orientation without producing a flat area with a highlight. Thus, our method allows the user to add a vertex to the relative-altitude graph which will allows a change in the surface orientation. Fig. 3 illustrates this feature: on the left the graph without user intervention produce incoherent lips whereas on the right, after the addition of the blue vertex, our method produces correct lips.

6 Results On Fig. 5 we show results from real photographs to demonstrate that our technique is well suited to image-based modeling. This surface is reconstructed from one input photograph in few minutes by an user: graph generation takes around 30 seconds on a Intel Centrino Laptop for approximately 70 vertices, surface generation by fast-marching takes less than a second for an image of 300 × 200. For the faces, the user has to interact with fewer vertices, between 2 and 10. We illustrate our concept on face photographs to show the capability of our technique to capture fine wrinkles of the skin which are difficult to obtain with multi-view approaches. Once the surface is reconstructed as an heightmap, we use the surface normal to extract a pseudo intrinsic color of each pixel by solving the diffuse equation (R, G, B)image = (R, G, B)intrinsic × N.L with L = (0, 0, 1). Thus, a textured image of the surface is obtained by combining the intrinsic color and the shading computed with surface normal. Notice, that if the light is similar to the original photograph (position and color) we should obtain similar results.

Fig. 4. On the left, the original photograph and the computed shading image. On the right, the reconstructed surface representing a low-relief hand. After an empirical test on this image, we do not perform any particular process to manage the specular aspect, nor to take into account that the wall behind the hand has probably a different albedo than the hand.

User-Guided Shape from Shading to Reconstruct Fine Details

745

Fig. 5. Fine detail of facial expression are hard to capture because of the wrinkles of the skin. Using shading information, our technique allows to capture them from a single photograph. We show the original image, the computed shading image used to reconstruct the surface, a rendering of the reconstructed image with only the shading computed using the normal and some results with the color texture. Top left is a photo downloaded from the web. Others are extracted from a 640 × 480 video sequence. Notice that Zeng et al. in [13] propose a minimization method to fix the kinks of the surface due to fast marching imprecision like the ones present near the eyebrows (bottom).

746

A. Meyer, H.M. Brice˜no, and S. Bouakaz

The heightmap produced by our system is easily triangulated to a mesh. Nevertheless, it can also be directly included as displacement map, for instance to produce realistic scene of low-relief walls as illustrate on Fig. 4. Limitations. Our method allows the reconstruction of fine details, nevertheless there are some limitations with our system. First, the reconstructed surface might have some kinks at the surface jonctions during the fast marching process. On Fig. 5, kinks near the eyebrows are present due to the difference of albedo between the skin and the eyebrows. At the end of [13], Zeng et al. propose a method to fix this kind of surface incoherency by a minimization process. Second, if there are too many fine details, the amount of user input can become important. It is concevaible to add heuristics, to modify the mass-spring simulation or to use hierarchical approach to alleviate this limitation. Additionnally, the algorithm supposes that the surface is C1 continuous, thus discontinuities in the surface can be hard to capture. This problem is partially mitigated by allowing the user to add vertices to the relative altitude graph.

7 Conclusion and Future Work Starting from a single color image (a photograph), we have presented an intuitive method for user reconstruction of surfaces which may have produced the image. Our method is interactive, guided by the user it may reconstruct different surfaces for a same input image. It allows to explore different SfS solutions in case of doubt. For example, the image on the left of Fig. 1 which may have been produced by the two surfaces on the right. This exploration facility allows the user to interactively, quickly, and easily reconstruct the surface of a given object with only one photograph. The ambiguity around the global shape of the photographed object is hard to resolve automatically without any a priori knowledge, so we ask the user to specify few local extrema (maximum or minimum). Since reconstructed surface is computed in less than few seconds, it is easy for the user to converge to a surface. Manual intervention is only needed to reconstruct the global shape whereas fine part of the surface is automatically extracted. Finally, we believe that a little user interaction can help to reconstruct many real objects. Thus, SfS approaches may be practically included in 3D mesh modelers 1 by defining a shape by example paradigm. In the future, we also would like to combine SfS approaches to global surface reconstruction based on multiple-views.

References 1. Durou, J.D., Mascarilla, L., Piau, D.: Non-Visible Deformations. In: Del Bimbo, A. (ed.) ICIAP 1997. LNCS, vol. 1311, Springer, Heidelberg (1997) 2. Prados, E., Faugeras, O.: Shape from shading: a well-posed problem? In: CVPR 2005. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 870–877. IEEE, Los Alamitos (2005) 3. Courteille, F., Durou, J.D., Morin, G.: A global solution to the sfs problem using b-spline and simulated annealing. In: ICPR (2006) 1

Such as Maya(Alias Wavefront), 3D Studio Max(Discreet) or Image Modeler(Realviz).

User-Guided Shape from Shading to Reconstruct Fine Details

747

4. Sethian, J.A: A Fast Marching Level Set Method for Monotonically Advancing Fronts. Proceedings of the National Academy of Sciences of the United States of America 93(4), 1591– 1595 (1996) 5. Kimmel, R., Sethian, J.A.: Optimal Algorithm for Shape from Shading and Path Planning. Journal of Mathematical Imaging and Vision 14(3), 237–244 (2001) 6. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. In: SIGGRAPH 2005. ACM SIGGRAPH 2005 Papers, pp. 577–584. ACM Press, New York (2005) 7. Kozera, R.: An overview of the shape from shading problem. Machine Graphics and Vision (1998) 8. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from Shading: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8), 690–706 (1999) 9. Durou, J.D., Falcone, M., Sagona, M.: A Survey of Numerical Methods for Shape from Shading. Rapport de Recherche 2004-2-R, Institut de Recherche en Informatique de Toulouse, Toulouse, France (2004) 10. Courteille, F., Crouzil, A., Durou, J.D., Gurdjos, P.: Shape from shading for the digitization of curved documents. Machine Vision and Applications (2006) 11. Prados, E., Camilli, F., Faugeras, O.: A unifying and rigorous shape from shading method adapted to realistic data and applications. Journal of Mathematical Imaging and Vision 25(3), 307–328 (2006) 12. Zhu, Q., Shi, J.: Shape from shading: Recognizing the mountains through a global view. In: CVPR 2006. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2006) 13. Zeng, G., Matsushita, Y., Quan, L., Shum, H.Y.: Interactive Shape from Shading. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. I, pp. 343–350. IEEE Computer Society Press, Los Alamitos (2005) 14. Funt, B.V., Drew, M.S., Brockington, M.: Recovering shading from color images. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 124–132. Springer, Heidelberg (1992) 15. Tappen, M.F.: Recovering intrinsic images from a single image. IEEE Trans. Pattern Anal. Mach. Intell 27(9), 1459–1472 (2005) 16. Ho, J., Lim, J., Yang, M.H.: Integrating Surface Normal Vectors Using Fast Marching Method. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 239–250. Springer, Heidelberg (2006) 17. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36, 1389–1401 (1957) 18. Desbrun, M., Schr¨oder, P., Barr, A.: Interactive animation of structured deformable objects. Graphics Interface, 1–8 (1999)

A Theoretical Approach to Construct Highly Discriminative Features with Application in AdaBoost Yuxin Jin, Linmi Tao, Guangyou Xu, and Yuxin Peng Computer Science and Technology Department, Tsinghua University, Beijing, China [email protected]

Abstract. AdaBoost is a practical method of real-time face detection, but abides by a crucial problem of overﬁtting for the big number of features used in a trained classiﬁer due to the weak discriminative abilities of these features. This paper proposes a theoretical approach to construct highly discriminative features, which is named composed features, from Haar-like features. Both of the composed and Haar-like features are employed to train a multi-view face detector. The primary experiments show promising results in reducing the number of features used in a classiﬁer, which leads to the increase of the generalization ability of the classiﬁer.

1

Introduction

In 1995, Freund and Schapire [1] introduced AdaBoost algorithm based on traditional boosting method. Thanks to their eﬀorts, theoretical analysis on AdaBoost was pro-posed in the following years. They proved in [1] that the generalization error should be smaller if fewer training rounds are involved. Later, they gave a new theory of the generalization in terms of margins [2]: greater margins contribute to better results. In early years, AdaBoost was, however, inapplicable in real-time case due to its great computational cost. Fortunately, the breakthrough occurred in 2001 when Viola and Jones proposed a novel real-time AdaBoost for face detection [4]. The keys to make real-time possible are the usage of the Integral Image, Haar-like features and Cascade Hierarchy. Based on this approach, two kinds of extensions were focused on the improvement of hierarchy and feature. [12] extended the cascade hierarchy into the multi-view case - Detector Pyramid Architecture AdaBoost (DPAA). Later, [7] adopted Width-First- Search (WFS) Tree Structure to make a balance between high speed and robust. [11] extended the Haar-like features with 45o rotated features. [8] proposed Asymmetric Rectangle Features and experiment showed improved performance. However, the above methods are based on Haar-like features which are so weak that a large number of weak classiﬁers are used to train a strong classiﬁer. Proven in [1], such a burdensome strong classiﬁer increases the risk of overﬁtting. Some eﬀorts were taken to overcome the disadvantage of Haar-like features (their poor discriminative abilities). [9] used PCA approach to generate the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 748–757, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Theoretical Approach to Construct Highly Discriminative Features

749

global features which are included in the feature set in later layers of cascade. [10] used Gabor Features instead of Haar-like features. Although these features show superior discriminative abilities to Haar-like features, they are time-consuming in computation which may forbid real-time application. [14] used EOH (Edge Oriented Histogram) Features and gained good results. In this paper, we propose a novel theoretical approach to construct highly discriminative features named composed features from Haar-like features whose computational load is small suitable for real-time tasks such as face detection. Thus, we can not only eﬃciently compute the highly discriminative features but also decrease overﬁtting. In section 2, we will discuss features and their discriminative abilities. In section 3, highly discriminative features eﬃcient in computation will be constructed by Haar-like features. Section 4 shows the promising experiment results.

2

Features

For AdaBoost, a strong classiﬁer combined by many weak classiﬁers which are not much better than random guess can achieve any little error rate in training after suﬃcient loop rounds which has been proven in [1]. Each weak classiﬁer contains two parts: feature and classiﬁcation function. AdaBoost systematically chooses diﬀerent features and then builds weak classiﬁers based on them and ﬁnally outputs a strong classiﬁer. We will focus on the features. Currently, in order to be real-time, Viola and Jones [4] introduced Haar-like features. With the help of Integral Image, they are eﬃcient to compute. Only about six additions are involved to compute their value. Moreover, they are in essence linear features because their value can also be calculated as the dot product of the feature and the image vector. Fig. 1 show the vector representation.

Fig. 1. Vector representation of Haar-like feature

Image could be represented as: image = [a11 , a21 , a31 , a41 , ...a34 , a44 ]T Similarly, the Haar-like feature also could be represented in vector form: w = [0, −1, −1, 0, 0, −1, −1, 0, 0, +1, +1, 0, 0, +1, +1, 0]T Feature value xw is just the dot product of these two vectors. (The module of w does not aﬀect the discriminative ability.) xw = w · image = wT image

(1)

750

Y. Jin et al.

Unlike many known linear features which are highly discriminative such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), Gabor features etc, a single Haar-like feature is so simple that its discrimination ability is weak. With respect to the discriminative abilities, features can be categorized into two classes: weak features and strong features shown in Fig. 2. Categorization is not strictly deﬁned. Strong features only mean the features with relatively higher discrimination abilities (or the classiﬁcation error may be lower than some threshold).

Fig. 2. A 2-Class case. Left: 2 weak feature vectors’ direction and 1 strong feature vector’s direction. Right: dash lines: the separation plane in weak feature space; solid line: the separation plane in strong feature space.

In this example, two classes are drawn in circles and crosses. Suppose there are only two weak features whose directions are horizontal and vertical. In order to fully separate these two classes, about eight weak classiﬁers should be trained by AdaBoost. However, only one weak classiﬁer with the strong feature can fully separate them shown in Fig. 2. As to the generalization theory in [1], the strong classiﬁer combined with fewer weak classiﬁers performs better. Although strong features are superior to weak features in discriminative abilities, they are always time-consuming in computation which involves n multiplications for an n dimensional image vector. That is exactly the reason to keep them from application in real-time cases. In the example illustrated by Fig. 2, all features are vectors. The strong feature can be constructed by two weak features fW 1 and fW 2 :fS = αfW 1 + βfW 2 . More generally, any linear feature vector in can be constructed by at most linearly independent vectors (linear features) in . In light of this, we will propose a novel approach to reduce the computational load of strong feature by constructing strong features from Haar-like features.

3 3.1

Composed Features Deﬁnition of Composed Features

The computation cost of a strong feature value arises from the numerous multiplications of dot product between feature vector and image vector. Thus, in order to reduce such cost, the number of multiplications should be decreased. We can use Haar-like features to construct a strong feature. Due to the low computation

A Theoretical Approach to Construct Highly Discriminative Features

751

cost for Haar-like features’ value, strong feature’s value can be calculated fast. A strong feature is de-noted as vector fj in Rd . For a 16*16 image, there exist more than 50,000 Haar-like features. These features are deﬁnitely linearly dependent because the dimension is just 256 far fewer than the number of features. Actually, there are just M ∗ N linearly independent feature vectors for M ∗ N image size. We deﬁne one group of M ∗ N independent feature vectors from the set of Haar-like features as base features. w0 , w1 , w2 , ...wd−1 , d = M × N By using these base features, fj can be constructed as follows: fj =

d−1

pij wi

(2)

i=0

Then the feature value can be computed as: xj =

d−1

pij wiT image

(3)

i=0

In this way, we can compute the strong feature’s value. However, such representation requires d multiplications, too. To reduce the computing time, we want to use fewer Haar-like features to construct a strong feature. We choose k linearly independent Haar-like features w0 , ...wk−1 ∈ W, k d to construct a feature qj , W is the complete set of Haar-like features, then: qj =

k−1

qij wi = Wαq , W = [w0 , ...wk−1 ], αq = [q0j , ...qk−1j ]T

(4)

i=0

We deﬁne such features as Composed Features: Deﬁnition 1. Composed features: linear and constructed by some linearly independent features - base features so that the computation of the features’ value can be implemented indirectly by calculating the base features’ value. Usually, the computational cost is reduced. 3.2

Approximation Measurement

Composed features are not necessary to be strong features. Some of them may be strong features while others may not. We have two ways to ﬁnd strong features: Firstly, we can exhaustively search a proper set of to construct a com-posed feature and then check its discrimination ability. If the classiﬁcation error is less than some predetermined threshold, it can be viewed as a strong feature. Secondly, as some strong features (PCA, LDA, or Gabor) are known, we can con-struct a composed feature to approximate these features. The ﬁrst way would be feasible only if k is very small(just 2-3). Making exhaustive exploration is equivalent to making comparison among all the possible combination of all Haar-like features. Even ﬁxing the coeﬃcients, eg. αq = [1, 1]T ,

752

Y. Jin et al.

the searching space will still be C(|W |, k), where |W | is the number of Haar-like features. When k > 3 , the computational cost is too large:C(50000, 4) ≈ 2.6e17 . Such a na¨ıve approach will crash when k is a bit larger. The second way is more practical. In this case, fj is assumed to be known(they can be PCA, LDA, or Gabor features). Our task is to ﬁnd a proper set of w0 , ...wk−1 and construct a composed feature which can approximate fj . Thus, a measurement to evaluate the approximation should be introduced as follows: Deﬁnition 2. Approximation Measurement: The smaller the angle θ between vector fj and qj is, the better qj approximates fj . Thus, given set W, the coeﬃcients αq can be uniquely determined according to the Approximation Measurement. It is equivalent to maximize cos θ. k−1 αTq FW qij wiT fj qj · fj = i=0 = cos θ = k−1 2 |qj ||fj | T αTq WT Wαq |fj | (5) i=0 qij wi wi |fj | T fj ]T where FW = [w0T fj , ..., wk−1

Then the problem becomes solving the maximum problem given below. max αTq FW s.t. αTq WT Wαq = 1

(6)

Here, we restrict the composed vector qj to be unit vector without losing generosity. Such problem could be easily solved by implementing Lagrangian method. f (p0j , p1j , ..., pk−1j , λ) = αTq FW − λ(αTq WT Wαq − 1) ⎧ ⎪ ⎪ FTW (WT W)−1 FW ⎪ ⎪ ⎪ ⎨λ = 2 T −1 W) FW (W ⎪ ⎪ ⎪ αq = ⎪ ⎪ ⎩ FTW (WT W)−1 FW

(7)

(8)

As selected Haar-like features are linearly independent, WT W is invertible. The remaining problem is how to ﬁnd a proper set W . It is infeasible to implement an exhaustive search, therefore, we will introduce a novel algorithm based on Simulated Annealing to achieve it. 3.3

Proper W Searching Algorithm

To ﬁnd a proper set W is an optimization problem described as follows: min θ = min f (W) for given k s.t. W ∈ D

(9)

A Theoretical Approach to Construct Highly Discriminative Features

753

D is the conﬁguration space. Each conﬁguration is a set of k Haar-like features chosen from W . Because the number of Haar-like features denoted as |W | is ﬁnite, so that the number of conﬁgurations in D (|D|) is ﬁnite. However, it is an NP problem because |D| = C(|W |, k), k d. In order to ﬁnd a proper set of Haar-like features, Simulated Annealing algorithm is implemented. We take θ as the energy function, if current θi is larger than θj calculated from conﬁguration through (5), i will be transited into j, otherwise the transition occurs at some probability. From the current conﬁguration i to the next one j, i, j ∈ D, we only exchange one Haar-like feature in i with the transition probability. pij = Gij (t)Aij (t),

∀j ∈ D

(10)

Where t is the temperature. Gij (t) and Aij (t) are Generalization and Acceptance probability, respectively.

1/|N (i)|, j ∈ N (i) Gij (t) = (11) 0, j∈ / N (i)

Aij (t) =

1, f (i) ≥ f (j) (i) exp − f (j)−f , f (i) < f (j) t

(12)

N (i) is the neighbor of conﬁguration i. Only one feature by randomly selection in i can be exchanged with any other feature in W which is linearly independent with the remained features. Thus, the neighbor number of i is k|W | |D|. In this way, the problem is feasible for solving. The Proper Searching Algorithm is given in Fig. 3. Research in feature extraction lasts for several decades. Thanks to these efforts, PCA, LDA, Gabor features or other possible linear features can be used

Fig. 3. Proper Searching Algorithm

754

Y. Jin et al.

here as fj . With the construction, complicated strong features are composed of several simple Haar-like features. As a result, the computation of strong features’ value is faster with some insigniﬁcant loss in accuracy. Then the Composed Features can be used in AdaBoost for real-time application.

4

Experiment

A total of 10,000 faces are collected from variant sources and categorized into 5 views, [-90, -60], [-60, -15], [-15, 15], [15, 60] and [60, 90] (view 1∼5) with 2000 of each view. Each face example’s size is 16*16. We adopt DPAA ([12], [8]) as Train-ing Structure under AdaBoost Scheme. The ratio of the number of positive examples to the number of negative examples is 1.0 for each layer. We implement our experiments by using a P4 3.0GHZ, 512 RAM computer.

Fig. 4. Left: faces data in view 4; Center: extracted PCA features of group 4; Right: approximated PCA features

Fig. 5. Feature values’ distribution for features. Top-Left: PCA feature. Bottom-Right PCA- CF. Bottom-Left: 1st Haar-like features. Bottom-Right: 5th Haar-like features.

A Theoretical Approach to Construct Highly Discriminative Features

755

In our experiment, we only construct PCA features. We extract PCA from 9 groups, each of which may include one or more views data. Groups 1 5 include 5 views data respectively; group 6 include view 1 and 2; group 7 include view 2, 3 and 4; group 8 include view 4 and 5; group 9 include all views. We choose ﬁrst 100 PCA features from each group to be approximated. Fig. 4 shows the PCA features and PCA-CFs (PCA Composed Features) each of which is combined with 15 Haar-like features for view [30, 60]. The angles between PCA features and PCA-CFs are less than 20o (cos 20o ≈ 0.939693). In Fig. 5, top row shows that in layer 7 the feature values’ distribution is similar for the PCA feature and PCA-CF. Their error rates are 26.75% and 27.58% respectively. Obviously, although PCA-CF is not exact PCA features, their discrimination abilities are similar. In layer 7, it is diﬃcult to depart faces from nonfaces. Bottom row shows the distribution for the 1st and 5th Haar-like features selected with error 30.17% and 38.17%. They are worse than PCA-CF. We tested our system on the CMU proﬁle data set (208 images and 441 faces). Fig. 6 are the comparison of training error curves and margin distribution in the test. As to the results, the training error rate converges faster to zero if PCACFs are used. Furthermore, fewer features are selected and higher margins are gained. Generalization error’s up bound must be reduced according to [1]. Scaling the scanning window from 16*16 to 256*256 with scale ratio 1.2, we can process about 14 frames per second for 320*240 images on average. Some of the results are given in Fig. 7. We compare ROC curve of our approach with those

Fig. 6. Left: Training Error Curve; Right: Margin distribution

Fig. 7. Some results on CMU proﬁle test set

756

Y. Jin et al.

Fig. 8. ROC comparison on CMU proﬁle test set

pro-posed by Viola, Jones [5] and Schneideman [6] (these approaches reported results in CMU proﬁle data set), the curves are shown in Fig. 8. Apparently, our approach is the best among them and it is real-time. (The results in [8] were given on some unknown dataset. [9][10] gave results on other dataset. [11] only gave the false alarm rate instead of false alarm number).

5

Conclusion

In this paper, we propose a theoretical approach to construct linear strong features so that we improve generalization ability and eﬃciency. Haar-like features are too weak to discriminate classes, which results in serious overﬁtting. Strong features may be used to gain highly discriminative ability but they are ineﬃcient in computation. Composed features proposed in this paper inherit advantages from both Haar-like features and strong features. By using Proper W Searching Algorithm, composed features can be constructed and approximate strong features so that eﬃciency and better generalization ability could be achieved. Experiment shows our method is better than Viola, Jones, Schneiderman and Levi, Weiss. We can build up real-time AdaBoost system on composed features. Our approach could be extended to construct any linear strong features. This approach’s application should not be limited in AdaBoost, it can be used in any cases where need fast computation of some strong features.

Acknowledgements This work is supported by National Science Foundation of China under grant No 60673189 and No 60433030.

A Theoretical Approach to Construct Highly Discriminative Features

757

References 1. Freund, Y., Schapire, R.E.: A Decision- Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Science 55, 119–139 (1997) 2. Schapire, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the Margin: A New Explanation for the Eﬀectiveness of Voting Methods. The Annals of Statistics 26(5), 1651–1686 (1998) 3. Schapire, R., Singer, Y.: Improved Boosting Algorithms Using Conﬁdence-rated Predictions. Machine Learning 37, 297–336 (1999) 4. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2001) 5. Jones, M., Viola, P.: Fast Multi-view Face Detection. In: TR (2003) 6. Schneiderman, H., Kanade, T.: A statistical method for 3D object detection applied to faces and cars. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2000) 7. Huang, C., Ai, H., et al.: Vector Boosting for Rotation Invariant Multi-view Face Detection. In: IEEE International Conf. on Computer Vision (2005) 8. Wang, Y., Liu, Y., Tao, L., Xu, G.: Real-Time Multi-View Face Detection and Pose Estimation in Video Stream. In: IEEE International Conf. on Pattern Recognition (2006) 9. Zhang, D., Li, S.Z., et al.: Real-Time Face Detection Using Boosting in Hierarchical Feature Spaces. In: IEEE International Conf. on Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2004) 10. Yang, P., Shan, S., et al.: Dong Zhang: Face Recognition Using Ada-Boosted Garbor Features. In: IEEE International Conf. on Automatic Face and Gesture Recognition (2004) 11. Lienhart, R., Maydt, J.: An Extended Set of Haar-like Features for Rapid Object Detection. In: IEEE International Conf. on Image Processing (2002) 12. Li, S.Z., Zhu, L., et al.: Statistical Learning of Multi-View Face Detection. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, Springer, Heidelberg (2002) 13. Fleuret, F.: Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 1531–1555 (2004) 14. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: the importance of good features. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2004)

Robust Foreground Extraction Technique Using Gaussian Family Model and Multiple Thresholds Hansung Kim1 , Ryuuki Sakamoto1 , Itaru Kitahara1,2 , Tomoji Toriyama1, and Kiyoshi Kogure1 Knowledge Science Lab, ATR, Kyoto, Japan Dept. of Intelligent Interaction Technologies, Univ. of Tsukuba, Japan {hskim,skmt,toriyama,kogure}@atr.jp, [email protected] 1

2

Abstract. We propose a robust method to extract silhouettes of foreground objects from color video sequences. To cope with various changes in the background, the background is modeled as generalized Gaussian Family of distributions and updated by the selective running average and static pixel observation. All pixels in the input video image are classiﬁed into four initial regions using background subtraction with multiple thresholds, after which shadow regions are eliminated using color components. The ﬁnal foreground silhouette is extracted by reﬁning the initial region using morphological processes. We have veriﬁed that the proposed algorithm works very well in various background and foreground situations through experiments. Keywords: Foreground segmentation, Silhouette extraction, Background subtraction, Generalized Gaussian Family model.

1

Introduction

The background subtraction technique is one of the most common approaches for extracting foreground objects from video sequences [1,2]. This technique subtracts the current image from a static background image acquired in advance from multiple images over a period of time. Since this technique works very quickly and distinguishes semantic object regions from static backgrounds, it has been used for years in many vision systems such as video surveillance, teleconferencing, video editing, and human-computer interfaces. Conventional approaches assume that the background is static; therefore, they cannot adapt to changes in illumination or geometry in it [3,4,5]. Several algorithms have been developed to overcome this problem by modeling and updating the background statistics. They can be classiﬁed into two categories: parametric and non-parametric approaches. The parametric approaches set a form of the background distribution in advance and estimate the parameters of the model. Earlier methods used single Gaussian distribution to model the probability distribution of the pixel intensity [6,7]. Recently, the Gaussian mixture model is the most representative Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 758–768, 2007. c Springer-Verlag Berlin Heidelberg 2007

Robust Foreground Extraction Technique Using Gaussian Family Model

759

approach [8] and has been widely incorporated with Bayesian frameworks [9], color and gradient information [10], mean-shift analysis [11] and region information [12]. However, these approaches require high computational complexity and have trade-oﬀ in learning rate [13]. O. Tuzel proposed to use 3D multivariate Gaussians instead of the Gaussian mixture model to improve computational eﬃciency [14]. The nonparametric approaches estimate density functions directly from sample data. A. Elgammal used Kernel Density Estimators (KDE) to adapt quickly to changes in the background [13], and several advanced approaches using KDE were proposed [15,16]. However, theses KDE-based approaches consume a lot of memory to update recent background statistics. K. Kim proposed a codebook algorithm to construct a background model from long image sequences [17]. Other approaches include representing various environmental conditions [18,19] use Hidden Markov Models (HMMs) to switch states of the background with observations, and [20,21] aim to segment objects especially in dynamic textured backgrounds such as water and waving trees. In this paper, we propose a robust foreground silhouette extraction algorithm for color-video sequences. Each pixel of the background is modeled as a Generalized Gaussian Family (GGF) distribution. The form of the distribution is chosen in the family by calculating kurtosis of data, and the model is updated by two methods in order to adapt to both changes in illumination and geometry in the background. We classify the initial mask into four categories according to their reliability and reﬁne them with color information. The ﬁnal foreground silhouette is extracted by morphological processes.

2 2.1

Background Model Modeling Background

Formerly, the variance of a pixel in a static scene over time was modeled with the Gaussian distribution N (μ, σ) because the image noise over time was modeled by a zero-mean Gaussian distribution N (0, σ) [4,5,6,7,8]. However, the latest digital video cameras provide clean and steady images with noise reduction. Moreover, in the case of stable scenes such as indoor ones, variations of pixels are smaller than those in outdoor scenes due to less light dispersion and illumination change, and fewer of those small motions that tend to occur frequently in nature. We extracted distributions of deviation from the mean of each pixel in indoor and outdoor scenes over a short time interval, and then compared these distributions with two Gaussian family distributions: a Gaussian distribution and a Table 1. Average diﬀerence of distributions Outdoor Indoor

Gaussian 0.4165 0.0452

Laplace 0.4923 0.0161

760

H. Kim et al.

Fig. 1. Intensity histograms of pixels in an image

Laplace distribution. Table 1 shows the average of diﬀerences from each model within the range of 3σ. Clearly, the indoor scene is much closer to a Laplace model than to a Gaussian one. More serious problem is that the distributions of each pixel over time show diﬀerent shapes in an image. Figure 1 shows intensity histograms of several pixels in an image and their excess kurtosis g2 . Excess kurtosis is a measure of whether the data are peaked or ﬂat relative to a normal distribution and calculated as Eq. (1) where n is a number of samples and μ is a mean of them. The excess kurtosis of Gaussian and Laplace distributions are 0 and 3, respectively. In the Fig. 1, we can see that the background is hard to be modeled only with a Gaussian distribution. n n (xi − μ)4 m4 (1) g2 = 4 − 3 = n i=1 2 −3 σ ( i=1 (xi − μ)2 ) Therefore, we propose to use the GGF distributions to model the background in this research. The GGF model is deﬁned as: ργ p(x : ρ) = exp (−γ ρ |x − μ|ρ ) (2) 2Γ (1/ρ) with γ =

1 σ

Γ (3/ρ) Γ (1/ρ)

1/2

, where Γ (·) is a gamma function and σ 2 is a variance

of the distribution. In Eq. (2), ρ = 2 represents a Gaussian distribution while ρ = 1 represents a Laplace distribution. In this research, we restrict the GGF model to Laplace and Gaussian. The models for each pixel in the background are decided by calculating the excess kurtosis of the ﬁrst N frames. Optimized parameters of the models can be estimated by maximizing the likelihood of the observed data [22,23]. The background is modeled in two distinct parts: a luminance component and a color component. Normal RGB components are very sensitive to noise and changes in lighting conditions. Therefore, we use a luminance component for initial object segmentation. However, the luminance component may change drastically due to shadows of objects in the background and reﬂections from lighting in the foreground. We have constructed a second background model

Robust Foreground Extraction Technique Using Gaussian Family Model

761

with the color component of the image to remove false segmentation. Color component H is extracted from the HSI model as follows. I = max(R, G, B) (I − min(R, G, B)/I) if I = 0 S= 0 otherwise ⎧ ⎪ if I = R ⎨(G − B) × 60/S H = 180 + (B − R) × 60/S if I = G ⎪ ⎩ 240 + (R − G) × 60/S if I = B

(3)

if H < 0 then H = H + 360

2.2

Updating the Background

The background model should be changed according to a change in background statistics. There are two types of change in background, with diﬀerent characteristics: gradual change due to lighting conditions, and sudden changes due to shifts in background geometry. To cope with gradual changes, we update the background model of each pixel using the running average of Eq. (4). 0 if xt ∈ / background μt+1 = αxt + (1 − α)μt , α= (4) 2 σt+1 = α(xt − μt )2 + (1 − α)σt2 0.05 if xt ∈ background However, this background update process cannot handle sudden and permanent changes in the background. For example, if an object in the background is moved and stays ﬁxed at the new location for a long time, the system will detect both the new position and old position as foreground objects permanently. Therefore, the background model is also updated using static pixel observation. If any region is determined as being a foreground region and assigned the same label by the labeling process described in Section 3, successive frame diﬀerences of pixels in the region are observed. If the pixels have been stationary for the past T hBg frames, the old background models of the pixels in the region are replaced with new models. However, if there is any non-stationary area bigger than the smallest region size T hRG , all observation processes in the region with the same label are reset to avoid partial disappearance of local stationary pixels in the foreground. That is, background models are updated by a unit with the same label. In the experiment, T hRG was set to 0.1% of the image’s size.

3

Foreground Extraction

Based on the constructed background models, the silhouette of foreground objects are extracted from video sequences. At ﬁrst, initial region classiﬁcation is performed by subtracting the luminance components of the current frame from the background model. The ﬁxed single

762

H. Kim et al.

threshold, however, can lead to serious error of over-segmentation or undersegmentation in ambiguous regions such as shadows or foreground regions with brightness similar to the background. Therefore we classify the initial object region into four categories using multiple thresholds based on their reliability, as in Eq. (5). LI and LB indicate the luminance components of the current frame and background model, respectively, and σ is a standard deviation of the background model. BD(p) = |LI (p) − LB (p)| ⎧ BD(p) < K1 σ(p) ⎪ ⎪ ⎪ ⎨K σ(p) ≤ BD(p) ≤ K σ(p) 1 2 ⎪ σ(p) ≤ BD(p) ≤ K K 2 3 σ(p) ⎪ ⎪ ⎩ K3 σ(p) ≤ BD(p)

⇒ (a) Reliable Background ⇒ (b) Suspicious Background ⇒ (c) Suspicious Foreground ⇒ (d) Reliable Foreground

(5)

Thresholds K1 ∼ K3 used in Eq. (5) were determined by training data. We used around 100 images with ground-truth foreground masks taken from diﬀerent environments. The following condition is used to decide the parameters, where β was set to 3 because false negative errors are generally more critical than false positive errors in foreground extraction.

β × F alseN egativeError (6) (K1 , K2 , K3 ) = arg min +F alseP ositiveError K1 ,K2 ,K3 However, a large amount of background can be assimilated in the suspicious foreground region in the results, caused by the shadow of an object that changes background brightness. We eliminate the shadows from the initial object by using a color component because the shadow does not change the color property of the background but only the brightness. With Eq. (7), shadows in the Suspicious foreground region are merged into the Suspicious background region. H indicates the color components of images, and σH is a standard deviation of the color component in the background model.

if p ∈ region(c)&|HI (p) − HB (p)| < K1 σH (p) then p ⇒ region(b) (7) In the labeling step, all foreground regions (c) and (d) in Eq. (5) are labeled with their own identiﬁcation numbers. All connected foreground pixels with 8-neighbor regulation are assigned the same labels using a region growing technique [24]. However, there are also small noise regions in the initial object regions. A conventional way of eliminating noise regions is to use a morphological operation to ﬁlter small regions. Therefore, we reﬁne the initial mask by a closing and opening process [24]. Then, we sort and relabel all labeled regions in descending order based on the size of the regions. In the relabeling process, regions smaller than T hRG are eliminated. Finally, we use a silhouette extraction technique that is an improvement of Kumar’s proﬁle extraction technique [3] to smooth the boundaries of the foreground and eliminate holes inside the regions. A weighted one-pixel-thick drape

Robust Foreground Extraction Technique Using Gaussian Family Model

(a) Original image

(d) Labeling

(b) Classification

(e) Silhouette extraction

763

(c) Shadow elimination

(f) Final result

Fig. 2. Segmentaion results in each step

is moved from one side to the opposite side. The pixels adjacent to the drape are connected by an elastic spring that covers the object without inﬁltrating gaps whose widths are smaller than threshold M . This process is performed from all quarters, and the region wrapped by four drapes denotes the ﬁnal foreground region. We independently apply the proﬁle extraction technique to each labeled region to avoid errors between multiple foreground objects. However, the silhouette extraction algorithm covers real holes inside the object. We perform a region growing technique from Reliable background regions in the silhouette if the region is bigger than the small region threshold T hRG . Figure 2 shows a test image and the results of the silhouette extraction in each step, respectively.

4

Experimental Results

We applied the proposed algorithm to various video streams including indoor/ outdoor scenes taken with an IEEE-1394 camera/a normal camcorder. The IEEE-1394 camera provides 1024 × 768 RGB video streams and the normal camcorder provides 720 × 480 interlaced DV streams. We simulated the algorithm on a PC with a Pentium IV 3.2-GHz CPU, 1.0-GByte memory, and Visual C++ on a Windows XP operating system. The parameters used in the simulation are experimentally selected as T hBG = 100 for stationary objects in updating the background model and M = 12 for a maximum gap width in silhouette extraction. We set the T hBG to be very short to show the eﬀect of background update in short time, but it should be much longer in real applications. Figure 3 shows the segmentation results of various scenes: the left image shows the captured image, and the right image shows the extracted foreground in each pair. Figure 4 shows objective evaluations of the proposed algorithm. We randomly selected 14 frames from seven diﬀerent scenes (i.e., 98 images in total) and created ground-truth segmentation masks by manual segmentation. Then

764

H. Kim et al.

Fig. 3. Results of foredround extraction

(a) False Negative (FN) error

(b) False Positive (FP) error

Fig. 4. Segmentation errors to ground-truth (%)

we compared the segmentation error of the proposed algorithm with a Gaussianbased algorithm with a single threshold [7] and a KDE-based algorithm [13]. We applied the same morphological processes to all experiments to see the eﬀect of the proposed model. We compared the results by calculating the percentage of erroneous pixels as Eq. (8). error =

N umber of erronous pixels × 100(%) N umber of real f oreground pixels

(8)

In Fig. 4, F N means false negative error, whereby the foreground region is falsely classiﬁed as the background region; F P means false positive error, which is the background being falsely classiﬁed as the foreground. In all results, F P errors are much bigger than F N errors due to blurring from fast motion and errors around object boundaries. Generally, F N error is more uncomfortable to

Robust Foreground Extraction Technique Using Gaussian Family Model

765

Table 2. Processing speed analysis (msec) Stage Background subtraction Shadow elimination Object labeling Silhouette extraction Background update Total

Time 15 46 16 250 15 342

the eye and less acceptable to many vision systems than F P . The average error rates of the proposed algorithm are lower than those of the conventional methods in most scenes. Table 2 shows a runtime analysis with the proposed system. The times listed are the average processing times when one person exists in the scene. The resolution of the video is 1024 × 768. Considering the image resolution, the processing speed is quite high. In order to evaluate the eﬀect of the background update algorithm, we made an artiﬁcial environment where the lighting condition is gradually changing in a short time. Some rigid objects in the background are also shifted to diﬀerent positions by an actor. In this experiment, we assumed that a rigid object attached to an actor is a foreground object, but it is considered to be background if it

(a) original scene

(b) Without background update

(c) With background update Fig. 5. Results of background update

766

H. Kim et al.

Fig. 6. Errors to ground-truth of sequence in Fig.5 (%)

is separated from the actor. Figure 5 shows snapshots of the results of this experiment. We also manually made ground-truth foreground masks of every 3 frames in 1200 and plotted the error rate of the segmentation results in Fig. 6. In this graph, error rates were calculated as a percentage of errors not against the foreground size as in Fig. 4 but against the image size because the error rate diverged by the change of background when there was no real object in the scene. The graph of the result with the background update in Fig. 6 shows that the error rate increased temporally when the objects parted from the actors but soon became low again, and the change in brightness in the room hardly aﬀected the error rate.

5

Conclusion

In this paper, we proposed a powerful algorithm for silhouette extraction that is robust against variations in the background. The background is modeled as GGF distributions, and updated by selective running averages and static pixel observations, while the foreground is segmented using multiple thresholds and morphological processes. Experimental results indicate that the proposed algorithm works very well in various background and foreground situations. Future work on this topic will take two main directions. Although the proposed algorithm is fast, it does not work in real time with XGA image sequences. However, it can be achieved by using hardware accelerators such as a Graphics Processing Unit (GPU) and by further optimizing the implementation. Second, the proposed method can cope with gradual or long-term changes in background but not with repetitive changes at high frequencies such as ﬂickering monitors or branches shaking in the wind. We are going to develop a multi-modal GGF model to overcome this problem.

Robust Foreground Extraction Technique Using Gaussian Family Model

767

Acknowledgements This research was supported by the National Institute of Information and Communications Technology.

References 1. Gelasca, E.D., Ebrahimi, T., Karaman, M., Sikora, T.: A Framework for Evaluating Video Object Segmentation Algorithms. In: Proc. CVPR Workshop, pp. 198–198 (2006) 2. Piccardi, M.: Background Subtraction techniques: a review. In: Proc. IEEE. SMC, vol. 4, pp. 3099–3104 (2004) 3. Kumar, P., Sengupta, K., Ranganath, S.: Real time detection and recognition of human proﬁles using inexpensive desktop cameras. In: Proc. ICPR, pp. 1096–1099 (2000) 4. Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Detection and location of people in video images using adaptive fusion of color and edge information. In: Proc. ICPR, pp. 627–630 (2000) 5. Horprasert, T., Harwood, D., Davis, L.S.: A robust background subtraction and shadow detection. In: Proc. ACCV (2000) 6. McKenna, S.J., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking Groups of People. Computer Vision and Image Understanding 80(1), 42–56 (2000) 7. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pﬁnder: Real-Time Tracking of the Human Body. IEEE Trans. PAMI 19(7), 780–785 (1997) 8. Stauﬀer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: Proc. CVPR, pp. 246–252 (1999) 9. Lee, D.S., Hull, J.J., Erol, B.: A Bayesian framework for Gaussian mixture background modeling. In: Proc. ICIP, vol. 3, pp. 973–976 (2003) 10. Javed, O., Shaﬁque, K., Shah, M.: A hierarchical approach to robust background subtraction using color and gradient information. In: Proc. IEEE Motion and Video Computing, pp. 22–27. IEEE Computer Society Press, Los Alamitos (2002) 11. Porikli, F., Tuzel, O.: Human body tracking by adaptive background models and mean-shift analysis. In: Proc. PETS-ICVS (2003) 12. Cristani, M., Bicego, M., Murino, V.: Integrated region- and pixel based approach to background modeling. In: Proc. IEEE MVC, pp. 3–8. IEEE Computer Society Press, Los Alamitos (2002) 13. Elgammal, A., Harwood, D., Davis, L.S.: Non-parametric model for background subtraction. In: Proc. ECCV, vol. 2, pp. 751–767 (2000) 14. Tuzel, O., Porikli, F., Meer, P.: A Bayesian Approach to Background Modeling. In: Proc. IEEE MVIV, vol. 3, pp. 58–63 (2005) 15. Han, B., Comaniciu, D., Davis, L.: Sequential kernel density approximation through mode propagation: applications to background modeling. In: Proc. ACCV (2004) 16. Mittal, A., Paragios, N.: Motion-based background subtraction using adaptive kernel density estimation. In: Proc. CVPR, pp. 302–309 (2004) 17. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.S.: Real-time foregroundbackground segmentation using codebook model. Real-Time Imaging 11, 172–185 (2005) 18. Wang, D., Feng, T., Shum, H., Ma, S.: Novel probability model for background maintenance and subtraction. In: Proc. ICVI (2002)

768

H. Kim et al.

19. Stenger, B., Ramesh, V., Paragios, N., Coetzee, F., Buhmann, J.: Topology free hidden Markov models: Application to background modeling. In: Proc. ICCV, pp. 294–301 (2001) 20. Zhong, J., Sclaroﬀ, S.: Segmenting foreground objects from a dynamic textured background via a robust Kalman ﬁlter. In: Proc. ICCV, pp. 44–50 (2003) 21. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background modeling and subtraction of dynamic scenes. In: Proc. ICCV, pp. 1305–1312 (2003) 22. Lee, J.Y., Nandi, A.K.: Maximum Likelihood Parameter Estimation of the Asymmetric Generalized Gaussian Family of Distribution. In: Proc. SPW-HOS (1999) 23. Kotz, S., Kozubowski, T.J., Podgorski, K.: Maximum likelihood estimation of asymmetric Laplace parameters. Ann. Inst. Statist. Math. 54, 816–826 (2002) 24. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, New Jersey (2001)

Feature Management for Eﬃcient Camera Tracking Harald Wuest1,2 , Alain Pagani2, and Didier Stricker2 1

Centre for Advanced Media Technology (CAMTech) Nanyang Technological University (NTU) 50 Nanyang Avenue, Singapore 649812 2 Department of Virtual and Augmented Reality Fraunhofer IGD TU Darmstadt, GRIS, Germany [email protected]

Abstract. In dynamic scenes with occluding objects many features need to be tracked for a robust real-time camera pose estimation. An open problem is that tracking too many features has a negative eﬀect on the real-time capability of a tracking approach. This paper proposes a method for the feature management which performs a statistical analysis of the ability to track a feature and then uses only those features which are very likely to be tracked from a current camera position. Thereby a large set of features in diﬀerent scales is created, where every feature holds a probability distribution of camera positions from which the feature can be tracked successfully. As only the feature points with the highest probability are used in the tracking step, the method can handle a large amount of features in diﬀerent scale without losing the ability of real time performance. Both the statistical analysis and the reconstruction of the features’ 3D coordinates are performed online during the tracking and no preprocessing step is needed.

1

Introduction

Tracking point based features is a widely used technique for the camera pose estimation. Either reference features are taken from pre-calibrated images with a given 3D model [1,2] or the feature points are reconstructed online during the tracking [3,4,5]. These approaches are very promising if the feature points are located on well textured planar regions. However, in industrial scenarios objects often consist of reﬂecting materials and poorly textured surfaces. Because of spotlights or occluding objects, the area of camera positions where a feature point has the same visual appearance can be very limited. Increasing the number of features can help to ensure a robust camera pose estimation, but as the 2D feature tracking step makes up a big amount of the computation time, the overall tracking performance gets very poor. Using only a subset of those features which are visible from a given viewpoint can avoid this problem. Najaﬁ et al. [1] present a statistical analysis of the appearance and shape of features from possible viewpoints. In an oﬄine training phase they coarsely Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 769–778, 2007. c Springer-Verlag Berlin Heidelberg 2007

770

H. Wuest, A. Pagani, and D. Stricker

sample the viewing space at discrete camera positions and create cluster groups of viewpoints for every model feature according to similar feature descriptors. Thereby a map is created which gives information about the detection repeatability, accuracy and visibility from diﬀerent viewpoints for every feature. During the online phase this information is used for a selection of good features. In this paper we present a method for a feature management which does not rely on any preprocessing but performs an online estimation of the tracking probability of every feature. The ability to track a feature is observed during the runtime and a distribution of camera positions of tracking successes and tracking failures is created. These distributions are represented by a mixture model with a constant number of Gaussians. A merge operation is used to keep the number of Gaussians ﬁxed. The resulting tracking probability, which not only models the visibility but also the robustness of a feature, is then used to decide which features are most suitable to be tracked at a given camera position. The robust camera pose estimation is solved by using Levenberg-Marquardt minimization and RANSAC outlier rejection.

2

Feature Tracking and Reconstruction

For a robust reconstruction and pose estimation a feature point must be tracked as long as possible. Therefore it should be invariant to deformations, illumination and scale. The well-known Shi-Tomasi-Kanade tracker is a widely used technique for tracking 2D feature points [6]. It is based on the iterative minimization of the sum of squared diﬀerences with a gradient decent method. In [7] illumination compensation has been added in the minimization procedure. The problem of updating a template patch has been addressed in [8]. Another promising approach for a reliable 2D feature tracking was presented by Zinßer et al.[9], where a brightness corrected aﬃne warped template patch is used to track a feature point. They proposed a two-stage approach where pure translation from frame to frame is estimated ﬁrst on several levels of the image pyramid, and then the template patch is iteratively aligned at the resulting image position of the ﬁrst stage. The alignment of the patch T in the image I is based on minimizing the following squared intensity diﬀerence (1) = (I(x) − (λT (gα (x)) + δ))2 , where λ and δ are the parameters for adjusting the contrast and the brightness, and gα is the aﬃne transformation function. We extended this method by extracting a template patch in diﬀerent resolution levels of the image pyramid and always select that patch which has the most similar resolution to the predicted aﬃne transformed patch. If the desired resolution of the patch does not exist, it is extracted out of the current image after a successful tracking step. A feature is regarded as tracked successfully if the iterations of the alignment converge and the error of equation 1 is smaller than a given threshold. Successfully tracked features are reconstructed by triangulation and further reﬁned by an Extended Kalman Filter. More details can be found in [5].

Feature Management for Eﬃcient Camera Tracking

3

771

Feature Management

The functions of the feature management are the extraction of new features, the estimation of the feature tracking probability, the selection of good features for a given camera position and the removal of features which are not of any use for further tracking. The whole management shall be an incremental process which runs in real-time and only uses a limited amount of memory. The tracking probability of a feature is denoted as the probability if a feature is able to be tracked successfully at a given camera position. In the following section the sequential estimation of this probability is described.

3.1

Tracking Probability

As the rotation around the camera center does not have any inﬂuence on the visibility of a point feature, if the feature is located inside the image, only the position of the camera in world coordinates is regarded as useful information to decide whether a feature is worth tracking. What is known about the ability to track a feature at a given camera position are the observations of its tracking success in previous frames. The problem of modeling a probability distribution p(x) of a random variable x, given a ﬁnite set x1 , . . . , xN of observations, is known as density estimation. A widely used nonparametric method for creating probability distributions are Kernel density estimators. To obtain a smooth density model we choose a Gaussian kernel function. For a D-dimensional vector x the probability density can be denoted as N 1 x − xn 2 1 exp − p(x) = N n=1 (2πσ 2 )D/2 2σ 2

(2)

where N is the number of observation points xn , and σ represents the variance of the Gaussian kernel function in one dimension. Every observation of a feature belongs to one element of the class C = {s, f }, which simply holds the information whether the tracking step was successful (s) or the tracking failed (f). The probability density of the camera position is estimated for every element of the class C separately. Let p(x|C = s) be the conditional probability density of the camera position for successfully tracked features and p(x|C = f ) the conditional probability density for unsuccessfully tracked features. The marginal probability of tracking successes is given by p(C = N s) = NNs and for tracking failures by p(C = f ) = Nf , where Ns and Nf are the number of successful and unsuccessful tracking steps respectively, and N is the total number of observations. The probability pt (x) if a feature can be tracked at a given camera position x is estimated with pt (x) = p(C = s|x)

(3)

772

H. Wuest, A. Pagani, and D. Stricker

When applying the Bayes’ theorem, the tracking probability can be written as p(x|C = s)p(C = s) p(x) p(x|C = s)p(C = s) = p(x|C = s)p(C = s) + p(x|C = f )p(C = f ) p(x|C = s)Ns = p(x|C = s)Ns + p(x|C = f )Nf

pt (x) =

(4)

The estimation of probability densities by using equation 2, however, has the major drawback that with an increasing number of observations the complexity for storage and computation is increasing linearly with the number of observations, which is not feasible for an online application. Our approach for the density estimation is based on a ﬁnite set of Gaussian mixtures. The use of mixture models for an eﬃcient computation of clusters in huge data sets has already been addressed. In [10] the Iterative Pairwise Replacement Algorithm (IPRA) is proposed, which is a computational eﬃcient method for conditional density estimation for very large data sets where kernel estimates are approximated by much smaller mixtures. Goldberger [11] uses a hierarchical approach to reduce large Gaussian mixtures to smaller mixtures by minimizing a KL-based distance between them. Zhang[12] presents another eﬃcient approach for simplifying mixture models by using a L2 norm as distance measure between the mixtures. Zivkovic [13] presents a recursive solution for estimating the parameters of a mixture with a simultaneous selection of the number of components. We use a method which is similar to [10], but instead of clustering a large data set we use the method for an online density estimation with a ﬁnite mixture model. A mixture with a ﬁnite number of Gaussians is maintained for both the successfully and unsuccessfully tracked features. Now we regard the multivariate Gaussian mixture distribution of the successfully tracked features, which can be written as p(x|C = s) =

K k=1

ωk N (x|μk , Σk ) with

K

ωk = 1

(5)

k=1

where μk is the D-dimensional mean vector and Σk the D×D covariance matrix. k The mixing coeﬃcients ωk = N Ns hold the information how many observations Nk have aﬀected this Gaussian k. The probability distribution p(x|C = f ) is deﬁned in the same way. Together with equation 4 the tracking probability for a given camera position can be estimated. The mixture model is built and maintained as follows. Depending on the tracking success, an observation is assigned to a class C, which means that either the distribution p(x|C = s) or the distribution p(x|C = f ) is updated. First for every observation a Gaussian kernel function is created where every kernel can be regarded as a Gaussian of the mixture model. If the maximum number of mixtures K is reached, then the two most similar mixtures are merged and a new Gaussian is created by taking the kernel function from the proximate observation.

Feature Management for Eﬃcient Camera Tracking

3.2

773

Similarity Measure

A similarity matrix is maintained where the similarity of all Gaussians among each other is stored. Scott [10] deﬁned the similarity measure between two density functions p1 and p2 as ∞ p1 (x)p2 (x)dx ∞ sim(p1 , p2 ) = ∞ −∞ (6) 2 ( −∞ p1 (x)dx −∞ p22 dx)1/2 Equation 6 can be considered as a correlation between the two densities. If p1 (x) = N (x|μ1 , Σ1 ) and p2 (x) = N (x|μ2 , Σ2 ) are normal distributions, the similarity measure can be calculated by sim(p1 , p2 ) =

(2D |Σ1 Σ2 |1/2 )1/2 exp(Δ) |Σ1 + Σ2 |1/2

(7)

with 1 Δ = − (μ1 − μ2 )T (Σ1 + Σ2 )−1 (μ1 − μ2 ). 2

(8)

This equation follows from the fact that ∞ N (x|μ1 , Σ1 )N (x|μ2 , Σ2 ) = N (0|μ1 − μ2 , Σ1 + Σ2 ).

(9)

−∞

The two Gaussians for which the similarity measure of equation 6 is smallest, are used for the merging step, which is described in the next section. 3.3

Merging Gaussian Distributions

The merge operation of the two most similar Gaussians is carried out as follows. Now we assume that the ith and the j th component are merged into the ith component of the mixture. Since a mixing coeﬃcient represents the number of observations which aﬀect a distribution, the new number of observations is Ni = Ni + Nj , and therefore ωi is updated by ωi = ωi + ωj .

(10)

The mean of the new distribution can be calculated by Nj Ni Ni 1 1 μi = xn = ( xn + xn ) Ni n=1 Ni n=1 n=1

=

1 1 (Ni μi + Nj μj ) = (ωi μi + ωj μj ) Ni ωi

(11)

774

H. Wuest, A. Pagani, and D. Stricker

After the mean is computed, the covariance Σi can be updated as follows Σi =

Ni 1 (xn − μi )(xn − μi )T Ni n=1

=

Ni 1 xn xTn − μi μTi Ni n=1

=

Nj Ni 1 ( xn xTn + xn xTn ) − μi μTi Ni n=1 n=1

1 (Ni (Σi + μi μTi ) + Nj (Σj + μj μTj )) − μi μTi Ni 1 = (ωi (Σi + μi μTi ) + ωj (Σj + μj μTj )) − μi μTi . ωi =

(12)

After the merge operation, the j th component can be used by a new observation to represent a new Gaussian. It can be regarded as a Kernel estimate with a Gaussian kernel function. For a new observation, the camera position is assigned to xj and the covariance is set to σ 2 I, where I is the identity matrix and σ determines the size of the Parzen window. The parameter σ aﬀects the smoothness of the resulting mixture model and must be chosen with respect to the world coordinate system. If for example the camera position vector is given in cm, with σ = 5 a convincing probability distribution can be created for an indoor camera tracking. The weight ωj is initialized with ωj = N1c , where Nc is the number of observations of the assigned class. 3.4

Feature Selection

Features which have a precisely reconstructed 3D coordinate have no need for any reconstruction or reﬁnement step. If we know that such features are not very likely to be tracked from the current camera position, it is probably not of any use for the pose estimation and it can be disregarded for a tracking step. Features which do not have a valid 3D coordinate are selected for the tracking step in every case, because it is important that a feature point gets triangulated fast, and an exact 3D position is reconstructed, so that the feature will be beneﬁcial for the camera pose estimation. Before the tracking step all features which have not been tracked successfully in the last frame are projected into the image with the last camera position in order to provide a good starting position for the features in the iterative alignment. The tracking probabilities of all features which are located inside the current image are calculated with equation 4 and the features are sorted by their probability in descending order. Now the feature tracking described in section 2 is applied on the sorted list of features until a minimum number of features has been tracked successfully. In our implementation we stop after 30 successfully tracked features with a valid 3D coordinate, which should be enough for a robust pose estimation.

Feature Management for Eﬃcient Camera Tracking

775

The beneﬁt of this approach is that the total number of tracked features is kept at a minimum if most of the features are tracked successfully, but if there are lots of tracking failures due to occlusion or strong motion blur, as many features as needed are tracked until a robust camera pose estimation is possible. 3.5

Feature Extraction

Most point based feature tracking methods use the well known Harris Corner Detector [14], which is based on the eigenvalue analysis of the gradient structure of an image patch. Another simple but very eﬃcient approach called FAST (Features from Accelerated Segment Test) was presented by Roston et al.[15]. Their method analyses the intensity values on a circle of 16 pixels surrounding the corner point. If at least 12 contiguous pixels are all above or all below the intensity of the center by some threshold, this point is regarded as a corner feature. For reasons of eﬃciency we used the FAST feature detector in our implementation. To avoid too many features and overlapping patches, a new feature is only extracted if no other feature points exist within a minimum distance to this feature in the image. New features are extracted if the total number of features with pt (x) > 0.5 for the current camera position x falls below a given threshold. 3.6

Feature Removal

In order to decide if a feature is valuable for further tracking, a measure of usefulness has to be deﬁned. If the tracking probability pt (x) for any camera position x is smaller than 0.5, a feature can be regarded as dispensable. The correct computation of the maximum of pt (x) with the expectation maximization algorithm for every feature is computationally too expensive. When μk,s are the Gaussian means of the mixture model representing successfully tracked features, we approximate the maximum of the tracking probability by evaluating pt at all positions μk,s by the following equation: pmax max pt (μk,s ) k

(13)

If pmax < 0.5 holds, then no camera position exists where this feature is likely to be tracked, and it can be removed from the feature map without the concern of losing valuable information. If a feature point gets lost and the 3D coordinate of that feature has not been reconstructed yet, this feature is removed as well, because without a valid 3D coordinate it is not possible to re-project the feature back into the image for further tracking.

4

Experimental Results

To evaluate if the tracking probability distribution of a single feature is estimated correctly the following test scenario is created. The camera pose is computed by tracking a set of planar ﬁducial markers, which are located in the x/y-plane. A

776

H. Wuest, A. Pagani, and D. Stricker x 1 tracking failures tracking successes

0.9 0.8 0.7 0.6

z

0.5 0.4 0.3 0.2 0.1 0

(a)

(b)

(c)

Fig. 1. Probability density map of camera position for a single feature. In (a) a frame of the test sequence is shown. (b) visualizes the Gaussian mixture models of camera positions where the feature has been tracked successfully (blue) and where the tracking failed (red). In (c) the tracking probability in the x/z-plane can be seen.

4

4

tracking failures

x 10

2 number of features

number of features

2 1.5 1 0.5 0

0

0.2

0.4 0.6 0.8 tracking probability

(a)

1

tracking successes

x 10

1.5 1 0.5 0

0

0.2

0.4 0.6 0.8 tracking probability

1

(b)

Fig. 2. Histograms of successfully and unsuccessfully tracked features with their corresponding tracking probability

point feature is extracted manually on the same plane. In ﬁgure 1(a) a frame of this sequence can be seen. When the camera is moved around, the point feature gets lost while it is occluded by an object, but it is tracked successfully, when it gets visible again. The Gaussian mixture model is visualized in ﬁgure 1(b) by a set of conﬁdence ellipsoids, which are drawn in blue and red for p(x|C = s) and p(x|C = f ) respectively. The number of Gaussians is limited to 8 for each mixture model in this particular example. In ﬁgure 1(c) the probability distribution pt (x) in the x/z-plane together with the Gaussian means is shown. It can be seen that the camera positions where the point feature was visible or occluded is correctly represented by the mixture model of tracking successes or tracking failures respectively. The probability distribution clearly illustrates that the tracking probability falls to 0 at camera positions where the feature is occluded.

Feature Management for Eﬃcient Camera Tracking

777

Table 1. Average processing time of the individual steps of the tracking approach time in ms prediction step build image pyramid 10.53 29.08 feature selection and tracking 2.74 pose estimation 1.94 update feature probability 5.53 reconstruct feature points extract new features 5.93 total time without feature extraction 49.82

An image sequence showing an industrial scenario is used for the further experiments. In order to evaluate the quality of the tracking probability estimation, all available features are used as an input for the tracking step and it is observed whether the features compared to their tracking probability are tracked successfully or not. In ﬁgure 2 histograms are plotted which show the number of successfully and unsuccessfully tracked features with their corresponding tracking probability. It can be seen that the major part of features with a high tracking probability has been indeed tracked successfully. An analysis of the processing time is carried out on a Pentium 4 with 2.8GHz and a ﬁrewire camera with a resolution of 640 × 480 pixels. The average computational costs for every individual step are shown in table 1. Without the feature extraction, the tracking system can run at a frame rate of 20Hz. If no feature selection is performed, on average 93.9 features are used in the feature tracking step, and only 49.0% of all features can be tracked successfully. The average runtime of the tracking step is at 64.36 milliseconds. With the selection of the most probable features on average only 48.94 features are analysed per frame in the tracking step. The success rate of the feature tracking is at 83.0% and the mean computation time is lowered to 29.08ms with no signiﬁcant diﬀerence of the quality of the pose estimation.

5

Conclusion

We have presented an approach for real-time camera pose estimation which uses an eﬃcient feature management to store many features and to track only those features which are most likely to be tracked from a given camera position. The tracking probability for every feature is estimated online during the tracking and no preprocessing is necessary. Features which are only visible in a limited area of viewpoints are only tracked at those certain camera positions and ignored at any other viewpoints. Even if they are occluded for a long time, reliable features are not deleted, but kept in the feature set as long as a camera position exists from which the feature can be tracked successfully. Not only the visibility, but also the robustness of a feature is represented by the tracking probability. Tracking failures due to reﬂections or spotlights at certain camera positions are also modeled correctly.

778

H. Wuest, A. Pagani, and D. Stricker

Acknowledgements This work was partially funded by the European Commission project SKILLS, Multimodal Interfaces for Capturing and Transfer of Skills (IST-035005, www.skills-ip.eu).

References 1. Najaﬁ, H., Genc, Y., Navab, N.: Fusion of 3d and appearance models for fast object detection and pose estimation. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, Springer, Heidelberg (2006) 2. Bleser, G., Pastarmov, Y., Stricker, D.: Real-time 3d camera tracking for industrial augmented reality applications. In: WSCG (Full Papers), pp. 47–54 (2005) 3. Genc, Y., Riedel, S., Souvannavong, F., Akinlar, C., Navab, N.: Marker-less tracking for ar: A learning-based approach. In: IEEE / ACM International Symposium on Mixed and Augmented Reality. pp. 295–304. IEEE Computer Society Press, Los Alamitos (2002) 4. Davison, A.: Real-time simultaneous localisation and mapping with a single camera. In: Proc. International Conference on Computer Vision, Nice (2003) 5. Bleser, G., Wuest, H., Stricker, D.: Online camera pose estimation in partially known and dynamic scenes. In: ISMAR, pp. 56–65 (2006) 6. Shi, J., Tomasi, C.: Good features to track. In: CVPR 1994. IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 7. Jin, H., Favaro, P., Soatto, S.: Real-Time feature tracking and outlier rejection with changes in illumination. In: IEEE Intl. Conf. on Computer Vision. pp. 684–689. IEEE Computer Society Press, Los Alamitos (2001) 8. Matthews, I., Ishikawa, T., Baker, S.: The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6), 810–815 (2004) 9. Zinßer, T., Gr¨ aßl, C., Niemann, H.: Eﬃcient Feature Tracking for Long Video Sequences. In: Rasmussen, C.E., B¨ ulthoﬀ, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 326–333. Springer, Heidelberg (2004) 10. Scott, D.W., Szewczyk, W.F.: From kernels to mixtures. Technometrics. 43, 323– 335 (2001) 11. Goldberger, J., Roweis, S.: Hierarchical clustering of a mixture model. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 17, pp. 505–512. MIT Press, Cambridge (2005) 12. Zhang, K., Kwok, J.: Simplifying mixture models through function approximation. In: Sch¨ olkopf, B., Platt, J., Hoﬀman, T. (eds.) Advances in Neural Information Processing Systems 19, MIT Press, Cambridge (2007) 13. Zivkovic, Z., van der Heijden, F.: Recursive unsupervised learning of ﬁnite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(5), 651–656 (2004) 14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. Alvey Vision Conf, Univ, Manchester, pp. 147–151 (1988) 15. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1508–1511 (2005)

Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e Xin Yin, Kangying Cai, Yuki Takeda, Ryo Akama, and Hiromi T. Tanaka Ritsumeikan University, Nojihigashi 1-1-1, Kusatsu, Shiga, 5258577, Japan http://cv.ci.ritsumei.ac.jp

Abstract. Ukiyo-e is one famous traditional woodblock type Japanese drawing. Some pattern printed by special print techniques can only be seen from some special direction. This phenomenon relate to the reﬂection properties on the surface of Ukiyo-e. In this paper, we propose a method to measure these reﬂection properties of Ukiyo-e. Fitstly, the normal on the surface and the direction of the ﬁber in Japanese paper are computed from photos which are taken by a measuring machine named OGM. Then, ﬁt the reﬂection model to the measured data and the reﬂection properties of Ukiyo-e can be obtained. Based on these parameters, the the appearance of Ukiyo-e can be rendered on real-time. Keywords: Ukiyo-e, measurement, ﬁbers in Japanese paper, cultural heritage.

1

Introduction

Some rendering techniques such as the NPR (Non-Photorealistic Rendering) were developed in last two decades. These studies mainly focused on simulating the pen stroke and the distribution of pigment on the paper. On another hand, the scattering of light is important to represent the appearance of drawing also. The particle of pigment and the ﬁber in paper can aﬀect the scattering of light and make a special eﬀect on the surface of drawing. In this paper, we measured the appearance of one type ancient Japanese drawing named Ukiyo-e, and observed the isotropic reﬂection from the pigment particle and the anisotropic reﬂection from the ﬁber of Japanese paper. Based on this observed result, we proposed a shading technique which can blend these two type reﬂections and render the appearance of Ukiyo-e on real-time. The Ukiyo-e is one traditional Japanese drawing. The origin of the Ukiyo-e come from describing the life of Kyoto in 16 century. For getting special eﬀect, some print techniques were developed. The techniques of the Karazuri and the Kirazuri are introduced here. The Karazuri do not use any pigment and put the woodblock on the paper by force, then the bump pattern are made. The Kirazuri use the mica and gold particle to draw the pattern on Ukiyo-e. Shown as Figure 1, the pattern of snow is made by the Karazuri and the pattern on the cloth of the Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 779–788, 2007. c Springer-Verlag Berlin Heidelberg 2007

780

X. Yin et al.

Fig. 1. The photos of Ukiyo-e

Fig. 2. The photomicrographs of the Ukiyo-e. The pigment in the left is gold particle and the pigment in the middle is ink. There is no pigment in the right.

woman is made by the Kirazuri. As the result of these print techniques, the color of Ukiyo-e varies according to the postion of the light source and the viewpoint. Another special eﬀect come from the ﬁber in Japanese paper. The length of the ﬁber in Japanese paper for Ukiyo-e is 6.0-15.0 mm and is 7-8 times of that in general paper. Figure 2 show the photomicrographs of the Ukiyo-e surface. The ﬁber can be seen clearly in these photos. Even in the part where look like ﬁll with the particles of pigment, the shape of the ﬁbers can be seen clearly. As the reﬂection of ﬁbers is anisotropic, we need designing a anisotropic shading model to represent the reﬂection eﬀect of the ﬁber. As mentioned above, the NPR techniques mainly represent the eﬀect of drawing such as painting, sculpture, block print, dyeing etc.. Some works have been done to present the eﬀect of the Ukiyo-e also. Processing 2D photos ([1]) or simulating the Ukiyo-e make process ([2]) can make the Ukiyo-e in virsual world. These works focus on simulating the isotropic color of the Ukiyo-e and doesn’t simulate the light scattering on the Ukiyo-e. To our knowledge, we are the ﬁrst time to simulate the light scattering properties of the Ukiyo-e.

Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e

781

Our works based on the measured data and is relate to previous works on measuring spatially varying BRDF (Bidirectional Reﬂectance Distribution Function) and BTF (Bidirectional Texture Function). This measurement usually set the light source and the camera on a hemisphere dome and take a lot of photos ([3], [4], [5]). Using these photos, the BRDF or the BTF can be constructed to render photorealistic scene. Our measurement is similar to the [5], using high density samples to capture detail color variation on the surface of Ukiyo-e. Constructing the geometric parameters such as the normal on surface from photos is carried out for a long term. The principle of photometric stereo ([6]) can be used to construct the geometric parameters and the BRDFs. The normal of suface can be obtained by the color variation of diﬀerent photos or video ([7], [8], [9]). Comparing the reﬂection of examples such as ball to the reﬂection of the target object under same illumination conditions, the geometric parameter of the target object can be computed ([10]). With the developing of the techniques for scaning 3D geometric data by laser, it is possible to improve the precision of the reﬂection parameters for the high quality rendering by comparing with the scaning data ([11], [12]). For applying easily, the BRDF and normal on surface can be obtained using small number of photos([13], [14]). To decrease the errors of measuring, we measure the data in high density and construct the shading model to render the Ukiyo-e with high reality. Another geometric parameter is the direction of ﬁbers in the Japanese paper. As the result of these ﬁber reﬂection, the apearance of Ukiyo-e show anisotropic reﬂection. Some anisotropic shading models have been proposed based on the microfacet models and empirical models ([15], [16]). These models assume that the reﬂection light is distribute on a narrow ﬁeld. For the ﬁber shading model, the strongst reﬂection direction is along with a cone around the direction of the ﬁber ([17], [18]). We develope this type ﬁber shading model and ﬁt this model to the measured data in this paper. The main idea of our work come from [14] and [5]. [14] compute the normal on the surface of the isotropic reﬂection materials. [5] mainly compute the ﬁber direction in the wood and render the eﬀect of the ﬁber. The ﬁbers in the Japanese paper are more complex than the case of the wood. The direction of it near a random distribution. The appearance of the Ukiyo-e blend two eﬀect, one is the isotropic reﬂection come from the pigment, another is the anisotropic reﬂection come from the ﬁber in Japanese paper. Diﬀerent to the [14] and [5], we blend these two reﬂection together, and ﬁt these two reﬂect models to the measured data. Because we combine two diﬀerent shading model together, the errors between the model and the measured data are decreased and high quality rendering result can be obtained.

2

Taking Photos

We use a system named OGM (Optical Gyro Measuring Machine) to take photos of Ukiyo-e. OGM is 4 axes measuring machine which can put the light source and the camera on any position of a hemisphere dome. Figure 3 show the photo

782

X. Yin et al.

Camera

Light source

Light source arm

Camera arm

Object stage

Fig. 3. Optical gyro measuring machine (OGM)

BRDF of pigment (Golden particle)

BRDF of Japanese paper fiber

Fig. 4. BRDFs of diﬀerent pixel on the Ukiyo-e image

of OGM. For measuring the color variation on Ukiyo-e, the camera is ﬁxed on the position perpendicular to the surface of Ukiyo-e. The position of light source is changed. The record of the position in computer is a 2D array. For correspondenting the 2D array and the position of lighting source on the hemisphere dome, A concentric map ([19]) is used to set the position of light source. We use a 37 by 37 grids to set the position of light. To advoid the light source behind the arm of camera, the object stage are turned 180 degrees. Some marks are set around the Ukiyo-e to calibrate the positon of the pixel of the image of Ukiyo-e. To calibrate the light distribution on the surface, the photos of a white paper are taken also. The technique of [14] is used to calibrate image. After calibrate the color and the pixel position, the BRDFs of each pixel can be obtained.

Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e

783

Figure 4 show the two type BRDFs. The center of image is black as the light sourse is blocked by the camera at that position. In the ﬁeld where is pigment, the anisotropic reﬂection is weak. The highlight centralize a point. In the ﬁeld where is paper, a strong anisotropic reﬂection phenomenna can be observed. The highlight distribute along a line. These mean that we need to construct two shading models and blend them together to render the appearance of Ukiyo-e.

3

Shading Model

From the measured results, two type reﬂection phenomenon are observed. One is isotropic reﬂection come from the pigment, another is anisotropic reﬂection come from the ﬁber in Japanese paper. Modeling these two type phenomenon and ﬁt it to the measured data will introduced in this section. 3.1

Two Type Models

From the photomicrographs of Ukiyo-e, the distribution of the pigment and the ﬁber can be observed clearly. Shown as Figure 5 (a), the shape of ﬁber is approximated as a cylinder. If the light that refract from air to ﬁber and back to air, the inclination will maintain same. As the result, the light enter the ﬁber as a line and become a cone surface when it leave the ﬁber. The axis of the cone is the direction of the ﬁber. If part of the paper surface is covered by some particles of pigment, the light will be reﬂect to air by the pigment directly. At this time, the light leave the surface of object is a line. Use αrp represent the angle between the surface normal Np and viewpoint vector V , then The eﬀect of pigment Ip can be expressed by next expression. Ip = Idp + ksp • g(σ, αhp )/cos2 (αrp )

(1)

Here,Idp is the diﬀusion reﬂection and ksp is specular reﬂectance. g(σ, αhp ) is a normalized Gaussian with zero mean and standard deviation σ. This model can be used to represent the eﬀect of pigment on the surface of Ukiyo-e. Simular to the shading model of the pigment, we can construct the reﬂection model of the ﬁber. Shown as Figure 5 (b), the blue plane is the normal plane

Light

Light

L Pigment

αif α rf

Fiber

V

Nf

F H

Normal plane

(a)

(b)

Fig. 5. Two type shading models of the Ukiyo-e

(c)

F

784

X. Yin et al.

Γ perpendicular to the ﬁber direction F . The angle between the light vector L and the normal plane Γ is αif . The anlge between the viewpoint vector V and the normal plane Γ is αr f . If the viewpoint near the surface of the cone, the reﬂection light is strong. If viewpoint far from the surface of the cone, the reﬂection light is weak. So we can construct the reﬂection model of the ﬁber by developing the traditional reﬂection model such as Torrance-Sparrow model. The main diﬀerence between the ﬁber reﬂection model and the tranditional reﬂection model is using the cone replace the vector of regular reﬂection. The eﬀect of the ﬁber If can be represent by next expression. If = Idf + ksf • g(σ, αhf )/cos2 (αrf )

(2)

Here, Idf is the diﬀusion reﬂection of ﬁber. ks f is the specular reﬂectance of ﬁber. g(σ, αhf ) is the normalized Gaussian same as the above. αhf is the halfangle between the normal plane Γ and the viewpoint vector V . Then blend these two type eﬀects of pigment and ﬁber, we can get ﬁnal color of Ukiyo-e as follow. I = Idp • β + Idf • (1 − β)

(3)

This expression mean that the ﬁnal appearance of the Ukiyo-e is the linear interpolation of the eﬀect of pigment and the eﬀect of ﬁber. The next work is ﬁtting this shading model to the measured data and the decide the parameters of this model on each pixels of the image. 3.2

Computing the Geometry Parameter

The Ukiyo-e is near to a plane. The geometry parameters in here is the normal of micro geometric surface and the direction of ﬁber. Even the micro geometric surface can be obtained by integration from the normal, but we need not constructing the micro geometric surface. Using the information of the normal and the direction of ﬁber, we can render the appearance of Ukiyo-e well. The common of two type shading model is the normal N in the middle of the strongst reﬂection R and the light vector L. The diﬀernt between these two case is that the normal of the ﬁber is a normal plane perpendicular to the ﬁber direction. As this reason, we can get the normal by computing the strongest reﬂection direction. As enough density data are captured by the OGM, it is ease to ﬁnd the strongest reﬂection direction R. Then the normal can be computed by N = (L + R)/2. The left image shown in Figure 6 is the image of the surface normal. The value of RGB represent the XYZ value of normal N . The bump pattern can be seen from this image.(For print it clearly, the contrast is enlarged.) The direction of the ﬁber F is the axis of the cone on which the highlight can be seen. As the result, the highligh of ﬁber reﬂection is a line on the hemisphere. Figer 5 (c) show the relationship of the Normal Nf , highlight line H and the ﬁber direction F . The F perpendicular to the plane which is parallel to the N and the H. The H can be obtained by computing the line of the highlight (H) using Principal Components Analysis method. Then the direction of the ﬁber can be got by F = N × H. The right image shown in Figure 6 is the image of the

Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e

785

Fig. 6. Geometry parameter images. The left one is the normal image and the right one is the ﬁber direction image.

ﬁber direction. The value of RGB represent the XYZ value of direction of ﬁber F . Now we know the Normal N and direction of ﬁber F . Fitting parameters of the model to the neasured data will be introduced next. 3.3

Fitting the Data

Fitting the model to the measured data is a nonlinear optimization problem. This problem is to ﬁnd parameters which can let the value of ρ in next express is minimal. ρ= (I − Muv ) (4) Here, I is the theory value of the shading model introduced above. Muv is the measured BDRF data by the OGM. u and v is the coordinates of the measured BRDF image. For get the correct parameters of the shading model, the good initial estimate is important. The initial diﬀusion value is using the mean value of the color. The initial parameter of Gaussian is computed from measured data directly. The inital β is 0.5. Then the parameters can be obtained by the steepest descent method. Now, we have all the value of parameters of the shading model. These values are stored as the texture, and use these texture, the appearance of the Ukiyo-e can be rendered.

4

Results

The Experiment is carried out based on the GPU (Graphics Processing Unit) and can render the Ukiyo-e on real time. The graph card is NVIDIA GeForce 6800 GS. And, this Experiment is carried out using a Ukiyo-e which was made in hundreds years ago. Because we captured 1225 photos (one photo size is 3888X2592 pixels) to construct the high density BRDFs, more than 10 hours for capture the data and more than 4 hours to compute the reﬂection properties

786

X. Yin et al.

Fig. 7. Experiment results1

Fig. 8. Experiment result2. The left one show the result without the ﬁber reﬂection eﬀect and the right one show the result with the ﬁber reﬂection eﬀect.

(the most time is cost by reading the photos into the computer). Based ont the computed results, rendering the Ukiyo-e can be carried out on real-time. The image shown in Figure 7 is the rendering result using the shading model proposed in this paper. When the viewpoint is changed, the color of the surface is diﬀerent also. The bump pattern of snow is visible and invisible according to the position of light source. This result is similar to the phenomenon that occurs on the real Ukiyo-e. We compare the case with the ﬁber eﬀect and without ﬁber eﬀect also. This experiment is carried out by another Ukiyo-e. The pattern of follower is made by Karazuri and the background of the woman is made by the Kirazuri. Shown as Figure 8, the left one is the rendering resutl only using the normal of surface. The result looks like the plastic more than the paper. The right one is the rendering result using the normal of surface and the direction of ﬁber together. There are some natural noise on the center of the image and the image become bright. The edge of the Karazuri becomes soft according to the eﬀecting of the ﬁber in the paper and looks more like the paper than the plastic. All parameters of the model is stored by the texture, the size is about 1/100 of the original BRDFs data. Using the GPU rendering technique, the rendering can

Measurement of Reﬂection Properties in Ancient Japanese Drawing Ukiyo-e

787

carried out at a speed of real-time. Because our method is based real measured data, we can get rendering result with high reality.

5

Conclusion

In this paper, a technique for measuring the reﬂection properties of ancient Japanese drawing named Ukiyo-e is proposed. It is ﬁrst time to measure the reﬂection properties of the Ukiyo-e materials and rendering the appearance of it considering the ﬁber eﬀect in the Japanese paper. Our methode can ﬁt real data well because the isotropic reﬂection and anisotropic reﬂection is blend together. This technique can also be used for rendering other similar objects such as the cloth. In the future, new techniques for modelling the detail of the ﬁber in the Japanese paper from images need to be developed. we also plan to develope a VR system which permit person watch the Ukiyo-e in hand and can feel the touch feeling of the Ukiyo-e at same time.

Acknowledgments This work was supported partly by the Grants-in-Aid for Scientiﬁc ”Research Scientiﬁc Research(A) 17200013” and ”Encouragement of Young Scientists(B) 19700104” of Japan Society for the Promotion of Science. This work was also supported partly by ”Kyoto Art Entertainment Innovation Research” of Centre of Excellence Program for the 21st Century of Japan Society for the Promotion of Science.

References 1. Okamoto, T.: http://www.tatuharu.com/ (2007) 2. Okada, M., Mizuno, S., Toriwaki, J.: Virtual sculpting and virtual woodblock printing by model-driven scheme. the Journal of the Society for Art and Science 1(2), 74–84 (2002) 3. Dana, K.J., Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reﬂectance and texture of real-world surfaces. ACM Transactions on Graphics 18, 1–34 (1999) 4. Gardner, A., Tchou, C., Hawkins, T., Debevec, P.: Linear light source reﬂectometry. ACM Transactions on Graphics 22(3), 749–758 (2003) 5. Marschner, S.R., Westin, S.H., Arbree, A., Moon, J.T.: Measuring and modeling the appearance of ﬁnished wood. In: Proceedings of SIGGRAPH 2005, pp. 727–734 (2005) 6. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19(1), 139–144 (1980) 7. Rushmeier, H., Taubin, G., Gueziec, A.: Applying shape from lighting variation to bump map capture. In: Proceedings of Eurographics Workshop on Rendering, pp. 35–44 (1997) 8. Paterson, J.A., Fitzgibbon, A.W.: Flexible bump map capture from video. In: Proceedings of the Eurographics 2002 conference, short papers (2002)

788

X. Yin et al.

9. Rushmeier, H., Gomes, J., Giordano, F., El-shishiny, H., Magerlein, K., Bernardint, F.: Design and use of an in-museum system for artifact capture. In: IEEE/CVPR Workshop on Applications of Computer Vision in Archaeology (2003) 10. Hertzmann, A., Seitz, S.M.: Shape and materials by example: A photometric stereo approach, pp. 533–540. IEEE Computer Society Press, Los Alamitos (2003) 11. Rushmeier, H., Bernardini, F., Mittleman, J., Taubin, G.: Acquiring input for rendering at appropriate levels of detail: Digitizing a pieto. In: Proceedings of Eurographics Workshop on Rendering, pp. 81–92 (1998) 12. Lensch, H.P.A., Kautz, J., Goesele, M., Heidrich, W., Seidel, H.P.: Image-based reconstruction of spatial appearance and geometric detail. ACM Transactions on Graphics 22, 234–257 (2003) 13. Georghiades, A.S.: Recovering 3-d shape and reﬂectance from a small number of photographs. In: Proceedings of Eurographics Workshop on Rendering, pp. 230– 240 (2003) 14. Paterson, J.A., Claus, D., Fitzgibbon, A.W.: Brdf and geometry capture from extended inhomogeneous samples using ﬂash photography. Computer Graphics Forum 24(3), 383–391 (2005) 15. Ashikhmin, M., Shirley, P.S.: An anisotropic phong brdf model. Journal of Graphics Tools 5(2), 25–32 (2000) 16. Ward, G.J.: Measuring and modeling anisotropic reﬂection. In: Proceedings of SIGGRAPH 1992, pp. 265–272 (1992) 17. Marschner, S.R., Jensen, H.W., Cammarano, M., Worley, S., Hanrahan, P.: Light scattering from human hair ﬁbers. ACM Transactions on Graphics 22(3), 780–791 (2003) 18. Kawai, N.: Reproducing reﬂection properties of natural textures onto real object surfaces. In: Proceedings of the 4th International Workshop on Texture, pp. 101– 106 (2006) 19. Shirley, P., Chiu, K.: A low distortion map between disk and square. Journal of Graphics Tools 2, 45–52 (1997)

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence Ping Li1 , Dirk Farin1 , Rene Klein Gunnewiek2, and Peter H.N. de With3 1

Eindhoven University of Technology {p.li,d.s.farin}@tue.nl 2 Philips Research Eindhoven [email protected] 3 LogicaCMG Netherlands B.V. [email protected]

Abstract. This paper proposes a novel and efficient feature-point matching algorithm for finding point correspondences between two uncalibrated images. The striking feature of the proposed algorithm is that the algorithm is based on the motion coherence/smoothness constraint only, which states that neighboring features in an image tend to move coherently. In the algorithm, the correspondences of feature points in a neighborhood are collectively determined in a way such that the smoothness of the local motion field is maximized. The smoothness constraint does not rely on any image feature, and is self-contained in the motion field. It is robust to the camera motion, scene structure, illumination, etc. This makes the proposed algorithm texture-independent and robust. Experimental results show that the proposed method outperforms existing methods for feature-point tracking in image sequences.

1 Introduction Intensity similarity of image texture is used by most existing algorithms for feature matching, which typically requires that the contrasts of the two images are constant. However, a constant contrast is difficult to maintain in practice. Even if we assume that the camera hardware is identical, for slightly different points of view, the amount of light entering the two cameras can be different, causing dynamically adjusted internal parameters such as aperture, exposure and gain to be different [1]. It is favorable to establish the feature correspondences using the geometric similarity alone, because it appears that the geometric similarity is more fundamental and stable than intensity similarity, as intensity is more liable to change [2]. This paper proposes a feature-point matching algorithm that uses only the smoothness constraint1 [3], which states that neighboring features in an image usually move in similar magnitudes and directions. The smoothness constraint does not rely on any image feature, and is self-contained in the motion field. It is robust to the camera motion, scene structure, illumination, etc. This makes the proposed algorithm texture-independent and robust. 1

We consider the smoothness constraint a geometric constraint, because it is the object rigidity that gives the motion smoothness in an image. For example, a group of points on the surface of a rigid object typically move in similar speeds. This leads to smooth image motion.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 789–799, 2007. c Springer-Verlag Berlin Heidelberg 2007

790

P. Li et al.

1.1 Related Work Photometric region descriptors have recently been widely used for feature-point matching. In this approach, local image regions are described using image measurements such as the histogram of the pixel intensity, distribution of the intensity gradients [4], image derivatives [5,6], etc. Below, we summarize some well-known descriptors that fall into this category. A review of the state-of-the-art region descriptors can be found in [7]. Lowe [4] proposed a Scale-Invariant Feature Transform (SIFT) algorithm for featurepoint matching or object recognition, which combines a scale-invariant region detector and a gradient-distribution-based descriptor. The descriptor is represented by a 128-dimensional vector capturing the distribution of the gradient orientations in 16 location grids (sub-sampled into 8 orientations and weighted by gradient magnitudes). Features are matched if two descriptors show small difference. Recently, Bay et al. proposed a new rotation- and scale-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features) [8]. It is based on sums of 2D-Haar wavelet responses and makes an efficient use of integral images. The algorithm was shown to have comparable or better performance, while obtaining a much faster execution than previously proposed schemes. Another category of feature-point matching/tracking algorithms does not use descriptors. In [2], a feature-point matching algorithm is proposed using the combination of the intensity similarity and geometric similarity. Feature correspondences are first detected based on the window-based intensity correlation. The outliers are thereafter rejected by a few subsequent heuristic tests involving geometry, rigidity, and disparity. The Kanade-Lucas-Tomasi (KLT) feature tracker [9] combines a feature detector to detect the feature points located in image areas containing sufficient texture variation, and a feature tracker to determine the displacement vector by minimizing the window-based intensity residue between two image patches around the two feature points. In [10], a global feature-point matching algorithm is proposed, which converts the huge combinatorial search into an efficient implementation using concave programming on its convex-hull. In [11], an algorithm based on a proximity constraint is proposed for associating features in two images, which is thus independent from the texture. 1.2 Our Approach For feature-point tracking in a video/image sequence, the variation of the camera parameters (rotation, zooming, translation) is relatively small. Methods based on the local intensity correlation are often used because of its computation efficiency like the blockmatching technique. But they are usually not robust (to noise, scaling, light change, etc.) due to the fact that only the local image information is used. Descriptor-based algorithms are more robust, but the high computational complexity of the high-dimension descriptors makes them less efficient. Our approach concentrates on both the computational efficiency and robustness of the feature-point matching algorithm, as well as the fundamental nature of the geometric similarity. Therefore, this paper proposes an efficient and robust point-matching algorithm that uses only the smoothness constraint, targeting at feature-point tracking

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence

791

along successive frames of uncalibrated image/video sequences. Texture information is not required for the feature-point matching. The proposed algorithm is thus referred to as Texture-Independent Feature Matching (TIFM). In TIFM, the correspondences of feature points within a neighborhood are collectively determined in a way such that the smoothness of the local motion field is maximized. TIFM is robust because it is based on only the smoothness constraint, which is robust to the scene structure, camera motion, light change, etc. TIFM is efficient, as the smoothness of the motion field can be efficiently computed using a simple coherence metric (discussed in Section 3). Experimental results on both synthetic and real images show that TIFM outperforms existing algorithms for feature-point tracking in image/video sequences. Assuming the feature points are already detected using the Harris corner detector [12], our focus in this paper is to establish the two-frame correspondences, with which feature points can be easily tracked by simply linking the two-frame correspondences across more images.

2 Notations Let I = {I1 , I2 , · · · , IM } and J = {J1 , J2 , · · · , JN } be two sets of feature points in two related images, containing M and N feature points, respectively. For any point Ii , we want to find its corresponding feature point Jj from its candidate set CIi , which is defined as all the points within a co-located rectangle in the second image, as shown in Fig. 1(b). The dimension of the rectangle and density of the feature points determine the number of the points in the set.

Fig. 1. The set of feature points in neighborhood NIi in the first image and the set of candidate matching points CIi in the second image for feature point Ii

As seen in Fig. 1, the neighborhood NIi of point Ii is defined as a circular area around that point. The displacement between Ii and Jj is represented by its Correspondence Vector (CV) v Ii . The candidate set CIi for Ii leads to a corresponding set of candidate correspondence vectors VIi . Determining the correspondence for Ii is equivalent to finding the corresponding point from CIi or finding the correct CV from VIi .

792

P. Li et al.

3 Matching Algorithm 3.1 Coherence Metric Suggested by the motion coherence theory [3], we assume that the CVs within a small neighborhood have similar directions and magnitudes, which is referred to as the localtranslational-motion (LTM) assumption/constraint in this paper. CVs that satisfy this constraint are called coherent.

(a) Two coherent CVs

(b) Possible matching combinations

Fig. 2. (a) two coherent CVs v i and v j within a neighborhood (v i is the reference CV); (b) n feature points in a neighborhood; each point has a varying number of candidate CVs; only those repeated points indicated by “T” have the true/correct CVs

Given two coherent CVs v i and v j , we require that both the difference dij between their magnitudes, and the angle deviation θij between their directions, should be small, as shown in Fig. 2(a). Combining these two requirements, we obtain the following coherence metric: (1) dij < ||v i || × sin(ϕ) = R, where ϕ is the maximum allowed angle deviation between two CVs within a neighborhood, and R is a threshold based on the magnitude of the reference CV and ϕ, as illustrated in Fig. 2(a). The allowed degree of deviation ϕ specifies how similar two CVs should be in order to satisfy the coherence criterion. Difference dij is computed in simplified form as: dij = |v i − v j | = |xvi − xv j | + |yvi − yvj |.

(2)

3.2 Smoothness Computation Given a reference CV v j ∈ VIi , the smoothness S of the motion field with respect to v j within the neighborhood NIi is measured as the ratio between the number of coherent CVs found in NIi and the number of the feature points in NIi . This ratio is denoted by S(NIi , v j ) and can be computed by: Ik ∈NIi fIk (v j ) S(NIi , v j ) = , (3) n where n is the number of feature points in NIi , and fIk (v j ) is a binary indicator variable, indicating whether the CV of feature point Ik is coherent with the reference v j , and is defined by:

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence

fIk (v j ) =

1 0

dik < R else.

793

(4)

The LTM assumption suggests that CVs within a neighborhood should have similar directions and magnitudes. This means that the S(NIi , v j ) should be as high as possible to have a smooth motion field. We compute S(Ni , v j ) for every v j ∈ VIi . The maximum is considered as the smoothness for the neighborhood, which is computed by Sm (Ni ) = max S(NIi , v j ). v j ∈VIi

(5)

With the above equation, the problem to determine the correspondence for feature point Ik ∈ NIi is converted into selecting a CV v Ik ∈ VIk so that the maximum smoothness Sm (Ni ) is found. 3.3 Steps to Compute Correspondences for Feature Points Within a Neighborhood S1: Given a reference CV v j ∈ VIi (j = 1, · · · , m), find the most similar CV from VIk for every Ik ∈ NIi (k = 1, · · · , n), so that the difference dik by Eq. (2) is minimal. S2: Set the indicator variable fIk (v j ) according to Eq. (4); compute the smoothness S(Ni , v j ) of the motion field using Eq. (3). S3: Compute the maximum smoothness Sm (Ni ) using Eq. (5); true correspondences are found if Sm (Ni ) is higher than a given threshold. The LTM constraint requires that correct matches for all points within a neighborhood give coherent CVs. Once we find the true CV for one point, CVs for other points can be found as well. As a result, the point-matching process is highly constrained. The combinations between the n points and (n × m) candidates are reduced from mn to approximately m in TIFM by using the LTM constraint. Fig. 2(b) illustrates the possible matching combinations. 3.4 Rationale of the Algorithm We now explain why the maximum number of coherent CVs give the true correspondences with a high probability. According to the LTM assumption, the (n × α) true CVs are coherent2, which compose a smooth motion field with a smoothness of α. Due to the random texture, feature points appear randomly along any other non-true CV. Thus, the probability is low to find another set of coherent CVs that give a smoothness which is higher than α. Once the highest smoothness is detected, true CVs are found. We do not assign any correspondence if the highest smoothness is below a threshold. This occurs for example in low-repetition-ratio areas such as trees, grass, etc. 2

α is the repetition ratio of feature points within a neighborhood, which is defined as the ratio between the #points that appear in both images and the #points that appear in the 1st image.

794

P. Li et al.

4 Experimental Results 4.1 Evaluation Criteria Only a portion of the detected points can be matched, and only a fraction of the detected matches are correct. For two-frame point matching, the results are presented with the parameters #CorrectMatches, recall and precision, as will be introduced below. A correct match is determined based on its conformity to either the homography matrix H or the fundamental matrix F that is computed using the RANSAC [13] algorithm on the obtained data set. A match is considered correct if its associated residual error dr computed by Eq. (6) is smaller than one pixel; this error is computed by [d(x , F x) + d(x, F T x )]/2, given F (6) dr = [d(x , Hx) + d(x, H −1 x )]/2, given H where (x, x ) is a pair of matched points; d(·, ·) is the geometric distance between the point and the epipolar line given the F , or the euclidian distance between the two points given the H. Measurements recall and precision are computed as follows: recall =

precision =

#CorrectM atches , #DetectedM atches

#CorrectM atches . #AverageP ointsInT woImages

(7)

(8)

recall is the percentage of the correct matches among the total detected matches, which measures the quality of the detected correspondences; precision is the percentage of the total feature points that are correctly matched, which measure the efficiency of an algorithm (more feature points imply more computation). For experiments on image/video sequences, besides the two-frame matching results, structure from motion is conducted, and the tracking performance of TIFM is evaluated in terms of the success or failure of the 3D reconstruction. 4.2 Experiments on Synthetic Images First, we generate an 800×600 image with 1, 000 randomly-distributed points3 . Second, the 1, 000 feature points are rotated and translated with controlled rotation or translation parameters to generate the second image. Third, an equal number of randomlydistributed outliers are injected into both images to generate two corrupted images. TIFM is then applied to the corrupted images to detect feature correspondences. The homography is computed for performance evaluation. Figs. 3(a) and 3(b) show the results obtained by TIFM under different settings of Degree of Rotation (DoR) and Percentage of Injected Outliers (PIO). The Degree of Rotation is the angle that the image rotates around its image center, which measures 3

The point is first randomly generated, and then suppressed in a way similar to the Harris corner suppression such that each 3x3 block contains at most one point.

#Correct Matches

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence

1

800

0.8 %Inliers

1000 600 400 200

795

0.6 0.4 0.2

0

0 %In 30 ject ed

60 Out liers

75 10

5 Rotatio of ree Deg

n

0

0 #In 30 ject ed

(a) #CorrectMatches.

Out 50 lier

s

75 10

ion 5 otat of R ree Deg

(b) recall.

(c) CVs superimposing on the uncorrupted first image.

Fig. 3. Results on synthetic images

how strong the image motion deviates from translation; the %Injected Outliers is the percentage of outliers injected into both images, which can be considered the noise level of the image, or the inverse to the repetition ratio of the feature points. As we see from Figs. 3(a) and 3(b), TIFM is able to reliably detect the correspondences even when image contains a large portion of injected outliers and evident rotation. For example, when P IO = 50%, DoR = 4o , we found 989 correct matches. Furthermore, 94.8% of the 1, 043 detected matches are inline to the homography. The obtained CVs are shown in Fig. 3(c), where an evident rotation is observed. The simulation results demonstrate the robustness of the LTM assumption to the noise and the motion deviation from the translational motion. For real images, this means that TIFM is able to work for image areas with low repetition ratio and containing non-translational motion, which is certainly very desirable. 4.3 Experiments on Real Images Tables 1(a) and 1(b) list the test image/video sequences, and individual image pairs for experiments. For performance comparison, three other matching algorithms are implemented, i.e., SIFT, KLT and the Block Matching (BM) algorithm4 . To track feature points by TIFM, SIFT and BM, correspondences between two successive images are firstly computed; second, feature-point tracks are obtained by linking the two-frame correspondences. For KLT, about 3, 000 good features are initially selected in the first image and then tracked or discarded over the remaining frames. The search range for TIFM, BM is set to 50 for image sequences (except that it is set to 70 for BM for castle), and set to 15 for video sequences. KLT relies on the image gradient for computing the CVs. It works only for sequences with small motion, and does not work with house, kspoort, castle and church. Two-Frame Matching Results. Fig. 5 shows the results by TIFM, SIFT and BM on individual images pairs. From the figure, we see TIFM obtains the largest 4

Source codes of SIFT and KLT are available from http://www.cs.ubc.ca/˜lowe/ keypoints/ and http://www.ces.clemson.edu/∼stb/klt/, respectively. For point matching by BM, the Sum of the Absolute Difference (SAD) of the luminance intensity between two 7 × 7 windows around the feature points are computed. A correspondence is established if the SAD is minimal and smaller than a given threshold.

796

P. Li et al. Table 1. (a) Test sequences, and (b) test image pairs

Seq (#f rm) Description medusa (194) Fig. 4(f); from www.cs.unc.edu/ marc/; small motion castle (26) Fig. 4(a); from www.cs.unc.edu/ marc/; mod. motion lab (150) Fig. 4(e); by hand-held DV; small motion kspoort (22) Fig. 4(d); by hand-held DC; mod. motion house (16) Fig. 4(b); by hand-held DC; mod. motion church (25) Fig. 4(c); by hand-held DC; mod. motion leuven (6) Fig. 4(i); from www.robots.ox.ac.uk/ vgg/research/affine/; big light change; small motion (a) Sequences

(a)

(d)

castle (IP1) with CVs.

kspoort with 524 points tracked

along 22 frames.

(g)

1st image of IP3 with CVs.

(b)

(e)

ImagePair L01 L05 IP1 IP2 IP3 IP4

Description Fig. 4(i); two brightest images from leuven the brightest and darkest images from leuven Fig. 4(a); extracted from castle; mod. motion Fig. 4(b); extracted from house; mod. motion Fig. 4(g); extracted from medusa; small motion Fig. 4(h); by hand-held DC

house (IP2) with CVs.

lab with 147 points tracked

along 51 frames.

(h)

(b) Image pairs

1st image of IP4 with CVs.

(c)

(f)

church with CVs.

medusa with 67 points tracked

along 141 frames.

(i)

1st image of L01 with CVs.

Fig. 4. Test sequences/images superimposed with detected CVs or tracked feature points by TIFM; all CVs are before outlier rejection

#CorrectMatches for 5 out of 6 image pairs. The recall of TIFM is comparable to that of SIFT, while precision is much higher than SIFT and BM. This implies that TIFM is accurate and more efficient. Without the need to detect many feature points, a large number of correspondences can be obtained with a high accuracy.

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence

797

Results on L01, L05 and IP2, which contain evident light change, show that TIFM is robust to light change. Results on IP4 show the potential of the TIFM for images containing reflecting or non-Lambertian objects. The reason is its texture independence. TIFM works as long as the LTM assumption is satisfied. Without surprise, BM works only for IP1 and IP3 with small light change. We have applied TIFM to every successive image pairs of all the test sequences. Similar results as Fig. 5 are obtained.

(a) #CorrectM atches.

(b) recall.

(c) precision.

Fig. 5. Results by TIFM, SIFT, and BM on individual image pairs

Tracking Results. The performance of TIFM is further evaluated in terms of the number of tracked feature points and the failure or success of the factorization-based 3D reconstruction [14]. A feature tracking is considered of a high quality if it can render an accurate reconstruction. Removing incorrect matches from two-frame correspondences is not used in all tested methods. Table 2. #T rackedP oints and the Success (S) or Failure (F) of the 3D reconstruction for (a) medusa, and (b) house; tracking starts from frame #0 and ends at frame #f rm #f rm 5 TIFM 1914F SIFT 676F KLT 753F BM 509F

10 40 80 1363S 550S 211S 413F 87S 23S 526F 220F 97F 187F 8F 0F (a) medusa

140 66S 5F 29F 0F

180 21F 0F 10F 0F

#f rm 5 10 15 TIFM 1152S 846S 699S SIFT 591F 286F 179S BM 430F 169F 79F (b) house

Table 2(a) shows the tracking results on medusa, where we see that TIFM performs better than other algorithms in terms of both of the number and the quality of the tracked feature points. Among the six results for tracking from 6 frames to 181 frames, only the first and last tracking fail for the 3D reconstruction. Table 2(b) lists the results on house, which show similar results as medusa. That is, TIFM is able to track more points along more frames and with a higher accuracy than SIFT, KLT and BM. KLT and BM fail for all 3D reconstructions. Using local image information alone for feature-point matching is difficult to render a satisfactory tracking. Outlier rejection is necessary for such algorithms.

798

P. Li et al. 0.3

0.25

0.2

22 cameras in ’W’ shape

0.2

0.3

16 cameras

0.2

Top view of recontd 3D shape of house

0.1

Top−front view of recontd 3D shape of castle

0.15

0.1 0.1

−0.1

0.05

0

−0.05

161 cameras

0

0

−0.1

0.05

−0.2

0

−0.05

Top−left view of recontd 3D shape of medusa

−0.1

−0.2

0.1

Top view of recontd 3D shape of kspoort

4 cameras

−0.1

−0.15 −0.3 −0.2

0.4 0.15 0.2

−0.4 −0.25

0 −0.3

0.2

−0.2 1

0.8

0.6

0.4

0.2

0

(a) house.

−0.2 −0.1 0 0.1

−0.8

−0.4

(b) medusa.

−0.6

−0.4

−0.2

0

0.2

0.1 0−0.1

(c) kspoort.

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.15

0.1

0.05

0

−0.05 −0.1

(d) castle.

Fig. 6. Four 3D shapes reconstructed by TIFM

As examples, Fig. 6 depicts four 3D scene structures reconstructed by TIFM. By examining the ‘W’-shape camera track in Fig. 6(c), the zigzag shape of the house in Fig. 6(a), and by comparing Fig. 6(d) with Fig. 4(a), we can see 3D reconstructions are successful. We believe the reconstruction in Fig. 6(b) is also correct, though it is more difficult to see from the figure. The success of the 3D reconstructions on a 161-frame long track in Fig. 6(b), and on a 4-frame short track in Fig. 6(d) fully demonstrates the reliability of TIFM. Experiments on other sequences show similar results.

5 Conclusion In this paper, we have proposed a novel texture-independent feature-point matching algorithm that uses only a self-contained smoothness constraint. The feature-point correspondences within a neighborhood are collectively determined such that the smoothness of the motion field is maximized. The experimental results on both synthetic and real images show that the proposed method outperforms SIFT, KLT and BM, in terms of both the number and quality of tracked feature points in image/video sequences. It provides an attractive solution to finding feature-point tracks in image/video sequences for tasks such as structure from motion. The accuracy and high tractability of the feature points by TIFM comes from the neighborhood-based smooth feature-point matching. First, collective determination of correspondences for a group of points constrains the feature-matching process, and decreases the chance of detecting individual incorrect correspondences5. Second, the use of local neighborhoods and allowing the motion deviation from the translational motion ensures a robust and accurate feature-point matching. TIFM obtains a good balance between using global and local image information. Third, the Harris corner detector guarantees the localization accuracy.

References 1. Ogale, A.S., Aloimonos, Y.: Robust Constrast Invariant Stereo Correspondence. In: Proc. IEEE Int. Conf. Robotics and Automation, pp. 819–824 (2005) 2. Hu, X., Ahuja, N.: Matching point feature with ordered geometric rigidity, and disparity constraints. IEEE Trans. Pattern Analysis and Machine Intelligence 16(10), 1041–1049 (1994) 5

Correspondences by TIFM can be wrong for a complete neighborhood. However, such erroneous two-frame correspondence is less likely to be propagated to succeeding images.

Texture-Independent Feature-Point Matching (TIFM) from Motion Coherence

799

3. Yuille, A., Grzywacz, N.: A Mathematical Analysis of the Motion Coherence Theory. In: Proc. 2nd Int. Conf. Computer Vision. (1988) 4. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Computer Vision 60(2), 91–110 (2004) 5. Baumberg, A.: Reliable feature matching across widely separated videws. In: Proc. IEEE Comp. Vision and Pattern Recognition. vol. 1, pp. 774–781 (2000) 6. Schaffalitzky, F., Zisserman, A.: Multi-view Matching for Unordered Image Sets. In: Proc. 7th European Conf. Computer Vision, pp. 414–431 (2002) 7. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Trans. Pattern Analysis and Machine Intelligence 27(10), 1615–1629 (2005) 8. Bay, H., Tuytelaars, T., Gool, L.V.: SURF: Speeded Up Robust Features. In: Proc. 9th European Conf. Computer Vision (2006) 9. Tomasi, C., Kanade, T.: Detecting and Tracking of Point Features. Carnegie Mellon University Technical Report CMU-CS-91-132 (1991) 10. Maciel, J., Costeira, J.P.: A Global Solution to Sparse Correspondence Problems. IEEE Trans. Pattern Analysis and Machine Intelligence 25(2), 187–199 (2003) 11. Scott, G., Longuet-Higgins, H.: An Algorithm for Associating the Features of Two Images. In: Proc. of the Royal Soc, London. vol. B–244, pp. 21–26 (1991) 12. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp. 147–151 (1988) 13. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–393 (1981) 14. Han, M., Kanade, T.: A perspective factorization method for Euclidean reconstruction with uncalibrated cameras. J. of Visualization and Computer Animation 13(4), 211–223 (2002)

Where’s the Weet-Bix? Yuhang Zhang, Lei Wang, Richard Hartley, and Hongdong Li Research School of Information Sciences and Engineering Australian National University

Abstract. This paper proposes a new retrieval problem and conducts the initial study. This problem aims at ﬁnding the location of an item in a supermarket by means of visual retrieval. It is modelled as objectbased retrieval and approached using the local invariant features. Two existing retrieval methods are investigated and their similarity measures are modiﬁed to better ﬁt this new problem. More importantly, through the study this new retrieval problem proves itself to be a challenging task. An instant application of it is to help the customer ﬁnd what they want without physically wandering around the shelves but a wide range of potential applications could be expected.

1

Introduction

Given the query image of an object, object retrieval requires to ﬁnd the same objects from a collection of images [1,2]. In this paper, we propose a new object retrieval problem, in which the object to retrieve only occupies a small part of a database image and might have multiple copies there. An instant application is: suppose a customer needs a certain brand of biscuit and he has a sample or its image. Through our object retrieval, the customer will be informed the shelf where this biscuit lies. This problem is more challenging than those posed in [1,2] due to the following issues: 1. The query image is a close-up view of the object to ﬁnd, whereas each database image contains dozens of objects that are small in size and different in brands and manufacturers, as shown in Figure 1. As a result, the object to retrieve is presented in very diﬀerent scales between the query and database images. Worse is that in a database image all the objects other than the queried become background clutter. 2. In each database image, there are often multiple copies of an object. In that case the same or similar local invariant features may be extracted from diﬀerent locations, which makes some widely used geometrical constraints not suitable anymore. For example, some features points found in a database image may not be close to each other, even if their matches are close on the query image;

The last two authors are also aﬃlated with NICTA, a research instutute funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.

Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 800–810, 2007. c Springer-Verlag Berlin Heidelberg 2007

Where’s the Weet-Bix?

801

3. To attract the attention of customers, the appearances of objects in supermarket are often full of striking signs and patterns, leading to a very rich set of local features. This challenges the discriminating ability of the descriptors for diﬀerent local features. Moreover, a large number of objects have round or non-rigid shapes and glisten are very common. These situations further increase the diﬃculty of this object retrieval task. 4. Products are often produced in series, for instance, “BIO” brand includes the detergent for brighter color and the softener. In this case, the appearances of objects have only slight diﬀerence. However, they are completely diﬀerent for customers. This fact is reﬂected in the ground truth deﬁned for this retrieval problem, as shown in Figure 2. In this paper, the image database is created and its characteristics are discussed. From each image, local invariant regions are detected and described by the SIFT feature [3]. Upon all of these features, a visual vocabulary is constructed. Each image is then represented by a histogram vector showing the occurrence frequency of each visual word. The similarity of two images is measured by comparing the corresponding histogram vectors. To address the issues aforementioned, a loose geometrical constraint is applied, which partitions each database image into multiple overlapped sub-images to mitigate the aﬀect of background clutter. Taking the advantage of the existence of multiple copies,

Fig. 1. Top row: query images; Bottom row: database images

Query

Not ground truth Not ground truth Ground truth

Fig. 2. Deﬁnition of ground truth (Must be identical rather than similar only)

802

Y. Zhang et al.

new weighting scheme is designed to assign high weight to the visual words intensively presented in a small number of database images. An inner-product based similarity measure is developed, which encourages the “co-presence” of same visual words in query and database images but does not penalize the other cases like the “co-presence” of diﬀerent visual words. Experimental study is conducted to compare our approach with the existing methods from [1,2] on this new retrieval problem, demonstrating its superior performance.

2

Related Work

This paper employs local invariant features to represent the visual content of an image. The interest point detecting approach Harris-Aﬃne detector proposed by Krystian Mikolajczyk and Cordelia Schmid in [4] provides interested point for reliable matching for images even with signiﬁcant perspective deformations. In later discussion in [5], it is also demonstrated to be more eﬃcient than most other region detectors when dealing with occlusion and clutter. The image feature generation approach proposed by David G. Lowe in [3] gives SIFT features which are largely invariant to changes in scale, illumination, and local aﬃne distortions. These local-invariant features are of intermediate complexity, which means they are not only distinctive enough to determine matches in big feature database but also suﬃcient enough to bear clutter and occlusion. There has been considerable progress in developing real-world objects recognition systems with local invariant features in recent work [1,2,6,7]. Among those work Visual vocabulary [1] has been proved to be a distinctive indexing artiﬁce in local invariant feature based object retrieval. To build a visual vocabulary , local feature descriptors are quantized into clusters according to their similarity. Then when a new local feature comes, all the similar features can be found by assigning the new feature’s descriptor into its nearest cluster. In [1] retrieval of all occurrences of an object outlined by user from a movie is carried out. After extracting local invariant features from each frame, two visual vocabularies of diﬀerent types of local invariant features are built up. As the complement of visual vocabulary, term frequency-inverse document frequency (tf-idf) weighting standard, L2 -norm similarity measure, and spatial consistency check are employed to improve the retrieval performance, showing excellent retrieval performance. In [2], visual vocabulary is extended into a hierarchical vocabulary tree and a scheme which can quickly retrieve the same CD-covers from a large database of music CDs is proposed. In [2], all the query and database images contain only a single object, namely CD cover. Their work tries to ﬁnd out whether a query and a database image have the same content. However in our work, although a query image contains single object, database image always contains dozens of objects of diﬀerent classes. What we try to ﬁnd out is whether a database image contains the single object on a query image. In [1] the problem seems to be the same with us, but their query images are cropped from database images, which actually results in ﬁnding out where an image patch comes from. Moreover, a new issue we need to deal

Where’s the Weet-Bix?

803

with is the potential multiple appearances of query object in single database image.

3

Creation of the Database

Our database named WebMarket contains 3,153 images which were taken in a supermarket named Coles in Jan 2007. This supermarket has eighteen 30-meterlong shelves, each of which approximately has six levels. Ten shelves are captured in this database. Starting from one end of the ﬁrst shelves, the photographing was carried out roughly following the order of the shelves. Two adjacent images have some overlap (less than one third of image size) to ensure that no part of the shelf is missed. Three digital cameras were used and all the images are saved in Jpeg format. The resolution of the images is either 2, 272 × 1, 704 or 2, 592 × 1, 944 pixels. Each image generally covers an area of about 1.5 meter in hight and 2 meters in width on shelves, imaging all the objects within three or four shelf levels in range. The size of each single object in an image is usually small. During photographing, no restriction on the view point and distance was imposed, although most of images are frontal views. No special illumination is used. To build the query set, about 200 diﬀerent ojbects were randomly selected and put on the ground, and captured one by one. For each query three images are taken from diﬀerent view angles or distances. For the same object, there is a large diﬀerence in its scale between the query and database images. This image database named WebMarket will soon be published on web.

4 4.1

Our Approach Image Representation

Local invariant features have shown excellent performance in representing images under a certain degree of change on scale, view angle, and illuminance and partial occlusion. In this work, a Harris-Aﬃne interest region detector [4] is applied to each image and the SIFT descriptor [3] is used to describe the detected interest regions. The binaries are downloaded from the Visual Geometry Group, Univ. of Oxford, and the threshold parameter is set to 20,000 to control the number of detected regions. For each database image, about 2,000 SIFT features are extracted, leading to 6.8 million SIFT features in total. When dealing with query images, to handle the large scale diﬀerence between the query and database images, the object is manually segmented out (an automatic segmentation algorithm can also be used) and downsized to a 300 × 300 image to partially alleviate the scale problem. The local features are then extracted from it. 4.2

Construction of the Visual Vocabulary

To ensure the quality of the visual vocabulary, it is built upon all of the 6.8 million SIFT features rather than a randomly selected subset. Hierarchical kmeans clustering [8] is applied. Step by step, all of the features in database are

804

Y. Zhang et al.

clustered into 1,000, 20,000, and 200,000 clusters, leading to a visual vocabulary containing 200,000 visual words. The hierarchical clustering is stopped at 200,000 because it gives reasonably good retrieval performance on our problem. A largersized vocabulary may be used. However, it will run a risk that the features which have been suﬃciently similar (up to some noise level) are separated into diﬀerent clusters and this will adversely aﬀect the retrieval. 4.3

A Loose Geometry Constraint

In our problem, the local features are much richer in a database image than in a query image. As a result, a high matching score can be easily obtained between a query image and most database images, leading to very poor retrieval performance. Spatial consistency check in [1,3] cannot be used here, because there are often multiple copies of identical object in a database image and the identical local features could be extracted from diﬀerent places. This paper imposes a loose geometry constraint. It evenly partitions a database image into 25 sub-images, each of which is one ninth of the original one. A sub-image is large enough to contain the object to retrieve but has much less background clutter. 1 Two neighboring sub-images have the half area overlapped to reduce the risk of separating one object into two sub-images. Totally, about 78,000 sub-images are obtained. For a query image, the match of a subimage means the match of the corresponding database image. After partition, the computational load of retrieval remains the same. 4.4

Similarity Measure

A similarity measure deﬁnes the visual similarity between a query and the subimages. Those with higher measure values are selected as the retrieval result. Let xi = [xi1 , xi2 , · · · , xin ] denote the i-th sub-image, where xi,j (1 ≤ i ≤ m, 1 ≤ j ≤ n) is the number of occurrences of the j-th visual word in this image, m the total number of sub-images, and n the total number of visual words. Similarly, the i-th query image is represented as q = [q1 , q2 , · · · , qn ] . = In our proposed similarity measure, q and xi are mapped respectively to q i = [ [ q1 , q2 , · · · , qn ] and x xi1 , x i2 , · · · , x in ] , where 0, if qj = 0 0, if xij = 0 qj = ; x ij = (1) 1, if qj > 0 dj , if xij > 0 where

m

xkj m log m ; k=1 sign(xkj ) k=1 sign(xkj ) 0, if xij = 0 . sign(xij ) = 1, if xij > 0

dj = m

1

k=1

(2) (3)

An ideal way to implement this constraint may be to ensure that each sub-image exactly contains one object. However, this is no less diﬃcult than the object retrieval problem itself.

Where’s the Weet-Bix?

805

and x i , The similarity measure is deﬁned as the inner product between q i S1 (q, xi ) = q, x

(4)

Equation (2) shows the weighting scheme m designed for the problem mwe have. It is m x / sign(x ) and log(m/ a product of two terms: kj kj k=1 k=1 k=1 sign(xkj )). m Here, k=1 xkj represents the total number of occurrences of the j-th visual m word through the database, and k=1 sign(xkj ) represents the number of images containing at least one j-th visual word through the database. If a visual word appears in certain image(s) with high repetition, then it is probably a stable feature that can be extracted from diﬀerent copies of a same object. On the other hand, if a feature appears in diﬀerent images but only once in each, it is more likely to be a noise. These two cases are weighted via the ﬁrst term. Moreover, for a very popular feature that can be extracted from many images, no matter if stable or not, its importance will be down scaled via the second term since it is not discriminating. In addition, the similarity measure computes the inner product of two unnormalized vectors. By doing so, the “co-presence” of the same visual words is rewarded between two compared images. Meanwhile, the case that a visual word only appears in one of them will not be penalized. We also tried the similarity measure which considers the exact number, xij , of a visual word appeared in a sub-image. It is deﬁned as qj =

0, 1,

if qj = 0 ; if qj > 0

x ij

=

0, xij dj ,

if xij = 0 if xij > 0

i . S2 (q, xi ) = q, x

(5)

(6)

Theoretically, similarity measure denoted by Equation (6) awards the database images who have more copies of the visual words that query image contains, which indicates a higher possibility of true matching rather than noise. However, in this manner those database images who only has single copy of queried object may loose the game against those who don’t contain queried object but possess certain amount of similar noise. On the other hand, similarity measure denoted by Equation (2) focus more on how many types of visual word are matched, which is expected to perform better in occasion of that few copies of queried object are presented in a database image. Another two similarity measure have been proposed in [1,2]. In [1], each query or database image i is mapped to a n-dimensional vector Vi = [ti1 , ti2 , · · · , tin ] , where xij m tij = n log m . (7) k=1 xik k=1 xkj Then the similarity between each pair of images is measured by S3 (Vq , Vxi ) =

Vxi Vq , . ||Vq || ||Vxi ||

(8)

806

Y. Zhang et al.

In [2] each query or database image i is also mapped to a n-dimensional vector i = [ V ti1 , ti2 , · · · , tin ] , where m . sign(x kj ) k=1

tij = xij log m

(9)

Then the similarity between each pair of images is measured by q, V xi ) = || Vq − Vxi ||. S4 (V q || ||V xi || ||V

(10)

All the four similarity measure methods will be compared in the experiments.

5

Experimental Result

This experiment investigates the performance of the proposed similarity measures S1 and S2 , as well as the previous similarity measure proposed in [1,2] which are denoted by S3 and S4 respectively on this new retrieval task. All of the 3,153 database images are used. Thirty retrievals are conducted and the average retrieval performance is reported. The 30 query images are randomly sampled from the query set containing 600 images. The ground truth of each of them is created by manually checking the database images one by one. On average, for a query image, there are merely 6.63 database images which are the true matches. The retrieval rank of a database image is determined as the highest rank of the sub-images cropped from it. Each image is represented as a 200, 000-dimensional vector. Although in very high dimension, this vector is quite sparse. The total number of non-zero items is no more than the total number of local features extracted from this image, about 2,000 in our case. Taking the advantage of the sparsity allows us to eﬃciently evaluate the similarity of two images. The retrieval performance is measured by the Precision and Recall widely used in information retrieval. They are deﬁned as P recison =

#positive retrieved #retrieved

=

#positive retrieved #total positive

, Recall

(11)

where #positive retrieved represents the number of correctly retrieved images, #retrieved represents the number of retrieved images, and #total positive represents the number of the ground truth in the database for a query. Also, the average normalized rank of true matching images in [1] is also used, which is deﬁned as mpos 1 mpos (mpos + 1) (12) Rk − avg rank = m·mpos 2 k=1

where m is the total number of images in the database. mpos is the number of the ground truth in the database for a query and Rk is the rank of the kth ground

Where’s the Weet-Bix?

(a) Precision curve

807

(b) Recall curve

(c) % of at least one positive return Fig. 3. Comparison of the four similarity measures Table 1. Comparison of the average rank value (S1 , S2 : proposed measures) Measure avg rank

S1 15.71

S2 15.83

S3 S4 17.13 15.99 (×10−2 )

truth image for a query. The smaller this average rank value is, the better the retrieval performance is. The Precision and Recall curves are plotted in Figure 3. The horizontal axis is the number of retrieved images, and the top 50 retrieved images are evaluated. As shown in sub-ﬁgure(a), the proposed similarity measures, S1 and S2 , achieve better retrieval performance than S3 and S4 . The main diﬀerence between the two proposed measures and those in [1,2] lies in our new weighting scheme and the unnormalized inner product, and this result veriﬁes their eﬀectiveness. In addition, S2 outperforms the other three at the ﬁrst retrieved image, whereas S1 becomes the best when the number of retrieved images increases. This shows that taking the number of occurrences of a visual word into account helps to identify a perfect matching database image which has many copies of queried objects.

808

Y. Zhang et al.

Query (8)

Query (3)

Query (5)

×

Query (9)

Query (6)

×

Query (7)

Query (5)

×

Fig. 4. Some retrieval examples. The number of ground truth is listed in bracket.

Where’s the Weet-Bix?

809

However, this also makes the similarity measure more sensitive to noise because when a database image doesn’t really contain the query object but contains multiple copies of another kind of object which shares one or two types of visual word with query object, it might score even more than those true matching image who has only one copy of query object through S2 . In other words, S2 focus on how many visual words query and database images have in common, when S1 focus on how many types of visual word query and database images have in common. Recall is shown in sub-ﬁgure(b), from which similar conclusion can be drawn. In the sub-ﬁgure(c), the horizontal axis is the number of retrieved images, and the vertical axis shows the percentage of the queries for which at least one correct match is found. As it shows, through S1 and S2 , over 70% of the query images can ﬁnd at least one correct match in top 50 retrieved images. However, through S3 and S4 , this number is only between 50% and 60%. This demonstrates again that the proposed S1 and S2 gain better performance than S3 and S4 do in our retrieval task. The average ranks are listed in Table 1. It can be seen that the proposed S1 produces the lowest value. It means that, averagely, the ground truth images are assigned higher ranks under this measure. This coincides with the Precision and Recall result. Some examples of the top 3 retrieved images are shown in Figure 4.

6

Conclusion

A new retrieval problem is proposed in this paper. The experimental result demonstrates the better performance of the proposed similarity measures. Meanwhile, it can be seen that less than half of the 1st retrieved images are correct answers and that only about 40% of true matches can be found after retrieving twenty images. Such a result indicates that this retrieval task is quite challenging and more work such as the veriﬁcation of matches and the consideration of feature dependency needs to be explored to further boost the retrieval accuracy. Acknowledgments. Many thanks to the Coles store located in Woden Plaza, ACT. We were allowed to collect the images in their store for our research. Their understanding and support make this work possible and are highly appreciated.

References 1. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, pp. 1470–1477. IEEE Computer Society Press, Los Alamitos (2003) 2. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2161–2168. IEEE Computer Society Press, Los Alamitos (2006) 3. Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision, pp. 1150–1157. IEEE Computer Society Press, Los Alamitos (1999)

810

Y. Zhang et al.

4. Mikolajczyk, K., Schmid, C.: Scale & aﬃne invariant interest point detectors. Interantional Journal of Computer Vision, 63–86 (2004) 5. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaﬀalitzky, F., Kadir, T., Gool, L.V.: A comparison of aﬃne region detectors. International Journal of Computer Vision 65(1/2), 43–72 (2005) 6. Lowe, D.G.: Local feature view clustering for 3d object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 682–688. IEEE Computer Society Press, Los Alamitos (2001) 7. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 530–535 (1997) 8. Duda, R.O., Stork, D.G., Hart, P.E.: Pattern Classiﬁcation, 2nd edn. vol. 115. John Wiley and Sons, Chichester (2001)

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR-Based Stopping in Nonlinear Diﬀusion Scale-Spaces Ram¯ unas Girdziuˇsas and Jorma Laaksonen Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box 5400, FI-02015 TKK, Finland [email protected], [email protected]

Abstract. Iterative smoothing algorithms are frequently applied in image restoration tasks. The result depends crucially on the optimal stopping (scale selection) criteria. An attempt is made towards the uniﬁcation of the two frequently applied model selection ideas: (i) the earliest time when the ‘entropy of the signal’ reaches its steady state, suggested by J. Sporring and J. Weickert (1999), and (ii) the time of the minimal ‘correlation’ between the diﬀusion outcome and the noise estimate, investigated by P. Mr´ azek and M. Navara (2003). It is shown that both ideas are particular cases of the marginal likelihood inference. Better entropy measures are discovered and their connection to the generalized signal-to-noise ratio is emphasized.

1

Introduction

Scale-space methods allow to restore and enhance semantically important features of images such as edges. One particular strategy is to employ edge-preserving diﬀusions [9] supplied with grid-based regularization and splitting techniques [11]. The scale then becomes the diﬀusion time. Practice indicates the existence of an optimal stopping time which yields the diﬀusion outcome closest to the desired signal assumed to exist in the observations. A further need to automate the choice of the optimal time has been emphasized in [2]: “Attentive viewing of a computer screen for quite long periods of time may be necessary, and, because changes from one iteration to the next are usually imperceptible, locating the optimal point at which to terminate the process becomes highly elusive.” We shall make an attempt to unify two ideas which seem, at the ﬁrst glance, to be completely diﬀerent: the entropy criterion suggested in [10], and the use of the correlation studied in [8] and [5]. A maximization of the signal-to-noise ratio (SNR) will partially be covered too. The entropy criterion arises from the stability analysis of the nonlinear diffusion scale-space in a discrete space and time. It is well-known in the majorization theory that iteration with doubly stochastic matrices diminishes any Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 811–820, 2007. c Springer-Verlag Berlin Heidelberg 2007

812

R. Girdziuˇsas and J. Laaksonen

Schur-convex (isotone) function. When assuming that the signal has only nonnegative values, one constructs the Shannon entropy − k u(xk ) ln u(xk ), which is proven to be isotone in [7]. A further investigation [10] of the entropy increase contains the statement which is presumably based on unreported experiments: “This correspondence has focused on the maximal entropy change by scale to estimate the size of image structures. The minimal change by scale, however, indicates especially stable scales with respect to evolution time. We expect these scales to be good candidates for stopping times in nonlinear diﬀusion scale-spaces.” The idea is of a certain interest as it relates the Liapunov stability to the second law of thermodynamics and the MaxEnt inference. However, it is unsatisfactory that the authors neglect the explicit probabilistic model of observations. An image is considered as a probability density (histogram) of a single scalar-valued random variable in [10]. The assignments of an image intensity value to a given spatial location are not reﬂected in the stopping criteria. Mixing the concept of ‘observation’ with the ‘probability density’ raises unnecessary questions, e.g.: Is Liapunov stability supposed to replace model selection? Is there any best way to preprocess a given image, so that when viewed as a scalar-valued function of the spatial coordinates, it would become a probability density? A current status of the entropy-based stopping [10] remains summarized in [8]: “However, as the entropy can be stable on whole intervals, it may be diﬃcult to decide on a single stopping instant from that interval; we are unaware of their idea being brought into practice in the ﬁeld of image restoration.” Instead, the suggestion in [8] is to stop the diﬀusion at the time when the ‘correlation’ between the signal and the noise estimate is minimal. It is rather evident that most of a critique directed against the entropy-based stopping applies to correlation as well. In particular, the remark on the ‘entropic stability’ in [8] pertains to rare cases in which the correlation might have very shallow minimum as well, or no minima at all, as indicated in [5]. Experimental evidence of [5] suggests that the correlation-based stopping might be suboptimal in the SNR sense and overestimates the stopping time for textured images. According to our view, neither the entropy, nor the correlation-based stopping should be excluded by the developments related to robust statistics. We suggest a uniﬁcation which allows to: (i) avoid unnecessary preprocessing of signals, (ii) arrive at a more general criterion, which merges both ideas into a single equation and further clariﬁes their probabilistic assumptions, and (iii) view the optimal diﬀusion stopping as an example where Bayesian arguments simplify the likelihood inference, not vice versa, as is commonly practiced. Section 2 presents the ‘inverse covariance trick’ applied to regularize a certain Gaussian model with a singular concentration matrix. This model uniﬁes the entropy, correlation and SNR-based stopping, which is discussed in Section 3. Section 4 describes a univariate numerical example which shows typical evolutions of the suggested criteria. Multivariate extensions do not go beyond numerical aspects when diﬀusion propagators satisfy the conditions of Section 3.2. Section 5 summarizes the conclusions.

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR

2

813

Construction of Joint Probability Density

Given an image stored as a matrix of size n1/2 × n1/2 , one may consider it as a vector u0 = y = n . At present, a variety of iterative smoothing algorithms are known [11,2] which, given an initial image u0 , provide a set of images ut : t = 1, τ, . . . , mτ with scale-space properties in a certain sense, e.g. nonincrease of global and local extrema, sign changes, and a variety of Liapunov sequences. This can be summarized as umτ = P−1 θ (u0:(m−1)τ )u0 .

(1)

Here Pθ (u0:(m−1)τ ) ∈ n×n is assumed to be nonsingular, and is often chosen so that P−1 θ (u0:(m−1)τ ) is doubly stochastic or totally positive [6]. The dependence on parameters and evolution will be suppressed. The subscript θ indicates the presence of the parameters, which comprise a vector θ ∈ p and are typically set by a practitioner. A single globally optimal parameter setting may not even exist, but practice indicates that the choice of the stopping time can be automated. Assumption 1 (Gaussian hypothesis space). Let the model outputs u ∈ n and the observations y ∈ n be distributed according to the joint Gaussian probability density with a zero mean and the covariance matrix Σuu Σuy Σ= (2) , Σab ≡ Cov(A, B), ΣTuy Σyy where Cov(A, B) ≡ (A − A)(B − B)T and · denotes the expectation. As the joint covariance is not speciﬁed yet, this assumption does not tell anything more than that one prefers to work with positive deﬁnite matrices. In accordance with the maximum likelihood inference, the parameters θ are not included in the joint random variable. The covariances depend implicitly on them. Assumption 2 (Model H1 ). Let Σ uu = Σ uy = Σ yu and Σ yy = Σ uu + Σ nn , where n stands for ‘noise’ variable N . An explicit inverse reads: Σ −1 H1

=

Σ uu Σ uu Σ uu Σ uu + Σ nn

−1 =

−1 −1 Σ −1 nn + Σ uu −Σ nn −1 −1 −Σ nn Σ nn

.

(3)

This particular covariance model is one of the simplest. Formally, the conditional −1 −1 concentration Σ −1 u|y = Σ nn + Σ uu reads directly from the upper-block of the partitioned inverse, and it ensures that the conditioning is variance-reducing. Eq. (1) can now be given the meaning of a conditional expectation: μu|y ≡ U |y, H1 = Σ uu (Σ uu + Σ nn )−1 y, Σ −1 uu Σ nn = P − I .

(4)

Given a nonsingular Σ nn , the diﬀusion propagator P uniquely deﬁnes the product Σ uu if and only if P has no eigenvalues equal to unity. If we further assume that the noise N is uniformly (isotropically) white, i.e. Σ nn = θ0 I for some

814

R. Girdziuˇsas and J. Laaksonen

−1 θ0 > 0, then the covariance Σ −1 uu = θ0 (P − I), and the model H1 becomes completely speciﬁed. The discrete space and time propagator P attains the form

P ≡ (I − L(u0 )) (I − L(uτ )) · · · I − L(u(m−1)τ ) ,

(5)

where L : n → n×n is the generalized Laplacian matrix [11,6]. In order to preserve the average value of the signal, one applies the von Neumann boundary conditions which yield singular Laplacians L irrespectively of the evolution u0:(m−1)τ . Thus, nonlinear diﬀusion scale-spaces result in the propagator P which has an eigenvalue equal to unity. Therefore, the model H1 has a singular concentration Σ −1 uu , and the problem with inﬁnite covariance matrices emerges thereupon. We suggest resolving this diﬃculty via the following trick. Instead of adding the uncorrelated noise variable N with the covariance matrix Σ nn to the signal variable U with the covariance matrix Σ uu , one can add an uncorrelated ‘noise’ −1 with the covariance Σ −1 uu to the ‘signal’ with the covariance Σ nn . Assumption 3 (Model H2 ). Let Σ uu = Σ uy = Σ yu = Σ −1 nn and Σ yy = −1 + Σ , and the overall covariance matrix possesses the following inverse: Σ −1 uu nn Σ −1 H2 =

Σ −1 Σ −1 nn nn −1 −1 Σ nn Σ nn + Σ −1 uu

−1 =

Σ uu + Σ nn −Σ uu −Σ uu Σ uu

.

(6)

The reader may check that the model H2 retains the conditional expectation given by Eq. (4). Contrary to the model H1 , the concentration Σ −1 uu is now allowed to be singular. The choice Σ nn = θ0 I completely speciﬁes the model H2 even when P has an eigenvalue equal to unity. The problem with inﬁnities is removed at the expense that the additive noise is no longer white.

3

Applications of Models H1 and H2

One can further decompose marginal likelihoods of the parameters of the models H1 and H2 into ‘decorrelation’ of the noise estimate with the model output and entropy maximization. Section 3.1 will also introduce diﬀerential entropies which: (i) avoid the normalization problems present in [10], (ii) are consistent with our experience that signals are less random when the scale is coarsened, and (iii) when diﬀusions are linear and time-homogeneous, the entropies do not depend on the signal, but only on the variance of the additive noise. Two sections are further included to emphasize a special role that the entropies play in the marginal likelihood maximization. Section 3.2 establishes conditions which guarantee that the entropies are monotonous in time. A Bayesian viewpoint is brieﬂy outlined in Section 3.3, where the reduction of the marginal likelihood maximization to ‘decorrelation’ can be seen as a way of imposing a certain a priori density on the parameters which are no longer viewed as deterministic quantities.

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR

3.1

815

Marginal Likelihood, Correlation, Entropy and SNR

It is somewhat paradoxical that a rather dull formal expression of the marginal likelihood can be seen as a conglomerate of diﬀerent model selection ideas. Lemma 1 (Marginal likelihood p(y|H1 )). Assume a white covariance Σ nn = θ0 I for some θ0 > 0. Then, −2 ln p(y|H1 ) =

1 y − μu|y 2 + (y − μu|y )T μu|y + ln |2π(Σ uu + θ0 I)| . (7) θ0

Proof. It follows from the deﬁnition of the marginal likelihood that 2 ln p(y|θ) = −yT Σ −1 yy y − ln |2πΣ yy |,

(8)

where Σ yy = Σ nn + θ0 I. Furthermore, yT (Σ uu + θ0 I)−1 y = yT Σ −1 uu μu|y

(9)

= θ0−1 yT (y − μu|y ) =

θ0−1 (y

(10)

− μu|y + (y − μu|y ) μu|y ) . Q.E.D. 2

T

(11)

The diﬀerence y − μu|y can be thought as the noise estimate, and minimizing the second term on the right-hand side of Eq. (7) is ‘orthogonalization’, except that (y − μu|y )T μu|y can be negative. This quantity can be compared with the correlation [8], which stops the diﬀusion at the time when cov(y − μu|y , μu|y ) var(y − μu|y ) var(μu|y )

(12)

is minimal. Here cov(u, v) ≡ tr(Cov(W )) with W being a joint vector which takes values w : wT = (uT , vT ) and var(u) ≡ tr(Cov(U )). The authors of [8] study only the dot-product estimator which, when neglecting the subtraction by means and normalization, turns out to be (y − μu|y )T μu|y . Given a random variable X, distributed according to the Gaussian density pμ,Σ (x) with the mean μ and the covariance Σ, the diﬀerential entropy is: 1 p(x) ln p(x) dx = ln |2πeΣ x | . (13) h(X) ≡ − 2 n Ê Thus, ln |2π(Σ uu + θ0 I)| = 2h(Y |H1 ) − n. The meaning of this entropy can also be appreciated by noticing that Σ uu + θ0 I = θ0 Σ uu (θ0−1 I + Σ −1 uu ), and, thus,

Cov(U |y, H1 )

. (14) 2h(Y |H1 ) = 2h(U |H1 ) − ln

θ0 Therefore, minimizing ln |2πΣ uu + θ0 I| reduces the uncertainty of the prior density p(u|H1 ) and maximizes the generalized signal-to-noise ratio.

816

R. Girdziuˇsas and J. Laaksonen

It should not be very hard to verify that the marginal likelihood p(y|H2 ) decomposes into:

−1

−2 ln p(y|H2 ) = θ0 μu|y 2 + (y − μu|y )T μu|y + ln 2π(Σ −1 uu + θ0 I) . (15) By noticing that Cov(Y |H2 ) = (Cov(U |y, H1 ))−1 , one discovers that h(Y |H2 ) = n ln(2πe) − h(U |y, H1 ).

(16)

In Section 3.2 we shall prove that the entropy h(Y |H2 ) does not decrease in time. Eq. (16) would then imply that the conditional entropy h(U |y, H1 ) is nonincreasing (diminishing). The conditional expectation, given by the nonlinear diﬀusion scale-space via Eqs. (4) and (5), tends towards the steady state which is a constant signal equal to the average value of the observations y. Intuitively, the signal becomes less random in time, which is reﬂected in the diminishing of h(U |y, H1 ). This can be contrasted to the view in [10], which prefers to apply the Shannon entropy functional directly to the signal. This entropy increases as the constant signal represents density, and the uniform density is known to attain the highest entropy value. However, the application of entropic arguments in [10] is inconsistent with the fact that each diﬀusion outcome is conditioned on the knowledge at the previous time instant. Before discussing the monotonicity, it is good to emphasize that evaluating entropies h(Y |H1 and h(Y |H2 is more diﬃcult than the case with the criteria in [10]. However, a linear scaling w.r.t. the number of observations can be achieved, irrespectively of the dimension of the domain in which the diﬀusion propagator P is deﬁned. The problem can ﬁrst be reduced to the inner-product representation via identities [1]: ln |I − A| = −

∞ tr(Ak ) k=1

k

∞ 1 X T Ak X = −n , k XT X

(17)

k=1

where the ﬁrst equality holds for any A ∈ n×n whose spectral radius does not exceed unity. The variable X ∼ N (0, I) is a standard normal variable which takes values x ∈ n . The reader may verify that application of Eq. (17) leads to: ∞ 1 X T P−k t X , ln |2π(Σ uu + θ0 I)| = n ln(2πθ0 ) + n k XT X k=1 T k ∞ k

X P X 1 (k − 1)! −1

−1 = n ln(2πθ ln 2π(Σ −1 + θ I) ) + n . uu 0 0 k m=0 m!(k − m)! XT X k=1

The matrix-vector product can further be split into univariate diﬀusions. Splitting is a common practice in approximating multivariate ﬂows with the univariate ones and is nicely documented in [4].

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR

3.2

817

Monotonicity of Diﬀerential Entropies

Suﬃcient conditions can be stated which indicate when the entropies h(Y |H1 and h(Y |H2 are nondecreasing. Another way to say the same thing is that negative entropies are Liapunov functions (sequences). More colloquially, the second law of thermodynamics takes place in a virtual world of discrete nonlinear diﬀusions. Lemma 2 (Monotonicity of entropies in a discrete time mτ ). Let the propagator be time-homogeneous, i.e. P = (I − L)m . The entropy is nondecreasing, i.e. (18) h(Y |m + 1, H1 ) ≥ h(Y |m, H1 ), provided that the matrix −L is positive deﬁnite. If the propagator is given by a more general Eq. (5), the following inequality is true: h(Y |m + 1, H2 ) ≥ h(Y |m, H2 ),

(19)

provided that each matrix −L(umτ ) is positive semideﬁnite for every m ∈ . Proof. The time-behavior of the entropy h(Y |m, H1 ) is determined by the term ln |2π(Σ uu + θ0 I)| which, up to irrelevant constants, equals to − ln |I − P−1 |. Let the eigenvalues λ(−L) be denoted as λi for i = 1, . . . , n. The Taylor series expansion leads to: − ln |I − P−1 t | =−

n

n ln 1 − (1 + λi )−m = − (1 + λi )−m + h.o.t. ,

i=1

(20)

i=1

∞ which follows from ln(1 − x) = − k=1 xk /k. Clearly, if the matrices −L are positive deﬁnite, then each λi > 0 and the entropy increases w.r.t. m. If we further assume the largest term, i.e. (1 + λmin )−(t+1) with λmin > 0, is dominating, the decay of the negative entropy will be exponential in time. It follows from Eqs. (4) and (15) that, up to irrelevant constants, the entropy h(Y |H2 ) is determined by: ln |P| =

n

ln(1 + λi )m = m

i=1

n

ln(1 + λi ).

(21)

i=1

Therefore, the entropy h(Y |H2 ) grows linearly in time, and L is allowed to be singular. The nondecrease of h(Y |H2 ) can be established for nonlinear diﬀusions: ln |P| =

m k=0

ln |I − L(ukτ )| =

n m

ln(1 + λi (k)) .

(22)

k=0 i=1

Here each eigenvalue λi (k) ≥ 0 comes from the set λ(−L(ukτ )) and is now timedependent. The positivity of the eigenvalues guarantees that the term ln |P| is nondecreasing, which proves the inequality in Eq. (19). Q.E.D.

818

R. Girdziuˇsas and J. Laaksonen

The most signiﬁcant term that determines the nondecrease of the entropy h(Y |H1 ) in the homogeneous case depends on the smallest eigenvalue λmin , whereas it is the maximal eigenvalue λmax which aﬀects the entropy h(Y |H2 ). Very convenient bounds follow from the Schur theorem [7], which states that the eigenvalues of a Hermitian matrix majorize its diagonal elements. As a special case, the following inequalities are true: λmin ≤

min

(−lii ), λmax ≥

i∈{1,2,...,n}

max

(−lii ) ,

i∈{1,2,...,n}

(23)

where lii are the diagonal elements of L, and they are typically negative. Positive deﬁniteness of the matrices −L is more thoroughly discussed in [6]. The ideology of [10] now gets a proper justiﬁcation. Utilization of the diﬀerential entropy ﬁrst establishes it as a model complexity measure, and then proves that it is indeed a Liapunov function. In the model [10], the signal is assumed to be normalized in order to satisfy the constraints of the probability density, which can be written such as umτ = δ(U − u|y, H. However, the observations must be preprocessed in order to validate this density, and the diﬀusions are restricted to positive evolutions. In this work, umτ ≡ μu|y ≡ U |y, H1(2) and y does not have to be preprocessed. 3.3

Correlation Prior

Eq. (7) suggests that when maximizing the marginal likelihood, the ﬁrst and the third terms on its right-hand side prefer small stopping times m. Therefore, if the marginal likelihood p(y|H1 ) is to be unimodal w.r.t. the stopping time m, the unnormalized correlation (y−μu|y )T μu|y must either possess an extremum, or it should give preference to larger stopping times. In the case of the model H2 , the ﬁrst term in Eq. (15) gives preference to m → ∞, and the last term acts in the opposite way, so the optimal balance should exist even without the unnormalized correlation. This is what the theory predicts assuming the correctness of the marginal likelihood criterion. Dropping out a particular term in Eqs. (7) and (15) can be seen as imposing a certain a priori density (a prior). Ignoring the ﬁrst terms leads to the maximum entropy priors, but one can discover the ‘orthogonality’, or the ‘unnormalized correlation’ prior, too. For example, minimizing the orthogonality term in Eq. (7) is equivalent to the application of the prior −n/2 exp − ln p(y|μu|y , θ, H1 ) + ln p(Y |θ, H1 ) . (24) p(θ|H1 ) ∝ θ0 If one applies ln p(Y |μu|y , θ, H1 ) instead of ln p(y|μu|y , θ, H1 ), the prior −n/2

p(θ|H1 ) becomes a uniform improper prior because the term θ0 can then be conveniently introduced to the exponent as the Gaussian entropy. The exponent disappears on the basis of the identity h(A) = h(B) + h(B|A). The prior −n/2 is known as the Jeﬀreys’ prior for the multinomial density with p(θ|H1 ) ∝ θ0 the parameter θ0 . Thus, Bayesian inference simpliﬁes the likelihood inference.

How Marginal Likelihood Inference Uniﬁes Entropy, Correlation and SNR

4

819

Experiment

A synthetic problem is indicated in Fig. 1a, where neither a true signal, whose range is [0, 1], nor its edge structure is visible in the noisy values scattered in [−30, 30]. During the simulation, n, the number of the observations, is one million elements. This setting can be opposed to common experiments with ‘real data’ where the edge structure is easy to detect via ‘eyeballing’. The propagators P are implemented in [3], and we employ the gradient norm s-dependent Perona– ν Malik diﬀusivity c(s) ≡ 1 − exp − (s/λ) m

as suggested in [8]. The parameters

2

are τ /h = 0.025, m = 8, λ = 200, ν is detected by the software automatically. During the estimation of the gradient, the function is pre-smoothed via averaging over 1000 neighbours. The ﬁgures with sharply recovered fronts are not shown as the signals are simple and it suﬃces to state that at the optimal stopping time is m = 5, the location of the right edge is recovered at x = 0.245 and the left edge is restored at x = 0.749. The following ﬁve stopping criteria have been applied: (i) the marginal likelihood given by Eqs. (15), (ii) the entropy h(Y |H2 ) contained therein, (iii) (y − μu|y )T μu|y , (iv) the correlation [8] in Eq. (12), and (v) the mean absolute error between the true signal (a rectangular pulse) and the diﬀusion outcome. Fig. 1b summarizes the results. All the criteria are normalized by subtracting their minimal values, dividing them by their range and adjusting the sign. The optimal stopping is at m = 5. The maximum likelihood criterion underestimates the stopping time, but its simpliﬁcations are helpful indeed. Contrary to the speculations in [8], detecting the steady state of the entropy does not present diﬃculties.

20

1

y u ¯

15

0.8

10

-Logl. Entr. Orth. Corr. M.A.E.

0.6

5 0

0.4

-5 0.2 -10 -15 0

0.2

0.4

x

0.6

0.8

1

0 100

101 Iterations

102

Fig. 1. (a): A binary pulse with edges at x = 0.25 and x = 0.75 is blurred by retaining the ﬁrst twenty components of its Fourier decomposition, which yields the result shown as u ¯. A white Gaussian noise with the variance θ0 = 25 is then added to create the observations y. Here only a sample of 2000 noisy observations out of the set of n = 106 elements is visualized. The actual range of the observations is [−30, 30]. (b): Time evolution of the criteria for optimal stopping.

820

5

R. Girdziuˇsas and J. Laaksonen

Conclusion

Consistent statistical inference postulates the joint probability density of any quantity and the estimation of unknowns emerges as the conditioning on what is known. Estimating the density itself via computer simulations and ‘histogramming’ is hard at best. However, when working with the Gaussian probability density, the ‘data-driven’ approach reduces to extending the knowledge of the conditional mean to the level of the joint covariance. This reveals axiomatic principles behind many heuristic model selection criteria. The suggested formalism clearly avoids the problems with an unnecessary image normalization in [10]. Contrary to the work [10], the introduced entropies are consistent with the fact that as the scale becomes coarser, the signal is less random. Simple arguments of positive deﬁniteness determine whether the decrease of the negative entropy is exponential or linear in time. Up to certain scalings, the correlation statistics employed in [8] has been shown to be connected to the maximization of the entropy with an early stopping. It is important to emphasize that the presented Gaussian density construction results in singular concentration matrices. The ‘inverse covariance’ trick circumvents this diﬃculty, but there may exist some even better ways to solve this problem.

References 1. Barry, R.P., Pace, R.K.: Monte Carlo estimates of the log determinant of large sparse matrices. Lin. Alg. Appl. 289, 41–54 (1999) 2. Carasso, A.S.: Linear and nonlinear image deblurring: A documented study. SIAM J. Numer. Anal. 36(6), 1659–1689 (1999) 3. D’Almeida, F.: Nonlinear diﬀusion toolbox. MATLAB Central (2003) 4. Fischer, B., Modersitzki, J.: Inverse Problems, Image Analysis, and Medical Imaging. In: Fast Diﬀusion Registration. AMS Contemporary Mathematics, vol. 313, pp. 117–129 (2002) 5. Gilboa, G., Sochen, N., Zeevi, Y.Y.: Estimation of optimal PDE-based denoising in the SNR sense. IEEE Trans. Im. Proc. 15(8), 2269–2280 (2006) 6. Girdziuˇsas, R., Laaksonen, J.: When is a discrete diﬀusion a scale-space. In: Int. Conf. Comp. Vis. 7. Marshall, A.W., Olkin, I.: Inequalities: Theory of Majorization and Its Applications. Academic Press, London (1979) 8. Mr´ azek, P., Navara, M.: Selection of optimal stopping time for nonlinear diﬀusion ﬁltering. Int. Journal of Computer Vision 52(2), 189–203 (2003) 9. Perona, P., Malik, J.: Scale–space and edge detection using anisotropic diﬀusion. IEEE Trans. on PAMI 12(7), 629–639 (1990) 10. Sporring, J., Weickert, J.: Information measures in scale spaces. IEEE Trans. Inf. Theory 45(3), 1051–1058 (1999) 11. Weickert, J., ter Haar Romeny, B.M., Viergever, M.A.: Eﬃcient and reliable schemes for nonlinear diﬀusion ﬁltering. IEEE Trans. on Image Processing 7(3), 398–410 (1998)

Kernel-Bayesian Framework for Object Tracking Xiaoqin Zhang1, Weiming Hu1 , Guan Luo1 , and Steve Maybank2 1

2

National Laboratory of Pattern Recognition, Institute of Automation, Beijing, China {xqzhang,wmhu,gluo}@nlpr.ia.ac.cn School of Computer Science and Information Systems, Birkbeck College, London, UK [email protected]

Abstract. This paper proposes a general Kernel-Bayesian framework for object tracking. In this framework, the kernel based method—mean shift algorithm is embedded into the Bayesian framework seamlessly to provide a heuristic prior information to the state transition model, aiming at effectively alleviating the heavy computational load and avoiding sample degeneracy suffered by the conventional Bayesian trackers. Moreover, the tracked object is characterized by a spatial-constraint MOG (Mixture of Gaussians) based appearance model, which is shown more discriminative than the traditional MOG based appearance model. Meantime, a novel selective updating technique for the appearance model is developed to accommodate the changes in both appearance and illumination. Experimental results demonstrate that, compared with Bayesian and kernel based tracking frameworks, the proposed algorithm is more efficient and effective.

1 Introduction Object tracking is an important research topic in computer vision community, because it is the foundation of high level visual problems such as motion analysis and behavior understanding. Recent years have witnessed a great advance in the literature, e.g. snakes model [1], condensation [2], mean shift [3], appearance model [4], probabilistic data association filter [5] and so on. Generally speaking, most of the tracking algorithms involve two major issues: the algorithm framework and the target representation model. The framework of the existing tracking algorithms can be roughly divided into two categories: deterministic methods and stochastic methods. Deterministic methods usually reduce to an optimization process, which can be typically tackled by an iterative search for the minimization of a similarity cost function. In more detail, there exists two major types of similarity functions: SSD (Sum of Squared Differences) [6] and kernel [3] based cost functions. The SSD based cost function is defined as the summation of squared differences between the current image patch and the template, while the kernel based cost function is defined as the distance between two kernel densities. The deterministic methods are usually computationally efficient but often trap in local minimal. In contrast, the stochastic methods adopt a state space to model the underlying dynamics of the tracking process, and the object tracking is viewed as a Bayesian inference problem, which needs to generate a number of hypotheses to estimate and propagate the posterior distribution of the state. Compared with the deterministic counterparts, the stochastic methods usually perform more robustly, but meantime they suffer a heavy computational load due to the large Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 821–831, 2007. c Springer-Verlag Berlin Heidelberg 2007

822

X. Zhang et al.

number of hypotheses, especially in a high-dimensional state space which may result in the curse of dimensionality. Recently, some researchers have combined the merits of these two methods to achieve more reliable performances [7,8]. In [7], random hypotheses are guided by a gradient based deterministic search which is carried out based on the sum of difference between two frames. Zhou [8] et al. propose an adaptive state transition model which is extracted from the information contained in the particles configuration. In essence, these methods are based on the constant illumination assumption, which is hard to be satisfied in practice, and moreover, they are far from a general tracking framework to be extended to other representative models. The target representation model is also a basic issue to be considered in tracking algorithms. Image patch [6], which takes the set of pixels in the target region as the model representation, is a direct way to model the target, but it loses the discriminative information that is implicit inside the layout of the target. The color histogram [3] provides global statistic information of the target region which is robust to noise, but it is very sensitive to illumination changes. Recently the MOG (Mixture of Gaussians) [4,8,9] based appearance model has received more and more attentions for its following merits: (1) it can model the multi-modal distribution of the appearance; (2) it is easy to capture the changes of the appearance; (3) it possess low computation and storage resources. However, traditional MOG based appearance model considers each pixel independently and with the same level of confidence, which is not reasonable in practice. In view of the forgoing discussions, we propose a general kernel-Bayesian tracking framework by combining the merits of both deterministic methods and stochastic methods. The main contributions of the proposed tracking approach are summarized as follows: 1. The kernel based method—mean shift algorithm is embedded into the Bayesian framework to give a heuristic prior information to the state transition model, which eases the computational burden and avoids sample degeneracy in the Bayesian tracking framework. 2. The appearance of the target is modeled by a spatial constraint MOG, whose parameters are estimated via an on-line EM algorithm. 3. A novel selective adaptation scheme for updating the appearance model is adopted to reliably capture the changes in appearance and illumination and effectively prevent the model from drifting away. The rest of this paper is structured as follows. A brief review of kernel based and Bayesian based tracking algorithms are presented in Section 2. The detail of KernelBayesian tracking framework is described in Section 3. A spatial constraint MOG based appearance model and its application in the Kernel-Bayesian framework are discussed in Section 4. Experimental results are presented in Section 5, and Section 6 is devoted to conclusion.

2 Review of Kernel Based and Bayesian Based Trackers In this section, we briefly review the two typical tracking algorithms: kernel based and Bayesian based trackers.

Kernel-Bayesian Framework for Object Tracking

823

2.1 Kernel Based Tracker Kernel based tracker tries to find local minima of a similarity measure between the kernel density estimations of the candidate and target images. The most famous kernel based method should be the mean shift algorithm, which firstly appeared in [10] as the gradient estimation of a density function, and was introduced for visual tracking by Comaniciu [3] in 2000. Mean shift is a non-parameter mode seeking technique that shifts each data point to the average of data points in its neighborhood [10]. Let A be a finite set embedded in an n-dimensional space X, the mean shift vector of x is defined as follows, K(a − x)w(a)a − x, a ∈ A, x ∈ X ms = a a K(a − x)w(a)

(1)

where K is a kernel function and w is a weight function. Mean shift algorithm works by iteratively shifting the data to the direction of mean shift vector until its convergence. In the mean shift based tracking algorithm, the convergence property is described by a Bhattacharyya coefficient[3], which reflects the similarity between the target and candidate kernel densities. 2.2 Bayesian Based Tracker Another popular way is to view tracking as an on-line Bayesian inference process for estimating the unknown state st at time t from a sequential observations o1:t perturbed by noises. A dynamic state-space form employed in the Bayesian inference framework is shown as follows [11], state transition model : st = ft (st−1 , t )

(2)

observation model : ot = ht (st , νt )

(3)

where st , ot represent system state and observation, t , νt is the system noise and observation noise, ft (., .) characterizes the kinematics of object, and ht (., .) models the observation. The key idea of Bayesian inference is to approximate the posterior probability distribution by a weighted sample set {(s(n) , π (n) )|n = 1 · · · N }. Each sample consists of an element s(n) which represents the hypothetical state of an object and a (n) corresponding discrete sampling probability π (n) , where N = 1. First, the samn=1 π ple set is resampled to avoid the degeneracy problem, and the new sample is propagated according to the state transition model. Then each element of the set is weighted with probability π (n) = p(ot |St = s(n) t ), which is calculated from the observation model. Finally, the state estimate sˆt can be either be the minimum mean square error (MMSE) estimation or the maximum a posterior (MAP) estimation.

3 Kernel-Bayesain Based Tracking Framework The kernel based methods enjoy a low computational complexity but often trap in local minimal/maxima, while Bayesian based methods improve robustness of the tracking process, but they suffer a large computational load by generating a huge number of hypotheses to cover the target. As a result, we propose a unified Kernel-Bayesain tracking framework to combine the merits of both methods.

824

X. Zhang et al.

3.1 Kernel-Bayesian Framework A state transition model is a basic component to be considered as the Bayesian inference is adopted for tracking. Most of the existing approaches take the naively random walk around previous system state [12] or learn through a pre-labeled video sequences [13]. The former one contains little information about the motion of the target, and thus involves a quite large computational load since many hypotheses need to be randomly generated to cover the target. While the latter one often suffers a over fitting problem, consequently available only to the training sequences. Since the mean shift algorithm provides the motion direction to the groundtruth in its iterations, which motivates us to embed the kernel method into Bayesain framework to provide a heuristic prior. In detail, the mean shift algorithm is firstly applied to the current frame to obtain the direction of motion and the offset of the state, which are then incorporated into the transition model as prior information. In this way, the kernel based method and the Bayesian based method are combined into a unified framework. Furthermore, it is investigated [14] that symmetric kernels are amenable to mean shift iterations, which means that our framework is general to all the symmetric appearance models. 3.2 An Optimization View A reinterpretation of the Kernel-Bayesian framework in an optimization view is presented to show why this framework can combine the merits of both the kernel method and the Bayesian method. To give a clear view, an input image with three templates superimposed, corresponding to the initialization, local maximum and globe maximum is illustrated in the left column of Fig. 1, and its cost function based on our appearance model is shown in the right column of Fig. 1. As witnessed by Fig. 1, starting from the initial position, the kernel method converges to the local maximal point which is near to the global maximal point. It is clear that a few number of hypotheses generated around the local maximum point is enough to cover the the global maximal point. Otherwise, if the tracker starts from the initial position, numerous hypotheses need to be generated in order to reach the target, and the algorithm even may trap into the curse of dimensionality in the case of high-dimensionality. In our proposed framework, the deterministic optimization method is used to refine the initial position and provide a heuristic prior, and the stochastic method is then adopted to reach the globe optimal point.

Initialized position Local maximum Global maximum

Fig. 1. (left) An input image with three templates superimposed, corresponding to the initialization (red), local maximum (green) and globe maximum (blue) and (right) its cost function

Kernel-Bayesian Framework for Object Tracking

825

4 The Proposed Tracking Algorithm An overview of the proposed algorithm is systematically presented in Fig. 2. First a kernel based prior information is obtained through mean shift iterations, which controls both the number of the hypotheses and the directional offset of the state in the state transition model. After hypotheses precess, each hypothesis is evaluated by the spatial constraint MOG based observation model. Finally, a maximum a posterior (MAP) estimate of state is obtained based on the probability of each hypothesis. Meanwhile, a selective updating scheme is developed to update parameters of the appearance model to accommodate the changes of object and environment. Each component in this algorithm is described detailedly in the following sections. 39:=A98 CD98=7F=BA

/A=F=5?=J5F=BA 4F5F9 H97FBD 5F F=@9 F

-BAFDB? 8=D97F=BA .ICBF<9E9E ;9A9D5F=BA

09DA9? 65E98 CD=BD ,77GD57I B: CD98=7F=BA

195EGD9@9AF 9H?G5F=BA

-BAFDB? AG@69D

-<97>+ GC85F9 F<9 @B89? BD ABF 2GCGF F<9 EF5F9

Fig. 2. The flow chart of our Kernel-Bayesian based tracking algorithm

4.1 Spatial Constraint MOG Based Appearance Model The appearance of the target is modeled by a spatial constraint MOG, with the parameters estimated by an on-line EM algorithm. Appearance Model: Similar to[4],[8], the appearance model consists of three components S, W, F , where S component captures temporally stable images, W component characterizes the two-frame variations, and F component is a fixed template of the target to prevent the model from drifting away. However, this appearance model treats each pixel independently and discards the spatial outline of the target. So it may fail in the case that, for instance, there are several similar objects close to the target or partial occlusion. In our work, we apply a 2-D gaussian spatial constraint to the SW F based appearance model, whose mean vector is the coordinate of the center position and the diagonal elements of the covariance matrix are proportional to the size of the target in the corresponding spatial direction, as illustrated in Fig. 3. As a result, the likelihood function of the spatial constraint appearance model can be formulated as follows, p(ot |st ) =

⎧ d ⎨

j=1

⎩

N (x(j); xc , Σc ) ∗

2 πi,t (j)N (ot (j); μi,t (j), σi,t (j))

i=s,w,f

⎫ ⎬ ⎭

(4)

where N (x; μ, σ 2 ) is a Gaussian density (x − μ)2 N (x; μ, σ 2 ) = (2πσ 2 )−1/2 exp − 2σ 2

(5)

826

X. Zhang et al.

300 250 200 150 100 50 0 280

260

240

220

200

180

160

160 150 140 130 120 110 100

Fig. 3. A 2-D gaussian spatial constraint MOG based appearance model

and {πi,t , μi,t , σi,t , i = s, w, f } represent mixture probabilities, mixture centers and mixture variances respectively, d is the number of pixels inside the target, xc and Σc represent the center of the target and its variance matrix in the spatial space. Parameter Estimation: In order to make the model parameters dependent more heavily on the most recent observation, we assume that the previous appearance is exponentially forgotten and new information is gradually added to the appearance model. To avoid having to store all the data from previous frames, an on-line EM algorithm [4] is used to estimate the parameters as follows. Step 1: During the E-step, the ownership probability of each component is computed as 2 mi,t (j) ∝ πi,t (j)N (oi,t (j); μi,t (j), σi,t (j))

(6)

which fulfills i=s,w,f mi,t = 1. Step 2: The mixing probability of each component is estimated as πi,t+1 (j) = αmi,t (j) + (1 − α)πi,t (j); i = s, w, f

(7)

and a recursive form for moments {Mk,t+1 ; k = 1, 2} are evaluated as Mk,t+1 (j) = αokt (j)ms,t (j) + (1 − α)Mk,t (j); k = 1, 2

(8)

−1/τ

where α = 1 − e acts as a forgotten factor and τ is a predefined constant. Step 3: The mixture centers and the variances are estimated in the M-step μs,t+1 (j) =

M1,t+1 (j) 2 M2,t+1 (j) , σs,t+1 = − μ2s,t+1 (j) πs,t+1 (j) πs,t+1 (j)

2 2 μw,t+1 (j) = ot (j), σw,t+1 (j) = σw,1 (j) 2 2 μf,t+1 (j) = μf,1 (j), σf,t+1 (j) = σf,1 (j)

In fact, updating of the appearance model every frame may be dangerous in case that, for instance, some backgrounds are misplaced into the target or the target is occluded. Thus, we developed a selective adaptation scheme to tackle such cases, which is described detailedly in section 4.3. 4.2 Kernel-Bayesian Based Tracker As stated in the section 3, the motivation of embedding the mean shift algorithm into the Bayesian filtering framework is to provide a heuristic prediction to the state transition

Kernel-Bayesian Framework for Object Tracking

827

model, and thus to ease the computational burden and avoids the sample degeneracy problem. Suppose the target is well localized at xt−1 in frame t − 1, we first apply mean shift iterations to the frame t, and the convergent position is considered as the refined initialization denoted as x ˆt . In order to embed the spatial constraint appearance model into the mean shift algorithm, the weighted kernel function is defined as follows. ω(x) = N (x, xc , Σc )

2 πi,t (j)N (ot (x); μi,t (x), σi,t (x))

(9)

i=w,s,f

And the flat kernel is chosen, so the mean shift iteration can be written as d

x =1

x ˆt = id

w(xi )xi

xi =1

w(xi )

, xi ∈ candidate.

(10)

The result obtained from mean shift iterations is then integrated into a fist-order state transition model to form an adaptive state transition model. st = sˆt−1 + Af f ine(ˆ xt − xt−1 ) + t

(11)

Where Af f ine is denoted for the affine transformation. Meanwhile, the accuracy of the refined position is evaluated by our appearance model to adaptively control the number of hypotheses and the system noise t . Finally the Bayesian inference is carried out based on the adaptive state transition model to achieve a robust and efficient tracking algorithm. 4.3 Selective Adaptation for Appearance Model In most tracking applications, the tracker must simultaneously deal with the changes of both the target and the environment. So it is necessary to design a adaption scheme for the appearance model. However, over updating of the model may gradually introduce the noise of background into the target model, causing the model drift away finally. Thus, a proper updating scheme is of significant importance for the tracking system. In this part, we propose a selective updating scheme based on three different confidence measures of the appearance model. First the MAP estimated state is respectively evaluated by the appearance model, the SW combined components, and the F component, denoted as πa , πsw , πf . And {Ta , Tsw , Tf } represent three thresholds correspondingly. Each component of the appearance model is updated selectively as follows. It is investigated that S together with W components effectively capture the variations of the target and F prevents the model from drifting away. As a result, such a selective updating strategy not only effectively captures the variations of the target, but also reliably prevents the drifting away problem during the tracking process.

5 Experimental Results In our experiments, affine transformation is chosen to model the object motion. Specifically, the motion is characterized by s = (tx , ty , a1 , a2 , a3 , a4 ) where {tx , ty } denote the 2-D translation parameters and {a1 , a2 , a3 , a4 } are deformation parameters. Each candidate image is rectified to a 30×15 patch, and thus the feature is a 450-dimensional vector with zero-mean-unit-variance normalization. All of the experiments are realtimely carried out on a dual-CPU Pentium IV 3.2GHz PC with 512M memory.

828

X. Zhang et al. Table 1. Selective Adaptation for the Appearance Model 1: if (πa > Ta ) 2: if (πsw > Tsw )&&(πf > Tf ) 3: Update the appearance model of the target; 4: else if(πsw > Tsw )&&(πf ≤ Tf ) 5: Only update the SW components of the appearance model; 6: else if(πsw ≤ Tsw )&&(πf > Tf ) 7: Only update the F components of the appearance model; 8: else if(πsw ≤ Tsw )&&(πf ≤ Tf ) 9: Keep the appearance model of the target 10: end if 11: end if

(a) Tracking performance in Kernel-Bayesian framework

(b) Tracking performance in traditional Bayesian framework

(c) Tracking performance in traditional kernel based framework Fig. 4. Experimental performance in the different tracking frameworks

5.1 Single Object Tracking In this section, three parts of experiments are presented to demonstrate the claimed contributions of the proposed tracking algorithm. The first part shows experimental performance of our tracking framework, and a comparison to the traditional Bayesian framework and kernel based framework in both tracking accuracy and efficiency. As illustrated in Fig. 4, the first row shows the tracking performance of our algorithm, where the tracker efficiently and effectively catches the target. The second row gives the similar tracking performance in the traditional Bayesian framework with 400 hypotheses. In the third row, it is clear that the kernel method usually traps in local maximal, leading to the inaccurate localization. Furthermore, the accuracy and efficiency of these tracking frameworks are quantitatively evaluated to have a profound analysis. The tracking time with respect to the frame index is shown in the left panel of Fig 5. And the tracking accuracy is also measured by the

Kernel-Bayesian Framework for Object Tracking 10 kernel Bayesian kernel−Bayesian

100 90 80

Time (ms)

70 60 50 40 30 20 10 0

0

50

100

150 Frame Index

200

250

300

MSE between tracked position and groundtruth

110

829

kernel Bayesian kernel−Bayesian

9 8 7 6 5 4 3 2 1 0

0

50

100

150 Frame index

200

250

300

Fig. 5. (left) Tracking time with respect to the frame index, (right) MSE between estimated points and groundtruth (red: kernel, green: Bayesian, blue: kernel-Bayesian)

MSE (mean square error) between tracked position and groundtruth which is shown in the right panel of Fig 5. The results in Fig. 5 shows that the kernel method performs efficiently but has a poor performance in localization. In contrast, Bayesian based tracking algorithm achieves more accurate performance due to large number hypotheses. It possesses 81ms (millisecond) tracking time in each frame and the average MSE is 8.6521. while in Kernel-Bayesian framework, the tracking time taken by each frame is only 55ms on average, which greatly eases the computational burden of the Bayesian framework. The average MSE for our algorithm is only 5.8012, because the kernel method provide a heuristic prior to the state transition model which avoids the sample degeneracy suffered by the Bayesian framework and thus leads to the accurate localization. The comparison between spatial constraint MOG based appearance model with traditional MOG based appearance model is presented in the second part. It is clear that the SMOG based appearance model gives a good solution to handle the case where there are some similar objects around the target, while the traditional MOG based appearance model fails, as shown in Fig. 6. The mechanism behind it is that the former one extracts the spatial layout of the target, which makes the model more discriminative. The last part tests the proposed algorithm in the variational scenes. In Fig. 7(a), it is clear that the selective updating scheme easily absorbs the the illumination changes. Fig. 7(b) shows the result of our algorithm to track a girl’s head with an out-plane rotation, from which we notice that the scheme also effectively captures the variations of appearance.

(a) Tracking with spatial constraint SWF model

(b) Tracking with traditional SWF model Fig. 6. Experimental results with different appearance models in the clutter scene

830

X. Zhang et al.

(a) Scene with large illumination changes

(b) Object with out-plane rotation Fig. 7. Experimental results in different scenes (illumination change, out plane rotation)

Fig. 8. Results of multiple objects tracking with the proposed tracking algorithm

5.2 Multiple Object Tracking Although the major tracking task in the experiments above performs with single object, our algorithm can be easily extended to multiple object tracking. As shown in Fig. 8, three objects are initialized manually, and can be tracked well in the following sequences including some partial occlusion cases, because the spatial constraint of appearance is employed to make the model less dependent on the peripheral pixels and the selective updating scheme effectively prevents introducing the noise into the appearance model. Due to its computational efficiency of our algorithm, it performs better and has more potential to handle various problems in the multiple object tracking than other tracking methods.

6 Conclusion This paper has proposed a robust and efficient Kernel-Bayesian framework for visual tracking. In this framework, the object to be tracked is characterized by a spatial constraint MOG based appearance model, which is proved more discriminative than the traditional MOG based appearance model. Our proposed tracking framework combines the merits of both the stochastic and deterministic tracking approaches in a unified framework: the mean shift algorithm is embedded into the Bayesian framework seamlessly to give a heuristic prediction to the state transition model, aiming at effectively alleviating the great computational load and avoiding sample degeneracy suffered by the conventional Bayesian trackers. Moreover, a selective updating scheme is developed to effectively accommodate the changes in both appearance and illumination. Experimental results have demonstrated the efficiency and effectiveness of the proposed tracking algorithm.

Kernel-Bayesian Framework for Object Tracking

831

Acknowledgment This work is partly supported by NSFC (Grant No. 60520120099 and 60672040) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).

References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision 1(4), 321–332 (1988) 2. Isard, M., Blake, A.: Condensation: conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 234–240 (2003) 4. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1296–1311 (2003) 5. Rasmussen, C., Hager, G.D.: Probabilistic Data Association Methods for Tracking Complex Visual Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 560– 576 (2001) 6. Hager, G.D., Hager, P.N.: Efficient region tracking with parametric models of geometry andillumination. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(10), 1025–1039 (1998) 7. Sullivan, J., Rittscher, J.: Guiding random particles by deterministic search. In: International Conference on Computer Vision, pp. 323–330 (2001) 8. Zhou, S., Chellappa, R., Moghaddam, B.: Visual Tracking and Recongnition Using Appearance-adaptive Models in Particles Filters. IEEE Transaction on Image Processing 13(11), 1491–1506 (2004) 9. McKenna, S., Jabri, S., Gong, S.: Tracking Colour Objects Using Adaptive Mixture Models. Image and Vision Computing 17(3-4), 225–231 (1999) 10. Fukunaga, K., Hostetler, L.: The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21(1), 32– 40 (1975) 11. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particles filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 12. Yang, C.J., Duraiswami, R., Davis, L.: Fast multiple object tracking via a hierarchical particle filter. In: Proceeding of International Conference on Computer Vision, pp. 212–219 (2005) 13. Nummiaro, K., Koller-Meierand, E., Van Gool, L.: An Adaptive Color-Based Particle Filter. Image and Vision Computing 21(1), 99–110 (2003) 14. Parameswaran, V., Ramesh, V., Zoghlami, I.: Tunable Kernels for Tracking. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 2179–2186 (2006)

Markov Random Field Modeled Level Sets Method for Object Tracking with Moving Cameras Xue Zhou, Weiming Hu, Ying Chen, and Wei Hu National Laboratory of Pattern Recognition, Institute of Automation, Beijing, China

Abstract. Object tracking using active contours has attracted increasing interest in recent years due to acquisition of e«ective shape descriptions. In this paper, an object tracking method based on level sets using moving cameras is proposed. We develop an automatic contour initialization method based on optical flow detection. A Markov Random Field (MRF)-like model measuring the correlations between neighboring pixels is added to improve the general region-based level sets speed model. The experimental results on several real video sequences show that our method successfully tracks objects despite object scale changes, motion blur, background disturbance, and gets smoother and more accurate results than the current region-based method.

1 Introduction Object tracking is an active research topic in computer vision community, because it is the foundation of high level visual problems such as motion analysis and behavior understanding. Current object tracking methods generally use predefined coarse shape models (rectangle or ellipse) to track objects [5,11]. Due to inflexibility in dealing with scale changes, these methods have diÆculty in accurately tracking non-rigid objects especially with moving cameras. In order to overcome this disadvantage, the methods based on active contours have been proposed, which provide detailed shape information for the rigid or non-rigid objects. Level set is an implicit representation of active contours [3]. Due to being numerically stable and capable of handling topological changes, the level set method is getting more and more popular, compared with explicit representation modes characterized by parameterized contours. Active contour-based tracking can be viewed as an iterative process of evolving the initial contour to the desired object boundary based on minimizing an energy function. This energy function often consists of three terms corresponding to internal energy, external energy and shape energy respectively. The first term concerns the internal constraints such as the evolution force based on curvature, the second term concerns the image attachment which has no correlation with contour itself and the last term reflects the shape prior constraints constructed by some statistical learning methods [18,19]. In this paper, we mainly consider the first two terms. In terms of dierent measurements used in the external energy, active contour-based methods are classified into two categories: edge-based ones and region-based ones. Snake [1] is a typical edge-based Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 832–842, 2007. c Springer-Verlag Berlin Heidelberg 2007

MRF Modeled Level Sets Method for Object Tracking with Moving Cameras

833

active contour model considering the gradient of image near the boundary of the object. An improved geodesic model which considers the intrinsic geometric measures of image compared with snake is proposed by Caselles et al. [12]. Edge-based methods are subject to a number of problems: (1) it only considers the local information around the contour and initialization near the object is necessary; (2) it is sensitive to image noises. Consequently, an eective alternative is region-based methods which consider the global image information. The measurements of an image can be some statistical quantities, such as mean, variance, texture or the histogram of the region concerned. Zhu and Yuille [15] present a statistical and variational framework for image segmentation using a region competition algorithm. Recently, Yilmaz et al. [16] adopt the features of both object and background regions in level sets speed model. Current region-based methods usually establish the energy function in a Bayesian framework based on the segmentation idea which means to divide an image into object and background regions. They assume the pixels in each region are independent when computing the region likelihood function. This assumption in some sense ignores the correlations between pixels, resulting in that the contour is sensitive to the background disturbance (similar color or texture between object and background) and not smoothed. Another diÆculty in contour-based tracking is the contour initialization. it is still an open problem. Although the general manual method is accurate, it needs human interaction. Background subtraction methods [13] for initializing contours are only effective when using stationary cameras. In [9], the initial contour can be anywhere in the image, however, it is time-consuming to converge to the correct boundary. The method we propose in this paper tries to solve the problems mentioned above. In our method, we adopt the region-based active contours method and represent contour using a level sets mode. Our method has the following features: – We model correlations between neighboring pixels (instead of treating them as independent) using a Markov Random Field (MRF)-like model. The computation of single pixel’s likelihood function not only depends on the pixel itself, but also considers the neighboring pixels. The correlations between neighboring pixels are measured by a penalty term. With this penalty term, the contour can be evolved to the desired object boundary more tightly and smoothly. Furthermore, our method gets rid of the influence of the background disturbance to some extent, comparing with general methods without this penalty term. – An automatic and fast initialization method based on optical flow detection is proposed. Closed initial contours near the boundaries of extracted objects are obtained. The remainder of this paper is organized as follows: Section 2 describes the initialization process which comprises generating initial contour and establishing the prior models. Section 3 introduces the penalty term and the improved level sets speed function. Section 4 shows experimental results. The last section summarizes the paper.

2 Initialization The initialization process of our method consists of two steps: (1) locating initial contour and establishing the level sets function; (2) modeling object and background regions using features such as color, texture, etc.

834

X. Zhou et al.

2.1 Initialization of Closed Contours Previous methods draw a closed contour near an object manually. The methods using the motion detection boundaries acquired by background subtraction as the initial contours of the moving objects are only eective with stationary cameras. With respect to a moving camera, motion detection based on optical flow is very popular [4]. Thus, we use the motion detection boundaries obtained by optical flow as the initial contours. Generally, the optical flow field can be viewed as the motion field. For each pixel, a velocity vector ( ) is defined, where and represent the velocity components in x and y direction respectively. The detailed process of our method for initializing contours is described as follows: 1. Optical flow is computed iteratively using consecutive frames based on the gradient of the image. Then, the velocity vector ( ) of each pixel is obtained. 2. Reducing noises. The velocity vector is set to (0 0) when its magnitude is less than a predefined threshold T , i.e. ( ) (0 0) if (

2

2 T )

(1)

3. Find the most probable moving regions which should have large and coherent motions. A shape model is moved to detect the most probable moving regions using a three-step algorithm. Firstly, a series of inial contour candidates are obtained through changing the positions of the coarse shape model with fixed parameters (e.g. radius of a circle). Assign each candidate a weight according to: weight

x

x x 2

2

(1

)arg(

)

(2)

where x (x y) is the coordinate vector of the pixel which belongs to the internal area of the contour, arg( ) is the variance of phase, and is the parameter (ranging between 0 and 1) used to weight the two terms. Secondly, sort these candidates in terms of their weights in a descending order and choose the top N (N 1) ones as the initial contours. The principle that the weights of the top N initial contours are far more than the others is determined. N denotes the number of detected initial contours. An assumption that the distances between detected initial contours are not too close is made to guarantee non-repetitive initialization. Thirdly, after the optimal positions are obtained, we refine the parameter of each detected initial contour with the position fixed. For each contour the optimal parameter £ should satisfy: £ arg max(weight) (3)

Consequently, the coarse shapes near moving objects are obtained as the initial contours of these objects. Although our initialization method is not accurate as some manual methods, the region-based active contour methods guarantee the precise initialization is not necessary [14] and the coarse initial contour still can be evolved to enclose the object tightly. After the initial contours are obtained, we compute the level sets function (x y t) for each contour. The level sets function is the signed Euclidean distance between the point x(x y) and the contour C(t). In our method, we assume (x y t) is positive when x belongs to the external part of the contour C(t) and negative for the internal part of C(t).

MRF Modeled Level Sets Method for Object Tracking with Moving Cameras

835

2.2 Construction of Prior Models Modeling the object and background region is indispensable to region-based active contour methods. In this paper, we present a hierarchical method for fusing color and texture features using a Gaussian Mixture Model (GMM) which is a variant of Stauer’s method [7]. The first step is to train a GMM model using the color feature, where HSV color space is chosen. The second step is to label each sample based on the trained color GMM model. The label is the jth Gaussian in the mixture. The final step is to model these labeled samples using the texture feature which is computed by a gray level co-occurrence matrix (GLCM) method [6]. We assume that the samples with the same color are often adjacent. Thus, the samples with the same label are modeled as a single Gaussian. As a whole, the estimated probability density function (pdf) at pixel xi in the joint color-texture space can be formulated as: p(xi )

k

j (xci cj cj ) (xti tj tj )

(4)

j 1

where xci and xti are respectively the color feature and texture feature at pixel xi ,

(x j j ) is a Gaussian pdf, is the weight parameter of the GMM model and k is the number of the Gaussian modes. The method for updating GMM parameters is similar to [7] which uses a parameter as the learning rate.

3 Evolving the Contour Our method for evolving contours is motivated by the segmentation idea [15,16]. Its objective is to find the optimal partition operator represented by a contour based on the initial contour in the current frame. The segmentation result in the current frame is used as the initial contour of the object in the next frame. Then, in consecutive images the object is tracked iteratively. The segmentation problem in the current frame can be modeled as a MAP problem. We let the posterior probability of getting a partition (R) of a given image I be represented by P( (R)I). In accordance with the Bayesian formula and considering the item P(I) is a constant, this posterior is proportional to: P( (R)I) P(I (R))P( (R))

(5)

Generally the prior probability P( (R)) is modeled as a smoothness regularization item which depends on the length of the contour [17]. In our method, P( (R)) is omitted, because the penalty term introduced later has the better smoothness eect than the prior probability P( (R)) which only focuses on minimizing curve length of the contour and lacks of interaction with the image. The assumption that the regions of the optimal partition are independent is made. This assumption is reasonable since the aim of the segmentation is to separate out the regions of the image where the properties are dierent. Thus, the following equation is obtained: P(I (R)) P(I (Rin ))P(I (Rout ))

(6)

836

X. Zhou et al.

where Rin and Rout denote respectively the regions inside and outside the contour, P(I (Rin )) and P(I (Rout )) are object and background region likelihood functions respectively. Current methods usually assume the pixels in each region are also independent [15,16], so the above formula can be rewritten as: P(I (R))

P(I(x) (Rin))

x¾Rin

P(I(x) (Rout ))

(7)

x¾Rout

However, the hypothesis of pixels independence in each region is very weak for textured regions or those with repeated patterns where there is a local interaction between the pixels. To avoid the above problem, we rewrite the region likelihood function which takes account of the neighboring relationships. Taking object region likelihood function for example, the background region likelihood function is formulated by analogy. MRF theory depicts the conditional probability only depends on the neighborhood [10]. So the object region likelihood function can be approximated as: P(I (Rin ))

P(xi xi in )

(8)

xi ¾Rin

where in is the parameters of the object GMM model in cin inc tin int , xi is the neighborhood of pixel xi in 2D image lattice. We introduce a penalty term to measure the influence of neighboring pixels on the center pixel. The penalty term encourages nearby pixels to fall into the same region, which is reasonable in most applications. A (2w 1) (2w 1) square neighborhood centered at pixel xi is defined. Before explaining the penalty term in detail, let’s define the label set first. The label of a pixel depends on its membership belonging to object or background. If the pixel’s posterior belonging to object is more than that belonging to the background, the label of that pixel is set 1, otherwise 0. The single pixel’s object likelihood function considering the interaction in the neighborhood is proportional to the product of the pure likelihood function and the penalty term: P(xi xi in ) P(xi in ) exp[ (

max(N1 N0 ) 2 2 ) ] sign N1 N0

(9)

where the pure likelihood function is the general likelihood function which only considers the pixel itself and is computed using (4), the exponent function is the penalty term, N1 and N0 are the numbers of neighboring pixels with label 1 and label 0 respectively, is the parameter controlling how fast the exponent function converges to zero, and sign is a piecewise function:

1 if (L sign 0 if (L 1 if (L

We define L as: L

12)(N1 12)(N1 12)(N1

N0 ) 0 N0 ) 0 N0 ) 0

1 when computing P(x

i xi in ) 0 when computing P(xi xi out )

(10)

(11)

MRF Modeled Level Sets Method for Object Tracking with Moving Cameras

837

The single pixel’s background likelihood function P(xi xi out ) can be formulated similarly just by replacing in with out in Formula (9). The penalty term has the following features: 1. If the label of the center pixel is the same as the labels of most neighboring pixels, the penalty term has an increasing eect on the likelihood function. The increasing extent depends on the dierence between N1 and N0 . The bigger the dierence, the more the likelihood is increased. 2. If the label of the center pixel isn’t identical with the labels of most neighboring pixels, the penalty term has a decreasing eect on the likelihood function. The bigger the dierence between N1 and N0 , the more the likelihood is decreased. 3. If N1 is equal to N0 , the penalty term equals to one and has no influence on the likelihood function. With the penalty term, the posterior partition probability is expressed as: P( (R)I)

P(xi xi in )

xi ¾Rin

P(x j x j out )

(12)

x j ¾Rout

Converting the MAP problem to the energy minimization problem, the energy equation is obtained: E

log P( (R)I)

xi ¾Rin

log P(xi xi in )dxi

x j ¾Rout

log P(x j x j out )dx j

(13) Minimizing the above energy function by solving the correlated Euler-Lagrange equations [16], we obtain the level sets evolution speed model in which a (2l 1) (2l 1) square neighboring subregion around the center pixel is defined resembling the definition of the square neighborhood in the penalty term. The object and the background posterior probabilities which we denote by PRin (I x˜ ) and PRout (I x˜ ) are also calculated in the speed model with the assumption that they have the same prior probabilities: PRin (I x˜ )

P( x˜ x˜ in ) P( x˜ x˜ in ) P( x˜ x˜ out )

(14)

PRout (I x˜ )

P( x˜ x˜ out ) P( x˜ x˜ in ) P( x˜ x˜ out )

(15)

The level sets advection speed model of each pixel is obtained by: F x y

l l

log PRin (I x˜ )Ha (( x˜ t))

i l j l

l l

log PRout (I x˜ )(1

Ha (( x˜ t)))

(16)

i l j l

where x˜ is the neighboring pixels of (x y): x˜ (x i y j) and Ha (( x˜ t)) is a Heaviside function: Ha (( x˜ t))

0 ( x˜ t) 0 1 ( x˜ t) 0

(17)

838

X. Zhou et al.

The contour is evolved to the desired boundary by modifying iteratively with overall speed F in the normal direction:

(Fadv Fcurv ) 0 t

(18)

where Fadv F xy is the external force reflecting the data attachment and Fcurv is the internal force proportional to the curvature . The detailed stable numerical approximation scheme of the above equation is given in [8].

4 Experiments To verify our method, we have performed a number of experiments on various sequences. Furthermore, some comparisons have been implemented. In our experiments, all the videos are captured with a moving camera, a tracked object is represented with a grayed contour (colored in color image). The evolution of level sets is implemented using the fast Narrow Band approach [2]. The contour is initialized based on the optical flow detection results in the first frame. Both object and background features are adopted. T in (1) is set to be 02. The parameter in (9) is set to be 021 for all the sequences. l in (16) is independent of sequences and is fixed to 2. Among the above parameters, the most important one is w which controls the size of the square neighborhood in the penalty term. Obviously, the bigger the w, the more smoother contour we can obtain. When w is zero, the neighborhood only has the pixel itself and the penalty term doesn’t work. This is evident from Fig. 1, where we show the dierent tracking results with dierent w. The experiment is implemented only considering the color feature. From the experimental results, we can find that when w is increasing, the contour is more smoothed at the cost of losing more details. Thus, choosing an appropriate parameter w is a trade-o between accuracy and smoothness of tracking results. In our experiments, parameter w is set to be 6.

Fig. 1. Tracking results with di«erent value of parameter w; from (a) to (f), w is 0, 2, 3, 4, 7 and 10

MRF Modeled Level Sets Method for Object Tracking with Moving Cameras

839

Fig. 2. Tracking results of several real sequences: (a) tracking results of indoor human walking sequence, the frame numbers are, respectively, 1, 16, 36, 70 and 114; (b) tracking results of two moving faces sequence, the frame numbers are, respectively, 1, 82, 144, 180 and 269; (c) tracking results of the football sequence, the frame numbers are, respectively, 1, 48, 75, 102 and 141

4.1 Results on Real Video Sequences We first present tracking results on three real video sequences. In the first experiment, we track a person with a moving camera from Frame 1 to 114. The background is cluttered up with some stus which have the similar color as the object. Both color and texture features are adopted to model object and background regions in this sequence. As shown in Fig. 2 (a), although with the background disturbance, we still track the contour of the walking person accurately where the arms and legs are also successfully tracked. (Please see the supplied video “indoor human tracking.avi”) In the second experiment, we demonstrate the performance of our method on the sequence of two moving faces. The camera zooms and moves as people change their faces’pose continuously. Both color and texture features are considered. Dierent people are labeled with dierent grays (colors in the color image). The tracking results are shown in Fig. 2 (b). We are still able to track the contours of these two faces with high accuracy even though faces’ scale change a lot. (Please see the supplied video “two moving faces tracking.avi”) In the third experiment, we track a fast moving football from Frame 1 to 141. Based on results of optical flow detection, we obtain many dierent initial contours corresponding to dieren moving objects (including players and football). We choose the football as the interested object and only model object and background regions around the football. The color feature is enough to distinguish between foreground and background. As we can see from the tracking results shown in Fig. 2 (c), we keep good track of the football pointed by an arrow when it is moving with a very high speed. In the

840

X. Zhou et al.

Fig. 3. Tracking results for two comparisons: the first and third column are obtained by our method with the penalty term, and the second and fourth column are using Yilmaz’s speed model without considering the penalty term. (a) the first comparison, from top to bottom, the frame numbers are, respectively, 1, 28, 96 and 153; (b) the second comparison, from top to bottom, the frame numbers are, respectively, 1, 24, 55 and 74.

102th frame, there exists severe motion blur, but the football is still tracked robustly. The zoom-in contour is shown on the top right corner of each image. (Please see the supplied video “football tracking.avi”) 4.2 Comparison One of the characteristics of our method is to encode the Markov property into an additional penalty term expressed in our speed model, while Yilmaz’s method [16] doesn’t. Yilmaz’s method is a typical region-based active contours method which doesn’t consider the interaction between pixels when computing the likelihood function. Here we show two comparisons between these two methods, where only the color feature is taken into account. The first comparison is implemented on the Mickey head tracking sequence, in which we make some disturbances to the tracked object artificially. The color of the scrips in background is the same as the color of the Mickey head. The camera zooms and tilts as the object moves. From these comparison results shown in Fig. 3 (a), it’s obvious that the tracking results obtained using our method are more accurate and much smoother than those obtained using Yilmaz’s method. The influence of background disturbance is eliminated to some extent by our method. (Please see the supplied video “comparison 1.avi”)

MRF Modeled Level Sets Method for Object Tracking with Moving Cameras

841

A similar comparison is implemented on the real outdoor human walking sequence captured with a moving camera. The colors of some areas in the background are similar to those of the person’s clothes. The tracking results are shown in Fig. 3 (b) from Frame 1 to 74. The comparison results have demonstrated that the method without the penalty term is more sensitive to background disturbance and our method more accurately tracks the objects when background disturbance occurs. (Please see the supplied video “comparison 2.avi”)

5 Conclusion In this paper, we have proposed a MRF modeled level sets method for object tracking with moving cameras. In our method, the contour is initialized automatically based on optical flow detection. The penalty term reflecting the interaction between pixels is introduced to reduce the influence of background disturbance and smooth the contour. Our method has been tested on several real video sequences. Objects are accurately tracked even only considering the color information. The comparison experiments have demonstrated that our method outperforms the general region-based method which doesn’t consider correlations between neighboring pixels.

Acknowledgments This work is partly supported by NSFC (Grant No. 60520120099 and 60672040) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).

References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. In: IJCV, vol. 1, pp. 321–331 (1988). 2. Adalsteinsson, D., Sethian, J.: A fast level set method for propagating interfaces. J. Comput. Phys. 118, 269–277 (1995) 3. Osher, S., Sethian, J.: Fronts propagation with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79, 12–49 (1988) 4. Horn, B., Schunck, B.: Determining optical flow. Artificial intelligence 17, 185–203 (1981) 5. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. IEEE Trans. PAMI 19, 780–785 (1997) 6. Partio, M., Cramariuc, B., Gabbouj, M.: Rock texture retrieval using gray level co-occurrence matrix. In: NSPS, NORSIC 2002,October4-7, 2002 p. 5 (2002) 7. Stau«er, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: CVPR, vol. 2, pp. 246–252 (1999) 8. Sethian, J.A.: Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science. Cambridge University Press, Cambridge (1999) 9. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans.on IP 10, 266–277 (2001) 10. Xu, D.X., Hwang, J.N., Yuan, C.: Segmentation of multi-channel image with markov random field based active contour model. In: JVLSI, vol. 31, pp. 45–55 (2002)

842

X. Zhou et al.

11. Haritaoglu, I., Harwood, D., Davis, L.S.: W 4 : real-time surveillance of people and their activities. IEEE Trans. PAMI 22, 809–830 (2000) 12. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. In: IJCV, vol. 22, pp. 61–79 (1997) 13. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Trans. PAMI 22, 266–280 (2000) 14. Bailloeul, T.: Active contours and prior knowledge for change analysis: application to digital urban building map updating from optical high resolution remote sensing images, Phd Thesis, October (2005) 15. Zhu, S.C., Yuille, A.: Region competition: unifying snakes, region growing and bayes»mdl for multiband image segmentation. IEEE Trans. PAMI 18, 884–900 (1996) 16. Yilmaz, A., Li, X., Shah, M.: Object contour tracking using level sets. In: ACCV (2004) 17. Shi, Y., Karl, W.C.: Real-time tracking using level sets. In: CVPR, vol. 2, pp. 34–41 (2005) 18. Leventon, M., Grimson, E., Faugeras, O.: Statistical shape influence in geodesic active contours. In: CVPR, vol. 1, pp. 316–323 (2000) 19. Cremers, D.: Dynamical statistical shape priors for level set based tracking. IEEE Trans. PAMI 28, 1262–1273 (2006)

Continuously Tracking Objects Across Multiple Widely Separated Cameras Yinghao Cai, Wei Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences P.O.Box 2728, Beijing, 100080, China {yhcai,wchen,kqhuang,tnt}@nlpr.ia.ac.cn

Abstract. In this paper, we present a new solution to the problem of multi-camera tracking with non-overlapping ﬁelds of view. The identities of moving objects are maintained when they are traveling from one camera to another. Appearance information and spatio-temporal information are explored and combined in a maximum a posteriori (MAP) framework. In computing appearance probability, a two-layered histogram representation is proposed to incorporate spatial information of objects. Diﬀusion distance is employed to histogram matching to compensate for illumination changes and camera distortions. In deriving spatio-temporal probability, transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. Experimental results demonstrate the eﬀectiveness of the proposed method.

1

Introduction

Nowadays, a distributed network of video sensors is applied to monitor activities over a complex area. Instead of having a high resolution camera with a limited ﬁeld of view, multiple cameras provide a solution to wide area surveillance by extending the ﬁeld of view of a single camera. Various types of camera overlap and non-overlap can be employed in multi-camera surveillance systems. Continuously tracking objects across cameras is usually termed as “object handover”. The objective of handover is to maintain the identities of moving objects when they are traveling from one camera to another. More speciﬁcally, when an object appears in one camera, we need to determine whether it has previously appeared before in other cameras or it is a new object. In earlier work of handover, either calibrated cameras or overlapping ﬁelds of view are required. Subsequent approaches to handover recover the relative positions between cameras by statistical consistency. Statistical information reveals a trend of how people are likely to move between cameras. Possible cues for tracking across cameras include appearance information and spatio-temporal information. Appearance information includes size, color, height of moving object, etc, while spatio-temporal information refers to transition time, velocity, entry zone, exit zone, trajectory, etc. These cues present a constraint on possible transitions between cameras, such as a person leaves the ﬁeld of view of one camera Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 843–852, 2007. c Springer-Verlag Berlin Heidelberg 2007

844

Y. Cai et al.

at exit zone A will never appear at entry zone B of another camera at the opposite direction of his or her moving direction. Combining appearance information with spatio-temporal information is promising since it does not require a priori calibration and is able to adapt to changes in the cameras’ positions. In this context, tracking objects across cameras is achieved through computing the probability of correspondence according to appearance and spatio-temporal cues. Since cameras are non-overlapping, the appearances of moving objects under multiple non-overlapping cameras may exhibit signiﬁcant diﬀerences due to diﬀerent illumination conditions, poses and camera parameters. Even under the same scene, the illumination conditions vary over time. As to spatio-temporal information, the transition time from one camera to another diﬀers dramatically from person to person. Some people may wander along the way, while others are rushing against time. In addition, as pointed out in [1], the more dense the observations and the longer the transition time, the more likely the false correspondences. In this paper, we solve these problems under a maximum a posteriori (MAP) framework. The probability of two observations under two cameras generated from the same object is dependent on both appearance probability and spatiotemporal probability. At the oﬀ-line training stage, we assume the correspondences between objects are known. The parameters for appearance matching and transition distributions between each pair of entry and exit zone are learned. At the testing stage, correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework. Experimental results demonstrate the eﬀectiveness of the proposed algorithm. In the remainder of this paper, an overview of the related work is in Section 2. In Section 3, experimental setup is described. The MAP framework is presented in Section 4 with appearance probability and spatio-temporal probability described. Experimental results and conclusions are given in Section 5 and Section 6 respectively.

2

Related Work

To compensate color variations under two separated cameras, one solution is by color normalization. Niu et al. [2] employ a comprehensive color normalization algorithm (CCN) to remove image dependency on lighting geometry and illuminant color. This procedure is an iterative process until no change is detected. An alternative solution to the problem is by ﬁnding a transformation matrix [3] or a mapping function [4] which map the appearance of one object to its appearance under another view. In [3], the transformation matrix is obtained by solving a linear matrix equation. Javed et al. [4] show that all brightness transfer functions (BTF) from one camera to another lie in a low dimensional subspace. [4] assumes planar surfaces and uniform lighting which are undesirable in real applications. In determining the spatio-temporal relationship between pairs of cameras, Javed et al. [5] employ a non-parametric Parzen window technique to estimate the spatio-temporal pdfs between cameras. In [6], it is assumed that all pairs

Continuously Tracking Objects Across Multiple Widely Separated Cameras

845

of arrival and departure events contribute to the distribution of transition time. Observations of transition time are accumulated into a reappearance period histogram. The peak of the reappearance period histogram indicates the most popular transition time. No appearance information is used in [6]. Furthermore, [2,3] weight the temporally correlating information by appearance information, only those observations which look similar in appearance are used to derive spatiotemporal pdfs. Both [6] and [2,3] assume a single mode transition distribution and are not ﬂexible to deal with multi-modal transition situations. In this paper, a two-layered histogram representation is proposed to incorporate spatial information of objects. This representation provides more descriptive ability than computing the histogram of the whole body directly. Furthermore, instead of modeling color changes between cameras explicitly as a mapping function or a transformation matrix, we propose diﬀusion distance [7] to histogram matching to compensate for illumination changes and camera distortions. To deal with multi-modal transition situations, we model the spatio-temporal probability between each pair of entry zone and exit zone as a mixture of Gaussians. Correspondences are assigned according to appearance and spatio-temporal probability under the MAP framework.

(a)

(b)

Fig. 1. (a) The layout of the camera system, (b) Three views from three widely separated cameras

3

Experimental Setup

The experimental setup consists of three cameras with non-overlapping ﬁelds of view. The cameras are widely separated, including two outdoor settings and one indoor setting. The layout is shown in Figure 1(a). As we can see from Figure 1(b), illumination conditions are quite diﬀerent. In single camera motion detection and tracking, Gaussian Mixture Model(GMM) and Kalman ﬁlter are applied, respectively. Figure 2(a), (b) and (c) show numbers of people in camera C1 , C2 , C3 respectively. The number of people in each view is obtained by single camera tracking.

Y. Cai et al. 40

35

35

30

30

25 20 15 10 5 0 0

30

25

Num of people

40

Num of people

Num of people

846

25 20 15 10

20

30

40

50

60

70

80

90 100 110 120

0 0

15

10

5

5

10

20

10

20

30

40

50

60

70

80

90 100 110 120

0 0

10

20

30

40

50

60

70

Time(min)

Time(min)

Time(min)

(a)

(b)

(c)

80

90 100 110 120

Fig. 2. (a-c) Numbers of people in camera C1 , C2 , and C3 respectively

Dense observations make the handover problem more diﬃcult. However, the proposed method provides a satisfactory result given the diﬃculties above.

4

Bayesian Framework

Suppose we have m people p1 , p2 , ..., pm under n cameras C1 , C2 , ..., Cn , observations under camera i(j) of moving object pa (pb ) is represented as Oia (Ojb ). Observations of the moving object pa include appearance and spatio-temporal properties which are represented as Oia (app) and Oia (st) respectively. According to Bayesian theory, given two observations Oia and Ojb under two cameras, the probability of these observations generated from the same object is [5]: P (a = b|Oia , Ojb ) =

P (Oia , Ojb |a = b)P (a = b) P (Oia , Ojb )

(1)

where the denominator P (Oia , Ojb ) is the normalization term, P (Oia , Ojb |a = b) depends on both appearance probability and spatio-temporal probability. P (a = b) is a constant term denoting the probability of a transition from camera i to camera j deﬁned as P (a = b) =

N um of transitions f rom Ci to Cj N um of people exiting Ci

(2)

Since the appearance of each object does not depend on its spatio-temporal property, we assume the independence between Oia (app) and Oia (st). So we have P (a = b|Oia , Ojb ) ∝ P (Oia (app), Ojb (app)|a = b) × P (Oia (st), Ojb (st)|a = b) (3) The handover problem is now formalized as: given observation Oia under camera i, we need to ﬁnd out observations Qai in a time sliding window of Oia under camera j which maximize the posterior probability P (a = b|Oia , Ojb ): h = arg max P (a = b|Oia , Ojb ) ∀Ojb ∈Qa i

(4)

The appearance probabilityP (Oia (app), Ojb (app)|a=b) and the spatio-temporal probability P (Oia (st), Ojb (st)|a = b) are computed in section 4.2 and 4.3, respectively.

Continuously Tracking Objects Across Multiple Widely Separated Cameras

4.1

847

Moving Object Representation

The purpose of moving object representation is to describe appearance of each object so as to be discriminable from other objects. Histogram is a widely used appearance descriptor. The main drawback of histogram-based methods is that they lose spatial information of the color distribution which is essential to discriminate diﬀerent moving objects. For example, histogram-based methods can not tell a person wearing a white shirt and blue pants from another person who dresses in a blue shirt and white pants.

Fig. 3. A two-layered histogram representation: (a,e) Histogram of the body, (b-d, f-h) Histograms of head, torso and legs respectively

In this paper, we propose a new moving object representation method based on a two-layered histogram. As pedestrians are our primary concern, human body is divided into three subregions: head, torso and bottom in vertical direction similar to the method in [8]. The ﬁrst layer of the proposed representation corresponds to the color histogram of human body Htotal , while the second layer consists of histograms of head, torso and legs, represented by Hh , Ht , Hl respectively. Histograms are quantized into 30 bins in R, G, B channel separately. It is worthwhile pointing out that coarse quantization discards too much discriminatory information, while ﬁne quantization will result in sparse histogram representations. Our preliminary experiments validate the adequacy of thirty bins in terms of discriminability and accuracy. Figure 3 shows the separated regions and their histogram representations. A two-layered histogram representation captures both global image description and local spatial information. Figure 3 shows that two diﬀerent people have visually similar Htotal , however, their Ht s are quite diﬀerent which demonstrate that the proposed two-layered representation provides more discriminability than computing the histogram of the whole body directly. Each layer of representation under one view is matched against its corresponding layer under another view in next subsection.

848

4.2

Y. Cai et al.

Histogram Matching

As we mentioned in Section 1, appearances of moving objects under multiple nonoverlapping cameras exhibit signiﬁcant diﬀerences due to diﬀerent illumination conditions, poses and camera parameters. To compute appearance probability given observations under two cameras, we ﬁrst obtain a two-layered histogram representation in Section 4.1. Histogram representation provides robustness to pose changes to some degree. In this section, we propose diﬀusion distance to histogram matching to compensate for illumination changes and camera distortions. Diﬀusion distance is ﬁrst proposed by Ling et al. [7]. This approach models the diﬀerence between two histograms as a temperature ﬁeld. Firstly, an initial distance between two histograms is deﬁned. The diﬀusion process on this temperature ﬁeld diﬀuses the diﬀerence between two histograms by a Gaussian kernel. When time increases, the diﬀerence between these two histograms will approximate zero. Therefore, the distance between two histograms can be deﬁned as the sum of dissimilarities over its process [7]: K(hist1, hist2) =

N

k(|di (x)|)

(5)

i=0

where d0 (x) = hist1(x) − hist2(x) di (x) = [di−1 (x) ∗ φ(x, σ)] ↓2 i = 1, ...N

(6) (7)

“↓2 ” denotes half size downsampling. σ is the standard deviation of the Gaussian ﬁlter which can be learned from the training phase. k(|.|) is chosen as the L1 norm. Subsequent distance di (x) is deﬁned as half size downsampling of its

time = 1 time = 2 time = 3 time = 4 time = 5

(a)

time = 1 time = 2 time = 3 time = 4 time = 5

(b)

Fig. 4. Diﬀusion distance plotted on the same ﬁgure. (a) Diﬀusion process for the diﬀerence of histograms of the same person under two views, (b) Diﬀusion process for diﬀerent people.

Continuously Tracking Objects Across Multiple Widely Separated Cameras

849

former layer. Then, the ground distance between two histograms is deﬁned as the sum of norms over N scales of the pyramid. An intuitive illustration is shown in Figure 4. Figure 4(a) shows the diﬀusion process for the diﬀerence of histograms of the same person under two views, and Figure 4 (b) shows the diﬀusion process for diﬀerent people. We can see that (a) decays faster than (b). In our method, we compare Htotal , Hh , Ht , Hl of one object with its corresponding histograms under another view by diﬀusion distance. The histogram representation is one dimensional since we treat each channel R, G, B separately. Four diﬀusion distances dtotal , dh , dt and dl are combined by the weighted sum technique. We obtain a Gaussian distribution for distances between the same object under diﬀerent views at the training stage. Finally, distances are transformed into probabilities to obtain the appearance probability P (Oia (app), Ojb (app)|a = b). A comparison with other histogram distances is shown in Section 5. 4.3

Spatio-temporal Information

To estimate the spatio-temporal relationship between pairs of cameras, at the oﬀ-line training stage, we group locations where objects appear(entry zone) and disappear(exit zone) by k-means clustering. Transition time distribution between each pair of entry zone and exit zone is modeled as a mixture of Gaussian distributions. In this paper, we choose K as 3, three gaussian distributions correspond to people walking slowly, at normal speed and walking quickly. The probability of testing transition time x is P (x) =

3

ωi ∗ ηi (x, μi , σi )

(8)

i=1

where ωi is the weight of the ith Gaussian in the mixture. ωi can be interpreted as a prior probability of the random variable generated by the ith Gaussian distribution. μi and σi are the mean value and the standard deviation of the ith Gaussian. η is the Gaussian probability density function (x−μ)2 1 e− 2σ2 η(x, μ, σ) = √ 2πσ

Transition Distribution

0.045

0.035

0.03 0.025 0.02 0.015

0.03 0.025 0.02 0.015

0.01

0.01

0.005

0.005

0 20

30

40

50

60

70

80

90

Original Distribution Mixed Gaussian Single Gaussian

0.04

Probability

Probability

0.035

Transition Distribution

0.045

Original distribution Mixed Gaussian Single Gaussian

0.04

100

(9)

0 20

30

40

50

60

70

Transition Time

Transition Time

(a)

(b)

80

90

100

Fig. 5. (a) Transition distribution from Camera 1 to Camera 2, (b) Transition distribution from Camera 2 to Camera 3

850

Y. Cai et al.

Parameters of the model are estimated by expectation maximization(EM). It should be noted that a single Gaussian distribution can not accurately model the transition time distributions between cameras due to the variability of walking paces. Figure 5 shows the transition distribution and its approximations by a mixture of Gaussian distributions and a single Gaussian distribution.

5

Experimental Results

Experiments are carried out on two outdoor settings and one indoor setting as shown in Figure 1. The oﬀ-line training phase lasts 40 minutes, and evaluation of the eﬀectiveness of the algorithm is performed using ground-truthed sequences lasting an hour. At the oﬀ-line training stage, locations where people appear and disappear are grouped together as entry zones and exit zones respectively in Figure 6. It takes approximately 40-70 seconds to exit from Camera 1 to Camera 2 and from Camera 2 to Camera 3. Some sample images under the three views are shown in Figure 7. Our ﬁrst experiment consists of transitions from Camera 1 to Camera 2 with two outdoor settings. Our second experiment is carried out on Camera 2 and Camera 3. Numbers of correspondence pairs in the training stage, transitions and detected tracks in the testing stage are summarized in Table 1.

(a)

(b)

(c)

Fig. 6. (a-c) Entry zones and exit zones for Camera 1, 2 and 3, respectively

Fig. 7. Each column contains the same person under two diﬀerent views Table 1. Experimental Description Training Stage Testing Stage Correspondence Pairs Transition Nums Detected Tracks Experiment 1 100 107 150 Experiment 2 50 75 100

1

1

0.95

0.95

0.9

0.9

0.85

0.85

Accuracy

Accuracy

Continuously Tracking Objects Across Multiple Widely Separated Cameras

0.8 0.75 0.7

app & st app st

0.65 0.6 0.55

1

2

3

4

Performance of ranked matches

0.8 0.75 0.7

app & st app st

0.65 0.6

5

(a)

851

0.55

1

2

3

4

Performance of ranked matches

5

(b)

Fig. 8. Rank Matching Performance. “app” denotes using appearance information only, “st” means using spatio-temporal information only, “app & st” means both appearance information and spatio-temporal information are employed. (a) Rank Matching Performance of Experiment 1. (b) Rank Matching Performance of Experiment 2.

Camera2__4 Camera1__1

Camera3__11 Camera3__10

Camera1__2 Camera1__2

Camera1__2

(a)

(b)

(c)

Fig. 9. Continuously tracking objects across three non-overlapping views

Fig. 10. Rank 1 rates for diﬀusion distance, L1 distance and histogram intersection

Figure 8 shows our rank matching performance. Rank i (i = 1...5) performance is the rate that the correct person is in the top i of the handover list. Diﬀerent people with similar appearances bring uncertainties into the system which can explain the rank one accuracy of 87.8% in Experiment 1 and 76% in Experiment 2. By taking the top three matches into consideration, the performance is improved to 97.5% and 98.6% respectively. People are tracked correctly in Figure 9. As a comparison between diﬀusion distance, the widely used L1 distance and histogram intersection distance [9], we use the same framework and replace the diﬀusion distance with L1 and histogram intersection distance. The rank 1 rates for diﬀerent distances are shown in Figure 10, which demonstrates the superiority of the proposed diﬀusion distance.

852

6

Y. Cai et al.

Conclusion and Future Work

In this paper, we have presented a new solution to the problem of multi-camera tracking with non-overlapping ﬁelds of view. People are tracked correctly across the widely separated cameras by combining appearance and spatio-temporal cues under the MAP framework. Experimental results validate the eﬀectiveness of the proposed algorithm. The proposed method requires an oﬀ-line training phase where parameters for appearance matching and transition probabilities are learned. Future work will focus on evaluation of the proposed method on larger datasets.

Acknowledgement This work is partly supported by National Basic Research Program of China (No. 2004CB318110), the National Natural Science Foundation of China (No. 60605014, No. 60335010 and No. 2004DFA06900) and CASIA Innovation Fund for Young Scientists.

References 1. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Computer Vision, 2005. Proceedings. Ninth IEEE International Conference on (2005), pp. 1842–1849. IEEE Computer Society Press, Los Alamitos (2005) 2. Niu, C., Grimson, E.: Recovering non-overlapping network topology using far-ﬁeld vehicle tracking data. In: ICPR 2006. Pattern Recognition 18th International Conference on (2006), pp. 944–949 (2006) 3. Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) 4. Javed, O., Shaﬁque, K., Shah, M.: Appearance modeling for tracking in multiple nonoverlapping cameras. In: CVPR 2005. Computer Vision and Pattern Recognition, pp. 26–33. IEEE Computer Society, Los Alamitos (2005) 5. Javed, O., Rasheed, Z., Shaﬁque, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (2003), pp. 952–957. IEEE Computer Society Press, Los Alamitos (2003) 6. Makris, D., Ellis, T., Black, J.: Bridging the gaps between cameras. In: CVPR 2004. Computer Vision and Pattern Recognition, pp. 205–210. IEEE Computer Society Press, Los Alamitos (2004) 7. Ling, H., Okada, K.: Diﬀusion distance for histogram comparison. In: Computer Vision and Pattern Recognition, pp. 246–253. IEEE Computer Society Press, Los Alamitos (2006) 8. Hu, M., Hu, W., Tan, T.: Tracking people through occlusions. In: ICPR 2004. Pattern Recognition 18th International Conference on (2006), pp. 724–727 (2004) 9. Swain, J., Ballard, M.: Indexing via color histograms, pp. 390–393 (1990)

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues Pankaj Kumar, Michael J. Brooks, and Anthony Dick University of Adelaide School of Computer Science South Australia 5005 [email protected], [email protected], [email protected]

Abstract. We consider the problem of reliably tracking multiple objects in video, such as people moving through a shopping mall or airport. In order to mitigate diﬃculties arising as a result of object occlusions, mergers and changes in appearance, we adopt an integrative approach in which multiple cues are exploited. Object tracking is formulated as a Bayesian parameter estimation problem. The object model used in computing the likelihood function is incrementally updated. Key to the approach is the use of a background subtraction process to deliver foreground segmentations. This enables the object colour model to be constructed using weights derived from a distance transform operating over foreground regions. Results from foreground segmentation are also used to gain improved localisation of the object within a particle ﬁlter framework. We demonstrate the eﬀectiveness of the approach by tracking multiple objects through videos obtained from the CAVIAR dataset.

1

Introduction

Reliably tracking multiple objects in video remains a highly challenging and unsolved problem. If, for example, we aim to track several people in an airport or shopping mall, we face diﬃculties associated with appearance and scale changes as each person moves around. Compounding this are occlusion problems that can arise when people meet or pass by each other. This paper is concerned with improving the reliability of multiple object tracking in surveillance video. Visual tracking of multiple objects is formulated in this work as a parameter estimation problem. Parameters describing the state of the object are estimated using a Bayesian technique where the constraints of Gaussianity and linearity do not apply. In Bayesian estimation, the posterior probability density function (pdf) p(Xt |Z T ) of the state vector Xt given a set of observations Z T obtained from the camera is computed at every step, as new observations become available. Many tracking algorithms with a ﬁxed object model have already been designed [1], [2]. However, trackers with a ﬁxed object model are typically unable to track objects for long because of changes in lighting conditions, pose, scale and view point and also due to camera noise. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 853–863, 2007. c Springer-Verlag Berlin Heidelberg 2007

854

P. Kumar, M.J. Brooks, and A. Dick

One of the ways of improving object tracking has been to update the object model with the observation data. Nummiaro et al. [3] developed an adaptive particle ﬁlter tracker, updating the object model by taking a weighted average of the current and a new histogram of the object. Zhou et al. [4] proposed an observation generated by adapting the appearance model, motion model, noise variance and number of particles. Ross et al. [5] proposed an adaptive probabilistic real time tracker that updates the model using an incremental update of a so-called eigenbasis. Another way to improve the tracking of an object in video is to use multiple cues such as colour, texture, motion, shape, etc. Brasnett et al. [6] integrated colour and texture cues in a particle ﬁlter framework for tracking an object. Wu and Huang [7] investigated the relationship amongst diﬀerent modalities for robust visual tracking and identiﬁed eﬃcient ways to facilitate tracking with simultaneous use of diﬀerent modalities. Spengler and Schiele [8] integrated skin colour and intensity change cues using CONDENSATION [2] for tracking multiple human faces. Perez et al. [9] proposed a multiple cue tracker for tracking objects in front of a web cam. They introduced a generic importance sampling mechanism for data fusion and applied it to fuse various subsets of colour, motion, and stereo sound for tele-conferencing and surveillance using ﬁxed cameras. Appearance update is not factored into the approach. Shao et al. [10] improved a multiple cue particle ﬁlter by using a motion model comprising background and foreground motion parameters. R. Collins and Y. Liu [11] and B. Han and L. Davis [12] presented methods of online selection of the most discriminative feature for tracking objects. In these methods multiple feature spaces are evaluated and adjusted while tracking, inorder to improve tracking performance. The hypothesis is that the features that best discriminate between object and background are also best for tracking the object. In this paper we utilise multiple cues and object model adaptation to achieve improved robustness and accuracy in tracking. We make use of two object description cues: a colour histogram capturing appearance and spatial dimensions obtained from background-foreground segmentation capturing location and size. Object model adaptation is implemented via an autoregressive update with the region where the mode of the particles of the state vector for an object lies in the current frame.

2

Proposed Scheme

A particle ﬁlter is a special case of a Bayesian estimation process (see [13] for a tutorial on particle ﬁlters incorporating real-time, nonlinear, non-Gaussian Bayesian tracking). The key idea of a particle ﬁlter is to approximate the probability distribution of the state Xt of the object with a set of Ns particles/hypotheses and weights, as per s (1) {Xti , wti }N i=1 . Each particle is a hypothetical state of the object and the weight/belief for each hypothesis is computed using a likelihood function. Particle ﬁlter based

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

855

tracking algorithms have four main components, namely: object representation, observation representation, hypotheses generation and hypotheses evaluation. This paper proposes improvements in (a) the object and observation representation using the information obtained from background foreground segmentation, (b) hypotheses evaluation methodology. Background foreground segmentation is quite a developed technology and many real time tracking systems use it for detection of moving objects. There are algorithms to detect foreground moving objects even when the camera is gradually moving [14]. Figure 1 presents a schematic of the approach taken in this paper. The image frame obtained from the video stream is processed by background subtraction using the method presented in [15]. Each foreground blob is measured as a rectangular region, speciﬁed by centroid, width and height. A data association and merge split analysis is carried out between the objects and measurements using the method presented in [16]. A distance transform [17], [18] is applied to the foreground segmentation result. Foreground pixel intensity obtained from the distance transform is used to weight the pixel’s contribution when building the object’s histogram model. Our contention is that this gives better object and candidate representation than that obtained using other Kernel functions. The hypothesis of the object’s state is also evaluated using the measurement of an object obtained from foreground segmentation process. The beliefs from the two hypothesis evaluation processes are combined to compute the weights of the particles. The mode of the particles is then evaluated, and the state at the mode of the particles is used to update the object mode in an auto-regressive formulation. Object update is suspended for objects which have undergone a merge. 2.1

Hypothesis Generation

An object state is given by Xt = [xc , yc , W, H]T , where xc , yc are the co-ordinates of the centroid of the object and W, H are the width and height of the object in the image frame. The hypothesis generation process is also known as the prediction step, and is denoted as p(Xt+1 |Xt ). New particles are generated using a proposal function q(.), called an importance density, and the object dynamics. Using the predicted particles and hypotheses evaluation from the observation, the posteriori probability distribution of the object state is computed. We use a random walk for object dynamics for the following reasons: 1. Alternative use of constant velocity or constant acceleration object dynamics increases the dimensionality of the state space, which in turn increases exponentially the number of particles needed to track the object with similar accuracy. 2. In real life situations, especially with humans walking and interacting with other objects of the scene, it is very diﬃcult to know the object dynamics beforehand. Diﬀerent people will have diﬀerent dynamics. Human motions and their interactions are relatively unpredictable.

856

P. Kumar, M.J. Brooks, and A. Dick

Hypothesis Generation

Object Dynamics

Mode of the Particles Segmentation Cue Evaluation Integration of Cues Object Measurement Data Association and Split Merge Analysis

Object Model

Colour Cue Evaluation

Distance Transform Weights

Foreground Segmentation

Visual Sensor Data

Fig. 1. This schematic highlights the ﬂow of information in the proposed multi-cue, adaptive object model tracking method

The particles are predicted using the update Xt+1 = Xt + vt

(2)

where vt is independent identically distributed, zero mean Gaussian noise. The importance density is chosen to be the prior q(Xt+1 |Xti , Zt+1 ) = p(Xt+1 |Xti ).

(3)

The result of using this importance density is that, after resampling, the particles of the current instance are used to generate the particles for the next iteration. 2.2

Object Representation

An object is represented by its (previously speciﬁed) state Xt and a colour model. The non-parametric representation of the colour histogram of the object is P = p(u) u=1...m where m is the number of bins in the histogram. It has been argued in previous works [3], [1] that not all pixels contribute equally to object or candidate model. Thus, for example, pixels on the boundary of a region are

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

857

typically more prone to errors than the pixels in the interior of the region. A common strategy in overcoming this problem has been to use a kernel function like the Epanechnikov kernel [19] to weight the pixels’ contribution to the histogram. The same kernel function is applied irrespective of the position of the region. Our contention is that blind application of a kernel function can lead to (a) a drift problem when the object model is updated and (b) poor localisation of the object during merge. Small errors can accumulate and ultimately the target model can be completely diﬀerent from the actual object. Our strategy in building the object and candidate histogram is to weight a pixel’s contribution by taking into account background-foreground segmentation information. To achieve this, the foreground segmentation result is ﬁrst cleaned up using morphological operations. The Manhattan distance transform [18] [17] is then applied to get the weights of the pixels for their contribution to the object/candidate histogram. In a binary image the distance transform replaces the intensity of each foreground pixel with the distance of that pixel to its nearest background pixel. Thus, centrally located pixels (in the sense of being further from the background) receive greater weight and pixels on the boundary separating foreground-background receive small weights. The distance transform appears to be better suited for this purpose than more traditional kernel functions. Scores p(u) of the bins of the histogram model of the object, P = p(u) u=1...m , are computed using the following equation w(xj ) δ(g(xj ) − u), (4) p(u) = xj ∈F oreground Region

where δ is the Kronecker delta function, g(xj ) assigns a bin in the histogram to the colour at location xj , and w(xj ) is the weight of the pixel at location xj obtained on application of the distance transform to the foreground segmented region. The weights for background pixels are almost zero, which makes it very unlikely that the tracker will shift to background regions of the scene. When two or more objects merge, it is detected using a merge-split algorithm [16], and the updating of the object model is temporarily halted. 2.3

Observation Representation

To estimate the posterior probability of the state of the object, Ns hypotheses of the object are maintained (recall eq. (1)). Each hypothesis gives rise to an observation representation which is used to evaluate the likelihood of it being the tracked object. The histogram for a hypothesised region using the current image frame is Q = q (u) u=1...m , where m is the number of bins in the histogram, is deﬁned to be q (u) = w(xj ) δ(g(xj ) − u), (5) xj ∈F oreground Region

analogously to eq.(4). The observation from foreground segmentation are the centroid, width and height of diﬀerent foreground blobs in the current frame.

858

P. Kumar, M.J. Brooks, and A. Dick

m Nearest-neighbour data association is used to associate a measurement xm c , yc , m m W , H to an object. In the event there is a merger of objects, then the centroid, for evaluation of a hypothesis, is computed as the weighted mean of the foreground pixels in the region deﬁned by the hypothesis/particle. The weights used are from the distance transform applied to the foreground segmentation result.

2.4

Hypothesis Evaluation

Each hypothesis for the object state is evaluated using colour information and foreground information. A likelihood function [6] is used to compute the weight of each particle, integrating the colour and foreground cues as follows: L(Zt |Xti ) = Lcolour (Zc,t |Xti ) × Lf g (Zf g,t |Xti ),

(6)

where Zc,t is the current frame and Zf g,t is the measurement from current frame, after foreground segmentation and eight-connected analysis, associated with the object. Here, (7) Lcolour (Zc,t |Xti ) = exp (−dc (Pt , Qt )/σZc ), 1 − ρ(P, Q) is the Bhattacharyya distance based on the Bhatdc (P, Q) = m (i) (i) tacharyya coeﬃcient, ρ(P, Q) = i=1 p q and σZc is zero mean Gaussian noise associated with colour observation. The term Lf g (Zf g,t |Xti ) is the likelihood based on the foreground segmentation measurement and is given by Lf g (Zf g,t |Xti ) = exp (−df g (Xti , Xtm )/σzf g ), where df g (Xti , Xtm ) = (1 − exp (−λ) and λ=[

2 i m 2 i m 2 i m 2 (xic − xm c ) + (yc − yc ) + (Wc − Wc ) + (Hc − Hc ) ] Wci × Hci

(8)

(9)

when there is a match for the the object by data association. In the case of 2 i M 2 i i a merge df g (Xti , Xtm ) = (1 − exp (−[((xic − xM c ) + (yc − yc ) )/(Wc × Hc )])) M where xM , y is the weighted centroid of the foreground pixels in the region c c deﬁned by particle Xti . For meaningful balanced integration of cues, the functions for dc and df g should have similar behaviour. To test the distance function behaviour we plotted dc against (1 − ρ(P, Q)) = [0, 1], where ρ = 1 means best match and ρ = 0 means worst match of histograms P and Q. Simultaneously we plotted df g for λ = [0, 2], where value zero means a good match and value two means a bad match. Figure 2 shows that the plots of the two distance functions are very similar. 2.5

Model Update

To handle the appearance change of the object due to variation in illumination, pose, distance from the camera, etc., the object model is updated using the auto-regressive learning process Pt+1 = (1 − α)Pt + αPtest .

(10)

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

859

Fig. 2. Left is the plot of dc against (1 − ρ(P, Q)) on x-axis, for matching colour observation. Right is the plot of df g against λ on x-axis. Both plots are quite similar.

Here Ptest is the histogram of the region deﬁned by the mode of the particles used in tracking the object, and α is the learning rate. The higher the value of α the faster the object model will be updated to the new region. The model update is applied when the likelihood of the current estimate of the state of the object Xtest , with respect to the current measurement Zt , given by L(Zt |Xtest ) = Lcolour (Zc,t |Xtest ) × Lf g (Zf g,t |Xtest )

(11)

is greater than an empirical threshold.

3

Results

Figure 3 shows the tracking result from a video in the CAVIAR data set. The person on the left of the frame undergoes signiﬁcant scale and illumination change. As the person walks past the shop window the illumination on the person changes and hence there is signiﬁcant change in appearance. This is evident from the model histogram plots of the object for the diﬀerent instances of time in Figure 4. This shows the colour model for the person tracked with dashed bounding box in Figure 3. In such a case an ordinary colour tracker will have the problem of large errors in localisation and an adaptive colour tracker is likely to drift to other parts of the scene. The tracker proposed in this paper tracks the object accurately throughout the duration of the video. Figure 5 shows the improvement in localisation of two targets brought about when there is overlap of targets. Figure 5a shows the tracking result just by colour cue. Since the colour of the two targets are diﬀerent, the mode of the particles precludes converging to positions that include parts of the other object. Under such circumstances if the tracker is adaptive then it is quite possible that it will drift to other parts of the scene than the object of interest. Incorporation of cues from foreground segmentation gives better localisation of the targets in case of overlap as is evident from Figures 5b and 6. Figures 6 and 7 shows some more tracking results. These two sequences are particularly diﬃcult because there are instances of long and complete occlusions of targets by each other. In the former sequence there is a case of occlusion which lasted for 280 frames. In the latter sequence the sizes of the objects are small, there are partial occlusions from background objects and noise level is high. The complete tracking results can be downloaded from http://www.cs.adelaide.edu.au/ ˜vision/projects/accv07-tracking/.

860

P. Kumar, M.J. Brooks, and A. Dick

Fig. 3. These frames show successful tracking of three objects in a video from the CAVIAR data set

Fig. 4. The images show the RGB histogram model of the person on the left, for three diﬀerent instances as tracking progresses. Because of the change in illumination due to shop windows there is signiﬁcant change in appearance and hence the object model.

Fig. 5. The left image shows the poor localisation of the object when only the colour cue is used for tracking. The right image shows the improved localisation of the object when both colour and segmentation cues are integrated for tracking.

The proposed approach to tracking is more reliable for tracking objects when there are changes in scale, pose, illumination, and occlusion. In our experiments we have been able to track objects with as few as 20 particles. However, two drawbacks which were observed in the proposed method of tracking: (1) When shadows are detected as foreground then the localisation of the object is less accurate. This can be improved by using shadow removal

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

861

Fig. 6. These frames show successful tracking of objects in spite of almost complete occlusion

Fig. 7. These frames show successful tracking of people in spite of poor illumination, small size, and several occlusions. Left and middle images show tracking of two persons. Right images shows tracking of the three persons present simultaneously in the scene.

methods; (2) During almost complete occlusions there are errors in localisation but correct tracking resumes when objects separate after occlusion. Correct localisation of occluded target during complete occlusion with a single sensor is a very diﬃcult problem. Given the unconstrained environment of real life situations in the CAVIAR dataset. quite good tracking results are obtained by the scheme presented in the paper.

4

Conclusion

An enhanced scheme for tracking multiple objects in video has been proposed and demonstrated. Novel contributions of this work include a new weight function for construction of the object and candidate model. The measurement obtained from foreground segmentation is integrated with a colour cue to achieve better localisation of the object. Sometimes there are errors in segmentation and sometimes the colour cue is not reliable, but integration of the two cues gives a better result. The proposed method improves handling of object models undergoing change, rendering the system less susceptible to the drift problem. Furthermore the tracker can follow an object with as few as 20 particles. The method can be extended to moving cameras by using optical ﬂow, mosaic or epipolar constraint techniques to segment the moving foreground objects.

862

P. Kumar, M.J. Brooks, and A. Dick

References 1. Dorin, C., Visvanathan, R., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 2. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Nummiaro, K., Koller-Meier, E., Gool, L.J.V.: Object tracking with an adaptive color-based particle ﬁlter. In: Proceedings of the 24th DAGM Symposium on Pattern Recognition, pp. 353–360. Springer, Heidelberg (2002) 4. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance adaptive models in particle ﬁlters. IEEE Transactions on Image Processing 13(11), 1434–1456 (2004) 5. Ross, D., Lim, J., Yang, M.H.: Adaptive probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 470–482. Springer, Heidelberg (2004) 6. Brasnett, P.A., Mihaylova, L., Canagarajah, N., Bull, D.: Particle ﬁltering with multiple cues for object tracking in video sequences. In: Proceedings of SPIE. Image and Video Communications and Processing, vol. 5685, pp. 430–441 (2005) 7. Wu, Y., Huang, T.S.: Robust visual tracking by integrating multiple cues based on co-inference learning. Int. J. Comput. Vision 58(1), 55–71 (2004) 8. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Machine Vision and Applications 14, 50–58 (2003) 9. Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proceedings of the IEEE 92(3), 495–513 (2004) 10. Shao, J., Zhou, S.K., Chellappa, R.: Tracking algorithm using background foreground motion models and multiple cues. In: ICASSP apos 2005. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 233–236 (2005) 11. Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005) 12. Han, B., Davis, L.: Object tracking by adaptive feature extraction. In: ICIP 2004. International Conference on Image Processing, vol. 3, pp. 1501–1504 (2004) 13. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle ﬁlters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 14. Kang, J., Cohen, I., Medioni, G., Yuan, C.: Detection and tracking of moving objects from a moving platform in presence of strong parallax. In: Proceedings of the Tenth International Conference on Computer Vision, Beijing, China, vol. 1, pp. 10–17 (2005) 15. Kumar, P., Ranganath, S., Huang, W.: Queue based fast background modelling and fast hysteresis thresholding for better foreground segmentation. In: The Fourth Paciﬁc Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, vol. 2, pp. 743–747 (2003) 16. Kumar, P., Ranganath, S., Sengupta, K., Huang, W.: Cooperative multitarget tracking with eﬃcient split and merge handling. IEEE Transactions on Circuts and Systems for Video Technology 16(12), 1477–1490 (2006)

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

863

17. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice Hall International, Englewood Cliﬀs (1989) 18. Rosenfeld, A., Pfaltz, J.: Distance functions in digital pictures. Pattern Recognition 1, 33–61 (1968) 19. Dorin, C., Visvanathan, R., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, vol. 2, pp. 142–149 (2000)

Image Assimilation for Motion Estimation of Atmospheric Layers with Shallow-Water Model ´ Nicolas Papadakis1, Patrick H´eas1 , and Etienne M´emin1,2,3 IRISA/INRIA, Campus de Beaulieu 35042 Rennes, France CEFIMAS, Avenida Santa Fe 1145 C1059ABF,Buenos Aires, Argentina Fac. de Ing. de la Univ. Buenos-Aires, Av. Paseo Col´ on 850, C1063ACV Buenos Aires, Argentina 1

2

3

Abstract. The complexity of dynamical laws governing 3D atmospheric ﬂows associated to incomplete and noisy observations makes very diﬃcult the recovery of atmospheric dynamics from satellite images sequences. In this paper, we face the challenging problem of joint estimation of timeconsistent horizontal motion ﬁelds and pressure maps at various atmospheric depths. Based on a vertical decomposition of the atmosphere, we propose a dense motion estimator relying on a multi-layer dynamical model. Noisy and incomplete pressure maps obtained from satellite images are reconstructed according to shallow-water model on each cloud layer using a framework derived from data assimilation. While reconstructing dense pressure maps, this variational process estimates timeconsistent horizontal motion ﬁelds related to the multi-layer model. The proposed approach is validated on a synthetic example and applied to a real world meteorological satellite image sequence.

1

Introduction

Geophysical motion characterization and image sequence analysis are crucial issues for numerous scientiﬁc domains involved in the study of climate change, weather forecasting, climate prediction or biosphere analysis. The use of surface station, balloon, and more recently in-ﬂight aircraft measurements and low resolution satellite images has improved the estimation of wind ﬁelds and has been a subsequent step for a better understanding of meteorological phenomena. However, the network’s temporal and spatial resolutions may be insuﬃcient for the analysis of mesoscale dynamics. Recently, in an eﬀort to avoid these limitations, another generation of satellites sensors has been designed, providing image sequences characterized by ﬁner spatial and temporal resolutions. Nevertheless, the analysis of motion remains particularly challenging due to the complexity of atmospheric dynamics at such scales. Tools are needed to exploit this new generation of satellite images and we believe that it is very important that the computer vision community gets involved in such domain as they can potentially bring relevant contributions with respect to the analysis of spatio-temporal data. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 864–874, 2007. c Springer-Verlag Berlin Heidelberg 2007

Image Assimilation for Motion Estimation of Atmospheric Layers

865

Nevertheless, in the context of geophysical motion analysis, standard techniques from computer vision, originally designed for bi-dimensional quasi-rigid motions with stable salient features, appear to be not well adapted [1,2]. The design of techniques dedicated to ﬂuid ﬂow has been a step forward, towards the constitution of reliable methods to extract characteristic features of ﬂows [3,4]. However, for geophysical applications, existing ﬂuid-dedicated methods are all limited to frame to frame estimation and do not use the underlying physical laws. Moreover, geophysical ﬂows are quite well described by appropriate physical models. As a consequence in such contexts, physic-based approach can be very powerful for analyzing incomplete and noisy image data, in comparison to standard statistical methods. The inclusion of physical a priori leads to unusual advanced techniques for motion analysis which may be of interest for the computer vision community. This yields to new application domains impacting potentially studies of capital interest for our everybody life, and obviously to the devise of proper eﬃcient techniques. This is thus a research domain with wide perspectives. Our work is a contribution towards this direction. The method proposed in this paper is signiﬁcantly diﬀerent from previous works on motion analysis by satellite imagery. Indeed our method estimates physical sound and time consistent motion ﬁelds retrieved at diﬀerent atmospheric levels for the whole image sequence. More precisely, we use a shallowwater formulation of the Navier-Stokes equations to control the motion evolution across the sequence. This is done through a variational approach derived from data assimilation principle which combines the a priori exact dynamic and the pressure diﬀerence observations obtained from satellite images.

2 2.1

Data Assimilation Principle Introduction

Data Assimilation is a technique related to optimal control theory which allows estimating over time the state of a system of variables of interest [5,6,7,8]. This method enables a smoothing of the unknown variables according to an initial state of the system, a dynamic law and noisy measurements of the system’s state. Let V be a Hilbert space identiﬁed to its dual deﬁned over Ω. The evolution of the state variable X ∈ W(t0 , tf ) = {f |f ∈ L2 (t0 ; tf ; V} is assumed to be described through a (possibly non linear) diﬀerential dynamical model M : V → V: ∂t X(x, t) + M(X(x, t)) = 0 X(t0 ) = X0

(1)

where X0 is a control parameter. We then assume that noisy observations Y ∈ O are available, where O is another Hilbert space. These observations may live in a diﬀerent space (a reduced space for instance) from the state variable. We will nevertheless assume that there exists a diﬀerential operator H : V → O, that goes from the variable space to the observation space. A least squares estimation of the control variable regarding the whole sequence of measurements available

´ M´emin N. Papadakis, P. H´eas, and E.

866

within a considered time range [t0 ; tf ] comes to minimize with respect to the control variable X0 ∈ V , a cost function of the following form: J(X0 ) =

1 2

tf

||Y − HX(X0 , t)||2R dt,

(2)

t0

where R is the covariance matrix of the observations Y . A ﬁrst approach consists in computing the functional gradient through ﬁnite diﬀerences. Denoting N the dimension of the control parameter X0 , such a computation is impractical for control space of large dimension since it requires N integrations of the evolution model for each required value of the gradient functional. Adjoint models as introduced ﬁrst in meteorology by Le Dimet and Talagrand in [7] authorize the computation of the gradient functional in a single backward integration of an adjoint variable. The value of this adjoint variable at the initial time provides the value of the gradient at the desired point. This ﬁrst approach is widely used in environmental sciences for the analysis of geophysical ﬂows [7,8]. 2.2

Diﬀerentiated Model

To obtain the adjoint model, the system of equations (1) is ﬁrstly diﬀerentiated ∂X dX0 : with respect to a small perturbation dX = ∂X 0 ∂t dX(x, t) + ∂X MdX = 0 dX(t0 ) = dX0

(3)

where ∂X M is the tangent linear operator of M deﬁned by its gˆateaux derivative. The gradient of the functional in the direction dX0 must also be computed:

∂J , dX0 ∂X0

= =

tf

t0 tf t0

(Y − HX(X0 ), H

∂X dX0 ∂X0

dt R

(4)

∗

H R(Y − HX(X0 ), dXV dt,

where H∗ is the adjoint operator of H deﬁned by: ∀X ∈ V, Y ∈ O; X, HY V = H∗ X, Y O .

2.3

Adjoint Model

We then introduce the adjoint variable λ ∈ W(t0 , tf ). The ﬁrst equation of the diﬀerentiated model (3) is multiplied by this adjoint variable and integrated in the time interval [t0 ; tf ]:

tf t0

∂t dX(x, t) + ∂X MdX, λV = 0.

After an integration by parts, we have:

tf t0

−∂t λ + ∂X M∗ λ, dX(x, t)V dt = λ(t0 ), dX(t0 )V − λ(tf ), dX(tf )V ,

(5)

Image Assimilation for Motion Estimation of Atmospheric Layers

867

where the adjoint operator ∂X M∗ is deﬁned by: ∀X, Y ∈ V; X, ∂X MY V = ∂X M∗ X, Y V .

To perform the computation of the gradient functional, we assume that λ(tf ) = 0 and deﬁne the following adjoint problem: −∂t λ + ∂X M∗ λ = H∗ R(Y − HX(X0 )) λ(tf ) = 0.

2.4

(6)

Functional Gradient

Combining (4), (5) and (6), we ﬁnally obtain the gradient functional as: ∂J = λ(t0 ). ∂X0

(7)

Hence, assimilation principle enables to compute the functional gradient with a single backward integration. In the next section, we adapt this process to the control of high dimensional state variables, characterizing the dynamics of layered atmospheric ﬂows.

3 3.1

Application to Atmospheric Layer Motion Estimation Layer Decomposition

The layering of atmospheric ﬂow in the troposphere is valid in the limit of horizontal scales much greater than the vertical scale height, thus roughly for horizontal scales greater than 100 km. In order to make the layering assumption valid in the case of satellite images of kilometer order, low resolution observations relevant of a coarser grid are considered. Thus, one can decompose the 3D space into elements of variable thickness, corresponding to layers. Analysis based on such decomposition presents the main advantage of operating at diﬀerent atmospheric pressure ranges and avoids the mix of heterogeneous observations. Let us present the 3D space decomposition that we chose for the deﬁnition of the K layers. The k-th layer corresponds to the volume lying in between an upper surface sk+1 and a lower surface sk . These surfaces sk are deﬁned by the height of top of clouds belonging to the k-th layer. They are thus deﬁned only in areas where there exists clouds belonging to the k-th layer, and remains undeﬁned elsewhere. The membership of top of clouds to the diﬀerent layers is determined by cloud classiﬁcation maps. Such classiﬁcations which are based on thresholds of top of cloud pressure, are routinely provided by the EUMETSAT consortium, the European agency which supplies the METEOSAT satellite data, as illustrated on ﬁgure 1. 3.2

Sparse Pressure Diﬀerence Observations

Top of cloud pressure images are also routinely provided by the EUMETSAT consortium. They are derived from a radiative transfer model using ancillary data

868

´ M´emin N. Papadakis, P. H´eas, and E.

(a)

(b)

(c)

(d)

Fig. 1. Top of cloud classiﬁcation. Satellite image of the visible channel at 0.8μm (a), visualization (in the same channel) of top of clouds classiﬁed by the EUMETSAT consortium : low layer (b), middle layer (c) and high layer (d).

obtained by analysis or short term forecasts. Multi-channel techniques enable the determination of the pressure at the top of semi-transparent clouds [9]. We denote by C k the class corresponding to the k-th layer. Note that the top of cloud pressure image denoted by p is composed of segments of top of cloud pressure functions p(sk+1 ) related to the diﬀerent layers. That is to say: p = { k p(sk+1 , s); s ∈ C k }. Thus, pressure images of top of clouds are used to constitute sparse pressure maps of the layer upper boundaries p(sk+1 ). As in satellite images, clouds lower boundaries are always occluded, we coarsely approximate the missing pressure observations p(sk ) by an average pressure value pk observed on top of clouds of the layer underneath. Finally, for layer k ∈ [1, K], we deﬁne observations hkobs as pressure diﬀerences in hecto Pascal (hPa) units: hkobs

3.3

= pk (s) − p if s ∈ C k =0 if s ∈ C¯ k ,

(8)

Shallow-Water Model

In order to provide a dynamical model for the previous pressure diﬀerence observations, we use the shallow-water approximation (horizontal motion much greater than vertical motion) derived under the assumption of layer incompressibility (layers are characterized by mean densities ρk which can be approximated according to the layer average pressure [10]). The shallow-water approximation is valid for mesoscale analysis in a layered atmosphere. As friction components can be neglected, the vertical integration of the momentum equation between boundaries sk and sk+1 yields for the k-th layer to the equation [6,11,12]: 1 1 ∂(qk ) 0 −1 φ k + div( k qk ⊗ qk ) + k ∇xy (hk )2 + ghk ∇xy (sk+1 ) + f q =0 1 0 ∂t h 2ρ

(9)

with hk = p(z = sk ) − p(z = sk+1 ), p(z=sk ) 1 vdp, vk = (uk , v k ) = k h p(z=sk+1 )

(10) (11)

Image Assimilation for Motion Estimation of Atmospheric Layers qk = hk vk ,

k

div(

1 k q ⊗ qk ) = hk

∂(h (uk )2 ) ∂x ∂(hk uk v k ) ∂x

+ +

∂(hk uk v k ) ∂y ∂(hk (v k )2 ) ∂y

869 (12)

,

(13)

where f φ represents the coriolis factor depending on latitude φ. By adding the integrated continuity equation to Eq. 9, we obtain independent shallow-water equation systems [12] for layers k ∈ [1, K]: ⎧ ⎨ ⎩

∂hk ∂t k

+ div(qk ))

∂(q ) ∂t

+

div( h1k qk

=0 0 −1 φ k ∇xy (h ) + f q = 0, 1 0

⊗q )+ k

1 2ρk

k 2

(14)

where we have assumed that surfaces sk and sk+1 are locally ﬂat in the vicinity of a pixel. This expression is discretized spatially with non oscillatory schemes [13] and integrated in time with a third order Runge-Kutta scheme. This equation system describes the dynamics of physical quantities expressed in standard units. Thus, some dimension factors appear in the equation when it is discretized on a pixel grid with velocities expressed in pixels per frame and pressure in hecto pascal hP a. As one pixel represents Δx meters and one frame corresponds to Δt seconds, the densities ρk expressed in pascal by square seconds per square meter (P a s2 /m2 ) must be multiplied by 10−2 Δx2 /Δt2 , and coriolis factor f φ expressed per seconds must be multiplied by Δt. By a scale analysis and as also observed in our experiments, for Δt = 900 seconds, the third term of equation 9 has a magnitude similar to other terms if Δx ≥ 25km. This is in agreement with the shallow water assumption. 3.4

Assimilation of Layer Motion and Pressure Diﬀerences

We can now deﬁne all the components of the assimilation system allowing the recovery of pressure diﬀerence observations obtained from section 3.2 through the dynamical model presented in section 3.3. The ﬁnal system enables the tracking of pressure diﬀerence hk and average velocities qk related to the set of k ∈ [1, K] layers. Referring to section 2, we then have X k = [hk , qk ]T . The evolution model M is given by the mesoscale dynamics (14). The only observations available are the pressure diﬀerence maps hkobs . For each layer k, the observation operator then reads: H = [1, 0] and the process minimizes:

Jk (hk0 , qk0 ) =

tf

||hkobs − hk (hk0 , qk0 , t)||2Rk dt,

(15)

t0

through a backward integrations of the adjoint model (∂X M)∗ deﬁned by: ⎧ k −∂t λkh (t) + w k · (w k · ∇)λkq − hk div(λkq ) ⎪ ⎪ ⎪ ⎪ ⎨ 0 1 φ k −∂t λkq (t) − (w k · ∇)λkq − (∇λkq )w k − ∇λkh + f λq −1 0 ⎪ ⎪ k ⎪ ⎪ ⎩ λhk (tf ) λq (tf )

= Rk (hkobs (t) − hk (t)), = 0, = 0, = 0. (16)

870

´ M´emin N. Papadakis, P. H´eas, and E.

In this expression, λkh and λkq are the two components of the adjoint variable λk of layer k [11]. More details on the construction of adjoint models can be found in [8]. One can ﬁnally deﬁne a diagonal covariance matrix Rk using the mask of observation C k :

Rk (s, s) =

= α if s ∈ C k ¯k, = 0 if s ∈ C

(17)

where α is a ﬁxed parameter (set to 0.1 in our applications) deﬁning the observation covariances. However, as observations are sparse, a nine-diagonal covariance matrix is employed to diﬀuse information in a 3x3 pixel vicinity. As the assimilation process is not insured to reach a global minima, results depend on initialization. Thus, state variables hk0 are initialized with a constant value while initial values for variables qk0 are provided by an optic-ﬂow algorithm dedicated to atmospheric layers [3].

4 4.1

Results Synthetic Experiments

For an exhaustive evaluation, we have relied on image observations generated by short time numerical simulation of atmospheric layer motion according to shallow-water dynamical model (Eq. 14). Realistic initial conditions on layer pressure function and motion have been chosen to derive a synthetic sequence of 10 images. The sequence has then been deteriorated by diﬀerent noises and by a masking operation to form 4 diﬀerent data sets. The two ﬁrst synthetic image sequences named e1 and e2 are thus composed of dense observations of hkobs in hecto-pascal units (hPa) corrupted by Gaussian noises with standard deviation respectively equal to 10 and 20% of the pressure amplitude. A real cloud classiﬁcation map (used in the next experiment) has been employed to extract regions of data sets e1 and e2 in order to create two noisy and incomplete synthetic sequences e3 and e4 (see ﬁgure 3). For initializing the assimilation system, we have not relied on an optic-ﬂow algorithm in this synthetic case. We have used instead known values of variables hk0 and qk0 deteriorated by Gaussian noises. Results of the joint motion-pressure estimation performed by image assimilation are evaluated in table 2. It clearly appears that for noisy observations, the assimilation process induces a signiﬁcant decrease of the RMSE between real and estimated velocities and pressure. Moreover, this table evaluates and demonstrates the eﬃciency of the proposed estimator for incomplete and noisy observations for both estimating dense motion ﬁelds and reconstructing pressure maps hk . Examples of reconstruction for experiments e2 and e3 are presented in ﬁgure 3. 4.2

Real Meteorological Image Sequence

We then turned to qualitative comparisons on a real meteorological image sequence. The benchmark data was composed by a sequence of 10 METEOSAT

Image Assimilation for Motion Estimation of Atmospheric Layers

e1 e2 e3 e4

Mask Noise % 10 20 x 10 x 20

871

hkobs RMSE ﬁnal hk RMSE initial |v0k | RMSE ﬁnal |v0k | RMSE (hPa) (hPa) (pixel/frame) (pixel/frame) 15.813880 5.904791 0.22863 0.03457 22.361642 8.133384 0.21954 0.05078 15.627055 6.979769 0.22351 0.04978 22.798671 10.930078 0.21574 0.05944

Fig. 2. Numerical evaluation. Decrease of the Root Mean Square Error (RMSE) of estimates hk and |v0k | by image assimilation for noisy (experiments e1 , e2 , e3 and e4 ) and sparse observations (experiments e3 and e4 ).

e2

e3 (a) Actual maps

(b) Noised (and masked) maps

(c) Estimated maps

Fig. 3. Synthetic sequences: Results of experimentations e2 and e3 , where the pressure maps have been noised (e2 and e3 ) and masked (e3 )

Second Generation (MSG) images, showing top of cloud pressures with a corresponding cloud classiﬁcation sequence. The 1024 × 1024 pixel images cover an area over the north Atlantic Ocean during part of one day (5-June-2004), at a rate of one image every 15 minutes. The spatial resolution is 3 kilometers at the center of the whole Earth image disk. Clouds from a cloud-classiﬁcation were used to segment images into K = 3 broad layers, at low, intermediate and high altitude. In order to make the layering assumption valid, low resolution observations on an image grid of 128 × 128 pixels are obtained by smoothing and sub-sampling for each layer the original data. By applying the methodology described in section 3.4 to the image at this coarser resolution, average motion and pressure diﬀerence maps are estimated from the image sequence for these 3 layers. Estimated vector ﬁelds superimposed on observed pressure diﬀerence maps are displayed in ﬁgure 4 for each of the 3 layers. The motion ﬁelds estimated for the diﬀerent layers on the cloudy

872

´ M´emin N. Papadakis, P. H´eas, and E.

Low layer

Middle layer

High layer Fig. 4. First (left) and last (right) estimated horizontal wind ﬁelds superimposed on observed pressure diﬀerence maps (original images have been subsampled into images of 128 × 128 pixels)

Image Assimilation for Motion Estimation of Atmospheric Layers

873

observable parts are consistent with the visual inspection of the sequence. In particular, several motion diﬀerences between layers are very relevant. For instance, near the bottom left corner of the images, the lower layer possesses a southward motion while the intermediate layer moves northward. Moreover, the temporal coherence of the retrieved motion demonstrates the eﬃciency of this spatio-temporal method under physical constraints.

5

Conclusion

In this paper, we have presented a new method for estimating time-consistent horizontal winds in a stratiﬁed atmosphere from satellite image sequences of top of cloud pressure. The proposed estimator applies on a set of sparse image observations related to a multi-layer atmosphere, which verify independent shallow-water models. In order to manage the incomplete and noisy observations while considering this non-linear physical model, a variational assimilation scheme is proposed. This process estimates time-consistent motion ﬁelds related to the layer components while performing the reconstruction of dense pressure diﬀerence maps. The merit of the joint motion-pressure estimator by image assimilation is demonstrated on both synthetic images and real satellite images. In view of the various meteorological studies relying on the analysis of experimental data of atmospheric dynamics, we believe that the proposed multi-layer horizontal wind ﬁeld estimation technique constitutes a valuable tool.

Acknowledgments This work was supported by the European Community through the IST FET Open FLUID Project (http://ﬂuid.irisa.fr).

References 1. Horn, B., Schunck, B.: Determining optical ﬂow. Artiﬁcial Intelligence 17, 185–203 (1981) 2. Leese, J., Novack, C., Clark, B.: An automated technique for obtained cloud motion from geosynchronous satellite data using cross correlation. Journal of applied meteorology 10, 118–132 (1971) 3. H´eas, P., M´emin, E., Papadakis, N., Szantai, A.: Layered estimation of atmospheric mesoscale dynamics from satellite imagery. IEEE Trans. Geoscience and Remote Sensing (2007) 4. Zhou, L., Kambhamettu, C., Goldgof, D.: Fluid structure and motion analysis from multi-spectrum 2D cloud images sequences. In: Proc. Conf. Comp. Vision Pattern Rec. Hilton Head Island, USA, vol. 2, pp. 744–751 (2000) 5. Bennet, A.: Inverse Methods in Physical Oceanography. Cambridge University Press, Cambridge (1992) 6. Courtier, P., Talagrand, O.: Variational assimilation of meteorological observations with the direct and adjoint shallow-water equations. Tellus 42, 531–549 (1990)

874

´ M´emin N. Papadakis, P. H´eas, and E.

7. Le Dimet, F.X., Talagrand, O.: Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects. Tellus, 97–110 (1986) 8. Talagrand, O., Courtier, P.: Variational assimilation of meteorological observations with the adjoint vorticity equation. I: Theory. J. of Roy. Meteo. soc. 113, 1311–1328 (1987) 9. Schmetz, J., Holmlund, K., Hoﬀman, J., Strauss, B., Mason, B., Gaertner, V., Koch, A., Berg, L.V.D.: Operational cloud-motion winds from meteosat infrared images. Journal of Applied Meteorology 32(7), 1206–1225 (1993) 10. Holton, J.: An introduction to dynamic meteorology. Academic Press, London (1992) 11. Honnorat, M., Le Dimet, F.X., Monnier, J.: On a river hydraulics model and Lagrangian data assimilation. In: ADMOS 2005. International Conference on Adaptive Modeling and Simulation, Barcelona (2005) 12. de Saint-Venant, A.: Th´eorie du mouvement non-permanent des eaux, avec application aux crues des rivi`eres et l’introduction des mar´ees dans leur lit. C. R. Acad. Sc. Paris 73, 147–154 (1871) 13. Xu, Z., Shu, C.W.: Anti-diﬀusive ﬁnite diﬀerence weno methods for shallow water with transport of pollutant. Journal of Computational Mathematics 24, 239–251 (2006)

Probability Hypothesis Density Approach for Multi-camera Multi-object Tracking Nam Trung Pham1,2 , Weimin Huang1 , and S.H. Ong2 Institute for Infocomm Research, Singapore Department of Electrical and Computer Engineering, National University of Singapore 1

2

Abstract. Object tracking with multiple cameras is more eﬃcient than tracking with one camera. In this paper, we propose a multiple-camera multiple-object tracking system that can track 3D object locations even when objects are occluded at cameras. Our system tracks objects and fuses data from multiple cameras by using the probability hypothesis density ﬁlter. This method avoids data association between observations and states of objects, and tracks multiple objects in single-object state space. Hence, it has lower computation than methods using joint state space. Moreover, our system can track varying number of objects. The results demonstrate that our method has a high reliability when tracking 3D locations of objects.

1

Introduction

Tracking moving objects is an important part of many applications. Some people proposed methods to track objects by using one camera [1]. However, when persons might be occluded by other persons in the scene, using one camera to track these persons is diﬃcult. This is because information of these persons from one camera is not enough to solve the occlusion problem. An idea to solve this problem is to use multiple cameras to recover information that might be missing from a particular camera. Furthermore, multiple cameras can be used to recover the 3D information of objects. There are some approaches for tracking with multiple cameras. Most of them have two stages. They are single-view stage and multiple-view data fusion stage. In the single-view stage, they extract observations, estimations. Then in the second stage, these data are fused to obtain the ﬁnal results. Some methods are proposed to track one object using multiple cameras [2], [3]. These methods track an object and switch to another camera when the system predicts that the current camera no longer has a good view of the object. However, these methods need to consider data association when extending from tracking one object to multiple objects. Some other methods can track multiple objects [4], [5], [6], [7]. Among them, methods match objects between diﬀerent camera views [4], [5] or incorporate classiﬁcation methods [6] to do the data association between observations and objects in multiple views. These methods can collaborate multiple cameras for multiple-object tracking. However, when the appearances of objects Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 875–884, 2007. c Springer-Verlag Berlin Heidelberg 2007

876

N.T. Pham, W. Huang, and S.H. Ong

are similar or occlusions occur, these methods might not be suitable. This is because some wrong matches may occur. The other idea is to ﬁnd 3D observations that correspond with observations from diﬀerent views [7]. However, the association of observations from diﬀerent views can increase computational cost in 3D observation searching. Recently, there has been increasing research interest on using random set theory to solve multiple-object tracking. Here, the states of objects and measurements are represented as random ﬁnite sets (RFS). Mahler [8] presented a probability hypothesis density (PHD) ﬁlter that operates on a single-object state space. Vo [9], [10] proposed implementations of the PHD ﬁlter. Especially, the implementation in [10] is a closed-form of the PHD ﬁlter. It is called Gaussian mixture probability hypothesis density (GMPHD) ﬁlter. In this paper, we extend the GMPHD ﬁlter from single sensor to multiple sensors to track several people using multiple cameras in a room. It is assumed that we have projection matrices from 3D space to cameras. Our method can recover the 3D object locations and handle the occlusion at each camera. We assume that color models are available. Then, the proposed tracking method can be eﬃciently applied to track a varying number of objects. Further, because the fusion stage of multiple cameras to obtain 3D object locations is based on the GMPHD ﬁlter, it reduces the amount of computation compared with other methods such as search based method or the particle ﬁlter.

2

PHD Filter Approach

In multiple-object tracking, it is diﬃcult to obtain the posterior density function when the number of objects increases. Fortunately, this density function can be approximately recovered from a probability hypothesis density (PHD) [8]. To obtain the PHD at each time step, the PHD ﬁlter [8] can be applied. Now, we review an implementation of the PHD ﬁlter. It is the GMPHD ﬁlter [10]. The GMPHD ﬁlter is a closed-form of the PHD ﬁlter with assumptions on linear Gaussian system. These assumptions are as follows. Each object follows a linear Gaussian model, i.e., fk|k−1 (x|ζ) = N (x; Fk−1 ζ, Qk−1 ),

(1)

gk (z|x) = N (z; Hk x, Rk ),

(2)

where N (.; m, P ) denotes a Gaussian density with mean m and covariance P , Fk−1 is the state transition matrix, Qk−1 is the process noise covariance, Hk is the observation matrix, and Rk is the observation noise covariance. The survival and detection probabilities are pS,k and pD,k , respectively. The intensity of the spontaneous birth RFS is

Jγ,k

γk (x) =

i=1

(i)

(i)

(i)

wγ,k N (x; wγ,k , Pγ,k )

(3)

Probability Hypothesis Density Approach

877

where Jγ,k is the number of birth Gaussian components. It is assumed that the posterior intensity at time k − 1 is a Gaussian mixture of the form

Jk−1

vk−1 (x) =

(i)

(i)

(i)

wk−1 N (x; wk−1 , Pk−1 )

(4)

i=1

where Jk−1 is the number of Gaussian components of vk−1 (x). Under these assumptions, the predicted intensity to time k is given by vk|k−1 (x) = vS,k|k−1 (x) + γk (x)

(5)

where

Jk−1

vS,k|k−1 (x) = pS,k

(j)

(j)

(j)

wk−1 N (x; mS,k|k−1 , PS,k|k−1 ),

j=1 (j) mS,k|k−1

(j)

= Fk−1 mk−1 ,

(j)

(j)

T PS,k|k−1 = Qk−1 + Fk−1 Pk−1 Fk−1 .

Because vS,k|k−1 (x) and γk (x) are Gaussian mixtures, vk|k−1 (x) can be expressed as a Gaussian mixture of the form Jk|k−1

vk|k−1 (x) =

(i)

(i)

(i)

wk|k−1 N (x; mk|k−1 , Pk|k−1 )

(6)

i=1

Then, the posterior intensity at time k is also a Gaussian mixture, and is given by vD,k (x; z) (7) vk (x) = (1 − pD,k )vk|k−1 (x) + z∈Zk

where Jk|k−1

vD,k (x; z) =

(j)

(j)

(j)

wk (z)N (x; mk|k , Pk|k ),

j=1 (j)

(j) wk (z)

(j)

pD,k wk|k−1 qk (z) = , Jk|k−1 (l) (l) κk (z) + pD,k l=1 wk|k−1 qk (z)

(j)

(j)

(j)

qk (z) = N (z; Hk mk|k−1 , Rk + Hk Pk|k−1 HkT ), (j)

(j)

(j)

(j)

mk|k (z) = mk|k−1 + Kk (z − Hk mk|k−1 ), (j)

(j)

(j)

Pk|k = [I − Kk Hk ]Pk|k−1 , Kk = Pk|k−1 HkT (Hk Pk|k−1 HkT + Rk )−1 . (j)

(j)

(j)

878

3

N.T. Pham, W. Huang, and S.H. Ong

System Overview

We propose a method to track 3D locations of heads of people using multiple cameras with assumptions that the cameras are calibrated and the ﬁeld of views of cameras overlap. The proposed method, as shown in Fig. 1, consists of two major components: single-view tracking and multiple-camera fusion. In the ﬁrst component, at each camera at time k, we ﬁnd color observations and then use the i i , ..., ym,k } GMPHD ﬁlter to estimate the 2D locations of objects. Let Yki = {y1,k be the set of 2D estimations of objects at time k, view i. We have n single views, so the set of 2D estimations of objects at time k can be deﬁned by (8) Yk = Yk1 ; Yk2 ; . . . ; Ykn More details on the ﬁrst step will be shown in Section 5. In the second component, we consider the set of 2D estimations of objects Yk as observations for a data fusion step to estimate the 3D locations of objects by the GMPHD ﬁlter. This method can avoid the data association between observations and states of objects. More details of the second step will be shown in Section 6.

Fig. 1. The sketch of our system for multiple object tracking using multiple cameras

4

Color Likelihood

The state of single object in each camera view is described by x = {xc , yc , Hx , Hy }. This is a rectangle with center and size deﬁned by {xc , yc } and {Hx , Hy }, respectively. Let the color histogram of object be denoted as p(u), the color histogram of template as q(u). The similarity function between an object and a template is measured by the Bhattacharyya distance [11]. p(u)q(u)du (9) D = 1−

Probability Hypothesis Density Approach

879

In multiple-object tracking, we can have many color models of templates, and let these models be as {q1 (u), q2 (u), ..., qn (u)}. The similarity function between an object and templates is modiﬁed by

D = min 1− p(u)qi (u)du (10) i

The color likelihood function is deﬁned as in [1]

D2 1 exp − 2 lz (x) = N (D; 0, σ 2 ) = √ 2σ 2πσ

(11)

where z is the current image, x is the state of object and σ 2 is the variance of noise.

5

Single-View Tracking

At each single view, we assume that the object state does not change much between frames and each object in multiple-object tracking is evolved from a dynamic moving equation (12) xk = xk−1 + wk where the state of an object in a single view xk = {xc , yc , Hx , Hy }, and wk is the process noise. Single-view tracking consists of two parts: obtaining the color measurement random set, and using these color measurements to obtain the PHD. Now, we consider the ith camera. Let vki (x) be the PHD of the ith camera at time k and i (x) be the predicted PHD of the ith camera at time k. From [12], we have vk|k−1 i vki (x) ∝ v˜ki (x) = lz (x)vk|k−1 (x)

(13)

where lz (x) is the color likelihood that is deﬁned in Section 4. Hence, peaks of v˜ki (x) are also peaks of vki (x). We apply the method in [12] to collect peaks in v˜ki (x). The set of these peaks is considered as the color measurement random set. Secondly, we use the color measurement random set to update the PHD by the updating step in the GMPHD ﬁlter (Equation (7)). After updating predicted i (x) with the color measurement random set, we obtain PHD vki (x). PHD vk|k−1 From PHD vki (x), we ﬁnd Gaussian components whose weights are larger than a threshold (0.5). The set of means of these Gaussian components are 2D estii i mations of objects at the ith camera. They are denoted as Yki = {y1,k , ..., ym,k }. (See [12] for more details of single-view tracking).

6

Multiple-Camera Fusion

We assume that the dynamic moving equation for 3D tracking is xk = xk−1 + wk

(14)

880

N.T. Pham, W. Huang, and S.H. Ong

where the state of an object xk = {x1,k , x2,k , x3,k } is a 3D coordinate, wk is the process noise. The observations are 2D estimations from multiple cameras. So, the measurement equation at the ith camera is described by ⎞ ⎛ ⎞ ⎛ i i i i ⎞ x1,k ⎛ l1,k a11 a12 a13 a14 ⎜ ⎟ ⎝ l2,k ⎠ = ⎝ ai21 ai22 ai23 ai24 ⎠ ⎜ x2,k ⎟ ⎝ x3,k ⎠ l3,k ai31 ai32 ai33 ai34 1 i y1,k l /l (15) = 1,k 3,k + uk i y2,k l2,k /l3,k where uk is the measurement noise, and aimn are projection parameters from 3D coordinate to the ith camera plane. Assuming that cameras are calibrated, we have projection parameters aimn . The idea of fusing data from multiple cameras is to use the GMPHD ﬁlter sequentially at each camera. There are some related work that used sequential sensor updating method in the PHD approach [8]. Let Vk (x) be the PHD for multiple-camera tracking at time step k. We propose the fusion stage as follows – Step 1: Assuming that we have the PHDs of previous time step k − 1 of 1 multiple-camera fusion stage Vk−1 (x) and single-view tracking stage vk−1 (x) at camera 1, we employ the method in Section 5 to obtain the set of 2D estimations of objects, Yk1 , and PHD vk1 (x). Then, from Vk−1 (x), we use dynamic moving equation (14) and measurement equation (15) to predict 1 (x) at camera 1 by Equation (5). Because measurement equation (15) Vk|k−1 is not linear, we have to use unscented transform in the prediction step (more details is in [10]). Then, the set of 2D estimations of objects at the camera 1 1, Yk1 , is used to update the Vk|k−1 (x) to Vk1 (x) by the updating step in the GMPHD ﬁlter (Equation (7)). From assumptions on the GMPHD ﬁlter, Vk−1 (x) is a Gaussian mixture, so Vk1 (x) is also a Gaussian mixture. – Step 2: Set i = 2 i (x) = Vki−1 (x). Assuming that we have – Step 3: At the camera i, set Vk|k−1 the PHD of previous time step k−1 of single-view tracking stage at camera i, i (x), the method described in Section 5 is performed to obtain the set of vk−1 i 2D estimations of objects at camera i, Yki , and PHD vki (x). Because Vk|k−1 (x) is a Gaussian mixture, we can use the updating step of the GMPHD ﬁlter i (x) with observations in Yki . This means to update Vk|k−1 Vki (x) = (1 − pD,k )Vki−1 (x) +

VD,k (x; y)

(16)

y∈Yki

Then, we obtain the Vki (x). – Step 4: Set i = i + 1. If i ≤ n then we repeat the step 3. Otherwise, we have Vkn (x). The PHD of the system is Vk (x) = Vkn (x). For estimating the 3D object locations, we investigate the PHD of the system Vk (x) and choose

Probability Hypothesis Density Approach

881

Gaussian components whose weights are larger than a threshold (0.5) to obtain the 3D estimations of objects. We note that the multiple-camera fusion stage is implemented by sequential sensor updating. Hence, the most reliable camera should be updated ﬁrst. Another notice is that the GMPHD ﬁlter in [10] does not include the track labels of objects. For label tracking, our method is described as follows. Each Gaussian component is associated with a label. For birth Gaussian components, we assign them a special label (for example -1). After the updating step in the ﬁrst camera, Gaussian components with labels become the predicted Gaussian components for the second camera and then they are used to update the PHD in the second camera. At the last camera, for each label, we choose the Gaussian component that has the largest weight. The estimations of object locations are from the means of these largest Gaussian components. If a Gaussian component has a special label and its weight is large enough, we assign it a new label. This means a new person occurs. Hence, the identiﬁcations of people are deﬁned in the tracking. This track label method is extended from the work in [13] from single sensor to multiple sensors and then applied in multiple-camera multiple-object tracking.

7

Experimental Results

We test the performance of our method with data from the ﬁrst and second cameras in scenarios seq24-2p-0111, seq35-2p-1111, and seq44-3p-1111 in test database [14]. There are about 4500 time steps (9000 image frames). The errors of 3D estimations are measured by the Wasserstein distance [9] and are shown in Table 1. For visualization, we show the results from test case ’seq44-3p-1111’. In this scenario, there are three persons. They appear and disappear at diﬀerent times. This scenario is challenging because occlusions occur between persons when they cross together. Moreover, in this scenario, the lighting of the room changes through the tracking, so it is diﬃcult to apply segmentation methods. In addition, because the color models of heads are diﬀerent between views, it is sometimes diﬃcult to apply methods such as Stereo Matching to ﬁnd the correspondences. Hence, the 3D reconstructions from correspondences are not reliable in this data. However, our method successfully track 3D object locations in this scenario. At each camera, we use 400 samples to detect peaks of PHD. The maximum of Gaussian mixture components are 30. We assume that persons enter the tracking Table 1. Error of 3D estimation Scenarios Mean error (m) seq24-2p-0111 0.06 seq35-2p-1111 0.05 seq44-3p-1111 0.07

882

N.T. Pham, W. Huang, and S.H. Ong

Fig. 2. 3D results of tracking multiple people using PHD ﬁlter

areas from two entrances. Hence, the birth intensity is the mixture of Gaussian components whose means are locations at these entrances. The clutter density in the multiple-view camera fusion is an uniform distribution on the tracking area 3m × 2m × 2m and the clutter density in the single-view tracking stage is an uniform distribution of the size of image (it is the projection from tracking area to cameras) and the range of radius Hx and Hy ([5,15]). The probability of survival is pS = 0.99 and the probability of detection is pD = 0.98. These parameters are set by experiments. Figure 2 shows the performance of 3D people tracking. The dots are groundtruth and the lines are estimations from our methods. The results indicate that tracks of people are maintained. The x and y components are reliable while the z component has some errors, for example at steps 600 to 700. This is because at steps 600 to 700, the color of the background near the person’s location at camera 2 is similar to color of templates. However, these errors are quite small. In this sequence, when a person moves out of the view and then moves back, we will assign it a new label, which is treated as correct detection. Figure 3 shows the results when we project 3D locations to the camera plane. Each cell in the ﬁgure has two images. The left image is from camera 1 and the right image is from camera 2. In this ﬁgure, we can see that at time k = 99, 144, 247, the ﬁrst, second, third persons appear in the overlapped region sequentially. They are detected and tracked automatically. At time k = 264, 295, the occlusions between the second and third persons occur in camera 1 and 2. However, the tracks are maintained after the occlusions. At time k = 809, the occlusion between the ﬁrst and

Probability Hypothesis Density Approach

883

Fig. 3. Projection 3D estimations to two camera planes

third persons occurs at camera 1 and the occlusion between the ﬁrst and second persons occurs at camera 2. We can see in the ﬁgure that our method can handle these cases. This is because the PHD from camera 1 is a good prediction for the PHD at camera 2. Information from two cameras is fused to obtain the reliable 3D estimations without using data association methods.

8

Conclusion

The paper described a method of using the GMPHD ﬁlter to track 3D locations of objects. The method can track a varying number of objects. Moreover, it can solve some occlusion problems for which single-camera system has diﬃculty. The fusion stage using the GMPHD ﬁlter reduced a lot of computations compared with other methods that search whole space or the particle ﬁlter with multiple objects. Experimental results have shown that the proposed approach is promising.

884

N.T. Pham, W. Huang, and S.H. Ong

Acknowledgements The authors would like to thank Prof. Ba Ngu Vo at Melbourne University for his helps and fruitful discussions. This work is partially supported by EU project ASTRALS (FP6-IST-0028097).

References 1. Czyz, J., Ristic, B., Macq, B.: A color-based particle ﬁlter for joint detection and tracking of multiple objects. In: ICASSP (2005) 2. Cai, Q., Aggarwal, J.K.: Automatic tracking of human motion in indoor scenes across multiple synchronized video streams. In: ICCV, Bombay, India (1998) 3. Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., Gool, L.V.: Color-based object tracking in multi-camera environments. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, Springer, Heidelberg (2003) 4. Chang, T., Gong, S.: Tracking multiple people with a multi-camera system. In: IEEE Workshop on Multi-Object Tracking, IEEE Computer Society Press, Los Alamitos (2001) 5. Mittal, A., Davis, L.S.: M2Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. International Journal of Computer Vision 51(3), 189–203 (2003) 6. Kim, K., Davis, L.S.: Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle ﬁltering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 7. Dockstader, S., Tekalp, A.M.: Multiple camera tracking of interacting and occluded human motion. Proceedings of the IEEE 89(10) (2001) 8. Mahler, R.: Multi-target Bayes ﬁltering via ﬁrst-order multi-target moments. IEEE Trans. on Aerospace and Electronic Systems 39(4), 1152–1178 (2003) 9. Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo methods for Bayesian multi-target ﬁltering with random ﬁnite sets. IEEE Trans. Aerospace and Electronic Systems 41(4), 1224–1245 (2005) 10. Vo, B.N., Ma, W.K.: The Gaussian mixture probability hypothesis density ﬁlter. IEEE Transaction Signal Processing 54(11), 4091–4104 (2006) 11. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: ICCV (1999) 12. Pham, N.T., Huang, W.M., Ong, S.H.: Tracking multiple objects using probability hypothesis density ﬁlter and color measurements. In: ICME (2007) 13. Clark, D., Panta, K., Vo, B.: The GM-PHD ﬁlter multiple target tracker. In: Proceedings of FUSION 2006, Florence (2006) 14. Lathoud, G., Odobez, J., Perez, D.: Av16.3: an audio-visual corpus for speaker localization and tracking. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, Springer, Heidelberg (2005)

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients Chi-Chen Raxle Wang and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {raxle,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw

Abstract. We developed a novel learning-based human detection system, which can detect people having different sizes and orientations, under a wide variety of backgrounds or even with crowds. To overcome the affects of geometric and rotational variations, the system automatically assigns the dominant orientations of each block-based feature encoding by using the rectangular- and circulartype histograms of orientated gradients (HOG), which are insensitive to various lightings and noises at the outdoor environment. Moreover, this work demonstrated that Gaussian weight and tri-linear interpolation for HOG feature construction can increase detection performance. Particularly, a powerful feature selection algorithm, AdaBoost, is performed to automatically select a small set of discriminative HOG features with orientation information in order to achieve robust detection results. The overall computational time is further reduced significantly without any performance loss by using the cascade-ofrejecter structure, whose hyperplanes and weights of each stage are estimated by using the AdaBoost approach. Keywords: Human Detection, Histograms of Oriented Gradients, Cascaded AdaBoost.

1 Introduction Human detection is a key capability for applications in robotics, surveillance, or automated personal assistance. The main challenge is the amount of variations in visual appearance owing to clothing, articulation, cluttering backgrounds and illumination conditions particularly in outdoor scenes. A number of different approaches for detecting human in images using some feature representations and learning methods have been proposed in the literatures. The work in [4] offers detailed survey results for human detection or analysis. Papageorgiou et al. [14] describe a polynomial Support Vector Machine (SVM) method to detect pedestrians in images, this work used the Haar wavelet features to represent a detection window. A variant version, by Mohan et al. [12], presented a part-based approach. An optimized version was presented by Depoortere et al. [2]. Viola et al. [19], and Yang et al. [21], which used Haar-like features and AdaBoost algorithm, and then build a cascaded system for efficient moving person detector. Felzenszwalb et al. [3] build an Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 885–895, 2007. © Springer-Verlag Berlin Heidelberg 2007

886

C.-C. Raxle Wang and J.-J.J. Lien

articulated body based on its parts, and then each part is represented by a Gaussian derivative filter with different scales and orientations. Their approach is similar to the works of Ioffe & Forsyth [7] and Ronfard et al. [15]. Gavrila et al. [5] implement a realtime pedestrian detection system by comparing edge images to an exemplar dataset using chamfer distance. Leibe et al. [8] implement the Implicit Shape Model (ISM) to combine with a global verification stage based on silhouettes to detect human. The improved version by Seemann et al. [17] presented a 4-D ISM approach to detect humans in images. Mikolajczyk et al. [11] used a single hierarchical codebook representation and Munder et al. [13] used SVM with LRF features to present their approach, which was capable of detecting humans in images. The use of orientation histograms has been extensively used in [1], [9], [11], [20], and [22]. The Histograms of Oriented Gradients (HOG) features have provided excellent performance contrast to other existing edge- and gradient-based features by Dalal & Triggs [1]. They used a dense grid of normalizd HOG features and compute over 16×16 fixed-size blocks to represent a 64×128-pixel detection window. Subsequently, they trained a linear SVM as a binary classification function for their human detection system. Unfortunately, the computational time of their system was approximately 7 seconds, performed by processing a 320×240-pixel image using a dense scaning methodology. Zhu et al. [22] improved the Dalal & Triggs approach by integrating a cascade-of-rejectors approach with the HOG features of variable-size blocks in order to achieve a fast and accurate human detection system. This has speeded up to 70 times, while maintaining an accuracy level similar to the Dalal & Triggs approach [1]. Inspired by their works, this paper designs a large set of blocks, which contain datasets of multiple types, sizes and locations. The system automatically assigns the dominant orientations for each block to overcome the affects of geometric and rotational variations. Therefore, each local pixel in the block can be described as relating to its dominant orientations in order to achieve the rotation invariance. For construction of the HOG features for each block, this method differed from the method suggested by Zhu et al. [22], which omited the Gaussain block-weighted window and tri-linear interpolation steps. These steps are important to construct the HOG features, which have been demonstrated in our experiment. The AdaBoost approach has established itself as a powerful learning algorithm that can be used for feature selection [18]. Therefore, this work uses the AdaBoost approach to select a small set of discriminative HOG features, which well suited for human detection by constructing a cascade-of-rejecter system. The performance of our system has proved better than pervious works in our experimental results.

2 Training Process Our human detection system consists of a training process and a testing process, as shown in Figs. 1 and 6. The training process contains four modules. The gradient computation module creates the gradient image for each positive or negative training example image. The second module designs rectangular- and circular-type blocks, which varied in their sizes and positions in the training examples. The system automatically evaluates the dominant orientations of each block to achieve the

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients

887

rotation invariance. The HOG feature construction module is then encoded information of each block, to construct its HOG features. Finally, the human detection module applies a cascaded AdaBoost approach for human detection. 2.1 Gradient Computation The gradient of all the sample pixels in the positive and negative training example images, each of which has the 64×128-pixel resolution, are computed by using a 1-D discrete derivations mask [-1, 0, 1]. The central difference across x and y at pixel location (x, y) is dx(x, y) and dy(x, y), respectively. d x ( x, y ) = I ( x + 1, y ) − I ( x − 1, y )

(1)

d y ( x, y ) = I ( x, y + 1) − I ( x, y − 1)

(2)

where I(x, y) is the pixel grayvalue at location (x, y) in the positive or negative training example images. The gradient magnitude m(x, y) at a sample pixel location (x, y) can be evaluated by following equation: m ( x, y ) = d x ( x, y ) 2 + d y ( x, y ) 2

(3)

θ

The gradient directional angle (x, y) at sample pixel locations (x, y) is relative to the x axis of the image space, and can be evaluated by following equation:

θ ( x, y ) = tan −1 (d y ( x, y ) d x ( x, y ) )

(4)

Fig. 1. Workflow of the training process

2.2 Block Type Assignment and Block Rotation Invariant Block Type Assignment: In [1], the authors used only fixed-size blocks of 16×16 pixels to construct the HOG features. Each block consisted of 2×2 spatial cells, and each cell was 8×8 pixels. However, the fixed-size blocks encoded very limited information in both the positive and negative training example images. Therefore,

888

C.-C. Raxle Wang and J.-J.J. Lien

Zhu et al. [22] used variable-size blocks to improve the detection performance of [1]. They considered all blocks, whose size range was from 12×12 to 64×128 pixels, then the ratio between width and height was assigned by any of the following ratios: (1:1), (1:2), and (2:1). The three ratios were regarded as three types of blocks, as shown in Figs. 2 (a), (c), and (e). Finally, 5031 blocks were defined in a 64×128-pixel training example image, as shown in approach [22]. However, they used only the rectangular structure of blocks to encode information in the training example images. In this work, we additionally use circular structure of blocks to encode more information of the training example images. The sizes and ratios of the circular structures of block variation are the same as the rectangular structure of block, as shown in Fig. 2 (b), (d), and (f). Therefore, a total of 10062 blocks are defined in a 64×128-pixel training example image in this work.

Fig. 2. Six types of blocks and corresponding multivariate Gaussian-weighted windows. (c: cell, (a:b): ratio between width and height of block).

Block Rotation Invariant: For each block, our system automatically evaluates one or more orientations. Therefore, each local pixel in the block can be described relative to block orientations for rotation invariance. This process is very different from approaches in [1] and [22], which omit the orientation information of blocks. The gradient direction angles of all the sample pixels in the block are voted into a 36-bin orientation histogram, these bins are evenly spaced over 00-3600. Each sample pixel contributing its weight to the orientation histogram, which is weighted by its gradient magnitude and then is multiplied by a multivariate Gaussian-weighted window with the covariance matrix C, as defined in Equation (5). 0 ⎤ ⎡1.5 × SW C=⎢ 1.5 × S H ⎥⎦ ⎣ 0

2

(5)

where SW and SH are one-half the width, and one-half the height of block, respectively. The multivariate Gaussian-weighted window for each type of block is shown in Fig. 2. A parabola is then fitted to the values of the 3 nearest bins at the histogram to interpolate the orientations for improved accuracy. Finally, the maximum value in the orientation histogram dominates the direction of the block. However, after the maximum value in the histogram is detected, other peak values higher than 80% of the maximum value is used to consider new block examples based on that orientation. This process is very similar to the work in [9] for the orientations of the keypoint assignment work.

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients

889

2.3 HOG Feature Construction The HOG feature is the fundamental nonlinearity of the normalized gradient descriptor, which is very similar to the SIFT descriptor. Each descriptor covered a 2×2 subregion in the SIFT algorithm [9]. In Dalal & Triggs approach [1], each cell of a 16×16-pixel block consisted of 8×8 gradient direction angles, which was then weighted and voted into a 9-bin orientation histogram, these bins were evenly spaced over 00–1800 angle (“unsigned” gradient), assigned by its gradient magnitudes and Gaussian block-weighted window, with standard deviation equal to one half the width of block. The votes were accumulated into orientation bins of histograms spread over its rectangular region. To reduce aliasing, the tri-linear interpolation was used to distribute the value of each gradient into adjacent orientation bins in the histogram. Therefore, four orientation histograms of each block was integrated into a 36-dimensional (36-D) HOG feature vector. Each HOG feature vector was then normalized into an L2 unit length by Lowe’s normalization method [9]. Finally, each 64×128-pixel training example image, with a dense, in face overlapping, grid of 105 HOG features, these features were used to train a SVM based window classifier. In this work, the HOG features are constructed from 10062 blocks at multiple types, sizes, and locations. The system automatically evaluates one or more orientations to each block, which where based on local image properties located in the last module. Thus, we encode information of block to construct the HOG feature by referring to its orientation. For constructing HOG feature, each cell of block is weighted and voted into a 9-bin orientation histogram according to its gradient magnitude and multivariate Gaussian block-weighted window with a covariance S 0 matrix K= ⎡⎢⎣ 0 . S ⎤⎥⎦ . Following this, four orientation histograms are integrated to a 36-D HOG feature vector, as shown in Fig. 3. Each HOG feature is then normalized to an L2 unit length. Therefore, more than 10062 normalized HOG features are defined in a 64×128-pixel training example image in this work. Although the approach in [22] demonstrates that the variable-size block method had higher detection accuracy than did the fixed-size block method in their experiment results, it did not use multivariate Gaussian block-weighted windows and tri-linear interpolation when each sample pixel of cell were weighted and voted into an orientation histogram. In our experiment, the performance reduces the recall rate 2

W

H

Fig. 3. Two examples of HOG features are created by computing the gradient magnitude and direction angle of each image sample pixel within the block. These sample pixels of each cell were then accumulated and weighted, then voted into a 9-bin orientation histogram by their gradient magnitudes and corresponding multivariate Gaussian block-weighted window are indicated by the overlaid circle. Four orientation histograms were integrated into a 36-D normalized HOG feature vector.

890

C.-C. Raxle Wang and J.-J.J. Lien

from 94% to 91% at 10-4 FPPW. Therefore, we do not omit both the multivariate Gaussian block-weighted windows and tri-linear interpolation approaches in this work. 2.4 Human Detection Using a Cascaded AdaBoost Approach Currently, more than 10062 HOG features defined in each positive or negative 64×128-pixel training example image. We intend to define a meaningful set of HOG features, which have the discriminative and distinctive properties. In this work, we use the AdaBoost algorithm [18] to select a small number of weighted HOG features, i.e. weak classifiers, to integrate into a strong classifier. Each weak classification is selected by evaluating training datasets of positive and negative, each classifier showing the lowest error is chosen. We use the normalized difference score s to be the similarity measurement function of weak classifier as following equation:

s (q, p ) =

q q

−

p p

(6)

where q is the separating hyperplane, which exhibited the lowest error rates of HOG feature; and p is the HOG feature of the query image.

Fig. 4. The cascaded AdaBoost consists of a sequence of detection stages. The first several stages can eliminate a large number of negative examples and retain almost all positive examples, with little processing time. The last several stages eliminate remaining negative examples, but take much more processing time than did the first several stages.

Fig. 5. Some details of the cascaded AdaBoost detector. (a) The number of weak classifiers in each stage. (b) The rejection rate as cumulative sum over cascaded stages.

To increase the speed of the detector, we construct a cascaded AdaBoost approach, which rejects many negative examples, while detecting almost all positive examples. The cascade process is similar to a sequence of detection stages and is designed to have a high detection rate in order to achieve a high rejection rate. In this work, we require a minimum detection rate of 0.9995 and a maximum false positive rate of 0.55 for each stage. The cascaded training process took a few days using a PC with 3.2GHz CPU and 4GB memory. The schematic depiction of the cascaded AdaBoost

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients

891

approach is shown in Fig. 4. The final detector is a 34-stage cascaded AdaBoost detector, which includes 813 HOG features. Each HOG feature involves parameters concerning the type, size, and position of blocks in a 64×128-pixel detection window. The first stage in the cascade is constructed using five HOG features and rejects approximately 60% of non-human (negative), while correctly detecting nearly 100% of humans (positive). The next stage has five features and rejects 80% of non-humans while detecting almost all humans. More stages are added until the false positive rate equal nearly zero, while still maintaining a high correct detection rate. Details regarding the cascaded AdaBoost detector are shown in Fig. 5.

3 Testing Process For the testing process, as shown in Fig. 6, the testing images, e.g. 320×240 pixels, are first down-sampled iteratively by a factor of 8/9 from the original resolution of 320×240 pixels (level 0) to 178×133 pixels (level 5). The height of the image at level 5, is slightly bigger than the height of the detection window. These detection windows are generated for the detection process by shifting pixel by pixel across the image of each level, and therefore approximately 70000 windows are produced for each input 320×240-pixel image. Then, each detection window is classified as a human (positive) or non-human (negative) example by our cascaded AdaBoost detector. In addition, an average about 8.5-blocks evaluation is needed to classify a 64×128-pixel detection window, which has been accelerated more than 12.3 times, compared to the approach in [1] with 105 blocks. The computational time of our system is 0.55 seconds per 320×240-pixel image. This is better than the approach [1], which required 7 seconds.

Fig. 6. Workflow of our testing process Table 1. Two datasets compiled from the webside of the work in [1] for research purpose. Each example image contains 64×128 pixels. (Unit: number of example). Dataset Training dataset (2416+1218 images) Testing dataset (741 images)

Positive examples Negative examples 2416 12180 555 persons in the 741 images.

892

C.-C. Raxle Wang and J.-J.J. Lien

Usually, each successful human detector may result in redundant detection windows of the same or different scaling sizes, around each human in an image. The work in [18] combined all the overlapped candidate windows together and has good results in face detection. But the non-overlapping constraint may be too strict for closely-spaced targets which cause overlapped candidate windows. To solve the overlapping problem, this work uses a non-maximum suppression method [1]. In [1], the authors propose a robust fusion of overlapping detections in 3-D position (x, y, response) and scale s space. The method is referred to as a non-maximum suppression algorithm, which applies a mean-shift model detection procedure in order to locate a pre-defined model density, which has been defined as a human, in the 3-D position and scale space. Therefore, the list of all of the located models gives the final fused detections.

4 Experimental Results 4.1 Databases and Performance Evaluation Method Databases: We downloaded two datasets from the webside of [1], as shown in Table 1. The first dataset is the training dataset, containing 2416 person images and 1218 personfree images. We selected 2416 person images as positive training examples and collect 12180 person-free images as negative training examples, which sampled randomly from 1218 person-free images. The sizes of both the positive and negative training example images are 64×128 pixels. To reduce the occurrence of false alarms, additional negative example (non-human) images are compiled from the false acceptance windows by applying the human detection process to the 1218 person-free images. The second set is the testing dataset, containing 555 persons in the 741 images in different size. Performance Evaluation Method: The performance of detection method is the same as [1] by plotting the Detection Error Tradeoff (DET) curve featuring miss rate versus FPPW (False Positive Per Window). The term is defined as follows: miss rate=1-recall rate = FPPW =

Number of false alarms Number of detected positives + Number of false alarms

Number of false alarms Total number of testing negative examples

(7) (8)

Based on the DET curve, the better performance of detector should achieve minimum miss rate and FPPW. In this work, we will often use miss rate at the 10-4 FPPW as a reference point, the same as [1], in the DET curve. 4.2 Experiments We performed a variety of experiments with our proposed system to evaluate its accurate performance by using training and testing datasets, as shown in Table 1. In the first experiment, we compared the performances of our system with four combinations of block types by evaluating the testing dataset in order to choose the

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients

893

best combination. Fig. 7 (a) shows the classification performances of our system with four combinations of block types to present four curves. The DET curves clearly show that only the 16×16-pixel blocks showed the least accuracy and increasing block types does contribute significantly to the detection performance of our system. The combination of six types of blocks demonstrated that the detection performance has the lowest miss rate (by 6% at 10-4 FPPW) when compared with other combinations of block types, but restricting to the combination of block types 1, 3, 5, the same as shown in [22], reduced the performance by 2.5% at 10-4 FPPW. In the second experiment, we demonstrated that block orientation assignment is significant to increase detection performance of our system. In Fig. 7 (b), the detection performance increases by approximately 3% at 10-4 FPPW, while we performed the block orientation assignment before the HOG feature construction module. Thus, the results confirmed that the local orientation assignment for each block given its most stable performance. In the third experiment, we proved the importance of the multivariate Gaussian block-weighted window and tri-linear interpolation steps during construction of the HOG features. In Fig. 7 (c), the results indicated that our proposed system, without including two steps decreases the detection performance from 94% to 91% at 10-4 FPPW. However, it is effective to avoid all aliasing effects, which the HOG descriptor suddenly changes as a sample shift smoothly, from being within one histogram to the other, or from one orientation to the other. In [1], the authors have demonstrated that the HOG feature outperforms the other existing feature representation methods, such as Harr-like wavelets, PCA-SIFT, and Shape Contexts, according to their experimental results. Therefore, we compared only

Fig. 7. (a) Classification performances of our detection system by using four combinations of block type. (b) Without using block orientation assignment to achieve invariance to image rotation, it decreases the performance by about 3%. (c) Without using multivariate Gaussian block-weighted window and tri-linear interpolation approaches to construct the HOG features, it decreases the accurate detection rate by about 3%. (d) DET curves for comparing the approaches in [1] and [22] with our detection system. Our detection system achieves the lower miss rate with 10-4 FPPW.

894

C.-C. Raxle Wang and J.-J.J. Lien

Fig. 8. Some typical results by using our detection system

Dalal & Triggs approach [1] and Zhu et al. approach [22] with our detection system in the final experiment. The DET curves are presented in Fig. 7 (d). The results indicate that our detection system has better detection performance than approaches in [1] and [22]. However, we observe that the HOG features located in some specific types, sizes, locations and orientations of blocks can achieve a much higher accuracy by using the AdaBoost approach. The results demonstrated that the performance of our system is superior to those of the approaches in [1] and [22] by selecting the most informative blocks, which are contrary to the blocks in background. Furthermore, the computational time of our system, using the cascaded AdaBoost approach, can significantly accelerated 12.3 times when compared with that of the approach in [1]. In Fig. 8, we show some typical results of our proposal detection system.

5 Conclusion We have developed a novel learning-based human detection system. For the feature representation, we use the rectangular- and circular-type HOG features, which are insensitive to various lightings and noises, and constructed a large set of blocks at multiple types, sizes, locations, and orientations to overcome the affects of the geometric and rotational variations. The discriminative HOG features are automatically selected, using the AdaBoost approach. This work has experimented on the affects of block types and orientation assignment for constructing HOG features to obtain good performance. In addition, we have demonstrated that HOG features construction, with Gaussian block-weighted windows and tri-linear interpolation can increase 3% detection performance. A cascaded AdaBoost approach significantly accelerated the computational time of 12.3 times faster than the approach in [1]. Finally, our detection system achieves the lower miss rate than the previous approaches in [1] and [22] with 10-4 FPPW.

References 1. Dalal, N., Triggs, B.: Histogram of Oriented Gradients for Human Detection. In: CVPR. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 886–893 (2005) 2. Depoortere, V., Cant, J., Bosch, B.V., Prins, J.D., Fransens, R., Gool, L.V.: Efficient Pedestrian Detection: A Test Case for SVM Based Categorization. In: Workshop on Cognitive Vision (2002)

AdaBoost Learning for Human Detection Based on Histograms of Oriented Gradients

895

3. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. International Journal of Computer Vision (IJCV), 55–79 (2005) 4. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding 73, 82–98 (1999) 5. Gavrila, D.M., Giebel, J., Munder, S.: Vision-Based Pedestrian Detection: The Protector System. In: IEEE Intelligent Vehicles Symposium, pp. 13–18. IEEE Computer Society Press, Los Alamitos (2004) 6. Gerónimo, D., Sappa, A.D., López, A., Ponsa, D.: Pedestrian Detection Using Adaboost Learning of Features and Vehicle Pitch Estimation. In: International Conf. on Visualization, Imaging and Image Processing, pp. 400–405 (2006) 7. Ioffe, S., Forsyth, D.A.: Probabilistic Methods for Finding People. In: IJCV, pp. 45–68 (2001) 8. Leibe, B., Seemann, E., Schiele, B.: Pedestrian Detection in Crowded Scenes. In: IEEE Conf. on CVPR, pp. 878–885. IEEE Computer Society Press, Los Alamitos (2005) 9. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. In: IJCV, pp. 91– 110 (2004) 10. Mikolajczyk, K., Leibe, B., Schiele, B.: Multiple Object Class Detection with a Generative Model. In: IEEE Conf. on CVPR, pp. 26–36. IEEE Computer Society Press, Los Alamitos (2006) 11. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human Detection Based on a Probabilistic Assembly of Robust Part Detections. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, pp. 69–81. Springer, Heidelberg (2004) 12. Mohan, A., Papageorgiou, C., Poggio, T.: Example-Based Object in Image by Components. IEEE Tran. on Pattern Analysis and Machine Intelligence, 349–361 (2001) 13. Munder, S., Gavrila, D.M.: An Experimental Study on Pedestrian Classification. IEEE Tran. on Pattern Analysis and Machine Intelligence (PAMI), 1863–1868 (2006) 14. Papageorgiou, C., Poggio, T.: A Trainable System for Object Detection. In: IJCV, pp. 15– 33 (2000) 15. Ronfard, R., Schmid, C., Triggs, B.: Learning to Parse Pictures of People. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, pp. 700–714. Springer, Heidelberg (2002) 16. Schneiderman, H., Kanade, T.: Object Detection Using the Statistics of Parts. In: IJCV, pp. 151–177 (2004) 17. Seemann, E., Leibe, B., Schiele, B.: Multi-Aspect Detection of Articulated Objects. In: IEEE Conf. on CVPR, pp. 1582–1588. IEEE Computer Society Press, Los Alamitos (2006) 18. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: IEEE Conf. on CVPR, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 19. Viola, P., Jones, M., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. In: IJCV, pp. 153–161 (2005) 20. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: ICCV, pp. 90–97 (2005) 21. Yang, T., Li, J., Pan, Q., Zhao, C., Zhu, Y.: Active Learning Based Pedestrian Detection in Real Scenes. In: International Conf. on Pattern Recognition, pp. 20–24 (2006) 22. Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: IEEE Conf. on CVPR, pp. 1491–1498. IEEE Computer Society Press, Los Alamitos (2006)

Multi-posture Human Detection in Video Frames by Motion Contour Matching Qixiang Ye, Jianbin Jiao, and Hua Yu Graduate University of Chinese Academy of Science {Qxye,jiaojb,Yuh}@gucas.ac.cn

Abstract. In the paper, we proposed a method for moving human detection in video frames by motion contour matching. Firstly, temporal and spatial difference of frames is calculated and contour pixels are extracted by global thresholding as the basic features. Then, skeleton templates with multiple representative postures are built on these features to represent multi-posture human contours. In the detection procedure, a dynamic programming algorithm is adopted to find best global match between the built templates and with extracted contour features. Finally a thresholding method is used to classify a matching result into moving human or negatives. And in the matching process scale problem and interpersonal contour difference are considered. Experiments on real video data prove the effectiveness of the proposed method. Keywords: Human detection, motion contour, dynamic programming.

1 Introduction Detecting humans in video frames is important for many applications, such as visual surveillance, traffic system, smart room or early threat assessment [1][2]. Precise moving human detection algorithm with low false alarm rates will push the development of automated visual surveillance technique on optical or infrared images. In the past years lots of works have been done on human detection in images and video frames. Various image features and methodologies are employed in these works, which can be categorized into three classes, 1) human detection by background subtraction 2) human detection directly in single image (frame) by image features and 3) human detection by motion features, tracking cue or the combination of motion and static image features. In the early years, human detection is performed based on background subtraction technologies (frames difference or background modeling) and simple regions analysis technology such as region area, region width/height ratio, region moment features etc [2][3]. It is difficult for these methods to discriminate moving human with other objects since pure region features cannot represent human effectively. When the background is moving or there are illumination changes, lots of false alarms would be detected. In [4], Gavrila et al. used shape-based grey value template matching method to detect human. The dissimilarity between template and human candidate is evaluated Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 896–904, 2007. © Springer-Verlag Berlin Heidelberg 2007

Multi-posture Human Detection in Video Frames by Motion Contour Matching

897

on the chamfer distance. They think that template matching method can deal with the challenging scenario of a moving camera mounted on a vehicle for human detection. The author also included a second verification state based on a neural network architecture with operates on image patches detected by template matching. In [5], Mohan and T. Poggio proposed a human detection method by learning component classifiers with Harr Wavelet features and SVM (support vector machine) classifier. In [6], Dalal et al. presented a descriptor of oriented gradient and motion flow for human shape representation. And then they trained a SVM classifier on the descriptor to detect pedestrian in video frames. In [7] Leibe et al. combined local information from sampled appearance features with global cues about an object’s silhouette for human detection. By using both segmentation and classification methods the flexible nature of their approach allows it to operate on complex background. Although at present using pure grey value features for human contour representation is an intuitionistic idea, it is usually need very large training samples to build a model, especially when facing human of various views and postures. In [8], Cutler et al. used time-frequency features from short-time Fourier transformation (STFT) for pedestrian detection by considering that the moving of human is periodic and repetitive. Viola et al. used an Adaboost classifier trained on human shape and motion features for human detection. Their shape features were extracted by rectangle filters on frame grey value and optical flow. The method had obtained good detection results on a large pedestrian dataset [9]. Haga et al. classified moving objects into human or others by motion uniqueness and continuity features with a linear classifier [10]. In [11], Zhang et al. proposed a method detecting human in moving background (moving camera on car). They firstly calculated image interested points and then use FOE (focus of expansion) with residual to judge whether the object can be classified into moving human. There are also some research works on pedestrian detection on infrared images in recent year [12]. By using motion features, static features or the combination these methods, the performance of human detection can be improved to some extent. However, the multi-posture problem in human detection has not been fully considered in these works, which may decrease their usability in lots of real applications. Our approach builds on the motion contour features and template matching. Contributions of this paper include: i) development of a multi-posture human contour representation method, and ii) implementation of human detection by template matching with dynamic programming idea which is a global optimization algorithm and has tolerance on small contour deformation.

2 Human Detection Algorithm There are three parts in the human detection algorithm. The first part is feature extraction on which multi-posture templates are built. The second one is optimized contour matching with dynamic programming and the third one is human/non-human determination. In the human detection, features are extracted as the process of feature extraction in template building. In the presentation of detection algorithm we will consider only the single scale human detection. Multi-scale human detection will be taken out by resizing the original frame’s size.

898

Q. Ye, J. Jiao, and H. Yu

2.1 Template Building on Contour Features Videos with simple and static background are selected when building the templates, the aim of which is to ensure that the foreground objects can be easily segmented from the background. Each video frame is smoothed and then intra and inter-frame difference values are calculated as follows.

(

FIntra ( x, y , σ ) = (Lt ( x − 1, y ,σ ) − Lt ( x, y , σ ) ) + (Lt ( x, y , σ ) − Lt ( x, y − 1, σ ) )

)

2 1/ 2

2

( 1)

FInter ( x, y, σ ) = Lt ( x, y,σ ) − Lt − n ( x, y, σ )

(2)

Lt ( x, y,σ ) = G ( x, y,σ ) ∗ I t ( x, y )

( 3)

where I t ( x, y) represents the brightness of a pixel at location ( x, y) at t th frame. G( x, y, σ ) is a Gaussian function, in which σ is the smooth scale factor. Based on the result of (1) and (2), the contour features by the following function.

F ( x, y ) can be calculated

⎧ 1, if ( FInter > TInter & & FIntra > TIntra ) F ( x, y ) = ⎨ ⎩ 0, otherwise

(4)

where TInter and TIntra are two global thresholds determined by histogram cavity analysis methods[13]. When detecting moving human in a video, the values of TInter and TIntra should ensure that enough possible moving pixels can be obtained so that moving objects are not missed.

Fig. 1. Multi-posture human contour templates

The built model templates should consider the variation of human heights. At present 4 persons with heights of 1.60, 1.70, 1.80, and 1.90 meters respectively are employed to capture the templates. 10 types of typical postures of moving human are selected for each person and the templates of one of them are shown in Fig.1. It is clear that 40 templates (10 types of postures of 4 persons) can not represent all of moving human because there are certainly some differences between a human in templates with a human that was not included. Therefore, some tolerance for the deformation of human body is required for the matching algorithm. In this paper, we just to justify the feasibility of the matching method. More postures of more persons will be used to build the templates in the future work. Foreground features can be extracted from a given video frame using equation (4). Then a built template will be matched with these moving pixels to search whether there are moving humans. Taking the first template as an example, the detected

Multi-posture Human Detection in Video Frames by Motion Contour Matching

899

foreground pixels form a profile as shown in Fig.2. Discrete key points are then sampled on this profile to get human contour features. In the sampling process, equally spaced horizontal lines are used (as shown in the second image of Fig.2b) to scan the foreground profile. Each line has two intersection points with the profile on the left and right sides respectively, which are 2 features we want. The reason we use discrete features instead of continuous profile is that experiment has shown that the former has larger tolerance to various human body shapes and some body deformation.

(a)

(b)

(c)

(d)

(e)

Fig. 2. Human contour features. (a) is extracted profile, (b)(c) are the discrete features, (d) the angle and distance between two features and (e) searching window for features.

2.2 Single Template Matching

Given a built human contour template we can represent the features by {Ft }, t = 0, 1, ..., T , where T represents the feature number which can be calculated by searching from the top left of the profile to the bottom right and then to the top right, which is illustrated in the second image in Fig.2. During the matching process we can firstly extract foreground features and represent the features as

，

~

{Ft ,i }, t = 0, 1, ..., T , i = 0, 1, ..., N t where N t is the feature number on tth step. The feature set of tth step is constructed by a square window (searching window) shown in Fig.2e. Given a template feature set A and a candidate feature set B , our goal is to find the best matching result as indicated by the following target function. ~ ⎫ ⎧T D ( A, B) = ~min~ ⎨∑ D ( Ft , Ft ) ⎬ Ft ∈{ Ft , i }⎩ t =1 ⎭

(5)

where D is the distance function, D( A, B) represents the overall matching distance ~

between the model and the candidate feature set. Ft is the global optimized matching result in t th step. In each step, the function D can be calculated as ~

~

~

D( Ft , Ft ) = K θ (θ t − θ t ) ⋅ K ρ ( ρ t − ρ t )

(6) ~

~

θ t is the angle between the line Ft −1 Ft and the horizontal line, ρ t is the distance ~

~

between the two feature points Ft −1 and Ft as shown in Fig.2. Kθ , K ρ are two

900

Q. Ye, J. Jiao, and H. Yu

functions to describe the dissimilarity between the model and the real data. In our experiments, they are selected as Gaussian functions as ~

K θ (θ t,i − θ t ) =

~ ⎛ ⎜ θ − θt exp⎜ t,i 2π σ θ ⎜ σθ ⎝

1

2

⎞ , ~ ⎟ ⎟ K ρ ( ρ t,i − ρ t ) = ⎟ ⎠

~ ⎛ ⎜ ρ t,i − ρ t exp⎜ 2π σ ρ ⎜ σρ ⎝

1

⎞ ⎟ ⎟ ⎟ ⎠

2

,

(7)

The value of σ θ and σ ρ is determined in experiment by referring the size of template image. A higher value of them implies higher acceptance of posture variations of detected targets at the cost of higher false alarm rates. This will be further discussed in the experimental part of this paper. In the detection process, a template will be matched to foreground features with the goal of function (5). To solve the function we can use a standard Viterbi decoding algorithm [13] to obtain global optimized matching result. As for the matching result a threshold method is used to determine whether a region in the image is or is not a moving human. This process can be described as ⎧1, if (D( A, B ) > T g ) H =⎨ ⎩0, otherwise

(8)

where H represents the detection result, 1 stands for human and 0 for negative. Tg is a threshold whose value is determined in terms of false alarm rate and recall rate. Since all templates have the similar sizes, Tg is same for all of the templates. 2.3 Detecting Multi-scale and Multi-posture Human

Supposing that n templates are built, then we can think that there are n human detectors representing n postures. In the detection process, a candidate image block can be matched with all of the built templates in the series, which is shown in Fig.3. If the matching result of any template satisfies equation (8), then the image block can be regarded as a human being. D1 Y

Y D1

Y

Y

…

D2 Y

Y

Y …

D2

Dn

Y

…

…

D1

D2 Y

Dn

Y …

… … Y

Moving human

Multi-scale detection

Dn Y negatives

Fig. 3. Multi-scale and multi-posture human detection

Multi-posture Human Detection in Video Frames by Motion Contour Matching

901

Templates with fixed sizes cannot process human in large or small size. To make the method be able to process multi-scale human beings, a pyramid resizing of the video frame’s size is taken out to build multi-scale detection. Experiments proved that 8-10 scales can cover humans of various sizes. Supposing that D1 − Dn represent n detectors for postures, then multi-scale and multi-posture human detection process can be described by the following figure.

3 Experiments We have prepared a dataset of 220 video clips containing about 200,000 video frames captured from natural scene for experiments. The video frame sizes are 640 × 480 or 720 × 576 pixels. The test set consists of a variety of cases, such as moving human in static background, moving human in moving background and static human in moving background. The backgrounds of most of the video frames are complex ones, which include swaying trees, moving cars, moving animals and buildings. In the following figure, we illustrate the foreground detection results from static and moving background. It can be seen that when the background is moving, more moving pixels are detected and then it is more difficult to detect the human.

Fig. 4. Examples of detected foreground pixels

Recall rate and false alarm rate are used to evaluate the proposed method. By adjusting the threshold of (8) we can obtain the curves to show recall rates and false alarm rates as shown in Fig.5. On the curve we can obtain a tradeoff between recall and false alarm rates. For example, the average 86% recall rate with 4.3% false alarm

Fig. 5. Detection performance on static (left) and moving (right) backgrounds

902

Q. Ye, J. Jiao, and H. Yu

rate can satisfy the requirement of lots of real applications in intelligent video surveillance systems. Two kinds of results are given in Fig.5 for three types of moving human beings. The figures show that when the background is moving, human detection is more difficult and the performance is worse than that of static background. Since the detection is performed in videos, the detection results can be smoothed by integrating multi-frame detection results, which means that if a moving human is detected in any of the ten frames, then we regard that there is a moving human in all of the ten frames. The detection performance is improved after the integration of multi-frame results. The figure also shows that if we do not use a dynamic programming process, the detection performance can drop a lot. Examples of contour matching results are showed in Fig.6. It can be seen that after a dynamic based optimization the final matching results are reasonable. Given a moving feature set (left image of each example) the algorithm can automatically find the most similar template and project the feature points to reasonable positions so that the final matching distance is minimized. These results intuitionisticly show the effectiveness of dynamic programming for contour matching.

Fig. 6. Examples of feature matching with dynamic programming

(a)

(d)

(b)

(e) Fig. 7. Examples of human detection results

(c)

(f)

Multi-posture Human Detection in Video Frames by Motion Contour Matching

903

Fig.7 shows examples of the detected moving humans. It can be seen that most of the humans are well detected despite of their postures, sizes, etc. The results also show that even in a cluttered background, the proposed method performs well on most of the conditions. Fig.7d is a frame obtained from a video clip captured on a moving vehicle with a shaking camera. Two moving persons are well detected in this case. Fig. 7e contains a human that is missed (the lady besides the tree) by the algorithm, which shows that the human whose brightness is similar with the background is more intent to be missed since the foreground pixels cannot be correctly separated from the background. Fig.7f contains a false alarm. In these images, some building exteriors look quite like some human contours and are falsely classified as texts. However, we think that these false alarms could be eliminated by integrating trajectory cue of a tracking algorithm in the future work.

4 Conclusion and Future Works In this paper, a new method is proposed for moving human detection. Multiple human contour templates are built to represent multi-posture humans in video frames. A dynamic programming method is employed to find the best matching between candidates and templates. Experimental results on video frames have proved the effectiveness of the template matching method for human detection. The multi-scale, multi-posture detection algorithm is effective in detecting human beings of various sizes and postures in videos. The speed of the algorithm should be considered in future works so that it can work in real time. More representative templates should be built for the detection of more postures, such as creepy postures. Moving object tracking algorithms can also be integrated to improve the performance of the proposed method. Acknowledgement. This research is supported by the Bairen Project of Chinese Academy of Sciences and partly supported by National Science Foundation of China (NO. 60672147).

References 1. Lee, D.J., Zhan, P., Thomas, A., Schoenberger, R.B.: Shape-based Human Detection for Threat Assessment. In: Proceedings of Visual Information Processing, SPIE (2004) 2. Beleznai, C., Fruhstuck, B., Bischof, H.: Human Detection in Groups Using a Fast Meanshift Procedure. International Conference on Image Processing 1, 349–352 (2004) 3. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: Real-time Tracking of Human Body. IEEE Trans. on PAMI 19, 780–785 (1997) 4. Gavrila, D.M., Giebel, J.: Shape-based Pedestrian Detection and Tracking. IEEE Intelligent Vechicle Symposium 1, 8–14 (2002) 5. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based Object Detection in Images by Components. IEEE Trans.PAMI 23 (2001) 6. Dalal, N.m., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. European Conference on Computer Vision 2006.

904

Q. Ye, J. Jiao, and H. Yu

7. Leibe, B., Seemann, E., Schiele, B.: Pedestrian Detection in Crowded Scenes. In: International Conference on Computer Vision and Pattern Recognition (2005) 8. Cutler, R., Davis, L.S.: Robust Real-time Periodic Motion Detection, Analysis and Applications. IEEE Trans. on PAMI 22, 781–796 (2000) 9. Viola, P., Jones, M.J., Snow, D.: Detecting Pedestrians using Patterns of Motion and Appearance. IEEE International Conference on Computer Vision 2, 734–741 (2003) 10. Haga, T., Sumi, K., Yagi, Y.: Human Detection in Outdoor Scene Using Spatio-temporal Motion Analysis. International Conference on Pattern Recognition 4, 331–334 (2004) 11. Zhang, Y., Kiselewich, S.J., Bauson, W.A., Hammoud, R.: Robust Moving Object Detection at Distance in the Visible Spectrum and Beyond Using A Moving Camera. In: Workshop of International Conference on Computer Vision and Pattern Recognition, pp. 131–134 (2006) 12. Dai, C., Zheng, Y., Li, X.: Pedestrian Detection and Tracking in Infrared Imagery Using Shape and Appearance. Int., J. CVIU. 106, 288–299 (2007) 13. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley and Sons press, Chichester (2001)

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection Yu-Ting Chen1,2 and Chu-Song Chen1,3 Institute of Information Science, Academia Sinica, Taipei, Taiwan 2 Dept. of Computer Science and Information Engineering, National Taiwan University Graduate Institute of Networking and Multimedia, National Taiwan University {yuhtyng,song}@iis.sinica.edu.tw 1

3

Abstract. We develop a method that can detect humans in a single image based on a new cascaded structure. In our approach, both the rectangle features and 1-D edge-orientation features are employed in the feature pool for weak-learner selection, which can be computed via the integral-image and the integral-histogram techniques, respectively. To make the weak learner more discriminative, Real AdaBoost is used for feature selection and learning the stage classiﬁers from the training images. Instead of the standard boosted cascade, a novel cascaded structure that exploits both the stage-wise classiﬁcation information and the interstage cross-reference information is proposed. Experimental results show that our approach can detect people with both eﬃciency and accuracy.

1

Introduction

Detecting pedestrians in an image has received considerable attentions in recent years. It has a wide variety of applications, such as video surveillance, smart rooms, content-based image retrievals, and driver-assistance systems. Detecting people in a cluttered background is still a challenging problem, since diﬀerent postures and illumination conditions can cause a large variation of appearances. In object detection, both eﬃciency and accuracy are important issues. In [1], Viola and Jones proposed a fast face detection framework through a boosted cascade. This cascade structure has been further applied to many other object detection problems. For instance, Viola et al. [2] used the cascade framework for pedestrian detection. Rectangle features, which can be evaluated eﬃciently via the technique of integral image, are employed as the basic elements to construct the weak learners of the AdaBoost classiﬁer for each stage in the cascade. While the use of rectangle features is eﬀective for object-detection tasks such as face detection, they still encounter diﬃculties in detecting people. It is because that the rectangle features are built by only using intensity information, which is not suﬃcient to encode the variance of human appearances caused by some factors that can result in large gray-value changes, such as the clothes they wear. Recently, Dalal and Triggs [3] presented a people detection method with promising detection performances. This method can detect people in a single Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 905–914, 2007. c Springer-Verlag Berlin Heidelberg 2007

906

Y.-T. Chen and C.-S. Chen

image. In this work, edge-based features, HOG (Histograms of Oriented Gradients), are designed for capturing edge-orientation structure that can characterize human images eﬀectively. HOG features are variant from Lowe’s SIFT [4] (Scale Invariant Feature Transform), but they are computed on a dense grid of uniform space. Nevertheless, a limitation of this method is that a very high-dimensional feature vector is used to describe each block in an image, which requires a long computation time. To speed up the detection, Zhu et al. [5] combined the above two methods by using linear SVM classiﬁer with HOG features as a weak learner in the AdaBoost stages of the cascaded structure, and enhance the eﬃciency of the HOG approach. In this paper, we develop an object detection framework with both eﬃciency and accuracy. Our approach employs rectangle features and 1-D edge-orientation features that can be computed eﬃciently. To make the weak learner more discriminative, we use Real AdaBoost as a stage classiﬁer in the cascade. Instead of learning a standard boosted cascade [1] for detection, a new cascading structure is introduced in this paper to exploit not only the stage-wise classiﬁcation information, but also the inter-stage cross-reference information, so that the detection accuracy and eﬃciency can be further increased.

2

Previous Work

There are two main types of approaches on pedestrian detection, the holistic approach and the component-based approach. In holistic approaches, a full-body detector is used to analyze a single detection window. The method of Gavrila and Philomin [6] detects pedestrians in images by extracting edge images and matching them to a template hierarchy of learned examplars using chamfer distances. Papageorgiou and Poggio [7] adopted polynomial SVM to learn a pedestrian detector, where Haar wavelets are used as feature descriptors. In [1], Viola and Jones proposed a boosted cascade of Haar-like wavelet features for face detection. Subsequently, this work was further extended to integrating intensity and motion information for walking person detection [2]. Dalal and Triggs [3] designed HOG appearance descriptors, which are fed into a linear SVM for human detection. Zhu et al. [5] employed the HOG descriptor in the boosted cascade structure to speed up the people detector. Dalal et al. [8] further extended the approach in [3] by combining the HOG descriptors with oriented histograms of optical ﬂow to handle space-time information for moving human. The holistic approaches may fail to detect pedestrians when occlusion happens. Some component-based researches were proposed to deal with the occlusion problem. Generally, a component-based approach searches for a pedestrian by looking for its apparent components instead of the full body. For example, Mohan et al. [9] divided the human body into four components, head, legs, and left/right arms, and a detector is learned by using SVM with Haar features for each component. In [10], Mikolajczyk et al. used position-orientation histograms of binary edges as features to build component-based detectors of frontal/proﬁle heads, faces, upper bodies, and legs. Though component-based approaches can

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection

907

cope with the occlusion problem, a high image resolution of the detection window is required for capturing suﬃcient information of human components. This restricts the range of applications. For example, the resolution of humans in some surveillance videos is too low to detect by component-based approaches. In this paper, we propose a holistic human detection framework. Our approach can detect humans in a single image. It is thus applicable for the cases where only single images are available, such as detecting people in home photos. The rest of this paper is organized as follows: In Section 3, the Real AdaBoost algorithm using rectangle features and EOH fetures is introduced. A novel cascaded structure of feed-forward classiﬁers is proposed in Section 4. Experimental results are shown in Section 5. Finally, a conclusion is given in Section 6.

3 3.1

Real AdaBoost with Rectangle and EOH Features Feature Pool

Features based on edge orientations have been shown eﬀective for human detection [3]. In the HOG feature [3], each image block is represented as 7 × 15 overlapping sub-blocks. Each sub-block contains 4 non-overlapping regions, where each region is represented as a 9-bin histogram with each bin being corresponding to a particular edge orientation. In this way, a 3780-dimensional feature, encoding part-based edge-orientation distribution information, is used to represent an image block. Such a representation is powerful for people detection, but has some limitations. First, the representation is too complex to evaluate, and thus the detection speed is slow. Second, all of the dimensions in an HOG feature vector are employed simultaneously, declining the chance of employing only part of them, which may be capable of rejecting the non-human blocks, for fast pre-ﬁltering. A high-dimensional edge-orientation feature like HOG can be treated as a combination of many low-dimensional ones. In our approach, instead of employing a high-dimensional feature vector, we use a set of one-dimensional features derived from edge orientations, as suggested by Levi and Weiss [11]. Similar to HOG, the EOH (Edge Orientation Histogram) feature introduced in [11] also employs the edge-orientation information for feature extraction, but an EOH feature can characterize only one orientation at a time, and each EOH feature is represented by a real value. Unlike the HOG that is uniquely deﬁned for an image region, many EOH features (with respect to diﬀerent orientations) can be extracted from an image region, and each of which is only of one-dimension. Therefore, there is a pool of EOH features allowed to be selected for a region. The EOH feature is thus suitable to be integrated into the AdaBoost or boostedcascade approaches for weak-learner selection. In our approach, the EOH feature is employed in the AdaBoost stage of our cascading structure. Since the weak learners employed are all one-dimension with the output being simply scalars, the resulted AdaBoost classiﬁer is more eﬃcient to compute than which of using high-dimensional features (e.g. HOG) for building the weak learners [5]. We brieﬂy review the EOH features in the following.

908

Y.-T. Chen and C.-S. Chen

To compute EOH features, the pixel gradient magnitude m and gradient orientation θ in a block B are calculated by Sobel edge operator. The edge orientation is evenly divided into K bins over 0◦ to 180◦ and the sign of the orientation is ignored, and thus the orientations between 180◦ to 360◦ are considered as the same to those between 0◦ to 180◦. Then, the edge orientation histograms Ek (B) in each orientation bin k of block B is built by summing up all of the edge magnitudes whose edge orientations are belonging to bin k. The EOH features we adopted is measured by the ratio of the bin value of a single orientation to the sum of all the bin values as follows: Ek (B) + , Fk (B) = i Ei (B) +

(1)

where is a small positive value to avoid the denominator being zero. Each block thus has K EOH features, F1 (B), . . . , FK (B), which are allowed to be selected as weak learners. Similar to the usage of the integral image technique for fast evaluating the rectangle features, integral histogram [12] can be used to eﬃciently compute the EOH features. The feature pool employed in our approach for AdaBoost learning contains the EOH features. To further enhance the detection performance, we also include the rectangle features used in [1] for weak learner selection. 3.2

Learning Via Real AdaBoost

After forming the feature pool, we learn an AdaBoost classiﬁer for some stages of our cascading structure. Typically, the AdaBoost alrogithm selects weak learners of binary-valued outputs obtained by thresholding the feature values as shown in Fig. 1(a) [1,2,5,11]. However, a disadvantage of the thresholded-type weak learners is that it is too crude to discriminate the complex distributions of the positive and negative training data. To deal with the problem, Schapire et al. [13] suggested the use of the Real AdaBoost algorithm. To represent the distributions of positive and negative data, the domain space of the feature value is evenly partitioned into N disjoint bins (see Fig. 1(b)). The real-valued output in each bin is calculated according to the ratio of the training data falling into the bin. The weak learner output then depends only on the bin to which the input data belongs. Real AdaBoost has shown its better discriminating power between positive and negative data [13]. This algorithm is employed to ﬁnd an AdaBoost classiﬁer for each stage of the cascade, and more details can be found in [13].

4

Feed-Forward Cascade Architecture

The Viola and Jones cascade structure containing S stages is illustrated in Fig. 2, where Ai is referred to as an AdaBoost or Real AdaBoost classiﬁer in the i-th stage. In this cascaded structure, negative image blocks that do not contain humans can be discarded in some early stages of the cascade. Only the blocks passing all the stages are deemed as positive ones (i.e., the ones containing

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection h (x)

909

h (x) Thresho ld

1

0 x

x

Feature Value

Feature Value

(a)

(b)

Fig. 1. Two types of weak classiﬁers: (a) Binary-valued weak classiﬁer and (b) Realvalued weak classiﬁer B ootstra p Se t

Input Im age

A0 F

T

A1 F

T

A2

T

F

A3 F

T

... F

T

AS -1

T

Accept

F

R ejec t

Fig. 2. Viola and Jones cascade structure. To learn each stage, negative images are randomly selected from the bootstrap set as shown in dashed arrows.

humans). A characteristic of the cascading approach is that the decision time of negative and positive blocks are un-equal, where the former takes less but the later takes much. To ﬁnd an object of unknown positions and sizes in an image, it usually involves the search of the blocks of all possible sites and scales in the image. In this case, since the negative blocks required to be veriﬁed in an image are usually far more than the positive blocks, saving the detection time of the negative blocks thus increases the overall eﬃciency of the object detector. To train such a cascaded structure, we usually set a goal for each stage. The later the stages, the more diﬃcult the goals. For example, consider the situation that the ﬁrst stage is designed with 99.9% positive examples being accepted and 50% negative examples being rejected. Then, in the second stage, the positive examples remain the same, but the negative examples include those not successfully rejected in the bootstrap set by the ﬁrst stage. If we set the goal of the second stage again as accepting 99.9% positive examples and rejecting 50% negative examples, respectively, and repeat the procedure for the later stages, the accepting rate of positive examples and rejecting rate of negative examples in i-th stages are (99.9%)i and 1 − (50%)i respectively for the training data. In each stage, the Real AdaBoost algorithm introduced in Section 3.2 can be used to select a set of weak learners from the feature pool to achieve the goal. Since more diﬃcult negative examples are sent to the later stages, it usually happens that more weak learners have to be chosen to fulﬁll the goal in the later stage.

910

Y.-T. Chen and C.-S. Chen

The degree of accurate prediction in each stage is evaluated by the conﬁdence score. A high conﬁdence value implies an accurate prediction. Each stage learns its own threshold to accept or reject an image block as shown in Fig. 3(a). In Viola and Jones structure, the conﬁdence value is discarded in successive stages. That is, once the conﬁdence value is employed to make a binary decision (yes or no) in the current stage, it will be no longer used in the later stages. This means that the stages are independent to each other and no cross-stage references are allowed. Nevertheless, exploiting the inter-stage information is possible to boost the classiﬁcation performance further. It is because that, by compositing the conﬁdence values of multiple stages (say, d-stages) as a vector and making a decision in the d-dimensional space, the classiﬁcation boundaries being considered will not be restricted as hyper-planes parallel to the axes of the stages (as shown in Fig. 3(a)), but can be hyper-planes (or surfaces) of general forms. A two-dimensional case is illustrated in Fig. 3(b). One possible way to exploit the inter-stage information is to delay the decision making of all the S stages in the cascade, and perform a post-classiﬁcation in the S-dimensional space to make a unique ﬁnal decision. However, making a decision after gathering all the conﬁdence scores will considerably decrease the detection eﬃciency, since there is no chance to early jump out the cascade. In this paper, we propose a novel approach that can exploit the inter-stage information while preserve the early jump-out eﬀect. 4.1

Adding Meta-stages

Our method is based on adding some meta-stages in the original boosted cascade as shown in Fig. 4. A meta-stage is a classiﬁer that uses the inter-stage information (represented as the conﬁdence scores) of some of the previous stages for learning. Like an AdaBoost stage, a meta-stage is also designed with a goal to accept and reject the pre-deﬁned ratios of positive and negative examples respectively, and the prediction accuracy is also measured by the conﬁdence score of the adopted classiﬁcation method for the meta-stage. In our approach, the meta-stages and the AdaBoost stages aligned in the cascade are designed as AAMAMAM. . . AM, where ‘A’ and ‘M’ denote the AdaBoost stages and meta-stages, respectively, as shown in Fig. 4. In this case, the meta-stage is a classiﬁer in the two-dimensional space. The input vector of the ﬁrst meta-stage M1 is a two-dimensional vector (C(A0 ), C(A1 )), where C(Ai ) is the conﬁdence score of the i-th AdaBoost stage. The input vector of the other meta-stage Mi (i = 2, . . . , H) is also a two-dimensional vector (C(Mi−1 ), C(Ai )) that consists of the conﬁdence values of the two closest previous-stages in the cascade, where C(Mi ) is the conﬁdence score of the i-th meta-stage. The meta-stage introduced above is light-weight in computation since only a two-dimensional classiﬁcation is performed. However, it can help us to further reject the negative examples during training the entire cascade. In our implementation, we usually set the goal of the meta-stage as allowing all the positive training examples to be correctly classiﬁed, and ﬁnding the classiﬁer with the highest rejection rate of the negative training exmaples under this condition.

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection Sta ge i+1

S ta ge i+1

P OS

P OS

th i+ 1

th i+ 1 NEG

P OS NEG

NEG

Sta ge i

P OS

NEG

P OS

th i

th i

(a)

(b)

911

NEG New Boundary S tag e i

Fig. 3. Triangles and circles are negative and positive examples shown in data space. (a) The data space is separated into object (POS) and non-object (NEG) regions by thresholds thi and thi+1 in stages i and i + 1. (b) The inter-stage information of stage i and i + 1 can be used to learn a new classiﬁcation boundary as shown in green line.

Input Im age

A0

T

F

A1

T

F

M1

T

F

A2

T

F

M2

T

F

A3

T

F

M3

T ... T T MH

F

F

Accept

F

Reject Fig. 4. Feed-forward cascade structure

This criterion will not inﬂuence the latest decision of the previous AdaBoost classiﬁer about the positive data, but can help reject more of the negative data. In our experience, by adding the meta stages, the total number of the required AdaBoost stages can be reduced when the same goals are set to be fulﬁlled. 4.2

Meta-stage Classiﬁer

The classiﬁcation method used in the meta-stage can be arbitrary. In our work, we choose the linear SVM as the meta-stage classiﬁer due to its high generalization ability and eﬃciency in evaluation. To train the meta-stage classiﬁer, 3-fold cross-validation is applied for selecting the best penalty parameter C of the linear SVM. Then, a maximum-margin hyperplane that separates the positive and negative training data can be learned. To achieve the goal of the meta-stage, we move the hyperplane along its normal direction by applying diﬀerent thresholds, and ﬁnd the one with the highest rejection rate for the negative training data (under the situation that no positive ones will be falsely rejected). Note that, even a two-dimensional classiﬁer is used, each meta-stage inherently contains the conﬁdence of all the previous stages. This is because that a meta-stage (except to the ﬁrst one) employs the conﬁdence value of its closest previous meta-stage as one of the inputs. Thus, information of the previous stages will be iteratively feed-forwarded to the later meta-stages.

912

5

Y.-T. Chen and C.-S. Chen

Experimental Result

To evaluate the proposed cascade structure, a challenging pedestrian data set, INRIA person data set [3], is adopted in our experiments. This data set contains standing people with diﬀerent orientations and poses, in front of varied cluttered backgrounds. The resolution of human images is 64×128 pixels. Within a 64×128 detection block, the feature pool contains 22477 features (6916 rectangle features and 15561 EOH features) for learning the AdaBoost stages, and the domain space of the feature value is evenly divided into 10 disjoint bins for each feature in the Real AdaBoost algorithm. The edge orientation is evenly divided into 9 bins over 0◦ to 180◦ to calculate the EOH feature. A bootstrap set with 3373860 negative images is generated by selecting sub-images from the non-pedestrian training images in diﬀerent positions and scales. We refer the method presented in Section 3 the ErR-cascade method since it employs the EOH and rectangle features in the Real AdaBoost algorithm for human detection. The method where the meta-stages are further added (as illustrated in Fig. 4) is referred to as the ErRMeta-cascade method. In the ErRMeta-cascade method, all meta-stages are two-dimensional classiﬁers, and the linear SVM is adopted as the meta-stage learners. Thus the meta-stages can be computed very fast since only a twodimensional inner product is needed. We use the same number of positive and negative examples for training each stage of the cascade: The data set provides 2416 positive training data and we randomly select 2416 negative images from the bootstrap set as the negative training data. In training each AdaBoost stage, we keep adding weak learners until the predeﬁned goals are achieved. In our experiments, we require that at least 99.95% positive examples are accepted and at least 50% negative examples are rejected in each AdaBoost stage. For meta-stages, we only require that all the positive examples should be accepted and ﬁnd the classiﬁer with the highest negative-example rejection rate. If the false positive rate of the cascade is below 0.5%, the cascade structure will stop learning new stages. We also implemented the Dalal and Triggs method [3] (referred to as the HOG-LSVM method). First, we compare the performances of the ErR-cascade and the HOG -LSVM method. After training, there are eight AdaBoost stages with 285 weak classiﬁers in the ErR-cascade as shown in Fig. 5(a). For a 320×240 image (containing 2770 detection blocks), the averaged processing speeds of the HOG-LSVM and the ErR-cascade are 1.21 and 9.48 fps (frames per second) respectively by using a PC with a 3.4 GHz processor and 2.5 GB memory. Since the HOG-LSVM uses a 3780-dimensional feature, their method is timeconsuming. As to the detection result, the ROC curves of these two methods are shown in Fig. 6(a). From the ROC curves, the detection result of the ErRcascade is in overall better than that of the HOG-LSVM method. The introduced ErR-cascade method thus highly improves the detection speed and also slightly increases the detection accuracy than the HOG-LSVM method. Then, we compare these methods to the method with meta-stages. All the goal settings of the AdaBoost stages are the same as those of ErR-cascade. After training, there are seven AdaBoost stages with 258 weak classiﬁers as

A Cascade of Feed-Forward Classiﬁers for Fast Pedestrian Detection 100

weak classifier number

weak classifier number

100

913

80 60 40 20 0

1

2

3 4 5 6 7 cascade stage

80 60 40 20 0

8

1

2

(a)

4 6 8 10 cascade stage

12

(b)

1

1

0.9

0.9

0.8

0.8

True Detection Rate

True Detection Rate

Fig. 5. The number of weak classiﬁers learned in each AdaBoost stage of the ErRcascade method (a) and ErRMeta-cascade method (b)

0.7

0.6

HOG-LS V M E rR -cascade

0.5

0.4

0

0.5

1

1.5

2 2.5 3 False Positive Rate

(a)

3.5

4

0.7

0.6

E rR -cascade E rR M eta-cascade

0.5

4.5

5 x 10

-3

0.4

0

1

2

3 False Positive Rate

4

5

6 x 10

-3

(b)

Fig. 6. (a) The ROC curves of the HOG-LSVM method and the ErR-cascade method. (b) The ROC curves of the ErR-cascade method and ErRMeta-cascade method.

Fig. 7. Experimental results of the ErRMeta-cascade method

shown in Fig. 5(b) and six meta-stages. For a 320 × 240 image, the averaged processing speed is 10.13 fps. For the ErR-cascade method, the trained cascade contains less weak learners, and some non-pedestrian blocks can be early rejected by the cascade in the meta-stages with less computation. The ROC curves of ErR-cascade and ErRMeta-cascade are shown in Fig. 6(b). The results demonstrate that, by adding the meta-stages, both the detection speeds and accuracies can be further raised. Some results are shown in Fig. 7.

914

6

Y.-T. Chen and C.-S. Chen

Conclusion

A novel cascaded structure for pedestrian detection is presented in this paper, which consists of the AdaBoost stages and meta-stages. In our approach, the 1-D EOH edge-based feature is employed for weak-learner selection and Real AdaBoost algorithm is used as the AdaBoost-stage classiﬁer to make the weak learner more discriminative. As to the meta-stages, the inter-stage information of the previous stages is composed as a vector for learning a SVM hyperplane, so that the negative examples can be further rejected. Based on experimental results, our approach is practically useful since it can detect pedestrian with both eﬃciency and accuracy. Although the cascade type of AAMAMAM. . . AM is used, our approach can be generalized to other types to composite the AdaBoost stages and the meta-stages. In the future, we plan to employ our method for other object detection problems, such as faces, vehicles, and motorcycles. Acknowledgments. This work was supported in part under Grants NSC962422-H-001-001.

References 1. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE CVPR, vol. 1, pp. 511–518 (2001) 2. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: IEEE ICCV, vol. 2, pp. 734–741 (2003) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR, vol. 1, pp. 886–893 (2005) 4. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 5. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE CVPR, vol. 2, pp. 1491–1498 (2006) 6. Gavrila, D., Philomin, V.: Real-time object detection for “smart” vehicles. In: IEEE ICCV, vol. 1, pp. 87–93 (1999) 7. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38(1), 15–33 (2000) 8. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of ﬂow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, Springer, Heidelberg (2006) 9. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE PAMI 23(4), 349–361 (2001) 10. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, pp. 69–82. Springer, Heidelberg (2004) 11. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: the importance of good features. In: IEEE CVPR, vol. 2, pp. 53–60 (2004) 12. Porikli, F.: Integral histogram: a fast way to extract histograms in cartesian spaces. In: IEEE CVPR, vol. 1, pp. 829–836 (2005) 13. Schapire, R.E., Singer, Y.: Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning 37(3), 297–336 (1999)

Combined Object Detection and Segmentation by Using Space-Time Patches Yasuhiro Murai1 , Hironobu Fujiyoshi1 , and Takeo Kanade2 Dept. of Computer Science, Chubu University, Matsumoto 1200, Kasugai, Aichi, 487-8501 Japan [email protected], [email protected] http://www.vision.cs.chubu.ac.jp/ 2 The Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213-3890 USA [email protected] 1

Abstract. This paper presents a method for classifying the direction of movement and for segmenting objects simultaneously using features of space-time patches. Our approach uses vector quantization to classify the direction of movement of an object and to estimate its centroid by referring to a codebook of the space-time patch feature, which is generated from multiple learning samples. We segmented the objects’ regions based on the probability calculated from the mask images of the learning samples by using the estimated centroid of the object. Even though occlusions occur when multiple objects overlap in diﬀerent directions of movement, our method detects objects individually because their direction of movement is classiﬁed. Experimental results show that object detection is more accurate with our method than with the conventional method, which is only based on appearance features.

1

Introduction

Recent achievements in automatic object detection and segmentation have led to applications in robotics, visual surveillance, and ITS[1]. Motion- and part-based approaches have previously been proposed to detect and estimate the positions of objects moving in images. Optical-ﬂow, which quantiﬁes the movement of objects as vector data, has previously been proposed[2]. However, dense, unconstrained, and non-rigid motion estimation by using optical-ﬂow is noisy and unreliable, so estimating the movement of objects by optical-ﬂow is diﬃcult. Shechtman et al. [3] proposed a method for detecting similar motion in video streams despite diﬀerences in appearance due to clothing, background, and illumination by using space-time patches. For short, we refer to space-time patch as ST-patch. Niebles et al.[4] proposed a method for categorizing human action by gathering information from space-time interest points. The part-based approach with local features has been used to categorize unknown objects in diﬃcult real-world images. Agarwal et al.[5] proposed an approach that uses an automatically acquired, sparse, part-based representation Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 915–924, 2007. c Springer-Verlag Berlin Heidelberg 2007

916

Y. Murai, H. Fujiyoshi, and T. Kanade

of objects to learn a classiﬁer that can be used to accurately detect occurrences of a category of objects in a static image. Leibe et al.[6,7] proposed a method for categorizing and segmenting objects by estimating the centroids of objects with image patches, which were extracted from a test image, and the corresponding appearance codebook. Moreover, the method for object categorization using the object boundary fragments and relation to centroid[8], a people detection algorithm using a dense grid of Histograms of Oriented Gradients(HOG)[9], and a face detection system using patterns of appearance obtained by Haar-like features[10] are proposed. Thus, many recent studies have also used the partbased approach. These approaches have an advantage in that they can detect an object, when part of it is occluded. However, it is diﬃcult to segment multiple overlapping objects individually, such as pedestrians who are walking in diﬀerent directions. We developed a method, which is based on the part-based approach, by using spatio-temporal features to simultaneously classify the direction of movement and segment the objects. Our approach classiﬁes the direction of movement of an object by using ST-patch features[3] and estimates the position of the centroid of the object based on its direction of motion. The object is segmented by using the estimated position of its centroid and its mask image, which are stored in the learning samples of the ST-patch features.

2

ST-Patch

Our approach classiﬁes the direction of movement of objects by using the ST-patch features. When we observe two movements, such as a pedestrian walking to the right and another walking to the left, we can obtain the diﬀerent features of the ST-patch. Therefore, we can generate a codebook based on the diﬀerent motion of the ST-patch features. In this section, we describe the ST-patch features used to classify the direction of movement of the object, and we describe a method for generating a codebook for the ST-patch features extracted from learning samples. 2.1

Overview of the ST-Patch

The ST-patch features are extracted from a small domain of a spatio-temporal image, i.e., the 3-dimensional data, which extend the image in the direction of time. Fig.1 shows an overview of the ST-patch. Three color lines represent the motion of each pixel, where [u v w]T is a space-time direction vector in the ST-patch, and ∇Pi represents the space-time gradients. 2.2

ST-Patch Features

A locally uniform motion induces parallel lines(see zoomed-in part in Fig.1) within the ST-patch P . All the color lines within a single ST-patch are oriented in the space-time direction [u v w]T . The orientation of [u v w]T can be diﬀerent

Combined Object Detection and Segmentation by Using Space-Time Patches

917

Fig. 1. Overview of the ST-patch

for diﬀerent points. It is assumed to be uniform locally, within a small ST-patch P in video streams. By examining the space-time gradients ∇Pi = (Pxi , Pyi , Pti ) of the intensity at each pixel within the ST-patch P (i = 1, · · ·, n), we ﬁnd that these gradients all point to directions of the maximum change in the intensity of space-time. Namely, these gradients will all be perpendicular to the direction [u v w]T of the color lines. ⎡

⎤ u ∇Pi ⎣ v ⎦ = 0. w

(1)

Stacking these equations from all n pixels within the small ST-patch P , we obtain: ⎤ ⎡ ⎡ ⎤ Px1 Py1 Pt1 0 ⎡ ⎤ u ⎢ Px2 Py2 Pt2 ⎥ ⎢0⎥ ⎥ ⎢ ⎥ ⎣v⎦=⎢ , (2) ⎢ .. ⎢ .. ⎥ .. .. ⎥ ⎣ . ⎣ ⎦ . . ⎦ . w Pxn Pyn Ptn n×3 0 n×1 where n is the number of pixels in P , and we denote an n × 3 matrix by G. By multiplying both sides of Eq.(2) by GT (the transpose of the gradient matrix G), yields: ⎡

⎤ ⎡ ⎤ u 0 GT G ⎣ v ⎦ = ⎣ 0 ⎦ . w 0 3×1

(3)

GT G is a 3 × 3 matrix. We denote it by M: ⎡ 2 ⎤ Px Px Py Px Pt 2 ⎦. M = GT G = ⎣ Py Px Py Py P2 t Pt Px Pt Py Pt

(4)

918

Y. Murai, H. Fujiyoshi, and T. Kanade

The matrix M contains information about the appearance and motion of the ST-patch. This matrix can be represented as 9-dimensional vector e as follows:

(5) Px Py , · · · , Pt2 . e= Px2 , 2.3

The Codebook of the ST-Patch Features

To generate a codebook of the ST-patch features for classifying the direction of movement and for segmenting objects, we used the LBG algorithm[11]. The LBG algorithm is a method for clustering the features and generating a codebook. Using the LBG algorithm, the feature vector of the learning samples can be clustered into a group of N representation vectors. The learning samples in which pedestrians or vehicles moved to the right and to the left in the image were used to generate the codebook of the ST-patch features. The following steps represent the ﬂow for generating the codebook of the ST-patch features. Step 1. ST-patch features are extracted from multiple learning samples. Step 2. The ST-patch features are labeled based on their direction of movement od = {right, left, other}. Moreover, the position of the centroid and the mask image of the object are stored in each learning samples of ST-patch feature. Step 3. A codebook is created by clustering N groups with the LBG algorithm. Step 4. The probability for direction of movement p (od | I) of codebook cluster I is calculated. When the codebook of the ST-patch features is created by using the LBG algorithm, not all labels belonging to each codebook cluster are the same. However, in a codebook cluster, the rate of same label becomes high. Then, the probability for direction of movement p (od | I) of codebook cluster I is calculated from number of labels belonging to each codebook cluster. And, the positions of the centroids of the lerning samples and the mask images are used for estimating the centroids of objects, and for segmenting objects’ regions.

3

Classifying Direction of Movement and Segmenting Regions of Objects

We quantized the vector of the ST-patch features that we acquired from an input image using the codebook of the ST-patch features. We estimated the position of the centroid of the object by voting on diﬀerent centroid positions based on the classiﬁcation of the direction of movement and by sampling the ST-patch features. Then, we classiﬁed the direction of movement of the object. The ﬂow of the proposed method is illustrated in Fig.2. 3.1

Vector Quantization of the ST-Patch Features

The vector quantization of the ST-patch features was performed using the codebook generated in advance. The ﬂow of vector quantization is shown below.

Combined Object Detection and Segmentation by Using Space-Time Patches

919

Fig. 2. Flow of the proposed method

Step 1. The image patch is obtained by downsampling the image, and the STpatch features are extracted from this patch(Fig.3(a)). Step 2. Vector quantization is performed on the ST-patch features(Fig.3(b)). The Euclidean distance, between the vectors of the input ST-patch features e and the features of the codebook cluster c, is calculated. And the codebook cluster I which is the minimum Euclidean distance is selected from Eq.(6). I = argmin e − c 2 . c

(6)

Step 3. The size of a patch is changed to handle the change in scale. Step 4. Steps1-3 are repeated until the raster scan. Thus, we can perform response to an object scale by changing the size of a patch. 3.2

Estimating Position of Centroid of Object

We estimated the position of the centroid of the object by voting the classiﬁcation of the direction of movement based on the vector quantization of the input ST-patch features and from the lerning samples. Voting on Centroid Position. To estimate the position of the centroid of the object, we vote on centroid positions[6,7]. Let e be our evidence, an extracted ST-patch observed at location l. By matching it to our codebook, we obtain valid interpretation I. The interpretation is weighted with probability p (I | e, l). Here, we use the relative matching score of a codebook cluster I and ST-patch feature e for p (od , x | I, l). If a codebook cluster matches, it can cast its votes for diﬀerent object positions. That is, for learning samples belonging to a codebook cluster I, we can obtain votes for several directions of movement of objects od and positions x, which we weight with p (od , x | I, l). Formally, this can be expressed by the following marginalization. p (od , x | e, l) = p (od , x | e, I, l) p (I | e, l) .

(7)

Since we have replaced the unknown ST-patch by a known interpretation, the ﬁrst term can be treated as independent from ST-patch e. In addition, we match

920

Y. Murai, H. Fujiyoshi, and T. Kanade

Fig. 3. Estimating position of centroid of object

patches to the codebook independent of their location l. The equation thus reduces to: p (od , x | e, l) = p (od , x | I, l) p (I | e) . = p (x | od , I, l) p (od | I, l) p (I | e) .

(8) (9)

The ﬁrst term is the probabilistic vote for an object position given its identity and the patch interpretation. The second term speciﬁes a conﬁdence that the codebook cluster is really matched to the direction of movement. The third term reﬂects the quality of the match between the ST-patch and the codebook cluster. Thus, the total number of votes for object od at location x in window W (x) is:

score (od , x) = p (od , xj | ek , lk ) . (10) k xj ∈W (x)

Mean-Shift Clustering. We can search for the positions of points with the most votes(i.e., the local maxima) by using 3-dimensional(x-y-scale space) MeanShift clustering(Fig.3(c))[12]. Fig.3 illustrates this procedure. Local maxima that converge by Mean-Shift clustering integrate into one cluster by Nearest Neighbor clustering algorithm. When the total weight integrated around the local maximum is below a certain threshold, we reject it as an outlier(Fig.3(d)). We can therefore remove the outliers of the voted points.We can then estimate the position of the centroid of the object. 3.3

Segmenting Regions of Objects

We construct regions of objects based on the number of voting points around the position of the centroids. Fig.4 shows the ﬂow of segmenting the regions of objects. Backprojection of the ST-Patch Features. We perform a backprojection of the ST-patch features, which is the number of voted points around the position of centroid of the object, and remove the outliers of the voted points. We can

Combined Object Detection and Segmentation by Using Space-Time Patches

921

Fig. 4. Segmenting regions of object

then select information about the reliable ST-patch features. The eﬀect of the backprojected ST-patch e can be expressed as: p (e, l | od , x) =

p (od , x | e, l) p (e, l) p (od , x | I, l) p (I | e) p (e, l) = , p (od , x) p (od , x)

(11)

where the patch votes p (od , x | e, l) are obtained from the codebook, as described in the Eq.(8). Estimating Region of Object. To segment the object, we now want to know whether a certain image pixel p is part of the object or the background, given the backprojected ST-patch e. More precisely, we are interested in the probability p (p = obj. | od , x). Given the eﬀect of p (e, l | od , x), we can obtain information about a speciﬁc pixel as follows:

p (p = obj. | od , x) = p (p = obj. | od , x, e, l) p (e, l | od , x), (12) num

where num is number of the backprojected ST-patch, and p (p = obj. | od , x, e, l) denoting patch-speciﬁc segmentation information, which is weighted by the eﬀect of p (e, l | od , x). Again, we can resolve patches by resorting to the learned patch interpretation I stored in the codebook.

p (p = obj. | od , x, e, I, l) p (e, I, l | od , x). p (p = obj. | od , x) = num

=

num

p (p = obj. | od , x, I, l)

p (od , x | I, l) p (I | e) p (e, l) . (13) p (od , x)

Then, segmentation information p (p = obj. | od , x, I, l) can be acquired from the mask image of the object stored in the lerning samples. This means that for every pixel, we calculate a weighted average over all segmentations stemming from ST-patches. Therefore, we can calculate the probability of objects for each pixel. Here, the probability of objects below a certain threshold represents a pixel in

922

Y. Murai, H. Fujiyoshi, and T. Kanade

the background, and the probability of objects over that threshold represents a pixel in the object. We can therefore segment the objects’ regions into rectangles by using the probability of objects for each pixel.

4

Experiment

This section describes the experimental results of the proposed method and the conventional method[6] which uses appearance information only. 4.1

Experimental Overview

We extracted 10,198 ST-patch features from sequences of pedestrians walking toward the right, 10,220 ST-patch features from sequences of pedestrians walking toward the left, and 36,982 ST-patch features from the background. We also extracted 9,885 ST-patch features from sequences of vehicles moving toward the right, 9,968 ST-patch features from sequences of vehicles moving toward the left, and 20,047 ST-patch features from the background. Using pedestrian and vehicle codebooks which were generated from the ST-patch features we extracted, we classiﬁed thedirection of movement and segmented the regions of the objects. In this experiment, the size of the ST-patch is 15x15[pixels]x3[frames], and the codebook size is 512 clusters. The experiment sequences were taken with a ﬁxed camera at the location diﬀerent from that where learning samples were collected. The sequences include rightward and leftward movement objects such a pedestrian and vehicle. The total number of frames for experiment sequences are 23,097. 4.2

Experimental Results

Fig.5 shows the detection and segmentation results by the conventional method and by our method. As shown in Fig.5(a)-(d), we can see that the proposed method can be used to classify the direction of movement and to segment the regions of a pedestrian and a moving vehicle. In particular, separate objects can be segmented exactly even when multiple objects walking in diﬀerent directions overlap, because our method segments objects’ regions based on the classiﬁcation of the direction of movement. As shown in Fig.5(b), our method responds to the scale of an object. As shown in Fig.5(d), the pedestrian who has occlusion in the body can be segmented in consideration of the objects’ regions, because they are estimated from the mask image of the learning samples. Moreover, as shown in Fig.5(a), the proposed method detects multiple objects individually, without being aﬀected by shadow. Table1 shows the experimental results of object detection with our method and the conventional method. Only the frame in which the object exists in an image is set as a detection target. As shown in Table1, we can see that our method of detection is better than the conventional method. Thus, because our method is based on classifying the direction of movement, the object detection rate was also better than that with the conventional method.

Combined Object Detection and Segmentation by Using Space-Time Patches

923

Fig. 5. Classifying direction of movement and segmenting the objects’ regions Table 1. Detection result pedestrian sequence vehicle sequence average

conventional method[6] 64.3% 70.7% 67.3%

proposed method 74.7% 93.3% 84.0%

Fig. 6. Example of failure

From Fig.6(a), it is diﬃcult to estimate the position of the centroid when multiple objects move in the same direction, such as a group of pedestrians. This is why the segmentation goes wrong. To solve this problem, we will add

924

Y. Murai, H. Fujiyoshi, and T. Kanade

more information about the appearance to the 9-dimensional vector e in future work. Moreover, for moving objects(for example, a bus and a truck), which do not exist in a learning samples, as shown in Fig.6(b), detection may also go wrong because such objects cannot be classiﬁed.

5

Conclusion

We developed a method for classifying the direction of movement and for segmenting objects simultaneously by using ST-patch features. Our method segments objects based on occlusion. Moreover, our method detects objects individually when multiple objects overlap in diﬀerent directions of movement because the direction of movement is classiﬁed. Our future work will involve overlapped objects moving in the same direction, and we will create a method for identifying objects by adding more information about the object’s appearance to the ST-patch features.

References 1. Fujiyoshi, H., Komura, T., Yairi, I.E., Kayama, K.: Road Observation and Information Providing System for Supporting Mobility of Pedestrian. In: IEEE International Conference on Computer Vision Systems, pp. 37–44. IEEE Computer Society Press, Los Alamitos (2006) 2. Horn, B.K.P., Schunck, B.G.: Determining optical ﬂow. Artiﬁcial Intelligence 17, 185–203 (1981) 3. Shechtman, E., Irani, M.: Space-Time Behavior Based Correlation. Computer Vision and Pattern Recognition 1, 405–412 (2005) 4. Niebles, J.C., Wang, H., Fei-Fe, L.: i: Unsupervised learning of human action categories using spatial-temporal words. In: British Machine Vision Conference, vol. 3, pp. 1249–1258 (2006) 5. Agarwal, S., Roth, D.: Learning a Sparse Representation for Object Detection. In: European Conference on Computer Vision, pp. 113–130 (2002) 6. Leibe, B., Leonardis, A., Schiele, B.: Interleaved Object Categorization and Segmentation. In: British Machine Vision Conference, Norwich, pp. 759–768 (2003) 7. Leibe, B., Leonardis, A., Schiele, B.: Combined Object Categorization and Segmentation with an Implicit Shape Model. In: European Conference on Computer Vision, Prague, pp. 496–510 (2004) 8. Opelt, A., Pinz, A., Zisserman, A.: Incremental learning of object detectors using a visual shape alphabet. Computer Vision and Pattern Recognition 1, 3–10 (2006) 9. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. IEEE Computer Vision and Pattern Recognition, 886–893 (2005) 10. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. Computer Vision and Pattern Recognition 1, 511–519 (2001) 11. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Trans. on Communications 28(1), 84–95 (1980) 12. Comaniciu, D., Meer, P.: Mean Shift Analysis and Applications. International Conference on Computer Vision 2, 1197–1203 (1999)

Embedding a Region Merging Prior in Level Set Vector-Valued Image Segmentation Ismail Ben Ayed1 and Amar Mitiche2 GE Healthcare 268 Grosvenor, E5-137, London, ON, N6A 4V2, Canada 2 Institut national de la recherche scientiﬁque, INRS-EMT 800, de La Gaucheti`ere Ouest, Montr´eal, QC, H5A 1K6, Canada 1

Abstract. In the scope of level set image segmentation, the number of regions is ﬁxed beforehand. This number occurs as a constant in the objective functional and its optimization. In this study, we propose a region merging prior which optimizes the objective functional implicitly with respect to the number of regions. A statistical interpretation of the functional and learning over a set of relevant images and segmentation examples allow setting the weight of this prior to obtain the correct number of regions. This method is investigated and validated with color images and motion maps.

1

Introduction

Image segmentation by active contours/level sets leads to eﬀective results as several studies have shown [1] [3] [2] [5] [4] [6] [7] [8]. Current methods assume that the number of regions is given beforehand. It occurs as a constant in the objective functional and its optimization [1] [3] [2] [5] [4] [6] [7] [8]. A few investigations proposed to estimate the number of regions automatically, but as a process external to the functional optimization [9] [10] [11]. In [9], a preliminary stage based on hierarchical level set splitting is used prior to a classical functional minimization with a ﬁxed number of regions. In [10] [11], local region merging is alternated with curve evolution. Apart from their computational cost, these methods are subject to the well known limitations of local region splitting/merging operations: (1) dependence on several ad hoc parameters and on the order of local operations [10] [14], (2) need of additional variables for local neighborhood search [10] [11], and (3) sensitivity to the initial conditions [10]. The purpose of this study is to vary the eﬀective number of regions within level set optimization. A region merging prior is proposed for this purpose. This prior favors region merging. Used in conjunction with a data term which measures the conformity of vector-valued image data in each region to the piecewise constant segmentation model [3] and a length related term for smooth region boundaries, this prior allows the objective functional to be optimized implicitly with respect to the number of regions. A maximum number of regions is used in the deﬁnition of the segmentation functional. The eﬀective number of regions, equal to the maximum number of regions initially, decreases implicitly during curve evolution Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 925–934, 2007. c Springer-Verlag Berlin Heidelberg 2007

926

I. Ben Ayed and A. Mitiche

to be, ideally, the desired number of regions. The functional minimization is carried out using the partition constrained minimization scheme developed in [7] [8]. A coeﬃcient must be aﬀected to the region merging prior in order to balance its contribution with respect to the other functional terms. This coeﬃcient will, of course, aﬀect the number of regions obtained at convergence. We will show that we can determine systematically an interval of values of this coeﬃcient to obtain the desired number of regions. This is possible via a statistical interpretation of the coeﬃcient over a set of relevant images and segmentation examples. The method is investigated and validated with color images and motion maps.

2

Segmentation into a Fixed Number of Regions

Consider a vector-valued image I : Ω ⊂ R2 → RL represented by L images I l : Ω ⊂ R2 → R+ (l ∈ [1, .., L]) and a partition R = {Rk }k∈[1,N ] of Ω deﬁned by a family of simple closed plane curves γ k (s) : [0, 1] → Ω|k=1,...,N −1 . For each k, region Rk corresponds to the interior Rγ k of curve γ k : Rk = Rγ k . Let N −1 RN = k=1 Rck . Level set segmentation of an image I is commonly stated as determining a partition R which minimizes a functional containing two terms: a data term which measures the conformity of the data within each region to a parametric model and a regularization term for smooth segmentation boundaries [1] [2] [3] [4] [5] [6] [7] [8] [11]. Following the piecewise constant model [3] and a regularization term, multiregion active curve segmentation consists of determining γ k |k=1,...,N −1 that minimize the following functional: −1 F ({γ k }N k=1 ) =

N k=1

L

Rk l=1

I l − μlk 2 + λ

N −1 k=1

ds

(1)

γk

where μlk is the mean intensity for image I l in the segmentation region k (l ∈ [1, .., L], k ∈ [1, .., N ]), and λ is a positive real constant to weigh the relative contribution of the two terms of the functional. Let ﬁrst consider a simple example that illustrates the usefulness of adding a region merging term to this type of functional. Consider two non intersecting regions R1 and R2 of a partition R. We have [12]: L L L l l 2 l l 2 l l 2 l=1 I − μ1 + R2 l=1 I − μ2 ≤ R1 ∪R2 l=1 I − μ1,2 R1 λ(∂R1 + ∂R2 ) = λ∂(R1 ∪ R2 ) (2) where μl1,2 the mean of R1 ∪ R2 for image I l , and ∂R is the boundary of R. Consequently, the minimization of (1) does not favor merging R1 and R2 even when μl1 = μl2 , ∀l ∈ [1, .., L]. As we will show in the experiments, model (1) may result in an over-segmentation when N is superior to the actual number of regions. An additional term in (1) which can merge regions, as when μl1 = μl2 , ∀l ∈ [1, .., L], would be useful. In this study, we propose and investigate such a prior.

Embedding a Region Merging Prior

3

927

A Region Merging Prior

A region merging prior PRM is a function from the set of partitions of Ω to R. This function must satisfy the following condition: For each partition R = {Rk }N k=1 of Ω, and for each subset J of [1..N ]: N PRM ({∪j∈J Rj , {Rk }k∈[1..N ],k∈J / }) < PF R ({Rk }k=1 )

(3)

This condition means that any region merging must decrease the prior term. We propose the following prior: PRM ({Rk }N k=1 ) = −β

N

ak logak ,

(4)

k=1

where ak is the area of region Rk , and β is a positive real constant to weigh the relative contribution of the region merging term in the segmentation functional. As we will see in Section 4.2, the logarithmic form of this prior has an interesting property which leads to a statistical interpretation that allows to ﬁx the weight of the region merging term systematically. This prior satisﬁes condition (3).

4

Segmentation Functional

Let N be the maximum number of regions, i.e., a number such that the actual number of regions is less or equal to N . Such a number is available in most applications. With the region merging prior (4), the functional of segmentation into a number of regions less than N is: −1 FRM ({γ k }N k=1 )

=

N k=1

L

I − l

Rk l=1

Data term

μlk 2

−β

N

ak logak

k=1

Region merging prior

+λ

N −1 k=1

γk

ds

regularization

(5) In Section 4.1, we will show how the eﬀective number of active curves can decrease as a result of the region merging prior. We minimize FRM with respect to the curves γ k , k = 1, .., N − 1 by embedding these into a family of one-parameter curves γ k (s, t) : [0, 1] × R+ → Ω and solving the partial diﬀerential equations: ∂FRM dγ k =− , k = 1, .., N − 1 dt ∂γ k

(6)

We use the multiregion minimization scheme developed in [7] [8]. This scheme has several advantages over others: it is fast, stepwise optimal, and robust to initialization [7] [8]. It embeds an eﬃcient partition constraint directly in the curve/level set evolution equations. At each iteration, the scheme involves only two regions for each pixel x, a region Ri which contains x currently, and a region

928

I. Ben Ayed and A. Mitiche

Rj , j = i, which corresponds to the largest decrease in the functional were x transferred to this region (refer to [7] [8] for details). For a level set implementation of curve evolution, we represent each curve γ k implicitly by the zero level set of a function uk : R2 → R, with the region inside γ k corresponding to uk > 0. The level set equations minimizing FRM are given, at each x ∈ Ω, by: L ∂ui = (−λκui + β.(logai − logaj ) − I l (x) − μli 2 − I l (x) − μlj 2 )∇ui ∂t

l=1 Region merging

Region competition

∂uj = (−λκuj + β.(logaj − logai ) − ∂t

L

I l (x) − μlj 2 − I l (x) − μli 2 )∇uj

l=1

(7) where κuk is the curvature of the zero level-set of uk , i ∈ [1..N ] is the index of the region containing x currently, and j is given by: j = arg

4.1

min

{k∈[1..N ], x∈Rk }

−β.ak logak +

L

I l (x) − μlk 2

(8)

l=1

A Region Merging Interpretation of Curve/Level Set Equations

The level set equations (7) show how region merging occurs: When two disjoint regions Rγ i and Rγ j have close intensities in each image I l (l ∈ [1, .., L]), the L velocity resulting from the data term ( l=1 I l − μlj 2 − I l − μli 2 ) is weak. Ignoring the curvature term, evolution of curves γ i and γ j is guided principally by the region merging prior velocity. As ui increases and uj decreases under the eﬀect of (logai − logaj ), this velocity expands the region with the larger area, and shrinks the other region until only one curve encloses both regions and the other curve disappears. 4.2

How to Fix the Weighting Parameter β

On the one hand, the data term increases N when regions are merged. On the other hand, the region merging term, − k=1 ak logak , decreases when regions are merged. The role of β is to balance the contribution of the region merging term against the other terms so as to, ideally, correspond to the actual number of regions. The weighting parameter β can be viewed as a unit conversion factor between the units of the region merging and the data terms. Therefore, and considering the form of these terms, we can take: L I l − μl 2 (9) β = α Ω l=1 A.logA

Embedding a Region Merging Prior

929

where μl is the mean intensity over the whole image I l (l ∈ [1, .., L]), A is the image domain area, and α is a constant without unit. Using expression (9), we rewrite the sum of the data term and the region merging prior as follows: N N L L l l 2 k=1 ak logak I − μk − α I l − μl 2 (10) A.logA R Ω k k=1 l=1 l=1

close to 1

Now, by applying inequality log(z) ≤ z − 1, ∀z ∈ [0, +∞[ to aAk , ∀k ∈ [1..N ] and using condition (3), one can prove the following important inequalities: N ak logak logN 1− ≤ k=1 ≤1 (11) logA A.logA N

a loga

k k In practice, N is generally much smaller than A and k=1 is close to A.logA 1. For example, for a maximum number of regions equal to 10 and a 256x256 N i=1 ai logai ≤ 1. We now consider the folimage, we have approximately 0.8 ≤ A.logA lowing classical relation in statistical pattern recognition between within-cluster distance, total distance, and in-between cluster distance [12]:

N k=1

L

I l − μlk 2 −

Rk l=1

within−cluster distance

L

Ω l=1

I l − μl 2 =

total distance

−

L N k=1 l=1

μlk − μl 2

(12)

in−between cluster distance

Consequently, with a value of α close to 1 in (10), the sum of the data and the region merging terms will be close to the segmentation in-between cluster distance. However, minimizing the in-between cluster distance is equivalent to minimizing the within-cluster distance because the total distance is independent from the segmentation [12]. This interpretation, which suggests a value of α close to 1, will be conﬁrmed in the next section (section 5) with several simulations. We will show that we can take α in an interval containing, or close to, 1, and which we can use for all the images of the same class. Note that β depends on the image, and once α is ﬁxed, β is given directly by (9).

5

Experiments

We conducted a large number of tests with color images and motion maps. We show representative segmentation examples. To support the possibility of determining via learning an interval of α values applicable to the images of a given class (color images and motion maps for example), we run tests showing that we can ﬁnd a common interval of α values which gives the desired number of regions for all the images of the same class. Color Images. The RGB space is used to represent the color information in each image. We show, here, results for a set of 6 images, each image containing several objects (Figure 2, (1)-(6)). The objects in these images were taken

930

I. Ben Ayed and A. Mitiche

(a)

(b)

(c)

Fig. 1. Results without the region merging prior on an image compound of 2 regions (object and background): (a) initialization (N = 5, i.e., 4 curves), (b) ﬁnal curves, (c) ﬁnal segmentation into 5 regions Table 1. Color images: intervals of α values corresponding to the correct number of regions Images 1

2

3

4

5

6

αmin

1.49 1.00 1.35 0.36 1.03 0.096

αmax

5.6

3.97 4.37 5.5 2.67 3.81

from ALOI database [15]. We have images with two, three and four regions. With these images, the actual number of regions is known, which allows us to evaluate experimentally the interval of α values giving this number. Segmentation of these images without ﬁxing the number of regions is diﬃcult due to the illumination variations inside each object. Figure 1 gives the segmentation result, without the region merging prior, of the color image (1) shown in Figure 2. This image consists of two regions: an object and a background. Segmentation of this image into 5 regions gives the ﬁnal curves displayed in Figure 1 (b) and the segmentation shown in Figure 1 (c). The corresponding initialization with 4 curves is depicted in Figure 1 (a). Without the region merging prior, the object is fragmented into 4 diﬀerent regions due to illumination variations. The ﬁrst line in Figure 2 shows the segmentation results of the same image in Figure 1 with the region merging prior. Only one curve (red) remained at convergence. This ﬁnal curve separates correctly the object from the background. With the same initialization (5 regions, 4 curves) in Figure 1 (a), and using the same α (α = 2), the other images (from (2) to (6)) have also been segmented correctly. The columns of Figure 2 show, respectively, the image, the ﬁnal curves remaining at convergence, and the corresponding ﬁnal segmentation. The ﬁnal segmentation of each image corresponds to desired number of regions as well as to the objects. We evaluated the interval of α values, [αmin , αmax ], which lead to the desired number of regions for each image. The obtained intervals are reported in table 1. All α values in [1.49, 2.67] lead to a correct segmentation of the six

Embedding a Region Merging Prior

931

(1)

(2)

(3)

(4)

(5)

(6) Fig. 2. Segmentation results with the region merging prior for 6 color images from the same database (α = 2): images (1)-(6) (ﬁrst column); ﬁnal curves which remained at convergence (second Column); ﬁnal segmentations (third column)

images. These results are conform to expectations, to the statistical interpretation we gave in section (4.2) to coeﬃcient α, and support the possibility of determining via learning an interval of α values.

932

I. Ben Ayed and A. Mitiche

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Segmentation results with the region merging prior (Marmor sequence, α = 2): (a) 5 initial curves (6 regions) and the motion ﬁeld (2 moving objects), (b) 2 ﬁnal curves corresponding to the moving objects, (c) obtained segmentation into 3 regions (2 moving objects and a background), (d)-(e) segmentation regions corresponding to moving objects, (f) segmentation region corresponding to the background

(a)

(b)

Fig. 4. Segmentation results without the region merging prior (Marmor sequence, α = 0): (a) segmentation with N = 4, (b) segmentation with N = 6

(a)

(b)

(c)

(d)

Fig. 5. Segmentation results with the region merging prior (Road sequence, α = 2): (a) 5 initial curves (6 regions) and the motion ﬁeld (1 moving object), (b) 1 ﬁnal curve, (c) segmentation region corresponding to the moving object (inside the curve), (d) segmentation region corresponding to the background (outside the curve)

Embedding a Region Merging Prior

933

Motion Segmentation. In this experiment, we segment optical ﬂow images into motion regions. Optical ﬂow at each pixel is a two-dimensional vector. The method in [16] was used to estimate the optical ﬂow. We show two examples. The ﬁrst example uses the Marmor sequence, which contains 3 regions: 2 moving objects and a background. The initial curves (5 curves for at most 6 regions) and motion vectors are shown in Figure 3 (a). With the region merging prior, Figure 3 (b) depicts curves which remained giving a correct segmentation into 3 regions (Figure 3 (c)). In this example, α = 2. Figures 3 (d) and (e) show regions corresponding to the 2 moving objects, and (f) shows the background. To illustrate the eﬀect of the region merging prior, Figure 4 (b) shows the segmentation obtained using the same initialization (N = 6) but without the region merging prior (α = 0). The background, in this case, is divided into 3 diﬀerent regions, and the moving object on the right is divided into 2 regions. We also give in Figure 4 (a) the segmentation obtained with 4 initial regions (N = 4), and with α = 0. Results obtained without the region merging prior do not correspond to a meaningful segmentation. The second example uses the Road image sequence (Figure 5 (a)), which contains two regions: a moving vehicle and a background. The same initialization as with the Marmor sequence was used. Figure 5 shows the results obtained (using the region merging prior). In (b), one curve remains which separates the moving object from the background; Figures (c)-(d) display segmentation regions. We evaluated the interval of α values, [αmin , αmax ], which gave the desired number of regions for each sequence. The obtained intervals are reported in table 2 and are conform to the interpretation of the weight of the region merging prior, and which led to a value of α close to 1. All α values segmenting correctly the Marmor sequence give also the desired number of regions for the Road sequence. Table 2. Motion segmentation: intervals of α values corresponding to the desired number of regions Images Marmor Road

6

αmin

1.217

0.017

αmax

2.69

6.5

Conclusion

This study investigated a curve evolution method which allowed the eﬀective number of regions to vary during optimization. This was done via a region merging prior which embeds an implicit region merging in curve evolution. We gave a statistical interpretation of the weight of this prior. We conﬁrmed this interpretation by several experiments with both color images and motion maps. Experiments demonstrated that we can determine by learning an interval of values of this weight applicable to the images of a given class.

934

I. Ben Ayed and A. Mitiche

References 1. Cremers, D., Rousson, M., Deriche, R.: A Review of Statistical Approaches to Level Set Segmentation: Integrating Color, Texture, Motion and Shape. Int. J. of Computer Vision 62, 249–265 (2007) 2. Rousson, M., Deriche, R.: A variational framework for active and adaptive segmentation of vector valued images. In: Proc. IEEE Workshop on Motion and Video Computing, pp. 56–61. IEEE Computer Society Press, Los Alamitos (2002) 3. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active Contours without Edges for VectorValued Images. J. Visual Communication Image Representation 11, 130–141 (2000) 4. Vese, L.A., Chan, T.F.: A Multiphase Level Set Framework for Image Segmentation Using the Mumford and Shah Model. Int. J. of Computer Vision 50, 271–293 (2002) 5. Samson, C., Blanc-F´eraud, L., Aubert, G., Zerubia, J.: A Level Set Model for Image Classiﬁcation. Int. J. of Computer Vision 40, 187–197 (2000) 6. Ayed, I.B., Hennane, N., Mitiche, A.: Unsupervised Variational Image Segmentation/Classiﬁcation using a Weibull Observation Model. IEEE Trans. on Image Processing 15, 3431–3439 (2006) 7. Ayed, I.B., Mitiche, A., Belhadj, Z.: Polarimetric Image Segmentation via Maximum Likelihood Approximation and Eﬃcient Multiphase Level Sets. IEEE Trans. on Pattern Anal. and Machine Intell. 28, 1493–1500 (2006) 8. Ayed, I.B., Mitiche, A.: A Partition Constrained Minimization Scheme for Eﬃcient Multiphase Level Set Image Segmentation. In: Proc. IEEE Int. Conf. on Image Processing, pp. 1641–1644. IEEE Computer Society Press, Los Alamitos (2006) 9. Brox, T., Weickert, J.: Level Set Segmentation With Multiple Regions. IEEE Trans. on Image Processing 15, 3213–3218 (2006) 10. Kadir, T., Brady, M.: Unsupervised non-parametric region segmentation using level sets. Proc. Int. Conf. on Computer Vision, 1267–1274 (2003) 11. Zhu, S.C., Yuille, A.: Region Competition: Unifying Snakes, Region Growing, and Bayes /MDL for Multiband Image Segmentation. IEEE Trans. on Pattern Anal. and Machine Intell. 18, 884–900 (1996) 12. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classiﬁcation, 2nd edn. Wiley Interscience, Chichester (2000) 13. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (1999) 14. Nock, R., Nielsen, F.: Statistical Region Merging. IEEE Trans. on Pattern Anal. and Machine Intell. 26, 1452–1458 (2004) 15. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam Library of Object Images. Int. J. of Computer Vision 61, 103–122 (2005) 16. Vazquez, C., Mitiche, A., Laganiere, R.: Joint Multiregion Segmentation and Parametric Estimation of Image Motion by Basis Function Representation and Level Set Evolution. IEEE Trans. on Pattern Anal. and Machine Intell. 28 (2006)

A Basin Morphology Approach to Colour Image Segmentation by Region Merging Erchan Aptoula and S´ebastien Lef`evre UMR-7005 CNRS-Louis Pasteur University LSIIT, Pˆ ole API, Bvd Brant, PO Box 10413, 67412 Illkirch Cedex, France {aptoula,lefevre}@lsiit.u-strasbg.fr

Abstract. The problem of colour image segmentation is investigated in the context of mathematical morphology. Morphological operators are extended to colour images by means of a lexicographical ordering in a polar colour space, which are then employed in the preprocessing stage. The actual segmentation is based on the use of the watershed transformation, followed by region merging, with the procedure being formalized as a basin morphology, where regions are “eroded” in order to form greater catchment basins. The result is a fully automated processing chain, with multiple levels of parametrisation and ﬂexibility, the application of which is illustrated by means of the Berkeley segmentation dataset.

1

Introduction

Automatic, robust and eﬃcient colour image segmentation is nowadays more indispensable than ever, since numerous image depositories have been formed and continue to grow with an increasing speed. As far as the human vision system is concerned, edge information is primarily contained within the luminance component. Hence colour is regarded as an invaluable, yet auxiliary component when it comes to image segmentation and generally object recognition. The problem of its eﬃcient exploitation in this context remains to be resolved, because not only the principles of human colour vision have not yet been fully understood, but because it also introduces additional parameters in the already elusive problem of general purpose image segmentation. Speciﬁcally, one of the major questions is the representation of colour vectors and the choice of the associated colour space. Since the desired segmentation outcome is almost always based on the human interpretation of objects, it is deemed natural to attempt to emulate the sensitivities of human colour vision. That is besides the reason why polar colour spaces have been gaining popularity in this regard. However as it will be elaborated in section 2, these spaces also suﬀer from considerable drawbacks. Among the approaches developed to resolve the problem of colour segmentation, mathematical morphology oﬀers a diﬀerent perspective from the mostly statistical and clustering based methods, since it is an algebraic image processing framework capable of exploiting not only the spectral, but spatial relationships Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 935–944, 2007. c Springer-Verlag Berlin Heidelberg 2007

936

E. Aptoula and S. Lef`evre

Luminance

Luminance

of pixels as well. In this paper, we present a fully automated colour segmentation procedure, designed for polar colour spaces, based on morphological operators. In particular, the proposed approach consists primarily of manipulating the catchment basins resulting from a watershed transformation, through their interpretation as the new processing units for morphological operators. Hence leading to a “morphology of basins”. A hierarchy of attributes is thus organised, making it possible to merge these regions based on arbitrary characteristics such as mean colour and texture. The resulting method is tested using the Berkeley segmentation dataset [1]. The rest of the paper is organised as follows. Section 2 discusses initially the crucial choice of polar colour space. Then in section 3 the proposed segmentation approach is elaborated and its individual stages are detailed. Finally section 4 is devoted to concluding remarks.

Saturation

Saturation

Fig. 1. Vertical semi-slice of the cylindrical HLS (left) and bi-conic IHLS (right) colour spaces

2

Choice of Colour Space

For the reasons mentioned in the previous section, here we concentrate on 3d-polar colour spaces, that have appeared as the result of attempts to describe the RGB cube in a more intuitive manner, from the point of view of human interpretation of colour, in terms of luminance, saturation and hue. While luminance L ∈ [0, 1] accounts for the amount of light, saturation S ∈ [0, 1] represents the purity of a colour. The values of the periodical hue interval H ∈ [0, 2π[ on the other hand, denote the dominant wavelength, with 0 corresponding to red. Basically, polar colour spaces achieve this transformation by representing colours with respect to the achromatic axis of RGB. Nevertheless, several implementational variants are available for this single transformation, e. g. HSV, HSB, HLS, HSI, etc [2]. According to Hanbury and Serra [3], all of the aforementioned colour spaces were developed primarily for easy numerical colour speciﬁcation, while they are ill-suited for image analysis. Speciﬁcally, although they were initially designed as conic or bi-conic shaped spaces, later on their cylindrical versions were employed in practice, in order to avoid the computationally expensive (for that period) checking for valid colour coordinates. The passage from conic to cylindrical shape however resulted in many inconsistencies within these spaces,

A Basin Morphology Approach to Colour Image Segmentation

937

for instance by allowing fully saturated colours to be deﬁned in zero luminance. Extensive details on this topic can be found in [3]. Here we adopt the suggestion made in [3], and make our colour space choice in favour of the improved HLS space (IHLS), which employs the original biconic version of HLS, hence limiting the maximal allowed value for saturation in relation to luminance (ﬁgure 1). Further advantages of IHLS with respect to its counterparts include the independence of saturation from luminance, thus permitting the use of any luminance expression (e. g. RGB average, perceptual luminance, etc) and the comparability of saturation values.

Preprocessing

Basin extraction

Input image

Merging

Postprocessing

Label image

Fig. 2. Summary of the proposed processing chain

3

Proposed Approach

In an ideal world all images would have the same resolution, colour number and overall complexity. Unfortunately it is not the case. The problem of segmentation in its most general form is highly diﬃcult to resolve, as it aims to detect the semantic regions of extremely heterogeneous input. Moreover semantic level segmentation requires naturally some a priori information of semantic nature, hence rendering it feasible only for domain speciﬁc applications, since no ontology incorporating all types of objects exists. Consequently a more “practical and realistic” aim, also adopted here, is to attempt to detect the principal regions of images with respect to homogeneity, which is a task most important for content based image retrieval. Considering the vast heterogeneity of image data, an equally high degree of adaptability is crucial, which will take into account the diﬀerent types of border information contained within an image, e. g. spectral, textural, etc. To this end, we propose the processing chain summarised in ﬁgure 2. Brieﬂy, the input image is ﬁrst simpliﬁed using border preserving morphological operators and then through the combination of a colour gradient and the watershed transformation, the catchment basins are obtained. Next, an iteratively applied hierarchical fusion is carried out, providing a rough approximation of the sought borders, that are ﬁnally reﬁned in the last stage by means of a marker based watershed transformation. Details on each step follow. 3.1

Preprocessing

This ﬁrst step aims to simplify the input image, and eliminate any “excessive” details. The morphological toolbox oﬀers a rich variety of operators for this purpose, however, several issues arise. The ﬁrst concerns the extension of grayscale

938

E. Aptoula and S. Lef`evre

morphological operators to colour images, a theoretical problem stemming from the need to impose a complete lattice structure on the pixel intensity range, which in the case of multivalued images, is equivalent to the need to order vectorial data [4,5]. Several approaches have been developed to this end, a survey of which may be found in [6]. Here, it has been chosen to order the colour vectors of the IHLS space by means of a lexicographical ordering: ⎧ ⎨ l1 < l2 , or (1) (h1 , s1 , l1 ) < (h2 , s2 , l2 ) ⇔ l1 = l2 , s1 < s2 or ⎩ l1 = l2 , s1 = s2 and h1 < h2 where l1 , l2 , s1 , s2 ∈ [0, 1] and h1 , h2 ∈ [0, 2π]. As the hue component is a circular value, an angular distance from a reference hue h0 [7] is employed for their comparison: |h − h0 | if |h − h0 | < π h ÷ h0 = (2) 1 − |h − h0 | if |h − h0 | ≥ π which for the sake of simplicity is set as h0 = 0.0. The hue values are then ordered according to their distances from h0 : ∀ h, h ∈ [0, 2π], h < h ⇔ h ÷ h0 < h ÷ h0

(3)

where hues closer to h0 are considered greater. Hence, with the luminance components compared ﬁrst, this ordering leads to operators that process particularly this channel, containing the majority of the total variational information.

0.6

Luminance

Recon. Standard Original 0.5

0.4

0.3 220

225

230 235 240 Vertical dimension

245

Fig. 3. From left to right, the original image (#101087), its preprocessed form and the intensity transition of the white line in the original image, for a reconstructive and standard processing based leveling

Equipped with this ordering, erosion (ε), dilation (δ) and all the deriving grayscale morphological operators may be extended to colour data. Nevertheless, a second issue in this regard is the need for border preserving operators. That is why, it was chosen to employ a morphological leveling Λ(f, m) [8], which provides

A Basin Morphology Approach to Colour Image Segmentation

939

a simpliﬁed version of the input image f , by applying iterative geodesic erosions and dilations to the marker m, of the input, until idem a rough approximation potence, i. e., Λ(f, m)i = sup inf[f, δ i (m)], εi (m) , until Λ(f, m)i+1 = Λ(f, m)i . The marker image is obtained by means of a reconstruction based opening followed by a reconstruction based closing. The result is a “leveled” image, of which the details smaller that the structuring element’s (SE) size have been removed while also preserving perfectly all the region borders (ﬁgure 3). The size of the SE, typically a square of 7×7 pixels is determined with respect to the dimensions of the input image. 3.2

Basin Extraction

Having simpliﬁed the input, this step consists in computing a ﬁrst segmentation map of the image using the watershed transformation. As this powerful operator can be applied only to a scalar input representing the topographic relief of the image, it has been chosen to combine the colour channels by means of a channel wise maximum of marginal gradients: ρHLS (h, s, l) = max {ρ(l), ρ(s), ρH (h)}

(4)

where ρ = f − ε(f ) is the standard internal morphological gradient. Although the components of the polar colour spaces are highly intuitive, their combination is relatively problematic. In particular, hue is of no importance if saturation is “low”, while the bi-conic shape of the colour space assures that no high saturation levels exist, if luminance is not “high enough”. Hence the hue gradient needs to be weighted with a coeﬃcient that has a strong output only when both compared saturation values are “suﬃciently high”: ρH (h) = max {j(s, si ) × h ÷ hi } − min {j(s, si ) × h ÷ hi } i∈B

i∈B

(5)

where B is the local 8-neighborhood and j(·, ·) a double sigmoid controlling the transition from “low” to “high” saturation levels: j(s1 , s2 ) =

1 (1 + exp(α × (s1 − β))) × (1 + exp(α × (s2 − β)))

(6)

where α = −10 and the oﬀset β = μS is set as the mean saturation of the image, hence making it possible to adapt the gradient’s sensitivity to colour, according to the image’s overall colourfulness level. The application of the watershed transformation on the newly computed gradient leads to the result depicted in ﬁgure 4. 3.3

Merging

Given the sensitivity of the internal gradient, the oversegmented result has been expected in the previous step. Considering that the sought borders are contained within this complex of adjacency relations, from this point on all eﬀorts

940

E. Aptoula and S. Lef`evre

Fig. 4. From left to right, the hand reference segmentation, the proposed colour gradient and its oversegmented watershed transformation result, superposed on the original image

are concentrated on eliminating the unwanted borders, and thus increasing the sizes of the catchment basins. Merging the mosaic of basins obtained by watershed transformation is a well known technique in automated image segmentation [9,10]. Here we follow a graph based formalisation for this procedure. As each basin represents a locally homogeneous region, despite the level of oversegmentation, by providing spectrally atomic regions, the entire watershed procedure greatly reduces the volume of clustering to be carried out in the later stages. At this point, based on the atomicity of each basin, one can proceed by manipulating the image content with the catchment basins being the new processing “image units”, instead of pixels. Hence the image can be viewed as an undirected graph of basins, where each node is characterised by a set of spectral and other properties (e. g. mean colour, variance, etc) as well as its set of adjacent basins, or neighbours. With this point of view, the merging procedure, can be deﬁned as an operator on this graph, which propagates labels, and modiﬁes adjacency relations. Furthermore, by imposing a complete lattice structure on the “value interval” of basins, one can deﬁne morphological operators, hence leading to a basin morphology. In particular, by formulating the merging of basins, as the replacement of each node, by its closest with respect to a certain metric, the operator becomes intuitively similar to an erosion. While a dilation would dually replace each node with its most distant neighbour. However this option is of no interest on its own. Consequently, the basin erosion (εb ) and dilation (δb ) of a graph G = (V, E) can be deﬁned respectively as: εb (G) =

G | ∀ Vi ∈ V, label(Vi ) = label( argmin d(Vj , Vi ))

δb (G) =

Vj ∈N (Vi )

(7)

G | ∀ Vi ∈ V, label(Vi ) = label( argmax d(Vj , Vi )) Vj ∈N (Vi )

(8)

A Basin Morphology Approach to Colour Image Segmentation

941

where N (Vi ) is the set containing the neighbours of Vi . Several operators may be derived from the combinations of these two, their eﬃciency however in this context is strongly related to the metric of similarity in use (d(·, ·)). Consequently, one can implement a rich variety of merging strategies, where each is based on a diﬀerent basin similarity metric, exploiting some of their properties.

Level 3

Texture...

Level 2

Basin transition

Level 1

Mean colour, variance

Reliability

Fig. 5. Hierarchy of merging criteria

We propose a hierarchical approach in this regard, as illustrated in ﬁgures 5 and 6, which consists in employing various properties of the basins in diﬀerent scales, in order to compute their distances and hence realise their erosions (i. e. mergings) by means of equation (7). Speciﬁcally, it begins with a series of thresholded erosions based on their mean colour, where only the basins that are closer than a predeﬁned limit are taken into account. This ﬁrst step aims to merge only spectrally similar basins. The colour distance in use is: ∀ c1 = (h1 , s1 , l1 ), c2 = (h2 , s2 , l2 ), d(c1 , c2 ) = j(s1 , s2 ) × h1 ÷ h2 + (1 − j(s1 , s2 )) × |l1 − l2 |

(9)

By means of factor j(·, ·), a saturation based continuous transition of priority is realised between hue and luminance. This low level step is carried out iteratively with thresholds starting from t0 = 0.01 and increasing until tmax , which doubles the initial intra-basin variances. This process results in a preliminary segmentation map, where relatively homogeneous regions appear. Next, we modify the distance metric so as to eliminate intensity gradients, and apply it using the same threshold. For this purpose, the erosions are computed by taking into account only the bordering pixels of basins. Whereas at the third step, higher level merging criteria are employed. Speciﬁcally, in order to calculate the textural similarity of basins, their mean covariance vector is used:

(10) K(f ) = Vol εP2,v (f ) / Vol (f ) where P2,v is a pair of points separated by a vector v and Vol the volume, i. e. sum of pixel values. Of course one is by no means limited with these criteria; for instance border geometry may be further exploited. The threshold in this case is ﬁxed as the mean covariance vector of the entire image. 3.4

Post-processing

Once this stage is reached, the principal regions of the input are expected to have been formed. As a last touch one can eliminate all regions inferior to a

942

E. Aptoula and S. Lef`evre

Fig. 6. From left to right, the three levels of merging using the principle illustrated in ﬁgure 5

certain surface, by merging them with their closest neighbor. A more serious problem however, concerns the possibility of local deviations from the sought borders, since according to the deﬁnition of the proposed erosion operator, basin processing has been carried out so far using only the immediate neighborhood of each basin. To counter this phenomenon, one can employ for instance multiple scales by modifying the size of the processed neighborhood, or in other words the shape and extent of the SE. Another possibility is to proﬁt from the topological properties of the marker based watershed transform, which by limiting the ﬂooding sources, provides an absolute control over the number of regions that are formed. As to the markers, once can very simply erode the binary region map, while preserving their connectivity. Thus, ﬂexibility areas are formed among them which make it possible to realise topological border corrections (ﬁgure 7).

Fig. 7. From left to right, the segmentation result using the jump connection algorithm [11], the marker image and the ﬁnal marker based watershed result

A Basin Morphology Approach to Colour Image Segmentation

4

943

Discussion and Conclusion

In this paper, an unsupervised and input speciﬁc colour image segmentation method has been presented. It has been developed for the improved HLS space, and constitutes an attempt to integrate the spatial sensitivities of morphological operators with spectral image properties. Furthermore, a graph based morphology approach has been formulated in order to manipulate the catchment basins produced by the watershed transform. This formulation aims mainly to provide a more eﬃcient and ﬂexible exploitation framework for the wealth of topological information provided by the aforementioned transform. Pertinent results have been obtained with the Berkeley dataset (ﬁgures 8), even by using the basic erosion deﬁnition in combination with a hierarchy of multiple merging criteria, ranging from mean colour to covariance based texture features. While more sophisticated operators (e. g. geodesic reconstruction of basins, etc), as well as the exploration of further basin metrics remain to be investigated. Its execution speed and adaptivity, along with its capacity to provide the “main” borders of its input, render this approach suitable for applications where precision is of secondary importance, and a fast and robust segmentation is prioritised (e. g. content based image retrieval).

Fig. 8. From left to right, the original images, their segmentations based on jump connection [11] and based on the proposed approach (top to bottom: #3096, #42049, #143090 and #145086

944

E. Aptoula and S. Lef`evre

Issues that require further attention include improving the estimation of arguments as well as the use of additional high level merging criteria, such as border geometry.

References 1. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, vol. 2, pp. 416–423 (2001) 2. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Addison-Wesley, New York (1992) 3. Hanbury, A., Serra, J.: Colour image analysis in 3d-polar coordinates. In: International Conference on Image Processing and its Applications, Magdeburg, Germany (2003) 4. Serra, J.: Image Analysis and Mathematical Morphology, vol. I. Academic Press, London (1982) 5. Ronse, C.: Why mathematical morphology needs complete lattices. Signal Processing 21(2), 129–154 (1990) 6. Aptoula, E., Lef`evre, S.: A comparative study on multivariate mathematical morphology. Pattern Recognition (2007), doi:10.1016/j.patcog.2007.02.004). 7. Hanbury, A., Serra, J.: Morphological operators on the unit circle. IEEE Transactions on Image Processing 10(12), 1842–1850 (2001) 8. Gomila, C., Meyer, F.: Levelings in vector spaces. In: Proceedings of the IEEE Conference on Image Processing, Kobe, Japan (1999) 9. Chen, Q., Zhou, C., Luo, J., Ming, D.: Fast segmentation of high-resolution satellite images using watershed transform combined with an eﬃcient region merging apˇ c, J. (eds.) IWCIA 2004. LNCS, pp. 621–630. Springer, proach. In: Klette, R., Zuni´ Heidelberg (2004) 10. Garrido, L., Salembier, P., Garcia, D.: Extensive operators in partition lattices for image sequence analysis. Signal Processing 66(2), 157–180 (1998) 11. Angulo, J., Serra, J.: Modelling and segmentation of colour images in polar representations. Image and Vision Computing (2006), doi:10.1016/j.imavis. 2006.07.018).

Detecting and Segmenting Un-occluded Items by Actively Casting Shadows Tze K. Koh1,2,3 , Amit Agrawal1 , Ramesh Raskar1 , Steve Morgan3, Nicholas Miles2 , and Barrie Hayes-Gill3 1

2

Mitsubishi Electric Research Labs (MERL), 201 Broadway, Cambridge MA 02139, USA {agrawal,raskar}@merl.com http://www.merl.com/people/agrawal/index.html School of Chemical, Environmental and Mining Engineering, University of Nottingham, UK {enxtkk,nick.miles}@nottingham.ac.uk 3 School of Electrical and Electronic Engineering, University of Nottingham, UK {steve.morgan,barrie.hayes-gill}@nottingham.ac.uk

Abstract. We present a simple and practical approach for segmenting un-occluded items in a scene by actively casting shadows. By ‘items’, we refer to objects (or part of objects) enclosed by depth edges. Our approach utilizes the fact that under varying illumination, un-occluded items will cast shadows on occluded items or background, but will not be shadowed themselves. We employ an active illumination approach by taking multiple images under different illumination directions, with illumination source close to the camera. Our approach ignores the texture edges in the scene and uses only the shadow and silhouette information to determine the occlusions. We show that such a segmentation does not require the estimation of a depth map or 3D information, which can be cumbersome, expensive and often fails due to the lack of texture and presence of specular objects in the scene. Our approach can handle complex scenes with self-shadows and specularities. Results on several real scenes along with the analysis of failure cases are presented.

1 Introduction Human vision system is extremely efficient at scene analysis. Identifying objects in the scene and grasping them is a mundane task for us. However, designing vision algorithms even for such simple tasks have proven to be notoriously difficult. For example, random 3D ‘bin-picking’, where objects are randomly placed in a bin is still an unsolved problem. Commercial systems typically address less taxing robot-guidance tasks, such as picking singulated parts from a moving conveyor belt and employ 2D image processing techniques. Partial occlusion with overlapping parts is a serious problem and it is important to find un-occluded objects. In this paper, we address the problem of identifying un-occluded items in a scene. By ‘items’, we refer to objects (or part of them) enclosed by depth edges. Such an approach could serve as a pre-processing stage for several vision tasks, for example, robotic manipulations in factory automation, 3D pose estimation and object recognition. Our motivating application for detecting un-occluded items is to enable a robot-mounted vision system to better plan the picking sequence. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 945–955, 2007. c Springer-Verlag Berlin Heidelberg 2007

946

T.K. Koh et al. Depth Edges Shadow Edges

A

B

A T-junction

B

A Shadow Region 1

B

A

Shadow Un-occluded Region 2 Item

Fig. 1. Segmenting un-occluded items. (Left) Implementation of our active illumination approach using a firewire camera and eight light emitting diodes (LED’s) around it. (Right) A scene with two objects. B is occluded since it contains a shadow edge (orange). Equivalently, B’s shadow region does not contain the complete depth edge contour (green) of B as its depth edges are intersected by shadow edges. However, A’s shadow region contains its complete depth edge contour. Thus, un-occluded items can be obtained by filling in depth edges inside shadow regions.

Although 2D image segmentation approaches can segment an image into semantic regions, in absence of 3D or depth information, these approaches cannot identify occlusion between objects. It is a general belief that once 3D or depth information is obtained, several vision tasks can be simplified. Although past decades have witnessed significant research efforts in this direction, accurate 3D estimation is cumbersome, expensive and usually have limitations (e.g. stereo on non-textured surfaces). Even if the depth map of the scene is available, one would have to do an analysis similar to ours to find un-occluded objects. This is because un-occluded objects may not necessarily be at a smaller distance from the camera as compared to occluded objects. Range segmentation may segment the depth map into regions, but one still needs to determine the occlusions to remove occluded objects. More importantly, we show that such an analysis can be done using depth and shadow edges without obtaining the depth map of the scene. Thus, our approach inherently overcomes the limitations of shape-from-X algorithms. Our approach can easily handle textured and non-textured objects as well as specular objects (to certain extent) as described in Sect. 3. Contributions: We make the following contributions in this paper – We propose an approach to segment un-occluded items in a scene using cast shadows. We describe a simple implementation for this approach using depth and shadow edges. – We analyze practical configurations where our approach works and fails including self-occlusions, mutual occlusions, object with holes and specular objects. – We show how to handle missing depth/shadow edges due to noise and lack of shadow information. 1.1 Related Work 2D/Range Segmentation: Image & range segmentation [1,2,3,4,5,6] is a well researched area. Although 2D segmentation can segment an image into semantic regions, it cannot provide occlusion information due to the lack of depth information. Even when a depth map of the scene is available, we need to explicitly find occlusions using depth edges. In Sect. 2, we show that depth edges can be directly obtained using active illumination without first computing a depth map.

Detecting and Segmenting Un-occluded Items

947

Shape from Silhouettes: These approaches [7,8,9,10] attempt to infer 3D shape from silhouettes obtained under different view-points. The computed silhouettes for every image along with the camera center of the corresponding camera is used to define a volume assumed to bound the object. The intersection of these volumes known as the visual hull [11] yields a reasonable approximation of the real object. In contrast, we capture images from a single view-point under varying illumination and use the information in cast shadows to segment un-occluded objects. Active Illumination: Several vision approaches use active illumination to simplify the underlying problem. Nayar et al. [12] recover shape of textured and textureless surfaces by projecting an illumination pattern on the scene. Shape from structured light [13,14] has been an active area of research for 3D capture. Raskar et al. [15] proposed the multiflash camera (MFC) by attaching four flashes to a conventional digital camera to capture depth edges in a scene. Crispell et al. [16] exploited the depth discontinuity information captured by the MFC for a 3D scanning system which can reconstruct the position and orientation of points located deep inside concavities. The depth discontinuities obtained by the MFC have also been utilized for robust stereo matching [17] and recognition of finger-spelling gestures [18]. Koh et al. [19] use the depth edges obtained using multiflash imaging [15] for automated particle size analysis with applications in mining and quarrying industry. Our approach also uses a variant of MFC (with 8 flashes, Fig. 1) to extract depth discontinuities, which are then used to segment un-occluded objects. Interpretation of Line Drawings: Our visual system is surprisingly good at perceptual analysis of line drawings and occluding contours into 3D shapes [20]. Labeling line drawing into different types of edges has been proposed by Huffman [21]. Waltz [22] describe a system to provide a precise description of a plausible scene which could give rise to a particular line drawing for polyhedral scenes. Malik [23] proposed schemes for labeling line drawing of scenes containing curved objects under orthographic projection. Marr [24] argued that a given silhouette could be generated by an infinite variety of shapes and analyzed the importance of assumptions about viewed surfaces in our perception of 3D from occluding contours. Our goal is not to interpret occluding contours into 3D shapes, but to label occluding contours corresponding to un-occluded objects in the scene.

2 Segmentation Using Information in Cast Shadows In this section, we describe the basic idea of segmenting un-occluded items in a scene. We first assume that complete depth and shadow edges are available, i.e., there are no missing edges. We do a thorough analysis of this ideal case for several practical scenes in Sect. 3. In Sect. 4, we will extend our approach to handle missing edges. As mentioned earlier, depth edges alone cannot provide a unique interpretation for objects in the scene. Thus, our goal is not to interpret depth edges into 3D shapes, but to identify or label those depth edges that possibly correspond to un-occluded objects in the scene. In particular, our approach outputs items enclosed by depth edges (See Fig. 2). In Sect. 3, we show how this is affected by self-shadows, self-occlusions and mutual occlusions. We assume that the scene consists of objects lying on a flat surface and on top of each other, and the view direction is along the vertical direction.

948

T.K. Koh et al. Top View

Depth Edge Concave Edge Convex Edge Shadow Edge

A B Scene

B

A Cast Shadows

A Depth & Un-occluded Item Shadow Edges

Fig. 2. (Left) Different types of edges [22]. It is well-known that occluding contours alone cannot provide a unique interpretation of the objects in the scene [24,23]. We therefore find un-occluded items, which are objects or part of objects enclosed within depth edges. (Right) In this scene, A and B could be parts of the same physical object or two different physical objects. Since A cast shadows on B, our approach will identify only A as the un-occluded item.

Consider a simple scene show in Fig. 1, where A casts shadow on B and B casts shadow on the background. We depict depth edges in green and shadow edges in orange color. Suppose we could segment the boundaries of A and B from the captured intensity images. Then we could easily infer that since region B has a shadow edge, it must be occluded. Thus, all regions that do not have any shadow edges are potential candidates for un-occluded objects. The important question is how to obtain such a segmentation so that the segmentation boundaries correspond to object boundaries or shape edges? Note that any 2D segmentation approach relies on image intensities and thus will respond to texture/reflectance edges. A depth edge may not correspond to a texture edge at the same location in the image (e.g. all objects with same reflectance, Fig. 4) and intensity edges on object surfaces will result in false depth edges. Thus, we need a robust method to find depth edges which can ignore texture edges. Computing Depth Edges: The active illumination method proposed in [15] is an easy way to find depth edges in the scene. In this approach, four flashes are attached close to the main lens of the camera along left, right, top and bottom directions. Four images, I1 , I2 , I3 , and I4 are captured, each under a different flash illumination. Since shadows will be cast due to object boundaries and not due to reflectance boundaries, depth edges can be extracted using the shadows. To compute depth edges, first a max composite image (Imax ) is obtained by taking the maximum of intensity value at every pixel. Imax will be a shadow-free image. Then, ratio images are calculated as ri = Ii /Imax . Depth edges are obtained by estimating the foreground to shadow transition in each ratio image and combining all the estimates. In our implementation, we capture eight images with different illumination directions using the setup shown in Fig. 1. Fig. 3 shows an example on a scene containing three overlapping crayons. Note that the shadow edges can be similarly obtained by estimating the shadow to foreground/background transition in each ratio image. Segmenting Un-occluded Items: The basic idea in segmenting un-occluded items is to utilize the cast shadows information. If we trace the depth edges in clockwise direction, the cast shadows should always be on the left of the depth edge. In other words, if an object is un-occluded, then cast shadows cannot fall inside the object, or to the right of the depth edge. T-junctions at the intersection of two objects (Fig. 1) can also be handled with this tracing method by always tracing along the rightmost boundary at

Detecting and Segmenting Un-occluded Items

Depth Edges

Depth & Shadow Edges

Shadow Regions

949

Un-occluded Items

Fig. 3. Depth and shadow edges can be obtained using active illumination. (Top Row) The eight input images captured using our setup. (Middle Row) Ratio images obtained by dividing the input images with Imax . Note that the ratio images are texture-free and have shadows according to the corresponding LED direction. (Bottom row) Depth & shadow edges are obtained using ratio images. Note that only the shadow region corresponding to the red crayon contains closed depth edge contours. Thus, filling depth edges inside shadow regions will correctly output the red crayon as the un-occluded item. Matlab source code and input images for this example are included in the supplementary materials.

junctions. In Fig. 1, at the intersection of A and B, the above condition will be satisfied for A but not for B, identifying A as an un-occluded item. Instead of tracing depth edges which might be cumbersome, we propose a simple equivalent implementation using shadow edges. The shadow edges segment the image into regions. For any un-occluded object, the shadow region should contain the entire depth edge contour for that object. For example, in Fig. 1, shadow region 1 contains the entire depth edge contour of object A. However, for occluded objects such as B, the shadow edge cuts through the depth edge. For shadow region 2, the depth edges inside that region do not form a closed contour. Thus, to find un-occluded items, we simply region fill the depth edges inside each shadow region. For occluded objects, since the depth edges inside the shadow regions will not be complete, they will not get filled. In Fig. 3, the shadow edges form five regions as shown in the last row. Only the depth edges in the shadow region corresponding to the red crayon form a closed contour. Thus, the red crayon will be correctly identified as un-occluded item. Supplementary materials include Matlab source code and input images for this example.

3 Practical Configurations In this section, we analyze common scenes which give rise to more complex shadow configurations such as objects with self-shadows, objects with holes and specular objects. We also analyze two failure cases involving self-occlusions and mutual occlusions.

950

T.K. Koh et al. Self-shadows

Extra Edges

A B

Scene

Right Flash Ratio Image

Depth and Shadow Edges

Shadow Region 1

Depth edges inside shadow region 1

Filled Depth Edges: Unoccluded Item

Shadow Regions

Fig. 4. Self-shadows. The scene consists of a rabbit shaped object A on top of another object B. Part ‘R’ of object A casts shadow on itself as evident from the ratio image corresponding to the right flash. This leads to extra depth and shadow edges as shown in the third image. If these extra edges do not form closed contours, erroneous shadow regions are not obtained. Note that the shadow region corresponding to the rabbit still contains the closed contour corresponding to the outer boundary of object A.

Self-shadows: We consider self-shadows as those shadows of an object which fall on the object itself. Fig. 4 shows an example where the part ‘R’ of the rabbit shaped object A casts shadow on itself 1 . The self-shadows lead to extra depth and shadow edges. These extra edges can be ignored by our algorithm if they do not form closed contours, or do not cut through the outer boundary of the object. Note that the shadow edges lead to five shadow regions which would also have been obtained if the self-shadows were not present. By filling in the depth edges inside shadow region 1, we can identify object A as an un-occluded item. In Sect. 3.1, we show that when the extra depth edges due to self-shadows form closed contours with other depth edges, the entire object is not identified as an un-occluded item. Object with Holes: Our algorithm can handle challenging cases of objects with holes. A common scenario is shown in Fig. 5. Although the depth edges are the same in two cases, the cast shadows are different. For two spheres case, the upper sphere will cast shadows on the lower sphere and thus only the upper sphere will be considered as the unoccluded item. For the doughnut case, note that the shadows cast by the inner region does not contain any depth edges, and hence will be ignored. The shadows cast by the outer region contains both depth edge contours. If the un-occluded item is obtained by filling the depth edges inside the outer shadow region as before, the doughnut hole will also get filled. We can remove the holes by ignoring those filled regions that contain a complete shadow edge contour. The inner filled region (in green) contains the complete shadow edge contour (in orange) due to the inner doughnut boundary, and can be removed. 1

The part ‘R’ is a slanted piece whose one side is attached to the object A.

Detecting and Segmenting Un-occluded Items

951

Doughnut

Two Spheres Cast Shadows

Depth & Shadow Edges

Shadow Region

Un-occluded Item

Fig. 5. Object with holes. Our approach can recover a doughnut shaped object using cast shadows information.

Specularities and Specular Objects: Specular highlights on objects are a common problem for vision algorithms as they tend to saturate and are view dependent. In the case of specular highlights, the active illumination approach for finding depth edges results in spurious depth edges [15,25]. We show that similar to the self-shadowing case above, our approach can ignore the effect of specular highlights if the spurious depth edges due to specularities do not form closed contours. For example, in Fig. 3, the specularities on the green and the red crayon result in spurious depth edges inside the crayons. But since these edges do not intersect the true depth edges and do not form closed regions, they can be ignored while filling in the true depth edges inside shadow regions. Scene

Depth and Shadow Edges

Shadow Regions

Un-occluded Items

Specular Object

Fig. 6. Handling specular objects. Using our method, depth edges for specular objects can also be obtained. If spurious depth edges due to specularities do not form closed contours, specular objects can be handled.

A more general case of a scene having a specular object is shown in Fig. 6. An important point to note is that while specularities may result in spurious depth edges, the true depth edges even for a specular object are obtained by our technique. This is different from other techniques such as stereo/photometric stereo where the estimation is completely incorrect for specular objects. Note that in Fig. 6, the outer depth edges for the specular object are obtained. The shadow edges results in four regions. Once again, by filling in the depth edges in shadow regions, the un-occluded specular object can be recovered. Only the shadow region corresponding to the specular object has closed depth edges, as other objects are shadowed by the specular object.

952

T.K. Koh et al.

3.1 Failure Cases Two important failure cases are described below. Self-Occlusions: The first case correspond to self-occlusions such that the depth edges due to self-occlusion form closed regions with outer depth edges of the object. Fig. 7 shows an example. Note that the part of the object which is occluded by the object itself cannot be recovered. Scene

Depth & Shadow Edges

Scene

Shadow Regions Region 3

Depth Edges within Region 3

Depth & Shadow Edges

Filled Region

Un-occluded Item

Shadow Regions

Fig. 7. Failure Cases. (Top row) Self-Occlusions. The scene consist of a single pipe which occludes itself. The shadow edges results in three shadow regions. Only region 3 has closed depth edge contours. However, filling in the depth edges inside region 3 followed by hole removal only recovers the un-occluded part of the object as the un-occluded item, instead of the entire object. (Bottom Row) Mutual Occlusions. The scene consist of two mutually occluding pipes. The shadow edges give rise to five regions. However, none of the shadow regions contain complete depth edge contours as each depth edge is intersected by some shadow edge. Thus, the output will be zero un-occluded items.

Mutual Occlusions: The second failure case correspond to mutual occlusions, where object A occludes object B but is also occluded by the object B at the same time. Fig. 7 shows such a scenario for a scene containing two pipes. For this scene, neither of the two pipes or any part of them will be segmented as an un-occluded item.

4 Handling Missing Depth and Shadow Edges In the previous section, we showed that if we have complete depth and shadow edge information, we can reliably segment un-occluded items in the scene. However, in some cases complete depth/shadow edges are not obtained due to noise, or dark surfaces. If shadow edges are missing, correct shadow regions will not be obtained and we cannot use the previous approach of filling depth edges within the shadow regions. Now we describe an extension to handle such cases. Our approach first tries to complete the depth edges by segmenting the pseudo-depth map [17,15] of the scene. We find an over-segmentation in this step so that all missing depth edges are accounted for, but this may result in extra regions. We then verify each segmented region for occlusion by checking if any shadow falls in that region.

Detecting and Segmenting Un-occluded Items Scene

Edges

Pseudo-depth Map

Segmented

953

Un-occluded Items

Fig. 8. Handling missing edges. In a complex scene with several objects, depth and shadow edges may be missing (pointed by white arrows). We first compute the pseudo-depth map of the scene. We then complete the depth edges by segmenting the pseudo-depth map. Each segmented region is then checked for occlusions using shadow information. All regions intersecting with shadow edges are removed to obtain un-occluded items.

Fig. 8 shows a complex scene with several objects. The extracted depth edges have gaps as shown. A pseudo-depth map of the scene is computed by assigning horizontal/vertical gradients to each depth edge pixel, according to the direction of the light source. The magnitude of the gradient is set proportional to the width of the shadow at that pixel [17]. The gradients at all other pixels are set to zero. The pseudo-depth map is obtained by integrating the resulting 2D gradient field by solving a Poisson equation. We segment the pseudo-depth map using EDISON [3]. The resulting pseudo-depth map and its segmentation is also shown in Fig. 8. Note that all the missing depth edges are completed but the segmented pseudo-depth map have extra regions. The final step consists of checking each region for occlusions. If we draw a line from any point inside an un-occluded object to a point outside the object, it should intersect a depth edge before intersecting a shadow edge. For an occluded object, since the shadow falls inside the object, such a line may intersect a shadow edge first. For example, in Fig. 1, any line drawn from inside of object A to outside will intersect a depth edge (green) first. However, certain lines drawn from the inside of object B to outside will intersect a shadow edge (red) first. Thus, for each segmented region, we draw lines from inside the region at several different angles. We count the number of intersections with a shadow edge before intersection with a depth edge. If this count is greater than some threshold, the region is declared to be occluded. Fig. 8 shows that all occluded regions were successfully eliminated. The starting point of these lines is taken to be the medial axis of each region to handle general regions with concavities.

5 Discussions Several improvements to our approach are possible. Better region filling approaches could handle cases where only a few pixels are missing in depth or shadow edges. A gradient based analysis could be used to remove spurious depth edges due to specularities [25]. Since depth edges are view dependent, the labeling of scene parts as unoccluded items is also view dependent. Higher level information can be combined for object-based interpretation.

954

T.K. Koh et al.

Limitations: Our approach share the limitations described in [15] for finding depth edges. This includes dark surfaces/background and detached shadows from the objects due to large baseline between the LED and the camera or thin objects. Our scheme works better on curved objects compared to polyhedral objects. The depth edge at the intersection of polyhedral objects may convert into a concave/convex edge depending on the viewpoint, and thus may not be obtained. Conclusions: We have proposed a simple and practical approach to segment unoccluded items in a scene using cast shadows by analyzing the resulting depth and shadow edges. A depth map of the scene is not required and our approach can handle complex scenes with specularities, specular objects, self-shadows and objects with holes. We showed several real examples using our approach and analyzed the failure cases including self-occlusions and mutual occlusions. To handle missing depth and shadow edges, we propose an extension based on segmenting the scene using pseudodepth map and analyzing each region for occlusions. We believe that our approach could serve as a pre-processing stage for several vision tasks including bin-picking, 3D pose estimation and object recognition.

References 1. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice-Hall, Englewood Cliffs (2001) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell 22(8), 888–905 (2000) 3. Christoudias, C.M., Georgescu, B., Meer, P.: Synergism in low level vision. In: Proc. Int’l Conf. Pattern Recognition, vol. IV, pp. 150–155 (2002) 4. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell. 24(5), 603–619 (2002) 5. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P., Bunke, H., Goldgof, D., Bowyer, K., Eggert, D., Fitzgibbon, A., Fisher, R.: An experimental comparison of range image segmentation algorithms. IEEE Trans. Pattern Anal. Machine Intell. 18(7), 673–689 (1996) 6. Yim, C., Bovik, A.: Multiresolution 3-D range segmentation using focus cues. IEEE Trans. Image Processing 7(9), 1283–1299 (1998) 7. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: SIGGRAPH, pp. 369–374 (2000) 8. Cheung, K.M.G.: Visual hull construction, alignment and refinement for human kinematic modeling, motion tracking and rendering. PhD thesis, CMU (2003) 9. Brand, M., Kang, K., Cooper, D.: Algebraic solution for the visual hull. In: Proc. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 30–35 (2004) 10. Franco, J., Boyer, E.: Exact polyhedral visual hulls. In: Proc. Fourteenth British Machine Vision Conference, pp. 329–338 (2003) 11. Laurentini, A.: The visual hull concept for the silhouette-based image understanding. IEEE Trans. Pattern Anal. Machine Intell. 16, 150–162 (1994) 12. Nayar, S., Watanabe, M., Noguchi, M.: Real-time focus range sensor. IEEE Trans. Pattern Anal. Machine Intell. 18, 1186–1198 (1995) 13. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: Proc. Conf. Computer Vision and Pattern Recognition (2003) 14. Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: high resolution capture for modeling and animation. ACM Trans. Graph 23, 548–558 (2004)

Detecting and Segmenting Un-occluded Items

955

15. Raskar, R., Tan, K.H., Feris, R., Yu, J., Turk, M.: Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Trans. Graph. 23(3), 679– 688 (2004) 16. Crispell, D., Lanman, D., Sibley, P.G., Zhao, Y., Taubin, G.: Beyond silhouettes: Surface reconstruction using multi-flash photography. In: Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 405–412 (2006) 17. Feris, R., Raskar, R., Chen, L., Tan, K.H., Turk, M.: Discontinuity preserving stereo with small baseline multi-flash illumination. In: Proc. Int’l Conf. Computer Vision, vol. 1, pp. 412–419 (2005) 18. Feris, R., Turk, M., Raskar, R., Tan, K., Ohashi, G.: Exploiting depth discontinuities for vision-based fingerspelling recognition. In: IEEE Workshop on Real-Time Vision for Human-Computer Interaction, IEEE Computer Society Press, Los Alamitos (2004) 19. Koh, T.K., Miles, N., Morgan, S., Hayes-Gill, B.: Image segmentation of overlapping particles in automatic size analysis using multi-flash imaging. In: WACV 2007. Proc. Eighth IEEE Workshop on Applications of Computer Vision, IEEE Computer Society Press, Los Alamitos (2007) 20. Barrow, H., Tenenbaum, J.: Interpreting line drawings as three-dimensional surfaces. Artificial Intelligence 17, 75–116 (1981) 21. Huffman, D.A.: Impossible objects as nonsense sentences. In: Melzer, B., Michie, D. (eds.) Machine Intelligence, vol. 6, pp. 295–323. Edinburgh University Press (1971) 22. Waltz, D.: Understanding line drawings of scenes with shadows. In: Winston, P. (ed.) The Psychology of Computer Vision, pp. 19–91. McGraw-Hill, New York (1975) 23. Malik, J.: Interpreting line drawings of curved objects. Int’l J. Computer Vision 1, 73–103 (1987) 24. Marr, D.: Analysis of occluding contour. Technical Report ADA034010, MIT (1976) 25. Feris, R., Raskar, R., Tan, K.H., Turk, M.: Specular reflection reduction with multi-flash imaging. In: SIBGRAPI, pp. 316–321 (2004)

A Local Probabilistic Prior-Based Active Contour Model for Brain MR Image Segmentation Jundong Liu1 , Charles Smith2 , and Hima Chebrolu2 1

School of Electrical Engineering and Computer Science Ohio University Athens, OH 2 Department of Neurology University of Kentucky Lexington, KY

Abstract. This paper proposes a probabilistic prior-based active contour model for segmenting human brain MR images. Our model is formulated with the maximum a posterior (MAP) principle and implemented under the level set framework. Probabilistic atlas for the structure of interest, e.g., cortical gray matter or caudate nucleus, can be seamlessly integrate into the level set evolution procedure to provide crucial guidance in accurately capturing the target. Unlike other region-based active contour models, our solution uses locally varying Gaussians to account for intensity inhomogeneity and local variations existing in many MR images are better handled. Experiments conducted on whole brain as well as caudate segmentation demonstrate the improvement made by our model.

1

Introduction

Magnetic Resonance Images (MRI) of the brain provide very important tools in diagnosing and treating various neurodegenerative diseases including Alzheimer’s disease (AD), Parkinson’s disease (PD), and multiple sclerosis. Segmentation of the whole brain as well as the subcortical structures from MR image is a critical and fundamental task for the 3D MRI data to be eﬀectively utilized for disease diagnosis and treatment. Numerous segmentation solutions have been proposed in the literature. Among them, region-based active contour models [17,2,3,11,14,5] have recently gained great popularity, mainly due to the demonstrated strong segmentation robustness. The Chan-Vese piecewise-constant model [2], commonly known as the active contour without edge model, adopts a stopping term based on a simpliﬁed version of the Mumford-Shah functional, and has the ability to detect object boundaries with or without gradient. Although impressive experimental results have been reported of using this model and its variants [11,5] in various applications, several common drawbacks and limitations have to be addressed when they are utilized for brain MRI segmentation. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 956–964, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Local Probabilistic Prior-Based Active Contour Model

957

Firstly, in these models, a mixture of global Gaussians (piecewise-constant can be regarded as the degenerate case) has been used a convenient assumption for modeling the intensity distribution. Global means are employed to discriminate regions from each other. However, “homogenous regions with distinct means” is rarely accurate for brain MRIs, especially before the bias ﬁeld is removed. Secondly, spatial distribution priors, often available and being used extensively in histogram-based models, are normally neglected in the region-based active contour models. For whole brain segmentation, prior knowledge about the organ’s location sometimes is an helpful resource to separate certain tissue types from their surroundings. For subcortical segmentation, atlases constructed from the train sets often an indispensable part that deﬁnes the structure of interest. 1.1

Our Proposed Solution

This paper proposes a fully automatic whole-brain as well as subcortical structure segmentation solution. The models consists of two major components: a local likelihood based active contour (LLAC) model and a guiding probabilistic atlas. The former can be regarded as a bridging solution between the ChanVese piecewise-constant [2] and Chan-Vese [3] (Tsai-Yezzi [14]) piecewise smooth models. The latter tells which structure to be captured. Formulated under Bayesian a posterior probability framework, our LLAC model can seamlessly integrate the probabilistic atlas information into the level set evolution procedure. In addition, it relaxes the global Gaussian assumption in many region-based actively models from “global” to “local”, and local means are used as the area representatives. Being able to better account for intensity inhomogeneity, the LLAC model can stand out structures of interest that have low contrast with the surrounding tissues.

2

Methods

Let C be an evolving curve in Ω. Cin denotes the region enclosed by C and Cout denotes the region outside of C. Chan-Vese (two-phase) piecewise-constant model is to minimize the energy functional F (c1 , c2 , C) = μ · Length(C) + λ1 |u0 − c1 |2 dxdy + λ2 |u0 − c2 |2 dxdy Cin

Cout

where c1 and c2 are the averages of u0 inside C and outside C respectively. Note that both c1 and c2 are global values, computed based on the entire image. Global Gaussian distribution has been assumed for each individual classes. This assumption, however, is not an accurate depiction of local image proﬁle for many medical images, including brain MRIs. Piecewise-smooth models [3,14] provide a solution for the intensity variability problem. Gradual intensity changes can be potentially handled with [3,14], however, high computational cost and being sensitive to curve initialization pose a barrier for practical applications.

958

2.1

J. Liu, C. Smith, and H. Chebrolu

Our Local Likelihood-Based Active Contour (LLAC) Model

Let S = {in, out} be the two classes for a two-phase model. The probability of the pixel (x, y) belonging to in and out is denoted by P (in|(x, y)) and P (out|(x, y)) respectively. Let P r(in) and P r(out) be the class prior probabilities at (x, y). Then, P r(in(x,y) )P (u0 (x, y)|in) P (B) P r(out(x,y) )P (u0 (x, y)|out) P (out|u0 (x, y)) = P (B) P (in|u0 (x, y)) =

where P (u0 (x, y)|in) is the likelihood of a voxel in class in has the intensity of u0 (x, y). P (B) is a constant. The maximum a posterior segmentation would be achieved only if the multiplication of P (in|(x, y)) and P (out|(x, y)) all over the entire image domain is maximized. Taking a logarithm, the maximization can be reduced to the minimization of the following energy: log(P r(in)P (u0 (x, y)|in))dxdy F (C) = μ · Length(C) − Cin − log(P r(out)P (u0 (x, y)|in))dxdy Cout

Note that our overall model is similar to [11,12], but the setup of the likelihood term is diﬀerent, which will be explained next. Spatial Priors for Whole Brain Segmentation: P r(in) and P r(out). A widely used whole brain tissue distribution prior model is provided by the Montreal Neurological Institute [7]. MNI prior is made of three probability images that contain values in the range of zero to one, representing the prior probability of a voxel being either GM, WM or CSF after an image has been normalized to the same space (see Figure 1). In this paper, we are particularly interested in extracting sub-cortical GM, therefore we take the GM and WM prior images as P r(in) and P r(out) respectively, for demonstration purpose. For these prior images to be applied, a registration is need to align the prior and the input image. We used the aﬃne registration routine provided by SPM [13] in all the 3D experiments of this paper. Spatial Priors for Caudate Segmentation: In this paper, we take caudate as an example to show how our method can be used for subcortical segmentation. The proposed method would, in principle, work for other sub-cortical structures as well. The distribution prior used for Caudate is constructed with 18 T1-weighted MR image data downloaded from internet brain segmentation repository (IBSR) at Mass General Hospital. Each data set contains a whole brain MRI together with an expert manual segmentation of 43 individual structures (1.5mm slice thickness). Out of the 18 data sets, the ﬁrst 9 have been used for constructing the distribution atlas. The other 9 are used as testing cases to evaluate the accuracy

A Local Probabilistic Prior-Based Active Contour Model

(a)

(b)

959

(c)

Fig. 1. Spatial prior probability images of CSF, GM and WM

of our segmentation model. To construct a distribution atlas, the 9 caudate segmentations need to put into a standard space. The template brain provided by SPM2 [13], obtained based on 152 brains from Montreal Neurological Institute, has been used as the standard space. Let fi (1 ≤ i ≤ N = 9) be one of the training images, and the extracted caudate segmentation is denoted as si (1 ≤ i ≤ N ). Let r denote the standard temple. The probabilistic caudate atlas was constructed as follows 1. For each training image fi in the IBSR data sets, map it to the standard template r using SPM2’s normalization routine. A 12-parameters aﬃne transformation is estimated ﬁrst, followed by a nonlinear warping based on a linear combination of discrete cosine transform (DCT) basis functions. The resulting transformation is denoted as Ti . 2. Apply Ti to si to get a transformed caudate segmentation si . 3. Sum up si under the standard space to get ss . 4. The prior distribution is then obtained: P r(caudate) = ss N , and P r(noncaudate) = 1 − P (caudate). When we try to segment the caudate of a testing image k, an aﬃne transformation from the standard template r to k is estimated using SPM2. Then the obtained transformation is applied to the distribution atlas P r(in) = P r (Caudate) and P r(out) = P r (nonCaudate) to put the prior images on the aligned space with the testing image k. Figure 2 shows a zoom-in version of the atlas (distribution prior image) constructed from the 9 IBSR data sets, viewing from three diﬀerent axes. Likelihood Terms P (u0 (x, y)|in) and P (u0 (x, y)|out): Global Gaussians are commonly assumed in many region-based active contour models to model the intensity distribution, but they are often not an accurate description of the local image proﬁle, especially when intensity inhomogeneity is present. A remedy is to relax the global Gaussian mixture assumption and take local intensity variations into consideration. More speciﬁcally, local Gaussians (piecewise-constant is the degenerate case) should be used as a better approximation to model the vicinity of each voxel. In the Chan-Vese model, two global means c1 and c2 are computed for Cin and Cout . In our approach, we introduce two functions v1 (x, y) and v2 (x, y), both

960

J. Liu, C. Smith, and H. Chebrolu

(a)

(b)

(c)

Fig. 2. Viewing from three axes: spatial prior probability image of the Caudate, constructed from 9 IBSR data sets

deﬁned on the image domain, to represent the mean values of the local pixels inside and outside the moving curve. By Local, we mean that only neighboring pixels will be considered. A simple implementation of the “neighborhood” is to introduce a rectangular window W (x, y) with size of 2k + 1 by 2k + 1, where k is a constant integer. Therefore, v1 (x, y) = mean(u0 ∈ (Cin ∩ W (x, y))) v2 (x, y) = mean(u0 ∈ (Cout ∩ W (x, y))) With the new setup, our segmentation model can then be updated as a minimization of the following energy: (u0 − v1 )2 F (v1 , v2 , C)=μ · Length(C)− log(P r(in)) − log(σ1 ) − dxdy − 2σ12 Cin (u0 − v2 )2 log(P r(out)) − log(σ2 ) − dxdy 2σ22 Cout The variances σ1 and σ2 should also be deﬁned and estimated locally. However, due to the fact that local variance estimation tends to be very unstable, we use global variances (for the pixels in Cin and Cout ) as uniform approximation. 2.2

Level Set Framework and Gradient Flow

Using the Heavside function H and the one-dimensional Dirac measure δ [2], the energy function F (v1 , v2 , C) can be minimized under the level set framework, where the update will be conducted on the level set function φ. Parameterizing the descent direction by an artiﬁcial time t ≥ 0, the gradient ﬂow for φ(t, x, y) is given from the associated Euler-Lagrange equation as r(in) ∂φ ∇φ = sign(v − v2) · δ(φ) μdiv( |∇φ| ) − log PPr(out) + log σσ12 1 ∂t −v1 )2 −v2 )2 (1) − (u02σ − (u02σ 2 2 1

2

A Local Probabilistic Prior-Based Active Contour Model

961

where φ0 is the level set function of the initial contour. This gradient ﬂow is the evolution equation of the level set function of our proposed method. Correspondingly, v1 and v2 are computed with v1 =

(u0 ∗ H(φ)) ⊗ W H(φ) ⊗ W

v2 =

(u0 ∗ (1 − H(φ))) ⊗ W (1 − H(φ)) ⊗ W

(2)

where ⊗ is the convolution operator. One should note that, Chan-Vese model can be regarded as a special case of our model — when the window W is set to inﬁnitely large. The sign(v1 - v2 ) term in Eqn.1 is designed to avoid the occurrence of an undesired curve evolution phenomenon we named as local twist. When Local twist happens, the multiple components of a same class might be evolved into the opposite side of φ, therefore labeled with diﬀerent classes. sign(v1 - v2 ) is a simple yet eﬀective way to prevent this phenomenon from happening. More details can be found in [10]. In practice, the Heaviside function H and Dirac function δ in eqn. 1 have to be approximated by smoothed versions. We adopt the H2, and δ2, used in [2]. For all the experiments conducted in this paper, we set the size of the window W as 21 × 21.

3

Results and Discussions

The ﬁst experiment is based on a 3T MR image, shown in Fig 3, before the biased ﬁeld is removed. Due the existing bias ﬁeld, this image greatly violates the global Gaussian/mean assumption, therefore traditional region-based approaches, including the Chan-Vese model, are expected to fail. Fig 3 shows the result of using Chan-Vese model (left column) and that of using our local median model (right column). Three snapshots of the executions are provided. As evident, Chan-Vese model has trouble in capturing the GM area in the top-left and right-bottom corners, while our model separate the two issues very accurately. The second experiment is conducted based on a low resolution 1.5T MR images. We compared our solution with that of SPM [13] and Chan-Vese model. Fig. 4 shows a single slice result from all three methods. Fig 4.a is the input image, and 4.b, 4.c and 4.d are the GM segmentation from SPM, Chan-Vese and our model, respectively. The sub-cortical GM tissues in all the seven images have a bit higher intensity values than cortical GM, therefore the Chan-Vese model, using a piece-wise constant assumption, mis-classiﬁes quite a portion of putamen as WM. Our model, on the other hand, clearly separates the putamen and thalamus from their surrounding WM. The comparison for the sub-cortical area has been highlighted with a red circle in Fig 4 (Figures are better seen on screen than in black-white print). Spatial distribution prior and local Gaussians both play a role in achieving this improvement. Compared to SPM, our model has the edge in outlining cleaner cortical GM (highlighted with a blue circle; better seen on the screen).

962

J. Liu, C. Smith, and H. Chebrolu

20

20

20

40

40

40

60

60

60

80

80

80

100

100

100

120

120

120

140

140

140

160

160

160

180

180

200

200 20

40

60

80

100

120

140

160

180

180 200 20

40

60

80

100

120

140

160

180

20

20

20

40

40

40

60

60 80

80

100

100

120

120

120

140

140

140

160

160

160

180

180

200

200 40

60

80

100

120

140

160

180

40

60

80

100

120

140

160

180

20

40

60

80

100

120

140

160

180

60

80 100

20

20

180 200 20

40

60

80

100

120

140

160

180

Fig. 3. Segmentation comparison of Chan-Vese model and our model in handling severe intensity inhomogeneity. First row: three snapshot of the execution on Chan-Vese model; Second row: three snapshot for our model. (The ﬁgures are better seen on screen than in print).

50

50

100

100

150

150

200

200

250

250 50

100

150

200

250

50

(a)

(a)

100

(b)

150

200

250

(b)

50

50

100

100

150

150

200

200

250

250 50

100

(c)

150

200

250

50

100

150

200

250

(d)

Fig. 4. Input image and 3 GM segmentation results from SPM (b), Chan-Vese (c) and our model (d).

The last group of experiments are for subcortical structure segmentation, carried out on the rest 9 IBSR data sets mentioned in section 2.1. To assess the performance of our algorithm we computed the Dice coeﬃcients between the segmentation obtained from our algorithm and that of the ground truth. This

A Local Probabilistic Prior-Based Active Contour Model

963

Table 1. Dice coeﬃcients of the comparison for all the 9 test cases. The summary rows at the end of the table display the overall average. IBSR Datasets

Dice Coeﬃcient

Patient 10 Patient 11 Patient 12 Patient 13 Patient 14 Patient 15 Patient 16 Patient 17 Patient 18 Average

0.7611 0.7929 0.7039 0.6928 0.8019 0.7342 0.7503 0.6622 0.8201 0.7466

metric measures the similarity of two sets and ranges from 0 for sets that are disjoint to 1 for sets that are identical. This index is a special case of the kappa index that is used for comparing set similarity. The Dice coeﬃcient is deﬁned as: 2 × |S1 ∩ S2 | (3) K(S1 , S2 ) = |S1 | + |S2 | Table 1 shows the results of the Dice coeﬃcients for all test cases. The results are rather stable across the 9 data sets, with an average Dice value of 0.7466. The accuracy is comparable to the results reported in [16].

4

Conclusion

In this paper, we propose a brain MRI segmentation algorithm based on a local likelihood oriented active contour model. The LLAC model has the advantage of being able to stand out the brain structures that are with low contrast with the surrounding tissues. The probabilistic atlas essentially works as a mask to capture the structure of interest, where no thresholding step and value are needed. The accuracy of our model may be further boosted if shape-based atlas, constructed through PCA, is integrated into the level set framework.

References 1. Leemput, K.V., et al.: Automated model-based tissue classiﬁcation of MR images of the brain. IEEE Trans. on Medical Imaging 18, 897–908 (1999) 2. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on Image Processing 10(2), 266–277 (2001) 3. Chan, T.F., Vese, L.A.: A level set algorithm for minimizing the Mumford-Shah functional in image processing. In: 1st IEEE Workshop on Variational and Level Set Methods in Computer Vision, pp. 161–168 (2001)

964

J. Liu, C. Smith, and H. Chebrolu

4. Cocosco, C.A., et al.: BrainWeb: Online interface to a 3D MRI simulated brain database. Neuroimage 5(4) part 2/4, S245 (1997) 5. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. In: IJCV 6. Yang, J., Tagare, H., Staib, L.H., Duncan, J.S.: Segmentation of 3D Deformable Objects with Level Set Based Prior Models. In: ISBI, pp. 85–88 (2004) 7. Evans, A.C., Collins, D.L., Milner, B.: An MRI-based stereotactic atlas from 250 young normal subjects. Society of Neuroscience Abstrasts 18, 408 (1992) 8. Gao, S., Bui, T.D.: Image Segmentation and Selective Smoothing by Using Mumford-Shah Model. IEEE Transactions on Image Processing 14(10), 1537–1549 (2005) 9. Li, C., Liu, J., Fox, M.D.: Segmentation of Edge Preserving Gradient Vector Flow: An Approach Toward Automatically Initializing and Splitting of Snakes. In: CVPR, vol. 1, pp. 162–167 (2008) 10. Liu, J., Chelberg, D., Smith, C., Chebrolu, H.: Distribution-based Level Set Model for Medical Image Segmentation. In: BMVC 2007. British Machine Vision Conference, Warwick, 10-13 September 2007, UK (2007) 11. Paragios, N., Deriche, R.: Coupled Geodesic Active Regions for Image Segmentation: A Level Set Approach. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 224–240. Springer, Heidelberg (2000) 12. Rousson, M., Deriche, R.: A Variational Framework for Active and Adaptative Segmentation of Vector Valued Images, INRIA Technical Report (2002) 13. Mechelli, A., Price, C.J., Friston, K.J., Ashburner, J.: Voxel-Based Morphometry of the Human Brain: Methods and Applications. Current Medical Imaging Reviews, 105–113 (2005) 14. Tsai, A., Yezzi, A., Wells, W., Tempany, C.: Approach to Curve: Evolution for Segmentation of Medical Imagery. IEEE TMI 22(2), 137–154 (2003) 15. Xu, C., Prince, J.L.: Snakes, Shapes, and Gradient Vector Flow. IEEE Transactions on Image Processing 7(3), 359–369 (1998) 16. Zhou, J., Rajapakse, J.C.: Segmentation of subcortical brain structures using fuzzy templates. NeuroImage 28, 915–924 (2005) 17. Zhu, S., Yuille, A.: Region competition: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. PAMI 18(9), 884–900 (1996)

Author Index

Abe, Shinji I-292 Agrawal, Amit I-945 Ai, Haizhou I-210 Akama, Ryo I-779 Andreopoulos, Alexander Aoki, Nobuya I-116 Aptoula, Erchan I-935 Arita, Daisaku I-159 Arth, Clemens II-447 Ashraf, Nazim II-63 ˚ Astr¨ om, Kalle II-549

I-385

Babaguchi, Noboru II-651 Banerjee, Subhashis II-85 Ben Ayed, Ismail I-925 Beveridge, J. Ross II-733 Bigorgne, Erwan II-817 Bischof, Horst I-657, II-447 Bouakaz, Sa¨ıda I-678, I-738 Boyer, Edmond II-166, II-580 Brice˜ no, Hector M. I-678, I-738 Brooks, Michael J. I-853, II-227 Byr¨ od, Martin II-549 Cai, Kangying I-779 Cai, Yinghao I-843 Cannons, Kevin I-532 Cha, Seungwook I-200 Chan, Tung-Jung II-631 Chang, Jen-Mei II-733 Chang, Wen-Yan II-621 Chaudhuri, Subhasis I-240 Chebrolu, Hima I-956 Chen, Chu-Song I-905, II-621 Chen, Ju-Chin II-700 Chen, Qian I-565, I-688 Chen, Tsuhan I-220, II-487, II-662 Chen, Wei I-843 Chen, Wenbin II-53 Chen, Ying I-832 Chen, Yu-Ting I-905 Cheng, Jian II-827 Choi, Inho I-698 Choi, Ouk II-269

Chu, Rufeng II-22 Chu, Wen-Sheng Vincnent II-700 Chun, Seong Soo I-200 Chung, Albert C.S. II-672 Chung, Ronald II-301 Cichowski, Alex I-375 Cipolla, Roberto I-335 Courteille, Fr´ed´eric II-196 Cui, Jinshi I-544 Dailey, Matthew N. I-85 Danafar, Somayeh II-457 Davis, Larry S. I-397, II-404 DeMenthon, Daniel II-404 De Mol, Christine II-881 Destrero, Augusto II-881 Detmold, Henry I-375 Di Stefano, Luigi II-517 Dick, Anthony I-375, I-853 Ding, Yuanyuan I-95 Dinh, Viet Cuong I-200 Doermann, David II-404 Donoser, Michael II-447 Dou, Mingsong II-722 Draper, Bruce II-733 Du, Wei I-365 Du, Weiwei II-590 Durou, Jean-Denis II-196 Ejiri, Masakazu I-35 Eriksson, Anders P. II-796 Fan, Kuo-Chin I-169 Farin, Dirk I-789 Foroosh, Hassan II-63 Frahm, Jan-Michael II-353 Fu, Li-Chen II-124 Fu, Zhouyu I-482, II-134 Fujimura, Kikuo I-408, II-32 Fujiwara, Takayuki II-891 Fujiyoshi, Hironobu I-915, II-806 Fukui, Kazuhiro II-467 Funahashi, Takuma II-891 Furukawa, Ryo II-206, II-847

966

Author Index

Gao, Jizhou I-127 Gargallo, Pau II-373, II-784 Geurts, Pierre II-611 Gheissari, Niloofar II-457 Girdziuˇsas, Ram¯ unas I-811 Goel, Dhiraj I-220 Goel, Lakshya II-85 Grabner, Helmut I-657 Grabner, Michael I-657 Guillou, Erwan I-678 Gupta, Ankit II-85 Gupta, Gaurav II-394 Gupta, Sumana II-394 Gurdjos, Pierre II-196 Han, Yufei II-1, II-22 Hancock, Edwin R. II-869 Handel, Holger II-258 Hao, Pengwei II-722 Hao, Ying II-12 Hartley, Richard I-13, I-800, II-279, II-322, II-353 Hasegawa, Tsutomu I-628 Hayes-Gill, Barrie I-945 He, Ran I-54, I-728, II-22 H´eas, Patrick I-864 Hill, Rhys I-375 Hiura, Shinsaku I-149 Honda, Kiyoshi I-85 Hong, Ki-Sang II-497 Horaud, Radu II-166 Horiuchi, Takahiko I-708 Hou, Cong I-210 Hsiao, Pei-Yung II-124 Hsieh, Jun-Wei I-169 Hu, Wei I-832 Hu, Weiming I-821, I-832 Hu, Zhanyi I-472 Hua, Chunsheng I-565 huang, Feiyue II-477 Huang, Guochang I-462 Huang, Kaiqi I-667, I-843 Huang, Liang II-680 Huang, Po-Hao I-106 Huang, Shih-Shinh II-124 Huang, Weimin I-875 Huang, Xinyu I-127 Huang, Yonggang II-690 Hung, Y.S. II-186 Hung, Yi-Ping II-621

Ide, Ichiro II-774 Ijiri, Yoshihisa II-680 Ikeda, Sei II-73 Iketani, Akihiko II-73 Ikeuchi, Katsushi II-289 Imai, Akihiro I-596 Ishikawa, Hiroshi II-537 Itano, Tomoya II-206 Iwata, Sho II-570 Jaeggli, Tobias I-608 Jawahar, C.V. I-586 Je, Changsoo II-507 Ji, Zhengqiao II-363 Jia, Yunde I-512, II-641, II-754 Jiao, Jianbin I-896 Jin, Huidong I-482 Jin, Yuxin I-748 Josephson, Klas II-549 Junejo, Imran N. II-63 Kahl, Fredrik I-13, II-796 Kalra, Prem II-85 Kanade, Takeo I-915, II-806 Kanatani, Kenichi II-311 Kanbara, Masayuki II-73 Katayama, Noriaki I-292 Kato, Takekazu I-688 Kawabata, Satoshi I-149 Kawade, Masato II-680 Kawamoto, Kazuhiko I-555 Kawasaki, Hiroshi II-206, II-847 Khan, Sohaib I-647 Kim, Daijin I-698 Kim, Hansung I-758 Kim, Hyeongwoo II-269 Kim, Jae-Hak II-353 Kim, Jong-Sung II-497 Kim, Tae-Kyun I-335 Kim, Wonsik II-560 Kirby, Michael II-733 Kitagawa, Yosuke I-688 Kitahara, Itaru I-758 Klein Gunnewiek, Rene I-789 Kley, Holger II-733 Kogure, Kiyoshi I-758 Koh, Tze K. I-945 Koller-Meier, Esther I-608 Kondo, Kazuaki I-544 Korica-Pehserl, Petra I-657

Author Index Koshimizu, Hiroyasu II-891 Kounoike, Yuusuke II-424 Kozuka, Kazuki II-342 Kuijper, Arjan I-230 Kumano, Shiro I-324 Kumar, Anand I-586 Kumar, Pankaj I-853 Kuo, Chen-Hui II-631 Kurazume, Ryo I-628 Kushal, Avanish II-85 Kweon, In So II-269 Laaksonen, Jorma I-811 Lai, Shang-Hong I-106, I-638 Lambert, Peter I-251 Langer, Michael I-271, II-858 Lao, Shihong I-210, II-680 Lau, W.S. II-186 Lee, Jiann-Der II-631 Lee, Kwang Hee II-507 Lee, Kyoung Mu II-560 Lee, Sang Wook II-507 Lee, Wonwoo II-580 Lef`evre, S´ebastien I-935 Lei, Zhen I-54, II-22 Lenz, Reiner II-744 Li, Baoxin II-155 Li, Heping I-472 Li, Hongdong I-800, II-227 Li, Jiun-Jie I-169 Li, Jun II-722 Li, Ping I-789 Li, Stan Z. I-54, I-728, II-22 Li, Zhenglong II-827 Li, Zhiguo II-901 Liang, Jia I-512, II-754 Liao, ShengCai I-54 Liao, Shu II-672 Lien, Jenn-Jier James I-261, I-314, I-885, II-96, II-700 Lim, Ser-Nam I-397 Lin, Shouxun II-106 Lin, Zhe II-404 Lina II-774 Liu, Chunxiao I-282 Liu, Fuqiang I-355 Liu, Jundong I-956 Liu, Nianjun I-482 Liu, Qingshan II-827, II-901 Liu, Wenyu I-282

Liu, Xiaoming II-662 Liu, Yuncai I-419 Loke, Eng Hui I-430 Lu, Fangfang II-134, II-279 Lu, Hanqing II-827 Lubin, Jeﬀrey II-414 Lui, Shu-Fan II-96 Luo, Guan I-821 Ma, Yong II-680 Maeda, Eisaku I-324 Mahmood, Arif I-647 Makhanov, Stanislav I-85 Makihara, Yasushi I-452 Manmatha, R. I-586 Mao, Hsi-Shu II-96 Mar´ee, Rapha¨el II-611 Marikhu, Ramesh I-85 Martens, Ga¨etan I-251 Matas, Jiˇr´ı II-236 Mattoccia, Stefano II-517 Maybank, Steve I-821 McCloskey, Scott I-271, II-858 Mekada, Yoshito II-774 Mekuz, Nathan I-492 ´ M´emin, Etienne I-864 Metaxas, Dimitris II-901 Meyer, Alexandre I-738 Michoud, Brice I-678 Miˇcuˇs´ık, Branislav I-65 Miles, Nicholas I-945 Mitiche, Amar I-925 Mittal, Anurag I-397 Mogi, Kenji II-528 Morgan, Steve I-945 Mori, Akihiro I-628 Morisaka, Akihiko II-206 Mu, Yadong II-837 Mudenagudi, Uma II-85 Mukaigawa, Yasuhiro I-544, II-246 Mukerjee, Amitabha II-394 Murai, Yasuhiro I-915 Murase, Hiroshi II-774 Nagahashi, Tomoyuki II-806 Nakajima, Noboru II-73 Nakasone, Yoshiki II-528 Nakazawa, Atsushi I-618 Nalin Pradeep, S. I-522, II-116 Niranjan, Shobhit II-394 Nomiya, Hiroki I-502

967

968

Author Index

Odone, Francesca II-881 Ohara, Masatoshi I-292 Ohta, Naoya II-528 Ohtera, Ryo I-708 Okutomi, Masatoshi II-176 Okutomoi, Masatoshi II-384 Olsson, Carl II-796 Ong, S.H. I-875 Otsuka, Kazuhiro I-324 Pagani, Alain I-769 Paluri, Balamanohar I-522, II-116 Papadakis, Nicolas I-864 Parikh, Devi II-487 Park, Joonyoung II-560 Pehserl, Joachim I-657 Pele, Oﬁr II-435 Peng, Yuxin I-748 Peterson, Chris II-733 Pham, Nam Trung I-875 Piater, Justus I-365 Pollefeys, Marc II-353 Poppe, Chris I-251 Prakash, C. I-522, II-116 Pujades, Sergi II-373 Puri, Manika II-414 Radig, Bernd II-332 Rahmati, Mohammad II-217 Raskar, Ramesh I-1, I-945 Raskin, Leonid I-442 Raxle Wang, Chi-Chen I-885 Reid, Ian II-601 Ren, Chunjian II-53 Rivlin, Ehud I-442 Robles-Kelly, Antonio II-134 Rudzsky, Michael I-442 Ryu, Hanjin I-200 Sagawa, Ryusuke I-116 Sakakubara, Shizu II-424 Sakamoto, Ryuuki I-758 Sato, Jun II-342 Sato, Kosuke I-149 Sato, Tomokazu II-73 Sato, Yoichi I-324 Sawhney, Harpreet II-414 Seo, Yongduek II-322 Shah, Hitesh I-240, I-522, II-116 Shahrokni, Ali II-601

Shen, Chunhua II-227 Shen, I-fan I-189, II-53 Shi, Jianbo I-189 Shi, Min II-42 Shi, Yu I-718 Shi, Zhenwei I-180 Shimada, Atsushi I-159 Shimada, Nobutaka I-596 Shimizu, Ikuko II-424 Shimizu, Masao II-176 Shinano, Yuji II-424 Shirai, Yoshiaki I-596 Siddiqi, Kaleem I-271, II-858 Singh, Gajinder II-414 Slobodan, Ili´c I-75 Smith, Charles I-956 Smith, William A.P. II-869 ˇ Sochman, Jan II-236 Song, Gang I-189 Song, Yangqiu I-180 Stricker, Didier I-769 Sturm, Peter II-373, II-784 Sugaya, Yasuyuki II-311 Sugimoto, Shigeki II-384 Sugiura, Kazushige I-452 Sull, Sanghoon I-200 Sumino, Kohei II-246 Sun, Zhenan II-1, II-12 Sung, Ming-Chian I-261 Sze, W.F. II-186 Takahashi, Hidekazu II-384 Takahashi, Tomokazu II-774 Takamatsu, Jun II-289 Takeda, Yuki I-779 Takemura, Haruo I-618 Tan, Huachun II-712 Tan, Tieniu I-667, I-843, II-1, II-12, II-690 Tanaka, Hidenori I-618 Tanaka, Hiromi T. I-779 Tanaka, Tatsuya I-159 Tang, Sheng II-106 Taniguchi, Rin-ichiro I-159, I-628 Tao, Hai I-345 Tao, Linmi I-748 Tarel, Jean-Philippe II-817 Tian, Min I-355 Tombari, Federico II-517 Tominaga, Shoji I-708

Author Index Toriyama, Tomoji I-758 Tsai, Luo-Wei I-169 Tseng, Chien-Chung I-314 Tseng, Yun-Jung I-169 Tsotsos, John K. I-385, I-492 Tsui, Timothy I-718 Uchida, Seiichi I-628 Uehara, Kuniaki I-502 Urahama, Kiichi II-590 Utsumi, Akira I-292 Van de Walle, Rik I-251 van den Hengel, Anton I-375 Van Gool, Luc I-608 Verri, Alessandro II-881 Vincze, Markus I-65 Wada, Toshikazu I-565, I-688 Wan, Cheng II-342 Wang, Fei II-1 Wang, Guanghui II-363 Wang, Junqiu I-576 Wang, Lei I-800, II-145 Wang, Liming I-189 Wang, Te-Hsun I-261 Wang, Xiaolong I-303 Wang, Ying I-667 Wang, Yuanquan I-512, II-754 Wang, Yunhong I-462, II-690 Wehenkel, Louis II-611 Wei, Shou-Der I-638 Werman, Michael II-435 Wildenauer, Horst I-65 Wildes, Richard I-532 Wimmer, Matthias II-332 With, Peter H.N. de I-789 Wong, Ka Yan II-764 Woo, Woontack II-580 Woodford, Oliver II-601 Wu, Fuchao I-472 Wu, Haiyuan I-565, I-688 Wu, Jin-Yi II-96 Wu, Q.M. Jonathan II-363 Wu, Yihong I-472 Wuest, Harald I-769 Xu, Gang II-570 Xu, Guangyou I-748, II-477

Xu, Lijie II-32 Xu, Shuang II-641 Xu, Xinyu II-155 Yagi, Yasushi I-116, I-452, I-544, I-576, II-246 Yamaguchi, Osamu II-467 Yamamoto, Masanobu I-430 Yamato, Junji I-324 Yamazaki, Masaki II-570 Yamazoe, Hirotake I-292 Yang, Ruigang I-127 Yang, Ying II-106 Ye, Qixiang I-896 Yin, Xin I-779 Ying, Xianghua I-138 Yip, Chi Lap II-764 Yokoya, Naokazu II-73 Yu, Hua I-896 Yu, Jingyi I-95 Yu, Xiaoyi II-651 Yuan, Ding II-301 Yuan, Xiaotong I-728 Zaboli, Hamidreza II-217 Zaharescu, Andrei II-166 Zha, Hongbin I-138, I-544 Zhang, Changshui I-180 Zhang, Chao II-722 Zhang, Dan I-180 Zhang, Fan I-282 Zhang, Ke I-482 Zhang, Weiwei I-355 Zhang, Xiaoqin I-821 Zhang, Yongdong II-106 Zhang, Yu-Jin II-712 Zhang, Yuhang I-800 Zhao, Qi I-345 Zhao, Xu I-419 Zhao, Youdong II-641 Zhao, Yuming II-680 Zheng, Bo II-289 Zheng, Jiang Yu I-303, II-42 Zhong, H. II-186 Zhou, Bingfeng II-837 Zhou, Xue I-832 Zhu, Youding I-408

969